nVIDIA Tesla P40

nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.

Host

Ignore PCI device

  1. Apend to GRUB_CMDLINE_LINUX in /etc/default/grub

intel_iommu=on pci-stub.ids=10de:1b38

  1. grub2-mkconfig -o /boot/grub2/grub.cfg

This step doesn't seem to be working. I have to manually add the pci-stub directive to the kernel cmdline when the server boots.

  1. reboot

PCI Passthrough to VM (via virt-manager)

  1. Add Hardware -> PCI Host Device

In the VM

Blacklist nouveau in VM

  1. echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf

  2. reboot

Install nVIDIA Driver

dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo

dnf module install nvidia-driver:550-dkms

Install nVIDIA Container Toolkit

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

dnf install nvidia-container-toolkit

nvidia-ctk runtime configure --runtime=crio

Kubernetes

Install nVIDIA Device Plugin

Helm Chart

Configuration with NFD / LocalAI / Ollama / etc

LocalAI

  1. make sure runtime is set correctly
  2. confirm that localai is running on the nvidia-container-runtime

stable-diffusion

On node install TCMalloc

dnf install -y gperftools gperftools-deve

Other

nVidia HTOP

Improved nvidia-smi command.

nVidia HTOP

Fan Speed

IPMI Fan Script Discussion

Fan Control Script