nVIDIA Tesla P40
nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.
Host
Ignore PCI device
- Apend to GRUB_CMDLINE_LINUX in /etc/default/grub
intel_iommu=on pci-stub.ids=10de:1b38
- grub2-mkconfig -o /boot/grub2/grub.cfg
This step doesn't seem to be working. I have to manually add the pci-stub directive to the kernel cmdline when the server boots.
- reboot
PCI Passthrough to VM (via virt-manager)
- Add Hardware -> PCI Host Device
In the VM
Blacklist nouveau in VM
-
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
-
reboot
Install nVIDIA Driver
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo
dnf module install nvidia-driver:550-dkms
Install nVIDIA Container Toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
dnf install nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=crio
Kubernetes
Install nVIDIA Device Plugin
Configuration with NFD / LocalAI / Ollama / etc
LocalAI
- make sure runtime is set correctly
- confirm that localai is running on the
nvidia-container-runtime
stable-diffusion
On node install TCMalloc
dnf install -y gperftools gperftools-deve
Other
nVidia HTOP
Improved nvidia-smi
command.