nVIDIA Tesla P40
nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.
Host
Ignore PCI device
- Apend to GRUB_CMDLINE_LINUX in /etc/default/grub
intel_iommu=on pci-stub.ids=10de:1b38
- grub2-mkconfig -o /boot/grub2/grub.cfg
This step doesn't seem to be working. I have to manually add the pci-stub directive to the kernel cmdline when the server boots.
- reboot
PCI Passthrough to VM (via virt-manager)
- Add Hardware -> PCI Host Device
In the VM
Blacklist nouveau in VM
-
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
-
Comment out the following block in /etc/X11/xorg.conf.d/10-nvidia.conf
#Section "OutputClass"
# Identifier "nvidia"
# MatchDriver "nvidia-drm"
# Driver "nvidia"
# Option "AllowEmptyInitialConfiguration"
# Option "PrimaryGPU" "no"
# Option "SLI" "Auto"
# Option "BaseMosaic" "on"
#EndSection
- reboot
Install nVIDIA Driver
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo
dnf module install nvidia-driver:565-dkms
Install nVIDIA Container Toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
dnf install nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=crio
Configure the runtime
nvidia-ctk runtime configure --runtime=crio
Results
/etc/nvidia-container-runtime/config.toml
Kubernetes
Install nVIDIA Device Plugin
Configuration with NFD / LocalAI / Ollama / etc
LocalAI
- make sure runtime is set correctly
- confirm that localai is running on the
nvidia-container-runtime
stable-diffusion
On node install TCMalloc
dnf install -y gperftools gperftools-deve
Other
nVidia HTOP
Improved nvidia-smi
command.