nVIDIA Tesla P40

nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.

Host

Ignore PCI device

Apend to GRUB_CMDLINE_LINUX in /etc/default/grub

intel_iommu=on pci-stub.ids=10de:1b38

Update Grub

grubby --add-kernel $(grubby --default-kernel) --copy-default --args=vfio_pci.ids=10de:1b38 --title "Default kernel with vfio_pci" --make-default

IBM through-pci

reboot

PCI Passthrough to VM (via virt-manager)

Add Hardware -> PCI Host Device

In the VM

Blacklist nouveau in VM

echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
Comment out the following block in /etc/X11/xorg.conf.d/10-nvidia.conf

#Section "OutputClass"
#    Identifier "nvidia"
#    MatchDriver "nvidia-drm"
#    Driver "nvidia"
#    Option "AllowEmptyInitialConfiguration"
#    Option "PrimaryGPU" "no"
#    Option "SLI" "Auto"
#    Option "BaseMosaic" "on"
#EndSection

reboot

Install nVIDIA Driver

dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo

dnf module install nvidia-driver:565-dkms

Install nVIDIA Container Toolkit

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

dnf install nvidia-container-toolkit

nvidia-ctk runtime configure --runtime=crio

make sure runtime is set correctly
confirm that localai is running on the nvidia-container-runtime

Fan Control Script 1

Fan Control Script 2

Home Operations

nVIDIA Tesla P40

Host

Ignore PCI device

PCI Passthrough to VM (via virt-manager)

In the VM

Blacklist nouveau in VM

Install nVIDIA Driver

Install nVIDIA Container Toolkit

Configure the runtime

Results

Kubernetes

Install nVIDIA Device Plugin

Configuration with NFD / LocalAI / Ollama / etc

LocalAI

stable-diffusion

On node install TCMalloc

Other

nVidia HTOP

Fan Control Methods