Introduction

Warning

These docs contain information that relates to my setup. They may or may not work for you.



Lovenet Home Operations Repository

Managed by Flux, Renovate and GitHub Actions 🤖

Kubernetes   Renovate   Documentation  

Kubernetes Cluster Information:

Age-Days  Node-Count  Pod-Count  CPU-Usage  Memory-Usage  Check Links



Overview

This is the configuration for my GitOps homelab Kubernetes cluster. This cluster runs home software services for my residence. It is quite complex and there are a lot of interdependencies but the declarative nature of GitOps allows me to manage this mesh of code. The software services fall into a few primary categories:

Core Components

Infrastructure

Networking

  • cilium: Kubernetes Container Network Interface (CNI).
  • cert-manager: Creates SSL certificates for services in my Kubernetes cluster.
  • external-dns: Automatically manages DNS records from my cluster in a cloud DNS provider.
  • ingress-nginx: Ingress controller to expose HTTP traffic to pods over DNS.
  • Cloudflared: Cloudflare tunnel client.

Storage

  • Rook-Ceph: Distributed block storage for peristent storage..
  • Minio: S3 Compatible Storage Interface.
  • Longhorn: Cloud native distributed block storage for Kubernetes.
  • NFS: NFS storage.

GitOps


⚙️  Configuration


⚙️  Hardware

HostnameDeviceCPURAMOSRoleStorageIOTNetwork
master1Intel NUC7PJYH48 GBCentOS 9k8s Master
master2VM on beast38 GBCentOS 9k8s Master
master3VM on beast38 GBCentOS 9k8s Master
worker1ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMeZ-Stick 7iot/sec-vlan
worker2ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMeiot/sec-vlan
worker3ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdSonoffiot/sec-vlan
worker4ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMeCoral USBiot/sec-vlan
worker5VM on beast1024 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdiot/sec-vlan
worker6VM on beast1024 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdskyconnectiot/sec-vlan
worker7VM on beast1024 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdiot/sec-vlan
worker8VM on beast1048 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdnVIDIA P40iot/sec-vlan

Network

Click to see a high level physical network diagram dns
NameCIDRVLANNotes
Management VLANTBD
Default192.168.0.0/160
IOT VLAN10.10.20.1/2420
Guest VLAN10.10.30.1/2430
Security VLAN10.10.40.1/2440
Kubernetes Pod Subnet (Cilium)10.42.0.0/16N/A
Kubernetes Services Subnet (Cilium)10.43.0.0/16N/A
Kubernetes LB Range (CiliumLoadBalancerIPPool)10.45.0.1/24N/A

☁️ Cloud Dependencies

ServiceUseCost
1PasswordSecrets with External Secrets~$65 (1 Year)
CloudflareDomainFree
GitHubHosting this repository and continuous integration/deploymentsFree
MailgunEmail hostingFree (Flex Plan)
PushoverKubernetes Alerts and application notifications$10 (One Time)
Frigate PlusModel training services for Frigate NVR$50 (1 Year)
Total: ~$9.60/mo

Noteworthy Documentation

Initialization and Teardown   Github Webhook   Limits and Requests Philosophy   Debugging  

@whazor created this website as a creative way to search Helm Releases across GitHub. You may use it as a means to get ideas on how to configure an applications' Helm values.

Coral Edge TPU

Coral USB

Info

I didn't seem to have to do any udev rules or build/load the apex driver when passing this device through to frigate. I had to do those things for the Mini PCIe Coral, but the way I'm doing it now (look at frigate mount point), it doesn't seem necessary.

USB Resets

Whenever the coral device was attached to the Frigate container it would trigger the following entry in dmesg and Node Feature Discovery could no longer identify that the node had the device. This resulted in the frigate staying in a 'Pending' state until I unplugged and then plugged in the Coral USB again. It was very annoying that if the frigate container ever terminated, I'd have to unplug and then re-plug the USB.

[ +12.269474] usb 2-5: reset SuperSpeed USB device number 22 using xhci_hcd [ +0.012155] usb 2-5: LPM exit latency is zeroed, disabling LPM.

This hack resolved the USB reset issue: https://github.com/blakeblackshear/frigate/issues/2607#issuecomment-2092965042

Coral Mini PCIe

Info

Not currently working. For some reason it doesn't show up in 'lspci' in my Dell R730XD. I wonder if using a more powerful power supply would make a difference.

nVIDIA Tesla P40

nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.

Host

Ignore PCI device

  1. Apend to GRUB_CMDLINE_LINUX in /etc/default/grub

intel_iommu=on pci-stub.ids=10de:1b38

  1. grub2-mkconfig -o /boot/grub2/grub.cfg

This step doesn't seem to be working. I have to manually add the pci-stub directive to the kernel cmdline when the server boots.

  1. reboot

PCI Passthrough to VM (via virt-manager)

  1. Add Hardware -> PCI Host Device

In the VM

Blacklist nouveau in VM

  1. echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf

  2. reboot

Install nVIDIA Driver

dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo

dnf module install nvidia-driver:550-dkms

Install nVIDIA Container Toolkit

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

dnf install nvidia-container-toolkit

nvidia-ctk runtime configure --runtime=crio

Kubernetes

Install nVIDIA Device Plugin

Helm Chart

Configuration with NFD / LocalAI / Ollama / etc

LocalAI

  1. make sure runtime is set correctly
  2. confirm that localai is running on the nvidia-container-runtime

stable-diffusion

On node install TCMalloc

dnf install -y gperftools gperftools-deve

Other

nVidia HTOP

Improved nvidia-smi command.

nVidia HTOP

Fan Speed

IPMI Fan Script Discussion

Fan Control Script

Initialization

./init/create-cluster.sh (on master)

./init/prepare-cluster.sh (on laptop)

./init/initialize-cluster.sh (on laptop)

ssh root@master1 rm /etc/kubernetes/manifests/kube-vip.yaml (on laptop)

Teardown

./init/destroy-cluster.sh (on laptop)

Github Webhook

kubectl -n flux-system get receivers.notification.toolkit.fluxcd.io generates token URL to be put into github.com -> Settings -> Webhooks -> Payload URL

  • Content Type: application/json
  • Secret: <token from kubectl -n flux-system describe secrets github-webhook-token>
  • SSL: Enable SSL verification
  • Which events would you like to trigger this webhook?: Just the push event.
  • Active:

Resources: Limits and Requests Philosophy

In short, do set CPU requests, but don't set CPU limits and set the Memory limit to be the same as the Memory requests.

Debugging

  • https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
  • https://dnschecker.org
  • https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
  • https://github.com/nicolaka/netshoot
  • https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat