Introduction

Warning

These docs contain information that relates to my setup. They may or may not work for you.

Lovenet Home Operations Repository

Managed by Flux, Renovate and GitHub Actions 🤖

Kubernetes Cluster Information

Infrastructure Information

Overview

This is the configuration for my GitOps homelab Kubernetes cluster. This cluster runs home software services for my residence. It is quite complex and there are a lot of interdependencies but the declarative nature of GitOps allows me to manage this mesh of code. The software services fall into a few primary categories:

Home Automation (Home Assistant, ESPHome, Node-Red, EMQX, ZWave JS UI, Zigbee2MQTT)
Home Metering and Monitoring (Weather Station, Power Monitoring, Sensors)
Home Security (Frigate, Double Take)
IOT Devices (WLED, Ratgdo)

Core Components

Infrastructure

CentOS 9 Stream: Kubernetes Node Operating System.
crun: Container Runtime implemented in C.
nVIDIA Container Toolkit: Container Runtime for nVIDIA GPUs.

Networking

cilium: Kubernetes Container Network Interface (CNI).
cert-manager: Creates SSL certificates for services in my Kubernetes cluster.
external-dns: Automatically manages DNS records from my cluster in a cloud DNS provider.
ingress-nginx: Ingress controller to expose HTTP traffic to pods over DNS.
Cloudflared: Cloudflare tunnel client.

Storage

Rook-Ceph: Distributed block storage for peristent storage..
Minio: S3 Compatible Storage Interface.
Longhorn: Cloud native distributed block storage for Kubernetes.
NFS: NFS storage.

GitOps

Flux2: Declarative Cluster GitOps
actions-runner-controller: Self-hosted Github runners.
sops: Managed secrets for Kubernetes which are commited to Git.
Rennovate: Automated Cluster Management.

⚙️ Hardware

Hostname	Device	CPU	RAM	OS	Role	Storage	IOT	Network
master1	Intel NUC7PJYH	4	8 GB	CentOS 9	k8s Master
master2	VM on beast	3	8 GB	CentOS 9	k8s Master
master3	VM on beast	3	8 GB	CentOS 9	k8s Master
worker1	ThinkCentre M910x	8	32 GB	CentOS 9	k8s Worker	longhorn NVMe	Z-Stick 7	iot/sec-vlan
worker2	ThinkCentre M910x	8	32 GB	CentOS 9	k8s Worker	longhorn NVMe		iot/sec-vlan
worker3	ThinkCentre M910x	8	64 GB	CentOS 9	k8s Worker	longhorn NVMe, ceph osd	Sonoff	iot/sec-vlan
worker4	ThinkCentre M910x	8	32 GB	CentOS 9	k8s Worker	longhorn NVMe	Coral USB	iot/sec-vlan
worker5	VM on beast	10	24 GB	CentOS 9	k8s Worker	longhorn NVMe, ceph osd		iot/sec-vlan
worker6	VM on beast	10	24 GB	CentOS 9	k8s Worker	longhorn NVMe, ceph osd	skyconnect	iot/sec-vlan
worker7	VM on beast	10	24 GB	CentOS 9	k8s Worker	longhorn NVMe, ceph osd		iot/sec-vlan
worker8	VM on beast	10	58 GB	CentOS 9	k8s Worker	longhorn NVMe, ceph osd	nVIDIA P40	iot/sec-vlan

Network

Click to see a high level physical network diagram

Name	CIDR	VLAN	Notes
Management VLAN			TBD
Default	`192.168.0.0/16`	0
IOT VLAN	`10.10.20.1/24`	20
Guest VLAN	`10.10.30.1/24`	30
Security VLAN	`10.10.40.1/24`	40
Kubernetes Pod Subnet (Cilium)	`10.42.0.0/16`	N/A
Kubernetes Services Subnet (Cilium)	`10.43.0.0/16`	N/A
Kubernetes LB Range (CiliumLoadBalancerIPPool)	`10.45.0.1/24`	N/A

☁️ Cloud Dependencies

Service	Use	Cost
1Password	Secrets with External Secrets	~$65 (1 Year)
Cloudflare	Domain	Free
GitHub	Hosting this repository and continuous integration/deployments	Free
Mailgun	Email hosting	Free (Flex Plan)
Pushover	Kubernetes Alerts and application notifications	$10 (One Time)
Frigate Plus	Model training services for Frigate NVR	$50 (1 Year)
		Total: ~$9.60/mo

Noteworthy Documentation

Cluster Rebuild Actions Initialization and Teardown Github Webhook Limits and Requests Philosophy Debugging Immich restore to new CNPG database nVIDIA P40 GPU

Home-Ops Search

@whazor created this website as a creative way to search Helm Releases across GitHub. You may use it as a means to get ideas on how to configure an applications' Helm values.

After whole home power outage or all nodes power cycle

The main problem is that the kube-vip pods are not running so the VIP, typically 192.16.6.1, is unknown. It just needs to be set so that the kube control plane can get up and runnging and the kube-vip pods can get re-instantiated. To do this simply login to master1 and run the following command.

ip addr add 192.168.6.1 dev eno1

Coral Edge TPU

Coral USB

Info

I didn't seem to have to do any udev rules or build/load the apex driver when passing this device through to frigate. I had to do those things for the Mini PCIe Coral, but the way I'm doing it now (look at frigate mount point), it doesn't seem necessary.

USB Resets

Whenever the coral device was attached to the Frigate container it would trigger the following entry in dmesg and Node Feature Discovery could no longer identify that the node had the device. This resulted in the frigate staying in a 'Pending' state until I unplugged and then plugged in the Coral USB again. It was very annoying that if the frigate container ever terminated, I'd have to unplug and then re-plug the USB.

[ +12.269474] usb 2-5: reset SuperSpeed USB device number 22 using xhci_hcd [ +0.012155] usb 2-5: LPM exit latency is zeroed, disabling LPM.

This hack resolved the USB reset issue: https://github.com/blakeblackshear/frigate/issues/2607#issuecomment-2092965042

Coral Mini PCIe

Info

Not currently working. For some reason it doesn't show up in 'lspci' in my Dell R730XD. I wonder if using a more powerful power supply would make a difference.

nVIDIA Tesla P40

nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.

Host

Ignore PCI device

Apend to GRUB_CMDLINE_LINUX in /etc/default/grub

intel_iommu=on pci-stub.ids=10de:1b38

Update Grub

grubby --add-kernel $(grubby --default-kernel) --copy-default --args=vfio_pci.ids=10de:1b38 --title "Default kernel with vfio_pci" --make-default

IBM through-pci

reboot

PCI Passthrough to VM (via virt-manager)

Add Hardware -> PCI Host Device

In the VM

Blacklist nouveau in VM

echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
Comment out the following block in /etc/X11/xorg.conf.d/10-nvidia.conf

#Section "OutputClass"
#    Identifier "nvidia"
#    MatchDriver "nvidia-drm"
#    Driver "nvidia"
#    Option "AllowEmptyInitialConfiguration"
#    Option "PrimaryGPU" "no"
#    Option "SLI" "Auto"
#    Option "BaseMosaic" "on"
#EndSection

reboot

Install nVIDIA Driver

dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo

dnf module install nvidia-driver:565-dkms

Install nVIDIA Container Toolkit

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

dnf install nvidia-container-toolkit

nvidia-ctk runtime configure --runtime=crio

Configure the runtime

nvidia-ctk runtime configure --runtime=crio

Results

/etc/crio/crio.conf

/etc/nvidia-container-runtime/config.toml

Kubernetes

Install nVIDIA Device Plugin

Helm Chart

Configuration with NFD / LocalAI / Ollama / etc

LocalAI

make sure runtime is set correctly
confirm that localai is running on the nvidia-container-runtime

stable-diffusion

On node install TCMalloc

dnf install -y gperftools gperftools-deve

Other

nVidia HTOP

Improved nvidia-smi command.

nVidia HTOP

Fan Control Methods

The R730XD doesn't officially support the P40 so it doesn't natively adjust the fan. Below are some workarounds that I have not implemented yet.

IPMI Fan Control

Fan Control Script 1

Fan Control Script 2

Cluster Rebuild Actions

Before Cluster Rebuild

Restore CNPG from backup

uncomment section in cluster.yaml

After Cluster Rebuild

Update KUBECONFIG secret in github/home-ops

Settings -> (left side) Secrets and Variables -> (submenu) Actions

Edit KUBECONFIG secret with cat ~/.kube/config | base64

Initialization

./init/create-cluster.sh (on master)

./init/prepare-cluster.sh (on laptop)

./init/initialize-cluster.sh (on laptop)

ssh root@master1 rm /etc/kubernetes/manifests/kube-vip.yaml (on laptop)

Teardown

./init/destroy-cluster.sh (on laptop)

Importing an Immich DB backup into a new CNPG database

Create the new Immich database

kubectl cnpg -n databases psql <db name> -- -c 'CREATE DATABASE immich;'

Ensure that you're woring with the primary CNPG instance

kubectl cnpg -n databases status <db name> | grep "Primary instance:"

Run the Immich restore command

gunzip < "immich-db-backup-1735455600013.sql.gz"| sed "s/SELECT pg_catalog.set_config('search_path', '', false);/SELECT pg_catalog.set_config('search_path', 'public, pg_catalog', true);/g" | kubectl cnpg -n databases psql <db name> --

Github Webhook

kubectl -n flux-system get receivers.notification.toolkit.fluxcd.io generates token URL to be put into github.com -> Settings -> Webhooks -> Payload URL

Content Type: application/json
Secret: <token from kubectl -n flux-system describe secrets github-webhook-token>
SSL: Enable SSL verification
Which events would you like to trigger this webhook?: Just the push event.
Active:

Resources: Limits and Requests Philosophy

In short, do set CPU requests, but don't set CPU limits and set the Memory limit to be the same as the Memory requests.

Debugging

https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
https://dnschecker.org
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
https://github.com/nicolaka/netshoot
https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat