Introduction
Lovenet Home Operations Repository
Managed by Flux, Renovate and GitHub Actions 🤖
Kubernetes Cluster Information
Infrastructure Information
Overview
This is the configuration for my GitOps homelab Kubernetes cluster. This cluster runs home software services for my residence. It is quite complex and there are a lot of interdependencies but the declarative nature of GitOps allows me to manage this mesh of code. The software services fall into a few primary categories:
- Home Automation (Home Assistant, ESPHome, Node-Red, EMQX, ZWave JS UI, Zigbee2MQTT)
- Home Metering and Monitoring (Weather Station, Power Monitoring, Sensors)
- Home Security (Frigate, Double Take)
- IOT Devices (WLED, Ratgdo)
Core Components
Infrastructure
- CentOS 9 Stream: Kubernetes Node Operating System.
- crun: Container Runtime implemented in C.
- nVIDIA Container Toolkit: Container Runtime for nVIDIA GPUs.
Networking
- cilium: Kubernetes Container Network Interface (CNI).
- cert-manager: Creates SSL certificates for services in my Kubernetes cluster.
- external-dns: Automatically manages DNS records from my cluster in a cloud DNS provider.
- ingress-nginx: Ingress controller to expose HTTP traffic to pods over DNS.
- Cloudflared: Cloudflare tunnel client.
Storage
- Rook-Ceph: Distributed block storage for peristent storage..
- Minio: S3 Compatible Storage Interface.
- Longhorn: Cloud native distributed block storage for Kubernetes.
- NFS: NFS storage.
GitOps
- Flux2: Declarative Cluster GitOps
- actions-runner-controller: Self-hosted Github runners.
- sops: Managed secrets for Kubernetes which are commited to Git.
- Rennovate: Automated Cluster Management.
⚙️ Configuration
⚙️ Hardware
Hostname | Device | CPU | RAM | OS | Role | Storage | IOT | Network |
---|---|---|---|---|---|---|---|---|
master1 | Intel NUC7PJYH | 4 | 8 GB | CentOS 9 | k8s Master | |||
master2 | VM on beast | 3 | 8 GB | CentOS 9 | k8s Master | |||
master3 | VM on beast | 3 | 8 GB | CentOS 9 | k8s Master | |||
worker1 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe | Z-Stick 7 | iot/sec-vlan |
worker2 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe | iot/sec-vlan | |
worker3 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | Sonoff | iot/sec-vlan |
worker4 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe | Coral USB | iot/sec-vlan |
worker5 | VM on beast | 10 | 24 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | iot/sec-vlan | |
worker6 | VM on beast | 10 | 24 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | skyconnect | iot/sec-vlan |
worker7 | VM on beast | 10 | 24 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | iot/sec-vlan | |
worker8 | VM on beast | 10 | 58 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | nVIDIA P40 | iot/sec-vlan |
Network
Click to see a high level physical network diagram
Name | CIDR | VLAN | Notes |
---|---|---|---|
Management VLAN | TBD | ||
Default | 192.168.0.0/16 | 0 | |
IOT VLAN | 10.10.20.1/24 | 20 | |
Guest VLAN | 10.10.30.1/24 | 30 | |
Security VLAN | 10.10.40.1/24 | 40 | |
Kubernetes Pod Subnet (Cilium) | 10.42.0.0/16 | N/A | |
Kubernetes Services Subnet (Cilium) | 10.43.0.0/16 | N/A | |
Kubernetes LB Range (CiliumLoadBalancerIPPool) | 10.45.0.1/24 | N/A |
☁️ Cloud Dependencies
Service | Use | Cost |
---|---|---|
1Password | Secrets with External Secrets | ~$65 (1 Year) |
Cloudflare | Domain | Free |
GitHub | Hosting this repository and continuous integration/deployments | Free |
Mailgun | Email hosting | Free (Flex Plan) |
Pushover | Kubernetes Alerts and application notifications | $10 (One Time) |
Frigate Plus | Model training services for Frigate NVR | $50 (1 Year) |
Total: ~$9.60/mo |
Noteworthy Documentation
Cluster Rebuild Actions Initialization and Teardown Github Webhook Limits and Requests Philosophy Debugging Immich restore to new CNPG database nVIDIA P40 GPU
Home-Ops Search
@whazor created this website as a creative way to search Helm Releases across GitHub. You may use it as a means to get ideas on how to configure an applications' Helm values.
Coral Edge TPU
Coral USB
Info
I didn't seem to have to do any udev rules or build/load the apex driver when passing this device through to frigate. I had to do those things for the Mini PCIe Coral, but the way I'm doing it now (look at frigate mount point), it doesn't seem necessary.
USB Resets
Whenever the coral device was attached to the Frigate container it would trigger the following entry in dmesg and Node Feature Discovery could no longer identify that the node had the device. This resulted in the frigate staying in a 'Pending' state until I unplugged and then plugged in the Coral USB again. It was very annoying that if the frigate container ever terminated, I'd have to unplug and then re-plug the USB.
[ +12.269474] usb 2-5: reset SuperSpeed USB device number 22 using xhci_hcd [ +0.012155] usb 2-5: LPM exit latency is zeroed, disabling LPM.
This hack resolved the USB reset issue: https://github.com/blakeblackshear/frigate/issues/2607#issuecomment-2092965042
Coral Mini PCIe
Info
Not currently working. For some reason it doesn't show up in 'lspci' in my Dell R730XD. I wonder if using a more powerful power supply would make a difference.
nVIDIA Tesla P40
nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.
Host
Ignore PCI device
- Apend to GRUB_CMDLINE_LINUX in /etc/default/grub
intel_iommu=on pci-stub.ids=10de:1b38
- grub2-mkconfig -o /boot/grub2/grub.cfg
This step doesn't seem to be working. I have to manually add the pci-stub directive to the kernel cmdline when the server boots.
- reboot
PCI Passthrough to VM (via virt-manager)
- Add Hardware -> PCI Host Device
In the VM
Blacklist nouveau in VM
-
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
-
Comment out the following block in /etc/X11/xorg.conf.d/10-nvidia.conf
#Section "OutputClass"
# Identifier "nvidia"
# MatchDriver "nvidia-drm"
# Driver "nvidia"
# Option "AllowEmptyInitialConfiguration"
# Option "PrimaryGPU" "no"
# Option "SLI" "Auto"
# Option "BaseMosaic" "on"
#EndSection
- reboot
Install nVIDIA Driver
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo
dnf module install nvidia-driver:565-dkms
Install nVIDIA Container Toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
dnf install nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=crio
Configure the runtime
nvidia-ctk runtime configure --runtime=crio
Results
/etc/nvidia-container-runtime/config.toml
Kubernetes
Install nVIDIA Device Plugin
Configuration with NFD / LocalAI / Ollama / etc
LocalAI
- make sure runtime is set correctly
- confirm that localai is running on the
nvidia-container-runtime
stable-diffusion
On node install TCMalloc
dnf install -y gperftools gperftools-deve
Other
nVidia HTOP
Improved nvidia-smi
command.
Fan Speed
Cluster Rebuild Actions
Before Cluster Rebuild
- Restore CNPG from backup
uncomment section in cluster.yaml
After Cluster Rebuild
- Update KUBECONFIG secret in github/home-ops
Settings -> (left side) Secrets and Variables -> (submenu) Actions
Edit KUBECONFIG secret with cat ~/.kube/config | base64
Initialization
./init/create-cluster.sh
(on master)
./init/prepare-cluster.sh
(on laptop)
./init/initialize-cluster.sh
(on laptop)
ssh root@master1 rm /etc/kubernetes/manifests/kube-vip.yaml
(on laptop)
Teardown
./init/destroy-cluster.sh
(on laptop)
Importing an Immich DB backup into a new CNPG database
- Create the new Immich database
kubectl cnpg -n databases psql postgres-16 -- -c 'CREATE DATABASE immich;'
- Ensure that you're woring with the primary CNPG instance
kubectl cnpg -n databases status postgres-16 | grep "Primary instance:"
- Copy the database backup into the container's persistent storage (where there is likely enough room for it)
kubectl -n databases cp ./immich-db-backup.sql.gz postgres-16-2:/var/lib/postgresql/data/
- Exec into the postgres container
kubectl -n databases exec -it postgres-16-2 -- bash
- Run the Immich restore command
gunzip < "immich-db-backup-1735455600013.sql.gz"| sed "s/SELECT pg_catalog.set_config('search_path', '', false);/SELECT pg_catalog.set_config('search_path', 'public, pg_catalog', true);/g" | kubectl cnpg -n databases psql postgres-14 --
Github Webhook
kubectl -n flux-system get receivers.notification.toolkit.fluxcd.io
generates token URL to be put into
github.com -> Settings -> Webhooks -> Payload URL
- Content Type: application/json
- Secret: <token from kubectl -n flux-system describe secrets github-webhook-token>
- SSL: Enable SSL verification
- Which events would you like to trigger this webhook?: Just the push event.
- Active:
Resources: Limits and Requests Philosophy
In short, do set CPU requests, but don't set CPU limits and set the Memory limit to be the same as the Memory requests.
Debugging
- https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
- https://dnschecker.org
- https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
- https://github.com/nicolaka/netshoot
- https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat