Introduction

Warning

These docs contain information that relates to my setup. They may or may not work for you.



Lovenet Home Operations Repository

Managed by Flux, Renovate and GitHub Actions πŸ€–

KubernetesΒ Β  RenovateΒ Β  DocumentationΒ Β 

Kubernetes Cluster Information

Age-DaysΒ  Node-CountΒ  Pod-CountΒ  CPU-UsageΒ  Memory-UsageΒ  Check LinksΒ  AlertsΒ Β 

Infrastructure Information

Power-UsageΒ 



Overview

This is the configuration for my GitOps homelab Kubernetes cluster. This cluster runs home software services for my residence. It is quite complex and there are a lot of interdependencies but the declarative nature of GitOps allows me to manage this mesh of code. The software services fall into a few primary categories:

Core Components

Infrastructure

Networking

  • cilium: Kubernetes Container Network Interface (CNI)
  • cert-manager: Creates SSL certificates for services in my Kubernetes cluster
  • external-dns: Automatically manages DNS records from my cluster in a cloud DNS provider
  • Cloudflared: Cloudflare tunnel client
  • Envoy Gateway: Networking gateways into cluster

Storage

  • Rook-Ceph: Distributed block storage for peristent storage
  • Minio: S3 Compatible Storage Interface
  • Longhorn: Cloud native distributed block storage for Kubernetes
  • NFS: NFS storage

GitOps


βš™οΈΒ  Hardware

HostnameDeviceCPURAMOSRoleStorageIOTVLANs (multus)
master1Intel NUC7PJYH48 GBCentOS 9k8s Master
master2VM on beast38 GBCentOS 9k8s Master
master3VM on beast38 GBCentOS 9k8s Master
worker1ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdZWA-2iot, sec
worker2ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdiot, sec
worker3ThinkCentre M910x864 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdSonoffiot, sec
worker4ThinkCentre M910x832 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdCoral USBiot, sec
worker5VM on beast1024 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdiot, sec
worker6VM on beast1024 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdiot, sec
worker7VM on beast1024 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdiot, sec
worker8VM on beast1058 GBCentOS 9k8s Workerlonghorn NVMe, ceph osdnVIDIA P40iot, sec

Network

Click to see a high level physical network diagram dns
NameCIDRVLANNotes
Management VLANTBD
Default192.168.0.0/160
IOT VLAN10.10.20.1/2420
Guest VLAN10.10.30.1/2430
Security VLAN10.10.40.1/2440
Kubernetes Pod Subnet (Cilium)10.42.0.0/16N/A
Kubernetes Services Subnet (Cilium)10.43.0.0/16N/A
Kubernetes LB Range (CiliumLoadBalancerIPPool)10.45.0.1/24N/A

☁️ Cloud Dependencies

ServiceUseCost
1PasswordSecrets with External Secrets~$65 (1 Year)
CloudflareDomainFree
GitHubHosting this repository and continuous integration/deploymentsFree
MailgunEmail hostingFree (Flex Plan)
PushoverKubernetes Alerts and application notifications$10 (One Time)
Frigate PlusModel training services for Frigate NVR$50 (1 Year)
Total: ~$9.60/mo

Noteworthy Documentation

Cluster Rebuild ActionsΒ Β  Initialization and TeardownΒ Β  Github WebhookΒ Β  Limits and Requests PhilosophyΒ Β  DebuggingΒ Β  Immich restore to new CNPG databaseΒ Β  nVIDIA P40 GPUΒ Β 

@whazor created this website as a creative way to search Helm Releases across GitHub. You may use it as a means to get ideas on how to configure an applications' Helm values.

After whole home power outage or all nodes power cycle

The main problem is that the kube-vip pods are not running so the VIP, typically 192.16.6.1, is unknown. It just needs to be set so that the kube control plane can get up and runnging and the kube-vip pods can get re-instantiated. To do this simply login to master1 and run the following command.

ip addr add 192.168.6.1 dev eno1

Coral Edge TPU

Coral USB

Info

I didn't seem to have to do any udev rules or build/load the apex driver when passing this device through to frigate. I had to do those things for the Mini PCIe Coral, but the way I'm doing it now (look at frigate mount point), it doesn't seem necessary.

USB Resets

Whenever the coral device was attached to the Frigate container it would trigger the following entry in dmesg and Node Feature Discovery could no longer identify that the node had the device. This resulted in the frigate staying in a 'Pending' state until I unplugged and then plugged in the Coral USB again. It was very annoying that if the frigate container ever terminated, I'd have to unplug and then re-plug the USB.

[ +12.269474] usb 2-5: reset SuperSpeed USB device number 22 using xhci_hcd [ +0.012155] usb 2-5: LPM exit latency is zeroed, disabling LPM.

This hack resolved the USB reset issue: https://github.com/blakeblackshear/frigate/issues/2607#issuecomment-2092965042

Coral Mini PCIe

Info

Not currently working. For some reason it doesn't show up in 'lspci' in my Dell R730XD. I wonder if using a more powerful power supply would make a difference.

nVIDIA Tesla P40

nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.

Host

Ignore PCI device

  1. Apend to GRUB_CMDLINE_LINUX in /etc/default/grub

intel_iommu=on pci-stub.ids=10de:1b38

  1. Update Grub

grubby --add-kernel $(grubby --default-kernel) --copy-default --args=vfio_pci.ids=10de:1b38 --title "Default kernel with vfio_pci" --make-default

IBM through-pci

  1. reboot

PCI Passthrough to VM (via virt-manager)

  1. Add Hardware -> PCI Host Device

In the VM

Blacklist nouveau in VM

  1. echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf

  2. Comment out the following block in /etc/X11/xorg.conf.d/10-nvidia.conf

#Section "OutputClass"
#    Identifier "nvidia"
#    MatchDriver "nvidia-drm"
#    Driver "nvidia"
#    Option "AllowEmptyInitialConfiguration"
#    Option "PrimaryGPU" "no"
#    Option "SLI" "Auto"
#    Option "BaseMosaic" "on"
#EndSection
  1. reboot

Install nVIDIA Driver

dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo

dnf module install nvidia-driver:565-dkms

Install nVIDIA Container Toolkit

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

dnf install nvidia-container-toolkit

nvidia-ctk runtime configure --runtime=crio

Configure the runtime

nvidia-ctk runtime configure --runtime=crio

Results

/etc/crio/crio.conf

/etc/nvidia-container-runtime/config.toml

Kubernetes

Install nVIDIA Device Plugin

Helm Chart

Configuration with NFD / LocalAI / Ollama / etc

LocalAI

  1. make sure runtime is set correctly
  2. confirm that localai is running on the nvidia-container-runtime

stable-diffusion

On node install TCMalloc

dnf install -y gperftools gperftools-deve

Other

nVidia HTOP

Improved nvidia-smi command.

nVidia HTOP

Fan Control Methods

The R730XD doesn't officially support the P40 so it doesn't natively adjust the fan. Below are some workarounds that I have not implemented yet.

IPMI Fan Control

Fan Control Script 1

Fan Control Script 2

Cluster Rebuild Actions

Before Cluster Rebuild

  1. Restore CNPG from backup

uncomment section in cluster.yaml

After Cluster Rebuild

  1. Update KUBECONFIG secret in github/home-ops

Settings -> (left side) Secrets and Variables -> (submenu) Actions

Edit KUBECONFIG secret with cat ~/.kube/config | base64

Cluster Upgrade Runbook

This runbook covers an in-place rolling upgrade of the cluster, planned in May 2026 to bring all nodes to a single k8s 1.34 patch + cri-o 1.34, then to k8s 1.35.

The cluster runs kubeadm with stacked etcd and kube-vip (DaemonSet) for the control-plane VIP at 192.168.6.1.

State at the start of the upgrade

AspectReality
Control plane3 nodes (master1/2/3), HA via kube-vip DaemonSet
Workers7 nodes (worker2-8), each runs a Ceph OSD
Special hardwareworker8 = NVIDIA GPU; worker4 = Frigate Coral USB + Intel GPU + vlan-security
k8sAll nodes on 1.34.x, drift across .2/.6/.7
cri-omaster1 on 1.34.2 (modern); all others on 1.28.4 (legacy el8 build, outside skew)
OSmaster1 on CentOS Stream 10; all others on CentOS Stream 9
etcd3.6.5 across all masters, healthy
StorageRook/Ceph (ceph-block), Longhorn (per-app named SCs), Garage (S3)
GitOpsFlux pulls from home-ops-kubernetes GitRepository

Phase 0 β€” Pre-flight (βœ… done)

  • βœ… Master1 stale kube-vip static pod removed (/root/kube-vip.yaml.removed-20260502 is the rollback breadcrumb)
  • βœ… etcd snapshot saved off-cluster (~/cluster-backups/etcd-20260502/snapshot-prephase0-20260502.db)
  • βœ… isv_cri-o_stable_v1.34.repo pre-staged on all nodes via dnf β€” master1 already had it from its earlier rebuild
  • βœ… descheduler HelmRelease suspended via the disable-descheduler commit pattern. Resume in Phase 5.
  • βœ… Cilium 1.19.3 confirmed compatible with k8s 1.35 (Rook 1.19.5, CNPG 1.29.0, Istio 1.29.2 also confirmed)
  • βœ… kube-vip DaemonSet already on v1.1.2; Renovate is tracking it
  • βœ… API deprecation grep clean β€” no core k8s alpha/beta apiVersions used outside of vendor CRDs

Per-node procedure (used in Phases 1–3)

The cri-o package swap, kubeadm upgrade, and kubelet bump all happen inside the same drain window per node, so we drain once per node.

Order

Standard "clean RPM" workers first, then the special cases, then masters.

  1. worker7 βœ… migrated 2026-05-02 β€” k8s 1.34.7 + cri-o 1.34.7
  2. worker3 β€” Intel GPU label but no pinned pods; mon-f is pinned here, drain only when worker3 is the active mon target (see "mon nodeSelector trap" below)
  3. worker2
  4. worker4 β€” Frigate node. Pre-suspend frigate, zigbee2mqtt, zwave-js-ui via the disable-<app> GitOps pattern; expect brief recording / automation gap.
  5. worker8 β€” NVIDIA. Pre-suspend ollama, comfyui. Carefully port the NVIDIA runtime stanza to a drop-in. Smoke-test with a runtimeClassName: nvidia pod before un-suspending.
  6. worker6 β€” manual-install kubelet/cri-o, requires the alternate procedure below (no rm crio.conf, use dnf install).
  7. worker5 β€” same alternate procedure as worker6.
  8. master3
  9. master2 (kubelet already 1.34.7; cri-o still 1.28.4)
  10. master1 (last β€” already on 1.34.2 + cri-o 1.34.2; just kubelet patch bump). The VIP risk is bounded by master2/master3 kube-vip DaemonSet pods.

Never drain two masters concurrently. After each master, verify etcd quorum:

kubectl exec -n kube-system etcd-master1.thesteamedcrab.com -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --cluster -w table

Pre-flight before every node (lessons from 2026-05-02)

These checks must happen before draining any node. Skipping any of them caused real damage on the first attempt at this phase.

  1. Drain Longhorn replicas off the target node FIRST, before kubectl drain. Otherwise: when the node's longhorn-manager blips during the cri-o restart, every replica still on that node becomes inaccessible, and pods on OTHER nodes whose replicas are here go into CrashLoopBackOff. That cascades into CNPG PDBs blocking subsequent drains. Two ways to do this:

    Via Longhorn UI (preferred β€” visual confirmation):

    • Open the Longhorn UI (port-forward longhorn-frontend in longhorn-system if you don't have ingress wired up).
    • Node tab β†’ click target node β†’ "Edit Node" β†’ set "Node Scheduling" to Disable AND "Eviction Requested" to True.
    • Watch the Volume tab β€” every volume with a replica on this node should show its replica count restoring on other nodes.
    • When the target node's "Replicas" count reaches 0, proceed.
    • Re-enable scheduling and clear eviction after the node is back and you've uncordoned it, otherwise replicas won't return.

    Via kubectl (equivalent):

    kubectl patch -n longhorn-system nodes.longhorn.io <node> \
      --type=merge -p '{"spec":{"allowScheduling":false,"evictionRequested":true}}'
    
    # Wait until no replicas remain on the node
    until [ "$(kubectl get replicas.longhorn.io -n longhorn-system -o json |
      jq -r --arg n "<node>" '.items[] | select(.spec.nodeID==$n) | .metadata.name' |
      wc -l)" = "0" ]; do sleep 10; done
    
    # … do the drain + upgrade + uncordon …
    
    # After uncordon and node Ready, restore Longhorn scheduling:
    kubectl patch -n longhorn-system nodes.longhorn.io <node> \
      --type=merge -p '{"spec":{"allowScheduling":true,"evictionRequested":false}}'
    
  2. Wait for ceph -s HEALTH_OK before draining the next node. Not just "the previous node's OSD pod is Ready" β€” Rook creates dynamic per-host OSD PDBs (rook-ceph-osd-host-<host>) when an OSD is unavailable, with MAX UNAVAILABLE: 1, ALLOWED DISRUPTIONS: 0. While ANY OSD-host PDB exists for ANY host, the next drain will hang indefinitely on its own host's PDB. This cost ~20 minutes of stuck drain on worker3 because worker6 was still degraded in the background.

  3. Check mon nodeSelectors and don't drain a mon's pinned host while another mon is also down. Rook pins each mon to a specific node and recreates them under new letters as nodes drop in/out:

    kubectl get pod -n rook-ceph -l app=rook-ceph-mon -o jsonpath='{range .items[*]}{.metadata.labels.mon}{"\t"}{.spec.nodeSelector.kubernetes\.io/hostname}{"\n"}{end}'
    

    Re-check before every node β€” the mon names shift (we saw c,e,f β†’ c,e,g β†’ c,e,h over a single afternoon as nodes drained). Draining a node hosting a pinned mon strands that mon β€” it cannot reschedule until the pin is satisfied again. If two mons are pinned to drained nodes, ceph quorum is lost. Drain pinned-mon nodes one at a time and let the mon come back before touching the next.

  4. Check the package install style on the target node. Some nodes have kubelet/kubeadm/kubectl/cri-o installed via dnf (RPM-tracked). Others have manually-installed binaries (no RPM entries):

    ssh root@<node> 'rpm -qa | grep -cE "^(kubeadm|kubelet|kubectl|cri-o)-"'
    

    Returns 4 β†’ standard procedure. Returns 0 β†’ alternate procedure (don't rm crio.conf; use dnf install not dnf upgrade). Nodes known to be manual-install: worker5, worker6.

  5. CNPG primary failover:

    for c in $(kubectl get pod -n databases -l 'cnpg.io/instanceRole=primary' \
        --field-selector spec.nodeName=<node>.thesteamedcrab.com \
        -o jsonpath='{range .items[*]}{.metadata.labels.cnpg\.io/cluster}{"\n"}{end}'); do
      replica=$(kubectl get pod -n databases -l "cnpg.io/cluster=$c,cnpg.io/instanceRole=replica" \
        -o jsonpath='{range .items[?(@.spec.nodeName!="<node>.thesteamedcrab.com")]}{.metadata.name}{"\n"}{end}' | head -1)
      kubectl cnpg promote -n databases "$c" "$replica"
    done
    

    Then poll kubectl get pod -n databases -l 'cnpg.io/instanceRole=primary' --field-selector spec.nodeName=<node>... until empty.

  6. Hardware-pinned pod suspension (worker4: frigate + zigbee2mqtt + zwave-js-ui; worker8: ollama + comfyui). Use the disable-<app> GitOps commit pattern, not kubectl scale β€” Flux will revert imperative scales. Wait for Flux reconciliation to actually take the pods down before draining.

Standard per-worker procedure (RPM-tracked nodes)

For workers with RPM entries (rpm-qa returns 4 packages):

  1. Pre-flight checks above.
  2. Drain:
    kubectl drain <worker>.thesteamedcrab.com \
      --ignore-daemonsets --delete-emptydir-data
    
  3. Package upgrade and kubelet config refresh:
    ssh root@<worker>.thesteamedcrab.com '
      # Drop the legacy crio.conf / .rpmnew. Order matters: only do
      # this RIGHT BEFORE the upgrade succeeds β€” leaving cri-o
      # without a config will trigger the "unsafe procfs detected"
      # runc error and break ALL pods on the node.
      rm -f /etc/crio/crio.conf /etc/crio/crio.conf.rpmnew /etc/crio/crio.conf.working
    
      # Single transaction: cri-o upgrade picks up the new isv repo
      # automatically since 1.34.7 > 1.28.4.
      dnf upgrade -y cri-o kubelet-1.34.7 kubeadm-1.34.7 kubectl-1.34.7
    
      kubeadm upgrade node
      systemctl daemon-reload
      systemctl restart crio
      systemctl restart kubelet
    '
    
  4. Smoke-test before uncordon:
    ssh root@<worker>.thesteamedcrab.com '
      systemctl is-active crio kubelet
      crictl info | grep -E "CgroupManagerName|DefaultRuntime"
      ls /etc/crio/crio.conf.d/   # should exist; legacy crio.conf should be gone
    '
    kubectl get node <worker>.thesteamedcrab.com -o wide   # version + cri-o version match expected
    
  5. Uncordon:
    kubectl uncordon <worker>.thesteamedcrab.com
    
    Rook auto-clears any host noout flag on its own a few seconds after uncordon β€” don't manually unset it.
  6. Wait for ceph -s HEALTH_OK (no OSDs down, no degraded PGs) before the next node. Do not skip this. Typically ~1–3 minutes.

Alternate procedure for manual-install nodes (worker5, worker6)

These nodes have kubelet/cri-o binaries in /usr/bin/ not tracked by RPM. dnf upgrade cannot upgrade what it cannot see, and rm crio.conf will break the node since the dnf step provides no replacement.

  1. Pre-flight checks (same as above).
  2. Drain (same as above).
  3. Fresh-install via dnf (overwrites the un-tracked binaries):
    ssh root@<worker>.thesteamedcrab.com '
      # Stop services so we can replace running binaries cleanly
      systemctl stop kubelet
      systemctl stop crio
    
      # Move the manual binaries aside (rollback if dnf install fails)
      mv /usr/bin/kubelet  /usr/bin/kubelet.manual
      mv /usr/bin/kubeadm  /usr/bin/kubeadm.manual
      mv /usr/bin/kubectl  /usr/bin/kubectl.manual
      mv /usr/bin/crio     /usr/bin/crio.manual
    
      # Install the RPMs fresh (now they will be tracked)
      dnf install -y cri-o kubelet-1.34.7 kubeadm-1.34.7 kubectl-1.34.7
    
      # Now we can safely remove crio.conf β€” package provides drop-in
      rm -f /etc/crio/crio.conf /etc/crio/crio.conf.rpmnew /etc/crio/crio.conf.working
    
      kubeadm upgrade node
      systemctl daemon-reload
    
      # Multi-minor cri-o jumps (1.28 β†’ 1.34) leave stale container
      # refs that hang internal_wipe forever. Skip the regular start;
      # do the kill+wipe+start dance up front.
      systemctl kill --signal=SIGKILL crio || true
      crio wipe -f
      systemctl start crio
      systemctl start kubelet
    
      # Verify and clean up rollback files only after success
      rpm -q cri-o kubelet kubeadm kubectl
      systemctl is-active crio kubelet
      # rm /usr/bin/*.manual-pre-1.34.7    # only after smoke-test confirms success
    '
    
  4. Smoke-test, uncordon, wait for HEALTH_OK β€” same as above.

Worker8 (NVIDIA) extra steps

After the standard procedure:

ssh root@worker8.thesteamedcrab.com 'cat > /etc/crio/crio.conf.d/20-nvidia.conf <<EOF
[crio.runtime.runtimes.nvidia]
runtime_path = "/usr/bin/nvidia-container-runtime"
runtime_root = "/run/nvidia"
runtime_type = "oci"
EOF
systemctl restart crio'

Smoke-test before resuming ollama / comfyui:

kubectl run nvidia-smoke --rm -i --restart=Never \
  --overrides='{"spec":{"runtimeClassName":"nvidia","nodeName":"worker8.thesteamedcrab.com"}}' \
  --image=nvidia/cuda:12.0-base-ubuntu22.04 -- nvidia-smi

Failure modes seen on 2026-05-02

SymptomRoot causeFix
Drain hangs ~indefinitely on rook-ceph-osd-host-* PDBA previous node's OSD is still degraded; Rook's per-host PDB blocks all OSD evictions cluster-wideWait for ceph -s HEALTH_OK, then retry drain
New pods on a node fail with runc create failed: unsafe procfs detected/etc/crio/crio.conf was removed but the new cri-o package didn't installRestore crio.conf from a peer node's identical version (scp root@<peer>:/etc/crio/crio.conf root@<broken>:/etc/crio/), systemctl restart crio
mon-X stays Pending after drainMon is pinned via nodeSelector to the cordoned nodeUncordon the pinned node, mon comes back. Don't drain another mon's host until quorum is restored
dnf upgrade reports kubelet-1.34.7: No match for argumentNode has manually-installed kubelet (no RPM entry)Use the alternate procedure (dnf install after moving binaries aside)
longhorn-manager-X stuck CrashLoopBackOff with bind: address already in use on port 9502Old longhorn-manager process orphaned by a previous container; cri-o lost track of it but the binary is still boundssh root@<node> 'pgrep -af "longhorn-manager -d daemon"' β†’ kill -9 <pid>; then kubectl delete pod -n longhorn-system longhorn-manager-X
Multiple CNPG replica pods stuck Init:CrashLoopBackOff on a recently-broken nodeTheir Longhorn volumes failed to attach during the node's outage; pods are now in 5-minute kubelet backoffOnce Longhorn recovers, kubectl delete pod each one to force immediate retry. Volume attaches succeed.
Replicas-cascading-CNPG-PDB-blocks-drainOne replica unhealthy in cluster X means PDB has 0 disruptions; subsequent drain anywhere blocks on cluster X's PDBHeal the unhealthy replica before draining its peer's host. The Longhorn pre-drain step (above) prevents this.
cri-o stuck in activating (start) indefinitely after package upgrade; logs flood with Killing container <id> failed: container does not existcri-o's internal_wipe = true tries to kill phantom containers left in /var/lib/containers/storage/ by the previous version, but their runc state in /run/crun is gone. Worse across multi-minor jumps (1.28 β†’ 1.34).systemctl kill --signal=SIGKILL crio β†’ crio wipe -f (clears container refs, keeps images) β†’ systemctl start crio β†’ systemctl start kubelet. Hit on worker6 during the manual-install upgrade β€” likely also needed on worker5 for the same reason.
Suspended app stays at replicas: 0 after the suspend HR is revertedManually scaling a StatefulSet/Deployment to 0 before drain (the post-suspend step that actually stops pods) sticks. Helm/Flux applying the un-suspended HR doesn't reset the imperative scale.After reverting disable-<app>, kubectl scale -n <ns> statefulset <app> --replicas=1 (or whatever the desired count is). Bit us with frigate, ollama, comfyui.

### Master procedure

Same as worker, with two differences:

- **Master1 is special** (already on cri-o 1.34.x). Skip the cri-o swap;
  just do the k8s patch / minor bump. Master1 is also the last in the
  order so the kube-vip VIP can fail over to master2/master3 during
  master1's drain.
- **First control plane in Phase 3 minor bump** uses
  `kubeadm upgrade apply v1.35.x`; the others use
  `kubeadm upgrade node`. The first one *must* be done before any
  others.

## Phase 5 β€” verify and clean up

- All nodes show same kubelet + cri-o version: `kubectl get nodes -o wide`
- Flux reconciled: `flux get all -A | grep -v True`
- `ceph -s` is `HEALTH_OK`
- Run `tools/etcd-defrag.sh` (etcd grew during the upgrade)
- **Resume descheduler**: `git revert <disable-descheduler sha>` and
  push. Same for any hardware-pinned `disable-<app>` commits made
  along the way.
- Update this runbook with anything new that bit you, before memory
  fades.

## Rollback

In-place RPM downgrade is messy, especially for kubelet across a minor
boundary. Realistic rollback paths:

- Per-node, *before* `kubeadm upgrade apply` on the first master: roll
  back is just `dnf downgrade kubeadm kubelet kubectl` to 1.34 + put
  the old `crio.conf` back (it's saved as `crio.conf.rpmsave` after the
  package swap).
- Per-node, *after* `kubeadm upgrade apply`: bring forward the rest of
  the cluster. Don't try to roll back the apiserver minor.
- Catastrophic (etcd corruption, control plane unrecoverable): restore
  from `~/cluster-backups/etcd-20260502/snapshot-prephase0-20260502.db`
  via [the kubeadm etcd recovery procedure][1]. **This is a last
  resort and has not been tested in this homelab.** It assumes you
  have at least one master with the original PKI intact and can
  `etcdctl snapshot restore` to a fresh data dir, then restart etcd
  static pods.

[1]: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster

Per-Route Cert Migration Runbook

This runbook covers migrating from one shared wildcard ACME certificate (thesteamedcrab.com covering *.thesteamedcrab.com) to per-listener fine-grained short-term certificates, using cert-manager's Gateway API shim to auto-provision a Certificate per HTTPRoute hostname.

Why: smaller blast radius per renewal failure, per-host visibility, no reflector dependency for new namespaces, faster rotation.

Why this needs to be paced: Let's Encrypt limits issuance to 50 certificates per Registered Domain per rolling 7-day window. The cluster has ~70 HTTPRoutes under thesteamedcrab.com β€” too many to flag-day onto ACME without breaching the limit and locking out all issuance for ~7 days.

State at the start

AspectReality
Cert manifestOne Certificate (thesteamedcrab.com) producing thesteamedcrab-com-tls
Cert SANsapex, *.thesteamedcrab.com, *.app.…, *.mcp.…
ReflectionSecret reflected to network, istio-system, mcp-system
Issuerletsencrypt-production (DNS01 via Cloudflare)
Gateways4 β€” external (network), internal (network), istio (istio-system), mcp-gateway (mcp-system)
HTTPRoutes per Gatewayexternal: 11, internal: 50+, mcp-gateway: 9, istio: 2
Existing TTLLE default 90d, autorenew at ~30d remaining

Target state β€” split by exposure

The 70+ routes split into two cohorts:

  • Public-facing (~22): external (11), mcp-gateway (9), istio (2). These need browser-trusted certs β†’ ACME via letsencrypt-production.
  • Internal-only (~50): everything on the internal Gateway. These never face the public internet β†’ private CA, issued by an in-cluster ClusterIssuer. Zero rate-limit concern, instant issuance, no public CT log entries leaking internal hostnames.

Why split

A "just ACME everything" plan would burn the entire 50/week budget twice over and leave no headroom for retries or the existing wildcard's renewal. Splitting halves the scope of the rate-limited work and removes the public CT log noise for internal services.

The cost of the split is one-time: distribute the private CA root to every device that visits internal hostnames (browsers, mobile devices, anything that calls internal APIs). That's a manual import on each device.

If that operational cost is unacceptable, see "Alternative: all-ACME" at the bottom of this runbook.

Rate-limit math (ACME side, ~22 certs)

  • Hard ceiling: 50 issuances / 7d / Registered Domain
  • Total ACME target certs: ~22
  • Renewal exemption: renewals (same FQDN set, same account) are exempt from the 50/week ceiling but still subject to a 5/week Duplicate Certificate cap per FQDN set. With duration: 168h / renewBefore: 48h, each cert renews ~1.4Γ—/week β€” well under 5/week.
  • Sustainable initial-issuance rate: 50 Γ· 7 β‰ˆ 7/day. The 7-day window is rolling, not calendar β€” at 10/day you hit the cap on day 5, not day 7.
  • Wave size: 5/day. Buffer of 2/day for retries and the existing wildcard's renewal pressure.
  • Total elapsed: ~5 days for 22 certs.

Critical gotchas (learned the hard way)

Three things bit us during the first execution attempt on 2026-05-02. Read these before writing any YAML.

1. The gateway-shim must be explicitly enabled

cert-manager v1.16+ does not enable the gateway-shim by default, contrary to what some release notes suggest. You need config.enableGatewayAPI: true in the Helm values. Without it, annotating a Gateway with cert-manager.io/cluster-issuer is silently inert β€” no Certificate is ever created. Verify before Phase 0:

kubectl get cm -n cert-manager cert-manager -o jsonpath='{.data.config\.yaml}'
# Expect to see: enableGatewayAPI: true

If absent, set it in the chart values and roll out cert-manager first.

2. The reflector pattern fights the gateway-shim across namespaces

This cluster mirrors network/thesteamedcrab-com-tls to istio-system and mcp-system via reflector. The gateway-shim is per-namespace: it creates a Certificate in the same namespace as the Gateway. So annotating the istio Gateway whose listener references the reflector-mirrored Secret causes:

  1. shim creates a Certificate in istio-system targeting the mirrored Secret's name
  2. cert-manager re-issues from ACME because "Secret was previously issued by a different issuer" (1 ACME order against the budget)
  3. ongoing fight: shim's Certificate vs reflector for ownership of the Secret

Don't annotate any Gateway whose listener still references a reflector-mirrored Secret. Migrate that Gateway's HTTPRoutes to per-app listeners with their own Secret names first, then remove the wildcard listener and the reflector dependency, then add the shim annotation.

For this cluster: the istio Gateway is excluded from Phase 0. It gets annotated only after its (small) HTTPRoute set is migrated and the wildcard listener is removed.

3. HTTPRoute sectionName has no fallback

A HTTPRoute attached to a listener that goes Programmed=False does not fall back to other listeners that match its hostname. The route is just down. The wildcard listener you keep around "as a backstop" only catches HTTPRoutes that explicitly point at it via sectionName.

Implication for the canary: don't pick an app whose downtime is unacceptable. The canary's listener is Programmed=False until ACME issues, which can be ~30s but can be much longer if anything's wrong.

Phase 0 β€” Annotation prep

Verified 2026-05-02 against cert-manager v1.17.1 with config.enableGatewayAPI: true: when the shim sees a listener whose referenced Secret doesn't exist (or isn't owned by an in-namespace Certificate), it creates one. When an in-namespace Certificate already owns the Secret, the shim is a no-op.

For this cluster, that means it's safe to annotate Gateways in the network namespace (which holds the manual thesteamedcrab.com Certificate that owns thesteamedcrab-com-tls), and unsafe to annotate Gateways in istio-system or mcp-system while they still reference the reflector-mirrored Secret.

0.1 Pre-flight test (one-time)

Confirm the shim creates Certificates for new Secret refs:

kubectl apply -f - <<'EOF'
---
apiVersion: v1
kind: Namespace
metadata: { name: shim-test }
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata: { name: ss, namespace: shim-test }
spec: { selfSigned: {} }
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: scratch
  namespace: shim-test
  annotations:
    cert-manager.io/issuer: ss
spec:
  gatewayClassName: envoy
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      hostname: bar.shim-test.thesteamedcrab.com
      allowedRoutes: { namespaces: { from: Same } }
      tls:
        certificateRefs: [{ kind: Secret, name: bar-shim-tls }]
EOF
sleep 20
kubectl get certificate -n shim-test
kubectl delete namespace shim-test

Expected: a bar-shim-tls Certificate appears, Ready: True. If nothing appears within 30s, the shim isn't running β€” fix that before proceeding.

0.2 Add the annotations

Annotate only the external and internal Gateways (network namespace). Skip istio and mcp-gateway for now β€” they're in namespaces that depend on the reflector. They'll be annotated in Phase 5 after their HTTPRoutes have moved off the shared Secret.

metadata:
  annotations:
    # ...existing annotations...
    cert-manager.io/cluster-issuer: letsencrypt-production
    cert-manager.io/duration: "168h"      # 7d
    cert-manager.io/renew-before: "48h"

What duration actually does. Let's Encrypt's default profile always issues 90-day certificates regardless of the duration requested by ACME. The cert-manager.io/duration annotation controls only cert-manager's renewal cadence β€” it tells cert-manager "treat this cert as expiring after 168h" so it renews early. You still get 90-day certs, just rotated every ~5 days.

For actually-short LE certs, opt into the tlsserver profile (~6-day validity) by adding acme.cert-manager.io/order-profile-name: tlsserver on the per-listener Certificate or via cert-manager's issuer-level configuration. That changes the LE order profile and the issued cert is genuinely short-lived. Verify with openssl s_client … | openssl x509 -noout -dates after issuance.

For the internal Gateway, swap the issuer to the private CA once Phase 1 is done:

    cert-manager.io/cluster-issuer: cluster-internal-ca

0.3 Verify

# Cert count must be unchanged after reconcile.
kubectl get certificate -A

# If a new Certificate appears in any namespace, the shim hit
# something it shouldn't have. Stop and investigate.

If a Certificate appeared, check whether it's in a namespace whose Gateway you annotated. If yes, you almost certainly hit case (2) above β€” the listener references a reflector-mirrored Secret. Revert the annotation and migrate that Gateway separately later.

Phase 1 β€” Internal Gateway β†’ private CA (parallel track)

This phase is independent of the ACME phases below β€” it has no rate limit concerns and can run in any order or in parallel.

1.1 Set up the private CA

(One-time. Skip if already present.)

# cert-manager: bootstrap a self-signed root + ClusterIssuer
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-bootstrap
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: cluster-internal-ca
  namespace: cert-manager
spec:
  isCA: true
  commonName: thesteamedcrab.com Internal CA
  secretName: cluster-internal-ca-tls
  duration: 87600h  # 10y root
  renewBefore: 720h
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned-bootstrap
    kind: ClusterIssuer
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: cluster-internal-ca
spec:
  ca:
    secretName: cluster-internal-ca-tls

1.2 Distribute the root to client devices

This is the non-zero operational cost. Export the CA cert and import it into:

  • Each laptop / desktop browser (or system trust store)
  • Each phone / tablet
  • Any in-cluster client that calls internal hostnames over TLS
kubectl get secret -n cert-manager cluster-internal-ca-tls \
  -o jsonpath='{.data.tls\.crt}' | base64 -d > internal-ca.crt

Don't proceed past 1.2 until you've imported and verified the root on at least your primary device. (Open https://glance.thesteamedcrab.com post-migration and confirm no cert warning.)

1.3 Migrate the internal Gateway

Add cert-manager.io/cluster-issuer: cluster-internal-ca annotation to kubernetes/apps/network/envoy-gateway/config/internal.yaml. Then, in batches of 10 for review-ability, add a per-app listener and flip the corresponding HTTPRoute's sectionName. Issuance is instant β€” no soak required between batches; just verify each Certificate goes Ready before moving on.

The wildcard listener stays alive until every internal HTTPRoute is migrated, so rollback is the same as the ACME track: revert sectionName and the route falls back to the wildcard.

Phase 2 β€” Public Gateways canary (1 app)

Pick a low-stakes route on the external Gateway (the highest public visibility). Avoid the highest-traffic apps (immich, jellyfin) until after canary; pick something like the github webhook or a seldom-used bookmark.

⚠ No fallback. Once you flip the HTTPRoute's sectionName to the new per-app listener, the route is bound to that listener only. If ACME issuance fails or is slow, the route returns 404 (or connection refused) until the cert lands β€” the wildcard listener does not serve as a backstop. Pick a canary you can afford to have down for ~30s in the happy case and several minutes in the failure case.

2.1 Add a per-app listener

spec:
  listeners:
    # existing http and wildcard https listeners untouched
    - name: https-<app>
      protocol: HTTPS
      port: 443
      hostname: "<app>.${SECRET_DOMAIN}"
      allowedRoutes:
        namespaces:
          from: All
      tls:
        certificateRefs:
          - kind: Secret
            name: <app>-tls

The Gateway already carries the shim annotations from Phase 0; this new listener picks them up automatically. cert-manager will create a <app>-tls Certificate within seconds of the listener appearing.

2.2 Flip the HTTPRoute's sectionName

   parentRefs:
     - name: external
       namespace: network
-      sectionName: https
+      sectionName: https-<app>

2.3 Verify and soak

# Cert issues in seconds for staging, ~30s for production
kubectl get certificate -n network -w

# Confirm the SAN is just the one hostname and duration β‰ˆ 7d
echo | openssl s_client -connect <app>.thesteamedcrab.com:443 \
  -servername <app>.thesteamedcrab.com 2>/dev/null \
  | openssl x509 -noout -subject -ext subjectAltName -dates

Soak 24h. If anything is wrong, revert the HTTPRoute sectionName. The wildcard listener still serves the old wildcard cert; no user impact.

Don't proceed past Phase 2 until this canary has cleanly soaked AND auto-renewed at least once (renewBefore: 48h means renewal kicks in on day 5 β€” wait for that).

Phase 3 β€” Wave migration on public Gateways

Goal: migrate the remaining ~19 public-facing HTTPRoutes on Gateways that are already annotated (external 10, mcp-gateway 9) at 5 per day with 12h soaks. The 2 istio routes are deferred to Phase 5.

Per-wave procedure

For each batch of 5:

  1. Add 5 listeners to the appropriate Gateway. Sort listeners alphabetically inside the listeners: block.
  2. Flip 5 HTTPRoute sectionNames.
  3. Commit + push. One PR per wave. Title: feat(network): per-app TLS migration wave N (X/Y).
  4. Watch issuance:
    kubectl get certificate -A -w
    
    All 5 should reach Ready within ~2 min. Anything stuck β†’ check kubectl describe certificate <name> -n network and the Order / Challenge resources.
  5. Soak ~12h between waves. Watch the issuer for rate-limit warnings:
    kubectl describe clusterissuer letsencrypt-production
    kubectl get challenge -A
    

Public Gateway with the manual Certificate first (lower risk β€” that's where Phase 0 was verified safe):

  1. Wave 1: external (5 of 10; canary already done)
  2. Wave 2: external (the other 5)
  3. Wave 3: mcp-gateway (5 of 9)
  4. Wave 4: mcp-gateway (the other 4)

Total ~4 calendar days at one wave per ~12h. The 2 istio routes are handled in Phase 5 once the reflector dependency is removed.

What to do if you hit a rate-limit error

cert-manager surfaces LE rate-limit responses in the Order resource:

kubectl get order -A | grep -i rate
kubectl describe order -n network <order>

If hit:

  • Stop the next wave immediately.
  • Don't delete failing Certificates β€” that doesn't reset the counter and can cause cert-manager to retry, eating more budget.
  • The window is rolling; just wait. The first issuance from N days ago drops off after 7d.
  • Resume waves once kubectl get certificate -A shows no Pending or Failing.

Phase 4 β€” Cleanup

Run only after every HTTPRoute on every Gateway has been migrated and soaked for 7+ days.

4.0 istio Gateway migration (prerequisite)

The istio Gateway was excluded from Phase 0 / Phase 3 because its listener references the reflector-mirrored Secret. Migrate its HTTPRoutes (~2: kiali and one redirect) before deleting the wildcard Certificate.

For each istio HTTPRoute (<app> = e.g. kiali):

  1. Create a manual Certificate in istio-system (no shim involvement β€” the istio Gateway is still un-annotated):

    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: <app>
      namespace: istio-system
    spec:
      secretName: <app>-tls
      issuerRef:
        name: letsencrypt-production
        kind: ClusterIssuer
      dnsNames:
        - <app>.${SECRET_DOMAIN}
      duration: 168h
      renewBefore: 48h
    
  2. Wait for it to be Ready (1 ACME issuance per cert).

  3. Add a per-app listener to the istio Gateway referencing <app>-tls, then flip the HTTPRoute's sectionName.

  4. Soak briefly. Confirm the HTTPRoute serves via the new listener.

After all istio HTTPRoutes are migrated:

  1. Remove the wildcard listener from the istio Gateway.

  2. Remove istio-system from the wildcard Certificate's reflector...reflection-allowed-namespaces annotation list (in kubernetes/apps/network/envoy-gateway/config/certificate.yaml). This stops new mirroring; existing Secret in istio-system can be deleted manually if desired (no consumers).

  3. Now the istio Gateway has no reflector-mirrored Secret. Add the cert-manager.io/cluster-issuer annotations from Phase 0. The shim is now safe β€” every listener's referenced Secret is owned by an in-namespace manual Certificate. (Optionally, delete the manual Certificates afterward and let the shim recreate them, at the cost of one ACME order per cert.)

4.1 Verify no consumers reference the wildcard Secret

grep -rn 'thesteamedcrab-com-tls\|${SECRET_DOMAIN/./-}-tls' kubernetes/

Anything still referencing the wildcard Secret needs to migrate first. Particular attention: Gateways in istio-system and mcp-system may still be using the reflector-mirrored copy.

4.2 Remove the wildcard listener from each Gateway

Delete the name: https block (with hostname: "*.${SECRET_DOMAIN}") from each of:

  • kubernetes/apps/network/envoy-gateway/config/external.yaml
  • kubernetes/apps/network/envoy-gateway/config/internal.yaml
  • kubernetes/apps/istio-system/gateway/gateway.yaml
  • kubernetes/apps/mcp-system/mcp-gateway/app/gateway.yaml

4.3 Delete the wildcard Certificate manifest

git rm kubernetes/apps/network/envoy-gateway/config/certificate.yaml

Also remove the entry from kubernetes/apps/network/envoy-gateway/config/kustomization.yaml.

The reflector annotations on the (now-deleted) Certificate's secretTemplate go away automatically; reflector stops mirroring; the reflected Secret copies in istio-system and mcp-system are garbage-collected.

4.4 Final verify

# No wildcard cert in any namespace
kubectl get certificate -A | grep -i wildcard

# Per-app certs all present and Ready
kubectl get certificate -A

# All public certs are short-duration; private CA certs may be longer
kubectl get certificate -A -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) duration=\(.spec.duration // "unset")"'

# Smoke test from outside (public) and from a CA-trusting device (internal)
for host in glance.thesteamedcrab.com photos.thesteamedcrab.com ... ; do
  echo "$host:"
  echo | openssl s_client -connect "$host:443" -servername "$host" 2>/dev/null \
    | openssl x509 -noout -subject -dates
done

Rollback

The wildcard cert + listeners stay alive through Phase 1, 2, and 3. Rollback during those phases is just reverting the HTTPRoute's sectionName. Even after deleting some per-app listeners, the wildcard listener catches the route.

Phase 4 is irreversible by git revert alone β€” once the wildcard Certificate is deleted, re-creating it kicks off a new ACME order (counts against rate limit). If you must roll back from Phase 4:

  1. Restore the Certificate manifest via git revert
  2. cert-manager re-issues β€” costs 1 against the 50/week budget
  3. Restore the wildcard listeners
  4. Flip HTTPRoutes back to sectionName: https

So don't enter Phase 4 unless you're confident in the new state.

What to watch

  • kubectl get certificate -A β€” column READY should always be True
  • kubectl get order -A β€” should be empty in steady state (orders exist transiently during issuance/renewal)
  • kubectl describe clusterissuer letsencrypt-production β€” surfaces ACME backoff messages if rate-limited
  • ACME audit log at https://crt.sh/?Identity=thesteamedcrab.com β€” external counter you don't control, useful sanity check

When to stop and ask

  • Any wave produces >0 failed Certificates after 5 min
  • Total Pending+Failing Certificates count >3 across the cluster
  • Any rate-limit error in Order events
  • More than one wave in the rolling 7d window has produced retries
  • Phase 0.1 test result was ambiguous

Alternative: all-ACME (no private CA)

If managing a private CA root on every client device is unacceptable:

  • Skip Phase 1 entirely
  • All ~70 routes go through ACME
  • Wave size still 5/day; total elapsed ~14 days
  • Internal hostnames will appear in public CT logs (https://crt.sh/?Identity=thesteamedcrab.com) β€” anyone can enumerate your service catalog from the cert transparency feed. Mitigate with hostnames that don't betray service identity if this matters.

Prereqe

  • 1password-cli
  • minijinja
  • yq
  • go-task (alias to task)

Initialization

./init/create-cluster.sh (on master)

./init/initialize-cluster.sh (on laptop)

ssh root@master1 rm /etc/kubernetes/manifests/kube-vip.yaml (on laptop)

Teardown

./init/destroy-cluster.sh (on laptop)

Importing an Immich DB backup into a new CNPG database

  1. Create the new Immich database

kubectl cnpg -n databases psql <db name> -- -c 'CREATE DATABASE immich;'

  1. Ensure that you're woring with the primary CNPG instance

kubectl cnpg -n databases status <db name> | grep "Primary instance:"

  1. Run the Immich restore command

gunzip < "immich-db-backup-1735455600013.sql.gz"| sed "s/SELECT pg_catalog.set_config('search_path', '', false);/SELECT pg_catalog.set_config('search_path', 'public, pg_catalog', true);/g" | kubectl cnpg -n databases psql <db name> --

Github Webhook

kubectl -n flux-system get receivers.notification.toolkit.fluxcd.io generates token URL to be put into github.com -> Settings -> Webhooks -> Payload URL

  • Content Type: application/json
  • Secret: <token from kubectl -n flux-system describe secrets github-webhook-token>
  • SSL: Enable SSL verification
  • Which events would you like to trigger this webhook?: Just the push event.
  • Active:

Resources: Limits and Requests Philosophy

In short, do set CPU requests, but don't set CPU limits and set the Memory limit to be the same as the Memory requests.

Debugging

  • https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
  • https://dnschecker.org
  • https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
  • https://github.com/nicolaka/netshoot
  • https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat