Introduction
Lovenet Home Operations Repository
Managed by Flux, Renovate and GitHub Actions π€
Kubernetes Cluster Information
Infrastructure Information
Overview
This is the configuration for my GitOps homelab Kubernetes cluster. This cluster runs home software services for my residence. It is quite complex and there are a lot of interdependencies but the declarative nature of GitOps allows me to manage this mesh of code. The software services fall into a few primary categories:
- Home Automation (Home Assistant, ESPHome, Node-Red, EMQX, ZWave JS UI, Zigbee2MQTT)
- Home Metering and Monitoring (Weather Station, Droplet, Power Monitoring, Sensors)
- Home Security (Frigate)
- IOT Devices (WLED, Ratgdo)
Core Components
Infrastructure
- CentOS 9 Stream: Kubernetes Node Operating System
- crun: Container Runtime implemented in C
- nVIDIA Container Toolkit: Container Runtime for nVIDIA GPUs
Networking
- cilium: Kubernetes Container Network Interface (CNI)
- cert-manager: Creates SSL certificates for services in my Kubernetes cluster
- external-dns: Automatically manages DNS records from my cluster in a cloud DNS provider
- Cloudflared: Cloudflare tunnel client
- Envoy Gateway: Networking gateways into cluster
Storage
- Rook-Ceph: Distributed block storage for peristent storage
- Minio: S3 Compatible Storage Interface
- Longhorn: Cloud native distributed block storage for Kubernetes
- NFS: NFS storage
GitOps
- Flux2: Declarative Cluster GitOps
- actions-runner-controller: Self-hosted Github runners
- Rennovate: Automated Cluster Management
βοΈΒ Hardware
| Hostname | Device | CPU | RAM | OS | Role | Storage | IOT | VLANs (multus) |
|---|---|---|---|---|---|---|---|---|
| master1 | Intel NUC7PJYH | 4 | 8 GB | CentOS 9 | k8s Master | |||
| master2 | VM on beast | 3 | 8 GB | CentOS 9 | k8s Master | |||
| master3 | VM on beast | 3 | 8 GB | CentOS 9 | k8s Master | |||
| worker1 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | ZWA-2 | iot, sec |
| worker2 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | iot, sec | |
| worker3 | ThinkCentre M910x | 8 | 64 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | Sonoff | iot, sec |
| worker4 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | Coral USB | iot, sec |
| worker5 | VM on beast | 10 | 24 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | iot, sec | |
| worker6 | VM on beast | 10 | 24 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | iot, sec | |
| worker7 | VM on beast | 10 | 24 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | iot, sec | |
| worker8 | VM on beast | 10 | 58 GB | CentOS 9 | k8s Worker | longhorn NVMe, ceph osd | nVIDIA P40 | iot, sec |
Network
Click to see a high level physical network diagram
| Name | CIDR | VLAN | Notes |
|---|---|---|---|
| Management VLAN | TBD | ||
| Default | 192.168.0.0/16 | 0 | |
| IOT VLAN | 10.10.20.1/24 | 20 | |
| Guest VLAN | 10.10.30.1/24 | 30 | |
| Security VLAN | 10.10.40.1/24 | 40 | |
| Kubernetes Pod Subnet (Cilium) | 10.42.0.0/16 | N/A | |
| Kubernetes Services Subnet (Cilium) | 10.43.0.0/16 | N/A | |
| Kubernetes LB Range (CiliumLoadBalancerIPPool) | 10.45.0.1/24 | N/A |
βοΈ Cloud Dependencies
| Service | Use | Cost |
|---|---|---|
| 1Password | Secrets with External Secrets | ~$65 (1 Year) |
| Cloudflare | Domain | Free |
| GitHub | Hosting this repository and continuous integration/deployments | Free |
| Mailgun | Email hosting | Free (Flex Plan) |
| Pushover | Kubernetes Alerts and application notifications | $10 (One Time) |
| Frigate Plus | Model training services for Frigate NVR | $50 (1 Year) |
| Total: ~$9.60/mo |
Noteworthy Documentation
Cluster Rebuild ActionsΒ Β Initialization and TeardownΒ Β Github WebhookΒ Β Limits and Requests PhilosophyΒ Β DebuggingΒ Β Immich restore to new CNPG databaseΒ Β nVIDIA P40 GPUΒ Β
Home-Ops Search
@whazor created this website as a creative way to search Helm Releases across GitHub. You may use it as a means to get ideas on how to configure an applications' Helm values.
After whole home power outage or all nodes power cycle
The main problem is that the kube-vip pods are not running so the VIP, typically 192.16.6.1, is unknown. It just needs to be set so that the kube control plane can get up and runnging and the kube-vip pods can get re-instantiated. To do this simply login to master1 and run the following command.
ip addr add 192.168.6.1 dev eno1
Coral Edge TPU
Coral USB
Info
I didn't seem to have to do any udev rules or build/load the apex driver when passing this device through to frigate. I had to do those things for the Mini PCIe Coral, but the way I'm doing it now (look at frigate mount point), it doesn't seem necessary.
USB Resets
Whenever the coral device was attached to the Frigate container it would trigger the following entry in dmesg and Node Feature Discovery could no longer identify that the node had the device. This resulted in the frigate staying in a 'Pending' state until I unplugged and then plugged in the Coral USB again. It was very annoying that if the frigate container ever terminated, I'd have to unplug and then re-plug the USB.
[ +12.269474] usb 2-5: reset SuperSpeed USB device number 22 using xhci_hcd [ +0.012155] usb 2-5: LPM exit latency is zeroed, disabling LPM.
This hack resolved the USB reset issue: https://github.com/blakeblackshear/frigate/issues/2607#issuecomment-2092965042
Coral Mini PCIe
Info
Not currently working. For some reason it doesn't show up in 'lspci' in my Dell R730XD. I wonder if using a more powerful power supply would make a difference.
nVIDIA Tesla P40
nVIDIA GPU ignored on Host (Dell R730xd, CentOS 9), PCI Passthrough to KVM VM, running CentOS 9 as a K8S node running nVIDIA Container Toolkit pods.
Host
Ignore PCI device
- Apend to GRUB_CMDLINE_LINUX in /etc/default/grub
intel_iommu=on pci-stub.ids=10de:1b38
- Update Grub
grubby --add-kernel $(grubby --default-kernel) --copy-default --args=vfio_pci.ids=10de:1b38 --title "Default kernel with vfio_pci" --make-default
- reboot
PCI Passthrough to VM (via virt-manager)
- Add Hardware -> PCI Host Device
In the VM
Blacklist nouveau in VM
-
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
-
Comment out the following block in /etc/X11/xorg.conf.d/10-nvidia.conf
#Section "OutputClass"
# Identifier "nvidia"
# MatchDriver "nvidia-drm"
# Driver "nvidia"
# Option "AllowEmptyInitialConfiguration"
# Option "PrimaryGPU" "no"
# Option "SLI" "Auto"
# Option "BaseMosaic" "on"
#EndSection
- reboot
Install nVIDIA Driver
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$\(uname -i)/cuda-rhel9.repo
dnf module install nvidia-driver:565-dkms
Install nVIDIA Container Toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
dnf install nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=crio
Configure the runtime
nvidia-ctk runtime configure --runtime=crio
Results
/etc/nvidia-container-runtime/config.toml
Kubernetes
Install nVIDIA Device Plugin
Configuration with NFD / LocalAI / Ollama / etc
LocalAI
- make sure runtime is set correctly
- confirm that localai is running on the
nvidia-container-runtime
stable-diffusion
On node install TCMalloc
dnf install -y gperftools gperftools-deve
Other
nVidia HTOP
Improved nvidia-smi command.
Fan Control Methods
The R730XD doesn't officially support the P40 so it doesn't natively adjust the fan. Below are some workarounds that I have not implemented yet.
Cluster Rebuild Actions
Before Cluster Rebuild
- Restore CNPG from backup
uncomment section in cluster.yaml
After Cluster Rebuild
- Update KUBECONFIG secret in github/home-ops
Settings -> (left side) Secrets and Variables -> (submenu) Actions
Edit KUBECONFIG secret with cat ~/.kube/config | base64
Cluster Upgrade Runbook
This runbook covers an in-place rolling upgrade of the cluster, planned in May 2026 to bring all nodes to a single k8s 1.34 patch + cri-o 1.34, then to k8s 1.35.
The cluster runs kubeadm with stacked etcd and kube-vip (DaemonSet) for
the control-plane VIP at 192.168.6.1.
State at the start of the upgrade
| Aspect | Reality |
|---|---|
| Control plane | 3 nodes (master1/2/3), HA via kube-vip DaemonSet |
| Workers | 7 nodes (worker2-8), each runs a Ceph OSD |
| Special hardware | worker8 = NVIDIA GPU; worker4 = Frigate Coral USB + Intel GPU + vlan-security |
| k8s | All nodes on 1.34.x, drift across .2/.6/.7 |
| cri-o | master1 on 1.34.2 (modern); all others on 1.28.4 (legacy el8 build, outside skew) |
| OS | master1 on CentOS Stream 10; all others on CentOS Stream 9 |
| etcd | 3.6.5 across all masters, healthy |
| Storage | Rook/Ceph (ceph-block), Longhorn (per-app named SCs), Garage (S3) |
| GitOps | Flux pulls from home-ops-kubernetes GitRepository |
Phase 0 β Pre-flight (β done)
- β
Master1 stale kube-vip static pod removed
(
/root/kube-vip.yaml.removed-20260502is the rollback breadcrumb) - β
etcd snapshot saved off-cluster
(
~/cluster-backups/etcd-20260502/snapshot-prephase0-20260502.db) - β
isv_cri-o_stable_v1.34.repopre-staged on all nodes via dnf β master1 already had it from its earlier rebuild - β
deschedulerHelmRelease suspended via thedisable-deschedulercommit pattern. Resume in Phase 5. - β Cilium 1.19.3 confirmed compatible with k8s 1.35 (Rook 1.19.5, CNPG 1.29.0, Istio 1.29.2 also confirmed)
- β kube-vip DaemonSet already on v1.1.2; Renovate is tracking it
- β API deprecation grep clean β no core k8s alpha/beta apiVersions used outside of vendor CRDs
Per-node procedure (used in Phases 1β3)
The cri-o package swap, kubeadm upgrade, and kubelet bump all happen inside the same drain window per node, so we drain once per node.
Order
Standard "clean RPM" workers first, then the special cases, then masters.
worker7β migrated 2026-05-02 β k8s 1.34.7 + cri-o 1.34.7worker3β Intel GPU label but no pinned pods; mon-f is pinned here, drain only when worker3 is the active mon target (see "mon nodeSelector trap" below)worker2worker4β Frigate node. Pre-suspend frigate, zigbee2mqtt, zwave-js-ui via thedisable-<app>GitOps pattern; expect brief recording / automation gap.worker8β NVIDIA. Pre-suspend ollama, comfyui. Carefully port the NVIDIA runtime stanza to a drop-in. Smoke-test with aruntimeClassName: nvidiapod before un-suspending.worker6β manual-install kubelet/cri-o, requires the alternate procedure below (norm crio.conf, usednf install).worker5β same alternate procedure as worker6.master3master2(kubelet already 1.34.7; cri-o still 1.28.4)master1(last β already on 1.34.2 + cri-o 1.34.2; just kubelet patch bump). The VIP risk is bounded by master2/master3 kube-vip DaemonSet pods.
Never drain two masters concurrently. After each master, verify etcd quorum:
kubectl exec -n kube-system etcd-master1.thesteamedcrab.com -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --cluster -w table
Pre-flight before every node (lessons from 2026-05-02)
These checks must happen before draining any node. Skipping any of them caused real damage on the first attempt at this phase.
-
Drain Longhorn replicas off the target node FIRST, before
kubectl drain. Otherwise: when the node's longhorn-manager blips during the cri-o restart, every replica still on that node becomes inaccessible, and pods on OTHER nodes whose replicas are here go into CrashLoopBackOff. That cascades into CNPG PDBs blocking subsequent drains. Two ways to do this:Via Longhorn UI (preferred β visual confirmation):
- Open the Longhorn UI (port-forward
longhorn-frontendinlonghorn-systemif you don't have ingress wired up). - Node tab β click target node β "Edit Node" β set "Node Scheduling" to Disable AND "Eviction Requested" to True.
- Watch the Volume tab β every volume with a replica on this node should show its replica count restoring on other nodes.
- When the target node's "Replicas" count reaches 0, proceed.
- Re-enable scheduling and clear eviction after the node is back and you've uncordoned it, otherwise replicas won't return.
Via kubectl (equivalent):
kubectl patch -n longhorn-system nodes.longhorn.io <node> \ --type=merge -p '{"spec":{"allowScheduling":false,"evictionRequested":true}}' # Wait until no replicas remain on the node until [ "$(kubectl get replicas.longhorn.io -n longhorn-system -o json | jq -r --arg n "<node>" '.items[] | select(.spec.nodeID==$n) | .metadata.name' | wc -l)" = "0" ]; do sleep 10; done # β¦ do the drain + upgrade + uncordon β¦ # After uncordon and node Ready, restore Longhorn scheduling: kubectl patch -n longhorn-system nodes.longhorn.io <node> \ --type=merge -p '{"spec":{"allowScheduling":true,"evictionRequested":false}}' - Open the Longhorn UI (port-forward
-
Wait for
ceph -sHEALTH_OK before draining the next node. Not just "the previous node's OSD pod is Ready" β Rook creates dynamic per-host OSD PDBs (rook-ceph-osd-host-<host>) when an OSD is unavailable, withMAX UNAVAILABLE: 1, ALLOWED DISRUPTIONS: 0. While ANY OSD-host PDB exists for ANY host, the next drain will hang indefinitely on its own host's PDB. This cost ~20 minutes of stuck drain on worker3 because worker6 was still degraded in the background. -
Check mon nodeSelectors and don't drain a mon's pinned host while another mon is also down. Rook pins each mon to a specific node and recreates them under new letters as nodes drop in/out:
kubectl get pod -n rook-ceph -l app=rook-ceph-mon -o jsonpath='{range .items[*]}{.metadata.labels.mon}{"\t"}{.spec.nodeSelector.kubernetes\.io/hostname}{"\n"}{end}'Re-check before every node β the mon names shift (we saw c,e,f β c,e,g β c,e,h over a single afternoon as nodes drained). Draining a node hosting a pinned mon strands that mon β it cannot reschedule until the pin is satisfied again. If two mons are pinned to drained nodes, ceph quorum is lost. Drain pinned-mon nodes one at a time and let the mon come back before touching the next.
-
Check the package install style on the target node. Some nodes have
kubelet/kubeadm/kubectl/cri-oinstalled via dnf (RPM-tracked). Others have manually-installed binaries (no RPM entries):ssh root@<node> 'rpm -qa | grep -cE "^(kubeadm|kubelet|kubectl|cri-o)-"'Returns 4 β standard procedure. Returns 0 β alternate procedure (don't
rm crio.conf; usednf installnotdnf upgrade). Nodes known to be manual-install: worker5, worker6. -
CNPG primary failover:
for c in $(kubectl get pod -n databases -l 'cnpg.io/instanceRole=primary' \ --field-selector spec.nodeName=<node>.thesteamedcrab.com \ -o jsonpath='{range .items[*]}{.metadata.labels.cnpg\.io/cluster}{"\n"}{end}'); do replica=$(kubectl get pod -n databases -l "cnpg.io/cluster=$c,cnpg.io/instanceRole=replica" \ -o jsonpath='{range .items[?(@.spec.nodeName!="<node>.thesteamedcrab.com")]}{.metadata.name}{"\n"}{end}' | head -1) kubectl cnpg promote -n databases "$c" "$replica" doneThen poll
kubectl get pod -n databases -l 'cnpg.io/instanceRole=primary' --field-selector spec.nodeName=<node>...until empty. -
Hardware-pinned pod suspension (worker4: frigate + zigbee2mqtt + zwave-js-ui; worker8: ollama + comfyui). Use the
disable-<app>GitOps commit pattern, notkubectl scaleβ Flux will revert imperative scales. Wait for Flux reconciliation to actually take the pods down before draining.
Standard per-worker procedure (RPM-tracked nodes)
For workers with RPM entries (rpm-qa returns 4 packages):
- Pre-flight checks above.
- Drain:
kubectl drain <worker>.thesteamedcrab.com \ --ignore-daemonsets --delete-emptydir-data - Package upgrade and kubelet config refresh:
ssh root@<worker>.thesteamedcrab.com ' # Drop the legacy crio.conf / .rpmnew. Order matters: only do # this RIGHT BEFORE the upgrade succeeds β leaving cri-o # without a config will trigger the "unsafe procfs detected" # runc error and break ALL pods on the node. rm -f /etc/crio/crio.conf /etc/crio/crio.conf.rpmnew /etc/crio/crio.conf.working # Single transaction: cri-o upgrade picks up the new isv repo # automatically since 1.34.7 > 1.28.4. dnf upgrade -y cri-o kubelet-1.34.7 kubeadm-1.34.7 kubectl-1.34.7 kubeadm upgrade node systemctl daemon-reload systemctl restart crio systemctl restart kubelet ' - Smoke-test before uncordon:
ssh root@<worker>.thesteamedcrab.com ' systemctl is-active crio kubelet crictl info | grep -E "CgroupManagerName|DefaultRuntime" ls /etc/crio/crio.conf.d/ # should exist; legacy crio.conf should be gone ' kubectl get node <worker>.thesteamedcrab.com -o wide # version + cri-o version match expected - Uncordon:
Rook auto-clears any host noout flag on its own a few seconds after uncordon β don't manually unset it.kubectl uncordon <worker>.thesteamedcrab.com - Wait for
ceph -sHEALTH_OK (no OSDs down, no degraded PGs) before the next node. Do not skip this. Typically ~1β3 minutes.
Alternate procedure for manual-install nodes (worker5, worker6)
These nodes have kubelet/cri-o binaries in /usr/bin/ not tracked by
RPM. dnf upgrade cannot upgrade what it cannot see, and rm crio.conf will break the node since the dnf step provides no
replacement.
- Pre-flight checks (same as above).
- Drain (same as above).
- Fresh-install via dnf (overwrites the un-tracked binaries):
ssh root@<worker>.thesteamedcrab.com ' # Stop services so we can replace running binaries cleanly systemctl stop kubelet systemctl stop crio # Move the manual binaries aside (rollback if dnf install fails) mv /usr/bin/kubelet /usr/bin/kubelet.manual mv /usr/bin/kubeadm /usr/bin/kubeadm.manual mv /usr/bin/kubectl /usr/bin/kubectl.manual mv /usr/bin/crio /usr/bin/crio.manual # Install the RPMs fresh (now they will be tracked) dnf install -y cri-o kubelet-1.34.7 kubeadm-1.34.7 kubectl-1.34.7 # Now we can safely remove crio.conf β package provides drop-in rm -f /etc/crio/crio.conf /etc/crio/crio.conf.rpmnew /etc/crio/crio.conf.working kubeadm upgrade node systemctl daemon-reload # Multi-minor cri-o jumps (1.28 β 1.34) leave stale container # refs that hang internal_wipe forever. Skip the regular start; # do the kill+wipe+start dance up front. systemctl kill --signal=SIGKILL crio || true crio wipe -f systemctl start crio systemctl start kubelet # Verify and clean up rollback files only after success rpm -q cri-o kubelet kubeadm kubectl systemctl is-active crio kubelet # rm /usr/bin/*.manual-pre-1.34.7 # only after smoke-test confirms success ' - Smoke-test, uncordon, wait for
HEALTH_OKβ same as above.
Worker8 (NVIDIA) extra steps
After the standard procedure:
ssh root@worker8.thesteamedcrab.com 'cat > /etc/crio/crio.conf.d/20-nvidia.conf <<EOF
[crio.runtime.runtimes.nvidia]
runtime_path = "/usr/bin/nvidia-container-runtime"
runtime_root = "/run/nvidia"
runtime_type = "oci"
EOF
systemctl restart crio'
Smoke-test before resuming ollama / comfyui:
kubectl run nvidia-smoke --rm -i --restart=Never \
--overrides='{"spec":{"runtimeClassName":"nvidia","nodeName":"worker8.thesteamedcrab.com"}}' \
--image=nvidia/cuda:12.0-base-ubuntu22.04 -- nvidia-smi
Failure modes seen on 2026-05-02
| Symptom | Root cause | Fix |
|---|---|---|
Drain hangs ~indefinitely on rook-ceph-osd-host-* PDB | A previous node's OSD is still degraded; Rook's per-host PDB blocks all OSD evictions cluster-wide | Wait for ceph -s HEALTH_OK, then retry drain |
New pods on a node fail with runc create failed: unsafe procfs detected | /etc/crio/crio.conf was removed but the new cri-o package didn't install | Restore crio.conf from a peer node's identical version (scp root@<peer>:/etc/crio/crio.conf root@<broken>:/etc/crio/), systemctl restart crio |
mon-X stays Pending after drain | Mon is pinned via nodeSelector to the cordoned node | Uncordon the pinned node, mon comes back. Don't drain another mon's host until quorum is restored |
dnf upgrade reports kubelet-1.34.7: No match for argument | Node has manually-installed kubelet (no RPM entry) | Use the alternate procedure (dnf install after moving binaries aside) |
longhorn-manager-X stuck CrashLoopBackOff with bind: address already in use on port 9502 | Old longhorn-manager process orphaned by a previous container; cri-o lost track of it but the binary is still bound | ssh root@<node> 'pgrep -af "longhorn-manager -d daemon"' β kill -9 <pid>; then kubectl delete pod -n longhorn-system longhorn-manager-X |
| Multiple CNPG replica pods stuck Init:CrashLoopBackOff on a recently-broken node | Their Longhorn volumes failed to attach during the node's outage; pods are now in 5-minute kubelet backoff | Once Longhorn recovers, kubectl delete pod each one to force immediate retry. Volume attaches succeed. |
| Replicas-cascading-CNPG-PDB-blocks-drain | One replica unhealthy in cluster X means PDB has 0 disruptions; subsequent drain anywhere blocks on cluster X's PDB | Heal the unhealthy replica before draining its peer's host. The Longhorn pre-drain step (above) prevents this. |
cri-o stuck in activating (start) indefinitely after package upgrade; logs flood with Killing container <id> failed: container does not exist | cri-o's internal_wipe = true tries to kill phantom containers left in /var/lib/containers/storage/ by the previous version, but their runc state in /run/crun is gone. Worse across multi-minor jumps (1.28 β 1.34). | systemctl kill --signal=SIGKILL crio β crio wipe -f (clears container refs, keeps images) β systemctl start crio β systemctl start kubelet. Hit on worker6 during the manual-install upgrade β likely also needed on worker5 for the same reason. |
Suspended app stays at replicas: 0 after the suspend HR is reverted | Manually scaling a StatefulSet/Deployment to 0 before drain (the post-suspend step that actually stops pods) sticks. Helm/Flux applying the un-suspended HR doesn't reset the imperative scale. | After reverting disable-<app>, kubectl scale -n <ns> statefulset <app> --replicas=1 (or whatever the desired count is). Bit us with frigate, ollama, comfyui. |
### Master procedure
Same as worker, with two differences:
- **Master1 is special** (already on cri-o 1.34.x). Skip the cri-o swap;
just do the k8s patch / minor bump. Master1 is also the last in the
order so the kube-vip VIP can fail over to master2/master3 during
master1's drain.
- **First control plane in Phase 3 minor bump** uses
`kubeadm upgrade apply v1.35.x`; the others use
`kubeadm upgrade node`. The first one *must* be done before any
others.
## Phase 5 β verify and clean up
- All nodes show same kubelet + cri-o version: `kubectl get nodes -o wide`
- Flux reconciled: `flux get all -A | grep -v True`
- `ceph -s` is `HEALTH_OK`
- Run `tools/etcd-defrag.sh` (etcd grew during the upgrade)
- **Resume descheduler**: `git revert <disable-descheduler sha>` and
push. Same for any hardware-pinned `disable-<app>` commits made
along the way.
- Update this runbook with anything new that bit you, before memory
fades.
## Rollback
In-place RPM downgrade is messy, especially for kubelet across a minor
boundary. Realistic rollback paths:
- Per-node, *before* `kubeadm upgrade apply` on the first master: roll
back is just `dnf downgrade kubeadm kubelet kubectl` to 1.34 + put
the old `crio.conf` back (it's saved as `crio.conf.rpmsave` after the
package swap).
- Per-node, *after* `kubeadm upgrade apply`: bring forward the rest of
the cluster. Don't try to roll back the apiserver minor.
- Catastrophic (etcd corruption, control plane unrecoverable): restore
from `~/cluster-backups/etcd-20260502/snapshot-prephase0-20260502.db`
via [the kubeadm etcd recovery procedure][1]. **This is a last
resort and has not been tested in this homelab.** It assumes you
have at least one master with the original PKI intact and can
`etcdctl snapshot restore` to a fresh data dir, then restart etcd
static pods.
[1]: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster
Per-Route Cert Migration Runbook
This runbook covers migrating from one shared wildcard ACME certificate
(thesteamedcrab.com covering *.thesteamedcrab.com) to per-listener
fine-grained short-term certificates, using cert-manager's Gateway API
shim to auto-provision a Certificate per HTTPRoute hostname.
Why: smaller blast radius per renewal failure, per-host visibility, no reflector dependency for new namespaces, faster rotation.
Why this needs to be paced: Let's Encrypt limits issuance to 50
certificates per Registered Domain per rolling 7-day window. The
cluster has ~70 HTTPRoutes under thesteamedcrab.com β too many to
flag-day onto ACME without breaching the limit and locking out all
issuance for ~7 days.
State at the start
| Aspect | Reality |
|---|---|
| Cert manifest | One Certificate (thesteamedcrab.com) producing thesteamedcrab-com-tls |
| Cert SANs | apex, *.thesteamedcrab.com, *.app.β¦, *.mcp.β¦ |
| Reflection | Secret reflected to network, istio-system, mcp-system |
| Issuer | letsencrypt-production (DNS01 via Cloudflare) |
| Gateways | 4 β external (network), internal (network), istio (istio-system), mcp-gateway (mcp-system) |
| HTTPRoutes per Gateway | external: 11, internal: 50+, mcp-gateway: 9, istio: 2 |
| Existing TTL | LE default 90d, autorenew at ~30d remaining |
Target state β split by exposure
The 70+ routes split into two cohorts:
- Public-facing (~22):
external(11),mcp-gateway(9),istio(2). These need browser-trusted certs β ACME vialetsencrypt-production. - Internal-only (~50): everything on the
internalGateway. These never face the public internet β private CA, issued by an in-cluster ClusterIssuer. Zero rate-limit concern, instant issuance, no public CT log entries leaking internal hostnames.
Why split
A "just ACME everything" plan would burn the entire 50/week budget twice over and leave no headroom for retries or the existing wildcard's renewal. Splitting halves the scope of the rate-limited work and removes the public CT log noise for internal services.
The cost of the split is one-time: distribute the private CA root to every device that visits internal hostnames (browsers, mobile devices, anything that calls internal APIs). That's a manual import on each device.
If that operational cost is unacceptable, see "Alternative: all-ACME" at the bottom of this runbook.
Rate-limit math (ACME side, ~22 certs)
- Hard ceiling: 50 issuances / 7d / Registered Domain
- Total ACME target certs: ~22
- Renewal exemption: renewals (same FQDN set, same account) are
exempt from the 50/week ceiling but still subject to a 5/week
Duplicate Certificate cap per FQDN set. With
duration: 168h/renewBefore: 48h, each cert renews ~1.4Γ/week β well under 5/week. - Sustainable initial-issuance rate:
50 Γ· 7 β 7/day. The 7-day window is rolling, not calendar β at 10/day you hit the cap on day 5, not day 7. - Wave size: 5/day. Buffer of 2/day for retries and the existing wildcard's renewal pressure.
- Total elapsed: ~5 days for 22 certs.
Critical gotchas (learned the hard way)
Three things bit us during the first execution attempt on 2026-05-02. Read these before writing any YAML.
1. The gateway-shim must be explicitly enabled
cert-manager v1.16+ does not enable the gateway-shim by default,
contrary to what some release notes suggest. You need
config.enableGatewayAPI: true in the Helm values. Without it,
annotating a Gateway with cert-manager.io/cluster-issuer is silently
inert β no Certificate is ever created. Verify before Phase 0:
kubectl get cm -n cert-manager cert-manager -o jsonpath='{.data.config\.yaml}'
# Expect to see: enableGatewayAPI: true
If absent, set it in the chart values and roll out cert-manager first.
2. The reflector pattern fights the gateway-shim across namespaces
This cluster mirrors network/thesteamedcrab-com-tls to istio-system
and mcp-system via reflector. The gateway-shim is per-namespace:
it creates a Certificate in the same namespace as the Gateway. So
annotating the istio Gateway whose listener references the
reflector-mirrored Secret causes:
- shim creates a
Certificateinistio-systemtargeting the mirrored Secret's name - cert-manager re-issues from ACME because "Secret was previously issued by a different issuer" (1 ACME order against the budget)
- ongoing fight: shim's Certificate vs reflector for ownership of the Secret
Don't annotate any Gateway whose listener still references a reflector-mirrored Secret. Migrate that Gateway's HTTPRoutes to per-app listeners with their own Secret names first, then remove the wildcard listener and the reflector dependency, then add the shim annotation.
For this cluster: the istio Gateway is excluded from Phase 0. It gets annotated only after its (small) HTTPRoute set is migrated and the wildcard listener is removed.
3. HTTPRoute sectionName has no fallback
A HTTPRoute attached to a listener that goes Programmed=False
does not fall back to other listeners that match its hostname.
The route is just down. The wildcard listener you keep around
"as a backstop" only catches HTTPRoutes that explicitly point at it
via sectionName.
Implication for the canary: don't pick an app whose downtime is
unacceptable. The canary's listener is Programmed=False until ACME
issues, which can be ~30s but can be much longer if anything's wrong.
Phase 0 β Annotation prep
Verified 2026-05-02 against cert-manager v1.17.1 with
config.enableGatewayAPI: true: when the shim sees a listener whose
referenced Secret doesn't exist (or isn't owned by an in-namespace
Certificate), it creates one. When an in-namespace Certificate already
owns the Secret, the shim is a no-op.
For this cluster, that means it's safe to annotate Gateways in the
network namespace (which holds the manual thesteamedcrab.com
Certificate that owns thesteamedcrab-com-tls), and unsafe to
annotate Gateways in istio-system or mcp-system while they still
reference the reflector-mirrored Secret.
0.1 Pre-flight test (one-time)
Confirm the shim creates Certificates for new Secret refs:
kubectl apply -f - <<'EOF'
---
apiVersion: v1
kind: Namespace
metadata: { name: shim-test }
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata: { name: ss, namespace: shim-test }
spec: { selfSigned: {} }
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: scratch
namespace: shim-test
annotations:
cert-manager.io/issuer: ss
spec:
gatewayClassName: envoy
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: bar.shim-test.thesteamedcrab.com
allowedRoutes: { namespaces: { from: Same } }
tls:
certificateRefs: [{ kind: Secret, name: bar-shim-tls }]
EOF
sleep 20
kubectl get certificate -n shim-test
kubectl delete namespace shim-test
Expected: a bar-shim-tls Certificate appears, Ready: True. If
nothing appears within 30s, the shim isn't running β fix that before
proceeding.
0.2 Add the annotations
Annotate only the external and internal Gateways
(network namespace). Skip istio and mcp-gateway for now β
they're in namespaces that depend on the reflector. They'll be
annotated in Phase 5 after their HTTPRoutes have moved off the
shared Secret.
metadata:
annotations:
# ...existing annotations...
cert-manager.io/cluster-issuer: letsencrypt-production
cert-manager.io/duration: "168h" # 7d
cert-manager.io/renew-before: "48h"
What
durationactually does. Let's Encrypt's default profile always issues 90-day certificates regardless of thedurationrequested by ACME. Thecert-manager.io/durationannotation controls only cert-manager's renewal cadence β it tells cert-manager "treat this cert as expiring after 168h" so it renews early. You still get 90-day certs, just rotated every ~5 days.For actually-short LE certs, opt into the
tlsserverprofile (~6-day validity) by addingacme.cert-manager.io/order-profile-name: tlsserveron the per-listener Certificate or via cert-manager's issuer-level configuration. That changes the LE order profile and the issued cert is genuinely short-lived. Verify withopenssl s_client β¦ | openssl x509 -noout -datesafter issuance.
For the internal Gateway, swap the issuer to the private CA once
Phase 1 is done:
cert-manager.io/cluster-issuer: cluster-internal-ca
0.3 Verify
# Cert count must be unchanged after reconcile.
kubectl get certificate -A
# If a new Certificate appears in any namespace, the shim hit
# something it shouldn't have. Stop and investigate.
If a Certificate appeared, check whether it's in a namespace whose Gateway you annotated. If yes, you almost certainly hit case (2) above β the listener references a reflector-mirrored Secret. Revert the annotation and migrate that Gateway separately later.
Phase 1 β Internal Gateway β private CA (parallel track)
This phase is independent of the ACME phases below β it has no rate limit concerns and can run in any order or in parallel.
1.1 Set up the private CA
(One-time. Skip if already present.)
# cert-manager: bootstrap a self-signed root + ClusterIssuer
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-bootstrap
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: cluster-internal-ca
namespace: cert-manager
spec:
isCA: true
commonName: thesteamedcrab.com Internal CA
secretName: cluster-internal-ca-tls
duration: 87600h # 10y root
renewBefore: 720h
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: selfsigned-bootstrap
kind: ClusterIssuer
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: cluster-internal-ca
spec:
ca:
secretName: cluster-internal-ca-tls
1.2 Distribute the root to client devices
This is the non-zero operational cost. Export the CA cert and import it into:
- Each laptop / desktop browser (or system trust store)
- Each phone / tablet
- Any in-cluster client that calls internal hostnames over TLS
kubectl get secret -n cert-manager cluster-internal-ca-tls \
-o jsonpath='{.data.tls\.crt}' | base64 -d > internal-ca.crt
Don't proceed past 1.2 until you've imported and verified the root on
at least your primary device. (Open https://glance.thesteamedcrab.com
post-migration and confirm no cert warning.)
1.3 Migrate the internal Gateway
Add cert-manager.io/cluster-issuer: cluster-internal-ca annotation
to kubernetes/apps/network/envoy-gateway/config/internal.yaml. Then,
in batches of 10 for review-ability, add a per-app listener and flip
the corresponding HTTPRoute's sectionName. Issuance is instant β
no soak required between batches; just verify each Certificate goes
Ready before moving on.
The wildcard listener stays alive until every internal HTTPRoute is
migrated, so rollback is the same as the ACME track: revert
sectionName and the route falls back to the wildcard.
Phase 2 β Public Gateways canary (1 app)
Pick a low-stakes route on the external Gateway (the highest public visibility). Avoid the highest-traffic apps (immich, jellyfin) until after canary; pick something like the github webhook or a seldom-used bookmark.
β No fallback. Once you flip the HTTPRoute's
sectionNameto the new per-app listener, the route is bound to that listener only. If ACME issuance fails or is slow, the route returns 404 (or connection refused) until the cert lands β the wildcard listener does not serve as a backstop. Pick a canary you can afford to have down for ~30s in the happy case and several minutes in the failure case.
2.1 Add a per-app listener
spec:
listeners:
# existing http and wildcard https listeners untouched
- name: https-<app>
protocol: HTTPS
port: 443
hostname: "<app>.${SECRET_DOMAIN}"
allowedRoutes:
namespaces:
from: All
tls:
certificateRefs:
- kind: Secret
name: <app>-tls
The Gateway already carries the shim annotations from Phase 0; this
new listener picks them up automatically. cert-manager will create a
<app>-tls Certificate within seconds of the listener appearing.
2.2 Flip the HTTPRoute's sectionName
parentRefs:
- name: external
namespace: network
- sectionName: https
+ sectionName: https-<app>
2.3 Verify and soak
# Cert issues in seconds for staging, ~30s for production
kubectl get certificate -n network -w
# Confirm the SAN is just the one hostname and duration β 7d
echo | openssl s_client -connect <app>.thesteamedcrab.com:443 \
-servername <app>.thesteamedcrab.com 2>/dev/null \
| openssl x509 -noout -subject -ext subjectAltName -dates
Soak 24h. If anything is wrong, revert the HTTPRoute sectionName. The wildcard listener still serves the old wildcard cert; no user impact.
Don't proceed past Phase 2 until this canary has cleanly soaked AND
auto-renewed at least once (renewBefore: 48h means renewal kicks in
on day 5 β wait for that).
Phase 3 β Wave migration on public Gateways
Goal: migrate the remaining ~19 public-facing HTTPRoutes on Gateways
that are already annotated (external 10, mcp-gateway 9) at
5 per day with 12h soaks. The 2 istio routes are deferred to
Phase 5.
Per-wave procedure
For each batch of 5:
- Add 5 listeners to the appropriate Gateway. Sort listeners
alphabetically inside the
listeners:block. - Flip 5 HTTPRoute
sectionNames. - Commit + push. One PR per wave. Title:
feat(network): per-app TLS migration wave N (X/Y). - Watch issuance:
All 5 should reach Ready within ~2 min. Anything stuck β checkkubectl get certificate -A -wkubectl describe certificate <name> -n networkand the Order / Challenge resources. - Soak ~12h between waves. Watch the issuer for rate-limit warnings:
kubectl describe clusterissuer letsencrypt-production kubectl get challenge -A
Recommended wave order
Public Gateway with the manual Certificate first (lower risk β
that's where Phase 0 was verified safe):
- Wave 1:
external(5 of 10; canary already done) - Wave 2:
external(the other 5) - Wave 3:
mcp-gateway(5 of 9) - Wave 4:
mcp-gateway(the other 4)
Total ~4 calendar days at one wave per ~12h. The 2 istio routes are handled in Phase 5 once the reflector dependency is removed.
What to do if you hit a rate-limit error
cert-manager surfaces LE rate-limit responses in the Order resource:
kubectl get order -A | grep -i rate
kubectl describe order -n network <order>
If hit:
- Stop the next wave immediately.
- Don't delete failing Certificates β that doesn't reset the counter and can cause cert-manager to retry, eating more budget.
- The window is rolling; just wait. The first issuance from N days ago drops off after 7d.
- Resume waves once
kubectl get certificate -Ashows no Pending or Failing.
Phase 4 β Cleanup
Run only after every HTTPRoute on every Gateway has been migrated and soaked for 7+ days.
4.0 istio Gateway migration (prerequisite)
The istio Gateway was excluded from Phase 0 / Phase 3 because its listener references the reflector-mirrored Secret. Migrate its HTTPRoutes (~2: kiali and one redirect) before deleting the wildcard Certificate.
For each istio HTTPRoute (<app> = e.g. kiali):
-
Create a manual Certificate in
istio-system(no shim involvement β the istio Gateway is still un-annotated):apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: <app> namespace: istio-system spec: secretName: <app>-tls issuerRef: name: letsencrypt-production kind: ClusterIssuer dnsNames: - <app>.${SECRET_DOMAIN} duration: 168h renewBefore: 48h -
Wait for it to be Ready (1 ACME issuance per cert).
-
Add a per-app listener to the istio Gateway referencing
<app>-tls, then flip the HTTPRoute'ssectionName. -
Soak briefly. Confirm the HTTPRoute serves via the new listener.
After all istio HTTPRoutes are migrated:
-
Remove the wildcard listener from the istio Gateway.
-
Remove
istio-systemfrom the wildcard Certificate'sreflector...reflection-allowed-namespacesannotation list (inkubernetes/apps/network/envoy-gateway/config/certificate.yaml). This stops new mirroring; existing Secret in istio-system can be deleted manually if desired (no consumers). -
Now the istio Gateway has no reflector-mirrored Secret. Add the
cert-manager.io/cluster-issuerannotations from Phase 0. The shim is now safe β every listener's referenced Secret is owned by an in-namespace manual Certificate. (Optionally, delete the manual Certificates afterward and let the shim recreate them, at the cost of one ACME order per cert.)
4.1 Verify no consumers reference the wildcard Secret
grep -rn 'thesteamedcrab-com-tls\|${SECRET_DOMAIN/./-}-tls' kubernetes/
Anything still referencing the wildcard Secret needs to migrate first.
Particular attention: Gateways in istio-system and mcp-system may
still be using the reflector-mirrored copy.
4.2 Remove the wildcard listener from each Gateway
Delete the name: https block (with hostname: "*.${SECRET_DOMAIN}")
from each of:
kubernetes/apps/network/envoy-gateway/config/external.yamlkubernetes/apps/network/envoy-gateway/config/internal.yamlkubernetes/apps/istio-system/gateway/gateway.yamlkubernetes/apps/mcp-system/mcp-gateway/app/gateway.yaml
4.3 Delete the wildcard Certificate manifest
git rm kubernetes/apps/network/envoy-gateway/config/certificate.yaml
Also remove the entry from
kubernetes/apps/network/envoy-gateway/config/kustomization.yaml.
The reflector annotations on the (now-deleted) Certificate's
secretTemplate go away automatically; reflector stops mirroring;
the reflected Secret copies in istio-system and mcp-system are
garbage-collected.
4.4 Final verify
# No wildcard cert in any namespace
kubectl get certificate -A | grep -i wildcard
# Per-app certs all present and Ready
kubectl get certificate -A
# All public certs are short-duration; private CA certs may be longer
kubectl get certificate -A -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) duration=\(.spec.duration // "unset")"'
# Smoke test from outside (public) and from a CA-trusting device (internal)
for host in glance.thesteamedcrab.com photos.thesteamedcrab.com ... ; do
echo "$host:"
echo | openssl s_client -connect "$host:443" -servername "$host" 2>/dev/null \
| openssl x509 -noout -subject -dates
done
Rollback
The wildcard cert + listeners stay alive through Phase 1, 2, and 3.
Rollback during those phases is just reverting the HTTPRoute's
sectionName. Even after deleting some per-app listeners, the
wildcard listener catches the route.
Phase 4 is irreversible by git revert alone β once the wildcard
Certificate is deleted, re-creating it kicks off a new ACME order
(counts against rate limit). If you must roll back from Phase 4:
- Restore the Certificate manifest via git revert
- cert-manager re-issues β costs 1 against the 50/week budget
- Restore the wildcard listeners
- Flip HTTPRoutes back to
sectionName: https
So don't enter Phase 4 unless you're confident in the new state.
What to watch
kubectl get certificate -Aβ columnREADYshould always beTruekubectl get order -Aβ should be empty in steady state (orders exist transiently during issuance/renewal)kubectl describe clusterissuer letsencrypt-productionβ surfaces ACME backoff messages if rate-limited- ACME audit log at https://crt.sh/?Identity=thesteamedcrab.com β external counter you don't control, useful sanity check
When to stop and ask
- Any wave produces >0 failed Certificates after 5 min
- Total Pending+Failing Certificates count >3 across the cluster
- Any rate-limit error in Order events
- More than one wave in the rolling 7d window has produced retries
- Phase 0.1 test result was ambiguous
Alternative: all-ACME (no private CA)
If managing a private CA root on every client device is unacceptable:
- Skip Phase 1 entirely
- All ~70 routes go through ACME
- Wave size still 5/day; total elapsed ~14 days
- Internal hostnames will appear in public CT logs (https://crt.sh/?Identity=thesteamedcrab.com) β anyone can enumerate your service catalog from the cert transparency feed. Mitigate with hostnames that don't betray service identity if this matters.
Prereqe
- 1password-cli
- minijinja
- yq
- go-task (alias to task)
Initialization
./init/create-cluster.sh (on master)
./init/initialize-cluster.sh (on laptop)
ssh root@master1 rm /etc/kubernetes/manifests/kube-vip.yaml (on laptop)
Teardown
./init/destroy-cluster.sh (on laptop)
Importing an Immich DB backup into a new CNPG database
- Create the new Immich database
kubectl cnpg -n databases psql <db name> -- -c 'CREATE DATABASE immich;'
- Ensure that you're woring with the primary CNPG instance
kubectl cnpg -n databases status <db name> | grep "Primary instance:"
- Run the Immich restore command
gunzip < "immich-db-backup-1735455600013.sql.gz"| sed "s/SELECT pg_catalog.set_config('search_path', '', false);/SELECT pg_catalog.set_config('search_path', 'public, pg_catalog', true);/g" | kubectl cnpg -n databases psql <db name> --
Github Webhook
kubectl -n flux-system get receivers.notification.toolkit.fluxcd.io generates token URL to be put into
github.com -> Settings -> Webhooks -> Payload URL
- Content Type: application/json
- Secret: <token from kubectl -n flux-system describe secrets github-webhook-token>
- SSL: Enable SSL verification
- Which events would you like to trigger this webhook?: Just the push event.
- Active:
Resources: Limits and Requests Philosophy
In short, do set CPU requests, but don't set CPU limits and set the Memory limit to be the same as the Memory requests.
Debugging
- https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
- https://dnschecker.org
- https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
- https://github.com/nicolaka/netshoot
- https://www.redhat.com/sysadmin/using-nfsstat-nfsiostat