Introduction

Warning

These docs contain information that relates to my setup. They may or may not work for you.



Cosmo

Lovenet Home Operations Repository

Production-grade Kubernetes for a household. GitOps with Flux ยท Automated dependency updates with Renovate ยท Self-hosted by design


Renovate Flux Documentation Check Links


apps helmreleases nodes cnpg secrets age


๐Ÿ“– Overview

This is the live configuration for a multi-node Kubernetes cluster that runs a household โ€” home automation, security cameras, media, document management, AI workloads, and the operational tooling required to keep it all up. Every change lands in Git first; Flux reconciles the cluster from there, and Renovate keeps dependencies current via PRs.

The repo is GitOps-strict: applications are declared as HelmRelease resources, secrets are pulled from 1Password through External Secrets Operator, and clusters are mostly identical except for app selection and sizing. Operational quirks, durability tiers, and security defaults live alongside the manifests in .agents/instructions/ so the conventions are enforceable, not folklore.


๐Ÿ—บ๏ธ Architecture

flowchart LR
    Dev[๐Ÿ‘ค Operator] -->|git push| Repo[(๐Ÿ“ฆ GitHub<br/>home-ops)]
    Renovate[๐Ÿค– Renovate] -.->|automated PRs| Repo
    Repo -->|reconciles| Flux[โš™๏ธ Flux]
    Flux -->|deploys| Cluster[โ˜ธ๏ธ Kubernetes<br/>10 nodes ยท 168 apps]

    Cluster --> Ceph[(๐Ÿชจ Ceph<br/>block ยท default durable)]
    Cluster --> LH[(๐Ÿ‚ Longhorn<br/>+ recurring backups)]
    Cluster --> Garage[(๐Ÿงบ Garage<br/>S3-compatible)]
    Cluster --> NFS[(๐Ÿ—„๏ธ NFS<br/>beast / brain ยท bulk media)]

    LH -->|weekly + monthly| NFS
    Garage -->|rclone CronJobs| AWS[โ˜๏ธ AWS S3<br/>Glacier Deep Archive<br/>offsite DR]

    classDef store fill:#1e293b,stroke:#475569,color:#e2e8f0
    class Ceph,LH,Garage,NFS,AWS store

Storage tiers are picked deliberately per workload โ€” see storage-class.instructions.md for the decision tree.


๐Ÿงฐ Stack at a glance

LayerToolRole
OSCentOS Stream 9 / 10Node operating system
Runtimecri-o + crunCRI runtime + OCI runtime (C implementation)
Kubernetesv1.35.4Control-plane and node version
GPUNVIDIA GPU Operator + Container ToolkitP40 driver/runtime management on worker8
GitOpsFlux2Declarative cluster reconciliation
AutomationRenovate + GitHub ActionsDependency PRs, link checks, self-hosted runners
CNICilium (eBPF)Networking, BGP peering, LoadBalancer pool
IngressEnvoy GatewayL7 gateway / HTTPRoute
Service meshIstiomTLS + traffic mgmt for mcp-system
DNSexternal-dnsCloudflare + bind9 split-horizon
TLScert-managerLet's Encrypt + internal CA
TunnelcloudflaredPublic ingress without exposing home WAN
AuthN/ZAuthelia + oauth2-proxyOIDC SSO; 24 oauth2-proxy instances gate apps
SecretsExternal Secrets Operator + 1Password109 ExternalSecrets, zero plain-text in Git
VPNwg-easyOperator OOB WireGuard access
StorageRook-Ceph, Longhorn, Garage, direct NFSTiered by durability requirement
DatabasesCloudNative-PG, Dragonfly, Qdrant24 Postgres clusters, KV, vector
Observabilitykube-prometheus-stack, Loki, Grafana, HolmesGPTMetrics, logs, dashboards, AI alert triage
ImagesZOTPull-through registry / local cache

๐Ÿ–ฅ๏ธ Hardware

RoleHostnameDeviceCPURAMOSStorage / AcceleratorsNotes
๐Ÿง master1bare-metal432 GBCentOS 10NVMe (Longhorn)Intel iGPU ยท RTL-SDR ยท control plane
๐Ÿง master2VM on beast312 GBCentOS 9virtualized control plane
๐Ÿง master3VM on beast310 GBCentOS 9virtualized control plane
๐Ÿ’ชworker2ThinkCentre M910x832 GBCentOS 9NVMe (Longhorn + Ceph OSD)ZWA-2 Z-Wave dongle
๐Ÿ’ชworker3ThinkCentre M910x864 GBCentOS 9NVMe (Longhorn + Ceph OSD)Sonoff Zigbee dongle
๐Ÿ’ชworker4ThinkCentre M910x832 GBCentOS 9NVMe (Longhorn + Ceph OSD)Coral USB TPU
๐Ÿ’ชworker5VM on beast1024 GBCentOS 9NVMe (Longhorn + Ceph OSD)
๐Ÿ’ชworker6VM on beast1030 GBCentOS 9NVMe (Longhorn + Ceph OSD)
๐Ÿ’ชworker7VM on beast1030 GBCentOS 9NVMe (Longhorn + Ceph OSD)
๐ŸŽฎworker8VM on beast1055 GBCentOS 9NVMe (Longhorn + Ceph OSD)NVIDIA P40 (24 GB VRAM)

Off-cluster infrastructure

HostRole
beastDell R730xd ยท iDRAC 8 ยท RAID6 bulk storage ยท primary NFS ยท Longhorn backup target ยท Garage substrate ยท VM host
brainRouter/gateway ยท RAID6 mass_storage ยท NFS for downloads & TV ยท OOB SSH on :3231

๐ŸŒ Network

Physical topology (click to expand)
physical network diagram
NetworkCIDRVLAN
Default192.168.0.0/160
IoT10.10.20.0/2420
Guest10.10.30.0/2430
Security (cameras)10.10.40.0/2440
Kubernetes pod subnet (Cilium)10.42.0.0/16โ€”
Kubernetes services subnet (Cilium)10.43.0.0/16โ€”
Kubernetes LB pool (CiliumLoadBalancerIPPool)10.45.0.0/24โ€”

Worker nodes attach to iot and sec VLANs via Multus for direct camera and IoT-device reachability. Cilium peers BGP with the upstream router to advertise the LB pool; external ingress flows through Envoy Gateway behind cloudflared.


๐Ÿ“ฆ What's running

๐Ÿ  Home Automation โ€” Home Assistant ecosystem, 400+ devices
AppPurpose
Home AssistantPrimary orchestrator; 400+ Z-Wave / Zigbee / Matter / ESPHome devices
ESPHomeBuild & deploy firmware for DIY sensors
EMQXMQTT broker
Node-REDVisual automation flows
Zigbee2MQTTZigbee bridge (Sonoff stick on worker3)
Z-Wave JS UIZ-Wave bridge (ZWA-2 stick on worker1)
Matter ServerMatter protocol bridge
FrigateNVR + ML camera analysis (7+ cameras, Frigate+ trained model)
n8nWorkflow automation (AlertManager โ†’ HolmesGPT, etc.)
NetBoxIPAM / DCIM
wyoming-servicesPiper TTS + Whisper STT for voice
smtp-relayMaddy โ†’ Mailgun outbound mail
๐ŸŽฌ Media & Entertainment โ€” Jellyfin, Immich, Music Assistant, RomM
AppPurpose
JellyfinPrimary media server (read-only metadata)
Immich + immich-pet-tagger + immichkiosk + immich-power-toolsPhoto library with ML face/pet recognition, offsite-backed
Music Assistant + GonicMulti-room music control + Subsonic API
RomMRetro game library (~10k ROMs)
BeetsMusic library tagging
cutVideo / av1corrector / videodupfinder / medialyzeCustom video tooling
Theme ParkConsistent UI theming across apps
Batocera Webdashboard ProRetro-gaming console dashboard
kodi-playback-watcherBridge for Kodi playback state
๐Ÿค– AI & ML โ€” Local inference, agents, image generation
AppPurpose
OllamaLocal LLM serving on the P40 (Qwen 2.5 7b/14b, DeepSeek-R1, etc.)
ComfyUIImage generation workflows
KhojPersonal AI assistant over notes + docs
LangGraph AgentsCustom multi-agent runtime (rwlove/langgraph-agents); Postgres-checkpointed; MCP-gateway client. See AI agent pipeline section below.
KubeClawWorkflow agent platform w/ browser automation (upstream chart); being phased out in favor of LangGraph
MCP InspectorModel Context Protocol debugger UI
Paperless-AIAuto-tagging for paperless-ngx
sync-receiverCross-host AI state sync endpoint
๐Ÿ“Š Observability โ€” Prom/Loki/Grafana with AI triage on top
AppPurpose
kube-prometheus-stackPrometheus + AlertManager + node-exporter
LokiLog aggregation
GrafanaDashboards + alerting UI
HolmesGPTLLM-backed alert investigation
kube-state-metrics / kube-ops-viewCluster state & visualization
GoldilocksVPA-driven resource right-sizing recommendations
KromgoPrometheus โ†’ Glance dashboard bridge
NetdataPer-node real-time metrics
network-ups-tools (NUT)UPS monitoring & graceful shutdown
exportersCustom Prometheus exporters
๐Ÿ—„๏ธ Data & Storage โ€” Databases, object storage, vector search
AppPurpose
CloudNative-PG24 Postgres clusters with WAL archiving to Garage
DragonflyRedis-compatible in-memory store
QdrantVector DB for embeddings / RAG
pgAdminPostgres admin UI
Rook-CephDistributed block storage (default durable tier)
LonghornBlock storage with NFS-backed recurring backups
GarageS3-compatible object storage (DB backups, app S3 workloads)
๐ŸŒ Network, Auth & Platform โ€” Ingress, SSO, GitOps machinery
AppPurpose
CiliumCNI, BGP, LoadBalancer pool
Envoy GatewayIngress / HTTPRoute (30 routes)
cert-managerTLS certificate lifecycle
external-dnsCloudflare + bind9 record sync
cloudflaredPublic tunnel without exposed WAN
AutheliaOIDC identity provider
LLDAPLightweight LDAP directory backing Authelia
oauth2-proxy24 instances gating per-app SSO
wg-easyPrimary OOB WireGuard access
External Secrets Operator1Password-backed secret materialization
Flux2GitOps reconciler
RenovateImage & Helm chart update PRs
KuadrantMCP server gateway (Authelia-gated JWT)
actions-runner-controllerSelf-hosted GitHub Actions runners
ZOTPull-through registry cache
๐Ÿ—‚๏ธ Documents & Collaboration โ€” Personal knowledge stack + self-hosted tools
AppPurpose
Paperless-ngxDocument scanning, OCR, tagging (CNPG-backed, offsite-backed)
Obsidian + obsidian-couchdbNotes sync (CouchDB w/ Cloudflare rate-limiting)
ZulipSelf-hosted team chat (also wired into agent pipeline approvals)
KitchenowlShopping lists + recipe / meal management
Open WebUISelf-hosted LLM frontend (Ollama + MCP servers as tool servers)
SearXNGPrivacy-respecting metasearch engine
GlancePersonal dashboard / start page
AtuinEncrypted shell-history sync across machines
IT-ToolsSelf-hosted developer toolbox
MediKeepPersonal medical records
NametagName tag / badge generator
Pump + Pump-cvCustom personal apps (rwlove-built)
๐Ÿ”Œ MCP Servers โ€” 14 Model Context Protocol servers behind an Authelia-gated gateway
ServerExposes
mcp-gatewayAggregating gateway; Envoy SecurityPolicy validates Authelia-issued JWTs (daily-rotated key)
ha-mcpHome Assistant entities + service calls
immich-mcpImmich library search + asset metadata
kubectl-mcpCluster introspection + safe kubectl ops
grafana-mcpGrafana dashboards + Loki/Prom queries
prometheus-mcpDirect PromQL access
paperless-mcpPaperless-ngx document search
netbox-mcpNetBox IPAM / DCIM
github-mcpGitHub repo + PR ops
n8n-mcpn8n workflow control
omada-mcpTP-Link Omada controller
searxng-mcpPrivacy search through SearXNG
arr-mcpLibrary-search interface to *arr apps
time-mcpTime / timezone utilities (rwlove/time-mcp native-SHTTP build)

๐Ÿง  AI agent pipeline

How local AI agents run, get work, ask for human approval, and produce reports โ€” all without putting data in someone else's cloud unless a task genuinely needs it.

flowchart TB
    subgraph Inputs[Inputs]
        AM[AlertManager alerts]
        Op[Operator chat / voice]
        Cron[n8n cron + webhooks]
    end

    subgraph Frontends[Frontends]
        OWUI[Open WebUI]
        HA[Home Assistant<br/>voice + conversation]
        N8N[n8n workflows]
    end

    subgraph Orchestration[Orchestration]
        Holmes[HolmesGPT<br/>alert RCA]
        LG[LangGraph Agents<br/>agent fleet]
        KC[KubeClaw<br/>retiring]
    end

    subgraph Inference[Inference]
        Ollama[Ollama on P40<br/>qwen2.5:7b/14b]
        Claude[Claude API<br/>escalation only]
    end

    subgraph Tools[Tools]
        Gw[MCP Gateway<br/>Authelia-gated JWT]
        Servers[14ร— MCP servers<br/>HA ยท Immich ยท k8s ยท Grafana ยท โ€ฆ]
    end

    subgraph Outputs[Outputs]
        Z[Zulip<br/>approvals + chat]
        P[Pushover<br/>high-priority alerts]
        V[(langgraph-vault<br/>drafts + reports)]
        DB[(Postgres CNPG<br/>checkpoints + memory)]
    end

    AM --> Holmes
    Op --> OWUI
    Op --> HA
    Cron --> N8N

    OWUI --> Ollama
    HA --> Ollama
    N8N --> LG

    Holmes --> Ollama
    LG --> Ollama
    LG -.-> Claude
    KC --> Ollama

    Holmes --> N8N
    LG --> Gw
    KC --> Gw
    OWUI --> Gw
    Gw --> Servers

    N8N --> Z
    N8N --> P
    LG --> V
    LG --> DB

Agent fleet (LangGraph)

A single rwlove/langgraph-agents FastAPI service runs the fleet. Each agent is a LangGraph graph with its own persona, tool set, and cost cap. Postgres-checkpointed state lets long-running plans survive restarts.

AgentRole
supervisorRoutes work to specialist agents; opens approvals
researcherWeb + repo + vault research
coderCode reading, drafting, PR descriptions
reviewerReviews drafts before they reach the operator
triagerClassifies inbound items, assigns owner agent
reporterDaily digests, summaries, status rollups
note-makerCaptures decisions + facts back into the vault
homelab-engineerCluster ops, HelmRelease drafting, PR-shaped output
smart-home-engineerHome Assistant entities, automations, ESPHome configs
ml-tunerFrigate, Immich CLIP, model tuning
errand-runnerOne-shot real-world tasks (purchases, lookups, scheduling)
property-coordinator3532 Foxhall workstreams (contractors, deck, pool)
health-trackerLocal-only โ€” never escalated to Claude API
doc-writer (Scribner)Sweeps repos for stale docs; drafts README + docs/ patches as diffs when commits land

Pipeline stages

  1. Inbox โ€” langgraph-inbox.json workflow ingests requests from chat, AlertManager, or scheduled triggers.
  2. Triage โ€” triager classifies and assigns to a specialist agent.
  3. Plan โ€” agent drafts an action plan (goals, steps, tool calls, expected cost) into Postgres state.
  4. Approval (HITL) โ€” for anything non-trivial, langgraph-approval-post sends a signed Zulip message + Pushover ping with the plan summary; langgraph-approval-receive waits on the reply; langgraph-awaiting-user-sweep chases stuck tasks.
  5. Execute โ€” agent runs tool calls through the MCP gateway. Cost caps enforced by langgraph-cost-cap-watcher ($5/task, $10/agent/day, $30/global/day).
  6. Report โ€” output written to langgraph-vault (drafts / finals), summarized into the reporter agent's daily Zulip digest (langgraph-daily-digest).

Local-first by design

TierBackendWhen used
1qwen2.5:7b on Ollama (P40)Fast / simple agents (triager, note-maker drafts)
2qwen2.5:14b on Ollama (P40)Default for everything else
3Claude API (escalation)Only on explicit uncertainty markers, repeated local-retry failure, novel/long-context work, or requires_cloud tag

health-tracker and errand-runner are pinned local-only โ€” they never escalate, even if quality suffers, because the data class isn't suitable for off-site inference.

Voice-to-action: power button โ†’ HA Assist โ†’ agents โ†’ Obsidian

The most common way work enters the fleet โ€” hold the phone's power button, say "inbox <whatever I'm thinking>", and the cluster takes it from there.

flowchart LR
    Btn[๐Ÿ“ฑ Hold power button<br/>Pixel: 'Hold for Assistant'] --> Assist[HA Companion app<br/>set as default assistant]
    Assist -->|audio stream| HA[Home Assistant<br/>Assist pipeline]
    HA --> Whisper[Whisper STT<br/>wyoming-services on P40]
    Whisper --> Sentence[Sentence trigger:<br/>'inbox &#123;content&#125;']
    Sentence --> Ollama[conversation.ollama_voice<br/>qwen3:8b]
    Ollama --> Rest[HA rest_command<br/>POST + Authelia JWT]
    Rest --> Hook[n8n: langgraph-inbox]
    Hook --> LG[langgraph-agents /inbox]
    LG --> Triage[triager classifies]
    Triage -->|capture only| Note[note-maker]
    Triage -->|plan + act| Spec[specialist agent<br/>drafts plan]
    Spec -->|needs input| Zulip[๐Ÿ’ฌ Zulip approval<br/>+ Pushover ping]
    Zulip -->|reply| Receive[approval-receive]
    Receive --> Spec
    Spec --> Done[outcome to vault]
    Note --> Inbox[/vault/inbox/YYYY-MM-DD-โ€ฆmd/]
    Done --> Outputs[/vault/outputs/&#123;drafts,finals&#125;//]
    Inbox --> Couch[(obsidian-couchdb)]
    Outputs --> Couch
    Couch -->|LiveSync| Phone[๐Ÿ“ฑ Obsidian on phone<br/>same vault]

The path

  1. Hold power button. Pixel's "Hold for Assistant" gesture is bound to the HA Companion app as the default digital assistant. The Assist UI opens with the mic hot.
  2. Speak. Audio streams to the cluster โ€” no on-phone STT. The trigger phrase is inbox <body>; everything after inbox is the note.
  3. STT in cluster. The Assist pipeline routes the audio to Whisper (wyoming-services, GPU-accelerated on the P40).
  4. Intent + LLM. A sentence trigger matches inbox {content} and hands {content} to conversation.ollama_voice (qwen3:8b on Ollama, tool-calling enabled). The conversation agent's only job here is to confirm the intent and call the rest_command โ€” it does not interpret the content.
  5. Auth'd POST. An HA rest_command POSTs to https://langgraph-inbox.${SECRET_DOMAIN}/webhook with { source:"voice", user:"rob", content:"<transcript>" }. The request carries an Authelia client_credentials JWT issued to a dedicated ha-voice-inbox OIDC client โ€” same daily-rotated signing-key machinery the MCP gateway already uses. Envoy's SecurityPolicy validates the JWT against Authelia's JWKS at the gateway.
  6. n8n langgraph-inbox. Normalizes the payload and POSTs to /inbox on langgraph-agents.
  7. Triager classifies. Research question, household errand, homelab change, property task, or note-to-self โ€” and picks the specialist agent.
  8. Capture path โ†’ note-maker writes the file to /vault/inbox/YYYY-MM-DD-HHMM-<slug>.md on the langgraph-vault-rw PVC. Single writer, no race with the phone.
  9. Plan-and-act path โ†’ specialist drafts a plan into Postgres + a draft under /vault/outputs/drafts/. HITL approval via the existing Zulip + Pushover loop when needed (see triggers above).
  10. Round-trip to the phone. obsidian-couchdb watches the vault PVC and replicates new files through Self-hosted LiveSync โ€” the note from step 8, plus any drafts/finals from step 9, appear in the Obsidian app on the phone within a sync cycle. Same surface the dictation started on.

The loop closes locally and on one surface: power-button โ†’ speak โ†’ outcome appears in the vault. Whisper, Ollama, n8n, and the agents all run in the cluster; the only off-site dependency is claude.com if the local fleet escalates a task.

Alert triage (production today)

HolmesGPT is the one agent already running in production:

  • AlertManager โ†’ HolmesGPT webhook (via alertmanager-holmesgpt-pushover.json) on every firing alert
  • HolmesGPT queries Prometheus, Loki, and the cluster directly to build a root-cause hypothesis
  • Result posted back as a Pushover message + Zulip thread; n8n sanitizes raw tool-call descriptors out of the agent text before delivery

Current state (2026-05-16)

  • HolmesGPT โ€” live, handling cluster alerts daily.
  • LangGraph fleet โ€” plumbed but cold (ENABLE_CLAUDE_API: false, no production triggers). Gated on NVIDIA Spark / Ascent GX10 arrival (~2026-05-20), which becomes the primary Ollama backend before the fleet goes hot.
  • KubeClaw โ€” running in parallel during the LangGraph transition; scheduled for retirement once LangGraph is validated.

โ˜๏ธ Cloud dependencies

ServiceUseCost
1PasswordSecret backend for External Secrets~$65 / yr
CloudflareDomain, DNS, tunnel, WAF rate-limitingFree
GitHubRepo hosting + CIFree
MailgunOutbound mail relay (via Maddy)Free (Flex)
PushoverPush notifications for AlertManager + apps$10 one-time
Frigate+Trained ML model for Frigate NVR$50 / yr
AWS S3 Glacier Deep ArchiveOffsite DR for Immich + Paperless (objects + DB backups)~$1โ€“5 / mo (varies)
~$10โ€“15 / mo

๐Ÿ›ก๏ธ Operational pillars

๐Ÿ’พ Tiered storage durability

Four tiers, picked by what the data has to survive โ€” node loss, Ceph loss, cluster loss, or full site loss. Databases get ceph-block + Barmanโ†’Garage; irreplaceable state goes to Longhorn with NFS-shipped weekly + monthly backups; S3-shaped workloads use Garage; bulk media rides direct NFS. Full decision tree: .agents/instructions/storage-class.instructions.md.

๐Ÿ” Secrets โ€” zero plain-text in Git

All 109 ExternalSecrets resolve through External Secrets Operator from 1Password. Application credentials are templated into ExternalSecret resources and never live in YAML. Cross-namespace mirrors use the reflector pattern when consumer charts hard-code secret names.

๐Ÿชช Authentication โ€” single sign-on everywhere

Authelia (with LLDAP) is the OIDC identity provider; per-app oauth2-proxy instances enforce auth at Envoy Gateway. 24 apps sit behind SSO today. The mcp-gateway validates Authelia-issued JWTs with a daily-rotated signing key for MCP tooling.

๐Ÿ”ญ Observability โ€” metrics, logs, AI triage

kube-prometheus-stack scrapes everything; Loki ingests pod logs; Grafana stitches the dashboards. AlertManager fans alerts to Pushover and to HolmesGPT, which runs LLM-driven root-cause investigation against the cluster and posts findings back via n8n.

๐ŸŽฎ GPU workloads

A single NVIDIA P40 (24 GB VRAM) on worker8 drives Ollama (local LLM), ComfyUI (image gen), Whisper STT, Immich's CLIP face/pet recognition, and the immich-pet-tagger fork pinned to a P40-compatible PyTorch build. Driver lifecycle is handled by the NVIDIA GPU Operator.

๐Ÿ›Ÿ Disaster recovery

Per-app rclone CronJobs ship Immich originals and Paperless documents โ€” plus their Garage-stored Postgres backups โ€” to encrypted AWS S3 with a 1-day Glacier Deep Archive transition. Recovery procedure is documented at Offsite recovery and was last validated 2026-05-05.

๐ŸŒช๏ธ Strict GitOps

Every change reaches the cluster through Git. Flux suspends are a deliberate manual signal โ€” paused Kustomizations are not "broken," they're intentional pauses for in-flight maintenance and are documented in conventions, not reverted on sight.


๐Ÿ“š Documentation

The full operator handbook lives in the mdBook site: https://rwlove.github.io/home-ops/.

Frequently referenced pages:

Repo-local conventions (auto-loaded by AI agents from .agents/instructions/):

  • Storage class selection ยท HelmRelease security defaults ยท ConfigMap layout ยท Sorting rules ยท Schema correction ยท Persona

๐Ÿ™ Acknowledgements

Inspired by the k8s-at-home community. @whazor maintains the excellent k8s-at-home search โ€” a great way to discover how others configure the same Helm releases.

This repo has been continuously reconciling itself since March 2021.