CNPG Cluster Restore Template¶

Per-cluster restore procedure for any of the ~25 CloudNativePG Cluster resources under kubernetes/apps/databases/cloudnative-pg/config/<app>/. Paper runbook — verify against the actual app's cluster.yaml + objectstore.yaml before executing.

When to use this¶

The PGData volume is gone (ceph-block PVC lost, ceph cluster wipe, namespace deleted, etc.) but the Barman ObjectStore in Garage is intact. Most "the database is broken" cases are NOT this — see the gotchas section at the bottom first.

What survives in Garage¶

Every CNPG cluster in this repo writes to a paired Barman ObjectStore (barmancloud.cnpg.io/ObjectStore). The bucket layout in Garage:

<cluster-name>/base/<backup-id>/ — full base backups (daily cadence on most clusters, scheduled by ScheduledBackup)
<cluster-name>/wals/ — WAL archive (continuous)

Lose Garage and you lose the recovery target. See garage_restore.md for Garage substrate recovery.

Recovery scenarios¶

Scenario A — point-in-time recovery (PITR)¶

The PGData is intact but the database state is wrong (bad migration, accidental DELETE, corrupted row). Restore to a specific timestamp without losing the cluster.

Identify the recovery target:

kubectl -n databases get cluster <app> -o yaml | yq '.spec.backup'
# confirm the ObjectStore path
kubectl -n databases logs <app>-1 -c postgres | grep -i 'archived'
# find the latest WAL timestamp before the bad event

Edit kubernetes/apps/databases/cloudnative-pg/config/<app>/cluster.yaml to add spec.bootstrap.recovery:

spec:
  bootstrap:
    recovery:
      source: <app>
      recoveryTarget:
        targetTime: "2026-05-20 14:30:00.00+00"
  externalClusters:
    - name: <app>
      plugin:
        name: barman-cloud.cloudnative-pg.io
        parameters:
          barmanObjectName: <app>
          serverName: <app>

Delete the existing cluster (destructive):

kubectl -n databases delete cluster <app>
# Wait for full teardown; CNPG operator drops the PVCs too.

Reconcile Flux:

flux reconcile kustomization databases-cloudnative-pg-<app>

Watch the recovery:

kubectl -n databases logs -l cnpg.io/cluster=<app>,role=primary -c postgres -f

Recovery applies the base backup + WALs up to targetTime.

Verify and remove the bootstrap.recovery block (or leave it commented out for next time) in a follow-up PR.

Scenario B — total cluster loss¶

PGData PVCs gone (ceph-block wipe, namespace nuked, etc.). Restore the cluster from scratch from the latest base backup + WALs to "now".

Same as Scenario A but with recoveryTarget set to latest (omitting targetTime), or simply leave the cluster's normal spec in place and CNPG will bootstrap from the ObjectStore on its own — the barmancloud.cnpg.io/ObjectStore is the bootstrap source by convention in this repo.

Concretely:

Verify the ObjectStore is intact:

kubectl -n databases get objectstore <app>
# Check Garage bucket directly:
AWS_ENDPOINT_URL_S3=https://s3.${SECRET_DOMAIN} \
  aws s3 ls s3://cnpg-<app>/base/

Reconcile Flux to re-create the cluster CR. CNPG sees no PGData PVC and falls back to bootstrap-from-ObjectStore automatically (this is the default behavior with our setup).
Watch logs as above. Recovery completes when the cluster pod reports database system is ready to accept connections.

Scenario C — Garage substrate gone¶

Both the database AND the Barman ObjectStore in Garage are lost.

For most clusters: full data loss — these apps' state isn't shipped offsite. App reconstruction depends on the app:

atuin — terminal history; gone, restart fresh
home-assistant — see home-assistant-config/ for what's in-Git and what's runtime-only
media-pull-stack apps — full data loss is non-recoverable beyond re-scraping from external sources
medikeep — full loss; manual reentry
pump, sparkyfitness, videodupfinder — application-specific; mostly recoverable from source data
nametag, medialyze, windmill — workflow / metadata; full loss

For Immich and Paperless: see offsite_recovery.md — these two have a separate offsite-backup pipeline to AWS Glacier Deep Archive, restored independently.

Per-cluster cheatsheet¶

Cluster	Garage bucket	Special considerations
`atuin`	`cnpg-atuin`	None — terminal history is ephemeral
`home-assistant`	`cnpg-home-assistant`	Pair with HA backup (`.tar` in vault) for full restore
`immich`	`cnpg-immich`	Also covered by offsite Glacier — see `offsite_recovery.md`
`paperless`	`cnpg-paperless`	Also covered by offsite Glacier — see `offsite_recovery.md`
`langgraph-memory`	`cnpg-langgraph-memory`	memory-mcp's knowledge-graph backend (langgraph-agents, its former co-consumer, was decommissioned 2026-07-06 — cluster now sole owner is memory-mcp) — 35+ entities; rebuildable but expensive
`windmill`	`cnpg-windmill`	Workflow definitions in Git; secrets in 1P — runtime state can be re-bootstrapped
Others (25 total)	`cnpg-<app>`	Follow the template above

Gotchas¶

CNPG Cluster bootstrap.recovery is honored only on initial bootstrap. Once a cluster has PGData, editing the recovery block does nothing. Must delete the cluster first to force re-bootstrap.
CNPG operator must be running before you delete a Cluster CR, or the finalizer hangs. Check kubectl -n cnpg-system get pods first.
Barman version compatibility — the ObjectStore CR version must match the CNPG operator. Bumping CNPG without bumping Barman (or vice versa) can render backups unreadable.
PostgreSQL major version in the ObjectStore must match the Cluster's imageName. PG16 base backup cannot bootstrap a PG17 cluster.
Recovery is slow on large clusters. Immich and Paperless can take 30+ minutes on the base-backup restore alone, before WAL replay starts.

Verification checklist¶

After any restore:

Cluster pod logs report database system is ready
Application pod successfully connects (check app logs for ORM connection success)
Smoke a representative query: kubectl -n databases exec -it <app>-1 -- psql -c "SELECT count(*) FROM <known-table>"
App-level health endpoint returns 200
Next ScheduledBackup runs successfully (catches the ObjectStore re-write path)