Rook-Ceph DR Runbook¶
Procedural recovery for Rook-Ceph failures in this cluster. Paper
runbook — commands are documented but not rehearsed; verify
against kubectl rook-ceph -- ... before running any destructive
step.
Cluster baseline¶
- Deployment:
kubernetes/apps/rook-ceph/rook-ceph/ - ~8 OSDs spread across worker nodes, 3-way replication on the
ceph-blockpool (perstorage-class.instructions.md) - Mons live on dedicated control-plane / mon nodes per the CephCluster CR
- Toolbox pod:
rook-ceph-tools-*in therook-cephnamespace —kubectl exec -it -n rook-ceph deploy/rook-ceph-tools -- ceph -s
Tier 1 — Single OSD loss¶
Symptom: ceph -s shows HEALTH_WARN, one OSD down/out,
recovery in progress.
Diagnose:
kubectl rook-ceph -n rook-ceph -- ceph osd tree
kubectl rook-ceph -n rook-ceph -- ceph -s
kubectl -n rook-ceph get pods -l app=rook-ceph-osd | grep -v Running
Recover:
- If the OSD pod is
CrashLoopBackOffand the underlying disk is intact, deleting the pod usually self-heals after Rook re-creates it. Wait 5 minutes before escalating. - If the disk is dead, mark the OSD
out:
Wait for ceph -s to show recovery_io complete (can run hours
at ~683 KB/s — see docs/src/cluster_upgrade.md note about
global recovery events). Then purge:
- Edit the CephCluster CR to remove the dead device, or replace the disk and let Rook re-provision via auto-discovery.
- Verify replication health:
ceph -sreportsHEALTH_OKand zeromisplacedordegradedPGs.
With 3-way replication and 8 OSDs, losing one OSD is fully tolerated. The recovery window is the only at-risk period; don't lose a second OSD during it.
Tier 2 — Mon quorum loss¶
Symptom: ceph -s hangs or reports mon election; pods using
ceph-block PVCs get IO errors.
Diagnose:
kubectl -n rook-ceph get pods -l app=rook-ceph-mon
kubectl -n rook-ceph logs deploy/rook-ceph-operator | tail -100
Recover:
This cluster runs an odd-numbered mon set (3 mons). Quorum requires 2 of 3 alive.
- If 2/3 mons alive: quorum holds. Identify the dead mon's node, fix the node (reboot, replace disk), and let Rook re-provision.
- If 1/3 mons alive: quorum lost. Rook's
monMaxOSDChangesafeguard kicks in. Recovery path: - Identify the surviving mon (
ceph mon stat). - Edit the
cephclusters.ceph.rook.ioCR to reduce mon count to 1 temporarily. - Rook re-elects with the surviving mon as quorum-of-1.
- Add mons back one at a time, verifying quorum at each step.
- If 0/3 mons alive: disaster — see Tier 3.
The Rook-Ceph upstream toolbox has explicit mon-recovery scripts
(rook-ceph-mon-quorum-recovery). Use those before manual quorum
edits. Reference:
https://rook.io/docs/rook/v1.18/Storage-Configuration/Advanced/ceph-mon-health/
Tier 3 — Full Ceph cluster loss¶
Symptom: all mons dead, OSDs dead, or the rook-ceph namespace itself trashed.
Recover:
ceph-block is the in-cluster durable tier, not the
cluster-loss-survivable tier. Per storage-class.instructions.md,
data on ceph-block does NOT survive a Ceph cluster wipe unless it
was also being shipped offsite.
What survives:
- CNPG cluster data — yes, via Barman ObjectStore backups to
Garage. Every CNPG
Clusterin this repo has a pairedbarmancloud.cnpg.io/ObjectStore. Restore path: seecnpg_restore.md. - Longhorn-backed apps — yes, via the backup target on
beast NFS. Restore path: see
longhorn_restore.md. - Garage substrate — yes, via the NFS substrate (Garage's storage isn't ceph-block-backed). Garage stays intact.
- Anything else on ceph-block — no. Application configuration on ceph-block is regenerable (per the storage-class doc), so the app will reconstruct on next pod start.
Steps:
- Confirm no path to recover — Tier 1 / Tier 2 procedures exhausted.
- Suspend the
rook-ceph-clusterFlux Kustomization to prevent automatic re-reconciliation while you assess. - Drain workloads using
ceph-blockPVCs:kubectl get pvc -A -o yaml | yq '.items[] | select( .spec.storageClassName == "ceph-block") | .metadata.namespace + "/" + .metadata.name'— scale them to zero. - Tear down the existing rook-ceph deployment per
https://rook.io/docs/rook/latest-release/Getting-Started/ceph-teardown/
—
Cluster,OperatorConfig, then delete the namespace. - Wipe the OSD disks on each worker (Rook-Ceph teardown does NOT
wipe disks; you must
dd if=/dev/zero of=/dev/<osd-disk> bs=1M count=100on each before re-deploying). - Re-deploy rook-ceph: unsuspend the Kustomization, let Flux reconcile a fresh CephCluster.
- Wait for
HEALTH_OK+ zero PGs in unknown state. - Restore data per the per-tier runbooks (CNPG → cnpg_restore.md; Longhorn → longhorn_restore.md).
- Scale the drained workloads back. Apps regenerate config on first start.
Expect Tier 3 recovery to take 4–8 hours total: ~30 min teardown, ~30 min fresh deploy, the rest is data restore + service smoke.
Common gotchas¶
Drain hangs ~indefinitely on rook-ceph-osd-host-*PDB. A previous node's OSD is still degraded; Rook's per-host PDB blocks all OSD evictions cluster-wide. Seedocs/src/cluster_upgrade.md. Two options: wait forceph -sHEALTH_OK + zero remapped/misplaced PGs, ORkubectl delete pod -n rook-ceph rook-ceph-osd-N-...directly (bypasses eviction API; safe with 8 OSDs + 3-way replication for the ~10-min recovery window).- OSD pods stuck on
Init:0/N. Usually alvm tagmismatch on the disk after a hard power loss. Run the rook-ceph-toolsceph-volume lvm zap --osd-id <id>to clear, then let Rook re-provision. HEALTH_WARN: mons are allowing insecure global_id reclaim— cosmetic; clear withceph config set mon auth_allow_insecure_global_id_reclaim false.
Verification checklist¶
After any Tier 1/2/3 recovery:
-
ceph -s→HEALTH_OK -
ceph osd tree→ all expected OSDsup/in -
ceph -sshows zerodegraded,misplaced,inactivePGs - A representative ceph-block PVC mounts cleanly
(
kubectl run dr-verify --image=alpine -- sh -c "sleep 60"with a PVC attached) - No
CephClusterErrorStatePrometheusAlert firing
See also¶
docs/src/cluster_upgrade.md— node-drain interactions with OSD PDBcnpg_restore.md— restore CNPG data after ceph-block losslonghorn_restore.md— restore Longhorn-backed appsgarage_restore.md— restore Garage (NFS-substrate, doesn't depend on ceph)- Memory:
[[reference_predict_linear_sparse_data_false_positives]]— PR #11755 disk-fill alerting context