master1 etcd-Disk Swap Plan
Active plan to fix master1's degraded etcd performance by replacing
the NVMe under /. Captured 2026-05-05 from a diagnostic session.
Why
master1's etcd is the slow voter and the cause of intermittent apiserver timeouts on this cluster. Confirmed via Prometheus:
| 10m p99 | master1 (192.168.1.9) | master2 (192.168.4.9) | master3 (192.168.4.10) |
|---|---|---|---|
etcd_disk_wal_fsync | 31 ms | 17 ms | 16 ms |
etcd_disk_backend_commit | 103 ms | 29 ms | 27 ms |
etcd_server_slow_apply_total (cumulative) | 67,523 | 21,459 | 5,738 |
Read latency on the same NVMe is sub-millisecond. The slowness is on
the sync-write path (page cache doesn't help fdatasync). The
suspected drive is nvme1n1 β a budget HS1TBNVME M2 1TB at 41% wear,
hosting the LVM root that contains /var/lib/etcd.
Cluster has a leader, zero leader changes/hr, no failed proposals β etcd is stable but degraded. Apiserver-master1 takes the brunt because it always reads from local etcd.
Unrelated note: /dev/sdb on master1 is in XfsShutdown state β a
secondary SATA disk that fell off the bus. Does not affect etcd
(verified: nothing on sdb is mounted; etcd lives on /, OSD-1 lives
on nvme0n1p1, Longhorn lives on nvme0n1p2). Investigate on its
own schedule.
Drive recommendation
Micron 7450 PRO 480 GB M.2 2280 (MTFDKBA480TFR-1BC1ZABYYR), ~$200 USD.
Why this one:
- M.2 2280 fits the most common slot (verify before ordering).
- 480 GB is plenty for
/+ etcd (etcd DB is ~340 MB). - 1 DWPD, 800 TBW endurance β etcd will not wear it out.
- Hardware power-loss-protection capacitors β the whole point.
- Gen4Γ4, negotiates down to Gen3.
Alternatives if master1 has an M.2 22110 slot (longer 110 mm form factor):
- Samsung PM9A3 960 GB M.2 22110 (~$230β280)
- Micron 7450 PRO 960 GB M.2 22110 (~$270β370)
See conversation history for retailer links; pricing as of 2026-05-04.
Preflight (before ordering)
-
Confirm M.2 slot length on master1: SSH and run
lspci -vv, or look at the board. 2280 vs 22110 changes the buy. -
Confirm where
cs_master1-rootLVM actually sits:pvs && vgs && lvs && lsblkThis is to verify the hypothesis that
/var/lib/etcdis onnvme1n1. If it turns out to be onnvme0n1(the WD Blue), the diagnosis changes β the issue would be NVMe contention with Longhorn/Ceph rather than a slow drive. The fix would then be different (move workloads, not the drive). -
Identify what was on
/dev/sdb:cat /etc/fstab | grep sdb blkid /dev/sdb # may return nothing if device gone dmesg -T | grep -E 'sd[ab]|ata|sata'This is independent of the etcd fix but should be cleaned up eventually.
Replacement options
Two paths once the drive arrives:
Option A β Reinstall fresh (recommended)
Treat master1 as a control-plane replacement: drain, remove from
cluster, reinstall the OS on the new NVMe, rejoin via kubeadm join --control-plane. Symmetrical to
Promote Worker to Control Plane
β the same Phase 2/3/4 apply, just with master1 instead of worker2.
Pros: clean state, drops accumulated cruft, validates the documented procedure.
Cons: more downtime, more steps.
Option B β Clone LVM offline
Boot master1 into rescue/live media, dd or pvmove the existing
root LVM from nvme1n1 to the new drive, update bootloader, reboot.
Pros: faster, preserves config exactly.
Cons: carries forward whatever cruft is on /; bootloader fiddling;
no validation that the cluster's join procedure works.
Sequencing constraints
Before doing either replacement option, confirm:
- Cluster is otherwise healthy. Same gating conditions as the Promote Worker to Control Plane preconditions β etcd voter health, Ceph HEALTH_OK, no in-flight long storage operations.
- Etcd snapshot saved off-host. See the Etcd snapshot section of the same runbook.
- Immich offsite seed finished 2026-05-05 (5-file sample
round-trip verified the same day). Paperless seed also finished
2026-05-05 (778 MiB across docs + DB). This gating constraint is
resolved β the drive swap no longer needs to coordinate around the
initial seed. See
offsite_recovery.md.
After master1 is fixed
In order:
- Verify the fix landed. master1's etcd
slow_apply_totalshould grow at the same rate as the other voters over a 10-minute window. If yes, the disk swap solved the etcd problem. - Investigate
BLUESTORE_SLOW_OP_ALERTon OSD-1. Separate issue, onnvme0n1. Likely Longhorn β Ceph contention on the same drive. Quantify withiostatand Ceph perf counters before deciding what to do. - Then promote worker2 as a control plane per Promote Worker to Control Plane, retiring the VM master2.
- Then repeat for master3 with another bare-metal worker, so all three control planes live in independent failure domains.