master1 etcd-Disk Swap Plan

Active plan to fix master1's degraded etcd performance by replacing the NVMe under /. Captured 2026-05-05 from a diagnostic session.

Why

master1's etcd is the slow voter and the cause of intermittent apiserver timeouts on this cluster. Confirmed via Prometheus:

10m p99master1 (192.168.1.9)master2 (192.168.4.9)master3 (192.168.4.10)
etcd_disk_wal_fsync31 ms17 ms16 ms
etcd_disk_backend_commit103 ms29 ms27 ms
etcd_server_slow_apply_total (cumulative)67,52321,4595,738

Read latency on the same NVMe is sub-millisecond. The slowness is on the sync-write path (page cache doesn't help fdatasync). The suspected drive is nvme1n1 β€” a budget HS1TBNVME M2 1TB at 41% wear, hosting the LVM root that contains /var/lib/etcd.

Cluster has a leader, zero leader changes/hr, no failed proposals β€” etcd is stable but degraded. Apiserver-master1 takes the brunt because it always reads from local etcd.

Unrelated note: /dev/sdb on master1 is in XfsShutdown state β€” a secondary SATA disk that fell off the bus. Does not affect etcd (verified: nothing on sdb is mounted; etcd lives on /, OSD-1 lives on nvme0n1p1, Longhorn lives on nvme0n1p2). Investigate on its own schedule.

Drive recommendation

Micron 7450 PRO 480 GB M.2 2280 (MTFDKBA480TFR-1BC1ZABYYR), ~$200 USD.

Why this one:

  • M.2 2280 fits the most common slot (verify before ordering).
  • 480 GB is plenty for / + etcd (etcd DB is ~340 MB).
  • 1 DWPD, 800 TBW endurance β€” etcd will not wear it out.
  • Hardware power-loss-protection capacitors β€” the whole point.
  • Gen4Γ—4, negotiates down to Gen3.

Alternatives if master1 has an M.2 22110 slot (longer 110 mm form factor):

  • Samsung PM9A3 960 GB M.2 22110 (~$230–280)
  • Micron 7450 PRO 960 GB M.2 22110 (~$270–370)

See conversation history for retailer links; pricing as of 2026-05-04.

Preflight (before ordering)

  1. Confirm M.2 slot length on master1: SSH and run lspci -vv, or look at the board. 2280 vs 22110 changes the buy.

  2. Confirm where cs_master1-root LVM actually sits:

    pvs && vgs && lvs && lsblk
    

    This is to verify the hypothesis that /var/lib/etcd is on nvme1n1. If it turns out to be on nvme0n1 (the WD Blue), the diagnosis changes β€” the issue would be NVMe contention with Longhorn/Ceph rather than a slow drive. The fix would then be different (move workloads, not the drive).

  3. Identify what was on /dev/sdb:

    cat /etc/fstab | grep sdb
    blkid /dev/sdb              # may return nothing if device gone
    dmesg -T | grep -E 'sd[ab]|ata|sata'
    

    This is independent of the etcd fix but should be cleaned up eventually.

Replacement options

Two paths once the drive arrives:

Treat master1 as a control-plane replacement: drain, remove from cluster, reinstall the OS on the new NVMe, rejoin via kubeadm join --control-plane. Symmetrical to Promote Worker to Control Plane β€” the same Phase 2/3/4 apply, just with master1 instead of worker2.

Pros: clean state, drops accumulated cruft, validates the documented procedure.

Cons: more downtime, more steps.

Option B β€” Clone LVM offline

Boot master1 into rescue/live media, dd or pvmove the existing root LVM from nvme1n1 to the new drive, update bootloader, reboot.

Pros: faster, preserves config exactly.

Cons: carries forward whatever cruft is on /; bootloader fiddling; no validation that the cluster's join procedure works.

Sequencing constraints

Before doing either replacement option, confirm:

  1. Cluster is otherwise healthy. Same gating conditions as the Promote Worker to Control Plane preconditions β€” etcd voter health, Ceph HEALTH_OK, no in-flight long storage operations.
  2. Etcd snapshot saved off-host. See the Etcd snapshot section of the same runbook.
  3. Immich offsite seed finished 2026-05-05 (5-file sample round-trip verified the same day). Paperless seed also finished 2026-05-05 (778 MiB across docs + DB). This gating constraint is resolved β€” the drive swap no longer needs to coordinate around the initial seed. See offsite_recovery.md.

After master1 is fixed

In order:

  1. Verify the fix landed. master1's etcd slow_apply_total should grow at the same rate as the other voters over a 10-minute window. If yes, the disk swap solved the etcd problem.
  2. Investigate BLUESTORE_SLOW_OP_ALERT on OSD-1. Separate issue, on nvme0n1. Likely Longhorn ↔ Ceph contention on the same drive. Quantify with iostat and Ceph perf counters before deciding what to do.
  3. Then promote worker2 as a control plane per Promote Worker to Control Plane, retiring the VM master2.
  4. Then repeat for master3 with another bare-metal worker, so all three control planes live in independent failure domains.