Promote a Worker to a Schedulable Control Plane
End-to-end procedure for converting a bare-metal worker into a third (or replacement) control-plane node, while removing an existing control plane (typically a VM master that needs to retire).
The worked example below promotes worker2.${SECRET_DOMAIN} and
retires the VM master2.${SECRET_DOMAIN}. Substitute names freely β
nothing in the procedure is specific to those hosts.
Why this exists
The control plane in this cluster runs three etcd voters. When two of those live on the same physical host (e.g. master2 + master3 as VMs on one box), one hardware failure costs you etcd quorum. Promoting a bare-metal worker into the control-plane pool fixes that without changing the etcd voter count.
The promoted node stays schedulable β control-plane and worker roles are stacked on the same host, the same way master1 already runs in this cluster.
What you need to know first
This cluster is kubeadm + kube-vip + cri-o + Cilium on CentOS Stream. Relevant facts:
- Control-plane VIP is
192.168.6.1:6443(kube-vip). Clients keep their kubeconfigs unchanged through this whole procedure. - kube-vip is a static pod under
/etc/kubernetes/manifests/kube-vip.yamlon every control-plane node. New control planes need this file themselves; seeinit/kube-vip.shfor the canonical generator. - kubelet-csr-approver is running, so node CSRs auto-approve. Manual
approval (
init/approve-csrs.sh) is the fallback. kubeadm-configConfigMap inkube-systemcarries the list of control-plane endpoints. Old members must be removed from it.
Hostname strategy
"Convert worker2 to master2" can mean two things. Default to (a) unless you have a concrete reason to rename:
- (a) Promote in place. Keep the hostname (
worker2.${SECRET_DOMAIN}). Themaster/workerprefix is purely cosmetic; thenode-role.kubernetes.io/control-planelabel is what matters. - (b) Rename to
masterN.${SECRET_DOMAIN}. Requires regenerating client certs, a new etcd member name, a new kube-vip pod identity, and DNS coordination. High risk, no functional gain.
The procedure below assumes (a).
Preconditions
Do not start until all of these are true:
- Surviving etcd voters are healthy as voters. This is the load-bearing
check, because Phase 1 takes you to a 2/2 etcd quorum. What matters is
write performance, not just "is the node Ready":
- On each surviving control-plane node, the etcd container should
show no recent
apply request took too longorwaiting for ReadIndex response took too longwarnings:
A handful per hour is normal; sustained streams mean the voter is limping and you should not start the procedure.kubectl -n kube-system logs etcd-<node> --since=1h \ | grep -E 'took too long|ReadIndex' - On each surviving control-plane node, check disk latency on the
device hosting
/var/lib/etcd(often the OS LVM root, not necessarily the same disk as Longhorn or Ceph):iostat -xz 5and watchawait/%util. Sustainedawait> a few ms on the etcd device is a problem; etcd wants single-digit-ms fsync. - A failed secondary disk on a control-plane node (e.g. a SATA drive that dropped off the bus while etcd lives on NVMe) is not an automatic blocker β but check whether the failure is causing bus resets or kernel I/O retries that bleed into the etcd device.
- On each surviving control-plane node, the etcd container should
show no recent
- Ceph is HEALTH_OK. Slow-OSD alerts on a control-plane node may or may not indicate disk failure (Ceph and Longhorn often share an NVMe β contention can produce slow ops without a dying disk). Either way, do not stack: an unhealthy Ceph plus a transient 2-voter window plus an OSD coming out for the worker drain is too many concurrent degradations.
- No long-running storage operations are in flight that touch the worker being drained β e.g. an Immich offsite seed, a CNPG full backup, a Longhorn rebuild.
- Etcd snapshot saved off-host. See Etcd snapshot below.
- Decision made on the retiring control plane. Either remove it first (briefly run on 2 voters), or defer it until after the new control plane is healthy (briefly run on 4 voters). Removing first is preferred β a 4-member etcd adds a write-quorum cost without improving fault tolerance.
- Kubelet/kubeadm version match. Confirm the worker's kubelet and kubeadm packages match what's currently on the control plane. Mismatched skew causes join to fail mid-handshake.
- Firewalld is disabled on the worker (per cluster policy β firewalld silently drops kubelet traffic post-reboot).
Etcd snapshot
Take from a healthy member (typically the leader; identify with
etcdctl endpoint status --cluster):
kubectl -n kube-system exec etcd-master1.${SECRET_DOMAIN} -- \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
snapshot save /var/lib/etcd-backup/pre-cp-swap-$(date +%F).db
Copy the snapshot off the host before continuing.
Audit what the worker is carrying
Before you drain, inventory anything hardware-pinned or single-replica:
- USB devices (Z-Wave dongles, etc.) attached to the worker
pin pods to it via NFD
usb-*feature labels. The hardware stays with the box; if you reuse the box for control plane, those pods return after rejoin. If you physically move the box, plan to move the dongles too. - GPU resources (
gpu.intel.com/i915, NVIDIA, Coral) β same story: stay with the box, re-published after rejoin. - Custom node labels (
node.network/vlan-*, app-specific pinning labels) β capture them now, re-apply in Phase 4. - CNPG primaries β failover-friendly but you may want to manually promote a replica off the worker first to control timing.
- Single-replica controllers β istiod, MCP gateways, etc. These will reschedule but cause brief disruption to mesh-injection or tool calls.
- Ceph OSD on the worker β needs
ceph osd out Nand rebalance before drain. - Ceph mon on the worker β Rook will recreate it on a different node automatically, but cluster runs at 2/3 mons in the gap.
Capture the worker's labels:
kubectl get node worker2.${SECRET_DOMAIN} -o yaml > /tmp/worker2-pre-drain.yaml
Phase 1 β Retire the outgoing control plane
Skip this phase if you're adding a 4th control plane and removing the old one later.
- Drain the outgoing master:
kubectl drain master2.${SECRET_DOMAIN} --ignore-daemonsets --delete-emptydir-data - Reset kubeadm state on the host:
Do not wipessh master2 'kubeadm reset --force && systemctl stop kubelet'/var/lib/etcdyet β it's your in-place rollback if the new control plane fails to join. - Remove the etcd member (run from a remaining control plane):
etcdctl ... member list # find master2's member ID etcdctl ... member remove <member-id> - Update kubeadm bookkeeping:
kubectl -n kube-system edit cm kubeadm-config # drop master2 endpoint kubectl delete node master2.${SECRET_DOMAIN} - Power down the VM. Keep the VM image around for 24h as rollback.
You're now on 2 control planes. Etcd quorum is 2/2 β any further loss fails writes. Don't dawdle.
Phase 2 β Drain the worker cleanly
- Optionally suspend Flux for noisy releases that you don't want
re-reconciling mid-drain:
flux suspend kustomization <name> - Cordon:
kubectl cordon worker2.${SECRET_DOMAIN} - Out the OSD and wait for Ceph to rebalance:
kubectl -n rook-ceph patch deploy rook-ceph-osd-N --replicas=0 \ --type=merge -p '{"spec":{"replicas":0}}' # from a working ceph CLI (e.g. rook-ceph-tools pod): ceph osd out N ceph -w # wait for HEALTH_OK and PGs active+clean - Move CNPG primaries off the worker:
Or accept the brief failover window during drain.kubectl cnpg promote <cluster> <healthy-replica-instance> - Drain:
kubectl drain worker2.${SECRET_DOMAIN} \ --ignore-daemonsets --delete-emptydir-data --force - Remove from the cluster:
kubectl delete node worker2.${SECRET_DOMAIN} - On the host, reset kubeadm state but preserve Longhorn data:
kubeadm reset --force rm -rf /etc/kubernetes /var/lib/etcd /var/lib/kubelet # leave /var/lib/longhorn alone β replicas re-attach after rejoin
Phase 3 β Rejoin as a control plane
- From an existing control plane, generate fresh join material:
CERT_KEY=$(kubeadm init phase upload-certs --upload-certs \ --config /path/to/init/clusterconfiguration.yaml | tail -1) JOIN=$(kubeadm token create --print-join-command) echo "$JOIN --control-plane --certificate-key $CERT_KEY" - On the worker (still its original hostname), preflight the kernel:
modprobe br_netfilter echo 1 > /proc/sys/net/ipv4/ip_forward - Run the printed join command. After it completes:
mkdir -p /etc/kubernetes/manifests - Drop a kube-vip static pod manifest into
/etc/kubernetes/manifests/kube-vip.yaml. Mirror whatinit/kube-vip.shproduces β but verify theINTERFACEenv var matches this host's NIC (master1 usesenp0s31f6; the new control plane may differ). Without kube-vip on the new master, the VIP just stays put on the survivors β non-fatal, but you lose the new node from leader election. - Approve any pending CSRs (kubelet-csr-approver should auto-approve;
check anyway):
kubectl get csr | grep Pending ./init/approve-csrs.sh # if needed
Phase 4 β Make it schedulable + restore roles
- Remove the control-plane NoSchedule taint:
kubectl taint nodes worker2.${SECRET_DOMAIN} \ node-role.kubernetes.io/control-plane:NoSchedule- - Re-apply node labels you captured in
Audit. NFD will re-publish
hardware feature labels on its own; you only need to re-apply the
manual ones:
kubectl label nodes worker2.${SECRET_DOMAIN} \ node.longhorn.io/create-default-disk=true \ node.network/vlan-iot=true \ node.network/vlan-security=true - Recreate Ceph OSD. With
useAllNodes/useAllDevicesenabled, Rook picks the disk back up automatically. Otherwise add the host to theCephClusternodeslist. - Resume any Flux kustomizations you suspended in Phase 2.
Phase 5 β Verify
kubectl get nodes -o wideβ promoted node hascontrol-planerole and isReady.flux get all -A | grep -v Trueis empty (or only known-suspended entries).- Etcd shows three members:
etcdctl ... member list etcdctl ... endpoint status --cluster -w table - Ceph:
ceph -sHEALTH_OK, all OSDs in/up, all PGsactive+clean. - VIP serving from the new node too:
curl -k https://192.168.6.1:6443/healthz - Hardware-pinned pods returned (e.g.
zwave-js-ui-0, GPU transcoders).
Rollback
The cheapest rollback is before Phase 3 step 3 (the new
kubeadm join). Up to that point:
- The old VM master is powered off but its
/var/lib/etcdis intact. Power it back on, restart kubelet, and it will re-join the etcd cluster on its old member ID β if you haven't yet removed it viaetcdctl member remove. After member-remove, you have to re-bootstrap that member. - The worker has been drained but not yet re-joined.
kubectl uncordonis enough to put it back in service as a worker β but you'll need to re-add the labels you removed.
After the new control plane has been healthy for ~24h, the old VM's etcd data dir can be wiped and the VM image deleted.
What this does NOT solve
This procedure swaps one control-plane host. If two of your three control planes still share a single failure domain (e.g. master3 remains a VM on the same host as the retired master2 was), repeat the procedure for master3 against another bare-metal worker. Only then does the control plane have three independent failure domains.