Proxmox “Can’t Remove Node”: Safe Node Removal From a Cluster

Was this helpful?

You’re trying to decommission a Proxmox node. It’s dead, or it’s half-alive, or it’s “alive” in the way a laptop at 1% battery is alive.
You run pvecm delnode and Proxmox responds with the operational equivalent of a shrug. Now the cluster UI shows a ghost node,
backups complain, migrations get weird, and someone asks if you can “just remove it real quick.”

You can remove it. You can also remove your cluster’s ability to form quorum, if you do it wrong. This guide is how to do it the boring way:
safe, repeatable, and with an escape plan. Because nothing says “team sport” like a cluster membership change at 4:57 PM on a Friday.

The mental model: what “remove node” really means

In Proxmox VE, “removing a node” is not a single thing. It’s at least four things that happen to share a button label:
cluster membership, the cluster filesystem state, service-level expectations (HA manager, scheduling, storage plugins),
and whatever workloads were pinned to that node (VMs, containers, storage daemons).

The core is Corosync membership plus Proxmox’s cluster config database (pmxcfs) distributed under
/etc/pve. When you run pvecm delnode, you’re telling the remaining cluster:
“Stop expecting this node to participate in quorum decisions, and delete its config presence from the shared state.”
If you don’t have quorum, or the remaining nodes disagree on the shared state, you don’t get a clean deletion—because
clusters are allergic to unilateral edits. For good reasons.

So the rule is simple but annoying: membership changes require a healthy cluster.
When the cluster isn’t healthy, you stop trying to be clever and you start doing controlled surgery:
stabilize quorum, freeze changes, then remove the node. If you don’t stabilize first, you can manufacture a split brain
that looks fine in the UI until it doesn’t. And “until it doesn’t” is usually during a host reboot.

There’s also a difference between “node is offline” and “node is gone.” Offline means it might come back with old configs.
Gone means you should assume it never returns, and you must prevent it from returning unexpectedly. That’s why safe removal
includes: powering it off, wiping cluster configs on it, or at least keeping it off the network.

Fast diagnosis playbook

When “can’t remove node” happens, the time sink isn’t typing commands. It’s guessing what subsystem is blocking you.
This is the quick path to the bottleneck.

1) Quorum and membership first (always)

  • Check pvecm status. If you don’t have quorum, you’re not deleting anything cleanly.
  • Check journalctl -u corosync for membership churn, token timeouts, or “not in quorum.”
  • Check if the node you’re removing still appears as “Expected votes” in the cluster.

2) Cluster filesystem (pmxcfs) second

  • Confirm pve-cluster is healthy on remaining nodes.
  • Check whether /etc/pve is responsive; hung FUSE mounts make every management command feel haunted.
  • Look for stale lock files (rare) or a node stuck with outdated config versions.

3) Workloads and HA third

  • If HA is enabled, confirm the node isn’t still referenced by HA groups or resources.
  • Confirm no VM/CT config still references that node’s local storage as the primary location.
  • If Ceph exists, make sure you’re not confusing “remove Proxmox node” with “remove Ceph host.” They’re related, not identical.

4) Then do the removal

  • Run pvecm delnode from a healthy remaining node.
  • Validate corosync.conf and node list updates replicated across /etc/pve.
  • Only after that, clean up the removed node so it can’t rejoin accidentally.

Interesting facts and context (why Proxmox behaves this way)

  • Corosync comes from the Linux HA world, designed to coordinate membership in clusters where correctness beats convenience.
  • Quorum is a safety feature, not a performance feature: it prevents “two clusters in one” (split brain) from both making changes.
  • Proxmox stores cluster configuration in a distributed filesystem (pmxcfs) mounted at /etc/pve, which is why edits there affect every node.
  • Historically, split brain incidents shaped cluster tooling: many strict behaviors are scars from earlier HA stacks where “best effort” corrupted state.
  • Two-node clusters are inherently awkward because majority is fragile; you usually need a tiebreaker (qdevice) or accept downtime risk.
  • Votes matter: Corosync quorum is based on votes; “expected votes” that include dead nodes can strand a cluster without quorum.
  • Proxmox’s HA manager is opinionated: it would rather fence or stop workloads than allow uncertain state, which is annoying until it saves you.
  • Ceph membership is separate from Proxmox membership: removing a Proxmox node doesn’t automatically remove Ceph OSDs/monitors, and mixing those steps blindly is a classic outage recipe.
  • Cluster changes are serialized through quorum: if you can’t get consensus, Proxmox prefers refusing the operation over guessing.

Before you touch anything: safety rails

Define the removal type: planned vs unplanned

Planned removal: node is reachable, you can migrate workloads, you can stop cluster services gracefully.
Unplanned removal: node is dead, disk is toast, or it’s in a reboot loop and you’re done negotiating.
The steps overlap, but the decisions differ: planned removal optimizes for clean migration; unplanned removal optimizes for restoring quorum
and preventing the corpse from rejoining the conversation.

Make sure you are not about to delete your only copy of anything

Local storage is the trap. If you used node-local ZFS, LVM-Thin, or directory storage and never replicated, then “remove node”
might be short for “delete the only place those disks existed.” Cluster membership is easy. Data is hard.

Operational rules (write them on a sticky note)

  • One person drives. Everyone else watches. Cluster membership changes are not a group typing exercise.
  • Freeze other changes: no upgrades, no network rework, no storage moves while you do this.
  • Keep the removed node powered off (or isolated) until you’ve cleaned it. “Oops it came back” is not a fun genre of incident.
  • Know your cluster size: 2-node and 3-node clusters have different quorum failure modes.

One quote that operations people keep learning the hard way:
“Hope is not a strategy.” — Gen. Gordon R. Sullivan

Joke #1: A cluster is like a group chat—removing someone is easy until you realize they were the only one who knew the password.

Practical tasks (commands, outputs, decisions)

These are the checks and actions that actually move the ball. Each task includes a runnable command, what the output means,
and what decision you make based on it. Commands assume you’re root or using sudo; the prompt below is a placeholder.

Task 1 — Confirm cluster quorum and expected votes

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-pve
Config Version:   27
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Dec 26 13:10:43 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Meaning: “Quorate: Yes” means cluster changes are allowed. “Expected votes” should match the number of nodes you intend to have.
Decision: If not quorate, stop here and fix quorum (see Tasks 2–4). If quorate, proceed with planned removal steps.

Task 2 — List nodes known to the cluster

cr0x@server:~$ pvecm nodes
Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 pve01
0x00000002          1 pve02
0x00000003          1 pve03

Meaning: This is cluster membership as Corosync sees it.
Decision: If the dead node still appears and you want it gone, you will remove it with pvecm delnode <name> from a healthy node.

Task 3 — Check Corosync service health on remaining nodes

cr0x@server:~$ systemctl status corosync --no-pager
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2025-12-26 10:02:11 UTC; 3h 8min ago
       Docs: man:corosync
   Main PID: 1320 (corosync)
      Tasks: 11
     Memory: 38.4M
        CPU: 2.112s
     CGroup: /system.slice/corosync.service
             └─1320 /usr/sbin/corosync -f

Meaning: Corosync is running. If it’s flapping, you have a membership problem, not a “remove node” problem.
Decision: If not active, fix networking/timeouts/host firewall before attempting deletion.

Task 4 — Read Corosync logs for quorum loss or ring issues

cr0x@server:~$ journalctl -u corosync -n 50 --no-pager
Dec 26 13:02:18 pve01 corosync[1320]:   [KNET  ] link: host: 2 link: 0 is up
Dec 26 13:02:19 pve01 corosync[1320]:   [QUORUM] Members[2]: 1 2
Dec 26 13:02:19 pve01 corosync[1320]:   [QUORUM] This node is within the primary component and will provide service.
Dec 26 13:05:31 pve01 corosync[1320]:   [TOTEM ] Token has not been received in 3000 ms
Dec 26 13:05:31 pve01 corosync[1320]:   [TOTEM ] A processor failed, forming new configuration.
Dec 26 13:05:33 pve01 corosync[1320]:   [QUORUM] Members[3]: 1 2 3

Meaning: “Token has not been received” indicates network jitter, MTU mismatch, or overloaded host.
Decision: If you see frequent reconfigs, postpone removal and stabilize the cluster network first. Node removal during churn is how you earn weekend work.

Task 5 — Confirm pmxcfs (/etc/pve) is responsive

cr0x@server:~$ pvesh get /cluster/status
[
  {
    "id":"cluster",
    "name":"prod-pve",
    "quorate":1,
    "version":27
  },
  {
    "id":"node/pve01",
    "ip":"10.10.10.11",
    "local":1,
    "name":"pve01",
    "online":1,
    "type":"node"
  },
  {
    "id":"node/pve02",
    "ip":"10.10.10.12",
    "local":0,
    "name":"pve02",
    "online":1,
    "type":"node"
  }
]

Meaning: API reads cluster status quickly; pmxcfs is likely fine.
Decision: If this hangs or errors, you may have pmxcfs issues; address pve-cluster service and quorum before doing anything else.

Task 6 — Check for HA references to the node

cr0x@server:~$ ha-manager status
quorum OK
master pve01 (active, Fri Dec 26 13:12:19 2025)
lrm pve01 (active, Fri Dec 26 13:12:19 2025)
lrm pve02 (active, Fri Dec 26 13:12:17 2025)
service vm:101 (started)
service vm:203 (started)

Meaning: HA is enabled; it has LRMs (local resource managers) for each node.
Decision: If the removed node is still listed as an LRM and it’s dead, removal should clean it up. If services are pinned to that node, migrate or disable HA for those resources first.

Task 7 — Find VMs/CTs still configured on the node you’re removing

cr0x@server:~$ pvesh get /nodes/pve03/qemu --output-format yaml
- vmid: 310
  name: build-runner-03
  status: stopped
- vmid: 311
  name: legacy-db-test
  status: running

Meaning: The target node still has VMs defined there. Even if they’re on shared storage, their config association matters.
Decision: Migrate or move those guests. If the node is dead, verify where their disks live before you “remove the node” and assume they’ll magically appear elsewhere.

Task 8 — Check storage dependencies (is anything node-local?)

cr0x@server:~$ pvesm status
Name             Type     Status           Total        Used       Avail
local            dir      active        19632040     3921040    14750212
local-zfs        zfspool  active        247463936   90218496   157245440
ceph-rbd         rbd      active       104857600    52428800    52428800
nfs-backup       nfs      active       209715200    73400320   136314880

Meaning: “local-zfs” is node-local unless it’s a shared ZFS-over-something (rare). “ceph-rbd” and NFS are shared.
Decision: If workloads depend on node-local storage, replicate or move disks before removal. If the node is already dead and the disks were local, your task is now “restore from backup.”

Task 9 — Verify replication jobs (if you use ZFS replication)

cr0x@server:~$ pvesh get /cluster/replication
[
  {
    "id":"101-0",
    "source":"pve03",
    "target":"pve01",
    "schedule":"*/15",
    "last_sync":1735218600,
    "duration":42,
    "comment":"critical VM replication"
  }
]

Meaning: Replication exists and has a last sync timestamp.
Decision: If the node is reachable, force a final sync before decommission. If it’s unreachable, confirm the target has a recent replica and you know the promotion steps.

Task 10 — Planned removal: migrate guests off the node

cr0x@server:~$ qm migrate 311 pve01 --online
migration started
migration status: active
migration status: completed

Meaning: VM moved successfully.
Decision: Repeat until the node hosts zero running guests. If migrations fail due to storage, pause and fix that—don’t improvise with half-migrated disks.

Task 11 — Remove the node from the cluster (the actual deletion)

cr0x@server:~$ pvecm delnode pve03
Removing node 'pve03' from the cluster
Stopping pve-cluster on removed node if reachable...
Updating corosync config
Waiting for quorum...

Meaning: The cluster is updating membership and synchronizing config.
Decision: If it completes, validate Tasks 12–13. If it errors with quorum complaints, you have to restore quorum first—do not keep retrying like it’s a flaky download.

Task 12 — Confirm the node is gone from membership and expected votes updated

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-pve
Config Version:   28
Transport:        knet
Secure auth:      on

Quorum information
------------------
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Total votes:      2
Quorum:           2
Flags:            Quorate

Meaning: The cluster now expects 2 votes. That’s correct for a 2-node cluster, but it’s also a warning label.
Decision: If you’ve ended up with 2 nodes, decide immediately whether you’re adding a third node or a qdevice. Two nodes without a tiebreaker is a reliability bet.

Task 13 — Confirm no stale node directory remains in /etc/pve/nodes

cr0x@server:~$ ls -1 /etc/pve/nodes
pve01
pve02

Meaning: The shared config no longer tracks the removed node.
Decision: If the removed node directory persists, pmxcfs may be stuck or quorum was never real. Recheck Tasks 1–5 and avoid manual deletion unless you’re doing a controlled recovery.

Task 14 — Confirm removed node is no longer in corosync.conf (from any remaining node)

cr0x@server:~$ grep -R "pve03" -n /etc/pve/corosync.conf || echo "no references"
no references

Meaning: Corosync config is clean.
Decision: If references remain, you didn’t actually remove it, or you have config propagation issues. Fix pmxcfs/quorum, then retry.

Task 15 — Cleanup on the removed node (if it’s still reachable)

cr0x@server:~$ systemctl stop pve-cluster corosync
cr0x@server:~$ rm -f /etc/pve/corosync.conf
cr0x@server:~$ rm -f /etc/corosync/authkey
cr0x@server:~$ rm -f /var/lib/corosync/*
cr0x@server:~$ systemctl disable corosync pve-cluster
Removed "/etc/systemd/system/multi-user.target.wants/corosync.service".
Removed "/etc/systemd/system/multi-user.target.wants/pve-cluster.service".

Meaning: The node won’t casually rejoin the cluster with stale credentials.
Decision: Do this before you repurpose the host. If you skip it, the node may come back later and confuse the cluster like a former employee whose badge still works.

Task 16 — If removal fails due to “not quorate”: identify the remaining nodes that agree

cr0x@server:~$ corosync-quorumtool -s
Quorum information
------------------
Date:             Fri Dec 26 13:18:22 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.30
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Total votes:      2
Quorum:           2

Meaning: You have two nodes up, but expected votes is still 3, so you’re not quorate.
Decision: Your immediate fix is to restore quorum (bring the third back, or use a qdevice if that was your design). Only in an emergency, you can temporarily adjust expected votes to regain quorum for recovery steps—see the step-by-step section for guardrails.

Joke #2: The fastest way to learn quorum math is to break it once, preferably in a lab and not during payroll processing.

Checklists / step-by-step plan

Plan A: Planned decommission (node is reachable)

  1. Announce a change window. Keep it short, but don’t do this during a network maintenance on the same switches.
  2. Verify cluster health and quorum on a remaining node.
    Use Tasks 1–5. If anything flaps, stop and fix stability first.
  3. Drain the node.

    • Migrate or shut down VMs/CTs (Task 10, plus the CT equivalent pct migrate if needed).
    • Disable scheduling expectations: if you have HA, confirm resources are not pinned to the node (Task 6).
  4. Check storage dependencies.
    Use Task 8. If local storage is involved, replicate or move disks first (Task 9).
  5. Run the removal from a healthy remaining node: Task 11.
  6. Validate removal: Tasks 12–14.
  7. Clean the removed node: Task 15.
  8. Rebalance and harden.
    If you’re down to two nodes, decide: add a third node or add qdevice. Don’t “leave it for later” unless later is already scheduled.

Plan B: Unplanned loss (node is dead or unreachable)

  1. Physically or logically isolate the dead node.
    If it’s flapping on the network, it can disrupt Corosync. Pull its NIC, disable its switch port, or keep it powered off.
  2. Stabilize quorum on the surviving nodes.
    Use Task 1 and Task 4. If not quorate, your priority is quorum restoration, not cleanup aesthetics.
  3. Confirm the dead node’s workloads and data location.
    Use Task 7 and Task 8 from surviving nodes. Decide:

    • If disks are on shared storage (Ceph/NFS/iSCSI), you can restart guests elsewhere.
    • If disks were local-only, switch into restore mode: backups, replicas, or rebuild.
  4. Remove the node only after you’re quorate.
    Task 11. If the cluster refuses due to quorum, do not force random edits inside /etc/pve unless you’re deliberately entering recovery.
  5. After removal, confirm membership and config clean state.
    Tasks 12–14.

Plan C: Emergency recovery when quorum is impossible (use sparingly)

Sometimes you have a 3-node cluster, you lost 2 nodes, and leadership wants “the last node up” to serve workloads.
That’s not cluster mode. That’s survival mode. You can do it, but treat it like emergency power: loud, smelly, temporary.

The safe advice: restore enough nodes to regain quorum or use a proper quorum device if your environment supports it.
If you absolutely must proceed with a single remaining node, you are choosing to override the safety mechanism that prevents split brain.
Make that choice explicitly, write it down, and schedule a cleanup to return to a supported state.

If you do need to temporarily force expected votes for recovery operations (not for “normal operations”), do it with eyes open:

cr0x@server:~$ pvecm expected 1
Setting expected votes to 1
WARNING: forcing expected votes can lead to split-brain if other nodes are active

Meaning: You’re telling Corosync to accept a single node as quorate.
Decision: Only do this when you are certain other nodes are not simultaneously making changes (power them off / isolate them). Once you recover nodes, revert expected votes by letting cluster membership normalize (or rejoin nodes cleanly).

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “offline means safe to delete”

A mid-sized company ran a three-node Proxmox cluster for internal services. One node started throwing ECC errors.
It was still reachable, but it rebooted unpredictably. The on-call engineer made a reasonable assumption:
“If it’s offline in the UI, removing it is just cleanup.”

They ran pvecm delnode while Corosync was already experiencing intermittent token timeouts.
The cluster was technically quorate at the moment of the command. Ten minutes later, the flaky node came back,
rejoined briefly with old state, then dropped again. The remaining nodes disagreed on membership for long enough that pmxcfs
started lagging, and a routine VM start operation stalled.

The failure mode wasn’t dramatic. It was worse: a slow bleed. The UI timed out, then the API calls hung, and eventually
the team had to schedule an emergency maintenance to re-stabilize Corosync. The removed node’s “return from the dead” didn’t
resurrect workloads; it just created a disagreement about who was allowed to change the shared config.

The post-incident fix was boring: isolate the node first (switch port disabled), then repeat removal while the cluster was stable.
They also implemented a policy: “If a node is unstable, treat it as malicious until it’s fully quarantined.”

Optimization that backfired: “Let’s run a 2-node cluster for a while”

Another org had budget pressure and a shortage of rack space. They decided to remove one Proxmox node temporarily,
run the cluster as two nodes for a quarter, then add capacity later. On paper, everything looked fine:
most storage was on shared iSCSI, and the VMs were light.

They removed the node cleanly. Quorum still showed “Yes,” because expected votes matched the two nodes.
And for day-to-day operations it worked—until the first real network incident.
A top-of-rack switch rebooted and the two nodes lost Corosync connectivity for long enough that both nodes
decided they were in trouble. HA got conservative. Some services failed over, some stopped, and a few got stuck in “unknown.”

The issue wasn’t that two-node clusters never work. It’s that they work right up until they don’t, and then
you discover you built your reliability model on a coin toss. Without a third vote (qdevice or third node),
a partition can turn into a hard stop for management operations.

The “optimization” was reversed: they added a lightweight quorum device on a separate failure domain.
After that, routine node maintenance stopped being a drama project. The lesson wasn’t “never run two nodes.”
It was “don’t pretend two nodes behave like three.”

Boring but correct practice that saved the day: “decommission runbook + cleanup”

A finance-adjacent team (meaning: everything is audited and nobody is allowed to have fun) had a strict decommission checklist.
Every node removal required: workload drain, storage dependency check, explicit quarantine, and a cleanup step on the removed node.
It felt slow. People complained. Of course they did.

Then a retired node was repurposed by another group. They reinstalled the OS, reconnected it to the same management network,
and powered it on. That would have been a classic “zombie node reappears” incident—except the old cluster credentials and
Corosync state had been wiped during the original decommission.

The node came up as a plain Debian host with no cluster identity. Nothing tried to rejoin. No votes changed.
The cluster didn’t even notice. The loudest sound was someone realizing the checklist actually had a reason.

The saving grace was not brilliant engineering. It was the most unsexy operational habit: clean up after yourself so future-you
doesn’t have to reverse-engineer your intent from a half-configured machine.

Common mistakes: symptom → root cause → fix

1) Symptom: pvecm delnode fails with “cluster not ready” or “not quorate”

Root cause: No quorum (expected votes still include missing node, or Corosync ring instability).

Fix: Restore quorum (bring nodes back, stabilize network). If emergency-only, temporarily force expected votes (Plan C),
but isolate all other nodes first to avoid split brain.

2) Symptom: Node disappears from pvecm nodes but still shows in UI / /etc/pve/nodes

Root cause: pmxcfs replication lag, stuck pve-cluster, or a node that isn’t actually quorate.

Fix: Verify pve-cluster service, confirm API responsiveness (Task 5), confirm quorum (Task 1),
then re-run removal. Avoid manual deletion inside /etc/pve unless you’re in controlled recovery.

3) Symptom: After removal, cluster becomes fragile and management actions hang during minor network events

Root cause: You now have a 2-node cluster without a tie-breaker, or a network design that doesn’t tolerate partitions.

Fix: Add a third node or configure a quorum device. Don’t accept “works most of the time” for quorum.

4) Symptom: HA keeps trying to recover resources “on the removed node”

Root cause: HA config still references node/groups; stale resource placement constraints.

Fix: Check ha-manager status, update HA groups, disable/remove resources referencing the node, then re-verify.

5) Symptom: Guests won’t start elsewhere after node loss

Root cause: Disks were on node-local storage; the “cluster” didn’t imply shared storage.

Fix: Restore from backup, promote replicas, or rebuild storage. For future: replicate ZFS datasets or use shared storage (Ceph/NFS/iSCSI) for critical workloads.

6) Symptom: Removed node returns later and “rejoins” or causes membership confusion

Root cause: You removed it from the cluster but didn’t clean cluster credentials/state on the host; it retained /etc/corosync/authkey and old configs.

Fix: Perform cleanup (Task 15) or reinstall. Also isolate removed nodes until wiped.

7) Symptom: /etc/pve operations are slow/hanging during removal attempts

Root cause: pmxcfs is blocked due to quorum loss or corosync instability; FUSE mount depends on cluster health.

Fix: Stabilize Corosync first. Don’t treat this as a filesystem bug; it’s usually cluster state protection.

FAQ

1) Why does Proxmox refuse to remove a node when it’s offline?

Because “offline” doesn’t equal “consensus.” Removing a node updates shared cluster state. Without quorum, Proxmox can’t safely
prove the remaining nodes agree on the change, so it blocks you to prevent split brain.

2) Can I just delete the node directory under /etc/pve/nodes?

Don’t, unless you’re deliberately doing a recovery procedure and you understand the consequences.
/etc/pve is not a normal filesystem; it’s cluster-managed state. Manual edits can desync nodes or mask the real issue (quorum).

3) What’s the safest order: migrate workloads first or remove node first?

Migrate first, remove second, clean the host last. If the node is dead, you can’t migrate, so you verify storage location and restore/promote elsewhere before removal.

4) I removed a node and now I have a 2-node cluster. Is that “supported”?

It can run, but it’s operationally brittle. Add a third vote (third node or quorum device) if you care about predictable behavior during partitions and maintenance.

5) How do I know if my VM disks are on shared storage or local storage?

Check storage definitions (pvesm status) and the VM config (qm config <vmid>) to see disk backends.
If it says local-zfs or local, assume it’s node-local unless you built something exotic on purpose.

6) Does removing a Proxmox node also remove it from Ceph?

No. Ceph is its own cluster. You must separately remove OSDs, monitors, and CRUSH entries if that host participated in Ceph.
Coordinate the sequence so you don’t strand data placement or lose quorum in Ceph while you’re fixing quorum in Proxmox.

7) What does “Expected votes” tell me?

It’s how many votes Corosync thinks should exist. If expected votes includes nodes that are gone, you may lose quorum even if enough nodes are online.
Fix expected votes by restoring membership properly (best) or temporarily forcing it for recovery (worst, but sometimes necessary).

8) What if the node name is reused later for a new host?

Don’t reuse the name until the old node is fully removed and cleaned. Cluster tooling often keys off node names in config paths.
If you must reuse, ensure the previous membership is gone and the new host joins cleanly with fresh keys.

9) Why does the UI sometimes still show the removed node for a while?

UI state can lag behind cluster state, especially if pmxcfs or the API had issues during the change.
Validate with pvecm nodes and /etc/pve/nodes; don’t trust the GUI as the source of truth in a cluster incident.

10) What’s the single most common reason node removal goes sideways?

Doing it while the cluster is unstable: flapping Corosync links, partial partitions, or degraded quorum.
People blame the command. The command is fine. The cluster is arguing.

Conclusion: next steps that won’t haunt you

Safe node removal in Proxmox isn’t about the deletion command. It’s about sequencing and certainty:
quorum, workload placement, storage reality, then membership change, then cleanup so the old node can’t wander back in.

If you do one thing after reading this: standardize a decommission runbook that includes (1) quorum verification,
(2) storage dependency checks, and (3) cleanup on the removed host. That’s the boring practice that keeps “can’t remove node”
from turning into “can’t manage cluster.”

Practical next steps:

  • Run Tasks 1–5 on your cluster now and record what “healthy” looks like for your environment.
  • If you ever operate with two nodes, decide on a third vote strategy and schedule it.
  • Audit which workloads still depend on node-local storage and either replicate them or accept restore-time pain explicitly.
  • Make “isolate removed nodes” non-negotiable: power off, port disabled, or wiped. Preferably all three.
← Previous
GitHub-Style Code Blocks: Title Bars, Copy Buttons, Line Numbers, and Highlighted Lines
Next →
MariaDB vs SQLite Backups: Simple Restore vs Real PITR

Leave a comment