Proxmox Clustering and HA: How It Works, What Breaks, and How to Design It Properly

Was this helpful?

The first time Proxmox HA “saves” you by rebooting a VM onto a node that can’t see its disk,
you learn a useful lesson: availability is a system property, not a checkbox.
Clustering is easy. Surviving failure is the hard part.

This is the practical, production-focused guide: how Proxmox clustering and HA actually work,
what breaks in real environments, and how to design a cluster that doesn’t turn minor outages into interpretive dance.

A mental model that matches reality

Proxmox clustering isn’t magic. It’s three separable problems that people love to blur together:

  • Control plane: nodes agree on membership and configuration (corosync + pmxcfs).
  • Data plane: VM disks are accessible where the VM runs (shared storage or replication).
  • Workload orchestration: when a node fails, something decides to restart VMs elsewhere (pve-ha-manager + pve-ha-lrm).

HA only works if all three are designed to fail gracefully. You can have a perfectly healthy cluster
with utterly non-HA storage. You can also have flawless shared storage with a control plane that panics at packet loss.
And you can have both, then sabotage yourself with “optimizations” like aggressive timeouts or a single switch.

The operational rule is simple: if you cannot explain where the VM’s disk lives during failover, you do not have HA.
You have “the ability to reboot VMs somewhere else.” Those are not the same.

What’s inside: corosync, pmxcfs, pve-ha-manager

Corosync: membership and messaging

Corosync is the cluster communication layer. It handles node membership and reliably distributes cluster state.
Proxmox uses it for the “who’s in the club” question. It’s sensitive to latency, jitter, and packet loss in exactly the way your busiest switch uplink is sensitive to “it’ll be fine.”

pmxcfs: the config filesystem

Proxmox stores cluster configuration in /etc/pve, backed by pmxcfs, a distributed filesystem.
It’s not meant for your VM images. It’s for config: storage definitions, VM configs, ACLs, cluster settings.

When quorum is lost, /etc/pve flips into a protective mode. Writes are blocked because letting two partitions write config independently is how you get config split brain.
This is a feature. It’s also why people think “Proxmox is down” when only writes are blocked.

pve-ha-manager and pve-ha-lrm: the HA brain and local hands

The HA stack has a manager and local resource managers. The manager decides what should run where. The local agent executes starts/stops.
It uses a mix of watchdog logic and cluster state to avoid running the same VM twice. Avoid. Not guarantee. The guarantee requires fencing.

One quote to keep on your wall, in the spirit of operating reality: (paraphrased idea) “Hope is not a strategy.” — often attributed to engineers and operators everywhere, and correct regardless of attribution.

Quorum: the single most important word in your cluster

Quorum answers: does this partition of nodes have the authority to make cluster decisions?
If you run a cluster without understanding quorum, you’re driving a forklift in a glass shop.

Why quorum exists

In a network partition, you can end up with two groups of nodes that can’t see each other.
Without quorum, both sides might “do the right thing” locally and start the same HA VM.
That’s how you corrupt filesystems, databases, and your relationships with the finance team.

Rule of thumb: odd numbers win

A 3-node cluster can lose 1 node and keep quorum. A 2-node cluster can’t.
You can make 2-node work with a quorum device (QDevice), but if you’re trying to do HA on the cheap,
you’ll find out the expensive way why people recommend 3.

QDevice: the “third vote” without a third hypervisor

QDevice adds an external quorum vote so a 2-node cluster can keep operating when one node is down (or partitioned).
It must be placed carefully: if the QDevice sits on the same network path that failed, it’s just a decorative vote.
Place it where it breaks differently than your cluster interconnect.

Joke #1 (short and relevant)

Quorum is like adulthood: if you have to ask whether you have it, you probably don’t.

Storage for HA: what “shared” really means

The storage decision is the HA decision. Everything else is choreography.

Option A: truly shared storage (SAN/NFS/iSCSI/FC)

Shared storage means the VM disk is accessible from multiple nodes at the same time, with correct locking.
In Proxmox terms, that’s storage types that support shared access and the right semantics (e.g., NFS with proper config, iSCSI with LVM, FC with multipath).

Shared storage makes failover fast: restart VM elsewhere, same disk. It also concentrates risk: the storage array becomes your “one thing” that can take everything out.
Some orgs accept that because the array is built like a tank. Others discover “the array” is a single controller and one admin password last changed in 2018.

Option B: hyperconverged shared storage (Ceph)

Ceph gives you distributed storage across your Proxmox nodes (or dedicated storage nodes). Great when designed correctly:
enough nodes, enough NICs, enough disks, enough CPU, and enough humility.

Ceph is not “free HA.” It’s a storage system with its own failure modes, recovery behavior, and performance profile.
During recovery, Ceph can eat your IO budget, which can look like “Proxmox HA is broken” when actually your cluster is just doing exactly what you asked: rebuilding data.

Option C: replication (ZFS replication) + restart

ZFS replication can provide near-HA behavior: replicate VM disks from primary to secondary nodes on a schedule.
If a node dies, you start the VM on a node that has a recent replica. The trade-off is obvious: RPO is not zero unless you do synchronous replication (which is its own beast).

Replication is a strong choice for small clusters that can tolerate losing a few minutes of data, or for workloads where you already have application-level replication.
It’s also far simpler to operate than Ceph for many teams. Simple is good. Predictable is better.

What breaks storage HA in practice

  • Locking mismatch: the storage supports shared reads but not safe shared writes.
  • Network coupling: storage traffic and corosync share the same congested link.
  • Assumed durability: “it’s RAID” becomes “it’s safe” becomes “where did the datastore go?”
  • Failover without fencing: two nodes think they own the same disk.

Fencing: the thing you skip until it hurts

Fencing (STONITH: Shoot The Other Node In The Head) ensures a failed or partitioned node is truly stopped before resources run elsewhere.
In VM HA, fencing is how you prevent the “two active writers” disaster.

Without fencing, you’re relying on timeouts and good luck. Timeouts are not authority. They are guesses.

What fencing looks like in Proxmox-land

Proxmox supports watchdog-based self-fencing and external fencing with IPMI/iDRAC/iLO-style management.
In practice, for serious HA:

  • Enable and validate watchdog on every node.
  • Use out-of-band management to hard power-cycle a node that’s “alive but wrong.”
  • Design power and networking so a node can be fenced even if its primary network is dead.

If you’re using shared block storage (iSCSI/FC) and running clustered filesystems or LVM, fencing matters even more.
Data corruption is rarely immediate. It’s usually delayed, subtle, and career-limiting.

Cluster networking: rings, latency, and packet loss

Corosync wants low latency and low packet loss. It will tolerate less weirdness than your storage stack.
The classic failure mode is “network is mostly fine,” which is what people say when packet loss is 0.5% and your cluster is intermittently losing quorum.

Separate networks by function

  • Corosync: dedicated VLAN or physical network if you can. Predictable latency matters.
  • Storage: dedicated too, especially for Ceph or iSCSI.
  • Client/VM traffic: keep it from stomping on your cluster control plane.

Dual rings are not decoration

Corosync can use multiple rings (separate networks) for redundancy. It’s worth doing.
If you can’t afford redundant switching, accept that your “HA” is “high anxiety.”

Joke #2 (short and relevant)

Packet loss is like termites: the house looks fine until it suddenly isn’t.

How it fails: real breakages and why

Failure mode: quorum loss stops config writes

Symptom: you can log in, VMs might keep running, but you can’t start/stop/migrate, and GUI changes fail.
Root cause: the node (or partition) doesn’t have quorum, so pmxcfs blocks writes.
Correct response: restore quorum, don’t “force it” unless you’re intentionally running a single-node island and understand the consequences.

Failure mode: HA restarts VMs but they won’t boot

Symptom: HA manager tries to move/restart, but VM fails with missing disks, storage offline, or IO errors.
Root cause: storage wasn’t actually shared/available on the target node; or the storage network failed alongside the node.
Correct response: treat storage as part of the failure domain; redesign storage path diversity; ensure storage definitions are consistent and available on all nodes.

Failure mode: split brain or “double start” risk

Symptom: the same service appears to run twice, or shared disk shows corruption, or HA flaps.
Root cause: network partition without effective fencing, mis-set expected votes, or QDevice placed poorly.
Correct response: fix quorum design and add fencing. Don’t tune timeouts until you’ve designed authority.

Failure mode: Ceph recovery makes everything look dead

Symptom: after a node failure, VMs stutter, IO latency spikes, cluster “feels down.”
Root cause: Ceph is backfilling/recovering, saturating disks and network.
Correct response: capacity planning, fast networks, and sane recovery settings. Also: accept that “self-healing storage” spends resources to heal.

Practical tasks: commands, outputs, and decisions

These are the checks I actually run. Each includes what the output means and what decision to make.
Run them on a node unless stated otherwise.

Task 1: Check cluster membership and quorum

cr0x@pve1:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   42
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Dec 28 12:10:41 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Meaning: Quorate: Yes means this partition can safely change cluster state (start HA VMs, edit config).

Decision: If Quorate: No, stop making changes and fix connectivity/quorum device before touching HA.

Task 2: Show per-node view and votes

cr0x@pve1:~$ pvecm nodes
Membership information
----------------------
Nodeid      Votes Name
0x00000001      1 pve1 (local)
0x00000002      1 pve2
0x00000003      1 pve3

Meaning: Confirms who is in the membership list and how many votes each has.

Decision: If a node is missing unexpectedly, investigate corosync links and node health before blaming HA.

Task 3: Verify corosync ring health (knet links)

cr0x@pve1:~$ corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
        addr    = 10.10.10.11
        status  = ok
LINK ID 1 udp
        addr    = 10.10.20.11
        status  = ok

Meaning: Both rings are up. If a link is down, you’ve lost redundancy.

Decision: If ring redundancy is broken, treat it as a sev-2: the next switch hiccup becomes a cluster incident.

Task 4: Check corosync timing and membership from runtime stats

cr0x@pve1:~$ corosync-quorumtool -s
Quorum information
------------------
Date:             Sun Dec 28 12:11:22 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1.2c
Quorate:          Yes

Meaning: Confirms votequorum’s perspective; useful when UI lies or logs are noisy.

Decision: If this says non-quorate, stop chasing storage/perf ghosts and fix cluster comms first.

Task 5: Confirm pmxcfs is healthy and not stuck in read-only mode

cr0x@pve1:~$ pvesh get /cluster/status
[
  {
    "type": "cluster",
    "name": "prod-cluster",
    "version": 42,
    "quorate": 1
  },
  {
    "type": "node",
    "name": "pve1",
    "online": 1,
    "ip": "10.10.10.11"
  },
  {
    "type": "node",
    "name": "pve2",
    "online": 1,
    "ip": "10.10.10.12"
  },
  {
    "type": "node",
    "name": "pve3",
    "online": 1,
    "ip": "10.10.10.13"
  }
]

Meaning: "quorate": 1 indicates cluster filesystem can accept writes normally.

Decision: If quorate is 0, don’t attempt edits to /etc/pve; they won’t stick (or worse, you’ll create a mess during recovery).

Task 6: Inspect HA manager state

cr0x@pve1:~$ ha-manager status
quorum OK
master pve2 (active, Sat Dec 27 23:14:02 2025)

service vm:101 (running, node=pve1)
service vm:120 (running, node=pve3)

lrm pve1 (active, Sat Dec 27 23:14:11 2025)
lrm pve2 (active, Sat Dec 27 23:14:02 2025)
lrm pve3 (active, Sat Dec 27 23:14:07 2025)

Meaning: Shows which node is HA master and whether local resource managers are active.

Decision: If LRMs are inactive or master flaps, focus on quorum and corosync stability before touching individual VMs.

Task 7: Explain why a specific HA VM isn’t moving

cr0x@pve1:~$ ha-manager status --verbose
quorum OK
master pve2 (active, Sat Dec 27 23:14:02 2025)

service vm:101 (running, node=pve1)
  state: started
  request: none
  last_error: none

service vm:130 (error, node=pve2)
  state: stopped
  request: start
  last_error: unable to activate storage 'ceph-vm' on node 'pve2'

Meaning: HA isn’t “broken,” it’s blocked by storage activation on the target node.

Decision: Switch to storage debugging (Ceph/NFS/iSCSI), not HA tuning.

Task 8: Check cluster network loss/latency quickly

cr0x@pve1:~$ ping -c 20 -i 0.2 10.10.10.12
PING 10.10.10.12 (10.10.10.12) 56(84) bytes of data.
64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.355 ms
64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.420 ms
...
--- 10.10.10.12 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 3815ms
rtt min/avg/max/mdev = 0.312/0.401/0.612/0.081 ms

Meaning: Clean ping doesn’t prove corosync is happy, but loss/jitter here is a smoking crater.

Decision: If packet loss exists, stop. Fix network first. Corosync under loss causes phantom failures everywhere else.

Task 9: Review corosync logs for membership churn

cr0x@pve1:~$ journalctl -u corosync -S -2h --no-pager | tail -n 20
Dec 28 10:41:03 pve1 corosync[1203]:   [KNET  ] link: host: 2 link: 0 is down
Dec 28 10:41:06 pve1 corosync[1203]:   [QUORUM] This node is within the primary component and will provide service.
Dec 28 10:41:11 pve1 corosync[1203]:   [KNET  ] link: host: 2 link: 0 is up
Dec 28 10:41:12 pve1 corosync[1203]:   [TOTEM ] A new membership (1.2b) was formed. Members joined: 2

Meaning: Links flapping caused membership re-forms. That’s cluster turbulence.

Decision: Treat repeated membership changes as an outage precursor. Investigate NICs, switches, bonding, VLAN config, MTU, and congestion.

Task 10: Validate watchdog availability (self-fencing prerequisite)

cr0x@pve1:~$ dmesg | grep -i watchdog | tail -n 10
[    1.842113] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[    1.842205] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)

Meaning: A hardware watchdog driver is present. That’s the base for self-fencing.

Decision: If no watchdog is present, don’t pretend you have safe HA. Add hardware support or out-of-band fencing.

Task 11: Confirm storage visibility on a target node (NFS example)

cr0x@pve2:~$ pvesm status
Name         Type     Status           Total       Used        Avail     %
local        dir      active         19660800    2457600     17203200   12%
nfs-vmstore  nfs      active       1048576000  734003200    314572800   70%

Meaning: Storage is active on this node. If it’s inactive, HA can’t start disks there.

Decision: If inactive, fix mount/network/auth before you allow HA to place workloads here.

Task 12: Confirm a VM disk’s backend and whether it’s migratable

cr0x@pve1:~$ qm config 101
boot: order=scsi0;net0
cores: 4
memory: 8192
name: app-prod-01
net0: virtio=12:34:56:78:9a:bc,bridge=vmbr0
scsi0: ceph-vm:vm-101-disk-0,iothread=1,size=80G
scsihw: virtio-scsi-pci
vmgenid: 0b0b0b0b-1111-2222-3333-444444444444

Meaning: Disk is on ceph-vm. That’s shared (assuming Ceph is healthy).

Decision: If you see local-lvm or a node-local ZFS dataset with no replication, do not expect HA to restart elsewhere successfully.

Task 13: Check Ceph health (if you use it)

cr0x@pve1:~$ ceph -s
  cluster:
    id:     2f4a9d2e-aaaa-bbbb-cccc-111122223333
    health: HEALTH_WARN
            1 osds down
            Degraded data redundancy: 12/3456 objects degraded

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 7m)
    mgr: pve1(active, since 2h)
    osd: 9 osds: 8 up (since 1m), 9 in (since 30d)

  data:
    pools:   3 pools, 128 pgs
    objects: 1152 objects, 4.5 GiB
    usage:   220 GiB used, 1.8 TiB / 2.0 TiB avail
    pgs:     12 active+undersized+degraded, 116 active+clean

Meaning: Ceph is up but degraded; performance may be bad and HA restarts may be slow.

Decision: During degraded states, avoid mass migrations and don’t trigger avoidable restarts. Stabilize Ceph first.

Task 14: Confirm replication status for ZFS replication setups

cr0x@pve1:~$ pvesr status
JobID  Type  State    Last Sync              Duration  Error
1000   local ok       2025-12-28 11:55:02    00:01:42  -
1001   local failed   2025-12-28 11:40:02    00:00:11  ssh connection failed

Meaning: Job 1001 failed; replicas may be stale on the target node.

Decision: Don’t claim HA coverage for VMs on a replication job that’s failing. Fix transport/auth, then validate RPO.

Task 15: Check if the cluster is resource-starved (CPU steal, IO wait symptoms)

cr0x@pve1:~$ pvesh get /nodes/pve1/status
{
  "cpu": 0.71,
  "loadavg": [2.15, 2.07, 1.94],
  "memory": {
    "total": 68719476736,
    "used": 51234567890,
    "free": 17484908846
  },
  "swap": {
    "total": 8589934592,
    "used": 2147483648
  },
  "uptime": 123456
}

Meaning: You’re swapping. In clusters, swapping turns “minor network blip” into “why did everything stall.”

Decision: If swap is used materially under normal conditions, fix memory pressure before diagnosing HA flaps.

Task 16: Verify time sync (yes, it matters)

cr0x@pve1:~$ timedatectl status
               Local time: Sun 2025-12-28 12:14:55 UTC
           Universal time: Sun 2025-12-28 12:14:55 UTC
                 RTC time: Sun 2025-12-28 12:14:55
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: Clock is synchronized. Good. Time drift makes logs useless and can upset distributed systems in subtle ways.

Decision: If unsynchronized, fix NTP/chrony across all nodes before you try to interpret event order in an incident.

Fast diagnosis playbook

When a Proxmox cluster “acts weird,” you want speed and correct priorities. Here’s the order that finds the bottleneck fast.

First: is the control plane authoritative?

  • Run pvecm status: are you quorate?
  • Run corosync-cfgtool -s: are links up? is a ring down?
  • Scan journalctl -u corosync: are there repeated memberships?

If quorum is unstable, stop here. Fix the cluster network and membership before investigating HA, storage, or VM symptoms.

Second: can the target node see the VM’s storage?

  • pvesm status on the target node: storage active?
  • qm config <vmid>: where is the disk really?
  • If Ceph: ceph -s and watch for degraded/backfill.

If storage isn’t available everywhere it needs to be, HA will flap or restart into failure.

Third: is HA itself stuck, or is it refusing to do something unsafe?

  • ha-manager status --verbose: read the error, don’t guess.
  • Check watchdog/fencing readiness if you see double-start risk or repeated fencing-like behavior.

Fourth: is this actually performance starvation?

  • pvesh get /nodes/<node>/status: swap, load, CPU usage.
  • Check network loss and latency between nodes; then check storage network separately.

Common mistakes: symptoms → root cause → fix

1) “HA is broken, it won’t start VMs”

Symptoms: HA actions fail; GUI shows errors; some nodes show “unknown.”

Root cause: Loss of quorum, or corosync membership churn.

Fix: Restore stable corosync connectivity (dedicated network, redundant rings). Add QDevice for 2-node. Don’t change timeouts first.

2) “VM restarted on another node but disk missing”

Symptoms: VM starts then fails; storage errors; disk not found.

Root cause: Disk was on node-local storage (or shared storage not mounted/active on target).

Fix: Put HA VMs on truly shared storage (Ceph/NFS/iSCSI/FC) or implement replication with a defined RPO and test restores/failover.

3) “Cluster freezes during backups/migrations”

Symptoms: corosync timeouts, HA flapping, sluggish GUI during heavy IO.

Root cause: Corosync network shares a congested link with VM/storage traffic; or CPU starvation from encryption/compression/backups.

Fix: Separate networks; rate-limit heavy jobs; pin corosync to low-latency paths; capacity plan CPU for backup windows.

4) “Two-node cluster: sometimes it just stops managing stuff”

Symptoms: after one node outage, remaining node won’t start HA resources or edit config.

Root cause: No QDevice; expected votes misaligned; quorum not achievable alone.

Fix: Add QDevice placed on an independent failure domain; verify expected votes; test “one node down” scenarios.

5) “Ceph is healthy-ish but everything is slow”

Symptoms: high IO latency, VMs pause, failovers sluggish after a fault.

Root cause: Degraded/backfill/recovery saturating IO; undersized network (1GbE) or disks; too few OSDs.

Fix: Engineer Ceph properly: 10/25GbE, enough OSDs, separate public/cluster networks where applicable, and plan recovery impact.

6) “We tuned corosync timeouts and now HA is worse”

Symptoms: more frequent quorum loss, false node deaths, random failovers.

Root cause: Timeouts set below real jitter during load; cluster becomes trigger-happy.

Fix: Revert to sane defaults; fix the network; measure jitter under peak; only then consider careful tuning.

7) “After a reboot, a node can’t join cluster”

Symptoms: node appears offline; corosync won’t start; config mismatch errors.

Root cause: Wrong /etc/hosts mapping, stale corosync config, MTU mismatch, or firewall rules.

Fix: Validate name resolution, consistent MTU, open required ports on cluster networks, and confirm corosync.conf consistency via /etc/pve.

Checklists / step-by-step plan

Design checklist: building a cluster you can sleep on

  1. Choose cluster size: Prefer 3+ nodes. If 2 nodes, plan QDevice and fencing from day one.
  2. Define failure domains: switches, power circuits, racks, ToR pairs, storage paths. Write them down.
  3. Separate networks: corosync on its own VLAN/physical fabric; storage on its own; client traffic separate.
  4. Redundancy: dual corosync rings; redundant switching; bonded NICs only if you understand your switch config.
  5. Storage strategy: shared (Ceph/SAN/NFS/iSCSI) for true HA; replication for “restart with RPO.” Don’t mix expectations.
  6. Fencing: watchdog + out-of-band power control tested. Document how to fence manually.
  7. Capacity: N+1 compute. HA without spare capacity is just automated disappointment.
  8. Operational tests: planned node reboot, power pull, switch port shutdown, storage path failure, QDevice loss (if used).

Implementation plan: from zero to stable

  1. Build nodes identically (kernel, NIC drivers, BIOS settings for virtualization, storage controllers).
  2. Set static addressing for corosync networks; ensure stable name resolution.
  3. Create the cluster; immediately configure dual-ring corosync if you have the networks.
  4. Validate quorum behavior by temporarily isolating one node (controlled test).
  5. Stand up storage and validate from every node (pvesm status must be boringly consistent).
  6. Enable watchdog and validate it exists in logs on every node.
  7. Create HA groups with sane constraints (don’t allow everything to pile onto the smallest node).
  8. Put one non-critical VM into HA, test node failure, confirm it comes up elsewhere with correct disk and networking.
  9. Roll HA out by service class; keep a “not HA” pool for pets and experiments.

Operations checklist: weekly “keep it healthy” routine

  1. Check pvecm status on a random node: confirm stable quorum.
  2. Scan corosync logs for link flaps and membership churn.
  3. If Ceph: check ceph -s and address warnings before they become incidents.
  4. Validate replication jobs (pvesr status) if you rely on them.
  5. Confirm time sync on all nodes.
  6. Run a controlled migration during business hours occasionally (yes, on purpose) to catch drift early.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A mid-sized enterprise built a three-node Proxmox cluster for “HA.” The team did what many teams do: they clustered nodes,
enabled HA, and felt good. The storage was “shared,” they said, because every node could reach the NAS over NFS.
Nobody asked the unsexy question: is NFS mounted and healthy on every node during the exact failure we’re planning for?

The first real outage was a top-of-rack switch reboot. Nodes stayed up, but the NAS VLAN hiccuped.
Corosync was on the same switching path, so membership churn started. Quorum flickered.
HA saw a node as unhealthy and restarted two VMs onto a node that, at that moment, didn’t have the NFS share mounted.

The VMs failed to boot. Applications went down. The team, staring at the UI, interpreted it as “Proxmox HA is buggy.”
They rebooted nodes in the name of “stabilizing,” which made it worse because it increased the amount of churn and failed starts.

After the dust settled, the postmortem revealed the wrong assumption: “reachable NFS” was treated as “HA storage.”
The fix wasn’t exotic. They separated corosync onto a dedicated network, made storage mounts resilient and monitored,
and added a hard rule: no HA tag on a VM unless its disk lives on storage proven accessible from any target node under failure tests.

The lesson is unromantic: HA is a contract. If storage can disappear during the same class of failure that triggers failover,
your HA system will eagerly restart workloads into a void and call it success.

Mini-story 2: The optimization that backfired

Another org ran Proxmox with Ceph and had intermittent “node lost” alerts. Someone decided to “tighten detection”
by reducing cluster timeouts. Faster detection equals faster failover, they reasoned. They wanted the system to be “snappier.”

It worked in the lab. In production, it met reality: backup windows, noisy neighbor IO, and a network that was fine
until Ceph recovery traffic decided to be ambitious. During peak load, corosync packets got delayed just enough.
With the tighter timeouts, those delays crossed the line from “tolerated” to “node declared dead.”

HA started moving VMs. Ceph was already busy. Now it had to serve VM IO plus handle recovery plus absorb migration/restart storms.
The cluster spiraled into self-inflicted pain: false failovers causing real load, causing more false failovers.

The fix was slightly embarrassing and completely effective: revert timeouts, then fix the underlying latency and congestion.
They separated traffic, rate-limited the noisiest jobs, and set expectations: failover speed is not worth cluster instability.

The deep lesson: timeouts are not performance knobs. They’re failure detectors. If you tune them below your worst-case jitter,
you aren’t improving availability. You’re generating chaos faster.

Mini-story 3: The boring but correct practice that saved the day

A company with strict compliance requirements ran a Proxmox cluster with shared iSCSI storage and a QDevice.
Their design wasn’t flashy. It was just relentlessly careful: dedicated corosync network, redundant switches,
documented fencing via out-of-band management, and quarterly “pull the plug” tests.

During a scheduled maintenance, a firmware update on one switch went sideways and the switch stopped forwarding traffic properly.
Corosync ring 0 went weird. Ring 1 stayed healthy. The cluster didn’t lose quorum.
VMs didn’t flap. The monitoring system lit up, but the business barely noticed.

The on-call followed the runbook: confirm quorum, confirm corosync ring status, confirm storage paths, then isolate the bad switch.
They moved some workloads deliberately instead of letting HA thrash. They did not panic-reboot nodes.

Post-incident, the team got zero hero points because nothing dramatic happened. That’s the dream.
The boring practice—redundant rings plus rehearsed diagnostics—turned a potentially messy outage into a controlled maintenance event.

If you want HA that looks like competence instead of adrenaline, you need these boring habits. Drama is not a KPI.

Interesting facts and historical context

  • Fact 1: Proxmox VE builds its cluster communications on Corosync, a long-running open-source project also used in classic Linux HA stacks.
  • Fact 2: The “quorum” concept predates modern virtualization; it’s rooted in distributed systems safety: only a majority can safely decide.
  • Fact 3: Split brain wasn’t invented by virtualization; storage clusters and database replication have been fighting it for decades.
  • Fact 4: pmxcfs is a configuration filesystem, not a general-purpose distributed filesystem; treating it as “shared storage” is a common misunderstanding.
  • Fact 5: Corosync’s move to knet transport improved link handling and redundancy options compared to older setups.
  • Fact 6: STONITH as a term is old enough to sound like a joke, but the idea is deadly serious: ensure only one writer exists.
  • Fact 7: “HA” originally meant service-level failover for processes; virtualization made the unit of failover a whole machine, but the safety rules stayed.
  • Fact 8: Ceph’s popularity in hyperconverged setups came from solving a real problem: scaling storage without a monolithic array, at the cost of operational complexity.
  • Fact 9: Two-node clusters are historically awkward in quorum-based systems; the third vote (human, witness, or qdevice) is a well-worn pattern.

FAQ

1) Do I need three nodes for Proxmox HA?

If you want sane quorum behavior without special components, yes. Two nodes can work with QDevice, but it’s less forgiving and easier to mis-design.

2) What happens when the cluster loses quorum?

Nodes without quorum will block configuration writes in /etc/pve. Running VMs may keep running, but cluster-managed operations (especially HA actions) can be restricted.

3) Can I do HA if my VM disks are on local storage?

Not real HA failover. Proxmox can restart a VM elsewhere only if the disk is accessible there (shared storage) or replicated and promotable with an understood RPO.

4) Is ZFS replication “HA”?

It’s “automated restart with replication.” Great for many workloads, but it’s not zero-RPO. If that’s acceptable and tested, it’s a valid design.

5) Do I really need fencing?

If you use shared storage and care about data integrity, yes. Without fencing, a network partition can lead to two active writers.
Sometimes you get lucky. Sometimes you get a corrupted database and a long weekend.

6) Should corosync run on the same network as Ceph or NFS?

Avoid it. Corosync wants predictable latency; storage traffic is bursty and will eventually bully it. Separate them unless your environment is tiny and you accept the risk.

7) How do I know if HA is failing because of storage or because of cluster comms?

Check quorum first (pvecm status). If quorate, check whether storage is active on the target (pvesm status) and read ha-manager status --verbose for explicit errors.

8) Why does Proxmox HA sometimes refuse to move a VM even when a node looks unhealthy?

Because it’s trying to avoid unsafe actions: starting without storage, starting twice, or acting without quorum. “Refusing” is often the correct behavior.

9) Can I stretch a Proxmox cluster across two sites?

You can, but you’re signing up for latency, partition risk, and complex storage replication. If you must, design quorum, fencing, and storage like a distributed system, not like a LAN.

10) What’s the simplest reliable HA design for small teams?

Three nodes, dedicated corosync network, shared storage that is truly shared (or replication with explicit RPO), watchdog enabled, and tested failure scenarios. Keep it boring.

Conclusion: practical next steps

Proxmox clustering and HA work well when you treat them like systems engineering: authority (quorum), safe execution (fencing),
and data accessibility (storage) designed together. If any one of those is hand-waved, the cluster will eventually collect payment.

Next steps you can do this week:

  1. Run the fast diagnosis checks now, while everything is “fine,” and record baseline outputs.
  2. Classify every VM: true HA (shared storage + fencing), restart-with-RPO (replication), or best-effort (local).
  3. Separate corosync traffic from storage and VM traffic, or at least prove the network can handle worst-case jitter under load.
  4. Test one real failure (power off a node, or disable a switch port) during a controlled window and verify: quorum behavior, HA decisions, storage availability.
  5. Document how to fence a node and practice it. You don’t want your first fencing attempt to be during corruption.
← Previous
Bendgate: when “thin” turned into a warranty nightmare
Next →
DNSSEC Fails Randomly: Debugging Validation Errors Without Panic

Leave a comment