The first time Proxmox HA “saves” you by rebooting a VM onto a node that can’t see its disk,
you learn a useful lesson: availability is a system property, not a checkbox.
Clustering is easy. Surviving failure is the hard part.
This is the practical, production-focused guide: how Proxmox clustering and HA actually work,
what breaks in real environments, and how to design a cluster that doesn’t turn minor outages into interpretive dance.
A mental model that matches reality
Proxmox clustering isn’t magic. It’s three separable problems that people love to blur together:
- Control plane: nodes agree on membership and configuration (corosync + pmxcfs).
- Data plane: VM disks are accessible where the VM runs (shared storage or replication).
- Workload orchestration: when a node fails, something decides to restart VMs elsewhere (pve-ha-manager + pve-ha-lrm).
HA only works if all three are designed to fail gracefully. You can have a perfectly healthy cluster
with utterly non-HA storage. You can also have flawless shared storage with a control plane that panics at packet loss.
And you can have both, then sabotage yourself with “optimizations” like aggressive timeouts or a single switch.
The operational rule is simple: if you cannot explain where the VM’s disk lives during failover, you do not have HA.
You have “the ability to reboot VMs somewhere else.” Those are not the same.
What’s inside: corosync, pmxcfs, pve-ha-manager
Corosync: membership and messaging
Corosync is the cluster communication layer. It handles node membership and reliably distributes cluster state.
Proxmox uses it for the “who’s in the club” question. It’s sensitive to latency, jitter, and packet loss in exactly the way your busiest switch uplink is sensitive to “it’ll be fine.”
pmxcfs: the config filesystem
Proxmox stores cluster configuration in /etc/pve, backed by pmxcfs, a distributed filesystem.
It’s not meant for your VM images. It’s for config: storage definitions, VM configs, ACLs, cluster settings.
When quorum is lost, /etc/pve flips into a protective mode. Writes are blocked because letting two partitions write config independently is how you get config split brain.
This is a feature. It’s also why people think “Proxmox is down” when only writes are blocked.
pve-ha-manager and pve-ha-lrm: the HA brain and local hands
The HA stack has a manager and local resource managers. The manager decides what should run where. The local agent executes starts/stops.
It uses a mix of watchdog logic and cluster state to avoid running the same VM twice. Avoid. Not guarantee. The guarantee requires fencing.
One quote to keep on your wall, in the spirit of operating reality: (paraphrased idea) “Hope is not a strategy.”
— often attributed to engineers and operators everywhere, and correct regardless of attribution.
Quorum: the single most important word in your cluster
Quorum answers: does this partition of nodes have the authority to make cluster decisions?
If you run a cluster without understanding quorum, you’re driving a forklift in a glass shop.
Why quorum exists
In a network partition, you can end up with two groups of nodes that can’t see each other.
Without quorum, both sides might “do the right thing” locally and start the same HA VM.
That’s how you corrupt filesystems, databases, and your relationships with the finance team.
Rule of thumb: odd numbers win
A 3-node cluster can lose 1 node and keep quorum. A 2-node cluster can’t.
You can make 2-node work with a quorum device (QDevice), but if you’re trying to do HA on the cheap,
you’ll find out the expensive way why people recommend 3.
QDevice: the “third vote” without a third hypervisor
QDevice adds an external quorum vote so a 2-node cluster can keep operating when one node is down (or partitioned).
It must be placed carefully: if the QDevice sits on the same network path that failed, it’s just a decorative vote.
Place it where it breaks differently than your cluster interconnect.
Joke #1 (short and relevant)
Quorum is like adulthood: if you have to ask whether you have it, you probably don’t.
Storage for HA: what “shared” really means
The storage decision is the HA decision. Everything else is choreography.
Option A: truly shared storage (SAN/NFS/iSCSI/FC)
Shared storage means the VM disk is accessible from multiple nodes at the same time, with correct locking.
In Proxmox terms, that’s storage types that support shared access and the right semantics (e.g., NFS with proper config, iSCSI with LVM, FC with multipath).
Shared storage makes failover fast: restart VM elsewhere, same disk. It also concentrates risk: the storage array becomes your “one thing” that can take everything out.
Some orgs accept that because the array is built like a tank. Others discover “the array” is a single controller and one admin password last changed in 2018.
Option B: hyperconverged shared storage (Ceph)
Ceph gives you distributed storage across your Proxmox nodes (or dedicated storage nodes). Great when designed correctly:
enough nodes, enough NICs, enough disks, enough CPU, and enough humility.
Ceph is not “free HA.” It’s a storage system with its own failure modes, recovery behavior, and performance profile.
During recovery, Ceph can eat your IO budget, which can look like “Proxmox HA is broken” when actually your cluster is just doing exactly what you asked: rebuilding data.
Option C: replication (ZFS replication) + restart
ZFS replication can provide near-HA behavior: replicate VM disks from primary to secondary nodes on a schedule.
If a node dies, you start the VM on a node that has a recent replica. The trade-off is obvious: RPO is not zero unless you do synchronous replication (which is its own beast).
Replication is a strong choice for small clusters that can tolerate losing a few minutes of data, or for workloads where you already have application-level replication.
It’s also far simpler to operate than Ceph for many teams. Simple is good. Predictable is better.
What breaks storage HA in practice
- Locking mismatch: the storage supports shared reads but not safe shared writes.
- Network coupling: storage traffic and corosync share the same congested link.
- Assumed durability: “it’s RAID” becomes “it’s safe” becomes “where did the datastore go?”
- Failover without fencing: two nodes think they own the same disk.
Fencing: the thing you skip until it hurts
Fencing (STONITH: Shoot The Other Node In The Head) ensures a failed or partitioned node is truly stopped before resources run elsewhere.
In VM HA, fencing is how you prevent the “two active writers” disaster.
Without fencing, you’re relying on timeouts and good luck. Timeouts are not authority. They are guesses.
What fencing looks like in Proxmox-land
Proxmox supports watchdog-based self-fencing and external fencing with IPMI/iDRAC/iLO-style management.
In practice, for serious HA:
- Enable and validate watchdog on every node.
- Use out-of-band management to hard power-cycle a node that’s “alive but wrong.”
- Design power and networking so a node can be fenced even if its primary network is dead.
If you’re using shared block storage (iSCSI/FC) and running clustered filesystems or LVM, fencing matters even more.
Data corruption is rarely immediate. It’s usually delayed, subtle, and career-limiting.
Cluster networking: rings, latency, and packet loss
Corosync wants low latency and low packet loss. It will tolerate less weirdness than your storage stack.
The classic failure mode is “network is mostly fine,” which is what people say when packet loss is 0.5% and your cluster is intermittently losing quorum.
Separate networks by function
- Corosync: dedicated VLAN or physical network if you can. Predictable latency matters.
- Storage: dedicated too, especially for Ceph or iSCSI.
- Client/VM traffic: keep it from stomping on your cluster control plane.
Dual rings are not decoration
Corosync can use multiple rings (separate networks) for redundancy. It’s worth doing.
If you can’t afford redundant switching, accept that your “HA” is “high anxiety.”
Joke #2 (short and relevant)
Packet loss is like termites: the house looks fine until it suddenly isn’t.
How it fails: real breakages and why
Failure mode: quorum loss stops config writes
Symptom: you can log in, VMs might keep running, but you can’t start/stop/migrate, and GUI changes fail.
Root cause: the node (or partition) doesn’t have quorum, so pmxcfs blocks writes.
Correct response: restore quorum, don’t “force it” unless you’re intentionally running a single-node island and understand the consequences.
Failure mode: HA restarts VMs but they won’t boot
Symptom: HA manager tries to move/restart, but VM fails with missing disks, storage offline, or IO errors.
Root cause: storage wasn’t actually shared/available on the target node; or the storage network failed alongside the node.
Correct response: treat storage as part of the failure domain; redesign storage path diversity; ensure storage definitions are consistent and available on all nodes.
Failure mode: split brain or “double start” risk
Symptom: the same service appears to run twice, or shared disk shows corruption, or HA flaps.
Root cause: network partition without effective fencing, mis-set expected votes, or QDevice placed poorly.
Correct response: fix quorum design and add fencing. Don’t tune timeouts until you’ve designed authority.
Failure mode: Ceph recovery makes everything look dead
Symptom: after a node failure, VMs stutter, IO latency spikes, cluster “feels down.”
Root cause: Ceph is backfilling/recovering, saturating disks and network.
Correct response: capacity planning, fast networks, and sane recovery settings. Also: accept that “self-healing storage” spends resources to heal.
Practical tasks: commands, outputs, and decisions
These are the checks I actually run. Each includes what the output means and what decision to make.
Run them on a node unless stated otherwise.
Task 1: Check cluster membership and quorum
cr0x@pve1:~$ pvecm status
Cluster information
-------------------
Name: prod-cluster
Config Version: 42
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sun Dec 28 12:10:41 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2c
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Meaning: Quorate: Yes means this partition can safely change cluster state (start HA VMs, edit config).
Decision: If Quorate: No, stop making changes and fix connectivity/quorum device before touching HA.
Task 2: Show per-node view and votes
cr0x@pve1:~$ pvecm nodes
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 pve1 (local)
0x00000002 1 pve2
0x00000003 1 pve3
Meaning: Confirms who is in the membership list and how many votes each has.
Decision: If a node is missing unexpectedly, investigate corosync links and node health before blaming HA.
Task 3: Verify corosync ring health (knet links)
cr0x@pve1:~$ corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = 10.10.10.11
status = ok
LINK ID 1 udp
addr = 10.10.20.11
status = ok
Meaning: Both rings are up. If a link is down, you’ve lost redundancy.
Decision: If ring redundancy is broken, treat it as a sev-2: the next switch hiccup becomes a cluster incident.
Task 4: Check corosync timing and membership from runtime stats
cr0x@pve1:~$ corosync-quorumtool -s
Quorum information
------------------
Date: Sun Dec 28 12:11:22 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 1
Ring ID: 1.2c
Quorate: Yes
Meaning: Confirms votequorum’s perspective; useful when UI lies or logs are noisy.
Decision: If this says non-quorate, stop chasing storage/perf ghosts and fix cluster comms first.
Task 5: Confirm pmxcfs is healthy and not stuck in read-only mode
cr0x@pve1:~$ pvesh get /cluster/status
[
{
"type": "cluster",
"name": "prod-cluster",
"version": 42,
"quorate": 1
},
{
"type": "node",
"name": "pve1",
"online": 1,
"ip": "10.10.10.11"
},
{
"type": "node",
"name": "pve2",
"online": 1,
"ip": "10.10.10.12"
},
{
"type": "node",
"name": "pve3",
"online": 1,
"ip": "10.10.10.13"
}
]
Meaning: "quorate": 1 indicates cluster filesystem can accept writes normally.
Decision: If quorate is 0, don’t attempt edits to /etc/pve; they won’t stick (or worse, you’ll create a mess during recovery).
Task 6: Inspect HA manager state
cr0x@pve1:~$ ha-manager status
quorum OK
master pve2 (active, Sat Dec 27 23:14:02 2025)
service vm:101 (running, node=pve1)
service vm:120 (running, node=pve3)
lrm pve1 (active, Sat Dec 27 23:14:11 2025)
lrm pve2 (active, Sat Dec 27 23:14:02 2025)
lrm pve3 (active, Sat Dec 27 23:14:07 2025)
Meaning: Shows which node is HA master and whether local resource managers are active.
Decision: If LRMs are inactive or master flaps, focus on quorum and corosync stability before touching individual VMs.
Task 7: Explain why a specific HA VM isn’t moving
cr0x@pve1:~$ ha-manager status --verbose
quorum OK
master pve2 (active, Sat Dec 27 23:14:02 2025)
service vm:101 (running, node=pve1)
state: started
request: none
last_error: none
service vm:130 (error, node=pve2)
state: stopped
request: start
last_error: unable to activate storage 'ceph-vm' on node 'pve2'
Meaning: HA isn’t “broken,” it’s blocked by storage activation on the target node.
Decision: Switch to storage debugging (Ceph/NFS/iSCSI), not HA tuning.
Task 8: Check cluster network loss/latency quickly
cr0x@pve1:~$ ping -c 20 -i 0.2 10.10.10.12
PING 10.10.10.12 (10.10.10.12) 56(84) bytes of data.
64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.355 ms
64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.420 ms
...
--- 10.10.10.12 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 3815ms
rtt min/avg/max/mdev = 0.312/0.401/0.612/0.081 ms
Meaning: Clean ping doesn’t prove corosync is happy, but loss/jitter here is a smoking crater.
Decision: If packet loss exists, stop. Fix network first. Corosync under loss causes phantom failures everywhere else.
Task 9: Review corosync logs for membership churn
cr0x@pve1:~$ journalctl -u corosync -S -2h --no-pager | tail -n 20
Dec 28 10:41:03 pve1 corosync[1203]: [KNET ] link: host: 2 link: 0 is down
Dec 28 10:41:06 pve1 corosync[1203]: [QUORUM] This node is within the primary component and will provide service.
Dec 28 10:41:11 pve1 corosync[1203]: [KNET ] link: host: 2 link: 0 is up
Dec 28 10:41:12 pve1 corosync[1203]: [TOTEM ] A new membership (1.2b) was formed. Members joined: 2
Meaning: Links flapping caused membership re-forms. That’s cluster turbulence.
Decision: Treat repeated membership changes as an outage precursor. Investigate NICs, switches, bonding, VLAN config, MTU, and congestion.
Task 10: Validate watchdog availability (self-fencing prerequisite)
cr0x@pve1:~$ dmesg | grep -i watchdog | tail -n 10
[ 1.842113] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[ 1.842205] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
Meaning: A hardware watchdog driver is present. That’s the base for self-fencing.
Decision: If no watchdog is present, don’t pretend you have safe HA. Add hardware support or out-of-band fencing.
Task 11: Confirm storage visibility on a target node (NFS example)
cr0x@pve2:~$ pvesm status
Name Type Status Total Used Avail %
local dir active 19660800 2457600 17203200 12%
nfs-vmstore nfs active 1048576000 734003200 314572800 70%
Meaning: Storage is active on this node. If it’s inactive, HA can’t start disks there.
Decision: If inactive, fix mount/network/auth before you allow HA to place workloads here.
Task 12: Confirm a VM disk’s backend and whether it’s migratable
cr0x@pve1:~$ qm config 101
boot: order=scsi0;net0
cores: 4
memory: 8192
name: app-prod-01
net0: virtio=12:34:56:78:9a:bc,bridge=vmbr0
scsi0: ceph-vm:vm-101-disk-0,iothread=1,size=80G
scsihw: virtio-scsi-pci
vmgenid: 0b0b0b0b-1111-2222-3333-444444444444
Meaning: Disk is on ceph-vm. That’s shared (assuming Ceph is healthy).
Decision: If you see local-lvm or a node-local ZFS dataset with no replication, do not expect HA to restart elsewhere successfully.
Task 13: Check Ceph health (if you use it)
cr0x@pve1:~$ ceph -s
cluster:
id: 2f4a9d2e-aaaa-bbbb-cccc-111122223333
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 12/3456 objects degraded
services:
mon: 3 daemons, quorum pve1,pve2,pve3 (age 7m)
mgr: pve1(active, since 2h)
osd: 9 osds: 8 up (since 1m), 9 in (since 30d)
data:
pools: 3 pools, 128 pgs
objects: 1152 objects, 4.5 GiB
usage: 220 GiB used, 1.8 TiB / 2.0 TiB avail
pgs: 12 active+undersized+degraded, 116 active+clean
Meaning: Ceph is up but degraded; performance may be bad and HA restarts may be slow.
Decision: During degraded states, avoid mass migrations and don’t trigger avoidable restarts. Stabilize Ceph first.
Task 14: Confirm replication status for ZFS replication setups
cr0x@pve1:~$ pvesr status
JobID Type State Last Sync Duration Error
1000 local ok 2025-12-28 11:55:02 00:01:42 -
1001 local failed 2025-12-28 11:40:02 00:00:11 ssh connection failed
Meaning: Job 1001 failed; replicas may be stale on the target node.
Decision: Don’t claim HA coverage for VMs on a replication job that’s failing. Fix transport/auth, then validate RPO.
Task 15: Check if the cluster is resource-starved (CPU steal, IO wait symptoms)
cr0x@pve1:~$ pvesh get /nodes/pve1/status
{
"cpu": 0.71,
"loadavg": [2.15, 2.07, 1.94],
"memory": {
"total": 68719476736,
"used": 51234567890,
"free": 17484908846
},
"swap": {
"total": 8589934592,
"used": 2147483648
},
"uptime": 123456
}
Meaning: You’re swapping. In clusters, swapping turns “minor network blip” into “why did everything stall.”
Decision: If swap is used materially under normal conditions, fix memory pressure before diagnosing HA flaps.
Task 16: Verify time sync (yes, it matters)
cr0x@pve1:~$ timedatectl status
Local time: Sun 2025-12-28 12:14:55 UTC
Universal time: Sun 2025-12-28 12:14:55 UTC
RTC time: Sun 2025-12-28 12:14:55
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: Clock is synchronized. Good. Time drift makes logs useless and can upset distributed systems in subtle ways.
Decision: If unsynchronized, fix NTP/chrony across all nodes before you try to interpret event order in an incident.
Fast diagnosis playbook
When a Proxmox cluster “acts weird,” you want speed and correct priorities. Here’s the order that finds the bottleneck fast.
First: is the control plane authoritative?
- Run
pvecm status: are you quorate? - Run
corosync-cfgtool -s: are links up? is a ring down? - Scan
journalctl -u corosync: are there repeated memberships?
If quorum is unstable, stop here. Fix the cluster network and membership before investigating HA, storage, or VM symptoms.
Second: can the target node see the VM’s storage?
pvesm statuson the target node: storageactive?qm config <vmid>: where is the disk really?- If Ceph:
ceph -sand watch for degraded/backfill.
If storage isn’t available everywhere it needs to be, HA will flap or restart into failure.
Third: is HA itself stuck, or is it refusing to do something unsafe?
ha-manager status --verbose: read the error, don’t guess.- Check watchdog/fencing readiness if you see double-start risk or repeated fencing-like behavior.
Fourth: is this actually performance starvation?
pvesh get /nodes/<node>/status: swap, load, CPU usage.- Check network loss and latency between nodes; then check storage network separately.
Common mistakes: symptoms → root cause → fix
1) “HA is broken, it won’t start VMs”
Symptoms: HA actions fail; GUI shows errors; some nodes show “unknown.”
Root cause: Loss of quorum, or corosync membership churn.
Fix: Restore stable corosync connectivity (dedicated network, redundant rings). Add QDevice for 2-node. Don’t change timeouts first.
2) “VM restarted on another node but disk missing”
Symptoms: VM starts then fails; storage errors; disk not found.
Root cause: Disk was on node-local storage (or shared storage not mounted/active on target).
Fix: Put HA VMs on truly shared storage (Ceph/NFS/iSCSI/FC) or implement replication with a defined RPO and test restores/failover.
3) “Cluster freezes during backups/migrations”
Symptoms: corosync timeouts, HA flapping, sluggish GUI during heavy IO.
Root cause: Corosync network shares a congested link with VM/storage traffic; or CPU starvation from encryption/compression/backups.
Fix: Separate networks; rate-limit heavy jobs; pin corosync to low-latency paths; capacity plan CPU for backup windows.
4) “Two-node cluster: sometimes it just stops managing stuff”
Symptoms: after one node outage, remaining node won’t start HA resources or edit config.
Root cause: No QDevice; expected votes misaligned; quorum not achievable alone.
Fix: Add QDevice placed on an independent failure domain; verify expected votes; test “one node down” scenarios.
5) “Ceph is healthy-ish but everything is slow”
Symptoms: high IO latency, VMs pause, failovers sluggish after a fault.
Root cause: Degraded/backfill/recovery saturating IO; undersized network (1GbE) or disks; too few OSDs.
Fix: Engineer Ceph properly: 10/25GbE, enough OSDs, separate public/cluster networks where applicable, and plan recovery impact.
6) “We tuned corosync timeouts and now HA is worse”
Symptoms: more frequent quorum loss, false node deaths, random failovers.
Root cause: Timeouts set below real jitter during load; cluster becomes trigger-happy.
Fix: Revert to sane defaults; fix the network; measure jitter under peak; only then consider careful tuning.
7) “After a reboot, a node can’t join cluster”
Symptoms: node appears offline; corosync won’t start; config mismatch errors.
Root cause: Wrong /etc/hosts mapping, stale corosync config, MTU mismatch, or firewall rules.
Fix: Validate name resolution, consistent MTU, open required ports on cluster networks, and confirm corosync.conf consistency via /etc/pve.
Checklists / step-by-step plan
Design checklist: building a cluster you can sleep on
- Choose cluster size: Prefer 3+ nodes. If 2 nodes, plan QDevice and fencing from day one.
- Define failure domains: switches, power circuits, racks, ToR pairs, storage paths. Write them down.
- Separate networks: corosync on its own VLAN/physical fabric; storage on its own; client traffic separate.
- Redundancy: dual corosync rings; redundant switching; bonded NICs only if you understand your switch config.
- Storage strategy: shared (Ceph/SAN/NFS/iSCSI) for true HA; replication for “restart with RPO.” Don’t mix expectations.
- Fencing: watchdog + out-of-band power control tested. Document how to fence manually.
- Capacity: N+1 compute. HA without spare capacity is just automated disappointment.
- Operational tests: planned node reboot, power pull, switch port shutdown, storage path failure, QDevice loss (if used).
Implementation plan: from zero to stable
- Build nodes identically (kernel, NIC drivers, BIOS settings for virtualization, storage controllers).
- Set static addressing for corosync networks; ensure stable name resolution.
- Create the cluster; immediately configure dual-ring corosync if you have the networks.
- Validate quorum behavior by temporarily isolating one node (controlled test).
- Stand up storage and validate from every node (
pvesm statusmust be boringly consistent). - Enable watchdog and validate it exists in logs on every node.
- Create HA groups with sane constraints (don’t allow everything to pile onto the smallest node).
- Put one non-critical VM into HA, test node failure, confirm it comes up elsewhere with correct disk and networking.
- Roll HA out by service class; keep a “not HA” pool for pets and experiments.
Operations checklist: weekly “keep it healthy” routine
- Check
pvecm statuson a random node: confirm stable quorum. - Scan corosync logs for link flaps and membership churn.
- If Ceph: check
ceph -sand address warnings before they become incidents. - Validate replication jobs (
pvesr status) if you rely on them. - Confirm time sync on all nodes.
- Run a controlled migration during business hours occasionally (yes, on purpose) to catch drift early.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
A mid-sized enterprise built a three-node Proxmox cluster for “HA.” The team did what many teams do: they clustered nodes,
enabled HA, and felt good. The storage was “shared,” they said, because every node could reach the NAS over NFS.
Nobody asked the unsexy question: is NFS mounted and healthy on every node during the exact failure we’re planning for?
The first real outage was a top-of-rack switch reboot. Nodes stayed up, but the NAS VLAN hiccuped.
Corosync was on the same switching path, so membership churn started. Quorum flickered.
HA saw a node as unhealthy and restarted two VMs onto a node that, at that moment, didn’t have the NFS share mounted.
The VMs failed to boot. Applications went down. The team, staring at the UI, interpreted it as “Proxmox HA is buggy.”
They rebooted nodes in the name of “stabilizing,” which made it worse because it increased the amount of churn and failed starts.
After the dust settled, the postmortem revealed the wrong assumption: “reachable NFS” was treated as “HA storage.”
The fix wasn’t exotic. They separated corosync onto a dedicated network, made storage mounts resilient and monitored,
and added a hard rule: no HA tag on a VM unless its disk lives on storage proven accessible from any target node under failure tests.
The lesson is unromantic: HA is a contract. If storage can disappear during the same class of failure that triggers failover,
your HA system will eagerly restart workloads into a void and call it success.
Mini-story 2: The optimization that backfired
Another org ran Proxmox with Ceph and had intermittent “node lost” alerts. Someone decided to “tighten detection”
by reducing cluster timeouts. Faster detection equals faster failover, they reasoned. They wanted the system to be “snappier.”
It worked in the lab. In production, it met reality: backup windows, noisy neighbor IO, and a network that was fine
until Ceph recovery traffic decided to be ambitious. During peak load, corosync packets got delayed just enough.
With the tighter timeouts, those delays crossed the line from “tolerated” to “node declared dead.”
HA started moving VMs. Ceph was already busy. Now it had to serve VM IO plus handle recovery plus absorb migration/restart storms.
The cluster spiraled into self-inflicted pain: false failovers causing real load, causing more false failovers.
The fix was slightly embarrassing and completely effective: revert timeouts, then fix the underlying latency and congestion.
They separated traffic, rate-limited the noisiest jobs, and set expectations: failover speed is not worth cluster instability.
The deep lesson: timeouts are not performance knobs. They’re failure detectors. If you tune them below your worst-case jitter,
you aren’t improving availability. You’re generating chaos faster.
Mini-story 3: The boring but correct practice that saved the day
A company with strict compliance requirements ran a Proxmox cluster with shared iSCSI storage and a QDevice.
Their design wasn’t flashy. It was just relentlessly careful: dedicated corosync network, redundant switches,
documented fencing via out-of-band management, and quarterly “pull the plug” tests.
During a scheduled maintenance, a firmware update on one switch went sideways and the switch stopped forwarding traffic properly.
Corosync ring 0 went weird. Ring 1 stayed healthy. The cluster didn’t lose quorum.
VMs didn’t flap. The monitoring system lit up, but the business barely noticed.
The on-call followed the runbook: confirm quorum, confirm corosync ring status, confirm storage paths, then isolate the bad switch.
They moved some workloads deliberately instead of letting HA thrash. They did not panic-reboot nodes.
Post-incident, the team got zero hero points because nothing dramatic happened. That’s the dream.
The boring practice—redundant rings plus rehearsed diagnostics—turned a potentially messy outage into a controlled maintenance event.
If you want HA that looks like competence instead of adrenaline, you need these boring habits. Drama is not a KPI.
Interesting facts and historical context
- Fact 1: Proxmox VE builds its cluster communications on Corosync, a long-running open-source project also used in classic Linux HA stacks.
- Fact 2: The “quorum” concept predates modern virtualization; it’s rooted in distributed systems safety: only a majority can safely decide.
- Fact 3: Split brain wasn’t invented by virtualization; storage clusters and database replication have been fighting it for decades.
- Fact 4: pmxcfs is a configuration filesystem, not a general-purpose distributed filesystem; treating it as “shared storage” is a common misunderstanding.
- Fact 5: Corosync’s move to knet transport improved link handling and redundancy options compared to older setups.
- Fact 6: STONITH as a term is old enough to sound like a joke, but the idea is deadly serious: ensure only one writer exists.
- Fact 7: “HA” originally meant service-level failover for processes; virtualization made the unit of failover a whole machine, but the safety rules stayed.
- Fact 8: Ceph’s popularity in hyperconverged setups came from solving a real problem: scaling storage without a monolithic array, at the cost of operational complexity.
- Fact 9: Two-node clusters are historically awkward in quorum-based systems; the third vote (human, witness, or qdevice) is a well-worn pattern.
FAQ
1) Do I need three nodes for Proxmox HA?
If you want sane quorum behavior without special components, yes. Two nodes can work with QDevice, but it’s less forgiving and easier to mis-design.
2) What happens when the cluster loses quorum?
Nodes without quorum will block configuration writes in /etc/pve. Running VMs may keep running, but cluster-managed operations (especially HA actions) can be restricted.
3) Can I do HA if my VM disks are on local storage?
Not real HA failover. Proxmox can restart a VM elsewhere only if the disk is accessible there (shared storage) or replicated and promotable with an understood RPO.
4) Is ZFS replication “HA”?
It’s “automated restart with replication.” Great for many workloads, but it’s not zero-RPO. If that’s acceptable and tested, it’s a valid design.
5) Do I really need fencing?
If you use shared storage and care about data integrity, yes. Without fencing, a network partition can lead to two active writers.
Sometimes you get lucky. Sometimes you get a corrupted database and a long weekend.
6) Should corosync run on the same network as Ceph or NFS?
Avoid it. Corosync wants predictable latency; storage traffic is bursty and will eventually bully it. Separate them unless your environment is tiny and you accept the risk.
7) How do I know if HA is failing because of storage or because of cluster comms?
Check quorum first (pvecm status). If quorate, check whether storage is active on the target (pvesm status) and read ha-manager status --verbose for explicit errors.
8) Why does Proxmox HA sometimes refuse to move a VM even when a node looks unhealthy?
Because it’s trying to avoid unsafe actions: starting without storage, starting twice, or acting without quorum. “Refusing” is often the correct behavior.
9) Can I stretch a Proxmox cluster across two sites?
You can, but you’re signing up for latency, partition risk, and complex storage replication. If you must, design quorum, fencing, and storage like a distributed system, not like a LAN.
10) What’s the simplest reliable HA design for small teams?
Three nodes, dedicated corosync network, shared storage that is truly shared (or replication with explicit RPO), watchdog enabled, and tested failure scenarios. Keep it boring.
Conclusion: practical next steps
Proxmox clustering and HA work well when you treat them like systems engineering: authority (quorum), safe execution (fencing),
and data accessibility (storage) designed together. If any one of those is hand-waved, the cluster will eventually collect payment.
Next steps you can do this week:
- Run the fast diagnosis checks now, while everything is “fine,” and record baseline outputs.
- Classify every VM: true HA (shared storage + fencing), restart-with-RPO (replication), or best-effort (local).
- Separate corosync traffic from storage and VM traffic, or at least prove the network can handle worst-case jitter under load.
- Test one real failure (power off a node, or disable a switch port) during a controlled window and verify: quorum behavior, HA decisions, storage availability.
- Document how to fence a node and practice it. You don’t want your first fencing attempt to be during corruption.