Proxmox “failed to start pve-ha-lrm”: why HA won’t start and what to check

January 13, 2026 • February 3, 2026 • Read: 22 min • Views: 16

Was this helpful?

That red status line—failed to start pve-ha-lrm—shows up right when you need HA to behave like HA. A node reboots, a storage path hiccups, or a switch gets “improved” by someone with a clipboard, and suddenly the Local Resource Manager (LRM) won’t come up. Your cluster is alive enough to mock you, but not healthy enough to fail over.

This isn’t a “restart the service” situation. pve-ha-lrm is the symptom. The cause is almost always one of: cluster filesystem not happy, corosync/quorum trouble, time or name resolution weirdness, fencing/watchdog configuration, or storage edges that HA refuses to guess about.

What pve-ha-lrm actually does (and why it refuses to start)

Proxmox HA is split into two main roles:

pve-ha-manager: the cluster brain that decides where resources (VMs/CTs) should run.
pve-ha-lrm: the node-local execution layer that starts/stops/migrates resources and reports status back.

The LRM is intentionally conservative. If it can’t trust the cluster state, can’t read the cluster filesystem, can’t talk to corosync reliably, or detects fencing/watchdog inconsistencies, it will bail. That’s not “picky.” It’s how you avoid split-brain double starts, which is the HA version of setting your data on fire with paperwork attached.

One operational truth worth tattooing onto a runbook: HA depends on the cluster being boring. Not “innovative.” Not “optimised.” Boring. The less surprising your time, networking, storage semantics, and node identity are, the more HA can safely do its job.

And yes, you can sometimes force things with manual starts. But if you don’t fix the underlying trust issues, HA will keep refusing to take responsibility—like a teenager asked to drive in a blizzard with bald tires.

Fast diagnosis playbook (first/second/third)

If you’re on-call and the pager is yelling, you need to locate the bottleneck quickly. Here’s the order that wastes the least time.

First: determine if this is a cluster/quorum problem

Check if corosync is up and has quorum.
Check if pmxcfs is mounted and writable.
Check cluster membership and node name consistency.

If quorum is gone or pmxcfs is broken, HA has no safe world model. Fix that before touching HA services.

Second: confirm HA services and their immediate errors

Read systemd status for pve-ha-lrm and pve-ha-manager.
Read journal logs around the failure time.
Check whether the node is stuck in a fenced/maintenance state.

Third: validate storage prerequisites for HA resources

Shared storage present where expected (or you’re using replication properly).
Storage config consistent across nodes.
Stale locks or storage-level errors not blocking start actions.

Shortcut heuristic: If the UI shows weird cluster status, your problem is not HA. If the UI shows healthy cluster but HA can’t manage one VM, your problem is likely storage or resource config.

Hard requirements before HA even thinks about starting

1) Corosync membership and quorum must be sane

LRM depends on the cluster communications layer. If corosync can’t form a stable membership, HA decisions become unsafe: a VM may be started on two nodes, or stopped on the wrong one. Proxmox is designed to avoid that by being stubborn.

2) pmxcfs must be mounted and consistent

pmxcfs is the Proxmox cluster filesystem (a FUSE filesystem) that provides distributed configuration under /etc/pve. If it isn’t mounted, HA can’t read cluster state and config reliably.

3) Node identity must match across layers

Node name mismatches—between hostname, /etc/hosts, corosync nodelist, and what Proxmox thinks—cause subtle breakage. HA processes are not interested in your creativity with DNS aliases.

4) Time must not drift into the uncanny valley

Corosync and distributed coordination behave badly when time jumps around. NTP/chrony issues can create symptoms that look like “random” membership flaps. Spoiler: they aren’t random.

5) Fencing/watchdog configuration must be coherent

HA without fencing is basically “best effort.” Proxmox can use watchdog-based fencing (and integrates with external fencing approaches depending on setup). If LRM sees watchdog requirements unmet, it may refuse to operate or behave in ways that look like deadlocks.

Joke #1: HA without quorum is like a committee without minutes: everyone remembers a different reality, and somehow finance still wins.

Practical tasks: commands, output meaning, and decisions

Below are 14 practical tasks that you can run on a node reporting failed to start pve-ha-lrm. Each includes what the output means and what decision to make next. Run them on the failing node first, then on at least one healthy node for comparison.

Task 1: Check systemd status for pve-ha-lrm

cr0x@server:~$ systemctl status pve-ha-lrm --no-pager
● pve-ha-lrm.service - PVE Local Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
     Active: failed (Result: exit-code) since Fri 2025-12-26 09:11:02 UTC; 2min 3s ago
    Process: 2149 ExecStart=/usr/sbin/pve-ha-lrm (code=exited, status=1/FAILURE)
   Main PID: 2149 (code=exited, status=1/FAILURE)

Dec 26 09:11:02 server pve-ha-lrm[2149]: unable to read cluster config: /etc/pve/corosync.conf: No such file or directory
Dec 26 09:11:02 server systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=1/FAILURE
Dec 26 09:11:02 server systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.

Meaning: The error line usually tells you which subsystem is broken. Here it’s complaining about /etc/pve, which screams pmxcfs not mounted or cluster FS unavailable.

Decision: If errors mention /etc/pve, jump straight to Tasks 4–6 (pmxcfs + corosync + quorum).

Task 2: Read recent journal logs for pve-ha-lrm

cr0x@server:~$ journalctl -u pve-ha-lrm -b --no-pager -n 120
Dec 26 09:10:59 server systemd[1]: Starting PVE Local Resource Manager Daemon...
Dec 26 09:11:02 server pve-ha-lrm[2149]: starting lrm service
Dec 26 09:11:02 server pve-ha-lrm[2149]: can't initialize HA stack - aborting
Dec 26 09:11:02 server pve-ha-lrm[2149]: cfs-lock 'file-ha_agent' error: no quorum
Dec 26 09:11:02 server systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=1/FAILURE
Dec 26 09:11:02 server systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'.

Meaning: This is the classic: no quorum. HA is refusing to operate because it cannot safely coordinate.

Decision: Do not “force” HA services. Fix quorum: corosync membership, network, votes, or node count. Go to Task 7 and Task 8.

Task 3: Confirm pve-ha-manager status (it matters)

cr0x@server:~$ systemctl status pve-ha-manager --no-pager
● pve-ha-manager.service - PVE HA Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-manager.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:01:18 UTC; 12min ago
   Main PID: 1422 (pve-ha-manager)
      Tasks: 6 (limit: 154838)
     Memory: 33.4M
        CPU: 1.820s
     CGroup: /system.slice/pve-ha-manager.service
             └─1422 /usr/sbin/pve-ha-manager

Meaning: Manager running doesn’t guarantee LRM can run. But if both are dead, you’re likely looking at a deeper cluster/pmxcfs issue rather than a single node problem.

Decision: If manager is down across multiple nodes, treat it as cluster-level. If only LRM is failing on one node, treat it as node-local prerequisites or a stuck state.

Task 4: Verify pmxcfs is mounted

cr0x@server:~$ mount | grep -E 'pve|pmxcfs'
pve-cluster on /etc/pve type fuse.pve-cluster (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Meaning: If you don’t see pve-cluster on /etc/pve, HA will not work because config/state is inaccessible.

Decision: If missing, check pve-cluster and corosync (Task 5 and Task 7). Don’t start HA until /etc/pve is mounted.

Task 5: Check pve-cluster service

cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:00:32 UTC; 13min ago
   Main PID: 1120 (pmxcfs)
      Tasks: 12 (limit: 154838)
     Memory: 41.6M
        CPU: 4.122s
     CGroup: /system.slice/pve-cluster.service
             └─1120 /usr/bin/pmxcfs

Meaning: If pve-cluster is not running, /etc/pve won’t be available.

Decision: If it’s failing, the logs will often say “no quorum” or show corosync connection errors. Fix corosync/quorum first; restarting pmxcfs without quorum is whack-a-mole.

Task 6: Check whether /etc/pve is writable (quorum indicator)

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   26
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Dec 26 09:13:12 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.4c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Meaning: Quorate: Yes is the green light. If it’s No, HA is expected to stop.

Decision: If not quorate, do not try to “recover” HA by restarting services. You need quorum back, or you need a deliberate decision to run without it (rarely correct, usually desperate).

Task 7: Inspect corosync status

cr0x@server:~$ systemctl status corosync --no-pager
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:00:25 UTC; 13min ago
       Docs: man:corosync
   Main PID: 1011 (corosync)
      Tasks: 17 (limit: 154838)
     Memory: 30.1M
        CPU: 2.033s
     CGroup: /system.slice/corosync.service
             └─1011 /usr/sbin/corosync -f

Meaning: Running corosync doesn’t guarantee membership stability. You need to check the cluster view (Task 8) and logs (Task 9).

Decision: If corosync is inactive/failed, fix that before HA. If it’s active but quorum is lost, look at networking, node votes, or a missing node.

Task 8: See corosync membership and link state

cr0x@server:~$ corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
        addr    = 10.10.10.11
        status:
                nodeid:          1:  connected
                nodeid:          2:  connected
                nodeid:          3:  connected

Meaning: “connected” across peers is what you want. If you see “disconnected,” “faulty,” or a node missing, that’s your quorum problem.

Decision: If links are flapping, stop chasing HA and fix the network path (MTU mismatch, packet loss, firewall, switch features). Corosync hates surprises.

Task 9: Read corosync logs for the real reason

cr0x@server:~$ journalctl -u corosync -b --no-pager -n 160
Dec 26 09:07:44 server corosync[1011]: [KNET  ] link: host: 2 link: 0 is down
Dec 26 09:07:45 server corosync[1011]: [TOTEM ] A new membership (1.4b) was formed. Members left: 2
Dec 26 09:07:45 server corosync[1011]: [QUORUM] Members[1]: 1
Dec 26 09:07:45 server corosync[1011]: [QUORUM] This node is within the non-quorate partition
Dec 26 09:07:49 server corosync[1011]: [KNET  ] link: host: 2 link: 0 is up

Meaning: Link down/up events correlate with quorum loss and HA refusing to lock cluster state.

Decision: Treat as networking/transport. Look for MTU changes, VLAN issues, asymmetric routing, or NIC offload bugs. If this is happening under load, suspect microbursts or a switch policy that “protects” you from multicast/UDP.

Task 10: Validate node name and resolution consistency

cr0x@server:~$ hostname -f
pve01.corp.local

cr0x@server:~$ grep -E '(^127\.0\.1\.1|10\.10\.10\.)' /etc/hosts
127.0.1.1 pve01.corp.local pve01
10.10.10.11 pve01.corp.local pve01
10.10.10.12 pve02.corp.local pve02
10.10.10.13 pve03.corp.local pve03

Meaning: Proxmox clusters are sensitive to identity consistency. You want the node’s hostname to match what the cluster expects, and you want stable resolution for peer nodes.

Decision: If hostnames don’t match the cluster node names, fix identity carefully (and plan downtime). Don’t “just change DNS” and hope.

Task 11: Inspect HA status and node states

cr0x@server:~$ ha-manager status
quorum OK
master pve01 (active, Wed Dec 26 09:13:55 2025)
lrm pve01 (active, Wed Dec 26 09:13:54 2025)
lrm pve02 (active, Wed Dec 26 09:13:51 2025)
lrm pve03 (idle, Wed Dec 26 09:12:10 2025)

service vm:101 (started)
service vm:102 (stopped)
service ct:201 (started)

Meaning: If the failing node shows lrm ... (idle) or missing, it isn’t participating. If master is bouncing or absent, your HA manager is unstable.

Decision: If quorum is OK but LRM is idle/missing, focus on node-local issues: systemd, watchdog, permissions, pmxcfs mount, or resource agent errors.

Task 12: Check watchdog device (fencing prerequisite in many setups)

cr0x@server:~$ ls -l /dev/watchdog*
crw------- 1 root root 10, 130 Dec 26 09:00 /dev/watchdog
crw------- 1 root root 10, 129 Dec 26 09:00 /dev/watchdog0

cr0x@server:~$ systemctl status watchdog-mux --no-pager
● watchdog-mux.service - Proxmox VE watchdog multiplexer
     Loaded: loaded (/lib/systemd/system/watchdog-mux.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:00:28 UTC; 13min ago

Meaning: Missing watchdog devices or a dead watchdog multiplexer can block fencing assumptions and make HA conservative (or misbehave during failures).

Decision: If watchdog is missing and your HA policy expects fencing, fix hardware/driver/BIOS watchdog settings before trusting HA.

Task 13: Check for storage-induced start failures (shared storage visibility)

cr0x@server:~$ pvesm status
Name             Type     Status           Total        Used       Avail      %
local             dir     active        19688240     4680128    14073520  23.77%
shared-nfs         nfs     active      984320000   612480000   371840000  62.22%
ceph-vm            rbd     active      204800000    73216000   131584000  35.75%

Meaning: If the storage a VM depends on is missing/inactive on the node, HA may fail to start that VM and can leave resources in error states. Sometimes LRM startup itself is fine; the first action triggers failures.

Decision: If shared storage is inactive on the failing node, stop blaming HA. Fix the mount/auth/network path; then retry HA actions.

Task 14: Look for stuck locks or blocked cluster filesystem operations

cr0x@server:~$ pveperf /etc/pve
CPU BOGOMIPS:      55866.40
REGEX/SECOND:      3966050
HD SIZE:           29.34 GB (local-lvm)
BUFFERED READS:    201.15 MB/sec
FSYNCS/SECOND:     1248.61
DNS EXT:           63.63 ms
DNS INT:           0.09 ms (pve01.corp.local)

Meaning: pveperf against /etc/pve is a rough health check. If it stalls or produces strange errors, your cluster filesystem is sick (often because quorum is unstable or the node is overloaded).

Decision: If pveperf /etc/pve hangs, treat it as a cluster comms/quorum issue or severe node resource contention. Fix stability before HA.

Failure modes that map directly to “failed to start pve-ha-lrm”

1) No quorum (most common, most correct behavior)

HA needs consensus. Without quorum, it cannot safely acquire locks in the cluster filesystem. You’ll see errors like:

cfs-lock ... error: no quorum
unable to read cluster config (because pmxcfs is in read-only/no-quorum mode)

Fix direction: restore membership. Get the missing node back, fix the corosync network, or correct vote expectations. If you’re in a two-node cluster, read the “FAQ” section on why that’s a trap unless you use a qdevice.

2) pmxcfs not mounted or unhealthy

If /etc/pve isn’t mounted, you effectively don’t have a Proxmox cluster on that node. HA will fail. This can happen if:

pve-cluster failed to start
corosync is down or not authenticated
the node boots into a state where it can’t reach peers and therefore can’t get quorum

Fix direction: treat it as “cluster not formed.” Solve corosync and identity first, then restart pve-cluster, then HA.

3) Corosync link flapping (network and MTU sins)

When corosync is unstable, HA tends to look “randomly broken.” It isn’t random; it’s responding to membership changes. Causes:

MTU mismatch (especially with VLANs, bonds, jumbo frames)
Firewall rules added “temporarily”
Switch storm control / multicast filtering / policing UDP
NIC/driver offload problems under load

Fix direction: make the corosync network boring: consistent MTU end-to-end, no filtering, low loss, predictable latency.

4) Node name mismatch or stale cluster identity

If node names drift (hostname changed after cluster creation, DNS returns new names, or /etc/hosts differs), cluster membership can form but higher layers can’t reconcile identities. HA can then fail to map resources to nodes.

Fix direction: standardize names. Prefer stable hostnames and static mappings for cluster traffic. If you must change names, do it as a controlled migration with full awareness of corosync configuration.

5) Watchdog/fencing preconditions not met

In HA, fencing is how you ensure a node is really dead before starting its workloads elsewhere. If the watchdog device is missing or misconfigured, some HA setups will refuse to operate as designed or will behave conservatively. The failure might surface as LRM refusing to start or rapidly stopping.

Fix direction: verify watchdog device availability, BIOS settings, kernel modules, and service status. If your policy requires fencing, don’t waive it casually.

6) Resource agent failures that look like LRM failures

Sometimes pve-ha-lrm starts fine but immediately reports resource failures, and the operator interprets it as “LRM won’t start.” The logs tell the truth: a VM start/migrate action fails due to storage, locks, or config.

Fix direction: separate “LRM daemon failed” from “LRM couldn’t execute resource actions.” Use ha-manager status and per-VM logs.

Quote (paraphrased idea): Peter Drucker’s reliability-flavored lesson: “You can’t manage what you don’t measure.” In HA, membership and quorum are the measurements that matter first.

Common mistakes: symptom → root cause → fix

This section is intentionally blunt. These are the mistakes that keep recurring because they feel like reasonable shortcuts at 2 a.m.

Mistake 1: “Restart HA services” when quorum is lost

Symptom: pve-ha-lrm fails with no quorum and you keep restarting it.

Root cause: Cluster cannot safely lock state; restarting doesn’t restore consensus.

Fix: Restore corosync membership and quorum. Only then restart HA services (if they don’t auto-recover).

Mistake 2: Two-node cluster without a tie-breaker

Symptom: One node reboot or a minor network issue takes HA down; pve-ha-lrm fails; Quorate: No.

Root cause: With 2 nodes, losing either one means you lose majority unless you use a qdevice or a third vote.

Fix: Add a qdevice (or a third node). If you can’t, accept that “HA” is conditional and operationally expensive.

Mistake 3: Changing hostname/DNS after cluster creation

Symptom: Corosync seems up, but nodes show odd identities; HA can’t coordinate; LRM errors reference missing nodes.

Root cause: Cluster node identity is not just “what DNS says today.”

Fix: Keep stable hostnames. If renaming is mandatory, plan a controlled reconfiguration and validate every node’s view of names and IPs.

Mistake 4: Corosync network shares a congested path with storage or VM traffic

Symptom: HA drops in/out during backup windows, migrations, or storage rebalancing; membership flaps.

Root cause: Corosync traffic is sensitive to loss/latency. Congestion makes it look like nodes are failing.

Fix: Put corosync on a dedicated, low-loss network (or at least a protected VLAN/QoS). Keep MTU consistent.

Mistake 5: “Optimizing” MTU or bonding without testing corosync

Symptom: Random quorum loss after network “improvements.”

Root cause: MTU mismatch or LACP hash changes causing packet loss/reordering for UDP.

Fix: Validate end-to-end MTU with real tests, and verify corosync stability under load. If you can’t prove it, don’t ship it.

Mistake 6: HA resources configured without shared storage guarantees

Symptom: HA refuses to start or immediately errors when trying to start a VM elsewhere.

Root cause: VM disks live on node-local storage; failover can’t access them.

Fix: Move disks to shared storage, use Ceph/RBD, or use replication where appropriate. “HA” does not teleport data.

Joke #2: The only thing scarier than split-brain is realizing both halves think they’re the “primary” because you named them that.

Checklists / step-by-step plan

Checklist A: Bring HA back safely on one node

Confirm corosync is stable: no membership flaps in logs for at least several minutes during normal traffic.
Confirm quorum: pvecm status shows Quorate: Yes.
Confirm pmxcfs: mount | grep /etc/pve shows it mounted read-write.
Confirm time sync: chrony/ntp stable; no big offsets.
Confirm watchdog (if used): devices present and service active.
Start HA stack (if not already): start manager then LRM, and watch logs.
Verify HA status: ha-manager status shows LRM active.

Checklist B: When you should stop and declare an incident

Quorum flaps repeatedly and you can’t correlate it to a single node outage.
/etc/pve intermittently disappears or becomes read-only.
Corosync logs show frequent link up/down without a clear physical reason.
Nodes disagree on membership or expected votes.
You see signs of storage corruption or repeated I/O errors on the system disk (pmxcfs depends on the node being healthy too).

Checklist C: Clean restart sequence (only after quorum is solid)

cr0x@server:~$ systemctl restart pve-ha-manager
cr0x@server:~$ systemctl restart pve-ha-lrm
cr0x@server:~$ systemctl status pve-ha-lrm --no-pager
● pve-ha-lrm.service - PVE Local Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:20:05 UTC; 3s ago
   Main PID: 3011 (pve-ha-lrm)

Meaning: If it starts cleanly now, the earlier failure was almost certainly quorum/pmxcfs/identity, not “a broken HA package.”

Decision: If it still fails, go back to logs and correlate with the failure modes. Don’t loop restarts; you’ll just create more noise.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “DNS is fine; it’s just names”

A mid-sized company ran a Proxmox cluster in a regional office. It wasn’t glamorous, but it hosted the usual: a monitoring stack, a few Windows VMs, internal apps, and a file service. HA was enabled because someone promised management “automatic failover.”

One Monday, they migrated internal DNS to a new pair of servers. The plan was clean: replicate zones, update DHCP options, then retire the old boxes. No changes to Proxmox were scheduled because, in the words of the change request, “DNS does not impact hypervisor clustering.”

After the cutover, HA started acting haunted. pve-ha-lrm failed on one node, then another. Corosync was “running,” but quorum wobbled during normal operations. The real culprit wasn’t DNS uptime—it was DNS behavior. The new resolver returned different answers for short names vs. FQDNs, and one node had a stale /etc/hosts mapping. Corosync membership formed, then higher-level identity comparisons didn’t line up consistently. The cluster was basically having an existential crisis.

The fix was boring: pin the cluster network resolution to stable entries, align hostnames with what Proxmox expects, and stop using ambiguous short-name lookups for cluster-critical paths. HA recovered immediately.

The wrong assumption was that naming is cosmetic. In clusters, names are identity. Identity is security. Security is coordination. Coordination is whether your VM starts once or twice.

Optimization that backfired: “Let’s jumbo-frame everything”

An enterprise team had a decent three-node Proxmox cluster with shared storage and a separate corosync VLAN. They wanted faster live migration and less CPU overhead, so they rolled out jumbo frames across the virtualization fabric. The change was approved quickly because they had done it in other environments.

They set MTU 9000 on host NICs and the top-of-rack switches. Storage performance improved. Migrations got faster. Everyone took a victory lap.

Then HA started failing to start on one node after reboots. The error was intermittent: sometimes pve-ha-lrm would start, sometimes it would throw no quorum briefly and stay dead. Corosync logs showed link flaps that lasted a few seconds, usually when backups kicked off.

The “optimization” was incomplete: one switch-to-switch trunk in the corosync path still had MTU 1500. Under light traffic, fragmentation and buffering masked the problem. Under load, packets dropped. Corosync interpreted the loss as node instability, quorum dipped, and HA processes refused to lock state.

The fix was not heroic: make MTU consistent end-to-end, then verify with real packet tests and sustained traffic. After that, HA behaved like it had never met these people before.

The lesson: performance improvements that touch network fundamentals need a validation plan that includes corosync stability. Your storage might tolerate a little loss. Your cluster coordinator will not.

Boring but correct practice that saved the day: “Dedicated corosync + predictable fencing”

A healthcare org (the kind that loves paperwork more than oxygen) ran Proxmox for non-clinical workloads. Their SRE lead insisted on two habits: (1) corosync on its own physically separate network and (2) watchdog fencing configured and tested quarterly. The team grumbled because it felt like process for process’s sake.

During a power event, one node returned in a bad state: the OS was up, but the NIC driver was wedged and dropping packets. From the outside, it looked alive enough to confuse humans. Corosync membership was unstable. Without fencing, this is where clusters go to die slowly—because everyone debates whether a node is “really” down.

Watchdog did its job. The unhealthy node was fenced cleanly, quorum stabilized, and HA started workloads on the remaining nodes. Users saw a short disruption, not a day of “intermittent slowness” that turns into a blame festival.

Afterward, the incident review was almost boring. The runbook matched reality, the failure domain was clear, and the fix was focused: replace the NIC, update firmware, re-test. No mystery. No folklore.

Boring practices aren’t glamorous. They’re also the reason you get to sleep.

Interesting facts and historical context

Fact 1: Proxmox’s cluster config lives in pmxcfs, a FUSE-based distributed filesystem mounted at /etc/pve, not in “normal” local files.
Fact 2: Corosync’s modern transport in Proxmox commonly uses knet, designed to handle multi-link redundancy and better network behavior than older approaches.
Fact 3: The concept of quorum is older than modern virtualization; it comes from distributed systems coordination problems where majority agreement prevents split-brain.
Fact 4: HA stacks often separate “decide” and “do” components (manager vs. local agent). Proxmox follows that pattern with manager and LRM for safety and clarity.
Fact 5: Fencing is not a Proxmox invention; it’s a long-standing cluster principle: if you can’t prove a node is dead, assume it’s alive and dangerous.
Fact 6: Two-node clusters are historically tricky across vendors because any single failure removes majority; tie-breakers (qdevice, witness) are the standard fix.
Fact 7: Cluster filesystems and consensus mechanisms often degrade “fail-safe” by going read-only or refusing locks when quorum is lost—this is a deliberate safety design.
Fact 8: Many “HA failures” are actually storage semantics issues—shared storage is not just “mounted,” it must be consistent, performant, and identically configured across nodes.

FAQ

1) Does “failed to start pve-ha-lrm” always mean corosync is broken?

No. Corosync/quorum is the most common cause, but LRM can also fail due to /etc/pve issues, identity mismatches, watchdog/fencing problems, or node-local corruption/overload. Start with quorum because it’s a hard gate.

2) If quorum is lost, can I force HA to start anyway?

You can try, but you’re asking HA to operate without a consistent reality. That’s how you get split-brain starts and storage corruption. The correct move is to restore quorum or deliberately disable HA and manage workloads manually until the cluster is safe.

3) Why does pmxcfs matter for pve-ha-lrm?

HA uses cluster-wide configuration and locks stored under /etc/pve. If pmxcfs is not mounted or is read-only due to no quorum, HA can’t coordinate state and will refuse to run.

4) What’s the difference between pve-ha-manager failing and pve-ha-lrm failing?

pve-ha-manager is the coordinator; if it’s down cluster-wide, HA decisions won’t happen. pve-ha-lrm is per-node; if it’s down on one node, that node can’t execute HA actions even if the cluster is otherwise fine.

5) Can a single bad node prevent HA from working on other nodes?

Yes, if that bad node causes quorum loss or membership instability. If quorum remains intact, other nodes can usually continue. The key is whether the cluster can form a stable majority partition.

6) My cluster is two nodes. Is that “real HA”?

It can be, but only if you add a tie-breaker vote (qdevice/witness). Without it, the loss of either node (or a network partition) typically kills quorum, and HA will correctly refuse to act.

7) HA is up, but VMs won’t fail over. Is that still an LRM issue?

Sometimes. LRM may be running but failing resource actions due to storage not being available on the target node, migration constraints, or resource configuration. Check ha-manager status, storage status, and per-VM start logs.

8) What’s the single fastest indicator that I should stop touching HA and fix the cluster?

If you see cfs-lock ... no quorum or Quorate: No. HA is downstream. Fix quorum and corosync stability first.

9) Can time drift really cause HA startup problems?

Yes. Time jumps and unstable sync can correlate with corosync membership issues and odd lock behavior. It’s rarely the first suspect, but it’s a real contributor in messy environments.

10) If I fix quorum, do I need to reboot nodes to get HA back?

Usually no. Once quorum and pmxcfs are healthy, restarting the HA services is typically sufficient. Rebooting can help only if a node is wedged (driver issues, memory pressure, filesystem errors).

Conclusion: practical next steps

When Proxmox says failed to start pve-ha-lrm, it’s rarely “an HA bug.” It’s HA refusing to operate without a trustworthy cluster. Treat that refusal as a safety feature, not an obstacle.

Do this next, in order:

Confirm quorum and membership stability with pvecm status, corosync-cfgtool -s, and corosync logs.
Verify pmxcfs health: /etc/pve mounted, readable, writable (when quorate).
Check identity and time: hostnames, /etc/hosts, and time sync sanity.
Validate watchdog/fencing assumptions so HA can make safe decisions during partial failures.
Only then restart HA services and verify with ha-manager status.

If you follow that flow, you’ll fix the root cause instead of arguing with a daemon that’s doing exactly what it was designed to do: refuse to guess.