Proxmox HA “cannot start resource”: finding the real blocker (quorum, storage, network)

Was this helpful?

If you’ve ever watched Proxmox HA refuse to start a VM with the wonderfully vague “cannot start resource,” you know the feeling: the cluster is “up,” the VM is “fine,” and yet nothing moves.

That message isn’t a diagnosis. It’s a symptom. The real blocker is almost always one of four things: quorum, storage, network partitioning, or a stale HA state machine that’s trying to keep you safe and succeeding a little too hard.

What “cannot start resource” actually means in HA terms

In Proxmox HA, a “resource” is usually a VM or container that the HA stack (CRM + LRM + manager) is responsible for placing, starting, stopping, and relocating. When you see “cannot start resource,” it typically means:

  • The HA manager attempted an action (start/migrate/recover) and the target node refused or couldn’t comply.
  • Or the HA manager refused to act because it believed doing so might create a split-brain, double-mount, or double-start situation.

That second case is why this error is so annoying and so correct. HA is conservative by design. It’d rather keep your service down than corrupt your data quietly. That’s not a bug; it’s the entire job description.

Key operational truth: The blocker is rarely “the VM.” It’s nearly always cluster health (quorum/membership), storage accessibility/locks, or networking between nodes (corosync ring, latency, MTU, drops).

One quote worth remembering, because it’s the whole posture of HA systems:

“Hope is not a strategy.” — General Gordon R. Sullivan

HA doesn’t run on hope. It runs on crisp state and verified reachability. If those aren’t crisp, it stops.

Fast diagnosis playbook (first/second/third)

This is what you do when an executive is hovering, the VM is down, and the HA UI is giving you a shrug.

First: confirm the cluster can make decisions (quorum + membership)

  • Check quorum (pvecm status). No quorum means no reliable “who owns what.”
  • Check corosync membership (corosync-cfgtool -s, corosync-quorumtool -s). Look for missing nodes, ring issues, or “partitioned.”
  • Check that pmxcfs is mounted and healthy (df -h /etc/pve).

Decision: If quorum is missing, stop chasing storage. Fix quorum/membership first. Every other “fix” risks making it worse.

Second: confirm HA manager state isn’t lying (CRM/LRM status)

  • ha-manager status for the resource and what node it thinks should run it.
  • On each node: systemctl status pve-ha-lrm pve-ha-crm.
  • Tail logs: journalctl -u pve-ha-crm -u pve-ha-lrm -n 200 --no-pager.

Decision: If HA services are unhealthy or stuck on a node, fix that before manual VM starts. HA fighting you is not a sport you win.

Third: verify the required storage is truly available on the target node

  • Check Proxmox storage status (pvesm status), and confirm the VM disks’ storage exists on the node.
  • For shared storage: check mount/connection (NFS/iSCSI/Ceph/ZFS replication targets).
  • Look for locks (qm config output includes lock:), and check tasks history.

Decision: If storage isn’t available everywhere HA expects it, the correct fix is to restore storage reachability or adjust the HA group constraints—not to “force start” and pray.

Everything else—CPU pressure, memory pressure, kernel logs, fencing expectations—comes after those three. You can be clever later. Be correct first.

The mental model: quorum, membership, manager, and storage reality

Quorum is not “cluster is up.” It’s “cluster can agree.”

Quorum is the mechanism that prevents split brain: two halves of the cluster both thinking they’re the real cluster. Without quorum, Proxmox deliberately restricts writes to the cluster filesystem and blocks certain actions. HA actions are among the first to be refused, because HA without consensus is how you get double-started VMs and disk corruption.

Corosync is the nervous system; latency is poison

Corosync is a membership and messaging layer. It doesn’t just want packets; it wants timely packets. You can “ping” between nodes and still have corosync flapping because of drops, reordering, or MTU mismatch. HA depends on corosync membership being stable. If membership flaps, HA keeps changing its mind—because it has to.

pmxcfs is your shared brain

/etc/pve is backed by pmxcfs (Proxmox cluster filesystem). It’s not a generic shared filesystem; it’s a cluster database delivered as a filesystem. If it’s not mounted or not writable due to quorum loss, configs can appear “stale,” HA can’t reliably coordinate, and you’ll get errors that look like VM issues but are really “cluster metadata isn’t consistent.”

Storage is the most common real blocker, because it’s where data can be damaged

HA can start a VM on a node only if the disks are accessible and safe. “Safe” means: the storage backend is present, the VM disks are not mounted elsewhere, and any locks (migration snapshot, backup) are resolved. A lot of backends can be “up” but still unsafe: a stale NFS mount, an iSCSI session that exists but is timing out, a Ceph cluster that’s in degraded mode with blocked I/O, or ZFS that’s alive but thrashing on latency.

Joke #1: High availability is like a fire drill: everyone loves it until you actually pull the alarm.

Interesting facts and context (why the sharp edges exist)

  • Fact 1: Quorum as a concept predates most virtualization stacks; it’s a classic distributed systems control to prevent “split brain” in clustered storage and databases.
  • Fact 2: Corosync evolved from early Linux cluster messaging efforts where stable membership was treated as the cornerstone of safe failover.
  • Fact 3: Proxmox’s pmxcfs deliberately behaves differently under no-quorum conditions: it prioritizes safety over convenience, which is why writes can be blocked even when nodes “seem fine.”
  • Fact 4: HA stacks typically separate “decision making” (cluster resource manager) from “execution” (local resource manager). Proxmox follows that pattern with CRM and LRM components.
  • Fact 5: Storage fencing (ensuring a failed node can’t still write to shared storage) is older than modern hypervisors; it came from clustered filesystems and SAN environments where a single rogue writer could corrupt everything.
  • Fact 6: “Heartbeat” style membership protocols historically suffered from false positives during congestion; modern stacks still face the same physics: drops and jitter look like node death.
  • Fact 7: Shared-nothing approaches (like local ZFS + replication) reduce shared storage failure modes but introduce “which copy is authoritative” problems—still a quorum problem, just wearing a different hat.
  • Fact 8: Many HA incidents are not caused by an actual failure, but by a partial failure: the node is up, the link is flaky, the storage is half-alive. HA hates partial failures because they’re ambiguous.
  • Fact 9: “Cannot start resource” is often an intentional safety message. It’s the HA stack telling you it cannot prove starting is safe, not that it tried and failed like a normal manual start.

Practical tasks: commands, expected output, and decisions (12+)

These are the checks I run when I’m on-call and want the truth quickly. Each task includes what the output means and what you do next.

Task 1: Check cluster quorum and expected votes

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             pve-prod
Config Version:   27
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Dec 26 10:14:03 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.22
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

What it means: Quorate: Yes means the cluster can make authoritative decisions and pmxcfs should be writable. If it says No, HA actions will often be blocked.

Decision: If not quorate, stop here and fix corosync connectivity or vote configuration. Don’t force-start HA resources.

Task 2: Check corosync ring status (are all links healthy?)

cr0x@server:~$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.10.11
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.10.10.11
        status  = ring 1 active with no faults

What it means: “active with no faults” is what you want. Faults here correlate strongly with HA refusing actions or “flapping” resources.

Decision: If a ring is faulty, fix network/MTU/vLAN/jumbo frames before touching HA. Dual-ring misconfigurations are a classic “looks redundant, behaves brittle” trap.

Task 3: Check corosync quorum details (partition hints)

cr0x@server:~$ corosync-quorumtool -s
Quorum information
------------------
Date:             Fri Dec 26 10:15:31 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1.22
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

What it means: Confirms the quorum subsystem sees the same reality as pvecm. If these disagree, you’re already in weird territory.

Decision: If not quorate: treat it as a cluster incident, not a VM incident.

Task 4: Verify pmxcfs is mounted (and not in a sad state)

cr0x@server:~$ df -h /etc/pve
Filesystem      Size  Used Avail Use% Mounted on
pmxcfs            0     0     0    - /etc/pve

What it means: It looks weird because it’s not a normal filesystem. If you see “not mounted” or it points to something else, cluster config distribution is broken.

Decision: If /etc/pve isn’t pmxcfs, fix cluster services. HA won’t behave predictably.

Task 5: Check HA manager view of the world

cr0x@server:~$ ha-manager status
quorum OK
master pve01 (active, Fri Dec 26 10:16:12 2025)

service vm:101 (pve-ha-lrm:pve02, running)
service vm:105 (pve-ha-lrm:pve03, stopped)
service ct:210 (pve-ha-lrm:pve01, running)

What it means: You get the HA master, quorum state, and where HA thinks each resource should be handled. If your broken VM is “stopped” but should be running, focus on why it can’t place it.

Decision: If the master is flapping between nodes, you likely have corosync instability. Fix that before anything else.

Task 6: Zoom into a single resource’s state and recent actions

cr0x@server:~$ ha-manager status vm:105
service vm:105
  state: stopped
  desired: started
  node: pve03
  last_error: unable to start VM 105 on node 'pve03': storage 'ceph-vm' is not available

What it means: This is the money line. HA often tells you the real blocker in last_error. Don’t ignore it because it’s not in the GUI.

Decision: If last_error points at storage, pivot to storage checks on that node now.

Task 7: Check HA services health (CRM/LRM)

cr0x@server:~$ systemctl status pve-ha-crm pve-ha-lrm --no-pager
● pve-ha-crm.service - PVE Cluster Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:01:12 UTC; 1h 15min ago

● pve-ha-lrm.service - PVE Local Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
     Active: active (running) since Fri 2025-12-26 09:01:14 UTC; 1h 15min ago

What it means: If either is not running, HA orchestration is compromised. LRM is the “hands” on each node; CRM is the “coordination brain.”

Decision: If one is failed, read logs and fix service stability before trying to start VMs manually.

Task 8: Read HA logs like you mean it

cr0x@server:~$ journalctl -u pve-ha-crm -u pve-ha-lrm -n 200 --no-pager
Dec 26 10:12:44 pve01 pve-ha-crm[2123]: status change: node pve03 online
Dec 26 10:13:02 pve01 pve-ha-crm[2123]: trying to start vm:105 on pve03
Dec 26 10:13:07 pve03 pve-ha-lrm[1988]: unable to start vm 105: storage 'ceph-vm' is not available
Dec 26 10:13:07 pve01 pve-ha-crm[2123]: service vm:105 start failed on node pve03 (exit code 255)

What it means: You get the timeline. If the log says “storage not available,” stop debating. It’s storage.

Decision: Follow the error at the node where the LRM failed. That’s where the missing dependency lives.

Task 9: Confirm storage status as Proxmox sees it

cr0x@server:~$ pvesm status
Name         Type     Status           Total        Used       Avail
local        dir      active        19528604     7824480    10716524
local-lvm    lvmthin  active        19000000     9200000     9800000
ceph-vm      rbd      inactive              0           0           0

What it means: inactive means Proxmox won’t use it. HA won’t start a VM whose disks live there. This is not negotiable.

Decision: Fix Ceph/RBD connectivity on that node, or move/restore disks elsewhere, or change HA placement constraints.

Task 10: Verify the VM’s disks and where they live

cr0x@server:~$ qm config 105
boot: order=scsi0;net0
cores: 4
memory: 8192
name: api-prod-01
net0: virtio=DE:AD:BE:EF:00:01,bridge=vmbr0
scsi0: ceph-vm:vm-105-disk-0,size=80G
scsihw: virtio-scsi-pci

What it means: Disk is on ceph-vm. If pvesm status says that storage is inactive on the target node, you’ve found the blocker.

Decision: Don’t try to “start anyway.” Make the storage active, or migrate disks to an available storage backend.

Task 11: Look for locks that block start

cr0x@server:~$ qm config 105 | grep -E '^lock:|^template:'
lock: backup

What it means: A lock can prevent start/migrate. A backup lock often remains after an interrupted backup job or storage hiccup.

Decision: Confirm whether a backup is still running; if it’s stale, remove lock cautiously (after verifying no job is active).

Task 12: Inspect recent tasks for failures that left state behind

cr0x@server:~$ tail -n 30 /var/log/pve/tasks/index
UPID:pve01:00012A1B:0A2F3C4D:676D9B12:vzdump:105:root@pam:
UPID:pve01:00012A40:0A2F3C99:676D9B55:qmstart:105:root@pam:
UPID:pve03:0000B120:0A2F3D01:676D9B90:ha-recover:105:root@pam:

What it means: You get a breadcrumb trail of what happened. Use the UPID to pull details.

Decision: If you see repeated failed starts or recoveries, treat it as systemic (storage/network), not as “try again.”

Task 13: Pull one UPID’s real log

cr0x@server:~$ cat /var/log/pve/tasks/0A/2F3C4D-676D9B12
INFO: starting new backup job: vzdump 105 --mode snapshot
ERROR: ceph-vm: rbd: error opening vm-105-disk-0: (2) No such file or directory
INFO: aborting backup

What it means: If backups fail due to storage, HA start may fail for the same reason. Also, missing RBD images points to a deeper storage inconsistency.

Decision: Stop and validate storage integrity and naming. Don’t “clear locks” until you know what disappeared.

Task 14: Validate cluster network health quickly (drops, errors)

cr0x@server:~$ ip -s link show dev bond0
2: bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    RX:  bytes packets errors dropped  missed   mcast
    8942349234 9023434      0     124       0       0
    TX:  bytes packets errors dropped carrier collsns
    7342342342 8234234      0       0       0       0

What it means: Drops on the corosync network (or shared bond) are a smoking gun. Corosync tolerates some loss, but HA stability requires boring networking.

Decision: If drops climb during incidents, investigate switch buffers, MTU mismatch, LACP hashing, or a saturated link.

Task 15: Check time sync (yes, it matters more than you want)

cr0x@server:~$ timedatectl
               Local time: Fri 2025-12-26 10:18:41 UTC
           Universal time: Fri 2025-12-26 10:18:41 UTC
                 RTC time: Fri 2025-12-26 10:18:41
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What it means: If clocks drift, logs lie, and debugging becomes interpretive dance. Some auth and cluster behaviors also get flaky with time skew.

Decision: If not synchronized, fix NTP/chrony on all nodes before you chase “random” HA behavior.

Task 16: Confirm the VM is not actually running somewhere else

cr0x@server:~$ qm list | awk '$1==105 {print}'
105 api-prod-01               stopped    8192     4

What it means: Proxmox says it’s stopped on this node. Repeat on other nodes if you suspect split-brain or stale state.

Decision: If you find it running on another node, don’t “start a second copy.” Investigate HA placement and locks immediately.

Finding the real blocker: quorum vs storage vs network vs state

1) Quorum loss: the silent “stop everything”

When quorum is lost, a lot of the system still appears operational. Nodes boot. You can SSH. You might even access local storage. The UI might render. And HA will still refuse to do the thing you want, because it cannot prove it’s safe.

Typical signs:

  • pvecm status shows Quorate: No.
  • /etc/pve is read-only or not updating across nodes.
  • HA master flips, or HA reports “quorum not OK.”

What to do: Restore connectivity between the majority of nodes. If you’re a two-node cluster without a proper third vote device, you’re living on the edge by design. Add a quorum device or third node if you want HA behavior that doesn’t resemble performance art.

2) Network partition: the cluster can’t agree because it can’t talk

Partitions are worse than outages. An outage is honest; a partition is a liar. With a partition, both sides can be alive. Both sides can think the other side is dead. That’s how data corruption is born.

Typical signs:

  • Corosync ring faults on one or more nodes.
  • Intermittent membership changes in journalctl -u corosync.
  • Packet drops, MTU mismatch, or a “helpful” firewall rule added by someone who hates weekends.

What to do: Stabilize corosync networking: dedicated VLAN, consistent MTU end-to-end, no asymmetric routing, no filtering, and predictable latency. HA wants boring. Give it boring.

3) Storage unavailable: the blocker that HA is right to enforce

Storage issues produce the most common “cannot start resource,” because they’re both common and dangerous. HA will refuse to start if it can’t access the VM disks safely. “Safely” includes “not already in use elsewhere.”

Typical signs:

  • pvesm status shows storage inactive on the target node.
  • Ceph/RBD errors in logs, or iSCSI session timeouts, or NFS “stale file handle.”
  • VM config points at storage not present on every node in the HA group.

What to do: Make storage uniformly available to all nodes that might run the resource, or restrict the resource to nodes that have the storage. HA placement without storage symmetry is just chaos with extra steps.

4) Locks and stale state: HA is waiting for a condition that never clears

Sometimes the cluster is healthy and the storage is fine, but the resource is “locked” or HA thinks it’s in transition. This can happen after interrupted migrations, backups, or a node crash mid-operation.

Typical signs:

  • qm config shows lock: values like backup, migrate, snapshot.
  • Tasks show operations that never completed.
  • HA status shows repeated retries with exit code errors but no progress.

What to do: Confirm the operation is not still running. Then clear stale locks carefully. If you clear locks while the underlying process is still active, you can create exactly the kind of dual-writer mess HA exists to prevent.

Joke #2: The only thing worse than a stuck HA manager is two “unstuck” ones, both convinced they’re the hero.

Three corporate mini-stories from real life

Mini-story 1: The incident caused by a wrong assumption (the “ping works” fallacy)

A mid-sized company had a three-node Proxmox cluster with dual corosync rings. They were proud of it. Redundancy, they said. A VM went down during a routine switch maintenance window, and HA refused to start it elsewhere: “cannot start resource.”

The on-call engineer did the classic check: ping between nodes. Clean. SSH worked. They assumed the cluster network was fine and pivoted hard into storage: remounted NFS, restarted iSCSI, even rebooted one node “to clear it.” The VM stayed down, and now they had two nodes disagreeing about cluster membership every few minutes.

The real issue was MTU mismatch introduced during a switch change. One corosync ring ran jumbo frames; one path silently dropped fragmented or oversized packets. ICMP pings were small and passed. Corosync traffic wasn’t. Membership flapped. Quorum wobbled. HA refused to start anything that could risk split brain.

Fixing MTU end-to-end stabilized corosync instantly. HA placed the VM. Storage was innocent. The postmortem’s most important lesson wasn’t “check MTU”—it was “don’t treat ping as proof of cluster health.” Corosync is a timing-sensitive protocol, not a vibes-based relationship.

Mini-story 2: The optimization that backfired (aggressive failover tuning)

Another organization wanted faster failover. They had customer-facing APIs on HA VMs and didn’t like the default detection and recovery timing. Someone tuned the corosync token timeouts and HA retry intervals to be “more responsive.” It looked good in a lab.

Then production happened. A brief microburst on the storage network caused transient latency and a handful of dropped packets on the corosync VLAN (shared physical ports, because “it was fine”). Corosync interpreted the jitter as node failure. HA reacted quickly—too quickly—and attempted recovery actions that collided with ongoing I/O stalls.

The net result: resources bounced. Not a full cluster meltdown, but a string of partial outages. The error in the UI was still “cannot start resource,” but the root cause was self-inflicted sensitivity: the system had been tuned to panic at normal network noise.

The rollback to conservative timeouts didn’t feel heroic, but stability returned. The hard lesson: in HA, “faster” often means “more wrong, more often.” If you want faster failover, invest in predictable networks and reliable fencing—not just smaller timeout values.

Mini-story 3: The boring but correct practice that saved the day (storage symmetry and placement rules)

A regulated enterprise ran Proxmox HA for internal services. Nothing flashy. Their cluster had a simple, strict rule: any HA resource must live on storage that is available on every node in its HA group, and every node in that group must have validated multipath or Ceph health checks as part of a weekly routine.

One afternoon, a node lost access to the shared storage due to a switch port issue. Proxmox marked the storage inactive on that node. HA saw it and refused to start certain resources there. The UI showed “cannot start resource” for a VM that the scheduler briefly considered for that node.

But because the placement rules limited that VM’s HA group to nodes with validated storage paths, HA immediately placed it on a different node with healthy access. The service stayed up. The incident was reduced to “replace a cable and fix a switch port configuration,” not “war room at 2 a.m.”

They didn’t have secret sauce. They had discipline: storage symmetry, clear constraints, and routine validation. Boring wins. It keeps your weekends intact.

Common mistakes: symptom → root cause → fix

1) Symptom: HA says “cannot start resource” right after a node reboot

Root cause: Quorum temporarily lost or corosync membership unstable during the reboot; HA refuses to act safely.

Fix: Verify pvecm status is quorate and corosync rings are healthy. Don’t chase VM logs until the cluster agrees on membership.

2) Symptom: VM starts manually with qm start, but HA won’t start it

Root cause: HA constraints or HA state machine thinks the VM belongs elsewhere, or a previous failure is recorded; HA is enforcing policy, not capability.

Fix: Check ha-manager status vm:ID and HA groups. Align manual actions with HA: either disable HA for that VM temporarily or fix the underlying placement issue.

3) Symptom: Storage shows “active” on one node, “inactive” on another

Root cause: Backend connectivity is node-specific: missing routes, failed multipath, Ceph auth issues, stale NFS mount, or firewall rules.

Fix: Fix storage connectivity on the broken node, or remove that node from the HA group for resources on that storage. HA requires symmetry or explicit constraints.

4) Symptom: “cannot start resource” after a backup window

Root cause: Stale lock left by interrupted backup, or snapshot mode issues, or storage hiccup during backup.

Fix: Confirm no backup job is running, inspect task logs, then clear locks if stale. If backups are frequently interrupted, fix the storage/network that causes it.

5) Symptom: HA keeps moving a VM back and forth (“ping-pong”)

Root cause: Corosync membership flapping, node health checks failing intermittently, or resource start timeouts too aggressive.

Fix: Stabilize corosync network, undo overly aggressive timeout tuning, and verify node-level resource constraints (CPU, memory, storage latency).

6) Symptom: UI shows node online, but HA says node offline

Root cause: Management network is up, corosync network isn’t (or vice versa). The UI can mislead you because it’s not the membership oracle.

Fix: Trust corosync tools (corosync-cfgtool, corosync-quorumtool) and fix the corosync path.

7) Symptom: “storage is not available” but Ceph/NFS “looks fine” from one node

Root cause: Partial failure: the backend is reachable but too slow, blocked, or timing out; Proxmox marks it inactive due to failed checks.

Fix: Check backend health from the failing node specifically. For Ceph: verify client auth and monitor reachability. For NFS: check for stale mounts and kernel logs.

8) Symptom: HA refuses to recover after a crash; resource stuck in “error”

Root cause: HA recorded a failure and is preventing loops; or LRM on the target node isn’t functioning; or the resource is still “owned” somewhere due to lock/state.

Fix: Read CRM/LRM logs, verify LRM services, check locks, and consider a controlled ha-manager cleanup action only after fixing the underlying cause.

Checklists / step-by-step plan

Emergency checklist: get one resource running safely

  1. Confirm quorum: pvecm status must be quorate. If not, restore majority connectivity first.
  2. Confirm membership stability: corosync-cfgtool -s should show active rings with no faults.
  3. Confirm /etc/pve health: df -h /etc/pve shows pmxcfs.
  4. Read HA’s own error: ha-manager status vm:ID and check last_error.
  5. Check required storage on target node: pvesm status and validate the VM disk storage from qm config ID.
  6. Check locks: qm config ID | grep '^lock:'.
  7. Inspect task history: read task logs for the VM and last operations.
  8. Only then: attempt recovery via HA (preferred) or controlled manual start with HA disabled (only if you understand the consequences).

Stability checklist: prevent “cannot start resource” from returning

  1. Make corosync networking boring: dedicated VLAN, consistent MTU, no firewall meddling, monitor drops and latency.
  2. Stop treating two-node clusters as HA: use a proper quorum device or third node for real decision-making.
  3. Enforce storage symmetry: if a VM is HA-managed across nodes, its storage must be accessible from those nodes, or constraints must reflect reality.
  4. Standardize storage health checks per node: don’t accept “works on node A” as evidence.
  5. Keep time sync tight: consistent NTP/chrony across all nodes.
  6. Write down fencing expectations: even if Proxmox HA isn’t doing external power fencing for you, your operational runbook must specify how you prevent dual writers.
  7. Practice a failover drill quarterly: not to impress anyone—so you know what “normal weird” looks like in logs.

Decision guardrails (what to avoid under pressure)

  • Avoid: clearing locks blindly. Do: confirm the underlying job is truly dead.
  • Avoid: rebooting random nodes “to fix HA.” Do: identify whether the problem is quorum, storage, or network first.
  • Avoid: changing corosync timeouts mid-incident. Do: stabilize the network path; tune later with data.

FAQ

1) Does “cannot start resource” always mean quorum problems?

No. Quorum is common, but storage unavailability and locks are just as common. The fastest truth is in ha-manager status vm:ID and the LRM logs on the target node.

2) The UI shows everything green. Why is HA still refusing?

The UI reflects management-plane reachability and some cluster status, but HA decisions depend on corosync membership and storage checks. Trust corosync tools and HA logs over the UI’s optimism.

3) Can I just run qm start and ignore HA?

You can, but you’re taking responsibility for safety. If the HA stack believes there’s a risk of split brain or double-mount, manual start can turn “downtime” into “data recovery.” If you must do it, disable HA for that resource first and verify storage exclusivity.

4) Why does HA care about storage “inactive” if the mount exists?

Because “mounted” doesn’t mean “working.” NFS can be mounted while hanging; iSCSI can be logged in while timing out; Ceph can be connected while blocked. Proxmox marks storage inactive when its checks fail or time out.

5) What’s the difference between CRM and LRM in Proxmox HA?

CRM coordinates cluster-wide decisions (where a resource should run). LRM executes actions on a node (start/stop). “Cannot start resource” often means CRM asked, LRM tried, and some dependency failed locally.

6) If corosync is unstable, why do my VMs still run?

VMs can keep running on their current node even when the cluster is confused. Starting and moving VMs safely is the hard part. HA will stop initiating actions if membership isn’t stable.

7) How do I distinguish storage failure from network partition quickly?

If quorum is lost or rings show faults, it’s network/membership first. If quorum is fine and HA says “storage not available,” run pvesm status on the target node and confirm the VM’s disk storage. Storage problems are often node-specific and won’t show on a healthy peer.

8) Why does a two-node cluster feel so fragile?

Because it is. With two nodes, any node loss (or link loss) creates an even split. Without a third vote (qdevice), you can’t reliably prove which side is authoritative. HA becomes conservative, as it should.

9) What if I suspect the HA manager state is stale?

First confirm quorum and corosync stability. Then check HA services on all nodes. If the manager is stuck, restarting HA services can help, but do it deliberately and only after you’ve confirmed the underlying dependency (storage/network) is actually fixed.

Conclusion: next steps that actually prevent repeats

“Cannot start resource” is Proxmox HA telling you it can’t prove the move is safe. Your job is to remove ambiguity. Do it in this order: quorum/membership, HA manager state, then storage availability and locks.

Practical next steps:

  1. Build a one-page runbook that starts with pvecm status, corosync-cfgtool -s, and ha-manager status vm:ID. Make it boring and mandatory.
  2. Audit HA resources for storage symmetry. If a VM can only run on two nodes because of storage reality, encode that in HA groups and constraints.
  3. Instrument your corosync network: drops, errors, MTU consistency, and saturation. HA failures are often network failures that politely waited to become obvious.
  4. Practice one controlled failover when nobody is panicking. The best time to learn HA behavior is when it’s not actively humiliating you.

Operational note: if your first instinct is to “force it,” pause. HA errors are often the system saving you from corruption. Fix the cause, then let HA do its job.

← Previous
ZFS zfs hold: The Safety Pin That Blocks Accidental Deletion
Next →
Proxmox Linux VM Slow Disk: Controller + Cache Choices That Fix Stutter

Leave a comment