Proxmox “pve-cluster.service failed”: How to Bring the Cluster Stack Back Up

Was this helpful?

When pve-cluster.service fails, Proxmox doesn’t just lose “cluster features.” It loses its nervous system. The GUI gets weird, VM operations start throwing errors, nodes drift, and you suddenly remember how much state Proxmox stores in that deceptively simple place: /etc/pve.

This isn’t a theory piece. It’s a production playbook for getting the cluster stack back on its feet—without turning a one-node problem into a multi-node incident. You’ll diagnose what’s actually failing (pmxcfs, Corosync, quorum, FUSE mount, disk, time, network), make the right trade-offs, and restore service with intent.

What “pve-cluster failed” really means

pve-cluster is the service that runs Proxmox Cluster File System (pmxcfs). pmxcfs is a FUSE-based filesystem mounted at /etc/pve. If it isn’t mounted and healthy, Proxmox loses access to cluster configuration in the way it expects. That includes:

  • Cluster-wide config, storage definitions, HA state
  • Firewall config (cluster scope)
  • Node membership and Corosync config distribution
  • VM and CT config files (in cluster mode)

Under the hood, pmxcfs depends heavily on Corosync (cluster membership, quorum, messaging). Corosync isn’t just “nice to have.” It’s the referee. If you don’t have quorum, pmxcfs becomes conservative to avoid split-brain writes.

That’s the part that bites: you can have “a perfectly fine node” with running VMs and healthy disks, but if Corosync can’t form membership, pmxcfs may refuse writes—or fail to start—and then every management action turns into a cryptic error.

Here’s the operational stance I recommend: treat pve-cluster.service failed as a symptom, not a diagnosis. The real diagnosis is usually one of these buckets:

  • Quorum/membership failure (network, node count, Corosync config mismatch)
  • Local system constraint (disk full, inode exhaustion, memory pressure, file descriptor limits)
  • Time problems (NTP drift makes Corosync unhappy; TLS breaks; logs lie)
  • Corrupted or inconsistent cluster config (bad corosync.conf, partial updates, stale node certificates)
  • Operator-induced foot-guns (force flags used in the wrong moment, “quick fixes” that quietly create split-brain)

One quote worth keeping in the back of your mind: Hope is not a strategy. —General Gordon R. Sullivan. It’s not a Proxmox quote, but it belongs on every on-call rotation.

Fast diagnosis playbook

If you’re on the clock, don’t bounce services randomly. Your goal is to find the bottleneck quickly: is this a local node problem or a cluster quorum problem?

First (60 seconds): determine whether /etc/pve is mounted and whether Corosync has quorum

  • Check if pmxcfs mounted: findmnt /etc/pve
  • Check Corosync/quorum: pvecm status and corosync-quorumtool -s
  • Check the exact failure reason: systemctl status pve-cluster + journalctl -u pve-cluster -b

Second (2–5 minutes): validate the “boring basics” that kill cluster services

  • Disk space/inodes: df -h, df -i
  • Memory pressure: free -h, dmesg -T | tail
  • Time sync: timedatectl, chronyc tracking (or systemctl status systemd-timesyncd)
  • Network reachability on Corosync links: ping/MTU checks between nodes

Third (5–15 minutes): decide your recovery mode

  • If quorum can be restored: fix network/config/time and bring Corosync up normally. Then restart pve-cluster.
  • If quorum cannot be restored quickly: choose between safe read-only operations vs. a controlled temporary single-node mode (only if you understand the consequences).
  • If this is a two-node cluster: decide whether to use a qdevice (best), or a short-term expected-votes hack (risky but sometimes necessary).

Joke #1: If your first recovery step is “reboot everything,” you’re not troubleshooting—you’re performing interpretive dance for the outage gods.

Interesting facts and historical context (that actually help you debug)

  1. pmxcfs is a FUSE filesystem. That means the /etc/pve “directory” you see is a userspace mount; if it’s down, your real on-disk files are elsewhere and behavior changes.
  2. Proxmox moved toward cluster-centric config early. The decision makes multi-node management sane, but it also means cluster health affects basic operations on a single node.
  3. Corosync’s design is quorum-first. It prefers refusing writes over risking split-brain. That’s why failures often look “overly strict.”
  4. Two-node clusters are inherently awkward. Without a third vote (qdevice or witness), you’re always one failure away from a philosophical debate about who gets to be “the cluster.”
  5. Quorum isn’t about majority uptime—it’s about safe membership. You can have 99% of services running and still be “unsafe to write cluster state.”
  6. Clock drift can masquerade as network failure. Corosync and TLS behave badly when time is wrong; logs become misleading and retries spike.
  7. Disk-full issues hit clusters harder. A full root filesystem can prevent state writes, logging, or pmxcfs internals, causing cascading service failures that look like “Corosync died.”
  8. Proxmox UI errors often reflect pmxcfs state. The web UI may still load but operations fail because it can’t write configs into /etc/pve.

Practical recovery tasks (commands, outputs, decisions)

These are real tasks you run on a node. Each task includes: command, what the output means, and the decision you make next. Don’t run them all blindly; follow the fast diagnosis ordering.

Task 1: Confirm the service failure and capture the real reason

cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: failed (Result: exit-code) since Tue 2025-12-25 09:41:02 UTC; 2min 3s ago
    Process: 1832 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
     Memory: 1.6M
     CPU: 43ms

Dec 25 09:41:02 server pmxcfs[1832]: [main] notice: starting pmxcfs
Dec 25 09:41:02 server pmxcfs[1832]: [main] crit: unable to initialize cluster communication
Dec 25 09:41:02 server systemd[1]: pve-cluster.service: Main process exited, code=exited, status=255/EXCEPTION
Dec 25 09:41:02 server systemd[1]: pve-cluster.service: Failed with result 'exit-code'.

Meaning: pmxcfs started but couldn’t initialize cluster communication—very often Corosync/quorum. Not a generic crash.

Decision: Immediately check Corosync status and quorum before touching pmxcfs flags or restarting everything.

Task 2: Read the pve-cluster journal for the precise failure chain

cr0x@server:~$ journalctl -u pve-cluster -b --no-pager -n 80
Dec 25 09:40:59 server systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 25 09:41:02 server pmxcfs[1832]: [main] notice: starting pmxcfs
Dec 25 09:41:02 server pmxcfs[1832]: [dcdb] notice: data verification successful
Dec 25 09:41:02 server pmxcfs[1832]: [main] crit: unable to initialize cluster communication
Dec 25 09:41:02 server pmxcfs[1832]: [main] notice: exit now
Dec 25 09:41:02 server systemd[1]: pve-cluster.service: Main process exited, code=exited, status=255/EXCEPTION

Meaning: Local database (dcdb) looks fine. The failure is networking/membership rather than local corruption.

Decision: Focus on Corosync and quorum, not on reinstalling packages or “repairing” files yet.

Task 3: Check whether /etc/pve is mounted (pmxcfs)

cr0x@server:~$ findmnt /etc/pve

Meaning: No output usually means it’s not mounted. If it is mounted, you’ll see a FUSE mount (pmxcfs).

Decision: If it’s not mounted and pve-cluster is failed, you’re in “cluster filesystem down” territory. Don’t edit /etc/pve expecting it to persist cluster-wide.

Task 4: Verify Corosync and quorum status from Proxmox tools

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-clu
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Dec 25 09:43:18 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.152
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2
Flags:            2Node Quorate

Meaning: Not quorate. With only 1 of 3 votes visible, cluster services will behave defensively.

Decision: You need to restore connectivity to at least one more node (or qdevice) before expecting pmxcfs to behave normally.

Task 5: Cross-check Corosync’s own view

cr0x@server:~$ corosync-quorumtool -s
Quorum information
------------------
Date:             Tue Dec 25 09:43:46 2025
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          1
Ring ID:          152.209
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Total votes:      1
Quorum:           2
Flags:            None

Meaning: Confirms it’s not just Proxmox tooling; Corosync truly lacks membership.

Decision: Stop thinking “pve-cluster is broken.” Start thinking “the cluster is broken.”

Task 6: Check Corosync service and recent logs

cr0x@server:~$ systemctl status corosync --no-pager
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
     Active: active (running) since Tue 2025-12-25 09:41:10 UTC; 2min 50s ago
       Docs: man:corosync
   Main PID: 1904 (corosync)
      Tasks: 9
     Memory: 29.8M
     CPU: 2.214s
cr0x@server:~$ journalctl -u corosync -b --no-pager -n 80
Dec 25 09:41:10 server corosync[1904]:   [KNET  ] link: host: 2 link: 0 is down
Dec 25 09:41:10 server corosync[1904]:   [KNET  ] link: host: 3 link: 0 is down
Dec 25 09:41:10 server corosync[1904]:   [QUORUM] This node is within the non-primary component and will NOT provide service.

Meaning: Corosync is running, but the knet link can’t reach peers. This is usually network, MTU, VLAN, firewall, or wrong IP in corosync.conf.

Decision: Verify the ring network path end-to-end before making any config edits.

Task 7: Validate corosync.conf is present and consistent on this node

cr0x@server:~$ sed -n '1,200p' /etc/pve/corosync.conf
sed: can't read /etc/pve/corosync.conf: No such file or directory

Meaning: If pmxcfs is down, /etc/pve may be empty/unmounted, so the file won’t read. That doesn’t mean it’s “missing,” it means the mount isn’t there.

Decision: Read the local fallback copy at /etc/corosync/corosync.conf (not perfect, but tells you what this node believes).

cr0x@server:~$ sed -n '1,220p' /etc/corosync/corosync.conf
totem {
  version: 2
  cluster_name: prod-clu
  transport: knet
  interface {
    linknumber: 0
    bindnetaddr: 10.10.50.0
    mcastport: 5405
  }
}
nodelist {
  node {
    name: pve1
    nodeid: 1
    ring0_addr: 10.10.50.11
  }
  node {
    name: pve2
    nodeid: 2
    ring0_addr: 10.10.50.12
  }
  node {
    name: pve3
    nodeid: 3
    ring0_addr: 10.10.50.13
  }
}
quorum {
  provider: corosync_votequorum
}

Meaning: This node expects peers on 10.10.50.0/24. If your network team “optimized” VLANs, this is where you catch it.

Decision: Confirm that 10.10.50.x is reachable L2/L3 as designed and that the node’s interface is actually on that subnet.

Task 8: Confirm the node’s IP and routing match the Corosync ring network

cr0x@server:~$ ip -br addr
lo               UNKNOWN        127.0.0.1/8 ::1/128
eno1             UP             10.10.50.11/24
eno2             DOWN
vmbr0            UP             192.168.10.11/24
cr0x@server:~$ ip route
default via 192.168.10.1 dev vmbr0
10.10.50.0/24 dev eno1 proto kernel scope link src 10.10.50.11
192.168.10.0/24 dev vmbr0 proto kernel scope link src 192.168.10.11

Meaning: Corosync ring network is directly connected on eno1. That’s good. If it wasn’t, you’d see odd routing.

Decision: Test connectivity and MTU to peers on the ring network next.

Task 9: Test peer reachability and MTU (Corosync hates silent drops)

cr0x@server:~$ ping -c 2 10.10.50.12
PING 10.10.50.12 (10.10.50.12) 56(84) bytes of data.

--- 10.10.50.12 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1016ms
cr0x@server:~$ ping -M do -s 1472 -c 2 10.10.50.12
PING 10.10.50.12 (10.10.50.12) 1472(1500) bytes of data.

--- 10.10.50.12 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1013ms

Meaning: Total loss suggests link/VLAN/firewall/port issue, not just MTU. If small ping worked but MTU ping failed, suspect MTU mismatch or a path with fragmentation blocked.

Decision: If ping fails, stop. Fix the network path (switch port, VLAN tagging, firewall rules, routing). Restarting services won’t change physics.

Task 10: Check if firewall is blocking Corosync traffic on the host

cr0x@server:~$ pve-firewall status
Status: enabled/running
cr0x@server:~$ iptables -S | head
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

Meaning: Proxmox firewall can be enabled yet still not block Corosync depending on ruleset. The iptables snippet here shows default ACCEPT, which suggests host firewall isn’t the issue.

Decision: If you see default DROP or explicit drops on UDP 5405/5404 etc., fix rules. If not, focus on upstream network.

Task 11: Check disk space and inode exhaustion (yes, really)

cr0x@server:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        48G   47G  180M 100% /
cr0x@server:~$ df -i /
Filesystem     Inodes  IUsed   IFree IUse% Mounted on
/dev/sda2     3145728 3120000   25728   100% /

Meaning: Full disk and full inodes are classic cluster-killers. Services can’t write state, logs, or temporary files. pmxcfs may fail in creative ways.

Decision: Free space safely (logs, old kernels, ISO clutter). Do not delete random cluster files because you’re angry.

Task 12: Time sanity check (because distributed systems are petty)

cr0x@server:~$ timedatectl
               Local time: Tue 2025-12-25 09:46:10 UTC
           Universal time: Tue 2025-12-25 09:46:10 UTC
                 RTC time: Tue 2025-12-25 09:46:09
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no
cr0x@server:~$ chronyc tracking
Reference ID    : 192.0.2.10
Stratum         : 3
Last offset     : +3.421 seconds
RMS offset      : 1.922 seconds
Frequency       : 34.112 ppm fast
Leap status     : Not synchronised

Meaning: You’re drifting seconds. That’s enough to make distributed components act haunted—especially during membership changes and TLS validation.

Decision: Fix time sync first (NTP reachability, chrony config). Then retry Corosync/pve-cluster.

Task 13: Bring Corosync up cleanly (after fixing the underlying issue)

cr0x@server:~$ systemctl restart corosync
cr0x@server:~$ pvecm status | sed -n '1,40p'
Cluster information
-------------------
Name:             prod-clu
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Dec 25 09:49:02 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.154
Quorate:          Yes

Meaning: Quorum is back. Now pmxcfs has a fighting chance to start correctly.

Decision: Restart pve-cluster and verify the mount.

Task 14: Restart pve-cluster and validate pmxcfs mount + basic reads

cr0x@server:~$ systemctl restart pve-cluster
cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: active (running) since Tue 2025-12-25 09:49:20 UTC; 4s ago
   Main PID: 2488 (pmxcfs)
cr0x@server:~$ findmnt /etc/pve
TARGET   SOURCE  FSTYPE OPTIONS
/etc/pve pmxcfs  fuse   rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other
cr0x@server:~$ ls -la /etc/pve | head
total 0
drwxr-xr-x 2 root www-data 0 Dec 25 09:49 .
drwxr-xr-x 1 root root     0 Dec 25 09:49 ..
-rw-r----- 1 root www-data 0 Dec 25 09:49 .members
-rw-r----- 1 root www-data 0 Dec 25 09:49 corosync.conf
drwxr-xr-x 2 root www-data 0 Dec 25 09:49 nodes

Meaning: It’s mounted and cluster config is visible again. That’s the “stack is back” indicator.

Decision: Verify pvedaemon and pveproxy next if the UI still errors.

Task 15: Validate UI back-end services once cluster FS is healthy

cr0x@server:~$ systemctl status pvedaemon pveproxy --no-pager
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled)
     Active: active (running) since Tue 2025-12-25 09:49:40 UTC; 9s ago
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
     Active: active (running) since Tue 2025-12-25 09:49:42 UTC; 7s ago

Meaning: API stack is running. If the GUI still fails, it’s likely auth/cert or browser cache or a specific node still not quorate.

Decision: Check cluster membership and node list across nodes.

Task 16: Confirm membership and look for “ghost nodes”

cr0x@server:~$ pvecm nodes
Membership information
----------------------
    Nodeid      Votes Name
         1          1 pve1
         2          1 pve2
         3          1 pve3

Meaning: Membership is clean. If you see nodes you decommissioned months ago, your cluster config is stale and may cause odd quorum math or link attempts.

Decision: If ghost nodes exist, plan a proper removal from the cluster, not an ad-hoc edit during an outage.

Task 17: If pmxcfs is mounted but writes fail, test a safe write operation

cr0x@server:~$ touch /etc/pve/.pmxcfs-write-test
touch: cannot touch '/etc/pve/.pmxcfs-write-test': Read-only file system

Meaning: pmxcfs mounted read-only. This often correlates with quorum loss or internal protection mode.

Decision: Re-check quorum (pvecm status). If quorate is “No”, stop trying to force writes. Restore quorum or deliberately choose single-node mode with eyes open (see checklists).

Task 18: Confirm local cluster config database health signals

cr0x@server:~$ systemctl status pve-cluster --no-pager -l
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: active (running) since Tue 2025-12-25 09:49:20 UTC; 1min 22s ago
   Main PID: 2488 (pmxcfs)
     CGroup: /system.slice/pve-cluster.service
             └─2488 /usr/bin/pmxcfs

Meaning: You’re mainly verifying it stays up and doesn’t crash-loop. If it flaps, it’s still fighting Corosync, disk, or time.

Decision: If it flaps, stop restarting it. Go back to Corosync logs and host resource checks.

Root causes by symptom (how to think, not just what to type)

Symptom: pve-cluster fails immediately with “unable to initialize cluster communication”

Likely causes: Corosync down, Corosync up but no quorum, wrong ring network, blocked UDP, MTU mismatch, or time drift causing membership instability.

Approach: Verify quorum and Corosync logs first. If Corosync can’t see peers, pmxcfs usually won’t form a coherent view.

Symptom: /etc/pve exists but is empty or missing expected files

Likely causes: pmxcfs not mounted; you’re looking at the underlying directory (or a broken mount).

Approach: Use findmnt. If it’s not mounted, do not “recreate” files in /etc/pve manually. That’s how you create a future outage with extra steps.

Symptom: pmxcfs mounts read-only

Likely causes: No quorum. Sometimes local “protection mode” after instability.

Approach: Restore quorum if at all possible. If not possible, decide whether you need writes badly enough to accept the split-brain risk (usually: no).

Symptom: Corosync is “active (running)” but cluster is not quorate

Likely causes: Network partition, peer nodes down, configuration mismatch across nodes, or “expected votes” math doesn’t match reality (common after node removal or a two-node design).

Approach: Read Corosync logs, check reachability, confirm consistent config version, and validate node list.

Symptom: Everything was fine until someone changed VLAN/MTU/teaming

Likely causes: Corosync traffic dropped, fragmented, or rerouted. knet is robust but not magical.

Approach: Validate MTU end-to-end, check switchport tags, and ensure Corosync’s ring network isn’t accidentally running over a congested or filtered segment.

Symptom: Cluster stack fails after power outage

Likely causes: Time drift (RTC off), inconsistent startup order, nodes booting with network not ready, or partial disk issues from abrupt shutdowns.

Approach: Confirm time sync, confirm disks are clean, then bring up Corosync before expecting pmxcfs to be happy.

Joke #2: Quorum is like a meeting that needs two people to approve a decision—until one person goes missing and suddenly nobody remembers how to do their job.

Three corporate mini-stories (because pain teaches)

Mini-story 1: The incident caused by a wrong assumption

They ran a tidy three-node Proxmox cluster in a mid-sized company: compute nodes in one rack, storage on a separate platform, networking “standardized.” One afternoon, a node’s management UI started throwing cluster errors, and pve-cluster.service went into a failed state. A junior engineer did what many of us have done under pressure: “It’s just the UI; the VMs are running, so the cluster must be fine.”

The wrong assumption was subtle: they assumed cluster state is only needed for cluster-wide actions. But their VM backups, firewall changes, and storage definitions all lived in /etc/pve via pmxcfs. When pmxcfs failed, backup jobs began failing silently at first (configs couldn’t be read consistently), then loudly (API operations errored). The incident report later used the phrase “secondary outage,” which is corporate for “we made it worse.”

The root cause was a network change on the Corosync VLAN. A switch port profile was updated for a “standard MTU,” which in practice meant the Corosync network lost jumbo frames while part of the path still sent them. Corosync didn’t die; it just couldn’t maintain stable membership. pmxcfs refused to participate, correctly preferring safety to nonsense.

What fixed it wasn’t restarts. It was verifying the ring network end-to-end with MTU pings, then aligning MTU consistently. After that, Corosync regained quorum, pmxcfs mounted read-write, and the UI went back to acting like a UI instead of a riddle.

Mini-story 2: The optimization that backfired

A different org decided their cluster network was “too chatty.” They had a separate storage network, a VM network, and a management network. So they “optimized” by consolidating Corosync onto the management bridge because “it already has connectivity everywhere.” The change was made during business hours because the engineer believed Corosync would reconnect quickly and pmxcfs would be tolerant. Optimism is a renewable resource.

The management network had firewall rules, rate-limits, and occasional bursts of traffic from monitoring, patching, and config management. Corosync membership began flapping under load, which meant quorum was intermittently lost. pmxcfs alternated between usable and defensive. Their Proxmox UI became a slot machine: click “start VM,” maybe it works, maybe it errors.

The real damage showed up later. HA state and config writes happened during brief “quorate windows,” but were blocked during “non-quorate windows,” creating inconsistent operator expectations. People began “fixing” symptoms: restarting daemons, disabling firewalls, adding hacks to expected votes. The environment became operationally noisy—the worst kind of unstable, because it looks alive.

The recovery was to undo the optimization. Corosync got a dedicated network with stable latency and consistent MTU, and they stopped treating cluster membership as “just another UDP service.” Afterwards, they added a qdevice because the cluster occasionally operated with only two nodes during maintenance. Boring design choices. Very effective.

Mini-story 3: The boring but correct practice that saved the day

A financial-services team had a Proxmox cluster that rarely failed—because they practiced unglamorous discipline. They documented the Corosync ring topology, kept a simple “cluster health” script run by monitoring, and enforced time sync like it was a security control (because it is).

One morning, pve-cluster failed on a node after a kernel update. The on-call engineer didn’t start by restarting services. They started by collecting evidence: pvecm status, journalctl, df, timedatectl. Within minutes, they saw the node’s clock was off and NTP wasn’t synchronizing. The update had reset a network policy and blocked NTP egress on that VLAN.

Because they had a known-good checklist, they fixed the firewall rule, confirmed chronyc tracking stabilized, then restarted Corosync and pmxcfs in the correct order. The cluster recovered without any config hacks. VMs kept running. The UI came back. Nobody had to “force” anything.

The postmortem was almost boring. That’s the point. They didn’t win because they were smarter; they won because they were consistent.

Checklists / step-by-step plan

Checklist A: Safe recovery when quorum can be restored

  1. Freeze risky operations. Don’t migrate VMs, don’t change storage definitions, don’t edit cluster config while the cluster is unstable.
  2. Capture state. Save outputs of systemctl status pve-cluster corosync, pvecm status, and the last ~100 lines of logs for both services.
  3. Fix basics first:
    • Disk space/inodes
    • Time sync
    • Network reachability on ring links (including MTU)
  4. Restart Corosync after the underlying fix. Don’t restart it as a superstition.
  5. Confirm quorate = Yes. If not, stop and keep working the network/membership issue.
  6. Restart pve-cluster. Validate findmnt /etc/pve shows pmxcfs.
  7. Validate read/write behavior. If read-only, you still don’t have quorum or stability.
  8. Then validate API services. pvedaemon, pveproxy.
  9. Only then resume normal operations.

Checklist B: Controlled single-node operation (last resort, temporary)

This is where people hurt themselves. If you run a single node as if it’s “the cluster” while other nodes may come back later, you risk split-brain cluster config. If you don’t understand that sentence, don’t do this.

  1. Confirm the other nodes are truly down or isolated. If they can come back unexpectedly, you’re playing config roulette.
  2. Communicate. Tell your team you’re entering a degraded mode and that config changes may need reconciliation later.
  3. Prefer read-only actions. Keep workloads running. Avoid edits to cluster-wide config.
  4. If you must regain write ability, use a deliberate quorum strategy (qdevice, or temporary expected-votes adjustment) and document what you changed.
  5. Plan the return to normal. The hard part is not entering degraded mode—it’s exiting it cleanly.

Checklist C: Two-node cluster survival plan

  1. Best: add a qdevice/witness in a third failure domain (not on the same host, not on the same switch if you can avoid it).
  2. During incident: if one node is down, expect quorum loss unless you planned for it.
  3. Don’t permanently “hack” expected votes. Temporary changes become permanent in practice, and then you learn what split-brain tastes like.
  4. After recovery: invest in the third vote. It’s cheaper than the next outage.

Common mistakes: symptom → root cause → fix

1) “pve-cluster failed, so I edited /etc/pve manually”

Symptom: After recovery, config changes disappeared or only existed on one node.

Root cause: pmxcfs wasn’t mounted; you edited the underlying directory or a local stub view.

Fix: Verify findmnt /etc/pve before any edits. Make changes only when pmxcfs is mounted and quorate.

2) Corosync is running, but quorate is “No,” so I restarted pve-cluster 20 times

Symptom: pmxcfs flaps; UI errors persist.

Root cause: Membership issue (network partition, missing peers, expected votes mismatch).

Fix: Restore network connectivity and consistent Corosync config; restart Corosync once after fixing the underlying cause; then restart pve-cluster.

3) “We have three nodes, so we always have quorum”

Symptom: One node down and suddenly cluster is not quorate.

Root cause: Two nodes are not actually communicating (silent partition), leaving only one visible vote.

Fix: Validate connectivity between all nodes on the ring network. A cluster is a graph, not a headcount.

4) MTU mismatch causes intermittent quorum loss

Symptom: Corosync membership flaps, especially under load or after network changes.

Root cause: Mixed MTU across the path, fragmentation blocked, or switchport inconsistency.

Fix: Standardize MTU end-to-end. Use ping -M do -s tests between nodes on ring interfaces.

5) Disk full triggers “cluster communication” looking failures

Symptom: Services crash or refuse to start; logs may be truncated; weird behavior after updates.

Root cause: Root filesystem or inodes exhausted.

Fix: Free space/inodes safely, then restart services. Also fix log retention and housekeeping so it doesn’t recur.

6) Time drift makes everything look like a network problem

Symptom: TLS errors, odd cluster state, inconsistent logs, membership instability.

Root cause: NTP not synchronized; RTC wrong; NTP blocked.

Fix: Restore time sync, confirm stable tracking, then re-evaluate Corosync membership.

7) Removing a node “by deleting it”

Symptom: Ghost nodes appear; expected votes wrong; quorum math surprises you.

Root cause: Node was decommissioned without proper cluster removal steps.

Fix: Use correct node removal procedures when the cluster is healthy; don’t improvise during an outage.

FAQ

1) What exactly is pve-cluster.service?

It runs pmxcfs, the Proxmox cluster filesystem mounted at /etc/pve. If it’s down, cluster config access breaks in ways that make the UI and API look unreliable.

2) Are my running VMs affected when pve-cluster is down?

Usually the VMs keep running. The pain is management operations: starting/stopping from UI, migrations, backups, HA actions, and config writes may fail or become unsafe.

3) Why does /etc/pve look empty?

Because pmxcfs isn’t mounted. You’re seeing the underlying directory, not the FUSE mount. Confirm with findmnt /etc/pve.

4) Corosync is running. Why is the cluster not working?

Corosync can be “running” but not have quorum. Without quorum, pmxcfs may refuse writes or fail to start. Check pvecm status and Corosync logs.

5) Can I just restart pve-cluster and corosync until it works?

You can, but it’s a weak strategy. If the root cause is network reachability, MTU, disk full, or time drift, restarts only add noise and can worsen instability.

6) What’s the fastest way to tell if this is network vs. local node?

If df -h and timedatectl look sane, then pvecm status + journalctl -u corosync will usually point straight at network/membership.

7) I have a two-node cluster. Is it normal to lose quorum when one node is down?

Yes. Two-node quorum is awkward by design. If you want clean failure behavior, use a qdevice/witness so the cluster can still reach a majority decision.

8) pmxcfs is mounted read-only. How do I force it read-write?

Most of the time, read-only indicates quorum loss or instability. The “fix” is restoring quorum. Forcing writes without quorum risks split-brain cluster config.

9) The Proxmox GUI says “cluster not ready” but services look up. What now?

Verify /etc/pve mount, quorum, and whether API services are healthy. Many GUI errors are just pmxcfs/quorum problems reflected through the API.

10) After recovery, one node still shows old config. Is that possible?

Yes—especially after partitions or if someone made local edits while pmxcfs was down. Confirm membership, config version, and avoid manual reconciliation unless you’re certain which state is authoritative.

Next steps that prevent repeat incidents

Getting pve-cluster back up is half the job. Keeping it boring is the other half.

  • Give Corosync a stable network path. Dedicated VLAN if possible, consistent MTU end-to-end, no “helpful” firewall surprises.
  • Fix time like it’s production-critical (because it is). Monitor NTP sync; alert on drift.
  • Monitor quorum and pmxcfs mount. Alert when Quorate: No or when findmnt /etc/pve doesn’t show pmxcfs.
  • Stop building two-node clusters without a witness. If budget allows two nodes, it allows a tiny third vote.
  • Write down your recovery order. Corosync/quorum first, then pve-cluster, then API services. Don’t freestyle during an outage.

If you take only one operational habit from this: before touching config, confirm /etc/pve is mounted and the cluster is quorate. That single check prevents a remarkable amount of self-inflicted damage.

← Previous
WireGuard AllowedIPs Confusion: Why Traffic Doesn’t Go Where You Expect (and How to Fix It)
Next →
Responsive Embeds That Don’t Break Layout: YouTube, Iframes, and Maps in Production

Leave a comment