You open the Proxmox GUI and it spins. Migrations stall. HA flaps. VMs are running—until they aren’t.
Then you check the obvious: Corosync quorum. It says everything’s fine.
That’s the trap. Corosync can be “fine” in the narrow sense—membership and quorum intact—while the
rest of the cluster is suffocating from latency, filesystem lock contention, storage stalls, or time drift.
Corosync is the pulse. Your cluster can still be bleeding out.
What Corosync is (and isn’t)
Corosync provides cluster membership and messaging. In Proxmox VE, it’s the component that decides
who’s in the club and whether the club has quorum. It does not guarantee that your management plane
is responsive, that your storage is healthy, or that your hypervisors can actually do work at the pace
you need.
Proxmox stacks a lot on top of “nodes can see each other”:
pmxcfs (the Proxmox cluster filesystem that stores config state),
pveproxy and pvedaemon (API/UI services),
pve-ha-lrm/crm (HA logic),
pvestatd (stats),
and whatever your storage backend is doing today (ZFS, iSCSI, NFS, Ceph, local LVM… pick your favorite headache).
Corosync membership can remain stable even while:
- pmxcfs is stuck waiting on FUSE operations and you can’t commit config changes
- the network is dropping packets or spiking latency—just not enough to lose quorum
- time drift causes subtle authentication and fencing weirdness
- storage stalls freeze QEMU I/O threads and migrations time out
- the management daemons are blocked on DNS, PAM/LDAP, or filesystem calls
Two dry facts that save careers
- Quorum is binary; health is not. You can have quorum and still be unusable.
- Most Proxmox “cluster issues” are actually latency issues. Not always network—often storage or CPU contention that manifests as missed heartbeats elsewhere.
Interesting facts and historical context (because systems have baggage)
- Corosync evolved from the OpenAIS project, which aimed to implement “application interface specification” concepts for clustering in Linux.
- Totem is Corosync’s group communication layer; its token mechanism is why “token timeout” tuning can make you feel powerful—and then regret it.
- Quorum in Corosync is a voting problem (via votequorum) rather than a health scoring system; it doesn’t measure service-level responsiveness.
- pmxcfs is a FUSE-based distributed filesystem; it’s great for small config files and terrible for your patience when it blocks.
- Proxmox’s “cluster filesystem” is not a general filesystem; it’s a replicated config store. Treating it like shared storage is how you end up in therapy.
- Split-brain avoidance is a design bias in most cluster stacks; Proxmox tends to prefer “stop writes” over “maybe corrupt things quietly.”
- CEPH’s historical pain point was small write amplification; modern versions improved a lot, but your network and disks still decide if it’s a Ferrari or a lawnmower.
- Linux kernel scheduling and I/O pressure can create “everything looks up but nothing moves” failure modes—especially on overloaded hypervisors.
One quote worth keeping taped to your monitor:
Hope is not a strategy.
— Gen. Gordon R. Sullivan
Joke #1: Corosync saying “quorum” while your GUI hangs is like your smoke detector saying “battery OK” during a kitchen fire.
Fast diagnosis playbook
When the cluster is dying, you don’t have time for interpretive log reading. You need a short path to the bottleneck.
Here’s the order that finds root causes fast in real environments.
First: is this a network membership problem or a management-plane stall?
- Check Corosync membership stability (
pvecm status,corosync-cfgtool -s). - Check whether pmxcfs is responsive (
pvecm updatecertswill hang if cluster fs is stuck; also simple reads in/etc/pvecan block). - Check whether the API/UI is blocked (systemd status and journal for
pveproxy/pvedaemon).
Second: what is the dominant latency source right now?
- Network latency/packet loss (
ping -fis not the answer; usemtr,ethtool -S, switch-side counters). - Storage latency (ZFS
zpool iostat -v, Cephceph -sand slow ops, NFS client stats). - CPU steal / run queue / memory pressure (load average is not enough; check
vmstat,top,pressure-stall-informationif available).
Third: is something “helpfully” retrying forever?
- DNS and LDAP lookups (GUI logins hang, API calls stall).
- Multipath flapping (iSCSI paths dying and coming back like a soap opera).
- Ceph backfill/recovery saturating the cluster (it’s “healthy-ish” but slow enough to time out everything else).
Quick triage decisions
- If membership is stable but pmxcfs is blocked: treat it like a control-plane outage. Stop changing config and find the stall.
- If storage latency spikes: stop migrations, stop backups, stop anything that multiplies I/O. Restore baseline first.
- If network loss/latency spikes: prioritize stabilizing the ring network over “tuning token timeouts.” Tuning is a last resort, not a cure.
Failure modes where Corosync looks healthy
1) pmxcfs is stuck: Corosync is fine, but configuration writes block
pmxcfs is where Proxmox stores cluster-wide config: VM definitions, storage configs, firewall rules, user realms, and more.
It’s backed by Corosync’s messaging, and it’s mounted at /etc/pve using FUSE.
When pmxcfs is slow or wedged, you’ll see symptoms like:
- GUI actions hang (creating a VM, editing storage, changing HA)
qm/pctcommands freeze when they touch configs- SSH is fine; VMs keep running; but management is “underwater”
Common causes: extreme CPU pressure, FUSE deadlocks, disk stalls affecting local journaling, or corosync message delays that don’t yet break quorum.
2) Token timeouts aren’t broken; your latency budget is
Corosync’s token mechanism expects timely message delivery. You can have stable quorum even with intermittent latency spikes that
don’t exceed your token timeout—but those spikes are still long enough to freeze migrations, backups, and HA decisions.
A classic pattern: you “fixed” corosync by increasing token timeout. Membership stops flapping.
Meanwhile, the cluster is now tolerant of latency so bad that everything else suffers. You didn’t fix the network.
You just taught Corosync to stop complaining.
3) Storage stalls freeze the hypervisor, not Corosync
The nastiest Proxmox incidents are storage-induced. A VM write blocks in the kernel or QEMU,
the host experiences I/O wait, and suddenly all your management daemons respond like they’re answering from a tunnel.
Corosync can still exchange heartbeats if the CPU gets scheduled occasionally. That’s enough to keep quorum.
But it’s not enough for a responsive system.
4) Time drift: the slow poison
NTP/chrony problems don’t always break quorum. But they can break everything that assumes time monotonicity:
TLS handshakes, authentication, logs correlation, fencing decisions, and “why did that node think it was 5 minutes in the future?”
You’ll also chase ghosts in logs because events appear out of order. That’s not “fun.” That’s how you lose hours.
5) HA isn’t “down,” it’s indecisive under partial failure
Proxmox HA depends on a coherent view of resources, node states, and storage availability.
With quorum intact but underlying latency, HA can get stuck: repeatedly trying to start resources, waiting for locks, or refusing actions
because it can’t safely verify state. From the outside it looks like “HA is broken.” From the inside it’s being cautious.
6) The GUI is slow because pveproxy is waiting on something dumb
Common culprits: reverse DNS lookups, LDAP/PAM timeouts, blocked reads in /etc/pve,
or a saturated single-threaded path somewhere in the request handling.
Practical tasks: commands, outputs, decisions
These are the checks I actually run when I’m on the clock. Each task includes what the output means and what decision you make from it.
Run them on at least two nodes: one “good” and one “bad.” Differences are your clue.
Task 1: Verify quorum and expected votes
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prod-cluster
Config Version: 42
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Feb 4 10:12:31 2026
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2c
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Meaning: Corosync sees 3 nodes, quorum is achieved, votes match expectation.
Decision: If this is “Yes” but you still have pain, stop blaming quorum and start measuring latency, pmxcfs, and storage.
Task 2: Check Corosync link status and MTU mismatches
cr0x@server:~$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 10.10.10.11
status = ring 0 active with no faults
RING ID 1
id = 10.10.20.11
status = ring 1 active with no faults
Meaning: Rings are up. “No faults” does not mean “good latency.”
Decision: If rings show faults intermittently, fix L2/L3 issues first (bonding, MTU, switch errors) before touching Corosync tuning.
Task 3: Read Corosync’s own complaints (they’re subtle)
cr0x@server:~$ journalctl -u corosync -S -2h --no-pager | tail -n 30
Feb 04 09:41:02 pve01 corosync[1267]: [KNET ] link: host: 2 link: 0 is down
Feb 04 09:41:03 pve01 corosync[1267]: [KNET ] host: 2 link: 0 recovered
Feb 04 09:58:19 pve01 corosync[1267]: [TOTEM ] Token has not been received in 1800 ms
Feb 04 09:58:19 pve01 corosync[1267]: [TOTEM ] A processor failed, forming new configuration.
Meaning: Short link drops and token delays. You can still remain quorate while reconfigurations happen.
Decision: If you see token warnings, treat it as a real incident: investigate network errors, CPU starvation, or IRQ storms.
Task 4: Confirm pmxcfs is mounted and responsive
cr0x@server:~$ mount | grep /etc/pve
pve on /etc/pve type fuse.pve (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
Meaning: The mount exists. It can still be slow.
Decision: Next, test read/write responsiveness.
Task 5: Test whether /etc/pve operations hang
cr0x@server:~$ time ls -l /etc/pve/nodes/pve01/qemu-server | head
total 8
-rw-r----- 1 root www-data 1324 Feb 4 09:55 101.conf
real 0m0.012s
user 0m0.002s
sys 0m0.004s
Meaning: Fast response is normal. If this takes seconds or hangs, pmxcfs is choking.
Decision: If slow/hanging on one node only, suspect local resource pressure. If slow on all nodes, suspect corosync latency or pmxcfs contention cluster-wide.
Task 6: Check pmxcfs and pve services health
cr0x@server:~$ systemctl status pve-cluster pvedaemon pveproxy --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Tue 2026-02-04 08:01:12 UTC; 2h 11min ago
Main PID: 1123 (pmxcfs)
Tasks: 13 (limit: 154263)
Memory: 52.4M
CPU: 2min 1.911s
● pvedaemon.service - Proxmox VE API Daemon
Active: active (running)
● pveproxy.service - Proxmox VE API Proxy Server
Active: active (running)
Meaning: Services are “running.” That doesn’t mean responsive.
Decision: If “active” but UI hangs, inspect logs and blocking calls (next tasks).
Task 7: See if pveproxy is timing out on auth/DNS
cr0x@server:~$ journalctl -u pveproxy -S -2h --no-pager | tail -n 25
Feb 04 10:01:18 pve02 pveproxy[2044]: proxy detected vanished client connection
Feb 04 10:02:41 pve02 pveproxy[2044]: authentication failure; rhost=10.10.30.50 user=admin@pam msg=timeout
Feb 04 10:02:41 pve02 pveproxy[2044]: failed login attempt; user=admin@pam
Meaning: Auth timeouts can be LDAP/PAM/DNS slowness, not wrong passwords.
Decision: If you see timeouts, test name resolution and directory reachability; don’t “restart random services” yet.
Task 8: Validate time sync and drift across nodes
cr0x@server:~$ chronyc tracking
Reference ID : 192.0.2.10
Stratum : 3
Ref time (UTC) : Tue Feb 04 10:11:32 2026
System time : 0.000347812 seconds slow of NTP time
Last offset : -0.000112345 seconds
RMS offset : 0.000251901 seconds
Frequency : 12.345 ppm fast
Leap status : Normal
Meaning: Good sync shows tiny offsets and “Normal” leap status.
Decision: If offset is large or leap status is not normal, fix time now. Don’t troubleshoot cluster behavior until clocks agree.
Task 9: Detect CPU pressure and I/O wait that starves everything
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 812344 54212 9248120 0 0 4 21 920 1800 6 2 88 4 0
3 1 0 790112 54180 9249008 0 0 120 8020 1100 2100 9 3 44 44 0
4 2 0 780004 54140 9249912 0 0 200 9100 1200 2400 8 4 36 52 0
Meaning: High wa (I/O wait) indicates the system is blocked on storage. High b suggests blocked processes.
Decision: If wa is consistently high during your incident, stop chasing Corosync configs and go to storage diagnostics.
Task 10: ZFS health and latency on a node using local ZFS
cr0x@server:~$ zpool status -x
all pools are healthy
Meaning: No known pool errors. Still doesn’t tell you latency.
Decision: If things are slow, check iostat and sync behavior next.
cr0x@server:~$ zpool iostat -v 1 3
capacity operations bandwidth
pool alloc free read write read write
rpool 320G 1.45T 80 1200 5.4M 98.2M
mirror 320G 1.45T 80 1200 5.4M 98.2M
nvme0n1 - - 40 610 2.7M 49.1M
nvme1n1 - - 40 590 2.7M 49.1M
Meaning: Heavy writes. If this correlates with management-plane hangs, you may be saturating storage.
Decision: Consider throttling backups/replication, and check for sync writes (databases, NFS sync, or mis-tuned ZFS).
Task 11: Ceph cluster state (if you run it)
cr0x@server:~$ ceph -s
cluster:
id: 1b2c3d4e-5555-6666-7777-88889999aaaa
health: HEALTH_WARN
12 slow ops, oldest one blocked for 38 sec
services:
mon: 3 daemons, quorum a,b,c (age 2h)
mgr: x(active, since 2h)
osd: 9 osds: 9 up (since 2h), 9 in (since 2h)
data:
pools: 6 pools, 512 pgs
usage: 12 TiB used, 18 TiB / 30 TiB avail
pgs: 512 active+clean
Meaning: “slow ops” is Ceph politely telling you your storage is hurting.
Decision: Treat slow ops as a production issue. Pause IO-heavy operations. Investigate OSD latency, network, and recovery/backfill settings.
Task 12: Check for network errors on Corosync interfaces
cr0x@server:~$ ip -s link show dev bond0
3: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:aa:bb:cc brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
1234567890 987654 12 0 0 0
TX: bytes packets errors dropped carrier collsns
2233445566 876543 0 0 0 0
Meaning: RX errors non-zero is a clue. Twelve errors can be “nothing” or the top of an iceberg—correlate with time.
Decision: If errors increase during incidents, check cabling, optics, NIC firmware, switch ports, and MTU consistency end-to-end.
Task 13: Measure latency and loss between nodes (without fooling yourself)
cr0x@server:~$ mtr -r -c 50 -n 10.10.10.12
Start: 2026-02-04T10:12:01+0000
HOST: pve01 Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.10.10.12 0.0% 50 0.4 0.6 0.3 2.1 0.3
Meaning: Good: low average, low worst-case, no loss.
Decision: If worst-case spikes into tens/hundreds of ms or loss appears, Corosync can still look “fine” while the rest times out. Fix network path quality.
Task 14: Check for stuck tasks and why migrations/backs ups don’t finish
cr0x@server:~$ pvesh get /cluster/tasks --limit 5
[
{
"endtime": 0,
"id": "UPID:pve02:0000A1B2:00C3D4E5:67A1B2C3:vzdump:105:root@pam:",
"node": "pve02",
"pid": 41394,
"starttime": 1707040801,
"status": "running",
"type": "vzdump",
"user": "root@pam"
}
]
Meaning: A backup running “forever” often correlates with storage stalls or snapshot commits that can’t flush.
Decision: Check the specific node logs and underlying storage latency. Don’t just kill the task unless you understand whether it’s holding locks or snapshots.
Task 15: Spot HA manager indecision
cr0x@server:~$ ha-manager status
quorum OK
master pve01 (active, Tue Feb 4 10:12:12 2026)
lrm pve01 (active, Tue Feb 4 10:12:11 2026)
lrm pve02 (active, Tue Feb 4 10:12:10 2026)
lrm pve03 (active, Tue Feb 4 10:12:09 2026)
service vm:101 (started)
service vm:102 (freeze) (request_stop)
service ct:203 (started)
Meaning: “freeze” indicates HA can’t make progress—often due to lock contention, storage unavailability, or stuck agent actions.
Decision: Investigate the affected resource’s storage and config locks. Do not “force” HA actions until you know what it’s waiting on.
Task 16: Find config lock contention (the quiet killer)
cr0x@server:~$ ls -l /var/lock/pve-manager
total 0
-rw-r----- 1 root www-data 0 Feb 4 10:08 vzdump.lock
-rw-r----- 1 root www-data 0 Feb 4 10:09 pve-storage-lock
Meaning: Locks exist during normal operations, but if they persist for a long time, something is stuck.
Decision: Correlate lock age with tasks list and storage performance. If a lock is stale due to a crashed process, resolve the underlying stuck task safely before removing locks.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a three-node Proxmox cluster for internal services. Everything was “redundant”: dual NICs, two switches,
RAID on the hypervisors. They were proud of it. They’d earned that pride.
Then one Monday morning, the GUI froze intermittently. Migrations hung. Backups that usually took minutes took hours.
The on-call did the ritual: checked pvecm status. Quorate. No node left. Corosync looked clean enough.
So they assumed the cluster network was fine and went hunting in the Proxmox UI logs.
The wrong assumption: “If Corosync has quorum, the cluster network is healthy.”
Quorum only meant the nodes could still exchange enough messages to agree on membership. It said nothing about tail latency.
The actual cause was one switch port going bad in a way that didn’t fully drop link. It introduced intermittent microbursts and CRC errors.
Corosync’s knet links recovered quickly, so membership stayed stable. But pmxcfs writes were delayed, and the API was constantly waiting on cluster filesystem responses.
The fix was boring: replaced the suspect cable and SFP, moved the port, and verified error counters stayed flat.
The “mystery” disappeared instantly. The postmortem added one line that mattered: measure network error counters and latency, not just quorum.
Mini-story 2: The optimization that backfired
Another org had a Proxmox+Ceph deployment. They wanted fewer “Corosync token timeout” warnings during heavy load windows.
Someone suggested increasing token timeout and consensus timeouts so Corosync would ride out temporary slowness.
The change reduced log noise. Everyone celebrated. Briefly.
Weeks later, a storage maintenance event triggered Ceph recovery that saturated the backend network.
The cluster remained quorate. That was the problem. Nodes stayed members while becoming progressively non-responsive under I/O wait.
HA decisions were delayed. Migrations queued. The GUI half-worked—just enough to create false confidence.
The “optimization” made the failure mode worse by stretching the window where everything was technically connected but practically unusable.
Operators waited longer before declaring an incident because “Corosync is stable.” Meanwhile, the business impact grew.
The eventual fix wasn’t rolling back timeouts alone. They separated traffic: Corosync on a low-latency, non-congested network;
Ceph recovery tuned to avoid saturating; and they added alerting on tail latency and slow ops rather than membership flaps.
Token timeouts returned closer to defaults. Log noise went up; actual outages went down.
Mini-story 3: The boring but correct practice that saved the day
A regulated environment ran Proxmox for a set of line-of-business workloads. The team was conservative to the point of annoyance.
They maintained a strict rule: each node had out-of-band management configured, a documented “safe shutdown” procedure,
and a quarterly drill where they practiced recovering from partial failures without improvising.
During a power event, one node came back with a degraded storage pool and intermittent I/O errors. Corosync quorum held,
but management operations became unreliable: config changes sometimes hung, backups stalled, and HA was hesitant to relocate workloads.
Instead of thrashing, they followed the playbook: freeze changes, identify the bad node, evacuate VMs that could move safely,
and keep the rest stable. They used out-of-band access to confirm hardware errors, then removed the node from scheduling.
The boring practice—documented steps, a known-good order of operations, and refusing to “just click around”—kept a messy hardware
issue from turning into a cluster-wide incident. The business barely noticed. The team went back to being annoyed by their own process,
which is exactly the vibe you want from reliability work.
Joke #2: If your cluster runs on “tribal knowledge,” congratulations—you’ve invented a single point of failure with feelings.
Common mistakes: symptom → root cause → fix
1) Symptom: Quorum is “Yes,” but GUI actions hang
- Root cause: pmxcfs latency or lock contention; API calls blocked waiting on
/etc/pve. - Fix: Test
ls /etc/pvelatency on multiple nodes; check CPU/I/O wait; reduce load; resolve stuck tasks holding locks.
2) Symptom: HA shows “freeze” or repeated restart attempts
- Root cause: HA can’t confirm state due to storage timeouts, locks, or delayed cluster filesystem updates.
- Fix: Check
ha-manager status, tasks, and storage health; stabilize storage first; avoid forcing starts until state is consistent.
3) Symptom: Migrations start and then stall at a fixed percentage
- Root cause: Storage backend can’t keep up (Ceph slow ops, NFS latency, ZFS sync pressure), or network throughput collapses under contention.
- Fix: Measure storage latency, check Ceph slow ops, check NIC errors; pause other I/O-heavy activities; ensure migration network isn’t shared with storage saturation.
4) Symptom: Corosync logs show token warnings but quorum stays
- Root cause: Tail latency spikes due to congestion, IRQ issues, or CPU starvation; reconfigurations occur without full membership loss.
- Fix: Treat as network/host performance incident; check
ip -s link,ethtool -S,mtr, and CPU wait; fix the underlying path.
5) Symptom: Random “permission denied” or TLS/auth issues after “nothing changed”
- Root cause: Time drift between nodes; cert validation windows violated; Kerberos/LDAP time-sensitive auth fails.
- Fix: Fix chrony/NTP, validate drift on all nodes, then re-test auth flows. Don’t rotate certs as your first move.
6) Symptom: Only one node is “slow,” but it doesn’t leave the cluster
- Root cause: Local hardware or kernel issues: disk errors, ZFS degradation, NIC errors, memory pressure.
- Fix: Compare metrics and logs with a healthy node; evacuate workloads; investigate hardware; don’t let a sick node poison the control plane.
7) Symptom: Everything gets bad during backups
- Root cause: Backup I/O saturates storage or network; snapshot commits slow; locks held longer; management operations pile up.
- Fix: Stagger backups, throttle backup bandwidth, separate backup traffic, and ensure storage has headroom. Backups are supposed to be boring, not a load test.
Checklists / step-by-step plan
Checklist A: When the cluster “feels slow” but quorum is fine
- Freeze changes. No new storage configs, no firewall edits, no HA reshuffles until you understand the stall.
- Pick one “bad” node and one “good” node. Run the same checks; differences are gold.
- Confirm membership stability:
pvecm status,corosync-cfgtool -s, Corosync journal. - Test pmxcfs responsiveness: quick
lsunder/etc/pvewith timing. - Check locks and stuck tasks:
pvesh get /cluster/tasks, inspect lock files. - Measure host pressure:
vmstat, load, I/O wait, memory pressure. - Measure network quality: errors counters +
mtrbetween nodes on the Corosync ring. - Measure storage health: ZFS iostat/status or Ceph slow ops.
- Only then consider tuning. Tuning without measurement is how you build a “stable” slow disaster.
Checklist B: Stabilize first, then recover functionality
- Stop the load multipliers: pause migrations, postpone backups, limit recovery/backfill if on Ceph (carefully).
- Isolate the sick node: if one node has errors/latency, migrate off what you can and remove it from HA decisions until fixed.
- Verify time sync: make sure clocks agree before you interpret logs and fencing events.
- Restore baseline network: eliminate packet loss, CRC errors, MTU mismatches, and congested links.
- Restore baseline storage: clear disk errors, repair degraded pools, address slow ops, ensure adequate free space.
- Re-enable operations gradually: migrations/backups one at a time, watch latency and logs.
Checklist C: Hardening so this doesn’t happen again
- Separate traffic classes: Corosync on low-latency links; storage on its own network; migrations separate if possible.
- Alert on tail latency, not just “up/down.” Quorum alarms are necessary and insufficient.
- Capacity plan for backups and recovery. If your cluster can’t handle a recovery event plus normal load, it’s not resilient.
- Test failure drills. Practice “one node slow,” “one link flapping,” “storage slow ops.” Real incidents shouldn’t be your first rehearsal.
FAQ
1) Why does pvecm status show “Quorate: Yes” when the GUI is unusable?
Because quorum is about membership and voting, not responsiveness. The GUI depends on pmxcfs and API daemons that can block on I/O, locks, DNS, or storage latency.
2) If Corosync shows no faults, can the network still be the problem?
Yes. Short spikes, microbursts, CRC errors, and jitter can ruin tail latency without dropping membership. Check counters and mtr, not just ring status.
3) Should I increase Corosync token timeout to stop flapping?
Only after you’ve proved the network and host scheduling are stable and you still need it. Increasing timeouts can hide real latency issues and delay failure detection.
4) What’s the quickest way to tell if pmxcfs is the bottleneck?
Time a simple ls in /etc/pve on multiple nodes. If it’s slow or hangs, pmxcfs is involved. Then check CPU and I/O wait.
5) Can storage problems really affect Corosync and cluster management?
Absolutely. Storage stalls drive I/O wait, which delays processes and scheduling. Corosync may continue to exchange enough messages, but pmxcfs and API calls will suffer.
6) How does time drift break a Proxmox cluster if quorum is fine?
Drift can break TLS/auth, confuse logs, and cause inconsistent decision-making in HA or fencing workflows. Fix time sync before deeper troubleshooting.
7) Why do migrations hang more often than “normal VM runtime” during incidents?
Migrations amplify bandwidth and storage requirements and are sensitive to latency. A VM can limp along with cache and retries; a migration is a tight loop that times out.
8) What should I do if one node is slow but still part of the cluster?
Treat it like a partial failure: evacuate workloads where safe, reduce what depends on that node, and investigate hardware/network/storage on that host specifically.
9) Is it safe to restart corosync or pmxcfs during an incident?
Sometimes, but it’s not a first-line move. Restarting can cause membership changes and lock churn. Stabilize network/storage first, then restart with a clear objective.
10) What’s the best “single metric” to alert on for these issues?
There isn’t one. Combine: pmxcfs responsiveness (synthetic checks), network loss/jitter, storage latency/slow ops, and host I/O wait. Quorum alone is a feel-good metric.
Next steps you can do this week
If you run Proxmox in production, here’s the practical path that actually changes outcomes:
- Add a synthetic pmxcfs check: measure and alert if
ls /etc/pveexceeds a small threshold on any node. - Alert on network errors on the Corosync interfaces: CRC errors, drops, link flaps. This catches “quorum is fine” degradations early.
- Alert on storage latency: ZFS pool iostat anomalies, Ceph slow ops, NFS client retransmits. Storage is the silent majority of these incidents.
- Keep Corosync timeouts sane: don’t use tuning as a bandage for a bad network. If you must tune, document why and what measurement justified it.
- Run a failure drill: simulate a congested storage network or a flapping link and practice “stabilize first.” Your future self will be grateful and slightly less tired.
Corosync is not lying to you. It’s just answering a smaller question than the one you’re asking.
If you want a cluster that survives, measure the whole organism—network quality, storage latency, control-plane responsiveness—and treat “quorum: yes” as the start of diagnosis, not the end.