Proxmox Corosync “link down”: why the cluster flaps and how to stabilize it

October 6, 2025 • February 3, 2026 • Read: 22 min • Views: 7

Was this helpful?

The Proxmox UI is green. Then it’s red. Then your nodes decide they’re single again, like a distributed system going through a breakup phase.
Corosync logs say link down, quorum goes missing, HA gets nervous, and suddenly your “simple” maintenance window has a live audience.

“Link down” isn’t a mystical Corosync mood swing. It’s a symptom: packet loss, jitter, MTU mismatch, NIC offload weirdness, bad ring design,
time sync drift, overloaded hosts, or a two-node cluster pretending quorum rules don’t apply. Let’s make it boring again.

What “link down” actually means in Corosync

In Proxmox VE, Corosync is the cluster communication layer. It’s the part that decides whether nodes can reliably hear each other and whether the
cluster still forms a single coherent group. When Corosync says link down, it’s not talking about an Ethernet carrier drop
(though it can be). It’s talking about its own transport path—its ring—being unusable because messages aren’t arriving within the expected time
window, or because it’s detecting a partition.

Corosync is opinionated. It would rather declare a link down than silently accept unreliable communication. That’s a good instinct: unreliable
cluster comms cause the kind of “half the nodes think X, half think Y” chaos that turns storage and HA into liability machines.

Practical translation: Corosync link down means your cluster control plane is experiencing packet loss, excessive jitter, reordering,
or long scheduling delays—and it’s exceeding Corosync’s tolerance as set by token timeouts and retransmit logic. Fix the network and
host behavior; don’t just “increase timeouts until it stops complaining” unless you’ve measured the trade-offs.

One dry truth: if your Corosync ring is flaky, everything above it becomes a liar. The UI lies. HA lies. “It worked yesterday” lies.

Facts & context: why Corosync behaves like this

Corosync evolved from the OpenAIS project (mid-2000s era) to provide reliable group messaging for clusters; it didn’t start life as a Proxmox feature.
The Totem protocol (Corosync’s messaging core) is designed around membership and ordered messaging; it’s conservative about partitions.
Token-based membership is a classic approach: if you can’t pass the “token” (conceptually) in time, you’re out. That’s why timeouts matter.
Two-node quorum is historically awkward in most clustering stacks; external quorum devices exist because “2” is a political number, not a fault-tolerant one.
Multicast used to be the default recommendation for some cluster stacks, but modern datacenters often disable it or treat it inconsistently, pushing people toward unicast.
Ring redundancy is not new: HA clusters have long used dual networks (public + private heartbeat) because “one switch” is not a strategy.
Linux bonding modes have a long history of surprises under packet loss or asymmetric routing; cluster traffic is where “mostly fine” becomes “not fine”.
MTU mismatch bugs are ancient and still thriving: jumbo frames silently falling back or fragmenting can create exactly the intermittent loss that kills a token ring.

Why clusters flap: the failure modes that matter

1) Packet loss (microbursts count)

Corosync doesn’t need “big loss” to hurt you. A few dropped packets in a tight window can trigger retransmits and token delays, leading to a link-down
decision. Microbursts—short spikes of congestion—are famous for being invisible to simplistic monitoring. A switch can be “fine” on average while
dropping the one packet your cluster needed.

2) Jitter and scheduling latency (the host is the network too)

Even if the network path is perfect, a busy host can behave like a lossy network if Corosync threads can’t run. CPU overcommit, interrupt storms,
storage stalls, and kernel-level hiccups can delay processing. Corosync measures time in real time; your scheduler measures time in “when I get around to it.”

3) MTU mismatch and fragmentation weirdness

Clusters tend to live on “special” networks: VLANs, jumbo frames, bonds, bridges, NIC offloads, firewall rules, and sometimes overlay networks.
MTU mismatch is the classic: pings work (small), but larger packets fragment or drop, and Corosync gets intermittent timeouts.

4) Ring design errors: single points of failure disguised as “redundancy”

If ring0 and ring1 share the same switch, same NIC, same bond slave, or same upstream path, you don’t have redundancy. You have two names for one problem.
Corosync will happily flap both rings if your “two rings” are actually one failure domain.

5) Misconfigured unicast/multicast, or firewall “help”

Corosync needs consistent delivery. Half-configured multicast, IGMP snooping acting up, or a firewall dropping UDP fragments can make membership unstable.
Corosync uses UDP; UDP plus “enterprise firewall policy” is a relationship that requires counseling.

6) Time sync drift and clock steps

Token timeouts assume clocks are sane. NTP/chrony stepping time backward, or hosts with wildly different time sources, can amplify jitter and trigger false suspicions.
Corosync is not strictly dependent on synchronized clocks like some databases, but wild time behavior makes diagnostics and timeouts worse.

7) Quorum edge cases (especially two-node clusters)

The moment one node can’t see the other, you’ve got a decision: who is “the cluster”? In two nodes, there is no majority without help.
Proxmox provides qdevice/qnetd to avoid the classic split-brain dance. Ignore that, and link flaps turn into “cluster down” events.

Joke #1: A two-node cluster without a qdevice is like a conference call with only two people—when it gets quiet, both assume the other hung up.

Fast diagnosis playbook (check first/second/third)

The goal is to find the bottleneck quickly: is it the network path, host scheduling, or quorum design? Don’t start by editing timeouts.
Start by proving where the loss or delay is happening.

First: confirm it’s Corosync membership instability, not “UI weirdness”

Check Corosync and cluster status: membership changes, ring status, expected votes.
Check logs around the flap: which node declared link down first?

Second: validate the transport path (loss, MTU, firewall, routing symmetry)

Run targeted ping tests with DF and larger payloads on the Corosync interfaces.
Capture a short tcpdump during a flap window; look for gaps and ICMP frag-needed.
Check switch counters (even if “network team says it’s fine”).

Third: check the host for stalls (CPU, interrupts, softnet drops, storage pauses)

Look for ksoftirqd spikes, RX drops, ring buffer overruns.
Check if ZFS scrubs, backups, or replication saturate IO/CPU during flaps.
Verify time sync is stable (no big steps, no “NTP slew panic”).

Fourth: verify quorum strategy

Two nodes? Get a qdevice. Three nodes? Ensure expected votes match reality.
Confirm there isn’t a hidden third “dead” vote in config.

Hands-on tasks: commands, expected output, and decisions

These are the checks I actually run when a cluster starts flapping. Each includes a decision: what you do next based on what you see.

Task 1: Confirm cluster membership and quorum

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   42
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec 25 10:12:03 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.3a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

What it means: If Quorate: No or Total votes suddenly drops, you’re not looking at a “minor comms blip”—you’re looking at a partition event.

Decision: If quorum is lost intermittently, prioritize transport stability checks (Tasks 4–10) and quorum design (Task 12). Don’t tune HA first.

Task 2: Check Corosync ring status and per-link state

cr0x@server:~$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.50.11
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.51.11
        status  = ring 1 active with no faults

What it means: If you see faulty or only one ring active, you’ve lost redundancy or the path is unstable.

Decision: If ring1 is faulty, check wiring/VLAN/MTU on ring1 specifically. Don’t assume “ring0 is fine so we’re safe”—failover flaps can still destabilize membership.

Task 3: Pull the relevant Corosync log slice around the flap

cr0x@server:~$ journalctl -u corosync --since "10 minutes ago" --no-pager
Dec 25 10:05:41 pve1 corosync[1642]:   [KNET  ] link: host: 2 link: 0 is down
Dec 25 10:05:41 pve1 corosync[1642]:   [TOTEM ] Token has not been received in 1000 ms
Dec 25 10:05:41 pve1 corosync[1642]:   [TOTEM ] A processor failed, forming new configuration.
Dec 25 10:05:42 pve1 corosync[1642]:   [QUORUM] Members[2]: 1 3
Dec 25 10:05:47 pve1 corosync[1642]:   [KNET  ] link: host: 2 link: 0 is up

What it means: The “token not received” and “forming new configuration” lines are the membership churn. The KNET line identifies which peer/link dropped.

Decision: Identify which node reports the drop first and which peer is implicated. Then test that exact network path (Tasks 4–8).

Task 4: Verify Corosync is using the interfaces you think it is

cr0x@server:~$ grep -E "ring0_addr|ring1_addr|transport|link_mode" /etc/pve/corosync.conf
        transport: knet
        link_mode: passive
        ring0_addr: 192.168.50.11
        ring1_addr: 192.168.51.11

What it means: This confirms addressing and transport. Wrong subnets or missing ring definitions are common after “quick changes.”

Decision: If addresses don’t match the intended dedicated network, fix that first. Corosync on a shared busy LAN is asking for jitter.

Task 5: Check MTU end-to-end with DF ping (small and large)

cr0x@server:~$ ping -c 3 -M do -s 8972 192.168.50.12
PING 192.168.50.12 (192.168.50.12) 8972(9000) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500

--- 192.168.50.12 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2046ms

What it means: Your interface path MTU is 1500 (or something along the path is), so jumbo expectations are wrong.

Decision: Either configure jumbo consistently across NIC/switch/VLAN/bridge/bond, or standardize on 1500 everywhere for Corosync. Mixed MTU is flapping fuel.

Task 6: Confirm MTU on the actual Linux interfaces used (bridge/bond included)

cr0x@server:~$ ip -d link show vmbr1
4: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    bridge forward_delay 1500 hello_time 200 max_age 2000 stp_state 0 priority 32768 vlan_filtering 1

What it means: If vmbr1 is 9000 but the bond or slave NIC is 1500, you still lose. MTU must match through the whole chain.

Decision: Check MTU on bondX and the underlying NICs. Fix inconsistencies before touching Corosync timers.

Task 7: Measure packet loss and latency with fping (corosync network)

cr0x@server:~$ fping -c 200 -p 20 192.168.50.12 192.168.50.13
192.168.50.12 : xmt/rcv/%loss = 200/199/0%, min/avg/max = 0.20/0.35/4.10
192.168.50.13 : xmt/rcv/%loss = 200/192/4%, min/avg/max = 0.21/0.60/18.70

What it means: 4% loss with 18 ms spikes on a cluster network is not “fine.” It’s a link-down generator.

Decision: If you see any loss, stop and find the physical/logical cause (switch drops, bad cable, bad optics, duplex mismatch, congested uplink).

Task 8: Check Linux network error counters and drops

cr0x@server:~$ ip -s link show dev eno2
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    RX:  bytes packets errors dropped  missed   mcast
    987654321  1234567      0   4821       0  112233
    TX:  bytes packets errors dropped carrier collsns
    876543210  1122334      0      0       0       0

What it means: RX dropped indicates the host is discarding packets (often ring buffer, driver, or CPU/interrupt pressure).

Decision: If drops increase during flaps, investigate interrupts/softnet (Task 9), NIC ring sizes, and host load (Task 10).

Task 9: Check softnet backlog drops (kernel can’t keep up)

cr0x@server:~$ awk '{print NR-1, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16}' /proc/net/softnet_stat | head
0 00000000 00000000 00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
1 00000000 00000000 000001a2 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

What it means: The third column (in hex) is dropped packets due to softnet backlog. Non-zero values that grow under load are a smoking gun.

Decision: If these counters climb during flaps, reduce interrupt pressure (IRQ affinity), tune NIC queues, and stop saturating the host with noisy traffic on the Corosync NIC.

Task 10: Correlate flaps with CPU pressure and IO stalls

cr0x@server:~$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 102400  12345 678901    0    0    20    40  900 1500 12  4 83  1  0
 8  2      0  51200  12000 650000    0    0  8000 12000 4000 9000 35 20 25 20  0
 7  3      0  49000  11800 640000    0    0  9000 14000 4200 9200 33 22 20 25  0

What it means: High r (run queue), non-trivial wa (IO wait), and high context switches can align with Corosync delays.

Decision: If flaps correlate with backups, scrubs, or replication, schedule or throttle them, and isolate Corosync traffic (dedicated NIC/VLAN, QoS).

Task 11: Confirm time synchronization is stable (chrony example)

cr0x@server:~$ chronyc tracking
Reference ID    : 0A0A0A01 (ntp1)
Stratum         : 3
Ref time (UTC)  : Thu Dec 25 10:10:58 2025
System time     : 0.000231456 seconds slow of NTP time
Last offset     : -0.000102334 seconds
RMS offset      : 0.000356221 seconds
Frequency       : 12.345 ppm fast
Residual freq   : -0.012 ppm
Skew            : 0.045 ppm
Root delay      : 0.001234 seconds
Root dispersion : 0.002345 seconds
Update interval : 64.0 seconds
Leap status     : Normal

What it means: Healthy: tiny offsets, stable stratum, no leap issues. If you see large offsets or frequent steps, time is a suspect.

Decision: Fix NTP/chrony stability before tuning Corosync. Time chaos makes diagnosis impossible and token timeouts unpredictable.

Task 12: Validate expected votes and (if needed) qdevice presence

cr0x@server:~$ pvecm qdevice status
QDevice information
-------------------
Status:           OK
QNetd host:       10.10.10.50
QDevice votes:    1
TLS:              enabled
Algorithm:        ffsplit

What it means: In two-node clusters, qdevice is the difference between “brief link blip” and “everyone panics.”

Decision: If you have two nodes and no qdevice, schedule adding one. If qdevice exists but status is not OK, fix it—don’t rely on hope.

Task 13: Check Corosync runtime stats (knet) for link errors

cr0x@server:~$ corosync-cmapctl | grep -E "knet.*(rx|tx|errors|retries|latency)" | head -n 12
runtime.totem.knet.pmtud_interval = 60
runtime.totem.knet.link.0.packets_rx = 2849921
runtime.totem.knet.link.0.packets_tx = 2790012
runtime.totem.knet.link.0.errors = 12
runtime.totem.knet.link.0.retries = 834
runtime.totem.knet.link.0.latency = 420

What it means: Growing errors and retries often align with physical loss or MTU/PMTUD trouble.

Decision: If retries spike during flaps, treat it as a network defect until proven otherwise. “But vMotion works” is not exculpatory evidence.

Task 14: Capture Corosync traffic briefly to detect fragmentation/ICMP issues

cr0x@server:~$ tcpdump -i vmbr1 -nn -c 50 udp port 5405
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vmbr1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:12:11.120001 IP 192.168.50.11.5405 > 192.168.50.12.5405: UDP, length 1240
10:12:11.120220 IP 192.168.50.12.5405 > 192.168.50.11.5405: UDP, length 1240
10:12:11.121003 IP 192.168.50.11 > 192.168.50.12: ICMP 192.168.50.11 udp port 5405 unreachable, length 1276

What it means: ICMP unreachable is a giant sign: firewall, wrong bind address, or a node not actually listening on that interface.

Decision: If you see unreachable/frag-needed, fix routing/firewall/MTU. If you see silence during flaps, suspect drops earlier in the path.

Task 15: Verify firewall state for the Corosync interfaces

cr0x@server:~$ pve-firewall status
Status: enabled/running

What it means: Enabled firewall is fine, but it means you must confirm rules allow Corosync on the correct interface/VLAN.

Decision: If flaps began after firewall changes, audit datacenter and host rulesets and confirm UDP 5405/5404 (depending on config) is allowed between nodes.

Three corporate mini-stories from real life (anonymized)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company migrated from a “single beefy host” mindset to a three-node Proxmox cluster. They did the sensible thing—dedicated a VLAN for Corosync—
and the unsensible thing—put that VLAN on the same top-of-rack switches as a busy storage network without thinking about congestion patterns.

The wrong assumption was simple: “Cluster traffic is tiny, so it can share anything.” Most of the time it did. The graphs showed low average bandwidth.
Everyone nodded, and the change request closed itself emotionally.

Then came weekly backups and an internal data export job that blasted the storage VLAN. Switch buffers filled, microbursts formed, and the Corosync VLAN
took collateral damage. Nodes dropped membership for a few seconds at a time—just long enough to trigger HA fencing logic and upset a few VMs.

The fix wasn’t a Corosync timeout tweak. They moved Corosync to a physically separate switch pair and NICs, and they stopped running backup traffic
through the same uplinks as cluster comms. The “tiny traffic can share anything” theory died quietly, as it should.

Mini-story 2: The optimization that backfired

Another org decided to “optimize latency” by enabling jumbo frames everywhere. The network team set MTU 9000 on the switches. The virtualization team set MTU 9000
on Linux bridges. They celebrated early, which is a reliable way to summon reality.

The backfire: one intermediate device—an old firewall doing VLAN routing for a management subnet—silently enforced 1500 MTU. PMTUD was inconsistent
due to filtering rules, so large packets were dropped without helpful ICMP. Pings worked. SSH worked. Everything worked except the thing that needed
consistent UDP delivery under load: Corosync.

The cluster started “randomly” flapping a few times a day. The team tried increasing token timeouts. Flaps became less frequent, but recovery got slower,
and failover decisions became sluggish. They traded correctness for numbness.

The fix was boring and surgical: enforce consistent MTU end-to-end (or standardize on 1500 for Corosync), and stop blocking PMTUD-related ICMP on internal links.
The cluster went quiet. The “optimization” had been a distributed MTU mismatch generator.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran Proxmox with Ceph and treated cluster comms like production control traffic: dedicated NICs, dedicated switches,
and a written checklist for any network change. They also had a habit of doing one unfashionable thing: capturing a short tcpdump before and after
every change that touched the cluster network.

One afternoon, link flaps started immediately after a routine switch firmware update. The network team insisted no config changed. That’s usually true—
right up until it isn’t. The virtualization team pulled their “boring practice” evidence: before/after captures showed periodic bursts of UDP loss every 30 seconds.

They correlated those bursts with an IGMP snooping behavior change on the switch (the update tweaked defaults). Corosync was running in a mode that relied on
assumptions about multicast handling. With the evidence in hand, the network team adjusted the snooping settings for that VLAN and validated with counters.

The cluster stabilized within minutes. The tcpdump habit didn’t make them cool in meetings, but it made them right when it mattered.

Stabilize by design: network, rings, quorum, and time

Build a cluster network like you mean it

Corosync is latency-sensitive, loss-sensitive, and jitter-sensitive. Treat the Corosync network as control-plane traffic. That usually means:
dedicated VLAN at minimum, dedicated physical NICs ideally, and a dedicated switch pair if you can afford it.

If you must share, enforce QoS and avoid known noisy neighbors (backups, replication, storage rebuild traffic). Don’t bet your cluster membership on the idea that
your top-of-rack switch will buffer everything forever.

Use two rings, but make them different failure domains

Two rings are good. Two rings that share a single switch are not two rings. If ring0 and ring1 terminate on the same switch stack, the same bonded pair,
or the same upstream router, you’ve built “redundancy theater.”

Ideal: ring0 on one NIC and switch pair, ring1 on another NIC and another switch pair. Acceptable: ring1 shares switch pair but uses different NICs and different
cabling paths, if you’ve validated that the switch pair itself is highly available and not saturated.

Be conservative about MTU

Jumbo frames are not evil. Mixed jumbo frames are. If you can’t prove MTU consistency across NICs, bonds, bridges, switch ports, VLAN trunks, and any intermediate
L3 devices, stick to 1500 for Corosync. You’ll sleep more.

Don’t mask problems with token timeout tuning

Corosync has knobs: token timeouts, retransmits, and consensus behavior. They exist because different environments exist. But tuning is not a substitute for fixing loss.
Longer timeouts can reduce sensitivity to transient jitter, but they also increase failure detection time. That means slower fencing, slower failover, and longer periods
where the cluster disagrees.

Here’s your trade-off: short timeouts mean fast failure detection but more sensitivity; long timeouts mean fewer false positives but slower reaction.
Tune only after you measure baseline latency and loss and after you’ve fixed obvious transport defects.

Quorum: stop pretending two nodes are three

If you run two nodes, add a qdevice (qnetd) on a third independent host. It can be small. It can be a VM in a different failure domain. It should not sit on the same
physical host you’re trying to arbitrate. Yes, people do that. No, it doesn’t count.

If you run three or more nodes, verify expected votes match reality and remove stale nodes from config. Zombie votes are how you end up non-quorate when everything
“looks up.”

Time sync: boring, foundational, and non-negotiable

Chrony with stable upstreams is the usual answer. Avoid large time steps on running cluster nodes unless you understand the side effects. If your logs show time jumps,
your diagnostics become fiction. Fix time first, then everything else gets easier.

One reliability quote worth keeping on a sticky note: Paraphrased idea from John Allspaw: “Reliability comes from enabling safe, fast learning, not blaming people when systems surprise us.”

Joke #2: Increasing Corosync timeouts to stop flapping is like turning up the radio to fix the check-engine light—it changes the vibe, not the problem.

Common mistakes: symptom → root cause → fix

1) Symptom: random `link down` during backups

Root cause: backup traffic congests shared switch/uplink; microbursts drop Corosync UDP.

Fix: isolate Corosync to dedicated NIC/VLAN/switch; apply QoS; throttle backup concurrency; confirm with fping loss tests.

2) Symptom: flaps only on one ring (ring1 keeps going faulty)

Root cause: ring1 VLAN mis-tagged, bad cable/optic, MTU mismatch, or wrong IP/subnet.

Fix: validate /etc/pve/corosync.conf, run DF pings on ring1, check ip -s link counters, replace suspect physical components.

3) Symptom: after enabling jumbo frames, cluster becomes unstable

Root cause: partial MTU rollout; PMTUD blocked; fragmentation/drop affecting UDP.

Fix: enforce MTU consistency end-to-end or revert to 1500 for Corosync; allow essential ICMP types internally; validate with DF pings.

4) Symptom: “Quorum lost” in a two-node cluster during brief link blips

Root cause: no external tie-breaker; classic two-node split-brain prevention kicks in.

Fix: deploy qdevice/qnetd; ensure it’s on an independent failure domain; verify with pvecm qdevice status.

5) Symptom: Corosync seems fine, but membership churns under CPU load

Root cause: scheduling latency; softirq backlog; packet processing delayed.

Fix: reduce contention (pin interrupts, adjust RPS/XPS, reduce noisy traffic), avoid saturating host CPU during critical operations.

6) Symptom: one node frequently “starts the flap”

Root cause: that host has NIC driver issues, RX drops, bad firmware, or a noisy neighbor VM saturating the bridge.

Fix: compare ip -s link, softnet_stat, and logs across nodes; update NIC firmware/driver; isolate cluster NIC from guest bridges if possible.

7) Symptom: flaps after firewall changes

Root cause: UDP ports blocked, fragments dropped, or asymmetric allow rules between nodes.

Fix: explicitly allow Corosync UDP between ring addresses; confirm with tcpdump and ss -u -lpn (see Task below if you add it to your own runbook).

8) Symptom: flaps coincide with switch maintenance or STP events

Root cause: spanning tree reconvergence, port flaps, LACP renegotiation causing burst loss.

Fix: use portfast/edge where appropriate, validate LACP hashing, avoid L2 reconvergence on the Corosync VLAN, and consider dual-ring over independent switch pairs.

Checklists / step-by-step plan

Step-by-step stabilization plan (do this in order)

Confirm the symptom is real: capture pvecm status, corosync-cfgtool -s, and the last 10 minutes of Corosync logs on all nodes. You need timestamps and which node declared first.
Freeze changes: stop “helpful” tuning while you measure. Disable non-essential migrations and heavy maintenance tasks until comms are stable.
Prove MTU end-to-end: DF ping at the maximum intended payload on each ring between every node pair. Fix mismatches immediately.
Measure loss/jitter: run fping for a few minutes. Any loss is unacceptable for the control plane.
Inspect host drops: check ip -s link and /proc/net/softnet_stat. If the host drops, the switch is innocent (for once).
Validate routing symmetry: ensure Corosync traffic stays on the intended interface and doesn’t hairpin through routers or firewalls.
Check time stability: confirm chrony/NTP health; ensure no step changes are happening during incidents.
Fix quorum design: two nodes → add qdevice; three+ → ensure expected votes, remove stale nodes, verify qdevice if used.
Only then consider tuning: if your environment has unavoidable jitter (long-distance links, encrypted overlays), adjust timeouts carefully and document the trade-offs.

Operational checklist for any network change touching Corosync

Record current ring status and membership.
Run DF ping tests for both rings to every node.
Run a 2–5 minute fping burst and archive results.
Check ip -s link counters before/after.
Confirm firewall rules unchanged for ring VLANs.
After change: verify corosync-cfgtool -s shows no faults and membership is stable for at least 30 minutes.

FAQ

1) Does “link down” mean the NIC link actually went down?

Not necessarily. It means Corosync considers its transport link unhealthy. That can be physical link loss, but more often it’s packet loss/jitter beyond tolerance.
Always verify with ip -s link, switch counters, and logs.

2) Should I just increase the token timeout?

Not first. If you increase timeouts to hide loss, you slow down failure detection and can make HA behavior worse. Fix transport stability first, then tune cautiously
if you have a measured need.

3) Is it okay to run Corosync on the management network?

“Okay” is a business decision. If the management network is quiet, dedicated, and stable, fine. If it’s shared with user traffic, backups, or unpredictable routing,
you’re building a flap generator. Dedicated VLAN at minimum is the sane baseline.

4) Do I need two rings?

If you care about uptime, yes. But only if they’re separate failure domains. Two rings on the same switch stack is only slightly better than one ring, and sometimes worse
if it adds complexity without real isolation.

5) Why does my three-node cluster still lose quorum sometimes?

Usually because one node is intermittently isolated or overloaded, or because expected votes are misconfigured (stale node entry). Check pvecm status and logs.
If one node “goes missing” repeatedly, diagnose that node’s network path and host load.

6) Can Ceph traffic cause Corosync flaps?

Absolutely, if they share NICs/switches/uplinks and Ceph recovery or backfill saturates the path. Ceph can create sustained high bandwidth and bursty patterns.
Corosync doesn’t need much bandwidth, but it needs consistent delivery.

7) Is multicast required for Corosync?

Modern Proxmox clusters commonly use knet/unicast-style behavior depending on configuration. Multicast can work, but it’s frequently mishandled in enterprise networks.
If you rely on multicast, validate IGMP snooping behavior and confirm switches treat the VLAN correctly.

8) What’s the quickest proof that MTU mismatch is the culprit?

A DF ping with a large payload that fails on one path and succeeds on another. If you expect jumbo, test near-9000. If it fails with “message too long” or silently drops,
you’ve found a real issue, not a theory.

9) Can a busy CPU really make Corosync think the network is down?

Yes. If the host can’t process network interrupts or schedule Corosync in time, packets get dropped or delayed. Check softnet drops and correlate flaps with CPU/IO pressure.

10) What if I only have two nodes and can’t add a third for qdevice?

Then accept that you don’t have a resilient cluster quorum model. You can run, but you’ll pick between availability and safety during partitions.
If the business requires HA, find a third failure domain for qnetd—even a small external host—before blaming Corosync for doing its job.

Conclusion: next steps that reduce pager noise

“Corosync link down” is not a configuration puzzle. It’s your cluster telling you the control plane can’t trust the network or the host under real conditions.
Treat it like a production reliability defect.

Next steps that actually move the needle:

Prove MTU consistency on every ring path with DF pings and fix mismatches.
Measure loss with fping and eliminate it—especially microbursts from shared uplinks.
Check host-level drops (softnet, RX drops) and reduce interrupt/CPU contention.
Make rings real: separate failure domains, not just separate IPs.
Fix quorum design: two nodes get a qdevice, three nodes verify votes and remove zombies.
Only then tune Corosync timeouts, and document the trade-off in failure detection time.

The stable end-state is boring: no membership churn, no surprise reconfigurations, and no late-night debates about whether “it’s probably the network.”
When Corosync goes quiet, the rest of your Proxmox stack starts telling the truth again.