Proxmox “cannot initialize CMAP service”: Corosync/pmxcfs troubleshooting checklist

Was this helpful?

The error shows up at the worst time: you rebooted a node (or it rebooted itself), and now the cluster feels “half alive.”
pve-cluster won’t behave, /etc/pve is empty or read-only, and corosync logs keep chanting:
cannot initialize CMAP service.

This is one of those Proxmox failures where storage, networking, and distributed systems all get a vote. If you only listen to one of them,
you’ll fix nothing and learn new swear words. Below is a production-grade checklist: fast triage first, then systematic isolation, then safe recovery.

What “cannot initialize CMAP service” actually means

CMAP is corosync’s internal configuration and runtime key-value database. Think of it as “the cluster’s shared memory interface” for settings and state:
node IDs, ring membership, quorum bits, totem timing, logging, and more. When a process says it cannot initialize CMAP service, it’s failing to
connect to the corosync CMAP API—usually because corosync isn’t running, is wedged, is restarting, or its IPC sockets aren’t accessible.

In Proxmox clusters, this shows up most often in these patterns:

  • Corosync is downpve-cluster can’t read cluster membership → /etc/pve goes weird.
  • Corosync can’t form a ring (network/MTU/firewall) → no quorum → cluster filesystem may be read-only or not mount.
  • Corosync config mismatch or wrong node name/IP → corosync starts but can’t join peers → CMAP is not usable from dependent daemons.
  • Time drift or extreme scheduling stalls → token timeouts → membership flaps → CMAP consumers fail intermittently.

The practical translation: treat CMAP errors as “corosync is not healthy enough to provide cluster runtime state.” Fix corosync health first, then pmxcfs,
and only then mess with higher-level Proxmox services.

Fast diagnosis playbook (first/second/third)

First: decide if you have a cluster problem or a single-node problem

The fastest win is knowing whether you’re debugging a local daemon crash or a split brain / quorum situation.
Run these three commands on the broken node and one healthy node (if you have one).

cr0x@server:~$ systemctl is-active corosync pve-cluster
inactive
active

Meaning: If corosync is inactive/failed, CMAP errors are a symptom, not the disease.
If corosync is active but consumers still complain, you’re looking for ring/quorum/config problems.
Decision: Fix corosync first. Don’t restart pvedaemon, pveproxy, or random services “to see what happens.”

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec 25 12:07:21 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.52
Quorate:          No

Meaning: “Quorate: No” is your headline. Without quorum, pmxcfs may refuse writes and some cluster operations will fail.
Decision: Determine if the missing node(s) are actually down, isolated by network, or misconfigured. Don’t force quorum casually.

cr0x@server:~$ journalctl -u corosync -b --no-pager | tail -n 40
Dec 25 12:05:12 pve1 corosync[1789]:   [TOTEM ] A processor failed, forming new configuration.
Dec 25 12:05:14 pve1 corosync[1789]:   [KNET  ] link: host: 2 link: 0 is down
Dec 25 12:05:18 pve1 corosync[1789]:   [QUORUM] Members[1]: 1
Dec 25 12:05:18 pve1 corosync[1789]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.

Meaning: This is classic network partition or dead peer. KNET says the link is down.
Decision: Stop thinking about CMAP and start thinking about the corosync network: routing, VLANs, MTU, firewall, link state.

Second: prove the cluster network path is clean (not “works for ping”)

Corosync uses multicast in older setups and unicast (knet) in modern Proxmox, typically over UDP, with strict timing.
A single firewall rule, asymmetric routing, or MTU mismatch can keep ICMP happy while corosync quietly burns.

cr0x@server:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
eno1             UP             10.10.10.11/24
eno2             UP             172.16.20.11/24

Meaning: Identify the actual interface(s) corosync should use. Many clusters dedicate a NIC/VLAN for corosync—until someone “simplifies” it.
Decision: Compare with /etc/pve/corosync.conf (or local copy if pmxcfs is unhealthy) and verify the ring address matches reality.

Third: confirm pmxcfs state and whether you’re stuck read-only

cr0x@server:~$ mount | grep -E '/etc/pve|pmxcfs'
pve-cluster on /etc/pve type fuse.pve-cluster (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Meaning: If this mount is missing, stale, or read-only, Proxmox will behave like it has amnesia.
Decision: If corosync is broken, pmxcfs may not mount or may mount but refuse writes. Don’t edit /etc/pve directly if it’s not mounted properly.

Interesting facts and historical context (why this is the way it is)

  • Corosync’s name comes from “coros”, an old project lineage tied to group communication and cluster membership—this is decades of distributed-systems scar tissue.
  • CMAP replaced earlier config approaches so daemons could read and react to cluster state through a consistent API instead of parsing files repeatedly.
  • Totem protocol (used by corosync) was designed to maintain ordered message delivery and membership despite failures—great when healthy, merciless when timing goes sideways.
  • Proxmox chose a FUSE-based cluster filesystem (pmxcfs) to distribute small configuration state quickly, not to be a general-purpose data store.
  • pmxcfs stores “authoritative” cluster config in SQLite/DB files under /var/lib/pve-cluster/; /etc/pve is the mounted view, not the original source directory.
  • Quorum rules come from classic consensus safety: it’s better to refuse writes than to accept conflicting updates you can’t reconcile later.
  • knet transport became the standard path to improve link handling and multi-link capabilities compared to older multicast-focused patterns.
  • Many “corosync issues” are actually clock issues: token-based membership is extremely sensitive to jitter, CPU starvation, and time jumps.

Corosync + CMAP + pmxcfs: the moving parts you’re debugging

Corosync: membership, messaging, quorum

Corosync is the cluster membership engine. It decides who is in, who is out, and how confident it is about that statement.
Its configuration lives in corosync.conf. Its runtime state lives behind CMAP. If corosync can’t run or can’t stabilize membership, everything above it
becomes decorative.

CMAP: runtime state interface

CMAP is where corosync exposes configuration and state to clients. Proxmox components, and corosync’s own tools, consult it.
When you see “cannot initialize CMAP service,” the client failed to connect to corosync’s runtime interface—often via IPC sockets.
That’s why the error can appear even on a single-node “cluster” after an upgrade or permission/SELinux/AppArmor oddity (rare on Proxmox), or after a crash leaving stale sockets.

pmxcfs: the Proxmox cluster filesystem

pmxcfs is a FUSE filesystem mounted at /etc/pve. It’s where Proxmox expects cluster-wide config to live:
storage definitions, firewall config, VM configs, cluster config, user realms. It depends on corosync for membership and quorum decisions.

If corosync is down or you’re not quorate, pmxcfs may:

  • not mount at all, leaving /etc/pve as an empty directory or a stale mount,
  • mount read-only, making writes fail in confusing ways,
  • mount but show only local node state if you’re isolated, depending on how things broke.

Paraphrased idea from Werner Vogels (Amazon CTO): Everything fails, all the time; build systems that expect it and recover quickly.

Joke #1: Corosync is like a meeting invite. If half the attendees can’t connect, nobody gets to edit the agenda.

Practical tasks (commands, expected output, decisions)

These are the tasks I actually run when someone slacks “CMAP service” at 02:00. Each includes what the output means and what decision you make next.
Run them on the broken node first, then compare with a healthy node if available.

Task 1: Check service states and recent failures

cr0x@server:~$ systemctl status corosync --no-pager
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
     Active: failed (Result: exit-code) since Thu 2025-12-25 12:01:03 UTC; 4min ago
       Docs: man:corosync
    Process: 1732 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=1/FAILURE)

Meaning: Corosync is not running; CMAP errors are expected.
Decision: Go straight to logs and config validation. Don’t restart everything; fix the first domino.

Task 2: Read corosync logs with context, not just the last line

cr0x@server:~$ journalctl -u corosync -b --no-pager -n 200
Dec 25 12:00:58 pve1 corosync[1732]:   [CFG   ] Parsing config file '/etc/corosync/corosync.conf'
Dec 25 12:00:58 pve1 corosync[1732]:   [CFG   ] No valid name found for nodeid 1
Dec 25 12:00:58 pve1 corosync[1732]:   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1835.

Meaning: Config mismatch: nodeid mapping or nodelist wrong.
Decision: Validate nodelist, node names, and ring addresses. Don’t touch quorum settings yet; you have a deterministic config error.

Task 3: Confirm which corosync.conf you’re actually using

In a healthy cluster, the canonical config is typically under /etc/pve/corosync.conf (cluster filesystem view), and corosync reads
/etc/corosync/corosync.conf which is usually a symlink into /etc/pve. When pmxcfs is broken, that symlink can point into the void.

cr0x@server:~$ ls -l /etc/corosync/corosync.conf /etc/pve/corosync.conf
lrwxrwxrwx 1 root root 17 Dec 25 11:58 /etc/corosync/corosync.conf -> /etc/pve/corosync.conf
-rw-r----- 1 root www-data 512 Dec 25 11:57 /etc/pve/corosync.conf

Meaning: The symlink is correct and the file exists in /etc/pve.
Decision: If /etc/pve/corosync.conf is missing or unreadable, you need to restore it from local cluster DB files or backups.

Task 4: Validate corosync configuration syntax

cr0x@server:~$ corosync -t
Parsing of config file successful

Meaning: Syntax is fine. This does not prove the config is correct; it only proves it parses.
Decision: If parsing fails, fix syntax before anything else. If parsing succeeds, focus on semantics: IPs, node IDs, links, crypto, MTU.

Task 5: Check membership and quorum status from Proxmox tooling

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   18
Transport:        knet

Quorum information
------------------
Nodes:            3
Node ID:          0x00000001
Quorate:          Yes

Meaning: If this returns cleanly and says quorate, CMAP is likely functional, and your error might be from a dependent service starting too early.
Decision: If corosync is healthy, pivot to pmxcfs and service ordering; check pve-cluster logs.

Task 6: Query corosync runtime from CMAP tools directly

cr0x@server:~$ corosync-cmapctl | head
runtime.config.totem.token (u32) = 3000
runtime.config.totem.token_retransmits_before_loss_const (u32) = 10
runtime.config.totem.consensus (u32) = 3600
runtime.totem.pg.mrp.srp.members (str) = 1 2 3

Meaning: If this works, CMAP is accessible. If it fails with a CMAP initialization error, corosync isn’t reachable via IPC.
Decision: If CMAP is inaccessible but corosync shows “active,” suspect permission/socket issues, stale runtime directory, or rapid restart loops.

Task 7: Inspect corosync runtime sockets and permissions

cr0x@server:~$ ls -ld /run/corosync /run/corosync/* 2>/dev/null | head
drwxr-xr-x  2 root root  80 Dec 25 12:06 /run/corosync
srwxrwx---  1 root root   0 Dec 25 12:06 /run/corosync/corosync.sock
srwxrwx---  1 root root   0 Dec 25 12:06 /run/corosync/cmap.sock

Meaning: The sockets exist. If sockets are missing, CMAP clients can’t connect. If permissions are too strict, non-root clients may fail.
Decision: Missing sockets → corosync not actually running or stuck early. Weird permissions → check package integrity and local hardening changes.

Task 8: Confirm pmxcfs mount health and error messages

cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: active (running) since Thu 2025-12-25 12:03:11 UTC; 3min ago
   Main PID: 2011 (pmxcfs)

Meaning: pmxcfs is running, but it could still be read-only due to no quorum.
Decision: Check quorum and test write capability safely (create a temp file in a safe location like /etc/pve/nodes/<node>/ if appropriate).

Task 9: Detect read-only cluster filesystem (without guessing)

cr0x@server:~$ touch /etc/pve/.ro-test
touch: cannot touch '/etc/pve/.ro-test': Read-only file system

Meaning: pmxcfs is mounted but refusing writes. That usually maps to no quorum or local protection mode.
Decision: Restore quorum (preferred) or use a controlled temporary quorum override only if you truly have a single surviving node and accept the risk.

Task 10: Check for time drift and NTP health

cr0x@server:~$ timedatectl status
Local time: Thu 2025-12-25 12:08:30 UTC
Universal time: Thu 2025-12-25 12:08:30 UTC
RTC time: Thu 2025-12-25 12:08:29
System clock synchronized: yes
NTP service: active

Meaning: Good. If clock isn’t synchronized or jumps around, corosync token timing gets spicy fast.
Decision: If unsynchronized, fix NTP/chrony/systemd-timesyncd. Then restart corosync to stabilize membership.

Task 11: Verify network path and MTU (because “ping works” lies)

cr0x@server:~$ ping -M do -s 1472 -c 3 172.16.20.12
PING 172.16.20.12 (172.16.20.12) 1472(1500) bytes of data.
1480 bytes from 172.16.20.12: icmp_seq=1 ttl=64 time=0.352 ms
1480 bytes from 172.16.20.12: icmp_seq=2 ttl=64 time=0.339 ms
1480 bytes from 172.16.20.12: icmp_seq=3 ttl=64 time=0.347 ms

Meaning: PMTU for 1500-byte frames is okay. If this fails but a smaller ping works, you have MTU/fragmentation problems.
Decision: Fix MTU consistency on switches, bonds, VLANs, and NICs. Corosync/knet doesn’t enjoy silent fragmentation black holes.

Task 12: Check firewall rules that often break corosync silently

cr0x@server:~$ pve-firewall status
Status: enabled/running
cr0x@server:~$ iptables -S | grep -E 'corosync|5405|5404|knet' | head

Meaning: Proxmox firewall is on. Corosync with knet typically uses UDP and dynamic ports; older corosync used UDP 5405 by default.
The empty grep doesn’t prove it’s allowed; it just proves you’re not labeling rules clearly (which is common).
Decision: If you recently enabled firewalling, explicitly allow cluster traffic on the dedicated corosync network. Also check host firewalls outside Proxmox tooling.

Task 13: Inspect corosync link status and peer reachability

cr0x@server:~$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 172.16.20.11
        status  = ring 0 active with no faults

Meaning: Local ring is active. If it shows faults or down, you have a link-level issue or wrong bindnetaddr/ring addresses.
Decision: If ring is faulty, verify the correct network is used, check VLAN tagging, and confirm the peer IPs match the physical topology.

Task 14: Look for split brain / multiple partitions

cr0x@server:~$ pvecm nodes
Membership information
----------------------
    Nodeid      Votes Name
         1          1 pve1 (local)
         2          1 pve2

Meaning: Only two nodes visible. If you expected three, you might have a node down or a partition.
Decision: If a node is permanently gone, remove it cleanly (after you’re stable). Don’t keep a dead node in config “just in case.”

Task 15: Corosync restart (only after you have a hypothesis)

cr0x@server:~$ systemctl restart corosync
cr0x@server:~$ systemctl is-active corosync
active

Meaning: Restart succeeded. Now immediately re-check quorum and CMAP access.
Decision: If it flips between active/failed, you’re in a crash loop—go back to logs and config validation. Don’t keep restarting; you’ll just thrash.

Task 16: Check pve-cluster logs for pmxcfs-specific clues

cr0x@server:~$ journalctl -u pve-cluster -b --no-pager | tail -n 60
Dec 25 12:03:12 pve1 pmxcfs[2011]: [status] notice: cpg_initialize failed: 2
Dec 25 12:03:12 pve1 pmxcfs[2011]: [status] notice: retrying in 1 seconds
Dec 25 12:03:18 pve1 pmxcfs[2011]: [status] notice: resolved node name 'pve1' to '172.16.20.11'

Meaning: pmxcfs can’t initialize its corosync group communication (CPG), typically because corosync isn’t ready or membership isn’t stable.
Decision: Treat this as another “corosync health first” signal. Fix ring/quorum, then pmxcfs stops complaining.

Task 17: Verify node name resolution is stable and matches corosync config

cr0x@server:~$ hostname -f
pve1.prod.example
cr0x@server:~$ getent hosts pve1
172.16.20.11     pve1.prod.example pve1

Meaning: Hostname resolves to the expected cluster IP. If it resolves to the management network one day and the storage network the next,
corosync will have a bad time.
Decision: Pin correct entries in DNS or /etc/hosts for cluster names; avoid “smart” DNS that returns different addresses based on source.

Task 18: If you must operate without quorum (controlled, temporary)

This is the part where you can make things worse with confidence. Only do this if you truly have a single surviving node and you need access to configs
to recover workloads. As soon as the cluster is healthy again, undo it.

cr0x@server:~$ pvecm expected 1
expected votes set to 1

Meaning: You’ve told the cluster to expect 1 vote, which can restore quorate state on the lone node.
Decision: Use it to get unstuck, not to run indefinitely. If the “missing” node comes back unexpectedly, you’re now in the land of conflicting truths.

Joke #2: Forcing quorum is like filing your own taxes with a flamethrower. It can work, but you’ll smell it later.

Common mistakes: symptoms → root cause → fix

1) Symptom: “cannot initialize CMAP service” plus corosync is “active”

  • Root cause: corosync is flapping (restarting), or IPC sockets under /run/corosync are missing/permissioned wrong.
  • Fix: Check journalctl -u corosync -b for crash loops; verify sockets exist; reinstall/repair corosync packages if needed; fix config semantics.

2) Symptom: /etc/pve is empty, and Proxmox UI shows “node has no valid subscription” style weirdness

  • Root cause: pmxcfs not mounted (FUSE mount missing), often because pve-cluster failed or corosync isn’t available.
  • Fix: Start with corosync health, then restart pve-cluster. Verify the mount with mount | grep /etc/pve.

3) Symptom: pmxcfs mounted but read-only; touching files fails

  • Root cause: no quorum or isolated node in a non-primary partition.
  • Fix: Restore cluster connectivity/quorum. If node loss is permanent, remove the dead node properly and reestablish quorum. Temporary workaround: pvecm expected 1 only when truly isolated and planned.

4) Symptom: Corosync fails with nodeid/name complaints after IP change

  • Root cause: Corosync config still references old ring addresses; hostname resolves to a different IP than the one in nodelist.
  • Fix: Update the nodelist ring0_addr entries (in the authoritative cluster config), align DNS/hosts, and restart corosync on all nodes in a controlled order.

5) Symptom: Everything works until backups or migrations, then corosync flaps

  • Root cause: CPU starvation, IO wait, or network saturation causing token timeouts. Corosync is timing-sensitive; your cluster network can be “fine” until it isn’t.
  • Fix: Move corosync traffic to a quieter network/VLAN; fix MTU; reserve CPU (avoid oversubscription on host); investigate NIC bonding/driver issues.

6) Symptom: After reboot, cluster doesn’t form; logs mention “wrong authkey”

  • Root cause: mismatched authkey between nodes, often from partial restore or incorrect manual copy.
  • Fix: Re-sync the corosync authentication key across nodes from a known-good source, then restart corosync everywhere.

7) Symptom: Single node cluster still shows CMAP failures

  • Root cause: local corosync service failure; not actually a “cluster” problem. Common causes: broken config file, corrupted package state, failed disk, or permissions.
  • Fix: Validate config, ensure runtime directories exist, check disk health and logs, reinstall corosync if necessary.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company had a clean three-node Proxmox cluster. “Clean” here means it worked often enough that nobody felt nervous.
During a routine change window, an engineer updated DNS records to standardize naming: pve1, pve2, pve3 now resolved to the management network,
not the corosync network. The assumption was harmless: “names should point to the primary interface.”

Corosync didn’t agree. After the first node reboot, corosync started, tried to bind and announce on one network, and other nodes expected it on another.
Token timeouts showed up, quorum bounced, and pmxcfs went read-only at the exact moment a VM migration was in flight.
The migration failed in a way that looked like storage corruption—because the UI couldn’t commit config updates reliably.

The fix wasn’t magic. They pinned cluster-specific resolution in /etc/hosts for node names used by corosync, so pve1 always mapped to the ring address.
They also stopped using “generic” hostnames for multi-network clusters; they introduced explicit names for cluster traffic and management traffic.
It was boring. It worked. The biggest learning was psychological: DNS changes are infrastructure changes, not “just cleanup.”

Mini-story 2: The optimization that backfired

Another team had a dedicated corosync VLAN and decided to “optimize” by combining it with the storage VLAN to reduce switch complexity.
Their storage network was 10GbE and generally stable. The assumption: faster network equals happier corosync.
They also enabled jumbo frames on the storage network—because storage likes it.

A month later, they started seeing intermittent “cannot initialize CMAP service” reports during heavy backup nights.
Not continuously; just enough to create confusion. Corosync wasn’t truly down, but it was flapping membership for a few seconds at a time,
which is all it takes for pmxcfs clients to throw errors and for humans to panic.

The culprit was MTU inconsistency across a couple of trunk ports and a bond configuration on one node. ICMP pings worked because they were small.
Storage traffic mostly worked because TCP retransmitted and shrugged. Corosync, using UDP and timing assumptions, suffered.
The “optimization” created a failure mode that didn’t exist when corosync had a small, quiet, well-controlled network.

The remediation was to split corosync traffic back onto a dedicated VLAN with strict MTU=1500 end-to-end, and to rate-limit noisy broadcast domains.
They didn’t need a faster network. They needed a predictable one.

Mini-story 3: The boring but correct practice that saved the day

A finance org ran a four-node Proxmox cluster. No heroics. But they had one practice that looked old-fashioned: after every cluster change
(adding/removing nodes, IP changes, corosync tweaks), they archived /etc/pve and /var/lib/pve-cluster snapshots on each node,
and they kept a short change log with timestamps and “why.”

One morning, a node’s root filesystem developed errors. The node rebooted, corosync failed, then pmxcfs failed, and /etc/pve looked wrong.
The on-call didn’t start guessing. They compared the local archived corosync config with the current one, saw that the current file was truncated,
and immediately suspected local disk corruption rather than “cluster weirdness.”

They rebuilt the node cleanly, rejoined it with the known-good cluster config, and brought services back without making quorum worse.
The correct practice wasn’t glamorous. It reduced decision time when everything looked suspicious.

Checklists / step-by-step plan

Checklist A: Stabilize corosync first (do this before touching pmxcfs)

  1. Confirm service health.

    cr0x@server:~$ systemctl is-active corosync
    active
    

    Decision: If not active, read logs and fix config/network before anything else.

  2. Validate config parses.

    cr0x@server:~$ corosync -t
    Parsing of config file successful
    

    Decision: Parsing failure means you stop and fix syntax, not “try a restart again.”

  3. Verify ring addresses and name resolution.

    cr0x@server:~$ grep -E 'ring0_addr|name|nodeid' -n /etc/pve/corosync.conf | head -n 40
    12:        name: pve1
    13:        nodeid: 1
    14:        ring0_addr: 172.16.20.11
    

    Decision: If ring IP doesn’t exist on the host, fix that mismatch before proceeding.

  4. Check ring status and faults.

    cr0x@server:~$ corosync-cfgtool -s
    Printing ring status.
    Local node ID 1
    RING ID 0
            id      = 172.16.20.11
            status  = ring 0 active with no faults
    

    Decision: Faults point to MTU/firewall/routing/VLAN. Go there, not to Proxmox UI.

  5. Confirm CMAP is queryable.

    cr0x@server:~$ corosync-cmapctl totem.cluster_name
    totem.cluster_name (str) = prod-cluster
    

    Decision: If CMAP query fails, corosync is not healthy. Solve that before pmxcfs.

Checklist B: Restore quorum safely

  1. Check quorum state.

    cr0x@server:~$ pvecm status | grep -E 'Quorate|Nodes|Node ID'
    Nodes:            2
    Node ID:          0x00000001
    Quorate:          No
    

    Decision: If “No,” identify missing voters and whether they’re actually reachable.

  2. Validate peer reachability on the corosync network (not management).

    cr0x@server:~$ ping -c 3 172.16.20.12
    PING 172.16.20.12 (172.16.20.12) 56(84) bytes of data.
    64 bytes from 172.16.20.12: icmp_seq=1 ttl=64 time=0.301 ms
    64 bytes from 172.16.20.12: icmp_seq=2 ttl=64 time=0.287 ms
    64 bytes from 172.16.20.12: icmp_seq=3 ttl=64 time=0.294 ms
    

    Decision: If ping fails, fix network. If ping succeeds, still check MTU and firewall.

  3. Only if a node is permanently dead: plan removal, don’t panic-remove.

    cr0x@server:~$ pvecm nodes
    Membership information
    ----------------------
        Nodeid      Votes Name
             1          1 pve1 (local)
             2          1 pve2
             3          1 pve3
    

    Decision: Confirm which node is dead and why. If it might come back, fix it instead of removing it.

  4. Temporary survival mode (single node only): set expected votes.

    cr0x@server:~$ pvecm expected 1
    expected votes set to 1
    

    Decision: Use this to access configs and recover. Undo it when normal quorum is restored.

Checklist C: Bring pmxcfs back to a sane mounted state

  1. Verify the mount.

    cr0x@server:~$ mount | grep /etc/pve
    pve-cluster on /etc/pve type fuse.pve-cluster (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other)
    

    Decision: No mount means pve-cluster isn’t working; check its logs and corosync health.

  2. Restart pmxcfs only after corosync is stable.

    cr0x@server:~$ systemctl restart pve-cluster
    

    Decision: If it immediately errors with CPG/CMAP issues, you’re still not done with corosync.

  3. Sanity check: list node directories.

    cr0x@server:~$ ls -1 /etc/pve/nodes
    pve1
    pve2
    pve3
    

    Decision: Missing nodes can mean partition, non-quorum view, or config inconsistency. Cross-check with pvecm nodes.

FAQ

1) Is “cannot initialize CMAP service” a Proxmox bug?

Usually no. It’s a corosync health signal. Proxmox surfaces it because Proxmox components rely on corosync runtime state. Fix the underlying corosync
service/ring/quorum problem and the CMAP errors typically disappear.

2) Can I just restart everything: corosync, pve-cluster, pvedaemon, pveproxy?

You can, but it’s like rebooting the smoke detector to fix a kitchen fire. Restarting corosync without understanding why it’s unhappy can make membership
flap more, which makes pmxcfs more stubborn, which makes your UI and API more confusing.

3) Why does /etc/pve act like it’s empty or read-only?

Because it’s a FUSE mount provided by pmxcfs. If pmxcfs isn’t mounted, you’re looking at a plain directory. If you don’t have quorum, pmxcfs may mount but refuse writes.
Always verify the mount and quorum state before editing anything.

4) What’s the safest order to restart services?

Corosync first (but only after config/network sanity). Then pve-cluster. Then the UI/API daemons (pveproxy, pvedaemon) if they’re misbehaving.
Starting at the top wastes time.

5) Do I need multicast for corosync?

Most modern Proxmox clusters use knet transport (unicast). Multicast was common historically but is often blocked or poorly supported in enterprise networks.
Don’t chase multicast unless your config explicitly uses it.

6) Is forcing quorum with pvecm expected 1 safe?

It’s a tool, not a lifestyle. It can be safe in a real single-node survival situation, when you understand that you are overriding safety rules
to regain manageability. It’s unsafe if other nodes might still be alive and writing conflicting config elsewhere.

7) Can storage issues cause CMAP errors?

Indirectly, yes. If the host is in severe IO wait (dying boot disk, overloaded ZFS pool, saturated CEPH OSD on the same node),
corosync can miss timing windows, membership flaps, and CMAP clients fail. Corosync is not impressed by your latency graphs.

8) What if I changed the cluster network IPs?

Expect corosync pain until every node’s corosync.conf nodelist and name resolution agree. Partial changes cause asymmetric membership.
Plan IP changes like a migration: update config consistently, validate MTU, validate firewall, restart in a controlled order.

9) Why does it work for a while after a restart, then fail again?

That’s classic for timing or network jitter: token timeouts, MTU black holes under load, or CPU starvation during backups. It can also indicate a flapping link
(bonding, LACP, switch ports). Look for patterns: “fails under load” is not a coincidence.

10) How do I know if it’s a partition vs a dead node?

Compare pvecm nodes output on multiple nodes. If different nodes see different membership, you have a partition.
If all healthy nodes agree a node is missing and it’s unreachable on the corosync network, it’s likely dead or isolated.

Conclusion: next steps that don’t create a second incident

When Proxmox throws “cannot initialize CMAP service,” don’t treat it as a mysterious Proxmox mood swing. It’s corosync telling you it can’t provide
stable runtime state. Your job is to stabilize the ring, restore quorum, and only then expect pmxcfs and the UI to behave.

Practical next steps:

  • Run the fast diagnosis playbook: service health → quorum → ring network/MTU → pmxcfs mount state.
  • Collect evidence before you restart things: corosync logs, ring status, CMAP query, and time sync status.
  • Fix the boring causes first: IP/name mismatches, MTU inconsistencies, firewall rules, and clock drift.
  • If you must force quorum, do it deliberately, document it, and undo it as soon as you can.
  • After recovery, schedule a hardening pass: dedicated corosync network, consistent name resolution, and config backups you can trust.
← Previous
MySQL vs PostgreSQL: the honest “website DB” pick (based on real bottlenecks)
Next →
How to migrate from VMware ESXi to Proxmox VE (step-by-step): VMs, disks, VLANs, downtime

Leave a comment