Proxmox “pmxcfs is not mounted”: why /etc/pve is empty and how to recover

Was this helpful?

You log into a Proxmox node and the web UI looks like it forgot everything you ever built. No VMs. No storage definitions. No cluster settings. You open a shell and find the true horror: /etc/pve is empty. Then you run a random command (because that’s what we do under stress) and it says: pmxcfs is not mounted.

This failure feels like “my configs are gone.” Usually they aren’t. Most of the time, you’re staring at a dead or unmounted cluster filesystem (pmxcfs), not a wiped disk. The job is to figure out which of the three usual suspects is holding the knife: pve-cluster, corosync, or quorum. Then recover in a way that doesn’t silently fork your cluster state into a future disaster.

What pmxcfs actually is (and why /etc/pve is special)

/etc/pve on Proxmox is not a normal directory in the way your muscle memory expects. It’s a FUSE-mounted filesystem called pmxcfs (Proxmox cluster filesystem). The process that mounts and serves it is part of pve-cluster. It stores cluster-wide configuration in a small database and exposes it as files.

So when you see “pmxcfs is not mounted” and /etc/pve looks empty, it typically means:

  • pmxcfs didn’t mount (service down, crash, permissions, FUSE issue).
  • pmxcfs mounted but is unhealthy (database lock, corruption, or can’t achieve cluster quorum).
  • you’re looking at the mountpoint without the mount (so you see an empty directory on the underlying root filesystem).

And yes: you can make it worse. The wrong “fix” can create split-brain, overwrite good config with stale config, or permanently diverge cluster states. Avoid improvisation.

Dry truth: Proxmox configuration is “just files” and “not just files.” pmxcfs makes it look simple while quietly depending on corosync membership and quorum rules.

Interesting facts and historical context

  • pmxcfs is FUSE-based: it’s not a kernel filesystem; it’s user-space, which means crashes and mount failures look like “my directory disappeared.”
  • /etc/pve is cluster-scoped: even on a single node, Proxmox uses the cluster filesystem abstraction for uniform tooling and UI behavior.
  • Quorum gates writes: in multi-node clusters, pmxcfs can refuse writes without quorum to avoid split-brain config.
  • Corosync came from the HA world: it’s a group communication system historically used in high-availability Linux clusters; Proxmox leverages it for membership and messaging.
  • Proxmox inherited a “files as API” mindset: editing files under /etc/pve is basically calling the configuration API; the UI and CLI both converge there.
  • pmxcfs keeps a local database copy: nodes carry state locally, which is why a node can often recover configs even when the network is on fire.
  • Two-node clusters are historically awkward: quorum math punishes even numbers. Proxmox supports a two-node setup, but you must understand the failure modes.
  • QDevice exists because physics exists: a third quorum participant can be virtual (qnetd/qdevice) so you don’t have to buy a third full server.

One quote that still applies in operations, especially when the UI is blank and your pulse is not: paraphrased idea: “Hope is not a strategy.” — attributed to Gordon R. Sullivan (paraphrased).

Symptoms that look like data loss (but usually aren’t)

When pmxcfs isn’t mounted, you don’t just lose a directory listing. Proxmox components that expect cluster config files start failing in a cascade:

  • Web UI shows no VMs/CTs, or errors out loading configuration.
  • pvesm status shows storages missing because /etc/pve/storage.cfg is “gone.”
  • VM config files under /etc/pve/qemu-server/*.conf appear missing.
  • Cluster commands like pvecm status fail or show “no quorum.”
  • Backup jobs, replication, and HA management complain because they read cluster state.

Here’s the key diagnostic question: Did we lose the VMs, or did we lose the config view? In most cases, disks/LVM/ZFS volumes are still present; only the configuration layer is offline.

Joke #1: If /etc/pve is empty, it’s not “minimalism,” it’s your cluster filesystem silently taking a sick day.

Fast diagnosis playbook

When production is down, you don’t need a lecture. You need a tight loop: identify whether this is a mount problem, a service problem, or a quorum problem. Do it in this order.

First: is /etc/pve actually mounted?

  • If it’s not mounted: focus on pve-cluster and FUSE/mount issues.
  • If it is mounted but empty/unreadable: focus on pmxcfs health and logs.

Second: is pve-cluster running and healthy?

  • If pve-cluster is down or restarting: check logs for db lock/corruption, permission issues, or FUSE failures.

Third: is corosync running, and do you have quorum?

  • If corosync is down: pmxcfs may not form cluster membership properly.
  • If corosync is up but no quorum: decide whether you’re in a multi-node outage (need to restore quorum) or a single-node environment where you can temporarily force expected votes (with caution).

Fourth: are the VM disks still there?

  • Confirm ZFS datasets, LVM volumes, or directory storage contents exist. If they do, you’re likely dealing with config availability, not data loss.

Fifth: decide the recovery track

  • Cluster track: restore corosync connectivity/quorum; avoid any “rebuild” steps until quorum is back.
  • Standalone track: fix local services; consider restoring /etc/pve from backups if pmxcfs db is damaged.

Practical tasks: commands, expected outputs, and decisions

Below are real tasks with commands, what the output means, and the decision you make next. Run them as root or via sudo. Keep notes. Under stress, you will forget what you already checked.

Task 1: Confirm whether pmxcfs is mounted

cr0x@server:~$ mount | grep -E '/etc/pve|pmxcfs' || true
pmxcfs on /etc/pve type fuse.pmxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Meaning: If you see pmxcfs on /etc/pve, the mount exists. If you see nothing, it’s not mounted.

Decision: If not mounted, jump to service-level checks (Task 3). If mounted but empty, check quorum and pmxcfs logs (Tasks 4–7).

Task 2: Check if /etc/pve is empty because it’s not mounted

cr0x@server:~$ ls -la /etc/pve
total 0
drwxr-xr-x  2 root root  40 Dec 25 11:03 .
drwxr-xr-x 98 root root  80 Dec 25 10:58 ..

Meaning: An “empty” /etc/pve with only . and .. is classic for an unmounted mountpoint directory on the root filesystem.

Decision: Treat this as a mount/service issue, not missing configs. Do not start recreating VM configs from memory yet.

Task 3: Check pve-cluster service status

cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: failed (Result: exit-code) since Thu 2025-12-25 10:59:33 UTC; 2min ago
    Process: 2210 ExecStart=/usr/bin/pmxcfs (code=exited, status=1/FAILURE)
     Main PID: 2210 (code=exited, status=1/FAILURE)

Meaning: If it’s failed, pmxcfs isn’t serving /etc/pve. If it’s active, move on to corosync/quorum.

Decision: Failed service means you read logs and fix the root cause before rebooting blindly.

Task 4: Check corosync service status

cr0x@server:~$ systemctl status corosync --no-pager
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
     Active: active (running) since Thu 2025-12-25 10:55:02 UTC; 6min ago
       Docs: man:corosync
   Main PID: 1020 (corosync)
      Tasks: 9 (limit: 15400)
     Memory: 23.4M

Meaning: corosync is your cluster membership transport. If it’s down, quorum and pmxcfs behavior will be off.

Decision: If corosync is inactive/failed, fix that first (network, config, certs are not the issue here; corosync is).

Task 5: Check quorum quickly with pvecm

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   42
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec 25 11:01:12 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.23
Quorate:          No

Meaning: “Quorate: No” is often the reason pmxcfs refuses to behave in a cluster, especially for writes. Reads may also degrade depending on circumstances.

Decision: If you’re in a multi-node cluster and it’s not quorate, focus on restoring node connectivity or qdevice. Avoid “rebuild pmxcfs” steps until quorum is back.

Task 6: Check corosync membership from this node

cr0x@server:~$ corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
	addr	= 10.10.10.11
	status:
		nodeid:          1:	connected
		nodeid:          2:	disconnected
		nodeid:          3:	disconnected

Meaning: This node can’t see peers. This is commonly a network or firewall issue, or peers are down.

Decision: If peers should be up, check network paths and MTU. If peers are down, decide whether to bring them back or (carefully) use a quorum workaround.

Task 7: Read pve-cluster logs for the real reason

cr0x@server:~$ journalctl -u pve-cluster -b --no-pager -n 80
Dec 25 10:59:33 server pmxcfs[2210]: [main] notice: starting pmxcfs
Dec 25 10:59:33 server pmxcfs[2210]: [main] crit: Unable to acquire pmxcfs lock: Resource temporarily unavailable
Dec 25 10:59:33 server systemd[1]: pve-cluster.service: Main process exited, code=exited, status=1/FAILURE
Dec 25 10:59:33 server systemd[1]: pve-cluster.service: Failed with result 'exit-code'.

Meaning: Lock contention can happen after a crash or if a stale process is hanging around. This is actionable.

Decision: Look for stale pmxcfs processes or leftover lock state; don’t rm -rf random directories as therapy.

Task 8: Check for stale pmxcfs processes

cr0x@server:~$ ps aux | grep -E 'pmxcfs|pve-cluster' | grep -v grep
root       1987  0.0  0.1  45620  9400 ?        Ss   10:58   0:00 /usr/bin/pmxcfs

Meaning: If pmxcfs is already running, starting it again can fail with a lock error. If multiple pmxcfs exist, you’ve got a mess.

Decision: If there’s an orphan pmxcfs, stop the service cleanly and ensure it exits before restarting.

Task 9: Restart services in a controlled order

cr0x@server:~$ systemctl restart corosync
cr0x@server:~$ systemctl restart pve-cluster
cr0x@server:~$ systemctl restart pvedaemon pveproxy

Meaning: corosync first (membership), then pmxcfs, then the daemons that serve UI/API.

Decision: If pve-cluster still fails, go back to logs. If it starts but quorum is still absent, don’t start editing configs.

Task 10: Confirm /etc/pve mount and contents after restart

cr0x@server:~$ mount | grep /etc/pve
pmxcfs on /etc/pve type fuse.pmxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
cr0x@server:~$ find /etc/pve -maxdepth 2 -type f | head
/etc/pve/corosync.conf
/etc/pve/storage.cfg
/etc/pve/user.cfg
/etc/pve/domains.cfg
/etc/pve/qemu-server/101.conf

Meaning: If you see these files, pmxcfs is mounted and serving data again.

Decision: Now you can move on to fixing the cluster state or VM runtime issues. If still empty, stop and reassess.

Task 11: Validate cluster filesystem health and file operations

cr0x@server:~$ pvecm status | grep -E 'Quorate|Nodes|Name'
Name:             prod-cluster
Nodes:            3
Quorate:          Yes
cr0x@server:~$ test -r /etc/pve/storage.cfg && echo "read ok"
read ok

Meaning: Quorum “Yes” plus readable config means the cluster filesystem is in a stable state.

Decision: If quorate is “No”, avoid config changes. If “Yes”, proceed with normal operations and cleanup.

Task 12: Confirm VM disks still exist (ZFS example)

cr0x@server:~$ zfs list -o name,used,avail,mountpoint | grep -E 'rpool/data|vm-'
rpool/data                 128G  1.72T  /rpool/data
rpool/data/vm-101-disk-0   64G   1.72T  -
rpool/data/vm-102-disk-0   64G   1.72T  -

Meaning: Disks exist. You likely “lost” only the config layer, not the VM data.

Decision: If disks are present, focus on recovering configs; do not reinitialize storage.

Task 13: Confirm VM disks still exist (LVM-thin example)

cr0x@server:~$ lvs -a -o lv_name,vg_name,lv_attr,lv_size,data_percent,metadata_percent
LV              VG   Attr       LSize   Data%  Meta%
data            pve  twi-aotz--  <1.80t  42.11  3.02
vm-101-disk-0    pve  Vwi-a-tz--  64.00g
vm-102-disk-0    pve  Vwi-a-tz--  64.00g

Meaning: Your VM volumes are intact.

Decision: Don’t touch the thin pool. Your urgency is config + cluster health.

Task 14: Look for backup copies of /etc/pve

cr0x@server:~$ ls -la /var/lib/pve-cluster/
total 20
drwxr-xr-x  5 root root 4096 Dec 25 10:59 .
drwxr-xr-x 50 root root 4096 Dec 25 10:58 ..
drwxr-xr-x  2 root root 4096 Dec 25 10:58 backup
drwxr-xr-x  2 root root 4096 Dec 25 10:59 config
drwx------  2 root root 4096 Dec 25 10:59 db
cr0x@server:~$ ls -la /var/lib/pve-cluster/backup | head
total 64
-rw------- 1 root root 65536 Dec 24 02:05 config-2025-12-24_02_05_01.tar

Meaning: Proxmox often keeps cluster config backups locally. That’s your “boring but effective” parachute.

Decision: If pmxcfs db is corrupted, restoring from these backups is typically safer than hand-rebuilding configs.

Task 15: Confirm basic network path for corosync (knet)

cr0x@server:~$ ip -br addr show
lo               UNKNOWN        127.0.0.1/8 ::1/128
eno1             UP             10.10.10.11/24
eno2             DOWN
cr0x@server:~$ ping -c 2 10.10.10.12
PING 10.10.10.12 (10.10.10.12) 56(84) bytes of data.
64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.412 ms
64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.398 ms

--- 10.10.10.12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms

Meaning: If basic connectivity to peers is broken, corosync won’t form membership. If ping works but corosync doesn’t, think firewall, MTU, or port filtering.

Decision: Fix networking before touching cluster config. Corosync problems are rarely solved by editing random files during an outage.

Task 16: Check for filesystem or disk issues that can crash pmxcfs

cr0x@server:~$ dmesg -T | tail -n 20
[Thu Dec 25 10:57:10 2025] EXT4-fs error (device sda2): ext4_journal_check_start:83: Detected aborted journal
[Thu Dec 25 10:57:10 2025] EXT4-fs (sda2): Remounting filesystem read-only

Meaning: If the root filesystem remounted read-only, pmxcfs may fail to update its db/lock state and services will fall apart.

Decision: Stop treating this as “Proxmox is weird.” This is a storage problem. Fix disk, remount, and reboot if needed.

Recovery paths (single node vs cluster)

Path A: Single node (or “I don’t care about cluster membership, I need my configs back”)

On a true single-node Proxmox system (or a node that you have intentionally de-clustered), you mainly care about getting pve-cluster to mount pmxcfs and present /etc/pve. Quorum is usually not the gatekeeper in the same way it is for multi-node clusters.

Typical causes on a single node:

  • pve-cluster failed to start due to lock contention or stale processes.
  • root filesystem is read-only after a disk hiccup.
  • pmxcfs database corruption after a crash or power loss.

What I do first: fix the root FS and services. Only if logs point to a broken pmxcfs DB do I restore from a local backup tar.

When pmxcfs won’t mount due to lock/stale process

If journalctl mentions an inability to acquire the pmxcfs lock, identify the stale process, stop services, and restart cleanly. Don’t “kill -9” unless you’ve tried a normal stop and it’s stuck.

cr0x@server:~$ systemctl stop pve-cluster
cr0x@server:~$ pkill -TERM pmxcfs || true
cr0x@server:~$ sleep 2
cr0x@server:~$ ps aux | grep pmxcfs | grep -v grep || echo "no pmxcfs running"
no pmxcfs running
cr0x@server:~$ systemctl start pve-cluster
cr0x@server:~$ mount | grep /etc/pve
pmxcfs on /etc/pve type fuse.pmxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Decision point: If that fixes it, you’re done. If it fails again, you’re hunting a deeper issue: disk read-only, missing FUSE support, or db corruption.

When the root filesystem went read-only

This is a classic “everything is failing, but the root cause is one line in dmesg.” If ext4 remounted read-only, you cannot trust pmxcfs and corosync to behave.

Fix the underlying disk problem. That may mean checking SMART, scheduling fsck, and rebooting. If this is a virtualized Proxmox node (yes, people do that), fix the underlying hypervisor storage too.

When pmxcfs DB is damaged and you need config restore

If you have local backups under /var/lib/pve-cluster/backup, restoring is usually safer than rebuilding storage.cfg, users, permissions, and VM configs by hand.

Be strict: do this only when you are confident the local node’s config is the source of truth, not a stale copy compared to other nodes.

One practical method is to extract the backup tar somewhere safe and compare what you’re about to restore. Then proceed with the least invasive restore method available for your situation.

cr0x@server:~$ mkdir -p /root/pve-restore
cr0x@server:~$ tar -tf /var/lib/pve-cluster/backup/config-2025-12-24_02_05_01.tar | head
etc/pve/corosync.conf
etc/pve/storage.cfg
etc/pve/user.cfg
etc/pve/qemu-server/101.conf
etc/pve/lxc/103.conf

Decision point: If the tar contains the config you expect, you can restore selected files once pmxcfs is mounted. If pmxcfs won’t mount at all, you must repair pmxcfs first; dumping files into an unmounted /etc/pve just writes to the empty directory on root FS and solves nothing.

Path B: Clustered node (this is where people lose weekends)

In a cluster, pmxcfs is intertwined with corosync membership and quorum rules. The right recovery depends on whether:

  • This node is isolated but others are fine.
  • The whole cluster is partially down (lost nodes, lost network, lost qdevice).
  • You actually have split-brain risk (two halves think they’re primary).

Rule I enforce: If you can restore quorum by bringing nodes/network back, do that. Don’t “force” quorum as your first move. Forced quorum is like bypassing a fuse with a nail. It works. It also teaches you new kinds of failure.

Restore corosync connectivity before touching /etc/pve

Corosync is sensitive to:

  • Network partitions and firewall changes
  • MTU mismatches (especially if you moved to jumbo frames “for performance”)
  • Wrong interface binding after NIC renaming or hardware swap
  • Broken hostnames or IP changes (especially if you rebuilt a node from a template)

Get the nodes talking. Then check quorum. Then confirm pmxcfs mount health.

Two-node clusters and qdevice: the math doesn’t care about your optimism

Two-node clusters are doable, but they’re booby-trapped if you treat them like three-node clusters. If one node drops, you lose quorum unless you have a qdevice or you adjust expected votes temporarily.

Temporary expected-votes changes can get you out of a hole, but they are emergency maneuvers. They must be reversed when the cluster returns to normal. Otherwise you’ve “fixed” quorum by redefining reality, which is a bold strategy in systems engineering.

Joke #2: Quorum is like a meeting: nothing gets decided until enough people show up, and the one person who did show up is always mad.

When it’s safe(ish) to use an emergency quorum workaround

Use a workaround only if:

  • You are certain the missing nodes are down, not alive and partitioned.
  • You accept the risk that config changes made now might conflict when nodes return.
  • You are trying to restore service to VMs that live on this node and you need pmxcfs writable.

If you’re unsure, don’t. Fix the network. Bring nodes up. Restore normal quorum.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

They had a three-node Proxmox cluster and a tidy change window. One node needed a firmware update. The plan: take it out, patch it, bring it back. Standard.

Someone noticed the cluster still looked “healthy” in the UI after the node rebooted, so they proceeded to reboot a second node to update it too. The assumption was simple: “If the UI is up, the cluster is fine.”

The second reboot dropped quorum. pmxcfs on the remaining node didn’t mount properly, and /etc/pve looked empty. The UI went from “healthy-ish” to “looks like a fresh install” in under a minute. Panic followed, naturally.

The wrong move came next: an engineer started recreating storage definitions in the UI. With no quorum, some writes were blocked, some were inconsistent, and a few were made on a node that later turned out not to be the best source of truth.

When the second node came back and quorum returned, the cluster state converged — but not the way anyone wanted. They spent the rest of the night untangling duplicate storages, stale VM configs, and conflicting definitions. Nothing was “lost,” but it sure wasn’t “correct.”

What fixed it: they rolled back the improvised config edits, restored the original config from a known-good backup tar, then reintroduced nodes one at a time while verifying quorum and pmxcfs mount health. The lesson wasn’t “never reboot nodes.” It was “never treat the UI as a quorum oracle.”

Mini-story 2: The optimization that backfired

A team wanted lower latency between nodes for corosync, Ceph traffic, and live migration. They consolidated cluster traffic onto a “faster” VLAN and enabled jumbo frames. It worked in the lab. Of course it did.

In production, one switch path silently didn’t support the MTU end-to-end. Ping worked because small packets don’t care. Corosync sometimes worked, then flapped, then declared peers dead. Nodes would appear and disappear like unreliable coworkers.

On one of the nodes, pmxcfs would intermittently unmount or refuse operations due to unstable membership/quorum. Operations saw the symptom: /etc/pve empty, VMs missing in UI, “pmxcfs is not mounted.” Storage engineers saw the deeper symptom: a cluster membership layer oscillating under load.

The “optimization” created an outage pattern that was worse than a clean failure. It was a slow-motion failure: just stable enough to make people change things, just unstable enough to corrupt their mental model.

The fix was boring: revert MTU to 1500 on the corosync ring, separate corosync traffic from bulk storage traffic, and validate with real packet sizes. The team later reintroduced jumbo frames only where end-to-end support was proven, with monitoring for corosync link stats. Performance returned later. Stability returned immediately.

Mini-story 3: The boring but correct practice that saved the day

A different org had a habit: every node had a nightly job that copied /var/lib/pve-cluster/backup off-node to a storage target that wasn’t part of the cluster. Not fancy. Just consistent.

They also kept a small “disaster notebook”: node inventory, cluster name, corosync ring addresses, and a one-page “how to confirm quorum” checklist. It was written by someone who clearly didn’t trust memory, including their own.

During a power event, one node came up with root filesystem errors and pmxcfs refused to mount. The UI showed nothing. But they didn’t rush to rebuild. They checked mounts, checked corosync, and confirmed that the other nodes still held quorum.

They rebuilt the broken node as a new OS install, then rejoined it to the cluster using the stored cluster information. After that, they restored needed host-specific bits and validated that pmxcfs showed the correct cluster config. The VMs were unaffected because their storage was on shared backends.

What “saved the day” wasn’t a clever trick. It was having a consistent config backup plus a boring runbook. They didn’t “hero” the outage. They followed the process and went home at a reasonable hour, which is the real luxury.

Common mistakes: symptom → root cause → fix

This section is opinionated because the same mistakes repeat across teams. Some are understandable. Most are avoidable.

Mistake 1: “/etc/pve is empty, so the configs were deleted”

Symptom: ls /etc/pve shows nothing; UI shows no VMs.

Root cause: pmxcfs isn’t mounted. You’re viewing an empty mountpoint directory on the underlying filesystem.

Fix: Check mount | grep /etc/pve, then fix pve-cluster and corosync/quorum.

Mistake 2: Recreating configs while quorum is lost

Symptom: Things “come back” but now storages/VMs are duplicated, or permissions are weird.

Root cause: You changed cluster config in a degraded/quorumless state. Later, the cluster reconciled with other nodes and you got a Franken-config.

Fix: Restore quorum first. If you must change config in emergency mode, document every change and plan to reconcile after quorum returns.

Mistake 3: Restarting random services and hoping

Symptom: Sometimes it works, sometimes it doesn’t; pveproxy restarts don’t help.

Root cause: pveproxy/pvedaemon depend on pmxcfs for config. Restarting the UI doesn’t mount the filesystem.

Fix: Start with corosync and pve-cluster. Then restart UI daemons.

Mistake 4: Editing files under /etc/pve when it’s not mounted

Symptom: You “fixed” /etc/pve/storage.cfg but nothing changes in UI.

Root cause: You wrote to the plain directory on root FS, not to pmxcfs.

Fix: Verify mount first. If pmxcfs isn’t mounted, your edits went to the wrong place. Remove the bogus files after recovery to avoid confusion.

Mistake 5: Treating a two-node cluster like it has quorum redundancy

Symptom: Lose one node, the remaining node becomes non-quorate, pmxcfs behavior degrades.

Root cause: Two-node clusters need qdevice or careful vote management.

Fix: Add qdevice, or accept that losing one node is a quorum loss event and plan operations accordingly.

Mistake 6: “Fixing” by deleting pmxcfs DB without a plan

Symptom: pmxcfs starts, but config is reset or inconsistent; cluster identity changes; nodes don’t agree.

Root cause: You wiped local cluster state without understanding whether that node was authoritative or how it will resync.

Fix: Only do destructive DB actions when you’ve identified the authoritative node(s) and have backups. Prefer restoring from /var/lib/pve-cluster/backup when appropriate.

Mistake 7: Ignoring underlying disk errors

Symptom: pmxcfs/corosync flap, root filesystem read-only, services fail in odd ways.

Root cause: Real storage or filesystem corruption.

Fix: Stop and fix the disk. No amount of service restarts will heal a remounted read-only filesystem.

Checklists / step-by-step plan

Checklist 1: Immediate triage (10 minutes, no heroics)

  1. Check mount: mount | grep /etc/pve.
  2. Check pve-cluster and corosync status.
  3. Check quorum: pvecm status.
  4. Check logs: journalctl -u pve-cluster -b -n 80 and journalctl -u corosync -b -n 80.
  5. Confirm disks exist (ZFS/LVM). This prevents destructive panic.

Checklist 2: If pmxcfs is not mounted

  1. Verify root FS is writable: check dmesg -T | tail for read-only remount events.
  2. Stop pve-cluster cleanly; look for stale pmxcfs processes.
  3. Start corosync (if cluster) then start pve-cluster.
  4. Confirm /etc/pve is mounted and non-empty.
  5. Only after pmxcfs is mounted: restart pveproxy/pvedaemon.

Checklist 3: If corosync/quorum is the problem

  1. Confirm network reachability and correct interface IPs on the corosync ring.
  2. Check membership: corosync-cfgtool -s.
  3. Bring back missing nodes if possible; restore normal membership.
  4. If two-node cluster: ensure qdevice is reachable, or acknowledge you need an emergency expected-votes change (and write down what you change).
  5. After quorum is restored: validate that /etc/pve shows expected files and that nodes agree on cluster state.

Checklist 4: If pmxcfs DB seems corrupted

  1. Stop and capture evidence: logs, service states, and copies of backup tars.
  2. Identify the authoritative node (in a cluster: the one with healthy quorum and expected config).
  3. Use backups from /var/lib/pve-cluster/backup where appropriate; avoid manual rebuild unless you love subtle mistakes.
  4. Validate critical files: corosync.conf, storage.cfg, qemu-server/*.conf, ACL/user configs.
  5. Bring services up, validate UI, then validate actual VM disk mappings before starting workloads.

FAQ

1) Why does /etc/pve become empty instead of showing old files?

Because /etc/pve is a mountpoint. When pmxcfs isn’t mounted, you’re looking at the underlying directory on the root filesystem, which is typically empty.

2) Are my VM disks deleted when /etc/pve is empty?

Usually no. VM disks live on ZFS datasets, LVM volumes, Ceph RBD, or directories elsewhere. Validate with zfs list, lvs, or your storage backend tools before assuming data loss.

3) Can I just reboot to fix “pmxcfs is not mounted”?

Sometimes. But if the cause is disk errors, network partition, or broken corosync config, rebooting is just a time-consuming way to get the same failure with extra downtime.

4) If corosync is down, can pmxcfs still mount?

On a cluster node, pmxcfs behavior is tightly coupled to cluster state. It may mount but be degraded, refuse writes, or behave inconsistently. Treat corosync failures as primary until proven otherwise.

5) What’s the safest first command when I see this error?

mount | grep /etc/pve and systemctl status pve-cluster corosync. You want to know if it’s a mount issue, a service issue, or quorum/membership.

6) Can I edit /etc/pve files directly to recover?

Yes—when pmxcfs is mounted and the cluster is healthy/quorate. No—when it’s not mounted, because you’ll edit the wrong place and create confusion.

7) What if only one node shows /etc/pve empty but others are fine?

That node is likely isolated, misconfigured network-wise, or has a local service/disk issue. Compare corosync membership and logs. Don’t “repair the cluster” from the broken node.

8) How do I prevent this from becoming a bigger incident next time?

Keep config backups off-node, monitor quorum and corosync link state, avoid risky network “optimizations” without end-to-end validation, and rehearse node-loss procedures.

9) Does “no quorum” always mean /etc/pve will be empty?

No. “No quorum” usually affects the ability to commit changes safely, and can trigger degraded behavior, but an empty /etc/pve strongly suggests pmxcfs isn’t mounted or isn’t running.

10) Should I remove and re-add the node to the cluster?

Only after you confirm whether the node’s local state is corrupt and whether the cluster is stable. Rejoining is sometimes the right answer, but it’s a last step, not a first guess.

Next steps you should actually do

If you got pmxcfs mounted again and /etc/pve is populated, don’t stop there. The “everything looks fine” phase is where latent issues hide.

  1. Confirm quorum and membership stability over a few minutes. If corosync is flapping, you’re not done.
  2. Validate config consistency across nodes: storage definitions, VM configs, and permissions.
  3. Confirm VM disks map correctly before starting critical workloads. A config mismatch can point a VM at the wrong disk.
  4. Export off-node backups of /var/lib/pve-cluster/backup if you don’t already.
  5. Write down the root cause and the exact fix you applied. Future you is a stranger with less sleep.

The goal isn’t to “get the UI back.” The goal is to restore a consistent, quorate, and durable control plane so your compute and storage don’t drift into folklore.

← Previous
Docker OOM in Containers: The Memory Limits That Prevent Silent Crashes
Next →
MySQL vs MariaDB: the “default” choice that secretly slows your VPS

Leave a comment