The symptom: you click “Apply,” you run qm set, you try to create a VM, and Proxmox snaps back with unable to write /etc/pve/.... Nothing sticks. The GUI lies to your face. The CLI glares at you. Your change window is melting.
The reality: /etc/pve is not a normal directory. If you treat it like one, you’ll chase the wrong problem for hours. The fix might be a disk issue, a cluster/quorum issue, a pmxcfs/FUSE issue, or plain old permissions. The trick is telling which in minutes, not afternoons.
The mental model: what /etc/pve really is
In Proxmox VE, /etc/pve isn’t “just configuration on disk.” It’s a virtual filesystem provided by pmxcfs (Proxmox Cluster File System), mounted via FUSE. It behaves like a shared config database with filesystem semantics: files and directories, atomic updates, permissions, and inotify events. But the truth underneath is closer to “cluster state machine plus replication” than “a folder.”
This one fact explains most of the weirdness:
- If pmxcfs is unhealthy or read-only, writes to
/etc/pvefail even if your root filesystem has plenty of free space. - If the cluster loses quorum, pmxcfs often goes read-only to protect you from split brain. The error looks like a filesystem issue because that’s how you interact with it.
- If permissions are wrong, you can be root on the node and still face a denial depending on how you’re writing and what context you’re in (API token, user, file ownership, or a stale lock pattern).
So when you see “unable to write /etc/pve/*”, don’t immediately do the classic Linux dance of chown -R and chmod -R. That’s how people take a manageable problem and turn it into a recovery project.
Here’s the operational translation:
- Disk problem means a backing store that pmxcfs depends on is out of space or unhealthy (typically local root fs for logs/state, or worse: corruption, read-only remount, I/O errors).
- pmxcfs/quorum problem means the cluster membership/state is not writable; your node may have lost quorum, corosync is unhappy, or pve-cluster is stuck.
- Permissions problem means the actor writing (CLI user, API user/token, service process) can’t write, or a file is locked/stale in a way that surfaces as a write failure.
One operational quote worth keeping taped to your monitor: “paraphrased idea” from Werner Vogels (Amazon CTO): You build it, you run it; operations feedback is part of engineering.
The point: don’t just “fix the write,” fix the reason the system decided writing was unsafe.
Fast diagnosis playbook (first/second/third)
First: confirm what /etc/pve is doing right now (seconds)
- Is pmxcfs mounted and writable? Check mount type and read-only status.
- Is the cluster quorate? If not, expect read-only behavior.
- Are services healthy? Specifically
pve-clusterandcorosync.
Second: rule out the boring killers (minutes)
- Disk full on / or /var? Disk pressure breaks everything indirectly.
- Filesystem remounted read-only? One I/O hiccup and your writes are doomed.
- Memory pressure? pmxcfs runs in RAM for metadata; extreme pressure can make symptoms weird.
Third: verify the actor and the specific write path (minutes)
- Are you writing via GUI/API token? Permission model differs from “root on SSH.”
- Is there a lock file or stale config? VM config locks and replication tasks can hold things up.
- Is this node “special”? Time skew, network loss, or single-node cluster mode can flip behavior.
If you’re in a real outage: don’t restart everything blindly. Restarting corosync on the wrong side of a partition is how you discover what split brain tastes like.
Joke #1: pmxcfs is like a group chat: if half the team can’t see messages, nobody gets to edit the pinned post.
Interesting facts & historical context (why it behaves this way)
- Fact 1:
/etc/pveis a FUSE mount provided bypmxcfs, not a normal on-disk directory. That’s why “filesystem” errors are really “cluster state” errors in disguise. - Fact 2: Proxmox uses Corosync for cluster membership and quorum decisions; pmxcfs uses that to decide when it’s safe to accept writes.
- Fact 3: The “read-only on no quorum” behavior is a safety feature against split brain. It’s annoying until it saves you from two nodes making conflicting VM configs.
- Fact 4: pmxcfs keeps much of the cluster config in memory and synchronizes it; losing pmxcfs is not the same as losing your entire OS filesystem.
- Fact 5: Historically, cluster filesystems in virtualization stacks have leaned toward conservative write gating. VMware’s vCenter model and many HA stacks also bias toward “stop writes” when membership is uncertain.
- Fact 6: Proxmox’s design choice—making cluster config look like files—makes automation simple (edit a file, run a command), but it also makes failure modes look deceptively “Unix-y.”
- Fact 7: Many Proxmox write failures are secondary effects: disk full stops log writes, services crash-loop, quorum flaps, and suddenly /etc/pve looks broken even though the real issue is capacity.
- Fact 8: Time drift can destabilize corosync membership on marginal networks. In clusters, time is a dependency like power and cooling—just quieter.
Triage tasks: commands, outputs, and decisions (12+)
These are the commands I actually run when production is on fire. Each task includes what you’re looking for, what the output means, and what decision you make next.
Task 1: confirm /etc/pve is a pmxcfs FUSE mount
cr0x@server:~$ mount | grep "on /etc/pve"
pve on /etc/pve type fuse.pve (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
Meaning: You expect type fuse.pve. If it’s missing, pmxcfs isn’t mounted (pve-cluster down or stuck).
Decision: If missing, jump to service checks (pmxcfs/quorum) and logs. If present but ro, treat as quorum or pmxcfs state issue.
Task 2: test write path without collateral damage
cr0x@server:~$ touch /etc/pve/.diag-write-test
touch: cannot touch '/etc/pve/.diag-write-test': Read-only file system
Meaning: “Read-only file system” is usually pmxcfs gating writes (often quorum). “Permission denied” is actor/ACL. “No such file” hints mount is gone.
Decision: Read-only → check quorum and pve-cluster; permission denied → check user/API context; no such file → pmxcfs not mounted.
Task 3: check quorum state fast
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prod-cluster
Config Version: 42
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Dec 25 12:10:03 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2
Quorate: No
Meaning: Quorate: No is your smoking gun. pmxcfs may flip to read-only to prevent divergence.
Decision: Fix corosync connectivity/quorum before trying to “fix permissions” or “restart random services.”
Task 4: see who corosync thinks is present
cr0x@server:~$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 192.0.2.11
status = ring 0 active with no faults
RING ID 1
id = 198.51.100.11
status = ring 1 active with no faults
Meaning: Rings are up locally, but that doesn’t guarantee other nodes are reachable.
Decision: If local rings show faults, fix NIC/VLAN/MTU first. If local looks fine but quorum is lost, check peer reachability and corosync logs.
Task 5: check cluster membership view
cr0x@server:~$ corosync-quorumtool -s
Quorum information
------------------
Date: Thu Dec 25 12:10:08 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 1
Ring ID: 1.2
Quorate: No
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2
Flags: 0
Meaning: Total votes 1 implies you’re isolated. This is not a “disk full” problem. This is a “network/cluster membership” problem.
Decision: Stop trying to make writes happen. Restore connectivity or make a deliberate decision about temporary single-node operation.
Task 6: check pve-cluster service state
cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: active (running) since Thu 2025-12-25 12:01:22 UTC; 8min ago
Main PID: 1543 (pmxcfs)
Tasks: 7
Memory: 52.3M
CPU: 9.122s
Meaning: If inactive/failed, /etc/pve either won’t be mounted or will be stale.
Decision: If failed, read logs before restarting. If running but /etc/pve is RO, focus on quorum/state rather than service restarts.
Task 7: read the relevant logs, not all the logs
cr0x@server:~$ journalctl -u pve-cluster -u corosync --since "30 min ago" --no-pager
Dec 25 11:55:02 pve1 pmxcfs[1543]: [status] notice: received write while not quorate - rejecting
Dec 25 11:55:04 pve1 corosync[1321]: [QUORUM] This node is within the non-primary component
Dec 25 11:55:06 pve1 pmxcfs[1543]: [dcdb] notice: leaving CFS service because of lost quorum
Meaning: pmxcfs is explicitly rejecting writes due to quorum loss. That’s the whole story.
Decision: Go fix corosync membership. If this is a planned maintenance partition, use a planned procedure (see checklists).
Task 8: rule out simple disk exhaustion
cr0x@server:~$ df -hT /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda2 ext4 110G 109G 0 100% /
Meaning: 100% root filesystem is an outage generator. Services fail in creative ways. pmxcfs may still mount, but other components can’t write state/logs.
Decision: Free space immediately (logs, caches), then re-check services and /etc/pve behavior. Don’t start with “cluster is broken” if the node can’t write to /.
Task 9: check inode exhaustion (yes, it still happens)
cr0x@server:~$ df -i /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda2 7208960 7208801 159 100% /
Meaning: You can have free gigabytes and still be unable to create new files. Some write failures bubble up as “unable to write” and look like permissions.
Decision: Find the inode hog (often small files under /var/lib or /var/log) and clean safely. Then retry writes to /etc/pve.
Task 10: detect a read-only remount from I/O errors
cr0x@server:~$ dmesg -T | tail -n 20
[Thu Dec 25 12:06:14 2025] EXT4-fs error (device sda2): ext4_journal_check_start:83: Detected aborted journal
[Thu Dec 25 12:06:14 2025] EXT4-fs (sda2): Remounting filesystem read-only
Meaning: The OS filesystem is read-only. You can’t trust any write behavior, including services that manage /etc/pve.
Decision: Stop. Fix the underlying storage/hardware or recover filesystem. A cluster config write error is not the real problem here.
Task 11: confirm /etc/pve mount options (rw vs ro)
cr0x@server:~$ findmnt -no TARGET,FSTYPE,OPTIONS /etc/pve
/etc/pve fuse.pve rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
Meaning: If it shows ro, it’s a gating decision (quorum or pmxcfs state) or a mount failure mode.
Decision: ro + quorate=no → fix quorum. ro + quorate=yes → dig into pmxcfs health and logs.
Task 12: validate node sees peers (network sanity check)
cr0x@server:~$ ping -c 2 192.0.2.12
PING 192.0.2.12 (192.0.2.12) 56(84) bytes of data.
64 bytes from 192.0.2.12: icmp_seq=1 ttl=64 time=0.332 ms
64 bytes from 192.0.2.12: icmp_seq=2 ttl=64 time=0.301 ms
--- 192.0.2.12 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
Meaning: Ping works doesn’t mean corosync works, but ping failing usually means you’ve found the reason for lost votes.
Decision: If ping fails, fix L2/L3 first. If ping works, investigate MTU, multicast/UDPU, firewall rules, or knet transport specifics.
Task 13: check for lock files impacting config writes
cr0x@server:~$ ls -la /etc/pve/nodes/pve1/qemu-server/ | head
total 8
drwxr-xr-x 2 root www-data 4096 Dec 25 11:50 .
drwxr-xr-x 4 root www-data 4096 Dec 25 11:40 ..
-rw-r----- 1 root www-data 612 Dec 25 11:50 101.conf
Meaning: You’re checking ownership and perms on a per-node config. Normal is often root:www-data with restrictive mode.
Decision: If you see weird ownership (user:user) or world-writable bits, don’t “fix” it blindly—figure out what wrote it that way. If files are missing, it can indicate a mount or sync issue.
Task 14: reproduce the failure with a controlled write via the API layer
cr0x@server:~$ pvesh set /nodes/pve1/config -description "diag change"
unable to write file '/etc/pve/nodes/pve1/config': Permission denied
Meaning: If the error is “Permission denied” via pvesh, you may be running under a non-root shell or using an API token with insufficient privileges.
Decision: Confirm your identity and role permissions. If you’re root and still get denial, look for ACL/ownership damage or a pmxcfs permission state issue.
Task 15: confirm who you are (it matters more than you think)
cr0x@server:~$ id
uid=1001(ops) gid=1001(ops) groups=1001(ops),27(sudo)
Meaning: “sudo group” is not “root.” If your tool didn’t use sudo, you’re not root. Proxmox often expects root-level writes to cluster config.
Decision: Use sudo -i for administrative actions or fix your automation to run with the correct privileges.
Task 16: check whether your root filesystem is healthy enough to trust
cr0x@server:~$ smartctl -H /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.0] (local build)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Meaning: If the disk is failing, all bets are off. You might be chasing “permission errors” caused by I/O failure.
Decision: Prioritize data safety: evacuate workloads if possible, replace disk, then fix software. Reliability beats cleverness.
Joke #2: Permissions bugs are the workplace equivalent of a keycard that stops working right when you’re carrying coffee.
Three big buckets: disk, pmxcfs/quorum, permissions
Bucket A: disk and OS filesystem problems
Disk issues show up in two ways: capacity (full blocks/inodes) and integrity (I/O errors causing read-only remounts). Both can produce “unable to write /etc/pve/*” because Proxmox components need local disk to function even if /etc/pve itself is virtual.
Common disk-driven chains:
- / full → services misbehave → cluster services flap → pmxcfs becomes unstable. This happens more than people admit.
- Filesystem remounts read-only → pve-cluster may keep running but can’t persist state/logs. You see write failures, but the fix is storage, not Proxmox.
- Inodes exhausted → tiny-file storms (container logs, backups, monitoring spools) → config writes fail indirectly.
What not to do: don’t “solve” full disk by deleting random files under /var/lib/pve-cluster or anything that smells like cluster state. Clean logs, rotate, prune backups, remove abandoned ISO images, then reassess.
Bucket B: pmxcfs and quorum (the classic)
If your node isn’t quorate, Proxmox is often doing you a favor by refusing writes. The cluster config must stay consistent. If multiple nodes could write divergent configs during a partition, you would eventually run two incompatible realities. That’s not “high availability.” That’s “high drama.”
Key indicators this is your bucket:
pvecm statusshowsQuorate: Nojournalctl -u pve-clustermentions lost quorum, rejecting writes/etc/pveis mounted but read-only- GUI changes fail across the board: VM create, storage edit, network config edits
What causes quorum loss in real life?
- Network partitions, especially “it works for TCP but not for corosync” (MTU mismatches, firewall rules, asymmetric routing)
- Node down (planned or not) and cluster sized such that you lose majority
- Time drift causing membership instability
- Corosync config changes applied partially or incorrectly
Operational guidance: if you’re in a two-node cluster without a tie-breaker, you’re living dangerously. It can work, but the failure mode is exactly this: one node goes away and the other refuses to accept config writes. That’s not a bug; it’s math.
Bucket C: permissions and access path problems
Permissions are the least exciting bucket, but they’re common in shops with automation, API tokens, and “temporary” changes that become permanent.
Typical cases:
- You’re not root (or your tool isn’t using sudo) and you try to write cluster config.
- An API token lacks the required role/privileges to modify specific objects.
- Someone changed ownership/mode under
/etc/pvemanually (yes, it’s possible; no, it’s not a hobby). - Stale lock semantics: not classic UNIX file locks, but Proxmox “lock” states in config (backup, snapshot, migration) that prevent writes to VM conf.
The key diagnostic distinction is the error text:
- Permission denied tends to be actor/ACL/ownership.
- Read-only file system tends to be quorum/pmxcfs gating or OS remount RO.
- No space left on device is either OS disk or sometimes a manifestation of an underlying resource exhaustion.
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
They had a three-node Proxmox cluster in a remote site. A switch firmware upgrade was scheduled at 2 a.m., because of course it was. The plan: upgrade one switch at a time, rely on redundant uplinks, no downtime. Classic.
The wrong assumption was subtle: “If ping works, corosync will work.” After the first switch reboot, the management network stayed reachable. SSH worked. Monitoring showed the nodes “up.” But corosync traffic ran over a different VLAN with a different MTU, and the new firmware defaulted to a slightly different jumbo-frame behavior. Not broken enough to drop all traffic—just broken enough to cause intermittent packet loss on larger frames.
Half an hour later, the on-call tried to create a VM for an urgent workload. The GUI failed: unable to write /etc/pve/nodes/.... They did what many smart people do under stress: they assumed it was local. They cleaned disk, restarted pve-cluster, even rebooted a node. It didn’t help. In fact, it made the membership flapping worse because the cluster kept re-electing membership while the network was unstable.
The fix was not Proxmox. The fix was making corosync reliable again: aligning MTU end-to-end, confirming knet link stability, and only then letting quorum settle. Once quorum was stable, /etc/pve became writable instantly—no permission tweaks required, no config voodoo.
The lesson: don’t diagnose a distributed system like a single box. “I can SSH” is a low bar. Quorum traffic has stricter needs than your terminal session.
Mini-story 2: the optimization that backfired
A different org wanted faster failovers. They tuned aggressively: faster corosync timeouts, more sensitive failure detection, and a “lean” network path. They also consolidated cluster traffic onto the same interfaces used for storage replication because “it’s all 10GbE, what could go wrong?”
It ran beautifully in calm weather. Then monthly backups kicked off and replication traffic spiked. The interfaces were saturated enough to introduce jitter. Corosync doesn’t need much bandwidth, but it hates unpredictable latency. Membership started flapping. Quorum dropped briefly, returned, dropped again. Operators didn’t see “network down”; they saw “Proxmox can’t write config.”
The backfire was operational, not theoretical. Every time quorum dropped, pmxcfs stopped accepting writes. Changes via GUI partially applied, then rolled back. Automation that assumed idempotency started thrashing: “set storage,” “fail,” “retry,” creating a storm of tasks and logs. That log storm, in turn, pushed the node closer to disk pressure. A tidy optimization turned into a multi-layer failure.
The eventual solution was boring: separate corosync traffic from heavy flows, revert timeout tuning to sane defaults, and treat quorum stability as a first-class SLO. The cluster became slightly “slower” to declare failure, and massively faster to recover safely.
Lesson: in HA, speed is a feature only when you can afford it. If you buy speed by reducing safety margins, you’ll pay later—with interest.
Mini-story 3: the boring but correct practice that saved the day
A healthcare-ish environment (regulated enough to be cautious) ran Proxmox in three nodes, with a fourth small “witness” node for quorum voting. Nothing fancy. They also had a habit that engineers love to mock: a written runbook with “pre-checks” and “stop conditions.”
One afternoon, a storage controller started throwing intermittent errors. The OS remounted a filesystem read-only on one node. That node stayed reachable, but services degraded. The on-call saw unable to write /etc/pve/... while trying to disable a scheduled job. Instead of force-restarting everything, they followed the runbook: check dmesg, check mount status, check quorum, check disk health. The runbook forced them to prove or disprove each bucket.
Within minutes, they realized it wasn’t quorum at all. The cluster was quorate; pmxcfs was fine. The local filesystem was read-only due to kernel-detected errors. That changed the response: evacuate VMs from that node, cordon it, and engage hardware replacement. They didn’t “fix permissions,” and they didn’t risk corrupting cluster state.
The next day, the incident report was dull. That’s the compliment. The reason it was dull is because they treated “unable to write /etc/pve” as a symptom, not a diagnosis, and they had a disciplined path to isolate the cause.
Lesson: the boring checklist isn’t bureaucracy. It’s a reliability feature.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Read-only file system” when touching /etc/pve
Root cause: Cluster not quorate, or pmxcfs intentionally refusing writes.
Fix: Restore quorum (network, corosync membership). Verify pvecm status shows Quorate: Yes. Do not chmod your way out of it.
2) Symptom: GUI saves fail, CLI as root also fails
Root cause: pmxcfs gating writes (quorum) or pmxcfs unhealthy; less commonly OS filesystem read-only.
Fix: Check findmnt /etc/pve for ro, check journalctl -u pve-cluster, check dmesg for read-only remounts.
3) Symptom: “Permission denied” only in GUI or automation, but root on SSH can edit
Root cause: API token / user role missing permissions; or automation running as non-root without proper sudo.
Fix: Audit roles/ACLs and confirm identity. Re-run the same action using pvesh as root to isolate access path differences.
4) Symptom: only one VM config won’t update; others are fine
Root cause: VM is locked (backup, snapshot, migration), or a stale lock state exists.
Fix: Check VM lock status in config and task list; resolve the underlying task. Avoid manual lock removal unless you’re sure the operation is dead and won’t resume.
5) Symptom: errors started after node reboot / maintenance
Root cause: Node came up isolated (VLAN trunk missing, firewall rule, wrong interface), so it lost quorum and pmxcfs went read-only.
Fix: Validate corosync interfaces, VLAN tagging, MTU, and peer reachability before touching Proxmox services.
6) Symptom: “No space left on device” while writing /etc/pve
Root cause: Root filesystem or inode exhaustion; sometimes a log storm or runaway backup spool.
Fix: Free space/inodes safely, then re-check. Don’t delete cluster state files as your first move.
7) Symptom: /etc/pve is missing or empty
Root cause: pve-cluster/pmxcfs not mounted; or mount failed during boot.
Fix: Start/repair pve-cluster; check why it failed in logs. Confirm FUSE is working and service dependencies are healthy.
8) Symptom: intermittent ability to write; works for a minute, then fails
Root cause: quorum flapping due to network jitter, MTU mismatch, overloaded corosync links, or aggressive timeouts.
Fix: Stabilize corosync transport. Reduce contention. Don’t tune timeouts for “speed” without measuring jitter under load.
Checklists / step-by-step plan (safe sequence)
Checklist A: you just saw “unable to write /etc/pve/*” and you want the truth fast
- Confirm mount and mode:
findmnt /etc/pve. If missing → service/mount problem; ifro→ quorum/pmxcfs gating likely. - Confirm quorum:
pvecm status. If not quorate, don’t waste time on permissions. - Confirm services:
systemctl status pve-cluster corosync. - Check logs:
journalctl -u pve-cluster -u corosync --since "30 min ago". - Rule out disk pressure:
df -h,df -i, anddmesgfor read-only remounts. - Reproduce with a harmless write:
touch /etc/pve/.diag-write-testand interpret the exact error message. - Make a call: if quorum is lost, fix networking/quorum. If disk is dead, evacuate node. If it’s permissions, fix ACLs/identity.
Checklist B: if quorum is lost, decide your operational stance (don’t improvise)
- Is this a real network partition or just a node down? Determine whether peers are reachable and whether the down node is expected.
- Do you have majority somewhere? In a 3-node cluster, one node isolated has no quorum; the other two likely do.
- Prefer fixing connectivity over forcing writes. “Force” is a last resort because it can create divergent state.
- Stabilize the network path: verify VLANs, MTU, firewall rules; check for saturation on corosync links.
- After quorum returns: retry writes. Don’t restart services unless you have a clear reason.
Checklist C: if it’s disk/full/RO, treat it like a storage incident
- Confirm disk condition:
dmesgand SMART/RAID status. - If filesystem is RO due to errors: stop trying to “fix Proxmox.” Evacuate workloads if possible. Plan repair.
- If it’s just full: clean space safely (rotate logs, prune backups, remove unused ISOs). Avoid deleting Proxmox state.
- Re-check services and /etc/pve behavior after space returns. Capacity incidents often cascade.
Checklist D: if it’s permissions, fix the smallest thing that works
- Confirm identity and execution context (
id,sudo -l, API token roles). - Don’t recursively chmod/chown /etc/pve. That’s how you create mystery meat permissions.
- Fix role/ACL issues at the Proxmox level (users, groups, roles) rather than file-level hacking.
- Validate by re-running the failing action and confirming audit logs/tasks succeed.
FAQ
1) Why does Proxmox put cluster config in /etc/pve instead of a database?
Because files are a universal interface. Tools, scripts, and admins can read them easily, diff them, and back them up. pmxcfs gives “files” semantics while handling cluster synchronization and access control.
2) If /etc/pve is virtual, why does my local disk matter?
Because the OS, services, logs, and runtime state still need local disk. Full disks or read-only remounts destabilize the services that keep pmxcfs running and healthy.
3) What’s the fastest way to tell quorum vs permissions?
Run pvecm status and do a harmless touch under /etc/pve. “Quorate: No” or “Read-only file system” strongly points to quorum/pmxcfs gating. “Permission denied” points to permissions/identity.
4) Can I just restart pve-cluster to fix it?
Sometimes, but it’s not a first move. If quorum is lost, restarting pve-cluster won’t create votes out of thin air. If the OS filesystem is read-only, restarts are theater. Read logs first; decide based on evidence.
5) Why do GUI changes fail but editing files over SSH seems to work (or vice versa)?
Because the GUI uses the API with its own permission model and may fail on ACLs even if your SSH session as root can write. Or the reverse: root can’t write because pmxcfs is read-only, and the GUI shows a generic write error.
6) Is a two-node cluster a bad idea?
It’s a fragile idea unless you add a third vote (qdevice/witness) or accept that losing one node can remove quorum and block writes. Two nodes can run, but the failure mode is exactly the one you’re debugging.
7) What does “pmxcfs rejecting writes while not quorate” actually protect me from?
Split brain config divergence: two nodes both “successfully” updating VM configs, storage definitions, or firewall rules in isolation. When the partition heals, reconciling that mess is painful and sometimes destructive.
8) Can I force Proxmox to be writable without quorum?
You can operate in degraded modes, but it’s a deliberate emergency procedure, not a casual toggle. The risk is diverging cluster state. If you must, document it, minimize changes, and plan a clean rejoin.
9) What if only writes to a specific path fail, like one VM config?
Look for locks and task state (backup/migration/snapshot) on that VM. Cluster-wide inability to write usually points to quorum/pmxcfs; single-object failures often point to lock/task issues or a corrupted config file.
10) How do I prevent this from recurring?
Design for quorum stability (odd number of votes, reliable corosync network), enforce disk capacity hygiene (alerts on space and inodes), and standardize privileged access (API tokens with explicit roles, automation that uses proper privileges).
Conclusion: next steps that actually prevent repeats
“Unable to write /etc/pve/*” is Proxmox telling you it won’t accept config changes because something foundational is wrong or unsafe. Your job is to classify the failure quickly:
- If quorum is lost: fix corosync connectivity and membership. Don’t fight pmxcfs; it’s protecting you.
- If disk is full or read-only: treat it as a storage incident. Free space safely or evacuate and repair hardware/filesystems.
- If it’s permissions: fix identity, roles, and ACLs at the Proxmox layer. Don’t recursively chmod the cluster’s brain.
Practical next steps for Monday morning (when you’re not mid-outage):
- Add monitoring on
pvecm status(quorum),pve-clusterhealth, and corosync link stability. - Alert on
df -handdf -ifor root and/var. Inode exhaustion is a stealth outage. - Run an odd number of votes. If you must run two nodes, add a witness vote and treat network as critical infrastructure.
- Write a short internal runbook with the “Fast diagnosis playbook” above. You want future-you to be lazy and correct.