It’s 02:13. A VM won’t start, stop, migrate, or snapshot. Proxmox shrugs and says: “cannot get exclusive lock”. The business hears “hypervisor is down.” You hear “somebody left a sticky note on the steering wheel.”
This error is rarely the real problem. It’s a symptom: a running task, a crashed daemon, a storage backend stalling, or an HA manager making a call you didn’t know it was allowed to make. The goal isn’t to “remove the lock.” The goal is to identify who is holding it, why, and whether releasing it will corrupt data or just unblock a queue.
What the lock actually is (and what it isn’t)
Proxmox VE (PVE) is a multi-process system doing orchestration in the least glamorous way possible: lock files and serialized tasks. That’s good. It keeps two operations from stomping the same VM configuration or disk at the same time.
When you see “cannot get exclusive lock,” PVE is saying: “I tried to take a lock for this VM, but something else already has it.” That “something else” can be:
- a legitimate long-running task (backup, snapshot, replication, migration)
- a dead task that didn’t release the lock (daemon restart, node crash, storage hang)
- an HA manager or cluster component coordinating access
- a storage backend lock (Ceph RBD lock, LVM activation lock, NFS file lock)
- a running QEMU process that still owns the VM runtime state even if the UI looks confused
Here’s the part that trips people: the lock might be on the VM config, not the VM process. “Unlocking” can let you edit config while a disk operation is still running. That is how you get to explain to your future self why the VM’s disks look like modern art.
One dry truth: a lock is a warning label. Peeling it off doesn’t make the contents safe.
Two classes of locks matter in practice:
- PVE-level locks (typically in VM config state). These block actions in
qm, the UI, and the API. - Storage-level locks (Ceph RBD, ZFS “dataset busy”, LVM activation, NFS stale locks). These can block QEMU, snapshots, or backups even if PVE looks “unlocked.”
Fast diagnosis playbook
If you’re on-call and you want the shortest path to “this is safe to fix,” do this in order. The idea is to locate the bottleneck: is it a normal task, a stuck task, or a storage hang.
First: is there an active PVE task for the VM?
- Check task history for the VM ID and look for “running” tasks.
- Confirm whether a backup, snapshot, migrate, or replication job is active.
- If yes: don’t unlock yet; decide whether to wait, cancel, or kill the worker.
Second: is QEMU still running (or half-running)?
- Check for a
kvm/qemu-system-x86_64process for that VM ID. - Check if QMP responds, or if the process is stuck in uninterruptible I/O wait (
Dstate). - If QEMU is alive: the “lock” is often correct. If QEMU is stuck: storage is usually the real villain.
Third: is storage blocked?
- ZFS: check
zpool status, I/O latency, and stuckzfs destroy/snapshotoperations. - Ceph: check cluster health and RBD locks/watchers.
- NFS: look for stale file handles and blocked mount points.
- LVM: check LV activation and any stuck
lvchange/lvscommands.
Then: only then consider releasing locks
If no task is running, QEMU is not running (or is gone), and storage isn’t performing an operation, then releasing a stale lock is usually safe.
Rule of thumb: if there’s any chance a disk operation is mid-flight, treat “unlock” as an incident, not a fix.
Tasks & commands: find the holder, prove the cause, choose the fix
Below are practical tasks you can run on a PVE node. Each one includes: a command, what typical output means, and what decision you should make.
Task 1: Identify the exact lock message and operation
cr0x@server:~$ qm start 101
cannot get exclusive lock on VM 101 (snapshot-delete)
What it means: the lock is not generic; it tells you which operation category holds it (here: snapshot-delete).
Decision: go look for snapshot deletion tasks, not just “some random lock.” This also hints storage involvement.
Task 2: Check VM config for a lock field
cr0x@server:~$ grep -E '^(lock|parent|snapname):' /etc/pve/qemu-server/101.conf
lock: snapshot-delete
What it means: PVE recorded a lock in the VM config itself. That is often a leftover from an interrupted task.
Decision: do not delete this line blindly yet. First confirm whether any task is still running or storage is still doing work.
Task 3: Find recent tasks mentioning the VM ID in the node logs
cr0x@server:~$ journalctl -u pvedaemon -u pveproxy -u pvestatd --since "2 hours ago" | grep -E 'vmid=101|VM 101|101.conf' | tail -n 30
Dec 26 01:41:10 server pvedaemon[2143]: starting task UPID:server:00002A1F:00014B77:676D1A46:vzdump:101:root@pam:
Dec 26 01:41:11 server pvedaemon[2143]: INFO: backup job started
Dec 26 01:55:44 server pvedaemon[2143]: ERROR: backup failed: storage 'nfs-backup' not online
Dec 26 01:55:44 server pvedaemon[2143]: TASK ERROR: job failed
What it means: you have an operation that failed. Failures can leave locks depending on where the failure happened.
Decision: pivot to the failing subsystem (here: backup storage). If storage is still not online, unlocking won’t help—you’ll just fail in a new, more creative way.
Task 4: List running tasks and filter for the VM
cr0x@server:~$ pvesh get /nodes/server/tasks --running 1 | grep -E 'vzdump|qm|vmid.?101' || true
UPID:server:00002A1F:00014B77:676D1A46:vzdump:101:root@pam:
What it means: a task is still marked as running. It might be actually running, or stuck, or just not cleaned up.
Decision: inspect that UPID and decide whether to wait, cancel, or kill the worker process.
Task 5: Inspect a task’s status (and see if it’s truly alive)
cr0x@server:~$ pvesh get /nodes/server/tasks/UPID:server:00002A1F:00014B77:676D1A46:vzdump:101:root@pam:/status
{"exitstatus":null,"id":"UPID:server:00002A1F:00014B77:676D1A46:vzdump:101:root@pam:","node":"server","pid":10783,"starttime":1735177270,"status":"running","type":"vzdump","upid":"UPID:server:00002A1F:00014B77:676D1A46:vzdump:101:root@pam:","user":"root@pam"}
What it means: PVE thinks PID 10783 is still running and therefore keeps the lock valid.
Decision: verify that PID exists and isn’t stuck in I/O wait forever.
Task 6: Verify the task PID exists and see what it’s doing
cr0x@server:~$ ps -o pid,ppid,stat,etime,cmd -p 10783
PID PPID STAT ELAPSED CMD
10783 2143 D+ 00:14:28 /usr/bin/vzdump 101 --storage nfs-backup --mode snapshot
What it means: state D means uninterruptible sleep, usually blocked on I/O. Killing it may not work until the I/O returns.
Decision: stop treating this as “a Proxmox lock problem.” It’s a storage or network hang problem. Go to the storage checks section before you kill anything.
Task 7: Check if QEMU is still running for that VM
cr0x@server:~$ pgrep -a -f "qemu-system-x86_64.*-id 101"
24591 /usr/bin/kvm -id 101 -name vm101,debug-threads=on -m 8192 ...
What it means: VM runtime is alive. Even if config lock looks stale, the VM process might be mid-snapshot or has attached devices.
Decision: don’t “qm unlock” as your first move. Determine whether the VM is healthy, and what action is blocked.
Task 8: Ask Proxmox what it thinks the VM status is
cr0x@server:~$ qm status 101
status: running
What it means: PVE still considers it running. That usually blocks certain exclusive operations (like snapshot delete) by design.
Decision: if your goal is “start VM,” you’re chasing the wrong issue. If your goal is “delete snapshot,” you need to resolve the snapshot task, not the VM runtime.
Task 9: Review snapshot tree and identify the stuck snapshot name
cr0x@server:~$ qm listsnapshot 101
`-> current
`-> pre-upgrade-2024-12-01
`-> vzdump
What it means: snapshot names matter. “vzdump” snapshots are often left behind by backup failures.
Decision: if a “vzdump” snapshot is present and backups are failing, fix backup storage first, then delete snapshot cleanly.
Task 10: Try a graceful task cancellation (when appropriate)
cr0x@server:~$ pvesh create /nodes/server/tasks/UPID:server:00002A1F:00014B77:676D1A46:vzdump:101:root@pam:/stop
{"data":null}
What it means: you requested stop. It may still hang if the process is in D state.
Decision: if it stops quickly, great. If not, storage debugging is mandatory. Avoid escalating to SIGKILL without understanding what it’s stuck on.
Task 11: When there’s no running task, safely unlock at the PVE layer
cr0x@server:~$ qm unlock 101
What it means: it removed the lock: entry from the VM config. This does not fix storage locks or stuck QEMU processes.
Decision: only do this after you confirmed there’s no active operation. After unlocking, retry the blocked action and watch logs.
Task 12: Check storage availability in PVE (the boring “is it even mounted?” check)
cr0x@server:~$ pvesm status
Name Type Status Total Used Available
local dir active 98.00GB 12.40GB 85.60GB
nfs-backup nfs inactive 0B 0B 0B
rbd-ceph rbd active 3.00TB 1.90TB 1.10TB
What it means: backup storage is inactive. Tasks that depend on it may hang or fail and leave locks.
Decision: fix storage reachability first. If you unlock while storage is still down, the next backup/snapshot attempt will re-lock and re-break.
Task 13: NFS-specific: check mount health and “stale file handle” risk
cr0x@server:~$ mount | grep nfs-backup
10.0.20.15:/export/pve-backup on /mnt/pve/nfs-backup type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)
cr0x@server:~$ timeout 3 ls -la /mnt/pve/nfs-backup | head
ls: cannot access '/mnt/pve/nfs-backup': Stale file handle
What it means: a “hard” NFS mount with a stale handle can hang processes in D state. That includes vzdump and snapshot cleanup.
Decision: remediate the NFS condition (server/export restart, remount). Don’t just kill the task and hope. It will wedge again.
Task 14: Ceph-specific: check cluster health fast
cr0x@server:~$ ceph -s
cluster:
id: 0f3c6f1a-9f72-4a1f-9e35-7b4c9b2a1c0a
health: HEALTH_WARN
1 slow ops, oldest one blocked for 34 sec
services:
mon: 3 daemons, quorum mon1,mon2,mon3
mgr: active: mgr1
osd: 6 osds: 6 up, 6 in
data:
pools: 4 pools, 256 pgs
usage: 1.9 TiB used, 1.1 TiB / 3.0 TiB avail
pgs: 256 active+clean
What it means: “slow ops” correlates with hung snapshots/migrations and lock timeouts because I/O completion is delayed.
Decision: solve Ceph slowness (network, OSD, backfill, recovery) before you manipulate locks. Otherwise you’re just speeding up your failure loop.
Task 15: Ceph RBD lock/watchers: see who’s attached
cr0x@server:~$ rbd status vm-101-disk-0 --pool rbd
Watchers:
watcher=10.0.10.21:0/1689456502 client.48231 cookie=18446744073709551615
What it means: some client (likely a PVE node) still has the disk open. Snapshots or exclusive operations may block.
Decision: identify that client and confirm QEMU really stopped. If it’s a dead node, you may need to clear stale attachments at the cluster level—carefully.
Task 16: ZFS-specific: confirm pool health and look for long-running operations
cr0x@server:~$ zpool status -x
all pools are healthy
cr0x@server:~$ zfs list -t snapshot | grep -E '^rpool/data/vm-101-disk-0@' | tail -n 5
rpool/data/vm-101-disk-0@vzdump 2.10G - 58.3G -
What it means: pool is healthy; snapshot exists. If deletion is stuck, it’s often a hold, clone dependency, or an I/O stall elsewhere.
Decision: check for clones/holds before forcing anything.
Task 17: ZFS holds: find the reason a snapshot won’t delete
cr0x@server:~$ zfs holds rpool/data/vm-101-disk-0@vzdump
NAME TAG TIMESTAMP
rpool/data/vm-101-disk-0@vzdump pve-vzdump Thu Dec 26 01:41 2025
What it means: there’s an explicit hold. Proxmox (or a script) placed it to prevent accidental deletion during backup.
Decision: remove the hold only if the backup task is confirmed dead and the snapshot is safe to remove.
Task 18: Remove a ZFS hold (surgical, not casual)
cr0x@server:~$ zfs release pve-vzdump rpool/data/vm-101-disk-0@vzdump
What it means: you removed the blocker that prevented deletion. Now snapshot deletion can proceed.
Decision: immediately run the delete via Proxmox again (preferred) so PVE’s state remains consistent.
Task 19: Check who is locking /var/lock at OS level (rare but real)
cr0x@server:~$ ls -la /var/lock/qemu-server/
total 0
-rw-r----- 1 root root 0 Dec 26 01:41 lock-101.conf
cr0x@server:~$ lsof /var/lock/qemu-server/lock-101.conf || true
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
pvedaemon 2143 root 8u REG 0,53 0 512 /var/lock/qemu-server/lock-101.conf
What it means: the daemon itself holds the file descriptor. That can happen during legitimate operations—or when the daemon is stuck.
Decision: if pvedaemon is wedged (and tasks are stuck), a controlled restart may be appropriate, but only after you understand what it’s waiting on.
Task 20: Restart the management daemons (last resort, but sometimes correct)
cr0x@server:~$ systemctl restart pvedaemon pveproxy pvestatd
What it means: you restarted the control plane. This can clear stale locks and task bookkeeping. It will not unstick blocked kernel I/O.
Decision: do this when: (1) you confirmed QEMU and storage are fine, (2) tasks are ghosted, and (3) HA/cluster state is stable. Otherwise you’re rebooting the dashboard while the engine is on fire.
Joke #1: Unlocking a VM without checking storage is like firing the smoke alarm because it’s loud. The noise stops; the problem doesn’t.
Lock types in Proxmox: what each one usually means
The lock reason string (the part in parentheses) is your clue. Treat it as a starting hypothesis, not a verdict.
backup / vzdump
Usually means: a backup job is running or crashed mid-flight, often leaving a snapshot behind (especially on ZFS/Ceph).
Common underlying causes: backup storage offline, NFS stale handle, PBS datastore unavailable, slow storage.
What to do: check tasks, then backup target health, then snapshots.
snapshot / snapshot-delete
Usually means: snapshot creation/deletion is in progress, or a previous attempt died after setting state.
Common underlying causes: ZFS holds, Ceph slow ops, RBD watchers still attached, pending IO.
What to do: inspect snapshot list, storage back-end state, and any holds/watchers.
clone
Usually means: a template clone is running or partially completed.
Common underlying causes: storage copy/clone stuck, networked storage latency, insufficient space.
What to do: confirm the clone task, verify storage space and backend health, avoid manual edits to disk config while clone is active.
migrate
Usually means: a migration is running or failed and didn’t clean up.
Common underlying causes: target node unreachable, SSH disruptions, shared storage hiccups, HA interventions.
What to do: find the migration task UPID, verify both nodes’ view of the VM, and confirm disks aren’t active on both sides.
rollback
Usually means: snapshot rollback is in progress, which is inherently exclusive.
Common underlying causes: rollback started and the VM was forced off, storage struggled, or rollback failed and left state.
What to do: treat as high-risk. Confirm disk consistency and do not unlock casually.
Storage-specific failure modes (ZFS, Ceph, LVM, NFS, iSCSI)
Most “exclusive lock” tickets are storage tickets wearing a virtualization costume.
ZFS: “dataset busy”, holds, and silent dependencies
ZFS is great at snapshots. It’s also great at not letting you shoot yourself in the foot. When snapshot delete is blocked, it’s often because of:
- holds placed by backup tooling (so you can’t delete mid-backup)
- clones depending on the snapshot
- busy datasets due to mounts or processes holding references
- pool distress (latency, failing device) causing operations to take forever
What you should avoid: manually destroying snapshots behind Proxmox’s back while it still thinks an operation is ongoing. Prefer using qm delsnapshot after you’ve cleared the underlying blocker; it keeps Proxmox metadata consistent.
Ceph RBD: watchers, exclusive locks, and “slow ops”
Ceph has its own concept of who is attached to an RBD image. Even if PVE drops a config lock, the image might still be opened by a QEMU instance somewhere.
If a node crashed, you can get “phantom attachments”: the cluster still believes a client holds the image. You’ll see watchers, and your snapshot/delete actions will stall or fail.
In Ceph incidents, the lock error is often the first visible symptom of a deeper condition: recovery/backfill saturating disks, network packet loss, or OSDs flapping. Fixing Ceph health is frequently the fastest “unlock.”
LVM-thin: activation locks and stuck lvchange
LVM-thin is dependable until you run out of metadata or the VG is busy across nodes in a way you didn’t intend. If lvs hangs, you’re in the same category as NFS “hard” mount problems: processes blocked in kernel I/O, not polite userland failures.
NFS: hard mounts, stale handles, and why your kill -9 didn’t work
NFS is a fine choice for backups. It’s also a masterclass in distributed systems trade-offs. With a hard mount, processes will wait forever for the server to return. That’s “reliability” in one sense, and “my cluster is glued to a dead export” in another.
When NFS returns Stale file handle, you’re usually looking at an export or filesystem event on the server side (reboot, failover, remount, export change). Client processes can wedge. Locks remain “held” because tasks never exit.
iSCSI / multipath: path loss and D-state QEMU
Block storage path issues love to show up as lock problems because QEMU can block on I/O. The management plane then blocks waiting for QEMU or storage completion. If you see QEMU in D state, think multipath or SAN before you think “Proxmox bug.”
Joke #2: Storage outages are generous—they share their pain with every layer above them, free of charge.
Three corporate-world mini-stories (how this goes wrong in real life)
Mini-story 1: The incident caused by a wrong assumption
The team had a small cluster with shared Ceph RBD. A node rebooted during a maintenance window, and afterward a VM wouldn’t migrate. The UI showed “cannot get exclusive lock,” so the on-call assumed it was just Proxmox being conservative and ran qm unlock.
Migration still failed. Now the config was editable, so they “fixed” it by detaching and reattaching the disk in the VM config. That created a clean-looking config state that was now out of sync with reality: the old RBD image still had a watcher, and the new attachment attempts piled on errors.
They spent two hours “retrying” and restarting daemons, because everyone loves a ritual when the root cause is unclear. The real cause was visible in minutes: Ceph had slow ops and the RBD image was still watched by a client ID from the rebooted node.
What changed the outcome was not a clever command. It was the decision to stop treating the lock as the problem and start treating it as evidence. Once Ceph recovered and the stale attachment was resolved, the lock error disappeared without heroics.
Mini-story 2: The optimization that backfired
A different company wanted faster backups. Someone tuned NFS mounts for throughput and set large read/write sizes and aggressive settings. It was fast. On good days.
Then the NFS server had a brief failover event. The mount was hard, which is defensible, but they didn’t plan for the operational reality: backup processes went into uninterruptible sleep while waiting for I/O. Those tasks held VM locks, snapshot cleanup stalled, and the next day’s backups collided with yesterday’s leftovers.
By mid-morning, several VMs were “locked,” and the storage team was being asked why virtualization couldn’t “just unlock them.” They could unlock. They could also make snapshot sprawl worse by letting new snapshots stack on top of an already-failed chain.
The eventual fix wasn’t rolling back the tuning entirely. It was adding guardrails: monitoring for hung mounts, isolating backup storage traffic, and ensuring backup tooling failed fast enough to release locks when the backend was sick.
Mini-story 3: The boring but correct practice that saved the day
A finance org had strict change control and, more importantly, strict habits. They scheduled snapshot deletions and backups with clear windows, and they had a rule: no manual unlocks until three checks were done—task list, QEMU process state, and storage health.
One night a snapshot delete got stuck after a node lost connectivity to the backup target. The on-call saw the lock error and followed the checklist. Tasks showed a running vzdump with a PID in D state. Storage status showed the NFS mount inactive.
They did the least exciting thing possible: restored the NFS export, remounted cleanly, waited for the stuck task to exit, then re-ran snapshot deletion through Proxmox. No unlock command was needed. The VM was never put at additional risk.
It was so boring it almost didn’t feel like success. That’s the point. Their practice prevented a cascade: no forced unlock, no config edits, no partial snapshots left behind, no “why is the backup chain broken” follow-up ticket.
Common mistakes: symptom → root cause → fix
1) “Cannot get exclusive lock (backup)” and you immediately run qm unlock
Symptom: backups failing, VM locked, and snapshots named “vzdump” accumulating.
Root cause: backup target offline or hung (NFS/PBS), leaving vzdump tasks stuck or failing mid-snapshot.
Fix: restore backup storage connectivity; confirm no running tasks; clean up snapshots via Proxmox; only then unlock if stale.
2) Lock persists after node reboot
Symptom: VM shows locked even though “nothing is running.”
Root cause: stale lock: field in /etc/pve/qemu-server/*.conf from an interrupted operation.
Fix: confirm no tasks are running and no QEMU process exists; then qm unlock <vmid>.
3) Killing the task PID doesn’t clear the lock
Symptom: you send SIGTERM/SIGKILL; process stays; lock stays.
Root cause: process is in uninterruptible I/O (D) due to NFS/SAN/Ceph stall; the kernel won’t kill it until I/O returns.
Fix: fix storage/network path; remount NFS or restore SAN paths; then the task can exit and release locks.
4) Migration lock that “won’t go away”
Symptom: VM locked with (migrate), but migration failed hours ago.
Root cause: partially completed migration left state on source/target; possibly VM registered on both nodes or disk active on target.
Fix: verify VM process location; ensure disks are not active on both nodes; clean up migration artifacts; then unlock stale state.
5) Snapshot delete fails repeatedly on ZFS
Symptom: lock says snapshot-delete; delete fails; snapshot remains.
Root cause: ZFS hold or clone dependency prevents deletion.
Fix: zfs holds and zfs get origin checks; release holds only when safe; delete via Proxmox afterward.
6) Ceph RBD lock errors during stop/start
Symptom: stop hangs; then “exclusive lock” errors appear on operations.
Root cause: Ceph slow ops or lingering watchers; QEMU still attached or Ceph cluster degraded.
Fix: stabilize Ceph (ceph -s); confirm watchers; ensure QEMU is stopped cleanly; then proceed.
7) Restarting pvedaemon “fixes it” once, then it returns
Symptom: locks clear briefly after daemon restart, but new tasks re-lock and fail.
Root cause: underlying storage issue remains; control-plane reset masks symptoms.
Fix: stop using restarts as a treatment; instrument storage health; fix mount/cluster issues first.
Checklists / step-by-step plan
Checklist A: “VM is locked and won’t start”
- Capture the exact error string (operation type).
- Check config lock:
grep '^lock:' /etc/pve/qemu-server/<vmid>.conf. - Check running tasks for the VM:
pvesh get /nodes/<node>/tasks --running 1. - If a task is running: inspect its PID and state (
ps); ifD, go storage-first. - Check if QEMU is already running:
pgrep -a -f "-id <vmid>". - Check storage status in PVE:
pvesm status. - Check backend health (ZFS/Ceph/NFS) depending on where the disks/backup live.
- Only if no tasks and no QEMU and storage is stable:
qm unlock <vmid>and retry start.
Checklist B: “Snapshot delete lock”
- List snapshots:
qm listsnapshot <vmid>and identify the target. - Check for ongoing backup tasks: running vzdump commonly leaves snapshot-delete locks.
- ZFS: check holds:
zfs holds <dataset>@<snap>. - Ceph: check watchers:
rbd status <image> --pool <pool>. - Fix the backend blocker (holds/watchers/slow ops).
- Delete the snapshot through Proxmox again (preferred):
qm delsnapshot <vmid> <snapname>. - If Proxmox still shows stale lock and nothing is active:
qm unlock <vmid>.
Checklist C: “Backup lock”
- Confirm backup target is reachable and active:
pvesm status. - NFS: validate mount is responsive:
timeout 3 ls /mnt/pve/<storage>. - PBS: check datastore connectivity (from PVE perspective) and authentication.
- Inspect vzdump task status and PID state.
- Resolve storage issue, let the task fail/exit cleanly or stop it if it can stop.
- Clean up leftover “vzdump” snapshots if present.
- Re-run one manual backup for validation before re-enabling schedules.
Checklist D: “When it’s actually safe to use qm unlock”
- No running task references the VM ID.
- No QEMU process exists for that VM ID.
- Storage backend is healthy and responsive.
- No snapshot operation is currently in progress at backend level.
- You understand what operation set the lock, and you accept the consequences of unblocking.
Interesting facts & historical context
- Proxmox VE grew out of pragmatic Linux ops culture: file-based configs and simple locking are intentional because they’re inspectable under pressure.
- /etc/pve is not a normal filesystem: it’s a cluster filesystem (pmxcfs). “A lock in the config” is shared cluster state, not just local text.
- Locking became a first-class concern as clusters matured: early single-node virtualization could “wing it”; clustered orchestration cannot.
- Snapshot semantics depend on backend: ZFS snapshots behave differently from Ceph RBD snapshots, and Proxmox has to map both into one UI story.
- NFS hard mounts are a deliberate trade-off: they prevent silent data corruption at the cost of “everything hangs when the server disappears.”
- Ceph’s “watchers” are a visibility gift: they let you see which clients have an image open, which often explains exclusive-lock failures without guesswork.
- Uninterruptible sleep (D-state) is a Linux feature, not a bug: it’s the kernel refusing to kill processes waiting on critical I/O paths.
- HA adds another locking layer: in HA setups, a VM may be “held” by the manager’s decision-making, even if humans want to intervene.
One operations quote (paraphrased idea): “Hope is not a strategy,” often attributed in SRE circles to pragmatic reliability thinking. Treat lock errors the same way: verify, then act.
FAQ
1) What does “cannot get exclusive lock” mean in Proxmox?
It means Proxmox attempted an operation requiring exclusive access to the VM’s config/state (or a related operation) but detected an existing lock. The lock is usually recorded in VM config (lock:) and/or held by a running task.
2) Is it safe to run qm unlock <vmid>?
Safe when the lock is stale: no running tasks, no QEMU process, and backend storage is healthy. Unsafe when a backup/snapshot/migration is actually in progress or stuck in I/O—unlocking can let you perform conflicting operations.
3) Why does the lock mention “snapshot-delete” when I’m just trying to start the VM?
Because the VM is locked for a snapshot deletion operation that’s pending or stuck, and Proxmox blocks other actions to prevent inconsistent state. Your “start” request is collateral damage.
4) I killed the backup process but the lock is still there. Why?
Either the process didn’t actually die (common in D state), or Proxmox recorded the lock in the VM config and it wasn’t cleaned up. Confirm with task lists and process state; then unlock only after verifying storage isn’t still working.
5) What’s the fastest way to find who holds the lock?
Check (a) running tasks via pvesh, (b) journalctl for the last task touching the VM, and (c) whether QEMU is running. Then check backend indicators: Ceph watchers, ZFS holds, NFS mount responsiveness.
6) Can HA cause an exclusive lock issue?
Yes. HA-managed resources can be started/stopped/migrated by the HA manager, and failures can leave “in-progress” operations. Always confirm HA state and where the VM is supposed to be running before manual intervention.
7) Why do NFS problems show up as VM locks?
Because backup/snapshot tasks depend on writing to or reading from NFS. With hard mounts, those tasks can block indefinitely in the kernel. Proxmox keeps the lock because the task never exits.
8) The UI shows no running tasks, but the lock persists. What now?
Look directly at VM config for lock:, and check logs for prior failures. Then check OS-level lock files and daemon health. If everything is truly idle, unlock is reasonable.
9) If Ceph is HEALTH_WARN, should I unlock anyway to get things moving?
Usually no. HEALTH_WARN with slow ops often means I/O is delayed; unlocking won’t make the underlying operations complete. Stabilize Ceph first, or you’ll trade one blocked task for a pile of half-finished ones.
10) Do I ever delete the lock: line manually from 101.conf?
Avoid it unless you’re in recovery mode and you understand pmxcfs implications. Prefer qm unlock, which is the supported path and keeps tooling expectations aligned.
Conclusion: practical next steps
When Proxmox says it “cannot get exclusive lock,” don’t argue with it. Interrogate it. The lock reason tells you which subsystem to investigate, and the fastest wins come from confirming whether a task is still alive and whether storage is actually responsive.
Next steps you can apply today:
- Adopt the fast diagnosis order: tasks → QEMU → storage → unlock.
- Add one operational rule: no
qm unlockuntil you’ve checked task state and storage health. - For backup-related locks, treat the backup target as a production dependency. Monitor it like one.
- For Ceph/ZFS, learn the native indicators (watchers, holds). They’re often the real explanation.
- Write your own internal runbook with your storage names, pools, and failure patterns—because the next lock will happen when you’re tired.