You clicked a button in Proxmox. Start VM. Stop container. Snapshot. Backup. Migration. And then: TASK ERROR: timeout waiting for …. The UI gives you a vague noun phrase and a clock that just ran out. Production gives you angry humans and a clock that never stops.
The key move is to stop treating “timeout” as the problem. A Proxmox timeout is a symptom: some process was waiting on a specific lock, kernel state, storage operation, or cluster filesystem step, and it didn’t finish before Proxmox gave up. Your job is to identify what it was actually waiting on, then fix the subsystem that stalled.
What Proxmox timeouts really are (and why the UI hides the truth)
Proxmox is a manager. It orchestrates lots of moving parts: systemd services, QEMU, LXC, ZFS, Ceph, kernel block devices, network storage, and a cluster configuration filesystem. When you click “Stop,” Proxmox doesn’t stop a VM by magic. It asks QEMU nicely, waits for the guest to cooperate, maybe escalates to a SIGKILL, then waits for block devices to detach, volumes to unmap, and config locks to release.
The “timeout waiting for …” message is usually emitted by a Proxmox Perl layer (pvedaemon/pveproxy helpers) waiting for a state transition. The timeouts vary by operation, and they often guard against the worst day in ops: a permanently stuck task holding a lock that blocks everything else.
That’s the philosophy. The reality is messy: if the kernel is stuck in I/O wait, Proxmox will “timeout waiting for…” something perfectly reasonable, like a device unmap, while the real problem is a dead HBA, a wedged multipath path, or a Ceph OSD flapping. You need to map the phrase in the task log to the concrete dependency chain.
Here’s the mental model that actually works:
- Proxmox task (UPID) runs as a process.
- It waits on a condition: lock, PID exit, volume operation, cluster file update.
- The condition depends on a subsystem: kernel, storage, network, cluster, guest OS.
- The timeout tells you which condition wasn’t met. Logs tell you why.
One operational truth: “timeout waiting for…” is often Proxmox politely saying “something is stuck in D-state and I can’t kill it.” If you haven’t looked for D-state yet, you’re still guessing.
Joke #1 (short, and earned): When Proxmox says “timeout waiting for lock,” it’s basically your datacenter doing the “after you” door dance… forever.
Fast diagnosis playbook: first/second/third checks that find the bottleneck
If you do only one thing differently after reading this: stop chasing the UI string. Start from the UPID, identify the blocking process, and then identify the subsystem by what that process is blocked on.
First: Is the node itself sick?
- Check load, D-state, memory pressure, and kernel I/O stalls. If the node is unhealthy, all tasks are suspect.
- If you see many processes in D-state, treat it as storage until proven otherwise.
Second: Is it storage I/O, storage control-plane, or cluster lock?
- Storage data-plane: slow reads/writes, hung unmap, stalled flushes, SCSI resets.
- Storage control-plane: Ceph monitors slow, ZFS transaction groups stuck, iSCSI sessions flapping.
- Cluster lock: pmxcfs lock contention or a dead task holding a lock file.
Third: Is it a guest that refuses to cooperate?
- VM shutdown timeouts often are just a guest ignoring ACPI.
- But if shutdown blocks on device detach, it’s still storage/host.
Fast triage checklist (what to run immediately)
- Find the UPID in the task view; match it to log lines in
/var/log/pve/tasks/. - Identify the PID and the exact wait point (lock vs I/O vs process exit).
- Check
dmesgfor storage resets/timeouts in the same time window. - Check Ceph health or ZFS health (depending on backend).
- Check cluster quorum and pmxcfs responsiveness if config/locks are involved.
Map “waiting for …” to the actual subsystem
Proxmox’s “waiting for …” strings are not random. They usually correspond to one of these categories:
1) Waiting for a lock (config and lifecycle serialization)
Common phrases: “timeout waiting for lock,” “can’t lock file,” “lock for VM …,” “lock for storage …”.
What actually timed out: a lock acquisition in Proxmox (often under /var/lock/pve-manager/ or a lock inside pmxcfs), or a lock held by another task that never finished.
Likely causes: a previous task hung (often due to storage), or pmxcfs is slow/unresponsive due to cluster issues.
2) Waiting for process exit (QEMU/LXC lifecycle)
Common phrases: “timeout waiting for VM to shutdown,” “timeout waiting for ‘qemu’ to stop,” “timeout waiting for CT to stop”.
What actually timed out: Proxmox sent a shutdown/stop request and waited for QEMU/LXC to exit; it didn’t. QEMU might be stuck in kernel I/O, or the guest ignored shutdown, or QEMU is paused waiting on storage.
3) Waiting for storage operations (create/destroy/snapshot/clone/unmap)
Common phrases: “timeout waiting for RBD unmap,” “timeout waiting for zfs destroy,” “timeout waiting for lvremove,” “timeout waiting for qm” (during disk operations), “timeout waiting for vzdump”.
What actually timed out: a storage command (rbd, zfs, lvm, iscsiadm, mount) didn’t finish. This is where the kernel and the storage backend meet—and argue.
4) Waiting for cluster filesystem updates (pmxcfs)
Common phrases: “timeout waiting for pmxcfs,” “unable to read/write /etc/pve/…,” “cfs-lock ‘…’ timed out”.
What actually timed out: pmxcfs couldn’t complete a lock or update quickly enough. Root causes include corosync/quorum issues, saturated CPU, or a disk-full condition preventing writes (yes, really).
5) Waiting for networking (migration and remote storage)
Common phrases: “migration aborted: timeout,” “waited too long for …,” “ssh connection timeout” (sometimes hidden behind generic task timeouts).
What actually timed out: TCP stalled, MTU mismatch, packet loss, or the remote side blocked on storage.
Dissect the Proxmox task: from UPID to the blocking syscall
Proxmox tasks are logged with a UPID (Unique Process ID) that encodes node, PID, start time, and task type. Your workflow should treat UPID as the primary key.
Here’s a practical way to cut through the noise:
- Find the UPID in the GUI task log.
- Open the task file on the node to see the exact internal steps and timestamps.
- Find the PID of the worker and inspect its state (running, D-state, blocked).
- Use kernel-level tools (stack traces, I/O stats) to see what it’s waiting on.
If you only read Proxmox’s high-level message, you’ll miss the part where the kernel has been shouting “SCSI timeout” for ten minutes in dmesg.
One quote, because it’s still the north star for operations: “Hope is not a strategy.” — General Gordon R. Sullivan
Practical tasks: commands, what the output means, and the decision you make
These are not “nice to know” commands. These are the ones you run while someone is waiting on a restore, your HA is flapping, and you need to decide whether to reboot a node or chase a cable.
Task 1: Find the task’s UPID log file and read it like a timeline
cr0x@server:~$ ls -1t /var/log/pve/tasks/ | head
active
index
UPID:pve01:0000A1B2:019F3A2C:676D3B1C:vzdump:101:root@pam:
UPID:pve01:00009C88:019F39F0:676D39A9:qmshutdown:104:root@pam:
cr0x@server:~$ sed -n '1,120p' /var/log/pve/tasks/UPID:pve01:00009C88:019F39F0:676D39A9:qmshutdown:104:root@pam:
starttime 1735163817
status: started
command: /usr/sbin/qm shutdown 104 --timeout 60
shutdown VM 104: initiated
shutdown VM 104: waiting for guest to shutdown...
shutdown VM 104: timeout waiting for shutdown
TASK ERROR: timeout waiting for shutdown
Meaning: You now know the exact command and timeout parameter involved (qm shutdown --timeout 60).
Decision: If this is a guest not responding to ACPI, you can escalate to qm stop. If this is repeated and coincides with storage errors, you diagnose host I/O first.
Task 2: Correlate Proxmox task timestamps with journal logs
cr0x@server:~$ journalctl -u pvedaemon -S "2025-12-26 09:30:00" -U "2025-12-26 09:40:00" --no-pager | tail -n 40
Dec 26 09:33:22 pve01 pvedaemon[1412]: starting task UPID:pve01:00009C88:019F39F0:676D39A9:qmshutdown:104:root@pam:
Dec 26 09:34:22 pve01 pvedaemon[1412]: task UPID:pve01:00009C88:019F39F0:676D39A9:qmshutdown:104:root@pam: finished with exit code 255
Meaning: Confirms which daemon executed it and when it failed.
Decision: If pvedaemon is repeatedly timing out across multiple tasks, look at node health and storage rather than individual VMs.
Task 3: Identify whether the worker process is stuck (and in what state)
cr0x@server:~$ ps -eo pid,ppid,stat,wchan:24,cmd | egrep 'pvedaemon|pveworker|qmshutdown|vzdump' | head -n 20
1412 1 Ss ep_poll /usr/bin/pvedaemon
9856 1412 S do_wait pveworker UPID:pve01:00009C88:019F39F0:676D39A9:qmshutdown:104:root@pam:
9874 9856 D io_schedule /usr/bin/qemu-system-x86_64 -id 104 -name vm104
Meaning: QEMU is in D-state (D in STAT) and waiting in io_schedule. That’s not “guest won’t shut down.” That’s “host kernel is waiting on I/O.”
Decision: Stop trying to kill QEMU. Investigate storage and kernel logs. A reboot might be the only way out if the kernel can’t complete the I/O.
Task 4: Check dmesg for storage timeouts/reset storms around the failure
cr0x@server:~$ dmesg -T | egrep -i 'timed out|reset|abort|blk_update_request|I/O error|nvme|scsi' | tail -n 30
[Thu Dec 26 09:33:51 2025] sd 2:0:0:1: rejecting I/O to offline device
[Thu Dec 26 09:33:51 2025] blk_update_request: I/O error, dev sdb, sector 9412896 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Thu Dec 26 09:33:52 2025] scsi host2: sas: task abort: SUCCESS scmd(0x00000000a1b2c3d4)
[Thu Dec 26 09:33:58 2025] sd 2:0:0:1: timing out command, waited 180s
Meaning: This is the smoking gun for “what actually timed out”: the kernel’s block layer couldn’t complete I/O. Proxmox just happened to be waiting.
Decision: Treat as storage incident: check cabling/HBA/paths, multipath, array health, and consider evacuating VMs.
Task 5: Determine if pmxcfs (/etc/pve) is slow or blocked
cr0x@server:~$ timeout 3 ls -la /etc/pve/nodes/pve01/qemu-server/ | head
total 0
drwxr-xr-x 2 root www-data 0 Dec 26 09:35 .
drwxr-xr-x 4 root www-data 0 Dec 26 09:20 ..
-rw-r----- 1 root www-data 0 Dec 26 09:35 104.conf
Meaning: If this hangs or times out, your cluster filesystem is sick. When ls on /etc/pve hangs, lots of “timeout waiting for lock” errors are downstream noise.
Decision: Check corosync/quorum, pmxcfs process, and node resource saturation before touching VM operations.
Task 6: Check quorum and corosync status (cluster-level timeouts)
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prod-cluster
Config Version: 42
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Dec 26 09:36:01 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.98
Quorate: Yes
Meaning: If Quorate: No or nodes flap, pmxcfs locks can stall or fail.
Decision: Stabilize the cluster network before you trust any “timeout waiting for lock” diagnostics.
Task 7: Identify which lock is held and by what task
cr0x@server:~$ ls -la /var/lock/pve-manager/ | head
total 0
drwxr-xr-x 2 root root 120 Dec 26 09:36 .
drwxr-xr-x 9 root root 180 Dec 26 08:10 ..
-rw-r--r-- 1 root root 0 Dec 26 09:31 lock-104.conf
-rw-r--r-- 1 root root 0 Dec 26 09:33 vzdump-101.lock
cr0x@server:~$ fuser -v /var/lock/pve-manager/lock-104.conf
USER PID ACCESS COMMAND
/var/lock/pve-manager/lock-104.conf:
root 9856 F.... pveworker
Meaning: The lock isn’t mystical. It’s a file, and a PID holds it open.
Decision: Investigate why that PID is stuck. Killing the worker may release the lock, but if it’s stuck in D-state, you won’t kill it. Fix the underlying I/O or plan a node reboot.
Task 8: Inspect a stuck PID with kernel stack traces (advanced, decisive)
cr0x@server:~$ cat /proc/9874/stack
[<0>] io_schedule+0x12/0x40
[<0>] bit_wait_io+0x11/0x60
[<0>] __wait_on_bit+0x6a/0x90
[<0>] out_of_line_wait_on_bit+0x8a/0xb0
[<0>] __filemap_fdatawait_range+0xd3/0x120
[<0>] filemap_write_and_wait_range+0x4a/0x90
[<0>] blkdev_fsync+0x2e/0x50
[<0>] vfs_fsync_range+0x4a/0xb0
Meaning: QEMU is blocked waiting for flush/fsync to complete. That’s storage. Possibly the backing device is hung, saturated, or fencing is in progress.
Decision: Stop tuning guest shutdown timeouts. Go fix the storage path or backend.
Task 9: If using ZFS, check pool health and spot latency sources
cr0x@server:~$ zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:10:12 with 0 errors on Thu Dec 26 03:10:21 2025
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
errors: No known data errors
cr0x@server:~$ zpool iostat -v rpool 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
rpool 92.3G 380G 12 420 1.2M 48.9M
mirror 92.3G 380G 12 420 1.2M 48.9M
sda2 - - 6 210 600K 24.4M
sdb2 - - 6 210 600K 24.5M
Meaning: zpool status tells you if you’re degraded/faulted; zpool iostat shows if you’re hammering the pool. Not all hangs show up as DEGRADED—latency can still be awful on “ONLINE” pools.
Decision: If iostat shows high writes and your tasks involve snapshots/destroys, consider contention (backup window, replication, scrub) and throttle or reschedule.
Task 10: If using Ceph, check cluster health and slow ops
cr0x@server:~$ ceph -s
cluster:
id: 3c9f3a2c-1111-2222-3333-019f39f0abcd
health: HEALTH_WARN
12 slow ops, oldest one blocked for 87 sec, mon.pveceph01 has slow ops
services:
mon: 3 daemons, quorum pveceph01,pveceph02,pveceph03 (age 6h)
mgr: pveceph01(active, since 6h)
osd: 9 osds: 9 up (since 6h), 9 in (since 6h)
data:
pools: 2 pools, 256 pgs
objects: 1.2M objects, 4.6 TiB
usage: 14 TiB used, 18 TiB / 32 TiB avail
pgs: 255 active+clean, 1 active+undersized+degraded
cr0x@server:~$ ceph health detail | head -n 40
HEALTH_WARN 12 slow ops, oldest one blocked for 87 sec, mon.pveceph01 has slow ops
SLOW_OPS 12 slow ops, oldest one blocked for 87 sec, osd.4 has slow ops
slow ops on osd.4, oldest one blocked for 87 sec, committed, flags = none
Meaning: Slow ops are your “what actually timed out.” Proxmox waits for an RBD operation; Ceph is stuck finishing it.
Decision: Stop retrying Proxmox tasks. Stabilize Ceph: check OSD latency, backfill/recovery pressure, network, and failing disks.
Task 11: If using iSCSI/multipath, verify sessions and path health
cr0x@server:~$ iscsiadm -m session
tcp: [1] 10.10.20.50:3260,1 iqn.2001-05.com.example:storage.lun1 (non-flash)
tcp: [2] 10.10.20.51:3260,1 iqn.2001-05.com.example:storage.lun1 (non-flash)
cr0x@server:~$ multipath -ll | head -n 60
mpatha (3600a098038314b6d4a2b4b6f6f4b6d31) dm-3 EXAMPLE,Array
size=2.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 2:0:0:1 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 3:0:0:1 sdc 8:32 active ready running
Meaning: If sessions are missing or paths show failed/faulty, your timeouts are the host waiting on a dead path. If you see queue_if_no_path, I/O can hang “politely” rather than error fast.
Decision: Fix pathing before you touch Proxmox. Consider whether your multipath features are appropriate for virtualization; hanging forever is rarely what you want.
Task 12: For NFS/CIFS storage, find out if mounts are stuck (classic “everything hangs”)
cr0x@server:~$ mount | egrep ' type nfs| type cifs'
10.10.30.10:/exports/vmstore on /mnt/pve/nfs-vmstore type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,sec=sys,clientaddr=10.10.30.21)
cr0x@server:~$ timeout 3 stat /mnt/pve/nfs-vmstore
File: /mnt/pve/nfs-vmstore
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 0,53 Inode: 2 Links: 15
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-12-26 09:36:10.000000000 +0000
Modify: 2025-12-26 09:33:01.000000000 +0000
Change: 2025-12-26 09:33:01.000000000 +0000
Birth: -
Meaning: If stat hangs, NFS is wedged and your “timeout waiting for …” is a downstream effect. With hard mounts, processes can sit in D-state forever waiting for the server.
Decision: Fix the NAS/network first. Consider mount options and monitoring so a flaky filer doesn’t freeze your whole node.
Task 13: Confirm whether vzdump/backups are the hidden lock-holder
cr0x@server:~$ pvesh get /nodes/pve01/tasks --limit 5
┌──────────────────────────────────────────────────────────────────────────────┬───────────┬─────────┬────────────┬──────────┬────────────┐
│ upid │ user │ status │ starttime │ type │ id │
╞══════════════════════════════════════════════════════════════════════════════╪═══════════╪═════════╪════════════╪══════════╪════════════╡
│ UPID:pve01:0000A1B2:019F3A2C:676D3B1C:vzdump:101:root@pam: │ root@pam │ running │ 1735163950 │ vzdump │ 101 │
│ UPID:pve01:00009C88:019F39F0:676D39A9:qmshutdown:104:root@pam: │ root@pam │ error │ 1735163817 │ qmshutdown│ 104 │
└──────────────────────────────────────────────────────────────────────────────┴───────────┴─────────┴────────────┴──────────┴────────────┘
Meaning: A running backup may hold snapshot or config locks, or it may simply saturate storage and make other tasks time out.
Decision: If backups overlap with operational tasks, you either schedule better or accept risk. Pick one deliberately.
Task 14: Check node-level I/O pressure quickly
cr0x@server:~$ iostat -x 1 3
Linux 6.2.16-20-pve (pve01) 12/26/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
11.20 0.00 6.10 42.30 0.00 40.40
Device r/s w/s rkB/s wkB/s await svctm %util
dm-3 1.2 420.0 9.6 48960.0 85.3 1.9 98.7
Meaning: %iowait is huge; await and %util are high. Your node is waiting on storage. Proxmox timeouts are just the messenger.
Decision: Throttle workload, pause heavy jobs, investigate backend saturation or failure. If it’s a shared array, coordinate with storage team (or become the storage team).
Storage-specific timeout failure modes (ZFS, Ceph, iSCSI/NFS)
Most “timeout waiting for …” incidents in Proxmox are storage incidents wearing a management UI’s clothes. The UI times out because storage is slow. Or gone. Or “sort of there but not answering,” which is worse.
ZFS: when “ONLINE” still hurts
ZFS is great at data integrity and reasonably honest about failures. It is not obligated to be fast when you ask it to do expensive metadata operations while you’re also slamming the pool with VM writes.
Timeout patterns you’ll see:
- Snapshot/destroy/clone operations stalling due to high TXG pressure or slow device flush.
- VM shutdown/stop stalling because QEMU is waiting on fsync/flush to the ZVOL backing device.
- Replication jobs timing out because sending snapshots can’t keep up or the network is congested.
What to look for:
- Pool-level latency:
iostatawaitand ZFSzpool iostatthroughput. - ARC pressure and memory: ZFS is a memory consumer with opinions.
- Transaction group sync times: long syncs mean everything that needs a sync waits longer.
Ceph RBD: control-plane health becomes data-plane latency
Ceph’s trick is distributed reliability. Its tax is that when the cluster is unhealthy, even “simple” operations can slow down: mapping/unmapping RBD, snapshots, image resize, flatten, etc.
Timeout patterns:
- “timeout waiting for RBD unmap” during VM stop, migration cleanup, or delete disk.
- Backup tasks time out because reads are slow during recovery/backfill.
- HA failover looks broken because storage operations are stuck, not because HA is dumb.
Ceph-specific reality checks:
- Slow ops older than a minute are not “normal background noise” in a VM cluster. They’re a symptom of resource starvation, network issues, or failing disks.
- If Ceph is recovering/backfilling hard, Proxmox tasks that expect timely storage responses will timeout. You either tune recovery or schedule disruptive operations away from peak.
iSCSI and multipath: the joy of “queue_if_no_path”
The most painful iSCSI failures are not the loud ones. They’re the silent hangs: a path dies, multipath queues I/O, and every process touching that LUN enters D-state. Proxmox doesn’t know. It waits for a command to finish. It never does.
If you’re running virtualization on iSCSI, you need to decide—explicitly—whether you prefer:
- Fail fast: tasks error out, VMs crash or pause, but the node is recoverable without reboot.
- Hang forever: data-plane waits for storage to return, which might preserve I/O ordering, but it can freeze your host.
Don’t let default multipath policy decide this for you. Defaults were made by committees, and committees do not get paged.
NFS: “hard” mounts can freeze management too
With NFS, a common surprise is that storage isn’t just where disks live; it’s also where backups, templates, and sometimes ISO storage live. A wedged NFS mount can cause backup tasks to hang, but it can also block operations that try to list or update content on that storage.
When NFS flakes out, Proxmox can time out waiting for a storage scan, a backup write, or a mount operation. Meanwhile, your shell commands hang, and your “simple” reboot plan becomes a fencing plan.
Cluster filesystem and locks: pmxcfs, corosync, and stale lock folklore
Proxmox’s cluster config lives in /etc/pve, backed by pmxcfs. It’s convenient and also the reason a “storage issue” can manifest as “can’t write config” if the node is overloaded or the cluster is unstable.
pmxcfs symptoms that masquerade as everything else
- CLI commands hang when they access
/etc/pve(e.g.,qm config). - “timeout waiting for cfs-lock” when starting/stopping VMs, especially around HA or migrations.
- GUI becomes slow because pveproxy calls touch cluster config repeatedly.
Locks: the good, the bad, and the hung
Locks exist to prevent two operations from stomping on the same VM config or disk state. They’re correct. They’re also a pain when the holder is dead.
Rules that save you:
- Never delete lock files blindly as a first move. If the lock is guarding an in-flight storage operation, you can create split-brain at the VM level (two conflicting actions) even if the cluster itself is fine.
- Identify the lock owner PID. If it’s a normal stuck process (not D-state), you might terminate it cleanly. If it’s D-state, stop pretending signals work.
- Lock timeouts are often secondary. A hung storage unmap causes a stuck task; the stuck task holds the lock; the next task times out waiting for the lock.
Joke #2 (short, and too true): You can’t kill -9 a process in D-state. That’s Linux’s way of saying “please hold.”
Three mini-stories from corporate reality
Mini-story 1: The incident caused by a wrong assumption
They ran a small Proxmox cluster for internal services: ticketing, CI runners, a couple of databases that “weren’t critical” (until they were). One morning, VM stops started failing with “TASK ERROR: timeout waiting for shutdown”. The on-call assumed it was a guest OS problem: “Windows updates again,” someone muttered, confidently wrong.
They increased the shutdown timeout. Then they tried qm stop. It hung. They tried killing QEMU. It didn’t die. The assumption shifted to “Proxmox bug.” They restarted pvedaemon. Same. Meanwhile, more tasks queued up, all timing out waiting for locks.
The turning point was a simple ps showing multiple QEMU processes in D-state, all waiting in io_schedule. A quick look at dmesg showed a storm of SCSI timeouts on one path. Multipath was configured to queue I/O when paths disappeared. Great for some workloads. Catastrophic for a hypervisor when the array controller rebooted.
The fix was not “more timeouts.” The fix was restoring a healthy path (and later changing multipath behavior so the host fails fast enough to recover). The postmortem takeaway was blunt: a VM shutdown timeout is rarely about shutdown if the hypervisor is stuck on I/O. They started treating “D-state + timeouts” as a storage incident by default, which made future incidents shorter and less theatrical.
Mini-story 2: The optimization that backfired
A different company wanted faster backups. They had a nightly vzdump window, and it was creeping into business hours. Someone proposed an “easy win”: enable more parallelism and push snapshot-based backups harder. Storage was “plenty fast,” based on peak throughput numbers from a vendor slide.
The first few nights looked great. Backups completed earlier. Then the cluster hit a day with heavier daytime write load. During the backup window overlap, VM snapshots and cleanup triggered a wave of timeout waiting for … errors: snapshot commits stalling, VM stops timing out, replication lagging. The operations team chased locks, killed tasks, and generally fought symptoms.
The real cause was transactional latency: ZFS sync times spiked under concurrent snapshot churn plus VM write I/O. The pool stayed “ONLINE,” but latency went ugly. Proxmox timeouts weren’t lying; they were just impatient. The optimization—more concurrency—turned out to be a latency amplifier.
The fix wasn’t dramatic: they reduced backup concurrency, moved the heaviest jobs to a separate window, and added simple I/O latency alerting. Throughput stayed decent, and the system stopped timing out. The lesson was classic SRE: optimizing for the metric you can brag about (backup duration) can quietly destroy the metric users feel (tail latency).
Mini-story 3: The boring but correct practice that saved the day
A regulated environment ran Proxmox with Ceph. Nothing exciting, by design. Their “boring practice” was strict correlation: every task failure had to be tied to a timestamped backend health signal (Ceph health detail, kernel logs, network errors). No exceptions. It felt bureaucratic until it wasn’t.
One afternoon, migrations started failing with timeouts, and a few VM stops hung. The UI messages varied—some said timeouts waiting for storage, others looked like lock timeouts. Engineers were tempted to blame Proxmox upgrades from the previous week. The boring practice forced them to pull Ceph slow ops and kernel logs for the same time slice.
They found a pattern: slow ops spiked whenever a specific top-of-rack switch port flapped. Corosync remained quorate, so the cluster looked “fine,” but Ceph traffic saw intermittent loss and retransmits. That increased latency, which increased timeouts, which increased lock contention. Cascading failure with a polite face.
Because they had the correlation habit, the incident didn’t become a week-long witch hunt. They moved Ceph traffic off the flapping port, replaced a transceiver, and the timeouts stopped. The postmortem was dull. That’s the compliment.
Common mistakes: symptom → root cause → fix
1) Symptom: “timeout waiting for lock …” across many VMs
Root cause: One stuck task holds a lock; often the stuck task is blocked on storage I/O or pmxcfs.
Fix: Identify the lock owner with fuser, inspect its state (D-state?), then fix storage/pmxcfs. Don’t delete locks as your first move.
2) Symptom: VM shutdown timeouts, then “qm stop” also hangs
Root cause: QEMU is blocked in the kernel (D-state), typically on disk flush/unmap or dead storage path.
Fix: Check ps for D-state and dmesg for block errors. Restore storage health; reboot node if kernel I/O is irrecoverably hung.
3) Symptom: vzdump jobs time out and leave snapshots/locks behind
Root cause: Backup target storage is slow/unreachable (NFS issues are frequent), or snapshot commit is slow (Ceph slow ops / ZFS latency).
Fix: Verify mount responsiveness with stat and check backend health. Reduce concurrency, reschedule, and ensure backup storage is monitored like production storage (because it is).
4) Symptom: “timeout waiting for RBD unmap” during stop/delete
Root cause: Ceph slow ops, hung client I/O, or network issues between node and Ceph.
Fix: Check ceph -s and ceph health detail. If slow ops exist, treat Ceph first. Consider temporarily pausing disruptive operations.
5) Symptom: operations touching /etc/pve hang or error
Root cause: pmxcfs is blocked due to corosync instability, CPU starvation, or a wedged node.
Fix: Check quorum (pvecm status), look at corosync logs, check CPU/memory pressure, and stabilize the cluster network.
6) Symptom: migrations time out but storage seems OK
Root cause: migration network congestion/MTU mismatch or remote node storage latency.
Fix: Validate network health; run migrations during lower load; check both source and destination node I/O wait. Migration is two nodes and a network, not a checkbox.
Checklists / step-by-step plan
Step-by-step: pinpoint what actually timed out
- Get the UPID from the Proxmox task view.
- Read the task file in
/var/log/pve/tasks/to identify the exact underlying command and the phase that stalled. - Identify the worker PID holding the lock (if applicable) using
fuseron the lock file, or by matching process command lines. - Check process state with
ps:- If
D: stop wasting time on signals; it’s I/O or kernel blockage. - If
S/R: it may be waiting on another process, network, or a lock.
- If
- Check kernel logs (
dmesg -T) for storage timeouts/resets in the same time window. - Check backend health:
- ZFS:
zpool status,zpool iostat - Ceph:
ceph -s,ceph health detail - iSCSI:
iscsiadm -m session,multipath -ll - NFS:
staton mountpoints
- ZFS:
- Decide recovery action:
- If storage is failing: evacuate/migrate where possible; stop churn; fix backend; consider reboot if kernel is wedged.
- If pmxcfs/quorum: stabilize corosync and cluster communications first.
- If guest won’t shutdown but host is healthy: use stop/kill escalation with awareness of data risk.
Checklist: “Should I reboot this node?”
- Do you have processes in D-state tied to critical tasks?
- Do kernel logs show repeated I/O timeouts or device offline events?
- Is the storage backend already recovered but the node is still stuck?
- Can you migrate/evacuate VMs first (shared storage, HA, or planned downtime)?
If you answered yes to the first two and you can’t clear the condition, a controlled reboot is often the least bad option. The worst option is letting the node rot in a half-dead state while tasks pile up and locks accumulate.
Facts and context: why these timeouts exist
- Fact 1: Proxmox task tracking uses UPIDs so asynchronous operations can be logged and audited independent of the GUI session.
- Fact 2: pmxcfs is a FUSE-based cluster filesystem; that convenience means normal file operations can block on cluster health.
- Fact 3: Linux D-state (uninterruptible sleep) exists to protect kernel and I/O integrity; it’s why some “kills” don’t kill.
- Fact 4: QEMU shutdown paths commonly involve storage flush semantics; a “shutdown timeout” can be a disk flush timeout in disguise.
- Fact 5: Ceph “slow ops” are not just performance metrics; they can directly block client operations like RBD unmap/snapshot/remove.
- Fact 6: Multipath’s “queue if no path” behavior was designed to preserve I/O ordering through path outages, but it can freeze a hypervisor when outages are prolonged.
- Fact 7: NFS “hard” mounts retry forever by default; that’s great for data correctness and terrible for interactive management when the server is gone.
- Fact 8: Snapshot-heavy workflows shift load from pure throughput to metadata and latency; timeouts often track tail latency, not average bandwidth.
- Fact 9: Cluster lock timeouts often appear after the real incident has started; they’re frequently secondary failures from earlier stalls.
FAQ
1) What does “timeout waiting for lock” usually mean in Proxmox?
It means a Proxmox task couldn’t acquire a lock (VM config, storage, or cluster config) within the allowed time. The real cause is typically a prior task holding that lock—often because it’s stuck on storage I/O or pmxcfs.
2) If a VM shutdown times out, should I just increase the timeout?
Only if the host is healthy and the guest is simply slow to shut down. If QEMU is in D-state or the node is in heavy I/O wait, increasing the timeout just makes you wait longer for a thing that can’t complete.
3) How do I tell if it’s a guest issue or a host/storage issue?
Check the QEMU process state. If it’s in D-state with io_schedule or blocked in fsync, it’s host/storage. If it’s running normally and only ACPI shutdown is ignored, it’s likely guest behavior.
4) Why do locks pile up after one bad task?
Because Proxmox serializes many lifecycle operations to avoid corruption. One stuck operation holds a lock; subsequent operations time out waiting for it. The lock pileup is the smoke, not the fire.
5) What’s the quickest indicator that Ceph is the problem?
ceph -s showing slow ops, degraded PGs, or recovering/backfilling under pressure. If slow ops exist, assume Proxmox storage operations will time out until Ceph stabilizes.
6) Can I safely delete lock files under /var/lock/pve-manager?
Sometimes, but it’s a last resort. First identify the owner PID. If the owner is a legitimately running task, deleting the lock can create concurrent conflicting operations. If the owner is gone and you’ve confirmed no underlying storage op is active, removing stale locks may be acceptable—document what you did.
7) Why does /etc/pve access hang when the cluster has issues?
Because /etc/pve is pmxcfs (FUSE). File operations can depend on cluster messaging and lock coordination. If corosync is unstable or the node is overloaded, normal reads/writes to that tree can block.
8) What’s the right order to troubleshoot: Proxmox logs or kernel logs?
Start with the task log to identify what Proxmox was doing, then immediately check kernel logs for I/O errors in the same time window. Kernel logs usually tell you why the operation stalled.
9) Why do timeouts happen during backups more than anything else?
Backups are heavy on I/O, snapshots, and metadata. They stress both storage throughput and tail latency. If your storage is “fine” during normal operation, backups can still push it into the slow zone where Proxmox’s waiting loops hit their timeout thresholds.
Next steps you should do this week
If you want fewer surprise timeouts and shorter incidents, do these practical next steps:
- Train your team to start from UPID and read
/var/log/pve/tasks/. Make it muscle memory. - Add D-state and I/O wait monitoring on every node. If your monitoring can’t alert on D-state spikes, it’s not monitoring; it’s scrapbooking.
- Baseline your storage backend during backups and migrations. Capture iostat latency, Ceph slow ops, ZFS iostat—whatever you use.
- Decide your failure policy for multipath/NFS (fail-fast vs hang). Defaults are not a strategy.
- Run one game day: intentionally saturate backup storage or simulate a path flap in a controlled window, and practice identifying “what actually timed out” in under ten minutes.
Proxmox timeouts feel vague because they’re a management layer waiting on lower layers. Once you consistently trace the wait to a process, then to a syscall, then to the backend, the error message stops being mysterious. It becomes a breadcrumb.