Proxmox Stuck Tasks: How to Clean Up Hung Jobs and Processes Safely

Was this helpful?

Some days, Proxmox is a calm, sensible hypervisor. Other days it’s a haunted house: backups that never finish, “TASK ERROR: interrupted by signal,” an LXC that refuses to stop, and a GUI “running…” spinner that outlives your coffee.

The hard part isn’t “how do I kill it.” The hard part is knowing what you are actually killing, what the system is waiting on (storage? cluster quorum? a kernel D-state?), and how to unwind it without turning a temporary hang into a corruption event.

What “stuck” really means in Proxmox (and why the GUI lies)

In Proxmox, most operations you click in the UI become a background task handled by a small cast of daemons: pvedaemon, pveproxy, pvescheduler, and friends. Those tasks may spawn other processes: vzdump for backups, qm for QEMU VM actions, pct for LXC actions, zfs utilities, rbd commands, or just plain rsync.

“Stuck” usually means one of these conditions:

  • A userspace process is waiting on a lock, a network socket, or a child process that’s not returning.
  • A process is blocked in kernel space (D-state), typically waiting for disk or network I/O. This is the nasty one: you can send signals all day and it won’t die until the kernel call returns.
  • A Proxmox “lock” is held (like VM config lock or vzdump lock). Sometimes it’s valid. Sometimes it’s stale, meaning the job died but the lock didn’t get cleaned up.
  • The cluster filesystem (pmxcfs) is unhappy. If the node can’t reliably commit config changes, tasks that touch VM config or storage definitions can stall in weird ways.
  • Storage is the bottleneck, and the task is the messenger. ZFS scrub, a saturated pool, a wedged NFS mount, a degraded Ceph cluster—those can freeze the “simple” actions like stop/start, snapshot, or backup.

Proxmox UI status isn’t authoritative. It’s a view into task logs and the cluster state, which can lag or fail to update when the underlying plumbing is congested. Don’t “click harder.” Observe like an adult.

Fast diagnosis playbook (first/second/third checks)

If you’re in an outage, you don’t have time to admire the subtlety of Linux process states. You need a fast funnel to find the bottleneck.

First: confirm whether this is storage, cluster, or a single process

  • Is the host I/O-bound? High iowait, blocked tasks in dmesg, ZFS pool latency, Ceph slow ops, or a wedged mount screams “storage.”
  • Is the node in quorum and pmxcfs writable? If cluster comms are broken, config-changing tasks can “hang” while retrying or waiting.
  • Is it only one VM/container action? Then you likely have a per-guest issue: QEMU stuck, QMP not responding, a device passthrough teardown, a backup snapshot hold, or a lock file.

Second: locate the exact process and its wait reason

  • Get the task UPID from the UI/task log or pvesh.
  • Find the PID and inspect state: ps, top, pstree, cat /proc/PID/stack, wchan.

Third: choose the least risky intervention

  • If it’s a stale Proxmox lock: remove the lock after verifying no active process owns it.
  • If it’s a userspace hang: stop/kill child processes in order, then the parent task.
  • If it’s D-state I/O wait: don’t play whack-a-mole with kill -9. Fix the I/O path first, then clean up.
  • If it’s cluster/pmxcfs: restore quorum or temporarily operate locally with discipline (and awareness of risk).

Interesting facts and historical context (why this keeps happening)

  1. Linux “D-state” has been a classic ops trap for decades: a process stuck in uninterruptible sleep ignores signals until the kernel operation completes. This is why “kill -9” sometimes looks like it’s broken.
  2. Proxmox tasks use UPIDs (Unique Process IDs) that encode node, PID, start time, and type. It’s not just an ID; it’s a breadcrumb trail.
  3. pmxcfs is a FUSE-based cluster filesystem that lives in RAM and syncs config via corosync. If it’s stuck, config reads/writes can get weird despite the underlying disk being fine.
  4. Snapshot-based backups changed the failure modes: modern Proxmox backup flows rely on QEMU guest agent hooks, snapshot creation, and storage flushes. Failures are often “coordination bugs,” not just slow disks.
  5. ZFS intentionally prioritizes consistency over responsiveness. Under certain error paths, ZFS will wait rather than guess. That’s a feature until you’re staring at a frozen backup task at 02:00.
  6. NFS “hard” mounts can hang indefinitely by design (they keep retrying). That’s great for correctness and terrible for task completion when the NAS is having a moment.
  7. Ceph’s “slow ops” are often networking problems in disguise. The storage symptom shows up first; the root cause can be a flapping switch port.
  8. QEMU stop/reboot hangs are frequently about device teardown: a stuck virtio-blk flush, a dead iSCSI session, or a PCI passthrough device that won’t reset cleanly.

Safety rules: what to do before you touch anything

When tasks hang, people get aggressive. That’s how you turn “a stuck backup” into “a corrupt VM disk and an awkward meeting.” Here’s the baseline discipline.

Rule 1: Don’t delete locks until you prove they’re stale

A Proxmox lock is often there because something is actively mutating VM state. If you remove it while the job is still running, you can overlap operations that were meant to be serialized.

Rule 2: Prefer stopping the child process, not the parent daemon

Killing pvedaemon because one backup is stuck is like rebooting a router because a single browser tab won’t load. Technically it does something; it’s rarely the best first move.

Rule 3: If the process is in D-state, stop trying to “kill” it

Fix what it’s waiting on: storage, network path, or kernel driver issue. Then the process will unwind. Or it won’t, and you’re looking at a node reboot window.

Rule 4: Capture evidence before intervention

At minimum: task log, journalctl around the time of failure, and process state. If you later need to explain what happened, “I killed some stuff and it went away” is not an incident report.

Joke #1: “Kill -9” is not a troubleshooting strategy; it’s an admission of defeat with punctuation.

Practical tasks (commands + what the output means + the decision you make)

These are the moves I actually use in production. Each task includes a realistic command, an example of the output you’ll see, what it means, and the decision it drives.

Task 1: List running Proxmox tasks and identify the UPID

cr0x@server:~$ pvesh get /nodes/pve1/tasks --running 1
┌──────────────────────────────────────────────────────────────────────────────────────┬───────────────┬────────┬─────────────┬──────────┬────────┐
│ upid                                                                                 │ type          │ status │ starttime   │ user     │ id     │
╞══════════════════════════════════════════════════════════════════════════════════════╪═══════════════╪════════╪═════════════╪══════════╪════════╡
│ UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:                            │ vzdump        │ running│ 1735019546  │ root@pam │ 101    │
│ UPID:pve1:00013210:03F4C2C8:676D2C4B:qmstop:205:root@pam:                            │ qmstop        │ running│ 1735019595  │ root@pam │ 205    │
└──────────────────────────────────────────────────────────────────────────────────────┴───────────────┴────────┴─────────────┴──────────┴────────┘

What it means: You have two running tasks. One is a backup for VM 101. One is a stop action for VM 205.

Decision: Pick one UPID to investigate. Don’t shotgun-kill services until you know which job is stuck and why.

Task 2: Pull the task log for the stuck job

cr0x@server:~$ pvesh get /nodes/pve1/tasks/UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:/log
INFO: starting new backup job: vzdump 101 --storage backup-nfs --mode snapshot --compress zstd
INFO: VM Name: app01
INFO: include disk 'scsi0' 'local-zfs:vm-101-disk-0' 200G
INFO: creating vzdump archive '/mnt/pve/backup-nfs/dump/vzdump-qemu-101-2025_12_26-01_12_26.vma.zst'
INFO: started backup task 'UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:'
INFO: status: 0% (loading)
INFO: backup is sparse: 58% (data 42%)

What it means: The job started fine and is now stuck early (“loading”), often a hint it’s waiting on storage I/O or a snapshot/flush.

Decision: Next check system and storage health before killing anything.

Task 3: Identify the PID behind a UPID

cr0x@server:~$ ps -eo pid,etimes,stat,cmd | egrep 'vzdump|pvedaemon|UPID' | head
  7714   1289 Ssl  pvedaemon worker UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:
  7760   1281 S    /usr/bin/perl /usr/bin/vzdump 101 --mode snapshot --storage backup-nfs --compress zstd

What it means: The worker is PID 7714; it spawned the real work process (vzdump) PID 7760.

Decision: Investigate PID 7760. Killing the worker without cleaning children can leave half-finished artifacts.

Task 4: Check whether the process is in D-state (uninterruptible sleep)

cr0x@server:~$ ps -o pid,stat,wchan,cmd -p 7760
  PID STAT WCHAN  CMD
 7760 D    nfs_wait_on_request /usr/bin/perl /usr/bin/vzdump 101 --mode snapshot --storage backup-nfs --compress zstd

What it means: The process is in D state and waiting in nfs_wait_on_request. Signals won’t help until the NFS call returns.

Decision: Stop treating this as a “Proxmox task problem.” It’s an NFS problem. Go check the mount, network path, and the NAS.

Task 5: Inspect stuck mounts (the silent task killer)

cr0x@server:~$ mount | grep '/mnt/pve/backup-nfs'
10.10.20.50:/export/pve-backups on /mnt/pve/backup-nfs type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,sec=sys,clientaddr=10.10.20.11,local_lock=none)

What it means: It’s a hard NFS mount (normal for correctness). If the server is slow or unreachable, client processes can hang indefinitely.

Decision: If the NFS server is down, you either restore it or plan a controlled unmount/reboot. Don’t expect Proxmox to “unstick” itself.

Task 6: Confirm system-wide I/O pain (iowait and blocked tasks)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  8      0 512344  89312 845992    0    0   112   984  921 1650  6  4 30 60  0
 1  7      0 508120  89316 846200    0    0    96   812  880 1543  5  3 27 65  0
 0  9      0 509004  89320 846380    0    0    72   900  915 1602  4  3 25 68  0

What it means: High wa (iowait) and many blocked processes (b) indicates I/O stall. This aligns with a D-state process waiting on NFS.

Decision: Shift troubleshooting to storage/network. Killing tasks won’t fix the wait.

Task 7: Identify what is holding a Proxmox lock on a VM

cr0x@server:~$ qm config 205 | grep -E '^lock:'
lock: stop

What it means: VM 205 is locked for a stop operation. It may be legitimate (a stop is in progress) or stale (a stop failed and left a lock).

Decision: Verify whether a stop task is actively running for this VM before clearing the lock.

Task 8: Check for active qmstop task and QEMU process state

cr0x@server:~$ ps -eo pid,stat,cmd | grep -E 'qemu-system|qm stop 205' | grep -v grep
 8122 Sl   /usr/bin/kvm -id 205 -name vm205 -m 16384 -smp 8 -machine q35 ...

What it means: QEMU is running. The stop task may be stuck in QMP communication, or QEMU is stuck on I/O teardown.

Decision: Query QEMU monitor status and check whether storage is wedged.

Task 9: Query QEMU status via qm monitor (fast sanity check)

cr0x@server:~$ qm monitor 205 --cmd 'info status'
VM status: running

What it means: QEMU is responsive enough to answer monitor commands. That’s good: it’s probably not in a kernel-level I/O coma.

Decision: Try a graceful shutdown first (qm shutdown) and inspect guest agent if configured. Avoid qm stop unless necessary.

Task 10: Check task-related logs from systemd journal

cr0x@server:~$ journalctl -u pvedaemon -u pvescheduler -u pveproxy --since "30 min ago" | tail -n 40
Dec 26 01:12:26 pve1 pvedaemon[1650]: starting task UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:
Dec 26 01:12:31 pve1 pvedaemon[7714]: vzdump: backup job failed to make progress (waiting for storage)
Dec 26 01:13:15 pve1 pvedaemon[1650]: starting task UPID:pve1:00013210:03F4C2C8:676D2C4B:qmstop:205:root@pam:
Dec 26 01:14:16 pve1 pvedaemon[8120]: qmstop: VM 205 qmp command failed - timeout

What it means: The daemons are telling you exactly what hurts: backup waiting for storage; QMP timeout for stop.

Decision: For the stop timeout, check if QEMU is blocked on I/O. For the backup, address NFS/server first.

Task 11: Look for kernel “hung task” warnings

cr0x@server:~$ dmesg -T | tail -n 30
[Thu Dec 26 01:15:02 2025] INFO: task vzdump:7760 blocked for more than 120 seconds.
[Thu Dec 26 01:15:02 2025]       Tainted: P           O      6.5.11-8-pve #1
[Thu Dec 26 01:15:02 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Dec 26 01:15:02 2025] task:vzdump          state:D stack:0     pid:7760  ppid:7714 flags:0x00004006
[Thu Dec 26 01:15:02 2025] Call Trace:
[Thu Dec 26 01:15:02 2025]  nfs_wait_on_request+0x2f/0x90 [nfs]

What it means: The kernel confirms it: the task is blocked in NFS. This is not a Proxmox bug; it’s the storage path.

Decision: Fix NFS reachability/performance; if you must recover immediately, plan a controlled node reboot after ensuring guest safety.

Task 12: Confirm ZFS pool health and spot obvious stalls

cr0x@server:~$ zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  scan: scrub repaired 0B in 00:21:11 with 0 errors on Thu Dec 26 00:40:15 2025
config:

        NAME                         STATE     READ WRITE CKSUM
        rpool                        ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            ata-SAMSUNG_SSD_AAA      ONLINE       0     0     1
            ata-SAMSUNG_SSD_BBB      ONLINE       0     0     0

errors: No known data errors

What it means: Pool is online, but you have a checksum error on one device. Not necessarily causing the current hang, but it’s a flashing yellow light.

Decision: If your tasks are hung on local ZFS I/O, investigate latency (zpool iostat) and hardware. Even “minor” errors can correlate with stalls under load.

Task 13: Check ZFS latency and throughput (find the choke point)

cr0x@server:~$ zpool iostat -v 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
rpool                        620G  1.19T    210    980   12.5M  88.3M
  mirror-0                   620G  1.19T    210    980   12.5M  88.3M
    ata-SAMSUNG_SSD_AAA         -      -    105    495   6.25M  44.1M
    ata-SAMSUNG_SSD_BBB         -      -    105    485   6.30M  44.2M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: Throughput is decent; if tasks still hang, the issue may be elsewhere (NFS, Ceph, or a specific device path) rather than local ZFS saturation.

Decision: Don’t blame ZFS by reflex. Use evidence: if ZFS iostat is calm but NFS wchan shows waits, follow the NFS thread.

Task 14: Inspect Ceph health if your storage is RBD/CephFS

cr0x@server:~$ ceph -s
  cluster:
    id:     8b7a3c8d-4f20-4f0a-8ce8-8c1b0f20d1b2
    health: HEALTH_WARN
            12 slow ops, oldest one blocked for 76 sec, osd.3 has slow requests

  services:
    mon: 3 daemons, quorum mon1,mon2,mon3
    mgr: mgr1(active), standbys: mgr2
    osd: 6 osds: 6 up (6 total), 6 in (6 total)

  data:
    pools:   3 pools, 256 pgs
    objects: 18.2k objects, 71 GiB
    usage:   224 GiB used, 2.1 TiB / 2.3 TiB avail
    pgs:     254 active+clean, 2 active+degraded

What it means: Slow ops can absolutely make Proxmox tasks hang (snapshot, backup, stop). Even if “it’s only WARN,” your latency is already lying to you.

Decision: Treat slow ops as a performance incident. Investigate OSD latency, network, and backfill/recovery. Don’t just kill tasks; they’ll hang again.

Task 15: Check pmxcfs and quorum state (cluster-induced “local” hangs)

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   41
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec 26 01:18:07 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.342
Quorate:          Yes

What it means: Quorum is fine. That removes a whole class of weirdness.

Decision: If quorum were No, avoid config writes and be cautious with VM actions that require cluster state updates.

Task 16: Find and clear a stale lock safely (only after verification)

cr0x@server:~$ qm unlock 205
unlocking VM 205

What it means: The lock is removed from the VM config. This does not kill any running QEMU process; it only removes Proxmox’s serialization guard.

Decision: Only do this after confirming no active operation is still legitimately running for that VM. If QEMU is still mid-stop and you unlock, you can start overlapping operations and create follow-on failures.

Task 17: Cancel a running task (best-effort, depends on wait state)

cr0x@server:~$ pvesh create /nodes/pve1/tasks/UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:/cancel
OK

What it means: Proxmox recorded a cancel request. If the process is in userspace, it may stop. If it’s blocked in D-state, it won’t.

Decision: Use cancel first for politeness; then deal with root cause. Don’t assume OK means “it’s gone.”

Task 18: Kill a userspace-hung helper process cleanly (and only the helper)

cr0x@server:~$ pstree -ap 7714 | head -n 20
pvedaemon,7714 worker UPID:pve1:00012A3B:03F4C2B5:676D2C1A:vzdump:101:root@pam:
  └─perl,7760 /usr/bin/vzdump 101 --mode snapshot --storage backup-nfs --compress zstd
     └─tar,7788 --sparse -cf - .

What it means: The vzdump worker spawned a tar process (or similar) doing archive work. Sometimes the helper hangs while the parent is fine.

Decision: If the helper is stuck in userspace (not D-state), you can try terminating it first to let the parent fail cleanly and release locks.

cr0x@server:~$ kill -TERM 7788
cr0x@server:~$ sleep 3
cr0x@server:~$ ps -p 7788
  PID TTY          TIME CMD

What it means: No output after the header implies it exited.

Decision: Re-check the parent task log. If it’s unwinding and cleaning up, you’re done. If it’s still blocked (especially D-state), stop and focus on storage.

Storage angle: ZFS, Ceph, NFS, iSCSI, and the art of waiting

Most “hung Proxmox tasks” are storage incidents wearing a UI costume. The UI is the messenger. Don’t shoot it unless you’re sure it deserves it.

ZFS-backed VMs: snapshot and send/receive stalls

With ZFS, common hang points are:

  • Snapshot creation when the pool is under stress (rarely truly hangs, but can get slow).
  • ZFS send (replication) waiting on disk reads, or throttled by destination writes.
  • Pool contention: backups, scrubs, resilvers, and heavy random writes on the same vdevs.

The trick is to differentiate “slow but progressing” from “stalled.” ZFS will often keep moving, just painfully. If your iostat shows throughput and ops ticking, it’s not hung; it’s underprovisioned.

NFS-backed backups: “hard” mount semantics cause D-state hangs

If your backup target is NFS, you are borrowing someone else’s storage stack and hoping it doesn’t sneeze. When it does, the Linux client can block in kernel NFS calls. That blocks vzdump, which blocks the Proxmox worker, which blocks locks and follow-on jobs. This is why you’ll see a “backup stuck” that also prevents VM operations.

If your NFS mount is hard (typical), you can’t “kill the process” out of it. You must restore the NFS server path or schedule a reboot of the node to clear stuck kernel threads. Rebooting is not shameful; it’s sometimes the only clean cut.

Ceph-backed storage: slow ops turn into stuck lifecycle actions

Ceph is usually honest. If it’s unhealthy, it tells you. But Proxmox actions can still look “hung” because they wait on RBD operations that are slowed by recovery, backfill, or network latency.

Common pattern: a VM stop hangs because the guest disk flush never completes in time. Or a snapshot task waits on RBD metadata operations. If you see slow ops, stop asking “why is Proxmox stuck” and ask “why is Ceph slow.”

iSCSI: sessions that half-die and take your processes hostage

iSCSI is capable of being perfectly boring in production, which is the nicest thing you can say about storage. When it fails, it can fail slowly: a path flaps, multipath tries, timeouts pile up, processes block in uninterruptible sleep. Tasks that touch those LUNs will hang.

If you see D-state waits in blk_mq_get_tag or similar, you’re in block layer territory. Fix the path (network, target, multipath) before trying to kill Proxmox jobs.

One quote you should actually internalize

“Hope is not a strategy.” — General Gordon R. Sullivan

In ops terms: if your “fix” depends on the storage coming back on its own, it’s not a fix. It’s a prayer with a ticket number.

Cluster angle: pmxcfs, corosync, quorum, and why “local” tasks hang

Proxmox’s cluster stack is a productivity multiplier. It’s also a source of surprising failure modes when the cluster is degraded.

pmxcfs: the config filesystem that can make you doubt reality

/etc/pve is not a normal directory. It’s a cluster filesystem (FUSE) backed by RAM and synchronized. That means:

  • If pmxcfs can’t commit changes, tasks that write VM configs can block or fail in odd ways.
  • If the node is isolated without quorum, Proxmox may intentionally prevent certain writes to avoid split-brain configuration.

Quorum loss: the “everything is running but nothing can change” mode

Losing quorum doesn’t necessarily stop your running VMs. It stops coordinated change. People interpret that as “Proxmox is stuck.” It’s not stuck; it’s refusing to make unsafe changes.

If you’re on a single isolated node and you force it to behave like it’s alone, you might get work done. You might also end up with conflicting VM configs when the cluster heals. Choose that trade intentionally, not emotionally.

Joke #2: Corosync without quorum is like a meeting without minutes—everyone remembers a different truth, and all of them are wrong.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The team had a Proxmox cluster with NFS for backups. One night, the backup jobs started stacking up. The UI showed “running” for hours. A senior engineer assumed, reasonably, that the backup process was just slow because the weekly job hits more data.

So they waited. Then they waited longer. VM stop operations started hanging too. The assumption became “Proxmox is overloaded.” They restarted pvedaemon and pveproxy to “clear tasks.” The UI refreshed, and it looked better for about ten minutes—until it didn’t.

The actual problem: the NFS server had a network path issue. The mount was hard, so processes were blocked in D-state. Restarting daemons did nothing except remove the easy breadcrumb trail in logs and confuse the on-call rotation.

The recovery was blunt: restore the NFS path, wait for the kernel to unwind blocked calls, then cancel the jobs and clean stale partial archives. They also learned a rule: when you see D-state with an NFS wchan, your “backup hang” is an NFS incident. Escalate accordingly.

Mini-story 2: The optimization that backfired

A company wanted faster backups. They changed vzdump compression from something moderate to a heavier setting and increased concurrency so more VMs could be backed up in parallel. CPU graphs looked “fine” in normal business hours, so the change sailed through. It was an optimization, after all.

Backup night arrived. The ZFS pool hit high write amplification under concurrent snapshotting and compression, the NFS backup target became the next bottleneck, and latency climbed. Some guests became sluggish. A few “qm shutdown” operations started timing out because QMP commands were waiting on I/O flushes.

The team responded by killing hung tasks, which freed locks but didn’t reduce the underlying pressure. Now they had partial backup artifacts and retries piling on. It turned into a self-inflicted thundering herd.

The fix was almost boring: reduce concurrency, choose compression that matches the CPU/storage balance, stagger heavy jobs, and watch iowait and storage latency, not just CPU utilization. The lesson: “faster backups” is a systems problem, not a single knob. If you optimize one component, another will happily catch fire.

Mini-story 3: The boring but correct practice that saved the day

A different organization had a rule: every node has a small “break-glass” checklist in a runbook—how to identify tasks by UPID, how to find PIDs, and which logs to capture before intervention. Nobody loved it. It felt like paperwork for machines.

During a maintenance window, an iSCSI path flapped. A handful of VMs were fine. Two VMs had stop operations that hung, and a replication job froze. The on-call followed the checklist: capture task logs, confirm process states, check kernel messages, verify multipath, and only then decide on a controlled restart of affected services.

They noticed the critical detail early: the hung processes were in D-state in the block layer, so killing them would not work. Instead, they restored the iSCSI path, confirmed I/O resumed, then retried stop actions cleanly. No corruption, no panic reboots, no mystery.

The “boring practice” was the difference. Under stress, you don’t rise to the occasion—you fall to the level of your operational habits. Their habit was evidence first, action second.

Common mistakes: symptom → root cause → fix

This section is deliberately specific. Generic advice is cheap; downtime isn’t.

1) Symptom: “TASK OK” never appears; UI shows “running” forever

  • Root cause: The worker process is blocked in kernel I/O (D-state), or the UI cannot update due to pmxcfs/cluster issues.
  • Fix: Find PID, check ps STAT and wchan, consult dmesg. If D-state, fix storage/network. If not, use task cancel and terminate child processes cleanly.

2) Symptom: “cannot lock file /var/lock/qemu-server/205.conf”

  • Root cause: A valid lock held by an active job, or a stale lock after a crash.
  • Fix: Confirm no running task for that VM; inspect qm config for lock; then qm unlock 205. If the VM action is still running, resolve that first.

3) Symptom: vzdump stuck at “0% (loading)”

  • Root cause: Waiting on storage flush/snapshot or a dead backup target mount (NFS/SMB).
  • Fix: Check process state; check mount health; check network path; for NFS hard mount issues, restore server or plan controlled reboot.

4) Symptom: qm stop hangs; QMP timeouts

  • Root cause: QEMU blocked on disk flush or device teardown; guest agent absent; storage latency high.
  • Fix: Try qm shutdown first; check qm monitor; inspect storage health and kernel messages. If QEMU is stuck in D-state, fix I/O path. Only force stop when you accept possible guest filesystem impact.

5) Symptom: LXC stop hangs

  • Root cause: Container processes stuck on NFS inside the container, or kernel cgroup freezer waiting.
  • Fix: Identify container init PID, inspect process states, check mounts used by the container. Fix storage path; then retry stop. Avoid force-killing unless you understand the risk to container workloads.

6) Symptom: tasks fail randomly after cluster hiccup

  • Root cause: Quorum loss or pmxcfs not healthy; config writes blocked.
  • Fix: Check pvecm status, restore quorum, confirm /etc/pve is responsive. Avoid making config changes while non-quorate unless you have a deliberate split-brain plan (usually you don’t).

7) Symptom: replication jobs stuck; zfs send not progressing

  • Root cause: Destination slow or unreachable, network bottleneck, snapshot holds, or source pool contention.
  • Fix: Check ZFS send process state; check network; validate destination dataset health; reduce competing load (scrubs/resilver) during replication windows.

Checklists / step-by-step plans

Checklist A: Safely unwind a stuck vzdump backup

  1. Get the UPID from running tasks (pvesh get /nodes/NODE/tasks --running 1).
  2. Pull the task log and note the last progress line.
  3. Find the PIDs (worker + vzdump + helpers) with ps and pstree.
  4. Check state with ps -o stat,wchan.
  5. If D-state in NFS/block layer: fix the storage path first. Don’t waste time with signals.
  6. If userspace hang: try pvesh .../cancel, then kill -TERM helper processes, then the parent if needed.
  7. Verify lock cleanup on the VM and that no new tasks are blocked behind it.
  8. Check for partial archive files on the backup store and remove only after you confirm they are incomplete and not referenced by a still-running process.
  9. Run a single backup job with reduced concurrency to validate the path before resuming the schedule.

Checklist B: Safely deal with a hung “stop” task for a VM

  1. Confirm the VM is actually still running: ps for QEMU PID and qm status VMID.
  2. Try qm shutdown (graceful). If guest agent is installed and working, even better.
  3. Check QMP responsiveness: qm monitor VMID --cmd 'info status'.
  4. If QEMU responds but stop hangs: investigate storage latency and device teardown logs; consider a longer timeout.
  5. If QEMU is D-state: fix storage path. For hard failures, schedule node reboot; don’t trust forced kills.
  6. Only as a last resort: qm stop VMID (force). Document the risk and follow up with guest filesystem checks as appropriate.
  7. After recovery: clear stale Proxmox lock (qm unlock VMID) if needed, but only after verifying no active stop task remains.

Checklist C: When you suspect cluster/pmxcfs issues

  1. Check quorum: pvecm status.
  2. Check corosync health in logs: journalctl -u corosync --since "30 min ago".
  3. Test responsiveness of /etc/pve (simple reads like ls should not hang).
  4. Don’t perform config-changing operations if non-quorate unless you’ve explicitly accepted the split-brain risk.
  5. Restore networking between nodes, fix time sync issues, then verify quorum returns.
  6. Once cluster is stable, re-check running tasks; many “hung” tasks suddenly become merely “failed” and can be retried cleanly.

FAQ

1) Can I just restart pvedaemon to clear stuck tasks?

You can, but it’s rarely the right first move. If a worker is blocked in D-state due to storage, restarting daemons won’t unblock it. You’ll just lose continuity in logs and confuse task tracking.

2) What’s the safest way to stop a hung backup?

Request cancel via the task endpoint, then terminate the child processes gently (TERM) if they’re in userspace. If they’re in D-state, fix storage first; otherwise you’ll create a mess without achieving “stopped.”

3) How do I know if a lock is stale?

Correlate it with running tasks and actual processes. If the VM has lock: set but there’s no matching running task and no relevant process tree, it’s likely stale. Then qm unlock is appropriate.

4) Why does “kill -9” not work sometimes?

Because the process is stuck in uninterruptible sleep (D-state) inside the kernel, usually waiting for I/O. The signal is queued but not acted on until the kernel call returns.

5) Is it safe to remove partial backup files from the backup store?

Usually, yes—after you confirm no process is still writing to them and you’ve decided the archive is incomplete. Deleting a file still open by a process can make troubleshooting worse and doesn’t necessarily free you from the underlying hang.

6) My VM won’t stop and QMP times out. Is the VM corrupted?

Not necessarily. It often means QEMU can’t complete an I/O flush or device operation, so it can’t respond. Corruption risk increases when you forcefully terminate without resolving I/O.

7) Does Ceph HEALTH_WARN mean I can ignore it for now?

No. “WARN” often means “latency is already bad enough to stall higher-level operations.” Slow ops are the canary; Proxmox stuck tasks are the mine collapse.

8) When is a node reboot the correct answer?

When critical processes are stuck in D-state due to an I/O path that can’t be restored quickly, or when kernel threads are wedged and preventing clean recovery. Reboot is a controlled reset, not a moral failing.

9) Why do “local” tasks fail when the cluster has quorum issues?

Because “local” often still touches cluster-synchronized config in /etc/pve. If pmxcfs is unhappy or writes are blocked due to quorum policy, tasks can stall or refuse to proceed.

10) How do I prevent stuck tasks from recurring?

Engineer for predictable I/O latency: size storage, limit backup concurrency, separate backup traffic, monitor kernel hung tasks, and treat storage warnings as incidents, not trivia.

Conclusion: next steps that prevent repeats

Stuck Proxmox tasks are rarely mysterious. They’re usually one of three things: storage stalls, stale locks, or cluster coordination problems. The difference between a clean recovery and a messy one is whether you identify which category you’re in before you start killing processes.

Do this next:

  • Adopt the fast diagnosis funnel: storage vs cluster vs per-guest process.
  • Teach your team to recognize D-state and stop wasting time on signals when the kernel is waiting on I/O.
  • Put sane limits on backup concurrency and schedule heavy operations (scrub, replication, bulk backups) like you actually care about latency.
  • When you clear locks, do it like a surgeon: verify, then cut. Don’t freestyle.
← Previous
Debian 13 Swapfile Pitfalls: Create It Right (Permissions, fstab, Priority)
Next →
MySQL vs PostgreSQL: “It Suddenly Got Slow”—a 15-Minute Diagnosis Plan for Both

Leave a comment