“TASK ERROR” is Proxmox’s way of telling you something failed, while politely refusing to tell you what failed in the UI. In production, that’s not a message. It’s an invitation to lose an afternoon.
The good news: Proxmox is predictable. The bad news: you have to know which log belongs to which daemon, and how to follow a task’s breadcrumbs into the right subsystem—storage, networking, cluster, or guest config—without guessing.
Fast diagnosis playbook (first/second/third checks)
When a Proxmox task fails, you don’t “start reading logs.” You run a short, brutal sequence that narrows the blast radius in minutes. The trick is to stop treating “TASK ERROR” as a single problem. It’s a wrapper around a subprocess, and that subprocess has a home.
First check: grab the task’s real command and exit context
- Open the task in the GUI and copy the UPID (Unique Process ID). It looks like a long colon-separated string.
- On the node where it ran, read the task log from disk (
/var/log/pve/tasks/) and look for the actual failing command and the first concrete error line. - If the log is truncated or vague, jump to systemd journal for the daemon that executed it: usually
pvedaemon, sometimespveproxy, sometimespvescheduler.
Second check: decide which subsystem owns the failure
Most task failures fall into one of five buckets:
- Storage: ZFS, LVM, Ceph, NFS, iSCSI, permission issues, “no space,” dataset mount weirdness.
- Cluster: corosync link flaps, no quorum, stale pmxcfs state.
- Guest: QEMU command line errors, missing disks, invalid config, LXC apparmor/cgroup problems.
- Network: broken bridges, MTU mismatch, firewall, migration sockets blocked.
- Host: kernel, hardware, time drift, OOM, stuck I/O, filesystem corruption.
If you can’t identify the bucket in 3 minutes, you’re reading the wrong log.
Third check: confirm with one “truth command” per bucket
Run one command that cannot fake success:
- Storage truth:
pvesm status,zpool status,ceph -s,df -h - Cluster truth:
pvecm status,journalctl -u corosync - Guest truth:
qm start --debugorpct start --debug - Network truth:
ss -tulpn+ migration port checks +ip -d link - Host truth:
dmesg -T,journalctl -p err..alert,smartctl
Once you have “truth output,” you stop debating and start fixing.
One dry reality: most Proxmox “TASK ERROR” incidents are not “Proxmox bugs.” They’re storage or cluster hygiene surfacing at the worst possible time.
How Proxmox tasks are wired (and why the UI lies by omission)
Proxmox VE is a collection of daemons coordinating work:
- pvedaemon: executes most node-level tasks requested via API/UI (start VM, create container, migrations, storage actions).
- pveproxy: the API/UI frontend; it handles auth and dispatch, but it’s rarely the source of the actual failure.
- pvescheduler: kicks off scheduled jobs like backups.
- pvestatd: collects stats; it’s useful when the UI graph looks wrong, not when a task fails.
- pmxcfs: the cluster filesystem mounted at
/etc/pve. If it’s unhappy, everything becomes “weird” in a uniquely Proxmox way.
A task in Proxmox is basically: “run a command, capture stdout/stderr, store it, show a filtered view in the UI.” The UI often highlights the last line and hides the earlier line that contains the real error. That’s why you go to the raw task log and the daemon journal.
Also: the error you see might be the wrapper failing, not the underlying operation. Example: a migration fails with “connection timed out,” but the root cause is a storage lock on the destination node because the target filesystem is read-only. The migration wrapper tells the truth, but not the useful truth.
Paraphrased idea from Werner Vogels: “You build it, you run it.” In Proxmox land that translates to: you attach storage, you troubleshoot storage.
Joke #1: The Proxmox UI error message is like a weather forecast that only says “some weather occurred.” Technically correct, operationally useless.
Log locations that matter (and the ones that don’t)
The task log archive (start here)
Proxmox stores task logs on the node under:
/var/log/pve/tasks/— the canonical task logs, organized by node and date/var/log/pve/tasks/index— index for tasks; handy for grepping
The fastest workflow: grab the UPID from the UI, then grep for it in /var/log/pve/tasks/ or use the task viewer file directly. You’re looking for:
- the first concrete error line (not the last)
- the exact command invocation (qemu, vzdump, zfs, rbd, ssh, rsync)
- any mention of “permission denied,” “no space,” “I/O error,” “quorum,” “timeout,” “dataset not found”
Systemd journal (the adult logs)
Task logs are good, but daemons often log more context to the system journal. These units matter:
pvedaemon— most task execution detailspveproxy— API/auth issues, ticket problems, websocket disconnectspvescheduler— scheduled vzdump and replication triggerscorosync— cluster membership, link status, quorum changespve-cluster(pmxcfs) — cluster filesystem and config syncceph-*— if you run Ceph, you already know where this is going
Kernel and storage logs (where the bodies are buried)
When storage is involved, the kernel usually tattles:
dmesg/journalctl -k— I/O errors, resets, hung tasks, ZFS messages/var/log/syslog(or journal on newer installs) — service-level context- ZFS:
zpool status,zpool events(events are gold during flaps) - Ceph:
ceph -s, OSD logs via systemd
Logs you should stop over-trusting
- The UI by itself. It’s a summary, not a root-cause tool.
- /var/log/pveproxy/access.log as a first stop for task errors. If auth works and tasks launch, pveproxy is rarely the culprit.
- Random greps across /var/log without an identifier. You’ll find scary words and learn nothing.
Practical tasks: commands, output meaning, and what decision to make
The goal here is speed. Each task below includes: command(s), what the output means, and the decision you make from it. Run them on the node that executed the failing task unless noted otherwise.
Task 1: Identify the UPID and pull the raw task log
cr0x@server:~$ grep -R "UPID:" -n /var/log/pve/tasks/index | tail -n 3
/var/log/pve/tasks/index:98122:UPID:pve1:0000A1B2:01F4C3D2:676D2C8B:vzdump:101:root@pam:
/var/log/pve/tasks/index:98123:UPID:pve1:0000A1B7:01F4C3D8:676D2C91:qmstart:104:root@pam:
/var/log/pve/tasks/index:98124:UPID:pve1:0000A1C0:01F4C3E0:676D2C9A:vmmigrate:104:root@pam:
Meaning: You’ve got recent task UPIDs. Copy the relevant one from the UI ideally; the index grep is a fallback.
Decision: Use the UPID to locate the specific task log file and read it top-to-bottom, not bottom-to-top.
Task 2: Read the task log file for that UPID
cr0x@server:~$ UPID="UPID:pve1:0000A1B2:01F4C3D2:676D2C8B:vzdump:101:root@pam:"; grep -R "$UPID" -n /var/log/pve/tasks/ | head
/var/log/pve/tasks/9/98122:UPID:pve1:0000A1B2:01F4C3D2:676D2C8B:vzdump:101:root@pam:
cr0x@server:~$ sed -n '1,160p' /var/log/pve/tasks/9/98122
UPID:pve1:0000A1B2:01F4C3D2:676D2C8B:vzdump:101:root@pam:
INFO: starting new backup job: vzdump 101 --storage backup-nfs --mode snapshot --compress zstd
INFO: Running hook script 'backup-hook.pl'.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/backup-nfs/dump/vzdump-qemu-101-2025_12_26-02_00_01.vma.zst'
ERROR: qemu-img: Could not open '/dev/zvol/rpool/data/vm-101-disk-0': Permission denied
INFO: aborting backup job
TASK ERROR: command 'vzdump 101 --storage backup-nfs --mode snapshot --compress zstd' failed: exit code 255
Meaning: The useful line is Permission denied opening a zvol. The last line is just a wrapper.
Decision: Pivot to ZFS permissions / device access / AppArmor / storage config, not “backup settings.”
Task 3: Pull daemon-side context for the same time window
cr0x@server:~$ journalctl -u pvedaemon --since "2025-12-26 01:55" --until "2025-12-26 02:05" | tail -n 30
Dec 26 02:00:01 pve1 pvedaemon[1123]: starting task UPID:pve1:0000A1B2:01F4C3D2:676D2C8B:vzdump:101:root@pam:
Dec 26 02:00:02 pve1 pvedaemon[1123]: starting task vzdump 101
Dec 26 02:00:05 pve1 pvedaemon[1123]: command 'qemu-img info /dev/zvol/rpool/data/vm-101-disk-0' failed: Permission denied
Dec 26 02:00:05 pve1 pvedaemon[1123]: end task UPID:pve1:0000A1B2:01F4C3D2:676D2C8B:vzdump:101:root@pam: TASK ERROR: exit code 255
Meaning: Confirms the failing subcommand and that it’s happening under pvedaemon, not some mystery UI layer.
Decision: Investigate why qemu-img can’t open the zvol device path.
Task 4: Quick system-wide error sweep (don’t overthink it)
cr0x@server:~$ journalctl -p err..alert --since "1 hour ago" | tail -n 25
Dec 26 02:00:05 pve1 kernel: audit: type=1400 apparmor="DENIED" operation="open" profile="lxc-container-default-cgns" name="/dev/zvol/rpool/data/vm-101-disk-0" pid=22871 comm="qemu-img" requested_mask="r" denied_mask="r"
Dec 26 02:00:05 pve1 pvedaemon[1123]: command 'qemu-img info /dev/zvol/rpool/data/vm-101-disk-0' failed: Permission denied
Meaning: AppArmor denial. That’s not “random”; it’s a direct reason for the permission error.
Decision: Fix the AppArmor profile / container confinement / Proxmox update mismatch causing the denial.
Task 5: Confirm storage state from Proxmox’s perspective
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
local dir active 98317312 21938176 71340544 22.31%
local-zfs zfspool active 402653184 312475648 90177536 77.60%
backup-nfs nfs active 1048576000 612368384 436207616 58.40%
Meaning: Storage is “active” per PVE. That doesn’t guarantee underlying ZFS devices are accessible under confinement, but it clears “storage offline” as a primary cause.
Decision: Stay focused on the device access path and security layer, not NFS mount failures.
Task 6: Validate ZFS health and look for I/O errors
cr0x@server:~$ zpool status -x
all pools are healthy
Meaning: No obvious pool-level faults. This is not a failing disk story yet.
Decision: Don’t start replacing hardware because a backup failed. Keep chasing the actual denial.
Task 7: Look for ZFS events around failure time (useful during flaps)
cr0x@server:~$ zpool events -v | tail -n 12
Dec 26 01:58:44.120782 sysevent.fs.zfs.config_sync
Dec 26 01:58:44.120790 sysevent.fs.zfs.history_event
history_event: zfs create rpool/data/vm-104-disk-0
Dec 26 02:00:03.908111 sysevent.fs.zfs.vdev_check
pool: rpool
Meaning: Nothing screaming. Events are quiet, which supports “policy/permission” over “storage dying.”
Decision: Stop digging in ZFS until you have a reason.
Task 8: Re-run the failing guest operation in debug mode
cr0x@server:~$ qm start 101 --debug
start VM 101: UPID:pve1:0000A1D0:01F4C410:676D2D12:qmstart:101:root@pam:
QEMU command line: /usr/bin/kvm -id 101 -name vm101 -no-shutdown ...
kvm: -drive file=/dev/zvol/rpool/data/vm-101-disk-0,if=none,id=drive-scsi0,format=raw: Could not open '/dev/zvol/rpool/data/vm-101-disk-0': Permission denied
TASK ERROR: start failed: QEMU exited with code 1
Meaning: Not a backup-only issue. VM start fails for the same reason. That’s strong evidence of a host policy/access regression.
Decision: Treat as “VM cannot access its block device.” Escalate priority.
Task 9: Confirm device node ownership and permissions (yes, it’s basic—do it)
cr0x@server:~$ ls -l /dev/zvol/rpool/data/vm-101-disk-0
lrwxrwxrwx 1 root root 13 Dec 26 01:12 /dev/zvol/rpool/data/vm-101-disk-0 -> ../../zd0
cr0x@server:~$ ls -l /dev/zd0
brw-rw---- 1 root disk 230, 0 Dec 26 01:12 /dev/zd0
Meaning: The block device is group disk, mode 660. QEMU typically runs as root on Proxmox, so plain UNIX perms usually aren’t the blocker.
Decision: If perms look sane but access fails, suspect confinement (AppArmor), a mount namespace oddity, or a kernel-level block rule.
Task 10: Check AppArmor status and recent denials
cr0x@server:~$ aa-status | sed -n '1,60p'
apparmor module is loaded.
60 profiles are loaded.
58 profiles are in enforce mode.
/usr/bin/lxc-start
/usr/bin/pve-container
lxc-container-default-cgns
2 profiles are in complain mode.
0 processes are unconfined but have a profile defined.
Meaning: AppArmor is active. If denials mention your device path or qemu-img, you have your smoking gun.
Decision: Fix the profile or the path expectations. In Proxmox, this often shows up after partial upgrades or custom hardening.
Task 11: Migration failures—confirm connectivity and listening services
cr0x@server:~$ ss -tulpn | egrep '(:8006|:3128|:5900|:60000|:22)\b'
tcp LISTEN 0 4096 0.0.0.0:8006 0.0.0.0:* users:(("pveproxy",pid=1044,fd=6))
tcp LISTEN 0 4096 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=881,fd=3))
Meaning: This confirms basics: API is up, SSH is up. Migration uses SSH for orchestration and a separate channel for memory transfer depending on settings.
Decision: If migration fails with timeout, you now check firewall rules, routing, MTU, and any dedicated migration network. If services aren’t listening, fix that before touching Proxmox configs.
Task 12: Cluster task errors—check quorum and corosync health
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prodcluster
Config Version: 19
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Dec 26 02:03:11 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2a
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Meaning: Quorum is OK. If you see Quorate: No, many tasks will fail, hang, or refuse to run because /etc/pve becomes read-only.
Decision: If not quorate: stop doing “storage troubleshooting” and fix the cluster link/quorum first. Proxmox depends on a healthy pmxcfs state for sane operations.
Task 13 (bonus, because you will need it): Verify pmxcfs mount and status
cr0x@server:~$ mount | grep /etc/pve
pve on /etc/pve type fuse.pve (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
Meaning: rw is what you want. If it flips to ro, you’ll see bizarre failures: VM config edits fail, storage edits fail, tasks fail with misleading errors.
Decision: If it’s ro, treat it as a cluster/quorum issue until proven otherwise.
Task 14 (bonus): Catch “no space” the way the kernel sees it
cr0x@server:~$ df -hT | egrep '(/$|/var|/mnt/pve|/rpool|/dev/sd)'
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/pve-root ext4 96G 89G 2.1G 98% /
backup-nfs:/export/backup nfs4 1000G 584G 417G 59% /mnt/pve/backup-nfs
Meaning: Root filesystem at 98% is not a “future problem.” It’s a current source of random task failures (logs can’t write, temp files fail, dpkg breaks).
Decision: If root is full: clean it now, then re-run the task. Do not “keep trying” and hope it resolves itself.
That’s already more than 12 tasks. Good. Your job is not to memorize them; it’s to internalize the mapping: task → daemon → subsystem → truth command.
Common failure modes by category
Backups (vzdump) that fail “randomly”
Backups touch almost everything: guest snapshot logic, storage, compression, network (if remote), and retention cleanup. When a backup fails, the cause is usually not “vzdump is broken.” It’s:
- snapshot not possible (guest agent missing or filesystem freeze fails)
- storage can’t create temp files (permissions, no space, stale mount)
- timeouts from slow disks or overloaded NFS/Ceph
- lock contention: previous backup still running, or stale lock file
Where to look first: the task log for the failing subcommand (qemu-img, vma, zstd, tar) and then the kernel logs for I/O stalls.
Migrations that fail with timeouts
Live migration is a three-act play:
- Orchestration (SSH, pvedaemon on both ends)
- State transfer (RAM pages, device state)
- Storage synchronization (depends on shared storage vs local + replication)
Failures often correlate with MTU mismatch on a dedicated migration network, firewall rules blocking ephemeral ports, or storage on the destination not being “actually ready” even if PVE says it’s active.
VM start errors that look like QEMU being dramatic
QEMU errors are verbose but honest. When you see:
Could not open ... Permission denied→ policy/permissions or block path mismatchDevice or resource busy→ stale process holding a device, leftover qemu, or ZFS volume still in useNo such file or directory→ storage path wrong, dataset missing, storage not mountedInvalid argument→ often a kernel/driver mismatch or a broken option after an upgrade
Cluster tasks failing because the cluster is “kind of up”
Proxmox clusters can be deceptively functional while broken: nodes ping, UI loads, but config writes fail or tasks hang. If quorum is lost, /etc/pve goes read-only to avoid split-brain writes. That’s good engineering, but it looks like sabotage at 2 a.m.
Storage tasks failing because storage is “active” but unusable
“Active” in pvesm status means Proxmox can see the storage definition and it passed its activation check. That does not guarantee:
- NFS is stable (it may be mounted but stale/hung)
- Ceph is healthy (it may respond but be degraded or blocked)
- ZFS has enough space (it may be online but effectively full due to snapshots)
Common mistakes: symptom → root cause → fix
These are the failures I see repeatedly because they’re sneaky, not because they’re hard.
1) Symptom: “TASK ERROR: command ‘vzdump …’ failed” with exit code 255
Root cause: Exit code 255 is often the wrapper reporting a failure from a subcommand. The real error is earlier: permission denied, snapshot failure, or remote storage trouble.
Fix: Read the task log from the top and find the first ERROR: line. Then check journalctl -u pvedaemon for the same time window.
2) Symptom: Migration fails with “connection timed out” or “ssh exited with status 255”
Root cause: Not one thing. Usually firewall/MTU/routing on the migration path, or SSH trust broken between nodes.
Fix: Confirm sshd is reachable node-to-node, verify ip link MTU consistency, and check firewall rules on both ends. If you use a dedicated migration network, test it directly.
3) Symptom: VM start fails after an upgrade, but worked yesterday
Root cause: Partial upgrades or mismatched kernel/modules/userspace. Sometimes AppArmor profiles change and start denying paths that used to be allowed.
Fix: Ensure the node is fully upgraded and rebooted into the expected kernel. Check journalctl -p err..alert for AppArmor denials and kernel errors.
4) Symptom: Tasks fail to modify VM configs; UI says permission denied or “file is read-only”
Root cause: Cluster lost quorum; pmxcfs is read-only.
Fix: Check pvecm status. Fix corosync connectivity. Don’t “work around it” by editing files elsewhere; that’s how you create split-brain config drift.
5) Symptom: Storage is “active,” but backups/migrations hang for minutes then fail
Root cause: Stale NFS mount or blocked I/O. The mount exists, but I/O is wedged.
Fix: Check kernel logs for NFS timeouts, verify NFS server health, and consider hard vs soft mount options. If the host is stuck in D state tasks, you may need to fix the storage path before Proxmox can recover.
6) Symptom: “no space left on device” even though df shows free space
Root cause: Inodes exhausted, ZFS dataset quota, ZFS pool nearly full with copy-on-write penalties, or Ceph full ratio reached.
Fix: Check inode usage (df -i), ZFS quotas and snapshot bloat (zfs list -o space), and Ceph fullness thresholds if applicable.
7) Symptom: “unable to activate storage” after reboot
Root cause: Network storage comes up before networking is ready, or name resolution isn’t available yet, or systemd ordering is wrong.
Fix: Validate DNS, routes, and consider systemd mount dependencies. On NFS, ensure the mount unit waits for network-online. Then re-activate storage and re-run tasks.
8) Symptom: LXC start fails with cgroup or mount errors
Root cause: Kernel/cgroup v2 expectations vs container config, AppArmor confinement, or nesting settings.
Fix: Run pct start <id> --debug, then check journalctl -u pvedaemon and kernel logs for cgroup messages. Fix container feature flags deliberately, not by trial-and-error toggling.
Joke #2: “It worked yesterday” is not a diagnosis; it’s a bedtime story the system tells you before it ruins your morning.
Three corporate mini-stories from the trenches
Incident 1: The outage caused by a wrong assumption (shared storage ≠ consistent storage)
A mid-sized company ran a two-node Proxmox cluster with “shared storage” on NFS. The UI showed the NFS datastore as active on both nodes. Migrations were routine. Backups were green. Everyone slept.
Then a maintenance window happened: a network engineer replaced a switch, and one node came back with a different VLAN tagging configuration. The node could still reach the NFS server intermittently—enough for the mount to exist—but large I/O would stall. In Proxmox terms, the storage was “active.” In reality, it was a trap.
Live migration began during business hours. The task failed with a timeout. The on-call assumed it was “Proxmox migration flakiness,” retried twice, then tried a different VM. Now multiple VMs were in limbo with locks held. The UI filled with TASK ERROR entries that all looked basically identical.
The fix was boring: verify NFS I/O health with a direct read/write test on the mount, check kernel logs for NFS timeouts, and correct the VLAN tagging. The lesson stuck: don’t treat “mounted” as “healthy,” and don’t retry migrations blindly. Retrying is how you turn one failure into a queue of failures.
Incident 2: The optimization that backfired (faster backups, slower cluster)
A different team wanted faster nightly backups. They moved from gzip to zstd, enabled multiple backup jobs in parallel, and felt clever. CPU had headroom. Storage was “fast.” What could go wrong?
What went wrong was I/O scheduling and latency. Parallel vzdump jobs created bursts of reads from ZFS zvols while simultaneously writing large compressed streams to an NFS target. ZFS did what ZFS does: it tried to be helpful with caching and copy-on-write. NFS did what NFS does: it added variability. The end result was occasional IO stalls.
Proxmox tasks started failing with vague messages: timeouts, snapshot commit delays, “unable to freeze filesystem,” occasional VM monitor command timeouts. The team spent a week staring at Proxmox logs, convinced the hypervisor was unstable. The kernel logs told the real story: blocked tasks waiting on I/O and NFS latency spikes.
The rollback wasn’t dramatic. They reduced concurrency, pinned backup windows, and—most importantly—stopped writing backups to the same network path that handled other latency-sensitive traffic. Backups became slightly slower. Everything else became reliable again. Optimizing a non-bottleneck is how you create a bottleneck.
Incident 3: The boring but correct practice that saved the day (time sync + log discipline)
A finance-adjacent company ran a Proxmox cluster with Ceph. They had a strict rule: every node must have time sync working, and logs were kept with enough retention to cover weekends. Nobody celebrated this. It was just policy, enforced by automation.
One Friday night, a node started throwing TASK ERROR during replication and Ceph operations. The UI errors were generic. But the team immediately correlated events because timestamps lined up cleanly across nodes. They saw that the failures began right after a specific network change and that corosync link flaps preceded Ceph warnings.
Because logs were retained and time was sane, they traced the root cause quickly: a misconfigured jumbo frame setting caused packet loss on one interface, which destabilized the cluster network and triggered cascading timeouts. They fixed MTU consistency, watched corosync stabilize, and tasks stopped failing.
Nothing heroic. Just boring fundamentals: synchronized clocks, consistent retention, and a habit of checking corosync before blaming Ceph or Proxmox. This is what “operational maturity” looks like when it’s not trying to sell you something.
Checklists / step-by-step plan
Checklist A: When you see “TASK ERROR” and you have 10 minutes
- Copy the UPID from the task details.
- Open the raw task log under
/var/log/pve/tasks/and find the first realERROR:line. - Pull daemon journal context around the same minute:
journalctl -u pvedaemon(orpveschedulerfor scheduled jobs). - Bucketize the problem: storage, cluster, guest, network, host.
- Run one truth command for that bucket (
pvesm status/zpool status/pvecm status/qm start --debug/dmesg -T). - Decide: retry is allowed only after you remove the cause (space, permission, quorum, connectivity). Never retry as a diagnostic method unless the system is idempotent and you know it.
Checklist B: Storage-centered failures (backups, replication, disk attach)
pvesm status— is the storage active?df -hT— is root or target mount full?- If ZFS:
zpool status -xandzfs list -o space— pool health and snapshot bloat. - If Ceph:
ceph -s— is the cluster HEALTH_OK? If not, expect weirdness. journalctl -k— any I/O errors, resets, hung tasks?- Fix the bottleneck, then rerun exactly one task to validate.
Checklist C: Cluster-centered failures (config edits, migrations, HA actions)
pvecm status— quorate or not?journalctl -u corosync— link flaps, token timeouts?mount | grep /etc/pve— is pmxcfsrw?- Fix cluster network/links first. Don’t “force” operations on a non-quorate cluster unless you fully accept split-brain risk.
Checklist D: Guest start failures
- Run the guest start with debug (
qm start --debugorpct start --debug). - Look for file paths and device names in the error. Those are your anchors.
- Verify the underlying storage object exists (zvol, qcow2, rbd image, logical volume).
- Check security layer denials (AppArmor) and kernel logs.
- Only after that: consider config regression (machine type, CPU flags, device options).
Interesting facts and historical context (so the logs make more sense)
- UPID is Proxmox’s task identity system: it encodes node, PID, start time, and task type. It’s designed for distributed tracking in clusters.
- Proxmox leans on Debian’s service model: many “Proxmox issues” are really systemd/journal/kernel issues wearing a Proxmox hat.
- /etc/pve is not a normal filesystem: it’s pmxcfs, a FUSE-based cluster filesystem. When quorum is lost, it can go read-only to prevent split-brain writes.
- Corosync quorum behavior is intentionally strict: refusing writes without quorum is a design choice that favors correctness over convenience.
- ZFS copy-on-write changes failure symptoms: pools can be “healthy” but painfully slow or near-unusable when near full, because every write becomes expensive.
- Ceph health is more than “up”: a Ceph cluster can answer commands while still being degraded enough to time out client I/O during recovery.
- vzdump is a wrapper: the backup tool orchestrates snapshotting and packaging, but the real work happens in qemu, storage backends, and compression tools.
- Task logs are stored locally: in a multi-node cluster, you must read the logs on the node that executed the task, not the node where you clicked the UI.
- Many failures are time-correlation problems: without decent time sync, you can’t correlate corosync events, kernel errors, and task failures. “Logs” become fiction.
FAQ
1) Where exactly do I find the log for a task that shows “TASK ERROR” in the UI?
On the node that executed it: /var/log/pve/tasks/. Use the UPID to locate the exact file (or grep the index file at /var/log/pve/tasks/index).
2) Why does the Proxmox UI show so little detail for TASK ERROR?
Because it’s rendering a summary. The real error is often earlier in the task output or logged by the executing daemon in the journal.
3) Which daemon should I check in journalctl for most task failures?
pvedaemon first. For scheduled tasks like backups, also check pvescheduler. For cluster membership and quorum issues, check corosync and pve-cluster.
4) My task fails on node A, but I’m logged into node B in the UI. Does it matter?
Yes. Task logs are local to the node that ran the task. If you read logs on the wrong node, you’ll conclude “there are no logs,” and then you’ll start guessing.
5) I see “exit code 255” a lot. What does it mean?
Usually “the command failed and the wrapper is reporting it.” It’s rarely the root cause by itself. Find the earlier line: permission, path, space, timeout, snapshot failure.
6) How do I tell if a failure is cluster/quorum-related?
pvecm status. If Quorate: No, expect task weirdness and config write failures. Also check whether /etc/pve is mounted read-only.
7) Backups fail but storage is “active.” What now?
“Active” is not “healthy.” Check kernel logs for storage timeouts, confirm free space and inode availability, and validate the storage backend’s own health (ZFS/Ceph/NFS).
8) What’s the fastest way to debug a VM that won’t start?
Run qm start <vmid> --debug on the node and read the exact QEMU error line. Then validate the referenced disk path/device and check kernel/AppArmor logs.
9) When should I reboot a Proxmox node during a TASK ERROR incident?
When you’ve confirmed the failure is caused by a host-level state that won’t clear safely: stuck I/O, kernel driver issues, or post-upgrade kernel/userspace mismatch. Rebooting as a first response is how you lose forensic data.
10) Do I need centralized logging for Proxmox troubleshooting?
It helps, but it’s not required to find root cause. What you need is: correct node selection, UPID-based task log reading, and journal access with time correlation.
Next steps you can actually do today
- Practice the UPID workflow: pick any completed task, find its UPID, and trace it from UI →
/var/log/pve/tasks/→journalctl. Make it muscle memory. - Decide your “truth commands” per subsystem and write them down for your environment (ZFS vs Ceph vs NFS changes the first questions).
- Stop retrying blind: institute a rule—one retry max, only after a hypothesis and a check. Retries create locks, load, and worse logs.
- Baseline cluster health: check quorum status, corosync stability, and pmxcfs mount mode during calm hours so you recognize abnormal output instantly.
- Watch your root filesystem: keep free space on
/and/var. Task failures caused by a full root disk are embarrassing because they’re preventable.
If you do those five things, “TASK ERROR” stops being a vague insult and becomes what it always was: a pointer to a specific subsystem that’s misbehaving. You’ll still have incidents. You’ll just finish them faster and with fewer guesses.