Proxmox pvedaemon.service Failed: Why Tasks Don’t Run and How to Fix It

Was this helpful?

You click Start on a VM. The web UI pretends to submit the job. Then nothing happens—no task log, no error you can use, just a long silence and a growing sense that the hypervisor is judging you.
If pvedaemon is down, Proxmox can look “mostly fine” while the one thing you actually need—tasks—quietly stops existing.

This is one of those failures that wastes hours because it doesn’t break loudly. It breaks politely. Here’s how to diagnose it fast, fix it correctly, and keep it from coming back like a bad recurring meeting.

What pvedaemon does (and why the UI lies when it’s dead)

Proxmox isn’t a monolith. It’s a handful of cooperating daemons and a web UI that mostly brokers requests to them. If you remember one thing:
pvedaemon is the worker that runs most node-local tasks—starting/stopping guests, snapshot operations, backups, storage actions, and plenty of “do a thing” jobs you trigger from the UI or API.

When pvedaemon is down, the web interface (pveproxy) may still serve pages just fine. Authentication works. Status dashboards render. You can even click buttons.
But when the system tries to spawn a task, it can’t hand it to the worker. That’s why you get the classic behavior: UI alive, tasks dead.

In reliability terms: your control plane is answering HTTP, but your executor is missing. It’s like a restaurant taking reservations while the kitchen is on fire.

What to expect when it fails

  • Task log entries don’t appear, or appear and then instantly fail with “connection refused”, “timeout”, or “no such file”.
  • Backups don’t run on schedule, or start and hang on locks.
  • Starting a VM from CLI (qm start) may work if it bypasses some API pathways, but many operations will still rely on the daemon ecosystem.
  • Cluster environments add an extra layer of fun: tasks may be submitted to the wrong node or fail due to cluster state, even if the UI looks healthy.

One quote worth remembering in operations—attributed to Werner Vogels (Amazon CTO): You build it, you run it. If you’re running Proxmox in production, you own this failure mode too.

Fast diagnosis playbook

Don’t “poke around.” Run a tight sequence that narrows the failure domain in minutes. The goal is to answer three questions:
Is pvedaemon actually down? If yes, why? And if it’s restarting, what’s killing it?

First: confirm the daemon status and restart loop

  1. Check systemctl status pvedaemon and journalctl -u pvedaemon.
  2. If you see “start request repeated too quickly” or exit codes, you’re in systemd-land. Fix the underlying error; don’t spam restarts.

Second: check for “it’s not pvedaemon, it’s everything underneath”

  1. Disk full on root filesystem or /var/log.
  2. Broken Perl module / failed package upgrade.
  3. Hostname / DNS / cluster state issues causing API calls to fail.
  4. Storage timeouts (NFS/iSCSI/Ceph) causing tasks to block and pvedaemon to look “hung.”

Third: decide on the recovery posture

  • If it’s a clean crash with a clear log error: fix and restart.
  • If it’s a hang due to storage: stop digging with a shovel; stabilize storage first.
  • If it’s a cluster split: avoid “random restarts.” Reconcile quorum and corosync health.

Joke #1 (short and relevant): If pvedaemon is down, your Proxmox UI becomes a very expensive wallpaper generator.

Interesting facts and context (why this fails in surprising ways)

A little context makes troubleshooting faster because you stop expecting Proxmox to behave like a single service.
Here are concrete facts that matter when pvedaemon.service fails:

  1. Proxmox tasks are orchestrated across multiple daemonspveproxy for UI/API, pvedaemon for workers, and often pvestatd for metrics refresh.
  2. Most Proxmox management logic is implemented in Perl. A missing Perl module after an upgrade can crash daemons at startup like a trapdoor.
  3. systemd restart rate limiting will mark services as failed even if they could recover—because systemd is protecting the node from a fork-bomb loop.
  4. Proxmox VE evolved from the “PVE” stack around 2008 with a strong focus on Debian packaging. That means package state matters: half-installed packages can cripple core services.
  5. Task logs are written under /var/log/pve/, and a full root filesystem can break task creation in ways that look like “random daemon failure.”
  6. Cluster membership affects task routing. In a cluster, some tasks consult cluster config under /etc/pve, which is a distributed filesystem backed by pmxcfs.
  7. /etc/pve is not “just a directory”. It’s a special FUSE filesystem. If pmxcfs has issues, configs can look missing or stale, and daemons can crash on reads.
  8. Storage plugins can block management tasks. If an NFS server hangs, a “simple” backup listing can block until timeouts cascade.
  9. Proxmox uses task IDs (UPIDs) to track jobs. If you can’t create a UPID, you usually have a daemon/log/permission problem—not a “VM problem.”

How failures present: symptoms that point to pvedaemon

You don’t troubleshoot by guessing. You match symptoms to likely failure domains.

Classic symptom clusters

  • Web UI loads, but every action fails: start VM, stop VM, snapshot, backup. That’s pvedaemon or API worker failure, not QEMU itself.
  • Tasks appear as “running” forever: often storage calls stuck, lock contention, or a worker blocked in I/O.
  • “Connection refused” or “501” style API errors: can be pvedaemon down, pveproxy issues, or local socket permission problems.
  • After an upgrade, tasks stop: packaging mismatch, Perl module dependency breakage, or stale services not restarted cleanly.
  • On a cluster, only one node is weird: local service failure. If all nodes are weird: cluster/quorum or shared storage trouble.

Practical tasks: commands, expected output, and the decision you make

Below are hands-on checks I actually run. Each includes: the command, what the output means, and what decision to make next.
Use them like a playbook, not like a buffet.

Task 1: Check whether pvedaemon is failed, dead, or in a restart loop

cr0x@server:~$ systemctl status pvedaemon --no-pager
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2025-12-22 09:14:02 UTC; 2min 11s ago
    Process: 2190 ExecStart=/usr/bin/pvedaemon start (code=exited, status=255/EXCEPTION)
   Main PID: 2190 (code=exited, status=255/EXCEPTION)

Dec 22 09:14:02 server systemd[1]: pvedaemon.service: Main process exited, code=exited, status=255/EXCEPTION
Dec 22 09:14:02 server systemd[1]: pvedaemon.service: Failed with result 'exit-code'.

Meaning: It’s not running, and it exited immediately. That’s usually a startup error (config, missing dependency, permissions), not “high load.”

Decision: Go straight to logs (journalctl) to find the first real error. Don’t restart blindly.

Task 2: Read the pvedaemon journal, but do it like you mean it

cr0x@server:~$ journalctl -u pvedaemon -b --no-pager -n 200
Dec 22 09:14:02 server pvedaemon[2190]: Starting pvedaemon
Dec 22 09:14:02 server pvedaemon[2190]: Can't locate PVE/API2/Tasks.pm in @INC (you may need to install the PVE::API2::Tasks module) (@INC contains: /usr/share/perl5 ...)
Dec 22 09:14:02 server pvedaemon[2190]: BEGIN failed--compilation aborted at /usr/bin/pvedaemon line 7.
Dec 22 09:14:02 server systemd[1]: pvedaemon.service: Main process exited, code=exited, status=255/EXCEPTION

Meaning: Packaging/dependency problem. Perl can’t find a Proxmox module. This often happens after an interrupted upgrade or partial install.

Decision: Stop troubleshooting “services” and fix packages. Jump to Task 8 (package sanity).

Task 3: Check whether other PVE daemons are also unhappy

cr0x@server:~$ systemctl --no-pager --failed
  UNIT                 LOAD   ACTIVE SUB    DESCRIPTION
● pvedaemon.service     loaded failed failed PVE API Daemon
● pvestatd.service      loaded failed failed PVE Status Daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

Meaning: Not just one daemon. That points to shared dependencies: pmxcfs, Perl libs, disk full, broken config, or failed upgrade.

Decision: Expand scope: check pve-cluster/pmxcfs, disk, and packages before chasing one service.

Task 4: Confirm the web UI daemon isn’t the one lying to you

cr0x@server:~$ systemctl status pveproxy --no-pager
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2025-12-22 08:51:10 UTC; 25min ago
   Main PID: 1422 (pveproxy)
      Tasks: 4 (limit: 154606)
     Memory: 78.3M
        CPU: 6.321s
     CGroup: /system.slice/pveproxy.service
             ├─1422 pveproxy
             └─1427 "pveproxy worker"

Meaning: UI/API proxy is fine. Users can log in. They’ll insist “Proxmox is up.” They’re technically correct, which is the worst kind of correct.

Decision: Focus on pvedaemon and the back-end chain (pmxcfs, storage, packages).

Task 5: Check pmxcfs and cluster filesystem health (/etc/pve)

cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2025-12-22 08:51:07 UTC; 26min ago
   Main PID: 1201 (pmxcfs)
      Tasks: 7 (limit: 154606)
     Memory: 26.9M
        CPU: 2.113s
     CGroup: /system.slice/pve-cluster.service
             └─1201 /usr/bin/pmxcfs -l
cr0x@server:~$ mount | grep /etc/pve
pmxcfs on /etc/pve type fuse.pmxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

Meaning: If pmxcfs isn’t mounted or pve-cluster is down, many configs “disappear” and daemons may crash reading them.

Decision: If pve-cluster is failed, fix cluster filesystem first; restarting pvedaemon will be pointless.

Task 6: Check quorum and corosync (clustered nodes)

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-pve
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Dec 22 09:17:12 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2
Quorate:          Yes

Meaning: If quorum is No, cluster FS may go read-only or stale and management operations can fail or behave strangely.

Decision: If non-quorate, fix networking/corosync first. Do not “force it” unless you accept split-brain risks and you know exactly what you’re doing.

Task 7: Check disk space and inode exhaustion (yes, really)

cr0x@server:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        94G   94G     0 100% /
cr0x@server:~$ df -ih /var/log
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda2        6.0M  6.0M     0  100% /

Meaning: Full disks or no inodes can prevent log writes, task log creation, temp file creation, and even socket operations. Daemons may crash or refuse to start.

Decision: Free space immediately (rotate logs, delete old backups on local storage, clear package cache). Do it before restarting anything.

Task 8: Validate dpkg/apt health (the “missing module” class of failure)

cr0x@server:~$ dpkg --audit
The following packages are only half configured, probably due to problems
configuring them the first time. The configuration should be retried using
dpkg --configure <package> or the configure menu option in dselect:
 pve-manager               Proxmox Virtual Environment management tools
cr0x@server:~$ apt-get -f install
Reading package lists... Done
Building dependency tree... Done
Correcting dependencies... Done
The following additional packages will be installed:
  pve-cluster pve-container pve-ha-manager
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/5,842 kB of archives.
After this operation, 24.8 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up pve-cluster (8.0.7) ...
Setting up pve-manager (8.2.2) ...

Meaning: Half-configured packages are a smoking gun. If Perl modules are missing, tasks will fail spectacularly and repeatedly.

Decision: Get the package state clean (dpkg --configure -a, apt-get -f install) before blaming systemd.

Task 9: Try starting pvedaemon manually and watch for immediate errors

cr0x@server:~$ systemctl restart pvedaemon
cr0x@server:~$ systemctl status pvedaemon --no-pager
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2025-12-22 09:20:41 UTC; 3s ago
   Main PID: 3122 (pvedaemon)
      Tasks: 4 (limit: 154606)
     Memory: 64.2M
        CPU: 424ms
     CGroup: /system.slice/pvedaemon.service
             ├─3122 pvedaemon
             └─3123 "pvedaemon worker"

Meaning: Service started and stayed up. That’s the easy case.

Decision: Immediately test a simple task (Task 12) and then hunt the original trigger (full disk, broken package, etc.) so it doesn’t recur.

Task 10: If it keeps crashing, capture the exact exit reason

cr0x@server:~$ systemctl show -p ExecMainStatus -p ExecMainCode -p Result pvedaemon
ExecMainCode=1
ExecMainStatus=255
Result=exit-code

Meaning: Exit code 255 is common for “exception at startup” in higher-level runtimes. It’s not telling you enough.

Decision: You need the first exception in journalctl (Task 2) or a package verification (Task 8).

Task 11: Check for runaway memory pressure / OOM kills

cr0x@server:~$ journalctl -k -b --no-pager | tail -n 30
Dec 22 09:12:44 server kernel: Out of memory: Killed process 2877 (pvedaemon) total-vm:512004kB, anon-rss:221344kB, file-rss:1232kB, shmem-rss:0kB, UID:0 pgtables:624kB oom_score_adj:0
Dec 22 09:12:44 server kernel: oom_reaper: reaped process 2877 (pvedaemon), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Meaning: The kernel killed it. Not systemd. Not Proxmox. The kernel. Usually due to memory overcommit, a ballooning guest gone wild, or host swap exhaustion.

Decision: Fix memory pressure: reduce host overcommit, add swap (carefully), stop the offending guests, and consider reserving memory for the host. Restarting pvedaemon without fixing memory is just a loop.

Task 12: Validate task execution end-to-end

cr0x@server:~$ pvesh get /nodes/$(hostname)/tasks --limit 3
[
  {
    "endtime": 1734868815,
    "id": "UPID:server:00001243:0002A3D1:6767F480:vzdump:101:root@pam:",
    "node": "server",
    "pid": 4675,
    "starttime": 1734868780,
    "status": "OK",
    "type": "vzdump",
    "upid": "UPID:server:00001243:0002A3D1:6767F480:vzdump:101:root@pam:",
    "user": "root@pam"
  }
]

Meaning: Tasks are being created, tracked, and completed. If your UI still misbehaves, you likely have a proxy/UI cache issue—not a worker problem.

Decision: If this works, move to storage-specific validation and whatever originally failed (backups, migrations, snapshots).

Task 13: Find a stuck task and identify which worker holds it

cr0x@server:~$ ls -1 /var/log/pve/tasks | tail -n 5
UPID:server:00000A1C:00006D32:6767F1C0:qmstart:103:root@pam:
UPID:server:00000A21:00006DA0:6767F1E2:vzsnapshot:101:root@pam:
UPID:server:00000B10:00008211:6767F23A:vzdump:101:root@pam:
UPID:server:00000B15:00008260:6767F248:qmstop:102:root@pam:
UPID:server:00000B19:00008288:6767F251:qmstart:104:root@pam:
cr0x@server:~$ tail -n 50 "/var/log/pve/tasks/UPID:server:00000B10:00008211:6767F23A:vzdump:101:root@pam:"
INFO: starting new backup job: vzdump 101 --storage backup-nfs --mode snapshot --compress zstd
ERROR: storage 'backup-nfs' is not online
INFO: backup job failed

Meaning: This is not “pvedaemon down.” This is “task ran and failed.” The log tells you what subsystem to fix.

Decision: Stop debugging daemons. Fix storage connectivity/state.

Task 14: Check for locks that stall tasks (common with storage hiccups)

cr0x@server:~$ pvesh get /cluster/locks
[
  {
    "lock": "backup",
    "node": "server",
    "type": "storage",
    "id": "backup-nfs",
    "time": 1734868001
  }
]

Meaning: A lock exists. Locks aren’t bad; they’re serialized safety. But stale locks can happen after crashes or timeouts.

Decision: If the lock is old and the task is dead, confirm no active process is running, then clear it safely (see Task 15). Don’t delete lock files blindly.

Task 15: Verify whether a “stuck” operation is still running before you clear anything

cr0x@server:~$ ps aux | egrep 'vzdump|qemu-img|pvesm|pvedaemon' | grep -v egrep
root      3122  0.0  0.2  52440 18944 ?        Ss   09:20   0:00 pvedaemon
root      4675  0.1  0.3 120332 26420 ?        Ss   09:24   0:02 vzdump 101 --storage backup-nfs --mode snapshot --compress zstd

Meaning: The job is still running. Clearing locks now would be a great way to create overlapping backups and inconsistent snapshots.

Decision: If it’s hung, debug why (usually storage I/O). Only kill the process if you understand consequences (snapshot cleanup, partial archives).

Task 16: Confirm storage status as Proxmox sees it (not as you wish it to be)

cr0x@server:~$ pvesm status
Name         Type     Status     Total     Used     Available  %
local        dir      active     93.00G   12.40G     75.85G   13%
backup-nfs   nfs      inactive       0B       0B         0B    0%

Meaning: Storage is inactive. If tasks depend on it, they’ll fail or hang.

Decision: Troubleshoot storage at OS level: mounts, network, credentials. Do not blame pvedaemon for refusing to use an offline storage.

Task 17: Verify NFS mounts and responsiveness (example)

cr0x@server:~$ grep backup-nfs /etc/pve/storage.cfg
nfs: backup-nfs
        server 10.20.30.40
        export /srv/nfs/pve-backup
        path /mnt/pve/backup-nfs
        content backup
        options vers=4.1,proto=tcp
cr0x@server:~$ mount | grep /mnt/pve/backup-nfs || echo "not mounted"
not mounted
cr0x@server:~$ timeout 5 bash -lc 'ls -la /mnt/pve/backup-nfs'
ls: cannot access '/mnt/pve/backup-nfs': No such file or directory

Meaning: Storage path doesn’t exist or isn’t mounted. Proxmox marks it offline, tasks fail. If the path exists but ls hangs, you’ve got a different problem: a stuck NFS mount.

Decision: Fix the mount definition/path, and if mounts hang, use umount -f or lazy unmount cautiously. Don’t reboot a node to fix a mount unless you enjoy downtime.

Task 18: Check TLS/hostname mismatch issues (a subtle task killer)

cr0x@server:~$ hostname -f
server.example.internal
cr0x@server:~$ grep -E 'server|127.0.1.1' /etc/hosts
127.0.0.1 localhost
127.0.1.1 server.example.internal server

Meaning: Name resolution is coherent. If it isn’t—if hostname -f returns something not matching certs or cluster config—you can see weird API errors and daemon confusion.

Decision: Fix hostname/DNS/hosts coherently across the cluster. Don’t “just add another hosts entry” until it works. That’s how you create haunted infrastructure.

Root causes that actually happen in production

1) Broken package state after upgrade (most common, least glamorous)

Proxmox upgrades are usually smooth—until they’re not. The failure mode isn’t always “upgrade failed.” Sometimes the upgrade “mostly worked” and left you with missing Perl modules, half-configured packages, or daemons running old code against new libraries.

The tell: Can't locate ... in @INC, BEGIN failed, or dpkg audit complaints. Fix it by completing package configuration and ensuring repositories are correct for your Proxmox/Debian version.

2) Disk full or inode exhaustion (quietly lethal)

Proxmox writes task logs, state, and temp data. If root is full, the symptom is often “tasks don’t start” or “service won’t stay up.”
People love to treat disk-full alerts as optional until they become mandatory.

Common triggers: local backups left on local storage, runaway journal logs, crash dumps, or ISO hoarding.

3) pmxcfs or cluster state problems

In a cluster, /etc/pve is the brain pantry. If it’s unavailable, daemons that depend on reading config can crash or misbehave.
If quorum is lost, you might see reads become stale or writes blocked. That’s not Proxmox being fragile. That’s Proxmox refusing to lie about distributed state.

4) Storage timeouts and hung I/O (pvedaemon “up” but useless)

You can have a running pvedaemon that is effectively dead because its workers are blocked on I/O. This is especially common with:

  • NFS mounts that hang rather than fail fast
  • iSCSI multipath misconfiguration
  • Ceph health issues (slow ops, blocked requests)
  • Fibre Channel path flaps that cause long SCSI timeouts

The trick is to distinguish: daemon crashed vs daemon blocked. Logs and process state tell you which.

5) Memory pressure and OOM kills

When the kernel starts killing processes, it doesn’t pick the one you like. If pvedaemon gets shot, tasks stop. But the true problem is host memory governance: ballooning, swap policy, and the fact that “it booted” isn’t the same as “it’s stable.”

6) Permissions and filesystem quirks (containers make this spicy)

Permission issues show up when task logs can’t be written, when a directory under /var/lib has wrong ownership, or when someone “hardened” the node and broke assumptions.
Proxmox is not allergic to hardening, but it expects the basics: correct owners, writable paths, and sane mount options.

Three corporate mini-stories (because this always happens to “someone else”)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a two-node Proxmox cluster for “high availability.” Not HA-manager, just a cluster. Their assumption: “Two nodes means redundancy.”
The networking team made a routine change on a core switch. A short broadcast storm later, corosync started dropping packets.

Node A still served the web UI. People logged in and tried to restart a few “stuck” VMs. Tasks didn’t run. Naturally, someone concluded: “pvedaemon is broken.”
They restarted daemons. They restarted the whole node. Nothing improved, because the node was non-quorate and pmxcfs refused to behave like a single-node filesystem.

The wrong assumption wasn’t about pvedaemon. It was about quorum. Two-node clusters are operational debt unless you deliberately design quorum behavior (witness, qdevice, or a third node).

The fix wasn’t heroic: restore corosync connectivity, re-establish quorum, confirm /etc/pve health, then restart services cleanly.
Afterward, they added a quorum device. Boring. Effective. The only kind of “redundancy” that matters at 3 a.m.

Mini-story 2: The optimization that backfired

Another shop wanted faster backups. They switched vzdump to use an NFS target with aggressive mount options and tuned timeouts, because “defaults are slow.”
It worked. For a while. Backups were faster when the NAS was healthy.

Then the NAS had a controller failover during the backup window. The NFS export didn’t go down cleanly; it went “half alive.” Reads sometimes worked. Metadata calls hung.
From Proxmox’s point of view, tasks started and then froze. Workers piled up. pvedaemon stayed running, but every meaningful task queue became a parking lot.

The “optimization” created a system that failed by hanging, not failing fast. That’s the worst failure mode for a task runner because it eats worker capacity while providing no useful error.

The long-term fix was to stop treating storage as a magical black box: they added monitoring for NFS responsiveness (not just ping), changed mount behavior to favor faster failure, and staggered backups to reduce blast radius.
Backups got slightly slower. Incidents got dramatically rarer. That’s a trade most adults accept.

Mini-story 3: The boring but correct practice that saved the day

A regulated environment ran Proxmox clusters with a strict maintenance routine: before upgrades, they checked free space on root, confirmed dpkg health, and took a configuration backup of /etc/pve (plus a note of current versions).
Nobody loved this checklist. It was paperwork with shell commands.

During one upgrade window, a mirror issue caused partial downloads and left pve-manager half configured. pvedaemon failed to start, and tasks went dark.
The on-call engineer didn’t improvise. They executed the boring plan: confirm package state, fix dependencies, re-run configuration, restart daemons in a clean order.

Because they had a known-good baseline (versions, space, and a copy of configs), they could make confident decisions fast. No guessing. No “maybe it’s DNS” spirals.
Systems recovered quickly, and the postmortem was short, which is a quiet sign of professional happiness.

Joke #2 (short and relevant): The only thing more persistent than a stale lock is an executive asking if you “tried restarting it.”

Common mistakes: symptom → root cause → fix

This section is blunt on purpose. These are the mistakes that turn a 10-minute repair into a half-day incident.

1) “Tasks don’t run, so I restart the whole node”

  • Symptom: UI works; tasks never start or immediately fail.
  • Root cause: pvedaemon crashed due to missing packages or full disk; reboot doesn’t fix either.
  • Fix: Check systemctl status pvedaemon, then journalctl -u pvedaemon, then fix package/disk issues before restart.

2) “I cleared locks because things looked stuck”

  • Symptom: Backups/migrations hang; locks show in cluster locks.
  • Root cause: Job is still running but blocked on storage; clearing lock causes overlapping operations.
  • Fix: Confirm processes with ps and storage responsiveness. Only clear locks when you’ve proven the worker is dead.

3) “Storage is fine because ping works”

  • Symptom: pvedaemon running but tasks hang, especially backups/snapshots.
  • Root cause: NFS/iSCSI/Ceph stalls at the I/O layer; ping is irrelevant.
  • Fix: Test actual operations: ls on mount, read/write small files, check pvesm status, examine kernel logs for I/O timeouts.

4) “I upgraded, it completed, so packages must be fine”

  • Symptom: pvedaemon exits with Perl module errors.
  • Root cause: Half-configured packages or repo mismatch.
  • Fix: dpkg --audit, dpkg --configure -a, apt-get -f install, then restart services.

5) “It’s a single node, cluster services don’t matter”

  • Symptom: Config reads fail, /etc/pve looks wrong, tasks error on config access.
  • Root cause: pmxcfs/pve-cluster down even on a “single node” install, because Proxmox still uses it for config management.
  • Fix: Restore pve-cluster health and confirm /etc/pve is mounted as fuse.pmxcfs.

6) “I fixed the daemon, but tasks still don’t run”

  • Symptom: pvedaemon active; UI actions still error or hang.
  • Root cause: Underlying storage offline, stale lock, or API routing/hostname mismatch.
  • Fix: Check task logs in /var/log/pve/tasks/, validate pvesm status, and confirm hostname/DNS consistency.

Checklists / step-by-step plan

Checklist A: Rapid restore of task execution (single node)

  1. Check disk space and inodes: df -h /, df -ih /. Free space if needed.
  2. Check service status: systemctl status pvedaemon.
  3. Check logs: journalctl -u pvedaemon -b -n 200. Find the first real error.
  4. If packaging: run dpkg --audit, then dpkg --configure -a, then apt-get -f install.
  5. Confirm pmxcfs: systemctl status pve-cluster and mount | grep /etc/pve.
  6. Restart in sane order:
    • systemctl restart pve-cluster (if needed)
    • systemctl restart pvedaemon
    • systemctl restart pveproxy (optional, if UI weirdness persists)
  7. Validate with pvesh get /nodes/$(hostname)/tasks --limit 3 and one real operation (start/stop a test VM).

Checklist B: Cluster-safe restore (avoid making it worse)

  1. Check quorum first: pvecm status.
  2. If not quorate: fix corosync network first. Avoid random restarts.
  3. Confirm /etc/pve is mounted and coherent on the node you’re using.
  4. Check pvedaemon logs for config read errors or permission problems.
  5. Check storage status from the node where tasks are failing: pvesm status.
  6. Only after the above: restart pvedaemon on the affected node and re-test tasks.

Checklist C: Prevent recurrence (the part most teams skip)

  1. Put root filesystem free-space alerts where humans see them, and act on them.
  2. After upgrades, verify daemon health: systemctl is-active pvedaemon and run one test task.
  3. Make storage “fail fast” where possible, and monitor storage responsiveness (not just reachability).
  4. Document how your cluster handles quorum (especially if it’s two nodes).
  5. Keep a known-good rollback plan for package issues: package caches, local mirrors, or at least a tested recovery procedure.

FAQ

1) What exactly breaks when pvedaemon is down?

Most node-local management tasks: VM/LXC lifecycle actions, backups, snapshots, many storage operations, and anything that needs a task worker and UPID tracking.

2) Why does the web UI still load if pvedaemon is failed?

Because pveproxy serves the UI and API front-end. It can respond even if the back-end worker is missing. The UI isn’t malicious; it’s just optimistic.

3) How do I tell the difference between “pvedaemon crashed” and “pvedaemon hung”?

If systemd shows failed or inactive, it crashed or didn’t start. If it’s active (running) but tasks hang, check storage I/O, locks, and worker processes.

4) I see “Can’t locate … in @INC” in the journal. What now?

Fix package state. Run dpkg --audit, then dpkg --configure -a and apt-get -f install. This is almost never solved by editing random Perl paths.

5) Can I just reinstall pve-manager?

Sometimes, yes, but treat it as a package-health problem, not a single-package problem. A forced reinstall without resolving repo mismatch or partial upgrades can make it worse.

6) Tasks are stuck and I see locks. Is it safe to remove them?

Only after you confirm the underlying process is not running and the storage subsystem is stable. Locks are there to prevent corruption; clearing them blindly is gambling with your data.

7) Could DNS/hostname issues really cause task failures?

Yes, especially in clusters where nodes reference each other by name and validate certs. Inconsistent hostname resolution can cause API calls, migrations, and storage actions to fail in ways that look unrelated.

8) Does restarting pveproxy help when tasks don’t run?

Rarely. If pvedaemon is dead, restarting the UI proxy is theater. Restart pvedaemon after fixing the cause, and only restart pveproxy if the UI is caching errors or acting strange.

9) Is this related to vzdump backups specifically?

Backups are a common trigger because they touch storage heavily and create locks. But pvedaemon failing is broader: any task runner failure will affect many operations.

10) What’s the single fastest “is it fixed?” test?

Start a harmless, quick task and confirm it completes. For example, query recent tasks with pvesh and run a stop/start on a non-critical VM.
If tasks generate UPIDs and complete, the worker path is back.

Conclusion: next steps that stick

When pvedaemon.service fails, Proxmox doesn’t so much “go down” as it stops doing work. That’s why it burns time: dashboards still render, users still click, and nothing actually happens.

Do this next, in this order:

  1. Confirm the failure mode: crashed vs hung (systemctl status, journalctl).
  2. Fix the usual suspects: disk/inodes, package state, pmxcfs/quorum, storage responsiveness.
  3. Restart deliberately: restart the right daemons in the right order, then run an end-to-end task test.
  4. Prevent recurrence: alert on root usage, validate after upgrades, and monitor storage like it’s part of the compute stack—because it is.
← Previous
DNS load spikes: rate-limit and survive attacks without downtime
Next →
ZFS Refreservation: Guaranteeing Space Without Breaking Apps

Leave a comment