Proxmox systemd “Dependency failed”: Find the Service That Breaks Boot

January 24, 2026 • February 3, 2026 • Read: 24 min • Views: 24

Was this helpful?

You reboot a Proxmox node and it comes back with that smug gray line:
Dependency failed for …. The web UI is dead, VMs are down, and you’re staring at a console that’s basically saying,
“Something else didn’t happen, so I didn’t happen either.”

The trap is that the unit shown on screen is often the victim, not the culprit. systemd reports what couldn’t start,
not what started the chain reaction. This guide is how you find the first domino—quickly, repeatably, and without guessing.

The mental model: what “Dependency failed” really means

systemd is a dependency engine with a logging habit. It starts units (services, mounts, devices, targets) based on declared relationships
(Requires=, Wants=, After=, Before=, plus implicit links from mount units, device units,
and generators).

A “Dependency failed” message for unit X means one of X’s required dependencies entered a failed state
(or never appeared within a timeout). That can be:

A service that failed (non-zero exit, timeout, watchdog, missing config).
A mount that didn’t mount (bad fs, missing device, wrong UUID, ZFS/LVM not imported).
A device unit that never showed up (disk missing, HBA firmware doing interpretive dance).
A target that can’t be reached because something required to reach it failed.

The line you see on console is almost always downstream. Your job is to walk upstream until you find the first unit that failed, and then
ask: is this a real failure (broken), or an ordering problem (wrong dependencies)?

One quote worth keeping in your head, attributed to John Gall (paraphrased idea): Complex systems that work usually evolved from simpler systems that worked.
If your boot is complex, debug it like an evolution: find the earliest failure, not the last symptom.

Fast diagnosis playbook (first/second/third checks)

If this is production and you’re bleeding minutes, do this in order. The goal is to identify the first failing unit and
whether it’s storage, network, cluster, or a plain old config error.

First: list failures and inspect the earliest one

cr0x@server:~$ systemctl --failed
  UNIT                         LOAD   ACTIVE SUB    DESCRIPTION
● pve-cluster.service          loaded failed failed The Proxmox VE cluster filesystem
● zfs-import-cache.service     loaded failed failed Import ZFS pools by cache file

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

Decision: if there are only one or two failed units, start with the one that looks most “foundational” (storage import, mount,
network-online). If there are many, the “foundational” one is almost always the first failure.

Second: read previous boot logs, not your current shell session

cr0x@server:~$ journalctl -b -0 -p err..alert --no-pager | head -n 50
Dec 26 09:14:03 server systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Dec 26 09:14:03 server systemd[1]: Failed to start Import ZFS pools by cache file.
Dec 26 09:14:03 server systemd[1]: Dependency failed for ZFS Mount.
Dec 26 09:14:03 server systemd[1]: Dependency failed for Proxmox VE firewall.

Decision: prioritize the first “Failed to start …” line. The later “Dependency failed for …” entries are downstream.

Third: draw the dependency chain for the victim unit

cr0x@server:~$ systemctl list-dependencies --reverse pve-firewall.service
pve-firewall.service
● pve-container.service
● pve-guests.service
● multi-user.target

Decision: if the victim is a “leaf” service (firewall, UI, guest management), stop blaming it. Walk up to the dependency that
actually failed (storage import, mount, network-online).

Time-saving heuristic: storage before cluster before network before “random service”

On Proxmox nodes, boot blockers cluster into four buckets:
storage imports (ZFS/LVM), mounts (including /var/lib/vz),
cluster filesystem (pve-cluster, corosync), and network readiness.
App-level services tend to fail because one of those didn’t.

Joke #1: systemd is like a very strict librarian—if you whisper “After=network-online.target” without arranging it properly, it will shush your boot.

Interesting facts and historical context (for better instincts)

systemd replaced SysV init on most Linux distros in the early 2010s, bringing parallel startup and explicit dependency graphs instead of shell script ordering.
“After=” is not a requirement. It only orders start time. People still confuse it with Requires=, and boot failures are born from that confusion.
Mounts are first-class units. systemd generates .mount units from /etc/fstab, and those become dependencies for services that need them.
Device units appear dynamically. A missing disk doesn’t “fail”; it just never shows up. Downstream units then time out, which looks like a service failure.
ZFS on Linux uses import services (zfs-import-cache / zfs-import-scan) to bring pools online during boot; missing cache files or renamed pools can stall everything depending on mounts.
Proxmox leans on /etc/pve, a cluster filesystem backed by pmxcfs. If pmxcfs doesn’t mount, half the management stack can’t read config.
Network “online” is contentious. Different distros ship different wait-online services. Enabling the wrong one can add boot delays or create deadlocks when bridges/VLANs are involved.
Initramfs changes are subtle. A module missing from initramfs can turn a perfectly fine root disk into “not found,” and the boot failure message will look unrelated.
systemd has been able to show critical path timing for years. It’s one of the fastest ways to spot a boot bottleneck that doesn’t strictly “fail.”

Practical tasks: commands, expected output, decisions

These are field-tested checks. Each includes a command, what the output means, and what you do next.
Don’t run everything blindly; pick the ones that match the failure domain you see.

Task 1: List failed units (the obvious starting point)

cr0x@server:~$ systemctl --failed --no-pager
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
● zfs-mount.service        loaded failed failed Mount ZFS filesystems
● pveproxy.service         loaded failed failed PVE API Proxy Server

2 loaded units listed.

Meaning: pveproxy is probably collateral damage. zfs-mount is foundational.
Decision: inspect zfs-mount first, not the proxy.

Task 2: Inspect the unit that failed (status includes last logs)

cr0x@server:~$ systemctl status zfs-mount.service --no-pager -l
× zfs-mount.service - Mount ZFS filesystems
     Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled)
     Active: failed (Result: exit-code) since Thu 2025-12-26 09:14:03 UTC; 1min 2s ago
    Process: 812 ExecStart=/sbin/zfs mount -a (code=exited, status=1/FAILURE)
   Main PID: 812 (code=exited, status=1/FAILURE)

Dec 26 09:14:03 server zfs[812]: cannot mount 'rpool/data': failed to create mountpoint: /rpool/data
Dec 26 09:14:03 server systemd[1]: zfs-mount.service: Failed with result 'exit-code'.

Meaning: the failing action is explicit: ZFS can’t create a mountpoint.
Decision: check whether the mountpoint is read-only, missing root mount, or blocked by a directory/file conflict.

Task 3: Identify the first failure in the boot timeline

cr0x@server:~$ journalctl -b -0 --no-pager | grep -E "Failed to start|Dependency failed" | head -n 30
Dec 26 09:14:02 server systemd[1]: Failed to start Import ZFS pools by cache file.
Dec 26 09:14:03 server systemd[1]: Dependency failed for ZFS Mount.
Dec 26 09:14:03 server systemd[1]: Failed to start Mount ZFS filesystems.
Dec 26 09:14:03 server systemd[1]: Dependency failed for Proxmox VE cluster filesystem.

Meaning: zfs-import-cache failed first. Everything else is cascading.
Decision: inspect zfs-import-cache.service logs next.

Task 4: Deep dive a unit’s logs with journalctl

cr0x@server:~$ journalctl -u zfs-import-cache.service -b -0 --no-pager -l
Dec 26 09:14:02 server systemd[1]: Starting Import ZFS pools by cache file...
Dec 26 09:14:02 server zpool[771]: cannot open '/etc/zfs/zpool.cache': No such file or directory
Dec 26 09:14:02 server systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Dec 26 09:14:02 server systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.

Meaning: no cache file. Not always fatal if zfs-import-scan exists, but your dependency graph may force cache import.
Decision: confirm whether scan import is enabled, or regenerate cache from a working import.

Task 5: Check whether ZFS pools are importable at all

cr0x@server:~$ zpool import
   pool: rpool
     id: 1234567890123456789
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        rpool        ONLINE
          mirror-0   ONLINE
            sda3     ONLINE
            sdb3     ONLINE

Meaning: the pool exists and is healthy, but not imported.
Decision: attempt a manual import (read-only first if you suspect damage), then fix boot-time import configuration.

Task 6: Manual import (careful, but effective for diagnosis)

cr0x@server:~$ sudo zpool import -N rpool
cr0x@server:~$ zpool status rpool
  pool: rpool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0

errors: No known data errors

Meaning: import works. The boot failure is likely configuration/ordering (cache file, service enablement), not dead disks.
Decision: rebuild /etc/zfs/zpool.cache and ensure the correct import unit is enabled.

Task 7: Verify wait-online/network blockers

cr0x@server:~$ systemd-analyze blame | head -n 15
        1min 32.104s systemd-networkd-wait-online.service
             8.771s pve-cluster.service
             4.210s zfs-mount.service
             2.018s networking.service

Meaning: boot isn’t “failing” so much as hanging on “network online.”
Decision: confirm if your node actually needs “online” (often it doesn’t). If Proxmox services incorrectly require it, fix dependencies or configuration.

Task 8: Show boot critical chain for a specific target

cr0x@server:~$ systemd-analyze critical-chain multi-user.target
multi-user.target @1min 42.909s
└─pve-guests.service @1min 42.870s +37ms
  └─pve-cluster.service @1min 33.982s +8.884s
    └─network-online.target @1min 32.111s
      └─systemd-networkd-wait-online.service @0min 0.008s +1min 32.104s

Meaning: the “critical chain” shows what actually delayed reaching the target.
Decision: tune the wait-online service or remove the requirement if it’s unnecessary for your boot success.

Task 9: Inspect dependencies of the unit shown as “Dependency failed”

cr0x@server:~$ systemctl show -p Requires -p Wants -p After -p Before pve-cluster.service
Requires=system.slice basic.target
Wants=corosync.service
After=network.target syslog.target corosync.service
Before=

Meaning: pve-cluster wants corosync and starts after it, but it does not explicitly require network-online here.
Your delay may be elsewhere, or another unit drags in network-online.
Decision: follow the actual chain with systemd-analyze critical-chain and check drop-ins.

Task 10: Find drop-in overrides (where “helpful” changes hide)

cr0x@server:~$ systemctl cat pve-cluster.service
# /lib/systemd/system/pve-cluster.service
[Unit]
Description=The Proxmox VE cluster filesystem
After=network.target corosync.service
Wants=corosync.service

[Service]
Type=forking
ExecStart=/usr/bin/pmxcfs -l
ExecStop=/usr/bin/killall -TERM pmxcfs
Restart=on-failure

# /etc/systemd/system/pve-cluster.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target

Meaning: someone added a drop-in making pve-cluster wait for “online.”
Decision: decide if that’s warranted. In most Proxmox setups it’s not; prefer network.target unless you truly need routable connectivity during boot.

Task 11: Confirm what “network-online” actually is on your box

cr0x@server:~$ systemctl status network-online.target --no-pager -l
● network-online.target - Network is Online
     Loaded: loaded (/lib/systemd/system/network-online.target; static)
     Active: active since Thu 2025-12-26 09:14:00 UTC; 4min ago
       Docs: man:systemd.special(7)

Dec 26 09:14:00 server systemd[1]: Reached target Network is Online.

Meaning: the target is satisfied when its “wait” service is satisfied. Which wait service depends on your network stack.
Decision: check whether you’re using systemd-networkd, NetworkManager, or ifupdown, and disable the wrong wait-online unit.

Task 12: Validate storage mounts tied to Proxmox (fstab and mount units)

cr0x@server:~$ findmnt -rno TARGET,SOURCE,FSTYPE,OPTIONS /var/lib/vz
/var/lib/vz rpool/data/subvol-100-disk-0 zfs rw,xattr,noacl

Meaning: Proxmox storage paths exist and are backed by the expected filesystem.
Decision: if this is missing or wrong, fix ZFS/LVM import/mount before touching Proxmox services.

Task 13: Check fstab for boot blockers (bad UUIDs, stale mounts)

cr0x@server:~$ grep -vE '^\s*#|^\s*$' /etc/fstab
UUID=2f1f9e2d-aaaa-bbbb-cccc-5f9b9d1d2e3f / ext4 defaults 0 1
UUID=deadbeef-1111-2222-3333-444444444444 /mnt/backup ext4 defaults 0 2

Meaning: a mount like /mnt/backup can block boot if the disk is gone.
Decision: if it’s non-critical, add nofail and sensible x-systemd.device-timeout=. Or remove it entirely.

Task 14: Confirm whether the missing device exists

cr0x@server:~$ lsblk -o NAME,SIZE,FSTYPE,UUID,MOUNTPOINT
NAME   SIZE FSTYPE UUID                                 MOUNTPOINT
sda   447.1G
├─sda1  512M vfat   8A1B-2C3D                            /boot/efi
├─sda2    1G ext4   0b7c9c2f-1234-5678-90ab-7e1f2c3d4e5f /boot
└─sda3 445.6G zfs_member
sdb   447.1G
└─sdb3 445.6G zfs_member

Meaning: if the fstab UUID doesn’t appear here, the mount will fail.
Decision: correct UUID, fix cabling/HBA, or mark the mount nofail if optional.

Task 15: Check pmxcfs and /etc/pve health (Proxmox-specific)

cr0x@server:~$ mount | grep -E "/etc/pve|pmxcfs"
pmxcfs on /etc/pve type fuse.pmxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

Meaning: if /etc/pve isn’t mounted, Proxmox config isn’t available; many services fail “mysteriously.”
Decision: fix pve-cluster.service before chasing pveproxy/pvedaemon errors.

Task 16: Confirm corosync status without guessing

cr0x@server:~$ systemctl status corosync.service --no-pager -l
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
     Active: active (running) since Thu 2025-12-26 09:14:01 UTC; 6min ago

Meaning: if corosync is down, pve-cluster may hang or fail depending on config.
Decision: inspect corosync logs and network (multicast/unicast, ring interfaces), but don’t “restart everything” in random order.

Task 17: Find which unit pulled in a target (who “wanted” it)

cr0x@server:~$ systemctl list-dependencies --reverse network-online.target
network-online.target
● pve-cluster.service
● pvescheduler.service
● zfs-import.target

Meaning: these units are forcing “online” semantics.
Decision: review whether each really needs it; remove drop-ins that add unnecessary waits.

Task 18: Verify generator output (fstab-to-unit translation)

cr0x@server:~$ systemctl status mnt-backup.mount --no-pager -l
× mnt-backup.mount - /mnt/backup
     Loaded: loaded (/run/systemd/generator/mnt-backup.mount; generated)
     Active: failed (Result: exit-code) since Thu 2025-12-26 09:14:05 UTC; 2min ago
      Where: /mnt/backup
       What: /dev/disk/by-uuid/deadbeef-1111-2222-3333-444444444444

Dec 26 09:14:05 server mount[901]: mount: /mnt/backup: special device /dev/disk/by-uuid/deadbeef-1111-2222-3333-444444444444 does not exist.

Meaning: the mount unit is generated from fstab and is failing because the device is missing.
Decision: either restore the device path/UUID, or make the mount optional (nofail plus timeout).

Task 19: Check initramfs and root device recognition (when boot is really broken)

cr0x@server:~$ lsinitramfs /boot/initrd.img-$(uname -r) | grep -E "zfs|dm_mod|nvme" | head
usr/lib/modules/6.8.12-4-pve/kernel/drivers/md/dm-mod.ko
usr/lib/modules/6.8.12-4-pve/kernel/drivers/nvme/host/nvme.ko
usr/lib/modules/6.8.12-4-pve/updates/dkms/zfs.ko

Meaning: critical modules appear in initramfs. If they don’t, the kernel can’t find disks early.
Decision: rebuild initramfs and verify DKMS modules, especially after kernel updates.

Task 20: Rebuild initramfs (only after you know what’s missing)

cr0x@server:~$ sudo update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.8.12-4-pve
W: Possible missing firmware /lib/firmware/i915/tgl_dmc_ver2_12.bin for module i915

Meaning: warnings about GPU firmware are usually irrelevant for servers; missing storage/HBA firmware is not.
Decision: if you see missing firmware for storage/network, fix that before the next reboot. Otherwise proceed.

Where Proxmox boot failures usually come from

When Proxmox “doesn’t boot,” the kernel probably booted fine. systemd is up. The failure is that the system didn’t reach the target that
includes your management services. Proxmox adds a few special pressure points:

1) Storage import and mounts

Proxmox nodes almost always depend on local storage being up early: ZFS pools, LVM volume groups, or both. If storage isn’t imported,
mount units fail, and then services that read config or store state fail. The UI going down is a symptom, not a cause.

Watch for:

ZFS import failures due to missing /etc/zfs/zpool.cache, renamed pools, or changed device paths.
LVM failures due to missing PVs or VG activation issues.
fstab entries for external backup disks that are “optional” until they aren’t.

2) /etc/pve and pmxcfs

/etc/pve is not a normal directory. It’s a fuse filesystem provided by pmxcfs, controlled by pve-cluster.service.
If pmxcfs isn’t mounted, lots of Proxmox tooling behaves like config vanished.

The console will happily tell you “Dependency failed for Proxmox VE API Proxy Server.”
That’s true, but unhelpful. The proxy depends on config and certs. No /etc/pve, no party.

3) Corosync assumptions

In a cluster, corosync is how nodes agree on membership. A node can run as a single node too, but the presence of cluster config changes boot behavior.
Misconfigured ring interfaces, a new NIC name after hardware change, or an overly strict “network-online” dependency can stall cluster-related units.

4) network-online target and “helpful” boot ordering

The network stack is layered: the link comes up, addresses apply, routes appear, DNS maybe works, and only then do you have “online.”
systemd’s network.target usually means “basic networking is configured,” not “the internet is reachable.”

If someone forces Proxmox services to require network-online.target, you can create a boot deadlock where the network wait service
is waiting for a bridge/VLAN that is created by a service that is waiting for network-online. Congratulations, you invented a circular dependency.

Joke #2: A dependency loop is the only time your server successfully models corporate org charts.

Hunting the root unit: chains, ordering, and why your eyes lie

The most common operator error in these incidents is treating the screen message as the root cause. systemd prints “Dependency failed for X”
because X was queued and then canceled. It’s like a project manager telling you “the demo is canceled” without mentioning the power outage.

Start from “failed units,” but don’t stop there

systemctl --failed is necessary but not sufficient. It shows what ended in a failed state. But the unit that matters might be:

A mount unit generated from fstab (something.mount).
A device unit that never appeared (no explicit “failed” state; instead something times out).
A one-shot unit that failed early and got buried in logs.

Prefer “critical-chain” to “blame” when you suspect ordering

systemd-analyze blame lists duration spent in units, which is useful for slow boots. But it can mislead you if a unit spent time waiting
on another unit. The critical chain shows you the real dependency path to a target.

If you’re stuck at boot and something is waiting forever, ask:

What target am I trying to reach? Usually multi-user.target on servers.
Which unit is the last one in the critical chain?
What is that unit waiting for? (status + journal)

Understand the three dependency axes: requirement, ordering, and relationship

systemd has relationships that sound similar but behave differently:

Requires=: if the required unit fails, this unit fails.
Wants=: weak dependency; pulls it in but doesn’t fail the dependent if it fails.
After=/Before=: ordering only. No requirement.

Many “dependency failed” incidents come from admins adding After= without Wants= or Requires=, expecting
systemd to start something automatically. It won’t.

When the real culprit is a mount or device

If a service has RequiresMountsFor=/var/lib/vz (or an implicit dependency via path use), and that mount is missing, systemd will
refuse to start it. The service isn’t “broken.” It’s being correctly cautious.

The best trick: inspect the .mount unit status and logs. It’s often more explicit than a service log.

When the real culprit is a timeout

Some “Dependency failed” messages appear because a dependency didn’t fail explicitly—it just never became active.
For example: a device unit never appears because a disk didn’t enumerate. Then a mount waits, times out, fails, and that fails a service.

In these cases, the important log lines are earlier: kernel messages, udev, storage driver errors, or firmware delays. Don’t ignore dmesg.

cr0x@server:~$ dmesg --level=err,warn | tail -n 30
[    7.112345] sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[    7.112400] blk_update_request: I/O error, dev sdc, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0

Decision: if you see transport errors, stop editing systemd units and start checking cables/HBA/drive health.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size company ran a two-node Proxmox cluster at remote sites. The nodes were stable, boring, and mostly untouched—until a network refresh.
The team replaced a switch and adjusted VLANs. The nodes came back up, but one stayed “half-booted.” Console showed “Dependency failed for Proxmox VE API Proxy Server.”
Naturally, the first response was to restart pveproxy. Then pvedaemon. Then corosync. Then the node got rebooted again, because rebooting is a lifestyle.

The wrong assumption: “pveproxy is the issue because it’s what users notice.” It wasn’t. The real culprit was a drop-in override added months earlier
to pve-cluster.service forcing After=network-online.target and Wants=network-online.target.
The switch change delayed DHCP on a management VLAN that should never have been DHCP in the first place.

systemd-networkd-wait-online waited for an address that never arrived. That prevented network-online.target,
which prevented pve-cluster, which prevented pmxcfs mounting /etc/pve, which prevented pveproxy reading configuration.
The message “Dependency failed” was accurate, but it accused the wrong bystander.

The fix was unglamorous: remove the override, revert the management interface to static addressing, and accept that “network.target” is good enough
for most local services. They also added a check during network changes: verify that “online” means what you think it means. It rarely does.

Mini-story 2: The optimization that backfired

A different org had a habit: make boot faster by “simplifying dependencies.” Someone noticed the node took 90 seconds to boot because of a missing
backup disk in fstab. So they added nofail. Good start. Then they went further and added aggressive timeouts everywhere,
including ZFS import, because “timeouts are bad.”

They also disabled zfs-import-cache and relied on scan import, reasoning it was “more flexible.” On paper, sure.
In practice, scan import caused occasional delays when multipath timing changed after firmware updates. Sometimes the pool imported a few seconds later
than before, which was fine—except they had shortened service timeouts for mounts and for the services that required them.

The result: intermittent boot failures. Not every time, which is the worst kind. Mount units timed out, ZFS eventually imported, but the system had already
failed critical units. A reboot sometimes fixed it, which encouraged the team to keep rebooting until it worked. That’s not reliability engineering; that’s wishcasting.

The recovery path was to undo the “optimization” and treat boot dependencies as a contract. If storage is required, let it wait long enough to be true.
If a disk is optional, mark it optional, but don’t randomly shorten timeouts for required resources. Boots should be deterministic, not a race.

Mini-story 3: The boring but correct practice that saved the day

One enterprise team ran Proxmox nodes with ZFS mirrors and a small local LVM volume group for scratch. They had a policy:
every node change that touched boot dependencies required a console validation, a snapshot of service overrides, and a recorded “golden boot” log sample.
People complained it was bureaucracy.

Then an on-call night happened. A kernel update pulled in a new initramfs, and a DKMS build for ZFS failed quietly due to a missing header package
after a repository change. The node rebooted and dropped into a state where ZFS modules weren’t available early enough. Storage import failed,
mount units failed, Proxmox services cascaded.

The boring practice paid off. They compared the “golden boot” log with the current boot via journalctl -b -1 and saw the DKMS failure line
in the previous run. They also had a saved list of enabled units and overrides, so they ruled out “someone changed systemd dependencies.”

Resolution was fast: install the missing headers, rebuild DKMS, regenerate initramfs, reboot. No drama, no random restarts of unrelated services.
The postmortem was equally boring: keep the policy. Boring is good when your job is to keep other people’s exciting ideas from breaking production.

Common mistakes: symptom → root cause → fix

Here’s the pattern library. If you recognize the symptom, skip the guesswork.

1) “Dependency failed for Proxmox VE API Proxy Server”

Symptom: pveproxy/pvedaemon won’t start; UI down.
Root cause: /etc/pve not mounted (pmxcfs down), or storage/mounts missing.
Fix: verify pve-cluster.service and pmxcfs mount; fix corosync/network ordering only if logs show it.

2) “Dependency failed for ZFS Mount” or “Failed to import ZFS pools”

Symptom: ZFS services fail; mounts missing; guests won’t start.
Root cause: missing cache file, renamed pool, device path changes, missing ZFS module in initramfs, or actual disk errors.
Fix: confirm pools via zpool import, import manually for diagnosis, rebuild cache, ensure import services and initramfs are correct.

3) Boot hangs for ~90 seconds on “A start job is running …”

Symptom: long boot delay; sometimes ends in dependency failures.
Root cause: wait-online service waiting for an interface, or fstab mount waiting for a missing disk.
Fix: identify the waiting unit with systemd-analyze critical-chain; disable or configure wait-online properly; mark optional mounts with nofail.

4) Cluster-related failures after NIC rename or hardware change

Symptom: corosync.service fails; pve-cluster fails; node can’t join cluster.
Root cause: corosync configured for an interface name that no longer exists.
Fix: correct corosync ring interface config and restart corosync/pve-cluster in the right order; avoid binding to unstable interface naming.

5) “Dependency failed for Local File Systems”

Symptom: system drops to emergency shell; multiple services canceled.
Root cause: a required fstab mount failed (wrong UUID, fs errors, missing device).
Fix: correct fstab, add nofail for non-essential mounts, run fsck where appropriate, or fix underlying disk issue.

6) After a kernel update, storage import fails intermittently

Symptom: ZFS/LVM services fail only after certain updates; reboot roulette.
Root cause: DKMS module build failure or initramfs missing module/firmware.
Fix: verify DKMS status, rebuild initramfs, ensure headers match kernel, confirm module presence via lsinitramfs.

Checklists / step-by-step plan (safe, boring, effective)

Checklist A: Find the real failing unit

Run systemctl --failed. Write down the failed units.
Run journalctl -b -0 -p err..alert. Find the earliest “Failed to start …” line.
Inspect that unit: systemctl status UNIT -l.
If the “failed unit” is a leaf service (UI, firewall), map dependencies back: systemctl list-dependencies --reverse.
Use systemd-analyze critical-chain multi-user.target to confirm what actually blocked reaching the target.

Checklist B: If it smells like storage

Check whether pools/VGs exist: zpool import, vgs, pvs (if applicable).
Look for mount unit failures: systemctl --failed | grep mount, then systemctl status *.mount as needed.
Validate the Proxmox-critical mounts: findmnt /var/lib/vz and whatever holds VM disks.
Scan kernel logs for transport errors: dmesg --level=err,warn.
Only then consider manual import/activation. If manual import works, fix boot ordering/config; don’t “repair” healthy pools out of boredom.

Checklist C: If it smells like network-online or cluster

Check boot delays: systemd-analyze blame and systemd-analyze critical-chain.
See who pulls in network-online: systemctl list-dependencies --reverse network-online.target.
Inspect unit overrides: systemctl cat pve-cluster.service (and any suspect units).
Confirm corosync status: systemctl status corosync and then logs if needed.
Remove unnecessary “online” dependencies. Prefer network.target

Step-by-step plan: Fixing without making it worse

When you’re done diagnosing, fix like an adult:

Make one change at a time. If you change three things and it boots, you learned nothing.
Prefer drop-ins over editing vendor unit files, but don’t keep ancient drop-ins that nobody remembers.
After changes, reload systemd and restart only the relevant units.

cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart zfs-import-cache.service zfs-mount.service
cr0x@server:~$ sudo systemctl restart pve-cluster.service pveproxy.service

What output means: if restart now works, your issue was boot ordering/config, not fundamental breakage.
If restart still fails, read the unit logs again—your fix didn’t address the actual failure condition.

FAQ

1) Why does systemd show “Dependency failed” for the wrong thing?

Because it prints what it canceled. The root cause is usually the first unit that failed earlier in the chain. Find that unit in the journal.

2) What’s the fastest way to find the first failing unit?

journalctl -b -0 -p err..alert and look for the earliest “Failed to start …” entry. Then open that unit’s logs with journalctl -u.

3) Should I just restart all Proxmox services?

No. Restarting the stack can mask ordering problems and make the next reboot fail again. Fix the dependency (storage/mount/network) first.

4) Is `After=network-online.target` a good idea for Proxmox?

Usually not. It adds failure modes and delays. Use it only if the service truly requires routable networking during start, and you’ve configured wait-online correctly.

5) How do I tell if it’s storage vs. systemd ordering?

If manual import/activation works (ZFS pool imports, VG activates) and there are no kernel I/O errors, it’s likely ordering/config. If devices are missing or dmesg shows errors, it’s storage/hardware.

6) What if a missing backup disk in fstab breaks boot?

Make it optional: add nofail and a short x-systemd.device-timeout=. Or remove the mount and mount it on demand. Don’t let optional storage become a boot gatekeeper.

7) Why does `systemd-analyze blame` sometimes point at the wrong bottleneck?

Because it measures time spent in units, including waiting time. Use systemd-analyze critical-chain to see the true dependency path.

8) If `/etc/pve` isn’t mounted, can I still recover the node?

Yes, but you need pve-cluster/pmxcfs healthy. Start by fixing corosync (if clustered) and ensuring required storage/mounts are present. Then restart pve-cluster.

9) Can “Dependency failed” be caused by a circular dependency?

Yes. Especially with custom overrides that force network-online or mounts in the wrong direction. Look for ordering loops in logs and use systemctl cat to find the override that created it.

10) What’s the safest way to change systemd dependencies?

Use a drop-in under /etc/systemd/system/UNIT.d/override.conf, document why, and test a reboot. If you can’t justify it, don’t ship it.

Conclusion: next steps you can do today

The cure for “Dependency failed” is not superstition. It’s graph tracing.
Find the first failing unit in the journal, confirm the dependency chain, and fix the upstream blocker—usually storage import/mounts, pmxcfs, or network-online.

Practical next steps:

On a healthy node, capture a baseline: systemctl --failed (should be empty), systemd-analyze critical-chain, and enabled overrides (systemctl cat for Proxmox-critical units).
Audit /etc/fstab for non-essential mounts and mark them optional or on-demand.
Hunt and delete “mystery” drop-ins that force network-online.target without a strong reason.
After kernel updates, verify storage modules are present in initramfs before scheduling reboots.