“vzdump backup failed” is one of those messages that lands at 02:13, right after you finally trusted the backups enough to stop worrying. It’s vague on purpose: vzdump is the messenger, and it’s reporting that something somewhere in storage, snapshots, networking, permissions, or time itself went sideways.
If you’re running Proxmox in production, you don’t need vibes. You need a reproducible checklist: commands, expected output, and a decision tree that finds the bottleneck quickly. That’s what this is.
Fast diagnosis playbook (check 1/2/3)
When a backup fails, there are two goals: (1) recover a successful backup today, and (2) prevent recurrence without turning your backup window into a weekend-long art project. The fastest path is usually: logs → storage health/capacity → snapshot capability → transport (NFS/CIFS/PBS) → permissions/locks.
Check 1: find the exact failure point in the log
Don’t start “fixing storage” blindly. Get the first hard error message and the context around it.
cr0x@server:~$ ls -1t /var/log/vzdump/*.log | head
/var/log/vzdump/qemu-101.log
/var/log/vzdump/lxc-203.log
cr0x@server:~$ tail -n 80 /var/log/vzdump/qemu-101.log
INFO: starting new backup job: vzdump 101 --storage backup --mode snapshot --compress zstd
INFO: filesystem type on dumpdir is 'nfs'
INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-101-2025_12_26-02_00_01.vma.zst'
ERROR: vzdump archive creation failed: write error: No space left on device
INFO: Backup job finished with errors
TASK ERROR: job errors
Decision: If you see a concrete error like No space left on device, Permission denied, Input/output error, stale file handle, snapshot failed, stop and follow that branch. If you only see ERROR: Backup of VM 101 failed - vma: ... with no cause, jump to the “Tasks” section below and run the deeper checks.
Check 2: confirm the target storage is writable, mounted, and not full
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
local dir active 102400000 12233456 90016644 11.94%
backup nfs active 2048000000 2039000000 9000000 99.56%
local-lvm lvmthin active 500000000 310000000 190000000 62.00%
Decision: If backup is at 99–100% or shows weirdly small available space, treat it as full. Clear space or fix quotas/reservations before you do anything else.
Check 3: is the failure snapshot-related or transport-related?
Snapshot-mode failures are loud and immediate. Transport failures (NFS, PBS, slow disks) tend to look like timeouts, “broken pipe”, or “stalled” jobs.
cr0x@server:~$ grep -E "snapshot|freeze|thaw|qmp|lvm|zfs|ceph|rbd|stale|timeout|broken pipe|i/o error" -i /var/log/vzdump/qemu-101.log | tail -n 30
ERROR: vzdump archive creation failed: write error: No space left on device
Decision: Snapshot keywords push you toward local storage driver checks (LVM-thin/ZFS/Ceph) and guest agent. Transport keywords push you toward NFS/PBS networking and server-side capacity/health.
Interesting facts and context (why this fails the way it does)
- vzdump predates modern Proxmox “PBS-first” setups. It started life as a simple dump tool and grew into a workflow engine that coordinates snapshots, freezing, and archiving.
- Snapshot backups are a storage feature first, a Proxmox feature second. If the backing store can’t snapshot reliably (or runs out of metadata space), vzdump can’t “make it work.”
- QEMU guest agent integration is optional, but the consequences aren’t. Without it, you still get a crash-consistent backup, but applications (databases) may be unhappy.
- VMA is Proxmox’s VM archive format. It’s simple and fast, but it’s not magic: it depends on stable reads from VM volumes and stable writes to target storage.
- NFS is popular for backup storage because it’s “easy,” which is also why it’s a frequent failure point: stale handles, soft mounts, retransmits, and “it worked yesterday” syndrome.
- ZFS snapshots are cheap; ZFS sends are not free. Snapshot explosion and dataset fragmentation can quietly turn a nightly backup into a performance incident.
- LVM-thin’s metadata pool is a separate thing you can fill. When thin metadata hits 100%, you don’t “degrade gracefully.” You stop.
- Backup compression changes I/O shape. zstd is usually a win, but if your node is CPU-bound or your target storage is slow, compression can make timeouts worse.
- Proxmox uses lock files and task logs to keep you from double-backup chaos. When a node reboots mid-task, those locks can outlive the task and create a second-order failure.
10 real causes of “vzdump backup failed” (in the order I check)
1) Target storage is full (or effectively full due to quotas/reservations)
This is the boring top offender. It’s not always “100% full” either. NFS exports can have per-directory quotas. ZFS datasets can have reservations. PBS datastores can hit chunk GC limits. The symptom from vzdump is usually “write error” or “No space left on device.”
Check: use pvesm status first, then verify on the mounted path with df -h. If it’s NFS, also check the server side. If it’s PBS, check datastore usage and prune settings.
2) Backup target not mounted or mounted wrong (NFS/CIFS path changed, automount race)
If /mnt/pve/backup isn’t actually your NFS share at the moment the job runs, vzdump happily writes to the local directory behind it (or fails if it’s missing). That’s how you “mysteriously” fill up local storage while backups “fail.”
Check: mount, findmnt, and confirm the mount options are sane (hard mount for backups, not soft).
3) Snapshot creation fails (wrong mode for the storage type, or storage can’t snapshot right now)
Snapshot backups require cooperation: LVM-thin snapshots, ZFS snapshots, or Ceph RBD snapshots. If you’re on plain directory storage without snapshot support, snapshot mode won’t do what you think. If you’re on LVM-thin but thin metadata is near full, snapshots fail when you need them most.
Check: log for “snapshot” errors, verify storage type with pvesm status/pvesm config, and inspect LVM-thin/ZFS/Ceph health.
4) Stuck lock or leftover task state blocks the job
Power loss, node reboot, storage hiccup—anything that interrupts a backup can leave behind locks. Then the next run fails immediately with “locked” errors. Proxmox tries to protect your VM disks from concurrent backups; it’s doing you a favor, just noisily.
Check: task list and lock files; also verify the VM isn’t actually running a backup already.
5) NFS/CIFS transport flakiness (stale file handles, permission mapping, timeouts)
NFS failures are rarely polite. You get “stale file handle”, “permission denied”, “input/output error”, or “server not responding”. CIFS/SMB gives you its own flavor of misery: credential rotation, dialect quirks, and file locking weirdness.
Check: kernel logs (dmesg), mount options, retransmits, and server-side logs. If your network drops packets, your backup system becomes a distributed denial-of-service against your patience.
6) PBS datastore issues (permissions, datastore full, prune/GC not keeping up, verification failures)
Backing up to Proxmox Backup Server is generally more robust than raw file shares, but failures still happen: datastore full, namespace permission mismatch, token expired, or verification/GC contention during backup windows.
Check: Proxmox node logs will show authentication/HTTP errors; PBS will show datastore and chunk store errors. Don’t guess—confirm on the PBS side.
7) QEMU Guest Agent / freeze-thaw problems cause snapshot hang or timeout
When you ask for a consistent snapshot, Proxmox uses QEMU guest agent to freeze filesystems. If the agent is absent, old, or wedged, you may see freeze timeouts. If you force it, you might get backups that “work” and restores that hurt.
Check: QMP/agent status, look for “fsfreeze” messages, and confirm the agent service inside the guest.
8) Storage read errors on the VM disks (ZFS checksum errors, Ceph degraded, underlying SSD dying)
A backup reads every allocated block. That’s why backups often detect bad storage earlier than your users do. If reads fail, vzdump aborts. The real culprit is below Proxmox: ZFS errors, mdadm issues, SMART pending sectors, or Ceph PG problems.
Check: ZFS zpool status, SMART data, Ceph health, and kernel I/O errors.
9) Performance bottlenecks (CPU, IO wait, tiny backup window) cause timeouts or job overlap
Backups failing due to “timeout” is usually a scheduling problem disguised as a technical problem. Your backup throughput is lower than your change rate, and you’re trying to squeeze it into a window that doesn’t fit.
Check: I/O latency, node load, backup duration trend, and whether jobs overlap. Then fix the schedule, concurrency, or storage—not the log message.
10) VMA/compression pipeline failures (broken pipe, tmp space, zstd thread pressure)
vzdump pipes data through tools (vma, compression, possibly encryption depending on target). If the pipeline breaks—target write stalls, process killed by OOM, tmp fills—you’ll see “broken pipe”, “killed”, or generic exit code errors.
Check: systemd journal for OOM kills, free RAM, swap activity, and the backup log around compression start.
Joke #1: Backups are like parachutes—if you need one and it doesn’t open, you’ll have a very short postmortem.
Hands-on tasks: commands, output meaning, and decisions
These are deliberately practical. Run them in order until you find the “first real error.” The output snippets are representative; your exact strings will vary. The point is what you decide next.
Task 1: Identify the failing job and VM/CT quickly
cr0x@server:~$ grep -R "TASK ERROR" -n /var/log/vzdump/ | tail -n 5
/var/log/vzdump/qemu-101.log:68:TASK ERROR: job errors
/var/log/vzdump/lxc-203.log:55:TASK ERROR: job errors
What it means: You have per-guest logs. Don’t stare at the GUI summary; open the specific log for the guest that failed.
Decision: Pick the newest failing log and work that one to ground truth before chasing secondary failures.
Task 2: Pull the first ERROR line and the 30 lines before it
cr0x@server:~$ awk 'BEGIN{e=0} /ERROR:/{e=NR} {a[NR]=$0} END{for(i=e-30;i<=e+5;i++) if(i>0) print a[i]}' /var/log/vzdump/qemu-101.log
INFO: starting new backup job: vzdump 101 --storage backup --mode snapshot --compress zstd
INFO: filesystem type on dumpdir is 'nfs'
INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-101-2025_12_26-02_00_01.vma.zst'
ERROR: vzdump archive creation failed: write error: No space left on device
INFO: Backup job finished with errors
What it means: The first meaningful error is often the only one that matters.
Decision: If it’s target-write related, stop and fix storage capacity/mount/permissions before touching snapshot settings.
Task 3: Verify Proxmox thinks the backup storage is active
cr0x@server:~$ pvesm status --storage backup
Name Type Status Total Used Available %
backup nfs active 2048000000 2039000000 9000000 99.56%
What it means: “active” only means “mounted” and “responding now.” It doesn’t mean “has space” or “is fast enough.”
Decision: If available space is tiny, prune old backups or expand. If totals look wrong (like 0), suspect mount trouble.
Task 4: Confirm the mount is real (and not a local directory)
cr0x@server:~$ findmnt -T /mnt/pve/backup
TARGET SOURCE FSTYPE OPTIONS
/mnt/pve/backup 10.10.0.20:/exports/pve-backup nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2
What it means: If SOURCE is blank or looks like a local device, you’re not mounted where you think you are.
Decision: If it’s not mounted, don’t run more backups. Fix mounting and clean up any “accidental local backups” that filled your node.
Task 5: Measure real free space where vzdump writes
cr0x@server:~$ df -h /mnt/pve/backup
Filesystem Size Used Avail Use% Mounted on
10.10.0.20:/exports/pve-backup 2.0T 2.0T 9.0G 100% /mnt/pve/backup
What it means: You’re done diagnosing. It’s full.
Decision: Prune backups, move some off, or expand. Then rerun one backup manually to prove it.
Task 6: Check for quota/reservation surprises (common on ZFS-backed NFS servers)
cr0x@server:~$ zfs get -o name,property,value,source quota,reservation,refquota,refreservation tank/pve-backup
NAME PROPERTY VALUE SOURCE
tank/pve-backup quota 2T local
tank/pve-backup reservation none default
tank/pve-backup refquota none default
tank/pve-backup refreservation none default
What it means: A dataset quota can make a filesystem “full” while the pool has plenty of space.
Decision: If the dataset quota is tight, raise it or implement pruning that respects retention.
Task 7: Detect NFS transport errors in kernel logs
cr0x@server:~$ dmesg -T | egrep -i "nfs|stale|server not responding|timed out|rpc" | tail -n 20
[Thu Dec 26 02:05:11 2025] NFS: server 10.10.0.20 not responding, still trying
[Thu Dec 26 02:05:44 2025] NFS: server 10.10.0.20 OK
What it means: Your “storage problem” may actually be packet loss, congestion, or a busy NAS.
Decision: If you see these during backup windows, treat network and NFS server performance as first-class backup dependencies.
Task 8: Spot stuck or overlapping vzdump jobs
cr0x@server:~$ pgrep -a vzdump
21984 /usr/bin/perl /usr/bin/vzdump 101 --storage backup --mode snapshot --compress zstd
cr0x@server:~$ pvesh get /nodes/$(hostname)/tasks --limit 5
┌──────────────┬───────────────────────────────┬───────────┬──────────┬─────────┬──────────────┐
│ upid │ starttime │ type │ status │ user │ id │
╞══════════════╪═══════════════════════════════╪═══════════╪══════════╪═════════╪══════════════╡
│ UPID:node... │ 2025-12-26T02:00:01Z │ vzdump │ running │ root@pam│ 101 │
└──────────────┴───────────────────────────────┴───────────┴──────────┴─────────┴──────────────┘
What it means: The job is still running; a “failed” email might have been from another guest or earlier attempt.
Decision: If it’s truly stuck (no progress, storage frozen), you may need to stop the task and clean locks carefully.
Task 9: Check for backup locks on the guest
cr0x@server:~$ qm config 101 | egrep -i "lock|backup"
lock: backup
What it means: Proxmox believes a backup is in progress (or was interrupted).
Decision: Confirm no vzdump process is actually working. If not, remove the lock.
cr0x@server:~$ qm unlock 101
OK
Task 10: Verify snapshot capability and storage type for the VM disks
cr0x@server:~$ qm config 101 | egrep -i "scsi|virtio|ide|sata"
scsi0: local-lvm:vm-101-disk-0,size=80G
cr0x@server:~$ pvesm status --storage local-lvm
Name Type Status Total Used Available %
local-lvm lvmthin active 500000000 310000000 190000000 62.00%
What it means: LVM-thin supports snapshots, but only while thin metadata is healthy.
Decision: If you’re on dir storage (not LVM-thin/ZFS/Ceph) and you demand snapshot mode, expect disappointment.
Task 11: Check LVM-thin metadata usage (the silent snapshot killer)
cr0x@server:~$ lvs -a -o+seg_monitor,metadata_percent,lv_size,data_percent vg0
LV VG Attr LSize Pool Data% Meta% Monitor
data vg0 twi-aotz-- 465.76g 68.12 98.77 monitored
What it means: Meta% at ~99% is a red alert. Snapshots and allocations can fail abruptly.
Decision: Extend the thin metadata LV (if designed for it) or reduce churn. Don’t “wait for it to clear”; it won’t.
Task 12: Check ZFS pool health if VM disks live on ZFS
cr0x@server:~$ zpool status -x
all pools are healthy
cr0x@server:~$ zpool status
pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
scan: scrub repaired 0B in 00:12:44 with 2 errors on Thu Dec 26 01:10:02 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 2
What it means: Checksums errors during scrub are not “just a warning.” Backups that read everything are likely to trip over the same corruption.
Decision: Replace/fix the failing device, restore affected data from known-good copies, and only then trust backups again.
Task 13: Check Ceph health if you use RBD
cr0x@server:~$ ceph -s
cluster:
id: 2c1b2d24-aaaa-bbbb-cccc-6f4f3b2d1b2a
health: HEALTH_WARN
Degraded data redundancy: 12/3456 objects degraded
services:
mon: 3 daemons, quorum mon1,mon2,mon3
osd: 6 osds: 6 up, 6 in
data:
pools: 3 pools, 256 pgs
objects: 1.15M objects, 4.2 TiB
usage: 12 TiB used, 18 TiB / 30 TiB avail
pgs: 10 degraded, 5 undersized
What it means: Degraded/undersized PGs can make snapshots slow or fail, and reads can stall.
Decision: Fix Ceph health first. “Back up harder” is not a storage repair strategy.
Task 14: Confirm guest agent status (for freeze/thaw and better consistency)
cr0x@server:~$ qm agent 101 ping
QEMU guest agent is running
cr0x@server:~$ qm agent 101 fsfreeze-status
thawed
What it means: The agent is reachable and can report filesystem freeze status.
Decision: If agent commands fail, either fix the agent inside the VM or adjust expectations (crash-consistent backup). Don’t pretend you’re getting application-consistent snapshots.
Task 15: Detect OOM kills or killed compression processes
cr0x@server:~$ journalctl -k --since "2025-12-26 01:50" | egrep -i "oom|killed process" | tail -n 20
Dec 26 02:03:22 node kernel: Out of memory: Killed process 22311 (zstd) total-vm:8123456kB, anon-rss:2147488kB, file-rss:0kB, shmem-rss:0kB
What it means: The backup pipeline died because the kernel needed memory more than your backup did.
Decision: Reduce compression level/threads, stagger jobs, add RAM, or move compression off-node (PBS can help). Otherwise you’ll keep “successfully failing” every night.
Task 16: Test write throughput to target storage (don’t benchmark forever)
cr0x@server:~$ dd if=/dev/zero of=/mnt/pve/backup/.pve-write-test bs=16M count=128 oflag=direct status=progress
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 18 s, 119 MB/s
128+0 records in
128+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 18.0832 s, 119 MB/s
cr0x@server:~$ rm -f /mnt/pve/backup/.pve-write-test
What it means: If you’re seeing single-digit MB/s on a system that used to do 200 MB/s, your “backup failed” is a performance regression.
Decision: Fix the network, the NAS, or contention. Or adjust backup concurrency so your window is realistic.
Task 17: Run a manual backup of one guest with maximum verbosity and minimal concurrency
cr0x@server:~$ vzdump 101 --storage backup --mode snapshot --compress zstd --notes-template '{{guestname}} {{vmid}}' --stdout 0
INFO: starting new backup job: vzdump 101 --storage backup --mode snapshot --compress zstd --notes-template {{guestname}} {{vmid}} --stdout 0
INFO: issuing guest-agent 'fsfreeze-freeze'
INFO: creating snapshot 'vzdump'
INFO: starting kvm to execute backup task
INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-101-2025_12_26-03_10_01.vma.zst'
INFO: issuing guest-agent 'fsfreeze-thaw'
INFO: archive file size: 14.32GB
INFO: Finished Backup of VM 101 (00:05:54)
INFO: Backup job finished successfully
What it means: A controlled one-off run tells you if the issue is systemic or due to concurrency/schedule.
Decision: If manual works but the nightly job fails, fix scheduling, storage contention, or job parameters—not the VM.
Joke #2: The only thing more reliable than a failing backup job is the email subject line that says it failed “successfully.”
Three corporate-world mini-stories (anonymized, plausible, and painful)
Mini-story 1: The wrong assumption (NFS is “just a disk”)
They had a neat little Proxmox cluster supporting internal apps: CI runners, a few databases, and a bunch of “temporary” VMs that had been temporary for three years. Backups went to an NFS share on a NAS appliance. Nothing exotic. It worked fine until it didn’t.
The assumption was simple: “If the storage shows active, it’s mounted, so the backups are landing on the NAS.” That’s almost true. The miss was that the NFS share was occasionally failing to mount at boot due to a dependency ordering issue, leaving /mnt/pve/backup as an empty local directory. The backup job ran, wrote locally, then failed when local filled up. Some nights it “worked” (wrote locally and didn’t fill). Some nights it failed. The NAS looked innocent the entire time.
The on-call engineer chased the NAS for a week: firmware, drives, load, network. Meanwhile, local storage on one node was quietly filling with partial .vma.zst files. The real hint was in the logs: “filesystem type on dumpdir is ‘ext4’” on the bad nights and “nfs” on the good nights. It was saying the truth; nobody was listening.
The fix was not clever: ensure mount ordering (systemd dependencies), add monitoring that alerts if backup storage is not the expected filesystem type, and fail the backup job early when the mount isn’t present. They also cleaned up local “phantom backups” and added retention that kept the NAS from hitting 99% ever again.
The lesson: treat network storage as a network dependency, not a directory that happens to work today.
Mini-story 2: The optimization that backfired (compression turned into a denial-of-service)
A different org wanted to shrink backup storage costs. Someone toggled zstd compression with a higher level and enabled multiple concurrent vzdump jobs. It looked great in a test: a single VM backup got smaller and didn’t take much longer. Everyone went home pleased with themselves.
In production, the node had a mix of workloads. During the backup window, CPU load spiked, IO wait spiked, and latency-sensitive services started timing out. Backups began failing with “broken pipe” and occasional “killed process” messages. On the worst nights, the kernel OOM killer took out the compression process. On the second worst nights, it took out something more exciting.
The optimization was based on a wrong mental model: “compression saves I/O, so it’s always easier on the system.” Compression saves bytes on the wire and disk, but it costs CPU and memory bandwidth. If your backup target is slow, you might actually be okay. If your target is fast and your node is already busy, compression can just add heat.
The fix was to reduce concurrency, dial compression back to a sane level, and schedule heavy guests separately. They also moved more backups to PBS where chunking and dedup shifted the economics, and the node stopped doing so much expensive work during peak periods.
The lesson: the “optimal” backup setting is the one that finishes reliably without harming production. Smaller archives are cute; stable restores are employment.
Mini-story 3: The boring practice that saved the day (scrubs, SMART checks, and test restores)
This team ran Proxmox with ZFS. Nothing flashy: mirrored SSDs for the OS, a bigger pool for VM storage. They were boring in the best way: weekly scrubs, SMART health checks, and a monthly restore drill to a fenced-off network. The restore drill was disliked by everyone. It took time. It caused tickets. It required attention. It was also the reason they didn’t have an incident report with the word “regret” in it.
One Friday, backups started failing for a specific VM with a generic read error. The app was still running. Users were happy. The on-call could have shrugged and silenced alerts until Monday. Instead, they checked zpool status and saw checksum errors. A scrub confirmed a couple of bad blocks on one SSD. No red lights, no dramatic failure—just early signs.
They replaced the device during a planned window. Then they ran the restore drill earlier than scheduled: restore last night’s backup to the sandbox, boot it, and validate basic application health. It worked. That confirmed the backup chain and the recovery procedure, not just the existence of backup files.
The boring practice paid off twice: first by catching degradation before it became data loss, and second by proving that backups were actually restorable. Not theoretically restorable. Not “it should be fine.” Restorable.
The lesson: reliability is mostly repetitive evidence-gathering. Drama is what happens when you skip that.
One quote (paraphrased idea): Werner Vogels (Amazon CTO) has emphasized that you should “design for failure” rather than assume components won’t break. (paraphrased idea)
Common mistakes: symptoms → root cause → fix
1) “Backup storage active” but backups fill local disk
Symptoms: local disk usage climbs; backup storage appears fine; vzdump errors mention ext4 instead of nfs; archives appear under /mnt/pve/backup but that path is local.
Root cause: backup target was not mounted, or mount failed transiently; the directory existed locally.
Fix: enforce mount dependencies, alert on filesystem type mismatch using findmnt, and make backups fail early if mount absent. Clean up local accidental files.
2) Snapshot mode fails only for certain guests
Symptoms: “snapshot failed” or “unable to create snapshot” for one VM; others succeed.
Root cause: that guest has a disk on a storage that doesn’t snapshot (e.g., directory), or has an unsupported config (e.g., raw device mapping quirks), or thin metadata is exhausted at the moment of snapshot creation.
Fix: move disks to snapshot-capable storage; fix LVM-thin metadata pressure; verify config via qm config and storage type.
3) “stale file handle” on NFS during backup
Symptoms: vzdump fails mid-write; kernel logs show NFS stale handle; sometimes clears on remount.
Root cause: export changed, server rebooted, underlying filesystem re-exported, or aggressive NAS-side maintenance. Sometimes triggered by snapshotting the export filesystem on the NAS.
Fix: stabilize exports; avoid changing export roots; use hard mounts; coordinate NAS snapshot/maintenance windows outside backup window; remount and retry.
4) “Permission denied” even though the share is writable from shell
Symptoms: you can touch a file manually, but vzdump fails creating its archive.
Root cause: root-squash on NFS, wrong ownership under the target directory, or backup job writing into a subdir with different permissions.
Fix: verify server-side export options; set correct ownership/permissions on the dump directory; use consistent UID/GID mapping.
5) Exit code 1 with “broken pipe”
Symptoms: the log mentions “broken pipe” during compression or archive creation.
Root cause: downstream writer died: target storage stalled/disconnected, out of space, or process killed (OOM).
Fix: check for OOM kills; check target storage logs and free space; reduce compression or concurrency; fix flaky transport.
6) Backups succeed but restores are slow or fail verification
Symptoms: backups complete, but later restores take forever or fail checks/verification (common with PBS if datastore is unhealthy).
Root cause: underlying storage degradation, chunk store issues, or silent corruption detected later.
Fix: verify datastore health regularly; run scrubs/SMART checks; perform periodic test restores.
7) Backups time out only during business hours
Symptoms: jobs hang/stall; NFS “server not responding”; Ceph latency spikes; IO wait spikes.
Root cause: contention. Backups compete with production load on the same storage/network.
Fix: move backups to off-peak, throttle concurrency, isolate backup traffic, or upgrade the bottleneck (often network or NAS CPU).
Checklists / step-by-step plan
Step-by-step: get tonight’s backup to succeed
- Open the specific log:
/var/log/vzdump/qemu-<vmid>.logorlxc-<ctid>.log. Find the first meaningfulERROR:. - Confirm target storage:
pvesm statusanddf -h /mnt/pve/<storage>. - Validate mount reality:
findmnt -T /mnt/pve/<storage>. If it’s not the expected fstype/source, stop and fix that first. - Free space fast if needed: delete or move old backups per your retention policy. Don’t “rm -rf everything” unless you enjoy career changes.
- Check locks:
qm config <vmid>forlock: backup; remove withqm unlockonly after verifying no job is running. - If snapshot errors: check LVM-thin metadata (
lvs ... metadata_percent) or ZFS/Ceph health. - Run one manual backup:
vzdump <vmid> --storage ... --mode snapshotto prove the fix. Don’t wait for the scheduled job to “maybe” work. - Document the cause: one line: “Backup failed because X; verified by Y; fixed by Z; validated by manual backup.” Your future self will buy you coffee.
Step-by-step: prevent repeats (reduce failure probability)
- Capacity headroom: keep backup targets under ~80–85% used. Above that, everything gets fragile: fragmentation, GC pressure, quota surprise.
- Make mount failures loud: monitor
findmntoutput and alert if the backup storage source/fstype changes. - Control concurrency: fewer parallel backups often means more total success. Tune for completion, not theoretical throughput.
- Separate heavy guests: schedule large DB or file servers in their own window. They dominate I/O and make everyone else look guilty.
- Storage health routines: ZFS scrub cadence; SMART monitoring; Ceph health gating. Backups shouldn’t be your first signal of disk failure.
- Periodic restore tests: prove that you can restore and boot, not just that you can write archives.
Order-of-operations cheat sheet (print-worthy)
- Log shows space/permission/IO? Fix that exact thing first.
- Verify mount is real and stable (NFS/CIFS/PBS).
- Check snapshot layer (LVM-thin meta%, ZFS pool, Ceph health).
- Check locks and stuck jobs.
- Check performance regression (dd test + dmesg + iostat if you use it).
- Rerun one backup manually to confirm.
FAQ
1) Where are the real vzdump logs on Proxmox?
Look in /var/log/vzdump/. There’s typically one log per guest per run (e.g., qemu-101.log, lxc-203.log). The GUI task view is a summary, not the full story.
2) Why does vzdump say “backup failed” when the archive file exists?
Because creating a file isn’t the definition of success. The job can fail after writing (verification step, finalizing, flushing, permissions on temp files, or post-backup hooks). Check the end of the log for the final error and whether the archive size looks sane.
3) Should I use snapshot mode or stop mode?
Snapshot mode is the default for a reason: it avoids downtime. Use stop mode for guests that cannot snapshot safely (rare) or when storage can’t snapshot reliably and you need a clean backup more than uptime.
4) What does “lock: backup” mean and is it safe to unlock?
It means Proxmox thinks a backup is in progress or was interrupted. Unlocking is safe only after you confirm there is no running vzdump task for that guest and no storage snapshot operation still active. Otherwise, you risk inconsistent state.
5) How do I know if NFS is the problem versus local disk?
Two fast tells: findmnt -T /mnt/pve/backup (is it mounted from the correct server?), and dmesg -T for NFS errors during the backup window. If the filesystem type is wrong in the vzdump log, it’s not NFS—it’s you writing locally.
6) Why do backups reveal storage corruption before users notice?
Backups read broadly and sequentially across VM disks. Regular workloads might only touch hot regions and never hit a bad block until months later. Backup is an accidental integrity test—don’t waste that signal.
7) Is higher compression always better for backups?
No. Higher compression can reduce storage and network use, but can increase CPU load and memory pressure, and can make long-running jobs overlap. Pick a setting that finishes reliably within your window.
8) What if PBS backups fail but NFS backups work (or vice versa)?
That usually points to the target side: authentication/token issues, datastore fullness, verification/GC contention on PBS, or NFS server export/permission problems. Split the system: confirm node-to-target network and then check the target’s own health and logs.
9) How do I prove a fix without waiting until tomorrow?
Run a manual backup for one representative guest immediately after the change. If the nightly job was failing due to concurrency, run two in parallel and see if the failure returns. Evidence beats hope.
10) What’s the single most effective reliability habit for backups?
Test restores on a schedule. Not once. Not after an incident. On purpose, routinely. Backups are only as real as your last successful restore.
Conclusion: next steps that actually reduce pager noise
When Proxmox says “vzdump backup failed,” it’s not being coy; it’s telling you the failure happened in a chain of dependencies. The fastest fix is to identify the first concrete error in the per-guest log, validate the target storage is real and writable, then check snapshot capability and storage health before you touch tuning knobs.
Do these next, in this order:
- Add a mount sanity check (filesystem type and source) before backups run, and alert if it changes.
- Enforce headroom on backup storage (retention/pruning, quotas you understand, and a hard “stop at 85%” policy).
- Audit snapshot readiness (LVM-thin metadata percent, ZFS pool health, Ceph health) and gate backups if the platform is degraded.
- Right-size concurrency and compression so jobs finish reliably. Your backup window is a budget; don’t overspend it.
- Schedule restore drills and treat them like fire alarms: annoying, necessary, and the difference between a controlled event and an obituary.