Proxmox backups to PBS fail: common chunk/space errors and what to do

Was this helpful?

You scheduled backups, watched the first few succeed, and mentally filed the whole thing under “handled.”
Then a restore test (or worse, a real incident) reveals the backups have been failing for days with messages about
chunks, space, locks, or “unexpected EOF.” This is the backup equivalent of finding out your smoke detector has been chirping
in an empty building.

Proxmox Backup Server (PBS) is solid engineering, but it’s a storage system with opinions: deduped chunks, indexes, snapshots,
garbage collection, and a datastore that will absolutely enforce physics. When backups fail, you’re usually dealing with one of
four realities: no space, no inodes, bad storage behavior, or a metadata/index inconsistency. The trick is identifying which one
quickly, then acting without making it worse.

How PBS failures actually look (and why they’re confusing)

PBS error messages often mention “chunks,” “indexes,” “catalog,” “snapshot,” “datastore,” or “verification.”
If you’re coming from file-copy style backups, this feels unnecessarily abstract. In PBS, the backup stream is cut into
chunks (variable sized), hashed, and deduplicated. Metadata (indexes) maps your VM/CT image to those chunks. Data and metadata
are both required for a restore.

So you can get failures in at least five layers:

  • Transport/auth: PVE can’t authenticate to PBS, or TLS breaks.
  • Repository/datastore: path not writable, datastore locked, permissions wrong.
  • Capacity: bytes, inodes, or “reserved space” constraints stop writes mid-stream.
  • Storage integrity: underlying disk returns I/O errors, or filesystem starts lying.
  • Index/chunk mismatch: missing chunk, corrupted index, verification failures.

Your job as the on-call adult is to identify the tightest constraint first. Don’t chase the message;
chase the bottleneck that makes the message inevitable.

Fast diagnosis playbook (check 1/2/3)

If you have ten minutes before the next angry meeting, do this in order. It’s optimized for “what breaks most often” and
“what you can confirm fastest,” not for elegance.

1) Capacity reality check: bytes, inodes, and pool health

  • On PBS: check free space (df -h) and inodes (df -i) on the datastore mount.
  • If ZFS: check zpool list, zpool status, and dataset usage (zfs list).
  • Look for “80–90% full” conditions. On copy-on-write systems, that’s where performance and fragmentation get mean.

2) Identify the failing job and the error class quickly

  • On PVE: read the job log in the UI, but confirm with journalctl for context and timestamps.
  • On PBS: read journalctl -u proxmox-backup. The same failure may show clearer causes server-side.
  • Decide: is it space, permission/lock, I/O, or chunk/index integrity?

3) Stop making it worse: pause jobs, protect what’s good

  • If the datastore is nearly full: stop backup jobs, run prune/GC deliberately, and don’t start “verification of everything” right now.
  • If the pool is degraded or throwing checksum errors: stop writes, fix hardware, then verify.
  • If it’s a lock/permission mess: clear stale locks carefully (not by deleting random files), then re-run a single job.

One operational truth: when you’re out of space, everything looks like corruption. When you’re corrupt, everything looks like space.
Your job is to separate the two.

Interesting facts and context (the stuff that explains the weirdness)

  • Dedup backups aren’t “smaller files.” PBS stores data as chunks referenced by indexes. Space usage depends on chunk reuse, not file size.
  • Chunking is content-defined. Like classic dedup systems, PBS splits data based on rolling hashes, which helps dedup even when blocks shift.
  • “Verification” is a first-class feature. Many backup stacks treat integrity checks as optional and rarely run them. PBS expects you to verify.
  • Garbage collection is separate from pruning. Prune removes snapshot references; GC reclaims unreferenced chunks. People forget the second step.
  • CoW filesystems hate being 95% full. ZFS and btrfs need breathing room for metadata and allocations; “df says 5% free” is not reassuring.
  • Inode exhaustion still exists. You can have terabytes free and still fail writes if the filesystem runs out of inodes (especially with small chunk files and some FS layouts).
  • Backups stress different I/O paths than production. Sequential reads from VM disks plus random writes to the datastore can surface controller/firmware bugs that normal workloads don’t.
  • Silent corruption is a design target. Modern backup systems assume bit rot happens; checksums and verification exist because disks occasionally lie.
  • Most “PBS is slow” issues are storage layout issues. Wrong ashift, small recordsize, or a single HDD trying to behave like an array will make PBS look guilty.

Practical tasks: commands, outputs, decisions

These are real tasks you can run today. Each one includes: a command, what typical output means, and the decision you make.
Run commands on the right host (PVE vs PBS). Don’t freestyle destructive options until you’ve identified the failure class.

Task 1: Confirm datastore mount path and free space (PBS)

cr0x@pbs:~$ df -hT
Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/sda2      ext4   917G  812G   59G  94% /
tmpfs          tmpfs   32G     0   32G   0% /dev/shm

Meaning: 94% used is danger territory for backup writes and filesystem housekeeping.
Decision: stop or stagger jobs, run prune/GC, and free space before re-running large backups.

Task 2: Check inode usage (PBS)

cr0x@pbs:~$ df -i
Filesystem      Inodes   IUsed   IFree IUse% Mounted on
/dev/sda2     61054976 60980012  74964  100% /

Meaning: You’re out of inodes. Writes fail even if bytes are available.
Decision: prune old snapshots, run GC, and consider migrating datastore to a filesystem/dataset sized for many files. Also review chunk/index layout and retention.

Task 3: Identify which datastore is configured (PBS)

cr0x@pbs:~$ proxmox-backup-manager datastore list
┌───────────┬───────────────┬────────────┬───────────┐
│ Name      │ Path          │ PruneOpts  │ Comment   │
╞═══════════╪═══════════════╪════════════╪═══════════╡
│ mainstore │ /mnt/datastore│ keep-daily │ primary   │
└───────────┴───────────────┴────────────┴───────────┘

Meaning: You now know the path to check with df, ls, and filesystem tooling.
Decision: focus diagnostics on the correct mount, not “/ looks full” guessing.

Task 4: Check ZFS pool health (PBS, if applicable)

cr0x@pbs:~$ zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
  scan: scrub repaired 0B in 02:11:45 with 3 errors on Thu Dec 19 03:12:10 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sdb     ONLINE       0     0     2
            sdc     ONLINE       0     0     1
            sdd     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
        tank/pbs@somefile

Meaning: checksum errors exist. This is not a “PBS bug”; it’s a storage integrity event.
Decision: pause heavy writes, fix hardware, scrub again, then run PBS verification and plan restores for affected snapshots.

Task 5: Locate backup failure in PBS logs (PBS)

cr0x@pbs:~$ journalctl -u proxmox-backup --since "today" | tail -n 30
Dec 26 01:12:01 pbs proxmox-backup[2211]: upload failed: No space left on device (os error 28)
Dec 26 01:12:01 pbs proxmox-backup[2211]: failed to finalize backup: unable to write index
Dec 26 01:12:01 pbs proxmox-backup[2211]: job failed: datastore "mainstore" is full

Meaning: clear “ENOSPC” and index write failure. Chunk writes may have succeeded but metadata couldn’t finalize.
Decision: free space first, then re-run. Also schedule prune/GC so you don’t repeat this next week.

Task 6: Check PVE-side job error context (PVE)

cr0x@pve1:~$ journalctl --since "today" | grep -E "vzdump|proxmox-backup-client" | tail -n 20
Dec 26 01:11:58 pve1 vzdump[18277]: INFO: starting new backup job: vzdump 101 --mode snapshot --storage pbs-main --notes-template '{{guestname}}'
Dec 26 01:12:01 pve1 vzdump[18277]: ERROR: Backup of VM 101 failed - upload failed: No space left on device (os error 28)
Dec 26 01:12:01 pve1 vzdump[18277]: INFO: Failed at 2025-12-26 01:12:01

Meaning: The same ENOSPC bubbles up. No need to chase network ghosts.
Decision: stop new backups until datastore space is reclaimed.

Task 7: Verify the mount isn’t lying (PBS, common with NFS/iSCSI misadventures)

cr0x@pbs:~$ mount | grep /mnt/datastore
192.0.2.50:/export/pbs on /mnt/datastore type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)

Meaning: Datastore is on NFS. This can work, but it’s where “chunks missing” stories are born.
Decision: if you see chunk corruption or intermittent I/O, strongly consider moving PBS datastores to local disks or a storage stack with end-to-end checksums.

Task 8: Check for filesystem I/O errors (PBS)

cr0x@pbs:~$ dmesg -T | tail -n 20
[Thu Dec 26 01:10:44 2025] blk_update_request: I/O error, dev sda, sector 194512345 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Thu Dec 26 01:10:44 2025] EXT4-fs error (device sda2): ext4_journal_check_start:83: Detected aborted journal
[Thu Dec 26 01:10:44 2025] EXT4-fs (sda2): Remounting filesystem read-only

Meaning: Disk or controller issue caused ext4 to remount read-only. PBS will fail in weird ways after this point.
Decision: stop backup jobs, fix hardware, fsck as required, and only then attempt new backups. Assume recent backups are incomplete until verified.

Task 9: Check PBS datastore status and usage (PBS)

cr0x@pbs:~$ proxmox-backup-manager datastore status mainstore
┌──────────────┬──────────────────────────────┐
│ Key          │ Value                        │
╞══════════════╪══════════════════════════════╡
│ name         │ mainstore                    │
│ path         │ /mnt/datastore               │
│ total        │ 8.00 TiB                      │
│ used         │ 7.62 TiB                      │
│ avail        │ 0.38 TiB                      │
│ read-only    │ false                        │
└──────────────┴──────────────────────────────┘

Meaning: Only 0.38 TiB left. That can vanish fast with large fulls, or even with metadata/index growth.
Decision: tighten retention, add capacity, or split workloads across datastores. Don’t rely on “dedup will save us.”

Task 10: Run a targeted verify (PBS)

cr0x@pbs:~$ proxmox-backup-manager verify start mainstore --ns vm --group vm/101
starting verification of datastore 'mainstore'
verifying group 'vm/101'
OK: verified snapshot 'vm/101/2025-12-25T01:10:02Z'
FAILED: snapshot 'vm/101/2025-12-26T01:10:02Z' - missing chunk "3f2c...d91a"
TASK OK: verify datastore mainstore

Meaning: One snapshot is missing a chunk. This is either underlying storage loss, interrupted writes, or index/chunk inconsistency.
Decision: preserve the datastore state (don’t “clean up” randomly), inspect storage health, and plan to re-run that backup after space/stability is fixed.

Task 11: Check prune simulation before deleting (PBS)

cr0x@pbs:~$ proxmox-backup-manager prune list mainstore --ns vm --group vm/101 --dry-run
found 14 snapshots
would keep: 7
would remove: 7
  remove vm/101/2025-12-01T01:10:02Z
  remove vm/101/2025-12-02T01:10:02Z
  remove vm/101/2025-12-03T01:10:02Z

Meaning: You can see what retention will delete without committing.
Decision: if space is tight, proceed with prune, then run GC. If legal/compliance needs exist, stop and adjust policy with stakeholders.

Task 12: Run garbage collection (PBS)

cr0x@pbs:~$ proxmox-backup-manager garbage-collection start mainstore
starting garbage collection on datastore 'mainstore'
found 18234 unreferenced chunks
removed 18190 chunks, reclaimed 412.7 GiB
TASK OK: garbage collection

Meaning: Space reclaimed. This is the step people forget after pruning.
Decision: re-check free space; only then restart backup schedules.

Task 13: Inspect datastore file count pressure (PBS)

cr0x@pbs:~$ find /mnt/datastore/.chunks -type f | wc -l
58492312

Meaning: Tens of millions of files can stress inode tables, directory lookups, and backups of the backup server itself.
Decision: if you’re on ext4 with inode exhaustion or performance pain, plan a datastore migration (bigger inode ratio, or ZFS dataset design) and reconsider retention.

Task 14: Confirm repository connectivity and auth (PVE)

cr0x@pve1:~$ proxmox-backup-client login pbs01 --repository backup@pbs@pbs01:mainstore
Password for "backup@pbs@pbs01":
Login succeeded.

Meaning: Auth and network basics work. If jobs still fail, it’s likely capacity or integrity, not credentials.
Decision: stop rotating passwords in a panic and go back to datastore diagnostics.

Task 15: Check for stale lock symptoms (PBS)

cr0x@pbs:~$ journalctl -u proxmox-backup --since "today" | grep -i lock | tail -n 10
Dec 26 00:59:03 pbs proxmox-backup[2144]: unable to acquire lock on snapshot: resource busy
Dec 26 00:59:03 pbs proxmox-backup[2144]: backup failed: datastore is locked by another task

Meaning: Another task (backup, verify, GC, prune) holds the lock, or a previous task crashed and left state behind.
Decision: check running tasks in the UI and on the host; don’t delete lock files blindly. If a task is stuck, fix the underlying I/O issue first.

Task 16: Confirm time sync (PVE and PBS)

cr0x@pbs:~$ timedatectl
               Local time: Fri 2025-12-26 01:20:11 UTC
           Universal time: Fri 2025-12-26 01:20:11 UTC
                 RTC time: Fri 2025-12-26 01:20:11
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: Time is synced. This matters more than people admit: expired certs, “future snapshots,” and weird scheduling can look like storage issues.
Decision: if clocks are wrong, fix NTP first, then re-run failing operations.

Chunk errors: what they mean and what to do

Chunk errors show up as “missing chunk,” “checksum mismatch,” “unable to decode chunk,” or index finalize failures.
Treat these as a spectrum: from harmlessly incomplete snapshots (job died before commit) to genuine storage corruption.

Missing chunk

Typical symptom: verify reports missing chunks; restore fails for a specific snapshot.

Common root causes:

  • Interrupted backup write (ENOSPC mid-flight, server crash, filesystem remounted RO).
  • Underlying storage inconsistency (NFS hiccups, RAID controller write cache lies, flaky disk).
  • Human cleanup (“I deleted some old chunk files to make space.” Please don’t.)

What to do:

  • Confirm the datastore filesystem is healthy: check dmesg, SMART, RAID, ZFS scrubs.
  • Verify whether only the newest snapshot is affected. If yes and it coincides with ENOSPC or a crash, re-run the backup after fixing capacity.
  • If multiple snapshots across time have missing chunks, assume storage is dropping writes or corrupting data. Stop and fix that first.

Checksum mismatch / decode errors

This is where you stop blaming PBS and start blaming entropy. Checksums exist because drives return wrong bits occasionally.

Root causes: bad RAM (yes, really), a disk returning stale data, or a controller/firmware issue. In virtualized PBS, the hypervisor storage stack can also be involved.

What to do: scrub and test storage; run memory tests if corruption repeats; verify all recent snapshots; and for affected workloads, ensure you have a second copy (another PBS, tape, object store, whatever your governance allows).

Index write/finalize failures

Often appears as “unable to write index,” “failed to finalize backup,” “unexpected EOF.” The backup stream can upload a lot of chunks,
then fail at the end when the index/catlog must be committed. If space is tight, that final commit is where you lose.

What to do: treat index errors as “backup not valid until verified.” Fix space or filesystem RO state, re-run the backup, and then verify.

Joke #1: A missing chunk is like a missing sock—annoying, mysterious, and it always disappears when you’re already late.

Space errors: not just “disk full”

“No space left on device” is blunt. The problem is that Linux can throw it for at least four different kinds of “space,” and PBS
will surface it when it tries to commit data or metadata. Here are the space traps that hit PBS operators in production.

1) Byte exhaustion (classic disk full)

Straightforward: df -h says 100%. But the more interesting problem is “not 100% yet, but effectively full.”
ZFS, btrfs, and even ext4 under heavy fragmentation can behave badly well before 100%.

Rule: don’t run a dedup backup datastore above ~80–85% long-term if you care about performance and successful GC.
You can spike higher briefly, but if you live there, you’ll pay in timeouts and weirdness.

2) Inode exhaustion

PBS chunk stores can create enormous file counts. Some filesystems (and some inode sizing choices) will punish you.
If df -i is high, you can get ENOSPC while df -h looks fine.

Fix: prune and GC to reduce chunk count, then plan a datastore migration sized for many files. If you’re building from scratch, design for inode pressure.

3) Metadata/transaction space constraints (CoW reality)

On ZFS, you can have “free space” but not enough contiguous space (or not enough metadata headroom) to complete allocations efficiently.
A full pool can cause GC to run painfully slow, which means you can’t reclaim space fast enough to unblock backups. This becomes a loop.

Fix: keep pools roomy; don’t treat “it fits” as success. Add vdevs/capacity before you hit the cliff. Consider special vdevs only if you understand failure domains.

4) “Reserved blocks” and root-only space (ext4 nuance)

ext4 typically reserves a percentage of blocks for root. If PBS runs as root (common), it may still write where other processes can’t.
Or vice versa depending on your service user and mount setup. This can create confusing partial failures.

Fix: understand your reserve settings and service permissions. But don’t “tune away” safety reserves just to squeeze a few more backups in. That’s how you get a dead system at 3 a.m.

Locks, permissions, and “it worked yesterday”

Backup systems are state machines with sharp edges. Locks exist to prevent concurrent modifications of the same snapshot group.
When locks become stale, or permissions change underfoot, failures look random.

Permission errors (datastore path, ownership, mount options)

Symptoms: “permission denied,” “read-only filesystem,” “failed to create directory,” or jobs failing immediately.

Reality check: “I can write there as root” is not the same as “the PBS service can write there under its runtime context.”
Also, an NFS export can silently change behavior after a reboot or network event.

Fix: confirm mount options, confirm actual path permissions, and confirm the filesystem is RW and stable.

Lock errors

Locks are often a symptom, not the disease. A long-running verify, stuck GC, or an I/O-hung filesystem can hold locks and cause backups to fail.
The correct move is not “remove locks,” it’s “why is the task stuck.”

A paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails, all the time; design systems so failure is routine, not catastrophic.”

Prune and garbage collection: the slow-moving train

PBS retention is a two-step dance:

  1. Prune removes snapshot references according to policy.
  2. Garbage collection reclaims chunks no longer referenced by any snapshot.

If you only prune, your datastore may still look full. If you only GC, nothing happens because snapshots still reference chunks.
If you do both while the pool is 95% full and unhealthy, both operations become slow and fragile.

Operational guidance (opinionated, because you need it)

  • Schedule prune and GC during low I/O windows, but not so rarely that you hit cliffs.
  • Don’t overlap heavy verify with GC on small systems. Your disks will spend the night arguing with themselves.
  • Keep retention realistic. “Keep everything forever” is a storage purchase order, not a policy.
  • Verify restores, not just backups. Verification catches missing chunks; restore tests catch operational gaps (keys, permissions, network).

Joke #2: Garbage collection is like corporate budgeting—nothing gets reclaimed until someone proves it’s truly unused.

Three corporate mini-stories from the backups trench

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran PBS on a new storage shelf. The team assumed “RAID controller with battery-backed cache” was enough to make writes safe.
They also assumed that because production VMs ran fine, backup writes were “less important” and could tolerate occasional hiccups.

The first sign of trouble was verification failures: missing chunks scattered across different VM groups. Not just the latest snapshot.
Restores worked sometimes, then failed on others. People blamed PBS updates, then blamed the network, then blamed “dedup complexity.”

The turning point was correlating PBS verify timestamps with controller logs showing cache destaging warnings and occasional resets.
Backup writes are high-throughput and bursty; they hit the controller in a way the VM workload didn’t. The array occasionally acknowledged writes,
then lost them during a reset window. PBS did its job: it complained when the chunks weren’t there later.

The fix was boring: firmware update, controller replacement, and moving the PBS datastore to ZFS with end-to-end checksums on a host with ECC RAM.
Verification stopped failing. They also added a second PBS replica target for critical workloads, because “one copy” is not a strategy.

Mini-story 2: The optimization that backfired

Another shop wanted faster backups. They put the PBS datastore on an NFS share backed by a fast NAS. On paper, it was great: lots of capacity,
easy expansion, centralized storage team approval. They tuned NFS rsize/wsize, felt clever, and declared victory.

For a month, everything looked fine. Then they saw periodic “unable to decode chunk” errors and some snapshots that couldn’t be verified.
The failures were sporadic—classic “it’s always the network” vibes. The storage team insisted the NAS was healthy. The virtualization team insisted PBS was picky.

The real issue was subtle: intermittent NFS stalls during metadata-heavy operations and a failover event that changed export behavior briefly.
Chunk files and indexes are sensitive to partial writes and timing. The NAS was optimized for large sequential file workloads, not millions of small-ish chunk objects
with constant metadata churn.

They rolled back the “optimization” and placed the datastore on local ZFS mirrors with a sane free-space target. Backups slowed slightly on paper,
but became consistent and verifiable. The best performance metric is “restores work,” not “the graph is pretty.”

Mini-story 3: The boring but correct practice that saved the day

A regulated company had a dull habit: weekly restore tests of one VM per cluster, rotated through a list.
Not a fire drill—just a scheduled task with a ticket, a checkbox, and a human who had to attach a console screenshot and a checksum file.

One week, the restore test failed with a missing chunk on the newest snapshot, but older snapshots restored fine.
The operator escalated immediately, and the team paused new backups to avoid overwriting the situation with noise.

They found the datastore was hovering around 92% full and GC hadn’t reclaimed space because prune ran, but GC was scheduled for a different window
and had been failing silently due to timeouts. The system wasn’t corrupt; it was just suffocating.

They freed space, ran GC, re-ran the failed backup, and verified it. No drama. The important part is that the restore test caught it
before a real incident forced a restore under pressure.

That’s the lesson: the “boring” practice isn’t backups. It’s proving backups are usable.

Common mistakes: symptoms → root cause → fix

1) “No space left on device” during finalize

Symptoms: backup uploads most data then fails at the end; logs mention index finalize.

Root cause: datastore too full; metadata/index needs space at commit time.

Fix: prune snapshots, run GC, keep datastore under a sane utilization target; re-run and verify.

2) Backups fail while df -h shows plenty of space

Symptoms: ENOSPC but “60% free.”

Root cause: inode exhaustion or quota/refreservation constraints.

Fix: check df -i; on ZFS check dataset quotas/reservations; reduce snapshot count via prune; plan migration if inode layout is wrong.

3) “Missing chunk” on verification for newest snapshot only

Symptoms: only latest snapshot fails verify; older ones OK.

Root cause: interrupted backup (space ran out, crash, filesystem RO).

Fix: fix capacity/health; re-run the backup; verify again; consider alerting on ENOSPC and RO remounts.

4) “Missing chunk” scattered across many snapshots

Symptoms: verify failures across time and groups.

Root cause: underlying storage corruption or unreliable remote filesystem behavior.

Fix: stop writes, run storage diagnostics (scrub, SMART, controller logs), remediate hardware, then re-verify and re-seed backups.

5) Backups hang or crawl, then fail with timeouts

Symptoms: jobs run for ages; GC never finishes; occasional lock errors.

Root cause: datastore too full (CoW pain), or disks saturated by concurrent verify/GC/backup.

Fix: reduce concurrency; separate verify windows; keep free space; move datastore to faster disks or add vdevs.

6) “Datastore is locked”

Symptoms: new jobs fail immediately; logs mention locks.

Root cause: another task running, or a stuck task due to I/O stalls.

Fix: identify running tasks; resolve I/O issues; avoid manual deletion of state; re-run one job to validate.

7) “Read-only filesystem” during backup

Symptoms: sudden widespread failures; dmesg shows ext4 remount RO or ZFS fault.

Root cause: hardware I/O errors or filesystem corruption response.

Fix: stop, repair hardware/filesystem, then verify datastore consistency and restore capability.

8) “Authentication failed” after “no changes”

Symptoms: jobs fail at start; login fails; TLS errors.

Root cause: expired certs, time drift, password rotation mismatch, or repository string typo.

Fix: check timedatectl; verify repository; re-auth with proxmox-backup-client login; update credentials in PVE storage config.

Checklists / step-by-step plan

Checklist A: When backups start failing tonight

  1. Stop the bleeding: pause backup schedules if failures are due to space or I/O errors.
  2. Classify: space (bytes/inodes), integrity (I/O, checksum), locks/perms, or auth/network.
  3. Confirm datastore mount health: df -hT, df -i, mount, dmesg.
  4. Read server-side logs: journalctl -u proxmox-backup near the failure timestamp.
  5. If ENOSPC: prune (dry-run first), then GC. Re-check free space.
  6. If I/O errors: stop writes, fix storage, then verify snapshots.
  7. Re-run one backup for a representative VM/CT and immediately verify it.

Checklist B: When you need to reclaim space safely

  1. List retention and run prune dry-run for a group you can delete safely.
  2. Prune snapshots intentionally (don’t guess-delete files).
  3. Run garbage collection to reclaim chunks.
  4. Re-check free bytes and inodes.
  5. Adjust retention policies so you don’t repeat the same emergency weekly.

Checklist C: When verification reports chunk errors

  1. Determine if it’s only the newest snapshot or scattered history.
  2. Check dmesg and storage health (ZFS scrub results / SMART / RAID logs).
  3. If newest-only and correlates with ENOSPC/crash: fix cause, re-run backup, verify again.
  4. If scattered: treat as storage integrity incident. Stop writes. Remediate hardware/storage stack.
  5. After remediation: run verification broadly and perform restore tests for critical workloads.

FAQ

1) Why does PBS talk about “chunks” instead of files?

PBS deduplicates and verifies at the chunk level. Your backup is a set of indexes referencing hashed chunks, not a single monolithic file.
That’s why missing chunks break restores even if “most data uploaded.”

2) If I prune snapshots, why don’t I immediately get space back?

Because prune removes references; it doesn’t delete shared chunks that might still be used by other snapshots. Garbage collection is what reclaims
unreferenced chunks.

3) Can I store PBS datastores on NFS?

You can, but it’s a risk trade. If your NFS server is rock-solid and tuned for metadata-heavy workloads, it may work.
In practice, many “missing chunk” issues trace back to remote filesystem behavior under stress or failover.

4) What free space target should I aim for?

For healthy operations and predictable GC, plan to keep meaningful headroom. If you’re regularly above ~85% usage, expect slowdowns and failures,
especially on CoW filesystems. Size capacity and retention accordingly.

5) I have ENOSPC but still see free space. How?

Inodes, quotas, reservations, or filesystem-specific constraints can cause ENOSPC. Always check df -i, and if you use ZFS, review dataset quotas/reservations.

6) Should I run verification nightly?

Verification is great, but it consumes I/O. On small systems, schedule it so it doesn’t overlap with heavy backups or GC.
For critical data, frequent verification plus periodic restore tests beats blind faith.

7) Is “datastore locked” safe to fix by deleting lock files?

Usually no. Locks are often held because a task is running or stuck due to I/O issues. Removing locks can create inconsistent state.
Find the task, find why it’s stuck, then resolve that.

8) Do chunk errors always mean my storage is corrupt?

Not always. If the newest snapshot failed during ENOSPC or a crash, you can get missing chunks for that snapshot only.
If chunk errors appear across many snapshots over time, assume storage integrity problems until proven otherwise.

9) What’s the single best habit to avoid surprise backup failures?

Alert on datastore utilization (bytes and inodes), and run scheduled restore tests. Backup success logs are comforting; restores are evidence.

Next steps (do this, not that)

If you’re dealing with PBS backup failures today, do these next steps in order:

  1. Get hard evidence: pull the PBS-side error from journalctl -u proxmox-backup and classify it (space vs integrity vs lock vs auth).
  2. Fix capacity first: bytes and inodes. Prune with a dry-run. Then run garbage collection. Re-check utilization.
  3. Verify after recovery: run targeted verify on the failing groups; don’t assume “next backup succeeded” means everything’s fine.
  4. Stop trusting unstable storage: if you see real corruption patterns, pause writes and remediate the storage stack before creating more questionable backups.
  5. Make the failure boring next time: add alerts for utilization and filesystem RO remounts, schedule prune/GC sensibly, and run restore tests.

The goal isn’t heroic recovery. The goal is a backup system so predictable that the only surprise is how rarely you have to think about it.
And when it does fail, you want it to fail loudly, quickly, and with enough detail to fix it before you need a restore.

← Previous
Docker on Windows/WSL2 is slow: fixes that actually help
Next →
ZFS snapdir=visible: Auditing Snapshots Without Making Users Rage

Leave a comment