Proxmox Backup “No space left on device”: why it fails even when space seems free

Was this helpful?

You’ve got plenty of free space. df -h says so. The storage graph in the GUI looks fine. Then a backup job detonates with
No space left on device and you start bargaining with the universe.

The punchline: “no space” rarely means “no bytes left.” It means some limit was hit—bytes, inodes, metadata, ZFS headroom, quotas,
tmp space, snapshot growth, or PBS chunk-store rules. This is the map of that territory, written by someone who’s watched it burn in production.

What “No space left on device” actually means

On Linux, ENOSPC is the error code behind the message. It does not mean “storage is physically full.”
It means: “the kernel refused a write because the filesystem, the block device, or a quota/limit said no.”
The refusal can happen for a few different reasons:

  • Bytes are exhausted on the target filesystem or pool.
  • Inodes are exhausted (common on ext4/xfs with millions of small files; PBS can create lots of chunks).
  • Metadata space is exhausted (ZFS special devices, btree nodes, journal limits, or just fragmentation and slop).
  • Reserved blocks (ext filesystems keep a reserve; ZFS needs headroom).
  • Quotas / project quotas / dataset quotas hit before the “whole disk” is full.
  • Copy-on-write amplification (snapshots turn deletes into “still used,” and overwrites into new allocations).
  • Different filesystem than you think (tmp dir on a small partition; container rootfs vs mounted storage).

Proxmox adds its own flavors. Backups involve temporary files, snapshots, and (if you use Proxmox Backup Server) a chunk store that behaves
more like a content-addressed archive than a folder of tar files. So you can have 2 TB “free” and still be unable to allocate the next 128 KB.

First joke (short, because your backup window isn’t getting any longer): Free space is like meeting room availability—always “available” until you try to book it.

Fast diagnosis playbook

When backups fail with “no space,” you want a short, ruthless sequence. Don’t browse dashboards. Don’t guess. Check the three constraints that
commonly lie: bytes, inodes, and headroom/quotas.

1) Identify where the write failed (PBS datastore? NFS? local dir?)

  • Look at the Proxmox job log: does it mention a path under /var/lib/vz, a mounted share, or a PBS datastore?
  • On PBS, errors often appear in the task viewer and journalctl with “chunk” or “datastore” context.

2) Check bytes and inodes on the target filesystem

  • df -h (bytes)
  • df -i (inodes)

3) If ZFS is involved, check pool health and capacity like you mean it

  • zpool list and zpool get autoexpand,ashift
  • zfs list -o space for the dataset that actually hosts the datastore/backups
  • If the pool is >80–85% full, treat it as “effectively full” for write-heavy workloads.

4) If PBS: run prune and garbage collection, then re-check

  • Prune removes snapshot references; GC frees unreferenced chunks. You usually need both.
  • If prune works but GC cannot allocate metadata, you’re still “full” in the way that matters.

5) Verify temp directories and mount points

  • Backups can write temp files to /var/tmp, /tmp, or a configured tmpdir.
  • If /tmp is a small tmpfs, large vzdump jobs can fail even while the backup target has terabytes free.

6) If it’s a remote target (NFS/SMB): check server-side quotas and exports

  • “Client shows free space” is not a legally binding contract.
  • Quotas, snapshots, or reserve policies on the NAS can throw ENOSPC while the share looks roomy.

The real failure modes (and why your free space lies)

Failure mode A: inode exhaustion (yes, in 2025)

Inodes are the filesystem’s “file slots.” Run out of them and you can’t create new files—even with gigabytes free.
PBS datastores can create lots of files (chunks, indexes, metadata). If you put a datastore on ext4 with a poor inode ratio,
or you’ve been running it for years with lots of churn, inode pressure becomes real.

The tell: df -h looks fine, df -i is at 100%. Backups fail on “create” or “write” calls.

Opinionated advice: for PBS, prefer ZFS (with sane headroom) or XFS with planning. ext4 is fine until it isn’t—and the day it isn’t will be
a Friday night.

Failure mode B: ZFS pool “full” behavior (it gets ugly before 100%)

ZFS is copy-on-write. That’s great for integrity and snapshots, and occasionally terrible for “I need to append a lot of data now.”
When the pool gets high in utilization, allocations require more metadata updates, more searching, and you get fragmentation. Performance falls off a cliff.
Eventually, you hit a point where the pool cannot allocate blocks of the needed size reliably. You may see ENOSPC or transaction group stalls.

The specific trap: people treat ZFS like ext4—run it to 95% and expect it to behave. It won’t.
If you care about reliability and predictable backups, keep ZFS pools comfortably below the danger zone. My default: start sweating at 80%, start acting at 85%.

Failure mode C: dataset quotas/reservations (the pool is fine; your dataset isn’t)

You can have a ZFS pool with 10 TB free and a dataset with a quota=1T that’s full. Same story on XFS project quotas.
Proxmox and PBS are often installed by humans who were “being tidy” and set quotas “just to keep backups from eating everything.”
Then they forget. The quota remembers.

Also, reservations work the other way: a dataset can reserve space and starve other datasets. If your backup target suddenly can’t write, check both quota and reservation.

Failure mode D: snapshots holding space hostage

Snapshots are not a second copy. They are a promise: “the old blocks stay available.” When you delete or overwrite data after taking a snapshot,
the old blocks remain referenced. Your “free space” doesn’t come back. This shows up in ZFS as USEDDS/USEDREFRESERV/USEDSNAP.

On PBS, pruning removes snapshot references from the backup catalog; garbage collection removes chunks no longer referenced.
If you only prune but never GC, the disk usage won’t fall much. If GC can’t run due to space pressure, you’re in a deadlock: need space to free space.

Failure mode E: temp space, tmpfs, and “not writing where you think”

Proxmox VE backups (vzdump) often write to a target storage, but they can also stage data, create logs, and use temp space.
If your /tmp is a small tmpfs (common), or /var is on a small root disk, your backup can fail long before the actual backup disk fills.

This is one of those problems that feels insulting. The backup target has room. Your root disk doesn’t. The error message doesn’t tell you which one is the problem.
Welcome to operations.

Failure mode F: thin provisioning and overcommit (LVM-thin, ZVOLs, SANs)

Thin provisioning makes everything look spacious until it isn’t. LVM-thin pools can hit 100% data or metadata usage.
SAN/NAS backends can overcommit, and the first write that needs real blocks fails. Some environments return ENOSPC; others return I/O errors.

The dangerous bit: you can have “free space” inside a VM disk while the thin pool hosting it is out of space. Backups that snapshot and read
a disk won’t necessarily fail; backups that write to the same stressed pool will.

Failure mode G: remote storage lies (NFS/SMB quirks, server-side snapshots, quotas)

NFS clients cache attributes. SMB sometimes lies by omission. And the server can enforce quotas or reserve policies invisible to the client.
If your Proxmox node says “No space left on device” while the NAS UI says “2 TB free,” believe the error and investigate server-side constraints.

Failure mode H: PBS chunk store overhead and “space you can’t use”

PBS stores data as chunks and indexes with verification. That brings overhead:
metadata, checksums, indexes, logs, plus additional “working” space for GC.
If you size PBS “to the raw backup size,” you will eventually discover the concept of “operational slack space,” typically during a failed backup.

Second joke, because we only get two: A backup system without slack space is like a parachute packed to exactly the volume of the bag—technically impressive, operationally fatal.

Failure mode I: the root filesystem is full (and it’s taking your backup job down with it)

Even if backups write to a separate mount, the system still needs space for logs, journal, temporary files, lock files, and sometimes
snapshot metadata. If /var/log or the system journal fills the root filesystem, you’ll see a cascade of unrelated failures.
Backups are just the first thing you notice.

Failure mode J: “deleted but still open” files

Classic Unix trap: a process has a file open; you delete it; the space is not freed until the process closes it.
So you clean up, see “free space unchanged,” then the next backup still fails.
PBS and Proxmox components are usually well-behaved, but loggers, debuggers, or third-party agents can hold large files open.

One operational quote to keep in your head, paraphrased idea from James Hamilton (AWS): “Measure everything you can, and automate the rest.”
If you don’t have alerting for inode usage, ZFS capacity, and thin-pool metadata, you’re choosing surprise outages.

Interesting facts and short history (the stuff that explains today’s weirdness)

  1. ENOSPC is older than most of your servers. The error code dates back to early Unix; “device” meant the filesystem’s backing storage, not just disks.
  2. Inodes were designed when disks were small. Early Unix filesystems pre-allocated inode tables; running out of inodes before bytes is a legacy feature that refuses to die.
  3. ext filesystems reserve blocks by default. Traditionally ~5% reserved for root to keep the system alive and reduce fragmentation; it surprises people on data volumes.
  4. ZFS needs headroom for copy-on-write. High utilization increases fragmentation and metadata pressure; “100% full” is not a practical target.
  5. Thin provisioning popularized “it looked free.” VM platforms made overcommit normal; storage admins learned the hard way that “allocated” and “used” are not the same.
  6. PBS is dedup-first, not file-first. Its chunk store trades human-readable backup files for integrity and efficiency; cleanup requires prune + GC, not just deleting files.
  7. Garbage collection often needs temporary breathing room. Many systems require some free space to rearrange metadata and safely free blocks; running “to zero” can deadlock cleanup.
  8. NFS attribute caching can mislead monitoring. The client’s view of free space and the server’s enforcement can diverge, especially with quotas and snapshots.
  9. Filesystem “free” is not all usable. Between reserved blocks, allocation unit sizes, and metadata, you can be “free” but unable to allocate the needed extents.

Practical tasks: commands, outputs, what they mean, and the decision you make

These are the checks I run when someone pings “backups failing, but there’s space.” Each task includes: a command, a realistic snippet of output,
what it means, and the action you take.

Task 1: Find the failing path in Proxmox task logs

cr0x@server:~$ journalctl -u pvedaemon -u pvescheduler --since "2 hours ago" | tail -n 30
Dec 26 01:12:19 pve01 pvescheduler[1883]: starting new backup job: vzdump 101 102 --storage pbs01 --mode snapshot
Dec 26 01:13:44 pve01 vzdump[22109]: ERROR: Backup of VM 101 failed - write error: No space left on device
Dec 26 01:13:44 pve01 vzdump[22109]: INFO: Failed at: /var/tmp/vzdumptmp12345
Dec 26 01:13:44 pve01 pvedaemon[1765]: VM 101 qmp command failed - backup aborted

Meaning: The failure happened writing to /var/tmp, not the backup target. That’s your first fork in the road.

Decision: Check root filesystem and temp space, not PBS capacity.

Task 2: Check byte usage on all mounted filesystems

cr0x@server:~$ df -hT
Filesystem                        Type   Size  Used Avail Use% Mounted on
/dev/mapper/pve-root              ext4    96G   95G   0G 100% /
tmpfs                             tmpfs  32G  1.2G   31G   4% /run
rpool/ROOT/pve-1                  zfs     96G   95G   0G 100% /
rpool/data                        zfs    1.8T  1.1T  700G  62% /rpool/data
192.0.2.10:/exports/pbsstore      nfs4    20T   15T  5.0T  75% /mnt/pbsstore

Meaning: Root is full. Also note the mixed example shows how people confuse mounts; focus on the mount relevant to the failing path.

Decision: Free space on / (logs, caches) or move temp dir to a larger filesystem.

Task 3: Check inode exhaustion (the silent killer)

cr0x@server:~$ df -i
Filesystem                        Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/pve-root             6553600 6553201     399  100% /
192.0.2.10:/exports/pbsstore    134217728 210345 134007383    1% /mnt/pbsstore

Meaning: Root filesystem has basically no inodes left. That can break package installs, log rotation, and backup temp file creation.

Decision: Identify inode-heavy directories (often /var/log, /var/lib, or container storage) and clean or expand.

Task 4: Find inode-heavy paths quickly

cr0x@server:~$ sudo du --inodes -x -d 2 /var | sort -n | tail -n 10
1200    /var/cache
8421    /var/tmp
15300   /var/log
1204500 /var/lib
1209800 /var

Meaning: /var/lib is consuming inodes like it’s a hobby. On Proxmox that may be container images, caches, or stray directories.

Decision: Drill down in /var/lib to find offenders; consider relocating heavy-write paths to a bigger filesystem.

Task 5: Confirm where vzdump is staging temporary data

cr0x@server:~$ grep -R "tmpdir" -n /etc/vzdump.conf /etc/pve/* 2>/dev/null
/etc/vzdump.conf:2:tmpdir: /var/tmp

Meaning: Your backup temp directory is on root. That’s fine until root is small or busy.

Decision: Move tmpdir to a dedicated filesystem with space (and monitor it).

Task 6: Check whether /tmp is tmpfs and its size

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,SIZE,USED,AVAIL /tmp
tmpfs tmpfs 2.0G 1.9G 100M

Meaning: /tmp is a 2 GB tmpfs and basically full. Any tool using /tmp for staging will fail fast.

Decision: Clear tmpfs usage, increase tmpfs size, or configure Proxmox to use a different temp location.

Task 7: If PBS is involved, check datastore usage from PBS

cr0x@server:~$ proxmox-backup-manager datastore list
┌───────────┬───────────────┬───────────────┬─────────────┐
│ Name      │ Path          │ Comment       │ Maintenance │
╞═══════════╪═══════════════╪═══════════════╪═════════════╡
│ store1    │ /mnt/store1   │ main datastore│ false       │
└───────────┴───────────────┴───────────────┴─────────────┘

Meaning: Datastore is at /mnt/store1. Now you know which filesystem’s constraints matter.

Decision: Check df and inodes on that mount; then PBS prune/GC status.

Task 8: Check datastore filesystem bytes and inodes on PBS host

cr0x@server:~$ df -hT /mnt/store1
Filesystem     Type  Size  Used Avail Use% Mounted on
rpool/store1   zfs   8.0T  7.4T  600G  93% /mnt/store1

Meaning: 93% full on ZFS is “you like living dangerously.” GC and new writes may fail.

Decision: Free space immediately (prune+GC, delete old groups, add capacity). Then enforce headroom policy.

Task 9: ZFS pool and dataset space breakdown (what’s really consuming space)

cr0x@server:~$ zfs list -o name,used,avail,refer,usedsnap,usedds,usedrefreserv,logicalused -r rpool/store1
NAME           USED  AVAIL  REFER  USEDSNAP  USEDDS  USEDREFRESERV  LOGICALUSED
rpool/store1  7.40T   600G  7.10T     900G   6.50T            0B       12.8T

Meaning: Snapshots are holding 900G. That may be normal, or it may be runaway retention. Logical used shows dedup/compression effects.

Decision: Review retention and whether snapshots are being cleared (PBS prune) and chunks freed (GC). Consider adding capacity before tuning.

Task 10: Check ZFS pool capacity and fragmentation signals

cr0x@server:~$ zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH
rpool  8.00T  7.45T   550G         -    61%    93%  1.00x  ONLINE

Meaning: 93% capacity and 61% fragmentation. Writes will be painful; metadata allocation can fail even before 100%.

Decision: Reduce utilization. Don’t “optimize” your way out of this; make space or add vdevs.

Task 11: Check dataset quotas and reservations (the invisible wall)

cr0x@server:~$ zfs get -o name,property,value -H quota,refquota,reservation,refreservation rpool/store1
rpool/store1	quota	7.5T
rpool/store1	refquota	none
rpool/store1	reservation	none
rpool/store1	refreservation	none

Meaning: The dataset quota is 7.5T. You’re at 7.4T used, so you’re basically out of space even if the pool has more elsewhere.

Decision: Raise quota (if intentional), or free space within it. If the quota was a forgotten “safety measure,” remove it and replace with monitoring.

Task 12: PBS prune simulation (see what would be deleted)

cr0x@server:~$ proxmox-backup-manager prune-job run nightly --dry-run
Starting prune on datastore 'store1'
Group 'vm/101' - would keep 14 backups, remove 6 backups
Group 'vm/102' - would keep 14 backups, remove 4 backups
Prune dry-run completed successfully

Meaning: Prune policy would remove some snapshots. Good. But prune alone doesn’t necessarily free space until GC runs.

Decision: Run prune for real during a safe window, then run GC and confirm space drops.

Task 13: Run PBS garbage collection and read the result

cr0x@server:~$ proxmox-backup-manager datastore gc store1
Starting garbage collection on datastore 'store1'
Removed 18432 unreferenced chunks (68.4 GiB)
GC finished in 00:12:09

Meaning: You reclaimed ~68 GiB. If you reclaim almost nothing, either retention is too aggressive, or most chunks are still referenced.

Decision: If space is still tight, reduce retention, move some groups to another datastore, or add capacity. Don’t expect miracles from GC.

Task 14: Detect deleted-but-open files eating space

cr0x@server:~$ sudo lsof +L1 | head -n 10
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NLINK    NODE NAME
rsyslogd  912 syslog  7w   REG  252,0 2147483648     0  53123 /var/log/syslog.1 (deleted)

Meaning: A 2 GB log file was deleted but still held open by rsyslogd. Space won’t return until the process closes it.

Decision: Restart the holding process (carefully), or rotate logs properly. Then re-check df.

Task 15: For LVM-thin setups, check thin pool data and metadata

cr0x@server:~$ sudo lvs -a -o +seg_monitor,metadata_percent,data_percent vg0
  LV              VG  Attr       LSize   Pool       Origin Data%  Meta%  Monitor
  thinpool        vg0 twi-aotz--  3.00t                      92.31  99.87 monitored
  thinpool_tmeta  vg0 ewi-aotz--  8.00g
  thinpool_tdata  vg0 lwi-aotz--  3.00t

Meaning: Thin pool metadata is at 99.87%. That’s an “about to explode” condition. Writes can fail even if Data% isn’t 100%.

Decision: Extend the thin pool metadata immediately and stop overcommitting without monitoring.

Task 16: Validate that your backup target mount is actually mounted

cr0x@server:~$ findmnt /mnt/pbsstore
TARGET        SOURCE                      FSTYPE OPTIONS
/mnt/pbsstore 192.0.2.10:/exports/pbsstore nfs4   rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2

Meaning: It’s mounted. If it weren’t, Proxmox might have been writing to the empty directory on root—filling it silently.

Decision: If not mounted, fix automount/systemd dependencies and clean up the accidental local writes.

Three corporate-world mini-stories (anonymized, plausible, and painfully familiar)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Proxmox VE with backups to an NFS share on a respectable NAS. The ops team watched the NAS dashboard:
terabytes free, green lights, everyone relaxed. Then Monday morning: multiple VM backups failed with “No space left on device.”
People immediately blamed Proxmox, because that’s what you do when you’re tired.

The wrong assumption was subtle: they assumed the client’s view of free space matched the server’s enforcement.
The NAS had per-share quotas enabled—set months earlier during a storage reorganization. The share had plenty of physical capacity,
but the quota ceiling had been reached. NFS correctly returned ENOSPC.

Worse, the team tried to “fix” it by deleting old backup files from the share. The quota accounting didn’t move much because the NAS also had snapshots,
and the snapshot retention policy held the deleted blocks for weeks. The share looked emptier by file listing, but the quota stayed pinned.

The final fix was boring: increase the share quota, reduce snapshot retention, and align the Proxmox retention policy with what the NAS snapshots were doing.
The postmortem included a new rule: any storage target must have quotas and snapshots documented in the same place as the backup job definition.

Operational takeaway: when the backend says no, believe it. Then verify quotas and snapshots on the server side, not just df on the client.

Mini-story 2: The optimization that backfired

Another org decided to “optimize” PBS space by running the datastore right up to the edge. They had deduplication savings, and it looked great on paper.
They targeted 95% utilization because “we’re paying for every terabyte.” Finance loved it. The storage engineer did not, but was outvoted.

The failure wasn’t a clean “disk full” event. It was a slow degradation: garbage collection ran longer, then started failing intermittently.
Backup windows expanded until they overlapped. Nightly jobs started stepping on each other. Eventually a set of backups failed during a critical patch window,
which is a great time to discover you don’t have a recent restore point.

The root cause wasn’t just “full disk.” It was ZFS pool behavior under high fragmentation and low headroom.
GC needed to create and update metadata, and those allocations became unreliable. The system was technically online, functionally fragile.

The fix was to add capacity and enforce a hard internal threshold: alerts at 75%, action at 80–85%, and a ban on “we’ll just push it to 95% for a week.”
The “week” always becomes a quarter.

Operational takeaway: utilization targets are reliability targets. If you budget for 95% full, you’re budgeting for failure modes you can’t test safely.

Mini-story 3: The boring but correct practice that saved the day

A third team ran Proxmox with PBS on ZFS. Their practice was offensively dull: weekly capacity review, alerting on ZFS pool capacity and inode usage,
and a standing runbook: prune, GC, confirm space, then only expand retention if headroom remains. They also kept a small “emergency” dataset quota buffer
that could be temporarily increased during incidents.

One night, a developer accidentally wrote logs into a mounted backup directory from a container with a misconfigured bind mount.
PBS started accumulating junk files next to its datastore path (not inside it). The backup job began failing, and the first page woke the on-call.

The on-call didn’t guess. They followed the runbook: check mount points, check df -i, check ZFS dataset usage. They found the offending directory,
stopped the container, cleaned up, and ran PBS GC. Backups resumed before the morning shift.

No heroics, no “war room,” no existential dread. Just tooling, thresholds, and a runbook that assumed humans will forget where files actually go.

Operational takeaway: “boring” is a feature. If your backup system needs creativity at 2 a.m., it’s not engineered; it’s improvised.

Common mistakes: symptoms → root cause → fix

1) Symptom: df shows 30% free; backups fail immediately

Root cause: inode exhaustion on the target filesystem (or on root/temp filesystem).

Fix: Run df -i. If inodes are full, delete inode-heavy files, rotate logs, or move PBS datastore to a filesystem designed for many files.

2) Symptom: PBS prune ran, but space didn’t come back

Root cause: prune removed snapshot references, but chunks remain until garbage collection runs; or most chunks are still referenced by other backups.

Fix: Run PBS GC. If GC fails due to low headroom, free space by deleting whole backup groups or adding capacity, then rerun GC.

3) Symptom: ZFS pool at 92% and everything becomes slow, then ENOSPC

Root cause: ZFS allocation and metadata pressure at high utilization; fragmentation and CoW overhead.

Fix: Reduce pool utilization (delete data, prune+GC, add vdevs). Set alert thresholds well below “full.”

4) Symptom: Proxmox job log mentions /var/tmp or /tmp

Root cause: temp dir on small filesystem or tmpfs filled.

Fix: Move tmpdir in /etc/vzdump.conf to a larger mount; clean tmp; increase tmpfs if appropriate.

5) Symptom: Pool has space, dataset says no

Root cause: ZFS dataset quota/refquota hit; or XFS project quota.

Fix: Check quotas and reservations. Raise/remove quota or allocate a dedicated dataset sized for retention goals.

6) Symptom: “Deleted old backups” but disk usage unchanged

Root cause: snapshots (ZFS or NAS) hold blocks; or deleted-but-open files; or PBS chunks still referenced.

Fix: Check snapshot usage, run prune+GC, and check lsof +L1 for open deleted files.

7) Symptom: Backups to NFS fail, but NAS UI shows free space

Root cause: server-side share quota, reserve policy, or snapshot retention; sometimes export misconfiguration.

Fix: Check quotas/snapshots on the NAS, not just client df. Confirm mount is correct with findmnt.

8) Symptom: LVM-thin backing VM disks shows “free,” but backups fail writing to local storage

Root cause: thin pool metadata full (often before data is full).

Fix: Monitor data_percent and metadata_percent; extend metadata; reduce snapshots; stop overcommitting without alerts.

9) Symptom: Root filesystem fills repeatedly after you “cleaned it”

Root cause: backups writing to an unmounted directory (mount failed), or logs growing, or runaway temp files.

Fix: Ensure mounts are active at boot; add systemd mount dependencies; put a guardrail like a mount-check in backup scripts.

Checklists / step-by-step plan

Step-by-step: recover from “No space left on device” today

  1. Stop the bleeding. Pause backup schedules so you don’t turn a space issue into a space-and-load issue.
  2. Locate the failing path. From the job log, identify whether it failed on root/temp, the backup target, or the datastore mount.
  3. Check bytes and inodes. Run df -hT and df -i on the failing mount and on /.
  4. If root is full: clear large logs, old kernels, caches; fix log rotation; consider moving temp directories.
  5. If inodes are full: delete inode-heavy junk; find offenders with du --inodes; consider filesystem redesign for PBS.
  6. If ZFS pool is >85%: free space immediately. Treat it as urgent. Run prune+GC (PBS) and/or delete old datasets.
  7. If PBS: run prune, then garbage collection, then re-check dataset/pool usage.
  8. If remote storage: check server-side quotas and snapshots; confirm the export isn’t capped.
  9. Re-run one backup manually. Validate the fix with a single VM before re-enabling schedules.

Prevent it: operational guardrails that work

  • Capacity SLOs: Define target max utilization (ZFS: 80–85% for write-heavy PBS). Make it policy, not advice.
  • Alert on inodes: inode usage is not optional monitoring for backup stores with lots of files.
  • Alert on thin pool metadata: monitor metadata_percent. Treat 80% as a page, not an email.
  • Make prune+GC scheduled and visible: if GC hasn’t run successfully in days, you’re accumulating risk.
  • Mount verification: ensure backup targets are mounted before jobs run; fail fast if not.
  • Separate system and backup write paths: don’t let backups depend on a tiny root filesystem for temp storage.

What to avoid (because it feels clever until it isn’t)

  • Running ZFS pools near full because “dedup will save us.” It might, until it doesn’t, and then it’s not a linear failure.
  • Setting quotas as a substitute for monitoring. Quotas are fine, but they’re not observability.
  • Deleting random files in PBS datastore paths. Use PBS tools; the datastore is not a junk drawer.
  • Assuming “client df” reflects NFS server truth.

FAQ

1) Why does Proxmox say “No space left on device” when df shows free space?

Because the failing write may be hitting a different filesystem (like /var/tmp), or you ran out of inodes, or a quota was hit,
or ZFS cannot allocate reliably due to high utilization.

2) Is this a Proxmox bug?

Usually no. Proxmox surfaces the kernel error. The confusion comes from humans (and GUIs) assuming “free space” is a single number,
when it’s actually several constraints layered together.

3) For PBS, why doesn’t deleting old backups free space immediately?

PBS is chunk-based. Removing backups (prune) removes references; space is freed when garbage collection removes unreferenced chunks.
If most chunks are still referenced by other backups, space won’t drop much.

4) How much free space should I keep on a ZFS pool used for PBS?

Keep meaningful headroom. For reliability: try to stay below ~80–85% used. If you run higher, expect GC and writes to become unreliable under pressure.
The exact number depends on workload and vdev layout, but “95% full is fine” is fantasy.

5) Can inode exhaustion happen on ZFS?

ZFS doesn’t have a fixed inode table like ext4; it can still hit metadata limits in other ways, but “df -i at 100%” is mainly an ext-family issue.
On ZFS, focus more on pool/dataset capacity and metadata behavior.

6) What’s the difference between PBS prune and garbage collection?

Prune removes backups from the catalog according to retention. Garbage collection removes chunks on disk that are no longer referenced by any remaining backups.
You often need both to reclaim space.

7) Why do backups fail when the NAS says there’s space?

Quotas and snapshots. The NAS may enforce a share quota or reserve capacity. Snapshots may keep deleted data “in use.”
NFS/SMB clients can also cache attributes; the server is the source of truth.

8) How do I know if Proxmox is writing to an unmounted directory?

Use findmnt to confirm the mount is active. If it isn’t, the path is just a directory on root, and backups can fill root silently.
Also check timestamps and unexpected files under the mount point.

9) Can thin provisioning cause ENOSPC during backups even if the backup target is separate?

Yes, if the write is to local storage (temp files, logs) on the thin-provisioned pool, or if the thin pool is also backing the datastore.
Also, thin pool metadata exhaustion can break writes in surprising places.

10) Should I just add a bigger disk and move on?

Often yes—capacity is cheap compared to outage time. But don’t stop there: add monitoring for the specific constraint that bit you (inodes, ZFS headroom,
quotas, thin metadata) or you’ll be back here with a bigger disk and the same surprise.

Conclusion: next steps you should actually do

“No space left on device” is a diagnosis category, not a measurement. The fix is almost never “delete a random file and hope.”
Find where the write failed, then check the constraint that applies there: bytes, inodes, quotas, ZFS headroom, thin metadata, or server-side policy.

Do these next, in order:

  1. From logs, identify the exact failing path and host (PVE node vs PBS server).
  2. Run df -hT and df -i on that filesystem and on /.
  3. If ZFS is involved, treat >85% used as an incident and reduce utilization.
  4. If PBS is involved, run prune then GC; confirm reclaimed space with zfs list -o space or df.
  5. Add monitoring for the constraint that failed, with thresholds that force action before the cliff.

The best backup system is the one that fails loudly in staging and quietly succeeds in production. Engineer it so “space” is a controlled variable, not a surprise.

← Previous
WordPress + Cloudflare Cache Settings That Won’t Break wp-admin
Next →
Microcode: the “firmware” inside your CPU you can’t ignore anymore

Leave a comment