It always happens at the worst time: deploy window, on-call phone buzzing, dashboards screaming, and someone says, “Disk is at 100%.” That statement is simultaneously useful and wildly incomplete. Which disk? Which filesystem? Real space, reserved space, inodes, thin-provisioned pools, or a liar-mount that isn’t even where your app writes?
Getting this wrong doesn’t just waste time. It creates new incidents: deleting the wrong files, corrupting databases, or “fixing” it long enough to get paged again in an hour. Let’s do the grown-up version: identify what’s actually full, why it became full, and how to fix it without gambling with your data.
When “100% disk” doesn’t mean what you think
“Disk usage 100%” can mean at least eight different things. Some of them are straightforward (“the filesystem has no free blocks”). Some are mean (“free blocks exist but you can’t use them”). Others are sneaky (“the space is free, but the kernel still holds it because a process has the file open”). If you don’t distinguish these, you’ll do the classic move: delete a file, see no space returned, panic, and delete more. That’s not engineering. That’s flailing.
Start with the unit of truth: filesystem vs block device vs pool
At minimum, you need to know what layer is full:
- Filesystem (ext4/xfs/btrfs/zfs dataset): files and directories, inodes, reserved blocks, snapshots, quotas.
- Block device (LVM LV, NVMe, EBS volume): the underlying capacity; can be fine while the filesystem is not (or vice versa).
- Storage pool (ZFS pool, LVM thin pool, Ceph): thin provisioning, metadata, snapshots, and “logical vs physical” accounting.
- Container/overlay layer: Docker overlay2, containerd snapshotters, Kubernetes ephemeral storage; writes aren’t where you think.
The three most common misreads
- Space vs inodes: “df says full” is different from “df -i says full.” The fixes are different.
- Deleted-but-open files: you deleted the file, but the space won’t return until the holding process closes it.
- Snapshots and copy-on-write: you “deleted” data but snapshots still reference the blocks, so the pool stays full.
Joke #1: Disk is like a hotel. “We have empty rooms” doesn’t help if every room key is still checked out.
Fast diagnosis playbook (first/second/third)
This is the on-call version. It prioritizes: stop the bleeding, identify the real constraint, avoid making it worse.
First: confirm what’s full, at which layer
- Check filesystem usage on the suspected mount(s):
df -hT. - Check inode usage:
df -i. - Check underlying block device:
lsblk -fandpvs/vgs/lvsif LVM is involved. - If ZFS: check dataset and pool:
zfs list,zpool list. - If containers: check overlay/container storage:
docker system dforcrictl df.
Second: find the consumer fast (the “what got big?” step)
- Top-level directory sizes:
du -xhd1 /mount(stay on one filesystem). - Recent growth: sort by modification time:
find ... -printf '%T@ %s %p\n'. - Logs:
journalctl --disk-usage, check/var/log, and application log directories. - Deleted-but-open space leak:
lsof +L1.
Third: choose a safe fix based on the failure mode
- Files actually taking space: delete/rotate/move data safely, then add capacity or reduce retention.
- Inodes exhausted: remove millions of tiny files (cache/temp), redesign, or rebuild filesystem with more inodes (rarely pleasant).
- Snapshots: prune snapshots or change snapshot policy. Do not just delete “random files” and expect magic.
- Thin pool full: extend the pool or delete snapshots; treat it as a high-risk event.
- Deleted-but-open: restart/rotate offending service cleanly (or kill it, if you enjoy postmortems).
Practical tasks: commands, outputs, decisions
These are not “try this” vibes. Each task includes: a runnable command, what the output tells you, and what decision you make next. Assume Linux unless stated otherwise. Run as root when you must, but don’t turn every disk-full problem into a privilege-escalation hobby.
Task 1: Identify the full filesystem (and its type)
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/vg0-root ext4 80G 79G 0.6G 99% /
/dev/nvme0n1p1 vfat 511M 7.2M 504M 2% /boot/efi
tmpfs tmpfs 32G 96M 32G 1% /run
/dev/mapper/vg0-var xfs 200G 200G 20M 100% /var
What it means: /var is full, and it’s XFS. Your database logs, container layers, and journals often live here.
Decision: Focus on /var. Don’t delete from / hoping it helps. Also note filesystem type: XFS behaves differently from ext4 (no shrinking, different reserved space behavior).
Task 2: Check inode exhaustion (the silent killer)
cr0x@server:~$ df -i /var
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg0-var 13107200 13107190 10 100% /var
What it means: You have basically no inodes left. You can still have free bytes and be dead. Lots of tiny files did this.
Decision: Stop hunting for “big files.” Find directories with millions of entries (caches, temp, mail spools, container layers) and delete those safely.
Task 3: Find the biggest directories without crossing mounts
cr0x@server:~$ sudo du -xhd1 /var | sort -h
12M /var/tmp
220M /var/lib
1.6G /var/cache
48G /var/log
150G /var/lib/docker
What it means: /var/lib/docker is enormous. That’s not rare. It’s also often underestimated.
Decision: Investigate container storage before deleting logs. Logs might be the symptom, Docker layers the cause, or vice versa.
Task 4: Get Docker’s version of the truth
cr0x@server:~$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 47 12 62.3GB 44.8GB (71%)
Containers 36 8 9.1GB 7.9GB (86%)
Local Volumes 18 11 81.4GB 12.6GB (15%)
Build Cache 119 0 24.7GB 24.7GB (100%)
What it means: Build cache is fully reclaimable. Images and stopped containers are also reclaimable. Volumes are where “data you care about” hides.
Decision: Clear build cache first (safe), then unused images/containers, and only touch volumes if you’re certain they’re disposable.
Task 5: Safely reclaim Docker cache and unused artifacts
cr0x@server:~$ sudo docker builder prune -af
Deleted build cache objects:
m1r4nd4d1g3st:sha256:0c5b... 215.3MB
m1r4nd4d1g3st:sha256:8e21... 112.9MB
Total reclaimed space: 24.7GB
What it means: You got real space back without touching persistent volumes.
Decision: Re-check df. If still tight, proceed to docker image prune and docker container prune. Schedule an image retention policy later so you don’t repeat this weekly.
Task 6: Find deleted-but-open files (space won’t return)
cr0x@server:~$ sudo lsof +L1 | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
java 2214 app 12w REG 253,1 4294967296 0 90122 /var/log/app/server.log (deleted)
rsyslogd 912 syslog 7w REG 253,1 2147483648 0 72311 /var/log/syslog (deleted)
What it means: Processes are holding deleted log files. The filesystem can’t free blocks until those FDs close.
Decision: Restart the offending service(s) or trigger log rotation correctly. Killing processes is the blunt instrument; use it only when you’re boxed in.
Task 7: Measure systemd journal usage and vacuum it
cr0x@server:~$ sudo journalctl --disk-usage
Archived and active journals take up 18.0G in the file system.
cr0x@server:~$ sudo journalctl --vacuum-size=2G
Deleted archived journal /var/log/journal/4c.../system@0005b1d2....journal (8.0G).
Deleted archived journal /var/log/journal/4c.../system@0005b1d9....journal (6.0G).
Vacuuming done, freed 16.0G of archived journals.
What it means: Journald was hoarding history. It happens on busy nodes, and it’s not a moral failing.
Decision: Set a persistent policy (SystemMaxUse= in journald config) after the fire is out. Don’t keep vacuuming manually like it’s a lifestyle.
Task 8: Find top files quickly (size-based)
cr0x@server:~$ sudo find /var -xdev -type f -size +1G -printf '%s %p\n' | sort -nr | head
8589934592 /var/lib/docker/containers/1b.../1b...-json.log
4294967296 /var/log/app/server.log
2147483648 /var/lib/postgresql/15/main/pg_wal/000000010000000A000000FE
What it means: Container JSON logs are huge, plus a PostgreSQL WAL segment directory is growing. Multiple consumers.
Decision: Add log limits for Docker logging driver; for PostgreSQL, check replication/archiving and retention. Don’t just delete WAL files unless you enjoy rebuilding from backups.
Task 9: Check PostgreSQL WAL and archiving pressure
cr0x@server:~$ sudo -u postgres psql -c "select now(), pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),'0/0')) as wal_since_boot;"
now | wal_since_boot
------------------------------+---------------
2026-02-05 12:44:19.121+00 | 87 GB
(1 row)
What it means: A lot of WAL has been generated. That’s not automatically bad, but it’s a clue.
Decision: Check if archiving/replication is stuck and causing WAL retention. If you don’t know, stop and get the DBA. Disk-full plus “winging it in WAL” is how careers get exciting.
Task 10: Verify LVM thin pool health (thin provisioning can bite)
cr0x@server:~$ sudo lvs -a -o +devices,seg_monitor,metadata_percent,data_percent
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
thinpool vg0 twi-aotz-- 500.00g 99.80 92.10 /dev/sdb(0)
vm-01 vg0 Vwi-aotz-- 80.00g thinpool 74.20 thinpool(0)
vm-02 vg0 Vwi-aotz-- 120.00g thinpool 68.15 thinpool(0)
What it means: Thin pool is basically full. When a thin pool hits 100%, writes fail in ugly, surprising ways.
Decision: Immediately extend the thin pool or free space within it (delete unused thin snapshots/volumes). Treat this as urgent; it’s not “just housekeeping.”
Task 11: Confirm ZFS pool/dataset constraints (quotas, snapshots, reservation)
cr0x@server:~$ sudo zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 3.62T 3.55T 72.0G - - 39% 98% 1.00x ONLINE -
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint
NAME USED AVAIL REFER MOUNTPOINT
tank 3.55T 72.0G 128K /tank
tank/app 2.40T 20.0G 2.10T /tank/app
tank/app@daily-1 300G - 2.10T -
What it means: The pool is 98% full. Dataset tank/app has only 20G available. A snapshot exists that may be pinning space.
Decision: Evaluate snapshot retention and delete snapshots safely if they’re the reason space won’t return. Also: stop running ZFS pools at 98% unless you like performance cliffs.
Task 12: Check ext4 reserved blocks (root has space, users don’t)
cr0x@server:~$ sudo tune2fs -l /dev/mapper/vg0-root | egrep 'Block count|Reserved block count|Reserved block percentage'
Block count: 20971520
Reserved block count: 1048576
Reserved block percentage: 5%
What it means: 5% of the filesystem is reserved (historically to protect root and reduce fragmentation). On large volumes, that can be many GB.
Decision: If this is a data volume (not a system root), consider reducing reserved blocks, but do it deliberately. Don’t “free space” by removing safety rails on critical volumes without a plan.
Task 13: Identify runaway file creation (inode burn) by directory
cr0x@server:~$ sudo find /var -xdev -type d -printf '%p\n' | while read -r d; do echo "$(find "$d" -maxdepth 1 -type f 2>/dev/null | wc -l) $d"; done | sort -nr | head
1200345 /var/cache/myapp
410221 /var/lib/docker/overlay2
99521 /var/spool/postfix/maildrop
What it means: A cache directory is generating an absurd number of files. That’s your inode culprit.
Decision: Purge cache and fix the application’s cache policy (TTL, max entries, directory hashing). Also consider moving caches to tmpfs if safe, or to a dedicated filesystem with inode planning.
Task 14: Verify mountpoints (avoid cleaning the wrong place)
cr0x@server:~$ mount | egrep ' /var | /var/lib/docker | / '
/dev/mapper/vg0-root on / type ext4 (rw,relatime)
/dev/mapper/vg0-var on /var type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
What it means: Docker is under /var on this host. If you assumed it lived on a different LV, you’d clean the wrong filesystem and learn nothing.
Decision: Confirm mounts before acting. This is how you avoid “I deleted 20G and it didn’t help” because you deleted from the wrong mount.
Task 15: Track what’s growing right now (when disk fills live)
cr0x@server:~$ sudo bash -lc 'while true; do date; df -h /var; du -xhd1 /var/log | sort -h | tail -n 5; sleep 10; done'
Thu Feb 5 12:46:00 UTC 2026
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-var 200G 199G 380M 100% /var
1.3G /var/log/apt
4.8G /var/log/nginx
16G /var/log/journal
48G /var/log
Thu Feb 5 12:46:10 UTC 2026
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-var 200G 200G 120M 100% /var
What it means: Usage is still increasing. This isn’t “cleanup and done.” Something is actively writing.
Decision: Identify the writer (logs, DB, uploads). Consider temporarily stopping the service or rate-limiting ingestion. Cleaning without stopping the source is mop-vs-flood territory.
The real causes (with fixes that stick)
Cause 1: Logs grew because your system is healthy enough to keep running
Logs are the classic disk-eater because they’re designed to never block the main workload. That’s good… until it isn’t. Common triggers:
- Debug logging left enabled after an incident.
- Access logs on a public endpoint during a scan storm.
- Applications logging full request bodies (yes, people do this in production).
- Systemd journal unbounded on busy nodes.
Real fixes:
- Set retention and size caps (
logrotate, journaldSystemMaxUse). - Ship logs off-host, but don’t use “ship it” as an excuse to keep infinite on-disk logs.
- Make log levels runtime-configurable and default to sane verbosity.
Cause 2: Deleted files didn’t free space (open file descriptors)
Unix semantics are elegant: removing a filename doesn’t remove the underlying file until nobody is using it. On busy hosts, “nobody” can take a long time.
Real fixes:
- Use
lsof +L1to find deleted-but-open files. - Restart the process cleanly (or force it if required).
- Fix log rotation: ensure the service reopens logs (e.g., via SIGHUP) after rotation.
Cause 3: Inodes ran out (space exists, but you can’t allocate files)
Inodes are the filesystem’s “file slots.” If you create millions of tiny files—cache shards, temp artifacts, image layers, mail queues—you can hit 100% inode usage while df -h still looks fine. The symptom is usually “No space left on device” at a laughably low byte usage.
Real fixes:
- Delete the high-count directories (cache, temp) and fix the producer.
- Use fewer files: pack objects, use SQLite/LMDB, or redesign cache layouts.
- For extreme cases: rebuild filesystem with more inodes (ext4
-Nplanning) or switch to a filesystem better suited for the workload.
Cause 4: Snapshots pinned blocks (especially CoW filesystems)
On ZFS and btrfs, snapshots are not “free.” They are cheap, but not free. Deleting a file doesn’t necessarily free its blocks if a snapshot references them. This is correct behavior. It’s also confusing at 3 a.m.
Real fixes:
- Review snapshot policies: frequency, retention, and which datasets are included.
- Prune snapshots to release space.
- Keep free space headroom; CoW filesystems get cranky when nearly full (metadata pressure, fragmentation).
Cause 5: Thin provisioning ran out of real space
LVM thin pools, VM datastores, and many SANs let you allocate more “virtual” space than you physically have. That’s fine until it’s not. When the pool hits the wall, writes fail in ways applications interpret as corruption, timeouts, or “disk full” in the wrong place.
Real fixes:
- Monitor thin pool data and metadata usage and alert early.
- Extend the pool before it hits 100%. At 99% you’re already late.
- Limit snapshot sprawl. Snapshots are a debt instrument with compounding interest.
Cause 6: Overlay/container storage grew quietly
Containers make disk usage feel abstract. Images, layers, build caches, and JSON logs pile up. Kubernetes makes it worse by spreading responsibility across nodes, namespaces, and “not my problem.”
Real fixes:
- Set log size limits for container logging drivers.
- Garbage collect images and build caches on a schedule.
- Use dedicated partitions for container storage so it can’t suffocate the OS.
- Set ephemeral storage requests/limits in Kubernetes and enforce them.
Cause 7: Reserved space and quotas (it’s full for you, not for root)
On ext4, reserved blocks exist for good reasons. On ZFS, quotas and reservations can make a dataset appear full while the pool has space, or vice versa. On XFS, project quotas can silently cap directories. This isn’t a bug. It’s policy.
Real fixes:
- Check quota settings and reserved space before you delete anything.
- Adjust policy with intention: who should be able to fill the disk, and who should be protected from that?
Cause 8: “df says full” but “du can’t find it” (the accounting mismatch)
This is the headache case: df reports used blocks, but du doesn’t show corresponding files. Common reasons:
- Deleted-but-open files (again).
- Filesystem metadata and journal growth.
- Snapshots (ZFS/btrfs).
- Block allocation in sparse files or preallocated DB files.
Real fix: Use the right tool for the right layer. If you treat all disk usage as “files in directories,” you’re blind to half the system.
One quote, because it belongs here: Everything fails, all the time.
— Werner Vogels
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a fleet of application servers with a “big data disk” attached. Everyone knew the rule: logs go on the big disk. The runbook even said so. The team did the sensible thing during an incident: they tailed logs, saw a storm of errors, and rotated a massive log file. Disk stayed at 100%.
Someone escalated to storage. Storage took one look at df -hT and sighed. The big disk was mounted at /data, but the application’s service file had been updated months earlier. The log path quietly changed to /var/log/app—on the root filesystem. The “big disk” was idle and innocent.
The wrong assumption was social, not technical: “the config can’t drift because the runbook exists.” It drifted anyway. The fix wasn’t a heroic cleanup script. It was: correct the service config, restart cleanly, and add a startup check that refuses to run if the expected mount isn’t present.
They also changed the runbook: the first step became “verify mounts and paths,” not “rotate logs.” The next time the symptom appeared, the diagnosis took minutes, not an afternoon of confident guessing.
Mini-story 2: The optimization that backfired
A performance-minded engineer decided to speed up builds by caching everything on the CI runners. Docker build cache, language package caches, test artifacts—if it could be cached, it was cached. Builds got faster. Everyone cheered. Then the runners started failing in waves.
It wasn’t CPU. It wasn’t memory. It was disk. But not “big files.” Inodes. Millions of tiny cache objects spread across directories designed by people who never had to run a shared runner farm.
The first response was predictable: “Add bigger disks.” That helped for a week. Then the inodes hit 100% again, because inode count doesn’t scale automatically with “just add a bigger volume” in the way people assume. They were upgrading capacity and still starving for file slots.
The backfiring optimization taught a boring lesson: caching needs budgets. The eventual fix was dull and effective—bounded cache sizes, periodic pruning, and moving certain caches into tarballs instead of file storms. Builds stayed fast enough, and the runners stopped dying from a thousand tiny cuts.
Mini-story 3: The boring but correct practice that saved the day
A different org ran a database cluster where disks were sized conservatively and alerts were set at 70%, 80%, and 85%. People complained the alerts were noisy. Leadership wanted them “less sensitive.” The SRE team didn’t budge.
One weekend, a downstream consumer stalled and replication lag grew. WAL started accumulating. The 70% alert fired. The on-call looked, saw the trend, and opened an incident before customers noticed anything.
The team had time to do the safe things: diagnose why the consumer was stuck, temporarily increase capacity, and adjust the retention knobs without improvising. No one deleted random database files. No one “freed space” by deleting the only copy of something important.
By the time disk would have hit 100%, the system was already stable. The alert policy looked boring on a dashboard. In practice, it was the difference between a controlled response and a late-night data-recovery seminar.
Common mistakes: symptom → root cause → fix
1) “I deleted 20 GB, but df didn’t change”
Symptom: du shows less, but df still reports full.
Root cause: Deleted-but-open files (or snapshots on CoW systems).
Fix: lsof +L1, restart the holding process; for snapshots, prune snapshots and validate space release at the pool/dataset level.
2) “No space left on device” but df shows plenty of free space
Symptom: Writes fail; df -h shows free GBs.
Root cause: Inodes exhausted, or quota hit.
Fix: df -i; identify high-file-count directories; clean caches; check and adjust quotas.
3) “Only / is full, but the data disk is empty”
Symptom: Root filesystem hits 100%; attached storage has space.
Root cause: Wrong path, missing mount, or application writing to default location after mount failure.
Fix: Verify mounts (mount, findmnt), enforce mount requirements at service start, use absolute paths, and test reboot behavior.
4) “Disk usage jumps during backups/snapshots and never comes down”
Symptom: Usage increases after snapshot-based backups.
Root cause: Snapshot retention too long, or backups pin snapshots; CoW block retention.
Fix: Audit snapshot schedule/retention, ensure backups expire snapshots, keep headroom, and monitor referenced vs used (ZFS properties help).
5) “After we enabled debug, disks started filling every day”
Symptom: Steady log growth; application is otherwise stable.
Root cause: Unbounded logging volume; log rotation missing or misconfigured.
Fix: Cap logs, implement rotation, and ship logs centrally with sane local retention.
6) “We extended the volume but it’s still 100%”
Symptom: Cloud volume resized; df unchanged.
Root cause: Block device grew but filesystem not expanded, or wrong layer resized (e.g., PV but not LV, LV but not FS).
Fix: Confirm with lsblk; expand PV/LV/FS appropriately (e.g., pvresize, lvextend, xfs_growfs / resize2fs).
7) “A thin-provisioned pool hit 100% and everything got weird”
Symptom: Random write failures, filesystem errors, VM pauses.
Root cause: Thin pool out of data or metadata.
Fix: Extend the thin pool immediately; reduce snapshots; implement alerting well before 90%.
Joke #2: If you run a filesystem at 100% and it’s “fine,” it’s only because it hasn’t checked its email yet.
Checklists / step-by-step plan
Checklist A: Live incident (disk is at 100% right now)
- Stop the bleeding: If a single service is writing uncontrollably, rate-limit or stop it. Keep the host alive.
- Confirm the full target:
df -hTanddf -ion the mount. - Identify the layer: filesystem vs thin pool vs ZFS pool vs container overlay.
- Find top consumers:
du -xhd1and targetedfindfor large files. - Check deleted-but-open:
lsof +L1. - Apply the safest reclaim first:
- Vacuum journals
- Prune build cache
- Rotate/compress logs properly
- Remove known-safe caches
- Re-check:
dfand service health after each change. - Only then consider expanding capacity or moving data.
Checklist B: After the incident (prevent the sequel)
- Write down what filled the disk in one sentence (example: “Docker build cache and JSON logs, no limits”).
- Add monitoring and alerts for:
- Filesystem space and inodes
- Thin pool data and metadata
- ZFS pool capacity and fragmentation
- Container storage usage
- Set retention budgets: logs, artifacts, backups, snapshots, caches.
- Make mounts enforceable: service refuses to start if the required filesystem isn’t mounted.
- Run a game day: simulate disk pressure and validate the runbook.
Checklist C: Storage design choices that reduce disk-full incidents
- Put
/var, container storage, and databases on separate filesystems where possible. Blast radius matters. - Prefer predictable growth paths: capacity planning beats emergency cleanup.
- Keep headroom: aim for 20–30% free on busy filesystems; more for CoW and metadata-heavy workloads.
- Assume snapshots are production data. Treat them with the same seriousness as backups.
- Don’t thin-provision without first-class monitoring and well-tested expansion procedures.
Interesting facts and historical context
- Reserved blocks on ext filesystems were introduced to keep the system usable for root and to reduce fragmentation when disks get full.
- Inodes are pre-allocated on many traditional filesystems (like ext4). You can run out of file entries independently from bytes.
- The “deleted but open” behavior is a core Unix design choice: filenames are directory entries; the file lives until the last reference is gone.
- XFS was built for big iron (high throughput, large filesystems). It’s fast, but shrinking it is not a thing you do on a Tuesday.
- Copy-on-write snapshots (ZFS/btrfs) trade easy snapshotting for more complicated space accounting under pressure.
- Log rotation exists because disks are finite, a fact that was as true on 40 MB drives as it is on multi-terabyte volumes.
- Thin provisioning became mainstream because it improved utilization, but it also turned “capacity” into a monitoring and governance problem.
- Filesystem-full behavior is nonlinear: performance and failure rates can degrade sharply near full, especially with fragmented free space and metadata churn.
FAQ
Why does my system slow down when the disk is nearly full?
Free space becomes fragmented, metadata operations get more expensive, and CoW filesystems may amplify writes. Near-full is where “average case” assumptions go to die.
Why does du not add up to what df shows?
du walks directory trees and sums file sizes it can see. df reports allocated blocks. Deleted-but-open files, snapshots, and filesystem metadata can make them disagree.
Is it safe to delete files in /var/log?
Sometimes. Prefer rotating and compressing via the system’s logging tools. If you delete a log that a running process has open, you may not reclaim space until the process reopens it.
Can I just delete PostgreSQL WAL files to free space?
No, not as a normal operational tactic. WAL is part of durability and replication. The safe approach is to fix archiving/replication, adjust retention, or add capacity. If you’re already in a disaster state, get database expertise involved.
What’s the fastest safe way to reclaim space on a Docker host?
Start with build cache: docker builder prune -af. Then prune unused images and stopped containers. Be cautious with volumes; they often contain real data.
How do I tell if it’s an inode problem?
Run df -i on the affected mount. If IUse% is near 100%, you need to delete lots of small files or redesign the workload, not hunt for large files.
Why does ZFS show space used even after deleting large directories?
Snapshots may still reference those blocks, so the pool can’t free them. Check snapshot lists and retention. Also keep headroom; ZFS performs poorly when very full.
What alert thresholds should we use for disk usage?
Depends on growth rate and recovery time. A practical baseline: warn at 70–80%, page at 85–90% for critical systems. CoW and thin-provisioned pools deserve earlier alerts.
Should we separate /var from /?
Yes for most production hosts. /var is where growth happens (logs, caches, container data). Separating it reduces the chance that a noisy workload bricks the OS.
What’s the “safest” emergency deletion target?
Reclaimable caches you can regenerate: build cache, package caches, temp files, rotated logs, old journals. Avoid touching live database files or unknown directories under pressure.
Next steps that prevent repeat incidents
Fixing 100% disk usage isn’t about heroics. It’s about identifying the layer that’s full, using the right tools to find the real consumer, and applying changes that won’t boomerang.
- Adopt the fast diagnosis playbook and make it muscle memory:
df -hT,df -i,du -x,lsof +L1, then layer-specific checks (Docker/ZFS/LVM thin). - Put budgets everywhere: log retention, journal caps, snapshot retention, cache sizes, image lifecycle policies.
- Design for blast radius: separate filesystems for OS, logs, containers, and databases; enforce mounts at startup.
- Monitor what matters early: space, inodes, thin pool data/metadata, snapshot growth, and top directories by usage.
- Keep headroom on purpose: “We run at 95% all the time” is not thrift; it’s deferred incident response.
Disk-full incidents are rarely mysterious. They’re usually just unpaid operational debt, showing up with interest and insisting on immediate payment.