The alert says “filesystem full,” the app starts throwing 500s, and your database—previously smug—now behaves like it forgot how to write.
You freed “some space” and restarted everything, and now the DB is either refusing to start, replaying WAL/redo forever, or worse: it starts and lies.
This is a recovery guide for Debian 13 when ENOSPC (no space left on device) hit your database in production. Not theory. Not “just delete logs.”
A real runbook: how to find what’s actually full, how to free space without making consistency worse, and how to get to a safe, verified state.
What “filesystem full” really does to databases
“Disk full” is not a single failure mode. It’s a family of failures that all look the same on a dashboard and behave wildly differently at 03:00.
On Linux you can be “full” because blocks are exhausted, because inodes are exhausted, because your process can’t allocate due to quotas,
or because your filesystem has space but your database needs contiguous-ish room for its own safety mechanisms (think WAL, redo logs, temp files).
The most dangerous moment is the first few minutes after ENOSPC. Databases respond by:
- Failing writes mid-transaction, leaving partial state that must be rolled back or replayed.
- Failing fsync. That’s when you get into “I thought it was durable” territory.
- Rotating logs badly. Some engines keep writing to the same file descriptor and don’t “see” you freed space until restart.
- Stalling on recovery because the recovery itself needs temporary disk.
- Breaking replication because WAL/binlogs can’t be archived or streamed.
Your job is to restore the invariant the DB expects: enough free space to complete crash recovery, apply/rollback, checkpoint, and resume normal write patterns.
“Enough” is not “a few hundred MB.” In production, I aim for at least 10–20% free on the DB volume as a minimum operating margin.
If you can’t get there, you treat this as capacity emergency, not housekeeping.
One quote worth remembering here is a paraphrased idea from Werner Vogels (Amazon CTO): everything fails, so design to detect failure quickly and recover automatically
.
Disk-full events are the most boring form of failure—and the most humiliating, because they’re also the most predictable.
Fast diagnosis playbook (first/second/third)
You’re under pressure. You need a short sequence that identifies the bottleneck quickly and avoids “random cleanup” that deletes evidence or makes recovery harder.
Do this in order. Don’t get creative until you’ve done the basics.
First: confirm what kind of “full” you have
- Blocks full? Check
df -hon the DB mount. - Inodes full? Check
df -i. Inode exhaustion feels like “full” even whendf -hlooks fine. - Deleted-but-open files? Check
lsof +L1. You can “delete” logs and reclaim nothing.
Second: identify the biggest writers, not the biggest files
- Check recent growth in logs/journals:
journalctl --disk-usage,duon/var/log. - Check temp dirs and DB temp usage:
/tmp,/var/tmp, Postgresbase/pgsql_tmpequivalents. - Check container layers and images if applicable:
docker system df(or your runtime).
Third: decide your recovery strategy
- If DB won’t start: free space first, then start DB, then verify consistency.
- If DB starts but errors on writes: keep it up only long enough to capture state and drain traffic; then fix capacity.
- If replication exists: consider failing over to a healthy replica instead of “hero repairs” on the primary.
Joke #1: Disk-full incidents are like gravity—everyone “doesn’t believe in them” until they fall off the roof.
Interesting facts & historical context (quick hits)
- Reserved blocks on ext filesystems: ext2/3/4 traditionally reserve ~5% blocks for root to prevent total brickage; great for servers, confusing for humans.
- ENOSPC isn’t only blocks: the same error is commonly returned for inode exhaustion and quota exhaustion, which is why “df says 40% free” can still be a crisis.
- Journaling is not magic: ext4 journaling protects metadata consistency, not your database’s logical correctness. Your DB has its own journal for a reason.
- Old Unix lesson: deleting a file does not free space until no process holds it open—this has been true for decades and still causes modern outages.
- Write amplification is real: databases can turn one logical write into multiple physical writes (WAL/redo + data + index + checkpoint). “We only insert 1 GB/day” is how you lose weekends.
- Inodes were sized for 1980s workloads: defaults can still bite you with millions of small files (think caches), even on multi-terabyte disks.
- Crash recovery needs space: Postgres may need room for WAL replay and temp files; InnoDB may need to expand logs; you can’t recover on fumes.
- Filesystem semantics differ: XFS behaves differently than ext4 under pressure (and handles deletes/open files similarly); ZFS has its own “don’t fill pool beyond ~80%” culture for performance reasons.
Stabilize first: stop the bleeding safely
When a filesystem is full, the wrong move is frantic restarts. Restarts can turn “temporarily wedged but consistent” into “recovery loop that needs more disk.”
Stabilization means: reduce writes, keep evidence, and avoid making the DB do extra work until you have headroom.
Task 1: confirm what’s failing at the service layer
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● postgresql@16-main.service loaded failed failed PostgreSQL Cluster 16-main
What it means: systemd agrees your DB is failing, not just your app.
Decision: don’t spam restarts. Move to logs and disk state.
Task 2: capture the last relevant errors before they rotate away
cr0x@server:~$ journalctl -u postgresql@16-main.service -n 200 --no-pager
Dec 30 02:11:44 db1 postgres[9123]: FATAL: could not write to log file: No space left on device
Dec 30 02:11:44 db1 postgres[9123]: PANIC: could not write to file "pg_wal/00000001000000A9000000FF": No space left on device
Dec 30 02:11:44 db1 systemd[1]: postgresql@16-main.service: Main process exited, code=exited, status=1/FAILURE
What it means: WAL writes failed. That’s not “nice to have”; it’s core durability.
Decision: your first objective is to restore enough space for WAL and crash recovery.
Task 3: stop traffic or put the DB behind a maintenance gate
cr0x@server:~$ systemctl stop myapp.service
cr0x@server:~$ systemctl stop nginx.service
What it means: you’re reducing write pressure while you recover.
Decision: if you have read-only fallback, use it; otherwise accept downtime over corruption.
Task 4: freeze the DB process state (if it’s looping) rather than hard-killing immediately
cr0x@server:~$ systemctl kill -s SIGSTOP postgresql@16-main.service
cr0x@server:~$ systemctl status postgresql@16-main.service | sed -n '1,12p'
● postgresql@16-main.service - PostgreSQL Cluster 16-main
Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled)
Active: activating (start) since Tue 2025-12-30 02:12:01 UTC; 3min ago
Process: 10455 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 16-main start (code=exited, status=1/FAILURE)
Main PID: 10501 (code=killed, signal=STOP)
What it means: the process is paused, not thrashing the disk.
Decision: do this if it’s repeatedly attempting recovery and chewing whatever space you free; resume with SIGCONT once you have headroom.
A practical rule: if you’re not sure what to delete, stop the writers first. You can always restart services; you can’t un-delete the wrong file in the middle of a consistency incident.
Find space like you mean it (commands + decisions)
You need to answer four questions quickly:
Which filesystem is full? Is it blocks, inodes, quotas, or “open deleted”?
Who is writing? Can I create stable free space that stays free?
Task 5: identify the full filesystem(s)
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 80G 79G 0 100% /
/dev/nvme1n1p1 ext4 1.8T 1.2T 520G 70% /var/lib/postgresql
What it means: root filesystem is hard full; DB mount is fine. That still breaks DB if it logs to /var/log or uses /tmp on /.
Decision: focus on /, not the DB volume. “DB disk is 70%” does not save you.
Task 6: check inode exhaustion
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 5242880 5242880 0 100% /
/dev/nvme1n1p1 61054976 712345 60342631 2% /var/lib/postgresql
What it means: you’re out of inodes on /. This often comes from caches, runaway small files, or mail spools.
Decision: deleting one giant file won’t help. You need to delete lots of small files or move the workload.
Task 7: detect deleted-but-open files (the “I deleted it, why is it still full?” classic)
cr0x@server:~$ sudo lsof +L1 | head -n 15
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
rsyslogd 721 syslog 6w REG 259,2 2147483648 0 1049883 /var/log/syslog.1 (deleted)
java 2041 app 12w REG 259,2 1073741824 0 1051122 /var/log/myapp/app.log (deleted)
What it means: space is still held by running processes. The filename is gone, but the inode is still allocated.
Decision: restart or signal those processes to close/reopen logs (e.g., systemctl restart rsyslog), or truncate via /proc/<pid>/fd/<fd> if you must.
Task 8: find which directory on the full filesystem is the culprit
cr0x@server:~$ sudo du -xhd1 / | sort -h
0 /boot
1.2G /etc
2.8G /home
4.5G /opt
9.9G /usr
12G /var
49G /
What it means: /var is large on root; that’s typically logs, caches, spool, or container junk.
Decision: drill into /var next, still staying on the same filesystem (-x matters).
Task 9: identify the biggest consumers inside /var
cr0x@server:~$ sudo du -xhd1 /var | sort -h
120M /var/cache
260M /var/tmp
1.1G /var/lib
1.7G /var/spool
8.4G /var/log
12G /var
What it means: logs are massive. That’s common, but don’t assume it’s safe to delete everything.
Decision: inspect /var/log, especially journals and application logs.
Task 10: check systemd-journald usage and limits
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 6.8G in the file system.
What it means: journald is a significant chunk. On small root partitions, it can quietly win the “who ate my disk” contest.
Decision: vacuum journals to a safe size, then set persistent limits so this doesn’t recur.
Task 11: find large files quickly (size-based triage)
cr0x@server:~$ sudo find /var/log -xdev -type f -size +200M -printf '%s %p\n' | sort -n | tail -n 10
268435456 /var/log/journal/3b2b1a.../system@0000000000000000-0000000000000000.journal
536870912 /var/log/myapp/app.log
2147483648 /var/log/syslog.1
What it means: you’ve got a few very large files. That’s the easiest kind of cleanup—if they’re not held open.
Decision: if logs are held open, restart the logger/app after truncating/rotating properly.
Task 12: check for package caches (safe-ish to remove)
cr0x@server:~$ sudo du -sh /var/cache/apt/archives
1.1G /var/cache/apt/archives
What it means: apt cache is non-trivial.
Decision: clearing it is generally safe and fast; it doesn’t change DB state.
Task 13: check if the DB volume itself has hidden bloat (WAL, temp, backups)
cr0x@server:~$ sudo du -sh /var/lib/postgresql /var/lib/postgresql/* 2>/dev/null | sort -h | tail -n 8
48G /var/lib/postgresql/16
48G /var/lib/postgresql/16/main
7.9G /var/lib/postgresql/16/main/pg_wal
What it means: WAL is large but not necessarily wrong—could be a replica down, archiving failing, or long transactions.
Decision: don’t delete WAL files manually. Fix the upstream cause (replication/archiving) and let Postgres manage retention.
Task 14: check memory + swap pressure (because recovery needs RAM and tmp)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 32Gi 26Gi 1.1Gi 214Mi 4.9Gi 3.8Gi
Swap: 0B 0B 0B
What it means: you’re tight on available RAM, no swap. Some DB recovery tasks spill to temp files; low RAM can amplify disk usage and time.
Decision: avoid running heavy maintenance (VACUUM FULL, OPTIMIZE) during recovery; stabilize first.
Free space safely (what to delete, what not to)
The priority is to create stable free space—space that stays free after services resume.
Deleting random files can make the system boot slower, lose audit trails, or break package state. Deleting the wrong DB files can end your week.
Low-risk space wins (do these first)
Task 15: vacuum systemd journal to a bounded size
cr0x@server:~$ sudo journalctl --vacuum-size=500M
Vacuuming done, freed 6.3G of archived journals from /var/log/journal.
What it means: you reclaimed real space. This is typically safe during incidents.
Decision: if this returns little, journald isn’t your main culprit; move on.
Task 16: set journald limits so you don’t repeat this tomorrow
cr0x@server:~$ sudo sed -i 's/^#SystemMaxUse=.*/SystemMaxUse=500M/' /etc/systemd/journald.conf
cr0x@server:~$ sudo systemctl restart systemd-journald
What it means: you’ve constrained future log growth.
Decision: pick a value that fits your disk and retention requirements; the right number depends on your incident response needs.
Task 17: clear apt cache
cr0x@server:~$ sudo apt-get clean
cr0x@server:~$ sudo du -sh /var/cache/apt/archives
4.0K /var/cache/apt/archives
What it means: you got back a chunk of space without touching application state.
Decision: do it when you need quick relief; it’s not a root-cause fix.
Task 18: rotate/truncate runaway application logs properly
cr0x@server:~$ sudo truncate -s 0 /var/log/myapp/app.log
cr0x@server:~$ sudo systemctl restart myapp.service
What it means: truncation frees space immediately (unless the file is replaced via logrotate patterns).
Decision: only truncate logs you can afford to lose. Prefer logrotate fixes after the incident.
Task 19: resolve deleted-but-open logs by restarting the right daemon
cr0x@server:~$ sudo systemctl restart rsyslog.service
cr0x@server:~$ sudo lsof +L1 | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
What it means: no more deleted-but-open files listed (or fewer).
Decision: if space usage doesn’t drop, your problem isn’t held-open files; re-check df and inodes.
Medium-risk moves (do with intent)
These can be safe, but they’re not free of consequences.
Task 20: clear large caches with a known owner (example: application cache directory)
cr0x@server:~$ sudo du -sh /var/cache/myapp
3.4G /var/cache/myapp
cr0x@server:~$ sudo rm -rf /var/cache/myapp/*
cr0x@server:~$ sudo du -sh /var/cache/myapp
12K /var/cache/myapp
What it means: you reclaimed space, but you may increase load when the cache warms up again.
Decision: acceptable during incident, but coordinate with app owners; watch CPU and latency post-restart.
Task 21: clean up core dumps (often huge, often forgotten)
cr0x@server:~$ sudo coredumpctl list | head
TIME PID UID GID SIG COREFILE EXE
Tue 2025-12-30 00:13:19 UTC 8123 1001 1001 11 present /usr/bin/myapp
cr0x@server:~$ sudo du -sh /var/lib/systemd/coredump
5.2G /var/lib/systemd/coredump
cr0x@server:~$ sudo rm -f /var/lib/systemd/coredump/*
cr0x@server:~$ sudo du -sh /var/lib/systemd/coredump
0 /var/lib/systemd/coredump
What it means: you removed diagnostics artifacts.
Decision: do this only if you’ve already captured what you need for debugging, or if uptime outranks post-mortem detail.
High-risk moves (usually avoid during recovery)
- Deleting database files under
/var/lib/postgresqlor/var/lib/mysqlbecause they “look big.” That’s how you create a second incident. - Running aggressive DB maintenance (VACUUM FULL, REINDEX DATABASE, OPTIMIZE TABLE) while you’re low on disk. These operations often need more disk to finish.
- Moving the database directory on the fly without a tested plan. If you must move storage, use a controlled stop, copy/rsync, verify, update service config, then start.
Joke #2: The fastest way to free disk is to delete /var/lib; the fastest way to update your résumé is to do it in production.
Database recovery steps that actually work
“Actually work” here means: you end up with a database that starts, accepts writes, and is consistent enough to trust—backed by verification, not vibes.
The exact steps depend on your engine, but the shape is consistent:
make space → start safely → confirm integrity → fix the growth vector → restore margins.
Step zero: get real headroom
Before you start the DB again, aim for at least:
- PostgreSQL: free space ≥ WAL needed to replay + some temp; I aim for 10–20% of the cluster volume, or at least several GB for small systems.
- MySQL/InnoDB: enough for redo/undo activity and temp tables; again, 10–20% on the filesystem hosting datadir and tmpdir.
Task 22: verify free space after cleanup (don’t assume)
cr0x@server:~$ df -hT /
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 80G 62G 14G 82% /
What it means: you now have 14G free on root. That’s a usable margin for logs, temp, and recovery.
Decision: proceed with DB restart. If you’re still >95%, keep freeing space or expand the filesystem first.
PostgreSQL on Debian 13 (clustered service)
Debian’s PostgreSQL packaging uses clusters (e.g., postgresql@16-main), and it’s usually well-behaved during recovery—if you give it space.
The two biggest Postgres disk-full footguns are: deleting WAL manually, and repeated restarts that never let recovery finish.
Task 23: resume the paused DB (if you used SIGSTOP), then start cleanly
cr0x@server:~$ sudo systemctl kill -s SIGCONT postgresql@16-main.service
cr0x@server:~$ sudo systemctl start postgresql@16-main.service
cr0x@server:~$ sudo systemctl status postgresql@16-main.service | sed -n '1,12p'
● postgresql@16-main.service - PostgreSQL Cluster 16-main
Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled)
Active: active (running) since Tue 2025-12-30 02:24:09 UTC; 3s ago
What it means: the service is up. Not proven healthy yet, but it’s breathing.
Decision: check logs for recovery completion and ensure it’s accepting connections.
Task 24: confirm crash recovery completion in logs
cr0x@server:~$ sudo journalctl -u postgresql@16-main.service -n 80 --no-pager
Dec 30 02:24:08 db1 postgres[11001]: LOG: database system was interrupted; last known up at 2025-12-30 02:10:21 UTC
Dec 30 02:24:08 db1 postgres[11001]: LOG: redo starts at A9/FF000028
Dec 30 02:24:09 db1 postgres[11001]: LOG: redo done at A9/FF9A2B30
Dec 30 02:24:09 db1 postgres[11001]: LOG: database system is ready to accept connections
What it means: WAL replay completed and Postgres declared readiness.
Decision: proceed to integrity checks and workload reintroduction.
Task 25: check DB connectivity and basic read/write
cr0x@server:~$ sudo -u postgres psql -d postgres -c "select now();"
now
-------------------------------
2025-12-30 02:24:31.12345+00
(1 row)
cr0x@server:~$ sudo -u postgres psql -d postgres -c "create table if not exists diskfull_probe(x int); insert into diskfull_probe values (1);"
CREATE TABLE
INSERT 0 1
What it means: you can connect and commit.
Decision: keep this probe table or drop it later; the point is verifying writes are possible.
Task 26: check for lingering “out of space” errors at the SQL layer
cr0x@server:~$ sudo -u postgres psql -d postgres -c "select datname, temp_bytes, deadlocks from pg_stat_database;"
datname | temp_bytes | deadlocks
-----------+------------+-----------
postgres | 0 | 0
template1 | 0 | 0
(2 rows)
What it means: temp usage is currently minimal; no obvious contention artifacts.
Decision: if temp_bytes is exploding, your workload is spilling to disk; ensure temp_tablespaces and filesystem margin are adequate.
Task 27: run a targeted integrity check (not a full-table panic)
cr0x@server:~$ sudo -u postgres psql -d postgres -c "select * from pg_stat_wal;"
wal_records | wal_fpi | wal_bytes | wal_buffers_full | wal_write | wal_sync
-------------+---------+-----------+------------------+----------+----------
18234 | 112 | 12345678 | 0 | 214 | 93
(1 row)
What it means: WAL subsystem is operating. This is not a corruption guarantee, but it’s a “not currently on fire” signal.
Decision: if you suspect corruption, schedule amcheck or restore from backup; don’t improvise in the incident window.
MySQL / MariaDB (InnoDB) on Debian 13
InnoDB handles crash recovery by replaying redo logs. Disk-full can interrupt that and leave you with a service that won’t start or starts read-only.
The worst move is deleting ib_logfile or ibdata files to “force” a start. That’s not repair; that’s data loss with extra steps.
Task 28: read the last InnoDB recovery messages
cr0x@server:~$ journalctl -u mariadb.service -n 120 --no-pager
Dec 30 02:13:02 db1 mariadbd[9322]: InnoDB: Error: Write to file ./ib_logfile0 failed at offset 1048576.
Dec 30 02:13:02 db1 mariadbd[9322]: InnoDB: Error: 28 No space left on device
Dec 30 02:13:02 db1 mariadbd[9322]: InnoDB: Plugin initialization aborted with error Generic error
Dec 30 02:13:02 db1 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
What it means: redo log write failed. Recovery can’t proceed without space.
Decision: free space on the filesystem containing datadir (often /var/lib/mysql) and any configured tmpdir.
Task 29: confirm datadir and tmpdir locations before you chase the wrong mount
cr0x@server:~$ sudo my_print_defaults --mysqld | egrep -i 'datadir|tmpdir'
--datadir=/var/lib/mysql
--tmpdir=/tmp
What it means: tmpdir is on /. If root is full, queries creating temp tables will fail even if /var/lib/mysql is fine.
Decision: consider moving tmpdir to a larger filesystem after recovery (and test it).
Task 30: start MariaDB/MySQL and watch recovery progress
cr0x@server:~$ sudo systemctl start mariadb.service
cr0x@server:~$ sudo systemctl status mariadb.service | sed -n '1,14p'
● mariadb.service - MariaDB 10.11.6 database server
Loaded: loaded (/lib/systemd/system/mariadb.service; enabled)
Active: active (running) since Tue 2025-12-30 02:27:40 UTC; 2s ago
What it means: it started. Now confirm it’s not silently in trouble.
Decision: check error log and run basic queries.
Task 31: validate basic read/write and InnoDB health
cr0x@server:~$ sudo mariadb -e "select @@version, now();"
@@version now()
10.11.6-MariaDB-0+deb13u1 2025-12-30 02:28:03
cr0x@server:~$ sudo mariadb -e "create database if not exists diskfull_probe; use diskfull_probe; create table if not exists t(x int); insert into t values (1);"
cr0x@server:~$ sudo mariadb -e "show engine innodb status\G" | sed -n '1,35p'
*************************** 1. row ***************************
Type: InnoDB
Name:
Status:
=====================================
2025-12-30 02:28:11 0x7f9c2c1fe6c0 INNODB MONITOR OUTPUT
...
Log sequence number 123456789
Log flushed up to 123456789
What it means: redo log flush caught up; basic commits work.
Decision: bring traffic back gradually, monitor error rates and disk usage closely.
If the DB still won’t start: what to do (and what not to)
If you’ve created headroom and the DB still fails, don’t immediately reach for “forced recovery” modes unless you’re prepared to restore from backup.
Forced recovery can be a data extraction tool, not a return-to-service strategy.
Task 32: verify permissions and filesystem health signals (quick sanity)
cr0x@server:~$ sudo dmesg -T | tail -n 20
[Thu Dec 30 02:22:11 2025] EXT4-fs warning (device nvme0n1p2): ext4_dx_add_entry: Directory index full!
[Thu Dec 30 02:22:12 2025] EXT4-fs (nvme0n1p2): Delayed block allocation failed for inode 3932211 at logical offset 0 with max blocks 2 with error 28
What it means: kernel saw allocation failures consistent with ENOSPC. No immediate sign of media errors here, but you’re reading the ground truth.
Decision: if you see I/O errors, remount read-only events, or corruption warnings, stop and plan a restore/migration.
Task 33: if you’re on LVM, extend the filesystem instead of playing cleanup whack-a-mole
cr0x@server:~$ sudo vgs
VG #PV #LV #SN Attr VSize VFree
vg0 1 2 0 wz--n- 200.00g 40.00g
cr0x@server:~$ sudo lvextend -L +20G /dev/vg0/root
Size of logical volume vg0/root changed from 80.00 GiB (20480 extents) to 100.00 GiB (25600 extents).
Logical volume vg0/root successfully resized.
cr0x@server:~$ sudo resize2fs /dev/vg0/root
resize2fs 1.47.0 (5-Feb-2023)
Filesystem at /dev/vg0/root is mounted on /; on-line resizing required
old_desc_blocks = 10, new_desc_blocks = 13
The filesystem on /dev/vg0/root is now 26214400 (4k) blocks long.
cr0x@server:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-root 99G 62G 33G 66% /
What it means: you converted a recurring incident into capacity.
Decision: if you can extend, extend. Cleanup is a stopgap; capacity is a fix.
Checklists / step-by-step plan (printable mindset)
Checklist A: incident response for “filesystem full broke DB”
- Stop write-heavy services (app workers, ingestion, cron jobs). Keep the DB off until you have space.
- Capture logs from systemd and DB logs before you rotate/truncate.
- Confirm the failure mode: blocks (
df -h), inodes (df -i), deleted-open (lsof +L1), quotas. - Free low-risk space: journald vacuum, apt cache, known-safe caches, rotate logs.
- Re-check free space and ensure margin (don’t settle for 200MB).
- Start DB once, and let it recover. Don’t thrash restarts.
- Verify readiness: DB logs “ready,” connectivity, basic write test.
- Bring app traffic back gradually: canary workers first, then scale.
- Fix the cause of growth: log rotation, stuck archiving, runaway query temp, containers, backups.
- Set guardrails: monitoring, quotas, journald limits, alerting thresholds, capacity plan.
Checklist B: post-incident hardening (what prevents case #60)
- Alert on both blocks and inodes. Many shops only alert on blocks and then act surprised.
- Put DB temp paths on a volume with margin. For Postgres, consider a dedicated tablespace for temp if appropriate.
- Enforce log limits (journald + app logs) and test logrotate under load.
- Keep at least one tested restore path (backup restore or replica promotion) that doesn’t require heroics.
- Run periodic “disk full game day” in staging: simulate ENOSPC and validate recovery time and steps.
Three corporate mini-stories from the trenches
1) The incident caused by a wrong assumption: “The DB has its own disk, so root can fill”
A mid-sized SaaS company split storage “properly”: database data on a big separate mount, root on a small NVMe partition.
The team felt virtuous. The DB volume had plenty of space; the graphs looked green. The on-call rotation got comfortable.
Then an innocuous change went live: a verbose debug flag accidentally enabled for an authentication service.
Logs exploded into /var/log on root. Within hours, root hit 100%. The database volume still had hundreds of gigabytes free,
so the incident commander initially dismissed storage as a cause and chased network and CPU red herrings.
PostgreSQL started failing writes—not because /var/lib/postgresql was full, but because it couldn’t write to its logs and couldn’t create temp files.
Recovery loops began: systemd attempted restarts; each restart tried to log; each log write failed; the service flapped.
Meanwhile the app, seeing connection failures, retried aggressively, amplifying the problem.
The fix was painfully simple: vacuum journald, rotate the runaway app log, stop the retry storm, and only then restart the DB once.
The postmortem action item that mattered wasn’t “add more disk.” It was “treat root as a dependency of the DB,” and alert on it accordingly.
2) The optimization that backfired: “Turn off logrotate compression, it’s wasting CPU”
A large internal platform team wanted to reduce CPU spikes on a fleet of Debian database proxies.
Someone noticed logrotate compression chewing CPU during peak hours. A reasonable thought followed: disable compression and keep rotation.
CPU flattened. Everyone congratulated themselves and moved on.
Two weeks later, multiple nodes started hitting disk-full at roughly the same time. The proxied databases were fine; the proxies were not.
Uncompressed rotated logs were now huge, and retention was tuned for “compressed size,” not raw. Root partitions were small because “they don’t do much.”
The incident manifested as connection churn and cascading retries—classic distributed systems behavior: one small fault becomes everybody’s problem.
The recovery effort got messy because engineers kept deleting old logs while rsyslog still held file descriptors open.
Space didn’t return, which led to more deleting, which led to fewer forensic logs. They eventually fixed it by restarting rsyslog, then setting sane logrotate policies,
and moving high-volume logs to a dedicated filesystem.
Lesson: optimizations that remove guardrails are rarely free. CPU is usually easier to buy than clean recovery time.
If you disable compression, you must re-tune retention and alerting, or the disk becomes your new scheduler.
3) The boring but correct practice that saved the day: “Always keep a replica with real promotion runbooks”
A financial services team ran PostgreSQL with one streaming replica in a different rack. Nothing fancy.
The boring part: once per quarter they practiced promoting the replica, updating app configs, and demoting/rebuilding the old primary.
They treated it like fire drills—annoying, scheduled, and non-optional.
During an end-of-month batch run, the primary filled its filesystem due to an archiving misconfiguration that caused WAL retention to balloon.
Writes stopped. Recovery attempts fought for disk. The team didn’t attempt complicated on-host surgery while business stakeholders hovered.
They confirmed the replica was caught up enough, promoted it, and restored service with minimal drama.
After traffic moved, they took their time: they expanded storage, corrected archiving, validated backups, and rebuilt replication cleanly.
The postmortem read like a grocery list, not a thriller novel. That’s the goal. Boring is a feature in operations.
Common mistakes: symptom → root cause → fix
1) “df shows 0 bytes freed after deleting logs”
- Symptom: you delete large files, but
df -hstays at 100%. - Root cause: files were deleted but still held open by processes.
- Fix:
lsof +L1; restart the owning service or truncate the open FD via/proc/<pid>/fd/<fd>. Then re-checkdf.
2) “df -h looks fine, but everything errors ‘No space left on device’”
- Symptom: plenty of GB available, yet creates/writes fail.
- Root cause: inode exhaustion (
df -i) or quota/project limits. - Fix: delete large counts of small files in the offending directory; for quotas, inspect and raise limits; for inodes, stop the file-creating workload and redesign storage layout.
3) “DB starts, but application gets intermittent write failures”
- Symptom: service is up, but some writes fail, temp-table operations fail, or sorts crash.
- Root cause: tmpdir or log directory is on a still-full filesystem (often root), while datadir is fine.
- Fix: confirm tmpdir paths (Postgres temp files location varies; MySQL tmpdir configurable). Free/expand the correct filesystem; move tmpdir to a larger mount with tested config changes.
4) “Postgres WAL directory is huge; let’s delete old WAL files”
- Symptom:
pg_walconsumes tens of GB. - Root cause: replication slot retaining WAL, replica down, archiving failing, or long-running transactions preventing cleanup.
- Fix: identify replication slots and lag; fix archiving; remove unused slots; resolve long transactions. Do not delete WAL files by hand.
5) “MySQL won’t start; someone suggests deleting ib_logfile0”
- Symptom: InnoDB initialization errors after ENOSPC.
- Root cause: incomplete redo log writes due to disk-full and insufficient space for recovery.
- Fix: restore disk space, start cleanly, verify InnoDB status. If corruption persists, use backups/replicas; forced recovery is for extraction, not normal service restoration.
6) “We freed 2GB; why does recovery still fail?”
- Symptom: recovery starts then fails again with ENOSPC.
- Root cause: recovery itself generates writes (WAL replay, temp, checkpoints). 2GB is not a strategy.
- Fix: free/extend until you have meaningful margin (10–20% or several GB depending on DB size and workload), then retry once.
7) “After cleanup, disk refills instantly”
- Symptom: you free space, restart services, and it’s back to 100% within minutes.
- Root cause: runaway log spam, retry storm, stuck queue, or a batch job that resumes.
- Fix: keep writers stopped; identify the top writer; add rate limits; fix log levels; drain queues carefully; only then reintroduce traffic.
FAQ
1) Is “filesystem full” the same as “disk full”?
Not always. Filesystems can be “full” because blocks are exhausted, inodes are exhausted, or space is reserved/limited by quotas.
Always check df -h and df -i, and scan for deleted-but-open files with lsof +L1.
2) Can I just delete old logs to fix it?
Sometimes, but do it intentionally. If logs are held open, deleting won’t reclaim space until the process closes them.
Also: deleting logs can remove the only evidence of what happened, so capture what you need first.
3) Why did my DB break if the database partition wasn’t full?
Because databases depend on other paths: log directories, tmp dirs, sockets, PID files, and sometimes crash recovery artifacts.
Root being full can break a DB whose data directory is on a separate mount.
4) Should I restart the DB repeatedly until it comes back?
No. Repeated restarts can thrash recovery and generate more writes (and more logs) while you’re low on space.
Make space first, then start once and let recovery finish.
5) How much free space is “enough” before I restart?
Enough to finish recovery and survive a burst of normal writes. In production I aim for 10–20% free on the relevant filesystem(s).
If you can’t get there, extend storage or fail over to a replica instead of gambling.
6) What’s the safest way to reclaim space immediately?
Vacuum systemd journals, clear package caches, remove known-safe caches, and rotate/truncate runaway app logs.
Avoid touching database files unless you are following a tested, engine-specific procedure.
7) What if I’m out of inodes?
Deleting a few large files won’t help. You must delete many small files (often caches) or move that workload off the filesystem.
Long-term, redesign: separate caches from root, and pick filesystem parameters that match file-count patterns.
8) Does ext4’s “reserved blocks” mean I can reclaim emergency space?
Reserved blocks exist so root can still function when users fill the disk. You can adjust them with tune2fs,
but changing reserved blocks during an incident is rarely your best first move. Free real space or extend the filesystem.
9) Can disk-full cause silent data corruption?
Disk-full typically causes loud failures (ENOSPC), but it can still put your database into a state where recovery is required and consistency must be verified.
If you suspect corruption—especially with I/O errors in dmesg—prefer restore/replica promotion over “keep restarting and hope.”
10) How do I prevent this from happening again?
Alert on blocks and inodes, enforce log limits (journald and app), put temp paths on storage with margin, and practice failover/restore.
Prevention is mostly boring configuration—and boring is cheaper than downtime.
Conclusion: next steps you can do today
Disk-full incidents aren’t glamorous, but they’re honest: they reveal whether your system is operated with margins, guardrails, and rehearsed recovery.
If this just happened to you on Debian 13, your priority is to end in a verified-good state, not merely “service running.”
Do these next, in this order:
- Set hard limits for logs (journald + logrotate) and confirm they work under load.
- Add alerting for inodes and deleted-open files (or at least make
lsof +L1part of your on-call muscle memory). - Move DB dependencies off root where sensible: tmp dirs, high-volume logs, container layers.
- Decide your “minimum free space policy” for DB volumes (10–20% is a sane start), and enforce it with alerts and capacity planning.
- Practice the recovery path (restore or replica promotion). If you only do it during an outage, it’s not a plan—it’s a dare.
The best outcome of a disk-full incident is not that you recovered. It’s that you now have enough discipline in the system that the next one never becomes an incident.