Disk full is the kind of outage that makes everyone suddenly remember how many logs exist, how many “temporary” exports are permanent, and how quickly a database will punish optimism. The app looks “slow.” Then it looks “down.” Then it looks like your pager is trying to achieve escape velocity.
This is a field comparison: when the filesystem hits 100%, which engine gets you back to green faster—and which one recovers cleaner (meaning fewer weird aftershocks, fewer silent data risks, fewer “we fixed it but it’s haunted” follow-ups).
The blunt thesis
If you measure “who recovers faster” by “who comes back online with minimal human babysitting,” PostgreSQL often wins on clarity and predictable failure modes, especially around WAL and crash recovery. If you measure “who recovers cleaner” by “who is less likely to quietly keep limping with hidden corruption risk,” PostgreSQL again tends to feel safer—because it is louder, more transactional about its metadata, and it refuses to pretend things are fine.
MySQL (InnoDB) can also recover cleanly, and it’s very good at crash recovery—but disk-full incidents have a talent for turning into messy partial failures: temp tables, redo logs, binary logs, and the OS all fighting over the last few megabytes like it’s the last seat on a commuter train.
My opinionated operational guidance:
- PostgreSQL: prioritize making WAL and
pg_walresilient and bounded (archiving, replication slots discipline, separate volume). Disk-full usually looks like “can’t write WAL / can’t checkpoint,” which is scary but legible. - MySQL: prioritize bounding binary logs, tmpdir, and undo/redo growth, and avoid filesystem-level surprises (thin provisioning, snapshots). Disk-full often looks like “everything is sort of broken at once,” which is operationally expensive.
- For both: keep free space like it’s a feature, not a suggestion. “We run at 92% full” is how you end up doing incident response while negotiating with a storage array.
One quote to keep taped to your monitor, from Gene Kranz: “Failure is not an option.” When you run production storage, treat that as a paraphrased idea, not a promise.
Interesting facts and historical context
- PostgreSQL’s WAL lineage: PostgreSQL’s write-ahead logging approach matured from Postgres research roots into a robust, conservative durability model that strongly shapes disk-full behavior (WAL first, everything else later).
- InnoDB wasn’t always “the default”: InnoDB became the practical default engine for MySQL because it brought crash recovery and transactional semantics that older MyISAM setups lacked—changing how disk-full incidents present (redo/undo vs table-level chaos).
- MySQL’s historical ibdata1 pain: Early InnoDB setups often used a shared system tablespace (ibdata1) that could bloat and not shrink easily, a long-running operational scar for “we deleted data, why is disk still full?” incidents.
- PostgreSQL MVCC costs space by design: PostgreSQL’s MVCC creates dead tuples that must be vacuumed, so disk pressure is not a surprise; it’s a bill you pay routinely or with interest later.
- Replication slots changed failure modes: PostgreSQL replication slots are powerful, but they can pin WAL indefinitely; modern “disk full” incidents often trace back to a forgotten slot holding WAL hostage.
- MySQL binlog retention is an outage lever: In MySQL, binary logs are both a recovery asset and a disk bomb. Retention defaults and operational habits can decide whether disk-full is “minor” or “multi-hour.”
- Filesystems matter more than you want: XFS, ext4, and ZFS behave differently under ENOSPC. The database doesn’t get to opt out of the kernel’s personality.
- Checkpoint behavior is a major differentiator: PostgreSQL checkpoints and MySQL flush behavior create different “write burst” patterns; under high fill, these bursts are where you discover you were skating on thin ice.
What “disk full” really means in production
“Disk full” is rarely one thing. It’s an argument between layers:
- The filesystem returns ENOSPC. Or it returns EDQUOT because you hit a quota you forgot existed.
- The storage backend lies politely (thin provisioning) until it stops being polite, and then it’s everybody’s problem.
- The kernel may keep some processes alive while others fail on fsync, rename, or allocate.
- The database has multiple write paths: WAL/redo, data files, temp files, sort spill, autovacuum/vacuum, binary logs, replication metadata, and background workers.
The practical definition for incident handling is: can the database still guarantee durability and consistency? When storage is 100% full, the answer becomes “not reliably” long before the process actually dies.
Also, “disk full” is not just capacity. It’s free blocks, inodes, IOPS headroom, write amplification, and the amount of contiguous space your filesystem can allocate under fragmentation.
Short joke #1: Disk-full incidents are like toddlers—silent is fine, screaming is bad, but the worst is when they get quiet again and you realize they found the markers.
How PostgreSQL and MySQL behave when storage runs out
PostgreSQL: WAL is king, and it will tell you when the kingdom is broke
PostgreSQL’s durability revolves around WAL. If it can’t write WAL, it can’t safely commit. That’s not negotiable. When disk fills on pg_wal (or the filesystem hosting it), you’ll often see:
- Transactions failing with “could not write to file … No space left on device”.
- Checkpoint warnings escalating into “PANIC” in worst cases (depending on what exactly failed).
- Replication lag becoming irrelevant because primaries can’t generate WAL reliably.
This is brutal but honest: Postgres tends to fail in ways that make you fix the underlying storage constraint, not in ways that invite you to keep writing “just a little more” until you’ve created a second incident.
Cleaner recovery patterns in PostgreSQL disk-full events:
- Clear error messages pointing to WAL segments, temp files, base directory.
- Crash recovery is typically deterministic once you restore write ability.
- Bounded set of usual suspects:
pg_wal, temp spill, autovacuum, replication slots.
Messy patterns you still see in Postgres:
- Replication slots pin WAL until the disk is gone.
- Long-running transactions prevent vacuum, inflate tables and indexes, and then disk goes.
- Temp file explosions from bad queries: sorts, hashes, large CTEs, or missing indexes.
MySQL (InnoDB): multiple write paths, multiple ways to suffer
InnoDB has redo logs, undo logs, doublewrite buffer, data files, temporary tablespaces, binary logs (server-level), relay logs (replication), and then your filesystem. When disk gets tight, you can hit failure in one area while another area still writes—creating partial functionality and confusing symptoms.
Common MySQL disk-full patterns:
- Binary logs fill the partition, especially with row-based logging and busy write workloads.
- tmpdir fills from large sorts or temp tables; queries start failing oddly while the server “looks up.”
- InnoDB can’t extend a tablespace (file-per-table or shared tablespace), producing errors on insert/update.
- Replication breaks asymmetrically: source keeps running but replica stops on relay log write, or vice versa.
Cleaner recovery patterns in MySQL:
- Once you free space, InnoDB crash recovery is generally solid.
- Binary log purge is a fast lever if you’re disciplined and understand replication requirements.
Messier patterns in MySQL:
- ibdata1 and undo tablespaces can remain large even after deletes; “freeing space” isn’t always immediate without rebuilds.
- Table corruption suspicion rises when files were partially written and the filesystem was under pressure—rare, but the fear is expensive.
- Background threads may keep hammering IO trying to flush/merge while you’re trying to stabilize the system.
So who recovers faster?
If your on-call needs a single sentence: PostgreSQL usually gives you a more straightforward path from “disk full” to “safe again,” provided you understand WAL, checkpoints, and slots. MySQL gives you more “quick levers” (purging binlogs, moving tmpdir), but also more ways to accidentally cut the branch you’re sitting on.
And who recovers cleaner?
Cleaner recovery is about confidence: after you free space, do you trust the system, or do you schedule a weekend to “verify”? PostgreSQL’s posture—stop the world when WAL can’t be guaranteed—tends to produce fewer “it’s running but…” situations. MySQL can be perfectly clean too, but disk-full incidents more often leave you with a checklist of “did we lose replication position, did we truncate something, did tmpdir move, did binlog purge break a replica?”
Fast diagnosis playbook
This is the order that finds the bottleneck fast. Not “theoretically correct,” but “ends the outage.”
1) Confirm what is actually full (blocks vs inodes vs quota)
- Check filesystem blocks used.
- Check inode exhaustion.
- Check quotas (user/project).
- Check thin provisioning / LVM / array headroom.
2) Identify the dominant space consumer (and whether it’s still growing)
- Find which directory is huge (
/var/lib/postgresql,/var/lib/mysql,/var/log). - Check for open-but-deleted files that still occupy space.
- Check if growth is “steady logs” or “sudden temp spill.”
3) Determine if the database can still guarantee durability
- Postgres: can it write WAL? are checkpoints failing?
- MySQL: are redo/binlog/tmp writes failing? is replication compromised?
4) Apply a reversible, low-risk space relief first
- Delete/purge rotated logs, old dumps, old packages.
- Purge MySQL binlogs only if replicas are safe.
- Resolve Postgres replication slots pinning WAL.
- Move temp directories to another volume as a stopgap.
5) After stability, do the correctness work
- Run consistency checks appropriate to engine.
- Fix retention policies and capacity alerts.
- Schedule vacuum/reindex or table rebuilds if bloat caused it.
Practical tasks: commands, outputs, and decisions (12+)
These are the commands you run at 3 a.m. Each includes what the output means and the decision you make.
Task 1: Check filesystem capacity (blocks)
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 80G 79G 120M 100% /
/dev/nvme1n1p1 xfs 500G 410G 90G 83% /var/lib/postgresql
tmpfs tmpfs 16G 1.2G 15G 8% /run
Meaning: Root filesystem is at 100% with only 120M available; Postgres data volume is fine. Many services break when / is full (journald, package updates, temp files).
Decision: Free space on / immediately (logs, caches). Don’t touch database files yet if they’re not the culprit.
Task 2: Check inode exhaustion
cr0x@server:~$ df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 5.0M 5.0M 0 100% /
/dev/nvme1n1p1 20M 1.2M 18.8M 6% /var/lib/postgresql
Meaning: Inodes on / are exhausted. This looks like “disk full” but deleting one huge file won’t help.
Decision: Find directories with millions of tiny files (often logs, temp, or poorly managed app caches). Clean those first.
Task 3: Find top space consumers safely
cr0x@server:~$ sudo du -xhd1 /var | sort -h
120M /var/cache
2.4G /var/log
8.1G /var/tmp
55G /var/lib
Meaning: /var/lib dominates. That’s where databases live. But /var/log and /var/tmp are non-trivial and often easiest to cut down.
Decision: If the database is down, first recover headroom by trimming logs and temp. Then investigate database directories with more precision.
Task 4: Detect open-but-deleted files (classic “df says full, du says not”)
cr0x@server:~$ sudo lsof +L1 | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
rsyslogd 812 syslog 7w REG 259,2 2147483648 0 12345 /var/log/syslog.1 (deleted)
java 1552 app 12w REG 259,2 1073741824 0 12346 /var/log/app.log (deleted)
Meaning: Processes still hold file descriptors to deleted files, so space won’t be reclaimed until those processes restart or close FDs.
Decision: Restart the specific service(s) after confirming impact, or rotate logs correctly. Don’t reboot blindly unless you enjoy extended downtime.
Task 5: Check journald disk usage
cr0x@server:~$ sudo journalctl --disk-usage
Archived and active journals take up 1.8G in the file system.
Meaning: Journals are eating real space. On small roots, this matters.
Decision: Vacuum old logs if needed; then fix retention so this isn’t your recurring hobby.
Task 6: Vacuum journald quickly (space relief)
cr0x@server:~$ sudo journalctl --vacuum-time=7d
Vacuuming done, freed 1.2G of archived journals from /var/log/journal.
Meaning: You got 1.2G back. That’s often enough to let databases checkpoint, rotate logs, or restart cleanly.
Decision: Use regained space to stabilize the database (or create a temporary safety buffer), then fix root cause.
Task 7: PostgreSQL—check if WAL directory is the culprit
cr0x@server:~$ sudo -u postgres du -sh /var/lib/postgresql/16/main/pg_wal
86G /var/lib/postgresql/16/main/pg_wal
Meaning: WAL is enormous. That usually means (a) replication slot pinned WAL, (b) archiving is broken, (c) replica is far behind, or (d) checkpointing is impaired.
Decision: Investigate replication slots and archiving status before deleting anything. Deleting WAL files manually is how you turn “incident” into “career change.”
Task 8: PostgreSQL—list replication slots and spot the WAL pin
cr0x@server:~$ sudo -u postgres psql -x -c "SELECT slot_name, slot_type, active, restart_lsn, wal_status FROM pg_replication_slots;"
-[ RECORD 1 ]--+------------------------------
slot_name | analytics_slot
slot_type | logical
active | f
restart_lsn | 2A/9F000000
wal_status | reserved
-[ RECORD 2 ]--+------------------------------
slot_name | standby_1
slot_type | physical
active | t
restart_lsn | 2F/12000000
wal_status | extended
Meaning: analytics_slot is inactive but still reserving WAL via restart_lsn. That’s a common cause of WAL growth until disk death.
Decision: If the consumer is gone or can be reset, drop the slot to free WAL retention. If it’s needed, fix the consumer and let it catch up, or move WAL to a bigger volume.
Task 9: PostgreSQL—drop an unused logical slot (only if you are sure)
cr0x@server:~$ sudo -u postgres psql -c "SELECT pg_drop_replication_slot('analytics_slot');"
pg_drop_replication_slot
--------------------------
(1 row)
Meaning: Slot is removed; PostgreSQL can now recycle WAL once no other retention constraints exist.
Decision: Monitor pg_wal size and disk use; coordinate with the team that owned the slot because their pipeline will break (better broken than full).
Task 10: PostgreSQL—check for runaway temp files (query spill)
cr0x@server:~$ sudo -u postgres find /var/lib/postgresql/16/main/base -maxdepth 2 -type f -name "pgsql_tmp*" -printf "%s %p\n" | head
2147483648 /var/lib/postgresql/16/main/base/16384/pgsql_tmp16384.0
1073741824 /var/lib/postgresql/16/main/base/16384/pgsql_tmp16384.1
Meaning: Temp files exist and are large, suggesting sorts/hashes spilling to disk. These usually disappear when sessions end, but during an incident they can be the incident.
Decision: Identify offending sessions (next task), cancel them if needed, and adjust query/indexing. For immediate relief, killing a few sessions can free gigabytes quickly.
Task 11: PostgreSQL—find heavy sessions and cancel the worst offender
cr0x@server:~$ sudo -u postgres psql -c "SELECT pid, usename, state, now()-query_start AS age, left(query,120) AS q FROM pg_stat_activity WHERE state <> 'idle' ORDER BY query_start ASC LIMIT 5;"
pid | usename | state | age | q
------+--------+--------+---------+------------------------------------------------------------
4412 | app | active | 00:34:12 | SELECT ... ORDER BY ...
4520 | app | active | 00:21:03 | WITH ... JOIN ...
cr0x@server:~$ sudo -u postgres psql -c "SELECT pg_cancel_backend(4412);"
pg_cancel_backend
-------------------
t
(1 row)
Meaning: You found long-running queries; cancel succeeded for PID 4412. If it was spilling to disk, temp files should stop growing and may be removed.
Decision: If disk pressure is acute, cancel/terminate the sessions causing spill. Then fix the query plan later with calm eyes.
Task 12: MySQL—check where space is going (datadir and logs)
cr0x@server:~$ sudo mysql -e "SHOW VARIABLES WHERE Variable_name IN ('datadir','tmpdir','log_bin','general_log_file','slow_query_log_file');"
+--------------------+---------------------------+
| Variable_name | Value |
+--------------------+---------------------------+
| datadir | /var/lib/mysql/ |
| tmpdir | /tmp |
| log_bin | ON |
| general_log_file | /var/lib/mysql/general.log|
| slow_query_log_file| /var/lib/mysql/slow.log |
+--------------------+---------------------------+
Meaning: tmpdir is on /tmp (often root-backed). If / fills, MySQL temp operations fail in surprising ways.
Decision: If / is constrained, move tmpdir to a larger volume and restart (or configure ahead of time). Also check if general log accidentally got enabled.
Task 13: MySQL—see binary log inventory and size
cr0x@server:~$ sudo mysql -e "SHOW BINARY LOGS;"
+------------------+-----------+
| Log_name | File_size |
+------------------+-----------+
| binlog.000231 | 104857600 |
| binlog.000232 | 104857600 |
| binlog.000233 | 104857600 |
| binlog.000234 | 104857600 |
+------------------+-----------+
Meaning: Binlogs exist and can pile up. The sizes here are consistent, but the count may be huge.
Decision: Before purging, verify replication state. Purging binlogs that replicas still need is self-inflicted downtime.
Task 14: MySQL—confirm replication position before purging binlogs
cr0x@server:~$ sudo mysql -e "SHOW MASTER STATUS\G"
*************************** 1. row ***************************
File: binlog.000234
Position: 89234122
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: 3f1c2c3a-aaaa-bbbb-cccc-111111111111:1-982341
cr0x@server:~$ sudo mysql -e "SHOW SLAVE HOSTS;"
+-----------+-----------+------+-------------------+-----------+
| Server_id | Host | Port | Rpl_recovery_rank | Master_id |
+-----------+-----------+------+-------------------+-----------+
| 12 | replica01 | 3306 | 0 | 1 |
+-----------+-----------+------+-------------------+-----------+
Meaning: You have at least one replica. You need to ensure it’s caught up enough (GTID or file/pos) before removing logs.
Decision: Check replica status on each replica (or via monitoring). Only purge logs older than what all replicas have consumed.
Task 15: MySQL—purge binary logs conservatively
cr0x@server:~$ sudo mysql -e "PURGE BINARY LOGS TO 'binlog.000233';"
Meaning: Binlogs older than binlog.000233 are removed. If a replica still needed them, it will stop and require re-seeding or other repair.
Decision: Purge only after verifying replication. If uncertain, free space elsewhere first. Binlogs are a chainsaw, not a scalpel.
Task 16: MySQL—check for temp table blowups and tmpdir pressure
cr0x@server:~$ sudo mysql -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%';"
+-------------------------+----------+
| Variable_name | Value |
+-------------------------+----------+
| Created_tmp_disk_tables | 184203 |
| Created_tmp_files | 91203 |
| Created_tmp_tables | 3312849 |
+-------------------------+----------+
Meaning: High disk temp tables suggests query patterns that spill to disk. On a full filesystem, those queries fail and sometimes block others.
Decision: Mitigate immediate disk pressure (move tmpdir, add space, kill worst queries), then tune queries/indexes and temp table settings.
Task 17: Confirm whether thin provisioning is lying to you (LVM example)
cr0x@server:~$ sudo lvs -a -o +data_percent,metadata_percent
LV VG Attr LSize Pool Data% Meta%
mysql vg0 Vwi-aotz-- 300.00g thinpool 98.44 92.10
thinpool vg0 twi-aotz-- 500.00g 98.44 92.10
Meaning: The thin pool is nearly full. Even if the filesystem looks “fine,” allocations may fail soon. Databases love to discover this at peak write time.
Decision: Expand the thin pool or free extents immediately. Treat this as “disk full pending.”
Task 18: Verify you didn’t hit a quota (XFS project quota example)
cr0x@server:~$ sudo xfs_quota -x -c "report -p" /var/lib/mysql
Project quota on /var/lib/mysql (/dev/nvme2n1p1)
Project ID: 10 (mysql)
Used: 498.0G Soft: 0 Hard: 500.0G Warn/Grace: [--------]
Meaning: You hit a hard quota at 500G. The filesystem may have free space overall, but your MySQL directory cannot grow.
Decision: Increase quota or migrate data to a bigger project. Stop blaming the database for a policy decision.
Short joke #2: The only thing that grows faster than your WAL is the confidence of the person who says “we don’t need disk alerts.”
Three corporate mini-stories (anonymized, painfully plausible)
1) The incident caused by a wrong assumption: “We have 20% free, we’re fine”
They ran PostgreSQL on a VM with a separate data disk and a small root disk. The dashboard showed the data disk at 78% used. Everyone felt responsible. Nobody felt worried.
Then a deployment flipped on verbose application logging during a debugging sprint. Logs went to /var/log on the root disk. By midnight, the root filesystem hit 100% and journald started dropping messages. By 00:30, PostgreSQL began failing temp writes and then struggled with checkpoint-related writes because even “not the data disk” still matters for the OS and system services.
On-call did the usual: checked the database volume, saw headroom, and assumed the problem was elsewhere. Meanwhile the “elsewhere” was the root partition, which was also the place for /tmp and a handful of admin scripts that wrote their own temp files. The failure looked like a database problem, but the root cause was infrastructure layout and an assumption that “database disk” equals “all disks that matter.”
Recovery was fast once someone ran df -h instead of staring at the database dashboard. They vacuumed journald, truncated the runaway log, restarted the chatty service, and Postgres recovered without drama. The postmortem action item was boring: partition sizing and log retention. It worked.
2) The optimization that backfired: “Let’s keep binlogs longer, just in case”
A MySQL team wanted better point-in-time recovery and smoother replica rebuilds. The easiest knob was binlog retention. They extended it. Nobody wrote down the new maximum footprint, because capacity planning is what you do when you have time, and nobody ever has time.
The storage was thin-provisioned. The filesystem still reported free space, so the alerts stayed quiet. Writes continued. Binlogs accumulated. The thin pool crept toward full. One day, allocations started failing in bursts—first during high write periods, then more frequently. MySQL threw errors about writing to the binary log, and suddenly commits began to fail intermittently.
The worst part was the pattern: the server didn’t immediately die, it just became unreliable. Some transactions committed, others didn’t. The application started retrying. Retrying increased write load. Write load generated more binlog pressure. That’s a feedback loop you don’t want.
They recovered by expanding the thin pool and purging binlogs conservatively after confirming replicas were healthy. The backfired optimization wasn’t “keeping binlogs longer.” It was doing it without a bounded budget, without storage-level alerts, and without a practiced purge procedure.
3) The boring but correct practice that saved the day: separate WAL/redo, enforce budgets, rehearse cleanup
A different org ran PostgreSQL with WAL on a dedicated volume and strict monitoring: volume fill alerts at 70/80/85%, and a runbook that included “check replication slots, check archiving, check long transactions.” It wasn’t glamorous. It was effective.
One afternoon, a downstream logical replication consumer stalled after a network change. The replication slot went inactive. WAL began to accumulate. The 70% alert fired. On-call didn’t panic; they executed the runbook. They confirmed the slot was the cause, validated that the consumer was truly down, then dropped the slot and notified the owning team.
The database never went read-only, never crashed, never took an outage. The “incident” was a Slack thread and a ticket. The practice that saved them was not brilliance. It was the quiet discipline of putting the most dangerous write-path (WAL) on a volume that had a budget, plus alerts that gave humans time to behave like humans.
If you want faster disk-full recovery, you don’t start during the outage. You start when you decide where WAL/binlogs live and how much slack you keep.
Common mistakes: symptoms → root cause → fix
1) “df says 100%, but I deleted files and nothing changed”
Symptom: Disk usage stays high after deleting large logs or dumps.
Root cause: Open-but-deleted files (process still holds the FD).
Fix: Use lsof +L1, restart the specific service, then confirm df -h drops. Don’t shotgun reboot unless you have to.
2) PostgreSQL: “pg_wal is huge and won’t shrink”
Symptom: WAL directory grows until disk is nearly full; archiving might be “working” sometimes.
Root cause: Inactive replication slot, stuck archiver, or replica that can’t consume WAL.
Fix: Inspect pg_replication_slots, fix consumer or drop slot; verify archive_command health; ensure WAL volume has headroom.
3) PostgreSQL: “No space left” but data disk has space
Symptom: Errors writing temp files, failure during sorts/joins; the main data mount looks OK.
Root cause: Temp files going to a different filesystem (often /tmp or root), or root filesystem full affecting system operations.
Fix: Check temp_tablespaces and OS temp usage; free root space; optionally move temp tablespaces to a larger mount.
4) MySQL: “Server is up but writes fail randomly”
Symptom: Some transactions fail, others succeed; errors mention binlog or tmp files.
Root cause: Binary logs or tmpdir on a full filesystem; thin provisioning near-full causing intermittent allocation failures.
Fix: Free space where binlogs live; validate thin pool; move tmpdir; bound binlog retention (binlog_expire_logs_seconds) and monitor.
5) “We freed 5G, but the database immediately fills it again”
Symptom: You delete stuff, free space briefly, then it’s gone within minutes.
Root cause: A write-amplifying background process: checkpoints under pressure, autovacuum catching up, replication backlog producing logs, or a runaway query producing temp spill.
Fix: Identify the growth vector (WAL/binlog/temp). Stop the bleeding: cancel the query, fix slot/replica, pause a job, or temporarily throttle ingestion.
6) “We purged MySQL binlogs and now a replica is dead”
Symptom: Replica errors about missing binlog files.
Root cause: Purged binlogs still required by replica; poor visibility into replica lag/GTID state.
Fix: Re-seed or use GTID-based recovery if available. Prevent recurrence: automate binlog purge based on safe retention and monitor replica lag.
7) PostgreSQL: “Vacuum didn’t save us; disk still full”
Symptom: Deletes happened, vacuum ran, but disk didn’t shrink.
Root cause: MVCC frees space inside relation files for reuse; it doesn’t necessarily return it to the OS. Also, indexes bloat.
Fix: For real shrink: VACUUM FULL (blocking) or table rewrite strategies, plus REINDEX where needed. Plan it; don’t improvise it mid-incident.
Checklists / step-by-step plan
During the incident: stabilize first, then repair
- Stop the growth: pause batch jobs, disable noisy logs, throttle ingestion, or temporarily block the worst offender.
- Confirm what is full:
- Run
df -hTanddf -ih. - Check quotas if your org likes invisible walls.
- Run
- Recover headroom fast (aim for 5–10% free, not “a few MB”):
- Vacuum journald and prune old logs.
- Remove old dumps/artifacts from known locations.
- Address open-but-deleted files and restart the specific service.
- Database-specific triage:
- Postgres: inspect
pg_wal, slots, archiving, long transactions, temp spill. - MySQL: inspect binlogs, tmpdir placement, relay logs on replicas, and thin pool status.
- Postgres: inspect
- Bring the database back safely:
- Prefer a controlled restart over repeated crash loops.
- Verify durability paths are working (WAL/binlog writable).
- Validate application health: error rates drop, latency normalizes, replication catches up.
After the incident: make it harder to repeat
- Separate critical write paths: WAL or redo/binlog on volumes with alerts and predictable growth budgets.
- Set retention explicitly:
- Postgres: manage slots; ensure archiving is monitored if used.
- MySQL: set
binlog_expire_logs_seconds; verify log rotation.
- Put temp where it can breathe: dedicated temp volume or sane limits; avoid root for heavy temp workloads.
- Alert on rate of change, not just percent full: “+20G/hour” beats “92% used” every time.
- Rehearse a disk-full runbook quarterly. The goal is muscle memory, not heroism.
FAQ
1) If disk is full, should I restart the database immediately?
No. First free enough space to let the database start and complete crash recovery/checkpoints. Restarting into ENOSPC loops can worsen corruption risk and extend downtime.
2) Is it safe to delete PostgreSQL WAL files to free space?
Almost never. Manual deletion can make recovery impossible. The correct approach is to remove the reason WAL is retained (slots, archiving backlog, replica lag) or add space.
3) Is it safe to purge MySQL binary logs during an incident?
Yes, but only with replication awareness. Confirm all replicas have consumed the logs (GTID or file/pos). Purge conservatively, then watch replicas closely.
4) Why does PostgreSQL consume so much disk even after deleting rows?
MVCC keeps old row versions until vacuum can reclaim space for reuse. That space is usually reused inside the same files, not returned to the filesystem. Shrinking requires rewrites (e.g., VACUUM FULL) or rebuild strategies.
5) Why does MySQL “look up” but queries fail under disk pressure?
Because different subsystems fail independently: tmpdir may be full, binlog writes may fail, or tablespaces can’t extend. The TCP port being open isn’t the same as “database is healthy.”
6) Which database is more tolerant of running near full?
Neither. PostgreSQL is more likely to refuse unsafe commits when WAL can’t be written. MySQL may limp longer but can be operationally messier. Your best tolerance feature is free space.
7) What’s the single best preventive control?
Bound growth with explicit budgets: WAL/binlog retention, log rotation, and quotas where appropriate—plus alerts on growth rate. “We’ll notice” is not a control.
8) How much free space should I keep?
Keep enough to survive your worst-case growth until humans can react: typically 10–20% on database volumes, more on volumes that host WAL/binlogs/temp, and extra if you have write bursts.
9) Does filesystem choice change recovery outcomes?
It changes failure behavior and tooling. ext4 reserved blocks can mask “full” for root; XFS quotas are common in multi-tenant setups; ZFS has its own copy-on-write amplification and “don’t fill the pool” rules. Pick intentionally and monitor accordingly.
10) What’s the fastest “space back” move that is usually safe?
Cleaning non-database artifacts: rotated logs, old dumps, package caches, journald vacuuming, and open-but-deleted file cleanup. Touch database internals only once you understand why they grew.
Conclusion: what to do next week
Disk-full incidents aren’t about databases being fragile. They’re about us being optimistic. PostgreSQL tends to recover cleaner because it is strict about WAL and consistency. MySQL can recover quickly too, but it gives you more operational levers—and more opportunities to cut yourself.
Next steps that pay off immediately:
- Separate and budget your critical write paths: Postgres WAL, MySQL binlogs/redo, and temp spill locations.
- Instrument growth rate alerts and add “time to full” to your dashboards.
- Write and rehearse a disk-full runbook that starts with
df,df -i, andlsof +L1before it touches any database file. - Normalize boring hygiene: log retention, dump cleanup, slot/binlog discipline, and periodic bloat management.
If you do those four things, “disk full” stops being a thriller and becomes a maintenance ticket. That’s the kind of downgrade you should actively pursue.