ZFS “Pool Full” Recovery: What Breaks First and How to Come Back

October 11, 2025 • February 3, 2026 • Read: 23 min • Views: 8

Was this helpful?

At 92% full, ZFS still smiles politely. At 98%, it starts answering emails with “per my last allocation request.” At 100%, it doesn’t negotiate: your “routine” rename becomes a production incident, and suddenly everyone learns what ENOSPC means.

This is a field guide for the ugly moment when a ZFS pool runs out of space (or the space ZFS can actually use), plus the disciplined path back to stable headroom. It’s written for people who don’t have time to admire the elegance of copy-on-write while the pager is screaming.

What breaks first: the failure chain when ZFS gets full

“Pool full” is not a single state. It’s a cascading set of constraints that hit different workloads differently. ZFS doesn’t just need bytes for your data; it needs working space for metadata, allocation bookkeeping, copy-on-write rewrites, and transactional group (TXG) commits. When the pool approaches full, the allocator’s choices get worse, fragmentation rises, and latency starts behaving like it’s trying to send you a message.

1) The first thing you’ll feel: latency, not necessarily errors

When free space is scarce, ZFS has fewer contiguous regions to allocate. For writes, that means more metaslab lookups, more fragmentation, and more scattered I/O. Even if the pool isn’t “100%,” the performance cliff is real. For many pools, the cliff starts around 80–90% depending on recordsize, workload, and vdev geometry. It’s not superstition; it’s allocator physics and seek penalties dressed up as a storage stack.

2) Then the application starts “mysteriously” failing

Applications often fail indirectly:

Databases can stall on fsync or fail when they can’t create temporary files, WAL segments, or new tablespace extents.
Containers fail to start because overlay layers can’t write, logs can’t append, or the runtime can’t create state under /var/lib.
System services fail because they can’t write pidfiles, journald can’t persist, or package managers can’t unpack.

3) Deleting files doesn’t always free space (and that’s the cruel part)

ZFS snapshots preserve referenced blocks. If you delete a 200 GB directory but snapshots still reference those blocks, the pool usage doesn’t budge. In a “pool full” incident, this is the moment teams start blaming ZFS. ZFS is innocent; it’s just consistent.

4) The really sharp edges: metadata, special vdevs, and reservations

Some “pool full” incidents are actually “the pool has space, but you can’t use it” incidents:

Reservations (reservation, refreservation) can fence off space so your deletes don’t help the dataset you care about.
Quotas (quota, refquota) can cap a dataset so it hits ENOSPC even when the pool has plenty.
Special vdev full (if you use one) can cause metadata allocation failures even while the main data vdevs have headroom.
Slop space (pool-wide reserved space) can block allocations when you’re near full, which is a safety feature you will curse until it saves you.

5) Why “just add a disk” isn’t always immediate relief

Expanding capacity can help, but it doesn’t rewind fragmentation, fix a full special vdev, or remove snapshot references. Also, resilvering or expansion activity adds load during the worst possible moment. You add capacity to survive; you still need cleanup and policy to recover.

One paraphrased idea from Werner Vogels (reliability/ops): “Everything fails, all the time; design and operate assuming it will.” That applies to storage fullness more than anyone wants to admit.

Interesting facts and history (the kind that changes decisions)

ZFS was born at Sun in the mid-2000s with copy-on-write and end-to-end checksums as core design points, not bolt-ons.
“Snapshots are cheap” is true—until the pool is full and “cheap” becomes “why didn’t anyone prune these.”
ZFS uses a transactional model (TXGs). Your writes aren’t “done” the way many people assume until the TXG commits; pressure there shows up as sync latency.
Space accounting is subtle: USED, REFER, USEDDS, snapshot usage, and reservations all tell different truths.
The “80% rule” wasn’t invented by ZFS, but ZFS makes the consequences of high fill levels more visible because allocation behavior degrades sharply.
Special vdevs (popularized in OpenZFS) can accelerate metadata and small blocks, but they introduce a new “full” failure mode if sized wrong.
Recordsize and volblocksize influence fragmentation and free-space usability. Big blocks can be efficient—until they can’t find landing spots.
Compression can delay the crisis, but it can also hide growth until you cross a threshold and suddenly every write is a fight for space.
Copy-on-write means “overwrite” needs free space. Running out of free space can break workloads that think they’re “just updating in place.”

Joke #1: A ZFS pool at 99% is like a meeting room booked “back-to-back” all day: technically available, practically unusable.

Fast diagnosis playbook (first / second / third)

If you only have five minutes before a VP appears in your Slack, do this. The goal is to identify which of these is your bottleneck: pool capacity, dataset quota/reservation, snapshots, or a special vdev/metadata constraint.

First: confirm the real constraint

Is the pool actually full? Check zpool list capacity and health.
Is the dataset capped? Check quotas/refquotas for the affected dataset.
Is space “stuck” in snapshots? Look at snapshot usage and recent snapshot churn.
Are reservations hoarding space? Look for refreservation and reservation outliers.

Second: locate the biggest consumer in the shortest time

Dataset breakdown: find which dataset is consuming the pool.
Snapshot breakdown: find which snapshot set is pinning the most space.
Process perspective: find which workload is currently writing and failing.

Third: pick the safest immediate action

Immediate safety valve: stop the writer (pause ingestion, disable log spam, stop runaway jobs).
Low-risk reclaim: delete known disposable data with no snapshot retention requirements; prune old snapshots if policy allows.
Structural fix: add capacity, adjust quotas/reservations, or fix special vdev sizing—then normalize snapshot policy.

Do not start by “rm -rf” on random paths. You’ll just create a second incident: “we deleted the wrong thing and also didn’t free space.”

Hands-on recovery tasks (commands, outputs, decisions)

These are field-tested tasks you can run during an incident. Each task includes a command, realistic output, what it means, and the decision to make. Run them in order if you’re unsure. Skip around if you already know the shape of the problem.

Task 1: Confirm pool capacity and health

cr0x@server:~$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   27.2T  26.6T   640G        -         -    72%    97%  1.00x  ONLINE  -

Meaning: Pool tank is 97% full and heavily fragmented. You are in the danger zone.

Decision: Treat this like an outage-in-progress. Stop nonessential writers. Plan to reclaim space and/or add capacity.

Task 2: Look for pool-wide errors and slow devices

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors
  scan: scrub repaired 0B in 12:31:44 with 0 errors on Sun Dec 22 03:11:08 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       2     0     0
            sdd                     ONLINE       0     1     0

errors: Permanent errors have been detected in the following files:
        tank/data/postgres/base/16384/2609

Meaning: You have device errors and a referenced file with corruption. This is separate from “pool full” but can become visible under stress.

Decision: Do not begin aggressive cleanup until you snapshot critical datasets (if possible) and plan remediation for the failing drive(s). If the pool is too full to snapshot, prioritize freeing a little space safely first, then snapshot.

Task 3: Identify which datasets consume the pool

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used tank
NAME                 USED   AVAIL  REFER  MOUNTPOINT
tank                 26.6T   512G   192K  /tank
tank/data            19.8T   512G  19.8T  /tank/data
tank/backups          4.9T   512G  3.1T   /tank/backups
tank/vm               1.6T   512G  1.6T   /tank/vm
tank/home             310G   512G   310G  /tank/home

Meaning: The biggest consumer is tank/data, then tank/backups.

Decision: Focus on the largest datasets first. Freeing 50 GB in a 27 TB pool might buy minutes, not stability.

Task 4: Check whether snapshots are pinning space

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -S used tank/data | head
NAME                                 USED  REFER  CREATION
tank/data@autosnap_2025-12-26_0000     620G  19.8T  Fri Dec 26 00:00 2025
tank/data@autosnap_2025-12-25_0000     410G  19.6T  Thu Dec 25 00:00 2025
tank/data@autosnap_2025-12-24_0000     395G  19.5T  Wed Dec 24 00:00 2025
tank/data@autosnap_2025-12-23_0000     380G  19.2T  Tue Dec 23 00:00 2025

Meaning: Snapshots are consuming hundreds of GB each in “unique blocks” (the USED column). Deletes in tank/data won’t free those blocks until snapshots are destroyed.

Decision: If the snapshots aren’t required for compliance or restore points, prune. If they are required, add capacity first or move data off-pool.

Task 5: Find the biggest snapshot consumers across the pool quickly

cr0x@server:~$ zfs list -t snapshot -o name,used -S used tank | head -n 10
NAME                                  USED
tank/backups@weekly_2025-12-22          1.2T
tank/data@autosnap_2025-12-26_0000      620G
tank/data@autosnap_2025-12-25_0000      410G
tank/vm@hourly_2025-12-26_0900          210G
tank/vm@hourly_2025-12-26_0800          205G
tank/home@daily_2025-12-26              18.4G
tank/home@daily_2025-12-25              17.9G

Meaning: One backup snapshot is eating 1.2 TB; it’s probably retention gone wild or a backup dataset that changes too much.

Decision: Target the high-impact snapshot(s) first, but verify they’re safe to remove (restore policy, legal hold, etc.).

Task 6: Verify quotas and refquotas on the failing dataset

cr0x@server:~$ zfs get -o name,property,value,source quota,refquota tank/data
NAME       PROPERTY  VALUE  SOURCE
tank/data  quota     none   default
tank/data  refquota  20T    local

Meaning: tank/data is capped at 20 TB referenced usage. It can hit ENOSPC even if the pool has free space.

Decision: If the dataset is the one failing, consider raising refquota temporarily—after you confirm it won’t starve other critical datasets.

Task 7: Check reservations and refreservations (space hoarding)

cr0x@server:~$ zfs get -o name,property,value,source reservation,refreservation -r tank | egrep -v 'none|0B|default'
tank/vm  refreservation  2T  local
tank/db  reservation     1T  local

Meaning: Two datasets reserve 3 TB. That space is effectively untouchable by other datasets.

Decision: In an emergency, reduce or remove reservations if you can tolerate the risk (reservations exist for a reason). Document and restore them later.

Task 8: Confirm whether deletions are stuck because a filesystem is “busy”

cr0x@server:~$ lsof +L1 | head
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NLINK     NODE NAME
nginx      1421 root    9w   REG  0,113  1048576     0  5329156 /tank/data/logs/access.log
postgres   2198 postgres 11w REG  0,113  1073741824   0  5329901 /tank/db/pg_wal/000000010000002A000000B3

Meaning: Files have been deleted (NLINK=0) but are still held open by processes. The space won’t be freed until those file descriptors close.

Decision: Restart or signal the offending services to reopen logs (or rotate properly). This can reclaim space without touching snapshots.

Task 9: Identify what’s writing right now (and stop it with intent)

cr0x@server:~$ zpool iostat -v tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        26.6T   640G    420   3800   82M   610M
  raidz2-0  26.6T   640G    420   3800   82M   610M
    sda         -      -     70    640   14M    98M
    sdb         -      -     66    640   13M    99M
    sdc         -      -     73    940   14M   155M
    sdd         -      -     71    580   14M    93M
----------  -----  -----  -----  -----  -----  -----

Meaning: Writes are heavy and uneven on sdc. Someone is still pouring data into a nearly full pool.

Decision: Pause ingestion, disable debug logging, stop batch jobs, or put the system in read-only mode for the impacted service until headroom is restored.

Task 10: Check if a special vdev is the real “full”

cr0x@server:~$ zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        27.2T  26.6T   640G        -         -    72%    97%  1.00x  ONLINE  -
  raidz2-0  27.0T  26.1T   900G        -         -    70%    97%      -  ONLINE
  special    200G   198G   2.0G        -         -    15%    99%      -  ONLINE

Meaning: The special vdev is 99% full. If metadata and small blocks land there, you can get allocation failures even with some free space elsewhere.

Decision: Treat this as urgent. You likely need to add capacity to the special vdev (mirror another device and attach, or rebuild a larger special vdev) and/or adjust special_small_blocks strategy. Quick fixes are limited.

Task 11: Free space by destroying snapshots (safely and in the right order)

cr0x@server:~$ zfs destroy tank/data@autosnap_2025-12-23_0000
cr0x@server:~$ zfs destroy tank/data@autosnap_2025-12-24_0000
cr0x@server:~$ zfs destroy tank/data@autosnap_2025-12-25_0000

Meaning: Snapshot destroy is asynchronous in effect; you may not see immediate free-space relief if the system is busy, but it should trend in the right direction.

Decision: Delete oldest first when retention permits. Stop when you’ve regained safe headroom (more on “how much” below), then fix snapshot policy so you don’t do this again at 3 a.m.

Task 12: Confirm space reclaimed at the pool level

cr0x@server:~$ zpool list tank
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  27.2T  25.9T  1.30T        -         -    69%    95%  1.00x  ONLINE  -

Meaning: You gained ~660 GB. Still tight, but you’ve stepped back from the ledge.

Decision: Keep going until you have operational headroom (often 15–20% on busy pools). If you can’t reach that through deletion, you need capacity expansion or data migration.

Task 13: Check if a dataset is still blocked by quota even after pool headroom improves

cr0x@server:~$ zfs list tank/data
NAME       USED   AVAIL  REFER  MOUNTPOINT
tank/data  19.9T   100G  19.9T  /tank/data

Meaning: The pool has more free space now, but the dataset has only 100 GB available—consistent with a refquota limit.

Decision: Raise refquota if appropriate, or redistribute data to another dataset with more headroom.

Task 14: Temporarily raise a refquota (controlled, logged, reversible)

cr0x@server:~$ sudo zfs set refquota=22T tank/data
cr0x@server:~$ zfs get -o name,property,value,source refquota tank/data
NAME       PROPERTY  VALUE  SOURCE
tank/data  refquota  22T    local

Meaning: The dataset can now reference up to 22 TB, assuming the pool has the space.

Decision: Use this as a stopgap. Then implement capacity planning and guardrails so this doesn’t become “just bump it again” culture.

Task 15: Find top space consumers inside a dataset (when you need to delete something real)

cr0x@server:~$ sudo du -xhd1 /tank/backups | sort -h | tail -n 10
120G    /tank/backups/tmp
540G    /tank/backups/staging
1.8T    /tank/backups/daily
2.4T    /tank/backups/weekly

Meaning: /tank/backups/weekly dominates. This is where pruning will matter.

Decision: Delete from the least business-critical retention tier first. Then adjust the backup job so it doesn’t create unbounded growth (often it’s “kept forever” by accident).

Task 16: Verify that snapshot retention tooling isn’t immediately recreating the problem

cr0x@server:~$ systemctl status zfs-auto-snapshot.timer
● zfs-auto-snapshot.timer - ZFS auto-snapshot timer
     Loaded: loaded (/lib/systemd/system/zfs-auto-snapshot.timer; enabled; preset: enabled)
     Active: active (waiting) since Thu 2025-12-26 09:05:12 UTC; 4h 12min ago
    Trigger: Thu 2025-12-26 14:00:00 UTC; 42min left

Meaning: Automated snapshots are scheduled and enabled.

Decision: Don’t disable this blindly. Tune retention, exclude datasets that don’t need frequent snaps, and ensure pruning works. If you must pause it during recovery, set a reminder to re-enable.

Task 17: Confirm pool feature flags and version context (helps when you call for backup)

cr0x@server:~$ zpool get all tank | egrep 'feature@|ashift|autoreplace|autoexpand'
tank  ashift                   12                     local
tank  autoexpand               off                    default
tank  autoreplace              off                    default
tank  feature@spacemap_histogram  enabled             local
tank  feature@allocation_classes  active              local

Meaning: You have allocation classes active (common with special vdevs), ashift=12, and no autoexpand.

Decision: When planning capacity changes, know whether autoexpand is disabled and whether allocation classes might be involved in “space available but not usable” behavior.

Three corporate mini-stories from the trenches

Mini-story #1 (wrong assumption): “Deleting the data will free the space.”

A mid-sized SaaS company ran a ZFS-backed analytics cluster. The on-call got paged for an API outage. Symptom: writes failing, database complaining about “no space left on device.” The pool showed 99% used. The on-call did what most of us would do under stress: deleted a directory full of old exports.

The pool usage didn’t change. Not even a percent. The on-call deleted more. Still nothing. Now they had a broken service and missing data that a customer might ask for later. The incident channel filled with confident guesses, which is always a bad sign.

The wrong assumption was simple: that filesystem deletions immediately free space. On ZFS with frequent snapshots, those blocks were still referenced. They had a snapshot policy that took hourly snapshots and retained them for longer than the data lifecycle. Great for restore points. Terrible for emergency cleanup.

The recovery was boring but strict: identify the largest snapshot consumers, confirm retention requirements with the data owner, destroy the oldest snapshots first, and only then delete more filesystem data. They regained headroom and restored service. The postmortem created a rule: “No deleting until you account for snapshots.” They added a dashboard showing snapshot used space by dataset, so the next on-call wouldn’t need a theology degree in ZFS to do triage.

Mini-story #2 (optimization that backfired): the undersized special vdev

A large internal platform team added a special vdev to speed up metadata-heavy workloads: millions of small files, lots of directory operations, and container layers. It worked beautifully. Latency dropped. Everyone celebrated and moved on.

Six months later, they hit a “pool full” incident where the pool still had measurable free space—hundreds of gigabytes. Yet new file creates failed sporadically. Renames stalled. Some datasets behaved normally; others melted down. Classic partial outage: the worst kind.

The special vdev was the culprit. It was sized optimistically, based on metadata estimates and a “small blocks” threshold that turned out to capture more than intended. It crept to 99% allocated, and then allocation for metadata became a choke point. The main data vdevs were fine, which made the symptoms feel paranormal.

The fix wasn’t a clever CLI trick. It was capacity: rebuild the special vdev with larger devices and correct the policy so “special_small_blocks” matched reality. They also learned an operational truth: special vdevs are powerful, but they are not optional infrastructure. You monitor them like you monitor the pool itself, and you plan their growth like you plan your database growth—because it is, effectively, database growth.

Mini-story #3 (boring but correct practice): quotas + retention + headroom saved the day

A finance-adjacent enterprise (heavy compliance, heavy paperwork, heavy everything) ran ZFS for file services and backups. They were not the “move fast” type. They had quotas on user datasets, refquotas on noisy apps, reservations for the database, and a snapshot policy with explicit retention tiers. Most engineers privately rolled their eyes.

Then a vendor tool went rogue and started writing debug logs at an impressive rate. The pool began climbing. But the blast radius was contained: the tool’s dataset hit its refquota and started failing its own writes without consuming the whole pool. The database reservation held. User home directories stayed writable. The incident was limited to “that one tool is broken,” not “everything is broken.”

They cleaned it up in daylight. They pruned the offending dataset, fixed log rotation, and adjusted alerting on dataset-level available space, not just pool capacity. Nobody had to negotiate deletions of critical data at 2 a.m.

This is the part where the boring team wins. They didn’t avoid incidents; they made incidents local. Quotas and retention policies aren’t exciting, but they are operational armor.

Joke #2: The best time to tune snapshot retention was six months ago. The second-best time is before your deletion “does nothing” and you start bargaining with the filesystem.

Common mistakes: symptom → root cause → fix

1) “We deleted data but pool usage didn’t drop”

Symptom: rm completes; df and zpool list barely change.

Root cause: Snapshots still reference the blocks, or deleted files are held open by processes.

Fix: Check snapshot usage (zfs list -t snapshot) and open deleted files (lsof +L1). Destroy unneeded snapshots; restart processes holding deleted files; then verify pool free space changes.

2) “The pool has free space but the app says ENOSPC”

Symptom: Pool shows hundreds of GB free; a dataset or application still fails writes.

Root cause: Dataset refquota/quota limit, or a reservation elsewhere is fencing off space, or slop space prevents allocation.

Fix: Inspect quotas/reservations (zfs get quota,refquota,reservation,refreservation). Adjust the constraint deliberately, and keep enough pool headroom to avoid slop-space collisions.

3) “Small file operations fail first”

Symptom: Creating small files fails; large reads still work; some datasets behave worse.

Root cause: Special vdev full (metadata allocation pressure) or severe fragmentation.

Fix: Check special vdev allocation (zpool list -v). Add capacity or rebuild the special vdev larger; revisit special_small_blocks. Reduce fill level and fragmentation by restoring headroom.

4) “Everything is slow, even reads”

Symptom: High latency across the board near 95–100% full; TXG-related stalls; user complaints about “the system is frozen.”

Root cause: Allocation contention and fragmentation; heavy sync writes; device-level bottlenecks amplified under pressure.

Fix: Stop heavy writers, reclaim space, and consider temporarily reducing write rate (application throttles). After recovery, enforce headroom policy and monitor fragmentation trends.

5) “We destroyed snapshots but free space still doesn’t recover”

Symptom: Snapshot list shrinks, but pool free space doesn’t increase as expected.

Root cause: Space is held by other snapshots/clones; or heavy ongoing writes are consuming reclaimed space immediately; or reservations/quotas confuse expectations.

Fix: Check for clones (zfs get origin), verify the top writers (zpool iostat), and re-check reservations. Pause writers during cleanup so you can actually regain headroom.

6) “We added capacity but performance didn’t improve”

Symptom: Pool is no longer full, but latency remains ugly.

Root cause: Fragmentation remains; special vdev still constrained; workload is sync-heavy; or device imbalance.

Fix: Restore meaningful headroom (not just 1–2%). Check fragmentation and special vdev usage. Consider workload-level changes (log batching, recordsize alignment) rather than expecting capacity alone to be a performance reset button.

Checklists / step-by-step plan

Phase 0: Stabilize (stop the bleeding)

Freeze the biggest writer. Pause ingestion pipelines, disable runaway debug logs, stop batch jobs, or temporarily scale down writers.
Confirm you’re solving the right constraint. Pool full vs dataset quota vs special vdev full vs open-deleted files.
Communicate a simple status. “We are space constrained; we are reclaiming X; ETA for writes returning is Y.” Keep it factual.

Phase 1: Get immediate headroom (fast wins)

Close deleted-but-open files. Use lsof +L1, restart services cleanly.
Prune the worst snapshots first. Delete oldest, highest-USED snapshots that are not required.
Delete disposable non-snapshotted data. Temp directories, staging data, cache, old build artifacts—only after verifying snapshots aren’t pinning them.
Target a real headroom threshold. On busy pools, 10% is often still uncomfortable. Aim for 15–20% where feasible.

Phase 2: Restore normal operations (make writes safe again)

Re-enable writers gradually. Watch zpool iostat and application error rates; don’t go from zero to full throttle instantly.
Re-check quotas and reservations. Undo emergency changes carefully; confirm datasets have appropriate limits.
Run a scrub if hardware was suspect. If you saw device errors, schedule a scrub and plan replacement if needed.

Phase 3: Fix the system (so this doesn’t recur)

Implement snapshot retention you can explain. Hourly for a day, daily for a week, weekly for a month—whatever fits your business. But make it explicit, automated, and audited.
Alert on dataset and pool headroom. Pool at 80/85/90% with escalating urgency; dataset avail below workload-specific thresholds.
Capacity plan with growth rate, not vibes. Track weekly change in zfs list used space, snapshot deltas, and backup churn.
If you use a special vdev, size it like production depends on it. Because it does.

Prevention: make “pool full” boring again

Headroom policy: pick a number and enforce it

If your pool hosts anything latency-sensitive or write-heavy, treat 80–85% as “yellow,” 90% as “red,” and 95% as “drop everything”. These thresholds aren’t moral judgments; they’re operational guardrails. Fragmentation and allocator pressure don’t care about your quarterly roadmap.

Dataset design: localize failure

Put noisy workloads in their own datasets with refquota. Put critical systems (databases, VM images) in datasets with deliberate reservations only if you truly need guaranteed space. This is how you prevent a logging storm from taking out your database.

Snapshot discipline: retention is a product feature

Snapshots are not “free backups.” They’re a retention system with a cost profile that spikes during high churn. Define:

What you snapshot (not everything deserves the same policy)
How often (hourly/daily/weekly tiers)
How long you keep them (and who approves increases)
How you prune (automated, verified, monitored)

Special vdevs: monitor them like a separate pool

If you’ve adopted special vdevs, add alerts on their capacity and watch the effect of special_small_blocks. Many teams treat special vdevs as “a cache-ish thing.” It’s not. It’s an allocation class that can become your first hard limit.

Operational hygiene that pays off

Log rotation that is tested (not just configured). Include the “process holds old fd” case.
Backups that don’t explode snapshots. Some backup workflows cause massive churn; tune them.
Dashboards that show: pool capacity, fragmentation, snapshot used by dataset, and dataset avail.
Runbooks that include your organization’s retention/compliance rules so on-call doesn’t have to negotiate policy mid-incident.

FAQ

1) What actually happens when ZFS is completely full?

Writes start failing with ENOSPC, but not all writes fail equally. Metadata allocations, copy-on-write updates, and sync writes can fail or stall earlier than you expect. Performance usually degrades before hard failure.

2) Why does ZFS get slow near full even before errors?

Free space becomes fragmented. The allocator works harder to find suitable blocks, and I/O becomes more scattered. On spinning disks, this is especially painful; on SSDs, it still increases write amplification and latency.

3) Why didn’t deleting files free space?

Most commonly: snapshots still reference those blocks, or the files are deleted but still open by a process. Use zfs list -t snapshot and lsof +L1 to confirm which.

4) Should I destroy snapshots during an outage?

If the snapshots are the reason you can’t free space and retention policy allows it, yes—destroying snapshots is often the cleanest reclaim. But do it intentionally: identify highest-USED snapshots, confirm business requirements, and delete oldest first.

5) Is it safe to set `quota`/`refquota` higher to fix ENOSPC?

It’s safe mechanically, but it’s risky operationally. You’re shifting who gets to consume shared pool space. Use it as a temporary measure and pair it with a capacity/retention fix.

6) How much free space should I keep on a ZFS pool?

For general mixed workloads: aim for 15–20% headroom. For mostly-append workloads with low churn you can sometimes run tighter, but you’re buying risk and latency. If you can’t keep headroom, you need more disks or less data—pick one.

7) Can I “defragment” a ZFS pool after it gets too full?

Not directly like legacy filesystems. The practical method is to restore headroom and let new writes land better, or migrate data (send/receive) to a fresh pool for a real reset. Plan capacity so you don’t need heroics.

8) What if the special vdev is full but the pool isn’t?

Then you have a metadata allocation choke point. You’ll need to address special vdev capacity (often by rebuilding/expanding it) and revisit which blocks you’re sending there. This is not a “delete a few files” situation.

9) Do reservations help or hurt in “pool full” situations?

Both. Reservations protect critical datasets from noisy neighbors, which can save the day. But they also reduce flexibility during emergencies. Use them sparingly and only with explicit ownership and monitoring.

10) Should I add capacity first or delete first?

If you can safely delete and reclaim meaningfully fast, delete first. If snapshots/compliance prevent deletion, or reclamation is slow compared to outage cost, add capacity first. Often you do both: add capacity to survive, then delete to restore healthy margins.

Next steps

A ZFS pool filling up is rarely a surprise; it’s usually a surprise only to the on-call who didn’t get the right signal early enough. When it happens, the winning move is to stop guessing. Confirm the constraint (pool vs dataset vs snapshots vs special vdev), reclaim space with the least irreversible actions first, and restore headroom to a level where ZFS can allocate sanely.

Practical next steps you can do this week:

Add alerts on pool capacity, dataset avail, and snapshot used space by dataset.
Write and enforce a retention policy that matches business reality (not optimism).
Put noisy workloads behind refquotas so they can fail alone.
If you run special vdevs, monitor and capacity-plan them like first-class storage, not an accessory.
Run a game day: simulate a pool at 95%, walk the runbook, and see where your process falls apart before production does.

Because the only thing worse than a full pool is a full pool during a migration window you promised would be “low risk.”