ZFS 80% Rule: Myth, Truth, and the Real ‘Danger Zone’

Was this helpful?

ZFS people love rules of thumb. They’re comforting, like labeling a breaker panel or pretending you’ll “tidy up that dataset layout later.” The most famous is the “ZFS 80% rule”: don’t let a pool go past 80% full, or performance falls off a cliff and misery follows. In production, you’ll hear it repeated like it’s physics.

The truth is better—and more useful. The 80% line is neither superstition nor universal law. It’s a crude boundary for a handful of real mechanisms: metaslab allocation, fragmentation, copy-on-write behavior, metadata growth, and the brutal math of “I need contiguous-ish free space to write efficiently.” The real danger zone is not a single number; it’s when your remaining free space stops being usable free space for your workload.

What the “80% rule” actually means

The “80% rule” is shorthand for: ZFS allocation gets harder as the pool fills, and once it gets hard enough, everything you care about (latency, write throughput, resilver times, snapshot deletes) can degrade nonlinearly.

That’s the key: nonlinear. ZFS isn’t gradually and politely slowing down like an elevator with too many people. It’s more like trying to park in a city when only the weird, tiny, half-blocked spots are left. You can still park, but your average “find a spot” time goes way up.

Two jokes, as promised, because storage people cope with humor and caffeine:

Joke #1: The 80% rule is like a diet: everyone agrees with it, and then Friday happens.

Joke #2: If you want to learn what “copy-on-write” really means, fill a pool to 95% and watch your application rewrite its résumé.

So is 80% right?

Sometimes. For many mixed workloads, “keep it under ~80%” is conservative and keeps you out of trouble. But it’s not a magical threshold. Some pools run happily at 85–90% because the workload is mostly sequential reads and large sequential writes, fragmentation is controlled, and there’s headroom elsewhere (fast special vdev for metadata, adequate RAM, sane recordsize). Other pools become miserable at 70–75% because they’re running random writes on small blocks with a snapshot-heavy retention policy and a bunch of small metaslabs due to narrow vdevs.

The goal isn’t to worship 80%. The goal is to understand what makes your pool “effectively full” before the “used” number hits 100%.

What actually changes as free space shrinks

At a high level, ZFS needs to find free blocks to satisfy allocations. As the pool fills, the remaining free blocks are more fragmented and unevenly distributed across metaslabs. Allocation decisions get more expensive and less optimal, which can translate into:

  • Higher write amplification (more I/O per unit of logical write).
  • More scattered writes (worse for HDDs, still not great for SSDs).
  • Longer transaction group (TXG) sync times (visible as latency spikes).
  • Slower frees (deleting snapshots can feel like pushing a piano uphill).
  • Longer resilvers and scrubs, especially when the pool is hot and fragmented.

Quick history & facts that explain the folklore

Storage rules of thumb don’t appear by magic; they appear after enough people get burned in the same place. Here are concrete facts and context points that make the 80% rule make sense—without treating it like scripture:

  1. ZFS was designed for copy-on-write from day one. That means overwrites become “allocate new, then update pointers,” which needs free space even to modify existing data.
  2. The metaslab allocator favors low-fragmentation regions. As the pool fills, the “good” metaslabs get used up and the allocator increasingly deals with scraps.
  3. Early ZFS deployments were HDD-heavy. HDDs punish random writes and seeking; the “cliff” was obvious. SSDs soften the pain but don’t remove allocation overhead, nor do they remove metadata churn.
  4. RAIDZ changes the math. RAIDZ has parity and variable-stripe behavior. Small-block random writes can become read-modify-write cycles, and fragmented free space makes it worse.
  5. Snapshots are cheap until they aren’t. Taking snapshots is fast; retaining and deleting them at scale can turn “free space” into a complicated liability.
  6. 128K became a cultural default recordsize. That default is great for many sequential workloads, but it interacts badly with small random write workloads and snapshot churn if you don’t tune per dataset.
  7. Ashift mistakes are forever (for that vdev). Misaligned sector assumptions can waste space and I/O headroom, shrinking your practical free-space margin.
  8. Special vdevs changed metadata economics. Putting metadata (and optionally small blocks) on fast media can keep pools usable deeper into high utilization—if sized correctly. If not, it can become the new bottleneck.
  9. “df” and “zfs list” tell different truths. Traditional filesystems let you pretend; ZFS exposes more reality: snapshots, reservations, and referenced vs used matter.

The real danger zone: when free space stops being usable

If you want one operational sentence to replace “never exceed 80%,” use this:

The danger zone begins when the pool’s remaining free space can’t satisfy your allocation pattern efficiently, especially under peak write load and snapshot retention.

“Used” is not the same as “stress”

Two pools can both be 85% used and behave completely differently. Stress depends on:

  • Free space fragmentation: Do you have free blocks in large extents, or a confetti pile?
  • Write size and locality: Are you writing large sequential blocks or 4K random updates?
  • Vdev geometry: Mirrors behave differently than RAIDZ; more vdevs means more allocation “lanes.”
  • Snapshot churn: Frequent snapshots + overwrites means your frees are delayed and your live set grows “shadow copies.”
  • Metadata load: Millions of small files, extended attributes, ACLs, and dedup tables create metadata I/O that doesn’t show up as “big writes.”

A practical definition of “effectively full”

In production, I call a pool “effectively full” when any of these are true:

  • Application write latency becomes unpredictable (p99 and p999 spike), even though disks aren’t saturated on raw throughput.
  • TXG sync time grows and stays high under steady write load.
  • Snapshot deletions stall or take hours longer than usual.
  • Resilver time estimates become comical, especially during business hours.
  • You start “fixing” performance by rebooting, which is how you know you’re out of ideas and into denial.

Why the cliff feels sudden

The cliff is a feedback loop:

  1. The pool fills and fragments.
  2. Allocations get more scattered; writes take longer.
  3. Longer writes keep TXGs open longer; more dirty data accumulates.
  4. More dirty data means heavier sync work; sync latency rises.
  5. Applications see latency and retry or queue; load increases.

It’s not that ZFS “panics at 81%.” It’s that your workload crosses the point where the allocator’s choices stop being cheap.

Workload patterns that hit the wall first

VM and database random writes on RAIDZ

This is the classic. Lots of 8K–16K updates, sync writes, and overwrite churn. If it’s RAIDZ, small writes can trigger read-modify-write. Add snapshots and you’ve built a machine that manufactures fragmentation.

Snapshot-heavy file shares with frequent renames and small files

Metadata churn and delayed frees. Users don’t notice the first 50,000 snapshots because reads still feel fine. Then deletes, renames, and directory operations start to lag, and everyone blames “the network.”

Backups and object-ish storage with large sequential streams

These can tolerate higher utilization if the workload is append-mostly and you don’t rewrite in place. But beware: retention expiration (mass deletes) can be its own storm.

Containers and CI pipelines

Lots of small files, short lifetimes, overlayfs layers, build caches, and “delete everything” events. ZFS can do great here, but near-full pools turn the constant create/delete churn into allocator pain.

Space accounting: why “df” lies and ZFS doesn’t care

ZFS makes people angry because it refuses to keep the illusion simple. You can have 2 TB “free” in one view and be out of space in another. Usually, nobody is lying; you’re asking different questions.

Key terms that actually matter

  • USED: Space consumed by dataset and its descendants, including snapshots (depending on where you look).
  • REFER: Space uniquely referenced by this dataset (not counting snapshots).
  • AVAIL: Space available considering quotas/reservations and pool free space.
  • RECORDED vs actual: Compression and copies change what “logical” vs “physical” means.

Why snapshots make free space feel haunted

With snapshots, deletes don’t necessarily free blocks. Overwrites allocate new blocks; old blocks remain referenced by snapshots. So your “delete the big file” reflex doesn’t buy you as much as it did on ext4. The pool can look like it has free space, but the allocator is still forced into awkward places because the free space it has is fragmented or unevenly distributed.

Three corporate-world mini-stories (from the trenches)

1) Incident caused by a wrong assumption: “80% is safe, so 79% is safe”

A mid-size enterprise ran a ZFS-backed NFS cluster serving home directories and a few build farms. The storage team had a dashboard with a nice green/yellow/red band: green below 80%, yellow above 80%, red above 90%. It looked professional, and everyone loved it because it reduced arguments to a color.

Then the pool hit 79% used. Still green. That week, the build farm’s artifact churn spiked due to a product launch, and the home directory share had a quarterly compliance snapshot policy. Nothing exotic: just a lot of small files being created and overwritten, plus snapshots taken hourly.

On Tuesday, the helpdesk started seeing “random” build failures. On Wednesday, the NFS clients began to stall on metadata operations: stat() calls, directory listings, file renames. The network team got paged because “NFS is slow,” and they did what network teams do: proved the network was innocent.

By Thursday, the storage team found the truth: the pool wasn’t “safe” at 79%. It was effectively full for that workload. Metaslabs had become fragmented enough that allocator searches and scattered writes were hammering latency. The 80% line wasn’t the cliff; it was a folk warning. Their dashboards were green while users were on fire.

The fix wasn’t magic. They freed space by expiring older snapshots, moved build artifacts to a dataset with tuned recordsize and compression, and—most importantly—changed alerting to track not just percentage used but also write latency, TXG sync behavior, and snapshot space growth. The new “green” band was based on behavior, not just a number.

2) Optimization that backfired: “We’ll crank recordsize and compress everything”

A different company ran ZFS for virtualization. They had SSDs, lots of RAM, and confidence—always a dangerous ingredient. Someone noticed the default recordsize was 128K and decided “bigger is better,” bumping it to 1M on the VM dataset. They also enabled aggressive compression everywhere because it looked great in the quarterly storage report.

At first, it was fine. Backups got smaller. The pool’s “used” number looked healthier. Then, as utilization climbed, random write latency started spiking. VM workloads that did small updates were now touching huge records, amplifying writes and metadata work. Compression added CPU overhead right when the system needed to make fast allocation decisions and keep up with sync.

The worst part: their monitoring mostly tracked throughput and percent used. Throughput looked okay. Percent used was below 80%. Meanwhile, the hypervisors were timing out on storage operations during peak hours. The business experience was “the platform is flaky,” which is the most expensive kind of performance bug.

They rolled back: VM disks moved to datasets with saner recordsize (and, for zvols, proper volblocksize), compression was kept but matched to CPU headroom, and they separated workloads by dataset instead of one-size-fits-all tuning. The lesson wasn’t “compression is bad” or “large recordsize is bad.” The lesson was: optimizations that help capacity reports can hurt allocation behavior near full pools.

3) Boring but correct practice that saved the day: reservations, headroom, and rehearsed cleanup

A financial services shop ran ZFS for a log analytics pipeline and internal file services. Nothing glamorous: lots of append writes, periodic compactions, and snapshots for fast rollback. The storage lead was allergic to heroic recoveries, so they built a dull plan: keep 20–25% headroom, enforce quotas, and create a “break glass” dataset with reserved space for emergencies.

It was unpopular. Teams hate quotas the way cats hate baths. But they agreed after a few meetings and some calm explanations about copy-on-write and snapshot retention. They also created a cleanup runbook: which snapshots to expire first, what datasets could be purged, and what had to be preserved. They tested it quarterly like a fire drill.

One year, a runaway logging loop started generating massive data. In a less disciplined environment, the pool would have hit 95% and then turned into a latency carnival. Here, alerts fired early (space trend + latency), quotas contained the blast radius, and the reserved “break glass” space ensured that critical services could still write while the team cleaned up. They executed the runbook without improvisation: expire non-critical snapshots, pause the offending pipeline, scrub for sanity, and then resume.

Nothing heroic happened. That’s the point. The boring practice didn’t make a great story at the company all-hands, but it prevented a great story on social media.

Practical tasks: commands, outputs, and how to read them

Below are hands-on tasks I actually use when a pool is “mysteriously slow” or approaching the danger zone. Commands are shown as if run on a typical Linux host with OpenZFS. Interpretations are included—because raw output without meaning is just decorative.

Task 1: Check pool capacity and health (the baseline)

cr0x@server:~$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   54.5T  44.1T  10.4T        -         -    38%    80%  1.00x  ONLINE  -

Interpretation: CAP at 80% is not automatically bad, but it’s the moment to start paying attention. FRAG is an indicator, not a verdict. A FRAG of 38% can be fine or painful depending on workload.

Task 2: Get detailed vdev layout and spot an obvious bottleneck

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 07:12:33 with 0 errors on Sun Dec 22 02:10:14 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

Interpretation: Health is good. This doesn’t mean performance is good. If a single disk had high READ/WRITE errors or was intermittently dropping, you’d see it here—near-full pools amplify the pain of marginal hardware.

Task 3: Identify what’s consuming space (top datasets)

cr0x@server:~$ zfs list -o name,used,refer,avail,compressratio,mountpoint -S used | head -n 12
NAME                 USED  REFER  AVAIL  COMPRESSRATIO  MOUNTPOINT
tank                 44.1T   128K  10.4T  1.35x         /tank
tank/vm              18.7T  18.2T  10.4T  1.12x         /tank/vm
tank/home            12.4T   9.1T  10.4T  1.61x         /tank/home
tank/backup           9.8T   9.7T  10.4T  1.05x         /tank/backup
tank/containers       2.1T   1.9T  10.4T  1.48x         /tank/containers

Interpretation: USED includes snapshots depending on dataset hierarchy; REFER is live data. A big gap between USED and REFER often means snapshots are holding a lot of old blocks.

Task 4: Quantify snapshot space impact

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -S used | head -n 10
NAME                               USED  REFER  CREATION
tank/home@hourly-2025-12-24-2300   420G   9.0T  Wed Dec 24 23:00 2025
tank/home@hourly-2025-12-24-2200   390G   8.9T  Wed Dec 24 22:00 2025
tank/vm@daily-2025-12-24           210G  18.1T  Wed Dec 24 01:00 2025
tank/home@hourly-2025-12-24-2100   180G   8.8T  Wed Dec 24 21:00 2025

Interpretation: Snapshot USED is the exclusive space held by that snapshot. If a few snapshots are huge, you likely have heavy overwrite churn (VMs, databases, CI caches) in that dataset.

Task 5: Spot reservations and quotas that make “AVAIL” weird

cr0x@server:~$ zfs get -o name,property,value,source quota,refquota,reservation,refreservation tank/home tank/vm
NAME      PROPERTY        VALUE  SOURCE
tank/home quota           none   default
tank/home refquota        none   default
tank/home reservation     none   default
tank/home refreservation  none   default
tank/vm   quota           20T    local
tank/vm   refquota        none   default
tank/vm   reservation     2T     local
tank/vm   refreservation  none   default

Interpretation: Reservations carve out space even when the pool is tight. Great for protecting critical workloads, confusing if you forgot you set them.

Task 6: Check pool fragmentation and allocation classes

cr0x@server:~$ zdb -L -bbbs tank | head -n 30
Traversing all blocks to verify metadata...
Metaslabs:
    tank: 256 metaslabs, 38% fragmented, 80% capacity
    ...
Summary:
    blocks = 123456789
    alloc = 44.1T
    free  = 10.4T

Interpretation: Fragmentation here is pool-level and coarse, but it correlates with allocator pain. Rising FRAG plus rising write latency is your “danger zone” siren.

Task 7: Inspect dataset properties that influence near-full behavior

cr0x@server:~$ zfs get -o name,property,value,source recordsize,compression,atime,sync,logbias,primarycache,secondarycache tank/vm
NAME     PROPERTY        VALUE     SOURCE
tank/vm  recordsize      128K      default
tank/vm  compression     lz4       local
tank/vm  atime           off       local
tank/vm  sync            standard  default
tank/vm  logbias         latency   local
tank/vm  primarycache    all       default
tank/vm  secondarycache  all       default

Interpretation: For VM datasets, recordsize and sync behavior matter. Near-full pools punish sync-heavy random writes; logbias and a correctly deployed SLOG can help, but they don’t fix fragmentation.

Task 8: Check zvol block size (common VM pitfall)

cr0x@server:~$ zfs get -o name,property,value,source volblocksize tank/vm/zvol0
NAME           PROPERTY      VALUE  SOURCE
tank/vm/zvol0  volblocksize  8K     local

Interpretation: A reasonable volblocksize can reduce write amplification. A mismatch (like 128K for 8K workload, or 4K for large sequential) can make “near-full” feel worse earlier.

Task 9: Watch real-time latency and IOPS per vdev

cr0x@server:~$ zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        44.1T  10.4T    820   1600   110M   95.2M
  raidz2-0  44.1T  10.4T    820   1600   110M   95.2M
    sda         -      -    120    250  17.0M  15.8M
    sdb         -      -    118    260  16.5M  16.2M
    sdc         -      -    130    245  17.2M  15.4M
    sdd         -      -    112    270  16.1M  16.8M
    sde         -      -    160    300  22.0M  18.5M
    sdf         -      -    180    275  21.2M  17.0M

Interpretation: Look for an outlier disk doing less work (or showing errors elsewhere). Also notice: bandwidth might look fine while latency is not. Use iostat and application metrics, not just throughput.

Task 10: Check ARC health and whether you’re caching what matters

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:00:01   890   120     13    10    1    40    4    70    8   96G   112G
12:00:02   910   140     15    15    2    50    5    75    8   96G   112G
12:00:03   870   110     12     8    1    38    4    64    7   96G   112G

Interpretation: ARC misses aren’t automatically bad, but if metadata misses climb during directory-heavy workloads, near-full pools suffer more because every metadata operation triggers more scattered I/O.

Task 11: See if sync writes are dominating (and whether a SLOG is helping)

cr0x@server:~$ cat /proc/spl/kstat/zfs/zil
8 1 0x01 107 8880 1234567890 987654321
name                            type data
zil_commit_count                4    152340
zil_commit_writer_count         4    152338
zil_itx_count                   4    9832451
zil_itx_indirect_count          4    0

Interpretation: High commit activity suggests sync-heavy workloads. If you’re near full and sync-heavy, you’ll feel it. A SLOG can reduce latency for sync writes, but it won’t cure allocator fragmentation or snapshot-induced write amplification.

Task 12: Confirm ashift (capacity and IO alignment implications)

cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
109:        vdev_tree:
174:            ashift: 12
231:            ashift: 12

Interpretation: ashift=12 (4K) is common. If you accidentally built with too small an ashift for 4K-native drives, you can get performance and space inefficiencies that reduce your real headroom.

Task 13: Measure delete pain (snapshot destruction) safely

cr0x@server:~$ time zfs destroy -nvp tank/home@hourly-2025-12-24-2100
would destroy tank/home@hourly-2025-12-24-2100
would reclaim 180G

Interpretation: Dry-run shows reclaimable space. If destroys take forever in practice, that’s often a sign you’re deep into fragmentation and/or heavy metadata churn. Plan deletions during off-peak and avoid mass destroys during peak write periods.

Task 14: Spot small-block workload candidates for special vdev (metadata/small blocks)

cr0x@server:~$ zfs get -o name,property,value,source special_small_blocks tank/home tank/containers
NAME            PROPERTY              VALUE  SOURCE
tank/home       special_small_blocks  0      default
tank/containers special_small_blocks  0      default

Interpretation: If you have a special vdev and enable small blocks to land there, you can reduce random I/O on HDD vdevs. But this is a design choice, not a band-aid: undersize the special vdev and you create a new “full pool” failure mode when it fills.

Fast diagnosis playbook (what to check first, second, third)

When someone says “ZFS is slow” and the pool is getting full, you need a repeatable sequence that finds the bottleneck before the meeting invites multiply.

First: confirm whether this is capacity pressure or something else

  1. Pool CAP and FRAG: Is CAP > ~80%? Is FRAG rising over time?
  2. Write latency at the app: Do you see p99 spikes that correlate with write load or snapshot operations?
  3. Errors/degraded devices: Any disk errors, slow devices, or a resilver in progress?
cr0x@server:~$ zpool list -o name,size,alloc,free,frag,cap,health
NAME  SIZE  ALLOC  FREE  FRAG  CAP  HEALTH
tank  54.5T 44.1T 10.4T  38%   80%  ONLINE

Second: determine whether you’re constrained by IOPS, bandwidth, sync, or CPU

  1. IOPS and bandwidth per vdev: zpool iostat -v 1 during the complaint window.
  2. Disk latency: Use OS tools to see await/service times; a near-full pool can look like “random I/O everywhere.”
  3. Sync pressure: NFS, databases, and hypervisors can force sync writes. Confirm with ZIL activity and dataset sync policies.
  4. CPU: Compression and checksum are usually worth it, until the box is CPU-starved during TXG sync.
cr0x@server:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.30    0.00    6.10   22.40    0.00   59.20

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sda              21.0   240.0   1800   16400   28.1   2.8    73.4
sdb              18.0   255.0   1600   17100   31.5   2.9    79.2

Interpretation: High await and %util implies the disks are the limiter. Near-full fragmentation tends to push random I/O, raising await.

Third: identify the dataset or behavior causing pressure

  1. Which datasets are growing fastest? zfs list -o used,refer and snapshot lists.
  2. Which snapshots are huge? Sort snapshots by USED.
  3. Is a single workload doing overwrite churn? VMs, CI caches, databases.
cr0x@server:~$ zfs list -t snapshot -o name,used -S used | head
NAME                               USED
tank/home@hourly-2025-12-24-2300   420G
tank/home@hourly-2025-12-24-2200   390G
tank/vm@daily-2025-12-24           210G

Common mistakes, specific symptoms, and fixes

Mistake 1: Treating 80% as a hard line instead of watching behavior

Symptom: Pool is “only” 75–80% used, but write latency spikes and deletes stall.

Fix: Add alerting on write latency, TXG sync time (via system metrics), and snapshot growth rate. Keep headroom based on observed workload. If your pool becomes unhappy at 75%, your rule is 75%.

Mistake 2: Keeping a snapshot policy that grows without a deletion plan

Symptom: “We deleted 2 TB but got only 50 GB back.” Snapshot USED is large; deletes don’t free space quickly.

Fix: Audit snapshot retention by dataset, cap retention, and delete oldest first during off-peak. Consider separating churny workloads into their own dataset with a tailored snapshot cadence.

Mistake 3: Mixing incompatible workloads in one dataset

Symptom: Tuning one workload breaks another; recordsize/compression choices feel like compromise.

Fix: Split datasets by workload: VMs, backups, home dirs, containers. Tune per dataset (recordsize, atime, logbias, compression).

Mistake 4: Over-optimizing for capacity reports

Symptom: Great compression ratios, terrible latency under write load.

Fix: Keep compression sensible (lz4 is the usual “free win”), but measure CPU headroom. Don’t crank recordsize up for random-write workloads.

Mistake 5: Assuming a SLOG fixes “near full” performance

Symptom: Added a fast SLOG; random write latency still awful; snapshot deletes still slow.

Fix: SLOG helps sync write latency; it does not fix fragmentation, metadata pressure, or allocation search costs. Focus on headroom and dataset layout.

Mistake 6: Running RAIDZ for IOPS-heavy VM workloads without enough vdev width/count

Symptom: Acceptable performance at 50% used; painful at 75%; scrubs/resilvers devastate latency.

Fix: Mirrors (more vdevs) often provide better small-random IOPS. If staying RAIDZ, plan for more vdevs, consider special vdev for metadata, and keep more free space.

Mistake 7: Ignoring rebuild/resilver dynamics

Symptom: A disk replacement takes forever; performance during resilver is catastrophic.

Fix: Keep headroom, scrub regularly, and schedule heavy jobs around maintenance. Near-full pools make resilvers slower because allocation and free-space patterns are worse.

Checklists / step-by-step plan

Step-by-step: decide your real headroom target

  1. Measure your “normal”: Record p95/p99 write latency, TXG sync behavior (indirectly via latency spikes), and baseline scrub time at 50–60% utilization.
  2. Measure at higher utilization: As you pass 70%, 75%, 80%, compare those metrics weekly.
  3. Find the knee: The point where latency variance and operational tasks (snapshot deletes, scrub) degrade faster than capacity grows.
  4. Set policy: Your “rule” is where your workload stays boring. For some pools it’s 80%. For others it’s 70%. For some append-only backup pools, it can be higher with careful management.

Step-by-step: emergency free-space recovery without chaos

  1. Stop the bleeding: Identify the biggest writer and pause it if possible (CI cache, log loop, runaway backup job).
  2. Delete with intent: Prefer deleting snapshots that reclaim the most space first (but don’t mass-delete blindly at peak time).
  3. Protect critical datasets: Ensure they have reservations or quotas set appropriately so one team can’t starve everyone.
  4. Verify after reclaim: Confirm pool free space and that latency stabilizes; schedule a scrub if you had hardware issues.

Step-by-step: make “near full” less scary long-term

  1. Separate workloads into datasets and tune them (recordsize/volblocksize, compression, atime, logbias).
  2. Revisit vdev design (mirrors vs RAIDZ, number of vdevs, special vdev sizing if used).
  3. Implement trend-based alerting (space growth rate + latency), not just thresholds.
  4. Practice cleanup (snapshot retention enforcement, emergency reclaim drill).

FAQ

1) Is the ZFS 80% rule real?

It’s real as a warning, not as a universal threshold. Many pools become noticeably harder to allocate efficiently as they pass ~80%, but the actual “knee” depends on workload, vdev layout, and fragmentation.

2) Why does performance degrade so much near full?

Because ZFS must allocate new blocks for writes (copy-on-write), and as free space shrinks it becomes more fragmented and unevenly distributed. The allocator works harder and makes less optimal placements, increasing I/O and latency.

3) Do SSD pools ignore the 80% rule?

SSDs mask some symptoms (seek penalty) but don’t remove allocation overhead, metadata churn, or the impact of snapshot-heavy overwrites. Also, SSDs have their own performance cliffs when the device’s internal free space (overprovisioning) gets tight.

4) Is ZFS FRAG a reliable indicator?

It’s useful but not sufficient. A moderate FRAG value can be fine for sequential workloads and painful for random-write workloads. Use FRAG alongside latency metrics and workload understanding.

5) Will adding more RAM fix near-full pool performance?

More ARC helps reads and metadata caching, and can reduce some I/O. But it won’t make fragmented free space contiguous, and it won’t prevent allocator pain under heavy writes. Consider RAM a multiplier, not a cure.

6) Does a SLOG let me run the pool fuller?

A SLOG can improve sync write latency for workloads like NFS or databases with fsync. It does not fix general allocation inefficiency or snapshot-induced write amplification. It can help you survive, not break the laws of geometry.

7) What’s the safest way to reclaim space fast?

Stop or throttle the biggest writer, then reclaim space where it’s truly reclaimable—often by deleting high-USED snapshots. Use dry-run destroy to estimate reclaimed space and avoid doing massive destructive operations in the middle of peak load.

8) Can I defragment a ZFS pool?

Not in the traditional “run defrag” sense. You can reduce fragmentation over time by rewriting data (send/receive to a new pool, or replication to fresh vdevs) and by keeping headroom so allocator choices remain good.

9) Why did deleting files not free space?

Because snapshots may still reference the old blocks. Deleting the live file removes one reference, but snapshots preserve another. Space is reclaimed when the last reference is gone—often meaning you must expire snapshots.

10) What utilization target should I choose?

Choose the highest utilization that keeps your system boring under peak load. For many general-purpose pools that’s 70–80%. For churny VM/DB workloads, lower is often safer. For append-only backups, you may push higher with careful monitoring and retention control.

Conclusion

The ZFS 80% rule is neither myth nor gospel. It’s an old scar turned into a slogan. The real engineering truth is that ZFS needs usable free space to keep copy-on-write allocations efficient, and “usable” depends on fragmentation, metaslab availability, workload write patterns, and snapshot churn.

If you run production storage, the winning move is not arguing about 80%. It’s defining your pool’s danger zone using behavior: latency variance, snapshot reclaim dynamics, scrub/resilver impact, and growth trends. Keep enough headroom to make allocation easy, keep policies boring (quotas, reservations, retention), and you’ll spend less time negotiating with a filesystem that has run out of good choices.

← Previous
Proxmox Corosync “link down”: why the cluster flaps and how to stabilize it
Next →
WordPress brute force attacks: lock down login without locking yourself out

Leave a comment