ZFS Slop Space: Why Pools Feel Full Before 100%

Was this helpful?

ZFS has an uncanny habit: a pool hits “90% used” and suddenly behaves like it’s out of runway. Writes slow, allocations fail, and your monitoring lights up like a pinball machine. But zpool list still claims there’s space available. If you’ve ever muttered “ZFS is lying to me,” you’re not alone—and you’re not entirely wrong. ZFS is doing math that your mental model probably isn’t.

This is the story of slop space, plus the other invisible hands that make pools feel full before 100%: copy-on-write (CoW), metadata, block alignment, snapshots, reservations, special vdevs, and fragmentation. We’ll treat it like an operator’s guide, not a brochure: what’s happening, how to measure it, how to fix it, and how not to repeat the same outage with a different ticket number.

What slop space is (and what it isn’t)

Slop space is ZFS’s way of keeping you from driving a CoW filesystem into a wall. When a pool gets too full, ZFS becomes increasingly likely to hit allocation failures at the worst possible moment: while rewriting metadata, while committing transaction groups, or while trying to allocate a new block to replace an old one. CoW means “don’t overwrite in place”; it means “allocate new, then update pointers.” That update itself needs space.

Slop space is a reserved cushion that ZFS tries to hold back from general use so the pool can still do the “internal work” required to remain consistent and make forward progress. Think of it as the emergency lane on a highway: you can drive in it, but if everyone does, ambulances stop getting to crashes.

Operationally, slop shows up as: “Available” in zfs list being smaller than you expect, and sometimes a dataset refusing writes even though zpool list says there’s space. The details vary by platform (OpenZFS version, OS integration), but the operator lesson is consistent: don’t plan to use the last few percent of a pool, and don’t treat “100%” as a usable design target.

What slop space is not

  • Not a quota: quotas are per-dataset and enforced at the dataset boundary. Slop is about pool survival.
  • Not “lost space”: it’s still part of the pool. It can become available under certain conditions and it’s used for internal allocations.
  • Not snapshot space: snapshots are ordinary blocks held alive by references. Slop is a reservation-like behavior applied by the allocator.

Joke #1: If you think you can run a ZFS pool at 99.9% forever, you’re not an optimist—you’re an unpaid chaos engineer.

Why pools feel full before 100%

“Slop space” is the headline, but in real incidents it’s rarely the only actor. Pools feel full early because the remaining space isn’t equally usable for all future writes. ZFS needs space of the right shape: free extents that match block sizes, space on the right vdevs, and enough headroom to do CoW and metadata updates. Here are the practical causes, in the order they most often bite production systems.

1) CoW needs working room

On a CoW filesystem, a small user write can trigger a chain of allocations:

  • New data blocks
  • New indirect blocks (if the tree grows or must be rewritten)
  • New metadata blocks (block pointers, space maps, allocation classes)
  • Deferred frees (space reclaimed later, not immediately)

That’s why “I’m only writing 1 GB” can fail when the pool shows “2 GB free.” In the tail end of a full pool, the allocator may need to try many candidates, fragmenting further and consuming metadata. ZFS is not being dramatic; it’s trying not to corrupt itself.

2) Slop space: the pool-level safety margin

Slop space is typically calculated as a fraction of pool size with a cap, and the exact mechanics have changed across implementations and versions. The effect: some portion of free space is treated as effectively unavailable to ordinary datasets to preserve pool operability. When you see a dataset claim “0 avail” while the pool still has “some” free, you’re often looking at slop in action.

It’s not a fixed “3%.” On large pools it may be capped; on smaller pools it can be proportionally significant. In the real world, on small lab boxes it feels like ZFS is stealing your lunch money; on multi-hundred-terabyte pools it feels like a sensible tax for stability.

3) 128 KiB records, 4K sectors, and the tyranny of alignment

Most ZFS datasets default to a recordsize around 128 KiB. Most disks present 4K physical sectors; some present 512e; some flash devices have larger internal page sizes. ZFS has to align writes to ashift, the pool’s sector size exponent. A pool with ashift=12 uses 4K sectors; ashift=13 uses 8K, etc.

If your workload writes lots of small blocks, the pool can “use” space faster than your file sizes suggest due to padding, parity overhead, and metadata. And if you chose an ashift that’s too large “for performance,” you might have also chosen to burn capacity permanently. (We’ll get to the story where that went sideways.)

4) RAIDZ parity and the “last lap” problem

RAIDZ allocation is clever, but it’s not magic. As free space becomes fragmented, ZFS has fewer choices to assemble full stripes. That can increase write amplification and reduce effective free space. The final 10–20% of a RAIDZ pool can feel like molasses: not because ZFS is lazy, but because “good” allocations are harder to find, and the allocator has to work harder to avoid pathological layouts.

Mirrors generally degrade more gracefully under fragmentation than RAIDZ, but they don’t escape CoW headroom requirements.

5) Metadata is real space

ZFS stores a lot of metadata: block pointers, dnodes, indirect blocks, spacemaps, checksums, and (depending on features) additional structures. The more files, snapshots, clones, and small blocks you have, the more metadata you carry.

If you add a special vdev for metadata and small blocks, the accounting changes: metadata may no longer consume “normal” space, but now you can hit a different wall—special vdev capacity. When the special vdev fills, the pool can be effectively dead in the water even if the main vdevs have room.

6) Snapshots: the space you can’t see until it’s too late

Snapshots don’t copy data at creation time; they pin blocks. You can delete a 10 TB directory and see almost no space return, because the snapshots still reference those blocks. This is where “pool feels full early” becomes “pool is full and we don’t know why.”

Snapshots also increase metadata and can increase fragmentation because freed blocks aren’t actually freed until all references go away.

7) Reservations, refreservations, and volblocksize gotchas

Reservations (for filesystems) and refreservations (often used for zvols) carve out space so a dataset can continue writing even under pressure. That’s good—but it can also make other datasets look full early because the remaining space is promised elsewhere.

Zvols add another layer: the volume’s block size (volblocksize) interacts with workload and ashift. A misfit can waste space and raise write amplification, which makes the “tail end” of the pool uglier.

8) Deferred frees and transaction group timing

ZFS batches work into transaction groups (TXGs). Frees can be deferred; space can be “logically freed” but not immediately reusable until the relevant TXG commits and syncing completes. In a pool under heavy write load, you can see a gap between what applications think they freed and what the allocator can actually reuse right now.

Joke #2: “It says 5% free” is the storage equivalent of “I’ll be there in five minutes”—it might be true, but you shouldn’t bet your production deploy on it.

Interesting facts and history

Some context points that explain why ZFS behaves the way it does—and why the “full pool” cliff is a known, engineered tradeoff rather than a bug.

  1. ZFS was born in the mid-2000s with end-to-end checksumming and CoW as core principles, trading away in-place overwrite simplicity for integrity and snapshot semantics.
  2. The “80% rule” predates ZFS: classic Unix filesystems (like UFS/FFS) reserved space (often ~10%) for root and to reduce fragmentation. Different mechanism, same goal: avoid death-by-full-disk.
  3. RAIDZ is not RAID5: ZFS’s RAIDZ integrates allocation and parity into the filesystem, avoiding the write hole through CoW and transactional semantics.
  4. ashift is forever: once a pool is created, the sector size choice is effectively permanent for that vdev. People learn this right after they learn what ashift means.
  5. Special vdevs changed the game: metadata and small blocks can be redirected to faster devices, but it also introduced a new “pool feels full” mode: special vdev exhaustion.
  6. Snapshot sprawl is a modern capacity killer: ZFS made snapshots cheap to create, which led to “snapshot everything” habits—and to the realization that deletion does not mean reclamation.
  7. Compression changed capacity planning: with lz4 and friends, “logical used” and “physical used” diverge, which complicates human intuition and makes “free space” feel inconsistent.
  8. Copy-on-write implies write amplification: especially under fragmentation and RAIDZ, small random writes can cost more physical IO and metadata updates than their size suggests.
  9. Space accounting evolved across OpenZFS versions: fields like usedby* and improvements in reporting made it easier to attribute space, but they also exposed just how many buckets space can fall into.

Three corporate-world mini-stories

Mini-story #1: The incident caused by a wrong assumption

They had a neat spreadsheet: pool size, expected growth, and a clean “alert at 90%” line. The pool was RAIDZ2, mostly sequential backups during the night, and a busy virtualization cluster during the day. Everyone agreed: 90% is aggressive, but manageable. The system had run that way for months, which is how bad assumptions earn tenure.

Then came the quarter-end. A batch job produced more data than usual, a few VMs were live-migrated, and a backup window overlapped with a snapshot retention job. Free space dipped into the uncomfortable zone, but the dashboard still showed “a few terabytes free.” The on-call engineer tried to be calm, because calm is cheaper than panic.

At 02:13, write latency climbed. At 02:17, a couple of VMs reported disk errors. At 02:19, the pool started throwing ENOSPC for a dataset that “should have had space.” It wasn’t lying: the dataset had reached the point where ZFS would no longer hand out the last chunk of pool headroom for ordinary allocations. Slop space plus fragmentation meant “free” wasn’t “allocatable.”

The wrong assumption wasn’t “90% is fine.” The wrong assumption was treating pool free space as a single number that predicts allocatability. They recovered by deleting short-lived snapshots, pausing backups, and moving a few VM disks off the pool. The postmortem changed two things: alerts triggered earlier, and every capacity report started including fragmentation and snapshot attribution—not just percent used.

Mini-story #2: The optimization that backfired

A storage team wanted to “get ahead” of performance complaints. They had NVMe devices sitting idle, so they added a special vdev to accelerate metadata and small IO. On paper, it was perfect: faster directory operations, lower latency for small reads, and a happier virtualization platform.

It worked—until it didn’t. Over months, the special vdev quietly filled. Not with user data, not with “big files,” but with metadata, small blocks, and the accumulation of “helpful” filesystem features. Nobody watched the special vdev’s allocation separately; the pool still had plenty of room on the main RAIDZ. Monitoring saw 60% pool usage and went back to sleep.

Then one day a routine operation—creating many small files during a CI build—started failing. ZFS needed to allocate metadata, and the class that stored that metadata was effectively out of space. The main pool had free space, but the allocator couldn’t place the required blocks where policy demanded. The team had optimized the hot path and accidentally created a single point of capacity failure.

The fix wasn’t glamorous: add more special vdev capacity (and mirror it properly), rebalance by moving some datasets’ small blocks back to normal allocation where possible, and set alerts specifically on special vdev usage. The lesson: every optimization creates a new resource to exhaust. If you don’t monitor it, you’re just inventing a new outage mode.

Mini-story #3: The boring but correct practice that saved the day

A different shop ran ZFS for databases and logs. Nothing exotic. Their storage lead had a rule: no pool should spend meaningful time above ~75% unless there’s a written plan for expansion. They also had another rule: snapshots must have explicit retention policies, and every dataset must declare whether it’s “snapshot-heavy” or “snapshot-light.”

One afternoon, a developer accidentally deployed a build that doubled log volume. The log dataset ballooned. The monitoring system didn’t just page on “pool at 85%.” It paged on “pool projected to hit 80% within 6 hours” and “snapshot space trend accelerating.” That’s not magic; it’s boring math and consistent labeling.

The on-call engineer throttled log ingestion, cut snapshot frequency for non-critical datasets, and increased retention for the database dataset only. They also had a pre-approved runbook step: temporarily raise a dataset quota for a critical service while lowering a less critical dataset’s quota. No emergency procurement. No midnight blame karaoke.

The pool never crossed the cliff where fragmentation and slop space become existential. The incident report was one page long, which is the best kind of incident report. The lesson: headroom is a feature, and boring rules are often the only ones that work under stress.

Practical tasks: commands + interpretation

The goal here isn’t to throw commands at the wall. It’s to build a mental model that matches ZFS’s accounting: pool-level reality, dataset-level promises, and “why can’t I allocate even though it looks like I should.” The examples assume OpenZFS on a typical Linux system, but the concepts translate.

Task 1: Check pool capacity and basic health

cr0x@server:~$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  87.2T  71.4T  15.8T        -         -    41%    81%  1.00x  ONLINE  -

Interpretation: CAP is the blunt instrument. FRAG is a warning light, not a verdict, but once fragmentation climbs, the “last 10–15%” becomes less usable. A pool at 81% with 41% frag can already feel tight for random-write workloads.

Task 2: See detailed vdev allocation and ashift

cr0x@server:~$ zdb -C tank | sed -n '1,120p'
MOS Configuration:
        version: 5000
        name: 'tank'
        vdev_tree:
            type: 'root'
            id: 0
            guid: 1234567890123456789
            children[0]:
                type: 'raidz'
                id: 0
                ashift: 12
                nparity: 2
                children[0]:
                    type: 'disk'
                    path: '/dev/disk/by-id/ata-DISK0'
                children[1]:
                    type: 'disk'
                    path: '/dev/disk/by-id/ata-DISK1'
                children[2]:
                    type: 'disk'
                    path: '/dev/disk/by-id/ata-DISK2'

Interpretation: ashift tells you the minimum allocation unit. If it’s larger than your actual physical sector needs, you’ve permanently reduced usable capacity. If it’s smaller than reality (rare with modern defaults), performance and write amplification can suffer.

Task 3: Compare pool vs dataset view of available space

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used tank
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank        71.4T  9.82T    96K  /tank
tank/vm     52.1T  2.40T  42.0T  /tank/vm
tank/backup 18.9T  7.40T  18.9T  /tank/backup

Interpretation: Pool FREE was 15.8T, but the top-level dataset shows 9.82T avail. That gap is where slop space, reservations, and accounting realities live. The number that applications care about is often the dataset’s AVAIL, not the pool’s FREE.

Task 4: Inspect reservations and refreservations

cr0x@server:~$ zfs get -H -o name,property,value,source reservation,refreservation tank tank/vm tank/backup
tank        reservation     none    default
tank        refreservation  none    default
tank/vm     reservation     2T      local
tank/vm     refreservation  none    default
tank/backup reservation     none    default
tank/backup refreservation  none    default

Interpretation: A reservation reduces what everyone else can use. If a dataset is reserved and mostly empty, the rest of the pool may feel “full early.” This is often intentional, but it needs to be visible in capacity reports.

Task 5: Attribute space to snapshots, children, and refreservation

cr0x@server:~$ zfs list -o name,used,usedbysnapshots,usedbydataset,usedbychildren,usedbyrefreservation tank/vm
NAME     USED  USEDBYSNAPSHOTS  USEDBYDATASET  USEDBYCHILDREN  USEDBYREFRESERVATION
tank/vm  52.1T            8.4T          43.7T             0B                   0B

Interpretation: 8.4T pinned by snapshots is not “freeable” until snapshots go. If you deleted VM images and space didn’t come back, this is usually why.

Task 6: List snapshots and find the heavy hitters

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -s used | tail -n 8
tank/vm@auto-2025-12-20  210G  39.8T  Sat Dec 20 02:00 2025
tank/vm@auto-2025-12-21  240G  40.1T  Sun Dec 21 02:00 2025
tank/vm@auto-2025-12-22  310G  40.5T  Mon Dec 22 02:00 2025
tank/vm@auto-2025-12-23  480G  41.0T  Tue Dec 23 02:00 2025
tank/vm@auto-2025-12-24  1.2T  42.0T  Wed Dec 24 02:00 2025
tank/vm@pre-migration     3.6T  38.2T  Wed Dec 24 18:11 2025
tank/vm@quarter-end       6.1T  37.9T  Thu Dec 25 01:05 2025
tank/vm@baseline          7.8T  36.4T  Fri Dec 12 03:00 2025

Interpretation: Snapshot USED shows unique blocks held by that snapshot. If one snapshot dwarfs the rest, it’s a prime candidate for review (not automatic deletion—review).

Task 7: Check compression ratio and logical vs physical usage

cr0x@server:~$ zfs get -H -o name,property,value compressratio,compression tank/vm
tank/vm  compressratio  1.45x  -
tank/vm  compression    lz4    local

Interpretation: Compression can mask growth until it doesn’t. When incoming data becomes less compressible (encrypted backups, already-compressed media, random blocks), physical usage can rise faster than your historical trend.

Task 8: Check free space fragmentation at pool level

cr0x@server:~$ zpool get -H -o name,property,value frag,capacity,free tank
tank  frag      41%   -
tank  capacity  81%   -
tank  free      15.8T -

Interpretation: FRAG rising with CAP is normal; FRAG rising fast at moderate CAP often indicates a workload mismatch (VM random writes on RAIDZ, heavy snapshots, or constant churn).

Task 9: Inspect special vdev allocation (if present)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                                STATE     READ WRITE CKSUM
        tank                                ONLINE       0     0     0
          raidz2-0                          ONLINE       0     0     0
            ata-DISK0                       ONLINE       0     0     0
            ata-DISK1                       ONLINE       0     0     0
            ata-DISK2                       ONLINE       0     0     0
        special
          mirror-1                          ONLINE       0     0     0
            nvme-NVME0                      ONLINE       0     0     0
            nvme-NVME1                      ONLINE       0     0     0

errors: No known data errors
cr0x@server:~$ zpool list -v tank
NAME         SIZE  ALLOC   FREE
tank        87.2T  71.4T  15.8T
  raidz2-0  87.2T  68.9T  18.3T
special      1.8T   2.5T   0B

Interpretation: If special shows “0B free,” you have a problem even if the main vdev has space. The pool may fail allocations that require special-class blocks (metadata/small blocks depending on properties). This is the “optimization that backfired” pattern.

Task 10: Identify datasets using special allocation classes

cr0x@server:~$ zfs get -H -o name,property,value special_small_blocks tank tank/vm tank/backup
tank        special_small_blocks  0     default
tank/vm     special_small_blocks  16K   local
tank/backup special_small_blocks  0     default

Interpretation: If special_small_blocks is set, small blocks go to special. That’s great until special fills. Know which datasets depend on it.

Task 11: Check quotas and refquotas that can create “feels full” at dataset level

cr0x@server:~$ zfs get -H -o name,property,value quota,refquota tank/vm
tank/vm  quota     55T   local
tank/vm  refquota  none  default

Interpretation: Quota limits total used (including snapshots depending on type); refquota limits referenced space. If a dataset hits quota, it looks full regardless of pool state.

Task 12: Confirm whether frees are being deferred (space not immediately reclaiming)

cr0x@server:~$ zpool iostat -v tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        71.4T  15.8T    210    980  82.1M  311M
  raidz2-0  68.9T  18.3T    200    940  80.4M  305M
  special    2.5T     0B     10     40  1.7M   6.1M
----------  -----  -----  -----  -----  -----  -----
tank        71.4T  15.8T    190   1020  75.2M  340M
  raidz2-0  68.9T  18.3T    180    980  73.6M  333M
  special    2.5T     0B     10     40  1.6M   6.6M
----------  -----  -----  -----  -----  -----  -----

Interpretation: Heavy sustained writes, especially alongside snapshot churn, can keep the pool in a state where space is “in motion.” If you delete data and immediately attempt a large write, you can still hit ENOSPC because the allocator can’t yet reuse freed space. This is where patience (or a controlled pause in workload) can matter.

Task 13: Reduce snapshot pressure safely (example workflow)

cr0x@server:~$ zfs destroy -nvp tank/vm@baseline
would destroy tank/vm@baseline
would reclaim 7.8T
cr0x@server:~$ zfs destroy -vp tank/vm@baseline
will destroy tank/vm@baseline
will reclaim 7.8T
destroyed tank/vm@baseline

Interpretation: Use -nvp first: dry-run plus verbose plus parsable-ish output. If you’re trying to escape a full-pool event, this is one of the cleanest levers—provided the snapshot is actually safe to remove.

Task 14: Spot “space promised elsewhere” via logicalused/available

cr0x@server:~$ zfs get -H -o name,property,value logicalused,logicalavailable tank
tank  logicalused       102T  -
tank  logicalavailable  14.2T -

Interpretation: With compression, logical and physical diverge. If logicalavailable is low while you “think” you have physical space, you may be running into constraints like slop, reservations, or special vdev exhaustion.

Fast diagnosis playbook

This is the “pager goes off” sequence. The goal is to identify which kind of “full” you have: pool full, dataset limited, snapshot pinned, reservation promised, special vdev exhausted, or fragmentation/alloc failure.

First: establish what kind of failure you’re seeing

  1. Application error: ENOSPC? EIO? timeouts? Don’t assume ENOSPC is “no bytes free”; it can mean “no allocatable space under policy.”
  2. Scope: one dataset or everything? One zvol or all filesystems?
  3. Timing: did it start after snapshotting, replication, a large delete, or adding a special vdev?

Second: check pool reality

cr0x@server:~$ zpool list tank
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  87.2T  71.4T  15.8T        -         -    41%    81%  1.00x  ONLINE  -
cr0x@server:~$ zpool status -x tank
pool 'tank' is healthy

Decision: If HEALTH is not ONLINE/healthy, stop and fix that first. Capacity games don’t matter if you’re degraded and resilvering.

Third: compare dataset avail with pool free

cr0x@server:~$ zfs list -o name,used,avail,mounted -r tank | head
NAME        USED  AVAIL  MOUNTED
tank       71.4T  9.82T  yes
tank/vm    52.1T  2.40T  yes
tank/backup 18.9T  7.40T yes

Decision: If pool free is “reasonable” but the dataset avail is tiny, you’re likely dealing with slop, reservations, quotas, or special vdev capacity constraints.

Fourth: attribute space quickly

cr0x@server:~$ zfs list -o name,used,usedbysnapshots,usedbydataset,usedbychildren,usedbyrefreservation -r tank | head -n 15
NAME        USED  USEDBYSNAPSHOTS  USEDBYDATASET  USEDBYCHILDREN  USEDBYREFRESERVATION
tank       71.4T              0B            96K          71.4T                   0B
tank/vm    52.1T            8.4T          43.7T             0B                   0B
tank/backup 18.9T           1.1T          17.8T             0B                   0B

Decision: If snapshots dominate, reclaiming space means snapshot policy changes, not file deletes.

Fifth: check special vdev and allocation class constraints

cr0x@server:~$ zpool list -v tank
NAME         SIZE  ALLOC   FREE
tank        87.2T  71.4T  15.8T
  raidz2-0  87.2T  68.9T  18.3T
special      1.8T   2.5T   0B

Decision: If special is full, treat it as a high-severity capacity incident. Freeing “normal” space won’t necessarily help if the allocator needs special-class blocks.

Sixth: check fragmentation and workload shape

cr0x@server:~$ zpool get frag tank
NAME  PROPERTY  VALUE  SOURCE
tank  frag      41%    -

Decision: High frag plus high CAP suggests you need headroom immediately. The fastest fix is usually to delete snapshot(s), move datasets, or expand the pool. Defragmentation is not a button you press; it’s a capacity and lifecycle discipline.

Common mistakes (symptoms and fixes)

Mistake 1: Treating “pool FREE” as “dataset can write”

Symptom: zpool list shows terabytes free, but writes fail on a dataset or zvol.

Likely causes: slop space threshold reached; dataset quota; reservation elsewhere; special vdev full.

Fix: Compare zfs list AVAIL for the target dataset; check quotas/reservations; inspect special vdev usage; free space in the right place.

Mistake 2: Snapshot-heavy datasets without retention discipline

Symptom: Deleting files doesn’t free space; usedbysnapshots grows; pool usage climbs “mysteriously.”

Fix: Use zfs list -t snapshot -o used to identify heavy snapshots; prune safely; redesign retention (fewer snapshots, shorter retention, or replication strategy that doesn’t pin forever).

Mistake 3: Adding special vdevs without sizing and alerting

Symptom: Pool seems fine, but metadata/small-file operations fail; special vdev shows near-full or full.

Fix: Monitor special vdev allocation; expand special vdev (mirrored); consider adjusting special_small_blocks for datasets; avoid making special the capacity choke point.

Mistake 4: Over-optimizing ashift or recordsize for “performance”

Symptom: Capacity is lower than expected; small-write workloads consume space quickly; performance degrades near full.

Fix: Choose ashift based on physical reality, not folklore. Tune recordsize/volblocksize per workload. If you already chose poorly, plan a migration; ashift isn’t a toggle.

Mistake 5: Running RAIDZ pools too hot with random-write workloads

Symptom: Latency spikes as the pool fills; fragmentation rises; allocations slow; “space left” becomes unusable.

Fix: Keep more headroom; consider mirrors for random-write heavy workloads (VMs, databases); use special vdev carefully; reduce snapshot churn.

Mistake 6: Assuming frees are immediate during heavy churn

Symptom: You delete a lot, but AVAIL doesn’t rise quickly; immediate writes still fail.

Fix: Allow TXGs to sync; reduce write pressure temporarily; verify that snapshots aren’t pinning blocks; avoid “delete then immediately rewrite the world” during an incident.

Checklists / step-by-step plan

Checklist A: Capacity report that matches reality

  1. Report zpool CAP, zpool FRAG, and zpool FREE.
  2. Report top datasets with USED, AVAIL, and usedbysnapshots.
  3. Report special vdev allocation separately (if present).
  4. Include quotas/reservations summary for “critical” datasets.
  5. Trend growth using physical used (not just logical) if compression is enabled.

Checklist B: When a dataset says “no space left”

  1. Check the dataset’s AVAIL: zfs list dataset.
  2. Check quota/refquota: zfs get quota,refquota.
  3. Check reservation/refreservation: zfs get reservation,refreservation.
  4. Check snapshot pinning: zfs list -o usedbysnapshots and list snapshots.
  5. Check pool health and special vdev free space.
  6. If you need emergency space: delete snapshots with dry-run first; pause churny workloads; move data off-pool if possible.

Checklist C: Staying out of the “tail risk” zone

  1. Set alert thresholds on pool CAP (early) and on rate of change (earlier).
  2. Set alert thresholds on FRAG and on special vdev usage.
  3. Define a target maximum utilization per pool based on workload (VMs need more headroom than archives).
  4. Keep snapshot policies explicit, reviewed, and enforced.
  5. Test “delete data” scenarios: confirm space returns when expected (and learn when it doesn’t).
  6. Plan expansions before you need them; emergency expansions are how you end up with weird vdev geometry forever.

FAQ

1) Is slop space configurable?

Sometimes, depending on platform and OpenZFS version, there are tunables that affect the behavior. Operationally, treat slop space as a safety feature, not a budget line item. If you “tune it away,” you’re trading a predictable cushion for unpredictable failure modes under pressure.

2) Why does zpool list FREE not match zfs list AVAIL?

Because pool free space is not the same as allocatable space for a dataset. The difference can include slop space, reservations held for other datasets, and sometimes constraints like special vdev allocation classes. Always trust dataset AVAIL for “can I write here?” questions.

3) I deleted a huge directory. Why didn’t space come back?

Most commonly: snapshots. The blocks are still referenced by snapshots, so they can’t be freed. Check usedbysnapshots and list snapshots sorted by USED.

4) Does fragmentation actually matter in ZFS?

Yes, especially near high utilization and on RAIDZ. Fragmentation reduces the allocator’s options and increases work per allocation. It can convert “space free” into “space that exists but isn’t helpful,” particularly for large or stripe-aligned writes.

5) Is the “keep pools under 80%” rule real?

It’s a rule of thumb, not physics. Some workloads can run happily above 80% for long periods (cold archives with few rewrites). Others become miserable at 70% (VM random writes with heavy snapshots). The correct number depends on workload, vdev type, and operational tolerance for risk.

6) Can a special vdev filling up break the whole pool?

It can. If metadata or small blocks are directed to special and special is full, operations that require those allocations can fail even if main vdevs have free space. This is why special vdevs must be sized conservatively and monitored aggressively.

7) Do reservations help with slop space problems?

Reservations help guarantee a dataset can write when the pool is tight, but they don’t create space. They shift who gets to fail first. In multi-tenant pools, that’s often the correct move—just understand the trade.

8) Why do writes slow down dramatically near full?

Because allocation becomes harder, metadata updates increase, and the allocator must search for usable extents. On RAIDZ, assembling good stripes becomes more constrained. CoW means each write is “allocate new then commit,” which is more expensive when the pool is crowded.

9) If I add more disks, will the “slop” issue go away?

Adding capacity helps because it restores headroom and gives the allocator better options. But if the root cause is snapshot sprawl, quota misconfiguration, or special vdev exhaustion, adding disks may only delay the next incident.

10) What’s the single best habit to avoid “feels full” surprises?

Track snapshots and headroom as first-class metrics. “Used” is a lagging indicator; “what’s pinning space” and “how fast we’re approaching the cliff” are leading indicators.

Conclusion

ZFS pools feel full before 100% because ZFS is not a dumb block bucket. It’s a transactional CoW system that needs room to maneuver: room for new blocks, room for metadata, room to avoid fragmentation traps, and room to keep its promises (reservations) without collapsing under load. Slop space is the explicit version of that reality; snapshots, special vdevs, and RAIDZ behavior are the implicit versions.

If you take one operational lesson from this: capacity is not a single number. Runbooks that only check “percent used” are how you end up debugging a storage incident with the emotional toolkit of a weather forecast. Check allocatable space at the dataset level, attribute what’s pinning blocks, watch fragmentation and special vdevs, and keep enough headroom that ZFS can do the thing you hired it for: keeping your data intact while everything else is on fire.

← Previous
RDP Between Offices Without Open Ports: The Safe “RDP Only via VPN” Setup
Next →
ZFS Replacing Disks: The Safe Workflow That Avoids Pool Drama

Leave a comment