ZFS: Why Your Pool Is “Full” at 70% (And How to Fix Space Planning)

Was this helpful?

You bought “20 TB raw,” built a RAIDZ pool, and felt smug. Then at ~70% used, things started to wobble:
writes slowed, scrub times grew legs, and your monitoring screamed “pool is full” while df claimed
you still had terabytes left. If this is your life, you’re not unlucky—you’re running ZFS as designed.

ZFS doesn’t treat “free space” like a dumb bucket. It treats it like a warehouse with aisles, forklifts, and
pallets that can’t be rearranged without work. When the aisles get tight, your forklifts stop being cute.

The 70% problem: what “full” really means in ZFS

ZFS pools often feel full well before they hit 100% used. The “magic number” people quote—70%, 80%,
sometimes 85%—isn’t superstition. It’s a rule of thumb for when allocation behavior turns from “mostly contiguous,
fast, and predictable” into “fragmented, metadata-heavy, and occasionally dramatic.”

Two things are simultaneously true:

  • ZFS can keep accepting writes above 70% of pool usage, and many pools run fine at 90%—until they
    don’t.
  • ZFS performance and reliability are strongly correlated with free space, especially for random
    writes, snapshots, RAIDZ, and metadata-heavy workloads.

“Full” in production is less about raw capacity and more about whether the allocator can find large enough
contiguous regions (and do so without burning CPU) to satisfy incoming writes while preserving redundancy and
performance.

Here’s the mental model that will keep you employed: free space is not fungible.
Ten terabytes of free space scattered in 128 KB crumbs is not the same as ten terabytes in big, clean chunks.
ZFS is copy-on-write. It doesn’t overwrite blocks in place; it writes new blocks and updates pointers.
That means it constantly needs new space for changes—even tiny ones.

One short joke, because the universe demands balance: ZFS doesn’t run out of space. It runs out of good
space—like a fridge full of condiments when you’re hungry.

So why 70%?

70% is not a hard limit; it’s a warning light. Above that, especially on pools with RAIDZ vdevs, the allocator has
fewer large extents to work with. Fragmentation rises, the amount of metadata churn increases, and every write
becomes a small negotiation with physics.

The “correct” threshold depends on:

  • RAID layout (mirror vs RAIDZ1/2/3)
  • vdev width
  • ashift choice
  • recordsize/volblocksize versus workload IO size
  • snapshot frequency and retention
  • special vdev usage (metadata/small blocks)
  • how often data is rewritten (databases vs archives)

If you want a policy that won’t embarrass you in front of a change review board: plan for
~70–80% maximum on RAIDZ-heavy pools that handle random writes or many snapshots, and allow
~80–90% only when you understand your workload and have a tested growth path.

Facts and history that explain the behavior

Storage engineers love folklore. Here are concrete facts—useful ones—that explain why ZFS behaves differently than
traditional filesystems and RAID stacks.

  1. ZFS was born at Sun Microsystems in the early 2000s as a combined volume manager and filesystem,
    designed to avoid the “RAID controller + filesystem” blame game.
  2. Copy-on-write wasn’t invented by ZFS, but ZFS mainstreamed it in enterprise storage: it enabled
    cheap snapshots and strong on-disk consistency without fsck rituals.
  3. ZFS uses a pooled storage model: datasets are carved logically, not with static partitions.
    That flexibility also means quotas, reservations, and snapshots can create surprising “where did my space go”
    scenarios.
  4. RAIDZ is not “RAID5 in software”. It has variable-stripe writes and different allocation
    behavior. It can be efficient, but it’s also more sensitive to fragmentation than mirrors.
  5. The allocator works in metaslabs (chunks of space per vdev), and it has multiple strategies
    depending on fragmentation and available extents. As metaslabs fill, allocation becomes more expensive.
  6. “Slop space” exists on purpose: ZFS reserves a chunk of pool space to keep the system operating
    even when users attempt to fill it. That’s not wasted space; it’s an anti-panic mechanism.
  7. The 128 KB default recordsize has a backstory: it’s a throughput-friendly compromise for
    sequential workloads, not a universal truth for databases or VM disks.
  8. 4K sector drives changed the world: choosing ashift=9 vs ashift=12
    affects space efficiency and performance. Get it wrong and you pay forever.
  9. Special vdevs are newer in OpenZFS evolution: they’re powerful for metadata/small blocks, but
    they also create a new “you can fill the wrong thing first” failure mode.

None of this is trivia. It’s the reason your pool can be technically “not full” and practically unusable.

Where the space goes: the mechanics behind the cliff

1) Copy-on-write turns updates into allocations

On a traditional overwrite-in-place filesystem, updating a block can reuse the same spot. On ZFS, updating a block
usually means writing the new block somewhere else, then atomically updating the pointer chain up the tree.
That’s fantastic for consistency. It’s also why ZFS needs breathing room.

When the pool is spacious, ZFS can choose nice, contiguous allocations. When the pool is crowded, the allocator
starts taking what it can get. Your “small update” becomes several fragmented writes plus metadata updates.
Multiply by a database doing 8K writes and you get the vibe.

2) RAIDZ amplification: parity isn’t free, and neither are small writes

Mirrors are simple: write two copies, done. RAIDZ needs parity, and for partial-stripe writes it may need to read
old data/parity to compute new parity (read-modify-write) unless it can do full-stripe writes.

As free space decreases and fragmentation increases, ZFS has a harder time assembling full-stripe allocations.
The result: more partial stripes, more RMW, more IO operations per logical write, more latency.

3) Fragmentation is not one number; it’s the allocator’s pain level

ZFS reports fragmentation at the pool level, but the allocator experiences fragmentation per metaslab.
A pool can show “only” 30% fragmentation and still have several metaslabs that are essentially allocation-hostile.

The key point: fragmentation rises naturally in COW filesystems when data is rewritten and snapshots exist.
It’s manageable with free space. It’s ugly without it.

4) Metadata and indirect blocks grow with snapshots and churn

Snapshots don’t copy data immediately; they preserve block versions. That means deleting files doesn’t necessarily
free space if a snapshot still references the old blocks.

Heavy churn + long snapshot retention is a fragmentation factory. Every rewrite leaves behind historical blocks
pinned by snapshots. Your “free space” becomes a museum of yesterday’s data.

5) The “slop space” reserve: your pool is not lying, you’re just not invited

ZFS keeps some space in reserve so the system can continue operating, delete data, and complete transactions.
This prevents the classic “disk full → system can’t write logs → system gets worse” cascade.

Practically, it means that when you’re very full, you may see confusing differences between:
zpool list, zfs list, and df. They’re not all reporting the same thing,
and some of that “free” space is not meant for your application.

6) Reservations, refreservations, and quotas: the silent space blockers

ZFS lets you reserve space for datasets (reservation), and for volumes it has
refreservation (space guaranteed for that dataset itself, not descendants).
This is a good tool—until someone sets a reservation “just to be safe” and forgets about it.

Quotas can also make “pool free” irrelevant for a dataset. The pool can have space; your dataset can still be
hard-stopped.

7) Special vdevs and small blocks: you can fill metadata before data

Special vdevs (often SSD) store metadata and optionally small blocks. If you undersize them, your pool can go into
a bad place: the special vdev fills up, and allocation becomes constrained even if the main data vdevs still have
room.

This is one of those features that is powerful in trained hands and a career-development opportunity in untrained
ones.

8) ashift and padding overhead: death by alignment

If you pick ashift too small relative to your drive’s physical sector size, you can get write
amplification and performance issues. If you pick it too large, you waste space on small blocks due to padding.
With lots of small files or metadata, that waste becomes visible early.

Fast diagnosis playbook

When someone pings “ZFS is full” or “ZFS is slow” and you need to triage fast, don’t wander. Do this in order.
The goal is to identify whether you’re dealing with (a) real capacity exhaustion, (b) snapshot pinning, (c)
quota/reservation constraints, (d) special vdev saturation, or (e) allocator fragmentation.

First: confirm what “full” means

  1. Check pool usage and health: zpool list, zpool status
  2. Check dataset usage vs available: zfs list
  3. Check whether a quota/reservation is blocking writes: zfs get quota,reservation,refquota,refreservation

Second: find what’s pinning space

  1. Check snapshots and their “used” space: zfs list -t snapshot
  2. Check whether deletes aren’t freeing space because of snapshots: compare dataset used vs refer

Third: diagnose allocator pain

  1. Check fragmentation: zpool list -o fragmentation
  2. Check metaslab/special vdev saturation: zpool list -v, zpool status -v
  3. Check IO latency and queueing: zpool iostat -v 1

Fourth: decide whether you need immediate relief or structural changes

  • Immediate relief: delete snapshots, reduce retention, remove reservations, add space, move hot datasets.
  • Structural: redesign vdev width, move DB/VMs to mirrors, add special vdev correctly, tune recordsize/volblocksize, fix snapshot policy.

Practical tasks: commands, outputs, and decisions (12+)

These are the real ops moves. Each task includes: command, example output, what it means, and what decision you make.
Commands assume OpenZFS on Linux, but the concepts apply broadly.

Task 1: Check pool capacity and basic health

cr0x@server:~$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  27.2T  20.1T  7.08T        -         -    41%    74%  1.00x  ONLINE  -

Meaning: The pool is 74% used and 41% fragmented. “ONLINE” is good; the allocator pain is not.

Decision: At ~74% with 41% frag, treat this as “approaching the cliff” for RAIDZ + churny workloads.
Start planning relief (space, snapshot cleanup, workload moves).

Task 2: Check per-vdev allocation balance (and catch a stealth bottleneck)

cr0x@server:~$ zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        27.2T  20.1T  7.08T        -         -    41%    74%  1.00x  ONLINE  -
  raidz2-0  18.1T  14.9T  3.22T        -         -    48%    82%      -  ONLINE
  raidz2-1  9.06T  5.23T  3.83T        -         -    19%    57%      -  ONLINE

Meaning: One vdev is at 82% and highly fragmented while the other is at 57%. Allocation imbalance happens;
it hurts because the fullest vdev becomes the limiter.

Decision: Treat capacity planning at the vdev level, not pool level. Consider adding vdevs of equal size,
or migrating data to rebalance (send/receive), or rethinking the layout.

Task 3: Confirm dataset-level view (what apps experience)

cr0x@server:~$ zfs list
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              20.1T  5.92T   160K  /tank
tank/prod         18.4T  5.92T  12.2T  /tank/prod
tank/prod/vm      9.80T  5.92T  9.80T  /tank/prod/vm
tank/prod/db      6.10T  5.92T  6.10T  /tank/prod/db
tank/backups      1.62T  5.92T  1.62T  /tank/backups

Meaning: Datasets all show the same AVAIL because they share pool free space unless constrained by quotas.

Decision: If the app says “no space left” but AVAIL looks fine, suspect quotas/reservations or slop space,
not raw capacity.

Task 4: Catch quotas and reservations that are quietly choking you

cr0x@server:~$ zfs get -r quota,refquota,reservation,refreservation tank/prod
NAME          PROPERTY        VALUE     SOURCE
tank/prod     quota           none      default
tank/prod     refquota        none      default
tank/prod     reservation     none      default
tank/prod     refreservation  none      default
tank/prod/vm  quota           10T       local
tank/prod/vm  refquota        none      default
tank/prod/vm  reservation     none      default
tank/prod/vm  refreservation  9T        local

Meaning: tank/prod/vm is guaranteed 9T regardless of descendants, and capped at 10T.
This can starve other datasets or make “free space” look misleading.

Decision: If the pool is tight, remove or right-size refreservations unless they’re protecting something
mission-critical. Most environments overuse them.

Task 5: Identify snapshot space pinning (the classic “I deleted files and nothing happened”)

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -s creation | tail -n 6
tank/prod/db@auto-2026-02-03_0100    0B  6.10T  Mon Feb  3 01:00 2026
tank/prod/db@auto-2026-02-03_0200  18G  6.10T  Mon Feb  3 02:00 2026
tank/prod/db@auto-2026-02-03_0300  22G  6.10T  Mon Feb  3 03:00 2026
tank/prod/db@auto-2026-02-03_0400  25G  6.10T  Mon Feb  3 04:00 2026
tank/prod/db@auto-2026-02-03_0500  27G  6.10T  Mon Feb  3 05:00 2026
tank/prod/db@auto-2026-02-03_0600  31G  6.10T  Mon Feb  3 06:00 2026

Meaning: Snapshots are accumulating “USED” (changed blocks). This is space that cannot be freed until snapshots expire.

Decision: If you need emergency space, delete the biggest/oldest snapshots first (with care), then fix retention.
If “USED” is tiny, snapshots are not your main villain today.

Task 6: Find datasets with suspicious divergence between USED and REFER

cr0x@server:~$ zfs list -o name,used,refer,compressratio -r tank/prod | head
NAME           USED  REFER  COMPRESSRATIO
tank/prod      18.4T  12.2T  1.12x
tank/prod/vm   9.80T  9.80T  1.01x
tank/prod/db   6.10T  6.10T  1.37x

Meaning: tank/prod USED is much larger than REFER because snapshots/children are pinning space.

Decision: Investigate snapshot policy and churny datasets. Move VM and DB datasets into separate pools or vdev types
if they’re dominating rewrite traffic.

Task 7: Check pool fragmentation and capacity in one glance

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag
NAME   SIZE  ALLOC   FREE  CAP  FRAG
tank  27.2T  20.1T  7.08T  74%  41%

Meaning: FRAG is a coarse signal. Past ~50% fragmentation on a busy pool is where “free space exists” but “fast writes do not.”

Decision: If FRAG rises along with CAP, prioritize adding space or reducing churn. Fragmentation rarely “heals” on its own.

Task 8: Observe real-time IO pressure and which vdev is hurting

cr0x@server:~$ zpool iostat -v tank 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        20.1T  7.08T    220    980  18.4M  96.7M
  raidz2-0                  14.9T  3.22T    170    770  14.1M  76.2M
  raidz2-1                  5.23T  3.83T     50    210   4.3M  20.5M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: raidz2-0 is doing most writes and is also the fuller, more fragmented vdev. That’s a performance trap.

Decision: If one vdev is hot and full, add capacity in a way that doesn’t worsen imbalance (add another similar vdev,
or migrate data off the hot vdev via replication to a new pool).

Task 9: Confirm ashift (space/perf overhead you can’t “tune later”)

cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head -n 8
38:    vdev_tree:
49:            ashift: 12
73:            ashift: 12

Meaning: ashift: 12 means 4K sectors. Good for modern drives; it avoids misalignment pain.

Decision: If you discover ashift=9 on 4K drives, don’t “hope it’s fine.” Plan a migration rebuild.
It’s one of the few ZFS mistakes that sticks like gum.

Task 10: Check special vdev presence and whether it’s near full

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            sda                      ONLINE       0     0     0
            sdb                      ONLINE       0     0     0
            sdc                      ONLINE       0     0     0
            sdd                      ONLINE       0     0     0
            sde                      ONLINE       0     0     0
            sdf                      ONLINE       0     0     0
          special
            mirror-1                 ONLINE       0     0     0
              nvme0n1                ONLINE       0     0     0
              nvme1n1                ONLINE       0     0     0

Meaning: A special vdev exists (mirrored NVMes). Now you must treat it as critical metadata storage.

Decision: Immediately check special vdev allocation and ensure monitoring alerts on it separately. If it fills,
the pool can become write-constrained in ugly ways.

Task 11: See special vdev allocation at the top level

cr0x@server:~$ zpool list -v tank | sed -n '1,8p'
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        27.2T  20.1T  7.08T        -         -    41%    74%  1.00x  ONLINE  -
  raidz2-0  18.1T  14.9T  3.22T        -         -    48%    82%      -  ONLINE
  special    1.81T  1.55T   260G       -         -    12%    85%      -  ONLINE

Meaning: Special vdev is 85% full. That’s not “fine.” That’s a fire drill in slow motion.

Decision: Stop allocating small blocks/metadata there if it’s undersized (special_small_blocks),
add special capacity if your platform supports it safely, or migrate to a corrected pool design.

Task 12: Check recordsize and volblocksize (misalignment with workload causes amplification)

cr0x@server:~$ zfs get recordsize,compression tank/prod/db
NAME          PROPERTY     VALUE     SOURCE
tank/prod/db  recordsize   128K      default
tank/prod/db  compression  zstd      local

Meaning: DB dataset is using 128K recordsize. For many databases doing 8K/16K IO, this can increase read/modify/write and
fragmentation under churn.

Decision: Consider setting recordsize=16K (or aligned to your DB page size) for new data,
and validate with a benchmark. Don’t cargo-cult; test.

Task 13: Show compression ratio and decide whether compression is a capacity lever

cr0x@server:~$ zfs list -o name,used,refer,compressratio -r tank/prod | grep -E 'db|vm'
tank/prod/vm  9.80T  9.80T  1.01x
tank/prod/db  6.10T  6.10T  1.37x

Meaning: VM data is barely compressible; DB data is moderately compressible.

Decision: Compression won’t save VM pools. For DBs, it might buy time, but don’t treat it as a capacity plan.

Task 14: Identify large snapshots quickly (who’s hoarding yesterday)

cr0x@server:~$ zfs list -t snapshot -o name,used -s used | tail -n 5
tank/prod/vm@weekly-2026-01-05  412G
tank/prod/vm@weekly-2026-01-12  438G
tank/prod/vm@weekly-2026-01-19  451G
tank/prod/vm@weekly-2026-01-26  466G
tank/prod/vm@weekly-2026-02-02  489G

Meaning: Weekly VM snapshots are each pinning hundreds of GB. That’s expected with churn, but it’s also why your pool “fills early.”

Decision: Tighten snapshot retention for VM datasets or move VM snapshots to a dedicated backup target via replication.

Task 15: Validate that deletes can actually reclaim space (not blocked by holds)

cr0x@server:~$ zfs holds tank/prod/vm@weekly-2026-02-02
NAME                            TAG  TIMESTAMP
tank/prod/vm@weekly-2026-02-02  keep  Mon Feb  3 09:12 2026

Meaning: A snapshot hold exists. Someone “protected” it. Space won’t be freed until the hold is released.

Decision: Audit holds. If the hold is unjustified, release it and delete the snapshot. If it’s required for compliance, stop arguing with physics and add storage.

Task 16: Estimate logical vs physical usage (and catch hidden overhead)

cr0x@server:~$ zfs get -o name,property,value -s local,default logicalused,logicalreferenced,used,refer tank/prod
NAME       PROPERTY           VALUE
tank/prod  logicalused        19.9T
tank/prod  logicalreferenced  12.2T
tank/prod  used               18.4T
tank/prod  refer              12.2T

Meaning: Logical used is higher than physical used due to compression, but the gap between USED and REFER indicates snapshot/child effects.

Decision: Use logical metrics for “how much the application thinks it has,” and physical used for “how full the pool really is.”

Three corporate mini-stories from the storage trenches

1) The incident caused by a wrong assumption: “Free space is free space”

A mid-sized company ran an internal analytics platform on a ZFS RAIDZ2 pool. It wasn’t fancy: a handful of big
datasets, nightly snapshotting, and a steady stream of ETL jobs. Someone noticed the pool hovered around 78% for
months and decided it was “safe” because it never hit 90%.

Then they added a new workload: a write-heavy job that rewrote a few terabytes daily. On paper, it fit. In
practice, the pool started timing out. Application logs showed sporadic “no space left” messages; the pool still
had multiple terabytes free. People blamed the database. People blame the database the way sailors blame the sea.

The real issue was allocator behavior under churn with snapshots. The ETL job wasn’t just writing new data; it was
rewriting existing blocks. Copy-on-write plus snapshots meant old versions stayed pinned. Free space became
fragmented, metaslabs became harder to allocate from, and RAIDZ had trouble assembling efficient stripes.
Latency climbed until the job’s retry logic did the rest.

The fix was unglamorous. They paused new snapshots for the churny dataset, trimmed retention, and moved the ETL
scratch space to a mirrored vdev pool designed for random writes. The lesson that stuck: “pool free” is not the
same as “pool healthy,” and %CAP isn’t a performance guarantee.

2) The optimization that backfired: “Let’s dedup; storage is expensive”

Another organization wanted to reduce storage spend for VM images and backups. They turned on deduplication on a
dataset with lots of similar-looking data. The first week looked great—until memory pressure started showing up,
and the storage boxes began swapping under load.

Dedup in ZFS is not a toggle you flip because it sounds efficient. It’s a design commitment: it creates a
deduplication table that must be consulted for writes, and it thrives on RAM and low-latency metadata access.
They didn’t have enough RAM, and their metadata lived on the same spinning disks as everything else.

As the pool filled, the dedup tables and metadata activity got worse. Latency rose, performance became spiky, and
snapshot deletion—normally a housekeeping task—turned into an overnight event. They hadn’t just made the pool
smaller; they made the pool harder to manage when space was tight.

They eventually disabled dedup (after migrating data, because you don’t just “turn it off” and reclaim everything),
leaned on compression instead, and redesigned backups to avoid storing multiple full copies. The optimization that
backfired wasn’t dedup itself—it was the assumption that capacity efficiency is always the right goal.

3) The boring but correct practice that saved the day: “Always keep headroom, always have a growth path”

A finance-adjacent shop ran customer-facing APIs backed by a ZFS pool, with strict latency SLOs. Their storage
engineer was not popular at budgeting time because they insisted on 25–30% free space targets and a written growth
plan. It sounded conservative. It was.

One quarter, a product change increased write amplification: more indexes, more small updates, more snapshot churn.
They saw the early signals—fragmentation creeping up, special vdev trending fuller, scrub times stretching—and
treated it as a planned event rather than an emergency.

They executed a pre-approved playbook: add a new vdev set, rebalance by migrating the noisiest dataset using
replication to a fresh pool, then adjust snapshot retention on the churn-heavy datasets. Users noticed nothing.
Leadership didn’t notice either, which is the highest compliment operations can get.

The boring practice wasn’t just “keep headroom.” It was instrumentation plus a decision rule: if CAP > 75% and
frag > 35% on a hot pool, start the expansion ticket. No heroics. No midnight math.

Common mistakes: symptom → root cause → fix

1) “I deleted a ton of data but space didn’t come back”

Symptom: Dataset shows lower live data, but pool ALLOC barely changes.

Root cause: Snapshots still reference the blocks; deletes only remove current references.

Fix: Identify high-USED snapshots (zfs list -t snapshot -o name,used -s used), delete or reduce retention, and remove holds if appropriate.

2) “Pool says 10% free, but writes fail with ENOSPC”

Symptom: Applications error; zpool list still shows some FREE.

Root cause: Slop space reserve, quota/refquota, or metaslab exhaustion on a vdev (pool-level free doesn’t help if one vdev is the limiter).

Fix: Check quotas/reservations; check vdev CAP individually; add capacity or migrate data. Don’t try to “ride it out.”

3) “Performance collapsed after we crossed ~80%”

Symptom: Latency spikes, throughput drops, CPU rises in kernel IO paths.

Root cause: Fragmentation + RAIDZ partial-stripe writes + metadata churn.

Fix: Create headroom (delete snapshots, add vdevs, move churny workloads). For the long term, use mirrors for random-write heavy datasets.

4) “One vdev is near full, but the pool isn’t”

Symptom: Pool CAP looks OK, but one top-level vdev shows very high CAP/FRAG.

Root cause: Allocation imbalance from adding mismatched vdev sizes or historical allocation decisions.

Fix: Add vdevs that match existing vdev sizes; consider data migration to a new pool for re-layout.

5) “Special vdev filled and now everything’s weird”

Symptom: Writes slow or fail; metadata ops stall; pool still has room on HDD vdevs.

Root cause: Special vdev undersized for metadata/small blocks; it’s a hard dependency.

Fix: Stop aggressive small-block offload, resize by adding special vdev capacity if safe, or rebuild the pool. Monitor special vdev CAP like it’s oxygen.

6) “We set reservations to be safe and now we’re out of space”

Symptom: Some datasets have space, others can’t grow; pool free seems wrong.

Root cause: Reservations/refreservations hoard space regardless of actual usage.

Fix: Remove or right-size reservations; keep them only for truly critical datasets with hard guarantees.

7) “We tuned recordsize and it got worse”

Symptom: More IO, more fragmentation, worse latency after a “performance change.”

Root cause: Recordsize/volblocksize mismatch with workload, plus existing data not rewritten to new sizes.

Fix: Benchmark with representative IO patterns; apply per-dataset; rewrite data if you need the new block geometry to matter.

Checklists / step-by-step plan

Space planning policy (what to do, not what to debate)

  1. Set a target cap per pool type:
    • RAIDZ + snapshots + random writes: target ≤ 75–80% used.
    • Mirrors + mostly sequential + low churn: target ≤ 85–90% used (with monitoring).
  2. Plan at the vdev level: a single “hot” vdev can define your failure threshold.
  3. Assume snapshots will grow: retention is a capacity consumer, not a checkbox.
  4. Document growth actions: what you add, how long it takes, and who approves it.
  5. Monitor special vdev separately: treat it like a pool inside your pool.

Emergency headroom plan (when you need space today)

  1. Confirm you’re not blocked by quotas/reservations; remove the accidental ones first.
  2. Delete the highest-USED snapshots you can safely remove; beware holds.
  3. Stop creating new snapshots temporarily for the churny datasets.
  4. Move temporary/scratch workloads off the constrained pool.
  5. Add capacity (new vdevs) if you can do it safely and quickly; otherwise replicate to a fresh pool.

Structural fix plan (when you don’t want this to happen again)

  1. Classify workloads: VM disks, databases, logs, backups, media archives.
  2. Match layout to workload:
    • Random write heavy: mirrors (and enough vdevs for IOPS).
    • Mostly sequential, capacity-heavy: RAIDZ2/3 with planned headroom.
  3. Set recordsize/volblocksize per dataset based on IO profile, not on vibes.
  4. Snapshot with intent: shorter retention for churny datasets, longer for mostly-append datasets.
  5. Automate “CAP + FRAG” alerts with a hard operational policy: when thresholds are crossed, expansion starts.

One reliability idea worth stealing

Paraphrased idea, attributed: Richard Cook’s “How Complex Systems Fail” argues that safety is a control problem, not a component problem.

Space planning is that in miniature: you don’t get reliability by hoping pools stay neat; you get it by controlling headroom and change rate.

FAQ

1) Is the “keep ZFS under 80%” rule always correct?

No. It’s a policy knob. For RAIDZ pools with churn and snapshots, 70–80% is a solid default.
For mirror pools with mostly sequential writes and low churn, you can run higher—if you monitor fragmentation and latency and have a tested expansion path.

2) Why does ZFS slow down more than ext4 when space gets tight?

Copy-on-write means updates require new allocations, and RAIDZ wants large, well-aligned writes to be efficient.
When free space is fragmented, ZFS spends more work finding space, writes become less contiguous, and parity work increases.

3) Why do df and zfs list disagree about free space?

They report different layers. df reports what the mounted filesystem advertises, while zfs list reports dataset accounting inside the pool,
influenced by quotas, reservations, and slop space behavior.

4) Do snapshots consume space even if nothing changes?

A snapshot itself is cheap, but changes after the snapshot are what cost space, because old block versions must be preserved.
If a dataset is mostly append-only, snapshots grow slowly. If it rewrites data, snapshot “USED” grows fast.

5) Can I “defragment” a ZFS pool?

Not in-place like old-school defrag tools. The practical defrag is migration: zfs send | zfs receive to a fresh pool or dataset,
or rewriting data to new blocks. Headroom is the real anti-fragmentation tool.

6) Does adding a vdev fix fragmentation?

It often improves allocation behavior because new writes go to the emptier vdevs, but it doesn’t magically rewrite old fragmented block layouts.
It’s relief, not time travel.

7) What’s the biggest space planning mistake with special vdevs?

Undersizing them and then forgetting they’re now critical. If the special vdev fills, you can get severe performance issues or allocation failures.
Size them with margin, mirror them, and monitor them explicitly.

8) Should I use compression as a capacity strategy?

Use compression because it’s usually a net win (especially zstd) and can reduce IO.
But don’t build a plan that requires a specific compression ratio to stay solvent; data changes and “compressible” is not a contractual property.

9) Why do my weekly VM snapshots get huge?

VM disks churn: OS updates, log files, swap, databases inside VMs, and random writes. Snapshots preserve old blocks, so churn equals snapshot growth.
The fix is retention policy, replication strategy, and sometimes moving VM storage to a layout that tolerates churn better.

10) What’s the safest immediate action when the pool is nearing full?

Create headroom: reduce snapshot retention and delete high-USED snapshots you can safely remove, then add capacity.
Don’t wait for “100% used” as a trigger; ZFS can get operationally stuck before that.

Next steps you can do this week

  1. Set an explicit cap policy per pool (by workload and vdev type). If you can’t explain why it’s 85% instead of 75%, it’s 75%.
  2. Implement a “CAP + FRAG” alert using zpool list, and page humans before users page you.
  3. Audit snapshot retention on churny datasets (VMs, DBs). If snapshots are your backup strategy, fine—make the backup target pay for it, not production.
  4. Audit quotas/reservations. Remove anything that exists “just in case.” “Just in case” is how storage dies quietly.
  5. Check special vdev utilization. If it’s above 70–80%, treat it as urgent engineering work.
  6. Write a growth playbook: who approves expansion, how you add vdevs, and when you choose migration over expansion.

Second short joke, because you’ll need it during the next capacity review: The only thing more optimistic than sales forecasts is a storage pool planned to 99%.

The core discipline is simple: ZFS wants headroom the way databases want indexes—because the alternative is chaos,
and chaos always finds a maintenance window you didn’t schedule.

← Previous
SMB Multichannel: When It Helps (and When It Hurts)
Next →
Dovecot: IMAP Is Slow — The 7 Settings That Actually Matter

Leave a comment