ZFS Capacity Planning: Designing for Growth Without Rebuilding

November 10, 2025 • February 3, 2026 • Read: 25 min • Views: 9

Was this helpful?

The storage outage you remember wasn’t the one with the scary log messages. It was the quiet one: the pool hit 95%,
latency spiked, applications timed out, and everyone learned that “free space” is not a number—it’s a range with sharp edges.

Capacity planning on ZFS is less about arithmetic and more about avoiding irreversible design corners. You can fix many
sins later. You cannot un-bake a vdev layout without rebuilding the pool. So plan for growth like you actually intend to grow.

What “capacity planning” really means in ZFS

In a traditional RAID controller world, capacity planning is mostly “how many disks do we need” and “how long until we buy
more.” In ZFS, capacity planning is also:

Layout irreversibility: vdev width and redundancy are the skeleton. Changing them later is usually a rebuild.
Performance coupling: space, IOPS, and latency are linked through free space, fragmentation, and metaslabs.
Operational safety margins: resilver time, URE risk, scrub windows, and snapshot retention are capacity topics.
Failure domains: “more space” can quietly mean “bigger blast radius.”

Capacity planning is therefore a system design problem. It’s choosing constraints you can live with for years, not quarters.
Your future self will judge you by two things: whether expansion was boring, and whether outages were rare.

Facts and context that change decisions

A few concrete points—some historical, some practical—that tend to correct people’s mental models:

ZFS was built at Sun to avoid “silent data corruption”, not just to pool disks. End-to-end checksums and self-healing are foundational.
Copy-on-write (CoW) is why “full pools” get ugly: you need free space to write new blocks before freeing old ones.
Early ZFS deployments were allergic to hardware RAID because hiding disk errors broke the checksum+repair model. That’s still true: ZFS wants direct disk visibility.
RAIDZ was designed to avoid the write hole that can plague RAID5/6; ZFS maintains consistency through transactions and CoW.
“Ashift” became a war story because 4K sectors arrived and people stuck with 512-byte alignment. You can’t change ashift later without rebuilding.
Drive capacity grew faster than rebuild speed. Resilver time is now a planning input, not a post-mortem detail.
ZFS has “slop space” (a reserved margin) specifically because administrators are optimistic and applications are ruthless.
Snapshots are cheap until they aren’t: they cost space only when blocks diverge, but retention policies can turn “oops” into “out of space.”
Special vdevs (metadata/class) changed performance economics by letting you buy fast space for metadata instead of everything.

One paraphrased idea worth keeping on the wall, attributed to John Allspaw: Reliability is a feature you build into systems, not a state you declare after launch.
Capacity planning is reliability engineering with a calculator and the humility to leave slack.

Model growth first: workloads, write amplification, and time

Start with the “what”, not the disks

Before you pick RAIDZ widths or mirror counts, decide what you’re storing and how it behaves:

Block size profile: databases and VM images produce small random writes; media archives do large sequential writes.
Churn rate: how often data is rewritten matters more than how big it is. CoW means churn creates fragmentation.
Retention and lifecycle: snapshots, backups, and legal holds can double “effective” used space.
Performance SLOs: p95 latency targets and rebuild time budgets should drive redundancy and vdev count.

Think in “usable now” vs “usable later”

Procurement loves raw TB. Operations lives in usable TB at acceptable latency. You want two numbers in your plan:

Day-0 usable: what you can safely allocate on day one while staying within headroom targets.
Day-N usable: what you can reach by adding vdevs, expanding disks, or both, without re-laying the pool.

Write amplification isn’t just for SSDs

ZFS amplifies writes in ways that matter for capacity:

Parity overhead (RAIDZ): small random writes can become read-modify-write patterns if recordsize and workload misalign.
Metadata growth: many small files, ACLs, xattrs, and snapshots inflate metadata.
Copies and replication: copies=2, send/receive targets, and backup staging double count quickly.

Plan with a safety factor. If you don’t know churn, assume it’s high. If you don’t know retention, assume someone will
ask for “just keep it longer” two weeks after you hit 85% utilization.

Joke #1: The only thing that grows faster than data is the number of teams claiming they “barely write anything.” It’s adorable.

Vdev choices that determine your future

Mirrors: the boring choice that wins more often than it loses

Mirrors are capacity-expensive and operationally forgiving. They:

Scale IOPS by adding more mirror vdevs (each mirror vdev adds more independent heads).
Resilver faster, because only allocated blocks are copied and there’s less parity math and less multi-disk coordination.
Handle “one drive is weird” situations better. Latency variance is lower because reads can come from either side.

If your workload is random I/O heavy (VMs, databases, mixed containers), mirrors are usually the right answer unless
capacity cost is the only thing your organization can hear.

RAIDZ: capacity-efficient, but width is a long-term commitment

RAIDZ (single, double, triple parity) trades IOPS and resilver characteristics for usable space. It can be fantastic for
throughput-heavy or mostly-sequential workloads. But the key planning gotcha is vdev width.

Historically, you couldn’t expand a RAIDZ vdev by adding disks; you expanded by adding an entire new vdev. Modern OpenZFS
has RAIDZ expansion support, but availability and maturity depend on platform and version, and you still need to think about
operational complexity and performance during reshape. Treat it as “possible” not “inevitable.” If your platform doesn’t
support it, plan as if it doesn’t exist.

RAIDZ level: don’t be cheap with parity at large drive sizes

Large disks changed the math. Rebuild windows are longer, and the chance of a second failure during resilver is not theoretical.
RAIDZ1 is still used, but it should be reserved for cases where:

You can tolerate higher risk,
Rebuilds are fast (small disks, low utilization),
And you have excellent backups and tested recovery.

For “business storage,” RAIDZ2 is the default recommendation. RAIDZ3 is defensible for very wide vdevs, very large disks,
or environments where drive failure correlation is a thing you’ve already seen.

Vdev count: capacity is pooled, performance isn’t

ZFS stripes across vdevs, not across individual disks in the way a RAID controller might. A pool with one wide RAIDZ vdev
is still one vdev. It has one vdev’s worth of concurrency for many operations. Add vdevs to add parallelism.

This is why “one 12-disk RAIDZ2” and “two 6-disk RAIDZ2” can feel wildly different under load. The second option has two
vdevs, more concurrency, and often better tail latency. The first option has a single failure domain and a single performance
envelope. Your monitoring will tell you which one you built.

Ashift and sector alignment: the tattoo you get once

Pick ashift=12 (4K sectors) as a baseline unless you have a very specific reason not to. Many “512e” drives
are lying politely. An incorrect ashift can cost performance and space forever.

Headroom rules: why 80% is not a superstition

ZFS needs free space to stay fast and safe

The old guidance of “don’t exceed 80%” survives because it works. As pools fill:

Allocation becomes harder; ZFS has fewer large free extents to choose from.
Fragmentation increases; sequential writes become less sequential.
Copy-on-write requires more temporary space, so metadata and block rewrites get more expensive.
Resilver and scrub become slower because there’s more data and less I/O slack.

Treat “80%” as a default trigger for planning, not panic. “90%” is where you start canceling meetings.

Slop space: the margin you forget until it saves you

ZFS reserves some space (the “slop”) to prevent catastrophic full-pool behavior. It’s not there for your convenience; it’s there
to keep the pool functioning when humans do human things. Capacity plans should assume slop space is untouchable.

Quota and reservation strategy is part of capacity planning

Quotas and reservations are not “filesystem policy.” They are guardrails that prevent one dataset from eating the pool
and turning every other workload into a victim.

Use quotas to cap blast radius.
Use reservations (and refreservation) only when you must guarantee space for critical datasets.
Prefer project quotas in environments with many subtrees and frequent team churn.

Snapshots, clones, and retention: the stealth capacity tax

Snapshots aren’t free; they’re deferred billing

Snapshots preserve old blocks. The more your data changes, the more snapshots pin space. For VM images and databases with
constant churn, snapshots can become a second copy over time. The trap is that the pool looks fine… until you delete data and
nothing frees because snapshots still reference it.

Retention must be explicit and enforced

“Keep daily snapshots for 30 days” sounds harmless until you have 50 datasets and 10 teams and one of them decides to
snapshot every 5 minutes. You need:

a naming scheme,
a retention policy per dataset class,
and automated pruning that is treated as production-critical.

Clones: convenient, but they tangle accounting

Clones share blocks. That’s the point. It also means “used” becomes a question with multiple answers (logical used, referenced,
written, snapshot usage). If you don’t train people on this, someone will delete the “original” and be confused that space
didn’t return, or delete the “clone” and be surprised by what disappears.

Special vdevs, SLOG, L2ARC: capacity and failure domains

Special vdevs: the performance lever that can also brick your pool

Special allocation classes can store metadata (and optionally small blocks) on fast devices. Done right, they make HDD pools
feel less like it’s 2009. Done wrong, they create a new failure domain that can take the whole pool down.

Rules that keep you employed:

Mirror special vdevs. Treat them like critical metadata (because they are).
Size them with growth in mind. If they fill, performance drops and allocation behavior changes.
Track special usage separately. It’s easy to miss until it’s too late.

SLOG: not a write cache, and not for most people

A separate log device (SLOG) only helps synchronous writes. If your workload is mostly asynchronous, a SLOG is a fancy
placebo. If your workload is synchronous and latency-sensitive (NFS for VMs, databases with fsync), a good SLOG can
stabilize tail latency.

Capacity planning tie-in: SLOG size is usually small, but device endurance, power-loss protection, and mirroring matter.
A dead SLOG device in the wrong setup is a fast way to learn what “hanging sync writes” looks like.

L2ARC: read cache that can steal memory and disappoint you

L2ARC is not “add SSD, get fast.” It’s “add SSD, then pay metadata costs in RAM.” Plan RAM before you plan L2ARC. If your ARC
is already under pressure, L2ARC can make things worse by increasing eviction churn.

Joke #2: L2ARC is like a gym membership—buying it feels productive, but the results depend on what you actually do.

Expansion paths without rebuilding (and what still forces rebuilds)

Expansion method #1: add vdevs (the classic)

Adding a new vdev is the most established growth path. It preserves existing vdev geometry and increases performance by
adding more parallelism. It also increases failure domain count: more disks means more failures over time, so redundancy
policy and monitoring matter.

Planning implication: design initial vdevs so that future vdevs can match them. Mixing wildly different vdev sizes and
performance profiles can cause uneven allocation and unpredictable behavior.

Expansion method #2: replace disks with larger disks (grow-in-place)

Replacing each disk in a vdev with a larger one and letting ZFS expand is common for mirrors and RAIDZ. It works, but:

It’s slow: you resilver each disk, one at a time.
It’s risky if your vdev is already stressed or highly utilized.
You only get the new space after the last disk is replaced (for RAIDZ).

Planning implication: if you’re going to grow-in-place, ensure your resilver windows and spares strategy are realistic. “We’ll
just replace disks over the weekend” is how you end up rebuilding on a Tuesday.

Expansion method #3: RAIDZ expansion (where supported)

If your platform supports RAIDZ vdev expansion, treat it as a tool, not a strategy. It can be useful for incremental growth
when adding whole vdevs isn’t feasible. But you still need to ask:

What’s the performance impact during reshape?
How does it interact with your scrub schedule?
What’s the rollback story if something goes sideways?
Does your monitoring and operational maturity match the complexity?

What usually forces a rebuild anyway

Some decisions are hard to undo:

Wrong ashift (sector size alignment).
Bad vdev geometry for workload (e.g., one huge RAIDZ vdev for random I/O heavy workloads).
Changing redundancy policy (e.g., moving from RAIDZ1 to RAIDZ2 without supported conversion paths).
Fundamentally mismatched special vdev strategy (metadata devices undersized or unmirrored).
Dedup mistakes that require redesign to recover performance and capacity sanity.

Practical tasks: commands, outputs, and decisions (12+)

This is the part you can copy into an ops runbook. Each task includes: command, what the output means, and the decision it drives.
Commands assume OpenZFS on Linux; adapt device naming for your platform.

Task 1: Get the real pool capacity and health

cr0x@server:~$ zpool list -o name,size,alloc,free,capacity,health
NAME   SIZE   ALLOC   FREE  CAPACITY  HEALTH
tank  109T   71.2T  37.8T       65%  ONLINE

Meaning: “capacity” is pool-level utilization. It does not include future snapshot growth or dataset reservations.

Decision: If capacity is trending toward 80%, start expansion planning; if past 85%, start executing it.

Task 2: See vdev layout (the “future regrets” view)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 12:41:03 with 0 errors on Sun Dec 22 03:10:32 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            ata-WDC_WD140EDGZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDGZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDGZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDGZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDGZ-...   ONLINE       0     0     0
            ata-WDC_WD140EDGZ-...   ONLINE       0     0     0

errors: No known data errors

Meaning: You have one RAIDZ2 vdev. One vdev means limited concurrency compared to multiple vdevs.

Decision: If you need more IOPS, plan to add another vdev rather than making this one wider.

Task 3: Confirm ashift (alignment) before you buy more disks

cr0x@server:~$ zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
37:        vdev_tree:
58:            ashift: 12
59:            asize:  14000519643136

Meaning: ashift: 12 indicates 4K sector alignment. Good default for modern drives.

Decision: If ashift is 9 on 4K drives, consider planning a rebuild or migration sooner rather than compounding the mistake.

Task 4: Identify top datasets by space (including snapshots)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used | head -n 12
NAME                 USED  AVAIL  REFER  MOUNTPOINT
tank/vm              28.1T  38.0T  6.2T   /tank/vm
tank/backups         17.4T  38.0T  1.1T   /tank/backups
tank/home            8.6T   38.0T  8.2T   /tank/home
tank                 71.2T  38.0T  216K   /tank

Meaning: USED includes snapshot usage. REFER is live referenced data.

Decision: If USED is much larger than REFER, snapshots are pinning space; address retention before buying disks.

Task 5: Quantify snapshot space burn

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation -S used | head -n 8
NAME                              USED  REFER  CREATION
tank/vm@auto-2025-12-25-1200       1.2T  6.1T   Thu Dec 25 12:00 2025
tank/vm@auto-2025-12-24-1200       1.1T  5.9T   Wed Dec 24 12:00 2025
tank/backups@daily-2025-12-25      640G  1.1T   Thu Dec 25 01:00 2025
tank/vm@auto-2025-12-23-1200       980G  5.7T   Tue Dec 23 12:00 2025

Meaning: Snapshot USED is the space held uniquely by that snapshot (as estimated by ZFS).

Decision: If a few snapshots dominate, prune or adjust schedule; if all are large, churn is high—plan more headroom.

Task 6: Check for reservations and refreservations (hidden “missing space”)

cr0x@server:~$ zfs get -r -H -o name,property,value reservation,refreservation tank | egrep -v '\t0$' | head
tank/db	reservation	2T
tank/vm	refreservation	5T

Meaning: Reservations consume space even if the dataset is empty; refreservation ties to referenced space, often used with volumes.

Decision: If the pool is tight, challenge reservations: keep them only for workloads that truly need guaranteed space.

Task 7: Inspect compression effectiveness (capacity multiplier or disappointment)

cr0x@server:~$ zfs get -o name,property,value -s local,received compression,compressratio tank | head -n 12
NAME        PROPERTY       VALUE
tank/home   compression    zstd
tank/home   compressratio  1.62x
tank/vm     compression    zstd
tank/vm     compressratio  1.08x
tank/db     compression    zstd
tank/db     compressratio  1.01x

Meaning: compressratio shows actual savings. VM and DB data often compress poorly.

Decision: If your plan assumes compression savings, verify reality. Don’t budget imaginary terabytes.

Task 8: Check recordsize/volblocksize alignment (performance and space behavior)

cr0x@server:~$ zfs get -o name,property,value recordsize tank/home tank/backups
NAME         PROPERTY    VALUE
tank/home    recordsize  128K
tank/backups recordsize  1M

Meaning: Large recordsize improves throughput for large sequential workloads; it can hurt small random write workloads.

Decision: Set recordsize per dataset class. Don’t run VM images with 1M recordsize unless you enjoy latency graphs.

Task 9: Check sync settings (and whether you’re masking a problem)

cr0x@server:~$ zfs get -o name,property,value sync tank
NAME  PROPERTY  VALUE
tank  sync      standard

Meaning: standard honors application sync semantics. disabled is a data-loss trade.

Decision: If someone set sync=disabled to “fix performance,” reverse it and design proper SLOG or workload isolation.

Task 10: Measure fragmentation and allocation pressure

cr0x@server:~$ zpool list -o name,capacity,fragmentation,free,allocated
NAME  CAPACITY  FRAG  FREE   ALLOCATED
tank      65%    29%  37.8T      71.2T

Meaning: Fragmentation isn’t always evil, but high frag + high capacity often correlates with latency spikes.

Decision: If FRAG is climbing and you’re above ~80% capacity, prioritize adding space (new vdev) over micro-optimizations.

Task 11: Watch I/O latency by vdev (find the actual limiter)

cr0x@server:~$ zpool iostat -v tank 5 3
                                            capacity     operations     bandwidth
pool                                        alloc   free   read  write   read  write
------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                         71.2T  37.8T    820  1.40K  92.1M  141M
  raidz2-0                                   71.2T  37.8T    820  1.40K  92.1M  141M
    ata-WDC_WD140EDGZ-...                        -      -    140    240  15.3M  25.0M
    ata-WDC_WD140EDGZ-...                        -      -    135    230  15.0M  24.6M
    ata-WDC_WD140EDGZ-...                        -      -    138    235  15.2M  24.8M
------------------------------------------  -----  -----  -----  -----  -----  -----

Meaning: One vdev carries all load. If bandwidth is fine but latency is bad, you’re likely IOPS-limited.

Decision: Add vdevs for concurrency; don’t expect tuning to invent new spindles.

Task 12: Confirm scrub behavior and error trend

cr0x@server:~$ zpool status tank | sed -n '1,12p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 12:41:03 with 0 errors on Sun Dec 22 03:10:32 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0

Meaning: Scrubs are your early warning system. Long scrub times can indicate size growth, contention, or weak disks.

Decision: If scrub time is creeping up, plan bigger maintenance windows or more vdevs; also review drive health.

Task 13: Check per-disk SMART indicators (predict the next failure, not the last one)

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep 'SMART overall-health|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable'
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1

Meaning: Pending and uncorrectable sectors are a “this drive is negotiating with reality” signal.

Decision: Proactively replace drives showing growing pending/uncorrectable counts, especially before a planned expansion/resilver cycle.

Task 14: Validate special vdev usage (if present)

cr0x@server:~$ zpool list -v tank
NAME         SIZE   ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH
tank         109T   71.2T  37.8T        -         -    29%    65%  1.00x  ONLINE
  raidz2-0   109T   71.2T  37.8T        -         -    29%    65%
special         -      -      -        -         -      -      -

Meaning: If you have a special class, you must monitor it like a pool within the pool (allocation, errors, wear).

Decision: If special devices approach high utilization, expand them (carefully) before metadata allocation pressure hits.

Task 15: See what’s actually writing now (tie capacity to offenders)

cr0x@server:~$ sudo arcstat 5 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:00:05  2.1K   320     13   120    5    40    2   160    7  98.4G  110G
12:00:10  2.3K   410     15   180    6    50    2   180    7  98.1G  110G
12:00:15  2.2K   360     14   140    6    45    2   175    7  98.2G  110G

Meaning: High miss rates imply more disk reads; if coupled with a near-full pool, latency balloons.

Decision: If ARC is undersized for working set, plan RAM upgrades before blaming disks. Capacity plans should include RAM for caching.

Fast diagnosis playbook

When someone says “storage is slow” you don’t have time for philosophical debates about CoW. You need a triage sequence that
finds the bottleneck quickly and points to the next action.

First: are you out of space (or effectively out of space)?

Run zpool list and look at CAPACITY and FREE.
Run zfs list and look for datasets whose USED includes massive snapshot usage.
Check reservations/refreservations that “hide” free space.

If pool utilization is high (especially >85%), treat that as the likely root cause of latency spikes until proven otherwise.
The fix is usually “add space” or “delete snapshots,” not kernel tuning.

Second: is it IOPS/latency, bandwidth, or CPU?

zpool iostat -v 5 to see whether load is concentrated on one vdev and whether disks are saturated.
System-level: iostat -x 5 and vmstat 5 to see queue depths, await times, and CPU steal/saturation.
If RAIDZ and heavy small writes: check CPU usage and interrupt load; parity work isn’t free.

Third: are you in a maintenance or failure mode?

zpool status for resilver/scrub in progress, errors, or slow devices.
SMART checks for disks with pending sectors.
Check recent config changes: compression, sync, recordsize, special vdev properties.

Fourth: is ARC under pressure or is the working set mismatched?

arcstat for miss rate trends and ARC size.
Look for sudden changes in workload pattern (e.g., a new analytics job reading everything once).

This sequence avoids the classic failure: spending hours tuning when the pool is simply too full, or chasing “disk issues”
when it’s an ARC/RAM bottleneck.

Three corporate mini-stories from the trenches

Mini-story 1: the incident caused by a wrong assumption

A mid-size SaaS company had a ZFS-backed NFS cluster for VM images. The storage team sized capacity based on “current used
plus 30%.” They were proud of the spreadsheet. It had conditional formatting and everything.

The wrong assumption: snapshot retention was “stable.” In reality, the virtualization team had increased snapshot frequency
for faster rollback during a migration. Snapshots went from hourly to every 10 minutes for a subset of datasets. Nobody told
storage; they “didn’t change the data size.” Technically true. Operationally irrelevant.

Weeks later, pool utilization crossed into the danger zone. Writes became increasingly fragmented. p95 latency climbed, then
p99 became a cliff. The NFS clients started timing out under what looked like “normal” load. Applications blamed the network,
the network blamed the hypervisors, and storage got paged last, as tradition demands.

The post-incident fix wasn’t heroic. They audited snapshot schedules, enforced retention by dataset class, and implemented
quotas so “one experiment” couldn’t eat the pool. Capacity planning was updated to include snapshot growth as a first-class
term, not a footnote.

The lesson: in ZFS, “data size” is a misleading concept. You must model churn and retention, or you’re planning for a world
you don’t live in.

Mini-story 2: the optimization that backfired

A finance org ran a mix of databases and file shares. They were IO-latency sensitive at end-of-month. Someone noticed that
synchronous write latency was the pain point and decided to “fix it” quickly: sync=disabled on the hot datasets.

It worked—spectacularly. Latency graphs flattened. The team declared victory and moved on. A few weeks later, a power event
took down a rack long enough to hard-reset a storage node. The hardware came back. The pool imported. The application came up.
Then the integrity bugs started: database inconsistencies, missing recent transactions, and the worst kind of incident—data
loss that doesn’t announce itself immediately.

The ensuing recovery was painful and political. They restored from backups, replayed what they could, and spent days proving
what was lost. The original “optimization” wasn’t malicious; it was the result of treating storage semantics as a performance
toggle.

The boring fix was also the correct one: they reverted to sync=standard, added mirrored power-loss-protected devices
as SLOG for the datasets that truly needed it, and separated workloads so databases weren’t fighting file shares during peaks.
Capacity planning changed too—SLOG endurance and failure domains became part of the design, not an emergency add-on.

The lesson: some optimizations are just deferred incidents with better graphs.

Mini-story 3: the boring practice that saved the day

A media company had petabytes of nearline archive and a smaller hot tier for active projects. Their storage engineer had a
habit that looked dull in meetings: quarterly “capacity fire drills.” No panic, no drama. Just testing expansion procedures,
validating backups, and reviewing trend lines.

They maintained a simple rule: expansion starts at 75% projected utilization (based on a rolling growth rate), not when the
pool hits 80% in reality. They also had a strict snapshot policy: aggressive on active datasets, conservative on archives, and
always with automated pruning. Nobody got to keep infinite snapshots because “it’s safer.”

During a major project, an internal team suddenly began generating high-churn intermediate files on the hot tier. Growth rate
doubled. Monitoring flagged the trend within days. Because the organization had already tested the procedure, adding a new
vdev was operationally uneventful. They expanded early, rebalanced expectations with stakeholders, and avoided the familiar
spiral of “we’ll fix it after the deadline.”

The incident that didn’t happen is hard to celebrate. But boring practice—trend-based triggers, rehearsed expansion, and
policy enforcement—kept the system stable when human behavior changed.

Common mistakes: symptom → root cause → fix

1) Symptom: latency spikes as pool hits ~85–95%

Root cause: allocation pressure + fragmentation + CoW overhead in a near-full pool.

Fix: add capacity (preferably new vdevs), prune snapshots, and stop writes that aren’t critical. Don’t “tune” your way out of physics.

2) Symptom: “deleted data but space didn’t return”

Root cause: snapshots (or clones) still reference blocks.

Fix: identify snapshot usage (zfs list -t snapshot), prune retention, avoid long-lived clones for high-churn datasets.

3) Symptom: pool shows free space, but datasets report low available

Root cause: quotas/reservations, refreservations, or slop space effects.

Fix: audit zfs get reservation,refreservation,quota; remove or right-size. Educate teams that pool free ≠ dataset free.

4) Symptom: “we added faster disks, still slow”

Root cause: single-vdev design limits concurrency; or workload is IOPS-bound but you added sequential throughput.

Fix: add vdevs (more concurrency), consider mirrors for random I/O, split workloads by dataset and vdev class.

5) Symptom: scrubs/resilvers take forever now

Root cause: more allocated data, slower disks, contention, or near-full pool causing inefficient allocation patterns.

Fix: expand capacity earlier; schedule scrubs with lower load; investigate weak disks; avoid pushing pools into high-utilization zones.

6) Symptom: special vdev fills up, performance falls off a cliff

Root cause: special devices undersized for metadata/small-block growth; small blocks redirected unexpectedly.

Fix: size special vdevs conservatively; mirror them; monitor allocation; adjust special_small_blocks with caution.

7) Symptom: random write performance is awful on RAIDZ

Root cause: parity overhead + read-modify-write patterns + insufficient vdev parallelism.

Fix: use mirrors for random-write-heavy datasets, add more RAIDZ vdevs rather than widening one, tune recordsize per dataset.

8) Symptom: you can’t expand the pool “the way you planned”

Root cause: design assumed a feature or procedure not supported on your platform/version; or disk bays/backplane constraints.

Fix: validate expansion methods on the exact platform early; document supported growth paths; keep a migration plan.

Checklists / step-by-step plan

Capacity planning checklist (new pool)

Classify datasets: VM, DB, home, backups, archive—each gets its own expectations.
Define SLOs: latency targets, acceptable rebuild windows, and downtime tolerance.
Pick redundancy: mirrors for mixed/random I/O; RAIDZ2/3 for capacity + throughput where appropriate.
Choose vdev width: avoid “one giant vdev” unless the workload is truly sequential and tolerant.
Set ashift correctly from day zero.
Plan headroom: treat 80% as a planning boundary; keep emergency space beyond that.
Design snapshot policy: naming, frequency, retention, and automated pruning.
Decide on special vdevs only if you will mirror and monitor them; size for growth.
Plan expansion path: add vdevs, replace disks, or both—validate against platform constraints.
Operationalize: monitoring, scrub schedule, SMART checks, and rehearsed expansion procedures.

Growth execution plan (existing pool approaching limits)

Measure reality: pool utilization, fragmentation, top datasets, snapshot usage, reservations.
Stop the bleeding: prune snapshots, cap runaway datasets with quotas, move temporary workloads off the pool.
Choose expansion method:
- Add vdevs when you need performance and a clean growth step.
- Replace disks when chassis is fixed and you can afford long resilver cycles.
- Use RAIDZ expansion only if your platform supports it and you’ve tested it.
Rehearse: perform a dry run on a lab or staging system; confirm device naming and failure handling.
Execute with observability: watch zpool status, zpool iostat, system I/O, and application latency.
Post-expand validation: verify new space, confirm scrub schedule, confirm alerts, and re-check headroom triggers.

Policy checklist that prevents surprise rebuilds

Standardize vdev geometry for each storage tier.
Enforce snapshot retention and prune automatically.
Require capacity impact review for new projects that generate churn (CI artifacts, analytics scratch, VM template sprawl).
Keep documented expansion procedures with version-specific notes.
Run periodic “capacity fire drills”: simulate expansion, verify backups, and test restore.

FAQ

1) What utilization target should I plan for on ZFS?

Plan to operate below ~80% for general-purpose pools. For high-churn workloads (VMs/DBs), aim lower if you care about tail latency.
Treat 85% as “execute expansion,” not “start thinking.”

2) Are mirrors always better than RAIDZ?

No. Mirrors are usually better for random I/O and predictable latency. RAIDZ is often better for capacity efficiency and sequential throughput.
The mistake is picking RAIDZ to save space and then expecting mirror-like IOPS under VM workloads.

3) Can I change RAIDZ level later (RAIDZ1 to RAIDZ2)?

In many environments, not without rebuilding or complex migration. Treat redundancy level as a day-0 decision. If you’re unsure,
pick more parity, not less.

4) Why does deleting files not free space?

Snapshots (or clones) keep old blocks referenced. Deleting the live file only removes one reference. Space returns when all
references are gone—often meaning snapshot pruning.

5) How do I plan snapshot capacity?

Estimate based on churn: daily changed data × retention window. For high-churn datasets, snapshots can approach another full
copy over time. Measure with zfs list -t snapshot and trend it.

6) Should I enable dedup to save space?

Usually no, unless you have a proven dedup-friendly workload and enough RAM (or specialized design) to support it safely.
Dedup mistakes can turn capacity plans into performance incidents.

7) When do special vdevs make sense?

When you have HDD pools suffering from metadata-heavy workloads (many small files, directories, snapshots) and you can mirror
fast devices and monitor them properly. They’re powerful—and unforgiving.

8) Is adding a SLOG the same as adding a cache?

No. SLOG improves synchronous write latency. If your workload isn’t doing sync writes, you won’t see much benefit. Don’t buy
a SLOG to fix general slowness.

9) What’s the safest way to expand capacity without downtime?

Adding a new vdev is commonly the least disruptive path if you have free bays and a consistent design. Replacing disks in-place
can also be done online, but it stretches risk over a longer time window.

10) How many disks per RAIDZ vdev should I use?

It depends on workload and rebuild tolerance. Very wide vdevs increase rebuild complexity and can hurt random I/O behavior.
Many production teams prefer multiple moderately sized vdevs over one very wide vdev for better concurrency and operational flexibility.

Conclusion: next steps you can execute this week

ZFS capacity planning is the art of choosing the future you want: boring expansions, predictable latency, and recoverable failures.
The wrong design doesn’t usually explode immediately. It just accumulates interest until the pool is full and the graph turns vertical.

Run the audit commands above and write down: pool utilization, fragmentation, top datasets, snapshot usage, reservations, and scrub times.
Set a headroom trigger tied to trend (not feelings): start expansion work at 75% projected, execute before 85% actual.
Standardize dataset policies: recordsize, compression, quotas, and snapshot retention per class.
Decide your growth path and rehearse it: adding vdevs, disk replacement, or (if supported) RAIDZ expansion—with a rollback story.
Make one structural improvement: more vdev concurrency for IOPS workloads, or mirrored special vdevs for metadata-heavy pools.

Do those, and you’ll spend less time negotiating with storage emergencies and more time running systems that behave like they were designed on purpose.