ZFS RAIDZ expansion: What’s Possible Today and Best Workarounds

Was this helpful?

You bought “just enough” disks for a RAIDZ. Six months later the pool is at 83%, snapshots are multiplying like rabbits, and the business wants another quarter’s worth of data yesterday.

This is where people discover that ZFS isn’t magic—it’s engineering. RAIDZ gives you great density and nice failure tolerance. It also makes capacity expansion feel like arguing with physics. The good news: modern OpenZFS has finally grown a new limb here. The bad news: you still need to understand what it does, what it doesn’t do, and which workarounds are safe in production.

What’s possible today (and what isn’t)

Let’s separate three concepts that get mashed together in Slack threads and incident calls:

  • Expanding a pool: adding more top-level vdevs (e.g., add another RAIDZ group). Easy. Has consequences.
  • Growing a vdev: making an existing vdev wider (e.g., RAIDZ1 from 5 disks to 6 disks). Historically “no,” now “sometimes yes” depending on feature support.
  • Growing each disk: replacing disks with larger ones and letting the vdev grow after the last replacement. Classic, safe, slow.

1) Adding a disk to an existing RAIDZ vdev

Today: This is possible on recent OpenZFS builds when the RAIDZ expansion feature is supported and enabled. You can attach an additional drive to a RAIDZ vdev and the vdev will undergo an expansion process.

The catch: It’s not instant, and it’s not free. Existing data must be rewritten to take advantage of the new geometry (more columns). That means a long-running background process that looks a lot like a full-pool rewrite. Expect heavy I/O, lots of time, and performance variability.

2) Replacing all drives with bigger ones

Today: Still the most boring-and-correct way to grow RAIDZ capacity. You replace drives one at a time, resilver after each, then the vdev expands once the last device is larger and autoexpand is honored.

The catch: You need patience and good drives. Resilvering on big disks is a lifestyle choice.

3) Adding a new top-level vdev (another RAIDZ group, or mirrors)

Today: Always works. It’s how ZFS was designed to scale pools: striped across vdevs.

The catch: You are permanently changing your redundancy profile and failure domain math. ZFS stripes across top-level vdevs; losing any one top-level vdev loses the pool. So if you “just add a single disk” (a one-disk vdev), you’ve created a pool that can die from one disk failure. That’s not an expansion, that’s a loaded foot-gun.

Opinionated guidance: If you can use RAIDZ expansion safely (feature supported, maintenance window tolerance, I/O headroom), it’s a legitimate tool now. If you can’t, don’t get cute: either replace disks with larger ones or migrate to a new pool/vdev layout. “Temporary” single-disk vdevs have a way of becoming permanent right up until they become catastrophic.

Joke #1: RAIDZ expansion is like a gym membership—technically you can make progress, but it’s going to demand time and sustained discomfort.

How RAIDZ expansion works under the hood (enough to make decisions)

RAIDZ writes data in “stripes” across disks, with parity. The number of disks participating in a stripe is the vdev’s “width.” When you add a disk to a RAIDZ vdev (with the expansion feature), you’re changing that width.

Here’s why it’s hard: old blocks are laid out according to the old width. New blocks could be written with the new width, but then you’d end up with a vdev that contains a mix of layouts. That is possible, but it means:

  • Capacity accounting gets complicated—space isn’t magically freed until enough data is rewritten.
  • Performance characteristics vary depending on how much of the dataset is “old layout” vs “new layout.”
  • Parity math, allocation classes, and metaslab selection all need to cooperate without breaking on-disk compatibility.

So expansion involves a controlled rewrite of blocks so they can be redistributed across the new geometry. Think of it as: “the pool learns a new gait, then slowly teaches its existing data to walk that way.”

What you should expect operationally

  • A long-running process that competes with normal workloads (read, write, metadata).
  • Higher write amplification because blocks are being rewritten, plus parity overhead.
  • Snapshot interaction: blocks referenced by snapshots don’t disappear; rewriting may be constrained by retention policies. You don’t “rebalance” around a mountain of immutable snapshot history without paying for it.
  • Thermal and SMART stress on the entire vdev. If your drives are already marginal, you’re about to find out in the least pleasant way.

What expansion does not fix

  • Bad ashift choices (e.g., 4K disks with ashift=9). That’s forever.
  • Fundamentally wrong topology for your workload (e.g., RAIDZ for heavy sync random writes in a latency-sensitive DB). Wider RAIDZ may improve throughput, but it won’t turn it into a mirror.
  • High fragmentation and snapshot bloat caused by workload patterns. Expansion may reduce pressure but not the underlying behavior.

Facts and history that explain the weirdness

Some context points that explain why “just add a disk” took so long to become real in ZFS-land:

  1. ZFS was born in an era of big, expensive disks where planning the vdev layout upfront was assumed. The culture came with it: “Choose wisely; changing later is hard.”
  2. Top-level vdev striping is foundational: ZFS pools are a stripe across vdevs. This makes scaling easy but makes mixed-redundancy pools risky when people improvise.
  3. RAIDZ parity isn’t RAID5/6 glued on the side; it’s integrated into allocation and block pointers. That deep integration is great for integrity and awful for retrofitting layout changes.
  4. The “replace-with-bigger-drives” method predates most modern ZFS forks and became the canonical expansion story because it didn’t require rewriting everything at once.
  5. Resilvering is block-based, not full-disk (in many cases). That’s a ZFS advantage—but RAIDZ resilvers can still be long because parity reconstruction touches a lot of data.
  6. OpenZFS consolidation mattered: ZFS development diverged across Solaris descendants, then re-converged. Big on-disk feature work tends to move at “careful filesystem speed.”
  7. Snapshots are a first-class citizen, which changes everything: “delete and rewrite” isn’t simple if older blocks are still referenced.
  8. End-to-end checksums changed failure expectations: you don’t just want capacity; you want correct data. Any expansion feature must preserve integrity across failure cases, power loss, and partial progress.
  9. Drive capacities exploded faster than rebuild windows shrank: this widened the pain gap—people wanted expansion because buying larger replacements meant multi-day resilvers.

Practical tasks: commands, outputs, and decisions

Below are real tasks I run before touching a production pool. Each includes a command, sample output, what the output means, and the decision you make.

Task 1: Identify your ZFS and kernel/userspace versions

cr0x@server:~$ zfs version
zfs-2.2.4-1
zfs-kmod-2.2.4-1

What it means: You’re on OpenZFS 2.2.x userspace and kernel module. Feature availability varies by version and distribution packaging.

Decision: If you’re on an older release, stop assuming RAIDZ expansion exists. Plan an upgrade window first, or use the older workarounds.

Task 2: Check pool health and any latent damage

cr0x@server:~$ zpool status -xv
all pools are healthy

What it means: No known errors at the moment.

Decision: If this shows errors, do not expand. Fix underlying device issues first. Expansion multiplies I/O and will surface weak drives.

Task 3: Get the exact vdev topology (this dictates your options)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

What it means: Single RAIDZ2 vdev, 6-wide. Losing more than 2 drives in that vdev kills the pool.

Decision: You can (a) expand this vdev width (if supported), (b) replace drives with bigger ones, or (c) add another RAIDZ2 vdev. Do not add single disks.

Task 4: Confirm ashift and sector alignment (you can’t “fix” it later)

cr0x@server:~$ zdb -C tank | grep ashift -n
56:                ashift: 12
57:                asize:  599998545920

What it means: ashift=12 (4K sectors) is generally sane for modern disks.

Decision: If ashift is wrong (commonly 9 on 4K drives), don’t sink more time into expansion. Migration is usually the only correct fix.

Task 5: Check feature flags and whether the pool can support newer features

cr0x@server:~$ zpool get -H all tank | egrep 'feature@|compatibility'
tank	compatibility	off	local

What it means: Compatibility mode is off; feature flags may be enabled individually.

Decision: In environments that boot older recovery media or replicate to older hosts, enabling new features can be a compatibility break. Confirm your fleet supports it before you turn knobs.

Task 6: Check pool capacity and fragmentation (expansion doesn’t like full pools)

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health tank
NAME  SIZE  ALLOC   FREE  CAP  FRAG  HEALTH
tank  21.8T  19.1T  2.7T  87%   61%  ONLINE

What it means: 87% full and 61% fragmented. You’re in the “everything is harder now” zone.

Decision: If CAP > 80%, prioritize freeing space before expansion or replacements. ZFS performance and allocation behavior degrade badly when full.

Task 7: Identify the biggest space consumers, including snapshots

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used | head
NAME                 USED   AVAIL  REFER  MOUNTPOINT
tank                 19.1T  2.7T   256K   /tank
tank/backups         11.2T  2.7T   4.1T   /tank/backups
tank/backups@snap-1  2.8T      -   4.0T   -
tank/vm              5.6T   2.7T   5.6T   /tank/vm
tank/home            1.9T   2.7T   1.9T   /tank/home

What it means: Snapshots can account for a shocking amount of “USED,” and they pin old blocks.

Decision: If expansion depends on rewriting blocks but snapshots keep them alive, trim snapshot retention first (with stakeholder buy-in).

Task 8: Measure write reminder flags and sync behavior (SLOG misconceptions live here)

cr0x@server:~$ zfs get -o name,property,value -s local,default sync,logbias,recordsize tank/vm
NAME     PROPERTY    VALUE
tank/vm  sync        standard
tank/vm  logbias     latency
tank/vm  recordsize  128K

What it means: VM dataset is default sync behavior, latency logbias, 128K recordsize.

Decision: If you’re seeing latency pain, don’t assume “expansion fixes it.” You may have a sync write bottleneck that needs SLOG tuning or workload changes.

Task 9: Verify autotrim, compression, and atime—quiet killers and quiet savers

cr0x@server:~$ zfs get -o name,property,value compression,atime,relatime,autotrim tank
NAME  PROPERTY     VALUE
tank  compression  lz4
tank  atime        off
tank  relatime     off
tank  autotrim     off

What it means: Compression is on (good). atime off (often good). autotrim off (might be fine for HDD; on SSD it’s a discussion).

Decision: Expansion increases writes; compression can reduce them. If your pool is SSD-based, consider autotrim behavior carefully—but test first.

Task 10: Baseline I/O and latency before any surgery

cr0x@server:~$ zpool iostat -v tank 5 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        19.1T  2.7T      85    210  1.20G  2.45G
  raidz2-0                  19.1T  2.7T      85    210  1.20G  2.45G
    sda                         -      -      12     35   180M   420M
    sdb                         -      -      14     36   190M   410M
    sdc                         -      -      13     34   200M   430M
    sdd                         -      -      15     35   210M   400M
    sde                         -      -      16     35   210M   410M
    sdf                         -      -      15     35   210M   380M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: This shows per-vdev and per-disk bandwidth/ops. You’re looking for outliers and headroom.

Decision: If disks are already near saturation, expansion will hurt. Schedule a quiet window or throttle the expansion process (where supported) by workload management.

Task 11: Check recent error history and resilver/scrub cadence

cr0x@server:~$ zpool status tank | sed -n '1,120p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 12:44:02 with 0 errors on Mon Dec  2 03:21:08 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

What it means: Scrub completed cleanly recently. Good baseline.

Decision: If you haven’t scrubbed in months, scrub before expansion/replacements. Find latent errors when you still have slack.

Task 12: Validate that the “new” disk is actually the disk you think it is

cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,TYPE
NAME   SIZE MODEL            SERIAL          TYPE
sda   3.64T ST4000NM0035     ZC1A1ABC        disk
sdb   3.64T ST4000NM0035     ZC1A1ABD        disk
sdc   3.64T ST4000NM0035     ZC1A1ABE        disk
sdd   3.64T ST4000NM0035     ZC1A1ABF        disk
sde   3.64T ST4000NM0035     ZC1A1ABG        disk
sdf   3.64T ST4000NM0035     ZC1A1ABH        disk
sdg   3.64T ST4000NM0035     ZC1A1ABJ        disk

What it means: You have a candidate disk (sdg) present. You can cross-check serials with your DC hands ticket.

Decision: Never operate on /dev/sdX blindly. Prefer stable identifiers like /dev/disk/by-id in actual pool configs.

Task 13: Check SMART health before involving a disk in your redundancy math

cr0x@server:~$ sudo smartctl -a /dev/sdg | egrep 'SMART overall-health|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable'
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

What it means: No reallocations/pending sectors. Not a guarantee, but it’s not screaming either.

Decision: If pending sectors or growing reallocations exist, don’t add it. Drives fail during stress events; don’t invite them.

Task 14: Simulate the topology change you’re considering (in your head, using math)

cr0x@server:~$ zpool list -o name,ashift,autoexpand,size,alloc,free,cap tank
NAME  ASHIFT  AUTOEXPAND  SIZE   ALLOC   FREE  CAP
tank     12   off         21.8T  19.1T   2.7T  87%

What it means: autoexpand off. For “replace disks with larger ones,” you typically want autoexpand on eventually.

Decision: Decide your path:

  • RAIDZ expansion path: ensure feature support, ensure spare bays, accept rewrite time.
  • Drive replacement path: turn on autoexpand at the right time; expect serial resilvers.
  • New vdev path: ensure matching redundancy and similar performance characteristics.

Task 15: Prepare for the boring part—verify you can restore

cr0x@server:~$ zfs get -o name,property,value -s local,default copies,encryption,keylocation tank/backups
NAME         PROPERTY     VALUE
tank/backups copies       1
tank/backups encryption   off
tank/backups keylocation  none

What it means: This dataset is not encrypted and has one copy. That’s fine if your backup system is elsewhere; terrible if you’re pretending this is the backup.

Decision: Before big layout changes, validate backups and restore drills. Expansion is not supposed to be destructive, but production systems enjoy irony.

Task 16: If you’re going the “replace drives” route, confirm device-by-id naming

cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'sd[abc]$' | head -n 3
lrwxrwxrwx 1 root root  9 Dec 25 02:10 ata-ST4000NM0035_ZC1A1ABC -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 25 02:10 ata-ST4000NM0035_ZC1A1ABD -> ../../sdb
lrwxrwxrwx 1 root root  9 Dec 25 02:10 ata-ST4000NM0035_ZC1A1ABE -> ../../sdc

What it means: Stable identifiers exist. Good.

Decision: Use these paths in zpool replace operations to avoid “wrong disk” incidents.

Fast diagnosis playbook: find the bottleneck fast

When someone says “we need RAIDZ expansion because storage is slow/full,” you need to diagnose the actual bottleneck before you start moving disks around. Here’s the triage order I use.

First: capacity pressure and fragmentation

  • Check zpool list CAP and FRAG.
  • If CAP > 80% and FRAG > ~50%, expect allocator pain and write amplification.

Action: Delete or move data first; tighten snapshot retention; add space using the least risky method available. Expansion on a nearly-full pool is like repairing an engine while the car is still in a race.

Second: latency vs throughput (they are not the same problem)

  • Use zpool iostat -v to spot overloaded disks or a single slow device.
  • Correlate with application symptoms: are they complaining about IOPS latency or bulk transfer time?

Action: For latency-sensitive random writes, RAIDZ width changes may not help much. Mirrors or special devices might.

Third: sync writes and intent log behavior

  • Check dataset properties: sync, logbias.
  • Look for workload types: NFS with sync semantics, databases, VM hypervisors.

Action: If sync latency is the bottleneck, a proper SLOG on power-loss-protected media might help more than any expansion. Or tune the app semantics if you’re allowed.

Fourth: error rates and drive health

  • Check zpool status for read/write/cksum errors.
  • Check SMART stats on outliers.

Action: Replace failing drives before attempting expansion/rebuild. A marginal disk under heavy rewrite load becomes an outage generator.

Fifth: recordsize, volblocksize, and workload mismatch

  • For databases: recordsize too large increases read amplification.
  • For VM zvols: check volblocksize choices; changing later can require migration.

Action: Fix the workload/dataset mismatch if possible; expansion doesn’t correct mis-tuned datasets.

Best workarounds when you can’t (or shouldn’t) expand RAIDZ

Even with RAIDZ expansion available, there are cases where you should choose a different approach: the pool is too full, the workload can’t tolerate sustained background rewrite, your platform can’t enable new feature flags, or you simply don’t trust the maintenance window.

Workaround 1: Replace drives with larger ones (the “slow but sane” method)

This is the classic: replace one disk at a time, resilver, repeat. Once the last device is larger and the pool recognizes the new size, you get more capacity without changing vdev width.

What it’s good for: predictable risk profile, no new topology, compatibility-friendly.

What it costs: time. Also, repeated resilvers stress the vdev multiple times, which is not nothing on older fleets.

Operational advice: keep at least one cold spare on-site. Treat drive replacement as a campaign, not a series of one-off heroics.

Workaround 2: Add a new top-level vdev (but do it like an adult)

Add another RAIDZ vdev of the same parity level and similar performance. This increases capacity immediately. It also increases aggregate performance in many cases because you have more spindles and metaslabs.

Rules:

  • Match redundancy. RAIDZ2 + RAIDZ2 is reasonable; RAIDZ2 + single disk is negligence.
  • Try to match width and drive class. Mixing a wide HDD RAIDZ with a narrow SSD RAIDZ creates fun performance cliff edges.
  • Plan for failure domains: more vdevs means more “any vdev failure kills the pool” surfaces. Don’t confuse “more redundancy per vdev” with “invincible pool.”

Workaround 3: Migrate to mirrors for future growth

If you’re in a world where capacity needs to grow in small increments and performance wants IOPS, mirrors are your friend. Mirrors let you add capacity in pairs and get predictable latency.

The trade: you pay in raw capacity. Sometimes that’s fine. Sometimes finance will argue. Finance always argues; it’s their RAIDZ.

Workaround 4: Build a new pool and replicate (the “clean cut”)

If your layout is wrong (ashift, wrong parity level, wrong width, wrong vdev type), stop patching and migrate. Build a new pool with the right geometry and replicate datasets with zfs send | zfs receive.

Why this is often best: it lets you fix multiple structural mistakes at once, and it gives you a rollback plan. The migration itself is work, but it’s the kind of work that ends in a system you actually trust.

Workaround 5: Add special vdevs carefully (metadata/small blocks acceleration)

Special vdevs can move metadata (and optionally small blocks) to faster media. This can make a RAIDZ pool feel dramatically snappier, especially for metadata-heavy workloads.

Warning: special vdevs are not a cache. If you lose a special vdev and it contains metadata, you can lose the pool. Mirror special vdevs. Treat them as first-class storage, not garnish.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They had a single RAIDZ2 vdev and were running out of space. A well-meaning engineer said, “ZFS can stripe across vdevs, so we can just add one disk temporarily and move data later.” It sounded plausible in the way that many dangerous ideas do.

The team ran zpool add with a single disk. Capacity pressure eased instantly. The ticket was closed. Everyone went home and enjoyed the rare feeling of “we fixed storage quickly.”

Two months later, that “temporary” disk started throwing errors. Not a total failure—worse. Intermittent timeouts and occasional write errors. ZFS began marking the device as degraded, and then faulted. The pool died because a top-level vdev died. No parity, no mirror. One disk was now a single point of failure for the entire pool.

The postmortem was painful because the system behaved exactly as designed. The wrong assumption wasn’t about ZFS being unreliable; it was about misunderstanding what “striped across vdevs” really implies. They didn’t add capacity—they added a new, fragile pillar under the whole building.

The fix was a recovery from backups and a rebuild of the pool topology. They also added a guardrail: a wrapper that refused zpool add unless the vdev being added met a minimum redundancy rule.

Mini-story 2: The optimization that backfired

A different company had a RAIDZ1 pool for a VM farm. It was “fine” until it wasn’t: latency spikes, guests stuttering, angry application teams. The storage team decided to “optimize” by making the RAIDZ wider during an expansion to increase throughput, assuming more disks equals faster.

They expanded capacity by adding another wide RAIDZ vdev, and they also adjusted dataset settings: larger recordsize, more aggressive caching assumptions, and a heavier snapshot schedule “for safety.” The system benchmarked well on sequential tests. The production workload was neither sequential nor polite.

Once back under real load, sync writes and metadata churn dominated. Wider RAIDZ made small random writes more expensive. The snapshot schedule pinned blocks and amplified fragmentation. Scrubs took longer, resilvers took longer, and the latency tail got uglier.

They didn’t cause data loss, but they created a system that met capacity targets while missing performance SLOs. The rollback was a migration to mirrored vdevs for the VM datasets, leaving RAIDZ for backups and bulk storage.

The lesson wasn’t “RAIDZ is bad.” It was: optimizing for the wrong metric is indistinguishable from sabotage, except the tickets are nicer at first.

Mini-story 3: The boring but correct practice that saved the day

A media company ran an archive pool that was always close to full because someone considered “free space” an optional luxury. The storage engineer—quiet, unglamorous, consistently right—kept insisting on three practices: monthly scrubs, stable device naming, and staged drive replacement with a tested runbook.

One week, a drive started showing pending sectors. The pool was still online. No application alarms yet. But the scrub report and SMART trend were clear: the drive was turning into a future incident.

They swapped the drive during business hours with a controlled resilver. Because they had stable by-id mapping, the risk of replacing the wrong disk was low. Because they scrubbed regularly, there were no nasty latent errors waiting to be discovered during resilver. Because they kept 20% free space as policy, allocations stayed sane and the resilver completed without becoming a performance disaster.

Two days later, another drive in the same vdev failed outright. They took a breath, watched ZFS do its parity job, and moved on with their week. The second failure could have been a pool loss if the first drive had been left to rot.

The “boring” practice didn’t get applause. It did something better: it prevented the 3 a.m. call.

Common mistakes: symptoms → root cause → fix

Mistake 1: “We added a disk and now performance is worse”

Symptoms: Higher latency, scrubs slower, unpredictable throughput after expansion activity.

Root cause: Expansion/rewrite competes with production I/O; also the pool is fragmented and near full.

Fix: Create headroom (delete/migrate data), reduce snapshot retention, schedule heavy rewrite in low-traffic windows, and baseline with zpool iostat -v to confirm contention.

Mistake 2: “We can just add one disk temporarily”

Symptoms: Pool becomes dependent on a non-redundant top-level vdev; later a single disk failure drops the whole pool.

Root cause: Misunderstanding top-level vdev striping and failure domains.

Fix: Never add a single-disk vdev to an important pool. If you already did, migrate off immediately or replace it by attaching redundancy (mirror) if feasible.

Mistake 3: “We replaced all disks but didn’t get more space”

Symptoms: After the final replacement, zpool list still shows old size.

Root cause: autoexpand disabled, or partitions not grown, or underlying device size not exposed.

Fix: Enable autoexpand, confirm partition sizing, export/import if needed, and verify with zpool get autoexpand and lsblk.

Mistake 4: “Resilver took forever and then another disk failed”

Symptoms: Multi-day resilver, increasing errors, second drive failure during rebuild.

Root cause: No proactive scrubs; latent errors discovered during rebuild; old drives under stress; pool too full.

Fix: Scrub regularly, keep free space, replace drives on SMART trends, and consider higher parity (RAIDZ2/3) for large HDD pools.

Mistake 5: “We enabled a new feature flag and now replication/boot media broke”

Symptoms: Another host can’t import the pool; recovery image can’t mount; replication target refuses streams.

Root cause: Feature flags/compatibility mismatch across systems.

Fix: Standardize OpenZFS versions across fleet before enabling features; use compatibility properties where appropriate; keep recovery tooling updated.

Mistake 6: “We expanded but capacity didn’t appear immediately”

Symptoms: Added disk shows in config, but usable space grows slowly or not at all.

Root cause: Existing blocks still laid out in old geometry; snapshots pin blocks; rewrite/expansion process takes time.

Fix: Manage expectations, monitor progress, reduce snapshot retention, and avoid expanding on a nearly-full pool.

Checklists / step-by-step plan

Plan A: Use RAIDZ expansion (when supported and you can tolerate the rewrite)

  1. Confirm platform support: OpenZFS version across all importers (prod nodes, DR nodes, rescue media).
  2. Confirm pool health: clean zpool status, recent scrub, no SMART red flags.
  3. Create headroom: get CAP under ~80% if possible.
  4. Freeze risky changes: no simultaneous kernel updates, no firmware roulette, no “while we’re here” experiments.
  5. Validate backups: test a restore path for at least one representative dataset.
  6. Identify the target disk by-id: confirm serial, slot, and WWN.
  7. Schedule the window: expansion is disruptive; plan for degraded performance and longer batch jobs.
  8. Monitor continuously: watch zpool status, zpool iostat, SMART, and application latency.
  9. Post-change scrub: once the system settles, run a scrub to verify integrity under the new configuration.

Plan B: Replace disks with larger ones (production-friendly, time-expensive)

  1. Run a scrub first; fix any errors before starting.
  2. Replace one disk at a time. Let resilver complete fully.
  3. Do not replace multiple disks “to save time” unless you like gambling with parity.
  4. After the final disk replacement, ensure the vdev expands (autoexpand, partition sizing).
  5. Validate performance and capacity; then adjust snapshot retention if the real goal was “free space.”

Plan C: Add a new vdev (fast capacity, permanent topology change)

  1. Choose redundancy to match or exceed existing vdevs.
  2. Match drive class and approximate width for predictable performance.
  3. Document the new failure domain math for on-call.
  4. After adding, watch distribution: new writes land on the new vdev; old data stays where it is unless rewritten.
  5. If you need rebalancing, plan it explicitly (send/receive or controlled rewrite), not as wishful thinking.

FAQ

1) Can I add one disk to my RAIDZ vdev and get more space instantly?

Not instantly in the way people hope. With RAIDZ expansion support, you can add a disk and start an expansion process, but usable space may materialize gradually as data is rewritten.

2) Is it safer to add a new RAIDZ vdev instead of expanding an existing one?

“Safer” depends on what you mean. Adding a new redundant vdev is a well-understood operation with predictable failure domains. Expansion rewrites a lot of data and stresses drives. If your drives are old or your pool is very full, adding a new vdev (properly redundant) may be the lower-risk choice.

3) Why can’t ZFS just rebalance data automatically after I add capacity?

ZFS doesn’t shuffle old blocks just because you added space; it allocates new blocks to new metaslabs. Automatic rebalancing would mean huge background I/O and complex policy decisions (which data, when, how aggressively, with what priority).

4) If I add a new vdev, will reads get faster?

Often yes for parallel workloads, because you have more vdevs to service reads. But hot data already written to the old vdev stays there; “more vdevs” doesn’t teleport your working set.

5) Should I switch from RAIDZ to mirrors for future growth?

If you need incremental expansion and predictable IOPS latency, mirrors are usually the right answer. If you need maximum usable capacity per disk and the workload is more sequential/bulk, RAIDZ is still great. Don’t pick based on religion; pick based on failure domain and latency requirements.

6) Does adding a SLOG help with expansion or capacity?

No for capacity. For performance: it helps only for synchronous writes, and only if the SLOG device is fast and power-loss safe. SLOG myths are eternal; so are write caches that lie.

7) Will RAIDZ expansion fix my “pool is 90% full” problem?

It can provide more space, but expanding a nearly-full pool is risky and can be painfully slow. First priority: reduce usage, prune snapshots, or add capacity in a way that doesn’t require a massive rewrite under pressure.

8) How do I know if snapshots are blocking my space recovery?

Look at zfs list output including snapshots, and compare dataset REFER vs USED. Large snapshot USED values mean old blocks are pinned. If you delete data and space doesn’t return, snapshots are a prime suspect.

9) Can I “undo” a bad expansion choice?

You can’t remove a top-level vdev from a pool in the general case. You can sometimes evacuate and destroy a pool, or migrate datasets to a new pool with a better layout. That’s why topology decisions deserve paranoia.

10) What’s the single best habit to prevent expansion drama?

Keep headroom. Under ~70–80% usage, ZFS behaves like a competent filesystem. Above that, it becomes a performance and operational comedy show with a very expensive cast.

Practical next steps

Here’s what I’d do Monday morning if I inherited your “we need more RAIDZ capacity” situation:

  1. Measure reality: run zpool list, zpool iostat -v, and zfs list -t snapshot. Decide whether the problem is capacity, latency, or both.
  2. Buy time safely: prune snapshots and cold data first. If you need immediate capacity, add a new redundant vdev—never a single disk.
  3. Pick an expansion strategy:
    • If you can tolerate a long background rewrite and your platform supports it, RAIDZ expansion is now a legitimate option.
    • If you need minimal feature risk and predictable ops, replace drives with larger ones.
    • If your topology is wrong (ashift, wrong class of drives, wrong redundancy), migrate to a new pool.
  4. Run the boring guardrails: scrub first, check SMART, validate backups, and document the rollback.

One paraphrased idea from Richard Cook (safety and operations researcher): Success in complex systems often comes from teams constantly adapting, not from the absence of problems.

Joke #2: The quickest way to learn your pool topology is wrong is to attempt to “temporarily” fix it during a holiday change freeze.

← Previous
8088 and the IBM PC deal that crowned Intel (almost by accident)
Next →
Intel Arc: why a third player matters even if you don’t buy it

Leave a comment