You bought “just enough” disks for a RAIDZ. Six months later the pool is at 83%, snapshots are multiplying like rabbits, and the business wants another quarter’s worth of data yesterday.
This is where people discover that ZFS isn’t magic—it’s engineering. RAIDZ gives you great density and nice failure tolerance. It also makes capacity expansion feel like arguing with physics. The good news: modern OpenZFS has finally grown a new limb here. The bad news: you still need to understand what it does, what it doesn’t do, and which workarounds are safe in production.
What’s possible today (and what isn’t)
Let’s separate three concepts that get mashed together in Slack threads and incident calls:
- Expanding a pool: adding more top-level vdevs (e.g., add another RAIDZ group). Easy. Has consequences.
- Growing a vdev: making an existing vdev wider (e.g., RAIDZ1 from 5 disks to 6 disks). Historically “no,” now “sometimes yes” depending on feature support.
- Growing each disk: replacing disks with larger ones and letting the vdev grow after the last replacement. Classic, safe, slow.
1) Adding a disk to an existing RAIDZ vdev
Today: This is possible on recent OpenZFS builds when the RAIDZ expansion feature is supported and enabled. You can attach an additional drive to a RAIDZ vdev and the vdev will undergo an expansion process.
The catch: It’s not instant, and it’s not free. Existing data must be rewritten to take advantage of the new geometry (more columns). That means a long-running background process that looks a lot like a full-pool rewrite. Expect heavy I/O, lots of time, and performance variability.
2) Replacing all drives with bigger ones
Today: Still the most boring-and-correct way to grow RAIDZ capacity. You replace drives one at a time, resilver after each, then the vdev expands once the last device is larger and autoexpand is honored.
The catch: You need patience and good drives. Resilvering on big disks is a lifestyle choice.
3) Adding a new top-level vdev (another RAIDZ group, or mirrors)
Today: Always works. It’s how ZFS was designed to scale pools: striped across vdevs.
The catch: You are permanently changing your redundancy profile and failure domain math. ZFS stripes across top-level vdevs; losing any one top-level vdev loses the pool. So if you “just add a single disk” (a one-disk vdev), you’ve created a pool that can die from one disk failure. That’s not an expansion, that’s a loaded foot-gun.
Opinionated guidance: If you can use RAIDZ expansion safely (feature supported, maintenance window tolerance, I/O headroom), it’s a legitimate tool now. If you can’t, don’t get cute: either replace disks with larger ones or migrate to a new pool/vdev layout. “Temporary” single-disk vdevs have a way of becoming permanent right up until they become catastrophic.
Joke #1: RAIDZ expansion is like a gym membership—technically you can make progress, but it’s going to demand time and sustained discomfort.
How RAIDZ expansion works under the hood (enough to make decisions)
RAIDZ writes data in “stripes” across disks, with parity. The number of disks participating in a stripe is the vdev’s “width.” When you add a disk to a RAIDZ vdev (with the expansion feature), you’re changing that width.
Here’s why it’s hard: old blocks are laid out according to the old width. New blocks could be written with the new width, but then you’d end up with a vdev that contains a mix of layouts. That is possible, but it means:
- Capacity accounting gets complicated—space isn’t magically freed until enough data is rewritten.
- Performance characteristics vary depending on how much of the dataset is “old layout” vs “new layout.”
- Parity math, allocation classes, and metaslab selection all need to cooperate without breaking on-disk compatibility.
So expansion involves a controlled rewrite of blocks so they can be redistributed across the new geometry. Think of it as: “the pool learns a new gait, then slowly teaches its existing data to walk that way.”
What you should expect operationally
- A long-running process that competes with normal workloads (read, write, metadata).
- Higher write amplification because blocks are being rewritten, plus parity overhead.
- Snapshot interaction: blocks referenced by snapshots don’t disappear; rewriting may be constrained by retention policies. You don’t “rebalance” around a mountain of immutable snapshot history without paying for it.
- Thermal and SMART stress on the entire vdev. If your drives are already marginal, you’re about to find out in the least pleasant way.
What expansion does not fix
- Bad ashift choices (e.g., 4K disks with ashift=9). That’s forever.
- Fundamentally wrong topology for your workload (e.g., RAIDZ for heavy sync random writes in a latency-sensitive DB). Wider RAIDZ may improve throughput, but it won’t turn it into a mirror.
- High fragmentation and snapshot bloat caused by workload patterns. Expansion may reduce pressure but not the underlying behavior.
Facts and history that explain the weirdness
Some context points that explain why “just add a disk” took so long to become real in ZFS-land:
- ZFS was born in an era of big, expensive disks where planning the vdev layout upfront was assumed. The culture came with it: “Choose wisely; changing later is hard.”
- Top-level vdev striping is foundational: ZFS pools are a stripe across vdevs. This makes scaling easy but makes mixed-redundancy pools risky when people improvise.
- RAIDZ parity isn’t RAID5/6 glued on the side; it’s integrated into allocation and block pointers. That deep integration is great for integrity and awful for retrofitting layout changes.
- The “replace-with-bigger-drives” method predates most modern ZFS forks and became the canonical expansion story because it didn’t require rewriting everything at once.
- Resilvering is block-based, not full-disk (in many cases). That’s a ZFS advantage—but RAIDZ resilvers can still be long because parity reconstruction touches a lot of data.
- OpenZFS consolidation mattered: ZFS development diverged across Solaris descendants, then re-converged. Big on-disk feature work tends to move at “careful filesystem speed.”
- Snapshots are a first-class citizen, which changes everything: “delete and rewrite” isn’t simple if older blocks are still referenced.
- End-to-end checksums changed failure expectations: you don’t just want capacity; you want correct data. Any expansion feature must preserve integrity across failure cases, power loss, and partial progress.
- Drive capacities exploded faster than rebuild windows shrank: this widened the pain gap—people wanted expansion because buying larger replacements meant multi-day resilvers.
Practical tasks: commands, outputs, and decisions
Below are real tasks I run before touching a production pool. Each includes a command, sample output, what the output means, and the decision you make.
Task 1: Identify your ZFS and kernel/userspace versions
cr0x@server:~$ zfs version
zfs-2.2.4-1
zfs-kmod-2.2.4-1
What it means: You’re on OpenZFS 2.2.x userspace and kernel module. Feature availability varies by version and distribution packaging.
Decision: If you’re on an older release, stop assuming RAIDZ expansion exists. Plan an upgrade window first, or use the older workarounds.
Task 2: Check pool health and any latent damage
cr0x@server:~$ zpool status -xv
all pools are healthy
What it means: No known errors at the moment.
Decision: If this shows errors, do not expand. Fix underlying device issues first. Expansion multiplies I/O and will surface weak drives.
Task 3: Get the exact vdev topology (this dictates your options)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: Single RAIDZ2 vdev, 6-wide. Losing more than 2 drives in that vdev kills the pool.
Decision: You can (a) expand this vdev width (if supported), (b) replace drives with bigger ones, or (c) add another RAIDZ2 vdev. Do not add single disks.
Task 4: Confirm ashift and sector alignment (you can’t “fix” it later)
cr0x@server:~$ zdb -C tank | grep ashift -n
56: ashift: 12
57: asize: 599998545920
What it means: ashift=12 (4K sectors) is generally sane for modern disks.
Decision: If ashift is wrong (commonly 9 on 4K drives), don’t sink more time into expansion. Migration is usually the only correct fix.
Task 5: Check feature flags and whether the pool can support newer features
cr0x@server:~$ zpool get -H all tank | egrep 'feature@|compatibility'
tank compatibility off local
What it means: Compatibility mode is off; feature flags may be enabled individually.
Decision: In environments that boot older recovery media or replicate to older hosts, enabling new features can be a compatibility break. Confirm your fleet supports it before you turn knobs.
Task 6: Check pool capacity and fragmentation (expansion doesn’t like full pools)
cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health tank
NAME SIZE ALLOC FREE CAP FRAG HEALTH
tank 21.8T 19.1T 2.7T 87% 61% ONLINE
What it means: 87% full and 61% fragmented. You’re in the “everything is harder now” zone.
Decision: If CAP > 80%, prioritize freeing space before expansion or replacements. ZFS performance and allocation behavior degrade badly when full.
Task 7: Identify the biggest space consumers, including snapshots
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used | head
NAME USED AVAIL REFER MOUNTPOINT
tank 19.1T 2.7T 256K /tank
tank/backups 11.2T 2.7T 4.1T /tank/backups
tank/backups@snap-1 2.8T - 4.0T -
tank/vm 5.6T 2.7T 5.6T /tank/vm
tank/home 1.9T 2.7T 1.9T /tank/home
What it means: Snapshots can account for a shocking amount of “USED,” and they pin old blocks.
Decision: If expansion depends on rewriting blocks but snapshots keep them alive, trim snapshot retention first (with stakeholder buy-in).
Task 8: Measure write reminder flags and sync behavior (SLOG misconceptions live here)
cr0x@server:~$ zfs get -o name,property,value -s local,default sync,logbias,recordsize tank/vm
NAME PROPERTY VALUE
tank/vm sync standard
tank/vm logbias latency
tank/vm recordsize 128K
What it means: VM dataset is default sync behavior, latency logbias, 128K recordsize.
Decision: If you’re seeing latency pain, don’t assume “expansion fixes it.” You may have a sync write bottleneck that needs SLOG tuning or workload changes.
Task 9: Verify autotrim, compression, and atime—quiet killers and quiet savers
cr0x@server:~$ zfs get -o name,property,value compression,atime,relatime,autotrim tank
NAME PROPERTY VALUE
tank compression lz4
tank atime off
tank relatime off
tank autotrim off
What it means: Compression is on (good). atime off (often good). autotrim off (might be fine for HDD; on SSD it’s a discussion).
Decision: Expansion increases writes; compression can reduce them. If your pool is SSD-based, consider autotrim behavior carefully—but test first.
Task 10: Baseline I/O and latency before any surgery
cr0x@server:~$ zpool iostat -v tank 5 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 19.1T 2.7T 85 210 1.20G 2.45G
raidz2-0 19.1T 2.7T 85 210 1.20G 2.45G
sda - - 12 35 180M 420M
sdb - - 14 36 190M 410M
sdc - - 13 34 200M 430M
sdd - - 15 35 210M 400M
sde - - 16 35 210M 410M
sdf - - 15 35 210M 380M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: This shows per-vdev and per-disk bandwidth/ops. You’re looking for outliers and headroom.
Decision: If disks are already near saturation, expansion will hurt. Schedule a quiet window or throttle the expansion process (where supported) by workload management.
Task 11: Check recent error history and resilver/scrub cadence
cr0x@server:~$ zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 12:44:02 with 0 errors on Mon Dec 2 03:21:08 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: Scrub completed cleanly recently. Good baseline.
Decision: If you haven’t scrubbed in months, scrub before expansion/replacements. Find latent errors when you still have slack.
Task 12: Validate that the “new” disk is actually the disk you think it is
cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,TYPE
NAME SIZE MODEL SERIAL TYPE
sda 3.64T ST4000NM0035 ZC1A1ABC disk
sdb 3.64T ST4000NM0035 ZC1A1ABD disk
sdc 3.64T ST4000NM0035 ZC1A1ABE disk
sdd 3.64T ST4000NM0035 ZC1A1ABF disk
sde 3.64T ST4000NM0035 ZC1A1ABG disk
sdf 3.64T ST4000NM0035 ZC1A1ABH disk
sdg 3.64T ST4000NM0035 ZC1A1ABJ disk
What it means: You have a candidate disk (sdg) present. You can cross-check serials with your DC hands ticket.
Decision: Never operate on /dev/sdX blindly. Prefer stable identifiers like /dev/disk/by-id in actual pool configs.
Task 13: Check SMART health before involving a disk in your redundancy math
cr0x@server:~$ sudo smartctl -a /dev/sdg | egrep 'SMART overall-health|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
What it means: No reallocations/pending sectors. Not a guarantee, but it’s not screaming either.
Decision: If pending sectors or growing reallocations exist, don’t add it. Drives fail during stress events; don’t invite them.
Task 14: Simulate the topology change you’re considering (in your head, using math)
cr0x@server:~$ zpool list -o name,ashift,autoexpand,size,alloc,free,cap tank
NAME ASHIFT AUTOEXPAND SIZE ALLOC FREE CAP
tank 12 off 21.8T 19.1T 2.7T 87%
What it means: autoexpand off. For “replace disks with larger ones,” you typically want autoexpand on eventually.
Decision: Decide your path:
- RAIDZ expansion path: ensure feature support, ensure spare bays, accept rewrite time.
- Drive replacement path: turn on autoexpand at the right time; expect serial resilvers.
- New vdev path: ensure matching redundancy and similar performance characteristics.
Task 15: Prepare for the boring part—verify you can restore
cr0x@server:~$ zfs get -o name,property,value -s local,default copies,encryption,keylocation tank/backups
NAME PROPERTY VALUE
tank/backups copies 1
tank/backups encryption off
tank/backups keylocation none
What it means: This dataset is not encrypted and has one copy. That’s fine if your backup system is elsewhere; terrible if you’re pretending this is the backup.
Decision: Before big layout changes, validate backups and restore drills. Expansion is not supposed to be destructive, but production systems enjoy irony.
Task 16: If you’re going the “replace drives” route, confirm device-by-id naming
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'sd[abc]$' | head -n 3
lrwxrwxrwx 1 root root 9 Dec 25 02:10 ata-ST4000NM0035_ZC1A1ABC -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 25 02:10 ata-ST4000NM0035_ZC1A1ABD -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 25 02:10 ata-ST4000NM0035_ZC1A1ABE -> ../../sdc
What it means: Stable identifiers exist. Good.
Decision: Use these paths in zpool replace operations to avoid “wrong disk” incidents.
Fast diagnosis playbook: find the bottleneck fast
When someone says “we need RAIDZ expansion because storage is slow/full,” you need to diagnose the actual bottleneck before you start moving disks around. Here’s the triage order I use.
First: capacity pressure and fragmentation
- Check
zpool listCAP and FRAG. - If CAP > 80% and FRAG > ~50%, expect allocator pain and write amplification.
Action: Delete or move data first; tighten snapshot retention; add space using the least risky method available. Expansion on a nearly-full pool is like repairing an engine while the car is still in a race.
Second: latency vs throughput (they are not the same problem)
- Use
zpool iostat -vto spot overloaded disks or a single slow device. - Correlate with application symptoms: are they complaining about IOPS latency or bulk transfer time?
Action: For latency-sensitive random writes, RAIDZ width changes may not help much. Mirrors or special devices might.
Third: sync writes and intent log behavior
- Check dataset properties:
sync,logbias. - Look for workload types: NFS with sync semantics, databases, VM hypervisors.
Action: If sync latency is the bottleneck, a proper SLOG on power-loss-protected media might help more than any expansion. Or tune the app semantics if you’re allowed.
Fourth: error rates and drive health
- Check
zpool statusfor read/write/cksum errors. - Check SMART stats on outliers.
Action: Replace failing drives before attempting expansion/rebuild. A marginal disk under heavy rewrite load becomes an outage generator.
Fifth: recordsize, volblocksize, and workload mismatch
- For databases: recordsize too large increases read amplification.
- For VM zvols: check volblocksize choices; changing later can require migration.
Action: Fix the workload/dataset mismatch if possible; expansion doesn’t correct mis-tuned datasets.
Best workarounds when you can’t (or shouldn’t) expand RAIDZ
Even with RAIDZ expansion available, there are cases where you should choose a different approach: the pool is too full, the workload can’t tolerate sustained background rewrite, your platform can’t enable new feature flags, or you simply don’t trust the maintenance window.
Workaround 1: Replace drives with larger ones (the “slow but sane” method)
This is the classic: replace one disk at a time, resilver, repeat. Once the last device is larger and the pool recognizes the new size, you get more capacity without changing vdev width.
What it’s good for: predictable risk profile, no new topology, compatibility-friendly.
What it costs: time. Also, repeated resilvers stress the vdev multiple times, which is not nothing on older fleets.
Operational advice: keep at least one cold spare on-site. Treat drive replacement as a campaign, not a series of one-off heroics.
Workaround 2: Add a new top-level vdev (but do it like an adult)
Add another RAIDZ vdev of the same parity level and similar performance. This increases capacity immediately. It also increases aggregate performance in many cases because you have more spindles and metaslabs.
Rules:
- Match redundancy. RAIDZ2 + RAIDZ2 is reasonable; RAIDZ2 + single disk is negligence.
- Try to match width and drive class. Mixing a wide HDD RAIDZ with a narrow SSD RAIDZ creates fun performance cliff edges.
- Plan for failure domains: more vdevs means more “any vdev failure kills the pool” surfaces. Don’t confuse “more redundancy per vdev” with “invincible pool.”
Workaround 3: Migrate to mirrors for future growth
If you’re in a world where capacity needs to grow in small increments and performance wants IOPS, mirrors are your friend. Mirrors let you add capacity in pairs and get predictable latency.
The trade: you pay in raw capacity. Sometimes that’s fine. Sometimes finance will argue. Finance always argues; it’s their RAIDZ.
Workaround 4: Build a new pool and replicate (the “clean cut”)
If your layout is wrong (ashift, wrong parity level, wrong width, wrong vdev type), stop patching and migrate. Build a new pool with the right geometry and replicate datasets with zfs send | zfs receive.
Why this is often best: it lets you fix multiple structural mistakes at once, and it gives you a rollback plan. The migration itself is work, but it’s the kind of work that ends in a system you actually trust.
Workaround 5: Add special vdevs carefully (metadata/small blocks acceleration)
Special vdevs can move metadata (and optionally small blocks) to faster media. This can make a RAIDZ pool feel dramatically snappier, especially for metadata-heavy workloads.
Warning: special vdevs are not a cache. If you lose a special vdev and it contains metadata, you can lose the pool. Mirror special vdevs. Treat them as first-class storage, not garnish.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They had a single RAIDZ2 vdev and were running out of space. A well-meaning engineer said, “ZFS can stripe across vdevs, so we can just add one disk temporarily and move data later.” It sounded plausible in the way that many dangerous ideas do.
The team ran zpool add with a single disk. Capacity pressure eased instantly. The ticket was closed. Everyone went home and enjoyed the rare feeling of “we fixed storage quickly.”
Two months later, that “temporary” disk started throwing errors. Not a total failure—worse. Intermittent timeouts and occasional write errors. ZFS began marking the device as degraded, and then faulted. The pool died because a top-level vdev died. No parity, no mirror. One disk was now a single point of failure for the entire pool.
The postmortem was painful because the system behaved exactly as designed. The wrong assumption wasn’t about ZFS being unreliable; it was about misunderstanding what “striped across vdevs” really implies. They didn’t add capacity—they added a new, fragile pillar under the whole building.
The fix was a recovery from backups and a rebuild of the pool topology. They also added a guardrail: a wrapper that refused zpool add unless the vdev being added met a minimum redundancy rule.
Mini-story 2: The optimization that backfired
A different company had a RAIDZ1 pool for a VM farm. It was “fine” until it wasn’t: latency spikes, guests stuttering, angry application teams. The storage team decided to “optimize” by making the RAIDZ wider during an expansion to increase throughput, assuming more disks equals faster.
They expanded capacity by adding another wide RAIDZ vdev, and they also adjusted dataset settings: larger recordsize, more aggressive caching assumptions, and a heavier snapshot schedule “for safety.” The system benchmarked well on sequential tests. The production workload was neither sequential nor polite.
Once back under real load, sync writes and metadata churn dominated. Wider RAIDZ made small random writes more expensive. The snapshot schedule pinned blocks and amplified fragmentation. Scrubs took longer, resilvers took longer, and the latency tail got uglier.
They didn’t cause data loss, but they created a system that met capacity targets while missing performance SLOs. The rollback was a migration to mirrored vdevs for the VM datasets, leaving RAIDZ for backups and bulk storage.
The lesson wasn’t “RAIDZ is bad.” It was: optimizing for the wrong metric is indistinguishable from sabotage, except the tickets are nicer at first.
Mini-story 3: The boring but correct practice that saved the day
A media company ran an archive pool that was always close to full because someone considered “free space” an optional luxury. The storage engineer—quiet, unglamorous, consistently right—kept insisting on three practices: monthly scrubs, stable device naming, and staged drive replacement with a tested runbook.
One week, a drive started showing pending sectors. The pool was still online. No application alarms yet. But the scrub report and SMART trend were clear: the drive was turning into a future incident.
They swapped the drive during business hours with a controlled resilver. Because they had stable by-id mapping, the risk of replacing the wrong disk was low. Because they scrubbed regularly, there were no nasty latent errors waiting to be discovered during resilver. Because they kept 20% free space as policy, allocations stayed sane and the resilver completed without becoming a performance disaster.
Two days later, another drive in the same vdev failed outright. They took a breath, watched ZFS do its parity job, and moved on with their week. The second failure could have been a pool loss if the first drive had been left to rot.
The “boring” practice didn’t get applause. It did something better: it prevented the 3 a.m. call.
Common mistakes: symptoms → root cause → fix
Mistake 1: “We added a disk and now performance is worse”
Symptoms: Higher latency, scrubs slower, unpredictable throughput after expansion activity.
Root cause: Expansion/rewrite competes with production I/O; also the pool is fragmented and near full.
Fix: Create headroom (delete/migrate data), reduce snapshot retention, schedule heavy rewrite in low-traffic windows, and baseline with zpool iostat -v to confirm contention.
Mistake 2: “We can just add one disk temporarily”
Symptoms: Pool becomes dependent on a non-redundant top-level vdev; later a single disk failure drops the whole pool.
Root cause: Misunderstanding top-level vdev striping and failure domains.
Fix: Never add a single-disk vdev to an important pool. If you already did, migrate off immediately or replace it by attaching redundancy (mirror) if feasible.
Mistake 3: “We replaced all disks but didn’t get more space”
Symptoms: After the final replacement, zpool list still shows old size.
Root cause: autoexpand disabled, or partitions not grown, or underlying device size not exposed.
Fix: Enable autoexpand, confirm partition sizing, export/import if needed, and verify with zpool get autoexpand and lsblk.
Mistake 4: “Resilver took forever and then another disk failed”
Symptoms: Multi-day resilver, increasing errors, second drive failure during rebuild.
Root cause: No proactive scrubs; latent errors discovered during rebuild; old drives under stress; pool too full.
Fix: Scrub regularly, keep free space, replace drives on SMART trends, and consider higher parity (RAIDZ2/3) for large HDD pools.
Mistake 5: “We enabled a new feature flag and now replication/boot media broke”
Symptoms: Another host can’t import the pool; recovery image can’t mount; replication target refuses streams.
Root cause: Feature flags/compatibility mismatch across systems.
Fix: Standardize OpenZFS versions across fleet before enabling features; use compatibility properties where appropriate; keep recovery tooling updated.
Mistake 6: “We expanded but capacity didn’t appear immediately”
Symptoms: Added disk shows in config, but usable space grows slowly or not at all.
Root cause: Existing blocks still laid out in old geometry; snapshots pin blocks; rewrite/expansion process takes time.
Fix: Manage expectations, monitor progress, reduce snapshot retention, and avoid expanding on a nearly-full pool.
Checklists / step-by-step plan
Plan A: Use RAIDZ expansion (when supported and you can tolerate the rewrite)
- Confirm platform support: OpenZFS version across all importers (prod nodes, DR nodes, rescue media).
- Confirm pool health: clean
zpool status, recent scrub, no SMART red flags. - Create headroom: get CAP under ~80% if possible.
- Freeze risky changes: no simultaneous kernel updates, no firmware roulette, no “while we’re here” experiments.
- Validate backups: test a restore path for at least one representative dataset.
- Identify the target disk by-id: confirm serial, slot, and WWN.
- Schedule the window: expansion is disruptive; plan for degraded performance and longer batch jobs.
- Monitor continuously: watch
zpool status,zpool iostat, SMART, and application latency. - Post-change scrub: once the system settles, run a scrub to verify integrity under the new configuration.
Plan B: Replace disks with larger ones (production-friendly, time-expensive)
- Run a scrub first; fix any errors before starting.
- Replace one disk at a time. Let resilver complete fully.
- Do not replace multiple disks “to save time” unless you like gambling with parity.
- After the final disk replacement, ensure the vdev expands (autoexpand, partition sizing).
- Validate performance and capacity; then adjust snapshot retention if the real goal was “free space.”
Plan C: Add a new vdev (fast capacity, permanent topology change)
- Choose redundancy to match or exceed existing vdevs.
- Match drive class and approximate width for predictable performance.
- Document the new failure domain math for on-call.
- After adding, watch distribution: new writes land on the new vdev; old data stays where it is unless rewritten.
- If you need rebalancing, plan it explicitly (send/receive or controlled rewrite), not as wishful thinking.
FAQ
1) Can I add one disk to my RAIDZ vdev and get more space instantly?
Not instantly in the way people hope. With RAIDZ expansion support, you can add a disk and start an expansion process, but usable space may materialize gradually as data is rewritten.
2) Is it safer to add a new RAIDZ vdev instead of expanding an existing one?
“Safer” depends on what you mean. Adding a new redundant vdev is a well-understood operation with predictable failure domains. Expansion rewrites a lot of data and stresses drives. If your drives are old or your pool is very full, adding a new vdev (properly redundant) may be the lower-risk choice.
3) Why can’t ZFS just rebalance data automatically after I add capacity?
ZFS doesn’t shuffle old blocks just because you added space; it allocates new blocks to new metaslabs. Automatic rebalancing would mean huge background I/O and complex policy decisions (which data, when, how aggressively, with what priority).
4) If I add a new vdev, will reads get faster?
Often yes for parallel workloads, because you have more vdevs to service reads. But hot data already written to the old vdev stays there; “more vdevs” doesn’t teleport your working set.
5) Should I switch from RAIDZ to mirrors for future growth?
If you need incremental expansion and predictable IOPS latency, mirrors are usually the right answer. If you need maximum usable capacity per disk and the workload is more sequential/bulk, RAIDZ is still great. Don’t pick based on religion; pick based on failure domain and latency requirements.
6) Does adding a SLOG help with expansion or capacity?
No for capacity. For performance: it helps only for synchronous writes, and only if the SLOG device is fast and power-loss safe. SLOG myths are eternal; so are write caches that lie.
7) Will RAIDZ expansion fix my “pool is 90% full” problem?
It can provide more space, but expanding a nearly-full pool is risky and can be painfully slow. First priority: reduce usage, prune snapshots, or add capacity in a way that doesn’t require a massive rewrite under pressure.
8) How do I know if snapshots are blocking my space recovery?
Look at zfs list output including snapshots, and compare dataset REFER vs USED. Large snapshot USED values mean old blocks are pinned. If you delete data and space doesn’t return, snapshots are a prime suspect.
9) Can I “undo” a bad expansion choice?
You can’t remove a top-level vdev from a pool in the general case. You can sometimes evacuate and destroy a pool, or migrate datasets to a new pool with a better layout. That’s why topology decisions deserve paranoia.
10) What’s the single best habit to prevent expansion drama?
Keep headroom. Under ~70–80% usage, ZFS behaves like a competent filesystem. Above that, it becomes a performance and operational comedy show with a very expensive cast.
Practical next steps
Here’s what I’d do Monday morning if I inherited your “we need more RAIDZ capacity” situation:
- Measure reality: run
zpool list,zpool iostat -v, andzfs list -t snapshot. Decide whether the problem is capacity, latency, or both. - Buy time safely: prune snapshots and cold data first. If you need immediate capacity, add a new redundant vdev—never a single disk.
- Pick an expansion strategy:
- If you can tolerate a long background rewrite and your platform supports it, RAIDZ expansion is now a legitimate option.
- If you need minimal feature risk and predictable ops, replace drives with larger ones.
- If your topology is wrong (ashift, wrong class of drives, wrong redundancy), migrate to a new pool.
- Run the boring guardrails: scrub first, check SMART, validate backups, and document the rollback.
One paraphrased idea from Richard Cook (safety and operations researcher): Success in complex systems often comes from teams constantly adapting, not from the absence of problems.
Joke #2: The quickest way to learn your pool topology is wrong is to attempt to “temporarily” fix it during a holiday change freeze.