The pager goes off because “storage is slow.” The graph says latency is up, but only sometimes. The app team swears nothing changed.
You log in, run zpool status, and there it is: a shiny new vdev added last week, sitting next to older, fuller vdevs like a new hire
who got all the easy tickets and still somehow broke prod.
ZFS makes it easy to add capacity. It also makes it easy to create a pool layout that is permanently imbalanced in performance and risk.
The trap is simple: zpool add adds a vdev to a pool, but ZFS does not automatically rebalance existing data across vdevs.
That one design choice is why your “quick expansion” becomes next quarter’s storage incident.
The “Add VDEV” trap: what really happens
When you run zpool add, you are not “adding disks” to an existing redundancy group. You are adding a brand-new top-level vdev
to the pool. ZFS then stripes new writes across top-level vdevs based on space and a few heuristics, but it does not move the old blocks.
The old vdevs remain full of old data; the new vdev starts empty and absorbs a lot of new allocations.
That seems fine until you remember what a top-level vdev means: the pool’s IOPS, throughput, and fault tolerance are all the sum (or the minimum)
of its top-level vdevs. Add a vdev with a different width, different disk class, different ashift, or different health profile, and you’ve changed the
pool’s behavior for the rest of its life.
One pool, many personalities
ZFS pools are not “one RAID.” A pool is a collection of top-level vdevs. Each top-level vdev has its own redundancy (mirror, raidz, dRAID),
and the pool stripes across them. That’s the model. It’s powerful. It’s also how people accidentally build storage chimera:
three wide RAIDZ2 vdevs, then a single mirror “just to add capacity quickly,” then an SSD special vdev “because metadata,” and now the pool
behaves like a committee where everyone votes and the slowest person still blocks the meeting.
Why ZFS does not rebalance by default
The lack of auto-rebalance isn’t laziness; it’s conservatism. Moving blocks around a live pool is expensive, wears drives, risks power-loss corner cases,
and complicates guarantees. ZFS will happily keep pointers to where data already lives. It optimizes for correctness and survivability, not for “make it pretty.”
The operational consequence is blunt: if you add a vdev and then wonder why one vdev is doing all the work, the answer is “because it’s empty and you’re writing
new data.” If you wonder why reads still hammer old vdevs, it’s because the old data is still there. ZFS is not being mysterious.
It’s being literal.
Facts and context that explain the behavior
- ZFS started at Sun in the early 2000s, designed for end-to-end data integrity with checksums on every block, not just “fast RAID.”
- Pools were a radical shift: instead of carving LUNs out of fixed RAID groups, you aggregate vdevs and allocate dynamically.
- Copy-on-write is the core mechanic: ZFS never overwrites live blocks in place; it writes new blocks and flips pointers. Great for snapshots, tricky for “rebalance.”
- Top-level vdevs are the unit of failure: lose a top-level vdev, lose the pool. That’s why a single-disk vdev is a ticking outage.
- RAIDZ is not RAID5/6 in implementation details; it’s variable-stripe with parity, interacting with recordsize and allocation patterns in ways that surprise people migrating from hardware RAID.
- “Ashift” is forever per vdev: choose the wrong sector size alignment and you can lock in write amplification for the lifetime of that vdev.
- Special vdevs (metadata/small blocks) made hybrid pools more practical, but they also introduced a new “if this dies, your pool is toast” component unless mirrored.
- L2ARC was never a write cache: it’s a read cache and it resets on reboot unless persistent L2ARC is enabled and supported. People still treat it like magic RAM.
- Resilver behavior differs by vdev type: mirrors resilver used space; RAIDZ resilver is heavier and touches more of the vdev geometry.
Why imbalance hurts: performance, resilience, and cost
Performance: the pool is only as smooth as its worst vdev
In a balanced pool, your IOPS and throughput scale by adding more top-level vdevs of similar capability. In an imbalanced pool, you get weirdness:
bursts of good performance, followed by latency spikes when one vdev is saturated while others are underused.
Here’s the classic pattern: you add a new vdev, new writes mostly land there, and your monitoring looks great. Then the workload shifts to reads of older data
(backups restoring, analytics jobs hitting historical partitions, or a VM fleet booting from older images). Suddenly the old vdevs become read hotspots,
and the new vdev sits bored.
Resilience: mixing redundancy levels is how you buy risk with CAPEX
A pool’s fault tolerance is not the “average redundancy.” It’s the redundancy of each top-level vdev, and losing any one top-level vdev kills the pool.
Add a single mirror vdev next to several RAIDZ2 vdevs, and you just introduced a weaker link: a two-disk mirror can survive one disk failure,
while your RAIDZ2 vdevs survive two. You now have uneven failure domains.
Add a single-disk vdev “temporarily,” and congratulations: you just created a pool that is one disk failure away from complete loss.
The word “temporary” has a long half-life in infrastructure.
Cost: you pay twice—once for the disks, once for the aftermath
Imbalanced pools are expensive in boring ways: more on-call time, more escalations, more “why is this one host slower” debugging, more premature upgrades.
And if you have to “fix” the layout, the fix is often disruptive: migrate data, rebuild pools, or do controlled block rewrites that take weeks.
Joke #1: Adding a mismatched vdev to a ZFS pool is like putting a spare tire on a race car—yes, it rolls, and yes, everyone can hear it.
Fast diagnosis playbook (first/second/third checks)
When someone says “ZFS got slow after we added disks,” do not start by tuning recordsize. Start by proving whether the pool is imbalanced and where the time is going.
First: Is one top-level vdev doing all the work?
- Check per-vdev I/O and latency with
zpool iostat -v. - Look for one vdev with consistently higher busy/await than others.
- If you see that, stop. You’re not debugging “ZFS performance.” You’re debugging “layout and allocation.”
Second: Is this read-bound, write-bound, or sync-bound?
- Read vs write mix:
zpool iostat -v 1andarcstat(if available) to see ARC hit rate and read pressure. - Sync writes: check
zfs get syncand whether a SLOG exists and is healthy. - If latency spikes correlate with sync write bursts, you’re looking at ZIL/SLOG or underlying write latency.
Third: Are you actually blocked on the device layer?
- Check kernel device stats:
iostat -xandsmartctlerror counters. - Confirm ashift and sector sizes. A new vdev with a different physical sector size can behave differently under load.
- Look for silent mischief: a controller link downshifted, a drive in SATA 1.5G mode, or NCQ disabled.
Practical tasks (commands, outputs, decisions)
Below are real tasks you can run on a typical Linux OpenZFS host. Each one includes what the output means and what decision you make from it.
Use them like a checklist when you’re sleep-deprived and your change window is closing.
Task 1: Show pool topology and spot the “new vdev” immediately
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 03:12:44 with 0 errors on Tue Dec 10 02:40:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
errors: No known data errors
Meaning: This pool has two RAIDZ2 vdevs and then a mirror. That mirror is a different class (NVMe) and a different redundancy profile.
Decision: Treat this as a heterogeneous pool. Expect allocation skew and performance “modes.” If this mirror was added for capacity, plan a migration
or a deliberate rebalance approach rather than more patchwork.
Task 2: Identify which vdev is hot (IOPS and bandwidth)
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 82.1T 21.9T 3.21K 1.05K 410M 198M
raidz2-0 41.0T 3.7T 2.80K 190 360M 31M
sda - - 470 32 61M 5.2M
sdb - - 458 30 59M 5.1M
sdc - - 472 31 61M 5.1M
sdd - - 469 33 60M 5.3M
sde - - 467 31 60M 5.2M
sdf - - 464 33 59M 5.2M
raidz2-1 40.9T 3.8T 380 175 49M 29M
sdg - - 64 29 8.1M 4.8M
sdh - - 63 30 8.0M 4.9M
sdi - - 62 28 8.0M 4.7M
sdj - - 64 29 8.2M 4.8M
sdk - - 63 29 8.0M 4.8M
sdl - - 64 30 8.1M 4.9M
mirror-2 250G 850G 30 690 1.2M 138M
nvme0n1p2 - - 15 345 0.6M 69M
nvme1n1p2 - - 15 345 0.6M 69M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: Reads are dominated by raidz2-0. Writes are dominated by the NVMe mirror. That’s classic “new vdev absorbs writes”
and “old vdev serves reads” behavior.
Decision: If your latency complaints are read-latency related, adding more write-capable vdevs won’t fix it. You need to move hot read data,
add more similar RAIDZ vdevs, or reconsider pool design.
Task 3: Confirm allocation skew and how bad it is
cr0x@server:~$ sudo zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 104T 82.1T 21.9T - - 28% 78% 1.00x ONLINE -
raidz2-0 45.5T 41.0T 3.7T - - 34% 91% - ONLINE
raidz2-1 45.5T 40.9T 3.8T - - 31% 90% - ONLINE
mirror-2 1.09T 250G 850G - - 2% 22% - ONLINE
Meaning: Two big vdevs are ~90% full, the new mirror is ~22% used. That is not “balanced capacity,” therefore it won’t be balanced workload
unless your workload is magically only new writes.
Decision: Do not keep adding tiny vdevs expecting ZFS to “spread it out.” Decide whether you’re going to add more vdevs of the same class/width
or migrate to a new pool with a clean layout.
Task 4: Check ashift and catch the “new vdev is different” problem
cr0x@server:~$ sudo zdb -C tank | sed -n '1,120p'
MOS Configuration:
version: 5000
name: 'tank'
state: 0
txg: 1293387
pool_guid: 14772854026870511222
vdev_children: 3
vdev_tree:
type: 'root'
id: 0
guid: 14772854026870511222
children[0]:
type: 'raidz'
ashift: 12
nparity: 2
children[0]: type: 'disk' path: '/dev/sda'
...
children[2]:
type: 'mirror'
ashift: 13
children[0]: type: 'disk' path: '/dev/nvme0n1p2'
children[1]: type: 'disk' path: '/dev/nvme1n1p2'
Meaning: RAIDZ vdevs are ashift=12 (4K). The mirror is ashift=13 (8K). That’s not automatically wrong, but it’s a permanent behavioral difference.
Decision: If you see accidental ashift mismatches due to mis-detected sectors or mixed device types, stop doing incremental “fixes.”
Plan a rebuild with correct ashift rather than stacking more inconsistency.
Task 5: Verify dataset properties that change latency (sync, recordsize, compression)
cr0x@server:~$ sudo zfs get -r compression,recordsize,sync tank/vmstore
NAME PROPERTY VALUE SOURCE
tank/vmstore compression lz4 local
tank/vmstore recordsize 16K local
tank/vmstore sync standard default
Meaning: Reasonable VM defaults: lz4 compression and 16K recordsize. Sync is standard, meaning synchronous writes are honored.
Decision: If you’re seeing sync write latency and you don’t have a SLOG, consider adding a mirrored SLOG on power-loss-safe devices.
Do not “fix” sync latency by setting sync=disabled unless you’re comfortable explaining data loss in postmortem.
Task 6: Check if a SLOG exists and whether it’s a single point of pain
cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
logs
mirror-3 ONLINE 0 0 0
nvme2n1p1 ONLINE 0 0 0
nvme3n1p1 ONLINE 0 0 0
Meaning: There is a mirrored log device. Good: no single-device SLOG. Also, it’s separate from the top-level vdev mirror.
Decision: If logs show as a single disk, fix that before you do anything else. A single SLOG is an outage waiting to happen,
and a slow SLOG is a latency factory for sync-heavy workloads.
Task 7: Measure real latency and queueing at the OS layer
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 78.0 5.0 62.1 4.1 173.2 9.80 110.2 120.3 18.1 2.8 23.1
sdg 12.0 4.0 9.5 3.2 194.0 0.90 22.4 24.1 17.3 2.4 3.8
nvme0n1 1.0 420.0 0.2 132.0 64.0 1.10 2.6 1.9 2.6 0.1 4.2
Meaning: sda has high await and queue depth compared to other disks. That usually means the vdev it belongs to is saturated.
Decision: Confirm this aligns with zpool iostat -v. If yes, you have a hot vdev.
If no, you might have controller or path issues specific to that disk.
Task 8: Check ARC pressure and whether reads are missing cache
cr0x@server:~$ sudo arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% size c
12:01:11 3200 1200 37 800 25 350 11 50 2 118G 128G
12:01:12 3400 1500 44 980 29 420 12 60 2 118G 128G
12:01:13 3350 1480 44 950 28 430 13 60 2 118G 128G
Meaning: Cache miss rate is ~40%+ under load. That can turn your “fast new vdev” story into “old vdev reads are killing us,” because the ARC
isn’t saving you.
Decision: If read misses correlate with latency, consider adding RAM (often the best IOPS upgrade per dollar),
or separate hot datasets onto faster, correctly redundant vdevs instead of mixing.
Task 9: Inspect fragmentation and recognize the “nearly full RAIDZ vdev” cliff
cr0x@server:~$ sudo zpool get -H -o property,value fragmentation tank
fragmentation 28%
Meaning: Fragmentation is non-trivial. In RAIDZ, high fullness + fragmentation can amplify write cost and latency variability.
Decision: If your RAIDZ vdevs are 85–90% full and frag climbs, stop trying to “tune” it. Add equivalent vdevs or migrate.
The physics of allocation will win.
Task 10: See if one dataset is the bully (and move it if needed)
cr0x@server:~$ sudo zfs list -o name,used,available,refer,compressratio -r tank | head -n 12
NAME USED AVAIL REFER RATIO
tank 82.1T 21.9T 128K 1.22x
tank/vmstore 42.8T 21.9T 42.8T 1.05x
tank/backups 25.6T 21.9T 24.1T 1.48x
tank/analytics 11.9T 21.9T 10.7T 1.31x
tank/home 1.8T 21.9T 1.8T 1.63x
Meaning: VMstore dominates space. If the hot workload is here, it dominates I/O too.
Decision: Consider moving tank/vmstore to a new, properly designed pool and leaving colder datasets behind.
A surgical split beats a heroic “rebalance everything.”
Task 11: Check for silent device errors before blaming allocation
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep "Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count"
Reallocated_Sector_Ct 0
Current_Pending_Sector 0
Offline_Uncorrectable 0
UDMA_CRC_Error_Count 27
Meaning: CRC errors suggest cabling/backplane/controller issues, not dying media. That can present as “one vdev is slow.”
Decision: Fix the physical path and clear the error trend (replace cable, reseat, swap bay) before you redesign the pool.
Task 12: Confirm that the “new vdev” is actually being used for new allocations
cr0x@server:~$ sudo zpool iostat -v tank 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 82.1T 21.9T 980 2.20K 210M 420M
raidz2-0 41.0T 3.7T 820 180 180M 40M
raidz2-1 40.9T 3.8T 150 160 28M 36M
mirror-2 250G 850G 10 1.86K 2.0M 344M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: Writes are flowing primarily to the mirror vdev. That’s expected when it’s emptier and faster.
Decision: If this mirror is not meant to be the primary write sink (for cost, endurance, or policy reasons),
you need to redesign, not “hope it evens out.”
Task 13: Prove you can’t remove the top-level vdev later (the irreversible part)
cr0x@server:~$ sudo zpool remove tank mirror-2
cannot remove mirror-2: operation not supported on this pool
Meaning: Top-level vdev removal may not be supported depending on your OpenZFS version and pool feature flags—and even when it is supported,
it’s constrained and can take a long time.
Decision: Treat zpool add as permanent unless you have validated vdev removal support in your environment
and you can tolerate the time and risk.
Task 14: Validate special vdev configuration (because losing it can lose the pool)
cr0x@server:~$ sudo zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
special
mirror-4 ONLINE 0 0 0
nvme4n1p1 ONLINE 0 0 0
nvme5n1p1 ONLINE 0 0 0
Meaning: Special vdev is mirrored. Good. If it were a single device and it died, you could lose metadata/small blocks and effectively lose the pool.
Decision: Never run special vdev as a single device in production. Mirror it, monitor it, and size it with headroom.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company had a ZFS pool backing a virtualization cluster. They ran two RAIDZ2 vdevs of identical HDDs. Growth was steady.
Then they got a new customer and storage jumped. Someone asked the obvious question: “Can we just add disks?”
The engineer on duty did what many of us would do at 2 a.m.: they added a new mirror vdev using two spare drives, because it was quick and capacity was urgent.
They assumed the pool would “rebalance,” or at least “spread I/O” evenly across vdevs over time. The pool stayed online, graphs improved, everyone went back to sleep.
Two weeks later, the platform team rolled out a new VM image and the fleet rebooted across the cluster. Boot storms are a read-heavy festival of small random I/O.
Reads hit the older data on the nearly-full RAIDZ vdevs. Latency spiked. The new mirror vdev had plenty of free space and speed, but it didn’t help because it held mostly new blocks.
The outage wasn’t a mystery; it was a topology debt being collected with interest. Their fix wasn’t clever tuning. They migrated the busiest VM datasets to a new pool
built from mirrors (matching the workload), then repurposed the old pool for colder data. The painful lesson: ZFS will not protect you from your own assumptions.
Mini-story 2: The optimization that backfired
An enterprise analytics team wanted faster query performance. They had a ZFS pool of RAIDZ2 HDD vdevs. A well-meaning optimization proposal arrived:
“Add a couple of NVMe drives as a mirror vdev, and ZFS will stripe across it. Boom: faster.”
They added the NVMe mirror as a top-level vdev. New writes (including temporary query spill data) landed disproportionately on NVMe. At first, it looked fantastic.
But the NVMe drives were consumer-grade and not power-loss safe. Worse: they were now in the hot write path for a workload that did lots of synchronous writes
due to application behavior they didn’t fully control.
A few months in, one NVMe started throwing media errors. The mirror protected them from immediate data loss, but performance degraded sharply during error handling
and resilver activity. The business impact was “queries sometimes take 10x longer,” which is how analytics outages present: not down, just unusable.
The backfire was subtle: they hadn’t just “added performance.” They changed allocation, failure characteristics, and the operational profile of the entire pool.
Their eventual solution was boring: move analytics scratch to a separate NVMe-only pool (mirrors), keep the HDD pool for durable data, and make the sync semantics explicit.
Mini-story 3: The boring but correct practice that saved the day
A finance company ran ZFS for a document archive and a VM farm. Their storage lead was allergic to “quick fixes” and insisted on a policy:
top-level vdevs must be identical in width, type, and disk class; no exceptions without a written risk sign-off.
The policy annoyed people because it made expansions slower. When capacity got tight, they didn’t “add whatever disks exist.”
They bought enough drives to add a full new RAIDZ2 vdev matching the existing ones, and they staged it as a planned change with burn-in tests.
Months later, they had a controller firmware issue that intermittently increased latency on one SAS path. Because their vdevs were consistent,
their metrics told a clear story: one path was wrong; the pool layout wasn’t muddying the signal. They isolated the faulty path, failed it over cleanly,
and scheduled a firmware fix.
The saving wasn’t heroics; it was clarity. Homogeneous vdevs and disciplined changes meant the system behaved predictably when hardware got weird,
which is the only time you really find out what your architecture is made of.
How to expand ZFS safely (what to do instead)
Rule 1: Add top-level vdevs that match, or accept permanent weirdness
If you’re going to grow a pool by adding vdevs, add vdevs of the same type, same width, and similar performance class. Mirrors with mirrors.
RAIDZ2 with RAIDZ2 of the same disk count. Similar ashift. Similar disk models if you can.
This isn’t aesthetic. It’s operational math: the pool stripes across vdevs, and any vdev can become the limiting factor depending on access patterns.
Homogeneity makes performance and capacity planning tractable.
Rule 2: If you need different media types, use separate pools or special vdevs deliberately
Want NVMe performance and HDD capacity? You have three sane patterns:
- Separate pools: one NVMe pool for latency-sensitive datasets, one HDD pool for bulk. Simple, predictable, easy to reason about.
- Special vdev (mirrored): for metadata and small blocks, to accelerate directory operations and small random I/O while keeping data on HDD.
- SLOG (mirrored, PLP devices): to accelerate synchronous write latency without changing where data ultimately lands.
What you should avoid is “just add a fast vdev” and hope ZFS turns it into a tiering system. It won’t. It will turn it into an allocation sink.
Rule 3: Consider rebuilding instead of patching when the layout is already compromised
Sometimes the correct answer is: build a new pool with the layout you wanted, then migrate datasets. Yes, it’s work. It’s also finite work.
Living with a compromised pool is infinite work.
What about “rebalance”?
ZFS does not have a one-command online rebalance that redistributes existing blocks across vdevs the way some distributed systems do.
If you want data to move, you usually have to rewrite it.
Practical approaches include:
- Dataset send/receive to a new pool (best when you can provision a new pool).
- Dataset-level rewrite in place via replication to a temporary dataset and back (works, but heavy and operationally risky).
- Selective migration of hot datasets to a new pool, leaving cold data where it is.
Joke #2: ZFS doesn’t “rebalance” your pool; it “preserves history.” Like your audit department, it remembers everything and moves nothing without paperwork.
Checklists / step-by-step plan
Step-by-step: Before you run zpool add in production
- Write down the goal: capacity, IOPS, throughput, or latency? “More space” is not a performance plan.
- Confirm current vdev geometry: mirror vs raidz, disk counts, ashift, device classes.
- Decide whether you can keep vdevs homogeneous: if not, decide whether you’re okay with permanent heterogeneity.
- Check free space and fragmentation: if you’re already high-cap and fragmented, expect worse behavior under writes.
- Validate controller and path health: don’t add complexity on top of flaky hardware.
- Plan the rollback: assume you can’t remove the vdev later; plan how you’d migrate away if it goes wrong.
- Stage and burn-in disks: run SMART long tests, check firmware, confirm sector sizes.
Step-by-step: Safe expansion patterns
- Need more capacity on an existing RAIDZ pool: add a new RAIDZ vdev matching the existing width and parity.
- Need more IOPS for VM-like random workloads: add mirror vdevs (multiple mirrors scale IOPS well).
- Need faster sync write latency: add a mirrored SLOG on power-loss-protected devices; don’t add a random fast top-level vdev.
- Need metadata/small block acceleration: add a mirrored special vdev; set
special_small_blocksonly with a sizing model and monitoring. - Need both fast and slow tiers: build two pools and place datasets explicitly.
Step-by-step: If you already fell into the trap
- Quantify imbalance: per-vdev alloc%, per-vdev iostat under real workload.
- Identify hot datasets: which datasets dominate I/O and latency complaints.
- Choose a target architecture: homogeneous vdevs, or separate pools by workload.
- Plan migration: send/receive with incremental snapshots, or move hot datasets first.
- Schedule scrubs and resilvers: expansions and migrations are when weak drives reveal themselves.
Common mistakes: symptom → root cause → fix
1) Symptom: “After adding disks, reads are still slow”
Root cause: You added a new vdev, but old data stayed on old vdevs; read workload is still hitting the old, full vdevs.
Fix: Move hot datasets to a new pool or rewrite them so blocks reallocate. If you must expand in-place, add matching vdevs to increase read parallelism where the data lives.
2) Symptom: “Writes got fast, then we started seeing random latency spikes”
Root cause: New vdev is taking most writes; old vdevs are near full and fragmented; mixed media types cause uneven service times.
Fix: Stop mixing top-level vdev classes for general allocation. Separate pools or use SLOG/special vdev for targeted acceleration.
3) Symptom: “Pool performance is inconsistent across hosts / times of day”
Root cause: Workload phases (read old data vs write new data) interact with allocation skew. The pool has “modes.”
Fix: Measure per-vdev I/O during each workload phase. Place datasets according to access pattern, not according to where there happened to be free space.
4) Symptom: “Scrub or resilver time exploded after expansion”
Root cause: High utilization vdevs + RAIDZ geometry + slow disks = long maintenance operations. Expansion didn’t reduce the old vdev’s fullness.
Fix: Add matching vdevs before you hit high CAP%. Keep headroom. Replace aging disks proactively. Consider mirrors for faster resilver when the workload demands it.
5) Symptom: “We added a single disk temporarily and now we’re terrified to touch the pool”
Root cause: A single-disk top-level vdev makes the pool one disk away from total loss.
Fix: Immediately convert that single disk into redundancy by attaching a second disk (mirror), or migrate data off that pool.
Do not schedule this for “later.”
6) Symptom: “New vdev is faster but the pool got slower overall”
Root cause: Mixed ashift, sector sizes, or write amplification on one vdev; plus metadata/small-block behavior can concentrate on certain devices.
Fix: Validate ashift and device properties before adding. If wrong, rebuild with correct ashift; don’t keep adding.
7) Symptom: “Synchronous writes are terrible, so someone proposes sync=disabled”
Root cause: No SLOG or a slow/unsafe SLOG. The pool is honoring sync semantics and paying the price in latency.
Fix: Add a mirrored, power-loss-protected SLOG; confirm application needs; keep sync=standard for durability unless you explicitly accept data loss.
FAQ (the questions people ask after the incident)
1) Is “adding disks” the same as “adding a vdev” in ZFS?
Practically, yes. You either replace disks within an existing vdev (growing it or refreshing it), or you add a new top-level vdev to the pool.
zpool add adds a new top-level vdev. That changes pool behavior permanently.
2) Will ZFS automatically rebalance data after I add a vdev?
No. Existing blocks stay where they are. New allocations tend to prefer vdevs with more free space, so new vdevs often get most new writes.
3) If I add a faster vdev, will reads get faster?
Only for data that is allocated on that vdev (or cached in ARC/L2ARC). If your hot data lives on older vdevs, reads will still hit those vdevs.
4) Can I “fix imbalance” without migrating to a new pool?
Sometimes partially, by rewriting data (send/receive to a temporary dataset and back, or moving datasets around).
But there’s no free lunch: to move blocks, you must rewrite blocks, and that is I/O-heavy and time-consuming.
5) Is it safe to mix mirrors and RAIDZ vdevs in one pool?
It can be safe in the sense that ZFS will function, but it’s rarely wise for predictable performance or consistent failure characteristics.
If you do it, do it deliberately and monitor per-vdev behavior. Otherwise, split pools.
6) What’s the most common “we accidentally broke it” ZFS expansion move?
Adding a small mirror vdev to a big RAIDZ pool because “we just need a little more space.” You end up with a pool that allocates disproportionately to the mirror,
changes wear patterns, and makes future capacity planning painful.
7) Does adding more vdevs always increase performance?
Adding more similar top-level vdevs usually increases aggregate throughput and IOPS. Adding dissimilar vdevs increases unpredictability.
Also, if your workload is limited by CPU, ARC, sync behavior, or network, more vdevs won’t help.
8) How do I choose between mirrors and RAIDZ for growth?
Mirrors scale random read/write IOPS better and resilver faster; RAIDZ is capacity-efficient but more sensitive to fullness and write patterns.
If you run VM workloads or databases with lots of random I/O, mirrors usually behave better operationally.
9) What quote should I remember when someone wants a “quick storage fix”?
Werner Vogels (paraphrased idea): “Everything fails, all the time.” Build layouts and procedures assuming the next failure is already scheduled.
Conclusion: next steps you can do this week
If you remember one thing, make it this: zpool add doesn’t “extend RAID.” It adds a new top-level vdev and it does not rebalance old data.
That’s not a bug. It’s the model.
Practical next steps:
- Inventory your pools: run
zpool statusand write down vdev types, widths, and device classes. If it’s a mess, admit it now. - Measure per-vdev load: capture
zpool iostat -v 1during real workload peaks. Find the hot vdev. - Decide on architecture: homogeneous vdevs in one pool, or multiple pools by workload. Don’t keep improvising.
- Plan the exit: if you already added an ill-fitting vdev, schedule a migration path while the system is still healthy enough to move data.
Storage reliability is mostly avoiding cleverness. ZFS gives you sharp tools. Use them like you plan to be the one on call when they slip.