ZFS Adding Disks: The ‘Add VDEV’ Trap That Creates Imbalance

Was this helpful?

The pager goes off because “storage is slow.” The graph says latency is up, but only sometimes. The app team swears nothing changed.
You log in, run zpool status, and there it is: a shiny new vdev added last week, sitting next to older, fuller vdevs like a new hire
who got all the easy tickets and still somehow broke prod.

ZFS makes it easy to add capacity. It also makes it easy to create a pool layout that is permanently imbalanced in performance and risk.
The trap is simple: zpool add adds a vdev to a pool, but ZFS does not automatically rebalance existing data across vdevs.
That one design choice is why your “quick expansion” becomes next quarter’s storage incident.

The “Add VDEV” trap: what really happens

When you run zpool add, you are not “adding disks” to an existing redundancy group. You are adding a brand-new top-level vdev
to the pool. ZFS then stripes new writes across top-level vdevs based on space and a few heuristics, but it does not move the old blocks.
The old vdevs remain full of old data; the new vdev starts empty and absorbs a lot of new allocations.

That seems fine until you remember what a top-level vdev means: the pool’s IOPS, throughput, and fault tolerance are all the sum (or the minimum)
of its top-level vdevs. Add a vdev with a different width, different disk class, different ashift, or different health profile, and you’ve changed the
pool’s behavior for the rest of its life.

One pool, many personalities

ZFS pools are not “one RAID.” A pool is a collection of top-level vdevs. Each top-level vdev has its own redundancy (mirror, raidz, dRAID),
and the pool stripes across them. That’s the model. It’s powerful. It’s also how people accidentally build storage chimera:
three wide RAIDZ2 vdevs, then a single mirror “just to add capacity quickly,” then an SSD special vdev “because metadata,” and now the pool
behaves like a committee where everyone votes and the slowest person still blocks the meeting.

Why ZFS does not rebalance by default

The lack of auto-rebalance isn’t laziness; it’s conservatism. Moving blocks around a live pool is expensive, wears drives, risks power-loss corner cases,
and complicates guarantees. ZFS will happily keep pointers to where data already lives. It optimizes for correctness and survivability, not for “make it pretty.”

The operational consequence is blunt: if you add a vdev and then wonder why one vdev is doing all the work, the answer is “because it’s empty and you’re writing
new data.” If you wonder why reads still hammer old vdevs, it’s because the old data is still there. ZFS is not being mysterious.
It’s being literal.

Facts and context that explain the behavior

  • ZFS started at Sun in the early 2000s, designed for end-to-end data integrity with checksums on every block, not just “fast RAID.”
  • Pools were a radical shift: instead of carving LUNs out of fixed RAID groups, you aggregate vdevs and allocate dynamically.
  • Copy-on-write is the core mechanic: ZFS never overwrites live blocks in place; it writes new blocks and flips pointers. Great for snapshots, tricky for “rebalance.”
  • Top-level vdevs are the unit of failure: lose a top-level vdev, lose the pool. That’s why a single-disk vdev is a ticking outage.
  • RAIDZ is not RAID5/6 in implementation details; it’s variable-stripe with parity, interacting with recordsize and allocation patterns in ways that surprise people migrating from hardware RAID.
  • “Ashift” is forever per vdev: choose the wrong sector size alignment and you can lock in write amplification for the lifetime of that vdev.
  • Special vdevs (metadata/small blocks) made hybrid pools more practical, but they also introduced a new “if this dies, your pool is toast” component unless mirrored.
  • L2ARC was never a write cache: it’s a read cache and it resets on reboot unless persistent L2ARC is enabled and supported. People still treat it like magic RAM.
  • Resilver behavior differs by vdev type: mirrors resilver used space; RAIDZ resilver is heavier and touches more of the vdev geometry.

Why imbalance hurts: performance, resilience, and cost

Performance: the pool is only as smooth as its worst vdev

In a balanced pool, your IOPS and throughput scale by adding more top-level vdevs of similar capability. In an imbalanced pool, you get weirdness:
bursts of good performance, followed by latency spikes when one vdev is saturated while others are underused.

Here’s the classic pattern: you add a new vdev, new writes mostly land there, and your monitoring looks great. Then the workload shifts to reads of older data
(backups restoring, analytics jobs hitting historical partitions, or a VM fleet booting from older images). Suddenly the old vdevs become read hotspots,
and the new vdev sits bored.

Resilience: mixing redundancy levels is how you buy risk with CAPEX

A pool’s fault tolerance is not the “average redundancy.” It’s the redundancy of each top-level vdev, and losing any one top-level vdev kills the pool.
Add a single mirror vdev next to several RAIDZ2 vdevs, and you just introduced a weaker link: a two-disk mirror can survive one disk failure,
while your RAIDZ2 vdevs survive two. You now have uneven failure domains.

Add a single-disk vdev “temporarily,” and congratulations: you just created a pool that is one disk failure away from complete loss.
The word “temporary” has a long half-life in infrastructure.

Cost: you pay twice—once for the disks, once for the aftermath

Imbalanced pools are expensive in boring ways: more on-call time, more escalations, more “why is this one host slower” debugging, more premature upgrades.
And if you have to “fix” the layout, the fix is often disruptive: migrate data, rebuild pools, or do controlled block rewrites that take weeks.

Joke #1: Adding a mismatched vdev to a ZFS pool is like putting a spare tire on a race car—yes, it rolls, and yes, everyone can hear it.

Fast diagnosis playbook (first/second/third checks)

When someone says “ZFS got slow after we added disks,” do not start by tuning recordsize. Start by proving whether the pool is imbalanced and where the time is going.

First: Is one top-level vdev doing all the work?

  • Check per-vdev I/O and latency with zpool iostat -v.
  • Look for one vdev with consistently higher busy/await than others.
  • If you see that, stop. You’re not debugging “ZFS performance.” You’re debugging “layout and allocation.”

Second: Is this read-bound, write-bound, or sync-bound?

  • Read vs write mix: zpool iostat -v 1 and arcstat (if available) to see ARC hit rate and read pressure.
  • Sync writes: check zfs get sync and whether a SLOG exists and is healthy.
  • If latency spikes correlate with sync write bursts, you’re looking at ZIL/SLOG or underlying write latency.

Third: Are you actually blocked on the device layer?

  • Check kernel device stats: iostat -x and smartctl error counters.
  • Confirm ashift and sector sizes. A new vdev with a different physical sector size can behave differently under load.
  • Look for silent mischief: a controller link downshifted, a drive in SATA 1.5G mode, or NCQ disabled.

Practical tasks (commands, outputs, decisions)

Below are real tasks you can run on a typical Linux OpenZFS host. Each one includes what the output means and what decision you make from it.
Use them like a checklist when you’re sleep-deprived and your change window is closing.

Task 1: Show pool topology and spot the “new vdev” immediately

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 03:12:44 with 0 errors on Tue Dec 10 02:40:11 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
          raidz2-1                  ONLINE       0     0     0
            sdg                     ONLINE       0     0     0
            sdh                     ONLINE       0     0     0
            sdi                     ONLINE       0     0     0
            sdj                     ONLINE       0     0     0
            sdk                     ONLINE       0     0     0
            sdl                     ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            nvme0n1p2               ONLINE       0     0     0
            nvme1n1p2               ONLINE       0     0     0
errors: No known data errors

Meaning: This pool has two RAIDZ2 vdevs and then a mirror. That mirror is a different class (NVMe) and a different redundancy profile.

Decision: Treat this as a heterogeneous pool. Expect allocation skew and performance “modes.” If this mirror was added for capacity, plan a migration
or a deliberate rebalance approach rather than more patchwork.

Task 2: Identify which vdev is hot (IOPS and bandwidth)

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        82.1T  21.9T  3.21K  1.05K   410M   198M
  raidz2-0                  41.0T  3.7T   2.80K   190    360M    31M
    sda                         -      -    470    32     61M   5.2M
    sdb                         -      -    458    30     59M   5.1M
    sdc                         -      -    472    31     61M   5.1M
    sdd                         -      -    469    33     60M   5.3M
    sde                         -      -    467    31     60M   5.2M
    sdf                         -      -    464    33     59M   5.2M
  raidz2-1                  40.9T  3.8T     380   175     49M    29M
    sdg                         -      -     64    29    8.1M   4.8M
    sdh                         -      -     63    30    8.0M   4.9M
    sdi                         -      -     62    28    8.0M   4.7M
    sdj                         -      -     64    29    8.2M   4.8M
    sdk                         -      -     63    29    8.0M   4.8M
    sdl                         -      -     64    30    8.1M   4.9M
  mirror-2                    250G   850G    30    690    1.2M   138M
    nvme0n1p2                   -      -     15   345   0.6M    69M
    nvme1n1p2                   -      -     15   345   0.6M    69M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Reads are dominated by raidz2-0. Writes are dominated by the NVMe mirror. That’s classic “new vdev absorbs writes”
and “old vdev serves reads” behavior.

Decision: If your latency complaints are read-latency related, adding more write-capable vdevs won’t fix it. You need to move hot read data,
add more similar RAIDZ vdevs, or reconsider pool design.

Task 3: Confirm allocation skew and how bad it is

cr0x@server:~$ sudo zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        104T  82.1T  21.9T        -         -    28%    78%  1.00x  ONLINE  -
  raidz2-0  45.5T  41.0T  3.7T        -         -    34%    91%      -  ONLINE
  raidz2-1  45.5T  40.9T  3.8T        -         -    31%    90%      -  ONLINE
  mirror-2  1.09T   250G   850G       -         -    2%     22%      -  ONLINE

Meaning: Two big vdevs are ~90% full, the new mirror is ~22% used. That is not “balanced capacity,” therefore it won’t be balanced workload
unless your workload is magically only new writes.

Decision: Do not keep adding tiny vdevs expecting ZFS to “spread it out.” Decide whether you’re going to add more vdevs of the same class/width
or migrate to a new pool with a clean layout.

Task 4: Check ashift and catch the “new vdev is different” problem

cr0x@server:~$ sudo zdb -C tank | sed -n '1,120p'
MOS Configuration:
        version: 5000
        name: 'tank'
        state: 0
        txg: 1293387
        pool_guid: 14772854026870511222
        vdev_children: 3
        vdev_tree:
            type: 'root'
            id: 0
            guid: 14772854026870511222
            children[0]:
                type: 'raidz'
                ashift: 12
                nparity: 2
                children[0]: type: 'disk' path: '/dev/sda'
                ...
            children[2]:
                type: 'mirror'
                ashift: 13
                children[0]: type: 'disk' path: '/dev/nvme0n1p2'
                children[1]: type: 'disk' path: '/dev/nvme1n1p2'

Meaning: RAIDZ vdevs are ashift=12 (4K). The mirror is ashift=13 (8K). That’s not automatically wrong, but it’s a permanent behavioral difference.

Decision: If you see accidental ashift mismatches due to mis-detected sectors or mixed device types, stop doing incremental “fixes.”
Plan a rebuild with correct ashift rather than stacking more inconsistency.

Task 5: Verify dataset properties that change latency (sync, recordsize, compression)

cr0x@server:~$ sudo zfs get -r compression,recordsize,sync tank/vmstore
NAME         PROPERTY     VALUE     SOURCE
tank/vmstore compression  lz4       local
tank/vmstore recordsize   16K       local
tank/vmstore sync         standard  default

Meaning: Reasonable VM defaults: lz4 compression and 16K recordsize. Sync is standard, meaning synchronous writes are honored.

Decision: If you’re seeing sync write latency and you don’t have a SLOG, consider adding a mirrored SLOG on power-loss-safe devices.
Do not “fix” sync latency by setting sync=disabled unless you’re comfortable explaining data loss in postmortem.

Task 6: Check if a SLOG exists and whether it’s a single point of pain

cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
          raidz2-1                  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
        logs
          mirror-3                  ONLINE       0     0     0
            nvme2n1p1               ONLINE       0     0     0
            nvme3n1p1               ONLINE       0     0     0

Meaning: There is a mirrored log device. Good: no single-device SLOG. Also, it’s separate from the top-level vdev mirror.

Decision: If logs show as a single disk, fix that before you do anything else. A single SLOG is an outage waiting to happen,
and a slow SLOG is a latency factory for sync-heavy workloads.

Task 7: Measure real latency and queueing at the OS layer

cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server)  12/25/2025  _x86_64_  (32 CPU)

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
sda             78.0    5.0    62.1     4.1     173.2     9.80   110.2    120.3     18.1   2.8   23.1
sdg             12.0    4.0     9.5     3.2     194.0     0.90    22.4     24.1     17.3   2.4    3.8
nvme0n1          1.0  420.0     0.2   132.0      64.0     1.10     2.6      1.9      2.6   0.1    4.2

Meaning: sda has high await and queue depth compared to other disks. That usually means the vdev it belongs to is saturated.

Decision: Confirm this aligns with zpool iostat -v. If yes, you have a hot vdev.
If no, you might have controller or path issues specific to that disk.

Task 8: Check ARC pressure and whether reads are missing cache

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c
12:01:11  3200  1200     37   800   25   350   11    50    2  118G  128G
12:01:12  3400  1500     44   980   29   420   12    60    2  118G  128G
12:01:13  3350  1480     44   950   28   430   13    60    2  118G  128G

Meaning: Cache miss rate is ~40%+ under load. That can turn your “fast new vdev” story into “old vdev reads are killing us,” because the ARC
isn’t saving you.

Decision: If read misses correlate with latency, consider adding RAM (often the best IOPS upgrade per dollar),
or separate hot datasets onto faster, correctly redundant vdevs instead of mixing.

Task 9: Inspect fragmentation and recognize the “nearly full RAIDZ vdev” cliff

cr0x@server:~$ sudo zpool get -H -o property,value fragmentation tank
fragmentation	28%

Meaning: Fragmentation is non-trivial. In RAIDZ, high fullness + fragmentation can amplify write cost and latency variability.

Decision: If your RAIDZ vdevs are 85–90% full and frag climbs, stop trying to “tune” it. Add equivalent vdevs or migrate.
The physics of allocation will win.

Task 10: See if one dataset is the bully (and move it if needed)

cr0x@server:~$ sudo zfs list -o name,used,available,refer,compressratio -r tank | head -n 12
NAME                 USED  AVAIL  REFER  RATIO
tank                82.1T  21.9T   128K  1.22x
tank/vmstore        42.8T  21.9T  42.8T  1.05x
tank/backups        25.6T  21.9T  24.1T  1.48x
tank/analytics      11.9T  21.9T  10.7T  1.31x
tank/home           1.8T   21.9T   1.8T  1.63x

Meaning: VMstore dominates space. If the hot workload is here, it dominates I/O too.

Decision: Consider moving tank/vmstore to a new, properly designed pool and leaving colder datasets behind.
A surgical split beats a heroic “rebalance everything.”

Task 11: Check for silent device errors before blaming allocation

cr0x@server:~$ sudo smartctl -a /dev/sda | egrep "Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count"
Reallocated_Sector_Ct     0
Current_Pending_Sector    0
Offline_Uncorrectable     0
UDMA_CRC_Error_Count      27

Meaning: CRC errors suggest cabling/backplane/controller issues, not dying media. That can present as “one vdev is slow.”

Decision: Fix the physical path and clear the error trend (replace cable, reseat, swap bay) before you redesign the pool.

Task 12: Confirm that the “new vdev” is actually being used for new allocations

cr0x@server:~$ sudo zpool iostat -v tank 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        82.1T  21.9T   980   2.20K   210M   420M
  raidz2-0                  41.0T  3.7T    820    180    180M    40M
  raidz2-1                  40.9T  3.8T    150    160     28M    36M
  mirror-2                    250G   850G    10   1.86K   2.0M   344M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Writes are flowing primarily to the mirror vdev. That’s expected when it’s emptier and faster.

Decision: If this mirror is not meant to be the primary write sink (for cost, endurance, or policy reasons),
you need to redesign, not “hope it evens out.”

Task 13: Prove you can’t remove the top-level vdev later (the irreversible part)

cr0x@server:~$ sudo zpool remove tank mirror-2
cannot remove mirror-2: operation not supported on this pool

Meaning: Top-level vdev removal may not be supported depending on your OpenZFS version and pool feature flags—and even when it is supported,
it’s constrained and can take a long time.

Decision: Treat zpool add as permanent unless you have validated vdev removal support in your environment
and you can tolerate the time and risk.

Task 14: Validate special vdev configuration (because losing it can lose the pool)

cr0x@server:~$ sudo zpool status tank | sed -n '1,120p'
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
          raidz2-1                  ONLINE       0     0     0
        special
          mirror-4                  ONLINE       0     0     0
            nvme4n1p1               ONLINE       0     0     0
            nvme5n1p1               ONLINE       0     0     0

Meaning: Special vdev is mirrored. Good. If it were a single device and it died, you could lose metadata/small blocks and effectively lose the pool.

Decision: Never run special vdev as a single device in production. Mirror it, monitor it, and size it with headroom.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company had a ZFS pool backing a virtualization cluster. They ran two RAIDZ2 vdevs of identical HDDs. Growth was steady.
Then they got a new customer and storage jumped. Someone asked the obvious question: “Can we just add disks?”

The engineer on duty did what many of us would do at 2 a.m.: they added a new mirror vdev using two spare drives, because it was quick and capacity was urgent.
They assumed the pool would “rebalance,” or at least “spread I/O” evenly across vdevs over time. The pool stayed online, graphs improved, everyone went back to sleep.

Two weeks later, the platform team rolled out a new VM image and the fleet rebooted across the cluster. Boot storms are a read-heavy festival of small random I/O.
Reads hit the older data on the nearly-full RAIDZ vdevs. Latency spiked. The new mirror vdev had plenty of free space and speed, but it didn’t help because it held mostly new blocks.

The outage wasn’t a mystery; it was a topology debt being collected with interest. Their fix wasn’t clever tuning. They migrated the busiest VM datasets to a new pool
built from mirrors (matching the workload), then repurposed the old pool for colder data. The painful lesson: ZFS will not protect you from your own assumptions.

Mini-story 2: The optimization that backfired

An enterprise analytics team wanted faster query performance. They had a ZFS pool of RAIDZ2 HDD vdevs. A well-meaning optimization proposal arrived:
“Add a couple of NVMe drives as a mirror vdev, and ZFS will stripe across it. Boom: faster.”

They added the NVMe mirror as a top-level vdev. New writes (including temporary query spill data) landed disproportionately on NVMe. At first, it looked fantastic.
But the NVMe drives were consumer-grade and not power-loss safe. Worse: they were now in the hot write path for a workload that did lots of synchronous writes
due to application behavior they didn’t fully control.

A few months in, one NVMe started throwing media errors. The mirror protected them from immediate data loss, but performance degraded sharply during error handling
and resilver activity. The business impact was “queries sometimes take 10x longer,” which is how analytics outages present: not down, just unusable.

The backfire was subtle: they hadn’t just “added performance.” They changed allocation, failure characteristics, and the operational profile of the entire pool.
Their eventual solution was boring: move analytics scratch to a separate NVMe-only pool (mirrors), keep the HDD pool for durable data, and make the sync semantics explicit.

Mini-story 3: The boring but correct practice that saved the day

A finance company ran ZFS for a document archive and a VM farm. Their storage lead was allergic to “quick fixes” and insisted on a policy:
top-level vdevs must be identical in width, type, and disk class; no exceptions without a written risk sign-off.

The policy annoyed people because it made expansions slower. When capacity got tight, they didn’t “add whatever disks exist.”
They bought enough drives to add a full new RAIDZ2 vdev matching the existing ones, and they staged it as a planned change with burn-in tests.

Months later, they had a controller firmware issue that intermittently increased latency on one SAS path. Because their vdevs were consistent,
their metrics told a clear story: one path was wrong; the pool layout wasn’t muddying the signal. They isolated the faulty path, failed it over cleanly,
and scheduled a firmware fix.

The saving wasn’t heroics; it was clarity. Homogeneous vdevs and disciplined changes meant the system behaved predictably when hardware got weird,
which is the only time you really find out what your architecture is made of.

How to expand ZFS safely (what to do instead)

Rule 1: Add top-level vdevs that match, or accept permanent weirdness

If you’re going to grow a pool by adding vdevs, add vdevs of the same type, same width, and similar performance class. Mirrors with mirrors.
RAIDZ2 with RAIDZ2 of the same disk count. Similar ashift. Similar disk models if you can.

This isn’t aesthetic. It’s operational math: the pool stripes across vdevs, and any vdev can become the limiting factor depending on access patterns.
Homogeneity makes performance and capacity planning tractable.

Rule 2: If you need different media types, use separate pools or special vdevs deliberately

Want NVMe performance and HDD capacity? You have three sane patterns:

  • Separate pools: one NVMe pool for latency-sensitive datasets, one HDD pool for bulk. Simple, predictable, easy to reason about.
  • Special vdev (mirrored): for metadata and small blocks, to accelerate directory operations and small random I/O while keeping data on HDD.
  • SLOG (mirrored, PLP devices): to accelerate synchronous write latency without changing where data ultimately lands.

What you should avoid is “just add a fast vdev” and hope ZFS turns it into a tiering system. It won’t. It will turn it into an allocation sink.

Rule 3: Consider rebuilding instead of patching when the layout is already compromised

Sometimes the correct answer is: build a new pool with the layout you wanted, then migrate datasets. Yes, it’s work. It’s also finite work.
Living with a compromised pool is infinite work.

What about “rebalance”?

ZFS does not have a one-command online rebalance that redistributes existing blocks across vdevs the way some distributed systems do.
If you want data to move, you usually have to rewrite it.

Practical approaches include:

  • Dataset send/receive to a new pool (best when you can provision a new pool).
  • Dataset-level rewrite in place via replication to a temporary dataset and back (works, but heavy and operationally risky).
  • Selective migration of hot datasets to a new pool, leaving cold data where it is.

Joke #2: ZFS doesn’t “rebalance” your pool; it “preserves history.” Like your audit department, it remembers everything and moves nothing without paperwork.

Checklists / step-by-step plan

Step-by-step: Before you run zpool add in production

  1. Write down the goal: capacity, IOPS, throughput, or latency? “More space” is not a performance plan.
  2. Confirm current vdev geometry: mirror vs raidz, disk counts, ashift, device classes.
  3. Decide whether you can keep vdevs homogeneous: if not, decide whether you’re okay with permanent heterogeneity.
  4. Check free space and fragmentation: if you’re already high-cap and fragmented, expect worse behavior under writes.
  5. Validate controller and path health: don’t add complexity on top of flaky hardware.
  6. Plan the rollback: assume you can’t remove the vdev later; plan how you’d migrate away if it goes wrong.
  7. Stage and burn-in disks: run SMART long tests, check firmware, confirm sector sizes.

Step-by-step: Safe expansion patterns

  1. Need more capacity on an existing RAIDZ pool: add a new RAIDZ vdev matching the existing width and parity.
  2. Need more IOPS for VM-like random workloads: add mirror vdevs (multiple mirrors scale IOPS well).
  3. Need faster sync write latency: add a mirrored SLOG on power-loss-protected devices; don’t add a random fast top-level vdev.
  4. Need metadata/small block acceleration: add a mirrored special vdev; set special_small_blocks only with a sizing model and monitoring.
  5. Need both fast and slow tiers: build two pools and place datasets explicitly.

Step-by-step: If you already fell into the trap

  1. Quantify imbalance: per-vdev alloc%, per-vdev iostat under real workload.
  2. Identify hot datasets: which datasets dominate I/O and latency complaints.
  3. Choose a target architecture: homogeneous vdevs, or separate pools by workload.
  4. Plan migration: send/receive with incremental snapshots, or move hot datasets first.
  5. Schedule scrubs and resilvers: expansions and migrations are when weak drives reveal themselves.

Common mistakes: symptom → root cause → fix

1) Symptom: “After adding disks, reads are still slow”

Root cause: You added a new vdev, but old data stayed on old vdevs; read workload is still hitting the old, full vdevs.

Fix: Move hot datasets to a new pool or rewrite them so blocks reallocate. If you must expand in-place, add matching vdevs to increase read parallelism where the data lives.

2) Symptom: “Writes got fast, then we started seeing random latency spikes”

Root cause: New vdev is taking most writes; old vdevs are near full and fragmented; mixed media types cause uneven service times.

Fix: Stop mixing top-level vdev classes for general allocation. Separate pools or use SLOG/special vdev for targeted acceleration.

3) Symptom: “Pool performance is inconsistent across hosts / times of day”

Root cause: Workload phases (read old data vs write new data) interact with allocation skew. The pool has “modes.”

Fix: Measure per-vdev I/O during each workload phase. Place datasets according to access pattern, not according to where there happened to be free space.

4) Symptom: “Scrub or resilver time exploded after expansion”

Root cause: High utilization vdevs + RAIDZ geometry + slow disks = long maintenance operations. Expansion didn’t reduce the old vdev’s fullness.

Fix: Add matching vdevs before you hit high CAP%. Keep headroom. Replace aging disks proactively. Consider mirrors for faster resilver when the workload demands it.

5) Symptom: “We added a single disk temporarily and now we’re terrified to touch the pool”

Root cause: A single-disk top-level vdev makes the pool one disk away from total loss.

Fix: Immediately convert that single disk into redundancy by attaching a second disk (mirror), or migrate data off that pool.
Do not schedule this for “later.”

6) Symptom: “New vdev is faster but the pool got slower overall”

Root cause: Mixed ashift, sector sizes, or write amplification on one vdev; plus metadata/small-block behavior can concentrate on certain devices.

Fix: Validate ashift and device properties before adding. If wrong, rebuild with correct ashift; don’t keep adding.

7) Symptom: “Synchronous writes are terrible, so someone proposes sync=disabled”

Root cause: No SLOG or a slow/unsafe SLOG. The pool is honoring sync semantics and paying the price in latency.

Fix: Add a mirrored, power-loss-protected SLOG; confirm application needs; keep sync=standard for durability unless you explicitly accept data loss.

FAQ (the questions people ask after the incident)

1) Is “adding disks” the same as “adding a vdev” in ZFS?

Practically, yes. You either replace disks within an existing vdev (growing it or refreshing it), or you add a new top-level vdev to the pool.
zpool add adds a new top-level vdev. That changes pool behavior permanently.

2) Will ZFS automatically rebalance data after I add a vdev?

No. Existing blocks stay where they are. New allocations tend to prefer vdevs with more free space, so new vdevs often get most new writes.

3) If I add a faster vdev, will reads get faster?

Only for data that is allocated on that vdev (or cached in ARC/L2ARC). If your hot data lives on older vdevs, reads will still hit those vdevs.

4) Can I “fix imbalance” without migrating to a new pool?

Sometimes partially, by rewriting data (send/receive to a temporary dataset and back, or moving datasets around).
But there’s no free lunch: to move blocks, you must rewrite blocks, and that is I/O-heavy and time-consuming.

5) Is it safe to mix mirrors and RAIDZ vdevs in one pool?

It can be safe in the sense that ZFS will function, but it’s rarely wise for predictable performance or consistent failure characteristics.
If you do it, do it deliberately and monitor per-vdev behavior. Otherwise, split pools.

6) What’s the most common “we accidentally broke it” ZFS expansion move?

Adding a small mirror vdev to a big RAIDZ pool because “we just need a little more space.” You end up with a pool that allocates disproportionately to the mirror,
changes wear patterns, and makes future capacity planning painful.

7) Does adding more vdevs always increase performance?

Adding more similar top-level vdevs usually increases aggregate throughput and IOPS. Adding dissimilar vdevs increases unpredictability.
Also, if your workload is limited by CPU, ARC, sync behavior, or network, more vdevs won’t help.

8) How do I choose between mirrors and RAIDZ for growth?

Mirrors scale random read/write IOPS better and resilver faster; RAIDZ is capacity-efficient but more sensitive to fullness and write patterns.
If you run VM workloads or databases with lots of random I/O, mirrors usually behave better operationally.

9) What quote should I remember when someone wants a “quick storage fix”?

Werner Vogels (paraphrased idea): “Everything fails, all the time.” Build layouts and procedures assuming the next failure is already scheduled.

Conclusion: next steps you can do this week

If you remember one thing, make it this: zpool add doesn’t “extend RAID.” It adds a new top-level vdev and it does not rebalance old data.
That’s not a bug. It’s the model.

Practical next steps:

  1. Inventory your pools: run zpool status and write down vdev types, widths, and device classes. If it’s a mess, admit it now.
  2. Measure per-vdev load: capture zpool iostat -v 1 during real workload peaks. Find the hot vdev.
  3. Decide on architecture: homogeneous vdevs in one pool, or multiple pools by workload. Don’t keep improvising.
  4. Plan the exit: if you already added an ill-fitting vdev, schedule a migration path while the system is still healthy enough to move data.

Storage reliability is mostly avoiding cleverness. ZFS gives you sharp tools. Use them like you plan to be the one on call when they slip.

← Previous
Docker Container Timeouts: Tune Retries the Right Way (Not Infinite)
Next →
Wrong Cable Outages: How One Wire Stops a Datacenter

Leave a comment