ZFS vdev imbalance: Why One Slow VDEV Drags the Whole Pool

Was this helpful?

You bought “fast storage.” The pool benchmarks nicely on day one. Then a month later, your databases start
doing that thing where every query feels like it’s walking through wet cement. Grafana says latency. The
app team says network. The network team says “not us.” You run zpool iostat and there it is:
one vdev sitting in the corner, red-faced, sweating, dragging the entire pool like it’s hauling a sofa up
three flights of stairs.

ZFS is remarkably fair and remarkably unforgiving. It will use all vdevs… until one vdev can’t keep up.
Then the pool is effectively paced by the slowest participant, especially for writes and latency-sensitive
workloads. This isn’t a “ZFS is bad” story. It’s a “physics is rude” story—plus a handful of design choices
you can use, abuse, or fix.

What “vdev imbalance” really means (and what it doesn’t)

In ZFS, a pool is a collection of vdevs. A vdev is the atomic unit of storage allocation: ZFS stripes
data across vdevs, not across individual disks (with the important caveat that a RAIDZ or mirror vdev
internally uses multiple disks).

“Vdev imbalance” is when one vdev does a disproportionate amount of work, or completes its work much more
slowly, such that pool-wide latency or throughput is constrained. It can be:

  • Performance imbalance: vdev A has higher latency / lower IOPS than vdev B under similar load.
  • Allocation imbalance: vdev A is much fuller / more fragmented, so new allocations are harder and slower.
  • Workload imbalance: “special” roles (SLOG, special vdev, L2ARC metadata patterns) funnel I/O to a subset.
  • Failure-mode imbalance: one device is retrying, remapping, or quietly dying, and you pay the penalty in tail latency.

What it usually isn’t: ZFS randomly “choosing favorites.” ZFS has rules. If one vdev is slower, more full,
more fragmented, or intermittently erroring, ZFS will continue to submit I/O, and the pool’s user-visible
latency becomes the sum of a thousand tiny “wait for the slow one” moments.

Dry-funny joke #1: A slow vdev is like a status meeting that “will just take five minutes”—you will still
be there at 11:30.

How ZFS spreads I/O across vdevs

The pool is a stripe across vdevs, not a magical performance soup

The basic mental model that keeps you sane: ZFS allocates blocks to vdevs, and it generally tries to keep
free space balanced (with weighting) across vdevs. When you have multiple top-level vdevs, the pool behaves
like a stripe at the vdev layer. For many workloads, that means throughput scales with the number of vdevs.
But it also means the pool inherits the worst latency behavior of any vdev that participates in the workload.

Allocation decisions: metaslabs, space maps, and a preference for “less awful”

ZFS divides vdevs into metaslabs. Metaslabs have space maps, and ZFS uses heuristics to pick where to place
new blocks. It prefers metaslabs with more free space and better allocation characteristics. Over time,
fragmentation and fullness change the cost of allocation.

Two important consequences:

  • As vdevs fill, allocation gets more expensive and more fragmented. A vdev at 80–90% used
    can become meaningfully slower than a vdev at 40–60% used, even if the disks are identical.
  • “Balanced free space” is not equal performance. ZFS may keep free space roughly even,
    but it can’t equalize physical device behavior, firmware quirks, or hidden media errors.

Reads vs writes: why writes are where you feel pain first

Reads often have more “escape hatches”: ARC cache, prefetch, compression effects, and the possibility that
your hot data clusters on the faster vdev (by accident or by policy). Writes have fewer outs. When the app
is waiting for a sync write, the slowest link becomes the whole chain.

Special vdevs, SLOGs, and the “performance funnel” problem

Some ZFS features intentionally route specific I/O to specific devices:

  • SLOG (separate log) for synchronous writes (ZIL). Great if it’s fast and power-loss safe; catastrophic if it’s slow or overcommitted.
  • Special vdev for metadata and (optionally) small blocks. Great if properly provisioned; a bottleneck if undersized or slower than the main vdevs.
  • L2ARC for read caching. Doesn’t directly slow writes, but can change read patterns and hide problems until cache misses spike.

If you design a funnel, you own the funnel. Put a slow device at the bottom and you’ve built a latency machine.

Why one slow vdev hurts the whole pool

Tail latency is the real pool limiter

Storage performance discussions love averages because averages are polite. Production systems live in
percentiles because percentiles are honest. If one vdev has periodic 200–800ms spikes due to firmware
retries, SMR shingled rewrite amplification, or a saturated controller queue, those spikes propagate.

ZFS is not doing synchronous cross-vdev “wait for everyone” for every block in the pool. But your
application’s view of latency is still shaped by the slowest I/O it depends on: transaction group
commits, sync write log flush, metadata updates, indirect blocks, and small random I/O where parallelism
is limited.

RAIDZ and mirrors: internal geometry matters

A top-level vdev might be a mirror, a RAIDZ1/2/3, or a single disk (don’t). If one disk inside a RAIDZ vdev
slows down—because it’s failing, because it’s SMR, because it’s behind a weird expander—then the entire RAIDZ
vdev’s latency can inflate. You don’t get to “skip” the slow disk; parity math and reconstruction reads are
group activities.

Mirrors behave differently: reads can be serviced by either side (ZFS picks based on heuristics), but writes
must go to both. A single laggy mirror member raises write latency for the mirror vdev.

The pool scheduler can’t fix a bad vdev

ZFS can queue I/O intelligently. It can issue more to fast vdevs. It can distribute allocations. It cannot
turn a slow device into a fast one, and it cannot remove physics from the equation. When the workload has
sync writes, metadata dependencies, or limited outstanding I/O, one slow vdev becomes a metronome for the pool.

Paraphrased idea (reliability quote)

Paraphrased idea: hope is not a strategy — Gene Kranz, flight director (widely attributed in ops culture; phrasing varies).

In storage, “hope the slow vdev will stop being slow” is not a remediation plan. It’s a calendar event waiting to happen.

Facts & historical context you can use in arguments

  1. ZFS originated at Sun Microsystems in the mid-2000s with a core goal: end-to-end data integrity via checksums and copy-on-write.
  2. The “vdev as allocation unit” design is deliberate: it simplifies failure domains and performance modeling compared to per-disk striping managed in multiple places.
  3. Early ZFS deployments leaned heavily on RAIDZ to reduce the RAID5 write hole and improve integrity; performance trade-offs were always part of the deal.
  4. Modern OpenZFS added “special vdev” to accelerate metadata/small blocks on SSDs; it can be transformative—or become a single choke point.
  5. ZFS’s ARC cache pre-dates the current SSD-everywhere era; many “ZFS is slow” complaints are actually “my workload isn’t in ARC anymore.”
  6. SMR drives changed the failure landscape: they can look normal until random writes or sustained updates force read-modify-write behavior and lat spikes.
  7. 4K sectors and ashift became a long-term performance tax: a wrong ashift is permanent for that vdev and can silently cut write performance.
  8. ZFS scrubs are a design choice, not an optional vanity task: regular scrubs find latent errors before resilver needs them.
  9. Device timeouts and retries are the hidden enemy: even a small amount of retry behavior can dominate tail latency and user experience.

Fast diagnosis playbook (check first/second/third)

First: confirm it’s a vdev bottleneck, not the app lying to you

  • Run zpool iostat -v with latency columns if available on your platform, and watch which vdev has the worst await or longest service time.
  • Check if the pool is doing a scrub/resilver. If yes, stop guessing: your performance baseline is already invalid.
  • Verify the workload type: sync writes, small random reads, sequential streaming, or metadata-heavy. The “slow vdev” manifestation differs.

Second: isolate whether the vdev is slow because it’s busy, failing, or structurally disadvantaged

  • Busy: high utilization, high queue depth, steady high I/O rate.
  • Failing: medium I/O rate but huge latency spikes, errors incrementing, link resets, timeouts.
  • Structurally disadvantaged: much fuller than others, special vdev undersized, wrong ashift, SMR mixed with CMR, tiny recordsize causing IOPS starvation.

Third: decide whether you can fix it live or need a rebuild/migration

  • Live fix: replace a device, remove a log device, add vdevs, tune dataset properties, rebalance by rewriting data.
  • Rebuild/migrate: wrong vdev type, wrong ashift, mixed media that will always behave badly, special vdev too small to grow safely without redesign.

Dry-funny joke #2: RAIDZ doesn’t care about your feelings, only your IOPS budget.

Practical tasks: commands, outputs, decisions (12+)

Task 1: Identify the slow vdev under real load

cr0x@server:~$ zpool iostat -v 2 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        18.2T  7.45T   8200  4100   780M  420M
  raidz2-0  9.10T  3.72T   4100  2000   390M  205M
    sda         -      -   1100   520   105M  54.0M
    sdb         -      -   1000   510   95.0M 53.5M
    sdc         -      -    950   500   92.0M 52.0M
    sdd         -      -   1050   520   98.0M 54.0M
    sde         -      -   1000   510   95.0M 52.5M
    sdf         -      -   1000   510   95.0M 53.0M
  raidz2-1  9.10T  3.73T   4100  2100   390M  215M
    sdg         -      -   1050   540   98.0M 55.0M
    sdh         -      -   1000   530   95.0M 54.0M
    sdi         -      -    200   520   18.0M 53.0M
    sdj         -      -   1050   540   98.0M 55.0M
    sdk         -      -   1000   530   95.0M 54.0M
    sdl         -      -   1000   530   95.0M 54.0M

What it means: Disk sdi is contributing far fewer reads than its siblings. That can mean it’s slow, erroring,
or getting hammered by retries so the scheduler avoids it for reads.

Decision: Move immediately to SMART + kernel logs for sdi, and check cabling/controller path. If it’s a mirror member,
consider replacing; if in RAIDZ, expect the whole vdev to suffer during reconstruction reads.

Task 2: Check pool health and whether errors are accumulating

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Replace the device or clear the errors.
  scan: scrub repaired 0B in 12:41:20 with 0 errors on Tue Dec 24 03:12:51 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
          raidz2-1  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       3     0     0
            sdj     ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /tank/vmstore/vm-104-disk-0

What it means: sdi has read errors. Also, there are permanent errors in a file: the checksum/repair pipeline couldn’t fully correct.

Decision: Treat sdi as suspect. Replace it, then restore affected data from backups or replicas. Don’t “clear and hope” until you understand impact.

Task 3: Check for scrub/resilver activity that’s masking normal behavior

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: resilver in progress since Wed Dec 25 01:10:02 2025
        3.22T scanned at 1.14G/s, 920G issued at 322M/s, 18.0T total
        920G resilvered, 5.00% done, 0:46:10 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
          raidz2-1  ONLINE       0     0     0

What it means: You’re in resilver land. Performance complaints during resilver are expected; what matters is whether one vdev is disproportionately slow.

Decision: If your SLA is bleeding, consider scheduling resilvers/scrubs during low-traffic windows and ensure ashift, device class, and controller paths aren’t creating a single-lane highway.

Task 4: Confirm if one vdev is far fuller than others

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank  18.2T  7.45T   192K  /tank
cr0x@server:~$ zpool list -v tank
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH
tank       25.6T  18.2T  7.45T        -         -    38%    70%  1.00x  ONLINE
  raidz2-0  12.8T  9.10T  3.72T        -         -    31%    70%      -  ONLINE
  raidz2-1  12.8T  9.10T  3.73T        -         -    46%    70%      -  ONLINE

What it means: Both vdevs have similar allocation, but fragmentation differs: raidz2-1 is more fragmented.

Decision: Expect higher allocation overhead and worse tail lat on the more fragmented vdev. If fragmentation is high and stable, plan a rewrite/rebalance strategy (send/recv or dataset migration).

Task 5: Catch the “one device is retrying” pattern in kernel logs

cr0x@server:~$ dmesg -T | egrep -i "ata|scsi|nvme|reset|timeout|error" | tail -n 20
[Wed Dec 25 02:13:21 2025] sd 6:0:12:0: [sdi] tag#1034 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Dec 25 02:13:21 2025] sd 6:0:12:0: [sdi] Sense Key : Medium Error [current]
[Wed Dec 25 02:13:21 2025] sd 6:0:12:0: [sdi] Add. Sense: Unrecovered read error
[Wed Dec 25 02:13:22 2025] blk_update_request: I/O error, dev sdi, sector 7814037096 op 0x0:(READ)
[Wed Dec 25 02:13:25 2025] ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Wed Dec 25 02:13:26 2025] ata12.00: failed command: READ FPDMA QUEUED
[Wed Dec 25 02:13:27 2025] ata12: hard resetting link

What it means: Hard resets and unrecovered read errors create long stalls. Even if ZFS “heals” reads, you pay in latency and queue lockups.

Decision: Replace the device and inspect the whole path: HBA firmware, expander, backplane, cables. If multiple drives show resets, suspect the path first.

Task 6: Pull SMART/NVMe health and look for the “quietly dying” counters

cr0x@server:~$ smartctl -a /dev/sdi | egrep -i "realloc|pending|uncorrect|offline|error|timeout|crc"
  5 Reallocated_Sector_Ct   0x0033   098   098   010    Pre-fail  Always       -       24
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       34

What it means: Pending/uncorrectable sectors point to media trouble. CRC errors usually point to cabling/backplane issues.

Decision: If CRC is climbing, reseat/replace cable/backplane lane and watch. If pending/uncorrectable exist, schedule replacement. Don’t negotiate with entropy.

Task 7: Check whether sync writes are gated by a slow or missing SLOG

cr0x@server:~$ zpool status tank | sed -n '1,80p'
  pool: tank
 state: ONLINE
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          raidz2-0    ONLINE       0     0     0
          raidz2-1    ONLINE       0     0     0
        logs
          nvme0n1p2   ONLINE       0     0     0

What it means: There is a dedicated log device. If sync-write latency is terrible, this device is suspect—or simply saturated.

Decision: Measure latency on the log vdev specifically (iostat, smart, nvme). If it’s consumer NVMe without power-loss protection, reconsider: data safety and latency spikes are not optional features.

Task 8: Identify special vdev and whether it’s becoming the choke point

cr0x@server:~$ zpool status tank | sed -n '1,120p'
  pool: tank
 state: ONLINE
config:

        NAME            STATE     READ WRITE CKSUM
        tank            ONLINE       0     0     0
          raidz2-0      ONLINE       0     0     0
          raidz2-1      ONLINE       0     0     0
        special
          mirror-2      ONLINE       0     0     0
            nvme1n1p1   ONLINE       0     0     0
            nvme2n1p1   ONLINE       0     0     0

What it means: Metadata (and maybe small blocks) live on the special vdev. If it’s slow, the whole pool feels slow—especially metadata-heavy workloads.

Decision: Verify dataset properties like special_small_blocks. Ensure special vdev has enough capacity headroom and comparable endurance/latency to your workload.

Task 9: See dataset settings that can amplify vdev imbalance

cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,atime,sync,logbias,special_small_blocks tank/vmstore
NAME          PROPERTY              VALUE
tank/vmstore  recordsize            16K
tank/vmstore  compression           lz4
tank/vmstore  atime                 off
tank/vmstore  sync                  standard
tank/vmstore  logbias               latency
tank/vmstore  special_small_blocks  16K

What it means: Small recordsize plus special_small_blocks=16K means a lot of blocks go to the special vdev. That’s fine if the special vdev is fast and roomy.

Decision: If special vdev is near full or high-latency, reduce special_small_blocks (or disable), or redesign: more special capacity, faster devices, or different recordsize based on workload.

Task 10: Quantify vdev latency directly with per-device stats

cr0x@server:~$ iostat -x 2 3
Linux 6.8.0 (server) 	12/25/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.21    0.00    3.14    7.88    0.00   79.77

Device            r/s     w/s   rkB/s   wkB/s  aqu-sz  await  r_await  w_await  svctm  %util
sda             95.0    52.0  97280   53248     1.2   10.5     9.8     11.7    2.1   31.0
sdi             18.0    50.0  18432   51200    14.8  214.2   390.1     62.3    6.8   97.0
nvme0n1        620.0   410.0  640000  420000    0.9    1.4     1.2      1.7    0.2   18.0

What it means: sdi has huge await and is near 100% utilized with deep queue. That’s the slow vdev symptom in neon.

Decision: If this is persistent, replace/evacuate the disk or fix the path. If it’s bursty, look for background jobs (scrub, backup, snapshots) or SMR behavior.

Task 11: Check ashift to catch permanent misalignment mistakes

cr0x@server:~$ zdb -C tank | egrep -i "vdev_tree|ashift" -n | head -n 30
54:        vdev_tree:
78:            ashift: 12
141:            ashift: 9

What it means: Mixed ashift values in one pool are a smell. An ashift of 9 (512B) on 4K-native drives can cause read-modify-write penalties.

Decision: If you find an incorrect ashift, you can’t “tune it.” Plan a vdev replacement/migration. Do not build new pools without explicitly setting ashift if you care about predictable performance.

Task 12: Detect severe fragmentation and allocation pathology

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,cap,health tank
NAME  SIZE   ALLOC  FREE  FRAG  CAP  HEALTH
tank  25.6T  18.2T  7.45T  38%  70%  ONLINE

What it means: Fragmentation isn’t automatically fatal, but when it climbs with pool fullness, random-write workloads get punished.

Decision: If FRAG is high and performance is suffering, plan a rewrite cycle (send/recv to a fresh pool, or dataset migration) rather than endless tuning.

Task 13: See if one vdev is doing disproportionate I/O over time

cr0x@server:~$ zpool iostat -v -l 5 3
                              operations         bandwidth          total_wait
pool                       read  write       read  write            read  write
-------------------------  ----  -----      -----  -----           -----  -----
tank                       8200  4100       780M   420M            12ms   28ms
  raidz2-0                 4100  2000       390M   205M             8ms   16ms
  raidz2-1                 4100  2100       390M   215M            45ms   82ms
-------------------------  ----  -----      -----  -----           -----  -----

What it means: raidz2-1 has much higher wait time than raidz2-0 at similar throughput. That’s vdev-level imbalance, not a fluke.

Decision: Drill into the devices inside raidz2-1. If they’re healthy, suspect fullness/fragmentation, controller path, or mixed drive models/firmware.

Task 14: Identify who is generating the I/O (because “storage is slow” is not a process)

cr0x@server:~$ zpool iostat -v 1
...output...
cr0x@server:~$ iotop -oPa
Total DISK READ:   145.20 M/s | Total DISK WRITE:  81.33 M/s
  PID USER      DISK READ  DISK WRITE  SWAPIN      IO>    COMMAND
23144 root      90.12 M/s   1.23 M/s   0.00 %  22.14 %   /usr/sbin/zfs receive -u tank/vmstore
18201 postgres   8.10 M/s  45.77 M/s   0.00 %  10.31 %   postgres: checkpointer

What it means: A zfs receive is writing hard, and Postgres is checkpointing. That can create sync write bursts and metadata churn.

Decision: If this is planned replication, rate-limit or schedule it. If it’s unplanned, find the job owner. Then tune datasets (recordsize, sync behavior) appropriately, not emotionally.

Common failure modes that create vdev imbalance (the stuff that actually happens)

1) Mixed media or mixed “behavior class” drives

Mixing HDD models that look similar on paper can still create vdev imbalance because firmware and cache behavior differ.
Mixing CMR and SMR drives is worse: SMR can exhibit massive latency spikes under sustained random writes or housekeeping.
Mixing SATA and SAS behind different expanders can also create “one vdev is fine, the other is haunted.”

If your pool has two RAIDZ vdevs and one is built from “whatever we had,” you don’t have a pool; you have an experiment.
Production doesn’t need experiments.

2) A single marginal disk inside RAIDZ or mirror

ZFS will keep the vdev online while a disk limps along with retries. That’s good for availability. It’s bad for latency.
A disk can be “healthy enough” to not fail, while being slow enough to destroy your tail lat. This is common with:

  • Growing reallocated/pending sectors
  • Interface CRC errors from a shaky cable/backplane lane
  • Thermal throttling in dense chassis
  • Firmware bugs that trigger periodic resets

3) Special vdev undersized or too slow

The special vdev is a performance multiplier when done right and a pool-wide bottleneck when done wrong.
Metadata is small but constant. If you place small blocks on special and then run VM images with 8K–16K blocks,
your special vdev is now the write path for a lot of your workload.

If the special vdev fills up, it stops accepting allocations and you can get ugly behavior. More subtly: even before it fills,
if it’s slower than your main vdevs (or simply saturated), you’ll see the pool “feel” like the special vdev’s latency.

4) Wrong ashift: permanent performance debt

Misaligned writes create read-modify-write cycles on the drive. That’s not a tunable annoyance; it’s a design flaw.
If one vdev has a wrong ashift, it can behave like a slow vdev forever, even when the devices are identical.

5) Fragmentation + high pool fullness

ZFS fragmentation isn’t the same as traditional filesystem fragmentation, but the effect is similar: more seeks, more metadata,
more small I/O. RAIDZ is especially sensitive to small random writes because parity math and read-modify-write amplify the work.
At high utilization, metaslab selection becomes constrained, and the allocator’s “best option” is often “least terrible.”

6) Background work you forgot about

Scrubs, resilvers, snapshot deletion, replication receives, and heavy compression can dominate I/O. A common pattern:
one vdev is slightly weaker, so background work queues up more there, which makes it weaker, which makes it queue more.
Congratulations, you built a positive feedback loop.

Three corporate-world mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a virtualization cluster backed by a ZFS pool: two top-level RAIDZ2 vdevs, each six disks.
The team assumed “same capacity” meant “same performance.” One vdev was built from a slightly newer batch of disks
with a different firmware line. Procurement said “equivalent.” It was not equivalent.

Everything looked fine in synthetic tests: sequential throughput was great. Then Monday happened—VM boot storms,
logins, and a database that never met a small random read it didn’t like. Latency spiked, then stayed spiky.
The application team saw timeouts. The storage team saw nothing “down.” The pool was ONLINE, scrubs passed, life was green.

The breakthrough was admitting that health and performance are separate states. zpool iostat -v showed one vdev
with consistently higher wait times during random read peaks. Per-disk iostat -x and kernel logs showed intermittent
link resets on just two disks—enough to stall that RAIDZ group.

The fix wasn’t a clever ZFS parameter. It was boring: replace the suspect disks and, more importantly, fix the SAS path
that was causing resets. They also changed their build standard: every vdev must be homogeneous in model/firmware, and
every HBA/backplane path must be validated under load before production cutover.

The wrong assumption was subtle: “ZFS will balance across vdevs, so one slightly odd vdev won’t matter.” ZFS balanced
allocations just fine. The workload didn’t care about fairness. It cared about tail latency.

Mini-story 2: The optimization that backfired

Another shop had a pool serving home directories and CI artifacts. Metadata lookups were heavy, so they added a special vdev
on a pair of consumer NVMe drives. It was fast at first. So fast that people started calling it “the SSD turbo mode,” which is
a phrase that should trigger a risk review.

The team enabled special_small_blocks broadly because it improved CI job times. Then the dataset mix shifted:
more small files, more container layers, more tiny random writes. The special vdev became both the metadata store and the small-block store.
Wear increased. Latency started to wobble. Nobody noticed—ARC and L2ARC masked reads, and writes “usually” completed quickly.

Months later, a burst of sync writes during peak hours exposed the weakness. The consumer NVMes began throttling under heat
and sustained writes. The pool didn’t go down. It just felt like an outage: shell prompts hung, builds timed out, and everyone
blamed the CI system.

The postmortem was blunt: they optimized the wrong thing. They treated special vdev like a cache (optional, best-effort).
It’s not a cache. It’s primary storage for what you put on it. The fix involved moving small-block allocation back to the main
vdevs for most datasets, improving cooling, and rebuilding the special vdev on enterprise-grade devices with power-loss protection.

The backfire wasn’t because special vdev is bad. It backfired because they made it the critical path and then built it like a toy.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran ZFS-backed NFS for a mix of analytics and user shares. They had a ritual: weekly scrub, monthly
review of SMART trends, and a standing rule that any disk with growing CRC or pending sector counts gets replaced during business hours.
Not because it’s failing now, but because they don’t like surprises.

One Thursday, latency started creeping up. Not a spike—just a slow climb. The on-call checked zpool iostat and saw one vdev’s
wait time rising. zpool status was clean. No errors. Everything ONLINE. This is where many teams stop.

Their “boring” practice kicked in: they pulled SMART for the whole chassis and found one disk with a small but steady increase in
UDMA_CRC_Error_Count. That doesn’t scream “replace me.” It whispers “your cable is flaky.” They reseated the cable, errors stopped
climbing, and latency normalized without drama.

Two days later, during a scheduled maintenance window, they replaced the cable harness and audited the backplane lane mapping.
No outage, no data loss, no midnight pager. The pool never went DEGRADED. That’s the point. The best incident is the one that never earns a ticket.

Common mistakes: symptom → root cause → fix

1) “Pool throughput is fine, but latency is awful”

Symptom: Big sequential copies look fast; databases and VM I/O stall; p99 latency spikes.

Root cause: One vdev/device has high tail latency (retries, throttling, SMR behavior), or special vdev/SLOG is saturated.

Fix: Identify the slow vdev with zpool iostat -v + iostat -x. Replace/repair the slow component. Verify special vdev sizing and device class.

2) “One vdev shows higher wait time even with identical disks”

Symptom: Two RAIDZ vdevs, same model disks, but one is consistently slower.

Root cause: Different controller path, expander, queue settings, or a single marginal disk inside the vdev.

Fix: Check kernel logs for resets, verify cabling/backplane, compare per-disk await/util. Swap disks between bays if you need proof.

3) “Scrub/resilver takes forever and performance tanks”

Symptom: Resilver ETA swings wildly; user I/O stalls during scan.

Root cause: A weak disk or saturated vdev; high fragmentation; concurrent heavy workload; SMR drives.

Fix: Reduce competing workload, prioritize repair window, and replace slow disks. If SMR is present, plan migration; don’t debate it.

4) “After adding a new vdev, nothing got faster”

Symptom: You added capacity/IOPS but performance remains constrained.

Root cause: Existing data remains on old vdevs; allocation bias doesn’t retroactively rebalance; workload hot set didn’t move.

Fix: Rebalance by rewriting data (send/recv, rsync to new dataset, or migrate VMs). Don’t expect magic from “add vdev.”

5) “Metadata-heavy workload is slow, but disks are idle”

Symptom: Directory listings, small-file ops, container unpacking is slow; disk %util not crazy.

Root cause: Special vdev overloaded or nearly full; ARC pressure causing constant metadata misses; recordsize mismatch.

Fix: Inspect special vdev usage, ARC stats, and dataset properties. Expand or redesign special vdev; adjust special_small_blocks and recordsize.

6) “Random write performance collapsed after pool hit ~80%”

Symptom: Same workload, worse performance over time; no hardware changes.

Root cause: Fragmentation + allocator constraints + RAIDZ small-write amplification.

Fix: Keep pools comfortably below high utilization for random-write workloads, or plan periodic rewrite/migration. Tuning won’t unfragment a vdev.

Checklists / step-by-step plan

When latency alarms fire: 15-minute triage

  1. Run zpool status. If scrub/resilver is active, annotate the incident and adjust expectations.
  2. Run zpool iostat -v 2 5. Identify which vdev is slow (wait/throughput imbalance) and which disk is odd.
  3. Run iostat -x 2 3. Confirm high await/%util on the suspected devices.
  4. Check dmesg for resets/timeouts. If present, stop debating; fix the path or replace disk.
  5. Pull SMART/NVMe health. Look for pending/uncorrectable and CRC trends.
  6. If sync writes are involved, inspect SLOG health and load. If metadata-heavy, inspect special vdev.

Stabilize first, optimize later

  1. Remove the failing component (replace disk, fix cable/HBA path).
  2. Throttle or reschedule background jobs (replication receives, scrubs, snapshot deletions) during peak.
  3. Verify pool utilization and fragmentation; plan capacity additions before you hit allocator pain.
  4. Only after stability: tune dataset properties for workload (recordsize, compression, atime, sync/logbias where appropriate).

Rebalancing plan (because adding vdevs doesn’t move old blocks)

  1. Create a new dataset (or new pool) with correct properties.
  2. Use zfs send/zfs receive to rewrite data onto the new allocation layout.
  3. Cut over consumers (mountpoints, shares, VM storage config).
  4. Destroy old datasets to free space and reduce fragmentation.

Design checklist to prevent imbalance (what I’d enforce in a build standard)

  • Homogeneous vdevs: same drive model, firmware family, and approximate wear level.
  • Consistent ashift (set it deliberately).
  • Separate performance roles only with appropriate devices (PLP for SLOG, enterprise SSDs for special vdev).
  • Capacity headroom: don’t run hot pools at 85–95% if you care about random I/O.
  • Controller path sanity: validate expanders/backplanes; keep queueing consistent across vdevs.
  • Operational hygiene: regular scrubs, SMART trend monitoring, and clear ownership for I/O-heavy jobs.

FAQ

1) Does one slow disk always slow the entire ZFS pool?

Not always, but often enough to plan for it. If the slow disk is inside a RAIDZ vdev, it can raise that vdev’s latency.
If your workload depends on that vdev’s I/O (it will), p99 latency can jump. ARC may hide read pain until it can’t.

2) Why does ZFS “use” the slow vdev at all?

Because the pool is built from vdevs, and ZFS allocates across them. It can bias allocations based on free space and heuristics,
but it can’t permanently avoid a vdev without removing it. If you want ZFS to not use it, you must redesign: replace or remove.

3) If I add a new fast vdev, will ZFS rebalance existing data?

No. Existing blocks stay where they are. New allocations will tend to go to vdevs with more free space, so performance may improve
slowly as the working set changes. If you need immediate rebalance, you rewrite data (send/recv or migration).

4) Is RAIDZ more sensitive to vdev imbalance than mirrors?

Generally yes for small random writes and rebuild behavior. RAIDZ has parity overhead and can do read-modify-write for partial stripes.
Mirrors can serve reads from either side, which sometimes masks a slow member on reads, but writes still pay for the slowest member.

5) Can a special vdev become the bottleneck even if it’s SSD?

Absolutely. SSD isn’t a guarantee; it’s a device class with wide variance. A small special vdev can fill up, and a consumer SSD can throttle,
suffer latency spikes, or wear out faster under metadata/small-block churn. Treat special vdev as tier-0 storage, not a cache.

6) Should I disable sync writes to “fix” performance?

Only if you can tolerate losing acknowledged writes on power loss or crash, and you understand what that means for your application.
For databases and VM storage, casually setting sync=disabled is a reliability regression disguised as a benchmark win.

7) How full is too full for a pool?

Depends on workload, but if you care about random write latency, start getting nervous above ~70–80% and plan capacity before 85%.
The allocator has fewer good choices as free space shrinks; fragmentation and metadata overhead compound.

8) What’s the fastest way to prove the problem is hardware, not “ZFS tuning”?

Combine: zpool iostat -v showing one vdev with higher wait, iostat -x showing high await/%util on specific devices,
and dmesg/smartctl showing resets or error counters. That triangle is hard to argue with in a meeting.

9) Can incorrect ashift cause one vdev to be slower than others?

Yes, and it’s permanent for that vdev. Misalignment can increase write amplification and latency. If you find one vdev with the wrong ashift,
plan a replacement/migration. Don’t waste weeks tuning around a geometry mistake.

10) Why does the pool feel slow when only one dataset is busy?

Because vdevs are shared resources and ZFS has global work (txg commits, metadata updates, space map updates). A noisy neighbor dataset can
saturate a vdev, increase wait times, and spill into other workloads. Use workload separation (different pools) when stakes are high.

Conclusion: next steps that actually move the needle

ZFS doesn’t “mysteriously slow down.” It does exactly what it’s designed to do: preserve integrity, allocate across vdevs, and keep going
even when parts of the system are limping. The price is that you feel every slow component in the places that matter: sync write latency,
metadata churn, and tail behavior.

Do this next:

  1. Find the slow vdev with zpool iostat -v and confirm with iostat -x.
  2. Check the path: kernel logs + SMART counters. Replace bad cables and suspicious disks early.
  3. Audit special vdev and SLOG design. If they’re in the critical path, they must be built like production, not like a lab.
  4. Plan for rewrite-based rebalancing after expansions. Adding vdevs adds potential; rewriting data realizes it.
  5. Keep headroom. If you want predictable random I/O, stop treating 90% full pools as normal.

If you remember one thing: the pool is only as fast as the slowest vdev at the moment your workload needs it. Don’t argue with that.
Design around it, monitor for it, and replace your weakest links before they introduce themselves to your customers.

← Previous
MariaDB vs PostgreSQL on a VPS: Tuning for Speed per Dollar
Next →
Rspamd first-time setup: the minimal config that actually works

Leave a comment