ZFS vdev width planning: Why More Disks per VDEV Has a Cost

Was this helpful?

You bought a stack of disks. The spreadsheet says “one huge RAIDZ2 vdev” is the most efficient. Then production says “why did latency jump to 80ms during a scrub?” and your on-call phone says “congrats, it’s your problem now.”

Vdev width planning is where storage stops being a shopping trip and becomes risk management. Wider vdevs can look like free capacity and free throughput. They are not free. They change failure domains, rebuild timelines, small-block I/O behavior, and the way ZFS distributes work.

The mental model: vdevs are the performance units

ZFS pools are built from vdevs (virtual devices). The pool stripes data across vdevs, not across individual disks. Inside each vdev, ZFS relies on the vdev’s redundancy layout (mirror, RAIDZ1/2/3, dRAID, etc.) to turn multiple physical disks into a single logical device.

That last sentence is the trap. People see 24 disks and think “24 spindles worth of IOPS.” With ZFS, you get roughly “number of vdevs worth of IOPS,” adjusted by layout. A pool with one RAIDZ2 vdev is one I/O queue (per top-level vdev) for random reads and writes. A pool with six mirror vdevs is six I/O queues. Same raw disks. Different behavior under load.

Vdev width planning is picking a failure domain and a performance shape. Wider vdevs consolidate capacity and can improve large sequential throughput. They also concentrate risk and can make small random I/O feel like wading through syrup.

Two rules that stay true when everything else changes

  • Rule 1: If your workload is random I/O heavy (VMs, databases), you generally want more vdevs, not wider ones.
  • Rule 2: If your workload is mostly large sequential (backups, media, archives), wider RAIDZ can be fine—until rebuild windows become your new hobby.

One quote worth keeping on the wall, because it’s the whole game: Hope is not a strategy. (paraphrased idea, often associated with operations and reliability culture)

Joke #1: A “one giant vdev” is like a one-lane highway—amazing until the first stalled car, and then everyone learns new words.

Quick facts and historical context

Storage engineers love arguing about vdev width because the “right” answer depends on workload, risk tolerance, and how much you hate weekends. Still, some history helps explain why the tradeoffs look the way they do.

  1. ZFS started at Sun Microsystems in the mid-2000s with an “integrated volume manager + filesystem” model. That’s why vdevs exist at all: the filesystem owns the RAID decisions.
  2. RAIDZ is not classic hardware RAID5/6. ZFS uses variable stripe width for RAIDZ, which reduces the classic “RAID5 write hole” problems when combined with copy-on-write and checksums.
  3. “IOPS per vdev” became a rule of thumb because ZFS dispatches I/O at the vdev level. It’s not the whole story, but it predicts pain surprisingly well.
  4. Scrubs were designed as integrity operations, not performance features. They read everything to verify checksums. On wide vdevs with big disks, “everything” is… a lot.
  5. Disk capacity growth outpaced IOPS for decades. A modern HDD is huge but not fast. Wider vdevs amplify the “big but slow” mismatch during resilvering.
  6. 4K sector reality (ashift) changed planning. Misaligned sectors (wrong ashift) can quietly multiply write amplification and latency, especially in RAIDZ.
  7. SSD changed the game but didn’t delete physics. You can make vdevs wide on SSD and it’ll “work,” but rebuild and parity math still costs CPU and latency.
  8. dRAID exists largely because resilver time hurt. It spreads spare capacity and rebuild work, reducing the “one disk at a time for days” failure window typical of wide RAIDZ on large HDDs.

What “more disks per vdev” really costs

1) You’re buying capacity with a larger failure domain

A top-level vdev is the atomic redundancy unit. If a RAIDZ2 vdev can lose two disks, then any third disk failure in that same vdev is catastrophic for the pool. When you make the vdev wider, you put more disks into the same “two failures allowed” bucket.

With multiple smaller vdevs, failures are more likely to be spread across vdevs. With one wide vdev, the pool’s fate rides on that one group’s parity budget. That’s the “blast radius” cost.

2) Rebuilds take longer, and long rebuilds are a reliability tax

ZFS resilvers at the block level (only allocated data) in many cases, which is great. But the pool still needs to scan metadata, and you still must read from surviving disks and write reconstructed blocks. As disks get larger, “only allocated” can still be many terabytes.

Wide vdevs increase the number of disks participating in reconstruction I/O for each block. That means more total reads during resilver, more chance of encountering latent sector errors, more contention with production workloads, and more time spent in the danger window where a second/third failure ends the story.

3) Small random I/O: parity is not free

Mirrors are friendly to random I/O: reads can be balanced across both sides, and small writes are relatively cheap. RAIDZ has to compute parity and often read-modify-write for partial stripes. ZFS is smart, but it’s not magic.

As vdev width increases, the “full stripe” size increases. For small-block workloads (common with VMs and databases), you often write partial stripes. That pushes RAIDZ into extra reads and parity updates. Wider vdevs can mean more work per write and more latency variance.

4) Latency becomes spikier under background work

Scrubs, resilvers, and heavy sequential readers will touch every disk. In a wide vdev, those operations light up more disks at once, and they do it for longer. The pool can still serve normal I/O, but queues deepen. Latency percentiles climb. And it’s the percentiles that wake you up at night, not the average.

5) You don’t actually get “more performance” the way you think

For sequential throughput, wider RAIDZ can help because more disks contribute to each stripe. For random IOPS, especially writes, one RAIDZ vdev behaves like one device with parity overhead. If you want more random IOPS, you want more top-level vdevs.

So the “more disks per vdev” pitch is usually: better capacity efficiency, sometimes better sequential throughput, fewer vdevs to manage. The bill arrives as: worse random performance per TB, bigger failure domain, longer rebuilds, harsher background-work impact.

6) Expansion planning becomes awkward

ZFS historically expanded pools by adding vdevs, not by “adding disks to a RAIDZ vdev.” RAIDZ expansion has become possible in modern OpenZFS, but it’s not a get-out-of-jail-free card: it’s a heavy operation, takes time, and you still end up with a wider vdev and all the costs above.

Planning vdev width is also planning how you’ll grow. If your growth model is “buy the same shelf again,” smaller vdevs that match shelf geometry tend to age better.

Match vdev width to workload: sequential, random, mixed

Large sequential (backup repositories, media, archive)

Sequential writes and reads are where RAIDZ can look great. You stream big blocks, ZFS can assemble large I/O, and disks stay busy in a predictable way. Wider vdevs can increase aggregate throughput because more disks participate in each stripe.

But: backup repositories also do scrubs, pruning, and sometimes many parallel restores. If you build one monster RAIDZ2 vdev, background work and restore storms will contend in the same single-vdev queue. If you can tolerate that (and your RTO can), fine. If your restore is the moment your CEO discovers storage, don’t gamble.

Random I/O heavy (VM datastores, databases)

Here, wide RAIDZ is usually the wrong tool. Not because it’s “slow,” but because it’s unpredictable under pressure. Random write latency is the killer. Mirrors (or more vdevs of narrower RAIDZ) spread I/O across more independent queues and reduce parity overhead.

If you must use RAIDZ for VM storage on HDDs, keep vdevs narrower, use enough vdevs to get parallelism, and be honest about performance expectations. If it’s SSD/NVMe, RAIDZ can be acceptable, but mirror layouts still win in tail latency in many real workloads.

Mixed workloads (fileshares + VMs + backups on the same pool)

This is where “one pool to rule them all” gets spicy. Mixed workloads create mixed I/O sizes. RAIDZ hates mixed small random writes while someone else does sequential reads. Your monitoring will show “utilization” and “throughput” and you’ll still get complaints because latency is what humans feel.

In corporate life, mixed workloads happen because budgets happen. If you can’t split pools, at least split vdevs: more vdevs gives ZFS options. Also consider special vdevs for metadata/small blocks if you know what you’re doing—done right, it can transform directory-heavy and small-file workloads.

RAIDZ width, parity, and the physics you can’t negotiate

Parity levels: RAIDZ1 vs RAIDZ2 vs RAIDZ3

Parity is your insurance policy. RAIDZ1 (single parity) is increasingly a bad idea with large HDDs in serious environments because rebuild windows are long and latent read errors aren’t a fairy tale. RAIDZ2 is the practical baseline for HDD pools that matter. RAIDZ3 exists for when you want to sleep during a long resilver on huge drives, but it costs more parity overhead and performance.

Parity doesn’t scale with width. A 6-disk RAIDZ2 and a 16-disk RAIDZ2 both tolerate two failures. Wider vdev means “same number of allowed failures across more disks.” That’s the core reliability tradeoff.

Recordsize, volblocksize, and why “small writes” become everyone’s problem

ZFS writes in records (default recordsize often 128K for filesystems), but zvols for block devices have volblocksize (often 8K/16K). Databases and VM images tend to generate 4K–16K writes. If your RAIDZ full-stripe width is large and your actual writes are small, you’re in partial-stripe territory. Partial-stripe writes are where parity overhead and read-modify-write show up as latency.

The fix is not “turn knobs until it looks better.” The fix is: choose a vdev layout that matches the workload’s I/O shape, and set recordsize/volblocksize intentionally.

Compression changes width math (in a good way, mostly)

Compression (like lz4) reduces bytes written. That can reduce disk time and sometimes reduce parity work. It can also make writes smaller and more scattered, which can increase metadata churn. Usually, compression is a win, but measure it—especially on already CPU-bound systems.

Special vdevs and SLOG: powerful, sharp tools

Special vdevs (metadata and small blocks) can rescue performance for small-file workloads on HDD RAIDZ. But they become critical devices: lose the special vdev and you can lose the pool. That means mirrored special vdevs with good devices and monitoring.

SLOG (separate log) helps only for synchronous writes. It won’t fix general latency. Put it in because you understand your sync workload, not because you saw it in a forum thread.

Joke #2: Buying a SLOG to fix random write latency is like buying a nicer doormat to stop a leaky roof.

Resilver/rebuild math: time, risk window, and blast radius

Why wide vdevs make resilver risk feel personal

During a resilver, surviving disks are read heavily and continuously. That’s stress. It’s also when you discover which drives were quietly accumulating errors. Wider vdev means more drives are in the same redundancy group, and more drives must be read to reconstruct data. The resilver touches more hardware and runs longer, which raises the chance that something else fails before you’re done.

Allocated data vs full disk: ZFS helps, but don’t bet your job on it

ZFS resilvering is often faster than classic RAID rebuild because it can skip unallocated blocks. Great. But real pools aren’t empty, and metadata still needs scanning. Also, “fast” is relative when your disks are tens of terabytes and your pool is busy.

Wide vdevs increase collateral damage during rebuild

Even if a resilver completes successfully, a wide vdev tends to degrade performance more during the process because more disks are busy doing rebuild work. If the pool serves latency-sensitive workloads, the rebuild window becomes a performance incident.

Operational implication: your rebuild plan is part of your architecture

Don’t plan vdev width in a vacuum. Plan it with:

  • How quickly you can replace a failed disk (humans, spares, vendor SLAs).
  • How much performance degradation you can tolerate during resilver.
  • How long your pool can run in “reduced redundancy” safely.
  • What happens if a second disk fails mid-resilver.

Three corporate mini-stories (all anonymized, all real enough)

1) Incident caused by a wrong assumption: “24 disks = lots of IOPS”

The company was mid-growth, migrating from a legacy SAN to a ZFS-based appliance. They had one requirement on paper: “support the existing VM farm without performance regression.” A team member did the familiar math: 24 HDDs, 7200 RPM, therefore “plenty” of IOPS. They built a single RAIDZ2 vdev because it maximized usable space and looked tidy.

It passed initial tests. The synthetic benchmark used large sequential reads and writes. The graphs looked heroic. The migration proceeded, and for a week it was quiet.

Then the Monday morning login storm hit. VMs booted, antivirus updated, Windows did Windows things, and latency climbed. Not gradually—spikily. The helpdesk tickets said “everything is slow.” The hypervisor logs said “storage latency 100–200ms.” The storage box was not “down,” just drowning.

They discovered the mistake: the pool had one top-level vdev. One queue. Random I/O from dozens of VMs piled into that queue, and RAIDZ parity overhead made writes worse. The fix wasn’t a tuning flag. The fix was architectural: rebuild the pool as multiple vdevs (mirrors in their case), accept less usable capacity, and get predictable latency. The hard lesson: IOPS come from vdev count and media type, not from raw disk count alone.

2) Optimization that backfired: “Let’s go wider to reduce parity overhead”

Another org ran a large backup repository on HDD RAIDZ2. It worked. Not fast, but acceptable. They decided to “optimize” by increasing vdev width during an expansion: fewer vdevs, wider each, simpler layout, slightly better capacity efficiency. Also, fewer vdevs meant fewer things to monitor, which always sounds good in a meeting.

For large sequential ingest, it improved. Ingest jobs finished earlier, and everyone congratulated themselves. Then the quarterly compliance scrub hit. Scrubs ran longer than expected, but that was tolerated.

The real failure showed up weeks later during a restore-heavy day. They had multiple parallel restores while a resilver was already running (because of course a drive failed during the busy period). The wider vdev meant the resilver involved more disks, which meant more contention. Restore performance collapsed, and the RTO they’d promised to internal customers became a performance incident.

The optimization “worked” for the best-case workload and failed for the worst-case one. That’s a classic corporate trap: you benchmark the sunny day, then you live through the storm.

3) Boring but correct practice that saved the day: narrow-ish vdevs, hot spares, and scrub discipline

A third team ran a mixed environment: home directories, build artifacts, and some VM storage. They were conservative. They built multiple RAIDZ2 vdevs of moderate width, kept at least one tested spare on-site, and scheduled scrubs regularly with monitoring on scrub duration and error counts.

It was not glamorous. The capacity efficiency wasn’t “maxed.” They also had a rule: no pool above a certain utilization threshold (they aimed to keep real free space, not fantasy free space). Finance complained; engineers nodded and moved on.

One Friday afternoon, a disk failed. The spare was inserted immediately and resilver started. Overnight, a second disk in the same chassis threw errors. Because vdevs were narrower and scrub history had already identified a couple of marginal drives earlier in the year, they had replacements staged. The second failure hit a different vdev, so redundancy held. The pool stayed online, performance dipped but remained usable, and Monday morning wasn’t an incident review.

The boring practice didn’t “prevent” failure. It reduced the blast radius and shortened the time spent in danger. That’s what good storage architecture does.

Fast diagnosis playbook: find the bottleneck before you “tune”

When performance is bad, teams often start turning ZFS knobs like it’s a radio. Don’t. First, identify the limiting resource and whether the limitation is structural (vdev width/layout) or situational (scrub, resilver, a single failing disk).

First: confirm the pool’s topology and current health

  • Is it one wide RAIDZ vdev or many vdevs?
  • Is a scrub/resilver running?
  • Are there read/write/checksum errors?

Second: determine whether the pain is latency, throughput, or CPU

  • High latency with low throughput often means queueing, small random I/O, or a sick disk.
  • High throughput with acceptable latency means you’re probably fine.
  • High CPU during writes can mean compression, checksum, RAIDZ parity work, or recordsize mismatch.

Third: isolate whether one vdev/disk is the bully

  • If one disk shows high latency or errors, it can drag a whole RAIDZ vdev down.
  • If one vdev is saturated and others are idle, you’re bottlenecked on vdev count, not disk count.

Fourth: check pool free space and fragmentation indicators

  • Very full pools behave badly, especially for RAIDZ with small blocks.
  • Free space is performance budget, not just capacity.

Fifth: decide if the fix is operational or architectural

  • Operational fixes: replace failing disk, reschedule scrub, throttle resilver, add SLOG only if sync workload demands it.
  • Architectural fixes: add vdevs, rebuild as mirrors, split workloads across pools, consider special vdevs or dRAID where appropriate.

Practical tasks: 12+ commands, outputs, and decisions

These are the commands you run when someone says “the storage is slow” and you need to answer with evidence. Each task includes what the output means and the decision you make.

Task 1: Identify vdev layout and current state

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 07:12:44 with 0 errors on Tue Dec 24 03:12:01 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            /dev/disk/by-id/ata-d1  ONLINE       0     0     0
            /dev/disk/by-id/ata-d2  ONLINE       0     0     0
            /dev/disk/by-id/ata-d3  ONLINE       0     0     0
            /dev/disk/by-id/ata-d4  ONLINE       0     0     0
            /dev/disk/by-id/ata-d5  ONLINE       0     0     0
            /dev/disk/by-id/ata-d6  ONLINE       0     0     0
            /dev/disk/by-id/ata-d7  ONLINE       0     0     0
            /dev/disk/by-id/ata-d8  ONLINE       0     0     0

errors: No known data errors

Meaning: One RAIDZ2 vdev with 8 disks. Scrub completed, no errors.

Decision: If this pool hosts random I/O workloads and latency is the complaint, you’re likely vdev-count limited. Consider adding more top-level vdevs (or rebuilding layout) rather than chasing tuning.

Task 2: Show vdev-level utilization and latency

cr0x@server:~$ zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        22.1T  12.7T    310    420  42.1M  55.8M
  raidz2-0                  22.1T  12.7T    310    420  42.1M  55.8M
    ata-d1                      -      -     38     52  5.3M   7.1M
    ata-d2                      -      -     40     50  5.4M   6.9M
    ata-d3                      -      -     39     52  5.4M   7.1M
    ata-d4                      -      -     38     53  5.2M   7.2M
    ata-d5                      -      -     39     53  5.4M   7.2M
    ata-d6                      -      -     38     53  5.2M   7.2M
    ata-d7                      -      -     39     53  5.4M   7.2M
    ata-d8                      -      -     39     54  5.4M   7.3M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: All I/O flows through one vdev. Disk distribution looks even (good), but this can still be a single-queue bottleneck for random I/O.

Decision: If you need more random IOPS, add another vdev (same redundancy) to increase top-level parallelism. Wider doesn’t help here; more vdevs does.

Task 3: Catch a failing disk by latency, not just errors

cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server)  12/26/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.12    0.00    3.44    9.33    0.00   81.11

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              6.0    12.0    512    2048      256.0     0.7   38.2   12.4   51.1   2.4   4.3
sdb             45.0    60.0   5200    7600      245.3     9.9   96.7   90.2  101.5   3.1  32.5
sdc             46.0    59.0   5150    7550      244.9     2.1   19.4   17.2   21.2   2.9  30.8

Meaning: sdb has much higher await and queue depth than peers. It might be retrying or internally remapping.

Decision: Pull SMART stats and consider proactive replacement. A single slow disk can throttle RAIDZ because every stripe waits for the laggard.

Task 4: Check SMART error counters and self-test history

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i 'Reallocated|Pending|Uncorrect|Power_On_Hours|SMART overall|Self-test'
SMART overall-health self-assessment test result: PASSED
Power_On_Hours          37124
Reallocated_Sector_Ct   12
Current_Pending_Sector  3
Offline_Uncorrectable   1
SMART Self-test log structure revision number 1
# 1  Extended offline    Completed: read failure       90%     37110         12345678

Meaning: “PASSED” is a weak signal. Pending/uncorrectable sectors plus failed extended test is a loud signal.

Decision: Replace the disk before it becomes a second failure during resilver. Also consider running a scrub after replacement to validate.

Task 5: Verify ashift (sector alignment) for the pool

cr0x@server:~$ zdb -C tank | egrep -m1 'ashift|vdev_tree'
        ashift: 12

Meaning: ashift=12 means 4K sectors assumed. Good for modern HDD/SSD.

Decision: If you see ashift=9 on 4K drives, expect write amplification and poor performance. Fix requires rebuilding the pool correctly; you can’t “tune” ashift after the fact.

Task 6: Identify datasets and whether you’re using zvols (VMs) vs filesystems

cr0x@server:~$ zfs list -o name,type,used,available,recordsize,volblocksize -r tank
NAME               TYPE   USED  AVAIL  RECSIZE  VOLBLOCK
tank               filesystem  22.1T  12.7T   128K     -
tank/vmstore       filesystem   8.2T  12.7T   128K     -
tank/vmstore/zvol1 volume      600G  12.7T      -     8K
tank/backups       filesystem  10.3T  12.7T   1M       -

Meaning: VMs are using a zvol with 8K volblocksize; backups dataset uses 1M recordsize (good for sequential).

Decision: If VM latency is bad on wide RAIDZ, consider moving VM storage to mirrors or SSD, or increasing vdev count. For backups, wider RAIDZ may be fine.

Task 7: Check sync settings and whether a SLOG would matter

cr0x@server:~$ zfs get -o name,property,value,source sync tank/vmstore/zvol1
NAME                PROPERTY  VALUE  SOURCE
tank/vmstore/zvol1   sync      standard  local

Meaning: Sync writes are honored. If your workload is sync-heavy (databases, NFS with sync), latency may be dominated by ZIL behavior.

Decision: Only consider SLOG if you confirm significant sync write load and you can provide proper power-loss-protected devices. Otherwise, don’t.

Task 8: Measure sync write pressure via ZIL stats (Linux OpenZFS)

cr0x@server:~$ cat /proc/spl/kstat/zfs/zil | egrep 'zil_commit|zil_itx_count|zil_itx_indirect_count'
zil_commit                          18442
zil_itx_count                        9921
zil_itx_indirect_count                 144

Meaning: Non-trivial ZIL commits happening. If this climbs rapidly during the complaint window, sync writes are involved.

Decision: If sync write latency is the bottleneck, evaluate SLOG (mirrored) or app-side batching. If not, don’t blame the ZIL.

Task 9: Confirm if a scrub/resilver is running and how fast it’s progressing

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 08:11:02 2025
        9.21T scanned at 1.12G/s, 2.03T issued at 252M/s, 22.1T total
        0B repaired, 9.19% done, 1 day 01:33:40 to go
config:
...

Meaning: Scrub is running and will continue for over a day. That’s a long time to share disks with production.

Decision: If this is impacting latency-sensitive workloads, schedule scrubs in quieter windows, and consider narrower vdevs/more vdevs for future builds to reduce scrub impact per vdev.

Task 10: Check pool free space and special allocation flags

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,health
NAME  SIZE   ALLOC   FREE  CAP  HEALTH
tank  34.8T  22.1T  12.7T  63%  ONLINE

Meaning: 63% used is comfortable. If you’re regularly above ~80–85% on RAIDZ with mixed workloads, expect worse fragmentation and latency.

Decision: If cap is high, plan expansion before you hit the cliff. Expansion is easier than recovery, and it’s cheaper than explaining to the business why writes became slow “for no reason.”

Task 11: Identify block size distribution with a quick fio sample

cr0x@server:~$ fio --name=randwrite --filename=/tank/vmstore/testfile --rw=randwrite --bs=4k --iodepth=32 --numjobs=4 --size=8G --runtime=30 --time_based --direct=1
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=32
...
  write: IOPS=820, BW=3.2MiB/s (3.4MB/s)(96MiB/30001msec)
    lat (usec): min=450, max=280000, avg=38500.12, stdev=42000.55

Meaning: 4K random writes are low IOPS with very high max/avg latency. On HDD RAIDZ, this is expected pain.

Decision: If this resembles production workload, stop trying to tune your way out. Use mirrors/SSD, increase vdev count, or separate VM storage from bulk storage.

Task 12: Check ARC pressure (memory) and whether reads are cache hits or disk hits

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  arcsz     c
08:32:11  12K   4K     33%   2K   16%   2K   16%   96G   110G
08:32:12  11K   5K     45%   3K   27%   2K   18%   96G   110G
08:32:13  13K   4K     30%   2K   15%   2K   15%   96G   110G

Meaning: ARC hit rate is okay-ish. If miss% is high and you’re read-latency bound, disks are doing real work.

Decision: If reads are missing ARC and your disks are already busy, don’t widen vdevs expecting miracles. Consider adding RAM (if reasonable), splitting workloads, or using special vdev/L2ARC only after measuring.

Task 13: Check for checksum errors indicating silent corruption or cabling trouble

cr0x@server:~$ zpool status -v tank | egrep -A2 'READ|WRITE|CKSUM'
        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0

Meaning: No errors. If you see CKSUM errors, don’t assume “bad disk”—it’s often cabling, HBA, expander, or power.

Decision: If CKSUM grows, investigate hardware path (swap cable, port, HBA). Don’t rebuild layout to fix a loose cable.

Task 14: Inspect dataset properties that influence I/O shape

cr0x@server:~$ zfs get -o name,property,value,source compression,atime,recordsize,logbias,primarycache tank/vmstore tank/backups
NAME          PROPERTY      VALUE     SOURCE
tank/vmstore  compression   lz4       local
tank/vmstore  atime         off       local
tank/vmstore  recordsize    128K      default
tank/vmstore  logbias       latency   default
tank/vmstore  primarycache  all       default
tank/backups  compression   lz4       local
tank/backups  atime         off       local
tank/backups  recordsize    1M        local
tank/backups  logbias       throughput local
tank/backups  primarycache  all       default

Meaning: Backups tuned for throughput; VM store defaults may be okay but zvol choices matter more than recordsize.

Decision: If a dataset is tuned opposite its workload (e.g., recordsize too small for backups, sync forced for no reason), fix properties. But don’t expect properties to overcome a bad vdev layout.

Common mistakes: symptoms → root cause → fix

Mistake 1: “Latency spikes during scrubs; users complain”

Symptoms: Predictable latency increases when scrub starts; interactive workloads become sluggish; zpool status shows scrub in progress.

Root cause: Wide vdev + HDD means scrub is long and competes for the same disk bandwidth and queues as production I/O.

Fix: Schedule scrubs off-hours; consider multiple vdevs (more top-level parallelism) for future builds; consider separating latency-sensitive workloads from bulk pools.

Mistake 2: “We have 20+ disks so VM performance should be great”

Symptoms: Low IOPS, high tail latency, throughput looks okay but apps time out; especially bad during boot storms.

Root cause: One (or too few) RAIDZ vdevs. Random I/O bottlenecked by vdev count and parity overhead.

Fix: Use mirrors for VM/databases on HDD, or increase the number of top-level vdevs. If you must use RAIDZ, keep vdevs narrower and add more of them.

Mistake 3: “Resilver takes forever; we’re nervous for days”

Symptoms: Resilver estimated time is multiple days; performance degraded; second failure anxiety becomes a lifestyle.

Root cause: Wide vdev + large disks + high utilization + ongoing workload. Rebuild window is long and stressful.

Fix: Keep utilization lower; choose narrower vdevs or dRAID where appropriate; maintain on-site spares; plan rebuild throttling and maintenance windows.

Mistake 4: “We added a SLOG and nothing changed”

Symptoms: Same latency; no noticeable improvement; sometimes worse stability due to cheap SSD.

Root cause: Workload not sync-write dominated, or SLOG device is not power-loss safe, or the real issue is random write amplification in wide RAIDZ.

Fix: Measure sync write rate first; if needed, use proper enterprise devices and mirror them. Otherwise remove the complexity.

Mistake 5: “One disk is ‘kinda slow’ but we’ll wait”

Symptoms: No ZFS errors yet, but iostat shows one disk with higher latency; occasional timeouts in logs.

Root cause: Drive is degrading, cabling path is flaky, or expander port is unhappy. RAIDZ stripes wait for slowest member.

Fix: Proactively replace or move the suspect component. Validate with SMART and error logs. Don’t wait for it to become “failed” during resilver.

Mistake 6: “Pool is 90% full and writes got slow; must be a bug”

Symptoms: Increasing latency over time; free space low; fragmentation and metadata overhead rise.

Root cause: Copy-on-write needs free space to allocate efficiently. RAIDZ suffers more as free segments shrink and writes become scattered.

Fix: Add capacity before the pool is full; enforce quotas/reservations; archive or delete; consider splitting hot and cold data across pools.

Checklists / step-by-step plan

Step-by-step vdev width planning (what I’d do before buying disks)

  1. Classify the workload: mostly sequential, mostly random, or mixed. If you can’t classify it, it’s mixed.
  2. Decide your failure tolerance: RAIDZ2 baseline for HDD; RAIDZ3 for very large disks or long replacement times; mirrors for low-latency tiers.
  3. Pick a target performance shape: how many random read/write IOPS do you actually need, and what latency percentile is acceptable?
  4. Choose vdev count first: number of top-level vdevs is your parallelism budget. Then pick width per vdev.
  5. Choose vdev width conservatively:
    • For HDD VM/databases: mirrors, multiple vdevs.
    • For HDD backups/archive: RAIDZ2 with moderate width; avoid “one giant vdev” unless you truly accept the rebuild risk.
  6. Plan growth increments: add vdevs over time; keep layouts consistent. Avoid having one odd vdev that behaves differently.
  7. Keep utilization headroom: set an internal cap (often ~80% for mixed workloads) and treat it as policy, not a suggestion.
  8. Plan for rebuilds: on-site spare(s), documented replacement procedure, and monitoring for early failure signals.
  9. Separate tiers if needed: put VM latency-sensitive workloads on mirrors/SSD, bulk data on RAIDZ.
  10. Test with a workload-like benchmark: random 4K/8K, mixed read/write, plus a scrub running, because that’s reality.

Operational checklist (what to verify monthly)

  • Scrub schedule and last scrub duration; investigate if duration is trending upward.
  • SMART error counters and failed self-tests; replace early.
  • Pool capacity trend and projected “80% date.”
  • zpool status clean, no CKSUM errors.
  • Latency percentiles (not just averages) during peak and during background ops.

FAQ

Q1: Is “wider RAIDZ vdev = faster” ever true?

For large sequential throughput, yes: more disks per stripe can raise read/write bandwidth. For random I/O, especially small writes, wider is often worse or at least not better.

Q2: Why do people say “IOPS come from vdevs”?

Because ZFS stripes across top-level vdevs. Each vdev is a performance unit with its own queueing. More vdevs means more parallelism for random I/O.

Q3: What’s a “safe” RAIDZ2 width?

“Safe” depends on disk size, replacement time, and workload. Practically, moderate widths are easier to live with than very wide ones because rebuild windows and blast radius grow with width.

Q4: If I have 24 disks, should I do one 24-wide RAIDZ2?

Almost never for general-purpose or VM-heavy pools. You’ll get a huge failure domain and one-vdev random I/O behavior. Split into multiple vdevs.

Q5: Mirrors waste 50% capacity. Why would I choose them?

Because they buy low latency, high random IOPS, simpler failure behavior, and often faster resilvers. Capacity is cheaper than outages.

Q6: Does adding L2ARC fix wide vdev random I/O problems?

Sometimes it helps read-heavy workloads with repeated access patterns. It won’t fix random write latency or parity overhead. Also, L2ARC has its own memory and warmup considerations.

Q7: Should I use RAIDZ1 if I have smaller drives?

For anything important on HDD, RAIDZ1 is a gamble you don’t need. RAIDZ2 is the sane default. RAIDZ1 can be acceptable for non-critical, easily reproducible data with short rebuild windows.

Q8: Can I expand a RAIDZ vdev by adding disks later?

Modern OpenZFS supports RAIDZ expansion, but it’s a heavy operation and you still end up with a wider vdev and its costs. Many shops still prefer adding a new vdev for growth.

Q9: Does dRAID solve the “wide vdev rebuild” problem?

dRAID reduces resilver time by distributing rebuild work and spare capacity. It can be a good choice for large HDD pools, but it’s not a universal replacement for mirrors in low-latency workloads.

Q10: What’s the single biggest planning mistake?

Designing for capacity efficiency first and treating performance/rebuild behavior as “tuning problems.” Layout is architecture. Tuning is seasoning.

Next steps you can do this week

  • Inventory your pools: list vdev layouts, workloads served, scrub duration, and typical latency.
  • Run the fast diagnosis playbook during a complaint window and during a scrub window; compare.
  • Decide what you optimize for: low latency, high throughput, or maximum usable TB. Pick one primary goal per pool.
  • If you’re vdev-count limited: plan expansion by adding vdevs (or migrating to mirrors for the hot tier), not by widening the existing vdev.
  • If resilver windows scare you: reduce vdev width on next build, keep more free space, and keep tested spares on hand.
  • Write down a rebuild runbook: disk replacement steps, expected resilver behavior, and what metrics trigger escalation.

If you take one opinionated guideline from all of this, make it this: don’t build your first ZFS pool as one wide RAIDZ vdev unless the workload is truly sequential and the risk is truly acceptable. When it goes wrong, it doesn’t go wrong politely.

← Previous
Ubuntu 24.04: Apache vs Nginx confusion — fix port binding and proxy loops cleanly
Next →
MariaDB vs PostgreSQL for Multi-Tenant Hosting: Stop One Client Site from Killing Everyone

Leave a comment