ZFS RAIDZ2: The Sweet Spot for Capacity and Survival (Usually)

November 17, 2025 • February 3, 2026 • Read: 28 min • Views: 13

Was this helpful?

There’s a reason RAIDZ2 keeps showing up in “boring infrastructure that survives” conversations. It’s not trendy, it’s not flashy, and it won’t win benchmark drag races against mirrors for small random I/O. What it does—when you size it sanely and operate it like you mean it—is keep your data standing when a disk dies on a Tuesday and another one looks sick on Wednesday.

This is a field guide from the perspective of someone who has watched pools degrade in real time, argued with procurement about “just buy bigger drives,” and learned that the difference between a calm on-call night and a career-defining incident is often a single assumption made months earlier.

What RAIDZ2 actually is (and what it isn’t)

RAIDZ2 is ZFS’s double-parity RAID. In plain terms: within a single RAIDZ vdev, you can lose any two drives and the vdev can still reconstruct data. ZFS stripes data across all drives in that vdev, with parity distributed. When you build a pool from one or more RAIDZ2 vdevs, your pool’s availability is the availability of its vdevs—lose one vdev, lose the pool.

RAIDZ2 is not “two drives worth of safety no matter what.” It’s protection against device loss and unreadable sectors during reconstruction—up to a point. It doesn’t protect you from:

accidental deletion (use snapshots, replication, backups)
application-level corruption (use checksums, also snapshots/replication)
controller lies or cabling chaos (monitor, test, and keep firmware sane)
human error (sadly, still a top-tier failure mode)

In operations, people confuse “RAID level” with “data safety.” RAIDZ2 is one layer. It’s a strong layer, but it’s not the whole story. If you only remember one thing: RAIDZ2 keeps you online during some hardware failures; it does not keep you whole during all failures.

Joke #1 (short, relevant): RAIDZ2 is like bringing a spare tire and a can of sealant—useful. It still doesn’t help if you drive into a lake.

Why RAIDZ2 is the sweet spot (usually)

Most storage design is a negotiation between four parties who don’t like each other: capacity, performance, resiliency, and budget. Mirrors win on IOPS and rebuild behavior, but cost you 50% raw capacity. RAIDZ1 is tempting until you operate it at modern drive sizes and real-world URE rates; you don’t want to learn that lesson during a resilver window. RAIDZ3 is robust but starts to feel like you’re paying a tax that your risk profile may not justify (unless you’re running huge vdevs or very large disks in high-stress workloads).

RAIDZ2 often lands in the middle: you “spend” two drives per vdev on parity, which is a relatively fixed overhead as you scale vdev width. In an 8-disk RAIDZ2, you’re at 75% usable (before ZFS overhead). In a 10-disk RAIDZ2, 80%. That’s the capacity story.

The survival story is more subtle. Modern disks are large, and rebuild/resilver is a long walk, not a sprint. During that time, you’re exposed: every additional read is a chance to discover a latent error; every vibration or thermal event is a chance another disk goes weird. RAIDZ2 is often the point where you can tolerate a second surprise without immediately converting an inconvenience into a catastrophe.

Performance is where people get surprised. RAIDZ2 can deliver great sequential throughput—streaming, backups, large-object stores, media workflows, analytics scans—because you’re using many spindles. But small random writes are more complex: parity means read-modify-write behavior at the vdev layer. ZFS has mitigations (variable stripe width, transaction groups, caching), but physics still shows up to the meeting.

So “sweet spot” usually means: decent capacity efficiency, survivable under realistic failure patterns, acceptable performance for mixed workloads when configured correctly, and not so expensive that procurement starts inventing new adjectives.

Facts and historical context that matter

Storage engineering has a long memory, mostly because we keep repeating the same mistakes with larger numbers. A few facts and context points that shape RAIDZ2 decisions:

ZFS was built to assume disks lie. End-to-end checksums were a response to silent corruption being common enough to matter in enterprise arrays and commodity hardware.
RAIDZ exists partly to avoid the “RAID-5 write hole” problem. ZFS’s copy-on-write and transaction semantics change the failure modes compared to classic hardware RAID implementations.
Drive capacities grew faster than rebuild speed. You can buy a lot of terabytes cheaply; you cannot buy back time during a rebuild when you’re reading nearly the whole vdev.
URE (unrecoverable read error) math moved from theory to incident reports. What used to be “unlikely” becomes “inevitable” when you read tens of terabytes under stress.
Ashift became a career-defining footgun. Advanced Format (4K sector) drives made 512e/4Kn alignment a real-world performance cliff. ZFS records it at vdev creation; you don’t casually change it later.
LZ4 compression became the default for a reason. On many workloads, it improves effective throughput because the CPU is faster than your disks, and you’re moving fewer bytes.
Scrubs used to be “nice to have.” With large pools, scrubs are proactive failure discovery: you want to find bad sectors on your schedule, not during a resilver.
ECC RAM debates never die. ZFS doesn’t “require” ECC to function, but production operators have seen memory corruption do spectacular things. The risk is workload- and platform-dependent, but dismissing it as mythology is how postmortems are born.
NVMe changed the architecture. Special vdevs, SLOG, and L2ARC can make RAIDZ2 behave dramatically differently, for better or worse, depending on how you deploy them.

Design decisions that decide your fate

1) Vdev width: how many disks per RAIDZ2?

RAIDZ2 overhead is “two disks,” so wider vdevs improve capacity efficiency. That’s the temptation. The counterweight is failure domain and rebuild time: a wider vdev means more total data to read and more disks participating, which increases both performance potential and the surface area for errors during resilver and scrub.

In practice, many production deployments land around 6–10 disks per RAIDZ2 vdev for general-purpose storage. Narrower vdevs (6–8) are often friendlier to resilver time and IOPS predictability; wider (10–12+) is common in throughput-heavy systems where you can tolerate longer maintenance windows and your monitoring and spares posture are mature.

A rule that’s less wrong than it sounds: choose a width you can rebuild within your operational patience, not your vendor’s marketing slides. If your on-call rotation panics at a three-day resilver, build for a one-day resilver. If you can’t rebuild quickly, you need more redundancy, better spares, or smaller vdevs—or all three.

2) How many vdevs per pool?

ZFS performance scales by vdev count more than by disk count inside a single vdev for many random I/O patterns. Two 8-disk RAIDZ2 vdevs generally behave better for concurrent workloads than one 16-disk RAIDZ2 vdev, because you have two independent allocation and I/O queues at the vdev layer.

Operationally, more vdevs also mean more components and more opportunities for individual failures—but that’s not automatically worse. If each vdev is smaller, resilvers can be shorter and failure domains can be easier to reason about. The key is to keep your design consistent: identical vdevs, same ashift, same drive class, same firmware, same topology.

3) Recordsize, volblocksize, and workload shape

ZFS is not block storage in the same way as a classic RAID controller; it’s a filesystem and volume manager with a lot of knobs. The critical knob for file datasets is recordsize. Larger recordsize can improve streaming throughput and reduce metadata overhead; smaller recordsize can reduce read amplification for small random reads.

For zvols (iSCSI, VM disks), you care about volblocksize. Get this wrong and you can create write amplification that makes RAIDZ2 feel like it’s working through molasses. For VM storage, 8K–16K volblocksize is common; for databases, you tune to the database page size and access patterns. The right answer depends on the workload; the wrong answer depends on assumptions.

4) Compression and checksums: not optional, just managed

Compression (usually LZ4) is often a net win even when you don’t “need” the space. It can reduce I/O load, which reduces latency and speeds resilver/scrub. Checksums are ZFS’s reason for existing; don’t turn them off. If you’re worried about CPU, profile it—don’t guess.

5) Special vdevs, SLOG, and L2ARC: power tools, not decorations

Adding a special vdev (metadata and small blocks on SSD) can transform RAIDZ2 pools serving lots of small files or metadata-heavy workloads. It can also become a single point of pool failure if you build it without redundancy. In other words: it can be a performance savior or a very expensive outage generator.

SLOG (separate log device) helps synchronous writes only if you actually have synchronous writes. If your workload is async (common for many file shares), a SLOG does nothing. An L2ARC can help read-heavy workloads with working sets larger than RAM, but it’s not magic; it’s a cache with its own overhead.

Joke #2 (short, relevant): A SLOG for an async workload is like installing a turbo on a shopping cart—it’s impressive, but you’re still pushing it by hand.

Performance reality: throughput, IOPS, latency

RAIDZ2 is often unfairly judged by a benchmark that doesn’t match production. It’s also sometimes deployed into workloads where mirrors would have been the right call. Let’s separate the modes.

Sequential reads and writes

If you’re pushing large blocks—backups, media, archives, replication streams—RAIDZ2 can be excellent. You get aggregate throughput from multiple disks, and parity overhead becomes a smaller fraction of the work because you’re writing full stripes more often.

Where it stumbles is when your “sequential” workload isn’t really sequential: small synchronous writes, fragmented datasets, or lots of concurrent writers can turn the pattern into something closer to random I/O. ZFS’s transaction group behavior can mask some of that, until it can’t.

Random reads

Random reads can be fine if they’re cacheable. ARC (in RAM) is your first line of defense. If your working set fits, RAIDZ2 performance is basically “how fast can the CPU and RAM serve it,” and the disks mostly idle.

If it doesn’t fit, you’re at the mercy of spindle latency. More vdevs help because you have more independent I/O paths. Wider RAIDZ vdevs don’t give you more IOPS the way adding vdevs does; they mostly give you more throughput per vdev and more capacity efficiency.

Random writes and the parity tax

Random writes on RAIDZ2 can be expensive because parity needs to be updated. ZFS does variable-width stripes and can avoid classic RAID’s worst behavior, but there’s still work to do. Small writes may require reading old data/parity and then writing new data/parity. That’s extra I/O and extra latency.

In practice, random-write-heavy workloads (VM farms with busy guests, database logs, small synchronous writes) often prefer mirrors, or RAIDZ2 plus a special vdev and careful tuning, or a hybrid approach: mirrors for hot data, RAIDZ2 for capacity tiers.

Latency: the metric that hurts careers

Throughput is easy to sell; latency is what users notice. RAIDZ2 pools under pressure can exhibit “latency cliffs” during scrubs, resilvers, or when the pool is too full. ZFS tries hard, but if your disks are saturated and your queue depths rise, latency follows.

The operator’s job is to keep enough headroom that maintenance events don’t turn into customer-facing incidents. That means: don’t run pools at 95% full and then act surprised when performance falls off a cliff. ZFS is not unique here; it’s just honest about it.

Operations that keep RAIDZ2 healthy

RAIDZ2 design gets you to “survivable.” Operations gets you to “boringly reliable.” The difference is routine.

Scrubs: controlled pain beats surprise pain

Scrubs read all data and verify checksums. They surface latent sector errors while you still have redundancy. If you skip scrubs, you’re gambling that your first discovery of bad sectors happens during a resilver—when you’re already stressed and already missing a disk.

Schedule scrubs based on pool size and workload. Monthly is common; larger, colder pools sometimes scrub less often; mission-critical pools scrub more often. The right schedule is the one that finishes reliably and doesn’t destroy your latency budget.

Resilver behavior: sequential rebuilds are a myth now

Classic RAID rebuilds were essentially “read the whole disk.” ZFS resilver is more surgical: it rebuilds only allocated blocks. That’s great for pools that aren’t full. But in many corporate realities, pools are… full. And “allocated blocks” becomes “most of the disk.”

Plan as if resilvers will be long. Make sure you can detect and replace failing drives quickly. Keep spares or at least maintain fast procurement paths. In operations, time-to-replace is a reliability metric.

Don’t treat SMART as a fortune teller

SMART data is useful, but it’s not a guarantee. Some disks fail loudly; others just start returning slow reads, timeouts, and weirdness that shows up as kernel messages and ZFS checksum errors. Watch error counters in zpool status and system logs. Those are the symptoms ZFS actually cares about.

Keep pools below the danger zone

As pools fill, allocation becomes harder, fragmentation increases, and write amplification rises. The result is slower performance and longer resilvers/scrubs. A common operational target is to keep pools below ~80% for mixed workloads, sometimes lower if you care about consistent latency.

This is not superstition. It’s geometry and scheduling. When free space becomes fragmented, ZFS has fewer good choices, so it makes less-good choices more often.

Three corporate-world mini-stories

Mini-story 1: An incident caused by a wrong assumption

The assumption: “RAIDZ2 means we can lose two disks, so we’re safe during maintenance.” The setup was a mid-sized virtualization cluster storing VM disks on a RAIDZ2 pool. The team planned a rolling firmware upgrade for the HBA and drive backplane. They’d done it before on smaller systems and nothing exciting happened.

They upgraded one host, rebooted, and the pool came back degraded: one drive didn’t enumerate. No panic. “We can lose two drives,” someone said, and the maintenance window continued. Then the second host rebooted, and a different drive path flapped. The pool went from degraded to heavily unhappy—timeouts, then a second device fault.

At this point the math changed. RAIDZ2 can lose two devices, but it can’t lose two devices while also surviving a third device that’s “not failed, just slow and timing out,” plus a controller that’s renegotiating links. The pool didn’t instantly die, but the VM workloads did what VM workloads do under storage latency: they dogpiled the I/O queue, then the cluster started fencing nodes for “unresponsive storage.”

The postmortem wasn’t about ZFS being bad. It was about the wrong assumption: “disk loss is binary.” In real systems, failures are messy: partial connectivity, timeout storms, and “works on reboot” behavior. The operational fix wasn’t to abandon RAIDZ2; it was to stop doing multi-host storage maintenance without a staged validation plan, and to treat “degraded” as “we are one bad day away from a very long week.”

Mini-story 2: An optimization that backfired

The goal: make a RAIDZ2 pool “faster” for small files on a build farm. The team had a pool of HDDs and decided to add a special vdev on a single, very fast NVMe. The benchmarks looked great. Metadata operations flew. Developers cheered. Someone wrote “problem solved” in the ticket and closed it with confidence.

Then came the quiet part: a few months later, the NVMe started throwing media errors. Not catastrophic at first—just a few. But the special vdev was unmirrored. In ZFS, if a special vdev holds metadata and small blocks, it is not optional: lose it and you can lose the pool, because you can’t find your data without the metadata. The pool went from “fast” to “existential crisis” in a single on-call shift.

The recovery was painful: they had replication, but not recent enough to make everyone happy. Some artifacts were rebuilt; some were re-downloaded; some were re-generated. The root cause wasn’t “NVMe is unreliable.” The root cause was treating a special vdev like a cache. It isn’t a cache; it’s a tier.

The fix was straightforward and boring: mirror the special vdev, monitor it like it’s a first-class citizen, and size it properly. Performance came back, and so did sleep. The lesson: when you add a power tool to RAIDZ2, read the safety label.

Mini-story 3: A boring but correct practice that saved the day

The practice: monthly scrubs and alerting on checksum errors, plus a standing rule that “any checksum error is a ticket.” It wasn’t glamorous. It didn’t get applause in quarterly planning. It was the kind of discipline people call “paranoid” until they call it “leadership.”

One week, an alert fired: a handful of checksum errors on one disk during a scrub. The pool was otherwise healthy. No one was screaming. The storage graphs didn’t look dramatic. But the ticket got attention because it always did. The on-call replaced the disk the next day during business hours. The resilver completed cleanly.

Two weeks later, another disk in the same vdev failed hard. In a different timeline—one where scrubs were “we’ll do it later”—that first disk would still have been in service with latent errors. The second failure would have turned the resilver into a roulette wheel: read errors during reconstruction, data loss, and an executive meeting where everyone pretends they always wanted better backups.

Instead, RAIDZ2 did what it’s supposed to do: absorb failure without drama. The day was saved not by heroics, but by a calendar reminder and a cultural habit of treating “small” signals as real.

Hands-on tasks (commands + interpretation)

These are practical tasks you can run on a typical ZFS-on-Linux or illumos/FreeBSD system. Adjust device names and pool names. Output examples are illustrative; your system will differ.

Task 1: Inspect pool health and error counters

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 05:12:44 with 0 errors on Sun Dec 15 03:12:44 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0
            ata-WDC_WD120...-part1  ONLINE       0     0     0

errors: No known data errors

Interpretation: You care about READ/WRITE/CKSUM counters and the scan line. Any non-zero CKSUM errors deserve investigation; they often indicate cabling/HBA issues as much as disks.

Task 2: See vdev layout and ashift

cr0x@server:~$ sudo zdb -C tank | sed -n '1,120p'
MOS Configuration:
        vdev_children: 1
        version: 5000
        pool_guid: 1234567890123456789
        vdev_tree:
            type: 'root'
            id: 0
            guid: 1111111111111111111
            children[0]:
                type: 'raidz'
                id: 0
                guid: 2222222222222222222
                nparity: 2
                ashift: 12

Interpretation: nparity: 2 confirms RAIDZ2. ashift: 12 means 4K sectors. If you accidentally created ashift: 9 on modern drives, you’re carrying a performance and wear penalty you can’t “tune away.”

Task 3: Check dataset properties that affect RAIDZ2 behavior

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,primarycache tank
NAME  PROPERTY     VALUE     SOURCE
tank  recordsize   128K      default
tank  compression  lz4       local
tank  atime        off       local
tank  sync         standard  default
tank  primarycache all       default

Interpretation: compression=lz4 is usually good. atime=off avoids extra writes for read-heavy datasets. Don’t set sync=disabled casually; that’s not tuning, it’s gambling with power failures.

Task 4: Measure real-time pool I/O and spot a scrub/resilver impact

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        72.1T  14.2T    210    480  48.2M  96.1M
  raidz2-0                  72.1T  14.2T    210    480  48.2M  96.1M
    ata-WDC_WD120...            -      -     28     60  6.3M   12.1M
    ata-WDC_WD120...            -      -     25     58  6.0M   11.5M
    ata-WDC_WD120...            -      -     31     61  6.8M   12.4M
--------------------------  -----  -----  -----  -----  -----  -----

Interpretation: Look for one disk doing significantly less/more work, or showing high latency elsewhere (see iostat/smartctl). During scrub, reads rise; if writes spike too, you may be thrashing metadata or dealing with a busy workload.

Task 5: Start a scrub and verify it’s progressing

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Tue Dec 23 01:05:11 2025
        9.34T scanned at 1.12G/s, 3.02T issued at 371M/s, 72.1T total
        0B repaired, 4.19% done, 2 days 03:14:00 to go

Interpretation: If “issued” bandwidth is far below “scanned,” ARC is serving data (good). If both are slow, disks are the bottleneck. If ETA is absurdly long, investigate disk health and pool contention.

Task 6: Throttle resilver/scrub impact (Linux/OpenZFS tunables)

cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_resilver_delay
2
cr0x@server:~$ echo 4 | sudo tee /sys/module/zfs/parameters/zfs_resilver_delay
4
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_scrub_delay
4

Interpretation: Increasing delays can reduce impact on production latency at the cost of longer resilver/scrub times. Do this deliberately: if you extend your exposure window too much, you may trade performance for risk.

Task 7: Find datasets consuming space and detect “pool is full” risk

cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint -S used | head
NAME                 USED   AVAIL  REFER  MOUNTPOINT
tank/vm              38.2T  14.2T  38.2T  /tank/vm
tank/backups         19.4T  14.2T  19.4T  /tank/backups
tank/home            7.9T   14.2T  7.9T   /tank/home
tank                 65.5T  14.2T   112K  /tank

Interpretation: When AVAIL gets tight, performance and maintenance suffer. Consider quotas, reservations, or capacity expansion before you hit the “everything is slow” phase.

Task 8: Check fragmentation and why “it’s slower than last year” might be true

cr0x@server:~$ sudo zpool list -o name,size,alloc,free,cap,frag,health tank
NAME  SIZE   ALLOC  FREE  CAP  FRAG  HEALTH
tank  86.3T  72.1T  14.2T  83%   52%  ONLINE

Interpretation: High fragmentation plus high capacity is a bad combination for latency. This is where “keep it under 80%” stops sounding like folklore.

Task 9: Replace a failing disk the safe way (offline/replace)

cr0x@server:~$ sudo zpool offline tank ata-WDC_WD120...-part1
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD120...-part1 /dev/disk/by-id/ata-WDC_WD120_NEWDRIVE-part1
cr0x@server:~$ sudo zpool status tank | sed -n '1,20p'
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Tue Dec 23 02:21:02 2025
        1.18T scanned at 622M/s, 140G issued at 74.1M/s, 72.1T total
        0B resilvered, 0.19% done, 5 days 06:10:00 to go
config:
        NAME                              STATE     READ WRITE CKSUM
        tank                              DEGRADED     0     0     0
          raidz2-0                        DEGRADED     0     0     0
            replacing-0                   DEGRADED     0     0     0
              ata-WDC_WD120...-part1     OFFLINE      0     0     0
              ata-WDC_WD120_NEWDRIVE-part1  ONLINE    0     0     0

Interpretation: Watch resilver progress and error counters. If resilver ETA is huge, confirm the pool isn’t saturated by workload and the new disk is healthy and performing.

Task 10: Identify slow disks and transport issues with SMART

cr0x@server:~$ sudo smartctl -a /dev/sdc | sed -n '1,60p'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0] (local build)
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD120...
Serial Number:    XXXXX
User Capacity:    12,000,000,000,000 bytes
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       18

Interpretation: Pending sectors and CRC errors matter. CRC errors often indicate cabling/backplane issues, not “bad disk.” In RAIDZ2 incidents, half the battle is distinguishing media failure from transport failure.

Task 11: Check ARC efficiency and whether “add RAM” is the right answer

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
02:10:01  812   122     15     92  75     30  25      0   0   124G   128G
02:10:02  790   118     14     88  75     30  25      0   0   124G   128G
02:10:03  835   140     17    101  72     39  28      0   0   124G   128G

Interpretation: A low miss% suggests caching is working. A high miss% during a read-heavy workload suggests disks are being hit. If ARC is capped (arcsz near c) and misses are high, more RAM might help—if the working set is cacheable.

Task 12: Confirm whether you really have synchronous writes

cr0x@server:~$ sudo zfs get -o name,property,value sync tank/vm
NAME     PROPERTY  VALUE
tank/vm  sync      standard

Interpretation: standard means “honor application requests.” If your workload is NFS with sync semantics or databases with fsync, a SLOG might matter. If your workload rarely issues sync writes, a SLOG is mostly decoration.

Task 13: Use fio to sanity-check behavior (carefully, off-peak)

cr0x@server:~$ sudo fio --name=randwrite --directory=/tank/test --size=8G --bs=4k --rw=randwrite --iodepth=16 --numjobs=4 --direct=1 --runtime=60 --time_based
...
  write: IOPS=820, BW=3.2MiB/s (3.4MB/s)(192MiB/60001msec)
  lat (usec): min=420, max=120000, avg=18500.12, stdev=9200.55

Interpretation: Random 4K writes on HDD RAIDZ2 can look ugly; that’s not necessarily misconfiguration. Use this to validate expectations, not to win internet arguments.

Task 14: Verify autoexpand status and device expansion posture

cr0x@server:~$ sudo zpool get autoexpand tank
NAME  PROPERTY    VALUE   SOURCE
tank  autoexpand  off     default

Interpretation: autoexpand affects whether a pool grows automatically when underlying devices grow. It won’t save you from poor planning, but it prevents “we replaced all drives and the pool didn’t get bigger” surprises.

Fast diagnosis playbook

When RAIDZ2 performance drops or errors appear, you don’t have time for a philosophical debate. You need a tight loop: confirm symptoms, isolate the layer, and decide whether you’re in “fix now” or “monitor” mode.

First: establish whether this is a health event or a performance event

Check pool health: zpool status -v. Any DEGRADED/FAULTED state, resilver in progress, or checksum errors changes priorities immediately.
Check for active scrub/resilver: the scan line in zpool status. If yes, expect reduced performance; decide whether to throttle or reschedule.
Check space and fragmentation: zpool list -o cap,frag and zfs list. A full pool can masquerade as “hardware got slower.”

Second: identify the bottleneck layer (CPU, RAM, disks, or network)

Disk/vdev pressure: zpool iostat -v 1. Look for uneven per-disk bandwidth or a vdev pinned near its limit.
System I/O wait and queueing: iostat -x 1 and vmstat 1. High await and util near 100% suggests the disks are saturated or a disk is slow.
ARC/cache behavior: arcstat (or equivalent). High misses during read-heavy load means disks are doing real work.
Network (if applicable): nfsstat, ss -s, NIC counters. Storage “slowness” is sometimes retransmits and bufferbloat wearing a disk costume.

Third: determine whether you have a single bad actor

Look for checksum/read/write errors per device: zpool status.
Check SMART for pending sectors or CRC errors: smartctl -a. CRC errors scream “cable/backplane.” Pending sectors scream “media.”
Check kernel logs: dmesg -T or journalctl -k for link resets, timeouts, and SCSI errors.

Decision point: what to do right now

If pool health is compromised: stop risky changes, reduce load if possible, and plan a disk replacement or transport fix. Performance tuning is not the priority.
If pool is full/fragmented: free space, add capacity, or migrate data. You can’t tune your way out of geometry.
If one disk is slow: treat it like a failure in progress. RAIDZ2 handles dead disks better than half-dead ones.
If it’s workload mismatch: consider mirrors for hot tiers or add vdevs to increase parallelism.

Common mistakes: symptoms and fixes

Mistake 1: Oversized RAIDZ2 vdevs because “capacity efficiency”

Symptoms: resilvers take forever; scrubs regularly run into business hours; performance collapses during maintenance; more frequent multi-disk stress events.

Fix: prefer more vdevs of moderate width rather than one very wide vdev; keep spare capacity; plan expansion by adding another RAIDZ2 vdev rather than widening existing ones (unless you have a verified expansion feature and a tested process).

Mistake 2: Treating special vdev as a cache and leaving it unmirrored

Symptoms: everything is fast until it’s suddenly not; a special device error threatens the whole pool; terrifying pool import behavior after SSD issues.

Fix: mirror special vdev devices; monitor them; size them for metadata growth; treat them as critical storage, not optional acceleration.

Mistake 3: Running too close to 100% and calling it “efficient”

Symptoms: rising latency, sluggish deletions, long transaction group sync times, scrubs/resilvers slow down dramatically.

Fix: enforce quotas, archive old snapshots, add capacity earlier, and keep operational headroom. If you need 90%+ utilization, design a tier for it and accept degraded performance.

Mistake 4: Wrong ashift, discovered after go-live

Symptoms: inexplicably low write performance; higher disk utilization; SSD wear concerns; “it was faster in staging” confusion.

Fix: ashift can’t be changed in place for existing vdevs. The real fix is rebuild/migrate: replicate to a correctly created pool and cut over.

Mistake 5: Disabling sync to “fix” latency

Symptoms: write latency improves; later, after a crash or power event, applications report corruption, missing transactions, or inconsistent VM disks.

Fix: leave sync=standard unless you fully understand and accept the durability trade. If sync writes are the bottleneck, consider a proper SLOG on power-loss-protected SSDs, or move sync-heavy workloads to mirrors.

Mistake 6: Ignoring checksum errors because “it’s still ONLINE”

Symptoms: occasional CKSUM increments; intermittent client errors; later, a cascade of device faults during a scrub/resilver.

Fix: investigate early. Check cabling, HBA firmware, backplane, and SMART. Replace suspect components before the pool is under reconstruction stress.

Checklists / step-by-step plan

Planning checklist: choosing RAIDZ2 geometry

Define the workload: mostly sequential, mostly random, sync-heavy, metadata-heavy, mixed?
Pick a target utilization ceiling (e.g., 75–80% for mixed latency-sensitive workloads).
Choose vdev width based on rebuild tolerance and risk posture (often 6–10 disks per RAIDZ2 vdev).
Choose number of vdevs to meet IOPS/parallelism needs.
Decide on special vdev (mirrored) if metadata/small-file heavy; decide on SLOG only if sync writes are real.
Set ashift correctly at creation for the actual media (assume 4K at minimum; consider 8K/16K for some SSDs where appropriate).
Set sane defaults: compression=lz4, atime=off where appropriate, tuned recordsize/volblocksize.
Plan scrubs, alerting, and a tested drive replacement runbook.

Build plan: create a RAIDZ2 pool (example)

Example only—verify device IDs and partitioning. Use stable by-id paths.

cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'ata-|nvme-' | head
cr0x@server:~$ sudo zpool create -o ashift=12 tank raidz2 \
  /dev/disk/by-id/ata-DISK1-part1 \
  /dev/disk/by-id/ata-DISK2-part1 \
  /dev/disk/by-id/ata-DISK3-part1 \
  /dev/disk/by-id/ata-DISK4-part1 \
  /dev/disk/by-id/ata-DISK5-part1 \
  /dev/disk/by-id/ata-DISK6-part1 \
  /dev/disk/by-id/ata-DISK7-part1 \
  /dev/disk/by-id/ata-DISK8-part1

Interpretation: This creates one 8-wide RAIDZ2 vdev. For more performance, you typically add another similar vdev later, not widen the existing one.

Operational plan: monthly scrub and alerting workflow

Schedule: run zpool scrub off-peak.
Monitor: alert on any non-zero READ/WRITE/CKSUM changes and on scrub completion with errors.
On error: open a ticket, capture zpool status -v, kernel logs, and SMART data.
Remediate: fix transport first (CRC/link resets), then replace disks with pending/reallocated sectors or repeated timeouts.
Verify: re-scrub if needed and confirm error counters stabilize.

Capacity expansion plan: add a new RAIDZ2 vdev

The normal ZFS growth pattern is adding vdevs. You preserve your existing vdev geometry and add parallelism.

cr0x@server:~$ sudo zpool add tank raidz2 \
  /dev/disk/by-id/ata-NEWDISK1-part1 \
  /dev/disk/by-id/ata-NEWDISK2-part1 \
  /dev/disk/by-id/ata-NEWDISK3-part1 \
  /dev/disk/by-id/ata-NEWDISK4-part1 \
  /dev/disk/by-id/ata-NEWDISK5-part1 \
  /dev/disk/by-id/ata-NEWDISK6-part1 \
  /dev/disk/by-id/ata-NEWDISK7-part1 \
  /dev/disk/by-id/ata-NEWDISK8-part1
cr0x@server:~$ sudo zpool status tank

Interpretation: Now you have two RAIDZ2 vdevs. Random I/O generally improves because allocations can spread across vdevs.

FAQ

1) Is RAIDZ2 “enough” for large drives (16–24TB)?

Often, yes—if you keep vdev widths reasonable, scrub regularly, and replace suspect disks quickly. With very large drives and very wide vdevs, RAIDZ3 becomes more attractive. The real question is operational: how long is your exposure window during resilver, and how confident are you in your ability to avoid a third failure mode (timeouts, UREs, transport resets) during that window?

2) RAIDZ2 vs mirrors for VM storage: which should I choose?

If the VM workload is random-write heavy and latency-sensitive, mirrors usually win. RAIDZ2 can work for VMs, but it’s less forgiving and may need more vdevs, SSD special vdevs, careful volblocksize tuning, and enough RAM. If you want the simple, predictable answer: mirrors for hot VM tiers; RAIDZ2 for capacity or colder tiers.

3) What’s a good RAIDZ2 width?

For mixed workloads on HDDs, 6–10 disks per RAIDZ2 vdev is a common practical range. Narrower tends to rebuild faster and behave better under random I/O; wider improves capacity efficiency and sequential throughput. If you don’t know your workload well, don’t build a 14-wide RAIDZ2 and hope monitoring will save you.

4) Do I need a SLOG with RAIDZ2?

Only if you have significant synchronous writes. Many file-sharing workloads are mostly async; a SLOG won’t help. For NFS with sync semantics, databases with fsync, or VM storage with sync writes, a power-loss-protected SSD SLOG can reduce latency. But it must be the right device class; consumer SSDs without PLP can make things worse, not better.

5) Does LZ4 compression slow things down?

Usually no, and often it speeds things up. LZ4 is fast, and by writing fewer bytes you reduce disk work—especially helpful on HDD RAIDZ2. There are corner cases (already-compressed media, some CPU-constrained systems), but the default posture in production is “enable LZ4 unless you have a measured reason not to.”

6) Why do I see checksum errors but SMART says PASSED?

Because SMART “PASSED” is a coarse self-assessment, not a guarantee. Checksum errors often point to transport issues (bad cables, flaky backplane, HBA problems) as well as media. Treat repeated checksum errors as real: capture zpool status, check UDMA_CRC_Error_Count, and read kernel logs for link resets.

7) Can I expand an existing RAIDZ2 vdev by adding a disk?

Historically, RAIDZ expansion wasn’t available in many deployments; newer OpenZFS has introduced RAIDZ expansion features, but operational maturity varies by platform and version. In many production environments, the conservative and widely used method remains: add a new vdev (another RAIDZ2 group) to the pool. If you plan to use expansion, test it in a staging environment that matches production, and plan a rollback story.

8) How full is “too full” for a RAIDZ2 pool?

It depends, but for mixed workloads where latency matters, treat ~80% as a yellow line, not a target. If you run higher, expect more fragmentation and worse tail latency, plus slower scrubs/resilvers. For cold archival pools with mostly sequential access, you can push higher, but you’re trading operational comfort for capacity utilization.

9) What’s the biggest operational risk with RAIDZ2?

Not the second disk failure—that’s the one you planned for. The biggest risks are extended exposure windows during resilver, “half-failures” (timeouts and transport resets), and running pools too full while assuming parity will save you from everything. RAIDZ2 buys time; it doesn’t buy immunity.

Conclusion

RAIDZ2 is popular in production for the same reason a good pager rotation is popular: it acknowledges reality. Disks fail. They also misbehave, slow down, and occasionally gaslight your monitoring until the worst possible moment. RAIDZ2 doesn’t prevent drama, but it gives you room to respond without turning one bad disk into a data-loss headline.

The “sweet spot” comes with conditions. Keep vdev widths sane. Add vdevs for performance scaling. Scrub on schedule. Treat checksum errors as signals, not trivia. Don’t run the pool to the brim and then blame ZFS for physics. Do that, and RAIDZ2 becomes what it’s supposed to be: unexciting storage that keeps showing up for work.