ZFS Design: Mirrors vs Parity — The Real Cost of “More Capacity”

Was this helpful?

The request usually shows up in a ticket with a cheery subject line: “Need more space.”
The subtext is always the same: someone noticed a pool at 78% and decided the storage team is an obstacle to progress.

If you run ZFS in production, “more capacity” is not a neutral change. It’s a decision that rewires failure domains,
changes rebuild behavior, shifts latency tails, and alters how your worst day will feel. Mirrors and parity both work.
But they fail differently. And the bill for “more usable TB” is often paid in time, not money.

The actual tradeoff: latency, rebuilds, and human time

Mirrors and RAIDZ parity are often pitched as a simple slider: “mirrors are faster, RAIDZ is more space-efficient.”
That’s true in the way “winter is cold” is true. It’s not wrong, it’s just missing everything that hurts.

The real tradeoff is: how much risk and operational complexity you’re buying with every extra terabyte of usable space.
Mirrors buy you predictable performance and simpler failure handling. RAIDZ buys you capacity efficiency but charges interest
in resilver behavior, parity write amplification, and time-to-sleep-again when a disk drops during peak hours.

When someone says “we’re leaving capacity on the table with mirrors,” translate that to:
“We want to reduce upfront spend by increasing the amount of work we might have to do later,
under stress, while the business is watching.”

Dry rule of thumb: if the workload is random I/O heavy (VMs, databases, mixed containers, CI runners), mirrors are the default.
If the workload is large sequential I/O (backups, media, archive, log retention), RAIDZ2 is often the default.
RAIDZ1 is for people who enjoy writing postmortems about “unexpected second failure.”

Facts & history that explain today’s pain

  • ZFS was born at Sun (mid-2000s) with end-to-end checksumming as a response to silent data corruption, not as a quest for maximum usable capacity.
  • RAID5’s reputation suffered as drive sizes grew faster than rebuild speeds; long rebuild windows made a second failure statistically more likely at the worst time.
  • “Unrecoverable read error” rates on commodity disks were a big deal in the early SATA era; parity rebuilds read a lot of data, which is exactly when UREs show up.
  • ZFS “scrub” became mainstream because it catches latent sector errors before a resilver forces you to read everything under duress.
  • 4K sector drives (and the move away from 512-byte sectors) made ashift a permanent design choice: pick wrong and you pay forever in write amplification.
  • Copy-on-write (CoW) means ZFS never overwrites live blocks in-place; great for integrity, but it changes the fragmentation and small-write story.
  • RAIDZ was designed for ZFS (not bolted on) to avoid the classic write-hole problem found in traditional parity RAID without a reliable journal.
  • OpenZFS features diverged across platforms for years; what your colleague swears worked on one distro might behave differently on another release.
  • SSD adoption shifted the pain from “throughput” to “tail latency,” and mirrors tend to behave better when you care about p99.9 more than benchmarks.

How ZFS thinks: vdevs, pool behavior, and why layouts aren’t “just RAID”

ZFS doesn’t do “RAID across disks” the way a hardware controller does. ZFS does redundancy at the vdev layer.
Your pool is a stripe across vdevs. A vdev is the unit of failure.

That one sentence explains most production surprises:

  • If you lose a vdev, you lose the pool. A pool is only as reliable as its least reliable vdev.
  • Adding a vdev adds capacity and performance, but it also adds another failure domain.
  • You can’t “convert” a RAIDZ vdev to a mirror vdev later without migration. (Yes, there are expansion features evolving, but plan like you still have to migrate.)

Mirrors and RAIDZ are different kinds of vdevs. Mirrors replicate blocks. RAIDZ encodes parity across a stripe.
ZFS spreads allocations across top-level vdevs, so if you mix vdev types, you mix behavior. That’s not “flexibility,”
it’s “your performance graph will look like modern art.”

Mirrors in production: the good, the bad, and the predictable

What mirrors buy you

Mirrors are boring. This is praise.

For random reads, mirrors are usually excellent because ZFS can service reads from either side, and it can be smart about queue depth.
For random writes, you write twice—but the I/O pattern stays simple, and failure behavior stays straightforward.

Resilvering a mirror tends to be gentler, especially when you replace with a larger disk or when the pool isn’t full.
ZFS can resilver only allocated blocks (sequential-ish) instead of a full disk rebuild, depending on platform and feature flags.
Even when it has to work hard, mirror resilvers are typically simpler to reason about.

Where mirrors bite

The cost is capacity efficiency. A two-way mirror is 50% usable (before slop, metadata, and padding).
A three-way mirror is 33%. You pay that every day, whether you need performance or not.

Mirrors also encourage a specific kind of overconfidence: “It’s mirrored, so it’s safe.”
It’s safer than a single disk. It’s not a backup. It doesn’t protect against “rm -rf”, ransomware, or an application
happily writing garbage with valid checksums.

RAIDZ parity in production: capacity wins, operational debt, and when it’s still right

What RAIDZ buys you

RAIDZ1/2/3 buys you better usable capacity per disk. That’s the headline, and it matters.
For large sequential workloads, RAIDZ can be perfectly fine, and often cost-effective.

RAIDZ2 is the practical baseline for large HDD pools. The cost difference between RAIDZ1 and RAIDZ2 is one disk of parity per vdev.
The difference in “how likely you are to lose the vdev during rebuild stress” is larger than that.

Where parity bites

RAIDZ has a structural problem with small random writes: parity math plus read-modify-write behavior can turn one small logical write
into multiple physical operations. ZFS can mitigate some of this with recordsize choices, workload alignment, and enough CPU,
but you can’t negotiate with physics.

Rebuild behavior is where parity layouts cash their checks. A RAIDZ resilver needs to reconstruct data from surviving disks,
which means reading a lot. That can drag out for days on big HDDs, and during that window your pool is stressed and your margin is thin.

Joke #1: RAIDZ1 on large spinning disks is like skydiving with a spare parachute you left in the car. Technically you own it.

Capacity math that doesn’t lie (and the parts it hides)

Capacity math is the easiest part, which is why it gets abused. People do quick arithmetic and declare victory.
The trap is that ZFS has overheads and behaviors that make “usable” different from “safe to run at.”

Raw vs usable vs sane

  • Raw capacity: sum of disk sizes.
  • Usable capacity: raw minus parity/mirror redundancy, minus formatting/ashift realities, minus ZFS metadata.
  • Sane operating capacity: usable minus the space you refuse to fill because performance and fragmentation get nasty when you do.

If you run pools at 90–95% full and then act surprised by latency spikes, you’re not unlucky. You’re just doing the thing that causes it.
ZFS needs space to find good allocation targets; it’s not alone—most CoW systems behave similarly.

Why “one big RAIDZ vdev” is seductive

A wide RAIDZ vdev looks great on a spreadsheet: more data disks per parity disk.
Then you realize a wide vdev resilver reads from more disks, takes longer, and each disk is a point of failure during that long window.
The ops cost isn’t linear; it’s more like compound interest.

Performance reality: IOPS, throughput, and tail latency

Storage decisions fail when you optimize the wrong metric. People buy “more TB” when they needed “less latency.”
Or they buy “more IOPS” when they needed “more throughput.” Mirrors vs parity is mostly a story about latency and IOPS.

IOPS: mirrors usually win where it hurts

A mirror vdev can service reads from either disk, so read IOPS scale better than parity for random workloads.
Write IOPS on a mirror are roughly like a single disk (you write to both), but the behavior stays clean.

RAIDZ random write IOPS are constrained by parity operations and stripe geometry. If you have a lot of small, sync writes
(databases, NFS for VMs, iSCSI), parity can turn into a latency factory.

Throughput: parity can be great (when the workload matches)

For large sequential reads/writes, RAIDZ can deliver excellent throughput, especially if your recordsize matches your workload
and you aren’t forcing sync semantics unnecessarily.

Tail latency: the graph that matters when users complain

Average latency makes dashboards look calm. Tail latency makes incident channels loud.
Mirrors tend to have fewer nasty surprises under mixed workloads because there’s less parity reconstruction and fewer “must-touch-many-disks”
operations. RAIDZ tends to show heavier tails when the pool is busy, fragmented, or resilvering.

Quote (paraphrased idea): “Hope is not a strategy.” — General James Mattis (paraphrased idea, commonly cited in ops contexts)

Resilvering and scrubs: where theory meets the pager

Scrubs and resilvers are not background chores. They are the moments your redundancy design is tested under load.
If your design assumes scrubs will always complete quickly, or that resilvers won’t collide with peak traffic, you’re designing a lab, not a service.

Scrub: proactive pain to avoid reactive disaster

A scrub reads all data and verifies checksums, repairing from redundancy when possible. It finds latent errors before a disk failure forces
a rebuild that reads everything anyway, but under worse conditions.

Scrubs cost I/O. The right response is to schedule them and monitor them, not to disable them because they “impact performance.”
If your system can’t tolerate a scrub, it can’t tolerate a resilver either; it just hasn’t admitted it yet.

Resilver: where redundancy gets expensive

Mirror resilvering is often a mostly-sequential copy of allocated blocks to the replacement disk. RAIDZ resilvering reconstructs stripes.
In both cases, the pool is doing extra work while degraded. This is exactly when you don’t want surprises.

The biggest operational risk is not “a disk failed.” Disks fail. The risk is “a disk failed and the rebuild window is long enough
that something else fails too.” Mirrors and RAIDZ2 reduce that risk in different ways; RAIDZ1 increases it.

Joke #2: The fastest way to learn your resilver strategy is inadequate is to name it “we’ll deal with it later.” Later is very punctual.

Practical tasks: commands, outputs, and decisions

These are tasks you can actually run on a ZFS host. Each one includes: a command, what typical output means, and what decision you make.
Hostname and pool names are examples; adapt carefully.

Task 1: Identify pool layout and vdev types

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 03:12:44 with 0 errors on Sun Feb  4 02:10:31 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d4  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d5  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d6  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d7  ONLINE       0     0     0

errors: No known data errors

What it means: Top-level vdevs are mirrors; pool will survive one disk failure per mirror vdev.
Read/write/checksum counters are clean.

Decision: If this pool backs VMs or databases, you’re probably already on the right track. If it’s an archive pool,
you may be paying too much in capacity. Decide based on workload, not ideology.

Task 2: Confirm ashift (sector alignment) before you expand or replace disks

cr0x@server:~$ zdb -C tank | grep -i ashift -n
34:            ashift: 12

What it means: ashift=12 means 4K sectors. Good for modern drives.

Decision: If you see ashift=9 on 4K drives, plan a migration. Don’t “fix it later”; you can’t fix ashift in place.

Task 3: Check pool free space and fragmentation indicators

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,health
NAME   SIZE  ALLOC   FREE  FRAG  HEALTH
tank  21.8T  16.9T  4.90T   28%  ONLINE

What it means: 28% fragmentation is not catastrophic, but it’s trending toward “random write latency will get spicy.”

Decision: If you’re planning to push past ~80% allocation on a busy pool, budget time for performance testing or expansion.
“It still has free TB” is not a performance plan.

Task 4: Find the datasets that are actually consuming space

cr0x@server:~$ zfs list -o name,used,avail,refer,compressratio,mountpoint -S used | head
NAME                    USED  AVAIL  REFER  RATIO  MOUNTPOINT
tank/vmstore           8.42T  4.10T  8.42T  1.28x  /tank/vmstore
tank/backups           5.10T  4.10T  5.10T  1.01x  /tank/backups
tank/home              1.80T  4.10T  1.70T  1.70x  /tank/home
tank                   512K   4.10T   192K  1.00x  /tank

What it means: vmstore dominates usage; compression helps a bit.

Decision: If the largest consumer is VM storage, mirrors are usually the right default. If it’s backups, RAIDZ2 might be fine.

Task 5: Check dataset recordsize and sync settings (common parity pain points)

cr0x@server:~$ zfs get -o name,property,value -r recordsize,sync tank/vmstore | head -n 12
NAME          PROPERTY    VALUE
tank/vmstore  recordsize  128K
tank/vmstore  sync        standard

What it means: Recordsize is 128K (default-ish). Sync is standard (depends on client/app).

Decision: For databases, consider smaller recordsize (e.g., 16K) only when you understand the I/O pattern.
For VM zvols, you typically tune volblocksize at creation. Don’t random-walk these settings on a production pool.

Task 6: Identify whether compression is on (and whether it’s helping)

cr0x@server:~$ zfs get -o name,property,value,source -r compression tank | head -n 12
NAME           PROPERTY     VALUE     SOURCE
tank           compression  lz4       local
tank/vmstore   compression  lz4       inherited from tank
tank/backups   compression  off       local

What it means: VM store compresses; backups don’t.

Decision: If backups are already compressed (e.g., encrypted/compressed archives), compression won’t help.
For general-purpose datasets, lz4 is usually a free win. If CPU is pinned, measure before changing.

Task 7: Look at real-time I/O to see whether you’re IOPS-bound or throughput-bound

cr0x@server:~$ zpool iostat -v tank 2 3
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
tank                                      16.9T  4.90T    820   1450   112M   210M
  mirror-0                                8.45T  2.45T    410    720  56.0M   105M
    wwn-0x5000c500a1b2c3d4                    -      -    208    361  28.1M  52.7M
    wwn-0x5000c500a1b2c3d5                    -      -    202    359  27.9M  52.4M
  mirror-1                                8.45T  2.45T    410    730  56.0M   105M
    wwn-0x5000c500a1b2c3d6                    -      -    205    365  28.0M  52.6M
    wwn-0x5000c500a1b2c3d7                    -      -    205    365  28.0M  52.6M
----------------------------------------  -----  -----  -----  -----  -----  -----

What it means: Writes are heavier than reads. Bandwidth is moderate. If latency complaints exist despite these numbers,
it may be sync writes, queueing, or a slow device (or you’re hitting p99 tails).

Decision: If you’re IOPS-bound on parity, mirrors are a structural fix. If you’re throughput-bound, add vdevs or faster media.

Task 8: Check ARC health and memory pressure (ZFS performance is often “RAM vs disk”)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c
12:40:01   942   121     12     0    0   121  100     0    0  64.0G  64.0G
12:40:02   910   115     12     0    0   115  100     0    0  64.0G  64.0G
12:40:03   955   118     12     0    0   118  100     0    0  64.0G  64.0G

What it means: ~12% ARC miss rate. Not awful, not amazing. If your working set doesn’t fit in RAM, your vdev design matters more.

Decision: If ARC misses are high and disks are busy, you may need more RAM or faster storage—not a different RAID level.

Task 9: Identify whether you’re blocked on sync writes and SLOG behavior

cr0x@server:~$ zpool status tank | sed -n '1,120p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 03:12:44 with 0 errors on Sun Feb  4 02:10:31 2026
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d4  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d5  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d6  ONLINE       0     0     0
            wwn-0x5000c500a1b2c3d7  ONLINE       0     0     0
        logs
          mirror-2                  ONLINE       0     0     0
            nvme0n1p2               ONLINE       0     0     0
            nvme1n1p2               ONLINE       0     0     0

errors: No known data errors

What it means: There is a mirrored log device (SLOG). That can dramatically help sync write latency for NFS/iSCSI, when configured correctly.

Decision: If you have heavy sync workloads and no SLOG, mirrors vs RAIDZ won’t save you by itself. Add a proper mirrored SLOG or change sync semantics knowingly.

Task 10: Verify ongoing resilver/scrub status and estimate operational risk

cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
  scan: resilver in progress since Tue Feb  4 11:10:51 2026
        3.12T scanned at 1.45G/s, 1.84T issued at 860M/s, 16.9T total
        1.84T resilvered, 10.88% done, 05:12:33 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            wwn-0x5000c500a1b2c3d4  ONLINE       0     0     0
            replacing-1             DEGRADED     0     0     0
              wwn-0x5000c500a1b2c3d5  ONLINE     0     0     0
              wwn-0x5000c500ffff0001  ONLINE     0     0     0  (resilvering)

errors: No known data errors

What it means: Pool is degraded during resilver. Time remaining is non-trivial.

Decision: Freeze risky changes. If this is a parity vdev, treat the window as higher risk; consider throttling workload or scheduling maintenance.

Task 11: Catch failing disks before they become “sudden”

cr0x@server:~$ smartctl -a /dev/sda | egrep -i 'reallocated|pending|uncorrect|crc|overall|temperature'
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
Temperature_Celsius     0x0022   034   040   000    Old_age   Always       -       34

What it means: Pending and uncorrectable sectors exist. “PASSED” does not mean “healthy.” It means “not dead enough yet.”

Decision: Replace the drive. On RAIDZ, do it proactively to avoid discovering weak drives during a resilver.

Task 12: Check whether you are suffering from a single slow disk (the silent pool killer)

cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server)  02/04/2026  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.1    0.0     6.4    18.7     0.0    62.8

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sda              62.0   115.0  8120.0  20120.0   9.2   1.8   31.0
sdb              64.0   117.0  8200.0  19990.0   8.9   1.9   32.0
sdc              12.0    95.0  1100.0  19850.0  44.5   2.1   88.0
sdd              60.0   112.0  8050.0  20010.0   9.1   1.8   30.5

What it means: sdc has much higher await and high utilization. One slow disk can drag a mirror vdev or parity vdev into bad latency.

Decision: Correlate sdc to a ZFS device (by-id/wwn). Check SMART, cabling, HBA errors. Replace or fix the path.

Task 13: See if your pool is throttling itself due to errors

cr0x@server:~$ zpool events -v | tail -n 12
Feb 04 2026 12:11:02.123456789 ereport.fs.zfs.checksum
        pool = tank
        vdev = /dev/disk/by-id/wwn-0x5000c500a1b2c3d6
        errno = 0x0
        ...
Feb 04 2026 12:11:05.987654321 ereport.fs.zfs.io
        pool = tank
        vdev = /dev/disk/by-id/wwn-0x5000c500a1b2c3d6
        errno = 0x5
        ...

What it means: Checksum and I/O errors are being reported. ZFS may be retrying, reconstructing, and stalling.

Decision: Treat as a hardware-path issue until proven otherwise: drive, cable, backplane, HBA, firmware. Fix that before debating mirrors vs parity.

Task 14: Verify that scrubs are running and finishing in a reasonable window

cr0x@server:~$ zpool status -x
all pools are healthy
cr0x@server:~$ zpool status tank | grep -A2 -i scan
  scan: scrub repaired 0B in 03:12:44 with 0 errors on Sun Feb  4 02:10:31 2026

What it means: Scrub completed in a few hours and repaired nothing. That’s what you want.

Decision: If scrubs take days and impact production, your pool is too busy, too full, too wide, or too slow. Mirrors can reduce pain, but so can “more vdevs” and “more I/O headroom.”

Fast diagnosis playbook: find the bottleneck in minutes

When users say “storage is slow,” do not start by redesigning the pool. Start by proving what’s slow and why.
Here’s the order that saves time and avoids wrong fixes.

First: confirm it’s actually storage

  • Check iowait and CPU steal: high iowait suggests storage; high steal suggests noisy neighbor or hypervisor pressure.
  • Check application symptoms: are they latency-bound (timeouts) or throughput-bound (slow batch jobs)?
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  2      0  92100  52000 812000    0    0  1200  4800 8200 9000 14  6 60 20  0

Decision: If wa is high and b (blocked processes) is non-zero, proceed to storage checks.

Second: identify whether it’s one device, one vdev, or the whole pool

  • Per-disk: iostat -x for outliers in await/%util.
  • Per-vdev: zpool iostat -v for imbalance and overload.

Third: check ZFS health signals that change everything

  • Degraded or resilvering: performance will be worse. Decide if you can shed load.
  • Errors/events: checksum/I/O errors suggest hardware problems that no RAID level fixes.
  • Free space/fragmentation: high fullness + fragmentation = tail latency party.

Fourth: validate sync write path and intent log behavior

  • NFS/iSCSI/DB sync writes: verify if they’re forcing sync and whether SLOG exists and is healthy.

Fifth: only then talk layout changes

If you can’t name the bottleneck, don’t propose mirrors vs RAIDZ as the fix. That’s cargo culting.
Layout changes are expensive, disruptive, and hard to roll back. Earn them with evidence.

Three corporate mini-stories from the storage trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran a ZFS pool backing a virtualization cluster. The storage was built as a single wide RAIDZ1 vdev.
The reasoning was familiar: “We’re losing too much capacity with mirrors, and ZFS is smarter than RAID anyway.”

The first disk failed on a Tuesday morning. No big deal. They had hot spares on the shelf and an on-call who’d done this before.
The resilver started and performance dipped—noticeably, but tolerably. The team assumed it would be done overnight.

Overnight became “tomorrow afternoon.” The pool was busy, scrub history was irregular, and the pool was north of 85% full.
Random writes were already a bit ugly; now parity reconstruction amplified the ugliness. VM latency climbed. Application timeouts followed.

Late on day two, a second disk threw read errors during the resilver. The vdev couldn’t reconstruct a few blocks.
ZFS did what it could, but critical VM images were affected. The incident wasn’t caused by ZFS being unreliable.
It was caused by the assumption that RAIDZ1 behaves like a casual insurance policy, even during long rebuild windows.

The corrective action wasn’t “use mirrors everywhere.” It was more adult:
RAIDZ2 for HDD parity vdevs, regular scrubs with alerting on duration, and a pool fullness policy that treats 80% as “start planning,” not “fine.”

Mini-story 2: The optimization that backfired

A financial services shop had a ZFS mirror pool supporting a latency-sensitive database.
Everything was stable, if a little pricey in raw capacity. During a budget cycle, someone proposed switching to RAIDZ2 to “save disks.”

The migration was planned carefully and executed well. The new pool had enough spindles, RAIDZ2 redundancy, and looked great in a capacity chart.
Early tests showed acceptable throughput. The team declared the optimization a success and moved on.

Then the workload changed: more microservices, more small transactions, more sync writes, and a new batch job that did lots of small updates.
Tail latency crept up. Not consistently—just enough to trigger periodic slow queries and user-facing hiccups.

The team chased ghosts: network, DB parameters, application code, kernel updates. Eventually they did what they should have done at the start:
measure I/O patterns. The workload was IOPS-heavy with small synchronous writes. RAIDZ2 wasn’t “bad,” it was mismatched.

They ended up adding a mirrored SLOG and tuning some dataset settings, which helped.
But the big fix was reintroducing mirrors for the database tier and keeping RAIDZ2 for backups and less spiky storage.
The optimization did save disks. It also bought an extra on-call rotation’s worth of sleep deprivation.

Mini-story 3: The boring but correct practice that saved the day

A media company stored production assets and long-term archives on ZFS RAIDZ2 pools. Not glamorous.
The team had a habit that looked overly cautious: monthly scrubs, alerts on scrub duration and error counts,
and a rule that any disk showing pending sectors gets replaced within a week.

One quarter, a batch of drives began showing small but rising read error counts. Nothing failed outright.
The pools stayed ONLINE. Most teams would have shrugged—after all, the dashboards were green.

Because the scrubs were consistent and tracked, they noticed a pattern: scrub times were increasing and a few disks were consistently slower.
They replaced those drives during a planned maintenance window, one at a time, keeping the resilver load manageable.

Two weeks later, another company in the same building (same power conditions, similar hardware generation) had a multi-disk failure event
during an unplanned rebuild and lost a vdev. The media company didn’t “get lucky.” They had operational hygiene.

The practice wasn’t sexy. It didn’t show up in a quarterly slide deck. It did keep the archive intact and the executives unbothered,
which is the highest compliment production can offer.

Common mistakes: symptoms → root cause → fix

Mistake 1: “We need more space, switch mirrors to RAIDZ”

Symptoms: Pool is 75–85% full; performance already occasionally spiky; capacity pressure from growth.

Root cause: Treating layout as a capacity tool instead of a workload tool; ignoring that redesign is migration and risk.

Fix: Add vdevs (same type) or add another pool for cold data. If you must migrate, do it with workload classification:
mirrors for random I/O tiers, RAIDZ2 for sequential tiers.

Mistake 2: RAIDZ1 on large HDDs for primary storage

Symptoms: Rebuild windows measured in days; anxiety spikes during resilver; occasional unrecoverable errors during rebuild.

Root cause: Single-parity vdev can’t tolerate a second failure/URE during a long rebuild; wide vdev magnifies rebuild stress.

Fix: Use RAIDZ2 (or mirrors) for large HDD vdevs. Keep vdev widths reasonable. Scrub regularly to reduce latent error surprise.

Mistake 3: Running pools too full because “ZFS can handle it”

Symptoms: Latency spikes, slow metadata operations, unpredictable write performance, scrub times increase.

Root cause: CoW allocator has fewer good options; fragmentation increases; metaslab contention rises.

Fix: Keep headroom (often 20% for busy pools). Add capacity before panic. Consider special vdevs for metadata only when you can mirror and monitor them.

Mistake 4: Misunderstanding sync writes and blaming RAID level

Symptoms: NFS datastore “randomly” stalls; database fsync latency high; throughput seems fine but users complain.

Root cause: Sync writes forcing ZIL behavior; no SLOG, or bad/unsafe SLOG device; parity amplifies small sync writes.

Fix: Add a proper mirrored SLOG (power-loss protected SSD/NVMe), verify sync semantics, and measure again. Don’t set sync=disabled unless you accept data loss on crash.

Mistake 5: Mixing vdev types in one pool without understanding allocation

Symptoms: Some workloads are fast, some are inexplicably slow; performance changes over time as the pool fills.

Root cause: ZFS allocates across top-level vdevs; slower vdevs become tail-latency anchors; imbalance appears.

Fix: Keep pools homogeneous per tier. If you need different behavior, use separate pools or carefully designed special vdevs with clear intent.

Mistake 6: Treating SMART “PASSED” as a clean bill of health

Symptoms: Intermittent checksum errors, slow disk behavior, weird timeouts; “but SMART says PASSED.”

Root cause: SMART overall status is coarse; pending/uncorrectable sectors are early warnings; cabling/HBA issues don’t always flip SMART.

Fix: Alert on meaningful attributes (pending, uncorrectable, reallocated trends, CRC errors). Use zpool events and OS logs to catch path problems.

Checklists / step-by-step plans

Plan A: Choosing mirrors vs RAIDZ (production decision checklist)

  1. Classify the workload: random IOPS heavy (VMs/DB) vs sequential (backup/archive) vs mixed.
  2. Define the failure budget: acceptable degraded performance, acceptable rebuild duration, and acceptable risk window.
  3. Set an operating fullness target: pick a max % full where performance remains predictable.
  4. Decide redundancy level: mirrors (2-way or 3-way) vs RAIDZ2/3; avoid RAIDZ1 for large HDD primary pools.
  5. Pick vdev width: keep it reasonable; don’t build the widest vdev your chassis allows just because you can.
  6. Plan for growth: add vdevs of the same type; avoid mixing types; ensure bays and HBAs support future expansion.
  7. Plan for rebuild operations: hot spares, maintenance windows, monitoring, and documented replacement procedure.
  8. Validate with measurements: benchmark representative I/O (including sync writes) before committing.

Plan B: Disk replacement procedure (safe, repeatable)

  1. Confirm the failing device by stable ID (WWN/by-id), not /dev/sdX.
  2. Check zpool status and ensure you understand which vdev is affected.
  3. If possible, offline the disk cleanly before removal.
  4. Replace the disk physically.
  5. Use zpool replace with the by-id path.
  6. Monitor resilver progress and system latency.
  7. After completion, run a scrub at the next safe window if your policy allows.
cr0x@server:~$ zpool offline tank /dev/disk/by-id/wwn-0x5000c500a1b2c3d5
cr0x@server:~$ zpool replace tank /dev/disk/by-id/wwn-0x5000c500a1b2c3d5 /dev/disk/by-id/wwn-0x5000c500ffff0001
cr0x@server:~$ zpool status tank | grep -A4 -i resilver
  scan: resilver in progress since Tue Feb  4 11:10:51 2026
        1.84T resilvered, 10.88% done, 05:12:33 to go

Decision: If the pool is parity and business-critical, consider temporarily reducing workload or rescheduling batch jobs during resilver.
For mirrors, keep an eye on the sibling disk: it’s now doing all the reads.

Plan C: If you must chase capacity, do it without wrecking reliability

  1. Move cold datasets (backups, archives) to a separate RAIDZ2 pool designed for that duty.
  2. Keep hot datasets (VMs/DB) on mirrors.
  3. Use quotas/reservations to prevent backups from consuming the VM pool.
  4. Enable compression where it’s likely to help (often lz4), but don’t expect miracles on already-compressed data.
  5. Set alerting at 70/80/85% pool fullness with explicit actions at each stage.
cr0x@server:~$ zfs set quota=6T tank/backups
cr0x@server:~$ zfs get -o name,property,value quota tank/backups
NAME          PROPERTY  VALUE
tank/backups  quota     6T

Decision: If backups are eating the pool, constrain them. If product teams want more space, make them pick: more disks or less retention.

FAQ

1) Are mirrors always faster than RAIDZ?

For random read-heavy workloads, usually yes. For large sequential throughput, RAIDZ can be competitive or better per dollar.
“Always” is the word that ruins storage design.

2) Is RAIDZ2 “safe enough” for large HDD pools?

RAIDZ2 is the common baseline because it tolerates two failures per vdev. “Safe enough” depends on rebuild time, disk quality,
scrub hygiene, and how close to full you run. But RAIDZ2 is dramatically less stressful than RAIDZ1 during rebuild windows.

3) Why is RAIDZ1 discouraged on big disks?

Because rebuild windows are long and parity rebuilds read a lot of data. That’s when latent errors or a second failure show up.
RAIDZ1 gives you no margin for that.

4) Can I add a single disk later to a RAIDZ vdev?

Historically, no—you add whole vdevs, not disks. Newer OpenZFS has RAIDZ expansion capabilities in some environments,
but you should plan as if you’ll still expand by adding vdevs or migrating, because feature availability and operational maturity vary.

5) Should I use a single wide RAIDZ2 vdev or multiple narrower ones?

Multiple vdevs usually give better parallelism and more predictable performance. A single very wide vdev can look efficient,
but it increases rebuild stress and can create long tail latency under mixed I/O.

6) Do mirrors waste too much capacity?

They waste capacity only if you don’t need what they buy: predictable latency and simpler degraded behavior.
If your workload is random I/O heavy, mirrors are often cheaper than the engineering time you’ll spend trying to tame parity.

7) Is a SLOG required?

No. It’s required only if you have significant sync writes and you care about latency.
If you add one, mirror it and use a device with power-loss protection, or you’re building a performance feature out of a reliability hazard.

8) How full is too full for a ZFS pool?

It depends on workload, but busy pools often get unpleasant above ~80–85%. If you’re running latency-sensitive services,
treat 80% as a planning threshold, not a finish line.

9) Are 3-way mirrors worth it?

Sometimes: very high availability requirements, very slow replacement logistics, or when resilver risk must be minimized.
But they’re expensive in capacity. RAIDZ3 can also be an option for some large archival tiers.

10) If I’m performance-bound, should I switch RAID level or add vdevs?

Add vdevs when you need more parallelism and throughput/IOPS. Switch layout when the current vdev type is structurally mismatched
(e.g., parity for small sync random writes). If you can solve it by adding vdevs of the same type, that’s usually less disruptive.

Conclusion: practical next steps

Mirrors vs parity is not a moral debate. It’s a production bet: what kind of bad day you’re willing to have,
and how long you’re willing to have it.

Next steps that actually move you forward:

  1. Measure your workload: use zpool iostat -v, latency metrics, and application behavior. Decide based on I/O pattern.
  2. Set policies: scrub schedule, drive replacement thresholds, and pool fullness limits with alerts and owners.
  3. Pick the boring default: mirrors for VM/DB tiers; RAIDZ2 for backup/archive tiers. Deviate only with evidence.
  4. Design for rebuilds: assume a disk will fail during a busy week. If that thought makes you sweat, your layout is wrong.
  5. Keep pools homogeneous per tier: avoid mixing vdev types unless you can explain exactly how it will behave at 85% full during a resilver.
← Previous
Why Your Laptop Fan Is Loud After an Update (It’s Usually a Driver)
Next →
PowerShell Profiles: Make Every Session Instantly Useful

Leave a comment