ZFS RAIDZ rebuild math: Why One More Failure Can Kill You

Was this helpful?

Your pager goes off at 02:13. One disk is dead. ZFS is resilvering. Everyone wants one answer: “How long until we’re safe?”
The bad news is that “safe” is not a timestamp. It’s a probability curve that gets ugly precisely when you need it to be calm.

RAIDZ rebuilds aren’t just about throughput. They’re about exposure time, error rates, and how many separate ways storage can betray you
while you’re busy fixing it. The math isn’t hard. The consequences are.

Terms that matter (and the ones people misuse)

RAIDZ is not “RAID-5 but in ZFS”

RAIDZ is ZFS’s parity RAID layout. RAIDZ1 tolerates one disk failure in a vdev. RAIDZ2 tolerates two. RAIDZ3 tolerates three.
A pool’s fault tolerance is the fault tolerance of its least tolerant vdev. If you have a RAIDZ2 vdev plus a single mirror vdev,
the mirror is not the weakest. But a RAIDZ1 vdev in the mix absolutely is.

Resilver vs rebuild

“Rebuild” is the generic RAID term. ZFS performs a resilver: it reconstructs missing data onto a replacement device.
The key detail: ZFS can often copy only allocated blocks (not the entire disk), so resilvers can be much faster than old-school RAID rebuilds.
But the vdev still has to read lots of data to recompute parity and verify checksums. If your pool is near-full, “only allocated blocks”
becomes “most blocks,” and the difference gets theoretical.

Scrub is not a backup and not a resilver

A scrub reads data and verifies checksums, repairing corruption using redundancy. It doesn’t make you resilient to “oops, we deleted prod.”
But scrubs reduce the chance that the first time you discover latent sector errors is during a resilver—when you’re already down one disk.

URE, latent errors, and why vendor numbers lie by omission

URE means “unrecoverable read error.” Vendors publish an “unrecoverable bit error rate” (UBER) like 10^-14 or 10^-15 bits.
People interpret that as “one error per X TB,” and then stop thinking. The actual risk depends on:
how many bits you read during the resilver, how many drives you read from, what the firmware does under stress,
and whether you’ve been scrubbing. The published number is a starting point, not a promise.

One quote worth keeping on your desk. Richard Cook’s well-known idea about complex systems is often summarized as:
“Success in complex operations is built on adaptation; failure happens when defenses are exhausted.”
(paraphrased idea, Richard Cook)

Joke #1: A resilver is like moving apartments while the building is on fire—technically possible, but you’ll discover which boxes are fragile.

Interesting facts and short history

  • ZFS was designed in the mid-2000s to treat storage as a system, not a pile of devices, baking in end-to-end checksums and self-healing.
  • RAIDZ exists to avoid the “RAID-5 write hole” by integrating parity updates with copy-on-write transaction groups.
  • Early ZFS deployments were memory-rich by necessity; the “ZFS needs tons of RAM” meme came from real ARC hunger on old systems, not superstition.
  • 4K sector reality changed everything: ashift mismatches (e.g., 512e drives forced into 512-byte alignment) can permanently tax performance and rebuild times.
  • Drive capacities grew faster than drive throughput; rebuild exposure windows expanded as disks went from hundreds of GB to tens of TB.
  • “Resilver only used space” was a game-changer compared to classic RAID, but it’s only as good as your fragmentation and fullness.
  • Checksums made silent corruption visible; the unpleasant side effect is you discover corruption during scrubs/resilvers, when the pool is stressed.
  • RAIDZ expansion arrived late compared to people’s wishes; operationally, that meant lots of “just add bigger disks and replace one by one” plans.
  • Enterprise HDD UBER numbers improved, but modern firmware behavior under heavy queueing can still produce timeouts that look like “random failures.”

Why one more failure can kill you

In RAIDZ1, one disk dying puts you in a state where every block that needs the missing disk must be reconstructed from the remaining disks.
During the resilver, you are running degraded: you have no redundancy. One additional failure—another disk drop, a cable issue, a controller reset,
a rash of UREs on the wrong disk—can turn “resilver in progress” into “restore from backups,” assuming you have backups.

RAIDZ2 buys you time and options. Not infinite safety, but breathing room. You can survive a second device failure during a resilver.
And more importantly, you can survive a handful of read errors without the vdev panicking itself into a corner. The practical difference is psychological:
engineers stop making rushed, risky moves.

The failure modes are not independent

The cute math assumes independent failures: one disk fails, other disks have random UREs at vendor rates, etc.
Reality is messier:

  • Disks from the same batch fail close together.
  • Resilver workloads are punishing: sustained reads, high queue depth, and lots of random IO if the pool is fragmented.
  • Controllers and expanders get stressed; marginal cables stop being “fine.”
  • Thermals shift: a failed fan or a chassis with poor airflow turns rebuilds into a heat-soak test.
  • Human factors: someone sees “DEGRADED” and starts swapping hardware like they’re playing whack-a-mole.

The resilience you thought you bought with parity is often spent on “non-disk” problems during the resilver window.
That’s why the phrase “one more failure can kill you” is operational, not poetic.

Rebuild math you can actually use

1) Exposure time: how long you’re living dangerously

The simplest and most useful model is: risk increases with time spent degraded. If your resilver takes 2 hours, you’re exposed for 2 hours.
If it takes 2 days, you’re exposed for 2 days. The details matter, but this alone explains why oversized RAIDZ1 vdevs are a trap.

A practical estimate for resilver time:

  • Data to copy ≈ allocated bytes on the vdev (not the pool), adjusted upward for fragmentation and metadata overhead.
  • Effective throughput ≈ minimum of (disk read bandwidth, disk write bandwidth to the new disk, vdev parity compute, controller/expander limits, and pool workload interference).
  • Resilver time ≈ data-to-copy / effective-throughput.

If you’re at 80–90% pool fullness, expect “data-to-copy” to be uncomfortably close to “whole disk,” and expect random IO to dominate.
That’s why seasoned operators keep RAIDZ pools under ~70–80% unless there’s a strong reason not to.

2) URE math: the classic “why RAID5 died at scale” argument (and what ZFS changes)

The canonical back-of-envelope calculation is:
probability of at least one URE while reading B bits is approximately:
P ≈ 1 − (1 − p)B,
where p is the URE probability per bit (e.g., 10^-14).

If you read lots of bits, P rises. Fast. The mistake is treating this as gospel in either direction:
“URE will definitely happen” or “URE never happens in practice.” Both are lazy.

What ZFS changes:

  • ZFS verifies checksums, so it can detect bad data instead of serving it silently.
  • With redundancy (RAIDZ2/3), ZFS can often repair around a bad sector using parity and other copies.
  • During a degraded RAIDZ1 resilver, a read error on the wrong block can become unrecoverable because the missing disk removes your last margin.

3) The “how many disks do I have to read?” reality

During a RAIDZ resilver, the system reads from all remaining disks in the vdev to reconstruct missing data. That means:
the more disks in the vdev, the more aggregate reads, the more opportunities for:
timeouts, UREs, and “this drive was okay until you asked it to read everything.”

Wider vdevs can be great for sequential throughput. They’re also great at concentrating risk during resilver.
RAIDZ2 with a sensible width tends to be the adult choice for large HDD pools.

4) Performance math that bites: small blocks and fragmentation

Resilver speed isn’t just “disk MB/s.” If your dataset is full of small blocks (databases, VMs, mail spools),
ZFS does more metadata traversal, seeks more, and your effective throughput collapses.
If you have compression, you might read less from disk than the logical data size. If you have dedup, you might make your life harder in other ways.

The real rebuild math is a blend:
exposure window × probability of additional failures per unit time × probability of unrecoverable reads per data read.
You don’t need a PhD to use it. You just need to stop pretending rebuild time is constant.

How ZFS resilvering really works (and why it sometimes wins)

Copy-on-write and transaction groups

ZFS is copy-on-write. It writes new blocks, then updates pointers. This reduces “write hole” exposure because ZFS can keep consistent on-disk state
through transaction groups. RAIDZ parity is integrated into those writes.

Why “only allocated blocks” is both true and misleading

ZFS resilvers based on what it believes is in use, not “every LBA on the disk.” That’s good.
But there are three footguns:

  • Pool fullness: if you’re 85% full, allocated blocks are most of the disk.
  • Fragmentation: allocated blocks are scattered, so the resilver becomes random IO.
  • Metadata overhead: indirect blocks, spacemaps, and gang blocks add reads and CPU work.

Sequential resilver vs healing resilver

ZFS can perform different styles of resilvering depending on implementation and situation (for example, sequential scanning behaviors),
but operationally you should assume:
resilvering causes sustained reads across the vdev, consumes IO budget, and drags performance down.
If you run a heavy production workload at the same time, you are choosing a longer exposure window.

Joke #2: The fastest resilver is the one you schedule before the disk fails—sadly, disks refuse calendar invites.

What really slows a resilver

1) The pool is busy, and ZFS is being polite

ZFS intentionally avoids taking the whole system hostage. Resilver IO competes with normal IO.
That’s great for users. It’s terrible for “get back to redundancy ASAP.”
You can tune it, but doing so blindly can make production latency worse and cause timeouts that look like hardware faults.

2) A single slow disk drags the whole vdev

RAIDZ vdevs behave like a team of rowers. One rower is hungover, everyone turns in circles.
A marginal disk that hasn’t “failed” can bottleneck reads, triggering long resilvers and more stress.

3) Controllers, expanders, and cabling

Rebuilds are great at exposing the stuff you never tested: a flaky SAS expander port, a cable that’s “fine” at idle,
a controller firmware bug, or a backplane that heats up and starts throwing errors.
If your logs show link resets during resilver, treat it as a reliability incident, not a performance mystery.

4) ashift mistakes are forever

If a pool was created with the wrong ashift, every write is misaligned. That inflates IO and hurts resilver time.
You can’t fix ashift in place. You fix it by rebuilding the pool correctly. This is why “just make a pool quickly” is a career-limiting move.

5) CPU is not usually the bottleneck—until it is

Modern CPUs usually handle RAIDZ parity and checksums fine. But heavy compression, encryption, and high IOPS workloads can change that.
Watch CPU during resilver. If you’re pegged, your disks are waiting on math.

Fast diagnosis playbook

When a resilver is slow, you do not have time for artisanal debugging. Triage it like an outage.
The goal is to find the one limiting factor you can change quickly.

First: confirm the failure domain and the real blast radius

  • Is this one vdev degraded, or multiple?
  • Is the pool actually serving errors, or just rebuilding?
  • Are there checksum errors (bad data) or just device errors (bad path)?

Second: identify the bottleneck class

  • Single-disk slow: one drive showing high latency / low throughput.
  • Bus/path problem: link resets, SAS errors, timeouts, multipath flapping.
  • Workload contention: production IO is starving resilver.
  • CPU/memory pressure: compression/encryption or ARC thrash slowing everything.

Third: take the least risky corrective action

  • If a disk is clearly misbehaving, replace it now rather than “waiting to see.”
  • If the path is unstable, fix cabling/firmware before you burn more disks.
  • If production load is the problem, rate-limit clients, move workloads, or schedule a maintenance window to prioritize resilver.

The right answer is often boring: reduce load, let resilver finish, then scrub.
The wrong answer is hero-mode tuning changes made on a pool that is already degraded.

Practical tasks: commands, outputs, decisions

Below are field tasks you can run right now. Each one includes: command, what output means, and the decision you make.
Hostnames, pools, and devices are examples; adjust them to your environment.

Task 1: Verify pool health and resilver status

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Dec 26 02:20:17 2025
        3.12T scanned at 410M/s, 1.84T issued at 242M/s, 6.20T total
        1.84T resilvered, 29.68% done, 6:05:11 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     2
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
            sdg                     ONLINE       0     0     0
            sdh                     ONLINE       0     0     0
            replacing-1             DEGRADED     0     0     0
              sdi                   FAULTED      0     0     0  too many errors
              sdz                   ONLINE       0     0     0  (resilvering)

errors: No known data errors

Meaning: “issued” is actual work done; “scanned” can be higher. CKSUM errors on a surviving disk (sdb) are a red flag.

Decision: If CKSUM errors rise during resilver, treat it as potential media or path instability; plan to replace or re-cable that disk.

Task 2: See per-vdev I/O to spot a slow device

cr0x@server:~$ zpool iostat -v tank 5 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        92.1T  28.7T  1.02K   210   310M  52.1M
  raidz2-0                  92.1T  28.7T  1.02K   210   310M  52.1M
    sda                         -      -    120    25  38.5M  6.3M
    sdb                         -      -     11    24   1.2M  6.2M
    sdc                         -      -    121    25  38.7M  6.2M
    sdd                         -      -    122    24  38.8M  6.1M
    sde                         -      -    121    26  38.6M  6.4M
    sdf                         -      -    120    25  38.5M  6.3M
    sdg                         -      -    121    25  38.6M  6.2M
    sdh                         -      -    120    25  38.4M  6.1M
    sdz                         -      -    287    22  66.1M  5.9M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: sdb is contributing almost no read bandwidth compared to peers.

Decision: Investigate sdb immediately (SMART, cabling, controller path). A “slow-but-online” disk can kill your resilver window.

Task 3: Check recent ZFS events for device flaps

cr0x@server:~$ zpool events -v | tail -n 20
TIME                           CLASS
Dec 26 2025 02:14:03.129812000  sysevent.fs.zfs.dev_remove
    pool = tank
    vdev_path = /dev/sdi
    vdev_guid = 1234567890123456789
Dec 26 2025 02:16:44.410223000  sysevent.fs.zfs.config_sync
    pool = tank
Dec 26 2025 02:20:17.004901000  sysevent.fs.zfs.resilver_start
    pool = tank
    vdev_path = /dev/sdz

Meaning: Confirms remove/replace timeline; useful for correlating with logs and human actions.

Decision: If you see repeated remove/add events, treat it as path instability (expander/cable/HBA) more than “bad disk.”

Task 4: Confirm what ZFS thinks about errors

cr0x@server:~$ zpool status -x
pool 'tank' is not healthy

Meaning: Still degraded; no nuance, but it’s a fast sanity check for dashboards and runbooks.

Decision: Use it for alert clear conditions only after you’ve also reviewed detailed status.

Task 5: Pull SMART health and error counters for the suspicious disk

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "Model|Serial|Reallocated|Pending|Offline_Uncorrectable|UDMA_CRC|Error|SMART overall"
Device Model:     WDC WD140EDGZ-11B2DA2
Serial Number:    9JH3K1AB
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       3

Meaning: Pending/offline uncorrectable sectors are bad; CRC errors suggest cable/backplane/HBA issues.

Decision: Pending sectors during resilver? Replace disk. CRC errors rising? Reseat/replace cable, check expander/HBA logs.

Task 6: Run a short SMART test (if you can afford it)

cr0x@server:~$ sudo smartctl -t short /dev/sdb
Please wait 2 minutes for test to complete.
Test will complete after Fri Dec 26 03:01:42 2025

Meaning: A short test can catch obvious failures quickly.

Decision: If the disk can’t complete a short test or reports read failures, stop debating and replace it.

Task 7: Watch kernel logs for link resets and timeouts

cr0x@server:~$ sudo dmesg -T | egrep -i "sas|scsi|reset|timeout|link|I/O error" | tail -n 12
[Fri Dec 26 02:44:18 2025] sd 5:0:7:0: [sdb] tag#913 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Fri Dec 26 02:44:18 2025] sd 5:0:7:0: [sdb] CDB: Read(16) 88 00 00 00 00 10 2a 3f 00 00 00 00 80 00 00 00
[Fri Dec 26 02:44:19 2025] sas: phy-7: reset complete
[Fri Dec 26 02:44:22 2025] sd 5:0:7:0: [sdb] Sense Key : Medium Error [current]
[Fri Dec 26 02:44:22 2025] sd 5:0:7:0: [sdb] Add. Sense: Unrecovered read error

Meaning: Timeouts + medium errors during heavy reads are exactly what resilver induces.

Decision: Medium errors: replace disk. Repeated PHY resets across multiple disks: suspect expander/HBA/cabling/firmware.

Task 8: Identify if the pool is too full (resilver will drag)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -S used tank | head
NAME            USED  AVAIL  REFER  MOUNTPOINT
tank            92.1T  28.7T   128K  /tank
tank/vm         41.8T  28.7T  41.8T  /tank/vm
tank/backups    28.2T  28.7T  28.2T  /tank/backups
tank/home       13.4T  28.7T  13.4T  /tank/home

Meaning: ~76% allocated. Manageable, but if this were 90%+, resilver pain increases sharply.

Decision: If >80% used, plan to free space or add capacity after the incident; do not normalize “nearly full” RAIDZ pools.

Task 9: Check fragmentation (a quiet resilver killer)

cr0x@server:~$ zpool list -o name,alloc,free,frag,cap,health
NAME   ALLOC   FREE  FRAG  CAP  HEALTH
tank   92.1T  28.7T   41%  76%  DEGRADED

Meaning: 41% fragmentation will turn “copy allocated blocks” into lots of seeks.

Decision: High frag + slow resilver: reduce workload, expect longer window; consider future redesign (more vdevs, different recordsize, keep pools emptier).

Task 10: Check ashift (alignment) of the vdev

cr0x@server:~$ zdb -C tank | egrep "vdev|ashift" | head -n 20
        vdev_tree:
            type: 'raidz'
            id: 0
            guid: 9876543210123456789
            ashift: 12
            nparity: 2

Meaning: ashift 12 means 4K alignment. Good for modern HDD/SSD. ashift 9 on 4K-native/512e is a classic own-goal.

Decision: If ashift is wrong, don’t “tune” your way out. Schedule a proper migration to a correctly created pool.

Task 11: Confirm recordsize / volblocksize for key datasets (impacts layout and resilver behavior)

cr0x@server:~$ zfs get -o name,property,value -s local recordsize,volblocksize tank/vm tank/home
NAME      PROPERTY     VALUE
tank/vm   volblocksize 16K
tank/home recordsize   128K

Meaning: VM zvols at 16K can be seek-heavy; home dataset at 128K is more sequential-friendly.

Decision: If resilvers are chronically slow on VM-heavy pools, consider special vdevs (metadata), more vdevs, or mirrors for IOPS-heavy tiers.

Task 12: Pause competing work: locate top I/O consumers

cr0x@server:~$ sudo iostat -x 5 2
Linux 6.5.0 (server)  12/26/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.41    0.00    6.32   18.77    0.00   62.50

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sda             98.1    22.4  39512    6821   28.4   3.1   95.2
sdb             11.3    21.9   1290    6720  412.7  10.2   99.8
sdz            210.4    18.2  67711    5812   16.9   2.7   89.4

Meaning: sdb has 400ms await and 99.8% util while doing tiny reads. That’s the hungover rower.

Decision: If a disk’s latency is extreme, do not wait for it to “recover.” Replace it or fix its path; otherwise, your resilver is a slow-motion disaster.

Task 13: Find checksum errors at the dataset level (silent corruption surfacing)

cr0x@server:~$ zpool status -v tank | sed -n '/errors:/,$p'
errors: Permanent errors have been detected in the following files:

        tank/home@daily-2025-12-25:/documents/finance.xlsx

Meaning: You have at least one permanent error: ZFS couldn’t repair that block from redundancy.

Decision: Trigger incident response: identify affected data owners, restore from backup/snapshot replication, and reassess redundancy level and scrub cadence.

Task 14: Start a scrub after the resilver (validation pass)

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank | egrep -A3 "scan:"
  scan: scrub in progress since Fri Dec 26 09:02:11 2025
        1.34T scanned at 620M/s, 410G issued at 190M/s, 120T total
        0B repaired, 0.34% done, 178:22:10 to go

Meaning: Scrub is slower on large pools; “repaired” staying at 0 is good, but “to go” matters for planning.

Decision: If scrub reveals new errors, treat it as ongoing media/path instability—don’t declare victory because the resilver finished.

Three corporate mini-stories (anonymized, plausible, painful)

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran a single big ZFS pool for “everything not in the database.” It was RAIDZ1 because someone remembered
that RAIDZ2 “wastes too much space,” and spreadsheets love to win arguments.

A disk failed on a Friday night. On-call replaced it and watched the resilver progress bar. It was moving, slowly.
The team assumed the only risk was “another disk has to fail,” which felt unlikely over a weekend.
They left the pool serving full production traffic because “customers can’t take downtime.”

Twelve hours in, ZFS started logging checksum errors on a different disk. SMART looked “PASSED,” so the team dismissed it as noise.
The resilver continued. Then the second disk threw a run of unreadable sectors on exactly the blocks needed to reconstruct data from the missing disk.
RAIDZ1 had no extra parity to work around it. The vdev faulted hard enough that the pool stopped being trustworthy.

The wrong assumption wasn’t “a second disk won’t fail.” It was subtler: “If the pool is still online, we’re okay.”
ZFS will keep trying until it can’t. Online is not the same thing as safe.

Their recovery was a messy blend: restore from object-store backups for most data, accept some customer data loss for a handful of tenants,
and spend the next month rebuilding storage with RAIDZ2 and a stricter scrub policy. The postmortem takeaway was blunt:
“We treated parity as insurance, then canceled the policy during the only week we needed it.”

Mini-story 2: The optimization that backfired

An internal analytics platform had a ZFS pool backing a farm of VMs. Performance complaints came in: write latency spikes,
occasional stalls during snapshots. A well-meaning engineer tuned for throughput: wider RAIDZ vdevs (because “more disks = faster”),
aggressive compression on everything, and a larger recordsize on datasets because “big IO is efficient.”

It looked good in benchmarks. Sequential reads flew. The complaints slowed down. Everyone moved on.
Six months later, a disk failed. The resilver was glacial. Not just “hours,” but “days,” even though the pool wasn’t near full.

The catch: the workload was mostly small random reads/writes from VMs, with metadata churn and snapshot-heavy behavior.
The wide vdev meant more disks participated in each reconstruction operation. Compression increased CPU work during resilver
at the exact moment the system was already doing parity math and checksum verification. And the large recordsize on the wrong datasets
increased write amplification and fragmentation over time.

During the long resilver, a second disk started timing out—not dead, just slow and hot. The controller logged resets.
The team had to throttle the entire VM cluster to let the resilver proceed, which created a customer-visible incident anyway.
They had optimized the happy path and made the failure path unbearable.

Afterward, they redesigned: smaller, more numerous vdevs; RAIDZ2; careful dataset tuning (VMs treated differently from bulk storage);
and a rule that any performance change must include “what happens during a resilver” as a first-class acceptance test.
The optimization didn’t fail because it was dumb. It failed because it ignored recovery math.

Mini-story 3: The boring but correct practice that saved the day

A conservative enterprise team ran storage for a file-heavy internal platform. Nothing glamorous: lots of documents, build artifacts,
a few fat datasets that grew steadily. They had a reputation for saying “no” to exciting shortcuts.

Their design was unremarkable on purpose: RAIDZ2 vdevs with moderate width, hot spares available, monthly SMART long tests,
and scrubs on a schedule that matched their failure domain. They also kept a strict utilization target: when pools approached it,
capacity expansion work was ticketed and funded like any other production requirement.

When a disk failed, the on-call followed the runbook: confirm no other disks had rising CRC counts, reduce nonessential batch jobs,
replace the disk, watch the resilver, then scrub. The resilver finished in hours, not days, because the pool had headroom and low fragmentation.

A week later, another disk showed increasing pending sectors. It hadn’t failed yet. They replaced it proactively in a maintenance window.
No outage. No drama. The “cost” was a few more drives and some calendar time. That’s cheap.

What saved them wasn’t a magic ZFS flag. It was the discipline to treat storage like a system that must remain recoverable,
not just fast. Their boring practice prevented the exciting incident.

Common mistakes: symptom → root cause → fix

1) Resilver is “stuck” at a low percentage

Symptom: “resilver in progress… 2% done… time remaining: days” barely changes.

Root cause: Severe fragmentation, heavy random IO workload, or a single slow disk dragging read rates.

Fix: Identify slow disk via zpool iostat -v and iostat -x; reduce workload; replace suspicious disk; keep pools less full long-term.

2) Rising CKSUM errors on an “ONLINE” disk during resilver

Symptom: zpool status shows CKSUM increments on a surviving device.

Root cause: Actual media errors or (very often) unstable path: SAS link resets, bad cable, expander issues.

Fix: Check dmesg for resets/timeouts; check SMART CRC counters; reseat/replace cables; update HBA firmware; replace disk if medium errors/pending sectors appear.

3) Pool goes FAULTED during RAIDZ1 rebuild

Symptom: After a second “minor” issue, pool becomes unavailable or reports permanent errors.

Root cause: RAIDZ1 had no margin; a URE or second device/path failure made some blocks unreconstructable.

Fix: Restore from backups/snapshots; rebuild with RAIDZ2 or mirrors; increase scrub cadence; stop using RAIDZ1 for large HDD vdevs in production.

4) Rebuild is slow only on one chassis / shelf

Symptom: Similar pools elsewhere rebuild faster; this one always drags.

Root cause: Backplane/expander bottleneck, thermal throttling, or bad cabling localized to that enclosure.

Fix: Compare error logs across hosts; swap cables; move a known-good disk to reproduce; check temps; validate expander firmware compatibility.

5) Scrubs always find “a few errors,” but nobody cares

Symptom: Regular checksum repairs or occasional permanent errors are normalized.

Root cause: Underlying media decay or unstable IO path; scrub is acting as your canary.

Fix: Treat scrub repairs as incidents with thresholds; replace suspect disks; validate cabling; ensure redundancy level can tolerate reality.

6) Adding more tuning made the resilver worse

Symptom: After “performance tuning,” resilver takes longer and latency spikes.

Root cause: Tuning that increases IO amplification or CPU overhead; wider vdevs; wrong recordsize; too much compression/encryption without headroom.

Fix: Back out changes; re-evaluate with failure-path benchmarks; design for rebuild, not just for dashboards.

Checklists / step-by-step plan

When a disk fails (RAIDZ): the incident runbook

  1. Freeze the scene: capture zpool status -v, zpool events -v, and dmesg -T output for the timeline.
  2. Confirm redundancy remaining: RAIDZ1 degraded = zero margin. RAIDZ2 degraded = one margin. Adjust urgency accordingly.
  3. Check for a second weak disk: run SMART on all disks in the vdev, at least the usual suspects (same model batch, same bay, same expander chain).
  4. Stabilize the path: CRC errors or link resets mean you fix cabling/HBA before you burn more drives.
  5. Replace the failed device correctly: use zpool replace with persistent device IDs; avoid device-name roulette.
  6. Reduce competing load: pause heavy batch, slow VM migration churn, defer backups that hammer the pool.
  7. Monitor resilver rate and errors: watch for CKSUM growth and slowing throughput that signals a second failure forming.
  8. After resilver, scrub: validate the pool and flush out latent corruption while you still have redundancy.
  9. Do a post-incident capacity and design review: if resilver took “too long,” it’s a design problem, not bad luck.

Design checklist: building RAIDZ to survive rebuild math

  1. Prefer RAIDZ2 for large HDD vdevs: treat RAIDZ1 as archival or small-disk territory where exposure windows are short.
  2. Keep vdev width reasonable: wider vdevs increase rebuild read fan-out and risk concentration.
  3. Keep pools under a utilization target: plan for 70–80% as “normal,” not “wasted.”
  4. Scrub on a schedule you can explain: frequent enough to catch latent errors before a failure; slow enough not to drown production.
  5. Standardize on persistent device naming: by-id paths, not /dev/sdX.
  6. Test resilver behavior: in staging, simulate a disk replacement and measure rebuild time with typical workload patterns.
  7. Keep spares and human spares: having a hot spare is good; having someone who knows the procedure is better.

FAQ

1) Is RAIDZ resilvering safer than classic RAID rebuilds?

Often safer because ZFS has end-to-end checksums and can resilver allocated blocks, not every LBA.
But degraded RAIDZ1 is still a razor’s edge: one more failure mode (disk, path, or unreadable sector) can end the vdev.

2) Why does “scanned” exceed “issued” in zpool status?

“Scanned” is the amount ZFS examined; “issued” is IO actually sent for resilvering. Scanning can include metadata traversal and work that doesn’t translate 1:1 into writes.
If issued rate is low, you’re often bottlenecked by random IO, contention, or a slow disk.

3) Should I crank resilver priority to finish faster?

Sometimes, but only with intent. If you starve production IO too hard, you can trigger timeouts and secondary failures.
The safer move is to reduce workload first (pause batch, drain VMs), then tune if needed.

4) Does RAIDZ2 guarantee I won’t lose data during rebuild?

No. It tolerates two disk failures in a vdev, but it doesn’t protect you from: firmware bugs, operator mistakes, mis-cabling, controller resets affecting multiple disks,
or three concurrent failures. It just makes the common bad day survivable.

5) How wide should a RAIDZ2 vdev be?

There isn’t one magic number. Operationally, “reasonable width” means: rebuild time is acceptable, and single-disk latency doesn’t dominate.
For HDD pools, many teams land on moderate-width RAIDZ2 vdevs rather than ultra-wide monsters. Measure your own workload.

6) Why did my resilver get slower at 60% than at 10%?

Early phases may hit more sequential regions or cached metadata. Later phases can encounter more fragmented allocations, colder data, and smaller blocks.
Also, other workloads may ramp back up as people assume “it’s almost done.”

7) If SMART says PASSED, is the disk fine?

No. SMART overall status is a crude summary. Look at pending sectors, offline uncorrectables, and CRC errors.
A disk can “PASS” while actively sabotaging your resilver.

8) Should I scrub more often or less often?

More often than “never,” less often than “constantly.” The goal is to surface latent errors while you still have redundancy and before a failure forces a resilver.
If scrubs impact production, schedule them and control workload; don’t abandon them.

9) Mirrors vs RAIDZ for rebuild risk—what’s the trade?

Mirrors resilver by copying from one surviving disk to one replacement disk, typically faster and with less read fan-out.
RAIDZ has better capacity efficiency but larger rebuild blast radius. If your workload is IOPS-heavy or your recovery time objective is strict, mirrors often win.

10) Can I just “add one more disk” to a RAIDZ vdev to reduce risk?

Risk is mostly about parity level, rebuild exposure time, and error behavior. Adding disks can increase throughput, but it can also widen the failure domain.
If you need more safety, increase parity (RAIDZ2/3) or change vdev topology—not just width.

Next steps you should take this week

If you’re running RAIDZ in production, treat resilver math like a budget. You’re spending safety every hour you’re degraded.
The job is to shorten the window and increase the margin.

  1. Audit parity level: find any RAIDZ1 vdevs backing important data. If the disks are large HDDs, plan migration to RAIDZ2 or mirrors.
  2. Measure real resilver time: not vendor MB/s. Use a controlled replacement drill in staging or a noncritical pool and record timings.
  3. Set scrub cadence and thresholds: decide what number of repaired/checksum errors triggers disk replacement or cabling inspection.
  4. Check pool fullness and fragmentation: if you live above 80%, you’re choosing longer, riskier rebuilds.
  5. Harden the boring stuff: persistent device naming, spare drives, known-good firmware, and runbooks that don’t assume heroics.

The day you lose data is rarely the day the first disk fails. It’s the day you discover your rebuild assumptions were fantasy.
Make the math real now, while you can still choose the shape of your failure.

← Previous
Debian 13: MySQL slow query log — find the query that’s silently killing you
Next →
Update Outages: How One Bad Patch Can Knock Out the World

Leave a comment