ZFS Sequential Resilver: Why Rebuild Speed Isn’t Just “Disk MB/s”

Was this helpful?

You replace a failed disk, kick off a resilver, and expect to watch a nice steady stream of hundreds of MB/s. Instead you get a rebuild that starts fast, then crawls, then oscillates like it’s being throttled by an unseen committee. Meanwhile your application latency goes sideways and management wants to know why “the new drive is slow.”

Here’s the uncomfortable truth: resilver speed is a systems problem. ZFS resilvers don’t happen in a vacuum. They compete with real workloads, traverse real metadata, and are constrained by vdev geometry, allocation history, and the physics of small random reads—even when the write side looks “sequential.”

What “sequential resilver” really means (and what it doesn’t)

When people say “sequential resilver,” they usually mean: “ZFS will rebuild by reading big contiguous chunks from healthy disks and writing big contiguous chunks to the replacement disk.” That’s the dream. And sometimes it happens.

But sequential resilvering is not a promise that the underlying I/O pattern is purely sequential end-to-end. It’s a strategy: preferentially scanning space maps and allocation metadata to rebuild only what’s allocated, and doing it in a way that tends to group work to reduce seeks.

What it is:

  • A method to rebuild allocated data, not the entire address space of the device (like old-school RAID rebuilds).
  • A best-effort attempt to minimize random I/O by walking metaslabs and allocation trees in a locality-friendly order.
  • A workload that is often read-limited by the surviving devices and metadata traversal, even if the writes to the new disk look sequential.

What it is not:

  • Not “copy the whole disk from LBA 0 upward.” ZFS doesn’t rebuild free space, and it doesn’t have to.
  • Not “one thread, one stream.” Resilver is pipelined and concurrent: reads, checksum verification, decompression, parity reconstruction (RAIDZ), and writes.
  • Not “immune to fragmentation.” ZFS can be sequential on paper and still be forced into scattered reads because the data is scattered.

Resilver time is the time it takes to find the blocks that matter, read them reliably, reconstruct them correctly, and write them in a way that keeps the pool consistent. “Disk MB/s” is just one variable, and it’s rarely the limiting one when things get ugly.

Why “disk MB/s” is the wrong unit for resilver planning

Sequential throughput benchmarks assume you’re doing something like dd across a clean device. A resilver is doing something closer to:

  • metadata-driven discovery (space maps, block pointers, indirect blocks)
  • mixed-size reads (records, partial records, small metadata blocks)
  • checksum verification for every block read
  • optional decompression (read path) and compression (write path)
  • RAIDZ parity math (CPU + memory bandwidth)
  • queueing and scheduling across multiple vdev members
  • write allocation and metaslab updates on the target

This is why two pools built from identical disks can resilver at radically different speeds:

  • Pool A: 40% full, low fragmentation, recordsize matched to workload, wide mirrors, enough IOPS headroom. Resilver looks “sequential.”
  • Pool B: 85% full, years of churn, lots of small blocks, RAIDZ with pathological read amplification, and a heavy foreground workload. Resilver looks like a crime scene.

Even on the same pool, resilver speed can change hour-to-hour because the bottleneck moves: first you’re reading big data extents quickly, then you hit metadata-heavy regions and suddenly you’re IOPS-limited. The progress bar keeps moving, but your expectations shouldn’t.

Joke #1: If you think resilver speed is “just disk MB/s,” I have a storage array that benchmarks at 10 GB/s—right up until you use it.

So what should you plan for instead of MB/s? Think in constraints:

  • Read IOPS ceiling of the surviving vdev members (often the limiter).
  • Small-block behavior (metadata, dedup tables, special vdev hits/misses).
  • Allocation history (fragmentation, free space distribution, metaslab sizes).
  • Concurrency (resilver threads, ZIO pipeline, ARC behavior).
  • Operational budget: how much latency you can tolerate for production while it runs.

Interesting facts and small historical context

  1. ZFS popularized “rebuild allocated data only” as a first-class design goal, avoiding traditional “rebuild the whole disk” behavior common in classic RAID controllers.
  2. Scrubs and resilvers share machinery, but resilver has to write and update state, which changes contention patterns in the I/O pipeline.
  3. RAIDZ resilvering is inherently read-amplified because reconstructing a block often requires reading multiple columns; mirrors can frequently read from a single side.
  4. Block pointer checksums are central: ZFS reads are verified end-to-end, so “fast but silent corruption” is not on the menu.
  5. Ashift choices became more painful over time as disks moved to 4K physical sectors; misalignment can turn “sequential” work into extra read-modify-write cycles.
  6. Space maps evolved to better track free/allocated ranges, and that bookkeeping directly affects how efficiently ZFS can walk allocated space during rebuild.
  7. Device removal and sequential resilver improvements in OpenZFS ecosystems made rebuilds less punishing than early-era “scan everything” approaches—when the pool layout cooperates.
  8. Special vdevs changed the game for metadata-heavy pools: resilver speed can improve dramatically if metadata reads are served quickly, but only if the special vdev is healthy and sized correctly.

The real bottlenecks: where resilvers actually spend time

1) The surviving disks are doing the hard part

The replacement disk is usually not the limiting factor. During resilver, the pool has to read all the blocks that belonged to the failed device from the surviving members. Those reads are frequently:

  • spread across the vdev’s address space (fragmentation)
  • mixed-size, including lots of tiny metadata reads
  • competing with your production workload

So you can put a brand new, fast drive in, and it won’t matter because the old drives are busy seeking all over their platters like it’s 2007.

2) RAIDZ parity reconstruction amplifies reads

With mirrors, ZFS often needs one good copy to rebuild a block. With RAIDZ, reconstructing a block generally requires reading the other columns in the stripe. This is why “same raw TB, same disks” does not mean “same resilver time.” RAIDZ trades capacity efficiency for more complex rebuild behavior, and resilver is where you pay that bill.

3) Fragmentation turns sequential intent into random reality

Sequential resilver tries to walk allocations in order, but if your data was written in a churn-heavy pattern (VM images, databases with frequent rewrites, object stores with deletes), the “next block” might be nowhere near the previous one on disk.

Two practical consequences:

  • Your reads become IOPS-bound, not throughput-bound.
  • Your pool can show “low MB/s” while still being saturated (high utilization, high latency).

4) Metadata can dominate the tail

The first half of a resilver often looks faster because you’re copying larger, more contiguous extents. The end gets weird: tiny blocks, indirect blocks, dnodes, spill blocks, spacemap updates. That’s when the progress percentage moves slowly and everyone starts accusing the disks.

5) Compression, checksums, and CPU aren’t free

On modern CPUs, checksum and compression overhead is usually manageable, until it isn’t. If you’re doing RAIDZ parity math plus high compression ratios plus a busy box, CPU can become the bottleneck. This shows up as:

  • disks not fully utilized (low %util) while resilver is slow
  • high system CPU, kernel time, or softirq time depending on platform
  • ARC pressure causing extra disk reads

6) The pool is a shared resource; resilver competes

If you allow a resilver to run “full send” during peak hours, the pool can become an I/O hostage situation. ZFS tries to be fair, but fairness is not the same as “your database stays happy.” You need to decide what you’re optimizing for: fastest rebuild time, or acceptable production latency. Trying to get both without measurement is how you get neither.

Joke #2: A resilver is like roadwork: it’s always scheduled for when you’re already late.

One reliability idea worth keeping on your wall

Hope is not a strategy. — James Cameron

It’s not ZFS-specific, but it’s painfully relevant: don’t hope your resilver will be fast. Measure, design, and practice.

Fast diagnosis playbook

This is the triage path when someone pings you: “resilver is slow.” Don’t start tuning knobs blindly. Find the limiter.

First: confirm what kind of slowness you have

  • Low MB/s but high disk utilization? You’re likely IOPS/seek limited (fragmentation, small blocks, RAIDZ reads).
  • Low MB/s and low disk utilization? You’re likely CPU-limited, throttled, or blocked on something else (ARC misses, special vdev, scheduler, tunables).
  • High latency spikes on clients? Resilver is competing with production workload; you need to budget I/O.

Second: isolate whether reads or writes are limiting

  • If the healthy disks show high read ops and high latency: read side is limiting.
  • If the new disk shows high write latency or errors: target write path is limiting (bad disk, cabling, HBA, SMR behavior).

Third: check for pathology

  • Any checksum errors during resilver? Stop assuming “performance problem.” You may have a reliability problem.
  • Any SMR drives, odd firmware, or power management quirks? Rebuild behavior can collapse under sustained writes.
  • Pool nearly full or heavily fragmented? Expect tail-latency and slow finish.

Fourth: decide on the operational goal

  • Goal A: finish resilver ASAP (accept user impact).
  • Goal B: keep production stable (accept longer resilver).

Make that decision explicitly, then tune. Otherwise you’ll oscillate between “too slow” and “users are screaming” and pretend it’s a mystery.

Hands-on tasks: commands, outputs, and decisions (12+)

Below are practical tasks I use in production. Each includes (1) command, (2) what the output means, (3) the decision you make.

Task 1: Confirm resilver state and scan rate

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
scan: resilver in progress since Tue Dec 24 11:02:12 2025
        1.23T scanned at 410M/s, 612G issued at 204M/s, 3.10T total
        148G resilvered, 19.11% done, 0 days 03:41:22 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
            sdg                     ONLINE       0     0     0
            sdh                     ONLINE       0     0     0
            sdi                     ONLINE       0     0     0
            sdj                     ONLINE       0     0     0
            sdk                     ONLINE       0     0     0
            sdl                     ONLINE       0     0     0
            sdm                     ONLINE       0     0     0
            sdn                     ONLINE       0     0     0
            sdo                     ONLINE       0     0     0
            sdp                     ONLINE       0     0     0
            sdq                     ONLINE       0     0     0
            sdr                     ONLINE       0     0     0
            sds                     ONLINE       0     0     0
            sdt                     ONLINE       0     0     0
            sdu                     ONLINE       0     0     0
            sdv                     ONLINE       0     0     0
            sdw                     ONLINE       0     0     0
            sdx                     ONLINE       0     0     0
            sdy                     ONLINE       0     0     0
            sdz                     ONLINE       0     0     0
            sdaa                    ONLINE       0     0     0
            sdab                    ONLINE       0     0     0
            sdac                    ONLINE       0     0     0
            sdad                    ONLINE       0     0     0
            sdae                    ONLINE       0     0     0
            sdaf                    ONLINE       0     0     0
            sdag                    ONLINE       0     0     0
            sdah                    ONLINE       0     0     0
            sdai                    ONLINE       0     0     0
            sdaj                    ONLINE       0     0     0
            sdak                    ONLINE       0     0     0
            sdal                    ONLINE       0     0     0
            sdam                    ONLINE       0     0     0
            sdan                    ONLINE       0     0     0
            sdao                    ONLINE       0     0     0
            sdap                    ONLINE       0     0     0
            sdaq                    ONLINE       0     0     0
            sdar                    ONLINE       0     0     0
            sdas                    ONLINE       0     0     0
            sdat                    ONLINE       0     0     0
            sdau                    ONLINE       0     0     0
            sdav                    ONLINE       0     0     0
            sdaw                    ONLINE       0     0     0
            sdax                    ONLINE       0     0     0
            sday                    ONLINE       0     0     0
            sdaz                    ONLINE       0     0     0
            sdbb                    ONLINE       0     0     0
            sdbc                    ONLINE       0     0     0
            sdbd                    ONLINE       0     0     0
            sdbe                    ONLINE       0     0     0
            sdbf                    ONLINE       0     0     0
            sdbg                    ONLINE       0     0     0
            sdbh                    ONLINE       0     0     0
            sdbi                    ONLINE       0     0     0
            sdbj                    ONLINE       0     0     0
            sdbk                    ONLINE       0     0     0
            sdbl                    ONLINE       0     0     0
            sdbm                    ONLINE       0     0     0
            sdbn                    ONLINE       0     0     0
            sdbo                    ONLINE       0     0     0
            sdbp                    ONLINE       0     0     0
            sdbq                    ONLINE       0     0     0
            sdbr                    ONLINE       0     0     0
            sdbs                    ONLINE       0     0     0
            sdbt                    ONLINE       0     0     0
            sdbu                    ONLINE       0     0     0
            sdbv                    ONLINE       0     0     0
            sdbw                    ONLINE       0     0     0
            sdbx                    ONLINE       0     0     0
            sdby                    ONLINE       0     0     0
            sdbz                    ONLINE       0     0     0
            sdc0                    ONLINE       0     0     0
            sdc1                    ONLINE       0     0     0
            sdc2                    ONLINE       0     0     0
            sdc3                    ONLINE       0     0     0
            sdc4                    ONLINE       0     0     0
            sdc5                    ONLINE       0     0     0
            sdc6                    ONLINE       0     0     0
            sdc7                    ONLINE       0     0     0
            sdc8                    ONLINE       0     0     0
            sdc9                    ONLINE       0     0     0
            sdd0                    ONLINE       0     0     0
            sdd1                    ONLINE       0     0     0
            sdd2                    ONLINE       0     0     0
            sdd3                    ONLINE       0     0     0
            sdd4                    ONLINE       0     0     0
            sdd5                    ONLINE       0     0     0
            sdd6                    ONLINE       0     0     0
            sdd7                    ONLINE       0     0     0
            sdd8                    ONLINE       0     0     0
            sdd9                    ONLINE       0     0     0
            sde0                    ONLINE       0     0     0
            sde1                    ONLINE       0     0     0
            sde2                    ONLINE       0     0     0
            sde3                    ONLINE       0     0     0
            sde4                    ONLINE       0     0     0
            sde5                    ONLINE       0     0     0
            sde6                    ONLINE       0     0     0
            sde7                    ONLINE       0     0     0
            sde8                    ONLINE       0     0     0
            sde9                    ONLINE       0     0     0
            sdf0                    ONLINE       0     0     0
            sdf1                    ONLINE       0     0     0
            sdf2                    ONLINE       0     0     0
            sdf3                    ONLINE       0     0     0
            sdf4                    ONLINE       0     0     0
            sdf5                    ONLINE       0     0     0
            sdf6                    ONLINE       0     0     0
            sdf7                    ONLINE       0     0     0
            sdf8                    ONLINE       0     0     0
            sdf9                    ONLINE       0     0     0
            sdg0                    ONLINE       0     0     0
            sdg1                     ONLINE       0     0     0
            replacing-1              ONLINE       0     0     0
              sdxX                   ONLINE       0     0     0
              sdnew                  ONLINE       0     0     0

errors: No known data errors

Meaning: “scanned” vs “issued” tells you how much metadata traversal is happening versus actual data reconstruction. If scanned is high but issued is low, you’re doing a lot of bookkeeping and seeking.

Decision: If “issued” is far below what the disks can do, go hunt for IOPS/latency limiters (fragmentation, small blocks, RAIDZ). If errors appear, switch from performance mode to data-integrity mode immediately.

Task 2: Watch per-vdev I/O during resilver

cr0x@server:~$ zpool iostat -v tank 1
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        51.2T  13.8T  8.12K  1.90K   812M   162M
  raidz2-0                  51.2T  13.8T  8.12K  1.90K   812M   162M
    sda                         -      -    310     75  31.2M  6.1M
    sdb                         -      -    298     72  30.8M  6.0M
    sdc                         -      -    315     76  31.4M  6.2M
    ...
    sdnew                       -      -     12    980   1.1M  98.4M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Healthy disks are doing lots of reads; the new disk is mostly writes. If read ops are high but read MB/s per disk is low, you’re IOPS-bound and seeking.

Decision: If per-disk read MB/s is anemic but ops are high, stop expecting line-rate. Consider throttling resilver to protect latency, or scheduling off-hours; tuning for “speed” won’t beat physics.

Task 3: Check disk-level latency and saturation

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  await  aqu-sz  %util
sda             310.0    70.0  32000    6200   28.4    8.2   99.0
sdb             295.0    68.0  30800    6000   30.1    8.5   98.7
sdnew            10.0   980.0   1100  102000   12.7    2.1   72.4

Meaning: The old disks are pegged at ~99% util with ~30ms await. That’s classic seek/IOPS limitation. New disk isn’t saturated.

Decision: Your bottleneck is surviving-disk read behavior. If production latency matters, cap resilver aggressiveness rather than chasing new-disk throughput.

Task 4: Inspect pool health and error counters

cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
scan: resilver in progress since Tue Dec 24 11:02:12 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     2
            sdnew   ONLINE       0     0     0

errors: No known data errors

Meaning: CKSUM errors on a surviving disk during a resilver are a red flag. Even if ZFS can correct them, they cost time and risk.

Decision: Pull SMART, cabling, and HBA logs. If errors climb, stop “optimizing speed” and start preparing for another failure and possible data loss scenario.

Task 5: Check SMART for the replacement and the stressed survivors

cr0x@server:~$ sudo smartctl -a /dev/sdnew | sed -n '1,80p'
=== START OF INFORMATION SECTION ===
Device Model:     ST18000NM000J
Firmware Version: SN05
User Capacity:    18,000,207,937,536 bytes
Sector Sizes:     512 bytes logical, 4096 bytes physical
SMART overall-health self-assessment test result: PASSED
...
ID# ATTRIBUTE_NAME          VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    200   200   000    Old_age   Always       -       12

Meaning: CRC errors often implicate cabling/backplane/HBA path issues. During resilver, link flakiness can look like “random slowness.”

Decision: If CRC counts increase during load, reseat/replace cables or move bays. Don’t keep resilvering over a shaky link.

Task 6: Identify ashift and top-level vdev layout

cr0x@server:~$ zdb -C tank | egrep -A3 'ashift|vdev_tree|type'
        vdev_tree:
            type: 'root'
            id: 0
            ashift: 12

Meaning: ashift=12 implies 4K sectors. If this is wrong (e.g., 9 on 4K drives), resilver can suffer from alignment penalties and extra I/O.

Decision: If ashift is wrong, the “fix” is generally a rebuild/migration, not a tuning knob. Plan it; don’t argue with geometry.

Task 7: Check pool fragmentation and fullness

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,health tank
NAME  SIZE  ALLOC  FREE  FRAG  HEALTH
tank  65.0T 51.2T  13.8T   42%  DEGRADED

Meaning: High frag increases seeky reads. High allocation percentage reduces available contiguous space and can worsen allocation behavior during writes.

Decision: If frag is high and alloc is high, expect slow resilver tails. Consider adding capacity or rebalancing long before the next failure.

Task 8: Inspect dataset properties that change block shape

cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,checksum tank/vmstore
NAME          PROPERTY     VALUE
tank/vmstore  recordsize   128K
tank/vmstore  compression  lz4
tank/vmstore  checksum     on

Meaning: recordsize affects how many blocks exist and how large they are. Small blocks increase metadata and IOPS needs; large blocks can improve sequentiality for streaming workloads.

Decision: If this dataset hosts small random I/O workloads (VMs, DB), recordsize may not match; but changing it mid-life won’t rewrite existing blocks. Plan a migration or rewrite window if you need a different block profile.

Task 9: Check special vdev and its health (metadata acceleration)

cr0x@server:~$ zpool status tank | sed -n '1,80p'
  pool: tank
 state: DEGRADED
config:

        NAME            STATE     READ WRITE CKSUM
        tank            DEGRADED     0     0     0
          special       ONLINE       0     0     0
            nvme0n1p1    ONLINE       0     0     0
            nvme1n1p1    ONLINE       0     0     0
          raidz2-0      DEGRADED     0     0     0
            ...

Meaning: A special vdev can dramatically affect resilver performance by serving metadata quickly. If it’s missing or unhealthy, metadata reads fall back to HDDs, and the tail gets brutal.

Decision: If your workload is metadata-heavy and you don’t have a special vdev, consider designing one for the next hardware refresh. If you have one, protect it like it’s production secrets.

Task 10: Confirm ARC pressure and memory headroom

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
11:10:01  6120  1480     24   820  55    60   4   600  41   112G   128G
11:10:02  5988  1622     27   910  56    72   4   640  39   112G   128G
11:10:03  6210  1588     25   880  55    55   3   653  41   112G   128G

Meaning: High miss rates during resilver can force extra disk reads. That slows resilver and increases latency for everything else.

Decision: If ARC is too small for the working set, you can’t “tune” your way out permanently. Add memory, reduce workload churn, or move metadata to faster devices.

Task 11: Watch CPU saturation and kernel time

cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server)  12/24/2025  _x86_64_  (32 CPU)

Average:     CPU   %usr  %sys  %iowait  %idle
Average:     all   18.2  42.7     6.1    33.0
Average:      0    21.0  55.0     3.0    21.0
Average:      1    16.0  51.0     4.0    29.0

Meaning: High %sys during RAIDZ resilver can indicate parity/checksum overhead, plus I/O stack overhead. If disks are not saturated but %sys is high, CPU is a suspect.

Decision: Consider reducing resilver concurrency to relieve CPU if production is suffering, or schedule resilver-heavy periods when CPU headroom is available.

Task 12: Check zed events and kernel logs for link resets

cr0x@server:~$ sudo dmesg -T | tail -n 20
[Tue Dec 24 11:22:14 2025] ata12: hard resetting link
[Tue Dec 24 11:22:15 2025] ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Tue Dec 24 11:22:15 2025] sd 11:0:0:0: [sdb] tag#18 timing out command, waited 180s
[Tue Dec 24 11:22:15 2025] blk_update_request: I/O error, dev sdb, sector 1987654321 op 0x0:(READ)

Meaning: Link resets and command timeouts are not “performance.” They are reliability events that manifest as performance collapse.

Decision: Replace cables/backplane/HBA port, possibly the disk. Don’t continue a resilver with a flapping path unless you enjoy living dangerously.

Task 13: Confirm scrub/resilver tunables (Linux/OpenZFS)

cr0x@server:~$ sudo sysctl -a | egrep 'zfs_scan|resilver|scrub' | head
kstat.zfs.misc.scan.state = 2
kstat.zfs.misc.scan.pass = 1
kstat.zfs.misc.scan.start_time = 1735038132

Meaning: Not all systems expose the same tunables via sysctl, but you can still verify scan state and correlate timing. If your platform uses module parameters, you’ll check those next.

Decision: Don’t change tunables because a blog told you to. Only change them when you know whether you’re latency-budgeting or speed-maximizing, and you can roll back.

Task 14: Identify SMR drives (a silent rebuild killer)

cr0x@server:~$ lsblk -d -o NAME,MODEL,ROTA
NAME  MODEL              ROTA
sda   ST18000NM000J         1
sdb   ST18000NM000J         1
sdnew ST18000NM000J         1

Meaning: This doesn’t directly tell you SMR vs CMR, but it gives you the model for cross-checking internally. If you unknowingly mixed in SMR, sustained writes during resilver can crater.

Decision: Maintain an approved-drive list for pools. If you suspect SMR, stop and verify model behavior; “it works most of the time” is not a storage strategy.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

The company ran a large analytics cluster on ZFS over RAIDZ2. When a disk failed, the on-call engineer swapped it and told everyone the pool would be healthy “by lunch” because each drive could do 250 MB/s sequential writes. That number came from a vendor datasheet and a quick benchmark on an empty disk.

The resilver started at a few hundred MB/s, which felt validating. Then it slowed to a jittery 30–60 MB/s “issued,” while scanned stayed high. Production queries began timing out, not because the network was slow, but because the pool’s read latency spiked. Users filed tickets. Management filed calendar invites.

The wrong assumption wasn’t “ZFS is slow.” The wrong assumption was that the replacement disk’s sequential write speed determines rebuild time. The pool was 80% full with years of churn. Reconstructing the missing column forced scattered reads from all surviving disks. Those disks were now doing high-IOPS random reads while also serving the live workload.

The fix was boring: throttle the resilver during business hours, resume aggressively overnight, and add a policy that pools should not live above a defined fullness threshold. After that, the lunch promises stopped. The pools did not become magically faster, but the business stopped being surprised.

Mini-story 2: The optimization that backfired

A different team wanted faster resilvers, so they went hunting for tunables. They increased scan concurrency and made resilver “more parallel,” expecting the pool to chew through work faster. On paper, it did: more I/O operations per second.

In reality, application latency doubled. The database wasn’t bandwidth-bound; it was tail-latency-bound. The increased resilver concurrency caused deeper queues on the HDDs. “Average throughput” looked better in a dashboard while real user queries got slower. The team then reacted by adding more application retries, which increased read load, which made queues deeper. Classic.

They rolled back the tuning and took a step that sounds almost insulting: they let resilver run slower. But they pinned a maximum acceptable latency for the database and tuned resilver aggressiveness to stay under it. The rebuild took longer, but the business stopped noticing.

The backfire lesson: resilver tuning is not a free lunch. If your pool is seek-limited, more concurrency can just mean more outstanding seeks. You don’t get “faster.” You get “louder.”

Mini-story 3: The boring but correct practice that saved the day

A financial-services platform ran ZFS mirrors for transactional data and RAIDZ for cold storage. Nothing exotic. Their secret weapon was a dull set of practices: scrubs on schedule, SMART trending, and a policy that any disk with growing UDMA_CRC errors gets moved to a different bay before it’s allowed back into a pool.

One week, a disk failed in a mirror vdev during peak processing. Resilver started and looked normal. An hour in, another disk in the same chassis started throwing CRC errors, and the kernel logged link resets. This is where a lot of teams “wait and see” and pray the resilver finishes first.

They didn’t wait. They paused the chaos, moved the suspect disk to a known-good slot, replaced the cable, and resumed. Resilver finished without a second failure. The transaction system stayed within latency SLOs because the team also had a standard procedure: during peak, limit resilver impact; at night, let it run.

The practice that saved them wasn’t a miracle tunable. It was disciplined hygiene and a willingness to treat flaky links as production incidents, not “weird performance.”

Common mistakes: symptoms → root cause → fix

Mistake 1: “Resilver starts fast then slows to a crawl”

Symptom: High initial MB/s, then the tail takes forever; disks show high util but low throughput.

Root cause: Fragmentation + metadata-heavy tail; remaining work is many small scattered blocks, which is IOPS-bound.

Fix: Accept the physics; don’t panic-tune. Keep pools below high-fullness thresholds, prefer mirrors for churny workloads, and consider special vdev for metadata-heavy cases.

Mistake 2: “New disk is fast, but resilver is slow”

Symptom: Replacement disk write is moderate; surviving disks are saturated on reads.

Root cause: Read side is the limiter (RAIDZ read amplification, seeky reads, workload contention).

Fix: Diagnose per-disk latency; throttle resilver to protect production; design future vdevs with rebuild behavior in mind (mirrors vs RAIDZ width).

Mistake 3: “Resilver speed collapses unpredictably”

Symptom: Periodic stalls; huge latency spikes; kernel logs show resets/timeouts.

Root cause: Link instability (cables, backplane, expander, HBA), or a marginal disk under sustained load.

Fix: Treat as hardware incident. Check dmesg, SMART CRC counts, HBA logs; move the drive bay/port; replace suspect components.

Mistake 4: “Tuned resilver for speed; users complain”

Symptom: Resilver finishes sooner but app latency and timeouts increase.

Root cause: Queue depth and fairness: more resilver concurrency increases tail latency and steals IOPS from foreground.

Fix: Tune to an SLO, not a stopwatch. Cap resilver aggressiveness during peak, schedule aggressive windows off-hours.

Mistake 5: “Checksum errors appear during resilver; performance drops”

Symptom: CKSUM counts increment; resilver slows; possible read retries.

Root cause: Media issues, bad cabling, or a second failing disk; ZFS is correcting (if it can), which costs time.

Fix: Stop focusing on speed. Stabilize hardware, identify the fault domain, and consider preemptive replacement of the suspect member.

Mistake 6: “Pool is 90% full and resilver is terrible”

Symptom: Resilver takes dramatically longer than past events; allocation looks chaotic.

Root cause: Free space is fragmented; metaslabs have fewer contiguous extents; both reads and writes become less efficient.

Fix: Add capacity earlier. If you can’t, migrate datasets off, reduce churn, and stop treating 90% full as normal operations.

Checklists / step-by-step plan

Step-by-step: handling a disk failure with minimal drama

  1. Freeze the story: capture zpool status -v, zpool iostat -v 1 (30 seconds), iostat -x 1 (30 seconds). This is your before/after truth.
  2. Confirm the fault domain: is it a disk, a cable, a bay, an expander lane? Check SMART CRC and kernel logs.
  3. Replace the device correctly: use zpool replace and confirm the correct GUID/device mapping. Human errors love drive bays.
  4. Decide the operational mode: “finish fast” vs “protect latency.” Write it down in the ticket. Be accountable.
  5. Monitor health first: any READ/WRITE/CKSUM increases? If yes, pause performance tuning and stabilize hardware.
  6. Monitor bottlenecks: per-disk util and await; if old disks are pegged, stop pretending the new disk matters.
  7. Protect users: if latency-sensitive, reduce resilver aggressiveness during peak windows rather than forcing throughput.
  8. After resilver: run a scrub (or ensure next scheduled scrub happens soon) to confirm no silent issues remain.
  9. Postmortem the root cause: not “a disk died,” but “why did it die and what else shares that failure domain?”
  10. Capacity/fragmentation follow-up: if you were >80% full, treat that as an incident contributor and plan remediation.

Design checklist: build pools that don’t make resilver a career

  • Choose vdev types based on churn: mirrors for high-churn random I/O; RAIDZ for colder, more sequential, less rewrite-heavy data.
  • Keep pools with headroom: don’t live in the 85–95% zone and act surprised when rebuilds are awful.
  • Standardize drive models: avoid mixing unknown SMR behavior into write-heavy pools.
  • Plan for metadata: special vdev can help, but only if mirrored and properly sized and monitored.
  • Practice failures: rehearse replace/resilver procedures; document device naming and bay mapping; confirm alerting works.

FAQ

1) Is a resilver always faster than a traditional RAID rebuild?

Often, because ZFS rebuilds allocated blocks rather than the whole device. But if your pool is very full and fragmented, the advantage shrinks and the tail can still be brutal.

2) Why does “scanned” differ from “issued” in zpool status?

“Scanned” is how much address space/metadata traversal has been walked; “issued” is actual I/O issued for reconstruction. Big gaps usually mean heavy metadata work or inefficient access patterns.

3) Mirrors or RAIDZ for faster resilvers?

Mirrors generally resilver faster and with less read amplification. RAIDZ resilvers can be significantly slower because reconstruction often requires reading multiple columns for each block.

4) Can I just throttle the resilver to keep production stable?

Yes, and you often should. The risk tradeoff is longer time in a degraded state. For latency-sensitive services, a controlled slow resilver is usually safer than a fast one that causes timeouts and cascading failures.

5) Does adding a faster replacement disk speed up the resilver?

Sometimes, but commonly no. The limiting factor is often the surviving disks’ read IOPS and latency, not the target’s write throughput.

6) Why does the resilver slow down near the end?

The tail contains more small blocks and metadata-heavy work. Small random reads dominate, which collapses throughput even though the system is “busy.”

7) Is fragmentation a dataset problem or a pool problem?

Both. Fragmentation is driven by allocation patterns and free space distribution at the pool level, but churny datasets amplify it. The pool pays the price during resilver.

8) Are checksum errors during resilver normal?

No. ZFS can correct them if redundancy allows, but they indicate a real fault: disk media, cable, HBA path, or another device in trouble. Treat it as a reliability incident.

9) Does compression help or hurt resilver speed?

It can help if it reduces physical bytes read/written, but it can hurt if CPU becomes the bottleneck or if the workload is metadata/IOPS-bound. Measure CPU and disk utilization before blaming compression.

10) Is “sequential resilver” something I can force?

Not in the sense people mean. ZFS uses strategies to reduce random I/O, but the on-disk allocation reality and vdev layout dictate how sequential it can be.

Practical next steps

If you’re dealing with a slow resilver today:

  1. Run zpool status -v, zpool iostat -v 1, and iostat -x 1. Decide whether you’re IOPS-limited, CPU-limited, or dealing with hardware pathology.
  2. Check dmesg and SMART CRC counts. If links are flapping, fix hardware before you touch tunables.
  3. Pick a goal: fastest finish or stable production. Throttle/accelerate accordingly and document the decision.

If you’re designing for the next failure (the correct time to care):

  1. Stop overfilling pools. Headroom is performance and reliability insurance.
  2. Choose vdevs based on churn. Mirrors cost capacity; they buy you rebuild sanity.
  3. Invest in observability: per-disk latency, ZFS error counters, and a clear mapping of bays to device IDs.

The best resilver is the one you finish before anyone notices. That’s not magic. It’s design, measurement, and refusing to treat “disk MB/s” as a plan.

← Previous
Proxmox VM won’t start after changing CPU type: recovery steps that work
Next →
Debian 13 Kdump crash capture: set it up and prove it works (case #17)

Leave a comment