You replace a failed disk, kick off a resilver, and expect to watch a nice steady stream of hundreds of MB/s. Instead you get a rebuild that starts fast, then crawls, then oscillates like it’s being throttled by an unseen committee. Meanwhile your application latency goes sideways and management wants to know why “the new drive is slow.”
Here’s the uncomfortable truth: resilver speed is a systems problem. ZFS resilvers don’t happen in a vacuum. They compete with real workloads, traverse real metadata, and are constrained by vdev geometry, allocation history, and the physics of small random reads—even when the write side looks “sequential.”
What “sequential resilver” really means (and what it doesn’t)
When people say “sequential resilver,” they usually mean: “ZFS will rebuild by reading big contiguous chunks from healthy disks and writing big contiguous chunks to the replacement disk.” That’s the dream. And sometimes it happens.
But sequential resilvering is not a promise that the underlying I/O pattern is purely sequential end-to-end. It’s a strategy: preferentially scanning space maps and allocation metadata to rebuild only what’s allocated, and doing it in a way that tends to group work to reduce seeks.
What it is:
- A method to rebuild allocated data, not the entire address space of the device (like old-school RAID rebuilds).
- A best-effort attempt to minimize random I/O by walking metaslabs and allocation trees in a locality-friendly order.
- A workload that is often read-limited by the surviving devices and metadata traversal, even if the writes to the new disk look sequential.
What it is not:
- Not “copy the whole disk from LBA 0 upward.” ZFS doesn’t rebuild free space, and it doesn’t have to.
- Not “one thread, one stream.” Resilver is pipelined and concurrent: reads, checksum verification, decompression, parity reconstruction (RAIDZ), and writes.
- Not “immune to fragmentation.” ZFS can be sequential on paper and still be forced into scattered reads because the data is scattered.
Resilver time is the time it takes to find the blocks that matter, read them reliably, reconstruct them correctly, and write them in a way that keeps the pool consistent. “Disk MB/s” is just one variable, and it’s rarely the limiting one when things get ugly.
Why “disk MB/s” is the wrong unit for resilver planning
Sequential throughput benchmarks assume you’re doing something like dd across a clean device. A resilver is doing something closer to:
- metadata-driven discovery (space maps, block pointers, indirect blocks)
- mixed-size reads (records, partial records, small metadata blocks)
- checksum verification for every block read
- optional decompression (read path) and compression (write path)
- RAIDZ parity math (CPU + memory bandwidth)
- queueing and scheduling across multiple vdev members
- write allocation and metaslab updates on the target
This is why two pools built from identical disks can resilver at radically different speeds:
- Pool A: 40% full, low fragmentation, recordsize matched to workload, wide mirrors, enough IOPS headroom. Resilver looks “sequential.”
- Pool B: 85% full, years of churn, lots of small blocks, RAIDZ with pathological read amplification, and a heavy foreground workload. Resilver looks like a crime scene.
Even on the same pool, resilver speed can change hour-to-hour because the bottleneck moves: first you’re reading big data extents quickly, then you hit metadata-heavy regions and suddenly you’re IOPS-limited. The progress bar keeps moving, but your expectations shouldn’t.
Joke #1: If you think resilver speed is “just disk MB/s,” I have a storage array that benchmarks at 10 GB/s—right up until you use it.
So what should you plan for instead of MB/s? Think in constraints:
- Read IOPS ceiling of the surviving vdev members (often the limiter).
- Small-block behavior (metadata, dedup tables, special vdev hits/misses).
- Allocation history (fragmentation, free space distribution, metaslab sizes).
- Concurrency (resilver threads, ZIO pipeline, ARC behavior).
- Operational budget: how much latency you can tolerate for production while it runs.
Interesting facts and small historical context
- ZFS popularized “rebuild allocated data only” as a first-class design goal, avoiding traditional “rebuild the whole disk” behavior common in classic RAID controllers.
- Scrubs and resilvers share machinery, but resilver has to write and update state, which changes contention patterns in the I/O pipeline.
- RAIDZ resilvering is inherently read-amplified because reconstructing a block often requires reading multiple columns; mirrors can frequently read from a single side.
- Block pointer checksums are central: ZFS reads are verified end-to-end, so “fast but silent corruption” is not on the menu.
- Ashift choices became more painful over time as disks moved to 4K physical sectors; misalignment can turn “sequential” work into extra read-modify-write cycles.
- Space maps evolved to better track free/allocated ranges, and that bookkeeping directly affects how efficiently ZFS can walk allocated space during rebuild.
- Device removal and sequential resilver improvements in OpenZFS ecosystems made rebuilds less punishing than early-era “scan everything” approaches—when the pool layout cooperates.
- Special vdevs changed the game for metadata-heavy pools: resilver speed can improve dramatically if metadata reads are served quickly, but only if the special vdev is healthy and sized correctly.
The real bottlenecks: where resilvers actually spend time
1) The surviving disks are doing the hard part
The replacement disk is usually not the limiting factor. During resilver, the pool has to read all the blocks that belonged to the failed device from the surviving members. Those reads are frequently:
- spread across the vdev’s address space (fragmentation)
- mixed-size, including lots of tiny metadata reads
- competing with your production workload
So you can put a brand new, fast drive in, and it won’t matter because the old drives are busy seeking all over their platters like it’s 2007.
2) RAIDZ parity reconstruction amplifies reads
With mirrors, ZFS often needs one good copy to rebuild a block. With RAIDZ, reconstructing a block generally requires reading the other columns in the stripe. This is why “same raw TB, same disks” does not mean “same resilver time.” RAIDZ trades capacity efficiency for more complex rebuild behavior, and resilver is where you pay that bill.
3) Fragmentation turns sequential intent into random reality
Sequential resilver tries to walk allocations in order, but if your data was written in a churn-heavy pattern (VM images, databases with frequent rewrites, object stores with deletes), the “next block” might be nowhere near the previous one on disk.
Two practical consequences:
- Your reads become IOPS-bound, not throughput-bound.
- Your pool can show “low MB/s” while still being saturated (high utilization, high latency).
4) Metadata can dominate the tail
The first half of a resilver often looks faster because you’re copying larger, more contiguous extents. The end gets weird: tiny blocks, indirect blocks, dnodes, spill blocks, spacemap updates. That’s when the progress percentage moves slowly and everyone starts accusing the disks.
5) Compression, checksums, and CPU aren’t free
On modern CPUs, checksum and compression overhead is usually manageable, until it isn’t. If you’re doing RAIDZ parity math plus high compression ratios plus a busy box, CPU can become the bottleneck. This shows up as:
- disks not fully utilized (low %util) while resilver is slow
- high system CPU, kernel time, or softirq time depending on platform
- ARC pressure causing extra disk reads
6) The pool is a shared resource; resilver competes
If you allow a resilver to run “full send” during peak hours, the pool can become an I/O hostage situation. ZFS tries to be fair, but fairness is not the same as “your database stays happy.” You need to decide what you’re optimizing for: fastest rebuild time, or acceptable production latency. Trying to get both without measurement is how you get neither.
Joke #2: A resilver is like roadwork: it’s always scheduled for when you’re already late.
One reliability idea worth keeping on your wall
Hope is not a strategy.
— James Cameron
It’s not ZFS-specific, but it’s painfully relevant: don’t hope your resilver will be fast. Measure, design, and practice.
Fast diagnosis playbook
This is the triage path when someone pings you: “resilver is slow.” Don’t start tuning knobs blindly. Find the limiter.
First: confirm what kind of slowness you have
- Low MB/s but high disk utilization? You’re likely IOPS/seek limited (fragmentation, small blocks, RAIDZ reads).
- Low MB/s and low disk utilization? You’re likely CPU-limited, throttled, or blocked on something else (ARC misses, special vdev, scheduler, tunables).
- High latency spikes on clients? Resilver is competing with production workload; you need to budget I/O.
Second: isolate whether reads or writes are limiting
- If the healthy disks show high read ops and high latency: read side is limiting.
- If the new disk shows high write latency or errors: target write path is limiting (bad disk, cabling, HBA, SMR behavior).
Third: check for pathology
- Any checksum errors during resilver? Stop assuming “performance problem.” You may have a reliability problem.
- Any SMR drives, odd firmware, or power management quirks? Rebuild behavior can collapse under sustained writes.
- Pool nearly full or heavily fragmented? Expect tail-latency and slow finish.
Fourth: decide on the operational goal
- Goal A: finish resilver ASAP (accept user impact).
- Goal B: keep production stable (accept longer resilver).
Make that decision explicitly, then tune. Otherwise you’ll oscillate between “too slow” and “users are screaming” and pretend it’s a mystery.
Hands-on tasks: commands, outputs, and decisions (12+)
Below are practical tasks I use in production. Each includes (1) command, (2) what the output means, (3) the decision you make.
Task 1: Confirm resilver state and scan rate
cr0x@server:~$ zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
scan: resilver in progress since Tue Dec 24 11:02:12 2025
1.23T scanned at 410M/s, 612G issued at 204M/s, 3.10T total
148G resilvered, 19.11% done, 0 days 03:41:22 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdv ONLINE 0 0 0
sdw ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdz ONLINE 0 0 0
sdaa ONLINE 0 0 0
sdab ONLINE 0 0 0
sdac ONLINE 0 0 0
sdad ONLINE 0 0 0
sdae ONLINE 0 0 0
sdaf ONLINE 0 0 0
sdag ONLINE 0 0 0
sdah ONLINE 0 0 0
sdai ONLINE 0 0 0
sdaj ONLINE 0 0 0
sdak ONLINE 0 0 0
sdal ONLINE 0 0 0
sdam ONLINE 0 0 0
sdan ONLINE 0 0 0
sdao ONLINE 0 0 0
sdap ONLINE 0 0 0
sdaq ONLINE 0 0 0
sdar ONLINE 0 0 0
sdas ONLINE 0 0 0
sdat ONLINE 0 0 0
sdau ONLINE 0 0 0
sdav ONLINE 0 0 0
sdaw ONLINE 0 0 0
sdax ONLINE 0 0 0
sday ONLINE 0 0 0
sdaz ONLINE 0 0 0
sdbb ONLINE 0 0 0
sdbc ONLINE 0 0 0
sdbd ONLINE 0 0 0
sdbe ONLINE 0 0 0
sdbf ONLINE 0 0 0
sdbg ONLINE 0 0 0
sdbh ONLINE 0 0 0
sdbi ONLINE 0 0 0
sdbj ONLINE 0 0 0
sdbk ONLINE 0 0 0
sdbl ONLINE 0 0 0
sdbm ONLINE 0 0 0
sdbn ONLINE 0 0 0
sdbo ONLINE 0 0 0
sdbp ONLINE 0 0 0
sdbq ONLINE 0 0 0
sdbr ONLINE 0 0 0
sdbs ONLINE 0 0 0
sdbt ONLINE 0 0 0
sdbu ONLINE 0 0 0
sdbv ONLINE 0 0 0
sdbw ONLINE 0 0 0
sdbx ONLINE 0 0 0
sdby ONLINE 0 0 0
sdbz ONLINE 0 0 0
sdc0 ONLINE 0 0 0
sdc1 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
sdc3 ONLINE 0 0 0
sdc4 ONLINE 0 0 0
sdc5 ONLINE 0 0 0
sdc6 ONLINE 0 0 0
sdc7 ONLINE 0 0 0
sdc8 ONLINE 0 0 0
sdc9 ONLINE 0 0 0
sdd0 ONLINE 0 0 0
sdd1 ONLINE 0 0 0
sdd2 ONLINE 0 0 0
sdd3 ONLINE 0 0 0
sdd4 ONLINE 0 0 0
sdd5 ONLINE 0 0 0
sdd6 ONLINE 0 0 0
sdd7 ONLINE 0 0 0
sdd8 ONLINE 0 0 0
sdd9 ONLINE 0 0 0
sde0 ONLINE 0 0 0
sde1 ONLINE 0 0 0
sde2 ONLINE 0 0 0
sde3 ONLINE 0 0 0
sde4 ONLINE 0 0 0
sde5 ONLINE 0 0 0
sde6 ONLINE 0 0 0
sde7 ONLINE 0 0 0
sde8 ONLINE 0 0 0
sde9 ONLINE 0 0 0
sdf0 ONLINE 0 0 0
sdf1 ONLINE 0 0 0
sdf2 ONLINE 0 0 0
sdf3 ONLINE 0 0 0
sdf4 ONLINE 0 0 0
sdf5 ONLINE 0 0 0
sdf6 ONLINE 0 0 0
sdf7 ONLINE 0 0 0
sdf8 ONLINE 0 0 0
sdf9 ONLINE 0 0 0
sdg0 ONLINE 0 0 0
sdg1 ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
sdxX ONLINE 0 0 0
sdnew ONLINE 0 0 0
errors: No known data errors
Meaning: “scanned” vs “issued” tells you how much metadata traversal is happening versus actual data reconstruction. If scanned is high but issued is low, you’re doing a lot of bookkeeping and seeking.
Decision: If “issued” is far below what the disks can do, go hunt for IOPS/latency limiters (fragmentation, small blocks, RAIDZ). If errors appear, switch from performance mode to data-integrity mode immediately.
Task 2: Watch per-vdev I/O during resilver
cr0x@server:~$ zpool iostat -v tank 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 51.2T 13.8T 8.12K 1.90K 812M 162M
raidz2-0 51.2T 13.8T 8.12K 1.90K 812M 162M
sda - - 310 75 31.2M 6.1M
sdb - - 298 72 30.8M 6.0M
sdc - - 315 76 31.4M 6.2M
...
sdnew - - 12 980 1.1M 98.4M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: Healthy disks are doing lots of reads; the new disk is mostly writes. If read ops are high but read MB/s per disk is low, you’re IOPS-bound and seeking.
Decision: If per-disk read MB/s is anemic but ops are high, stop expecting line-rate. Consider throttling resilver to protect latency, or scheduling off-hours; tuning for “speed” won’t beat physics.
Task 3: Check disk-level latency and saturation
cr0x@server:~$ iostat -x 1 3
Device r/s w/s rkB/s wkB/s await aqu-sz %util
sda 310.0 70.0 32000 6200 28.4 8.2 99.0
sdb 295.0 68.0 30800 6000 30.1 8.5 98.7
sdnew 10.0 980.0 1100 102000 12.7 2.1 72.4
Meaning: The old disks are pegged at ~99% util with ~30ms await. That’s classic seek/IOPS limitation. New disk isn’t saturated.
Decision: Your bottleneck is surviving-disk read behavior. If production latency matters, cap resilver aggressiveness rather than chasing new-disk throughput.
Task 4: Inspect pool health and error counters
cr0x@server:~$ zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Tue Dec 24 11:02:12 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 2
sdnew ONLINE 0 0 0
errors: No known data errors
Meaning: CKSUM errors on a surviving disk during a resilver are a red flag. Even if ZFS can correct them, they cost time and risk.
Decision: Pull SMART, cabling, and HBA logs. If errors climb, stop “optimizing speed” and start preparing for another failure and possible data loss scenario.
Task 5: Check SMART for the replacement and the stressed survivors
cr0x@server:~$ sudo smartctl -a /dev/sdnew | sed -n '1,80p'
=== START OF INFORMATION SECTION ===
Device Model: ST18000NM000J
Firmware Version: SN05
User Capacity: 18,000,207,937,536 bytes
Sector Sizes: 512 bytes logical, 4096 bytes physical
SMART overall-health self-assessment test result: PASSED
...
ID# ATTRIBUTE_NAME VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 200 200 000 Old_age Always - 12
Meaning: CRC errors often implicate cabling/backplane/HBA path issues. During resilver, link flakiness can look like “random slowness.”
Decision: If CRC counts increase during load, reseat/replace cables or move bays. Don’t keep resilvering over a shaky link.
Task 6: Identify ashift and top-level vdev layout
cr0x@server:~$ zdb -C tank | egrep -A3 'ashift|vdev_tree|type'
vdev_tree:
type: 'root'
id: 0
ashift: 12
Meaning: ashift=12 implies 4K sectors. If this is wrong (e.g., 9 on 4K drives), resilver can suffer from alignment penalties and extra I/O.
Decision: If ashift is wrong, the “fix” is generally a rebuild/migration, not a tuning knob. Plan it; don’t argue with geometry.
Task 7: Check pool fragmentation and fullness
cr0x@server:~$ zpool list -o name,size,alloc,free,frag,health tank
NAME SIZE ALLOC FREE FRAG HEALTH
tank 65.0T 51.2T 13.8T 42% DEGRADED
Meaning: High frag increases seeky reads. High allocation percentage reduces available contiguous space and can worsen allocation behavior during writes.
Decision: If frag is high and alloc is high, expect slow resilver tails. Consider adding capacity or rebalancing long before the next failure.
Task 8: Inspect dataset properties that change block shape
cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,checksum tank/vmstore
NAME PROPERTY VALUE
tank/vmstore recordsize 128K
tank/vmstore compression lz4
tank/vmstore checksum on
Meaning: recordsize affects how many blocks exist and how large they are. Small blocks increase metadata and IOPS needs; large blocks can improve sequentiality for streaming workloads.
Decision: If this dataset hosts small random I/O workloads (VMs, DB), recordsize may not match; but changing it mid-life won’t rewrite existing blocks. Plan a migration or rewrite window if you need a different block profile.
Task 9: Check special vdev and its health (metadata acceleration)
cr0x@server:~$ zpool status tank | sed -n '1,80p'
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
special ONLINE 0 0 0
nvme0n1p1 ONLINE 0 0 0
nvme1n1p1 ONLINE 0 0 0
raidz2-0 DEGRADED 0 0 0
...
Meaning: A special vdev can dramatically affect resilver performance by serving metadata quickly. If it’s missing or unhealthy, metadata reads fall back to HDDs, and the tail gets brutal.
Decision: If your workload is metadata-heavy and you don’t have a special vdev, consider designing one for the next hardware refresh. If you have one, protect it like it’s production secrets.
Task 10: Confirm ARC pressure and memory headroom
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
11:10:01 6120 1480 24 820 55 60 4 600 41 112G 128G
11:10:02 5988 1622 27 910 56 72 4 640 39 112G 128G
11:10:03 6210 1588 25 880 55 55 3 653 41 112G 128G
Meaning: High miss rates during resilver can force extra disk reads. That slows resilver and increases latency for everything else.
Decision: If ARC is too small for the working set, you can’t “tune” your way out permanently. Add memory, reduce workload churn, or move metadata to faster devices.
Task 11: Watch CPU saturation and kernel time
cr0x@server:~$ mpstat -P ALL 1 2
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
Average: CPU %usr %sys %iowait %idle
Average: all 18.2 42.7 6.1 33.0
Average: 0 21.0 55.0 3.0 21.0
Average: 1 16.0 51.0 4.0 29.0
Meaning: High %sys during RAIDZ resilver can indicate parity/checksum overhead, plus I/O stack overhead. If disks are not saturated but %sys is high, CPU is a suspect.
Decision: Consider reducing resilver concurrency to relieve CPU if production is suffering, or schedule resilver-heavy periods when CPU headroom is available.
Task 12: Check zed events and kernel logs for link resets
cr0x@server:~$ sudo dmesg -T | tail -n 20
[Tue Dec 24 11:22:14 2025] ata12: hard resetting link
[Tue Dec 24 11:22:15 2025] ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Tue Dec 24 11:22:15 2025] sd 11:0:0:0: [sdb] tag#18 timing out command, waited 180s
[Tue Dec 24 11:22:15 2025] blk_update_request: I/O error, dev sdb, sector 1987654321 op 0x0:(READ)
Meaning: Link resets and command timeouts are not “performance.” They are reliability events that manifest as performance collapse.
Decision: Replace cables/backplane/HBA port, possibly the disk. Don’t continue a resilver with a flapping path unless you enjoy living dangerously.
Task 13: Confirm scrub/resilver tunables (Linux/OpenZFS)
cr0x@server:~$ sudo sysctl -a | egrep 'zfs_scan|resilver|scrub' | head
kstat.zfs.misc.scan.state = 2
kstat.zfs.misc.scan.pass = 1
kstat.zfs.misc.scan.start_time = 1735038132
Meaning: Not all systems expose the same tunables via sysctl, but you can still verify scan state and correlate timing. If your platform uses module parameters, you’ll check those next.
Decision: Don’t change tunables because a blog told you to. Only change them when you know whether you’re latency-budgeting or speed-maximizing, and you can roll back.
Task 14: Identify SMR drives (a silent rebuild killer)
cr0x@server:~$ lsblk -d -o NAME,MODEL,ROTA
NAME MODEL ROTA
sda ST18000NM000J 1
sdb ST18000NM000J 1
sdnew ST18000NM000J 1
Meaning: This doesn’t directly tell you SMR vs CMR, but it gives you the model for cross-checking internally. If you unknowingly mixed in SMR, sustained writes during resilver can crater.
Decision: Maintain an approved-drive list for pools. If you suspect SMR, stop and verify model behavior; “it works most of the time” is not a storage strategy.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
The company ran a large analytics cluster on ZFS over RAIDZ2. When a disk failed, the on-call engineer swapped it and told everyone the pool would be healthy “by lunch” because each drive could do 250 MB/s sequential writes. That number came from a vendor datasheet and a quick benchmark on an empty disk.
The resilver started at a few hundred MB/s, which felt validating. Then it slowed to a jittery 30–60 MB/s “issued,” while scanned stayed high. Production queries began timing out, not because the network was slow, but because the pool’s read latency spiked. Users filed tickets. Management filed calendar invites.
The wrong assumption wasn’t “ZFS is slow.” The wrong assumption was that the replacement disk’s sequential write speed determines rebuild time. The pool was 80% full with years of churn. Reconstructing the missing column forced scattered reads from all surviving disks. Those disks were now doing high-IOPS random reads while also serving the live workload.
The fix was boring: throttle the resilver during business hours, resume aggressively overnight, and add a policy that pools should not live above a defined fullness threshold. After that, the lunch promises stopped. The pools did not become magically faster, but the business stopped being surprised.
Mini-story 2: The optimization that backfired
A different team wanted faster resilvers, so they went hunting for tunables. They increased scan concurrency and made resilver “more parallel,” expecting the pool to chew through work faster. On paper, it did: more I/O operations per second.
In reality, application latency doubled. The database wasn’t bandwidth-bound; it was tail-latency-bound. The increased resilver concurrency caused deeper queues on the HDDs. “Average throughput” looked better in a dashboard while real user queries got slower. The team then reacted by adding more application retries, which increased read load, which made queues deeper. Classic.
They rolled back the tuning and took a step that sounds almost insulting: they let resilver run slower. But they pinned a maximum acceptable latency for the database and tuned resilver aggressiveness to stay under it. The rebuild took longer, but the business stopped noticing.
The backfire lesson: resilver tuning is not a free lunch. If your pool is seek-limited, more concurrency can just mean more outstanding seeks. You don’t get “faster.” You get “louder.”
Mini-story 3: The boring but correct practice that saved the day
A financial-services platform ran ZFS mirrors for transactional data and RAIDZ for cold storage. Nothing exotic. Their secret weapon was a dull set of practices: scrubs on schedule, SMART trending, and a policy that any disk with growing UDMA_CRC errors gets moved to a different bay before it’s allowed back into a pool.
One week, a disk failed in a mirror vdev during peak processing. Resilver started and looked normal. An hour in, another disk in the same chassis started throwing CRC errors, and the kernel logged link resets. This is where a lot of teams “wait and see” and pray the resilver finishes first.
They didn’t wait. They paused the chaos, moved the suspect disk to a known-good slot, replaced the cable, and resumed. Resilver finished without a second failure. The transaction system stayed within latency SLOs because the team also had a standard procedure: during peak, limit resilver impact; at night, let it run.
The practice that saved them wasn’t a miracle tunable. It was disciplined hygiene and a willingness to treat flaky links as production incidents, not “weird performance.”
Common mistakes: symptoms → root cause → fix
Mistake 1: “Resilver starts fast then slows to a crawl”
Symptom: High initial MB/s, then the tail takes forever; disks show high util but low throughput.
Root cause: Fragmentation + metadata-heavy tail; remaining work is many small scattered blocks, which is IOPS-bound.
Fix: Accept the physics; don’t panic-tune. Keep pools below high-fullness thresholds, prefer mirrors for churny workloads, and consider special vdev for metadata-heavy cases.
Mistake 2: “New disk is fast, but resilver is slow”
Symptom: Replacement disk write is moderate; surviving disks are saturated on reads.
Root cause: Read side is the limiter (RAIDZ read amplification, seeky reads, workload contention).
Fix: Diagnose per-disk latency; throttle resilver to protect production; design future vdevs with rebuild behavior in mind (mirrors vs RAIDZ width).
Mistake 3: “Resilver speed collapses unpredictably”
Symptom: Periodic stalls; huge latency spikes; kernel logs show resets/timeouts.
Root cause: Link instability (cables, backplane, expander, HBA), or a marginal disk under sustained load.
Fix: Treat as hardware incident. Check dmesg, SMART CRC counts, HBA logs; move the drive bay/port; replace suspect components.
Mistake 4: “Tuned resilver for speed; users complain”
Symptom: Resilver finishes sooner but app latency and timeouts increase.
Root cause: Queue depth and fairness: more resilver concurrency increases tail latency and steals IOPS from foreground.
Fix: Tune to an SLO, not a stopwatch. Cap resilver aggressiveness during peak, schedule aggressive windows off-hours.
Mistake 5: “Checksum errors appear during resilver; performance drops”
Symptom: CKSUM counts increment; resilver slows; possible read retries.
Root cause: Media issues, bad cabling, or a second failing disk; ZFS is correcting (if it can), which costs time.
Fix: Stop focusing on speed. Stabilize hardware, identify the fault domain, and consider preemptive replacement of the suspect member.
Mistake 6: “Pool is 90% full and resilver is terrible”
Symptom: Resilver takes dramatically longer than past events; allocation looks chaotic.
Root cause: Free space is fragmented; metaslabs have fewer contiguous extents; both reads and writes become less efficient.
Fix: Add capacity earlier. If you can’t, migrate datasets off, reduce churn, and stop treating 90% full as normal operations.
Checklists / step-by-step plan
Step-by-step: handling a disk failure with minimal drama
- Freeze the story: capture
zpool status -v,zpool iostat -v 1(30 seconds),iostat -x 1(30 seconds). This is your before/after truth. - Confirm the fault domain: is it a disk, a cable, a bay, an expander lane? Check SMART CRC and kernel logs.
- Replace the device correctly: use
zpool replaceand confirm the correct GUID/device mapping. Human errors love drive bays. - Decide the operational mode: “finish fast” vs “protect latency.” Write it down in the ticket. Be accountable.
- Monitor health first: any READ/WRITE/CKSUM increases? If yes, pause performance tuning and stabilize hardware.
- Monitor bottlenecks: per-disk util and await; if old disks are pegged, stop pretending the new disk matters.
- Protect users: if latency-sensitive, reduce resilver aggressiveness during peak windows rather than forcing throughput.
- After resilver: run a scrub (or ensure next scheduled scrub happens soon) to confirm no silent issues remain.
- Postmortem the root cause: not “a disk died,” but “why did it die and what else shares that failure domain?”
- Capacity/fragmentation follow-up: if you were >80% full, treat that as an incident contributor and plan remediation.
Design checklist: build pools that don’t make resilver a career
- Choose vdev types based on churn: mirrors for high-churn random I/O; RAIDZ for colder, more sequential, less rewrite-heavy data.
- Keep pools with headroom: don’t live in the 85–95% zone and act surprised when rebuilds are awful.
- Standardize drive models: avoid mixing unknown SMR behavior into write-heavy pools.
- Plan for metadata: special vdev can help, but only if mirrored and properly sized and monitored.
- Practice failures: rehearse replace/resilver procedures; document device naming and bay mapping; confirm alerting works.
FAQ
1) Is a resilver always faster than a traditional RAID rebuild?
Often, because ZFS rebuilds allocated blocks rather than the whole device. But if your pool is very full and fragmented, the advantage shrinks and the tail can still be brutal.
2) Why does “scanned” differ from “issued” in zpool status?
“Scanned” is how much address space/metadata traversal has been walked; “issued” is actual I/O issued for reconstruction. Big gaps usually mean heavy metadata work or inefficient access patterns.
3) Mirrors or RAIDZ for faster resilvers?
Mirrors generally resilver faster and with less read amplification. RAIDZ resilvers can be significantly slower because reconstruction often requires reading multiple columns for each block.
4) Can I just throttle the resilver to keep production stable?
Yes, and you often should. The risk tradeoff is longer time in a degraded state. For latency-sensitive services, a controlled slow resilver is usually safer than a fast one that causes timeouts and cascading failures.
5) Does adding a faster replacement disk speed up the resilver?
Sometimes, but commonly no. The limiting factor is often the surviving disks’ read IOPS and latency, not the target’s write throughput.
6) Why does the resilver slow down near the end?
The tail contains more small blocks and metadata-heavy work. Small random reads dominate, which collapses throughput even though the system is “busy.”
7) Is fragmentation a dataset problem or a pool problem?
Both. Fragmentation is driven by allocation patterns and free space distribution at the pool level, but churny datasets amplify it. The pool pays the price during resilver.
8) Are checksum errors during resilver normal?
No. ZFS can correct them if redundancy allows, but they indicate a real fault: disk media, cable, HBA path, or another device in trouble. Treat it as a reliability incident.
9) Does compression help or hurt resilver speed?
It can help if it reduces physical bytes read/written, but it can hurt if CPU becomes the bottleneck or if the workload is metadata/IOPS-bound. Measure CPU and disk utilization before blaming compression.
10) Is “sequential resilver” something I can force?
Not in the sense people mean. ZFS uses strategies to reduce random I/O, but the on-disk allocation reality and vdev layout dictate how sequential it can be.
Practical next steps
If you’re dealing with a slow resilver today:
- Run
zpool status -v,zpool iostat -v 1, andiostat -x 1. Decide whether you’re IOPS-limited, CPU-limited, or dealing with hardware pathology. - Check
dmesgand SMART CRC counts. If links are flapping, fix hardware before you touch tunables. - Pick a goal: fastest finish or stable production. Throttle/accelerate accordingly and document the decision.
If you’re designing for the next failure (the correct time to care):
- Stop overfilling pools. Headroom is performance and reliability insurance.
- Choose vdevs based on churn. Mirrors cost capacity; they buy you rebuild sanity.
- Invest in observability: per-disk latency, ZFS error counters, and a clear mapping of bays to device IDs.
The best resilver is the one you finish before anyone notices. That’s not magic. It’s design, measurement, and refusing to treat “disk MB/s” as a plan.