The alert arrives at 09:12: “DEGRADED pool.” You swap the disk, run zpool replace, and expect a couple hours of churn.
Then zpool status hits you with “resilvering… 3%” and an ETA that looks like a long weekend.
Resilver time isn’t a moral failing. It’s physics, queue depth, vdev geometry, and the awkward reality that production workloads don’t pause just because you’d prefer they did.
The trick is knowing which levers are safe, which are cargo cult, and which will trade speed today for data loss tomorrow.
What resilver actually does (and why it feels slower than it “should”)
In ZFS, a “resilver” is reconstruction after a device is replaced or comes back online. ZFS walks the pool’s metadata to discover which blocks are actually in use,
then regenerates the missing copies/parity and writes them to the new (or returned) device.
That “walk the metadata” part is why resilver is often not a simple linear copy of “used bytes.” It’s a dependency chain:
ZFS must read metadata to learn where data blocks live, then read those blocks, then write reconstructed blocks, while also staying consistent with ongoing writes.
If your pool is fragmented, metadata-heavy, or under load, resilver becomes a seek-and-queue festival.
Also, resilver isn’t just a big streaming read and write. It’s “find all referenced blocks and fix up the missing side,” which in RAIDZ means reading enough
columns to reconstruct parity, and in mirrors means copying the other side’s blocks. Mirrors can be fast if they can read sequentially. RAIDZ often can’t.
One more operational reality: ZFS tries to be a good citizen. By default it won’t take your serving workload behind the barn and do the merciful thing.
Resilver competes for I/O with everything else, and ZFS intentionally leaves headroom—unless you tell it otherwise.
Why a resilver takes days: the real bottlenecks
1) Random I/O and fragmentation: your “used bytes” aren’t contiguous
If your pool has been running for years with mixed workloads—VM images, databases, small files, deletions, snapshots—blocks get scattered.
ZFS must chase metadata pointers, which turns into lots of small reads. HDDs hate that. Even SSDs can struggle if you saturate them with queue depth mismatches
or hit write amplification.
The lie we tell ourselves is: “There’s only 12 TB used; it should resilver in 12 TB / disk throughput.” That assumes sequential reads and writes, low metadata overhead,
and no contention. In reality, resilver’s effective throughput is often gated by IOPS, not MB/s.
2) vdev geometry: RAIDZ rebuild reads more than you think
In a mirror, to rebuild a missing side you can usually read the good disk and write the new disk. In RAIDZ, to reconstruct one missing disk,
ZFS reads the remaining columns of each stripe. That’s more I/O per reconstructed byte, and it’s scattered across more spindles.
RAIDZ resilver can be especially punishing on wide vdevs with large disks. The pool is degraded, so redundancy is reduced, and performance drops exactly when you need it.
If you’re unlucky, you’ll also be serving production reads with fewer columns available. It’s like rebuilding a bridge while rush hour is still on it.
3) “Allocating while resilvering”: blocks move under your feet
ZFS is copy-on-write. New writes go to new locations, old blocks remain referenced until freed. During resilver, active writes can change what needs to be copied:
metadata updates, indirect blocks, new block pointers. ZFS handles this, but it means the operation is less “single pass” than people assume.
4) Pool fullness: above ~80% gets ugly fast
Full pools fragment more, allocate in smaller chunks, and force ZFS to work harder to find space. Resilver becomes more random, and the overhead climbs.
If you’re also snapshot-heavy, freed space isn’t truly free until snapshots expire, so “df says 20% free” might be fiction.
5) Recordsize, volblocksize, and small-block workloads
Resilver has to deal with your block sizes as they exist on disk. A VM zvol with 8K volblocksize or a database dataset with 16K recordsize
results in many more blocks to traverse than a dataset full of 1M records.
More blocks means more metadata, more checksums, more I/O operations, and less chance of nice sequential patterns. You don’t notice this day-to-day
until you need to rebuild.
6) Compression and dedup: great until you rebuild
Compression usually helps resilver because fewer bytes need to be read and written—if CPU isn’t the bottleneck.
Dedup is the opposite: it adds metadata lookups and often makes everything more random.
If you enabled dedup because you once saw a slide deck about “storage efficiency,” you’ve built yourself a resilver tax. It compounds under pressure.
7) Checksumming, crypto, and CPU bottlenecks
ZFS verifies checksums as it reads. If you’re using native encryption, it also decrypts. On older CPUs or busy boxes, resilver can become CPU-bound,
especially when the I/O pattern is lots of small blocks (more checksum operations per byte).
8) “Resilver priority” is a trade, not a free lunch
You can often make resilver faster by letting it consume more I/O. That speeds recovery but can crush latency for your applications.
The safe speedup is the one that keeps your SLOs intact.
9) Slow or mismatched replacement disks
If the new disk is SMR, has aggressive internal garbage collection, is connected through a sad HBA, or is simply slower than the old one,
resilver time can explode. “Same capacity” is not “same behavior.”
Joke #1: Resilver is the storage equivalent of repainting your house while you’re still living in it—everything is technically possible, just not pleasant.
Interesting facts & history (because the past explains the pain)
- ZFS started at Sun Microsystems in the mid-2000s as a response to filesystems that treated “volume manager” and “filesystem” as separate problems.
- Copy-on-write was a deliberate bet: it made snapshots cheap and consistency strong, but it also made allocation patterns more complex over time.
- Resilver isn’t scrub: scrub validates the whole pool; resilver reconstructs redundancy after device loss. They share codepaths but have different intent.
- “Slop space” exists for a reason: ZFS keeps some space unallocatable to avoid catastrophic fragmentation and allocation failures on near-full pools.
- RAIDZ expansion was historically limited (grow a RAIDZ vdev by adding disks) which pushed many shops toward wide vdevs up front—great on day one, tense on day 900.
- SMR drives changed the game: they can look fine in benchmarks and then crater under sustained random writes like resilver traffic.
- OpenZFS became the center of gravity after Sun, with multiple platforms (illumos, FreeBSD, Linux) carrying the torch and diverging in tunables.
- Sequential resilver improvements landed over time to make some patterns faster, but they can’t undo fragmentation or fix “pool is 92% full” as a life choice.
Fast diagnosis playbook: find the bottleneck in 10 minutes
When resilver is slow, don’t guess. Take three measurements: what ZFS thinks it’s doing, what the disks are doing, and what the CPU and memory are doing.
Then decide whether to speed up resilver or reduce production load—or both.
First: confirm the rebuild is real and see the shape of it
- Check
zpool statusfor scan rate, errors, and whether it’s a resilver or scrub. - Confirm which vdev is affected and whether you’re RAIDZ or mirror.
- Look for “resilvered X in Y” style progress; if it’s barely moving, you’re likely IOPS-bound or blocked by errors/retries.
Second: identify the limiting resource (IOPS, bandwidth, CPU, or contention)
- Disk busy but low throughput: random I/O / queueing / SMR / retries.
- High CPU in kernel/ZFS threads: checksum/encryption/metadata heavy workload.
- Latency spikes in apps: resilver competing with production I/O; tune priorities or schedule load shedding.
Third: decide on the safe intervention
- If production is calm, increase resilver aggressiveness slightly and watch latency.
- If production is hurting, lower resilver impact and accept longer rebuild—unless risk dictates otherwise.
- If a device is erroring, stop “tuning” and fix hardware/cabling first.
Practical tasks: commands, outputs, and decisions (12+)
These are the checks I actually run. Each includes what the output means and the decision you make from it.
Commands are shown as if you’re on a Linux box with OpenZFS; adapt paths if you’re on illumos/FreeBSD.
Task 1: Confirm scan state, speed, and whether you’re resilvering or scrubbing
cr0x@server:~$ zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices is being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Dec 23 09:12:11 2025
1.87T scanned at 58.3M/s, 612G issued at 19.1M/s, 22.4T total
102G resilvered, 2.91% done, 5 days 03:18:22 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdv ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdz ONLINE 0 0 0
sdaa ONLINE 0 0 0
sdab ONLINE 0 0 0
sdac ONLINE 0 0 0
sdad ONLINE 0 0 0
sdae ONLINE 0 0 0
sdaf ONLINE 0 0 0
sdag ONLINE 0 0 0
sdah ONLINE 0 0 0
sdai ONLINE 0 0 0
sdaj ONLINE 0 0 0
sdak ONLINE 0 0 0
sdal ONLINE 0 0 0
sdam ONLINE 0 0 0
sdan ONLINE 0 0 0
sdao ONLINE 0 0 0
sdap ONLINE 0 0 0
sdaq ONLINE 0 0 0
sdar ONLINE 0 0 0
sdas ONLINE 0 0 0
sdat ONLINE 0 0 0
sdau ONLINE 0 0 0
sdav ONLINE 0 0 0
sdaw ONLINE 0 0 0
sdax ONLINE 0 0 0
sday ONLINE 0 0 0
sdaz ONLINE 0 0 0
sdba ONLINE 0 0 0
sdbb ONLINE 0 0 0
sdbc ONLINE 0 0 0
sdbd ONLINE 0 0 0
sdbe ONLINE 0 0 0
sdbf ONLINE 0 0 0
sdbg ONLINE 0 0 0
sdbh ONLINE 0 0 0
sdbi ONLINE 0 0 0
sdbj ONLINE 0 0 0
sdbk ONLINE 0 0 0
sdbl ONLINE 0 0 0
sdbm ONLINE 0 0 0
sdbn ONLINE 0 0 0
sdbo ONLINE 0 0 0
sdbp ONLINE 0 0 0
sdbq ONLINE 0 0 0
sdbr ONLINE 0 0 0
sdbs ONLINE 0 0 0
sdbt ONLINE 0 0 0
sdbu ONLINE 0 0 0
sdbv ONLINE 0 0 0
sdbw ONLINE 0 0 0
sdbx ONLINE 0 0 0
sdby ONLINE 0 0 0
sdbz ONLINE 0 0 0
sdcA ONLINE 0 0 0
sdcB ONLINE 0 0 0
sdcC ONLINE 0 0 0
sdcD ONLINE 0 0 0
sdcE ONLINE 0 0 0
sdcF ONLINE 0 0 0
sdcG ONLINE 0 0 0
sdcH ONLINE 0 0 0
sdcI ONLINE 0 0 0
sdcJ ONLINE 0 0 0
sdcK ONLINE 0 0 0
sdcL ONLINE 0 0 0
sdcM ONLINE 0 0 0
sdcN ONLINE 0 0 0
sdcO ONLINE 0 0 0
sdcP ONLINE 0 0 0
sdcQ ONLINE 0 0 0
sdcR ONLINE 0 0 0
sdcS ONLINE 0 0 0
sdcT ONLINE 0 0 0
sdcU ONLINE 0 0 0
sdcV ONLINE 0 0 0
sdcW ONLINE 0 0 0
sdcX ONLINE 0 0 0
sdcY ONLINE 0 0 0
sdcZ ONLINE 0 0 0
sdd0 ONLINE 0 0 0
errors: No known data errors
What it means: “scanned” vs “issued” tells you metadata traversal versus actual reconstruction I/O.
If “issued” is far lower than “scanned,” you’re spending time walking metadata and/or being throttled by IOPS.
Decision: If ETA is days and your pool is big, don’t panic yet. Move to the bottleneck checks below before touching tunables.
Task 2: Check pool health and error trends (don’t tune around dying hardware)
cr0x@server:~$ zpool status -x
pool 'tank' is degraded
What it means: The pool is not healthy; resilver is expected. If you see additional errors (READ/WRITE/CKSUM), that’s more urgent.
Decision: If errors climb during resilver, stop “performance work” and start “hardware triage.”
Task 3: Confirm which disk is new and whether it negotiated correctly (link speed, size)
cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,ROTA,TYPE /dev/sdx
NAME SIZE MODEL SERIAL ROTA TYPE
sdx 14.6T ST16000NM000J ZR12ABCDEF 1 disk
What it means: You want the replacement to match expected capacity and be a CMR enterprise model, not a surprise SMR desktop drive.
Decision: If model/serial looks wrong, stop and validate procurement. The cheapest “fix” is returning the wrong disk before it wastes your week.
Task 4: Spot SMR behavior or deep write stalls using iostat
cr0x@server:~$ iostat -x 2 5 /dev/sdx
Linux 6.6.0 (server) 12/25/2025 _x86_64_ (32 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 12.0 180.0 1.1 9.4 116.0 27.8 145.2 8.1 154.8 2.9 56.8
sdx 11.5 190.5 1.0 2.2 36.2 64.1 336.7 9.2 356.8 2.7 52.4
What it means: Rising await with collapsing wMB/s is classic “drive is stalling” behavior.
Not always SMR, but often “drive firmware is busy reorganizing writes” or you have a transport/HBA issue.
Decision: If the replacement device has pathological await, move it to a different bay/cable/HBA port or swap the drive model.
Task 5: See if resilver is IOPS-bound across the vdev
cr0x@server:~$ iostat -x 2 3
Device r/s w/s rMB/s wMB/s avgqu-sz await %util
sda 85.0 22.0 5.1 1.2 9.2 86.4 92.0
sdb 82.0 25.0 4.9 1.4 8.7 84.9 90.1
sdc 83.0 23.0 5.0 1.3 9.1 85.7 91.5
What it means: High %util with low MB/s means you’re not streaming; you’re seeking. This is why “14TB disk at 250MB/s” math fails.
Decision: Don’t crank “resilver speed” knobs and expect miracles. You need to reduce random I/O pressure (pause heavy workloads, reduce snapshot churn),
or accept the timeline.
Task 6: Check ARC pressure and whether the box is thrashing
cr0x@server:~$ arcstat 2 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
09:23:01 914 202 22 46 23 156 77 0 0 96G 112G
09:23:03 901 229 25 51 22 178 78 0 0 96G 112G
09:23:05 938 301 32 90 29 211 70 0 0 96G 112G
What it means: Rising misses during resilver can mean metadata isn’t fitting well, or production + resilver exceeds cache usefulness.
Decision: If ARC is constrained and you’re swapping, stop: memory pressure will destroy resilver and everything else. Add RAM or reduce workload.
Task 7: Confirm you are not swapping (swapping turns rebuild into a slow-motion disaster)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 82432 12644 9812448 0 0 4920 1280 4200 6100 18 12 58 12 0
2 0 0 80120 12644 9812016 0 0 5100 1320 4302 6230 17 11 57 15 0
What it means: si/so should be zero. If you’re swapping, ZFS metadata walks and checksum work will crawl.
Decision: If swapping, reduce ARC cap, stop memory hogs, or move workloads off. Do not “just let it finish.”
Task 8: Check whether a scrub is running concurrently (and stop it if it’s not policy-critical)
cr0x@server:~$ zpool status tank | sed -n '1,20p'
pool: tank
state: DEGRADED
scan: scrub in progress since Mon Dec 23 08:55:02 2025
3.11T scanned at 72.5M/s, 901G issued at 21.0M/s, 22.4T total
0B repaired, 4.03% done, 2 days 22:10:05 to go
What it means: A scrub competing with a resilver is usually self-sabotage unless you have a specific reason.
Decision: If the pool is already degraded and you’re trying to restore redundancy, prioritize resilver and pause scrub.
cr0x@server:~$ sudo zpool scrub -s tank
scrub stopped
Task 9: Verify autotrim and ashift assumptions (performance cliffs hide here)
cr0x@server:~$ zdb -C tank | egrep -i 'ashift|autotrim'
ashift: 12
autotrim: off
What it means: ashift defines the sector alignment. Wrong ashift can permanently kneecap write performance.
autotrim matters mostly for SSD pools.
Decision: You can’t change ashift in place. If it’s wrong, plan a migration. Don’t pretend a tunable will fix geometry.
Task 10: Check dataset-level properties that amplify resilver work
cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,dedup,atime tank/vmstore
NAME PROPERTY VALUE
tank/vmstore recordsize 128K
tank/vmstore compression lz4
tank/vmstore dedup off
tank/vmstore atime off
What it means: Small recordsize, dedup=on, and atime=on (for busy datasets) can all increase metadata churn and rebuild work.
Decision: Don’t flip these mid-resilver as a “speed hack.” Use them as input for future design, and for narrowing which workloads to throttle.
Task 11: Identify whether special vdev or metadata devices are the bottleneck
cr0x@server:~$ zpool status tank | sed -n '1,120p'
pool: tank
state: DEGRADED
scan: resilver in progress since Mon Dec 23 09:12:11 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
...
special
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
What it means: If you have a special vdev (metadata/small blocks), its performance can dominate resilver speed because resilver is metadata-heavy.
Decision: Watch NVMe latency and health; a “fine” data vdev can still resilver slowly if metadata devices are saturated or degraded.
Task 12: Check for I/O errors and retries in kernel logs (silent killer)
cr0x@server:~$ sudo dmesg -T | egrep -i 'ata[0-9]|scsi|reset|I/O error|blk_update_request' | tail -n 12
[Tue Dec 23 10:02:14 2025] sd 3:0:8:0: [sdx] tag#83 I/O error, dev sdx, sector 1883742336 op 0x1:(WRITE) flags 0x0 phys_seg 16 prio class 0
[Tue Dec 23 10:02:15 2025] ata9: hard resetting link
[Tue Dec 23 10:02:20 2025] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
What it means: Link resets and degraded link speed (1.5 Gbps) will stretch resilver into geological time.
Decision: Fix cabling/backplane/HBA. Don’t tune ZFS to compensate for a flaky transport.
Task 13: See if ZFS is throttling resilver due to tunables (and adjust carefully)
cr0x@server:~$ sudo sysctl -a 2>/dev/null | egrep 'zfs_vdev_resilver|zfs_resilver|scan_idle'
debug.zfs_scan_idle=50
debug.zfs_vdev_resilver_max_active=2
debug.zfs_vdev_resilver_min_active=1
What it means: These knobs influence how aggressively ZFS issues I/O for resilver and how much it idles to favor production.
Names vary by platform/distribution; don’t copy-paste blog values blindly.
Decision: If you have I/O headroom and acceptable latency, increase max_active modestly. If latency is already bad, don’t.
Task 14: Verify the pool isn’t dangerously full (full pools rebuild slowly and fail creatively)
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -p tank | head
NAME USED AVAIL REFER MOUNTPOINT
tank 19854735163392 2533274798080 1048576 /tank
What it means: Roughly 19.8 TB used, 2.5 TB available. On a ~22 TB pool, that’s flirting with the danger zone.
Decision: If you’re above ~80–85% and resilver is slow, prioritize freeing space (delete old snapshots, move cold data) before you tune for speed.
Safe ways to speed up resilver (what works, what doesn’t)
The goal isn’t “make resilver fast.” The goal is “restore redundancy quickly without blowing up production or corrupting data.”
Those are not the same. You can always make things fast by doing them wrong.
1) Reduce competing I/O (the unsexy, highest-leverage move)
If resilver is IOPS-bound, the winning move is to stop generating random I/O. That usually means:
- Pause batch jobs: backups, log reindexing, analytics, large rsyncs.
- Throttle or migrate noisy tenants (VM clusters are famous for this).
- Delay snapshot pruning that triggers lots of frees/rewrites (depends on implementation and workload).
This is often politically hard. It shouldn’t be. A degraded pool is a risk event. Treat it like one.
2) Increase resilver aggressiveness—carefully, and with a rollback plan
ZFS tunables that control scan/resilver concurrency can increase throughput. They can also increase tail latency and trigger timeouts in sensitive apps.
Adjust in small steps, measure, and revert if pain outweighs gain.
cr0x@server:~$ sudo sysctl -w debug.zfs_scan_idle=0
debug.zfs_scan_idle = 0
What it means: Lower idle time means scan work yields less to normal I/O. Resilver gets more turns.
Decision: Use this only when you can tolerate higher latency, and monitor application SLOs immediately. If latency spikes, put it back.
cr0x@server:~$ sudo sysctl -w debug.zfs_vdev_resilver_max_active=4
debug.zfs_vdev_resilver_max_active = 4
What it means: More concurrent I/O operations per vdev. Good for underutilized systems, bad for already-saturated spindles.
Decision: If disks show low %util and low queue depth, this can help. If disks are already pegged, it will mostly increase latency.
3) Put the replacement on the best path (HBA, firmware, cabling)
The boring truth: resilver speed is often limited by one misbehaving link. A single disk at 1.5 Gbps SATA, or an HBA port flapping,
can drag a RAIDZ resilver down because parity reconstruction waits on stragglers.
Fix the physical layer. Then tune.
4) Prefer mirrors when rebuild time matters more than capacity
If you’re designing systems where rebuild time under failure is a core risk, mirrors are your friend. They resilver by copying allocated blocks from a healthy side.
In many real deployments, mirrors also deliver more predictable performance under partial failure.
RAIDZ is fine—sometimes great—but don’t pretend it’s “the same but cheaper.” During resilver, it’s a different beast.
5) Keep pools less full (your future self will thank you)
The easiest way to speed up resilver is to avoid pathological fragmentation. The most reliable predictor of fragmentation in ZFS land is:
how close to full you run the pool.
Set quotas. Enforce them. Have a capacity plan. “We’ll clean it up later” is how you get 5-day resilvers and 2 a.m. meetings.
6) Use sane block sizes for the workload (before the incident, not during)
For VM stores, choose volblocksize with intention. For datasets, pick recordsize aligned with workload. This isn’t about micro-optimizing performance benchmarks;
it’s about reducing metadata and block count so rebuild work scales sanely.
7) Don’t “optimize” by disabling checksums or trusting magic
Checksums are not optional safety belts. Resilver is exactly when you want end-to-end integrity.
ZFS doesn’t give you a supported, sensible path to “skip verification for speed,” and that’s a feature, not a limitation.
Joke #2: Turning knobs during a resilver without measuring is like adding more coffee to fix a broken printer—emotionally satisfying, technically unrelated.
One quote worth keeping on your incident bridge
“Hope is not a strategy.” — paraphrased idea commonly cited in engineering and operations
The operational version: measure first, change one thing, measure again. Anything else is performance cosplay.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran a ZFS-backed VM cluster on a wide RAIDZ2 vdev. It had worked “fine” for years. One disk failed on a Tuesday.
The on-call swapped it quickly and kicked off the replace. Everyone relaxed.
The assumption: “Resilver only copies used data, so it’ll be faster than a full rebuild.” The pool had about 60% used.
They did the classic back-of-napkin math using sequential throughput and decided resilver would finish overnight.
Overnight came and went. Progress stalled at single-digit percent. Latency for customer VMs spiked intermittently, and the hypervisor fleet started logging guest I/O timeouts.
The team reacted by adding more load—specifically, migrating VMs around to “balance.” That migration workload was random reads and writes. It poured gasoline on the fire.
The real problem: the pool was old, snapshot-heavy, and badly fragmented. Resilver was IOPS-bound, not bandwidth-bound. Every “move fast” mitigation made the I/O pattern worse.
After 36 hours, a second disk threw errors. Now it wasn’t a slow rebuild; it was a data risk incident.
They recovered, but the lesson stuck: resilver time is not a function of “used TB” alone. It’s a function of allocation history, workload shape, and contention.
Their postmortem action items were simple and uncomfortable: enforce capacity headroom, cap snapshot counts, and stop building wide RAIDZ vdevs for latency-sensitive VM clusters.
Mini-story 2: The optimization that backfired
Another shop decided resilvers were taking too long. Someone found tunables online and set scan/resilver concurrency aggressively across the fleet.
It looked great in a quiet staging environment: rebuilds were faster. Everyone high-fived. The change rolled out.
Then production had a real failure. A disk dropped during peak business hours. Resilver ramped up like a jet engine: lots of concurrent I/O, minimal idling.
Rebuild speed improved, sure. Meanwhile, database latency went from “fine” to “why is everything timing out.”
The worst part wasn’t the latency. It was the retries. Apps started retrying failed requests, which increased load, which increased I/O, which increased latency.
The system entered a familiar spiral: the more it struggled, the harder it tried.
The team rolled back tunables mid-incident. Rebuild slowed down, but the platform stabilized.
Postmortem conclusion: “faster resilver” is not a global default. It’s an incident-mode switch, tied to business hours and SLOs, with explicit monitoring and rollback.
Mini-story 3: The boring but correct practice that saved the day
A financial services company (yes, the kind that loves change control) ran mirrored vdevs for critical datasets and RAIDZ for colder tiers.
They also enforced a simple policy: pools stay under a defined fullness threshold, and snapshot retention is capped with regular pruning.
A disk failed during quarter-end. Of course it did. The on-call replaced it, and resilver began. They didn’t touch tunables at first.
Instead, they executed the runbook: pause non-essential batch jobs, verify no concurrent scrub, check link speed, check dmesg for resets, and watch latency dashboards.
Resilver finished in a predictable window. No drama. No second failure. No heroic tuning.
The team’s favorite part was how little they had to explain to management—because nothing customer-visible happened.
The “boring” work was done months earlier: headroom, sane vdev design, and operational discipline.
That’s the kind of reliability story nobody tells at conferences because it doesn’t fit on a t-shirt. It’s still the one you want.
Common mistakes: symptom → root cause → fix
1) Symptom: Resilver rate starts decent, then collapses
Root cause: Replacement disk internal write stall (often SMR or firmware GC), or transport link renegotiated down, or the pool hit more fragmented regions.
Fix: Check iostat -x for rising await and collapsing MB/s; check dmesg for resets/link speed. Swap port/cable/HBA or replace the drive model.
2) Symptom: “Scanned” grows fast, “issued” is tiny
Root cause: Metadata-heavy traversal with low actual reconstruction, often due to fragmentation and many snapshots; sometimes due to throttling settings.
Fix: Reduce competing metadata churn (pause snapshotting, heavy filesystem activity). Consider temporarily lowering scan idle if latency budget allows.
3) Symptom: Apps time out during resilver even though throughput isn’t high
Root cause: Tail latency spike from random I/O contention; a few queues saturate while average throughput looks modest.
Fix: Watch await, queue depth, and app-level p99 latency. Reduce load, or increase resilver idle to give production priority.
4) Symptom: Resilver never finishes; progress inches and then “restarts”
Root cause: Device flapping, transient disconnects, or repeated errors forcing retries; sometimes a marginal backplane.
Fix: Check zpool status error counters; inspect dmesg. Fix hardware. No tunable compensates for a cable that hates you.
5) Symptom: CPU pegged during resilver on an “I/O system”
Root cause: Checksum/encryption work on many small blocks, plus metadata overhead. Can be amplified by dedup.
Fix: Confirm with top/vmstat and ARC stats; reduce small-block churn (pause VM migrations), and plan CPU upgrades for encrypted pools.
6) Symptom: Resilver is slow only on one pool, not others on the same hardware
Root cause: Pool fullness, fragmentation, dataset block size choices, snapshot count, or vdev width differences.
Fix: Compare zfs list usage, snapshot counts, and dataset properties. The hardware isn’t “slow”; your allocation history is.
7) Symptom: Rebuild speed improved after “tuning,” then pool gets weird later
Root cause: Persistent sysctl/tunable changes applied globally without guardrails; increased I/O pressure causes timeouts and secondary failures.
Fix: Make tunables incident-scoped with explicit rollback. Capture baseline values and revert after the pool is healthy.
Checklists / step-by-step plan
Step-by-step: when a disk fails and resilver begins
- Confirm state:
zpool status. Ensure it’s a resilver, not a scrub, and identify the affected vdev. - Stop competing maintenance: If a scrub is running, stop it unless policy requires it right now.
- Hardware sanity: Confirm replacement disk model (CMR vs SMR), link speed, and no resets in
dmesg. - Measure contention:
iostat -xand app latency dashboards. Decide if you have headroom to push resilver harder. - Check memory pressure:
vmstatand ARC stats. Ensure no swapping. - Decide priority: If risk is high (second disk shaky, critical data), prioritize resilver. If business hours are critical, bias toward SLOs.
- Apply tunables carefully (optional): Increase resilver aggressiveness in small increments, monitoring p95/p99 latency.
- Communicate: Set expectations. “Degraded until X” is a business risk statement, not a storage trivia fact.
- After completion: Verify pool is healthy, then revert temporary tunables. Schedule a scrub after redundancy is restored.
Checklist: safe speedups you can justify in a postmortem
- Pause non-essential batch I/O and snapshot-heavy jobs.
- Stop concurrent scrubs while resilver is running (unless compliance requires otherwise).
- Fix transport issues (link resets, downshifted SATA speeds) before tuning.
- Increase resilver concurrency modestly only when disks have headroom and app latency is stable.
- Reduce scan idling temporarily only during a controlled incident window.
- Prefer mirrors for tiers where rebuild risk dominates capacity efficiency.
- Maintain capacity headroom as a policy, not a suggestion.
Checklist: things you should not do during a resilver
- Don’t enable dedup “to save space” mid-incident.
- Don’t start big migrations, rebalancing, or bulk rewrites unless you’re intentionally trading resilver time for a bigger risk.
- Don’t keep cranking tunables upward when latency is already bad; you’re just making failure louder.
- Don’t ignore kernel logs. If you see resets or I/O errors, you’re in hardware land now.
FAQ
1) Is resilver supposed to be faster than scrub?
Often, yes—because resilver only touches allocated blocks that need reconstruction. But fragmentation and metadata traversal can erase that advantage.
If the pool is old and random-I/O heavy, resilver can feel like a scrub with extra steps.
2) Why does “scanned” not match “resilvered”?
“Scanned” reflects how much the scan process has walked through the pool’s block pointers and metadata.
“Resilvered” is the actual reconstructed data written to the replacement. Lots of scanning with little resilvered typically means metadata-heavy work or throttling/IOPS limits.
3) Does ZFS resilver copy only used space?
ZFS aims to resilver only allocated (referenced) blocks, not the whole raw device. That’s why free space doesn’t always cost time.
But “allocated blocks” can still be scattered into millions of small extents, which makes the operation slow.
4) Can I pause and resume a resilver?
Depending on platform and version, you may be able to stop a scan and later resume, but behavior varies and may restart portions of work.
Operationally: treat “pause” as “delay with risk,” not a clean checkpoint.
5) Should I run a scrub immediately after replacing a disk?
Usually: resilver first, scrub after. While degraded, you want redundancy restored as quickly as possible.
After resilver completes and the pool is healthy, a scrub is a good follow-up to validate integrity—schedule it during low load.
6) What’s the single safest way to shorten resilver time?
Reduce competing I/O and keep pools less full. Tunables help at the margins; workload and fragmentation determine the baseline.
The “safest” speedup is taking pressure off the pool so resilver can use IOPS without harming production.
7) Are mirrors always better than RAIDZ for resilver?
Not always, but mirrors typically resilver more predictably and with less parity-read amplification.
RAIDZ can be efficient and reliable, but rebuild behavior under failure is more complex, especially on wide vdevs and busy pools.
8) Why did replacing a disk with a “same size” model make resilver slower?
Same size isn’t same performance. You may have introduced SMR behavior, lower sustained write rates, worse firmware under random writes, or a link negotiated at a lower speed.
Verify model and check transport errors.
9) Does compression make resilver faster or slower?
Usually faster for I/O-bound systems because fewer bytes move. It can be slower if CPU becomes the bottleneck, especially with encryption and small blocks.
Measure CPU during resilver; don’t assume.
10) If resilver is slow, is my data at risk?
A degraded pool has reduced redundancy, so risk is higher until resilver finishes. Slow resilver extends the exposure window.
That’s why the right reaction isn’t just “wait”; it’s “reduce load, fix hardware issues, and restore redundancy quickly.”
Next steps you can do today
If you’re in the middle of a painfully slow resilver, do this in order:
- Run
zpool statusand confirm you’re not accidentally scrubbing while degraded. - Check
dmesgfor link resets and I/O errors; fix physical issues before touching ZFS knobs. - Use
iostat -xto decide whether you’re IOPS-bound or bandwidth-bound. - Reduce competing I/O: pause backups, migrations, batch jobs, and any heavy snapshot churn.
- If latency budget allows, adjust resilver aggressiveness modestly and monitor p95/p99 latency; revert after the pool is healthy.
If you’re not currently degraded, even better. Use that calm to buy future speed: keep headroom, avoid surprise SMR, choose vdev geometry intentionally,
and treat resilver time as a first-class design constraint—not an afterthought you discover when the disk dies.