The page is red. Latency is up. Someone says “just run a scrub” like it’s a magical disinfectant wipe.
Another person replies “it’s already resilvering.” Now you’ve got two adults arguing about verbs while your
storage is trying not to eat itself.
Scrub and resilver are not synonyms. They don’t have the same goal, they don’t have the same triggers,
and—critically—they don’t have the same risk profile when you’re already limping on a degraded vdev.
Mixing them up is how you turn a recoverable disk failure into a “we’ll restore from backups” meeting.
The mental model: verification vs reconstruction
Think of a ZFS pool as two jobs happening over time:
proving data is correct and making redundancy whole.
Scrub is the first. Resilver is the second.
A scrub is a full integrity audit. ZFS walks allocated blocks, reads them, verifies checksums,
and if redundancy exists (mirror, RAIDZ), it repairs bad copies by rewriting good data over the bad.
Scrub is about correctness across everything you’ve stored.
A resilver is reconstruction targeted at a device that needs to become a correct member of a vdev again:
after a disk replacement, after an offline/online event, after a transient disconnect, after a device was removed
and reintroduced. Resilver is about restoring redundancy for that one device (or portion of it), not auditing the universe.
Here’s the sentence that prevents incidents: Scrub proves your data; resilver rebuilds your redundancy.
They can both discover errors. They can both repair. But they start from different intents and cover different scopes.
Precise definitions (and what they are not)
Scrub: a checksum-driven audit of allocated data
Scrub reads allocated blocks in the pool. Not free space. Not “the raw disk.” Actual live data and metadata.
Each block’s checksum is verified. If a block is corrupt, ZFS tries to fetch a correct copy from redundancy and repair it.
If ZFS can’t find a correct copy, you get a permanent error: the checksum didn’t match and there was no good source.
What scrub is not:
- Not a disk surface scan (that’s closer to vendor diagnostics or SMART long tests).
- Not a “performance cleanup.” If scrub made you faster, you were broken.
- Not a replacement for backups. It can detect corruption; it can’t conjure missing data.
Resilver: bringing a device back into correct membership
Resilver copies (or reconstructs) the data needed for a device to rejoin a vdev with the correct contents.
In a mirror, that means copying blocks from the healthy side to the new/reintroduced disk.
In RAIDZ, it means reconstructing from parity to populate the new/reintroduced disk.
Modern ZFS implementations do sequential resilver and track dirty time logs / resilver checkpoints,
so resilver tends to focus on actually-allocated data rather than “the entire disk.” That’s good.
It’s also why people sometimes underestimate resilver load: it’s less than a full-disk rebuild, but it’s still a heavy,
latency-sensitive read+write workload.
What resilver is not:
- Not a proactive integrity check. It might discover errors, but that’s collateral, not the objective.
- Not optional when redundancy is compromised. If you’re degraded, you’re gambling with every read.
- Not “just copying files.” It’s copying blocks, including metadata, often with random-ish access patterns.
One quote that operations people learn the hard way:
Hope is not a strategy.
— paraphrased idea often attributed in SRE circles (commonly associated with Gordon R. Sullivan’s saying)
What triggers a scrub, what triggers a resilver
If you can’t answer “why is it scanning right now?” you don’t control your system. ZFS will do what you ask,
and it will also do what it must.
Scrub triggers
- You start it:
zpool scrub pool. - Your scheduler starts it: cron, systemd timer, or a NAS UI.
- Some appliances start it automatically after upgrades or certain maintenance events.
Scrub is a policy decision. It’s scheduled hygiene. You pick frequency based on risk, workload, and how quickly you want to detect latent errors.
Resilver triggers
- Disk replacement:
zpool replace. - Device returns: a transient cable or HBA glitch and the disk comes back.
- Online after offline:
zpool offlinethenzpool onlineor a reboot. - Attach in mirrors:
zpool attachto turn a single disk into a mirror.
Resilver is an availability event. Something changed in the set of devices that form redundancy, and ZFS is repairing the redundancy contract.
Joke #1: A scrub is like taking inventory; a resilver is like restocking after someone stole a pallet. Mixing them up is how you reorder paperclips during a fire.
What gets read, what gets written, and why your IOPS vanish
Scrub I/O profile
Scrub reads every allocated block and verifies it. That means:
- Reads dominate, often streaming but not perfectly sequential (metadata walks jump around).
- Writes happen only when repair is needed (or when rewriting metadata due to detection/repair mechanics).
- Worst-case random reads show up on fragmented pools or busy datasets with lots of small blocks.
On healthy pools, scrub is “read mostly.” On unhealthy pools, scrub can turn into “read everything, then write repairs,” which is when latency complaints get loud.
Resilver I/O profile
Resilver is read + write by definition: it must populate one device with correct data. Depending on topology:
- Mirror resilver: read from healthy disk(s), write to the new/returned disk.
- RAIDZ resilver: read from all remaining disks (to reconstruct), write to the target disk.
- Special vdevs (metadata/small blocks) can make resilver surprisingly “spiky.”
Resilver competes with your production workload for IOPS and bandwidth. If the pool is degraded, resilver also competes with “every normal read now has less redundancy,” which increases the cost of handling errors.
The operational implication
Scrub is planned pain; resilver is unplanned pain. Plan the first so the second doesn’t become catastrophe.
If you are resilvering, treat performance tuning as secondary to data safety. Your job is to finish the resilver without losing another disk.
How to read zpool status like you mean it
zpool status is where reality lives. It tells you whether you’re scrubbing or resilvering, how far along,
and whether errors are being found or repaired.
Key fields that matter in incidents
- state: ONLINE / DEGRADED / FAULTED / UNAVAIL. This drives your urgency.
- scan: scrub in progress, resilver in progress, scrub repaired X, resilvered X, and a rate/ETA.
- errors: “No known data errors” is not a victory lap, but it is a good sign.
- READ/WRITE/CKSUM columns per device: tells you if the device is lying, failing, or being kicked.
If you take nothing else: the “scan” line says which process is running. Stop guessing.
Practical tasks: commands, outputs, decisions (12+)
These are the things you actually do at 02:13 when someone asks “is it safe?” Each task includes:
the command, what the output means, and what decision it should drive.
Task 1: Confirm whether it’s a scrub or resilver
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Dec 26 00:41:11 2025
1.23T scanned at 612M/s, 742G issued at 369M/s, 5.41T total
742G resilvered, 13.40% done, 03:46:12 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdv ONLINE 0 0 0
sdw ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdz ONLINE 0 0 0
sdaa ONLINE 0 0 0
sdab ONLINE 0 0 0
sdac ONLINE 0 0 0
sdad ONLINE 0 0 0
sdae ONLINE 0 0 0
sdaf ONLINE 0 0 0
sdag ONLINE 0 0 0
sdah ONLINE 0 0 0
sdai ONLINE 0 0 0
sdaj ONLINE 0 0 0
sdak ONLINE 0 0 0
sdal ONLINE 0 0 0
sdam ONLINE 0 0 0
sdan ONLINE 0 0 0
sdao ONLINE 0 0 0
sdap ONLINE 0 0 0
sdaq ONLINE 0 0 0
sdar ONLINE 0 0 0
sdas ONLINE 0 0 0
sdat ONLINE 0 0 0
sdau ONLINE 0 0 0
sdav ONLINE 0 0 0
sdaw ONLINE 0 0 0
sdax ONLINE 0 0 0
sday ONLINE 0 0 0
sdaz ONLINE 0 0 0
sdba ONLINE 0 0 0
sdbb ONLINE 0 0 0
sdbc ONLINE 0 0 0
sdbd ONLINE 0 0 0
sdbe ONLINE 0 0 0
sdbf ONLINE 0 0 0
sdbg ONLINE 0 0 0
sdbh ONLINE 0 0 0
sdbi ONLINE 0 0 0
sdbj ONLINE 0 0 0
sdbk ONLINE 0 0 0
sdbl ONLINE 0 0 0
sdbm ONLINE 0 0 0
sdbn ONLINE 0 0 0
sdbo ONLINE 0 0 0
sdbp ONLINE 0 0 0
sdbq ONLINE 0 0 0
sdbr ONLINE 0 0 0
sdbs ONLINE 0 0 0
sdbt ONLINE 0 0 0
sdbu ONLINE 0 0 0
sdbv ONLINE 0 0 0
sdbw ONLINE 0 0 0
sdbx ONLINE 0 0 0
sdby ONLINE 0 0 0
sdbz ONLINE 0 0 0
errors: No known data errors
Meaning: It’s a resilver, not a scrub (“scan: resilver in progress”). Pool is DEGRADED but not failing.
Decision: Don’t start a scrub “to help.” Let resilver finish unless you have evidence of silent corruption you must audit immediately (rare during active resilver).
Task 2: Start a scrub on purpose (and only on purpose)
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 26 01:12:03 2025
388G scanned at 1.05G/s, 122G issued at 331M/s, 5.41T total
0B repaired, 2.20% done, 04:41:09 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
Meaning: Scrub is running; “0B repaired” is good so far. The “issued” rate shows actual I/O being sent, sometimes lower than “scanned.”
Decision: Leave it running if the workload can tolerate it. If production is suffering, throttle via tunables (platform-dependent) or run off-hours next time.
Task 3: Stop a scrub (because you like your latency)
cr0x@server:~$ sudo zpool scrub -s tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub canceled on Thu Dec 26 01:24:55 2025
512G scanned at 1.02G/s, 0B repaired, 9.24% done
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
Meaning: Scrub stopped. You did not “break” the pool; you just canceled the audit.
Decision: Reschedule scrub for a window where it won’t fight peak I/O. Don’t go six months without one because canceling felt good.
Task 4: Identify which device is being resilvered (and why)
cr0x@server:~$ sudo zpool status -P tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
scan: resilver in progress since Thu Dec 26 00:41:11 2025
742G resilvered, 13.40% done, 03:46:12 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX_A1A... ONLINE 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX_B2B... ONLINE 0 0 0 (resilvering)
Meaning: The “(resilvering)” tag tells you the target. -P shows persistent paths, not fragile /dev/sdX names.
Decision: Verify the physical disk matches the by-id path before you pull anything from a chassis. This prevents “wrong disk replaced” disasters.
Task 5: Replace a failed disk correctly
cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/ata-WDC_WD80EFAX_B2B... /dev/disk/by-id/ata-WDC_WD80EFAX_NEW...
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Thu Dec 26 02:03:22 2025
96.4G scanned at 485M/s, 62.1G issued at 313M/s, 5.41T total
62.1G resilvered, 1.15% done, 05:02:10 to go
Meaning: Replacement initiated; resilver starts. “Issued” shows actual reconstruction work.
Decision: If resilver rate is extremely low, move immediately to bottleneck diagnosis (see playbook). Slow resilvers extend the window of risk.
Task 6: Spot silent corruption versus device errors
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 06:11:44 with 0 errors on Thu Dec 26 08:02:13 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 2
sdb ONLINE 0 0 0
errors: No known data errors
Meaning: Device sda has checksum errors. Scrub didn’t repair anything, but ZFS observed bad data from that device and corrected it from redundancy.
Decision: Investigate sda immediately: cabling/HBA, firmware, SMART, and consider preemptive replacement. Checksum errors are ZFS politely telling you a disk lied.
Task 7: Correlate ZFS errors with kernel logs
cr0x@server:~$ sudo dmesg -T | tail -n 20
[Thu Dec 26 03:11:19 2025] sd 2:0:8:0: [sda] tag#231 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Dec 26 03:11:19 2025] sd 2:0:8:0: [sda] Sense Key : Medium Error [current]
[Thu Dec 26 03:11:19 2025] sd 2:0:8:0: [sda] Add. Sense: Unrecovered read error
[Thu Dec 26 03:11:19 2025] blk_update_request: I/O error, dev sda, sector 2384812048
[Thu Dec 26 03:11:21 2025] ata3: hard resetting link
Meaning: This is not “ZFS being fussy.” The kernel is reporting real read errors and link resets.
Decision: Treat it as a failing disk or path. If you see resets, also suspect SATA/SAS cabling, expander, or HBA issues—not just the drive.
Task 8: Check SMART quickly (triage)
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i 'Reallocated|Pending|Offline_Uncorrectable|UDMA_CRC|SMART overall'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 12
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 4
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 4
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 87
Meaning: “PASSED” is not exoneration. Pending/uncorrectable sectors are bad; high CRC errors scream cabling/path issues.
Decision: If CRC is climbing, fix cabling/backplane/HBA first. If pending/uncorrectable is present, plan a replacement even if the drive is still “working.”
Task 9: Find the pool’s ashift and topology (sets expectations)
cr0x@server:~$ sudo zdb -C tank | egrep 'ashift|vdev_tree|type'
ashift: 12
type: 'root'
type: 'mirror'
type: 'disk'
type: 'disk'
Meaning: ashift: 12 implies 4K sectors. Bad ashift choices can punish random I/O and make scrubs/resilvers slower than they should be.
Decision: If ashift is wrong (too small), you can’t fix it in-place. You plan migration, not heroics during an incident.
Task 10: Check whether autotrim is affecting scan performance
cr0x@server:~$ sudo zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
Meaning: Autotrim can add background work on SSD pools. Usually fine, sometimes a contributor to “why is everything slow right now?”
Decision: If you’re resilvering on SSDs under heavy load, consider temporarily disabling nonessential background churn—but only if you understand your SSD behavior and endurance.
Task 11: Observe real-time I/O while scrub/resilver runs
cr0x@server:~$ sudo iostat -x 2 5
Linux 6.8.0 (server) 12/26/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.1 0.0 2.8 18.6 0.0 75.5
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 415.2 18.1 54032 6120 252.1 9.84 23.5 22.9 37.2 1.9 82.4
sdb 402.7 17.5 52810 5904 251.6 9.21 22.1 21.6 35.8 1.8 79.3
Meaning: High %util and queue (avgqu-sz) with rising await indicates disks saturated. That will stretch your scan and hammer app latency.
Decision: If production SLOs matter, throttle scan or move workload. If resilver is the priority, you accept the pain but watch for a second disk showing errors.
Task 12: Identify whether ARC pressure is making everything worse
cr0x@server:~$ sudo arcstat 2 3
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
01:33:20 328K 112K 34 28K 25 72K 64 12K 11 92G 104G 18G
01:33:22 341K 126K 36 31K 25 83K 66 12K 10 92G 104G 18G
01:33:24 355K 149K 42 45K 30 96K 64 8K 5 92G 104G 18G
Meaning: Miss rate climbing during scan suggests scan is evicting useful cache, especially on busy systems.
Decision: If scans routinely trash cache and hurt apps, schedule them better, or tune scan behavior (platform-dependent), or add RAM if the workload justifies it.
Task 13: Verify last scrub time and whether you’re overdue
cr0x@server:~$ sudo zpool status tank | grep -E 'scan:|scrub'
scan: scrub repaired 0B in 06:11:44 with 0 errors on Thu Dec 26 08:02:13 2025
Meaning: You have a baseline: duration and result.
Decision: If you can’t find a recent “scrub repaired … with … errors” line in your history, you’re flying blind. Implement a schedule and alerting.
Task 14: Check if a resilver is restarting or not making progress
cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'
pool: tank
state: DEGRADED
scan: resilver in progress since Thu Dec 26 00:41:11 2025
1.23T scanned at 612M/s, 742G issued at 369M/s, 5.41T total
742G resilvered, 13.40% done, 03:46:12 to go
Meaning: “Scanned” increasing while “issued” stalls can indicate the scan is walking metadata but not issuing useful I/O—sometimes due to contention, sometimes due to errors.
Decision: If the percentage doesn’t move across multiple checks, look for device errors and bottlenecks. Don’t just wait and hope.
Task 15: Confirm pool health after completion
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: resilvered 5.41T in 06:02:18 with 0 errors on Thu Dec 26 06:43:29 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
Meaning: Resilver finished with 0 errors and pool is ONLINE.
Decision: Now you scrub (soon) if you haven’t recently, because resilver completion doesn’t guarantee you don’t have latent corruption elsewhere.
Fast diagnosis playbook (find the bottleneck fast)
When a scrub/resilver is slow, teams lose hours arguing about “ZFS overhead.” Don’t.
Find the bottleneck. It’s almost always one of: device, controller/path, workload contention,
recordsize/small block patterns, or system memory pressure.
First: establish what’s running and what “slow” means
-
Check scan type and progress:
runzpool status. Record scanned/issued rate, ETA, and whether errors appear.
If the ETA is increasing, it’s not just slow; it’s unstable. -
Check pool state:
ONLINE vs DEGRADED changes everything. If DEGRADED, finishing resilver is priority one.
Second: find the limiting layer (disk vs path vs CPU vs cache)
-
Disk saturation:
iostat -x. Look for high%util, highawait, large queues.
If a single disk has much worse latency than its peers, it’s your villain. -
Kernel errors:
dmesg -T. Link resets, timeouts, “Medium Error,” “I/O error” are hard evidence.
Fix path issues before you tune ZFS. -
ARC pressure:
arcstator platform equivalent.
If scan evicts hot data, applications will suffer and may amplify I/O. -
CPU bottlenecks (less common but real on checksum-heavy loads):
check system CPU, and whether compression/encryption is in play. Scrub verifies checksums; that’s compute.
Third: decide the operating mode
-
Safety-first mode (degraded, recent disk errors, uncertain hardware):
minimize extra churn, don’t start new scrubs, reduce application write bursts, and avoid reboots. -
Performance-first mode (healthy pool, planned scrub):
schedule, throttle, and keep it predictable. The goal is “finish without anyone noticing,” not “win benchmarks.”
Common mistakes: symptoms → root cause → fix
Mistake 1: “Scrub will rebuild the missing disk”
Symptoms: Pool is DEGRADED. Someone runs scrub. Nothing gets “fixed.” Degraded state persists.
Root cause: Scrub is not a redundancy rebuild operation. It audits and repairs corrupted blocks; it does not repopulate an absent/failed device.
Fix: Replace/online/attach the device to trigger resilver. Use zpool replace or zpool online as appropriate. Then monitor resilver.
Mistake 2: “Resilver finished, so data is verified”
Symptoms: After disk replacement, team relaxes. Weeks later, a file read throws checksum errors or an app reports corruption.
Root cause: Resilver rebuilds redundancy to one device; it does not necessarily walk and verify every block across the whole pool.
Fix: Schedule regular scrubs. After major events (disk failures, controller resets), run a scrub in the next safe window.
Mistake 3: Canceling scrubs forever because “they hurt performance”
Symptoms: Scrubs always get canceled. Eventually a second disk fails during a rebuild and latent errors surface at the worst time.
Root cause: Scrub is the mechanism that finds latent sector errors while you still have redundancy to repair them.
Fix: Run scrubs on a predictable cadence. If impact is unacceptable, tune schedule, add I/O headroom, or adjust pool design.
Mistake 4: Treating checksum errors as “just ZFS being dramatic”
Symptoms: CKSUM counts rise on one device. Pool stays ONLINE. People ignore it.
Root cause: Checksum errors mean data read from that device did not match what ZFS expected. ZFS corrected it using redundancy—this time.
Fix: Check cabling/HBA, then SMART, then replace the device if errors persist. Also scrub to force verification and healing.
Mistake 5: Optimizing scan speed by starving applications (or the reverse)
Symptoms: You throttle scrubs so much they run for days, or you unthrottle so much production falls over.
Root cause: Scans are background I/O that must be balanced against foreground latency demands.
Fix: Pick an explicit policy: time windows, throttles, and alerting if scans exceed a duration threshold (a sign of underlying hardware trouble).
Mistake 6: Replacing the wrong disk because /dev/sdX changed
Symptoms: After reboot, the “failed” disk name changes. Technician pulls the wrong sled. Now you’re degraded twice.
Root cause: Using ephemeral device nodes instead of stable identifiers.
Fix: Use /dev/disk/by-id paths in ZFS and operational docs. Verify with zpool status -P before any physical action.
Three corporate mini-stories (how this goes wrong)
Mini-story #1: The incident caused by a wrong assumption
A mid-size SaaS shop ran ZFS on a pair of mirrored HDDs per node. Nothing fancy. A disk dropped out on one node during a quiet weekend.
The on-call engineer saw “DEGRADED” and did what they thought was the safe, conservative action: started a scrub.
The assumption was simple: scrub equals “repair.” And ZFS does repair during scrub—when it has redundancy. But the missing disk was still missing.
Scrub dutifully walked the remaining disk, verified what it could, and forced a ton of reads. Meanwhile, the pool had no redundancy.
The remaining disk wasn’t happy about being the sole source of truth under sustained load. Latent sector errors showed up.
A few blocks could not be read cleanly. With no mirror partner, ZFS couldn’t heal them. Now the pool had actual data errors, not just “degraded redundancy.”
They replaced the disk Monday morning and resilvered. Redundancy came back, but the damaged blocks stayed damaged—because the only good copy had never existed.
The fallout wasn’t total data loss, but it was worse: a handful of corrupted objects in the app’s blob store, discovered by customers in production.
The postmortem fix was boring: treat “degraded” as “minimize additional stress,” prioritize restoring redundancy (resilver), and run scrub after resilver in a window.
Also: put “scrub is not a rebuild” into the runbook in large letters.
Mini-story #2: The optimization that backfired
A finance company had RAIDZ2 pools backing a data warehouse. Scrubs were painful, so an engineer decided to “optimize” by running scrubs constantly at low intensity
and canceling them during business hours. The idea was to always be making progress while staying invisible.
In practice, the scrubs never finished. They’d run a few hours, get canceled, restart later, get canceled again. Weeks passed without a completed scrub.
Everyone felt good because the command history showed “scrub started” frequently. Management loves activity.
Then a disk failed. Resilver started, and it was slower than expected because one of the remaining disks had a growing pile of unreadable sectors.
The unreadable sectors had been present for some time, but the never-ending scrub pattern never forced a complete pass that would have surfaced the problem early.
Resilver encountered those bad sectors at the worst moment: while redundancy was reduced and load was high.
The resilver dragged on, which kept the pool in a risk window longer, and the warehouse performance cratered during peak reporting.
The fix wasn’t “scrub less.” It was “scrub correctly”: schedule a window where it completes, alert on completion and errors, and track baseline durations.
Their “optimization” was activity without outcomes, which is the corporate version of jogging in place.
Mini-story #3: The boring but correct practice that saved the day
A media company ran large ZFS pools with mirrors of SSDs for hot content and RAIDZ for warm archives.
Nothing heroic, but they did three disciplined things: monthly scrubs with completion alerts, stable by-id device naming everywhere,
and a hard rule that any checksum errors trigger investigation within a day.
One morning, alerts showed a scrub completed but recorded a small number of checksum errors on a single SSD.
The pool stayed ONLINE. No customer impact. This is the exact moment teams are tempted to shrug.
They didn’t shrug. They pulled SMART, saw CRC errors rising, and found a marginal cable in the backplane.
They fixed the path, cleared the error counters by replacing the cable and reseating, then ran a follow-up scrub.
No further errors. Problem ended quietly.
A month later, a different node lost a drive for real. Resilver completed quickly because the remaining devices were healthy
and the team already had confidence in their scan baselines. No drama, no “maybe it’s fine,” no surprise corrupted files.
Boring practices are underrated because they don’t produce adrenaline. They do produce uptime.
Interesting facts and historical context
- ZFS was built around end-to-end checksumming, meaning the filesystem, not the disk, decides whether data is correct.
- Scrub exists to catch latent errors—the “bit rot” class of problems that traditional RAID may not detect until it’s too late.
- Early ZFS systems popularized routine scrubbing as an operational habit, similar to how databases popularized routine backups and consistency checks.
- Resilver behavior evolved over time; older approaches could resemble “rebuild the whole disk,” while newer approaches focus on allocated space and are more sequential when possible.
- RAIDZ resilver inherently reads many disks to reconstruct the missing one. Mirrors can be gentler: read one side, write the other.
- Device naming got people hurt (operationally); the industry shift toward stable identifiers (
by-id, WWN) was driven by real incidents of replacing the wrong disk. - Checksums don’t prevent corruption; they detect it. The prevention comes from redundancy plus repair actions (scrub or normal reads that trigger healing).
- Scrub is not free space verification. If you want to stress-test a new disk, you do burn-in and SMART testing, not just ZFS scrub.
- Latency-sensitive workloads notice scans first; the pain isn’t “bandwidth,” it’s queueing delay. That’s why iostat
awaitmatters more than raw MB/s.
Joke #2: Scrubs are the dentist visit of storage—skip them long enough and you’ll eventually pay in root canals.
Checklists / step-by-step plans
Plan A: You replaced a disk and resilver started
- Confirm target device:
zpool status -P. Make sure you’re resilvering what you think you’re resilvering. - Check pool state: if DEGRADED, treat this as a priority incident until resilver completes.
- Watch for new errors: check
READ/WRITE/CKSUMcounts every 10–30 minutes during early hours. - Check kernel logs: if you see resets/timeouts, fix the path now. A flapping link can stretch resilver and trigger more failures.
- Decide contention policy: if this is a critical workload, you might temporarily reduce application load to finish resilver faster.
- After completion: verify ONLINE and 0 errors; then schedule a scrub soon (not necessarily immediately) to validate the broader pool.
Plan B: A scrub is slow and the business is yelling
- Measure impact: check app latency and disk
await. If you can’t quantify it, you’re negotiating with vibes. - Check whether scrub is finding repairs:
zpool status. If it’s repairing, stopping may postpone healing. Balance risk. - Decide: cancel or continue: If production is at risk and pool is healthy, cancel the scrub and reschedule. If pool shows errors, lean toward continuing in a controlled window.
- Fix the root cause: slow scrubs often point to failing disks, cabling issues, or a pool design with insufficient IOPS headroom.
- Baseline durations: record how long scrubs take on healthy hardware. When it doubles, something changed.
Plan C: You see checksum errors but everything “seems fine”
- Identify the device: which disk is accumulating
CKSUMerrors? - Check dmesg: look for I/O errors and resets.
- Check SMART: pending sectors and CRC errors guide whether this is media or path.
- Run a scrub (in a safe window): force verification and healing.
- Replace or fix path: if errors persist, do not negotiate with a lying disk. Replace it.
FAQ
1) Does a scrub read free space?
No. Scrub walks allocated blocks (live data and metadata). It’s integrity verification, not a raw-disk surface scan.
2) Does resilver verify checksums like scrub?
Resilver operations involve reading blocks and writing reconstructed/correct copies, and checksum verification occurs as part of normal reads.
But resilver is not a comprehensive audit of the whole pool; it focuses on what the target device needs to rejoin correctly.
3) Should I run a scrub immediately after resilver?
Usually: soon, yes. Immediately: depends. If the system is fragile or heavily loaded, schedule it for the next safe window.
The point is to validate the broader pool after a disruptive event.
4) Why is “scanned” higher than “issued” in zpool status?
“Scanned” reflects how far ZFS has walked through the space it needs to consider. “Issued” reflects actual I/O requests sent.
The gap can widen due to metadata traversal, skipping unneeded regions, contention, or implementation details.
5) Is it safe to cancel a resilver?
“Safe” is the wrong question. Canceling a resilver keeps you degraded longer. That increases risk.
You cancel only if you must, and you restart it as soon as possible after addressing the reason (like a path problem).
6) Scrub found errors but “errors: No known data errors” still appears. What gives?
ZFS can correct some errors using redundancy. If it repaired everything successfully, there may be no known data errors left.
The presence of device CKSUM errors still matters: something returned bad data.
7) Does running scrubs reduce SSD lifespan?
Scrub is mostly reads; writes occur when repairing. Reads still consume some SSD resources indirectly (controller work), but the write amplification concern is usually modest.
The bigger issue is performance impact and ensuring your SSDs are healthy and properly cooled.
8) How often should I scrub?
Common practice is monthly on large pools, more often for critical systems, less often for lightly used archives—if you have good monitoring and know your risk.
The right answer is: frequently enough to detect latent errors before a second failure makes them unrecoverable.
9) Why does resilver take so long on RAIDZ compared to mirrors?
RAIDZ resilver requires reading data/parity from multiple disks to reconstruct what belongs on the missing disk. Mirrors can copy from a single healthy member.
The topology dictates the I/O fan-out and the penalty under contention.
10) If my pool is healthy, can I ignore occasional checksum errors?
No. Occasional might mean “rare and transient,” but it can also mean “early warning.” Investigate first.
The cost of checking logs and SMART is cheap compared to discovering it’s real during a degraded event.
Next steps you can do this week
- Put scrub on a schedule that actually completes (and alert on completion and errors). “Started” is not a useful status.
- Baseline scrub duration on healthy hardware. When it changes materially, investigate before you have a failure.
- Standardize on stable disk identifiers (
/dev/disk/by-id) in pool configs and runbooks. - Write a one-page degraded-pool procedure: prioritize resilver, minimize churn, watch for second-device errors, avoid risky reboots.
- Train the team on
zpool status: the scan line, the error counters, and what “No known data errors” does and does not promise. - Decide your policy for scan impact: when to cancel a scrub, when to let it run, and how to handle business-hour pressure without abandoning integrity checks.
Scrub and resilver are both “scan” operations, which is why people blend them together in conversation.
But in operations, verbs matter. One proves. One rebuilds. If you treat them interchangeably, ZFS won’t argue with you—it will just faithfully execute your mistake.