Your ZFS pool looks fine. Apps are happy. SMART is “PASSED.” Then a disk dies, a resilver starts, and suddenly you discover you’ve been living with silent corruption for months.
That moment is when people learn what a scrub is for.
A scrub is not a “performance task.” It’s not a ritual. It’s a controlled, repeatable way to prove that your pool can still read what it thinks it wrote—before you need that proof in the middle of an outage.
What a scrub actually does (and what it doesn’t)
A ZFS scrub is a full integrity walk of the pool. ZFS reads all allocated blocks, verifies checksums, and—if redundancy exists—repairs bad data by rewriting good copies.
That’s the key: scrub is a read verification pass with optional healing.
What scrub does
- Reads allocated data across datasets/zvols/snapshots, not free space.
- Verifies checksums end-to-end (ZFS checksum stored separately from data, so “disk returned something” isn’t good enough).
- Repairs when possible by using parity/mirror copies and rewriting corrected blocks (a “self-healing” read).
- Surfaces latent sector errors and flaky paths that normal workloads might not touch.
- Accumulates evidence: error counters, affected vdevs, and whether corruption is permanent (no good copy) or correctable.
What scrub does not do
- It does not validate your application’s semantics. If your app wrote the wrong bytes consistently, ZFS will proudly preserve them.
- It does not prove backups are restorable. Scrub is not a restore test.
- It does not fix “bad hardware design.” If your HBA, expander, or backplane is lying, scrub may reveal symptoms but not cure the cause.
- It does not necessarily read free space. Free-space corruption is mostly irrelevant until allocation, and ZFS doesn’t waste time validating blocks that don’t exist.
Scrub is also not a resilver. A resilver reconstructs missing data for a replaced disk or reattached device. A scrub validates existing data and heals it if redundancy permits.
Confusing the two leads to some creative outages.
Joke #1: A scrub is like flossing—nobody enjoys it, and everyone swears they’ll do it more often after something starts bleeding.
What scrub proves (and what it can only suggest)
Operations is the art of knowing what your tools can prove. Scrub gives you strong evidence, but not omniscience.
Treat scrub results as a reliability signal with boundaries.
Scrub proves these things (with high confidence)
- ZFS can read allocated blocks and validate them against checksums at the time of the scrub. If the scrub completes cleanly, the pool’s data as read matched its recorded checksums.
- Redundancy paths can deliver correct data for blocks that were wrong on one device. “Repaired” counts mean ZFS had at least one good copy and wrote it back.
- Your system can sustain a full-pool sequential-ish read workload without collapsing. This matters because resilvers and restores look similar from a storage perspective.
Scrub suggests these things (useful, but not guaranteed)
- Disk media health. A clean scrub suggests the media is okay now, but it doesn’t forecast future failures. Disks fail on Tuesday because it’s Tuesday.
- Controller/cable integrity. If errors appear during scrub, the path is suspicious. If errors never appear, it’s still not a certificate of good cabling—just a lack of observed failure.
- Workload safety margins. Scrub speed and latency impact reveal contention, but a “fast scrub” doesn’t mean your random I/O workload is safe under pressure.
What it cannot prove
- That the data is “correct” to humans. ZFS validates integrity, not meaning. If you stored the wrong spreadsheet, it will faithfully protect it.
- That a future resilver will succeed. Scrub reduces the chance of discovering latent errors during resilver, but it cannot eliminate multi-disk failure, firmware bugs, or sudden URE storms.
A useful mental model: scrub answers “Can I read everything I care about right now, and if not, can I fix it with redundancy?”
That’s a very good question. It’s also not the only question you have.
One quote worth keeping in your runbook, because it’s painfully true:
“Hope is not a strategy.”
— General Gordon R. Sullivan
How often to run a scrub: rules that hold up in production
The “right” scrub frequency is an engineering compromise between detection latency and operational cost.
Detection latency is how long you’re willing to carry silent corruption before you find it. Operational cost is performance impact, wear, and human attention.
My default recommendation (and why)
- Home / small NAS: monthly scrub.
- Business file + VM pools: every 2–4 weeks, leaning toward 2 weeks for larger pools.
- Critical data + long retention snapshots: every 1–2 weeks.
- Archive / cold storage: monthly to quarterly, but only if you also do periodic restore drills and keep spare drives compatible.
If that sounds aggressive, remember what you’re buying: earlier discovery of corruption, and earlier discovery of marginal disks.
The most expensive scrub is the one you didn’t run before a disk failure forced your hand.
The two variables that should actually change your schedule
1) Pool size and rebuild math
Bigger pools take longer to scrub. Longer scrubs mean a larger window where another failure can coincide with “we’re already stressed.”
Also, rebuild times scale with capacity and device behavior under load. If resilver takes days, you should scrub more often, not less, because you want latent errors found while you still have redundancy.
2) Your data churn and snapshot retention
Scrub reads allocated blocks, including blocks referenced only by snapshots. Long snapshot retention increases “allocated data” even if live datasets are small.
If you keep months of snapshots, you’re carrying more historical blocks that need periodic verification. Otherwise you learn about old corruption right when you need to restore old data. That’s not a fun genre of surprise.
What about SSDs? Do I scrub less?
No. You might scrub at least as often, because SSDs fail in ways that look clean until they don’t.
SSDs also have internal error correction that can mask a degrading NAND situation until the drive hits a cliff.
Scrub is your way of asking the full stack, end-to-end, to prove it can still read and checksum-validate.
Performance impact: don’t fear it, manage it
Scrub competes for I/O. On HDD pools, it often becomes a big sequential read that still disrupts random I/O latency.
On mixed workloads (VMs plus scrub), users don’t complain about throughput; they complain about latency spikes.
The answer is rarely “never scrub.” The answer is “scrub with guardrails”: scheduling, throttling, and monitoring. You can make scrubs boring. That’s the goal.
Trigger-based scrubs (in addition to a schedule)
Scheduled scrubs are baseline hygiene. Trigger scrubs are for specific risk events:
- After replacing a disk and finishing resilver, run a scrub to validate the new steady state.
- After moving hardware (new HBA/backplane/cables), run a scrub to shake out path issues.
- After power events or kernel/storage driver updates, run a scrub during a low-traffic window.
- After any non-zero checksum errors appear, scrub sooner, not later, after you’ve stabilized the system.
Interesting facts and history: why scrubbing exists
Scrubbing didn’t appear because engineers were bored. It exists because storage lies—quietly, and sometimes convincingly.
A few context points that matter when you’re making policy:
- ZFS was designed with end-to-end checksums from the start (Sun era), because controllers and disks can return “successful” reads with wrong data.
- The term “scrub” predates ZFS in storage: RAID arrays and enterprise SANs used patrol reads to discover latent sector errors before rebuilds.
- Latent sector errors became a mainstream concern as drive capacities outpaced rebuild times; the longer the rebuild, the more likely you hit a bad block during it.
- ZFS stores checksums separately from the block they protect (in metadata), avoiding the “corruption corrupts the checksum too” failure mode common in weaker designs.
- ZFS can repair only when redundancy exists. A single-disk pool can detect checksum mismatches, but it cannot conjure correct data.
- Scrub reads snapshots too because snapshots reference old blocks; this is both a benefit (verification) and a cost (more to read).
- Some errors are “correctable” and some are permanent. This distinction is operationally huge: correctable errors often point to a failing device or path; permanent errors mean data loss unless you restore from backup.
- Scrubs are intentionally interruptible (pause/stop and resume depending on implementation), acknowledging real-world production windows.
- Modern ZFS implementations expose tunables to throttle scrub impact, reflecting the shift from “big iron maintenance windows” to “always-on multi-tenant systems.”
Practical tasks: commands, outputs, and decisions
This section is deliberately hands-on: commands you can run, what the output implies, and what decision you make next.
If you only skim one part, skim this.
Task 1: Confirm pool health before you scrub
cr0x@server:~$ zpool status
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0
errors: No known data errors
Meaning: Pool is online; no scan in progress; error counters are zero.
Decision: Safe to start a scrub. If the pool were DEGRADED, you’d decide whether scrub is urgent (to find more errors) or risky (to stress remaining devices).
Task 2: Start a scrub (and know what it touches)
cr0x@server:~$ sudo zpool scrub tank
Meaning: ZFS begins reading allocated blocks in tank.
Decision: Do it in a window where added I/O is acceptable. If this is a VM host at 10:00 Monday, you just volunteered for user feedback.
Task 3: Watch scrub progress and estimate time-to-done
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
scan: scrub in progress since Mon Dec 23 01:12:43 2025
3.21T scanned at 612M/s, 1.04T issued at 198M/s, 18.7T total
0B repaired, 5.57% done, 1 days 02:11:09 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0
errors: No known data errors
Meaning: “Scanned” can outrun “issued” depending on accounting; focus on issued rate and ETA trends. Repaired is 0, good.
Decision: If ETA balloons during business hours, you consider throttling via system tunables, moving the scrub window, or pausing (see later tasks).
Task 4: Detect whether scrub is fighting your workload
cr0x@server:~$ iostat -x 5 3
Linux 6.8.0 (server) 12/25/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.12 0.00 2.41 18.77 0.00 73.70
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
sda 92.0 184320.0 0.0 0.00 28.44 2003.5 6.2 512.0 12.21 3.42 98.7
sdb 89.5 179200.0 0.0 0.00 31.02 2002.8 6.0 480.0 14.10 3.64 99.1
Meaning: Disks are pegged (~99% util), read await is high (28–31ms). That’s normal for HDD scrub, but it may be deadly for latency-sensitive workloads.
Decision: If users complain or VM latency spikes, throttle scrub impact or schedule it at night. If await is high even off-hours, investigate disk health/cabling.
Task 5: Find “real size” of what scrub must read (including snapshots)
cr0x@server:~$ zfs list -o name,used,usedbysnapshots,refer,avail -r tank
NAME USED USEDBYSNAPSHOTS REFER AVAIL
tank 18.7T 6.2T 128K 12.1T
tank/vm 11.4T 4.8T 6.6T 12.1T
tank/home 2.1T 1.2T 0.9T 12.1T
tank/backups 5.2T 0.2T 5.0T 12.1T
Meaning: Scrub reads allocated blocks; 6.2T is only snapshots. That’s extra scrub time you “forgot” you were paying for.
Decision: If scrubs are taking too long, adjust snapshot retention or split datasets/pools by criticality, not by vibes.
Task 6: See scrub history and whether you’re actually doing it
cr0x@server:~$ zpool history -il tank | tail -n 12
2025-12-23.01:12:43 zpool scrub tank
2025-11-23.01:09:51 zpool scrub tank
2025-10-23.01:11:02 zpool scrub tank
2025-09-22.01:08:37 zpool scrub tank
Meaning: Someone (or a timer) has run monthly scrubs. If there’s a gap, you’ve been running on hope.
Decision: Put scrubs under a scheduler you can audit (systemd timers, cron) and alert on “scrub overdue.”
Task 7: Catch checksum errors early and identify the guilty vdev
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
scan: scrub repaired 0B in 05:41:10 with 2 errors on Mon Dec 23 06:53:53 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 2
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/vm@auto-2025-12-01:vm-103-disk-0
Meaning: Two checksum errors and permanent errors: ZFS could not repair because all copies were bad or redundancy insufficient for those blocks.
Decision: Stop pretending this is “fine.” Restore affected data from backup/snapshots (if snapshots are clean), and start hardware/path investigation immediately.
Task 8: Clear error counters only after you’ve acted
cr0x@server:~$ sudo zpool clear tank
Meaning: Resets error counters and clears “errors” state. It does not magically fix data.
Decision: Clear only after: (1) you captured evidence, (2) replaced/confirmed hardware, and (3) ran a follow-up scrub to validate stability. Otherwise you’re deleting the crime scene.
Task 9: Pause or stop a scrub when it hurts (and resume later)
cr0x@server:~$ sudo zpool scrub -p tank
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub paused since Mon Dec 23 10:04:11 2025
7.12T scanned at 590M/s, 2.98T issued at 248M/s, 18.7T total
0B repaired, 15.9% done, scrub paused
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
...
Meaning: Scrub is paused; progress is preserved depending on implementation and version.
Decision: If latency is killing production, pause and resume in a quiet window rather than aborting the whole thing.
cr0x@server:~$ sudo zpool scrub -s tank
Meaning: Stops the scrub.
Decision: Use stop when you need to change hardware or when the scrub itself is triggering faults you must address first (timeouts, resets).
Task 10: Differentiate scrub from resilver in status output
cr0x@server:~$ zpool status tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Dec 24 02:01:17 2025
1.44T scanned at 312M/s, 620G issued at 134M/s, 6.60T total
620G resilvered, 9.16% done, 0 days 11:08:33 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
ata-WDC_WD80EAZZ-00BKLB0 FAULTED 0 0 0 too many errors
ata-WDC_WD80EAZZ-00BKLB0 ONLINE 0 0 0 (resilvering)
errors: No known data errors
Meaning: This is a resilver, not a scrub. Different risk profile: you’re reconstructing redundancy.
Decision: Avoid starting a scrub during resilver unless you’re intentionally validating a shaky pool and can afford the I/O. Usually: let resilver finish, then scrub.
Task 11: Check device error logs (Linux) when scrubs trigger resets
cr0x@server:~$ sudo dmesg -T | tail -n 20
[Mon Dec 23 03:18:11 2025] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Mon Dec 23 03:18:11 2025] ata9.00: failed command: READ FPDMA QUEUED
[Mon Dec 23 03:18:11 2025] ata9: hard resetting link
[Mon Dec 23 03:18:12 2025] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Mon Dec 23 03:18:14 2025] ata9.00: configured for UDMA/133
Meaning: Link resets during heavy reads. That screams cabling/backplane/HBA issues, not “ZFS being picky.”
Decision: Fix the physical path. Scrub is just the messenger that forced sustained reads.
Task 12: Pull SMART data and decide whether to replace the disk
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i "Reallocated|Pending|Offline_Uncorrectable|UDMA_CRC|Power_On_Hours"
9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 31254
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 14
Meaning: Pending/uncorrectable sectors suggest real media trouble. UDMA CRC errors often point to cabling/backplane.
Decision: If pending/uncorrectable > 0 and scrub surfaced checksum errors: replace disk. If only CRC climbs: reseat/replace cable/backplane lane and keep watching.
Task 13: Confirm ashift and why it matters for scrub time
cr0x@server:~$ zdb -C tank | egrep "ashift|vdev_tree" -n | head
42: vdev_tree:
57: ashift: 12
Meaning: ashift=12 implies 4K sectors. Wrong ashift (too small) can cause read-modify-write amplification and generally miserable behavior.
Decision: If ashift is wrong, you don’t “tune it.” You plan a migration to a correctly created pool. Scrub won’t save you from a foundational mistake.
Task 14: Identify whether a dataset is forcing pathological I/O during scrub
cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,atime,dedup tank/vm
NAME PROPERTY VALUE
tank/vm recordsize 128K
tank/vm compression lz4
tank/vm atime off
tank/vm dedup off
Meaning: Reasonable defaults for many VM workloads. Dedup off is a sanity-preserving choice for most shops.
Decision: If you see dedup on and scrubs/reslivers are slow with memory pressure, you investigate ARC/memory and DDT pressure. Don’t guess; measure.
Task 15: Verify scrub isn’t “fast” because it isn’t reading much
cr0x@server:~$ zpool list -o name,size,alloc,free,fragmentation,health
NAME SIZE ALLOC FREE FRAG HEALTH
tank 29.1T 18.7T 10.4T 42% ONLINE
Meaning: Allocated is 18.7T. A scrub that “finishes” in 15 minutes probably did not read 18.7T; you’re misreading what happened (or looking at the wrong pool).
Decision: Sanity-check scrub duration against alloc and realistic device throughput. When numbers don’t match physics, assume you’re missing information.
Task 16: Use event logs to correlate scrub errors with timestamps
cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME CLASS
Dec 23 2025 03:18:12.123456789 ereport.fs.zfs.checksum
zevent.fsname = tank
zevent.vdev_path = /dev/disk/by-id/ata-WDC_WD80EAZZ-00BKLB0_XXXX
zevent.vdev_guid = 1234567890123456789
Dec 23 2025 03:18:12.223456789 ereport.fs.zfs.io
zevent.fsname = tank
zevent.vdev_path = /dev/disk/by-id/ata-WDC_WD80EAZZ-00BKLB0_XXXX
Meaning: ZFS is telling you when and where it saw checksum/I/O faults.
Decision: Correlate with dmesg and SMART. If events line up with link resets, the “bad disk” might be a bad path.
Fast diagnosis playbook: find the bottleneck fast
When a scrub is slow or throwing errors, you can spend hours philosophizing—or you can do a quick triage that narrows it down to: disk, path, CPU/memory, or competing workload.
Here’s the practical order that saves time.
First: what does ZFS say is happening?
- Check:
zpool status -v tank - Look for: “scrub in progress” vs “resilver”; repaired bytes; “errors: Permanent errors”; which vdev shows READ/WRITE/CKSUM increments.
- Decision: If permanent errors appear, switch from “performance” to “data recovery” mode. Capture evidence, identify affected datasets/objects, plan restore.
Second: is it hardware path instability?
- Check:
dmesg -T | tail -n 200for resets/timeouts;smartctlfor CRC errors; HBA logs if available. - Look for: SATA link resets, SCSI aborts, NVMe resets, “I/O error” spam that appears only under scrub load.
- Decision: If resets occur: treat as a path issue first (cables/backplane/HBA/expander/firmware). Replace the cheapest suspect component before blaming ZFS.
Third: is it simple I/O saturation or contention?
- Check:
iostat -x 5, plus whatever telemetry you use for latency. - Look for: 100% util, high await, queue depth climbing, and whether writes also suffer.
- Decision: If this is “normal saturation,” manage it: schedule, throttle, or isolate workloads. If await is extreme even with low util, suspect firmware/drive issues.
Fourth: is ZFS or the OS starved (CPU, RAM, ARC pressure)?
- Check:
vmstat 5, memory pressure, ARC sizing, and whether the system is swapping. - Look for: swap activity; kswapd storms; CPU pegged in system time.
- Decision: If you’re swapping during scrub, fix memory first. Scrub is a read workload; it should not turn your server into a pager demo.
Fifth: is it “scrub is slow because it has a lot to read”?
- Check:
zfs list(usedbysnapshots) andzpool listalloc. - Look for: snapshot bloat, unexpected allocation, or fragmentation that implies a lot of seeks.
- Decision: If the pool is simply large and busy, accept longer scrubs but tighten scheduling and alerting. Reliability isn’t free; it’s billed in IOPS.
Three corporate mini-stories from the scrub trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran ZFS on a pair of storage servers backing virtualization. Mirrors everywhere. They felt safe, which is a common precondition for learning.
Scrubs were “optional,” because the storage was “RAID1 and enterprise disks.”
A disk failed during a busy week. Resilver began. Halfway through, the pool started logging checksum errors on the surviving side of one mirror.
The team assumed: “ZFS is being dramatic; it’ll heal.” That assumption is the incident.
ZFS couldn’t heal those blocks because the only remaining copy was wrong. The corrupt blocks were old—weeks old—sitting quietly in a rarely read VM disk area.
Nobody noticed until resilver forced a read of everything and demanded correctness.
The operational failure wasn’t “a disk died.” Disks die. The failure was accepting unknown integrity for weeks because scrubs weren’t scheduled and monitored.
They had backups, but restores were slow and politically painful. The outage became an executive-level event.
Afterward, they didn’t just add a monthly scrub. They added a policy: any non-zero checksum error creates a ticket, requires hardware triage, and ends with a clean scrub before closure.
The incident stopped being “mysterious ZFS corruption” and became “we found a bad path before it ate our lunch.”
Mini-story 2: The optimization that backfired
Another org had a large HDD pool that scrubbed slowly. Users complained about Monday morning slowness.
Someone proposed an optimization: run scrub continuously but “nice” it by limiting CPU and letting it trickle. They also moved the scrub window to business hours because “it’ll be gentle.”
On paper, gentle. In reality, it created permanent low-grade contention. Scrub never got enough contiguous time to finish quickly, so it overlapped with every peak.
Latency spikes weren’t dramatic; they were constant. That’s worse. People tolerate storms; they quit over drizzle.
Meanwhile, scrubs that take forever increase exposure. The longer the pool is under sustained scanning, the more likely you collide with unrelated failures:
controller hiccups, firmware quirks, or just the unlucky drive that chooses that week to start timing out.
They eventually reverted to a strict off-hours scrub with a hard stop at the end of the maintenance window, then resume next night.
Same total work, less user pain, faster completion, and fewer “scrub coincided with something weird” incidents.
The lesson: “always on but slow” isn’t automatically safer. Finishing a scrub is a reliability milestone. Drip-feeding a scrub across business hours is how you invent a new baseline of mediocrity.
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran ZFS on mirrored vdevs for VM storage. Nothing fancy. What they did have was a dull routine:
biweekly scrubs, alerts on completion, and a ticket if any CKSUM count moved.
They also logged scrub duration and compared it to historical baselines.
One cycle, the scrub finished but took noticeably longer. No errors, just slower. The alert didn’t page anyone, but it created an investigation ticket because the duration breached a threshold.
An engineer looked at iostat during the next scrub and saw one disk with higher await than its mirror partner.
SMART didn’t show reallocated sectors, but it did show rising CRC errors. They swapped a cable and re-seated the drive in the backplane.
Next scrub returned to normal speed. Still no errors.
Two weeks later, that same backplane slot started throwing link resets during a high-IOPS trading window—except now the team had already identified the slot as suspicious.
They failed traffic over, replaced the backplane, and avoided data corruption and a wider incident.
Boring practice didn’t just “find corruption.” It found a degrading path before it crossed the line into data loss.
That’s what operational maturity looks like: fewer heroic restores, more quiet fixes.
Common mistakes: symptoms → root cause → fix
Most scrub problems are not exotic. They’re the same few failure modes wearing different hats.
Here’s the shortlist I wish more runbooks had.
1) Scrub shows checksum errors, SMART looks “fine”
- Symptom:
zpool statusshows CKSUM > 0, sometimes “Permanent errors”. SMART overall-health says PASSED. - Root cause: SMART “PASSED” is not a health guarantee. Also, corruption can come from the path (cable/HBA) or RAM, not just media.
- Fix: Correlate ZFS events with
dmesg. Check CRC errors. Run a follow-up scrub after reseating/replacing cables or swapping the drive to a different port. If errors persist on the same drive across ports, replace the drive.
2) Scrubs are “taking forever” after snapshots increased
- Symptom: Scrub time doubled over months; no hardware changes; pool feels fine otherwise.
- Root cause: Snapshot retention increased allocated blocks. Scrub reads blocks referenced by snapshots, even if the live dataset is small.
- Fix: Inspect
usedbysnapshots. Adjust retention, or move long-retention snapshots to a different pool/media tier. Re-baseline scrub duration once snapshot policy is sane.
3) Scrub causes VM latency spikes and “storage is slow” tickets
- Symptom: During scrub, latency goes up; users complain; database stalls; VMs “freeze” briefly.
- Root cause: HDDs saturated by scrub reads; queue depth climbs; synchronous writes wait behind scrub I/O.
- Fix: Schedule scrubs off-hours. Consider throttling via OS/ZFS tunables available on your platform. If this is a chronic issue, the real fix is more spindles, SSDs, or splitting pools by workload.
4) Scrub keeps “repairing” a small amount every run
- Symptom: Each scrub reports some repaired bytes, but nothing ever escalates to a clear fault.
- Root cause: Marginal hardware or path causing intermittent bad reads; ZFS heals using redundancy, masking the underlying rot until it gets worse.
- Fix: Treat recurring repairs as a hardware incident. Identify the vdev with rising errors. Swap cables/ports, run SMART long tests, replace the device if behavior follows it.
5) Scrub “finishes instantly” on a huge pool
- Symptom: Scrub done in minutes; pool has tens of TB allocated; nobody believes the numbers.
- Root cause: You scrubbed the wrong pool, you’re reading status from a different host, or the pool has very little allocated data (thin allocation, mostly free).
- Fix: Confirm
zpool list alloc, confirm pool name, confirm you’re on the correct machine, check history, and validate status timestamps.
6) Permanent errors appear, and someone clears them immediately
- Symptom: “errors: Permanent errors” appears, then disappears after
zpool clear, but no restore occurred. - Root cause: Treating error state as cosmetic; pressure to make dashboards green.
- Fix: Policy: never clear before capturing
zpool status -v, identifying impacted objects, and attempting restore. Require a clean scrub after remediation before clearing tickets.
7) Scrub speed collapses after adding a “faster” cache or special device
- Symptom: After changes (L2ARC, special vdev), scrub time and latency get worse.
- Root cause: Metadata placement and device asymmetry can change I/O patterns. A special vdev that’s too small or stressed can become the choke point.
- Fix: Measure device-level utilization during scrub. If a special vdev is saturated, expand it appropriately or redesign. Don’t bolt “acceleration” onto a pool without planning for scrub/resilver behavior.
8) Scrub triggers device timeouts only under load
- Symptom: Normal workloads fine. Scrub causes I/O errors, resets, offlining.
- Root cause: Marginal firmware, overheating, power delivery issues, or a path that fails under sustained throughput.
- Fix: Check temps, power, cabling, HBA firmware. Run long SMART tests. Consider lowering concurrent load (schedule scrubs) while you fix the underlying hardware issue.
Joke #2: Scrub doesn’t “cause” your disks to fail; it just asks them to do their job for more than five minutes.
Checklists / step-by-step plan
Checklist A: Set a scrub policy you can defend
- Pick a frequency based on risk: start with every 2–4 weeks for business pools.
- Define an “overdue” threshold (e.g., 2× your interval) that triggers an alert/ticket.
- Define acceptable scrub impact: max window, off-hours schedule, pause rules.
- Write down what “non-zero errors” means operationally: who gets paged, who investigates, how closure is proven.
- Baseline scrub duration and throughput; alert on regressions, not just failures.
Checklist B: Operational runbook for each scrub cycle
- Before starting:
zpool status(confirm ONLINE and note existing error counters). - Start scrub:
zpool scrub tank. - During scrub: monitor
zpool status -vandiostat -xfor impact. - If performance pain: pause (
zpool scrub -p) and resume during quiet hours. - After completion: capture
zpool status -voutput for records. - If repaired > 0 or CKSUM > 0: open an incident ticket; gather SMART + dmesg + zpool events; plan remediation.
- After remediation: run a follow-up scrub and require it to complete cleanly.
Checklist C: When scrub reports permanent errors
- Stop making changes. Capture:
zpool status -v,zpool events -v,dmesg -T, SMART data for all members. - Identify impacted objects/files listed in status; map them to datasets/services.
- Decide recovery path: restore from backup, restore from an older snapshot, or rebuild affected VM/dataset.
- Address the likely cause (replace disk, fix path, check memory if corruption pattern suggests it).
- Run a scrub after repairs and only then clear errors if you must for hygiene.
Example: scheduling with systemd timers (boring, effective)
cr0x@server:~$ cat /etc/systemd/system/zfs-scrub@.service
[Unit]
Description=ZFS scrub on %i
[Service]
Type=oneshot
ExecStart=/sbin/zpool scrub %i
cr0x@server:~$ cat /etc/systemd/system/zfs-scrub@tank.timer
[Unit]
Description=Run ZFS scrub on tank monthly
[Timer]
OnCalendar=monthly
Persistent=true
[Install]
WantedBy=timers.target
cr0x@server:~$ sudo systemctl enable --now zfs-scrub@tank.timer
Created symlink /etc/systemd/system/timers.target.wants/zfs-scrub@tank.timer → /etc/systemd/system/zfs-scrub@tank.timer.
Meaning: You now have an auditable schedule. “Persistent=true” runs missed timers after downtime.
Decision: Add monitoring that checks last scrub completion time and last scrub result. Scheduling without visibility is just optimism with timestamps.
FAQ
1) Should I run a scrub on a degraded pool?
Sometimes. If you suspect silent corruption, a scrub can reveal the scope. But it also stresses remaining devices.
If the pool is degraded because a disk is missing and you have a replacement ready, prioritize replacement/resilver first, then scrub.
2) Is scrub the same as SMART long tests?
No. SMART long tests are device-level diagnostics. Scrub is end-to-end integrity validation through filesystem, controller, and disk.
Run both. They catch different classes of failure.
3) Does scrub validate free space?
Generally no; it reads allocated blocks (including snapshots). Free space isn’t data. The risk is in blocks you might need to read later, which are the allocated ones.
4) Why does scrub slow down near the end?
Early scrub tends to hit large contiguous regions and caches; later phases can involve more fragmented metadata and scattered blocks, which increases seeks and reduces throughput—especially on HDDs.
5) If scrub repaired data, am I safe now?
Safer, not safe. Repairs mean redundancy worked, but something caused bad reads. Treat repairs as a warning that your pool is consuming redundancy to hide a problem.
Investigate hardware and run a follow-up scrub.
6) Can I scrub too often?
You can scrub so often that you’re always contending with your own maintenance workload. That’s not “safer,” it’s just self-inflicted latency.
If your scrubs overlap continuously, either the pool is too busy, too slow, or your snapshot retention is out of control.
7) What’s the difference between READ/WRITE/CKSUM errors in zpool status?
Roughly: READ/WRITE suggest I/O failures (timeouts, device errors). CKSUM suggests data was read but didn’t match the expected checksum (could be media, path, controller, RAM).
CKSUM errors are the ones that should make you particularly skeptical of “but the disk says it’s fine.”
8) Should I scrub after replacing a disk?
Yes, after resilver completes. Resilver reconstructs redundancy; scrub validates end-to-end correctness across everything allocated.
It’s the difference between “we rebuilt it” and “we verified it.”
9) Do I need to scrub if I have backups?
Yes. Backups are for recovery, scrubs are for early detection. Without scrubs, you may only discover corruption when you try to restore—often long after the last known-good copy.
10) How do I know whether to replace a disk after scrub errors?
If errors follow the disk across ports/cables, replace the disk. If errors follow the port/cable/backplane slot, fix the path.
Repeated correctable errors are not “fine.” They’re a countdown timer with an unknown duration.
Conclusion: next steps you can do today
A ZFS scrub is a reality check. It proves that what you’ve stored can still be read and validated—end to end—right now.
That proof matters most right before something breaks, which is why you schedule it before you feel like you need it.
Do these next
- Pick a frequency: default to every 2–4 weeks for production pools; tighten if pools are large or resilvers are slow.
- Schedule scrubs with an auditable tool (systemd timer/cron) and alert on “overdue” and “errors.”
- Baseline scrub duration and throughput, then alert on regressions—slow is often the first symptom.
- Write an escalation policy for any non-zero checksum errors or repaired bytes.
- After any hardware change (drive, cable, HBA), run a scrub in a controlled window and confirm it ends cleanly.
Your future outage postmortem will either say “we found it during a routine scrub” or “we discovered it during a rebuild.”
One of those reads better to management. More importantly, one is usually survivable.