You wake up to a pager: “checksum errors.” Nobody touched the storage (allegedly). Applications are slow, backups are late, and someone is already asking if the problem is “ZFS being weird.” You run a scrub, it “repairs” something, and the incident closes. Two weeks later, a different disk drops dead during a resilver and you get to enjoy the sequel.
Scrubs and SMART long tests are both “health checks,” but they see different layers of reality. Treating them as interchangeable is how you get surprised at 2 a.m. and then blamed at 9 a.m.
The mental model: who checks what
Start with the stack, because the confusion is always about layers:
- SMART long test is the drive checking itself: media surface, internal error correction, head positioning, read stability, and whatever else the vendor firmware thinks matters. It’s device-centric.
- ZFS scrub is ZFS reading your data and verifying it against end-to-end checksums stored in metadata, then repairing from redundancy if possible. It’s data-centric.
Those viewpoints overlap, but not enough to rely on one. SMART can tell you a disk is degrading even when ZFS hasn’t touched the weak areas yet. ZFS can tell you the data you care about is corrupt even if SMART insists the disk is “PASSED.”
Operationally: SMART is a predictive signal. Scrub is a correctness guarantee. You want both because production systems fail in multiple dimensions, often at the same time.
One quote that should be tattooed on every storage runbook: “Hope is not a strategy.”
— James Cameron. Storage is where hope goes to die.
What a ZFS scrub actually finds (and fixes)
What a scrub does, precisely
A ZFS scrub is a pool-wide, intentional read of allocated blocks, verifying every block’s checksum. If ZFS finds a mismatch, it attempts repair by fetching a good copy from redundancy (mirror/RAIDZ) and rewriting the bad one. That repair is not magical. It depends on:
- Redundancy being present and healthy.
- The badness being limited to one side of redundancy (or at least not exceeding parity).
- The corruption being detectable (checksum mismatch) rather than invisible (you wrote the wrong bytes and checksummed them).
Failures scrubs are good at finding
Scrubs excel at catching silent data corruption within the ZFS-protected path. That includes:
- Bit rot / latent sector errors: sectors that have become unreadable or return incorrect data later.
- Bad reads that “look successful” to the drive: the drive returns data, but it’s wrong. ZFS catches it because the checksum doesn’t match.
- Cabling/controller issues that result in corrupted transfers (sometimes shows up as checksum errors rather than read errors).
- Uncorrectable read errors during the scrub: ZFS logs them and may repair if redundancy exists.
Failures scrubs are not good at finding
A scrub is not a “drive diagnostic,” and it has blind spots:
- Unallocated space isn’t read. A scrub won’t test sectors ZFS hasn’t allocated.
- Write-path failures can be missed if the wrong data is written and checksummed as correct (this is rarer than people fear, but it exists in the universe of bugs, misdirected writes, and memory corruption without ECC).
- Drive mechanics like slow seeks, marginal heads, or failing cache can be developing while your current working set still reads fine.
- Transient timeouts might not manifest during scrub timing; or they might only appear under a different IO pattern.
Dry-funny reality: a scrub is like auditing your company’s expenses. It finds fraud in the submitted receipts, not the stuff nobody filed.
What “scrub repaired X” really means
If zpool status says a scrub “repaired” bytes, you have learned exactly one thing: the pool served at least one bad copy of a block, and ZFS found another good copy. That’s a correctness win, but also a red flag. It’s evidence of a failing component or a path issue. Your job is to figure out which, before it becomes multi-disk failure during a resilver.
Scrub vs resilver, because people mix them up
A scrub reads and validates allocated data in-place. A resilver is reconstruction after a device replacement or reattach, copying the necessary data to restore redundancy. Both are heavy reads. Both can trigger latent sector errors. But the intent differs:
- Scrub: “prove the pool’s existing data is correct.”
- Resilver: “rebuild redundancy and re-populate a device.”
If you only learn about latent read errors during a resilver, you learned too late.
What a SMART long test actually finds (and predicts)
What SMART long tests do
SMART self-tests are firmware-directed diagnostics. A long (extended) test typically reads most or all of the device’s surface (implementation varies wildly), exercising the drive’s ability to read sectors and correct errors internally. Unlike ZFS scrub, it’s not verifying your data’s checksum. It’s verifying the drive’s own ability to retrieve sectors reliably, with its internal ECC and remapping mechanisms.
What SMART is good at detecting
- Media degradation (pending sectors, reallocated sectors, uncorrectable errors).
- Read instability that drives hide until it gets worse (increasing error correction, retries, slow reads).
- Thermal stress patterns correlated with failures (temperature history, over-temp events).
- Interface-level errors (CRC errors often point to cabling/backplane issues, not the media).
What SMART is not good at (especially in 2025)
SMART is not a truth oracle. It’s vendor firmware telling you how it feels today. Problems:
- Attribute semantics vary between vendors. A raw value can be “real” on one drive and “vendor math” on another.
- SMART “PASSED” is meaningless comfort. Many drives fail without ever tripping the overall health flag.
- Long tests may not cover all LBAs on some devices, or may be interrupted by power management quirks.
- SSD behavior differs: “bad sectors” become a translation-layer issue, and some failures present as sudden death, not graceful degradation.
SMART long tests are still worth running. Just don’t use them as a get-out-of-jail-free card when ZFS starts shouting.
Joke #1: SMART “PASSED” is like a manager saying “everything looks fine” five minutes before the org chart changes.
How SMART failures map to operational decisions
SMART gives you early warning, but the decision isn’t always “replace immediately.” Typical triggers:
- Any non-zero Offline_Uncorrectable or increasing trend: schedule replacement. That drive is producing hard failures.
- Reallocated_Sector_Ct increasing: replace soon, especially if paired with pending sectors.
- Current_Pending_Sector non-zero: treat as urgent; a scrub/resilver will likely hit them and fail reads.
- UDMA_CRC_Error_Count increasing: often replace cable/backplane path first, not the drive.
Overlap, gaps, and why both are necessary
The overlap: when both tools agree
Sometimes life is simple. A drive is failing, it’s throwing read errors, SMART shows pending sectors, and a scrub reports read errors or checksum errors. Great. You replace the disk and move on.
The gap #1: SMART looks clean, ZFS reports checksum errors
This is common and it spooks people. It shouldn’t. Checksums are end-to-end: they can detect corruption from RAM, controllers, HBAs, cabling, expanders, backplanes, firmware, and cosmic mischief. SMART only knows about what the drive sees internally.
If ZFS says “checksum error,” do not assume the drive is lying. Assume the path is guilty until proven innocent. Cable issues often manifest as CRC errors in SMART (SATA especially), but not always in time.
The gap #2: SMART looks bad, ZFS is quiet
This is the “latent failure” scenario. The drive is degrading in areas ZFS hasn’t read recently (cold data, sparse datasets, unallocated blocks). ZFS is quiet because nothing has asked it to read those sectors yet. A scrub will likely find it—if the scrub runs before a resilver forces full-stripe reads under pressure.
The gap #3: scrubs pass, but you still lose data
Scrubs validate what exists and what is readable. They don’t guarantee:
- That your backups are valid.
- That your hardware won’t fail catastrophically tomorrow.
- That you have redundancy appropriate for your failure domain (same batch of disks, same backplane, same firmware bug).
Scrubs reduce risk. They don’t repeal physics.
My opinionated rule
If you run ZFS in production and you’re not scheduling scrubs and SMART long tests, you’re not “saving wear.” You’re saving ignorance for later, with interest.
Interesting facts and historical context
- ZFS shipped with end-to-end checksumming from day one (Solaris era), explicitly to address silent corruption that traditional filesystems couldn’t detect.
- The term “scrub” popularized in storage via RAID scrubbing: periodic reads to find latent sector errors before rebuild time, when you can least afford surprises.
- SMART dates back to the 1990s, with early vendor-specific implementations before ATA standardized a baseline set of behaviors.
- The SMART overall-health “PASSED” flag is notoriously conservative; many failures occur without flipping it, which is why SREs watch attributes and trends instead.
- ATA and SCSI/SAS health reporting diverged: SAS uses log pages and “self-test” mechanisms that feel familiar but aren’t identical to ATA SMART attributes.
- Modern drives can retry reads invisibly, turning “bad sector” into “slow sector,” which shows up first as latency spikes, not immediate errors.
- ZFS scrubs only read allocated blocks, so a mostly-empty pool can hide bad areas until data grows or relocates.
- SSD controllers remap constantly; “sectors” are logical, and failures can be sudden (firmware, FTL metadata corruption) rather than gradual sector decay.
- RAIDZ resilver behavior historically required more reading than mirrors; newer ZFS features like sequential resilvering help, but rebuilds still stress the fleet.
Fast diagnosis playbook
This is the “stop arguing in chat and find the bottleneck” sequence. Use it when you see checksum errors, read errors, slow IO, or scrub taking forever.
First: Is the pool currently safe?
- Check
zpool status -v: are any vdevs degraded, any devices faulted, any unrecoverable errors? - If redundancy is compromised (degraded mirror, RAIDZ with a dead disk), prioritize restoration of redundancy over performance tuning.
Second: Is it a drive, a path, or the host?
- Drive: SMART shows pending/uncorrectable/reallocated growth; ZFS shows read errors on that device.
- Path: SMART shows CRC errors; ZFS shows checksum errors across multiple disks on the same HBA/backplane; dmesg shows link resets/timeouts.
- Host: memory errors, ECC events, kernel warnings, or a recent driver/firmware update.
Third: Are you IO-bound or latency-bound right now?
- Use
zpool iostat -v 1to see which vdev is saturated. - Use
iostat -xto catch a single device with high await/util. - During scrub: expect heavy reads; if the system is collapsing, throttle or schedule better.
Fourth: Decide—repair, replace, or rewire
- If it’s media: replace the drive, then scrub after resilver.
- If it’s path: fix cabling/backplane/HBA first, then scrub to confirm error counters stop rising.
- If it’s host: validate RAM (ECC logs), review recent changes, and consider isolating the node.
Joke #2: The fastest way to “fix” checksum errors is to stop checking—this is also how you become a cautionary tale.
Practical tasks with commands, output meaning, and the decision you make
All examples assume a Linux host with ZFS and smartmontools installed. Replace device names and pool names with your reality. The point is the workflow: observe → interpret → decide.
Task 1: Check scrub status and whether errors are still accumulating
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors
see: zpool(8)
scan: scrub repaired 128K in 02:13:44 with 0 errors on Tue Dec 24 03:11:09 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 2 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
errors: No known data errors
What it means: scrub repaired data, and one disk has read errors (READ=2). ZFS fixed it using parity, so no known data errors. This is a warning shot.
Decision: pull SMART for the offending disk, check cabling/HBA logs, and plan replacement if SMART shows pending/uncorrectable or if read errors recur.
Task 2: See whether checksum errors are localized or spread across devices
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 25 01:00:03 2025
2.11T scanned at 1.02G/s, 1.44T issued at 711M/s, 18.2T total
0B repaired, 0.00% done, 06:59:12 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-Samsung_SSD_... ONLINE 0 0 0
ata-Samsung_SSD_... ONLINE 0 0 17
What it means: CKSUM errors are on one side of the mirror. That points to that device or its path (cable/backplane/HBA) more than pool-wide corruption.
Decision: check SMART and kernel logs for that device; if CRC errors rise, suspect cabling; if media errors rise, replace.
Task 3: Clear ZFS errors only after you’ve fixed the cause
cr0x@server:~$ zpool clear tank
What it means: counters reset. This is not a fix; it’s a way to confirm whether the issue returns.
Decision: clear only after you replace hardware or fix the path; then monitor for reappearance during subsequent scrub.
Task 4: Start a scrub intentionally (and don’t do this at noon on payroll day)
cr0x@server:~$ sudo zpool scrub tank
What it means: scrub queued/started. On busy pools, it will contend with workloads.
Decision: if scrub causes latency pain, tune scheduling and consider zfs set choices (recordsize) separately; don’t disable scrubs.
Task 5: Watch scrub progress and identify the slow vdev
cr0x@server:~$ zpool iostat -v tank 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 9.12T 8.97T 512 44 820M 23.1M
raidz2-0 9.12T 8.97T 512 44 820M 23.1M
ata-WDC_WD140EDFZ-1 - - 82 7 131M 5.8M
ata-WDC_WD140EDFZ-2 - - 85 6 138M 5.2M
ata-WDC_WD140EDFZ-3 - - 19 6 12M 5.4M
ata-WDC_WD140EDFZ-4 - - 86 6 139M 5.3M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: one disk is delivering far less read bandwidth than its peers (WD140EDFZ-3). That’s your likely bottleneck or the one doing heavy retries.
Decision: inspect SMART and kernel logs for that disk; plan replacement if it’s slow due to media retries.
Task 6: Run a SMART long test on a SATA disk
cr0x@server:~$ sudo smartctl -t long /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.0] (local build)
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 255 minutes for test to complete.
Test will complete after Wed Dec 25 06:14:02 2025
What it means: the drive accepted the test; it runs in firmware while IO continues (but may impact performance).
Decision: schedule long tests off-peak; on busy arrays, stagger them to avoid synchronized latency spikes.
Task 7: Read SMART self-test results and decide if it’s a replace-now situation
cr0x@server:~$ sudo smartctl -a /dev/sda | sed -n '/Self-test execution status/,$p' | head -n 25
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever been run.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 00% 18432 723451233
# 2 Extended offline Completed without error 00% 18390 -
What it means: a read failure occurred at an LBA. That’s not “monitor it,” that’s “this drive can’t read part of its surface reliably.”
Decision: replace the drive. Before replacement, consider forcing reads in ZFS (scrub) to see if redundancy can rewrite bad blocks, but don’t bet production on it.
Task 8: Check key SMART attributes (HDD) for latent failure signals
cr0x@server:~$ sudo smartctl -A /dev/sda | egrep 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count'
5 Reallocated_Sector_Ct 0x0033 188 188 140 Pre-fail Always - 24
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
What it means: pending and uncorrectable sectors exist. Reallocated sectors already happened. CRC is clean, so it’s likely media, not cabling.
Decision: replace the drive and run a scrub after resilver to verify no lingering errors.
Task 9: Detect a cabling/backplane problem via CRC errors
cr0x@server:~$ sudo smartctl -A /dev/sdb | egrep 'UDMA_CRC_Error_Count|CRC'
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 312
What it means: CRC errors are transport-level. Data got corrupted or had retries between disk and controller. Drives often stay “healthy” while the link is flaky.
Decision: reseat/replace cable, inspect backplane connectors, swap ports on HBA. Do not replace the drive first unless you enjoy wasting time.
Task 10: Correlate ZFS device errors to physical devices
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep WDC_WD140EDFZ | head
lrwxrwxrwx 1 root root 9 Dec 26 00:41 ata-WDC_WD140EDFZ-11A0VA0_V9J0ABCD -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 26 00:41 ata-WDC_WD140EDFZ-11A0VA0_V9J0EFGH -> ../../sdd
What it means: you map ZFS-reported by-id names to kernel devices. This matters when swapping disks; /dev/sdX is not stable.
Decision: always use /dev/disk/by-id in ZFS vdevs and in your replacement procedure.
Task 11: Check kernel logs for link resets and timeouts (path issues)
cr0x@server:~$ sudo dmesg -T | egrep -i 'ata[0-9]+|link is slow|hard resetting link|I/O error|blk_update_request' | tail -n 12
[Wed Dec 25 02:14:11 2025] ata7: hard resetting link
[Wed Dec 25 02:14:12 2025] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 25 02:14:14 2025] blk_update_request: I/O error, dev sdc, sector 723451233 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Wed Dec 25 02:14:14 2025] ata7.00: failed command: READ FPDMA QUEUED
What it means: link reset + read errors. Could be a dying drive, could be a flaky connection. Combine with SMART CRC and pending sectors to decide.
Decision: if CRC grows, treat as path; if pending/uncorrectable grows, treat as media. If both grow, replace the drive and the cable/backplane path, because you’re in the fun zone.
Task 12: Identify whether ZFS has permanent data errors (the nightmare scenario)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Restore the file in question if possible.
scan: scrub completed after 07:44:02 with 0 errors on Thu Dec 26 04:12:09 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST8000DM-1 ONLINE 0 0 0
ata-ST8000DM-2 ONLINE 0 0 0
ata-ST8000DM-3 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/data/db/pg_wal/000000010000000A000000FE
What it means: redundancy could not repair a corrupted block. ZFS can tell you which file was affected. This is exactly why you run ZFS.
Decision: restore from backup or application-level replication; investigate why redundancy wasn’t sufficient (RAIDZ1, multiple read errors, prior silent issues). Consider increasing redundancy for this workload.
Task 13: See which datasets are consuming most scrub time (indirect method)
cr0x@server:~$ sudo zfs list -o name,used,logicalused,recordsize,compression -S used | head -n 10
NAME USED LUSED RECSIZE COMPRESS
tank/data 6.12T 6.70T 128K lz4
tank/backups 2.44T 2.80T 1M lz4
tank/vm 1.91T 2.10T 16K lz4
tank/home 312G 330G 128K lz4
What it means: heavy-used datasets dominate scrub IO. Small recordsize (like 16K for VMs) can increase metadata and IO overhead.
Decision: don’t “optimize” recordsize during an incident. But do note which datasets might justify separate pools or different layouts if scrubs are constantly painful.
Task 14: Run SMART long test on an NVMe device (self-test/log differs)
cr0x@server:~$ sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.0] (local build)
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO 2TB
Serial Number: S7HBNX0T123456A
Firmware Version: 5B2QJXD7
SMART overall-health self-assessment test result: PASSED
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Percentage Used: 2%
Data Units Read: 12,384,921
Data Units Written: 9,221,114
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
What it means: NVMe health is different: percentage used, media errors, critical warnings. There isn’t a “pending sector” story; failures often present via media errors or sudden controller problems.
Decision: track NVMe media/data integrity errors and error log entries over time; replace on any upward trend, not just when “critical warning” flips.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
They had a tidy rule: “If SMART says PASSED, the disk is fine.” It was written into the on-call wiki, repeated in handoffs, and used as a fast close for tickets. The storage pool was ZFS mirrors on SATA SSDs in a 2U box, backing a set of internal services that were never “critical” until they were.
One week, ZFS started reporting checksum errors on one mirror leg. Not read errors. Not write errors. Just checksum increments during normal operation. SMART looked clean. The team cleared errors, ran a scrub, saw “repaired 0B,” and moved on. The checksum counter crept up again. Same playbook: clear, shrug, defer.
Then a firmware update happened on the HBA. Under load, the box began logging link resets. Suddenly, the mirror leg with checksum errors went unavailable during peak traffic. The remaining side carried the workload, but the pool was now one bad day away from a really bad week.
The postmortem was blunt: the wrong assumption wasn’t “SMART is useful.” The wrong assumption was that SMART is the right tool to validate data correctness. The checksum errors were a path issue—likely cabling/backplane marginality that only showed up under certain IO patterns. SMART didn’t see it because the drive itself wasn’t failing. ZFS saw it because the data arriving didn’t match the checksum. The fix was boring: replace cables, reseat the backplane connectors, and stop clearing counters until the cause was understood. After that, checksum errors stopped entirely.
Lesson: SMART PASSED does not overrule ZFS checksums. When they disagree, investigate the bus.
Mini-story 2: The optimization that backfired
A different company ran huge ZFS RAIDZ2 pools for backups and object storage. Scrubs were set monthly because “these are cold data pools,” and SMART long tests were disabled because “they slow down IO.” Someone decided to be clever: scrubs should run quarterly, because the pools were large and scrubs annoyed stakeholders.
For a while, nothing happened. That’s the trap. Latent sector errors don’t send a calendar invite. Then a disk failed and they started a resilver. During the resilver, another disk began throwing read errors in a region that hadn’t been touched in months. The pool limped along, but the resilver slowed to a crawl because the drive was retrying reads. Eventually, ZFS declared unrecoverable errors in a handful of backup objects.
Those were “just backups,” until they weren’t. When a production restore was needed later, a subset of the oldest recovery points was corrupt. The app team had to choose between restoring a newer snapshot (more data loss) or rebuilding state manually (more pain). They did both, poorly.
The optimization was based on a misunderstanding: scrubs are not optional “maintenance.” They are how ZFS converts unknown latent failures into known repaired blocks while redundancy still exists and the pool is stable. Resilver time is the worst possible time to discover unreadable sectors.
They reverted to monthly scrubs and staggered SMART long tests weekly across devices. Scrubs still annoyed stakeholders, but fewer than data loss annoyed executives. That’s how priorities get clarified.
Mini-story 3: The boring but correct practice that saved the day
A team running a multi-tenant virtualization cluster had a simple, unsexy routine: weekly SMART long tests staggered per host, monthly ZFS scrubs staggered per pool, and a dashboard that tracked deltas for three SMART attributes: pending sectors, offline uncorrectable, and CRC errors. No fancy machine learning. Just trends.
One Monday, the dashboard showed a small but real increase in UDMA_CRC_Error_Count on two disks behind the same backplane. The drives had zero pending sectors and zero reallocated. ZFS was clean. Performance was fine. Nobody was paging. This is the moment where teams either ignore it or look like paranoids.
They scheduled a short maintenance window, reseated the backplane connector, and replaced two SATA cables. CRC errors stopped incrementing. No incidents. No downtime beyond the window. Everyone forgot about it, which is the correct outcome for preventive maintenance.
Three weeks later, another host in the same rack had a similar CRC pattern. They swapped cables again and found a batch issue with a specific cable model. They proactively replaced them across the fleet during normal maintenance windows. Still no incidents.
This is what “reliability work” looks like when it’s done well: quiet, repetitive, and slightly boring. That’s the goal.
Common mistakes (symptom → root cause → fix)
1) Symptom: ZFS checksum errors increase, but SMART looks fine
Root cause: transport corruption or retries (cabling/backplane/HBA), occasionally memory/driver issues.
Fix: check SMART CRC counts, dmesg for link resets, reseat/replace cables, move the drive to another port, consider HBA firmware/driver rollback.
2) Symptom: Scrub “repaired” data and you treat it as resolved
Root cause: latent media errors or intermittent read instability; you got lucky because redundancy still worked.
Fix: identify which device logged read/checksum errors; run SMART long test; replace the suspect drive if attributes are non-zero or errors recur.
3) Symptom: Scrubs take forever and murder latency
Root cause: scrubs running during peak IO; one slow device throttling a RAIDZ vdev; SMR HDDs in the mix; or drives doing heavy internal retries.
Fix: schedule scrubs off-peak; stagger across pools; identify slow disks with zpool iostat -v; replace the laggard; avoid SMR where you expect consistent reads under load.
4) Symptom: SMART shows pending sectors, but ZFS scrub is clean
Root cause: problematic sectors are in areas ZFS hasn’t read recently (or at all).
Fix: treat pending sectors as urgent; run a scrub (or targeted reads) to force detection/repair; plan drive replacement.
5) Symptom: “Permanent errors” in zpool status
Root cause: corruption exceeded redundancy, or corruption was written and checksummed as “valid,” or multiple devices returned bad data.
Fix: restore affected files from backup/replication; increase redundancy design (mirrors/RAIDZ2/3), and audit memory integrity (ECC) and controller firmware.
6) Symptom: SMART long test fails or aborts repeatedly
Root cause: media problems; device resets; power management quirks; or the device is behind a controller that blocks SMART passthrough.
Fix: run SMART through the correct device type (-d sat, -d megaraid,N, etc.); check dmesg; replace disk on genuine read failures.
7) Symptom: You replace a “bad” drive but errors continue on the new one
Root cause: the path was the issue (cable/backplane/HBA), not the drive.
Fix: investigate CRC errors and link resets; swap port/HBA; validate power delivery; don’t keep sacrificing drives to the cable gods.
8) Symptom: Resilver fails with read errors from surviving disks
Root cause: latent sector errors discovered under rebuild stress because scrubs weren’t frequent enough.
Fix: increase scrub frequency; replace marginal drives earlier based on SMART trends; consider higher redundancy and smaller vdev widths for rebuild risk management.
Checklists / step-by-step plan
Baseline policy (production ZFS)
- Enable scheduled ZFS scrubs per pool (monthly is a good default; more frequent for hot data or large HDD pools).
- Schedule SMART long tests staggered across drives (weekly or biweekly; daily short tests if you want quick signal).
- Alert on ZFS errors: any non-zero READ/WRITE/CKSUM per device, and any “repaired” bytes.
- Alert on SMART trends: pending sectors, offline uncorrectable, reallocated growth, CRC growth, NVMe media/data integrity errors.
- Document replacement procedure using
/dev/disk/by-idand include a post-replace scrub/resilver verification step.
When ZFS reports checksum errors
- Run
zpool status -v, identify the device(s), and whether redundancy is degraded. - Check dmesg for link resets/timeouts around the timestamps.
- Check SMART attributes, especially CRC, pending, uncorrectable, and reallocated.
- If CRC is rising: fix cabling/backplane/HBA first.
- If pending/uncorrectable is rising: replace the drive.
- After fix:
zpool clear, then scrub, then verify counters stay at zero.
When SMART long test fails but ZFS is quiet
- Confirm the failure in SMART self-test logs and attributes.
- Run a ZFS scrub to force reads of allocated data; watch for read/checksum errors.
- Replace the drive even if ZFS doesn’t complain yet, if the long test shows read failures or uncorrectables.
- After replacement/resilver: scrub again to confirm pool health.
Scheduling advice that won’t get you hated
- Stagger everything: don’t scrub all pools at 01:00 on Sunday and run SMART long tests at 01:05. You’ll create your own incident.
- Prefer “predictable slow” over “random slow”: a scheduled scrub is annoying; an unscheduled resilver is career-defining.
- Track duration trends: if scrub time doubles month over month, something changed (capacity, workload, a slow disk, or a firmware regression).
FAQ
1) If I run ZFS scrubs regularly, do I still need SMART long tests?
Yes. Scrubs validate allocated data; SMART can warn about degradation in areas ZFS hasn’t touched and can reveal path issues via CRC counters.
2) If SMART long tests are clean, can I ignore ZFS checksum errors?
No. Checksums are end-to-end. SMART can be clean while the controller, cable, backplane, or driver corrupts or retries IO.
3) Do scrubs “wear out” SSDs?
Scrubs are mostly reads; reads have minimal wear impact compared to writes. The bigger risk is performance impact, not endurance. Schedule properly.
4) Why did a scrub repair data but SMART shows no reallocated sectors?
Because the corruption may not have been a media defect that triggers remapping. It could be a bad read that still passed the drive’s internal checks, or a path issue.
5) How often should I scrub?
Monthly is a solid default for HDD pools. For very large pools or high-criticality data, consider every 2–4 weeks. If scrubs are painful, fix the architecture, not the calendar.
6) How often should I run SMART long tests?
Weekly or biweekly is reasonable if you stagger disks. For laptops or lightly used systems, monthly is fine. For arrays, keep it regular and automated.
7) What’s the difference between READ errors and CKSUM errors in zpool status?
READ errors mean the device failed to deliver data at all. CKSUM errors mean data was delivered but didn’t match the checksum—often a path or corruption issue.
8) Should I replace a drive with a few reallocated sectors?
One-time reallocations that stabilize can be survivable, but in production I replace on trend: if the count increases, or if pending/uncorrectable appear.
9) Can scrubs find corruption in RAM or CPU?
Indirectly. If bad memory corrupts data in-flight or on write, ZFS may detect mismatches later. ECC memory reduces this risk dramatically; without ECC, you’re gambling.
10) What if the pool is RAIDZ1 and I see repaired bytes?
Take it seriously. RAIDZ1 has less margin for multi-sector or multi-disk issues during rebuilds. Consider upgrading redundancy if the data matters.
Conclusion: next steps you can do this week
Stop treating ZFS scrub and SMART long test as competing tools. They’re complementary sensors pointed at different parts of the system: one validates your data, the other interrogates the device.
- Schedule monthly scrubs per pool and stagger them across your fleet.
- Schedule SMART long tests weekly/biweekly, staggered per disk, and alert on the attributes that actually predict pain (pending, uncorrectable, reallocations, CRC).
- When ZFS reports checksum errors, investigate the path first—cables, backplane, HBA, firmware—then the drive.
- When SMART reports pending/uncorrectable, don’t wait for a resilver to discover it’s real. Replace the drive on your terms.
- After any fix, clear counters, scrub, and confirm the errors stop. Evidence beats optimism.
Your future self will still get paged. But it’ll be for something new, not the same avoidable storage tragedy with a different timestamp.