You get a page at 02:13: SMART says “prefail” on /dev/sdX. ZFS says the pool is ONLINE.
Your manager says, “If ZFS is fine, why are you bothering me?”
The answer is: because ZFS is honest, not psychic. It reports what it can prove. SMART is messy, vendor-specific,
and prone to crying wolf. Your job is to correlate the two so you can make a decision that’s boring, defensible,
and doesn’t end with a resilver during a quarterly close.
A mental model: what ZFS knows vs what SMART knows
ZFS and SMART are not rivals. They’re two witnesses to the same crime, standing at different street corners.
One saw the suspect’s face. The other heard the tires screech. Neither story is complete alone.
ZFS: evidence-based, end-to-end, and ruthlessly literal
ZFS sees the world through I/O operations and checksums. When ZFS reads a block, it validates the checksum.
If the checksum doesn’t match, ZFS calls it a checksum error. If the device refuses to read
or write, ZFS calls it an I/O error. If ZFS can correct bad data from redundancy (mirror/RAIDZ),
it increments counters and carries on, often while you sleep.
ZFS doesn’t inherently know why a read failed. It could be a failing platter surface. It could be a bad SAS expander port.
It could be a controller firmware bug. It could be a cosmic ray. ZFS just knows it asked for data and got something wrong.
SMART: predictive, vendor-flavored, and occasionally dramatic
SMART is a set of self-reported counters and test results from the drive firmware. It knows about reallocations,
pending sectors, read disturb, endurance, CRC errors on the SATA link, and more. It can run self-tests.
It can also hide problems behind “attribute normalization” where the raw value matters but the “VALUE/WORST/THRESH”
tries to keep you calm until it’s far too late.
SMART also has a big blind spot: if the problem is upstream (cable, HBA, backplane, power), the drive can look healthy
while ZFS is drowning in I/O errors. Or SMART can show UDMA CRC errors that scream “cable” while ZFS counters climb,
and you’ll be tempted to RMA the drive anyway because that’s emotionally satisfying.
Correlation is the discipline of asking: are these signals describing the same failure mode, and do they point to a specific action?
This is less “monitoring” and more “courtroom cross-examination.”
One paraphrased idea from Richard Cook (resilience engineering): Success in operations comes from people continually adapting to complexity under pressure.
Interesting facts and historical context
- ZFS was designed with silent data corruption in mind. End-to-end checksums were a reaction to storage stacks that assumed disks “mostly tell the truth.”
- SMART predates modern filesystems’ integrity features. It emerged when the OS had very little visibility into internal drive behavior.
- SMART attributes are not standardized in practice. The names look consistent; the meaning and scaling often aren’t, especially across vendors and SSD generations.
- “UDMA CRC error count” is one of the most actionable attributes. It often indicates cabling/backplane issues rather than media failure.
- ZFS scrubs were controversial early on. People worried scrubs would “wear out” disks; reality is scrubs are how you discover latent errors before a rebuild forces the issue.
- Early SSD SMART was wildly optimistic. Some devices reported near-perfect health right up to failure; modern NVMe telemetry is better but still not gospel.
- RAID rebuilds used to be “the scary part.” With multi-TB drives, unrecoverable read errors during rebuild became a practical risk; ZFS’s scrubs and checksums help expose risks earlier.
- ZFS error counters can increase with no user-visible outage. Self-healing reads repair data silently in redundant vdevs; “no outage” does not mean “no problem.”
The signal map: SMART warnings ↔ ZFS errors
Start with ZFS’s error types (because they’re tied to data correctness)
- Checksum errors: data returned by the device didn’t match the expected checksum. Often media issues, sometimes controller/cable corruption, occasionally RAM (if non-ECC and you’re unlucky).
- Read errors / write errors: the device failed the I/O. Often link resets, timeouts, power/cable, or the device giving up.
- Device removal / faulted: the OS lost the device. Think: flaky SATA power, expander reset, HBA trouble, or the drive itself hard-resetting.
Now the SMART attributes that actually earn your attention
The “best” SMART attributes are the ones with a clear physical meaning and decent correlation with failure:
pending sectors, reallocated sectors, uncorrectable errors, and interface CRC errors.
Temperature matters too, but mostly as a risk factor.
- Reallocated_Sector_Ct: sectors the drive has remapped. Non-zero is not instant death, but increasing is a trajectory.
- Current_Pending_Sector: sectors that couldn’t be read and are waiting for a rewrite to remap. This is often the “data loss soon” attribute.
- Offline_Uncorrectable / UNC: uncorrectable read errors found during offline scanning or normal ops.
- UDMA_CRC_Error_Count: data corruption on the link (SATA). Almost always cabling/backplane/connector trouble, not disk media.
- SMART overall-health: useful when it fails; not trustworthy when it passes.
- NVMe Media and Data Integrity Errors: a strong signal on NVMe devices when non-zero and increasing.
- Power_On_Hours / Power_Cycle_Count: context. Frequent power cycles correlate with weird failures and connector issues.
Correlation patterns you can bet your weekend on
Pattern A: ZFS checksum errors + pending sectors (or offline uncorrectables)
This is the cleanest case. The device is returning bad data or can’t read it reliably. ZFS can often heal it from redundancy,
but your job is to stop gambling. Scrub, capture evidence, replace the disk if the counters move.
Pattern B: ZFS I/O errors + UDMA CRC errors (SATA)
This is usually not the disk. It’s the path: cable, backplane, connector oxidation, a cheap splitter,
or a “hot-swap bay” that’s more “warm-swap if you promise not to sneeze.”
Reseat/replace cables, move ports, check power. If CRC increments stop, the drive is fine.
Pattern C: ZFS device faults/removals + clean SMART
Think power and transport. Drives can disappear under marginal power, expander resets, or HBA firmware issues.
SMART won’t always record it because the drive didn’t fail internally; it just got yanked from reality.
Pattern D: SMART “PASSED” + ZFS checksum errors
SMART overall-health is a blunt instrument. Some drives don’t declare failure until the situation is already on fire.
Treat ZFS checksums as higher priority for data correctness, then dig into SMART raw attributes and transport logs.
Pattern E: No ZFS errors, SMART warnings rising
This is where people get lazy. ZFS hasn’t observed incorrect data yet. SMART is telling you the drive is
doing extra work to keep up appearances. If the SMART attributes are the “real” ones (pending/uncorrectable),
plan a controlled replacement. If it’s just temperature spikes or a single reallocation years ago, monitor closely.
Joke #1: SMART is like a check-engine light: sometimes it’s catastrophic, sometimes your gas cap is loose, and either way you’re late for work.
Fast diagnosis playbook (first/second/third)
First: find out what ZFS is complaining about
- Run
zpool status -xvand read it like a contract. Look forREAD,WRITE,CKSUMcounters and any “too many errors” messages. - Identify the exact device and vdev (by persistent ID, not
/dev/sdXroulette). - Check whether errors are still increasing (run status again after some I/O, or after a minute). Static historical errors are different from ongoing ones.
Second: decide if it smells like media vs transport vs host
- Media suspicion: checksum errors on one disk, SMART pending/uncorrectable/reallocated increasing, self-test fails.
- Transport suspicion: I/O errors, device resets, CRC errors, multiple disks on same backplane port, kernel logs show link resets/timeouts.
- Host suspicion: errors across many vdevs at once, HBA resets, PCIe AER messages, memory instability, firmware bugs.
Third: pick a safe action that reduces risk quickly
- If redundancy is intact: scrub to force reads; gather SMART; schedule replacement or cabling work during business hours.
- If redundancy is compromised (degraded RAIDZ/mirror already missing a member): stop experiments. Minimize load, take a backup/snapshot strategy seriously, and replace the most likely culprit first.
- If device is flapping: stabilize transport (reseat/replace cable, move port) before resilvering. Resilvering through a flaky link is just performance art.
Practical tasks: commands, outputs, decisions
Below are hands-on tasks you can run on a typical Linux ZFS host (OpenZFS). Each task includes: a command, a realistic snippet of output,
what it means, and the decision you make. Do them in order when you’re on call. Do them in slow motion when you’re not.
Task 1: Confirm whether any pool is unhappy
cr0x@server:~$ zpool status -x
pool 'tank' is healthy
pool 'backup' has experienced errors
Meaning: -x prints only problems. “Healthy” doesn’t mean “perfect,” it means no known faults requiring action.
Decision: If any pool is not healthy, move to a full verbose status immediately and start collecting evidence before rebooting anything.
Task 2: Get the detailed ZFS view (your primary source)
cr0x@server:~$ zpool status -xv backup
pool: backup
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 02:11:30 with 2 errors on Sun Dec 22 03:10:12 2025
config:
NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ7... ONLINE 0 0 2
ata-WDC_WD80EFAX-68LHPN0_VKJ8... ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ9... ONLINE 0 0 0
errors: No known data errors
Meaning: Two checksum errors occurred and were corrected. “No known data errors” means ZFS believes it repaired or isolated the damage.
Decision: Treat corrected checksum errors as a warning shot. Don’t clear counters yet; correlate with SMART and transport logs first.
Task 3: Map the ZFS device ID to the OS block device
cr0x@server:~$ ls -l /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ7*
lrwxrwxrwx 1 root root 9 Dec 26 01:12 /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ7XXXX -> ../../sdc
Meaning: The persistent ID points to /dev/sdc right now. Tomorrow it might not.
Decision: Use /dev/disk/by-id paths for all ZFS replace operations. Write down the slot/serial mapping while you’re calm.
Task 4: Check SMART overall-health (quick but not decisive)
cr0x@server:~$ sudo smartctl -H /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.0] (local build)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Meaning: The drive isn’t declaring failure. That’s not the same as “the drive is fine.”
Decision: Continue to attributes and error logs. Don’t let “PASSED” overrule ZFS checksums.
Task 5: Pull SMART attributes that correlate with media failure
cr0x@server:~$ sudo smartctl -A /dev/sdc
=== START OF READ SMART DATA SECTION ===
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
Meaning: Pending and uncorrectable sectors exist. This is a drive that couldn’t reliably read two sectors.
Decision: If this disk is in a redundant vdev: plan replacement. If it’s a single-disk pool: back up now, then replace. Also run a long self-test for confirmation, but don’t wait for it to “get worse.”
Task 6: Read the SMART error log (often shows the pattern)
cr0x@server:~$ sudo smartctl -l error /dev/sdc
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
...
Error 3 occurred at disk power-on lifetime: 42110 hours
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 ...
40 -- 51 0008 00000000f3a1c2 ... UNC
Meaning: UNC indicates an uncorrectable read error at a specific LBA. That aligns with pending/uncorrectable sectors and ZFS checksum issues.
Decision: Replace the drive on your schedule, not the drive’s. Before replacement, scrub to force ZFS to touch as much surface area as possible.
Task 7: Run a SMART long test (evidence, not therapy)
cr0x@server:~$ sudo smartctl -t long /dev/sdc
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 780 minutes for test to complete.
Meaning: The drive will scan surface media. This can surface more UNC errors.
Decision: If the pool is degraded or workload is latency sensitive, consider scheduling. If you already have pending sectors, you don’t need this test to grant permission to replace.
Task 8: Check SMART self-test results
cr0x@server:~$ sudo smartctl -l selftest /dev/sdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 42112 16062914
# 2 Short offline Completed without error 00% 42001 -
Meaning: Long test hit a read failure. This is as close as SMART gets to a sworn statement.
Decision: Replace the disk. If you have spares, do it now. If procurement is slow, at least move critical datasets off this vdev or adjust redundancy plans.
Task 9: Look for transport/link issues in kernel logs
cr0x@server:~$ sudo dmesg -T | egrep -i 'ata[0-9]|sas|reset|link|I/O error|blk_update_request' | tail -n 20
[Tue Dec 24 02:12:11 2025] ata7: hard resetting link
[Tue Dec 24 02:12:12 2025] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Tue Dec 24 02:12:12 2025] ata7.00: configured for UDMA/133
[Tue Dec 24 02:12:12 2025] blk_update_request: I/O error, dev sdc, sector 16062914 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Dec 24 02:12:13 2025] ata7.00: failed command: READ FPDMA QUEUED
Meaning: Link resets + read failures. Could be drive or cable; combine with SMART to decide. With UNC/pending sectors present, the drive is guilty enough.
Decision: If CRC errors are zero but UNC exists, replace drive. If CRC errors are climbing, fix cabling too; you can have both.
Task 10: Check UDMA CRC specifically (classic “don’t RMA the cable”)
cr0x@server:~$ sudo smartctl -A /dev/sdd | egrep 'UDMA_CRC_Error_Count|CRC'
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 41
Meaning: CRC errors occurred on the SATA link. If this number increases over time, your transport path is corrupting or dropping frames.
Decision: Reseat/replace the SATA cable, check backplane connectors, inspect power splitters, and consider moving the disk to another port. Do not replace the drive solely for CRC errors unless other evidence points to media failure.
Task 11: Identify which physical slot a disk is in (stop guessing)
cr0x@server:~$ sudo udevadm info --query=all --name=/dev/sdc | egrep 'ID_SERIAL=|ID_SERIAL_SHORT=|ID_PATH=|DEVLINKS='
E: ID_SERIAL=WDC_WD80EFAX-68LHPN0_VKJ7XXXX
E: ID_SERIAL_SHORT=VKJ7XXXX
E: ID_PATH=pci-0000:3b:00.0-ata-7
E: DEVLINKS=/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ7XXXX /dev/disk/by-path/pci-0000:3b:00.0-ata-7
Meaning: You can map the drive to a controller port via by-path. In a chassis with labeled bays, this saves you from pulling the wrong disk.
Decision: Document the mapping (serial → bay). If you can’t map reliably, you’re not operating storage; you’re gambling with hardware.
Task 12: Run a scrub (controlled truth-finding)
cr0x@server:~$ sudo zpool scrub backup
cr0x@server:~$ zpool status backup
pool: backup
state: ONLINE
scan: scrub in progress since Thu Dec 26 01:20:05 2025
1.12T scanned at 1.05G/s, 410G issued at 384M/s, 16.2T total
0B repaired, 2.50% done, 11:42:16 to go
Meaning: Scrub forces reads and checksum verification. If a disk is marginal, scrubs often provoke the errors you need to see.
Decision: If errors climb during scrub on one device, prepare replacement. If errors appear on multiple disks in the same enclosure, suspect transport/power/HBA.
Task 13: Inspect ZFS error details for affected files (when ZFS knows)
cr0x@server:~$ zpool status -v backup
pool: backup
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...
errors: Permanent errors have been detected in the following files:
backup/media@autosnap_2025-12-25:some/path/video.mp4
Meaning: ZFS could not repair data for that file/block. This is not a “monitoring alert.” This is data loss.
Decision: Restore from backup/snapshot/replica. Then investigate hardware. Also: stop clearing errors; you need the audit trail until you fix the root cause.
Task 14: Replace a failed disk correctly (use by-id)
cr0x@server:~$ sudo zpool replace backup ata-WDC_WD80EFAX-68LHPN0_VKJ7XXXX ata-WDC_WD80EFAX-68LHPN0_VNEW1234
cr0x@server:~$ zpool status backup
pool: backup
state: ONLINE
scan: resilver in progress since Thu Dec 26 02:01:44 2025
1.84T scanned at 512M/s, 620G issued at 172M/s, 16.2T total
610G resilvered, 3.74% done, 07:55:10 to go
config:
NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-...VKJ7XXXX ONLINE 0 0 6 (resilvering)
ata-WDC_WD80EFAX-...VKJ8YYYY ONLINE 0 0 0
ata-WDC_WD80EFAX-...VKJ9ZZZZ ONLINE 0 0 0
ata-WDC_WD80EFAX-...VNEW1234 ONLINE 0 0 0 (resilvering)
Meaning: Resilver is underway. ZFS keeps old device around until completion if possible. Checksum errors on the old device can continue as it’s read during resilver.
Decision: Watch for rising errors on other disks. If resilver slows dramatically or devices reset, pause and fix transport before forcing through.
Task 15: Clear errors only after you’ve fixed the cause
cr0x@server:~$ sudo zpool clear backup
cr0x@server:~$ zpool status -xv backup
pool 'backup' is healthy
Meaning: Counters reset. This is not “healed”; it’s “clean slate.”
Decision: Only clear after replacement/repair and after a clean scrub. Otherwise you delete your own forensic evidence and invite repeat incidents.
Task 16: NVMe-specific SMART/health (different vocabulary, same job)
cr0x@server:~$ sudo smartctl -a /dev/nvme0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 100%
Percentage Used: 6%
Media and Data Integrity Errors: 12
Error Information Log Entries: 44
Meaning: Non-zero and increasing media/data integrity errors are a strong predictor of trouble for NVMe.
Decision: Correlate with ZFS checksum/I/O errors and kernel NVMe resets. If errors are rising, plan replacement; NVMe tends to fail “fast” once it starts.
Joke #2: Resilvering through a flaky cable is like trying to mop up a leak while someone keeps turning the faucet on—technically possible, spiritually unwise.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran a ZFS-backed object store on a pair of storage servers. Monitoring flagged SMART “PASSED” across the board.
ZFS reported a handful of checksum errors during scrubs, always corrected, always on the same vdev. The on-call rotation treated it like harmless background noise.
The wrong assumption was simple: “If the pool is ONLINE and SMART says PASSED, the disk is healthy.” That’s a comforting story because it lets you do nothing.
And doing nothing is the most popular operational strategy on Earth.
Weeks later, one disk in that vdev started accumulating Current_Pending_Sector and Offline_Uncorrectable counts. Nobody looked because nobody had dashboards for raw SMART attributes,
only the overall-health bit. Then a second disk in the same RAIDZ group developed read timeouts during a busy ingest window.
ZFS did what it could, but RAIDZ isn’t magic when multiple members are sick. The pool degraded, performance collapsed, and scrub/resilver activity fought production I/O.
The incident wasn’t instant data loss—it was worse in a corporate way: a slow, noisy outage with partial errors and a lot of “retry later” behavior.
Post-incident, the fix was boring: alert on the raw attributes that matter, alert on ZFS corrected errors, and require a human decision when either moves.
The team didn’t eliminate disk failures. They eliminated the surprise.
Mini-story 2: The optimization that backfired
A financial services shop wanted faster scrubs. They’d read that increasing scrub throughput helps catch latent errors sooner.
True. They also wanted to reduce “maintenance windows,” so they cranked parallelism, scheduled scrubs during business hours, and tuned the system to prioritize scrub I/O.
The first few scrubs were lightning-fast. Everyone congratulated themselves. Then they hit a vdev with a marginal disk and a borderline SAS backplane connection.
The aggressive scrub produced a storm: link resets, command timeouts, retries. ZFS counters grew. Latency graphs became abstract art.
The punchline was that the optimization changed the failure mode. Instead of discovering a few bad sectors gently over time, they forced the system into a high-stress workload
that magnified transport instability and created customer-visible latency spikes. Worse, their monitoring treated “scrub in progress” as “maintenance noise” and suppressed alerts.
The remediation was not “never scrub.” It was: scrub regularly, but tune scrub impact, don’t suppress the wrong alerts, and treat transport errors during scrub as a top-tier signal.
They also learned to stagger scrubs across hosts and avoid synchronized I/O storms.
They ended up with slower scrubs and fewer incidents. That trade was worth it. Performance is a feature; predictability is a product.
Mini-story 3: The boring but correct practice that saved the day
A company ran a large ZFS pool for VM storage. Nothing glamorous. They did three things consistently:
monthly scrubs, weekly review of SMART raw attributes, and strict device naming via /dev/disk/by-id with a living map of serial-to-slot.
One Friday afternoon, ZFS reported a few checksum errors on a single disk during a scrub. SMART showed UDMA_CRC_Error_Count rising, but no pending sectors.
The engineer on rotation didn’t panic. They didn’t order an emergency drive shipment. They followed the checklist.
They reseated the drive, replaced the SATA cable, and moved the port to a different controller lane. CRC errors stopped increasing immediately.
Next scrub completed cleanly. ZFS counters stayed flat.
The “save” was not heroic debugging. It was having enough baseline and mapping that the engineer could confidently touch one cable and know it was the right one.
The incident write-up was short. That’s how you know it was a good day.
Common mistakes: symptom → root cause → fix
1) “ZFS checksum errors on one disk, SMART says PASSED”
Symptom: ZFS CKSUM increments; SMART overall-health passes.
Root cause: SMART overall-health is not sensitive; raw attributes may still show pending/uncorrectables. Or corruption is in transport (cable/controller) not internal media.
Fix: Check SMART raw attributes and error log; check kernel logs for resets; if pending/UNC exists, replace disk. If CRC/reset pattern dominates, fix cabling/HBA first.
2) “Lots of I/O errors across multiple disks in the same enclosure”
Symptom: Several vdev members show read/write errors simultaneously.
Root cause: Shared component: backplane, expander, power rail, HBA, firmware, or a loose mini-SAS cable.
Fix: Inspect transport path; swap expander port/cable; check power distribution. Do not shotgun-replace multiple drives unless SMART media errors corroborate.
3) “CRC errors climbing but ZFS is fine”
Symptom: UDMA_CRC_Error_Count increases; ZFS shows no errors yet.
Root cause: Link integrity is degrading; retries are masking it.
Fix: Replace/secure cables, clean connectors, verify proper seating. If CRC stops incrementing, you likely prevented a future ZFS incident.
4) “Scrub repaired 0B but reported errors”
Symptom: Scrub output shows errors but no bytes repaired.
Root cause: Errors may be I/O timeouts or transient transport failures rather than checksum-correctable corruption.
Fix: Examine zpool status -v and kernel logs; check SMART CRC errors and controller resets. Repair transport; re-run scrub.
5) “Resilver is slow and keeps restarting”
Symptom: Resilver progress stalls; devices reset; pool flaps between DEGRADED/ONLINE.
Root cause: Transport instability or power issues. Under rebuild load, marginal links fail more often.
Fix: Stabilize cabling/backplane/power first. Consider lowering workload, pausing disruptive jobs, and avoiding repeated device resets.
6) “Cleared ZFS errors and now we can’t tell what happened”
Symptom: Historical counters gone; intermittent errors return; no timeline.
Root cause: Premature zpool clear erased evidence before root cause was fixed.
Fix: Keep counters until after remediation and a clean scrub. Use ticket notes to record before/after counters and SMART raw values.
7) “Replaced a disk and the errors followed the slot”
Symptom: New disk shows errors immediately, same bay/port.
Root cause: Backplane slot, cable, or HBA port is the problem.
Fix: Move the disk to a different bay/port; replace the suspect cable/backplane component. Don’t keep sacrificing drives to a bad connector.
8) “ZFS reports permanent errors, but SMART looks clean”
Symptom: zpool status -v lists files with permanent errors.
Root cause: Data was written corrupted (e.g., transient path corruption) and then became the “truth” in parity, or redundancy couldn’t reconstruct due to multiple faults.
Fix: Restore affected data from backup/replica; then troubleshoot transport and memory stability. Treat as a data integrity incident, not just a disk replacement.
Checklists / step-by-step plan
Checklist A: When you get a SMART alert
- Run
zpool status -xv. If ZFS is already complaining, prioritize ZFS signals. - Identify the device via
/dev/disk/by-idsymlink. Record serial and by-path. - Collect SMART:
smartctl -H,-A,-l error,-l selftest. - Classify the SMART alert:
- Pending/uncorrectable/reallocated increasing: plan disk replacement.
- CRC errors increasing: fix cabling/backplane/port.
- Temperature high: fix cooling/airflow and re-check; heat accelerates failure.
- Run a scrub if redundancy exists and operational impact is acceptable.
- Decide: replace disk now, schedule replacement, or remediate transport and monitor.
Checklist B: When ZFS reports checksum or I/O errors
- Do not clear errors. Capture
zpool status -xvoutput into a ticket. - Check if errors are increasing over time. If yes, treat as active incident.
- Map device:
ls -l /dev/disk/by-id/andudevadm info. - Collect SMART raw attributes and error logs.
- Check kernel logs for resets/timeouts and for shared-path patterns (multiple disks).
- If media indicators exist: replace the drive using
zpool replacewith by-id. - If transport indicators dominate: fix cabling/HBA/backplane, then scrub again.
- After remediation: scrub clean, then
zpool clearto reset counters.
Checklist C: Before you declare victory
- Verify the pool is stable:
zpool status -xshould be quiet. - Verify no counters increment during normal workload.
- Verify SMART CRC stops increasing (if it was a transport issue).
- Schedule a follow-up scrub (or confirm a scrub completed cleanly after the fix).
- Update your serial-to-slot documentation so next time is faster and less creative.
FAQ
1) Which should I trust more: ZFS or SMART?
For data correctness, trust ZFS more. For predicting a disk that’s about to go sour, SMART raw attributes help.
Overall SMART “PASSED” is not a veto over ZFS checksum errors.
2) Do I replace a disk after a single checksum error?
Not always. A single corrected checksum error could be transient. The move is: run a scrub, inspect SMART raw attributes,
check transport logs. If errors repeat or SMART shows pending/uncorrectables, replace. If CRC errors rise, fix cabling.
3) What’s the difference between ZFS scrub and resilver for diagnostics?
Scrub reads and verifies existing data across the pool; it’s a periodic integrity audit. Resilver reconstructs data onto a replaced device.
Both stress disks, but resilver is more urgent and riskier if transport is flaky because it’s reconstructing state under load.
4) Why do I see ZFS errors but “errors: No known data errors”?
Because ZFS can correct certain failures using redundancy and still records that it had to. “No known data errors” means it believes user data is consistent now.
It doesn’t mean the hardware is healthy.
5) Are UDMA CRC errors a reason to RMA a drive?
Usually no. CRC errors implicate the SATA link: cable, connector, backplane, EMI. Replace the cable, reseat, move ports.
If CRC stops increasing, you fixed it. If the drive also has pending/UNC, then yes, the drive may be failing too.
6) How do I avoid replacing the wrong disk?
Use persistent naming and mapping. Operate with /dev/disk/by-id. Confirm serial via smartctl -i and map to bay via by-path or enclosure tools.
Never trust /dev/sdX names for planned maintenance.
7) Should I run SMART long tests on production drives?
Yes, but with intent. Long tests can add load and latency. Schedule them, stagger them, and don’t run them when the pool is degraded unless you’re collecting urgent evidence.
If you already have strong failure indicators, replace instead of “testing until it fails harder.”
8) What if SMART shows pending sectors but ZFS has no errors?
Pending sectors mean the drive couldn’t read some data reliably. ZFS may not have touched those blocks recently.
Run a scrub to force reads. If pending/uncorrectable persists or increases, replace the disk while redundancy is intact.
9) Can bad RAM cause ZFS checksum errors that look like disk problems?
Yes, though it’s less common than people claim when they want to avoid replacing a disk. If checksum errors appear across multiple devices or move around,
consider host memory/CPU/PCIe stability. ECC helps; it doesn’t make you immortal.
10) Why do errors spike during scrubs?
Scrubs read everything. Latent media errors that never got touched under normal workload suddenly get exercised.
Transport issues also show up because sustained throughput increases the chance of resets/timeouts.
A scrub that reveals errors is doing its job; the mistake is ignoring the results.
Conclusion: next steps you can do today
Correlating SMART warnings with ZFS errors is not about collecting more graphs. It’s about turning two imperfect truth sources into a decision.
Your best outcomes look dull: a planned disk swap, a replaced cable, a scrub that finishes clean, and a ticket that closes without drama.
- Standardize identification: operate ZFS using
/dev/disk/by-idand keep a serial-to-slot map. - Alert on meaningful SMART raw attributes: pending, uncorrectable, reallocated trends; CRC errors for transport.
- Treat corrected ZFS errors as actionable signals: not emergencies, but not background noise.
- Scrub regularly and intentionally: stagger scrubs, watch for errors during scrub, and don’t suppress the alerts you actually need.
- Only clear counters after the fix: evidence first, cosmetics later.
The goal isn’t zero alerts. The goal is fewer surprises, and fewer weekends spent learning the exact acoustic signature of a dying disk.