You log in to a box that has been “fine for years” and ZFS is suddenly screaming. READ errors.
A scrub that won’t finish. Latency spikes that make the app team say “the database is slow” like that’s a diagnosis.
You run zpool status and there it is: a vdev with a handful of read errors, maybe some checksum errors,
and everyone wants to know the one thing you can’t responsibly answer in five seconds: is it the disk, the cable, or the controller?
The trick is that ZFS is honest, but it’s not psychic. It can tell you what it saw—bad reads, bad checksums, I/O timeouts—
not why the universe decided to misbehave. Your job is to turn symptoms into a decision: replace a disk, reseat a cable,
update HBA firmware, stop trusting a backplane, or all of the above.
A mental model: what ZFS can prove vs. what it can only suggest
ZFS is a storage system with receipts. Every block has a checksum; metadata is protected; redundancy (mirror/RAIDZ)
lets ZFS self-heal by reading a good copy and repairing the bad one. That is the part people remember.
The part people forget: ZFS is sitting on top of an I/O stack that includes drivers, firmware, a controller (HBA/RAID card in IT mode),
a backplane, cables, power, and finally the drive’s own controller and media. If any layer lies, retries, times out, or returns corrupted data
while still claiming success, ZFS will catch it (checksum mismatch) but won’t necessarily know where it was corrupted.
So treat ZFS errors like evidence at a crime scene:
- ZFS can prove data didn’t match the checksum. It cannot prove where the corruption happened.
- ZFS can count I/O errors per device. It cannot guarantee that the device label maps to a physical drive if your cabling is chaotic.
- ZFS can repair if redundancy exists and the failure is transient. It cannot repair if you’ve lost redundancy and the only copy is bad.
Operationally, you want to answer three questions quickly:
- Is data integrity at risk right now (more than your redundancy can tolerate)?
- Is the fault localized (one drive) or systemic (path/controller/backplane/power)?
- What’s the safest action that improves odds immediately?
A useful rule: replace parts only after you’ve reduced uncertainty. Swapping the wrong thing is how you create
“new” problems—like bumping a marginal cable and turning a flaky link into a dead one.
READ vs CHECKSUM vs WRITE: what each error points at
READ errors: the device didn’t deliver the requested data
A READ error in zpool status is ZFS saying “I asked the block device for bytes and got an error back.”
That error could come from the drive (media error, internal reset), from the link (CRC errors leading to aborted commands),
from the controller/driver (timeouts, resets), or from power events. It’s not as specific as people hope.
If READ errors are isolated to one device and the OS logs show that device reporting medium errors, the drive is a prime suspect.
If READ errors jump around across drives behind the same HBA port, the link or controller becomes more interesting.
CHECKSUM errors: data arrived, but it was wrong
CHECKSUM errors are ZFS’s calling card. The device returned data successfully, but the checksum didn’t match.
If redundancy exists, ZFS will try other replicas/parity and correct the bad copy. If not, your “successful read” just handed you corruption.
Checksum errors often implicate the path: dodgy SAS/SATA cables, a flaky backplane, or a controller doing something “creative.”
They can also be caused by RAM issues, but if you’re running ZFS without ECC and you’re seeing checksum errors, that’s not “bad luck”—
that’s a design review.
WRITE errors: the device refused to accept data
Write errors are more straightforward: ZFS tried to write, and the device path returned failure.
Persistent write errors usually point to a device or path that is dying or intermittently disconnecting.
The pattern that should change your posture immediately:
checksum errors across multiple devices at once. That’s not “two disks failing together.”
That’s “something shared is misbehaving.”
Joke #1: RAID is not a backup, and neither is “we’ve never had a problem.” That’s just a wish wearing a spreadsheet.
Interesting facts and historical context (the stuff that explains today’s weirdness)
- ZFS was built at Sun to end silent data corruption. The checksum-on-everything design was a direct response to storage stacks that “trusted the disk.”
- Early SATA got a reputation for “just working” until it didn’t. Consumer-grade SATA cabling and connectors were never meant for dense, vibration-heavy chassis.
- SAS was designed with expander/backplane topologies in mind. That’s why SAS error counters and link-level reporting can be richer—if your tooling exposes it.
- “RAID controller” caching once hid disk problems. Battery-backed cache could mask latency and even reorder failure symptoms, making postmortems fun in the worst way.
- IT mode HBAs became the ZFS norm because ZFS wants the truth. ZFS expects direct device semantics; RAID firmware that remaps errors can sabotage diagnosis.
- ATA error reporting is inconsistent across vendors. SMART is useful, but vendors implement attributes differently and sometimes optimistically.
- UDMA CRC errors are usually “path” errors, not media errors. If that counter is climbing, suspect cables, backplanes, or electrical noise first.
- Drive internal retries can look like “random latency” before they look like failure. By the time you see hard errors, the drive may have been struggling for weeks.
- ZFS scrubs were made for catching latent sector errors. The point is to find bad sectors while you still have redundancy, not during a rebuild when you don’t.
One paraphrased idea from Werner Vogels: “Everything fails, all the time.” Treat it as a design constraint, not motivational wall art.
Fast diagnosis playbook (first/second/third checks)
First: establish blast radius and current risk
- Run
zpool status -xv. You’re looking for which vdevs are degraded, whether errors are increasing, and whether ZFS repaired anything. - Check if a scrub/resilver is running and its ETA. If you’re already degraded, your goal is to avoid adding stress or triggering more failures.
- Identify redundancy margin. Mirror: you can lose one side. RAIDZ1: you can lose one drive total. RAIDZ2: two. Be brutally honest.
Second: correlate with OS-level evidence
- Look at
dmesg/ journald for resets, timeouts, link errors. If you see “hard resetting link,” “device offline,” or “COMRESET failed,” you’re in cable/controller territory. - Pull SMART, especially UDMA CRC and reallocated/pending sectors. CRC climbing points to the path; reallocated/pending points to the disk.
- Check if multiple drives on the same HBA port or expander show symptoms. Shared fate suggests shared hardware.
Third: act conservatively and verify after each change
- If the evidence points to one disk: replace it. Don’t “watch it” if you’re already degraded and the drive is throwing media errors.
- If the evidence points to the path: reseat/replace the simplest component first. Start with the cable, then the backplane slot, then the HBA—unless logs scream otherwise.
- After any hardware change: clear errors, scrub, and watch counters. “Fixed” means counters stop increasing under load.
Practical tasks: commands, expected output, and the decision you make
Task 1: Get the authoritative ZFS symptom list
cr0x@server:~$ sudo zpool status -xv
pool: tank
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 02:11:09 with 0 errors on Tue Dec 24 03:12:21 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 3 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
errors: No known data errors
What it means: One disk has accumulated READ errors, but ZFS says applications are unaffected and no known data errors exist.
That suggests ZFS had redundancy to recover.
Decision: You are not done. You need to identify whether those reads are a drive problem (media) or a path problem (CRC/timeouts).
Do not clear errors yet; they’re breadcrumbs.
Task 2: Watch whether the error counters are still increasing
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 7 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
What it means: Errors grew from 3 to 7. Something is still wrong.
Decision: Escalate urgency. Start OS log correlation immediately; plan a maintenance action (replace disk or fix path) today, not “next window.”
Task 3: Map the ZFS device to a physical disk you can touch
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep WDC_WD80 | head
lrwxrwxrwx 1 root root 9 Dec 25 02:10 ata-WDC_WD80...-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 9 Dec 25 02:10 ata-WDC_WD80...-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 9 Dec 25 02:10 ata-WDC_WD80...-part1 -> ../../sdd1
What it means: You can translate the by-id name into /dev/sdX.
Don’t skip this: people replace the wrong drive more often than they admit.
Decision: Identify the suspect device node (say /dev/sdc) and use that consistently for SMART and logs.
Task 4: Inspect kernel logs for link resets/timeouts (path/controller smell)
cr0x@server:~$ sudo dmesg -T | egrep -i 'sdc|ata[0-9]|reset|timeout|failed command|I/O error' | tail -n 20
[Wed Dec 25 01:44:02 2025] ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xe frozen
[Wed Dec 25 01:44:02 2025] ata6.00: irq_stat 0x08000000, interface fatal error
[Wed Dec 25 01:44:02 2025] ata6: SError: { CommWake DevExch }
[Wed Dec 25 01:44:03 2025] ata6: hard resetting link
[Wed Dec 25 01:44:08 2025] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 25 01:44:08 2025] sd 5:0:0:0: [sdc] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Dec 25 01:44:08 2025] sd 5:0:0:0: [sdc] Sense Key : Medium Error [current]
[Wed Dec 25 01:44:08 2025] sd 5:0:0:0: [sdc] Add. Sense: Unrecovered read error
What it means: You have both link-level drama (hard resetting link) and a medium error (unrecovered read error).
This is where diagnosis gets spicy: a bad cable can trigger resets; a bad disk can also provoke resets.
Decision: Pull SMART next. If SMART shows pending/reallocated sectors, the disk is guilty enough to replace.
If SMART is clean but CRC errors climb, suspect cable/backplane/controller.
Task 5: Check SMART health summary and the attributes that matter
cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep -i 'SMART overall|Reallocated|Pending|Offline_Uncorrectable|UDMA_CRC|Power_On_Hours|Temperature'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 12
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 7
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 22816
194 Temperature_Celsius 0x0022 110 099 000 Old_age Always - 40
What it means: SMART “PASSED” is not exoneration. You have 12 pending sectors and 7 offline uncorrectables.
That’s not a cable; that’s the drive failing to read parts of the platter/flash map.
Decision: Replace the disk. Don’t waste time swapping cables first when the media is already admitting defeat.
Task 6: Look for UDMA CRC errors (classic cable/backplane evidence)
cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i 'UDMA_CRC_Error_Count|Interface_CRC|CRC'
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 43
What it means: CRC errors increment when the link corrupts frames. Drives can’t fix that with reallocation.
A handful over years can happen, but a climbing count in a short period is a hardware path problem.
Decision: Reseat/replace the SATA/SAS cable for that bay, check backplane connectors, and ensure the drive is firmly seated.
After changes, verify the CRC counter stops increasing under load.
Task 7: Run a targeted SMART self-test to validate suspicion
cr0x@server:~$ sudo smartctl -t long /dev/sdc
Please wait 780 minutes for test to complete.
Test will complete after Wed Dec 25 15:10:22 2025
What it means: A long test forces the drive to walk the media. If it fails, you get concrete evidence.
Decision: If you can’t afford to wait, replace now. If you can, let it run and check results—especially useful when arguing with procurement.
Task 8: Read SMART self-test results (the “show your work” step)
cr0x@server:~$ sudo smartctl -a /dev/sdc | sed -n '/SMART Self-test log/,$p' | head -n 20
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 22820 3912456672
# 2 Short offline Completed without error 00% 22816 -
What it means: “Completed: read failure” with an LBA is a smoking gun for media trouble.
Decision: Replace the disk, then scrub after resilver. If you’re in RAIDZ1 and already degraded, move fast and avoid heavy nonessential load.
Task 9: Identify whether errors cluster by controller / expander
cr0x@server:~$ lsscsi -t
[0:0:0:0] disk ata:WDC_WD80... /dev/sdb - sata:ahci
[0:0:1:0] disk ata:WDC_WD80... /dev/sdc - sata:ahci
[2:0:0:0] disk sas:SEAGATE... /dev/sdd - sas:phy0
[2:0:1:0] disk sas:SEAGATE... /dev/sde - sas:phy0
What it means: You have devices behind different hosts/phys. If all the “sas:phy0” disks show checksum errors together,
you stop blaming “random drives” and start blaming shared infrastructure.
Decision: Group errors by host/port. If the pattern matches a single HBA/expander, schedule maintenance on that component.
Task 10: Inspect HBA/driver events (resets are not a vibe)
cr0x@server:~$ sudo dmesg -T | egrep -i 'mpt2sas|mpt3sas|sas|reset|ioc|task abort' | tail -n 30
[Wed Dec 25 01:20:11 2025] mpt3sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[Wed Dec 25 01:20:12 2025] mpt3sas_cm0: sending diag reset !!
[Wed Dec 25 01:20:20 2025] mpt3sas_cm0: diag reset: SUCCESS
[Wed Dec 25 01:20:23 2025] sd 2:0:1:0: [sde] tag#91 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
What it means: The controller reset, and disks saw DID_RESET. ZFS may log read/checksum issues because the I/O stream got interrupted.
Decision: Treat this as controller/firmware/PCIe/power territory. Check firmware levels, PCIe errors, and thermal conditions.
If resets repeat, plan an HBA swap rather than “we’ll monitor.”
Task 11: Confirm PCIe layer health (because HBAs sit on PCIe, not vibes)
cr0x@server:~$ sudo journalctl -k | egrep -i 'pcie|aer|corrected|uncorrected|fatal' | tail -n 20
Dec 25 01:20:10 server kernel: pcieport 0000:3b:00.0: AER: Corrected error received: 0000:3b:00.0
Dec 25 01:20:10 server kernel: pcieport 0000:3b:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Dec 25 01:20:10 server kernel: pcieport 0000:3b:00.0: AER: device [8086:2030] error status/mask=00000040/00002000
Dec 25 01:20:10 server kernel: pcieport 0000:3b:00.0: AER: [ 6] Bad TLP
What it means: PCIe AER “Bad TLP” is the platform admitting data link trouble. That can manifest as HBA resets and I/O errors.
Decision: Reseat the HBA, verify PCIe slot health, check BIOS settings, and consider moving the HBA to another slot.
If errors persist, suspect motherboard/risers.
Task 12: Clear errors only after the underlying issue is addressed
cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ sudo zpool status -xv
all pools are healthy
What it means: Counters are reset. Great for visibility, terrible if you did it before fixing anything.
Decision: Now apply load (normal workload or a scrub) and confirm counters stay at zero.
If they return, your “fix” was theater.
Task 13: Start a scrub and interpret what it finds
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 25 02:05:18 2025
1.20T scanned at 1.45G/s, 220G issued at 270M/s, 8.00T total
0B repaired, 2.75% done, 08:07:12 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
What it means: Scrub is clean so far and counters remain zero. This is what “fixed” looks like.
Decision: Let the scrub finish. If it completes with 0 errors and counters remain stable during peak workload,
you can close the incident with decent confidence.
Task 14: Replace a failed disk properly (and avoid the “wrong disk” incident)
cr0x@server:~$ sudo zpool offline tank ata-WDC_WD80...-part1
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80...-part1 /dev/disk/by-id/ata-WDC_WD80_NEW_DISK-part1
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Wed Dec 25 02:12:44 2025
620G scanned at 1.10G/s, 85.2G issued at 155M/s, 8.00T total
85.2G resilvered, 1.04% done, 09:11:33 to go
What it means: Resilver is underway. Degraded is expected until it finishes.
Decision: Monitor for new errors during resilver. If another disk starts erroring during resilver, stop and reassess risk:
you may be one bad day away from data loss (especially RAIDZ1).
Task 15: Verify ZFS event history (useful when the counter already got cleared)
cr0x@server:~$ sudo zpool events -v | tail -n 25
Dec 25 2025 01:44:08.123456789 ereport.fs.zfs.io
class = "ereport.fs.zfs.io"
ena = 0x7f0b2c00000001
detector = (embedded nvlist)
(snip)
vdev_path = "/dev/disk/by-id/ata-WDC_WD80...-part1"
vdev_guid = 1234567890123456789
io_error = 5
What it means: ZFS recorded I/O ereports, including which vdev path was implicated. Good for timeline reconstruction.
Decision: Use events to correlate with OS logs and hardware changes. If you see bursts of events aligned with controller resets, the controller is the suspect.
Joke #2: If your “quick fix” is rebooting the storage server, that’s not troubleshooting—it’s the IT version of turning the radio up.
Disk vs cable vs controller: pattern-matching that actually holds up
When it’s the disk
Drives fail in ways that are both predictable and maddening. The predictable part: media errors show up as pending sectors,
uncorrectables, and SMART self-tests that can’t complete. The maddening part: many drives “pass” SMART right up until they don’t,
because SMART is a vendor-defined set of thresholds, not a guarantee.
Disk-likely indicators:
- Current_Pending_Sector or Offline_Uncorrectable is non-zero and growing.
- Kernel logs show Medium Error, “Unrecovered read error,” or similar sense data for the same disk.
- SMART long test fails with a read failure and an LBA.
- ZFS errors stay localized to a single vdev leaf across time and across reboots.
Disk replacement is not a moral failing. It’s the job. The only real mistake is hesitating until a second component fails
while you’re still debating the first.
When it’s the cable or backplane
Cables and backplanes don’t get enough blame because they’re boring. They also don’t come with warranty dashboards.
But signal integrity is physics, and physics doesn’t care that your quarterly review is tomorrow.
Cable/backplane-likely indicators:
- UDMA_CRC_Error_Count increases (especially rapidly) while media attributes look clean.
- Kernel logs show link resets, COMRESET failed, “SATA link down,” or “device offlined” without clear medium errors.
- Errors move when you move the drive to another bay, or when you swap the cable.
- Multiple drives in the same backplane row/expander exhibit errors during vibration/temperature changes.
Practical advice: if you’re using SATA in a dense chassis, use high-quality latching cables, minimize bends, and treat backplanes as components
with a lifecycle. If you can’t get latching cables, you’re one accidental tug away from an incident.
When it’s the controller (HBA, expander, firmware, or PCIe)
Controllers fail in a unique way: they fail “systemically,” and they fail loudly in logs. They also fail in ways that look like drive failures,
because the drive is the messenger getting shot.
Controller-likely indicators:
- Kernel logs show mpt2sas/mpt3sas resets, task aborts, IOC resets, or repeated bus resets.
- Errors appear across multiple drives behind the same HBA/expander, especially within the same time window.
- PCIe AER errors correlate with I/O disruptions.
- Changing firmware/driver versions changes the symptom pattern (sometimes for better, sometimes for “new and exciting”).
Opinionated guidance: if you are running ZFS, your HBA should be in IT mode, on a stable firmware version you can reproduce,
and you should be able to identify exactly which drives hang off which ports. If you can’t, you’re operating blind.
The messy middle: mixed signals and secondary damage
Real incidents often involve more than one contributing factor:
- A marginal cable causes intermittent link resets.
- Those resets force drives into error recovery more often.
- A drive with weak sectors now has to retry reads under stress, and pending sectors start appearing.
That’s why you don’t stop at “it’s the disk” if CRC errors are skyrocketing, and you don’t stop at “it’s the cable” if SMART is reporting uncorrectables.
Sometimes the correct answer is: replace the disk and the cable, then watch for controller resets that started the whole chain.
Three corporate-world mini-stories (anonymized, painfully plausible)
1) The incident caused by a wrong assumption
A mid-sized company ran a ZFS-backed NFS farm for build artifacts. Nothing exotic: a couple of RAIDZ2 pools, decent drives,
and a habit of ignoring warnings until they became pages. One morning, developers complained that builds were failing to fetch artifacts.
The storage host showed read errors on one disk in a vdev.
The on-call engineer made a classic assumption: “READ errors mean the drive is bad.” They offlined the disk, walked to the rack,
and pulled the drive from the bay they believed matched /dev/sde. They replaced it, ran zpool replace,
and watched the resilver begin. It looked competent. It was not.
The problem was that the chassis had been re-cabled months earlier, and the physical bay mapping was never updated.
The drive they pulled wasn’t sde. It was a healthy sibling drive in the same vdev. The actual failing drive stayed online,
still flaking out, while the pool ran degraded during resilver. Then the failing drive threw more errors under resilver load and got kicked.
Now the vdev was in a worse state than when the day started.
They recovered because it was RAIDZ2, not RAIDZ1, and because ZFS kept enough consistent data to finish the resilver after a messy pause.
The postmortem was short and sharp: the failure wasn’t the drive; it was the assumption that device names correspond to bays without verification.
The fix wasn’t heroic. They added a procedure: always map by-id to the physical bay using enclosure tools or serial numbers,
and label the chassis. Boring. Correct. The kind of boring that keeps your weekends intact.
2) The optimization that backfired
Another organization had a storage-heavy analytics platform. They were proud of their throughput and always wanted more.
Someone noticed that scrubs were “wasting I/O” during business hours, so they reduced scrub frequency and tuned ZFS to be
less aggressive about background work. The dashboards looked nicer. For a while.
Months later, a drive failed during a routine workload spike. Replacement was straightforward, but the resilver was slow.
During resilver, a second drive hit an unrecoverable read error on a sector that hadn’t been touched in ages.
ZFS couldn’t repair the block because the remaining parity couldn’t cover the combination of failures.
The unpleasant truth: the second drive’s sector had been bad for a long time. A regular scrub would have found it while redundancy was intact,
and ZFS would have rewritten the bad sector from a good copy. By optimizing away scrubs, they optimized away early warning and self-healing.
They ended up restoring a slice of data from backups and reprocessing some analytics jobs. The business impact wasn’t existential,
but it was expensive and embarrassing. The engineer who proposed the change wasn’t careless; they were optimizing one metric
without understanding what it bought them in integrity and risk reduction.
The revised policy was blunt: scrub regularly, throttle scrubs if needed, but don’t stop doing them. If you must choose,
sacrifice a little performance to avoid discovering latent errors at the worst possible time.
3) The boring but correct practice that saved the day
A SaaS company ran ZFS on a fleet of storage nodes. The setup was intentionally plain: mirrored vdevs for the hot tier,
RAIDZ2 for the cold tier, ECC RAM everywhere, and a hard rule that HBAs and firmware were standardized.
The rule annoyed people because it slowed down “quick upgrades.”
One node started showing intermittent checksum errors across multiple drives in the same shelf.
The on-call followed the runbook: capture zpool status, capture SMART, capture kernel logs,
and check whether the errors correlated with a particular HBA port. They did not clear counters. They didn’t reboot.
They treated the system like evidence.
The logs showed repeated controller resets at the same timestamps as the checksum bursts.
SMART on the drives was clean—no pending sectors, no uncorrectables, CRC stable. They swapped a SAS cable and moved the shelf connection
to a spare HBA port. Errors stopped immediately, and the scrub completed cleanly.
The real “save” happened a week later. Another node exhibited similar symptoms, but this time the CRC counters climbed fast.
Because they had baseline metrics for CRC and resets, the deviation was obvious within minutes.
They swapped the cable before any pool went degraded.
Nothing about this is glamorous. The practice that saved them was consistency: standard firmware, ECC, routine scrubs,
and a habit of collecting the same evidence every time. Their incident report was short. Their sleep was long.
Common mistakes: symptom → root cause → fix
1) “ZFS shows errors; I cleared them and now it’s fine.”
Symptom: Errors reappear days later, sometimes worse.
Root cause: Clearing counters removed evidence without fixing the underlying hardware/path issue.
Fix: Capture zpool status -xv, logs, and SMART first. Clear only after remediation, then validate with scrub/load.
2) CHECKSUM errors blamed on the disk without checking CRC
Symptom: Multiple disks show checksum errors, often in bursts.
Root cause: Data corruption in transit (cable/backplane/HBA) rather than media defects.
Fix: Check UDMA_CRC_Error_Count, dmesg for link resets, and whether the errors cluster by HBA/expander. Replace/repair the shared component.
3) Treating SMART “PASSED” as “healthy”
Symptom: SMART says PASSED, but read errors keep happening and pending sectors exist.
Root cause: SMART overall health is threshold-based and often too forgiving.
Fix: Look at specific attributes (pending, uncorrectable, reallocated, CRC). Run long self-tests when possible.
4) Replacing the wrong drive
Symptom: After replacement, errors persist; sometimes the pool becomes more degraded.
Root cause: Device-to-bay mapping was guessed, not verified. Cabling changed; enumeration changed.
Fix: Use /dev/disk/by-id, serial numbers, and enclosure mapping tools. Label bays. Require a second-person check for production.
5) Running RAIDZ1 in places where rebuild risk is not acceptable
Symptom: During resilver, a second disk error causes data loss or unrecoverable blocks.
Root cause: Single-parity vdevs are fragile during rebuilds, especially with large drives and latent sector errors.
Fix: Prefer mirrors (fast resilver, simple failure domain) or RAIDZ2/3 for large capacity vdevs. Scrub regularly.
6) Ignoring controller resets as “noise”
Symptom: Intermittent I/O stalls, multiple disks briefly fault, errors come in waves.
Root cause: HBA firmware/driver instability, PCIe issues, overheating, or power events.
Fix: Check dmesg for resets, check PCIe AER, verify cooling, standardize firmware, and replace/relocate the HBA if needed.
7) Overloading the pool during a resilver
Symptom: Resilver takes forever; more errors appear; a second disk drops.
Root cause: Heavy random I/O + rebuild stress triggers marginal components.
Fix: Reduce workload if possible, schedule rebuilds, and avoid pushing latency-sensitive jobs during degraded states. Consider mirrors for hot tiers.
Checklists / step-by-step plan
Immediate containment checklist (first 30 minutes)
- Run
zpool status -xv. Save output to the ticket. - Confirm redundancy margin (mirror/RAIDZ level) and whether any vdev is already degraded.
- Check whether error counters are increasing (run
zpool statustwice, minutes apart). - Capture OS logs around the incident window:
dmesg -Tandjournalctl -k. - Pull SMART from suspect disks (at least the ones with ZFS errors; ideally all in the vdev).
- If degraded: reduce nonessential load and avoid maintenance actions that add stress (like firmware updates) unless necessary.
Decision checklist: disk vs path
- Replace the disk now if pending/offline uncorrectable sectors are non-zero, or SMART long test fails.
- Fix the path first if CRC errors are increasing, link resets dominate logs, and SMART media indicators are clean.
- Suspect the controller/platform if multiple drives across one HBA show issues and you see controller resets or PCIe AER events.
- Do both if the disk shows media problems and the path shows CRC/resets. Mixed failures happen.
Remediation plan (change one thing at a time)
- Identify the exact physical component (drive serial, bay, cable, HBA port).
- Replace/repair the most likely component with the smallest blast radius:
- Cable reseat/replace
- Move drive to another bay (if your enclosure allows and you can keep mapping correct)
- Replace drive
- Swap HBA or move to another PCIe slot
- Clear ZFS errors only after remediation.
- Run a scrub. Monitor errors during the scrub.
- Document the mapping and what was replaced (future you will be tired and grateful).
Validation checklist (the “prove it” phase)
- Scrub completes with 0 errors.
zpool statusshows no increasing READ/WRITE/CKSUM counters during peak load.- SMART CRC counters stop increasing after cable/backplane remediation.
- No controller resets or PCIe AER errors in logs for at least one full business cycle.
FAQ
1) Are ZFS read errors always a failing disk?
No. READ errors mean the block layer returned an error. That can be disk media, but also link resets, controller timeouts, or power.
Use SMART media attributes and OS logs to separate disk from path.
2) What’s the difference between READ errors and CHECKSUM errors in zpool status?
READ errors mean the device didn’t successfully return data. CHECKSUM errors mean the device returned data, but it didn’t match ZFS’s checksum,
pointing toward corruption in transit, firmware, or memory (less commonly).
3) If ZFS says “errors: No known data errors,” can I ignore it?
You can’t ignore it. That message means ZFS believes it repaired or avoided user-visible corruption. It does not mean the hardware is healthy.
Treat it as an early warning—your cheapest kind.
4) Should I run zpool clear right away?
Not right away. Capture evidence first. Clear after you’ve replaced/fixed something so you can see whether the problem returns.
5) How many UDMA CRC errors are “too many”?
A few over a long lifetime can happen due to transient events. A CRC counter that increases during normal operation—especially rapidly—means the path is unhealthy.
The trend matters more than the absolute value.
6) Can ECC RAM issues show up as checksum errors?
Yes, but it’s less common than cable/backplane/HBA issues in well-built systems. ECC usually corrects single-bit errors and logs them.
If you suspect memory, check kernel logs for ECC events and consider a memory test during a maintenance window.
7) Is RAIDZ1 safe with large disks?
It depends on your risk tolerance, workload, and operational discipline. In practice, large disks plus long rebuild times make RAIDZ1 brittle.
Mirrors or RAIDZ2 are safer choices for production where downtime or data loss is expensive.
8) Why do errors often show up during a scrub or resilver?
Because scrubs and resilvers force full-surface reads. They touch cold data that hasn’t been read in months.
That’s exactly when latent sector errors, weak cables, and marginal controllers get exposed.
9) If I swap a cable, how do I confirm it fixed the issue?
Clear ZFS errors after the swap, then run a scrub and observe. Also watch SMART CRC counters: they should stop increasing.
If link resets continue in logs, the problem isn’t solved.
10) Can a backplane cause checksum errors without dropping drives?
Yes. A marginal backplane can corrupt traffic or trigger intermittent retries without fully disconnecting a disk.
That often shows up as checksum errors, occasional link resets, and CRC counters that creep upward.
Next steps (what to do after you stop the bleeding)
Once the pool is stable and the scrub is clean, don’t waste the pain. Convert it into posture:
- Standardize your drive mapping. Labels, by-id usage, and a documented bay map reduce human error more than any tool.
- Baseline SMART and log signals. CRC counters, pending sectors, controller resets, and PCIe AER should be graphed or at least periodically reviewed.
- Scrub on a schedule. Tune for impact, but don’t skip. Scrubs are your integrity exercise, not an optional hobby.
- Keep firmware boring. Stable HBA firmware/driver combos beat “latest” in production unless you have a specific fix you need.
- Design for rebuilds. Mirrors or RAIDZ2/3, plus realistic rebuild windows, turn disk failures into maintenance instead of incidents.
- Practice replacements. Run through a disk replacement in a lab. The worst time to learn your chassis quirks is during a degraded vdev.
The most reliable storage systems aren’t the ones that never fail. They’re the ones where failure looks routine: clear evidence, quick isolation,
one component replaced, scrub, done. That’s the bar. Set it, and your future read errors become an errand instead of a crisis.