It’s 02:13. Your alert says “ZFS: checksum errors increasing.” Nothing is down—yet. The dashboard looks “fine”—which is exactly how storage failures like to introduce themselves. You’re staring at zpool status wondering if you should replace a disk, reseat a cable, or stop touching things before you make it worse.
ZFS checksum errors are not a vague omen. They’re a precise statement: “Data was read, but what came back doesn’t match what ZFS knows it wrote.” That can be a dying drive. It can also be a flaky SAS expander, a firmware bug, a bad HBA, or RAM doing interpretive dance. The trick is to decide quickly which one—without accidentally destroying your best evidence.
What checksum errors actually mean (and what they don’t)
When ZFS writes a block, it computes a checksum and stores it in metadata (not next to the data block in the same place where a single failure could corrupt both). Later, when ZFS reads that block, it recomputes the checksum and compares it to what it expects.
A checksum error means: the bytes returned by the storage path are different from what ZFS previously wrote. ZFS successfully read something—just not the right thing.
A checksum error does not automatically mean: “the disk is bad.” It means “the data is bad somewhere between platters/NAND and ZFS.” That “somewhere” includes:
- Drive media or controller (classic failure)
- SATA/SAS cable, connector, backplane, expander (depressingly common)
- HBA firmware/driver issues (especially around resets and timeouts)
- Power problems (brownouts, marginal PSUs, loose power connector)
- RAM corruption (ECC helps; non-ECC makes the story… spicier)
- CPU/IMC instability from overclocking or undervolting (yes, servers too)
If redundancy exists (mirror/RAIDZ) and ZFS can fetch a good copy from another device, it will repair the bad data during a scrub or read. That’s the part people love about ZFS. The part they ignore is the implication: if you keep seeing checksum errors, the system is repeatedly receiving corrupt data. The repair is a bandage, not a cure.
Also, don’t confuse these three counters:
- READ errors: the device couldn’t read data (I/O error). That’s usually a disk, cable, or HBA problem.
- WRITE errors: ZFS couldn’t write (I/O error) or the device rejected a write. Also often hardware.
- CKSUM errors: data came back, but it was wrong. This is where “bad disk” competes with “bad path.”
Here’s the opinionated rule: treat checksum errors as a production incident until you prove they’re benign. They often start as “a few.” They rarely stay that way.
Joke #1 (short, relevant): ZFS checksum errors are like a smoke alarm that can also tell you which room is burning. You still shouldn’t ignore it because the kitchen “seems fine.”
Why ZFS catches what other filesystems miss
ZFS is end-to-end checksummed. That phrase gets thrown around, but the operational consequence is simple: ZFS distrusts your hardware by default. Most stacks historically trusted disks to either return correct data or return an error. In reality, devices sometimes return wrong data without raising an I/O error. That’s “silent corruption,” and it’s why checksum errors exist as a class.
ZFS also knows how to fix what it detects—as long as you gave it redundancy. A mirror can read from the other side. RAIDZ can reconstruct using parity. Without redundancy, ZFS can only say “this block is wrong” and then you get the least fun kind of error: the honest one.
Operationally, ZFS’s pickiness is a gift. It turns “mysterious app bug” into “storage returned corrupted bytes.” But it also makes you do the work: you must decide whether the source is the drive, the bus, memory, or software.
One more practical note: checksum errors can show up long after the original corruption occurred. The bad data might sit unread for weeks. Then a scrub (or a backup job) touches it and ZFS finally complains. The error is real; the timing is just rude.
Fast diagnosis playbook (first/second/third)
First: determine scope and whether ZFS can self-heal
- Is the pool
DEGRADEDor just reporting errors? - Are errors increasing right now or historical?
- Are they confined to one device, one vdev, or spread?
- Is there redundancy (mirror/RAIDZ) that can repair?
Second: classify the failure mode by pattern
- Single disk, rising CKSUM with clean cabling/backplane history: suspect disk or its slot.
- Multiple disks on same HBA/backplane show CKSUM: suspect cable/backplane/expander/HBA.
- CKSUM with no READ/WRITE and weird system instability: suspect RAM/CPU/firmware.
- Errors only during scrub/resilver: suspect marginal hardware under sustained load (disk, cable, expander, HBA, power).
Third: gather evidence before you “fix” anything
- Capture
zpool status -v,zpool events -v, and system logs around the timestamps. - Record device identifiers:
/dev/disk/by-idmapping matters more thansdX. - Pull SMART / NVMe health data and error logs.
- If multiple devices show errors, map them to a physical chain (HBA port → expander → backplane slot).
Then—and only then—start replacing parts or re-seating cables. Otherwise you erase the trail and keep the mystery alive for the next on-call.
Hands-on tasks: commands, outputs, and decisions
The goal here isn’t to run commands for comfort. It’s to extract a decision. Each task includes: the command, what the output means, and what you do next.
Task 1: Get the current truth from ZFS
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 128K in 0 days 00:12:31 with 0 errors on Tue Dec 24 01:10:18 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0 ONLINE 0 0 14
errors: No known data errors
Meaning: One device shows 14 checksum errors, but ZFS repaired data during scrub and there are “No known data errors.” That implies redundancy worked.
Decision: Treat the device path as suspect. Investigate hardware, but you’re not in immediate “restore from backup” territory.
Task 2: Confirm whether errors are still increasing
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0 ONLINE 0 0 14
Meaning: Same number as before. That’s good: it suggests a historical event rather than an ongoing corruption stream.
Decision: Continue evidence gathering. If the count climbs during load or scrub, escalate to physical troubleshooting and likely replacement.
Task 3: Pull device identity and avoid the sdX trap
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'WDC_WD80EFAX|ata-WDC'
lrwxrwxrwx 1 root root 9 Dec 25 01:22 ata-WDC_WD80EFAX-68LHPN0_VKJ9KABC -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 25 01:22 ata-WDC_WD80EFAX-68LHPN0_VKJ9KXYZ -> ../../sdc
Meaning: You have stable identifiers mapping to kernel device names sdb/sdc. The kernel names can change across reboots; the by-id names generally don’t.
Decision: Use by-id paths in replacements and documentation. If you pull the wrong drive, ZFS will remember.
Task 4: Check ZFS events for the timeline
cr0x@server:~$ zpool events -v | tail -n 20
Dec 24 01:06:12.900760000 sysevent.fs.zfs.checksum
pool = tank
vdev_path = /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ9KXYZ
vdev_guid = 1234567890123456789
vdev_state = 0
vdev_read_errors = 0
vdev_write_errors = 0
vdev_checksum_errors = 14
Dec 24 01:10:18.112233000 sysevent.fs.zfs.scrub_finish
pool = tank
scrub_repaired = 131072
scrub_errors = 0
Meaning: The checksum event occurred during scrub. It repaired 128K and finished with zero uncorrected errors.
Decision: Investigate what changed around that time: cabling work, firmware updates, power events, heavy I/O, or a controller reset.
Task 5: Look for controller resets and link errors in system logs
cr0x@server:~$ sudo journalctl -k --since "2025-12-24 00:45" --until "2025-12-24 01:30" | egrep -i "ata|sas|scsi|reset|link|error" | tail -n 20
Dec 24 01:05:44 server kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
Dec 24 01:05:44 server kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Dec 24 01:05:44 server kernel: ata7: SError: { UnrecovData Handshk }
Dec 24 01:05:45 server kernel: ata7: hard resetting link
Dec 24 01:05:50 server kernel: ata7: link is slow to respond, please be patient (ready=0)
Dec 24 01:05:51 server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 24 01:05:51 server kernel: sd 6:0:0:0: [sdc] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
Meaning: The kernel saw a SATA link error and reset. That’s a red flag for cabling/backplane/drive interface—especially if it repeats.
Decision: If you see link resets tied to the same bay, start with physical layer: reseat/replace cable, move the drive to another slot, or swap the breakout cable.
Task 6: Check SMART health (SATA/SAS via smartctl)
cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep -i "Reallocated_Sector_Ct|Reported_Uncorrect|UDMA_CRC_Error_Count|Current_Pending_Sector|Offline_Uncorrectable|SMART overall"
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 23
Meaning: CRC errors (UDMA_CRC_Error_Count) point to data corruption in transit—cable/backplane—not media defects. Reallocated/pending sectors are zero, which makes a “dying platter” diagnosis weaker.
Decision: Fix the transport first: cable/backplane/slot. If CRC continues to climb after physical remediation, then you consider drive or controller.
Task 7: Run a short SMART self-test and read results
cr0x@server:~$ sudo smartctl -t short /dev/sdc
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep -i "Self-test execution status|SMART Self-test log" -A5
Self-test execution status: ( 0) The previous self-test routine completed without error.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9123 -
Meaning: The drive passed its own test. That does not exonerate the drive, but it shifts suspicion toward cabling/backplane/controller if checksum errors persist.
Decision: If ZFS errors keep accumulating and SMART is clean except CRCs, stop blaming the disk and start swapping cables/slots.
Task 8: For NVMe, check SMART and error logs (nvme-cli)
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 41 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 3%
media_errors : 0
num_err_log_entries : 2
Meaning: NVMe reports media_errors and error log entries. Media errors are serious; a small number of error log entries might be benign depending on type, but it’s evidence.
Decision: If media_errors > 0 or the error log shows repeated data integrity issues, plan replacement. NVMe tends to fail fast and without nostalgia.
Task 9: Identify which vdev type you’re dealing with (affects urgency)
cr0x@server:~$ zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 14.5T 9.12T 5.38T - - 21% 62% 1.00x ONLINE -
mirror-0 7.25T 4.56T 2.69T - - 19% 62% - ONLINE
mirror-1 7.25T 4.56T 2.69T - - 23% 62% - ONLINE
Meaning: Mirrors: good for self-healing and fast resilvers. RAIDZ: different behavior, longer rebuild windows, different risk profile.
Decision: If you’re on RAIDZ with large drives, be more aggressive about early remediation because the resilver/scrub window is larger and exposes marginal components longer.
Task 10: Scrub intentionally, not as a reflex
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 25 01:40:02 2025
1.27T scanned at 3.10G/s, 912G issued at 2.23G/s, 9.12T total
0B repaired, 9.76% done, 0:59:31 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-... ONLINE 0 0 0
ata-WDC_WD80EFAX-... ONLINE 0 0 14
errors: No known data errors
Meaning: Scrub stresses the entire path. If errors increase during scrub, you’re reproducing the issue (good for diagnosis, bad for fragile hardware).
Decision: Run scrubs when you can observe and capture logs. Don’t start a scrub and go to bed unless you enjoy surprise pages.
Task 11: Clear errors only after remediation
cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-... ONLINE 0 0 0
ata-WDC_WD80EFAX-... ONLINE 0 0 0
errors: No known data errors
Meaning: Counters reset. This does not fix anything. It just makes the dashboard quiet.
Decision: Clear after you’ve reseated cables, swapped slots, replaced a disk, updated firmware, or otherwise changed the world. Then watch if errors recur.
Task 12: Replace a disk the ZFS way (mirror example)
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFAX-68LHPN0_VKJ9KXYZ /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_NEWDRIVE
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: resilver in progress since Wed Dec 25 02:05:11 2025
612G scanned at 1.45G/s, 210G issued at 512M/s, 9.12T total
210G resilvered, 2.25% done, 4:58:40 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ9KABC ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ9KXYZ ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_NEWDRIVE ONLINE 0 0 0
errors: No known data errors
Meaning: ZFS is resilvering. If checksum errors occur during resilver, that’s diagnostic: your path is unstable under sustained I/O.
Decision: Monitor journalctl -k alongside. If link resets or transport errors appear, pause and fix cabling/HBA before finishing replacement.
Task 13: Use zdb only when you’re ready to be precise
cr0x@server:~$ sudo zdb -bbbbbb tank 2>/dev/null | head -n 20
Dataset tank [ZPL], ID 50, cr_txg 4, 1.62G, 1021 objects, rootbp DVA[0]=<0:1c3f4a8000:200> ...
Meaning: zdb is for deep forensics, not casual poking. It can help identify which blocks are affected when you have “known data errors.”
Decision: Use it if you’re facing uncorrectable errors and need to decide what data to restore, or to validate assumptions about metadata vs file data corruption.
Task 14: Check memory health signals (ECC and kernel reports)
cr0x@server:~$ sudo journalctl -k | egrep -i "mce|edac|ecc|machine check" | tail -n 20
Dec 25 00:12:08 server kernel: EDAC MC0: 1 CE memory error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0)
Meaning: Corrected ECC errors (CE) mean RAM is misbehaving but being corrected. That can still correlate with checksum errors and instability.
Decision: Treat repeated ECC events as a hardware incident. Schedule DIMM replacement and stop blaming disks for what RAM is doing upstream.
Task 15: Inspect ZFS properties that affect corruption visibility
cr0x@server:~$ zfs get checksum,compression,recordsize,atime tank
NAME PROPERTY VALUE SOURCE
tank checksum on default
tank compression lz4 local
tank recordsize 128K local
tank atime off local
Meaning: Checksumming is on (good). Compression doesn’t cause checksum errors, but it changes I/O patterns and can expose weak hardware during heavy reads/writes.
Decision: Don’t “fix” checksum errors by toggling dataset properties. Fix the hardware path. ZFS is the messenger, not the villain.
Task 16: Map physical topology (SAS example) to find shared components
cr0x@server:~$ sudo sas2ircu 0 DISPLAY | egrep -i "Enclosure|Slot|Device is a Hard disk|SAS Address|State" -A2
Enclosure# : 1
Slot# : 4
Device is a Hard disk
State : Optimal
Slot# : 5
Device is a Hard disk
State : Optimal
Meaning: When multiple affected disks share an enclosure/backplane/expander, topology mapping turns “random errors” into “single point of failure.”
Decision: If errors correlate by enclosure/port, prioritize the shared component (cable, expander, backplane, HBA) over drive swaps.
Interpreting patterns: disk vs cable vs RAM vs firmware
Pattern A: One device, checksum errors climb slowly, SMART shows reallocations/pending sectors
Likely culprit: the drive. If reallocated sectors increase or pending sectors appear, the media is failing or the drive can’t reliably read old data.
What to do: replace the drive. Don’t negotiate. In mirrored vdevs you have breathing room; in RAIDZ you have math, not mercy.
Pattern B: One device, checksum errors + SMART UDMA CRC errors
Likely culprit: transport layer. CRC errors are often cable/backplane/connector issues. With SAS expanders, it can be a marginal lane.
What to do: reseat and replace the cable; move the drive to a different bay; if using SATA in a server backplane, check fit and strain. Then clear errors and observe.
Pattern C: Multiple devices across one HBA/backplane show checksum errors
Likely culprit: shared component: HBA, expander, backplane, power, or a firmware bug. ZFS is telling you corruption is systemic, not a single-disk tragedy.
What to do: correlate affected drives to a specific port chain. Swap the SAS cable. Update HBA firmware to a known-stable version. Check power rails and backplane seating.
Pattern D: Checksum errors with no disk-level errors, plus kernel MCE/EDAC reports
Likely culprit: memory/CPU platform instability. ZFS checksums protect on-disk integrity, but bad RAM can corrupt data before
What to do: treat as a hardware platform issue. Verify ECC is enabled. Run a memory test in maintenance. Stop overclocking. Replace DIMMs with recurring corrected errors.
Pattern E: Errors appear only after firmware updates or kernel upgrades
Likely culprit: driver/firmware interactions, queue depth behavior, power management, or a regression. It’s not glamorous, but it happens.
What to do: roll back if possible, or move forward to a fixed release. Keep the evidence (events, logs). Don’t “fix” it by suppressing errors.
One quote, because operations is mostly about humility: “Hope is not a strategy.”
— General Gordon R. Sullivan
Joke #2 (short, relevant): Clearing ZFS errors without fixing the cause is like turning off the check-engine light by removing the bulb. It does reduce dashboard noise.
Three corporate mini-stories (painfully familiar)
Mini-story 1: The incident caused by a wrong assumption
The team inherited a storage node that “had a few checksum errors for months.” The previous admin’s note said it was a “known bad disk” and “not urgent because RAIDZ.” That sounded plausible, so it lived at the bottom of the backlog with all the other plausible things.
Then a routine scrub started on Sunday morning. Halfway through, checksum errors began climbing on two different drives in the same vdev. The on-call replaced the “known bad disk” first, expecting the errors to stop. They didn’t. In fact, during resilver, a third drive started logging transport resets. Now the pool was degraded and furious.
The wrong assumption was subtle: they assumed checksum errors behave like SMART reallocated sectors—drive-specific and local. But ZFS checksum errors are often path-specific. After a frantic hour of swapping disks, someone finally checked kernel logs and noticed repeated link resets on the same SAS port.
The root cause ended up being a marginal SAS breakout cable that had been bent into a shape that looked great in photos and terrible in physics. The cable worked under light load and failed under sustained scrub/resilver patterns. Once replaced, the “bad disk” stopped being bad. The team learned a lesson: a checksum error is evidence of corruption, not a verdict on a specific drive.
Mini-story 2: The optimization that backfired
A different company had a performance initiative: reduce scrub time and backup windows. They increased parallelism in their backup jobs and tuned the storage stack for throughput. Scrubs finished faster, dashboards looked greener, and everyone congratulated the spreadsheet.
A month later, checksum errors began appearing during scrubs across multiple pools—mostly on systems with older expanders. Reads were now more aggressive and more concurrent. The expanders started to show their age: occasional link flaps, retries, and subtle corruption events that ZFS caught as checksum mismatches.
The optimization didn’t “cause” corruption out of nothing. It turned a marginal system into a reproducible failure. Previously, the environment never sustained the I/O pattern long enough to expose the weakness. Now it did, on a schedule, helpfully at peak business hours.
The fix wasn’t to dial performance back forever. They moved scrubs to a quieter window, reduced concurrency during scrubs, updated expander firmware where safe, and replaced the worst hardware. But the bigger fix was cultural: performance tuning now required a reliability test plan, not just a throughput chart.
Mini-story 3: The boring but correct practice that saved the day
In one enterprise, the storage team had a habit that seemed bureaucratic: every quarter, they ran a scrub, exported the last scrub report, and archived zpool status -v outputs alongside kernel logs from the window. They also labeled drive bays with the by-id serials. It was dull. It was also exactly what you want during a real incident.
One night, checksum errors spiked on a mirror member. Because they had baseline artifacts, they knew two things immediately: the drive had never shown CRC errors before, and the bay had a history of link resets two years ago with a different disk. Same bay. Different drive. Same symptom family.
They swapped the bay’s cable during a maintenance window, cleared errors, and ran a controlled scrub while watching logs. No new errors. They left the drive in place and kept it under observation. If they had followed the default instinct—replace the disk—they would have wasted time, money, and probably triggered more disruption.
The executive summary for management was boring too: “We fixed a faulty interconnect and validated integrity.” No drama, no heroics, just a system that kept its promises. That’s the job.
Common mistakes: symptom → root cause → fix
1) “Checksum errors are rising, so replace the disk immediately”
Symptom: CKSUM increments, but SMART shows rising UDMA CRC errors and no reallocations.
Root cause: cable/backplane/connector corruption, not media failure.
Fix: replace/reseat cables, move the disk to another bay, check backplane connectors; then scrub and observe.
2) “We cleared errors and they’re gone, so it’s fixed”
Symptom: Someone ran zpool clear; dashboard is quiet for a week.
Root cause: counters reset; underlying cause remains. You just lost trend data.
Fix: treat zpool clear as a post-remediation step with follow-up scrubs and monitoring.
3) “Errors appeared after a scrub, so the scrub caused corruption”
Symptom: First time anyone noticed errors was during scrub.
Root cause: scrub read the data and discovered existing corruption or stressed marginal hardware into showing its flaws.
Fix: keep scrubs; change when/how you scrub; fix weak hardware and verify with controlled scrubs.
4) “The pool is ONLINE, so data is fine”
Symptom: Pool shows ONLINE but device CKSUM counts are non-zero.
Root cause: redundancy masked corruption, but corruption still occurred.
Fix: investigate, remediate, and confirm no new errors under load. ONLINE is not a clean bill of health.
5) “Multiple disks have checksum errors, so we had multiple disk failures”
Symptom: Several drives across the same chassis show CKSUM increments within the same hour.
Root cause: shared component (HBA, expander, backplane, PSU) misbehaving.
Fix: map topology; target the common point; check logs for resets; swap cables and test.
6) “ZFS is too sensitive; turn off checksums”
Symptom: Someone suggests disabling checksumming or ignoring events.
Root cause: misunderstanding: ZFS is detecting real corruption your stack previously missed.
Fix: keep checksumming; fix the hardware path; improve monitoring and maintenance windows.
7) “ECC corrected errors are harmless”
Symptom: EDAC logs show corrected errors; storage sees occasional checksum issues.
Root cause: corrected errors can be early warning. Uncorrected errors may follow; stability may already be compromised.
Fix: schedule DIMM replacement, check seating, update BIOS/microcode where appropriate, and validate with stress tests.
8) “We can keep a degraded pool running while we troubleshoot for days”
Symptom: A RAIDZ vdev is degraded and someone wants to wait for procurement.
Root cause: extended exposure: one more failure and you’re restoring from backup.
Fix: prioritize restoration of redundancy immediately—spares, temporary replacements, or moving workloads.
Checklists / step-by-step plan
Step-by-step triage (do this in order)
- Capture state: save
zpool status -voutput and timestamp it. Don’t clear anything yet. - Check if errors are correctable: look for “errors: No known data errors” vs listed file paths/objects.
- Check trend: run
zpool statusagain after load or after 10–30 minutes. Is CKSUM increasing? - Check events:
zpool events -vto tie errors to scrub/resilver/normal reads. - Check kernel logs: look for resets, link errors, timeouts during the same window.
- Check SMART/NVMe health: specifically CRC, reallocated/pending, media errors, error logs.
- Map topology: are affected drives sharing a port/backplane/expander? If yes, suspect the shared component.
- Remediate physically: reseat/replace cable, move disk bays, swap HBA port, check power connections.
- Clear errors: only after a change, then run a controlled scrub or read workload while observing.
- Replace hardware if needed: drive, cable, expander, HBA, DIMM—based on the evidence.
- Verify: complete scrub with zero new errors; verify logs are quiet; monitor counters for a week.
- Document: what changed, what evidence supported it, and what “normal” now looks like.
When to replace the drive (my blunt criteria)
- SMART shows increasing reallocated sectors, pending sectors, or offline uncorrectables.
- NVMe reports non-zero media_errors or repeated integrity-related error log entries.
- Checksum errors keep increasing on the same drive after you’ve swapped cables/ports and verified logs are clean.
- The drive is the outlier in a mirror (other side clean) and the error pattern follows the drive when moved to a new bay.
When to suspect cabling/backplane/expander first
- SMART UDMA CRC errors rise.
- Kernel logs show link resets, PHY resets, or transport errors.
- Multiple drives in the same enclosure/port chain show issues.
- Errors show up during sustained sequential reads (scrub/resilver) more than random workload.
When to suspect RAM/CPU/platform stability
- EDAC/MCE logs show corrected or uncorrected memory errors.
- Checksum errors appear across devices with no consistent physical topology.
- You see inexplicable crashes, panics, or data issues outside ZFS as well.
- The system runs non-ECC RAM for workloads that actually matter.
Interesting facts and historical context (the kind you only learn after being burned)
- Fact 1: ZFS originated at Sun Microsystems in the early 2000s with the explicit goal of end-to-end data integrity, not just “fast filesystem.”
- Fact 2: Traditional storage stacks often relied on the disk to detect bad reads via sector-level ECC; silent corruption can bypass that and still return “success.”
- Fact 3: ZFS stores checksums in metadata rather than alongside the data block, reducing the chance that one bad write corrupts both data and its checksum.
- Fact 4: Scrubs aren’t “maintenance theater.” They’re a systematic read of the pool to surface latent errors before your restore job discovers them.
- Fact 5: The rise of very large disks made long rebuild/resilver windows normal, which increased the importance of detecting and repairing silent corruption early.
- Fact 6: SATA CRC errors have been a classic “it’s not the disk” clue for decades; they implicate signal integrity and connectors more than media.
- Fact 7: Some of the nastiest checksum-error incidents involve expanders/backplanes because they create correlated failures across many drives at once.
- Fact 8: ZFS can repair corrupted data transparently on redundant vdevs, but it will still increment error counters—because you deserve to know it happened.
- Fact 9: “ONLINE” pool state can coexist with serious underlying issues; ZFS prioritizes availability while reporting integrity problems so you can act.
FAQ
1) Are checksum errors always a sign of data loss?
No. If you have redundancy and ZFS reports “No known data errors,” ZFS likely repaired the bad blocks. It’s still a sign of corruption somewhere, but not necessarily lost application data.
2) What’s the difference between READ errors and CKSUM errors in zpool status?
READ errors mean the device couldn’t deliver data (I/O failure). CKSUM errors mean it delivered data that didn’t match the expected checksum. READ screams “can’t read.” CKSUM whispers “I read something, but it’s wrong.”
3) Should I run a scrub immediately when I see checksum errors?
Usually yes—if you can watch it. A scrub can both repair and reproduce the issue under load, which is great for diagnosis. Don’t start it blind in the middle of a risky window.
4) Is it safe to run zpool clear?
It’s safe in the sense that it doesn’t delete data. It is unsafe operationally if you use it to hide evidence. Clear after you remediate something, so you can detect recurrence.
5) I have checksum errors on a single mirror disk. Can I ignore them?
You can postpone replacement if the count is stable and evidence points to a one-time event (like a transient link reset). You should not ignore recurring increases. Mirrors are forgiving; they’re not magical.
6) Can bad RAM cause ZFS checksum errors?
Yes. Bad RAM can corrupt data in memory before it’s written (so ZFS faithfully checksums corrupted data), or corrupt data after it’s read. ECC reduces risk and gives you evidence via EDAC/MCE logs.
7) Why do checksum errors often show up during scrub/resilver?
Because those operations read a lot of data and stress the entire I/O path continuously. Marginal cables, flaky expanders, and borderline drives hate sustained, predictable work.
8) If SMART says “PASSED,” why does ZFS report errors?
SMART “PASSED” is not a clean bill of health; it’s a minimal threshold check. ZFS is measuring actual integrity outcomes. Believe the system that validates data end-to-end.
9) What if zpool status -v lists files with errors?
That means ZFS has known data errors it could not repair. Your decision tree changes: identify impacted datasets, restore from backup/snapshot where possible, and address hardware immediately.
10) Do checksum errors mean my HBA is bad?
Sometimes. If errors correlate across multiple drives attached to the same HBA port or appear with controller resets/timeouts in logs, the HBA (or its firmware/driver) moves up the suspect list fast.
Practical next steps
Checksum errors are ZFS doing its job: catching corruption that the rest of the stack would happily serve to your applications with a smile. Your job is to turn “CKSUM=14” into a concrete fix.
- Collect evidence first:
zpool status -v,zpool events -v, and kernel logs around the time of the errors. - Classify the pattern: single device vs multiple devices, scrub-only vs anytime, CRC vs reallocations, topology correlation.
- Remediate the most likely root cause (often cabling/backplane) before swapping drives blindly.
- Clear errors only after a change, then run a controlled scrub while watching logs.
- If errors recur, replace the component that the evidence points to—drive, cable, expander, HBA, or DIMM—and verify with a clean scrub.
If you take one operational lesson from all this: don’t confuse ZFS reporting an integrity problem with ZFS being the problem. It’s the witness. Treat it like one.