You run a Proxmox box. ZFS has been boring (the best kind of storage). Then one morning you see it: CKSUM incrementing in zpool status. The pool is still online, the VMs are still booting, and your brain is already bargaining: “It’s probably just a scrub artifact.”
No. ZFS checksum errors are ZFS doing its job: telling you it read bytes that didn’t match what it previously wrote. Your job is to prove where the corruption occurred: on the platter/NAND, in the drive electronics, in the cable, in the backplane, or in the HBA. Guessing is how you buy the wrong part twice.
What ZFS checksum errors actually mean (and what they don’t)
ZFS stores checksums for blocks and verifies them when reading. If the checksum doesn’t match, ZFS calls it a checksum error. That is a statement about the data when it was read, not a confession from the disk. The disk can return “success” while returning wrong bytes; ZFS notices anyway.
Here’s the important nuance:
- Checksum errors are end-to-end data-integrity failures. The corruption could be on-disk media, drive firmware, controller, cable, backplane, expander, RAM, DMA, or kernel/driver issues.
- ZFS can often repair them automatically if redundancy exists (mirror/RAIDZ) and the bad copy can be replaced by a good copy. That’s why you may see checksum errors without application-visible failures—until you don’t.
- They are not the same as read errors. A read error is “I couldn’t read the sector.” A checksum error is “I read something, but it’s not what you previously wrote.”
If you run mirrors or RAIDZ, ZFS can tell you which vdev member delivered the bad data. That’s your starting point. But “which device delivered bad bytes” isn’t always “which device is defective.” Cables and HBAs lie in between and they love intermittent failure.
One quote to keep in your pocket: “The only real reliability is the reliability you measure.” — John Allspaw (paraphrased idea)
Fast diagnosis playbook (first/second/third)
If you’re in production, you don’t have time to become a philosopher about entropy. You need a fast path to likely causes, then you gather evidence until you can justify a part swap.
First: confirm the scope and whether ZFS repaired anything
- Is the pool degraded, or just showing historical errors?
- Are errors increasing right now, or stuck at a single number?
- Does
zpool status -vlist file paths (metadata) or just counts?
Second: correlate ZFS errors with kernel transport errors
- Look for
atalink resets,I/O error,frozencommands, SASphy reset, or SCSI sense data around the same times as scrubs/resilvers. - If you see transport instability, your “bad disk” might actually be a cable, backplane slot, or HBA port.
Third: use SMART/NVMe logs to separate media failure from transport failure
- Media failure shows up as reallocated/pending/uncorrectable sectors (HDD) or media/data integrity errors (NVMe).
- Transport failure shows up as CRC errors (SATA), link resets/timeouts (SAS/SATA), but can leave SMART media counters clean.
Then you decide: swap cable/slot first (cheap, fast) or swap disk first (also fast, but riskier if it’s actually the slot and you just “infect” the replacement).
Interesting facts & context (because history repeats)
- ZFS’s end-to-end checksumming was a reaction to “silent data corruption” in traditional stacks where the filesystem trusts the disk blindly.
- ATA UDMA CRC error counters exist because even in the 1990s, engineers accepted that cables and connectors are failure devices, not accessories.
- Enterprise SAS backplanes introduced expanders and more connectors—more flexibility, more places for marginal signal integrity to hide.
- Consumer SATA cables vary wildly in quality; some are essentially decorative, especially at 6 Gb/s with tight tolerances.
- ZFS scrubs are intentionally boring: they read everything and verify checksums, which is why they tend to “discover” weak links first.
- Write caching lies are older than your career: a disk/controller can acknowledge a write and later lose it. ZFS checksums catch the aftermath on read.
- CMR vs SMR HDD behavior can create long I/O latency spikes that look like timeouts; timeouts can trigger resets that look like “bad cable,” even when the disk is just overloaded.
- ECC RAM matters because ZFS uses RAM aggressively; corruption in RAM can become “checksum errors” later. It’s rarer than cables, but not mythical.
An evidence framework: where corruption can happen
When ZFS shows checksum errors for /dev/sdX, you need to decide whether to accuse:
- The drive media (surface/NAND cells)
- The drive electronics/firmware
- The transport (SATA cable, SAS cable, connectors)
- The backplane/expander (slot, traces, expander chip)
- The HBA/controller (port, firmware, PCIe errors)
- The host (RAM, PCIe issues, power, kernel driver)
The trick is to avoid the classic failure of logic: “ZFS blamed the disk, so the disk is bad.” ZFS doesn’t have a Ouija board. It has a device node and whatever that path delivered at read time.
What “disk is bad” usually looks like: SMART shows reallocated/pending/uncorrectable; errors persist even when moved to a new port/cable; ZFS errors continue on the same physical drive across moves.
What “cable/slot is bad” usually looks like: SMART media counters clean, but UDMA CRC errors climb (SATA) or kernel logs show link resets; replacing/moving the cable/slot makes the errors stop, even with the same drive.
Joke #1 (short, relevant): A flaky SATA cable is the kind of “high availability” you didn’t ask for—it fails only when you’re watching.
Practical tasks (commands, outputs, decisions)
These are the tasks I actually run on Proxmox/Debian hosts. Each one includes what you’re looking for and what decision it drives. Do them in order if you can; skip around if production is on fire.
Task 1 — Capture a clean baseline: pool status with details
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 02:11:43 with 3 errors on Sun Dec 22 03:12:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFZX-68... ONLINE 0 0 3
ata-WDC_WD80EFZX-68... ONLINE 0 0 0
errors: No known data errors
What it means: ZFS saw checksum mismatches on one device and corrected them using redundancy (mirror). The pool is still online.
Decision: Treat this as a real integrity signal. Start correlation work before you clear counters. Don’t replace anything yet unless errors are climbing fast or the pool is degraded.
Task 2 — Identify the exact device path ZFS is using
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep WD80EFZX
lrwxrwxrwx 1 root root 9 Dec 26 09:20 ata-WDC_WD80EFZX-68... -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 09:20 ata-WDC_WD80EFZX-68... -> ../../sdc
What it means: You can map ZFS member names to /dev/sdX for SMART and kernel log correlation.
Decision: Work with /dev/disk/by-id names in ZFS commands (stable) and map to /dev/sdX only for diagnostics.
Task 3 — Pull recent kernel errors for that disk (SATA/SAS transport tells on itself)
cr0x@server:~$ sudo journalctl -k --since "24 hours ago" | egrep -i "sdb|ata|scsi|sas|link reset|I/O error" | tail -n 30
Dec 26 02:41:12 server kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0x6 frozen
Dec 26 02:41:12 server kernel: ata6: SError: { CommWake DevExch PhyRdyChg }
Dec 26 02:41:12 server kernel: ata6.00: failed command: READ FPDMA QUEUED
Dec 26 02:41:12 server kernel: ata6: hard resetting link
Dec 26 02:41:13 server kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 26 02:41:15 server kernel: sd 5:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 26 02:41:15 server kernel: sd 5:0:0:0: [sdb] Sense Key : Medium Error [current]
Dec 26 02:41:15 server kernel: sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error
What it means: You’re seeing link resets and a read error. That can be a drive, but transport instability is strongly suggested by repeated resets/frozen commands.
Decision: If you see lots of link resets and SError noise, plan a cable/slot move test. But also check SMART for real media errors, because both can be true.
Task 4 — Check SMART health and the counters that separate “media” from “transport” (HDD/SATA)
cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "Serial Number|SMART overall|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count"
Serial Number: WD-ABC123XYZ
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 12
What it means: Media counters are clean (0), but CRC errors exist. CRC errors are classic cable/connector/backplane signal issues for SATA.
Decision: Replace the SATA cable or move the drive to another backplane slot/port first. Don’t RMA a perfectly healthy drive because a $3 cable is having a personality.
Task 5 — For NVMe, check error logs and media errors (different vocabulary)
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep -i "critical_warning|media_errors|num_err_log_entries|temperature|percentage_used"
critical_warning : 0x00
temperature : 42 C
percentage_used : 3%
media_errors : 0
num_err_log_entries : 0
What it means: NVMe isn’t reporting media errors. That doesn’t fully exonerate it, but it reduces the chance of actual NAND decay.
Decision: If ZFS checksum errors show up on NVMe with clean media stats, you look harder at PCIe errors, power, firmware, and the motherboard slot.
Task 6 — Check PCIe/AER errors (especially for HBAs and NVMe)
cr0x@server:~$ sudo journalctl -k --since "7 days ago" | egrep -i "AER|pcie|Corrected error|Uncorrected" | tail -n 40
Dec 25 11:03:18 server kernel: pcieport 0000:00:1c.0: AER: Corrected error received: 0000:03:00.0
Dec 25 11:03:18 server kernel: pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer
Dec 25 11:03:18 server kernel: pcieport 0000:00:1c.0: AER: [ 0] RxErr
What it means: Physical layer errors can manifest as storage weirdness. “Corrected” doesn’t mean “harmless”; it means “we noticed and fixed it, until we don’t.”
Decision: If these correlate with ZFS errors, consider reseating the HBA/NVMe, trying another slot, updating firmware, and checking power/thermals.
Task 7 — Run a ZFS scrub and watch whether checksum errors climb during sustained reads
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ watch -n 30 "zpool status tank"
Every 30.0s: zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 26 09:31:12 2025
1.32T scanned at 682M/s, 920G issued at 475M/s, 6.12T total
0B repaired, 0.00% done, 03:45:12 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFZX-68... ONLINE 0 0 5
ata-WDC_WD80EFZX-68... ONLINE 0 0 0
What it means: The CKSUM count is increasing during scrub. That’s active failure, not historical residue.
Decision: Stop treating this like housekeeping. Plan immediate mitigation: reduce load, prepare a replacement path, and start isolating the transport (swap cable/slot) or the disk (replace) based on evidence.
Task 8 — Clear counters only after you’ve captured evidence
cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:11:43 with 0 errors on Thu Dec 26 12:05:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFZX-68... ONLINE 0 0 0
ata-WDC_WD80EFZX-68... ONLINE 0 0 0
What it means: Counters reset. Great for tracking recurrence, terrible if you did it before collecting logs.
Decision: Only clear after saving zpool status -v and relevant journalctl slices. If errors return quickly, you’ve got proof of an ongoing issue.
Task 9 — Check SMART self-tests (they’re not perfect, but they’re admissible evidence)
cr0x@server:~$ sudo smartctl -t long /dev/sdb
Please wait 728 minutes for test to complete.
cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "Self-test execution|# 1|error"
Self-test execution status: 241 (The previous self-test routine completed with a read failure)
# 1 Extended offline Completed: read failure 90% 54123 12345678
What it means: A long test hit a read failure. That’s a strong indicator of disk/media trouble, not just a cable. (Cables can still cause read failures, but SMART recording read failure tends to implicate the drive.)
Decision: If long tests fail, prioritize replacing the disk. Still check cabling if CRC errors are also climbing, because a bad cable can make a good disk look guilty.
Task 10 — For SATA, watch UDMA_CRC_Error_Count over time (it should not grow)
cr0x@server:~$ sudo smartctl -A /dev/sdb | awk '$2==199 || /UDMA_CRC_Error_Count/ {print}'
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 12
cr0x@server:~$ sleep 60
cr0x@server:~$ sudo smartctl -A /dev/sdb | awk '$2==199 || /UDMA_CRC_Error_Count/ {print}'
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 14
What it means: CRC errors increased while the system was running. That’s almost never “disk media.” That’s signal integrity: cable, connector, or backplane.
Decision: Replace cable, reseat connectors, change port/slot. After the change, the CRC count should stop increasing. It won’t reset, but it should go flat.
Task 11 — Map the disk to a physical slot (so you don’t swap the wrong thing at 2 a.m.)
cr0x@server:~$ sudo lsblk -o NAME,SERIAL,MODEL,SIZE,WWN,HCTL
NAME SERIAL MODEL SIZE WWN HCTL
sdb WD-ABC123XYZ WDC WD80EFZX 7.3T 0x50014ee2b1c2d3e4 5:0:0:0
sdc WD-DEF456UVW WDC WD80EFZX 7.3T 0x50014ee2b1c2d3e5 6:0:0:0
What it means: You have serials and HCTL. With HBAs/backplanes, HCTL helps map to a port/slot. With SAS, you can go deeper using SAS addresses.
Decision: Put labels on trays/cables. If you can’t map physical to logical reliably, you’ll eventually replace the healthy disk and keep the broken one.
Task 12 — Check SAS topology and link events (for SAS HBA/backplanes)
cr0x@server:~$ sudo sas2ircu 0 DISPLAY | egrep -i "Enclosure|Slot|Serial|PHY|Link Rate|Error"
Enclosure# : 1
Slot # : 4
Serial No : WD-ABC123XYZ
PHY Error Count : 19
Negotiated Link Rate : 6.0 Gbps
What it means: PHY errors and link negotiation issues implicate the transport or expander/backplane path. A single slot showing errors while others are clean is a bright arrow pointing at the slot/cable.
Decision: Move the drive to a different slot on the same enclosure/backplane. If errors follow the drive, it’s the drive. If errors stay with the slot, it’s the slot/backplane/cabling/HBA lane.
Task 13 — Check for SATA error statistics in sysfs (useful when SMART is noisy or incomplete)
cr0x@server:~$ for h in /sys/class/ata_link/link*/ata_link/*/dev*/../.. 2>/dev/null; do :; done
cr0x@server:~$ sudo grep -R . /sys/class/ata_link/link*/sata_spd 2>/dev/null | head
/sys/class/ata_link/link5/sata_spd:6.0 Gbps
/sys/class/ata_link/link6/sata_spd:6.0 Gbps
What it means: You can confirm link speed and sometimes spot downshifts after resets. A link repeatedly dropping from 6.0 to 3.0 Gbps under load is suspicious.
Decision: If downshifts correlate with errors, treat it like cabling/backplane integrity. Replace/ reseat, then verify stability.
Task 14 — Stress-read the suspect disk (carefully) and watch logs
cr0x@server:~$ sudo dd if=/dev/sdb of=/dev/null bs=8M status=progress
118111600640 bytes (118 GB, 110 GiB) copied, 25 s, 4.7 GB/s
240019120128 bytes (240 GB, 224 GiB) copied, 55 s, 4.4 GB/s
^C
cr0x@server:~$ sudo journalctl -k --since "5 minutes ago" | egrep -i "sdb|ata|reset|I/O error" | tail -n 20
Dec 26 09:52:41 server kernel: ata6: hard resetting link
Dec 26 09:52:45 server kernel: ata6.00: configured for UDMA/133
Dec 26 09:52:46 server kernel: blk_update_request: I/O error, dev sdb, sector 123456789 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
What it means: Under sustained reads, the link reset and I/O errors recur. That’s not “ZFS being dramatic.” It’s transport or disk stability breaking under load.
Decision: If dd reliably triggers link resets, you can reproduce the fault. Reproduce once before and once after swapping the cable/slot. That gives you “proof,” not vibes.
Task 15 — Verify memory integrity (rare, but catastrophic when it’s the cause)
cr0x@server:~$ sudo dmesg -T | egrep -i "EDAC|MCE|Machine check|memory error" | tail -n 20
[Thu Dec 26 07:11:32 2025] EDAC MC0: 1 CE memory scrubbing error on CPU#0Channel#1_DIMM#0
What it means: Correctable ECC errors are being logged. Not an immediate indictment, but it means the platform is not pristine. Non-ECC systems won’t log this; they’ll just corrupt and smile.
Decision: If you see MCE/EDAC activity and ZFS checksum errors across multiple vdevs/devices, expand scope to RAM/CPU/motherboard before replacing a pile of “bad disks.”
Task 16 — When you decide to replace, do it the ZFS way (and keep evidence)
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFZX-68... /dev/disk/by-id/ata-WDC_WD80EFZX-NEW123
cr0x@server:~$ watch -n 30 "zpool status tank"
Every 30.0s: zpool status tank
pool: tank
state: ONLINE
scan: resilver in progress since Thu Dec 26 10:12:40 2025
1.88T scanned at 512M/s, 1.02T issued at 278M/s, 6.12T total
1.02T resilvered, 16.66% done, 05:12:11 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
ata-WDC_WD80EFZX-68... ONLINE 0 0 0
ata-WDC_WD80EFZX-NEW123 ONLINE 0 0 0
ata-WDC_WD80EFZX-68... ONLINE 0 0 0
What it means: Resilver is rebuilding redundancy. Watch for new checksum errors during resilver; resilver is a stress test that often flushes out marginal cables.
Decision: If checksum errors appear on the new disk immediately, stop blaming disks. Start blaming the port, cable, expander lane, or HBA.
Three corporate-world mini-stories (how this goes sideways)
Mini-story #1: The incident caused by a wrong assumption
A mid-size company ran Proxmox on a handful of commodity servers. Mirrored ZFS boot pools, RAIDZ for VM storage. One host started showing checksum errors on a single disk. The on-call engineer did what everyone does under stress: they replaced the disk. Errors stopped. Everyone went back to sleep.
Two weeks later, the checksum errors returned—this time on the replacement disk. That triggered the “bad batch of drives” hypothesis. Another replacement followed. Same story: quiet for a bit, then errors again.
The wrong assumption was subtle: “If the error shows on the disk, the disk is the cause.” They never checked the SATA CRC counters. They never looked at dmesg for link resets. The physical slot was shared with a particular cable that had been kinked hard during a rushed maintenance window.
When they finally swapped the cable and moved the drive to a different backplane position, the CRC error count stopped climbing immediately. The original “bad” drives, tested elsewhere, were fine. The company had paid for three disks and lost a night of sleep because they treated a transport issue like a media issue.
The lesson that stuck: ZFS told the truth (“bad bytes arrived”), but it did not tell the whole truth (“where they went bad”). You have to interrogate the path.
Mini-story #2: The optimization that backfired
A large internal platform team wanted faster rebuilds and better VM performance. They added an HBA with more ports and moved from direct SATA to a SAS expander backplane to simplify cabling. The rack looked cleaner. The inventory spreadsheet looked cleaner. Everyone loves clean.
Then scrub times became unpredictable. Some scrubs ran at full speed; others crawled. A few hosts showed intermittent checksum errors, always “corrected,” never “fatal.” That’s the most dangerous kind of storage problem because it trains people to ignore it.
The “optimization” was that they used mixed-quality mini-SAS cables and routed them tightly around power distribution. The expander plus longer cable runs made signal margins smaller. Under peak I/O (scrubs plus VM snapshots), the path occasionally glitched. ZFS did what it does: it detected the mismatch, fetched good data from parity/mirror, and healed. The platform looked fine—until one day a second disk in the same vdev went offline during a resilver and the window for recovery got narrow and sweaty.
They fixed it by boring means: higher-quality cables, less aggressive routing, and moving the HBA to a slot with fewer PCIe corrected errors. Performance stayed good. More importantly, scrubs became consistent and checksum errors stopped appearing like seasonal allergies.
Mini-story #3: The boring but correct practice that saved the day
A finance-adjacent team ran Proxmox with ZFS mirrors for critical databases. Their environment wasn’t fancy, but they had a habit: monthly scheduled scrubs, and an internal runbook requiring “scrub results + SMART deltas” to be reviewed.
One month, the scrub report showed two checksum errors on one mirror member. Not hundreds. Just two. SMART looked clean except for a CRC counter that had increased by one. Nothing else. No tickets from users. The temptation was to shrug.
The runbook forced the next step: clear counters, swap the cable, rerun a scrub within 48 hours, and compare. After the cable swap, CRC stopped increasing. Scrub completed clean. They kept the disk in place.
Three months later, the same host had a power event (UPS maintenance gone wrong). Disks handled it, but that old cable would likely have made recovery uglier. Instead, the pool came up clean. The team didn’t “save the day” with heroics. They saved it by being dull on a schedule.
Common mistakes: symptom → root cause → fix
This is where most ZFS checksum threads go to die: vague symptoms and cargo-cult part swaps. Here are specific patterns that show up on Proxmox hosts.
1) CKSUM increases during scrub, SMART media looks perfect
- Symptom:
zpool statusshows CKSUM climbing;Reallocated/Pending/Uncorrectableare zero. - Likely root cause: SATA cable or backplane connector causing CRC errors; or SAS PHY errors.
- Fix: Replace cable, reseat both ends, move to a different port/slot, then rerun scrub and confirm CRC/PHY errors stop increasing.
2) One disk shows CKSUM, then replacement disk shows CKSUM in the same bay
- Symptom: You replaced the disk; errors return on the new one.
- Likely root cause: Bad bay/backplane lane, expander port, or HBA port.
- Fix: Keep the new disk, move it to a different bay; move another known-good disk into the suspect bay as a test if you can. If errors stay with the bay, repair/replace the backplane or change HBA port.
3) CHECKSUM errors across multiple disks at once
- Symptom: Several devices in different vdevs show CKSUM increases around the same time.
- Likely root cause: Systemic: HBA issue, PCIe errors, power instability, RAM/MCE, or a backplane/expander shared path.
- Fix: Look at AER/MCE logs, HBA firmware, PSU/cabling, and shared expander paths. Don’t shotgun-replace multiple disks without narrowing the shared component.
4) CKSUM count is non-zero but never increases
- Symptom: Historical CKSUM count remains constant across weeks.
- Likely root cause: Past transient event: power loss, a one-off cable wiggle, a one-time controller hiccup.
- Fix: Clear counters after capturing evidence; monitor. If it returns, treat as active. If it doesn’t, log it and move on.
5) “No known data errors” but everyone panics anyway
- Symptom: ZFS says it repaired errors; no files listed.
- Likely root cause: Redundancy worked. But it still indicates underlying reliability problems.
- Fix: Investigate transport/media. Don’t ignore it. This is your early warning system doing you a favor.
6) Errors show up after you changed recordsize, compression, or enabled some “performance” knob
- Symptom: Timing suggests config caused corruption.
- Likely root cause: Not the ZFS knob. More often it increased I/O pressure and exposed a marginal cable/drive that only fails under sustained throughput.
- Fix: Reproduce under load; check transport errors; fix the marginal component. Then decide whether the knob is still a good idea for your workload.
Joke #2 (short, relevant): Storage failures are like meetings—if you don’t take notes, you’ll repeat them.
Checklists / step-by-step plan
Checklist A — Prove “disk” vs “cable/slot” with minimal downtime
- Snapshot evidence: save
zpool status -voutput and kernel log slice around the last scrub/resilver window. - Check SMART media counters: reallocated, pending, offline uncorrectable (HDD) or media errors (NVMe).
- Check transport counters: SATA
UDMA_CRC_Error_Count; SAS PHY errors; kernel link resets/timeouts. - Clear ZFS errors: only after evidence is captured.
- Swap the cheapest suspect first: replace SATA/SAS cable or move the drive to another bay/port.
- Rerun scrub (or a controlled read test): confirm whether errors recur and whether CRC/PHY counters increase.
- If errors follow the drive across ports: replace the drive.
- If errors stay with the port/bay: fix backplane/HBA lane/cable routing; consider HBA firmware updates and PCIe slot changes.
Checklist B — If the pool is degraded (stop diagnosing, start stabilizing)
- Reduce workload: pause heavy backups, replication, and aggressive scrubs.
- Confirm redundancy state: mirror vs RAIDZ, number of failures tolerated.
- Export evidence (pool status, logs) somewhere safe.
- Replace the most likely failing component based on evidence, but keep the old disk until resilver completes and scrub validates.
- After resilver, scrub again. No scrub, no confidence.
Checklist C — “We need proof for procurement”
- Show
zpool status -vwith the exact member and rising CKSUM counts. - Show kernel logs with link resets/timeouts or medium errors tied to that device.
- Show SMART deltas: CRC errors increasing (cable) or reallocated/pending increasing (disk).
- Show the A/B test: after swapping cable/port, the issue either stopped (cable) or followed the drive (disk).
FAQ
1) Are ZFS checksum errors always a failing disk?
No. They’re a failing integrity check. Disks are common culprits, but cables, backplanes, HBAs, PCIe issues, and RAM can all produce the same symptom.
2) If SMART says “PASSED,” can the disk still be bad?
Yes. SMART overall status is a low bar. Look at specific attributes (reallocated/pending/uncorrectable) and self-test results. Also correlate with kernel logs.
3) What’s the single best sign of a bad SATA cable?
UDMA_CRC_Error_Count increasing. One or two historical CRC errors can happen; a counter that keeps climbing under load is a smoking connector.
4) ZFS repaired the errors. Can I ignore it?
You can, but you’re spending redundancy like it’s free. Today it’s a correctable mismatch; tomorrow it’s a second failure during resilver. Investigate and fix the underlying cause.
5) Should I run zpool clear immediately?
No. Capture evidence first. Clearing early destroys the timeline and makes it harder to prove whether the fix worked.
6) Why do errors show up during scrub but not during normal VM activity?
Scrub forces full-surface reads and checksum verification. It’s sustained I/O and broad coverage, which exposes marginal transport and weak media that normal hot data patterns might never touch.
7) If I swap the disk, what should I watch during resilver?
Watch for new read/write/checksum errors on any member, and watch kernel logs for resets. Resilver is a stress test for the entire chain: disk, cable, HBA, expander, power.
8) Can RAM issues look like disk checksum errors?
Yes. Bad RAM can corrupt data in memory before it is written or after it is read. ECC reduces the chance and increases observability via EDAC/MCE logs.
9) Does compression or recordsize cause checksum errors?
Not directly. Those settings change I/O patterns and CPU load, which can trigger marginal hardware paths. Fix the marginality; don’t blame compression because it was the last thing you touched.
10) What if the pool is RAIDZ and ZFS points to one disk with CKSUM?
Treat it similarly: check transport/media for that disk first. But also remember RAIDZ resilver/scrub behavior can stress the entire vdev; watch other members closely during diagnostics.
Conclusion: next steps you can do today
Checksum errors are not a “ZFS problem.” They’re ZFS catching someone else’s problem before your applications do. Your mission is to turn that warning into a defensible diagnosis.
Do this, in this order:
- Capture
zpool status -vand the relevant kernel log window. - Check SMART/NVMe media indicators and transport indicators (CRC/PHY/link resets).
- Clear counters, then reproduce under a scrub or controlled read.
- Swap cable/slot/port first when evidence points to transport (CRC/PHY errors, resets with clean media stats).
- Replace the disk when media evidence points to the drive (failed SMART long test, growing reallocated/pending/uncorrectable, errors follow the drive).
- After any fix, scrub again. If you didn’t scrub, you didn’t verify.
If you follow that discipline, you’ll stop arguing with guesswork and start walking into the parts room with proof. Storage engineers love proof. It’s the only thing that survives the next incident review.