At 02:17, somebody pings: “Storage shows errors. Can I just run zpool clear?” That’s the UNIX equivalent of asking if you can stop the smoke alarm by removing the battery. Sometimes it’s fine. Sometimes it’s how you turn a recoverable fault into a career-limiting event.
zpool clear is not magic, and it’s not maintenance. It’s a broom. Use it to sweep up known, already-fixed messes. Don’t use it to hide broken plumbing.
What zpool clear actually does (and what it doesn’t)
zpool clear clears error counts and certain fault states recorded by ZFS for a pool or a specific vdev. That’s it. It does not “repair data.” It does not “fix the disk.” It does not “remove corruption.” It doesn’t even guarantee the pool will stay healthy for the next five minutes.
Think of ZFS as having two layers of truth:
- Physical reality: what the disks, HBAs, expanders, cables, backplanes, and controllers can actually deliver.
- Recorded observation: what ZFS has seen (errors, retries, checksum mismatches) and tallied.
zpool clear resets the recorded observation. It does nothing to physical reality. If physical reality is still broken, ZFS will just start counting again—often immediately, sometimes seconds after you clear. If you clear and the counters stay at zero under load, you probably fixed something and are now making the pool status readable again. If you clear and the counters race back up, you just confirmed a live problem.
Also: clearing can transition a pool from DEGRADED/FAULTED to ONLINE if the underlying device is now accessible and the fault was transient. That’s convenient. It’s also how you accidentally “prove” a flaky path is good because it worked for one minute.
The job of error counters in ZFS operations
ZFS uses per-vdev counters (READ, WRITE, CKSUM) and state machines (ONLINE, DEGRADED, UNAVAIL, FAULTED) to help you answer two questions:
- Is data at risk right now?
- If yes, is it a device problem, a path problem, or actual on-disk corruption?
Clearing counters is a hygiene step after you’ve answered both questions and mitigated the cause. It’s not step one.
Paraphrased idea, attributed: “Hope is not a strategy.”
— often attributed to practitioners in reliability/operations culture (paraphrased idea).
Facts and context you can use in a war room
- ZFS was built to detect silent corruption—the class of bugs and bit flips your RAID controller cheerfully ignores while returning “success.” That’s why checksum errors matter more than a scary-looking “degraded” banner.
- End-to-end checksumming wasn’t a mainstream filesystem feature when ZFS was designed in the early 2000s; ZFS pushed the idea that storage should verify what it serves, not just what it stores.
- The “scrub” concept is intentionally proactive: it’s a scheduled “read everything and verify” job, not a reactive fsck after a crash. It’s how ZFS finds bad sectors before you need them.
- READ/WRITE/CKSUM counters are per-vdev and persistent across reboots until cleared. That persistence is useful for trend spotting and terrible for people who panic at old numbers.
- A single checksum mismatch doesn’t automatically mean data loss in redundant vdevs (mirror/raidz). It often means ZFS detected a bad copy and repaired it from a good one—quietly saving you.
- ZFS “fault management” was influenced by serviceability thinking: it tries to tell you what component to replace, not merely that “something is wrong.” It’s good, but it’s not omniscient about cables and backplanes.
- Many “disk errors” are actually transport issues: SATA/SAS link resets, marginal expanders, mis-seated drives, power issues. ZFS can’t tell if your cable is sulking.
- On Solaris-derived stacks, FMA (Fault Management Architecture) integrated with ZFS. On Linux, you get a different ecosystem (ZED, udev, kernel logs), so diagnosis often requires correlating sources.
- The rise of cheap large disks made latent sector errors a real operational problem. Scrubs and redundancy planning became non-optional as rebuild times stretched and URE math got ugly.
Interpreting READ/WRITE/CKSUM like an adult
zpool status gives you three counters per device:
- READ: device returned an I/O error on read, timed out, or otherwise failed to deliver blocks.
- WRITE: device failed to write blocks (or acknowledge writes).
- CKSUM: the device returned data, but ZFS computed a checksum mismatch—data was not what was originally written.
In practice:
- READ/WRITE errors often point to a device that can’t do I/O reliably, or a path issue (HBA, expander, cable, power). These tend to show up with kernel logs. They can cause vdevs to drop offline.
- CKSUM errors are the “quietly terrifying” category. They can be caused by a dying disk, yes. They can also be caused by flaky transport, bad RAM (less common with ECC but not impossible), firmware weirdness, or controllers doing “helpful” things. ZFS is telling you: I got bytes, but I don’t trust them.
What you should mentally map from counters to actions
Here’s a field guide that matches what seasoned on-call engineers do, not what they say they do:
- CKSUM on one disk, small count, never increases: likely a transient or a repaired event. Scrub, correlate logs, then clear if you’ve remediated (reseat, replace cable, firmware update, etc.).
- CKSUM climbing during scrub or under load: active integrity issue. Do not clear. Find the component. Replace things until the counter stops rising.
- READ/WRITE errors with resets/timeouts: treat as path/device instability. ZFS might keep serving, but your resilience budget is being spent.
- Errors on multiple disks behind one HBA/expander: don’t start a disk replacement spree. That’s how you end up with “new disks, same errors.” Suspect shared infrastructure.
When clearing errors is correct
zpool clear is correct when the error state is stale, the underlying cause has been addressed, and you need clean counters to verify stability going forward.
Use it after you fix something measurable
Good triggers:
- You replaced a failed disk, resilver completed, and the pool is healthy but still shows old counts. Clear to reset the scoreboard.
- You reseated a drive or replaced a SATA/SAS cable/backplane slot, and you want to confirm errors do not recur under a scrub.
- You had a one-time power event (PDU hiccup, chassis power loss) that caused transient I/O errors; after confirming hardware is stable, clear to avoid living with “historic shame.”
- You corrected a misconfiguration (wrong firmware, controller mode, multipath config) and want to prove the fix by watching counters under load.
Use it to make monitoring sane again
Many monitoring setups fire alerts on non-zero error counters. If you’ve done the work—scrubbed, remediated, validated—then leaving ancient counters in place trains your org to ignore alerts. Clear becomes an operational reset: “from this moment, new errors mean new incidents.”
Use it on a specific device when you’re isolating
Clearing the entire pool can remove useful forensic context. Prefer:
zpool clear poolnamewhen you’re fully done and want everything reset.zpool clear poolname /dev/disk/by-id/...when you’re tracking a suspected bad actor and want a fresh count for that one path.
When clearing errors is stupid
Clearing errors is stupid when you haven’t proven the pool is stable, or when you’re using it to silence symptoms instead of diagnosing the cause.
Do not clear while errors are still accruing
If a scrub is running and CKSUM is climbing, clearing is just changing the odometer while the engine is on fire. You’ll lose the ability to quantify “how bad” and “how fast.” Keep the numbers. They’re evidence.
Do not clear to “fix” data corruption
Sometimes zpool status says errors: Permanent errors have been detected. That’s not a counter problem. That’s ZFS telling you it found blocks it could not repair from redundancy. Clearing doesn’t resurrect data. It hides the warning until the next read triggers the same pain in a more expensive place: your application.
Do not clear to make a degraded pool look green
A pool in DEGRADED is already spending redundancy. If you clear and declare victory without replacing the failing component, you’re betting the rest of your vdev will behave. That bet is statistically popular and professionally regrettable.
Do not clear before you’ve captured context
Clearing wipes a breadcrumb trail. Before clearing, capture:
zpool status -vzpool events -v(if available)- Kernel logs around the time errors occurred
smartctlfor affected disks
First joke (short, relevant): Clearing ZFS errors without diagnosis is like rebooting the smoke detector. You didn’t fix the fire; you just made it quieter.
Fast diagnosis playbook
When you’re on-call, you don’t have the luxury of a philosophical relationship with your storage. Here’s how to get to “what’s broken” quickly.
First: establish whether this is active or historical
- Check
zpool statusand note whether counters are increasing over time. - If a scrub/resilver is in progress, watch whether errors grow during that activity.
Second: classify the failure mode (device vs path vs data)
- If it’s mostly READ/WRITE with timeouts in logs → suspect device or transport.
- If it’s CKSUM without I/O errors → suspect corruption in flight (cable/HBA/expander/RAM) or a disk returning bad data.
- If
Permanent errorsshow up → treat as data loss event until proven otherwise.
Third: look for shared blast radius
- Are errors on multiple drives on the same HBA/port/expander? That’s a shared component.
- Are errors on a single disk only? That’s usually the disk, its slot, or its cable.
Fourth: decide whether you can keep serving
- Redundant vdev, single device errors, scrub repairs cleanly: you can often keep serving while you schedule a replacement.
- Non-redundant vdev or multiple devices in the same RAIDZ group acting up: stop gambling. Stabilize first.
Practical tasks: commands, outputs, decisions
Below are real operational tasks you can run. Each one includes: command, what the output means, and what decision you make.
Task 1: Get the headline status (don’t squint)
cr0x@server:~$ sudo zpool status
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
scan: scrub repaired 0B in 02:41:13 with 0 errors on Tue Dec 24 03:12:01 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-GHI789 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-JKL012 UNAVAIL 8 2 0 cannot open
errors: Permanent errors have been detected in the following files:
tank/data/finance.db
What it means: This is not “just counters.” You have an unavailable disk and ZFS reports permanent errors in a specific file.
Decision: Do not run zpool clear as your first move. Stabilize hardware (bring disk back / replace), then address the corrupted file (restore from replica/backup or application-level rebuild).
Task 2: Get verbose detail on errors and files
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
scan: scrub repaired 0B in 02:41:13 with 0 errors on Tue Dec 24 03:12:01 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-GHI789 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-JKL012 UNAVAIL 8 2 0 cannot open
errors: Permanent errors have been detected in the following files:
tank/data/finance.db
What it means: ZFS is pointing at a specific dataset path. This is actionable.
Decision: Engage app owner. Determine whether file can be restored or rebuilt. Hardware work and data work must proceed in parallel.
Task 3: Confirm the disk is actually missing vs renamed
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep JKL012
What it means: No output suggests the OS doesn’t see the device at all (unplugged, dead, link down, controller issue).
Decision: Check cabling/backplane/HBA. If this is a hot-swap bay, reseat the drive. If it’s a VM, check virtual disk attachment.
Task 4: Correlate with kernel logs (transport vs media)
cr0x@server:~$ sudo dmesg -T | egrep -i "ata|sas|scsi|reset|timeout|I/O error" | tail -n 20
[Wed Dec 24 02:01:44 2025] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x4050000 action 0x6 frozen
[Wed Dec 24 02:01:44 2025] ata9: SError: { CommWake DevExch }
[Wed Dec 24 02:01:45 2025] ata9.00: failed command: READ FPDMA QUEUED
[Wed Dec 24 02:01:45 2025] blk_update_request: I/O error, dev sdi, sector 918273645 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Wed Dec 24 02:01:47 2025] ata9: hard resetting link
[Wed Dec 24 02:01:52 2025] ata9: link is slow to respond, please be patient (ready=0)
[Wed Dec 24 02:01:57 2025] ata9: SATA link down (SStatus 0 SControl 300)
What it means: Link resets and link down scream “transport/path” more than “bad sector.” Disk may still be fine; the path isn’t.
Decision: Inspect/replace cable, backplane port, or HBA lane. Don’t replace the disk first unless you have other evidence.
Task 5: Pull SMART to see if the disk is sick
cr0x@server:~$ sudo smartctl -a /dev/sdi | egrep -i "Reallocated|Pending|Uncorrect|CRC|SMART overall|Power_On_Hours" -n
5:SMART overall-health self-assessment test result: PASSED
83: 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
88:197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
89:198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
102:199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 47
109: 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 46712
What it means: High CRC errors with no reallocations/pending sectors points to cable/backplane/controller noise. Disk media doesn’t look obviously bad.
Decision: Fix the path. Then scrub. Then consider clearing counters.
Task 6: Identify which physical bay a by-id maps to
cr0x@server:~$ sudo lsblk -o NAME,SERIAL,MODEL,HCTL,SIZE,STATE | grep -E "sdi|JKL012"
sdi WD-JKL012 WDC WD80EFAX-68LHPN0 3:0:9:0 7.3T running
What it means: HCTL gives you host:channel:target:lun, which is useful with SAS HBAs and chassis mapping.
Decision: Use HCTL and your chassis map to touch the correct drive. Avoid the classic “pull the wrong disk” moment.
Task 7: Confirm pool topology and ashift (future pain predictor)
cr0x@server:~$ sudo zpool get ashift,autoreplace,autotrim tank
NAME PROPERTY VALUE SOURCE
tank ashift 12 local
tank autoreplace off default
tank autotrim off default
What it means: ashift=12 is 4K sectors; fine. autoreplace=off means replacing a disk may require explicit steps. autotrim off is typical for HDD pools.
Decision: If you expect hot-swap behavior, decide whether to enable autoreplace after you understand your environment. Don’t touch it mid-incident unless you enjoy surprise.
Task 8: Run a scrub and watch whether counters increase
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ watch -n 10 sudo zpool status tank
Every 10.0s: sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 24 04:01:10 2025
1.23T scanned at 1.4G/s, 212G issued at 245M/s, 7.98T total
0B repaired, 2.59% done, 08:31:12 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-GHI789 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-JKL012 ONLINE 0 0 0
errors: No known data errors
What it means: Scrub is progressing, pool is stable, no errors increasing. That’s a good sign.
Decision: If this follows remediation (cable reseat/replacement), you’re approaching “safe to clear” territory after scrub completes.
Task 9: Check scrub result and repaired bytes
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub repaired 12M in 02:39:44 with 0 errors on Wed Dec 24 06:41:02 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 3
raidz1-0 ONLINE 0 0 3
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-GHI789 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-JKL012 ONLINE 0 0 3
errors: No known data errors
What it means: Scrub repaired 12M (self-healed), but the disk still shows checksum errors. If these errors are historical and not increasing, you can clear after you’re satisfied the underlying cause is fixed.
Decision: Correlate with logs during scrub. If no new resets/timeouts and SMART CRC stops increasing, clear and monitor.
Task 10: Clear a single device (surgical reset)
cr0x@server:~$ sudo zpool clear tank ata-WDC_WD80EFAX-68LHPN0_WD-JKL012
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub repaired 12M in 02:39:44 with 0 errors on Wed Dec 24 06:41:02 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-GHI789 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-JKL012 ONLINE 0 0 0
errors: No known data errors
What it means: Counters reset. Now you have a clean baseline.
Decision: Put the pool under normal load and re-check in hours/days. If counters reappear, the fix didn’t hold.
Task 11: Use zpool events to see recent ZFS fault activity
cr0x@server:~$ sudo zpool events -v | tail -n 25
TIME CLASS
Dec 24 2025 02:01:58.219388000 ereport.fs.zfs.io
vdev_path: /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_WD-JKL012
vdev_guid: 1234567890123456789
pool: tank
pool_guid: 9876543210987654321
ereport_payload:
zio_err: 5
zio_offset: 469124961280
zio_size: 131072
zio_objset: 54
zio_object: 1
zio_level: 0
What it means: ZFS logged an I/O ereport for that vdev. The timestamp helps correlate to kernel logs and physical events.
Decision: If events stop after remediation, you’re winning. If they continue, escalate hardware isolation.
Task 12: Confirm the pool isn’t silently re-silvering or stuck
cr0x@server:~$ sudo zpool status -D tank
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC...ABC123 ONLINE 0 0 0
ata-WDC...DEF456 ONLINE 0 0 0
ata-WDC...GHI789 ONLINE 0 0 0
ata-WDC...JKL012 ONLINE 0 0 0
errors: No known data errors
What it means: No ongoing scan, nothing deferred.
Decision: If you expected a resilver and don’t see it, you may have replaced the wrong thing or ZFS didn’t attach the new disk as you assumed.
Task 13: Replace a disk correctly (and avoid clearing too early)
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFAX-68LHPN0_WD-JKL012 /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_WD-MNO345
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Wed Dec 24 07:10:04 2025
612G scanned at 1.1G/s, 118G issued at 211M/s, 7.98T total
118G resilvered, 1.44% done, 06:10:22 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-WDC...ABC123 ONLINE 0 0 0
ata-WDC...DEF456 ONLINE 0 0 0
ata-WDC...GHI789 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 0
ata-WDC...JKL012 UNAVAIL 0 0 0 cannot open
ata-WDC...MNO345 ONLINE 0 0 0 (resilvering)
What it means: Resilver is active; ZFS is reconstructing redundancy onto the new disk.
Decision: Do not clear during resilver unless you’re clearing stale errors after a resolved transient issue and you’ve captured evidence. Let resilver finish; then validate with scrub.
Task 14: Clear pool-wide after a completed and validated remediation
cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: resilvered 7.98T in 06:44:09 with 0 errors on Wed Dec 24 13:54:13 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC...ABC123 ONLINE 0 0 0
ata-WDC...DEF456 ONLINE 0 0 0
ata-WDC...GHI789 ONLINE 0 0 0
ata-WDC...MNO345 ONLINE 0 0 0
errors: No known data errors
What it means: Clean baseline after a successful replacement/resilver.
Decision: Set a reminder to re-check SMART and zpool status after one business day and after the next scheduled scrub.
Three corporate mini-stories from the real world
Mini-story 1: The incident caused by a wrong assumption
The storage team at a mid-size company ran a mixed workload on a RAIDZ2 pool: analytics reads, nightly compactions, and a few “temporary” datasets that somehow became permanent. One morning, monitoring started paging on checksum errors on a single disk. The on-call engineer saw the pool still ONLINE and decided it was old noise. They ran zpool clear to stop the page spam.
Within hours, the checksum count returned—fast. But the page had been “handled,” so it didn’t get much attention. The next night’s heavy batch job pushed the system harder than daytime traffic, and a second disk in the same vdev started throwing read timeouts. Now the pool went DEGRADED. The team assumed it was “two failing disks” and started a standard disk replacement process.
What they missed: both disks were connected through the same SAS expander path, and the expander firmware had a known issue with link power management. The initial checksum errors were early warning. Clearing erased the ability to show that the errors were increasing, and it also reset the timeline evidence in the on-call notes because nobody captured zpool status -v and logs first.
They replaced one disk. Resilver crawled with intermittent resets. They replaced the second disk too. Still flaky. Then a third disk began erroring, and now everyone had the horrible realization: this wasn’t a set of bad disks. It was shared infrastructure.
The fix was boring: disable the problematic link power setting, update expander firmware during a maintenance window, and reseat a marginal mini-SAS cable. After that, the errors stopped. They cleared counters again—this time correctly—and watched them stay at zero. The cost wasn’t just hardware; it was trust. Storage pages got ignored for weeks afterward, which is how organizations accumulate future outages like credit card debt.
Mini-story 2: The optimization that backfired
A different org wanted faster scrubs. They had a big pool and a tight maintenance window, so someone tuned scrub behavior at the system level (the exact knobs varied by platform) and scheduled scrubs during business hours “because the array is fast.” It worked in test.
In production, the faster scrub increased read pressure on a set of aging drives. That pressure exposed a marginal HBA and a questionable backplane trace. Nothing catastrophic—just enough retries and link resets that ZFS started logging checksum errors on multiple disks. The pool stayed online, so the tickets were downgraded to “watch.”
Then the team did the worst possible thing for their own observability: they cleared the errors weekly after each scrub, because leadership didn’t like dashboards with red numbers. With the counters constantly reset, it became impossible to show the trend that the environment was degrading. The early warning system got turned into a cosmetic filter.
Eventually, a heavy write workload coincided with a scrub and a firmware hiccup, causing a vdev to drop temporarily. The pool recovered, but an application saw transient errors and started rebuilding indexes from scratch, hammering storage even harder. The incident wasn’t one big explosion; it was a chain reaction powered by “optimization.”
The postmortem had a blunt lesson: performance tuning that hides reliability signals is not optimization. It’s debt. They rolled back the aggressive scrub behavior, moved scrubs to controlled windows, fixed the shared hardware, and only then used zpool clear to reset baselines. Performance came back. So did their credibility.
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran ZFS for internal services that were not allowed to be exciting. They had a rule: before anyone runs zpool clear, they must attach three artifacts to the ticket—zpool status -v, a relevant excerpt of kernel logs, and smartctl for any involved disks. No exceptions, no “quick fix.”
One weekend, a pool started showing a small number of checksum errors on a single disk. The system otherwise behaved. The on-call engineer followed the ritual, captured the evidence, and noticed something subtle: the SMART report showed CRC errors increasing, but reallocated/pending sectors were zero. Kernel logs showed link resets on a specific port.
Instead of replacing the disk, they moved the drive to another bay and swapped the cable to the backplane. After that, they ran a scrub; no new errors appeared. Only then did they clear the device counters. Monitoring stayed quiet for months.
What saved them wasn’t genius. It was the refusal to let “clear the errors” be a substitute for diagnosis. The ritual produced enough data to pick the right component the first time, during a calm maintenance window, without a resilver marathon and without risking a second failure during rebuild.
Common mistakes: symptoms → root cause → fix
Mistake 1: “Pool is ONLINE, so checksum errors don’t matter.”
Symptom: Pool shows ONLINE, but one disk has growing CKSUM errors.
Root cause: Redundancy is masking corruption; ZFS is repairing reads, but something is returning wrong data (disk or path).
Fix: Correlate with dmesg and SMART; check for CRC errors and link resets. Swap cable/port, or replace the disk if SMART indicates media issues. Scrub to validate. Clear only after the count stops increasing.
Mistake 2: Clearing errors to silence monitoring
Symptom: Alerts stop after zpool clear, then return later, often worse.
Root cause: Monitoring was right; you deleted the evidence and delayed diagnosis.
Fix: Keep counters until root cause is addressed. If alert fatigue is real, tune alerting to “rate of increase” and “active degraded/faulted state,” not merely non-zero historic counters.
Mistake 3: Replacing disks when the real issue is the HBA/expander
Symptom: Multiple drives show errors, often across different vdevs, and replacements don’t help.
Root cause: Shared component failure: HBA overheating, expander firmware bugs, bad backplane, power brownouts.
Fix: Map devices to HBAs/ports (HCTL), check logs for common reset patterns, and isolate by moving one affected drive to a different controller path if possible. Replace/repair the shared component.
Mistake 4: Treating “Permanent errors” as a counter problem
Symptom: zpool status lists specific files with permanent errors.
Root cause: ZFS could not self-heal those blocks (insufficient redundancy, multiple faults, or corruption persisted through all replicas).
Fix: Restore affected files from backup/replica or rebuild at application level. Identify and fix the hardware/transport issue that caused corruption. Clearing counters does not fix the file.
Mistake 5: Clearing before capturing evidence
Symptom: “It happened last night but we can’t reproduce.”
Root cause: Counters/events/logs rotated or were cleared, destroying correlation.
Fix: Capture zpool status -v, zpool events -v, and a time-window of logs before any destructive hygiene. Standardize this in incident runbooks.
Mistake 6: Confusing scrub with resilver and making the wrong call
Symptom: Team runs scrubs expecting redundancy to rebuild after a replacement, or they replace disks while scrub is the real need.
Root cause: Scrub verifies and repairs; resilver reconstructs onto a replacement/newly attached device. Different goals, different signals.
Fix: If a device was replaced/reattached, confirm resilver is happening. After resilver, run scrub to validate. Clear errors after validation.
Checklists / step-by-step plan
Checklist A: “Can I run zpool clear now?”
- Have you captured
zpool status -voutput for the ticket? - Do kernel logs show the last relevant I/O errors/resets, and are they explained?
- Is the pool
ONLINEwith no missing devices? - Has a scrub completed since the fix, with
0 errors? - Are SMART indicators stable (CRC not increasing, no pending/reallocated growth)?
- Are error counters not increasing under load?
If you can’t say yes to most of these, don’t clear. If you can, clear surgically (device-level) when possible, then monitor.
Checklist B: Step-by-step incident response when zpool status shows errors
- Stabilize: confirm pool state, redundancy, and whether any vdev is offline/unavail.
- Preserve evidence: capture
zpool status -v,zpool events -v, logs, SMART. - Classify: READ/WRITE vs CKSUM vs permanent errors.
- Scope: single disk vs multiple disks vs shared component pattern.
- Mitigate: reseat/swap cable/port; replace disk only when indicated.
- Validate: resilver (if applicable), then scrub.
- Reset baseline: clear errors only after validation.
- Monitor: check status after 1h, 24h, and next scrub; keep an eye on SMART deltas.
Checklist C: Post-fix verification (the part people skip)
- Run a scrub (or schedule immediately if the pool is large and you need a window).
- Confirm no new errors during scrub.
- Confirm counters remain stable for at least one normal workload cycle.
- Only then clear error counters and mark the incident resolved.
Second joke (short, relevant): The quickest way to make zpool status look healthy is zpool clear. The second quickest is fixing the problem.
FAQ
1) Does zpool clear fix corruption?
No. It clears recorded error counts and can clear certain fault states. Corruption is fixed by redundancy repair during reads/scrub, resilver, or by restoring data.
2) If I clear errors, can I lose data?
Clearing itself doesn’t delete data, but it can make you miss a developing failure by removing evidence. The data loss comes later, when you fail to replace the bad component in time.
3) When should I clear: before or after a scrub?
After. A scrub is your proof that the pool can read and verify data end-to-end. Clearing before scrub removes the baseline you want to compare against.
4) My pool shows checksum errors but “No known data errors.” Is that okay?
It can be okay if the errors are historical and not increasing—ZFS may have repaired them. It is not okay if the checksum count is rising, especially during scrub or heavy reads.
5) Why do checksum errors often implicate cables and HBAs?
Because the disk can return bytes that are corrupted in transit or by a flaky controller path. SMART CRC counts and kernel logs about link resets are common telltales.
6) Can I clear only one disk’s errors?
Yes, and you often should. Use zpool clear poolname vdev to reset counters on the suspected device while preserving pool-wide context.
7) What if errors return immediately after clearing?
That’s a gift: it means you have an active problem. Stop clearing. Correlate with logs and SMART, and isolate whether it’s the disk or the path/shared hardware.
8) Is it safe to clear errors on a degraded pool?
Usually not as a “fix.” If the pool is degraded because a device is missing, clearing won’t replace redundancy. The only safe reason is after you’ve restored device availability and validated stability.
9) How do I decide between replacing the disk and swapping the cable?
If SMART shows reallocations/pending/uncorrectables climbing, replace the disk. If SMART shows CRC errors and logs show link resets, prioritize the cable/port/HBA path. If unsure, swap the path first; it’s low-risk and fast.
10) Does clearing errors affect future diagnostics?
Yes. You lose historical counters that help show trend. That’s why you capture evidence first and clear only after you’ve fixed and validated.
Next steps you can do today
If you operate ZFS in production, stop treating zpool clear as a ritual and start treating it as a controlled reset of evidence.
- Add a rule to your runbooks: no clearing until you’ve attached
zpool status -v, relevant logs, and SMART output to the ticket. - Teach your monitoring to care about change, not shame: alert on increasing counters, degraded/faulted states, scrub errors, and repeated device dropouts.
- Schedule scrubs like grown-ups: regular enough to catch latent issues, controlled enough not to become your peak load.
- Practice surgical clears: clear a single vdev when you’re isolating. Clear the whole pool when the incident is truly over and validated.
- Do one tabletop exercise: simulate checksum errors on one drive and make the team walk the playbook without improvising with
zpool clear.
When you do clear, do it with a reason, a baseline, and a plan to watch what happens next. That’s not superstition. That’s operations.