ZFS Write Errors: The Failure Pattern That Predicts a Dropout

Was this helpful?

You get the page at 02:13: “pool degraded.” You log in, run zpool status, and there it is: one device has
a few write errors. Not checksum errors. Not read errors. Writes.

Maybe the app is still up. Maybe latency is a little weird. Someone suggests a scrub and going back to bed.
And that’s how you end up with a device dropout at 09:42, right when the business starts doing business.

The pattern: why write errors predict dropouts

In ZFS land, checksum errors are the headline grabbers. They sound scary and “data-corruption-y.”
But when you’re trying to predict the next failure event—the one that knocks a device out of the pool—
write errors are the better canary.

Here’s the failure pattern that shows up again and again in production:

  1. Small, intermittent write errors accumulate on one device. The pool may still be ONLINE.
    Often it’s a single-digit number that doesn’t change for days, then jumps.
  2. Latency spikes correlate with link resets (SATA/SAS), HBA queue stalls, or firmware “helpfulness.”
    Your applications feel it before you do—unless you graph the right things.
  3. The kernel logs show transport drama: timeouts, resets, “device offline,” “rejecting I/O.”
    ZFS logs the consequence: write I/O that didn’t complete successfully.
  4. A scrub doesn’t “fix” write errors. Scrubs validate and repair data using redundancy.
    They don’t heal a transport path or a dying write channel.
  5. The device drops out during pressure: a resilver, a snapshot send, a backup window, or a busy Monday.
    The pool degrades, and the incident starts charging you rent.

The key insight: write errors are often not “bad blocks,” not initially. They’re frequently
path-level failures—controller, cabling, expander, backplane, power, or firmware.
ZFS is reporting “I tried to write, and the stack below me didn’t deliver.”

If you treat that as a cosmetic counter and move on, you’re betting your uptime on the same flaky path
behaving better tomorrow under higher load. That’s not bravery; that’s gambling.

One quote that still holds up in operations: Hope is not a strategy — often attributed to Vince Lombardi.

Joke #1: Drives don’t “partially fail” to be polite. They do it to make sure your outage fits neatly into a meeting invite.

Facts and historical context worth knowing

  • ZFS was born at Sun in the mid-2000s with end-to-end checksumming as a foundational idea, not a bolt-on.
  • ZFS error counters are per-vdev device stats maintained by the pool; they persist across reboots in many implementations.
  • Early SATA was rough in enterprise enclosures: timeouts, poor TLER behavior, and flimsy link handling taught a generation to respect transport errors.
  • SMART and ZFS see different worlds: SMART often reports “disk health” while ZFS reports “I/O reality,” and those can disagree for weeks.
  • Write failures are more “visible” than reads because successful reads can be served from ARC/L2ARC, hiding problems until the cache misses.
  • Expander and backplane firmware bugs have caused real outages by issuing resets under load, leading to bursty write errors and device offlines.
  • ZFS scrubs are not fsck: they verify blocks and repair from redundancy, but they don’t fix unstable devices or flaky HBAs.
  • Resilver behavior has evolved: “sequential resilver” and improvements in OpenZFS reduced pain, but resilvers still amplify weak links.
  • Ashift mistakes are forever: misaligned sector sizing doesn’t directly create write errors, but it increases write amplification, pushing marginal hardware over the edge.

What ZFS “write errors” actually mean

The three counters: READ, WRITE, CKSUM

zpool status typically shows three numbers per device: READ, WRITE, and CKSUM.
They are not interchangeable, and they do not implicate the same layer.

  • READ errors: the device/path failed to return data when asked.
    This could be media failure, transport timeout, or “device not ready.”
  • WRITE errors: the device/path failed to persist data when asked.
    This is commonly a transport or device firmware problem, sometimes power-related.
  • CKSUM errors: data came back, but it didn’t match its checksum.
    That often indicates corruption in-flight (cable, controller, RAM), or bad media returning wrong data.

Why write errors are a dropout predictor

A device that can’t reliably complete writes tends to get kicked out by the OS stack first.
Most HBAs and drivers have a patience budget. Under load, the budget gets spent faster.
When timeouts accumulate, the controller resets the link. ZFS sees a write I/O fail.
Enough of those, and the device is effectively unreliable even if it “comes back.”

What write errors are not

They are not automatically “data lost.” If the pool has redundancy (mirror/RAIDZ) and the write failed on one side,
ZFS can still complete the transaction group safely, depending on what failed and when.

They are not automatically “the disk is dying.” A surprising percentage of write errors in the field are
cabling, expanders, HBAs, backplanes, or power distribution. The disk is just the messenger—and we all know
what organizations do to messengers.

The dropout choreography (what you’ll see)

The classic sequence for a SATA/SAS path issue is:
timeouts → link reset M:N → queued I/O aborted → ZFS logs write errors → device marked FAULTED or REMOVED → resilver begins.

The more your system “helpfully” tries to recover links (reset storms), the more chaotic the latency becomes.
Applications don’t care that the device came back. They care that 99th percentile latency just moved into a new country.

Fast diagnosis playbook (first/second/third)

The goal in the first 15 minutes is not to conduct a philosophical seminar on storage.
It’s to answer three questions:
Is data safe right now? What’s the failing layer? What’s the next action that reduces risk fastest?

First: confirm pool state and blast radius

  • Run zpool status -v. Identify the exact vdev and device path reporting write errors.
  • Check whether redundancy is intact (mirror has another side ONLINE, RAIDZ has enough parity).
  • Look for ongoing resilver/scrub. Those amplify weak hardware and change the “next 30 minutes” risk profile.

Second: correlate with kernel transport logs

  • Search dmesg / journalctl -k for resets, timeouts, “rejecting I/O,” “device offline,” SAS phy resets.
  • If logs scream “link reset,” stop debating. Treat it as a path issue until proven otherwise.

Third: check device identity and SMART, then decide replace vs path-fix

  • Map ZFS device to /dev node and physical slot (use zpool status, ls -l /dev/disk/by-id, enclosure tools if you have them).
  • Run smartctl and look for reallocated/pending sectors, UDMA CRC errors, command timeouts, and SMART error log entries.
  • If SMART is clean but dmesg shows resets: suspect cable/HBA/backplane/power before the disk.

Decision rule that keeps you employed

If write errors are increasing and you see transport resets in the same window, prioritize stabilizing the path.
If you can’t stabilize the path quickly, replace the component most likely to be intermittent (cable/backplane port),
and if the platform is “mystery meat,” replace the disk too. Labor is expensive; downtime is worse.

Practical tasks: commands, outputs, decisions (12+)

These are the tasks I actually run. Each one includes what the output means and what decision it drives.
Commands are shown for a typical Linux OpenZFS host; adjust device names to your environment.

Task 1: Get the truth from ZFS

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible.  Otherwise restore the entire pool from backup.
  scan: scrub repaired 0B in 02:41:10 with 0 errors on Tue Dec 24 03:10:11 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz1-0                  DEGRADED     0     0     0
            ata-WDC_WD80EFAX-68KNBN0  ONLINE      0     0     0
            ata-WDC_WD80EFAX-68KNBN0  ONLINE      0     7     0
            ata-WDC_WD80EFAX-68KNBN0  ONLINE      0     0     0
            ata-WDC_WD80EFAX-68KNBN0  ONLINE      0     0     0

errors: Permanent errors have been detected in the following files:
        tank/data/vmstore/vm-104.img

Meaning: One disk shows WRITE=7. The pool is DEGRADED and ZFS reports a file with permanent errors.
This is not “maybe.” Something already failed to write correctly or couldn’t be committed.

Decision: Treat as an active incident. Identify whether the error is isolated and recoverable from redundancy/backup.
Also start transport investigation immediately; the write counter is a leading indicator for a dropout.

Task 2: Verify the exact device path(s) ZFS is using

cr0x@server:~$ sudo zpool status -P tank
  pool: tank
 state: DEGRADED
config:

        NAME                                      STATE  READ WRITE CKSUM
        tank                                      DEGRADED 0    0     0
          raidz1-0                                DEGRADED 0    0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX_A1   ONLINE  0    0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX_B2   ONLINE  0    7     0
            /dev/disk/by-id/ata-WDC_WD80EFAX_C3   ONLINE  0    0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX_D4   ONLINE  0    0     0

Meaning: You now have stable identifiers. If your pool uses /dev/sdX names, fix that later.

Decision: Use the by-id path for SMART and for locating the physical disk. Do not trust /dev/sdX ordering after resets.

Task 3: Check whether errors are still increasing

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME                        STATE  READ WRITE CKSUM
        tank                        DEGRADED 0    0     0
          raidz1-0                  DEGRADED 0    0     0
            ata-WDC_WD80EFAX_A1     ONLINE  0    0     0
            ata-WDC_WD80EFAX_B2     ONLINE  0    9     0
            ata-WDC_WD80EFAX_C3     ONLINE  0    0     0
            ata-WDC_WD80EFAX_D4     ONLINE  0    0     0

Meaning: WRITE went from 7 to 9. The problem is ongoing.

Decision: Stop “wait and see.” Either remove load, migrate workloads, or move toward replacement/path repair now.

Task 4: Pull kernel logs for transport resets/timeouts

cr0x@server:~$ sudo journalctl -k --since "30 min ago" | egrep -i "ata|sas|scsi|reset|timeout|offline|aborted" | tail -n 20
Dec 25 01:43:11 server kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Dec 25 01:43:11 server kernel: ata8.00: cmd 61/00:08:10:3c:2a/04:00:00:00:00/40 tag 1 ncq dma 524288 out
Dec 25 01:43:11 server kernel: ata8.00: status: { DRDY ERR }
Dec 25 01:43:11 server kernel: ata8.00: error: { ICRC ABRT }
Dec 25 01:43:12 server kernel: ata8: hard resetting link
Dec 25 01:43:17 server kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 25 01:43:17 server kernel: ata8.00: configured for UDMA/133

Meaning: ICRC errors and link resets are classic cable/backplane/connector issues (or occasionally controller).
This is not ZFS “being picky.” The transport is flaking.

Decision: Treat as a path problem first. Reseat/replace cable, check backplane port, inspect power, consider HBA firmware/driver issues.
If this is a hot-swap bay, move the disk to another slot if possible.

Task 5: Map the device to a kernel node and verify it’s stable

cr0x@server:~$ ls -l /dev/disk/by-id/ata-WDC_WD80EFAX_B2
lrwxrwxrwx 1 root root 9 Dec 25 01:10 /dev/disk/by-id/ata-WDC_WD80EFAX_B2 -> ../../sdg

Meaning: The disk is currently /dev/sdg. After resets, it might become sdf or sdh.

Decision: Use /dev/disk/by-id/... for all actions, not /dev/sdg.

Task 6: Pull SMART health and error log

cr0x@server:~$ sudo smartctl -a /dev/sdg | egrep -i "Model|Serial|Power_On_Hours|Reallocated|Pending|Offline_Uncorrectable|UDMA_CRC|Error|SMART overall"
Model Family:     Western Digital Red
Device Model:     WDC WD80EFAX-68KNBN0
Serial Number:    XXXXXXXXXXXX
SMART overall-health self-assessment test result: PASSED
Power_On_Hours:   23144
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
UDMA_CRC_Error_Count: 27
SMART Error Log Version: 1
ATA Error Count: 14

Meaning: Media looks clean (no reallocations/pending), but CRC errors exist. CRC errors correlate strongly with cabling/backplane issues.

Decision: Replace/ reseat the SATA/SAS path components. If CRC continues increasing after that, replace the disk or HBA.

Task 7: Check whether ZFS thinks this is a transient I/O issue or permanent corruption

cr0x@server:~$ sudo zpool status -v tank | sed -n '1,120p'
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
config:
...
errors: Permanent errors have been detected in the following files:
        tank/data/vmstore/vm-104.img

Meaning: “Permanent errors” means ZFS could not self-heal that block from redundancy at the time it was detected.
It might have happened during a write where parity couldn’t be updated safely, or the redundancy was insufficient/compromised.

Decision: Restore that file/object from a known-good source. For VM images, that usually means restoring from backup or replica.
Do not pretend a scrub will resurrect missing good data.

Task 8: Identify if a scrub or resilver is in progress (and how painful it is)

cr0x@server:~$ sudo zpool status tank | egrep -i "scan|resilver|scrub|repaired|errors"
  scan: resilver in progress since Wed Dec 25 01:21:34 2025
        1.18T scanned at 802M/s, 312G issued at 210M/s, 8.34T total
        78.1G resilvered, 3.65% done, 10:44:12 to go
errors: No known data errors

Meaning: Resilver is running. That’s high sustained I/O and a great way to trigger marginal links.

Decision: Consider temporarily reducing load (throttle workloads, adjust zfs_resilver_delay if appropriate in your environment),
or schedule the heavy work during the lowest risk window. If the path is unstable, resilver may never finish cleanly.

Task 9: Confirm ashift and vdev layout (to understand stress and rebuild behavior)

cr0x@server:~$ sudo zdb -C tank | egrep -i "ashift|raidz|mirror|vdev" | head -n 30
            raidz
                ashift: 12
                nparity: 1
                children[0]:
                children[1]:
                children[2]:
                children[3]:

Meaning: ashift 12 is 4K sectors. Good. If you see ashift 9 on modern disks, you’re amplifying writes.

Decision: If ashift is wrong, plan a rebuild/migration. Don’t “tune around” a structural mistake while chasing write errors.

Task 10: Check ARC pressure and memory errors (because bad RAM can mimic disk problems)

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep "size|c_max|memory_throttle_count|arc_no_grow" | head
size                            4    8589934592
c_max                           4    17179869184
memory_throttle_count           4    0
arc_no_grow                     4    0

Meaning: ARC isn’t throttling due to memory pressure. This doesn’t clear RAM, but it rules out a common “system is dying” scenario.

Decision: If you suspect memory, check ECC logs (mcelog/rasdaemon) and run a maintenance test.
Don’t replace disks to fix RAM corruption.

Task 11: Inspect I/O error stats at the block layer

cr0x@server:~$ iostat -x 1 5 /dev/sdg
Linux 6.5.0 (server)  12/25/2025  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.10    0.00    1.20   12.40    0.00   83.30

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sdg              0.10  220.00     1.2  28160.0  85.40   4.10  90.20

Meaning: High await with high utilization could be “normal under resilver,” or it could be repeated retries due to transport issues.
You need to correlate with kernel logs.

Decision: If await spikes align with resets/timeouts in logs, escalate path remediation. If not, this might be expected during rebuild,
and you can manage by throttling rather than replacing hardware immediately.

Task 12: Confirm HBA driver/firmware class and queue behavior

cr0x@server:~$ lspci -nn | egrep -i "SAS|RAID|HBA"
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097]
cr0x@server:~$ modinfo mpt3sas | egrep -i "version|firmware|srcversion" | head -n 10
version:        47.00.00.00
srcversion:     9F6C3D0F5E4D2C3A2E3A1B2

Meaning: You’re on an LSI/Broadcom SAS3 HBA stack. Great—usually solid, sometimes picky about firmware.

Decision: If many disks show intermittent write errors, suspect HBA/expander/firmware. One disk only? Suspect that lane/cable/slot/disk.

Task 13: Check for wide impact—errors across multiple devices

cr0x@server:~$ sudo zpool status tank | awk 'BEGIN{p=0} /config:/{p=1} p && $1 ~ /(ata-|scsi-|nvme|\/dev\/disk)/ {print}'
tank                        DEGRADED 0 0 0
raidz1-0                    DEGRADED 0 0 0
ata-WDC_WD80EFAX_A1         ONLINE  0 0 0
ata-WDC_WD80EFAX_B2         ONLINE  0 9 0
ata-WDC_WD80EFAX_C3         ONLINE  0 0 0
ata-WDC_WD80EFAX_D4         ONLINE  0 0 0

Meaning: Only one device is accumulating write errors. That reduces suspicion of global components—but doesn’t eliminate it.
A single bad port on a backplane is still a “global component” in disguise.

Decision: Focus physical inspection on that slot/cable/path first. If multiple devices show WRITE/READ errors, widen to HBA/expander/power.

Task 14: Force a controlled SMART self-test during maintenance (not during peak I/O)

cr0x@server:~$ sudo smartctl -t short /dev/sdg
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
cr0x@server:~$ sudo smartctl -a /dev/sdg | egrep -i "Self-test execution status|# 1|Completed"
Self-test execution status:      (  0) The previous self-test routine completed without error.
# 1  Short offline       Completed without error       00%     23145         -

Meaning: Short test passes; still doesn’t clear the drive for sustained writes under load, but it’s a data point.

Decision: If transport errors persist, a passing SMART test doesn’t exonerate the path. Keep chasing resets and CRC counts.

Task 15: Clear counters the right way (and only after fixing the cause)

cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME                        STATE  READ WRITE CKSUM
        tank                        DEGRADED 0 0 0
          raidz1-0                  DEGRADED 0 0 0
            ata-WDC_WD80EFAX_A1     ONLINE  0 0 0
            ata-WDC_WD80EFAX_B2     ONLINE  0 0 0
            ata-WDC_WD80EFAX_C3     ONLINE  0 0 0
            ata-WDC_WD80EFAX_D4     ONLINE  0 0 0

Meaning: Counters cleared. This does not fix anything; it just gives you a clean slate to see whether the issue recurs.

Decision: Clear only after remediation (cable reseat, slot move, firmware update, disk replacement), then monitor for recurrence.
If counters climb again, you have proof—not vibes.

Task 16: Controlled write test on a non-production spare (never the live vdev)

cr0x@server:~$ sudo fio --name=writecheck --filename=/mnt/testdisk/fio.dat --rw=write --bs=1M --iodepth=16 --numjobs=1 --size=8G --direct=1 --runtime=120 --time_based=1
writecheck: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=16
fio-3.33
writecheck: (groupid=0, jobs=1): err= 0: pid=25191: Wed Dec 25 01:58:23 2025
  write: IOPS=210, BW=210MiB/s (220MB/s)(25.2GiB/123000msec)
    clat (usec): min=950, max=420000, avg=7700.12, stdev=9500.10

Meaning: This is a basic sustained write test. It can expose transport resets when paired with log watching.

Decision: If kernel logs show resets during a simple fio run, you’ve reproduced the problem under controlled conditions.
Fix the path. Don’t blame ZFS for reporting it.

Three corporate mini-stories from the real world

1) Incident caused by a wrong assumption: “SMART passed, so the disk is fine”

A mid-sized SaaS shop ran a ZFS-backed VM cluster. One host started reporting a handful of write errors on one disk.
The on-call engineer did the classic move: smartctl showed “PASSED,” no reallocations, no pending sectors.
They cleared the pool errors and moved on.

Two days later, during the backup window, the same disk dropped out. The pool degraded, resilver started, and
the host’s latency went from “fine” to “helpdesk apocalypse.” The on-call saw SATA link resets in the logs
but assumed it was a side effect of the resilver rather than the cause.

A senior SRE finally looked at the SMART attributes again—specifically the interface error counters.
UDMA CRC errors had climbed steadily. The disk wasn’t failing to store data; it was failing to communicate reliably.
The “passed” health summary was technically correct and operationally useless.

The fix was embarrassing in the way good fixes often are: replace a single SATA cable and move the drive to a different backplane port.
Errors stopped. The resilver finished. No more dropouts.

The wrong assumption wasn’t that SMART lies. The assumption was that SMART is the primary truth.
In ZFS incidents, the primary truth is “what I/O did the system fail to complete,” and ZFS is closer to that truth than the disk’s marketing summary.

2) Optimization that backfired: “More throughput” via aggressive tuning

Another company had a storage node that looked underpowered on paper: lots of disks, a busy database workload, and a team allergic to buying hardware.
They did what engineers do when procurement says no: tuning.

They cranked recordsize without understanding the I/O profile, tweaked sync settings in the name of throughput,
and raised queue depths in the controller stack to “keep the disks busy.” Benchmarks improved. Slides were made.
Production, predictably, did something else.

Under peak load, the system began accumulating write errors on two disks behind the same expander.
Kernel logs showed timeouts and resets. The tuning had increased the duration and burstiness of writes, stretching
the expander’s tolerance and amplifying any tiny signal integrity issues. The system didn’t fail immediately; it failed intermittently,
which is the worst kind because it wastes everyone’s time.

They rolled back most of the tuning and replaced a marginal expander/backplane module.
Throughput decreased a bit, tail latency improved a lot, and the write errors vanished.
The team learned the lesson nobody wants: a faster path to failure is still a path to failure.

Joke #2: “We tuned it for performance” is sometimes just a fancy way of saying “we made the future incident more efficient.”

3) Boring but correct practice that saved the day: device naming discipline and staged replacements

A finance org ran OpenZFS on Linux for internal file services. Not glamorous, but they treated it like a real system:
stable by-id device names in pools, labeled drive sleds, documented slot maps, and a strict policy that no one hot-swaps anything
without a second person verifying the serial number.

One afternoon, write errors appeared on a single vdev member. The on-call pulled zpool status -P,
confirmed the by-id path, mapped it to a physical slot, and checked logs: intermittent resets on one phy.
They opened a ticket to facilities for a scheduled maintenance window and pre-staged a replacement cable and a spare disk.

During the window, they moved the disk to a new slot and replaced the cable. They cleared errors and ran a scrub.
No recurrence. They didn’t even need the spare disk, but having it on hand prevented the “we’ll order it and wait” trap.

Nobody got an award for this. That’s the point. The boring practices—stable naming, physical labeling, change control—turned a likely outage
into a routine maintenance task.

Common mistakes: symptom → root cause → fix

1) “WRITE errors on one disk, SMART says PASSED”

Symptom: ZFS shows write errors; SMART overall-health is “PASSED,” media attributes look clean.

Root cause: Transport instability (CRC errors, link resets), backplane port, cable, expander lane, or HBA hiccups.

Fix: Check kernel logs for resets; inspect/replace cable; move the disk to another slot; update HBA/expander firmware; only then consider disk replacement.

2) “Scrub ran clean, but write errors keep rising”

Symptom: Scrub reports 0 errors, yet WRITE counter increments.

Root cause: Scrub reads; your issue is on writes (or link resets during writes). Scrub doesn’t validate the write path under your workload.

Fix: Correlate with write-heavy periods; reproduce with controlled writes off-peak; stabilize the path.

3) “Multiple disks suddenly show WRITE errors”

Symptom: Several vdev members get write errors within minutes/hours.

Root cause: Shared component failure (HBA, expander, backplane power, PCIe issues, firmware bug).

Fix: Stop replacing disks like it’s a ritual. Inspect shared hardware, check PCIe AER logs, review recent firmware/kernel changes, and consider rolling back.

4) “Device keeps dropping and returning; pool flaps”

Symptom: Disk goes FAULTED/REMOVED then comes back ONLINE after resets.

Root cause: Link instability or power intermittency, sometimes a marginal PSU rail to the backplane.

Fix: Treat flapping as urgent. Replace cable/slot/backplane; check power connectors; confirm PSU health and backplane power delivery.

5) “WRITE errors after changing sync settings”

Symptom: After performance tuning, write errors appear under peak load.

Root cause: Tuning increased write burstiness/queue depth, exposing controller/expander weaknesses.

Fix: Roll back tuning; re-test; add SLOG only if you understand your sync workload and have enterprise-grade devices.

6) “Permanent errors in files even with redundancy”

Symptom: ZFS reports permanent errors for specific files.

Root cause: At the time of the event, redundancy couldn’t provide a good copy (multiple errors, partial writes, or prior silent damage).

Fix: Restore from backup/replica; then investigate why redundancy didn’t save you (second failing device, unstable path, misconfiguration).

7) “Resilver never finishes; errors keep happening”

Symptom: Resilver slows or restarts; write/read errors continue.

Root cause: The rebuild load is triggering the failure repeatedly. This is common with marginal cables, expanders, or overheated HBAs.

Fix: Reduce load, improve cooling, replace the suspect path component, and only then attempt resilver again.

Checklists / step-by-step plan

Immediate containment (0–30 minutes)

  1. Run zpool status -v and capture output in the incident notes.
  2. Identify whether redundancy is intact. If not, move to data protection mode: stop writes, snapshot what you can, prepare restore.
  3. Check if a scrub/resilver is active. If resilver is running, expect the system to be fragile.
  4. Pull kernel logs for resets/timeouts. Transport errors shift your priority from “disk health” to “path stability.”
  5. Map the device to by-id and physical location. Don’t touch hardware until you know exactly what you’re pulling.

Root-cause isolation (same day)

  1. Compare ZFS counters across all devices. One device: likely local. Many devices: likely shared component.
  2. Check SMART attributes that indicate path issues (CRC, command timeouts), not just reallocated sectors.
  3. Inspect/reseat/replace the cable; move disk to another slot if the chassis allows it.
  4. Review HBA/expander firmware and recent kernel updates; regression is a real thing.
  5. Clear ZFS errors only after remediation to confirm whether the problem recurs.

Recovery and validation (after hardware work)

  1. If a disk was replaced, let resilver complete with monitoring. Don’t declare victory at 3%.
  2. Run a scrub after resilver to validate redundancy and discover latent read issues.
  3. Verify application data integrity for any files ZFS flagged as permanent errors.
  4. Set or adjust alerting: write errors > 0 should page during business hours, not wait for the pool to degrade.

Hardening (the part that prevents repeats)

  1. Use by-id device naming in all pools. If you inherited a pool using /dev/sdX, plan a controlled migration.
  2. Standardize HBAs in IT mode and keep firmware consistent across the fleet.
  3. Label sleds with serial numbers and maintain a slot map. This is how you avoid “replaced the wrong disk” incidents.
  4. Monitor kernel reset messages and SMART CRC counters alongside ZFS error counters.
  5. Do periodic, scheduled scrubs and track trends, not just pass/fail.

FAQ

1) Are ZFS write errors always a failing disk?

No. They often indicate the write I/O didn’t complete successfully, which can be disk firmware, but very commonly is
cabling, backplane, expander, HBA, or power. Check transport logs before you order a pallet of drives.

2) What’s the difference between write errors and checksum errors in ZFS?

Write errors mean ZFS couldn’t successfully commit data to the device/path. Checksum errors mean data was read but didn’t match its checksum—
corruption or mis-delivery. Both are bad; they point to different layers.

3) If I run a scrub and it reports 0 errors, am I safe?

Safer, not safe. Scrub is read-heavy and validates stored data. It doesn’t prove your write path is stable under your real workload.
If write errors are incrementing, you still have a write-path reliability issue.

4) Should I clear ZFS errors with zpool clear?

Yes, but only after you’ve fixed something and want to validate that the fix worked. Clearing as a “make the alert go away” move is how incidents mature.

5) Why do write errors often show up during resilver?

Resilver is sustained, heavy write activity plus reads across the vdev. It turns marginal transport issues into obvious failures.
Think of resilver as a stress test you didn’t schedule.

6) My pool is mirrored. If one side has write errors, can ZFS still succeed?

Often, yes. But the mirror only protects you if the other side is healthy and the system can complete transactions safely.
A flapping device can still cause latency spikes and operational risk.

7) Do I need ECC RAM to avoid write errors?

ECC is strongly recommended for ZFS, but write errors specifically are usually I/O completion failures, not memory corruption.
ECC helps prevent checksum errors and silent corruption. It doesn’t fix bad cables.

8) Can an HBA cause write errors without any disk SMART problems?

Absolutely. If the HBA or expander resets links, mishandles queueing, overheats, or has firmware bugs, ZFS will see write I/O failures.
SMART can look pristine because the disk never got a clean chance to do its job.

9) What alert threshold should I use for write errors?

For most production systems: any non-zero write errors on an otherwise healthy device should trigger investigation.
If they increase over time, escalate. Trends matter more than a single number, but the first write error is the earliest warning.

10) If ZFS reports “permanent errors,” does that mean total data loss?

Not total, but it means ZFS couldn’t repair certain blocks from redundancy. The affected files/objects need restoration from backup or replicas.
Then you need to find out why redundancy wasn’t enough at that moment.

Conclusion: practical next steps

ZFS write errors are not a trivia stat. They’re a failure pattern. They predict dropouts because they’re often the first visible sign
that the write path—disk, link, controller, enclosure—can’t reliably complete I/O under pressure.

Do three things the next time you see WRITE > 0:

  1. Correlate immediately: ZFS counters + kernel logs + SMART interface errors. Find the failing layer.
  2. Stabilize the path: reseat/replace cables, move slots, check HBAs and firmware, and don’t let a resilver grind on unstable hardware.
  3. Prove the fix: clear counters after remediation and watch for recurrence. If the count climbs again, stop negotiating and replace components.

Storage reliability is mostly about refusing to be surprised by the same failure twice.
ZFS is giving you a forecast. Use it.

← Previous
WordPress 500 Internal Server Error: Most Common Causes and a Fast Fix Plan
Next →
3D V-Cache / X3D: why cache became the ultimate gaming cheat code

Leave a comment