Silent corruption is the worst kind of storage failure because it’s not dramatic. Nothing beeps. Nothing crashes.
Your monitoring stays green. And then, weeks later, a finance report has a “weird” value, a VM won’t boot, or a backup
can’t restore because the data was already wrong when you backed it up.
RAID can keep disks online. That’s not the same as keeping data correct. ZFS was designed to notice when bytes
change behind your back, and—when redundancy exists—repair them automatically. This is where ZFS earns its reputation
and where many RAID stacks quietly tap out.
Silent corruption: what it is and why RAID can miss it
Silent corruption (a.k.a. “bit rot,” “latent sector errors,” “misdirected writes,” or “oops the controller lied”)
is any case where the storage stack returns data that is not what was written, without raising an I/O error.
The key word is silent: the OS gets a successful read, your app gets bytes, and nobody throws an exception.
RAID—especially hardware RAID or “mdadm-only” RAID—typically focuses on device availability. A disk fails, parity or
a mirror keeps the volume online. Great. But RAID often has limited visibility into whether the block content is
correct as originally written by the application. It can detect some errors (like an unreadable sector),
but it can also happily deliver corrupted data because it has no end-to-end verification from app → filesystem → disk.
Here’s the uncomfortable truth: many RAID designs assume the rest of the stack is honest. In production, that’s a
risky assumption. Controllers have firmware bugs. HBAs have quirks. Cables and backplanes do weird things at scale.
Disks remap sectors. DMA can misbehave. Memory flips bits. And yes, sometimes humans “fix” a storage problem by
swapping slots, re-cabling, or updating firmware—and accidentally create a new one.
ZFS takes a different approach: it assumes the world is hostile. It checks everything it can, and it treats “success”
from a lower layer as a suggestion, not a guarantee.
What ZFS “sees” that classic RAID often doesn’t
End-to-end checksums: the big idea
ZFS stores a checksum for every block it writes—data and metadata—and stores that checksum in the block’s parent
metadata (not alongside the block itself). When ZFS reads a block, it recomputes the checksum and compares it to the
expected value. If the checksums mismatch, ZFS knows the data is wrong even if the disk said “read OK.”
That checksum is end-to-end: it covers the path from the filesystem’s perspective down to the device. ZFS doesn’t
rely on the drive’s internal ECC or the controller’s parity logic as the only integrity mechanism. It uses
them as defense-in-depth.
Self-healing reads (when redundancy exists)
If the pool has redundancy (mirror, RAIDZ, dRAID), ZFS can try alternate copies when it detects corruption. It reads
from another replica, validates the checksum, and then repairs the bad copy automatically. That’s not “rebuild when a
disk dies.” That’s “repair a single bad block while the disk is still alive.”
RAID parity can reconstruct data when a disk is missing. But if a disk returns a bad block without an error, RAID
often doesn’t know which disk is wrong, and it may not have any higher-level checksum to prove it. You end up with a
Schrödinger’s block: it’s both correct and incorrect until you try to use it.
Scrubs: proactive detection instead of surprise failure
ZFS scrubs walk the allocated space in the pool, read blocks, verify checksums, and repair when possible. This is how
you catch latent sector errors before the only good copy gets overwritten or before you need that data during
an incident.
Copy-on-write: fewer “torn writes,” different failure surface
ZFS is copy-on-write (CoW). It writes new blocks elsewhere, then updates metadata pointers atomically. CoW doesn’t
magically prevent all corruption, but it reduces a classic RAID problem: partial writes leaving a filesystem in an
inconsistent state (especially after power loss). Traditional filesystems rely on journaling and correct ordering.
ZFS relies on transactional semantics at the filesystem layer.
Checksumming metadata too, because losing the map is worse than losing the territory
Plenty of systems treat metadata as “small and reliable.” In practice, metadata corruption is often catastrophic:
directory structures, allocation tables, indirect blocks. ZFS checksums metadata and can self-heal it with redundancy.
That’s a huge difference in failure behavior: instead of “filesystem needs fsck and prayers,” you often get “ZFS healed
it during the read.”
Better error accounting: you get receipts
ZFS error counters distinguish between read errors, write errors, and checksum errors. That last category is the
interesting one: the drive didn’t error, but the data didn’t match. Many RAID setups never expose this nuance. ZFS
will tell you when the storage stack is lying.
One paraphrased idea from W. Edwards Deming fits storage operations too: without measurement, you’re managing by guesswork
(paraphrased idea, Deming).
ZFS measures integrity continuously.
RAID’s strengths, and the gaps people confuse for safety
RAID is about availability, not correctness
RAID’s core job is to keep a block device online when a disk fails. It’s good at that. It’s also good at hiding disk
failures from the OS, which can be convenient and terrible at the same time: convenient because servers keep running,
terrible because the first time you notice a problem may be when the RAID card is already juggling multiple
degraded conditions.
“But the RAID controller has a battery and patrol read”
Battery-backed (or flash-backed) write cache is great for performance and can reduce some power-loss hazards. Patrol
read can detect unreadable sectors. Neither guarantees end-to-end correctness.
-
Patrol read is a disk-surface health tool. It doesn’t validate that the bytes read match what the application wrote
months ago. -
Some controller stacks use checksums internally, but they’re usually not exposed to the filesystem, and they don’t
protect you from misdirected writes or host-side corruption.
The write hole and parity nightmares
RAID5/6 parity writes can suffer from the “write hole” if power is lost mid-stripe update. Modern controllers mitigate
this with cache + journaling, but implementations vary. Software RAID has its own mitigation strategies, also variable.
ZFS avoids the classic write hole by design because it never overwrites in-place; it commits new consistent trees.
When RAID rebuilds, it stress-tests your weakest link
Rebuild/resilver reads huge amounts of data. This is when latent read errors show up. A RAID6 rebuild can still fail
if there are too many unreadable sectors at the wrong time. ZFS resilver can be faster in some scenarios because it
only reconstructs allocated blocks, not every sector of a disk.
Joke #1: RAID is like a spare tire—you’re happy it’s there, right up until you realize it’s been flat for months.
Facts and historical context worth knowing
- ZFS originated at Sun Microsystems in the early-to-mid 2000s, designed to combine volume management and filesystem into one coherent stack.
- End-to-end checksumming was a core design point from the beginning: ZFS treats silent corruption as a first-class problem, not an edge case.
- Copy-on-write snapshots weren’t “nice to have”—they’re structurally tied to transactional consistency and safe on-disk updates.
- RAID predates modern multi-terabyte disks by decades; many RAID assumptions were shaped when rebuild windows were short and URE risk was lower.
- Hardware RAID controllers became popular partly to offload CPU and standardize disk management, back when parity math and caching mattered more.
- ZFS scrubbing formalized what many admins did manually: periodic full reads to surface latent errors before they become data loss.
- Enterprise disks advertise URE rates (unrecoverable read errors), reminding you that “a disk that hasn’t failed yet” can still fail during rebuild.
- OpenZFS became the multi-platform lineage, with active development across illumos, FreeBSD, Linux, and others—important because storage bugs don’t respect OS boundaries.
- “Bit rot” isn’t a single phenomenon; it’s a bucket name for media defects, firmware bugs, misdirected writes, and RAM/cable/controller faults.
Failure modes: how corruption happens in real fleets
Misdirected writes: the scariest corruption you don’t see coming
A misdirected write is when the system writes the right data to the wrong place. Your application writes block A, but
the disk/controller writes it to block B. Both writes “succeed.” RAID happily mirrors or parity-protects the wrong
location. Later, reads return data that passes drive ECC, and RAID thinks everything is fine.
ZFS catches this because the checksum stored in metadata for block A won’t match what’s now at block A. With
redundancy, ZFS can read another copy and repair.
DMA / memory corruption: your filesystem can’t fix what it can’t detect
If bad RAM flips bits before data hits disk, the disk stores corrupted content. RAID will faithfully protect the wrong
bytes. ZFS will compute a checksum on write—so if the corruption happens before checksumming, the checksum
matches the corrupted data and ZFS cannot know it was wrong.
That’s why ECC memory matters in ZFS systems. ZFS is an integrity system; it still needs trustworthy memory.
Partial writes and ordering: power loss isn’t polite
Sudden power loss can create torn writes: half old, half new. Some drives lie about flush completion. Some
controllers reorder. Filesystems try to cope with journaling, barriers, and careful flushes.
ZFS’s transactional CoW model reduces exposure, but you still need sane write ordering and honest flush semantics.
If your drive acknowledges writes it didn’t persist, you can still lose the last few seconds of data. The difference:
you’re less likely to get a corrupt filesystem structure; more likely to lose recent transactions.
Bad sectors: the classic, boring villain
Latent sector errors show up at the worst time: during rebuild, scrub, or restore. RAID can reconstruct if it can
identify the bad read and still has enough redundancy. ZFS can do the same, but it also validates correctness with
checksums and can repair while the disk remains online.
Joke #2: Storage vendors promise “five nines,” but they never say whether those nines refer to uptime or your sleep.
Fast diagnosis playbook
When you suspect corruption or you’re seeing ZFS errors, speed matters. Not because ZFS is fragile—because the longer
you run degraded, the more likely you’ll stack failures. Here’s a fast, practical order of operations that works in
real on-call life.
1) Confirm whether it’s integrity (checksum) or I/O (read/write) failure
- Check
zpool status -vfirst. If you see CKSUM increments, that’s “device returned data but it was wrong.” - If you see READ or WRITE errors, that’s more classic device or transport failure.
2) Determine if redundancy can heal it
- Mirrors and RAIDZ can correct many checksum errors automatically.
- Single-disk pools cannot. They can detect corruption, but not repair it.
3) Localize the problem: one disk, one path, or systemic
- If errors cluster on one vdev member, suspect that disk, its cable, that bay, or that HBA port.
- If multiple disks on the same HBA show issues, suspect the controller, backplane, expander, or firmware.
- If checksum errors appear across many devices with no pattern, suspect memory, CPU instability, or a driver bug.
4) Decide: scrub, offline/replace, or stop writes
- Run a scrub if you have redundancy and the pool is stable; it will surface and fix more issues.
- Offline/replace a device that keeps accumulating errors.
- Stop writes (or freeze workloads) if you’re seeing active corruption without a clear containment strategy.
5) Validate backups by restoring, not by believing
If corruption is suspected, do an actual restore test of representative data. ZFS protects what it stores, not what
your backup system mishandles.
Hands-on tasks: commands, outputs, and what to do next
Below are practical tasks I expect an operator to run during commissioning, weekly hygiene, and incident response.
Each task includes a realistic command, typical output, what it means, and the decision you make.
Task 1: Identify corruption vs I/O failure
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
scan: scrub repaired 0B in 00:12:41 with 3 errors on Wed Dec 25 03:10:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda ONLINE 0 0 3
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/vmstore@auto-2025-12-24:/vm/guest1/disk0.img
Meaning: CKSUM errors on sda indicate bad data returned without an I/O error. ZFS found at least one permanent error in a specific file.
Decision: If redundancy exists, replace sda and restore or recreate the affected file. If the file is a VM disk, treat it as critical: restore from a known-good snapshot/backup.
Task 2: Start a scrub (and know what it will do)
cr0x@server:~$ sudo zpool scrub tank
Meaning: A scrub will read allocated blocks, validate checksums, and repair using redundancy.
Decision: Schedule scrubs regularly; run one after replacing hardware or after a suspicious event (power loss, cable reseat, controller reset).
Task 3: Check scrub progress and whether it’s healing or just reading
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 26 01:20:41 2025
412G scanned at 1.31G/s, 88G issued at 288M/s, 6.02T total
0B repaired, 1.43% done, 06:12:11 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
Meaning: “0B repaired” is good; if “repaired” grows, ZFS is actively healing.
Decision: If repair bytes rise repeatedly across scrubs, treat it as a hardware path problem (disk, cable, HBA) not a one-off cosmic event.
Task 4: Inspect error counters over time
cr0x@server:~$ zpool status -p tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 24
sdb ONLINE 0 0 0
Meaning: CKSUM errors are cumulative counters. They don’t automatically reset.
Decision: Clear counters only after remediation and documentation. If CKSUM continues increasing, replace/relocate that device and check the path.
Task 5: Clear errors after you’ve fixed something (not before)
cr0x@server:~$ sudo zpool clear tank sda
Meaning: Clears error counts for sda so new errors are visible.
Decision: Only do this after you’ve reseated/replaced hardware; otherwise you’re deleting evidence mid-investigation.
Task 6: List datasets and mountpoints (to find what data might be affected)
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 4.12T 2.55T 128K /tank
tank/vmstore 3.40T 2.55T 3.40T /tank/vmstore
tank/backups 612G 2.55T 612G /tank/backups
Meaning: You can map pool health issues to business-impacting datasets quickly.
Decision: If permanent errors cite a path, identify the dataset and decide whether to restore, rehydrate from replication, or rebuild generated data.
Task 7: Find snapshots (useful for rolling back damaged data)
cr0x@server:~$ zfs list -t snapshot -o name,creation -r tank/vmstore | tail
tank/vmstore@auto-2025-12-24-0100 Wed Dec 24 01:00 2025
tank/vmstore@auto-2025-12-25-0100 Thu Dec 25 01:00 2025
tank/vmstore@auto-2025-12-26-0100 Fri Dec 26 01:00 2025
Meaning: If corruption is recent, a snapshot from before the event may be clean.
Decision: Consider cloning and validating before rollback. Don’t blindly rollback a busy dataset without coordinating with app owners.
Task 8: Validate pool properties that influence integrity behavior
cr0x@server:~$ zfs get -o name,property,value -s local,default checksum,compression,recordsize tank/vmstore
NAME PROPERTY VALUE
tank/vmstore checksum on
tank/vmstore compression lz4
tank/vmstore recordsize 128K
Meaning: Checksumming is on by default; turning it off is self-sabotage.
Decision: Leave checksum=on. If you inherit odd properties from templates, fix them before production.
Task 9: Look for devices by persistent IDs (avoid “sda roulette”)
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'ata-|scsi-' | head
lrwxrwxrwx 1 root root 9 Dec 26 00:10 ata-ST8000NM000A_ZR123ABC -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 26 00:10 ata-ST8000NM000A_ZR123DEF -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 00:10 ata-ST8000NM000A_ZR123GHI -> ../../sdc
Meaning: Device names like /dev/sda can change across boots; by-id is stable.
Decision: Use by-id paths when creating pools and when replacing devices to avoid replacing the wrong disk (a real and popular hobby).
Task 10: Check SMART health for early warning signs
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep 'SMART overall-health|Reallocated_Sector_Ct|Current_Pending_Sector|UDMA_CRC_Error_Count'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always 12
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always 37
Meaning: Pending sectors and CRC errors point to media problems and/or cabling issues. “PASSED” doesn’t mean “healthy,” it means “not dead yet.”
Decision: Replace disk if pending sectors persist or reallocated sectors trend upward. If CRC errors climb, replace cable/backplane slot and watch if errors stop.
Task 11: Confirm the pool isn’t accidentally built on a RAID controller volume
cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,MODEL
NAME SIZE TYPE MODEL
sda 7.3T disk ST8000NM000A
sdb 7.3T disk ST8000NM000A
sdc 7.3T disk ST8000NM000A
nvme0n1 1.8T disk Samsung SSD 990
Meaning: If you see a single huge “disk” that’s actually a RAID LUN from a controller, ZFS cannot see individual disks and can’t do proper self-healing per device.
Decision: Prefer HBA/JBOD mode for ZFS. If you’re already deployed on a RAID LUN, plan a migration; don’t wait for the controller to become your single point of regret.
Task 12: Replace a failing disk correctly (mirror example)
cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/ata-ST8000NM000A_ZR123ABC /dev/disk/by-id/ata-ST8000NM000A_ZR999XYZ
Meaning: ZFS starts resilvering to the new disk.
Decision: Monitor resilver. Don’t pull the old disk until you see the new one online and resilver progressing.
Task 13: Monitor resilver and interpret “resilvered X” properly
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: resilver in progress since Fri Dec 26 02:11:05 2025
842G scanned at 780M/s, 216G issued at 200M/s, 2.90T total
216G resilvered, 7.43% done, 05:41:32 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
/dev/disk/by-id/ata-ST8000NM000A_ZR123ABC ONLINE 0 0 24
/dev/disk/by-id/ata-ST8000NM000A_ZR999XYZ ONLINE 0 0 0
sdb ONLINE 0 0 0
Meaning: Resilver “issued” shows what’s being written; “scanned” includes what’s being checked. The old disk still has checksum history; that’s fine.
Decision: If resilver slows to a crawl, check for other I/O load, failing disks, or an HBA bottleneck. If checksum errors occur during resilver, pause and investigate the path.
Task 14: After resilver, verify no new errors and run a scrub
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
scan: resilvered 2.90T in 05:55:12 with 0 errors on Fri Dec 26 08:06:17 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdx ONLINE 0 0 0
sdb ONLINE 0 0 0
Meaning: Resilver completed cleanly.
Decision: Kick off a scrub soon after, especially if the incident involved cabling/controller instability. You want confidence, not vibes.
Task 15: Check ZFS event history for the timeline
cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME CLASS
Dec 26 2025 01:18:43.221001000 ereport.fs.zfs.checksum
Dec 26 2025 02:11:05.102334000 sysevent.fs.zfs.resilver_start
Dec 26 2025 08:06:17.552190000 sysevent.fs.zfs.resilver_finish
Meaning: Events tell you when checksum issues began and what remediation occurred.
Decision: Use this to correlate with maintenance windows, firmware updates, power events, or workload spikes.
Task 16: Confirm ashift and avoid accidental 512e performance/latency traps
cr0x@server:~$ zdb -C tank | egrep 'ashift|path' | head -n 10
path: /dev/disk/by-id/ata-ST8000NM000A_ZR123DEF
ashift: 12
path: /dev/disk/by-id/ata-ST8000NM000A_ZR123GHI
ashift: 12
Meaning: ashift=12 indicates 4K sectors. Wrong ashift can cause write amplification and timeouts that look like flaky disks.
Decision: If ashift is wrong on a pool, plan migration. You can’t change ashift in-place safely in the general case.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran a database cluster on “enterprise” storage: dual-controller SAN, RAID6, the usual reassuring
acronyms. The team’s assumption was simple: RAID6 means data corruption is basically handled. They focused on disk
failures and spare drives, not data integrity.
Over a few months, they saw intermittent query weirdness: a small subset of rows would fail application-level checks.
Nothing consistent. No obvious disk failures. Storage support asked for logs, the team collected them, and the problem
politely refused to reproduce during business hours.
The turning point was a restore test (not during an outage—just a scheduled fire drill). The restored database passed
checksums at the database layer for most tables, but a couple of indexes were subtly corrupted. The backups had
captured the corruption because the corruption had already been present and silent.
They moved the workload to ZFS mirrors on commodity servers with ECC memory, scrubbed weekly, and enabled frequent
snapshots. The scary part wasn’t that RAID failed—it did what it promised. The wrong assumption was believing RAID
promised correctness. RAID promised “your volume stays online when a disk dies.” Different contract.
Mini-story 2: The optimization that backfired
Another org had a virtualization cluster with ZFS on Linux. Performance was fine, but someone wanted “better” numbers:
lower latency, higher IOPS. They added an L2ARC device and tuned recordsize and caching aggressively. Then they got
bold: they disabled sync writes at the dataset level for VM storage because “the UPS is solid.”
Months later a controller firmware update triggered an unexpected reboot loop on one host. The UPS was indeed solid.
The host was not. After the dust settled, several VMs wouldn’t boot. ZFS itself was consistent—CoW did its job—but the
guest filesystems had lost transactions they believed were durable.
The optimization wasn’t the L2ARC. It was the decision to lie about durability. Disabling sync is like removing the
seatbelt because it wrinkles your shirt. It looks great until the first sharp stop.
They rolled back the setting, added a proper SLOG device for workloads that needed sync latency improvements, and
implemented a change review checklist specifically for storage durability settings. Performance recovered to “good
enough,” and correctness returned to “not negotiable.”
Mini-story 3: The boring but correct practice that saved the day
A financial services shop ran OpenZFS for document storage. Nothing glamorous. The team had three boring habits:
monthly scrubs, quarterly restore tests, and a rule that any checksum error triggered a physical inspection of the
entire path: drive, bay, backplane, HBA port, cable.
One month, a scrub reported a handful of checksum repairs on one mirror member. No outages, no user complaints. The
on-call engineer filed a ticket and followed the rule: pull SMART, check CRC errors, inspect the SAS cable. SMART
showed rising CRC counts, but the disk looked fine otherwise.
They swapped the cable and moved the disk to a different bay. Errors stopped. The next scrub repaired nothing. Later,
during a scheduled maintenance, they found the original backplane slot had intermittent contact under vibration.
Nobody wrote a postmortem novel because nothing went down. That’s the point. The boring practice turned a “silent
corruption someday” story into a non-event.
Common mistakes (symptom → root cause → fix)
1) Symptom: CKSUM errors increase, but SMART looks “PASSED”
Root cause: SMART overall status is too coarse; checksum errors often indicate transport, firmware, or misdirected writes, not just dying media.
Fix: Check SMART attributes (CRC/pending/realloc), swap cables, move bays, update HBA firmware, and replace the disk if errors persist after path remediation.
2) Symptom: Permanent errors in files after scrub
Root cause: ZFS detected corruption but couldn’t repair because redundancy was insufficient, or all replicas were bad, or the corrupted data was overwritten.
Fix: Restore affected files from backup or a clean snapshot/replica. Then investigate why redundancy didn’t save you (single-disk vdev, degraded pool during event, multiple faults).
3) Symptom: Scrubs take forever and impact performance badly
Root cause: Underprovisioned I/O, failing disk retry storms, or a pool design that forces wide reads. Sometimes it’s just “you built a busy pool and scheduled scrub at peak.”
Fix: Run scrubs off-peak, investigate slow devices, cap scrub/resilver priority where available, and consider more vdevs/mirrors for parallelism.
4) Symptom: RAID controller shows healthy, but ZFS inside a VM reports corruption
Root cause: ZFS can’t see physical disks; it sees a virtual disk or RAID LUN. Corruption may be introduced below the guest.
Fix: Prefer passing through HBAs/disks to ZFS (JBOD/HBA mode) or run ZFS on the host. If you must virtualize, ensure the lower layer provides integrity guarantees and stable flush semantics.
5) Symptom: After power events, apps complain about lost writes, but ZFS pool is “ONLINE”
Root cause: Sync writes were disabled, or drives/controllers lied about cache flush, or the workload expected durability that wasn’t provided.
Fix: Re-enable sync semantics, use a proper SLOG if needed, and verify write cache policies. Treat “acknowledged but not durable” as data loss by design.
6) Symptom: Repeated errors across multiple disks on the same host
Root cause: Common-path failure: HBA, expander, backplane, firmware, power, or memory instability.
Fix: Move one disk to a different controller path and observe. Update firmware. Check dmesg for link resets. Validate ECC health and run memory diagnostics.
7) Symptom: Replacing a disk doesn’t stop checksum errors
Root cause: You replaced the victim, not the culprit. Often it’s cabling, bay, or controller, or the disk model/firmware has a known issue.
Fix: Swap cable, change bay, update firmware, and if needed replace the HBA. Track errors by physical slot and by-id, not by sda names.
Checklists / step-by-step plan
Commissioning checklist (before production traffic)
- Use ECC memory. If you can’t, at least acknowledge the risk explicitly in the design review.
- Use HBAs in IT/JBOD mode, not hardware RAID volumes, so ZFS can see and manage disks directly.
- Build pools with persistent device paths (
/dev/disk/by-id). - Pick redundancy appropriate to the blast radius: mirrors for IOPS and fast resilver; RAIDZ for capacity with careful rebuild expectations.
- Set and document scrub schedule (common: monthly for large pools, more often for critical data).
- Enable snapshots and replication; decide retention based on recovery objectives, not wishful thinking.
- Test restore of representative datasets before go-live.
Operational hygiene checklist (weekly/monthly)
- Review
zpool statusacross fleets; alert on non-zero READ/WRITE/CKSUM deltas. - Check SMART attributes for trends, not just failure flags.
- Run scheduled scrubs; verify they complete and whether they repaired anything.
- Track resilver times; increasing resilver duration is an early warning of aging disks or overloaded design.
- Perform a restore test quarterly. Pick random samples and actually verify content integrity at the application layer.
Incident response plan (checksum errors detected)
- Freeze risky changes. Stop “helpful” reboots and firmware updates during the incident.
- Run
zpool status -vand capture output for the ticket. - If redundancy exists, start a scrub and monitor for repairs and new errors.
- Inspect SMART and dmesg for link resets and CRC errors; swap cables/bays if indicated.
- Replace devices that continue to accrue errors. Use
zpool replacewith by-id paths. - After remediation, clear counters, scrub again, and confirm stability over at least one full scrub cycle.
- Restore affected files/VMs from snapshots or backups; validate at the app layer.
- Write down what was replaced (disk serial, bay, cable) so the next engineer doesn’t repeat the same dance.
FAQ
Does ZFS prevent all corruption?
No. ZFS detects on-disk corruption and corruption in the I/O path after
If I already have RAID, do I still need ZFS?
If your priority is correctness, you need end-to-end integrity somewhere. ZFS provides it at the filesystem+volume layer. RAID provides device redundancy. Combining them is usually redundant in a bad way (complexity) unless you’re using RAID as simple JBOD abstraction—which is better replaced by HBA mode.
Why does ZFS report checksum errors when the disk says it’s fine?
Because the disk can return the wrong data without reporting an error (firmware issues, misdirected reads, transport corruption). ZFS validates the content, not the device’s confidence.
What’s the difference between READ errors and CKSUM errors in zpool status?
READ errors mean the device couldn’t return data (I/O failure). CKSUM errors mean it returned data, but the data didn’t match the expected checksum. CKSUM is often more alarming because it implies silent corruption.
How often should I scrub?
Monthly is a common baseline for multi-terabyte pools. More often for critical data or less reliable hardware. The right answer is: often enough that latent errors are found while redundancy still exists and before backups age out.
Will a scrub fix corruption?
If you have redundancy and at least one good copy exists, yes—ZFS can self-heal during scrub. If you have a single-disk vdev, scrub detects corruption but cannot repair it. If all copies are bad, you’re restoring from backup either way.
Are mirrors safer than RAIDZ?
Mirrors tend to resilver faster and are operationally simpler; they also offer better random read IOPS. RAIDZ is capacity-efficient but can have longer rebuild windows and wider fault domains. “Safer” depends on workload, disk size, and operational discipline. If you’re not scrubbing regularly, neither is safe.
Can I run ZFS on top of a hardware RAID volume?
You can, but you’re blinding ZFS to physical disk behavior and pushing trust down into the RAID controller. You lose some of ZFS’s ability to localize faults and self-heal at the correct layer. If you care about integrity, give ZFS direct disk access.
What should I do when ZFS reports “Permanent errors”?
Treat it as data loss for the referenced file(s). Restore from backup or a clean snapshot/replica, then fix the underlying hardware/path issue and scrub again. Don’t assume “it healed itself” if it says permanent.
Does compression or recordsize affect corruption detection?
Checksumming still works. Compression changes the physical blocks written, but they’re still checksummed. Recordsize affects I/O patterns and rebuild behavior, which can influence how quickly you notice issues and how painful scrubs/resilvers feel.
Practical next steps
If you run RAID and think you’re “covered,” decide what you actually mean: availability or correctness. If you mean correctness, add an end-to-end integrity layer. ZFS is one of the few that does this comprehensively and operationally.
- Enable (and actually monitor) ZFS scrubs. Put them on a calendar and alert on repairs and errors.
- Design for redundancy that can heal: mirrors or RAIDZ/dRAID with realistic rebuild windows.
- Use ECC memory and HBAs in JBOD/IT mode so ZFS can see real disks.
- When checksum errors appear, don’t argue with them. Localize the path, replace what’s guilty, scrub again, and restore affected data.
- Prove backups by restoring. Corruption that’s already in the source will happily propagate into “perfectly successful” backups.
The operational win is simple: ZFS turns silent corruption into loud corruption. Loud problems get fixed. Silent ones get promoted into outages.