Storage failures rarely show up as cinematic smoke. In production, the scariest failures are polite: a controller that says “all good,” a filesystem that returns the wrong bytes with the confidence of a fortune teller, a rebuild that “completed successfully” while quietly baking your remaining disks to death. If you’ve operated fleets long enough, you learn the real enemy isn’t the dead drive you can see—it’s the component that lies while continuing to serve I/O.
This is the heart of the ZFS vs hardware RAID debate. Not “which is faster in a benchmark,” but: when reality diverges from what your storage stack reports, who notices first, who can prove it, and who can fix it without guessing? We’re going to treat this like production engineering, not religion: what each stack is good at, where it fails, and what you should actually do on a Tuesday at 2 a.m. when the dashboard is green and the database is screaming.
The lie: how storage misleads you
“The controller lies” doesn’t mean it’s malicious. It means the software stack you trust is receiving a comforting story that doesn’t match the physical reality of the bytes on platters or flash cells. This can happen in several ways:
Silent corruption: bad data, good attitude
Silent corruption is when a read returns a block that is not what you wrote, and nothing complains. It can come from flaky cables, marginal SAS expanders, firmware bugs, DRAM errors in a controller, misdirected writes, or media returning stale data under rare conditions. Your application gets bytes. They’re just the wrong bytes.
Here’s why this is so operationally nasty: a checksum at the application level is rare, and even when it exists (think: some databases), it might only detect corruption when it touches that page again—weeks later. Meanwhile, backups might happily back up the corrupted version. That’s the nightmare scenario: the only “good” copy aged out.
The write cache paradox
Write caching can make a storage array look heroic right up until it loses power and becomes a historian rewriting events. If a controller acknowledges a write before it’s safely on non-volatile media, you’ve traded latency for truth. Battery-backed write cache (BBWC) and flash-backed write cache (FBWC) reduce that risk, but the devil lives in the details: battery health, capacitor aging, firmware policies, and “temporary” settings that become permanent because nobody wants to reboot to change them back.
Joke #1: A RAID controller with write-back cache and no power protection is like a motivational speaker—lots of confidence, not much accountability.
The rebuild spiral
When a disk fails in RAID5/6, rebuilds stress the surviving drives with sustained reads. That’s when latent sector errors show up. The array can go from “degraded but fine” to “degraded and now unreadable” at the worst possible time. ZFS has its own version (resilver), but it behaves differently when it knows which blocks are actually in use.
False positives and false negatives in health reporting
Drive health, temperature, error counters, and predictive failure—these are not objective truths so much as vendor-specific interpretations of S.M.A.R.T. attributes and link error logs. Hardware RAID adds another layer: sometimes you can’t see the drive’s real S.M.A.R.T. data without jumping through controller-specific hoops. That can make your monitoring feel like you’re checking a patient’s pulse through a winter coat.
Facts and history: why this keeps happening
Storage is old enough to have traditions, and some of them are… character-building. A few context points that matter in real designs:
- RAID was born in an era of expensive disks and cheap CPU cycles. The original RAID ideas (late 1980s) assumed you’d trade extra disks for reliability while keeping performance acceptable.
- The “write hole” has been a known RAID problem for decades. Parity RAID can acknowledge writes that update data and parity out of sync after power loss, unless the stack has protections.
- ZFS (mid-2000s) was designed around end-to-end checksums. The premise: storage should detect corruption at the filesystem layer, not trust lower layers blindly.
- Unrecoverable Read Error (URE) rates became relevant as drive sizes exploded. Bigger disks mean rebuilds read more bits, increasing the odds you hit a URE during rebuild—especially with consumer-grade drives.
- Write caches moved from “optional speed boost” to “mandatory for decent latency.” As workloads became random I/O heavy (VMs, databases), caching became a core behavior, not a nice-to-have.
- Controllers have firmware, DRAM, and sometimes buggy logic. Hardware RAID isn’t “pure hardware”; it’s an embedded system that can have its own failure modes.
- Filesystems historically trusted the block layer. Traditional stacks assumed a read returns what was written. That assumption is now routinely false in large fleets.
- Commodity servers + JBOD changed the economics. It became practical to run software-defined storage where the host CPU does parity/checksums and disks are dumb.
- NVMe reduced some problems and introduced new ones. Lower latency and fewer intermediaries help, but firmware issues, power-loss protection, and namespace quirks still exist.
ZFS’s model: end-to-end truth
ZFS is often described with a kind of reverence, like it’s a magical filesystem that defeats entropy. It’s not magic. It’s a set of design decisions that move the trust boundary upward and make integrity verifiable.
End-to-end checksums: the core mechanic
When ZFS writes a block, it computes a checksum and stores it in metadata (not alongside the block in the same place). When it reads, it verifies. If the checksum doesn’t match, ZFS knows the data is wrong. That’s already a huge shift: it’s no longer guessing whether the disk returned the right block.
If redundancy exists (mirror, RAIDZ), ZFS can read an alternate copy and repair the bad one (“self-healing”). This is the part people mean when they say ZFS “protects against bit rot.” The more precise claim: ZFS detects corruption and, with redundancy, can correct it.
Copy-on-write and transactional semantics
ZFS doesn’t overwrite live blocks in place. It writes new blocks, then updates pointers. That means the on-disk state is always consistent from ZFS’s perspective: either the transaction committed or it didn’t. This design significantly reduces classic filesystem corruption after crashes and interacts favorably with integrity checking.
It doesn’t eliminate every failure mode. If a device lies consistently (returns wrong data with matching checksum because corruption happened before ZFS computed it, e.g., memory corruption), you can still get bad data. That’s why ECC RAM and stable hardware matter in serious ZFS deployments.
Scrubs: scheduled truth audits
A scrub is not a “performance optimization.” It’s an integrity audit: ZFS reads data and verifies checksums, repairing from redundancy when possible. If you want to find silent corruption before it becomes your only copy, scrubs are how you do it.
Operationally, the important nuance: scrubs create load. You schedule them, monitor them, and you don’t run them at the same time as your monthly “reindex everything” job unless you enjoy chaos.
Resilver vs rebuild: not just semantics
When replacing a drive in a ZFS mirror or RAIDZ, ZFS resilvers. In mirrors, it can copy only allocated blocks rather than the entire raw device, reducing time and wear. With RAIDZ, resilver behavior depends on metadata and can still be heavy, but the “only what’s in use” property often helps, especially with sparse usage.
ZFS is not a free lunch
ZFS costs CPU and memory. It wants RAM for ARC (cache) and metadata. It’s happier with ECC memory. It demands that you understand recordsize, volblocksize (for zvols), and the interaction between sync writes and SLOG devices. You can absolutely make ZFS slow by misconfiguring it; ZFS will comply with your bad ideas with admirable professionalism.
Hardware RAID’s model: abstraction and acceleration
Hardware RAID is a different deal: put a controller in front of disks, let it manage redundancy, and present the OS with one or more logical volumes that look like normal disks. It’s seductive because it’s simple at the OS layer: the server sees “/dev/sda,” and life continues.
What hardware RAID does well
Good controllers with proper cache protection can deliver strong performance, especially for small random writes. They offload parity math (less important now with modern CPUs) and hide complexity from the OS. In some environments—especially those standardized around vendor support contracts—hardware RAID is a familiar, supportable shape.
What hardware RAID hides from you
That abstraction cuts both ways. If the controller maps logical blocks to physical blocks incorrectly, or if its cache acknowledges writes prematurely, the OS and filesystem may have no way to detect it. Traditional filesystems typically don’t checksum data blocks end-to-end; they rely on the block device being honest.
And even when the controller is behaving, it can obscure the health of underlying disks. You might see “logical drive optimal” while one disk is throwing media errors that the controller is remapping quietly. That sounds helpful until you realize you’ve lost the ability to reason about risk. You’re flying IFR with a single altimeter and no external view.
Patrol read, consistency checks, and their limits
Enterprise RAID controllers often have “patrol read” or background consistency checks. These can detect and sometimes correct parity inconsistencies. That’s real value. But it’s not the same as filesystem-level end-to-end integrity: the controller can validate parity relationships without knowing whether the data itself is the right data for that logical block in the first place.
The controller is a computer you forgot to patch
Hardware RAID failures are not always disk failures. Controllers have firmware bugs, memory errors, overheating issues, and sometimes weird interactions with specific drive firmware revisions. If you’ve never seen a controller behave differently after a firmware update, congratulations on your peaceful life—but also, you may not be patching aggressively enough.
Joke #2: The only thing more optimistic than a RAID controller is a project plan that says “no downtime required.”
Who protects you when the controller lies?
This is the decision-making frame I use in production: assume you will eventually have a component that tells you “success” when reality is “corruption.” Then design so that some layer can prove it.
Case 1: The controller returns wrong data on read
- Hardware RAID + traditional filesystem: likely undetected until application-level checks fail (if they exist). The filesystem can’t validate data blocks. Your backups may happily preserve corruption.
- ZFS on JBOD/HBA: checksum mismatch is detected. With redundancy, ZFS can correct and log it. Without redundancy, at least you know the block is bad rather than serving incorrect bytes silently.
- ZFS on hardware RAID (single logical volume): ZFS can detect mismatches, but correction is limited if the underlying RAID volume always returns the same wrong data (no alternate copies visible to ZFS). You get detection without self-healing, and your troubleshooting is now layered.
Case 2: The controller lies about write durability
This is the write cache problem. If the controller acks writes early and then loses them, you can get filesystem corruption or application-level inconsistency.
- Hardware RAID: depends entirely on cache policy and power-loss protection. If BBWC/FBWC is healthy and configured correctly, you’re usually fine. If not, you can lose acknowledged writes.
- ZFS: ZFS cares deeply about sync writes. With proper configuration, ZFS will only acknowledge sync writes when they’re safely on stable storage (or on a SLOG device that is power-loss safe). But ZFS cannot overcome a disk/controller that lies about flushes. If your devices ignore cache flush commands, ZFS’s guarantees are weakened.
Case 3: The controller hides disk errors until it’s too late
Hardware RAID can keep an array “optimal” while remapping sectors and dealing with media errors. That can be good—until you realize multiple drives are degrading. ZFS tends to surface errors loudly (checksum errors, read errors, degraded vdev), which is annoying but actionable. In operations, I prefer noisy truth over quiet fiction.
The pragmatic answer
If your question is strictly “who protects you when the controller lies,” ZFS (with direct disk access through a proper HBA in IT mode) is the stronger story because it validates data at the filesystem layer and can self-heal with redundancy.
That doesn’t automatically mean “never use hardware RAID.” It means: if you choose hardware RAID, you must accept that you’re outsourcing correctness to the controller and you need operational controls (battery monitoring, patrol reads, firmware lifecycle, periodic verification) that match that trust.
Three corporate-world mini-stories
Mini-story #1: The incident caused by a wrong assumption
A mid-size company ran a busy VM cluster on hardware RAID10. It was boring—in the good way—for years. Then they virtualized a few more database-heavy workloads and started seeing rare, random failures: a service would crash, a database would report a corrupt page, then everything would be fine after a restart. Classic “cosmic rays” territory, so the first response was to blame the applications.
They replaced memory. They tuned the hypervisor. They even added more monitoring. The storage dashboard remained green: logical drive optimal, no failed disks, cache healthy. The wrong assumption was simple: “If the RAID controller says it’s healthy, the bytes are healthy.”
The breakthrough came from a boring test: they took a known data set, wrote it, computed checksums at the application layer, and then re-read it repeatedly under load. Once every few days, a block came back wrong. Not consistently wrong—just wrong enough to ruin your week. The controller logs showed nothing useful. The disks’ S.M.A.R.T. data wasn’t visible through the controller in their current tooling, and the OS had no visibility into individual drives.
They eventually reproduced the problem more frequently by swapping a SAS cable and moving the array behind an expander with a marginal port. The controller was retrying reads until it got something and returning it. From its perspective, it had satisfied the request. From the application’s perspective, reality had drifted.
The fix was not “switch to ZFS” overnight. The fix was: expose real disk health, replace the marginal path, enable patrol reads with alerting, and implement end-to-end checksumming where possible. Later, on the next refresh, they moved to ZFS mirrors on HBAs for the most integrity-sensitive workloads. The lesson wasn’t that RAID is evil. It was that health LEDs are not data integrity guarantees.
Mini-story #2: The optimization that backfired
A different org had a habit: every storage problem was a performance problem until proven otherwise. They were running ZFS on a solid JBOD shelf, but one workload had latency spikes during peak hours. Someone proposed a “simple” fix: add a fast SLOG device for sync writes and set a few datasets to sync=always “to make it safe,” and a few others to sync=disabled “to make it fast.” Yes, both at once. In the same pool. On production.
They did add an NVMe device as SLOG. It was fast. It was also a consumer NVMe without power-loss protection. Under normal conditions it looked fantastic. The latency graphs flattened out. The optimization was declared a win and promptly forgotten, which is how most disasters are funded.
Months later, there was a power event that didn’t trip the UPS properly—just long enough to brown out the server and reboot it. ZFS did what ZFS does: it came back consistent. But a chunk of the “fast” datasets that had sync disabled had acknowledged writes that never landed. A VM’s filesystem survived, but the database inside it had missing transactions and a broken journal. The root cause wasn’t “ZFS is unsafe.” The root cause was using ZFS controls to override durability semantics without understanding the workload, plus putting an unsafe device in the one role where you really want power-loss safety.
What they changed afterward was refreshingly unsexy: they stopped using sync=disabled as a performance knob; they used a proper enterprise SSD with power-loss protection for SLOG; and they created a policy that any dataset change to sync semantics required a ticket with a rollback plan.
Mini-story #3: The boring but correct practice that saved the day
A financial services team had a storage philosophy that annoyed people: “Trust nothing. Verify routinely.” They ran ZFS mirrors for critical systems and RAIDZ2 for large repositories. Every pool had a scrub schedule and alerting on checksum errors, not just device failures. They also had a quarterly drill: simulate a failed disk, replace it, and confirm resilver behavior and alerts. It felt like busywork until it wasn’t.
One quarter, the scrub reported a small but non-zero number of checksum errors on a mirror vdev. ZFS repaired them automatically, but the presence of errors was the signal. The team treated checksum errors like smoke: you don’t ignore them because the fire alarm “handled it.” They pulled SMART data, saw growing UDMA CRC errors on one drive path, and replaced a SAS cable during a scheduled window.
A few weeks later, that same rack experienced a vibration event (construction nearby, the kind of thing facilities swears “couldn’t possibly matter”). One of the drives in that mirror dropped briefly. Because the cable had been replaced earlier, the mirror stayed stable. The event became a single alert and a brief resilver rather than a multi-hour incident with a degraded pool and panicked leadership.
Nothing about this story is glamorous. That’s the point. The practice that saved them was a scrub schedule, alerting that treats checksum errors as actionable, and routine drills that kept the replacement process crisp. In production, boring is a feature.
Practical tasks (commands + interpretation)
The commands below assume Linux with ZFS installed (OpenZFS), plus common tools. Adjust device names and pool names. Every task includes what “good” and “bad” tend to look like.
Task 1: Get a quick truth snapshot of a ZFS pool
cr0x@server:~$ sudo zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:14:33 with 0 errors on Sun Dec 22 03:10:21 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
errors: No known data errors
Interpretation: Look at READ, WRITE, and especially CKSUM. Non-zero checksum errors mean ZFS detected corruption on reads. If redundancy exists, ZFS may have repaired it—still investigate the path (drive, cable, HBA, backplane).
Task 2: Force a scrub and watch it like it matters
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Mon Dec 23 01:00:02 2025
312G scanned at 1.20G/s, 22.4G issued at 88.0M/s, 1.80T total
0B repaired, 1.21% done, 05:40:11 to go
Interpretation: “Issued” vs “scanned” helps you understand actual disk work. If scrub speed collapses, you may have a sick disk or a bottlenecked controller. If a scrub finds errors, don’t just celebrate that it repaired them—treat it as a leading indicator.
Task 3: Find ZFS events and fault history
cr0x@server:~$ sudo zpool events -v | tail -n 30
Dec 23 01:10:44.123456 ereport.fs.zfs.checksum
class = "ereport.fs.zfs.checksum"
ena = 0x2c4b3c1f2b000001
detector = (embedded nvlist)
...
Interpretation: ZFS events provide breadcrumb trails. Repeated checksum events tied to a specific vdev or device often indicate a path problem (cable/backplane) as much as a failing disk.
Task 4: Confirm ZFS ashift (sector alignment) before you regret it
cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n | head
45: vdev_tree:
78: ashift: 12
Interpretation: ashift=12 means 4K sectors. Misaligned ashift can hurt performance permanently; you can’t change it without recreating vdevs. If you’re on modern drives/SSDs, ashift 12 (or 13 for 8K) is typical.
Task 5: Check dataset properties that often cause “mystery latency”
cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,logbias tank
NAME PROPERTY VALUE SOURCE
tank recordsize 128K default
tank compression lz4 local
tank atime off local
tank sync standard default
tank logbias latency default
Interpretation: For VM images (zvols), you care more about volblocksize than recordsize. For databases, recordsize can matter. sync is not a casual toggle; changing it changes durability semantics.
Task 6: Inspect ARC stats to see if you’re memory-bound
cr0x@server:~$ sudo arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
01:20:01 842 58 6 12 1 44 5 2 0 24G 32G
01:20:02 901 61 6 10 1 49 5 2 0 24G 32G
Interpretation: High miss rates under steady read workloads suggest ARC is too small or working set is too large. That doesn’t prove “buy more RAM,” but it’s a strong clue. Don’t confuse cache misses with disk failure; they just mean the system is doing real I/O.
Task 7: For sync write pain, verify whether you’re actually sync-bound
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.20T 600G 210 1800 45.2M 92.1M
mirror-0 1.20T 600G 210 1800 45.2M 92.1M
sda - - 105 900 22.6M 46.0M
sdb - - 105 900 22.6M 46.0M
Interpretation: Watch write IOPS and latency (use iostat -x too). If latency spikes align with sync-heavy workloads, a safe SLOG device can help—but only if it has power-loss protection and you size expectations correctly.
Task 8: Check individual disk health (JBOD/HBA case)
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i "model|serial|reallocated|pending|crc|power_on|temperature"
Device Model: ST12000NM0008
Serial Number: ZJV123AB
Reallocated_Sector_Ct 0
Current_Pending_Sector 0
UDMA_CRC_Error_Count 12
Power_On_Hours 39120
Temperature_Celsius 41
Interpretation: Reallocated/pending sectors are classic “media is deteriorating.” UDMA CRC errors often implicate the cable/backplane path, not the drive media. If CRC errors climb, replace the path before you replace the disk.
Task 9: Check for kernel-level storage errors that ZFS might be reacting to
cr0x@server:~$ sudo dmesg -T | egrep -i "ata|sas|scsi|reset|timeout|I/O error" | tail -n 30
[Mon Dec 23 01:12:10 2025] sd 4:0:0:0: [sda] tag#32 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Mon Dec 23 01:12:10 2025] blk_update_request: I/O error, dev sda, sector 123456789
Interpretation: Timeouts and resets correlate strongly with checksum errors and degraded vdevs. If you see frequent link resets, start with physical layer: cables, backplane, expander, HBA firmware.
Task 10: If you’re on hardware RAID, verify cache/battery status (example with storcli)
cr0x@server:~$ sudo storcli /c0 show
Controller = 0
Status = Success
Description = None
Product Name = RAID Controller
FW Package Build = 24.21.0-0123
BBU = Present
cr0x@server:~$ sudo storcli /c0/bbu show
BBU_Info:
Battery State = Optimal
Next Learn time = 2025-12-30 02:00:00
Interpretation: If BBU is missing, failed, or in learn cycle, many controllers disable write-back cache or behave differently. The performance symptom often shows up before anyone reads the BBU status alert.
Task 11: On hardware RAID, check physical drive error counters
cr0x@server:~$ sudo storcli /c0/eall/sall show all | egrep -i "Drive|Media Error|Other Error|Predictive"
Drive /c0/e32/s0
Media Error Count = 0
Other Error Count = 14
Predictive Failure Count = 0
Interpretation: “Other Error” often means link issues, timeouts, or command aborts. A rising count is a canary. Don’t wait for “Predictive Failure.” That flag is conservative and sometimes arrives after your patience is already gone.
Task 12: Verify write cache policy on hardware RAID
cr0x@server:~$ sudo storcli /c0/vall show | egrep -i "Cache|Write"
Cache Policy = WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Interpretation: “WriteBack” is good only if you truly have power protection and the policy is safe when BBU is bad. If you see “WriteThrough” unexpectedly, expect latency pain. If you see “WriteBack” with “Always Write Cache,” be very sure you like living dangerously.
Task 13: Measure real latency at the block layer
cr0x@server:~$ sudo iostat -x 1 5
Device r/s w/s r_await w_await aqu-sz %util
sda 95.0 820.0 2.10 18.40 4.20 98.0
Interpretation: High w_await and high %util suggests the device is saturated or stalling. If this is a RAID virtual disk, remember you’re seeing the controller’s personality as much as the disks.
Task 14: Spot a write-hole risk in parity RAID (conceptual check)
cr0x@server:~$ sudo storcli /c0/v0 show | egrep -i "RAID Level|Strip Size|Write Cache"
RAID Level = Primary-5, Secondary-0, RAID Level Qualifier-3
Strip Size = 256 KB
Write Cache = WriteBack
Interpretation: RAID5 + write-back cache can be fine with proper power protection. Without it, a power loss can leave parity inconsistent. The symptom later is “filesystem corruption” that looks like software’s fault.
Fast diagnosis playbook
This is the “don’t panic, don’t theorize” sequence I use to find the bottleneck or integrity issue quickly. The goal is to separate latency, throughput, and correctness problems, because they look similar when users are yelling.
First: establish whether you have an integrity emergency
- ZFS:
zpool status -v. IfCKSUMis non-zero or the pool is degraded, treat it as priority one. Checkzpool events -v. - Hardware RAID: controller status + physical drive error counters. Don’t stop at “virtual drive optimal.” Look at media/other errors and cache/battery health.
- OS logs:
dmesg -Tfor timeouts, resets, I/O errors.
Second: identify the pain type (IOPS vs bandwidth vs latency spikes)
- Block device utilization:
iostat -x 1to see await times and %util. - ZFS device view:
zpool iostat -v 1to see which vdev/disk is hot. - Application symptoms: are you seeing slow commits (sync write pain), slow reads (cache misses / disk), or periodic stalls (timeouts/resets)?
Third: validate caching and durability assumptions
- ZFS: check dataset
sync, presence of SLOG, and whether the SLOG device is safe (power-loss protection). Confirm you didn’t “fix” latency by disabling durability. - Hardware RAID: confirm write-back cache behavior and what happens when BBU is bad or learning. Check if cache policy changed after an event.
Fourth: localize the fault domain
- Single disk with rising errors? Replace disk or path.
- Multiple disks with CRC/link errors? Suspect HBA/backplane/expander/cabling.
- Only one workload slow? Suspect dataset properties, recordsize/volblocksize, sync patterns, fragmentation, or application-level I/O patterns.
Common mistakes, symptoms, fixes
Mistake 1: Running ZFS on top of hardware RAID and expecting self-healing
Symptom: ZFS reports checksum errors but can’t repair them; errors recur on the same logical blocks.
Why: ZFS sees one logical device; it can detect corruption but has no alternate copy to fetch if the RAID volume returns the same bad data.
Fix: Prefer HBAs/JBOD so ZFS can manage redundancy. If you must use hardware RAID, use a mirror of logical volumes at minimum and still treat it as layered risk.
Mistake 2: Treating sync=disabled as a performance feature
Symptom: After a crash/power event, databases show missing transactions or corrupted journals despite “clean” pool import.
Why: ZFS acknowledged writes that weren’t durable.
Fix: Keep sync=standard unless you fully understand the workload and accept the risk. Use a proper SLOG device for sync-heavy workloads.
Mistake 3: Ignoring checksum errors because “ZFS repaired them”
Symptom: Occasional checksum errors during scrubs; later you see a disk drop or a degraded vdev.
Why: The underlying issue (cable, HBA, disk) is worsening.
Fix: Treat checksum errors as hardware investigation triggers. Check SMART, replace suspect cables, update HBA firmware.
Mistake 4: Not monitoring RAID cache/battery health
Symptom: Sudden latency regression after maintenance; controller silently switches to write-through.
Why: BBU in learn cycle or failed, policy changes.
Fix: Alert on BBU state and cache policy. Schedule learn cycles. Confirm cache settings after firmware updates.
Mistake 5: RAID5 for large disks without a rebuild risk plan
Symptom: Second disk error during rebuild; array becomes unrecoverable or goes read-only.
Why: Rebuild reads enormous amounts of data; latent errors appear.
Fix: Use RAID6/RAIDZ2 or mirrors for large disks and critical data. Keep hot spares, monitor error rates, and test rebuild times.
Mistake 6: Believing “drive failure” is the only disk problem
Symptom: Random I/O timeouts, link resets, CRC errors, but drives test “fine.”
Why: Path issues: cabling, backplane, expander, power, firmware mismatches.
Fix: Replace the path components methodically and watch error counters. Don’t shotgun disks first.
Checklists / step-by-step plan
Step-by-step: choosing between ZFS and hardware RAID (production version)
- Define the failure you cannot tolerate. Is it downtime, data corruption, or performance collapse during rebuild?
- Decide where you want integrity verification. If you want end-to-end checksums and self-healing at the filesystem layer, plan for ZFS with direct disk access.
- Decide who owns caching semantics. Hardware RAID: controller owns it. ZFS: the host owns it, but devices can still lie about flushes.
- Pick redundancy based on rebuild math, not superstition. Mirrors for performance and fast recovery; RAIDZ2/RAID6 for capacity efficiency with better fault tolerance than single parity.
- Plan monitoring before deployment. ZFS: alert on checksum errors, degraded pools, slow scrubs. RAID: alert on BBU state, cache policy, media/other errors, patrol read outcomes.
- Plan recovery drills. Practice replacing a disk, importing pools, verifying data, and timing resilvers/rebuilds.
Step-by-step: baseline a new ZFS pool safely
- Use an HBA in IT mode (pass-through). Avoid RAID volumes.
- Create pool with correct ashift for your media.
- Enable
compression=lz4for most general workloads. - Set
atime=offunless you truly need it. - Define datasets for different workloads; don’t dump everything in one dataset.
- Schedule scrubs and alert on checksum errors.
- Validate backups by restore testing, not by optimism.
Step-by-step: operating hardware RAID without fooling yourself
- Verify write cache is protected (BBWC/FBWC) and policy disables write-back when protection is bad.
- Enable patrol read/background checks and alert on their findings.
- Expose physical drive stats to monitoring (media errors, other errors, temperature).
- Keep firmware lifecycle disciplined (controller, backplane/expander if applicable, drives).
- Test rebuild times and performance impact in a maintenance window before you need it during an incident.
FAQ
1) Does ZFS eliminate the need for hardware RAID?
For most ZFS deployments, yes: you generally want direct disk access via an HBA so ZFS can manage redundancy and integrity. Hardware RAID can still be used in environments with strict vendor support requirements, but it adds a layer that can obscure disk truth.
2) Is ZFS safe without ECC RAM?
ZFS will run without ECC, and many people do. But if you’re designing for “who protects you when something lies,” ECC reduces the chance that memory corruption turns into valid-but-wrong checksums or metadata issues. For critical systems, ECC is a practical risk reducer.
3) Can hardware RAID detect silent corruption?
It can detect some inconsistencies (parity mismatches, bad sectors) via patrol reads and consistency checks, but it typically does not provide filesystem-level end-to-end validation of user data. Without end-to-end checksums above the controller, you may not detect “wrong but readable” data.
4) Is ZFS on top of a single hardware RAID volume a good idea?
It’s usually a compromise: ZFS can detect corruption but may not be able to correct it, and you complicate troubleshooting. If you must do it, prefer mirrored logical volumes and be serious about controller health monitoring.
5) What redundancy should I use: mirrors, RAIDZ1/2/3, RAID5/6?
For critical systems with demanding latency, mirrors are often the most predictable. For capacity-heavy datasets, RAIDZ2 (or RAID6) is a common balance. Single-parity (RAIDZ1/RAID5) becomes increasingly risky with large disks and long rebuilds.
6) Are ZFS scrubs the same as RAID patrol reads?
They rhyme but they’re not identical. A ZFS scrub verifies data against checksums at the filesystem layer and can repair from redundancy. Patrol read focuses on drive/array consistency at the controller layer and may not validate that the bytes are the “correct” bytes for your application.
7) Why do I see checksum errors but SMART looks fine?
Because checksum errors can come from the path (cables, HBA, expander, backplane) and transient read issues, not just failing media. Also, SMART is not a perfect predictor; some drives fail “suddenly” with minimal SMART warning.
8) What’s the most common way storage teams accidentally lose durability?
Turning off sync semantics (ZFS sync=disabled) or relying on write-back cache without reliable power-loss protection. Both make benchmarks look great and post-mortems look expensive.
9) If ZFS detects corruption, does that mean my data is already lost?
Not necessarily. With redundancy, ZFS can often correct the bad copy automatically. But detection means something in the chain is untrustworthy, so you should investigate before the next error lands on the last good copy.
Conclusion
When the controller lies, the winner is the stack that can prove the lie and recover without guessing. ZFS’s end-to-end checksums and self-healing (with proper redundancy and direct disk access) make it a strong answer to silent corruption and misbehavior in the lower layers. Hardware RAID can be fast and operationally simple, but its abstraction can also become a blindfold—especially if your monitoring stops at “logical drive optimal.”
The practical takeaway isn’t “ZFS good, RAID bad.” It’s: decide where truth lives, monitor that layer aggressively, and avoid performance “optimizations” that trade away durability unless you are consciously buying that risk. Storage doesn’t need your faith. It needs your verification.