ZFS vs Hardware RAID: Who Protects You When the Controller Lies?

Was this helpful?

Storage failures rarely show up as cinematic smoke. In production, the scariest failures are polite: a controller that says “all good,” a filesystem that returns the wrong bytes with the confidence of a fortune teller, a rebuild that “completed successfully” while quietly baking your remaining disks to death. If you’ve operated fleets long enough, you learn the real enemy isn’t the dead drive you can see—it’s the component that lies while continuing to serve I/O.

This is the heart of the ZFS vs hardware RAID debate. Not “which is faster in a benchmark,” but: when reality diverges from what your storage stack reports, who notices first, who can prove it, and who can fix it without guessing? We’re going to treat this like production engineering, not religion: what each stack is good at, where it fails, and what you should actually do on a Tuesday at 2 a.m. when the dashboard is green and the database is screaming.

The lie: how storage misleads you

“The controller lies” doesn’t mean it’s malicious. It means the software stack you trust is receiving a comforting story that doesn’t match the physical reality of the bytes on platters or flash cells. This can happen in several ways:

Silent corruption: bad data, good attitude

Silent corruption is when a read returns a block that is not what you wrote, and nothing complains. It can come from flaky cables, marginal SAS expanders, firmware bugs, DRAM errors in a controller, misdirected writes, or media returning stale data under rare conditions. Your application gets bytes. They’re just the wrong bytes.

Here’s why this is so operationally nasty: a checksum at the application level is rare, and even when it exists (think: some databases), it might only detect corruption when it touches that page again—weeks later. Meanwhile, backups might happily back up the corrupted version. That’s the nightmare scenario: the only “good” copy aged out.

The write cache paradox

Write caching can make a storage array look heroic right up until it loses power and becomes a historian rewriting events. If a controller acknowledges a write before it’s safely on non-volatile media, you’ve traded latency for truth. Battery-backed write cache (BBWC) and flash-backed write cache (FBWC) reduce that risk, but the devil lives in the details: battery health, capacitor aging, firmware policies, and “temporary” settings that become permanent because nobody wants to reboot to change them back.

Joke #1: A RAID controller with write-back cache and no power protection is like a motivational speaker—lots of confidence, not much accountability.

The rebuild spiral

When a disk fails in RAID5/6, rebuilds stress the surviving drives with sustained reads. That’s when latent sector errors show up. The array can go from “degraded but fine” to “degraded and now unreadable” at the worst possible time. ZFS has its own version (resilver), but it behaves differently when it knows which blocks are actually in use.

False positives and false negatives in health reporting

Drive health, temperature, error counters, and predictive failure—these are not objective truths so much as vendor-specific interpretations of S.M.A.R.T. attributes and link error logs. Hardware RAID adds another layer: sometimes you can’t see the drive’s real S.M.A.R.T. data without jumping through controller-specific hoops. That can make your monitoring feel like you’re checking a patient’s pulse through a winter coat.

Facts and history: why this keeps happening

Storage is old enough to have traditions, and some of them are… character-building. A few context points that matter in real designs:

  1. RAID was born in an era of expensive disks and cheap CPU cycles. The original RAID ideas (late 1980s) assumed you’d trade extra disks for reliability while keeping performance acceptable.
  2. The “write hole” has been a known RAID problem for decades. Parity RAID can acknowledge writes that update data and parity out of sync after power loss, unless the stack has protections.
  3. ZFS (mid-2000s) was designed around end-to-end checksums. The premise: storage should detect corruption at the filesystem layer, not trust lower layers blindly.
  4. Unrecoverable Read Error (URE) rates became relevant as drive sizes exploded. Bigger disks mean rebuilds read more bits, increasing the odds you hit a URE during rebuild—especially with consumer-grade drives.
  5. Write caches moved from “optional speed boost” to “mandatory for decent latency.” As workloads became random I/O heavy (VMs, databases), caching became a core behavior, not a nice-to-have.
  6. Controllers have firmware, DRAM, and sometimes buggy logic. Hardware RAID isn’t “pure hardware”; it’s an embedded system that can have its own failure modes.
  7. Filesystems historically trusted the block layer. Traditional stacks assumed a read returns what was written. That assumption is now routinely false in large fleets.
  8. Commodity servers + JBOD changed the economics. It became practical to run software-defined storage where the host CPU does parity/checksums and disks are dumb.
  9. NVMe reduced some problems and introduced new ones. Lower latency and fewer intermediaries help, but firmware issues, power-loss protection, and namespace quirks still exist.

ZFS’s model: end-to-end truth

ZFS is often described with a kind of reverence, like it’s a magical filesystem that defeats entropy. It’s not magic. It’s a set of design decisions that move the trust boundary upward and make integrity verifiable.

End-to-end checksums: the core mechanic

When ZFS writes a block, it computes a checksum and stores it in metadata (not alongside the block in the same place). When it reads, it verifies. If the checksum doesn’t match, ZFS knows the data is wrong. That’s already a huge shift: it’s no longer guessing whether the disk returned the right block.

If redundancy exists (mirror, RAIDZ), ZFS can read an alternate copy and repair the bad one (“self-healing”). This is the part people mean when they say ZFS “protects against bit rot.” The more precise claim: ZFS detects corruption and, with redundancy, can correct it.

Copy-on-write and transactional semantics

ZFS doesn’t overwrite live blocks in place. It writes new blocks, then updates pointers. That means the on-disk state is always consistent from ZFS’s perspective: either the transaction committed or it didn’t. This design significantly reduces classic filesystem corruption after crashes and interacts favorably with integrity checking.

It doesn’t eliminate every failure mode. If a device lies consistently (returns wrong data with matching checksum because corruption happened before ZFS computed it, e.g., memory corruption), you can still get bad data. That’s why ECC RAM and stable hardware matter in serious ZFS deployments.

Scrubs: scheduled truth audits

A scrub is not a “performance optimization.” It’s an integrity audit: ZFS reads data and verifies checksums, repairing from redundancy when possible. If you want to find silent corruption before it becomes your only copy, scrubs are how you do it.

Operationally, the important nuance: scrubs create load. You schedule them, monitor them, and you don’t run them at the same time as your monthly “reindex everything” job unless you enjoy chaos.

Resilver vs rebuild: not just semantics

When replacing a drive in a ZFS mirror or RAIDZ, ZFS resilvers. In mirrors, it can copy only allocated blocks rather than the entire raw device, reducing time and wear. With RAIDZ, resilver behavior depends on metadata and can still be heavy, but the “only what’s in use” property often helps, especially with sparse usage.

ZFS is not a free lunch

ZFS costs CPU and memory. It wants RAM for ARC (cache) and metadata. It’s happier with ECC memory. It demands that you understand recordsize, volblocksize (for zvols), and the interaction between sync writes and SLOG devices. You can absolutely make ZFS slow by misconfiguring it; ZFS will comply with your bad ideas with admirable professionalism.

Hardware RAID’s model: abstraction and acceleration

Hardware RAID is a different deal: put a controller in front of disks, let it manage redundancy, and present the OS with one or more logical volumes that look like normal disks. It’s seductive because it’s simple at the OS layer: the server sees “/dev/sda,” and life continues.

What hardware RAID does well

Good controllers with proper cache protection can deliver strong performance, especially for small random writes. They offload parity math (less important now with modern CPUs) and hide complexity from the OS. In some environments—especially those standardized around vendor support contracts—hardware RAID is a familiar, supportable shape.

What hardware RAID hides from you

That abstraction cuts both ways. If the controller maps logical blocks to physical blocks incorrectly, or if its cache acknowledges writes prematurely, the OS and filesystem may have no way to detect it. Traditional filesystems typically don’t checksum data blocks end-to-end; they rely on the block device being honest.

And even when the controller is behaving, it can obscure the health of underlying disks. You might see “logical drive optimal” while one disk is throwing media errors that the controller is remapping quietly. That sounds helpful until you realize you’ve lost the ability to reason about risk. You’re flying IFR with a single altimeter and no external view.

Patrol read, consistency checks, and their limits

Enterprise RAID controllers often have “patrol read” or background consistency checks. These can detect and sometimes correct parity inconsistencies. That’s real value. But it’s not the same as filesystem-level end-to-end integrity: the controller can validate parity relationships without knowing whether the data itself is the right data for that logical block in the first place.

The controller is a computer you forgot to patch

Hardware RAID failures are not always disk failures. Controllers have firmware bugs, memory errors, overheating issues, and sometimes weird interactions with specific drive firmware revisions. If you’ve never seen a controller behave differently after a firmware update, congratulations on your peaceful life—but also, you may not be patching aggressively enough.

Joke #2: The only thing more optimistic than a RAID controller is a project plan that says “no downtime required.”

Who protects you when the controller lies?

This is the decision-making frame I use in production: assume you will eventually have a component that tells you “success” when reality is “corruption.” Then design so that some layer can prove it.

Case 1: The controller returns wrong data on read

  • Hardware RAID + traditional filesystem: likely undetected until application-level checks fail (if they exist). The filesystem can’t validate data blocks. Your backups may happily preserve corruption.
  • ZFS on JBOD/HBA: checksum mismatch is detected. With redundancy, ZFS can correct and log it. Without redundancy, at least you know the block is bad rather than serving incorrect bytes silently.
  • ZFS on hardware RAID (single logical volume): ZFS can detect mismatches, but correction is limited if the underlying RAID volume always returns the same wrong data (no alternate copies visible to ZFS). You get detection without self-healing, and your troubleshooting is now layered.

Case 2: The controller lies about write durability

This is the write cache problem. If the controller acks writes early and then loses them, you can get filesystem corruption or application-level inconsistency.

  • Hardware RAID: depends entirely on cache policy and power-loss protection. If BBWC/FBWC is healthy and configured correctly, you’re usually fine. If not, you can lose acknowledged writes.
  • ZFS: ZFS cares deeply about sync writes. With proper configuration, ZFS will only acknowledge sync writes when they’re safely on stable storage (or on a SLOG device that is power-loss safe). But ZFS cannot overcome a disk/controller that lies about flushes. If your devices ignore cache flush commands, ZFS’s guarantees are weakened.

Case 3: The controller hides disk errors until it’s too late

Hardware RAID can keep an array “optimal” while remapping sectors and dealing with media errors. That can be good—until you realize multiple drives are degrading. ZFS tends to surface errors loudly (checksum errors, read errors, degraded vdev), which is annoying but actionable. In operations, I prefer noisy truth over quiet fiction.

The pragmatic answer

If your question is strictly “who protects you when the controller lies,” ZFS (with direct disk access through a proper HBA in IT mode) is the stronger story because it validates data at the filesystem layer and can self-heal with redundancy.

That doesn’t automatically mean “never use hardware RAID.” It means: if you choose hardware RAID, you must accept that you’re outsourcing correctness to the controller and you need operational controls (battery monitoring, patrol reads, firmware lifecycle, periodic verification) that match that trust.

Three corporate-world mini-stories

Mini-story #1: The incident caused by a wrong assumption

A mid-size company ran a busy VM cluster on hardware RAID10. It was boring—in the good way—for years. Then they virtualized a few more database-heavy workloads and started seeing rare, random failures: a service would crash, a database would report a corrupt page, then everything would be fine after a restart. Classic “cosmic rays” territory, so the first response was to blame the applications.

They replaced memory. They tuned the hypervisor. They even added more monitoring. The storage dashboard remained green: logical drive optimal, no failed disks, cache healthy. The wrong assumption was simple: “If the RAID controller says it’s healthy, the bytes are healthy.”

The breakthrough came from a boring test: they took a known data set, wrote it, computed checksums at the application layer, and then re-read it repeatedly under load. Once every few days, a block came back wrong. Not consistently wrong—just wrong enough to ruin your week. The controller logs showed nothing useful. The disks’ S.M.A.R.T. data wasn’t visible through the controller in their current tooling, and the OS had no visibility into individual drives.

They eventually reproduced the problem more frequently by swapping a SAS cable and moving the array behind an expander with a marginal port. The controller was retrying reads until it got something and returning it. From its perspective, it had satisfied the request. From the application’s perspective, reality had drifted.

The fix was not “switch to ZFS” overnight. The fix was: expose real disk health, replace the marginal path, enable patrol reads with alerting, and implement end-to-end checksumming where possible. Later, on the next refresh, they moved to ZFS mirrors on HBAs for the most integrity-sensitive workloads. The lesson wasn’t that RAID is evil. It was that health LEDs are not data integrity guarantees.

Mini-story #2: The optimization that backfired

A different org had a habit: every storage problem was a performance problem until proven otherwise. They were running ZFS on a solid JBOD shelf, but one workload had latency spikes during peak hours. Someone proposed a “simple” fix: add a fast SLOG device for sync writes and set a few datasets to sync=always “to make it safe,” and a few others to sync=disabled “to make it fast.” Yes, both at once. In the same pool. On production.

They did add an NVMe device as SLOG. It was fast. It was also a consumer NVMe without power-loss protection. Under normal conditions it looked fantastic. The latency graphs flattened out. The optimization was declared a win and promptly forgotten, which is how most disasters are funded.

Months later, there was a power event that didn’t trip the UPS properly—just long enough to brown out the server and reboot it. ZFS did what ZFS does: it came back consistent. But a chunk of the “fast” datasets that had sync disabled had acknowledged writes that never landed. A VM’s filesystem survived, but the database inside it had missing transactions and a broken journal. The root cause wasn’t “ZFS is unsafe.” The root cause was using ZFS controls to override durability semantics without understanding the workload, plus putting an unsafe device in the one role where you really want power-loss safety.

What they changed afterward was refreshingly unsexy: they stopped using sync=disabled as a performance knob; they used a proper enterprise SSD with power-loss protection for SLOG; and they created a policy that any dataset change to sync semantics required a ticket with a rollback plan.

Mini-story #3: The boring but correct practice that saved the day

A financial services team had a storage philosophy that annoyed people: “Trust nothing. Verify routinely.” They ran ZFS mirrors for critical systems and RAIDZ2 for large repositories. Every pool had a scrub schedule and alerting on checksum errors, not just device failures. They also had a quarterly drill: simulate a failed disk, replace it, and confirm resilver behavior and alerts. It felt like busywork until it wasn’t.

One quarter, the scrub reported a small but non-zero number of checksum errors on a mirror vdev. ZFS repaired them automatically, but the presence of errors was the signal. The team treated checksum errors like smoke: you don’t ignore them because the fire alarm “handled it.” They pulled SMART data, saw growing UDMA CRC errors on one drive path, and replaced a SAS cable during a scheduled window.

A few weeks later, that same rack experienced a vibration event (construction nearby, the kind of thing facilities swears “couldn’t possibly matter”). One of the drives in that mirror dropped briefly. Because the cable had been replaced earlier, the mirror stayed stable. The event became a single alert and a brief resilver rather than a multi-hour incident with a degraded pool and panicked leadership.

Nothing about this story is glamorous. That’s the point. The practice that saved them was a scrub schedule, alerting that treats checksum errors as actionable, and routine drills that kept the replacement process crisp. In production, boring is a feature.

Practical tasks (commands + interpretation)

The commands below assume Linux with ZFS installed (OpenZFS), plus common tools. Adjust device names and pool names. Every task includes what “good” and “bad” tend to look like.

Task 1: Get a quick truth snapshot of a ZFS pool

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:14:33 with 0 errors on Sun Dec 22 03:10:21 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-SAMSUNG_SSD_1       ONLINE       0     0     0
            ata-SAMSUNG_SSD_2       ONLINE       0     0     0

errors: No known data errors

Interpretation: Look at READ, WRITE, and especially CKSUM. Non-zero checksum errors mean ZFS detected corruption on reads. If redundancy exists, ZFS may have repaired it—still investigate the path (drive, cable, HBA, backplane).

Task 2: Force a scrub and watch it like it matters

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:00:02 2025
        312G scanned at 1.20G/s, 22.4G issued at 88.0M/s, 1.80T total
        0B repaired, 1.21% done, 05:40:11 to go

Interpretation: “Issued” vs “scanned” helps you understand actual disk work. If scrub speed collapses, you may have a sick disk or a bottlenecked controller. If a scrub finds errors, don’t just celebrate that it repaired them—treat it as a leading indicator.

Task 3: Find ZFS events and fault history

cr0x@server:~$ sudo zpool events -v | tail -n 30
Dec 23 01:10:44.123456  ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0x2c4b3c1f2b000001
        detector = (embedded nvlist)
        ...

Interpretation: ZFS events provide breadcrumb trails. Repeated checksum events tied to a specific vdev or device often indicate a path problem (cable/backplane) as much as a failing disk.

Task 4: Confirm ZFS ashift (sector alignment) before you regret it

cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n | head
45:        vdev_tree:
78:            ashift: 12

Interpretation: ashift=12 means 4K sectors. Misaligned ashift can hurt performance permanently; you can’t change it without recreating vdevs. If you’re on modern drives/SSDs, ashift 12 (or 13 for 8K) is typical.

Task 5: Check dataset properties that often cause “mystery latency”

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,logbias tank
NAME  PROPERTY     VALUE     SOURCE
tank  recordsize   128K      default
tank  compression  lz4       local
tank  atime        off       local
tank  sync         standard  default
tank  logbias      latency   default

Interpretation: For VM images (zvols), you care more about volblocksize than recordsize. For databases, recordsize can matter. sync is not a casual toggle; changing it changes durability semantics.

Task 6: Inspect ARC stats to see if you’re memory-bound

cr0x@server:~$ sudo arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
01:20:01   842    58      6    12    1    44    5     2    0   24G   32G
01:20:02   901    61      6    10    1    49    5     2    0   24G   32G

Interpretation: High miss rates under steady read workloads suggest ARC is too small or working set is too large. That doesn’t prove “buy more RAM,” but it’s a strong clue. Don’t confuse cache misses with disk failure; they just mean the system is doing real I/O.

Task 7: For sync write pain, verify whether you’re actually sync-bound

cr0x@server:~$ sudo zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.20T  600G     210   1800  45.2M  92.1M
  mirror-0  1.20T  600G     210   1800  45.2M  92.1M
    sda         -      -     105    900  22.6M  46.0M
    sdb         -      -     105    900  22.6M  46.0M

Interpretation: Watch write IOPS and latency (use iostat -x too). If latency spikes align with sync-heavy workloads, a safe SLOG device can help—but only if it has power-loss protection and you size expectations correctly.

Task 8: Check individual disk health (JBOD/HBA case)

cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i "model|serial|reallocated|pending|crc|power_on|temperature"
Device Model:     ST12000NM0008
Serial Number:    ZJV123AB
Reallocated_Sector_Ct     0
Current_Pending_Sector    0
UDMA_CRC_Error_Count      12
Power_On_Hours            39120
Temperature_Celsius       41

Interpretation: Reallocated/pending sectors are classic “media is deteriorating.” UDMA CRC errors often implicate the cable/backplane path, not the drive media. If CRC errors climb, replace the path before you replace the disk.

Task 9: Check for kernel-level storage errors that ZFS might be reacting to

cr0x@server:~$ sudo dmesg -T | egrep -i "ata|sas|scsi|reset|timeout|I/O error" | tail -n 30
[Mon Dec 23 01:12:10 2025] sd 4:0:0:0: [sda] tag#32 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Mon Dec 23 01:12:10 2025] blk_update_request: I/O error, dev sda, sector 123456789

Interpretation: Timeouts and resets correlate strongly with checksum errors and degraded vdevs. If you see frequent link resets, start with physical layer: cables, backplane, expander, HBA firmware.

Task 10: If you’re on hardware RAID, verify cache/battery status (example with storcli)

cr0x@server:~$ sudo storcli /c0 show
Controller = 0
Status = Success
Description = None

Product Name = RAID Controller
FW Package Build = 24.21.0-0123
BBU = Present

cr0x@server:~$ sudo storcli /c0/bbu show
BBU_Info:
  Battery State = Optimal
  Next Learn time = 2025-12-30 02:00:00

Interpretation: If BBU is missing, failed, or in learn cycle, many controllers disable write-back cache or behave differently. The performance symptom often shows up before anyone reads the BBU status alert.

Task 11: On hardware RAID, check physical drive error counters

cr0x@server:~$ sudo storcli /c0/eall/sall show all | egrep -i "Drive|Media Error|Other Error|Predictive"
Drive /c0/e32/s0
Media Error Count = 0
Other Error Count = 14
Predictive Failure Count = 0

Interpretation: “Other Error” often means link issues, timeouts, or command aborts. A rising count is a canary. Don’t wait for “Predictive Failure.” That flag is conservative and sometimes arrives after your patience is already gone.

Task 12: Verify write cache policy on hardware RAID

cr0x@server:~$ sudo storcli /c0/vall show | egrep -i "Cache|Write"
Cache Policy = WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Interpretation: “WriteBack” is good only if you truly have power protection and the policy is safe when BBU is bad. If you see “WriteThrough” unexpectedly, expect latency pain. If you see “WriteBack” with “Always Write Cache,” be very sure you like living dangerously.

Task 13: Measure real latency at the block layer

cr0x@server:~$ sudo iostat -x 1 5
Device            r/s   w/s  r_await  w_await  aqu-sz  %util
sda              95.0  820.0    2.10   18.40    4.20   98.0

Interpretation: High w_await and high %util suggests the device is saturated or stalling. If this is a RAID virtual disk, remember you’re seeing the controller’s personality as much as the disks.

Task 14: Spot a write-hole risk in parity RAID (conceptual check)

cr0x@server:~$ sudo storcli /c0/v0 show | egrep -i "RAID Level|Strip Size|Write Cache"
RAID Level = Primary-5, Secondary-0, RAID Level Qualifier-3
Strip Size = 256 KB
Write Cache = WriteBack

Interpretation: RAID5 + write-back cache can be fine with proper power protection. Without it, a power loss can leave parity inconsistent. The symptom later is “filesystem corruption” that looks like software’s fault.

Fast diagnosis playbook

This is the “don’t panic, don’t theorize” sequence I use to find the bottleneck or integrity issue quickly. The goal is to separate latency, throughput, and correctness problems, because they look similar when users are yelling.

First: establish whether you have an integrity emergency

  1. ZFS: zpool status -v. If CKSUM is non-zero or the pool is degraded, treat it as priority one. Check zpool events -v.
  2. Hardware RAID: controller status + physical drive error counters. Don’t stop at “virtual drive optimal.” Look at media/other errors and cache/battery health.
  3. OS logs: dmesg -T for timeouts, resets, I/O errors.

Second: identify the pain type (IOPS vs bandwidth vs latency spikes)

  1. Block device utilization: iostat -x 1 to see await times and %util.
  2. ZFS device view: zpool iostat -v 1 to see which vdev/disk is hot.
  3. Application symptoms: are you seeing slow commits (sync write pain), slow reads (cache misses / disk), or periodic stalls (timeouts/resets)?

Third: validate caching and durability assumptions

  1. ZFS: check dataset sync, presence of SLOG, and whether the SLOG device is safe (power-loss protection). Confirm you didn’t “fix” latency by disabling durability.
  2. Hardware RAID: confirm write-back cache behavior and what happens when BBU is bad or learning. Check if cache policy changed after an event.

Fourth: localize the fault domain

  1. Single disk with rising errors? Replace disk or path.
  2. Multiple disks with CRC/link errors? Suspect HBA/backplane/expander/cabling.
  3. Only one workload slow? Suspect dataset properties, recordsize/volblocksize, sync patterns, fragmentation, or application-level I/O patterns.

Common mistakes, symptoms, fixes

Mistake 1: Running ZFS on top of hardware RAID and expecting self-healing

Symptom: ZFS reports checksum errors but can’t repair them; errors recur on the same logical blocks.

Why: ZFS sees one logical device; it can detect corruption but has no alternate copy to fetch if the RAID volume returns the same bad data.

Fix: Prefer HBAs/JBOD so ZFS can manage redundancy. If you must use hardware RAID, use a mirror of logical volumes at minimum and still treat it as layered risk.

Mistake 2: Treating sync=disabled as a performance feature

Symptom: After a crash/power event, databases show missing transactions or corrupted journals despite “clean” pool import.

Why: ZFS acknowledged writes that weren’t durable.

Fix: Keep sync=standard unless you fully understand the workload and accept the risk. Use a proper SLOG device for sync-heavy workloads.

Mistake 3: Ignoring checksum errors because “ZFS repaired them”

Symptom: Occasional checksum errors during scrubs; later you see a disk drop or a degraded vdev.

Why: The underlying issue (cable, HBA, disk) is worsening.

Fix: Treat checksum errors as hardware investigation triggers. Check SMART, replace suspect cables, update HBA firmware.

Mistake 4: Not monitoring RAID cache/battery health

Symptom: Sudden latency regression after maintenance; controller silently switches to write-through.

Why: BBU in learn cycle or failed, policy changes.

Fix: Alert on BBU state and cache policy. Schedule learn cycles. Confirm cache settings after firmware updates.

Mistake 5: RAID5 for large disks without a rebuild risk plan

Symptom: Second disk error during rebuild; array becomes unrecoverable or goes read-only.

Why: Rebuild reads enormous amounts of data; latent errors appear.

Fix: Use RAID6/RAIDZ2 or mirrors for large disks and critical data. Keep hot spares, monitor error rates, and test rebuild times.

Mistake 6: Believing “drive failure” is the only disk problem

Symptom: Random I/O timeouts, link resets, CRC errors, but drives test “fine.”

Why: Path issues: cabling, backplane, expander, power, firmware mismatches.

Fix: Replace the path components methodically and watch error counters. Don’t shotgun disks first.

Checklists / step-by-step plan

Step-by-step: choosing between ZFS and hardware RAID (production version)

  1. Define the failure you cannot tolerate. Is it downtime, data corruption, or performance collapse during rebuild?
  2. Decide where you want integrity verification. If you want end-to-end checksums and self-healing at the filesystem layer, plan for ZFS with direct disk access.
  3. Decide who owns caching semantics. Hardware RAID: controller owns it. ZFS: the host owns it, but devices can still lie about flushes.
  4. Pick redundancy based on rebuild math, not superstition. Mirrors for performance and fast recovery; RAIDZ2/RAID6 for capacity efficiency with better fault tolerance than single parity.
  5. Plan monitoring before deployment. ZFS: alert on checksum errors, degraded pools, slow scrubs. RAID: alert on BBU state, cache policy, media/other errors, patrol read outcomes.
  6. Plan recovery drills. Practice replacing a disk, importing pools, verifying data, and timing resilvers/rebuilds.

Step-by-step: baseline a new ZFS pool safely

  1. Use an HBA in IT mode (pass-through). Avoid RAID volumes.
  2. Create pool with correct ashift for your media.
  3. Enable compression=lz4 for most general workloads.
  4. Set atime=off unless you truly need it.
  5. Define datasets for different workloads; don’t dump everything in one dataset.
  6. Schedule scrubs and alert on checksum errors.
  7. Validate backups by restore testing, not by optimism.

Step-by-step: operating hardware RAID without fooling yourself

  1. Verify write cache is protected (BBWC/FBWC) and policy disables write-back when protection is bad.
  2. Enable patrol read/background checks and alert on their findings.
  3. Expose physical drive stats to monitoring (media errors, other errors, temperature).
  4. Keep firmware lifecycle disciplined (controller, backplane/expander if applicable, drives).
  5. Test rebuild times and performance impact in a maintenance window before you need it during an incident.

FAQ

1) Does ZFS eliminate the need for hardware RAID?

For most ZFS deployments, yes: you generally want direct disk access via an HBA so ZFS can manage redundancy and integrity. Hardware RAID can still be used in environments with strict vendor support requirements, but it adds a layer that can obscure disk truth.

2) Is ZFS safe without ECC RAM?

ZFS will run without ECC, and many people do. But if you’re designing for “who protects you when something lies,” ECC reduces the chance that memory corruption turns into valid-but-wrong checksums or metadata issues. For critical systems, ECC is a practical risk reducer.

3) Can hardware RAID detect silent corruption?

It can detect some inconsistencies (parity mismatches, bad sectors) via patrol reads and consistency checks, but it typically does not provide filesystem-level end-to-end validation of user data. Without end-to-end checksums above the controller, you may not detect “wrong but readable” data.

4) Is ZFS on top of a single hardware RAID volume a good idea?

It’s usually a compromise: ZFS can detect corruption but may not be able to correct it, and you complicate troubleshooting. If you must do it, prefer mirrored logical volumes and be serious about controller health monitoring.

5) What redundancy should I use: mirrors, RAIDZ1/2/3, RAID5/6?

For critical systems with demanding latency, mirrors are often the most predictable. For capacity-heavy datasets, RAIDZ2 (or RAID6) is a common balance. Single-parity (RAIDZ1/RAID5) becomes increasingly risky with large disks and long rebuilds.

6) Are ZFS scrubs the same as RAID patrol reads?

They rhyme but they’re not identical. A ZFS scrub verifies data against checksums at the filesystem layer and can repair from redundancy. Patrol read focuses on drive/array consistency at the controller layer and may not validate that the bytes are the “correct” bytes for your application.

7) Why do I see checksum errors but SMART looks fine?

Because checksum errors can come from the path (cables, HBA, expander, backplane) and transient read issues, not just failing media. Also, SMART is not a perfect predictor; some drives fail “suddenly” with minimal SMART warning.

8) What’s the most common way storage teams accidentally lose durability?

Turning off sync semantics (ZFS sync=disabled) or relying on write-back cache without reliable power-loss protection. Both make benchmarks look great and post-mortems look expensive.

9) If ZFS detects corruption, does that mean my data is already lost?

Not necessarily. With redundancy, ZFS can often correct the bad copy automatically. But detection means something in the chain is untrustworthy, so you should investigate before the next error lands on the last good copy.

Conclusion

When the controller lies, the winner is the stack that can prove the lie and recover without guessing. ZFS’s end-to-end checksums and self-healing (with proper redundancy and direct disk access) make it a strong answer to silent corruption and misbehavior in the lower layers. Hardware RAID can be fast and operationally simple, but its abstraction can also become a blindfold—especially if your monitoring stops at “logical drive optimal.”

The practical takeaway isn’t “ZFS good, RAID bad.” It’s: decide where truth lives, monitor that layer aggressively, and avoid performance “optimizations” that trade away durability unless you are consciously buying that risk. Storage doesn’t need your faith. It needs your verification.

← Previous
OpenVPN AUTH_FAILED: why correct credentials still fail (and what to check)
Next →
The metaverse rush: how ‘the future’ became a meme overnight

Leave a comment