ZFS is picky in the way a good pager is picky: it only screams when you’ve earned it. Give it clean disks, honest error reporting, and predictable latency, and it will pay you back with boring uptime. Put a RAID controller between ZFS and the platters and you can turn “boring” into “mysterious” fast.
The marketing words don’t help. “Passthrough.” “HBA mode.” “IT mode.” “JBOD.” “Non-RAID.” Some of these mean what you think. Some mean “we’re still doing RAID things, just quietly.” This piece is about telling the difference, choosing the right gear, and diagnosing the failure modes that show up in production at 2 a.m.
The principle: ZFS wants disks, not opinions
ZFS is a filesystem and a volume manager. That’s not a slogan; it’s a design contract. ZFS expects to see the actual disks (or at least something that behaves like disks) because it wants to:
- control redundancy (mirrors, RAIDZ) and understand it end-to-end,
- track checksums and repair using redundant copies,
- manage write ordering and flush semantics,
- observe I/O errors and act on them (fault a device, resilver, scrub),
- see identity and topology (serials, paths) so it can survive cable/controller weirdness.
Hardware RAID controllers were built for a different contract: abstract disks into “virtual drives,” hide bad blocks, retry internally, reorder writes, and generally provide the illusion that the world is fine. ZFS’s whole point is that the world is not fine, and pretending otherwise leads to the kind of data loss that looks like “random corruption” until you realize it was deterministic all along.
The goal, if you’re running ZFS, is simple: present each physical disk to the OS with minimal translation. That’s where IT mode and proper HBAs shine. And that’s where “passthrough” RAID controllers sometimes lie to you.
Terms that vendors abuse: IT mode, HBA, JBOD, passthrough
IT mode (Initiator-Target mode)
In LSI/Broadcom land (and most of the rebranded ecosystem: Dell PERC, IBM ServeRAID, Fujitsu, etc.), “IT mode” is firmware that turns a RAID-capable controller into a simpler SAS initiator. It stops doing RAID virtualization and exposes each drive as its own target.
“IT mode” is not a universal term across all vendors, but in practice it’s shorthand for “HBA firmware on an LSI SAS controller.”
HBA (Host Bus Adapter)
A true HBA is a controller designed to attach disks and expose them as-is. No RAID stack, no virtual drive layer, no write-back caching pretending to be your filesystem. It may support SAS expanders, multipath, and higher queue depths. It’s the storage equivalent of a good wrench: not exciting, always there.
JBOD mode
JBOD is where things get slippery. Some controllers have a “JBOD” option that creates one “virtual disk” per physical disk. Sometimes that’s close enough. Sometimes it’s a RAID0 wrapper around each disk with extra state, caching, remapping, and error handling. ZFS can run on top of that, but now you’ve inserted a translation layer that can mask errors and break assumptions.
Passthrough
“Passthrough” might mean:
- the OS gets SCSI/SAS devices directly, including serials and SMART,
- or the OS gets a virtual device that forwards most commands but not all,
- or the OS gets a “disk-like object” that shares only enough of reality to boot.
When someone says “passthrough,” your next question is: passthrough of what, exactly? Errors? Flushes? SMART? Queue depth? TRIM? Power loss behavior? Because that’s where the bodies are buried.
IT mode: what it is, what it buys you, what can still go wrong
IT mode is the sweet spot for many ZFS builds because it’s widely available, cheap on the secondary market, and stable in mainstream OS drivers (Linux mpt2sas/mpt3sas, FreeBSD mps/mpr). You get physical disks showing up as physical disks. ZFS gets to do its job.
What you gain with IT mode
- Direct disk visibility: serial numbers, WWNs, and predictable device IDs.
- Honest error reporting: media errors show up to the OS; ZFS can fault drives.
- Predictable flush semantics: drive cache flushes are not “helpfully” reinterpreted by RAID firmware.
- Less write-path nonsense: no write-back cache trying to outrun power loss.
- Better compatibility: tools like
smartctlandsg_sesusually behave.
What can still go wrong
IT mode doesn’t magically fix the physics. Your bottlenecks just become more honest.
- Bad cables/backplanes: SAS is robust until it isn’t. One marginal mini-SAS cable can turn into link resets and timeouts that look like “random disk errors.”
- Expanders under load: cheap expanders can introduce head-of-line blocking, especially with many HDDs scrubbing simultaneously.
- Queue depth mismatch: too shallow and you waste throughput; too deep and you inflate latency during contention.
- Firmware mismatch: mixing firmware generations can be fine, until a certain drive model hits a corner case. Storage is a museum of corner cases.
- Thermals: HBAs run hot. “Works in a lab” becomes “link flaps in production” when the chassis airflow isn’t real.
Joke #1: An HBA without airflow is like a database without backups—technically functional right up until it becomes your personality.
True HBAs: boring, correct, fast enough
If you have a choice, a true HBA is the least surprising option for ZFS. That’s the compliment. Production loves “least surprising.”
Modern “true” HBAs are usually SAS3 (12Gbps) or SAS4 (24Gbps) capable, support expanders properly, and have drivers that have been beaten up by a decade of real-world abuse. With HDD pools, you’ll hit disk limitations long before you hit an HBA limitation. With SSD pools, HBA selection matters more, but the rule stays: keep the path simple.
The practical differences you notice with a true HBA:
- Disk identity stays stable across reboots and rescans.
- SMART and error counters are readable without weird controller-specific incantations.
- ZFS fault management behaves predictably when a disk degrades.
- Latency is “just disks” plus bus overhead, not “disks plus firmware committee.”
“Fake HBA”: RAID controllers with passthrough makeup
Let’s define “fake HBA” plainly: a RAID controller that claims to expose single disks but still keeps a RAID-era abstraction layer in front of them. Sometimes this is called JBOD, sometimes “HBA mode,” sometimes “passthrough.” It’s not always evil. It’s just not always honest.
Why it exists
Vendors built RAID controllers for Windows/VMware-era datacenters where the controller is the storage brain. Then ZFS (and software RAID generally) became mainstream, and suddenly people wanted “just a disk.” Rather than redesign silicon, vendors often shipped a firmware feature that approximates disk exposure.
What “fake HBA” breaks
- SMART transparency: you might see partial SMART, or none, or it might be mapped through controller-specific pages.
- Error semantics: the controller might retry and hide marginal media, turning a failing disk into long latency spikes instead of clean errors.
- Write barriers/flush: the controller might acknowledge flushes early due to its cache policy.
- Device identity: drives show up as “Virtual Disk 0” rather than stable serials/WWNs.
- Recovery behavior: on reset or power events, the controller can reorder what the OS sees and when.
When it’s “good enough”
Sometimes you inherit hardware and you can’t change it this quarter. If the controller provides true JBOD that exposes each disk with stable identifiers and full SMART, and you can disable caching and read-ahead policies cleanly, you can make it work. The question is: can you verify all of that, repeatedly, and can you detect when firmware updates change behavior?
When it’s a hard no
If you can’t get reliable SMART, if disk IDs are unstable, if ZFS sees weird timeouts during scrubs, or if the controller requires virtual-drive wrappers per disk, stop negotiating. Swap it for an HBA. The money you “save” will be spent on incident response. And you’ll pay in weekends.
Joke #2: A RAID controller in “passthrough mode” is like a meeting that “could have been an email”—still somehow takes two hours and breaks your afternoon.
Interesting facts and historical context (the short, useful kind)
- ZFS was born in an era of lying disks. Early commodity SATA drives and controllers would reorder writes aggressively; ZFS’s emphasis on end-to-end checksums was partly a response.
- “IT mode” came from the SAS world, not the ZFS world. Initiator-Target firmware was meant for hosts that just needed to talk SAS without RAID logic in the middle.
- Many famous “RAID controller” cards are just rebrands of LSI silicon. Dell PERC and IBM ServeRAID lines often map to the same chip families, with different firmware constraints.
- Write-back cache was historically a performance band-aid for slow disks. It improved benchmarks dramatically, but it also made power-loss correctness someone else’s problem—often your problem.
- Battery-backed cache evolved into flash-backed cache. Batteries age and swell; flash modules with supercaps became common to preserve cache content without constant battery babysitting.
- SMART wasn’t designed for RAID controllers. Controllers that sit between OS and disk often need vendor-specific passthrough mechanisms; not all implement them fully.
- SAS expanders are basically Ethernet switches for disks. They multiplex lanes; good ones behave, bad ones introduce subtle congestion patterns during scrub/resilver storms.
- Queue depth became the stealth performance knob. As SSDs rose, the ability to sustain many outstanding commands mattered more than raw link speed.
- Some controllers present RAID0-per-disk as “JBOD.” It looks like a disk, smells like a disk, and still has metadata and policy state that can bite you later.
What to buy and what to avoid (opinionated)
Buy this
- A true HBA (SAS3/SAS4) supported well by your OS, especially if you run SSDs or large drive counts.
- An LSI/Broadcom SAS controller flashed to IT mode if you’re comfortable validating firmware and keeping spares.
- Good cables and a sane backplane. Most “controller problems” are copper problems wearing a different hat.
Avoid this
- Hardware RAID for ZFS (real arrays). ZFS on top of a RAID5/6 virtual disk is how you get checksum errors you can’t repair.
- “JBOD” that is really RAID0 wrappers per disk, especially if you can’t read SMART cleanly.
- Controllers that can’t pass through drive serials/WWNs. If you can’t identify a failing disk confidently, you will eventually pull the wrong one.
- Any setup where you can’t disable write-back caching or at least verify flush behavior. ZFS needs truthful barriers, not vibes.
The one quote
“Hope is not a strategy.” — General Gordon R. Sullivan
Practical tasks: commands, outputs, and decisions (12+)
These are not “toy” commands. They’re the ones you run when you’re deciding whether you have a real HBA, whether passthrough is lying, and where your performance went. Examples assume Linux, but the logic carries elsewhere.
Task 1: Identify the controller and its driver
cr0x@server:~$ lspci -nn | egrep -i 'sas|scsi|raid'
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
04:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS-3 3108 [1000:005d] (rev 02)
What it means: The first line is a Fusion-MPT SAS controller (often used as an HBA/IT mode device). The second is MegaRAID (RAID personality).
Decision: Prefer the Fusion-MPT device for ZFS. If you must use the MegaRAID, plan to validate “JBOD” behavior aggressively or replace it.
Task 2: Confirm which kernel modules are in use
cr0x@server:~$ lsmod | egrep 'mpt3sas|megaraid|aacraid|hpsa'
mpt3sas 307200 2
megaraid_sas 180224 0
What it means: mpt3sas is typical for LSI SAS3 HBAs/IT mode; megaraid_sas indicates RAID stack.
Decision: If your disks sit behind megaraid_sas, assume you might have a “fake HBA” until proven otherwise.
Task 3: Map disks to controller paths and check if they look “virtual”
cr0x@server:~$ lsblk -o NAME,MODEL,SERIAL,HCTL,TYPE,SIZE
NAME MODEL SERIAL HCTL TYPE SIZE
sda ST12000NM0008 ZHZ123AB 3:0:0:0 disk 10.9T
sdb ST12000NM0008 ZHZ124CD 3:0:1:0 disk 10.9T
sdc MR9361-8i 00c0ffee 4:2:0:0 disk 7.3T
What it means: sda/sdb show real drive model/serial. sdc shows the controller model as the “disk” model—classic virtual drive presentation.
Decision: Do not build a ZFS pool on sdc unless you understand exactly what’s being virtualized and why.
Task 4: Verify ZFS sees stable identifiers (WWN) and not just /dev/sdX
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep 'wwn|scsi'
lrwxrwxrwx 1 root root 9 Dec 26 09:11 wwn-0x5000c500a1b2c3d4 -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 26 09:11 wwn-0x5000c500a1b2c3d5 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 09:11 scsi-35000c500a1b2c3d4 -> ../../sda
What it means: WWN-based IDs exist and are stable across reboots (assuming the controller doesn’t rewrite identity).
Decision: Use /dev/disk/by-id/wwn-* paths when creating pools. If you don’t have WWNs, treat that as a red flag.
Task 5: Check if SMART works directly (best sign of real passthrough)
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep 'Model Family|Device Model|Serial Number|SMART overall-health'
Model Family: Seagate IronWolf Pro
Device Model: ST12000NM0008
Serial Number: ZHZ123AB
SMART overall-health self-assessment test result: PASSED
What it means: Direct SMART access on /dev/sda works normally.
Decision: This disk path is “real enough” for ZFS. Repeat for all disks; one misbehaving bay/backplane can be the outlier.
Task 6: SMART behind MegaRAID (if you must), and what failure looks like
cr0x@server:~$ sudo smartctl -a /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.0] (local build)
Smartctl open device: /dev/sdc failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'
What it means: The OS sees a virtual disk. SMART isn’t directly visible.
Decision: If you continue, you now depend on controller-specific tooling (-d megaraid,N) and its correctness. For ZFS, that’s a step backward.
Task 7: Confirm physical drive visibility behind MegaRAID (controller-specific)
cr0x@server:~$ sudo smartctl -a -d megaraid,0 /dev/sdc | egrep 'Device Model|Serial Number|SMART overall-health'
Device Model: ST12000NM0008
Serial Number: ZHZ999XY
SMART overall-health self-assessment test result: PASSED
What it means: You can access physical drive SMART via a RAID layer index.
Decision: This is better than nothing, but you must ensure ZFS receives honest I/O errors and flush behavior. SMART access alone does not certify passthrough quality.
Task 8: Check ZFS pool topology and confirm it uses by-id paths
cr0x@server:~$ sudo zpool status -P
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
/dev/disk/by-id/wwn-0x5000c500a1b2c3d4 ONLINE 0 0 0
/dev/disk/by-id/wwn-0x5000c500a1b2c3d5 ONLINE 0 0 0
/dev/disk/by-id/wwn-0x5000c500a1b2c3d6 ONLINE 0 0 0
/dev/disk/by-id/wwn-0x5000c500a1b2c3d7 ONLINE 0 0 0
What it means: ZFS is anchored to stable device identifiers, not /dev/sdX.
Decision: If you see /dev/sdX in production, schedule a maintenance window to fix it. It’s a delayed outage.
Task 9: Spot controller-induced latency spikes during scrub
cr0x@server:~$ sudo zpool iostat -v tank 5 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 12.3T 21.4T 220 15 420M 2.1M
raidz2 12.3T 21.4T 220 15 420M 2.1M
wwn-... - - 55 4 105M 0.6M
wwn-... - - 54 4 104M 0.5M
wwn-... - - 56 3 107M 0.6M
wwn-... - - 55 4 104M 0.4M
What it means: Reads are balanced. No obvious outlier disk.
Decision: If one disk shows drastically lower bandwidth or near-zero ops during scrub, suspect a bad drive, bad lane, expander trouble, or a controller retry storm hiding errors.
Task 10: Check for link resets and command timeouts (classic “it’s the cable” proof)
cr0x@server:~$ sudo dmesg -T | egrep -i 'mpt3sas|sas|reset|timeout|link'
[Thu Dec 26 09:22:11 2025] mpt3sas_cm0: SAS host is non-operational !!!!
[Thu Dec 26 09:22:12 2025] mpt3sas_cm0: sending diag reset !!
[Thu Dec 26 09:22:20 2025] sd 3:0:1:0: [sdb] tag#129 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
What it means: The HBA reset or link went sideways. Drives start timing out.
Decision: Before blaming ZFS, replace/seat cables and check backplane power/thermals. If this repeats under load, you’re flirting with pool instability.
Task 11: Validate write cache policy at the drive level
cr0x@server:~$ sudo hdparm -W /dev/sda
/dev/sda:
write-caching = 1 (on)
What it means: Drive write cache is enabled. That can be fine if flushes are honored and power is stable; it can be spicy otherwise.
Decision: If you can’t guarantee power-loss protection (UPS, proper shutdown, or SSD PLP), consider disabling write cache on HDDs for safer semantics—accepting a performance hit.
Task 12: Check whether barriers/flush are being issued (symptom-driven)
cr0x@server:~$ sudo cat /sys/block/sda/queue/write_cache
write back
What it means: The kernel believes the device uses write-back caching. That increases the importance of correct flush behavior.
Decision: On “fake HBA” RAID controllers, this is where you get nervous: does the controller actually flush the disk, or just its own cache?
Task 13: Confirm TRIM/discard support for SSD pools (and detect virtualization)
cr0x@server:~$ lsblk -D -o NAME,DISC-GRAN,DISC-MAX,DISC-ZERO
NAME DISC-GRAN DISC-MAX DISC-ZERO
sda 4K 2G 0
sdb 4K 2G 0
sdc 0B 0B 0
What it means: sda/sdb support discard. sdc does not, which is common for virtual disks behind RAID controllers.
Decision: For SSD-heavy ZFS pools, lack of discard is a performance and endurance concern. Prefer true HBAs or controllers with verified full pass-through.
Task 14: Measure queue depth and sanity-check saturation
cr0x@server:~$ cat /sys/block/sda/device/queue_depth
32
What it means: The device queue depth is 32. This may be fine for HDDs; SSDs and heavy concurrency may want more, but it depends on controller and workload.
Decision: If latency spikes under parallel load, don’t reflexively crank queue depth. First confirm you’re not hiding errors or hitting expander bottlenecks.
Task 15: Check controller firmware/BIOS version (change control fuel)
cr0x@server:~$ sudo lshw -class storage -short
H/W path Device Class Description
/0/100/3 storage Serial Attached SCSI controller
/0/100/4 storage RAID bus controller
What it means: You’ve identified storage devices, but not firmware. Next step is vendor tooling or storcli/sas3flash depending on controller family.
Decision: Track firmware versions in your CMDB or at least your build notes. “It changed” is half of every postmortem.
Fast diagnosis playbook (find the bottleneck quickly)
When ZFS performance tanks or a pool starts throwing errors, you don’t have time for philosophy. You need a tight loop that tells you: is this a disk, a controller, a path, or ZFS itself?
First: is ZFS unhappy, or is the hardware lying?
- Run
zpool status -P. Look forREAD/WRITE/CKSUMincrements and degraded/faulted devices. - Check whether device paths are stable (
/dev/disk/by-id/wwn-*). - If errors appear only as timeouts/resets in
dmesgwith no clear disk faulting, suspect cabling/HBA/expander.
Second: find the slow component under load
- Run
zpool iostat -v pool 5during the problem. Identify outlier disks with low bandwidth or stalled ops. - Correlate with
iostat -x 5to see device utilization and await times. - If one disk shows huge
awaitand low throughput, it’s probably failing or retrying. If all disks show elevated await simultaneously, suspect controller/expander bottleneck or a workload-induced sync storm.
Third: verify the controller mode and transparency
- Check
lspci+lsmodto confirm driver stack (mpt3sasgood,megaraid_sasneeds scrutiny). - Confirm SMART passthrough. If you need controller indexes to read SMART, you may have a “fake HBA.”
- Check discard support for SSDs (
lsblk -D).
Fourth: decide if this is “replace disk” or “replace controller/cable”
- Replace disk if SMART shows reallocated/pending sectors, media errors, or if ZFS consistently faults that one disk.
- Replace/seat cable/backplane lane if errors move when you move the disk, or if logs show link resets affecting multiple disks.
- Replace controller if you see recurring HBA resets, command timeouts across many disks, or virtualization that blocks reliable monitoring.
Common mistakes: symptom → root cause → fix
1) Scrubs are painfully slow and “randomly” stall
Symptom: Scrub starts fast, then throughput collapses; sometimes a single disk shows 0 ops for long intervals.
Root cause: Controller retry storms hiding marginal media, or a SAS link problem causing resets; expanders amplify it.
Fix: Check dmesg for resets/timeouts, swap cables, move the suspect disk to another bay, and run SMART long tests. If behind a RAID passthrough layer, strongly consider moving to IT mode/true HBA.
2) ZFS checksum errors you can’t repair reliably
Symptom: CKSUM increases; scrubs find errors; repairs don’t stick or errors recur.
Root cause: ZFS on top of a RAID virtual disk, or a controller doing silent remapping/caching that breaks end-to-end assumptions.
Fix: Stop using hardware RAID arrays under ZFS. Rebuild with direct disks (HBA/IT mode). If you can’t, at least ensure one disk equals one device with full error reporting—but treat that as temporary.
3) After reboot, disks “changed names” and the pool won’t import cleanly
Symptom: Pool import complains about missing devices; /dev/sdX ordering changed.
Root cause: Pool created using /dev/sdX paths; unstable enumeration; RAID controller virtualization can worsen this.
Fix: Use by-id paths and export/import carefully. Replace controller mode if it rewrites identity.
4) SMART data looks empty or generic (“Unknown model”)
Symptom: SMART queries fail unless you pass special flags; model shows as controller.
Root cause: Virtual drives presented by RAID firmware, not physical disks.
Fix: Use an HBA/IT mode if possible. If stuck, configure monitoring using controller-specific methods and verify it covers all drives and attributes you care about.
5) Performance looks great until a power event, then you get pool drama
Symptom: After a hard power loss, pool imports but has errors, or you see suspiciously recent data missing.
Root cause: Write-back caching at the controller layer acknowledging writes early; flush semantics not honored end-to-end.
Fix: Disable write-back cache (controller and drives) unless you have verified power-loss protection and correct flush behavior. Prefer simpler HBAs for ZFS.
6) Random multi-disk “failure” events during heavy I/O
Symptom: Several disks log errors at once; ZFS reports timeouts; then everything “recovers.”
Root cause: HBA overheating, expander saturation, or PSU/backplane instability causing momentary link drops.
Fix: Fix airflow over the HBA heatsink, validate power and backplane connectors, and avoid bargain expanders for high-drive-count boxes.
Three corporate mini-stories from the storage trenches
Incident caused by a wrong assumption: “Passthrough means passthrough”
A mid-size SaaS company inherited a pair of storage servers during a rapid team reshuffle. The previous owner left a note: “Controller in passthrough, ZFS handles redundancy.” Everyone nodded. It booted, the pool imported, dashboards were green. The team had bigger fires.
Months later, they saw a pattern: periodic latency spikes on the primary database replica, always during scrub windows. No obvious disk was failing. ZFS didn’t fault anything, but application latency went from “fine” to “angry customers” in minutes. The on-call muted alerts, disabled scrubs, and promised to revisit.
The revisit happened after an unclean shutdown during a building maintenance event. The pool came back, but ZFS reported checksum errors that didn’t correlate with any single drive. Worse, the error pattern moved around between scrubs. The team suspected RAM, then kernel bugs, then cosmic rays. The postmortem draft was already blaming “ZFS complexity.”
The actual issue: the RAID controller’s “passthrough” was a virtual-drive wrapper with write-back caching still enabled at the controller level. Flushes were not being honored the way ZFS expected. Under normal operation it was “fine”; under scrub + workload it produced latency spikes due to internal retries and cache behavior; under power loss it was correctness roulette.
Fix was blunt: replace the controller with a proper HBA and rebuild the pool from backups. They also wrote a runbook rule: “Passthrough isn’t a feature; it’s a claim. Verify with SMART, device identity, and cache policy.”
Optimization that backfired: “Let’s use write-back cache for more IOPS”
A different org ran a ZFS-backed VM store. Someone noticed that enabling write-back caching on the controller made benchmarks look heroic. The change request read well: “Improves latency and throughput; risk mitigated by UPS.” It was approved because the graphs were pretty and nobody wanted to be the person who blocked performance.
For a while it worked. Latency dropped. The team celebrated by increasing consolidation ratios. That’s how you know an optimization has been culturally adopted: people start relying on it.
Then a routine firmware update changed the controller’s behavior around cache flushes. Not dramatically. Just enough. Under heavy sync write load (VMs love sync when you least expect it), the controller occasionally acknowledged writes early and reordered some operations in ways ZFS didn’t anticipate. ZFS didn’t immediately scream. It quietly did what it could with the information it was given.
Weeks later, a storage node crashed and imported with a handful of corrupted VM disks. Not the whole pool, not a clean failure—just the kind that ruins your day and your credibility. The UPS didn’t help because it was never about “power loss” alone; it was about correctness guarantees at the boundary.
They rolled back caching and ate the performance hit. The lesson wasn’t “never optimize.” It was “never optimize by violating the filesystem’s contract.” ZFS already has a cache (ARC), and it already knows how to order writes. Your controller doesn’t understand your intent, only your packets.
Boring but correct practice that saved the day: stable IDs, tested spares, and controller transparency
A financial services shop (the kind that likes change windows more than humans) standardized on a short list of HBAs. Every server build used by-id device paths, and every disk was labeled physically with its WWN suffix. The policy sounded pedantic until you watched it in action.
One afternoon, a pool went DEGRADED. ZFS faulted a disk cleanly. There was no ambiguity: the zpool status -P path matched a WWN, and the chassis label matched the bay. The tech pulled the right disk on the first try. That alone is rarer than it should be.
Here’s the part that saved them: they also validated SMART passthrough and error reporting during commissioning. When the disk started throwing read errors, those errors surfaced fast and consistently. ZFS didn’t have to guess. It did its job: repair from redundancy, mark the drive bad, and keep serving.
The replacement resilvered without drama because the HBA wasn’t injecting its own recovery policies. No hidden retries, no opaque “degraded virtual disk,” no controller alarms that only one person knew how to interpret. The incident ticket was closed with the driest comment in the world: “Replaced failed disk; resilver complete.”
That’s the dream: a failure that behaves like the design docs. The trick is that it only happens when you build for it.
Checklists / step-by-step plan
Step-by-step: validate a controller for ZFS (new build or inherited)
-
Identify controller type and driver.
- Run
lspci -nnandlsmod. - Goal: HBA/IT mode driver stack (e.g.,
mpt3sas), not a RAID virtualization stack unless you’ve verified transparency.
- Run
-
Confirm disk identity is real.
- Run
lsblk -o NAME,MODEL,SERIAL. - Goal: disk model/serial matches the drive label, not the controller model.
- Run
-
Confirm stable by-id paths exist.
- Check
/dev/disk/by-id/for WWNs. - Goal: create pools using WWN-based paths, never
/dev/sdX.
- Check
-
Validate SMART works normally for every disk.
smartctl -a /dev/sdXshould succeed.- If it requires
-d megaraid,N, write monitoring that covers it and treat the controller as a risk.
-
Validate error behavior under load.
- Run a scrub and watch
zpool iostat -v. - Check
dmesgfor resets/timeouts. - Goal: no link flaps; any bad disk should show cleanly as disk errors, not controller-wide chaos.
- Run a scrub and watch
-
Decide caching policy deliberately.
- If your controller has write-back cache, understand and document when it’s enabled and why.
- Prefer correctness over benchmarks unless you have verified power-loss protection and flush semantics end-to-end.
Operational checklist: every quarter (yes, really)
- Run a scrub and confirm it completes in a predictable time window.
- Review SMART stats for slow-burn failures (pending sectors, CRC errors).
- Spot-check that disk IDs in ZFS still match physical labeling.
- Verify firmware versions haven’t drifted unpredictably across fleet.
- Test a disk replacement procedure on a non-critical node (muscle memory matters).
Migration checklist: moving from “fake HBA” to true HBA/IT mode
- Assume you will need a maintenance window and backups that you trust.
- Inventory current pool layout with
zpool status -Pand save it. - Record drive WWNs and bay mapping.
- Plan controller swap and verify boot/root disk strategy (don’t strand the OS).
- After swap: validate SMART, discard (for SSD), and stable by-id paths before importing/creating pools.
- Run a scrub after the first heavy workload week to catch path issues early.
FAQ
1) Do I have to use IT mode for ZFS?
No. You have to give ZFS direct, truthful disks. IT mode is a common way to achieve that on LSI-based cards. True HBAs also work. RAID arrays under ZFS are the thing to avoid.
2) What’s the simplest sign I’m dealing with a “fake HBA”?
If lsblk shows the controller model as the disk model, or smartctl -a /dev/sdX fails without controller-specific flags, you’re probably not seeing raw disks.
3) Is “one RAID0 per disk” acceptable for ZFS?
It can function, but it’s not equivalent to a real HBA. You inherit metadata/state per disk, controller error handling, and sometimes caching/flush quirks. If it’s production and you have a choice, don’t.
4) Can I run ZFS on top of a RAID5/6 virtual disk?
You can, and people do, and some even survive. But when you get corruption or rebuild edge cases, ZFS can’t heal correctly because it can’t see which physical device lied. For systems you care about, don’t stack redundancy layers that hide failure detail.
5) Does a battery/flash-backed cache make RAID controllers safe for ZFS?
It reduces one risk (power-loss during write-back caching). It does not solve the bigger issue: the controller can still mask errors, rewrite identity, or change command semantics. ZFS wants visibility, not just “durability most days.”
6) How do I know if flush/barriers are honored?
It’s hard to prove perfectly without targeted testing and vendor specifics. Practically: avoid layers that might reinterpret flushes, disable controller write-back caching where possible, and prefer IT mode/true HBAs where the drive semantics are straightforward.
7) Do SAS expanders work with ZFS?
Yes, commonly. The risk is quality and oversubscription: under scrubs/resilvers with many HDDs, expanders can become contention points. If you see controller resets or consistent slowdowns, test with a direct attach configuration.
8) My pool is slow. Is it the HBA?
Sometimes, but usually it’s disks, cabling, or workload sync patterns. Use zpool iostat -v to find per-disk outliers, and dmesg to catch link resets. HBAs are often innocent until proven otherwise.
9) What about virtualization: can I pass through an HBA to a VM and run ZFS inside?
Yes, with IOMMU/PCI passthrough, and it can be solid. The key is that the guest must see real disks, not virtual disks from the hypervisor or RAID layer. If you can’t do full passthrough, you’re back in “someone is lying” territory.
10) Are SATA drives behind SAS HBAs okay?
Yes. SAS HBAs commonly attach SATA via SATA tunneling. Just be mindful of power management quirks and that SATA error handling is historically less deterministic than SAS under heavy fault conditions.
Conclusion: practical next steps
If you remember one thing: ZFS doesn’t want a storage middle manager. It wants direct disks, honest errors, and stable identity. IT mode and true HBAs deliver that. RAID controllers wearing a “passthrough” badge might, but you have to verify—and verification is work you’ll repeat forever.
Next steps you can do this week:
- Inventory your controllers (
lspci) and drivers (lsmod) across the fleet. - Pick one ZFS host and verify SMART, discard (if SSD), and by-id stability end-to-end.
- During your next scrub, watch per-disk behavior with
zpool iostat -vand check logs for resets/timeouts. - If you find a “fake HBA,” decide whether you’re okay owning its monitoring and semantics—or schedule the controller swap and be done.
Storage reliability is mostly about removing surprises. The fastest path to fewer surprises is a plain HBA, good cables, and ZFS seeing what it expects: real disks doing real disk things.