Your ZFS pool is fine. Until it isn’t. You kick off a scrub or a resilver, latency goes from “boring” to “spicy,”
and suddenly dmesg starts narrating a disaster: command timeouts, link resets, and disks dropping just long enough
to ruin your day and your confidence.
A lot of people walk away from those incidents with a simple conclusion: “SAS is stable; SATA is flaky.”
That’s not wrong, but it’s not the whole story either. The real difference lives in timeout semantics, error
recovery behavior, transport layers, and how HBAs react under stress. ZFS is the messenger. Don’t shoot it.
What “stable” actually means in ZFS land
When operators say SAS “feels more stable,” they usually mean this:
- Drives don’t disappear from the OS during scrubs/resilvers.
- Error handling is bounded: failures surface as errors, not minutes-long stalls.
- Controllers recover cleanly without resetting entire links or whole buses.
- Latency stays predictable enough that ZFS can keep the pool healthy without panicking.
ZFS is allergic to silence. If a vdev stops responding long enough, ZFS assumes it’s gone and will fault it.
Sometimes the drive is “fine” in the consumer sense—no reallocated sectors, SMART says OK—but it spent 45 seconds
doing deep error recovery. ZFS doesn’t care about your drive’s feelings; it cares about timely I/O completion.
“Stable” therefore often means: the storage stack fails fast and fails narrow.
SAS tends to be built for that. SATA tends to be built for lowest-cost capacity and a desktop-style expectation:
a long pause is acceptable if it might salvage a sector.
Timeouts 101: the failure you see vs the failure you have
A timeout in logs is rarely the root cause. It’s the point where the OS gives up waiting. The real root cause is
one of these families:
- Media trouble: weak/unstable sectors that trigger long internal retries.
- Transport trouble: link errors, marginal cables, backplane issues, expanders being dramatic.
- Controller trouble: HBA firmware bugs, queue starvation, resets under error storms.
- Workload pressure: scrubs/resilvers + random reads + small sync writes collide and the queue
turns into a parking lot.
In practice, SAS “wins” under stress because its ecosystem assumes you are running a shared storage backplane,
with multipath, with expanders, with dozens of disks, and with operational people who will replace a drive instead
of begging it to retry for two minutes.
SATA can be reliable, but you need to be stricter: better power, better cables/backplanes, better firmware, and
drives configured for bounded error recovery (if supported). Without that, you get the classic ZFS failure mode:
one drive pauses, the HBA resets, a couple other drives get collateral damage, and ZFS thinks the world is ending.
SAS vs SATA behaviors that matter under ZFS stress
Error recovery philosophy: bounded vs “let me cook”
Enterprise SAS drives are typically tuned for RAID-style environments: if a sector read is failing, don’t spend
forever. Return an error promptly so the redundancy layer (RAID, ZFS mirror/RAIDZ) can reconstruct and rewrite.
That’s not altruism; it’s survival. In a shared array, one drive stalling can stall many.
Many SATA drives—especially desktop/NAS-light models—will try hard to recover a sector internally. That can mean
tens of seconds of retry loops. During that time, the drive may not respond to other commands. From the OS
perspective, the device “hung.”
The enterprise answer here is TLER/ERC/CCTL (different vendors, same concept): limit how long the
drive spends on recovery before returning failure. SAS drives often behave as if this is always enabled.
SATA drives may not support it, may default it off, or may behave inconsistently across firmware revisions.
Transport and command handling: SAS is SCSI-native; SATA is a translation party
SAS is a SCSI transport. SATA is ATA. When you attach SATA drives behind a SAS HBA/backplane, you usually do it via
SATA Tunneling Protocol (STP). That works, but it’s not the same as SAS end-to-end:
some error reporting is less expressive, some recovery paths are uglier, and under heavy error conditions you can
see behavior like link resets that feel disproportionate to a single bad sector.
SAS also gives you niceties such as dual-porting (on many drives) and better integration with multipath.
That matters when the goal is “keep serving” rather than “eventually finish the read.”
Queuing and fairness: who gets to talk, and how long
Under load, the storage stack is a queueing system. SAS infrastructures (HBAs, expanders, firmware) are usually
designed for deeper queues and better fairness across many devices. SATA can do NCQ, but the end-to-end ecosystem
is more variable—especially once you introduce port multipliers, cheap backplanes, or consumer controllers.
Your scrub/resilver is a perfect queue-depth torture test: sustained sequential-ish reads plus metadata lookups
plus occasional writes. If one device starts delaying completions, queues fill, timeouts appear, and resets happen.
Reset blast radius: SAS tends to be surgical; SATA can be “reset the universe”
When a SAS device misbehaves, the error handling often stays scoped to that device or that phy. With SATA behind
STP, recovery may require resetting a link that also hosts other drives in the same backplane path, or at least
momentarily interrupting traffic in a way that causes collateral timeouts.
That’s why the same workload can produce “one bad disk” behavior on SAS and “three disks vanished for 10 seconds”
behavior on SATA in the same chassis. It’s not magic. It’s reset domain size.
SMART and telemetry: SAS tends to tell you more, sooner
SAS devices expose richer SCSI log pages and often more consistent reporting of grown defects and error counters.
SATA SMART is famous for vendor-specific attributes and “everything is OK until it isn’t.” You can still run good
operations on SATA, but you must compensate with better trending and tighter replacement criteria.
First joke (and only because you deserve a break): SATA error recovery is like a microwave that won’t stop because it’s “almost done.” You still end up eating cold pizza at 2 a.m.
How ZFS scrubs/resilvers amplify weak links
ZFS is a reliability system that does a lot of I/O on purpose. Scrubs read everything to validate checksums.
Resilvers read a lot to reconstruct redundancy. Both are background tasks that look suspiciously like a benchmark
when your pool is large enough.
That creates two effects:
-
You finally touch the cold data. The sector that has been silently decaying for a year gets read.
If the drive decides to do extended recovery, you get long command stalls right when the pool is busy. -
You concentrate stress. A resilver can hammer a subset of disks (the ones in the affected vdev)
and can turn minor link/cable marginality into repeated errors.
ZFS also has opinions about slow devices. If a device is missing or non-responsive long enough, ZFS will fault it to
keep the pool coherent. That’s the correct behavior. The storage stack should not block indefinitely.
The operator trap is to interpret “ZFS faulted the disk” as “the disk is dead.” Sometimes it is dead. Sometimes the
disk is fine but your transport is bad. Sometimes the disk is “fine” in isolation but wrong for a redundant array
because it stalls too long.
Facts and historical context (the stuff we keep relearning)
- Fact 1: SAS was designed as a serial successor to parallel SCSI with multi-device backplanes and expanders in mind; “many disks, one HBA” was always the plan.
- Fact 2: SATA’s early design center was desktop/commodity storage; long internal retries were acceptable if they recovered a file without user intervention.
- Fact 3: The TLER/ERC/CCTL concept emerged because hardware RAID arrays needed drives to fail fast so the controller could rebuild from parity/mirrors.
- Fact 4: SATA behind SAS backplanes commonly uses STP, a protocol bridge that can reduce the fidelity of error handling compared with native SAS.
- Fact 5: SAS supports dual-porting on many enterprise drives, enabling multipath redundancy; SATA generally does not (with a few niche exceptions).
- Fact 6: ZFS’s scrub was built to continuously detect and correct silent corruption; it intentionally creates a read workload that other filesystems never generate by default.
- Fact 7: “Drive disappeared during load” incidents often trace back to reset domains: one controller reset can flap multiple SATA devices on a shared path.
- Fact 8: The industry spent years learning that “SMART says OK” is not a warranty; timeouts and command aborts can precede any SMART threshold changes.
- Fact 9: Enterprise SAS ecosystems typically validate firmware combinations (HBA + expander + disk) more rigorously because the market demands predictable failure behavior.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company moved a build artifact store onto a new ZFS box. The design looked reasonable: RAIDZ2, lots of
cheap high-capacity SATA drives, a popular SAS HBA in IT mode, and a tidy 24-bay chassis. Someone asked, “Should we
buy SAS drives?” Someone else answered, “Nah, ZFS has checksums. It’ll be fine.”
The first month was boring. Then a scrub hit during a busy CI day. Latency spiked. One disk started timing out.
The HBA tried to recover with link resets. Two other SATA disks on adjacent bays glitched long enough to throw
timeouts as well. ZFS faulted one drive, then temporarily marked another as unavailable, and the pool went degraded
with a side of panic.
The wrong assumption wasn’t “SATA is unreliable.” The wrong assumption was “data integrity features eliminate
transport and timeout problems.” Checksums detect corruption. They do not force a drive to respond promptly.
The fix wasn’t exotic. They swapped the worst offenders with enterprise-class drives that had bounded error
recovery behavior, replaced a suspicious backplane cable, and tuned scrub scheduling away from peak hours. Most
importantly, they wrote an operational rule: any disk that times out under scrub is treated as suspect until proven otherwise.
Mini-story 2: The optimization that backfired
Another team wanted faster resilvers. Their idea: raise queue depths, increase parallelism, and let the HBA
“stretch its legs.” They bumped tunables, enabled aggressive scrub behavior, and declared victory after a quick
test on an empty pool.
Six weeks later a drive failed for real. The resilver kicked off on a pool that was also serving database backups.
The new settings drove the disks into long outstanding queues. Latency hit the ceiling. A marginal SATA disk that
had been “fine” started accumulating command timeouts. The HBA responded with resets, and the resilver slowed down
anyway—except now the pool was degraded and unstable.
The postmortem was blunt: optimizing for peak rebuild speed without protecting latency headroom is gambling with
your redundancy window. Rebuilds are not benchmarks; they are emergency procedures.
They rolled back the tunables, capped scrub/resilver impact, and measured success by “does the pool stay online and
does latency stay bounded,” not by raw MB/s. The resilver took longer. The business took fewer outages. That trade
penciled out nicely.
Mini-story 3: The boring but correct practice that saved the day
A financial services shop ran ZFS on SAS drives with a strict maintenance culture. Nothing heroic: known-good HBA
firmware, consistent drive models, scheduled scrubs, and a simple habit of reviewing error counters weekly. They
also did something unfashionable: they labeled every cable and recorded which bay mapped to which /dev entry.
One quarter, a scrub reported a handful of checksum errors and a single device with rising transport errors.
No outage. No tickets. Just quiet evidence.
The on-call replaced one cable during a maintenance window and proactively swapped the drive that showed elevated
grown defect counts. The next scrub was clean.
Two months later, a nearby rack had a power event that caused a brief storm of link resets. Their system stayed up.
Why? Because the “boring” hygiene meant there were no marginal links waiting to turn a transient into a cascade.
Operational stability is mostly the absence of surprises, and surprises love neglected cabling.
Fast diagnosis playbook
When ZFS is throwing timeouts, you do not have time for interpretive dance. You need to find the bottleneck and the
reset domain quickly.
First: Confirm what ZFS thinks is happening
- Is the pool degraded? Which vdev?
- Are errors checksum errors (data corruption) or I/O errors/timeouts (device/transport)?
- Is the problem localized to one disk or multiple disks behind the same HBA/backplane path?
Second: Correlate with kernel logs and identify the transport layer
- Look for SCSI timeouts, aborts, task management resets, link resets.
- Identify the driver: mpt2sas/mpt3sas, megaraid_sas, ahci, ata_piix, etc.
- Check whether the affected devices are SAS or SATA behind STP.
Third: Measure latency and queue pressure during the event
- Use
zpool iostat -vto find the slow vdev. - Use per-disk latency metrics (
iostat -x) to find which disk is stalling. - Check if the HBA is resetting repeatedly (that’s a sign of transport/firmware issues or a disk that’s poisoning the bus).
Fourth: Decide whether to quarantine the disk or fix the path
- If one disk shows repeated timeouts and high media errors: replace it.
- If multiple disks on the same backplane path flap together: inspect/replace cables, expander/backplane, HBA firmware.
- If problems appear only under scrub/resilver: cap rebuild aggressiveness and schedule those jobs off-peak; still investigate the underlying weak component.
Second joke (last one): A disk that “only fails during scrubs” is like a parachute that “only fails during jumps.” The brochure is not the problem.
Practical tasks (commands, outputs, and decisions)
The commands below assume a Linux host. If you run illumos or FreeBSD, the concepts translate; the filenames and
tooling change. The outputs shown are representative—your box will have its own drama.
Task 1: Check pool health and error type
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
scan: scrub in progress since Thu Dec 26 01:11:19 2025
3.21T scanned at 812M/s, 1.94T issued at 492M/s, 12.8T total
0B repaired, 15.15% done, 0:07:21 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 3
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/backups/db/full-2025-12-25.sql.zst
What it means: Checksum errors mean ZFS detected data that didn’t match its checksum. That can be
a bad sector, bad cable, bad RAM, or a controller issue—but it’s not merely a timeout symptom.
Decision: If checksum errors are present, treat it as a data-integrity incident. Identify the
affected files, validate backups, and then hunt the hardware cause.
Task 2: Watch per-vdev latency during scrub/resilver
cr0x@server:~$ sudo zpool iostat -v tank 2
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 8.44T 4.18T 2.40K 112 520M 5.12M
raidz2-0 8.44T 4.18T 2.40K 112 520M 5.12M
ata-WDC_WD140EDFZ-... - - 410 19 86.9M 862K
ata-WDC_WD140EDFZ-... - - 12 18 512K 840K
ata-WDC_WD140EDFZ-... - - 398 18 84.2M 835K
ata-WDC_WD140EDFZ-... - - 401 19 85.1M 860K
ata-WDC_WD140EDFZ-... - - 395 19 83.7M 858K
ata-WDC_WD140EDFZ-... - - 404 19 85.4M 865K
-------------------------- ----- ----- ----- ----- ----- -----
What it means: One disk is doing almost no reads while others are working. That’s a classic sign of
a stalled device or a path issue. Decision: Identify that disk’s OS device name and inspect logs
and SMART. If it repeats, plan replacement or path remediation.
Task 3: Confirm which /dev node maps to the ZFS disk ID
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep WD140EDFZ | head
lrwxrwxrwx 1 root root 9 Dec 26 01:12 ata-WDC_WD140EDFZ-11A0VA0_VGH1ABC1 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 01:12 ata-WDC_WD140EDFZ-11A0VA0_VGH1ABC2 -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 26 01:12 ata-WDC_WD140EDFZ-11A0VA0_VGH1ABC3 -> ../../sdd
What it means: ZFS shows by-id names; you need the kernel device for SMART, error logs, and
controller mapping. Decision: Use by-id consistently in ZFS, but map to /dev/sdX for triage.
Task 4: Pull kernel errors around the timeouts
cr0x@server:~$ sudo journalctl -k -S "2025-12-26 01:00" -g "timeout|reset|I/O error|mpt3sas|ata"
Dec 26 01:13:44 server kernel: sd 0:0:8:0: [sdb] tag#91 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Dec 26 01:13:44 server kernel: sd 0:0:8:0: [sdb] tag#91 CDB: Read(16) 88 00 00 00 00 1b 2f 9e 80 00 00 00 01 00 00 00
Dec 26 01:13:44 server kernel: ata9: hard resetting link
Dec 26 01:13:45 server kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 26 01:13:46 server kernel: sd 0:0:8:0: [sdb] tag#93 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Dec 26 01:13:49 server kernel: mpt3sas_cm0: log_info(0x31120600): originator(PL), code(0x12), sub_code(0x0600)
What it means: You have SCSI command timeouts and a SATA link reset on the same path. That suggests
either a misbehaving SATA drive under STP or a transport problem (cable/backplane) causing link instability.
Decision: If resets cluster around one drive, replace that drive. If resets affect multiple drives
on the same “ataX” or HBA port, inspect cabling/backplane and HBA firmware.
Task 5: Identify whether a disk is SAS or SATA, and through which host
cr0x@server:~$ sudo lsblk -d -o NAME,MODEL,TRAN,SERIAL,HCTL /dev/sdb
NAME MODEL TRAN SERIAL HCTL
sdb WDC WD140EDFZ sata VGH1ABC1 0:0:8:0
What it means: TRAN=sata tells you this is a SATA drive even if it’s sitting in a SAS chassis.
HCTL shows the SCSI address. Decision: Expect more STP-related quirks; be more conservative about
timeout sensitivity and cabling quality.
Task 6: Inspect SMART for evidence of slow recovery or link trouble (SATA)
cr0x@server:~$ sudo smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.6.0] (local build)
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD140EDFZ-11A0VA0
Serial Number: VGH1ABC1
SATA Version is: SATA 3.3, 6.0 Gb/s
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 200 200 0 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 200 200 0 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x003e 199 199 0 Old_age Always - 24
What it means: Pending/offline uncorrectable sectors suggest media problems; UDMA CRC errors suggest
link/cable/backplane issues. Decision: Pending sectors during scrubs are a red flag—plan
replacement. CRC errors: reseat/replace cable/backplane path and watch if the counter keeps climbing.
Task 7: Inspect SMART for SAS with richer error logs
cr0x@server:~$ sudo smartctl -a /dev/sde
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.6.0] (local build)
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST12000NM000J
Revision: SN03
Logical Unit id: 0x5000c500d3a1b2c3
Transport protocol: SAS (SPL-3)
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 34 C
Elements in grown defect list: 8
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
read: 1200 2 1202 1200 98123.456 0
write: 0 0 0 0 77234.112 0
What it means: Grown defects exist but reads are still corrected with no uncorrected errors.
SAS gives you a clearer picture of correction activity. Decision: Trend “grown defect list” and
corrected error rates. If they accelerate or you see uncorrected errors/timeouts, replace proactively.
Task 8: Check HBA driver/firmware and spot known reset storms
cr0x@server:~$ sudo lspci -nn | grep -i -E "sas|lsi|broadcom"
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0087] (rev 02)
cr0x@server:~$ sudo modinfo mpt3sas | egrep "version|firmware"
version: 46.100.00.00
firmware: 0x2221 (IR)
What it means: You need to know what you’re running. An HBA in IR mode, for example, may not be
what you intended for ZFS passthrough. Decision: Ensure the HBA is in IT mode for ZFS, and keep
firmware/driver combinations consistent across fleet. If you see resets, verify you’re not on a problematic combo.
Task 9: Check link error counters (SAS phy stats)
cr0x@server:~$ for phy in /sys/class/sas_phy/phy-*; do echo "== $phy =="; sudo cat $phy/invalid_dword_count $phy/running_disparity_error_count $phy/loss_of_dword_sync_count $phy/phy_reset_problem_count 2>/dev/null; done | head -n 20
== /sys/class/sas_phy/phy-0:0 ==
0
0
0
0
== /sys/class/sas_phy/phy-0:1 ==
12
5
3
1
What it means: Non-zero disparity/sync/reset counts are physical-layer trouble: cable, backplane,
expander port, or a marginal connector. Decision: If counters increase over time, replace the path
components before you blame ZFS or the disk.
Task 10: Identify which disks share an expander/backplane path
cr0x@server:~$ sudo lsscsi -t
[0:0:0:0] disk sas:0x5000c500d3a1b2c3 /dev/sde
[0:0:1:0] disk sas:0x5000c500d3a1b2c4 /dev/sdf
[0:0:8:0] disk sata:0x50014ee2b7c81234 /dev/sdb
[0:0:9:0] disk sata:0x50014ee2b7c85678 /dev/sdc
What it means: You can see SAS vs SATA devices, and often infer which are behind an expander.
Decision: If multiple devices on the same host/channel show errors together, suspect shared path.
Task 11: Observe per-disk latency and queue depth during pain
cr0x@server:~$ iostat -x -d 2 3 /dev/sdb /dev/sdc
Linux 6.6.0 (server) 12/26/2025 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s await aqu-sz %util
sdb 8.0 1.0 512 64 980.0 31.5 99.9
sdc 210.0 12.0 86000 1200 12.4 2.1 88.0
What it means: sdb has ~1 second average wait and a huge queue. It’s not keeping up,
it’s stalling. Decision: If this correlates with kernel timeouts, you have a device/path issue on
sdb. Replace disk or fix transport; don’t just “tune ZFS” and hope.
Task 12: Check for ZFS slow I/O messages
cr0x@server:~$ sudo dmesg -T | grep -i "slow io" | tail
[Thu Dec 26 01:13:55 2025] ZFS: vdev IO failure, zio 0x0000000123456789, type 1, offset 231145472, size 131072
[Thu Dec 26 01:13:55 2025] ZFS: vdev slow I/O, zio 0x0000000123456790, 60 seconds
What it means: ZFS is telling you I/O is taking tens of seconds. That usually aligns with drive
recovery or transport resets. Decision: Treat repeated “slow I/O” as a pre-failure indicator; it’s
not normal background noise.
Task 13: Throttle scrub impact (Linux OpenZFS)
cr0x@server:~$ sudo zfs get -H -o property,value autotrim tank
autotrim off
cr0x@server:~$ sudo cat /sys/module/zfs/parameters/zfs_vdev_scrub_max_active
10
cr0x@server:~$ echo 4 | sudo tee /sys/module/zfs/parameters/zfs_vdev_scrub_max_active
4
What it means: You reduced the number of concurrent scrub I/Os per vdev, lowering pressure.
Decision: Use this as a temporary stability lever during incident response. If throttling makes
timeouts disappear, you still have a weak component—just one that only fails under stress.
Task 14: Temporarily pause and resume a scrub to stop the bleeding
cr0x@server:~$ sudo zpool scrub -p tank
cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
pool: tank
state: DEGRADED
scan: scrub paused since Thu Dec 26 01:20:11 2025
3.80T scanned at 0B/s, 2.41T issued at 0B/s, 12.8T total
What it means: Pausing reduces load so you can stabilize and investigate without continuing to
poke the failing component. Decision: Pause during active timeout storms; resume after you fix the
suspected disk/path or after applying throttles.
Task 15: Offline a flapping disk to protect the pool (carefully)
cr0x@server:~$ sudo zpool offline tank ata-WDC_WD140EDFZ-11A0VA0_VGH1ABC1
cr0x@server:~$ sudo zpool status -v tank | sed -n '1,25p'
pool: tank
state: DEGRADED
status: One or more devices is being resilvered or is offline.
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD140EDFZ-... OFFLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 3
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
ata-WDC_WD140EDFZ-... ONLINE 0 0 0
What it means: You’ve removed a flapping device from the I/O path. This can stop bus resets and
protect remaining disks. Decision: Only do this if redundancy allows it (RAIDZ2/mirror safety).
If you are one disk away from data loss, you fix the path first and avoid offlining unless the disk is poisoning
the whole vdev.
Task 16: Replace the disk and monitor resilver progress
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD140EDFZ-11A0VA0_VGH1ABC1 /dev/disk/by-id/ata-WDC_WD140EDFZ-11A0VA0_VGH9NEW1
cr0x@server:~$ sudo zpool status tank | sed -n '1,22p'
pool: tank
state: DEGRADED
scan: resilver in progress since Thu Dec 26 02:01:09 2025
1.22T scanned at 421M/s, 512G issued at 176M/s, 12.8T total
0B repaired, 3.90% done, 5:48:12 to go
What it means: Resilver is underway. Watch for new timeouts during resilver; that’s when marginal
components confess. Decision: If resilver triggers new timeouts on other disks, you likely have a
shared-path issue or another weak disk.
Task 17: Check SATA error handling parameters (Linux SCSI layer)
cr0x@server:~$ cat /sys/block/sdb/device/timeout
30
cr0x@server:~$ echo 60 | sudo tee /sys/block/sdb/device/timeout
60
What it means: The OS will wait longer before declaring a command timed out. This can reduce false
positives with slow disks, but it can also hide real hangs and extend stalls.
Decision: Use timeout increases only as a stopgap and only if you understand the tradeoff. If a
disk needs 90 seconds to read a sector, you don’t have a timeout problem—you have a drive problem.
Task 18: Verify ZFS isn’t being sabotaged by write cache surprises
cr0x@server:~$ sudo hdparm -W /dev/sdb | head -n 2
/dev/sdb:
write-caching = 1 (on)
What it means: Write cache is enabled. That’s normal, but on some consumer drives and some
enclosures, cache + power-loss behavior can be risky. Decision: If you don’t have power-loss
protection and you care about sync semantics, consider SLOG for sync-heavy workloads and enterprise drives for
predictable behavior rather than toggling cache blindly.
Common mistakes: symptoms → root cause → fix
1) Symptom: “ZFS keeps faulting different disks during scrub”
- Root cause: Shared-path resets (backplane, expander, cable, or HBA) causing multiple devices to time out, not multiple bad disks.
- Fix: Correlate which disks flap together using kernel logs and HCTL. Replace cable/backplane/expander path; update HBA firmware; reduce reset domain complexity.
2) Symptom: “SMART looks fine but I see command timeouts”
- Root cause: Drives can stall on error recovery without tripping SMART thresholds; transport errors also don’t always show as classic SMART failures.
- Fix: Treat timeouts as a first-class failure signal. Trend CRC/link counters and kernel resets; replace the disk or path before it becomes a data-loss incident.
3) Symptom: “Resilver is painfully slow and the pool feels unstable”
- Root cause: Aggressive rebuild parallelism saturating queues; a weak disk is stalling under load; or a SATA drive is doing long internal retries.
- Fix: Throttle scrub/resilver parameters temporarily, protect latency for production, and replace any disk showing long awaits/timeouts.
4) Symptom: “Checksum errors appear, then disappear after a reboot”
- Root cause: Intermittent link or memory corruption; reboot masks it temporarily by resetting the path.
- Fix: Don’t celebrate. Run memory tests, check ECC logs, and inspect SAS phy or SATA CRC counters. Replace marginal components.
5) Symptom: “Only SATA drives drop; SAS drives don’t”
- Root cause: STP translation plus consumer drive error recovery behavior plus larger reset blast radius.
- Fix: Use enterprise/NAS drives with bounded recovery if available; prefer SAS for dense multi-bay chassis; ensure backplane and HBA are validated for SATA loads.
6) Symptom: “After replacing one disk, another starts timing out during resilver”
- Root cause: The resilver created enough stress to reveal the next weakest disk or a marginal transport path.
- Fix: Run a full scrub after resilver, watch iostat/latency, and proactively replace any disk with rising pending sectors or repeated slow I/O.
Checklists / step-by-step plan
Step-by-step: when you see timeouts during scrub/resilver
- Freeze the blast radius: pause scrub if the system is unstable and production is impacted.
- Capture evidence: save
zpool status -v,zpool iostat -v, and kernel logs covering the event window. - Identify the suspect device(s): map by-id → /dev/sdX and note HCTL addresses.
- Differentiate media vs transport: SMART pending/offline uncorrectable suggests media; CRC/link counters suggest transport.
- Check shared-path correlation: do multiple disks share the same host/port/backplane segment?
- Apply temporary throttles: reduce scrub/resilver concurrency to stabilize while you work.
- Fix the likeliest cause: replace cable/backplane first if link counters rise; replace disk if it stalls alone or shows media errors.
- Resume scrub/resilver and watch: if errors recur, you missed the real weak link.
- Post-incident hygiene: schedule a full scrub after resilver and review error counter trends weekly for a month.
Decision checklist: when to choose SAS over SATA for ZFS
- Choose SAS if you have a dense chassis, expanders, many drives per HBA, or strict uptime needs.
- Choose SAS if you expect frequent resilvers (large pools, high churn) and you want bounded error recovery by default.
- Choose SATA if cost/TB dominates, your chassis is simple, and you’re willing to be ruthless about drive models, cabling, and proactive replacement.
- Avoid mixing “whatever is cheap this quarter” SATA SKUs in the same pool. Consistency beats surprise.
Configuration checklist: make SATA behave less like a hobby
- Use a real HBA in IT mode; avoid RAID personalities.
- Prefer high-quality backplanes and short, labeled cables; replace anything that increments CRC counters.
- Schedule scrubs off-peak and throttle them if latency matters.
- Trend timeouts and resets, not just SMART “PASSED.”
- Keep spare drives on hand; replace on first signs of slow I/O under scrub, not after the third incident.
FAQ
1) Is SAS inherently more reliable than SATA?
Not inherently. SAS ecosystems are typically engineered for predictable failure behavior under multi-disk stress.
That predictability is what operators interpret as reliability.
2) Why do SATA drives “drop out” during ZFS scrub?
Often because the drive stalls on internal error recovery, the OS times out the command, and the controller resets
the link. With STP behind a SAS HBA, that reset can be noisy.
3) If ZFS has checksums, why do timeouts matter?
Checksums detect corruption. They don’t make I/O complete on time. ZFS still needs the device to respond within the
OS timeout window to keep the pool online and coherent.
4) Should I increase Linux disk timeouts to stop flapping?
Sometimes as a temporary mitigation. But longer timeouts increase stall duration and can hide failing media.
If a disk needs dramatically longer timeouts, the safer answer is replacement or moving to drives with bounded
recovery behavior.
5) Do SAS expanders cause timeouts?
They can if firmware is buggy, the topology is overloaded, or cabling is marginal. But a good expander in a sane
topology is not a problem. A bad cable pretending to be an expander problem is extremely common.
6) Are checksum errors always a disk problem?
No. They can come from disk, cable, HBA, backplane, or memory corruption. That’s why correlating kernel logs,
link counters, and SMART is mandatory.
7) Why does SAS “fail fast” feel better for ZFS?
Because ZFS redundancy works best when a device reports an error quickly so ZFS can reconstruct and heal. Long
silent stalls cause timeouts, resets, and multi-device collateral damage.
8) Is mixing SAS and SATA in the same pool a bad idea?
It’s not forbidden, but it’s operationally awkward. Different timeout behavior and telemetry make incidents harder
to interpret. If you must mix, isolate by vdev class or pool purpose and document expectations.
9) What’s the single biggest indicator that I have a cabling/backplane issue?
Errors affecting multiple disks that share a physical path, plus rising CRC/link error counters. Media failures are
usually localized; transport failures love company.
10) What should I optimize for: scrub speed or stability?
Stability. A scrub is a safety procedure. If you need it to finish faster, scale out capacity or schedule better.
Don’t turn your redundancy check into a denial-of-service event.
Conclusion: next steps that prevent the 3 a.m. page
SAS feels more stable under ZFS stress because the whole stack—drives, protocol, and controllers—tends to return
errors promptly, isolate resets, and provide clearer telemetry. SATA can work, sometimes brilliantly, but it’s far
less forgiving when anything is marginal and far more likely to turn one weak disk into a bus-wide incident.
Practical next steps:
- Classify your errors (checksum vs timeout vs transport) and stop treating them as interchangeable.
- Correlate failures to shared paths using HCTL and link counters; replace cables/backplanes aggressively.
- Replace disks that stall under scrub even if SMART says “PASSED.” Your pool needs timely I/O, not optimism.
- Throttle scrubs/resilvers to protect production latency, then fix the underlying weakness instead of living on throttles.
- If uptime matters, buy SAS for dense multi-bay systems. You’re paying for bounded failure behavior and smaller reset blast radius.
One paraphrased idea worth keeping on a sticky note, attributed carefully: paraphrased idea
— John Allspaw: reliability comes from building systems that expect failure and recover predictably.