ZFS SAS Expander Tuning: Avoiding Saturation and Link Bottlenecks

Was this helpful?

You built a perfectly reasonable ZFS pool. Then you put it behind a SAS expander because “it’s just wiring, right?” Now scrubs take forever, resilvers crawl, and your latency graph looks like a seismograph during a minor apocalypse.

This is the part where people blame ZFS. Don’t. Most of the time, it’s topology and link math: oversubscription, narrow ports, queueing, expander arbitration, or an HBA that’s quietly pinned on PCIe. The good news: you can diagnose this quickly and fix it without turning your storage shelf into modern art.

A mental model that won’t betray you

A SAS expander is a packet switch for SAS frames. That’s it. It’s not a magic bandwidth multiplier. It gives you fan-out (many drives) behind fewer upstream links (to your HBA), and it schedules access on those links. If your downstream aggregate demand exceeds upstream capacity, you don’t get more throughput—you get contention, queueing, and latency.

When ZFS does big sequential reads, you might get away with an oversubscribed expander because drives are slow and predictable. When ZFS does scrubs, resilvers, metadata walks, and lots of concurrent small I/O, expanders get stressed in a way that looks like “random latency spikes” at the application layer.

Three layers that can bottleneck (and often do)

  • Media: HDD/SSD behavior, SATA vs SAS, NCQ/TCQ, firmware quirks.
  • Transport: SAS links, wide ports, expander arbitration, SATA tunneling (STP), zoning.
  • Host: HBA queue depth, driver settings, PCIe lanes, interrupt handling, CPU overhead, ZFS scheduling.

The job is to identify which layer is limiting you right now. Not in theory. Not on the spec sheet. In your rack at 3 a.m. while the resilver timer laughs at you.

Joke #1: A SAS expander is like an open-plan office—everyone can collaborate, but somehow nobody gets work done at peak hours.

Facts and history: why expanders behave the way they do

Some context helps because SAS has a long tail of design decisions. Here are concrete points that show up in real systems:

  1. SAS expanders are descendants of Fibre Channel switching ideas, but with simpler addressing and a different arbitration model. The “it’s a switch” intuition is right, but the implementation details differ.
  2. SAS-1 (3 Gb/s), SAS-2 (6 Gb/s), SAS-3 (12 Gb/s) are per-lane rates; wide ports bundle lanes (x2, x4, etc.). Your “12G shelf” can still be effectively “6G-ish” if it negotiates down.
  3. SATA drives behind SAS expanders use STP (SATA Tunneling Protocol), which can behave very differently under load than native SAS. Some expanders handle STP contention poorly.
  4. Early SAS-2 expanders had notorious firmware oddities: link resets under error, poor fairness, and strange interactions with specific HBA firmwares. It got better, but “update firmware” is still not superstition.
  5. Wide porting exists because individual lanes are not enough. A single 12G lane is great until you put 24 drives behind it and start a scrub.
  6. Zoning on SAS expanders is real, and misconfigured zoning can force traffic through a narrow path or prevent multipath from working even when physically cabled.
  7. Queue depth tuning has been a sport since the SCSI days. Too low wastes hardware; too high causes latency collapse. Modern Linux makes it easy to set—also easy to set wrong.
  8. PCIe became the quiet limiter as SAS got faster. A “12G HBA” on insufficient PCIe lanes can bottleneck long before the SAS fabric does.

One engineering quote to keep in your pocket when you’re tempted to optimize without measuring: “If you can’t measure it, you can’t improve it.” — Peter Drucker. (Commonly attributed; treat it as a paraphrased idea if you’re picky.)

Topology and oversubscription math (where the bodies are buried)

Let’s talk about the most common failure: you have lots of disks and not enough upstream bandwidth.

Know your lanes, and stop guessing

SAS bandwidth is per lane, per direction, roughly:

  • 6G SAS: ~600 MB/s per lane after encoding/overhead (ballpark).
  • 12G SAS: ~1200 MB/s per lane after overhead (ballpark).

Wide ports combine lanes. A x4 12G wide port is roughly “up to ~4.8 GB/s each direction” in the ideal case. Ideal cases are rare in production, but the math still tells you if you’re dreaming.

Oversubscription is not automatically evil

Oversubscription is normal because disks don’t all run at line rate. With HDDs, each drive might do ~200–280 MB/s sequential, and much less with random I/O. You can oversubscribe and still be fine if:

  • workloads are bursty and not synchronized across many spindles,
  • there’s enough cache (ARC/L2ARC) to absorb reads,
  • you don’t run scrubs/resilvers during peak or you cap them sensibly.

But ZFS maintenance operations are the synchronized workload. A scrub touches everything. A resilver touches a lot and does it while the pool is already degraded. If your expander uplink is narrow, these operations turn into a slow-motion traffic jam.

Common topologies and their traps

  • One HBA port → one expander uplink → many drives: simplest, and the easiest to saturate.
  • Two HBA ports → dual uplinks to the same expander: can help, but only if wide porting or multipath is actually negotiated and the expander is configured to use it.
  • Dual expanders (redundant paths) in a shelf: good for availability; performance depends on how traffic is balanced and whether the OS sees distinct paths.
  • Daisy-chained expanders: works, but it’s easy to create a “funnel” where everything transits one link. Latency spikes become your personality.

ZFS I/O patterns that stress expanders

ZFS is not “a RAID card.” It’s a storage system that schedules I/O based on transaction groups, vdev layout, and queueing policies. This matters because expanders are sensitive to concurrency and fairness.

Scrub/resilver: high fan-out, sustained, fairness-sensitive

During a scrub, ZFS reads the entire pool to verify checksums. That is: many drives, lots of concurrent reads, and steady pressure for hours or days. An oversubscribed uplink becomes a shared choke point, and expanders can introduce additional latency when arbitration cycles get busy.

Small-block random I/O: metadata and synchronous workloads

Even if your application does “big streaming writes,” ZFS metadata, indirect blocks, and allocation behavior introduce smaller I/O. Expanders don’t hate small I/O; they hate lots of outstanding commands competing for a constrained uplink, especially with SATA behind STP.

Special vdevs and SLOG can help, but they can also mask transport pain

A special vdev can reduce metadata I/O on HDD vdevs. A SLOG can transform sync write latency. Neither one increases expander uplink bandwidth. They can reduce demand—and that’s great—but don’t mistake “symptoms improved” for “fabric fixed.”

Joke #2: Resilver time estimates are like weather forecasts—technically derived, emotionally inaccurate.

Fast diagnosis playbook

You’re on-call. Latency is high. Scrub is running. Someone says “the shelf is slow.” Here’s a fast sequence that finds the bottleneck more often than not.

First: prove whether the bottleneck is in the SAS fabric or the disks

  1. Check ZFS vdev latency (are all vdevs slow equally, or one side?): use zpool iostat -v.
  2. Check per-disk service time and queueing: use iostat -x and look at await/svctm-like proxies (r_await/w_await), utilization, and queue depth.
  3. Check link negotiation and topology: use systool/lsscsi/sas2ircu/storcli (depending on your HBA) to confirm 12G/6G and number of lanes.

Second: check if you’re saturating an uplink

  1. Measure aggregate throughput during scrub/resilver and compare it to uplink theoretical: does it plateau suspiciously?
  2. Look for fairness issues: some disks show huge queue depths while others are idle; that’s a classic expander arbitration or pathing problem.
  3. Look for resets/retries: link errors cause retransmits and stalls that mimic “slow storage.”

Third: eliminate host-side bottlenecks

  1. Check PCIe link width/speed for the HBA.
  2. Check CPU softirq/interrupt pressure if you’re doing very high IOPS (SSDs behind expanders can do that).
  3. Check queue depth settings for SCSI devices and the HBA driver/module.

If you do only three things: zpool iostat -v, iostat -x, and “what speed/width did the HBA negotiate,” you’ll catch most of the real-world failures.

Practical tasks: commands, outputs, and what to decide

These are not benchmark beauty shots. They are the kinds of commands you run when someone is waiting on Slack for an answer. Each task includes what the output means and what decision to make next.

Task 1: Identify pool-wide latency and whether it’s localized

cr0x@server:~$ zpool iostat -v tank 5 3
                               capacity     operations     bandwidth
pool                         alloc   free   read  write   read  write
---------------------------  -----  -----  -----  -----  -----  -----
tank                         120T   80.0T  1.20K   220   1.10G  180M
  raidz2-0                    60T   40.0T    620   120   560M   95M
    sda                           -      -    78    15   70M   12M
    sdb                           -      -    77    14   70M   12M
    sdc                           -      -    76    15   69M   12M
  raidz2-1                    60T   40.0T    580   100   540M   85M
    sdd                           -      -    74    13   68M   11M
    sde                           -      -    12     2   10M  1.5M
    sdf                           -      -    73    13   67M   11M
---------------------------  -----  -----  -----  -----  -----  -----

What it means: One disk (sde) is dramatically underperforming while others are steady. That’s not a simple uplink saturation story; it smells like a bad disk, a bad path, or a link negotiating down.

Decision: Drill into sde: check negotiated speed, error counters, cabling, and SMART. Don’t tune ZFS until you know whether one drive/path is failing.

Task 2: Check scrub status and whether it’s being throttled

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 01:12:03 2025
        14.2T scanned at 1.05G/s, 9.8T issued at 720M/s, 200K repaired
        18.4% done, 0 days 10:22:11 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
          raidz2-1                  ONLINE       0     0     0

errors: No known data errors

What it means: The difference between “scanned at” and “issued at” suggests the scrub is limited by actual device I/O issuance (or some throttling), not just walking metadata.

Decision: If latency impact is unacceptable, consider temporary throttling via ZFS module tunables (platform-dependent) or scheduling scrubs off-peak. If the “issued at” rate is far below what the disks can do, suspect expander uplink or queueing constraints.

Task 3: See per-disk utilization and wait times

cr0x@server:~$ iostat -x -d 5 2
Linux 6.6.0 (server)   12/26/2025  _x86_64_  (32 CPU)

Device            r/s     w/s   rMB/s   wMB/s  r_await  w_await  aqu-sz  %util
sda              78.2    14.9    70.5    12.1    9.2     12.3     1.2   88.0
sdb              77.9    14.7    70.3    12.0    9.4     12.1     1.2   87.5
sdc              76.8    15.1    69.5    12.2    9.6     12.0     1.3   89.1
sdd              74.0    13.4    68.1    11.0    9.1     11.8     1.1   86.2
sde              11.7     2.0    10.2     1.6   55.0     61.2     3.8   42.0
sdf              73.2    13.1    67.4    10.8    9.0     11.6     1.1   85.9

What it means: sde has high await and higher average queue size but low throughput and lower utilization. That’s classic “waiting on something not-the-platter”: retries, link resets, path issues, or expander oddities.

Decision: Check kernel logs for link errors and query the SAS link rate for that target. Don’t increase queue depths to “fix” this; it will make the tail latency uglier.

Task 4: Find SAS hosts and the expander(s) visible to Linux

cr0x@server:~$ lsscsi -g
[0:0:0:0]    disk    ATA      ST12000NM000J  SN02  /dev/sda  /dev/sg0
[0:0:1:0]    disk    ATA      ST12000NM000J  SN02  /dev/sdb  /dev/sg1
[0:0:2:0]    disk    ATA      ST12000NM000J  SN02  /dev/sdc  /dev/sg2
[0:0:3:0]    disk    ATA      ST12000NM000J  SN02  /dev/sdd  /dev/sg3
[0:0:4:0]    disk    ATA      ST12000NM000J  SN02  /dev/sde  /dev/sg4
[0:0:5:0]    disk    ATA      ST12000NM000J  SN02  /dev/sdf  /dev/sg5
[1:0:0:0]    enclosu HGST     H4060-J        4R06  -         /dev/sg10

What it means: You have at least two SCSI hosts ([0:...] and [1:...]), and an enclosure device. That enclosure entry often indicates SES (enclosure services), which is common in shelves with expanders.

Decision: Map disks to expander/phy topology using HBA tooling (LSI/Broadcom utilities) or sysfs attributes. You want to know: are “slow disks” clustered behind one expander port?

Task 5: Check kernel logs for SAS link resets, timeouts, and transport errors

cr0x@server:~$ dmesg -T | egrep -i 'sas|mpt3sas|scsi|reset|timeout|link'
[Thu Dec 26 02:01:14 2025] mpt3sas_cm0: log_info(0x31120400): originator(PL), code(0x12), sub_code(0x0400)
[Thu Dec 26 02:01:15 2025] sd 0:0:4:0: [sde] tag#1826 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Thu Dec 26 02:01:15 2025] sd 0:0:4:0: [sde] tag#1826 Sense Key : Aborted Command [current]
[Thu Dec 26 02:01:15 2025] mpt3sas_cm0: sas_device(0x00000008), phy(6) link reset
[Thu Dec 26 02:01:17 2025] sd 0:0:4:0: [sde] Synchronizing SCSI cache

What it means: Soft errors and link resets on a particular phy strongly indicate a physical-layer issue (cable, connector, backplane slot, expander phy), or marginal negotiation at high speed.

Decision: Before touching ZFS tunables, fix transport reliability: reseat/replace cable, move the drive to another bay, check expander/HBA firmware compatibility. Performance tuning on a flaky link is just polishing a cracked lens.

Task 6: Verify PCIe speed/width for the HBA

cr0x@server:~$ lspci -nn | egrep -i 'sas|scsi'
3b:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)

cr0x@server:~$ sudo lspci -s 3b:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
LnkSta: Speed 8GT/s, Width x4

What it means: The card can do PCIe Gen3 x8, but it’s currently running at x4. That’s a sneaky throughput cap and can produce “mysterious” saturation even with a properly wide SAS uplink.

Decision: Move the HBA to a slot that provides full lanes, adjust BIOS bifurcation settings, or remove the conflicting device stealing lanes. Don’t argue with physics.

Task 7: Inspect SCSI queue depth per device

cr0x@server:~$ for d in sda sdb sdc sdd sde sdf; do echo -n "$d "; cat /sys/block/$d/device/queue_depth; done
sda 32
sdb 32
sdc 32
sdd 32
sde 32
sdf 32

What it means: Queue depth 32 is common for SATA drives behind SAS. Not inherently bad.

Decision: If you’re saturating an uplink, raising queue depths can worsen latency because it increases outstanding work competing for the same narrow link. If you’re under-driving fast SSDs and latency is stable, increasing may help. Decide based on observed saturation/latency, not vibes.

Task 8: Check block layer scheduler (helps diagnose latency amplification)

cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none kyber bfq

What it means: mq-deadline is active. That’s often sane for HDDs in servers.

Decision: If you see pathological tail latency under mixed workloads behind an expander, mq-deadline is usually a better starting point than none for HDD. Don’t cargo-cult “none everywhere.”

Task 9: Check for device-level negotiated link speed (SAS/SATA)

cr0x@server:~$ sudo smartctl -a /dev/sde | egrep -i 'SATA Version|SAS Version|Negotiated|Transport protocol'
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)

What it means: The drive supports 6.0 Gb/s but is currently at 3.0 Gb/s. That’s a smoking gun for a marginal link.

Decision: Treat “negotiated down” as a hardware/cabling/backplane problem first. Fix that, then retest. If it keeps happening, that bay or expander phy is suspect.

Task 10: Map disks to enclosure slots (so you can move the right thing)

cr0x@server:~$ sudo sg_map -x
/dev/sg0  0 0 0 0  /dev/sda
/dev/sg1  0 0 1 0  /dev/sdb
/dev/sg2  0 0 2 0  /dev/sdc
/dev/sg3  0 0 3 0  /dev/sdd
/dev/sg4  0 0 4 0  /dev/sde
/dev/sg5  0 0 5 0  /dev/sdf
/dev/sg10 1 0 0 0

What it means: You can now correlate SCSI addresses with disks. If you have SES tools, you can often light LEDs or query slot mapping.

Decision: If only one phy/slot is problematic, move the drive to another bay to see if the problem follows the drive (drive issue) or stays with the slot (backplane/expander path).

Task 11: Observe ZFS latency under load with per-vdev visibility

cr0x@server:~$ zpool iostat -v -l tank 5 2
                              capacity     operations     bandwidth    total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----  ----- -----  -----  -----
tank                        120T   80.0T  1.15K   240   1.05G  190M   12ms   8ms    9ms   6ms
  raidz2-0                   60T   40.0T    600   120   530M   95M    11ms   7ms    8ms   5ms
  raidz2-1                   60T   40.0T    550   120   520M   95M    13ms   9ms   10ms   7ms
--------------------------  -----  -----  -----  -----  -----  -----  ----- -----  -----  -----

What it means: total_wait includes time in ZFS queues; disk_wait is closer to device service time. When total_wait balloons while disk_wait stays modest, you’re bottlenecked above the disks (queuing in ZFS, HBA, or fabric).

Decision: If both waits are high, the disks or the transport to them are slow. If only total is high, look at queue depth, scrub/resilver throttles, and host-side contention.

Task 12: Check current ZFS tunables that affect scrub/resilver behavior (Linux OpenZFS)

cr0x@server:~$ sudo sysctl -a 2>/dev/null | egrep 'zfs\.(vdev|scan)' | head -n 12
zfs.vdev.max_active=1000
zfs.vdev.scrub_max_active=64
zfs.vdev.resilver_max_active=64
zfs.vdev.async_read_max_active=64
zfs.vdev.async_write_max_active=64
zfs.vdev.sync_read_max_active=10
zfs.vdev.sync_write_max_active=10
zfs.scan.issue_strategy=0

What it means: These control how many concurrent I/Os ZFS will throw at vdevs for different classes of work. High concurrency can saturate an expander uplink and inflate latency for “real work.” Low concurrency can make maintenance take forever.

Decision: If you’re seeing expander saturation during scrub/resilver, reduce scrub/resilver concurrency first (and observe impact). If you’re underutilizing the fabric and disks, increase cautiously. Do it with measurement, and do not change ten knobs at once.

Task 13: Confirm multipath is actually in effect (if you expect it)

cr0x@server:~$ multipath -ll
mpatha (3600508b1001c3d6d5f3c1b2a00000001) dm-2 LIO-ORG ,disk
size=10T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:0 sda 8:0  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 4:0:0:0 sda 8:0  active ready running

What it means: You have multiple paths. Great—assuming they are independent uplinks and not two paths that converge into the same narrow expander uplink.

Decision: Validate physical cabling and expander routing. Multipath is not a performance guarantee; it’s a correctness/availability tool that can also help throughput when topology supports it.

Task 14: Measure aggregate throughput during scrub/resilver to spot uplink plateaus

cr0x@server:~$ zpool iostat tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        120T   80.0T  1.10K   210   1.15G  160M
tank        120T   80.0T  1.12K   215   1.16G  162M
tank        120T   80.0T  1.11K   212   1.15G  161M
tank        120T   80.0T  1.11K   211   1.15G  160M
tank        120T   80.0T  1.10K   209   1.14G  159M

What it means: A stable plateau (here ~1.15 GB/s reads) is not automatically bad. It becomes suspicious when your theoretical uplink is near that value and you have many spindles that could collectively exceed it during scrub.

Decision: Compare plateau to uplink capacity (lanes × negotiated rate). If you’re close, you’re uplink-bound. Remedy: add lanes (wider ports), add another uplink, or split shelves across HBAs.

Task 15: Check HBA firmware/driver version (because compatibility is a performance feature)

cr0x@server:~$ modinfo mpt3sas | egrep -i 'version|firmware'
version:        46.100.00.00
firmware:       16.00.12.00

What it means: You can at least identify what you’re running. Expander and HBA firmware mismatches can manifest as resets, reduced link rate, or poor fairness under load.

Decision: If you’re chasing intermittent resets or negotiation downshifts, align HBA firmware with a known-good set for your shelf/expander generation. Make changes in maintenance windows and validate with stress + scrub.

Tuning levers that actually matter

1) Fix the physical and link layer first

If you have link resets, negotiated-down speeds, or CRC errors, stop. Replace cables, reseat connectors, swap bays, and update firmware. Tuning above that layer is like tuning a race car with three lug nuts missing.

2) Prefer wide ports and real uplink bandwidth

If your expander has multiple external ports, you want either:

  • Wide porting: multiple lanes aggregated between HBA and expander, or
  • Multiple independent uplinks split across shelves/vdevs so one uplink doesn’t become a funnel.

Practically: cable the shelf the way the vendor expects for “high bandwidth,” not the way that uses the fewest cables. Cables are cheap. On-call time is not.

3) Scrub/resilver concurrency: cap it to your fabric, not your ego

On OpenZFS (Linux), the zfs.vdev.scrub_max_active and zfs.vdev.resilver_max_active knobs are the first ones to reach for when maintenance operations murder latency.

Guideline that works surprisingly well behind oversubscribed expanders:

  • Start with 16–32 scrub/resilver max_active per vdev class if you have HDDs behind an expander.
  • If latency is stable and you’re not saturating uplinks, move up gradually.
  • If you see tail latency spikes, move down and re-measure.

These values aren’t sacred. What’s sacred is changing one thing at a time and collecting before/after.

4) Queue depth: don’t “fix” bandwidth caps with more queueing

Queue depth is a multiplier for contention. If your uplink is already saturated, increasing queue depth increases the amount of work waiting its turn, which increases latency, which makes applications sad, which makes you sad.

When would you increase it?

  • SSDs behind a properly wide SAS fabric where the HBA isn’t the limiter.
  • Workloads that are throughput-oriented and tolerate higher latency (backups, bulk replication).

5) Split pools/vdevs across fabrics when you can

ZFS performance lives and dies by vdev parallelism. If you have two HBAs or two shelves, don’t put all vdevs behind one expander uplink because it’s “tidy.” Spread vdevs so that no single link becomes the shared choke point for the entire pool. That’s design, not tuning.

6) Validate PCIe and NUMA placement

If your HBA is in a slot negotiating at x4 instead of x8, or it’s attached to the “other socket” with a chatty NUMA penalty, you can waste a lot of time blaming the expander. Confirm PCIe link width/speed and keep interrupts/CPU locality sane for high-IOPS systems.

7) Be careful with “optimize for rebuild speed” policies

Fast resilvers are great until they starve production I/O. On oversubscribed shelves, aggressive resilver concurrency can saturate uplinks and degrade every dataset, even those living on “healthy” vdevs. Your goal is predictable service, not winning a benchmark screenshot.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “12G shelf means 12G everywhere”

The setup looked modern: SAS3 HBA, SAS3 shelf, “12G” printed on the bezel. The team migrated a ZFS backup target into it, then repurposed it for something more interactive because “it has plenty of disks.” First scrub after go-live, latency blew up. Databases complained. People stared at ZFS graphs like they were going to confess.

The assumption was subtle: that a 12G shelf guarantees 12G per drive and enough uplink bandwidth to match the spindle count. In reality, the shelf had a 12G expander, but the upstream cabling was a single x1 lane equivalent because of a bad cable choice and a port that negotiated down. The expander happily served 24 drives through a straw.

The telling sign was a suspicious throughput plateau: scrub reads pinned at about what a single down-negotiated lane could do. Per-disk iostat showed lots of waiting and low utilization. Nothing was “wrong” with the disks. The fabric was simply saturated.

The fix was boring: recable to a proper x4 wide link, verify negotiated rates, and run a scrub again. Throughput doubled, latency dropped, and the team stopped arguing about whether ZFS “needs a RAID controller.” They also wrote down the port mapping in a runbook, which should not be a heroic act, but here we are.

Optimization that backfired: turning every knob to 11

A different company had a big object store on ZFS with HDD vdevs behind expanders. Scrubs were slow, and someone decided it was unacceptable. They increased ZFS vdev concurrency values aggressively and bumped device queue depths because “more parallelism equals more speed.” The next scrub was indeed faster—for about fifteen minutes.

Then came the weirdness: API latency climbed, timeouts appeared, and the busiest nodes showed oscillating I/O graphs. It wasn’t a simple saturation line; it looked like bursts of congestion and recovery, like traffic waves on a highway. The on-call team disabled the scrub to stabilize production, which is exactly what you don’t want to do for data integrity.

Post-incident analysis showed classic queueing collapse. The expander uplinks were oversubscribed. By increasing outstanding commands, they increased the queue inside the fabric and at the disks. The system spent more time juggling and timing out commands and less time doing useful I/O. Tail latency ballooned; the app felt it immediately.

The fix was to lower scrub/resilver concurrency to match the uplink bandwidth, keep queue depths moderate, and schedule scrubs with an explicit impact budget. The result was a scrub that took longer than the “fast” attempt, but it ran without disrupting production. That’s the version you can live with.

Boring but correct practice that saved the day: topology documentation and canary scrubs

One enterprise team ran several ZFS pools across multiple shelves. They were not the “move fast and break storage” type. Every shelf had a topology diagram: which HBA port, which expander port, which cable, and which bay ranges belonged to which vdevs. It was a spreadsheet. It was not exciting. It was also accurate.

During a maintenance window, they upgraded expander firmware. After the change, a canary scrub on a non-critical pool showed a small but consistent drop in throughput and a rise in disk_wait. Not catastrophic, but measurable. Because they had “known good” baselines from previous canary scrubs, they didn’t have to argue whether this was normal variance.

They rolled back firmware on one shelf and the canary performance returned. That narrowed the problem to a specific expander firmware interaction with their HBA model and SATA drive mix. They staged a different firmware revision and retested until it matched baseline.

Nothing dramatic happened in production. That’s the point. Boring practices—topology documentation, baselines, and canary scrubs—turn “mysterious performance regressions” into controlled changes with evidence. It’s not glamorous, but neither is restoring from backups.

Common mistakes: symptoms → root cause → fix

1) Scrubs/resিলvers plateau at a suspiciously round number

Symptoms: Throughput stalls around ~550 MB/s, ~1.1 GB/s, ~2.2 GB/s, etc., regardless of how many spindles you have.

Root cause: Uplink is a single lane (or negotiated down), or wide porting isn’t actually in effect. Sometimes the HBA is PCIe-limited.

Fix: Verify negotiated SAS rate and lane count; recable for x4 wide port; ensure HBA PCIe x8 is negotiated; split vdevs across uplinks/HBAs.

2) One or a few disks show massive await while others look fine

Symptoms: iostat -x shows one disk with high await, low throughput, and low-ish %util; ZFS shows that leaf device lagging.

Root cause: Link resets, negotiated-down speed for that drive, marginal bay/backplane, or expander phy problems.

Fix: Check dmesg; confirm negotiated speed via SMART; move the disk to a different bay; replace cable/backplane path; update expander firmware if it’s a known issue.

3) “Multipath enabled” but performance is unchanged

Symptoms: Two paths exist in OS, but throughput looks like a single link; failover works but no scaling.

Root cause: Both paths converge to the same expander uplink, or zoning/expander routing forces a single active path.

Fix: Validate physical topology; ensure independent uplinks; check expander zoning and HBA port wiring; test by pulling one cable and observing path changes.

4) Latency spikes during scrub that disappear when scrub stops

Symptoms: Applications see periodic timeouts; storage metrics show spikes correlated with scrub windows.

Root cause: Scrub concurrency too high for fabric; expander arbitration/oversubscription causes queueing; sometimes sync workloads compete badly.

Fix: Reduce scrub/resilver concurrency; schedule scrubs off-peak; consider special vdev for metadata; ensure uplinks are wide enough.

5) “Upgraded to faster HBA” but nothing improved

Symptoms: New SAS3 HBA installed; same plateau; same latency patterns.

Root cause: HBA is PCIe lane-limited, or shelf uplink is still narrow, or drives are SATA behind STP and dominate behavior.

Fix: Confirm PCIe x8 at expected generation; confirm expander uplink lanes; consider SAS drives for heavy concurrency use; don’t forget cabling.

6) Random command timeouts under heavy load

Symptoms: Kernel logs show aborted commands/timeouts; ZFS marks devices slow; resilvers restart.

Root cause: Excessive queue depth + oversubscription + marginal links; expander firmware issues; sometimes power/thermal causing PHY instability.

Fix: Fix link errors, lower concurrency/queue depth, update firmware, check shelf power/thermals, and revalidate with stress tests.

Checklists / step-by-step plan

Step-by-step: from “storage is slow” to a stable fix

  1. Capture the moment. Save zpool status, zpool iostat -v -l, and iostat -x during the incident window.
  2. Check for obvious single-device weirdness. If one disk is lagging, treat it as a link/device issue first.
  3. Check logs for transport errors. Any link reset/retry pattern moves you into hardware remediation mode.
  4. Verify negotiated speeds. Confirm drives aren’t stuck at 3G and uplinks aren’t narrower than you think.
  5. Verify PCIe negotiation. Confirm HBA is at expected width/speed.
  6. Compute rough uplink capacity. Lanes × rate; compare to observed plateau under scrub.
  7. Decide: add bandwidth or reduce demand. Bandwidth: recable wide ports, add uplinks, split shelves. Demand: tune scrub/resilver concurrency, schedule maintenance.
  8. Change one thing. Apply a single adjustment (recable, firmware, concurrency) and re-measure.
  9. Run a canary scrub. Not full production first. Validate stability and latency impact.
  10. Document topology and baselines. If you skip this, you’ll pay later—with interest.

Checklist: “Is my expander uplink actually wide?”

  • HBA port(s) used are capable of wide porting and are configured accordingly.
  • Cables support the lane count you think they do (not all external cables are equivalent).
  • Link negotiated at expected rate (6G/12G) end-to-end.
  • OS/driver/HBA utility reports multiple phys in the wide port.
  • No zoning/routing rule forces all traffic through one narrow path.

Checklist: safe scrub/resilver tuning on oversubscribed shelves

  • Start conservative on zfs.vdev.scrub_max_active and zfs.vdev.resilver_max_active.
  • Measure latency impact on production workloads during a controlled window.
  • Increase in small steps; stop when tail latency starts to climb.
  • Keep a rollback plan (and document the previous values).

FAQ

1) Do SAS expanders reduce performance by default?

No. They reduce performance when you oversubscribe uplinks, have poor cabling/negotiation, or hit fairness/queueing limits. With proper wide ports and sane concurrency, expanders can perform very well.

2) Why do scrubs hurt more than normal reads?

Scrubs are synchronized, sustained, and wide-fan-out. They keep many disks busy at once and expose shared bottlenecks (uplinks, PCIe, arbitration) that normal application I/O might not hit continuously.

3) Is SATA behind a SAS expander a bad idea?

It’s common and can be fine for capacity tiers. But STP behavior plus high concurrency can produce uglier tail latency than native SAS, especially during scrub/resilver. If you need predictable latency under heavy concurrency, SAS drives are easier to reason about.

4) Should I increase queue depth to speed up resilvers?

Only if you’ve proven you are not uplink-bound and latency is stable. If you’re already saturating an expander uplink, higher queue depth often increases timeouts and tail latency, making resilvers less stable.

5) What’s the single best indicator of an expander uplink bottleneck?

A stable throughput plateau during scrub/resilver that matches a narrow link’s capacity, plus elevated queueing/wait times across many disks without any one disk being the clear villain.

6) How do I know if wide porting is actually working?

Use HBA tooling to inspect phys/ports and negotiated link rates. OS-level indicators alone can be misleading. Also compare throughput under load: if adding a second cable changes nothing, your “wide port” might not be wide.

7) Can I “tune ZFS” to overcome a saturated SAS uplink?

You can tune ZFS to be less disruptive (lower concurrency, better scheduling), but you cannot tune your way around missing bandwidth. You’re choosing between “slow maintenance” and “slow everything.” Fix the topology if you need both fast and smooth.

8) Does adding an L2ARC or more RAM help with expander bottlenecks?

It can reduce read I/O to disks, which reduces demand on the fabric. It won’t help writes that must hit disks, and it won’t fix scrub/resilver bandwidth caps. Treat it as workload mitigation, not transport repair.

9) What about splitting a pool across multiple expanders?

Splitting vdevs across multiple independent uplinks can be a big win. The trick is “independent”: if both expanders ultimately funnel through one HBA port, you’ve just added complexity, not bandwidth.

Practical next steps

If your ZFS pool sits behind a SAS expander and you’re fighting slow scrubs, slow resilvers, or latency spikes, do these next:

  1. Prove negotiated speeds and PCIe width. Fix any downshift or lane limitation before tuning anything else.
  2. Measure plateau vs uplink math. If your throughput tops out near a single-lane or narrow-wide-port number, you’ve found the choke point.
  3. Right-size scrub/resilver concurrency. Make maintenance predictable and non-destructive, even if it’s slower than you wish.
  4. Re-cable for bandwidth. Wide ports and multiple uplinks beat clever sysctl settings every day of the week.
  5. Write down the topology. Future-you will need it, and future-you is already tired.

Once the fabric is reliable and the link math matches your expectations, ZFS tends to behave like a grown-up system: boring, measurable, and fast enough to keep everyone out of trouble. That’s the dream. Aim for boring.

← Previous
Ubuntu 24.04: Jumbo frames break “only some” traffic — how to test and fix MTU safely
Next →
Chipset eras: when the motherboard decided half your performance

Leave a comment