ZFS 10GbE Performance: Proving Whether Network Is the Bottleneck

Was this helpful?

Your ZFS box benchmarks like a hero locally, then crawls when a client hits it over 10GbE. Everyone has a theory. Storage blames the switch. Network blames ZFS. The app team blames “latency.” Meanwhile you’re staring at a copy dialog that feels like it’s rendering each packet by hand.

This is how you stop guessing. You’ll build a baseline, isolate layers, and walk away with proof: either the network is your ceiling, or it’s just the messenger delivering bad news from somewhere else.

A mental model that survives meetings

To prove a bottleneck, you need to separate throughput ceilings from latency amplification and serialization.

1) 10GbE has a hard ceiling, but your workload hits it differently

On paper, 10GbE is 10 gigabits per second. In practice, best-case TCP payload throughput is usually around 9.4–9.8 Gbit/s depending on framing, offloads, and CPU. That’s about 1.1–1.2 GB/s in file-copy terms if everything is aligned and streaming.

But ZFS workloads aren’t always streaming. Small I/O, metadata chatter, synchronous writes, or a chatty protocol can “spend” throughput on overhead and “spend” time on round trips. 10GbE can look slow when it’s actually fast but forced into stop-and-wait behavior by something upstream.

2) ZFS can be faster than your network, and that’s a trap

A modern ZFS pool with enough spindles or NVMe can outrun 10GbE on sequential reads. Great. That means the network becomes the limiter on big transfers, and you should see a clean plateau: disk isn’t busy, CPU isn’t pegged, and the NIC is near line rate.

If you don’t see a clean plateau, you’re not “network limited.” You’re “something else limited” and the network just happens to be involved.

3) Proving a network bottleneck requires two independent measurements

You’re trying to answer a specific question: Is the network the narrowest point in this end-to-end path?

To prove it, you need:

  • A network-only test (no disks, no filesystems) that hits near line rate.
  • A storage-only test (local to the server) that exceeds what you see over the network.
  • And then a combined test (NFS/SMB/iSCSI) that matches the network-only ceiling.

If any of those doesn’t line up, you don’t have a neat network bottleneck. You have a layered problem, which is more annoying but also more fixable.

One operational rule worth taping to your forehead: Never tune ZFS for a network problem, and never tune the network for a ZFS problem, until you have isolation tests.

Interesting facts and quick history (so you stop repeating myths)

  • 10GbE wasn’t built for cheap copper first. Early 10GbE deployments were mostly fiber; 10GBASE-T arrived later and was power-hungry compared to SFP+ optics/DAC.
  • Jumbo frames predate 10GbE popularity. They were used to reduce CPU overhead on busy links long before every switch defaulted to “maybe support it.”
  • ZFS was designed with end-to-end data integrity as a first-class feature. Checksums and copy-on-write aren’t “extras”; they’re the point, and they have performance implications.
  • NFS has lived multiple lives. v3 is stateless and common in appliances; v4 adds state, locking, and different failure modes that show up as “performance issues.”
  • SMB performance changed dramatically once multi-channel and modern Linux/Windows stacks matured. Old SMB tuning folklore still gets copy-pasted into environments that don’t match.
  • TCP window scaling made high-throughput, high-latency links feasible. Without it, your “10GbE” would behave like a polite suggestion.
  • Interrupt moderation and offloads are a trade. They can reduce CPU cost per packet, but they can also increase latency or hide driver bugs.
  • ZFS recordsize is not the same as “block size on disk,” but it strongly influences I/O pattern and thus how efficiently network protocols can stream data.
  • Switch buffers are not infinite, and microbursts can drop packets even when average utilization looks fine.

Fast diagnosis playbook (first/second/third)

First: prove the network can do 10GbE between the exact two hosts

  • Run iperf3 both directions with multiple streams.
  • Check NIC negotiated speed, PCIe width, driver, and errors.
  • Look for drops/retransmits. If you see them, stop and fix that before touching ZFS.

Second: prove the ZFS server can read/write faster locally than the client sees

  • Use fio locally against a dataset (and optionally a zvol) with direct I/O to avoid “it’s just ARC” illusions.
  • Use zpool iostat -v to confirm disks are doing what you think they’re doing.
  • Watch CPU usage and per-device latency; if the pool can’t feed 1+ GB/s, the network is not the primary bottleneck.

Third: test the real protocol (NFS/SMB/iSCSI) and correlate with both sides

  • Run a controlled file transfer or fio over NFS/SMB.
  • Simultaneously capture: NIC stats, ZFS stats, and client-side retransmits.
  • If protocol throughput matches iperf3 ceiling and server disk is underutilized, you’re network-limited (by design, not by accident).

If you do only one thing: run iperf3 and capture retransmits. If the network is sick, it will confess quickly.

Baseline targets for 10GbE + what “good” looks like

Targets vary by OS, NIC, and CPU. But for a single 10GbE link on modern Linux:

  • iperf3 single stream: often 6–9 Gbit/s depending on TCP tuning and latency.
  • iperf3 4–8 parallel streams: should approach 9.4–9.8 Gbit/s on a clean LAN.
  • TCP retransmits: should be near zero on a stable LAN. A few during warm-up can happen; persistent retransmits mean drops, bad optics/cables, duplex/speed negotiation issues, or buffer pressure.
  • NIC errors/drops: should be zero or extremely low. “A little” is not a plan.
  • Server local sequential read (fio): ideally > 1.2 GB/s if you want to saturate 10GbE with headroom; otherwise your pool is the limiter.

Also: don’t ignore the client. Plenty of “NAS is slow” incidents are actually “client has a laptop NIC driver from the Pleistocene.”

Joke #1: A 10GbE link that tops out at 2 Gbit/s is still “10GbE” in the same way a sports car on a flat tire is still “a sports car.”

Practical tasks: commands, outputs, decisions

These are real tasks you can run today. Each includes a command, a sample output, and what decision you make from it. Run them on both the ZFS server and the client when applicable.

Task 1: Verify link speed, duplex, and negotiated state

cr0x@server:~$ sudo ethtool enp65s0
Settings for enp65s0:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: No
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: No
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Link detected: yes

What it means: If you don’t see 10000Mb/s Full with link detected, stop. You can’t out-tune a bad link.

Decision: If speed isn’t 10Gb, fix cabling/optics/switch config first. If auto-negotiation mismatch exists (common with 10GBASE-T), correct it.

Task 2: Check PCIe link width/speed (yes, it matters)

cr0x@server:~$ sudo lspci -s 65:00.0 -vv | egrep -i "LnkCap|LnkSta"
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <512ns, L1 <64us
LnkSta: Speed 8GT/s, Width x8

What it means: A 10GbE NIC in a PCIe slot running x1 or at low speed can cap throughput or spike latency under load.

Decision: If width/speed is below expected, move the NIC, change BIOS settings, or stop sharing lanes with a storage HBA if the motherboard is stingy.

Task 3: Check interface error counters and drops

cr0x@server:~$ ip -s link show dev enp65s0
2: enp65s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    RX:  bytes  packets  errors  dropped  missed  mcast
    9876543210  1234567  0       0        0       1234
    TX:  bytes  packets  errors  dropped  carrier collsns
    8765432109  2345678  0       0        0       0

What it means: Any non-zero errors/dropped during testing is suspicious. Dropped can also be host queue overflow.

Decision: If drops climb under load, investigate ring sizes, driver, IRQ affinity, and switch congestion. Don’t touch ZFS yet.

Task 4: Look for TCP retransmits on Linux (client and server)

cr0x@server:~$ sudo nstat -az | egrep "TcpRetransSegs|TcpOutSegs|TcpInErrs"
TcpInErrs                    0                  0.0
TcpOutSegs                   4863921            0.0
TcpRetransSegs               12                 0.0

What it means: Retransmits should be extremely low on a LAN. Persistent retransmits indicate packet loss or severe reordering.

Decision: If retransmits increase meaningfully during a throughput test, fix network path (cables/optics, switch buffers, LACP hashing, MTU mismatch) before blaming ZFS.

Task 5: Establish raw network throughput with iperf3 (server mode)

cr0x@server:~$ iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 10.20.30.41, port 49822
[  5] local 10.20.30.10 port 5201 connected to 10.20.30.41 port 49824
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  10.8 GBytes  9.27 Gbits/sec

What it means: 9+ Gbit/s is a good sign. If you’re stuck at 3–5 Gbit/s, the network or host stack is limiting.

Decision: If iperf3 can’t go fast, do not waste time tuning NFS/SMB/ZFS. Fix the basic pipe first.

Task 6: Use multiple streams to reach line rate

cr0x@client:~$ iperf3 -c 10.20.30.10 -P 8 -t 15
[SUM]   0.00-15.00  sec  17.1 GBytes  9.80 Gbits/sec                  receiver

What it means: Multi-stream often hides single-flow limitations (window sizing, CPU scheduling, offload quirks).

Decision: If multi-stream hits ~9.5–9.8 but single-stream is low, the network is fine; focus on TCP tuning, CPU, and the application/protocol behavior (single-threaded copy, sync writes, small I/O).

Task 7: Validate MTU end-to-end (no fragmentation, no wishful thinking)

cr0x@client:~$ ping -M do -s 8972 10.20.30.10 -c 3
PING 10.20.30.10 (10.20.30.10) 8972(9000) bytes of data.
8972 bytes from 10.20.30.10: icmp_seq=1 ttl=64 time=0.215 ms
8972 bytes from 10.20.30.10: icmp_seq=2 ttl=64 time=0.207 ms
8972 bytes from 10.20.30.10: icmp_seq=3 ttl=64 time=0.209 ms

--- 10.20.30.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2035ms

What it means: If this fails, you don’t have working jumbo frames across the path. Partial jumbo is worse than none.

Decision: Either make MTU consistent on NICs/VLAN/switch ports, or stay at 1500 everywhere and stop arguing about it.

Task 8: Check NIC offloads and sanity (TSO/GSO/GRO/RX/TX)

cr0x@server:~$ sudo ethtool -k enp65s0 | egrep "tcp-segmentation-offload|generic-segmentation-offload|generic-receive-offload|rx-checksumming|tx-checksumming"
rx-checksumming: on
tx-checksumming: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on

What it means: Offloads can improve throughput and reduce CPU, but buggy drivers/firmware can cause drops or weird latency.

Decision: If you see retransmits/drops and suspect offload issues, toggle features one at a time during a controlled test. Don’t shotgun-disable everything “because Reddit.”

Task 9: Observe interrupt distribution (a single hot CPU can cap you)

cr0x@server:~$ grep -i enp65s0 /proc/interrupts | head
  72:  1203987   1123340   1189021   1098877   PCI-MSI 524288-edge      enp65s0-TxRx-0
  73:    23411     21987     24010     22022   PCI-MSI 524289-edge      enp65s0-TxRx-1
  74:    11234     10988     12001     11102   PCI-MSI 524290-edge      enp65s0-TxRx-2
  75:     9876      9455     10012      9666   PCI-MSI 524291-edge      enp65s0-TxRx-3

What it means: If one queue/CPU takes almost all interrupts, you’ll hit a CPU ceiling before line rate.

Decision: If distribution is skewed, fix RSS/RPS/XPS, check irqbalance, and ensure multiple TX/RX queues are active.

Task 10: Confirm ZFS pool health and basic layout (no, resilvering doesn’t “not matter”)

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:31:14 with 0 errors on Sun Dec 22 03:10:18 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
            sdg                     ONLINE       0     0     0
            sdh                     ONLINE       0     0     0

errors: No known data errors

What it means: If you’re resilvering, scrubbing, or degraded, performance conclusions are contaminated.

Decision: Only do performance proof on a stable pool, or you’ll optimize for an emergency state.

Task 11: Measure server-side disk throughput and latency while clients run

cr0x@server:~$ sudo zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        3.12T  21.8T    210    180   1.15G   320M
  raidz2-0                  3.12T  21.8T    210    180   1.15G   320M
    sda                         -      -     26     22    142M    40M
    sdb                         -      -     26     22    144M    39M
    sdc                         -      -     26     22    143M    40M
    sdd                         -      -     26     22    142M    40M
    sde                         -      -     26     22    143M    40M
    sdf                         -      -     26     22    143M    40M
    sdg                         -      -     26     22    144M    41M
    sdh                         -      -     26     22    144M    41M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: If your pool is reading at ~1.15G and your NIC is doing ~9.5 Gbit/s, the network could be the limit. If pool bandwidth is far below and devices show high ops, you’re IOPS/latency-limited.

Decision: If the pool can’t locally push beyond what you observe over the network, stop calling it a network issue.

Task 12: Measure local storage performance with fio (bypass cache illusions)

cr0x@server:~$ sudo fio --name=seqread --directory=/tank/test --rw=read --bs=1M --ioengine=libaio --direct=1 --iodepth=32 --numjobs=1 --size=20G --runtime=30 --time_based
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.33
seqread: (groupid=0, jobs=1): err= 0: pid=28144: Fri Dec 25 10:12:01 2025
  read: IOPS=1342, BW=1342MiB/s (1407MB/s)(40.0GiB/30538msec)
    slat (usec): min=6, max=214, avg=17.32, stdev=7.11
    clat (msec): min=1, max=41, avg=23.76, stdev=4.12

What it means: Local read BW (~1.4 GB/s) exceeds what 10GbE can carry. Great: storage can feed the link.

Decision: If local BW is well above 1.2 GB/s but network file reads are far lower, the bottleneck is network/protocol/CPU, not disks.

Task 13: Measure local synchronous write sensitivity (SLOG truth serum)

cr0x@server:~$ sudo fio --name=syncwrite --directory=/tank/test --rw=write --bs=16K --ioengine=libaio --direct=1 --iodepth=1 --numjobs=1 --size=5G --runtime=20 --time_based --fsync=1
syncwrite: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=1
fio-3.33
syncwrite: (groupid=0, jobs=1): err= 0: pid=28410: Fri Dec 25 10:14:44 2025
  write: IOPS=820, BW=12.8MiB/s (13.4MB/s)(256MiB/20008msec)
    clat (usec): min=340, max=8241, avg=1169.55, stdev=422.12

What it means: Sync writes can be brutally limited by latency. Throughput looks “bad” because each write waits for durability.

Decision: If your app/protocol is sync-heavy (databases, VM images, NFS with sync), you won’t saturate 10GbE without addressing sync latency (SLOG, settings, workload design).

Task 14: Check ZFS dataset settings that influence streaming vs chatter

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,atime,compression,sync tank/data
NAME       PROPERTY     VALUE
tank/data  recordsize   1M
tank/data  atime        off
tank/data  compression  lz4
tank/data  sync         standard

What it means: recordsize influences sequential efficiency; atime can add extra writes; compression can increase apparent throughput (if compressible) or cost CPU.

Decision: Don’t touch these until network baseline is proven. Then tune to workload: large files want larger recordsize; small random wants smaller and maybe special vdev/metadev strategies.

Task 15: Observe CPU saturation and softirq pressure during transfer

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.1.0 (server)  12/25/2025  _x86_64_  (32 CPU)

12:16:11 AM  CPU   %usr %nice  %sys %iowait  %irq %soft  %steal  %idle
12:16:12 AM  all   12.4  0.0   18.1   0.7    0.0  22.8    0.0   46.0
12:16:12 AM   7   10.0  0.0   21.0   0.0    0.0  65.0    0.0    4.0
12:16:12 AM  15   11.0  0.0   20.0   0.0    0.0  62.0    0.0    7.0

What it means: If one or two CPUs are nearly pegged in %soft, you’re network-stack limited, not link-speed limited. This looks like “10GbE is slow” but is actually “host can’t push packets fast enough.”

Decision: Fix IRQ/RSS, consider faster CPU, better NIC/driver, or adjust protocol behavior (fewer small ops, larger I/O).

Task 16: Validate NFS mount and server export behavior (client-side)

cr0x@client:~$ nfsstat -m
/tank/data from 10.20.30.10:/tank/data
 Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.20.30.41,local_lock=none

What it means: Small rsize/wsize or wrong version can kneecap throughput. v4.1 with 1M rsize/wsize is generally sane for big streaming.

Decision: If rsize/wsize are tiny, fix mount options or server config. If using UDP (rare now), stop.

Task 17: Observe NFS client retrans and RPC behavior

cr0x@client:~$ nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
987654     12         0

Client nfs v4:
null         read         write        commit       open
0            120034       60012        0            320

What it means: Rising retrans correlates with network loss or server overload. It kills throughput because RPC waits.

Decision: If retrans grows during load, investigate network drops or server CPU contention; don’t “increase timeo” and call it solved.

Task 18: SMB server/client: verify multi-channel and throughput behavior

cr0x@server:~$ sudo smbstatus -b
Samba version 4.18.6
PID     Username     Group        Machine                                   Protocol Version  Encryption           Signing
-----------------------------------------------------------------------------------------------------------------------------
12011   alice        staff        10.20.30.41 (ipv4:10.20.30.41:52144)      SMB3_11           -                    partial(AES-128-GMAC)

Service      pid     Machine       Connected at                     Encryption   Signing
------------------------------------------------------------------------------------------
data         12011   10.20.30.41   Fri Dec 25 00:18:12 2025        -            partial(AES-128-GMAC)

What it means: SMB signing/encryption can cost CPU; multi-channel may or may not be in use depending on client and server config.

Decision: If CPU is the limiter and SMB signing/encryption is enabled unnecessarily for a trusted LAN, decide with security whether to adjust. If you need it, plan for CPU headroom.

Task 19: Track per-protocol throughput vs NIC throughput (server-side)

cr0x@server:~$ sar -n DEV 1 3
Linux 6.1.0 (server)  12/25/2025  _x86_64_  (32 CPU)

00:20:01     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
00:20:02   enp65s0   82000     79000     1150000   180000      0         0       120
00:20:03   enp65s0   83000     80000     1165000   175000      0         0       130

What it means: ~1,150,000 kB/s is ~1.15 GB/s receive. That’s close to the practical ceiling of 10GbE for payload.

Decision: If NIC is near ceiling but users still complain, you’re likely saturating the link or hitting unfair sharing among clients. Consider LACP (carefully), faster links, or QoS—not ZFS recordsize rituals.

Task 20: Check ZFS ARC and memory pressure (because “cache” hides problems)

cr0x@server:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
00:21:10   951    21      2     8   38    13   62     0    0   96G   112G
00:21:11  1002    24      2     9   37    15   63     0    0   96G   112G

What it means: A low miss% during reads suggests ARC is serving a lot. Your “network throughput test” might be a RAM test.

Decision: For proof, use direct I/O tests or large working sets. If ARC is doing the work, that’s fine operationally, but don’t misattribute it to disk performance.

That’s more than a dozen tasks; use the ones that match your stack. The point is not to collect trivia. The point is to make a decision after each command.

Interpreting results: proving the bottleneck

What “network bottleneck” looks like when it’s real

You can call the network the bottleneck when these are simultaneously true:

  • iperf3 between the same hosts reaches ~9.4–9.8 Gbit/s with low retransmits.
  • Local server storage can exceed 1.2 GB/s for your access pattern (or at least exceed what you’re getting over NFS/SMB).
  • During real file access, the NIC is near ceiling while disks are not saturated and CPU isn’t choking in softirq.
  • Protocol-specific stats (NFS retrans, SMB credit stalls, iSCSI retrans) do not show pathologies.

In that case, congratulations: your system is performing normally. You’ve hit physics and standards. The fix is capacity: more links, faster links, or better distribution across multiple NICs/clients.

What “not network bottleneck” looks like

Most of the time, “10GbE is slow” is one of these:

Case A: Packet loss and retransmits under load

iperf3 may look okay for short bursts, but sustained transfers show drops, retransmits, and throughput collapse. You’ll see increasing TcpRetransSegs, NIC drops, or switch port discards. This is a network problem—even if the link is negotiated at 10Gb.

Common culprits: bad optics/DAC, marginal 10GBASE-T cable, MTU mismatch causing fragmentation/blackholes, switch buffer exhaustion, and oversubscribed uplinks.

Case B: CPU/interrupt bottleneck on the server or client

NIC is at 10Gb, but your host isn’t. High %soft on one CPU, uneven interrupts, or a single-threaded copy loop can cap throughput at 3–6 Gbit/s. This is painfully common on older CPUs or with unfortunate driver defaults.

Case C: Protocol/workload mismatch (small I/O over a chatty protocol)

10GbE shines on large sequential I/O. If your workload is lots of small random reads, metadata ops, or synchronous writes, your throughput will be set by IOPS and latency, not bandwidth. That’s not a network bottleneck. That’s a workload reality.

Case D: ZFS sync write latency (and the “SLOG will save us” mythology)

If you’re doing sync writes, throughput can be tiny while the network sits idle. You’ll see low MB/s, but high operation counts and elevated latency. Without a proper SLOG device (and the right expectation), you can’t buy your way out with MTU changes.

A quote you should keep in your incident channel

“Hope is not a strategy.” — paraphrased idea often attributed in engineering and operations circles

Whether or not you’ve heard it in a postmortem, the point stands: measure, isolate, decide.

Joke #2: If you ‘fix’ 10GbE performance by disabling checksums, you didn’t tune a system—you committed a misdemeanor against reality.

Three corporate mini-stories (painful, real, useful)

Mini-story #1: The outage caused by a wrong assumption

The team had a new ZFS-based NAS serving VM images over NFS to a small cluster. The storage tests looked strong: local reads were well over 1 GB/s, latency seemed fine, everyone high-fived. Then Monday morning hit, and VM boots staggered like a zombie movie.

The immediate assumption was classic: “NFS is slow” and “ZFS needs tuning.” Someone proposed changing recordsize, disabling atime everywhere (fine, but irrelevant), and playing with sync. The network team said the 10GbE links were “up,” so it couldn’t be them.

One SRE ran iperf3 and got 9.7 Gbit/s for 10 seconds. Case closed, right? Except the pain happened during sustained load. They re-ran iperf3 for 10 minutes with multiple streams and watched throughput sawtooth. Retransmits climbed. Switch port counters showed intermittent discards.

Root cause: the NAS and the hypervisors were connected through a leaf switch pair with an uplink that was quietly oversubscribed during backups. The link was technically “10GbE” and “healthy,” but not available when needed. The wrong assumption was equating negotiated speed with delivered capacity.

Fix: move backup traffic, add capacity where contention existed, and add basic alerting on switch discards and host retransmits. ZFS settings were untouched. Performance issues vanished like they were never real.

Mini-story #2: The optimization that backfired

A different company had a ZFS server backing a creative team’s shared media storage over SMB. They wanted faster transfers of large video files. Someone suggested jumbo frames and pushed MTU 9000 on the server and a few desktops. The switch “supported jumbo,” so it must be fine.

For a day, it looked better. Then the tickets started: random stalls, transfers that froze at 95%, sometimes a file copy would restart from scratch. Wireshark traces showed TCP retransmissions and occasional ICMP “fragmentation needed” messages that weren’t consistently delivered.

The problem wasn’t jumbo frames as a concept. It was jumbo frames as a partially deployed religion. One switch segment had MTU 9000, another was stuck at 1500 due to an older interconnect, and a firewall in the path treated large frames like suspicious luggage.

The “optimization” increased the probability of hard-to-diagnose blackholing and made the system less reliable. Throughput averages improved in a narrow case, but tail latency and transfer failure rate got worse—which is how real users experience performance.

Fix: either deploy jumbo frames end-to-end with verification (including every hop and VLAN), or revert to 1500 and focus on protocol and disk layout. They reverted, then improved SMB settings and client concurrency. Net result: slightly lower peak throughput, dramatically fewer stalls. Users stopped complaining, which is the only benchmark that matters.

Mini-story #3: The boring but correct practice that saved the day

A financial-services shop ran a ZFS appliance for nightly data loads. Nothing fancy: NFS, a few 10GbE links, and a strict change process. Their secret weapon wasn’t a magical sysctl. It was dull discipline.

They kept a quarterly “performance baseline” runbook: iperf3 between known hosts, a local fio sequential read/write, and a protocol-level test that mimicked production (same mounts, same credentials, same directory depth). They archived the outputs in a ticketing system. It was as exciting as watching paint dry. On purpose.

One quarter, the iperf3 baseline dropped from near line rate to about 6 Gbit/s. No one had complained yet. They treated it like a fire anyway, because baselines don’t lie without help.

The investigation found a BIOS update had reset PCIe power management defaults and nudged the NIC into a less optimal state under load. The change didn’t break connectivity; it just quietly reduced throughput. Because they had baseline outputs, they could prove the regression, isolate it, and roll forward with correct settings.

That’s the lesson: boring tests, run regularly, detect problems before your users do. “But it worked last month” becomes “we have a diff.” That’s operations adulthood.

Common mistakes: symptom → root cause → fix

1) Symptom: “We only get 3–5 Gbit/s on 10GbE”

Root cause: Single TCP flow limited (window sizing, CPU, interrupt affinity), or client can’t keep up.

Fix: Validate with iperf3 -P 8. If multi-stream is fast, adjust workload to use parallelism (multiple copy threads), fix IRQ/RSS, or improve CPU/NIC.

2) Symptom: Throughput spikes then collapses every few seconds

Root cause: Packet loss due to microbursts, switch buffer exhaustion, or oversubscribed uplinks; TCP backs off hard.

Fix: Check switch discards and host retransmits. Reduce contention, add bandwidth, or apply QoS where appropriate. Don’t touch ZFS first.

3) Symptom: Large sequential reads are fast, writes are painfully slow

Root cause: Sync writes (NFS/VM/databases) limited by ZIL latency; no suitable SLOG; or SLOG exists but is not power-loss safe.

Fix: Confirm with fio sync tests and workload characteristics. Add proper SLOG only if your workload benefits, and choose a device designed for low latency and power-loss protection.

4) Symptom: “Jumbo frames improved one client but broke others”

Root cause: MTU mismatch along the path, or mixed MTU domains without proper routing/PMTUD support.

Fix: Either deploy MTU end-to-end with validation using ping -M do at the target size, or standardize on 1500.

5) Symptom: SMB copy speed is low and CPU is high

Root cause: SMB signing/encryption overhead, single-threaded client behavior, or server CPU saturated in kernel/network stack.

Fix: Measure CPU and SMB settings. Decide with security whether to relax signing/encryption on trusted networks; otherwise provision CPU and consider protocol changes.

6) Symptom: NFS stalls and “server not responding” messages appear

Root cause: Network drops/retrans, server overloaded, or storage latency spikes causing RPC timeouts.

Fix: Check NFS retrans, TCP retrans, and server latency. Increase timeo only after you’ve fixed the cause, not as a sedative.

7) Symptom: Great performance in synthetic tests, bad in production

Root cause: Benchmarks hit ARC, bypass metadata, or use sequential I/O unlike real workload.

Fix: Test with working set larger than RAM, use --direct=1 for fio, and mimic I/O size and concurrency.

8) Symptom: LACP added but a single client still can’t exceed ~10GbE

Root cause: LACP doesn’t increase throughput for a single flow; hashing pins flows to one member link.

Fix: Use multiple sessions/clients, SMB multichannel, NFS sessions, or move to faster single links (25/40/100GbE) if single-host throughput is required.

Checklists / step-by-step plan

Step-by-step plan to prove (not guess) the bottleneck

  1. Freeze the battlefield: ensure no resilver/scrub, no major background jobs, and test during a controlled window.
  2. Record topology: exact client ↔ switch ports ↔ VLAN ↔ server port path. If you can’t draw it, you can’t debug it.
  3. Validate link and PCIe: ethtool, lspci, error counters, drops.
  4. Run iperf3 baseline: single stream, then multi-stream, then reverse direction. Capture retransmits before and after.
  5. Validate MTU policy: either 1500 everywhere or jumbo everywhere. Prove with ping -M do.
  6. Measure local storage: fio sequential read/write with --direct=1; optionally sync write test if relevant.
  7. Measure real protocol: NFS/SMB/iSCSI with a representative test. Keep it repeatable.
  8. Correlate during test: server zpool iostat, NIC throughput (sar), CPU softirq (mpstat), client retrans (nstat).
  9. Make the call: choose the narrowest point supported by data. Write it down with the outputs attached.
  10. Change one thing: rerun the same test. If you changed three things, you learned nothing.

Checklist: signs you are truly network-limited (the “stop tuning ZFS” checklist)

  • iperf3 stable near line rate with low retransmits
  • NIC throughput near 1.1–1.2 GB/s during file reads/writes
  • ZFS pool not saturated (disk bandwidth and ops not maxed)
  • CPU not pegged in softirq or a single core
  • Protocol stats clean (no NFS retrans spike, no SMB stalls)

Checklist: signs your bottleneck is not the link, but the host/protocol

  • iperf3 fast with multiple streams but slow single stream
  • high %soft or skewed interrupts during transfers
  • low NIC utilization while users wait
  • sync-heavy workload with low MB/s but high op latency
  • high packet rate (pps) with small I/O and metadata ops

FAQ

1) What throughput should I expect from 10GbE for a big file copy?

On a clean LAN, expect roughly 1.0–1.2 GB/s payload in the best case. If you’re seeing ~800–1100 MB/s on large sequential transfers, you’re probably fine.

2) Why does iperf3 show 9.8 Gbit/s but SMB copy is only 400 MB/s?

Because the network pipe is healthy, and something above it isn’t: SMB signing/encryption CPU cost, small I/O pattern, client single-threading, server CPU softirq, or ZFS sync behavior. Measure CPU and protocol stats during the copy.

3) Do jumbo frames always help ZFS NAS performance?

No. They can reduce CPU overhead for high packet rates, but they also create failure modes if the path isn’t consistently configured. If you can’t prove MTU end-to-end, use 1500 and move on.

4) Should I disable NIC offloads to “fix performance”?

Only if you have evidence they’re causing drops/retransmits or latency spikes with your specific driver/firmware. Toggle one feature at a time and retest. Blanket disabling often increases CPU use and reduces throughput.

5) Is LACP the answer to saturating 10GbE?

LACP increases aggregate bandwidth across multiple flows, not usually a single flow. If one client needs >10Gb/s, you want a faster single link (25/40/100GbE) or a protocol/client that uses multiple channels.

6) Can ZFS compression improve network throughput?

Sometimes, yes. If your data compresses well, the server sends fewer bytes over the wire, so effective throughput increases. But you pay CPU. Measure both throughput and CPU before declaring victory.

7) How do I know if I’m limited by sync writes?

If write throughput is low, latency is high, and tests with fsync or database workloads behave similarly, you’re likely gated by durability latency. Adding a proper SLOG can help, but only for sync writes and only if the device is appropriate.

8) Why does performance vary by time of day?

Usually contention: oversubscribed switch uplinks, backup windows, replication, or noisy neighbors on shared infrastructure. Prove it by correlating throughput drops with retransmits and switch port discards during those windows.

9) How do I keep tests honest when ARC makes everything look amazing?

Use working sets larger than RAM, use fio --direct=1, and repeat runs after cache warm-up with controlled file sizes. Don’t benchmark memory and call it storage.

10) If I’m truly network-limited, what are my best upgrade paths?

Add bandwidth (25GbE is a common step), add parallelism (multiple NICs/clients), or split workloads across interfaces. The right choice depends on whether you need single-client speed or aggregate multi-client capacity.

Next steps you can do this week

  1. Run and save three baselines: iperf3 (single + multi-stream), local fio sequential read, and one protocol-level test (NFS/SMB) that matches production.
  2. Add two alerts: host TCP retransmits (client/server) and switch port discards/drops on the NAS-facing ports.
  3. Decide your MTU policy: 1500 everywhere or jumbo everywhere. Document it and enforce it.
  4. Pick one bottleneck to fix: packet loss, CPU softirq, sync write latency, or real link saturation. Do not try to fix all of them with one magical sysctl.
  5. Write the proof down: paste command outputs into your ticket. Future-you will thank present-you when someone says “it’s always been like that.”

If you do the isolation tests and the numbers still don’t make sense, that’s not failure. That’s data telling you the real system is more interesting than the diagram. Welcome to production.

← Previous
ZeroTier: Build a Private Network and Fix the “Can’t Ping” Problem
Next →
WordPress editor crashes: plugin conflicts and how to identify the offender

Leave a comment