ZFS iSCSI: ZVOL Setup That Doesn’t Stutter Under Load

Was this helpful?

The worst kind of storage problem is the one that “usually works.” The VM boots. The database passes its smoke tests.
Then someone runs a real workload at 10:14 on a Tuesday and your iSCSI latency graph turns into modern art.
The symptom is a stutter: periodic stalls, long tail latencies, and users describing it as “slowness” like that’s a real metric.

If you’re exporting ZFS ZVOLs over iSCSI, you’re holding a chain of sharp knives: ZFS transaction groups, sync semantics,
iSCSI queueing, multipath, and guest OS caching. Get the geometry wrong and you’ll spend your week blaming “the network”
while ZFS politely waits for your tiny SLOG to stop melting.

A mental model: where the stutter comes from

“Stutter” under load is rarely raw throughput. It’s variance. Your median I/O looks fine, but your 99.9th percentile gets ugly:
the guest pauses, the database checkpoint slows, the hypervisor panics and retries, and the app team files a ticket titled
“Storage randomly freezes.”

With ZFS ZVOLs exported via iSCSI, the stutter usually comes from one of these choke points:

  • Synchronous write handling: the client issues a write with FUA/flush semantics; ZFS must commit safely. Without a proper SLOG, latency spikes.
  • Transaction group (TXG) behavior: ZFS batches writes and periodically syncs them. Mis-sized dirty data limits or a slow pool can make TXG sync time balloon.
  • Block size mismatch: the client does 4K random writes but your ZVOL volblocksize is 128K, forcing read-modify-write amplification.
  • Queueing and backpressure: iSCSI initiator queue depth, target queueing, and ZFS vdev queues can create “sawtooth” latency.
  • Latency outliers from the physical layer: one dying SSD, one misbehaving HBA, one path flapping in multipath. ZFS will tell you, if you ask the right question.

The practical goal: predictable latency under sustained load, even if peak throughput drops a little.
In production, boring and consistent wins.

Interesting facts and context (so you stop repeating history)

  1. ZFS was born at Sun in the mid-2000s as an end-to-end storage stack: checksums, pooled storage, snapshots, and copy-on-write. It’s not “a filesystem on top of RAID” so much as “storage that refuses to lie.”
  2. Copy-on-write makes snapshots cheap, but it also means overwrites become new writes, and fragmentation is a thing. Block workloads that do lots of random overwrites can “age” a pool faster than you expect.
  3. ZVOLs are not files. A ZVOL is a block device backed by ZFS. That distinction matters: ZVOLs don’t have recordsize; they have volblocksize, and it’s harder to change later.
  4. iSCSI dates back to the late 1990s as “SCSI over TCP.” It won because Ethernet won, not because it’s charming. It’s also why “storage performance” can be impacted by tiny TCP behaviors.
  5. The ZFS Intent Log (ZIL) exists to make synchronous writes durable. The separate log device (SLOG) is not a write cache; it’s a low-latency landing pad for sync write intent.
  6. Early ZFS guidance was shaped by spinning disks and big sequential writes. Modern SSD/NVMe pools shift the bottleneck from seek time to latency amplification and queueing behavior.
  7. 4K sectors changed everything (Advanced Format). If you misalign ashift, you can accidentally turn a 4K write into a read-modify-write at the drive level. That’s a tax you pay forever.
  8. Compression used to be feared on block workloads. Modern LZ4 is fast enough that enabling compression often reduces I/O and improves latency, especially for VM images and databases with repetitive patterns.
  9. Multipath was originally about redundancy but is now also about performance. Get it wrong and you can load-balance yourself into packet loss and timeouts.

Design choices that matter (and the ones that don’t)

Choose ZVOL vs dataset like you mean it

If the client expects a block device (hypervisor datastore, Windows volume, clustered database that wants raw disks), use a ZVOL.
If you can present files (NFS, SMB) and you control the application’s I/O patterns, a dataset is often easier to tune and observe.
But this piece is about ZVOLs, so we’ll stay on that battlefield.

Volblocksize: the “format” decision you don’t get to redo lightly

volblocksize is the internal block size ZFS uses for the ZVOL. It strongly influences write amplification and metadata churn.
Smaller blocks fit random I/O better. Larger blocks reduce metadata overhead for sequential workloads but punish small random writes.

  • VM boot/system disks: 8K or 16K is a safe start. 8K often wins for mixed random reads/writes. 16K can reduce overhead if the workload isn’t too chatty.
  • Databases (OLTP-ish): 8K if the DB page size is 8K (common). Match reality. Don’t fight physics.
  • Big sequential (backup targets, media): 64K–128K can be appropriate, but don’t export that as a general-purpose VM datastore unless you enjoy mystery stalls.

You can change volblocksize only by recreating the ZVOL and migrating data. That is not a weekend project if you’re already in production.

Ashift: decide once, suffer forever

ashift is the pool’s sector size exponent (e.g., 12 means 2^12 = 4096 bytes). If your physical devices are 4K (they are), set ashift=12.
For many SSDs, 12 is correct; for some 8K sector devices, you may want 13.
The key: don’t let “auto-detect” guess wrong and then discover it after your pool has data.

Sync writes: understand what the client is asking you to guarantee

Synchronous writes are the part where the storage system promises the data won’t vanish if power dies.
On iSCSI, guests and filesystems issue flushes and FUA writes more than people think—especially virtualized stacks.

ZFS can handle sync writes without a SLOG by writing to the main pool, but latency will track your slowest vdev’s stable write latency.
If you care about tail latency, you probably want a dedicated SLOG on low-latency power-loss-protected (PLP) SSD/NVMe.

SLOG: not “more cache,” but “less waiting”

A SLOG accelerates acknowledgement of sync writes by quickly persisting intent. Later, ZFS flushes the transaction group to the main pool.
If your sync workload is heavy, the SLOG needs:

  • Low latency under sustained writes (not just “fast sequential”).
  • Power loss protection or you can lose acknowledged sync writes.
  • Enough endurance for the write rate (ZIL writes can be brutal).

Joke #1: Buying a “gaming NVMe” for SLOG is like hiring a sprinter as a night watchman—fast, yes, but asleep when the power goes out.

Compression: turn it on unless you have a strong reason not to

Use compression=lz4 for most ZVOL workloads. Even if the data doesn’t compress much, the overhead is small on modern CPUs.
The payoff is often fewer bytes hitting disk and fewer I/Os, which can lower latency.

Dedup: don’t

Dedup on ZVOLs is a classic “it looked good on a slide” feature. It increases memory pressure and can amplify latency.
Unless you’ve modeled it, benchmarked it, and can afford the RAM and the operational complexity, don’t.

Special vdevs and metadata: useful, but easy to misuse

Special allocation class devices can accelerate metadata and small blocks.
On ZVOL-heavy systems, this can help certain patterns, but it’s also a new reliability domain.
If you lose a special vdev that contains metadata, you can lose the pool unless it’s properly mirrored.
Treat it like core storage, not a “bonus SSD.”

Network: 10/25/40/100GbE doesn’t fix latency

iSCSI is sensitive to loss, buffer pressure, and path instability. A fast link with microbursts can still stutter.
Your goal is consistent latency, not just big numbers in iperf.

Build a ZVOL iSCSI target that stays smooth

Baseline assumptions

The examples assume a Linux-based ZFS host running OpenZFS, exporting block LUNs over iSCSI using LIO (targetcli).
The client is a Linux initiator, but we’ll call out Windows/ESXi gotchas.
Adjust to your platform, but keep the principles: align geometry, respect sync, and measure latency where it’s created.

Pool layout: mirrors beat RAIDZ for random-write latency

For VM/iSCSI block workloads, mirrors usually beat RAIDZ on latency and IOPS consistency. RAIDZ can be fine for mostly sequential
or read-heavy workloads, but small random writes have to touch parity and can amplify latency during rebuilds.

If you must use RAIDZ for capacity reasons, budget for slower sync writes and more pronounced tail latency during scrubs/resilvers.
You can still run production on it; you just don’t get to act surprised.

Create the ZVOL with sane properties

A good starting set for a general VM datastore-style ZVOL:

  • volblocksize=8K or 16K depending on workload
  • compression=lz4
  • sync=standard (do not set disabled to “fix” latency unless you like data loss)
  • logbias=latency for sync-heavy workloads (common with VM storage)

Also decide whether you want thin provisioning. ZVOLs can be sparse (thin) or thick. Thin is convenient. Thin also lets people
oversubscribe until the pool hits 100% and everything becomes a slow-motion incident.

Export it via iSCSI with predictable queueing

With LIO, you’ll typically:

  1. Create a backstore pointing at the ZVOL
  2. Create an iSCSI target IQN and portal
  3. Create a LUN mapping and ACLs
  4. Optionally tune sessions, error recovery, and timeouts

Where the stutter hides: excessive outstanding I/O can make latency look like “periodic freezing” when the queue drains.
Under-tuning can cap throughput, but over-tuning causes tail latency spikes and timeouts. You want enough queue depth to keep disks busy,
not enough to build an I/O traffic jam.

Client-side multipath: boring, specific, and mandatory for production

Multipath solves two problems: failover and load distribution. It also creates a whole new failure mode: path flapping, where the client
thrashes between paths and your storage looks like it’s “dropping packets” when it’s actually being yo-yo’d by policy.

Use two physically separate paths if you claim redundancy. Separate NICs, separate switches if possible, separate everything. Otherwise, it’s theater.

One quote that actually belongs in ops

“Hope is not a strategy.” — paraphrased idea often attributed to operators and engineers across reliability work

Practical tasks: commands, outputs, and decisions

You can’t tune what you can’t observe. Below are hands-on tasks you can run on the ZFS/iSCSI target and initiator.
Each task includes: the command, what typical output means, and what decision you make from it.

Task 1: Verify pool health and obvious hardware errors

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 02:11:43 with 0 errors on Sun Dec 22 03:10:16 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nvme0n1 ONLINE       0     0     0
            nvme1n1 ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            nvme2n1 ONLINE       0     0     0
            nvme3n1 ONLINE       0     0     0

errors: No known data errors

Meaning: any non-zero READ/WRITE/CKSUM errors are a clue, not a suggestion. Even a single flaky device can create “stutter”
due to retries and slow I/O completion.

Decision: if you see errors or a degraded vdev, stop tuning. Fix hardware first. Performance work on broken storage is just cosplay.

Task 2: Confirm ashift and vdev topology (latency expectations start here)

cr0x@server:~$ sudo zdb -C tank | egrep -i 'ashift|vdev_tree|type'
        type: 'root'
        type: 'mirror'
            ashift: 12
        type: 'mirror'
            ashift: 12

Meaning: ashift=12 indicates 4K sector alignment. If you discover ashift=9/10/11 on 4K devices, you’ve found a permanent tax.

Decision: wrong ashift is usually “rebuild the pool correctly.” If that’s not possible immediately, document the risk and stop promising low latency.

Task 3: Check how full the pool is (ZFS gets cranky when it’s crowded)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank   18.2T 2.11T  256K   /tank

Meaning: with only ~10% free, fragmentation and allocator behavior can increase latency variance, especially on block workloads.

Decision: aim to keep meaningful free space (often 20%+ for busy pools). If you’re tight, plan capacity before you tune queue depths.

Task 4: Inspect the ZVOL properties (block size, compression, sync)

cr0x@server:~$ zfs get -o name,property,value -s local,default volblocksize,compression,sync,logbias,refreservation tank/zvol/vmstore
NAME              PROPERTY      VALUE
tank/zvol/vmstore  volblocksize  8K
tank/zvol/vmstore  compression   lz4
tank/zvol/vmstore  sync          standard
tank/zvol/vmstore  logbias       latency

Meaning: this is your basic “is it sane” checklist. sync=disabled on a VM datastore is a red flag unless you’ve made an explicit risk decision.

Decision: if volblocksize is obviously wrong for the workload (e.g., 128K for OLTP), plan a migration to a new ZVOL. Don’t pretend it’s a sysctl away.

Task 5: Create a ZVOL correctly (example) and explain why

cr0x@server:~$ sudo zfs create -V 4T -b 8K -o compression=lz4 -o logbias=latency -o sync=standard tank/zvol/vmstore2

Meaning: -V creates a block volume, -b sets volblocksize at creation time, and we’re explicitly choosing latency-friendly behavior for sync writes.

Decision: choose 8K when you don’t know better and you expect random writes. Choose 16K if your workload is mixed but not tiny-I/O heavy and you want slightly less metadata overhead.

Task 6: Confirm the block device exists and see its sector sizes

cr0x@server:~$ ls -l /dev/zvol/tank/zvol/vmstore2
lrwxrwxrwx 1 root root 13 Dec 25 10:12 /dev/zvol/tank/zvol/vmstore2 -> ../../zd16
cr0x@server:~$ sudo blockdev --getss /dev/zd16
4096
cr0x@server:~$ sudo blockdev --getpbsz /dev/zd16
4096

Meaning: the ZVOL presents 4K logical sectors here. That’s typically good for modern guests and aligns with ashift=12.

Decision: if your initiator/OS stack expects 512 and you present 4K, some older guests behave poorly. For modern systems, 4K is usually fine and often better.

Task 7: Verify whether you have a dedicated SLOG and whether it’s mirrored

cr0x@server:~$ sudo zpool status tank | sed -n '1,120p'
  pool: tank
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1    ONLINE       0     0     0
            nvme1n1    ONLINE       0     0     0
          mirror-1     ONLINE       0     0     0
            nvme2n1    ONLINE       0     0     0
            nvme3n1    ONLINE       0     0     0
        logs
          mirror-2     ONLINE       0     0     0
            nvme4n1    ONLINE       0     0     0
            nvme5n1    ONLINE       0     0     0

Meaning: a mirrored SLOG reduces the risk that one log device failure causes loss of recent sync writes (depending on failure mode).

Decision: if you run sync-heavy iSCSI and you care about correctness, use a mirrored SLOG with PLP. If you can’t, accept higher sync latency and tune expectations, not physics.

Task 8: Watch ZFS latency and queueing in real time

cr0x@server:~$ sudo zpool iostat -v tank 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        18.2T  2.11T    820   1900  91.3M  76.2M
  mirror      -      -     410    950  45.6M  38.1M
    nvme0n1    -      -     205    470  22.8M  19.3M
    nvme1n1    -      -     205    480  22.8M  18.8M
  mirror      -      -     410    950  45.7M  38.1M
    nvme2n1    -      -     205    475  22.9M  19.1M
    nvme3n1    -      -     205    475  22.8M  19.0M
logs          -      -       0    600      0  12.0M
  mirror      -      -       0    600      0  12.0M
    nvme4n1    -      -       0    300      0   6.0M
    nvme5n1    -      -       0    300      0   6.0M

Meaning: you can see whether writes are hitting the log device (sync-heavy workload) and whether one vdev is lagging.

Decision: if one device consistently shows lower ops or bandwidth under mirror, suspect device latency, firmware, thermal throttling, or PCIe topology problems.

Task 9: Check TXG behavior and “stutter cadence” clues

cr0x@server:~$ sudo arcstat.py 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:20:01   920    80      8     0   0%    40  50%    40  50%  128G   128G
10:20:02   980    70      7     0   0%    35  50%    35  50%  128G   128G
10:20:03   910    95     10     0   0%    47  49%    48  51%  128G   128G
10:20:04   940    82      9     0   0%    41  50%    41  50%  128G   128G
10:20:05   960    78      8     0   0%    39  50%    39  50%  128G   128G

Meaning: ARC pressure and misses often correlate with read latency, not write stutter. But if your “stutter” is actually read cache miss storms,
this will show it: high miss rates while the app pauses.

Decision: if ARC is pinned and miss% is high, consider memory, working set, or shifting read-heavy workloads to file protocols where caching works better.
Don’t jump to L2ARC until you’ve measured it.

Task 10: Confirm iSCSI target configuration and mapped LUNs

cr0x@server:~$ sudo targetcli ls
o- / ............................................................................. [...]
  o- backstores ................................................................. [...]
  | o- block ............................................................. [Storage Objects: 1]
  | | o- vmstore2 ........................................ [/dev/zvol/tank/zvol/vmstore2 (4.0TiB)]
  o- iscsi ............................................................... [Targets: 1]
    o- iqn.2025-12.lab:storage.tank ............................................. [TPGs: 1]
      o- tpg1 .......................................................... [no-gen-acls, no-auth]
        o- acls ............................................................... [ACLs: 1]
        | o- iqn.2025-12.lab:client01 .............................................. [Mapped LUNs: 1]
        o- luns ............................................................... [LUNs: 1]
        | o- lun0 ............................... [block/vmstore2 (/dev/zvol/tank/zvol/vmstore2)]
        o- portals ......................................................... [Portals: 1]
          o- 10.0.10.10:3260 .............................................................. [OK]

Meaning: verify the backstore points at the intended ZVOL and that ACLs map the correct initiator IQN.

Decision: if you see fileio backstores for high-performance VM storage, reconsider: block backstores to ZVOLs are typically cleaner and faster.

Task 11: On the initiator, confirm sessions and negotiated parameters

cr0x@server:~$ sudo iscsiadm -m session -P 3 | sed -n '1,120p'
iSCSI Transport Class version 2.0-870
iscsiadm version 2.1.9
Target: iqn.2025-12.lab:storage.tank
    Current Portal: 10.0.10.10:3260,1
    Persistent Portal: 10.0.10.10:3260,1
    DataDigest: None
    HeaderDigest: None
    MaxRecvDataSegmentLength: 262144
    MaxXmitDataSegmentLength: 262144
    FirstBurstLength: 262144
    MaxBurstLength: 1048576
    InitialR2T: No
    ImmediateData: Yes

Meaning: these parameters influence performance and CPU cost. Digests add CPU overhead but help detect corruption on bad networks.
Most clean DC networks run with digest off.

Decision: if CPU is the bottleneck or you see retransmits, revisit network and MTU first. Don’t “tune” iSCSI bursts as a first move; it’s rarely the main issue.

Task 12: Verify multipath status (path flaps create “stutter”)

cr0x@server:~$ sudo multipath -ll
mpatha (36001405f3f2b7f8d9d5b3d2a8c1e0001) dm-3 IET,VIRTUAL-DISK
size=4.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 4:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=50 status=enabled
  `- 5:0:0:0 sdc 8:32 active ready running

Meaning: you want stable “active ready running” paths. If you see paths switching between failed/active, that’s your latency spike generator.

Decision: fix cabling/switching/NIC/MTU issues before touching ZFS tunables. Storage tuning can’t outvote physics and packet loss.

Task 13: Check for TCP retransmits and NIC drops on the target

cr0x@server:~$ sudo nstat -az | egrep 'TcpRetransSegs|TcpExtTCPRenoReorder|IpInDiscards'
TcpRetransSegs                    12                 0.0
IpInDiscards                      0                  0.0

Meaning: retransmits add latency variance. A few over long uptime is fine; rising quickly under load is not.

Decision: if retransmits spike during your “stutter,” investigate switch buffer/ECN, MTU mismatch, bad optics/cables, or IRQ/CPU saturation on the NIC.

Task 14: Observe disk latency directly (don’t guess)

cr0x@server:~$ sudo iostat -x 1 3
Linux 6.8.0 (server)  12/25/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    5.20    2.80    0.00   79.90

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  r_await  w_await
nvme0n1         210.0   480.0  24000   20000     0.0     0.0   71.0   1.80     1.20     2.10
nvme1n1         215.0   470.0  24500   19800     0.0     0.0   69.0   1.90     1.30     2.20
nvme2n1         205.0   475.0  23500   19900     0.0     0.0   73.0   1.70     1.10     2.00
nvme3n1         205.0   475.0  23600   20100     0.0     0.0   72.0   1.75     1.15     2.05

Meaning: if await jumps into tens/hundreds of milliseconds during stutter, your pool devices are the bottleneck (or being forced into sync waits).

Decision: if device latency is fine but client sees stutter, suspect iSCSI queueing, multipath flaps, or sync write behavior (SLOG, flush storms).

Task 15: Identify synchronous write pressure (ZIL/SLOG usage clues)

cr0x@server:~$ sudo zfs get -o name,property,value sync,logbias tank/zvol/vmstore2
NAME               PROPERTY  VALUE
tank/zvol/vmstore2  sync      standard
tank/zvol/vmstore2  logbias   latency
cr0x@server:~$ sudo zpool iostat -v tank 1 | sed -n '1,20p'
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        18.2T  2.11T    200   2400  22.0M  98.0M
logs          -      -       0   1200      0  24.0M

Meaning: significant write ops to logs indicates sync activity (ZIL). If you have no SLOG, those sync writes hit main vdevs and can spike latency.

Decision: if your workload is sync-heavy and latency-sensitive, implement a proper SLOG. If you already have one and stutter persists, validate it’s not saturated or thermally throttling.

Task 16: Run a controlled fio test from the initiator (measure the right thing)

cr0x@server:~$ sudo fio --name=iscsi-4k-randwrite --filename=/dev/mapper/mpatha --direct=1 --rw=randwrite --bs=4k --iodepth=32 --numjobs=4 --time_based=1 --runtime=60 --group_reporting
iscsi-4k-randwrite: (groupid=0, jobs=4): err= 0: pid=4121: Thu Dec 25 10:30:12 2025
  write: IOPS=18.2k, BW=71.2MiB/s (74.7MB/s)(4272MiB/60001msec)
    slat (usec): min=4, max=1120, avg=15.2, stdev=9.8
    clat (usec): min=90, max=84210, avg=650.1, stdev=2100.4
     lat (usec): min=110, max=84230, avg=666.2, stdev=2101.0
    clat percentiles (usec):
     |  1.00th=[  140], 10.00th=[  180], 50.00th=[  310], 90.00th=[  980], 99.00th=[ 5800]
     | 99.90th=[38000], 99.99th=[79000]

Meaning: the average latency looks fine, but 99.90th at 38ms and 99.99th at 79ms is your “stutter.” That’s what users feel.

Decision: optimize for tail latency, not average. If tails spike periodically, correlate with TXG sync times, SLOG behavior, and network retransmits.

Fast diagnosis playbook

When the stutter happens, you need a short list that narrows the blast radius quickly. This is the order I use because it finds
the “obviously broken” class of issues before you waste hours on tunables.

First: rule out failures and flaps

  • zpool status: any device errors, degraded vdevs, or resilvers? If yes, that’s the story.
  • multipath -ll on initiators: any paths failing/flapping? If yes, fix networking/paths.
  • nstat / NIC counters: retransmits, drops, ring overruns. If yes, you have a network/host CPU interrupt problem.

Second: decide if it’s disk latency or sync semantics

  • iostat -x on target: await spikes? If yes, disks or SLOG are slow or saturated.
  • zpool iostat -v 1: do log devices show heavy write ops? If yes, you’re sync-heavy; SLOG quality matters.
  • fio percentiles: high tail lat with modest average often points to periodic flush/sync behavior or queue drain events.

Third: hunt configuration mismatches and amplification

  • volblocksize: mismatched to workload? 128K ZVOL for 4K random writes will stutter like clockwork under pressure.
  • pool fullness: near-full pools amplify allocator and fragmentation pain.
  • RAIDZ + small random sync writes: expect higher tails, especially during scrubs/resilvers.

Fourth: tune queue depth carefully

Queue depth tuning is the last step because it can mask problems and create new ones. But once you’ve verified the system is healthy,
adjusting initiator depth and multipath policies can smooth latency without sacrificing correctness.

Common mistakes: symptom → root cause → fix

  • Symptom: periodic latency spikes every few seconds under write load
    Root cause: TXG sync pressure + slow stable storage, often no SLOG or bad SLOG
    Fix: add a PLP mirrored SLOG; verify log writes hit it; ensure pool isn’t near-full; keep scrubs/resilvers scheduled off-peak.
  • Symptom: good throughput but “freezes” in VMs, especially on Windows guests
    Root cause: flush storms (guest filesystem, hypervisor, or app), sync write latency exposed over iSCSI
    Fix: correct SLOG, keep sync=standard; consider workload-specific tuning on the guest (write cache policy) only with a risk review.
  • Symptom: terrible 4K random write IOPS, high disk utilization, surprisingly low bandwidth
    Root cause: volblocksize too large causing read-modify-write amplification
    Fix: create a new ZVOL with 8K/16K volblocksize and migrate; don’t try to “tune it away.”
  • Symptom: stutter appears only with multipath enabled
    Root cause: unstable secondary path, path checker too aggressive, switch misconfig, asymmetric routing
    Fix: stabilize L2/L3 first; set sane multipath policies; avoid “round-robin everything” if your target or network can’t handle it cleanly.
  • Symptom: latency worsens after enabling L2ARC
    Root cause: L2ARC stealing RAM/CPU, cache thrash, or SSD contention with pool workload
    Fix: remove L2ARC; increase RAM; only add L2ARC when you’ve measured read working set and can dedicate a fast device.
  • Symptom: performance collapses during scrub/resilver
    Root cause: pool layout (RAIDZ), limited IOPS margin, scrub competing with production, vdev imbalance
    Fix: schedule scrubs; tune scrub behavior if needed; design pools with headroom; mirrors for latency-critical block workloads.
  • Symptom: “random” pauses; logs show iSCSI timeouts/reconnects
    Root cause: TCP retransmits, MTU mismatch, NIC offload quirks, CPU IRQ saturation
    Fix: validate MTU end-to-end; check drops; pin IRQs; consider disabling problematic offloads; keep it simple and measurable.
  • Symptom: space usage looks fine, then suddenly everything slows near capacity
    Root cause: thin-provisioning oversubscription + pool hitting high utilization
    Fix: enforce quotas/refreservations; alert earlier; keep free space; treat capacity as an SLO, not a spreadsheet.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption (volblocksize “doesn’t matter”)

A mid-sized company migrated from an aging Fibre Channel SAN to a ZFS-based iSCSI target. The test plan was “copy a few VMs,
run login tests, call it good.” Everything looked fine in the lab. In production, the ERP system started “hanging” for a few seconds
at a time during peak entry hours.

The infrastructure team assumed the network was congested because iSCSI is “over TCP,” and TCP is “fragile.” They added bandwidth:
upgraded uplinks, rearranged VLANs, even replaced a switch. The stutter remained, politely unchanged. Users still described it as
“it pauses and then catches up.”

The clue was in a focused fio run that reproduced the problem: 4K random writes showed big tail latencies, even though average latency
looked acceptable. On the ZFS side, the ZVOL had been created with a 128K volblocksize—because someone read a blog post about “bigger blocks are faster.”
For the ERP workload, the IO was small, random, and sync-heavy.

Under load, ZFS was doing read-modify-write cycles on large blocks for tiny updates, churning metadata and forcing extra IOs.
The system wasn’t “slow” as much as “bursty with periodic misery.” The fix was unglamorous: create a new ZVOL with 8K volblocksize,
migrate the LUN at the hypervisor layer, and keep compression on.

The network upgrade wasn’t wasted—it improved headroom—but it didn’t touch the root cause. The postmortem changed the build process:
block size became a formal design input, not a default.

Mini-story 2: The optimization that backfired (sync=disabled “for performance”)

Another shop ran a virtualization cluster on ZFS iSCSI. They had a good pool, decent network, and still got complaints about VM pauses
during patch windows. Someone discovered that setting sync=disabled made benchmarks scream. They applied it to the main VM ZVOL
during a maintenance window and declared victory.

For a while, it looked great. Latency graphs smoothed out. The helpdesk got quiet. Then a power event hit one rack—not the whole datacenter,
just a single row where a PDU failed. The storage server rebooted. A handful of VMs came up with corrupted filesystems.
Not all of them. Just enough to make the incident both real and maddening.

The team’s “optimization” wasn’t a tuning change. It was an integrity change. With sync disabled, the system acknowledged writes that
were still in volatile cache. For some VMs it didn’t matter; for others it absolutely did. The recovery was a mix of restoring from backups,
filesystem repairs, and one long weekend of explaining to management why the “fast” fix made data go missing.

The long-term fix wasn’t a lecture, it was architecture: mirrored PLP SLOG for the sync workloads, and a refusal to change data safety semantics
to chase prettier graphs. They also added tests that simulate power-loss-like behavior by forcing hard reboots during synthetic sync write loads
in a pre-production environment.

Joke #2: sync=disabled is the storage equivalent of removing smoke detectors because the beeping is annoying.

Mini-story 3: The boring but correct practice that saved the day (capacity headroom + scrub discipline)

A financial services firm ran ZFS iSCSI for dev/test and a subset of production analytics workloads. Their storage lead had one stubborn policy:
keep pools below a utilization threshold, and run scrubs on schedule with clear alerts for any checksum error.

It was boring. It also meant they routinely argued with project managers about capacity requests. But they kept headroom, and they treated
scrubs as “non-optional hygiene” rather than “something we do when bored.”

One quarter, a batch job started hammering the storage with random writes and unexpected sync flushes. Latency increased, but it didn’t become
catastrophic. The pool still had free space, the allocator wasn’t cornered, and the mirrors had IOPS headroom. The team could slow the batch job,
adjust schedules, and tune the initiator queue depth without being under existential pressure.

During the same period, a scrub detected early checksum errors on one device. They replaced it preemptively. No outage, no drama, no “we lost half a day.”
The policy didn’t make them heroes. It made the system predictable, which is better.

Checklists / step-by-step plan

Step-by-step: building a new ZFS iSCSI ZVOL that won’t stutter

  1. Pick the pool layout for latency: mirrors for VM/iSCSI unless you have a strong reason otherwise.
  2. Confirm ashift before creating data: verify with zdb -C. If wrong, fix now, not later.
  3. Decide volblocksize based on workload: 8K/16K for general VM store; match DB page size when known.
  4. Create the ZVOL with explicit properties: lz4 compression, sync standard, logbias latency if sync-heavy.
  5. Plan free space: operational target (e.g., 20% free). Set alerts before you hit pain.
  6. Add a proper SLOG if sync-heavy: mirrored, PLP, low latency, endurance appropriate.
  7. Export via iSCSI with stable configuration: block backstore to ZVOL; explicit ACLs; consistent portal config.
  8. Set up multipath on initiators: validate both paths are truly independent; confirm stable “active ready running.”
  9. Run fio percentiles from initiators: measure tails, not just average; test both sync-ish (fsync) and async patterns.
  10. Record a baseline: zpool iostat, iostat -x, retransmits, latency percentiles. This becomes your “known good.”

Operational checklist: when adding new LUNs or new tenants

  • Confirm pool free space threshold and projected growth.
  • Validate volblocksize matches the tenant’s expected I/O profile.
  • Decide whether to use refreservation to prevent thin-provisioning surprises.
  • Check SLOG health and wear indicators (outside the scope of ZFS commands, but mandatory).
  • Schedule scrubs and monitor checksum errors as a first-class signal.
  • Run a short fio smoke test after changes to networking or multipath policies.

FAQ

1) Should I use a dataset (file) over NFS instead of ZVOL iSCSI?

If your hypervisor and workload are happy with NFS and you want simpler tuning and observability, NFS datasets are often easier.
Use ZVOL iSCSI when you need block semantics or OS expectations demand it.

2) What volblocksize should I pick for VM storage?

Start with 8K for general VM workloads when you care about random write latency. Use 16K if you’ve measured that your workloads skew larger
and you want slightly lower metadata overhead. Avoid 64K/128K for mixed VM stores unless you truly have sequential patterns.

3) Can I change volblocksize later?

Not in-place. You typically create a new ZVOL with the desired volblocksize and migrate data at the client/hypervisor layer.
Plan for that reality up front.

4) Do I need a SLOG for iSCSI?

If your workload issues sync writes (many do) and you care about latency consistency, yes—a proper mirrored PLP SLOG is often the difference
between “smooth” and “mystery pauses.” If your workload is mostly async and you can tolerate sync latency, you may skip it.

5) Why not just set sync=disabled and enjoy the speed?

Because it changes correctness. You can lose acknowledged writes on power loss or crash, and the failure can be partial and nasty.
If you’re okay with that risk for a scratch environment, document it. For production, don’t.

6) Does L2ARC help iSCSI ZVOL performance?

Sometimes, for read-heavy workloads with a working set larger than RAM. But L2ARC consumes RAM and CPU and can contend with devices.
Measure ARC hit rates and actual read latency before adding it. It’s not a magic “more cache” lever.

7) Mirrors vs RAIDZ for iSCSI: what’s the practical difference?

Mirrors generally deliver better random write latency and more predictable IOPS under rebuild/scrub. RAIDZ trades that for capacity efficiency.
For VM/databases over iSCSI, mirrors are the safer default.

8) My network is 25/100GbE—why do I still stutter?

Because bandwidth doesn’t fix tail latency. Microbursts, retransmits, path flaps, and CPU interrupt saturation can cause pauses even on fast links.
Check drops/retransmits and multipath stability first.

9) Should I enable compression on ZVOLs?

Yes, usually lz4. It often reduces physical I/O and improves latency. Exceptions exist (already compressed data streams),
but “off by default” is old superstition.

10) Why do benchmarks look fine but apps still pause?

Many benchmarks report averages and hide tail latency. Users feel the 99.9th percentile. Use fio percentiles and correlate with ZFS TXG,
SLOG activity, and network retransmits.

Conclusion: next steps you can do this week

If your ZFS iSCSI ZVOL stutters under load, it’s usually not one magic knob. It’s a chain: block size geometry, sync semantics, log device quality,
queueing, and network stability. Your job is to find where the waits are created, then remove the worst ones without cheating on durability.

Practical next steps:

  1. Run the fast diagnosis playbook during a stutter event and capture evidence: zpool status, zpool iostat -v 1, iostat -x 1, retransmit counters, and fio percentiles.
  2. Audit every ZVOL’s volblocksize and identify the outliers that don’t match workload reality.
  3. Decide explicitly whether you need a SLOG. If you do, buy the right class of device (PLP, low latency, mirrored), not “fast consumer NVMe.”
  4. Confirm multipath stability end-to-end. Eliminate path flapping before you touch storage tunables.
  5. Set a capacity policy and alerts that keep the pool out of the “nearly full” zone where stutter becomes a lifestyle.
← Previous
Laptop GPUs: why one name can hide five performance levels
Next →
MySQL vs TiDB: MySQL Compatibility vs Operational Complexity—What You’re Signing Up For

Leave a comment