If you moved ZFS onto “fast NVMe” and expected latency to evaporate, you’ve probably met the two villains of modern storage: tail latency and invisible contention. Your benchmarks look heroic, but production still pauses for 200–800 ms at the worst possible time. Users don’t care that your median is 80 µs when their checkout page hits the 99.9th percentile.
NVMe-only pools remove one bottleneck and expose the rest. Now the limits are CPU scheduling, interrupt distribution, queueing, ZFS transaction group behavior, and the difference between “fast device” and “fast system.” Let’s tune for reality: stable latency under mixed load, not just pretty fio graphs.
The real limits: NVMe makes the rest obvious
On rust or even SATA SSDs, storage latency dominates and hides a lot of sins. On NVMe, storage is no longer “the slow part” most of the time. That’s good news and bad news.
Good news: you can get insane IOPS and great throughput with very little tuning, especially for reads and async writes. Bad news: the bottleneck moves. If you see latency spikes on NVMe-only ZFS pools, your first suspects are:
- CPU scheduling and interrupt handling: one core doing all the work while others idle.
- Queueing in the block layer or NVMe driver: deep queues hide congestion until they don’t.
- ZFS transaction group (TXG) cadence: bursts of writeback every few seconds, plus sync semantics.
- Memory pressure and ARC behavior: reclaim storms, kswapd, and stalled threads.
- “Fast path” assumptions: e.g., metadata-heavy workloads, small random writes, and sync-heavy apps.
You’re tuning a distributed system on one machine: application threads, kernel threads, interrupts, ZFS pipeline stages, and a controller with its own firmware. NVMe doesn’t eliminate complexity; it just stops masking it.
Interesting facts & history (that actually matters)
- ZFS wasn’t born in the NVMe era. The original design assumed spinning disks and write coalescing over seconds; TXG behavior still reflects that.
- NVMe was designed for parallelism. Multiple submission/completion queues exist specifically to reduce lock contention and interrupt storms compared to AHCI/SATA.
- Interrupt coalescing is older than NVMe. NICs have long traded latency for throughput by batching interrupts; NVMe devices and drivers do similar things.
- 4K sectors weren’t always normal. Advanced Format and then SSD erase block realities forced alignment conversations; ZFS’s
ashiftis basically “I don’t trust the device to tell the truth.” - TRIM/Discard used to be scary. Early SSDs could stall hard on discard; “don’t enable TRIM” was once reasonable advice. Modern NVMe generally handles it better, but not always gracefully under load.
- Checksumming is not a tax you can ignore. On fast media, CRC and compression can become first-order CPU costs, especially with small IO sizes and high concurrency.
- ZIL is not a write cache. The intent log exists to satisfy POSIX sync semantics; misunderstanding it leads to weird tuning and disappointed finance teams.
- IOPS numbers got weaponized. “Millions of IOPS” marketing pushed people to forget tail latency and mixed workloads; operations people had to clean that up.
Mental model: where latency comes from in ZFS on NVMe
When latency goes bad, you need a map. Here’s the map.
ZFS write pipeline, simplified
- Application write() hits page cache / ARC logic and ZFS DMU.
- If sync is required, ZFS must commit intent to the ZIL (in-pool or SLOG) before acknowledging.
- TXG open phase: writes accumulate in memory.
- TXG sync phase: data is written to the main pool, metaslabs allocate space, checksums computed, compression applied, etc.
- Flush/FUA/barriers: underlying NVMe must make durability promises for sync paths.
Read pipeline, simplified
- ARC hit: cheap, mostly CPU/cache behavior.
- ARC miss: ZFS issues IO; device returns data; checksums verified; optional decompression.
NVMe path, simplified
- Block IO queued in kernel.
- NVMe driver submits to device queue(s).
- Device completes; interrupt (or polling) notifies CPU; completion processed.
- Wakeups and context switches deliver data to waiting threads.
Latency spikes happen when any stage becomes serialized. On NVMe, serialization is usually CPU-side: one queue, one core, one lock, one misguided setting. Or it’s ZFS’s own cadence: TXG sync bursts, metaslab contention, or sync-write flushes.
One quote to keep you honest. Hope is not a strategy.
— General Gordon R. Sullivan
Fast diagnosis playbook (first/second/third)
This is the “open laptop at 2 a.m.” playbook. You are trying to answer one question: where is the time going?
First: confirm the symptom is storage latency (not app or network)
- Check if the application threads are blocked on IO (D state) or CPU.
- Check end-to-end latency around the time of incident (p99/p999) and correlate with IO wait and interrupt load.
Second: decide whether it’s read-path, async write-path, or sync write-path
- Sync-heavy workloads (databases, NFS with sync, VM images) behave differently than log-heavy async writers.
- Look for ZIL activity and flush behavior if latency spikes are periodic or aligned with fsync.
Third: identify whether the bottleneck is CPU/IRQs, queueing, or TXG
- If one CPU is pegged with interrupts/softirqs: fix IRQ affinity and queue mapping.
- If IO queues are deep and await times climb: you’re queueing; reduce concurrency, fix scheduler settings, or address firmware/controller limits.
- If stalls align with TXG sync: you’re not “out of NVMe,” you’re out of ZFS pipeline capacity (often metaslab/alloc or memory pressure).
Joke #1: Tail latency is like a house cat—quiet all day, then at 3 a.m. it sprints through your dashboard for no reason.
Practical tasks: commands, output meaning, and decisions
These are not “run commands for fun” tasks. Each one ends with a decision. Run them during normal load and again during the incident window if possible.
Task 1: Confirm pool topology and ashift (alignment)
cr0x@server:~$ sudo zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 00:18:21 with 0 errors on Mon Dec 23 02:10:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0
errors: No known data errors
cr0x@server:~$ sudo zdb -C tank | egrep -i 'ashift|vdev_tree' -n | head
55: ashift: 12
What it means: ashift: 12 implies 4K sectors. For many NVMe drives, 4K is correct; some prefer 8K/16K alignment, but ZFS won’t change ashift after creation.
Decision: If ashift is 9 (512B) on SSD/NVMe, plan a rebuild. No clever sysctl will save you from read-modify-write penalties.
Task 2: Check dataset properties that commonly drive latency
cr0x@server:~$ sudo zfs get -o name,property,value -s local,default recordsize,compression,atime,sync,logbias,primarycache,secondarycache xattr,dnodesize tank
NAME PROPERTY VALUE
tank recordsize 128K
tank compression lz4
tank atime off
tank sync standard
tank logbias latency
tank primarycache all
tank secondarycache all
tank xattr sa
tank dnodesize legacy
What it means: This is the ground truth for behavior. sync=standard is usually correct; logbias=latency says “optimize sync.”
Decision: Only change properties per workload, not per superstition. If you’re hosting databases, consider smaller recordsize (16K/32K) and leave compression=lz4 on unless CPU is demonstrably tight.
Task 3: Measure pool latency and queueing from ZFS’s perspective
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.20T 2.30T 8.1K 12.4K 410M 980M
mirror 1.20T 2.30T 8.1K 12.4K 410M 980M
nvme0n1 - - 4.0K 6.2K 205M 490M
nvme1n1 - - 4.1K 6.2K 205M 490M
What it means: This shows load distribution and whether one device is lagging. It doesn’t show per-IO latency, but it shows imbalance.
Decision: If one NVMe is consistently doing less work in a mirror, suspect firmware throttling, PCIe link issues, or thermal throttling.
Task 4: Check NVMe health and throttling indicators
cr0x@server:~$ sudo nvme smart-log /dev/nvme0n1 | egrep -i 'temperature|warning|critical|media|percentage|thm'
temperature : 71 C
critical_warning : 0x00
percentage_used : 4%
media_errors : 0
warning_temp_time : 12
critical_comp_time : 0
What it means: Temperature and accumulated time above warning thresholds matter. A drive can be “healthy” and still throttle under sustained writes.
Decision: If warning temp time climbs during incidents, fix airflow/heatsinks, reduce sustained write amplification (recordsize, sync patterns), or move the device to a cooler slot.
Task 5: Verify PCIe link speed and width (the silent throughput cap)
cr0x@server:~$ sudo lspci -s 0000:5e:00.0 -vv | egrep -i 'LnkCap|LnkSta'
LnkCap: Port #0, Speed 16GT/s, Width x4
LnkSta: Speed 16GT/s, Width x4
What it means: A Gen4 x4 link is expected for many NVMe devices. If you see x2 or Gen3 unexpectedly, you’ve found your “NVMe isn’t fast” moment.
Decision: Fix BIOS lane bifurcation, slot choice, risers, or M.2 backplane wiring. This is hardware, not tuning.
Task 6: Inspect block layer latency (await) and queue depth behavior
cr0x@server:~$ iostat -x -d 1 5 nvme0n1 nvme1n1
Device r/s w/s rKB/s wKB/s rrqm/s wrqm/s r_await w_await aqu-sz %util
nvme0n1 4100 6200 210000 505000 0.0 0.0 0.35 1.90 6.20 78.0
nvme1n1 4200 6200 210000 505000 0.0 0.0 0.36 1.85 6.05 77.5
What it means: await is average request latency. aqu-sz shows average queue size. High aqu-sz with rising await means queueing (congestion) rather than raw device latency.
Decision: If aqu-sz is high and %util is near 100% with await climbing, you’re saturating a device or a queue. Reduce concurrency, spread load, or add vdevs.
Task 7: Check NVMe driver IRQ distribution (one core doing all the work)
cr0x@server:~$ grep -i nvme /proc/interrupts | head -n 12
98: 10293812 0 0 0 PCI-MSI 524288-edge nvme0q0
99: 1938120 1980221 2018830 1999012 PCI-MSI 524289-edge nvme0q1
100: 1912202 1923301 1899987 1901120 PCI-MSI 524290-edge nvme0q2
101: 1908821 1910092 1903310 1899922 PCI-MSI 524291-edge nvme0q3
What it means: If one IRQ line is skyrocketing on CPU0 while others are flat, you’re bottlenecking on interrupt handling. Queue 0 (often admin + IO) can be special; don’t panic about nvme0q0 alone, but do watch imbalances.
Decision: If interrupts are not spread, enable irqbalance (if appropriate), or pin queues deliberately to CPUs aligned with NUMA locality.
Task 8: Confirm NUMA locality and whether NVMe is “far away” from the CPU doing the work
cr0x@server:~$ sudo cat /sys/class/nvme/nvme0/device/numa_node
1
cr0x@server:~$ numactl -H | sed -n '1,25p'
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node distances:
node 0 1
0: 10 21
1: 21 10
What it means: If NVMe sits on NUMA node 1 but most interrupts and ZFS threads run on node 0, you pay cross-node latency and memory bandwidth penalties.
Decision: Align IRQ affinity and, for extreme cases, process placement (database threads) with the NVMe’s NUMA node.
Task 9: Check ZFS thread states and look for TXG sync pressure
cr0x@server:~$ ps -eLo pid,tid,cls,rtprio,pri,psr,stat,wchan:20,comm | egrep 'txg|z_wr_iss|z_wr_int|arc_reclaim|dbu|zio' | head
1423 1450 TS - 19 12 S txg_sync_thread txg_sync
1423 1451 TS - 19 13 S txg_quiesce txg_quiesce
1423 1462 TS - 19 14 S zio_wait z_wr_int
1423 1463 TS - 19 15 S zio_wait z_wr_int
1423 1488 TS - 19 16 S arc_reclaim_thr arc_reclaim
What it means: Seeing threads is normal. Seeing many blocked on zio_wait during spikes suggests backend pressure; arc_reclaim being busy suggests memory pressure.
Decision: If reclaim correlates with latency, stop “tuning ZFS” and start fixing memory: ARC limits, application RSS, and kernel reclaim behavior.
Task 10: Watch memory pressure and reclaim storms
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 52132 91244 8032100 0 0 2 180 8200 9900 22 9 66 3 0
7 1 0 11200 88010 7941200 0 0 0 2500 14000 22000 28 14 50 8 0
What it means: Rising b (blocked), shrinking free, and increasing context switches during latency incidents is a classic “the system is thrashing” smell. On NVMe, the device is fast enough to make reclaim loops “interesting.”
Decision: If reclaim correlates, cap ARC (or increase RAM), stop overcommitting, and investigate which process is ballooning.
Task 11: Confirm sync workload and whether flushes dominate
cr0x@server:~$ sudo zpool iostat -r -w tank 1 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.20T 2.30T 2.2K 18.1K 120M 1.10G
read write
pool ops bytes latency disk latency ops bytes latency disk latency
---------- --- ----- --------- ------------ --- ----- --------- ------------
tank 2.2K 120M 420us 160us 18.1K 1.10G 3.8ms 2.9ms
What it means: ZFS-level latency vs disk latency helps separate ZFS pipeline costs from device time. If ZFS latency is much higher than disk latency, you’re paying CPU/locking/queueing inside the stack.
Decision: If disk latency is low but ZFS latency is high, look at CPU, IRQs, and ZFS concurrency; don’t buy more NVMe.
Task 12: Measure per-CPU softirq time (storage completions show up here)
cr0x@server:~$ mpstat -P ALL 1 3 | egrep -A1 'CPU|Average| 0| 12'
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
0 3.12 0.00 6.01 0.21 0.90 28.44 0.00 0.00 0.00 61.31
12 15.90 0.00 10.82 0.12 0.10 3.20 0.00 0.00 0.00 69.86
What it means: CPU0 spending 28% in softirq is often “too much IO completion work on one core.” This is a frequent culprit when NVMe “should be fast” but isn’t.
Decision: Fix IRQ affinity, enable multiple queues, and align to NUMA. If needed, consider NVMe polling (carefully) for latency-sensitive workloads.
Task 13: Inspect queue configuration and scheduler choices
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
cr0x@server:~$ cat /sys/block/nvme0n1/queue/nr_requests
1023
What it means: For NVMe, none is common and often correct. Schedulers can help fairness; they can also add overhead. nr_requests affects queue depth at the block layer, which can affect tail latency.
Decision: If tail latency is your enemy, avoid enormous queues that hide congestion. Consider mq-deadline for more predictable latency if none produces ugly p99s under mixed IO.
Task 14: Check ZFS tunables you should treat as “last-mile,” not magic
cr0x@server:~$ sudo sysctl kstat.zfs.misc.arcstats.size kstat.zfs.misc.arcstats.c kstat.zfs.misc.arcstats.mfu_size kstat.zfs.misc.arcstats.mru_size
kstat.zfs.misc.arcstats.size = 68412985344
kstat.zfs.misc.arcstats.c = 73014444032
kstat.zfs.misc.arcstats.mfu_size = 51200901120
kstat.zfs.misc.arcstats.mru_size = 15600930816
What it means: ARC size and target matter. If ARC fights the application for memory, you get reclaim churn. If ARC is too small, you thrash reads and metadata.
Decision: Set ARC max explicitly on shared boxes. On dedicated storage appliances, let ARC breathe unless you have a concrete reason not to.
Task 15: Confirm TRIM/autotrim behavior and whether it lines up with spikes
cr0x@server:~$ sudo zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
cr0x@server:~$ sudo zpool trim -v tank
trim operation for tank has completed
trim rate: 0B/s
total trimmed: 0B
What it means: Autotrim is generally good on SSD/NVMe, but some devices handle background deallocation poorly, especially under heavy write load.
Decision: If you can correlate spikes with trimming, set autotrim=off and schedule trims in a quiet window. If performance degrades over weeks without trim, the device’s GC needs the help—so don’t just turn it off forever.
Dataset & pool tuning that’s worth doing (and what’s cargo cult)
Compression: leave lz4 on, then measure CPU
On NVMe-only pools, compression often helps more than it hurts. It reduces bytes written and read, which reduces backend pressure and can improve latency. The tax is CPU. On modern CPUs, lz4 is usually a bargain.
When does it hurt? High-entropy data (already compressed), small-block sync-heavy writes, or when you’re CPU-bound due to interrupts and checksumming already.
recordsize: stop using 128K as a moral stance
recordsize is about IO shape. Databases, VM images, and log-structured apps often prefer 8K–32K. Streaming workloads (backups, media, analytics scans) like 128K or larger.
On NVMe, large recordsize can still be fine, but small random updates to big records cause write amplification. The device is fast enough that you won’t notice until your p99 grows teeth.
atime and xattr: boring toggles, real wins
Turning off atime reduces metadata writes. Storing xattrs as SA (xattr=sa) reduces IO for ACL-heavy and metadata-heavy workloads. Neither is glamorous; both are frequently correct.
dnodesize: metadata-heavy workloads deserve better defaults
Large dnodes can pack more metadata into fewer reads, which matters for directory traversals, container image layers, and small-file storms. The downside is slightly more space usage.
Do it per dataset when you have a metadata-heavy workload and you’re tired of pretending NVMe solves metadata fan-out by itself.
sync=disabled: don’t do it unless you can lose data and keep your job
Disabling sync turns “durable” into “optimistic.” It makes benchmarks look like you upgraded the laws of physics. It also changes the meaning of fsync. If you’re serving NFS, running databases, or hosting VMs, this is not a tuning knob; it’s a contractual decision.
Joke #2: sync=disabled is the storage equivalent of removing the smoke alarm because it’s loud.
IRQs, queues, and CPU: the unglamorous bottleneck
NVMe is built for parallelism, but the system has to accept that invitation. If your interrupts land on one CPU and your completion processing burns in softirq, your “fast storage” becomes a single-core latency machine.
What “good” looks like
- Multiple NVMe IO queues are active (
nvme0q1..qN), not just queue 0. - Interrupts are distributed across CPUs on the correct NUMA node.
- Softirq time is non-trivial but not concentrated on one core.
- Application threads and ZFS threads aren’t fighting over the same few CPUs.
What you can actually change safely
IRQ affinity: pin NVMe IRQs to a CPU set local to the device’s NUMA node. Don’t pin everything to CPU0 because a blog post from 2017 said “CPU0 handles interrupts.” That blog post was written when people wore different shoes.
irqbalance: can help, can hurt. It’s fine for general-purpose servers. For latency-critical storage, you often want deliberate pinning.
Scheduler: none is fine for pure throughput. If you have mixed IO and care about p99, test mq-deadline. Don’t assume; measure.
Sync writes, SLOG myths, and what NVMe changes
NVMe-only pools create a common temptation: “If the pool is already NVMe, do I need a SLOG?” The honest answer: usually no. The more useful answer: it depends on your sync write pattern and what “durable” means in your environment.
Clarify what ZIL/SLOG does
The ZIL (ZFS Intent Log) exists to record enough intent so that synchronous writes can be acknowledged safely before the full TXG commit. On crash, replay makes the last committed sync operations appear durable.
A SLOG is a separate device used to store ZIL entries. It helps when the main pool is slow at sync writes or flushes. On an NVMe pool, the main pool may already be good at this. Adding a SLOG can still help if:
- your main pool NVMe has poor flush latency under load,
- you have heavy sync workloads and want isolation,
- you want power-loss-protected log device behavior.
The real limiter: flush and durability behavior
Sync performance is frequently about cache flush semantics, not raw media speed. A drive can do 700k IOPS and still have ugly fsync latency if its firmware has conservative flush behavior under sustained writes.
TXG behavior, stalls, and why “fast media” doesn’t save you
ZFS writes are organized into transaction groups that sync periodically. Even on NVMe, this “batching” behavior can create periodic latency spikes, especially when:
- you have bursts of random writes causing allocator pressure,
- metadata updates pile up (snapshots, small files, atime),
- the system is memory constrained and can’t buffer smoothly.
Recognize TXG-related symptoms
- Latency spikes appear on a cadence (often seconds).
- CPU usage spikes in kernel threads, not user space.
- Device latency remains modest, but ZFS-level latency increases.
TXG tuning exists, but it’s not where you start. You start by ensuring the system isn’t choking on interrupts, memory reclaim, or pathological IO shapes.
Special vdevs and metadata on NVMe-only pools
Special vdevs are often pitched as “put metadata on SSD.” On an NVMe-only pool, you already did that. So why talk about it?
Because “NVMe-only” doesn’t mean “all NVMe behaves the same,” and because metadata and small blocks have different performance profiles than big sequential writes. A special vdev can still help if:
- you want to isolate metadata and small blocks onto a lower-latency, power-loss-protected device,
- you want to reduce fragmentation pressure on the main vdevs for mixed workloads,
- you need consistent small-IO latency even under large streaming writes.
But special vdevs are a commitment: if you lose it without redundancy, you lose the pool. Mirror it or don’t do it.
Three corporate mini-stories from the latency trenches
Incident caused by a wrong assumption: “NVMe mirror means no more latency”
A mid-size SaaS company moved their primary Postgres cluster to an NVMe-mirrored ZFS pool. The migration plan was careful: scrub before cutover, snapshots, rollback. The storage graph looked gorgeous in staging. Everyone slept well.
In production, every few minutes the database would freeze just long enough to trip application timeouts. Not a total outage—worse. A slow bleed of errors and retries. The on-call kept staring at disk throughput, which was nowhere near saturation, and at NVMe SMART, which looked clean.
The wrong assumption was that “fast disk means IO can’t be the problem.” The real problem was a single CPU core pinned at high softirq during peak traffic. NVMe completions were landing on a narrow set of CPUs due to a combination of default IRQ affinity and NUMA placement. Under normal load, it was fine. Under peak load, that core became a completion bottleneck, queueing grew, and database threads piled up behind it.
They fixed it by aligning NVMe IRQs to CPUs local to the NVMe NUMA node, and by moving the Postgres process CPU set away from the hottest interrupt CPUs. The pool didn’t change. The drives didn’t change. Latency did.
Optimization that backfired: “Let’s crank queue depth and disable sync”
A different company was running a fleet of virtualization hosts on NVMe ZFS. They had a quarterly “performance push,” which is corporate for “someone saw a benchmark slide.” A well-meaning engineer increased block layer queue sizes and allowed deeper outstanding IO, chasing higher throughput numbers on a synthetic test.
The numbers improved. Throughput went up. Everyone clapped. Then Monday happened. Real workloads arrived: mixed reads, writes, metadata, and bursts of sync activity from guest filesystems.
Tail latency went off a cliff. Deep queues hid congestion until it was too late, and when the system fell behind, it fell behind in big, ugly chunks. They also tested sync=disabled on VM datasets “just for performance.” That change did improve guest fsync times—right up until a host crash turned “just for performance” into a multi-VM filesystem repair party.
They rolled back the queue depth changes, moved to more modest concurrency, and restored correct sync semantics. The performance didn’t look as exciting on slides, but the pager stopped screaming.
Boring but correct practice that saved the day: pinning, observability, and a runbook
An enterprise team running an internal object store had a habit of doing three boring things: set ARC max explicitly, keep irq affinity documented, and run a weekly “latency drill” where they captured 10 minutes of iostat/mpstat/zpool iostat during peak business hours.
One week, latency started climbing subtly. Not enough to cause outages, but enough that SLO burn rate alarms woke someone up. Their drill data showed a new pattern: one NVMe device’s w_await was doubling under sustained writeback, while its mirror partner stayed stable. The pool stayed online; nothing “broke.”
Because they already had baselines, they didn’t debate whether it was “normal.” They swapped the device at the next maintenance window. Postmortem notes suggested thermal throttling due to a partially obstructed airflow path after a routine cabling change.
No heroics. No magical tunables. Just boring discipline: baselines, pinning, and replacing hardware before it becomes an incident.
Common mistakes: symptom → root cause → fix
-
Symptom: Great average latency, terrible p99 under load.
Root cause: Queueing + interrupt concentration; deep queues hide congestion.
Fix: Checkiostat -xfor risingaqu-sz; redistribute NVMe IRQs; considermq-deadline; reduce concurrency where possible. -
Symptom: Periodic “stutters” every few seconds during heavy writes.
Root cause: TXG sync bursts + allocator/metaslab pressure, sometimes worsened by metadata churn.
Fix: Reduce metadata writes (atime=off,xattr=sa), tune recordsize per workload, ensure ample RAM; verify ZFS vs disk latency viazpool iostat -r -w. -
Symptom: NVMe shows low disk latency, but app sees high IO latency.
Root cause: CPU bottleneck in checksum/compression/interrupt handling; lock contention.
Fix: Usempstatfor softirq hot spots; distribute IRQs; confirm CPU headroom; avoid heavy compression levels; don’t over-parallelize small IO. -
Symptom: Sync-heavy workload (fsync) is slow even on NVMe.
Root cause: Flush/FUA latency behavior under load; firmware; sometimes virtualization layers.
Fix: Measure sync IO separately; consider a mirrored PLP log device; avoidsync=disabledunless policy allows data loss. -
Symptom: One device in a mirror underperforms or falls behind.
Root cause: Thermal throttling, PCIe link downgrade, firmware quirks.
Fix: Check SMART temp time,lspcilink status, and compareiostat -xper device; fix cooling or slot/BIOS config. -
Symptom: Latency spikes during trims or after large deletes.
Root cause: TRIM/GC interaction; device firmware stalls during deallocation under load.
Fix: Disableautotrimtemporarily and schedule trims; validate long-term write performance doesn’t degrade.
Checklists / step-by-step plan
Step-by-step: stabilize latency on an NVMe-only ZFS pool
- Baseline first: capture
zpool iostat,iostat -x,mpstat, and/proc/interruptsduring normal peak load. - Confirm hardware reality: PCIe link width/speed; NVMe temps; firmware versions if your org tracks them.
- Fix obvious CPU bottlenecks: IRQ distribution; NUMA alignment; avoid starving ZFS threads.
- Validate dataset semantics: ensure
syncmatches durability requirements; setatime=offwhere appropriate; keepcompression=lz4unless CPU proves otherwise. - Match recordsize to workload: databases/VMs generally smaller; streaming bigger. Measure with production-like IO sizes.
- Watch memory pressure: set ARC max if the box runs apps; prevent reclaim storms.
- Re-test p99: don’t stop at average throughput. Run a mixed workload test and look at tails.
- Only then consider exotic knobs: schedulers, queue depths, polling. Each change gets a rollback plan.
Checklist: before you blame ZFS
- PCIe link is at expected speed/width for each NVMe.
- No NVMe is in persistent thermal warning territory.
- IRQs are distributed and local to the correct NUMA node.
- CPU softirq isn’t concentrated on one core.
- There’s enough RAM and no swap/reclaim thrash.
- You can explain whether the pain is sync, async, read, or metadata.
FAQ
1) Do I need a SLOG on an NVMe-only pool?
Usually no. Add a SLOG only if you have measurable sync-write latency issues and a suitable low-latency, power-loss-protected device. Otherwise you’re adding complexity for vibes.
2) Should I set sync=disabled for performance?
Only if the business accepts data loss on crash and you can prove the application is safe with that risk. For databases, NFS, and VM storage, it’s typically a bad trade.
3) Is compression=lz4 still helpful on NVMe?
Yes, often. It reduces bytes moved and written, which can improve tail latency. Turn it off only after you confirm CPU is the limiting factor.
4) Why is my NVMe doing “only” a fraction of its advertised IOPS?
Because your system is not a vendor benchmark rig. IRQ handling, NUMA, queueing, ZFS checksums, and workload mix all reduce headline numbers. Also, mirror vdev behavior and sync semantics matter.
5) Should I use mq-deadline or none for NVMe?
Start with none. If you have ugly p99 latency under mixed IO, test mq-deadline. Use measured tail latency, not throughput, to decide.
6) How do I know if I’m CPU-bound rather than storage-bound?
If disk latency is low but ZFS/app latency is high, and you see high %soft or one core saturated, you’re CPU-bound. Confirm with mpstat and interrupt distribution.
7) Does more queue depth always improve performance?
No. It can improve throughput and worsen latency. Deep queues trade predictability for bulk movement. If your workload is user-facing, you often want controlled queues and stable tails.
8) What’s the single most common cause of NVMe “mystery latency” on ZFS?
Interrupt and completion processing concentrated on too few CPUs, often worsened by NUMA mismatch. It’s embarrassingly common, and it’s fixable.
9) Should I tune TXG parameters to fix stutters?
Only after you’ve ruled out IRQ/CPU issues, memory pressure, and pathological IO shapes. TXG knobs can help in edge cases, but they’re easy to misapply and hard to reason about.
Conclusion: next steps you can do this week
If you want a calmer pager and better p99 latency on NVMe-only ZFS pools, do the unsexy work first. Verify PCIe links, stop thermal throttling, distribute IRQs, and align NUMA. Then match dataset properties to the workload: recordsize where it matters, atime off where it’s noise, and keep lz4 unless you’ve proven CPU is the choke point.
Next steps:
- Capture a baseline during peak:
iostat -x,zpool iostat -r -w,mpstat, and/proc/interrupts. - Fix any IRQ/NUMA obviousness you find. Re-measure p99.
- Audit datasets: identify which ones are database/VM/log/streaming and set recordsize accordingly.
- Decide explicitly whether you need strict sync semantics; don’t “accidentally” change durability.
- Write a tiny runbook: what graphs/commands you check first, and what “good” looks like for your fleet.