Latency spikes are the kind of storage problem that make smart people say weird things in incident channels. Everything is fine… until it isn’t. Your API p99 goes vertical for 800ms, your database starts “waiting for IO,” and your ZFS box looks bored.
ZFS isn’t “randomly slow.” It’s usually doing something very specific at a very specific layer: sync writes, TXG commit, device queueing, memory reclaim, scrub/resilver, or an unlucky workload mismatch. The trick is to stop guessing and run a checklist that narrows the bottleneck in minutes, not hours.
Fast diagnosis playbook (first/second/third)
This is the “I have five minutes before my boss joins the call” sequence. Don’t optimize anything yet. Don’t toggle properties like a DJ. Just identify the layer where the time is going.
First: confirm it’s storage latency, not CPU scheduling or network
-
Check system load and IO wait. If CPU is saturated or you’re thrashing memory, storage will look guilty even when it isn’t.
cr0x@server:~$ uptime 14:22:18 up 18 days, 3:11, 2 users, load average: 7.81, 7.34, 6.92 cr0x@server:~$ vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 1 0 82120 15640 912340 0 0 120 980 540 890 12 5 73 10 0 1 2 0 81288 15640 905112 0 0 240 2030 620 1010 10 6 62 22 0What it means: High
wa(IO wait) suggests the CPU is often stalled waiting on IO. Highbmeans processes blocked, often on disk.Decision: If
waspikes coincide with application latency spikes, proceed to ZFS-specific checks. Ifus/syis pegged, look at CPU bottlenecks first (compression, checksums, encryption). -
Check network if it’s NFS/SMB/iSCSI.
cr0x@server:~$ ss -ti | head -n 15 State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 10.0.0.12:2049 10.0.2.45:51712 timer:(keepalive,38min,0) ino:0 sk:3b2 cubic wscale:7,7 rto:204 retrans:0/0 rtt:0.337/0.012 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:818244 bytes_acked:818244 bytes_received:94412 segs_out:571 segs_in:558 send 343Mb/s lastsnd:12 lastrcv:12 lastack:12 pacing_rate 686Mb/sWhat it means: Retransmits and huge RTT swings can mimic “storage spikes.”
Decision: If network is clean (stable RTT, no retrans), stay focused on disk/ZFS.
Second: identify whether the spike is sync-write/TXG related
-
Watch ZFS latency at the pool layer.
cr0x@server:~$ zpool iostat -v -l 1 10 capacity operations bandwidth total_wait disk_wait pool alloc free read write read write read write read write tank 3.12T 1.45T 210 9800 12.3M 402M 2.1ms 85ms 1.9ms 82ms raidz2-0 3.12T 1.45T 210 9800 12.3M 402M 2.1ms 85ms 1.9ms 82ms sda - - 20 1250 1.3M 50.1M 1.8ms 88ms 1.7ms 84ms sdb - - 22 1210 1.4M 49.6M 2.0ms 86ms 1.8ms 83ms sdc - - 19 1220 1.2M 50.0M 2.1ms 84ms 1.9ms 81msWhat it means:
total_waitis what callers feel;disk_waitisolates device service time. If total_wait is high but disk_wait is low, you’re queueing above the disks (TXG, throttling, contention).Decision: If write wait jumps into tens/hundreds of ms during spikes, suspect sync writes, SLOG, TXG commit pressure, or pool saturation.
Third: check for “maintenance IO” and obvious contention
-
Is a scrub/resilver running?
cr0x@server:~$ zpool status -v tank pool: tank state: ONLINE status: One or more devices is currently being scrubbed. scan: scrub in progress since Mon Dec 23 02:01:13 2025 1.22T scanned at 1.02G/s, 438G issued at 365M/s, 3.12T total 0B repaired, 14.04% done, 0:02:45 to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0What it means: Scrubs and resilvers are legitimate IO storms. They also change IO patterns (more reads, more metadata).
Decision: If spikes line up with scrub/resilver windows, either reschedule, tune scrub behavior, or provision enough headroom to survive maintenance.
A workable mental model of ZFS latency
ZFS latency spikes rarely come from one magical setting. They come from a pipeline where any stage can stall:
- Application semantics: sync vs async writes, fsync storms, database checkpoints.
- Filesystem layer: dataset properties (recordsize, compression, atime), metadata and small-block behavior, special vdevs.
- ARC and memory: cache hits are fast; cache misses and eviction churn aren’t. Memory pressure makes everything angry.
- TXG (transaction groups): ZFS batches changes and commits them. When commit work piles up, you can see periodic pauses or write latency waves.
- ZIL/SLOG (sync writes): sync writes are acknowledged after they are safely logged. If the log device is slow, your application learns new words.
- vdev topology and disks: RAIDZ math, queue depth, SMR weirdness, firmware hiccups, write cache policies.
- Block device layer: scheduler choice, multipath, HBAs, drive timeouts.
When someone says “ZFS spikes,” ask: spikes where? In read latency? write latency? only sync writes? only metadata? only when the pool is 80% full? only during backups? Your job is to turn “spiky” into a plot with a culprit.
One opinion that will save you time: treat ZFS like a database. It has its own batching (TXGs), logging (ZIL), cache (ARC), and background work (scrub/resilver). If you wouldn’t tune a database by randomly changing knobs, don’t do it to ZFS either.
Paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails, all the time; design systems that expect it.” That includes your storage latency budget.
Joke #1: ZFS doesn’t have “mood swings.” It has “I/O accounting.” It’s less fun, but more actionable.
Interesting facts and short history (why ZFS behaves like this)
- ZFS debuted in the mid‑2000s with end-to-end checksumming as a first-class feature, not an add-on. That choice influences latency because every block read can involve checksum verification.
- The ZIL exists even without a dedicated SLOG. If you don’t add a log device, ZFS uses space on the pool devices. Your “no SLOG” setup still has sync logging; it’s just slower and competes with normal writes.
- TXG commit is periodic. ZFS batches dirty data and metadata, then flushes. That batching increases throughput but can create rhythmic latency pulses under sustained dirtying.
- Copy-on-write is both a feature and a tax. It prevents in-place overwrites (good for integrity and snapshots) but can increase fragmentation and metadata work over time, affecting tail latency.
- RAIDZ is not “free RAID.” It saves disks but makes small random writes expensive (read-modify-write patterns). Tail latency suffers first.
- 4K sector reality arrived after many arrays were designed. Misaligned ashift choices (or old 512e assumptions) can turn “normal writes” into amplified IO.
- ARC was built to be adaptive, not polite. It will happily eat memory for cache; if your workload needs RAM elsewhere (VMs, page cache, databases), the resulting pressure can manifest as storage jitter.
- Special vdevs are modern ZFS’s “metadata SSD” story. They can dramatically reduce metadata latency—unless they saturate or fail, in which case the entire pool experience changes.
- Compression became mainstream in ZFS before it was cool. It trades CPU cycles for fewer IOs; on fast NVMe, CPU becomes the bottleneck more often than people expect, causing “storage latency” symptoms.
Checklists / step-by-step plan (with commands)
This is the main event: practical tasks you can run during an incident and again during a calm postmortem. Each task includes (1) command, (2) what the output means, (3) the decision you make.
Task 1: Capture pool-level latency and separate queueing from disk service time
cr0x@server:~$ zpool iostat -l 1 30
capacity operations bandwidth total_wait disk_wait
pool alloc free read write read write read write read write
tank 3.12T 1.45T 320 9200 18.2M 380M 3.2ms 96ms 2.8ms 91ms
tank 3.12T 1.45T 310 9300 17.8M 387M 2.9ms 140ms 2.6ms 132ms
tank 3.12T 1.45T 290 9100 16.4M 376M 3.0ms 35ms 2.7ms 31ms
What it means: When total_wait balloons, that’s user-visible. If disk_wait tracks it, the disks (or HBAs) are slow. If disk_wait stays modest while total_wait spikes, you’re bottlenecked above disk: throttling, TXG congestion, ZIL contention, lock contention.
Decision: Disk_wait-driven spikes send you to device/hardware/queue checks. Total_wait-only spikes send you to sync/TXG/ARC checks.
Task 2: Find the guilty vdev or single slow disk
cr0x@server:~$ zpool iostat -v -l 2 10 tank
capacity operations bandwidth total_wait disk_wait
pool alloc free read write read write read write read write
tank 3.12T 1.45T 300 9100 16.8M 375M 3.0ms 72ms 2.7ms 68ms
raidz2-0 3.12T 1.45T 300 9100 16.8M 375M 3.0ms 72ms 2.7ms 68ms
sda - - 30 1200 1.8M 48.9M 2.6ms 70ms 2.4ms 66ms
sdb - - 29 1180 1.7M 48.2M 2.7ms 72ms 2.5ms 69ms
sdc - - 28 1190 1.6M 48.6M 35ms 250ms 33ms 240ms
What it means: One device with 10× latency drags a RAIDZ vdev down because parity math forces coordination. Mirrors also suffer: reads can avoid a slow side, writes can’t.
Decision: If one disk is the outlier, check SMART, cabling, HBA pathing. Replace the disk if it keeps spiking. Don’t “tune ZFS” around a dying drive.
Task 3: Confirm whether writes are synchronous (and whether your workload forces it)
cr0x@server:~$ zfs get -r sync tank
NAME PROPERTY VALUE SOURCE
tank sync standard default
tank/db sync standard inherited from tank
tank/vm sync always local
What it means: sync=always forces every write to behave like a sync write. Some workloads need it (databases), many don’t (bulk logs). standard obeys the application (fsync/O_DSYNC).
Decision: If a dataset is sync=always without a clear reason, set it back to standard after confirming application durability needs. If it’s truly required, you must provide a good SLOG or accept the latency.
Task 4: Check if you actually have a SLOG, and whether it’s healthy
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
logs
nvme0n1p2 ONLINE 0 0 0
What it means: If the pool has a logs section, you have a separate log device. If not, sync writes land on the main vdevs. Also: an ONLINE SLOG can still be slow.
Decision: If sync latency is killing you and there is no SLOG, add one (properly mirrored if durability matters). If there is a SLOG, benchmark/validate it; cheap consumer NVMe can have catastrophic tail latency under power-loss-protected semantics.
Task 5: Validate dataset recordsize and volblocksize against the workload
cr0x@server:~$ zfs get recordsize,volblocksize,primarycache,logbias tank/vm tank/db
NAME PROPERTY VALUE SOURCE
tank/vm recordsize 128K inherited from tank
tank/vm volblocksize - -
tank/vm primarycache all default
tank/vm logbias latency default
tank/db recordsize 16K local
tank/db volblocksize - -
tank/db primarycache all default
tank/db logbias latency local
What it means: Databases often like 8K–16K recordsize for files; VM images may prefer volblocksize tuned at zvol creation time (common values 8K–64K depending on hypervisor). Wrong sizes increase read-modify-write and metadata pressure.
Decision: If you see a mismatch, plan a migration (you can’t change volblocksize after creation). Don’t do this mid-incident unless you like overtime.
Task 6: Check pool fullness and fragmentation pressure
cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health
NAME SIZE ALLOC FREE CAP FRAG HEALTH
tank 4.50T 3.12T 1.38T 69% 41% ONLINE
What it means: High cap (especially above ~80–85%) and high fragmentation often correlate with worse tail latency, particularly on HDD RAIDZ. ZFS has fewer contiguous free segments; allocation gets more expensive.
Decision: If cap is high, free space (delete, move, expand). If frag is high and performance is collapsing, consider rewriting data (send/recv to a fresh pool) or adding vdevs to increase allocation choices.
Task 7: Check ARC size, hit ratio, and memory pressure signals
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
14:25:01 820 44 5 10 1 34 4 0 0 62G 64G
14:25:02 910 210 23 80 9 130 14 0 0 62G 64G
14:25:03 840 190 22 70 8 120 14 0 0 58G 64G
14:25:04 870 55 6 12 1 43 5 0 0 58G 64G
What it means: Miss% spikes and ARC shrinking can indicate memory pressure or a workload shift. When ARC can’t hold hot data, reads turn into real disk IO, and latency gets spiky.
Decision: If ARC is volatile and miss% jumps during incidents, check overall memory, reclaim behavior, and whether something else (VMs, containers) is eating RAM. Consider reserving memory, tuning ARC max, or moving the noisy tenant.
Task 8: Identify sync-write pressure via ZIL statistics
cr0x@server:~$ kstat -p | egrep 'zfs:0:zil:|zfs:0:vdev_sync' | head
zfs:0:zil:zil_commit_count 184220
zfs:0:zil:zil_commit_writer_count 12011
zfs:0:zil:zil_commit_waiter_count 31255
zfs:0:vdev_sync:vdev_sync_write_bytes 98342199296
What it means: Rising commit and waiter counts point to heavy sync activity. If waiters pile up, applications are blocking on log commits.
Decision: If sync pressure is high, validate SLOG latency, logbias settings, and whether the app is fsyncing too often (or running in a “safe but slow” mode).
Task 9: Confirm whether a scrub, resilver, or device error recovery is consuming IO
cr0x@server:~$ zpool status tank | sed -n '1,25p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 0 days 03:22:11 with 0 errors on Sun Dec 22 03:22:11 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
What it means: A clean pool status doesn’t mean “no background IO,” but it rules out the obvious maintenance tasks.
Decision: If scan is active during spikes, decide whether to pause (in some environments), reschedule, or accept the hit as the cost of integrity.
Task 10: Check per-disk IO queueing and latency at the Linux block layer
cr0x@server:~$ iostat -x 1 5
Linux 6.6.0 (server) 12/25/2025 _x86_64_ (32 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 25.0 1200.0 1.6 49.0 86.0 9.2 62.0 18.0 63.0 0.8 98.0
sdb 24.0 1180.0 1.5 48.3 86.0 8.9 60.0 17.5 61.0 0.8 97.0
sdc 23.0 1190.0 1.4 48.6 86.0 28.1 220.0 21.0 225.0 0.8 99.0
What it means: avgqu-sz and await show queue buildup. %util near 100% indicates saturation. A single disk with huge queue and await is your latency spike generator.
Decision: If the block layer shows the same outlier disk as ZFS, this is not a ZFS tuning problem. Pull SMART, check controller logs, and prepare replacement.
Task 11: Check drive health and “this disk is lying to you” counters
cr0x@server:~$ smartctl -a /dev/sdc | egrep -i 'Reallocated|Pending|Uncorrect|CRC|Power_On_Hours|Temperature|Error'
Power_On_Hours 38122
Temperature_Celsius 44
Reallocated_Sector_Ct 0
Current_Pending_Sector 8
Offline_Uncorrectable 2
UDMA_CRC_Error_Count 19
What it means: Pending/uncorrectable sectors can cause long internal retries. CRC errors often mean cabling/backplane issues, which produce intermittent latency spikes rather than clean failures.
Decision: Pending sectors and uncorrectables are a “replace soon” sign; CRC errors are a “fix the path” sign. Either can explain periodic 200ms–2s stalls.
Task 12: Check dataset properties that quietly cause extra IO
cr0x@server:~$ zfs get -r atime,compression,xattr,acltype,logbias tank | head -n 20
NAME PROPERTY VALUE SOURCE
tank atime off local
tank compression lz4 local
tank xattr sa local
tank acltype posixacl local
tank logbias latency default
tank/db atime off inherited from tank
tank/db compression lz4 inherited from tank
tank/db xattr sa inherited from tank
tank/db acltype posixacl inherited from tank
tank/db logbias latency local
What it means: atime=on on read-heavy workloads generates writes on reads, which is a special kind of self-own. Compression can reduce IO but increase CPU latency. xattr=sa often helps metadata-heavy workloads.
Decision: If you see atime=on on hot datasets and you don’t need it, turn it off. If compression is heavy and CPU spikes correlate with latency spikes, consider adjusting compression or adding CPU.
Task 13: Check if snapshots and deletes are creating pathological metadata work
cr0x@server:~$ zfs list -t snapshot -o name,used,creation -s creation | tail -n 5
tank/db@auto-2025-12-25-0100 2.1G Mon Dec 25 01:00 2025
tank/db@auto-2025-12-25-0200 2.2G Mon Dec 25 02:00 2025
tank/db@auto-2025-12-25-0300 2.2G Mon Dec 25 03:00 2025
tank/db@auto-2025-12-25-0400 2.3G Mon Dec 25 04:00 2025
tank/db@auto-2025-12-25-0500 2.4G Mon Dec 25 05:00 2025
What it means: Snapshots are cheap to take, not always cheap to keep. Large snapshot trees + constant overwrites can increase fragmentation and metadata churn, which shows up as jitter.
Decision: If you have aggressive snapshot schedules on high-churn datasets, tighten retention or move that workload to mirrors/NVMe where metadata churn is less punishing.
Task 14: Validate ashift and sector alignment (the “forever” setting)
cr0x@server:~$ zdb -C tank | egrep 'ashift|path' | head -n 20
path: '/dev/disk/by-id/ata-ST12000NM0008-2H2161_ZHZ12345'
ashift: 12
path: '/dev/disk/by-id/ata-ST12000NM0008-2H2161_ZHZ23456'
ashift: 12
What it means: ashift=12 means 4K sectors; ashift=9 means 512B. Setting ashift too low on 4K drives can create write amplification and ugly tail latency.
Decision: If ashift is wrong, you can’t “fix it” in place. Plan a rebuild/migration. Put it in the postmortem as “we will not do this again.”
Task 15: Find throttling and TXG pressure signals (Linux OpenZFS)
cr0x@server:~$ cat /proc/spl/kstat/zfs/txgs
txg birth state ndirty
1064217 1703510152 open 2147483648
1064216 1703510150 quiescing 1987654321
1064215 1703510148 syncing 1876543210
What it means: Multiple TXGs in quiescing/syncing with very high dirty bytes suggests the system is struggling to flush. That can manifest as write latency spikes and occasional stalls when ZFS applies backpressure.
Decision: If TXGs are stuck syncing, reduce dirtying rate (application throttling, tune write bursts), increase vdev performance, or reduce competing background tasks. Don’t just raise dirty data limits and hope.
Task 16: Check special vdev health and utilization (metadata on SSD)
cr0x@server:~$ zpool status -v tank | sed -n '1,80p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
What it means: A special vdev holds metadata (and optionally small blocks). If it’s unhealthy or undersized, metadata operations can spike even if bulk data is fine. Also: if special vdev fails and you lose redundancy, pool risk skyrockets.
Decision: If special vdev is present, treat it as tier-0 infrastructure. Monitor it like you monitor your database WAL device. If it’s small or slow, fix that before chasing ghosts.
Task 17: Correlate application sync behavior with storage spikes
cr0x@server:~$ pidstat -d 1 5 | egrep 'postgres|mysqld|qemu|java' || true
14:28:11 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
14:28:12 113 2218 0.00 402112.00 1024.00 312 postgres
14:28:13 113 2218 0.00 398740.00 980.00 411 postgres
What it means: iodelay growing indicates the process is waiting on IO. If your DB process is the one blocked during spikes, stop blaming the load balancer.
Decision: If one process dominates IO wait, analyze its write pattern (checkpoints, fsync storms, backup jobs). The fix may be in the application schedule, not ZFS.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They moved a transactional service to a new ZFS-backed storage cluster. The architecture review went smoothly. “We’re on SSDs now,” someone said, which was treated like a universal solvent for performance risk.
The first week was quiet. Then month-end hit. The service started showing 1–3 second write stalls. Not constant—just enough to trigger retries, amplify load, and make the rest of the system look flaky. The incident commander did the usual tour: CPU fine, network fine, “ZFS is spiking.”
The wrong assumption was subtle: they assumed the presence of an NVMe device meant “fast sync writes.” The pool had no dedicated SLOG, and the main vdev layout was RAIDZ. The workload did frequent fsyncs. During bursts, the pool had to do sync logging on the same devices handling regular writes and parity work.
Once they looked at pool total_wait versus disk_wait, it was obvious: disk_wait soared during sync bursts. They added a mirrored, power-loss-protected SLOG device and moved only the sync-heavy dataset to it (keeping other datasets standard). Latency spikes didn’t vanish; they shrank into the noise floor where monitoring graphs go to die.
The lesson that stuck: “SSD” is not a performance guarantee. It’s a medium. The architecture is still your problem.
Mini-story 2: The optimization that backfired
A different team was battling random read latency spikes on a busy file server. Someone suggested turning off compression “to reduce CPU overhead.” It sounded reasonable: less CPU, more speed. They flipped compression=off on the hottest dataset.
Within hours, p99 got worse. Not slightly worse—meaningfully worse. CPU went down a bit, yes, but disk read bandwidth went up, and so did queue depth. The pool was now servicing more physical IO for the same logical workload, and those extra IOs were the ones that show up in tail latency.
The real issue turned out to be ARC misses triggered by a new workload mix plus a memory limit introduced by container settings. Compression had been hiding some of the pressure by making blocks smaller, effectively increasing the “useful capacity” of ARC and reducing disk traffic.
They reverted compression to lz4, fixed the container memory policy, and set expectations: “We optimize for p99, not for CPU vanity metrics.” Compression wasn’t the villain; unaccounted memory pressure was.
Joke #2: Turning off compression to reduce latency is like removing your car’s seatbelt to improve acceleration—technically fewer constraints, practically a bad day.
Mini-story 3: The boring but correct practice that saved the day
A large enterprise ran mixed workloads: VMs, file shares, and a few databases that everyone pretended weren’t “production-critical” until they were. They had a policy: weekly scrub windows, monthly SMART checks, and immediate investigation of any rising CRC errors. It was not glamorous work. It was also the reason their worst incident never happened.
One Thursday, latency started spiking in short bursts: 200ms, then normal, then 500ms, then normal. The graphs were maddening. ZFS stats didn’t scream “pool dying.” No disk was officially failed. Yet users were noticing.
The on-call ran zpool iostat -v -l and saw one disk occasionally jumping to huge disk_wait. SMART showed a rising UDMA_CRC_Error_Count. That’s not “replace disk,” that’s “fix the path.” The team reseated the drive, swapped the SAS cable, and moved on.
A week later, a similar spike appeared on another chassis. Same playbook, same fix, no drama. The boring practice—treating CRC errors as first-class alerts—prevented a slow-motion resilver incident that would have crushed performance for days.
The lesson: the best latency spike is the one you never have because you believed your telemetry.
Common mistakes: symptom → root cause → fix
1) “Every 5–10 seconds, writes stall”
Symptom: rhythmic write latency waves; applications time out on commit; graphs look like a heartbeat.
Root cause: TXG commit pressure. Dirty data accumulates faster than the pool can flush, so ZFS applies backpressure and you feel it as periodic stalls.
Fix: Reduce bursty write sources (batch jobs, checkpoints), add vdev performance, avoid RAIDZ for heavy random write workloads, and keep scrub/resilver out of peak hours.
2) “Only sync writes are slow; async is fine”
Symptom: reads okay; buffered writes okay; fsync-heavy apps suffer; NFS with sync exports feels terrible.
Root cause: slow ZIL path—either no SLOG, or a SLOG with terrible tail latency, or a forced sync setting (sync=always).
Fix: Add a proper mirrored SLOG (power-loss-protected), set sync=standard unless you truly need always, and validate the application isn’t fsyncing excessively.
3) “Random read p99 gets worse after we added more tenants”
Symptom: average is fine; p99 is not; spikes correlate with other jobs.
Root cause: ARC eviction churn and cache misses due to memory contention (containers/VMs), plus IO competition on the same vdevs.
Fix: Reserve RAM, set sane ARC limits, isolate noisy workloads onto separate pools/vdevs, or add faster media for metadata/small IO.
4) “Latency spikes appeared after enabling encryption/compression”
Symptom: CPU jumps during spikes; disks not fully utilized.
Root cause: CPU-bound pipeline: compression, checksums, or encryption pushes work onto CPU. If CPU scheduling slips, IO completion appears “slow.”
Fix: Profile CPU, pin workloads appropriately, upgrade CPU, or adjust compression level. Don’t disable integrity features blindly; prove CPU is the bottleneck first.
5) “Everything is fine until scrub/resilver starts, then users scream”
Symptom: predictable performance cliffs during maintenance.
Root cause: insufficient headroom. Maintenance IO competes with production IO; on HDD RAIDZ, the penalty is especially sharp.
Fix: Schedule scrubs, tune priorities if your platform supports it, and—here’s the unpopular one—buy enough disks to have headroom.
6) “One VM causes everyone’s storage to spike”
Symptom: noisy neighbor behavior; bursts of sync writes or small random IO dominate.
Root cause: forced sync workload (journaling, databases) sharing a pool with latency-sensitive reads; or a zvol volblocksize mismatch causing write amplification.
Fix: Isolate that VM’s storage, tune zvols properly, provide SLOG if sync is required, or set per-workload QoS at a higher layer if you have it.
7) “Spikes disappear after reboot… then return”
Symptom: reboot cures symptoms temporarily.
Root cause: ARC warm cache hides disk issues until cache misses return; or long-term fragmentation/snapshot churn returns; or hardware retries build up again.
Fix: Use the warm period to collect baseline metrics, then chase the real root cause: disk health, workload changes, cache sizing, fragmentation management.
A stricter checklist: isolate the bottleneck layer
Step 1: Classify the spike by IO type
- Read spikes: look for ARC misses, metadata latency, one slow disk, or special vdev saturation.
- Write spikes: determine sync vs async. For async: TXG, vdev saturation, RAIDZ math. For sync: ZIL/SLOG path.
- Metadata spikes: lots of file creates/deletes, snapshots, small files, directory operations. Special vdevs help; HDD RAIDZ hates it.
Step 2: Decide if you’re saturated or jittery
- Saturated: %util near 100%, long queues, disk_wait high. Fix is capacity/performance: more vdevs, faster media, fewer competing jobs.
- Jittery: average utilization moderate but p99 awful. Often firmware retries, CRC/cabling, SMR behavior, GC on consumer SSD/NVMe, or sync log tail latency.
Step 3: Verify the topology matches the workload
- Heavy random write and sync workloads prefer mirrors (or striped mirrors) for latency.
- RAIDZ can be great for sequential throughput and capacity efficiency, but it is not a low-latency random-write specialist.
- Special vdevs can transform metadata-heavy performance, but they are now part of the pool’s survival story. Mirror them.
Step 4: Make changes that are reversible first
- Reschedule scrubs and backups.
- Adjust dataset properties like
atime,logbias, andprimarycachewhen justified. - Only after evidence: add SLOG, add vdevs, migrate workload, rebuild pool for correct ashift.
Latency spike patterns and what they usually mean
Pattern: “Short spikes, one disk is always the worst”
This is the easiest one and the most commonly ignored. If one disk consistently shows 10× latency, the pool is doing group work with a coworker who takes smoke breaks during fire drills. Replace it or fix the path.
Pattern: “Spikes only during fsync-heavy events”
Classic ZIL/SLOG story. Also shows up during NFS with synchronous semantics, database commit storms, or VM guest filesystems with aggressive barriers. If you need durability, you need low-latency durable logging hardware.
Pattern: “Spikes when memory is tight”
ARC shrink + misses increase physical IO. If you’re running ZFS on a box that also runs a zoo of containers, be explicit about memory budgets. “It’ll be fine” is not a memory strategy.
Pattern: “Spikes after snapshot retention grew”
Long snapshot chains aren’t evil, but they can amplify the cost of deletes/overwrites and increase fragmentation. Latency suffers first; throughput looks okay until it doesn’t.
Pattern: “Spikes after adding a special vdev”
Special vdev helps when it’s fast and not saturated. If it’s undersized, it becomes the hot spot. If it’s unmirrored (don’t do this), it becomes the single point of pool death.
FAQ
1) Are ZFS latency spikes “normal” because of TXG commits?
Some periodicity can be normal under sustained write load, but big p99 spikes aren’t a feature. If TXGs are causing user-visible stalls, you’re over-driving the pool or fighting a sync/log bottleneck.
2) Should I set sync=disabled to fix latency?
Only if you’re comfortable acknowledging writes that may vanish on power loss or crash. It can “fix” latency by removing durability. In regulated or transactional systems, that’s not a fix; it’s a career choice.
3) Do I always need a SLOG?
No. If your workload rarely issues sync writes (or you can tolerate the latency), you can live without it. If you run databases, VM storage, or synchronous NFS and you care about p99, a good SLOG is often the difference between calm and chaos.
4) Why does one slow disk hurt the whole pool so much?
Because vdevs do coordinated IO. In RAIDZ, parity coordination makes the slowest member set the pace. In mirrors, writes still go to both sides. Tail latency is dominated by the worst participant.
5) How full is “too full” for a ZFS pool?
It depends on workload and vdev type, but above ~80–85% you should expect allocation and fragmentation effects to show up—especially on HDD RAIDZ. If latency is a goal, leave headroom like you mean it.
6) Is lz4 compression good or bad for latency?
Often good. It reduces physical IO and can improve p99 if you were IO-bound. It can be bad if you become CPU-bound (busy boxes, encryption, weak cores). Measure CPU and IO before deciding.
7) Can snapshots cause latency spikes?
Indirectly, yes. Frequent snapshots plus high churn can increase fragmentation and metadata work. Deleting large snapshot sets can also create heavy background activity that users feel.
8) Do special vdevs always improve latency?
They improve metadata and small-block performance when sized and mirrored correctly. But they add a new bottleneck class: special vdev saturation. Treat them as tier-0 and monitor accordingly.
9) Is RAIDZ always worse than mirrors for latency?
For random-write-heavy, low-latency workloads, mirrors usually win. RAIDZ can be excellent for capacity-efficient sequential workloads. Pick based on IO pattern, not ideology.
10) Why do latency spikes sometimes look like network issues?
Because the application experiences “time waiting,” and it can’t tell whether it’s disk, CPU scheduling, lock contention, or packet loss. That’s why you start with layer separation: vmstat/iostat/zpool iostat, then network.
Conclusion: next steps that actually reduce spikes
If you want fewer ZFS latency spikes, don’t start with tuning. Start with classification and evidence.
- Instrument and capture: keep a short rolling capture of
zpool iostat -l,iostat -x, and memory stats during peak hours. Spikes that aren’t captured become folklore. - Fix the obvious hardware jitter: outlier disks, CRC errors, and flaky paths. Replace or repair first; tuning later.
- Respect sync semantics: if the workload needs sync, provide a real SLOG and validate its tail latency. If it doesn’t, don’t force sync “just to be safe.”
- Keep headroom: capacity headroom and performance headroom. Scrubs, resilvers, and backups are not optional; plan for them.
- Align topology to workload: mirrors for latency-sensitive random writes, RAIDZ for capacity/throughput where it fits, special vdevs for metadata if you can operate them responsibly.
- Make one change at a time: and measure. If you can’t tell whether it helped, it didn’t—at least not reliably.
The point of the checklist isn’t to make you “good at ZFS.” It’s to make you fast at finding the one thing that’s actually causing the spike, before the incident channel develops its own weather system.