ZFS Latency Spikes: The Checklist That Finds the Cause

October 2, 2025 • February 3, 2026 • Read: 24 min • Views: 16

Was this helpful?

Latency spikes are the kind of storage problem that make smart people say weird things in incident channels. Everything is fine… until it isn’t. Your API p99 goes vertical for 800ms, your database starts “waiting for IO,” and your ZFS box looks bored.

ZFS isn’t “randomly slow.” It’s usually doing something very specific at a very specific layer: sync writes, TXG commit, device queueing, memory reclaim, scrub/resilver, or an unlucky workload mismatch. The trick is to stop guessing and run a checklist that narrows the bottleneck in minutes, not hours.

Fast diagnosis playbook (first/second/third)

This is the “I have five minutes before my boss joins the call” sequence. Don’t optimize anything yet. Don’t toggle properties like a DJ. Just identify the layer where the time is going.

First: confirm it’s storage latency, not CPU scheduling or network

Check system load and IO wait. If CPU is saturated or you’re thrashing memory, storage will look guilty even when it isn’t.

cr0x@server:~$ uptime
 14:22:18 up 18 days,  3:11,  2 users,  load average: 7.81, 7.34, 6.92
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1      0  82120  15640 912340    0    0   120   980  540  890 12  5 73 10  0
 1  2      0  81288  15640 905112    0    0   240  2030  620 1010 10  6 62 22  0

What it means: High wa (IO wait) suggests the CPU is often stalled waiting on IO. High b means processes blocked, often on disk.

Decision: If wa spikes coincide with application latency spikes, proceed to ZFS-specific checks. If us/sy is pegged, look at CPU bottlenecks first (compression, checksums, encryption).

Check network if it’s NFS/SMB/iSCSI.

cr0x@server:~$ ss -ti | head -n 15
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0      0      10.0.0.12:2049    10.0.2.45:51712  timer:(keepalive,38min,0) ino:0 sk:3b2
	 cubic wscale:7,7 rto:204 retrans:0/0 rtt:0.337/0.012 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:818244 bytes_acked:818244 bytes_received:94412 segs_out:571 segs_in:558 send 343Mb/s lastsnd:12 lastrcv:12 lastack:12 pacing_rate 686Mb/s

What it means: Retransmits and huge RTT swings can mimic “storage spikes.”

Decision: If network is clean (stable RTT, no retrans), stay focused on disk/ZFS.

Second: identify whether the spike is sync-write/TXG related

Watch ZFS latency at the pool layer.

cr0x@server:~$ zpool iostat -v -l 1 10
                              capacity     operations     bandwidth     total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
tank                        3.12T  1.45T    210   9800  12.3M  402M   2.1ms  85ms   1.9ms  82ms
  raidz2-0                  3.12T  1.45T    210   9800  12.3M  402M   2.1ms  85ms   1.9ms  82ms
    sda                         -      -     20   1250  1.3M  50.1M  1.8ms  88ms   1.7ms  84ms
    sdb                         -      -     22   1210  1.4M  49.6M  2.0ms  86ms   1.8ms  83ms
    sdc                         -      -     19   1220  1.2M  50.0M  2.1ms  84ms   1.9ms  81ms

What it means: total_wait is what callers feel; disk_wait isolates device service time. If total_wait is high but disk_wait is low, you’re queueing above the disks (TXG, throttling, contention).

Decision: If write wait jumps into tens/hundreds of ms during spikes, suspect sync writes, SLOG, TXG commit pressure, or pool saturation.

Third: check for “maintenance IO” and obvious contention

Is a scrub/resilver running?

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being scrubbed.
  scan: scrub in progress since Mon Dec 23 02:01:13 2025
        1.22T scanned at 1.02G/s, 438G issued at 365M/s, 3.12T total
        0B repaired, 14.04% done, 0:02:45 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

What it means: Scrubs and resilvers are legitimate IO storms. They also change IO patterns (more reads, more metadata).

Decision: If spikes line up with scrub/resilver windows, either reschedule, tune scrub behavior, or provision enough headroom to survive maintenance.

A workable mental model of ZFS latency

ZFS latency spikes rarely come from one magical setting. They come from a pipeline where any stage can stall:

Application semantics: sync vs async writes, fsync storms, database checkpoints.
Filesystem layer: dataset properties (recordsize, compression, atime), metadata and small-block behavior, special vdevs.
ARC and memory: cache hits are fast; cache misses and eviction churn aren’t. Memory pressure makes everything angry.
TXG (transaction groups): ZFS batches changes and commits them. When commit work piles up, you can see periodic pauses or write latency waves.
ZIL/SLOG (sync writes): sync writes are acknowledged after they are safely logged. If the log device is slow, your application learns new words.
vdev topology and disks: RAIDZ math, queue depth, SMR weirdness, firmware hiccups, write cache policies.
Block device layer: scheduler choice, multipath, HBAs, drive timeouts.

When someone says “ZFS spikes,” ask: spikes where? In read latency? write latency? only sync writes? only metadata? only when the pool is 80% full? only during backups? Your job is to turn “spiky” into a plot with a culprit.

One opinion that will save you time: treat ZFS like a database. It has its own batching (TXGs), logging (ZIL), cache (ARC), and background work (scrub/resilver). If you wouldn’t tune a database by randomly changing knobs, don’t do it to ZFS either.

Paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails, all the time; design systems that expect it.” That includes your storage latency budget.

Joke #1: ZFS doesn’t have “mood swings.” It has “I/O accounting.” It’s less fun, but more actionable.

Interesting facts and short history (why ZFS behaves like this)

ZFS debuted in the mid‑2000s with end-to-end checksumming as a first-class feature, not an add-on. That choice influences latency because every block read can involve checksum verification.
The ZIL exists even without a dedicated SLOG. If you don’t add a log device, ZFS uses space on the pool devices. Your “no SLOG” setup still has sync logging; it’s just slower and competes with normal writes.
TXG commit is periodic. ZFS batches dirty data and metadata, then flushes. That batching increases throughput but can create rhythmic latency pulses under sustained dirtying.
Copy-on-write is both a feature and a tax. It prevents in-place overwrites (good for integrity and snapshots) but can increase fragmentation and metadata work over time, affecting tail latency.
RAIDZ is not “free RAID.” It saves disks but makes small random writes expensive (read-modify-write patterns). Tail latency suffers first.
4K sector reality arrived after many arrays were designed. Misaligned ashift choices (or old 512e assumptions) can turn “normal writes” into amplified IO.
ARC was built to be adaptive, not polite. It will happily eat memory for cache; if your workload needs RAM elsewhere (VMs, page cache, databases), the resulting pressure can manifest as storage jitter.
Special vdevs are modern ZFS’s “metadata SSD” story. They can dramatically reduce metadata latency—unless they saturate or fail, in which case the entire pool experience changes.
Compression became mainstream in ZFS before it was cool. It trades CPU cycles for fewer IOs; on fast NVMe, CPU becomes the bottleneck more often than people expect, causing “storage latency” symptoms.

Checklists / step-by-step plan (with commands)

This is the main event: practical tasks you can run during an incident and again during a calm postmortem. Each task includes (1) command, (2) what the output means, (3) the decision you make.

Task 1: Capture pool-level latency and separate queueing from disk service time

cr0x@server:~$ zpool iostat -l 1 30
                              capacity     operations     bandwidth     total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
tank                        3.12T  1.45T    320   9200  18.2M  380M   3.2ms  96ms   2.8ms  91ms
tank                        3.12T  1.45T    310   9300  17.8M  387M   2.9ms  140ms  2.6ms  132ms
tank                        3.12T  1.45T    290   9100  16.4M  376M   3.0ms  35ms   2.7ms  31ms

What it means: When total_wait balloons, that’s user-visible. If disk_wait tracks it, the disks (or HBAs) are slow. If disk_wait stays modest while total_wait spikes, you’re bottlenecked above disk: throttling, TXG congestion, ZIL contention, lock contention.

Decision: Disk_wait-driven spikes send you to device/hardware/queue checks. Total_wait-only spikes send you to sync/TXG/ARC checks.

Task 2: Find the guilty vdev or single slow disk

cr0x@server:~$ zpool iostat -v -l 2 10 tank
                              capacity     operations     bandwidth     total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write   read  write
tank                        3.12T  1.45T    300   9100  16.8M  375M   3.0ms  72ms   2.7ms  68ms
  raidz2-0                  3.12T  1.45T    300   9100  16.8M  375M   3.0ms  72ms   2.7ms  68ms
    sda                         -      -     30   1200  1.8M  48.9M  2.6ms  70ms   2.4ms  66ms
    sdb                         -      -     29   1180  1.7M  48.2M  2.7ms  72ms   2.5ms  69ms
    sdc                         -      -     28   1190  1.6M  48.6M  35ms   250ms  33ms   240ms

What it means: One device with 10× latency drags a RAIDZ vdev down because parity math forces coordination. Mirrors also suffer: reads can avoid a slow side, writes can’t.

Decision: If one disk is the outlier, check SMART, cabling, HBA pathing. Replace the disk if it keeps spiking. Don’t “tune ZFS” around a dying drive.

Task 3: Confirm whether writes are synchronous (and whether your workload forces it)

cr0x@server:~$ zfs get -r sync tank
NAME        PROPERTY  VALUE  SOURCE
tank        sync      standard  default
tank/db     sync      standard  inherited from tank
tank/vm     sync      always    local

What it means: sync=always forces every write to behave like a sync write. Some workloads need it (databases), many don’t (bulk logs). standard obeys the application (fsync/O_DSYNC).

Decision: If a dataset is sync=always without a clear reason, set it back to standard after confirming application durability needs. If it’s truly required, you must provide a good SLOG or accept the latency.

Task 4: Check if you actually have a SLOG, and whether it’s healthy

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
        logs
          nvme0n1p2 ONLINE       0     0     0

What it means: If the pool has a logs section, you have a separate log device. If not, sync writes land on the main vdevs. Also: an ONLINE SLOG can still be slow.

Decision: If sync latency is killing you and there is no SLOG, add one (properly mirrored if durability matters). If there is a SLOG, benchmark/validate it; cheap consumer NVMe can have catastrophic tail latency under power-loss-protected semantics.

Task 5: Validate dataset recordsize and volblocksize against the workload

cr0x@server:~$ zfs get recordsize,volblocksize,primarycache,logbias tank/vm tank/db
NAME      PROPERTY      VALUE   SOURCE
tank/vm   recordsize    128K    inherited from tank
tank/vm   volblocksize  -       -
tank/vm   primarycache  all     default
tank/vm   logbias       latency default
tank/db   recordsize    16K     local
tank/db   volblocksize  -       -
tank/db   primarycache  all     default
tank/db   logbias       latency local

What it means: Databases often like 8K–16K recordsize for files; VM images may prefer volblocksize tuned at zvol creation time (common values 8K–64K depending on hypervisor). Wrong sizes increase read-modify-write and metadata pressure.

Decision: If you see a mismatch, plan a migration (you can’t change volblocksize after creation). Don’t do this mid-incident unless you like overtime.

Task 6: Check pool fullness and fragmentation pressure

cr0x@server:~$ zpool list -o name,size,alloc,free,cap,frag,health
NAME  SIZE  ALLOC  FREE  CAP  FRAG  HEALTH
tank  4.50T 3.12T 1.38T  69%   41%  ONLINE

What it means: High cap (especially above ~80–85%) and high fragmentation often correlate with worse tail latency, particularly on HDD RAIDZ. ZFS has fewer contiguous free segments; allocation gets more expensive.

Decision: If cap is high, free space (delete, move, expand). If frag is high and performance is collapsing, consider rewriting data (send/recv to a fresh pool) or adding vdevs to increase allocation choices.

Task 7: Check ARC size, hit ratio, and memory pressure signals

cr0x@server:~$ arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
14:25:01   820    44      5    10    1    34    4     0    0   62G   64G
14:25:02   910   210     23    80    9   130   14     0    0   62G   64G
14:25:03   840   190     22    70    8   120   14     0    0   58G   64G
14:25:04   870    55      6    12    1    43    5     0    0   58G   64G

What it means: Miss% spikes and ARC shrinking can indicate memory pressure or a workload shift. When ARC can’t hold hot data, reads turn into real disk IO, and latency gets spiky.

Decision: If ARC is volatile and miss% jumps during incidents, check overall memory, reclaim behavior, and whether something else (VMs, containers) is eating RAM. Consider reserving memory, tuning ARC max, or moving the noisy tenant.

Task 8: Identify sync-write pressure via ZIL statistics

cr0x@server:~$ kstat -p | egrep 'zfs:0:zil:|zfs:0:vdev_sync' | head
zfs:0:zil:zil_commit_count                        184220
zfs:0:zil:zil_commit_writer_count                  12011
zfs:0:zil:zil_commit_waiter_count                  31255
zfs:0:vdev_sync:vdev_sync_write_bytes              98342199296

What it means: Rising commit and waiter counts point to heavy sync activity. If waiters pile up, applications are blocking on log commits.

Decision: If sync pressure is high, validate SLOG latency, logbias settings, and whether the app is fsyncing too often (or running in a “safe but slow” mode).

Task 9: Confirm whether a scrub, resilver, or device error recovery is consuming IO

cr0x@server:~$ zpool status tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 03:22:11 with 0 errors on Sun Dec 22 03:22:11 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

What it means: A clean pool status doesn’t mean “no background IO,” but it rules out the obvious maintenance tasks.

Decision: If scan is active during spikes, decide whether to pause (in some environments), reschedule, or accept the hit as the cost of integrity.

Task 10: Check per-disk IO queueing and latency at the Linux block layer

cr0x@server:~$ iostat -x 1 5
Linux 6.6.0 (server) 	12/25/2025 	_x86_64_	(32 CPU)

Device            r/s     w/s   rMB/s   wMB/s avgrq-sz avgqu-sz await r_await w_await  svctm  %util
sda              25.0  1200.0     1.6    49.0    86.0     9.2  62.0   18.0   63.0   0.8   98.0
sdb              24.0  1180.0     1.5    48.3    86.0     8.9  60.0   17.5   61.0   0.8   97.0
sdc              23.0  1190.0     1.4    48.6    86.0    28.1  220.0  21.0  225.0   0.8   99.0

What it means: avgqu-sz and await show queue buildup. %util near 100% indicates saturation. A single disk with huge queue and await is your latency spike generator.

Decision: If the block layer shows the same outlier disk as ZFS, this is not a ZFS tuning problem. Pull SMART, check controller logs, and prepare replacement.

Task 11: Check drive health and “this disk is lying to you” counters

cr0x@server:~$ smartctl -a /dev/sdc | egrep -i 'Reallocated|Pending|Uncorrect|CRC|Power_On_Hours|Temperature|Error'
Power_On_Hours          38122
Temperature_Celsius     44
Reallocated_Sector_Ct   0
Current_Pending_Sector  8
Offline_Uncorrectable   2
UDMA_CRC_Error_Count    19

What it means: Pending/uncorrectable sectors can cause long internal retries. CRC errors often mean cabling/backplane issues, which produce intermittent latency spikes rather than clean failures.

Decision: Pending sectors and uncorrectables are a “replace soon” sign; CRC errors are a “fix the path” sign. Either can explain periodic 200ms–2s stalls.

Task 12: Check dataset properties that quietly cause extra IO

cr0x@server:~$ zfs get -r atime,compression,xattr,acltype,logbias tank | head -n 20
NAME      PROPERTY     VALUE     SOURCE
tank      atime        off       local
tank      compression  lz4       local
tank      xattr        sa        local
tank      acltype      posixacl  local
tank      logbias      latency   default
tank/db   atime        off       inherited from tank
tank/db   compression  lz4       inherited from tank
tank/db   xattr        sa        inherited from tank
tank/db   acltype      posixacl  inherited from tank
tank/db   logbias      latency   local

What it means: atime=on on read-heavy workloads generates writes on reads, which is a special kind of self-own. Compression can reduce IO but increase CPU latency. xattr=sa often helps metadata-heavy workloads.

Decision: If you see atime=on on hot datasets and you don’t need it, turn it off. If compression is heavy and CPU spikes correlate with latency spikes, consider adjusting compression or adding CPU.

Task 13: Check if snapshots and deletes are creating pathological metadata work

cr0x@server:~$ zfs list -t snapshot -o name,used,creation -s creation | tail -n 5
tank/db@auto-2025-12-25-0100   2.1G  Mon Dec 25 01:00 2025
tank/db@auto-2025-12-25-0200   2.2G  Mon Dec 25 02:00 2025
tank/db@auto-2025-12-25-0300   2.2G  Mon Dec 25 03:00 2025
tank/db@auto-2025-12-25-0400   2.3G  Mon Dec 25 04:00 2025
tank/db@auto-2025-12-25-0500   2.4G  Mon Dec 25 05:00 2025

What it means: Snapshots are cheap to take, not always cheap to keep. Large snapshot trees + constant overwrites can increase fragmentation and metadata churn, which shows up as jitter.

Decision: If you have aggressive snapshot schedules on high-churn datasets, tighten retention or move that workload to mirrors/NVMe where metadata churn is less punishing.

Task 14: Validate ashift and sector alignment (the “forever” setting)

cr0x@server:~$ zdb -C tank | egrep 'ashift|path' | head -n 20
            path: '/dev/disk/by-id/ata-ST12000NM0008-2H2161_ZHZ12345'
            ashift: 12
            path: '/dev/disk/by-id/ata-ST12000NM0008-2H2161_ZHZ23456'
            ashift: 12

What it means: ashift=12 means 4K sectors; ashift=9 means 512B. Setting ashift too low on 4K drives can create write amplification and ugly tail latency.

Decision: If ashift is wrong, you can’t “fix it” in place. Plan a rebuild/migration. Put it in the postmortem as “we will not do this again.”

Task 15: Find throttling and TXG pressure signals (Linux OpenZFS)

cr0x@server:~$ cat /proc/spl/kstat/zfs/txgs
txg                            birth                    state                    ndirty
1064217                        1703510152               open                     2147483648
1064216                        1703510150               quiescing                1987654321
1064215                        1703510148               syncing                  1876543210

What it means: Multiple TXGs in quiescing/syncing with very high dirty bytes suggests the system is struggling to flush. That can manifest as write latency spikes and occasional stalls when ZFS applies backpressure.

Decision: If TXGs are stuck syncing, reduce dirtying rate (application throttling, tune write bursts), increase vdev performance, or reduce competing background tasks. Don’t just raise dirty data limits and hope.

Task 16: Check special vdev health and utilization (metadata on SSD)

cr0x@server:~$ zpool status -v tank | sed -n '1,80p'
  pool: tank
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            sda        ONLINE       0     0     0
            sdb        ONLINE       0     0     0
            sdc        ONLINE       0     0     0
        special
          mirror-1     ONLINE       0     0     0
            nvme1n1    ONLINE       0     0     0
            nvme2n1    ONLINE       0     0     0

What it means: A special vdev holds metadata (and optionally small blocks). If it’s unhealthy or undersized, metadata operations can spike even if bulk data is fine. Also: if special vdev fails and you lose redundancy, pool risk skyrockets.

Decision: If special vdev is present, treat it as tier-0 infrastructure. Monitor it like you monitor your database WAL device. If it’s small or slow, fix that before chasing ghosts.

Task 17: Correlate application sync behavior with storage spikes

cr0x@server:~$ pidstat -d 1 5 | egrep 'postgres|mysqld|qemu|java' || true
14:28:11      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
14:28:12      113      2218      0.00  402112.00  1024.00     312  postgres
14:28:13      113      2218      0.00  398740.00   980.00     411  postgres

What it means: iodelay growing indicates the process is waiting on IO. If your DB process is the one blocked during spikes, stop blaming the load balancer.

Decision: If one process dominates IO wait, analyze its write pattern (checkpoints, fsync storms, backup jobs). The fix may be in the application schedule, not ZFS.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They moved a transactional service to a new ZFS-backed storage cluster. The architecture review went smoothly. “We’re on SSDs now,” someone said, which was treated like a universal solvent for performance risk.

The first week was quiet. Then month-end hit. The service started showing 1–3 second write stalls. Not constant—just enough to trigger retries, amplify load, and make the rest of the system look flaky. The incident commander did the usual tour: CPU fine, network fine, “ZFS is spiking.”

The wrong assumption was subtle: they assumed the presence of an NVMe device meant “fast sync writes.” The pool had no dedicated SLOG, and the main vdev layout was RAIDZ. The workload did frequent fsyncs. During bursts, the pool had to do sync logging on the same devices handling regular writes and parity work.

Once they looked at pool total_wait versus disk_wait, it was obvious: disk_wait soared during sync bursts. They added a mirrored, power-loss-protected SLOG device and moved only the sync-heavy dataset to it (keeping other datasets standard). Latency spikes didn’t vanish; they shrank into the noise floor where monitoring graphs go to die.

The lesson that stuck: “SSD” is not a performance guarantee. It’s a medium. The architecture is still your problem.

Mini-story 2: The optimization that backfired

A different team was battling random read latency spikes on a busy file server. Someone suggested turning off compression “to reduce CPU overhead.” It sounded reasonable: less CPU, more speed. They flipped compression=off on the hottest dataset.

Within hours, p99 got worse. Not slightly worse—meaningfully worse. CPU went down a bit, yes, but disk read bandwidth went up, and so did queue depth. The pool was now servicing more physical IO for the same logical workload, and those extra IOs were the ones that show up in tail latency.

The real issue turned out to be ARC misses triggered by a new workload mix plus a memory limit introduced by container settings. Compression had been hiding some of the pressure by making blocks smaller, effectively increasing the “useful capacity” of ARC and reducing disk traffic.

They reverted compression to lz4, fixed the container memory policy, and set expectations: “We optimize for p99, not for CPU vanity metrics.” Compression wasn’t the villain; unaccounted memory pressure was.

Joke #2: Turning off compression to reduce latency is like removing your car’s seatbelt to improve acceleration—technically fewer constraints, practically a bad day.

Mini-story 3: The boring but correct practice that saved the day

A large enterprise ran mixed workloads: VMs, file shares, and a few databases that everyone pretended weren’t “production-critical” until they were. They had a policy: weekly scrub windows, monthly SMART checks, and immediate investigation of any rising CRC errors. It was not glamorous work. It was also the reason their worst incident never happened.

One Thursday, latency started spiking in short bursts: 200ms, then normal, then 500ms, then normal. The graphs were maddening. ZFS stats didn’t scream “pool dying.” No disk was officially failed. Yet users were noticing.

The on-call ran zpool iostat -v -l and saw one disk occasionally jumping to huge disk_wait. SMART showed a rising UDMA_CRC_Error_Count. That’s not “replace disk,” that’s “fix the path.” The team reseated the drive, swapped the SAS cable, and moved on.

A week later, a similar spike appeared on another chassis. Same playbook, same fix, no drama. The boring practice—treating CRC errors as first-class alerts—prevented a slow-motion resilver incident that would have crushed performance for days.

The lesson: the best latency spike is the one you never have because you believed your telemetry.

Common mistakes: symptom → root cause → fix

1) “Every 5–10 seconds, writes stall”

Symptom: rhythmic write latency waves; applications time out on commit; graphs look like a heartbeat.

Root cause: TXG commit pressure. Dirty data accumulates faster than the pool can flush, so ZFS applies backpressure and you feel it as periodic stalls.

Fix: Reduce bursty write sources (batch jobs, checkpoints), add vdev performance, avoid RAIDZ for heavy random write workloads, and keep scrub/resilver out of peak hours.

2) “Only sync writes are slow; async is fine”

Symptom: reads okay; buffered writes okay; fsync-heavy apps suffer; NFS with sync exports feels terrible.

Root cause: slow ZIL path—either no SLOG, or a SLOG with terrible tail latency, or a forced sync setting (sync=always).

Fix: Add a proper mirrored SLOG (power-loss-protected), set sync=standard unless you truly need always, and validate the application isn’t fsyncing excessively.

3) “Random read p99 gets worse after we added more tenants”

Symptom: average is fine; p99 is not; spikes correlate with other jobs.

Root cause: ARC eviction churn and cache misses due to memory contention (containers/VMs), plus IO competition on the same vdevs.

Fix: Reserve RAM, set sane ARC limits, isolate noisy workloads onto separate pools/vdevs, or add faster media for metadata/small IO.

4) “Latency spikes appeared after enabling encryption/compression”

Symptom: CPU jumps during spikes; disks not fully utilized.

Root cause: CPU-bound pipeline: compression, checksums, or encryption pushes work onto CPU. If CPU scheduling slips, IO completion appears “slow.”

Fix: Profile CPU, pin workloads appropriately, upgrade CPU, or adjust compression level. Don’t disable integrity features blindly; prove CPU is the bottleneck first.

5) “Everything is fine until scrub/resilver starts, then users scream”

Symptom: predictable performance cliffs during maintenance.

Root cause: insufficient headroom. Maintenance IO competes with production IO; on HDD RAIDZ, the penalty is especially sharp.

Fix: Schedule scrubs, tune priorities if your platform supports it, and—here’s the unpopular one—buy enough disks to have headroom.

6) “One VM causes everyone’s storage to spike”

Symptom: noisy neighbor behavior; bursts of sync writes or small random IO dominate.

Root cause: forced sync workload (journaling, databases) sharing a pool with latency-sensitive reads; or a zvol volblocksize mismatch causing write amplification.

Fix: Isolate that VM’s storage, tune zvols properly, provide SLOG if sync is required, or set per-workload QoS at a higher layer if you have it.

7) “Spikes disappear after reboot… then return”

Symptom: reboot cures symptoms temporarily.

Root cause: ARC warm cache hides disk issues until cache misses return; or long-term fragmentation/snapshot churn returns; or hardware retries build up again.

Fix: Use the warm period to collect baseline metrics, then chase the real root cause: disk health, workload changes, cache sizing, fragmentation management.

A stricter checklist: isolate the bottleneck layer

Step 1: Classify the spike by IO type

Read spikes: look for ARC misses, metadata latency, one slow disk, or special vdev saturation.
Write spikes: determine sync vs async. For async: TXG, vdev saturation, RAIDZ math. For sync: ZIL/SLOG path.
Metadata spikes: lots of file creates/deletes, snapshots, small files, directory operations. Special vdevs help; HDD RAIDZ hates it.

Step 2: Decide if you’re saturated or jittery

Saturated: %util near 100%, long queues, disk_wait high. Fix is capacity/performance: more vdevs, faster media, fewer competing jobs.
Jittery: average utilization moderate but p99 awful. Often firmware retries, CRC/cabling, SMR behavior, GC on consumer SSD/NVMe, or sync log tail latency.

Step 3: Verify the topology matches the workload

Heavy random write and sync workloads prefer mirrors (or striped mirrors) for latency.
RAIDZ can be great for sequential throughput and capacity efficiency, but it is not a low-latency random-write specialist.
Special vdevs can transform metadata-heavy performance, but they are now part of the pool’s survival story. Mirror them.

Step 4: Make changes that are reversible first

Reschedule scrubs and backups.
Adjust dataset properties like atime, logbias, and primarycache when justified.
Only after evidence: add SLOG, add vdevs, migrate workload, rebuild pool for correct ashift.

Latency spike patterns and what they usually mean

Pattern: “Short spikes, one disk is always the worst”

This is the easiest one and the most commonly ignored. If one disk consistently shows 10× latency, the pool is doing group work with a coworker who takes smoke breaks during fire drills. Replace it or fix the path.

Pattern: “Spikes only during fsync-heavy events”

Classic ZIL/SLOG story. Also shows up during NFS with synchronous semantics, database commit storms, or VM guest filesystems with aggressive barriers. If you need durability, you need low-latency durable logging hardware.

Pattern: “Spikes when memory is tight”

ARC shrink + misses increase physical IO. If you’re running ZFS on a box that also runs a zoo of containers, be explicit about memory budgets. “It’ll be fine” is not a memory strategy.

Pattern: “Spikes after snapshot retention grew”

Long snapshot chains aren’t evil, but they can amplify the cost of deletes/overwrites and increase fragmentation. Latency suffers first; throughput looks okay until it doesn’t.

Pattern: “Spikes after adding a special vdev”

Special vdev helps when it’s fast and not saturated. If it’s undersized, it becomes the hot spot. If it’s unmirrored (don’t do this), it becomes the single point of pool death.

FAQ

1) Are ZFS latency spikes “normal” because of TXG commits?

Some periodicity can be normal under sustained write load, but big p99 spikes aren’t a feature. If TXGs are causing user-visible stalls, you’re over-driving the pool or fighting a sync/log bottleneck.

2) Should I set `sync=disabled` to fix latency?

Only if you’re comfortable acknowledging writes that may vanish on power loss or crash. It can “fix” latency by removing durability. In regulated or transactional systems, that’s not a fix; it’s a career choice.

3) Do I always need a SLOG?

No. If your workload rarely issues sync writes (or you can tolerate the latency), you can live without it. If you run databases, VM storage, or synchronous NFS and you care about p99, a good SLOG is often the difference between calm and chaos.

4) Why does one slow disk hurt the whole pool so much?

Because vdevs do coordinated IO. In RAIDZ, parity coordination makes the slowest member set the pace. In mirrors, writes still go to both sides. Tail latency is dominated by the worst participant.

5) How full is “too full” for a ZFS pool?

It depends on workload and vdev type, but above ~80–85% you should expect allocation and fragmentation effects to show up—especially on HDD RAIDZ. If latency is a goal, leave headroom like you mean it.

6) Is `lz4` compression good or bad for latency?

Often good. It reduces physical IO and can improve p99 if you were IO-bound. It can be bad if you become CPU-bound (busy boxes, encryption, weak cores). Measure CPU and IO before deciding.

7) Can snapshots cause latency spikes?

Indirectly, yes. Frequent snapshots plus high churn can increase fragmentation and metadata work. Deleting large snapshot sets can also create heavy background activity that users feel.

8) Do special vdevs always improve latency?

They improve metadata and small-block performance when sized and mirrored correctly. But they add a new bottleneck class: special vdev saturation. Treat them as tier-0 and monitor accordingly.

9) Is RAIDZ always worse than mirrors for latency?

For random-write-heavy, low-latency workloads, mirrors usually win. RAIDZ can be excellent for capacity-efficient sequential workloads. Pick based on IO pattern, not ideology.

10) Why do latency spikes sometimes look like network issues?

Because the application experiences “time waiting,” and it can’t tell whether it’s disk, CPU scheduling, lock contention, or packet loss. That’s why you start with layer separation: vmstat/iostat/zpool iostat, then network.

Conclusion: next steps that actually reduce spikes

If you want fewer ZFS latency spikes, don’t start with tuning. Start with classification and evidence.

Instrument and capture: keep a short rolling capture of zpool iostat -l, iostat -x, and memory stats during peak hours. Spikes that aren’t captured become folklore.
Fix the obvious hardware jitter: outlier disks, CRC errors, and flaky paths. Replace or repair first; tuning later.
Respect sync semantics: if the workload needs sync, provide a real SLOG and validate its tail latency. If it doesn’t, don’t force sync “just to be safe.”
Keep headroom: capacity headroom and performance headroom. Scrubs, resilvers, and backups are not optional; plan for them.
Align topology to workload: mirrors for latency-sensitive random writes, RAIDZ for capacity/throughput where it fits, special vdevs for metadata if you can operate them responsibly.
Make one change at a time: and measure. If you can’t tell whether it helped, it didn’t—at least not reliably.

The point of the checklist isn’t to make you “good at ZFS.” It’s to make you fast at finding the one thing that’s actually causing the spike, before the incident channel develops its own weather system.