Storage outages rarely start as “everything is down.” They start as someone in Slack saying, “why is the app… sticky?” Then your database checkpoints take longer, your queues lengthen, and suddenly your SLOs look like a ski slope.
ZFS is excellent at keeping data correct. It’s also excellent at hiding the early warning signs inside a few counters you’re not graphing. This is about those counters: the graphs that tell you “something is wrong” while you still have time to fix it without a war room.
What “latency” means in ZFS (and why your app cares)
Most teams graph throughput because it’s gratifying: big numbers, colorful dashboards, leadership-friendly. Latency is the number that actually hurts users. One slow I/O can stall a transaction, block a worker thread, and amplify into a queue that takes minutes to unwind.
ZFS adds its own layers. That’s not a criticism; it’s a reality of copy-on-write, checksums, compression, and transaction groups. Your I/O doesn’t just “hit the disk.” It moves through:
- Application and filesystem path (syscalls, VFS)
- ZFS intent/logging for synchronous semantics (ZIL/SLOG)
- ARC (RAM cache) and optional L2ARC (flash cache)
- DMU / metaslab allocation (where blocks go)
- vdev queues (the place where “this looks fine” becomes “why is everything waiting”)
- physical devices (including their firmware moods)
When someone says “ZFS is slow,” translate it. Ask: slow reads or slow writes? synchronous or asynchronous? small random IO or large sequential? hot set fits in ARC or not? That’s how you pick the right graphs and the right fix.
One more practical point: average latency is a liar. You want percentiles (p95, p99) and you want queue depth context. A system can have “low average latency” while p99 is burning your app alive.
Joke #1: Average latency is like average temperature—great until you realize your head is in the oven and your feet are in the freezer.
Interesting facts and history (the useful kind)
- ZFS was born at Sun Microsystems with a design goal that data integrity beats everything else, which is why checksums and copy-on-write are non-negotiable features.
- The “ZIL” exists even without a SLOG: ZIL is the in-pool log, and SLOG is a separate device that can hold ZIL records to accelerate sync writes.
- TXGs (transaction groups) are the heartbeat: ZFS batches changes and periodically flushes them. When flushes get slow, latency gets weird—and then catastrophic.
- ARC is not “just a cache”: ARC interacts with memory pressure, metadata, and prefetch behavior; it can shift workload from disks to RAM in surprising ways.
- L2ARC used to be riskier: older implementations could consume a lot of RAM for metadata and warmup behavior could disappoint. Modern systems improved, but it’s still not free.
- Compression can reduce latency when it turns random reads into fewer device IOs—until CPU becomes the bottleneck or recordsize misaligns with access patterns.
- RAIDZ changes the write story: small random writes on RAIDZ can incur read-modify-write overhead, which shows up as higher latency and device queueing.
- Scrubs are operationally mandatory, but they compete for IO. If you don’t shape scrub impact, your users will shape it for you via angry tickets.
- Special vdevs changed metadata economics: putting metadata (and optionally small blocks) on fast SSD can dramatically improve latency for metadata-heavy workloads.
One quote, because it survives every storage incident: “Hope is not a strategy.”
— James Cameron
The graphs that catch disaster early
Latency monitoring isn’t one graph. It’s a small set of graphs that tell a coherent story. If your monitoring tool only supports a few panels, pick the ones below. If you can do more, do more. You’re buying time.
1) Pool and vdev latency (read/write) with percentiles
You want read and write latency per vdev, not just the pool average. Pools hide murder. One degraded device, one slow SSD, one HBA hiccup, and the pool graph looks “a bit worse,” while one vdev is on fire.
What it catches: a single device dying, firmware GC stalls, queue starvation, a controller issue affecting only one path.
How it fails: if you only graph pool averages, you’ll miss the “one bad apple” stage and discover the problem at the “everything timed out” stage.
2) Queue depth / wait time signals
Latency without queue depth is like smoke without a fire alarm. ZFS has internal queues; disks have queues; the OS has queues. When queue depth rises, latency rises, and throughput often plateaus. This is your early disaster signature.
What it catches: saturation, throttling, sudden workload shifts, scrub/resilver interference, a dataset property change that increases IO size or sync behavior.
3) Sync write latency (ZIL/SLOG health)
Databases and NFS often force sync semantics. If your SLOG is slow or misconfigured, users learn new words. Graph:
- sync write latency (p95/p99)
- sync write ops per second
- SLOG device latency and utilization
What it catches: SLOG wearing out, device cache disabled, power-loss-protected requirements not met, accidental removal of SLOG, or a “fast” consumer SSD stalling under sync workload.
4) TXG commit time and “dirty data” pressure
When ZFS can’t flush dirty data fast enough, everything backs up. Monitor:
- TXG sync time (or proxies like “time blocked in txg”)
- dirty data size
- write throttle events
What it catches: slow disks, overcommitted pools, recordsize mismatches, SMR drive weirdness, RAIDZ math under random write load.
5) ARC hit ratio with context (and the misses that matter)
ARC hit ratio is a classic vanity metric. A high hit ratio can still hide latency if the misses are on the critical path (metadata misses, or random reads that hammer a slow vdev). Graph:
- ARC size and target size
- ARC hit ratio
- metadata vs data hits (if you can)
- eviction rate
What it catches: memory pressure, container density changes, a kernel update changing ARC behavior, or a new workload that blows out the cache.
6) Fragmentation and metaslab allocation behavior (slow-motion disasters)
Fragmentation isn’t instant pain; it’s future pain. When free space gets low and fragmentation rises, allocation becomes slower and IO becomes more random. Graph:
- pool free space percentage
- fragmentation percentage
- average block size written (if available)
What it catches: “we’re fine at 20% free” myths, snapshot hoarding, and workloads that churn small blocks.
7) Scrub/resilver impact overlay
Scrubs and resilvers are necessary. They are also heavy IO. Your monitoring should annotate when they run. Otherwise you’ll treat “expected IO pressure” as “mysterious performance regression,” or worse, you’ll disable scrubs and congratulate yourself on the improvement.
8) Errors that precede latency spikes
Graph counts of:
- checksum errors
- read/write errors
- timeouts and link resets (from OS logs)
Latency incidents often start with “harmless” retries. Retries are not harmless; they are IO multiplication with a bad attitude.
Fast diagnosis playbook (first / second / third)
This is the “my pager is yelling” sequence. It’s optimized for speed and correctness, not elegance.
First: decide if this is device latency, ZFS queuing, or workload change
- Look at vdev read/write latency. Is one vdev off the charts?
- Look at IOPS vs throughput. Did the workload shift from sequential to random?
- Check sync write rate. Did something flip to sync semantics?
Second: confirm whether the pool is blocked by TXG flushes or by sync writes
- If sync latency is high and SLOG is saturated: you’re in ZIL/SLOG land.
- If async writes are slow and dirty data is high: you’re in TXG/flush land.
- If reads are slow with low ARC hits: you’re in cache working-set land.
Third: look for the “boring” root causes that always win
- A device is failing (SMART, error counters, timeouts).
- A scrub/resilver is running and contending.
- A pool is too full or too fragmented.
- A recent change: recordsize, compression, sync, logbias, special vdev, volblocksize, database config.
If you can’t answer “which vdev is slow” within five minutes, your monitoring isn’t finished. Fix that before the next incident fixes it for you.
Practical tasks: commands, outputs, and decisions (12+)
These are runnable tasks you can do on a typical Linux system with ZFS. The point is not to memorize commands; it’s to connect output to a decision. Copy/paste with intent.
Task 1: Identify the pool and its topology
cr0x@server:~$ sudo zpool status -v
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:11:34 with 0 errors on Sun Dec 22 03:10:12 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
logs
nvme2n1 ONLINE 0 0 0
errors: No known data errors
What it means: You have RAIDZ2 for capacity, a mirrored special or fast vdev (here shown as mirror-1), and a separate log device.
Decision: When latency spikes, you’ll check each of those components separately. If sync latency spikes, the logs device becomes suspect immediately.
Task 2: Watch per-vdev latency and queue behavior in real time
cr0x@server:~$ sudo zpool iostat -v tank 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 5.21T 9.33T 812 604 98.6M 41.2M
raidz2-0 5.01T 9.33T 740 520 92.1M 38.0M
sda - - 98 70 11.8M 5.2M
sdb - - 110 74 12.0M 5.4M
sdc - - 96 68 11.5M 5.1M
sdd - - 99 69 11.6M 5.1M
sde - - 112 75 12.1M 5.6M
sdf - - 225 164 33.1M 11.6M
mirror-1 210G 1.59T 72 84 6.5M 3.2M
nvme0n1 - - 36 42 3.2M 1.6M
nvme1n1 - - 36 42 3.3M 1.6M
logs - - - - - -
nvme2n1 - - 0 180 0 2.1M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: You’re seeing distribution. One disk (sdf) is doing far more work, which may be normal (layout effects) or suspicious (hot spot, errors, or a slow sibling causing imbalance).
Decision: If a single device shows disproportionate work or performance shifts under load, correlate with kernel logs and SMART for that device. If sync writes spike, check the log device.
Task 3: Get latency at the block layer (are devices stalling?)
cr0x@server:~$ iostat -x 1 3
Linux 6.6.44 (server) 12/26/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
4.12 0.00 1.88 6.05 0.00 87.95
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 8.2 6.0 1024.0 512.0 185.1 0.9 22.4 18.1 28.3 0.9 12.3
sdf 32.1 25.8 8192.0 2048.0 356.7 7.8 118.6 101.2 140.8 1.1 98.7
nvme2n1 0.0 220.0 0.0 2304.0 20.9 0.5 2.3 0.0 2.3 0.1 2.6
What it means: sdf is saturated (%util near 100) with high await and a large queue (avgqu-sz). That’s your latency factory.
Decision: Treat it as a likely failing device, path issue, or workload hotspot. Move from ZFS to OS and hardware evidence.
Task 4: Check for ZFS health events and recent changes
cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME CLASS
Dec 26 2025 10:42:18.911482 ereport.fs.zfs.io
pool = tank
vdev = /dev/sdf
errno = 5
error = Input/output error
Dec 26 2025 10:42:19.002114 ereport.fs.zfs.checksum
pool = tank
vdev = /dev/sdf
errors = 3
What it means: ZFS is recording IO and checksum problems for /dev/sdf. Latency spikes and errors are friends.
Decision: Start replacement planning. If you see timeouts/resets, consider cabling/HBA. Don’t “tune” your way out of a dying device.
Task 5: Pull SMART/NVMe health (confirm the hardware story)
cr0x@server:~$ sudo smartctl -a /dev/sdf | egrep -i 'Reallocated|Pending|Offline|CRC|Power_On|Temperature|error'
Power_On_Hours 0x0032 092 092 000 Old_age Always - 35621
Reallocated_Sector_Ct 0x0033 089 089 010 Pre-fail Always - 24
Current_Pending_Sector 0x0012 098 098 000 Old_age Always - 3
UDMA_CRC_Error_Count 0x003e 199 199 000 Old_age Always - 12
SMART Error Log Version: 1
ATA Error Count: 7
What it means: Reallocated and pending sectors plus CRC errors. The drive and/or link is misbehaving. This is not a “maybe.”
Decision: Replace the disk and check cable/backplane/HBA for CRC issues. If CRCs climb on multiple drives, suspect the path, not just the disk.
Task 6: Confirm if you’re bottlenecked by sync writes (ZIL/SLOG)
cr0x@server:~$ sudo zfs get -o name,property,value -s local,default sync,logbias tank
NAME PROPERTY VALUE
tank logbias latency
tank sync standard
What it means: You’re not forcing sync always, but applications doing fsync() or O_DSYNC will still generate sync writes. logbias=latency is typical for databases/NFS.
Decision: If sync latency is high and you have a SLOG, validate its health and suitability. If you don’t have a SLOG and your workload is sync-heavy, add one (correctly).
Task 7: Verify that a SLOG exists and is actually in use
cr0x@server:~$ sudo zpool status tank | sed -n '/logs/,$p'
logs
nvme2n1 ONLINE 0 0 0
What it means: A log vdev is present. That doesn’t prove it’s good; it proves you intended it.
Decision: If sync latency is bad, benchmark or observe nvme2n1 latency under pressure and confirm it has power-loss protection characteristics appropriate for your risk profile.
Task 8: Look at ARC behavior (is the cache helping or thrashing?)
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
10:55:01 812 96 11 38 4 58 7 8 33 128G 144G
10:55:02 790 210 27 120 15 82 10 8 10 128G 144G
10:55:03 805 240 30 150 19 82 10 8 9 128G 144G
What it means: Miss rate is increasing fast. If this aligns with latency spikes, your working set may have outgrown ARC, or a new workload is evicting useful data.
Decision: Investigate what changed (deploy, query pattern, new tenants). Consider adding RAM, optimizing access patterns, or using a special vdev for metadata if metadata misses dominate.
Task 9: Check pool fullness and fragmentation (slow-motion latency bombs)
cr0x@server:~$ sudo zpool list -o name,size,alloc,free,capacity,fragmentation,health
NAME SIZE ALLOC FREE CAPACITY FRAG HEALTH
tank 14.5T 13.2T 1.3T 91% 62% ONLINE
What it means: 91% full and 62% fragmented. ZFS can run like this, but it won’t run happily. Allocation gets harder, IO gets more random, latency increases.
Decision: Free space. Add vdevs (the right way), delete old snapshots/data, or move data. Don’t tune around a nearly-full pool; that’s denial with graphs.
Task 10: Identify datasets with risky properties for latency
cr0x@server:~$ sudo zfs get -r -o name,property,value recordsize,compression,atime,sync,primarycache,secondarycache tank/app
NAME PROPERTY VALUE
tank/app recordsize 128K
tank/app compression lz4
tank/app atime off
tank/app sync standard
tank/app primarycache all
tank/app secondarycache all
tank/app/db recordsize 128K
tank/app/db compression lz4
tank/app/db atime off
tank/app/db sync standard
tank/app/db primarycache all
tank/app/db secondarycache all
What it means: Sensible defaults, but recordsize=128K may be wrong for some DB patterns (especially lots of 8–16K random IO). Wrong recordsize can inflate read-modify-write and amplify latency.
Decision: If your workload is random small-block, consider a dataset tuned to that workload (often 16K) and validate with real IO tests. Don’t change recordsize mid-flight without a plan: it affects newly written blocks.
Task 11: Find whether scrub/resilver is stealing your lunch
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Fri Dec 26 09:58:41 2025
3.12T scanned at 1.25G/s, 1.88T issued at 780M/s, 5.21T total
0B repaired, 36.15% done, 01:29:13 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: Scrub is in progress. Even when healthy, scrubs raise latency by consuming IO and queue slots.
Decision: If this is a production latency incident, consider pausing scrub during peak and rescheduling. But don’t permanently “solve latency” by never scrubbing. That’s how you get silent corruption and a bad week.
Task 12: Check snapshot pressure (space and performance side-effects)
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation -S used | head
NAME USED REFER CREATION
tank/app/db@hourly-2025-12-26-10 78G 1.2T Fri Dec 26 10:00 2025
tank/app/db@hourly-2025-12-26-09 74G 1.2T Fri Dec 26 09:00 2025
tank/app/db@hourly-2025-12-26-08 71G 1.2T Fri Dec 26 08:00 2025
tank/app/db@hourly-2025-12-26-07 69G 1.2T Fri Dec 26 07:00 2025
tank/app/db@hourly-2025-12-26-06 66G 1.2T Fri Dec 26 06:00 2025
tank/app/db@hourly-2025-12-26-05 64G 1.2T Fri Dec 26 05:00 2025
tank/app/db@hourly-2025-12-26-04 62G 1.2T Fri Dec 26 04:00 2025
tank/app/db@hourly-2025-12-26-03 61G 1.2T Fri Dec 26 03:00 2025
tank/app/db@hourly-2025-12-26-02 59G 1.2T Fri Dec 26 02:00 2025
What it means: Snapshots are consuming serious space. Heavy snapshotting with churn increases fragmentation and reduces free space, both of which degrade latency.
Decision: Apply retention that matches business needs, not anxiety. If you need frequent snapshots, plan capacity and vdev layout for it.
Task 13: Validate whether a “tuning” change actually helped (fio)
cr0x@server:~$ fio --name=randread --directory=/tank/app/test --rw=randread --bs=16k --iodepth=32 --numjobs=4 --size=4g --time_based --runtime=30 --group_reporting
randread: (groupid=0, jobs=4): err= 0: pid=21144: Fri Dec 26 10:58:22 2025
read: IOPS=42.1k, BW=658MiB/s (690MB/s)(19.3GiB/30001msec)
slat (nsec): min=500, max=220000, avg=1120.4, stdev=820.1
clat (usec): min=85, max=14200, avg=302.5, stdev=190.2
lat (usec): min=87, max=14203, avg=304.0, stdev=190.4
clat percentiles (usec):
| 1.00th=[ 120], 5.00th=[ 150], 10.00th=[ 170], 50.00th=[ 280],
| 90.00th=[ 470], 95.00th=[ 560], 99.00th=[ 900], 99.90th=[ 3100]
What it means: You have percentile latency data. p99 at ~900us might be fine; p99.9 at 3.1ms might be a concern for some OLTP. The tail matters.
Decision: Compare before/after tuning changes. If p99 got worse while average improved, you didn’t “optimize,” you just moved pain into the tail where incidents live.
Task 14: Check for kernel-level IO errors and resets
cr0x@server:~$ sudo dmesg -T | egrep -i 'reset|timeout|blk_update_request|I/O error|nvme|ata' | tail -n 20
[Fri Dec 26 10:42:16 2025] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Fri Dec 26 10:42:16 2025] ata7.00: failed command: READ DMA EXT
[Fri Dec 26 10:42:16 2025] blk_update_request: I/O error, dev sdf, sector 983742112 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 0
[Fri Dec 26 10:42:17 2025] ata7: hard resetting link
[Fri Dec 26 10:42:18 2025] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
What it means: The OS is seeing link resets and IO errors. This aligns with the earlier ZFS events and the iostat stall.
Decision: Replace suspect hardware, and inspect the physical path. Also expect latency spikes to correlate with these resets; annotate them on graphs if you can.
Joke #2: When an SSD’s firmware starts garbage-collecting mid-incident, it’s like your janitor vacuuming during a fire drill—technically working, emotionally unhelpful.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran PostgreSQL on ZFS-backed volumes. They had good hardware, decent monitoring, and a confident belief: “NVMe is fast, so sync writes can’t be the problem.” They graphed throughput and CPU and called it observability.
Then a new feature launched: more small transactions, more commits, more fsyncs. Within hours, p99 API latency doubled. The database didn’t crash. It just started moving like a person walking through wet cement. Engineers chased query plans, connection pools, GC pauses—anything but storage, because the storage was “NVMe.”
The missing graph was sync write latency. Once they added it, the shape was obvious: a clean sawtooth of normal behavior punctuated by ugly spikes. Those spikes lined up with ZIL activity and device stalls. Their “fast” SLOG device was a consumer NVMe without power-loss protection, and under sustained sync load it periodically stalled hard.
They replaced the SLOG with an enterprise device designed for consistent latency, and they moved a few critical datasets to sync=standard with explicit application-level durability settings, rather than blanket forcing sync behavior everywhere. The incident ended quickly—after a long detour through every system except the one that was guilty.
The wrong assumption wasn’t “NVMe is fast.” It was “fast means consistent.” Latency monitoring exists to punish that belief before customers do.
Mini-story 2: The optimization that backfired
An enterprise internal platform team wanted to reduce disk IO. They noticed their pool was doing a lot of metadata lookups and decided to “speed it up” by adding an L2ARC device. They provisioned a large SSD, enabled L2ARC, and watched cache size grow. Victory, right?
Within a week, they saw periodic latency spikes on read-heavy services. The graphs were confusing: throughput was stable, CPU was modest, and ARC hit ratio looked okay. But the tail latency was worse. Users complained about intermittent slowness that never showed up in averages.
The backfire was subtle: the L2ARC was huge, and its metadata overhead in RAM wasn’t trivial. Under memory pressure from other services, the kernel reclaimed memory; ARC shrank; eviction increased; and the system oscillated between “cache warm” and “cache cold.” Each oscillation created a burst of disk reads, queueing, and tail latency spikes.
They fixed it by right-sizing the L2ARC, protecting system memory headroom, and shifting focus to a special vdev for metadata (which reduced critical-path misses) instead of trying to cache their way out of a working-set problem. The optimization wasn’t stupid—it was incomplete. It treated “more cache” as a universal good rather than a trade-off.
Mini-story 3: The boring practice that saved the day
A financial services team ran ZFS for a set of customer-facing APIs. Their culture was aggressively unromantic: every change had a rollback plan, every pool had weekly scrub windows, and every dashboard had a few panels nobody cared about until they mattered.
One of those panels was per-vdev latency with annotations for scrubs and deploys. Another was a simple counter: checksum errors per device. It was boring. It stayed flat. People forgot it existed.
On a Tuesday afternoon, the checksum error counter started ticking upward on a single disk. Latency wasn’t bad yet. Throughput looked fine. No customer complaints. The on-call engineer replaced the disk that evening because the runbook said, effectively, “don’t argue with checksum errors.”
Two days later, the vendor confirmed a batch issue with that drive model. Other teams discovered the problem during resilvers and brownouts. This team discovered it in a flat graph turning slightly non-flat. The savings weren’t heroic; they were procedural. Their “boring” practice converted a future incident into a maintenance ticket.
Common mistakes: symptom → root cause → fix
1) Symptom: p99 write latency spikes, but throughput looks normal
Root cause: a single device stalling (firmware GC, failing drive, link resets) while the pool average hides it.
Fix: graph per-vdev latency; confirm with iostat -x and kernel logs; replace the device or fix the path. Don’t “tune ZFS” first.
2) Symptom: database commits become slow after a “storage upgrade”
Root cause: SLOG is missing, slow, or not appropriate for sync durability needs; or dataset switched to sync=always accidentally.
Fix: verify pool has a log vdev; validate sync write latency; ensure the SLOG device is designed for consistent low-latency writes and power-loss behavior; revert accidental property changes.
3) Symptom: latency worsens during backups/snapshots
Root cause: snapshot churn increases fragmentation and reduces free space; backup reads collide with production IO.
Fix: schedule backups and scrubs; use bandwidth limits where available; improve retention; add capacity to keep pools comfortably below “panic fullness.”
4) Symptom: reads are slow only after a deploy or tenant onboard
Root cause: working set no longer fits ARC, or metadata misses increased; ARC is thrashing due to memory pressure.
Fix: add memory headroom; identify noisy neighbor workloads; consider special vdev for metadata; don’t blindly add huge L2ARC without RAM planning.
5) Symptom: “everything gets slower” when scrub runs
Root cause: scrub IO contends with production; pool near saturation; vdev queues fill.
Fix: run scrubs off-peak; if your platform supports it, limit scrub rate; ensure enough spindles/IOPS for both. If your pool can’t scrub without hurting users, it’s undersized.
6) Symptom: high latency with low device utilization
Root cause: queueing above the block layer (ZFS internal locks, TXG sync blocking, CPU saturation, memory reclaim), or IO is synchronous and serialized.
Fix: check CPU and memory pressure; look for TXG sync time growth; verify sync workload and SLOG; confirm your workload isn’t accidentally single-threaded at the storage layer.
7) Symptom: performance degrades over months with no obvious incident
Root cause: pool filling up, fragmentation climbing, snapshots accumulating; allocation becomes expensive and IO patterns worsen.
Fix: capacity planning with enforced thresholds (alerts at 70/80/85%); snapshot retention; add vdevs before the pool becomes a complaint generator.
Checklists / step-by-step plan
Checklist A: Build a latency dashboard that’s actually operational
- Per-vdev read/write latency (p50/p95/p99 if possible).
- Per-vdev utilization and queue depth (or closest available proxy).
- Sync write latency and sync IOPS.
- TXG flush pressure: dirty data, txg sync time, write throttling indicators.
- ARC size, target, miss rate, eviction rate (not just hit ratio).
- Pool capacity and fragmentation.
- Scrub/resilver state with annotations.
- Error counters: checksum, read/write errors, IO timeouts, link resets.
Rule: Every graph must answer a question you will ask at 3 a.m. If it doesn’t, delete it.
Checklist B: Incident response steps (15 minutes to clarity)
- Run
zpool status -vand confirm health, errors, scrub/resilver activity. - Run
zpool iostat -v 1and spot the worst vdev/device. - Run
iostat -x 1to confirm device-level saturation or await spikes. - Check
zpool events -vanddmesgfor resets/timeouts/errors. - Check dataset properties that influence latency (
sync,logbias,recordsize,compression). - Decide: hardware fault, workload shift, or capacity/fragmentation pressure.
- Mitigate: move load, pause scrub, reduce sync pressure, or replace device.
Checklist C: Preventive hygiene that pays rent
- Alert on pool capacity thresholds and fragmentation trend.
- Scrub regularly, but schedule it and observe its impact.
- Track per-device error counters and replace early.
- Keep enough free space for ZFS to allocate efficiently.
- Validate changes with a small benchmark and tail-latency comparison, not just averages.
FAQ
1) What’s the single most important ZFS latency graph?
Per-vdev latency (reads and writes). Pool averages hide the early stage where one device is ruining everyone’s day.
2) Why do I need percentiles? Isn’t average latency enough?
Averages hide tail behavior. Users feel p99. Databases feel p99. Your incident channel is basically a p99 reporting system.
3) Does adding a SLOG always reduce latency?
No. It reduces latency for synchronous writes if the SLOG device has low and consistent write latency. A bad SLOG can make things worse.
4) How do I know if my workload is sync-heavy?
Measure. Look for high sync write ops and high sync latency; correlate with database commit latency, NFS sync behavior, or applications calling fsync().
5) Can a pool be “healthy” and still have terrible latency?
Absolutely. Health is correctness and availability. Latency is performance. A pool can be ONLINE while one drive is intermittently stalling.
6) Is a nearly-full pool really that bad? We still have free space.
Yes, it’s that bad. Above ~80–85% capacity (workload dependent), allocation gets harder, fragmentation rises, and tail latency grows teeth.
7) Should I disable scrubs to keep latency low?
No. Scrubs are how you find silent corruption and latent disk issues. Schedule them and manage their impact; don’t pretend entropy doesn’t exist.
8) Does compression help or hurt latency?
Both, depending on workload. Compression can reduce IO and improve latency, but it adds CPU cost and can change IO size patterns. Measure with your data.
9) Why do I see high latency but low %util on disks?
Because the bottleneck might be above the disk: TXG sync blocking, memory reclaim shrinking ARC, CPU saturation, or a serialized sync workload path.
10) What’s the fastest way to catch a dying disk before it fails?
Graph per-device latency and error counters. A slow disk often becomes inconsistent before it becomes dead.
Next steps you can do this week
- Add per-vdev latency panels (read/write, percentiles if possible). If you can’t, at least separate vdev/device metrics from pool metrics.
- Annotate scrubs, resilvers, and deploys on the same timeline as latency. Guessing is expensive.
- Write a 1-page runbook using the “Fast diagnosis playbook” above and include the exact commands your team will run.
- Set capacity alerts that trigger before panic: start warning at 70–75%, page at a threshold your workload can tolerate.
- Run one fio test (carefully, on a safe dataset) and record p95/p99. That becomes your baseline for “normal.”
- Practice a replacement drill: identify a drive, simulate failure in process (not physically), confirm you can interpret
zpool statusand proceed without improvisation.
The goal isn’t to build a perfect observability cathedral. The goal is to see the next disaster while it’s still small enough to step on.