ZFS Latency Monitoring: The Graphs That Catch Disaster Early

October 14, 2025 • February 3, 2026 • Read: 22 min • Views: 9

Was this helpful?

Storage outages rarely start as “everything is down.” They start as someone in Slack saying, “why is the app… sticky?” Then your database checkpoints take longer, your queues lengthen, and suddenly your SLOs look like a ski slope.

ZFS is excellent at keeping data correct. It’s also excellent at hiding the early warning signs inside a few counters you’re not graphing. This is about those counters: the graphs that tell you “something is wrong” while you still have time to fix it without a war room.

What “latency” means in ZFS (and why your app cares)

Most teams graph throughput because it’s gratifying: big numbers, colorful dashboards, leadership-friendly. Latency is the number that actually hurts users. One slow I/O can stall a transaction, block a worker thread, and amplify into a queue that takes minutes to unwind.

ZFS adds its own layers. That’s not a criticism; it’s a reality of copy-on-write, checksums, compression, and transaction groups. Your I/O doesn’t just “hit the disk.” It moves through:

Application and filesystem path (syscalls, VFS)
ZFS intent/logging for synchronous semantics (ZIL/SLOG)
ARC (RAM cache) and optional L2ARC (flash cache)
DMU / metaslab allocation (where blocks go)
vdev queues (the place where “this looks fine” becomes “why is everything waiting”)
physical devices (including their firmware moods)

When someone says “ZFS is slow,” translate it. Ask: slow reads or slow writes? synchronous or asynchronous? small random IO or large sequential? hot set fits in ARC or not? That’s how you pick the right graphs and the right fix.

One more practical point: average latency is a liar. You want percentiles (p95, p99) and you want queue depth context. A system can have “low average latency” while p99 is burning your app alive.

Joke #1: Average latency is like average temperature—great until you realize your head is in the oven and your feet are in the freezer.

Interesting facts and history (the useful kind)

ZFS was born at Sun Microsystems with a design goal that data integrity beats everything else, which is why checksums and copy-on-write are non-negotiable features.
The “ZIL” exists even without a SLOG: ZIL is the in-pool log, and SLOG is a separate device that can hold ZIL records to accelerate sync writes.
TXGs (transaction groups) are the heartbeat: ZFS batches changes and periodically flushes them. When flushes get slow, latency gets weird—and then catastrophic.
ARC is not “just a cache”: ARC interacts with memory pressure, metadata, and prefetch behavior; it can shift workload from disks to RAM in surprising ways.
L2ARC used to be riskier: older implementations could consume a lot of RAM for metadata and warmup behavior could disappoint. Modern systems improved, but it’s still not free.
Compression can reduce latency when it turns random reads into fewer device IOs—until CPU becomes the bottleneck or recordsize misaligns with access patterns.
RAIDZ changes the write story: small random writes on RAIDZ can incur read-modify-write overhead, which shows up as higher latency and device queueing.
Scrubs are operationally mandatory, but they compete for IO. If you don’t shape scrub impact, your users will shape it for you via angry tickets.
Special vdevs changed metadata economics: putting metadata (and optionally small blocks) on fast SSD can dramatically improve latency for metadata-heavy workloads.

One quote, because it survives every storage incident: “Hope is not a strategy.” — James Cameron

The graphs that catch disaster early

Latency monitoring isn’t one graph. It’s a small set of graphs that tell a coherent story. If your monitoring tool only supports a few panels, pick the ones below. If you can do more, do more. You’re buying time.

1) Pool and vdev latency (read/write) with percentiles

You want read and write latency per vdev, not just the pool average. Pools hide murder. One degraded device, one slow SSD, one HBA hiccup, and the pool graph looks “a bit worse,” while one vdev is on fire.

What it catches: a single device dying, firmware GC stalls, queue starvation, a controller issue affecting only one path.

How it fails: if you only graph pool averages, you’ll miss the “one bad apple” stage and discover the problem at the “everything timed out” stage.

2) Queue depth / wait time signals

Latency without queue depth is like smoke without a fire alarm. ZFS has internal queues; disks have queues; the OS has queues. When queue depth rises, latency rises, and throughput often plateaus. This is your early disaster signature.

What it catches: saturation, throttling, sudden workload shifts, scrub/resilver interference, a dataset property change that increases IO size or sync behavior.

3) Sync write latency (ZIL/SLOG health)

Databases and NFS often force sync semantics. If your SLOG is slow or misconfigured, users learn new words. Graph:

sync write latency (p95/p99)
sync write ops per second
SLOG device latency and utilization

What it catches: SLOG wearing out, device cache disabled, power-loss-protected requirements not met, accidental removal of SLOG, or a “fast” consumer SSD stalling under sync workload.

4) TXG commit time and “dirty data” pressure

When ZFS can’t flush dirty data fast enough, everything backs up. Monitor:

TXG sync time (or proxies like “time blocked in txg”)
dirty data size
write throttle events

What it catches: slow disks, overcommitted pools, recordsize mismatches, SMR drive weirdness, RAIDZ math under random write load.

5) ARC hit ratio with context (and the misses that matter)

ARC hit ratio is a classic vanity metric. A high hit ratio can still hide latency if the misses are on the critical path (metadata misses, or random reads that hammer a slow vdev). Graph:

ARC size and target size
ARC hit ratio
metadata vs data hits (if you can)
eviction rate

What it catches: memory pressure, container density changes, a kernel update changing ARC behavior, or a new workload that blows out the cache.

6) Fragmentation and metaslab allocation behavior (slow-motion disasters)

Fragmentation isn’t instant pain; it’s future pain. When free space gets low and fragmentation rises, allocation becomes slower and IO becomes more random. Graph:

pool free space percentage
fragmentation percentage
average block size written (if available)

What it catches: “we’re fine at 20% free” myths, snapshot hoarding, and workloads that churn small blocks.

7) Scrub/resilver impact overlay

Scrubs and resilvers are necessary. They are also heavy IO. Your monitoring should annotate when they run. Otherwise you’ll treat “expected IO pressure” as “mysterious performance regression,” or worse, you’ll disable scrubs and congratulate yourself on the improvement.

8) Errors that precede latency spikes

Graph counts of:

checksum errors
read/write errors
timeouts and link resets (from OS logs)

Latency incidents often start with “harmless” retries. Retries are not harmless; they are IO multiplication with a bad attitude.

Fast diagnosis playbook (first / second / third)

This is the “my pager is yelling” sequence. It’s optimized for speed and correctness, not elegance.

First: decide if this is device latency, ZFS queuing, or workload change

Look at vdev read/write latency. Is one vdev off the charts?
Look at IOPS vs throughput. Did the workload shift from sequential to random?
Check sync write rate. Did something flip to sync semantics?

Second: confirm whether the pool is blocked by TXG flushes or by sync writes

If sync latency is high and SLOG is saturated: you’re in ZIL/SLOG land.
If async writes are slow and dirty data is high: you’re in TXG/flush land.
If reads are slow with low ARC hits: you’re in cache working-set land.

Third: look for the “boring” root causes that always win

A device is failing (SMART, error counters, timeouts).
A scrub/resilver is running and contending.
A pool is too full or too fragmented.
A recent change: recordsize, compression, sync, logbias, special vdev, volblocksize, database config.

If you can’t answer “which vdev is slow” within five minutes, your monitoring isn’t finished. Fix that before the next incident fixes it for you.

Practical tasks: commands, outputs, and decisions (12+)

These are runnable tasks you can do on a typical Linux system with ZFS. The point is not to memorize commands; it’s to connect output to a decision. Copy/paste with intent.

Task 1: Identify the pool and its topology

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:11:34 with 0 errors on Sun Dec 22 03:10:12 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            nvme0n1                 ONLINE       0     0     0
            nvme1n1                 ONLINE       0     0     0
        logs
          nvme2n1                   ONLINE       0     0     0

errors: No known data errors

What it means: You have RAIDZ2 for capacity, a mirrored special or fast vdev (here shown as mirror-1), and a separate log device.

Decision: When latency spikes, you’ll check each of those components separately. If sync latency spikes, the logs device becomes suspect immediately.

Task 2: Watch per-vdev latency and queue behavior in real time

cr0x@server:~$ sudo zpool iostat -v tank 1
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        5.21T  9.33T    812    604  98.6M  41.2M
  raidz2-0                  5.01T  9.33T    740    520  92.1M  38.0M
    sda                         -      -     98     70  11.8M  5.2M
    sdb                         -      -    110     74  12.0M  5.4M
    sdc                         -      -     96     68  11.5M  5.1M
    sdd                         -      -     99     69  11.6M  5.1M
    sde                         -      -    112     75  12.1M  5.6M
    sdf                         -      -    225    164  33.1M  11.6M
  mirror-1                   210G   1.59T     72     84   6.5M  3.2M
    nvme0n1                     -      -     36     42   3.2M  1.6M
    nvme1n1                     -      -     36     42   3.3M  1.6M
logs                              -      -      -      -      -      -
  nvme2n1                        -      -      0    180     0  2.1M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: You’re seeing distribution. One disk (sdf) is doing far more work, which may be normal (layout effects) or suspicious (hot spot, errors, or a slow sibling causing imbalance).

Decision: If a single device shows disproportionate work or performance shifts under load, correlate with kernel logs and SMART for that device. If sync writes spike, check the log device.

Task 3: Get latency at the block layer (are devices stalling?)

cr0x@server:~$ iostat -x 1 3
Linux 6.6.44 (server)  12/26/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.12    0.00    1.88    6.05    0.00   87.95

Device            r/s   w/s  rkB/s  wkB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
sda              8.2   6.0  1024.0  512.0    185.1     0.9    22.4     18.1     28.3    0.9    12.3
sdf             32.1  25.8  8192.0  2048.0   356.7     7.8   118.6    101.2    140.8    1.1    98.7
nvme2n1           0.0  220.0   0.0  2304.0    20.9     0.5     2.3      0.0      2.3    0.1     2.6

What it means: sdf is saturated (%util near 100) with high await and a large queue (avgqu-sz). That’s your latency factory.

Decision: Treat it as a likely failing device, path issue, or workload hotspot. Move from ZFS to OS and hardware evidence.

Task 4: Check for ZFS health events and recent changes

cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME                           CLASS
Dec 26 2025 10:42:18.911482     ereport.fs.zfs.io
    pool = tank
    vdev = /dev/sdf
    errno = 5
    error = Input/output error
Dec 26 2025 10:42:19.002114     ereport.fs.zfs.checksum
    pool = tank
    vdev = /dev/sdf
    errors = 3

What it means: ZFS is recording IO and checksum problems for /dev/sdf. Latency spikes and errors are friends.

Decision: Start replacement planning. If you see timeouts/resets, consider cabling/HBA. Don’t “tune” your way out of a dying device.

Task 5: Pull SMART/NVMe health (confirm the hardware story)

cr0x@server:~$ sudo smartctl -a /dev/sdf | egrep -i 'Reallocated|Pending|Offline|CRC|Power_On|Temperature|error'
Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       35621
Reallocated_Sector_Ct   0x0033   089   089   010    Pre-fail  Always       -       24
Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       3
UDMA_CRC_Error_Count    0x003e   199   199   000    Old_age   Always       -       12
SMART Error Log Version: 1
ATA Error Count: 7

What it means: Reallocated and pending sectors plus CRC errors. The drive and/or link is misbehaving. This is not a “maybe.”

Decision: Replace the disk and check cable/backplane/HBA for CRC issues. If CRCs climb on multiple drives, suspect the path, not just the disk.

Task 6: Confirm if you’re bottlenecked by sync writes (ZIL/SLOG)

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default sync,logbias tank
NAME  PROPERTY  VALUE
tank  logbias   latency
tank  sync      standard

What it means: You’re not forcing sync always, but applications doing fsync() or O_DSYNC will still generate sync writes. logbias=latency is typical for databases/NFS.

Decision: If sync latency is high and you have a SLOG, validate its health and suitability. If you don’t have a SLOG and your workload is sync-heavy, add one (correctly).

Task 7: Verify that a SLOG exists and is actually in use

cr0x@server:~$ sudo zpool status tank | sed -n '/logs/,$p'
logs
  nvme2n1                   ONLINE       0     0     0

What it means: A log vdev is present. That doesn’t prove it’s good; it proves you intended it.

Decision: If sync latency is bad, benchmark or observe nvme2n1 latency under pressure and confirm it has power-loss protection characteristics appropriate for your risk profile.

Task 8: Look at ARC behavior (is the cache helping or thrashing?)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:55:01   812    96     11    38    4    58    7     8   33   128G  144G
10:55:02   790   210     27   120   15    82   10     8   10   128G  144G
10:55:03   805   240     30   150   19    82   10     8    9   128G  144G

What it means: Miss rate is increasing fast. If this aligns with latency spikes, your working set may have outgrown ARC, or a new workload is evicting useful data.

Decision: Investigate what changed (deploy, query pattern, new tenants). Consider adding RAM, optimizing access patterns, or using a special vdev for metadata if metadata misses dominate.

Task 9: Check pool fullness and fragmentation (slow-motion latency bombs)

cr0x@server:~$ sudo zpool list -o name,size,alloc,free,capacity,fragmentation,health
NAME  SIZE   ALLOC  FREE   CAPACITY  FRAG  HEALTH
tank  14.5T  13.2T  1.3T   91%       62%   ONLINE

What it means: 91% full and 62% fragmented. ZFS can run like this, but it won’t run happily. Allocation gets harder, IO gets more random, latency increases.

Decision: Free space. Add vdevs (the right way), delete old snapshots/data, or move data. Don’t tune around a nearly-full pool; that’s denial with graphs.

Task 10: Identify datasets with risky properties for latency

cr0x@server:~$ sudo zfs get -r -o name,property,value recordsize,compression,atime,sync,primarycache,secondarycache tank/app
NAME                 PROPERTY        VALUE
tank/app             recordsize      128K
tank/app             compression     lz4
tank/app             atime           off
tank/app             sync            standard
tank/app             primarycache    all
tank/app             secondarycache  all
tank/app/db          recordsize      128K
tank/app/db          compression     lz4
tank/app/db          atime           off
tank/app/db          sync            standard
tank/app/db          primarycache    all
tank/app/db          secondarycache  all

What it means: Sensible defaults, but recordsize=128K may be wrong for some DB patterns (especially lots of 8–16K random IO). Wrong recordsize can inflate read-modify-write and amplify latency.

Decision: If your workload is random small-block, consider a dataset tuned to that workload (often 16K) and validate with real IO tests. Don’t change recordsize mid-flight without a plan: it affects newly written blocks.

Task 11: Find whether scrub/resilver is stealing your lunch

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Fri Dec 26 09:58:41 2025
        3.12T scanned at 1.25G/s, 1.88T issued at 780M/s, 5.21T total
        0B repaired, 36.15% done, 01:29:13 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

What it means: Scrub is in progress. Even when healthy, scrubs raise latency by consuming IO and queue slots.

Decision: If this is a production latency incident, consider pausing scrub during peak and rescheduling. But don’t permanently “solve latency” by never scrubbing. That’s how you get silent corruption and a bad week.

Task 12: Check snapshot pressure (space and performance side-effects)

cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation -S used | head
NAME                              USED  REFER  CREATION
tank/app/db@hourly-2025-12-26-10   78G   1.2T   Fri Dec 26 10:00 2025
tank/app/db@hourly-2025-12-26-09   74G   1.2T   Fri Dec 26 09:00 2025
tank/app/db@hourly-2025-12-26-08   71G   1.2T   Fri Dec 26 08:00 2025
tank/app/db@hourly-2025-12-26-07   69G   1.2T   Fri Dec 26 07:00 2025
tank/app/db@hourly-2025-12-26-06   66G   1.2T   Fri Dec 26 06:00 2025
tank/app/db@hourly-2025-12-26-05   64G   1.2T   Fri Dec 26 05:00 2025
tank/app/db@hourly-2025-12-26-04   62G   1.2T   Fri Dec 26 04:00 2025
tank/app/db@hourly-2025-12-26-03   61G   1.2T   Fri Dec 26 03:00 2025
tank/app/db@hourly-2025-12-26-02   59G   1.2T   Fri Dec 26 02:00 2025

What it means: Snapshots are consuming serious space. Heavy snapshotting with churn increases fragmentation and reduces free space, both of which degrade latency.

Decision: Apply retention that matches business needs, not anxiety. If you need frequent snapshots, plan capacity and vdev layout for it.

Task 13: Validate whether a “tuning” change actually helped (fio)

cr0x@server:~$ fio --name=randread --directory=/tank/app/test --rw=randread --bs=16k --iodepth=32 --numjobs=4 --size=4g --time_based --runtime=30 --group_reporting
randread: (groupid=0, jobs=4): err= 0: pid=21144: Fri Dec 26 10:58:22 2025
  read: IOPS=42.1k, BW=658MiB/s (690MB/s)(19.3GiB/30001msec)
    slat (nsec): min=500, max=220000, avg=1120.4, stdev=820.1
    clat (usec): min=85, max=14200, avg=302.5, stdev=190.2
     lat (usec): min=87, max=14203, avg=304.0, stdev=190.4
    clat percentiles (usec):
     |  1.00th=[  120],  5.00th=[  150], 10.00th=[  170], 50.00th=[  280],
     | 90.00th=[  470], 95.00th=[  560], 99.00th=[  900], 99.90th=[ 3100]

What it means: You have percentile latency data. p99 at ~900us might be fine; p99.9 at 3.1ms might be a concern for some OLTP. The tail matters.

Decision: Compare before/after tuning changes. If p99 got worse while average improved, you didn’t “optimize,” you just moved pain into the tail where incidents live.

Task 14: Check for kernel-level IO errors and resets

cr0x@server:~$ sudo dmesg -T | egrep -i 'reset|timeout|blk_update_request|I/O error|nvme|ata' | tail -n 20
[Fri Dec 26 10:42:16 2025] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Fri Dec 26 10:42:16 2025] ata7.00: failed command: READ DMA EXT
[Fri Dec 26 10:42:16 2025] blk_update_request: I/O error, dev sdf, sector 983742112 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 0
[Fri Dec 26 10:42:17 2025] ata7: hard resetting link
[Fri Dec 26 10:42:18 2025] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

What it means: The OS is seeing link resets and IO errors. This aligns with the earlier ZFS events and the iostat stall.

Decision: Replace suspect hardware, and inspect the physical path. Also expect latency spikes to correlate with these resets; annotate them on graphs if you can.

Joke #2: When an SSD’s firmware starts garbage-collecting mid-incident, it’s like your janitor vacuuming during a fire drill—technically working, emotionally unhelpful.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran PostgreSQL on ZFS-backed volumes. They had good hardware, decent monitoring, and a confident belief: “NVMe is fast, so sync writes can’t be the problem.” They graphed throughput and CPU and called it observability.

Then a new feature launched: more small transactions, more commits, more fsyncs. Within hours, p99 API latency doubled. The database didn’t crash. It just started moving like a person walking through wet cement. Engineers chased query plans, connection pools, GC pauses—anything but storage, because the storage was “NVMe.”

The missing graph was sync write latency. Once they added it, the shape was obvious: a clean sawtooth of normal behavior punctuated by ugly spikes. Those spikes lined up with ZIL activity and device stalls. Their “fast” SLOG device was a consumer NVMe without power-loss protection, and under sustained sync load it periodically stalled hard.

They replaced the SLOG with an enterprise device designed for consistent latency, and they moved a few critical datasets to sync=standard with explicit application-level durability settings, rather than blanket forcing sync behavior everywhere. The incident ended quickly—after a long detour through every system except the one that was guilty.

The wrong assumption wasn’t “NVMe is fast.” It was “fast means consistent.” Latency monitoring exists to punish that belief before customers do.

Mini-story 2: The optimization that backfired

An enterprise internal platform team wanted to reduce disk IO. They noticed their pool was doing a lot of metadata lookups and decided to “speed it up” by adding an L2ARC device. They provisioned a large SSD, enabled L2ARC, and watched cache size grow. Victory, right?

Within a week, they saw periodic latency spikes on read-heavy services. The graphs were confusing: throughput was stable, CPU was modest, and ARC hit ratio looked okay. But the tail latency was worse. Users complained about intermittent slowness that never showed up in averages.

The backfire was subtle: the L2ARC was huge, and its metadata overhead in RAM wasn’t trivial. Under memory pressure from other services, the kernel reclaimed memory; ARC shrank; eviction increased; and the system oscillated between “cache warm” and “cache cold.” Each oscillation created a burst of disk reads, queueing, and tail latency spikes.

They fixed it by right-sizing the L2ARC, protecting system memory headroom, and shifting focus to a special vdev for metadata (which reduced critical-path misses) instead of trying to cache their way out of a working-set problem. The optimization wasn’t stupid—it was incomplete. It treated “more cache” as a universal good rather than a trade-off.

Mini-story 3: The boring practice that saved the day

A financial services team ran ZFS for a set of customer-facing APIs. Their culture was aggressively unromantic: every change had a rollback plan, every pool had weekly scrub windows, and every dashboard had a few panels nobody cared about until they mattered.

One of those panels was per-vdev latency with annotations for scrubs and deploys. Another was a simple counter: checksum errors per device. It was boring. It stayed flat. People forgot it existed.

On a Tuesday afternoon, the checksum error counter started ticking upward on a single disk. Latency wasn’t bad yet. Throughput looked fine. No customer complaints. The on-call engineer replaced the disk that evening because the runbook said, effectively, “don’t argue with checksum errors.”

Two days later, the vendor confirmed a batch issue with that drive model. Other teams discovered the problem during resilvers and brownouts. This team discovered it in a flat graph turning slightly non-flat. The savings weren’t heroic; they were procedural. Their “boring” practice converted a future incident into a maintenance ticket.

Common mistakes: symptom → root cause → fix

1) Symptom: p99 write latency spikes, but throughput looks normal

Root cause: a single device stalling (firmware GC, failing drive, link resets) while the pool average hides it.

Fix: graph per-vdev latency; confirm with iostat -x and kernel logs; replace the device or fix the path. Don’t “tune ZFS” first.

2) Symptom: database commits become slow after a “storage upgrade”

Root cause: SLOG is missing, slow, or not appropriate for sync durability needs; or dataset switched to sync=always accidentally.

Fix: verify pool has a log vdev; validate sync write latency; ensure the SLOG device is designed for consistent low-latency writes and power-loss behavior; revert accidental property changes.

3) Symptom: latency worsens during backups/snapshots

Root cause: snapshot churn increases fragmentation and reduces free space; backup reads collide with production IO.

Fix: schedule backups and scrubs; use bandwidth limits where available; improve retention; add capacity to keep pools comfortably below “panic fullness.”

4) Symptom: reads are slow only after a deploy or tenant onboard

Root cause: working set no longer fits ARC, or metadata misses increased; ARC is thrashing due to memory pressure.

Fix: add memory headroom; identify noisy neighbor workloads; consider special vdev for metadata; don’t blindly add huge L2ARC without RAM planning.

5) Symptom: “everything gets slower” when scrub runs

Root cause: scrub IO contends with production; pool near saturation; vdev queues fill.

Fix: run scrubs off-peak; if your platform supports it, limit scrub rate; ensure enough spindles/IOPS for both. If your pool can’t scrub without hurting users, it’s undersized.

6) Symptom: high latency with low device utilization

Root cause: queueing above the block layer (ZFS internal locks, TXG sync blocking, CPU saturation, memory reclaim), or IO is synchronous and serialized.

Fix: check CPU and memory pressure; look for TXG sync time growth; verify sync workload and SLOG; confirm your workload isn’t accidentally single-threaded at the storage layer.

7) Symptom: performance degrades over months with no obvious incident

Root cause: pool filling up, fragmentation climbing, snapshots accumulating; allocation becomes expensive and IO patterns worsen.

Fix: capacity planning with enforced thresholds (alerts at 70/80/85%); snapshot retention; add vdevs before the pool becomes a complaint generator.

Checklists / step-by-step plan

Checklist A: Build a latency dashboard that’s actually operational

Per-vdev read/write latency (p50/p95/p99 if possible).
Per-vdev utilization and queue depth (or closest available proxy).
Sync write latency and sync IOPS.
TXG flush pressure: dirty data, txg sync time, write throttling indicators.
ARC size, target, miss rate, eviction rate (not just hit ratio).
Pool capacity and fragmentation.
Scrub/resilver state with annotations.
Error counters: checksum, read/write errors, IO timeouts, link resets.

Rule: Every graph must answer a question you will ask at 3 a.m. If it doesn’t, delete it.

Checklist B: Incident response steps (15 minutes to clarity)

Run zpool status -v and confirm health, errors, scrub/resilver activity.
Run zpool iostat -v 1 and spot the worst vdev/device.
Run iostat -x 1 to confirm device-level saturation or await spikes.
Check zpool events -v and dmesg for resets/timeouts/errors.
Check dataset properties that influence latency (sync, logbias, recordsize, compression).
Decide: hardware fault, workload shift, or capacity/fragmentation pressure.
Mitigate: move load, pause scrub, reduce sync pressure, or replace device.

Checklist C: Preventive hygiene that pays rent

Alert on pool capacity thresholds and fragmentation trend.
Scrub regularly, but schedule it and observe its impact.
Track per-device error counters and replace early.
Keep enough free space for ZFS to allocate efficiently.
Validate changes with a small benchmark and tail-latency comparison, not just averages.

FAQ

1) What’s the single most important ZFS latency graph?

Per-vdev latency (reads and writes). Pool averages hide the early stage where one device is ruining everyone’s day.

2) Why do I need percentiles? Isn’t average latency enough?

Averages hide tail behavior. Users feel p99. Databases feel p99. Your incident channel is basically a p99 reporting system.

3) Does adding a SLOG always reduce latency?

No. It reduces latency for synchronous writes if the SLOG device has low and consistent write latency. A bad SLOG can make things worse.

4) How do I know if my workload is sync-heavy?

Measure. Look for high sync write ops and high sync latency; correlate with database commit latency, NFS sync behavior, or applications calling fsync().

5) Can a pool be “healthy” and still have terrible latency?

Absolutely. Health is correctness and availability. Latency is performance. A pool can be ONLINE while one drive is intermittently stalling.

6) Is a nearly-full pool really that bad? We still have free space.

Yes, it’s that bad. Above ~80–85% capacity (workload dependent), allocation gets harder, fragmentation rises, and tail latency grows teeth.

7) Should I disable scrubs to keep latency low?

No. Scrubs are how you find silent corruption and latent disk issues. Schedule them and manage their impact; don’t pretend entropy doesn’t exist.

8) Does compression help or hurt latency?

Both, depending on workload. Compression can reduce IO and improve latency, but it adds CPU cost and can change IO size patterns. Measure with your data.

9) Why do I see high latency but low %util on disks?

Because the bottleneck might be above the disk: TXG sync blocking, memory reclaim shrinking ARC, CPU saturation, or a serialized sync workload path.

10) What’s the fastest way to catch a dying disk before it fails?

Graph per-device latency and error counters. A slow disk often becomes inconsistent before it becomes dead.

Next steps you can do this week

Add per-vdev latency panels (read/write, percentiles if possible). If you can’t, at least separate vdev/device metrics from pool metrics.
Annotate scrubs, resilvers, and deploys on the same timeline as latency. Guessing is expensive.
Write a 1-page runbook using the “Fast diagnosis playbook” above and include the exact commands your team will run.
Set capacity alerts that trigger before panic: start warning at 70–75%, page at a threshold your workload can tolerate.
Run one fio test (carefully, on a safe dataset) and record p95/p99. That becomes your baseline for “normal.”
Practice a replacement drill: identify a drive, simulate failure in process (not physically), confirm you can interpret zpool status and proceed without improvisation.

The goal isn’t to build a perfect observability cathedral. The goal is to see the next disaster while it’s still small enough to step on.