Some outages arrive with a bang. ZFS read-latency problems usually don’t. They show up as “the app feels weird,” dashboards that look fine until you zoom in, and customers discovering patience they didn’t know they had.
If you run production on ZFS and you’ve ever stared at zpool iostat wondering why throughput looks healthy while latency is on fire, this is your field guide. We’re going to use zpool iostat -r like adults: measure the right thing, isolate the bottleneck quickly, and make changes that don’t create a second incident.
What -r actually tells you (and what it doesn’t)
zpool iostat is the quickest “is my pool sick?” test we have that doesn’t require a kernel trace. Add -r and you get per-interval latency for read operations (and, depending on platform/version, a structured view of latency broken down into time spent in different stages). This is the tool you reach for when the question is:
- Are read requests queuing?
- Is the pool slow because the disks are slow, or because we’re asking ZFS to do expensive work?
- Is one vdev dragging the pool down?
Here’s the uncomfortable truth: -r does not magically reveal the entire path a read takes through ARC, compression, checksums, RAID-Z reconstruction, device firmware, controller, HBA, and back. It gives you strong signals. You still have to think.
What “read latency” means in ZFS terms
Read latency is the time from “a read was issued” to “data was returned.” But the pool sees it in a very specific way: ZFS measures and reports timings at the pool/vdev layer. That includes queueing and service time in the storage stack. It may not include the time your application spent waiting on CPU, locks, or scheduling delays before issuing the I/O.
Why -r is different from “disk latency” tools
Tools like iostat -x and sar -d focus on block devices. They’re great, but they don’t understand vdev topology. ZFS does. A mirror behaves differently than RAID-Z, and a special vdev changes the read path. zpool iostat -r is the closest thing to a “ZFS-native” read-latency view you can get without going full DTrace/eBPF.
A practical latency model for ZFS pools
If you want to read zpool iostat -r like a pro, you need a mental model that’s simple enough to use under stress, but accurate enough to prevent bad decisions.
Latency is usually one of four things
- Cache miss penalty (ARC/L2ARC doesn’t have it, you hit disks).
- Queueing (the pool/vdev is saturated; requests wait their turn).
- Service time (the device is slow per I/O: media, firmware, controller, SMR pain, rebuild/scrub contention).
- Amplification work (RAID-Z parity, decompression, checksum, gang blocks, small random reads across wide vdevs).
Your job is to figure out which one dominates right now. Not which one is philosophically important. Not which one makes a nice slide deck.
Read latency is workload-shaped
ZFS read behavior differs wildly depending on access pattern:
- Random 4–16K reads: IOPS bound. Mirrors shine. Wide RAID-Z vdevs can struggle.
- Large sequential reads: throughput bound. RAID-Z can look great until it doesn’t (fragmentation and recordsize matter).
- Metadata-heavy workloads: lots of small reads; special vdevs and recordsize choices matter a lot.
- Mixed read/write: reads can be dragged down by writes via contention, especially during sync-heavy periods and background work.
Know your “good” latency
There is no universal “good read latency.” But there are healthy ranges:
- NVMe mirrors: single-digit ms under load is often fine; sub-ms is common under light load.
- SAS/SATA SSD: a few ms is normal; tens of ms suggests queueing or garbage collection.
- 7200 RPM HDD: 8–15 ms service time is normal; 30–100+ ms likely means queueing or rebuild/scrub contention.
The key is correlation: latency with IOPS and bandwidth. High latency at low IOPS is “device or path is slow.” High latency only at high IOPS is “you’re saturating something.”
One quote that should be taped to your monitor:
“Hope is not a strategy.” — Gene Kranz
Interesting facts & historical context (short and useful)
- ZFS was designed with end-to-end data integrity baked in (checksums everywhere). That integrity work can show up as CPU cost under extreme read rates.
- ARC (Adaptive Replacement Cache) replaced simplistic LRU caching with something more resilient to scans; great for real workloads, sometimes surprising during big sequential reads.
- RAID-Z exists largely because “hardware RAID lied” about write ordering and error handling; ZFS wanted predictable semantics and scrubbable correctness.
- Scrubs are not optional “maintenance”; they’re how ZFS validates redundancy. But scrubs absolutely compete for read I/O and can inflate read latency.
- ashift is forever for a vdev: choose poorly (e.g., 512-byte sectors on 4K disks) and you can create amplification that haunts latency.
- L2ARC historically had a reputation for being “cache that eats RAM” because its metadata can pressure ARC; newer implementations improved behavior, but the tradeoff is still real.
- Special vdevs (metadata/small blocks) are a relatively modern addition in OpenZFS and can dramatically change read latency for metadata-heavy workloads—also a new place to shoot yourself.
- RAID-Z expansion is a newer capability; before that, changing vdev width often meant rebuilding pools, which influenced how people designed for latency (mirrors became a default).
- ZFS prefetch has evolved; it can help sequential reads but can hurt when it mispredicts, increasing pointless I/O and latency for the real reads.
Walking the output: columns, units, and traps
zpool iostat has multiple output modes and the exact columns vary by OpenZFS version and OS packaging. The principle stays consistent: you’re looking at per-pool and per-vdev stats, per interval, and with -r you get read latency data.
Two traps show up constantly:
- Looking at pool-level latency only. A pool is the sum of its vdevs, and one bad vdev can poison the experience.
- Confusing “average latency” with “tail latency”.
zpool iostatis interval-based and generally average-ish. Your customers live in the tail.
Also: latency reported by ZFS is not always the same as what the app experiences. If the app is blocked on CPU or lock contention, ZFS can look innocent while users suffer.
Joke #1: Latency is like garbage—you can ignore it for a while, until it starts making decisions about your life.
Fast diagnosis playbook (first/second/third)
First: confirm it’s real and find the scope
- Run
zpool iostat -rwith a short interval to catch spikes and identify which pool/vdev is involved. - Check if background work is running (scrub, resilver). If yes, decide whether to pause/slow it.
- Check ARC behavior: if you’re missing cache, disks will be busy and latency will rise. Confirm with ARC stats.
Second: decide if it’s saturation, a single bad actor, or a misconfiguration
- Saturation: latency rises with IOPS/bandwidth and drops when load drops. Likely queueing or device limits.
- Single bad actor: one vdev shows much higher read latency than peers. Suspect disk, path, cabling, firmware, or a degraded mirror.
- Misconfiguration: high read amplification (small recordsize on HDD RAID-Z, wrong ashift, special vdev near full, weird volblocksize) creates persistent latency under normal load.
Third: validate with one cross-check tool, not five
Pick the minimum extra tool needed:
iostat -xto validate device-level queueing/service time.zpool statusto confirm errors, degraded vdevs, ongoing operations.arcstator equivalent to confirm cache hit rate trends.
If you need to jump to tracing, do it deliberately. Latency incidents love people who thrash around collecting “all the data.”
Practical tasks: commands, what the output means, and the decision you make
These are production-grade tasks. Each includes: a runnable command, a representative snippet of output, what it means, and what you do next. Adjust pool names and device paths to match your system.
Task 1: Get the baseline read latency per vdev
cr0x@server:~$ zpool iostat -v -r 1 10
capacity operations bandwidth read latency
pool alloc free read write read write read
-------------------------- ----- ----- ----- ----- ----- ----- -----
tank 2.15T 5.12T 3200 410 210M 18.2M 7ms
mirror-0 1.07T 2.56T 1580 210 104M 9.1M 6ms
sda - - 790 105 52.1M 4.6M 6ms
sdb - - 790 105 52.0M 4.6M 6ms
mirror-1 1.08T 2.56T 1620 200 106M 9.1M 7ms
sdc - - 810 100 53.0M 4.5M 7ms
sdd - - 810 100 53.0M 4.5M 7ms
-------------------------- ----- ----- ----- ----- ----- ----- -----
Meaning: Pool reads are split fairly evenly across mirrors. Latency is consistent across vdevs and disks. This looks like normal load, not a single failing device.
Decision: If latency is “too high,” it’s likely saturation or workload mismatch—not a single bad disk. Continue with workload and cache checks.
Task 2: Catch bursty latency by extending the sample window
cr0x@server:~$ zpool iostat -v -r 5
capacity operations bandwidth read latency
pool alloc free read write read write read
-------------------------- ----- ----- ----- ----- ----- ----- -----
tank 2.15T 5.12T 900 60 58.0M 2.1M 3ms
tank 2.15T 5.12T 4100 120 260M 6.0M 22ms
tank 2.15T 5.12T 800 55 52.0M 1.8M 4ms
^C
Meaning: The second interval shows a spike: IOPS and bandwidth jump and latency rises. That’s classic queueing under burst load.
Decision: Identify the burst source (backup job, analytics query, scrub/resilver). Consider throttling the workload or increasing vdev parallelism (more mirrors) if sustained.
Task 3: Confirm pool health and background operations
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Dec 24 10:12:11 2025
1.24T scanned at 820M/s, 410G issued at 270M/s, 1.24T total
0B resilvered, 32.9% done, 1:05:12 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
Meaning: Resilvering is a read-heavy operation and can raise read latency for clients, especially on HDD.
Decision: If this is a latency incident, either schedule resilvers/scrubs for low-traffic windows or tune scan speed (see Task 10).
Task 4: Cross-check device-level latency and queueing
cr0x@server:~$ iostat -x 1 3
Linux 6.6.0 (server) 12/25/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 4.60 1.80 0.00 81.50
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
sda 380.0 52000.0 0.0 0.0 6.10 136.8 55.0 4600.0 3.20 2.10 78.0
sdb 379.0 51900.0 0.0 0.0 6.00 136.9 55.0 4600.0 3.10 2.05 77.5
sdc 402.0 53000.0 0.0 0.0 7.20 131.8 50.0 4500.0 3.00 2.40 82.0
sdd 401.0 52900.0 0.0 0.0 7.10 131.9 50.0 4500.0 3.10 2.35 81.5
Meaning: Device-level read await roughly matches ZFS read latency in the earlier sample. Utilization is high-ish, queue size is non-trivial: you are working the disks.
Decision: If you need lower latency, reduce load, add vdevs, or move hot data to faster media. Don’t “tune” your way out of physics.
Task 5: Identify whether ARC is saving you or betraying you
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
10:44:01 8200 980 12 40 0 920 11 20 0 96.2G 128G
10:44:02 7900 1100 13 38 0 1040 13 22 0 96.2G 128G
10:44:03 8100 4200 51 45 0 4130 51 25 0 96.2G 128G
10:44:04 8300 3900 46 42 0 3830 46 28 0 96.2G 128G
10:44:05 8000 1000 12 40 0 940 12 20 0 96.2G 128G
Meaning: Two seconds show a sudden miss spike. That’s often a scan, a cache-thrashing query, or a new working set that doesn’t fit. Miss spikes correlate strongly with read latency spikes.
Decision: If it’s a scheduled scan (backup, AV, analytics), isolate it or rate-limit it. If it’s normal traffic, you may need more RAM, better dataset layout, or faster disks.
Task 6: Check dataset properties that commonly shape read latency
cr0x@server:~$ zfs get -o name,property,value,source recordsize,compression,atime,primarycache,secondarycache tank/app
NAME PROPERTY VALUE SOURCE
tank/app recordsize 128K local
tank/app compression lz4 inherited from tank
tank/app atime off local
tank/app primarycache all default
tank/app secondarycache all default
Meaning: recordsize affects how much data ZFS reads per logical read. For database-like random reads, 128K can create unnecessary read amplification. Compression can help by reducing physical I/O, but costs CPU.
Decision: For OLTP-ish random reads, consider smaller recordsize (e.g., 16K) on a per-dataset basis. Don’t carpet-bomb the whole pool.
Task 7: Validate pool topology and vdev imbalance
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
Meaning: A single wide RAID-Z2 vdev concentrates IOPS limits. Random read latency will climb sooner compared to multiple mirrors or multiple narrower vdevs.
Decision: If your workload is random-read heavy and latency-sensitive, prefer mirrors or multiple vdevs. RAID-Z is fine for capacity and streaming, not magic for IOPS.
Task 8: Detect a single slow disk inside a mirror
cr0x@server:~$ zpool iostat -v -r 1 5
capacity operations bandwidth read latency
pool alloc free read write read write read
-------------------------- ----- ----- ----- ----- ----- ----- -----
tank 2.15T 5.12T 2400 200 160M 8.0M 18ms
mirror-0 1.07T 2.56T 1200 100 80.0M 4.0M 35ms
sda - - 600 50 40.0M 2.0M 34ms
sdb - - 600 50 40.0M 2.0M 36ms
mirror-1 1.08T 2.56T 1200 100 80.0M 4.0M 2ms
sdc - - 600 50 40.0M 2.0M 2ms
sdd - - 600 50 40.0M 2.0M 2ms
-------------------------- ----- ----- ----- ----- ----- ----- -----
Meaning: Mirror-0 is dramatically worse than mirror-1. Even if both disks in mirror-0 show similar latency, it can be a shared path (SAS expander lane, HBA port) or media issue.
Decision: Investigate hardware path: swap cables/ports, check SMART, check HBA logs. If you confirm a disk is slow, replace it before it becomes “degraded + slow,” the worst combo.
Task 9: Check for checksum errors that force expensive reads
cr0x@server:~$ zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Replace the device or restore the pool from backup.
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 12
sdb ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/app@autosnap:2025-12-25:00:00:00:/var/lib/app/index.dat
Meaning: Checksum errors can trigger retries, reconstruction, and slow reads. Even if the pool stays ONLINE, latency can spike when ZFS has to work harder to return correct data.
Decision: Treat checksum errors as urgent. Replace questionable media/path, scrub after replacement, and validate application-level corruption risk.
Task 10: Control scrub/resilver impact on read latency
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_vdev_scrub_min_active
1
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_vdev_scrub_max_active
2
Meaning: These parameters influence how aggressively ZFS issues scrub I/O. Higher values can finish scrubs faster but can absolutely inflate read latency during business hours.
Decision: On latency-sensitive systems, keep scrub aggressiveness conservative during peak and schedule scrubs off-hours. If your platform exposes zpool scrub -s, pausing scrubs during incidents is a pragmatic move.
Task 11: Validate special vdev health and space (metadata hot path)
cr0x@server:~$ zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 7.27T 2.15T 5.12T - - 18% 29% 1.00x ONLINE -
special 894G 770G 124G - - 35% 86% - ONLINE -
mirror-2 894G 770G 124G - - 35% 86% - ONLINE -
nvme0n1 - - - - - - - - ONLINE -
nvme1n1 - - - - - - - - ONLINE -
Meaning: Special vdev is at 86% capacity. That’s where latency incidents go to breed. When special fills, ZFS behavior changes and metadata placement can get ugly.
Decision: Keep special vdev well below “scary” utilization. If it’s filling, either add special capacity or adjust which blocks are eligible (carefully). Treat special vdev as a tier-0 dependency.
Task 12: Spot pathological small-block reads using zpool iostat block-size view
cr0x@server:~$ zpool iostat -v -r -w 1 3
capacity operations bandwidth read latency
pool alloc free read write read write read
-------------------------- ----- ----- ----- ----- ----- ----- -----
tank 2.15T 5.12T 9000 200 72.0M 6.0M 14ms
mirror-0 1.07T 2.56T 4600 100 36.0M 3.0M 15ms
mirror-1 1.08T 2.56T 4400 100 36.0M 3.0M 13ms
-------------------------- ----- ----- ----- ----- ----- ----- -----
Meaning: High read IOPS with modest bandwidth implies small I/O sizes. That’s an IOPS game, not a throughput game.
Decision: If this is HDD-backed RAID-Z, expect pain. For SSD/NVMe mirrors, check CPU and metadata path. Consider special vdevs for metadata/small blocks if appropriate.
Task 13: Verify ashift to avoid read-modify amplification
cr0x@server:~$ zdb -C tank | grep -E 'ashift|vdev_path' -n
45: ashift: 12
62: path: '/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_2TB_S5...'
88: ashift: 9
105: path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68...'
Meaning: Mixed ashift values can happen across vdevs, especially if the pool grew over time. ashift: 9 (512B) on 4K-native or 4K-emulated disks can create extra work and latency.
Decision: If you find wrong ashift on a vdev, plan remediation: replace vdev by migration, rebuild pool, or accept the cost. There’s no safe in-place “flip ashift” button.
Task 14: Check whether your “optimization” is actually prefetch thrash
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
Meaning: Prefetch is enabled. For sequential workloads it can help. For some mixed/random patterns, it can increase useless reads and worsen latency.
Decision: Don’t blindly disable prefetch. Test it on the specific workload and observe zpool iostat -r plus ARC miss patterns. If disabling it reduces read latency without hurting throughput, keep it off for that host.
Three corporate mini-stories from the latency trenches
1) The incident caused by a wrong assumption
The team had a ZFS-backed VM platform. Mirrors on SSDs, lots of RAM, a decent HBA. One Monday, users complained that VM consoles were laggy and deployments were timing out. The graphs showed CPU fine, network fine, and “disk throughput” not outrageous. The on-call stared at zpool iostat and saw read bandwidth sitting comfortably under what the SSDs could handle. They assumed storage wasn’t the issue.
It was. zpool iostat -v -r showed read latency spiking to tens of milliseconds for short bursts. Nothing sustained, just enough to make synchronous operations (like metadata reads during boot storms) fall over. The pool had a special vdev for metadata on a pair of small NVMe devices. One of them was quietly throwing media errors that didn’t fully degrade the mirror yet, but did trigger retries and slow paths.
The wrong assumption was “if throughput is fine, storage is fine.” In reality, throughput is a bulk metric; the platform was dying by a thousand tiny reads. Every retry extended the critical path for metadata-heavy operations.
They replaced the NVMe, scrubbed, and the issue vanished. The postmortem included a simple rule: when users report “laggy,” look at latency first. Throughput is what you brag about after the incident.
2) The optimization that backfired
A database team wanted faster reads. They enabled L2ARC on a big SSD and celebrated when ARC hit rate improved during a synthetic benchmark. Then production got… slower. Not catastrophically, but enough that tail latencies moved from “annoying” to “ticket-generating.”
zpool iostat -r showed read latency spikes coinciding with CPU softirqs and memory pressure. ARC size looked stable, but the system was spending a lot more time managing cache metadata. The L2ARC device was fast; the host wasn’t free. In their case, the workload had a large sequential component plus a hot random set. L2ARC helped sometimes, but it also encouraged more churn and overhead.
The backfire wasn’t that L2ARC is bad. It was that they treated it like a free speed upgrade. L2ARC costs RAM and CPU, and under some access patterns it can increase work per read enough to offset reduced disk I/O.
They rolled it back, then reintroduced L2ARC with a smaller size and more RAM in the host. The real win came later: moving a few metadata-heavy datasets onto a mirrored NVMe special vdev, which reduced small-read latency directly without cache-churn drama.
3) The boring but correct practice that saved the day
A storage cluster ran nightly scrubs and had an unpopular policy: scrub windows were fixed, and any team wanting to run heavy batch reads had to coordinate. People complained. It felt bureaucratic.
One week, a firmware regression hit a batch of disks. Nothing failed outright. Instead, one drive in a mirror started taking longer to respond under sustained reads. Latency rose slowly. The interesting part: the system’s monitoring had a weekly “scrub latency profile” because scrubs were consistent. When the drive began to misbehave, the scrub window showed a clear deviation: read latency on one vdev climbed while others stayed flat.
They replaced the disk proactively. No outage, no degraded vdev during peak, no “mystery performance incident.” The boring practice—regular scrubs on a known schedule and comparing like-for-like—turned into an early warning system.
When people asked why they didn’t just “scrub whenever,” the answer was simple: consistency creates baselines, and baselines catch slow failures. Randomness creates folklore.
Joke #2: The only thing more predictable than disk failures is the meeting where someone says, “But it worked in staging.”
Common mistakes: symptom → root cause → fix
1) Symptom: pool read latency spikes during business hours, especially in short bursts
Root cause: background scans (scrub/resilver) or bursty batch jobs creating queueing.
Fix: schedule heavy scans off-hours, reduce scrub aggressiveness, and rate-limit batch readers. Confirm with zpool status and compare zpool iostat -r during/without the job.
2) Symptom: one vdev shows much higher read latency than peers
Root cause: single slow disk, bad SAS lane, HBA port issue, or intermittent media errors causing retries.
Fix: cross-check with iostat -x, SMART, and HBA logs. Swap ports/cables if possible; replace suspect disk early.
3) Symptom: high read latency but low IOPS and low bandwidth
Root cause: device is slow per I/O (firmware GC, SMR behavior, error recovery) or kernel-level congestion.
Fix: validate with iostat -x (r_await high, %util low-to-moderate), check drive logs, consider replacing the device class. For HDD pools, ensure no hidden SMR drives snuck in.
4) Symptom: read latency rises with small I/O workloads on RAID-Z
Root cause: IOPS limitation and read amplification on wide RAID-Z vdevs; metadata seeks dominate.
Fix: redesign for mirrors or add vdevs to increase parallelism. For existing pools, consider moving latency-sensitive datasets to a mirrored SSD/NVMe pool.
5) Symptom: after enabling a special vdev, latency improves then degrades months later
Root cause: special vdev filling up or fragmenting; metadata/small blocks forced back to main vdevs or placed inefficiently.
Fix: keep special vdev well under high utilization, add capacity early, and monitor special allocation separately. Treat it as production-critical.
6) Symptom: latency spikes after changing recordsize or volblocksize
Root cause: mismatch between I/O pattern and block size causing amplification, plus old data not rewritten (properties apply mostly to new writes).
Fix: test changes on a canary dataset, rewrite data if needed (migration/replication), and validate with zpool iostat -r under representative load.
7) Symptom: latency spikes look like storage, but ZFS latency is normal
Root cause: application or VM host CPU contention, lock contention, or scheduler delays. Storage is the scapegoat because it’s measurable.
Fix: correlate with CPU run queue, vmstat, and app profiling. Don’t “tune ZFS” when ZFS isn’t the problem.
Checklists / step-by-step plan
Step-by-step: when you get paged for “slow reads”
- Confirm the symptom: is it user-facing latency, app timeouts, or batch slowdown? Get one concrete example and timestamp.
- Run
zpool iostat -v -r 1 10on the affected host. Identify the worst pool/vdev. - Check
zpool statusfor scrub/resilver, errors, degraded vdevs. - Cross-check with
iostat -x 1 3to see if devices are saturated or slow per I/O. - Check ARC hit rate (Task 5). If misses spike, identify the workload causing it.
- Make the smallest safe intervention:
- pause or reschedule scrub/resilver if possible and appropriate,
- throttle the batch reader,
- fail out a clearly sick disk if you have strong evidence and redundancy,
- move the hot dataset to a faster pool if you already have one.
- Validate improvement with the same commands and interval. Don’t declare victory based on vibes.
- Write down what you saw: “mirror-0 read latency 35ms while mirror-1 2ms” is gold for follow-up.
Checklist: design choices that prevent read-latency incidents
- Choose vdev topology based on IOPS needs (mirrors for random reads; RAID-Z for capacity/streaming).
- Set
ashiftcorrectly at creation time. - Use dataset-level tuning (
recordsize,atime, compression) rather than global heroics. - Keep special vdevs healthy, mirrored, and not near full.
- Schedule scrubs and keep them consistent to build baselines.
- Monitor latency, not just throughput and space.
Checklist: what not to do during a latency incident
- Don’t change five tunables at once. You will not know what helped.
- Don’t “fix” latency by disabling checksums or integrity features. That’s not engineering; that’s denial.
- Don’t assume a fast SSD pool can’t be the problem. NVMe can queue too.
- Don’t ignore a single vdev outlier. Pools are only as fast as their slowest critical path.
FAQ
1) What does zpool iostat -r measure exactly?
It reports read latency observed at the ZFS pool/vdev layer over each sampling interval. Think “time to satisfy reads seen by ZFS,” including queueing and device service time.
2) Why is my app slow when zpool iostat -r looks fine?
Because storage isn’t always the bottleneck. CPU scheduling delays, lock contention, network, or application serialization can create latency while ZFS stays healthy. Correlate with system and app metrics.
3) Why does latency spike during scrubs if scrubs are “just reads”?
Scrubs compete for the same read bandwidth and IOPS as clients, and they can increase head movement on HDD. Even on SSD, they can add queueing and reduce cache effectiveness.
4) Can L2ARC reduce read latency?
Yes—when the workload has a stable working set that fits and the host has RAM/CPU to manage it. It can also backfire by increasing overhead and cache churn. Measure before and after.
5) Is RAID-Z always worse for read latency than mirrors?
For small random reads, usually yes: mirrors scale IOPS with vdev count and have simpler reconstruction. For large sequential reads, RAID-Z can be excellent. Match topology to workload.
6) My pool read latency is high but disks show low %util. How?
Possible causes: the bottleneck is above the disks (CPU, checksum/decompression, contention), or the measurement window hides bursts, or the device is slow per I/O without being “busy” in the classic sense. Use shorter intervals and cross-check.
7) What’s the fastest way to identify “one bad disk” behavior?
Compare per-vdev latency in zpool iostat -v -r and then check per-device await in iostat -x. If one path is consistently worse, dig into that path first.
8) Does changing recordsize fix existing data read latency?
Not directly. recordsize affects how new data is laid out. Existing blocks keep their old structure until rewritten (by rewrite/migration/replication).
9) Should I disable prefetch to reduce read latency?
Sometimes. Prefetch helps sequential workloads and can hurt mispredicted patterns. Don’t cargo-cult it; test on the actual workload and watch latency plus ARC misses.
10) What latency number should trigger paging someone?
Pick thresholds relative to your baseline and workload. A better policy: page on sustained deviation (e.g., 3–5× baseline for several minutes) plus user impact, not a single spike.
Conclusion: next steps you can actually do
If you want to get good at ZFS latency, stop treating it like a mystical property of disks. It’s a measurable output of topology, workload, caching behavior, and background work. zpool iostat -r gives you the closest thing to a truthful story, as long as you ask it the right questions.
Do this next:
- Run
zpool iostat -v -r 1 10during normal load and save the output as your baseline. - Schedule scrubs consistently and record latency profiles so you can spot slow failures early.
- Identify your top 2–3 latency-sensitive datasets and audit their properties (
recordsize, compression, special vdev eligibility). - Decide whether your pool topology matches your workload. If it doesn’t, accept that the real fix is architectural, not a tunable.
Latency doesn’t care about your intentions. Measure it, respect it, and it usually behaves. Usually.