You don’t need a dashboard to know your storage is unhappy. You need one angry API timeout, one “why is the deploy stuck?” Slack thread,
and one exec who says “but it worked yesterday.” ZFS gives you a truth serum: zpool iostat -w.
Used well, it tells you whether you’re CPU-bound, IOPS-bound, latency-bound, sync-write bound, or simply “bound by optimism.”
Used badly, it convinces you to buy hardware you don’t need, tune knobs you don’t understand, and blame “the network” out of habit.
What -w is really showing (and why you should care)
zpool iostat is the heartbeat monitor. The -w flag adds the part you usually wish you had during an incident: latency.
Not theoretical latency. Not vendor datasheet latency. Observed latency as ZFS sees it for pool and vdev operations.
If you run production systems, you should treat zpool iostat -w as the “is storage the bottleneck right now?” tool.
It answers:
- Are we queuing? Latency grows before throughput drops.
- Where is the pain? Pool-level vs one slow vdev vs a single mirror leg.
- What kind of pain? Reads, writes, sync writes, metadata-heavy work, resilver, trim, or “small random everything.”
Latency is the currency apps spend. Throughput is what storage teams brag about. When the bill comes due, apps pay in latency.
What -w adds and what it does not
The exact columns vary by ZFS implementation and version (OpenZFS on Linux vs FreeBSD vs illumos). But broadly:
- It adds latency columns (often separate for read and write).
- It may add queue or “wait” time depending on platform.
- It does not magically separate ZFS intent latency from device latency unless you ask for it (more on that later).
- It does not tell you why latency is high; it tells you where to dig next.
One dry truth: zpool iostat can make your system look healthy even while your application is on fire, because ARC cache is quietly saving you.
Then the cache misses spike and the same pool collapses under real disk reads. You need to watch it continuously, not just when you’re already doomed.
Quick history and facts that make the output click
A few context points turn “columns of numbers” into a story you can act on. Here are nine short facts that actually matter.
- ZFS was born at Sun Microsystems (mid-2000s) as an end-to-end storage stack: filesystem + volume manager, with checksums everywhere.
- The “pool” idea is the core innovation: filesystems live on top of a pool, and the pool decides where blocks go across vdevs.
- Copy-on-write (CoW) is why fragmentation feels different: ZFS doesn’t overwrite blocks in place; it writes new blocks and updates pointers.
- The ZIL is not a write cache: it’s a log for synchronous writes. It exists even without a dedicated SLOG device.
- SLOG is a device, ZIL is a concept: adding a SLOG moves the ZIL to faster media, but only for sync writes.
- OpenZFS unified multiple forks so features like device removal, special vdevs, and persistent L2ARC became more common across platforms.
- Ashift is forever (mostly): set wrong at pool creation and you carry that performance penalty for the life of the vdev.
- “IOPS” is a half-truth without latency: you can push high IOPS with terrible tail latency; your database will still hate you.
- Resilver and scrub are intentional pain: they are background reads/writes that can dominate
zpool iostatif you let them.
Exactly one quote, because engineers deserve better than motivational posters:
Hope is not a strategy.
— paraphrased idea often attributed to operations leadership circles.
A production mental model: from app request to vdev
When an application does I/O on ZFS, you’re watching multiple layers negotiate reality:
- The app issues reads/writes (often small, often random, and occasionally insulting).
- The OS and ZFS aggregate, cache (ARC), and sometimes reorder.
- ZFS translates logical blocks into physical writes across vdevs, obeying redundancy and allocation rules.
- Your vdevs translate that to actual device commands. The slowest relevant component sets the pace.
Pool vs vdev: why your “fast disks” can still be slow
ZFS performance is vdev-centric. A pool is a set of vdevs. Your pool throughput scales with the number of vdevs, not the number of disks,
in the way people naively assume.
Mirrors give you more IOPS per vdev than RAIDZ, and multiple mirrors scale. RAIDZ vdevs are great at capacity efficiency and large sequential reads,
but they don’t magically turn into IOPS monsters. If you built one giant RAIDZ2 vdev with a pile of disks and expected database-grade random I/O,
you bought a minivan and entered it in a drag race.
Latency is a stack: service time + waiting time
The most useful way to interpret -w is: “is the device slow” vs “is the device busy.”
High latency with moderate utilization often means the device itself is slow (or failing, or doing internal garbage collection).
High latency with high IOPS/throughput usually means queueing: the workload is beyond what the vdev can serve at acceptable latency.
How to read the columns without lying to yourself
You’ll see variants like:
- capacity: used, free, fragmentation, and sometimes allocated space by pool/vdev.
- operations: read/write operations per second (IOPS).
- bandwidth: read/write bytes per second (throughput).
- latency: read/write latency, sometimes broken into “wait” and “service.”
The rule: you diagnose with a combination of IOPS, bandwidth, and latency. Any one alone is a liar.
Pool-level lines are averages; vdev lines are truth
Pool-level stats can hide a single sick disk in a mirror or a single slow RAIDZ vdev dragging everything.
Always use -v to see vdev breakdown when diagnosing.
Watch the change, not the number
zpool iostat is best used as a time series. Run it with a 1-second or 2-second interval and watch trends.
The “now” matters more than the lifetime averages.
Joke #1: Storage graphs are like horoscopes—vague until the pager goes off, and then suddenly they’re “obviously predictive.”
Fast diagnosis playbook
This is the sequence I use when an app team says “storage is slow” and you have 90 seconds to decide whether that’s true.
First: determine if you’re looking at disk or cache
- Run
zpool iostat -wat 1s intervals for 10–20 seconds. - If read IOPS and bandwidth are low but the app is slow, suspect CPU, locking, network, or cache misses not reaching disk yet.
- If read IOPS spike and latency spikes with them, you’re on the disks. Now it’s real.
Second: isolate the bottleneck layer
- Use
zpool iostat -w -v. Find the vdev with the worst latency or the highest utilization symptoms. - If it’s one disk in a mirror: likely failing, firmware weirdness, or pathing issue.
- If it’s a whole RAIDZ vdev: you’re saturating that vdev’s IOPS. Fix is architectural (more vdevs, mirrors, or accept latency).
Third: decide whether it’s sync write pain
- Look for write latency spikes that correlate with small write IOPS and low bandwidth.
- Check whether the workload is forcing sync writes (databases, NFS, hypervisors).
- If yes: examine SLOG health and device latency; consider whether your
syncsettings are correct for risk tolerance.
Fourth: check for “background jobs pretending to be traffic”
- Scrubs, resilvers, trims, and heavy snapshots/clones can dominate I/O.
- If
zpool statusshows active work, decide whether to throttle, schedule, or let it finish.
Fifth: confirm with a second signal
- Correlate with CPU (
mpstat), memory pressure, or per-process I/O (pidstat). - Use device-level tools (
iostat -x) to see if one NVMe is melting down while ZFS averages everything.
Practical tasks: commands, meaning, decisions
You asked for real tasks, not vibes. Here are fourteen. Each includes a command, an example output snippet, what it means, and the decision it drives.
Outputs are representative; your columns may differ by platform/version.
Task 1: Baseline pool latency in real time
cr0x@server:~$ sudo zpool iostat -w tank 1
capacity operations bandwidth latency
pool alloc free read write read write read write
---------- ----- ----- ----- ----- ----- ----- ----- -----
tank 3.21T 7.58T 210 180 18.2M 22.4M 2ms 4ms
tank 3.21T 7.58T 195 260 16.9M 31.1M 3ms 18ms
tank 3.21T 7.58T 220 240 19.1M 29.7M 2ms 20ms
Meaning: Writes jumped in latency from 4ms to ~20ms while bandwidth increased. That’s a classic “we’re pushing the write path.”
Decision: If this aligns with user-facing latency, treat storage as suspect. Next: add -v to find which vdev is causing it.
Task 2: Find the vdev that’s hurting you
cr0x@server:~$ sudo zpool iostat -w -v tank 1
operations bandwidth latency
pool vdev read write read write read write
---------- -------------------------------- ---- ----- ----- ----- ----- -----
tank - 220 240 19.1M 29.7M 2ms 20ms
tank mirror-0 110 120 9.5M 14.6M 2ms 10ms
tank nvme0n1 108 118 9.4M 14.4M 2ms 12ms
tank nvme1n1 112 119 9.6M 14.7M 2ms 65ms
tank mirror-1 110 120 9.6M 15.1M 2ms 10ms
tank nvme2n1 110 118 9.5M 14.8M 2ms 11ms
tank nvme3n1 110 121 9.7M 15.3M 2ms 10ms
Meaning: One mirror leg (nvme1n1) has 65ms write latency while its partner is fine. ZFS can mirror reads, but writes wait for both legs.
Decision: This is a device/path issue. Check SMART, firmware, PCIe errors, multipath, and link speed. Replace or fix the path before tuning ZFS.
Task 3: Separate “busy” from “broken” at the device layer
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server) 12/25/2025 _x86_64_ (32 CPU)
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 110.0 120.0 9.5 14.4 197.0 0.4 3.1 2.0 4.2 0.4 18.0
nvme1n1 112.0 119.0 9.6 14.7 198.2 9.8 66.7 2.2 65.8 0.5 95.0
Meaning: nvme1n1 is at high utilization with a deep queue and huge write await. That’s not ZFS being “chatty”; that’s a sick or throttled device.
Decision: Stop blaming recordsize. Investigate device health and throttling (thermal, firmware GC). If it’s a shared PCIe lane, fix topology.
Task 4: Detect a sync-write workload quickly
cr0x@server:~$ sudo zpool iostat -w tank 1
operations bandwidth latency
pool read write read write read write
---------- ---- ----- ----- ----- ----- -----
tank 80 3200 6.1M 11.8M 1ms 35ms
tank 75 3500 5.9M 12.4M 1ms 42ms
Meaning: Very high write IOPS but low write bandwidth means small writes. Latency is high. If the app is a database, VM host, or NFS server, assume sync pressure.
Decision: Check SLOG and sync settings. If you don’t have a SLOG and you need safe sync, accept that spinning disks will cry.
Task 5: Verify whether datasets are forcing sync behavior
cr0x@server:~$ sudo zfs get -o name,property,value -s local,received sync tank
NAME PROPERTY VALUE
tank sync standard
Meaning: Pool/datasets are using default semantics: honor app sync requests.
Decision: If latency is killing you and you can accept risk for a specific dataset (not the whole pool), consider sync=disabled only for that dataset.
If you can’t explain the risk to an auditor without sweating, don’t do it.
Task 6: Check if you even have a SLOG and whether it’s healthy
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0
Meaning: A mirrored SLOG exists. Good: single-device SLOG is a foot-gun if you care about durability.
Decision: If sync writes are slow, measure SLOG device latency separately and consider replacing with power-loss-protected NVMe.
Task 7: Measure per-vdev behavior during a scrub or resilver
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 25 10:11:03 2025
1.23T scanned at 1.1G/s, 512G issued at 450M/s, 3.21T total
0B repaired, 15.9% done, 01:42:18 to go
cr0x@server:~$ sudo zpool iostat -w -v tank 1
operations bandwidth latency
pool vdev read write read write read write
---------- -------------------------------- ---- ----- ----- ----- ----- -----
tank - 6200 180 1.05G 22.4M 12ms 4ms
tank mirror-0 3100 90 525M 11.2M 13ms 4ms
tank mirror-1 3100 90 525M 11.2M 12ms 4ms
Meaning: Reads are dominated by scrub. Latency is elevated but explainable. Writes look normal.
Decision: If this is a busy production window, throttle scrub (platform-dependent) or reschedule. If your read latency is wrecking SLOs, scrub timing is now an SRE problem, not a storage problem.
Task 8: Identify a single slow disk in a mirror without guessing
cr0x@server:~$ sudo zpool iostat -w -v tank 1
operations bandwidth latency
pool vdev read write read write read write
---------- -------------------------------- ---- ----- ----- ----- ----- -----
tank mirror-0 800 900 65M 72M 3ms 14ms
tank sda 400 450 32M 36M 2ms 6ms
tank sdb 400 450 33M 36M 3ms 80ms
Meaning: sdb is the anchor dragging the boat. Mirrors write to both sides; one slow leg poisons write latency.
Decision: Pull SMART data, check cabling/HBA, and plan replacement. Do not “tune” your way around a dying disk.
Task 9: Confirm ashift and why it matters
cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n
45: ashift: 12
Meaning: ashift=12 (4K sectors). This is generally sane for modern disks and SSDs. If you see ashift=9 on 4K-native media, you’re paying a write amplification tax forever.
Decision: If ashift is wrong, you don’t “fix it with a sysctl.” You migrate to a new vdev/pool with correct ashift.
Task 10: Detect metadata-heavy pain and consider a special vdev
cr0x@server:~$ sudo zpool iostat -w -v tank 1
operations bandwidth latency
pool vdev read write read write read write
---------- -------------------------------- ---- ----- ----- ----- ----- -----
tank - 5200 2100 48M 39M 18ms 22ms
tank raidz2-0 5200 2100 48M 39M 18ms 22ms
Meaning: Huge IOPS with tiny bandwidth is the signature of metadata or small-block random I/O (directory walks, small files, maildirs, container layers).
RAIDZ is not thrilled.
Decision: Consider adding mirrors, adding more vdevs, or using a special vdev for metadata/small blocks if your platform supports it and you can operationalize it safely.
Task 11: Verify dataset recordsize vs workload (and avoid self-inflicted pain)
cr0x@server:~$ sudo zfs get -o name,property,value recordsize tank/db
NAME PROPERTY VALUE
tank/db recordsize 128K
Meaning: 128K recordsize is fine for large sequential I/O, but many databases prefer smaller (like 16K) depending on engine and page size.
Decision: If your workload is random reads/writes of small pages and you see read amplification, test a smaller recordsize for that dataset.
Do not change recordsize blindly for an existing dataset and expect instant miracles; existing blocks stay as they were.
Task 12: Check compression and learn whether you’re CPU-bound
cr0x@server:~$ sudo zfs get -o name,property,value compression,compressratio tank/vm
NAME PROPERTY VALUE
tank/vm compression zstd
tank/vm compressratio 1.62x
cr0x@server:~$ mpstat 1 3
Linux 6.5.0 (server) 12/25/2025 _x86_64_ (32 CPU)
12:11:01 AM all %usr %nice %sys %iowait %irq %soft %steal %idle
12:11:02 AM all 72.0 0.0 11.0 2.0 0.0 0.0 0.0 15.0
Meaning: Compression is active and effective. CPU is pretty busy. If zpool iostat shows low disk activity but latency is high at the application, CPU could be the limiter.
Decision: If CPU saturation correlates with I/O latency, consider switching compression level (still keep compression), adding CPU, or isolating noisy neighbors.
Don’t turn compression off as your first reflex; it often reduces disk I/O.
Task 13: Spot TRIM/autotrim impact and decide when to run it
cr0x@server:~$ sudo zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim on local
cr0x@server:~$ sudo zpool iostat -w tank 1
operations bandwidth latency
pool read write read write read write
---------- ---- ----- ----- ----- ----- -----
tank 180 220 12.4M 19.1M 2ms 6ms
tank 175 240 12.2M 19.8M 3ms 18ms
Meaning: If you see periodic write latency spikes without matching workload changes, background maintenance (including TRIM on some devices) can be a culprit.
Decision: If autotrim causes visible jitter on latency-sensitive systems, test disabling autotrim and scheduling manual trims during low-traffic windows. Measure, don’t guess.
Task 14: Correlate “who is doing the I/O” with pool symptoms
cr0x@server:~$ pidstat -d 1 5
Linux 6.5.0 (server) 12/25/2025 _x86_64_ (32 CPU)
12:12:01 AM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
12:12:02 AM 0 2211 120.0 82000.0 0.0 1 postgres
12:12:02 AM 0 3440 200.0 14000.0 0.0 0 qemu-system-x86
Meaning: Postgres is hammering writes. If zpool iostat -w shows high write latency, you can now have a productive conversation with the database owner.
Decision: Decide whether you’re dealing with a legitimate workload increase (scale storage), a misconfigured app (fsync loop), or an operational job (vacuum, reindex) that needs scheduling.
Recognizing workload patterns in the wild
The whole point of zpool iostat -w is to recognize patterns fast enough to act. Here are the common ones that show up in real systems.
Pattern: small random writes, high IOPS, low bandwidth, high latency
Looks like: thousands of write ops, a few MB/s, write latency tens of milliseconds or worse.
Usually is: database WAL/fsync pressure, VM journaling, NFS sync writes, log-heavy workloads.
What to do:
- Confirm whether writes are synchronous (dataset
syncand application behavior). - Ensure SLOG is present, fast, and power-loss protected if you require durability.
- Don’t put a cheap consumer SSD as SLOG and call it a day. That’s how you buy data loss with a receipt.
Pattern: high read bandwidth, moderate IOPS, rising read latency
Looks like: hundreds of MB/s to GB/s reads, latency climbing from a few ms to tens of ms.
Usually is: streaming reads during backup, analytics scans, scrubs, or a cache-miss storm.
What to do:
- Check scrub/resilver status.
- Check ARC hit ratio indirectly by seeing whether reads are reaching disk at all (pair with ARC stats if available).
- If it’s real user traffic, you may need more vdevs or faster media. Latency doesn’t negotiate.
Pattern: one vdev shows high latency; pool average looks “okay”
Looks like: zpool iostat -w pool line seems fine; -v shows one mirror or disk with 10–100x latency.
Usually is: failing drive, bad cable, HBA reset storms, thermal throttling, or a firmware bug.
What to do: treat it as hardware until proven otherwise. Replace. Don’t spend a week writing a tuning proposal around a disk that’s quietly dying.
Pattern: latency spikes at steady throughput
Looks like: same MB/s and IOPS, but latency periodically shoots up.
Usually is: device GC, TRIM, write cache flushes, or contention on shared resources (PCIe lanes, HBA, virtualization host).
What to do: correlate with device-level metrics and system logs. If it’s NVMe thermal throttling, your “fix” might be airflow, not software.
Joke #2: If your “latency spikes are random,” congratulations—you’ve built a probability distribution in production.
Three corporate mini-stories (because reality is mean)
Mini-story #1: The incident caused by a wrong assumption
A mid-size SaaS company migrated a busy Postgres cluster from an aging SAN to local NVMe with ZFS mirrors. The team celebrated:
benchmarks looked great, average latency was low, and the storage graphs finally stopped looking like a crime scene.
Two weeks later, an incident: periodic 2–5 second stalls on write-heavy endpoints. Not constant. Not predictable.
The app team blamed locks. The DBAs blamed autovacuum. The SREs blamed “maybe the kernel.”
Everyone had a favorite villain and none of them were the disks.
Someone finally ran zpool iostat -w -v 1 during a stall. Pool averages were fine, but one NVMe showed write latency in the hundreds of milliseconds.
It wasn’t failing outright. It was intermittently throttling.
The wrong assumption: “NVMe is always fast, and if it’s slow it must be ZFS.” The reality: consumer NVMe devices can hit thermal limits and drop performance dramatically
under sustained sync-ish write patterns. The box had great CPU and terrible airflow.
The fix was gloriously boring: improve cooling, update firmware, and swap that one model for enterprise parts on the next maintenance cycle.
Tuning ZFS wouldn’t have helped. Observing the per-device latency with -w -v did.
Mini-story #2: The optimization that backfired
A corporate platform team ran a multi-tenant virtualization cluster backed by a large RAIDZ2 pool. They were under pressure: developers wanted faster CI,
and storage was “the thing everyone complains about.”
Someone proposed a quick win: set sync=disabled on the VM dataset. The argument was seductive: “We have UPS. The hypervisor can recover.
And it’s only dev workloads.” They changed it late one afternoon and watched zpool iostat write latency drop. High-fives all around.
Then came the backfire. A host crashed in a way the UPS did not politely prevent (it was a motherboard issue, not a power outage).
A handful of VMs had filesystem corruption. Not all of them. Just enough to ruin a weekend and make the postmortem spicy.
The operational mistake wasn’t “sync=disabled is always wrong.” The mistake was treating durability semantics as a performance knob with no blast-radius modeling.
They optimized for the median case and paid in tail risk.
The long-term fix: re-enable sync=standard, add a mirrored power-loss-protected SLOG, and segment “real dev” from “pretend prod”
so the durability decision matched the business reality. The lesson: zpool iostat -w can show you that sync writes hurt,
but it can’t grant you permission to disable them.
Mini-story #3: The boring but correct practice that saved the day
A finance-adjacent company ran ZFS for file services and build artifacts. Nothing fancy. The kind of storage that never gets attention until it breaks.
The storage engineer had one habit: a weekly “five minute drill” during business hours.
Run zpool iostat -w -v 1 for a minute, look at latencies, check zpool status, and move on.
One Tuesday, the drill showed a mirror leg with steadily climbing write latency. No errors yet. No alerts. The system was “fine.”
But the latency trend was wrong, the way a gearbox sounds wrong before it explodes.
They pulled SMART data and found increasing media errors. The disk wasn’t dead; it was just starting to lie.
They scheduled a replacement for the next maintenance window, resilvered, and never had an outage.
Weeks later, a similar model disk in another team’s fleet failed hard and caused a visible incident. Same vendor, same batch, same failure mode.
Their team escaped purely because someone watched -w and trusted the slow drift.
The boring practice wasn’t heroism. It was acknowledging that disks rarely go from “perfect” to “dead” without a phase of “weird.”
zpool iostat -w is excellent at catching weird.
Common mistakes: symptom → root cause → fix
These are the failure modes I keep seeing in real organizations. The trick is to map the symptom in zpool iostat -w to the likely cause,
then make a specific change that can be validated.
1) Pool write latency is high, but only one mirror leg is slow
- Symptom: Pool write latency spikes;
-vshows one device with much higher write latency. - Root cause: Device throttling, firmware GC, thermal issue, bad link, or a failing drive.
- Fix: Verify with
iostat -xand logs; replace the device or fix the path. Don’t waste time tuning recordsize or ARC.
2) High write IOPS, low write bandwidth, ugly write latency
- Symptom: Thousands of writes/s, only a few MB/s, write latency tens of ms to seconds.
- Root cause: Sync writes without a fast SLOG, or a slow SLOG.
- Fix: Add/replace mirrored PLP SLOG; confirm app sync behavior; isolate datasets and set sync semantics intentionally.
3) Reads look fine until ARC misses spike, then everything melts
- Symptom: Normally low disk reads; during incidents, read IOPS/bandwidth jump and read latency climbs.
- Root cause: Working set outgrows ARC, or an access pattern change (scan, report job, backup, new feature).
- Fix: Add RAM (often the best ROI), reconsider caching strategy, or isolate scan-heavy workloads. Validate by observing disk read changes in
zpool iostat -w.
4) RAIDZ vdev saturates on metadata/small random I/O
- Symptom: High IOPS, low MB/s, high latency; vdev is RAIDZ.
- Root cause: RAIDZ parity overhead and limited IOPS per vdev for small random writes.
- Fix: Add more vdevs (not more disks to the same vdev), or shift to mirrors for latency-sensitive random I/O workloads. Consider special vdev for metadata if appropriate.
5) “We upgraded disks but it’s still slow”
- Symptom: New SSDs, similar latency as before under load.
- Root cause: You are CPU-bound (compression/checksumming), or limited by a single vdev layout, or constrained by PCIe/HBA.
- Fix: Confirm with CPU metrics and per-vdev stats; scale vdev count, fix topology, or move hot datasets to a separate pool.
6) Latency spikes during scrubs/resilver and users complain
- Symptom: Read latency increases sharply when maintenance runs.
- Root cause: Scrub/resilver competing with production workload.
- Fix: Schedule maintenance in off-hours, throttle if available, or provision enough performance headroom so integrity checks aren’t an outage generator.
Checklists / step-by-step plan
Checklist: responding to “storage is slow” in under five minutes
- Run
sudo zpool iostat -w -v tank 1and watch 10–20 lines. - Identify whether reads or writes dominate, and whether latency is rising with load.
- If one device stands out, pivot to
iostat -xand system logs for that device. - Check
sudo zpool status tankfor scrub/resilver. - Check dataset
syncsettings for the workload in question. - Correlate with process I/O (
pidstat -d) so you’re not debugging ghosts. - Make one change at a time; confirm impact with the same
zpool iostat -wview.
Checklist: building a baseline before you touch anything
- Capture
zpool iostat -w -v 2 30during a known-good period. - Capture the same during peak traffic.
- Save outputs with timestamps in your incident notebook or ticket.
- Record pool layout: vdev types, disk models, ashift.
- Record dataset properties:
recordsize,compression,sync,atime. - Decide what “bad” looks like (latency thresholds aligned to your app SLO).
Step-by-step plan: turning observations into an improvement project
- Classify the workload: random vs sequential, read vs write, sync vs async, metadata-heavy vs large blocks.
- Map to ZFS layout: determine if your vdev design matches the workload.
- Fix correctness first: replace bad devices, correct cabling/HBA issues, ensure redundancy for SLOG/special vdevs.
- Reduce avoidable I/O: enable sensible compression, tune recordsize per dataset, consider
atime=offwhere appropriate. - Scale properly: add vdevs to scale IOPS; don’t keep inflating a single RAIDZ vdev and expect miracles.
- Validate with
-w: you want lower latency at the same workload, not just prettier throughput numbers. - Operationalize: add routine checks and alerting on per-vdev latency anomalies, not only capacity.
FAQ
1) What does zpool iostat -w actually measure for latency?
It reports observed latency for ZFS I/O at the pool/vdev level, as ZFS accounts it. It’s not a perfect substitute for device firmware metrics,
but it’s extremely good at showing where time is being spent and whether queueing is happening.
2) Why does pool latency look fine, but my database is still slow?
Because the disks might not be the bottleneck. You could be CPU-bound (compression, checksumming), lock-bound, or suffering from application-level fsync behavior.
Also, ARC can mask disk reads until cache misses spike. Correlate with CPU and per-process I/O.
3) Should I always run with -v?
For diagnosis, yes. Pool averages hide slow devices and uneven vdev load. For quick sampling on a busy system, start without -v and then pivot.
4) Does adding more disks to a RAIDZ vdev increase IOPS?
Not in the way most people want. It can increase sequential throughput, but small random I/O is limited by parity overhead and vdev behavior.
If you need more IOPS, add more vdevs or use mirrors for the hot data.
5) When is a SLOG worth it?
When you have significant synchronous write load and you care about durability semantics (sync=standard).
Without sync pressure, a SLOG often does nothing measurable. With sync pressure, it can be the difference between “fine” and “why is everything timing out?”
6) Can I use a consumer SSD as a SLOG?
You can, but you probably shouldn’t if you care about correctness. A good SLOG needs low latency under sustained sync writes and power-loss protection.
Cheap SSDs can lie about flushes and fall off a performance cliff under the exact workload you bought them for.
7) Why do I see high latency during scrub, even when user traffic is low?
Scrubs read a lot of data and can push vdev queues. Even with low app traffic, the scrub itself is real I/O.
The fix is scheduling, throttling (where supported), or provisioning enough headroom.
8) Is sync=disabled ever acceptable?
Only if you have a clear risk decision and you can tolerate losing the last few seconds of writes on crash or power loss, potentially with application-level corruption.
If the dataset contains anything you’d call “important” in a postmortem, don’t do it. Use a proper SLOG instead.
9) Why does write latency increase even when write bandwidth is constant?
Because queueing and device internal behavior matter. Constant throughput can still accumulate a backlog if the device’s service time increases due to garbage collection,
thermal throttling, write cache flushes, or contention.
10) How long should I sample with zpool iostat -w?
For incident triage, 10–30 seconds at 1-second intervals is usually enough to spot the bottleneck. For capacity planning or performance work, capture multiple windows:
idle, peak, and “bad day.”
Conclusion: next steps that actually move the needle
zpool iostat -w is not a reporting tool. It’s a decision engine. It tells you whether you have a device problem, a layout problem, a sync-write problem,
or a “background maintenance is eating my lunch” problem—while the system is live and misbehaving.
Practical next steps:
- During a calm period, capture a baseline:
sudo zpool iostat -w -v tank 2 30. - Write down what “normal” latency looks like per vdev and per workload window.
- When the next complaint hits, run the fast diagnosis playbook and resist the urge to tune first.
- If you discover a repeat pattern (sync write pressure, one slow device, RAIDZ metadata pain), turn it into an engineering project with measurable outcomes.
Your future self doesn’t need more graphs. Your future self needs fewer surprises. -w is how you start charging interest on chaos.