ZFS zpool iostat -w: Understanding Workload Patterns in Real Time

Was this helpful?

You don’t need a dashboard to know your storage is unhappy. You need one angry API timeout, one “why is the deploy stuck?” Slack thread,
and one exec who says “but it worked yesterday.” ZFS gives you a truth serum: zpool iostat -w.

Used well, it tells you whether you’re CPU-bound, IOPS-bound, latency-bound, sync-write bound, or simply “bound by optimism.”
Used badly, it convinces you to buy hardware you don’t need, tune knobs you don’t understand, and blame “the network” out of habit.

What -w is really showing (and why you should care)

zpool iostat is the heartbeat monitor. The -w flag adds the part you usually wish you had during an incident: latency.
Not theoretical latency. Not vendor datasheet latency. Observed latency as ZFS sees it for pool and vdev operations.

If you run production systems, you should treat zpool iostat -w as the “is storage the bottleneck right now?” tool.
It answers:

  • Are we queuing? Latency grows before throughput drops.
  • Where is the pain? Pool-level vs one slow vdev vs a single mirror leg.
  • What kind of pain? Reads, writes, sync writes, metadata-heavy work, resilver, trim, or “small random everything.”

Latency is the currency apps spend. Throughput is what storage teams brag about. When the bill comes due, apps pay in latency.

What -w adds and what it does not

The exact columns vary by ZFS implementation and version (OpenZFS on Linux vs FreeBSD vs illumos). But broadly:

  • It adds latency columns (often separate for read and write).
  • It may add queue or “wait” time depending on platform.
  • It does not magically separate ZFS intent latency from device latency unless you ask for it (more on that later).
  • It does not tell you why latency is high; it tells you where to dig next.

One dry truth: zpool iostat can make your system look healthy even while your application is on fire, because ARC cache is quietly saving you.
Then the cache misses spike and the same pool collapses under real disk reads. You need to watch it continuously, not just when you’re already doomed.

Quick history and facts that make the output click

A few context points turn “columns of numbers” into a story you can act on. Here are nine short facts that actually matter.

  1. ZFS was born at Sun Microsystems (mid-2000s) as an end-to-end storage stack: filesystem + volume manager, with checksums everywhere.
  2. The “pool” idea is the core innovation: filesystems live on top of a pool, and the pool decides where blocks go across vdevs.
  3. Copy-on-write (CoW) is why fragmentation feels different: ZFS doesn’t overwrite blocks in place; it writes new blocks and updates pointers.
  4. The ZIL is not a write cache: it’s a log for synchronous writes. It exists even without a dedicated SLOG device.
  5. SLOG is a device, ZIL is a concept: adding a SLOG moves the ZIL to faster media, but only for sync writes.
  6. OpenZFS unified multiple forks so features like device removal, special vdevs, and persistent L2ARC became more common across platforms.
  7. Ashift is forever (mostly): set wrong at pool creation and you carry that performance penalty for the life of the vdev.
  8. “IOPS” is a half-truth without latency: you can push high IOPS with terrible tail latency; your database will still hate you.
  9. Resilver and scrub are intentional pain: they are background reads/writes that can dominate zpool iostat if you let them.

Exactly one quote, because engineers deserve better than motivational posters:
Hope is not a strategy. — paraphrased idea often attributed to operations leadership circles.

A production mental model: from app request to vdev

When an application does I/O on ZFS, you’re watching multiple layers negotiate reality:

  • The app issues reads/writes (often small, often random, and occasionally insulting).
  • The OS and ZFS aggregate, cache (ARC), and sometimes reorder.
  • ZFS translates logical blocks into physical writes across vdevs, obeying redundancy and allocation rules.
  • Your vdevs translate that to actual device commands. The slowest relevant component sets the pace.

Pool vs vdev: why your “fast disks” can still be slow

ZFS performance is vdev-centric. A pool is a set of vdevs. Your pool throughput scales with the number of vdevs, not the number of disks,
in the way people naively assume.

Mirrors give you more IOPS per vdev than RAIDZ, and multiple mirrors scale. RAIDZ vdevs are great at capacity efficiency and large sequential reads,
but they don’t magically turn into IOPS monsters. If you built one giant RAIDZ2 vdev with a pile of disks and expected database-grade random I/O,
you bought a minivan and entered it in a drag race.

Latency is a stack: service time + waiting time

The most useful way to interpret -w is: “is the device slow” vs “is the device busy.”
High latency with moderate utilization often means the device itself is slow (or failing, or doing internal garbage collection).
High latency with high IOPS/throughput usually means queueing: the workload is beyond what the vdev can serve at acceptable latency.

How to read the columns without lying to yourself

You’ll see variants like:

  • capacity: used, free, fragmentation, and sometimes allocated space by pool/vdev.
  • operations: read/write operations per second (IOPS).
  • bandwidth: read/write bytes per second (throughput).
  • latency: read/write latency, sometimes broken into “wait” and “service.”

The rule: you diagnose with a combination of IOPS, bandwidth, and latency. Any one alone is a liar.

Pool-level lines are averages; vdev lines are truth

Pool-level stats can hide a single sick disk in a mirror or a single slow RAIDZ vdev dragging everything.
Always use -v to see vdev breakdown when diagnosing.

Watch the change, not the number

zpool iostat is best used as a time series. Run it with a 1-second or 2-second interval and watch trends.
The “now” matters more than the lifetime averages.

Joke #1: Storage graphs are like horoscopes—vague until the pager goes off, and then suddenly they’re “obviously predictive.”

Fast diagnosis playbook

This is the sequence I use when an app team says “storage is slow” and you have 90 seconds to decide whether that’s true.

First: determine if you’re looking at disk or cache

  1. Run zpool iostat -w at 1s intervals for 10–20 seconds.
  2. If read IOPS and bandwidth are low but the app is slow, suspect CPU, locking, network, or cache misses not reaching disk yet.
  3. If read IOPS spike and latency spikes with them, you’re on the disks. Now it’s real.

Second: isolate the bottleneck layer

  1. Use zpool iostat -w -v. Find the vdev with the worst latency or the highest utilization symptoms.
  2. If it’s one disk in a mirror: likely failing, firmware weirdness, or pathing issue.
  3. If it’s a whole RAIDZ vdev: you’re saturating that vdev’s IOPS. Fix is architectural (more vdevs, mirrors, or accept latency).

Third: decide whether it’s sync write pain

  1. Look for write latency spikes that correlate with small write IOPS and low bandwidth.
  2. Check whether the workload is forcing sync writes (databases, NFS, hypervisors).
  3. If yes: examine SLOG health and device latency; consider whether your sync settings are correct for risk tolerance.

Fourth: check for “background jobs pretending to be traffic”

  1. Scrubs, resilvers, trims, and heavy snapshots/clones can dominate I/O.
  2. If zpool status shows active work, decide whether to throttle, schedule, or let it finish.

Fifth: confirm with a second signal

  1. Correlate with CPU (mpstat), memory pressure, or per-process I/O (pidstat).
  2. Use device-level tools (iostat -x) to see if one NVMe is melting down while ZFS averages everything.

Practical tasks: commands, meaning, decisions

You asked for real tasks, not vibes. Here are fourteen. Each includes a command, an example output snippet, what it means, and the decision it drives.
Outputs are representative; your columns may differ by platform/version.

Task 1: Baseline pool latency in real time

cr0x@server:~$ sudo zpool iostat -w tank 1
              capacity     operations     bandwidth    latency
pool        alloc   free   read  write   read  write   read  write
----------  -----  -----  -----  -----  -----  -----  -----  -----
tank        3.21T  7.58T    210    180  18.2M  22.4M   2ms   4ms
tank        3.21T  7.58T    195    260  16.9M  31.1M   3ms  18ms
tank        3.21T  7.58T    220    240  19.1M  29.7M   2ms  20ms

Meaning: Writes jumped in latency from 4ms to ~20ms while bandwidth increased. That’s a classic “we’re pushing the write path.”

Decision: If this aligns with user-facing latency, treat storage as suspect. Next: add -v to find which vdev is causing it.

Task 2: Find the vdev that’s hurting you

cr0x@server:~$ sudo zpool iostat -w -v tank 1
                                            operations     bandwidth    latency
pool        vdev                             read  write   read  write   read  write
----------  -------------------------------- ----  -----  -----  -----  -----  -----
tank        -                                  220    240  19.1M  29.7M   2ms  20ms
tank        mirror-0                           110    120   9.5M  14.6M   2ms  10ms
tank          nvme0n1                          108    118   9.4M  14.4M   2ms  12ms
tank          nvme1n1                          112    119   9.6M  14.7M   2ms  65ms
tank        mirror-1                           110    120   9.6M  15.1M   2ms  10ms
tank          nvme2n1                          110    118   9.5M  14.8M   2ms  11ms
tank          nvme3n1                          110    121   9.7M  15.3M   2ms  10ms

Meaning: One mirror leg (nvme1n1) has 65ms write latency while its partner is fine. ZFS can mirror reads, but writes wait for both legs.

Decision: This is a device/path issue. Check SMART, firmware, PCIe errors, multipath, and link speed. Replace or fix the path before tuning ZFS.

Task 3: Separate “busy” from “broken” at the device layer

cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server)  12/25/2025  _x86_64_  (32 CPU)

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz  avgqu-sz   await  r_await  w_await  svctm  %util
nvme0n1         110.0   120.0     9.5    14.4     197.0      0.4     3.1      2.0      4.2    0.4   18.0
nvme1n1         112.0   119.0     9.6    14.7     198.2      9.8    66.7      2.2     65.8    0.5   95.0

Meaning: nvme1n1 is at high utilization with a deep queue and huge write await. That’s not ZFS being “chatty”; that’s a sick or throttled device.

Decision: Stop blaming recordsize. Investigate device health and throttling (thermal, firmware GC). If it’s a shared PCIe lane, fix topology.

Task 4: Detect a sync-write workload quickly

cr0x@server:~$ sudo zpool iostat -w tank 1
              operations     bandwidth    latency
pool        read  write   read  write   read  write
----------  ----  -----  -----  -----  -----  -----
tank          80   3200   6.1M  11.8M   1ms  35ms
tank          75   3500   5.9M  12.4M   1ms  42ms

Meaning: Very high write IOPS but low write bandwidth means small writes. Latency is high. If the app is a database, VM host, or NFS server, assume sync pressure.

Decision: Check SLOG and sync settings. If you don’t have a SLOG and you need safe sync, accept that spinning disks will cry.

Task 5: Verify whether datasets are forcing sync behavior

cr0x@server:~$ sudo zfs get -o name,property,value -s local,received sync tank
NAME   PROPERTY  VALUE
tank   sync      standard

Meaning: Pool/datasets are using default semantics: honor app sync requests.

Decision: If latency is killing you and you can accept risk for a specific dataset (not the whole pool), consider sync=disabled only for that dataset.
If you can’t explain the risk to an auditor without sweating, don’t do it.

Task 6: Check if you even have a SLOG and whether it’s healthy

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            nvme0n1   ONLINE       0     0     0
            nvme1n1   ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            nvme2n1   ONLINE       0     0     0
            nvme3n1   ONLINE       0     0     0
        logs
          mirror-2    ONLINE       0     0     0
            nvme4n1   ONLINE       0     0     0
            nvme5n1   ONLINE       0     0     0

Meaning: A mirrored SLOG exists. Good: single-device SLOG is a foot-gun if you care about durability.

Decision: If sync writes are slow, measure SLOG device latency separately and consider replacing with power-loss-protected NVMe.

Task 7: Measure per-vdev behavior during a scrub or resilver

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 25 10:11:03 2025
        1.23T scanned at 1.1G/s, 512G issued at 450M/s, 3.21T total
        0B repaired, 15.9% done, 01:42:18 to go
cr0x@server:~$ sudo zpool iostat -w -v tank 1
                                            operations     bandwidth    latency
pool        vdev                             read  write   read  write   read  write
----------  -------------------------------- ----  -----  -----  -----  -----  -----
tank        -                                 6200    180  1.05G  22.4M  12ms   4ms
tank        mirror-0                          3100     90  525M  11.2M  13ms   4ms
tank        mirror-1                          3100     90  525M  11.2M  12ms   4ms

Meaning: Reads are dominated by scrub. Latency is elevated but explainable. Writes look normal.

Decision: If this is a busy production window, throttle scrub (platform-dependent) or reschedule. If your read latency is wrecking SLOs, scrub timing is now an SRE problem, not a storage problem.

Task 8: Identify a single slow disk in a mirror without guessing

cr0x@server:~$ sudo zpool iostat -w -v tank 1
                                            operations     bandwidth    latency
pool        vdev                             read  write   read  write   read  write
----------  -------------------------------- ----  -----  -----  -----  -----  -----
tank        mirror-0                           800    900  65M   72M     3ms  14ms
tank          sda                              400    450  32M   36M     2ms   6ms
tank          sdb                              400    450  33M   36M     3ms  80ms

Meaning: sdb is the anchor dragging the boat. Mirrors write to both sides; one slow leg poisons write latency.

Decision: Pull SMART data, check cabling/HBA, and plan replacement. Do not “tune” your way around a dying disk.

Task 9: Confirm ashift and why it matters

cr0x@server:~$ sudo zdb -C tank | grep -E "ashift|vdev_tree" -n
45:        ashift: 12

Meaning: ashift=12 (4K sectors). This is generally sane for modern disks and SSDs. If you see ashift=9 on 4K-native media, you’re paying a write amplification tax forever.

Decision: If ashift is wrong, you don’t “fix it with a sysctl.” You migrate to a new vdev/pool with correct ashift.

Task 10: Detect metadata-heavy pain and consider a special vdev

cr0x@server:~$ sudo zpool iostat -w -v tank 1
                                            operations     bandwidth    latency
pool        vdev                             read  write   read  write   read  write
----------  -------------------------------- ----  -----  -----  -----  -----  -----
tank        -                                 5200   2100  48M   39M    18ms  22ms
tank        raidz2-0                           5200   2100  48M   39M    18ms  22ms

Meaning: Huge IOPS with tiny bandwidth is the signature of metadata or small-block random I/O (directory walks, small files, maildirs, container layers).
RAIDZ is not thrilled.

Decision: Consider adding mirrors, adding more vdevs, or using a special vdev for metadata/small blocks if your platform supports it and you can operationalize it safely.

Task 11: Verify dataset recordsize vs workload (and avoid self-inflicted pain)

cr0x@server:~$ sudo zfs get -o name,property,value recordsize tank/db
NAME     PROPERTY    VALUE
tank/db  recordsize  128K

Meaning: 128K recordsize is fine for large sequential I/O, but many databases prefer smaller (like 16K) depending on engine and page size.

Decision: If your workload is random reads/writes of small pages and you see read amplification, test a smaller recordsize for that dataset.
Do not change recordsize blindly for an existing dataset and expect instant miracles; existing blocks stay as they were.

Task 12: Check compression and learn whether you’re CPU-bound

cr0x@server:~$ sudo zfs get -o name,property,value compression,compressratio tank/vm
NAME      PROPERTY       VALUE
tank/vm   compression    zstd
tank/vm   compressratio  1.62x
cr0x@server:~$ mpstat 1 3
Linux 6.5.0 (server)  12/25/2025  _x86_64_  (32 CPU)

12:11:01 AM  all   %usr  %nice   %sys %iowait  %irq  %soft  %steal  %idle
12:11:02 AM  all   72.0    0.0   11.0     2.0   0.0    0.0     0.0   15.0

Meaning: Compression is active and effective. CPU is pretty busy. If zpool iostat shows low disk activity but latency is high at the application, CPU could be the limiter.

Decision: If CPU saturation correlates with I/O latency, consider switching compression level (still keep compression), adding CPU, or isolating noisy neighbors.
Don’t turn compression off as your first reflex; it often reduces disk I/O.

Task 13: Spot TRIM/autotrim impact and decide when to run it

cr0x@server:~$ sudo zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local
cr0x@server:~$ sudo zpool iostat -w tank 1
              operations     bandwidth    latency
pool        read  write   read  write   read  write
----------  ----  -----  -----  -----  -----  -----
tank         180    220  12.4M  19.1M   2ms   6ms
tank         175    240  12.2M  19.8M   3ms  18ms

Meaning: If you see periodic write latency spikes without matching workload changes, background maintenance (including TRIM on some devices) can be a culprit.

Decision: If autotrim causes visible jitter on latency-sensitive systems, test disabling autotrim and scheduling manual trims during low-traffic windows. Measure, don’t guess.

Task 14: Correlate “who is doing the I/O” with pool symptoms

cr0x@server:~$ pidstat -d 1 5
Linux 6.5.0 (server)  12/25/2025  _x86_64_  (32 CPU)

12:12:01 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
12:12:02 AM     0      2211      120.0   82000.0      0.0       1  postgres
12:12:02 AM     0      3440      200.0   14000.0      0.0       0  qemu-system-x86

Meaning: Postgres is hammering writes. If zpool iostat -w shows high write latency, you can now have a productive conversation with the database owner.

Decision: Decide whether you’re dealing with a legitimate workload increase (scale storage), a misconfigured app (fsync loop), or an operational job (vacuum, reindex) that needs scheduling.

Recognizing workload patterns in the wild

The whole point of zpool iostat -w is to recognize patterns fast enough to act. Here are the common ones that show up in real systems.

Pattern: small random writes, high IOPS, low bandwidth, high latency

Looks like: thousands of write ops, a few MB/s, write latency tens of milliseconds or worse.

Usually is: database WAL/fsync pressure, VM journaling, NFS sync writes, log-heavy workloads.

What to do:

  • Confirm whether writes are synchronous (dataset sync and application behavior).
  • Ensure SLOG is present, fast, and power-loss protected if you require durability.
  • Don’t put a cheap consumer SSD as SLOG and call it a day. That’s how you buy data loss with a receipt.

Pattern: high read bandwidth, moderate IOPS, rising read latency

Looks like: hundreds of MB/s to GB/s reads, latency climbing from a few ms to tens of ms.

Usually is: streaming reads during backup, analytics scans, scrubs, or a cache-miss storm.

What to do:

  • Check scrub/resilver status.
  • Check ARC hit ratio indirectly by seeing whether reads are reaching disk at all (pair with ARC stats if available).
  • If it’s real user traffic, you may need more vdevs or faster media. Latency doesn’t negotiate.

Pattern: one vdev shows high latency; pool average looks “okay”

Looks like: zpool iostat -w pool line seems fine; -v shows one mirror or disk with 10–100x latency.

Usually is: failing drive, bad cable, HBA reset storms, thermal throttling, or a firmware bug.

What to do: treat it as hardware until proven otherwise. Replace. Don’t spend a week writing a tuning proposal around a disk that’s quietly dying.

Pattern: latency spikes at steady throughput

Looks like: same MB/s and IOPS, but latency periodically shoots up.

Usually is: device GC, TRIM, write cache flushes, or contention on shared resources (PCIe lanes, HBA, virtualization host).

What to do: correlate with device-level metrics and system logs. If it’s NVMe thermal throttling, your “fix” might be airflow, not software.

Joke #2: If your “latency spikes are random,” congratulations—you’ve built a probability distribution in production.

Three corporate mini-stories (because reality is mean)

Mini-story #1: The incident caused by a wrong assumption

A mid-size SaaS company migrated a busy Postgres cluster from an aging SAN to local NVMe with ZFS mirrors. The team celebrated:
benchmarks looked great, average latency was low, and the storage graphs finally stopped looking like a crime scene.

Two weeks later, an incident: periodic 2–5 second stalls on write-heavy endpoints. Not constant. Not predictable.
The app team blamed locks. The DBAs blamed autovacuum. The SREs blamed “maybe the kernel.”
Everyone had a favorite villain and none of them were the disks.

Someone finally ran zpool iostat -w -v 1 during a stall. Pool averages were fine, but one NVMe showed write latency in the hundreds of milliseconds.
It wasn’t failing outright. It was intermittently throttling.

The wrong assumption: “NVMe is always fast, and if it’s slow it must be ZFS.” The reality: consumer NVMe devices can hit thermal limits and drop performance dramatically
under sustained sync-ish write patterns. The box had great CPU and terrible airflow.

The fix was gloriously boring: improve cooling, update firmware, and swap that one model for enterprise parts on the next maintenance cycle.
Tuning ZFS wouldn’t have helped. Observing the per-device latency with -w -v did.

Mini-story #2: The optimization that backfired

A corporate platform team ran a multi-tenant virtualization cluster backed by a large RAIDZ2 pool. They were under pressure: developers wanted faster CI,
and storage was “the thing everyone complains about.”

Someone proposed a quick win: set sync=disabled on the VM dataset. The argument was seductive: “We have UPS. The hypervisor can recover.
And it’s only dev workloads.” They changed it late one afternoon and watched zpool iostat write latency drop. High-fives all around.

Then came the backfire. A host crashed in a way the UPS did not politely prevent (it was a motherboard issue, not a power outage).
A handful of VMs had filesystem corruption. Not all of them. Just enough to ruin a weekend and make the postmortem spicy.

The operational mistake wasn’t “sync=disabled is always wrong.” The mistake was treating durability semantics as a performance knob with no blast-radius modeling.
They optimized for the median case and paid in tail risk.

The long-term fix: re-enable sync=standard, add a mirrored power-loss-protected SLOG, and segment “real dev” from “pretend prod”
so the durability decision matched the business reality. The lesson: zpool iostat -w can show you that sync writes hurt,
but it can’t grant you permission to disable them.

Mini-story #3: The boring but correct practice that saved the day

A finance-adjacent company ran ZFS for file services and build artifacts. Nothing fancy. The kind of storage that never gets attention until it breaks.
The storage engineer had one habit: a weekly “five minute drill” during business hours.
Run zpool iostat -w -v 1 for a minute, look at latencies, check zpool status, and move on.

One Tuesday, the drill showed a mirror leg with steadily climbing write latency. No errors yet. No alerts. The system was “fine.”
But the latency trend was wrong, the way a gearbox sounds wrong before it explodes.

They pulled SMART data and found increasing media errors. The disk wasn’t dead; it was just starting to lie.
They scheduled a replacement for the next maintenance window, resilvered, and never had an outage.

Weeks later, a similar model disk in another team’s fleet failed hard and caused a visible incident. Same vendor, same batch, same failure mode.
Their team escaped purely because someone watched -w and trusted the slow drift.

The boring practice wasn’t heroism. It was acknowledging that disks rarely go from “perfect” to “dead” without a phase of “weird.”
zpool iostat -w is excellent at catching weird.

Common mistakes: symptom → root cause → fix

These are the failure modes I keep seeing in real organizations. The trick is to map the symptom in zpool iostat -w to the likely cause,
then make a specific change that can be validated.

1) Pool write latency is high, but only one mirror leg is slow

  • Symptom: Pool write latency spikes; -v shows one device with much higher write latency.
  • Root cause: Device throttling, firmware GC, thermal issue, bad link, or a failing drive.
  • Fix: Verify with iostat -x and logs; replace the device or fix the path. Don’t waste time tuning recordsize or ARC.

2) High write IOPS, low write bandwidth, ugly write latency

  • Symptom: Thousands of writes/s, only a few MB/s, write latency tens of ms to seconds.
  • Root cause: Sync writes without a fast SLOG, or a slow SLOG.
  • Fix: Add/replace mirrored PLP SLOG; confirm app sync behavior; isolate datasets and set sync semantics intentionally.

3) Reads look fine until ARC misses spike, then everything melts

  • Symptom: Normally low disk reads; during incidents, read IOPS/bandwidth jump and read latency climbs.
  • Root cause: Working set outgrows ARC, or an access pattern change (scan, report job, backup, new feature).
  • Fix: Add RAM (often the best ROI), reconsider caching strategy, or isolate scan-heavy workloads. Validate by observing disk read changes in zpool iostat -w.

4) RAIDZ vdev saturates on metadata/small random I/O

  • Symptom: High IOPS, low MB/s, high latency; vdev is RAIDZ.
  • Root cause: RAIDZ parity overhead and limited IOPS per vdev for small random writes.
  • Fix: Add more vdevs (not more disks to the same vdev), or shift to mirrors for latency-sensitive random I/O workloads. Consider special vdev for metadata if appropriate.

5) “We upgraded disks but it’s still slow”

  • Symptom: New SSDs, similar latency as before under load.
  • Root cause: You are CPU-bound (compression/checksumming), or limited by a single vdev layout, or constrained by PCIe/HBA.
  • Fix: Confirm with CPU metrics and per-vdev stats; scale vdev count, fix topology, or move hot datasets to a separate pool.

6) Latency spikes during scrubs/resilver and users complain

  • Symptom: Read latency increases sharply when maintenance runs.
  • Root cause: Scrub/resilver competing with production workload.
  • Fix: Schedule maintenance in off-hours, throttle if available, or provision enough performance headroom so integrity checks aren’t an outage generator.

Checklists / step-by-step plan

Checklist: responding to “storage is slow” in under five minutes

  1. Run sudo zpool iostat -w -v tank 1 and watch 10–20 lines.
  2. Identify whether reads or writes dominate, and whether latency is rising with load.
  3. If one device stands out, pivot to iostat -x and system logs for that device.
  4. Check sudo zpool status tank for scrub/resilver.
  5. Check dataset sync settings for the workload in question.
  6. Correlate with process I/O (pidstat -d) so you’re not debugging ghosts.
  7. Make one change at a time; confirm impact with the same zpool iostat -w view.

Checklist: building a baseline before you touch anything

  1. Capture zpool iostat -w -v 2 30 during a known-good period.
  2. Capture the same during peak traffic.
  3. Save outputs with timestamps in your incident notebook or ticket.
  4. Record pool layout: vdev types, disk models, ashift.
  5. Record dataset properties: recordsize, compression, sync, atime.
  6. Decide what “bad” looks like (latency thresholds aligned to your app SLO).

Step-by-step plan: turning observations into an improvement project

  1. Classify the workload: random vs sequential, read vs write, sync vs async, metadata-heavy vs large blocks.
  2. Map to ZFS layout: determine if your vdev design matches the workload.
  3. Fix correctness first: replace bad devices, correct cabling/HBA issues, ensure redundancy for SLOG/special vdevs.
  4. Reduce avoidable I/O: enable sensible compression, tune recordsize per dataset, consider atime=off where appropriate.
  5. Scale properly: add vdevs to scale IOPS; don’t keep inflating a single RAIDZ vdev and expect miracles.
  6. Validate with -w: you want lower latency at the same workload, not just prettier throughput numbers.
  7. Operationalize: add routine checks and alerting on per-vdev latency anomalies, not only capacity.

FAQ

1) What does zpool iostat -w actually measure for latency?

It reports observed latency for ZFS I/O at the pool/vdev level, as ZFS accounts it. It’s not a perfect substitute for device firmware metrics,
but it’s extremely good at showing where time is being spent and whether queueing is happening.

2) Why does pool latency look fine, but my database is still slow?

Because the disks might not be the bottleneck. You could be CPU-bound (compression, checksumming), lock-bound, or suffering from application-level fsync behavior.
Also, ARC can mask disk reads until cache misses spike. Correlate with CPU and per-process I/O.

3) Should I always run with -v?

For diagnosis, yes. Pool averages hide slow devices and uneven vdev load. For quick sampling on a busy system, start without -v and then pivot.

4) Does adding more disks to a RAIDZ vdev increase IOPS?

Not in the way most people want. It can increase sequential throughput, but small random I/O is limited by parity overhead and vdev behavior.
If you need more IOPS, add more vdevs or use mirrors for the hot data.

5) When is a SLOG worth it?

When you have significant synchronous write load and you care about durability semantics (sync=standard).
Without sync pressure, a SLOG often does nothing measurable. With sync pressure, it can be the difference between “fine” and “why is everything timing out?”

6) Can I use a consumer SSD as a SLOG?

You can, but you probably shouldn’t if you care about correctness. A good SLOG needs low latency under sustained sync writes and power-loss protection.
Cheap SSDs can lie about flushes and fall off a performance cliff under the exact workload you bought them for.

7) Why do I see high latency during scrub, even when user traffic is low?

Scrubs read a lot of data and can push vdev queues. Even with low app traffic, the scrub itself is real I/O.
The fix is scheduling, throttling (where supported), or provisioning enough headroom.

8) Is sync=disabled ever acceptable?

Only if you have a clear risk decision and you can tolerate losing the last few seconds of writes on crash or power loss, potentially with application-level corruption.
If the dataset contains anything you’d call “important” in a postmortem, don’t do it. Use a proper SLOG instead.

9) Why does write latency increase even when write bandwidth is constant?

Because queueing and device internal behavior matter. Constant throughput can still accumulate a backlog if the device’s service time increases due to garbage collection,
thermal throttling, write cache flushes, or contention.

10) How long should I sample with zpool iostat -w?

For incident triage, 10–30 seconds at 1-second intervals is usually enough to spot the bottleneck. For capacity planning or performance work, capture multiple windows:
idle, peak, and “bad day.”

Conclusion: next steps that actually move the needle

zpool iostat -w is not a reporting tool. It’s a decision engine. It tells you whether you have a device problem, a layout problem, a sync-write problem,
or a “background maintenance is eating my lunch” problem—while the system is live and misbehaving.

Practical next steps:

  1. During a calm period, capture a baseline: sudo zpool iostat -w -v tank 2 30.
  2. Write down what “normal” latency looks like per vdev and per workload window.
  3. When the next complaint hits, run the fast diagnosis playbook and resist the urge to tune first.
  4. If you discover a repeat pattern (sync write pressure, one slow device, RAIDZ metadata pain), turn it into an engineering project with measurable outcomes.

Your future self doesn’t need more graphs. Your future self needs fewer surprises. -w is how you start charging interest on chaos.

← Previous
Ubuntu 24.04: logrotate isn’t rotating — the one config mistake that keeps biting people
Next →
Ubuntu 24.04: TLS handshake failures in curl — SNI/CA/time quick fix checklist

Leave a comment