ZFS iostat Basics: Turning Numbers Into a Bottleneck

Was this helpful?

Your app is “slow.” The database team swears nothing changed. CPU is fine. Network graphs are boring.
And then someone points at storage: “ZFS looks busy.” Congratulations, you’ve entered the part of the incident
where people start making confident statements based on vibes.

zpool iostat is where vibes go to die. It’s also where perfectly good engineers get trapped,
because the output is simple-looking and full of numbers that seem to agree with whatever theory you already had.
This is how you turn those numbers into a bottleneck you can name, measure, and fix.

What ZFS iostat is (and what it isn’t)

ZFS gives you multiple “iostat” lenses. The one most people mean is zpool iostat, which reports
per-pool and per-vdev throughput and I/O rates. Depending on flags, it can also show latency, queue size, and
breakdowns by device. This is ZFS’s view of work, not the block device’s view, and not necessarily what
your application thinks it asked for.

What it isn’t: it’s not a full performance profiler, it won’t tell you “the database is missing an index,” and it
won’t magically separate “bad workload” from “bad storage design.” But it will tell you where the pool is paying
time: which vdev, which device, reads vs writes, bandwidth vs IOPS vs latency, and whether you’re bottlenecked on
the slowest disk or on your own configuration choices.

A useful rule: if you can’t explain an incident in terms of a few stable metrics (IOPS, bandwidth, latency),
you haven’t finished debugging; you’ve just started narrating.

Interesting facts and historical context

  • ZFS started at Sun in the early 2000s to fix filesystem and volume manager complexity by making them one system, with end-to-end checksums.
  • The “pool of storage” idea was radical in a world of partitions and LUN spreadsheets; it changed how ops teams thought about capacity management.
  • vdevs are the performance unit: ZFS stripes across vdevs, not across individual disks the way many people casually assume.
  • RAIDZ parity math is why small random writes hurt more on RAIDZ than mirrors; “write amplification” isn’t marketing, it’s arithmetic.
  • OpenZFS diverged and converged: multiple platforms carried ZFS forward after Oracle, and features and tooling evolved unevenly across OSes for years.
  • SLOG became a thing because synchronous writes are a contract; ZIL exists to satisfy it, and iostat is one of the fastest ways to see if that contract is expensive.
  • Special vdevs were introduced to handle metadata/small blocks on faster media; they’re performance gold and operational responsibility debt.
  • ashift battles happened because drives lied (or were misunderstood) about physical sector sizes; the wrong ashift can permanently tax performance.
  • “L2ARC fixes everything” has been a recurring myth since the first admins discovered caching; it doesn’t fix write latency, and it can steal memory you needed elsewhere.

Field guide: reading zpool iostat like you mean it

Start with the default output, but don’t stop there

The default zpool iostat output is a gentle introduction: read/write ops and bandwidth per pool and
optionally per vdev. It’s good for “is it busy?” and bad for “why is it slow?” because bottlenecks are usually
about latency distribution and one slow path, not aggregate throughput.

What the columns actually mean

Column names vary slightly by platform and OpenZFS version, but the concepts are stable:

  • r/s, w/s: read/write operations per second as ZFS sees them at that layer (pool/vdev/device).
  • rkB/s, wkB/s: read/write bandwidth per second.
  • await (when available): average time an I/O spent waiting + service time (queueing + device).
  • r_await, w_await: same idea, split by reads/writes (super useful when sync writes bite).
  • sqd (sometimes shown): average queue depth. High queue with rising latency is a classic “device saturated” signal.
  • util is not usually a ZFS column like Linux iostat; don’t mentally import Linux semantics. ZFS uses its own instrumentation.

The one sentence that saves you hours

ZFS performance is dominated by the slowest vdev involved in the operation, and zpool iostat is the fastest way to identify which one is acting like the slowest kid in a group project.

A practical mental model: where the time goes

When an application says “write this,” ZFS turns it into a dance: transaction groups (TXGs), copy-on-write (COW),
checksumming, optional compression, metaslab allocation, and then actual device I/O. Reads are their own story:
ARC hits, L2ARC hits, prefetch, RAIDZ reconstruction, and finally disks.

zpool iostat won’t show you TXG commit time directly, but it will show you the symptoms:
bursts of writes, latency spikes, a single vdev doing more work than others, or a SLOG device taking all the
synchronous write heat.

If you need a quick mapping from symptom to “what layer is hurting,” here’s the blunt version:

  • High w_await, low bandwidth: small writes, sync writes, RAIDZ parity overhead, or device latency.
  • High rkB/s, moderate r/s, rising r_await: large reads saturating bandwidth.
  • High r/s, low rkB/s, high r_await: random reads, poor caching, or metadata contention.
  • One top-level vdev pegged: uneven allocation, one vdev smaller/fuller, or an actual failing device.

One quote worth keeping on the wall:
Hope is not a strategy. — Jim Lovell

Practical tasks: commands, outputs, decisions (12+)

These are the things you actually do at 02:17 when someone says “storage is slow.” Every task has: a command,
what you’re looking at, and the decision it supports. Run them in order if you’re new; jump around if you already
have a hunch and a pager burn scar.

Task 1: Get a moving picture, not a snapshot

cr0x@server:~$ zpool iostat -v 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        3.12T  2.88T    420    310   42.1M  18.7M
  mirror    1.56T  1.44T    210    155   21.0M   9.4M
    sda         -      -    110     78   10.4M   4.9M
    sdb         -      -    100     77   10.6M   4.5M
  mirror    1.56T  1.44T    210    155   21.1M   9.3M
    sdc         -      -    105     77   10.5M   4.6M
    sdd         -      -    105     78   10.6M   4.7M

Meaning: You’re getting five 1-second samples with vdev and disk detail. Look for imbalance
(one disk or mirror doing way more), and look for workload shape: big bandwidth vs lots of ops.

Decision: If one device is consistently behind its sibling in a mirror, you suspect device health,
cabling, controller issues, or a path problem. If both sides match but the pool latency is bad, you’re hunting
workload or configuration.

Task 2: Turn on latency view (if supported)

cr0x@server:~$ zpool iostat -v -l 1 5
                              capacity     operations     bandwidth    total_wait
pool                        alloc   free   read  write   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----  -----  -----
tank                        3.12T  2.88T    430    320   43.2M  19.1M    4ms   28ms
  mirror                    1.56T  1.44T    215    160   21.5M   9.6M    3ms   26ms
    sda                         -      -    112     81   10.7M   5.0M    3ms   25ms
    sdb                         -      -    103     79   10.8M   4.6M    3ms   28ms
  mirror                    1.56T  1.44T    215    160   21.7M   9.5M    4ms   31ms
    sdc                         -      -    109     80   10.8M   4.8M    4ms   32ms
    sdd                         -      -    106     80   10.9M   4.7M    4ms   30ms

Meaning: “total_wait” (or similar) is your quick-and-dirty latency signal.
Here writes are far slower than reads, and they’re slow everywhere, not just one disk.

Decision: If write wait is high across all vdevs, you ask: sync writes? RAIDZ parity?
a slow SLOG? a TXG stall? If it’s only one vdev/disk, you treat it like a hardware/path issue until proven otherwise.

Task 3: Narrow to the guilty pool when you have multiple pools

cr0x@server:~$ zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        3.12T  2.88T    440    330   44.1M  19.8M
  mirror    1.56T  1.44T    220    165   22.0M   9.9M
    sda         -      -    112     83   10.8M   5.1M
    sdb         -      -    108     82   11.2M   4.8M
  mirror    1.56T  1.44T    220    165   22.1M   9.9M
    sdc         -      -    110     82   11.0M   4.9M
    sdd         -      -    110     83   11.1M   5.0M

Meaning: Avoid cross-pool noise. Especially on multi-tenant boxes, your “slow app”
might not be the only one doing I/O.

Decision: If the busy pool is not the pool backing the complaint, you don’t tune storage;
you tune expectations and investigate resource contention elsewhere (or billing).

Task 4: Confirm pool health before you chase performance ghosts

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 06:41:12 with 0 errors on Mon Dec 23 03:12:01 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

Meaning: If you see increasing READ/WRITE/CKSUM errors, performance “problems” may be the system
retrying, remapping, and suffering. ZFS will keep your data correct and your latency spicy.

Decision: Any nonzero, trending error counters: stop optimizing. Start replacing hardware,
checking cables/HBAs, and reviewing SMART data.

Task 5: Check if a scrub or resilver is stealing your lunch

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Tue Dec 24 11:02:09 2025
        1.43T scanned at 1.21G/s, 820G issued at 694M/s, 3.12T total
        0B repaired, 26.18% done, 00:52:14 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

Meaning: Scrubs are legitimate I/O consumers. They’re also good hygiene.
But during peak hours, they can push already-tight latency over the edge.

Decision: If the system is latency-sensitive, schedule scrubs off-peak and/or tune
scrub priority. If you can’t tolerate scrubs, you can’t tolerate operating storage.

Task 6: See whether you’re bound by IOPS or bandwidth

cr0x@server:~$ zpool iostat -v 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        3.12T  2.88T   4200    180   32.1M   6.2M
  mirror    1.56T  1.44T   2100     90   16.0M   3.1M
    sda         -      -   1050     45    8.0M   1.6M
    sdb         -      -   1050     45    8.0M   1.5M
  mirror    1.56T  1.44T   2100     90   16.1M   3.1M
    sdc         -      -   1050     45    8.1M   1.6M
    sdd         -      -   1050     45    8.0M   1.5M

Meaning: 4200 reads/sec but only 32 MB/s: that’s small-block random I/O. If latency is high,
adding more spindles (more vdevs) helps more than “faster disks” in isolation.

Decision: IOPS-bound workloads want mirrors, more vdevs, and sane record sizes; bandwidth-bound
workloads want sequential layout, fewer parity penalties, and often larger blocks.

Task 7: Validate whether sync writes are the villain

cr0x@server:~$ zfs get -o name,property,value,source sync tank/db
NAME     PROPERTY  VALUE  SOURCE
tank/db  sync      standard  local

Meaning: sync=standard means sync writes are honored when applications request them.
Databases often do.

Decision: If write latency is terrible and you see many sync writes, your SLOG (if present) must
be fast and power-safe. If you “fix” this by setting sync=disabled, at least be honest and call it
“choosing data loss as a feature.”

Task 8: Check for SLOG and whether it’s actually used

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
        logs
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

Meaning: There is a separate log device. That’s the SLOG (separate intent log).
It only accelerates synchronous writes, not regular writes.

Decision: If you don’t have sync-heavy workloads, a SLOG won’t help and can complicate failure
domains. If you do, the SLOG must be low latency and have power-loss protection.

Task 9: Watch the log device under load (does it spike?)

cr0x@server:~$ zpool iostat -v 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        3.12T  2.88T    120   1800    3.1M  22.4M
  mirror    1.56T  1.44T     60    900    1.5M  11.1M
    sda         -      -     30    450  760K     5.5M
    sdb         -      -     30    450  740K     5.6M
  mirror    1.56T  1.44T     60    900    1.6M  11.3M
    sdc         -      -     30    450  780K     5.7M
    sdd         -      -     30    450  820K     5.6M
logs
  nvme0n1       -      -      0   5200      0  41.6M

Meaning: Big write ops on nvme0n1 under “logs” strongly suggests sync write traffic
is being committed to the SLOG. If the pool has high write wait and the log device shows high ops, you have a
sync-write dominated workload.

Decision: If log device ops are high and latency is high, upgrade/replace SLOG, verify PLP, and
confirm you’re not accidentally using a cheap consumer NVMe that turns into a pumpkin under sustained sync writes.

Task 10: Check dataset recordsize and workload mismatch

cr0x@server:~$ zfs get -o name,property,value,source recordsize tank/db
NAME     PROPERTY    VALUE  SOURCE
tank/db  recordsize  128K   default

Meaning: Many databases prefer smaller blocks (e.g., 8K–16K) to avoid reading/writing extra data.
A 128K recordsize can still work, but it can amplify I/O on random workloads.

Decision: If your iostat shows high IOPS with low bandwidth (small random), consider setting
recordsize closer to the DB page size (after confirming the specific DB guidance).

Task 11: Check compression (it can reduce I/O, or cost CPU)

cr0x@server:~$ zfs get -o name,property,value,source compression tank/db
NAME     PROPERTY     VALUE   SOURCE
tank/db  compression  lz4     local

Meaning: lz4 is usually a win: less data written, less read from disk, often faster.
But on CPU-starved systems or already-compressed data, it can be neutral or slightly worse.

Decision: If iostat shows disk saturation but CPU headroom exists, compression is your friend.
If CPU is pegged and disk is idle, don’t “optimize storage” with more compression.

Task 12: Check ashift (alignment) because it’s forever

cr0x@server:~$ zdb -C tank | grep -E 'ashift|vdev_tree' -n | head
118:        vdev_tree:
134:            ashift: 12

Meaning: ashift=12 implies 4K sectors. If you built the pool with ashift=9
on 4K disks, small writes can become read-modify-write nightmares.

Decision: Wrong ashift is a rebuild-grade mistake. If it’s wrong and performance matters, plan a
migration. There’s no magical “fix” knob that rewrites your pool geometry safely in place.

Task 13: Identify uneven vdev fill (allocation bias)

cr0x@server:~$ zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH
tank        6.00T  3.12T  2.88T        -         -    19%    52%  1.00x  ONLINE
  mirror    3.00T  2.40T   600G        -         -    28%    80%      -  ONLINE
  mirror    3.00T   720G  2.30T        -         -     8%    24%      -  ONLINE

Meaning: One mirror is 80% full and more fragmented; the other is 24% full. ZFS tries to balance,
but pool history, expansion, and metaslab behavior can lead to skew. Skew can mean one vdev becomes the hot spot.

Decision: If iostat shows that fuller vdev doing more work or showing more latency, rebalance
strategy matters: consider adding vdevs earlier, keep pools under aggressive capacity thresholds, and avoid
“expand later” planning as a performance strategy.

Task 14: Correlate ZFS iostat with OS-level device stats

cr0x@server:~$ iostat -x 1 3
Linux 6.6.0 (server) 	12/25/2025 	_x86_64_	(32 CPU)

Device            r/s     w/s   rkB/s   wkB/s  await  aqu-sz  %util
sda             110.2    81.4  10960.0  5092.0   4.2    0.8   42.1
sdb             105.9    79.6  11010.0  4780.0   4.0    0.7   41.7
sdc             109.8    80.2  11030.0  4960.0   4.6    1.0   44.9
sdd             108.8    80.7  11080.0  5030.0   4.4    0.9   44.1
nvme0n1           0.0  5200.3      0.0 41600.0   0.3    0.6   62.0

Meaning: ZFS sees logical work; the OS sees actual device queues and utilization.
If ZFS says a disk is busy but OS says it’s idle, you might be bottlenecked above the device (locks, CPU, TXG).
If OS shows high await and high %util, the device really is the limiter.

Decision: Use OS iostat to confirm whether the problem is “ZFS scheduling” vs “device can’t keep up.”
This prevents you from replacing disks when you needed to tune sync writes, or vice versa.

Task 15: Check ARC pressure and memory headroom (because reads lie)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
12:10:01  8420  1120     13   620   55   410   36    90    8   58G    64G
12:10:02  9010  1460     16   780   53   590   40    90    6   58G    64G
12:10:03  8770  1310     14   700   53   520   40    90    7   58G    64G

Meaning: A 13–16% miss rate may be fine or awful depending on workload.
If misses spike and disks light up in zpool iostat, your “storage latency” might be “cache working as designed.”

Decision: If ARC is capped and miss rate is high, add RAM before buying more disks. If ARC is huge
but misses still high, your working set is bigger than memory or you’re doing streaming reads that don’t cache well.

Joke #1: If you stare at iostat long enough, the numbers don’t blink first. You do.

Fast diagnosis playbook

When you need a bottleneck fast, do this in order. The goal is not perfect understanding; it’s a correct first fix
and a crisp incident timeline.

First: is this a health problem pretending to be a performance problem?

  • Run zpool status -v. Look for errors, degraded vdevs, slow resilvering, or a device flapping.
  • Run OS-level device stats (iostat -x) to see if one device is suffering.

If you see errors or resets: stop. Replace/fix the path. Performance tuning on sick hardware is like tuning a race
car while the tire is falling off.

Second: identify the scope and shape of I/O

  • zpool iostat -v 1 5: which pool, which vdev, which device is hottest?
  • Look at IOPS vs bandwidth: lots of ops with small bandwidth screams random/small I/O.
  • If available, add -l for wait/latency hints: are reads fine but writes awful?

Third: decide whether the bottleneck is sync writes, RAIDZ parity, or capacity/fragmentation

  • Check zfs get sync on the affected dataset(s).
  • Check for a SLOG in zpool status, then watch log vdev stats in zpool iostat -v.
  • Check zpool list -v for vdev fill imbalance and fragmentation clues.

Fourth: validate with one cross-check metric

  • Use iostat -x to confirm actual device queue/await behavior.
  • Use arcstat (or platform equivalent) to rule in/out caching as the reason disks are busy.

At the end of this playbook you should be able to say one of these sentences with a straight face:
“We’re saturated on random read IOPS,” “Sync writes are bottlenecked by the log device,” “One vdev is overloaded,”
“We’re paying fragmentation/capacity tax,” or “A device/path is failing and retrying.”

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “It’s mirrored, so it’s fast”

A mid-sized SaaS company migrated a customer-facing PostgreSQL cluster onto a new ZFS-backed storage server.
The pool was built as a couple of mirrors. Clean, simple, reassuring. Everyone slept better.

Weeks later, latency started spiking during peak traffic. The graphs were maddening: CPU wasn’t pegged, network was fine.
On-call pulled zpool iostat -v and saw one mirror showing noticeably higher write wait than the other.
The assumption was immediate: “ZFS should balance across mirrors; a mirror can’t be the bottleneck, it’s redundant.”
They began looking elsewhere.

The real issue was uneven vdev fill caused by an earlier expansion workflow. One mirror was significantly more full and more
fragmented. That mirror became the allocation hot spot, so it did more work, so it got slower, so the pool looked slower.
The zpool list -v output had been trying to confess the whole time.

Fix was not exotic: they re-evaluated capacity thresholds, changed the expansion plan to add whole vdevs earlier, and migrated
data to a freshly built pool designed with consistent vdev sizes. The biggest change was cultural: “mirrors” are not “one unit.”
In ZFS, vdevs are the units that matter.

An optimization that backfired: “Disable sync, ship it”

A data platform team had an ingestion pipeline that wrote lots of small transactions. During a load test, write latency looked bad.
Someone discovered zfs set sync=disabled and ran a quick benchmark. Latency dropped. Throughput went up.
The change request wrote itself.

In production, the first few weeks were great. Then a power event hit a rack. The servers rebooted. The pipeline restarted cleanly.
And a silent data inconsistency appeared hours later: downstream jobs saw duplicates and missing records. The team spent days
“debugging Kafka,” then “debugging the database,” and finally traced it back to acknowledged writes that never made it safely to disk.

The post-incident review was awkward because nobody had “done the wrong thing” in their local frame.
The benchmark had been real. The improvement had been real. The workload was real.
The mistake was treating a durability contract like a tunable preference.

The fix wasn’t “never touch sync.” It was “build the right write path.” They added a proper SLOG device with power-loss protection,
validated it under load via zpool iostat log stats, and left sync=standard in place.
Performance improved, and so did everyone’s sleep.

The boring but correct practice that saved the day: baseline sampling

An internal platform team ran ZFS on a fleet of virtualization hosts. They had a policy nobody got excited about:
capture a 60-second zpool iostat -v -l 1 sample during “normal” business hours each week and store it with host metadata.
No dashboards, no hype—just text files with timestamps. The kind of thing you delete by accident during spring cleaning.

One afternoon, VM latency complaints came in. The hosts looked “fine” at first glance. The pool wasn’t full.
There were no errors. Yet the hypervisor UI showed periodic stalls.

On-call compared the current iostat sample to the baseline and saw the difference immediately: write wait had doubled
at the same IOPS level. It wasn’t “more load.” It was “same load, slower service time.”
That narrowed the search to “something in the I/O path changed,” not “someone added tenants.”

The culprit was a firmware update on an HBA that altered queue behavior under mixed read/write workloads.
They rolled back firmware on affected hosts, confirmed latency returned to baseline, and then worked with the vendor
on a fixed build. Boring baselines saved them from a week of ghost-chasing and blame pinball.

Joke #2: Storage latency is like corporate approvals—nobody notices until it’s suddenly everyone’s problem.

Common mistakes: symptom → root cause → fix

1) “Pool IOPS look high, therefore disks are slow”

  • Symptom: High r/s or w/s in zpool iostat; app latency increases.
  • Root cause: Workload changed to smaller I/O or more random I/O; aggregate bandwidth may be modest.
  • Fix: Verify I/O size (IOPS vs MB/s), tune dataset recordsize, add vdevs (more spindles/mirrors), or move hot data to SSD/special vdev.

2) “One disk in a mirror is doing less, so it’s fine”

  • Symptom: In a mirror, one disk shows fewer ops and/or higher wait.
  • Root cause: Device is slow, error-retrying, has a path issue, or is being throttled by the controller.
  • Fix: Check zpool status counters, check OS logs, check SMART, reseat/replace cables, move the disk to a different port/HBA, replace the device if needed.

3) “We added a SLOG; writes are still slow; SLOG is useless”

  • Symptom: SLOG present; write latency remains high.
  • Root cause: Workload isn’t sync-heavy, or SLOG is slow/consumer-grade, or sync is disabled/enabled unexpectedly.
  • Fix: Confirm sync behavior (zfs get sync), observe “logs” vdev activity in zpool iostat -v, use a low-latency PLP device, and don’t expect SLOG to accelerate async writes.

4) “RAIDZ should be fine; it’s just like hardware RAID”

  • Symptom: Small random writes have high wait; bandwidth looks low; CPU is fine.
  • Root cause: RAIDZ parity overhead and read-modify-write behavior on small updates.
  • Fix: Use mirrors for IOPS-sensitive workloads, or change workload/block sizes, or add more vdevs, or isolate random-write workloads to mirror/SSD pools.

5) “Latency is high, so the pool is overloaded”

  • Symptom: High total_wait; app stalls; but device %util isn’t high.
  • Root cause: The bottleneck may be above the disks: TXG sync, CPU contention, memory pressure, or lock contention.
  • Fix: Cross-check with OS iostat -x, observe ARC behavior, check CPU steal/interrupts, and look for TXG-related stalls via system logs and ZFS tunables (platform-specific).

6) “Performance got worse after we enabled special vdev; ZFS is broken”

  • Symptom: Reads become bursty; metadata-heavy operations stall.
  • Root cause: Special vdev is undersized, saturated, or on a device with poor latency consistency; also, metadata concentration increases dependency on that device.
  • Fix: Ensure special vdev has ample capacity and redundancy, monitor it as a critical tier, and don’t use a single cheap device as your metadata lifeline.

7) “We’re only 80% full; capacity can’t affect performance”

  • Symptom: Latency rises gradually as pool fills; allocation becomes uneven; fragmentation increases.
  • Root cause: Free space becomes harder to allocate efficiently; metaslab fragmentation grows; some vdevs become hotter.
  • Fix: Keep pools comfortably below high-water marks for your workload, add vdevs earlier, and avoid last-minute expansions during peak usage.

Checklists / step-by-step plan

Step-by-step: from “slow” to “bottleneck” in 15 minutes

  1. Confirm the pool and dataset involved.
    Check mountpoints/volumes and map them to pools. Don’t diagnose the wrong box.
  2. Check health first.
    Run zpool status -v. Any errors, degraded devices, resilvers, or scrubs? Treat that as primary.
  3. Capture a short iostat run.
    Run zpool iostat -v 1 10 (and -l if available). Save it. Incidents without artifacts become arguments.
  4. Classify workload shape.
    IOPS-heavy (high ops, low MB/s) vs bandwidth-heavy (moderate ops, high MB/s). Decide what kind of hardware helps.
  5. Identify the hot vdev.
    If one vdev is hotter, ask “why”: fill imbalance, different disk type, different path, errors, or special/log device saturation.
  6. Check sync behavior.
    Run zfs get sync on the dataset(s). If sync-heavy, confirm SLOG and its behavior in iostat.
  7. Cross-check at OS layer.
    Run iostat -x 1 5. Confirm device-level await and queue. If devices are idle, stop blaming disks.
  8. Decide the first safe mitigation.
    Examples: pause/retime scrub, move workload, rate-limit a batch job, add log device, add vdevs, or migrate dataset.

Checklist: what to capture for a postmortem

  • At least 60 seconds of zpool iostat -v samples, plus latency view if available.
  • zpool status -v output at incident time.
  • OS device stats (iostat -x) for correlation.
  • Dataset properties: sync, recordsize, compression, atime, and any special vdev configuration.
  • Whether a scrub/resilver was active and its rate.
  • A clear statement: IOPS-bound, bandwidth-bound, sync-bound, or single-vdev-bound.

FAQ

1) What’s the difference between zpool iostat and Linux iostat?

zpool iostat reports ZFS’s logical view per pool/vdev, including virtual devices like logs and special vdevs.
Linux iostat reports block device behavior (queue, await, utilization). Use both: ZFS to find the hot vdev,
OS iostat to confirm device saturation or path trouble.

2) Why does one vdev become the bottleneck if ZFS “stripes” data?

ZFS stripes across top-level vdevs. If one vdev is fuller, slower, misbehaving, or simply fewer disks, it can become
the limiting factor. The pool’s aggregate speed is dragged toward the slowest vdev involved in the workload.

3) Does a SLOG speed up all writes?

No. A SLOG speeds up synchronous writes (writes that must be committed safely before returning).
Asynchronous writes are buffered and committed with TXGs; SLOG doesn’t change that path much.

4) Why are my writes slower than reads in iostat?

Common causes: sync writes without a good SLOG, RAIDZ parity overhead for small writes, a nearly-full/fragmented pool,
or a device with inconsistent write latency. If reads are fine and writes are awful, check sync and log vdev behavior.

5) Is high IOPS always bad?

High IOPS is just a workload characteristic. It becomes bad when latency climbs or the system can’t meet SLOs.
The right question is: are we IOPS-bound (small random) or bandwidth-bound (large sequential)?

6) How many samples should I take with zpool iostat?

For triage, 5–10 seconds at 1-second intervals is fine. For diagnosing jitter, take at least 60 seconds and capture
the same during “normal” hours for comparison. Storage problems love to hide in the variance.

7) Can fragmentation explain latency spikes?

It can contribute, especially as pools fill. Fragmentation tends to raise average I/O cost and increase variance.
If you see rising wait times with similar IOPS, and the pool is getting full, fragmentation/capacity tax is a real suspect.

8) What’s the safest “quick fix” during an incident?

The safest is reducing competing I/O: pause/reschedule a scrub, rate-limit batch jobs, or move a noisy tenant.
Changing sync settings or rebuilding pool geometry is not an incident-time fix unless you’re prepared to own the risk.

9) Why does zpool iostat show low bandwidth but the app says it’s writing a lot?

Applications “write” at a logical level; ZFS may compress, coalesce, defer, or satisfy writes via caching and TXG batching.
The device-level bandwidth is what actually hits disks. If your app writes small random blocks, bandwidth can look small even while the system struggles.

10) Should I trust average latency numbers?

Trust them as a clue, not as a verdict. Averages hide tail latency, and tail latency is where users live.
Use samples over time, correlate with workload phases, and confirm with OS-level queue metrics.

Next steps you can do today

If you operate ZFS in production, do these three things before the next incident does them to you:

  1. Create a baseline. Capture weekly zpool iostat -v (and latency view if available) for each important host.
    Keep it somewhere boring and durable.
  2. Document your intended write contract. For each critical dataset: is sync required? If yes, what’s the SLOG and how do you validate it under load?
    If no, why not—and who signed for that risk?
  3. Stop treating “pool full” as a capacity-only event. Decide your performance waterline (often well below 90%),
    and plan vdev additions while the pool is still comfortable.

Once you can name the bottleneck with evidence—IOPS-bound, bandwidth-bound, sync-bound, or a single vdev dragging the pool—
the fixes become straightforward. Not always cheap. But straightforward. That’s the deal.

← Previous
ZFS zdb Introduction: The Tool You Fear Until You Need It
Next →
Docker: Limit Log Spam at the Source — App Logging Patterns That Save Disks

Leave a comment