ZFS zpool iostat -v: Finding the One Disk That’s Ruining Latency

Was this helpful?

Your users aren’t complaining that throughput is low. They’re complaining that sometimes everything stalls:
logins hang, queries time out, backups “freeze,” and the dashboard lights up with p99 latency that looks like a heart monitor.
In ZFS land, that’s often not “the pool is slow.” It’s “one disk is having a bad day and everyone is invited.”

zpool iostat -v is how you catch that disk in the act. Not after the fact. Not “maybe SMART said something last week.”
Right now—while latency is ugly—you can isolate the offender, prove it with numbers, and decide whether to offline it, replace it,
or stop blaming ZFS for what is really a hardware problem wearing a software costume.

The mental model: ZFS performance is per-vdev, not per-pool

Start here, because it changes how you read every chart and every incident update.
A ZFS pool is a collection of vdevs. Data is striped across vdevs, but each block lives on a specific vdev.
If one vdev gets slow, any workload that hits blocks on that vdev suffers.
If the vdev is a mirror and one side is slow, ZFS can sometimes route reads around it—until it can’t (or until it’s doing writes,
or resilvering, or your workload is sync-heavy).

There’s also a more annoying truth: ZFS is very good at turning small, intermittent device misery into whole-system tail latency.
It does this because ZFS is consistent: it waits for the storage it asked for. Your app doesn’t care that 99.9% of I/Os are fine;
it remembers the 0.1% that took 2 seconds because one disk decided to do internal housekeeping at the worst possible time.

So the goal of zpool iostat -v isn’t “measure the pool.” It’s “find the vdev or disk with a different story than the rest.”
You’re not hunting for low averages. You’re hunting for outliers, spikes, queue growth, and asymmetry.

Quick facts and history that actually help you debug

  • ZFS was built around end-to-end checksums. That’s great for integrity, but it also means ZFS will loudly surface bad devices by retrying, healing, and logging errors instead of silently returning junk.
  • RAIDZ is not “hardware RAID but in software.” RAIDZ parity math and allocation behavior make small random writes more complex than mirrors, which matters when a single disk slows down.
  • “Slower disk ruins the vdev” is older than ZFS. Classic RAID arrays have always been gated by their slowest member; ZFS just gives you better instrumentation to prove it.
  • 4K sector reality changed everything. ZFS ashift exists because disks lied (or half-lied) about sector sizes for years; misalignment can amplify I/O and latency.
  • Scrub is a feature, not a punishment. ZFS scrubs read everything intentionally; the point is to find latent errors before resilver forces you to learn the hard way.
  • ZFS can choose different mirror sides for reads. That can hide a slow disk on reads while writes still suffer, leading to confusing “reads are fine, writes are awful” incidents.
  • SLOG is not a write cache. A separate log device accelerates synchronous writes only; it won’t fix async write latency or slow pool members.
  • OpenZFS iostat grew useful over time. Older implementations were thinner; modern OpenZFS exposes more per-vdev behavior, and on some platforms you can get latency histograms via other tooling.
  • SSDs can be “healthy” and still stall. Firmware GC and thermal throttling can create latency spikes without obvious SMART failures—until you correlate iostat + temperatures.

What zpool iostat -v really shows (and what it hides)

The command you’ll actually use

The workhorse is:
zpool iostat -v with an interval and optionally a count.
Without an interval, you get lifetime averages since boot/import—useful for capacity planning, terrible for incidents.
With an interval, you get per-interval deltas. That’s where the truth lives.

Common variants:

  • zpool iostat -v 1 for real-time watching
  • zpool iostat -v 5 12 for a 1-minute snapshot
  • zpool iostat -v -y 1 to suppress the first “since boot” line, which otherwise distracts people in war rooms
  • zpool iostat -v -p for exact bytes (no humanization), which matters when you’re eyeballing small deltas

What the columns mean (and what you should infer)

Depending on your platform/OpenZFS version, you’ll see columns like capacity, operations, bandwidth.
Typical output shows read/write ops and read/write bandwidth at pool, vdev, and leaf-device levels.
That’s enough to catch many latency killers because latency usually manifests as reduced ops plus uneven distribution plus one device doing less work (or weirdly more).

What you often won’t see directly is latency in milliseconds. Some platforms expose it via extended iostat modes or other tools,
but even without explicit latency columns you can still diagnose: when the workload is steady, a slow device shows up as a drop in its ops/bandwidth compared to its peers,
plus increased load elsewhere, plus user-visible stalls.

The discipline: don’t stare at pool totals. Pool totals can look “fine” while one disk quietly causes tail latency by intermittently stalling.
Always expand to vdev and disk.

Joke #1: If storage were a team sport, the slow disk is the one who insists on “just one more quick thing” before every pass.

Fast diagnosis playbook (first/second/third)

First: confirm it’s storage latency, not CPU, network, or memory pressure

If the system is swapping, or your NIC is dropping packets, or the CPU is pegged by compression, storage will get blamed anyway.
Do a 60-second sanity pass:

  • Load average vs CPU usage
  • Swap activity
  • Network errors
  • ZFS arc size and evictions

But don’t overthink it: if your app latency correlates with a pool spike, keep going.

Second: run zpool iostat -v with an interval and watch for asymmetry

Run zpool iostat -v -y 1 during the incident. You’re looking for:

  • One leaf device with far fewer ops than its siblings (or periodic zeroing)
  • One device with weird bandwidth compared to the rest (read amplification, retries, rebuild traffic)
  • A single vdev dragging down pool ops (common with RAIDZ under random I/O)

Third: corroborate with health and error telemetry

Once you have a suspect disk, validate it:

  • zpool status -v for errors, resilver/scrub activity
  • smartctl for media errors, CRC errors, temperature, and timeout patterns
  • iostat / nvme tooling for device-level utilization and latency (platform dependent)

The decision point is usually one of these:

  • Offline/replace a failing disk
  • Fix a path/cable/controller issue (CRC errors, link resets)
  • Stop an “optimization” that’s generating destructive I/O (bad recordsize, misused sync, pathological scrub timing)
  • Rebalance or redesign vdev layout if you’ve outgrown it (RAIDZ width, mirrors, special vdev)

Practical tasks: commands, output meaning, decisions

These are the moves you can do on a live system. Each includes what you’re looking at and what decision it drives.
Use an interval. Use -y. And stop copy/pasting lifetime averages into incident channels like they mean anything.

Task 1: Get a clean, real-time per-disk view

cr0x@server:~$ zpool iostat -v -y 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
tank        7.12T  3.80T    980    420   120M  38.0M
  raidz2-0  7.12T  3.80T    980    420   120M  38.0M
    sda         -      -    170     70  21.0M  6.2M
    sdb         -      -    165     71  20.8M  6.4M
    sdc         -      -    168     69  20.9M  6.1M
    sdd         -      -     20    180   2.0M  19.3M
    sde         -      -    170     71  21.1M  6.3M
    sdf         -      -    167     70  20.7M  6.2M

Meaning: One disk (sdd) is doing far fewer reads and far more writes than peers; the pattern is asymmetric.
That could be real workload skew, but in RAIDZ it’s often a hint of retries, reconstruction reads, or a device behaving oddly.

Decision: Mark sdd as a suspect and corroborate with zpool status and SMART. Don’t replace anything yet, but stop arguing about “pool totals.”

Task 2: Narrow to a single pool and reduce noise

cr0x@server:~$ zpool iostat -v -y tank 2 10
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
tank        7.12T  3.80T    950    410   118M  37.6M
  raidz2-0  7.12T  3.80T    950    410   118M  37.6M
    sda         -      -    165     68  20.6M  6.0M
    sdb         -      -    163     67  20.7M  6.1M
    sdc         -      -    164     66  20.5M  5.9M
    sdd         -      -     10    175   1.1M  18.7M
    sde         -      -    166     67  20.8M  6.0M
    sdf         -      -    162     69  20.5M  6.1M

Meaning: The imbalance persists across intervals. It’s not a one-off burst.

Decision: Escalate from “hunch” to “active investigation.” Someone should start a SMART dump and error check now, not after the incident ends.

Task 3: Check pool health and see if ZFS is already telling you the answer

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced.
  scan: scrub repaired 0B in 02:14:33 with 0 errors on Wed Dec 25 03:12:41 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       3     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

Meaning: sdd has read errors. ZFS corrected them (so you’re not corrupt), but it paid latency to do it.
“Applications are unaffected” means data integrity, not user experience.

Decision: Treat sdd as failing until proven otherwise. Prepare replacement, and check cabling/controller too.

Task 4: Pull SMART health and look for the usual liars

cr0x@server:~$ sudo smartctl -a /dev/sdd
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME          VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   100   100   010    Pre-fail  Always       -       12
187 Reported_Uncorrect      100   100   000    Old_age   Always       -       3
197 Current_Pending_Sector  100   100   000    Old_age   Always       -       1
199 UDMA_CRC_Error_Count    200   199   000    Old_age   Always       -       48
194 Temperature_Celsius     031   048   000    Old_age   Always       -       57

Meaning: “PASSED” is not a performance guarantee. Reallocated sectors and pending sectors hint at media degradation.
CRC errors point to cabling/backplane/controller issues. 57°C is “I’m not dead, I’m just slow and angry.”

Decision: If CRC is climbing, reseat/replace cable/backplane lane. If realloc/pending exists, plan disk replacement.
If temperature is high, fix airflow; heat causes latency spikes before it causes outright failure.

Task 5: Confirm the suspect disk mapping (avoid replacing the wrong drive)

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep sdd | head
lrwxrwxrwx 1 root root  9 Dec 25 09:10 ata-ST12000NM0007_ZL0ABC12 -> ../../sdd
lrwxrwxrwx 1 root root 10 Dec 25 09:10 wwn-0x5000c500a1b2c3d4 -> ../../sdd

Meaning: You have stable identifiers (WWN/by-id) that survive reboots and device renumbering.

Decision: Use by-id/WWN in replacement procedures and in zpool replace where possible.
“We pulled the wrong disk” is an outage genre.

Task 6: Check whether a scrub/resilver is competing for I/O

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Wed Dec 25 09:02:11 2025
        3.11T scanned at 1.25G/s, 1.44T issued at 595M/s, 7.12T total
        0B repaired, 20.18% done, 00:18:43 to go

Meaning: A scrub is actively reading the pool. On busy pools, scrub can raise latency by saturating disk queues.
If one disk is weak, scrub makes it obvious by dragging its feet.

Decision: During a customer-facing incident, consider pausing scrub (zpool scrub -p) if policy allows.
But don’t “fix” the incident by permanently never scrubbing; that’s how you convert latent errors into data loss later.

Task 7: Identify sync-write pressure (and stop blaming SLOG incorrectly)

cr0x@server:~$ zfs get -o name,property,value -H sync,logbias tank
tank	sync	standard
tank	logbias	latency

Meaning: Sync writes are honored normally; logbias favors the log device if present.
If your app latency is from fsync-heavy writes, SLOG quality matters. If it’s not sync-heavy, SLOG won’t save you.

Decision: If the incident is write-latency and you have no SLOG (or a slow one), consider adding a proper power-loss-safe SLOG.
If you already have one, don’t assume it’s working—measure.

Task 8: Observe per-vdev behavior in a mirror (catch the “one side is sick” case)

cr0x@server:~$ zpool iostat -v -y ssdpool 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
ssdpool      812G   989G   4200   1800   520M   210M
  mirror-0   812G   989G   4200   1800   520M   210M
    nvme0n1      -      -   4100    900   505M   110M
    nvme1n1      -      -    100    900    15M   110M

Meaning: Reads are being served mostly from nvme0n1; writes are mirrored so both take them.
This pattern can happen because ZFS prefers the faster side for reads. If nvme1n1 is stalling, you may not notice until write latency or resilver.

Decision: Investigate the “quiet” side anyway. Run SMART/NVMe logs and check for thermal throttling or media errors.

Task 9: Check NVMe health for throttling and resets

cr0x@server:~$ sudo nvme smart-log /dev/nvme1
temperature                         : 79 C
available_spare                     : 100%
percentage_used                     : 12%
media_errors                        : 0
num_err_log_entries                 : 27
warning_temp_time                   : 148
critical_comp_time                  : 0

Meaning: 79°C plus significant warning-temp time: the device is likely throttling.
Error log entries suggest resets/timeouts even if media_errors is zero.

Decision: Fix cooling and check firmware/controller behavior. Throttling is a latency killer that looks like “random ZFS pauses.”

Task 10: Look for controller or link issues in kernel logs

cr0x@server:~$ sudo dmesg | egrep -i "sdd|ata|reset|timeout|I/O error" | tail -n 8
[ 8123.441122] ata7: hard resetting link
[ 8123.771090] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 8124.112904] blk_update_request: I/O error, dev sdd, sector 223456789 op 0x0:(READ)
[ 8124.112970] sd 7:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8124.113001] sd 7:0:0:0: [sdd] Sense Key : Medium Error [current]

Meaning: Link resets and I/O errors. That’s not “ZFS is slow.” That’s the device or the path.

Decision: Replace/repair the path (cable/backplane/HBA port) and likely the disk. If errors persist after path fix, the disk is guilty.

Task 11: Confirm ashift and stop accidental write amplification

cr0x@server:~$ zdb -C tank | egrep "ashift|vdev_tree" -n | head -n 12
56:        vdev_tree:
78:            ashift: 12

Meaning: ashift: 12 means 4K sectors. Good for modern disks.
If you see 9 (512B) on 4K media, you can create brutal read-modify-write behavior and latency under random writes.

Decision: If ashift is wrong, the real fix is rebuilding the pool correctly. Don’t pretend a tunable will save you.

Task 12: Determine if special vdev is the hidden bottleneck

cr0x@server:~$ zpool status tank | sed -n '1,120p'
  pool: tank
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            sda        ONLINE       0     0     0
            sdb        ONLINE       0     0     0
            sdc        ONLINE       0     0     0
            sdd        ONLINE       3     0     0
            sde        ONLINE       0     0     0
            sdf        ONLINE       0     0     0
        special
          mirror-1     ONLINE       0     0     0
            nvme2n1    ONLINE       0     0     0
            nvme3n1    ONLINE       0     0     0

Meaning: There’s a special vdev mirror. If it’s undersized or throttling, metadata-heavy workloads stall even if the main RAIDZ looks fine.
If special vdev dies, the pool can be toast depending on what’s allocated there.

Decision: Monitor special vdev like it’s production-critical (because it is). If it’s hot or erroring, fix it first.

Task 13: Correlate ZFS-level behavior with block-device utilization

cr0x@server:~$ iostat -x 1 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.20    0.00    6.10    9.40    0.00   73.30

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  svctm  %util
sda             168.0    70.0 21000.0  6200.0    224.2     1.10    5.20   1.40  33.2
sdb             166.0    69.0 20800.0  6100.0    223.9     1.08    5.10   1.35  32.9
sdc             167.0    68.0 20900.0  6000.0    224.1     1.12    5.30   1.38  33.5
sdd              10.0   175.0  1100.0 18700.0    225.0    18.40  115.00   5.90  99.8
sde             167.0    69.0 21100.0  6100.0    224.0     1.09    5.20   1.37  33.1
sdf             165.0    70.0 20700.0  6200.0    224.3     1.11    5.10   1.36  33.0

Meaning: sdd has massive queue (avgqu-sz), high await, and 99.8% util.
Peers are cruising at ~33% util with ~5ms awaits. That’s your latency villain.

Decision: You have enough evidence to take action. If redundancy permits, offline and replace sdd, or fix its path.
Don’t wait for it to “fail harder.”

Task 14: Offlining a disk safely (when you know what you’re doing)

cr0x@server:~$ sudo zpool offline tank sdd

Meaning: ZFS stops using that device. In a RAIDZ2 vdev, you can survive up to two missing devices; in a mirror, you can survive one.

Decision: Do this only if redundancy allows it and you’re confident in device identity. Offlining can immediately improve latency by removing the stalling device from the I/O path.

Task 15: Replace by-id and watch resilver I/O distribution

cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/wwn-0x5000c500a1b2c3d4 /dev/disk/by-id/wwn-0x5000c500d4c3b2a1

Meaning: You are replacing the exact WWN device with a new one. This avoids device-name roulette.

Decision: After replacement, use zpool iostat -v during resilver to ensure the new disk behaves like its peers and the pool remains responsive.

How to interpret the patterns: latency signatures of common failures

Signature 1: One disk shows lower ops and occasional “flatline” intervals

You’ll see a disk with read/write ops dropping to near-zero for an interval, then “catching up.”
This is classic for firmware stalls, internal GC (SSD), or SATA link resets.
The pool total might not crash because other devices still do work, but your p99 will look terrible.

What to do: Check dmesg for resets/timeouts and SMART/NVMe error logs. Verify temperatures. If it’s recurring, replace.

Signature 2: One disk is at 100% util with huge queues; peers are idle-ish

That’s a smoking gun. ZFS is issuing what it can, but the device can’t keep up.
In RAIDZ, a single device at 100% can throttle reconstruction reads and parity operations. In mirrors, the slow side can still hurt writes.

What to do: Confirm with iostat -x. If it’s a path issue (CRC errors), fix cabling. If it’s media, replace disk.

Signature 3: Writes are slow everywhere, but reads are fine

Often sync writes. Or a slog that isn’t actually fast. Or a dataset setting that forces sync (or your app calls fsync constantly).
Mirrors hide read problems better than write problems; RAIDZ tends to punish random writes.

What to do: Confirm dataset settings (sync, logbias), check for a SLOG device, and validate it’s low-latency and power-loss-safe.
Also check if you’re hitting fragmentation + small records on RAIDZ.

Signature 4: During scrub/resilver, everything becomes a potato

Scrub/resilver is a full-contact sport. ZFS will compete with your workload for disk time.
If you have one marginal disk, scrub makes it the center of attention.

What to do: Schedule scrub. Consider throttling via system-level I/O scheduling tools (platform-specific) rather than disabling scrubs.
If scrub reveals errors on a specific disk, don’t argue; replace it.

Signature 5: Special vdev or metadata device stalls cause bizarre “everything is slow but data disks are chill”

Metadata is the choke point for many workloads. If special vdev is overloaded or throttled, file operations and small IO can crawl.
You’ll see the special vdev devices hotter/busier than the main vdev.

What to do: Monitor it like a tier-0 component. Use zpool iostat -v and device telemetry to confirm it’s not thermal throttling.

One quote worth keeping on a sticky note (paraphrased idea): John Allspaw’s reliability message: you don’t “prevent” failure; you build systems that detect and recover quickly.

Joke #2: SMART “PASSED” is like a corporate status report—technically true, emotionally useless.

Three corporate mini-stories from the latency trenches

Mini-story #1: The incident caused by a wrong assumption

A mid-size SaaS shop ran a ZFS-backed NFS tier for CI artifacts and container images. For months it was fine.
Then “random” slowness hit: builds stalled, pulls timed out, and the on-call was told, repeatedly, that “storage is slow again.”
The team assumed it was network saturation because the pool throughput graphs didn’t look terrible.

They spent a morning tuning NFS threads and arguing about MTU. They rebooted a switch.
Nothing changed. The lat spikes were still there, mostly during peak commit hours.
Someone finally ran zpool iostat -v -y 1 during an incident window and noticed one disk in a RAIDZ2 vdev
showing drastically fewer reads than its siblings, with periodic near-zero intervals.

The “wrong assumption” was subtle: they believed “if throughput is okay, storage isn’t the problem.”
But their workload was full of small random reads (metadata-heavy directory traversals and lots of tiny files), where tail latency matters more than aggregate MB/s.
One disk was intermittently resetting the SATA link. ZFS kept the pool online and healed, but the retries translated into user-visible stalls.

They replaced the disk, and the graphs didn’t change much. That’s the point: throughput was never the real symptom.
The customer tickets stopped within the hour because p99 latency stopped taking scenic routes through error recovery.
The lesson they wrote down: you don’t troubleshoot ZFS incidents with pool totals; you troubleshoot them with per-device deltas and evidence.

Mini-story #2: The optimization that backfired

A finance-adjacent company had a ZFS pool serving VMs over iSCSI. Latency was decent but not stellar.
An engineer proposed an “easy win”: change dataset properties to chase performance—smaller recordsize “for databases,” more aggressive compression,
and a blanket change to treat sync writes differently because “the UPS will cover us.”

The immediate benchmark looked better. Then production happened.
The databases started doing more I/O operations per transaction due to recordsize mismatch with workload patterns,
compression increased CPU during spikes, and the sync changes created pathological behavior with the application’s fsync patterns.
To make it worse, the team added a consumer NVMe as SLOG because it was “fast,” ignoring power-loss characteristics and sustained latency.

zpool iostat -v told a very specific story: the SLOG device was pegged with high write ops during peak hours,
and one member of a mirror started to show uneven distribution as it overheated.
Latency got worse exactly when the team expected it to get better.

The rollback fixed the incident. The postmortem wasn’t about shame; it was about discipline.
Optimizations that change write semantics aren’t “tweaks.” They’re architecture changes.
If you can’t explain the new failure modes—and measure them live—you’re just moving risk around until it lands on a customer.

Mini-story #3: The boring but correct practice that saved the day

A healthcare vendor ran ZFS for a document store. Nobody loved that system; it was “legacy,” which meant it was critical and underfunded.
But the storage engineer had one habit: weekly scrub windows, and a simple runbook that included capturing zpool iostat -v during scrub.
Boring. Repetitive. Effective.

One week, scrub time increased noticeably. Not enough to page anyone, but enough to show up in the notes.
During the scrub, per-disk stats showed one drive doing fewer reads and occasionally stalling.
SMART still said “PASSED,” because of course it did, but there were a few growing CRC errors and an elevated temperature.

The engineer filed a ticket to replace the cable and, if needed, the disk at the next maintenance window.
Cable replacement reduced CRC errors but didn’t eliminate the stalls. They replaced the disk proactively, resilvered cleanly, and moved on.

Two months later, a different team had a near-identical failure on a similar chassis and suffered a messy incident because they had no trend history and no habit of looking at per-disk behavior.
The boring practice didn’t just prevent downtime; it prevented ambiguity.
When you can say “this disk is degrading over weeks,” you get to do maintenance instead of heroics.

Common mistakes: symptom → root cause → fix

1) Symptom: Pool throughput looks fine, but p99 latency is terrible

Root cause: One disk stalls intermittently; ZFS retries/heals; averages hide tail latency.
Or special vdev stalls metadata operations.

Fix: Use zpool iostat -v -y 1 during the event; find asymmetry.
Correlate with iostat -x, dmesg, and SMART/NVMe logs. Replace failing disk/path; address overheating.

2) Symptom: Writes are slow on a mirror even though reads are fast

Root cause: ZFS can choose the faster side for reads, hiding a sick disk; writes must hit both.
Or sync writes saturate a weak SLOG device.

Fix: Look for per-leaf imbalance in mirror with zpool iostat -v.
Validate SLOG behavior. Fix/replace the slow mirror member or SLOG device.

3) Symptom: Scrubs make the pool unusable

Root cause: Scrub competes for I/O; pool is near capacity or fragmented; one disk is marginal and becomes the bottleneck.

Fix: Schedule scrubs in low-traffic windows; consider pausing during incidents.
Investigate the slow disk with zpool iostat -v and SMART; replace it if it’s dragging. Keep free space healthy.

4) Symptom: RAIDZ pool has poor random write latency compared to expectations

Root cause: RAIDZ parity overhead plus small blocks; wrong recordsize; misaligned ashift; heavy sync workload without appropriate design.

Fix: Confirm ashift. Align recordsize to workload. For heavy random I/O, prefer mirrors or adjust vdev width and workload patterns.
Don’t “tune” parity away.

5) Symptom: Disk shows many CRC errors but no reallocations

Root cause: Cabling/backplane/HBA issues causing link-level corruption and retries.

Fix: Replace/reseat cables, swap ports, check backplane. Watch if CRC continues to increment after fix.

6) Symptom: Replacing a disk didn’t help; latency still spikes

Root cause: The path/controller is the real issue; or the workload is pounding sync writes; or special vdev is throttling.

Fix: Use dmesg and controller telemetry. Validate sync/logbias and SLOG.
Inspect special vdev iostat and NVMe thermals.

7) Symptom: “zpool iostat shows nothing wrong” but the app is still slow

Root cause: You’re looking at lifetime averages (no interval), or the incident is intermittent and you missed it.
Or the bottleneck is above ZFS (CPU, memory pressure, network) or below (multipath, SAN, hypervisor).

Fix: Always use interval mode. Capture during the event. Add lightweight continuous sampling in your monitoring so you can replay the moment.

Checklists / step-by-step plan

Step-by-step: catch the slow disk in under 10 minutes

  1. Run per-interval stats.

    cr0x@server:~$ zpool iostat -v -y 1
    

    Decision: Identify any leaf device with ops/bandwidth that doesn’t match its siblings.

  2. Confirm pool health and background activity.

    cr0x@server:~$ zpool status -v
    

    Decision: If scrub/resilver is active, decide whether to pause for incident mitigation.

  3. Correlate with device-level queue and await.

    cr0x@server:~$ iostat -x 1 3
    

    Decision: High await and queue on one disk = action item. Don’t wait.

  4. Check logs for resets/timeouts.

    cr0x@server:~$ sudo dmesg | egrep -i "reset|timeout|I/O error|blk_update_request" | tail -n 30
    

    Decision: If link resets exist, suspect cabling/HBA/backplane even if SMART is quiet.

  5. Pull SMART/NVMe logs for the suspect device.

    cr0x@server:~$ sudo smartctl -a /dev/sdX
    

    Decision: Pending/reallocated/CRC/temp issues drive replace vs path fix.

  6. If redundancy allows, offline the offender to restore latency.

    cr0x@server:~$ sudo zpool offline tank sdX
    

    Decision: Use as an emergency mitigation, not a permanent lifestyle.

  7. Replace with stable identifiers and monitor resilver.

    cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/OLD /dev/disk/by-id/NEW
    

    Decision: If the new disk’s iostat looks different from peers, stop and validate hardware and firmware.

Checklist: what to capture for a useful postmortem

  • zpool iostat -v -y 1 samples during the incident window (even 60 seconds helps)
  • zpool status -v including error counts and scan state
  • SMART/NVMe logs for any suspect disk (including temperature and CRC)
  • iostat -x for queue/await/util confirmation
  • Kernel logs around the time of the spike (resets, timeouts)
  • What changed recently (firmware, cables moved, workload changes, dataset settings)

Checklist: decisions you should make explicitly (not by vibes)

  • Is the problem isolated to one disk, one vdev, or systemic?
  • Do we have redundancy to offline now?
  • Is this a path problem (CRC/resets) or media problem (pending/realloc/uncorrectables)?
  • Is a scrub/resilver amplifying the issue, and can it be paused safely?
  • Do we need to change architecture (mirrors vs RAIDZ, special vdev sizing) rather than tuning?

FAQ

1) Why does one slow disk hurt the whole pool?

Because ZFS issues I/O to specific vdevs. If a block lives on the slow vdev (or needs reconstruction involving it), the request waits.
Tail latency is dominated by the worst participant, not the average participant.

2) Is zpool iostat showing latency?

Usually it shows ops and bandwidth, not milliseconds. But latency shows up indirectly as reduced ops, uneven distribution, and queue growth
confirmed by iostat -x or platform-specific latency tools.

3) Why should I always use an interval (like zpool iostat -v 1)?

Without an interval, you’re looking at averages since import/boot. Incidents live in spikes and deltas.
Interval mode gives you the per-second (or per-N-seconds) reality.

4) I see one mirror side doing almost all reads. Is that bad?

Not automatically. ZFS can prefer the faster side. It’s bad when the “ignored” side is ignored because it’s unhealthy
(thermal throttling, timeouts, errors). Verify with device telemetry.

5) SMART says PASSED. Can the disk still be the problem?

Yes. SMART’s overall health is coarse. Look at specific attributes (pending sectors, reallocations, CRC errors, temperature)
and error logs. Also check kernel logs for resets/timeouts.

6) Should I pause scrub during an incident?

If scrub is competing with critical production I/O and you need to restore service, pausing can be a reasonable short-term move.
But reschedule it and investigate why scrub hurts so much—often it reveals a weak disk or an oversubscribed design.

7) Does adding a SLOG fix latency?

It can fix synchronous write latency when the workload is truly sync-heavy and the SLOG device is low-latency and power-loss-safe.
It does nothing for async writes or reads, and a bad SLOG can make things worse.

8) How do I know if it’s a cable/backplane issue instead of a disk?

Rising UDMA_CRC_Error_Count on SATA devices is a strong hint, along with link resets in dmesg.
Media errors (pending/realloc/uncorrectable) point more toward the disk itself. Sometimes you get both; fix the path first, then reassess.

9) Why is RAIDZ often worse for random write latency than mirrors?

Parity requires additional reads/writes and coordination across disks; small writes can become read-modify-write cycles.
Mirrors do simpler writes. If you need IOPS and low tail latency, mirrors are usually the straightforward choice.

10) What’s the quickest “proof” I can show to non-storage folks?

A short capture of zpool iostat -v -y 1 plus iostat -x 1 where one device has dramatically higher await and queue
while peers are normal. It’s visual, repeatable, and doesn’t require belief in storage folklore.

Conclusion: what to do next, today

When ZFS latency goes sideways, don’t start by “tuning ZFS.” Start by proving whether one disk or one vdev is misbehaving.
zpool iostat -v with an interval is your flashlight; it finds asymmetry fast.
Then corroborate with zpool status, SMART/NVMe logs, and kernel messages.

Practical next steps:

  • Put zpool iostat -v -y 1 into your incident runbook, not your personal memory.
  • During the next scrub window, capture per-disk behavior and baseline it. Boring baselines make exciting incidents shorter.
  • If you find a suspect device, decide quickly: path fix, offline, or replace. Tail latency rarely heals itself.
  • If your architecture is mismatched (sync-heavy DB on wide RAIDZ, special vdev undersized), admit it and plan the redesign instead of worshipping tunables.

The one disk ruining latency is rarely subtle. The subtle part is us: we keep staring at pool totals and hoping averages will tell the truth.
They won’t. Per-device deltas will.

← Previous
Modern CSS Holy Grail Layout: Header, Footer, Sidebar Without Hacks
Next →
Email Catch‑all Mailbox: Why It Ruins Deliverability (and Safer Alternatives)

Leave a comment