Your users aren’t complaining that throughput is low. They’re complaining that sometimes everything stalls:
logins hang, queries time out, backups “freeze,” and the dashboard lights up with p99 latency that looks like a heart monitor.
In ZFS land, that’s often not “the pool is slow.” It’s “one disk is having a bad day and everyone is invited.”
zpool iostat -v is how you catch that disk in the act. Not after the fact. Not “maybe SMART said something last week.”
Right now—while latency is ugly—you can isolate the offender, prove it with numbers, and decide whether to offline it, replace it,
or stop blaming ZFS for what is really a hardware problem wearing a software costume.
The mental model: ZFS performance is per-vdev, not per-pool
Start here, because it changes how you read every chart and every incident update.
A ZFS pool is a collection of vdevs. Data is striped across vdevs, but each block lives on a specific vdev.
If one vdev gets slow, any workload that hits blocks on that vdev suffers.
If the vdev is a mirror and one side is slow, ZFS can sometimes route reads around it—until it can’t (or until it’s doing writes,
or resilvering, or your workload is sync-heavy).
There’s also a more annoying truth: ZFS is very good at turning small, intermittent device misery into whole-system tail latency.
It does this because ZFS is consistent: it waits for the storage it asked for. Your app doesn’t care that 99.9% of I/Os are fine;
it remembers the 0.1% that took 2 seconds because one disk decided to do internal housekeeping at the worst possible time.
So the goal of zpool iostat -v isn’t “measure the pool.” It’s “find the vdev or disk with a different story than the rest.”
You’re not hunting for low averages. You’re hunting for outliers, spikes, queue growth, and asymmetry.
Quick facts and history that actually help you debug
- ZFS was built around end-to-end checksums. That’s great for integrity, but it also means ZFS will loudly surface bad devices by retrying, healing, and logging errors instead of silently returning junk.
- RAIDZ is not “hardware RAID but in software.” RAIDZ parity math and allocation behavior make small random writes more complex than mirrors, which matters when a single disk slows down.
- “Slower disk ruins the vdev” is older than ZFS. Classic RAID arrays have always been gated by their slowest member; ZFS just gives you better instrumentation to prove it.
- 4K sector reality changed everything. ZFS
ashiftexists because disks lied (or half-lied) about sector sizes for years; misalignment can amplify I/O and latency. - Scrub is a feature, not a punishment. ZFS scrubs read everything intentionally; the point is to find latent errors before resilver forces you to learn the hard way.
- ZFS can choose different mirror sides for reads. That can hide a slow disk on reads while writes still suffer, leading to confusing “reads are fine, writes are awful” incidents.
- SLOG is not a write cache. A separate log device accelerates synchronous writes only; it won’t fix async write latency or slow pool members.
- OpenZFS iostat grew useful over time. Older implementations were thinner; modern OpenZFS exposes more per-vdev behavior, and on some platforms you can get latency histograms via other tooling.
- SSDs can be “healthy” and still stall. Firmware GC and thermal throttling can create latency spikes without obvious SMART failures—until you correlate iostat + temperatures.
What zpool iostat -v really shows (and what it hides)
The command you’ll actually use
The workhorse is:
zpool iostat -v with an interval and optionally a count.
Without an interval, you get lifetime averages since boot/import—useful for capacity planning, terrible for incidents.
With an interval, you get per-interval deltas. That’s where the truth lives.
Common variants:
zpool iostat -v 1for real-time watchingzpool iostat -v 5 12for a 1-minute snapshotzpool iostat -v -y 1to suppress the first “since boot” line, which otherwise distracts people in war roomszpool iostat -v -pfor exact bytes (no humanization), which matters when you’re eyeballing small deltas
What the columns mean (and what you should infer)
Depending on your platform/OpenZFS version, you’ll see columns like capacity, operations, bandwidth.
Typical output shows read/write ops and read/write bandwidth at pool, vdev, and leaf-device levels.
That’s enough to catch many latency killers because latency usually manifests as reduced ops plus uneven distribution plus one device doing less work (or weirdly more).
What you often won’t see directly is latency in milliseconds. Some platforms expose it via extended iostat modes or other tools,
but even without explicit latency columns you can still diagnose: when the workload is steady, a slow device shows up as a drop in its ops/bandwidth compared to its peers,
plus increased load elsewhere, plus user-visible stalls.
The discipline: don’t stare at pool totals. Pool totals can look “fine” while one disk quietly causes tail latency by intermittently stalling.
Always expand to vdev and disk.
Joke #1: If storage were a team sport, the slow disk is the one who insists on “just one more quick thing” before every pass.
Fast diagnosis playbook (first/second/third)
First: confirm it’s storage latency, not CPU, network, or memory pressure
If the system is swapping, or your NIC is dropping packets, or the CPU is pegged by compression, storage will get blamed anyway.
Do a 60-second sanity pass:
- Load average vs CPU usage
- Swap activity
- Network errors
- ZFS arc size and evictions
But don’t overthink it: if your app latency correlates with a pool spike, keep going.
Second: run zpool iostat -v with an interval and watch for asymmetry
Run zpool iostat -v -y 1 during the incident. You’re looking for:
- One leaf device with far fewer ops than its siblings (or periodic zeroing)
- One device with weird bandwidth compared to the rest (read amplification, retries, rebuild traffic)
- A single vdev dragging down pool ops (common with RAIDZ under random I/O)
Third: corroborate with health and error telemetry
Once you have a suspect disk, validate it:
zpool status -vfor errors, resilver/scrub activitysmartctlfor media errors, CRC errors, temperature, and timeout patternsiostat/nvmetooling for device-level utilization and latency (platform dependent)
The decision point is usually one of these:
- Offline/replace a failing disk
- Fix a path/cable/controller issue (CRC errors, link resets)
- Stop an “optimization” that’s generating destructive I/O (bad recordsize, misused sync, pathological scrub timing)
- Rebalance or redesign vdev layout if you’ve outgrown it (RAIDZ width, mirrors, special vdev)
Practical tasks: commands, output meaning, decisions
These are the moves you can do on a live system. Each includes what you’re looking at and what decision it drives.
Use an interval. Use -y. And stop copy/pasting lifetime averages into incident channels like they mean anything.
Task 1: Get a clean, real-time per-disk view
cr0x@server:~$ zpool iostat -v -y 1
capacity operations bandwidth
pool alloc free read write read write
tank 7.12T 3.80T 980 420 120M 38.0M
raidz2-0 7.12T 3.80T 980 420 120M 38.0M
sda - - 170 70 21.0M 6.2M
sdb - - 165 71 20.8M 6.4M
sdc - - 168 69 20.9M 6.1M
sdd - - 20 180 2.0M 19.3M
sde - - 170 71 21.1M 6.3M
sdf - - 167 70 20.7M 6.2M
Meaning: One disk (sdd) is doing far fewer reads and far more writes than peers; the pattern is asymmetric.
That could be real workload skew, but in RAIDZ it’s often a hint of retries, reconstruction reads, or a device behaving oddly.
Decision: Mark sdd as a suspect and corroborate with zpool status and SMART. Don’t replace anything yet, but stop arguing about “pool totals.”
Task 2: Narrow to a single pool and reduce noise
cr0x@server:~$ zpool iostat -v -y tank 2 10
capacity operations bandwidth
pool alloc free read write read write
tank 7.12T 3.80T 950 410 118M 37.6M
raidz2-0 7.12T 3.80T 950 410 118M 37.6M
sda - - 165 68 20.6M 6.0M
sdb - - 163 67 20.7M 6.1M
sdc - - 164 66 20.5M 5.9M
sdd - - 10 175 1.1M 18.7M
sde - - 166 67 20.8M 6.0M
sdf - - 162 69 20.5M 6.1M
Meaning: The imbalance persists across intervals. It’s not a one-off burst.
Decision: Escalate from “hunch” to “active investigation.” Someone should start a SMART dump and error check now, not after the incident ends.
Task 3: Check pool health and see if ZFS is already telling you the answer
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced.
scan: scrub repaired 0B in 02:14:33 with 0 errors on Wed Dec 25 03:12:41 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 3 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
Meaning: sdd has read errors. ZFS corrected them (so you’re not corrupt), but it paid latency to do it.
“Applications are unaffected” means data integrity, not user experience.
Decision: Treat sdd as failing until proven otherwise. Prepare replacement, and check cabling/controller too.
Task 4: Pull SMART health and look for the usual liars
cr0x@server:~$ sudo smartctl -a /dev/sdd
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 100 100 010 Pre-fail Always - 12
187 Reported_Uncorrect 100 100 000 Old_age Always - 3
197 Current_Pending_Sector 100 100 000 Old_age Always - 1
199 UDMA_CRC_Error_Count 200 199 000 Old_age Always - 48
194 Temperature_Celsius 031 048 000 Old_age Always - 57
Meaning: “PASSED” is not a performance guarantee. Reallocated sectors and pending sectors hint at media degradation.
CRC errors point to cabling/backplane/controller issues. 57°C is “I’m not dead, I’m just slow and angry.”
Decision: If CRC is climbing, reseat/replace cable/backplane lane. If realloc/pending exists, plan disk replacement.
If temperature is high, fix airflow; heat causes latency spikes before it causes outright failure.
Task 5: Confirm the suspect disk mapping (avoid replacing the wrong drive)
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep sdd | head
lrwxrwxrwx 1 root root 9 Dec 25 09:10 ata-ST12000NM0007_ZL0ABC12 -> ../../sdd
lrwxrwxrwx 1 root root 10 Dec 25 09:10 wwn-0x5000c500a1b2c3d4 -> ../../sdd
Meaning: You have stable identifiers (WWN/by-id) that survive reboots and device renumbering.
Decision: Use by-id/WWN in replacement procedures and in zpool replace where possible.
“We pulled the wrong disk” is an outage genre.
Task 6: Check whether a scrub/resilver is competing for I/O
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Wed Dec 25 09:02:11 2025
3.11T scanned at 1.25G/s, 1.44T issued at 595M/s, 7.12T total
0B repaired, 20.18% done, 00:18:43 to go
Meaning: A scrub is actively reading the pool. On busy pools, scrub can raise latency by saturating disk queues.
If one disk is weak, scrub makes it obvious by dragging its feet.
Decision: During a customer-facing incident, consider pausing scrub (zpool scrub -p) if policy allows.
But don’t “fix” the incident by permanently never scrubbing; that’s how you convert latent errors into data loss later.
Task 7: Identify sync-write pressure (and stop blaming SLOG incorrectly)
cr0x@server:~$ zfs get -o name,property,value -H sync,logbias tank
tank sync standard
tank logbias latency
Meaning: Sync writes are honored normally; logbias favors the log device if present.
If your app latency is from fsync-heavy writes, SLOG quality matters. If it’s not sync-heavy, SLOG won’t save you.
Decision: If the incident is write-latency and you have no SLOG (or a slow one), consider adding a proper power-loss-safe SLOG.
If you already have one, don’t assume it’s working—measure.
Task 8: Observe per-vdev behavior in a mirror (catch the “one side is sick” case)
cr0x@server:~$ zpool iostat -v -y ssdpool 1
capacity operations bandwidth
pool alloc free read write read write
ssdpool 812G 989G 4200 1800 520M 210M
mirror-0 812G 989G 4200 1800 520M 210M
nvme0n1 - - 4100 900 505M 110M
nvme1n1 - - 100 900 15M 110M
Meaning: Reads are being served mostly from nvme0n1; writes are mirrored so both take them.
This pattern can happen because ZFS prefers the faster side for reads. If nvme1n1 is stalling, you may not notice until write latency or resilver.
Decision: Investigate the “quiet” side anyway. Run SMART/NVMe logs and check for thermal throttling or media errors.
Task 9: Check NVMe health for throttling and resets
cr0x@server:~$ sudo nvme smart-log /dev/nvme1
temperature : 79 C
available_spare : 100%
percentage_used : 12%
media_errors : 0
num_err_log_entries : 27
warning_temp_time : 148
critical_comp_time : 0
Meaning: 79°C plus significant warning-temp time: the device is likely throttling.
Error log entries suggest resets/timeouts even if media_errors is zero.
Decision: Fix cooling and check firmware/controller behavior. Throttling is a latency killer that looks like “random ZFS pauses.”
Task 10: Look for controller or link issues in kernel logs
cr0x@server:~$ sudo dmesg | egrep -i "sdd|ata|reset|timeout|I/O error" | tail -n 8
[ 8123.441122] ata7: hard resetting link
[ 8123.771090] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 8124.112904] blk_update_request: I/O error, dev sdd, sector 223456789 op 0x0:(READ)
[ 8124.112970] sd 7:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8124.113001] sd 7:0:0:0: [sdd] Sense Key : Medium Error [current]
Meaning: Link resets and I/O errors. That’s not “ZFS is slow.” That’s the device or the path.
Decision: Replace/repair the path (cable/backplane/HBA port) and likely the disk. If errors persist after path fix, the disk is guilty.
Task 11: Confirm ashift and stop accidental write amplification
cr0x@server:~$ zdb -C tank | egrep "ashift|vdev_tree" -n | head -n 12
56: vdev_tree:
78: ashift: 12
Meaning: ashift: 12 means 4K sectors. Good for modern disks.
If you see 9 (512B) on 4K media, you can create brutal read-modify-write behavior and latency under random writes.
Decision: If ashift is wrong, the real fix is rebuilding the pool correctly. Don’t pretend a tunable will save you.
Task 12: Determine if special vdev is the hidden bottleneck
cr0x@server:~$ zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 3 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
Meaning: There’s a special vdev mirror. If it’s undersized or throttling, metadata-heavy workloads stall even if the main RAIDZ looks fine.
If special vdev dies, the pool can be toast depending on what’s allocated there.
Decision: Monitor special vdev like it’s production-critical (because it is). If it’s hot or erroring, fix it first.
Task 13: Correlate ZFS-level behavior with block-device utilization
cr0x@server:~$ iostat -x 1 5
avg-cpu: %user %nice %system %iowait %steal %idle
11.20 0.00 6.10 9.40 0.00 73.30
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 168.0 70.0 21000.0 6200.0 224.2 1.10 5.20 1.40 33.2
sdb 166.0 69.0 20800.0 6100.0 223.9 1.08 5.10 1.35 32.9
sdc 167.0 68.0 20900.0 6000.0 224.1 1.12 5.30 1.38 33.5
sdd 10.0 175.0 1100.0 18700.0 225.0 18.40 115.00 5.90 99.8
sde 167.0 69.0 21100.0 6100.0 224.0 1.09 5.20 1.37 33.1
sdf 165.0 70.0 20700.0 6200.0 224.3 1.11 5.10 1.36 33.0
Meaning: sdd has massive queue (avgqu-sz), high await, and 99.8% util.
Peers are cruising at ~33% util with ~5ms awaits. That’s your latency villain.
Decision: You have enough evidence to take action. If redundancy permits, offline and replace sdd, or fix its path.
Don’t wait for it to “fail harder.”
Task 14: Offlining a disk safely (when you know what you’re doing)
cr0x@server:~$ sudo zpool offline tank sdd
Meaning: ZFS stops using that device. In a RAIDZ2 vdev, you can survive up to two missing devices; in a mirror, you can survive one.
Decision: Do this only if redundancy allows it and you’re confident in device identity. Offlining can immediately improve latency by removing the stalling device from the I/O path.
Task 15: Replace by-id and watch resilver I/O distribution
cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/wwn-0x5000c500a1b2c3d4 /dev/disk/by-id/wwn-0x5000c500d4c3b2a1
Meaning: You are replacing the exact WWN device with a new one. This avoids device-name roulette.
Decision: After replacement, use zpool iostat -v during resilver to ensure the new disk behaves like its peers and the pool remains responsive.
How to interpret the patterns: latency signatures of common failures
Signature 1: One disk shows lower ops and occasional “flatline” intervals
You’ll see a disk with read/write ops dropping to near-zero for an interval, then “catching up.”
This is classic for firmware stalls, internal GC (SSD), or SATA link resets.
The pool total might not crash because other devices still do work, but your p99 will look terrible.
What to do: Check dmesg for resets/timeouts and SMART/NVMe error logs. Verify temperatures. If it’s recurring, replace.
Signature 2: One disk is at 100% util with huge queues; peers are idle-ish
That’s a smoking gun. ZFS is issuing what it can, but the device can’t keep up.
In RAIDZ, a single device at 100% can throttle reconstruction reads and parity operations. In mirrors, the slow side can still hurt writes.
What to do: Confirm with iostat -x. If it’s a path issue (CRC errors), fix cabling. If it’s media, replace disk.
Signature 3: Writes are slow everywhere, but reads are fine
Often sync writes. Or a slog that isn’t actually fast. Or a dataset setting that forces sync (or your app calls fsync constantly).
Mirrors hide read problems better than write problems; RAIDZ tends to punish random writes.
What to do: Confirm dataset settings (sync, logbias), check for a SLOG device, and validate it’s low-latency and power-loss-safe.
Also check if you’re hitting fragmentation + small records on RAIDZ.
Signature 4: During scrub/resilver, everything becomes a potato
Scrub/resilver is a full-contact sport. ZFS will compete with your workload for disk time.
If you have one marginal disk, scrub makes it the center of attention.
What to do: Schedule scrub. Consider throttling via system-level I/O scheduling tools (platform-specific) rather than disabling scrubs.
If scrub reveals errors on a specific disk, don’t argue; replace it.
Signature 5: Special vdev or metadata device stalls cause bizarre “everything is slow but data disks are chill”
Metadata is the choke point for many workloads. If special vdev is overloaded or throttled, file operations and small IO can crawl.
You’ll see the special vdev devices hotter/busier than the main vdev.
What to do: Monitor it like a tier-0 component. Use zpool iostat -v and device telemetry to confirm it’s not thermal throttling.
One quote worth keeping on a sticky note (paraphrased idea): John Allspaw’s reliability message: you don’t “prevent” failure; you build systems that detect and recover quickly.
Joke #2: SMART “PASSED” is like a corporate status report—technically true, emotionally useless.
Three corporate mini-stories from the latency trenches
Mini-story #1: The incident caused by a wrong assumption
A mid-size SaaS shop ran a ZFS-backed NFS tier for CI artifacts and container images. For months it was fine.
Then “random” slowness hit: builds stalled, pulls timed out, and the on-call was told, repeatedly, that “storage is slow again.”
The team assumed it was network saturation because the pool throughput graphs didn’t look terrible.
They spent a morning tuning NFS threads and arguing about MTU. They rebooted a switch.
Nothing changed. The lat spikes were still there, mostly during peak commit hours.
Someone finally ran zpool iostat -v -y 1 during an incident window and noticed one disk in a RAIDZ2 vdev
showing drastically fewer reads than its siblings, with periodic near-zero intervals.
The “wrong assumption” was subtle: they believed “if throughput is okay, storage isn’t the problem.”
But their workload was full of small random reads (metadata-heavy directory traversals and lots of tiny files), where tail latency matters more than aggregate MB/s.
One disk was intermittently resetting the SATA link. ZFS kept the pool online and healed, but the retries translated into user-visible stalls.
They replaced the disk, and the graphs didn’t change much. That’s the point: throughput was never the real symptom.
The customer tickets stopped within the hour because p99 latency stopped taking scenic routes through error recovery.
The lesson they wrote down: you don’t troubleshoot ZFS incidents with pool totals; you troubleshoot them with per-device deltas and evidence.
Mini-story #2: The optimization that backfired
A finance-adjacent company had a ZFS pool serving VMs over iSCSI. Latency was decent but not stellar.
An engineer proposed an “easy win”: change dataset properties to chase performance—smaller recordsize “for databases,” more aggressive compression,
and a blanket change to treat sync writes differently because “the UPS will cover us.”
The immediate benchmark looked better. Then production happened.
The databases started doing more I/O operations per transaction due to recordsize mismatch with workload patterns,
compression increased CPU during spikes, and the sync changes created pathological behavior with the application’s fsync patterns.
To make it worse, the team added a consumer NVMe as SLOG because it was “fast,” ignoring power-loss characteristics and sustained latency.
zpool iostat -v told a very specific story: the SLOG device was pegged with high write ops during peak hours,
and one member of a mirror started to show uneven distribution as it overheated.
Latency got worse exactly when the team expected it to get better.
The rollback fixed the incident. The postmortem wasn’t about shame; it was about discipline.
Optimizations that change write semantics aren’t “tweaks.” They’re architecture changes.
If you can’t explain the new failure modes—and measure them live—you’re just moving risk around until it lands on a customer.
Mini-story #3: The boring but correct practice that saved the day
A healthcare vendor ran ZFS for a document store. Nobody loved that system; it was “legacy,” which meant it was critical and underfunded.
But the storage engineer had one habit: weekly scrub windows, and a simple runbook that included capturing zpool iostat -v during scrub.
Boring. Repetitive. Effective.
One week, scrub time increased noticeably. Not enough to page anyone, but enough to show up in the notes.
During the scrub, per-disk stats showed one drive doing fewer reads and occasionally stalling.
SMART still said “PASSED,” because of course it did, but there were a few growing CRC errors and an elevated temperature.
The engineer filed a ticket to replace the cable and, if needed, the disk at the next maintenance window.
Cable replacement reduced CRC errors but didn’t eliminate the stalls. They replaced the disk proactively, resilvered cleanly, and moved on.
Two months later, a different team had a near-identical failure on a similar chassis and suffered a messy incident because they had no trend history and no habit of looking at per-disk behavior.
The boring practice didn’t just prevent downtime; it prevented ambiguity.
When you can say “this disk is degrading over weeks,” you get to do maintenance instead of heroics.
Common mistakes: symptom → root cause → fix
1) Symptom: Pool throughput looks fine, but p99 latency is terrible
Root cause: One disk stalls intermittently; ZFS retries/heals; averages hide tail latency.
Or special vdev stalls metadata operations.
Fix: Use zpool iostat -v -y 1 during the event; find asymmetry.
Correlate with iostat -x, dmesg, and SMART/NVMe logs. Replace failing disk/path; address overheating.
2) Symptom: Writes are slow on a mirror even though reads are fast
Root cause: ZFS can choose the faster side for reads, hiding a sick disk; writes must hit both.
Or sync writes saturate a weak SLOG device.
Fix: Look for per-leaf imbalance in mirror with zpool iostat -v.
Validate SLOG behavior. Fix/replace the slow mirror member or SLOG device.
3) Symptom: Scrubs make the pool unusable
Root cause: Scrub competes for I/O; pool is near capacity or fragmented; one disk is marginal and becomes the bottleneck.
Fix: Schedule scrubs in low-traffic windows; consider pausing during incidents.
Investigate the slow disk with zpool iostat -v and SMART; replace it if it’s dragging. Keep free space healthy.
4) Symptom: RAIDZ pool has poor random write latency compared to expectations
Root cause: RAIDZ parity overhead plus small blocks; wrong recordsize; misaligned ashift; heavy sync workload without appropriate design.
Fix: Confirm ashift. Align recordsize to workload. For heavy random I/O, prefer mirrors or adjust vdev width and workload patterns.
Don’t “tune” parity away.
5) Symptom: Disk shows many CRC errors but no reallocations
Root cause: Cabling/backplane/HBA issues causing link-level corruption and retries.
Fix: Replace/reseat cables, swap ports, check backplane. Watch if CRC continues to increment after fix.
6) Symptom: Replacing a disk didn’t help; latency still spikes
Root cause: The path/controller is the real issue; or the workload is pounding sync writes; or special vdev is throttling.
Fix: Use dmesg and controller telemetry. Validate sync/logbias and SLOG.
Inspect special vdev iostat and NVMe thermals.
7) Symptom: “zpool iostat shows nothing wrong” but the app is still slow
Root cause: You’re looking at lifetime averages (no interval), or the incident is intermittent and you missed it.
Or the bottleneck is above ZFS (CPU, memory pressure, network) or below (multipath, SAN, hypervisor).
Fix: Always use interval mode. Capture during the event. Add lightweight continuous sampling in your monitoring so you can replay the moment.
Checklists / step-by-step plan
Step-by-step: catch the slow disk in under 10 minutes
-
Run per-interval stats.
cr0x@server:~$ zpool iostat -v -y 1Decision: Identify any leaf device with ops/bandwidth that doesn’t match its siblings.
-
Confirm pool health and background activity.
cr0x@server:~$ zpool status -vDecision: If scrub/resilver is active, decide whether to pause for incident mitigation.
-
Correlate with device-level queue and await.
cr0x@server:~$ iostat -x 1 3Decision: High
awaitand queue on one disk = action item. Don’t wait. -
Check logs for resets/timeouts.
cr0x@server:~$ sudo dmesg | egrep -i "reset|timeout|I/O error|blk_update_request" | tail -n 30Decision: If link resets exist, suspect cabling/HBA/backplane even if SMART is quiet.
-
Pull SMART/NVMe logs for the suspect device.
cr0x@server:~$ sudo smartctl -a /dev/sdXDecision: Pending/reallocated/CRC/temp issues drive replace vs path fix.
-
If redundancy allows, offline the offender to restore latency.
cr0x@server:~$ sudo zpool offline tank sdXDecision: Use as an emergency mitigation, not a permanent lifestyle.
-
Replace with stable identifiers and monitor resilver.
cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/OLD /dev/disk/by-id/NEWDecision: If the new disk’s iostat looks different from peers, stop and validate hardware and firmware.
Checklist: what to capture for a useful postmortem
zpool iostat -v -y 1samples during the incident window (even 60 seconds helps)zpool status -vincluding error counts and scan state- SMART/NVMe logs for any suspect disk (including temperature and CRC)
iostat -xfor queue/await/util confirmation- Kernel logs around the time of the spike (resets, timeouts)
- What changed recently (firmware, cables moved, workload changes, dataset settings)
Checklist: decisions you should make explicitly (not by vibes)
- Is the problem isolated to one disk, one vdev, or systemic?
- Do we have redundancy to offline now?
- Is this a path problem (CRC/resets) or media problem (pending/realloc/uncorrectables)?
- Is a scrub/resilver amplifying the issue, and can it be paused safely?
- Do we need to change architecture (mirrors vs RAIDZ, special vdev sizing) rather than tuning?
FAQ
1) Why does one slow disk hurt the whole pool?
Because ZFS issues I/O to specific vdevs. If a block lives on the slow vdev (or needs reconstruction involving it), the request waits.
Tail latency is dominated by the worst participant, not the average participant.
2) Is zpool iostat showing latency?
Usually it shows ops and bandwidth, not milliseconds. But latency shows up indirectly as reduced ops, uneven distribution, and queue growth
confirmed by iostat -x or platform-specific latency tools.
3) Why should I always use an interval (like zpool iostat -v 1)?
Without an interval, you’re looking at averages since import/boot. Incidents live in spikes and deltas.
Interval mode gives you the per-second (or per-N-seconds) reality.
4) I see one mirror side doing almost all reads. Is that bad?
Not automatically. ZFS can prefer the faster side. It’s bad when the “ignored” side is ignored because it’s unhealthy
(thermal throttling, timeouts, errors). Verify with device telemetry.
5) SMART says PASSED. Can the disk still be the problem?
Yes. SMART’s overall health is coarse. Look at specific attributes (pending sectors, reallocations, CRC errors, temperature)
and error logs. Also check kernel logs for resets/timeouts.
6) Should I pause scrub during an incident?
If scrub is competing with critical production I/O and you need to restore service, pausing can be a reasonable short-term move.
But reschedule it and investigate why scrub hurts so much—often it reveals a weak disk or an oversubscribed design.
7) Does adding a SLOG fix latency?
It can fix synchronous write latency when the workload is truly sync-heavy and the SLOG device is low-latency and power-loss-safe.
It does nothing for async writes or reads, and a bad SLOG can make things worse.
8) How do I know if it’s a cable/backplane issue instead of a disk?
Rising UDMA_CRC_Error_Count on SATA devices is a strong hint, along with link resets in dmesg.
Media errors (pending/realloc/uncorrectable) point more toward the disk itself. Sometimes you get both; fix the path first, then reassess.
9) Why is RAIDZ often worse for random write latency than mirrors?
Parity requires additional reads/writes and coordination across disks; small writes can become read-modify-write cycles.
Mirrors do simpler writes. If you need IOPS and low tail latency, mirrors are usually the straightforward choice.
10) What’s the quickest “proof” I can show to non-storage folks?
A short capture of zpool iostat -v -y 1 plus iostat -x 1 where one device has dramatically higher await and queue
while peers are normal. It’s visual, repeatable, and doesn’t require belief in storage folklore.
Conclusion: what to do next, today
When ZFS latency goes sideways, don’t start by “tuning ZFS.” Start by proving whether one disk or one vdev is misbehaving.
zpool iostat -v with an interval is your flashlight; it finds asymmetry fast.
Then corroborate with zpool status, SMART/NVMe logs, and kernel messages.
Practical next steps:
- Put
zpool iostat -v -y 1into your incident runbook, not your personal memory. - During the next scrub window, capture per-disk behavior and baseline it. Boring baselines make exciting incidents shorter.
- If you find a suspect device, decide quickly: path fix, offline, or replace. Tail latency rarely heals itself.
- If your architecture is mismatched (sync-heavy DB on wide RAIDZ, special vdev undersized), admit it and plan the redesign instead of worshipping tunables.
The one disk ruining latency is rarely subtle. The subtle part is us: we keep staring at pool totals and hoping averages will tell the truth.
They won’t. Per-device deltas will.