ZFS rarely “goes down” in a dramatic blaze. It slows. Quietly. The graph wiggles. Latency stretches.
Apps start timing out. Someone files a ticket about “intermittent slowness.” And then—because the universe hates humility—your
next deployment lands on top of the problem and gets blamed for everything.
The trick is to catch ZFS performance decay while it’s still a maintenance window, not a postmortem. This is a field guide for
reading the logs and the adjacent truth (kernel messages, device errors, ZFS event streams) to spot slowdowns early and decide
what to do next—fast, accurately, and without cargo cult rituals.
The mindset: logs are a timeline, not a vibe
ZFS “logs” are plural. There’s the ZFS event stream, the kernel ring buffer, systemd journal, SMART/device logs,
and ZFS’s own idea of pool health. The goal is not to collect more text. The goal is to align timelines:
what got slower, when, and what else changed.
Here’s the operating posture that keeps you out of trouble:
- Prefer latency over throughput. Users feel 99th percentile latency. Dashboards that only show MB/s will lie to you.
- Assume ZFS is honest about data integrity and conservative about performance. When it slows, it’s usually protecting you from something worse.
- Be suspicious of “it started after X” narratives. ZFS problems often incubate for weeks: one weak drive, one mis-sized recordsize, one sync write path you forgot existed.
- Correlate at the device layer. Most “ZFS performance issues” are either device latency, queueing, or a sync-write path doing exactly what you told it to do.
A log line is a clue, not a verdict. You still have to reconcile it with reality: zpool iostat, arcstat, iostat,
and what your applications are actually doing.
Interesting facts and historical context (so you stop guessing)
- ZFS was born in the Solaris era with an end-to-end data integrity model—checksums everywhere—because “silent corruption” was already a thing, just not a popular one.
- The intent log (ZIL) is not a write cache. It’s a mechanism to replay synchronous semantics after a crash. Most writes never live on “the log” long-term.
- SLOG is a device, not a feature. Adding a separate log device (SLOG) only helps synchronous writes and can hurt you if it’s slow or misconfigured.
- Scrubs were designed as proactive auditing, not a “repair when broken” tool. They’re how ZFS proves your data is still your data.
- Resilver behavior evolved. Modern OpenZFS resilvers can be sequential and smarter about what to copy, but you still pay in I/O contention.
- ARC/L2ARC tuning has a long history of bad advice. Many “performance guides” from a decade ago optimized for different workloads and smaller RAM-to-disk ratios.
- ashift is forever. A wrong sector size assumption at pool creation time can lock you into write amplification—quietly expensive, loudly painful.
- Compression became mainstream in ZFS ops because CPU got cheap and I/O did not. But the win depends on your data shape, not your hopes.
What “slow ZFS” actually means: the bottleneck map
“ZFS is slow” is like saying “the city is crowded.” Which street? Which hour? Which lane closure?
In practice, ZFS slowdowns cluster into a few categories. Your logs will usually point to one:
1) Device latency and error recovery
One marginal disk can stall a vdev. In RAIDZ and mirrors, the slowest child often becomes the pace car.
Linux kernel logs may show link resets, command timeouts, or “frozen queue” events. ZFS may show read/write/checksum errors.
Even if the drive “recovers,” the retry costs are paid in wall clock time by your application.
2) Sync write path: ZIL/SLOG pain
If your workload does synchronous writes (databases, NFS, some VM storage, anything calling fsync a lot),
then ZIL latency matters. With a SLOG, your sync latency is frequently the SLOG’s latency.
Without a SLOG, sync writes hit the pool and inherit pool latency. Logs won’t say “fsync is your problem” in those words,
but the pattern shows up: rising await, bursts aligned with txg sync, and a lot of complaints during commit-heavy periods.
3) Transaction group (txg) sync time spikes
ZFS batches changes into transaction groups. When a txg is committed (“synced”), the system can see short storms of write I/O.
If sync time grows, everything that depends on those commits gets slower. This can show up as periodic pauses, NFS “not responding,”
or application latency spikes every few seconds.
4) Metadata and fragmentation issues
Fragmentation isn’t a moral failing; it’s physics plus time. Certain workloads (VM images, databases, small random writes)
can turn the pool into an expensive seek festival. ZFS logs won’t print “you are fragmented,” but your iostat patterns will,
and your scrub/resilver times will get worse.
5) Memory pressure: ARC thrash
When ARC hit rate drops, reads go to disk. That’s not automatically bad—sometimes the working set is simply bigger than RAM.
But sudden ARC collapse can happen after a memory-hungry deployment, a container density change, or an ill-considered L2ARC setup.
The signal is usually: more disk reads, higher latency, and a kernel that looks… busy.
One paraphrased idea often attributed to John Allspaw fits here: Reliability comes from learning and adapting, not from pretending we can predict everything.
(paraphrased idea).
ZFS is adaptable. Your job is to learn what it’s telling you before it starts yelling.
Fast diagnosis playbook (first/second/third checks)
If you’re on call, you don’t have time for interpretive dance. You need a sequence that narrows the search space.
This playbook assumes Linux + OpenZFS, but the logic travels.
First: is the pool healthy right now?
- Run
zpool status -x. If it says anything other than “all pools are healthy,” stop and investigate that first. - Check
zpool events -vfor recent device faults, link resets, or checksum errors. - Look for scrubs/resilvers running. A “healthy” pool can still be slow if it’s rebuilding.
Second: is this a device problem or a workload/sync problem?
- Run
zpool iostat -v 1and watch latency distribution by vdev. One slow disk? One slow mirror? That’s your suspect. - Run
iostat -x 1and checkawait,svctm(if present), and%util. High await + high util = device/queue saturation. - Check if the latency correlates with sync write bursts: look for high writes with relatively low throughput but high await.
Third: confirm the failure mode with logs and counters
- Journal/kernel:
journalctl -kfor timeouts, resets, NCQ errors, transport errors, aborted commands. - SMART:
smartctlfor reallocated sectors, pending sectors, CRC errors (often cable/backplane). - ZFS stats: ARC behavior (
arcstatif available), txg sync messages (depending on your build), and event history.
One sentence rule: if you can name the slowest component, you can usually fix the outage.
If you can’t, you’re still guessing—keep narrowing.
Practical tasks: commands, outputs, and decisions (12+)
These are the tasks I actually run in production when ZFS slows down. Each one includes what the output means and what decision you make next.
Copy/paste is allowed. Panic is not.
Task 1: Quick pool health check
cr0x@server:~$ sudo zpool status -x
all pools are healthy
Meaning: No known faults, no degraded vdevs, no active errors. This does not guarantee performance, but it removes one big class of emergencies.
Decision: Move to latency diagnosis (zpool iostat, iostat) rather than repair operations.
Task 2: Full status with error counters and ongoing work
cr0x@server:~$ sudo zpool status
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 02:14:33 with 0 errors on Mon Dec 23 03:12:18 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_860-1 ONLINE 0 0 0
ata-SAMSUNG_SSD_860-2 ONLINE 0 0 3
errors: No known data errors
Meaning: The pool is online, but one device has checksum errors. ZFS corrected them using redundancy, but you now have a reliability and performance smell.
Decision: Investigate that device path (SMART, cabling, backplane, HBA). Do not “zpool clear” as therapy; clear only after you understand why errors happened.
Task 3: Watch per-vdev latency live
cr0x@server:~$ sudo zpool iostat -v 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 4.12T 3.15T 210 980 23.1M 61.4M
mirror-0 2.06T 1.57T 105 510 11.6M 30.7M
ata-SAMSUNG_SSD_860-1 - - 60 250 6.7M 15.2M
ata-SAMSUNG_SSD_860-2 - - 45 260 4.9M 15.5M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: Balanced mirror load looks roughly symmetrical over time. If one member shows far fewer ops but higher latency (not shown in this basic view),
or if a vdev’s ops collapse while pool demand remains, that’s a hint the device is stalling or error-retrying.
Decision: If the imbalance persists, correlate with kernel logs and SMART; consider offlining/replacing the suspect device if errors align.
Task 4: Add latency columns (where supported)
cr0x@server:~$ sudo zpool iostat -v -l 1
capacity operations bandwidth total_wait disk_wait
pool alloc free read write read write read write read write
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
tank 4.12T 3.15T 220 1020 24.0M 63.2M 3ms 28ms 2ms 24ms
mirror-0 2.06T 1.57T 110 520 12.0M 31.6M 2ms 30ms 2ms 27ms
ata-SAMSUNG_SSD_860-1 - - 55 260 6.1M 15.8M 2ms 8ms 2ms 7ms
ata-SAMSUNG_SSD_860-2 - - 55 260 5.9M 15.8M 2ms 90ms 2ms 85ms
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
Meaning: One disk has disk_wait spikes (85–90ms) while the other stays low. That’s your “pace car.”
Decision: Pull kernel + SMART evidence. If it’s a cable/HBA path, fix that. If it’s the SSD itself, schedule replacement before it “recovers” into your next outage.
Task 5: Check for scrub/resilver contention
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: resilver in progress since Thu Dec 26 08:11:02 2025
312G scanned at 1.24G/s, 48.2G issued at 192M/s, 7.11T total
48.2G resilvered, 0.68% done, 10:27:11 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
errors: No known data errors
Meaning: Resilvering is in progress. Your pool is doing extra reads/writes, and latency will usually worsen.
Decision: If this is a user-facing production system, decide whether to throttle resilver/scrub (where supported),
or temporarily shift workload away. Also confirm the original failure is fully addressed—don’t let a second disk wobble during resilver.
Task 6: Read the recent ZFS event stream
cr0x@server:~$ sudo zpool events -v | tail -n 30
TIME CLASS
Dec 26 2025 08:10:58.123456789 ereport.fs.zfs.vdev.io
pool = tank
vdev_path = /dev/disk/by-id/ata-SAMSUNG_SSD_860-2
vdev_guid = 1234567890123456789
errno = 5
size = 131072
offset = 9876543210
flags = 0x180
Dec 26 2025 08:10:58.223456789 ereport.fs.zfs.vdev.checksum
pool = tank
vdev_path = /dev/disk/by-id/ata-SAMSUNG_SSD_860-2
vdev_guid = 1234567890123456789
Meaning: ZFS is recording I/O errors and checksum problems against a specific device.
Decision: Treat this as hardware path triage: SMART, cables, HBA, enclosure. If it repeats, replace the device.
If it stops after reseating a cable, still keep watching; intermittent CRC errors love comebacks.
Task 7: Check kernel logs for transport resets and timeouts
cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "ata|nvme|scsi|reset|timeout|error" | tail -n 40
Dec 26 09:01:14 server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 26 09:01:14 server kernel: ata7.00: failed command: READ FPDMA QUEUED
Dec 26 09:01:14 server kernel: ata7: hard resetting link
Dec 26 09:01:18 server kernel: ata7: link is slow to respond, please be patient (ready=0)
Dec 26 09:01:20 server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 26 09:01:20 server kernel: ata7.00: configured for UDMA/133
Meaning: Link reset events. Even when they “recover,” the retry time creates latency spikes and can stall a vdev.
Decision: Check cabling/backplane, power, and HBA firmware. If this is a single drive bay, swap the drive to a different slot to isolate the enclosure path.
Task 8: SMART triage (SATA/SAS devices)
cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "Reallocated|Pending|Offline_Uncorrectable|CRC_Error_Count|Power_On_Hours"
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 23874
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
Meaning: Pending sectors and offline uncorrectables are real media issues. CRC count at zero suggests this is not “just a cable.”
Decision: Plan replacement. If the pool is redundant, replace proactively. If it’s single-disk (don’t), backup first and then replace yesterday.
Task 9: NVMe health and error log
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep -i "critical_warning|media_errors|num_err_log_entries|percentage_used"
critical_warning : 0x00
media_errors : 12
num_err_log_entries : 398
percentage_used : 87%
Meaning: Media errors and a high percentage used can correlate with rising latency and impending failure.
Decision: If this NVMe is a SLOG or special vdev, treat it as urgent—those roles can degrade performance sharply when the device misbehaves.
Task 10: Identify sync-heavy workloads via dataset properties
cr0x@server:~$ sudo zfs get -o name,property,value -s local sync,logbias,primarycache,recordsize tank/app tank/vm
NAME PROPERTY VALUE
tank/app sync standard
tank/app logbias latency
tank/app primarycache all
tank/app recordsize 128K
tank/vm sync always
tank/vm logbias latency
tank/vm primarycache metadata
tank/vm recordsize 16K
Meaning: sync=always forces synchronous semantics even if the app doesn’t ask for it. That can be correct, or it can be a self-inflicted performance incident.
Decision: Verify why sync=always is set. If it’s for a database that already manages durability, you may be double-paying. If it’s for NFS/VM safety, keep it and invest in a proper SLOG.
Task 11: Confirm SLOG presence and basic layout
cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
logs
nvme-SAMSUNG_MZVLB1T0-1 ONLINE 0 0 0
errors: No known data errors
Meaning: A single-device SLOG exists. That’s common, but it’s also a single point of performance and (depending on your tolerance) risk for sync write latency.
Decision: For critical sync workloads, prefer mirrored SLOG devices. And make sure the SLOG is actually low-latency under power-loss-safe conditions.
Task 12: Check whether the system is drowning in I/O queueing
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server) 12/26/2025 _x86_64_ (32 CPU)
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await %util
sda 12.0 340.0 480 14560 83.2 18.4 52.6 3.1 54.4 99.2
sdb 10.0 332.0 420 14400 85.7 1.2 3.7 2.8 3.8 34.5
nvme0n1 0.0 25.0 0 2048 163.8 0.4 15.8 0.0 15.8 40.1
Meaning: sda is pegged at ~99% util with a deep queue and high await, while sdb is fine. In a mirror, that can drag the vdev.
NVMe shows moderate await; if that’s your SLOG, 15ms might be too slow for “fast fsync” expectations.
Decision: Investigate why sda is slow: errors, firmware, thermal throttling, controller issues. If this is a mirror member, consider offlining it briefly to see if latency improves (with risk awareness).
Task 13: Check ARC behavior (if arcstat is available)
cr0x@server:~$ arcstat 1 5
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
09:12:01 3120 980 31 220 7 710 23 50 1 96.2G 96.0G
09:12:02 2980 940 31 240 8 650 22 50 1 96.2G 96.0G
09:12:03 3050 970 32 210 7 710 23 50 1 96.2G 96.0G
09:12:04 3105 995 32 230 7 715 23 50 1 96.2G 96.0G
09:12:05 3002 960 32 220 7 690 23 50 1 96.2G 96.0G
Meaning: A ~31–32% miss rate may be fine or terrible depending on your storage and workload. If miss% suddenly jumps compared to baseline,
the disks will see more reads and latency will rise.
Decision: Compare to last week’s baseline. If ARC is capped (c equals arcsz) and you have free RAM, consider raising ARC max.
If ARC is being squeezed by something else, fix memory pressure rather than “tuning ZFS” into a corner.
Task 14: Check dataset compression and logical vs physical I/O
cr0x@server:~$ sudo zfs get -o name,property,value compression,compressratio tank/app
NAME PROPERTY VALUE
tank/app compression lz4
tank/app compressratio 1.62x
Meaning: Compression is working and likely saving I/O. If compressratio is ~1.00x, you’re paying CPU overhead for no I/O benefit (usually small with lz4, but not zero).
Decision: If CPU is a bottleneck and data is incompressible, consider disabling compression on that dataset. Otherwise, leave lz4 alone; it’s one of the few “defaults” that earns its keep.
Task 15: Find who is hammering the pool right now
cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 45.20 M/s | Total DISK WRITE: 112.30 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
18342 be/4 postgres 2.10 M/s 65.40 M/s 0.00 % 84.21 % postgres: checkpointer
20111 be/4 root 0.00 B/s 28.20 M/s 0.00 % 62.10 % zfs send -w tank/app@snap
9321 be/4 libvirt-qemu 1.10 M/s 12.80 M/s 0.00 % 20.33 % qemu-system-x86_64
Meaning: You have a checkpointer doing heavy writes, a zfs send pushing data, and VMs reading/writing. This is a contention map.
Decision: If latency is user-visible, pause or reschedule the bulk transfer (zfs send) or rate-limit it. Don’t argue with physics.
Joke #1: Storage is the only place where “it’s fine on average” is accepted right up until the moment it isn’t.
Three corporate mini-stories (anonymized, painfully plausible)
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran customer Postgres on ZFS-backed VM storage. It had been stable for months, and the team was proud:
mirrored SSDs, compression on, weekly scrubs, basic monitoring. The only thing they didn’t monitor was sync write latency. Because, in their heads,
“SSDs are fast.”
A new compliance requirement arrived: ensure durability semantics for a subset of workloads. An engineer flipped sync=always on the dataset that housed
the VM images. The assumption was simple: “This will be safer and only slightly slower.” It was half right.
The next morning, customers reported sporadic timeouts. The pool looked healthy. CPU was fine. Network was fine. Throughput graphs were fine.
But the 99th percentile write latency exploded. The kernel logs showed nothing dramatic. ZFS logs showed no errors. Everyone started staring at the application layer,
because that’s what you do when the storage won’t confess.
The smoking gun was in zpool iostat -l: the SLOG device (a consumer NVMe without power-loss protection) had high and jittery write latency under sustained sync load.
It wasn’t “broken.” It was just being asked to provide consistent low-latency commits and politely declined.
The fix was boring and expensive: replace the SLOG with a device designed for steady sync write latency and mirror it.
The postmortem had one lesson worth tattooing: don’t change sync semantics without measuring the sync path.
Mini-story 2: The optimization that backfired
An enterprise internal platform team ran a ZFS pool for CI artifacts and container images. It was mostly large files, lots of parallel reads,
and occasional big writes. The system was “fine,” but a well-meaning performance initiative demanded “more throughput.”
Someone found an old tuning note and decided the pool should use a separate “special vdev” for metadata to speed up directory traversal and small reads.
They added a pair of small, fast SSDs as a special vdev. Initial benchmarks looked great. Leadership smiled. Everyone moved on.
Months later, performance got weird. Not just slower—spiky. During peak CI hours, builds stalled for seconds at a time.
zpool status stayed green. But zpool iostat -v -l told an uglier story: the special vdev had become the latency bottleneck.
Those “small fast SSDs” were now heavily written, wearing out, and occasionally throttling.
The backfire wasn’t the feature. It was the sizing and lifecycle thinking. Metadata and small blocks can be an I/O magnet.
When the special vdev hiccups, the whole pool feels drunk. The kernel logs had mild NVMe warnings, not enough to trip alerts,
but enough to explain the stalls when correlated with the latency spikes.
The remediation plan: replace the special vdev with appropriately durable devices, expand capacity to reduce write amplification,
and add monitoring specifically for special vdev latency and wear indicators. The moral: every acceleration structure becomes a dependency.
Mini-story 3: The boring but correct practice that saved the day
A financial services shop ran ZFS for an NFS backend serving home directories and shared build outputs. Nothing sexy. No heroic tuning.
What they did have was discipline: monthly scrubs, alerting on zpool status changes, and a runbook that forced engineers to check
kernel transport errors before touching ZFS knobs.
One Tuesday, latency climbed. Users noticed. The on-call followed the runbook: check pool health, check events, check kernel logs.
Within minutes they found repeating SATA link resets on one drive bay. No ZFS errors yet—just retries.
They swapped the cable/backplane component in a scheduled micro-window, before the drive started throwing checksum errors.
Latency dropped back to baseline. No resilver needed. No data risk. No weekend consumed by regret.
The practice that saved them wasn’t genius. It was consistency: scrubs to detect latent issues, and log correlation to catch hardware path degradation early.
Boring is underrated in storage engineering because it works.
Joke #2: If you want an exciting storage career, ignore your scrub schedule; the pager will create excitement for you.
Common mistakes: symptom → root cause → fix
1) “Pool is ONLINE but latency is awful”
Symptom: zpool status looks clean; applications time out; iostat shows high await.
Root cause: Device retries, link resets, or a single slow disk dragging a mirror/RAIDZ vdev.
Fix: Check journalctl -k for resets/timeouts; check SMART/NVMe error logs. Replace the suspect device or repair the transport path. Don’t tune ZFS to compensate for hardware lying.
2) “Every few seconds we get a pause”
Symptom: Periodic latency spikes; NFS stutters; databases show commit stalls.
Root cause: Txg sync taking too long, often because the pool is saturated, fragmentation is high, or a slow device is stalling flushes.
Fix: Use zpool iostat -l to identify the slow vdev, and reduce competing write load. If sync-heavy, fix SLOG latency or reconsider sync=always.
3) “We added a SLOG and performance got worse”
Symptom: Sync-heavy workload slows after adding log device.
Root cause: SLOG device has worse latency than the pool or suffers from throttling; single SLOG becomes a choke point.
Fix: Verify with iostat -x and zpool iostat -l. Replace with low-latency, power-loss-protected device, ideally mirrored. If workload is mostly async, remove SLOG and stop expecting magic.
4) “Checksum errors keep appearing, but scrubs repair them”
Symptom: CKSUM counts rise; scrubs repair; no user-visible data errors—yet.
Root cause: Often cabling/backplane/HBA issues (CRC errors), sometimes drive media failure.
Fix: Check SMART CRC counters and kernel transport logs. Reseat/replace cable/backplane; update firmware; replace drive if media indicators are bad. Then scrub again and monitor whether counters stay flat.
5) “Resilver will finish in 2 hours… for the next 3 days”
Symptom: Resilver ETA grows; pool is sluggish.
Root cause: Competing workload + fragmented pool + slow device. Resilver competes for I/O and can be deprioritized by the system or starved by your applications.
Fix: Reduce workload, schedule resilver in off-hours where possible, and check for a weak device prolonging the process. Confirm ashift and vdev design aren’t causing pathological write amplification.
6) “ARC hit rate fell off a cliff after we deployed something unrelated”
Symptom: Sudden increase in disk reads; latency rises; memory usage changes.
Root cause: Memory pressure from new services, container density, or kernel page cache behavior; ARC limited by configuration or squeezed by other consumers.
Fix: Measure memory, don’t guess. If you have RAM headroom, increase ARC cap. If you don’t, reduce memory pressure or move workload. Don’t add L2ARC as a substitute for not having enough RAM unless you understand the write/read patterns.
7) “We tuned recordsize and now writes are slower”
Symptom: After changing recordsize, throughput drops and latency rises.
Root cause: Recordsize mismatch with workload (e.g., too large for random-write DB blocks, too small for sequential streaming).
Fix: Set recordsize per dataset and per workload type. VM images and databases often prefer smaller blocks (e.g., 16K), while large sequential files benefit from larger (128K–1M depending on use). Validate with real I/O traces, not vibes.
Checklists / step-by-step plan
Checklist A: When users report “intermittent slowness”
- Confirm whether it’s storage latency: check app-level p95/p99 and I/O wait on hosts.
- Run
zpool status -x. If not healthy, treat as incident. - Run
zpool statusand look for scrub/resilver in progress. - Run
zpool iostat -v -l 1for 60–120 seconds. Identify the slowest vdev/device by latency. - Run
journalctl -kfiltered for resets/timeouts. Confirm whether the slow device has matching errors. - Check SMART/NVMe health for the suspect device.
- Decide: isolate (offline/replace), repair transport (cable/backplane/HBA), or reduce workload contention.
Checklist B: When sync writes are suspected (databases/NFS/VMs)
- Check dataset
syncandlogbiasproperties for the relevant datasets. - Confirm whether you have a SLOG and what it is (single vs mirror).
- Measure SLOG latency using
iostat -xon the SLOG device during the slowdown window. - If SLOG latency is worse than the main pool, don’t debate: replace or remove it depending on sync needs.
- If no SLOG exists and sync latency is painful, consider adding a proper mirrored SLOG—after validating that the workload is actually sync-heavy.
Checklist C: When errors appear but the pool “keeps working”
- Capture
zpool statusandzpool events -voutput for the incident record. - Check kernel logs around the same timestamps for transport issues.
- Check SMART/NVMe media indicators and error counts.
- Fix the path or replace hardware. Only then clear errors with
zpool clear. - Run a scrub after remediation and confirm error counters stop increasing.
Checklist D: Baseline so you can detect regressions
- Record baseline
zpool iostat -v -lduring “known good” hours. - Record baseline ARC stats (hit rate, ARC size, memory pressure indicators).
- Track scrub duration and resilver duration trends (they’re early warnings for fragmentation and device aging).
- Alert on kernel transport errors, not just ZFS faults.
FAQ
1) Are ZFS logs enough to diagnose performance issues?
No. ZFS will tell you about integrity signals (errors, faults, events), but performance diagnosis needs device and kernel context.
Always pair ZFS events with kernel logs and iostat/zpool iostat.
2) If zpool status is clean, can I rule out hardware?
Absolutely not. Many hardware/transport issues present as retries and link resets long before ZFS increments a counter.
Kernel logs and SMART often show the “pre-symptoms.”
3) Does adding a SLOG always improve performance?
Only for synchronous writes. For async workloads, it’s mostly irrelevant. And a slow SLOG can make sync performance worse.
Treat SLOG as a latency-critical component, not a checkbox.
4) What’s the fastest way to spot a single bad disk in a mirror?
Use zpool iostat -v -l 1 and look for one member with dramatically higher disk wait latency.
Then confirm with journalctl -k and SMART/NVMe logs.
5) Do checksum errors always mean the disk is dying?
Often it’s the path: cable, backplane, HBA, firmware. SMART CRC errors and kernel transport resets are your tell.
Media errors (pending/reallocated/uncorrectable) implicate the disk more directly.
6) Why does the pool slow down during scrub if scrubs are “background”?
Scrubs are background in intent, not in physics. They consume real I/O and can raise latency.
If scrubs cause user pain, schedule them better, throttle where possible, and verify your pool has enough performance headroom.
7) Should I set sync=disabled to fix latency?
That’s not fixing; it’s negotiating with reality and hoping it doesn’t notice. You’re trading durability guarantees for speed.
If the data matters, fix the sync path (SLOG/device latency) instead.
8) Is high fragmentation always the reason for slowdowns?
No. Fragmentation is common, but the usual first culprit is device latency or a degraded/rebuilding pool.
Fragmentation tends to show up as a long-term trend: scrubs/resilvers get longer, random I/O gets pricier, and latency becomes easier to trigger.
9) When should I clear ZFS errors with zpool clear?
After you’ve fixed the underlying cause and captured the evidence. Clearing too early erases your breadcrumb trail and invites repeat incidents.
10) What if ZFS is slow but iostat shows low %util?
Then the bottleneck might be elsewhere: CPU (compression/encryption), memory pressure, throttling, or a sync path stall.
Also confirm you’re measuring the right devices (multipath, dm-crypt layers, HBAs).
Conclusion: practical next steps
ZFS performance outages are usually slow-motion hardware failures, sync write surprises, or rebuild/scrub contention that nobody treated as a production event.
The good news: you can see them coming—if you look in the right places and keep a baseline.
Do these next:
- Baseline
zpool iostat -v -landiostat -xduring healthy hours, then keep those numbers somewhere your future self can find. - Alert on kernel transport errors (resets, timeouts) in addition to ZFS pool state changes.
- Audit datasets for
syncsettings and identify which workloads are truly sync-heavy. - Decide whether your SLOG (if any) is actually fit for purpose: low-latency, power-loss safe, and ideally mirrored for critical environments.
- Keep scrubs scheduled and monitored. Not because it’s fun, but because it’s how you catch the “quiet corruption and weak hardware” class of problems early.
Your goal isn’t to create the perfect ZFS system. It’s to make the slowdowns predictable, diagnosable, and fixable—before they become outages with a meeting invite.