ZFS Log Analysis: Finding Slowdowns Before They Become Outages

Was this helpful?

ZFS rarely “goes down” in a dramatic blaze. It slows. Quietly. The graph wiggles. Latency stretches.
Apps start timing out. Someone files a ticket about “intermittent slowness.” And then—because the universe hates humility—your
next deployment lands on top of the problem and gets blamed for everything.

The trick is to catch ZFS performance decay while it’s still a maintenance window, not a postmortem. This is a field guide for
reading the logs and the adjacent truth (kernel messages, device errors, ZFS event streams) to spot slowdowns early and decide
what to do next—fast, accurately, and without cargo cult rituals.

The mindset: logs are a timeline, not a vibe

ZFS “logs” are plural. There’s the ZFS event stream, the kernel ring buffer, systemd journal, SMART/device logs,
and ZFS’s own idea of pool health. The goal is not to collect more text. The goal is to align timelines:
what got slower, when, and what else changed.

Here’s the operating posture that keeps you out of trouble:

  • Prefer latency over throughput. Users feel 99th percentile latency. Dashboards that only show MB/s will lie to you.
  • Assume ZFS is honest about data integrity and conservative about performance. When it slows, it’s usually protecting you from something worse.
  • Be suspicious of “it started after X” narratives. ZFS problems often incubate for weeks: one weak drive, one mis-sized recordsize, one sync write path you forgot existed.
  • Correlate at the device layer. Most “ZFS performance issues” are either device latency, queueing, or a sync-write path doing exactly what you told it to do.

A log line is a clue, not a verdict. You still have to reconcile it with reality: zpool iostat, arcstat, iostat,
and what your applications are actually doing.

Interesting facts and historical context (so you stop guessing)

  1. ZFS was born in the Solaris era with an end-to-end data integrity model—checksums everywhere—because “silent corruption” was already a thing, just not a popular one.
  2. The intent log (ZIL) is not a write cache. It’s a mechanism to replay synchronous semantics after a crash. Most writes never live on “the log” long-term.
  3. SLOG is a device, not a feature. Adding a separate log device (SLOG) only helps synchronous writes and can hurt you if it’s slow or misconfigured.
  4. Scrubs were designed as proactive auditing, not a “repair when broken” tool. They’re how ZFS proves your data is still your data.
  5. Resilver behavior evolved. Modern OpenZFS resilvers can be sequential and smarter about what to copy, but you still pay in I/O contention.
  6. ARC/L2ARC tuning has a long history of bad advice. Many “performance guides” from a decade ago optimized for different workloads and smaller RAM-to-disk ratios.
  7. ashift is forever. A wrong sector size assumption at pool creation time can lock you into write amplification—quietly expensive, loudly painful.
  8. Compression became mainstream in ZFS ops because CPU got cheap and I/O did not. But the win depends on your data shape, not your hopes.

What “slow ZFS” actually means: the bottleneck map

“ZFS is slow” is like saying “the city is crowded.” Which street? Which hour? Which lane closure?
In practice, ZFS slowdowns cluster into a few categories. Your logs will usually point to one:

1) Device latency and error recovery

One marginal disk can stall a vdev. In RAIDZ and mirrors, the slowest child often becomes the pace car.
Linux kernel logs may show link resets, command timeouts, or “frozen queue” events. ZFS may show read/write/checksum errors.
Even if the drive “recovers,” the retry costs are paid in wall clock time by your application.

2) Sync write path: ZIL/SLOG pain

If your workload does synchronous writes (databases, NFS, some VM storage, anything calling fsync a lot),
then ZIL latency matters. With a SLOG, your sync latency is frequently the SLOG’s latency.
Without a SLOG, sync writes hit the pool and inherit pool latency. Logs won’t say “fsync is your problem” in those words,
but the pattern shows up: rising await, bursts aligned with txg sync, and a lot of complaints during commit-heavy periods.

3) Transaction group (txg) sync time spikes

ZFS batches changes into transaction groups. When a txg is committed (“synced”), the system can see short storms of write I/O.
If sync time grows, everything that depends on those commits gets slower. This can show up as periodic pauses, NFS “not responding,”
or application latency spikes every few seconds.

4) Metadata and fragmentation issues

Fragmentation isn’t a moral failing; it’s physics plus time. Certain workloads (VM images, databases, small random writes)
can turn the pool into an expensive seek festival. ZFS logs won’t print “you are fragmented,” but your iostat patterns will,
and your scrub/resilver times will get worse.

5) Memory pressure: ARC thrash

When ARC hit rate drops, reads go to disk. That’s not automatically bad—sometimes the working set is simply bigger than RAM.
But sudden ARC collapse can happen after a memory-hungry deployment, a container density change, or an ill-considered L2ARC setup.
The signal is usually: more disk reads, higher latency, and a kernel that looks… busy.

One paraphrased idea often attributed to John Allspaw fits here: Reliability comes from learning and adapting, not from pretending we can predict everything. (paraphrased idea).
ZFS is adaptable. Your job is to learn what it’s telling you before it starts yelling.

Fast diagnosis playbook (first/second/third checks)

If you’re on call, you don’t have time for interpretive dance. You need a sequence that narrows the search space.
This playbook assumes Linux + OpenZFS, but the logic travels.

First: is the pool healthy right now?

  • Run zpool status -x. If it says anything other than “all pools are healthy,” stop and investigate that first.
  • Check zpool events -v for recent device faults, link resets, or checksum errors.
  • Look for scrubs/resilvers running. A “healthy” pool can still be slow if it’s rebuilding.

Second: is this a device problem or a workload/sync problem?

  • Run zpool iostat -v 1 and watch latency distribution by vdev. One slow disk? One slow mirror? That’s your suspect.
  • Run iostat -x 1 and check await, svctm (if present), and %util. High await + high util = device/queue saturation.
  • Check if the latency correlates with sync write bursts: look for high writes with relatively low throughput but high await.

Third: confirm the failure mode with logs and counters

  • Journal/kernel: journalctl -k for timeouts, resets, NCQ errors, transport errors, aborted commands.
  • SMART: smartctl for reallocated sectors, pending sectors, CRC errors (often cable/backplane).
  • ZFS stats: ARC behavior (arcstat if available), txg sync messages (depending on your build), and event history.

One sentence rule: if you can name the slowest component, you can usually fix the outage.
If you can’t, you’re still guessing—keep narrowing.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually run in production when ZFS slows down. Each one includes what the output means and what decision you make next.
Copy/paste is allowed. Panic is not.

Task 1: Quick pool health check

cr0x@server:~$ sudo zpool status -x
all pools are healthy

Meaning: No known faults, no degraded vdevs, no active errors. This does not guarantee performance, but it removes one big class of emergencies.
Decision: Move to latency diagnosis (zpool iostat, iostat) rather than repair operations.

Task 2: Full status with error counters and ongoing work

cr0x@server:~$ sudo zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  scan: scrub repaired 0B in 02:14:33 with 0 errors on Mon Dec 23 03:12:18 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-SAMSUNG_SSD_860-1   ONLINE       0     0     0
            ata-SAMSUNG_SSD_860-2   ONLINE       0     0     3

errors: No known data errors

Meaning: The pool is online, but one device has checksum errors. ZFS corrected them using redundancy, but you now have a reliability and performance smell.
Decision: Investigate that device path (SMART, cabling, backplane, HBA). Do not “zpool clear” as therapy; clear only after you understand why errors happened.

Task 3: Watch per-vdev latency live

cr0x@server:~$ sudo zpool iostat -v 1
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        4.12T  3.15T    210    980  23.1M  61.4M
  mirror-0                  2.06T  1.57T    105    510  11.6M  30.7M
    ata-SAMSUNG_SSD_860-1      -      -     60    250  6.7M  15.2M
    ata-SAMSUNG_SSD_860-2      -      -     45    260  4.9M  15.5M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: Balanced mirror load looks roughly symmetrical over time. If one member shows far fewer ops but higher latency (not shown in this basic view),
or if a vdev’s ops collapse while pool demand remains, that’s a hint the device is stalling or error-retrying.
Decision: If the imbalance persists, correlate with kernel logs and SMART; consider offlining/replacing the suspect device if errors align.

Task 4: Add latency columns (where supported)

cr0x@server:~$ sudo zpool iostat -v -l 1
                              capacity     operations     bandwidth    total_wait     disk_wait
pool                        alloc   free   read  write   read  write   read  write    read  write
--------------------------  -----  -----  -----  -----  -----  -----  ----- -----    ----- -----
tank                        4.12T  3.15T    220   1020  24.0M  63.2M   3ms   28ms     2ms   24ms
  mirror-0                  2.06T  1.57T    110    520  12.0M  31.6M   2ms   30ms     2ms   27ms
    ata-SAMSUNG_SSD_860-1      -      -     55    260  6.1M  15.8M   2ms   8ms      2ms   7ms
    ata-SAMSUNG_SSD_860-2      -      -     55    260  5.9M  15.8M   2ms   90ms     2ms   85ms
--------------------------  -----  -----  -----  -----  -----  -----  ----- -----    ----- -----

Meaning: One disk has disk_wait spikes (85–90ms) while the other stays low. That’s your “pace car.”
Decision: Pull kernel + SMART evidence. If it’s a cable/HBA path, fix that. If it’s the SSD itself, schedule replacement before it “recovers” into your next outage.

Task 5: Check for scrub/resilver contention

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: resilver in progress since Thu Dec 26 08:11:02 2025
        312G scanned at 1.24G/s, 48.2G issued at 192M/s, 7.11T total
        48.2G resilvered, 0.68% done, 10:27:11 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0

errors: No known data errors

Meaning: Resilvering is in progress. Your pool is doing extra reads/writes, and latency will usually worsen.
Decision: If this is a user-facing production system, decide whether to throttle resilver/scrub (where supported),
or temporarily shift workload away. Also confirm the original failure is fully addressed—don’t let a second disk wobble during resilver.

Task 6: Read the recent ZFS event stream

cr0x@server:~$ sudo zpool events -v | tail -n 30
TIME                           CLASS
Dec 26 2025 08:10:58.123456789 ereport.fs.zfs.vdev.io
    pool = tank
    vdev_path = /dev/disk/by-id/ata-SAMSUNG_SSD_860-2
    vdev_guid = 1234567890123456789
    errno = 5
    size = 131072
    offset = 9876543210
    flags = 0x180

Dec 26 2025 08:10:58.223456789 ereport.fs.zfs.vdev.checksum
    pool = tank
    vdev_path = /dev/disk/by-id/ata-SAMSUNG_SSD_860-2
    vdev_guid = 1234567890123456789

Meaning: ZFS is recording I/O errors and checksum problems against a specific device.
Decision: Treat this as hardware path triage: SMART, cables, HBA, enclosure. If it repeats, replace the device.
If it stops after reseating a cable, still keep watching; intermittent CRC errors love comebacks.

Task 7: Check kernel logs for transport resets and timeouts

cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | egrep -i "ata|nvme|scsi|reset|timeout|error" | tail -n 40
Dec 26 09:01:14 server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 26 09:01:14 server kernel: ata7.00: failed command: READ FPDMA QUEUED
Dec 26 09:01:14 server kernel: ata7: hard resetting link
Dec 26 09:01:18 server kernel: ata7: link is slow to respond, please be patient (ready=0)
Dec 26 09:01:20 server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 26 09:01:20 server kernel: ata7.00: configured for UDMA/133

Meaning: Link reset events. Even when they “recover,” the retry time creates latency spikes and can stall a vdev.
Decision: Check cabling/backplane, power, and HBA firmware. If this is a single drive bay, swap the drive to a different slot to isolate the enclosure path.

Task 8: SMART triage (SATA/SAS devices)

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "Reallocated|Pending|Offline_Uncorrectable|CRC_Error_Count|Power_On_Hours"
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       23874
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

Meaning: Pending sectors and offline uncorrectables are real media issues. CRC count at zero suggests this is not “just a cable.”
Decision: Plan replacement. If the pool is redundant, replace proactively. If it’s single-disk (don’t), backup first and then replace yesterday.

Task 9: NVMe health and error log

cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep -i "critical_warning|media_errors|num_err_log_entries|percentage_used"
critical_warning                    : 0x00
media_errors                        : 12
num_err_log_entries                 : 398
percentage_used                     : 87%

Meaning: Media errors and a high percentage used can correlate with rising latency and impending failure.
Decision: If this NVMe is a SLOG or special vdev, treat it as urgent—those roles can degrade performance sharply when the device misbehaves.

Task 10: Identify sync-heavy workloads via dataset properties

cr0x@server:~$ sudo zfs get -o name,property,value -s local sync,logbias,primarycache,recordsize tank/app tank/vm
NAME      PROPERTY      VALUE
tank/app  sync          standard
tank/app  logbias       latency
tank/app  primarycache  all
tank/app  recordsize    128K
tank/vm   sync          always
tank/vm   logbias       latency
tank/vm   primarycache  metadata
tank/vm   recordsize    16K

Meaning: sync=always forces synchronous semantics even if the app doesn’t ask for it. That can be correct, or it can be a self-inflicted performance incident.
Decision: Verify why sync=always is set. If it’s for a database that already manages durability, you may be double-paying. If it’s for NFS/VM safety, keep it and invest in a proper SLOG.

Task 11: Confirm SLOG presence and basic layout

cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
        logs
          nvme-SAMSUNG_MZVLB1T0-1   ONLINE       0     0     0

errors: No known data errors

Meaning: A single-device SLOG exists. That’s common, but it’s also a single point of performance and (depending on your tolerance) risk for sync write latency.
Decision: For critical sync workloads, prefer mirrored SLOG devices. And make sure the SLOG is actually low-latency under power-loss-safe conditions.

Task 12: Check whether the system is drowning in I/O queueing

cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server)   12/26/2025  _x86_64_  (32 CPU)

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  r_await  w_await  %util
sda              12.0   340.0    480   14560      83.2     18.4    52.6     3.1     54.4   99.2
sdb              10.0   332.0    420   14400      85.7      1.2     3.7     2.8      3.8   34.5
nvme0n1           0.0    25.0      0    2048     163.8      0.4    15.8     0.0     15.8   40.1

Meaning: sda is pegged at ~99% util with a deep queue and high await, while sdb is fine. In a mirror, that can drag the vdev.
NVMe shows moderate await; if that’s your SLOG, 15ms might be too slow for “fast fsync” expectations.
Decision: Investigate why sda is slow: errors, firmware, thermal throttling, controller issues. If this is a mirror member, consider offlining it briefly to see if latency improves (with risk awareness).

Task 13: Check ARC behavior (if arcstat is available)

cr0x@server:~$ arcstat 1 5
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
09:12:01  3120   980     31   220    7   710   23    50    1  96.2G  96.0G
09:12:02  2980   940     31   240    8   650   22    50    1  96.2G  96.0G
09:12:03  3050   970     32   210    7   710   23    50    1  96.2G  96.0G
09:12:04  3105   995     32   230    7   715   23    50    1  96.2G  96.0G
09:12:05  3002   960     32   220    7   690   23    50    1  96.2G  96.0G

Meaning: A ~31–32% miss rate may be fine or terrible depending on your storage and workload. If miss% suddenly jumps compared to baseline,
the disks will see more reads and latency will rise.
Decision: Compare to last week’s baseline. If ARC is capped (c equals arcsz) and you have free RAM, consider raising ARC max.
If ARC is being squeezed by something else, fix memory pressure rather than “tuning ZFS” into a corner.

Task 14: Check dataset compression and logical vs physical I/O

cr0x@server:~$ sudo zfs get -o name,property,value compression,compressratio tank/app
NAME      PROPERTY       VALUE
tank/app  compression    lz4
tank/app  compressratio  1.62x

Meaning: Compression is working and likely saving I/O. If compressratio is ~1.00x, you’re paying CPU overhead for no I/O benefit (usually small with lz4, but not zero).
Decision: If CPU is a bottleneck and data is incompressible, consider disabling compression on that dataset. Otherwise, leave lz4 alone; it’s one of the few “defaults” that earns its keep.

Task 15: Find who is hammering the pool right now

cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 45.20 M/s | Total DISK WRITE: 112.30 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
18342 be/4  postgres    2.10 M/s   65.40 M/s  0.00 %  84.21 % postgres: checkpointer
20111 be/4  root        0.00 B/s   28.20 M/s  0.00 %  62.10 % zfs send -w tank/app@snap
 9321 be/4  libvirt-qemu 1.10 M/s  12.80 M/s  0.00 %  20.33 % qemu-system-x86_64

Meaning: You have a checkpointer doing heavy writes, a zfs send pushing data, and VMs reading/writing. This is a contention map.
Decision: If latency is user-visible, pause or reschedule the bulk transfer (zfs send) or rate-limit it. Don’t argue with physics.

Joke #1: Storage is the only place where “it’s fine on average” is accepted right up until the moment it isn’t.

Three corporate mini-stories (anonymized, painfully plausible)

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran customer Postgres on ZFS-backed VM storage. It had been stable for months, and the team was proud:
mirrored SSDs, compression on, weekly scrubs, basic monitoring. The only thing they didn’t monitor was sync write latency. Because, in their heads,
“SSDs are fast.”

A new compliance requirement arrived: ensure durability semantics for a subset of workloads. An engineer flipped sync=always on the dataset that housed
the VM images. The assumption was simple: “This will be safer and only slightly slower.” It was half right.

The next morning, customers reported sporadic timeouts. The pool looked healthy. CPU was fine. Network was fine. Throughput graphs were fine.
But the 99th percentile write latency exploded. The kernel logs showed nothing dramatic. ZFS logs showed no errors. Everyone started staring at the application layer,
because that’s what you do when the storage won’t confess.

The smoking gun was in zpool iostat -l: the SLOG device (a consumer NVMe without power-loss protection) had high and jittery write latency under sustained sync load.
It wasn’t “broken.” It was just being asked to provide consistent low-latency commits and politely declined.

The fix was boring and expensive: replace the SLOG with a device designed for steady sync write latency and mirror it.
The postmortem had one lesson worth tattooing: don’t change sync semantics without measuring the sync path.

Mini-story 2: The optimization that backfired

An enterprise internal platform team ran a ZFS pool for CI artifacts and container images. It was mostly large files, lots of parallel reads,
and occasional big writes. The system was “fine,” but a well-meaning performance initiative demanded “more throughput.”

Someone found an old tuning note and decided the pool should use a separate “special vdev” for metadata to speed up directory traversal and small reads.
They added a pair of small, fast SSDs as a special vdev. Initial benchmarks looked great. Leadership smiled. Everyone moved on.

Months later, performance got weird. Not just slower—spiky. During peak CI hours, builds stalled for seconds at a time.
zpool status stayed green. But zpool iostat -v -l told an uglier story: the special vdev had become the latency bottleneck.
Those “small fast SSDs” were now heavily written, wearing out, and occasionally throttling.

The backfire wasn’t the feature. It was the sizing and lifecycle thinking. Metadata and small blocks can be an I/O magnet.
When the special vdev hiccups, the whole pool feels drunk. The kernel logs had mild NVMe warnings, not enough to trip alerts,
but enough to explain the stalls when correlated with the latency spikes.

The remediation plan: replace the special vdev with appropriately durable devices, expand capacity to reduce write amplification,
and add monitoring specifically for special vdev latency and wear indicators. The moral: every acceleration structure becomes a dependency.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop ran ZFS for an NFS backend serving home directories and shared build outputs. Nothing sexy. No heroic tuning.
What they did have was discipline: monthly scrubs, alerting on zpool status changes, and a runbook that forced engineers to check
kernel transport errors before touching ZFS knobs.

One Tuesday, latency climbed. Users noticed. The on-call followed the runbook: check pool health, check events, check kernel logs.
Within minutes they found repeating SATA link resets on one drive bay. No ZFS errors yet—just retries.

They swapped the cable/backplane component in a scheduled micro-window, before the drive started throwing checksum errors.
Latency dropped back to baseline. No resilver needed. No data risk. No weekend consumed by regret.

The practice that saved them wasn’t genius. It was consistency: scrubs to detect latent issues, and log correlation to catch hardware path degradation early.
Boring is underrated in storage engineering because it works.

Joke #2: If you want an exciting storage career, ignore your scrub schedule; the pager will create excitement for you.

Common mistakes: symptom → root cause → fix

1) “Pool is ONLINE but latency is awful”

Symptom: zpool status looks clean; applications time out; iostat shows high await.

Root cause: Device retries, link resets, or a single slow disk dragging a mirror/RAIDZ vdev.

Fix: Check journalctl -k for resets/timeouts; check SMART/NVMe error logs. Replace the suspect device or repair the transport path. Don’t tune ZFS to compensate for hardware lying.

2) “Every few seconds we get a pause”

Symptom: Periodic latency spikes; NFS stutters; databases show commit stalls.

Root cause: Txg sync taking too long, often because the pool is saturated, fragmentation is high, or a slow device is stalling flushes.

Fix: Use zpool iostat -l to identify the slow vdev, and reduce competing write load. If sync-heavy, fix SLOG latency or reconsider sync=always.

3) “We added a SLOG and performance got worse”

Symptom: Sync-heavy workload slows after adding log device.

Root cause: SLOG device has worse latency than the pool or suffers from throttling; single SLOG becomes a choke point.

Fix: Verify with iostat -x and zpool iostat -l. Replace with low-latency, power-loss-protected device, ideally mirrored. If workload is mostly async, remove SLOG and stop expecting magic.

4) “Checksum errors keep appearing, but scrubs repair them”

Symptom: CKSUM counts rise; scrubs repair; no user-visible data errors—yet.

Root cause: Often cabling/backplane/HBA issues (CRC errors), sometimes drive media failure.

Fix: Check SMART CRC counters and kernel transport logs. Reseat/replace cable/backplane; update firmware; replace drive if media indicators are bad. Then scrub again and monitor whether counters stay flat.

5) “Resilver will finish in 2 hours… for the next 3 days”

Symptom: Resilver ETA grows; pool is sluggish.

Root cause: Competing workload + fragmented pool + slow device. Resilver competes for I/O and can be deprioritized by the system or starved by your applications.

Fix: Reduce workload, schedule resilver in off-hours where possible, and check for a weak device prolonging the process. Confirm ashift and vdev design aren’t causing pathological write amplification.

6) “ARC hit rate fell off a cliff after we deployed something unrelated”

Symptom: Sudden increase in disk reads; latency rises; memory usage changes.

Root cause: Memory pressure from new services, container density, or kernel page cache behavior; ARC limited by configuration or squeezed by other consumers.

Fix: Measure memory, don’t guess. If you have RAM headroom, increase ARC cap. If you don’t, reduce memory pressure or move workload. Don’t add L2ARC as a substitute for not having enough RAM unless you understand the write/read patterns.

7) “We tuned recordsize and now writes are slower”

Symptom: After changing recordsize, throughput drops and latency rises.

Root cause: Recordsize mismatch with workload (e.g., too large for random-write DB blocks, too small for sequential streaming).

Fix: Set recordsize per dataset and per workload type. VM images and databases often prefer smaller blocks (e.g., 16K), while large sequential files benefit from larger (128K–1M depending on use). Validate with real I/O traces, not vibes.

Checklists / step-by-step plan

Checklist A: When users report “intermittent slowness”

  1. Confirm whether it’s storage latency: check app-level p95/p99 and I/O wait on hosts.
  2. Run zpool status -x. If not healthy, treat as incident.
  3. Run zpool status and look for scrub/resilver in progress.
  4. Run zpool iostat -v -l 1 for 60–120 seconds. Identify the slowest vdev/device by latency.
  5. Run journalctl -k filtered for resets/timeouts. Confirm whether the slow device has matching errors.
  6. Check SMART/NVMe health for the suspect device.
  7. Decide: isolate (offline/replace), repair transport (cable/backplane/HBA), or reduce workload contention.

Checklist B: When sync writes are suspected (databases/NFS/VMs)

  1. Check dataset sync and logbias properties for the relevant datasets.
  2. Confirm whether you have a SLOG and what it is (single vs mirror).
  3. Measure SLOG latency using iostat -x on the SLOG device during the slowdown window.
  4. If SLOG latency is worse than the main pool, don’t debate: replace or remove it depending on sync needs.
  5. If no SLOG exists and sync latency is painful, consider adding a proper mirrored SLOG—after validating that the workload is actually sync-heavy.

Checklist C: When errors appear but the pool “keeps working”

  1. Capture zpool status and zpool events -v output for the incident record.
  2. Check kernel logs around the same timestamps for transport issues.
  3. Check SMART/NVMe media indicators and error counts.
  4. Fix the path or replace hardware. Only then clear errors with zpool clear.
  5. Run a scrub after remediation and confirm error counters stop increasing.

Checklist D: Baseline so you can detect regressions

  1. Record baseline zpool iostat -v -l during “known good” hours.
  2. Record baseline ARC stats (hit rate, ARC size, memory pressure indicators).
  3. Track scrub duration and resilver duration trends (they’re early warnings for fragmentation and device aging).
  4. Alert on kernel transport errors, not just ZFS faults.

FAQ

1) Are ZFS logs enough to diagnose performance issues?

No. ZFS will tell you about integrity signals (errors, faults, events), but performance diagnosis needs device and kernel context.
Always pair ZFS events with kernel logs and iostat/zpool iostat.

2) If zpool status is clean, can I rule out hardware?

Absolutely not. Many hardware/transport issues present as retries and link resets long before ZFS increments a counter.
Kernel logs and SMART often show the “pre-symptoms.”

3) Does adding a SLOG always improve performance?

Only for synchronous writes. For async workloads, it’s mostly irrelevant. And a slow SLOG can make sync performance worse.
Treat SLOG as a latency-critical component, not a checkbox.

4) What’s the fastest way to spot a single bad disk in a mirror?

Use zpool iostat -v -l 1 and look for one member with dramatically higher disk wait latency.
Then confirm with journalctl -k and SMART/NVMe logs.

5) Do checksum errors always mean the disk is dying?

Often it’s the path: cable, backplane, HBA, firmware. SMART CRC errors and kernel transport resets are your tell.
Media errors (pending/reallocated/uncorrectable) implicate the disk more directly.

6) Why does the pool slow down during scrub if scrubs are “background”?

Scrubs are background in intent, not in physics. They consume real I/O and can raise latency.
If scrubs cause user pain, schedule them better, throttle where possible, and verify your pool has enough performance headroom.

7) Should I set sync=disabled to fix latency?

That’s not fixing; it’s negotiating with reality and hoping it doesn’t notice. You’re trading durability guarantees for speed.
If the data matters, fix the sync path (SLOG/device latency) instead.

8) Is high fragmentation always the reason for slowdowns?

No. Fragmentation is common, but the usual first culprit is device latency or a degraded/rebuilding pool.
Fragmentation tends to show up as a long-term trend: scrubs/resilvers get longer, random I/O gets pricier, and latency becomes easier to trigger.

9) When should I clear ZFS errors with zpool clear?

After you’ve fixed the underlying cause and captured the evidence. Clearing too early erases your breadcrumb trail and invites repeat incidents.

10) What if ZFS is slow but iostat shows low %util?

Then the bottleneck might be elsewhere: CPU (compression/encryption), memory pressure, throttling, or a sync path stall.
Also confirm you’re measuring the right devices (multipath, dm-crypt layers, HBAs).

Conclusion: practical next steps

ZFS performance outages are usually slow-motion hardware failures, sync write surprises, or rebuild/scrub contention that nobody treated as a production event.
The good news: you can see them coming—if you look in the right places and keep a baseline.

Do these next:

  1. Baseline zpool iostat -v -l and iostat -x during healthy hours, then keep those numbers somewhere your future self can find.
  2. Alert on kernel transport errors (resets, timeouts) in addition to ZFS pool state changes.
  3. Audit datasets for sync settings and identify which workloads are truly sync-heavy.
  4. Decide whether your SLOG (if any) is actually fit for purpose: low-latency, power-loss safe, and ideally mirrored for critical environments.
  5. Keep scrubs scheduled and monitored. Not because it’s fun, but because it’s how you catch the “quiet corruption and weak hardware” class of problems early.

Your goal isn’t to create the perfect ZFS system. It’s to make the slowdowns predictable, diagnosable, and fixable—before they become outages with a meeting invite.

← Previous
Docker: Secrets without leaks — stop putting passwords in .env
Next →
WireGuard: the simplest client setup on Windows/macOS/Linux (and the usual pitfalls)

Leave a comment