ZFS health dashboard: The Metrics You Must Track (or You’re Blind)

November 17, 2025 • February 3, 2026 • Read: 24 min • Views: 9

Was this helpful?

You can run ZFS for years and think you’re fine—right up until the day you discover the pool has been quietly collecting bad sectors,
your “fast” SSD SLOG is throttling to 5 MB/s, and every VM is waiting on a single angry vdev. ZFS doesn’t usually fail loudly.
It fails politely, with latency.

A ZFS health dashboard is not a vanity project. It’s how you stop arguing with opinions and start arguing with numbers.
If you’re not tracking the right metrics, you’re not “monitoring.” You’re staring at a screensaver while production burns.

What “healthy” means in ZFS (and what it doesn’t)

ZFS “health” is not just “pool is ONLINE.” That’s like declaring a patient healthy because they’re technically upright.
Real health is: data is correct, redundancy still exists, recovery time is bounded, and performance is predictable under load.

A useful dashboard answers five questions continuously:

Is my data still protected? (Redundancy intact; no silent corruption accumulating; no scary checksum trends.)
Am I about to lose protection? (Scrub finding increasing errors; SMART reallocations climbing; resilver risk growing.)
Is the pool fast enough for the workload? (Latency distribution; queueing; sync write behavior; vdev balance.)
Is the pool getting slower over time? (Fragmentation, full pool, metadata pressure, ARC thrash, special vdev saturation.)
Can I recover quickly? (Resilver time; spare readiness; replication lag; scrub schedule adherence.)

What it does not mean: “CPU is low,” “network looks fine,” or “nobody has paged me yet.” ZFS can hide a lot of pain behind
buffering and asynchronous writes until a sync-heavy workload shows up and asks for the truth.

Facts and historical context (why these metrics exist)

ZFS was born at Sun in the mid-2000s with the explicit goal of end-to-end data integrity—checksums on everything, not just metadata.
“Scrub” isn’t a cute name: it exists because silent corruption is real, and disks lie convincingly until you read every block.
ARC (Adaptive Replacement Cache) is central to ZFS performance; it’s not a bolt-on. ZFS is designed assuming a big, smart cache.
L2ARC came later as a “second-level” read cache on fast devices, but it’s not magic: it needs RAM for metadata and can steal CPU.
ZIL and SLOG are commonly misunderstood: the ZIL is always there; a SLOG is just a separate device to store the ZIL more safely/faster.
Copy-on-write (CoW) trades rewrite-in-place simplicity for consistency. The price is fragmentation and the need to watch free space.
“RAIDZ is not RAID” in operational behavior: rebuild (resilver) reads only allocated blocks, which can be faster, but still punishes slow disks.
OpenZFS broadened the ecosystem across illumos, FreeBSD, Linux, and others; that’s why some commands and kstats differ by platform.
Special vdevs (metadata/small blocks) are powerful but operationally sharp: lose the special vdev and you can lose the pool.

These are not trivia night facts. They explain why the dashboard needs to be biased toward integrity and latency, not just throughput.

The must-track metrics (and how to interpret them)

1) Pool state and error counters (integrity first)

Your top row should be brutally simple: pool state and error counts by vdev and device:
read errors, write errors, checksum errors. Checksum errors are the ones that should make you sit up straight.
Reads and writes can fail transiently; checksum errors mean data didn’t match what ZFS wrote.

Track:

Pool state: ONLINE/DEGRADED/FAULTED
Error counters per leaf device
“errors: Permanent errors have been detected…” events
Number of degraded/missing devices

Decision rule: any non-zero checksum error trend is an incident until proven otherwise. Not a ticket. An incident.

2) Scrub metrics: last scrub, duration, found errors

Scrubs are your periodic truth serum. If you don’t scrub, you’re betting the business that disks will behave when it matters most.
A dashboard should show:

Time since last scrub completed
Scrub duration (and how it changes)
Errors found during scrub
Scrub rate (MB/s), especially on large pools

Pattern to watch: scrubs taking longer each month often correlates with a pool getting fuller, more fragmented, or a disk silently degrading.

3) Resilver metrics: start time, progress, estimated completion

Resilver time is operational risk. While resilvering, you have reduced redundancy and heightened exposure to a second failure.
Track:

Resilver in progress
Scan rate and ETA
Number of resilvers in the last N days (frequent resilvers suggest marginal hardware)

Decision: if resilver time is “days,” you’re not running a storage system; you’re running a lottery kiosk. Adjust design: more vdevs, mirrors,
fewer huge RAIDZ groups, or faster disks, depending on constraints.

4) Capacity pressure: used %, free %, and, crucially, free space fragmentation

ZFS performance has a cliff. It’s not a myth. Past a certain fullness, allocations get harder, fragmentation grows, and latency spikes under load.
Track:

Pool allocation percentage
Dataset quotas/reservations (they can hide a real shortage)
Free space fragmentation (pool-level)

Opinionated threshold: treat 80% pool usage as “yellow,” 90% as “red.”
Some pools survive beyond that, but you’re paying for it in latency.

5) Latency: not just average, but distribution and sync write latency

Throughput graphs soothe people. Latency graphs keep systems alive. You want:

Read latency p50/p95/p99 per pool and per vdev
Write latency p50/p95/p99
Sync write latency (especially if you serve databases, NFS, or VM storage)
Queue depth (device and ZFS pipeline)

If p99 is ugly, users will call it “random slowness.” You’ll call it “a Tuesday.”

6) IOPS and bandwidth by vdev (imbalance tells you where the fire is)

ZFS pools are aggregates. The bottleneck is usually one vdev, one disk, or one device class (e.g., special vdev, SLOG).
Track:

IOPS and bandwidth per top-level vdev
Latency per vdev
Disk busy percentage

Decision: if one vdev is consistently hotter, fix layout or workload placement. Don’t “tune” around a structural problem.

7) ARC metrics: size, hit ratio, evictions, and “is ARC fighting the OS?”

ARC can hide sins. Or expose them. You need:

ARC size vs target
Hit ratio (overall, and by data vs metadata if available)
Eviction rate
Memory pressure signals (swap activity, OOM risk)

Beware: a high ARC hit ratio can be meaningless if your workload is streaming. The real signal is: are you meeting latency SLOs?

8) L2ARC metrics (if used): hit ratio, feed rate, and write amplification

L2ARC helps read-heavy, cache-friendly workloads. It can also just burn SSD endurance while accomplishing nothing.
Track:

L2ARC size and hit ratio
“Feed” rate (how much you’re writing into it)
SSD wear indicators (from SMART)

Decision: if L2ARC hit ratio is low and the feed rate is high, it’s probably an expensive placebo.

9) ZIL/SLOG metrics: sync write rate, SLOG latency, and device health

If you export NFS with sync, run databases, or host VM disks, ZIL behavior will define your worst day.
Track:

Sync write IOPS
SLOG device latency and errors
SLOG utilization and contention

A SLOG that is “fine” until it hits thermal throttling is a classic outage recipe.

10) Dataset-level knobs that change reality: recordsize, volblocksize, compression, atime, sync

ZFS datasets are policy boundaries. Your dashboard should let you correlate performance with the knobs you set:

recordsize (filesystems) and volblocksize (zvols)
compression algorithm and ratio
atime (often wasted writes)
sync (standard/always/disabled)

Compression ratio is not just “space saved.” It’s also “less IO required” if CPU is available. Watch both.

11) SMART and device health: reallocated sectors, media errors, temperature, and power loss events

ZFS can correct some errors. It can’t correct a drive that is dying in slow motion while you ignore it.
Track:

Reallocated sector count / media errors
UDMA CRC errors (cables/backplanes matter)
Temperature and thermal throttling flags
NVMe percent used / wear level

Decision: replace drives based on trend, not heroics. Waiting for “FAILED” is how you end up resilvering during a holiday weekend.

12) Replication and backup signals: lag, last success, and “did we actually test restore?”

A health dashboard without replication/backup status is theater. Track:

Last snapshot time and retention adherence
Replication lag (time and bytes)
Last successful send/receive
Restore test recency (yes, it belongs on a dashboard)

One paraphrased idea from W. Edwards Deming fits operations well: Without data, you’re just another person with an opinion. (paraphrased idea, attributed to W. Edwards Deming)

Joke #1: ZFS is like a good accountant—everything is checksummed, and it will absolutely remember what you tried to “just ignore.”

Practical tasks: commands, output meaning, and decisions (12+)

These are tasks you can run today. The point isn’t the command. The point is what you do after reading the output.
Examples assume OpenZFS on Linux; adjust paths for your platform if needed.

Task 1: Confirm pool health and error counters

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 05:12:33 with 0 errors on Sun Dec 22 02:18:10 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

What it means: “ONLINE” is necessary but not sufficient. The counters matter. Any non-zero CKSUM is a red flag; READ/WRITE too, but CKSUM is scarier.

Decision: If errors are non-zero, start an incident: identify the device, check cabling/backplane, run SMART, and plan replacement. If repeated CKSUM errors appear, don’t wait.

Task 2: Watch for slow or stalled scrub/resilver

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 02:10:03 2025
        4.12T scanned at 610M/s, 2.01T issued at 298M/s, 18.3T total
        0B repaired, 10.97% done, 0 days 15:22:10 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

What it means: “scanned” vs “issued” rate can diverge; issued reflects actual IO. Long ETAs can indicate slow disks, contention, or throttling.

Decision: If scrub ETA balloons or issued rate collapses under normal load, investigate per-disk latency and errors. Consider scheduling scrubs off-peak and verifying scrub throttles.

Task 3: Check pool capacity and fragmentation risk

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank             72.4T  9.63T   128K  /tank
tank/vm          41.1T  9.63T  40.9T  /tank/vm
tank/backups     28.7T  9.63T  28.7T  /tank/backups
tank/home        2.58T  9.63T  2.58T  /tank/home

What it means: “AVAIL” is shared across datasets unless reservations are in play. A pool with ~88% used is already in “latency cliff” territory for many workloads.

Decision: If you’re trending toward 85–90% used, stop adding data. Buy time by expiring snapshots, moving cold data, adding vdevs, or expanding capacity. Don’t “tune” your way out.

Task 4: Surface dataset policies that change write behavior

cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,compression,atime,sync tank/vm
NAME     PROPERTY     VALUE
tank/vm  recordsize   128K
tank/vm  compression  lz4
tank/vm  atime        off
tank/vm  sync         standard

What it means: These properties directly influence IO size and durability semantics. “sync=always” can crush latency on weak SLOG; “sync=disabled” can fake performance by risking data loss.

Decision: Validate that properties match the workload. If you changed sync to “disabled” to “fix” performance, you didn’t fix performance—you changed the rules of physics.

Task 5: Measure real-time IO load and latency by vdev

cr0x@server:~$ sudo zpool iostat -v tank 2 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        72.4T  9.63T  3.12K  1.88K  612M  241M
  raidz2-0                  72.4T  9.63T  3.12K  1.88K  612M  241M
    sda                         -      -    520    310   98M   41M
    sdb                         -      -    510    305   96M   40M
    sdc                         -      -    540    312  101M   41M
    sdd                         -      -    105    328   22M   39M
    sde                         -      -    525    309   99M   40M
    sdf                         -      -    520    316   98M   40M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: One disk (sdd) is doing far fewer reads but similar writes: that can indicate read errors, slow reads, or a path issue causing retries elsewhere.

Decision: Drill into that device: SMART, cabling, controller, and kernel logs. If the vdev is imbalanced, resilver and scrub times suffer, and so does your p99 latency.

Task 6: Check ARC behavior (Linux)

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_max|c_min|hits|misses|demand_data_hits|demand_data_misses) '
size                            4    17179869184
c_min                           4     4294967296
c_max                           4    34359738368
hits                            4    9140284432
misses                          4     821739122
demand_data_hits                4    6021139981
demand_data_misses              4     611229880

What it means: ARC size, target max, and hit/miss counters give you cache efficiency. A low hit ratio under a random-read workload suggests you need more RAM or different layout.

Decision: If misses are high and latency is high, consider adding RAM before adding “fancy” SSD caches. If your workload is streaming, don’t overreact to misses; focus on throughput and write behavior.

Task 7: Confirm there’s no memory pressure hiding behind ARC

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           256Gi       118Gi        12Gi       2.0Gi       126Gi       136Gi
Swap:            0B          0B          0B

What it means: “available” memory matters. If available collapses and swap churn begins, ARC will shrink and your storage will suddenly “get slow” everywhere.

Decision: If the host is swapping, treat it as a performance incident. Fix memory pressure, reduce co-located workloads, or cap ARC intentionally.

Task 8: Check per-disk latency and queueing (OS-level)

cr0x@server:~$ iostat -x 2 2
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.42    0.00    5.18    8.70    0.00   75.70

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  r_await  w_await  %util
sda             85.0    51.0    98.0    41.0    1680.0     7.2    42.0     48.0     32.0   96.0
sdb             82.0    49.0    96.0    40.0    1676.0     6.8    39.0     44.0     31.0   94.0
sdc             88.0    52.0   101.0    41.0    1672.0     6.9    40.0     46.0     31.0   95.0
sdd             18.0    55.0    22.0    39.0    1490.0    18.5   210.0    420.0     26.0   99.0

What it means: sdd has massive read await and deep queue. That’s your hot shard of misery. ZFS will wait for it because parity/mirror needs it.

Decision: Replace the disk, fix the path, or move the vdev to healthier hardware. Tuning ZFS won’t make a dying disk stop dying.

Task 9: Pull SMART/NVMe health and decide on proactive replacement

cr0x@server:~$ sudo smartctl -a /dev/sdd | egrep -i 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count|Temperature_Celsius'
  5 Reallocated_Sector_Ct   0x0033   097   097   010    Pre-fail  Always       -       48
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   041   053   000    Old_age   Always       -       59

What it means: Pending and uncorrectable sectors plus high temperature: this drive is not “fine.” It’s negotiating with entropy.

Decision: Replace it. If it’s hot, fix airflow too. If multiple drives show similar trends, suspect chassis cooling, backplane, or batch quality.

Task 10: Check ZFS events for silent warnings

cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME                           CLASS
Dec 24 2025 14:18:03.512410000  ereport.fs.zfs.io
        vdev_path: /dev/disk/by-id/ata-WDC_WD140EDFZ-11A0VA0_9JH2ABCD
        vdev_guid: 1234567890123456789
        zio_err: 5
        zio_offset: 8723412992
        zio_size: 131072
        zio_flags: 180880
        parent_guid: 9876543210987654321

What it means: ZFS is telling you about IO errors with context. These events are often earlier than a full “DEGRADED” status.

Decision: Correlate events with SMART and kernel logs. Escalate if the event rate increases. A few IO errors can become a resilver fast.

Task 11: Validate ashift (the performance foot-gun you don’t notice until it hurts)

cr0x@server:~$ sudo zdb -C tank | egrep 'ashift|vdev_tree' -n | head
34:        vdev_tree:
57:            ashift: 12

What it means: ashift=12 implies 4K sectors. If you mistakenly created a pool with ashift=9 on 4K drives, you get read-modify-write amplification and sad latency.

Decision: If ashift is wrong, plan migration. You can’t fix it in place. This is why pool creation should be treated like schema design.

Task 12: Confirm special vdev usage (if present) and whether it’s saturating

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            sda                      ONLINE       0     0     0
            sdb                      ONLINE       0     0     0
            sdc                      ONLINE       0     0     0
            sdd                      ONLINE       0     0     0
            sde                      ONLINE       0     0     0
            sdf                      ONLINE       0     0     0
          special
            nvme0n1                  ONLINE       0     0     0

What it means: A special vdev exists. That means metadata (and possibly small blocks) live there. If it’s overloaded or fails, you can be in a world of pain.

Decision: Ensure special vdev has redundancy (mirror) and monitor its latency and wear. If it’s a single device, fix that design before it fixes you.

Task 13: Check snapshot load and retention creep

cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation | head -n 8
NAME                               USED  REFER  CREATION
tank/vm@auto-2025-12-25-0000        0B    40.9T  Thu Dec 25 00:00 2025
tank/vm@auto-2025-12-24-0000        0B    40.9T  Wed Dec 24 00:00 2025
tank/vm@auto-2025-12-23-0000       12.4G  40.9T  Tue Dec 23 00:00 2025
tank/vm@auto-2025-12-22-0000       10.1G  40.9T  Mon Dec 22 00:00 2025
tank/vm@auto-2025-12-21-0000       11.7G  40.9T  Sun Dec 21 00:00 2025
tank/vm@auto-2025-12-20-0000        9.8G  40.9T  Sat Dec 20 00:00 2025
tank/vm@auto-2025-12-19-0000       10.9G  40.9T  Fri Dec 19 00:00 2025

What it means: Snapshot “USED” isn’t zero once blocks diverge. Snapshot sprawl quietly consumes space and can degrade performance.

Decision: If snapshots are eating the pool, fix retention. If compliance requires long retention, budget capacity accordingly and avoid storing hot churny datasets in the same pool.

Task 14: Spot replication lag via recent snapshot times (simple but effective)

cr0x@server:~$ zfs get -H -o value creation tank/vm@auto-2025-12-25-0000
Thu Dec 25 00:00 2025

What it means: If the latest snapshot is old, replication can’t be current either (unless you replicate without snapshots, which is its own risk).

Decision: If snapshot schedule is drifting, treat replication lag as an RPO incident. Fix the pipeline, not just the one failed run.

Fast diagnosis playbook: find the bottleneck in minutes

The goal is not to become a wizard. The goal is to avoid spending two hours debating whether it’s “the network” while the pool is 92% full.
Here’s the order that wins in production.

Step 1: Confirm you’re not already in a data-risk event

Check: zpool status -v (errors, DEGRADED, scrub/resilver in progress)
Check: zpool events -v | tail (new IO errors)
Check: SMART for suspect devices (pending/uncorrectable sectors)

If there are checksum errors, stop chasing performance. You’re in integrity mode now.

Step 2: Identify whether the pain is latency, throughput, or sync semantics

Check: zpool iostat -v 1 (who is hot? which vdev?)
Check: iostat -x 1 (await, %util, queue depth per disk)
Ask: Is the workload sync-heavy (NFS sync, databases, VM writes)?

Classic symptom: bandwidth looks “fine,” but p99 latency is awful and users complain. That’s queueing, contention, or a slow device.

Step 3: Check capacity and fragmentation pressure

Check: zfs list and pool usage trends
Check: snapshot count and growth

If the pool is very full, you can waste days “tuning” while the allocator keeps losing. Buy space first.

Step 4: Check cache effectiveness and memory pressure

Check: ARC stats (hit/miss, size vs c_max)
Check: free -h and swap activity

ARC thrash turns small random reads into disk seeks. Your dashboard should let you see that shift over weeks, not after the pager.

Step 5: Check “policy knobs” and recent changes

Check: dataset properties (sync, recordsize, compression)
Check: recent rollouts: kernel updates, controller firmware, new workloads, new NFS mount options

Most incidents are not mysterious. They’re just uncorrelated. Correlate them.

Three corporate mini-stories (what actually happens)

Mini-story #1: The incident caused by a wrong assumption (“ONLINE means healthy”)

A mid-sized SaaS company ran customer databases on ZFS-backed VM storage. The pool dashboard had one big green tile:
“tank: ONLINE.” Everyone felt good about it. When performance complaints arrived, the storage team pointed at that tile and said,
“It’s storage, it’s fine. It must be the hypervisors.”

The first real clue was a slow-motion increase in query times that didn’t correlate with CPU or network. Then the backup window started
leaking into business hours. Nothing dramatic—just a creeping sense that everything was heavier than it used to be.
The team eventually ran zpool status -v and found a non-zero checksum count on one disk that had been “stable” for weeks.
Scrubs were passing, but taking longer, and no one had been tracking duration.

The wrong assumption was that ZFS would “handle it” and they’d get paged when it mattered. But ZFS is conservative: it will retry,
correct, and keep serving. That kindness is exactly why you must monitor it. Those checksum counters were telling the story:
a drive path was intermittently corrupting reads, corrected by redundancy—until it wasn’t.

The fix was boring: replace the drive, inspect the backplane slot, and add alerting on checksum deltas (not just absolute count),
plus a “scrub duration” panel. The outcome was even more boring: the next time a path went flaky, they replaced it before customer-facing symptoms.
That is what success looks like in storage: nothing happens.

Mini-story #2: The optimization that backfired (“sync=disabled is free performance”)

An internal platform team had an NFS-backed CI system. It was “slow” during peak hours. Someone found a blog post and flipped
sync=disabled on the dataset, because the workload was “temporary build artifacts anyway.” The dashboard instantly looked better.
Latency dropped. Everybody celebrated. Somebody even suggested rolling the same change across more datasets.

A week later, a power event hit the rack. Not a full datacenter meltdown; just a UPS transfer and a couple of hosts that didn’t like it.
The CI system came back… and then jobs started failing with corrupted artifacts and missing files. The “temporary” nature of artifacts
turned out to be a lie: the system also cached dependency blobs and build outputs that were reused downstream.

ZFS did exactly what it was told: with sync disabled, it acknowledged writes before they were safely on stable storage.
The team had traded durability for speed without a formal risk decision. The outage wasn’t the power event. The outage was the policy choice.

The fix was twofold: set sync=standard again, and build a real solution:
a mirrored, power-loss-protected NVMe SLOG, plus instrumentation for sync write latency.
After that, the dashboard showed the real truth: the bottleneck was sync writes, not random magic performance.

Mini-story #3: The boring practice that saved the day (“scrub schedule + SMART trend alerts”)

A financial services shop ran a large archival pool. Nothing flashy. Mostly sequential reads, periodic writes, and strict retention.
The team had a habit that everyone found slightly annoying: monthly scrubs, with alerts not only for errors but also for “scrub time increased by 30%.”
They also tracked SMART trends for pending sectors and temperature.

One month, scrub duration jumped. No errors, just slower. The dashboard showed one vdev’s read latency p95 rising steadily,
while the rest stayed flat. SMART didn’t show catastrophic failure, but it did show increasing “Current_Pending_Sector” on a single disk.
Not enough to trigger vendor replacement by itself, but enough to trigger the team’s own policy.

They replaced the disk during business hours with no drama. A week later, the old disk failed hard in the test bench.
The team didn’t win an award. They also didn’t get paged at 03:00 with a degraded pool and a CEO who suddenly cares about parity math.

The practice wasn’t clever. It was consistent. Storage rewards consistency like a trained animal: feed it scrubs and trend alerts, and it behaves.

Common mistakes: symptom → root cause → fix

1) “Pool is ONLINE but users report random stalls”

Symptom: Short freezes, p99 latency spikes, especially under mixed load.

Root cause: One slow disk, controller hiccups, or a vdev imbalance causing queueing. Often visible in iostat -x as high await/%util on a single device.

Fix: Identify the hot/slow disk, check SMART and logs, reseat/replace hardware. Add per-disk latency panels, not just pool throughput.

2) “Everything got slower after we added more data”

Symptom: Gradual decline in performance, scrubs take longer, allocation errors appear near the end.

Root cause: Pool too full and/or heavily fragmented; snapshots consuming space; small free segments.

Fix: Get the pool back under sane utilization. Delete/expire snapshots responsibly, add capacity, rebalance data. Put “pool used %” on the front page with hard alerts.

3) “Sync writes are painfully slow”

Symptom: Databases/NFS/VMs show high commit latency; throughput looks fine for async workloads.

Root cause: No SLOG, weak SLOG, or SLOG without power-loss protection; also possible: sync=always on a dataset with no need.

Fix: Use a mirrored, PLP-capable SLOG for serious sync workloads. Monitor SLOG latency and errors. Don’t set sync=disabled to “fix” it.

4) “We’re seeing checksum errors but SMART looks okay”

Symptom: Non-zero CKSUM in zpool status, often increasing slowly.

Root cause: Cabling/backplane/controller issues, not necessarily the disk media. UDMA CRC errors may show up, but not always.

Fix: Swap cables/ports, update firmware, move the drive to a different bay/controller, then re-scrub. Treat checksum deltas as critical signals.

5) “L2ARC didn’t help; SSD wear is high”

Symptom: Little to no improvement in read latency; SSD write volume is large.

Root cause: Working set doesn’t fit or isn’t cache-friendly; L2ARC is being fed aggressively; metadata overhead burns RAM.

Fix: Verify hit ratio and feed rate. If it’s not helping, remove it. Spend budget on RAM or vdev layout before you buy placebo SSDs.

6) “Resilver takes forever and we feel exposed”

Symptom: Rebuilds measured in days; performance during resilver is degraded.

Root cause: Huge vdevs on slow disks, overloaded pool, or too few vdevs (not enough parallelism).

Fix: Redesign: more vdevs, mirror vdevs for random IO workloads, or smaller RAIDZ groups. Keep spares on-hand and test replacement procedures.

Joke #2: If your only storage alert is “pool FAULTED,” your monitoring is basically a smoke detector that only beeps after the house moved out.

Checklists / step-by-step plan

Dashboard checklist: the panels that matter

Integrity panel: pool state, DEGRADED/FAULTED count, checksum/read/write errors (absolute and delta), recent ZFS events.
Scrub & resilver panel: last scrub time, duration trend, errors found; resilver active flag, rate, ETA.
Capacity panel: pool used %, free %, dataset usage, snapshot used space, quota/reservation outliers.
Latency panel: read/write p50/p95/p99 per pool and per vdev; device await; %util; queue depth.
Workload panel: IOPS and bandwidth per vdev; sync write rate if measurable; top talker datasets if you have per-dataset accounting.
Cache panel: ARC size, hit ratio, eviction rate; L2ARC hit ratio and feed rate (if used).
Device health panel: SMART trends: pending/uncorrectable/reallocated, CRC errors, temperature, NVMe wear.
Data protection panel: snapshot freshness, replication lag, last successful replication/backup, last restore test date.

Operations checklist: weekly and monthly routines

Weekly: review checksum deltas and ZFS event logs; investigate anything that moved.
Weekly: review p99 latency trend by vdev; identify emerging hot spots.
Weekly: review capacity headroom; confirm you’re not drifting toward 90%.
Monthly: run a scrub (or ensure it ran); record duration and compare to baseline.
Monthly: review SMART trend report and temperatures; fix airflow before replacing half the fleet.
Quarterly: test a restore (file-level and dataset-level), document time-to-restore, and update runbooks.
After any hardware change: verify device identifiers, ashift, and that alerts still map to the right disks.

Step-by-step: when you add a new pool (do this every time)

Decide vdev layout based on workload: mirrors for random IO, RAIDZ for capacity and sequential-ish patterns.
Confirm ashift expectations before creation. Treat this as irreversible.
Define dataset policies up front: compression on (lz4), atime off for most, sane recordsize/volblocksize, sync standard.
Set scrub schedule and alerts before production data lands.
Put SMART monitoring in place with trend-based alerting, not just “FAILED.”
Baseline performance (latency under representative load) and keep it for comparisons.

FAQ

1) What’s the single most important ZFS metric?

Checksum error deltas (and the events around them). Performance problems hurt; integrity problems end careers.
Track absolute counts and changes over time per device.

2) How often should I scrub?

For most production pools: monthly is a sane default. Very large or very busy pools may need tuning, but “never” is not a strategy.
If scrubs are too disruptive, fix scheduling and investigate why the pool can’t tolerate sequential reads.

3) Why does ZFS get slow when the pool is nearly full?

CoW allocations need contiguous-ish free space. As free space shrinks and fragments, ZFS works harder to allocate blocks, and IO becomes more random.
Latency spikes show up before you hit 100%. That’s why you alert early.

4) Is ARC hit ratio a reliable KPI?

It’s a diagnostic, not a KPI. A low hit ratio can be normal for streaming reads. Use it to explain latency behavior, not to “optimize a number.”

5) When should I use L2ARC?

When your working set is larger than RAM but still cacheable, and you have spare CPU and RAM for metadata overhead.
If you can buy more RAM, do that first in many cases.

6) Do I need a SLOG?

Only if you have significant sync writes and care about latency (databases, NFS with sync, VM storage).
If you add one, use power-loss-protected devices and mirror them for reliability.

7) Are checksum errors always a dying disk?

No. They’re often cabling, a bad HBA, firmware issues, or a flaky backplane slot. But the fix is still urgent: identify the path and eliminate corruption.

8) Should I alert on “pool ONLINE” changing only?

That’s the bare minimum, and it’s too late for comfort. Alert on checksum deltas, scrub duration anomalies, SMART trends, and vdev latency spikes.
“ONLINE” is the headline; the story is in the footnotes.

9) How do I tell if one vdev is the bottleneck?

Use zpool iostat -v and OS-level iostat -x to compare per-device await/%util. A single device pegged at high await with high util is a classic culprit.
If only one top-level vdev is hot, you may be under-provisioned on vdev count (not disk size).

10) What’s a reasonable alert threshold for disk temperature?

It depends on the drive class, but sustained temperatures in the high 50s °C are a bad sign for longevity.
Trend it: if temperature rises over weeks, you likely changed airflow or density.

Conclusion: next steps you can do this week

A ZFS health dashboard isn’t a single graph. It’s a set of opinions encoded as alerts: integrity first, then latency, then capacity, then optimization.
If you track only throughput and “ONLINE,” you’ll learn about problems when users teach you—loudly.

Add front-page panels for: checksum deltas, scrub duration, pool used %, and p99 latency per vdev.
Turn SMART into trend alerts: pending/uncorrectable, reallocated, temperature, and NVMe wear.
Write one runbook page: what to do when checksum errors increase, and what to do when p99 latency spikes.
Pick a scrub schedule and make it non-optional; alert when it doesn’t happen.
Stop “performance fixes” that are actually durability tradeoffs unless you’ve made a formal risk decision.

Do those, and you’re no longer blind. You’ll still have problems—this is storage—but they’ll be problems you can see coming, quantify, and fix on your terms.