You can run ZFS for years and think you’re fine—right up until the day you discover the pool has been quietly collecting bad sectors,
your “fast” SSD SLOG is throttling to 5 MB/s, and every VM is waiting on a single angry vdev. ZFS doesn’t usually fail loudly.
It fails politely, with latency.
A ZFS health dashboard is not a vanity project. It’s how you stop arguing with opinions and start arguing with numbers.
If you’re not tracking the right metrics, you’re not “monitoring.” You’re staring at a screensaver while production burns.
What “healthy” means in ZFS (and what it doesn’t)
ZFS “health” is not just “pool is ONLINE.” That’s like declaring a patient healthy because they’re technically upright.
Real health is: data is correct, redundancy still exists, recovery time is bounded, and performance is predictable under load.
A useful dashboard answers five questions continuously:
- Is my data still protected? (Redundancy intact; no silent corruption accumulating; no scary checksum trends.)
- Am I about to lose protection? (Scrub finding increasing errors; SMART reallocations climbing; resilver risk growing.)
- Is the pool fast enough for the workload? (Latency distribution; queueing; sync write behavior; vdev balance.)
- Is the pool getting slower over time? (Fragmentation, full pool, metadata pressure, ARC thrash, special vdev saturation.)
- Can I recover quickly? (Resilver time; spare readiness; replication lag; scrub schedule adherence.)
What it does not mean: “CPU is low,” “network looks fine,” or “nobody has paged me yet.” ZFS can hide a lot of pain behind
buffering and asynchronous writes until a sync-heavy workload shows up and asks for the truth.
Facts and historical context (why these metrics exist)
- ZFS was born at Sun in the mid-2000s with the explicit goal of end-to-end data integrity—checksums on everything, not just metadata.
- “Scrub” isn’t a cute name: it exists because silent corruption is real, and disks lie convincingly until you read every block.
- ARC (Adaptive Replacement Cache) is central to ZFS performance; it’s not a bolt-on. ZFS is designed assuming a big, smart cache.
- L2ARC came later as a “second-level” read cache on fast devices, but it’s not magic: it needs RAM for metadata and can steal CPU.
- ZIL and SLOG are commonly misunderstood: the ZIL is always there; a SLOG is just a separate device to store the ZIL more safely/faster.
- Copy-on-write (CoW) trades rewrite-in-place simplicity for consistency. The price is fragmentation and the need to watch free space.
- “RAIDZ is not RAID” in operational behavior: rebuild (resilver) reads only allocated blocks, which can be faster, but still punishes slow disks.
- OpenZFS broadened the ecosystem across illumos, FreeBSD, Linux, and others; that’s why some commands and kstats differ by platform.
- Special vdevs (metadata/small blocks) are powerful but operationally sharp: lose the special vdev and you can lose the pool.
These are not trivia night facts. They explain why the dashboard needs to be biased toward integrity and latency, not just throughput.
The must-track metrics (and how to interpret them)
1) Pool state and error counters (integrity first)
Your top row should be brutally simple: pool state and error counts by vdev and device:
read errors, write errors, checksum errors. Checksum errors are the ones that should make you sit up straight.
Reads and writes can fail transiently; checksum errors mean data didn’t match what ZFS wrote.
Track:
- Pool state: ONLINE/DEGRADED/FAULTED
- Error counters per leaf device
- “errors: Permanent errors have been detected…” events
- Number of degraded/missing devices
Decision rule: any non-zero checksum error trend is an incident until proven otherwise. Not a ticket. An incident.
2) Scrub metrics: last scrub, duration, found errors
Scrubs are your periodic truth serum. If you don’t scrub, you’re betting the business that disks will behave when it matters most.
A dashboard should show:
- Time since last scrub completed
- Scrub duration (and how it changes)
- Errors found during scrub
- Scrub rate (MB/s), especially on large pools
Pattern to watch: scrubs taking longer each month often correlates with a pool getting fuller, more fragmented, or a disk silently degrading.
3) Resilver metrics: start time, progress, estimated completion
Resilver time is operational risk. While resilvering, you have reduced redundancy and heightened exposure to a second failure.
Track:
- Resilver in progress
- Scan rate and ETA
- Number of resilvers in the last N days (frequent resilvers suggest marginal hardware)
Decision: if resilver time is “days,” you’re not running a storage system; you’re running a lottery kiosk. Adjust design: more vdevs, mirrors,
fewer huge RAIDZ groups, or faster disks, depending on constraints.
4) Capacity pressure: used %, free %, and, crucially, free space fragmentation
ZFS performance has a cliff. It’s not a myth. Past a certain fullness, allocations get harder, fragmentation grows, and latency spikes under load.
Track:
- Pool allocation percentage
- Dataset quotas/reservations (they can hide a real shortage)
- Free space fragmentation (pool-level)
Opinionated threshold: treat 80% pool usage as “yellow,” 90% as “red.”
Some pools survive beyond that, but you’re paying for it in latency.
5) Latency: not just average, but distribution and sync write latency
Throughput graphs soothe people. Latency graphs keep systems alive. You want:
- Read latency p50/p95/p99 per pool and per vdev
- Write latency p50/p95/p99
- Sync write latency (especially if you serve databases, NFS, or VM storage)
- Queue depth (device and ZFS pipeline)
If p99 is ugly, users will call it “random slowness.” You’ll call it “a Tuesday.”
6) IOPS and bandwidth by vdev (imbalance tells you where the fire is)
ZFS pools are aggregates. The bottleneck is usually one vdev, one disk, or one device class (e.g., special vdev, SLOG).
Track:
- IOPS and bandwidth per top-level vdev
- Latency per vdev
- Disk busy percentage
Decision: if one vdev is consistently hotter, fix layout or workload placement. Don’t “tune” around a structural problem.
7) ARC metrics: size, hit ratio, evictions, and “is ARC fighting the OS?”
ARC can hide sins. Or expose them. You need:
- ARC size vs target
- Hit ratio (overall, and by data vs metadata if available)
- Eviction rate
- Memory pressure signals (swap activity, OOM risk)
Beware: a high ARC hit ratio can be meaningless if your workload is streaming. The real signal is: are you meeting latency SLOs?
8) L2ARC metrics (if used): hit ratio, feed rate, and write amplification
L2ARC helps read-heavy, cache-friendly workloads. It can also just burn SSD endurance while accomplishing nothing.
Track:
- L2ARC size and hit ratio
- “Feed” rate (how much you’re writing into it)
- SSD wear indicators (from SMART)
Decision: if L2ARC hit ratio is low and the feed rate is high, it’s probably an expensive placebo.
9) ZIL/SLOG metrics: sync write rate, SLOG latency, and device health
If you export NFS with sync, run databases, or host VM disks, ZIL behavior will define your worst day.
Track:
- Sync write IOPS
- SLOG device latency and errors
- SLOG utilization and contention
A SLOG that is “fine” until it hits thermal throttling is a classic outage recipe.
10) Dataset-level knobs that change reality: recordsize, volblocksize, compression, atime, sync
ZFS datasets are policy boundaries. Your dashboard should let you correlate performance with the knobs you set:
- recordsize (filesystems) and volblocksize (zvols)
- compression algorithm and ratio
- atime (often wasted writes)
- sync (standard/always/disabled)
Compression ratio is not just “space saved.” It’s also “less IO required” if CPU is available. Watch both.
11) SMART and device health: reallocated sectors, media errors, temperature, and power loss events
ZFS can correct some errors. It can’t correct a drive that is dying in slow motion while you ignore it.
Track:
- Reallocated sector count / media errors
- UDMA CRC errors (cables/backplanes matter)
- Temperature and thermal throttling flags
- NVMe percent used / wear level
Decision: replace drives based on trend, not heroics. Waiting for “FAILED” is how you end up resilvering during a holiday weekend.
12) Replication and backup signals: lag, last success, and “did we actually test restore?”
A health dashboard without replication/backup status is theater. Track:
- Last snapshot time and retention adherence
- Replication lag (time and bytes)
- Last successful send/receive
- Restore test recency (yes, it belongs on a dashboard)
One paraphrased idea from W. Edwards Deming fits operations well: Without data, you’re just another person with an opinion.
(paraphrased idea, attributed to W. Edwards Deming)
Joke #1: ZFS is like a good accountant—everything is checksummed, and it will absolutely remember what you tried to “just ignore.”
Practical tasks: commands, output meaning, and decisions (12+)
These are tasks you can run today. The point isn’t the command. The point is what you do after reading the output.
Examples assume OpenZFS on Linux; adjust paths for your platform if needed.
Task 1: Confirm pool health and error counters
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: scrub repaired 0B in 05:12:33 with 0 errors on Sun Dec 22 02:18:10 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: “ONLINE” is necessary but not sufficient. The counters matter. Any non-zero CKSUM is a red flag; READ/WRITE too, but CKSUM is scarier.
Decision: If errors are non-zero, start an incident: identify the device, check cabling/backplane, run SMART, and plan replacement. If repeated CKSUM errors appear, don’t wait.
Task 2: Watch for slow or stalled scrub/resilver
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Mon Dec 23 02:10:03 2025
4.12T scanned at 610M/s, 2.01T issued at 298M/s, 18.3T total
0B repaired, 10.97% done, 0 days 15:22:10 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: “scanned” vs “issued” rate can diverge; issued reflects actual IO. Long ETAs can indicate slow disks, contention, or throttling.
Decision: If scrub ETA balloons or issued rate collapses under normal load, investigate per-disk latency and errors. Consider scheduling scrubs off-peak and verifying scrub throttles.
Task 3: Check pool capacity and fragmentation risk
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 72.4T 9.63T 128K /tank
tank/vm 41.1T 9.63T 40.9T /tank/vm
tank/backups 28.7T 9.63T 28.7T /tank/backups
tank/home 2.58T 9.63T 2.58T /tank/home
What it means: “AVAIL” is shared across datasets unless reservations are in play. A pool with ~88% used is already in “latency cliff” territory for many workloads.
Decision: If you’re trending toward 85–90% used, stop adding data. Buy time by expiring snapshots, moving cold data, adding vdevs, or expanding capacity. Don’t “tune” your way out.
Task 4: Surface dataset policies that change write behavior
cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,compression,atime,sync tank/vm
NAME PROPERTY VALUE
tank/vm recordsize 128K
tank/vm compression lz4
tank/vm atime off
tank/vm sync standard
What it means: These properties directly influence IO size and durability semantics. “sync=always” can crush latency on weak SLOG; “sync=disabled” can fake performance by risking data loss.
Decision: Validate that properties match the workload. If you changed sync to “disabled” to “fix” performance, you didn’t fix performance—you changed the rules of physics.
Task 5: Measure real-time IO load and latency by vdev
cr0x@server:~$ sudo zpool iostat -v tank 2 3
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 72.4T 9.63T 3.12K 1.88K 612M 241M
raidz2-0 72.4T 9.63T 3.12K 1.88K 612M 241M
sda - - 520 310 98M 41M
sdb - - 510 305 96M 40M
sdc - - 540 312 101M 41M
sdd - - 105 328 22M 39M
sde - - 525 309 99M 40M
sdf - - 520 316 98M 40M
-------------------------- ----- ----- ----- ----- ----- -----
What it means: One disk (sdd) is doing far fewer reads but similar writes: that can indicate read errors, slow reads, or a path issue causing retries elsewhere.
Decision: Drill into that device: SMART, cabling, controller, and kernel logs. If the vdev is imbalanced, resilver and scrub times suffer, and so does your p99 latency.
Task 6: Check ARC behavior (Linux)
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c_max|c_min|hits|misses|demand_data_hits|demand_data_misses) '
size 4 17179869184
c_min 4 4294967296
c_max 4 34359738368
hits 4 9140284432
misses 4 821739122
demand_data_hits 4 6021139981
demand_data_misses 4 611229880
What it means: ARC size, target max, and hit/miss counters give you cache efficiency. A low hit ratio under a random-read workload suggests you need more RAM or different layout.
Decision: If misses are high and latency is high, consider adding RAM before adding “fancy” SSD caches. If your workload is streaming, don’t overreact to misses; focus on throughput and write behavior.
Task 7: Confirm there’s no memory pressure hiding behind ARC
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 256Gi 118Gi 12Gi 2.0Gi 126Gi 136Gi
Swap: 0B 0B 0B
What it means: “available” memory matters. If available collapses and swap churn begins, ARC will shrink and your storage will suddenly “get slow” everywhere.
Decision: If the host is swapping, treat it as a performance incident. Fix memory pressure, reduce co-located workloads, or cap ARC intentionally.
Task 8: Check per-disk latency and queueing (OS-level)
cr0x@server:~$ iostat -x 2 2
avg-cpu: %user %nice %system %iowait %steal %idle
10.42 0.00 5.18 8.70 0.00 75.70
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await %util
sda 85.0 51.0 98.0 41.0 1680.0 7.2 42.0 48.0 32.0 96.0
sdb 82.0 49.0 96.0 40.0 1676.0 6.8 39.0 44.0 31.0 94.0
sdc 88.0 52.0 101.0 41.0 1672.0 6.9 40.0 46.0 31.0 95.0
sdd 18.0 55.0 22.0 39.0 1490.0 18.5 210.0 420.0 26.0 99.0
What it means: sdd has massive read await and deep queue. That’s your hot shard of misery. ZFS will wait for it because parity/mirror needs it.
Decision: Replace the disk, fix the path, or move the vdev to healthier hardware. Tuning ZFS won’t make a dying disk stop dying.
Task 9: Pull SMART/NVMe health and decide on proactive replacement
cr0x@server:~$ sudo smartctl -a /dev/sdd | egrep -i 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count|Temperature_Celsius'
5 Reallocated_Sector_Ct 0x0033 097 097 010 Pre-fail Always - 48
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 6
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 6
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 041 053 000 Old_age Always - 59
What it means: Pending and uncorrectable sectors plus high temperature: this drive is not “fine.” It’s negotiating with entropy.
Decision: Replace it. If it’s hot, fix airflow too. If multiple drives show similar trends, suspect chassis cooling, backplane, or batch quality.
Task 10: Check ZFS events for silent warnings
cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME CLASS
Dec 24 2025 14:18:03.512410000 ereport.fs.zfs.io
vdev_path: /dev/disk/by-id/ata-WDC_WD140EDFZ-11A0VA0_9JH2ABCD
vdev_guid: 1234567890123456789
zio_err: 5
zio_offset: 8723412992
zio_size: 131072
zio_flags: 180880
parent_guid: 9876543210987654321
What it means: ZFS is telling you about IO errors with context. These events are often earlier than a full “DEGRADED” status.
Decision: Correlate events with SMART and kernel logs. Escalate if the event rate increases. A few IO errors can become a resilver fast.
Task 11: Validate ashift (the performance foot-gun you don’t notice until it hurts)
cr0x@server:~$ sudo zdb -C tank | egrep 'ashift|vdev_tree' -n | head
34: vdev_tree:
57: ashift: 12
What it means: ashift=12 implies 4K sectors. If you mistakenly created a pool with ashift=9 on 4K drives, you get read-modify-write amplification and sad latency.
Decision: If ashift is wrong, plan migration. You can’t fix it in place. This is why pool creation should be treated like schema design.
Task 12: Confirm special vdev usage (if present) and whether it’s saturating
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
special
nvme0n1 ONLINE 0 0 0
What it means: A special vdev exists. That means metadata (and possibly small blocks) live there. If it’s overloaded or fails, you can be in a world of pain.
Decision: Ensure special vdev has redundancy (mirror) and monitor its latency and wear. If it’s a single device, fix that design before it fixes you.
Task 13: Check snapshot load and retention creep
cr0x@server:~$ zfs list -t snapshot -o name,used,refer,creation | head -n 8
NAME USED REFER CREATION
tank/vm@auto-2025-12-25-0000 0B 40.9T Thu Dec 25 00:00 2025
tank/vm@auto-2025-12-24-0000 0B 40.9T Wed Dec 24 00:00 2025
tank/vm@auto-2025-12-23-0000 12.4G 40.9T Tue Dec 23 00:00 2025
tank/vm@auto-2025-12-22-0000 10.1G 40.9T Mon Dec 22 00:00 2025
tank/vm@auto-2025-12-21-0000 11.7G 40.9T Sun Dec 21 00:00 2025
tank/vm@auto-2025-12-20-0000 9.8G 40.9T Sat Dec 20 00:00 2025
tank/vm@auto-2025-12-19-0000 10.9G 40.9T Fri Dec 19 00:00 2025
What it means: Snapshot “USED” isn’t zero once blocks diverge. Snapshot sprawl quietly consumes space and can degrade performance.
Decision: If snapshots are eating the pool, fix retention. If compliance requires long retention, budget capacity accordingly and avoid storing hot churny datasets in the same pool.
Task 14: Spot replication lag via recent snapshot times (simple but effective)
cr0x@server:~$ zfs get -H -o value creation tank/vm@auto-2025-12-25-0000
Thu Dec 25 00:00 2025
What it means: If the latest snapshot is old, replication can’t be current either (unless you replicate without snapshots, which is its own risk).
Decision: If snapshot schedule is drifting, treat replication lag as an RPO incident. Fix the pipeline, not just the one failed run.
Fast diagnosis playbook: find the bottleneck in minutes
The goal is not to become a wizard. The goal is to avoid spending two hours debating whether it’s “the network” while the pool is 92% full.
Here’s the order that wins in production.
Step 1: Confirm you’re not already in a data-risk event
- Check:
zpool status -v(errors, DEGRADED, scrub/resilver in progress) - Check:
zpool events -v | tail(new IO errors) - Check: SMART for suspect devices (pending/uncorrectable sectors)
If there are checksum errors, stop chasing performance. You’re in integrity mode now.
Step 2: Identify whether the pain is latency, throughput, or sync semantics
- Check:
zpool iostat -v 1(who is hot? which vdev?) - Check:
iostat -x 1(await, %util, queue depth per disk) - Ask: Is the workload sync-heavy (NFS sync, databases, VM writes)?
Classic symptom: bandwidth looks “fine,” but p99 latency is awful and users complain. That’s queueing, contention, or a slow device.
Step 3: Check capacity and fragmentation pressure
- Check:
zfs listand pool usage trends - Check: snapshot count and growth
If the pool is very full, you can waste days “tuning” while the allocator keeps losing. Buy space first.
Step 4: Check cache effectiveness and memory pressure
- Check: ARC stats (hit/miss, size vs c_max)
- Check:
free -hand swap activity
ARC thrash turns small random reads into disk seeks. Your dashboard should let you see that shift over weeks, not after the pager.
Step 5: Check “policy knobs” and recent changes
- Check: dataset properties (sync, recordsize, compression)
- Check: recent rollouts: kernel updates, controller firmware, new workloads, new NFS mount options
Most incidents are not mysterious. They’re just uncorrelated. Correlate them.
Three corporate mini-stories (what actually happens)
Mini-story #1: The incident caused by a wrong assumption (“ONLINE means healthy”)
A mid-sized SaaS company ran customer databases on ZFS-backed VM storage. The pool dashboard had one big green tile:
“tank: ONLINE.” Everyone felt good about it. When performance complaints arrived, the storage team pointed at that tile and said,
“It’s storage, it’s fine. It must be the hypervisors.”
The first real clue was a slow-motion increase in query times that didn’t correlate with CPU or network. Then the backup window started
leaking into business hours. Nothing dramatic—just a creeping sense that everything was heavier than it used to be.
The team eventually ran zpool status -v and found a non-zero checksum count on one disk that had been “stable” for weeks.
Scrubs were passing, but taking longer, and no one had been tracking duration.
The wrong assumption was that ZFS would “handle it” and they’d get paged when it mattered. But ZFS is conservative: it will retry,
correct, and keep serving. That kindness is exactly why you must monitor it. Those checksum counters were telling the story:
a drive path was intermittently corrupting reads, corrected by redundancy—until it wasn’t.
The fix was boring: replace the drive, inspect the backplane slot, and add alerting on checksum deltas (not just absolute count),
plus a “scrub duration” panel. The outcome was even more boring: the next time a path went flaky, they replaced it before customer-facing symptoms.
That is what success looks like in storage: nothing happens.
Mini-story #2: The optimization that backfired (“sync=disabled is free performance”)
An internal platform team had an NFS-backed CI system. It was “slow” during peak hours. Someone found a blog post and flipped
sync=disabled on the dataset, because the workload was “temporary build artifacts anyway.” The dashboard instantly looked better.
Latency dropped. Everybody celebrated. Somebody even suggested rolling the same change across more datasets.
A week later, a power event hit the rack. Not a full datacenter meltdown; just a UPS transfer and a couple of hosts that didn’t like it.
The CI system came back… and then jobs started failing with corrupted artifacts and missing files. The “temporary” nature of artifacts
turned out to be a lie: the system also cached dependency blobs and build outputs that were reused downstream.
ZFS did exactly what it was told: with sync disabled, it acknowledged writes before they were safely on stable storage.
The team had traded durability for speed without a formal risk decision. The outage wasn’t the power event. The outage was the policy choice.
The fix was twofold: set sync=standard again, and build a real solution:
a mirrored, power-loss-protected NVMe SLOG, plus instrumentation for sync write latency.
After that, the dashboard showed the real truth: the bottleneck was sync writes, not random magic performance.
Mini-story #3: The boring practice that saved the day (“scrub schedule + SMART trend alerts”)
A financial services shop ran a large archival pool. Nothing flashy. Mostly sequential reads, periodic writes, and strict retention.
The team had a habit that everyone found slightly annoying: monthly scrubs, with alerts not only for errors but also for “scrub time increased by 30%.”
They also tracked SMART trends for pending sectors and temperature.
One month, scrub duration jumped. No errors, just slower. The dashboard showed one vdev’s read latency p95 rising steadily,
while the rest stayed flat. SMART didn’t show catastrophic failure, but it did show increasing “Current_Pending_Sector” on a single disk.
Not enough to trigger vendor replacement by itself, but enough to trigger the team’s own policy.
They replaced the disk during business hours with no drama. A week later, the old disk failed hard in the test bench.
The team didn’t win an award. They also didn’t get paged at 03:00 with a degraded pool and a CEO who suddenly cares about parity math.
The practice wasn’t clever. It was consistent. Storage rewards consistency like a trained animal: feed it scrubs and trend alerts, and it behaves.
Common mistakes: symptom → root cause → fix
1) “Pool is ONLINE but users report random stalls”
Symptom: Short freezes, p99 latency spikes, especially under mixed load.
Root cause: One slow disk, controller hiccups, or a vdev imbalance causing queueing. Often visible in iostat -x as high await/%util on a single device.
Fix: Identify the hot/slow disk, check SMART and logs, reseat/replace hardware. Add per-disk latency panels, not just pool throughput.
2) “Everything got slower after we added more data”
Symptom: Gradual decline in performance, scrubs take longer, allocation errors appear near the end.
Root cause: Pool too full and/or heavily fragmented; snapshots consuming space; small free segments.
Fix: Get the pool back under sane utilization. Delete/expire snapshots responsibly, add capacity, rebalance data. Put “pool used %” on the front page with hard alerts.
3) “Sync writes are painfully slow”
Symptom: Databases/NFS/VMs show high commit latency; throughput looks fine for async workloads.
Root cause: No SLOG, weak SLOG, or SLOG without power-loss protection; also possible: sync=always on a dataset with no need.
Fix: Use a mirrored, PLP-capable SLOG for serious sync workloads. Monitor SLOG latency and errors. Don’t set sync=disabled to “fix” it.
4) “We’re seeing checksum errors but SMART looks okay”
Symptom: Non-zero CKSUM in zpool status, often increasing slowly.
Root cause: Cabling/backplane/controller issues, not necessarily the disk media. UDMA CRC errors may show up, but not always.
Fix: Swap cables/ports, update firmware, move the drive to a different bay/controller, then re-scrub. Treat checksum deltas as critical signals.
5) “L2ARC didn’t help; SSD wear is high”
Symptom: Little to no improvement in read latency; SSD write volume is large.
Root cause: Working set doesn’t fit or isn’t cache-friendly; L2ARC is being fed aggressively; metadata overhead burns RAM.
Fix: Verify hit ratio and feed rate. If it’s not helping, remove it. Spend budget on RAM or vdev layout before you buy placebo SSDs.
6) “Resilver takes forever and we feel exposed”
Symptom: Rebuilds measured in days; performance during resilver is degraded.
Root cause: Huge vdevs on slow disks, overloaded pool, or too few vdevs (not enough parallelism).
Fix: Redesign: more vdevs, mirror vdevs for random IO workloads, or smaller RAIDZ groups. Keep spares on-hand and test replacement procedures.
Joke #2: If your only storage alert is “pool FAULTED,” your monitoring is basically a smoke detector that only beeps after the house moved out.
Checklists / step-by-step plan
Dashboard checklist: the panels that matter
- Integrity panel: pool state, DEGRADED/FAULTED count, checksum/read/write errors (absolute and delta), recent ZFS events.
- Scrub & resilver panel: last scrub time, duration trend, errors found; resilver active flag, rate, ETA.
- Capacity panel: pool used %, free %, dataset usage, snapshot used space, quota/reservation outliers.
- Latency panel: read/write p50/p95/p99 per pool and per vdev; device await; %util; queue depth.
- Workload panel: IOPS and bandwidth per vdev; sync write rate if measurable; top talker datasets if you have per-dataset accounting.
- Cache panel: ARC size, hit ratio, eviction rate; L2ARC hit ratio and feed rate (if used).
- Device health panel: SMART trends: pending/uncorrectable/reallocated, CRC errors, temperature, NVMe wear.
- Data protection panel: snapshot freshness, replication lag, last successful replication/backup, last restore test date.
Operations checklist: weekly and monthly routines
- Weekly: review checksum deltas and ZFS event logs; investigate anything that moved.
- Weekly: review p99 latency trend by vdev; identify emerging hot spots.
- Weekly: review capacity headroom; confirm you’re not drifting toward 90%.
- Monthly: run a scrub (or ensure it ran); record duration and compare to baseline.
- Monthly: review SMART trend report and temperatures; fix airflow before replacing half the fleet.
- Quarterly: test a restore (file-level and dataset-level), document time-to-restore, and update runbooks.
- After any hardware change: verify device identifiers, ashift, and that alerts still map to the right disks.
Step-by-step: when you add a new pool (do this every time)
- Decide vdev layout based on workload: mirrors for random IO, RAIDZ for capacity and sequential-ish patterns.
- Confirm ashift expectations before creation. Treat this as irreversible.
- Define dataset policies up front: compression on (lz4), atime off for most, sane recordsize/volblocksize, sync standard.
- Set scrub schedule and alerts before production data lands.
- Put SMART monitoring in place with trend-based alerting, not just “FAILED.”
- Baseline performance (latency under representative load) and keep it for comparisons.
FAQ
1) What’s the single most important ZFS metric?
Checksum error deltas (and the events around them). Performance problems hurt; integrity problems end careers.
Track absolute counts and changes over time per device.
2) How often should I scrub?
For most production pools: monthly is a sane default. Very large or very busy pools may need tuning, but “never” is not a strategy.
If scrubs are too disruptive, fix scheduling and investigate why the pool can’t tolerate sequential reads.
3) Why does ZFS get slow when the pool is nearly full?
CoW allocations need contiguous-ish free space. As free space shrinks and fragments, ZFS works harder to allocate blocks, and IO becomes more random.
Latency spikes show up before you hit 100%. That’s why you alert early.
4) Is ARC hit ratio a reliable KPI?
It’s a diagnostic, not a KPI. A low hit ratio can be normal for streaming reads. Use it to explain latency behavior, not to “optimize a number.”
5) When should I use L2ARC?
When your working set is larger than RAM but still cacheable, and you have spare CPU and RAM for metadata overhead.
If you can buy more RAM, do that first in many cases.
6) Do I need a SLOG?
Only if you have significant sync writes and care about latency (databases, NFS with sync, VM storage).
If you add one, use power-loss-protected devices and mirror them for reliability.
7) Are checksum errors always a dying disk?
No. They’re often cabling, a bad HBA, firmware issues, or a flaky backplane slot. But the fix is still urgent: identify the path and eliminate corruption.
8) Should I alert on “pool ONLINE” changing only?
That’s the bare minimum, and it’s too late for comfort. Alert on checksum deltas, scrub duration anomalies, SMART trends, and vdev latency spikes.
“ONLINE” is the headline; the story is in the footnotes.
9) How do I tell if one vdev is the bottleneck?
Use zpool iostat -v and OS-level iostat -x to compare per-device await/%util. A single device pegged at high await with high util is a classic culprit.
If only one top-level vdev is hot, you may be under-provisioned on vdev count (not disk size).
10) What’s a reasonable alert threshold for disk temperature?
It depends on the drive class, but sustained temperatures in the high 50s °C are a bad sign for longevity.
Trend it: if temperature rises over weeks, you likely changed airflow or density.
Conclusion: next steps you can do this week
A ZFS health dashboard isn’t a single graph. It’s a set of opinions encoded as alerts: integrity first, then latency, then capacity, then optimization.
If you track only throughput and “ONLINE,” you’ll learn about problems when users teach you—loudly.
- Add front-page panels for: checksum deltas, scrub duration, pool used %, and p99 latency per vdev.
- Turn SMART into trend alerts: pending/uncorrectable, reallocated, temperature, and NVMe wear.
- Write one runbook page: what to do when checksum errors increase, and what to do when p99 latency spikes.
- Pick a scrub schedule and make it non-optional; alert when it doesn’t happen.
- Stop “performance fixes” that are actually durability tradeoffs unless you’ve made a formal risk decision.
Do those, and you’re no longer blind. You’ll still have problems—this is storage—but they’ll be problems you can see coming, quantify, and fix on your terms.