You don’t usually “discover” a ZFS pool crash. The pool introduces itself—slowly—through weird latency spikes, a scrub that suddenly takes forever, or one innocent-looking counter that increments exactly once and then keeps you awake.
The trick is treating ZFS like the production system it is: watch the few signals that actually predict failure, not the 47 dashboards that only confirm it afterward.
The 3 metrics that actually predict a crash
ZFS failures in the real world rarely look like the cartoons. You don’t always get a neat “disk died, replace disk.” More often you get:
- A slow-motion I/O disaster that causes timeouts in apps, followed by resets and then a device dropping.
- Silent corruption detection (checksums) that starts small and becomes “why are we rewriting half the pool?”
- Capacity pressure that turns normal writes into an allocation nightmare, which turns into latency, which turns into panic.
So here are the three metrics I care about because they tend to move before the pool is on fire:
1) Error counters that change over time
Not the fact that you have a nonzero number once in history. The slope. If READ/WRITE/CKSUM counters move between scrubs, between reboots, or between two “known good” points, the pool is telling you it’s losing its contract with reality.
2) Latency + queue pressure at the pool/vdev layer
Throughput is a vanity metric. Latency is the truth. Queue growth and long service times are how you spot an impending device drop, a thrashing pool, or a design that was “fine in staging.”
3) Capacity headroom and fragmentation
ZFS can tolerate a lot, but not a pool that’s near full while also fragmented and being asked to do random writes. The crash predictor here isn’t “80% used.” It’s “we’re on the wrong side of the curve and scrubs/reslivers now take forever.”
One paraphrased idea from Deming (reliability people quote him constantly for a reason): Without data you’re just another person with an opinion.
Paraphrased idea, but the point stands—collect the right data.
One joke (1/2): ZFS doesn’t “randomly fail.” It just waits until you’re on vacation and then files a detailed complaint in the form of latency.
Metric 1: Error counters that move (READ/WRITE/CKSUM)
ZFS gives you a gift most storage stacks don’t: end-to-end checksumming with explicit error accounting. But you have to read it like an operator, not like a tourist.
What counts as “predictive” here
- New checksum errors (CKSUM) on a single device, slowly increasing: often cabling, backplane, controller, firmware, or marginal media.
- Read errors increasing: the device couldn’t return data (or couldn’t do so within the driver’s patience). If this moves, the disk is auditioning for replacement.
- Write errors increasing: can be device, but also HBA/controller paths, power issues, or timeouts under load.
- Errors tied to specific blocks that reappear after a scrub: now you’re looking at persistent corruption or a “fix” that didn’t fix.
How crashes happen from “just a few errors”
A pool crash, in practice, is often a cascade:
- Device starts returning occasional errors or taking too long.
- ZFS retries; queue grows; latency spikes; application timeouts begin.
- Driver resets the device; ZFS marks it FAULTED or DEGRADED.
- Resilver starts; load increases; other marginal devices get stressed.
- Now you’re one more blip away from losing redundancy.
That’s why “it’s only 3 checksum errors” is not a comforting sentence. It’s a sentence that should trigger investigation and trend analysis, because small counts are often the opening credits.
What to do with nonzero counters
You decide based on persistence, locality, and correlation:
- Persistence: Do errors keep increasing after a scrub and after reseating cables?
- Locality: Is it one device or several? One path suggests hardware. Many suggests controller, backplane, firmware, or power.
- Correlation: Do errors line up with load spikes, temperature, or a specific host?
Metric 2: Latency and queue pressure (the pool is “breathing hard”)
When ZFS pools “crash,” they’re often not actually dead. They’re just stuck in pathological latency: everything technically works, but nothing completes in time for your apps. That’s operationally indistinguishable from down.
What latency signals look like in ZFS
You want to watch the pool and vdev level, not just per-disk IOPS:
- High await / long service times during normal workloads.
- Queue depth rising (requests piling up because the pool can’t drain).
- Scrub/resilver taking dramatically longer than historical baselines.
- Sync write pressure (especially with misdesigned SLOG or no SLOG for sync-heavy workloads).
Why this predicts device drops
Timeouts and resets are often a latency story first. Many “disk failures” are “disk became too slow and the OS gave up.” That distinction matters because the fix changes:
- If a single disk is slow: replace it.
- If all disks get slow at once: look at HBA, expander, firmware, PCIe errors, saturation, ARC pressure, or too-full pool behavior.
- If only sync writes are slow: look at SLOG, sync settings, and workload semantics.
One joke (2/2): Queue depth is like email: the longer it gets, the less likely anything important is getting answered.
Metric 3: Capacity headroom + fragmentation (metaslabs and the cliff)
ZFS performance is not linear with fullness. The pool has a phase change: allocation gets harder, free space gets chopped into smaller pieces, and writes become more expensive. The pool doesn’t need to be at 99% to behave like it’s dying.
Headroom isn’t a vibe; it’s a control knob
If you operate ZFS pools in production, you should have an explicit headroom policy. My opinionated baseline:
- General mixed workloads: keep under ~80% used.
- Random write heavy, metadata heavy, snapshots everywhere: aim for under ~70% used.
- Rust-heavy vdevs with small blocks: be stricter; fragmentation hurts you earlier.
Those aren’t laws of physics; they’re guardrails. The real signal is trend: when allocation slows and fragmentation rises, you’re past the comfortable zone.
Fragmentation and why it becomes “crash-like”
ZFS allocates from metaslabs. As free space becomes fragmented, allocations require more seeking, more metadata work, more I/O operations, and longer transaction group (TXG) commit times. Symptoms include:
- Write latency climbing even when throughput looks “fine.”
- Scrubs/reslivers stretching from hours to days.
- Bursty stalls where apps pause during TXG sync.
Capacity problems also amplify other risks: resilvers on near-full pools take longer, increasing exposure time where you’re one disk away from data loss.
Fast diagnosis playbook
This is the order I use when an app says “storage is slow” or when monitoring shows pool health degradation. It’s designed to find the bottleneck quickly, not to satisfy curiosity.
First: is the pool logically healthy or already bleeding?
- Check
zpool statusfor DEGRADED/FAULTED, error counters, and any ongoing resilver/scrub. - If error counters are moving: treat as an incident, not a tuning session.
Second: is this latency/queue pressure (systemic) or one bad actor (a single vdev/device)?
- Use
zpool iostat -vto identify which vdev is slow, not just which dataset is busy. - Correlate with
iostatand kernel logs for resets/timeouts.
Third: is capacity/fragmentation the hidden culprit?
- Check pool allocation and fragmentation indicators. If you’re near the cliff, performance “bugs” are usually physics.
- Validate snapshot growth and small-block workloads (VMs, databases, mail spools) that accelerate fragmentation.
Fourth: confirm the workload semantics (sync, recordsize, special vdevs)
- Sync-heavy NFS? VM images with tiny writes? A “clever” SLOG? Special vdev? These shape the I/O path and failure modes.
- Don’t guess. Pull dataset properties and observe actual I/O patterns.
Practical tasks: commands, outputs, and decisions
Below are hands-on tasks you can run today. Each one includes: command, example output, what it means, and what decision to make. These are the boring moves that keep your pool alive.
Task 1: Read pool health and error slope
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 03:18:22 with 0 errors on Sun Feb 4 01:10:43 2026
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 2
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
errors: No known data errors
What it means: One disk has 2 checksum errors. ZFS corrected them and reports no known data errors. That’s not “fine”; it’s “investigate.”
Decision: Check if the CKSUM count increases over time. If it increments again after a scrub or after reseating cables, plan to replace the disk and inspect cabling/backplane.
Task 2: Clear errors only after you’ve captured evidence
cr0x@server:~$ sudo zpool clear tank
What it means: Counters reset. This is useful for trend measurement, but it also destroys your before/after comparison if you didn’t record it.
Decision: Only clear after you’ve captured zpool status output (ticket/notes) and you’re ready to observe whether errors recur.
Task 3: Confirm whether a scrub is running and how it behaves
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Mon Feb 3 00:12:10 2026
3.42T scanned at 1.21G/s, 1.88T issued at 684M/s, 8.10T total
0B repaired, 23.19% done, 0:02:53 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
errors: No known data errors
What it means: Scrub is progressing quickly. Compare scan/issue rates to previous scrubs.
Decision: If scrub speed suddenly halves (or worse) without an intentional workload change, investigate latency and fragmentation; it’s an early warning.
Task 4: Identify which vdev is slow under load
cr0x@server:~$ sudo zpool iostat -v tank 2 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 6.12T 1.98T 320 610 42.1M 81.7M
raidz2-0 6.12T 1.98T 320 610 42.1M 81.7M
ata-WDC_WD80...-part1 - - 40 90 5.1M 12.0M
ata-WDC_WD80...-part1 - - 41 88 5.0M 11.8M
ata-WDC_WD80...-part1 - - 39 92 5.2M 12.3M
ata-WDC_WD80...-part1 - - 40 250 5.0M 35.0M
ata-WDC_WD80...-part1 - - 40 90 5.1M 12.1M
ata-WDC_WD80...-part1 - - 120 0 16.7M 0.0
-------------------------- ----- ----- ----- ----- ----- -----
What it means: One disk is receiving disproportionate writes (250 ops) and another is taking most reads (120 ops). That can be normal depending on parity layout and workload, but persistent skew can indicate a slow device causing the rest to wait.
Decision: If one device consistently shows lower bandwidth with higher ops (small I/O) and system logs show timeouts, treat that device/path as suspect.
Task 5: Get latency stats (OpenZFS) to spot service-time explosions
cr0x@server:~$ sudo zpool iostat -v -l tank 1 3
capacity operations bandwidth latency
pool alloc free read write read write read write
-------------------------- ----- ----- ----- ----- ----- ----- ----- -----
tank 6.12T 1.98T 310 590 41.0M 79.0M 2.1ms 18.4ms
raidz2-0 6.12T 1.98T 310 590 41.0M 79.0M 2.1ms 18.4ms
ata-WDC_WD80...-part1 - - 40 90 5.1M 12.0M 1.8ms 12.2ms
ata-WDC_WD80...-part1 - - 40 90 5.0M 12.0M 1.9ms 12.4ms
ata-WDC_WD80...-part1 - - 40 90 5.2M 12.1M 2.0ms 12.6ms
ata-WDC_WD80...-part1 - - 40 240 5.0M 33.0M 2.1ms 85.0ms
ata-WDC_WD80...-part1 - - 40 90 5.1M 12.0M 2.0ms 12.5ms
ata-WDC_WD80...-part1 - - 110 0 15.6M 0.0 2.2ms 0.0ms
-------------------------- ----- ----- ----- ----- ----- ----- ----- -----
What it means: One disk shows 85ms write latency while peers are ~12ms. That’s your future incident report.
Decision: Correlate with kernel logs; plan replacement or path remediation. Don’t “tune” around a disk that’s developing a personality.
Task 6: Check pool usage and ashift implications
cr0x@server:~$ sudo zpool list -o name,size,alloc,free,cap,health,ashift tank
NAME SIZE ALLOC FREE CAP HEALTH ASHIFT
tank 8.10T 6.12T 1.98T 75% ONLINE 12
What it means: 75% used, ashift=12 (4K sectors). Good baseline for modern disks/SSDs.
Decision: If ashift is wrong (e.g., 9 on 4K drives), you can’t fix it in place; plan a rebuild/migration. Wrong ashift is a long-term performance tax that can become a stability issue under load.
Task 7: Inspect dataset properties that change the I/O path
cr0x@server:~$ sudo zfs get -o name,property,value -s local recordsize,compression,atime,sync,logbias xattr tank/vmstore
NAME PROPERTY VALUE
tank/vmstore recordsize 16K
tank/vmstore compression lz4
tank/vmstore atime off
tank/vmstore sync standard
tank/vmstore logbias latency
tank/vmstore xattr sa
What it means: This dataset is tuned for VM-ish small blocks (16K recordsize), lz4 compression, atime off, logbias latency. Good, but it implies sync write sensitivity and metadata intensity.
Decision: Verify you actually have a sane SLOG if sync write latency matters. Don’t set recordsize small “because databases”; measure your I/O size first.
Task 8: Identify snapshot pressure (capacity predictor)
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer -s used | head
NAME USED REFER
tank/vmstore@auto-2026-02-04-0000 48G 3.1T
tank/vmstore@auto-2026-02-03-0000 44G 3.1T
tank/vmstore@auto-2026-02-02-0000 39G 3.1T
tank/vmstore@auto-2026-02-01-0000 33G 3.1T
What it means: Snapshots are consuming real space (USED), even if REFER stays similar. This can silently eat headroom and drive fragmentation.
Decision: Implement retention and verify it actually deletes snapshots. If USED per snapshot grows faster than expected, your churn is high and resilvers will hurt more.
Task 9: Measure ARC pressure (latency predictor in disguise)
cr0x@server:~$ sudo arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:10:01 612 118 16 44 6 60 8 14 2 46.2G 64.0G
12:10:02 590 121 17 50 7 55 8 16 3 46.2G 64.0G
12:10:03 640 130 17 52 7 61 8 17 3 46.3G 64.0G
What it means: ARC size is stable; miss rate ~16–17%. If miss rate spikes and you start pounding disks, latency follows.
Decision: If ARC is constrained (container limits, VM ballooning, misconfigured zfs_arc_max), fix memory pressure first. Don’t blame disks for what RAM didn’t cache.
Task 10: Spot error/reset patterns in kernel logs
cr0x@server:~$ sudo dmesg -T | egrep -i 'reset|timeout|error|ata|nvme' | tail -n 12
[Mon Feb 3 13:42:11 2026] ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0x6 frozen
[Mon Feb 3 13:42:11 2026] ata7.00: irq_stat 0x08000000, interface fatal error
[Mon Feb 3 13:42:12 2026] ata7: hard resetting link
[Mon Feb 3 13:42:17 2026] ata7: link is slow to respond, please be patient (ready=0)
[Mon Feb 3 13:42:22 2026] ata7: COMRESET failed (errno=-16)
[Mon Feb 3 13:42:22 2026] ata7.00: disabled
What it means: Link resets and COMRESET failures point to cabling/backplane/HBA path issues, not necessarily “bad ZFS.” This is a classic precursor to devices dropping from the pool.
Decision: Replace cables, move ports, check power and backplane. If errors follow the port, your disk is innocent and your chassis is lying.
Task 11: Verify SMART/NVMe health for the suspect device
cr0x@server:~$ sudo smartctl -a /dev/sdg | egrep -i 'Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|CRC_Error_Count|SMART overall'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 14
What it means: “PASSED” is marketing. Pending sectors + uncorrectables are serious. CRC errors suggest cabling/path problems.
Decision: If CRC errors increase, fix cabling/backplane first. If pending sectors persist/increase, schedule replacement—especially in a redundancy-reduced situation.
Task 12: Check scrub schedule and ensure scrubs aren’t silently failing to run
cr0x@server:~$ systemctl status zfs-scrub@tank.timer
● zfs-scrub@tank.timer - ZFS scrub timer for tank
Loaded: loaded (/lib/systemd/system/zfs-scrub@.timer; enabled; preset: enabled)
Active: active (waiting) since Sun Feb 2 00:00:01 2026
Trigger: Mon Feb 17 00:00:00 2026
What it means: Scrubs are scheduled and enabled.
Decision: If scrubs aren’t running regularly, you lose early detection. Enable and monitor scrub completion times and error counts.
Task 13: Detect near-full datasets and quota surprises
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint -S used | head
NAME USED AVAIL REFER MOUNTPOINT
tank/vmstore 5.4T 520G 3.1T /tank/vmstore
tank/backups 580G 2.2T 580G /tank/backups
tank/home 120G 2.7T 120G /tank/home
What it means: vmstore is consuming most space; only 520G “available” at dataset level may be tight depending on churn and snapshot behavior.
Decision: If this dataset is high-churn (VMs), increase headroom by deleting snapshots, moving data, or expanding the pool before performance collapses.
Task 14: Check for special vdev and understand the blast radius
cr0x@server:~$ sudo zpool status tank | sed -n '1,120p'
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme-INTEL_SSDPE... ONLINE 0 0 0
nvme-INTEL_SSDPE... ONLINE 0 0 0
What it means: A special vdev exists (metadata/small blocks depending on settings). Its failure semantics can be severe if critical metadata lives there.
Decision: Treat special vdevs like first-class citizens: mirror them, monitor them, keep firmware sane, and don’t cheap out on endurance.
Task 15: Confirm sync behavior when latency spikes are “only for some writes”
cr0x@server:~$ sudo zfs get -o name,property,value sync tank
NAME PROPERTY VALUE
tank sync standard
What it means: Sync writes are honored. If your workload is sync-heavy and you don’t have a capable SLOG (or you have a bad one), you’ll feel it.
Decision: Do not set sync=disabled as a “performance fix” unless you have explicitly accepted data-loss semantics. If you need low-latency sync, design SLOG properly.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
They ran a fleet of virtualization hosts with local ZFS pools. It was a comfortable setup: mirrored SSDs for boot, raidz for bulk VM storage, and weekly scrubs. The team’s working assumption was simple: checksum errors mean “the disk is bad.” That’s what the graphs said, and the graphs were never wrong.
One host started accumulating a small number of CKSUM errors on two drives, not one. The on-call replaced the first drive. Errors kept coming. Then they replaced the second. Still creeping. Nobody loved the idea, but the team started planning a bigger maintenance: “maybe we got a bad batch.”
Meanwhile, the incident pattern got nastier. Under peak write load, a disk would briefly disappear, ZFS would degrade the vdev, and the VMs would freeze long enough for guest OSes to panic. Then the device would come back, sometimes. The counters climbed like a staircase—only during specific high-traffic windows.
The root cause was a backplane path: a marginal connector that behaved until sustained temperature and vibration hit it just right. The “disk is bad” assumption delayed the correct fix by a week and burned several maintenance windows. Once they moved the disks to different bays (and replaced the backplane), the checksum counters stopped moving. Same disks. Same data. Different reality.
Lesson: A checksum error is not a disk diagnosis; it’s a data-integrity alarm. When errors spread across devices, suspect shared components first: cabling, backplane, HBA, expander, power.
Mini-story 2: The optimization that backfired
A data platform team wanted faster NFS performance for a busy analytics cluster. The storage backend was ZFS on a decent set of HDDs with plenty of RAM. Latency looked okay until nightly batch jobs, when sync writes piled up and everyone complained.
Someone proposed a quick win: add a “fast” consumer NVMe as SLOG. On paper it was perfect—cheap, screaming benchmarks, easy install. They also changed a couple dataset properties in the same change window to “align for small writes.” The benchmark improved. A round of congratulations happened. This is how it always starts.
Two months later, the pool didn’t exactly crash. It did something more subtle and more expensive: it started stalling for seconds at a time, then tens of seconds. NFS clients would hang. Kernel logs showed occasional NVMe errors. The SLOG device wasn’t dead yet, just intermittently slow and occasionally resetting. Every reset forced painful retry behavior, and because the SLOG was in the critical path for sync writes, the entire service looked down.
The final trigger was a power event in the rack. The consumer NVMe lacked the kind of power-loss protection you actually want in a log device. After the reboot, the team spent a long night validating data integrity and replay semantics. They got lucky, but luck is not an architecture.
Lesson: SLOG is not “a cache.” It is a write-ahead log device with reliability and latency requirements. If you add a SLOG, make it enterprise-grade, mirror it if the workload demands it, and monitor it like it’s production—because it is.
Mini-story 3: The boring but correct practice that saved the day
A mid-sized SaaS company ran ZFS on dedicated storage nodes backing object storage and internal backups. Nothing fancy. They had a policy that sounded almost comically strict: monthly scrubs, mandatory recording of scrub duration, and a simple rule—any device with recurring errors after remediation gets replaced before the next maintenance cycle.
It wasn’t popular at first. Scrubs create load. Replacing disks costs money. And writing down scrub times felt like bureaucracy. But they kept at it because their incident reviews were ruthless about “we didn’t know” being an unacceptable excuse.
One month, scrub time on a key pool jumped significantly without any growth in total data. No one had touched the layout. Error counters were still at zero. But the team treated “scrub time doubled” as a failure signal and started digging. They found a set of drives that were still online but had dramatically increased read error recovery time. The OS wasn’t timing out yet; ZFS hadn’t complained. Users weren’t screaming. That was the window.
They replaced two drives in a controlled, low-stress way—before the pool was degraded, before a resilver had to compete with production traffic, before they were one surprise away from a bad day. A week later, one of those removed drives failed a vendor diagnostic hard.
Lesson: Baselines are cheap insurance. Scrub duration and resilver duration are early indicators of trouble even when “health” looks green.
Common mistakes: symptom → root cause → fix
1) “A few checksum errors, but scrub says 0 repaired”
Symptom: CKSUM increments; scrub reports no repairs; apps seem fine.
Root cause: Transient corruption on the path (cable/backplane/HBA) or marginal media that hasn’t forced a repair yet; counters can reflect corrected data via redundancy without requiring a scrub repair.
Fix: Capture zpool status -v, clear counters, reseat/replace cables, check logs for resets, run another scrub. If counters return, replace device or fix shared hardware.
2) “Pool is ONLINE but everything times out”
Symptom: No obvious ZFS errors, but applications hang; iowait spikes.
Root cause: Latency collapse from a slow device, near-full fragmentation, sync-write bottleneck, or a resilver/scrub competing with workload.
Fix: Use zpool iostat -v -l to find the slow vdev. Check capacity, scrub/resilver activity, and sync path (SLOG). Throttle scrubs if needed and fix the actual slow component.
3) “Resilver takes forever now”
Symptom: Replacing a disk used to take hours; now it takes days.
Root cause: Pool is fuller, more fragmented, and/or workload is heavier; also possible device-level slowdowns increasing service time.
Fix: Increase headroom (delete snapshots, migrate cold data), schedule resilvers during low load, consider wider vdevs cautiously, and verify no device is lagging via latency stats.
4) “We fixed performance by setting sync=disabled” (and then…)
Symptom: Performance improves dramatically; later there’s data loss after a crash/power event.
Root cause: Semantic change: synchronous writes are no longer durable at commit time.
Fix: Put sync=standard back. If you need sync performance, design SLOG and workload behavior. If you truly accept risk, document it like a contract with the business.
5) “Random read performance is terrible on SSD pool”
Symptom: Latency is unexpectedly high; IOPS are low.
Root cause: Wrong ashift, misaligned partitions, or an SSD firmware/controller issue; sometimes also recordsize mismatch causing read amplification.
Fix: Confirm ashift and alignment. If wrong, plan a rebuild. Validate SSD health and firmware; match dataset recordsize to workload.
6) “Special vdev made things faster, then the pool became fragile”
Symptom: Metadata/small-block acceleration helped—until special vdev errors started and everything got scary.
Root cause: Special vdev holds critical allocations; if it’s underprotected or using weak devices, its failure threatens pool availability and sometimes data.
Fix: Mirror special vdev, use high-endurance devices, monitor SMART/NVMe stats, and keep spare devices ready. Treat it like part of the pool, not a plugin.
Checklists / step-by-step plan
Daily/continuous monitoring checklist (what to graph and alert on)
- Error slope: alert when any vdev READ/WRITE/CKSUM counters increase since last baseline.
- Scrub duration: alert when duration increases beyond a threshold vs last 3 runs.
- Resilver duration: track time-to-complete; rising trend is risk exposure increasing.
- Latency percentiles: pool-level and vdev-level write latency (p95/p99), not just average throughput.
- Capacity headroom: pool cap + “days to full” trend; include snapshot growth.
When you see new errors (step-by-step response)
- Capture evidence:
zpool status -v,zpool events -v(if available), kernel logs relevant lines. - Identify scope: one device vs multiple devices vs whole controller path.
- If multiple devices affected: investigate shared components first (HBA, expander, backplane, power).
- Run/verify a scrub if appropriate and safe under current load.
- Clear counters after you have a baseline, then observe recurrence.
- Replace hardware proactively if counters recur or latency degrades, even if the pool remains ONLINE.
When latency spikes (step-by-step response)
- Check if scrub/resilver is running and competing with production.
- Run
zpool iostat -v -land identify the slowest vdev or device by latency. - Check OS logs for resets/timeouts on that path.
- Check pool fullness and snapshot churn; if near the cliff, create headroom immediately.
- Validate sync workload and SLOG health if sync write latency dominates.
- Only after hardware and capacity are ruled out, tune properties like recordsize and compression.
Quarterly maintenance plan (boring, correct, effective)
- Review scrub times and error trends for each pool; update baselines after major changes.
- Review snapshot retention and verify deletions actually occur.
- Test a disk replacement workflow on a non-critical system: labeling,
zpool replace, resilver monitoring, and post-check scrub. - Audit firmware versions for HBAs and SSDs; apply vendor-approved updates in a controlled window.
- Capacity planning: forecast growth and schedule expansions before crossing your headroom policy.
Interesting facts and historical context (because storage has lore)
- ZFS originated at Sun Microsystems in the mid-2000s as an integrated filesystem + volume manager, explicitly to reduce “RAID + filesystem mismatch” pain.
- End-to-end checksumming was a core design goal: detect silent corruption rather than trusting lower layers to be correct.
- Copy-on-write (CoW) is why ZFS can do consistent snapshots cheaply, but it’s also why fragmentation and near-full behavior can bite under random-write workloads.
- Scrubs exist because disks lie: they can return bad data without reporting errors. ZFS scrubs validate checksums across the pool.
- RAIDZ is not RAID5 in a driver: it’s integrated with the allocator and transaction model, which is powerful but makes performance sensitive to recordsize and workload shape.
- ashift became a long-running operational lesson: picking the wrong sector size alignment doesn’t usually break data, it just taxes performance forever.
- “SLOG” is misunderstood constantly: it’s only used for synchronous writes; it doesn’t accelerate normal async writes, and a bad SLOG can harm you.
- OpenZFS evolved across platforms (Illumos, FreeBSD, Linux) and converged as a shared codebase, which is why some commands/features vary slightly by OS and version.
- Special vdevs are relatively modern operational weapons: powerful for metadata/small blocks, but they create new failure domains that require grown-up operational discipline.
FAQ
1) What’s the single most predictive metric of “imminent pain”?
New error counters combined with rising write latency. Errors predict integrity risk; latency predicts availability risk. When both move, you’re in the danger zone.
2) If zpool status says “No known data errors,” can I relax?
No. It means ZFS believes it repaired or avoided user-visible damage. It does not mean the underlying cause is gone. Trend counters and check logs.
3) Are checksum errors always a bad disk?
No. They can be cabling, backplane, HBA, expander, firmware, or power. If multiple disks show CKSUM increments, suspect shared infrastructure first.
4) Why does a pool “feel crashed” when it’s just slow?
Because your applications have timeouts. A storage system that returns results after 60 seconds is functionally down for systems expecting 200ms.
5) What pool fullness is “too full”?
Depends on workload, but many production pools start behaving badly beyond ~80% used, earlier for random-write-heavy workloads. Use scrub/resilver time trends to find your cliff.
6) Should I run scrubs weekly or monthly?
Monthly is a common baseline; more frequent scrubs can be justified for high-value data or flaky hardware. The key is consistency and tracking completion time and errors.
7) Is adding a SLOG always a good idea?
No. It only helps for synchronous writes. A weak or unstable SLOG can make sync workloads worse and can become a failure amplifier. Choose carefully.
8) What should I alert on for ZFS errors?
Alert on any increase in READ/WRITE/CKSUM per device, DEGRADED/FAULTED states, scrub/resilver start/finish and duration, and sustained high latency at the vdev level.
9) If resilvers are slow, can I “speed them up” safely?
You can influence behavior by reducing competing workload and ensuring headroom. Aggressive tuning can starve applications or increase risk. The safest “speed-up” is fewer problems: headroom and healthy drives.
10) When do I replace a drive if SMART says PASSED?
When operational evidence says it’s trending worse: increasing pending sectors, uncorrectables, recurring ZFS errors, or outlier latency versus peers. “PASSED” is not a warranty.
Conclusion: practical next steps
If you want to predict a ZFS pool crash, stop watching the stuff that makes you feel informed and start watching the stuff that changes your decisions:
- Trend error counters (READ/WRITE/CKSUM) per device and alert on increases, not just nonzero values.
- Measure latency and queue pressure at the vdev layer, and treat outlier latency as a hardware/path incident until proven otherwise.
- Enforce headroom and track scrub/resilver duration as your early warning that the pool is crossing its performance cliff.
Then do the unglamorous thing: write down baselines, keep scrubs regular, and replace components before the pool forces you to. ZFS is extremely honest. It’s just not polite about timing.