The most dangerous moment in ZFS recovery is the first five minutes. Not because ZFS is fragile—because humans are. You’re on call, the pool is “UNAVAIL,” dashboards are red, and someone suggests “just reboot it.” That’s how you turn a survivable incident into a forensic hobby.
This is the field guide for recovering a ZFS pool with minimal drama: what to check first, which commands to run, how to read the output, and the decisions you make based on that output. It’s opinionated. It assumes production pressure. It also assumes you’d like to keep your job and your data.
Recovery mindset: stop digging
ZFS pool recovery is less like “repairing a filesystem” and more like “handling evidence at a crash scene.” The tools are powerful, the data structures are resilient, and you can absolutely make it worse by being creative under stress.
Three rules that keep you out of trouble:
- Minimize writes until you understand the failure. Writes can overwrite the very metadata you need to recover cleanly.
- Prefer observation over action. Your first commands should be read-only: status, import discovery, kernel logs.
- Change one variable at a time. If you swap disks, change cables, and rewrite device paths in one go, your timeline becomes fiction.
Also: don’t “clean up” labels and partitions until you’ve captured enough state to roll back your decisions. ZFS is forgiving; it’s not psychic.
Joke #1: “I’ll just run zpool import -f and see what happens” is the storage equivalent of “I’ll just wiggle the airplane yoke and see what happens.”
Fast diagnosis playbook (first/second/third)
This is the shortest path to “what’s actually broken?” It’s designed to find the bottleneck quickly: hardware, topology, metadata, or operator error.
First: Is this a device visibility problem or a ZFS metadata problem?
- Check kernel logs for disk/link resets (SATA, SAS, NVMe timeouts).
- Check whether the OS still sees the devices (by-id paths, multipath, enclosure mapping).
- Run import discovery without importing to see whether ZFS can even read labels.
If devices are missing at the OS level, ZFS can’t save you yet. Fix cables, HBAs, expander issues, zoning, multipath, and only then return to ZFS.
Second: What is the pool state and what type of redundancy do you have?
- Mirror vs RAIDZ changes the risk. A degraded mirror is usually routine. A degraded RAIDZ with multiple issues can be a careful dance.
- Look for “too many errors” vs “device removed.” Those are different failure modes with different actions.
Third: Are you dealing with corruption, or just “can’t assemble the pool”?
- Corruption signs: checksum errors, “permanent errors,” read errors on specific files, repeated resilver restarts.
- Assembly signs: pool won’t import, missing top-level vdevs, stale cachefile, renamed devices, changed HBA order.
Once you can label the incident as “visibility,” “assembly,” “degraded but importable,” or “corruption,” you stop flailing and start executing.
Interesting facts and historical context (quick hits)
- ZFS originated at Sun Microsystems in the mid-2000s with the goal of end-to-end data integrity, not just convenience features.
- The “copy-on-write” model means ZFS usually doesn’t overwrite live metadata in place, which is why many failure modes are recoverable if you avoid extra writes.
- RAIDZ was ZFS’s answer to the classic “RAID5 write hole,” leaning on transactional semantics rather than controller battery voodoo.
- ZFS checksums data and metadata (not optional for metadata), which is why “checksum error” often indicates real corruption, not a cosmetic warning.
- OpenZFS became a multi-platform effort as ZFS moved beyond Solaris-derived systems; today behaviors vary slightly across platforms and versions.
- The
ashiftproperty exists because physical sector sizes and drive lies matter; misalignment can quietly wreck performance and sometimes recovery predictability. - ZFS pool labels exist in multiple places on each vdev, which is why “label damage” is often survivable unless something keeps rewriting the wrong thing.
- Scrubs weren’t always “standard ops.” Many organizations only started treating scrubs as routine after big disk capacities made latent sector errors common.
- Special vdevs and L2ARC made pools faster but also added new ways to hurt yourself if you treat “cache devices” as disposable without understanding their role.
Triage workflow: confirm, isolate, preserve
If your pool just went sideways, don’t start with “repair.” Start with confirm (what’s true), then isolate (stop the bleeding), then preserve (capture state so you can reason later).
Confirm: what exactly changed?
Most ZFS outages are not mystical corruption. They’re mundane: a drive dropped, an HBA reset, a cable got bumped during “maintenance,” a multipath alias changed, a firmware update reordered devices. The pool is fine; your assumptions are not.
Isolate: reduce churn and side-effects
- If applications are hammering a degraded pool, pause or throttle them. Resilvering under write pressure is possible, but it’s slower and riskier.
- If the pool is flapping (devices online/offline), stop the churn: fix the link layer first.
- If you’re tempted to run “cleanup” commands, pause and take snapshots/exports only when safe.
Preserve: capture information you’ll wish you had
Collect the state before you change anything: zpool status, zpool import output, lsblk and by-id mappings, and kernel logs. This is how you avoid the “we replaced the wrong disk” classic.
Practical tasks (commands + output meaning + decisions)
These are real operational tasks. Each one includes: command, sample output, what it means, and the decision you make. Adjust pool names and device IDs to your environment.
Task 1: Check current pool health (if imported)
cr0x@server:~$ sudo zpool status -xv
all pools are healthy
Meaning: ZFS thinks everything is fine. If you’re still seeing application errors, the problem may be higher up (NFS/SMB, permissions, app bugs) or lower down (intermittent I/O not yet surfaced).
Decision: If the pool is healthy, don’t run recovery commands “just in case.” Move to logs and application symptoms.
Task 2: Inspect a degraded pool and identify the failing vdev
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
scan: scrub repaired 0B in 0 days 00:18:22 with 2 errors on Fri Dec 20 02:19:11 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-SAMSUNG_MZ7LM1T9-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9-1 ONLINE 0 0 2
ata-SAMSUNG_MZ7LM1T9-2 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/tank/projects/db/wal/00000001000000A9000000FF
Meaning: The pool is importable but has checksum errors and a specific file is impacted. This is not “a disk is missing.” This is “ZFS couldn’t repair data from redundancy” or “the redundancy itself returned inconsistent data.”
Decision: Identify whether the file can be rebuilt (database WAL, derived artifacts) or must be restored. Don’t start yanking disks because you saw “2” under CKSUM; correlate with SMART and error logs first.
Task 3: Discover importable pools without importing
cr0x@server:~$ sudo zpool import
pool: tank
id: 10293847566554433211
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
tank ONLINE
mirror-0 ONLINE
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE
Meaning: ZFS sees the pool labels and can assemble the topology. Good sign. If the pool is “not found” here, you likely have device visibility problems, label damage, or you’re looking at the wrong host.
Decision: If import is possible, import read-only first when the situation is ambiguous.
Task 4: Attempt a cautious read-only import
cr0x@server:~$ sudo zpool import -o readonly=on -N tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
Meaning: Imported without mounting datasets (-N) and in read-only mode. This is a great state for inspection: you can list datasets and properties without writing new metadata.
Decision: If you need to copy data off, you can mount specific datasets read-only. If you plan repairs, switch to read-write only once you’re confident the device layer is stable.
Task 5: Check OS-level device presence and stable identifiers
cr0x@server:~$ ls -l /dev/disk/by-id/ | head
total 0
lrwxrwxrwx 1 root root 9 Dec 26 03:10 ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 26 03:10 ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 -> ../../sdc
lrwxrwxrwx 1 root root 9 Dec 26 03:10 nvme-SAMSUNG_MZVL21T0HCLR-00B00_S6XYZ123 -> ../../nvme0n1
Meaning: By-id links exist. This matters because /dev/sdX ordering is not a promise; it’s a suggestion that changes when you blink.
Decision: If the pool was built with /dev/sdX paths, plan a controlled migration to by-id during normal hours. During recovery, map which physical disk is which before replacing anything.
Task 6: Read the kernel’s opinion about your storage (timeouts, resets)
cr0x@server:~$ sudo dmesg -T | egrep -i "zfs|sd[a-z]|ata|nvme|sas|reset|timeout|I/O error" | tail -20
[Thu Dec 26 03:01:22 2025] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Thu Dec 26 03:01:22 2025] ata6.00: failed command: READ FPDMA QUEUED
[Thu Dec 26 03:01:22 2025] blk_update_request: I/O error, dev sdc, sector 1219024896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Thu Dec 26 03:01:23 2025] ata6: hard resetting link
[Thu Dec 26 03:01:29 2025] ata6: link is slow to respond, please be patient (ready=0)
[Thu Dec 26 03:01:35 2025] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Thu Dec 26 03:01:35 2025] sd 6:0:0:0: [sdc] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Dec 26 03:01:35 2025] zfs: vdev state changed, pool_guid=10293847566554433211
Meaning: This is classic flapping: link reset, read failures, and ZFS noticing vdev state changes. ZFS can recover from transient issues, but repeated resets during resilver can create a long incident.
Decision: Fix the physical layer before doing ZFS actions that rely on sustained reads (scrub/resilver). Swap cable, move port, check HBA firmware, check expander.
Task 7: Check SMART/NVMe health to confirm a drive is actually sick
cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep -i "Model|Serial|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count|SMART overall"
Device Model: WDC WD80EFAX-68LHPN0
Serial Number: WD-DEF456
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct 0x0033 095 095 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 12
Meaning: “PASSED” is not absolution. Pending and uncorrectable sectors mean the drive has unreadable areas. CRC errors suggest a cable/backplane issue too.
Decision: If pending/uncorrectable are non-zero and growing, replace the drive. If CRC errors are growing without other signs, suspect cabling/backplane/HBA first.
Task 8: Find which device corresponds to which bay (enclosure mapping)
cr0x@server:~$ sudo lsblk -o NAME,SIZE,SERIAL,MODEL,HCTL,WWN
NAME SIZE SERIAL MODEL HCTL WWN
sdb 7.3T WD-ABC123 WDC WD80EFAX-68L 6:0:0:0 0x50014ee2b1234567
sdc 7.3T WD-DEF456 WDC WD80EFAX-68L 6:0:1:0 0x50014ee2b7654321
nvme0n1 1.8T S6XYZ123 SAMSUNG MZVL21T0 - 0x002538b111223344
Meaning: You’re building a mapping from logical device names to physical devices. This saves you from replacing the wrong disk—the most expensive form of cardio.
Decision: Before any replacement, record serials and slot locations (from your enclosure tools or inventory). If you can’t map it, stop and figure it out.
Task 9: Clear transient errors only after the underlying cause is fixed
cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
Meaning: Counters reset. This does not “fix” anything; it just gives you a clean slate for monitoring after remediation.
Decision: Use zpool clear after you’ve repaired cabling or replaced a drive and want to confirm the issue isn’t recurring.
Task 10: Replace a failed disk in a mirror (clean and boring)
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 FAULTED 0 0 0 too many errors
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ata-WDC_WD80EFAX-68LHPN0_WD-NEW999
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
scan: resilver in progress since Thu Dec 26 03:22:01 2025
1.21T scanned at 1.18G/s, 312G issued at 305M/s, 7.30T total
312G resilvered, 4.17% done, 06:35:11 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-NEW999 ONLINE 0 0 0 (resilvering)
Meaning: Replacement started and resilver is underway. If resilver rate is extremely low, suspect ongoing I/O errors, SMR drives under load, or competing workload.
Decision: Monitor resilver stability. If the old disk is still readable intermittently, keep it connected until resilver completes—unless it’s actively wedging the bus.
Task 11: Identify what’s preventing import (missing device vs wrong hostid)
cr0x@server:~$ sudo zpool import tank
cannot import 'tank': pool may be in use from other system, it was last accessed by server-a (hostid=0x12ab34cd) on Thu Dec 26 01:50:20 2025
use '-f' to import anyway
Meaning: ZFS believes another host had the pool. This can be legitimate (shared storage, cluster, SAN) or just stale hostid after moving disks.
Decision: Confirm the pool is not actively imported elsewhere. If you are sure, import with force, but treat that as an operational event worth documenting and auditing.
Task 12: Force import safely, ideally read-only first
cr0x@server:~$ sudo zpool import -f -o readonly=on -N tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
Meaning: Pool imports and is stable in read-only mode. This is a safe staging point to verify data before you allow writes.
Decision: If the incident involved unexpected reboots or HBA issues, keep it read-only until you’ve confirmed device stability and captured data you can’t recreate.
Task 13: Import using an alternate root (for recovery environments)
cr0x@server:~$ sudo zpool import -R /mnt -N tank
cr0x@server:~$ sudo zfs mount -a
cr0x@server:~$ mount | grep tank
tank/home on /mnt/home type zfs (rw,xattr,noacl)
Meaning: -R sets an alternate mountpoint root. This is useful in rescue shells, recovery hosts, or when you don’t want to collide with production mountpoints.
Decision: Use this when recovering from a live OS or migrating disks to a temporary host for inspection.
Task 14: Verify snapshots exist before you attempt “repairs”
cr0x@server:~$ sudo zfs list -t snapshot -o name,creation -S creation tank | head
NAME CREATION
tank@autosnap_2025-12-26_0200 Thu Dec 26 02:00 2025
tank@autosnap_2025-12-26_0100 Thu Dec 26 01:00 2025
tank@autosnap_2025-12-26_0000 Thu Dec 26 00:00 2025
Meaning: Snapshots provide a logical recovery point even when hardware is shaky. They also give you confidence to roll back datasets if needed (with care).
Decision: If snapshots are missing and you’re considering destructive actions, pause. Your recovery options just got narrower.
Task 15: Run a scrub when the pool is stable, then interpret results
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Thu Dec 26 03:40:11 2025
2.88T scanned at 1.02G/s, 1.44T issued at 512M/s, 7.30T total
0B repaired, 19.72% done, 03:05:19 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-ABC123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_WD-DEF456 ONLINE 0 0 0
Meaning: Scrub reads everything and verifies checksums. “Repaired” indicates ZFS corrected bad data using redundancy. If repairs keep happening, you have a persistent problem (drive, cabling, memory, controller).
Decision: If scrub completes with errors, treat it as a real incident, not a warning. Collect zpool status -v and decide whether to restore impacted files or replace hardware.
Task 16: List and evaluate permanent errors
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
errors: Permanent errors have been detected in the following files:
/tank/projects/db/base/16384/2609
/tank/projects/db/base/16384/2610
Meaning: ZFS is telling you: “I know which files are bad.” That’s a gift. For databases and VM images, those paths matter.
Decision: If these are critical files: restore from backup/snapshot replication, or rebuild the dataset/application. Don’t assume another scrub will magically fix “permanent errors.”
Task 17: Check whether you’re dealing with insufficient replicas (a.k.a. you lost redundancy)
cr0x@server:~$ sudo zpool status -v vault
pool: vault
state: DEGRADED
status: One or more devices could not be used because the label is missing or invalid.
action: Replace the device using 'zpool replace'.
config:
NAME STATE READ WRITE CKSUM
vault DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-ST12000NM0008_A1 ONLINE 0 0 0
ata-ST12000NM0008_A2 ONLINE 0 0 0
15553461652766218490 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST12000NM0008_A3
ata-ST12000NM0008_A4 ONLINE 0 0 0
ata-ST12000NM0008_A5 ONLINE 0 0 0
ata-ST12000NM0008_A6 ONLINE 0 0 0
Meaning: ZFS remembers a vdev by GUID and is telling you the old path. This is often “disk is missing” or “disk got re-enumerated but label is gone.”
Decision: Find the missing disk (by serial, bay, HBA path). If it’s present but label is gone, treat that as suspicious—something overwrote it. Don’t run zpool replace until you know which physical disk corresponds to that GUID.
Task 18: Export a pool cleanly before moving disks to another host
cr0x@server:~$ sudo zpool export tank
cr0x@server:~$ sudo zpool import
no pools available to import
Meaning: Export flushes and cleanly closes the pool. When you move disks between hosts, this reduces “pool may be in use” friction and lowers the risk of stale state.
Decision: If the box is stable enough to export, do it. If it’s not stable enough, plan a read-only import on the recovery host and document why.
Common mistakes: symptoms → root cause → fix
This is the part where I save you from your future self. These are not theoretical. These are things people do at 03:00 while Slack is yelling.
1) Pool won’t import after “a simple reboot”
- Symptoms:
zpool importshows pool, but import complains about “in use from other system” or hangs. - Root cause: Stale hostid, pool was not exported cleanly, or devices are flapping causing import to stall.
- Fix: Verify no other host has it imported. Do read-only forced import (
-f -o readonly=on -N) to inspect. Fix device-layer resets before allowing writes.
2) “Checksum errors” appear and someone immediately replaces the disk
- Symptoms: Non-zero CKSUM on a vdev; pool degraded; scrub reports errors.
- Root cause: Could be drive media, but also RAM issues, HBA/backplane bit flips, bad SAS cable, or firmware bugs. Replacing the disk might not fix it and can add stress during resilver.
- Fix: Correlate: SMART stats,
dmesg, error counters across multiple drives, and whether errors “move” with a cable/port. Replace hardware based on evidence, not vibes.
3) Resilver is “stuck” at low speed
- Symptoms:
zpool statusshows resilver in progress with hours/days remaining; throughput collapses. - Root cause: Drive SMR behavior under random writes, ongoing read retries due to marginal media, pool is busy with application writes, or one slow device in RAIDZ dragging the entire vdev.
- Fix: Reduce workload, verify I/O errors in logs, consider temporarily stopping heavy writes, and ensure the replacement disk is not a performance mismatch. If you see repeated resets, fix cabling/HBA first.
4) Pool is UNAVAIL after “optimization” changes
- Symptoms: A special vdev, SLOG, or metadata device fails; pool won’t mount datasets; weird latency spikes before failure.
- Root cause: Treating “fast devices” as optional when they were configured as required (special vdev is not like L2ARC). Or mis-sizing and overloading them.
- Fix: Replace the failed special vdev device(s) like you would a critical disk. Design special vdevs with redundancy. Don’t add one unless you’re willing to operationally own it.
5) Someone runs zpool labelclear on the wrong disk
- Symptoms: A previously visible vdev becomes “missing,” pool degrades further, import output changes in confusing ways.
- Root cause: Misidentification of physical disks, reliance on
/dev/sdX, or skipping the serial-to-slot mapping step. - Fix: Stop. Capture current
zpool importoutput. Rebuild mapping using serials, WWNs, and enclosure tools. Only clear labels when you are absolutely certain the disk is not part of any needed pool.
6) Scrub “repairs” keep recurring every week
- Symptoms: Scrubs repair small amounts repeatedly; checksum errors rotate among drives; no single disk is consistently guilty.
- Root cause: Often non-disk: flaky RAM (especially without ECC), HBA issues, cabling, backplane, or power instability.
- Fix: Treat as platform integrity issue. Run memory tests during maintenance, check ECC logs, update firmware, and validate cabling/backplane. Replacing random drives is just expensive denial.
Checklists / step-by-step plan
Checklist A: When a pool is degraded but online
- Run
zpool status -v. Identify whether errors are READ/WRITE/CKSUM and whether “permanent errors” exist. - Check
dmesgfor resets/timeouts around the same time. - Check SMART/NVMe health for the implicated devices.
- Map devices by serial/WWN to physical bays.
- If a disk is clearly failing: replace it and monitor resilver.
- After resilver: scrub the pool. Confirm zero errors.
- Only then: clear counters (
zpool clear) and return workload to normal.
Checklist B: When a pool won’t import
- Run
zpool import(no args) and save the output. - Confirm OS sees all expected devices (
ls -l /dev/disk/by-id/,lsblk). - Check kernel logs for link resets/timeouts.
- If “in use from other system”: verify no other host has it imported; then try read-only forced import:
zpool import -f -o readonly=on -N POOL. - If devices are missing: fix the hardware path first (cables/HBA/enclosure), then re-run import discovery.
- If import succeeds read-only: mount selectively and copy irreplaceable data off before you attempt any write-enabled recovery.
Checklist C: When you suspect corruption
- Import read-only and without mounting (
-o readonly=on -N) if possible. - Run
zpool status -vand identify permanent error paths. - Decide file-level response: restore/rebuild/delete derived artifacts.
- Scrub only when devices are stable.
- If corruption keeps appearing: investigate RAM, HBA, firmware, power, and cabling. ZFS is reporting symptoms; your platform is supplying the disease.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
They had a storage server with twelve disks in RAIDZ2, a pair of mirrored boot SSDs, and a neat spreadsheet mapping bays to serial numbers. The spreadsheet was “pretty accurate,” which is like saying a parachute is “mostly packed.”
A disk started timing out. The on-call engineer saw /dev/sdj accumulating errors and pulled the “disk in slot 10,” because that’s what last quarter’s mapping said. The pool got worse immediately. Now two members of a RAIDZ2 vdev were missing: the truly failing disk and the perfectly healthy disk that was removed.
The scary part was how normal it looked at first. ZFS didn’t scream in a new way. It just moved from “degraded but available” to “import fails intermittently” because the remaining disks were suddenly under massive reconstruction load and the original flaky link was still flapping.
Recovery took longer than it needed to because they spent hours debating ZFS commands. The fix was boring: re-seat the original suspect disk, rebuild the correct bay-to-serial mapping from enclosure data, then replace the actual failing serial. After that, resilver completed and the pool came back clean.
Postmortem action item: ban /dev/sdX from any operational procedure and require serial confirmation before a disk pull. It didn’t make anyone feel clever. It made the next incident survivable.
Mini-story 2: The optimization that backfired
A team wanted faster metadata operations for millions of small files. They added a special vdev on a pair of fast SSDs. It worked. Latency dropped, directory traversals got snappy, and everyone got their “we fixed storage” dopamine hit.
Months later, one of the SSDs started erroring. Not catastrophically—just enough to intermittently vanish under load. The pool didn’t simply degrade like a normal data mirror. The metadata-heavy workload made the special vdev the center of the universe, so every hiccup cascaded into application timeouts.
During the incident, someone suggested, “If it’s just cache, can we detach it?” That’s where the misunderstanding lived: a special vdev isn’t a cache. It can be a required store for metadata (and optionally small blocks). Lose it, and the pool may become unusable.
The recovery was straightforward but tense: stabilize the device path (it turned out to be a marginal backplane connector), replace the bad SSD, and resilver the special vdev mirror. After that, the pool returned to normal performance and stability.
Lesson learned: if you add an optimization component that can take the whole pool down, it’s not an optimization. It’s a new tier of critical infrastructure. Treat it like production, because it is.
Mini-story 3: The boring but correct practice that saved the day
An enterprise team ran weekly scrubs, monthly restore tests, and they kept a habit that felt almost old-fashioned: after any hardware change, they captured zpool status, zpool get all for key properties, and a device inventory with serials and WWNs. It was paperwork. Nobody bragged about it.
One afternoon, a storage shelf rebooted unexpectedly. The OS came back with different device enumeration order, and one path didn’t return. The pool didn’t import automatically. The incident channel started to form, as it does, like storm clouds around a picnic.
Instead of improvising, they followed the playbook. They compared current by-id inventory against the last captured snapshot from their own records, identified exactly which serial was missing, and correlated it with enclosure logs. It wasn’t a dead disk—it was a SAS path that failed to re-login after the shelf reboot.
They fixed the path, re-ran zpool import, imported read-only to validate, then imported normally. No “force” flags, no label clearing, no magic. The outage was short because the team spent years investing in being boring.
Joke #2: The best storage heroism is so dull nobody notices—like a fire extinguisher that never gets used.
One quote that belongs in your incident channel
Werner Vogels (paraphrased idea): Everything fails all the time; design and operate assuming failure is normal.
Decision points that matter (and why)
Read-only import is your pressure relief valve
When you’re unsure whether devices are stable, read-only import lets you inspect datasets, validate metadata, and copy data off without committing new writes. It’s not a cure; it’s a safer diagnostic posture.
Scrub is not first aid; it’s a full-body scan
A scrub is heavy. It touches everything. On a pool with a marginal disk or a flaky HBA, scrubbing can turn “mostly okay” into “now we’ve discovered every weak sector at once.” That discovery is useful, but not when your transport layer is on fire.
Resilver speed is a health signal, not a comfort metric
Slow resilvers happen for innocent reasons (busy pool, SMR drive behavior). They also happen because the system is doing endless retries on reads. Watch for resets and I/O errors in logs; that’s where the truth lives.
FAQ
1) Should I reboot a ZFS server when the pool is degraded?
Not as a first move. Rebooting can reorder devices, lose transient state, and make a flapping link look like a missing disk. Stabilize hardware and capture outputs first.
2) When is zpool import -f appropriate?
When you’ve confirmed the pool is not imported elsewhere and the “in use” state is stale (host move, crash). Prefer -o readonly=on -N alongside -f for the first import if you’re uncertain.
3) If SMART says “PASSED,” can the drive still be bad?
Yes. “PASSED” is often a threshold check. Pending sectors, uncorrectables, and growing CRC errors are operational signals regardless of the overall result.
4) What do checksum errors actually mean?
ZFS read data, computed its checksum, and it didn’t match what was stored. The bad data could come from disk media, a cable, a controller, RAM, or even firmware. Treat it as evidence and correlate.
5) Is a scrub the same as a resilver?
No. Scrub is proactive verification of all data. Resilver is reconstruction onto a replacement or returning device. Both are heavy reads; resilver also writes.
6) Can I just detach a failing “cache” device to get the pool back?
It depends on what it is. L2ARC is typically removable without losing pool integrity. A SLOG device can usually be removed with considerations. A special vdev may be required—don’t treat it like cache.
7) Why does ZFS show a long numeric ID instead of a device name?
That’s the vdev GUID. ZFS uses it to track members even if device paths change. It’s also your clue that ZFS remembers something that the OS isn’t presenting right now.
8) What’s the safest way to move a pool to another machine for recovery?
Export the pool cleanly if you can, move disks, then import read-only with an alternate root (-o readonly=on -R /mnt -N) to inspect before you commit to writes.
9) If the pool is degraded, should I run zpool clear to “fix” it?
No. zpool clear resets error counters. It’s useful after fixing the underlying issue to confirm whether errors return. It does not repair data.
10) How do I decide between “replace disk” and “fix cabling”?
Look at evidence: CRC errors and link resets suggest transport. Pending/uncorrectable sectors suggest media. If multiple disks on the same HBA path show issues, suspect the shared component.
Conclusion: next steps you can actually do
If you remember nothing else: stabilize the device layer, import read-only when unsure, and never replace a disk you can’t positively identify. ZFS is built to survive hardware failure. It’s less tolerant of improvisation.
Practical next steps for your environment:
- Write a one-page on-call runbook using the “Fast diagnosis playbook” and the checklists above.
- Ensure every pool uses stable device paths (by-id/WWN), and document how to map serials to bays.
- Schedule scrubs and alert on new checksum errors, not just pool state changes.
- Test restores. Not in theory. On real data. The day you need it is not the day to discover your backups are performance art.
- Audit any “optimization” vdevs (special, SLOG) and make sure they’re redundant and monitored like first-class citizens.