ZFS Disaster Scenarios: What Fails First and What You Can Save

Was this helpful?

ZFS doesn’t usually fail with fireworks. It fails like a corporate laptop battery: slowly, quietly, and right before the demo.
One day a pool goes DEGRADED, then a scrub starts “finding” things, then an import hangs while your pager politely
informs you that sleep is now a legacy feature.

This is a field guide for those moments. Not a tour of features. A map of the cliffs: what tends to break first, how to confirm it
fast, and what data you can still salvage—without making the classic “I fixed it by deleting it” move.

How ZFS “dies”: the failure stack

ZFS isn’t a single thing. It’s a set of interlocking promises: checksums, copy-on-write, transactional semantics, replication,
caching, and a storage topology that mixes “disks” with “intent.” When the system is stressed, those promises fail in a predictable
order.

1) The first to fail is usually the environment, not ZFS

In real incidents, the earliest failure is commonly outside the pool:

  • Power (brownouts, flaky PSUs, dual-feed assumptions)
  • Controllers (HBA firmware bugs, expander resets, SATA link flaps)
  • Memory (non-ECC plus bad luck; ECC plus failing DIMM)
  • Thermals (drives erroring under heat, throttling, or outright dropping)
  • Humans (the most compatible component; it works with every system)

2) Then I/O: latency spikes become timeouts become “device removed”

ZFS is conservative. When a vdev starts returning slow or inconsistent I/O, ZFS prefers to stop trusting it. That’s how you get a pool
that’s technically still online but practically unusable: everything blocks on a few dying members, the retry logic piles up, and
the application interprets it as “storage is down.”

3) Metadata becomes the battleground

User data often survives longer than metadata integrity. If the pool can’t read the structures that describe datasets, snapshots, and
block pointers, your “data” becomes a warehouse with no inventory system. This is why special vdevs and tiny metadata devices matter
more than people think.

4) Finally, trust collapses: checksum errors that can’t self-heal

A single checksum error is a warning shot. A pattern of unrecoverable checksum errors is ZFS telling you it can’t find a valid copy.
In mirrors/RAIDZ, that usually means you lost too many replicas or parity can’t reconstruct because the error surface is larger than
redundancy.

Paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails, all the time—design and operate like that’s normal.” ZFS helps,
but it won’t repeal physics.

Interesting facts & historical context (why it behaves this way)

ZFS disaster behavior makes more sense when you remember where it came from and what it was designed to survive.

  1. ZFS was born at Sun in the early 2000s to fix silent corruption and admin pain from volume manager + filesystem splits.
  2. Copy-on-write was the point: ZFS writes new blocks and then commits pointers, so it avoids “half-written” filesystem structures after crashes.
  3. Checksums are end-to-end: ZFS checksums data as it is stored, not as the disk claims it is. This is why it can detect “bit rot” rather than guessing.
  4. Scrubs weren’t a luxury; they were a response to real-world corruption that classic RAID happily served as “valid” data.
  5. The pool model was a rebellion: instead of carving disks into volumes, you build a pool and allocate datasets. Admins stopped playing Tetris.
  6. “RAIDZ” exists because write holes do: classic RAID5/6 can produce inconsistent stripes after power loss; ZFS tries to avoid that with transactional writes.
  7. The ARC/L2ARC design reflects expensive RAM at the time: ARC in RAM, optional L2ARC on flash when flash got cheap enough to matter.
  8. OpenZFS became the continuity plan after Sun’s fall; today it’s a cross-platform project with different defaults and feature flags depending on OS.

What fails first: common disaster scenarios

Scenario A: Single disk failure in mirror/RAIDZ

This is the “normal” failure. If your topology has redundancy and the rest of the system is healthy, ZFS does what it’s supposed to:
it degrades, keeps serving, and demands a replacement.

What fails first here is usually performance, not correctness. A degraded RAIDZ vdev has less parallelism. A degraded
mirror is one disk away from a long weekend.

Scenario B: Multiple disk failures during resilver

The dangerous window isn’t “a disk failed.” The dangerous window is “a disk failed and now we’re hammering the remaining disks.”
Resilver and scrub are heavy. They increase load, heat, and error rates. Drives that were marginal become honest.

What fails first: second disks, often in the same batch, same age, same vibration profile. If your procurement policy
buys identical drives on the same day, congratulations: you’ve built synchronized failure.

Scenario C: Controller/HBA resets causing transient device loss

ZFS reacts to devices disappearing. If a controller resets and all disks vanish for a moment, ZFS may mark them as unavailable.
Depending on timing and OS behavior, you can get:

  • Pool goes DEGRADED because one path flapped
  • Pool goes SUSPENDED because I/O errors exceeded safety thresholds
  • Import problems after reboot because device names changed

What fails first: your assumptions about “disk failure”. The disks may be fine. The bus is not.

Scenario D: Power loss + write cache lies

ZFS is transactional, but it still relies on the storage stack to honor flushes. If a drive or controller acknowledges writes it
didn’t persist, you can get corrupted intent logs, inconsistent metadata, and a pool that imports but behaves like it has a concussion.

If you run sync=disabled on anything you claim is “important,” you are not optimizing; you are gambling with someone else’s job.

Joke #1: “sync=disabled” is like removing your car’s seatbelt to improve acceleration. Technically true, strategically questionable.

Scenario E: Special vdev failure (metadata/small blocks)

Special vdevs are powerful and dangerous. If you put metadata (and possibly small blocks) on a special vdev, you have created a
dependency: lose that vdev, lose the pool. Mirrors help, but the blast radius is real.

What fails first: the pool’s ability to find anything. You may still have most data blocks intact on main vdevs,
but you can’t assemble the filesystem without the metadata structures.

Scenario F: Pool nearly full → fragmentation → “sudden” latency collapse

The most boring disasters are self-inflicted. Pools above ~80–90% full (depending on recordsize, workload, and topology) fragment.
Allocation becomes expensive. Writes amplify. Latency spikes. Applications time out. People blame “ZFS overhead” while the pool is
trying to allocate blocks like it’s packing a moving truck with no empty boxes.

Scenario G: Memory pressure and mis-sized ARC in a noisy neighbor world

ZFS loves RAM. Your database also loves RAM. Your container platform loves RAM. When they fight, the kernel wins by killing someone.
ZFS in heavy memory pressure can thrash, evicting ARC, increasing disk I/O, increasing latency, increasing pressure. A feedback loop
that ends with “storage is slow” tickets.

Scenario H: Encryption key loss

Native ZFS encryption is good engineering. It is also unforgiving. Lose the key (or the passphrase + key location), and the pool
can be perfectly healthy while your data is perfectly inaccessible. This is not a bug. This is the deal you made.

What you can save (and what you probably can’t)

When you can usually save everything

  • Single-disk failure in a mirror/RAIDZ with prompt replacement
  • Transient controller errors that didn’t corrupt metadata (fix the bus, clear, scrub)
  • Application-level mistakes when you have snapshots (rollback/clone/restore)
  • Accidental deletes if snapshots exist and haven’t been pruned

When you can usually save some things

  • Pool imports read-only but panics or hangs under write load: you can often zfs send critical datasets out
  • Some unrecoverable checksum errors: you can sometimes copy unaffected datasets or files, then rebuild
  • Metadata damage limited to some datasets: other datasets may mount; prioritize exports immediately

When you are mostly saving logs and lessons

  • Special vdev lost without redundancy: pool is generally not recoverable in practice
  • More vdev members lost than redundancy allows (e.g., RAIDZ1 with 2 disk failures)
  • Encryption keys gone (no key, no data, no exceptions)

Joke #2: Data recovery without backups is like trying to unburn toast with a screwdriver. You can be very busy and still be very hungry.

Fast diagnosis playbook (first/second/third)

The fastest path to sanity is to separate pool health, device health, and performance bottleneck.
Don’t start with heroics. Start with facts.

First: Is the pool safe to touch?

  1. Check pool state: zpool status -xv and zpool list. If you see SUSPENDED, stop and stabilize.
  2. Confirm you’re not out of space: zfs list -o name,used,avail,refer,mountpoint. 95% full pools create “mystery outages.”
  3. Check for active resilver/scrub: status output tells you; if it’s ongoing during an incident, consider pausing competing workloads.

Second: Is this disk failure, bus failure, or software failure?

  1. Look at dmesg/journal for link resets and timeouts (SATA/SAS sense errors, NVMe resets).
  2. Check SMART/NVMe health for the devices ZFS complains about. One bad disk is plausible; eight “bad disks” at once is usually a controller.
  3. Verify stable device IDs (by-id paths). If disk names changed, imports can get weird.

Third: Find the bottleneck quickly

  1. I/O latency: zpool iostat -v 1—find the vdev with high wait and low throughput.
  2. ARC pressure: if ARC is tiny and misses are high, you’re going to disk.
  3. Sync write pressure: if you have a SLOG and it’s slow or dead, synchronous workloads will crawl.

Practical tasks: commands, output meaning, and decisions

These are the “do this at 03:17” tasks. Each one includes what to look for and what decision it should trigger. Commands assume Linux
with OpenZFS; adjust for your platform, but keep the logic.

Task 1: Confirm whether anything is actually broken

cr0x@server:~$ sudo zpool status -xv
all pools are healthy

Meaning: ZFS doesn’t see pool-level faults. Your outage is likely above ZFS (application, network) or below it (hardware causing latency without errors).
Decision: Move to performance triage (zpool iostat, ARC, and system metrics) instead of disk replacement.

Task 2: Read the real pool story, not the comforting summary

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible.  Otherwise restore the entire pool from backup.
  scan: scrub repaired 0B in 02:11:45 with 3 errors on Thu Dec 26 01:02:10 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     1
            ata-WDC_WD80...-part1   ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0
            ata-WDC_WD80...-part1   ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
/tank/data/db/index/segments_12

Meaning: The pool is up, but at least one block could not be reconstructed and a file is damaged.
Decision: Treat this as a data incident, not just hardware. Restore the file from application replication or backups; after that, scrub again and consider replacing the disk with CKSUM errors.

Task 3: Identify whether “READ/WRITE/CKSUM” points to disk, cabling, or memory

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     ONLINE       0     0    12
            sdb     ONLINE       0     0     0

Meaning: CKSUM errors on sda suggest bad reads from that device path. It can be the disk, the cable/backplane, or memory during reads (less common with ECC).
Decision: Pull SMART and kernel logs for sda; if link resets show up, suspect cabling/HBA. If SMART shows reallocated/pending sectors, replace disk.

Task 4: Check whether the pool is capacity-choking itself

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,health
NAME   SIZE   ALLOC   FREE  FRAG  HEALTH
tank  43.5T   40.9T  2.6T   73%  ONLINE

Meaning: 73% fragmentation with little free space is a performance trap.
Decision: Stop arguing about tunables and create space. Delete old snapshots, move cold data off, or expand the pool. If the workload is random-write heavy, consider adding vdevs, not replacing disks with bigger ones one-by-one.

Task 5: Find the hot vdev under load

cr0x@server:~$ sudo zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        40.9T  2.6T     210    980  34.1M  112M
  raidz2-0  40.9T  2.6T     210    980  34.1M  112M
    sda         -      -      35    165  5.6M   19M
    sdb         -      -      36    164  5.5M   19M
    sdc         -      -      34    166  5.7M   18M
    sdd         -      -      35    321  5.5M   37M
    sde         -      -      35      0  5.6M     0
    sdf         -      -      35    164  5.6M   19M

Meaning: One disk (sdd) is taking disproportionate writes; could be normal distribution, could be retries, could be a path issue.
Decision: Correlate with zpool status error counters and kernel logs. If one disk has high latency/timeouts, replace or reseat it.

Task 6: Confirm whether ZFS is throttling due to errors (suspended I/O)

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: SUSPENDED
status: One or more devices are faulted in response to persistent errors.
action: Make sure the affected devices are connected, then run 'zpool clear'.
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     FAULTED     75     0     0  too many errors

Meaning: ZFS stopped I/O to prevent further damage or stalls.
Decision: Do not “clear and pray” repeatedly. Fix the underlying device/path issue first (replace disk, fix HBA/backplane), then zpool clear and scrub.

Task 7: Verify device identity to avoid replacing the wrong disk

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'sdc|WDC|SEAGATE' | head
lrwxrwxrwx 1 root root  9 Dec 26 01:10 ata-WDC_WD80... -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 26 01:10 ata-WDC_WD80... -> ../../sdb
lrwxrwxrwx 1 root root  9 Dec 26 01:10 ata-WDC_WD80... -> ../../sdc

Meaning: You can map logical device names to stable IDs.
Decision: Use by-id paths in replacements and documentation. If your runbook says “replace sdc,” update your runbook; it’s lying as soon as you reboot.

Task 8: Check SMART for the “it’s dying” markers

cr0x@server:~$ sudo smartctl -a /dev/sdc | egrep 'Reallocated|Pending|Offline_Uncorrectable|CRC|SMART overall'
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       12
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

Meaning: “PASSED” is not a clearance letter. Pending and uncorrectable sectors are real problems.
Decision: Replace the drive. If CRC errors were high instead, you’d suspect cabling/backplane first.

Task 9: Replace a failed device the safe way

cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/ata-WDC_WD80_BAD /dev/disk/by-id/ata-WDC_WD80_NEW
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
scan: resilver in progress since Thu Dec 26 01:22:01 2025
        1.23T scanned at 2.10G/s, 412G issued at 703M/s, 40.9T total
        412G resilvered, 0.98% done, 16:10:12 to go

Meaning: Resilver has started; time estimate is often optimistic fiction.
Decision: Reduce workload if you can. Watch for additional errors. If more disks start erroring during resilver, stop pretending it’s a single-disk event and reassess hardware.

Task 10: Confirm scrub behavior and interpret error counts

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
scan: scrub in progress since Thu Dec 26 02:01:44 2025
        7.88T scanned at 1.05G/s, 1.12T issued at 153M/s, 40.9T total
        0B repaired, 2.74% done, 06:12:10 to go

Meaning: Scrub is verifying checksums. “0B repaired” is good so far; if it ends with repaired >0 and no permanent errors, redundancy worked.
Decision: If scrub finds permanent errors, identify affected files and restore from snapshots/backup. Don’t ignore “only a few” checksum errors.

Task 11: Mount problems—see what ZFS thinks is mounted

cr0x@server:~$ zfs list -o name,mounted,mountpoint,canmount
NAME                 MOUNTED  MOUNTPOINT     CANMOUNT
tank                 yes      /tank          on
tank/data             no      /tank/data     on
tank/data/db          no      /tank/data/db  on

Meaning: Datasets are present but not mounted. Could be because of dependency failures, encryption locked, or mountpoint conflicts.
Decision: Check for encryption state and errors in zpool status and system logs. Attempt explicit mount and capture the error.

Task 12: Encryption triage—can you load keys?

cr0x@server:~$ zfs get -H -o name,property,value encryption,keylocation,keystatus tank/data
tank/data	encryption	aes-256-gcm
tank/data	keylocation	file:///root/keys/tank_data.key
tank/data	keystatus	unavailable
cr0x@server:~$ sudo zfs load-key -a
Enter passphrase for 'tank/data': 
cr0x@server:~$ zfs get -H -o name,property,value keystatus tank/data
tank/data	keystatus	available

Meaning: Dataset was encrypted and locked; now it’s unlocked.
Decision: If you can’t load keys, stop. Don’t destroy and recreate anything. Your incident is now “key management,” not “storage reliability.”

Task 13: Import a pool safely after a crash (read-only first)

cr0x@server:~$ sudo zpool import
   pool: tank
     id: 1234567890123456789
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank        ONLINE
          raidz2-0  ONLINE
            sda     ONLINE
            sdb     ONLINE
            sdc     ONLINE
            sdd     ONLINE
            sde     ONLINE
            sdf     ONLINE
cr0x@server:~$ sudo zpool import -o readonly=on tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
status: The pool is imported in read-only mode.

Meaning: You imported without allowing writes. That reduces risk while you assess.
Decision: If you suspect corruption, keep read-only and start extracting critical datasets via zfs send or file-level copy to a safe destination.

Task 14: Find what changed recently (snapshots, holds, clones)

cr0x@server:~$ zfs list -t snapshot -o name,creation -S creation | head
NAME                         CREATION
tank/data@auto-2025-12-26-0100  Thu Dec 26 01:00 2025
tank/data@auto-2025-12-26-0000  Thu Dec 26 00:00 2025
tank/data@auto-2025-12-25-2300  Wed Dec 25 23:00 2025

Meaning: You have recent snapshots; good. If you don’t, your future self is about to have a serious talk with you.
Decision: If the incident is “bad deployment deleted stuff,” restore from snapshot (clone or rollback). If the incident is corruption, use snapshots for zfs send to a new pool.

Task 15: Replicate out the most valuable data first

cr0x@server:~$ sudo zfs send -v tank/data@auto-2025-12-26-0100 | sudo zfs receive -u rescue/data
send from @auto-2025-12-26-0100 estimated size is 3.12T
total estimated size is 3.12T
TIME        SENT   SNAPSHOT
00:00:10   5.28G   tank/data@auto-2025-12-26-0100

Meaning: You’re streaming a consistent snapshot to a rescue pool/dataset, not trying to “fix in place.”
Decision: If you see send errors, note the dataset and adjust priority. You can often salvage other datasets even if one is damaged.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a ZFS-backed VM platform. Two storage nodes, each with a RAIDZ2 pool. Replication existed, but it was “best effort.”
The team’s assumption—never written down, always repeated—was that RAIDZ2 meant “we can lose two drives and be fine.”

One Friday, a pool went DEGRADED. A drive was replaced. Resilver began. During resilver, a second drive started logging timeouts.
Everyone was calm: “We’re still within RAIDZ2.” Then the controller reset. All drives dropped for a moment, came back, and ZFS marked a third
device as faulted due to accumulated errors. The pool went from “fine” to “SUSPENDED” in a blink.

The wrong assumption wasn’t about parity math. It was about the idea that failures are independent and that “losing” a device is the same as
“a disk physically dead.” In reality, a bad HBA or backplane can make multiple disks look dead at once. ZFS doesn’t care why the device stopped
answering; it only knows it stopped.

The recovery was not heroic. They imported read-only, immediately started zfs send for the most critical VM datasets to the other node,
and accepted that some low-priority VMs would be rebuilt from images later. The fix was boring: new HBA firmware, better cabling discipline,
and a policy that “any multi-disk anomaly is bus-first until proven otherwise.”

Mini-story 2: The optimization that backfired

A different shop ran latency-sensitive NFS on ZFS. They benchmarked, saw big gains by disabling sync, and shipped it to production.
The justification sounded reasonable: “The app is already resilient, and we have UPS.”

Months later, a power event hit. Not a full outage—just a brownout that made one node reboot. The pool imported. Services started. Users noticed
odd file corruption: small, random, non-reproducible. The kind of problem that makes everyone question reality.

What happened: “sync disabled” allowed the system to acknowledge writes before they were durable. The UPS didn’t help because the failure mode
wasn’t “power off” so much as “hardware behaved unpredictably under unstable power.” ZFS did its job within the contract it was given; the contract
had been rewritten by a tuning flag.

The fix was painful but straightforward: restore affected datasets from known-good snapshots and replication, re-enable sync, and add a SLOG
device sized and spec’d for sustained synchronous writes. The lesson stuck. They still benchmark. They just benchmark reality, not a fantasy.

Mini-story 3: The boring but correct practice that saved the day

A SaaS provider had a ZFS pool hosting customer uploads and some internal build artifacts. Their data wasn’t “life or death,” but it was the kind
of data that turns into churn when it disappears. They ran weekly scrubs and daily snapshot replication to a separate system. No drama, no
executive slides. Just calendar-driven maintenance.

One week, scrub reported a handful of checksum errors that it repaired. The on-call engineer filed a ticket anyway, because “repaired” isn’t the
same as “fine.” SMART showed a drive with rising pending sectors. The drive was replaced during business hours. Nobody noticed.

Two weeks later, a different drive in the same chassis failed hard during peak load. Resilver started. Another disk began to wobble. This time,
there was a real chance of crossing redundancy limits if things got worse.

The team didn’t play chicken with resilver. They reduced workload, temporarily stopped nonessential writes, and confirmed replication freshness.
When the pool survived, great. If it hadn’t, they were already ready to fail over. The “boring practice” wasn’t just scrubbing; it was having
a plan to stop digging when the hole got deep.

Common mistakes: symptom → root cause → fix

1) Symptom: Pool is ONLINE but everything is slow

  • Root cause: Pool nearly full + fragmentation; or a slow/failing device causing queue buildup.
  • Fix: Check zpool list for free space/frag; run zpool iostat -v 1 to find the lagging device; make space and replace/repair the slow member.

2) Symptom: Many disks “fail” at once

  • Root cause: HBA reset, expander/backplane, power issue, or cabling.
  • Fix: Inspect system logs for link resets/timeouts; reseat/replace HBA or backplane; don’t shotgun-replace drives until you prove they’re bad.

3) Symptom: CKSUM errors increase, SMART looks fine

  • Root cause: Path integrity (cables), controller issues, or memory corruption.
  • Fix: Check UDMA CRC counters, reseat cables, test memory, move the drive to a different port/controller and see if errors follow the drive or the path.

4) Symptom: Scrub finds “permanent errors”

  • Root cause: ZFS cannot reconstruct at least one block from redundancy.
  • Fix: Restore affected files/datasets from snapshot/backup/replica; after restoration, scrub again. Treat it as a data integrity incident.

5) Symptom: Import hangs or takes forever

  • Root cause: One or more devices timing out; extremely slow reads during replay; massive metadata churn.
  • Fix: Import read-only; isolate failing disks (SMART/logs); try importing on a system with known-good HBAs; prioritize exporting critical data.

6) Symptom: Dataset won’t mount after reboot

  • Root cause: Encryption key not loaded; mountpoint conflict; canmount=off; or corrupted/blocked dataset properties.
  • Fix: Check zfs get keystatus, zfs list -o mounted,canmount,mountpoint, and fix keys/properties; mount explicitly and capture errors.

7) Symptom: Resilver is painfully slow

  • Root cause: Heavy concurrent workload; SMR drives; slow replacement disk; ashift mismatch; controller bottleneck.
  • Fix: Reduce load, verify drive type, check zpool iostat for per-disk throughput; don’t “optimize” by raising tunables blindly—find the limiter.

8) Symptom: Everything died after adding a special vdev

  • Root cause: Special vdev was not redundant or not monitored; it failed and took metadata with it.
  • Fix: If it’s gone, recovery options are limited. Prevent it next time: mirror special vdevs, monitor like your job depends on it (because it does), and treat them as tier-0 devices.

Checklists / step-by-step plan

Checklist: When a pool goes DEGRADED

  1. Capture zpool status -v output for the incident record.
  2. Check if resilver/scrub is running; decide whether to reduce load.
  3. Validate capacity/fragmentation: zpool list, zfs list.
  4. Map device names to by-id and physical slots before touching hardware.
  5. Pull SMART/NVMe health and check system logs for link resets.
  6. Replace the failing component (disk vs cable vs HBA) based on evidence.
  7. Monitor resilver; watch for new errors on other members.
  8. Scrub after resilver completes.
  9. If permanent errors are reported: restore affected files/datasets from known-good sources.

Checklist: When you see permanent checksum errors

  1. Stop writing if possible; consider importing read-only if you can reboot safely.
  2. Identify affected files from zpool status -v.
  3. Decide restore source: snapshots, replication, application-level rebuild, or backups.
  4. Restore and verify integrity (application checks, hashes, or domain-specific validation).
  5. Scrub again; if errors persist, suspect additional silent damage and escalate to pool migration.

Checklist: When performance collapses but health looks fine

  1. Check pool fullness and fragmentation.
  2. Run zpool iostat -v 1 and find the slowest device/vdev.
  3. Check for sync workload pressure and SLOG health (if used).
  4. Check memory pressure and swapping; storage “slowness” can be a RAM war.
  5. Look for background activity: scrub/resilver, snapshot destroys, replication receives.
  6. Fix the bottleneck you can prove. Don’t tune first.

FAQ

1) What actually fails first in most ZFS disasters?

The environment: power, controllers, cabling, or human changes. ZFS is often the messenger that refuses to lie about it.

2) Are checksum errors always a dead disk?

No. They can be disk media, but also cabling/backplane, controller issues, or memory corruption. Use SMART plus kernel logs to separate them.

3) Should I scrub during an incident?

If the pool is stable but you suspect latent corruption, a scrub is diagnostic and corrective. If hardware is actively failing, scrubbing can accelerate collapse. Stabilize first.

4) Is RAIDZ2 “safe enough” for big drives?

It’s safer than RAIDZ1, but safety depends on rebuild time, workload, and correlated failures. Mirrors rebuild faster; RAIDZ saves space. Pick based on your recovery objectives, not your spreadsheet.

5) Can I import a pool read-only to rescue data?

Yes, and you often should. zpool import -o readonly=on reduces risk while you assess and extract data.

6) What’s the most unrecoverable ZFS failure mode?

Losing encryption keys is absolute. Losing a non-redundant special vdev is close behind. In both cases, the pool can be “healthy” and still unusable.

7) Does adding a SLOG make my pool safer?

It can make synchronous writes faster and more predictable. It does not replace backups, and a bad SLOG can become a performance or reliability problem if mis-specified.

8) Why does ZFS get slow when the pool is nearly full?

ZFS allocates blocks across free space; when free space is scarce and fragmented, allocation becomes expensive and write amplification increases. The fix is more space, not more hope.

9) Should I ever use sync=disabled?

Only when data loss is acceptable and explicitly agreed upon. For most production systems, it’s a footgun with a nice benchmark result.

10) Can I “clear” errors and move on?

You can clear counters, not reality. If errors are from a transient event and scrubs come back clean, fine. If errors keep increasing, you’re masking a live fault.

Next steps you can do this week

If you want fewer ZFS disasters, do less magic and more routine. ZFS rewards boring competence.

  1. Schedule scrubs and review results like they matter. They do.
  2. Establish a replacement policy based on SMART trends and age, not just failure.
  3. Move to stable device naming (by-id) in pool configs and runbooks.
  4. Test read-only import and snapshot restore on a non-production clone. Don’t wait for the incident to discover the backup is “theoretical.”
  5. Make space management a first-class metric: alert on pool >80–85% and snapshot growth anomalies.
  6. Audit special vdev and log devices: ensure redundancy, monitoring, and appropriate hardware.
  7. Write down the triage order (pool health → device/path health → performance bottleneck) and make on-call follow it.

The goal isn’t to never see DEGRADED. The goal is to see it early, respond calmly, and never have to learn what “permanent errors” feels like in your gut.

← Previous
ZFS Replication Over WAN: Making Sends Fast on Slow Links
Next →
Ubuntu 24.04 TLS/Cert Errors After Time Drift: Fix NTP/Chrony the Right Way

Leave a comment