ZFS zpool status: Reading Health Like a Forensics Analyst

Was this helpful?

ZFS gives you a rare gift in production: it tells you what it knows, and it’s usually honest. But zpool status isn’t a yes/no health check; it’s a crash report, a crime scene photo, and a weather forecast in one. Read it like a forensics analyst and it will save you from the worst kind of outage: the one that looks fine until it doesn’t.

This is a field guide for SREs and storage engineers who want to move from “pool is DEGRADED, panic” to “here’s the failure domain, here’s the blast radius, here’s the next command.” We’ll dissect the output, map symptoms to root causes, and walk through real operational tasks—scrubs, clears, replacements, resilvers, and the awkward moments when the disks are innocent and the cabling is guilty.

What zpool status really is (and isn’t)

zpool status is a summary view of a transactional storage system that can self-heal, self-diagnose, and sometimes self-incriminate. It is not a full SMART report, not a performance dashboard, and not an oracle. It’s a structured set of clues: the pool topology, the state of each vdev, error counters, and the last few notable events.

In forensics terms: it’s the incident timeline plus the suspect list. You still have to interrogate suspects (SMART, cabling, HBAs, multipath), reconstruct the timeline (scrub/resilver logs), and confirm whether the “victim” is data, performance, or both.

One operational truth: zpool status is biased toward correctness, not comfort. If it says “DEGRADED,” it’s not asking you to meditate; it’s asking you to take action, or at least to understand the risk you’ve accepted.

Anatomy of zpool status: every line matters

Let’s start with a representative output and then pull it apart like a postmortem.

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: ZFS-8000-2Q
  scan: scrub repaired 0B in 0 days 02:41:19 with 0 errors on Tue Dec 24 03:12:18 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            ata-SAMSUNG_SSD_860...  ONLINE       0     0     0
            ata-SAMSUNG_SSD_860...  UNAVAIL      0     0     0  cannot open
          raidz1-1                  ONLINE       0     0     0
            ata-WDC_WD40EFRX...     ONLINE       0     0     0
            ata-WDC_WD40EFRX...     ONLINE       0     0     0
            ata-WDC_WD40EFRX...     ONLINE       0     0     0

errors: No known data errors

pool / state / status / action: the executive summary

pool: is the pool name. In incident response, it’s also your blast-radius label: everything mounted from this pool shares fate during an outage.

state: is the coarse health classification. It’s not just “good/bad”; it’s “how many ways can I still read the truth.” ONLINE means redundancy is intact. DEGRADED means redundancy is compromised but enough replicas exist to keep serving data. FAULTED means the pool can’t provide consistent data.

status: is the human-readable interpretation. ZFS will often tell you whether it’s missing a device, has corruption, or has seen I/O errors. It’s not perfect, but it’s usually directionally right.

action: is your next step suggestion. Don’t treat it like a fortune cookie; treat it like a runbook hint. It may recommend zpool clear, zpool online, replacing a device, or restoring from backup.

scan: scrub/resilver output is your timeline

scan: tells you whether a scrub is running, finished, or whether a resilver (reconstruction after device replacement/reattach) is underway. This line answers crucial questions:

  • Did we scrub recently, or are we gambling on silent corruption?
  • Did ZFS repair anything? If yes, did it repair from parity/mirror or just detect?
  • Were there errors during scrub? “0 errors” is comforting; “with N errors” is an escalation.

config: topology is the map of failure domains

The config: tree is the most important section to read like a storage engineer, not like a helpdesk queue. It tells you:

  • Which vdev types you have (mirror, raidz1/2/3, special, log, cache, spare).
  • Which leaf devices back each vdev.
  • Which layer is degraded (a leaf disk, a whole vdev, or the pool).

ZFS pools are a “stripe of vdevs.” That means the pool’s availability depends on each top-level vdev. A single top-level vdev failure can take out the pool, even if other vdevs are fine. This is where people read “only one disk missing” and accidentally translate it to “only one disk matters.”

errors: the deceptively calm ending

errors: No known data errors is not the same as “nothing bad happened.” It means ZFS has not identified persistent data corruption it couldn’t heal or that remains after repair. You can still have:

  • Transient I/O timeouts that incremented READ/WRITE errors but didn’t corrupt data.
  • Checksum errors that were healed from redundancy (which is good) but are still a warning (which is also good).
  • A device that vanished and came back, leaving you one reboot away from a bad day.

First joke (one you’ll appreciate at 03:00): ZFS is like a good pager—when it’s quiet, it doesn’t mean everything’s fine, it means you haven’t asked the right question yet.

Health states: ONLINE, DEGRADED, FAULTED, UNAVAIL, SUSPENDED

ZFS state words are short because they have to fit on a console at the worst time of your week. Here’s how to interpret them operationally.

ONLINE

ONLINE means the vdev/pool is accessible and redundancy is intact at that layer. You can still have error counters accumulating on a disk while it remains ONLINE. “ONLINE with errors” is a real state of the world and it’s a warning sign for your future self.

DEGRADED

DEGRADED means ZFS has lost some redundancy but can still serve data. In a mirror, one side can be missing/faulted and the mirror is DEGRADED. In RAIDZ, one (RAIDZ1) or two (RAIDZ2) disks can be missing and the RAIDZ vdev stays alive, until it doesn’t.

Operationally: DEGRADED is where you still have time, but your time budget is unknown. It depends on workload, rebuild time, and whether the remaining devices are healthy.

FAULTED

FAULTED usually means ZFS has determined the device or vdev is broken enough that it can’t be used safely. Sometimes it’s a disk. Sometimes it’s the path to the disk. Sometimes it’s a whole controller that’s gone out for coffee and never came back.

UNAVAIL

UNAVAIL means ZFS cannot access the device. The disk might be gone, the OS might have renamed it, the SAS expander might be misbehaving, or multipath changed identities. UNAVAIL with “cannot open” is a classic: ZFS tried, the OS said “nope.”

SUSPENDED

SUSPENDED is the “stop the world” state. ZFS will suspend I/O when continuing would risk worse corruption or when it’s seeing repeated I/O failures. This is not a “clear errors and move on” moment; it’s “stabilize the system” time.

READ/WRITE/CKSUM: the counters that pay your salary

The three counters in zpool status are the fastest way to distinguish “dying disk” from “bad cable” from “cosmic rays” from “driver tantrum.” They’re also commonly misunderstood.

READ errors

A READ error means the device failed to return data ZFS asked for. This can be a media error, a timeout, a bus reset, or a path issue. If the vdev has redundancy, ZFS can fetch the data from elsewhere, then (often) rewrite it to heal. But the presence of READ errors is a strong signal that something in the stack is unreliable.

WRITE errors

A WRITE error means ZFS tried to write and the device didn’t accept it. This can indicate failing media, a controller issue, or that the device disappeared mid-flight. Repeated WRITE errors are scary because they often precede a device dropping entirely.

CKSUM errors

Checksum errors are ZFS doing what it was designed to do: detect data that doesn’t match its expected checksum. A CKSUM error is not “ZFS is broken”; it’s “ZFS caught something that would have been silent corruption elsewhere.”

CKSUM errors typically point to:

  • Bad RAM (less common with ECC, more common without).
  • Bad cabling/backplane/expander causing bit flips or dropped frames.
  • Firmware/driver issues returning wrong data without I/O errors (yes, that happens).
  • Failing disk returning corrupt data that still passes the disk’s own checks.

The big clue: if you see CKSUM errors without READ/WRITE errors, suspect the transport path or memory before you blame the disk.

Second joke, because we’ve earned it: A checksum error is ZFS politely saying, “I found a lie.” It’s the storage equivalent of catching your GPS calmly suggesting you drive into a lake.

The “errors:” line: when “No known data errors” is a trap

That last line is a summary of whether ZFS believes there’s unrepaired corruption. You can have a pool with meaningful risk while still showing “No known data errors.” Common scenarios:

  • Healed corruption: ZFS detected bad data, repaired it from redundancy, and now everything is consistent. The event still matters because it indicates an underlying reliability problem.
  • Transient device loss: A disk vanished during peak load, errors incremented, then it came back. Data might be fine; redundancy might have been stressed. Your next reboot might be the reboot that doesn’t come back clean.
  • Scrub not recent: If you haven’t scrubbed in months, “No known data errors” just means “we haven’t looked thoroughly.” ZFS verifies checksums on read; cold blocks can stay unverified for a long time.

Events, scrubs, resilvers, and the timeline of pain

zpool status shows a small timeline: “scan” state and sometimes a few recent events. But the real operational art is correlating these with system logs and your own change history. When did the errors start? Did someone replace a cable? Was firmware updated? Did a scrub start right after a busy batch job?

Two key mechanics to remember:

  • Scrub reads all blocks and verifies checksums; with redundancy it can repair silent corruption. Scrub is your periodic audit.
  • Resilver rebuilds redundancy after a device is replaced/attached/onlined. Resilver is recovery work, often performed while serving live workload.

Resilver speed and scrub speed are not just “disk speed.” They depend on fragmentation, recordsize, compression, concurrent load, and whether you’re dealing with RAIDZ (parity reconstruction) versus mirrors (straight copy). In the real world, the difference between a 2-hour resilver and a 2-day resilver is the difference between a “routine ticket” and an “executive bridge call.”

Interesting facts and historical context (6–10 points)

  1. ZFS was designed to detect silent corruption end-to-end with checksums stored separately from data—so it can catch errors introduced by disks, controllers, or transport.
  2. “Scrub” isn’t just a nice-to-have feature; it was built in because disks can return corrupt-but-valid-looking data long after it was written.
  3. The vdev model (stripe across vdevs) is why adding a single RAIDZ vdev increases capacity but also adds a new failure domain that can take the pool down.
  4. RAIDZ was ZFS’s answer to the write hole problem that classic RAID-5 implementations could suffer from during partial stripe writes and power loss.
  5. ZFS error counters are about observed failures, not SMART predictions; you can have a “perfect SMART disk” causing real ZFS I/O problems due to a bad cable.
  6. Ashift exists because drives lie about their sector size; ZFS needs an alignment assumption to avoid read-modify-write penalties.
  7. Special vdevs (metadata/small blocks on fast devices) changed the performance game—but also introduced a new way to lose a pool if misused without redundancy.
  8. L2ARC and SLOG were widely misunderstood early on; many outages were caused by treating them as mandatory “speed upgrades” rather than workload-specific tools.
  9. ZFS’s “self-healing” depends on redundancy; without it, ZFS can still detect corruption but may not be able to fix it—detection alone is still valuable for forensics.

Fast diagnosis playbook (check first, second, third)

This is the “I have five minutes before the next meeting turns into an incident review” sequence. The goal is to quickly classify the problem: availability risk, data integrity risk, or performance risk.

First: classify the failure domain from topology

cr0x@server:~$ sudo zpool status -xv
pool 'tank' is not healthy
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.

Interpretation: -x tells you if ZFS thinks the pool is healthy; -v gives details. If a top-level vdev is DEGRADED/FAULTED, your risk is pool-wide. If a single leaf device is misbehaving inside a redundant vdev, you have a window to act.

Second: read the counters like a detective

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     UNAVAIL     12     3     0  cannot open

Interpretation: READ/WRITE on a missing device often indicates it was erroring before it vanished. CKSUM staying at 0 suggests the device is failing outright, not silently corrupting—still bad, but a different kind of bad.

Third: decide whether you’re chasing integrity or performance

If the pool is ONLINE but apps are slow, don’t stare at zpool status like it owes you an apology. Use it to confirm there’s no active resilver/scrub and no error storm, then pivot to latency and queue depth.

cr0x@server:~$ sudo zpool iostat -v 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        12.3T  3.2T    210    980   18.4M  92.1M
  raidz1-0                  12.3T  3.2T    210    980   18.4M  92.1M
    sdc                         -      -     70    340   6.1M  30.2M
    sdd                         -      -     68    320   6.0M  29.1M
    sde                         -      -     72    320   6.3M  32.8M
--------------------------  -----  -----  -----  -----  -----  -----

Interpretation: If one disk is lagging, you’ll often see uneven ops/bandwidth. If everything is evenly busy but latency is high, the workload may be sync-heavy, small-block, or suffering from a misfit recordsize/volblocksize.

Practical tasks: commands + interpretation (12+)

These are real things you do at a shell when the pool is trying to tell you a story. For each task: run the command, read the output, and decide the next action.

Task 1: Quick health check across all pools

cr0x@server:~$ sudo zpool status -x
all pools are healthy

Interpretation: Great for cron checks and dashboards. If it reports unhealthy, immediately follow with zpool status -v for details.

Task 2: Full forensic view with device paths and error details

cr0x@server:~$ sudo zpool status -vP tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 01:10:22 with 0 errors on Mon Dec 23 02:11:40 2025
config:

        NAME                         STATE     READ WRITE CKSUM
        tank                         ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            /dev/disk/by-id/ata-A...  ONLINE       0     0     0
            /dev/disk/by-id/ata-B...  ONLINE       0     0     0

errors: No known data errors

Interpretation: Use -P to show full paths; use by-id paths to avoid device renaming surprises. If you see /dev/sdX in a production pool, consider migrating to stable identifiers when you get breathing room.

Task 3: Confirm what ZFS thinks each disk is (GUID-level truth)

cr0x@server:~$ sudo zpool status -g tank
  pool: tank
 state: ONLINE
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      ONLINE       0     0     0
          mirror-0                ONLINE       0     0     0
            11465380813255340816  ONLINE       0     0     0
            13955087861460021762  ONLINE       0     0     0

Interpretation: GUIDs are how ZFS identifies vdevs internally. This is handy when device names change or multipath plays musical chairs.

Task 4: Map a missing vdev to the OS’s view of disks

cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,WWN,TYPE,MOUNTPOINT
NAME   SIZE MODEL            SERIAL        WWN                TYPE MOUNTPOINT
sda   3.6T  HGST_HUS726T4... K8G...        0x5000cca2...      disk
sdb   3.6T  HGST_HUS726T4... K8H...        0x5000cca2...      disk
nvme0n1 1.8T Samsung_SSD...  S5G...        eui.002538...      disk

Interpretation: If zpool status shows a disk UNAVAIL but lsblk doesn’t list it, you likely have a physical/path issue: cable, backplane, HBA, expander, power, or the disk is truly dead.

Task 5: Online a device that came back (carefully)

cr0x@server:~$ sudo zpool online tank /dev/disk/by-id/ata-SAMSUNG_SSD_860_EVO_S4X...
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 12.4G in 0 days 00:03:12 with 0 errors on Tue Dec 24 10:44:01 2025

Interpretation: If the device was missing briefly, onlining may trigger a resilver. If the device is flapping (disappearing and reappearing), do not celebrate; replace the disk or fix the path before it fails again mid-resilver.

Task 6: Clear error counters after fixing the root cause

cr0x@server:~$ sudo zpool clear tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0

Interpretation: Clearing counters is not “fixing ZFS.” It’s resetting the odometer after you repaired the engine. If you clear without fixing the cause, the errors come back—and now you’ve lost the timeline.

Task 7: Start a scrub and watch it like it owes you money

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Tue Dec 24 11:02:33 2025
        1.32T scanned at 1.01G/s, 412G issued at 322M/s, 12.3T total
        0B repaired, 3.27% done, no estimated completion time

Interpretation: Scrub throughput can be far below “disk speed” when the pool is busy. If scrub ETA is “no estimated completion time,” it can mean the system can’t predict due to changing rate, not necessarily that it’s stuck.

Task 8: Pause and resume a scrub (when production needs air)

cr0x@server:~$ sudo zpool scrub -p tank
cr0x@server:~$ sudo zpool status tank
  scan: scrub paused since Tue Dec 24 11:10:44 2025
        2.01T scanned, 0B repaired, 16.3% done
cr0x@server:~$ sudo zpool scrub -w tank
cr0x@server:~$ sudo zpool status tank
  scan: scrub in progress since Tue Dec 24 11:02:33 2025
        2.05T scanned, 0B repaired, 16.6% done

Interpretation: Useful when scrubs collide with peak load. Pausing is not a substitute for capacity planning; it’s an operational tool to avoid turning a maintenance task into an outage.

Task 9: Replace a failed disk in a mirror

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     ONLINE       0     0     0
            sdb     FAULTED     64     0     0  too many errors
cr0x@server:~$ sudo zpool replace tank sdb /dev/disk/by-id/ata-WDC_WD40EFRX-NEW_DISK
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Tue Dec 24 11:20:06 2025
        412G scanned at 1.12G/s, 96.1G issued at 268M/s, 3.60T total
        95.9G resilvered, 2.60% done, 0:23:55 to go

Interpretation: In mirrors, resilver is mostly a copy from the healthy side. Watch for new errors during resilver; a second disk acting up during resilver is how “degraded but fine” becomes “restore from backup.”

Task 10: Replace a disk in RAIDZ (and understand the risk)

cr0x@server:~$ sudo zpool replace tank raidz1-0 sdd /dev/disk/by-id/ata-WDC_WD40EFRX-REPL
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Tue Dec 24 11:31:18 2025
        2.14T scanned at 512M/s, 1.02T issued at 244M/s, 12.3T total
        0B resilvered, 8.32% done, 11:42:10 to go

Interpretation: RAIDZ resilver is reconstruction, not a simple copy, and can be slower and more stressful. If you’re already running RAIDZ1 on large disks, resilver time is your enemy; treat the degraded period as a high-risk window.

Task 11: Identify and interpret a “too many errors” situation

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0    14
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/projects/build-cache.bin

Interpretation: This is a different class of emergency: ZFS is telling you there are permanent errors in a specific object. That means redundancy wasn’t sufficient at the time of read/repair, or corruption existed in enough replicas. Your next step is to restore that file (or dataset) from a known good source, then investigate why the CKSUM errors occurred.

Task 12: Find which dataset and mountpoint own the damaged file

cr0x@server:~$ zfs list -o name,mountpoint
NAME                 MOUNTPOINT
tank                 /tank
tank/projects        /tank/projects

Interpretation: When ZFS reports a path, confirm it maps to the expected dataset. In complex environments with nested datasets, this helps you scope impact and choose whether to restore a single file, a dataset snapshot, or something larger.

Task 13: Correlate pool health with recent kernel/storage events

cr0x@server:~$ sudo dmesg -T | tail -n 30
[Tue Dec 24 11:05:02 2025] ata9: hard resetting link
[Tue Dec 24 11:05:03 2025] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Tue Dec 24 11:05:04 2025] sd 9:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Tue Dec 24 11:05:04 2025] blk_update_request: I/O error, dev sdb, sector 128

Interpretation: Link resets and timeouts often implicate cabling, backplane, or controller as much as the disk. If multiple disks behind the same HBA show resets around the same time, suspect the shared component first.

Task 14: Check ashift and vdev layout when performance is weird

cr0x@server:~$ sudo zdb -C tank | grep -E 'ashift|vdev_tree|type' -n | head -n 30
56:        vdev_tree:
57:            type: 'root'
73:                    type: 'mirror'
91:                            ashift: 12
118:                            ashift: 12

Interpretation: ashift: 12 means 4K sectors (2^12). Misaligned ashift can cause chronic write amplification. You can’t change ashift in-place for existing vdevs; fixing it is a rebuild/recreate exercise, which is why diagnosing it early matters.

Task 15: Validate that you’re using stable device IDs (prevent future “UNAVAIL” drama)

cr0x@server:~$ sudo zpool status tank | awk '/^ *[a-z]/{print $1}' | head
tank
mirror-0
sda
sdb
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'sda|sdb' | head
lrwxrwxrwx 1 root root 9 Dec 24 10:01 ata-HGST_HUS726T4... -> ../../sda
lrwxrwxrwx 1 root root 9 Dec 24 10:01 ata-HGST_HUS726T4... -> ../../sdb

Interpretation: If your pool is built on /dev/sdX, you’re relying on discovery order staying consistent. That’s fine until it isn’t. Plan a maintenance window to replace device references with by-id paths when possible (often done during disk replacements, one leaf at a time).

Three corporate-world mini-stories (how things actually break)

Mini-story 1: The incident caused by a wrong assumption

They had a “simple” setup: two top-level vdevs, each a RAIDZ1 of big disks, striped into one pool. Capacity looked great on a slide. The assumption—quietly shared across teams—was that “losing one disk is fine.” It was true in a narrow sense: losing one disk per RAIDZ1 vdev is survivable.

The first disk failed on a Monday morning. zpool status reported DEGRADED, and the team moved fast: new disk ordered (good), replacement planned (good), and a resilver started late afternoon (also good). During the resilver, a second disk in the other RAIDZ1 vdev started throwing CKSUM errors—nothing dramatic at first, just a small number that looked like “noise.” Someone cleared errors to make the dashboard green again, because nobody likes a red box in the weekly review.

By Tuesday, the second disk’s errors weren’t noise. Under rebuild load, marginal components stop being polite. The disk dropped, the second vdev went DEGRADED, and now the pool was one more failure away from catastrophe—across two different vdevs. A batch job kicked off (unrelated, scheduled, and innocent), I/O spiked, and a third disk hiccuped long enough to go UNAVAIL. That was it: the pool faulted because a top-level vdev became unavailable, and the pool is only as alive as its least-available vdev.

The postmortem was painful but productive. The wrong assumption wasn’t “RAIDZ1 survives one disk,” it was treating the pool as if it had a single parity budget. In a striped pool, parity is local to each vdev; risk stacks. The fix wasn’t a hero script. It was a design correction: RAIDZ2 for large-capacity disks, scrubs on schedule, and treating CKSUM errors as a first-class incident, not a metric to be cleared.

Mini-story 2: The optimization that backfired

A different company, different sin: they added a SLOG device to “speed up writes.” The workload was a mix of databases and log ingestion, and someone had read enough to know “sync writes are slow; SLOG helps.” True, sometimes. They installed a consumer NVMe drive because it benchmarked fast and procurement was allergic to “enterprise” pricing.

For a week, the graphs looked great. Latency dropped. The team congratulated itself and moved on. Then an unrelated power event hit the rack—nothing dramatic, just a quick brownout and recovery. Systems came back, but the pool was not happy. zpool status showed the log vdev as FAULTED, and worse, the pool refused to import cleanly on one node. The “fast NVMe” had no power-loss protection; it acknowledged writes that never made it to durable media.

This is where zpool status reads like a coroner: the log vdev wasn’t just “optional cache.” For sync workloads, ZIL/SLOG is part of the transaction path. A lying SLOG device can turn a clean crash into a consistency problem. The recovery involved removing and replacing the log device, replaying what ZFS could, and validating application-level integrity. The outcome wasn’t total data loss, but it was the kind of incident that burns confidence.

The lesson they wrote down (and eventually believed): performance upgrades are changes to the failure model. A SLOG device should be treated like a tiny database: high endurance, power-loss protection, and redundancy if the platform supports it. They later tested with intentional power pulls in a lab. The “optimization” was real, but only when engineered like production, not like a benchmark.

Mini-story 3: The boring but correct practice that saved the day

The third story is not glamorous. It’s about a team that ran monthly scrubs, reviewed zpool status outputs as part of on-call hygiene, and replaced disks on early warning signs instead of waiting for a full failure. They also kept a small inventory of identical replacement drives on-site. Nobody put that in a press release.

One month, a scrub reported a handful of repaired checksum errors on a single disk. The pool stayed ONLINE, and errors: No known data errors looked reassuring. But the team treated “repaired” as “confirmed something was wrong.” They correlated with kernel logs and found intermittent link resets on the same bay. They moved the disk to a different bay during a scheduled window; the errors followed the bay, not the disk. That’s the kind of fact you only get if you bother to look.

They replaced the backplane connector and re-seated cables. Then they ran another scrub. Clean. Error counters stayed at zero. A month later, a different disk failed outright—normal lifecycle stuff. The pool degraded, a replacement was inserted, resilver completed, and the business never noticed.

The saving move wasn’t genius. It was the boring discipline of scrubbing, reviewing, and acting on weak signals while the system was still cooperative. In corporate life, boring correctness is often the highest form of competence, and it rarely gets rewarded until the day it prevents an outage.

Common mistakes (with symptoms and fixes)

Mistake 1: Treating CKSUM errors as “disk is bad, replace it” without checking the path

Symptoms: CKSUM increments on one or more disks; READ/WRITE remain near zero; disks pass basic SMART; errors correlate with load spikes or cable movement.

Fix: Check dmesg for link resets/timeouts; inspect/replace SATA/SAS cables; reseat drives; verify HBA firmware/driver stability. If errors follow the bay or controller port rather than the disk, don’t waste time RMA’ing innocent drives.

Mistake 2: Clearing errors to make monitoring green

Symptoms: zpool clear was run repeatedly; counters come back; nobody can answer “when did it start?”

Fix: Preserve evidence. Capture zpool status -vP, relevant logs, and timestamps before clearing. Clear only after addressing the likely root cause and preferably after a scrub confirms stability.

Mistake 3: Building pools on /dev/sdX and acting surprised when devices move

Symptoms: After reboot or hardware change, one disk shows UNAVAIL though it’s present; device letters changed; import becomes confusing.

Fix: Use /dev/disk/by-id consistently. If you inherited a pool built on sdX, plan to “migrate” leaf devices during replacements by using by-id paths in zpool replace.

Mistake 4: Assuming RAIDZ resilver is “just copying” and scheduling it during peak hours

Symptoms: Resilver takes far longer than expected; application latency spikes; another disk starts throwing errors under stress.

Fix: Treat RAIDZ resilver as a heavy reconstruction job. Reduce concurrent load if possible, schedule rebuild windows, and consider adding redundancy (RAIDZ2) for large drives where rebuild windows are long.

Mistake 5: Misunderstanding “No known data errors” as “no problem”

Symptoms: Pool stays ONLINE, but READ/WRITE/CKSUM counters tick up; scrubs repair data occasionally; intermittent app errors.

Fix: Investigate the trend. One repaired block is not a crisis, but it is a signal. Run a scrub, check logs, and track whether errors concentrate on a single device or path.

Mistake 6: Adding a special vdev or SLOG without redundancy or validation

Symptoms: After device loss, pool becomes unusable or suffers severe performance collapse; unexpected import issues; metadata-heavy workloads stall.

Fix: Special vdevs should be redundant if they hold metadata/small blocks that the pool depends on. SLOG should be power-loss safe and appropriate for sync write workloads. Validate failure scenarios in a lab.

Checklists / step-by-step plan

Checklist A: When you see DEGRADED

  1. Capture evidence: zpool status -vP output and timestamps.
  2. Identify the layer degraded: leaf disk, mirror/raidz vdev, or top-level vdev.
  3. Check counters: are errors READ/WRITE (I/O) or CKSUM (integrity/path)?
  4. Confirm OS visibility: lsblk and dmesg for resets/timeouts.
  5. If device is present but offline/unavail: attempt zpool online once after stabilizing cabling/power.
  6. If device is truly failing: zpool replace with a stable by-id path.
  7. Monitor resilver in zpool status; watch for new errors on other disks.
  8. After resilver: run a scrub during a controlled window.
  9. Only then: consider zpool clear to reset counters for future detection.

Checklist B: When you see CKSUM errors but pool is ONLINE

  1. Don’t replace hardware blindly. First, capture zpool status -vP.
  2. Check if multiple disks behind one HBA/expander show CKSUM increments.
  3. Inspect dmesg -T for link resets, bus errors, or timeouts.
  4. Reseat/replace cables; reseat the drive; consider a different bay.
  5. Run a scrub to force verification and repair: zpool scrub.
  6. If CKSUM continues on the same disk after path remediation: replace the disk.

Checklist C: When performance is bad but pool looks healthy

  1. Confirm no scrub/resilver is running: zpool status scan line.
  2. Check pool/vdev activity: zpool iostat -v 1.
  3. Look for one slow device skewing a vdev: uneven ops/bw in iostat.
  4. Check for sync write pressure: app behavior and ZFS dataset settings (sync, logbias).
  5. Validate ashift and layout: zdb -C for ashift and vdev design assumptions.
  6. Only then consider “performance upgrades” (SLOG, special vdev, more vdevs)—and model failure impact first.

FAQ

1) If the pool is DEGRADED, is my data already corrupted?

Not necessarily. DEGRADED usually means redundancy is reduced (a disk missing/faulted) but data can still be served from remaining replicas/parity. The risk is that a second failure in the same vdev can become data loss or pool loss, depending on layout.

2) What’s the difference between READ/WRITE errors and CKSUM errors?

READ/WRITE errors are I/O failures: the device didn’t complete the operation. CKSUM errors mean ZFS received data but it didn’t match the expected checksum—often implicating corruption in transit, firmware, or media that returns wrong data.

3) Can I just run zpool clear to fix a degraded pool?

No. zpool clear clears error counters and some error states; it doesn’t resurrect missing disks or repair underlying causes. Use it after you fix the problem and want fresh counters for monitoring.

4) Why does zpool status show “No known data errors” when counters are non-zero?

Because counters can reflect transient errors that were recovered via redundancy or errors that didn’t result in permanent corruption. It’s still a signal that something is unreliable and worth investigating.

5) Should I replace a disk as soon as I see any errors?

Replace immediately for persistent READ/WRITE errors or a disk that drops offline. For isolated CKSUM errors, first suspect cabling/backplane/HBA and verify with logs and a scrub. Replace the disk if errors persist after path remediation.

6) Why is my RAIDZ resilver so slow?

RAIDZ resilver can involve reconstructing parity across many blocks, and it competes with live workload. Fragmentation and small-block workloads make it worse. Mirrors typically resilver faster because they copy from a surviving replica.

7) What does “too many errors” mean in zpool status?

ZFS decided a device is unreliable enough to fault it. That can be due to repeated I/O failures, timeouts, or corruption events. The right response is to identify whether the disk is bad or the path is bad, then replace/fix accordingly.

8) Is it safe to hot-swap drives while the pool is online?

Often yes, if your hardware and OS support it and your procedure is disciplined. The risk isn’t just the swap—it’s misidentifying the disk, pulling the wrong one, or triggering a flaky backplane into a wider outage. Verify by-id paths before you pull anything.

9) If zpool status lists a specific file with permanent errors, what should I do?

Treat it as data loss in that object. Restore the file from a known-good source (backup, replication, artifact rebuild), then scrub and investigate why redundancy didn’t heal it (multiple device errors, prior corruption, or insufficient redundancy).

Conclusion

zpool status is not a vibe check. It’s evidence. The topology tells you what can fail. The states tell you what already failed. The counters tell you how it failed. And the scan line tells you what ZFS is doing about it.

Read it like a forensics analyst and you’ll stop guessing. You’ll know when a disk is dying, when a cable is lying, when a scrub is overdue, and when a “performance upgrade” quietly rewrote your failure model. In production, that’s the difference between routine maintenance and a very expensive lesson.

← Previous
Proxmox SSL certificate broke: fast ways to restore Web UI access safely
Next →
Proxmox Ceph PG Stuck/Inactive: What to Do Before Data Risk Escalates

Leave a comment