Your Proxmox node is humming along, VMs are happy, and then you see it: “pool is not healthy”. That message is never early. It’s ZFS politely clearing its throat before it starts yelling.
If you treat it like a cosmetic warning, you’ll eventually meet the kind of outage that makes calendar invites feel like personal attacks. If you treat it like a controlled incident, you’ll usually keep the lights on—and often fix the root cause before data is at risk.
What “pool is not healthy” actually means in Proxmox
Proxmox isn’t doing deep ZFS forensics when it shows “pool is not healthy.” It’s typically surfacing ZFS’s own health assessment: the pool state isn’t ONLINE and clean, or ZFS has recorded errors it couldn’t fully correct.
The common states behind the warning:
- DEGRADED: One or more vdevs is missing/faulted, but redundancy still exists. You’re operating on your spare tire.
- FAULTED: The pool (or a vdev) has lost redundancy and/or access. Reads/writes may fail.
- UNAVAIL: Devices are missing, paths are wrong, or the pool can’t be accessed.
- SUSPENDED: ZFS suspended I/O due to repeated failures. This is ZFS saying “I’m not helping you make it worse.”
Separate from state, ZFS also tracks error counters:
- READ errors: device couldn’t read blocks. Might be media, cable, controller, or pathing.
- WRITE errors: writes failed. Often indicates device/controller issues.
- CKSUM (checksum) errors: data read from disk didn’t match its checksum; redundancy may have repaired it. This one is sneaky because the disk may “work,” while silently returning wrong data.
Here’s the key operational point: “pool is not healthy” is not a single problem. It’s a category label. Your job is to classify it quickly: is this a device failure, a pathing/cabling/controller issue, a logic/config issue, or actual data corruption?
Paraphrased idea from Werner Vogels (Amazon CTO): Everything fails, all the time; design and operate as if that’s always true.
One more thing: Proxmox clusters complicate emotions. When storage on one node gets weird, people start blaming quorum, corosync, the network, or the weather. Don’t. Start with ZFS’s own evidence.
Joke #1: A “healthy” ZFS pool is like a quiet toddler—you don’t brag about it, you just enjoy the silence while it lasts.
Fast diagnosis playbook (first/second/third)
This is the “get oriented in five minutes” routine. The goal isn’t a full fix. It’s to identify the failure mode and decide whether you can proceed online, need maintenance, or should stop touching things.
First: classify the pool state and the error type
- Run
zpool status -xv. You want: state, which device is implicated, whether errors are READ/WRITE/CKSUM, and the exact message text. - If the pool is SUSPENDED or shows permanent errors in files, treat as incident severity-high. Stop “optimizing,” start preserving evidence and making backups.
Second: determine if it’s a disk problem or a path problem
- Check SMART quickly:
smartctl -a(or NVMe smart log). Look for reallocated sectors, pending sectors, CRC errors, media errors. - Check kernel logs for link resets, I/O errors, transport errors:
journalctl -k. - If you see CRC or link reset storms, suspect cables/backplane/HBA before condemning the disk.
Third: decide your next action based on redundancy and risk
- If redundancy exists (mirror/RAIDZ) and only one device is impacted: plan a controlled replacement/resilver.
- If redundancy is compromised (single-disk vdev, RAIDZ with multiple failures, multiple mirrors degraded): shift priority to data evacuation and minimizing writes.
- If this is a performance + errors scenario (scrub/resilver slow, timeouts): check HBA queueing, SATA link speed negotiation, and workload contention before forcing more I/O.
Interesting facts and a little history (because it changes decisions)
These aren’t trivia. They explain why ZFS behaves the way it does—and why some “classic RAID” instincts are actively harmful.
- ZFS started at Sun Microsystems in the mid-2000s, designed to end the “filesystem plus volume manager” split. That’s why it talks about pools and vdevs instead of partitions-as-fate.
- Copy-on-write (CoW) means ZFS never overwrites live blocks in-place. Great for consistency; also means “just run fsck” is not a thing. The repair model is different.
- Every block is checksummed, including metadata. That’s why checksum errors matter: ZFS is detecting lies from the storage stack.
- Scrub is not a performance feature. It’s an integrity audit. Run it because you like sleeping, not because dashboards look better.
- “RAIDZ isn’t RAID” is more than pedantry. RAID controllers often hide disk identity and error detail; ZFS wants direct visibility. HBAs in IT mode are popular for a reason.
- The pool health state is conservative. ZFS will keep complaining about errors even if they were corrected, until you clear them. That’s on purpose: the system wants humans to acknowledge risk.
- 4K sectors changed everything. The
ashiftchoice (sector alignment) is basically irreversible per vdev and can quietly destroy performance if wrong. - Resilver changed over time. “Sequential resilver” improvements in later OpenZFS versions made rebuilds less punishing, but they still hammer the affected vdev(s).
- Proxmox made ZFS mainstream in small clusters. Great outcome, but it also means people run production with “it worked in the lab” storage habits.
Triage rules: what you do immediately (and what you do not)
Do this immediately
- Capture current state:
zpool status,zfs list, relevant logs. When things worsen, you’ll want a before/after. - Stop non-essential churn: postpone migrations, suspend bulk backups, avoid snapshots explosions. You want fewer writes during a shaky period.
- Verify you have a recent backup that restores, not just “a backup job that ran.” If redundancy is compromised, backups become your only adult in the room.
Avoid these tempting moves
- Don’t “clear” errors to make the alert go away before you understand them. Clearing is acknowledgment, not a fix.
- Don’t run a scrub while actively failing hard (timeouts, resets, pool suspended). Scrub increases reads; failing devices often die faster under pressure.
- Don’t offline the wrong disk. Identify by persistent IDs, not by “/dev/sdX is probably…”. “Probably” is how you create a second failure.
- Don’t mix heroics with uncertainty. If you’re guessing, switch to read-only import (when applicable) and evacuate data.
Joke #2: If you pick disks by /dev/sdX order, you’re basically doing storage engineering with a Ouija board.
Practical tasks with commands: read outputs, make decisions
Below are hands-on tasks you can run on a Proxmox host. Each one includes: command, what typical output means, and what decision it drives. Run them as root or with appropriate privileges.
Task 1: Confirm the pool health and identify the offender
cr0x@server:~$ zpool status -xv
pool: rpool
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: message ID: ZFS-8000-2Q
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-SAMSUNG_SSD_870... ONLINE 0 0 0
ata-SAMSUNG_SSD_870... UNAVAIL 0 0 0 cannot open
errors: No known data errors
What it means: Pool is degraded because one mirror member is unavailable. No data errors yet.
Decision: Treat as urgent-but-controlled. Identify whether it’s a disk failure or path failure. Plan to restore redundancy ASAP.
Task 2: Check if Proxmox thinks something is wrong at the cluster level
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prodcluster
Config Version: 18
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Dec 26 11:02:18 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2a
Quorate: Yes
What it means: Cluster quorum is fine. This is a storage incident, not a split-brain drama.
Decision: Keep focus on the node’s ZFS/hardware. Don’t change cluster settings as a “fix.”
Task 3: Map ZFS device names to real hardware paths (stop guessing)
cr0x@server:~$ zpool status -P
pool: rpool
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S5... ONLINE 0 0 0
/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6... UNAVAIL 0 0 0 cannot open
What it means: ZFS is already using stable by-id paths (good). If you see /dev/sdX here, fix that after the incident.
Decision: Use these IDs to locate the physical disk in the chassis, HBA slot, or backplane bay mapping.
Task 4: Check recent kernel messages for I/O errors, link resets, and timeouts
cr0x@server:~$ journalctl -k -n 80 --no-pager
Dec 26 10:58:41 server kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0x6 frozen
Dec 26 10:58:41 server kernel: ata7.00: irq_stat 0x08000000, interface fatal error
Dec 26 10:58:41 server kernel: ata7: SError: { CommWake DevExch }
Dec 26 10:58:42 server kernel: ata7: hard resetting link
Dec 26 10:58:47 server kernel: ata7: link is slow to respond, please be patient (ready=0)
Dec 26 10:58:52 server kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 26 10:58:52 server kernel: sd 6:0:0:0: [sdf] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 26 10:58:52 server kernel: blk_update_request: I/O error, dev sdf, sector 14680064 op 0x0:(READ)
What it means: Interface fatal errors and resets. This smells like cable/backplane/HBA issues, not necessarily “the disk is dead.” Also note link negotiated at 3.0 Gbps; could indicate signal quality problems.
Decision: Before replacing the disk, reseat/replace cable, move bay, check HBA firmware, verify power stability. If errors continue, then replace disk.
Task 5: Pull SMART data for SATA/SAS disks and interpret the right fields
cr0x@server:~$ smartctl -a /dev/sdf
SMART Attributes Data Structure revision number: 16
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 37
What it means: No reallocated/pending sectors, but CRC errors exist. CRC errors are often cabling/backplane noise, not platter failure.
Decision: Fix the path (cable/backplane/HBA seating). Then clear counters at the ZFS level (not SMART), and observe. If CRC continues rising, keep digging.
Task 6: Pull NVMe health (different tools, different clues)
cr0x@server:~$ nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 44 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 6%
media_errors : 0
num_err_log_entries : 0
What it means: NVMe looks healthy. If ZFS reports checksum errors on an NVMe mirror and SMART is clean, suspect PCIe issues, firmware, or RAM corruption rather than NAND wear.
Decision: Correlate with journalctl -k for PCIe AER errors; consider firmware updates and memory testing if checksums persist.
Task 7: Verify pool events and timelines (ZFS tells you what it noticed)
cr0x@server:~$ zpool events -v | tail -n 25
Dec 26 10:58:52 sysevent.fs.zfs.vdev_fault
pool: rpool
vdev: /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6...
errno: 5
description: Vdev I/O failure, zio pool=rpool vdev=/dev/disk/by-id/ata-SAMSUNG...
Dec 26 10:59:10 sysevent.fs.zfs.statechange
pool: rpool
old-state: ONLINE
new-state: DEGRADED
What it means: You have a clear timestamp. That’s gold for correlating with maintenance, power events, kernel updates, or someone “just reseating a cable real quick.”
Decision: Use the timeline to validate/kill hypotheses. If it started right after a reboot or firmware change, suspect configuration/compatibility.
Task 8: Check for scrub status and whether it’s currently safe to run one
cr0x@server:~$ zpool status
pool: rpool
state: DEGRADED
status: One or more devices could not be opened...
scan: scrub repaired 0B in 00:18:44 with 0 errors on Sun Dec 22 03:20:12 2025
config:
...
What it means: Last scrub completed recently with zero errors. That suggests integrity was fine until the vdev went missing/unavailable.
Decision: Fix the missing device and resilver; do not start a scrub until redundancy is restored (unless you’re diagnosing checksum errors and the pool is stable).
Task 9: Clear only after you’ve fixed the cause (and know what you’re acknowledging)
cr0x@server:~$ zpool clear rpool
cr0x@server:~$ zpool status -xv
all pools are healthy
What it means: ZFS error counters were cleared and the pool is back to healthy.
Decision: Only do this after remediation. If you clear too early, you lose signal and make recurring issues harder to prove.
Task 10: Find and validate the physical disk you’re about to touch
cr0x@server:~$ ls -l /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6*
lrwxrwxrwx 1 root root 9 Dec 26 10:40 /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6... -> ../../sdf
cr0x@server:~$ udevadm info --query=all --name=/dev/sdf | egrep 'ID_SERIAL=|ID_PATH=|DEVNAME='
DEVNAME=/dev/sdf
ID_SERIAL=SAMSUNG_SSD_870_EVO_S6...
ID_PATH=pci-0000:3b:00.0-ata-7
What it means: You have a stable serial and a bus path. That’s how you avoid yanking the wrong drive.
Decision: Match serial to chassis label or HBA mapping. If you can’t match confidently, stop and build the mapping first.
Task 11: Offline a failing device (when you actually mean it) and replace it
cr0x@server:~$ zpool offline rpool /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6...
cr0x@server:~$ zpool status -P
pool: rpool
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S5... ONLINE 0 0 0
/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6... OFFLINE 0 0 0
What it means: You intentionally offlined the device. ZFS will stop trying to use it, which can reduce error spam and latency.
Decision: Replace the disk, then attach the new member. Do not offline disks “to see what happens” in RAIDZ—mirrors are forgiving; RAIDZ is less amused.
Task 12: Attach/replace and watch resilver progress
cr0x@server:~$ zpool replace rpool /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S6... /dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S7...
cr0x@server:~$ zpool status
pool: rpool
state: DEGRADED
scan: resilver in progress since Fri Dec 26 11:15:04 2025
3.12G scanned at 410M/s, 1.02G issued at 134M/s, 92.3G total
1.02G resilvered, 1.10% done, 0:11:20 to go
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S5... ONLINE 0 0 0
/dev/disk/by-id/ata-SAMSUNG_SSD_870_EVO_S7... ONLINE 0 0 0 (resilvering)
errors: No known data errors
What it means: Resilver is running and estimates look sane. “Scanned” vs “issued” indicates how fast ZFS can read vs write/verify.
Decision: Keep workload moderate. If you see timeouts, plummeting rates, or repeated resets in logs, pause and address hardware path stability.
Task 13: If resilver is painfully slow, check I/O pressure and latency
cr0x@server:~$ iostat -x 2 5
avg-cpu: %user %nice %system %iowait %steal %idle
12.33 0.00 5.21 41.08 0.00 41.38
Device r/s w/s r_await w_await aqu-sz %util
sde 58.2 120.4 8.12 42.77 12.40 98.7
sdf 2.1 90.3 1.90 210.55 19.10 99.9
What it means: One device is saturated with huge write latency. During resilver, a weak disk or weak path drags everything down.
Decision: Reduce competing workload (backups, replication), check write cache settings, verify the replacement disk is not SMR garbage in disguise for HDD pools, and confirm no controller bottleneck.
Task 14: Check whether you’re seeing memory-related corruption signals
cr0x@server:~$ dmesg | egrep -i 'edac|mce|ecc|machine check' | tail -n 20
[ 912.220112] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
[ 912.220119] mce: [Hardware Error]: Corrected error, no action required.
What it means: Corrected ECC errors are still a signal. If checksum errors appear across multiple disks or vdevs, RAM/CPU/PCIe issues become credible suspects.
Decision: If error patterns are “everywhere,” don’t play whack-a-mole with disks. Stabilize the platform: memory tests, reseat DIMMs, firmware updates, check power.
Task 15: Find “permanent errors” and decide whether you need file-level recovery
cr0x@server:~$ zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible.
see: message ID: ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/vmdata:vm-104-disk-0
What it means: The pool is ONLINE, but a file is corrupted and cannot be repaired from redundancy. In Proxmox terms, that could be a VM disk.
Decision: Restore from backup or replica. For a VM disk, consider restoring the VM disk image, or migrating service to a known-good snapshot if you have it.
Task 16: Validate ashift and recordsize choices (performance problems that look like “health”)
cr0x@server:~$ zdb -C rpool | egrep 'ashift|vdev_tree' -n | head -n 20
55: vdev_tree:
61: ashift: 12
cr0x@server:~$ zfs get recordsize,compression,atime tank/vmdata
NAME PROPERTY VALUE SOURCE
tank/vmdata recordsize 128K local
tank/vmdata compression zstd local
tank/vmdata atime off local
What it means: ashift: 12 (4K) is generally sane. Dataset settings show a VM storage dataset with compression and atime off.
Decision: If ashift is 9 on modern 4K drives, expect pain. You can’t “fix” ashift in place; you rebuild vdevs. Recordsize tuning is workload-specific; don’t change it during an incident unless you know exactly why.
Three corporate mini-stories from the trenches
Mini-story 1: The outage caused by a wrong assumption
They had a tidy Proxmox cluster: three nodes, each with mirrored boot SSDs and a separate RAIDZ2 for VM storage. The alert read “pool is not healthy” on one node’s VM pool. The on-call engineer did what many of us do at 2 a.m.: they searched their memory for the last time it happened.
Last time, it was a dead SSD. So they assumed it was a dead disk again. They saw one device in UNAVAIL and offlined it immediately. Then they went to the rack and replaced the drive in the bay they believed matched /dev/sdf. The pool stayed degraded. The errors continued. Then a second device started flapping. Now the pool wasn’t just degraded, it was becoming unstable.
The root cause was mundane: the HBA had been moved to a different PCIe slot during a previous maintenance, and the bay-to-device mapping they kept in a shared doc was no longer correct. The “dead disk” was fine; it was the wrong bay. They removed a healthy disk from RAIDZ2, turning “one device missing” into “now we’re actually in danger.”
They recovered, but only because the pool was RAIDZ2 and had one more margin for error than their process deserved. The postmortem wasn’t about ZFS. It was about identification discipline: always use /dev/disk/by-id, always validate serial numbers, and never assume OS device names mean anything stable.
Mini-story 2: The optimization that backfired
A different company wanted faster VM performance. Someone read that disabling sync writes can increase throughput, and they set sync=disabled on a dataset used for a database VM. It looked great in benchmarks. Latency graphs became brag-worthy.
Months later, the storage began showing checksum errors. A scrub found a handful of repairs, then later a few permanent errors in a VM disk. The immediate reaction was to blame “bad drives,” and they replaced two disks across two nodes. Errors kept appearing, and now everyone was nervous because the replacements didn’t “fix” it.
The real story: they had a controller firmware issue that occasionally acknowledged writes before they were durable. With synchronous semantics weakened, the system’s tolerance for weird acknowledgements vanished. The failure mode became “application believes data is committed, power flickers, and now ZFS sees inconsistent reality.” ZFS did its job by complaining, but it couldn’t un-commit a database transaction that the VM already believed was safe.
They rolled back the dataset property to sync=standard, added proper SLOG on hardware that actually supported power-loss protection, and updated controller firmware. Performance dropped a bit. Incidents dropped a lot. The lesson wasn’t “never optimize.” It was “optimize only after you understand what safety contract you’re rewriting.”
Mini-story 3: The boring practice that saved the day
This one is less glamorous and more instructive. A medium-size org had a Proxmox environment with ZFS mirrors for VM storage. Nothing fancy. But they had two boring habits: monthly scrubs with alerting, and quarterly restore tests for a small set of critical VMs.
One Monday, “pool is not healthy” appeared with a few checksum errors on a single disk. SMART looked fine. Kernel logs showed intermittent link resets. Instead of replacing the disk immediately, they scheduled a maintenance window, moved VMs off the node, and replaced a suspect SAS cable and a backplane expander that had a known flaky port.
They cleared the ZFS error counters only after the hardware fix. Errors did not return. The pool stayed stable. No emergency replacement, no random disk swaps, no panic scrubs.
And here’s the part that matters: because they already had scrub results and restore drills, leadership didn’t argue about the maintenance window. They had a track record that translated technical risk into operational confidence. The boring practice wasn’t just “good hygiene.” It was organizational leverage.
Common mistakes: symptom → root cause → fix
1) Symptom: “CKSUM errors increasing” but SMART looks clean
Root cause: Often a bad cable, marginal backplane, flaky HBA port, or PCIe/AER issues. Checksums detect bad data anywhere in the chain.
Fix: Correlate with journalctl -k for resets; reseat/replace cables, move the drive to another bay/port, update HBA firmware. Then scrub once stable to validate integrity.
2) Symptom: Pool DEGRADED after reboot, device UNAVAIL
Root cause: Device enumeration changed, missing by-id links inside initramfs, or BIOS/HBA timing causing delayed device availability.
Fix: Ensure pools use /dev/disk/by-id paths; check initramfs updates; verify HBA/BIOS settings and boot order; avoid mixing USB boot with ZFS complexity in production.
3) Symptom: Scrub “repaired bytes” repeatedly every scrub
Root cause: Persistent corruption source: unstable RAM, controller issues, or a disk returning inconsistent reads.
Fix: Don’t celebrate “it repaired it.” That’s the warning. Run memory diagnostics, check ECC logs, verify firmware, and isolate by moving vdev members across ports.
4) Symptom: Resilver is extremely slow and the host feels frozen
Root cause: Resilver competes with VM I/O; replacement disk slower than others; SMR drives in a random-write workload; controller queue bottlenecks.
Fix: Throttle workloads (pause backup jobs, migrate hot VMs), confirm drive class (avoid SMR in VM pools), consider scheduling resilver during low usage.
5) Symptom: Pool SUSPENDED
Root cause: ZFS gave up on I/O due to repeated failures. Often a controller/backplane failing hard, or multiple disks gone.
Fix: Stop writes. Capture logs. Check physical layer and power. Consider read-only import and data evacuation. Replace failing components before attempting normal import.
6) Symptom: Proxmox UI warns, but zpool status -x says healthy
Root cause: Stale cached status, monitoring lag, or pool health recently recovered but alert not cleared.
Fix: Re-check with zpool status -xv, clear resolved alerts in your monitoring, and verify there are no non-zero error counters.
7) Symptom: “Permanent errors in vm-XXX-disk-Y”
Root cause: Redundancy couldn’t reconstruct a block. Could be multiple failures, silent corruption, or data written during a failing period.
Fix: Restore from backup/replication. If you have snapshots, try cloning and running filesystem checks in the guest. Don’t assume “ZFS will fix it” when it explicitly says it can’t.
8) Symptom: Frequent device OFFLINE/ONLINE flapping
Root cause: Loose connection, marginal power, overheating, expander issues, or aggressive link power management.
Fix: Fix physical layer, improve cooling, check PSU rails, disable problematic power-saving features for storage paths where appropriate, and monitor error recurrence.
Checklists / step-by-step plan
Checklist A: “I just saw the alert” (first 15 minutes)
- Run
zpool status -xv. Record it (paste into ticket/notes). - Run
zpool status -Pto confirmby-idpaths and get the exact device identifier. - Check
journalctl -karound the time of the first error; look for resets/timeouts/AER. - Pull SMART/NVMe health for the implicated device.
- Decide: disk failure vs path failure vs systemic (RAM/controller) suspicion.
- Reduce churn: pause heavy backup/replication jobs; avoid large migrations.
- Verify backup restore capability for the most critical VMs on that pool.
Checklist B: Controlled disk replacement (mirror or RAIDZ, stable system)
- Identify disk by
/dev/disk/by-idand serial number. - If it’s misbehaving but still present,
zpool offlineit intentionally. - Replace hardware (disk, cable, bay, or HBA port depending on evidence).
- Use
zpool replacewithby-idpaths. - Monitor
zpool statusuntil resilver completes. - After completion, consider a scrub (if this was checksum-related) during a quiet window.
- Only then,
zpool clearif needed, and keep watching for recurrence.
Checklist C: When redundancy is compromised (you’re on the edge)
- Stop non-essential writes. If possible, gracefully shut down non-critical VMs.
- Capture evidence:
zpool status -xv,zpool events -v, kernel logs. - Prioritize evacuation: backups, replication, or export VM disks to another storage target.
- If import is unstable, consider read-only import where applicable and copy what you can.
- Replace the most obviously failed component first (often path/controller/power).
- After stability returns, rebuild redundancy; then run a scrub to validate.
Checklist D: After the incident (prevent the sequel)
- Schedule regular scrubs with alerting (monthly is common; tune by capacity and workload).
- Track SMART and link reset rates; alert on trends, not just thresholds.
- Document bay-to-serial mapping and keep it current after hardware changes.
- Standardize on HBAs and firmware versions across nodes.
- Test restores quarterly for representative VMs.
FAQ
1) Does “pool is not healthy” mean I’m already losing data?
Not necessarily. DEGRADED often means redundancy is reduced but data is still intact. “Permanent errors” is the phrase that should make you assume some data is corrupted.
2) Should I run a scrub immediately?
Only if the pool is stable and you’re diagnosing integrity issues. If devices are dropping, timeouts are happening, or the pool is suspended, a scrub can push failing hardware over the edge.
3) What’s worse: READ errors or CKSUM errors?
READ errors indicate the device can’t read. CKSUM errors indicate the device (or path) returned incorrect data. CKSUM errors often point to cabling/controller/RAM problems and can be more insidious.
4) Can I just clear the errors so Proxmox stops complaining?
You can, but it’s a terrible idea before fixing the cause. Clearing removes the evidence trail. Clear after remediation so you can detect recurrence cleanly.
5) How do I know I’m replacing the correct disk?
Use zpool status -P and /dev/disk/by-id. Confirm with udevadm info and the disk serial number. If you can’t match serial-to-bay, pause and build the mapping.
6) Why did the pool go DEGRADED after a reboot even though disks are “fine”?
Common causes are pathing changes, slow device initialization, or using unstable device nodes. Fix by ensuring persistent IDs and checking boot/initramfs behavior.
7) Resilver is taking forever—can I speed it up?
The fastest “speed up” is removing competing I/O: pause backups, avoid migrations, reduce VM write load. If the replacement device is slower or the path is unstable, no tuning knob beats fixing hardware.
8) Is ECC memory required for ZFS on Proxmox?
Required? No. Strongly recommended for production? Yes—especially if you’re running important VMs. Without ECC, you’re betting that memory errors won’t coincide with heavy I/O and scrubs.
9) What if zpool status shows “errors: No known data errors” but the pool isn’t healthy?
That usually means redundancy is reduced or a device is missing, but ZFS hasn’t identified unrecoverable corrupted blocks. You still need to restore redundancy quickly.
10) Should I use hardware RAID with ZFS in Proxmox?
In general: avoid it. ZFS wants direct device control and visibility. Use an HBA (IT mode) unless you have a very specific, well-understood exception.
Next steps that prevent repeat incidents
When Proxmox says a ZFS pool is not healthy, treat it like a smoke alarm, not a dashboard decoration. Your best outcome is boring: identify the failing component, restore redundancy, validate integrity, and tighten your operational habits so the next alert is actionable instead of mysterious.
Practical next steps you can do this week:
- Set a recurring scrub schedule and alert on scrub errors and slowdowns.
- Implement disk identity discipline:
/dev/disk/by-ideverywhere, and maintain a bay-to-serial map. - Baseline SMART/NVMe health and kernel error patterns so you can spot “new weird” quickly.
- Run one restore test from backup for a critical VM. Not because you expect failure—because you expect reality.
If you do nothing else: don’t clear the alert until you can explain it. ZFS is giving you a chance to fix the problem while you still have options. Take the hint.