Proxmox ZFS Degraded: Replace a Disk Without Collateral Damage

Was this helpful?

Your Proxmox node is “fine” until it isn’t. Then you notice a yellow banner, a pool that says DEGRADED,
and VMs that suddenly feel like they’re running through syrup. You don’t need a pep talk. You need a disk replaced
correctly—without yanking the wrong drive, without triggering a second failure, and without turning a recoverable event
into a resume update.

ZFS is forgiving, but it’s not magic. A degraded pool is ZFS telling you: “I am operating without a safety net.
Please stop improvising.” This guide is the production way to do it: identify the right device, preserve your topology,
replace with confidence, and verify resilvering and integrity afterwards.

What “DEGRADED” actually means in Proxmox + ZFS

In ZFS, DEGRADED means the pool is still usable, but at least one vdev (virtual device) has lost redundancy
or has a component that’s unavailable or misbehaving. The important part is not the word; it’s the implications:

  • You are one failure away from data loss if you’re on mirrors with one side gone, or on RAIDZ with
    no remaining parity tolerance.
  • Performance is often worse because ZFS may be reconstructing reads, retrying I/O, or working around a flaky device.
  • Scrubs and resilvers become risk events: they stress the remaining disks, which is exactly what you don’t want when they’re aging.

Proxmox is mostly a messenger here. It runs ZFS as the storage backend (often for rpool and sometimes for VM storage pools),
and surfaces zpool status in the UI. The real control surface is the shell.

“Replace a disk” sounds like a hardware task. In ZFS, it’s a data migration operation with a hardware dependency. Treat it that way.

Fast diagnosis playbook (check first/second/third)

When a pool goes degraded, your job is to answer three questions quickly: Which disk? Is it actually failing or just missing?
Is the pool stable enough to resilver safely?

First: confirm the pool state and the exact vdev member that’s sick

Don’t guess from the Proxmox UI. Get the authoritative status from ZFS.

Second: determine if this is a “dead disk” or a “pathing problem”

A disk can look failed because a controller reset, a bad SATA cable, a flaky backplane slot, or a device renumbering made it vanish.
The fix is different—and replacing hardware blindly can make things worse.

Third: assess risk before you start stressing the pool

If you have remaining disks showing reallocated sectors, read errors, or timeouts, you may want to slow down I/O,
schedule a window, or take a backup snapshot/replication pass before resilvering.

The fastest way to find the bottleneck: ZFS status → kernel logs → SMART. If those three line up, you act.
If they disagree, you pause and figure out why.

Interesting facts & historical context (why ZFS behaves this way)

  • ZFS came out of Sun Microsystems in the mid-2000s with end-to-end checksumming as a first-class feature, not an add-on.
  • “Copy-on-write” is why ZFS hates partial truths: it writes new blocks and then updates pointers, which makes silent corruption easier to detect.
  • ZFS doesn’t “rebuild”; it “resilvers”, meaning it only reconstructs the blocks that are actually in use—not the entire raw device.
  • RAIDZ was designed to fix RAID-5/6 write hole problems by integrating parity management with the filesystem transaction model.
  • Device names like /dev/sda are not stable; persistent naming via /dev/disk/by-id became best practice because Linux enumeration changes.
  • Scrubs exist because checksums need exercising: ZFS can detect corruption, but a scrub forces reading and verifying data proactively.
  • Advanced Format (4K sector) drives created a whole era of pain; ashift is ZFS’s way of aligning allocations to physical sector size.
  • SMART isn’t a verdict, it’s a weather report: many disks die “healthy,” while others limp along “failing” for months.

One reliability paraphrased idea that remains painfully true comes from Gene Kranz (NASA flight director): paraphrased idea: be tough and competent.
In storage terms: don’t improvise, and don’t touch two things at once.

Before you touch hardware: safety rails that prevent collateral damage

Use stable disk identities (serial-based), not whatever Linux called it today

If you do disk work by /dev/sdX alone, you are playing roulette with a loaded wheel. Proxmox upgrades,
kernel updates, controller resets, or simply rebooting can reshuffle enumeration. ZFS can also store paths in multiple forms.
You want to anchor your decisions to immutable facts: WWN, serial, and bay location.

Decide your blast radius

If the pool is hosting VM disks (zvols), a resilver is heavy I/O. If you can evacuate critical VMs,
do it. If you can’t, you still proceed—but you do it intentionally: limit workload, avoid concurrent maintenance,
and watch latency like a hawk.

Don’t “fix” two problems at the same time

Replacing a disk is already a controlled emergency. Do not pair it with a kernel update, HBA firmware update,
ZFS feature flag changes, or a “quick cable tidy.” You want causality. You want reversibility.

Joke #1: Storage arrays don’t have moods. They have consequences, and they always remember what you did.

Practical tasks with commands, outputs, and decisions (12+)

Everything below is written for a typical Proxmox node using ZFS on Linux. Commands assume root privileges or sudo.
Replace pool/device names with your reality. If your pool is rpool, great. If it’s tank, use that.

Task 1: Confirm pool health and identify the failing member

cr0x@server:~$ sudo zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.
action: Replace the device using 'zpool replace'.
  scan: scrub repaired 0B in 00:14:12 with 0 errors on Thu Dec 19 02:10:03 2025
config:

        NAME                                         STATE     READ WRITE CKSUM
        rpool                                        DEGRADED     0     0     0
          mirror-0                                   DEGRADED     0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K123456   ONLINE       0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K654321   FAULTED     12     0     0  too many errors

errors: No known data errors

What it means: You have a mirror with one side faulted. Reads are being served from the remaining disk.
The pool is still running, but redundancy is gone for that vdev.
Decision: Replace ...654321. Do not detach the healthy disk. Do not “clear” errors and hope.

Task 2: Map ZFS device IDs to physical disks (WWN/serial)

cr0x@server:~$ ls -l /dev/disk/by-id | egrep 'S2Y3NX0K123456|S2Y3NX0K654321'
lrwxrwxrwx 1 root root  9 Dec 26 09:10 ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K123456 -> ../../sda
lrwxrwxrwx 1 root root  9 Dec 26 09:10 ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K654321 -> ../../sdb

What it means: Right now the healthy disk is /dev/sda and the bad disk is /dev/sdb.
That mapping may change after a reboot.
Decision: Work in terms of /dev/disk/by-id/ata-..., and use /dev/sdb only for immediate SMART checks.

Task 3: Check kernel logs for timeouts/resets (distinguish disk vs path)

cr0x@server:~$ sudo journalctl -k -b | egrep -i 'sdb|ata|I/O error|timeout|reset' | tail -n 25
Dec 26 08:41:12 server kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 26 08:41:12 server kernel: ata2.00: failed command: READ DMA EXT
Dec 26 08:41:12 server kernel: blk_update_request: I/O error, dev sdb, sector 118782976 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 26 08:41:13 server kernel: ata2: hard resetting link
Dec 26 08:41:18 server kernel: ata2: link is slow to respond, please be patient (ready=0)
Dec 26 08:41:23 server kernel: ata2.00: configured for UDMA/133

What it means: This looks like real I/O errors and link resets. Could be the disk, could be the cable/backplane.
Decision: If this is a hot-swap bay, reseat the drive once (carefully) before replacing. If it persists, replace disk and consider swapping bay/cable later.

Task 4: Get SMART health summary (fast)

cr0x@server:~$ sudo smartctl -H /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!

What it means: The drive is admitting it’s toast.
Decision: Replace the drive. No more diagnostics theater.

Task 5: Pull the SMART attributes that matter (timeouts vs wear vs media)

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i 'Serial Number|Reallocated|Pending|Offline_Uncorrectable|CRC|Power_On_Hours|Media_Wearout|Percentage_Used' 
Serial Number:                       S2Y3NX0K654321
Power_On_Hours:                      43120
Reallocated_Sector_Ct:               48
Current_Pending_Sector:              9
Offline_Uncorrectable:               9
UDMA_CRC_Error_Count:                0

What it means: Pending and uncorrectable sectors are classic media failure. CRC count is 0, so it’s less likely a cable problem.
Decision: Replace the drive. If CRC errors were high, you’d suspect cabling/backplane and still likely replace—just with extra scrutiny.

Task 6: Verify the pool’s topology (mirror vs RAIDZ) before replacement commands

cr0x@server:~$ sudo zpool get -H -o value ashift rpool
12

What it means: ashift=12 (4K sectors). This is what you want in modern systems.
Decision: Ensure the replacement disk is not smaller and is suitable for 4K alignment. You cannot change ashift on an existing vdev member in-place.

Task 7: Check that the replacement disk is visible and not in use

cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,TYPE,MOUNTPOINT
NAME    SIZE MODEL             SERIAL           TYPE MOUNTPOINT
sda   476.9G SAMSUNG MZ7LN512  S2Y3NX0K123456   disk 
├─sda1 1007K                    0               part 
├─sda2    1G                    0               part 
└─sda3 475.9G                    0               part 
sdb   476.9G SAMSUNG MZ7LN512  S2Y3NX0K654321   disk 
├─sdb1 1007K                    0               part 
├─sdb2    1G                    0               part 
└─sdb3 475.9G                    0               part 
sdc   476.9G SAMSUNG MZ7LN512  S2Y3NX0K777777   disk 

What it means: The new disk is sdc and appears blank (no partitions listed). Good.
Decision: Use /dev/disk/by-id for sdc too. Confirm serial matches what’s on the box/bay label.

Task 8: Confirm persistent ID for the new disk

cr0x@server:~$ ls -l /dev/disk/by-id | grep S2Y3NX0K777777
lrwxrwxrwx 1 root root  9 Dec 26 09:16 ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K777777 -> ../../sdc

What it means: You have a stable identifier for the replacement disk.
Decision: Proceed with zpool replace using by-id paths.

Task 9: Replace the member correctly (let ZFS do the right thing)

cr0x@server:~$ sudo zpool replace rpool \
  /dev/disk/by-id/ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K654321 \
  /dev/disk/by-id/ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K777777

What it means: ZFS will attach the new disk and start resilvering onto it.
Decision: Do not reboot. Do not remove the remaining good disk. Monitor resilver progress.

Task 10: Watch resilver progress and verify it’s actually moving

cr0x@server:~$ watch -n 5 sudo zpool status rpool
  pool: rpool
 state: DEGRADED
status: One or more devices is being resilvered.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Dec 26 09:18:41 2025
        112G scanned at 1.21G/s, 38.2G issued at 423M/s, 475G total
        38.2G resilvered, 8.04% done, 00:17:42 to go
config:

        NAME                                         STATE     READ WRITE CKSUM
        rpool                                        DEGRADED     0     0     0
          mirror-0                                   DEGRADED     0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K123456   ONLINE       0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K777777   ONLINE       0     0     0  (resilvering)

What it means: Resilver is active, shows throughput, ETA, and scanned vs issued. That’s a good sign.
Decision: If “issued” stays stuck at 0 for minutes while “scanned” increases slowly, you may be bottlenecked or blocked by errors. Check logs and SMART on the surviving disk.

Task 11: Check load/latency impact on the node (avoid accidental outage)

cr0x@server:~$ sudo iostat -x 2 5
Linux 6.8.12-5-pve (server) 	12/26/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    3.44   24.66    0.00   63.78

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await wareq-sz  aqu-sz  %util
sda              98.0   91200.0     0.0   0.00    3.10   930.6     21.0    980.0    8.60    46.7     0.41  38.0
sdc              44.0   40120.0     0.0   0.00    4.90   912.7     62.0   15200.0    2.40   245.2     0.44  29.5

What it means: %iowait is elevated. That’s normal during resilver. %util not pegged suggests you still have headroom.
Decision: If iowait is crushing VM latency, throttle workload or schedule a quiet window. Avoid launching backups or heavy migrations mid-resilver.

Task 12: Confirm pool returns to ONLINE and the old device is gone

cr0x@server:~$ sudo zpool status -v rpool
  pool: rpool
 state: ONLINE
  scan: resilvered 475G in 00:22:19 with 0 errors on Fri Dec 26 09:41:00 2025
config:

        NAME                                         STATE     READ WRITE CKSUM
        rpool                                        ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K123456   ONLINE       0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K777777   ONLINE       0     0     0

errors: No known data errors

What it means: You’re back to redundancy. No errors. That’s the goal state.
Decision: Now you do cleanup and verification: scrub scheduling, alerting, labeling, and postmortem notes.

Task 13: If the disk is “UNAVAIL” not “FAULTED,” check if it’s just missing

cr0x@server:~$ sudo zpool status rpool
  pool: rpool
 state: DEGRADED
config:

        NAME                                         STATE     READ WRITE CKSUM
        rpool                                        DEGRADED     0     0     0
          mirror-0                                   DEGRADED     0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K123456   ONLINE       0     0     0
            ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K654321   UNAVAIL      0     0     0  cannot open

What it means: ZFS cannot open the device; that might be a dead disk, or it might be a path issue.
Decision: Check whether the device exists in /dev/disk/by-id. If it disappeared, look for controller/backplane issues before declaring it “failed.”

Task 14: Validate the missing device path exists (or not)

cr0x@server:~$ test -e /dev/disk/by-id/ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K654321; echo $?
1

What it means: Exit code 1 means the path doesn’t exist. The OS doesn’t see it.
Decision: Check physical seating, backplane slot, HBA logs. If it comes back, you might do zpool online instead of replace.

Task 15: If a disk came back, try onlining it (only if you trust it)

cr0x@server:~$ sudo zpool online rpool /dev/disk/by-id/ata-SAMSUNG_MZ7LN512HMJP_S2Y3NX0K654321

What it means: ZFS will attempt to bring the device back. If it was transient, the pool might return to ONLINE without replacement.
Decision: If SMART still looks ugly or logs show repeated resets, don’t get sentimental. Replace it anyway.

Task 16: Clear old error counts only after fixing the underlying problem

cr0x@server:~$ sudo zpool clear rpool

What it means: This clears error counters and some fault states.
Decision: Use it to confirm the fix held (errors stay at 0). Don’t use it as a way to “green up” a pool without replacement.

Joke #2: If you “clear” the pool errors without fixing anything, congratulations—you’ve successfully silenced the smoke alarm while the toast is still on fire.

Replacement workflows: mirror vs RAIDZ, hot-swap vs cold-swap

Mirror vdevs: the straightforward case (still easy to mess up)

Mirrors are operationally friendly. Replace one disk, resilver, you’re done. The failure mode is human:
someone pulls the wrong disk, or replaces a disk with one that’s a hair smaller, or uses unstable device names,
or detaches the wrong member.

Recommended approach:

  • Identify the failed disk by serial and bay, not by sdb.
  • Use zpool replace with /dev/disk/by-id paths.
  • Monitor resilver and system I/O. Resilver is not a background whisper; it’s a forklift.
  • Verify zpool status returns to ONLINE, then do a scrub within a maintenance window if you can tolerate the load.

RAIDZ vdevs: replacement is still simple; the consequences aren’t

With RAIDZ, the pool can remain online with one or more missing disks depending on parity level (RAIDZ1/2/3).
But during degradation, every read of data that touched the missing disk may require reconstruction, which stresses remaining drives.
Then resilvering adds more I/O. This is how “one failed disk” turns into “why are three disks timing out.”

The method is the same: zpool replace the member. The operational posture changes:

  • Don’t run a scrub and a resilver concurrently unless you enjoy long nights.
  • Consider reducing workload during resilver.
  • If another disk is throwing errors, pause and decide whether to back up/replicate before proceeding.

Hot-swap bays: trust but verify

Hot swap is not “plug-and-pray.” It’s “hot swap if your backplane, HBA, and OS agree on reality.”
On Proxmox, you can typically replace a failed disk live, but you must:

  • Confirm the right bay LED (if available) or use a mapping process (serial ↔ bay).
  • Insert the new disk and ensure it appears under /dev/disk/by-id.
  • Then run zpool replace. Not before.

Cold swap (shutdown): sometimes boring is correct

If you’re on questionable hardware (consumer SATA controllers, flaky backplanes, old BIOS, or a history of link resets),
a cold swap can reduce risk. It’s also operationally cleaner if the pool is already unstable.
You still do the same identity checks after boot, because enumeration can change.

Checklists / step-by-step plan (production-ready)

Checklist A: “I saw DEGRADED” response plan

  1. Capture current state: zpool status -v and save output to your ticket/notes.
  2. Check if errors are rising: run zpool status again after 2–5 minutes. Are READ/WRITE/CKSUM counts increasing?
  3. Check kernel logs for resets/timeouts: journalctl -k -b.
  4. SMART check on suspect disk and on the surviving members in the same vdev.
  5. Decide whether you can proceed live or need a maintenance window.

Checklist B: Safe replacement steps (mirror or RAIDZ member)

  1. Identify the failing member by zpool status name (prefer by-id).
  2. Map it to a serial and bay label using ls -l /dev/disk/by-id and your chassis inventory.
  3. Insert the replacement disk. Confirm it appears in /dev/disk/by-id.
  4. Confirm the replacement disk size is not smaller than the old one: lsblk.
  5. Run zpool replace <pool> <old> <new>.
  6. Monitor resilver: zpool status until done. Watch system latency.
  7. When ONLINE, record the resilver duration and any errors observed.
  8. Optionally run a scrub in the next quiet window (not immediately if the system is under heavy load).

Checklist C: Post-replacement validation

  1. zpool status -v shows ONLINE and 0 known data errors.
  2. SMART for the new disk shows clean baseline (save SMART report).
  3. Confirm alerts are clear in Proxmox and your monitoring system.
  4. Update inventory: bay → serial mapping, warranty tracking, and replacement date.

Checklist D: If resilver is slow or stuck

  1. Check if another disk is now erroring (SMART + kernel logs).
  2. Check for saturation: iostat -x and VM workload.
  3. Consider pausing noncritical jobs (backups, replication, bulk storage moves).
  4. If errors are climbing, stop making changes and plan a controlled outage with backups ready.

Common mistakes: symptoms → root cause → fix

1) Pool still DEGRADED after replacement

Symptom: You replaced a disk, but zpool status still shows DEGRADED and the old device name lingers.

Root cause: You used the wrong identifier (e.g., replaced /dev/sdb but ZFS tracked by-id), or you attached without replacing.

Fix: Use zpool status to find the exact device string, then run zpool replace pool <that-exact-old> <new-by-id>. Avoid /dev/sdX.

2) Replacement disk is “too small” even though it’s the same model

Symptom: zpool replace errors with “device is too small.”

Root cause: Manufacturers quietly vary usable capacity across firmware revisions, or the old disk had slightly larger reported size.

Fix: Use an equal-or-larger disk. For mirrored boot pools, buy the next capacity up rather than playing model-number bingo.

3) Resilver is crawling and the node is unusable

Symptom: VM latency spikes, I/O wait is high, users complain, and resilver ETA keeps growing.

Root cause: Workload contention (VM writes + resilver reads/writes), plus potentially a marginal surviving disk.

Fix: Reduce workload (pause heavy jobs, migrate noncritical VMs), check SMART on surviving disks, and consider a maintenance window. If surviving disk is erroring, your real emergency is “second disk is dying.”

4) You pulled the wrong disk and now the pool is OFFLINE

Symptom: Pool drops, VMs pause/crash, and zpool status shows missing devices.

Root cause: Human identification failure: bay mapping not confirmed, device naming instability, or no LED locate procedure.

Fix: Reinsert the correct disk immediately. If you removed a healthy mirror member, put it back first. Then reassess. This is why you label bays and record serials.

5) Proxmox boot pool replaced, but node won’t boot

Symptom: After replacing a disk in rpool, system fails to boot from the new disk if the old one is removed.

Root cause: The bootloader/EFI entry wasn’t installed or mirrored to the new device. ZFS redundancy does not automatically mirror your bootloader state.

Fix: Ensure Proxmox boot tooling/EFI setup is replicated across boot devices. Validate by temporarily setting BIOS boot order or performing a controlled boot test.

6) CKSUM errors with “healthy” SMART

Symptom: zpool status shows checksum errors on a device, but SMART looks fine.

Root cause: Often cabling, backplane, HBA issues, or power problems causing data corruption in transit.

Fix: Reseat/replace cables, move the disk to another bay/controller port, check HBA firmware stability. Clear errors after fixing and watch if they return.

Three corporate mini-stories from real life

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Proxmox on two nodes, each with a ZFS mirror boot pool and a separate RAIDZ for VM storage.
One morning, a node flagged DEGRADED. The on-call engineer saw /dev/sdb in a quick glance at zpool status,
assumed it mapped cleanly to “Bay 2,” and asked facilities to pull that drive.

Facilities did exactly what they were told, except the bays weren’t labeled and the chassis had been re-cabled during a “cleanup.”
The drive pulled was the healthy mirror member, not the faulted one. The pool didn’t die instantly because ZFS is polite—until it isn’t.
The node took a performance hit, then a second disk threw errors under the extra read load.

The recovery was ugly but educational: reinsert the removed disk, let the pool settle, then redo the identification properly using
/dev/disk/by-id serials cross-checked against the physical labels they created during the incident.
The long-term fix wasn’t fancy: they documented bay-to-serial mapping and stopped speaking in sdX.

The wrong assumption wasn’t “people make mistakes.” It was “device names are stable.” They aren’t. Not on a good day.

Mini-story 2: The optimization that backfired

Another org wanted faster resilvers and scrubs. Someone read that “more parallelism is better” and tuned ZFS aggressively:
higher scan rates, more concurrent operations, the whole “make the graph go up” approach.
It looked great in a quiet lab. In production, it collided with real workloads: database VMs, backups, and replication traffic.

During a degraded event, resilver began at impressive throughput, then the node started timing out guest I/O.
The VM cluster didn’t crash; it just slowly turned into molasses. Operators tried to fix it by restarting services and migrating VMs,
which added more I/O. The resilver slowed further. More retries. More pain.

They eventually stabilized by backing off the “optimization” and letting the resilver run at a sustainable pace,
prioritizing service latency over benchmark numbers. After the incident, they kept conservative defaults and built a runbook:
during resilver, pause nonessential jobs and treat storage latency as a primary SLO.

The lesson: resilver speed is not a vanity metric. The only number that matters is “finished without a second failure while users kept working.”

Mini-story 3: The boring but correct practice that saved the day

A regulated business ran Proxmox nodes with ZFS mirrors and a strict habit: every disk bay had a label,
every label corresponded to a recorded serial, and every replacement was done by by-id with screenshots of zpool status
pasted into the ticket.

When a pool degraded during a holiday week, the on-call engineer was a generalist, not a storage person.
They followed the runbook: check ZFS status, map serial, validate SMART, confirm replacement serial,
and only then replace. No cleverness. No shortcuts.

The resilver finished cleanly. A scrub later showed no errors. The post-incident review was short because there wasn’t much to review.
Their “boring practice” prevented the most common disaster: removing the wrong disk or replacing the wrong vdev member.

Boring is underrated. Boring is how you sleep.

FAQ

1) Should I use the Proxmox GUI to replace a disk, or the CLI?

Use the CLI for the actual ZFS operations. The GUI is fine for visibility, but zpool status is the source of truth,
and you want copy-pastable commands and outputs for your incident notes.

2) Can I reboot during a resilver?

Avoid it. ZFS can usually resume resilvering, but rebooting introduces risk: device renumbering, HBA quirks, and the chance that a marginal disk doesn’t come back.
If you must reboot for stability, do it intentionally and document the state before and after.

3) What’s the difference between zpool replace and zpool attach?

replace substitutes one device for another and triggers resilver. attach adds a new device to a mirror (turning a single-disk vdev into a mirror, or expanding a mirror width).
For a failed mirror member, you almost always want replace.

4) Should I run a scrub immediately after replacement?

Not immediately, unless you’re in a quiet window and can tolerate the load. A resilver already reads a lot of data.
Schedule a scrub after the system cools down, especially on RAIDZ pools.

5) The disk shows as UNAVAIL. Is it dead?

Not necessarily. UNAVAIL can mean the OS can’t see it (path/cable/controller) or the disk is dead.
Check /dev/disk/by-id, kernel logs, and SMART if the device reappears.

6) Can I replace a disk with a larger one and get more space?

You can replace with larger disks, but you only gain usable space after all members of a vdev are upgraded and ZFS is allowed to expand.
Mirrors expand after both sides are larger; RAIDZ expands after all disks in that RAIDZ vdev are larger.

7) Why does ZFS show errors but applications seemed fine?

Because ZFS can correct many errors transparently using redundancy and checksums. That’s not “fine,” it’s “caught it in the act.”
Treat corrected errors as a warning: something is degrading.

8) What if the surviving disk in a mirror starts showing SMART errors during resilver?

That’s the uncomfortable moment: you may be in a two-disk failure scenario waiting to happen.
If possible, reduce load, prioritize backing up/replicating critical data, and consider whether you should stop and do a controlled recovery plan.
Continuing might work—but you’re gambling with your last good copy inside that vdev.

9) Does ZFS automatically mirror the Proxmox bootloader on rpool?

Not reliably by itself. ZFS mirrors data blocks; bootability depends on EFI/bootloader installation and firmware boot entries.
After replacing boot disks, validate that each disk is independently bootable if your design requires it.

10) Is it safe to “offline” a disk before pulling it?

Yes, when you’re deliberately removing a member (especially in hot-swap). Offlining reduces surprise by telling ZFS the device is going away.
But never offline the last good member of a vdev. Confirm topology first.

Conclusion: next steps after you’re back to ONLINE

Getting from DEGRADED to ONLINE is the tactical win. The strategic win is making sure the next disk failure
is boring, fast, and doesn’t require heroics.

  1. Record the incident: paste zpool status -v before/after, SMART output, and which serial was replaced.
  2. Fix identification debt: label bays, maintain a serial-to-slot map, and standardize on /dev/disk/by-id.
  3. Verify monitoring: alerts for ZFS pool state, SMART critical attributes, and kernel link resets.
  4. Schedule scrubs intentionally: regular enough to catch rot, not so aggressive that you’re always stressing disks.
  5. Practice the runbook: the best time to learn zpool replace is not during the first time your pool goes degraded.

ZFS gives you a fighting chance. Don’t squander it with guesswork. Replace the right disk, in the right way, and keep the collateral damage where it belongs: nowhere.

← Previous
Shellshock: when an environment variable became world news
Next →
MySQL vs MariaDB on NVMe: redo logs, flush policy, and IO capacity done right

Leave a comment