ZFS hot-swap strategy: How to Replace Disks Without Panic

November 10, 2025 • February 3, 2026 • Read: 26 min • Views: 10

Was this helpful?

Replacing a disk in ZFS should feel like changing a tire: methodical, slightly dirty, and not a personality test. Yet in production it often turns into a group improv exercise—someone is staring at blinking LEDs, someone else is SSH’d into the wrong host, and a third person is asking if “DEGRADED” is just a suggestion.

This guide is a practical, field-tested strategy for hot-swapping disks in ZFS without inventing new outage categories. It’s written for people who operate real systems: you care about uptime, audit trails, and that one executive dashboard that turns red if latency spikes for more than five minutes.

1. The mindset: replacing disks is normal

ZFS is built around the assumption that storage fails. Disks fail. Cables fail. Backplanes fail. Humans fail with enthusiasm. ZFS doesn’t demand that you prevent failure; it demands that you respond predictably.

A good hot-swap strategy is less about “the right command” and more about controlling ambiguity:

Know exactly which device you’re removing. “Probably /dev/sdb” is how you create an outage postmortem with interpretive dance.
Make ZFS see stable device names. Prefer /dev/disk/by-id/ over ephemeral /dev/sdX.
Observe before and after. Baselines, logs, and deterministic checks reduce “I think it’s working” to “it is working.”
Respect resilver. It’s not magic; it’s I/O and CPU and sometimes pain.

One operational truth: the hot-swap is rarely the dangerous part. The dangerous part is the 45 minutes before it, when the team convinces itself it has perfect information. Spoiler: you don’t. You build a procedure that works even when you don’t.

Joke #1 (short, relevant): A disk replacement is like a fire drill—you only discover the exits are blocked when you’re already carrying the server down the stairs.

2. Interesting facts and historical context

Some quick context makes modern ZFS hot-swap decisions easier to justify to both engineers and the “why is this taking so long” crowd.

ZFS was designed with end-to-end checksums so it can detect silent corruption that classic RAID happily serves as “fine.” That’s why checksum errors matter even when the filesystem “looks normal.”
RAID rebuilds used to be mostly sequential. Modern large drives and random-ish workloads turned rebuilds into long, noisy events; ZFS resilver behavior evolved to avoid reading unallocated space in many cases (depending on vdev type and feature flags).
The “write hole” problem in traditional RAID5/6 (power loss mid-stripe update) influenced the push toward copy-on-write and transactional updates—ZFS made that a first-class assumption.
Device naming in Unix has always been slippery. /dev/sdX names are discovered-order artifacts; in the early days of hotplug, the “same disk” could come back as a different letter after reboot. That’s why persistent IDs became standard practice.
Hot-swap hardware support predates common hot-swap operational maturity. Backplanes and SAS expanders made it easy to pull drives; operational playbooks lagged behind, which is why “we yanked the wrong one” remains a timeless genre.
SMART was never a guarantee. Many disks die without dramatic SMART preambles. SMART is a probabilistic early-warning system, not prophecy.
“Bigger drives, same bays” changed failure math. When rebuilds take longer, your exposure window increases. Disk replacement strategy becomes a reliability strategy, not just a maintenance chore.
ZFS’s self-healing needs redundancy. ZFS can detect corruption alone; it can only repair it when it has a good copy (mirror/RAIDZ or special redundancy features).
Operationally, the biggest risk is humans under time pressure. The reason many teams use “offline + LED locate + confirm serial” rituals isn’t superstition; it’s hard-earned scar tissue.

3. Core principles of a no-panic hot-swap

3.1 Prefer identity over location, then verify location

Your ZFS config should track disks by stable IDs. But when the pool says “disk X is bad,” you still need to map that to a physical slot and an actual serial number.

The safe pattern is:

ZFS reports a vdev member in trouble (by-id preferred).
You map that identifier to a serial.
You map the serial to a slot (enclosure tools or controller tools).
You light the locate LED, and you verify a second way.

3.2 Do not “optimize” your way into data loss

Hot-swap procedures are boring on purpose. If you’re tempted to skip steps because “it’s just a mirror,” remember: mirrors fail too, and the fastest rebuild is the one you don’t have to do twice.

Joke #2 (short, relevant): The only thing more permanent than a temporary workaround is a “quick” disk swap done without checking the serial.

3.3 Treat resilver as an incident, not a background task

Resilvering competes with production I/O and can amplify existing bottlenecks. It’s normal to see latency spikes, especially on HDD pools with small random writes. Plan it, monitor it, and communicate it.

3.4 Use spares and staged replacements intentionally

Hot spares can reduce time-in-degraded, but they can also hide a hardware problem (e.g., flaky backplane or controller pathing) by “fixing” the symptom. Decide whether you want:

Automatic spare activation to minimize risk window, or
Manual spare use to keep humans in the loop for verification.

4. Preflight: what you check before you touch hardware

Before pulling anything, you want to answer three questions:

Is the pool currently safe enough to survive this? If you’re already one disk away from losing the vdev, stop and think.
Is the “failed disk” actually the disk? It might be a cable, expander port, HBA, or enclosure slot.
Do you have a good replacement and a rollback plan? Replacement media can be DOA. Rollback is often “reinsert the old disk,” which requires you not to smash it on the way out.

4.1 Health snapshot and evidence capture

Capture enough state that you can compare before/after and defend your choices later.

cr0x@server:~$ zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device using 'zpool replace'.
  scan: scrub repaired 0B in 03:21:11 with 0 errors on Mon Dec 23 02:10:22 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank                                      DEGRADED     0     0     0
          raidz2-0                                DEGRADED     0     0     0
            ata-WDC_WD101KRYZ-01..._1SGH3ABC       ONLINE       0     0     0
            ata-WDC_WD101KRYZ-01..._1SGH3ABD       ONLINE       0     0     0
            ata-WDC_WD101KRYZ-01..._1SGH3ABE       FAULTED      0     0    23  too many errors
            ata-WDC_WD101KRYZ-01..._1SGH3ABF       ONLINE       0     0     0
            ata-WDC_WD101KRYZ-01..._1SGH3ABG       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
        /tank/vmstore/vm-102-disk-0

Interpretation: This is not “a disk is missing,” it’s worse: a device fault with checksum errors and permanent errors. You may need application-level repair for impacted files. Still, replacing the disk is step one.

4.2 Confirm redundancy margin

Know what your vdev can tolerate. A RAIDZ2 can survive two disk failures per vdev; a mirror can survive one per mirror; a RAIDZ1 is living a little dangerously on large disks.

cr0x@server:~$ zpool status tank | sed -n '1,60p'
  pool: tank
 state: DEGRADED
  scan: scrub repaired 0B in 03:21:11 with 0 errors on Mon Dec 23 02:10:22 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            ...

Interpretation: Identify vdev type and how many members are already out. If you’re at the cliff edge (e.g., RAIDZ1 already degraded), you plan a different kind of maintenance window.

4.3 Check for systemic issues (HBA, cabling, enclosure)

If multiple drives throw errors on the same controller path, swapping a single disk is sometimes “treating the bruise, not the fracture.”

cr0x@server:~$ dmesg -T | egrep -i 'ata|sas|scsi|reset|I/O error|timeout' | tail -n 30
[Mon Dec 23 11:04:18 2025] sd 3:0:12:0: [sdo] tag#71 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Dec 23 11:04:18 2025] sd 3:0:12:0: [sdo] Sense Key : Medium Error [current]
[Mon Dec 23 11:04:18 2025] sd 3:0:12:0: [sdo] Add. Sense: Unrecovered read error
[Mon Dec 23 11:04:20 2025] mpt3sas_cm0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)

Interpretation: “Medium Error” points at the disk surface/media. If you instead see link resets, PHY resets, or timeouts across multiple disks, suspect cabling, expander, or HBA firmware/thermal issues.

5. Disk identification: the part that actually breaks teams

Hot-swap failures are rarely caused by ZFS syntax. They’re caused by someone pulling the wrong drive because the naming scheme was sloppy. Production systems punish ambiguity.

5.1 Use persistent identifiers in ZFS

When building pools, use /dev/disk/by-id paths. If your pool already uses /dev/sdX, you can still replace by specifying the old device as ZFS knows it, but you should plan a cleanup to stable naming in the future.

5.2 Map ZFS device to OS device to physical slot

Start with what ZFS reports (often a by-id string). Then map to the kernel device, then to serial, then to enclosure slot.

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep 1SGH3ABE
lrwxrwxrwx 1 root root  9 Dec 23 10:55 ata-WDC_WD101KRYZ-01W..._1SGH3ABE -> ../../sdo

Interpretation: The failing by-id points to /dev/sdo right now. “Right now” matters—hotplug events can re-enumerate devices. That’s why we verify serial via SMART as well.

cr0x@server:~$ sudo smartctl -a /dev/sdo | egrep -i 'Model|Serial|Capacity|Reallocated|Pending|Offline_Uncorrectable' 
Device Model:     WDC WD101KRYZ-01W
Serial Number:    1SGH3ABE
User Capacity:    10,000,831,348,736 bytes
  5 Reallocated_Sector_Ct   0x0033   001   001   140    Pre-fail  Always       -       2816
197 Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -       12
198 Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -       12

Interpretation: This disk is not having a philosophical debate about retirement. Thousands of reallocations plus pending sectors is “replace me.” Capture the serial; that’s your truth anchor.

5.3 Light the right LED (when you can)

On servers with proper enclosure management, you can locate the disk bay via SAS enclosure tools. The exact utility differs, but the principle is: use the serial to find the slot, then blink it.

If you don’t have locate LEDs: label your trays, maintain a bay map, and do not rely on “third from the left.” That sentence has caused outages.

6. Step-by-step hot-swap procedures (mirror, RAIDZ, spares)

6.1 General safe flow (works for most topologies)

Confirm pool/vdev state and redundancy margin.
Confirm the failing disk identity by serial.
Optionally offline the disk in ZFS (recommended when possible).
Physically remove and replace the disk.
Make sure the OS sees the new disk, and confirm its serial matches the replacement paperwork.
Run zpool replace (or let autoreplace take over if you deliberately enabled it).
Monitor resilver until completion; verify no new errors; run a scrub if appropriate.

6.2 Mirror replacement

Mirrors are forgiving and fast to operationalize, but don’t confuse “forgiving” with “invincible.” If you pull the wrong side, you can drop the mirror and cause a service incident, especially if your remaining disk is already marginal.

Recommended approach:

Offline the member you intend to pull.
Replace it and attach the new disk.
Prefer explicit device paths by-id.

6.3 RAIDZ replacement

RAIDZ resilvers can be more punishing on performance, especially on HDDs and especially with small blocks. If this is a latency-sensitive workload, plan the replacement for a quieter window or throttle resilver behavior (carefully) rather than hoping for the best.

6.4 Spares: hot, cold, and “surprise, it activated”

ZFS supports spares. If a spare has taken over, your job becomes: replace the failed disk, then decide whether to return the spare to the spare pool or keep it in-place. Don’t leave the spare permanently consumed without recording it; six months later someone will assume you still have a spare and you will have a very educational evening.

7. Practical tasks (with commands and interpretation)

Below are practical tasks you can lift into a runbook. They are written with a Linux + OpenZFS mindset. Adjust paths for your distro and hardware stack.

Task 1: Get a crisp pool status for the ticket

cr0x@server:~$ zpool status -P -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device using 'zpool replace'.
  scan: scrub repaired 0B in 03:21:11 with 0 errors on Mon Dec 23 02:10:22 2025
config:

        NAME                                                STATE     READ WRITE CKSUM
        /tank                                               DEGRADED     0     0     0
          raidz2-0                                          DEGRADED     0     0     0
            /dev/disk/by-id/ata-..._1SGH3ABC                 ONLINE       0     0     0
            /dev/disk/by-id/ata-..._1SGH3ABD                 ONLINE       0     0     0
            /dev/disk/by-id/ata-..._1SGH3ABE                 FAULTED      0     0    23  too many errors
            /dev/disk/by-id/ata-..._1SGH3ABF                 ONLINE       0     0     0
            /dev/disk/by-id/ata-..._1SGH3ABG                 ONLINE       0     0     0

Interpretation: -P forces full paths; -v includes error details. This output is what you want in your incident timeline.

Task 2: Verify the pool’s on-disk labels match expectations

cr0x@server:~$ sudo zdb -l /dev/sdo | sed -n '1,60p'
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'tank'
    state: 0
    txg: 12345678
    pool_guid: 1234567890123456789
    vdev_guid: 9876543210987654321
    top_guid: 1111222233334444555

Interpretation: Confirms the disk belongs to the pool you think it does. This is especially useful in multi-pool hosts or when someone “borrowed” bays.

Task 3: Map by-id to current kernel device

cr0x@server:~$ readlink -f /dev/disk/by-id/ata-WDC_WD101KRYZ-01W..._1SGH3ABE
/dev/sdo

Interpretation: Helps connect ZFS device naming to SMART and kernel logs.

Task 4: Check SMART health and error counters

cr0x@server:~$ sudo smartctl -H /dev/sdo
SMART overall-health self-assessment test result: FAILED!

Interpretation: A failed SMART health is not subtle. Even without it, high pending/uncorrectable counts are enough to act.

Task 5: Check whether errors are spreading

cr0x@server:~$ zpool events -v | tail -n 20
TIME                           CLASS
Dec 23 2025 11:03:55.123456789 ereport.fs.zfs.checksum
    vdev_path=/dev/disk/by-id/ata-..._1SGH3ABE
    vdev_guid=9876543210987654321
    pool=tank

Interpretation: If you see checksum errors across multiple disks, look at controller paths and cabling, not just the “loudest” disk.

Task 6: Offline the disk cleanly before pulling (recommended)

cr0x@server:~$ sudo zpool offline tank /dev/disk/by-id/ata-WDC_WD101KRYZ-01W..._1SGH3ABE
cr0x@server:~$ zpool status tank | sed -n '1,40p'
  pool: tank
 state: DEGRADED
config:

        NAME                                          STATE     READ WRITE CKSUM
        tank                                          DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ...                                       ...
            ata-WDC_WD101KRYZ-01W..._1SGH3ABE         OFFLINE      0     0     0

Interpretation: Offlining reduces surprise behavior while you pull the disk. It also makes the “wrong disk pulled” scenario more visible sooner.

Task 7: After physical swap, confirm the OS sees the new disk

cr0x@server:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,WWN,TYPE
NAME   SIZE MODEL           SERIAL    WWN               TYPE
sdo    9.1T WDC WD101KRYZ   2JGH7XYZ  0x50014ee2abcd123 disk
...

Interpretation: Confirm you’re looking at the new serial, not the old one reappearing due to a loose tray or backplane bounce.

Task 8: Replace the disk in ZFS explicitly

cr0x@server:~$ sudo zpool replace tank \
  /dev/disk/by-id/ata-WDC_WD101KRYZ-01W..._1SGH3ABE \
  /dev/disk/by-id/ata-WDC_WD101KRYZ-01W..._2JGH7XYZ

Interpretation: Explicit old->new replacement avoids ambiguity and avoids ZFS “guessing” which new disk you intended.

Task 9: Watch resilver progress the right way

cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Mon Dec 23 11:22:01 2025
        1.23T scanned at 1.8G/s, 412G issued at 620M/s, 9.20T total
        82.3G resilvered, 4.36% done, 03:21:44 to go
config:
        ...

Interpretation: “Scanned” and “issued” matter. If “issued” crawls while “scanned” is fast, your pool is busy serving real I/O or you’re bottlenecked elsewhere.

Task 10: Check per-vdev I/O and spot a bottleneck

cr0x@server:~$ zpool iostat -v tank 5
                                   capacity     operations     bandwidth
pool                             alloc   free   read  write   read  write
-------------------------------  -----  -----  -----  -----  -----  -----
tank                             62.3T  28.7T   420   1900  86.4M   322M
  raidz2-0                       62.3T  28.7T   420   1900  86.4M   322M
    ata-..._1SGH3ABC                -      -     80    350  16.0M  64.0M
    ata-..._1SGH3ABD                -      -     82    360  16.2M  64.5M
    ata-..._2JGH7XYZ                -      -     96    520  20.5M  88.0M
    ata-..._1SGH3ABF                -      -     80    330  16.0M  62.0M
    ata-..._1SGH3ABG                -      -     82    340  16.2M  63.5M
-------------------------------  -----  -----  -----  -----  -----  -----

Interpretation: The replacement disk often works harder during resilver. If one disk shows dramatically worse bandwidth/ops, suspect a bad new disk, negotiated link speed issue, or a struggling slot/backplane lane.

Task 11: Confirm TRIM / ashift expectations (SSDs especially)

cr0x@server:~$ sudo zdb -C tank | egrep -i 'ashift|autotrim' | head
                ashift: 12
        autotrim: on

Interpretation: Mismatched sector size assumptions can hurt performance. You can’t “fix ashift” on an existing vdev member without replacing/rewriting the vdev, so verify before you scale SSD replacements.

Task 12: Validate that the pool returns to healthy and stays there

cr0x@server:~$ zpool status -x
all pools are healthy

Interpretation: The best status output is boring. If it’s not boring, keep digging.

Task 13: Scrub after replacement (when appropriate)

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank | sed -n '1,25p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 15:01:12 2025
        2.10T scanned at 1.5G/s, 0B issued at 0B/s, 9.20T total

Interpretation: A scrub is a confidence pass: it reads and verifies. Don’t run it blindly during peak load on a fragile latency budget, but do schedule it soon after.

Task 14: If the pool reports permanent errors, identify impacted files

cr0x@server:~$ zpool status -v tank | sed -n '/errors:/,$p'
errors: Permanent errors have been detected in the following files:
        /tank/vmstore/vm-102-disk-0

Interpretation: Replacing the disk stops the bleeding, but it doesn’t resurrect already-corrupt user data. You now need workload-specific recovery: restore from backup, regenerate artifacts, or repair VM images.

8. Resilver reality: performance, time, and how to watch it

Resilvering is ZFS copying data needed to reconstruct redundancy onto the new device. The details vary with topology and ZFS version/features, but in production you care about two questions:

How long will we be degraded?
How much pain will users feel while we’re degraded?

8.1 Why resilver time is so hard to predict

Allocated data vs raw size: A 10 TB disk doesn’t mean a 10 TB resilver. It depends on how full the pool is and how data is laid out.
Workload interference: ZFS is serving reads/writes while also resilvering. Your customers are part of the benchmark.
Disk behavior under stress: HDDs can drop into deep error recovery; SSDs can thermal-throttle; SMR drives can turn a rebuild into a slow-motion apology letter.
Controller limits: SAS expanders, HBAs, PCIe lanes—your bottleneck might be upstream of the disks.

8.2 What “good” looks like during resilver

A healthy resilver is noisy but stable: progress increases steadily, I/O lat rises somewhat, and ZFS doesn’t accumulate new checksum errors.

What “bad” looks like:

Resilver progress stalls for long periods.
New read errors appear on other disks.
The replacement disk shows timeouts in dmesg.
Application latency goes nonlinear (queueing collapse).

8.3 The operational trick: reduce surprise

Tell people resilver is happening. If you have SLOs, treat it as a known risk period. If you have batch jobs or backups, consider pausing them. The easiest performance incident to resolve is the one you don’t trigger during your own maintenance.

9. Fast diagnosis playbook: bottleneck hunting under pressure

This is the “my pager is loud and management is watching” sequence. The goal is not perfect root cause in five minutes. The goal is to stop making things worse and find the dominant bottleneck quickly.

Step 1: Confirm what state ZFS thinks it’s in

cr0x@server:~$ zpool status -v
cr0x@server:~$ zpool status -x

Look for: DEGRADED vs FAULTED, resilver in progress, increasing error counters, “too many errors,” or a missing device.

Step 2: Check kernel logs for transport vs media errors

cr0x@server:~$ dmesg -T | tail -n 200

Interpretation:

Media errors (“Unrecovered read error”) usually implicate the disk.
Transport errors (link resets, timeouts across devices) implicate cabling/HBA/expander/backplane/power.

Step 3: Identify whether the bottleneck is disk, CPU, memory pressure, or queueing

cr0x@server:~$ uptime
cr0x@server:~$ vmstat 1 5
cr0x@server:~$ iostat -x 1 5

Interpretation:

High wa (iowait) plus high disk utilization suggests storage-bound.
High run queue with low iowait suggests CPU-bound or lock contention.
Disk await skyrocketing during resilver may be expected—but if it’s extreme and concentrated on one device, suspect that member/slot.

Step 4: Zoom into ZFS-level I/O distribution

cr0x@server:~$ zpool iostat -v 5
cr0x@server:~$ zpool iostat -v tank 5

Look for: one disk far slower than peers, or a vdev doing disproportionate work. In RAIDZ, a single lagging disk can drag the entire vdev into latency hell.

Step 5: Check ARC pressure and dirty data (if latency is spiking)

cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep 'size|c_max|c_min|memory_throttle_count' | head -n 20
cr0x@server:~$ cat /proc/spl/kstat/zfs/vdev_cache_stats 2>/dev/null | head

Interpretation: If you’re memory-throttling or ARC is under pressure, ZFS may be doing extra work. This is often “background pain” made visible during resilver.

Step 6: If corruption is reported, decide what you’re saving first

If ZFS reports permanent errors, prioritize integrity:

Stabilize the pool (finish replacement/resilver).
Identify impacted datasets/files.
Restore/repair at the application layer.

Performance can be tuned. Corruption is a deadline.

10. Common mistakes (symptoms and fixes)

This is the section written in the ink of past incidents. If you recognize your environment in any of these, congratulations: you’re normal. Now fix it.

Mistake 1: Using /dev/sdX in pool configs

Symptom: After a reboot or hotplug event, ZFS shows missing devices or the wrong disk appears as the failed member; someone swears it “used to be sdc.”

Fix: Use by-id paths for replacements, and when you have a maintenance window, migrate configs toward stable identifiers. At minimum, always replace using by-id rather than sdX.

Mistake 2: Pulling a disk before offlining it (in ambiguous environments)

Symptom: The pool flips from DEGRADED to “missing device,” multipath behaves strangely, or the wrong tray gets pulled because the team had no clear “this one is safe to pull” signal.

Fix: Offline the target member, confirm status reflects OFFLINE for that specific by-id, then pull.

Mistake 3: Confusing a failing slot/backplane with a failing disk

Symptom: You replace the disk and errors continue—often on the new disk—especially timeouts or link resets.

Fix: Move the disk to a different slot (if possible) to test the bay, inspect SAS/SATA cabling, check expander/HBA logs, and review power delivery/thermals.

Mistake 4: Replacing multiple disks “because we’re here anyway”

Symptom: A second disk goes offline mid-resilver, or performance collapses. The team is now doing two stressful things at once.

Fix: Stage replacements. Finish one resilver before starting the next, unless you have a very controlled plan and adequate redundancy margin (and even then, be cautious).

Mistake 5: Assuming resilver speed will be “like last time”

Symptom: A resilver that used to take hours now takes days; stakeholders panic; engineers start “tuning” random knobs.

Fix: Validate pool fullness, workload intensity, and device class (CMR vs SMR HDD, SSD cache behavior, etc.). Use iostat/zpool iostat to locate the true limiter before changing ZFS parameters.

Mistake 6: Ignoring checksum errors because the pool is ONLINE

Symptom: Periodic checksum errors appear; scrubs sometimes “repair” them; months later a second event reveals latent corruption or a broader hardware fault.

Fix: Treat checksum errors as a signal. Correlate with SMART, cabling, controller resets, and scrubs. Replace suspicious components proactively.

Mistake 7: Autoreplace enabled without strong inventory discipline

Symptom: A new disk inserted triggers replacement automatically, but it wasn’t intended for that pool/vdev; in multi-bay environments, it’s easy to consume the wrong disk.

Fix: If you use autoreplace, pair it with strict slot labeling, serial tracking, and change control. Otherwise prefer explicit zpool replace.

Mistake 8: Not checking the new disk before trusting it

Symptom: Resilver starts, then the new disk begins throwing errors; you’re back to degraded, now with less confidence.

Fix: At least confirm SMART identity and basic health. In higher rigor environments, do burn-in testing before putting disks into production pools.

11. Checklists / step-by-step plan

11.1 The “replace one disk” operational checklist

Open an incident or change record (even if it’s “minor”). Add host, pool, and time.
Capture baseline state: zpool status -P -v, zpool events -v | tail, and recent kernel errors.
Confirm redundancy margin: identify vdev type and current degraded members.
Identify failing disk by serial: map by-id → sdX → SMART serial.
Verify physical slot: enclosure map/LED locate; second-person verification if available.
Offline the disk: zpool offline, confirm it shows OFFLINE.
Hot-swap hardware: pull old disk, insert replacement, wait for OS to detect.
Verify replacement serial and size/class: lsblk, smartctl.
Run replacement: zpool replace pool old new.
Monitor resilver: zpool status and zpool iostat -v.
Watch for collateral damage: new checksum errors on other disks suggests systemic issue.
Close the loop: confirm zpool status -x, schedule scrub, update inventory with old/new serials.

11.2 “We found permanent errors” checklist

Stabilize the pool (complete replacement/resilver).
List impacted files: zpool status -v.
Identify owning datasets and application context.
Restore from backup or regenerate data.
Scrub after remediation to confirm cleanliness.

11.3 “We suspect it’s not the disk” checklist

Check whether multiple disks on the same controller path show resets/timeouts.
Inspect cables, reseat HBA, review firmware and temperatures.
Swap slot for the disk (if possible) to isolate bay/backplane faults.
Consider controller replacement or expander diagnostics if errors persist.

12. Three corporate-world mini-stories from the trenches

12.1 Incident caused by a wrong assumption: “The LED means that disk is safe to pull”

The environment was a fairly standard corporate virtualization cluster: a couple of storage-heavy hosts, a shared ZFS pool per host, and enough dashboards to make you think the system was being paid per metric. One host flagged a disk as degraded. The on-call did the right first move: checked zpool status and saw a by-id path that looked familiar.

Then the wrong assumption showed up: the team treated the chassis LED as the source of truth. The server had two different LED behaviors—one driven by the enclosure, one driven by the RAID controller’s “locate” function. Someone had previously tested “locate” on a different bay and never turned it off. So now two bays were blinking: the failed disk and a perfectly healthy one.

The on-call pulled the wrong drive. The pool went from DEGRADED to a much angrier state, and the host’s VM storage latencies spiked into the kind of chart that makes executives discover adjectives. ZFS did what it could, but losing an extra member in the same vdev moved the system from “repairable while serving traffic” to “you’re going to restore some things.”

The fix wasn’t heroic; it was procedural. They reinserted the accidentally pulled disk (which thankfully was healthy), offlined the correct disk by-id, and then used a second confirmation method: serial number via SMART matched the asset record, and the bay map matched the enclosure slot. After the replacement, they updated the runbook: LEDs are helpful, not authoritative. The authoritative chain is ZFS identifier → OS device → SMART serial → physical bay.

The lesson that stuck: humans love a blinking light because it feels certain. Production systems punish that emotional shortcut.

12.2 Optimization that backfired: “Let’s crank resilver speed to finish faster”

Another shop had a latency-sensitive workload—think internal services that are “not customer-facing” until they’re down, at which point they’re suddenly the most customer-facing thing on Earth. They had a disk fail in a RAIDZ vdev. The team wanted the degraded window to be as short as possible, so someone proposed increasing aggressiveness: let resilver run as fast as the disks could push.

It worked in a narrow sense: resilver throughput went up. But the workload wasn’t idle, and the pool wasn’t mostly sequential. The combination of heavy rebuild reads, parity computations, and the normal random write workload pushed the system into queueing collapse. Latency ballooned, timeouts cascaded, and upstream services began retry storms. Now the infrastructure wasn’t just degraded—it was a performance incident.

What made it worse: during that event, another disk started logging timeouts. Not because it was failing, but because it was starved and the controller queue was saturated. That triggered a round of “is a second disk failing?” panic, and the team nearly started a second replacement mid-resilver.

The eventual stabilization was, again, boring: they backed off the rebuild aggressiveness (and reduced competing batch I/O), prioritized service stability, and accepted a longer degraded window. The resilver finished later, but the business stopped noticing the storage layer every time it blinked.

The lesson: faster resilver is not automatically safer. The safest resilver is the one that doesn’t trigger an outage while attempting to prevent one.

12.3 A boring but correct practice that saved the day: serial-based inventory and two-person verification

This team had a reputation for being almost offensively methodical. Every disk bay had a label. Every disk had its serial recorded at install time. Every time they replaced a drive, they updated an internal inventory and pasted the before/after zpool status -P -v output into the change ticket. It was the kind of process that makes some engineers roll their eyes—until it pays rent.

One afternoon, a pool reported errors on a disk that was physically located in a bay that, according to the label, shouldn’t have been part of that pool. That inconsistency was the alarm bell. Instead of yanking the “obvious” disk, they paused. They verified the ZFS by-id, mapped it to a serial, and discovered the disk had been moved months ago during a chassis swap and the bay map was never updated.

Because they had serial-based inventory, they could reconcile the mismatch without guessing. They updated the bay map, offlined the correct disk, and replaced it cleanly. No accidental removals, no second failures, no unexpected downtime. The only “cost” was ten extra minutes of verification.

In the post-incident review, nobody celebrated the process; they barely mentioned it. That’s how you know it worked. The best operational practices don’t create stories. They prevent them.

13. FAQ

Q1: Should I always offline a disk before physically removing it?

In most production environments, yes. Offlining makes the intent explicit and reduces ambiguity during hotplug events. The exceptions are when the disk is already gone/unresponsive, or your hardware stack doesn’t tolerate offlining well—but those are special cases that should be documented.

Q2: Can I replace a disk with a slightly smaller one if it’s “the same size” on the label?

Don’t rely on marketing sizes. ZFS cares about actual usable sectors. A replacement that’s even slightly smaller can fail to attach. Verify sizes with lsblk or blockdev --getsize64 before you start.

Q3: Mirror vs RAIDZ—does the hot-swap procedure differ?

The command flow is similar, but the risk profile differs. Mirrors are usually simpler and resilver faster; RAIDZ resilvers can be heavier and longer. The “don’t replace multiple disks at once” rule matters more in RAIDZ because resilver stress can reveal weak members.

Q4: What if `zpool replace` says the device is busy or won’t attach?

Common causes: the new disk has partitions from an old use, multipath is presenting a different node than you expected, or you’re specifying unstable paths. Confirm the new disk identity, wipe old labels if appropriate (carefully), and use by-id paths consistently.

Q5: Should I enable autoreplace?

Autoreplace can be great in tightly controlled environments with consistent slot mapping and disciplined inventory. In mixed environments (multiple pools, shared enclosures, humans swapping “whatever disk fits”), it can create surprising behavior. If you can’t guarantee operational discipline, prefer explicit replacements.

Q6: How do I know if checksum errors mean I lost data?

Checksum errors mean ZFS detected a mismatch. If redundancy allowed repair, ZFS may have corrected it transparently (you’ll see “repaired” during scrub). If ZFS reports “permanent errors,” that means it could not repair some blocks; that’s when you identify impacted files and restore/repair at the application layer.

Q7: Is it safe to run a scrub during resilver?

It’s usually not what you want. Both are heavy read operations competing for the same spindles/controllers. Finish resilver first, then scrub soon after (or schedule a scrub when load is acceptable). There are edge cases where you scrub to confirm broader integrity problems, but that should be an intentional decision.

Q8: Why does ZFS resilver sometimes feel slower than “traditional RAID” rebuilds?

Sometimes it’s not slower; it’s just honest about the work. ZFS may do extra verification, and it often shares bandwidth with production I/O. Also, modern workloads are random and modern drives are huge. Rebuild time inflation is a physics tax, not a software bug.

Q9: After replacement, should I keep the old disk for analysis?

If your environment has recurring failures, yes—at least long enough to confirm whether the failure mode is media wear, transport issues, or environmental (heat/vibration/power). If it’s obviously failing media (pending/uncorrectable exploding), you can usually RMA and move on, but keep enough evidence (SMART output, serial, timestamps) to spot patterns.

14. Conclusion

ZFS hot-swaps don’t have to be adrenaline sports. The system gives you the tools to replace disks safely—if you treat identity as sacred, keep device naming stable, and respect resilver as a real operational event.

The winning strategy is deliberately unsexy: confirm by serial, offline before pull, replace explicitly, watch resilver like it matters, and follow up with integrity checks and inventory updates. That’s how you replace disks without panic—by removing surprise from the process, one boring verification at a time.