ZFS Replacing Disks: The Safe Workflow That Avoids Pool Drama

Was this helpful?

Replacing a disk in ZFS is one of those chores that feels routine—right up until it isn’t. The pool goes DEGRADED, alerts start screaming, your ticket queue fills with “is it still safe?”, and someone inevitably suggests rebooting “to clear it.” That’s how you turn a single bad drive into a multi-day incident with a side of regret.

This is the workflow I use in production when I want predictable outcomes: identify the correct disk by serial, validate the replacement, choose the right ZFS verb (replace/attach/detach/online), monitor resilvering like you mean it, and finish cleanly so the pool is actually healthy—not “green until the next scrub.”

What “safe” means in ZFS disk replacement

In ZFS, “safe” isn’t a vibe. It’s a sequence of verifications that prevents three classic failures:

  • Replacing the wrong disk (the #1 operator error). That’s how you turn redundancy into a coin toss.
  • Using the wrong device path (hello, /dev/sdX roulette), leading to the pool “healing” onto a different physical disk than you think.
  • Declaring victory too early: resilver finishes, but you didn’t fix the underlying transport/controller issue, so the next disk “fails” in sympathy.

Here’s the operational truth: ZFS is very good at data integrity and very honest about what it knows. But ZFS cannot protect you from confusing bay numbers, bad SAS expanders, mis-cabled backplanes, or a replacement drive that’s DOA. Your workflow must bridge that gap.

Use the right ZFS verb or suffer the consequences

ZFS disk “replacement” can mean different things:

  • zpool replace: replace a specific leaf device with another. This is the usual path for failed disks.
  • zpool attach: add a disk to an existing single-disk vdev to make a mirror, or add another disk to an existing mirror (rare; be careful). Not a replacement.
  • zpool detach: remove a disk from a mirror vdev (never from RAIDZ). Used after attach-based migrations or mirror shrink operations.
  • zpool offline/online: take a disk out of service temporarily. Useful when you want to control the timing.

One more: zpool clear is not a fix. It clears error counters. It can be appropriate after you correct a cabling issue and want to confirm stability. It is not appropriate as a ritual to make alerts go away.

Paraphrased idea from Gene Kim: the real work of reliability is making the normal path the safe path. That applies to disk swaps as much as deploys.

Short joke #1: If your plan relies on “I’ll remember which drive it is,” congratulations—you’ve invented non-persistent storage for humans.

Interesting facts and short history (so you stop fighting the system)

  • ZFS was built to detect silent data corruption. Checksums are stored separately from data, so a disk can’t “grade its own homework.”
  • Resilvering changed over time: modern OpenZFS supports sequential resilver, which tends to be kinder to fragmented pools and faster on spinning disks.
  • Scrubs predate many “cloud-native” ideas. A ZFS scrub is basically continuous verification at scale, long before “integrity scanning” became fashionable.
  • RAIDZ is not RAID5. Similar goals, very different implementation: ZFS has end-to-end checksumming and a transactional model that avoids the classic write hole.
  • ashift is forever. Once a vdev is created with an ashift (sector size exponent), you don’t change it without rebuilding the vdev.
  • Device naming wars are old. Admins have been burned by unstable device names since before SATA was cool; persistent IDs exist because humans keep losing that fight.
  • ZFS “autoexpand” exists because upgrades are normal. But it’s not magic. It has rules and timing, and it won’t fix mismatched vdev sizes.
  • ZFS can use a separate intent log (SLOG) to accelerate synchronous writes, but it doesn’t help resilver speed and can complicate failure handling.
  • SMART isn’t truth, it’s testimony. Drives can die without warning; others scream for months and keep running. ZFS uses checksums and redundancy to live in reality.

Pre-flight: what to check before touching hardware

Disk replacement is a choreography between ZFS, the OS, the controller/HBA, the chassis/backplane, and the human holding the caddy. The goal is to reduce uncertainty before you pull anything.

Rules I follow in production

  • No disk pulls without serial confirmation. Slot numbers lie. Labels drift. Humans misread. Serial numbers don’t care.
  • Scrub or at least assess scrub health first. If the pool already has checksum errors, resilver is not a victory lap—it’s a stress test.
  • Don’t resilver during a performance-sensitive window unless you deliberately throttle and accept the blast radius.
  • Prefer persistent device identifiers (by-id paths) for replace operations and for the vdev config long-term.
  • Validate the replacement disk: sector size, capacity, transport type, and “is it even the right disk.”

Practical tasks: commands, what the output means, and what decision you make

The following tasks are deliberately repetitive. That’s the point. Disk swaps fail when people improvise.

Task 1: Confirm pool health and identify the problem vdev

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Replace the device using 'zpool replace'.
  scan: scrub repaired 0B in 03:12:44 with 0 errors on Tue Dec 24 01:12:18 2025
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A123            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A124            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A125            UNAVAIL      0     0     0  cannot open
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A126            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A127            ONLINE       0     0     0

errors: No known data errors

Meaning: ZFS can’t open one leaf device; redundancy keeps the pool online. Scrub history is clean. Good.

Decision: Replace ...VKJ0A125. Do not touch other disks. If multiple are flaky, pause and investigate cabling/backplane before swapping more hardware.

Task 2: Get the full vdev tree with persistent names (and see if you’re already using them)

cr0x@server:~$ sudo zpool status -P tank
  pool: tank
 state: DEGRADED
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A123  ONLINE       0     0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A124  ONLINE       0     0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A125  UNAVAIL      0     0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A126  ONLINE       0     0     0
            /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A127  ONLINE       0     0     0

Meaning: Pool config uses stable by-id paths. This is what you want.

Decision: Continue using /dev/disk/by-id/... for replacement. If you see /dev/sdX, fix that during/after replacement (details later).

Task 3: Confirm the OS sees the failed disk at all (or that it really disappeared)

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep VKJ0A125 || true

Meaning: If the path is missing, the disk isn’t enumerating—could be dead disk, dead slot, or link issue.

Decision: If it’s missing, check controller logs and physical link before assuming the disk is the sole problem.

Task 4: Pull recent kernel/storage errors (spot controller/backplane issues)

cr0x@server:~$ sudo dmesg -T | egrep -i "ata|sas|scsi|reset|link|I/O error|blk_update_request" | tail -n 25
[Wed Dec 24 09:42:11 2025] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Wed Dec 24 09:42:11 2025] ata9.00: failed command: READ FPDMA QUEUED
[Wed Dec 24 09:42:11 2025] blk_update_request: I/O error, dev sdi, sector 123456789
[Wed Dec 24 09:42:12 2025] ata9: hard resetting link
[Wed Dec 24 09:42:13 2025] ata9: link is slow to respond, please be patient
[Wed Dec 24 09:42:18 2025] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 24 09:42:19 2025] sd 9:0:0:0: [sdi] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Meaning: Link resets and I/O errors. Could be the disk. Could be the slot/cable.

Decision: If multiple disks on the same port show resets, stop and investigate the HBA/backplane. If it’s isolated to one disk, replacement is reasonable.

Task 5: Map the by-id device to a physical slot (enclosure tools)

On SAS enclosures, you often have sg_ses / sesutil / vendor tools. On Linux with common HBAs, start with lsblk and udevadm.

cr0x@server:~$ lsblk -o NAME,SIZE,SERIAL,MODEL,HCTL,PATH | egrep "sdi|VKJ0A125|NAME"
NAME   SIZE SERIAL    MODEL          HCTL       PATH
sdi   7.3T  VKJ0A125  WDC WD80EFAX   9:0:0:0    /dev/pci0000:3b/0000:3b:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdi

Meaning: You have the serial and the HCTL/path. That’s your breadcrumb trail to the right bay.

Decision: Use this mapping with your chassis docs or enclosure management to identify the exact slot before pulling anything.

Task 6: Confirm the replacement disk is sane (size, sector, transport)

cr0x@server:~$ sudo smartctl -a /dev/sdj | egrep "Model|Serial|User Capacity|Sector Size|Rotation Rate|SMART overall|SATA Version"
Device Model:     WDC WD80EFAX-68LHPN0
Serial Number:    VKJ0B555
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Size:      512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
SMART overall-health self-assessment test result: PASSED
SATA Version is:  SATA 3.3, 6.0 Gb/s

Meaning: Capacity matches, 4Kn physical sectors, SMART passes. Good baseline.

Decision: If the replacement is smaller (even slightly), stop. ZFS can’t replace a disk with a smaller one in the same vdev. If sector characteristics differ, proceed but expect performance/ashift consequences (covered later).

Task 7: Check the pool’s ashift reality (because surprises happen)

cr0x@server:~$ sudo zdb -C tank | egrep "ashift|vdev_tree|path" | head -n 30
        vdev_tree:
            type: 'raidz'
            id: 0
            ashift: 12
            children[0]:
                path: '/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A123'
            children[1]:
                path: '/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A124'

Meaning: ashift: 12 (4K). This is typically correct for modern disks.

Decision: If you see ashift: 9 on 4K drives, don’t “fix” it by replacing a disk. Plan a vdev rebuild/migration when you can afford it.

Task 8: Offline the failing disk (controlled removal)

cr0x@server:~$ sudo zpool offline tank ata-WDC_WD80EFAX-68LHPN0_VKJ0A125

Meaning: ZFS stops using that disk and expects it to be gone. This reduces chaos when you pull the drive.

Decision: Offline before physical removal unless the disk is already gone. If the pool is already missing the disk, offlining may not matter.

Task 9: Confirm the disk is offline (not some other disk)

cr0x@server:~$ sudo zpool status tank | sed -n '1,35p'
  pool: tank
 state: DEGRADED
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A123            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A124            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A125            OFFLINE      0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A126            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A127            ONLINE       0     0     0

Meaning: The correct leaf is offline. If you see a different disk offline, stop immediately and undo your last command.

Decision: Proceed to physical swap.

Task 10: After inserting the new disk, find its stable by-id path

cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep "VKJ0B555|WDC_WD80EFAX" | tail -n 5
lrwxrwxrwx 1 root root  9 Dec 24 10:11 ata-WDC_WD80EFAX-68LHPN0_VKJ0B555 -> ../../sdj
lrwxrwxrwx 1 root root 10 Dec 24 10:11 ata-WDC_WD80EFAX-68LHPN0_VKJ0B555-part1 -> ../../sdj1

Meaning: The OS sees the new disk as sdj, but you will use the by-id path.

Decision: Use /dev/disk/by-id/ata-...VKJ0B555 for zpool replace.

Task 11: Wipe old labels on the replacement disk (avoid importing ghosts)

cr0x@server:~$ sudo zpool labelclear -f /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0B555

Meaning: Clears ZFS labels that might confuse replacement (common with reused disks).

Decision: If this fails because the disk is busy, stop and check you didn’t accidentally pick a disk already in use.

Task 12: Replace the offline disk with the new disk

cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFAX-68LHPN0_VKJ0A125 /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0B555

Meaning: ZFS begins resilvering to the new disk. The old disk identity is now associated with the new physical device.

Decision: Start monitoring resilver immediately. If the pool doesn’t begin resilvering, check for wrong path, wrong leaf name, or the disk not actually present.

Task 13: Monitor resilver progress (and learn what “good” looks like)

cr0x@server:~$ watch -n 10 sudo zpool status tank
Every 10.0s: sudo zpool status tank

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
  scan: resilver in progress since Wed Dec 24 10:14:02 2025
        1.82T scanned at 920M/s, 612G issued at 308M/s, 1.82T total
        102G resilvered, 32.83% done, 03:44:10 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A123            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A124            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0B555            ONLINE       0     0     0  (resilvering)
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A126            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A127            ONLINE       0     0     0

Meaning: You get scanned, issued, current throughput, ETA, and the leaf marked “(resilvering)”.

Decision: If throughput collapses or ETA grows unbounded, jump to the Fast diagnosis playbook. Don’t wait hours to “see if it sorts itself out.”

Task 14: Check for rising read/write/checksum errors during resilver

cr0x@server:~$ sudo zpool status -v tank | sed -n '1,60p'
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Wed Dec 24 10:14:02 2025
        2.34T scanned, 1.01T issued, 180G resilvered, 54.11% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A123            ONLINE       0     0     2
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A124            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0B555            ONLINE       0     0     0  (resilvering)
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A126            ONLINE       0     1     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A127            ONLINE       0     0     0

errors: No known data errors

Meaning: Errors are rising on other disks, not just the replaced one. That’s the smell of a shared problem (cable, expander, HBA, power).

Decision: Pause operational changes. Consider throttling workloads, checking transport errors, and possibly offlining a second unhealthy disk only if redundancy permits and you have a plan.

Task 15: When resilver finishes, verify the pool is truly healthy

cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 1.09T in 04:18:51 with 0 errors on Wed Dec 24 14:32:53 2025
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A123            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A124            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0B555            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A126            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68LHPN0_VKJ0A127            ONLINE       0     0     0

errors: No known data errors

Meaning: Pool is ONLINE, resilver completed, and no errors were recorded.

Decision: Schedule a scrub sooner than usual if the pool had any instability. A scrub after resilver is the “trust but verify” move.

Task 16: Confirm TRIM/autotrim policy (SSDs and some SMR edge cases)

cr0x@server:~$ sudo zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  off       local

Meaning: Autotrim is off. On SSD pools, that might be a performance/space-reuse issue; on HDD pools it’s usually irrelevant.

Decision: For SSD-based pools, consider turning it on after verifying firmware/controller behavior. Don’t toggle knobs mid-resilver unless you enjoy debugging your own curiosity.

Step-by-step: replacing a disk in a mirror (the least dramatic version)

Mirrors are forgiving. They are also where people get sloppy, because “it’s just a mirror.” Mirrors fail when you replace the wrong side and discover you’ve been running one-legged for months.

Mirror replacement workflow

  1. Identify the failed leaf via zpool status -P. Capture the by-id path and serial.
  2. Confirm physical mapping (slot LED tools if available; otherwise serial → bay mapping).
  3. Offline the failed leaf if it’s still present: zpool offline pool oldleaf.
  4. Replace the disk physically. Wait for the OS to enumerate it.
  5. Labelclear the new disk if it may have been used before.
  6. Replace: zpool replace pool oldleaf newdisk.
  7. Monitor resilver. Mirrors often resilver quickly, but still watch for errors on the “good” disk.
  8. Close out with a scrub scheduled soon-ish if this mirror is important.

When to use attach for mirrors

You use zpool attach when you’re converting a single-disk vdev into a mirror, or when you intentionally want a three-way mirror temporarily. During replacements, replace is the default because it preserves the vdev topology and intent.

Step-by-step: replacing a disk in RAIDZ (where patience is a feature)

RAIDZ resilvers are heavier. They read from many disks and reconstruct data onto the new one. That load can flush out marginal drives, marginal cables, and marginal assumptions.

RAIDZ replacement workflow

  1. Confirm redundancy margin. RAIDZ1 with one failed disk is living on the edge. RAIDZ2 gives you room to breathe, not permission to be casual.
  2. Check scrub history and current errors. If you already have checksum errors, treat this as incident mode.
  3. Stabilize the platform. If you see link resets or timeouts across multiple drives, fix transport first.
  4. Offline the target disk (if present) so you don’t race the OS during hot-swap.
  5. Replace by persistent ID path, not /dev/sdX.
  6. Monitor resilver and system metrics. Resilvering can saturate I/O and make applications sad. That’s normal. The question is: is it progressing consistently?
  7. After completion, verify no new errors. If errors climbed, don’t “clear” and forget—investigate, because the next scrub will remind you.

Short joke #2: RAIDZ resilvering is like watching paint dry, except the paint can occasionally catch fire.

Step-by-step: upgrading capacity with bigger disks (without lying to yourself)

Capacity upgrades are where people accidentally run “mixed reality storage”: the pool reports one size, the chassis contains another, and everyone argues with math. The safe approach is boring: replace one disk at a time, wait for resilver, repeat, then expand.

The boring, correct sequence

  1. Replace a disk with a larger one using zpool replace.
  2. Wait for resilver to finish.
  3. Repeat for every disk in the vdev.
  4. Enable autoexpand if you want ZFS to grow vdevs when possible.
  5. Expand the pool (often automatic after last replacement; sometimes requires manual online/expand depending on platform).

Task 17: Check autoexpand setting and enable if appropriate

cr0x@server:~$ sudo zpool get autoexpand tank
NAME  PROPERTY    VALUE   SOURCE
tank  autoexpand  off     default

Meaning: Autoexpand is currently off.

Decision: If you are intentionally upgrading disks and want ZFS to grow vdevs automatically when all leaves are larger, turn it on. If you’re not doing a planned upgrade, leave it alone.

cr0x@server:~$ sudo zpool set autoexpand=on tank

Task 18: Confirm pool size and per-vdev allocation after upgrades

cr0x@server:~$ sudo zpool list -v tank
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank        29.1T  18.4T  10.7T        -         -    22%    63%  1.00x  ONLINE  -
  raidz2-0  29.1T  18.4T  10.7T        -         -    22%    63%      -  ONLINE  -

Meaning: Pool and vdev sizes now reflect expanded capacity.

Decision: If size didn’t change after all disks are replaced, you likely need to online/expand the devices, or your platform’s partitioning left space unused.

Task 19: Online and expand a leaf device (when needed)

cr0x@server:~$ sudo zpool online -e tank /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0B555

Meaning: The -e attempts to expand the device to use available space.

Decision: Use this if you replaced with a larger disk and ZFS hasn’t picked up the size. If the disk is partitioned and the partition isn’t expanded, you must fix partitioning first (outside this article’s scope, but the point is: ZFS can’t use space it can’t see).

Fast diagnosis playbook (find the bottleneck before you blame ZFS)

When resilvering is slow, stuck, or making the box feel like it’s wading through syrup, don’t guess. Triage in a strict order. The goal is to decide: is this a normal slow resilver, a workload conflict, a failing sibling disk, or a transport/controller problem?

First: confirm ZFS is actually making progress

  • Check zpool status scan line: does “issued” increase over time? Does “resilvered” increase?
  • Decision: If progress is steady, you mostly have a tuning/scheduling issue. If progress stalls, you have a failure.
cr0x@server:~$ sudo zpool status tank | sed -n '1,20p'
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Wed Dec 24 10:14:02 2025
        2.61T scanned at 410M/s, 1.22T issued at 190M/s, 230G resilvered, 61.44% done, 02:31:20 to go

Second: look for transport pain (timeouts, resets, retries)

Transport issues mimic disk failure and ruin resilvers.

cr0x@server:~$ sudo dmesg -T | egrep -i "reset|timeout|link|frozen|SAS|phy|I/O error" | tail -n 40
[Wed Dec 24 11:02:41 2025] ata10: hard resetting link
[Wed Dec 24 11:02:46 2025] ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 24 11:03:10 2025] blk_update_request: I/O error, dev sdh, sector 987654321

Decision: If resets correlate with throughput drops and affect multiple disks, suspect cable/backplane/HBA. Fix that first. Replacing “another disk” won’t help.

Third: measure device-level saturation and queueing

cr0x@server:~$ iostat -x 5 3
Linux 6.8.0 (server)   12/24/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    6.20   18.50    0.00   63.20

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await aqu-sz  %util
sdg             98.0  101024.0     0.0   0.00   22.10  1030.9     12.0   1408.0   11.20   2.18   99.0
sdh             96.0   98624.0     0.0   0.00   21.80  1027.3     10.0   1280.0   10.90   2.10   98.5
sdj             30.0   9216.0      0.0   0.00   15.50   307.2     80.0  81920.0   35.10   3.90   99.3

Meaning: Near 100% utilization and high await times. That may be normal during resilver on HDDs, but watch for one disk being an outlier.

Decision: If one disk has much higher await or errors, it’s the next candidate for failure—or it’s on a bad port.

Fourth: check ZFS-level latency and queue pressure

cr0x@server:~$ sudo zpool iostat -v tank 5 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        18.4T  10.7T    420    210   180M   92.0M
  raidz2-0  18.4T  10.7T    420    210   180M   92.0M
    sda         -      -     90     45   38.0M  20.0M
    sdb         -      -     85     40   36.0M  18.5M
    sdj         -      -     60     95   22.0M  38.0M
    sdd         -      -     92     15   40.0M  6.0M
    sde         -      -     93     15   44.0M  9.0M
----------  -----  -----  -----  -----  -----  -----

Decision: If the new disk is the write hotspot (common), that’s expected. If one old disk’s read bandwidth collapses, investigate that disk and its link.

Fifth: decide whether to throttle workloads or postpone

If this is a shared production box, resilver plus heavy random reads is a performance tax. Sometimes the correct answer is: throttle or move workloads. ZFS can’t negotiate with your peak traffic.

Common mistakes: symptoms → root cause → fix

1) Symptom: pool still DEGRADED after replacement

Root cause: You replaced the wrong leaf, used the wrong device path, or the new disk didn’t get attached to the right vdev.

Fix: Check zpool status -P. Confirm the replaced leaf now points to the new by-id path. If you see both old and new, you may have used attach instead of replace. Correct the topology intentionally.

2) Symptom: resilver “runs” but never finishes (ETA keeps growing)

Root cause: Read errors or link resets on surviving disks, or the new disk is intermittently dropping.

Fix: Review dmesg for resets/timeouts. Check error counters in zpool status. If errors are on multiple disks on one path, fix the transport.

3) Symptom: cannot replace ... with ... or “device is too small”

Root cause: Replacement disk capacity is slightly smaller (common across models/vendors) or partition is smaller than expected.

Fix: Use a disk with equal-or-larger capacity. If partitioned, ensure the replacement partition matches or exceeds the original.

4) Symptom: new disk shows ONLINE but immediately racks up write errors

Root cause: Bad new disk, bad SATA/SAS link, or backplane slot issues.

Fix: Swap the disk into a different bay if possible to isolate slot vs drive. Pull SMART extended tests if time allows. Treat repeated link resets as platform problems, not ZFS problems.

5) Symptom: after “upgrade to bigger disks,” pool size doesn’t increase

Root cause: Not all disks in the vdev are upgraded, autoexpand is off, leaf devices not expanded, or partitions weren’t resized.

Fix: Verify each leaf’s size, enable autoexpand if desired, and use zpool online -e where supported. Confirm partition sizes.

6) Symptom: scrub finds new checksum errors right after replacement

Root cause: You had latent corruption or a marginal disk/cable that only showed under heavy read.

Fix: Identify which vdev/disk shows errors. If errors are on one disk, replace it. If scattered, investigate controller/backplane. Don’t “clear” and move on.

7) Symptom: you replaced a disk and now the pool won’t import

Root cause: Multiple disks were removed/offlined, wrong disk pulled, or you crossed the redundancy boundary (especially RAIDZ1). Sometimes it’s also a controller presenting different IDs after reboot.

Fix: Stop. Preserve evidence. Use zpool import to inspect available pools and their state. Avoid force flags unless you understand the transaction groups and what you’re overriding.

Task 20: Inspect importable pools when things look wrong

cr0x@server:~$ sudo zpool import
   pool: tank
     id: 1234567890123456789
  state: DEGRADED
status: One or more devices are missing from the system.
action: The pool can be imported despite missing or damaged devices.  The fault tolerance of the pool may be compromised.
config:

        tank                      DEGRADED
          raidz2-0                DEGRADED
            ata-WDC...VKJ0A123    ONLINE
            ata-WDC...VKJ0A124    ONLINE
            ata-WDC...VKJ0B555    ONLINE
            ata-WDC...VKJ0A126    ONLINE
            ata-WDC...VKJ0A127    UNAVAIL

Meaning: ZFS sees the pool and can likely import, but a device is still missing.

Decision: Do not force anything until you know which physical device is missing and why.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “Bay 12 is Bay 12 everywhere”

A mid-sized company ran a pair of storage servers in two racks, same chassis model, same number of bays, mirrored pools for a few services. A disk alert fired: zpool status showed a specific serial missing. The on-call engineer did the “normal thing”: asked facilities to pull “Bay 12” because the chassis labels said Bay 12.

Except one chassis had been serviced months earlier. During that service, the backplane wiring had been rerouted to accommodate a different HBA port mapping. The labels on the chassis still looked right; the mapping behind them had changed. “Bay 12” in the UI was not Bay 12 in the metal.

They pulled a healthy disk. The pool went from DEGRADED to “you have a problem.” Fortunately it was RAIDZ2, so the service stayed up, but the resilver plan got uglier. They reinserted the wrong disk quickly—yet ZFS still logged a flurry of errors due to the sudden removal under load.

The postmortem was short and painful: the root cause wasn’t ZFS. It was a human workflow that relied on a bay number without confirming serial. The fix was boring: require serial verification before removal, and keep a living mapping document per chassis (HBA port → expander → bay).

The bigger lesson: in production, “identical hardware” is a myth. Systems drift. People forget. Labels survive longer than truths.

Optimization that backfired: “Let’s speed up resilver by cranking concurrency”

Another team had a large RAIDZ pool on HDDs with mixed workloads: analytics reads during the day, backup writes overnight. They wanted faster resilvers to reduce risk windows. Someone found tunables that promised higher throughput by increasing parallelism and making the system “use the disks more.”

They changed a handful of parameters during business hours—because the pool was already degraded and “we need this done ASAP.” The resilver rate spiked… briefly. Then application latency spiked harder. The box began to log timeouts. Not just on the replaced disk; on several drives.

The real bottleneck wasn’t “ZFS not trying hard enough.” It was the HBA/expander path and a queue depth that became pathological under the new settings. With higher concurrency, the system turned transient latency into outright command timeouts, which ZFS interpreted as device errors. The pool spent cycles retrying and recovering, not copying data.

They rolled back the tunables and instead throttled the workload by moving batch jobs off the host during resilver. The resilver took longer than the best-case spike, but it finished reliably, and the pool didn’t accumulate new errors.

Optimization rule: if you don’t know what the bottleneck is, your “tuning” is just a new way to be wrong, faster.

Boring but correct practice that saved the day: persistent IDs and slow hands

A financial services shop ran ZFS mirrors for low-latency databases and RAIDZ2 for backups. Their runbooks were strict: every disk removal required the ZFS leaf name, the by-id path, the serial number, and a second person to verify the physical drive’s serial on the label. No exceptions, even at 3 a.m.

One night, a disk started throwing errors. The on-call followed the procedure, offlined the exact leaf, and the tech replaced the correct disk. Resilver started. Then, halfway through, another disk on the same backplane began showing link resets.

Here’s where the boring practice paid off: because they were watching the error counters and dmesg during the resilver, they recognized a shared-path issue early. They paused non-essential workloads and reseated the backplane cable during a controlled window. Errors stopped. Resilver completed cleanly.

Later analysis showed the second disk was fine. The cable wasn’t. If they’d “optimized” by swapping disks rapidly to chase alerts, they might have removed a healthy disk and crossed the redundancy line. Instead, they treated the system like a system: disks, links, and humans.

Reliable operations is mostly refusing to be in a hurry in exactly the moments you feel hurried.

Checklists / step-by-step plan

Checklist A: Standard failed disk replacement (any topology)

  1. Run zpool status -P; copy the exact leaf identifier and by-id path.
  2. Confirm scrub history and current errors. If checksum errors exist, treat as elevated risk.
  3. Check system logs (dmesg) for link resets/timeouts affecting multiple disks.
  4. Map serial → physical bay; confirm with chassis tools/labels.
  5. Offline the target disk (if present) with zpool offline.
  6. Replace the physical disk; confirm the new disk serial in the OS.
  7. Clear labels on the new disk: zpool labelclear -f.
  8. Run zpool replace using by-id paths.
  9. Monitor resilver with zpool status, zpool iostat, and OS metrics.
  10. After completion, confirm pool ONLINE and error counters stable.
  11. Schedule a scrub if this incident involved link resets, checksum errors, or multiple disks misbehaving.
  12. Close the loop: record old serial, new serial, date, and slot mapping changes.

Checklist B: “I might have pulled the wrong disk” containment plan

  1. Stop pulling disks. Put the removed disk back if possible.
  2. Run zpool status -P and take a snapshot of the output for incident notes.
  3. Check redundancy margin: RAIDZ1 with two missing disks is not a place for experiments.
  4. Use serial-based identification to reconcile what’s missing vs what’s physically removed.
  5. If the pool is still importable, do not export/import repeatedly “to see if it helps.” Stabilize first.
  6. Only after the system is stable, proceed with a single, verified replacement at a time.

Checklist C: Capacity upgrade (bigger disks)

  1. Confirm current vdev layout and ashift (so you don’t upgrade into old mistakes).
  2. Ensure replacement disks are genuinely larger (not “marketing larger”).
  3. Replace exactly one disk per vdev at a time, waiting for resilver each time.
  4. After all leaves in a vdev are replaced, verify expansion (zpool list -v).
  5. Use zpool online -e where needed and safe.
  6. Run a scrub after the final replacement and expansion.

Task 21: Baseline performance and latency before/after replacement (so you can spot regressions)

cr0x@server:~$ sudo zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        18.4T  10.7T    110     65   42.0M  18.0M
  raidz2-0  18.4T  10.7T    110     65   42.0M  18.0M
    sda         -      -     22     13   8.5M   3.0M
    sdb         -      -     20     12   7.9M   2.8M
    sdj         -      -     26     15   9.8M   4.0M
    sdd         -      -     21     12   8.1M   4.1M
    sde         -      -     21     13   7.7M   4.1M
----------  -----  -----  -----  -----  -----  -----

Meaning: Establishes a basic profile. If one disk becomes a laggard after replacement, this is where you’ll see it.

Decision: If the new disk underperforms massively compared to peers, suspect SMR quirks, firmware behavior, or a bad link/port.

Task 22: Run a post-replacement scrub (planned verification)

cr0x@server:~$ sudo zpool scrub tank

Meaning: Begins a full integrity check. On large pools this can take hours.

Decision: Schedule it during a quieter window if the workload is latency-sensitive; monitor for errors and performance impact.

Task 23: Check scrub results and make the “ship it” decision

cr0x@server:~$ sudo zpool status tank | sed -n '1,15p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 06:41:08 with 0 errors on Thu Dec 25 02:01:33 2025

Meaning: Scrub found and repaired nothing, with zero errors. This is the closest thing to closure.

Decision: Close the incident/change. If errors were found, keep digging; something is still wrong.

FAQ

1) Should I always offline a disk before pulling it?

Yes, when the disk is still present and you control the timing. Offlining reduces surprise I/O errors and makes the removal an intentional state change. If the disk already vanished, offlining won’t help, but it also won’t magically fix anything.

2) What’s the difference between “resilver” and “scrub”?

Resilver reconstructs data onto a replacement device to restore redundancy. Scrub verifies the entire pool’s checksums and repairs using redundancy where possible. Resilver is targeted recovery; scrub is whole-pool audit.

3) Can I replace multiple failed disks at once?

You can, but you usually shouldn’t. One-at-a-time keeps the system in a known state and avoids crossing redundancy limits accidentally. The exception is when you have spares pre-attached in some designs, or when a backplane failure forces multiple changes—then you operate in incident mode with explicit risk acceptance.

4) Should I use /dev/sdX paths in zpool replace?

No. Use /dev/disk/by-id/... (or the equivalent stable naming on your OS). /dev/sdX is an implementation detail that changes across boots, rescans, and sometimes just because the kernel felt like it.

5) My new disk is “the same model” but slightly smaller. Why?

Because vendors revise firmware, use different platters, or reserve different amounts of space. ZFS is strict: replacement must be equal or larger. Keep a few “known good and known large enough” spares rather than trusting marketing capacity.

6) Why is resilver speed so different from a simple disk-to-disk copy?

ZFS isn’t copying raw blocks from a single source. It’s reconstructing from redundancy, reading across vdev members, validating checksums, and competing with live workloads. Also, fragmented pools tend to resilver slower because metadata and blocks are scattered.

7) After replacement, can I just run zpool clear to reset errors?

You can clear after you’ve fixed the underlying cause and you want to watch for recurrence. Don’t clear as a substitute for investigation. Error counters are evidence, and evidence is useful.

8) Do I need to partition disks for ZFS?

Depends on your platform conventions and boot requirements. Many Linux deployments use whole disks; others use partitions for alignment or tooling. Operationally, consistency matters more than ideology: don’t mix approaches casually, and ensure replacements match the existing scheme.

9) What about hot spares—should I use them?

Hot spares can reduce time-to-resilver, which reduces risk. But they also can hide operational hygiene issues (you still need to replace the failed disk physically) and they can be consumed by the wrong failure mode (like a flaky expander causing multiple “failures”). Use them, but don’t let them replace monitoring and discipline.

10) Is RAIDZ1 acceptable if I replace disks quickly?

Sometimes, for low-value data and small pools. In production, RAIDZ1 plus large disks plus real workloads is a risk decision, not a technical trick. Disk replacement is exactly when you discover how much you regret that decision.

Close-out: practical next steps

If you want fewer ZFS disk replacement incidents—and fewer “why is the pool still angry?” mornings—make the safe workflow your default:

  • Standardize on persistent device names in pool configs and runbooks.
  • Require serial confirmation before any physical pull. Two-person verification if the data matters.
  • Offline intentionally, replace with zpool replace, and monitor resilver like it’s a live migration—because it basically is.
  • When resilver is slow, diagnose transport and sibling disks first. ZFS is usually reporting symptoms, not hallucinating problems.
  • Finish with a scrub when the incident had any whiff of instability. It’s the audit trail your future self will thank you for.

Do this consistently and disk replacements become what they should be: maintenance, not theater.

← Previous
ZFS Slop Space: Why Pools Feel Full Before 100%
Next →
MySQL vs PostgreSQL Connection Pooling: Who Needs It Earlier on a VPS

Leave a comment