Replacing a disk in ZFS is one of those chores that feels routine—right up until it isn’t. The pool goes DEGRADED, alerts start screaming, your ticket queue fills with “is it still safe?”, and someone inevitably suggests rebooting “to clear it.” That’s how you turn a single bad drive into a multi-day incident with a side of regret.
This is the workflow I use in production when I want predictable outcomes: identify the correct disk by serial, validate the replacement, choose the right ZFS verb (replace/attach/detach/online), monitor resilvering like you mean it, and finish cleanly so the pool is actually healthy—not “green until the next scrub.”
What “safe” means in ZFS disk replacement
In ZFS, “safe” isn’t a vibe. It’s a sequence of verifications that prevents three classic failures:
- Replacing the wrong disk (the #1 operator error). That’s how you turn redundancy into a coin toss.
- Using the wrong device path (hello, /dev/sdX roulette), leading to the pool “healing” onto a different physical disk than you think.
- Declaring victory too early: resilver finishes, but you didn’t fix the underlying transport/controller issue, so the next disk “fails” in sympathy.
Here’s the operational truth: ZFS is very good at data integrity and very honest about what it knows. But ZFS cannot protect you from confusing bay numbers, bad SAS expanders, mis-cabled backplanes, or a replacement drive that’s DOA. Your workflow must bridge that gap.
Use the right ZFS verb or suffer the consequences
ZFS disk “replacement” can mean different things:
zpool replace: replace a specific leaf device with another. This is the usual path for failed disks.zpool attach: add a disk to an existing single-disk vdev to make a mirror, or add another disk to an existing mirror (rare; be careful). Not a replacement.zpool detach: remove a disk from a mirror vdev (never from RAIDZ). Used after attach-based migrations or mirror shrink operations.zpool offline/online: take a disk out of service temporarily. Useful when you want to control the timing.
One more: zpool clear is not a fix. It clears error counters. It can be appropriate after you correct a cabling issue and want to confirm stability. It is not appropriate as a ritual to make alerts go away.
Paraphrased idea from Gene Kim: the real work of reliability is making the normal path the safe path. That applies to disk swaps as much as deploys.
Short joke #1: If your plan relies on “I’ll remember which drive it is,” congratulations—you’ve invented non-persistent storage for humans.
Interesting facts and short history (so you stop fighting the system)
- ZFS was built to detect silent data corruption. Checksums are stored separately from data, so a disk can’t “grade its own homework.”
- Resilvering changed over time: modern OpenZFS supports sequential resilver, which tends to be kinder to fragmented pools and faster on spinning disks.
- Scrubs predate many “cloud-native” ideas. A ZFS scrub is basically continuous verification at scale, long before “integrity scanning” became fashionable.
- RAIDZ is not RAID5. Similar goals, very different implementation: ZFS has end-to-end checksumming and a transactional model that avoids the classic write hole.
ashiftis forever. Once a vdev is created with an ashift (sector size exponent), you don’t change it without rebuilding the vdev.- Device naming wars are old. Admins have been burned by unstable device names since before SATA was cool; persistent IDs exist because humans keep losing that fight.
- ZFS “autoexpand” exists because upgrades are normal. But it’s not magic. It has rules and timing, and it won’t fix mismatched vdev sizes.
- ZFS can use a separate intent log (SLOG) to accelerate synchronous writes, but it doesn’t help resilver speed and can complicate failure handling.
- SMART isn’t truth, it’s testimony. Drives can die without warning; others scream for months and keep running. ZFS uses checksums and redundancy to live in reality.
Pre-flight: what to check before touching hardware
Disk replacement is a choreography between ZFS, the OS, the controller/HBA, the chassis/backplane, and the human holding the caddy. The goal is to reduce uncertainty before you pull anything.
Rules I follow in production
- No disk pulls without serial confirmation. Slot numbers lie. Labels drift. Humans misread. Serial numbers don’t care.
- Scrub or at least assess scrub health first. If the pool already has checksum errors, resilver is not a victory lap—it’s a stress test.
- Don’t resilver during a performance-sensitive window unless you deliberately throttle and accept the blast radius.
- Prefer persistent device identifiers (by-id paths) for replace operations and for the vdev config long-term.
- Validate the replacement disk: sector size, capacity, transport type, and “is it even the right disk.”
Practical tasks: commands, what the output means, and what decision you make
The following tasks are deliberately repetitive. That’s the point. Disk swaps fail when people improvise.
Task 1: Confirm pool health and identify the problem vdev
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Replace the device using 'zpool replace'.
scan: scrub repaired 0B in 03:12:44 with 0 errors on Tue Dec 24 01:12:18 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A124 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A125 UNAVAIL 0 0 0 cannot open
ata-WDC_WD80EFAX-68LHPN0_VKJ0A126 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A127 ONLINE 0 0 0
errors: No known data errors
Meaning: ZFS can’t open one leaf device; redundancy keeps the pool online. Scrub history is clean. Good.
Decision: Replace ...VKJ0A125. Do not touch other disks. If multiple are flaky, pause and investigate cabling/backplane before swapping more hardware.
Task 2: Get the full vdev tree with persistent names (and see if you’re already using them)
cr0x@server:~$ sudo zpool status -P tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A123 ONLINE 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A124 ONLINE 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A125 UNAVAIL 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A126 ONLINE 0 0 0
/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A127 ONLINE 0 0 0
Meaning: Pool config uses stable by-id paths. This is what you want.
Decision: Continue using /dev/disk/by-id/... for replacement. If you see /dev/sdX, fix that during/after replacement (details later).
Task 3: Confirm the OS sees the failed disk at all (or that it really disappeared)
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep VKJ0A125 || true
Meaning: If the path is missing, the disk isn’t enumerating—could be dead disk, dead slot, or link issue.
Decision: If it’s missing, check controller logs and physical link before assuming the disk is the sole problem.
Task 4: Pull recent kernel/storage errors (spot controller/backplane issues)
cr0x@server:~$ sudo dmesg -T | egrep -i "ata|sas|scsi|reset|link|I/O error|blk_update_request" | tail -n 25
[Wed Dec 24 09:42:11 2025] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Wed Dec 24 09:42:11 2025] ata9.00: failed command: READ FPDMA QUEUED
[Wed Dec 24 09:42:11 2025] blk_update_request: I/O error, dev sdi, sector 123456789
[Wed Dec 24 09:42:12 2025] ata9: hard resetting link
[Wed Dec 24 09:42:13 2025] ata9: link is slow to respond, please be patient
[Wed Dec 24 09:42:18 2025] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 24 09:42:19 2025] sd 9:0:0:0: [sdi] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Meaning: Link resets and I/O errors. Could be the disk. Could be the slot/cable.
Decision: If multiple disks on the same port show resets, stop and investigate the HBA/backplane. If it’s isolated to one disk, replacement is reasonable.
Task 5: Map the by-id device to a physical slot (enclosure tools)
On SAS enclosures, you often have sg_ses / sesutil / vendor tools. On Linux with common HBAs, start with lsblk and udevadm.
cr0x@server:~$ lsblk -o NAME,SIZE,SERIAL,MODEL,HCTL,PATH | egrep "sdi|VKJ0A125|NAME"
NAME SIZE SERIAL MODEL HCTL PATH
sdi 7.3T VKJ0A125 WDC WD80EFAX 9:0:0:0 /dev/pci0000:3b/0000:3b:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdi
Meaning: You have the serial and the HCTL/path. That’s your breadcrumb trail to the right bay.
Decision: Use this mapping with your chassis docs or enclosure management to identify the exact slot before pulling anything.
Task 6: Confirm the replacement disk is sane (size, sector, transport)
cr0x@server:~$ sudo smartctl -a /dev/sdj | egrep "Model|Serial|User Capacity|Sector Size|Rotation Rate|SMART overall|SATA Version"
Device Model: WDC WD80EFAX-68LHPN0
Serial Number: VKJ0B555
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Size: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
SMART overall-health self-assessment test result: PASSED
SATA Version is: SATA 3.3, 6.0 Gb/s
Meaning: Capacity matches, 4Kn physical sectors, SMART passes. Good baseline.
Decision: If the replacement is smaller (even slightly), stop. ZFS can’t replace a disk with a smaller one in the same vdev. If sector characteristics differ, proceed but expect performance/ashift consequences (covered later).
Task 7: Check the pool’s ashift reality (because surprises happen)
cr0x@server:~$ sudo zdb -C tank | egrep "ashift|vdev_tree|path" | head -n 30
vdev_tree:
type: 'raidz'
id: 0
ashift: 12
children[0]:
path: '/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A123'
children[1]:
path: '/dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0A124'
Meaning: ashift: 12 (4K). This is typically correct for modern disks.
Decision: If you see ashift: 9 on 4K drives, don’t “fix” it by replacing a disk. Plan a vdev rebuild/migration when you can afford it.
Task 8: Offline the failing disk (controlled removal)
cr0x@server:~$ sudo zpool offline tank ata-WDC_WD80EFAX-68LHPN0_VKJ0A125
Meaning: ZFS stops using that disk and expects it to be gone. This reduces chaos when you pull the drive.
Decision: Offline before physical removal unless the disk is already gone. If the pool is already missing the disk, offlining may not matter.
Task 9: Confirm the disk is offline (not some other disk)
cr0x@server:~$ sudo zpool status tank | sed -n '1,35p'
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A124 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A125 OFFLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A126 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A127 ONLINE 0 0 0
Meaning: The correct leaf is offline. If you see a different disk offline, stop immediately and undo your last command.
Decision: Proceed to physical swap.
Task 10: After inserting the new disk, find its stable by-id path
cr0x@server:~$ ls -l /dev/disk/by-id/ | egrep "VKJ0B555|WDC_WD80EFAX" | tail -n 5
lrwxrwxrwx 1 root root 9 Dec 24 10:11 ata-WDC_WD80EFAX-68LHPN0_VKJ0B555 -> ../../sdj
lrwxrwxrwx 1 root root 10 Dec 24 10:11 ata-WDC_WD80EFAX-68LHPN0_VKJ0B555-part1 -> ../../sdj1
Meaning: The OS sees the new disk as sdj, but you will use the by-id path.
Decision: Use /dev/disk/by-id/ata-...VKJ0B555 for zpool replace.
Task 11: Wipe old labels on the replacement disk (avoid importing ghosts)
cr0x@server:~$ sudo zpool labelclear -f /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0B555
Meaning: Clears ZFS labels that might confuse replacement (common with reused disks).
Decision: If this fails because the disk is busy, stop and check you didn’t accidentally pick a disk already in use.
Task 12: Replace the offline disk with the new disk
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFAX-68LHPN0_VKJ0A125 /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0B555
Meaning: ZFS begins resilvering to the new disk. The old disk identity is now associated with the new physical device.
Decision: Start monitoring resilver immediately. If the pool doesn’t begin resilvering, check for wrong path, wrong leaf name, or the disk not actually present.
Task 13: Monitor resilver progress (and learn what “good” looks like)
cr0x@server:~$ watch -n 10 sudo zpool status tank
Every 10.0s: sudo zpool status tank
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.
scan: resilver in progress since Wed Dec 24 10:14:02 2025
1.82T scanned at 920M/s, 612G issued at 308M/s, 1.82T total
102G resilvered, 32.83% done, 03:44:10 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A124 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0B555 ONLINE 0 0 0 (resilvering)
ata-WDC_WD80EFAX-68LHPN0_VKJ0A126 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A127 ONLINE 0 0 0
Meaning: You get scanned, issued, current throughput, ETA, and the leaf marked “(resilvering)”.
Decision: If throughput collapses or ETA grows unbounded, jump to the Fast diagnosis playbook. Don’t wait hours to “see if it sorts itself out.”
Task 14: Check for rising read/write/checksum errors during resilver
cr0x@server:~$ sudo zpool status -v tank | sed -n '1,60p'
pool: tank
state: DEGRADED
scan: resilver in progress since Wed Dec 24 10:14:02 2025
2.34T scanned, 1.01T issued, 180G resilvered, 54.11% done
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A123 ONLINE 0 0 2
ata-WDC_WD80EFAX-68LHPN0_VKJ0A124 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0B555 ONLINE 0 0 0 (resilvering)
ata-WDC_WD80EFAX-68LHPN0_VKJ0A126 ONLINE 0 1 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A127 ONLINE 0 0 0
errors: No known data errors
Meaning: Errors are rising on other disks, not just the replaced one. That’s the smell of a shared problem (cable, expander, HBA, power).
Decision: Pause operational changes. Consider throttling workloads, checking transport errors, and possibly offlining a second unhealthy disk only if redundancy permits and you have a plan.
Task 15: When resilver finishes, verify the pool is truly healthy
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
scan: resilvered 1.09T in 04:18:51 with 0 errors on Wed Dec 24 14:32:53 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A123 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A124 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0B555 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A126 ONLINE 0 0 0
ata-WDC_WD80EFAX-68LHPN0_VKJ0A127 ONLINE 0 0 0
errors: No known data errors
Meaning: Pool is ONLINE, resilver completed, and no errors were recorded.
Decision: Schedule a scrub sooner than usual if the pool had any instability. A scrub after resilver is the “trust but verify” move.
Task 16: Confirm TRIM/autotrim policy (SSDs and some SMR edge cases)
cr0x@server:~$ sudo zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim off local
Meaning: Autotrim is off. On SSD pools, that might be a performance/space-reuse issue; on HDD pools it’s usually irrelevant.
Decision: For SSD-based pools, consider turning it on after verifying firmware/controller behavior. Don’t toggle knobs mid-resilver unless you enjoy debugging your own curiosity.
Step-by-step: replacing a disk in a mirror (the least dramatic version)
Mirrors are forgiving. They are also where people get sloppy, because “it’s just a mirror.” Mirrors fail when you replace the wrong side and discover you’ve been running one-legged for months.
Mirror replacement workflow
- Identify the failed leaf via
zpool status -P. Capture the by-id path and serial. - Confirm physical mapping (slot LED tools if available; otherwise serial → bay mapping).
- Offline the failed leaf if it’s still present:
zpool offline pool oldleaf. - Replace the disk physically. Wait for the OS to enumerate it.
- Labelclear the new disk if it may have been used before.
- Replace:
zpool replace pool oldleaf newdisk. - Monitor resilver. Mirrors often resilver quickly, but still watch for errors on the “good” disk.
- Close out with a scrub scheduled soon-ish if this mirror is important.
When to use attach for mirrors
You use zpool attach when you’re converting a single-disk vdev into a mirror, or when you intentionally want a three-way mirror temporarily. During replacements, replace is the default because it preserves the vdev topology and intent.
Step-by-step: replacing a disk in RAIDZ (where patience is a feature)
RAIDZ resilvers are heavier. They read from many disks and reconstruct data onto the new one. That load can flush out marginal drives, marginal cables, and marginal assumptions.
RAIDZ replacement workflow
- Confirm redundancy margin. RAIDZ1 with one failed disk is living on the edge. RAIDZ2 gives you room to breathe, not permission to be casual.
- Check scrub history and current errors. If you already have checksum errors, treat this as incident mode.
- Stabilize the platform. If you see link resets or timeouts across multiple drives, fix transport first.
- Offline the target disk (if present) so you don’t race the OS during hot-swap.
- Replace by persistent ID path, not
/dev/sdX. - Monitor resilver and system metrics. Resilvering can saturate I/O and make applications sad. That’s normal. The question is: is it progressing consistently?
- After completion, verify no new errors. If errors climbed, don’t “clear” and forget—investigate, because the next scrub will remind you.
Short joke #2: RAIDZ resilvering is like watching paint dry, except the paint can occasionally catch fire.
Step-by-step: upgrading capacity with bigger disks (without lying to yourself)
Capacity upgrades are where people accidentally run “mixed reality storage”: the pool reports one size, the chassis contains another, and everyone argues with math. The safe approach is boring: replace one disk at a time, wait for resilver, repeat, then expand.
The boring, correct sequence
- Replace a disk with a larger one using
zpool replace. - Wait for resilver to finish.
- Repeat for every disk in the vdev.
- Enable autoexpand if you want ZFS to grow vdevs when possible.
- Expand the pool (often automatic after last replacement; sometimes requires manual online/expand depending on platform).
Task 17: Check autoexpand setting and enable if appropriate
cr0x@server:~$ sudo zpool get autoexpand tank
NAME PROPERTY VALUE SOURCE
tank autoexpand off default
Meaning: Autoexpand is currently off.
Decision: If you are intentionally upgrading disks and want ZFS to grow vdevs automatically when all leaves are larger, turn it on. If you’re not doing a planned upgrade, leave it alone.
cr0x@server:~$ sudo zpool set autoexpand=on tank
Task 18: Confirm pool size and per-vdev allocation after upgrades
cr0x@server:~$ sudo zpool list -v tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 29.1T 18.4T 10.7T - - 22% 63% 1.00x ONLINE -
raidz2-0 29.1T 18.4T 10.7T - - 22% 63% - ONLINE -
Meaning: Pool and vdev sizes now reflect expanded capacity.
Decision: If size didn’t change after all disks are replaced, you likely need to online/expand the devices, or your platform’s partitioning left space unused.
Task 19: Online and expand a leaf device (when needed)
cr0x@server:~$ sudo zpool online -e tank /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_VKJ0B555
Meaning: The -e attempts to expand the device to use available space.
Decision: Use this if you replaced with a larger disk and ZFS hasn’t picked up the size. If the disk is partitioned and the partition isn’t expanded, you must fix partitioning first (outside this article’s scope, but the point is: ZFS can’t use space it can’t see).
Fast diagnosis playbook (find the bottleneck before you blame ZFS)
When resilvering is slow, stuck, or making the box feel like it’s wading through syrup, don’t guess. Triage in a strict order. The goal is to decide: is this a normal slow resilver, a workload conflict, a failing sibling disk, or a transport/controller problem?
First: confirm ZFS is actually making progress
- Check
zpool statusscan line: does “issued” increase over time? Does “resilvered” increase? - Decision: If progress is steady, you mostly have a tuning/scheduling issue. If progress stalls, you have a failure.
cr0x@server:~$ sudo zpool status tank | sed -n '1,20p'
pool: tank
state: DEGRADED
scan: resilver in progress since Wed Dec 24 10:14:02 2025
2.61T scanned at 410M/s, 1.22T issued at 190M/s, 230G resilvered, 61.44% done, 02:31:20 to go
Second: look for transport pain (timeouts, resets, retries)
Transport issues mimic disk failure and ruin resilvers.
cr0x@server:~$ sudo dmesg -T | egrep -i "reset|timeout|link|frozen|SAS|phy|I/O error" | tail -n 40
[Wed Dec 24 11:02:41 2025] ata10: hard resetting link
[Wed Dec 24 11:02:46 2025] ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Wed Dec 24 11:03:10 2025] blk_update_request: I/O error, dev sdh, sector 987654321
Decision: If resets correlate with throughput drops and affect multiple disks, suspect cable/backplane/HBA. Fix that first. Replacing “another disk” won’t help.
Third: measure device-level saturation and queueing
cr0x@server:~$ iostat -x 5 3
Linux 6.8.0 (server) 12/24/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 6.20 18.50 0.00 63.20
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
sdg 98.0 101024.0 0.0 0.00 22.10 1030.9 12.0 1408.0 11.20 2.18 99.0
sdh 96.0 98624.0 0.0 0.00 21.80 1027.3 10.0 1280.0 10.90 2.10 98.5
sdj 30.0 9216.0 0.0 0.00 15.50 307.2 80.0 81920.0 35.10 3.90 99.3
Meaning: Near 100% utilization and high await times. That may be normal during resilver on HDDs, but watch for one disk being an outlier.
Decision: If one disk has much higher await or errors, it’s the next candidate for failure—or it’s on a bad port.
Fourth: check ZFS-level latency and queue pressure
cr0x@server:~$ sudo zpool iostat -v tank 5 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 18.4T 10.7T 420 210 180M 92.0M
raidz2-0 18.4T 10.7T 420 210 180M 92.0M
sda - - 90 45 38.0M 20.0M
sdb - - 85 40 36.0M 18.5M
sdj - - 60 95 22.0M 38.0M
sdd - - 92 15 40.0M 6.0M
sde - - 93 15 44.0M 9.0M
---------- ----- ----- ----- ----- ----- -----
Decision: If the new disk is the write hotspot (common), that’s expected. If one old disk’s read bandwidth collapses, investigate that disk and its link.
Fifth: decide whether to throttle workloads or postpone
If this is a shared production box, resilver plus heavy random reads is a performance tax. Sometimes the correct answer is: throttle or move workloads. ZFS can’t negotiate with your peak traffic.
Common mistakes: symptoms → root cause → fix
1) Symptom: pool still DEGRADED after replacement
Root cause: You replaced the wrong leaf, used the wrong device path, or the new disk didn’t get attached to the right vdev.
Fix: Check zpool status -P. Confirm the replaced leaf now points to the new by-id path. If you see both old and new, you may have used attach instead of replace. Correct the topology intentionally.
2) Symptom: resilver “runs” but never finishes (ETA keeps growing)
Root cause: Read errors or link resets on surviving disks, or the new disk is intermittently dropping.
Fix: Review dmesg for resets/timeouts. Check error counters in zpool status. If errors are on multiple disks on one path, fix the transport.
3) Symptom: cannot replace ... with ... or “device is too small”
Root cause: Replacement disk capacity is slightly smaller (common across models/vendors) or partition is smaller than expected.
Fix: Use a disk with equal-or-larger capacity. If partitioned, ensure the replacement partition matches or exceeds the original.
4) Symptom: new disk shows ONLINE but immediately racks up write errors
Root cause: Bad new disk, bad SATA/SAS link, or backplane slot issues.
Fix: Swap the disk into a different bay if possible to isolate slot vs drive. Pull SMART extended tests if time allows. Treat repeated link resets as platform problems, not ZFS problems.
5) Symptom: after “upgrade to bigger disks,” pool size doesn’t increase
Root cause: Not all disks in the vdev are upgraded, autoexpand is off, leaf devices not expanded, or partitions weren’t resized.
Fix: Verify each leaf’s size, enable autoexpand if desired, and use zpool online -e where supported. Confirm partition sizes.
6) Symptom: scrub finds new checksum errors right after replacement
Root cause: You had latent corruption or a marginal disk/cable that only showed under heavy read.
Fix: Identify which vdev/disk shows errors. If errors are on one disk, replace it. If scattered, investigate controller/backplane. Don’t “clear” and move on.
7) Symptom: you replaced a disk and now the pool won’t import
Root cause: Multiple disks were removed/offlined, wrong disk pulled, or you crossed the redundancy boundary (especially RAIDZ1). Sometimes it’s also a controller presenting different IDs after reboot.
Fix: Stop. Preserve evidence. Use zpool import to inspect available pools and their state. Avoid force flags unless you understand the transaction groups and what you’re overriding.
Task 20: Inspect importable pools when things look wrong
cr0x@server:~$ sudo zpool import
pool: tank
id: 1234567890123456789
state: DEGRADED
status: One or more devices are missing from the system.
action: The pool can be imported despite missing or damaged devices. The fault tolerance of the pool may be compromised.
config:
tank DEGRADED
raidz2-0 DEGRADED
ata-WDC...VKJ0A123 ONLINE
ata-WDC...VKJ0A124 ONLINE
ata-WDC...VKJ0B555 ONLINE
ata-WDC...VKJ0A126 ONLINE
ata-WDC...VKJ0A127 UNAVAIL
Meaning: ZFS sees the pool and can likely import, but a device is still missing.
Decision: Do not force anything until you know which physical device is missing and why.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “Bay 12 is Bay 12 everywhere”
A mid-sized company ran a pair of storage servers in two racks, same chassis model, same number of bays, mirrored pools for a few services. A disk alert fired: zpool status showed a specific serial missing. The on-call engineer did the “normal thing”: asked facilities to pull “Bay 12” because the chassis labels said Bay 12.
Except one chassis had been serviced months earlier. During that service, the backplane wiring had been rerouted to accommodate a different HBA port mapping. The labels on the chassis still looked right; the mapping behind them had changed. “Bay 12” in the UI was not Bay 12 in the metal.
They pulled a healthy disk. The pool went from DEGRADED to “you have a problem.” Fortunately it was RAIDZ2, so the service stayed up, but the resilver plan got uglier. They reinserted the wrong disk quickly—yet ZFS still logged a flurry of errors due to the sudden removal under load.
The postmortem was short and painful: the root cause wasn’t ZFS. It was a human workflow that relied on a bay number without confirming serial. The fix was boring: require serial verification before removal, and keep a living mapping document per chassis (HBA port → expander → bay).
The bigger lesson: in production, “identical hardware” is a myth. Systems drift. People forget. Labels survive longer than truths.
Optimization that backfired: “Let’s speed up resilver by cranking concurrency”
Another team had a large RAIDZ pool on HDDs with mixed workloads: analytics reads during the day, backup writes overnight. They wanted faster resilvers to reduce risk windows. Someone found tunables that promised higher throughput by increasing parallelism and making the system “use the disks more.”
They changed a handful of parameters during business hours—because the pool was already degraded and “we need this done ASAP.” The resilver rate spiked… briefly. Then application latency spiked harder. The box began to log timeouts. Not just on the replaced disk; on several drives.
The real bottleneck wasn’t “ZFS not trying hard enough.” It was the HBA/expander path and a queue depth that became pathological under the new settings. With higher concurrency, the system turned transient latency into outright command timeouts, which ZFS interpreted as device errors. The pool spent cycles retrying and recovering, not copying data.
They rolled back the tunables and instead throttled the workload by moving batch jobs off the host during resilver. The resilver took longer than the best-case spike, but it finished reliably, and the pool didn’t accumulate new errors.
Optimization rule: if you don’t know what the bottleneck is, your “tuning” is just a new way to be wrong, faster.
Boring but correct practice that saved the day: persistent IDs and slow hands
A financial services shop ran ZFS mirrors for low-latency databases and RAIDZ2 for backups. Their runbooks were strict: every disk removal required the ZFS leaf name, the by-id path, the serial number, and a second person to verify the physical drive’s serial on the label. No exceptions, even at 3 a.m.
One night, a disk started throwing errors. The on-call followed the procedure, offlined the exact leaf, and the tech replaced the correct disk. Resilver started. Then, halfway through, another disk on the same backplane began showing link resets.
Here’s where the boring practice paid off: because they were watching the error counters and dmesg during the resilver, they recognized a shared-path issue early. They paused non-essential workloads and reseated the backplane cable during a controlled window. Errors stopped. Resilver completed cleanly.
Later analysis showed the second disk was fine. The cable wasn’t. If they’d “optimized” by swapping disks rapidly to chase alerts, they might have removed a healthy disk and crossed the redundancy line. Instead, they treated the system like a system: disks, links, and humans.
Reliable operations is mostly refusing to be in a hurry in exactly the moments you feel hurried.
Checklists / step-by-step plan
Checklist A: Standard failed disk replacement (any topology)
- Run
zpool status -P; copy the exact leaf identifier and by-id path. - Confirm scrub history and current errors. If checksum errors exist, treat as elevated risk.
- Check system logs (
dmesg) for link resets/timeouts affecting multiple disks. - Map serial → physical bay; confirm with chassis tools/labels.
- Offline the target disk (if present) with
zpool offline. - Replace the physical disk; confirm the new disk serial in the OS.
- Clear labels on the new disk:
zpool labelclear -f. - Run
zpool replaceusing by-id paths. - Monitor resilver with
zpool status,zpool iostat, and OS metrics. - After completion, confirm pool ONLINE and error counters stable.
- Schedule a scrub if this incident involved link resets, checksum errors, or multiple disks misbehaving.
- Close the loop: record old serial, new serial, date, and slot mapping changes.
Checklist B: “I might have pulled the wrong disk” containment plan
- Stop pulling disks. Put the removed disk back if possible.
- Run
zpool status -Pand take a snapshot of the output for incident notes. - Check redundancy margin: RAIDZ1 with two missing disks is not a place for experiments.
- Use serial-based identification to reconcile what’s missing vs what’s physically removed.
- If the pool is still importable, do not export/import repeatedly “to see if it helps.” Stabilize first.
- Only after the system is stable, proceed with a single, verified replacement at a time.
Checklist C: Capacity upgrade (bigger disks)
- Confirm current vdev layout and ashift (so you don’t upgrade into old mistakes).
- Ensure replacement disks are genuinely larger (not “marketing larger”).
- Replace exactly one disk per vdev at a time, waiting for resilver each time.
- After all leaves in a vdev are replaced, verify expansion (
zpool list -v). - Use
zpool online -ewhere needed and safe. - Run a scrub after the final replacement and expansion.
Task 21: Baseline performance and latency before/after replacement (so you can spot regressions)
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 18.4T 10.7T 110 65 42.0M 18.0M
raidz2-0 18.4T 10.7T 110 65 42.0M 18.0M
sda - - 22 13 8.5M 3.0M
sdb - - 20 12 7.9M 2.8M
sdj - - 26 15 9.8M 4.0M
sdd - - 21 12 8.1M 4.1M
sde - - 21 13 7.7M 4.1M
---------- ----- ----- ----- ----- ----- -----
Meaning: Establishes a basic profile. If one disk becomes a laggard after replacement, this is where you’ll see it.
Decision: If the new disk underperforms massively compared to peers, suspect SMR quirks, firmware behavior, or a bad link/port.
Task 22: Run a post-replacement scrub (planned verification)
cr0x@server:~$ sudo zpool scrub tank
Meaning: Begins a full integrity check. On large pools this can take hours.
Decision: Schedule it during a quieter window if the workload is latency-sensitive; monitor for errors and performance impact.
Task 23: Check scrub results and make the “ship it” decision
cr0x@server:~$ sudo zpool status tank | sed -n '1,15p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 06:41:08 with 0 errors on Thu Dec 25 02:01:33 2025
Meaning: Scrub found and repaired nothing, with zero errors. This is the closest thing to closure.
Decision: Close the incident/change. If errors were found, keep digging; something is still wrong.
FAQ
1) Should I always offline a disk before pulling it?
Yes, when the disk is still present and you control the timing. Offlining reduces surprise I/O errors and makes the removal an intentional state change. If the disk already vanished, offlining won’t help, but it also won’t magically fix anything.
2) What’s the difference between “resilver” and “scrub”?
Resilver reconstructs data onto a replacement device to restore redundancy. Scrub verifies the entire pool’s checksums and repairs using redundancy where possible. Resilver is targeted recovery; scrub is whole-pool audit.
3) Can I replace multiple failed disks at once?
You can, but you usually shouldn’t. One-at-a-time keeps the system in a known state and avoids crossing redundancy limits accidentally. The exception is when you have spares pre-attached in some designs, or when a backplane failure forces multiple changes—then you operate in incident mode with explicit risk acceptance.
4) Should I use /dev/sdX paths in zpool replace?
No. Use /dev/disk/by-id/... (or the equivalent stable naming on your OS). /dev/sdX is an implementation detail that changes across boots, rescans, and sometimes just because the kernel felt like it.
5) My new disk is “the same model” but slightly smaller. Why?
Because vendors revise firmware, use different platters, or reserve different amounts of space. ZFS is strict: replacement must be equal or larger. Keep a few “known good and known large enough” spares rather than trusting marketing capacity.
6) Why is resilver speed so different from a simple disk-to-disk copy?
ZFS isn’t copying raw blocks from a single source. It’s reconstructing from redundancy, reading across vdev members, validating checksums, and competing with live workloads. Also, fragmented pools tend to resilver slower because metadata and blocks are scattered.
7) After replacement, can I just run zpool clear to reset errors?
You can clear after you’ve fixed the underlying cause and you want to watch for recurrence. Don’t clear as a substitute for investigation. Error counters are evidence, and evidence is useful.
8) Do I need to partition disks for ZFS?
Depends on your platform conventions and boot requirements. Many Linux deployments use whole disks; others use partitions for alignment or tooling. Operationally, consistency matters more than ideology: don’t mix approaches casually, and ensure replacements match the existing scheme.
9) What about hot spares—should I use them?
Hot spares can reduce time-to-resilver, which reduces risk. But they also can hide operational hygiene issues (you still need to replace the failed disk physically) and they can be consumed by the wrong failure mode (like a flaky expander causing multiple “failures”). Use them, but don’t let them replace monitoring and discipline.
10) Is RAIDZ1 acceptable if I replace disks quickly?
Sometimes, for low-value data and small pools. In production, RAIDZ1 plus large disks plus real workloads is a risk decision, not a technical trick. Disk replacement is exactly when you discover how much you regret that decision.
Close-out: practical next steps
If you want fewer ZFS disk replacement incidents—and fewer “why is the pool still angry?” mornings—make the safe workflow your default:
- Standardize on persistent device names in pool configs and runbooks.
- Require serial confirmation before any physical pull. Two-person verification if the data matters.
- Offline intentionally, replace with
zpool replace, and monitor resilver like it’s a live migration—because it basically is. - When resilver is slow, diagnose transport and sibling disks first. ZFS is usually reporting symptoms, not hallucinating problems.
- Finish with a scrub when the incident had any whiff of instability. It’s the audit trail your future self will thank you for.
Do this consistently and disk replacements become what they should be: maintenance, not theater.