You don’t replace multiple disks in ZFS because you’re bored. You do it because SMART is screaming, a vendor batch is aging out, or you’re mid-migration and the calendar is your enemy. The trap is thinking “it’s just swapping drives.” That’s how you turn a resilient pool into a long weekend of restore drills and uncomfortable meetings.
This is the production-minded way to replace multiple disks without punching holes in your redundancy. It’s opinionated. It’s command-heavy. And it’s built around one idea: never do anything that makes the pool temporarily less redundant than you think it is.
The mental model: what “safe order” really means
Replacing multiple disks in ZFS safely is less about which bay you open first and more about the redundancy budget you spend while you do it.
In a mirror, your redundancy is simple: one disk can die and you keep going. During a replacement, you’re intentionally taking a disk out and asking the surviving disk to do all reads while also feeding the resilver. You haven’t just reduced redundancy; you’ve increased stress on the remaining members. This is when “the other disk was fine yesterday” becomes “the other disk is now throwing read errors.” That’s not ZFS being dramatic. That’s physics and probability.
In RAIDZ (RAIDZ1/2/3), the redundancy budget is distributed. You can lose 1, 2, or 3 devices in a vdev and still read data (depending on level). But you don’t get to spend that budget casually: every additional degraded device increases both the risk of an uncorrectable read error and the time you’re running with no slack. And if you’re replacing multiple disks, it’s easy to accidentally run two operations that overlap in the same vdev. That’s the classic “I didn’t think those replacements would overlap” failure.
So, “safe order” means:
- One vdev at a time when possible, and never multiple degraded devices in the same vdev unless you’re already in emergency mode.
- Wait for resilver to complete and stabilize before proceeding to the next disk in that vdev.
- Use persistent device identifiers (by-id/by-path) so you replace the disk you think you’re replacing.
- Validate redundancy and health before and after each step with commands that tell the truth.
One quote to keep you honest: “Hope is not a strategy,” commonly attributed to engineers and operators; treat it as a paraphrased idea from reliability culture.
Joke #1: A resilver is like a gym membership—easy to start, hard to finish, and it gets really expensive if you ignore it.
Interesting facts and historical context (short, useful)
- ZFS was born at Sun Microsystems in the mid-2000s to end the “filesystem vs volume manager” split and make storage more self-consistent.
- Copy-on-write (CoW) means ZFS never overwrites live blocks; it writes new blocks and flips pointers. That’s why a power loss mid-write doesn’t corrupt existing data structures the way older stacks could.
- “Resilver” is ZFS’s term for reconstruction after a device replacement; it’s not always a full rebuild because ZFS can often reconstruct only allocated data (especially with newer sequential resilver behavior).
- RAIDZ is not “just RAID-5/6”; ZFS uses variable stripe width and integrates checksums and self-healing, changing failure behavior under read errors.
- 4K sector reality hurt a lot of early ZFS deployments; wrong ashift settings caused write amplification and slow resilvers that looked like “ZFS is slow.” It wasn’t. The configuration was.
- The “1MB recordsize” default (for some workloads) is a performance trade; big records are great for streaming IO but can increase read/modify/write costs for random small updates on RAIDZ.
- Drive capacity growth changed rebuild risk; larger disks mean longer resilvers, which increases exposure time while degraded. That’s operational risk, not a theoretical one.
- Persistent device naming became a survival skill; Linux’s /dev/sdX renumbering after reboots has caused enough outages that “always use /dev/disk/by-id” is now tribal law.
Pre-flight: decide if you should even start today
Disk replacement is not a “quick maintenance” if any of these are true:
- The pool is already degraded and serving critical traffic.
- Resilver speed is historically slow on this system (busy pool, SMR disks, or saturated HBA).
- You’re near capacity (high fragmentation and lack of free space makes everything worse).
- You can’t identify disks reliably (no mapping from serial to bay).
- You don’t have a recent scrub with clean results.
Be boring. Schedule a window. Ensure you have console access. If this is production, you want an escape hatch: reduced load, maintenance mode, or the ability to fail over traffic elsewhere.
Core principles: rules you follow even under pressure
1) Replace disks sequentially within a vdev
Do not “parallelize” replacements in the same RAIDZ vdev or mirror. ZFS can handle it in some cases, but you’re turning a controlled operation into a race between resilver completion and the next failure.
2) Prefer explicit zpool replace over “pull and pray”
Physically swapping a disk and hoping ZFS “figures it out” is a method. It’s just not a good one. Use zpool replace with stable identifiers. Validate.
3) Do not clear errors to feel better
zpool clear is for clearing a known transient error after fixing the cause. Using it as emotional support hides evidence you need for good decisions.
4) Every step is: observe → act → observe
Before each disk: verify health. After each command: verify state. After each resilver: verify errors and run a scrub when appropriate.
5) Understand what “safe” means for your topology
A 2-way mirror tolerates one failure. A RAIDZ2 vdev tolerates two device losses but can still die from read errors during resilver if the surviving disks can’t deliver correct data for reconstruction.
Practical tasks (commands, outputs, and decisions)
These are real operator moves: command, sample output, what it means, and what you do next. Assume pool name tank. Adapt names to your environment.
Task 1: Confirm current pool health and locate the damaged device
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device or use 'zpool clear' to clear the error.
scan: scrub repaired 0B in 03:12:10 with 0 errors on Wed Dec 18 02:12:11 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... FAULTED 12 0 0 too many errors
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
tank/data/dbfile.dat
What it means: The pool is degraded; a specific by-id device is faulted with read errors. There are permanent errors, meaning data was corrupted or could not be reconstructed at some point.
Decision: Stop and assess impact. If you see permanent errors, decide whether to restore affected files from backup before you proceed. Replacing the disk will not magically fix already-corrupted application data.
Task 2: Confirm the topology and redundancy level
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
scan: scrub repaired 0B in 03:12:10 with 0 errors on Wed Dec 18 02:12:11 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
...
What it means: This is a RAIDZ2 vdev. You can survive one more device failure in this vdev, but you don’t get to be casual about it.
Decision: Replace one disk, wait for full resilver completion, verify, then proceed. No batching.
Task 3: Map the by-id device to a physical bay/serial
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep S3ZABC123456 | head
lrwxrwxrwx 1 root root 9 Dec 26 08:11 ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZABC123456 -> ../../sdf
lrwxrwxrwx 1 root root 10 Dec 26 08:11 ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZABC123456-part1 -> ../../sdf1
What it means: The failing drive is currently /dev/sdf. That can change after reboots; the by-id name is your anchor.
Decision: Use by-id identifiers in all ZFS commands. Use the serial to locate the bay (via chassis LEDs, HBA management, or your DC runbook).
Task 4: Check SMART health before you touch anything
cr0x@server:~$ sudo smartctl -a /dev/sdf | egrep "SMART overall-health|Reallocated_Sector_Ct|Media_Wearout_Indicator|UDMA_CRC_Error_Count"
SMART overall-health self-assessment test result: FAILED!
5 Reallocated_Sector_Ct 0x0033 001 001 010 Pre-fail Always - 9812
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
What it means: The disk is genuinely failing (massive reallocations). CRC errors are zero, so it’s not just a cabling issue.
Decision: Replace the disk. If this were only CRC errors, you’d check cabling/backplane first and avoid unnecessary disk churn.
Task 5: Verify pool free space and fragmentation risk
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint -r tank | head
NAME USED AVAIL REFER MOUNTPOINT
tank 68.1T 3.2T 128K /tank
tank/data 66.9T 3.2T 66.9T /tank/data
tank/logs 1.2T 3.2T 1.2T /tank/logs
What it means: Only ~3.2T free on a ~71T pool. That’s tight. Resilver and scrub performance can degrade; allocation becomes more complex.
Decision: If you can, free space before starting. If you can’t, accept that resilver will take longer and the exposure window grows.
Task 6: Verify the last scrub and decide whether to scrub first
cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
scan: scrub repaired 0B in 03:12:10 with 0 errors on Wed Dec 18 02:12:11 2025
What it means: A scrub ran recently and found no errors. Good: you have recent evidence that redundancy worked.
Decision: Proceed with replacement. If the last scrub is old or had errors, you may scrub first (if the pool isn’t too fragile) to surface latent issues before you stress it with a resilver.
Task 7: Offline the device (when you want a controlled removal)
cr0x@server:~$ sudo zpool offline tank ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZABC123456
cr0x@server:~$ sudo zpool status tank | grep -A2 S3ZABC123456
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZABC123456 OFFLINE 0 0 0
What it means: You’ve told ZFS “this disk is intentionally out.” This avoids confusion if the disk disappears mid-IO.
Decision: Offline before physical removal unless you’re in a forced hot-swap scenario and the disk is already gone.
Task 8: Replace the disk using persistent identifiers
cr0x@server:~$ sudo zpool replace tank ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZABC123456 ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZNEW987654
cr0x@server:~$ sudo zpool status tank | sed -n '1,25p'
pool: tank
state: DEGRADED
scan: resilver in progress since Fri Dec 26 08:33:02 2025
2.11T scanned at 1.35G/s, 410G issued at 262M/s, 68.1T total
410G resilvered, 0.59% done, 3 days 02:11:22 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
...
replacing-2 DEGRADED 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZABC123456 OFFLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZNEW987654 ONLINE 0 0 0 (resilvering)
What it means: ZFS created a replacing vdev and started resilvering to the new device. ETA is long: that’s exposure time.
Decision: Don’t touch another disk in this vdev until this resilver finishes and the pool returns to ONLINE (or at least the vdev returns to full redundancy).
Task 9: Monitor resilver progress and watch for read errors
cr0x@server:~$ watch -n 10 sudo zpool status tank
Every 10.0s: sudo zpool status tank
pool: tank
state: DEGRADED
scan: resilver in progress since Fri Dec 26 08:33:02 2025
9.82T scanned at 1.12G/s, 1.84T issued at 214M/s, 68.1T total
1.84T resilvered, 2.70% done, 2 days 20:05:13 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
...
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZOLD111222 ONLINE 18 0 0
What it means: One of the remaining disks is now showing read errors. That’s your early warning that resilver stress is surfacing weakness.
Decision: Pause nonessential workloads, reduce IO, and consider proactive replacement planning. If errors climb or another disk degrades, you may need to stop and move to emergency response (backups, replication, restore readiness).
Task 10: Check kernel logs for link resets and transport errors
cr0x@server:~$ sudo dmesg -T | egrep -i "ata[0-9]+:|link reset|I/O error|blk_update_request" | tail -n 12
[Fri Dec 26 09:01:12 2025] ata9.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0x6 frozen
[Fri Dec 26 09:01:12 2025] ata9.00: failed command: READ FPDMA QUEUED
[Fri Dec 26 09:01:13 2025] ata9: hard resetting link
[Fri Dec 26 09:01:18 2025] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Dec 26 09:01:19 2025] blk_update_request: I/O error, dev sdi, sector 1293849216 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 0
What it means: The OS is seeing link resets and read IO errors on a specific disk. This might be a disk failing, or it might be cabling/backplane/HBA issues.
Decision: If CRC counts rise in SMART and logs show link resets, check physical connectivity before you declare the drive dead.
Task 11: Confirm the new disk’s ashift / sector alignment expectations (sanity check)
cr0x@server:~$ sudo zdb -C tank | egrep -n "ashift|vdev_tree" | head -n 8
40: vdev_tree:
56: ashift: 12
57: asize: 12003003719680
What it means: ashift: 12 implies 4K sectors (2^12). Good for modern disks. If you somehow had ashift: 9 with 4K drives, performance would be pain and resilvers would be slower.
Decision: If ashift is wrong, don’t “fix it” by replacing disks; ashift is set per vdev at creation. Fix requires vdev recreation/migration. Plan it, don’t improvise.
Task 12: When resilver finishes, verify pool is healthy and error-free
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: ONLINE
scan: resilvered 68.1T in 2 days 21:44:02 with 0 errors on Mon Dec 29 06:17:04 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3Z... ONLINE 0 0 0
ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZNEW987654 ONLINE 0 0 0
...
What it means: Pool is ONLINE, resilver completed, and errors are zero. This is the state you want before moving to the next disk replacement.
Decision: Only now do you proceed to the next device. If you’re rotating a batch, replace the next disk in the same vdev one-by-one with full completion in between.
Task 13: Start a scrub after a replacement cycle (validation step)
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status tank | sed -n '1,10p'
pool: tank
state: ONLINE
scan: scrub in progress since Mon Dec 29 06:22:11 2025
1.14T scanned at 992M/s, 12.3G issued at 10.7M/s, 68.1T total
0B repaired, 0.01% done, 19:05:41 to go
What it means: Scrub is running. Note the difference between scan rate and issued rate; issued rate reflects actual IO performed.
Decision: Let it complete. If scrub surfaces errors, treat them as evidence of a broader problem (another weak disk, controller, cabling, or latent corruption).
Task 14: Use zpool events to see what actually happened
cr0x@server:~$ sudo zpool events -v | tail -n 20
TIME CLASS
Dec 29 2025 06:17:04.123456789 resilver.finish
pool = tank
vdev_path = /dev/disk/by-id/ata-SAMSUNG_MZ7LM1T9HMLP-00003_S3ZNEW987654
resilvered_bytes = 74882345013248
Dec 29 2025 06:22:11.987654321 scrub.start
pool = tank
What it means: You have an auditable timeline. This matters when someone asks “when did we start stressing the pool?”
Decision: Save this output in your change ticket. Future-you likes evidence.
Safe replacement order by vdev topology
Mirrors (2-way, 3-way)
Order rule: replace one leaf at a time, wait for resilver completion, then move to the next mirror. If you have many mirrors (common in “striped mirrors”), you can replace one disk per mirror sequentially, but don’t degrade multiple mirrors simultaneously unless you can tolerate reduced redundancy across the pool.
Why: When a mirror member is missing, the remaining disk is your single point of failure for that mirror’s data. Also, that surviving disk is now doing all reads, and it’s feeding reconstruction. That’s the moment hidden read errors show up.
Recommended order:
- Pick one mirror vdev.
- Replace the worst disk (SMART failing, errors, or oldest batch).
- Wait for resilver to finish, verify ONLINE, check errors.
- Optionally scrub after a batch of replacements.
- Move to the next mirror vdev.
RAIDZ1
Order rule: never have two concurrent replacements in the same RAIDZ1 vdev. RAIDZ1 has a thin safety margin: one missing disk is okay; a second failure is game over.
Operational reality: RAIDZ1 on large disks is a risk trade. If you’re doing multi-disk replacement on RAIDZ1, your “safe order” is basically “one disk, full resilver, verify, repeat”—and you should still be slightly nervous. That’s healthy.
RAIDZ2 / RAIDZ3
Order rule: still do replacements sequentially within a vdev. Yes, RAIDZ2 can tolerate two failures, but resilver is when weak disks reveal themselves and when your mean time to a second failure shrinks.
Recommended order:
- Replace any disk that is faulted or throwing read/write errors first.
- Next replace disks with increasing SMART indicators (reallocations, pending sectors, media wear).
- Leave “healthy but old” disks for last, and stop if you observe new errors on surviving members during resilver.
Special vdevs: SLOG, L2ARC, special allocation class
These aren’t data vdevs in the same way, but they can still wreck your day.
- SLOG (separate intent log): losing it doesn’t lose committed data, but it can cause a performance cliff and client timeouts. Replace carefully, validate sync workload behavior.
- L2ARC: safe to remove/replace from a data integrity standpoint. But if you remove it under load, you may trigger a read amplification storm as hot data falls back to spinning disks.
- Special vdev: treat like data. Losing it can be catastrophic depending on configuration, because metadata and small blocks may live there. Replacement order is “never degrade it beyond redundancy,” same as mirrors/RAIDZ.
Joke #2: Hot-swapping the wrong disk is a great way to learn how much you trust your backups.
Checklists / step-by-step plan
Checklist A: before replacing any disk (the “don’t be clever” list)
- Confirm pool topology and redundancy per vdev (
zpool status). - Confirm last scrub results and date (
zpool status). - Check free space (
zfs list) and plan for longer resilver if space is tight. - Collect baseline performance and error counters (SMART,
dmesgfor link errors). - Map by-id device to bay/serial and label the physical drive you will pull.
- Confirm you have console/remote hands path if the host reboots or hangs.
- Ensure you have a rollback plan: reduce load, fail over, or pause ingestion.
Checklist B: one-disk replacement cycle (repeatable, safe)
- Observe:
zpool status -vand identify exact leaf vdev name to replace. - Offline (optional but recommended):
zpool offlinethat leaf. - Physical swap: remove/insert drive; confirm OS sees new by-id entry.
- Replace:
zpool replace pool old-by-id new-by-id. - Monitor: watch
zpool statusfor progress and errors. - Validate: after completion, ensure pool returns to
ONLINEand errors remain zero. - Record: capture
zpool events -vsnippets and SMART for the new drive (baseline).
Checklist C: replacing a whole batch (when multiple drives are suspect)
This is where people get cocky and ZFS reminds them it’s a filesystem, not a therapist.
- Sort disks by urgency: faulted > read/write errors > SMART pre-fail > old batch.
- Replace one disk at a time per vdev.
- After each resilver, re-check the other disks’ SMART and ZFS error counters; stress changes the truth.
- After 2–3 replacements (or after any incident), run a scrub during a controlled window.
- If you observe new read errors on survivors, stop the batch and reassess. “Finish the schedule” is not an SRE principle.
Fast diagnosis playbook
When a disk replacement goes sideways, you need to answer one question quickly: are we bottlenecked by the disks, the controller/links, or the workload? Here’s the order that gets you to a useful answer fast.
1st: ZFS-level truth (pool state, errors, scan/resilver status)
- Check:
zpool status -v - Look for: DEGRADED/FAULTED, growing READ/WRITE/CKSUM counters, “too many errors,” and resilver ETA that is exploding instead of stabilizing.
- Decision: If errors are increasing on multiple devices in the same vdev, stop making changes and reduce IO. Prepare for restore/replication action.
2nd: OS transport and controller signals (dmesg)
- Check:
dmesg -Tfor link resets, timeouts, “frozen,” “I/O error.” - Look for: repeated resets on the same port/device, which often implicates cabling/backplane/HBA firmware rather than the disk itself.
- Decision: If multiple disks on the same HBA show resets, pause replacements and fix the transport layer first.
3rd: Per-disk health and error types (SMART)
- Check:
smartctl -afor reallocations, pending sectors, CRC counts, media wear. - Look for: CRC errors rising (cabling), realloc/pending rising (media), or “FAILED” overall health.
- Decision: Replace disks with media issues. Fix cabling/backplane for CRC issues.
4th: Workload pressure (is the pool being asked to do too much?)
- Check: active IO patterns: large sequential reads, small random writes, heavy sync writes.
- Look for: resilver issued rate collapsing while scan rate stays high; applications timing out; latency spikes.
- Decision: throttle or pause nonessential IO. In extreme cases, temporarily stop ingestion, reschedule batch replacements, or fail over traffic.
Common mistakes: symptom → root cause → fix
Mistake 1: “We replaced disk A, so now we can replace disk B immediately.”
Symptom: pool goes from DEGRADED to UNAVAIL during the second replacement; resilver never finishes.
Root cause: overlapping degraded state in the same vdev (mirror/RAIDZ) reduced redundancy below what the vdev can tolerate.
Fix: replace one disk per vdev at a time; wait for resilver completion and ONLINE state before proceeding.
Mistake 2: Resilver is painfully slow and ETA keeps growing
Symptom: “3 days remaining” becomes “9 days remaining,” issued throughput is low, workloads time out.
Root cause: pool is near full, heavy fragmentation, high concurrent IO, SMR disks, or a saturated controller.
Fix: reduce workload, free space, schedule replacements during low IO, and validate you are not mixing SMR drives into a rebuild-heavy environment.
Mistake 3: New disk shows up as ONLINE but the pool still reports errors
Symptom: zpool status shows non-zero READ/CKSUM counters on surviving disks after resilver.
Root cause: latent read errors during resilver; transport issues; or existing corruption discovered during reconstruction.
Fix: review zpool status -v for permanent errors, run a scrub, check SMART and dmesg. If permanent errors exist, restore affected data from backup.
Mistake 4: “We used /dev/sdX because it was obvious.”
Symptom: wrong disk replaced; the intended bad disk remains in the pool; a healthy disk is now missing.
Root cause: device renumbering; /dev/sdX is not stable across reboots and hotplug events.
Fix: always use /dev/disk/by-id or /dev/disk/by-path in ZFS commands; confirm serial matches the physical drive.
Mistake 5: Clearing errors early to make dashboards green
Symptom: errors vanish, then reappear later; you can’t prove when the issue started.
Root cause: zpool clear erased the counters without fixing the underlying device or link issue.
Fix: only clear errors after corrective action and after capturing evidence (status output, SMART, logs).
Mistake 6: Replacing disks while a scrub is running
Symptom: IO saturation; resilver slows to a crawl; increased timeouts; client impact.
Root cause: scrub and resilver compete for IO; both are read-heavy and stress already-aging disks.
Fix: don’t overlap unless you’re intentionally doing it for a reason (rare). Pause scrub if possible; schedule scrub after resilver.
Mistake 7: Assuming a “checksum error” means the disk is bad
Symptom: rising CKSUM errors; disks look healthy in SMART.
Root cause: cabling/backplane/controller corruption; DMA errors; intermittent transport issues.
Fix: inspect and reseat cables, swap ports, update HBA firmware, and watch CRC counters in SMART.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
They had a striped-mirror pool powering a build farm and artifact store. Nothing exotic, just lots of IO and a respectable amount of “we’ll clean it up later.” A disk started flapping. Someone did the right thing and marked it for replacement.
The wrong assumption was simple: “All mirrors are independent, so we can replace one disk in mirror A and one disk in mirror B at the same time.” Which sounds reasonable until you notice that the workload is not independent. The artifact store was hammering the pool, and the two remaining mirror members were now doing double duty: serving reads and feeding resilvers.
Within hours, a second disk in one of the mirrors began reporting read errors. It wasn’t brand-new, and it wasn’t happy about being the last source of truth for half the data while under peak CI load. The pool didn’t die instantly; it degraded in a slow-motion way that is worse because everyone keeps thinking they can “just finish the rebuild.”
The fix wasn’t clever: they paused CI, drained queues, and finished one resilver at a time. Then they wrote a rule into their change process: “Only one mirror degraded at once unless we’re failing over.” The lesson wasn’t that ZFS is fragile. The lesson was that workload coupling makes “independent vdevs” a lie when you’re saturating the same spindles.
Mini-story 2: The optimization that backfired
A team wanted faster resilvers, so they tuned aggressively. Higher concurrency, more threads, and they let replacements run during business hours because “the pool can handle it.” The metrics looked fine on day one. On day two, the database latency alarms started. On day three, the application team showed up with graphs and pitchforks.
The backfire was not mysterious: resilver is heavy on reads, and those reads compete with everything else. By “optimizing” resilver speed without controlling workload, they created a perfect storm: random reads from applications plus sequential-ish reads from resilver plus sync writes from a journaling workload. The pool started living at high queue depths, and latency became spiky and unpredictable.
Then the second-order effect hit: one marginal disk started timing out, which ZFS interpreted as errors. Now the pool wasn’t just slow—it was degraded. The team had optimized themselves into a higher-risk state while customers were active.
What saved them was backing out the “optimization.” They scheduled resilvers at night, reduced daytime IO, and used operational controls (rate limiting via workload management rather than trying to outsmart the storage). Faster is good. Fast at the wrong time is how you manufacture incidents.
Mini-story 3: The boring but correct practice that saved the day
Another environment: a compliance-heavy shop with strict change control. It was the kind of place where you roll your eyes at the paperwork until something breaks. They had a RAIDZ2 pool with a known aging disk batch, and they planned a rolling replacement over weeks.
The boring practice: every single disk had its serial mapped to a bay, written down, and cross-checked by remote hands with a second person reading the serial off the label. No “it’s probably slot 7.” They also captured zpool status, SMART summaries, and zpool events outputs before and after each replacement.
Midway through the batch, a disk failed hard. Not the one being replaced—another one. In many shops, that’s where panic starts. Here, they looked at their records and saw that the pool had completed scrubs cleanly, the last resilver was error-free, and the new drives had clean SMART baselines. That evidence let them make a calm decision: proceed with a single emergency replacement, pause the rest of the batch, and schedule a scrub once redundancy was restored.
They never had a customer-visible incident. The paperwork didn’t prevent failure. It prevented the human error that turns a failure into an outage. Boring is a feature.
FAQ
1) Can I replace multiple disks at once if I have RAIDZ2?
You can, but you shouldn’t. RAIDZ2 tolerates two missing devices, but overlapping resilvers multiply risk and exposure time. Replace one disk per vdev, wait for completion.
2) Should I offline a disk before pulling it?
Yes, when you can. Offlining makes the removal intentional and reduces surprise behavior. If the disk is already gone, proceed with zpool replace using the missing device’s identifier.
3) What’s the difference between scrub and resilver?
A scrub validates data by reading and verifying checksums across the pool. A resilver reconstructs data onto a replacement device. Both stress disks; don’t overlap them casually.
4) Why does ZFS show “permanent errors” even after I replaced the disk?
Because some data was unreadable or corrupted at the time it was needed. Replacing the disk restores redundancy; it doesn’t retroactively fix corrupted application files. Restore those files from backup or replicas.
5) Should I use /dev/sdX in zpool replace?
No. Use /dev/disk/by-id or the exact leaf vdev name shown in zpool status. /dev/sdX can and will change.
6) Resilver is slow. Is that a sign something is wrong?
Sometimes. Slow resilver can be normal under load, on near-full pools, or with slower media. It’s suspicious when it becomes progressively slower or errors begin to accumulate during the process.
7) Do I need to run a scrub after every single disk replacement?
Not always after every disk, but you should run one after a replacement batch or after any event with errors. The goal is validation, not ritual.
8) What if the new disk is larger than the old disk?
ZFS will generally use the smaller size until all members in the vdev are upgraded and you expand (depending on features and configuration). Don’t assume instant capacity gains.
9) What if the new disk is smaller than the old one?
Don’t do it. Even tiny size differences can prevent replacement. Match or exceed size, and prefer same model class when possible to avoid performance mismatches.
10) Can I replace disks in different vdevs concurrently?
Sometimes, but it’s still risky because the pool’s overall IO budget is shared. If you must, do it only when you can reduce workload and you have strong operational visibility.
Next steps you can actually do
- Write down your device mapping: by-id ↔ bay ↔ serial. Do it before the next incident forces improvisation.
- Decide your replacement policy: one-at-a-time per vdev, with explicit “stop conditions” (new errors on surviving disks, link resets, performance collapse).
- Automate evidence capture: save
zpool status -v, SMART summaries, andzpool events -vto your change record. - Practice on a non-critical pool: run a controlled replacement so the first time you do it isn’t during a page.
- After a batch, scrub and review: treat the scrub report as your “we’re stable again” certificate.
Replacing multiple disks in ZFS is safe when you respect the redundancy budget and the time dimension. Replace one disk, finish the resilver, verify health, then move on. The order isn’t a superstition. It’s how you stop maintenance from turning into incident response.