ZFS offline/online: Using Maintenance Mode Without Breaking Redundancy

Was this helpful?

The most dangerous storage work is the work that feels routine: “Just take the disk out for a minute.”
ZFS will happily let you do that—until it doesn’t, and your “minute” becomes a multi-day incident with an executive on the bridge asking if RAID is “still a thing.”

This is about using zpool offline/zpool online as a deliberate maintenance mode—predictably, repeatably—without accidentally turning redundancy into a rumor.
We’ll focus on the mechanics, the failure modes, and the decisions you make from real outputs, not vibes.

Offline/online without superstition: the mental model

ZFS doesn’t do “RAID” the way your hardware controller did in 2012. It does vdevs (virtual devices), and redundancy lives at the vdev level.
A pool is only as healthy as its least redundant vdev, because any vdev loss is pool loss. That’s the headline.

When you run zpool offline, you’re not removing a disk from the pool’s history. You’re telling ZFS:
“Stop trusting this leaf device for I/O right now.” It’s an operational state change, not a structural change.
Contrast that with zpool detach (mirror only), which changes topology by removing a device from a mirror vdev.

What “offline” really means

  • Offline: device is intentionally unavailable. ZFS will not use it. The vdev/pool may become DEGRADED.
  • Faulted: ZFS decided the device is bad (too many errors, timeouts, etc.). You may need to clear/replace.
  • Removed: device disappeared (cable, enclosure, path, HBA, multipath). Sometimes it comes back; sometimes it’s lying.
  • Unavailable: ZFS can’t open the device path. This includes renames, missing by-id, or enclosure shenanigans.

Offline vs replace vs detach: pick the right verb

Here’s the rule I enforce in production: use the smallest hammer that reliably gets you the next safe state.

  • Use zpool offline to take a disk out of service temporarily, especially before physical work.
  • Use zpool replace when you are swapping media and want ZFS to treat the new disk as successor.
  • Use zpool detach only on mirrors, and only when you really mean to reduce mirror width.
  • Avoid “yanking” without offlining first unless you’re already in emergency mode (disk is dead or the bus is on fire).

Redundancy isn’t a philosophical concept. It’s arithmetic: how many devices can be missing in the same vdev before you lose data.
Mirrors generally survive one missing disk (per mirror). RAIDZ1 survives one. RAIDZ2 survives two. RAIDZ3 survives three.
But maintenance often creates a second fault: you offline one disk, then discover another was silently dying. Congratulations, you’ve invented a postmortem.

One quote worth keeping on your terminal: “Hope is not a strategy.” — General Gordon R. Sullivan.
Storage teams love hope. ZFS punishes it.

Joke #1: If you’ve never offlined the wrong disk, you either run perfect labeling—or you haven’t done enough maintenance yet.

Interesting facts and history that actually matter

  1. ZFS originated at Sun Microsystems in the mid-2000s as a reaction to split-brain storage stacks (volume manager + filesystem) that couldn’t coordinate integrity.
  2. The name “pool” was a deliberate change in thinking: you don’t manage filesystems on disks; you manage storage capacity and redundancy as a shared resource.
  3. ZFS checksums everything (metadata and data). That’s why a degraded pool is not the same as a risky pool—until you lose the last good copy.
  4. Scrub existed to fight “bit rot” before it was trendy. In ZFS, scrub is a proactive read/verify/repair using redundancy; it’s not a filesystem “fsck.”
  5. Resilver is not a scrub. Resilver is reconstruction of missing replica(s) after replacement/offline/online; scrub is full verification across the pool.
  6. By-id naming became a survival tactic as Linux device naming (/dev/sdX) proved too fluid under hotplug, multipath, and HBA resets.
  7. Early ZFS admins learned the hard way about write caches: disks lying about flushes can defeat a filesystem, even one with transactional semantics.
  8. RAIDZ rebuild behavior differs from traditional RAID5/6: ZFS resilver only copies allocated blocks (plus metadata), which can be dramatically faster—unless your pool is very full.
  9. Device “REMOVED” states increased with SAS expanders and JBODs: a transient enclosure glitch can look like a disk failure, and treating it as such can compound risk.

These aren’t trivia questions. Each one changes how you respond when a disk goes away and someone asks,
“Can we just offline it and keep going?”

Practical tasks: commands, outputs, and decisions

Below are concrete tasks you’ll do during a maintenance window. For each: the command, what the output means, and what decision to make next.
Examples assume a pool named tank. Swap names accordingly.

Task 1: Confirm the pool topology before you touch anything

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: Some supported and requested features are enabled on the pool.
config:

        NAME                                    STATE     READ WRITE CKSUM
        tank                                    ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AA    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AB    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AC    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AE    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AF    ONLINE       0     0     0

errors: No known data errors

Meaning: You have a single RAIDZ2 vdev. You can tolerate two missing disks in this vdev; a third loss kills the pool.
The by-id names look stable and specific. Good.

Decision: If you plan to offline a disk, you are consuming one redundancy “life.” Verify there are no existing degraded/faulted members first.

Task 2: Verify you’re not already degraded (don’t stack failures)

cr0x@server:~$ zpool list -o name,size,alloc,free,health tank
NAME  SIZE  ALLOC   FREE  HEALTH
tank  43.5T  34.9T  8.6T  ONLINE

Meaning: HEALTH is ONLINE. Alloc is high-ish. High fullness makes resilvers slower and riskier (less free space, more fragmentation).

Decision: If HEALTH is DEGRADED or ALLOC is near full, treat maintenance like surgery: minimize time degraded, and consider delaying non-critical work.

Task 3: Identify the physical disk behind a ZFS leaf device

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep VKJ0A1AD
lrwxrwxrwx 1 root root  9 Dec 25 03:10 ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD -> ../../sdg

Meaning: The ZFS leaf device maps to /dev/sdg right now. That may change after reboots; by-id is the stable reference.

Decision: Use by-id names in ZFS whenever possible. Use /dev/sdX only for transient OS-level checks.

Task 4: Check SMART before deciding it’s “just a cable”

cr0x@server:~$ sudo smartctl -a /dev/sdg | egrep -i 'Serial Number|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count'
Serial Number:    VKJ0A1AD
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       8
197 Current_Pending_Sector  0x0012   200   200     0    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   200   200     0    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200     0    Old_age   Always       -       0

Meaning: Reallocated and pending sectors exist. That’s media trouble, not just a flaky cable (CRC errors would hint cabling).

Decision: Plan replacement, not “reseat and pray.” Offline first, replace, resilver, then scrub later.

Task 5: Take the target disk offline (maintenance mode, intentional)

cr0x@server:~$ sudo zpool offline tank ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD
cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
config:

        NAME                                    STATE     READ WRITE CKSUM
        tank                                    DEGRADED     0     0     0
          raidz2-0                              DEGRADED     0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AA    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AB    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AC    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD    OFFLINE      0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AE    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AF    ONLINE       0     0     0

errors: No known data errors

Meaning: Pool is DEGRADED by admin action, not by faults. That’s the safest kind of degraded: you’re in control.

Decision: Proceed with physical work. Also: freeze other risky changes. This is not the moment for kernel upgrades or HBA firmware experiments.

Task 6: Confirm the disk is truly not in use (avoid surprise I/O)

cr0x@server:~$ sudo zpool iostat -v tank 1 3
                                          capacity     operations     bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
tank                                    34.9T  8.6T     45     30  4.20M  2.10M
  raidz2-0                               34.9T  8.6T     45     30  4.20M  2.10M
    ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AA        -      -      8      5   710K   380K
    ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AB        -      -      7      5   690K   370K
    ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AC        -      -      9      5   740K   400K
    ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD        -      -      0      0      0      0
    ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AE        -      -      6      5   670K   360K
    ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AF        -      -      6      5   690K   380K
--------------------------------------  -----  -----  -----  -----  -----  -----

Meaning: The offlined device shows no ops/bandwidth. Expected. If you see I/O on an “offline” leaf, something is off (aliases, multipath, wrong name).

Decision: If there’s unexpected activity, stop and re-check device identity before you pull anything.

Task 7: Replace the disk and tell ZFS explicitly (don’t rely on autodetect)

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep VKJ0A1AD
lrwxrwxrwx 1 root root  9 Dec 25 03:10 ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD -> ../../sdg
cr0x@server:~$ ls -l /dev/disk/by-id/ | grep NEWDRIVE
lrwxrwxrwx 1 root root  9 Dec 25 03:42 ata-WDC_WD80EFZX-68UW8N0_NEWDRIVE -> ../../sdg
cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD ata-WDC_WD80EFZX-68UW8N0_NEWDRIVE
cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
config:

        NAME                                    STATE     READ WRITE CKSUM
        tank                                    DEGRADED     0     0     0
          raidz2-0                              DEGRADED     0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AA    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AB    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AC    ONLINE       0     0     0
            replacing-3                         DEGRADED     0     0     0
              ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD  OFFLINE      0     0     0
              ata-WDC_WD80EFZX-68UW8N0_NEWDRIVE  ONLINE       0     0     0  (resilvering)
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AE    ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AF    ONLINE       0     0     0

scan: resilver in progress since Thu Dec 25 03:43:11 2025
        1.26T scanned at 1.45G/s, 312G issued at 360M/s, 34.9T total
        52.0G resilvered, 0.86% done, 1 days 02:13:08 to go
errors: No known data errors

Meaning: ZFS created a “replacing” subtree, tracking old vs new. The estimate is often pessimistic early on.

Decision: Do not offline another disk “just to be safe.” You’re already in a reduced margin state until resilver completes.

Task 8: Monitor resilver progress and decide if you should throttle workloads

cr0x@server:~$ zpool iostat -v tank 5
cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
scan: resilver in progress since Thu Dec 25 03:43:11 2025
        9.80T scanned at 1.12G/s, 2.01T issued at 235M/s, 34.9T total
        1.62T resilvered, 5.76% done, 0 days 19:04:55 to go

Meaning: “scanned” is what ZFS examined; “issued” is what it actually had to reconstruct/copy. Issued is what stresses disks.

Decision: If application latency spikes, consider temporarily reducing write-heavy jobs, backups, or compaction tasks. Your goal is “finish resilver without a second failure,” not “win benchmarks.”

Task 9: Bring a previously offlined disk back online (when it’s a transient issue)

cr0x@server:~$ sudo zpool online tank ata-WDC_WD80EFZX-68UW8N0_VKJ0A1AD
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices has been brought online.
config:

        NAME                                    STATE     READ WRITE CKSUM
        tank                                    ONLINE       0     0     0
          mirror-0                              ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960_ABC123          ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960_DEF456          ONLINE       0     0     0

errors: No known data errors

Meaning: Online brings the leaf back into service. Depending on what happened, ZFS may resilver to re-sync.

Decision: If the disk was offlined due to suspected media errors, do not online it “to see what happens.” Replace it. Curiosity is expensive on storage.

Task 10: Clear transient errors after you fix the cause (not before)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
  see: http://zfsonlinux.org/msg/ZFS-8000-8A
config:

        NAME                                    STATE     READ WRITE CKSUM
        tank                                    DEGRADED     0     0     0
          mirror-0                              DEGRADED     0     0     0
            ata-SAMSUNG_MZ7KM960_ABC123          ONLINE       0     0     7
            ata-SAMSUNG_MZ7KM960_DEF456          ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
        tank/data/app.db
cr0x@server:~$ sudo zpool clear tank

Meaning: zpool clear clears error counters and fault states, but it does not magically heal corrupt application data.

Decision: Clear only after fixing the underlying cause and after you’ve handled any “Permanent errors” at the dataset/app level.

Task 11: Confirm device identity with ZFS labels (when by-id lies or duplicates exist)

cr0x@server:~$ sudo zdb -l /dev/sdg | egrep 'name:|pool_guid|vdev_tree|guid'
    name: 'tank'
    pool_guid: 17219319428190311341
    guid: 1329582641194012239

Meaning: zdb -l reads the on-disk ZFS label and tells you what pool it belongs to. This is your “forensics” tool when paths are confusing.

Decision: If a disk doesn’t label as expected, stop. You may be looking at a different pool’s disk or a stale spare.

Task 12: Make sure you’re not about to resilver from a slower path (multipath/sas)

cr0x@server:~$ lsblk -o NAME,HCTL,SIZE,MODEL,SERIAL,TRAN
NAME HCTL       SIZE MODEL            SERIAL   TRAN
sdg  2:0:6:0    7.3T WDC WD80EFZX-68U NEWDRIVE sas

Meaning: HCTL gives you the controller:target:lun style path. Good for mapping to HBA ports and expander bays.

Decision: If the “new disk” shows up on an unexpected HCTL, you may have inserted into the wrong bay. Fix it now, not after 12 hours of resilver.

Task 13: Check for ongoing scrub/resilver before starting maintenance (don’t pile on)

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
scan: scrub in progress since Thu Dec 25 01:12:04 2025
        18.3T scanned at 710M/s, 18.3T issued at 710M/s, 34.9T total
        0B repaired, 52.4% done, 0 days 05:22:19 to go

Meaning: A scrub is already stressing all disks. Offlining a disk mid-scrub forces ZFS to work harder on the remaining members.

Decision: Prefer pausing/ending scrub before planned maintenance if policy allows. If you must proceed, accept higher risk and longer durations.

Task 14: Stop a scrub (with intention) to reduce load during a critical resilver

cr0x@server:~$ sudo zpool scrub -s tank
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
scan: scrub canceled on Thu Dec 25 03:55:10 2025

Meaning: Scrub was canceled. You didn’t fix integrity; you reduced load. That’s fine when you’re triaging risk.

Decision: Reschedule scrub after resilver and after business hours. Don’t leave the pool unscrubbed forever because you got busy.

Task 15: Export/import as a controlled reset (when a device state is stuck)

cr0x@server:~$ sudo zpool export tank
cr0x@server:~$ sudo zpool import -d /dev/disk/by-id tank
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
errors: No known data errors

Meaning: Export/import can clear certain stale path issues and forces a rescan. It’s disruptive: datasets disappear briefly.

Decision: Use during maintenance windows, not during peak traffic. If your issue is hardware-level, export/import won’t cure it.

Task 16: Set a temporary spare policy (if you have spares configured)

cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
config:

        NAME                                STATE     READ WRITE CKSUM
        tank                                DEGRADED     0     0     0
          mirror-0                          DEGRADED     0     0     0
            ata-HGST_HUH721010ALE604_AAA111  OFFLINE      0     0     0
            ata-HGST_HUH721010ALE604_BBB222  ONLINE       0     0     0
        spares
          ata-HGST_HUH721010ALE604_SPARE33   AVAIL
cr0x@server:~$ sudo zpool replace tank ata-HGST_HUH721010ALE604_AAA111 ata-HGST_HUH721010ALE604_SPARE33

Meaning: You can actively replace an offline/faulted disk with an available hot spare. ZFS will resilver onto the spare.

Decision: Use spares to buy time, not to avoid proper replacement. Spares need to be restored to “AVAIL” after you install a real disk.

Maintenance mode patterns (mirror, RAIDZ, spares)

Pattern A: Mirror vdev — the safest place to practice

Mirrors are forgiving during maintenance because the redundancy math is simple: one side can go away, and you still have a complete copy.
That doesn’t mean you should get cocky. Mirrors fail in boring ways: both devices were bought together, wrote the same workload, and age the same way.

For mirrors, you usually have three sane flows:

  • Offline → replace → resilver (preferred for physical swap).
  • Offline → online (for suspected path/cable issue after fix).
  • Detach (only when intentionally shrinking mirror width, or when splitting a mirror for migration/testing and you accept the risk).

Avoid mixing “detach” into routine break/fix. Detach is forever. Offline is a pause button.

Pattern B: RAIDZ — maintenance is a risk-budget exercise

RAIDZ is where people make “reasonable” moves that become unreasonable once the second variable changes.
In RAIDZ2, offlining one disk for planned work is typically fine. But your risk is now concentrated:
any other disk failure or a flaky enclosure that drops two paths at once can push you into an unrecoverable state.

Practical guidance:

  • Minimize degraded time. Have the replacement disk pre-burned-in, labeled, and ready.
  • Don’t do firmware experiments while degraded. The HBA reset you’ve “never seen before” will show up right then.
  • Watch pool fullness. High allocation tends to increase resilver time and stress, which increases probability of a second failure.
  • Prefer daylight resilvers. The fastest way to get help is to start while people are awake and Slack is noisy.

Pattern C: Hot spares — useful, but easy to misuse

Hot spares are a tool for reducing mean time degraded, not a substitute for maintaining hardware.
The common failure pattern is: spare activates, everyone relaxes, weeks pass, and now the spare is “part of the pool” with no spare left.
The next failure is not impressed.

What not to do: “maintenance mode” via yanking drives

People still do this because it’s fast and sometimes it works.
The cost is that you lose the chance to confirm identity, you risk a bus-level event, and you create ambiguous states (“REMOVED,” “UNAVAIL”) that complicate recovery.
Offline is a one-line command. Use it.

Joke #2: The quickest way to find out your labeling process is bad is to do drive swaps at 2 a.m. under a flashlight.

Fast diagnosis playbook (find the bottleneck fast)

When a pool goes DEGRADED or resilver crawls, you don’t have time to read every forum post ever written.
You need a triage order that finds the limiting factor quickly.

First: confirm what ZFS thinks is happening

  • Command: zpool status -v
  • Look for: OFFLINE vs FAULTED vs REMOVED, “replacing” subtree, “scan:” section, error counters.
  • Decision: If a device is FAULTED with read/write/cksum errors, plan replacement. If it’s REMOVED/UNAVAIL, investigate path/HBA/enclosure first.

Second: check if you’re fighting the system (scrub, heavy writes, quotas, snapshots)

  • Command: zpool status for scrub/resilver; zpool iostat -v 1 for load shape
  • Look for: scrub in progress, massive write bandwidth, one disk pegged, very low “issued” rate
  • Decision: If resilver is slow and the pool is busy, either throttle workloads or accept longer degraded time. Degraded time is risk time.

Third: isolate hardware path problems

  • Commands: smartctl -a, lsblk -o NAME,HCTL,SERIAL,TRAN, and kernel logs via dmesg -T
  • Look for: link resets, SAS phy flaps, timeouts, CRC errors, queued command failures
  • Decision: If multiple drives show transport-level errors, stop replacing “bad disks” and start checking the HBA, expander, enclosure power, and cabling.

Fourth: verify you’re not out of redundancy budget

  • Command: zpool status and mentally compute: how many failures can this vdev take right now?
  • Look for: RAIDZ1 with one offline = no margin; mirror with one side failing = no margin
  • Decision: If margin is gone, stop elective work. Your goal becomes “get back to redundant ASAP,” even if that means pausing workloads.

Fifth: check pool fullness and fragmentation as a resilver multiplier

  • Command: zpool list
  • Look for: high ALLOC, low FREE, historically slow scrubs
  • Decision: If resilver is slow due to fullness, you can’t fix it mid-flight. Use it to justify capacity headroom next budget cycle.

Common mistakes: symptom → root cause → fix

1) Symptom: pool is DEGRADED after “simple” maintenance; no one knows why

Root cause: Disk was yanked without offlining, and device names changed after a rescan/reboot.

Fix: Use zpool status to identify the missing leaf by by-id, confirm with zdb -l, reinsert into correct bay, then zpool online or zpool replace as appropriate.

2) Symptom: resilver “stuck” at 0% or progresses painfully slowly

Root cause: Pool is saturated by application writes or another scrub; or the replacement disk is SMR/slow; or there’s a transport issue causing retries.

Fix: Confirm with zpool iostat -v which device is bottlenecked. Check smartctl and dmesg -T for errors/timeouts. Reduce workload, stop scrub if necessary, and consider replacing the replacement disk if it’s clearly underperforming.

3) Symptom: you offlined one disk, and a second disk suddenly shows errors

Root cause: You removed redundancy, increasing read pressure on remaining disks; latent sector errors surfaced under load.

Fix: Don’t “online” the bad disk back just to recover margin. Replace the failing disk(s) in the safest order, consider using a spare, and keep workloads conservative until redundancy is restored.

4) Symptom: “cksum errors” on one disk but SMART looks fine

Root cause: Cabling, HBA, expander, or backplane issues corrupting data in transit; sometimes memory, but usually transport.

Fix: Check CRC/transport errors and logs. Reseat/replace cable, move bays, validate HBA firmware stability. Clear errors only after fixing transport and scrubbing to confirm no new corruption.

5) Symptom: replaced disk, but ZFS keeps referencing the old by-id name

Root cause: Replacement was done without zpool replace, or the OS presented a different path; ZFS is tracking the old leaf identity.

Fix: Use zpool status to see the “replacing” tree. If needed, explicitly zpool replace old leaf with the new by-id. Confirm with zpool status that resilver is running.

6) Symptom: pool goes UNAVAIL after exporting/importing during a disk swap

Root cause: Missing too many devices for a vdev, or import was attempted without correct device paths (by-id missing), or one disk belongs to another pool.

Fix: Use zpool import without arguments to list pools, add -d /dev/disk/by-id, confirm labels with zdb -l. Do not force import unless you understand the consequences.

7) Symptom: after bringing a disk online, ZFS doesn’t resilver and you fear silent divergence

Root cause: The device came back with no changes needed (it was offline briefly), or ZFS believes it is up-to-date due to transaction group state.

Fix: Confirm with zpool status (no resilver is not automatically bad). Run a scrub after maintenance to validate end-to-end integrity.

8) Symptom: “too many errors” and ZFS faults a disk that passes vendor diagnostics

Root cause: Intermittent timeouts under load, flaky firmware, or HBA/enclosure resets that vendor tests don’t reproduce.

Fix: Treat timeouts as real. Correlate with system logs, replace questionable components, and prefer stable HBAs/enclosures over “it passed a quick test.”

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran ZFS on Linux for a multi-tenant object store. The pool was RAIDZ2, dense JBOD, nothing exotic.
They had a routine: if a disk showed a few checksum errors, they’d offline it, schedule replacement, and move on.

One night, a disk started logging timeouts. The on-call offlined it and opened a ticket for a swap the next morning.
The assumption was that “offline means safe” and that the rest of the vdev would quietly carry the load.

What they missed: the pool was already heavily allocated, and nightly batch jobs had started—massive reads and writes.
With one disk offlined, ZFS had less parallelism and more reconstruction work per read. Latency rose. Clients retried. The retry storm increased load further.

Two hours in, a second disk (same model, same purchase batch) began returning unreadable sectors under the increased stress.
That disk wasn’t “newly bad.” It was newly tested. The pool went from “planned degraded” to “actively failing.”

They recovered, but the repair window stretched: replacement disks, resilver, then a scrub that found a handful of permanent errors in cold objects.
The postmortem wasn’t about ZFS. It was about the assumption that degraded mode is a stable operating point. It’s not; it’s an emergency lane.

Mini-story 2: The optimization that backfired

A finance org wanted faster rebuilds. Someone proposed a “simple” improvement: schedule weekly scrubs during business hours because “scrubs keep everything healthy”
and “it’s better to find errors early.” Sounds reasonable.

The pool served VMs. The workload was latency-sensitive: small synchronous writes, bursts of reads. Scrub ran, caches warmed, and performance looked fine—at first.
Over a few weeks, a pattern emerged: every scrub day included a few VM pauses and storage alerts. No outage, but a slow bleed of trust.

Then a disk actually failed during a scrub. The team replaced it quickly, but now they had scrub load plus resilver load plus a full business-day workload.
Resilver time stretched. The pool stayed degraded far longer than usual. The “optimization” had extended the risk window.

The fix was boring: move scrubs to low-traffic windows, enforce workload throttles during resilver, and treat scrub scheduling as a production change with SLO impact.
Their rebuilds didn’t get faster. Their incidents got rarer.

Mini-story 3: The boring but correct practice that saved the day

A healthcare platform had strict change control. It annoyed everyone, which is how you know it worked.
Their storage runbook required: by-id naming everywhere, printed bay maps updated with every chassis change, and a two-person verification before any disk pull.

During a planned replacement, the on-site tech found the bay label didn’t match the OS mapping. The easy move would’ve been to “pull the one that looks right.”
Instead, they followed the process: offline the target leaf by-id, confirm zero I/O to that device with zpool iostat -v, then match serial numbers using smartctl.

The mismatch turned out to be a backplane swap months earlier, with bay numbering reversed by the enclosure firmware.
If they had pulled based on the physical sticker, they would have removed a healthy disk while leaving the failing one online—exactly the kind of chaos that makes RAIDZ look fragile.

They updated the bay map, swapped the right disk, resilvered, scrubbed, and moved on. No incident. No bridge call.
The only casualty was someone’s belief that process is “red tape.” Process is how you avoid learning the same lesson twice.

Checklists / step-by-step plan

Planned disk maintenance (single disk) — the safe workflow

  1. Confirm redundancy and current health
    Run zpool status -v and zpool list. If already degraded, stop and reassess. Don’t stack risk.
  2. Identify the disk unambiguously
    Use by-id, then map to /dev/sdX with ls -l /dev/disk/by-id. Verify serial via smartctl -a.
  3. Check whether it’s media or transport
    Pending/reallocated sectors suggest media. CRC/link resets suggest transport. Decide if you’re replacing a disk or fixing a path.
  4. Offline the disk
    zpool offline tank <by-id>. Confirm it shows OFFLINE and no I/O.
  5. Perform physical swap
    Use bay maps. Don’t rely on “it’s probably the blinking one” unless you’re using a validated locate feature.
  6. Confirm the new disk identity
    Check by-id and serial. Ensure you didn’t insert the wrong capacity/model accidentally.
  7. Replace explicitly
    zpool replace tank <old-by-id> <new-by-id>. Confirm “replacing” and resilver started.
  8. Monitor resilver and system health
    Watch zpool status and zpool iostat -v. Check kernel logs for resets/timeouts.
  9. Return to ONLINE and restore spare posture
    When resilver completes, ensure the pool is ONLINE and you still have an AVAIL spare if your policy expects one.
  10. Scrub after the dust settles
    Schedule a scrub in a safe window to validate end-to-end integrity after the event.

Emergency disk event (unexpected FAULTED/REMOVED) — the containment workflow

  1. Stop the bleeding: capture zpool status -v output for the ticket and incident log.
  2. Determine state type: FAULTED suggests replacement; REMOVED/UNAVAIL suggests path investigation first.
  3. Check for systemic issues: if multiple disks report errors, suspect HBA/enclosure/cabling/power.
  4. Reduce load: pause scrubs, defer batch jobs, reduce write-heavy operations if possible.
  5. Restore redundancy ASAP: replace the disk or restore the path. Use a hot spare if you must buy time.
  6. Validate: resilver completion, then scrub later, then review error counters.

Rules I enforce (because I like sleeping)

  • Never offline a disk in RAIDZ1 unless you’re prepared for a same-day replacement and a careful watch.
  • Never do multiple concurrent drive swaps in the same vdev unless you have a tested, documented procedure and a reason beyond impatience.
  • Never clear errors to “make the alert go away” until you understand the cause.
  • Always keep a record of: leaf by-id, physical bay, serial number, purchase batch if known, and last scrub date.

FAQ

1) Does zpool offline reduce redundancy?

Yes, in the only way that matters: it removes a participating replica/parity member from a vdev.
The pool may keep serving data, but you have less fault tolerance until the device is back or replaced and resilvered.

2) Should I offline a disk before physically removing it?

For planned maintenance, yes. Offlining is an intentional state change that prevents surprise I/O and helps you confirm you targeted the right device.
If the disk is already dead/unreachable, you may not be able to offline it meaningfully—but you can still proceed with replacement.

3) What’s the difference between offline and detach?

offline is temporary and does not change topology. detach removes a device from a mirror vdev and changes topology permanently.
Use detach when you mean it, not as a maintenance shortcut.

4) Can I offline a disk in RAIDZ2 and keep running production traffic?

Usually, yes. But “can” isn’t “should without changes.” Expect higher latency under load, longer resilvers, and higher risk from latent failures.
Reduce load during the degraded window if you want the maintenance to stay boring.

5) Why did my resilver time estimate jump around?

ZFS estimates are based on observed throughput and work discovered so far. Early in a resilver, it may not have mapped the real amount of data to reconstruct.
Watch “issued” and real bandwidth; treat ETAs as hints, not contracts.

6) Should I scrub right after replacing a disk?

Not immediately if the pool is busy and you’re sensitive to latency. First: finish resilver and return to redundant.
Then scrub in a controlled window to validate integrity across the pool.

7) I see checksum errors, but no read/write errors. Is that a disk problem?

Not always. Checksum errors often implicate transport (cabling/backplane/HBA) because data arrived corrupted.
Verify with SMART (CRC counters), system logs, and whether multiple drives show similar symptoms.

8) Is it safe to online a disk that was offlined due to errors?

If it was offlined for transient path work and you fixed the path, onlining is fine and may trigger resilver.
If it was offlined because the media was failing (pending sectors, uncorrectables), onlining is gambling with production data.

9) What if I accidentally offlined the wrong disk?

First: stop. Confirm redundancy margin. If you still have margin, immediately zpool online the wrong disk and verify it rejoins cleanly.
Then re-identify the correct disk using serial numbers and by-id mappings before proceeding.

10) How do hot spares interact with offline/online?

A spare can be used as a replacement target via zpool replace, reducing time degraded.
But once a spare is in active use, you effectively have no spare. Replace the failed disk promptly and return the spare to AVAIL.

Conclusion: practical next steps

ZFS gives you the tools to treat disk work like a controlled operation instead of a bar fight with a server chassis.
The difference between “maintenance mode” and “incident mode” is usually one thing: whether you changed state intentionally and verified it.

  1. Standardize on by-id naming for pools and for runbooks. If your commands depend on /dev/sdX, you’re building on sand.
  2. Make degraded time a metric. Track how long pools spend without full redundancy; optimize for shorter, not faster.
  3. Write the two-person verification step into policy for physical swaps: map by-id → serial → bay, then offline.
  4. Schedule scrubs like production changes, not like chores. Align them with low-traffic windows and incident response coverage.
  5. Practice the workflow on a non-critical mirror pool. It’s cheaper than practicing during an outage.

Next maintenance window, aim for a single outcome: you can explain every device state in zpool status and why it changed.
If you can do that, you’re not guessing—you’re operating.

← Previous
Silicon lottery: why identical CPUs perform differently
Next →
WordPress wp-admin Won’t Open: The Real Reasons and Fixes

Leave a comment