ZFS Resilver Priority: Rebuild Fast Without Killing Production IO

Was this helpful?

Nothing tests your storage design like the moment a disk drops out at 2pm on a Tuesday and your pool starts resilvering while the business is still very much doing business. You want the rebuild to finish fast—because risk compounds with every minute you stay degraded—but you also don’t want your databases to look like they’re running on a floppy drive.

ZFS resilver “priority” is really a bundle of knobs: scheduling, concurrency, queue depth, and how aggressively ZFS chews through dirty regions and metadata. Get it right and resilver is a controlled burn. Get it wrong and you’ll create your own outage, politely, from the inside.

What resilver priority actually means (and what it doesn’t)

In ZFS terms, a resilver is the process of reconstructing redundancy after replacing or reattaching a device, or after a device temporarily disappears and returns. It’s related to a scrub, but not the same job.

People say “raise resilver priority” as if there’s a single slider labeled Make It Fast. There isn’t. ZFS resilver speed and production impact come from:

  • What ZFS chooses to copy: a resilver is typically incremental, guided by ZFS metadata about what needs to be reconstructed, not a blind full-disk mirror copy (though certain layouts and conditions can make it act closer to one).
  • How many concurrent rebuild IOs ZFS issues: too low and you underutilize disks; too high and you saturate queues and murder latency.
  • Where the bottleneck actually is: often it’s not “disk speed” but random IO amplification, metadata contention, fragmentation, or a single vdev being pinned.
  • IO scheduler and queue behavior in the OS: Linux vs FreeBSD behavior differs; modern schedulers, NVMe, and virtualization layers can change the game.
  • Competing work: scrubs, heavy writes, sync workloads, and small-block random reads all fight for the same spindles.

Priority, in practice, means deciding who gets to be annoying: the resilver or your customers. You can usually find a middle ground where the resilver stays aggressive enough to reduce risk, while production keeps its p99 latency below “paging the CEO.”

One quote worth keeping on your monitor

Paraphrased idea (Werner Vogels, reliability-focused engineering): “Everything fails, all the time—design and operate assuming it will.”

Facts and history that explain today’s behavior

Resilver behavior isn’t arbitrary; it’s the product of design choices and a couple decades of scar tissue. Here are some short, concrete facts that help you reason about what you’re seeing.

  1. ZFS was built around end-to-end checksums and copy-on-write, which means “rebuild” is not a naive sector clone; it’s reconstruction based on what the pool believes is live and valid.
  2. Traditional RAID rebuilds were historically full-device reads, which is why older admins still assume a resilver must read every sector. ZFS can often do less.
  3. Scrub predates a lot of today’s cheap, high-capacity disks; modern URE rates and multi-TB drives turned “degraded for a day” into “degraded for a week” if you don’t tune.
  4. Wide RAIDZ became popular partly to save bays; the tradeoff is longer rebuild times and more IO pressure during resilver, especially under random write workloads.
  5. “Sequential resilver” and related improvements (implementation varies by platform/version) aimed to reduce seek storms by issuing IO in a more disk-friendly order.
  6. ZFS has to rebuild metadata correctly, not just blocks. Metadata-heavy pools (small files, snapshots, lots of datasets) can resilver in a pattern that looks “random and slow” even on fast disks.
  7. Special devices (metadata/small blocks) can make a pool scream when healthy—and scream in a different way when degraded, because critical reads get concentrated.
  8. Compression and recordsize choices change resilver IO shape. Big records and compressed data can reduce physical reads; tiny records can make resilver look like a million paper cuts.

Joke #1: A resilver is like an elevator: if you watch it, it moves slower. If you graph it, it moves even slower.

Risk model: why “fastest possible” is not always safest

Resilver speed is a risk-control knob. Your pool is degraded; another failure can turn into data loss (or at least forced recovery). So yes, finishing quickly matters. But “finish quickly” doesn’t mean “saturate everything until production falls over.”

Here’s the risk triangle you’re juggling:

  • Time at risk: how long you run with reduced redundancy.
  • Customer impact: latency and error budgets during the rebuild.
  • Hardware stress: aggressive rebuilds can drive disks hot and keep them at 100% duty cycle for days, raising failure probability—especially on older spinners.

The correct posture in most production environments is: aggressive but bounded. You want the resilver to keep forward progress even under load, while preventing it from pushing IO wait and latency into a death spiral.

A useful mental model: a resilver is a background batch job with a safety deadline. Treat it like a batch job you’re allowed to prioritize temporarily—but not at the expense of making the system unusable.

Fast diagnosis playbook

If your resilver is slow or your apps are suffering, don’t start by flipping tunables. Start by finding the bottleneck. This is the shortest path to being right.

First: confirm what job you’re running and where the pool stands

  • Is it actually a resilver, or a scrub, or both?
  • Is the pool degraded because of a missing disk, a faulted disk, or an in-flight replace?
  • Is the new disk slower (SMR surprise, USB bridge, wrong firmware)?

Second: identify the limiting vdev and IO shape

  • Which vdev is doing the most work?
  • Are you bound on random reads, random writes, or sync writes?
  • Is metadata dominating (high IOPS, low throughput)?

Third: check queueing and latency, not just bandwidth

  • What’s the disk queue depth and await time?
  • Are you saturating a single HBA path?
  • Is ARC hit rate dropping, making everything hit disk?

Fourth: decide the trade you’re making

  • Do you need the fastest possible resilver (high business risk, poor redundancy), even if it hurts?
  • Or do you need to protect p99 latency (customer-facing systems), accepting a longer resilver?

Practical tasks: commands, outputs, decisions (12+)

These are the commands I actually run when a pool is degraded and everyone suddenly remembers storage exists. Each task includes: command, sample output, what it means, and the decision you make.

Task 1: See whether you’re resilvering, scrubbing, or both

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Dec 25 09:12:04 2025
        1.24T scanned at 1.10G/s, 412G issued at 365M/s, 3.80T total
        412G resilvered, 10.57% done, 0 days 02:31:18 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            ata-WDC_WD80...         ONLINE       0     0     0
            replacing-1             DEGRADED     0     0     0
              ata-WDC_WD80...       OFFLINE      0     0     0
              ata-ST8000...         ONLINE       0     0     0
errors: No known data errors

Meaning: It’s a resilver, not just a scrub. “Issued” is the work actually being done; “scanned” can be higher because ZFS may walk metadata faster than it can reconstruct blocks.

Decision: If ETA is reasonable and production is stable, don’t touch anything. If latency is spiking or “issued” is crawling, continue diagnosis.

Task 2: Check if a scrub is also running (and stop it if needed)

cr0x@server:~$ zpool status tank | sed -n '1,25p'
  pool: tank
 state: DEGRADED
  scan: resilver in progress since Wed Dec 25 09:12:04 2025
        1.24T scanned at 1.10G/s, 412G issued at 365M/s, 3.80T total

Meaning: Only resilver is active. If you see “scrub in progress” at the same time (possible in some operational sequences), you’re doing extra work.

Decision: If a scrub is competing during an incident, stop the scrub and let the resilver finish first.

cr0x@server:~$ sudo zpool scrub -s tank

Task 3: Show per-vdev IO to find the hot spot

cr0x@server:~$ zpool iostat -v tank 2 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        22.1T  13.5T  18.2K  6.4K   520M   140M
  mirror-0  7.40T  5.10T  12.1K  3.1K   340M    70M
    sda         -      -  6.1K   1.6K   170M    35M
    sdb         -      -  6.0K   1.5K   170M    35M
  mirror-1  7.40T  4.90T  3.9K   1.7K   110M    40M
    sdc         -      -  2.0K     850   55M    20M
    sdd         -      -  1.9K     840   55M    20M
  mirror-2  7.40T  3.50T  2.2K   1.6K    70M    30M
    sde         -      -  1.1K     820   35M    15M
    sdf         -      -  1.1K     810   35M    15M

Meaning: mirror-0 is doing most reads. That’s likely your resilver vdev or the vdev with the most referenced blocks.

Decision: Focus on the hot vdev. Tuning that improves the wrong vdev is just performance cosplay.

Task 4: Confirm the replacement disk isn’t the slow one (or SMR)

cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,ROTA,TRAN,SERIAL
NAME  MODEL            SIZE ROTA TRAN SERIAL
sda   WDC WD80...      7.3T    1 sata WD-AAA
sdb   WDC WD80...      7.3T    1 sata WD-BBB
sdc   WDC WD80...      7.3T    1 sata WD-CCC
sdd   WDC WD80...      7.3T    1 sata WD-DDD
sde   WDC WD80...      7.3T    1 sata WD-EEE
sdf   ST8000DM004      7.3T    1 sata ZDHFFF

Meaning: One disk is a different model. That can be fine—or it can be an SMR drive that turns rebuilds into molasses under sustained writes.

Decision: If the replacement is an SMR model in a workload that writes during resilver, stop and swap it for CMR. Yes, even if procurement complains.

Task 5: Check disk health quickly (SMART)

cr0x@server:~$ sudo smartctl -H /dev/sdf
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0] (local build)
SMART overall-health self-assessment test result: PASSED

Meaning: “PASSED” is not “healthy,” it’s “not obviously dead.” You still want to check error counters.

Decision: If SMART is failing or attributes look ugly (reallocated/pending sectors), do not trust the disk to finish a resilver. Replace again.

Task 6: Watch OS-level latency and queue depth during resilver

cr0x@server:~$ iostat -x 2 3
Linux 6.8.0 (server)  12/25/2025  _x86_64_  (32 CPU)

avg-cpu:  %user %nice %system %iowait  %steal   %idle
           6.02  0.00    3.11   18.40    0.00   72.47

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s %util aqu-sz  await
sda            320.1   110.4 165000   42000     0.0     2.1  99.0  41.3  110.2
sdf            310.6   115.2 160000   45000     0.0     2.0  98.7  39.8  105.9

Meaning: %util ~99, queue depth ~40, await ~100ms. On spinners, that’s “we are saturating them and everything else will wait.”

Decision: If production is hurting, you need to reduce concurrent IO or shape it (tune resilver, cap throughput, or move workload). If you’re on SSD/NVMe, 100ms is a red siren.

Task 7: Check ARC pressure (a hidden source of “everything got slow”)

cr0x@server:~$ grep -E 'c_max|c_min|size|arc_meta_used|hits|misses' /proc/spl/kstat/zfs/arcstats | head
c_max                           4    25769803776
c_min                           4    6442450944
size                            4    24191070208
arc_meta_used                   4    4246732800
hits                            4    392004112
misses                          4    48122103

Meaning: ARC is near c_max; metadata usage is high. During resilver, metadata walks can evict useful cache and increase disk reads for production.

Decision: If ARC is thrashing (misses climbing fast) and you have RAM headroom, consider raising ARC max (carefully) or reducing memory pressure elsewhere. If not, you must tune IO, not memory.

Task 8: Confirm your pool layout and ashift (resilver time is partly geometry)

cr0x@server:~$ zdb -C tank | grep -E 'vdev|ashift' -n | head -n 20
45:        vdev_tree:
46:            type: 'root'
47:            id: 0
58:                    type: 'mirror'
59:                    ashift: 12
78:                    type: 'mirror'
79:                    ashift: 12

Meaning: ashift=12 (4K sectors). If you have ashift=9 on 4K-native drives, you’ll pay a write amplification tax forever, and resilvers will be worse.

Decision: If ashift is wrong, fix it by rebuilding the pool (yes, painful). Don’t try to tune your way out of a geometry mismatch.

Task 9: See if you’re bottlenecked on a single HBA or link

cr0x@server:~$ lsscsi -t
[0:0:0:0]    disk    ata:WDC_WD80...   /dev/sda  sata:0:0:0
[0:0:1:0]    disk    ata:WDC_WD80...   /dev/sdb  sata:0:0:1
[0:0:2:0]    disk    ata:WDC_WD80...   /dev/sdc  sata:0:0:2
[0:0:3:0]    disk    ata:WDC_WD80...   /dev/sdd  sata:0:0:3
[1:0:0:0]    disk    ata:WDC_WD80...   /dev/sde  sata:1:0:0
[1:0:1:0]    disk    ata:ST8000DM004   /dev/sdf  sata:1:0:1

Meaning: Drives spread across two controllers. If everything is on one HBA (or a single expander link), you can saturate it during resilver.

Decision: If you find a single chokepoint, reduce concurrency or plan a hardware fix. Tuning can’t create bandwidth you don’t have.

Task 10: Observe ZFS internal IO patterns via zpool iostat with latency (OpenZFS feature dependent)

cr0x@server:~$ zpool iostat -l -v tank 2 2
                           capacity     operations     bandwidth    total_wait     disk_wait
pool                     alloc   free   read  write   read  write   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
tank                     22.1T  13.5T  18.0K  6.5K   510M   145M   12ms   18ms    8ms   15ms
  mirror-0               7.40T  5.10T  12.0K  3.2K   335M    72M   14ms   21ms   10ms   18ms
    sda                      -      -  6.0K   1.6K   168M    36M    0ms    0ms    0ms    0ms
    sdb                      -      -  6.0K   1.6K   167M    36M    0ms    0ms    0ms    0ms

Meaning: “total_wait” includes time waiting in ZFS; “disk_wait” is time at the device. High total_wait with low disk_wait points to ZFS-side contention; high disk_wait means the devices are the wall.

Decision: If disk_wait dominates, throttle. If total_wait dominates, look for CPU, lock contention, or pathological metadata workloads.

Task 11: Adjust resilver aggressiveness on Linux (module parameters)

On Linux OpenZFS, resilver behavior is influenced by module parameters like zfs_resilver_delay and resilver min/max time slice. Exact availability depends on version; check first, then change.

cr0x@server:~$ modinfo zfs | grep -E 'resilver|scan' | head -n 20
parm:           zfs_resilver_delay:How long to delay resilvering next extent (int)
parm:           zfs_scan_idle:Idle scan delay (int)
parm:           zfs_scan_min_time_ms:Minimum scan time per txg (ulong)
parm:           zfs_scan_max_time_ms:Maximum scan time per txg (ulong)

Meaning: The knobs exist on this system. Good. Now you can tune intentionally instead of cargo-culting sysctls from a blog post written during the Jurassic era of spinning rust.

Decision: If production latency is suffering, increase delays or reduce scan time. If resilver is too slow and you have IO headroom, do the opposite.

cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_resilver_delay
2

Meaning: Current delay is 2 (units are implementation-defined; think “back off between extents”). Lower values push harder.

Decision: For daytime production, keep a nonzero delay. For off-hours or emergency “finish now,” reduce carefully, watching latency.

Task 12: Temporarily throttle ZFS scan work (Linux)

cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_scan_max_time_ms
30000

Meaning: ZFS may spend up to 30 seconds scanning per transaction group cycle. Higher means more aggressive scanning/resilver; lower yields more often to other work.

Decision: If production is timing out, lower this value in small steps and validate with metrics. Don’t whiplash it from 30000 to 1000 and then wonder why the resilver ETA went to “sometime next quarter.”

cr0x@server:~$ echo 10000 | sudo tee /sys/module/zfs/parameters/zfs_scan_max_time_ms
10000

Task 13: Set up a stable, low-noise watch of progress

cr0x@server:~$ watch -n 10 'zpool status tank | sed -n "1,20p"'
Every 10.0s: zpool status tank

  pool: tank
 state: DEGRADED
  scan: resilver in progress since Wed Dec 25 09:12:04 2025
        1.46T scanned at 1.05G/s, 501G issued at 360M/s, 3.80T total
        501G resilvered, 13.01% done, 0 days 02:12:03 to go

Meaning: You’re tracking issued and ETA, not just scanned. That’s the number that tends to correlate with “how long until redundancy is back.”

Decision: If “scanned” moves but “issued” stalls, you’re likely blocked on actual reconstruction IO or contention, not scan traversal.

Task 14: Verify you’re not creating a self-inflicted sync storm

cr0x@server:~$ zfs get -o name,property,value -r sync tank | head
NAME           PROPERTY  VALUE
tank           sync      standard
tank/db        sync      standard
tank/vmstore   sync      always

Meaning: A dataset with sync=always will force more synchronous writes. During resilver, that can push spinners into misery.

Decision: If sync=always is set for good reasons (databases without proper barriers, compliance), don’t “fix” it by changing sync. Instead: ensure SLOG is healthy and fast, or shift workload during resilver.

Task 15: Check if a SLOG device is missing or slow (and making sync writes hurt)

cr0x@server:~$ zpool status tank | sed -n '1,120p'
  pool: tank
 state: DEGRADED
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
          logs
            nvme0n1p2               ONLINE       0     0     0

Meaning: SLOG is present and online. If it were missing/faulted, sync workloads would spill to the main pool, competing harder with resilver.

Decision: If SLOG is degraded, fix that first for sync-heavy production. A fast resilver is pointless if your transaction log is on life support.

Tuning strategy: rebuild fast without crushing latency

Let’s get real: the “best” resilver priority depends on what you’re protecting. For a backup target, you can go full send. For a primary database pool, you need guardrails.

Start with operational levers, not tunables

The cleanest tuning is the kind you don’t do in sysfs at all.

  • Move or reduce load: throttle batch jobs, pause reindexing, shift analytics off the box, delay snapshot send/receive.
  • Stop optional background work: scrubs, heavy snapshot deletes, replication, large zfs destroy operations.
  • Prefer resilver completion windows: if you can, schedule replacements so resilvers start at night. You’re not cheating; you’re being an adult.

Understand what ZFS is optimizing for during resilver

ZFS must maintain correctness. It also tries to keep the pool usable. Some platform defaults are conservative because the worst-case is ugly: resilver can generate enormous random IO, especially on fragmented pools with lots of snapshots.

Two practical outcomes:

  • Bandwidth isn’t the only metric. You can show 800MB/s “scanned” while production latency is dying because you’re really doing small random reads/writes and queueing.
  • Incremental isn’t always cheap. If a pool is highly fragmented or has heavy churn, the “only copy live blocks” strategy still touches a lot of scattered regions.

Pick a policy: business hours vs emergency mode

I recommend defining two modes and practicing them:

  • Business-hours mode: modest resilver aggressiveness, predictable latency, longer ETA.
  • Emergency mode: when redundancy is critically low (second disk showing errors, RAIDZ under stress, or known-bad batch of drives), you accept higher impact to finish faster.

Joke #2: Resilver tuning is like caffeine—there’s a productive dose, and then there’s “I can hear colors” and nothing gets done.

Linux OpenZFS knobs: what they tend to do in the real world

Exact parameter names and semantics vary by OpenZFS version, but these are common themes:

  • Delay knobs (like zfs_resilver_delay): introduce a small pause between resilver IO chunks. This often reduces tail latency a lot on HDD pools because it breaks up queue monopolization.
  • Time-slicing knobs (like zfs_scan_min_time_ms, zfs_scan_max_time_ms): how much time per cycle ZFS dedicates to scan/resilver. Lower values yield more frequently to normal IO.
  • Idle scan behavior (like zfs_scan_idle): tries to detect idleness and ramp up scan when the system is quiet.

What to avoid: pushing concurrency so high that disk queues are always full. This is how you turn “slower resilver” into “resilver that never completes because everything times out and resets.”

FreeBSD knobs and general approach

On FreeBSD, you’ll typically use sysctl-driven controls and rely on the kernel’s IO scheduler behavior. The philosophy is similar: shape scan/resilver work so it yields to foreground IO. The specifics differ; the workflow doesn’t: measure, change one thing, observe p99 latency and resilver issued rate.

When faster is actually slower: the seek-storm trap

On HDD pools, resilver can become a random IO generator. If you crank aggressiveness, you increase outstanding IO and the drive does more seeking, which reduces effective throughput. You’ll see “issued” rate flatten while await climbs. That’s your sign you’ve hit the seek wall.

On SSD/NVMe pools, the trap is different: you can saturate controller queues and steal IO budget from latency-sensitive reads. Here, modest throttling can preserve p99 latency with only a small hit to rebuild time.

Scrub vs resilver: arbitration and scheduling

A scrub reads data and verifies checksums; a resilver reconstructs redundancy. Both are “scan class” operations in ZFS. If you run them together, you’re effectively asking the same disks to do two big background jobs while also serving production. That’s not bravery. That’s unpaid overtime for your IO subsystem.

Rules I follow

  • Never run a scrub during an active resilver unless you have a very specific reason (like investigating silent corruption on a pool with known-bad hardware) and you’ve accepted the performance hit.
  • After a resilver, consider a scrub if your operational policy requires it, but schedule it when load is lower.
  • Don’t let automation stack jobs: if you have weekly scrubs, ensure they pause when the pool is degraded or resilvering.

Why ZFS can look “busy” even when throughput is low

Scrub/resilver can be metadata-bound: lots of small IO, checksum verification, and bookkeeping. On a fragmented pool, this can be an IOPS-heavy job with unimpressive MB/s. That’s normal. What’s not normal is letting it starve your foreground reads.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had a mixed fleet: some storage nodes were mirror vdevs on SSD, some were RAIDZ2 on HDD. Operations treated them as “the same ZFS.” The monitoring dashboard even had a single “resilver rate” panel for all nodes. It looked neat. It was also lying by omission.

One afternoon, a disk in an HDD RAIDZ2 vdev failed. The on-call engineer replaced it and decided to “speed things up” by applying the same resilver parameters that worked great on the SSD nodes. Queue depths rose, but the resilver “scanned” speed looked fantastic, so everyone relaxed.

Within an hour, the customer-facing API started timing out intermittently. The database wasn’t down; it was just slow enough to trip upstream timeouts. The root cause wasn’t lack of bandwidth. It was a seek storm. The HDDs were pinned at near-constant utilization, with high await times. Foreground reads were queued behind background rebuild IO and small sync writes.

The wrong assumption was subtle: “If SSD likes more concurrency, HDD will also like more concurrency.” HDD does not. HDD likes sequentiality, breathing room, and not being asked to do 40 things at once.

The fix was boring: revert to conservative scan limits, stop the scrub that had also kicked off by schedule, and temporarily shift a batch job to a different cluster. Resilver took longer than “fast mode,” but the incident ended because p99 latency came back under control.

Mini-story 2: The optimization that backfired

A different org had a habit: whenever a pool was degraded, they’d immediately “help” by pausing normal workload processing and letting the resilver run at maximum speed. The logic was: minimize time at risk. On paper, that’s sound.

Then they got clever. They built automation to detect a degraded pool, bump resilver aggressiveness, and also trigger parallel dataset cleanup (“free space to make it easier for ZFS”). That cleanup included heavy snapshot deletions—lots of metadata work and frees—while resilver was walking the same general metadata structures.

The result was a perfect storm: the pool spent significant CPU time in metadata management, while disks got hammered with random IO. The resilver issued rate actually dropped, even though the system “looked busy.” The automation kept the system in this state for hours because it only stopped when resilver completed.

They didn’t cause data loss, but they caused avoidable customer pain and extended the degraded window. The optimization backfired because it combined two IO-intensive, metadata-heavy operations into one period when the pool was least able to tolerate it.

The fix was policy: during resilver, do not run snapshot deletes, large zfs destroy, or heavy rebalancing-like operations. If you must free space, do it before you’re degraded, not during.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran ZFS for VM storage. Their performance SLO was simple: keep storage latency within a tight band during trading hours. They also had a reliability rule: never let a pool sit degraded longer than necessary. Those goals conflict. So they wrote it down as an explicit operational playbook.

When a disk failed, the on-call did three things by muscle memory. First: confirm whether it was a transient path issue or a real disk failure. Second: start the replacement and resilver immediately. Third: switch the pool to “business-hours mode” tuning and freeze nonessential batch work.

They also had a standing runbook entry: if p99 latency crosses a threshold for more than 10 minutes, reduce scan max time in steps and re-evaluate. That’s it. No heroics. No 2am sysctl archaeology.

The resilver took longer than it would have in full-throttle mode, but the trading systems stayed stable. After hours, they flipped to “emergency mode” and let the resilver finish more aggressively overnight.

The boring practice was documented, rehearsed toggles—plus the discipline to not change five things at once. It saved them from turning a hardware failure into an SLA failure.

Common mistakes: symptoms → root cause → fix

These are not theoretical. These are the ways people accidentally light their own storage on fire and then blame the fire for being hot.

1) Symptom: “Resilver is stuck at 0% issued but scanned keeps moving”

  • Root cause: metadata traversal continues, but actual reconstruction IO is blocked by device errors, slow replacement disk, or extreme contention.
  • Fix: check zpool status -v for errors; verify replacement device health and link; look at iostat -x await; reduce scan aggressiveness and stop competing jobs.

2) Symptom: “Apps are timing out; disks show 99% util; resilver ‘speed’ looks high”

  • Root cause: queue monopolization/seek storm; scanning metric is misleading; foreground IO is waiting behind rebuild IO.
  • Fix: increase resilver delay / reduce scan max time; stop scrub; reduce batch writes; watch p99 latency, not MB/s.

3) Symptom: “Replacement disk keeps faulting during resilver”

  • Root cause: bad disk, marginal cable/backplane, or power issues; resilver is the first sustained stress test it’s had.
  • Fix: replace the disk again, swap slot/cable, check SMART error logs, and check controller logs. Don’t keep retrying with the same flaky path.

4) Symptom: “Resilver is painfully slow on a nearly full pool”

  • Root cause: high fragmentation and allocator stress; ZFS has less freedom to place reconstructed blocks efficiently; metadata overhead grows.
  • Fix: keep pools below sane utilization (rule of thumb: avoid living above ~80% for many workloads); add vdevs before you’re desperate; don’t start massive frees during resilver.

5) Symptom: “Resilver speed varies wildly hour to hour”

  • Root cause: competing workload patterns (backup window, compaction, snapshot sends, database checkpoints) and ZFS idle detection ramping.
  • Fix: correlate with workload schedule; pin scan aggressiveness to predictable settings during business hours; optionally raise aggressiveness during known quiet windows.

6) Symptom: “Resilver got slower after we added a special device”

  • Root cause: special device concentrates metadata/small blocks; if it’s slower, oversubscribed, or also degraded, it becomes the chokepoint.
  • Fix: ensure special devices are redundant and fast; monitor their latency separately; avoid letting special fill up; treat them as first-class storage, not a “bonus SSD.”

7) Symptom: “We tuned for fast resilver, but the resilver took longer”

  • Root cause: over-aggressive concurrency causes thrash; HDD seek penalty; increased retries; IO scheduler contention.
  • Fix: back off aggressiveness until issued rate increases and await drops; measure, don’t guess.

Checklists / step-by-step plan

Step-by-step plan: when a disk fails in production

  1. Confirm the failure mode. Is it a real disk failure, or a path/controller hiccup? Check zpool status -v and OS logs.
  2. Replace the disk and start resilver. Don’t wait for a meeting invite to approve physics.
  3. Stop competing background tasks. Pause scrubs, heavy replication, snapshot deletes, and bulk maintenance.
  4. Measure production impact. Use app latency and disk await; don’t rely on resilver MB/s alone.
  5. Choose a mode: business-hours tuning vs emergency mode, based on redundancy risk and customer SLOs.
  6. Adjust one knob at a time. Change scan max time or resilver delay in small steps; observe 10–15 minutes.
  7. Verify forward progress. Issued and resilvered bytes should move; error counters should stay quiet.
  8. When complete, validate health. Pool should be ONLINE, no new errors; schedule scrub if policy demands it.

Checklist: business-hours mode (protect latency)

  • Scrub stopped or deferred
  • Batch jobs paused or rate-limited
  • Resilver delay nonzero
  • Scan max time reduced moderately
  • Watch p95/p99 latency, disk await, and issued rate

Checklist: emergency mode (finish fast)

  • Confirm redundancy risk justifies it (multiple warnings, RAIDZ under stress, failing batch)
  • Increase scan max time and reduce delays
  • Optionally pause non-critical services briefly to accelerate completion
  • Ensure cooling is adequate; watch disk temps and controller errors

FAQ

1) Is a ZFS resilver always incremental?

Often, yes: ZFS can rebuild only the blocks that matter based on metadata. But pool layout, fragmentation, and how the device was removed can make it behave closer to a full-device operation in practice.

2) Why does “scanned” move faster than “issued” during resilver?

Scanning is metadata traversal; issuing is actual reconstruction IO. If issuing lags, you’re bottlenecked on disk IO, contention, or device problems.

3) Should I pause production workload to speed up resilver?

If it’s a customer-facing system, prefer shaping the resilver first. Pausing workload can help, but it’s a blunt instrument. Use it when redundancy risk is high or when you can safely drain traffic.

4) Is it safe to tune zfs_scan_max_time_ms and zfs_resilver_delay live?

Typically yes on Linux via sysfs module parameters, but treat it as change management: small increments, measure impact, and record what you changed so you can revert.

5) Why did resilver get slower when the pool is >80% full?

Allocator flexibility drops and fragmentation tends to rise, which increases random IO and metadata overhead. It’s not a moral failing; it’s geometry and entropy.

6) Do mirrors resilver faster than RAIDZ?

Usually. Mirrors have simpler reconstruction and often better parallelism characteristics. RAIDZ resilver can involve more parity math and more IO amplification, depending on workload and fragmentation.

7) Should I run a scrub right after resilver completes?

If your operational policy requires periodic scrubs, schedule it—just not immediately if the system is still hot and production load is high. Give the pool time to breathe.

8) What’s the single best metric to watch?

For risk: time remaining until redundancy is restored (issued/resilvered progress). For customer impact: p99 latency and disk await/queue depth.

9) Can I make resilver fast by adding L2ARC?

Not reliably. L2ARC helps read caching but can add write traffic and metadata overhead. During resilver, your bottleneck is usually disk IO and reconstruction, not “more cache devices.”

10) Does a SLOG make resilver faster?

Indirectly. A good SLOG can protect sync-write latency so production suffers less, which lets you keep resilver more aggressive without breaking apps. It doesn’t magically accelerate the rebuild itself.

Conclusion: next steps you can do this week

Resilver priority isn’t a magic setting; it’s operations with a speed limit. The goal is simple: restore redundancy fast enough to reduce risk, while keeping production IO predictable enough that your incident doesn’t create its own sequel.

Next steps that pay off immediately:

  1. Write down two modes (business-hours vs emergency) with your chosen scan/resilver settings and when to use them.
  2. Add monitoring that graphs issued/resilvered progress alongside p99 latency and disk await.
  3. Audit your fleet for SMR surprises, near-full pools, and scrub schedules that ignore degraded states.
  4. Practice once on a non-critical pool: replace a disk, watch behavior, adjust one knob, and record the results.
← Previous
Docker “too many open files”: raising limits the right way (systemd + container)
Next →
Email Forwarding Breaks DMARC — Fix It with SRS (and Other Options)

Leave a comment