ZFS Scrub Scheduling: How to Avoid Peak-Hour Pain

Was this helpful?

You schedule a scrub because you’re responsible. ZFS does what you asked because it’s obedient.
Then Monday morning arrives and your nice, stable latency graph turns into a comb: VM storage stalls,
database queries drag, and someone says the words “network issue” with a straight face.

The scrub didn’t “break” anything. It just showed you—loudly—how thin your performance margins were,
and how little you controlled background I/O. This is the playbook for running scrubs in production
without turning your busiest hours into a live-fire exercise.

What a scrub really does (and why it hurts)

A ZFS scrub walks the pool and verifies data integrity by reading blocks and validating checksums.
If redundancy exists and ZFS detects a bad block, it repairs it by reading a good copy and rewriting
the bad one. Scrub is not a “filesystem scan” in the old sense; it’s a systematic audit of stored data.
That audit has a cost: sustained reads, some writes, and a lot of device queue pressure.

The pain comes from contention. Your production workload wants low-latency random I/O. Scrub wants
high-throughput sequential-ish reads (but not perfectly sequential—metaslabs, fragmentation, and
recordsize realities make it messy). Both hit the same vdevs, share the same queues, and fight over
headroom. If you have HDDs, expect scrub to drag the average seek time into the daylight. If you have
SSDs, scrub can still eat IOPS budget and controller bandwidth, and it can amplify garbage collection
awkwardness.

One more thing: scrub competes with resilvering, and resilvering is not optional. Scrub is elective
surgery; resilver is the ambulance ride. If you schedule scrubs so aggressively that resilvers are
constantly “slow but steady,” you are expanding your risk window. A slow resilver isn’t just annoying.
It’s time spent exposed to a second failure.

“But ZFS is copy-on-write, so it shouldn’t interfere much.” That’s a comforting sentence, not a plan.
CoW changes write behavior and consistency semantics; it doesn’t give you infinite I/O lanes.

Joke #1: A scrub during peak hours is like vacuuming during a Zoom call—technically productive, socially catastrophic.

Scrub vs. resilver vs. SMART long tests

These three are often lumped together under “maintenance” and then scheduled like a dentist appointment.
They are not interchangeable:

  • Scrub: Reads through the pool to verify checksums; repairs silent corruption using redundancy.
  • Resilver: Reconstructs data onto a replacement device; priority is data safety, not convenience.
  • SMART long test: Device self-test; can catch drive issues but doesn’t validate redundancy or ZFS checksums.

A sane production posture uses all three, but never pretends one replaces another. Scrub tells you whether
your stored data remains readable and correct. SMART tells you whether a device is feeling honest today.
Resilver tells you how fast you can stop sweating after a disk swap.

Facts and history worth knowing

These aren’t trivia-night facts. They change how you schedule and how you interpret outcomes.

  1. Scrubs exist because silent corruption is real. “Bit rot” isn’t a myth; ZFS’ end-to-end checksums were built to detect it.
  2. ZFS made checksumming mainstream in general-purpose storage. When ZFS arrived at Sun (mid-2000s), end-to-end integrity was not standard on commodity filesystems.
  3. Scrub is a pool operation, not a dataset operation. You can’t scrub only “the important dataset” and call it coverage.
  4. Scrub reads can still turn into writes. If ZFS finds bad data and can repair it, it will rewrite corrected blocks.
  5. RAIDZ rebuild characteristics differ from mirrors. Mirrors can often repair using a straightforward alternate copy; RAIDZ needs parity math and can behave differently under load.
  6. Big pools make “monthly scrub” a lie. If your scrub takes 10 days, “monthly” really means “always.” That’s a scheduling failure, not a calendar problem.
  7. Scrub competes with ARC and prefetch behavior. The cache can help or hurt depending on memory pressure and workload; scrubs can displace hot application data.
  8. vdev layout dominates scrub behavior. Adding vdevs adds parallelism; adding bigger disks adds duration. “Same size pool” does not imply “same scrub time.”
  9. Some scrub slowdowns are actually write amplification side-effects. Heavy writes during scrub can amplify fragmentation and make future scrubs slower.

Paraphrased idea from John Allspaw: “Reliability comes from designing for failure, not pretending it won’t happen.”
Scrubs are one of the tools that keep failure honest. Scheduling is how you keep the tool from hurting you.

Choose a scrub policy: frequency, windows, and expectations

Frequency: stop copying “monthly” from the internet

“Scrub monthly” is a decent default for many pools, and a terrible rule for others. Frequency should be set by:
(1) how fast you can scrub, (2) how quickly you want to detect latent corruption, and (3) how much risk you
incur by running scrubs under load.

Practical guidance that holds up in production:

  • HDD RAIDZ, large capacity: monthly can be okay if scrubs finish in a day or two. If not, consider every 6–8 weeks and invest in reducing scrub duration (more vdevs, better layout) instead of turning your pool into a perpetual background job.
  • HDD mirrors for latency-sensitive workloads: every 2–4 weeks is often realistic because mirrors scrub faster and the workload often cares about quick detection.
  • All-flash pools: frequency can be higher, but don’t do it “because SSDs are fast.” Controllers saturate, and your busiest hours still matter.
  • Archive pools with low churn: less frequent may be acceptable, but only if you can tolerate longer time-to-detect for silent corruption.

Windowing: pick time by observing, not by guessing

The best scrub window is the one your users don’t notice. That’s not always “2 AM Sunday.” In global companies,
Sunday 2 AM is Monday 10 AM somewhere. In batch-heavy companies, nights are busier than days. In backup-heavy
shops, weekends are their own form of violence.

You pick a window the way you pick a maintenance window for a database: by measuring read latency,
queue depth, and CPU steal, then choosing the least-bad period. If your telemetry is weak, start with:
(a) the hour with lowest 95th percentile disk latency, (b) the hour with lowest synchronous write load,
(c) the hour least likely to be consumed by backups.

Expectation setting: define “allowed pain”

If you don’t define the acceptable impact, the scrub will define it for you. Put numbers on it:

  • Max acceptable increase in read latency (e.g., +3 ms at p95 for HDD, +0.5 ms for SSD).
  • Max acceptable reduction in IOPS (e.g., not below 70% of baseline for critical pools).
  • Abort conditions (e.g., cancel scrub if pool latency exceeds threshold for 15 minutes).

This matters because ZFS will happily scrub through a fire unless you tell it otherwise.

Monitoring that actually predicts pain

“Scrub is slow” is not a metric. Scrub is a workload. You need the same signals you’d want for any workload:
throughput, latency, concurrency, and saturation. And you need ZFS-specific context: vdev health,
checksum errors, and whether the pool is repairing.

What to watch during scrub

  • Pool scan rate and ETA from zpool status.
  • vdev-level utilization (queue depth, await, svctm where applicable) via iostat or zpool iostat.
  • Application latency (database p95/p99, VM storage latency).
  • ARC behavior: cache hit ratio shifts and memory pressure can turn a scrub into a cache eviction party.
  • Error counters: read/write/checksum errors, plus SMART hints.

The trap: watching only throughput. High MB/s can still mean terrible latency for small synchronous I/O.
Your users don’t experience MB/s; they experience waiting.

Fast diagnosis playbook

When someone says “everything is slow” and you suspect scrub, you need a 3-minute path to a credible answer.
This is the order I use in production because it converges quickly.

First: confirm scrub/resilver and whether it’s repairing

  • Check if a scrub is running and its scan rate.
  • Check whether it’s finding errors (repair work increases write load).
  • Check whether a resilver is happening (that changes priorities).

Second: identify the saturated resource (disk, CPU, or something pretending to be disk)

  • If disks show high await/queue depth and low idle, you’re I/O bound.
  • If CPU is pinned in kernel threads (or iowait dominates), the system is struggling to feed I/O.
  • If network storage paths are involved (iSCSI/NFS), confirm you’re not debugging the wrong layer.

Third: find the single worst vdev

ZFS performance is shaped by the slowest vdev when you need uniform progress. One marginal disk can
drag scrub time and increase time-in-risk. Use per-vdev stats and error counts; don’t guess.

Fourth: decide “throttle, move, or abort”

  • Throttle if the pool is healthy and you just need lower impact.
  • Move the window if this is a predictable collision with batch jobs/backups.
  • Abort if latency is causing customer impact and you can safely resume later.
  • Do not abort a resilver unless you are absolutely sure of what you’re trading away.

Practical tasks: commands, outputs, and decisions

These are real commands you can run. Each task includes: command, what the output means, and the decision you make.
I’m assuming typical OpenZFS on Linux or FreeBSD. Some knobs differ by platform; the diagnostic logic doesn’t.

Task 1: Confirm whether a scrub is running and how far along it is

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:10:12 2025
        3.12T scanned at 612M/s, 1.74T issued at 341M/s, 21.4T total
        0B repaired, 8.12% done, 17:21:33 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

Meaning: Scrub is active, scan and issue rates differ (some reads queued but not issued yet),
ETA is long. No repairs. Errors are clean.
Decision: If this overlaps peak hours, throttle or reschedule rather than panic. If ETA is absurd,
suspect vdev bottleneck or competing workload.

Task 2: Get per-vdev scrub I/O rates to spot a laggard

cr0x@server:~$ zpool iostat -v tank 5 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        10.2T  11.2T  3.10K   210   320M  22.1M
  raidz2-0                  10.2T  11.2T  3.10K   210   320M  22.1M
    sda                         -      -    520    35  54.1M  3.7M
    sdb                         -      -    515    34  53.8M  3.6M
    sdc                         -      -    518    36  54.0M  3.8M
    sdd                         -      -    110    33  11.2M  3.6M
    sde                         -      -    521    35  54.3M  3.7M
    sdf                         -      -    516    37  54.0M  3.8M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: One disk (sdd) is reading far less than its peers. That’s your scrub time thief.
Decision: Pull SMART stats and error logs for sdd. If it’s slow but error-free, it may still be failing
(timeouts, reallocated sectors pending, or firmware issues). Consider replacing proactively.

Task 3: Check device error counters and whether ZFS is silently “giving up” and retrying

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:10:12 2025
        3.30T scanned at 585M/s, 1.92T issued at 340M/s, 21.4T total
        0B repaired, 8.95% done, 16:58:01 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       2     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0

errors: No known data errors

Meaning: sdd has read errors counted. Even if ZFS repaired transparently, this is a signal.
Decision: Treat it as a reliability event. Don’t wait for it to “get worse.” Plan replacement,
and watch for escalating counters during the scrub.

Task 4: On Linux, check udev names and map disks correctly before replacing anything

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'sdd$'
lrwxrwxrwx 1 root root  9 Dec 26 10:02 ata-WDC_WD120EDAZ-11F3RA0_9GJ3K2AA -> ../../sdd
lrwxrwxrwx 1 root root 10 Dec 26 10:02 wwn-0x50014ee2b6d12345 -> ../../sdd

Meaning: You have stable identifiers for sdd. Good. Stop using /dev/sdX in replacement procedures
if you like sleeping.
Decision: Use the by-id path in zpool replace and in documentation for on-call.

Task 5: Check whether application latency correlates with scrub

cr0x@server:~$ iostat -x 5 2
Linux 6.8.0 (server) 	12/26/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.30    0.00    3.10   18.40    0.00   72.20

Device            r/s     w/s   rMB/s   wMB/s  rrqm/s  wrqm/s  %util  await
sda             92.0    6.0    9.6     0.7      0.0     0.3   74.0   14.8
sdb             91.5    6.1    9.6     0.7      0.0     0.2   73.1   15.2
sdc             92.2    6.0    9.6     0.7      0.0     0.2   73.9   15.0
sdd             19.3    5.9    2.1     0.7      0.0     0.1   99.0   87.4
sde             91.8    6.0    9.6     0.7      0.0     0.2   74.2   14.9
sdf             91.6    6.2    9.6     0.7      0.0     0.2   73.5   15.1

Meaning: sdd is pegged at 99% util with huge await. Others are fine-ish. iowait is elevated.
This is classic “one disk ruins the party.”
Decision: Throttling scrub might reduce pain, but it won’t fix sdd. Prioritize device health triage.

Task 6: Validate whether the pool is constrained by one vdev or by overall bandwidth

cr0x@server:~$ zpool iostat tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        10.2T  11.2T  3.05K   230    318M  24.0M
tank        10.2T  11.2T  3.12K   240    326M  25.0M
tank        10.2T  11.2T  2.98K   225    310M  23.1M
tank        10.2T  11.2T  1.40K   980    145M  96.2M
tank        10.2T  11.2T  3.10K   235    320M  24.3M

Meaning: One interval shows write spikes; likely application writes colliding with scrub.
Decision: If write-heavy jobs overlap (backups, compaction, log rotation to disk), move the scrub window
or throttle scrub to protect latency.

Task 7: Check whether ARC pressure is causing collateral damage

cr0x@server:~$ arcstat 5 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c
10:14:05  812K  124K     15  108K  87%   10K   8%    6K   5%  96.0G  110G
10:14:10  790K  210K     27  198K  94%    8K   4%    4K   2%  96.0G  110G
10:14:15  820K  260K     32  250K  96%    7K   3%    3K   1%  96.0G  110G

Meaning: Miss percentage increases during scrub; demand misses dominate. Scrub is likely displacing hot data.
Decision: If your workload is cache-sensitive, throttle scrub, consider scheduling when cache churn is low,
and evaluate whether ARC sizing is appropriate for peak.

Task 8: Check ZFS event logs around the time pain started

cr0x@server:~$ zpool events -v | tail -n 12
TIME                           CLASS
Dec 26 2025 10:02:11.123456789 ereport.fs.zfs.io
    pool = tank
    vdev_path = /dev/disk/by-id/wwn-0x50014ee2b6d12345
    vdev_guid = 1234567890123456789
    errno = 5
    io_priority = scrub
Dec 26 2025 10:02:14.987654321 ereport.fs.zfs.io
    pool = tank
    vdev_path = /dev/disk/by-id/wwn-0x50014ee2b6d12345
    errno = 5
    io_priority = scrub

Meaning: I/O errors on a specific vdev during scrub. errno 5 is I/O error.
Decision: Stop debating scheduling and start planning replacement. A scrub is doing its job by finding weak links.

Task 9: Verify scrub history and whether it’s regularly completing

cr0x@server:~$ zpool status tank | sed -n '1,20p'
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Dec 23 01:10:12 2025
        3.45T scanned at 610M/s, 2.05T issued at 362M/s, 21.4T total
        0B repaired, 9.30% done, 16:21:02 to go

Meaning: You get current scrub progress, but for historical completion you should log it externally
or query platform tooling. Many teams only notice scrubs “take forever” after a year of silent failures.
Decision: Start recording scrub start/finish/repair bytes in monitoring. If scrubs don’t complete between windows,
change frequency or pool architecture.

Task 10: Start a scrub intentionally (and understand what “start” means)

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank | head -n 8
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 10:20:55 2025
        0B scanned at 0B/s, 0B issued at 0B/s, 21.4T total
        0B repaired, 0.00% done, no estimated completion time

Meaning: Scrub queued and started; initial rates may show 0 until it ramps and stats refresh.
Decision: If you’re testing scheduling, start scrubs manually in a controlled window, then measure impact.

Task 11: Pause/stop a scrub when you must (and accept the trade)

cr0x@server:~$ sudo zpool scrub -s tank
cr0x@server:~$ zpool status tank | head -n 8
  pool: tank
 state: ONLINE
  scan: scrub canceled on Thu Dec 26 10:27:12 2025
        3.62T scanned at 600M/s, 2.11T issued at 350M/s, 21.4T total
        0B repaired, 9.86% done

Meaning: Scrub is canceled; it will not “resume where it left off” as a continuous process. Next scrub re-scans.
Decision: Cancel only when customer impact demands it. Then reschedule a new scrub in a safer window soon.
If you canceled because of a failing disk, replacement comes first.

Task 12: On Linux, check and set scrub-related module parameters (example: delay)

cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_scan_idle
0
cr0x@server:~$ cat /sys/module/zfs/parameters/zfs_scan_min_time_ms
1000

Meaning: These parameters influence scan behavior. Values vary by distro/version, but you’re verifying
whether the system is configured to yield or run aggressively.
Decision: If scrubs crush latency, increase yielding/throttling (when supported) rather than simply moving the schedule.

Task 13: Confirm TRIM/autotrim settings so scrubs aren’t blamed for SSD housekeeping

cr0x@server:~$ zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  off       default

Meaning: Autotrim is off; SSD free space reclamation might occur in other ways and at inconvenient times.
Decision: If you run SSD pools, decide deliberately: enable autotrim if appropriate for your environment,
and avoid coinciding heavy trim with scrubs unless tested.

Task 14: Measure real application I/O patterns during scrub (not just ZFS stats)

cr0x@server:~$ pidstat -d 5 2
Linux 6.8.0 (server) 	12/26/2025 	_x86_64_	(32 CPU)

10:33:10      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:33:15      999     18321  20480.00   5120.00      0.00  postgres
10:33:15        0      1287      0.00  86016.00      0.00  z_wr_iss
10:33:15        0      1288  327680.00      0.00      0.00  z_rd_iss

10:33:15      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
10:33:20      999     18321  19456.00   6144.00      0.00  postgres
10:33:20        0      1287      0.00  90112.00      0.00  z_wr_iss
10:33:20        0      1288  335872.00      0.00      0.00  z_rd_iss

Meaning: You can see scrub kernel threads doing heavy reads, while the database is also active.
Decision: If the workload is latency-sensitive (databases, VM backing stores), schedule scrubs in lower-demand
windows or throttle. If you must run during business hours, you need guardrails.

Throttling and tuning: how to make scrubs behave

Scheduling is necessary, but not sufficient. In many environments you cannot find a truly idle window.
You still need scrubs to run. So you control the blast radius.

Principle: protect latency, accept longer duration

Your users don’t care that the scrub finished in 9 hours instead of 14. They care that the API
stopped timing out. A longer scrub is fine if it stays within your acceptable risk window and
doesn’t overlap too frequently. If your scrubs become “always on,” that’s not a throttling problem.
That’s a capacity and architecture problem.

OS-level I/O priority: coarse, but sometimes enough

On Linux, you can often improve fairness by running the scrub start under lower I/O priority for the
initiating process. This does not magically re-prioritize kernel I/O in all cases, but it can help
on some setups. Use it as a lever, not as religion.

cr0x@server:~$ sudo ionice -c3 zpool scrub tank
cr0x@server:~$ zpool status tank | head -n 6
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 11:02:01 2025
        158G scanned at 510M/s, 92.4G issued at 298M/s, 21.4T total

Meaning: Scrub is running; you attempted to schedule it as idle-class I/O.
Decision: If this reduces latency impact measurably, keep it. If it does nothing, don’t waste time pretending.

ZFS scan tunables: use sparingly, test aggressively

OpenZFS exposes scan behavior knobs (names and availability vary by platform/version). Some influence
how scan work yields to other I/O, how long scan threads run before sleeping, and how aggressively
the system tries to use available bandwidth.

You should treat these like database knobs: default is reasonable, changes have side effects, and you
only change them with measurement and a rollback plan.

What tends to work:

  • Increase yielding / idle behavior so scrub backs off under load.
  • Reduce scan intensity if latency is the primary SLO.
  • Keep resilver priority higher than scrub; don’t accidentally slow your recovery path.

What tends to backfire:

  • Cranking scan aggressiveness to “finish faster” and then discovering you can’t run scrubs at all during the work week.
  • Turning knobs without understanding the pool is actually blocked by one dying disk (no tune fixes physics).

Workload-aware scheduling beats clever tuning

If your workload has predictable spikes—ETL at 01:00, backups at 02:00, compaction at 03:00—don’t tune ZFS
to fight those spikes. Move the scrub away from them. Tuning is for smoothing edges, not ignoring traffic patterns.

Scheduling mechanics: cron, systemd timers, and guardrails

Scheduling is not “run on Sunday.” Scheduling is “run when safe, and stop when unsafe.” That means you need:
(1) an automated trigger, (2) a safety check, and (3) observability when it runs.

Cron: simple, reliable, and brutally honest

Cron is fine if you add a wrapper script that checks pool health, current load, and whether a scan is already running.
The wrapper is where professionalism lives.

cr0x@server:~$ cat /usr/local/sbin/zfs-scrub-guard
#!/usr/bin/env bash
set -euo pipefail

POOL="${1:-tank}"

# Refuse if a scrub/resilver is already running
if zpool status "$POOL" | grep -qE "scan: (scrub|resilver) in progress"; then
  echo "$(date -Is) $POOL: scan already in progress; exiting"
  exit 0
fi

# Refuse if pool is degraded
if ! zpool status "$POOL" | grep -q "state: ONLINE"; then
  echo "$(date -Is) $POOL: pool not ONLINE; exiting"
  exit 1
fi

# Refuse if 1-minute load is too high (example threshold)
LOAD1=$(cut -d' ' -f1 /proc/loadavg)
LOAD1_INT=${LOAD1%.*}
if [ "$LOAD1_INT" -ge 20 ]; then
  echo "$(date -Is) $POOL: load too high ($LOAD1); exiting"
  exit 0
fi

echo "$(date -Is) $POOL: starting scrub"
exec zpool scrub "$POOL"
cr0x@server:~$ sudo crontab -l
# Scrub on the first Sunday of the month at 01:30
30 1 1-7 * 0 /usr/local/sbin/zfs-scrub-guard tank >> /var/log/zfs-scrub.log 2>&1

Meaning: The script prevents overlapping scans, avoids scrubbing degraded pools, and skips during high load.
Decision: Adjust thresholds to your environment. If you don’t have an SLO-based threshold, you’re guessing—start measuring.

systemd timers: better state, better reporting

systemd timers shine when you want missed-run catch-up behavior, standardized logging, and easy disable/enable controls.
In production, this matters because you will eventually have a maintenance freeze or an incident where you want to pause scrubs.

cr0x@server:~$ cat /etc/systemd/system/zfs-scrub@.service
[Unit]
Description=Guarded ZFS scrub for pool %i

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/zfs-scrub-guard %i
cr0x@server:~$ cat /etc/systemd/system/zfs-scrub@.timer
[Unit]
Description=Monthly ZFS scrub timer for pool %i

[Timer]
OnCalendar=Sun *-*-01..07 01:30:00
Persistent=true

[Install]
WantedBy=timers.target
cr0x@server:~$ sudo systemctl enable --now zfs-scrub@tank.timer
Created symlink /etc/systemd/system/timers.target.wants/zfs-scrub@tank.timer → /etc/systemd/system/zfs-scrub@.timer.
cr0x@server:~$ systemctl list-timers | grep zfs-scrub
Sun 2026-01-04 01:30:00 UTC  1 week 1 day left  Sun 2025-12-01 01:30:00 UTC  zfs-scrub@tank.timer  zfs-scrub@tank.service

Meaning: You have a predictable schedule with persistence (missed runs execute after downtime).
Decision: If “Persistent=true” would cause a scrub to start immediately after a reboot into peak hours,
disable persistence or add a “business hours” guard.

Guardrails that prevent self-inflicted wounds

  • Don’t start a scrub if a resilver is active. Let recovery finish.
  • Don’t start a scrub on a degraded pool unless you have a reason. Scrub will add load during a vulnerable state.
  • Don’t run scrubs concurrently across all pools on the same host. Stagger them; your HBA and backplane have feelings too.
  • Make scrubs observable. Write start/stop events to a log you actually read, and alert on “scrub didn’t complete in N days.”

Joke #2: If you can’t tell when the scrub ran, it’s basically Schrödinger’s maintenance—both done and not done until the outage.

Three corporate mini-stories from the scrub trenches

1) Incident caused by a wrong assumption: “scrub is just reads”

A mid-sized SaaS company ran ZFS-backed VM storage on a set of HDD RAIDZ2 pools. They had a monthly scrub
scheduled at 02:00 local time, because that’s what the previous admin did. It worked—until the company
moved a chunk of customers to a different region and “quiet hours” stopped being quiet.

The on-call saw latency spikes and assumed scrub couldn’t be the culprit because “scrub is read-only.”
They chased network graphs, tuned database connection pools, and even rolled back an application deploy.
Meanwhile, ZFS was repairing a small number of blocks found during scrub. That repair created writes,
which collided with synchronous VM guest writes. The result was a perfect storm of queue depth and tail latency.

The clue was in plain sight: zpool status showed non-zero repaired bytes and a slow device.
But nobody had trained the team to interpret “issued vs scanned” rates or to treat repair activity as write pressure.
So the scrub kept running, customers kept timing out, and the incident lasted longer than it needed to.

The fix was unglamorous: move the scrub window, add a load-based guard, and set an abort threshold tied to storage latency.
They also added alerts when repaired bytes are non-zero during scrub, because “scrub is reads” stopped being a comforting story.

2) Optimization that backfired: “finish faster by making it aggressive”

An enterprise analytics team had an all-flash ZFS pool feeding a cluster of query engines. Scrubs were taking longer
as the dataset grew, and someone decided to “speed them up” with more aggressive scan behavior. The plan: use as much
bandwidth as possible so scrubs finish before the Monday ETL.

It did finish faster. It also hammered the SSD controllers hard enough that background garbage collection became a
visible part of the performance profile. Latency spikes appeared not during scrub itself, but immediately after,
when the controllers tried to clean up. Worse, the ARC got polluted with scrub reads, and the query engines lost cache locality.

The team’s first conclusion was that the query engine was “unstable.” They tried to tune the application, then the OS,
then the network. The real story was simpler: they optimized for scrub completion time, not for user experience.
They won a race nobody asked them to run.

The rollback was instructive: return scan knobs to defaults, reduce scrub intensity, and schedule scrubs in shorter
daily slices rather than one aggressive session. Completion time increased, but tail latency flattened.
Production regained its manners.

3) Boring but correct practice that saved the day: staggered scrubs plus “one bad disk” alerts

A financial services infrastructure team ran multiple ZFS pools per host, mixing mirrors for databases and RAIDZ for archives.
Their scrub policy looked dull: weekly mirrors, monthly RAIDZ, always staggered by pool, and never overlapping with backups.
They also had a standard alert: “any vdev with 3x higher await than peers during scrub” triggers investigation.

One quarter, during a routine mirror scrub, the alert fired for a single SSD that was still “ONLINE” with zero ZFS errors.
The device wasn’t failing loudly. It was failing quietly, by occasionally stalling long enough to create a tail-latency spike.
Because the team watched per-vdev behavior during scrub, they saw it early—before the database workload became collateral damage.

They swapped the device in a planned maintenance window. A week later, SMART reports started showing worsening indicators.
In other words: the boring practice caught the problem before it graduated into incident territory.

Nobody got a trophy. Nobody wrote a postmortem. That’s the point.

Common mistakes: symptoms → root cause → fix

1) Symptom: “Every scrub takes longer than the last one”

Root cause: Pool fragmentation and growth; added data increases scan set; one disk is aging; or scrubs overlap with heavier workloads as usage changes.

Fix: Identify per-vdev laggards with zpool iostat -v. Replace slow devices proactively. Re-evaluate window. If scrubs don’t complete between windows, change frequency or expand vdev parallelism.

2) Symptom: “Latency spikes only during scrub; throughput looks fine”

Root cause: Queue depth saturation. Scrub is consuming disk service time, harming small sync I/O.

Fix: Throttle scrub (platform tunables, lower I/O priority), move the window, and set abort thresholds based on p95/p99 latency rather than MB/s.

3) Symptom: “Scrub ETA jumps around wildly”

Root cause: Competing workloads or a vdev intermittently stalling; sometimes controller/HBA issues.

Fix: Correlate with iostat -x and per-vdev zpool iostat -v. Check zpool events -v for I/O errors/timeouts. Investigate cabling/backplane if stalls are bursty across multiple disks.

4) Symptom: “Scrub finds checksum errors, but pool stays ONLINE”

Root cause: ZFS repaired from redundancy; underlying media or path errors exist.

Fix: Treat as a hardware/path incident anyway. Pull SMART, check controller logs, and plan device replacement if errors recur. Scrub “fixed it” is not the same as “problem resolved.”

5) Symptom: “Scrubs are always running; there’s never a quiet period”

Root cause: Scrub duration exceeds frequency, often due to huge capacity with too little vdev parallelism or constant workload.

Fix: Increase parallelism (more vdevs), reduce dataset churn where possible, accept less frequent scrubs with targeted monitoring, and ensure the system has enough performance headroom for maintenance.

6) Symptom: “After enabling aggressive settings, scrubs are faster but the next day is slow”

Root cause: Cache pollution, SSD GC side effects, or disrupted I/O patterns. You moved the pain, not removed it.

Fix: Revert to defaults, run scrubs in smaller slices, and measure latency over 24 hours—not just during the scrub window.

7) Symptom: “Scrub starts right after reboot and hurts peak traffic”

Root cause: systemd timer persistence or catch-up behavior triggered immediately post-boot.

Fix: Disable persistence or add time-of-day/business-hours guard checks. Make “post-reboot stabilization time” a rule.

8) Symptom: “Scrub is slow only on one pool, on one host”

Root cause: Specific HBA lane/backplane issues, a single marginal disk, or different vdev geometry than assumed.

Fix: Compare vdev layouts, confirm link speeds, and use per-vdev iostat to spot the outlier. Replace hardware, not hope.

Checklists / step-by-step plan

Step-by-step: establish a scrub schedule that won’t hurt peak hours

  1. Measure baseline latency and utilization for a week. You need p95/p99 disk latency and per-vdev utilization, not just pool throughput.
  2. Run a controlled scrub test in a low-risk window. Record latency deltas and scan rate.
  3. Pick a window based on data, not tradition. If your “off-peak” is actually backup time, don’t fight it.
  4. Define abort thresholds (latency, error spikes, customer-impact SLO breaches). Decide who can cancel a scrub and when.
  5. Add guardrails: skip if pool not ONLINE, skip if resilvering, skip if load too high, skip if backups active.
  6. Stagger pools on the same host and across clusters. If everything scrubs at once, you invented a new peak hour.
  7. Log and alert on completion. “Scrub didn’t finish within N days” is an operational signal, not a curiosity.
  8. Review outcomes monthly: duration trend, errors found, repaired bytes, slowest vdev behavior.

Checklist: before you start a scrub in production

  • Pool state is ONLINE; no active resilver.
  • No known failing disks; SMART and ZFS counters are stable.
  • Backups/ETL/batch jobs are not scheduled to collide.
  • You can see per-vdev I/O and latency in monitoring.
  • You have a clear “cancel criteria” and a plan to reschedule.

Checklist: after the scrub completes

  • Confirm “errors: No known data errors” and review repaired bytes.
  • Compare duration and scan rate to last run; investigate regressions.
  • Identify any vdev outliers (slow or error-prone) and open a ticket for follow-up.
  • Write down whether the window caused user-visible impact. If yes, adjust.

FAQ

1) Should I run ZFS scrubs weekly or monthly?

Monthly is a common starting point, but set frequency by scrub duration and risk tolerance. If scrubs take many days,
monthly becomes continuous load—either scrub less often or redesign for more parallelism.

2) Is it safe to scrub during business hours?

It can be, if you throttle and you have headroom. If you don’t know your headroom, assume you don’t have it.
For latency-sensitive workloads, schedule off-peak or implement load/latency guardrails.

3) Why does scrub affect writes if it’s a read operation?

Because when ZFS finds bad data and redundancy allows repair, it rewrites corrected data. Also, scrub competes for
disk service time, which indirectly slows write I/O.

4) What’s the difference between “scanned” and “issued” in zpool status?

“Scanned” reflects work traversed; “issued” reflects I/O actually submitted. A big gap can indicate throttling,
contention, or stalls. Use per-vdev stats to find where the pipeline is clogged.

5) Should I cancel a scrub when performance tanks?

If customers are impacted and the pool is otherwise healthy, canceling and rescheduling is reasonable.
If a resilver is active, be far more cautious—recovery speed matters more than convenience.

6) Do scrubs wear out SSDs?

Scrubs are mostly reads, but repairs cause writes, and sustained reads still consume controller bandwidth and can
trigger background maintenance. The bigger risk is not “wear,” it’s performance impact and masking a failing device.

7) Can I scrub only part of a pool?

Scrub operates at pool level. You can’t reliably scrub “just this dataset” for integrity coverage. Plan windows
and intensity assuming pool-wide work.

8) My pool is huge. How do I avoid a scrub that never finishes?

First find out why it’s slow: per-vdev outliers, workload collisions, or insufficient parallelism. If the pool simply
can’t be scrubbed within a reasonable window, you need architectural change (more vdevs, different layout) or a revised
risk posture with strong monitoring.

9) Is a scrub required if I have backups?

Backups are not integrity verification for primary storage. Scrub detects and repairs corruption before you discover it
during restore—when time is already against you.

10) How do I know if a disk is “slow but not failing”?

Compare per-vdev utilization and await during scrub. If one disk consistently shows much higher await/%util than peers,
treat it as suspect even with zero ZFS errors. Quiet failures love being ignored.

Next steps that won’t ruin your week

ZFS scrubs are non-negotiable if you care about data integrity. But you don’t have to let them bully your peak hours.
Make scrubs boring: predictable windows, controlled intensity, and loud signals when something is off.

  1. Pick a scrub window using latency data, not habit.
  2. Add a guard script that refuses to start scrubs under load, during resilvers, or on degraded pools.
  3. Instrument per-vdev behavior so “one bad disk” is detected early.
  4. Define cancel criteria tied to customer impact, and rehearse the decision.
  5. Review scrub trends monthly; if duration grows, fix architecture or hardware before the pool makes the decision for you.

Do those things and scrubs become what they should be: routine integrity checks, not surprise performance experiments on your users.

← Previous
Docker “No Space Left on Device”: The Hidden Places Docker Eats Your Disk
Next →
Proxmox Web UI Won’t Open on Port 8006: Restart pveproxy and Get Access Back

Leave a comment