ZFS: Detecting Silent Data Corruption — What to Scrub, When, and Why

Was this helpful?

Most outages are loud: a controller dies, a link drops, a disk falls out of the pool like a bad actor leaving a meeting early. Silent corruption is the opposite. Your apps keep returning HTTP 200 while your data quietly rots. Then one day you restore a backup, run a report, open a photo, replay a video, or start a database, and you learn what “checksum mismatch” feels like in your stomach.

ZFS is one of the few mainstream filesystems that treats “my storage said it wrote the data” as a hypothesis to be tested, not a promise. Scrubs are how you test it at scale. Done right, they are boring, regular, and lifesaving. Done wrong, they’re either theater—or they become the reason your production latency graph looks like a seismograph.

Silent corruption: what it is (and why you won’t notice)

Silent data corruption is when bits change without your software being told. No I/O error. No kernel panic. No RAID controller screaming. The read succeeds and returns bytes—just not the bytes you wrote. That’s what makes it a reliability problem, not just a hardware problem.

There are a few common ways it happens:

  • Media errors that don’t surface as errors. Drives can return “successful” reads with wrong data due to internal faults, firmware bugs, marginal sectors, or recovery algorithms that guess incorrectly.
  • Bad writes. Power loss, write caching mistakes, or flaky cables/controllers can cause a write to land incorrectly or incompletely.
  • Memory corruption. The data was fine on disk, but RAM flipped a bit before it got checksummed or before it got written out.
  • DMA/cabling issues. A slightly-bad HBA, backplane, or cable can corrupt data in transit. The disk did exactly what it was told. The problem is you told it the wrong thing.

Classic filesystems generally have two choices: trust the stack below them, or bolt on partial checksumming for metadata only. ZFS doesn’t trust. It verifies.

One dry comfort: silent corruption is rare on well-built systems. One dry warning: “rare” is not a strategy. Neither is “we have RAID.”

Joke #1: RAID stands for “Redundant Array of Inexplicable Decisions” when you assume it verifies your data. It usually doesn’t.

How ZFS catches corruption: checksums, self-healing, and the limits

ZFS verifies data end-to-end by storing a checksum for every block (data and metadata) and validating it when the block is read. This matters because the checksum is stored separately from the data it protects—typically in the parent block pointer—so a single bad sector can’t conveniently damage both data and its checksum.

What “end-to-end” means in practice

When an application writes a block:

  1. ZFS computes a checksum of the block content.
  2. ZFS writes the new block to disk (copy-on-write), then updates metadata to point to it, including the checksum.
  3. Later, when the block is read, ZFS recomputes the checksum and compares it to the stored one.

If the checksum doesn’t match, ZFS knows the data is wrong. That is already a big deal: detection beats silent failure. But ZFS can also repair it if you gave it redundancy.

Self-healing: only if you have a good copy

With mirrors or RAIDZ, ZFS can read from another replica/parity set, find a correct version, and rewrite the bad copy. That’s the “self-healing” you hear about. Without redundancy (a single disk pool), ZFS still detects corruption, but it can’t conjure correct data from the void.

This is the non-negotiable mental model:

  • Checksums detect corruption.
  • Redundancy repairs corruption.
  • Scrubs force verification at scale.

What scrubs are not

A scrub is not a magical “fix my pool” button. It is a systematic read-and-verify across the pool’s data and metadata, repairing what it can using redundancy. It won’t:

  • Fix corruption that exists identically in all replicas (bad data written consistently).
  • Fix application-level corruption (the app wrote garbage; ZFS faithfully preserved it).
  • Replace backups or versioning. If corruption is old and you only discover it later, you may need historical copies.

And a scrub isn’t the same as a resilver. Resilver is targeted reconstruction after a device is replaced or reattached. Scrub is periodic verification of everything.

One paraphrased idea that has aged well: paraphrased idea from John Allspaw (operations/reliability): “Systems fail in messy, surprising ways; you need feedback loops that tell you when reality diverges from assumptions.” Scrubs are one of those feedback loops.

What a scrub actually does

A ZFS scrub walks the pool’s block tree and issues reads for every allocated block. For each block, it verifies the checksum. If the checksum fails and redundancy exists, ZFS reads alternate copies/parity, repairs the bad one, and records the event. If there’s no redundancy or the damage is unrecoverable, you’ll see permanent errors.

The scrub workload profile

Scrubs are mostly sequential-ish reads at the vdev level, but the access pattern depends on fragmentation and recordsize. On a quiet pool with big records, scrubs can be polite. On a heavily fragmented pool with lots of small random blocks, scrubs behave more like a slow-motion random-read storm.

Scrubs compete with your real workload for:

  • Disk bandwidth and IOPS
  • HBA/controller queue depth
  • ARC (cache) and memory bandwidth
  • CPU (checksumming, decompression, parity)

If your pool is RAIDZ and CPU is weak, checksumming plus parity reconstruction can be more expensive than you’d expect. On modern CPUs it’s usually fine. On “we bought the cheapest thing that boots” boxes, it gets spicy.

What gets checked

Scrub checks allocated blocks, not free space. It validates both metadata and data blocks. If you have snapshots, the scrub covers the blocks referenced by snapshots too, because those blocks are still allocated. This is why scrubs on snapshot-heavy pools can take forever: you asked ZFS to keep old blocks, and it’s going to verify them.

Why scrubs matter even if you “never read old data”

You do read it. Maybe not today, but the first time you need that old data is usually the worst time to discover it’s broken. Scrubs turn “we learned about corruption during an incident” into “we learned about corruption on a Tuesday morning and fixed it before anyone noticed.” That difference is an entire career arc.

What to scrub (and what not to confuse with scrubbing)

Scrub the pool. Not a dataset. Not a directory. ZFS integrity is a pool-level property. Your unit of truth is zpool.

Scrub targets: the practical list

  • Production pools with redundancy (mirrors/RAIDZ): scrub regularly. You want automatic repair when it’s still repairable.
  • Backup pools: scrub even more religiously. Backups that aren’t verified are just optimistic archives.
  • Single-disk pools: scrubs still detect corruption, but don’t pretend they “fix” it. Pair with good backups and preferably add redundancy.
  • Cold storage with infrequent reads: scrubs are your only routine read path. Without scrubs, you’re one “restore day” away from regret.

Don’t confuse these with scrubbing

  • SMART tests validate drive health signals, not end-to-end correctness of your data.
  • RAID patrol read (hardware RAID) reads stripes but usually can’t validate application-visible correctness.
  • Filesystem checks on non-ZFS filesystems mostly validate metadata consistency, not data correctness.
  • Backups are not verification unless you actually test restore or at least verify checksums on the stored objects.

When to scrub: schedules that survive contact with reality

People ask for the “right” scrub cadence. There isn’t one. There is only a cadence that matches your risk tolerance, data temperature, device size, and operational budget.

Here’s the opinionated baseline that works in most shops:

  • General purpose production pools: scrub monthly.
  • Backup/archival pools: scrub monthly, sometimes biweekly if the pool is large and restore RTO matters.
  • High-change databases and VM clusters: scrub monthly, but control impact (off-peak windows, tuning, monitoring). If monthly is too disruptive, fix the underlying I/O headroom problem—don’t stop verifying.
  • Very large pools (hundreds of TB+): scrub frequency must consider scrub duration. If scrubs take 20 days, “monthly” becomes a comedy. You may need to increase throughput headroom, reduce snapshot churn, or split workloads.

Two constraints matter more than calendars:

  1. Time-to-detect vs. time-to-fail. If you have latent sector errors accumulating, you want to discover them before another disk fails and redundancy is compromised.
  2. Scrub completion time. A scrub that never finishes is a ritual, not a control.

Scrubs also interact with resilver risk. If you defer scrubs for a year and then lose a disk, the resilver will read huge amounts of data, and that’s when latent errors pop. Better to find and repair latent errors during a scrub while the pool is healthy.

Joke #2: Skipping scrubs because “performance” is like skipping dental cleanings because “time.” You’ll still pay, just with more screaming.

When to avoid starting a scrub

  • Right before a big migration, restore drill, or data rebalancing event.
  • During known degraded performance periods (rebuilds, resilvers, heavy replication).
  • When you’re already missing redundancy (a degraded pool). In that case, decide deliberately: scrubbing a degraded pool can stress remaining disks, but also can reveal unreadable blocks early. Make it a conscious trade.

Practical tasks: commands, outputs, and decisions (12+)

Everything below is meant to be copy-paste runnable. The key is not the command. The key is what you decide after you run it.

Task 1: Check overall pool health and the last scrub result

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
  scan: scrub repaired 0B in 03:12:41 with 2 errors on Sun Feb  4 02:11:03 2026
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/data/app.db

What it means: The pool is ONLINE, but ZFS found permanent errors in a file. “Repaired 0B” indicates it could not heal those blocks from redundancy, or the damaged blocks were in metadata in a way that prevented repair.

Decision: Treat as an incident for that dataset. Restore the affected object from a known-good source (backup, snapshot copy elsewhere). If this is a mirror and still unrepairable, suspect corruption was written consistently, or both sides have the same bad block.

Task 2: Start a scrub (and know what you just did)

cr0x@server:~$ sudo zpool scrub tank

What it means: You told ZFS to begin scanning allocated blocks and verifying checksums.

Decision: Start scrubs in a controlled window unless you’re responding to suspected corruption. If you start it during peak load, don’t act surprised by the graphs.

Task 3: Monitor scrub progress

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Feb  4 01:13:22 2026
        1.20T scanned at 824M/s, 620G issued at 415M/s, 8.40T total
        0B repaired, 7.38% done, 05:20:11 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0

errors: No known data errors

What it means: “Scanned” vs “issued” matters. Scanned is logical progress; issued is actual I/O dispatched. If issued is far behind, something is throttling I/O or the scrub is contending with workload.

Decision: If ETA is exploding or issued rate collapses during business hours, consider pausing and rescheduling, or address the contention (see diagnosis playbook).

Task 4: Pause a scrub (because production exists)

cr0x@server:~$ sudo zpool scrub -p tank

What it means: The scrub is paused and can be resumed later. (Support depends on ZFS implementation; on OpenZFS it’s common.)

Decision: If you pause scrubs routinely, you probably don’t have enough I/O headroom. Fix that, or you’ll eventually stop scrubbing at all.

Task 5: Stop a scrub (the blunt instrument)

cr0x@server:~$ sudo zpool scrub -s tank

What it means: Stops the scrub. Progress is lost; future scrub starts over.

Decision: Stop only if you must. Prefer pause/resume where supported.

Task 6: Identify which vdev/device is producing checksum errors

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or invalid.
  scan: scrub repaired 32M in 02:40:11 with 0 errors on Sun Feb  2 03:02:11 2026
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     ONLINE       0     0    18
            sdb     UNAVAIL      0     0     0  cannot open

errors: No known data errors

What it means: sda is returning bad data (CKSUM errors). sdb is unavailable. You are running without redundancy and your remaining disk is already suspicious.

Decision: This is a “stop doing nonessential I/O” situation. Restore redundancy immediately. Also pull SMART data for sda and check cabling/HBA. A checksum error is often a path issue, not only a disk issue.

Task 7: Map device names to physical disks (avoid replacing the wrong one)

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep -E 'sda|sdb'
lrwxrwxrwx 1 root root  9 Feb  4 01:22 ata-SAMSUNG_MZ7LM960HAJM_S3Z9NX0K123456 -> ../../sda
lrwxrwxrwx 1 root root  9 Feb  4 01:22 ata-SAMSUNG_MZ7LM960HAJM_S3Z9NX0K654321 -> ../../sdb

What it means: You can identify the model/serial tied to each device node. ZFS device naming should use stable IDs where possible.

Decision: Use /dev/disk/by-id paths in your pool config going forward. When replacing, match the serial number, not your hopes.

Task 8: Pull SMART health and error counters

cr0x@server:~$ sudo smartctl -a /dev/sda
=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LM960HAJM-00005
Serial Number:    S3Z9NX0K123456
Firmware Version: GXT5204Q
...
SMART overall-health self-assessment test result: PASSED
...
ID# ATTRIBUTE_NAME          VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      100   100   000    Old_age   Always       -       2
199 UDMA_CRC_Error_Count    200   200   000    Old_age   Always       -       38

What it means: “PASSED” is not a clean bill of health. UDMA_CRC_Error_Count often points to cabling/backplane/HBA issues. Reported_Uncorrect means the device has seen uncorrectable errors.

Decision: If CRC errors are rising, replace cable/backplane slot, reseat, or move the drive to another port before you replace the drive. If uncorrectables climb, plan a drive replacement.

Task 9: Check ZFS error counters over time (baseline matters)

cr0x@server:~$ zpool status -P tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 03:12:41 with 0 errors on Sun Feb  2 03:02:11 2026
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            /dev/disk/by-id/ata-SAMSUNG...123456        ONLINE       0     0     1
            /dev/disk/by-id/ata-SAMSUNG...654321        ONLINE       0     0     0

errors: No known data errors

What it means: There’s 1 checksum error on one side of the mirror. A single error could be ancient history or a current trend.

Decision: Record counters now and re-check after the next scrub and after heavy I/O. If the counter moves, investigate path and device. If it stays flat across weeks, it might have been transient—but keep watching.

Task 10: See pool and vdev properties that impact scrub behavior

cr0x@server:~$ zpool get all tank | egrep 'ashift|autoreplace|autotrim|failmode|autoexpand'
tank  ashift       12                    local
tank  autoexpand   off                   default
tank  autoreplace  off                   default
tank  autotrim     on                    local
tank  failmode     wait                  default

What it means: ashift=12 (4K sectors) is typical. autotrim affects SSD behavior. These aren’t scrub knobs directly, but they influence write amplification, replacement workflows, and device performance stability.

Decision: If you use SSDs, keep autotrim=on unless you have a compelling reason not to. For replacement automation, consider autoreplace carefully—automation is great until it confidently replaces the wrong thing.

Task 11: Check dataset settings that amplify scrub pain (compression, recordsize, snapshots)

cr0x@server:~$ zfs get -o name,property,value -s local,received recordsize,compression,atime tank/data
NAME       PROPERTY     VALUE
tank/data  compression  lz4
tank/data  recordsize   128K
tank/data  atime        off

What it means: Reasonable defaults: LZ4 compression, 128K records, atime off. Smaller recordsize on VM images can increase metadata and fragmentation, affecting scrub behavior.

Decision: For VM zvols/datasets, tune recordsize/volblocksize deliberately. Don’t blame scrubs for a layout you created.

Task 12: Find out whether errors are tied to a specific file

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 02:58:03 with 1 errors on Sun Feb  2 03:02:11 2026
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/data/logs/2026-01-17.gz

What it means: ZFS can sometimes name the affected file. Not always, but when it can, take the gift.

Decision: Restore or regenerate that file. For logs, you may accept loss; for databases, you don’t. Either way, treat it as a signal to investigate the underlying device/path.

Task 13: Verify whether redundancy can heal by forcing reads

cr0x@server:~$ sudo dd if=/tank/data/logs/2026-01-17.gz of=/dev/null bs=1M status=progress
104857600 bytes (105 MB, 100 MiB) copied, 0.42 s, 249 MB/s

What it means: A straight read may trigger checksum verification. If the block is corrupted and redundancy exists, ZFS may repair on read (depending on settings and access path).

Decision: If the read throws I/O errors or the file can’t be read, you need restore-from-backup or restore-from-snapshot. Don’t “cat” your way out of real corruption.

Task 14: Inspect I/O pressure during a scrub (Linux example)

cr0x@server:~$ iostat -x 1 3
Linux 6.6.0 (server)  02/04/2026  _x86_64_ (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    5.55   18.40    0.00   63.95

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s w_await aqu-sz  %util
sda             820.0  345000.0     0.0   0.00   12.40   420.7      5.0     120.0  5.20   10.2   96.0
sdb             790.0  339000.0     0.0   0.00   11.90   429.1      4.0     100.0  4.80   10.0   94.5

What it means: Disks are pegged (~95% util). Read latency (~12 ms) may be acceptable or disastrous depending on your workload.

Decision: If production latency suffers, schedule scrubs off-peak, increase vdev count (more spindles), or separate workloads. “Just don’t scrub” is how you become a cautionary tale.

Task 15: Confirm you’re not scrubbing because of phantom time jumps or missed runs

cr0x@server:~$ zpool history -i tank | tail -n 12
2026-02-02.03:02:11 zpool scrub tank
2026-02-02.05:58:14 zpool scrub -s tank
2026-02-03.02:01:00 zpool scrub tank
2026-02-03.06:12:09 zpool scrub -p tank
2026-02-04.01:13:22 zpool scrub tank

What it means: Operators (or automation) are starting/stopping scrubs frequently. That’s a smell: either the schedule is wrong, the impact is too high, or both.

Decision: Fix the process: define a single owner (automation or humans), set an off-peak window, and alert when scrubs don’t complete.

Fast diagnosis playbook: find the bottleneck fast

When a scrub is slow or errors appear, you need to answer three questions quickly: Is it actually progressing? Is it blocked by I/O, CPU, or something weird? Are we seeing real corruption or transport errors?

First: Is the scrub advancing, and are errors growing?

  1. Run zpool status twice, 60 seconds apart.
  2. Compare “scanned” and “issued,” and check if READ/WRITE/CKSUM counters are increasing.
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Feb  4 01:13:22 2026
        1.30T scanned at 815M/s, 700G issued at 430M/s, 8.40T total
        0B repaired, 8.10% done, 05:10:11 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0

errors: No known data errors

If issued doesn’t move: suspect a device hung, controller queue stuck, or the system is throttling due to other load.

If errors are climbing: stop treating this as performance tuning and start treating it as data integrity triage.

Second: Is this IOPS, throughput, CPU, or contention?

  • Disk pegged: iostat -x shows high %util and rising await → you’re I/O bound or contending with workload.
  • CPU pegged: top shows kernel/system usage high, checksum/parity overhead, possibly compression/decompression. Scrub rates can drop on RAIDZ with weak CPU.
  • ARC churn: memory pressure can cause extra reads; check free -h and your platform’s ARC stats tools.
cr0x@server:~$ top -b -n 1 | head -n 15
top - 02:11:08 up 31 days,  6:22,  2 users,  load average: 12.44, 10.80, 9.21
Tasks: 312 total,   2 running, 310 sleeping,   0 stopped,   0 zombie
%Cpu(s): 22.0 us,  0.5 sy,  0.0 ni, 70.0 id,  7.5 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64256.0 total,   3120.4 free,  23120.9 used,  38014.7 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  40110.2 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1421 root      20   0       0      0      0 I  18.2   0.0  12:44.21 z_wr_iss
 1412 root      20   0       0      0      0 I  14.7   0.0  10:03.88 z_rd_iss

Decision: If CPU idle is high and iowait is high, you’re storage-bound. If CPU is high and disks aren’t, you’re compute-bound or throttled by something else (like single-threaded bottlenecks on old hardware). Change one variable at a time: schedule, vdev layout, and hardware headroom.

Third: Are checksum errors “disk bad” or “path bad”?

Checksum errors can be caused by:

  • Actual media corruption on a device
  • Bad SATA/SAS cable or expander
  • Flaky HBA firmware or driver issues
  • Power instability to a drive/backplane

Correlate ZFS CKSUM counts with SMART CRC errors and kernel logs.

cr0x@server:~$ dmesg | egrep -i 'ata|sas|scsi|reset|crc|error' | tail -n 10
[123456.789] ata4: hard resetting link
[123457.012] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[123457.345] ata4.00: configured for UDMA/133
[123457.678] ata4.00: failed command: READ DMA EXT
[123457.679] ata4.00: status: { DRDY ERR }
[123457.680] ata4.00: error: { ICRC ABRT }

Decision: If you see ICRC/CRC errors and link resets, suspect the path first. Replace cable, move port, update HBA firmware. Drives get blamed for a lot of sins they didn’t commit.

Three corporate mini-stories from the real world

Mini-story 1: The incident caused by a wrong assumption

The company was running a big data platform that looked modern from a distance: containers, a service mesh, and dashboards everywhere. Storage was “solved” by a pair of mirrored ZFS pools per node. They had a quarterly “maintenance week” where someone ran a scrub if they remembered. Mostly, they didn’t.

The wrong assumption was simple: “Mirrors protect us from disk failures, and the app replicates data anyway.” The mirror was treated like a RAID1 comfort blanket, not a data integrity system. Nobody asked how quickly corruption would be detected, or what would happen if a bad block existed on both sides of a mirror because the corruption was introduced before the write hit disk.

Then came a restore request—nothing dramatic. A team wanted an old dataset to replay training. The job failed with checksum errors on a handful of blocks. Engineers did what engineers do under time pressure: they retried, then restarted, then moved the workload. The errors moved with it.

Eventually, someone ran zpool status -v and saw permanent errors in files that hadn’t been read in months. The pool was technically “ONLINE,” so monitoring had been quiet. The mirror had been silently serving corrupted blocks because nobody forced verification of cold data. A monthly scrub would have found it early enough to repair from redundancy or from snapshots that still existed elsewhere. Instead, the data’s only remaining copies were already identical and wrong.

The lesson wasn’t “ZFS failed.” ZFS did exactly what it promised: it detected corruption. The system failed because the organization treated verification as optional. Integrity controls that are not exercised are just beliefs with CLI syntax.

Mini-story 2: The optimization that backfired

A mid-sized SaaS provider ran ZFS on a shared storage tier. Scrubs were scheduled monthly, but engineers complained they caused latency spikes on tenant databases. So the team made an “optimization”: they increased scrub frequency to weekly but set it to run during business hours at a lower rate by reducing I/O priority at the OS level and letting the scrub “take as long as it takes.”

On paper, it sounded responsible: constant gentle scrubbing rather than a big monthly event. In practice, it became permanent background noise. The scrub never finished before the next scheduled run. Operators got used to seeing “scrub in progress” as the steady state, so nobody noticed when scrub rates dropped by 80% after an HBA started logging link resets.

Weeks later, a drive failed. Resilvering began on already stressed hardware. The pool was now doing heavy reconstruction reads in addition to a scrub that was still running because nobody stopped it. Latency spiked, timeouts followed, and the incident became customer-visible. The root cause wasn’t “weekly scrubs” by itself. It was the combination of non-finishing scrubs, normalization of abnormal status, and insufficient controls during degraded states.

Afterward, they changed the rule: a scrub must complete within a defined window, and no scrub runs while the pool is degraded or resilvering unless explicitly approved. They also added alerting for “scrub running longer than X” and “scrub issued throughput below Y for Z minutes.” The successful optimization wasn’t tuning; it was putting boundaries on work.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company had a reputation for conservative ops. Not exciting. Not fashionable. But they ran monthly scrubs on every ZFS pool, and they had a small runbook that required operators to record: last scrub time, repaired bytes, and device error counters. They also ran quarterly restore drills for a random sample of datasets. People complained it was paperwork. It was. That was the point.

One month, a scrub repaired a small amount of data on a RAIDZ2 pool—nothing massive, but not zero. The counters showed a few checksum errors isolated to a single device. SMART looked “fine.” Because they tracked trends, they could see the checksum count had increased since last month. They replaced the cable first. Errors continued. They replaced the drive on the next maintenance window.

Two weeks later, another drive in the same vdev started throwing read errors. If the first drive had not been replaced proactively, they would have been in a RAIDZ2 with two marginal devices during a rebuild window—exactly when latent errors show up. Instead, the pool stayed healthy, scrubs stayed clean, and the incident never happened.

Nothing heroic occurred. No late-night war room. No executive escalations. Just a mundane feedback loop: scrub, record, react. Boring is a feature in storage engineering.

Common mistakes: symptom → root cause → fix

1) “Pool is ONLINE, so we’re fine”

Symptom: Everything looks green, but later you see permanent errors in old files.

Root cause: ONLINE only means the pool is accessible. It does not mean all data is currently readable and correct.

Fix: Scrub regularly and alert on any errors: lines in zpool status, not just pool state. Integrate scrub results into monitoring.

2) Scrubs never finish

Symptom: zpool status always shows “scrub in progress,” ETA is meaningless, operators ignore it.

Root cause: Not enough I/O headroom; scrub scheduled too frequently; heavy snapshot retention; fragmentation; workload contention.

Fix: Enforce a completion SLO: scrub must complete within X hours/days. Reduce snapshot churn, increase vdev parallelism, schedule off-peak, or split pools by workload.

3) Rising CKSUM errors on one device but SMART says PASSED

Symptom: zpool status shows CKSUM increments; SMART overall health says PASSED.

Root cause: Often transport: bad cable, flaky HBA, backplane issues, expander problems. SMART health is not a data-integrity truth source.

Fix: Check UDMA_CRC_Error_Count (SATA), kernel logs for link resets, reseat/replace cable, move ports, update HBA firmware. Replace drive if errors persist.

4) “We scrubbed and it repaired stuff; we’re done”

Symptom: Scrub repaired bytes; team shrugs and moves on.

Root cause: A repair event is a smoke alarm, not a victory lap. Something returned wrong data.

Fix: Investigate which device/vdev saw errors; correlate with SMART and logs; run an additional scrub after changes to confirm stability.

5) Scrubbing a degraded pool without thinking

Symptom: Performance collapses; additional devices error during scrub/resilver.

Root cause: Extra reads on already-stressed disks; scrub competes with resilver; rebuild windows amplify latent errors.

Fix: Default: avoid scrubs during resilver unless you’re hunting suspected corruption. Prioritize restoring redundancy. If you scrub degraded, do it intentionally with monitoring and a rollback plan.

6) Confusing application corruption with storage corruption

Symptom: Database logical inconsistencies; ZFS reports clean scrubs.

Root cause: The app wrote bad data, or a bug/logic issue occurred at the application layer. ZFS faithfully stored it.

Fix: Use application checks, backups, and immutable snapshots with retention. ZFS is necessary, not sufficient, for correctness.

7) Scrub schedule copied from a blog, not from your pool reality

Symptom: Scrubs cause repeated customer impact or get disabled.

Root cause: Schedule doesn’t match device count, pool size, and business load.

Fix: Start with monthly, then measure scrub duration and impact. Adjust window and throughput headroom. Treat scrub as a required control, not an optional job.

Checklists / step-by-step plan

Baseline plan for a new ZFS pool (production-minded)

  1. Choose redundancy intentionally: mirrors for IOPS and simple healing; RAIDZ2/3 for capacity and fault tolerance.
  2. Name devices by stable ID: build pools using /dev/disk/by-id paths.
  3. Set scrub cadence: monthly by default; confirm it can complete within your window.
  4. Decide on snapshot retention: keep what you can afford to scrub. Snapshot sprawl is real.
  5. Monitoring: alert on (a) scrub not run in N days, (b) scrub repaired bytes > 0, (c) any new READ/WRITE/CKSUM errors, (d) pool DEGRADED/FAULTED, (e) scrub taking too long.
  6. Run a test scrub after go-live: measure throughput and impact; record baseline duration.

Monthly operations routine (boring, repeatable, correct)

  1. Run zpool status and record counters.
  2. Run/confirm scrub completion.
  3. Review scrub outcome: repaired bytes, errors count, duration.
  4. Pull SMART summaries for all devices; look for CRC/link errors and reallocated/pending sectors.
  5. Verify at least one restore path: a small dataset restore drill or checksum verification against known-good artifacts.

Incident routine when you see checksum errors

  1. Don’t panic-write: avoid heavy writes that will churn metadata and make recovery harder.
  2. Capture state: save zpool status -v, dmesg excerpts, and SMART reports.
  3. Identify scope: file-level permanent errors vs device counters only.
  4. Stabilize hardware path: cables/ports/HBA.
  5. Restore redundancy: replace failed/unavailable devices.
  6. Scrub again: confirm errors stop increasing and scrub completes clean.
  7. Restore affected objects: from backup/snapshot; verify integrity.

Interesting facts and historical context

  • Fact 1: ZFS was designed at Sun Microsystems with an explicit goal of end-to-end data integrity, not just capacity management.
  • Fact 2: Traditional RAID checks parity but typically doesn’t validate that the data returned matches what the filesystem originally wrote.
  • Fact 3: ZFS stores checksums in parent metadata, so the checksum for a block is not sitting in the same sector as the block itself.
  • Fact 4: Copy-on-write means ZFS never overwrites live data in place; it writes new blocks then flips pointers, reducing “torn write” risks.
  • Fact 5: Scrub checks allocated blocks, including blocks held by snapshots; snapshot retention directly affects scrub workload.
  • Fact 6: “Self-healing” requires redundancy; a single-disk pool can detect corruption but cannot repair it.
  • Fact 7: Checksum errors can come from the transport layer (cables/HBA), not just from the disk media—SMART “PASSED” doesn’t refute that.
  • Fact 8: Resilver and scrub are different scans: resilver is targeted reconstruction after device events; scrub is full verification of pool contents.
  • Fact 9: Large-capacity disks increase the time window where latent sector errors matter, because rebuild/scrub reads more data and takes longer.

FAQ

1) How often should I scrub a ZFS pool?

Monthly is the default that works in most environments. If your scrub can’t finish monthly, fix headroom/snapshot sprawl or adjust architecture—don’t abandon verification.

2) Does a ZFS scrub check free space?

No. It checks allocated blocks reachable from the metadata tree, including snapshot-referenced blocks. Free space isn’t scanned because there’s nothing to verify.

3) What’s the difference between scrub and resilver?

Scrub is a full integrity scan of allocated data/metadata. Resilver reconstructs data onto a replaced/returned device for the affected vdev(s), typically only the blocks in use.

4) If I have mirrors, do I still need scrubs?

Yes. Mirrors give you a second copy, but without scrubs you may not discover corruption until you read the block during an incident—or during a resilver when you’re already stressed.

5) Can ZFS fix corruption automatically?

If redundancy exists and at least one copy is good, ZFS can repair a bad copy during scrub (and sometimes during normal reads). If all copies are wrong, it can only detect.

6) Why do I see CKSUM errors but no READ/WRITE errors?

Because the device returned data successfully, but the data didn’t match the stored checksum. That points to silent corruption or to transport issues (cable/HBA/backplane).

7) Are scrubs dangerous for disk health?

They increase read load, which can expose marginal drives. That’s not a reason to avoid scrubs; it’s a reason to find weak drives on your schedule, not during a failure.

8) Should I scrub SSD pools differently than HDD pools?

The integrity goal is the same. SSDs often scrub faster due to random-read strength, but they can still have firmware and path issues. Keep scrubs regular and monitor wear/SMART.

9) My scrub repaired bytes. Is that normal?

It can happen, but it shouldn’t be routine. Treat “repaired” as an integrity event: identify the device path, check logs/SMART, and confirm the next scrub is clean.

10) Do I need ECC RAM for ZFS integrity?

ECC is strongly recommended for systems where data correctness matters. ZFS can detect on-disk corruption, but it can’t reliably detect corruption introduced in RAM before checksums are computed.

Conclusion: next steps you can do this week

If you run ZFS and you’re not scrubbing, you’re trusting your storage stack to be perfect forever. That’s adorable. Also, please don’t.

  1. Set a scrub schedule (monthly baseline) and make sure it completes.
  2. Add alerting for: scrub overdue, scrub repaired bytes > 0, any new READ/WRITE/CKSUM errors, and scrub runtime exceeding your window.
  3. Run a scrub now during a controlled window, then record duration and throughput as your baseline.
  4. When you see checksum errors, correlate with SMART CRC/link resets before you replace hardware blindly.
  5. Test restores for at least one representative dataset. Scrubs verify what’s there; restore drills verify you can recover when it isn’t.

The goal isn’t to “trust ZFS.” The goal is to operationalize verification. Scrubs are the part where you stop believing and start knowing.

← Previous
‘Network Path Not Found’ (0x80070035): The 5‑Minute Fix
Next →
Proxmox PBS: Backups Succeed but Restores Fail — The Checklist That Catches It

Leave a comment