ZFS Write Hole Explained: Who Has It and Why ZFS Avoids It

Was this helpful?

You don’t notice the write hole when everything is green. You notice it when a database index won’t rebuild, a VM filesystem won’t mount,
or a single customer’s tarball fails checksum—months after the “harmless” power blip.

The write hole is the storage equivalent of a missing witness statement: the system “remembers” some parts of a write, forgets others,
and confidently hands you a story that never happened. ZFS is famous for not doing that. This is why.

The write hole: what it is (and what it is not)

The classic “write hole” is a parity RAID consistency failure caused by an interrupted write. Not “disk died,” not “controller exploded.”
It’s subtler: a power loss, kernel panic, controller reset, cable sneeze, or firmware hiccup hits in the middle of updating a stripe.
Data blocks and parity blocks no longer agree.

Parity RAID (think RAID5/RAID6 and cousins) maintains redundancy by computing parity over a set of blocks in a stripe. For a simple RAID5 stripe,
parity is an XOR of the data blocks. If you update a data block, you must also update parity. That means multiple writes per logical write.
If the system crashes after one of them hits disk but before the other does, you’ve created a stripe that looks valid structurally but is
mathematically inconsistent.

That inconsistency isn’t always detected immediately. In fact, that’s the whole problem: the array still comes online, the filesystem still mounts,
and the corruption sits there like a landmine until you read the “wrong” block (or until you need to reconstruct after a disk failure).

What the write hole is not:

  • Not the same as bit rot (media errors over time). Bit rot can happen without crashes; write hole needs an interrupted write path.
  • Not “your RAID is useless.” It’s a specific failure mode triggered by partial updates in parity systems.
  • Not fixed by “more parity” alone. RAID6 reduces rebuild risk, but partial parity updates can still create inconsistent stripes.

The key question is simple: when you write, do you ever end up with a mixture of old and new data/parity that the system can’t later reconcile?
If yes, you have a write hole risk.

Who has the write hole: RAID levels and stacks that are vulnerable

The write hole is most commonly associated with RAID5, but the real classification is: parity-based redundancy where a single logical write becomes
multiple on-disk writes, and the stack cannot guarantee atomicity across them.

Likely vulnerable (unless mitigated)

  • RAID5 / RAID6 in many software RAID stacks and hardware RAID controllers, especially without battery-backed or persistent write cache.
  • mdadm RAID5/6 on Linux can be vulnerable if barriers/flushes are misconfigured, write cache lies, or a crash happens mid-update.
  • Classic filesystems on top of parity RAID (ext4, XFS, NTFS, etc.) don’t generally know stripe parity state. They assume the block device is honest.
  • Storage appliances that “optimize” by reordering writes without end-to-end integrity may worsen it.

Less vulnerable by design

  • Mirrors (RAID1, RAID10): a write is duplicated; you can still have torn writes at the block level, but you typically won’t poison parity math.
  • ZFS on mirrors: adds checksums + copy-on-write semantics, so it won’t return a torn block as “good.”
  • ZFS RAIDZ: parity is part of a transactional, copy-on-write scheme; it avoids the classic write hole behavior.

“But my controller has a cache”

A cache helps if it is persistent across power loss and if the controller correctly honors flushes and write ordering.
Battery-backed write cache (BBWC) or flash-backed write cache (FBWC) can reduce risk dramatically.
“Write-back cache without persistence” is just speed with a gambling habit.

Joke #1: A RAID controller without a real power-loss-protected cache is like a parachute made of optimism—you only know it’s fake once.

Why it happens: partial stripes, parity, and “torn writes”

Consider a RAID5 array with three disks: two data blocks and one parity block per stripe. Update one 4K block in one stripe.
The array must update parity so that parity still matches the two data blocks.

There are two classic ways to update parity:

  • Read-modify-write (RMW): read old data block and old parity, compute new parity = old_parity XOR old_data XOR new_data, then write new data and new parity.
  • Reconstruct-write: read all other data blocks in stripe, compute parity from scratch, write updated data and new parity.

Either way, multiple reads and multiple writes happen. That’s fine when the system completes the sequence. The write hole appears when it doesn’t.
The nasty part is that a crash can interrupt at many points:

  • New data is written, parity is old.
  • Parity is written, data is old.
  • Half a sector is updated (“torn write”), depending on device semantics and cache behavior.
  • Writes are reordered by cache/controller/drive firmware, despite what the OS thinks.

After reboot, the array has no memory of what it was doing mid-flight. A stripe is just a stripe.
There’s no durable transaction log at the RAID layer that says “this parity update was in progress; please reconcile.”
Many stacks simply trust the parity and move on.

The worst time to discover an inconsistent stripe is during a rebuild, when you lose one disk and parity is now your truth source.
If parity is wrong, the reconstruction produces garbage, and it produces it confidently.

Why ZFS avoids it: COW, checksums, and transactional commits

ZFS doesn’t “magically” defeat physics. It uses a storage model that makes the classic write hole scenario structurally hard to create and
easy to detect. Three pillars matter: copy-on-write, end-to-end checksums, and transactional commit semantics.

1) Copy-on-write (COW): never overwrite live data in place

Traditional filesystems often overwrite metadata and data blocks in place (with journaling reducing, not eliminating, some risks).
ZFS takes a different approach: when it modifies a block, it allocates a new block elsewhere, writes the new content there, and only then
updates pointers to reference the new block.

That pointer update itself is also done via COW, walking up the tree until the top-level metadata (the uberblock) is updated.
The old version remains intact on disk until the new version is committed. If you crash mid-write, you generally fall back to the previous
consistent tree. You don’t end up with “half old, half new” metadata pretending it’s a coherent filesystem.

2) End-to-end checksums: detect lies and corruptions

ZFS stores a checksum for every block in its parent metadata. When you read a block, ZFS verifies the checksum. If it doesn’t match,
ZFS knows the block is wrong. This is not “maybe wrong.” It’s cryptographically or mathematically inconsistent.

On redundant vdevs (mirrors, RAIDZ), ZFS can then repair: it reads an alternate copy/parity reconstruction, finds a version that matches
the expected checksum, returns good data, and optionally heals the bad copy. The checksum makes corruption detectable; redundancy makes it correctable.

3) Transaction groups (TXGs): atomic-ish filesystem state changes

ZFS batches changes into transaction groups. In memory, it accumulates dirty data and metadata, then periodically syncs a TXG to disk.
The “commit point” is a new uberblock written to disk pointing to the new tree. Uberblocks are written in a rotating set so ZFS can choose
the most recent valid one at import.

This matters for the write hole because ZFS doesn’t need to perform in-place parity updates of existing blocks to “patch” the current truth.
It writes new blocks, new parity, and only later switches the root pointer. If it crashes before the switch, the new blocks are orphaned
and later cleaned up; the old consistent state is still there.

Where the write hole used to live, ZFS puts a locked door

Classic write hole: “I updated data but not parity, now the stripe is inconsistent and nobody knows.”
ZFS model: “I wrote new data+parity to new locations. If I didn’t finish, the active filesystem tree still points to the old consistent blocks.”

ZFS can still experience interrupted writes, but it has two advantages:

  • Consistency of the active tree: the on-disk state chosen at import is a coherent snapshot of the filesystem.
  • Detectability: if a block is corrupt, ZFS knows via checksum; it doesn’t silently accept it as good.

One quote worth keeping on a sticky note near your rack:

“Hope is not a strategy.” — General Gordon R. Sullivan

RAIDZ specifics: variable stripe width and what changes

RAIDZ is ZFS’s parity RAID, but it isn’t “RAID5 implemented inside ZFS” in the naive sense. RAIDZ is integrated into ZFS’s allocation and
transactional model.

In classic RAID5, a logical write to a small block often forces a read-modify-write cycle on a full stripe. That’s where partial stripe updates
become dangerous, especially if the system overwrites existing blocks in place.

RAIDZ does something different: it uses variable stripe width. ZFS can pack data across disks based on the actual write size and block layout,
rather than forcing fixed stripe boundaries. It still has parity, but the allocation is coordinated with COW updates.

The key operational consequence: RAIDZ parity is computed for the newly allocated blocks and written as part of the TXG sync.
It is not retrofitting parity onto existing live stripes via in-place updates the same way classic RAID5 RMW does.

This doesn’t mean RAIDZ is invincible. You can still lose data if you lose more devices than parity allows, or if your system lies about flushes
and you lose the newest committed uberblock(s). But the classic “silent parity mismatch created mid-write and later used as truth” is not the default failure mode.

The villain side-quest: caches, barriers, and “I swear it was written”

The write hole discussion always drags in caches because most real-world corruptions are not “ZFS bug” or “mdadm bug.” They’re “the storage stack
told me it was on stable storage, but it was actually sitting in volatile cache.”

Your OS uses flushes (FUA/flush/barriers) to say: “commit this to non-volatile media before you confirm.” If the device/controller ignores that,
everything above it is operating on a fantasy timeline.

ZFS is not immune to lies. ZFS assumes that when it issues a flush and gets an acknowledgment, the data is durable. If your hardware says “sure”
and then loses power, you can lose the last TXG(s). Usually that means losing the latest writes, not corrupting old committed data, but the blast
radius depends on how dishonest the device is.

This is where SLOG, sync settings, and power-loss protection enter the chat. If you set sync=disabled to chase benchmarks, you’re
volunteering to re-learn why databases insist on fsync.

Joke #2: Setting sync=disabled in production is like removing the smoke detector because it beeps when there’s smoke.

Interesting facts and historical context

  • Fact 1: The write hole was discussed in RAID literature long before SSDs; it’s tied to parity update semantics, not flash quirks.
  • Fact 2: Hardware RAID vendors pushed BBWC partly because it turns multi-write updates into “atomic enough” operations across power loss.
  • Fact 3: Early enterprise arrays implemented “parity logging” or journaling at the RAID layer to reconcile incomplete stripe updates after a crash.
  • Fact 4: Filesystem journaling (e.g., ext3/ext4) protects filesystem metadata consistency, but it generally can’t fix a parity stripe mismatch underneath.
  • Fact 5: ZFS’s end-to-end checksumming was a direct reaction to silent corruption in storage stacks—controllers, firmware, and even DMA can be wrong.
  • Fact 6: The ZFS “uberblock” concept provides multiple recent commit points; on import, ZFS can pick the newest valid one.
  • Fact 7: RAID5’s reputation worsened as disk sizes grew; rebuild windows got longer, increasing the chance of encountering an unrecoverable read error.
  • Fact 8: “Write hole” isn’t only about power loss; any mid-flight interruption (panic, HBA reset, link flap) can produce the same partial-update result.
  • Fact 9: Some SSDs historically acknowledged flushes too early due to firmware bugs; reliability engineers learned to distrust cheap flash “enterprise” labels.

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran their primary PostgreSQL cluster on a Linux software RAID6 volume with a “reputable” HBA in IT mode.
The assumption was simple: RAID6 means “safe,” and ext4 journaling means “consistent.” They had UPS units in the rack, so power events
were treated as theoretical.

Then came a building maintenance event. Power didn’t fully drop; it sagged, bounced, and the servers rebooted. UPS logs showed a short transfer,
then a second transfer. The hosts came back. The monitoring was green. Everybody moved on.

Two weeks later a disk failed—normal, replaceable, boring. During rebuild, mdadm began throwing read errors that didn’t line up with SMART.
Then PostgreSQL started failing checksum verification on a table that hadn’t been touched since the power event. The data files looked “fine,”
until they weren’t.

The postmortem found likely stripe inconsistency: some stripes had parity that reflected new data blocks, while others had old parity and new data.
The filesystem journal had done its job, but it never had a chance: it was built on top of a block device that returned corrupted blocks as if they were valid.

The wrong assumption wasn’t “RAID6 is bad.” The wrong assumption was believing the block layer always preserves write ordering and atomicity across
a parity update. They migrated to ZFS mirrors for databases and reserved RAIDZ2 for large, sequential-ish object storage where scrubs and checksums
gave them detectable, repairable behavior.

Mini-story 2: The optimization that backfired

Another org—internal analytics platform, lots of batch jobs—ran OpenZFS on RAIDZ2. Performance was fine, until they onboarded a new workload:
a queue-backed ingestion service doing small synchronous writes. Latency jumped.

Someone did what someone always does: they googled “ZFS slow sync writes,” saw the magic phrase sync=disabled, and applied it to the dataset.
The graphs smiled. Tickets stopped. Victory lap.

Months later they had a kernel panic during a routine driver update. On reboot, the pool imported cleanly. No errors. But the ingestion service started
replaying messages that it believed were committed but weren’t actually on stable storage. The downstream system ingested duplicates and then overwrote
derived tables. Nothing exploded loudly; it just became quietly wrong.

The backfire wasn’t ZFS corruption. It was semantic corruption: the application’s durability contract got broken. By disabling sync, they changed
“ack after durable commit” into “ack after RAM felt good about it.” The fix was boring: restore sync=standard, add a proper SLOG device with power-loss protection,
and tune the application’s batching so it didn’t fsync every sneeze.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran ZFS on mirrors for their VM datastore and ZFS RAIDZ2 for backups. Their storage engineer was aggressively unromantic:
monthly scrubs, SMART tests, and alerting on even single checksum errors. They also refused to buy SSDs without clear power-loss protection.

One quarter, a new batch of drives started returning occasional checksum errors during scrubs. Not read errors. Not SMART failures. Just checksum mismatches
that ZFS corrected from redundancy. The dashboards showed “self-healing” events and a slow climb in corrected errors.

Procurement wanted to ignore it because workloads were fine. The storage engineer insisted: error budget is not a savings account. They drained the pool
of that batch over a few maintenance windows, replaced the drives, and RMA’d the lot.

Two months later, another team in the same building (different vendor stack) had a nasty incident: silent corruption surfaced during a restore test.
The financial services team didn’t even have an incident. Their practice was boring, correct, and therefore effective: scrubs turned invisible corruption
into a scheduled maintenance task instead of a surprise outage.

Practical tasks: commands, outputs, and decisions

These are real operator moves. Each task includes a command, what the output means, and what you decide next.
Examples assume OpenZFS on Linux, but most translate to illumos/FreeBSD with minor changes.

Task 1: Confirm pool health and whether you already have checksum errors

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors
  scan: scrub repaired 0B in 02:13:44 with 2 errors on Sun Dec 22 03:10:18 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            ata-WDC_WD80EFAX-00A0  ONLINE       0     0     2
            ata-WDC_WD80EFAX-00A1  ONLINE       0     0     0
            ata-WDC_WD80EFAX-00A2  ONLINE       0     0     0
            ata-WDC_WD80EFAX-00A3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/vmstore:vm-102-disk-0

What it means: ZFS detected bad data via checksum and couldn’t fully repair it (permanent error), or it repaired but still records impacted files.

Decision: Treat this like a production incident. Identify the impacted workload, restore the specific file/volume from backup or replica,
and schedule drive replacement if errors keep increasing.

Task 2: Check whether your vdev layout is parity-based (where classic write hole matters)

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

What it means: RAIDZ2 parity vdev. ZFS will avoid classic write hole patterns, but you must still care about flush honesty and sync semantics.

Decision: If this pool hosts sync-heavy databases, evaluate SLOG and dataset settings. If it’s bulk storage, focus on scrub cadence and drive health.

Task 3: Confirm sync policy and whether someone “optimized” durability away

cr0x@server:~$ zfs get -o name,property,value -s local,default sync tank tank/db tank/vmstore
NAME         PROPERTY  VALUE
tank         sync      standard
tank/db      sync      standard
tank/vmstore sync      disabled

What it means: tank/vmstore is acknowledging synchronous writes without guaranteeing stable storage.

Decision: If this hosts VMs, databases, or anything that expects fsync to mean something, revert to sync=standard.
If you keep disabled, document the data-loss window and get explicit sign-off.

Task 4: Check if you actually have a SLOG and what it is

cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            sda        ONLINE       0     0     0
            sdb        ONLINE       0     0     0
            sdc        ONLINE       0     0     0
            sdd        ONLINE       0     0     0
        logs
          nvme0n1p2    ONLINE       0     0     0

What it means: A separate log device exists. It can accelerate synchronous writes if it has power-loss protection and low latency.

Decision: Validate that the NVMe is enterprise-ish with PLP. If not, you may be trading latency for integrity under power loss.

Task 5: Confirm ashift (misalignment can masquerade as “ZFS parity is slow”)

cr0x@server:~$ zdb -C tank | grep -E "ashift|vdev_tree" -n | head
35:        vdev_tree:
52:                ashift: 12
89:                ashift: 12
126:               ashift: 12

What it means: ashift=12 = 4K sectors. Good default for modern disks/SSDs.

Decision: If you see ashift=9 on 4K drives, expect pain. Fix requires recreating the vdev (no in-place change).

Task 6: Check recordsize/volblocksize against workload (small writes + RAIDZ can bite)

cr0x@server:~$ zfs get -o name,property,value recordsize,volblocksize tank/db tank/vmstore tank/vmstore/vm-102
NAME                  PROPERTY      VALUE
tank/db               recordsize    128K
tank/vmstore          recordsize    128K
tank/vmstore/vm-102   volblocksize  8K

What it means: Filesystem datasets use recordsize. ZVOLs use volblocksize. Mismatch can cause write amplification.

Decision: For VM zvols, pick a volblocksize that matches the guest and workload (often 8K or 16K). For databases on datasets,
consider smaller recordsize if IO is random and small, but measure first.

Task 7: Observe real-time IO and spot whether you’re sync-bound

cr0x@server:~$ zpool iostat -v tank 1 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        3.21T  11.3T    112    980  22.4M  18.1M
  raidz2-0                  3.21T  11.3T    112    980  22.4M  18.1M
    sda                         -      -     18    160  3.7M   4.2M
    sdb                         -      -     19    165  3.8M   4.3M
    sdc                         -      -     18    162  3.7M   4.2M
    sdd                         -      -     18    158  3.6M   4.1M
  nvme0n1p2                      -      -      0    510  0     6.1M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: The log device is taking a lot of writes. That’s consistent with sync-heavy workload being absorbed by SLOG.

Decision: If latency is still bad, the SLOG may be slow or saturated. Consider faster PLP NVMe, or reduce sync frequency at the app layer (batching).

Task 8: Check TXG sync behavior and whether you’re thrashing on syncs

cr0x@server:~$ grep -E "txg_sync|spa_sync" /proc/spl/kstat/zfs/* 2>/dev/null | head
/proc/spl/kstat/zfs/dbgmsg:txg_sync_start txg 89213
/proc/spl/kstat/zfs/dbgmsg:spa_sync: syncing tank txg 89213
/proc/spl/kstat/zfs/dbgmsg:txg_sync_done txg 89213

What it means: You can see sync cycles. Frequent syncs under load can mean small sync writes forcing commits.

Decision: Correlate with application fsync patterns. If it’s a DB, tune its checkpointing/batching; if it’s NFS, verify client mount options and sync semantics.

Task 9: Validate drive write cache settings (and decide whether to disable volatile caches)

cr0x@server:~$ sudo hdparm -W /dev/sda
/dev/sda:
 write-caching =  1 (on)

What it means: Drive write cache is enabled. That’s not automatically bad if flushes are honored and you have power-loss protection.

Decision: If you don’t trust the platform (no PLP, sketchy SATA expanders, history of flush bugs), consider disabling write cache:
hdparm -W0 (with testing). Expect performance impact.

Task 10: Confirm TRIM/autotrim for SSD pools (not write hole, but prevents performance decay)

cr0x@server:~$ zpool get autotrim tank
NAME  PROPERTY  VALUE     SOURCE
tank  autotrim  on        local

What it means: ZFS will issue TRIM automatically. Helps SSD sustained performance and wear behavior.

Decision: Keep it on for SSDs unless you have a device that behaves badly under TRIM (rare now, common in the bad old days).

Task 11: Run a scrub and understand what “repaired” vs “errors” means

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub in progress since Thu Dec 26 01:12:40 2025
        512G scanned at 1.21G/s, 96G issued at 233M/s, 3.21T total
        0B repaired, 2.99% done, 0:21:34 to go
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

What it means: Scrub is reading and verifying checksums. “0B repaired” is good. “No known data errors” is what you want.

Decision: If scrubs repeatedly repair data, you have a reliability problem (drives, cabling, HBA, memory).
If scrubs find permanent errors, initiate restore/recovery for affected datasets and investigate hardware.

Task 12: Verify compression and its effect on IO patterns

cr0x@server:~$ zfs get -o name,property,value compression,compressratio tank/db
NAME     PROPERTY       VALUE
tank/db  compression    lz4
tank/db  compressratio  1.78x

What it means: Compression is on and effective. That reduces physical writes for the same logical write volume.

Decision: Usually keep lz4 on. If latency is dominated by CPU (rare on modern CPUs), measure before changing anything.

Task 13: Check ARC behavior (because people blame “ZFS write hole” when it’s actually memory pressure)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
01:22:10   845    47      5    12   26    35   74     0    0   58G   64G
01:22:11   812    55      6    14   25    41   75     0    0   58G   64G
01:22:12   901    50      5    13   26    37   74     0    0   58G   64G

What it means: Low miss rate, ARC near target. Reads are mostly being served from cache.

Decision: If miss% is high and disks are busy, consider more RAM or L2ARC (carefully). But don’t use L2ARC as a substitute for fixing a sync-write bottleneck.

Task 14: Spot a lying device: confirm write cache, flush support, and kernel messages

cr0x@server:~$ dmesg | grep -iE "flush|fua|barrier|cache" | tail -n 8
[    2.913244] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.914881] sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.916402] sd 0:0:0:2: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.918103] sd 0:0:0:3: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

What it means: “doesn’t support DPO or FUA” isn’t automatically fatal (flushes can still work), but it’s a hint:
the stack may be relying on cache flushes rather than force-unit-access semantics.

Decision: If this is a sync-heavy integrity-critical pool, prefer devices/HBAs that behave predictably with flushes and have PLP where appropriate.
Consider mirrors for databases.

Fast diagnosis playbook

When someone says “ZFS is corrupt” or “we hit the write hole,” your job is to separate three things:
(1) data integrity failures, (2) durability semantics, and (3) performance bottlenecks. Fast.

First: determine if you have data errors right now

  1. Check pool health: zpool status -v. If you see checksum errors or permanent errors, treat as active incident.
  2. Check recent scrubs: in zpool status scan line. If scrubs repaired data recently, find the source of corruption.
  3. Check system logs: dmesg for link resets, I/O errors, HBA timeouts. If the transport is flaky, everything above it is a victim.

Second: confirm whether the complaint is actually lost acknowledged writes

  1. Check dataset sync settings: zfs get sync. If you see disabled, you likely have a durability contract violation, not corruption.
  2. Check presence/health of SLOG: zpool status logs section. If missing and workload is sync-heavy, latency may be expected.
  3. Ask what “lost” means: Did the app ack after fsync? If yes, investigate sync settings, PLP, and flush honesty.

Third: find the bottleneck (CPU, disk, sync, fragmentation, or capacity)

  1. IO view: zpool iostat -v 1 to see if disks or SLOG are saturated.
  2. Capacity pressure: zfs list and pool alloc%. Near-full pools fragment and slow down writes.
  3. Workload mismatch: small random writes on RAIDZ without proper tuning can be slow; consider mirrors for latency-sensitive datasets.

Common mistakes: symptoms → root cause → fix

1) “We had a reboot and now the database is inconsistent”

  • Symptoms: DB reports missing recent transactions; app replays messages; no ZFS checksum errors.
  • Root cause: sync=disabled (or application was relying on fsync semantics but stack didn’t provide them), or volatile cache lied.
  • Fix: Set sync=standard; add PLP-backed SLOG if needed; validate UPS and test power-loss behavior; ensure hardware honors flushes.

2) “Scrub keeps repairing data every month”

  • Symptoms: zpool status shows repaired bytes or increasing checksum counters; workloads seem fine.
  • Root cause: Silent corruption from drives, cabling, backplane, HBA, firmware, or memory (yes, RAM matters).
  • Fix: Replace suspect drives; reseat/replace cables; update HBA firmware; run memtest; keep scrubs; do not ignore corrected errors.

3) “RAIDZ is slow for VM storage”

  • Symptoms: High latency, especially on sync writes; zpool iostat shows lots of small writes; CPU mostly idle.
  • Root cause: Workload is small random writes; parity RAID has write amplification; volblocksize/recordsize mismatch; no SLOG for sync-heavy patterns.
  • Fix: Use mirrors for VM/databases; tune volblocksize; add PLP SLOG for sync; keep RAIDZ for throughput workloads.

4) “We see permanent errors in a few files after a crash”

  • Symptoms: zpool status -v lists specific files with permanent errors; checksum errors on a disk.
  • Root cause: Actual on-disk corruption that redundancy couldn’t heal (e.g., exceeded parity, bad sector on multiple drives, or stale replicas).
  • Fix: Restore affected files/volumes from backup/snapshot replication; replace failing hardware; run scrub; validate backups with restore tests.

5) “After adding an SSD cache device, things got worse”

  • Symptoms: Latency spikes; SSD at 100% busy; little improvement in hit rate.
  • Root cause: Misused L2ARC or weak SSD; caching the wrong workload; memory pressure due to L2ARC metadata overhead.
  • Fix: Remove or resize L2ARC; add RAM first; focus on SLOG if the problem is sync writes, not reads.

Checklists / step-by-step plan

Checklist: building a ZFS design that doesn’t reintroduce write-hole-like pain

  1. Pick vdevs by latency needs: mirrors for databases/VMs; RAIDZ2/3 for bulk storage and backups.
  2. Assume power events happen: UPS helps, but test behavior. Brownouts and double-transfers are real.
  3. Do not disable sync casually: if you must, isolate it to a dataset and document the risk.
  4. Use PLP devices where durability matters: especially for SLOG and metadata-heavy workloads.
  5. Schedule scrubs: monthly is common; more frequent for critical pools with lots of data churn.
  6. Alert on checksum errors: corrected errors are an early warning, not a success story.
  7. Validate backups via restore tests: “backup succeeded” is not “restore works.”
  8. Keep headroom: avoid running pools near full; fragmentation and allocation pressure will punish you.

Step-by-step: investigating “possible write hole” allegations in a parity stack

  1. Collect timeline: what crashed, when, and what was writing at that moment.
  2. Look for partial-write symptoms: app-level inconsistencies without corresponding disk read errors.
  3. Check for volatile caches: drive write cache, RAID controller cache mode, missing BBWC/FBWC.
  4. Validate flush semantics: kernel logs; controller settings; virtualization layers.
  5. Run consistency checks: for ZFS run scrub; for mdadm consider check/repair operations (careful); for databases run internal checksums.
  6. Decide remediation: restore known-good copies, rebuild from replicas, or migrate to a design that provides end-to-end integrity.

FAQ

1) Does ZFS completely eliminate the write hole?

ZFS avoids the classic parity RAID write hole by not doing in-place updates and by committing changes transactionally.
But ZFS still relies on hardware honoring flush semantics. If devices lie, you can lose recent writes.

2) Is RAIDZ just RAID5 with a different name?

No. RAIDZ is integrated with ZFS allocation and copy-on-write behavior, supports variable stripe width, and is protected by end-to-end checksums.
It behaves differently under partial-write conditions than classic RAID5 RMW patterns.

3) If I use RAIDZ2/3, can I skip backups?

No. Parity protects against device failure, not against deletion, ransomware, application bugs, or “oops I wrote the wrong dataset.”
ZFS snapshots help, but they’re not off-system backups unless replicated elsewhere.

4) Why do mirrors get recommended for databases?

Latency. Mirrors have simpler write paths and often better IOPS for random workloads. RAIDZ is great at capacity efficiency and throughput,
but parity math plus write amplification can hurt small random sync writes.

5) Do I need a SLOG device?

Only if you have significant synchronous writes and your pool devices are too slow to meet latency targets.
A SLOG is not a general write cache; it accelerates the ZIL for sync writes. Don’t add a cheap SSD without PLP and call it “safer.”

6) What does “checksum error” mean in zpool status?

ZFS read data that didn’t match the checksum stored in metadata. That’s proof of corruption in the read path or on media.
With redundancy, ZFS can often self-heal by reading a good copy.

7) Can ext4 journaling prevent RAID5 write hole corruption?

ext4 journaling can keep filesystem metadata consistent after a crash, but it can’t correct a parity stripe mismatch underneath.
It assumes the block device provides coherent blocks when requested.

8) Isn’t the write hole solved by using a UPS?

UPS reduces risk but doesn’t eliminate it. Crashes, kernel panics, controller resets, and firmware bugs still interrupt write sequences.
Also, many “power events” are messy brownouts and reboots, not clean power loss.

9) If ZFS is safe, why do people still lose data?

Because safety is a system property. Common culprits: running out of redundancy, bad hardware, lying caches, disabling sync,
ignoring checksum errors, and not testing restores.

10) What’s the simplest operational habit that improves ZFS integrity?

Regular scrubs with alerting on any checksum errors. Scrub turns silent corruption into a ticket you can handle on a weekday.

Practical next steps

If you’re running parity RAID outside of ZFS, assume the write hole exists until proven otherwise. If you’re running ZFS, assume your hardware
can still lie and your operators can still “optimize” durability away. Both assumptions are healthy.

  1. Inventory your risk: identify parity arrays, cache modes, and whether write cache is persistent.
  2. Audit ZFS properties: find any datasets with sync=disabled and decide whether you actually meant that.
  3. Scrub schedule + alerting: ensure scrubs run and somebody gets paged on checksum errors.
  4. Align design to workload: mirrors for latency, RAIDZ for capacity/throughput, and don’t pretend they’re interchangeable.
  5. Test a power-loss scenario: not by pulling the plug on a production box, but by validating that your platform honors flushes and that your apps tolerate crash-recovery.
← Previous
MySQL vs PostgreSQL: PITR drills—test restore before you need it
Next →
Proxmox firewall locked you out: restore SSH/Web UI from console without panic

Leave a comment