ZFS Power Loss Testing: How to Validate Safety Without Losing Data

Was this helpful?

Power loss testing is one of those tasks everyone swears they’ll do “next quarter,” right after they finish “just one more migration.”
Then a breaker trips, a PDU reboots, or a remote hands tech unplugs the wrong cord, and suddenly you’re learning your storage truth in public.

ZFS is famously resilient, but it cannot save you from every lie your hardware tells—especially around write caches and “I totally flushed that”
acknowledgements. This guide is how to validate safety like you mean it, without turning production data into a science fair experiment.

What you are actually testing (and what you are not)

“Power loss testing” gets thrown around like a single thing. It’s not. You’re validating a chain of promises:
the OS, the HBA, the drive firmware, the cache policy, and the pool layout all have to agree on what “durable” means.
ZFS sits in the middle trying to keep you honest. Sometimes it succeeds. Sometimes the hardware gaslights it.

Here’s what you are testing:

  • Crash consistency: after abrupt loss, the pool imports cleanly and metadata is coherent.
  • Durability of synchronous writes: data acknowledged as sync is actually on stable media.
  • Behavior of ZIL/SLOG under stress: how much you lose in-flight, and whether replay is clean.
  • Recovery workflow: how fast humans can restore service without making it worse.
  • Truthfulness of your storage stack: whether flushes, barriers, and FUA are respected end-to-end.

What you’re not proving:

  • That corruption can never happen. You’re reducing probability and tightening detection.
  • That async writes are durable. If your app writes async, you are testing “best effort,” not guarantees.
  • That RAIDZ is a backup. It isn’t, and ZFS will not argue with you about it.

One quote worth keeping taped to your monitor. Peter Drucker’s often-cited line (paraphrased idea):
What gets measured gets managed.
In storage, what gets measured also gets believed. So measure correctly.

Facts and context that change decisions

These aren’t trivia. They’re the reasons certain “obvious” setups are quietly dangerous.

  1. ZFS is copy-on-write (CoW): it never overwrites live blocks in place, which is why pool-wide consistency after a crash is usually excellent.
  2. ZFS checksums everything (metadata and data): it detects corruption even when the disk proudly returns garbage with a smile.
  3. The ZIL exists even without a SLOG: synchronous transaction intent is logged somewhere; a dedicated log device just changes where.
  4. SLOG is not a write cache: it accelerates synchronous writes; it does not make async writes safe, and it doesn’t replace RAM needs.
  5. Write caches lie sometimes: certain disks/SSDs have volatile caches that can acknowledge writes before they’re persistent, unless power-loss protection (PLP) is real and enabled.
  6. Barriers and flushes are a contract: Linux + controller + device must honor cache flush/FUA semantics. One weak link makes “sync” expensive theater.
  7. ZFS scrubs are not fsck: a scrub is integrity verification via checksums and repair via redundancy. It’s not a generic “fix my pool” tool.
  8. Power loss bugs are often timing bugs: the exact same test can pass 99 times and fail on the 100th because the bug lives in a small race window.
  9. Early ZFS popularized end-to-end checksums in mainstream ops: not the first filesystem to do it, but the one many SRE teams met in anger and then refused to live without.

Joke #1: Storage engineers don’t believe in ghosts. We believe in “unexplained latency” and “it went away after a reboot,” which is basically the same thing with better graphs.

Threat model: the failure modes that matter

Treat power loss like a failure injection tool that happens to be dramatic. You’re aiming at specific classes of failures.
If you don’t name them, you’ll run a “test,” feel productive, and learn nothing.

1) Crash without storage lies (the “good” crash)

Kernel stops. Power cuts. Devices stop instantly. On reboot, ZFS replays intent logs, rolls back to the last consistent TXG,
and you lose at most the last few seconds of asynchronous work. Synchronous work survives.

2) Crash with volatile cache lies (the “bad” crash)

Your device or controller acknowledged writes that were sitting in a volatile cache. Power loss drops them on the floor.
ZFS may have believed them durable, so now you have missing blocks that can manifest as checksum errors, permanent errors on scrub,
or—worst case—silent corruption if the lie aligns with a valid checksum path (rare, but don’t build policy on “rare”).

3) Crash during resilver/scrub (the “stress crash”)

This tests long-running metadata operations and their resumability. ZFS is usually solid here, but your bottleneck might shift:
a scrub can amplify IO and reveal queueing issues, controller firmware bugs, or weak drives.

4) Partial power loss / brownout (the “weird crash”)

One shelf drops, another lives. One PSU survives. SAS expanders reboot. NVMe resets.
These are the outages where logs look like a crime scene and you find out if multipath and timeouts are sane.

5) Human recovery mistakes (the “most common crash”)

A clean pool can be turned into a mess by an impatient import with the wrong flags, or by “fixing” what isn’t broken.
Power loss tests should include a recovery runbook, not just a power cut.

Build a lab that tells the truth

If you test on a laptop SSD with a single pool and a polite workload, you’re testing the emotional comfort of the team, not the system.
The lab doesn’t have to be expensive, but it must be honest.

Rules for a trustworthy test environment

  • Same storage class as production: same model SSDs/HDDs, same HBA, same firmware family. “Close enough” is where data goes to die.
  • Same ashift and recordsize philosophy: do not “optimize” the lab with different block sizes and expect identical failure behavior.
  • Reproducible power cut: a switched PDU outlet, a relay, or IPMI power-off. Pulling a random cord is not reproducible science.
  • Out-of-band console logging: serial console, IPMI SOL, or at least persistent journal logs. You want the last 2 seconds.
  • Timeboxed tests: define what “pass” means before you start. Otherwise you’ll keep testing until you get bored.

What you should mirror from production

Mirror the pool topology, the vdev count, and whether you use a dedicated SLOG. Mirror your sync policy.
Mirror your compression and atime choices. And yes, mirror your worst workloads.

What you should not mirror

Don’t mirror production’s lack of monitoring. The lab is where you add extra instrumentation.
You want to see TXG commit cadence, ZIL activity, device flush behavior, and latency distributions.

Workloads and metrics: stop “dd”ing and start measuring

“I wrote a big file and it seemed fine” is not a test. It’s a lullaby.
Your workload must include sync writes, metadata churn, small random IO, and sustained throughput. Also, concurrency.

Key behaviors to trigger

  • Synchronous commit pressure: databases, NFS with sync, fsync-heavy apps.
  • Metadata updates: create/unlink storms, snapshots, clones, dataset property changes.
  • Space map activity: frees, rewrites, and fragmentation patterns.
  • Mixed read/write under load: because outages don’t wait for your batch job to finish.

Metrics that matter during testing

  • sync write latency distribution: p50/p95/p99, not just average.
  • TXG commit time: long commit times can extend the window of async data loss and can indicate device flush pain.
  • ZIL/SLOG utilization: how often and how hard you hit it.
  • checksum and IO errors: after reboot and after scrub.
  • import time and replay time: operational impact matters.

Joke #2: If your power loss test plan is “yank the plug and hope,” congratulations—you’ve reinvented chaos engineering, but with fewer dashboards.

Practical tasks: commands, outputs, and decisions (12+)

These are production-grade checks and procedures you can run in a lab and—carefully—in production for validation.
Each task includes: command, what the output means, and what decision you make.
Examples assume OpenZFS on Linux. Adjust device names and pool names.

Task 1: Identify pool health and topology

cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:12:41 with 0 errors on Tue Dec 24 02:10:12 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0  ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N1  ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N2  ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N3  ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N4  ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N5  ONLINE       0     0     0

errors: No known data errors

Output meaning: You want ONLINE, zero errors, and a recent scrub with no repairs.

Decision: If you already have checksum errors, stop. Fix baseline health before injecting failures.

Task 2: Check dataset sync policy (the “did we quietly opt out of safety?” check)

cr0x@server:~$ sudo zfs get -o name,property,value -s local,default sync tank
NAME  PROPERTY  VALUE
tank  sync      standard

Output meaning: standard honors application sync requests; disabled lies to apps; always forces sync for all writes.

Decision: For safety validation, temporarily test with sync=always on the dataset(s) that matter to verify real durability.

Task 3: Verify whether you even have a SLOG and what it is

cr0x@server:~$ sudo zpool status tank | sed -n '1,80p'
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
        logs
          nvme0n1p1 ONLINE       0     0     0

Output meaning: A logs section indicates a separate SLOG device.

Decision: If you rely on sync performance or NFS sync, validate the SLOG’s power-loss protection and latency.

Task 4: Confirm key dataset properties that affect behavior during failures

cr0x@server:~$ sudo zfs get -o name,property,value recordsize,compression,atime,logbias,primarycache tank/data
NAME       PROPERTY     VALUE
tank/data  recordsize   128K
tank/data  compression  lz4
tank/data  atime        off
tank/data  logbias      latency
tank/data  primarycache all

Output meaning: logbias=latency tends to use SLOG more aggressively for sync writes; throughput pushes more to main pool.

Decision: For databases, logbias=latency with a proper PLP SLOG is reasonable; otherwise you might be paying a latency tax for nothing.

Task 5: Confirm write cache policy and “volatile cache” risk on SSD/NVMe

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0n1 | egrep -i 'vwc|oncs|wzsl'
vwc     : 0x1
oncs    : 0x5f
wzsl    : 0x0

Output meaning: vwc: 0x1 indicates volatile write cache is present. That’s not automatically bad—but it raises the question: does the drive have PLP?

Decision: If the device has volatile cache and no PLP, do not use it as a SLOG. Treat sync durability claims as suspect until proven.

Task 6: Check SATA drive write cache setting (where lies sometimes begin)

cr0x@server:~$ sudo hdparm -W /dev/sda | head -n 2
/dev/sda:
 write-caching =  1 (on)

Output meaning: Write caching is enabled. Many drives are safe; some aren’t; controllers can also interfere.

Decision: If you can’t verify power-loss protection, consider disabling write cache for safety tests (accepting performance hit), or use enterprise drives with PLP.

Task 7: Verify that ZFS sees ashift and that it’s sane

cr0x@server:~$ sudo zdb -C tank | egrep 'ashift|vdev_tree' | head -n 20
        vdev_tree:
            type: 'root'
            id: 0
            guid: 1234567890123456789
            ashift: 12

Output meaning: ashift: 12 means 4K sectors. Wrong ashift can cause pathological write amplification and long TXG commits.

Decision: If ashift is wrong, fix it by rebuilding the pool. Don’t rationalize it; the pool won’t care about your feelings.

Task 8: Baseline IO latency and queueing before you start breaking things

cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server)   12/26/2025  _x86_64_  (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.31    0.00    1.12    0.45    0.00   96.12

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
sda              0.00      0.00     0.00   0.00    0.00     0.00   12.00   512.00     0.00   0.00    7.20    42.67    0.09   8.40
nvme0n1          0.00      0.00     0.00   0.00    0.00     0.00   80.00 10240.00    0.00   0.00    0.45   128.00    0.04   2.10

Output meaning: Watch w_await, aqu-sz, and %util. A SLOG with high await during sync tests is a red flag.

Decision: If baseline is already saturated, your power-loss test will mostly measure “how overloaded we were,” not correctness.

Task 9: Generate a sync-heavy workload that actually calls fsync

cr0x@server:~$ sudo fio --name=syncwrite --directory=/tank/data/test \
  --rw=randwrite --bs=4k --size=2G --iodepth=1 --numjobs=4 \
  --fsync=1 --direct=1 --time_based --runtime=60 --group_reporting
syncwrite: (groupid=0, jobs=4): err= 0: pid=22310: Fri Dec 26 10:12:41 2025
  write: IOPS=410, BW=1640KiB/s (1679kB/s)(96.1MiB/60001msec)
    clat (usec): min=650, max=42000, avg=2400.12, stdev=1100.55
    lat (usec): min=670, max=42120, avg=2420.88, stdev=1102.10

Output meaning: With --fsync=1 and iodepth=1, you’re forcing frequent durability points.

Decision: Use this workload during power-cut tests to validate whether acknowledged sync writes survive.

Task 10: Track ZFS latency and queue stats during the test

cr0x@server:~$ sudo zpool iostat -v tank 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.20T  5.80T     20    450  2.5M  1.7M
  raidz2-0  1.20T  5.80T     20    210  2.5M  1.2M
    sda         -      -      3     35   420K   220K
    sdb         -      -      3     35   420K   220K
    sdc         -      -      3     35   420K   220K
    sdd         -      -      3     35   420K   220K
logs            -      -      -    240     -   520K
  nvme0n1p1     -      -      -    240     -   520K

Output meaning: If writes pile into logs under sync pressure, you’re using the SLOG as intended. If not, your workload might not be sync, or logbias/sync settings differ.

Decision: If the SLOG is saturated, expect high sync latency. Consider mirrored SLOG or better device, not bigger pool.

Task 11: Validate pre-crash: snapshot a marker dataset so you can prove what survived

cr0x@server:~$ sudo zfs snapshot -r tank/data@pre_cut_01

Output meaning: No output means it worked. You now have a known-good reference state.

Decision: Always snapshot before you cut power. It turns “I think it changed” into an actual diff.

Task 12: After crash: import the pool safely and observe replay

cr0x@server:~$ sudo zpool import
   pool: tank
     id: 1234567890123456789
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank        ONLINE
          raidz2-0  ONLINE
            sda     ONLINE
            sdb     ONLINE
            sdc     ONLINE
            sdd     ONLINE
        logs
          nvme0n1p1 ONLINE
cr0x@server:~$ sudo zpool import -o readonly=on -N tank

Output meaning: Importing read-only first lets you inspect without immediately writing new state.

Decision: If you see warnings about missing devices or UNAVAIL, stop and investigate cabling/controller resets before importing read-write.

Task 13: Check for last known transaction group and errors in event history

cr0x@server:~$ sudo zpool history -i tank | tail -n 25
2025-12-26.10:14:03 zpool import -o readonly=on -N tank
2025-12-26.10:14:21 zpool export tank
2025-12-26.10:15:10 zpool import tank

Output meaning: You can correlate imports/exports with your test timeline. Unexpected exports/imports can indicate watchdogs or automation interference.

Decision: If an agent is “helpfully” importing/exporting, disable it during testing. Determinism beats surprises.

Task 14: Run a scrub and interpret the result

cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ sudo zpool status -v tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:16:22 with 0 errors on Fri Dec 26 10:32:18 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
        logs
          nvme0n1p1 ONLINE       0     0     0

errors: No known data errors

Output meaning: “0 errors” after a crash is what you want. If you see repaired bytes, investigate whether it was pre-existing or induced.

Decision: Any checksum errors after a controlled test are a serious signal: suspect cache flush lies, a bad SLOG, or a controller issue.

Task 15: Compare state pre/post cut using snapshots (prove data loss boundaries)

cr0x@server:~$ sudo zfs snapshot -r tank/data@post_cut_01
cr0x@server:~$ sudo zfs diff -FH tank/data@pre_cut_01 tank/data@post_cut_01 | head
M	/tank/data/test/file0001
M	/tank/data/test/file0002
-	/tank/data/test/tmp_8831

Output meaning: You can see exactly what changed. If your application claims it fsync’d data that is missing, you have a durability breach.

Decision: Missing fsync’d data is not “expected loss.” It’s a bug or a lie. Escalate to hardware validation.

Task 16: Check kernel logs for storage resets and flush failures

cr0x@server:~$ sudo journalctl -b -k | egrep -i 'zfs|nvme|sd |scsi|reset|timeout|flush' | tail -n 40
Dec 26 10:15:18 server kernel: nvme nvme0: controller reset occurred
Dec 26 10:15:19 server kernel: sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 26 10:15:20 server kernel: ZFS: Loaded module v2.2.4-1
Dec 26 10:15:27 server kernel: ZFS: pool tank imported, checkpoint disabled

Output meaning: Cache synchronize failures or controller resets around crash/reboot are where truth leaks out.

Decision: If flushes fail, do not trust sync durability. Fix firmware, HBA mode, or replace devices.

How to cut power safely (and what not to do)

The goal is to simulate an outage while still being able to recover and analyze. This is not the time for interpretive dance with power cords.
Choose a method that’s repeatable and doesn’t introduce extra variables.

Preferred methods (most repeatable)

  • Switched PDU outlet: cut power to the host while leaving networking and logging intact elsewhere.
  • IPMI chassis power off: consistent, logged, and remote-friendly. It’s not “pure,” but it’s good enough for many scenarios.
  • Relay-controlled AC cut: the closest you get to “real” while keeping repeatability.

Methods that create more confusion than insight

  • Pulling random PSUs in redundant systems: that’s a PSU-failure test, not a power-loss test. Still useful, but name it correctly.
  • Hard-reset button mashing: it adds human timing noise; you’ll spend the postmortem debating seconds.
  • Triggering kernel panic as your only test: it tests crash recovery, not power loss. It’s a different failure mode.

Staging the cut: what “good hygiene” looks like

Start the workload. Let it reach steady state. Snapshot. Record baseline stats. Then cut power on a known boundary (e.g., 30 seconds into a sync workload).
The intent is not random. Random is for attackers and load balancers.

Fast diagnosis playbook

After a crash test—or a real outage—you want answers quickly. This is the minimal path to identify the bottleneck and the risk.
Do it in order. Skipping steps is how you convert a minor incident into an expensive one.

First: can the pool import cleanly and what does ZFS think happened?

  • Run zpool import and zpool status -v after import.
  • Check for missing devices, suspended pool, or checksum errors.
  • Decision: If the pool is DEGRADED or errors are non-zero, treat this as a data-integrity incident, not a performance issue.

Second: check the kernel logs for resets, timeouts, and flush failures

  • Use journalctl -b -k filters for nvme/sd/scsi/reset/timeout/flush.
  • Decision: Flush failures or repeated resets point to hardware/firmware/controller issues. Don’t tune ZFS to “work around” lies.

Third: determine if sync latency is the pain (SLOG or main pool)

  • Run sync workload (fio with fsync) and watch zpool iostat -v and iostat -x.
  • Decision: If logs device is hot and high-latency, SLOG is bottleneck. If main vdevs are hot, pool layout or disks are limiting.

Fourth: validate integrity with a scrub (don’t guess)

  • Start zpool scrub, then monitor with zpool status.
  • Decision: Any post-crash checksum errors mean your “durable” chain is broken. Escalate before rerunning tests.

Fifth: prove application-level durability boundaries

  • Diff snapshots or compare application markers (transaction IDs, WAL positions, sequence numbers).
  • Decision: If the app lost acknowledged commits, the storage stack acknowledged writes it didn’t actually persist.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company moved a critical internal service to ZFS. Good choice. They also moved it to “fast NVMe” for the SLOG.
The logic was simple: NVMe is modern, modern is durable, and the spec has words like “flush.”

Their NFS exports were configured with sync semantics. The application team celebrated: latency dropped, dashboards got greener,
and everyone quietly decided the storage problem was solved forever. Then a building power event happened—brief, sharp, and real.
The hosts rebooted. Pools imported. No obvious errors.

Two days later, the weirdness started: a handful of files had older versions. Not missing. Not obviously corrupted. Just… time-traveled.
It was spotted because an engineer compared generated artifacts against a build manifest that did fsync on completion.
The storage team initially suspected an application bug. The application team suspected the storage. Everyone suspected everyone, which is normal.

The breakthrough came from a boring log line: occasional cache flush failures and controller resets during heavy sync bursts.
The “fast NVMe” was a consumer-grade device with volatile write cache and no real power-loss protection.
It could accept sync writes quickly, then lose them when power vanished. ZFS did what it was told. The device did what it wanted.

They replaced the SLOG with a PLP-capable enterprise device and reran the power cut tests. The time-travel stopped.
The lesson wasn’t “NVMe bad.” It was “don’t assume durability because marketing said ‘data center’ somewhere on the box.”

Mini-story 2: The optimization that backfired

A different org had a heavy write workload and hated sync latency. Someone proposed a “pragmatic” tweak:
set sync=disabled on the dataset because “our app is replicated anyway.”
This was not a villain twirling a mustache. It was a well-meaning engineer trying to hit a latency SLO with limited budget.

The performance improvement was dramatic. So dramatic that the change spread from one dataset to several others.
It survived multiple quarters, meaning it survived multiple layers of review, because nobody wants to be the person who reintroduces latency.
They had a UPS. They had redundant PSUs. They had faith.

Then came a storage firmware update and a sequence of host reboots. One host crashed mid-write burst.
Replication didn’t save them because the replication stream was also sourced from the same “sync-disabled” reality.
They lost a slice of acknowledged transactions. Not all of them. Just enough to cause irreconcilable mismatches.

The postmortem was grim but useful. They had optimized the wrong layer: they removed the filesystem’s honesty contract instead of
investing in a proper SLOG and making sure the application’s durability model matched reality.
The fix was not just flipping sync back. It was auditing every dataset, aligning app semantics, and re-testing under failure injection.

The dry punchline: they hit the latency SLO until they hit the “explain to finance why invoices disappeared” SLO, which is harder to graph.

Mini-story 3: The boring but correct practice that saved the day

A third company ran ZFS for VM storage. Nothing fancy. Mirrors for vdevs, a mirrored SLOG with PLP, and a monthly scrub schedule.
They also had a ritual: every quarter, run a controlled power cut test in the staging environment after patch cycles.
It was not exciting work. Nobody got promoted for it. That’s how you know it was the right practice.

One quarter, after a kernel update, their staging power-cut test triggered a pool import delay and a handful of checksum errors on scrub.
Not catastrophic, but unacceptable. Because this was staging, they had time. They rolled back, bisected the change, and discovered
a driver/firmware interaction that caused occasional NCQ/flush weirdness on a subset of SATA SSDs under high sync pressure.

They swapped firmware on the drives and changed one controller setting. The test passed again. Then they patched production.
A month later, a real power outage hit a rack due to an upstream electrical event. Production recovered cleanly.

Nobody outside infra noticed. Which is the point. The “boring” quarterly test converted a surprise outage into a routine reboot.
That team’s reputation was built out of non-events, which is the only honest kind.

Common mistakes: symptoms → root cause → fix

1) Symptom: sync workloads are fast, then post-crash data is missing

Root cause: volatile write cache acknowledged writes without persistence (no PLP), or flush/FUA not honored through controller path.

Fix: use PLP-capable SLOG; verify controller is in IT/HBA mode; update firmware; confirm flush commands succeed in logs; re-test with sync=always.

2) Symptom: pool imports but scrub finds checksum errors after every power-cut test

Root cause: device write cache lies or unstable hardware (controller resets, bad cable/backplane), or power cut is causing partial writes beyond what redundancy can repair.

Fix: inspect journalctl for resets/timeouts; replace questionable components; disable write cache for test comparison; validate power delivery and PSU redundancy.

3) Symptom: import hangs for a long time after crash

Root cause: device enumeration delays, multipath timeouts, or a log device that is slow/unresponsive (SLOG problems can stall replay).

Fix: check kernel boot logs for device timeouts; confirm SLOG is present and healthy; consider removing faulty log device (in lab) only after careful analysis.

4) Symptom: terrible sync latency, but SLOG looks idle

Root cause: workload isn’t actually sync (no fsync/O_DSYNC), or dataset sync and logbias aren’t what you think, or NFS export options differ.

Fix: generate a known sync workload (fio with fsync); verify dataset properties; validate NFS mount/export sync settings; observe zpool iostat -v.

5) Symptom: after crash, applications complain but ZFS reports “No known data errors”

Root cause: application-level durability expectations exceed what it actually requested (async writes, missing fsync), or application bug around ordering.

Fix: instrument the application: confirm fsync points, WAL behavior, or commit markers; use snapshot diffs around known markers; don’t blame ZFS until you prove the app asked.

6) Symptom: scrub takes forever or causes performance collapse after a crash

Root cause: pool is already near saturation, or one slow device drags the vdev; power event exposed a weak disk.

Fix: identify slow devices via iostat -x and zpool iostat -v; replace laggards; consider adding vdevs (not “bigger disks”) if you need IOPS.

7) Symptom: “zpool status” shows write errors but no checksum errors

Root cause: transport-level issues (cables, HBA, expander) or device timeouts rather than corruption.

Fix: check SMART/NVMe error logs, kernel messages, and cabling; don’t chase checksums when the bus is on fire.

Checklists / step-by-step plan

Step-by-step: safe power loss test cycle (repeatable)

  1. Pick the target dataset(s): test what matters, not the whole pool by default.
  2. Baseline pool health: zpool status -v must be clean; recent scrub preferred.
  3. Record configuration: capture zpool get all and zfs get all for relevant datasets; save outputs with timestamps.
  4. Enable evidence collection: ensure persistent journal, remote syslog, or out-of-band console capture.
  5. Create a marker: a small file or app-level transaction ID written with fsync; snapshot @pre_cut.
  6. Start workload: use fio with fsync for sync behavior; optionally run a second mixed workload in parallel.
  7. Observe steady state: collect 60–120 seconds of zpool iostat -v and iostat -x.
  8. Cut power: via PDU/IPMI/relay; log the exact time and method.
  9. Restore power: boot; do not auto-import read-write if you can avoid it.
  10. Import read-only first: inspect zpool import output; then import normally if healthy.
  11. Validate markers: check marker files, snapshot diffs, and app-level consistency.
  12. Scrub: run zpool scrub, verify zero errors.
  13. Document: record pass/fail and what changed. If you can’t explain it, you didn’t test it.

Checklist: “Do not proceed” conditions

  • Existing checksum errors before testing.
  • Flush failures or device resets in kernel logs during normal operation.
  • SLOG device without verified PLP used for durability-critical sync workloads.
  • Pool already running at high utilization where results will be dominated by overload.

Checklist: evidence to capture every run

  • zpool status -v before and after.
  • zfs get for sync/logbias/recordsize/compression for tested datasets.
  • zpool iostat -v during workload.
  • Kernel logs around reboot: controller resets, timeouts, flush failures.
  • Snapshot diff around a known marker.
  • Scrub result after recovery.

FAQ

1) Does ZFS guarantee zero data loss on power failure?

ZFS guarantees on-disk consistency of its structures, and it protects data integrity with checksums.
It does not guarantee that async writes survive, and it cannot force dishonest hardware to persist data it claimed to have written.

2) If my pool imports cleanly after a power cut, am I safe?

You’re safer than many filesystems, but “imports cleanly” is not the same as “sync writes were durable.”
You still need application-level markers and a scrub to validate integrity and boundaries.

3) Is a SLOG required for power-loss safety?

No. A SLOG is primarily for performance of synchronous writes. Safety comes from correct sync semantics and honest devices.
A bad SLOG can actively reduce safety by concentrating sync writes on a device that lies under power loss.

4) Should I set sync=always everywhere?

For testing durability and exposing hardware lies: yes, on the datasets under test. For production: only if your latency budget supports it.
Otherwise, keep sync=standard and ensure your applications request fsync where needed.

5) How do I know if my SSD has power-loss protection?

Vendor specs help, but the reliable answer is testing plus enterprise device classes known for PLP.
If you can’t verify PLP, don’t put that device in the SLOG role for durability-critical workloads.

6) Is pulling the plug the best test?

It’s the most literal, but not always the most repeatable. A switched PDU or relay cut is better for consistent timing.
Use the method that gives you reproducible results and clean logs.

7) What’s the difference between a crash test and a power loss test?

A crash test (kernel panic, reboot) stops the OS but often leaves device power intact, so caches may flush.
A power loss test removes power so volatile caches don’t get a graceful exit. They’re different truths.

8) Can UPS replace power-loss testing?

No. UPS reduces frequency and buys time, but it doesn’t fix firmware bugs, flush lies, or partial power events.
Also: batteries age, and maintenance windows have a way of “discovering” that.

9) Do RAIDZ layouts behave differently under power loss compared to mirrors?

CoW consistency remains, but operationally mirrors usually recover faster and deliver better IOPS, which can reduce
recovery time and scrub impact. RAIDZ can be perfectly safe; it can also be slower under stress.

10) If I see checksum errors after a power cut, is my data permanently corrupted?

Not necessarily. With redundancy, ZFS can often repair. But checksum errors after a controlled test are a loud alarm.
Treat it as a signal that your durability chain is compromised and fix the root cause before trusting the system.

Conclusion: practical next steps

Power loss testing isn’t about bravery. It’s about being unwilling to outsource your confidence to assumptions.
ZFS gives you excellent primitives—copy-on-write consistency, checksums, scrubs, and clear visibility into pool state.
Your job is to make sure the hardware stack doesn’t sabotage those guarantees.

Next steps that actually move the needle:

  • Stand up a lab that matches production storage class and topology.
  • Run a scripted test cycle with sync=always on the target dataset and a real fsync workload.
  • Cut power in a repeatable way, import read-only first, then validate with snapshot diffs and a scrub.
  • If you use a SLOG, treat it like a durability component: PLP-capable, monitored, and tested.
  • Turn the results into a runbook: import procedure, log checks, integrity validation, and “stop the line” criteria.

The best outcome is boring: your test becomes a routine, the graphs look normal, and nobody tells a story about you in a postmortem.
That’s the kind of fame you want.

← Previous
Liquid metal mishaps: the upgrade that turns into a repair bill
Next →
Intel XeSS: how Intel joined the fight through software

Leave a comment