Power goes out. Fans spin down. Your phone lights up. And you’re left with that slow dread: did ZFS “handle it,”
or did you just buy yourself a weekend of pool imports, degraded vdevs, and uncomfortable explanations?
A UPS-triggered clean shutdown is one of those boring, unsexy controls that makes the difference between “we lost power”
and “we lost trust.” But it doesn’t protect what most people think it protects. It protects consistency and
predictability—not hardware, not physics, and definitely not the belief that RAID is a force field.
What a clean shutdown actually protects (and what it doesn’t)
Let’s define “clean shutdown” the way production defines it: the OS and services stop in an orderly sequence, ZFS
flushes and commits what it needs to, and the pool is exported or at least left in a consistent, known state.
The UPS is just the messenger. The shutdown workflow is the protection.
What it protects
-
ZFS intent log (ZIL) replay workload: With a clean stop, you reduce the amount of synchronous work
left to replay at import time. Less replay means less startup latency and fewer surprises for apps that hate
recovery windows. -
Application-level consistency opportunities: Databases can checkpoint, flush, or stop accepting writes.
A filesystem can be crash-consistent and still be app-inconsistent. A clean shutdown lets you fix the second part. -
Predictable boot/import behavior: Clean shutdown avoids long “hung” imports, heavy txg recovery, and
the “why is the pool taking 20 minutes to import” game. -
Reduced chance of cascading failures: If power flaps or the UPS runs out mid-chaos, you want the machine
already down, not half-writing metadata while the battery wheezes its last. -
People and process: A clean shutdown means alerts fire in the right order, logs are written, and you can
tell the difference between “power outage” and “we just had a storage incident.”
What it does not protect
-
Data that was never committed: If an app buffered writes and never fsync’d, it can lose them on a clean shutdown
too—unless the app participates (or the OS forces it by stopping services correctly). -
Hardware failures caused by power events: Surges, brownouts, and cheap PSUs can still do damage. A UPS helps,
but it’s not a talisman. -
Bad assumptions about “sync=disabled”: If you turned off sync because benchmarks, a clean shutdown won’t
resurrect acknowledged writes that were actually sitting in RAM. -
Firmware and controller write caches without power-loss protection: If your HBA or drive lies about flushes,
your filesystem can do everything right and still lose. ZFS can’t outvote physics. -
Silent corruption you already had: ZFS will detect many corruptions and repair when redundancy exists,
but a clean shutdown doesn’t retroactively make a bad disk good.
One quote that still feels like ops scripture: “Everything fails, all the time.”
— Werner Vogels.
The grown-up move is to decide which failures become boring.
Joke #1: A UPS is like an umbrella—useful, but if you try to stop a hurricane with it, you’ll just get wetter and angrier.
ZFS crash consistency in plain operational terms
ZFS is transactional. It stages changes and then commits them in transaction groups (txg). When a txg commits, on-disk
structures get updated in a way that’s designed to be consistent after a crash. That’s the core promise: you may lose
recent changes that weren’t committed, but you shouldn’t get a half-updated filesystem that needs an fsck-style rebuild.
The catch is that “consistent” is not “no downtime” and not “no data loss.” It’s “no structural nonsense.” And the
second catch is that the rest of the stack—the ZIL, the SLOG, the application, the hypervisor, the network—
can still conspire to make your day longer.
The parts that matter during power loss
-
TXG (transaction group): Batches dirty data and metadata, then commits. A crash means: last committed txg is safe;
in-flight txg is discarded or replayed as needed. -
ZIL (ZFS Intent Log): Only relevant for synchronous writes. It stores intent records so that after a crash,
ZFS can replay what was promised as “committed” to apps doing fsync/O_DSYNC. -
SLOG (Separate Log device): Optional device to store the ZIL more quickly (and safely, if power-loss protected).
It’s not a write cache for everything. It’s a specialized log for sync writes. - ARC: RAM cache. Great for performance. Useless for durability when the power is gone.
- L2ARC: Secondary read cache. Not a durability tool. On reboot it’s mostly “nice, but you’re cold again.”
Six to ten quick facts and historical context (because ops has a memory)
-
ZFS was created at Sun Microsystems in the early 2000s to replace the traditional volume manager + filesystem split,
because that split caused real operational pain at scale. -
Traditional filesystems often relied on fsck after unclean shutdowns; ZFS’s copy-on-write model was designed to avoid that
rebuild cycle by keeping on-disk state self-consistent. -
The “ZIL” isn’t a journal for all writes; it’s specifically for synchronous write semantics. That distinction still confuses people
and fuels bad SLOG purchases. -
Enterprise storage arrays historically used battery-backed write cache to acknowledge writes safely; ZFS’s sync path is a
software approach that still depends on honest hardware flush behavior. -
Early consumer SSDs were notorious for lying about write completion and lacking capacitor-backed flush, which made “sync”
less meaningful than people assumed. -
OpenZFS became a community-driven continuation after Oracle’s stewardship of Sun’s ZFS created a forked ecosystem; operational
best practices spread through vendors and hard-earned incident reports. -
The rise of virtualization and NFS/iSCSI made synchronous semantics more common in home labs and SMB setups, increasing the
real-world importance of ZIL/SLOG correctness. -
“UPS integration” used to mean serial cables and vendor daemons; now it’s mostly USB HID and networked UPS tools (NUT),
which is easier—and easier to misconfigure.
The clean shutdown isn’t about making ZFS “more consistent.” ZFS is already designed to be crash-consistent.
The clean shutdown is about reducing replay, reducing recovery ambiguity, and giving applications time to close the book properly.
UPS integration goals: the contract you want
Don’t treat UPS integration as “install daemon, hope.” Treat it like a contract with three clauses:
detect power events, decide based on runtime, and shut down in the right order. If you can’t state your contract in one breath,
you don’t have one—you have vibes.
Clause 1: Detect the power event reliably
USB is convenient. USB is also the first thing to get flaky when hubs, cheap cables, and kernel updates collide.
If this is production, prefer a networked UPS or a dedicated NUT server that stays up longer than any single host.
Clause 2: Decide based on runtime, not battery percent theater
Battery percentage is not a plan. It’s a feeling with numbers. You want runtime estimates, plus a safety margin,
and you want the shutdown trigger to happen while the UPS is still stable. Brownout mode is not where you want to
discover your shutdown script has a typo.
Clause 3: Shut down in the right order
For ZFS, “right order” usually means: stop apps that generate writes, flush/stop network storage exports, then shut down the OS.
If you’re running VMs on top of ZFS, the hypervisor needs time to power them off cleanly before the host drops.
Joke #2: The only thing more optimistic than “we’ll fix it after the outage” is “the UPS will handle it.”
Fast diagnosis playbook
When someone says “we had a power event and ZFS is slow/weird,” you want a fast triage that finds the bottleneck
in minutes, not hours. This is the order that tends to pay rent.
First: confirm what kind of shutdown you actually had
- Was it a clean systemd shutdown, or a hard power loss?
- Did the UPS daemon fire? Did it call the shutdown command? Did the host reach poweroff?
- Did the pool import automatically, or did it wait/hang?
Second: check pool health and whether you’re replaying or resilvering
- Is the pool
ONLINE,DEGRADED, or importing slowly? - Is a scrub/resilver running, and is it throttling IO?
- Are there checksum errors (which change the whole conversation)?
Third: decide if the issue is synchronous write latency (ZIL/SLOG) or general IO
- If apps complain about fsync latency, focus on SLOG device health, queue depth, and flush behavior.
- If everything is slow, look at vdev latency, one dying disk, controller resets, or a pool stuck in recovery.
Fourth: verify service shutdown ordering
- Did NFS/iSCSI exports stop before ZFS and networking went down?
- Did databases receive a stop signal and have time to checkpoint?
- Did VMs shut down, or did the host drop out from under them?
Practical tasks: commands, outputs, decisions (12+)
These are the tasks I actually run after power events or during UPS integration work. Each has: a command, realistic output,
what the output means, and the decision you make. Commands assume Linux + OpenZFS; adjust paths for your distro.
Task 1: Confirm last shutdown was clean (journald)
cr0x@server:~$ journalctl -b -1 -u systemd-shutdown --no-pager | tail -n 20
Dec 26 01:12:41 server systemd-shutdown[1]: Syncing filesystems and block devices.
Dec 26 01:12:44 server systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Dec 26 01:12:46 server systemd-shutdown[1]: Sending SIGKILL to remaining processes...
Dec 26 01:12:47 server systemd-shutdown[1]: Unmounting file systems.
Dec 26 01:12:48 server systemd-shutdown[1]: Deactivating swaps.
Dec 26 01:12:48 server systemd-shutdown[1]: Detaching loop devices.
Dec 26 01:12:49 server systemd-shutdown[1]: All filesystems unmounted.
Dec 26 01:12:49 server systemd-shutdown[1]: Powering off.
Meaning: This looks like an orderly shutdown path. If it was power loss, you’d often see abrupt log truncation.
Decision: If this is clean, focus less on “did ZFS get corrupted” and more on “did apps stop cleanly and did the UPS trigger correctly.”
Task 2: Check if the UPS daemon logged “on battery” and “low battery” events (NUT)
cr0x@server:~$ journalctl -u nut-monitor --since "today" --no-pager | egrep -i "on battery|low battery|shutdown" | tail -n 20
Dec 26 01:10:02 server upsmon[2314]: UPS ups@localhost on battery
Dec 26 01:12:05 server upsmon[2314]: UPS ups@localhost battery is low
Dec 26 01:12:06 server upsmon[2314]: Executing automatic power-fail shutdown
Meaning: UPS events were detected and actioned. If you only see “on battery” but no shutdown, your trigger thresholds are wrong.
Decision: If “battery is low” happens too late (right before power loss), raise shutdown margin (runtime threshold) and retest.
Task 3: Query UPS status live (NUT upsc)
cr0x@server:~$ upsc ups@localhost
battery.charge: 96
battery.runtime: 2480
battery.voltage: 27.3
device.mfr: APC
device.model: Smart-UPS 1500
input.voltage: 121.0
output.voltage: 120.0
ups.status: OL
ups.load: 22
Meaning: ups.status: OL means on line power. Runtime is in seconds here. Load is modest.
Decision: Use runtime (battery.runtime) as your shutdown decision input. Configure shutdown when runtime dips below your required window + margin.
Task 4: Confirm the pool state and error counters
cr0x@server:~$ sudo zpool status -x
all pools are healthy
Meaning: No known issues. This is what you want after an outage.
Decision: If healthy, do not start “recovery rituals” that add risk. Proceed to app-level verification and performance checks.
Task 5: Full pool status with vdev errors (after a nasty outage)
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
scan: scrub repaired 0B in 00:12:31 with 2 errors on Thu Dec 26 01:44:09 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 2 (repairing)
errors: Permanent errors have been detected in the following files:
tank/data/app.db
Meaning: This is no longer “just a power outage.” You have checksum errors and a file flagged as permanently damaged.
Decision: Stop the affected application, restore that file from backups/snapshots, and investigate disk sdd (SMART, cabling, controller resets). Do not hand-wave checksum errors.
Task 6: Check whether a resilver/scrub is in progress and its impact window
cr0x@server:~$ sudo zpool status tank | sed -n '1,20p'
pool: tank
state: ONLINE
scan: resilver in progress since Thu Dec 26 01:20:12 2025
312G scanned at 1.02G/s, 88.4G issued at 289M/s, 3.50T total
88.4G resilvered, 2.46% done, 03:18:44 to go
Meaning: Resilvering is happening and will steal IO bandwidth and IOPS.
Decision: If this is a production workload, consider scheduling IO-heavy jobs away from the resilver window and adjust zfs_resilver_delay/zfs_scan_idle tuning only if you understand the tradeoffs.
Task 7: Check import history and whether the pool was exported
cr0x@server:~$ sudo zpool history -il tank | tail -n 20
2025-12-26.01:12:48 zpool export tank
2025-12-26.01:12:49 system: shutdown initiated by upsmon
2025-12-26.01:18:02 zpool import tank
Meaning: Someone (or automation) exported the pool before poweroff. That’s excellent hygiene and reduces import ambiguity.
Decision: If you don’t see exports during shutdown events, consider adding a controlled export in your UPS shutdown path—carefully, with ordering.
Task 8: Confirm dataset sync settings (avoid accidental “fast but unsafe”)
cr0x@server:~$ sudo zfs get -r sync tank | head -n 15
NAME PROPERTY VALUE SOURCE
tank sync standard default
tank/data sync standard inherited from tank
tank/data/postgres sync standard local
tank/vm sync disabled local
Meaning: tank/vm has sync=disabled. That dataset is acknowledging writes without guaranteeing durability.
Decision: If tank/vm hosts VM disks or databases, change it to standard (or always where appropriate) and invest in a proper SLOG if sync latency is the driver.
Task 9: See whether you even have a SLOG and what it is
cr0x@server:~$ sudo zpool status tank | egrep -A3 "logs|cache|special"
logs
nvme0n1p2 ONLINE 0 0 0
Meaning: There is a dedicated log device. Good—if it’s the right kind of device.
Decision: Validate the SLOG has power-loss protection (PLP). If it doesn’t, you’ve built a “fast crash” device.
Task 10: Confirm SLOG device health and error history (SMART)
cr0x@server:~$ sudo smartctl -a /dev/nvme0 | egrep -i "model|percentage used|media|power cycles|unsafe shutdowns"
Model Number: INTEL SSDPE2KX040T8
Percentage Used: 3%
Media and Data Integrity Errors: 0
Power Cycles: 38
Unsafe Shutdowns: 7
Meaning: “Unsafe Shutdowns” being non-zero is common over a device lifetime, but if it climbs with every outage, you’re routinely failing to shut down cleanly.
Decision: Correlate unsafe shutdown counts with outages and UPS tests. If they increment during planned tests, your shutdown trigger is too late or ordering is wrong.
Task 11: Check kernel logs for drive/controller resets around the outage
cr0x@server:~$ dmesg -T | egrep -i "reset|link is down|I/O error|ata[0-9]" | tail -n 20
[Thu Dec 26 01:10:11 2025] ata3: link is slow to respond, please be patient (ready=0)
[Thu Dec 26 01:10:15 2025] ata3: COMRESET failed (errno=-16)
[Thu Dec 26 01:10:19 2025] sd 2:0:0:0: [sdd] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Dec 26 01:10:19 2025] blk_update_request: I/O error, dev sdd, sector 812345672 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Meaning: This is not ZFS being dramatic; it’s the kernel telling you the storage path was unstable.
Decision: Treat it as hardware/cabling/controller first. ZFS errors are often downstream of transport errors.
Task 12: Measure pool latency quickly (iostat) and decide if you’re bottlenecked on sync writes
cr0x@server:~$ sudo zpool iostat -v tank 1 5
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
tank 2.01T 1.49T 110 980 12.8M 94.1M
raidz2-0 2.01T 1.49T 110 980 12.8M 94.1M
sda - - 30 220 3.1M 21.1M
sdb - - 28 235 3.2M 22.4M
sdc - - 27 245 3.3M 23.0M
sdd - - 25 280 3.2M 27.6M
logs - - 0 600 0.0K 52.0M
nvme0n1p2 - - 0 600 0.0K 52.0M
-------------------------- ----- ----- ----- ----- ----- -----
Meaning: The log device is taking significant write ops/bandwidth—classic sign of sync-heavy workload. If latency is high here, apps will feel it.
Decision: If performance complaints align with log activity, focus on SLOG performance/PLP and on whether the workload truly needs sync writes.
Task 13: Check ZFS event history for import/replay-related hints
cr0x@server:~$ sudo zpool events -v | tail -n 25
TIME CLASS
Dec 26 2025 01:18:02.412345678 sysevent.fs.zfs.pool_import
pool: tank
guid: 1234567890123456789
Dec 26 2025 01:18:04.998877665 sysevent.fs.zfs.config_sync
pool: tank
Meaning: You can see import events and timing; this helps correlate “boot felt slow” with actual import operations.
Decision: If import events are frequent and associated with crashes, fix shutdown reliability before you chase micro-optimizations.
Task 14: Verify NFS/iSCSI services were stopped cleanly before shutdown
cr0x@server:~$ systemctl status nfs-server --no-pager
● nfs-server.service - NFS server and services
Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled)
Active: active (running) since Thu 2025-12-26 01:18:22 UTC; 12min ago
Docs: man:rpc.nfsd(8)
Main PID: 1842 (rpc.nfsd)
Status: "running"
Meaning: It’s running now, but you still need to check whether it stopped before the last shutdown.
Decision: Pull previous boot logs for service stop timing. If NFS stayed alive while ZFS was trying to quiesce, you have ordering work to do.
Task 15: Validate systemd ordering for UPS-triggered shutdown hook
cr0x@server:~$ systemctl list-dependencies shutdown.target --no-pager | head -n 25
shutdown.target
● ├─systemd-remount-fs.service
● ├─systemd-poweroff.service
● ├─umount.target
● ├─final.target
● └─systemd-shutdown.service
Meaning: This shows the shutdown path. Your UPS daemon should trigger a normal shutdown, not a “kill -9 everything and pray.”
Decision: If your UPS integration bypasses systemd (custom scripts calling poweroff oddly), refactor to a standard shutdown so services can stop correctly.
Task 16: Test shutdown trigger without actually powering off (NUT upsmon -c fsd)
cr0x@server:~$ sudo upsmon -c fsd
Network UPS Tools upsmon 2.8.0
FSD set on UPS [ups@localhost]
Meaning: This simulates “forced shutdown” handling. Depending on configuration, it may start shutdown immediately.
Decision: Run this during a maintenance window and observe service stop order and pool export behavior. If it’s chaotic, fix it before the next real outage.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a ZFS-backed virtualization host for internal services. They had a UPS. They had confidence.
They also had a critical assumption: “If the UPS is connected over USB, shutdown will always happen.”
The day it failed wasn’t dramatic—just a quick power drop and a generator that took longer than expected to come online.
The host didn’t shut down. The UPS did try to tell it, but the USB device had enumerated differently after a kernel update,
and the monitoring daemon was quietly watching the wrong device path. No alerts, because “the UPS daemon is running” was
their only health check.
After power came back, the ZFS pool imported. No corruption. Everyone relaxed too early. Then the VM filesystem showed
filesystem-level consistency but application-level inconsistency: a database that had been acknowledging writes relied on
host flush semantics and sync behavior they hadn’t verified. The VM booted, the DB complained, and then the “quick power event”
became a staged outage as teams argued about whether it was “storage” or “the app.”
The fix was boring: they added a NUT server on a small machine with stable power, used network-based monitoring, and added
an explicit periodic validation: pull UPS status and alert if the UPS disappears or the daemon reports stale data.
They also audited dataset sync settings and stopped pretending USB equals reliability.
Mini-story 2: The optimization that backfired
Another shop ran ZFS for an NFS datastore. Performance was “fine” until a new workload arrived: a build farm that did a lot
of metadata ops and demanded predictable latency. The team chased benchmarks and found the magic switch: sync=disabled
on the dataset. Latency dropped. Charts looked good. A round of high-fives occurred, which should always be treated as a
monitoring alert.
Months later, a power event hit. The UPS initiated a clean shutdown and the host powered off gracefully. They congratulated
themselves again. Then users reported missing build artifacts and “successful” jobs that produced broken outputs.
Nothing was structurally corrupt. ZFS was consistent. The app-level data, however, included acknowledged writes that had
lived in RAM and never made it to stable storage—because they told ZFS to lie on their behalf.
The postmortem was awkward because the outage was short and the system “shut down cleanly.” That’s the trap:
a clean shutdown protects you from certain failure modes, but it cannot protect you from intentionally disabling durability
guarantees. The optimization had converted “rare power loss” into “eventual data integrity incident.”
The remediation was to re-enable sync and add a proper SLOG device with PLP. They also separated workloads:
build cache datasets got different settings than artifact storage. Performance tuning moved from “flip dangerous bit” to
“design storage tiers like adults.”
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran ZFS-backed iSCSI for a few internal systems that mattered more than they looked.
Their UPS integration wasn’t flashy. It was written down, tested quarterly, and treated as part of change management.
They had one rule: every shutdown path must be testable without pulling the plug.
They used a dedicated NUT server and configured clients to initiate shutdown when runtime fell below a threshold that included
“time to stop services” plus a margin. The shutdown procedure stopped iSCSI targets first, then told application services to stop,
then allowed ZFS to quiet down. They didn’t manually export pools in random scripts; they relied on standard shutdown ordering,
and where they added hooks, they did it as systemd units with clear dependencies.
When a real outage happened—longer than usual—systems shut down predictably, with clean logs. When power returned, boot was dull.
Imports were fast. No resilvers. No mystery errors. Users mostly didn’t notice.
The best part: during the post-incident review, nobody argued about what happened, because the logs made it obvious and the
procedure had been rehearsed. “Boring” was the success criterion, and they hit it.
Common mistakes: symptom → root cause → fix
1) Symptom: pool import takes forever after outages
Root cause: repeated unclean shutdowns leading to more recovery work; sometimes compounded by a marginal disk or controller resets.
Fix: verify UPS triggers earlier; check journalctl for clean shutdown; inspect dmesg for link resets; replace flaky cables/HBAs before tuning ZFS.
2) Symptom: apps report “database corruption” but ZFS status is clean
Root cause: app-level inconsistency due to disabled sync, missing fsync, or abrupt VM power loss.
Fix: restore from backups/snapshots; re-enable sync; ensure VM shutdown ordering; validate DB settings (durability mode) and storage semantics.
3) Symptom: NFS clients hang during outage and come back with stale handles
Root cause: shutdown ordering wrong: exports still active while storage goes away, or clients don’t see a clean server stop.
Fix: systemd ordering: stop nfs-server (or iSCSI target) early in shutdown; ensure UPS triggers standard shutdown, not a kill-script.
4) Symptom: UPS “works” in normal times but fails during real outages
Root cause: monitoring is present but not validated; USB flakiness; daemon running but disconnected from the device.
Fix: alert on UPS telemetry freshness; prefer network UPS; run simulated FSD tests; record expected log lines and check them.
5) Symptom: sync write latency spikes after adding a SLOG
Root cause: SLOG device is slower or lacks PLP, causing flush stalls; or it’s on a shared bus with contention.
Fix: choose an enterprise SSD/NVMe with PLP; isolate it; verify with zpool iostat and device SMART; remove bad SLOG rather than suffer it.
6) Symptom: checksum errors appear after power events
Root cause: real corruption from a dying disk, unstable power path, or controller issues; sometimes exposed by a scrub after reboot.
Fix: treat as hardware incident; run SMART tests; reseat/replace components; restore corrupted files from known-good copies; scrub again after fixes.
7) Symptom: system shuts down too late, battery dies mid-shutdown
Root cause: shutdown threshold set to “low battery” instead of “enough runtime to finish shutdown,” or runtime estimate not calibrated.
Fix: set shutdown on runtime remaining with margin; measure real shutdown duration; include time for VM shutdown and storage export.
8) Symptom: after outage, pool is ONLINE but performance is terrible
Root cause: resilver/scrub running, or one disk has high latency; ARC cold; or apps are replaying their own recovery.
Fix: check zpool status scan line; inspect per-vdev IO in zpool iostat -v; identify slow device; communicate expected recovery window to app owners.
Checklists / step-by-step plan
Step-by-step: build UPS integration that actually protects you
- Decide the objective: “Host powers off cleanly before UPS runtime hits X seconds; apps stop in order; pool imports fast.”
- Measure shutdown time: time a full stop of services and poweroff. Add margin. Don’t guess.
- Choose architecture: if multiple hosts, run a dedicated UPS monitoring server (NUT) and have clients subscribe.
- Configure triggers on runtime, not percent: set shutdown when runtime < (measured shutdown time + margin).
- Validate service order: stop write-heavy services first (databases, hypervisors), then exports (NFS/iSCSI), then the OS.
-
Audit ZFS durability knobs: find any
sync=disableddatasets and justify them in writing. - Validate SLOG correctness: if you use a SLOG, require PLP and monitor device health.
- Test without pulling power: simulate FSD and verify expected logs and behavior.
- Alert on UPS telemetry freshness: if the daemon can’t read UPS status, you want to know before the outage.
-
Run a post-test import check: after each test, check
zpool status, errors, and import timing.
Outage response checklist: after power returns
- Check whether it was clean: previous boot logs (
journalctl -b -1). - Check pool health:
zpool status -xthen fullzpool status -vif anything looks off. - Check kernel errors:
dmesgfor resets and IO errors. - Check for scrub/resilver: expect performance impact and estimate completion time.
- Validate critical datasets: confirm
syncsettings, SLOG presence, and any checksum errors. - Then validate applications: DB recovery, VM integrity, and client reconnections.
FAQ
1) Does a clean shutdown prevent ZFS corruption?
It reduces risk and recovery work, but ZFS is designed to be crash-consistent even without it. Clean shutdown mainly protects
application consistency, reduces ZIL replay, and avoids hardware/transport chaos during power loss.
2) If ZFS is crash-consistent, why bother with a UPS at all?
Because crash-consistent isn’t “no consequences.” Imports can be slower, apps can lose acknowledged work (depending on settings),
and repeated hard losses stress hardware. Also: your uptime objectives usually include “don’t reboot unexpectedly.”
3) Is a SLOG required for safety?
No. A SLOG is a performance tool for synchronous workloads. Safety comes from correct sync semantics and honest hardware flush.
If you add a SLOG, it must be reliable (ideally PLP) or you can make things worse.
4) What does sync=disabled actually do?
It tells ZFS to acknowledge synchronous writes without waiting for them to be committed safely. You get speed and you trade away
durability. If your app thinks it committed data, it may be wrong after a crash—even with a clean shutdown if the timing is bad.
5) Should I export the pool during shutdown?
It can help by making the next import cleaner, but you must do it in the right order. If services still have open files or
clients are writing, export can hang or fail. Prefer standard shutdown ordering; add export hooks only if you can test them.
6) Why did my pool scrub start after reboot?
Some systems or admins schedule scrubs and they happen to run after reboot; in other cases, you run it manually to validate.
Scrub after an outage is a good habit when the event was unclean or you suspect hardware instability—expect performance impact.
7) UPS percentage says 40%. Why did it die early?
Load changes, battery age, temperature, and calibration all affect runtime. Percent is not a reliable predictor under changing load.
Base decisions on runtime estimates and periodically run controlled runtime tests to calibrate.
8) NUT vs apcupsd: which should I use?
If you have a single APC UPS connected to one host, apcupsd is simple. If you have multiple hosts, mixed UPS vendors,
or you want a networked monitoring server, NUT is usually the better fit.
9) How do I know if my shutdown ordering is correct?
You test it. Simulate a forced shutdown (maintenance window), then inspect logs to confirm: apps stopped first, exports stopped,
ZFS wasn’t still busy, and the host powered off with time to spare.
10) After an outage, should I immediately run zpool scrub?
If the shutdown was clean and zpool status -x is healthy, you can often wait for your regular schedule.
If the shutdown was unclean, or you see IO errors, or you run critical workloads, scrub sooner—after you’ve stabilized hardware.
Conclusion: next steps that prevent 3 a.m. surprises
A clean UPS-triggered shutdown doesn’t grant immortality to disks, databases, or bad ideas. It buys you a controlled landing:
fewer in-flight promises, less recovery work, and a system that behaves predictably when the lights come back.
That’s what you’re paying for.
- Measure your real shutdown time (including VM/app stop) and set runtime-based triggers with margin.
- Validate UPS telemetry (freshness alerts) and prefer network-based monitoring for multi-host setups.
- Audit ZFS durability settings and stop using
sync=disabledas a performance strategy for important data. - Verify SLOG reality: PLP, health monitoring, and removal if it’s the wrong device.
- Rehearse: run simulated shutdown tests and confirm logs show an orderly stop and clean import behavior.
If you do those five things, power outages become an inconvenience instead of a plot twist. And that’s the whole point.