ZFS ZED: Alerts That Tell You About Failure Before Users Do

January 31, 2026 • February 3, 2026 • Read: 23 min • Views: 12

Was this helpful?

Nobody wants to learn about storage problems from a ticket titled “the app is slow again” with a screenshot of a spinning wheel.
ZFS gives you better options: it already knows when a disk is getting weird, when a pool goes degraded, when a scrub finds damage,
and when a device vanishes for 12 seconds because a cable is auditioning for a horror movie.

ZED (the ZFS Event Daemon) is the part that turns those internal signals into human-visible alerts and automated responses.
If you run ZFS in production and ZED is not wired to alert you, you’re choosing surprise. And surprise is expensive.

What ZED actually does (and what it doesn’t)

ZFS is a filesystem and volume manager with a built-in sense of self-preservation. It checksums data, validates reads,
detects corruption, and records detailed fault information. But ZFS will not walk into your office and clear its throat.
ZED is the messenger.

At a high level, ZED listens for ZFS events (originating from the ZFS kernel module and userland tools) and runs small handler scripts
called zedlets. Those scripts can send email, log to syslog/journald, trigger a hot spare, record history, or integrate with
whatever alerting system you actually trust at 3 a.m.

The boundary line

ZFS detects and records: errors, degraded state, resilver start/finish, scrub start/finish, device faults, etc.
ZED reacts and notifies: “something happened, here are the details, do this next.”
Your monitoring correlates and escalates: pages humans, opens tickets, tracks MTTR, and makes it someone’s problem.

ZED isn’t a full monitoring system. It’s a trigger-and-context engine. It won’t deduplicate alerts across fleets or give you SLO dashboards.
But it will give you early, specific, actionable signals — the kind that let you replace a disk on Tuesday afternoon instead of
doing surgery during a customer outage on Saturday night.

One operational quote worth keeping near your runbooks:
Hope is not a strategy. — Gen. Gordon R. Sullivan

Joke #1: Storage failures are like dentists — if you only see them when it hurts, you’re already paying extra.

Facts and history that matter in ops

ZED isn’t just “some daemon.” It’s the operational surface area of ZFS. A few facts and context points make it easier to reason
about what you’re deploying and why it behaves the way it does:

ZFS originated at Sun Microsystems in the mid-2000s with a “storage as a system” philosophy: checksums, pooling, snapshots, self-healing.
ZFS was designed to distrust disks by default. End-to-end checksums are not a feature; they’re the assumption.
OpenZFS emerged as the cross-platform effort after the original Solaris ZFS lineage fragmented; today Linux, FreeBSD, and others track OpenZFS.
ZED grew out of the need to operationalize fault events. Detecting a fault is useless if nobody gets told.
ZFS has an internal event stream (think: “state changes and fault reports”), and ZED is a consumer that turns those events into actions.
Scrubs are a first-class maintenance primitive in ZFS: periodic full reads to find and repair silent corruption while redundancy exists.
“Degraded” is not “down” in ZFS, which is exactly why it’s dangerous: service continues, but your safety margin is gone.
Resilver is not the same as scrub: resilver is targeted repair/rebuild after a device replacement or attach; scrub is pool-wide verification.
Many ZFS “errors” are actually the warning, not the incident: checksum errors often mean the system successfully detected bad data and healed it.

The operational punchline: ZFS is chatty in the ways that matter. ZED is how you listen without living in zpool status like it’s a social network.

How ZED sees the world: events, zedlets, and state

ZED’s job is simple: when a ZFS event happens, run handlers. The complexity is in the details: which events, which handlers,
how to throttle, and how to get enough context into your alerts so you can act without spelunking.

Event sources and the shape of data

ZFS emits events for pool state changes, device errors, scrub/resilver activity, and fault management actions. ZED receives them
and exposes event fields to zedlets as environment variables. The exact set varies by platform and OpenZFS version, but you’ll see
consistent themes: pool name, vdev GUID, device path, state transitions, and error counters.

Zedlets: tiny scripts with sharp knives

Zedlets are executable scripts placed in a zedlet directory (commonly under /usr/lib/zfs/zed.d on Linux distributions,
with symlinks or enabled sets under /etc/zfs/zed.d). They’re intentionally small. They should do one thing well:
format an email, write to syslog, initiate a spare, record a history line, or call a local integration script.

The discipline: keep zedlets deterministic and fast. If you need “real logic,” have the zedlet enqueue work (write a file, emit to a local socket,
call a lightweight wrapper) and let another service do the heavy lifting. ZED is part of your failure-path. Don’t bloat it.

State and deduplication

ZED can generate repeated events for flapping devices or ongoing errors. If you blindly page on every emission, you’ll train your team
to ignore alerts, and then you’ll deserve what happens next. Good ZED setups usually do at least one of these:

Throttle notifications (per pool/vdev and per time window).
Send “state change” alerts (ONLINE→DEGRADED, DEGRADED→ONLINE) rather than every increment.
Send scrubs as summary events (started, finished, errors found) with context.
Store a small state file that tracks what was already sent.

What you should alert on

Don’t alert on everything. Alerting is a contract with sleepy humans. Here’s a sane baseline:

Pool state changes: ONLINE→DEGRADED, DEGRADED→FAULTED, removed device.
Scrub results: completed with errors, repaired bytes, or “too many errors.”
Checksum/read/write errors beyond a threshold or increasing rate.
Device fault events: timeouts, I/O failures, “device removed,” path changes.
Resilver completion: success/failure, duration, whether pool returns to ONLINE.

Alerts you should care about (and what to do with them)

A ZED alert should answer three questions: what happened, what’s at risk, and what do I do next.
If your alerts don’t include the pool name, affected vdev, and a copy of zpool status -x or a relevant snippet,
you’re writing mystery novels, not alerts.

DEGRADED pool

“DEGRADED” means you are running on redundancy. You are still serving, but one more failure away from data loss (depending on RAIDZ level and which vdev).
The right response is time-bounded: investigate immediately; replace promptly; don’t wait for the next maintenance window unless you enjoy gambling.

Checksum errors

Checksum errors are ZFS telling you “I caught bad data.” That’s good news and bad news. Good: detection works. Bad: something is corrupting data
in the stack — disk, cable, HBA, firmware, RAM (if you’re not using ECC), or even power instability. Your decision depends on whether errors are
isolated (single disk, single path) or systemic (across vdevs).

Read/write errors

Read errors indicate the device could not return data. ZFS may be able to reconstruct from parity/mirrors; if not, you see permanent errors.
Write errors often point to connectivity, controller resets, or the drive refusing writes. Either way, treat increasing counters as “replace or fix the path.”

Scrub finished with errors

A scrub that repaired data is a warning that redundancy saved you this time. If you don’t act, next time it might not.
A scrub that found unrepaired errors is a data integrity incident; your job becomes damage assessment and restoration strategy.

Device removed / UNAVAIL

This is often not “the disk died,” but “the path died.” Loose SAS cable, failing expander, HBA firmware bug, flaky backplane.
The fastest way to burn a weekend is to replace a perfectly fine disk when the backplane is the real criminal.

Practical tasks: commands, outputs, and decisions (12+)

These are the moves you’ll make in real life: verify ZED is running, validate it can send mail, trigger test events,
interpret pool health, and take corrective action. Every task below includes: the command, what the output means, and the decision you make.

Task 1: Confirm the ZED service is running (systemd)

cr0x@server:~$ systemctl status zfs-zed.service
● zfs-zed.service - ZFS Event Daemon (zed)
     Loaded: loaded (/lib/systemd/system/zfs-zed.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2025-12-22 09:14:31 UTC; 2 days ago
   Main PID: 1189 (zed)
      Tasks: 3 (limit: 18982)
     Memory: 7.4M
        CPU: 1min 12s
     CGroup: /system.slice/zfs-zed.service
             └─1189 /usr/sbin/zed -F

What it means: “active (running)” is table stakes. If it’s inactive, ZFS events still happen; you just don’t hear about them.

Decision: If not running, fix ZED before trusting any “monitoring” that claims to watch ZFS.

Task 2: Inspect recent ZED logs in journald

cr0x@server:~$ journalctl -u zfs-zed.service -n 50 --no-pager
Dec 24 08:03:11 server zed[1189]: ZED: eid=402 class=sysevent.fs.zfs.scrub_finish pool=tank
Dec 24 08:03:11 server zed[1189]: ZED: executing zedlet: /usr/lib/zfs/zed.d/scrub.finish
Dec 24 08:03:11 server zed[1189]: ZED: eid=403 class=sysevent.fs.zfs.vdev_check pool=tank

What it means: You want to see events and zedlet execution lines. Silence during known events suggests misconfiguration or no events.

Decision: If you see events but no notifications, focus on zedlet configuration (mail, permissions, PATH), not ZFS itself.

Task 3: Validate ZED configuration file is sane

cr0x@server:~$ sudo egrep -v '^\s*(#|$)' /etc/zfs/zed.d/zed.rc
ZED_DEBUG_LOG="/var/log/zed.log"
ZED_EMAIL_ADDR="storage-alerts@example.com"
ZED_EMAIL_PROG="mail"
ZED_NOTIFY_INTERVAL_SECS=3600
ZED_NOTIFY_VERBOSE=1

What it means: ZED is configured to log, send email, and throttle alerts. Missing email settings is a common “we thought we had alerts” problem.

Decision: If your org doesn’t do email, set ZED to call a wrapper script that talks to your alert manager, but keep throttling.

Task 4: Confirm the mailer exists and works from the host

cr0x@server:~$ command -v mail
/usr/bin/mail

cr0x@server:~$ echo "zed test message" | mail -s "zed smoke test" storage-alerts@example.com
...output...

What it means: The first command proves ZED’s configured mail program exists. The second proves the host can actually deliver mail (locally queued or relayed).

Decision: If mail fails, fix outbound mail before blaming ZED. ZED can’t notify through a nonexistent pipe.

Task 5: List enabled zedlets (what actions you’re actually taking)

cr0x@server:~$ ls -l /etc/zfs/zed.d
total 0
lrwxrwxrwx 1 root root 30 Dec 10 10:12 all-syslog.sh -> /usr/lib/zfs/zed.d/all-syslog.sh
lrwxrwxrwx 1 root root 31 Dec 10 10:12 checksum-email.sh -> /usr/lib/zfs/zed.d/checksum-email.sh
lrwxrwxrwx 1 root root 29 Dec 10 10:12 scrub.finish -> /usr/lib/zfs/zed.d/scrub.finish

What it means: Many distributions ship zedlets in /usr/lib and enable a subset via symlinks in /etc.

Decision: If nothing is enabled, you’ll get nothing. Enable only what you can act on; disable noisy ones until you’re ready.

Task 6: Check overall pool health quickly (the “are we on fire” command)

cr0x@server:~$ zpool status -x
all pools are healthy

What it means: This is ZFS being mercifully concise. If it prints anything else, you have work to do.

Decision: Healthy output doesn’t mean “no risk,” but it means you’re not actively degraded/faulted.

Task 7: Deep status when something is wrong

cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.
action: Replace the device using 'zpool replace'.
  scan: scrub repaired 0B in 03:21:18 with 0 errors on Tue Dec 24 08:03:11 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz1-0                  DEGRADED     0     0     0
            ata-WDC_WD80EFAX-1      ONLINE       0     0     0
            ata-WDC_WD80EFAX-2      ONLINE       0     0     0
            ata-WDC_WD80EFAX-3      UNAVAIL      0     0     0  cannot open

errors: No known data errors

What it means: The pool is degraded because one device is unavailable. No known data errors is good; redundancy is still holding.

Decision: Treat UNAVAIL as urgent. Investigate path vs disk, then replace or restore connectivity before a second failure.

Task 8: Correlate ZFS device names to actual hardware

cr0x@server:~$ ls -l /dev/disk/by-id/ | grep WD80EFAX-3
lrwxrwxrwx 1 root root  9 Dec 25 01:12 ata-WDC_WD80EFAX-3 -> ../../sde

What it means: You can map ZFS’s stable by-id path to a kernel device node (/dev/sde), which helps with SMART and physical slot mapping.

Decision: Use /dev/disk/by-id in pools whenever possible; it reduces “wrong disk pulled” incidents.

Task 9: Check SMART health for the suspect disk

cr0x@server:~$ sudo smartctl -a /dev/sde | egrep 'SMART overall-health|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable'
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2

What it means: “PASSED” is not a get-out-of-jail-free card. Pending and uncorrectable sectors are bad signs even when the drive claims confidence.

Decision: If pending/uncorrectable is non-zero and growing, replace the disk. If ZFS already marked UNAVAIL, you’re done debating.

Task 10: Inspect recent kernel messages for link resets or transport errors

cr0x@server:~$ dmesg -T | tail -n 20
[Wed Dec 25 01:10:22 2025] ata9.00: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x6 frozen
[Wed Dec 25 01:10:22 2025] ata9.00: irq_stat 0x08000000, interface fatal error
[Wed Dec 25 01:10:23 2025] ata9: hard resetting link
[Wed Dec 25 01:10:28 2025] ata9: link is slow to respond, please be patient (ready=0)
[Wed Dec 25 01:10:31 2025] ata9: COMRESET failed (errno=-16)
[Wed Dec 25 01:10:31 2025] ata9.00: disabled

What it means: This screams “path problem.” Could be the disk, could be the cable/backplane, could be the controller.

Decision: Before replacing disks in bulk, swap cable/backplane slot if you can. If errors follow the slot, you found the real failure domain.

Task 11: Show ZFS error counters and watch for growth

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: DEGRADED
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz1-0                  DEGRADED     0     0     0
            ata-WDC_WD80EFAX-1      ONLINE       0     0     0
            ata-WDC_WD80EFAX-2      ONLINE       0     0     0
            ata-WDC_WD80EFAX-3      UNAVAIL      3     1     0  cannot open

errors: No known data errors

What it means: Counters (READ/WRITE/CKSUM) are evidence. A few historical errors are not always catastrophic, but increasing counts are a trend.

Decision: If counters increase after reseating cables or reboot, stop “trying things” and replace the component in the failing domain.

Task 12: Replace a failed disk the correct way

cr0x@server:~$ sudo zpool replace tank ata-WDC_WD80EFAX-3 /dev/disk/by-id/ata-WDC_WD80EFAX-NEW
...output...

What it means: ZFS begins a resilver onto the new disk, targeted to allocated blocks (typically faster than classic RAID rebuilds).

Decision: Monitor resilver progress. If the pool is still degraded after resilver, you have additional issues (wrong device, multiple failures, or path instability).

Task 13: Monitor resilver/scrub progress

cr0x@server:~$ zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.
  scan: resilver in progress since Wed Dec 25 01:22:10 2025
        312G scanned at 1.12G/s, 44.8G issued at 164M/s, 3.21T total
        44.8G resilvered, 1.36% done, 05:20:11 to go
config:

        NAME                             STATE     READ WRITE CKSUM
        tank                             DEGRADED     0     0     0
          raidz1-0                       DEGRADED     0     0     0
            ata-WDC_WD80EFAX-1           ONLINE       0     0     0
            ata-WDC_WD80EFAX-2           ONLINE       0     0     0
            ata-WDC_WD80EFAX-NEW         ONLINE       0     0     0  (resilvering)

What it means: “issued at” reflects actual write rate. “scanned at” can be higher due to metadata traversal and read-ahead.

Decision: If resilver is crawling, don’t guess. Check for I/O bottlenecks, errors on other disks, or controller issues.

Task 14: Verify scrub scheduling and last results

cr0x@server:~$ zpool status -s tank
  pool: tank
 state: ONLINE
scan: scrub repaired 0B in 03:21:18 with 0 errors on Tue Dec 24 08:03:11 2025

What it means: You have a last scrub completion record. If this is missing for months, you are flying without headlights.

Decision: If you don’t have periodic scrubs, schedule them. If you do have them but don’t alert on failures, wire ZED now.

Task 15: Confirm ZFS event delivery to ZED (sanity check)

cr0x@server:~$ sudo zpool scrub tank
...output...

cr0x@server:~$ journalctl -u zfs-zed.service -n 20 --no-pager
Dec 25 01:30:02 server zed[1189]: ZED: eid=510 class=sysevent.fs.zfs.scrub_start pool=tank
Dec 25 01:30:02 server zed[1189]: ZED: executing zedlet: /usr/lib/zfs/zed.d/scrub.start

What it means: Starting a scrub produces an event. Seeing it in the ZED logs proves event flow.

Decision: If you don’t see the event, troubleshoot ZED service, permissions, or ZFS event infrastructure on that platform.

Task 16: Check that ZED is not blocked by permissions or missing directories

cr0x@server:~$ sudo -u root test -w /var/log && echo "log dir writable"
log dir writable

cr0x@server:~$ sudo -u root test -x /usr/lib/zfs/zed.d/scrub.finish && echo "zedlet executable"
zedlet executable

What it means: ZED failing to write logs or execute zedlets is boring, common, and devastating to alerting.

Decision: Fix file permissions and package integrity. Don’t “chmod 777” your way out; keep it minimal and auditable.

Joke #2: ZED is like a smoke alarm — people only complain it’s loud until the day it keeps their weekend intact.

Fast diagnosis playbook

This is the “get un-stuck fast” sequence. Not perfect. Not elegant. It’s optimized for: what do I check first, second, third to find the bottleneck
and decide whether I’m dealing with a disk, a path, a pool-level problem, or an alerting miswire.

First: is this a real pool problem or just missing alerts?

Check pool health: zpool status -x. If it’s healthy, you might be debugging ZED, not ZFS.
Check ZED is alive: systemctl status zfs-zed.service and journalctl -u zfs-zed.service.
Trigger a harmless event: start a scrub on a test pool or run a scrub start/stop cycle (if you can tolerate it). Confirm ZED logs an event.

Second: if the pool is degraded/faulted, localize the failure domain

Identify the vdev and device: zpool status POOL and note READ/WRITE/CKSUM counters.
Map by-id to real device: ls -l /dev/disk/by-id/ to get the kernel node.
Check kernel logs: dmesg -T for link resets, timeouts, transport errors. Path problems often show up here first.
Check SMART: smartctl -a for pending/uncorrectable sectors and error logs.

Third: decide whether you can stabilize without replacement

If it looks like a path issue: reseat/replace cable, move the disk to another bay, update HBA firmware (carefully), verify power.
If it looks like disk media: replace disk. Don’t negotiate with pending sectors.
After change: watch resilver and re-check error counters. If counters keep climbing, stop and broaden scope to controller/backplane.

Fourth: verify alerting quality

Ensure alerts are actionable: include pool name, device id, current zpool status, and last scrub results.
Throttle and dedupe: page on state transitions; email or ticket on repeated soft warnings.
Do a quarterly fire drill: simulate an event (scrub start/finish, test zedlet) and confirm the right team receives it.

Three corporate mini-stories from the storage trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran ZFS on Linux for a handful of “durable” storage nodes. They’d migrated from an old RAID controller setup and
felt good about it: checksums, scrubs, snapshots — the works. They also had monitoring. Or so everyone believed.

The wrong assumption was subtle: “ZFS alerts are part of the ZFS package.” Someone had installed OpenZFS, created pools, scheduled scrubs,
and moved on. ZED was installed but not enabled. Nobody noticed because, day to day, ZFS is quiet when things are healthy.

Months later, a disk started logging intermittent timeouts. ZFS retried, healed from parity, and kept serving. The pool went DEGRADED briefly,
then returned to ONLINE after the disk came back. No alert, no ticket, no replacement. The error counters crept up like a slow leak behind a wall.

The actual incident arrived as a second disk failure during a heavy read period. Now the pool went hard DEGRADED and the application saw latency spikes.
Users reported “slow uploads.” Ops started from the wrong end of the problem (app tuning, load balancers) because they had no early signal.

Postmortem action items were boring and correct: enable ZED, wire notifications to the on-call rotation, page on pool degradation, and include
by-id device names so someone can pull the right drive without a séance.

Mini-story 2: The optimization that backfired

A data engineering team wanted fewer emails. They were tired of “scrub started” and “scrub finished” notes cluttering inboxes, and they had a point:
the alerts weren’t prioritized and nobody was reading them carefully.

The “optimization” was to disable scrub-related zedlets entirely. Their reasoning: “We already run scrubs monthly; if something is wrong, the pool will go degraded.”
That last clause is the landmine. Scrub results can reveal corruption that ZFS repaired silently. That’s not a degraded pool. That’s a warning shot.

A few months later, a scrub would have caught and repaired checksum errors on one vdev, pointing to a bad SAS cable. Instead, nobody saw the early signal.
The cable got worse. Eventually the disk dropped during a resilver triggered by an unrelated maintenance operation, dragging the resilver out and
increasing operational risk. The team had engineered a “quiet system” that failed loud.

They fixed it by re-enabling scrub alerts but changing the policy: scrub start events went to low-priority logs; scrub finish with repairs or errors
generated a ticket and a human review. Noise reduced. Signal restored. That’s the correct trade.

Mini-story 3: The boring practice that saved the day

An enterprise IT group ran a fleet of ZFS-backed VM hosts. Their storage platform wasn’t exciting; it was intentionally dull. They had a strict standard:
by-id device naming, quarterly scrub verification, and an on-call “disk replacement” runbook that fit on a page.

One Thursday, ZED paged “pool DEGRADED” with the affected vdev and the physical slot mapping. The host was still serving VMs fine.
The temptation in corporate environments is to postpone work because “no outage.” They didn’t.

The on-call followed the runbook: confirm status, check SMART, check kernel logs, and replace the disk. The resilver completed, pool returned ONLINE,
and they closed the loop by verifying the next scrub. No leadership escalation, no customer impact, no dramatic war room.

Two days later, another host in the same rack had a power event that caused a controller reset. If they’d still been degraded on the first host,
that second event could have turned a routine hardware replacement into a messy restoration. The boring practice bought them slack.

Common mistakes: symptoms → root cause → fix

1) Symptom: “We never get ZFS alerts”

Root cause: ZED service not enabled/running, or zedlets not enabled via /etc/zfs/zed.d.

Fix: Enable and start ZED; verify event flow with a scrub start and check journald for execution lines.

2) Symptom: “ZED logs events but no emails arrive”

Root cause: Missing mail program, blocked outbound SMTP, or misconfigured ZED_EMAIL_ADDR/ZED_EMAIL_PROG.

Fix: Run the same mail command manually from the host; fix relay/firewall/DNS; then re-test ZED.

3) Symptom: “Pager storm during a flaky disk event”

Root cause: No throttling/deduplication; alerting on every error increment rather than state change.

Fix: Configure notification interval; page on pool state transitions; ticket on repeated soft errors with rate thresholds.

4) Symptom: “Pool shows checksum errors on multiple disks at once”

Root cause: Shared failure domain (HBA, backplane, expander, cable, power) or memory corruption on non-ECC systems.

Fix: Stop replacing disks randomly. Inspect dmesg for transport resets, validate HBA firmware/driver, swap cables, and assess RAM/ECC posture.

5) Symptom: “Scrub finished, repaired bytes, but everyone ignored it”

Root cause: Alert policy treats scrub results as noise; no workflow to investigate repaired corruption.

Fix: Route “scrub finished with repairs/errors” to a ticket with a required review and follow-up checks (SMART, cabling, counters).

6) Symptom: “Resilver takes forever and the pool stays fragile”

Root cause: Underlying I/O bottleneck, additional marginal disks, or controller issues causing retries.

Fix: Check other vdev error counters, dmesg for resets, and SMART for slow sectors. If multiple disks are sick, stabilize hardware before pushing resilver hard.

7) Symptom: “ZED runs zedlets but they fail silently”

Root cause: Permissions, missing executable bits, missing dependencies in PATH, or scripts relying on interactive shell behavior.

Fix: Make zedlets self-contained: absolute paths, explicit environment, strict error handling, log failures to journald/syslog.

8) Symptom: “Ops replaced the wrong disk”

Root cause: Pools built on /dev/sdX names; alert doesn’t include stable identifiers; no slot mapping process.

Fix: Use /dev/disk/by-id in pools, include by-id in alerts, and maintain a mapping from bay/WWN to host inventory.

Checklists / step-by-step plan

Checklist A: Minimum viable ZED alerting (do this this week)

Confirm ZED is installed and running: systemctl status zfs-zed.service.
Enable ZED at boot: systemctl enable zfs-zed.service.
Pick a notification destination (email or local integration script).
Set ZED_EMAIL_ADDR (or wrapper) and ZED_NOTIFY_INTERVAL_SECS in /etc/zfs/zed.d/zed.rc.
Enable only the zedlets you intend to act on (scrub finish, pool state changes, checksum errors).
Trigger a scrub on a non-critical pool and verify you see ZED events in journald.
Make sure on-call receives the alert and can identify the disk by stable name.

Checklist B: When you get a “pool DEGRADED” alert

Run zpool status POOL. Capture it in the ticket.
Identify affected vdev and device by-id; map to kernel device node.
Check dmesg -T for transport errors and resets.
Run smartctl -a on the device; look for pending/uncorrectable sectors and error logs.
Decide: path fix (cable/backplane/HBA) vs disk replacement.
Perform the change, then monitor resilver and re-check counters.
After return to ONLINE, schedule/verify a scrub and watch for new repairs.

Checklist C: Quarterly alerting fire drill (so you trust it)

Pick one host per storage class (NVMe mirror, RAIDZ, etc.).
Start a scrub and confirm ZED sees scrub_start.
Confirm scrub finish alerts include repaired bytes and errors summary.
Confirm your paging policy triggers on a simulated degraded state (non-production test pool if possible).
Review throttling: ensure no pager storms for repeated soft errors.
Update runbooks with any new event fields your ZED version emits.

FAQ

1) What exactly is ZED?

ZED is the ZFS Event Daemon. It listens for ZFS events and runs handler scripts (zedlets) to notify humans or trigger automated actions.

2) Is ZED required for ZFS to function safely?

ZFS can detect and correct many issues without ZED. ZED is required for you to function safely: it turns silent risk into visible work.

3) What events should page humans vs create tickets?

Page on state transitions that reduce redundancy (DEGRADED/FAULTED, device removed, unrepaired errors). Ticket on scrub repairs and recurring soft errors.

4) Why do I see checksum errors if ZFS “self-heals”?

Because ZFS detected bad data and repaired it from redundancy. The checksum error is the evidence trail that something in the stack misbehaved.
Treat it as a warning to investigate, especially if errors increase.

5) How often should I run scrubs?

Common practice is monthly for large pools, sometimes weekly for smaller or higher-risk fleets. The right cadence depends on rebuild time,
drive size, and risk tolerance. Whatever you choose, alert on failures and repairs.

6) Can ZED send alerts to Slack/PagerDuty directly?

Typically you do it via a wrapper script invoked by a zedlet (or by modifying/adding a zedlet) that calls your internal alerting pipeline.
Keep ZED-side logic minimal and resilient.

7) Why did my pool go DEGRADED and then return to ONLINE?

Devices can flap: brief disconnects, controller resets, or timeout storms. ZFS may mark a device UNAVAIL and then reintegrate it.
That’s not “fine.” It’s a path or device reliability issue.

8) Should I rely on SMART “PASSED” to decide not to replace a disk?

No. SMART overall health is a coarse heuristic. Pending sectors, uncorrectables, and error logs matter more. ZFS error counters matter too.

9) What’s the difference between scrub and resilver for alerting?

Scrub is a planned integrity scan; you alert on completion and whether repairs/errors occurred. Resilver is a rebuild/repair after device changes; you alert on start, progress anomalies, and completion.

10) What if ZED is too noisy?

Don’t mute it globally. Tune it: throttle, page only on state transitions, and send informational events to logs. Noise is a policy bug, not a reason to go blind.

Practical next steps

If you only do three things after reading this, do these:

Make sure ZED runs everywhere you run ZFS, starts on boot, and logs to a place you actually look.
Make scrub results actionable: alert on scrub finish with repairs/errors, and create a workflow to investigate and close the loop.
Page on lost redundancy: DEGRADED/FAULTED is not a suggestion. It’s ZFS telling you your safety margin is gone.

Then do the grown-up version: run a quarterly alerting drill, keep zedlets small and boring, and build alerts that include enough context
that a human can decide in one minute whether to swap a disk, a cable, or a controller.

ZFS is already doing the detection work. ZED is how you stop that work from dying quietly inside the machine.