The outage you remember isn’t the one where a disk exploded loudly. It’s the one where nothing “failed”,
performance got weird, a scrub took forever, and a week later you discovered that a controller was quietly
lying to you. ZFS didn’t lie. You just didn’t read the part where it told you.
zpool events is that part. It’s the stream of “something just happened” signals: timeouts, checksum errors,
device removals, resilver milestones, and the occasional ominous “ereport” that sounds like an HR escalation.
Ignore it long enough and you’ll eventually use it under pressure—at 03:00—while pretending you always had a plan.
What zpool events actually are (and what they are not)
zpool events is ZFS’s event feed. It shows a timeline of significant pool-related events:
disk I/O errors, checksum problems, vdev state changes, scrubs starting and finishing, resilvers, and assorted fault
management reports (“ereports”). It’s the stuff you wish you had preserved when someone says, “When did this start?”
It is not a replacement for zpool status. Think of it like this:
zpool statustells you the current state and summarized history (recent errors, current resilver/scrub, etc.).zpool eventstells you the sequence of events that got you there, often with more context and the exact time.- System logs (kernel messages,
journalctl) tell you the surrounding OS-level symptoms: resets, timeouts, link flaps.
Under the hood, on many platforms, ZFS plugs into a fault management pipeline. On illumos/Solaris you’ll see the
Fault Management Architecture (FMA) vocabulary more explicitly. On Linux, ZED (ZFS Event Daemon) and udev-ish reality
provide the “something happened” bridge into scripts, email, and ticket systems. The words differ; the operational
intent is the same: turn low-level weirdness into a page before data is at risk.
Your job is not to memorize every event class. Your job is to build the habit of correlating:
event → device → workload symptom → risk → action. Do that, and zpool events stops being trivia and starts being a lever.
First dry joke, as promised: ZFS has a great sense of humor—every “checksum error” is a punchline where your storage vendor is the setup.
Interesting facts and a little history
These aren’t nostalgia. They’re clues to why the tooling looks the way it does and why some behaviors surprise people.
- ZFS was born at Sun Microsystems and designed to treat data integrity as a first-class feature, not a best-effort accessory.
- End-to-end checksumming means ZFS verifies data from disk to memory to application read path; it can detect silent corruption that RAID controllers happily “complete.”
- “Ereport” language comes from FMA (Fault Management Architecture), where faults and diagnoses are structured, not free-form log spam.
zpool scrubis proactive verification; it’s not “defrag” and it’s not “maintenance theater.” It’s a controlled way to find latent errors before a disk fails.- Resilvering is incremental in modern OpenZFS: it can copy only what’s in use (especially with features like sequential resilver), reducing rebuild time compared to old-school RAID rebuilds.
- ZFS intentionally prefers correctness over optimism; it will degrade a pool rather than keep pretending all is well if it can’t trust a device’s reads.
- The ZED daemon exists because events need actions; the command output is useful, but operationally you want hooks: alerting, replacement workflows, and automatic exports for removal events.
- Historically, storage stacks hid errors behind “retries” and controller caches; ZFS made errors visible, which is great—unless you ignore the visibility.
- OpenZFS became multi-platform (Linux, FreeBSD, illumos), and the event plumbing differs per OS, which is why advice must name the platform assumptions.
One real operational implication: people migrating from hardware RAID often assume the controller will “handle it.”
ZFS assumes the opposite: the stack is untrustworthy until proven otherwise. That’s not paranoia. That’s experience.
A practical mental model: from symptoms to events to decisions
1) Events are signals; status is the diagnosis summary
When a vdev goes DEGRADED, zpool status tells you what it is now. But zpool events can tell you whether it was:
a link reset storm, a single bad cable, firmware timeouts, a power issue, or actual media failure. Those have different fixes.
2) Not all errors are equal
Three categories matter operationally:
- Transient transport errors (timeouts, resets): often cabling/HBA/backplane/power. Replace disks and you’ll still be broken.
- Consistent read errors: likely media failure. Replace the device; check your redundancy and start a resilver.
- Checksum errors: can be disk, cable, controller, RAM (less common with ECC), or firmware. You need correlation, not guesswork.
3) “Scrub found errors” means “you just learned something,” not “panic”
Scrub errors are early-warning telemetry. Your response should be calm and methodical:
identify which device(s), what error type, whether errors repeat, and whether redundancy covered the damage.
The event timeline helps you determine whether this is a single incident or a trend.
4) Operationally, you’re chasing two questions
- Is data at risk right now? (pool faulted, no redundancy, errors during scrub/resilver)
- Is the platform lying? (transport instability, intermittent resets, “random” checksum errors across multiple disks)
The first question decides urgency. The second decides whether your fix will actually fix it.
One quote that belongs on every on-call rotation, from a notable reliability voice:
“Hope is not a strategy.” — Gene Kranz
Fast diagnosis playbook (what to check 1st/2nd/3rd)
This is the sequence I use when a pool looks sick, performance tanks, or someone posts a screenshot of DEGRADED in chat
with no other context. It’s built for speed and for avoiding the classic mistake: swapping disks before you know what failed.
First: establish current risk and whether you’re already in “stop the bleeding” mode
- Check pool state (
zpool status -x,zpool status): are you degraded, faulted, suspended, resilvering, or scrubbing? - Check whether redundancy is intact: how many vdevs, what parity level, any missing devices?
- Check if errors are still increasing: repeated read/write/checksum counts moving upward indicates an active problem.
Second: pull the event timeline and identify the class of failure
- Look at recent events (
zpool events -v): do you see timeouts, removals, checksum ereports, scrub finish with errors? - Correlate timestamps with kernel logs: link resets and SCSI errors often tell the transport story.
- Look for spread: one disk vs. multiple disks across the same HBA/backplane. Multiple disks failing at once is usually not “bad luck.”
Third: decide the lane—hardware replacement, transport stabilization, or workload containment
- Single device with persistent read errors: prepare replacement, offline/replace, monitor resilver.
- Multiple devices with intermittent issues: check cables, HBA firmware/driver, backplane, power; avoid cascading failures during heavy rebuild.
- Scrub/resilver is slow: inspect I/O saturation, recordsize/workload mismatch, special vdev issues, SMR behavior, or a sick device throttling the vdev.
If you remember only one thing: the fastest way to lose a day is to replace hardware before you classify the failure.
Events help you classify it.
Hands-on tasks: commands, outputs, and the decision you make
These are real tasks you can run during an incident or as routine hygiene. Each one includes:
the command, an example of what you might see, what it means, and what you decide next.
Hostname and pool names are intentionally bland; production systems rarely are.
Task 1: “Is anything actually wrong?” quick check
cr0x@server:~$ zpool status -x
all pools are healthy
Meaning: ZFS sees no known faults right now.
Decision: If users report slowness anyway, pivot to performance triage (iostat, latency, queue depth). Still check events for recent transient issues.
Task 2: Get the full current health picture (don’t be lazy)
cr0x@server:~$ zpool status
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device using 'zpool replace'.
scan: scrub repaired 0B in 03:12:11 with 2 errors on Sun Dec 21 01:10:44 2025
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 0
ata-WDC_WD80...-part1 ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
tank/data/vmstore/guest42.img
Meaning: A scrub found uncorrectable corruption affecting at least one file; checksum errors point at a specific disk path.
Decision: Treat as data-integrity incident. Identify whether corruption is limited to that file, restore from backup/snapshot, and investigate the disk/transport immediately.
Task 3: Pull recent events with verbose details
cr0x@server:~$ zpool events -v | tail -n 40
time: 2025-12-21.01:10:44
eid: 1876
class: scrub_finish
pool: tank
pool_guid: 1234567890123456789
scrub_errors: 2
scrub_repaired: 0
scrub_time_secs: 11531
time: 2025-12-21.00:58:19
eid: 1869
class: ereport.fs.zfs.checksum
pool: tank
vdev_guid: 9876543210987654321
vdev_path: /dev/disk/by-id/ata-WDC_WD80...-part1
zio_err: 52
zio_objset: 54
zio_object: 102938
zio_level: 0
Meaning: You have a checksum ereport tied to a specific vdev path before scrub finished with errors.
Decision: Correlate this timestamp with kernel logs; decide whether the disk is bad or the transport is unstable. Start with transport checks if similar events exist for multiple disks.
Task 4: Follow events live during a rebuild or suspected flapping
cr0x@server:~$ zpool events -f
time: 2025-12-25.09:14:03
eid: 2101
class: resilver_start
pool: tank
vdev_path: /dev/disk/by-id/ata-WDC_WD80...-part1
time: 2025-12-25.09:16:27
eid: 2104
class: ereport.fs.zfs.io
pool: tank
vdev_path: /dev/disk/by-id/ata-WDC_WD80...-part1
zio_err: 5
Meaning: Errors during resilver are extra dangerous: you’re stressing the system while redundancy is already reduced.
Decision: If errors persist, pause and stabilize: check cables/HBA, consider offlining the suspect device to avoid pool suspension, and reduce workload during resilver.
Task 5: Confirm pool properties that affect behavior during incidents
cr0x@server:~$ zpool get -o name,property,value,source autoreplace,failmode,autotrim tank
NAME PROPERTY VALUE SOURCE
tank autoreplace off default
tank failmode wait default
tank autotrim on local
Meaning: failmode=wait can stall I/O when the pool can’t proceed safely; autoreplace=off means you must explicitly replace.
Decision: For fleets, standardize these. For an incident, knowing failmode explains “why everything froze” and informs whether to export/import or fail over.
Task 6: Identify the exact disk behind a vdev path
cr0x@server:~$ ls -l /dev/disk/by-id/ata-WDC_WD80...-part1
lrwxrwxrwx 1 root root 10 Dec 25 08:59 /dev/disk/by-id/ata-WDC_WD80...-part1 -> ../../sdc1
Meaning: The “friendly” by-id path maps to /dev/sdc1 right now.
Decision: Use by-id paths in ZFS config to reduce “sdX roulette” after reboots. For replacement, confirm the physical slot using enclosure tools if available.
Task 7: Correlate ZFS events with OS-level resets/timeouts (Linux)
cr0x@server:~$ journalctl -k --since "2025-12-21 00:50" --until "2025-12-21 01:05" | egrep -i "sd[cde]|ata|sas|reset|timeout|I/O error" | tail -n 20
Dec 21 00:58:11 server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 21 00:58:12 server kernel: ata7.00: failed command: READ FPDMA QUEUED
Dec 21 00:58:12 server kernel: ata7: hard resetting link
Dec 21 00:58:17 server kernel: sd 7:0:0:0: [sdc] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 21 00:58:17 server kernel: sd 7:0:0:0: [sdc] Sense Key : Medium Error [current]
Dec 21 00:58:17 server kernel: blk_update_request: I/O error, dev sdc, sector 771920896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Meaning: This looks like real media trouble (Medium Error) plus resets. It’s not just a “temporary cable wiggle.”
Decision: Replace the disk. While you’re there, still inspect cabling/backplane if you see resets across multiple ports.
Task 8: Verify scrub schedule and last scrub result
cr0x@server:~$ zpool status tank | sed -n '1,20p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 02:44:09 with 0 errors on Sun Dec 14 03:12:33 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
Meaning: Recent scrub completed cleanly; good baseline.
Decision: If today’s incident is “sudden,” compare with event timing. A clean scrub last week makes “months of silent corruption” less likely.
Task 9: Trigger a scrub intentionally (and watch for trouble)
cr0x@server:~$ sudo zpool scrub tank
cr0x@server:~$ zpool status tank | egrep -i "scan|scrub"
scan: scrub in progress since Thu Dec 25 09:22:10 2025
312G scanned at 1.12G/s, 74.2G issued at 273M/s, 4.21T total
0B repaired, 1.72% done, no estimated completion time
Meaning: Scrub is running; “issued” rate shows actual reads submitted. If it’s crawling, something is throttling.
Decision: If scrub speed collapses, check for a sick disk, SMR drives, or an overloaded system. Consider scheduling scrubs off-peak or temporarily reducing workload.
Task 10: Find whether errors are concentrated on one vdev (quick per-vdev view)
cr0x@server:~$ zpool status -v tank
pool: tank
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/dev/disk/by-id/ata-ST4000... ONLINE 0 0 0
/dev/disk/by-id/ata-ST4000... DEGRADED 0 0 34
errors: No known data errors
Meaning: Many checksum errors on one side of a mirror, but ZFS could correct them using the other side (no permanent errors).
Decision: This is still a hardware/transport defect. Plan replacement; don’t congratulate yourself because redundancy saved you this time.
Task 11: Replace a failed disk safely (mirror example)
cr0x@server:~$ sudo zpool offline tank /dev/disk/by-id/ata-ST4000...bad
cr0x@server:~$ sudo zpool replace tank /dev/disk/by-id/ata-ST4000...bad /dev/disk/by-id/ata-ST4000...new
cr0x@server:~$ zpool status tank | egrep -i "state|scan|resilver"
state: DEGRADED
scan: resilver in progress since Thu Dec 25 09:31:02 2025
98.4G scanned at 1.02G/s, 21.7G issued at 231M/s, 98.4G total
21.7G resilvered, 22.06% done, 0:05:09 to go
Meaning: Replacement initiated; resilver is progressing.
Decision: Watch for new events during resilver. If more devices error out, stop and reassess the platform (HBA/backplane/power) before you turn one failure into three.
Task 12: Check ZED is installed and running (Linux systemd)
cr0x@server:~$ systemctl status zfs-zed.service --no-pager
● zfs-zed.service - ZFS Event Daemon (zed)
Loaded: loaded (/lib/systemd/system/zfs-zed.service; enabled; preset: enabled)
Active: active (running) since Thu 2025-12-25 07:12:09 UTC; 2h 19min ago
Docs: man:zed(8)
Main PID: 1124 (zed)
Tasks: 2 (limit: 38454)
Memory: 4.3M
CPU: 1.221s
Meaning: ZED is running, so events can trigger notifications and scripts.
Decision: If ZED isn’t running, you’re relying on humans to notice zpool status. Fix that—today, not “after the quarter closes.”
Task 13: Inspect ZED’s recent activity for missed alerts
cr0x@server:~$ journalctl -u zfs-zed.service --since "2025-12-21 00:00" | tail -n 25
Dec 21 00:58:20 server zed[1124]: eid=1869 class=ereport.fs.zfs.checksum pool=tank
Dec 21 00:58:20 server zed[1124]: Executing ZEDLET: /usr/lib/zfs/zed.d/zed.rc
Dec 21 00:58:20 server zed[1124]: Executing ZEDLET: /usr/lib/zfs/zed.d/all-syslog.sh
Dec 21 00:58:20 server zed[1124]: Executing ZEDLET: /usr/lib/zfs/zed.d/zed.email
Dec 21 00:58:20 server zed[1124]: email: to=storage-oncall@example.internal subject="ZFS checksum error on tank"
Meaning: ZED saw the checksum ereport and ran alerting hooks.
Decision: If on-call claims “no alert,” investigate mail routing/monitoring ingestion. ZFS did its part; your notification chain might be the weak link.
Task 14: Check for pool suspension symptoms (the “everything hangs” incident)
cr0x@server:~$ zpool status tank | egrep -i "state|suspend|status"
state: ONLINE
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Online the device using 'zpool online' or replace the device with 'zpool replace'.
Meaning: Not suspended here, but this is the place you’ll see it. If the pool is suspended, applications often hang on I/O.
Decision: If suspended, stop pounding it with retries. Stabilize hardware, consider exporting/importing, and prioritize data evacuation or failover.
Task 15: Get a clean “what changed recently” list from events
cr0x@server:~$ zpool events | egrep "vdev_(add|remove)|config_sync|scrub_start|scrub_finish|resilver_(start|finish)" | tail -n 20
time: 2025-12-25.09:14:03 class: resilver_start
time: 2025-12-25.09:20:11 class: config_sync
time: 2025-12-25.09:37:18 class: resilver_finish
time: 2025-12-25.09:40:00 class: scrub_start
Meaning: You can see the lifecycle of operations: resilver start/finish, config sync, scrub start.
Decision: Use this to explain cause/effect in incident notes and to catch “someone started a scrub during peak hours” without guessing.
Task 16: Extract events since a known bad time for incident timelines
cr0x@server:~$ zpool events -v | awk '/time: 2025-12-21/{p=1} p{print}'
time: 2025-12-21.00:58:19
eid: 1869
class: ereport.fs.zfs.checksum
pool: tank
vdev_path: /dev/disk/by-id/ata-WDC_WD80...-part1
zio_err: 52
Meaning: A quick way to pull relevant event blocks for a postmortem or ticket.
Decision: Preserve this output during incidents. Events can roll or be ephemeral depending on platform and tooling; your ticket should not rely on “we’ll reproduce it.”
Three corporate mini-stories (mistakes you can avoid)
Mini-story 1: The incident caused by a wrong assumption
The company ran a medium-sized virtualization cluster on ZFS mirrors. Good choice: simple, fast rebuilds, clear failure domains.
The on-call runbook said: “If a disk shows checksum errors, replace it.” That’s not terrible advice. It’s also incomplete.
One Tuesday, checksum errors started appearing across two hosts—different pools, different disks, same model HBA. The on-call swapped a disk in host A.
Resilver started, then host A threw more checksum errors, then timeouts, then the pool hung long enough to trip guest watchdogs.
“Bad batch of disks,” someone said, because that’s the story everyone’s heard before.
The event trail—ignored until after the fact—showed something more specific: bursts of I/O errors and link resets that coincided with a particular kernel module upgrade.
The disks weren’t failing independently; the transport was flapping under load. Replacing a disk increased load (resilver reads/writes), making the flaps worse.
The fix wasn’t “more disks.” It was pinning the HBA driver version, adjusting queue settings, and scheduling resilvers only after transport stability was confirmed.
They still replaced one disk eventually—because one did have real media errors—but not before stopping the systemic failure.
The wrong assumption: “Checksum errors mean disk.” Sometimes they do. Sometimes they mean “your whole I/O path is suspect.” Events tell you which story you’re in.
Mini-story 2: The optimization that backfired
Another team loved dashboards. They wanted fewer pages, fewer “false positives,” and a calmer on-call experience. Respectable goals.
They decided to alert only on zpool status state changes: ONLINE→DEGRADED, DEGRADED→FAULTED. No alerts for events like timeouts or single checksum ereports.
The environment ran fine for months. Then a backplane started intermittently dropping one disk for a few seconds under heavy write bursts.
ZFS would mark it as unavailable briefly, retry, and recover. The pool never stayed degraded long enough for the “state change” alert to fire.
Meanwhile, zpool events was basically screaming into the void with a repeating pattern: I/O errors, device removal, device return, config sync.
The first real symptom noticed by humans wasn’t an alert. It was a slow database and then a prolonged resilver after the disk finally stopped coming back.
By then, the pool had endured weeks of stress cycles and had less margin than anyone realized.
After the incident, they reintroduced alerts for specific event classes (removals, I/O ereports, checksum bursts) with rate-limiting and correlation.
Fewer pages? Yes. But not by going blind. By being smarter.
Second dry joke: The quietest monitoring system is also the cheapest—until you price in the outage.
Mini-story 3: The boring but correct practice that saved the day
A finance org ran ZFS on a pair of storage nodes. Nothing exotic: RAIDZ2 for bulk datasets, mirrored SLOG for sync workloads, regular scrubs, and ZED wired into tickets.
The setup was so boring that nobody talked about it, which is exactly what you want from storage.
During a routine weekly scrub, ZFS logged a small number of checksum errors on one disk. No permanent data errors. The pool stayed ONLINE.
ZED generated a ticket with the device by-id, the host, and the scrub context. The ticket wasn’t urgent, but it wasn’t ignored.
The storage engineer checked zpool events -v and saw the checksum ereports were clustered within a five-minute window.
Kernel logs in that window showed a single link reset on one SAS lane—classic “transport hiccup.” They reseated the cable at the maintenance window, then scrubbed again.
No errors.
Two months later, the same backplane started failing harder on a different slot. This time, because they had history,
they recognized the pattern immediately. They preemptively moved disks off the suspect enclosure before it caused a multi-disk incident.
The practice that saved the day wasn’t heroics. It was: scheduled scrubs, event-driven tickets, and actually reading the timeline.
Boring is a feature.
Common mistakes: symptom → root cause → fix
This section is intentionally blunt. These are the failure modes I’ve watched people repeat because they “handled ZFS once”
and assumed that meant “understood storage.”
1) “Pool is degraded, but performance is fine” → complacency → surprise outage
- Symptom:
DEGRADEDstate for days; users don’t complain. - Root cause: You’re running without redundancy margin. The next fault becomes an outage or data loss event.
- Fix: Replace/restore redundancy promptly. Use
zpool events -vto confirm the failure is localized and not transport-wide before replacing.
2) “Random checksum errors across multiple disks” → misdiagnosed as “bad disks” → wasted replacements
- Symptom: Checksum errors appear on different disks, sometimes after reboots or load spikes.
- Root cause: HBA firmware/driver issues, bad SAS expander/backplane, marginal cable, power instability.
- Fix: Correlate events by timestamp with kernel logs; look for resets/timeouts across the same bus. Stabilize transport first, then replace any disk that still shows persistent media errors.
3) “Scrub is slow, must be normal” → hidden device sickness → resilver takes forever
- Symptom: Scrubs/resilvers run at a fraction of expected speed; sometimes “no ETA.”
- Root cause: One disk is retrying reads, SMR drives under sustained random I/O, or a saturated system with competing workload.
- Fix: Watch
zpool events -fduring scrub for I/O errors; check OS logs for read retries/timeouts; consider moving heavy workloads off during resilver and replacing the slow/failing device.
4) “No alerts, so no problem” → ZED not running → silent degradation
- Symptom: Someone discovers issues manually; no tickets/pages were created.
- Root cause: ZED disabled, misconfigured email/handler, or monitoring only checks
zpool statusdaily. - Fix: Enable and validate ZED; test a controlled event (scrub start/finish) and verify alerts land where humans will see them.
5) “Replace disk immediately” → rebuild stress triggers latent transport faults → cascading failures
- Symptom: The moment resilver starts, other disks start erroring, or the same disk “fails harder.”
- Root cause: The rebuild increases I/O and reveals flaky HBA/backplane/cabling.
- Fix: Before replacement, look for multi-disk patterns in
zpool events. If transport instability is suspected, stabilize first, then rebuild.
6) “Permanent errors detected” → panic delete → making recovery harder
- Symptom:
errors: Permanent errors have been detected...lists files. - Root cause: ZFS could not reconstruct certain blocks from redundancy. Deleting the file may destroy forensic value or complicate app-level recovery.
- Fix: Snapshot current state (if possible), identify impacted objects, restore from backup/snapshot where appropriate, and preserve event/log evidence for root cause analysis.
Checklists / step-by-step plan
Checklist A: Set up “I will not be surprised” monitoring for zpool events
- Enable ZED and confirm it’s running:
systemctl status zfs-zed.service. - Verify at least one delivery path: syslog ingestion, email, or a ticket hook via zedlets.
- Alert on these event classes (rate-limited): device removal, I/O ereports, checksum ereports, scrub finish with errors, resilver finish with errors.
- Include the vdev by-id path in alerts. Do not alert with
/dev/sdXalone. - Store a short rolling history of
zpool events -voutput in your incident system during pages (copy/paste is fine; perfection is overrated when you’re losing a disk).
Checklist B: When a pool goes DEGRADED
- Run
zpool statusand record the full output in the ticket. - Run
zpool events -v | tail -n 200and record it. - Identify whether the problem is single-device or multi-device.
- Correlate with kernel logs for resets/timeouts around the first event timestamp.
- If single-device media errors: offline/replace; monitor resilver events live.
- If transport pattern: pause risky operations; inspect cables/HBA/backplane/power; then proceed with replacements.
- After resilver, run a scrub and verify scrub_finish has
scrub_errors: 0.
Checklist C: After an error-free fix (the part people skip)
- Confirm pool is
ONLINEandzpool status -xis clean. - Review events since incident start; ensure no recurring I/O or checksum ereports.
- Make one durable change: firmware pin, cable replacement record, scrub schedule tweak, or alert routing fix.
- Document the “first bad timestamp” and the “last bad timestamp.” It matters for scope.
Checklist D: Fast bottleneck hunt when resilver/scrub is slow
- Check if errors are occurring during the operation:
zpool events -ffor 5–10 minutes. - Check OS logs for read retries/timeouts in the same window.
- Check whether one device is limiting the vdev (high latency): if a single disk is sick, replacing it is faster than “waiting it out.”
- Reduce competing workload temporarily. Rebuild is a stress test; don’t run benchmarks during it.
FAQ
1) What’s the difference between zpool events and zpool status?
zpool status is a current snapshot plus a short summary of recent activity. zpool events is the timeline: what happened, when, and often which device path and error class.
In incidents, timeline beats vibes.
2) If I see checksum errors, do I always replace the disk?
No. If checksum errors are isolated to one disk and persist, replacement is usually correct. If they appear across multiple disks, suspect transport (HBA/backplane/cabling) or firmware first.
Events plus kernel logs decide which one you’re dealing with.
3) Why does ZFS report errors when the application seems fine?
Because ZFS checks data integrity aggressively and can often repair corrupted reads from redundancy without surfacing it to applications.
That’s the whole point. Treat repaired errors as a warning sign, not a victory lap.
4) Are ereport.fs.zfs.* events “serious”?
They’re structured fault reports. Some are informational; many indicate real I/O trouble. What matters is frequency, whether they repeat on the same vdev, and whether they coincide with scrubs/resilvers or state changes.
5) Can I clear events or reset the event history?
Event retention is platform/tooling-dependent. Don’t rely on being able to “clear” them like a counters dashboard.
Operationally, you preserve relevant output in your incident record and you focus on stopping new events from occurring.
6) How do I know if it’s a cable/backplane issue instead of a disk?
Look for patterns: multiple disks on the same HBA lane showing timeouts/resets around the same time; devices disappearing and reappearing; errors that spike under load (resilver/scrub) and vanish later.
Single-disk “Medium Error” style logs point more strongly at media failure.
7) Why did performance tank when a disk failed, even though the pool stayed online?
Degraded redundancy changes read/write behavior and increases work (more parity reconstruction, more retries, more scrubbing/resilvering).
Also, a failing disk can be “online” while taking forever to respond, dragging the whole vdev down.
8) Should I run scrubs weekly or monthly?
It depends on capacity, workload, and risk tolerance. Bigger pools benefit from more frequent verification because latent errors accumulate.
The non-negotiable part is consistency: pick an interval you can sustain, and alert on scrub finish with errors.
9) How do I use events to avoid false positives in alerting?
Rate-limit and correlate. Alert on the first occurrence, then suppress repeats for a window unless the pool state changes or errors increase.
Don’t alert only on state changes; that’s how you miss weeks of flapping.
10) Do events help with “ZFS is slow” complaints?
Yes, indirectly. Events can tell you that the system is retrying I/O, experiencing device removals, or scrubbing/resilvering—exactly the kind of background stress that makes latency ugly.
For pure performance tuning, you still need I/O stats, but events often reveal the “why now.”
Conclusion: next steps that actually reduce risk
zpool events isn’t a novelty command. It’s the narrative your storage stack is writing while you’re busy watching dashboards that average away the truth.
Read it, automate it, and use it to decide whether you’re dealing with a dying disk or a lying transport.
Do these next, in this order:
- Make sure ZED is running and wired to humans (tickets/pages), not just syslog.
- Add a runbook step: whenever
zpool statuslooks bad, capturezpool events -vand kernel logs around the first bad timestamp. - Alert on event classes that indicate risk (I/O, removals, scrub/resilver errors), with rate-limiting instead of silence.
- Schedule scrubs and treat “repaired errors” as an actionable signal, not an FYI.
- During a disk replacement, watch events live. If errors spread, stop swapping parts and fix the transport.
Storage doesn’t fail politely. It fails in ways that look like network problems, database problems, or “the cloud is slow today.”
ZFS gives you a timeline. Use it before it has to save you.