ZFS Upgrading OpenZFS: The Checklist That Prevents Breakage

December 13, 2025 • February 3, 2026 • Read: 21 min • Views: 10

Was this helpful?

OpenZFS upgrades don’t usually fail because the code is bad. They fail because humans are optimistic, change windows are short, and compatibility is subtle. The breakage isn’t dramatic, either. It’s the quiet kind: replication that stops two days later, a boot pool that won’t import after a kernel update, or a feature flag you enabled “because it was there” that strands your pool on older hosts.

If you run production storage, you don’t “upgrade ZFS.” You upgrade an ecosystem: kernel modules, userland tools, bootloader support, feature flags, dataset properties, monitoring, and all the scripts that assume last year’s behavior. This is the checklist that keeps that ecosystem from biting you.

What actually breaks during OpenZFS upgrades

Most “ZFS upgrade” guides focus on the command named zpool upgrade. That’s like focusing on the airplane’s seatbelt while ignoring the engines. Real-world breakage clusters into a few categories:

1) You upgraded userland, but not the kernel module (or vice versa)

On Linux, OpenZFS is often a kernel module delivered via DKMS or kABI-tracking packages. If your kernel updates and the module doesn’t build or load, you don’t have ZFS. If userland tools are newer than the module, you get warnings, weird behavior, or missing features. On FreeBSD, ZFS is typically integrated, but you can still create version skew between boot environments or jails.

2) Feature flags make pools “newer” than some hosts

OpenZFS uses feature flags on pools. Once enabled, some flags are “active” and can’t be turned off. The practical implication: enabling a flag can permanently block importing the pool on older OpenZFS implementations. That becomes a problem when you discover the “older implementation” is your DR site.

3) Boot pool and bootloader support are their own universe

Root-on-ZFS is wonderful until you learn your bootloader understands only a subset of OpenZFS features. The pool might be perfectly healthy, but your system won’t boot because the bootloader can’t read the on-disk structures created by newer features. If you’re upgrading a boot pool, your rollback plan must be bulletproof.

4) Replication compatibility is an operational contract

zfs send/zfs receive is your data pipeline. If you enable features or change properties that alter stream compatibility, your replication may fail, silently skip what you expect, or force you into full re-seeds. “It still snapshots” is not the same as “it still replicates.”

5) Performance regressions are usually configuration mismatches

Upgrades can change defaults, ARC behavior, prefetch patterns, or how certain workloads interact with compression, recordsize, and special vdevs. The code may be fine; your workload might just be finally honest about your previous tuning. You need a pre/post performance baseline, or you’ll spend a week arguing with graphs.

One paraphrased idea from Gene Kim, who has spent a career translating operations pain into language executives understand: reliability comes from fast, safe change with feedback loops. That’s the core of this checklist—make the change safe, observable, and reversible.

Interesting facts and historical context (the useful kind)

ZFS popularized end-to-end checksumming for data and metadata, which changes how you think about “silent corruption” compared to traditional RAID stacks.
OpenZFS feature flags replaced old pool version numbers so implementations could evolve without a single linear version lockstep.
Copy-on-write is why snapshots are cheap, but it also means free space fragmentation patterns can surprise you after heavy churn.
The “ARC” isn’t just cache; it’s an adaptive cache with eviction behavior that can dominate memory pressure conversations on mixed workloads.
L2ARC is not a read cache in the way people imagine; it’s a second-level cache with warm-up costs and metadata overhead that can hurt if mis-sized or placed on fragile media.
Special vdevs (for metadata and small blocks) can be transformational, but they also introduce “small, critical, fast devices” that can take your whole pool down if not redundant.
ZFS send streams evolved to support properties, large blocks, embedded data, and resumable receives; not every receiver understands every stream flavor.
Root-on-ZFS got mainstream adoption in multiple operating systems because boot environments plus snapshots make upgrades reversible—when you respect bootloader limits.

Preflight: decide what “upgrade” means in your environment

Before you touch packages, answer three questions. If you can’t answer them, you’re not upgrading; you’re rolling dice in a server room.

Define the upgrade scope

Userland only? Tools like zfs, zpool, zed (event daemon).
Kernel module? On Linux: ZFS module version, SPL, DKMS build status, initramfs.
Pool feature flags? Whether you will run zpool upgrade or leave pools as-is.
Dataset property changes? Some teams “upgrade” by also turning on compression everywhere. That’s not an upgrade. That’s a migration of I/O behavior.

Inventory the compatibility surface

List every system that might import this pool or receive replication streams:

Primary hosts
DR hosts
Backup targets
Forensic/recovery workstations (yes, someone eventually tries to import a pool on a laptop)
Bootloader capabilities if it’s a boot pool

Commit to a rollback strategy

There are only two grown-up rollback strategies:

Boot environment rollback (system ZFS, root-on-ZFS): snapshot/clone the root dataset and keep a known-good boot environment selectable at boot.
Out-of-band rollback (non-root ZFS): keep old packages available, keep the old kernel available, and never enable irreversible pool features until you’re confident.

Joke #1: ZFS is like a professional kitchen—everything is labeled, checksummed, and organized, and one intern can still set the place on fire.

Practical tasks: commands, outputs, and decisions (12+)

These are production tasks. Each one includes a command, an example output, what it means, and what you decide next. Run them before and after the upgrade. Keep the outputs in your change ticket. Future-you will be grateful, and future-you is usually the one holding the pager.

Task 1: Confirm what ZFS you’re actually running

cr0x@server:~$ zfs --version
zfs-2.2.2-1
zfs-kmod-2.2.2-1

What it means: Userland and kernel module versions are shown (varies by distro). If you see userland only, check module separately.

Decision: If versions are mismatched after upgrade, stop and fix package/module parity before touching pools.

Task 2: Verify the kernel module is loaded (Linux)

cr0x@server:~$ lsmod | grep -E '^zfs '
zfs                  8843264  6

What it means: ZFS module is loaded; the final number is “users.”

Decision: If it’s not loaded, check DKMS build logs, initramfs, and whether the kernel update broke module compilation.

Task 3: Check pool health and error counters before you change anything

cr0x@server:~$ zpool status -x
all pools are healthy

What it means: No known faults. If you get anything else, you have work to do before upgrading.

Decision: If there are checksum errors, resilvering, or degraded vdevs: postpone upgrade. Fix the pool first, then upgrade.

Task 4: Get the full status, not the comforting summary

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors
  scan: scrub repaired 0B in 00:12:33 with 0 errors on Thu Dec 19 03:12:01 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sda     ONLINE       0     0     2  (repairable)
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:
/tank/vmstore/vm-104-disk-0

What it means: “Healthy” can still hide repaired-but-real errors. Permanent errors list affected files.

Decision: Investigate and remediate permanent errors (restore from replica/backup) before upgrade. Also evaluate disk sda for replacement.

Task 5: Confirm you have recent scrubs and they aren’t screaming

cr0x@server:~$ zpool get -H scrub tank
tank	scrub	scrub repaired 0B in 00:12:33 with 0 errors on Thu Dec 19 03:12:01 2025	-

What it means: Last scrub result and timestamp.

Decision: If scrubs are old or show errors, scrub before upgrade. You want known-good data before you start changing the stack.

Task 6: Capture feature flags currently enabled and active

cr0x@server:~$ zpool get -H -o name,property,value all tank | grep -E '^tank	feature@'
tank	feature@async_destroy	enabled
tank	feature@empty_bpobj	active
tank	feature@spacemap_histogram	enabled
tank	feature@extensible_dataset	enabled

What it means: enabled means the pool can use it; active means it’s in use on-disk.

Decision: If you see features you don’t recognize as supported by DR/backup hosts, do not run zpool upgrade yet. Build a compatibility matrix first.

Task 7: See what upgrades are even available (and don’t blindly apply them)

cr0x@server:~$ zpool upgrade
This system supports ZFS pool feature flags.

The following pools are formatted with legacy version numbers and can be upgraded:
  tank

The following feature flags are supported:
  spacemap_histogram
  enabled_txg
  hole_birth
  extensible_dataset
  ...

What it means: Shows supported features and whether pools are using legacy versions.

Decision: Upgrading the pool format is an explicit compatibility decision. If you have any chance of importing on older systems, delay pool upgrade until all systems are upgraded and tested.

Task 8: Check dataset properties that affect performance and replication

cr0x@server:~$ zfs get -r -o name,property,value -s local,received compression,recordsize,atime,xattr,acltype,encryption,keylocation tank
NAME                PROPERTY     VALUE
tank                compression  zstd
tank                recordsize   128K
tank                atime        off
tank/vmstore        recordsize   16K
tank/vmstore        compression  lz4
tank/backup         atime        on

What it means: Local and received properties that drive on-disk layout, I/O patterns, and behavior.

Decision: Freeze property changes during the upgrade window. If you want to “fix properties,” make that a separate change with its own rollback thinking.

Task 9: Validate snapshot/replication posture before upgrade

cr0x@server:~$ zfs list -t snapshot -o name,creation -S creation | head
NAME                          CREATION
tank/vmstore@autosnap_2025-12-26_0000  Fri Dec 26 00:00 2025
tank/home@autosnap_2025-12-26_0000     Fri Dec 26 00:00 2025
tank@autosnap_2025-12-26_0000          Fri Dec 26 00:00 2025

What it means: Snapshots exist and are fresh.

Decision: If snapshots aren’t current, fix automation before upgrading. No snapshots means no quick rollback for data mistakes.

Task 10: Check replication stream compatibility in practice (dry run with resumable receive)

cr0x@server:~$ zfs send -nP tank/vmstore@autosnap_2025-12-26_0000 | head
size	1234567896

What it means: -nP estimates send size without sending. If this fails, you have a send-side issue.

Decision: If estimation fails after upgrade, you likely hit a feature/property mismatch or a send stream change. Investigate before the next scheduled replication.

Task 11: Confirm you can actually import the pool on the upgraded host (and see why it might not)

cr0x@server:~$ zpool import
   pool: tank
     id: 1234567890123456789
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank        ONLINE
          raidz1-0  ONLINE
            sda     ONLINE
            sdb     ONLINE
            sdc     ONLINE

What it means: Importable pool discovered. On a real system you wouldn’t run this on the active host unless you’re in recovery; it’s great on a standby or rescue environment.

Decision: If import shows “unsupported features,” you’ve proven a compatibility break. Do not upgrade pool features until every importer is ready.

Task 12: Validate boot pool constraints (root-on-ZFS environments)

cr0x@server:~$ zpool list -o name,size,alloc,free,ashift,health
NAME   SIZE  ALLOC   FREE  ASHIFT  HEALTH
bpool  1.8G   612M  1.2G      12  ONLINE
rpool  1.8T   1.1T  724G      12  ONLINE

What it means: You likely have a separate bpool (boot pool) with conservative features, plus rpool for the root filesystem.

Decision: Treat bpool as “bootloader-compatible storage.” Be extremely conservative about upgrading or enabling features on it.

Task 13: Confirm ZED is running and will report problems

cr0x@server:~$ systemctl status zfs-zed.service --no-pager
● zfs-zed.service - ZFS Event Daemon (zed)
     Loaded: loaded (/lib/systemd/system/zfs-zed.service; enabled)
     Active: active (running) since Thu 2025-12-26 00:10:11 UTC; 2h 3min ago

What it means: ZFS event daemon is active. Without it, you may miss disk fault events and scrubs alerts.

Decision: If ZED isn’t running, fix that before upgrade. Visibility is part of safety.

Task 14: Check ARC behavior before/after (quick sanity, not cargo-cult tuning)

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c
00:00:01   912    34      3     9   1%    21   2%     4   0%  28.1G  31.9G
00:00:02   877    29      3     8   1%    17   2%     4   0%  28.1G  31.9G
00:00:03   940    35      3     9   1%    22   2%     4   0%  28.1G  31.9G

What it means: Miss rates and ARC size. A sudden miss spike after upgrade can indicate changed prefetch behavior or memory pressure.

Decision: If miss% jumps and latency rises, start with workload and memory pressure checks before changing tunables.

Task 15: Confirm your upgrade didn’t flip mount behavior or dataset visibility

cr0x@server:~$ zfs mount | head
tank                          /tank
tank/home                     /tank/home
tank/vmstore                  /tank/vmstore

What it means: Mounted datasets and mountpoints.

Decision: If expected datasets aren’t mounted after upgrade/reboot, check canmount, mountpoint, and whether systemd mount ordering changed.

Task 16: Post-upgrade: ensure the pool is still clean and the event log isn’t hiding drama

cr0x@server:~$ zpool events -v | tail -n 12
TIME                           CLASS
Dec 26 02:11:03.123456 2025    sysevent.fs.zfs.config_sync
    pool: tank
    vdev: /dev/sdb

Dec 26 02:11:04.654321 2025    sysevent.fs.zfs.history_event
    history: zpool scrub tank

What it means: Recent ZFS events. Useful after upgrades to see if devices disappeared, multipath changed, or config sync happened.

Decision: If you see repeated device removal/add events, stop and investigate cabling, HBAs, multipath config, or udev naming changes before you trust the pool.

Checklists / step-by-step plan (the one you can run at 2 AM)

This plan assumes you’re upgrading OpenZFS on a production host. Adjust for your platform, but don’t skip the logic. ZFS punishes improvisation.

Phase 0: Compatibility planning (do this before the change window)

Write down all importers. Every host that might import the pool, including DR, backup, and rescue media.
Write down all replication receivers. Every target that receives zfs send streams.
Determine the oldest OpenZFS version in that set. That version is your compatibility floor.
Decide whether you will run zpool upgrade. Default answer in a mixed estate: no. Upgrade code first, features later.
For boot pools: identify bootloader constraints. If you can’t state what your bootloader can read, treat boot pool upgrades as forbidden until proven safe.
Build a rollback plan that doesn’t rely on hope. Old kernel available, old packages available, boot environment if root-on-ZFS, and a documented “how to get a shell” path (IPMI/iLO/console).

Phase 1: Preflight checks (right before change)

Confirm ZFS versions (zfs --version).
Confirm module loaded (lsmod | grep zfs on Linux).
Confirm pool health (zpool status -x, then zpool status -v).
Confirm scrub recency (zpool get scrub).
Capture feature flags (zpool get feature@* or filtered output).
Capture key dataset properties (zfs get -r for compression/recordsize/atime/encryption).
Confirm snapshots exist and replication last ran cleanly (your tooling + zfs list -t snapshot).
Confirm free space headroom. You want operational slack for resilvers and metadata growth.

Phase 2: Upgrade execution (code first, features later)

Upgrade packages. Keep notes of the before/after versions.
Rebuild initramfs if applicable. On Linux, ZFS in initramfs matters for boot pools.
Reboot during the window. If you’re not rebooting, you’re not testing the hardest part.
Post-boot validate module + pool import + mounts. Verify zfs mount, services, and application I/O.
Do not run zpool upgrade on day one. Observe stability first.

Phase 3: Post-upgrade verification (immediately and again 24 hours later)

Check pool status and errors.
Check ZED and alerting pipeline.
Run a scrub (if your window allows) or schedule one soon.
Trigger a replication run and verify receive side.
Compare performance baselines: latency, IOPS, CPU usage, ARC miss rates.
Review zpool events -v for device churn.

Phase 4: Feature flag upgrades (only after the fleet is ready)

When—and only when—every importer and receiver is on compatible OpenZFS, and you have tested rollback paths, then you can consider enabling new pool features.

Review supported features (zpool upgrade output).
Enable features deliberately, in small sets, with a change record.
Verify replication still works afterward.
Update your “compatibility floor” documentation. Your fleet’s minimum version just moved.

Joke #2: The only thing more permanent than a feature flag is the memory of the person who enabled it five minutes before vacation.

Fast diagnosis playbook

After an OpenZFS upgrade, you’ll typically get one of three pain signals: it won’t boot/import, it’s slow, or replication is failing. This playbook is ordered to find the bottleneck quickly, not philosophically.

First: can the system see the pool and the devices?

Check module loaded: lsmod | grep zfs (Linux) or verify the kernel has ZFS and userland matches.
Check device names are stable: look for missing disks, changed WWNs, multipath issues.
Check importability: zpool import (on a rescue environment or standby).
Check pool status: zpool status -v for degraded vdevs and checksum errors.

Second: is it a compatibility/feature flag issue?

If the pool won’t import and you see “unsupported feature(s)”, stop. That’s not a tuning issue.
Compare zpool get feature@* between working and failing hosts.
For boot failures: suspect bootloader feature support, not “ZFS is broken.”

Third: is it a performance regression or an I/O path problem?

Check latency at the pool: zpool iostat -v 1 10 (not shown above, but you should run it).
Check ARC misses and memory pressure: arcstat, and OS memory stats.
Check CPU usage in kernel threads: high system CPU can indicate checksum/compression overhead changes or a pathological workload pattern.
Check recordsize/compression drift: upgrades don’t change existing blocks, but they can expose that your “one size fits all” properties were a lie.

Fourth: is it replication/tooling?

Run a manual zfs send -nP and inspect errors.
Confirm receiver can accept the stream (version/feature support).
Check whether your replication tooling parses outputs that changed subtly.

Common mistakes: symptom → root cause → fix

1) Symptom: pool won’t import after upgrade; message mentions unsupported features

Root cause: Pool features were enabled on another host (or you ran zpool upgrade) and now you’re trying to import on an older OpenZFS implementation.

Fix: Upgrade the importing environment to a compatible OpenZFS version. If this is DR and you can’t, your only path is restoring from replication/backup that targets a compatible pool. You cannot “disable” most active features.

2) Symptom: system boots to initramfs or emergency shell; root pool not found

Root cause: ZFS module not included/built for the new kernel, initramfs missing ZFS, or the module failed to load.

Fix: Boot into an older kernel from the bootloader (keep one), rebuild DKMS/module, rebuild initramfs, and reboot. Validate zfs --version parity afterward.

3) Symptom: bootloader can’t read boot pool, but the pool imports fine from rescue media

Root cause: Boot pool features not supported by the bootloader. You upgraded/changed something on bpool or used an incompatible ashift/feature set for the loader.

Fix: Restore boot pool from a known-good snapshot/boot environment if available. Otherwise, reinstall bootloader with a compatible boot pool design (often: keep boot pool conservative and separate).

4) Symptom: replication starts failing after upgrade with stream errors

Root cause: Sender now produces streams using features receiver can’t accept, or your replication script assumes old zfs send behavior/flags.

Fix: Upgrade receiver side first (or keep sender compatible), and adjust replication tooling to use compatible flags. Verify with zfs send -nP and a small test dataset.

5) Symptom: performance drops; CPU spikes; I/O wait increases

Root cause: Often not “ZFS got slower,” but a change in kernel, I/O scheduler, compression implementation, or memory reclaim behavior interacting with ARC.

Fix: Compare pre/post baselines. Check arcstat miss rates, zpool iostat latency, and whether your workload shifted. Only then consider targeted tuning. Don’t shotgun zfs_arc_max changes based on vibes.

6) Symptom: datasets not mounted after reboot; services fail because paths missing

Root cause: Dataset properties like canmount, mountpoint, or systemd ordering changed; sometimes a received property overrides local expectations.

Fix: Inspect properties with zfs get, correct the intended source (local vs received), and ensure your service dependencies wait for ZFS mounts.

7) Symptom: “zfs” commands work but pool operations error oddly; logs show version mismatch

Root cause: Userland and kernel module mismatch after partial upgrade.

Fix: Align versions. On Linux, that means ensuring the ZFS module built for the running kernel and userland packages match the same release line.

Three corporate-world mini-stories (anonymized, painfully plausible)

Mini-story #1: The incident caused by a wrong assumption

A mid-sized company ran a pair of storage servers: primary and DR. Primary was upgraded quarterly. DR was “stable” in the way old bread is stable. The assumption was simple: ZFS replication is just snapshots over the wire, so as long as datasets exist, compatibility will sort itself out.

During a routine upgrade, an engineer ran zpool upgrade because the command looked like the next logical step. The pool remained online, nothing crashed, and the change ticket closed early. In the next few days, replication jobs began failing, but only on some datasets. The failures were intermittent enough to be ignored and loud enough to annoy everyone.

Then a real incident hit primary—an HBA started throwing resets under load. They failed over to DR and discovered the pool wouldn’t import. “Unsupported features” on import. DR was running an older OpenZFS that didn’t understand some now-active feature flags. The pool wasn’t damaged. It was just too modern for the environment that needed it most.

The recovery was boring and expensive: rebuild DR with a newer stack, then restore data from whatever replication was still valid. The actual outage wasn’t because of ZFS. It was because of the assumption that feature flags are optional and reversible. They are not.

Mini-story #2: The optimization that backfired

A different org had a virtualization cluster backed by ZFS. After upgrading OpenZFS, an engineer decided to “take advantage of the new version” by changing compression from lz4 to zstd across the VM dataset. The rationale was straightforward: better compression means less I/O, so performance improves. That’s a good theory in a world where CPU is free and latency is imaginary.

In practice, the cluster ran a mixed workload: small random writes, bursty metadata operations, and occasional backup storms. After the change, latency got worse during peak hours. CPU climbed. The on-call started seeing VM timeouts that previously never happened. The storage graphs looked fine at the disk layer, which made the incident more fun: everyone blamed the network.

The root issue wasn’t that zstd is bad. It was that they changed a workload-defining property in the same window as an OpenZFS upgrade, without baseline comparisons. Compression levels and CPU overhead matter. Also, existing blocks don’t recompress, so the performance behavior was inconsistent across VMs depending on data age. Perfect for confusion.

The fix was to revert compression on the hot VM dataset back to lz4, keep zstd for colder datasets, and separate “upgrade the storage stack” from “change the storage behavior.” The upgrade wasn’t the villain. The bundling was.

Mini-story #3: The boring but correct practice that saved the day

A finance-adjacent company ran root-on-ZFS everywhere. They had a policy that made engineers roll their eyes: every host upgrade required a fresh boot environment, plus a post-upgrade reboot during the window. No exceptions. The policy existed because somebody, years earlier, got tired of “we’ll reboot later” being a synonym for “we will find out during an outage.”

During an OpenZFS upgrade, one host came back up with ZFS services failing to mount a dataset. The reason was mundane: a combination of service ordering and a dataset mount property that had been inherited unexpectedly. The host was technically up, but the applications were dead. The engineer on-call didn’t try clever fixes on a broken system under pressure.

They selected the previous boot environment in the bootloader, came back online on the old stack, and restored service. Then, in daylight hours, they reproduced the issue in staging, fixed the mount ordering, and re-ran the upgrade. The downtime was minimal because rollback wasn’t theoretical—it was part of the muscle memory.

This is the kind of practice that looks slow until it’s faster than every alternative.

FAQ

1) Should I run `zpool upgrade` immediately after upgrading OpenZFS?

No, not by default. Upgrade the software stack first, validate stability and replication, then upgrade pool features when every importer/receiver is ready.

2) What’s the difference between upgrading OpenZFS and upgrading a pool?

Upgrading OpenZFS changes the code that reads/writes your pool. Upgrading a pool changes on-disk feature flags. The latter can be irreversible and affects cross-host compatibility.

3) Can I downgrade OpenZFS if something goes wrong?

You can usually downgrade the software if you did not enable new pool features and your distro supports package rollback. If you enabled features that become active, older implementations may no longer import the pool.

4) Why does my pool say “healthy” but I still have permanent errors?

zpool status -x is a summary. Permanent errors can exist even when the pool is online. Always review zpool status -v before upgrades.

5) Do I need to scrub before upgrading?

If you haven’t scrubbed recently, yes. A scrub is how you validate data integrity across the whole pool. You want known-good data before changing kernel modules and storage code.

6) What about encrypted datasets—any special upgrade caveats?

Ensure key management works across reboots: confirm keylocation, test unlocking procedures, and validate that early boot can access keys if the root dataset is encrypted.

7) My replication target is older. Can I still upgrade the sender?

Often yes, if you avoid enabling incompatible pool features and keep replication streams compatible. But you must test: run zfs send -nP and validate a receive on the older target.

8) Why do people separate boot pool (`bpool`) from root pool (`rpool`)?

Bootloaders often support fewer ZFS features than the OS. A conservative boot pool reduces the chance that a feature upgrade makes the system unbootable.

9) If performance changed after upgrade, what’s the first metric to trust?

Latency at the pool and vdev level, plus ARC miss rates under the same workload. Throughput graphs alone can hide tail-latency regressions that break apps.

10) Is L2ARC a good idea after upgrading?

Only if you can prove it helps. L2ARC adds complexity and can steal memory for metadata. Measure before and after; don’t treat it as a rite of passage.

Conclusion: next steps you should actually take

Here’s the practical path that avoids most OpenZFS upgrade pain:

Inventory importers and receivers. Compatibility is a fleet property, not a host property.
Upgrade code first, reboot, validate. If you won’t reboot, you’re postponing the real test to a worse time.
Delay zpool upgrade until the whole ecosystem is ready. Treat feature flags like schema migrations: planned, reviewed, and timed.
Capture evidence. Save pre/post outputs for versions, pool status, feature flags, properties, and replication tests.
Run one controlled replication test. If you can’t prove send/receive still works, you don’t have DR—you have a story.

Upgrading OpenZFS can be boring. That’s the goal. The checklist isn’t ceremony; it’s how you keep the on-call shift from becoming a career development event.