OpenZFS upgrades don’t usually fail because the code is bad. They fail because humans are optimistic, change windows are short, and compatibility is subtle. The breakage isn’t dramatic, either. It’s the quiet kind: replication that stops two days later, a boot pool that won’t import after a kernel update, or a feature flag you enabled “because it was there” that strands your pool on older hosts.
If you run production storage, you don’t “upgrade ZFS.” You upgrade an ecosystem: kernel modules, userland tools, bootloader support, feature flags, dataset properties, monitoring, and all the scripts that assume last year’s behavior. This is the checklist that keeps that ecosystem from biting you.
What actually breaks during OpenZFS upgrades
Most “ZFS upgrade” guides focus on the command named zpool upgrade. That’s like focusing on the airplane’s seatbelt while ignoring the engines. Real-world breakage clusters into a few categories:
1) You upgraded userland, but not the kernel module (or vice versa)
On Linux, OpenZFS is often a kernel module delivered via DKMS or kABI-tracking packages. If your kernel updates and the module doesn’t build or load, you don’t have ZFS. If userland tools are newer than the module, you get warnings, weird behavior, or missing features. On FreeBSD, ZFS is typically integrated, but you can still create version skew between boot environments or jails.
2) Feature flags make pools “newer” than some hosts
OpenZFS uses feature flags on pools. Once enabled, some flags are “active” and can’t be turned off. The practical implication: enabling a flag can permanently block importing the pool on older OpenZFS implementations. That becomes a problem when you discover the “older implementation” is your DR site.
3) Boot pool and bootloader support are their own universe
Root-on-ZFS is wonderful until you learn your bootloader understands only a subset of OpenZFS features. The pool might be perfectly healthy, but your system won’t boot because the bootloader can’t read the on-disk structures created by newer features. If you’re upgrading a boot pool, your rollback plan must be bulletproof.
4) Replication compatibility is an operational contract
zfs send/zfs receive is your data pipeline. If you enable features or change properties that alter stream compatibility, your replication may fail, silently skip what you expect, or force you into full re-seeds. “It still snapshots” is not the same as “it still replicates.”
5) Performance regressions are usually configuration mismatches
Upgrades can change defaults, ARC behavior, prefetch patterns, or how certain workloads interact with compression, recordsize, and special vdevs. The code may be fine; your workload might just be finally honest about your previous tuning. You need a pre/post performance baseline, or you’ll spend a week arguing with graphs.
One paraphrased idea from Gene Kim, who has spent a career translating operations pain into language executives understand: reliability comes from fast, safe change with feedback loops. That’s the core of this checklist—make the change safe, observable, and reversible.
Interesting facts and historical context (the useful kind)
- ZFS popularized end-to-end checksumming for data and metadata, which changes how you think about “silent corruption” compared to traditional RAID stacks.
- OpenZFS feature flags replaced old pool version numbers so implementations could evolve without a single linear version lockstep.
- Copy-on-write is why snapshots are cheap, but it also means free space fragmentation patterns can surprise you after heavy churn.
- The “ARC” isn’t just cache; it’s an adaptive cache with eviction behavior that can dominate memory pressure conversations on mixed workloads.
- L2ARC is not a read cache in the way people imagine; it’s a second-level cache with warm-up costs and metadata overhead that can hurt if mis-sized or placed on fragile media.
- Special vdevs (for metadata and small blocks) can be transformational, but they also introduce “small, critical, fast devices” that can take your whole pool down if not redundant.
- ZFS send streams evolved to support properties, large blocks, embedded data, and resumable receives; not every receiver understands every stream flavor.
- Root-on-ZFS got mainstream adoption in multiple operating systems because boot environments plus snapshots make upgrades reversible—when you respect bootloader limits.
Preflight: decide what “upgrade” means in your environment
Before you touch packages, answer three questions. If you can’t answer them, you’re not upgrading; you’re rolling dice in a server room.
Define the upgrade scope
- Userland only? Tools like
zfs,zpool,zed(event daemon). - Kernel module? On Linux: ZFS module version, SPL, DKMS build status, initramfs.
- Pool feature flags? Whether you will run
zpool upgradeor leave pools as-is. - Dataset property changes? Some teams “upgrade” by also turning on compression everywhere. That’s not an upgrade. That’s a migration of I/O behavior.
Inventory the compatibility surface
List every system that might import this pool or receive replication streams:
- Primary hosts
- DR hosts
- Backup targets
- Forensic/recovery workstations (yes, someone eventually tries to import a pool on a laptop)
- Bootloader capabilities if it’s a boot pool
Commit to a rollback strategy
There are only two grown-up rollback strategies:
- Boot environment rollback (system ZFS, root-on-ZFS): snapshot/clone the root dataset and keep a known-good boot environment selectable at boot.
- Out-of-band rollback (non-root ZFS): keep old packages available, keep the old kernel available, and never enable irreversible pool features until you’re confident.
Joke #1: ZFS is like a professional kitchen—everything is labeled, checksummed, and organized, and one intern can still set the place on fire.
Practical tasks: commands, outputs, and decisions (12+)
These are production tasks. Each one includes a command, an example output, what it means, and what you decide next. Run them before and after the upgrade. Keep the outputs in your change ticket. Future-you will be grateful, and future-you is usually the one holding the pager.
Task 1: Confirm what ZFS you’re actually running
cr0x@server:~$ zfs --version
zfs-2.2.2-1
zfs-kmod-2.2.2-1
What it means: Userland and kernel module versions are shown (varies by distro). If you see userland only, check module separately.
Decision: If versions are mismatched after upgrade, stop and fix package/module parity before touching pools.
Task 2: Verify the kernel module is loaded (Linux)
cr0x@server:~$ lsmod | grep -E '^zfs '
zfs 8843264 6
What it means: ZFS module is loaded; the final number is “users.”
Decision: If it’s not loaded, check DKMS build logs, initramfs, and whether the kernel update broke module compilation.
Task 3: Check pool health and error counters before you change anything
cr0x@server:~$ zpool status -x
all pools are healthy
What it means: No known faults. If you get anything else, you have work to do before upgrading.
Decision: If there are checksum errors, resilvering, or degraded vdevs: postpone upgrade. Fix the pool first, then upgrade.
Task 4: Get the full status, not the comforting summary
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
action: Determine if the device needs to be replaced, and clear the errors
scan: scrub repaired 0B in 00:12:33 with 0 errors on Thu Dec 19 03:12:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda ONLINE 0 0 2 (repairable)
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/tank/vmstore/vm-104-disk-0
What it means: “Healthy” can still hide repaired-but-real errors. Permanent errors list affected files.
Decision: Investigate and remediate permanent errors (restore from replica/backup) before upgrade. Also evaluate disk sda for replacement.
Task 5: Confirm you have recent scrubs and they aren’t screaming
cr0x@server:~$ zpool get -H scrub tank
tank scrub scrub repaired 0B in 00:12:33 with 0 errors on Thu Dec 19 03:12:01 2025 -
What it means: Last scrub result and timestamp.
Decision: If scrubs are old or show errors, scrub before upgrade. You want known-good data before you start changing the stack.
Task 6: Capture feature flags currently enabled and active
cr0x@server:~$ zpool get -H -o name,property,value all tank | grep -E '^tank feature@'
tank feature@async_destroy enabled
tank feature@empty_bpobj active
tank feature@spacemap_histogram enabled
tank feature@extensible_dataset enabled
What it means: enabled means the pool can use it; active means it’s in use on-disk.
Decision: If you see features you don’t recognize as supported by DR/backup hosts, do not run zpool upgrade yet. Build a compatibility matrix first.
Task 7: See what upgrades are even available (and don’t blindly apply them)
cr0x@server:~$ zpool upgrade
This system supports ZFS pool feature flags.
The following pools are formatted with legacy version numbers and can be upgraded:
tank
The following feature flags are supported:
spacemap_histogram
enabled_txg
hole_birth
extensible_dataset
...
What it means: Shows supported features and whether pools are using legacy versions.
Decision: Upgrading the pool format is an explicit compatibility decision. If you have any chance of importing on older systems, delay pool upgrade until all systems are upgraded and tested.
Task 8: Check dataset properties that affect performance and replication
cr0x@server:~$ zfs get -r -o name,property,value -s local,received compression,recordsize,atime,xattr,acltype,encryption,keylocation tank
NAME PROPERTY VALUE
tank compression zstd
tank recordsize 128K
tank atime off
tank/vmstore recordsize 16K
tank/vmstore compression lz4
tank/backup atime on
What it means: Local and received properties that drive on-disk layout, I/O patterns, and behavior.
Decision: Freeze property changes during the upgrade window. If you want to “fix properties,” make that a separate change with its own rollback thinking.
Task 9: Validate snapshot/replication posture before upgrade
cr0x@server:~$ zfs list -t snapshot -o name,creation -S creation | head
NAME CREATION
tank/vmstore@autosnap_2025-12-26_0000 Fri Dec 26 00:00 2025
tank/home@autosnap_2025-12-26_0000 Fri Dec 26 00:00 2025
tank@autosnap_2025-12-26_0000 Fri Dec 26 00:00 2025
What it means: Snapshots exist and are fresh.
Decision: If snapshots aren’t current, fix automation before upgrading. No snapshots means no quick rollback for data mistakes.
Task 10: Check replication stream compatibility in practice (dry run with resumable receive)
cr0x@server:~$ zfs send -nP tank/vmstore@autosnap_2025-12-26_0000 | head
size 1234567896
What it means: -nP estimates send size without sending. If this fails, you have a send-side issue.
Decision: If estimation fails after upgrade, you likely hit a feature/property mismatch or a send stream change. Investigate before the next scheduled replication.
Task 11: Confirm you can actually import the pool on the upgraded host (and see why it might not)
cr0x@server:~$ zpool import
pool: tank
id: 1234567890123456789
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
tank ONLINE
raidz1-0 ONLINE
sda ONLINE
sdb ONLINE
sdc ONLINE
What it means: Importable pool discovered. On a real system you wouldn’t run this on the active host unless you’re in recovery; it’s great on a standby or rescue environment.
Decision: If import shows “unsupported features,” you’ve proven a compatibility break. Do not upgrade pool features until every importer is ready.
Task 12: Validate boot pool constraints (root-on-ZFS environments)
cr0x@server:~$ zpool list -o name,size,alloc,free,ashift,health
NAME SIZE ALLOC FREE ASHIFT HEALTH
bpool 1.8G 612M 1.2G 12 ONLINE
rpool 1.8T 1.1T 724G 12 ONLINE
What it means: You likely have a separate bpool (boot pool) with conservative features, plus rpool for the root filesystem.
Decision: Treat bpool as “bootloader-compatible storage.” Be extremely conservative about upgrading or enabling features on it.
Task 13: Confirm ZED is running and will report problems
cr0x@server:~$ systemctl status zfs-zed.service --no-pager
● zfs-zed.service - ZFS Event Daemon (zed)
Loaded: loaded (/lib/systemd/system/zfs-zed.service; enabled)
Active: active (running) since Thu 2025-12-26 00:10:11 UTC; 2h 3min ago
What it means: ZFS event daemon is active. Without it, you may miss disk fault events and scrubs alerts.
Decision: If ZED isn’t running, fix that before upgrade. Visibility is part of safety.
Task 14: Check ARC behavior before/after (quick sanity, not cargo-cult tuning)
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% size c
00:00:01 912 34 3 9 1% 21 2% 4 0% 28.1G 31.9G
00:00:02 877 29 3 8 1% 17 2% 4 0% 28.1G 31.9G
00:00:03 940 35 3 9 1% 22 2% 4 0% 28.1G 31.9G
What it means: Miss rates and ARC size. A sudden miss spike after upgrade can indicate changed prefetch behavior or memory pressure.
Decision: If miss% jumps and latency rises, start with workload and memory pressure checks before changing tunables.
Task 15: Confirm your upgrade didn’t flip mount behavior or dataset visibility
cr0x@server:~$ zfs mount | head
tank /tank
tank/home /tank/home
tank/vmstore /tank/vmstore
What it means: Mounted datasets and mountpoints.
Decision: If expected datasets aren’t mounted after upgrade/reboot, check canmount, mountpoint, and whether systemd mount ordering changed.
Task 16: Post-upgrade: ensure the pool is still clean and the event log isn’t hiding drama
cr0x@server:~$ zpool events -v | tail -n 12
TIME CLASS
Dec 26 02:11:03.123456 2025 sysevent.fs.zfs.config_sync
pool: tank
vdev: /dev/sdb
Dec 26 02:11:04.654321 2025 sysevent.fs.zfs.history_event
history: zpool scrub tank
What it means: Recent ZFS events. Useful after upgrades to see if devices disappeared, multipath changed, or config sync happened.
Decision: If you see repeated device removal/add events, stop and investigate cabling, HBAs, multipath config, or udev naming changes before you trust the pool.
Checklists / step-by-step plan (the one you can run at 2 AM)
This plan assumes you’re upgrading OpenZFS on a production host. Adjust for your platform, but don’t skip the logic. ZFS punishes improvisation.
Phase 0: Compatibility planning (do this before the change window)
- Write down all importers. Every host that might import the pool, including DR, backup, and rescue media.
- Write down all replication receivers. Every target that receives
zfs sendstreams. - Determine the oldest OpenZFS version in that set. That version is your compatibility floor.
- Decide whether you will run
zpool upgrade. Default answer in a mixed estate: no. Upgrade code first, features later. - For boot pools: identify bootloader constraints. If you can’t state what your bootloader can read, treat boot pool upgrades as forbidden until proven safe.
- Build a rollback plan that doesn’t rely on hope. Old kernel available, old packages available, boot environment if root-on-ZFS, and a documented “how to get a shell” path (IPMI/iLO/console).
Phase 1: Preflight checks (right before change)
- Confirm ZFS versions (
zfs --version). - Confirm module loaded (
lsmod | grep zfson Linux). - Confirm pool health (
zpool status -x, thenzpool status -v). - Confirm scrub recency (
zpool get scrub). - Capture feature flags (
zpool get feature@*or filtered output). - Capture key dataset properties (
zfs get -rfor compression/recordsize/atime/encryption). - Confirm snapshots exist and replication last ran cleanly (your tooling +
zfs list -t snapshot). - Confirm free space headroom. You want operational slack for resilvers and metadata growth.
Phase 2: Upgrade execution (code first, features later)
- Upgrade packages. Keep notes of the before/after versions.
- Rebuild initramfs if applicable. On Linux, ZFS in initramfs matters for boot pools.
- Reboot during the window. If you’re not rebooting, you’re not testing the hardest part.
- Post-boot validate module + pool import + mounts. Verify
zfs mount, services, and application I/O. - Do not run
zpool upgradeon day one. Observe stability first.
Phase 3: Post-upgrade verification (immediately and again 24 hours later)
- Check pool status and errors.
- Check ZED and alerting pipeline.
- Run a scrub (if your window allows) or schedule one soon.
- Trigger a replication run and verify receive side.
- Compare performance baselines: latency, IOPS, CPU usage, ARC miss rates.
- Review
zpool events -vfor device churn.
Phase 4: Feature flag upgrades (only after the fleet is ready)
When—and only when—every importer and receiver is on compatible OpenZFS, and you have tested rollback paths, then you can consider enabling new pool features.
- Review supported features (
zpool upgradeoutput). - Enable features deliberately, in small sets, with a change record.
- Verify replication still works afterward.
- Update your “compatibility floor” documentation. Your fleet’s minimum version just moved.
Joke #2: The only thing more permanent than a feature flag is the memory of the person who enabled it five minutes before vacation.
Fast diagnosis playbook
After an OpenZFS upgrade, you’ll typically get one of three pain signals: it won’t boot/import, it’s slow, or replication is failing. This playbook is ordered to find the bottleneck quickly, not philosophically.
First: can the system see the pool and the devices?
- Check module loaded:
lsmod | grep zfs(Linux) or verify the kernel has ZFS and userland matches. - Check device names are stable: look for missing disks, changed WWNs, multipath issues.
- Check importability:
zpool import(on a rescue environment or standby). - Check pool status:
zpool status -vfor degraded vdevs and checksum errors.
Second: is it a compatibility/feature flag issue?
- If the pool won’t import and you see “unsupported feature(s)”, stop. That’s not a tuning issue.
- Compare
zpool get feature@*between working and failing hosts. - For boot failures: suspect bootloader feature support, not “ZFS is broken.”
Third: is it a performance regression or an I/O path problem?
- Check latency at the pool:
zpool iostat -v 1 10(not shown above, but you should run it). - Check ARC misses and memory pressure:
arcstat, and OS memory stats. - Check CPU usage in kernel threads: high system CPU can indicate checksum/compression overhead changes or a pathological workload pattern.
- Check recordsize/compression drift: upgrades don’t change existing blocks, but they can expose that your “one size fits all” properties were a lie.
Fourth: is it replication/tooling?
- Run a manual
zfs send -nPand inspect errors. - Confirm receiver can accept the stream (version/feature support).
- Check whether your replication tooling parses outputs that changed subtly.
Common mistakes: symptom → root cause → fix
1) Symptom: pool won’t import after upgrade; message mentions unsupported features
Root cause: Pool features were enabled on another host (or you ran zpool upgrade) and now you’re trying to import on an older OpenZFS implementation.
Fix: Upgrade the importing environment to a compatible OpenZFS version. If this is DR and you can’t, your only path is restoring from replication/backup that targets a compatible pool. You cannot “disable” most active features.
2) Symptom: system boots to initramfs or emergency shell; root pool not found
Root cause: ZFS module not included/built for the new kernel, initramfs missing ZFS, or the module failed to load.
Fix: Boot into an older kernel from the bootloader (keep one), rebuild DKMS/module, rebuild initramfs, and reboot. Validate zfs --version parity afterward.
3) Symptom: bootloader can’t read boot pool, but the pool imports fine from rescue media
Root cause: Boot pool features not supported by the bootloader. You upgraded/changed something on bpool or used an incompatible ashift/feature set for the loader.
Fix: Restore boot pool from a known-good snapshot/boot environment if available. Otherwise, reinstall bootloader with a compatible boot pool design (often: keep boot pool conservative and separate).
4) Symptom: replication starts failing after upgrade with stream errors
Root cause: Sender now produces streams using features receiver can’t accept, or your replication script assumes old zfs send behavior/flags.
Fix: Upgrade receiver side first (or keep sender compatible), and adjust replication tooling to use compatible flags. Verify with zfs send -nP and a small test dataset.
5) Symptom: performance drops; CPU spikes; I/O wait increases
Root cause: Often not “ZFS got slower,” but a change in kernel, I/O scheduler, compression implementation, or memory reclaim behavior interacting with ARC.
Fix: Compare pre/post baselines. Check arcstat miss rates, zpool iostat latency, and whether your workload shifted. Only then consider targeted tuning. Don’t shotgun zfs_arc_max changes based on vibes.
6) Symptom: datasets not mounted after reboot; services fail because paths missing
Root cause: Dataset properties like canmount, mountpoint, or systemd ordering changed; sometimes a received property overrides local expectations.
Fix: Inspect properties with zfs get, correct the intended source (local vs received), and ensure your service dependencies wait for ZFS mounts.
7) Symptom: “zfs” commands work but pool operations error oddly; logs show version mismatch
Root cause: Userland and kernel module mismatch after partial upgrade.
Fix: Align versions. On Linux, that means ensuring the ZFS module built for the running kernel and userland packages match the same release line.
Three corporate-world mini-stories (anonymized, painfully plausible)
Mini-story #1: The incident caused by a wrong assumption
A mid-sized company ran a pair of storage servers: primary and DR. Primary was upgraded quarterly. DR was “stable” in the way old bread is stable. The assumption was simple: ZFS replication is just snapshots over the wire, so as long as datasets exist, compatibility will sort itself out.
During a routine upgrade, an engineer ran zpool upgrade because the command looked like the next logical step. The pool remained online, nothing crashed, and the change ticket closed early. In the next few days, replication jobs began failing, but only on some datasets. The failures were intermittent enough to be ignored and loud enough to annoy everyone.
Then a real incident hit primary—an HBA started throwing resets under load. They failed over to DR and discovered the pool wouldn’t import. “Unsupported features” on import. DR was running an older OpenZFS that didn’t understand some now-active feature flags. The pool wasn’t damaged. It was just too modern for the environment that needed it most.
The recovery was boring and expensive: rebuild DR with a newer stack, then restore data from whatever replication was still valid. The actual outage wasn’t because of ZFS. It was because of the assumption that feature flags are optional and reversible. They are not.
Mini-story #2: The optimization that backfired
A different org had a virtualization cluster backed by ZFS. After upgrading OpenZFS, an engineer decided to “take advantage of the new version” by changing compression from lz4 to zstd across the VM dataset. The rationale was straightforward: better compression means less I/O, so performance improves. That’s a good theory in a world where CPU is free and latency is imaginary.
In practice, the cluster ran a mixed workload: small random writes, bursty metadata operations, and occasional backup storms. After the change, latency got worse during peak hours. CPU climbed. The on-call started seeing VM timeouts that previously never happened. The storage graphs looked fine at the disk layer, which made the incident more fun: everyone blamed the network.
The root issue wasn’t that zstd is bad. It was that they changed a workload-defining property in the same window as an OpenZFS upgrade, without baseline comparisons. Compression levels and CPU overhead matter. Also, existing blocks don’t recompress, so the performance behavior was inconsistent across VMs depending on data age. Perfect for confusion.
The fix was to revert compression on the hot VM dataset back to lz4, keep zstd for colder datasets, and separate “upgrade the storage stack” from “change the storage behavior.” The upgrade wasn’t the villain. The bundling was.
Mini-story #3: The boring but correct practice that saved the day
A finance-adjacent company ran root-on-ZFS everywhere. They had a policy that made engineers roll their eyes: every host upgrade required a fresh boot environment, plus a post-upgrade reboot during the window. No exceptions. The policy existed because somebody, years earlier, got tired of “we’ll reboot later” being a synonym for “we will find out during an outage.”
During an OpenZFS upgrade, one host came back up with ZFS services failing to mount a dataset. The reason was mundane: a combination of service ordering and a dataset mount property that had been inherited unexpectedly. The host was technically up, but the applications were dead. The engineer on-call didn’t try clever fixes on a broken system under pressure.
They selected the previous boot environment in the bootloader, came back online on the old stack, and restored service. Then, in daylight hours, they reproduced the issue in staging, fixed the mount ordering, and re-ran the upgrade. The downtime was minimal because rollback wasn’t theoretical—it was part of the muscle memory.
This is the kind of practice that looks slow until it’s faster than every alternative.
FAQ
1) Should I run zpool upgrade immediately after upgrading OpenZFS?
No, not by default. Upgrade the software stack first, validate stability and replication, then upgrade pool features when every importer/receiver is ready.
2) What’s the difference between upgrading OpenZFS and upgrading a pool?
Upgrading OpenZFS changes the code that reads/writes your pool. Upgrading a pool changes on-disk feature flags. The latter can be irreversible and affects cross-host compatibility.
3) Can I downgrade OpenZFS if something goes wrong?
You can usually downgrade the software if you did not enable new pool features and your distro supports package rollback. If you enabled features that become active, older implementations may no longer import the pool.
4) Why does my pool say “healthy” but I still have permanent errors?
zpool status -x is a summary. Permanent errors can exist even when the pool is online. Always review zpool status -v before upgrades.
5) Do I need to scrub before upgrading?
If you haven’t scrubbed recently, yes. A scrub is how you validate data integrity across the whole pool. You want known-good data before changing kernel modules and storage code.
6) What about encrypted datasets—any special upgrade caveats?
Ensure key management works across reboots: confirm keylocation, test unlocking procedures, and validate that early boot can access keys if the root dataset is encrypted.
7) My replication target is older. Can I still upgrade the sender?
Often yes, if you avoid enabling incompatible pool features and keep replication streams compatible. But you must test: run zfs send -nP and validate a receive on the older target.
8) Why do people separate boot pool (bpool) from root pool (rpool)?
Bootloaders often support fewer ZFS features than the OS. A conservative boot pool reduces the chance that a feature upgrade makes the system unbootable.
9) If performance changed after upgrade, what’s the first metric to trust?
Latency at the pool and vdev level, plus ARC miss rates under the same workload. Throughput graphs alone can hide tail-latency regressions that break apps.
10) Is L2ARC a good idea after upgrading?
Only if you can prove it helps. L2ARC adds complexity and can steal memory for metadata. Measure before and after; don’t treat it as a rite of passage.
Conclusion: next steps you should actually take
Here’s the practical path that avoids most OpenZFS upgrade pain:
- Inventory importers and receivers. Compatibility is a fleet property, not a host property.
- Upgrade code first, reboot, validate. If you won’t reboot, you’re postponing the real test to a worse time.
- Delay
zpool upgradeuntil the whole ecosystem is ready. Treat feature flags like schema migrations: planned, reviewed, and timed. - Capture evidence. Save pre/post outputs for versions, pool status, feature flags, properties, and replication tests.
- Run one controlled replication test. If you can’t prove send/receive still works, you don’t have DR—you have a story.
Upgrading OpenZFS can be boring. That’s the goal. The checklist isn’t ceremony; it’s how you keep the on-call shift from becoming a career development event.