ZFS Kernel Upgrade Survival: What to Check Before Reboot

Was this helpful?

If you run ZFS in production, a kernel upgrade is not “apply patches, reboot, done.” It’s “apply patches, verify the plumbing, and then reboot.” Because ZFS isn’t just another filesystem: it’s a tight coupling of kernel module, userspace tools, initramfs behavior, bootloader assumptions, and pool feature compatibility.

The failure mode is always the same: you reboot with confidence and come back to a host that won’t import its pool, can’t mount root, or is stuck at an initramfs prompt looking betrayed. This is the checklist that keeps you out of that meeting.

The mental model: what can break, and where

A ZFS kernel upgrade failure rarely comes from a single bug. It comes from an assumption. Usually yours. The “kernel upgrade” is really a coordinated change across:

  • Kernel ABI changes (even minor ones) that require a matching ZFS module.
  • DKMS or prebuilt kmods building the module at install time.
  • initramfs content: whether ZFS modules and scripts are included so the system can import pools at boot.
  • Bootloader config: whether you can pick an older kernel quickly when it goes sideways.
  • Pool feature flags: whether a pool created or upgraded on one machine can be imported elsewhere (including in rescue mode).
  • Userspace tooling versions: mismatches that don’t always stop booting, but can make operations misleading.

Your job is to prove, before reboot, that:

  1. The ZFS module will build (or is already built) for the new kernel.
  2. The module will load cleanly.
  3. The initramfs contains what it needs to import and mount.
  4. The pools are healthy enough that a reboot won’t turn “slightly degraded” into “unimportable under pressure.”
  5. You can roll back quickly without remote hands or a long outage.

One quote worth taping to your monitor:

“Hope is not a strategy.” — Gene Kranz

That quote applies to kernel upgrades more than it applies to spaceflight. Spacecraft do not run DKMS. You do.

Interesting facts and history that actually matter operationally

These aren’t trivia; they explain why ZFS upgrades feel different from ext4 upgrades.

  1. ZFS was designed to own the storage stack. It merges volume management and filesystem semantics, which is why failures can hit both “disk discovery” and “mount” layers at once.
  2. OpenZFS feature flags replaced “pool version” upgrades. Feature flags let pools evolve without a single monolithic version number, but they also create subtle compatibility traps in rescue scenarios.
  3. Linux ZFS is out-of-tree. That means kernel updates can break module builds or load behavior even when the kernel itself is fine.
  4. DKMS was invented to reduce manual rebuild pain. It helps—until it fails non-interactively during unattended upgrades, leaving you with a new kernel and no module.
  5. Root-on-ZFS boots depend heavily on initramfs scripts. If the scripts or modules aren’t included, the kernel may boot perfectly and still not find root.
  6. ZFS caches device-to-vdev mappings. That’s great for speed, but stale cache files or changed device names can complicate imports during early boot.
  7. “Hostid” exists for a reason. ZFS uses it to prevent simultaneous imports on shared storage; wrong or missing hostid can change import behavior in surprising ways.
  8. Compression and checksumming are always-on concepts in ZFS’s model. That makes data integrity excellent, but it also means that “just mount it” debugging is less forgiving when metadata is unhappy.
  9. Enterprise distributions often patch kernels aggressively. A “minor” update can still be an ABI-affecting change for external modules.

Fast diagnosis playbook (when you’re already sweating)

This is for the moment after reboot when something doesn’t mount, or ZFS won’t import, or you’re staring at an initramfs prompt. Don’t freestyle; you’ll waste the outage budget on anxiety.

First: prove whether the ZFS module exists and can load

  • Question: Is the ZFS kernel module present for this kernel?
  • Why: If it’s missing, nothing else matters yet.
cr0x@server:~$ uname -r
6.5.0-28-generic
cr0x@server:~$ modinfo zfs | sed -n '1,8p'
filename:       /lib/modules/6.5.0-28-generic/updates/dkms/zfs.ko
version:        2.2.3-1ubuntu1
license:        CDDL
description:    ZFS filesystem
author:         OpenZFS
srcversion:     1A2B3C4D5E6F7A8B9C0D
depends:        znvpair,zcommon,zunicode,zzstd,icp,spl

Decision: If modinfo fails with “not found,” you’re in module-land. Boot an older kernel or rebuild/install ZFS for this kernel. If it exists, attempt to load it and read the error.

cr0x@server:~$ sudo modprobe zfs
modprobe: ERROR: could not insert 'zfs': Unknown symbol in module, or unknown parameter (see dmesg)
cr0x@server:~$ dmesg | tail -n 15
[   12.841234] zfs: disagrees about version of symbol module_layout
[   12.841250] zfs: Unknown symbol spl_kmem_cache_alloc (err -22)

Decision: “Unknown symbol” typically means ABI mismatch: wrong module build for this kernel, or stale modules. Rebuild DKMS or install the correct kmod package for the running kernel. Do not try to “force import.” There is no force flag for missing symbols.

Second: prove pool visibility vs importability

cr0x@server:~$ sudo zpool import
   pool: tank
     id: 1234567890123456789
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank        ONLINE
          mirror-0  ONLINE
            sda3    ONLINE
            sdb3    ONLINE

Decision: If the pool is listed, ZFS is running and can see vdevs. If it’s not listed, you’re in device discovery (missing drivers, wrong cabling, dead HBA, wrong /dev paths) or ZFS isn’t running.

Third: if root-on-ZFS, check initramfs assumptions

If you’re in initramfs with no root, your fastest question is: did initramfs include ZFS and the right scripts?

cr0x@server:~$ lsmod | grep -E 'zfs|spl'
zfs                  6209536  0
spl                  163840   1 zfs

Decision: If ZFS isn’t loaded in initramfs, you likely shipped a broken initramfs. Boot old kernel/initramfs from GRUB and rebuild initramfs properly.

Joke #1 (short, relevant): A kernel upgrade without a rollback plan is like a parachute with “v2” in the filename—technically improved, spiritually terrifying.

Pre-reboot preflight: tasks with commands, outputs, and decisions

The rest of this piece is a set of boring checks that prevent dramatic outages. Each task includes: command, example output, what it means, and the decision you make.

Scope note: Commands below assume a Debian/Ubuntu-ish environment with OpenZFS on Linux. The logic generalizes to other distros; the packaging details don’t.

Task 1: Confirm what kernel you’re on, and what you’re about to boot

cr0x@server:~$ uname -r
6.2.0-39-generic
cr0x@server:~$ ls -1 /boot/vmlinuz-* | tail -n 3
/boot/vmlinuz-6.2.0-39-generic
/boot/vmlinuz-6.5.0-28-generic
/boot/vmlinuz-6.5.0-31-generic

What it means: You have multiple kernels installed; good. The newest will usually become default depending on GRUB config.

Decision: If you only have one kernel installed, stop. Install at least one known-good previous kernel before you gamble with reboot.

Task 2: Verify ZFS userspace version and kernel module version alignment

cr0x@server:~$ zfs --version
zfs-2.2.3-1ubuntu1
zfs-kmod-2.2.3-1ubuntu1

What it means: Userspace tools and kernel module package versions match. This does not guarantee the module is built for the new kernel, but it’s a baseline sanity check.

Decision: If userspace and kmod versions differ widely, expect surprises. Align them via your package manager before reboot.

Task 3: Check DKMS status for ZFS against installed kernels

cr0x@server:~$ dkms status | grep -i zfs
zfs/2.2.3, 6.2.0-39-generic, x86_64: installed
zfs/2.2.3, 6.5.0-31-generic, x86_64: installed

What it means: DKMS built ZFS for both current and target kernels.

Decision: If the target kernel is missing here, do not reboot yet. Fix DKMS build first, while you still have a working system to do it on.

Task 4: Force a build test for the new kernel (because DKMS sometimes lies by omission)

cr0x@server:~$ sudo dkms autoinstall -k 6.5.0-31-generic
Sign command: /lib/modules/6.5.0-31-generic/build/scripts/sign-file
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

zfs.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/6.5.0-31-generic/updates/dkms/

depmod...

What it means: Build succeeded; module installed; depmod ran.

Decision: If you see compiler errors, missing headers, or “bad return status,” treat that as a hard stop. Rebooting will turn a build error into an outage.

Task 5: Confirm headers exist for the new kernel

cr0x@server:~$ dpkg -l | grep -E 'linux-headers-6.5.0-31-generic'
ii  linux-headers-6.5.0-31-generic  6.5.0-31.31~22.04.1  amd64  Linux kernel headers for version 6.5.0 on 64 bit x86 SMP

What it means: Headers are installed, so DKMS has a fighting chance.

Decision: If headers aren’t installed, install them now and rerun the DKMS build check.

Task 6: Check Secure Boot module signing (if applicable)

cr0x@server:~$ mokutil --sb-state
SecureBoot enabled
cr0x@server:~$ sudo modprobe zfs
modprobe: ERROR: could not insert 'zfs': Key was rejected by service

What it means: The module exists but can’t be loaded because it isn’t signed with a key trusted by the system (MOK/UEFI).

Decision: Either sign the ZFS module appropriately and enroll the key, or disable Secure Boot if your environment allows it. “We’ll deal with it after reboot” is not a plan; it’s a trap.

Task 7: Verify pool health and error counters (a reboot is not a healing ceremony)

cr0x@server:~$ sudo zpool status -x
all pools are healthy
cr0x@server:~$ sudo zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  scan: scrub repaired 0B in 00:17:21 with 0 errors on Thu Dec 12 03:27:01 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       1     0     0
errors: No known data errors

What it means: Even if the pool is ONLINE, you have at least one read error on a device. That’s a yellow light.

Decision: If any device shows rising READ/WRITE/CKSUM errors, investigate before reboot. Reboots are when marginal drives decide to become performance art.

Task 8: Scrub status and timing: did you recently validate the pool?

cr0x@server:~$ sudo zpool get -H -o name,value,source autotrim tank
tank    on    local
cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:17:21 with 0 errors on Thu Dec 12 03:27:01 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0

What it means: You have a recent scrub with 0 errors, which is the closest thing to “known good” ZFS offers.

Decision: If the last scrub is ancient and this machine matters, run one ahead of the maintenance window. If you don’t have time for a full scrub, at least accept you’re rebooting blind.

Task 9: Check feature flags to avoid “works here, fails in rescue”

cr0x@server:~$ sudo zpool get -H -o name,property,value all tank | grep -E 'feature@|compatibility' | head -n 12
tank    feature@async_destroy  enabled
tank    feature@spacemap_histogram  enabled
tank    feature@extensible_dataset  enabled
tank    feature@bookmarks  enabled
tank    feature@embedded_data  active
tank    feature@device_removal  enabled
tank    feature@obsolete_counts  enabled
tank    feature@zstd_compress  active

What it means: Some features are “active” (in use). A rescue environment with older OpenZFS might refuse to import, or import read-only with warnings.

Decision: Before upgrades, avoid enabling new pool features casually. If you must, make sure your rescue media and rollback hosts can import the pool.

Task 10: Validate that ZFS services will start cleanly at boot

cr0x@server:~$ systemctl status zfs-import-cache.service --no-pager
● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled)
     Active: active (exited) since Thu 2025-12-26 01:12:09 UTC; 2h 11min ago
cr0x@server:~$ systemctl status zfs-mount.service --no-pager
● zfs-mount.service - Mount ZFS filesystems
     Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled)
     Active: active (exited) since Thu 2025-12-26 01:12:11 UTC; 2h 11min ago

What it means: Your current boot path is sane: pools imported, datasets mounted.

Decision: If these are disabled or failing, fix them before changing the kernel. Kernel upgrades are not the time to discover your ZFS import method is “someone manually does it.”

Task 11: Confirm your import method: cachefile vs by-id paths

cr0x@server:~$ sudo zpool get cachefile tank
NAME  PROPERTY  VALUE                 SOURCE
tank  cachefile /etc/zfs/zpool.cache  local
cr0x@server:~$ ls -l /etc/zfs/zpool.cache
-rw-r--r-- 1 root root 2270 Dec 26 01:12 /etc/zfs/zpool.cache

What it means: The system uses a cache file to speed imports and preserve device mapping. Good—if it’s kept current.

Decision: If you move disks/HBAs or clone VMs, regenerate the cache file intentionally. Stale cache files can cause confusing import behavior at boot.

Task 12: Inspect initramfs contents for ZFS modules and scripts (root-on-ZFS especially)

cr0x@server:~$ lsinitramfs /boot/initrd.img-6.5.0-31-generic | grep -E 'zfs|spl|zpool' | head -n 12
usr/sbin/zpool
usr/sbin/zfs
usr/share/initramfs-tools/scripts/zfs
usr/share/initramfs-tools/hooks/zfs
lib/modules/6.5.0-31-generic/updates/dkms/zfs.ko
lib/modules/6.5.0-31-generic/updates/dkms/spl.ko

What it means: The initramfs includes ZFS userspace tools, scripts, and modules for that kernel.

Decision: If ZFS bits are missing from the initramfs, rebuild it now and re-check. Otherwise you’re about to boot a kernel that can’t mount the filesystems you care about.

Task 13: Rebuild initramfs after any ZFS/kernel module changes

cr0x@server:~$ sudo update-initramfs -u -k 6.5.0-31-generic
update-initramfs: Generating /boot/initrd.img-6.5.0-31-generic

What it means: initramfs regenerated for the target kernel.

Decision: Always regenerate initramfs after DKMS builds or ZFS package updates on root-on-ZFS systems. Treat it like part of the module install, not an optional extra.

Task 14: Confirm GRUB has entries for at least one previous kernel

cr0x@server:~$ grep -n "menuentry 'Ubuntu" -n /boot/grub/grub.cfg | head -n 6
148:menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-11111111-2222-3333-4444-555555555555' {
166:menuentry 'Ubuntu, with Linux 6.5.0-31-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-31-generic-advanced-11111111-2222-3333-4444-555555555555' {
184:menuentry 'Ubuntu, with Linux 6.5.0-28-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-28-generic-advanced-11111111-2222-3333-4444-555555555555' {

What it means: Boot menu includes multiple kernels.

Decision: If older kernels aren’t available in GRUB, add them now. During an outage, your fingers will forget how to type “chroot.” GRUB selection is the faster escape hatch.

Task 15: Capture “known good” system state for later comparison

cr0x@server:~$ sudo zpool status > /root/pre-reboot-zpool-status.txt
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint > /root/pre-reboot-zfs-list.txt

What it means: You now have before/after snapshots of reality.

Decision: Always capture this on systems you’ll troubleshoot later. It turns vague “it was fine yesterday” into evidence.

Task 16: Validate that your root dataset and boot pool properties are sane

cr0x@server:~$ sudo zfs get -o name,property,value -H mountpoint,canmount,readonly -r rpool/ROOT | head -n 10
rpool/ROOT  mountpoint  /          local
rpool/ROOT  canmount    off        local
rpool/ROOT/ubuntu  mountpoint  /   local
rpool/ROOT/ubuntu  canmount    noauto  local
rpool/ROOT/ubuntu  readonly    off     default

What it means: You’re using a common root-on-ZFS pattern: a parent dataset with canmount=off and a child root dataset.

Decision: If mountpoints or canmount flags are inconsistent, fix them before reboot. Kernel upgrades don’t break these, but they expose boot ordering issues that were already lurking.

Task 17: Confirm hostid stability (avoids import confusion on shared storage or clones)

cr0x@server:~$ hostid
7f000001
cr0x@server:~$ sudo zgenhostid -i
/etc/hostid: 7f000001

What it means: hostid exists and is stable.

Decision: If you’re cloning VMs that share pools (or might see the same disks), ensure unique hostid. Kernel upgrades are often when cloned templates become “real servers,” and ZFS remembers.

Task 18: Make a targeted snapshot before touching kernels (for configs and quick rollbacks)

cr0x@server:~$ sudo zfs snapshot -r rpool/ROOT/ubuntu@pre-kernel-upgrade
cr0x@server:~$ sudo zfs list -t snapshot | grep pre-kernel-upgrade | head -n 3
rpool/ROOT/ubuntu@pre-kernel-upgrade   0B  -

What it means: You created a snapshot of the root dataset. On systems with proper boot environments, this can become a fast rollback.

Decision: If you can’t snapshot root (weird layout, policy), at least back up /etc and boot config. “We have backups” is not specific enough during an outage.

Task 19: Check free space (yes, it matters for upgrades and for ZFS behavior)

cr0x@server:~$ sudo zfs list -o name,used,avail,usedbysnapshots,mountpoint tank
NAME  USED  AVAIL  USEDBYSNAPSHOTS  MOUNTPOINT
tank  18.2T 1.14T  3.55T            /tank

What it means: You have headroom. ZFS doesn’t like living near 100% full; metadata allocations and performance get ugly.

Decision: If the pool is tight, do cleanup first. Kernel upgrades are not a great time to also discover your pool is out of space and can’t commit transactions smoothly.

Task 20: Dry-run your remote access and console fallback

cr0x@server:~$ sudo systemctl is-active ssh
active
cr0x@server:~$ sudo systemctl is-enabled serial-getty@ttyS0.service
enabled

What it means: SSH is up; serial console is configured (if you have it). This is a recovery affordance, not a luxury.

Decision: If you don’t have out-of-band console access, be extra conservative: keep older kernels, rehearse rollback, and schedule reboot windows when humans can reach the box.

Three corporate mini-stories (all true in spirit)

Mini-story 1: The incident caused by a wrong assumption

They ran a small fleet of analytics nodes. Nothing fancy: a bunch of Linux servers, each with a mirrored boot pair and a big ZFS pool for data. The team treated kernel updates like routine hygiene—because for years, they were.

One week, the distro shipped a new kernel line and the CI pipeline started pushing it. The upgrade looked clean. Packages installed, DKMS printed reassuring messages, and the maintenance window was short. They rebooted, one node at a time.

The first node came back… into initramfs. It could see disks, but couldn’t import the pool. Panic followed the classic route: someone said “ZFS is broken,” someone else said “it’s the HBA,” and a third person started searching old runbooks that assumed ext4.

The real problem was simpler and more embarrassing: the ZFS module had built for the new kernel, but Secure Boot rejected it. Nobody had tested module loading on a Secure Boot-enabled host after the key rotation. The “wrong assumption” wasn’t technical; it was procedural: they assumed the environment was unchanged because the servers were unchanged.

Recovery took longer than it should have, because they had to boot an older kernel, enroll the right key, and then re-run the upgrade properly. The takeaway that stuck: add a modprobe zfs check as a gate before reboot. You can’t argue with a kernel that refuses unsigned code.

Mini-story 2: The optimization that backfired

A different org had a performance culture. Not the good kind. The kind where someone posts iostat screenshots in Slack like they’re workout selfies. They wanted faster reboot times, faster imports, faster everything.

Someone noticed that importing pools by scanning devices was “slow,” so they enforced cachefile imports everywhere and disabled anything that looked like extra work. They also trimmed initramfs to reduce size, because smaller initramfs must be faster. This is how logic works in conference rooms.

Months later, they upgraded kernels on a storage node. The upgrade itself was fine. But the reboot changed device enumeration order—an HBA firmware update had also been applied earlier. The cachefile still referenced old /dev names. Early boot tried to import using stale mapping, and because they’d trimmed scripts, the fallback scan behavior wasn’t there.

The server didn’t import pools automatically. An operator manually imported with -d /dev/disk/by-id, things came up, and everyone agreed it was “weird.” The weirdness repeated on the next node. The “optimization” had removed resilience.

They fixed it by treating import paths like configuration drift: either rely on stable by-id paths and keep fallback scanning enabled, or maintain the cachefile intentionally and regenerate it when hardware changes. “Faster boot” is not a KPI worth a pager.

Mini-story 3: The boring but correct practice that saved the day

Large corporate data platform. Several environments. Lots of stakeholders. The storage team was not loved, but they were respected—the difference is usually paperwork.

They had a checklist for ZFS kernel upgrades. It wasn’t fancy. It was rigid. It included “verify DKMS status for the target kernel,” “confirm initramfs contains zfs.ko,” and “confirm GRUB has an older kernel entry.” Nobody got to skip it because they were “in a hurry.”

One night, a kernel update landed that broke ZFS DKMS builds due to a toolchain mismatch on that image. Their pipeline caught it because the preflight included dkms autoinstall -k as a validation step. The build failed loudly while the system was still up, and the node never rebooted into a broken state.

The fix was mundane: install the right headers and toolchain bits, then rebuild. The heroic part was that nobody needed to be heroic. The cluster stayed healthy, the maintenance window stayed boring, and the morning status update was just “completed successfully.”

That team’s secret weapon was not expertise. It was insisting on the boring checks, every time, especially when things seemed easy.

Checklists / step-by-step plan

Use this as a runbook. Don’t “adapt it live.” Adapt it once, on a calm day, and then follow it.

Step-by-step plan: the safe reboot sequence

  1. Pick a rollback kernel. Confirm an older kernel is installed and selectable in GRUB. If you can’t point to it, you don’t have one.
  2. Check pool health. zpool status -x must be clean or understood. Any active resilver, missing device, or escalating error counters should delay the reboot.
  3. Check recent scrub. If the system is important and the last scrub is old, run one ahead of the change window.
  4. Install kernel + headers. Don’t separate them. Kernel without headers is DKMS roulette.
  5. Confirm DKMS builds for the target kernel. Use dkms status and then force dkms autoinstall -k.
  6. Test module load on the current system (if possible). You can’t load a module built for a different kernel, but you can catch Secure Boot/key issues by validating your signing flow and current module behavior.
  7. Regenerate initramfs for the target kernel. Then inspect it with lsinitramfs to confirm ZFS content exists.
  8. Snapshot root datasets. Especially if you have a boot environment mechanism. Snapshots are cheap; outages are not.
  9. Capture pre-state outputs. Save zpool status, zfs list, and journalctl -b around previous boots if you’re chasing intermittent issues.
  10. Reboot one node, observe, then proceed. If you have multiple machines, do a canary. If you don’t, your canary is “you, being careful.”

Minimal “I have five minutes” checklist (use only when the blast radius is small)

  • zpool status -x is clean
  • dkms status | grep zfs shows the target kernel
  • lsinitramfs /boot/initrd.img-<target> | grep zfs.ko finds the module
  • GRUB has an older kernel entry
  • You have console access

Joke #2 (short, relevant): If your rollback plan is “we’ll SSH in and fix it,” your rollback plan is also “we believe in magic.”

Common mistakes: symptom → root cause → fix

1) Boot drops to initramfs, root pool won’t import

Symptom: You get an initramfs shell. Messages mention missing root filesystem, ZFS import failure, or “cannot mount rpool.”

Root cause: initramfs missing ZFS modules/scripts for the new kernel, or ZFS module failed to load (Secure Boot, ABI mismatch).

Fix: Boot an older kernel from GRUB. Rebuild DKMS for the target kernel, run update-initramfs -u -k <target>, verify with lsinitramfs, then reboot again.

2) ZFS module exists but won’t load: “Unknown symbol”

Symptom: modprobe zfs fails; dmesg shows unknown symbols or module_layout mismatch.

Root cause: Module built against a different kernel, stale modules not replaced, or partial upgrade left inconsistent files under /lib/modules.

Fix: Rebuild ZFS for the exact kernel: dkms autoinstall -k $(uname -r) (or install the right kmod). Run depmod -a. If packaging is broken, reinstall ZFS packages cleanly.

3) Module load fails: “Key was rejected by service”

Symptom: ZFS module won’t load on Secure Boot systems.

Root cause: Unsigned module or untrusted signing key.

Fix: Enroll a Machine Owner Key (MOK) and ensure DKMS signs the module, or disable Secure Boot (policy permitting). Validate in advance by testing module loading after rebuild.

4) Pool imports manually but won’t auto-import at boot

Symptom: After reboot, services are up but datasets are missing until someone runs zpool import.

Root cause: zfs-import services disabled, cachefile missing/stale, or import method not matching environment (e.g., device names changed).

Fix: Enable import services, regenerate /etc/zfs/zpool.cache intentionally by exporting/importing, and prefer stable device paths like /dev/disk/by-id for imports in unusual environments.

5) Pool won’t import in rescue environment, but imports fine on the host

Symptom: Rescue ISO can’t import; host can.

Root cause: Pool feature flags in use are newer than the rescue environment supports.

Fix: Keep rescue environments updated to match your pool features. Avoid enabling new features unless you’ve verified your recovery tooling can handle them.

6) Reboot succeeds, but performance tanks and latency spikes

Symptom: After kernel upgrade, I/O latency increases; ARC behavior changes; CPU usage jumps.

Root cause: New kernel/ZFS module behavior interacts differently with memory limits, cgroup constraints, or tunables. Sometimes the change is real; sometimes your monitoring labels moved.

Fix: Compare /proc/spl/kstat/zfs/arcstats before/after, check memory limits, and revert tunable experiments. Don’t “optimize” during the incident; stabilize first.

FAQ

1) Do I really need to rebuild initramfs if DKMS says ZFS is installed?

On root-on-ZFS or any system that imports pools during early boot: yes. DKMS building a module doesn’t guarantee initramfs contains it. Verify with lsinitramfs.

2) What’s the single best indicator that a reboot will succeed?

For root-on-ZFS: the target initramfs includes ZFS modules and scripts, and you can select a previous kernel in GRUB. For non-root pools: DKMS has built for the target kernel and pools are healthy.

3) Should I enable new pool feature flags during routine maintenance?

Not as a casual “why not.” Feature flags are mostly one-way. Enable them when you have a reason, and after confirming your recovery environments can import the pool.

4) Is it safer to use distro prebuilt ZFS kmods or DKMS?

Prebuilt kmods can be more predictable if your distro maintains them well. DKMS is flexible but adds a build step that can fail at the worst time. Pick one based on your environment’s reliability, not ideology.

5) How do I know if my pool is likely to fail import after reboot?

If zpool status shows missing devices, ongoing resilvering, checksum errors that keep increasing, or you’ve had recent power loss issues, delay reboot and stabilize first.

6) What if I run ZFS only for a data pool, not root?

Your system will boot even if ZFS is broken, which is nice. But your applications won’t see their data. The checklist still applies: verify module build/load and verify auto-import services.

7) Why do device names changing matter if ZFS uses GUIDs?

ZFS does track devices by GUID, but import behavior and cachefile entries can still depend on paths. In early boot or constrained rescue environments, stable paths reduce surprises.

8) Can I just keep the old kernel forever and avoid this problem?

You can, until you can’t—security fixes, hardware enablement, and vendor support windows will eventually corner you. The sustainable move is to make upgrades boring, not to avoid them.

9) What’s the fastest rollback if the new kernel can’t import pools?

Reboot and select the previous kernel from GRUB. That’s why you keep it installed and tested. If you can’t reach GRUB, you needed console access yesterday.

10) Should I export pools before reboot?

Usually no for single-host local disks; it can increase risk if services expect mounts during shutdown. But on shared storage or complex multipath setups, a controlled export can reduce “pool was in use” ambiguity. Make it a deliberate choice, not a superstition.

Conclusion: practical next steps

A ZFS kernel upgrade goes wrong when you treat it like an OS patch instead of a storage stack change. The survival strategy is consistent:

  • Prove the ZFS module exists for the target kernel (DKMS status plus forced build).
  • Prove initramfs contains what boot needs (inspect, don’t assume).
  • Prove your pools are healthy enough to survive stress (status, scrub recency, error counters).
  • Prove you can roll back without heroics (GRUB entries, console, snapshots).

Next time you schedule a reboot window, do the preflight tasks in order, save the outputs, and treat any red flag as a stop. The goal isn’t courage. It’s paperwork-backed confidence.

← Previous
ZFS autotrim: Keeping SSD Pools Fast Over Time
Next →
Ray tracing in 2026: why it’s still hard—and still inevitable

Leave a comment