If you run ZFS in production, a kernel upgrade is not “apply patches, reboot, done.” It’s “apply patches, verify the plumbing, and then reboot.” Because ZFS isn’t just another filesystem: it’s a tight coupling of kernel module, userspace tools, initramfs behavior, bootloader assumptions, and pool feature compatibility.
The failure mode is always the same: you reboot with confidence and come back to a host that won’t import its pool, can’t mount root, or is stuck at an initramfs prompt looking betrayed. This is the checklist that keeps you out of that meeting.
The mental model: what can break, and where
A ZFS kernel upgrade failure rarely comes from a single bug. It comes from an assumption. Usually yours. The “kernel upgrade” is really a coordinated change across:
- Kernel ABI changes (even minor ones) that require a matching ZFS module.
- DKMS or prebuilt kmods building the module at install time.
- initramfs content: whether ZFS modules and scripts are included so the system can import pools at boot.
- Bootloader config: whether you can pick an older kernel quickly when it goes sideways.
- Pool feature flags: whether a pool created or upgraded on one machine can be imported elsewhere (including in rescue mode).
- Userspace tooling versions: mismatches that don’t always stop booting, but can make operations misleading.
Your job is to prove, before reboot, that:
- The ZFS module will build (or is already built) for the new kernel.
- The module will load cleanly.
- The initramfs contains what it needs to import and mount.
- The pools are healthy enough that a reboot won’t turn “slightly degraded” into “unimportable under pressure.”
- You can roll back quickly without remote hands or a long outage.
One quote worth taping to your monitor:
“Hope is not a strategy.” — Gene Kranz
That quote applies to kernel upgrades more than it applies to spaceflight. Spacecraft do not run DKMS. You do.
Interesting facts and history that actually matter operationally
These aren’t trivia; they explain why ZFS upgrades feel different from ext4 upgrades.
- ZFS was designed to own the storage stack. It merges volume management and filesystem semantics, which is why failures can hit both “disk discovery” and “mount” layers at once.
- OpenZFS feature flags replaced “pool version” upgrades. Feature flags let pools evolve without a single monolithic version number, but they also create subtle compatibility traps in rescue scenarios.
- Linux ZFS is out-of-tree. That means kernel updates can break module builds or load behavior even when the kernel itself is fine.
- DKMS was invented to reduce manual rebuild pain. It helps—until it fails non-interactively during unattended upgrades, leaving you with a new kernel and no module.
- Root-on-ZFS boots depend heavily on initramfs scripts. If the scripts or modules aren’t included, the kernel may boot perfectly and still not find root.
- ZFS caches device-to-vdev mappings. That’s great for speed, but stale cache files or changed device names can complicate imports during early boot.
- “Hostid” exists for a reason. ZFS uses it to prevent simultaneous imports on shared storage; wrong or missing hostid can change import behavior in surprising ways.
- Compression and checksumming are always-on concepts in ZFS’s model. That makes data integrity excellent, but it also means that “just mount it” debugging is less forgiving when metadata is unhappy.
- Enterprise distributions often patch kernels aggressively. A “minor” update can still be an ABI-affecting change for external modules.
Fast diagnosis playbook (when you’re already sweating)
This is for the moment after reboot when something doesn’t mount, or ZFS won’t import, or you’re staring at an initramfs prompt. Don’t freestyle; you’ll waste the outage budget on anxiety.
First: prove whether the ZFS module exists and can load
- Question: Is the ZFS kernel module present for this kernel?
- Why: If it’s missing, nothing else matters yet.
cr0x@server:~$ uname -r
6.5.0-28-generic
cr0x@server:~$ modinfo zfs | sed -n '1,8p'
filename: /lib/modules/6.5.0-28-generic/updates/dkms/zfs.ko
version: 2.2.3-1ubuntu1
license: CDDL
description: ZFS filesystem
author: OpenZFS
srcversion: 1A2B3C4D5E6F7A8B9C0D
depends: znvpair,zcommon,zunicode,zzstd,icp,spl
Decision: If modinfo fails with “not found,” you’re in module-land. Boot an older kernel or rebuild/install ZFS for this kernel. If it exists, attempt to load it and read the error.
cr0x@server:~$ sudo modprobe zfs
modprobe: ERROR: could not insert 'zfs': Unknown symbol in module, or unknown parameter (see dmesg)
cr0x@server:~$ dmesg | tail -n 15
[ 12.841234] zfs: disagrees about version of symbol module_layout
[ 12.841250] zfs: Unknown symbol spl_kmem_cache_alloc (err -22)
Decision: “Unknown symbol” typically means ABI mismatch: wrong module build for this kernel, or stale modules. Rebuild DKMS or install the correct kmod package for the running kernel. Do not try to “force import.” There is no force flag for missing symbols.
Second: prove pool visibility vs importability
cr0x@server:~$ sudo zpool import
pool: tank
id: 1234567890123456789
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
tank ONLINE
mirror-0 ONLINE
sda3 ONLINE
sdb3 ONLINE
Decision: If the pool is listed, ZFS is running and can see vdevs. If it’s not listed, you’re in device discovery (missing drivers, wrong cabling, dead HBA, wrong /dev paths) or ZFS isn’t running.
Third: if root-on-ZFS, check initramfs assumptions
If you’re in initramfs with no root, your fastest question is: did initramfs include ZFS and the right scripts?
cr0x@server:~$ lsmod | grep -E 'zfs|spl'
zfs 6209536 0
spl 163840 1 zfs
Decision: If ZFS isn’t loaded in initramfs, you likely shipped a broken initramfs. Boot old kernel/initramfs from GRUB and rebuild initramfs properly.
Joke #1 (short, relevant): A kernel upgrade without a rollback plan is like a parachute with “v2” in the filename—technically improved, spiritually terrifying.
Pre-reboot preflight: tasks with commands, outputs, and decisions
The rest of this piece is a set of boring checks that prevent dramatic outages. Each task includes: command, example output, what it means, and the decision you make.
Scope note: Commands below assume a Debian/Ubuntu-ish environment with OpenZFS on Linux. The logic generalizes to other distros; the packaging details don’t.
Task 1: Confirm what kernel you’re on, and what you’re about to boot
cr0x@server:~$ uname -r
6.2.0-39-generic
cr0x@server:~$ ls -1 /boot/vmlinuz-* | tail -n 3
/boot/vmlinuz-6.2.0-39-generic
/boot/vmlinuz-6.5.0-28-generic
/boot/vmlinuz-6.5.0-31-generic
What it means: You have multiple kernels installed; good. The newest will usually become default depending on GRUB config.
Decision: If you only have one kernel installed, stop. Install at least one known-good previous kernel before you gamble with reboot.
Task 2: Verify ZFS userspace version and kernel module version alignment
cr0x@server:~$ zfs --version
zfs-2.2.3-1ubuntu1
zfs-kmod-2.2.3-1ubuntu1
What it means: Userspace tools and kernel module package versions match. This does not guarantee the module is built for the new kernel, but it’s a baseline sanity check.
Decision: If userspace and kmod versions differ widely, expect surprises. Align them via your package manager before reboot.
Task 3: Check DKMS status for ZFS against installed kernels
cr0x@server:~$ dkms status | grep -i zfs
zfs/2.2.3, 6.2.0-39-generic, x86_64: installed
zfs/2.2.3, 6.5.0-31-generic, x86_64: installed
What it means: DKMS built ZFS for both current and target kernels.
Decision: If the target kernel is missing here, do not reboot yet. Fix DKMS build first, while you still have a working system to do it on.
Task 4: Force a build test for the new kernel (because DKMS sometimes lies by omission)
cr0x@server:~$ sudo dkms autoinstall -k 6.5.0-31-generic
Sign command: /lib/modules/6.5.0-31-generic/build/scripts/sign-file
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der
zfs.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/6.5.0-31-generic/updates/dkms/
depmod...
What it means: Build succeeded; module installed; depmod ran.
Decision: If you see compiler errors, missing headers, or “bad return status,” treat that as a hard stop. Rebooting will turn a build error into an outage.
Task 5: Confirm headers exist for the new kernel
cr0x@server:~$ dpkg -l | grep -E 'linux-headers-6.5.0-31-generic'
ii linux-headers-6.5.0-31-generic 6.5.0-31.31~22.04.1 amd64 Linux kernel headers for version 6.5.0 on 64 bit x86 SMP
What it means: Headers are installed, so DKMS has a fighting chance.
Decision: If headers aren’t installed, install them now and rerun the DKMS build check.
Task 6: Check Secure Boot module signing (if applicable)
cr0x@server:~$ mokutil --sb-state
SecureBoot enabled
cr0x@server:~$ sudo modprobe zfs
modprobe: ERROR: could not insert 'zfs': Key was rejected by service
What it means: The module exists but can’t be loaded because it isn’t signed with a key trusted by the system (MOK/UEFI).
Decision: Either sign the ZFS module appropriately and enroll the key, or disable Secure Boot if your environment allows it. “We’ll deal with it after reboot” is not a plan; it’s a trap.
Task 7: Verify pool health and error counters (a reboot is not a healing ceremony)
cr0x@server:~$ sudo zpool status -x
all pools are healthy
cr0x@server:~$ sudo zpool status tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 0B in 00:17:21 with 0 errors on Thu Dec 12 03:27:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
sdb3 ONLINE 1 0 0
errors: No known data errors
What it means: Even if the pool is ONLINE, you have at least one read error on a device. That’s a yellow light.
Decision: If any device shows rising READ/WRITE/CKSUM errors, investigate before reboot. Reboots are when marginal drives decide to become performance art.
Task 8: Scrub status and timing: did you recently validate the pool?
cr0x@server:~$ sudo zpool get -H -o name,value,source autotrim tank
tank on local
cr0x@server:~$ sudo zpool status tank | sed -n '1,12p'
pool: tank
state: ONLINE
scan: scrub repaired 0B in 00:17:21 with 0 errors on Thu Dec 12 03:27:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
What it means: You have a recent scrub with 0 errors, which is the closest thing to “known good” ZFS offers.
Decision: If the last scrub is ancient and this machine matters, run one ahead of the maintenance window. If you don’t have time for a full scrub, at least accept you’re rebooting blind.
Task 9: Check feature flags to avoid “works here, fails in rescue”
cr0x@server:~$ sudo zpool get -H -o name,property,value all tank | grep -E 'feature@|compatibility' | head -n 12
tank feature@async_destroy enabled
tank feature@spacemap_histogram enabled
tank feature@extensible_dataset enabled
tank feature@bookmarks enabled
tank feature@embedded_data active
tank feature@device_removal enabled
tank feature@obsolete_counts enabled
tank feature@zstd_compress active
What it means: Some features are “active” (in use). A rescue environment with older OpenZFS might refuse to import, or import read-only with warnings.
Decision: Before upgrades, avoid enabling new pool features casually. If you must, make sure your rescue media and rollback hosts can import the pool.
Task 10: Validate that ZFS services will start cleanly at boot
cr0x@server:~$ systemctl status zfs-import-cache.service --no-pager
● zfs-import-cache.service - Import ZFS pools by cache file
Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled)
Active: active (exited) since Thu 2025-12-26 01:12:09 UTC; 2h 11min ago
cr0x@server:~$ systemctl status zfs-mount.service --no-pager
● zfs-mount.service - Mount ZFS filesystems
Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled)
Active: active (exited) since Thu 2025-12-26 01:12:11 UTC; 2h 11min ago
What it means: Your current boot path is sane: pools imported, datasets mounted.
Decision: If these are disabled or failing, fix them before changing the kernel. Kernel upgrades are not the time to discover your ZFS import method is “someone manually does it.”
Task 11: Confirm your import method: cachefile vs by-id paths
cr0x@server:~$ sudo zpool get cachefile tank
NAME PROPERTY VALUE SOURCE
tank cachefile /etc/zfs/zpool.cache local
cr0x@server:~$ ls -l /etc/zfs/zpool.cache
-rw-r--r-- 1 root root 2270 Dec 26 01:12 /etc/zfs/zpool.cache
What it means: The system uses a cache file to speed imports and preserve device mapping. Good—if it’s kept current.
Decision: If you move disks/HBAs or clone VMs, regenerate the cache file intentionally. Stale cache files can cause confusing import behavior at boot.
Task 12: Inspect initramfs contents for ZFS modules and scripts (root-on-ZFS especially)
cr0x@server:~$ lsinitramfs /boot/initrd.img-6.5.0-31-generic | grep -E 'zfs|spl|zpool' | head -n 12
usr/sbin/zpool
usr/sbin/zfs
usr/share/initramfs-tools/scripts/zfs
usr/share/initramfs-tools/hooks/zfs
lib/modules/6.5.0-31-generic/updates/dkms/zfs.ko
lib/modules/6.5.0-31-generic/updates/dkms/spl.ko
What it means: The initramfs includes ZFS userspace tools, scripts, and modules for that kernel.
Decision: If ZFS bits are missing from the initramfs, rebuild it now and re-check. Otherwise you’re about to boot a kernel that can’t mount the filesystems you care about.
Task 13: Rebuild initramfs after any ZFS/kernel module changes
cr0x@server:~$ sudo update-initramfs -u -k 6.5.0-31-generic
update-initramfs: Generating /boot/initrd.img-6.5.0-31-generic
What it means: initramfs regenerated for the target kernel.
Decision: Always regenerate initramfs after DKMS builds or ZFS package updates on root-on-ZFS systems. Treat it like part of the module install, not an optional extra.
Task 14: Confirm GRUB has entries for at least one previous kernel
cr0x@server:~$ grep -n "menuentry 'Ubuntu" -n /boot/grub/grub.cfg | head -n 6
148:menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-11111111-2222-3333-4444-555555555555' {
166:menuentry 'Ubuntu, with Linux 6.5.0-31-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-31-generic-advanced-11111111-2222-3333-4444-555555555555' {
184:menuentry 'Ubuntu, with Linux 6.5.0-28-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-28-generic-advanced-11111111-2222-3333-4444-555555555555' {
What it means: Boot menu includes multiple kernels.
Decision: If older kernels aren’t available in GRUB, add them now. During an outage, your fingers will forget how to type “chroot.” GRUB selection is the faster escape hatch.
Task 15: Capture “known good” system state for later comparison
cr0x@server:~$ sudo zpool status > /root/pre-reboot-zpool-status.txt
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint > /root/pre-reboot-zfs-list.txt
What it means: You now have before/after snapshots of reality.
Decision: Always capture this on systems you’ll troubleshoot later. It turns vague “it was fine yesterday” into evidence.
Task 16: Validate that your root dataset and boot pool properties are sane
cr0x@server:~$ sudo zfs get -o name,property,value -H mountpoint,canmount,readonly -r rpool/ROOT | head -n 10
rpool/ROOT mountpoint / local
rpool/ROOT canmount off local
rpool/ROOT/ubuntu mountpoint / local
rpool/ROOT/ubuntu canmount noauto local
rpool/ROOT/ubuntu readonly off default
What it means: You’re using a common root-on-ZFS pattern: a parent dataset with canmount=off and a child root dataset.
Decision: If mountpoints or canmount flags are inconsistent, fix them before reboot. Kernel upgrades don’t break these, but they expose boot ordering issues that were already lurking.
Task 17: Confirm hostid stability (avoids import confusion on shared storage or clones)
cr0x@server:~$ hostid
7f000001
cr0x@server:~$ sudo zgenhostid -i
/etc/hostid: 7f000001
What it means: hostid exists and is stable.
Decision: If you’re cloning VMs that share pools (or might see the same disks), ensure unique hostid. Kernel upgrades are often when cloned templates become “real servers,” and ZFS remembers.
Task 18: Make a targeted snapshot before touching kernels (for configs and quick rollbacks)
cr0x@server:~$ sudo zfs snapshot -r rpool/ROOT/ubuntu@pre-kernel-upgrade
cr0x@server:~$ sudo zfs list -t snapshot | grep pre-kernel-upgrade | head -n 3
rpool/ROOT/ubuntu@pre-kernel-upgrade 0B -
What it means: You created a snapshot of the root dataset. On systems with proper boot environments, this can become a fast rollback.
Decision: If you can’t snapshot root (weird layout, policy), at least back up /etc and boot config. “We have backups” is not specific enough during an outage.
Task 19: Check free space (yes, it matters for upgrades and for ZFS behavior)
cr0x@server:~$ sudo zfs list -o name,used,avail,usedbysnapshots,mountpoint tank
NAME USED AVAIL USEDBYSNAPSHOTS MOUNTPOINT
tank 18.2T 1.14T 3.55T /tank
What it means: You have headroom. ZFS doesn’t like living near 100% full; metadata allocations and performance get ugly.
Decision: If the pool is tight, do cleanup first. Kernel upgrades are not a great time to also discover your pool is out of space and can’t commit transactions smoothly.
Task 20: Dry-run your remote access and console fallback
cr0x@server:~$ sudo systemctl is-active ssh
active
cr0x@server:~$ sudo systemctl is-enabled serial-getty@ttyS0.service
enabled
What it means: SSH is up; serial console is configured (if you have it). This is a recovery affordance, not a luxury.
Decision: If you don’t have out-of-band console access, be extra conservative: keep older kernels, rehearse rollback, and schedule reboot windows when humans can reach the box.
Three corporate mini-stories (all true in spirit)
Mini-story 1: The incident caused by a wrong assumption
They ran a small fleet of analytics nodes. Nothing fancy: a bunch of Linux servers, each with a mirrored boot pair and a big ZFS pool for data. The team treated kernel updates like routine hygiene—because for years, they were.
One week, the distro shipped a new kernel line and the CI pipeline started pushing it. The upgrade looked clean. Packages installed, DKMS printed reassuring messages, and the maintenance window was short. They rebooted, one node at a time.
The first node came back… into initramfs. It could see disks, but couldn’t import the pool. Panic followed the classic route: someone said “ZFS is broken,” someone else said “it’s the HBA,” and a third person started searching old runbooks that assumed ext4.
The real problem was simpler and more embarrassing: the ZFS module had built for the new kernel, but Secure Boot rejected it. Nobody had tested module loading on a Secure Boot-enabled host after the key rotation. The “wrong assumption” wasn’t technical; it was procedural: they assumed the environment was unchanged because the servers were unchanged.
Recovery took longer than it should have, because they had to boot an older kernel, enroll the right key, and then re-run the upgrade properly. The takeaway that stuck: add a modprobe zfs check as a gate before reboot. You can’t argue with a kernel that refuses unsigned code.
Mini-story 2: The optimization that backfired
A different org had a performance culture. Not the good kind. The kind where someone posts iostat screenshots in Slack like they’re workout selfies. They wanted faster reboot times, faster imports, faster everything.
Someone noticed that importing pools by scanning devices was “slow,” so they enforced cachefile imports everywhere and disabled anything that looked like extra work. They also trimmed initramfs to reduce size, because smaller initramfs must be faster. This is how logic works in conference rooms.
Months later, they upgraded kernels on a storage node. The upgrade itself was fine. But the reboot changed device enumeration order—an HBA firmware update had also been applied earlier. The cachefile still referenced old /dev names. Early boot tried to import using stale mapping, and because they’d trimmed scripts, the fallback scan behavior wasn’t there.
The server didn’t import pools automatically. An operator manually imported with -d /dev/disk/by-id, things came up, and everyone agreed it was “weird.” The weirdness repeated on the next node. The “optimization” had removed resilience.
They fixed it by treating import paths like configuration drift: either rely on stable by-id paths and keep fallback scanning enabled, or maintain the cachefile intentionally and regenerate it when hardware changes. “Faster boot” is not a KPI worth a pager.
Mini-story 3: The boring but correct practice that saved the day
Large corporate data platform. Several environments. Lots of stakeholders. The storage team was not loved, but they were respected—the difference is usually paperwork.
They had a checklist for ZFS kernel upgrades. It wasn’t fancy. It was rigid. It included “verify DKMS status for the target kernel,” “confirm initramfs contains zfs.ko,” and “confirm GRUB has an older kernel entry.” Nobody got to skip it because they were “in a hurry.”
One night, a kernel update landed that broke ZFS DKMS builds due to a toolchain mismatch on that image. Their pipeline caught it because the preflight included dkms autoinstall -k as a validation step. The build failed loudly while the system was still up, and the node never rebooted into a broken state.
The fix was mundane: install the right headers and toolchain bits, then rebuild. The heroic part was that nobody needed to be heroic. The cluster stayed healthy, the maintenance window stayed boring, and the morning status update was just “completed successfully.”
That team’s secret weapon was not expertise. It was insisting on the boring checks, every time, especially when things seemed easy.
Checklists / step-by-step plan
Use this as a runbook. Don’t “adapt it live.” Adapt it once, on a calm day, and then follow it.
Step-by-step plan: the safe reboot sequence
- Pick a rollback kernel. Confirm an older kernel is installed and selectable in GRUB. If you can’t point to it, you don’t have one.
- Check pool health.
zpool status -xmust be clean or understood. Any active resilver, missing device, or escalating error counters should delay the reboot. - Check recent scrub. If the system is important and the last scrub is old, run one ahead of the change window.
- Install kernel + headers. Don’t separate them. Kernel without headers is DKMS roulette.
- Confirm DKMS builds for the target kernel. Use
dkms statusand then forcedkms autoinstall -k. - Test module load on the current system (if possible). You can’t load a module built for a different kernel, but you can catch Secure Boot/key issues by validating your signing flow and current module behavior.
- Regenerate initramfs for the target kernel. Then inspect it with
lsinitramfsto confirm ZFS content exists. - Snapshot root datasets. Especially if you have a boot environment mechanism. Snapshots are cheap; outages are not.
- Capture pre-state outputs. Save
zpool status,zfs list, andjournalctl -baround previous boots if you’re chasing intermittent issues. - Reboot one node, observe, then proceed. If you have multiple machines, do a canary. If you don’t, your canary is “you, being careful.”
Minimal “I have five minutes” checklist (use only when the blast radius is small)
zpool status -xis cleandkms status | grep zfsshows the target kernellsinitramfs /boot/initrd.img-<target> | grep zfs.kofinds the module- GRUB has an older kernel entry
- You have console access
Joke #2 (short, relevant): If your rollback plan is “we’ll SSH in and fix it,” your rollback plan is also “we believe in magic.”
Common mistakes: symptom → root cause → fix
1) Boot drops to initramfs, root pool won’t import
Symptom: You get an initramfs shell. Messages mention missing root filesystem, ZFS import failure, or “cannot mount rpool.”
Root cause: initramfs missing ZFS modules/scripts for the new kernel, or ZFS module failed to load (Secure Boot, ABI mismatch).
Fix: Boot an older kernel from GRUB. Rebuild DKMS for the target kernel, run update-initramfs -u -k <target>, verify with lsinitramfs, then reboot again.
2) ZFS module exists but won’t load: “Unknown symbol”
Symptom: modprobe zfs fails; dmesg shows unknown symbols or module_layout mismatch.
Root cause: Module built against a different kernel, stale modules not replaced, or partial upgrade left inconsistent files under /lib/modules.
Fix: Rebuild ZFS for the exact kernel: dkms autoinstall -k $(uname -r) (or install the right kmod). Run depmod -a. If packaging is broken, reinstall ZFS packages cleanly.
3) Module load fails: “Key was rejected by service”
Symptom: ZFS module won’t load on Secure Boot systems.
Root cause: Unsigned module or untrusted signing key.
Fix: Enroll a Machine Owner Key (MOK) and ensure DKMS signs the module, or disable Secure Boot (policy permitting). Validate in advance by testing module loading after rebuild.
4) Pool imports manually but won’t auto-import at boot
Symptom: After reboot, services are up but datasets are missing until someone runs zpool import.
Root cause: zfs-import services disabled, cachefile missing/stale, or import method not matching environment (e.g., device names changed).
Fix: Enable import services, regenerate /etc/zfs/zpool.cache intentionally by exporting/importing, and prefer stable device paths like /dev/disk/by-id for imports in unusual environments.
5) Pool won’t import in rescue environment, but imports fine on the host
Symptom: Rescue ISO can’t import; host can.
Root cause: Pool feature flags in use are newer than the rescue environment supports.
Fix: Keep rescue environments updated to match your pool features. Avoid enabling new features unless you’ve verified your recovery tooling can handle them.
6) Reboot succeeds, but performance tanks and latency spikes
Symptom: After kernel upgrade, I/O latency increases; ARC behavior changes; CPU usage jumps.
Root cause: New kernel/ZFS module behavior interacts differently with memory limits, cgroup constraints, or tunables. Sometimes the change is real; sometimes your monitoring labels moved.
Fix: Compare /proc/spl/kstat/zfs/arcstats before/after, check memory limits, and revert tunable experiments. Don’t “optimize” during the incident; stabilize first.
FAQ
1) Do I really need to rebuild initramfs if DKMS says ZFS is installed?
On root-on-ZFS or any system that imports pools during early boot: yes. DKMS building a module doesn’t guarantee initramfs contains it. Verify with lsinitramfs.
2) What’s the single best indicator that a reboot will succeed?
For root-on-ZFS: the target initramfs includes ZFS modules and scripts, and you can select a previous kernel in GRUB. For non-root pools: DKMS has built for the target kernel and pools are healthy.
3) Should I enable new pool feature flags during routine maintenance?
Not as a casual “why not.” Feature flags are mostly one-way. Enable them when you have a reason, and after confirming your recovery environments can import the pool.
4) Is it safer to use distro prebuilt ZFS kmods or DKMS?
Prebuilt kmods can be more predictable if your distro maintains them well. DKMS is flexible but adds a build step that can fail at the worst time. Pick one based on your environment’s reliability, not ideology.
5) How do I know if my pool is likely to fail import after reboot?
If zpool status shows missing devices, ongoing resilvering, checksum errors that keep increasing, or you’ve had recent power loss issues, delay reboot and stabilize first.
6) What if I run ZFS only for a data pool, not root?
Your system will boot even if ZFS is broken, which is nice. But your applications won’t see their data. The checklist still applies: verify module build/load and verify auto-import services.
7) Why do device names changing matter if ZFS uses GUIDs?
ZFS does track devices by GUID, but import behavior and cachefile entries can still depend on paths. In early boot or constrained rescue environments, stable paths reduce surprises.
8) Can I just keep the old kernel forever and avoid this problem?
You can, until you can’t—security fixes, hardware enablement, and vendor support windows will eventually corner you. The sustainable move is to make upgrades boring, not to avoid them.
9) What’s the fastest rollback if the new kernel can’t import pools?
Reboot and select the previous kernel from GRUB. That’s why you keep it installed and tested. If you can’t reach GRUB, you needed console access yesterday.
10) Should I export pools before reboot?
Usually no for single-host local disks; it can increase risk if services expect mounts during shutdown. But on shared storage or complex multipath setups, a controlled export can reduce “pool was in use” ambiguity. Make it a deliberate choice, not a superstition.
Conclusion: practical next steps
A ZFS kernel upgrade goes wrong when you treat it like an OS patch instead of a storage stack change. The survival strategy is consistent:
- Prove the ZFS module exists for the target kernel (DKMS status plus forced build).
- Prove initramfs contains what boot needs (inspect, don’t assume).
- Prove your pools are healthy enough to survive stress (status, scrub recency, error counters).
- Prove you can roll back without heroics (GRUB entries, console, snapshots).
Next time you schedule a reboot window, do the preflight tasks in order, save the outputs, and treat any red flag as a stop. The goal isn’t courage. It’s paperwork-backed confidence.