Proxmox Won’t Boot After a Kernel Update: Roll Back via GRUB the Right Way

Was this helpful?

There are few feelings like watching a Proxmox host reboot after a “routine” kernel update… and never come back. No web UI. No SSH. Just a console that stares back like it’s waiting for you to admit you didn’t test this anywhere.

This is the practical way out: roll back to a known-good kernel via GRUB, boot cleanly, then fix the actual problem so you’re not one reboot away from reliving the same incident.

Fast diagnosis playbook

When a Proxmox node won’t boot after a kernel update, your job is to get signal fast and avoid “fixes” that burn evidence (or break your ZFS pool, or your bootloader, or your mood). Here’s a triage order that works under pressure.

First: identify what kind of failure you have (seconds, not minutes)

  • No GRUB menu / straight to firmware: likely EFI boot entry/ESP issues, or boot order changed.
  • GRUB appears, kernel selection available: good. Roll back to the previous kernel first.
  • Kernel loads then panic / hangs: usually initramfs missing modules, DKMS driver mismatch, or storage root not found.
  • Boots to emergency shell: root filesystem mount failure, ZFS import failure, or wrong kernel parameters.
  • Black screen after GRUB: often GPU/console handoff or framebuffer issues; still might be booting—check for disk activity or ping from another box.

Second: choose the least invasive recovery path

  1. Use GRUB to boot an older kernel (fast, reversible, keeps your install intact).
  2. Use GRUB edit to add a temporary parameter (e.g., debug or nomodeset) if you must.
  3. Boot rescue media only if GRUB cannot boot anything or your ESP is missing.

Third: decide what you need to preserve

Before you “clean up old kernels” or reinstall anything, decide what matters:

  • Do you need to keep the broken kernel installed for later debugging? Usually yes—at least until the node is stable.
  • Are you on ZFS root? Then you treat the initramfs and ZFS module situation as a coupled system.
  • Are you running vendor DKMS modules (HBA drivers, out-of-tree NIC drivers)? Expect rebuild issues.

Operational rule: your first boot goal is stable console + stable storage + stable network. Proxmox UI comes later.

What actually broke when Proxmox won’t boot

A kernel update doesn’t just swap vmlinuz. It updates a family of tightly-related pieces: kernel image, initramfs, modules under /lib/modules, DKMS-compiled drivers, bootloader menu entries, and sometimes microcode and EFI bits. Any one of those can ruin your morning.

Typical failure modes after a kernel update

  • New kernel boots, but can’t find root filesystem: missing storage driver or ZFS module not present in initramfs; wrong root=; broken initramfs generation.
  • Kernel panic during early boot: often module/ABI mismatch, corrupted initramfs, or a bad DKMS build that silently failed.
  • GRUB menu exists but only new kernel shown: old kernels removed, or /boot not updated, or GRUB config not regenerated.
  • UEFI boots to firmware / “No bootable device”: ESP not mounted when updating, or EFI boot entry lost.
  • Boot loops: watchdog reboots, panic then auto-reboot, or systemd failing with kernel: panic and immediate restart.
  • System boots but Proxmox services fail: not a boot failure, but still post-update instability (corosync, pveproxy, storage mounts).

Here’s the slightly annoying truth: GRUB rollback fixes symptoms, not causes. But it buys you a clean environment to repair the cause without doing surgery in an initramfs shell.

Interesting facts and historical context (because it helps decisions)

  1. GRUB 2 became the default in Debian years ago largely because it handles complex boot scenarios (LVM, RAID, multiple kernels) better than legacy GRUB.
  2. Proxmox rides on Debian, which means most “Proxmox boot problems” are classic Debian kernel + initramfs + bootloader problems wearing a PVE badge.
  3. Initramfs is not optional theater: it’s a small filesystem loaded into RAM to get storage and root mounted. If it’s missing the right drivers, the kernel can’t reach your root.
  4. DKMS exists to rebuild drivers across kernels. When it fails, you can get a kernel that boots but has no NIC, or worse, no storage controller driver.
  5. ZFS on Linux is out-of-tree, which means it’s sensitive to kernel ABI changes. A new kernel plus a missing ZFS module equals “root pool not found”.
  6. UEFI made booting more flexible and more fragile: the ESP, NVRAM boot entries, and multiple vendor implementations all add failure surfaces.
  7. Linux keeps multiple installed kernels by design so you can roll back. People sabotage that safety net by aggressively purging “unused” kernels.
  8. Proxmox’s kernel packaging (pve-kernel) is tuned for virtualization and includes patches/features; mixing kernels from random repos is asking for surprises.

One dry comfort: this failure is common enough that the recovery steps are mostly boring. And boring is good in production.

Joke #1: A kernel update is like “just one quick change” in a corporate firewall—nobody believes it, but we all pretend anyway.

Rollback via GRUB (BIOS and UEFI) without making it worse

You want the old kernel because it likely matches the working initramfs and installed modules. The right rollback is: select an older kernel, boot, confirm stability, then make it the default temporarily while you fix the broken one.

Step 1: get to the GRUB menu reliably

On many systems the GRUB menu is hidden unless something goes wrong. When things are already wrong, you often still have to be quick.

  • BIOS: press and hold Shift during boot.
  • UEFI: tap Esc (or sometimes Shift) during boot.
  • Remote IPMI/iKVM: send the key early; some consoles buffer poorly.

Step 2: choose an older kernel the safe way

In GRUB:

  1. Select Advanced options for Debian GNU/Linux (on Proxmox it may still say Debian).
  2. Select the previous kernel version (usually the one below the newest).
  3. Prefer the normal entry, not “recovery mode”, unless your normal boot fails.

Which kernel should you pick? In practice: the newest kernel that used to work. If you updated from 6.5 to 6.8 and 6.8 fails, pick 6.5. Don’t jump back multiple major steps unless you have to; older kernels may not match newer userspace assumptions, and you’re here to stabilize, not time travel.

Step 3: when kernel selection isn’t enough

If the screen goes black or you suspect a graphics console issue, try a temporary parameter. In GRUB, highlight the kernel entry, press e, find the line starting with linux, append parameters, then boot with Ctrl+x or F10.

Useful temporary parameters:

  • nomodeset (video driver handoff issues; common on weird GPUs)
  • systemd.unit=multi-user.target (skip graphical targets; not usually relevant for Proxmox servers, but can help)
  • debug or loglevel=7 (more kernel log verbosity)
  • panic=30 (delay reboot on panic; gives you time to read the error)

Don’t “fix” GRUB from the GRUB editor beyond temporary parameters. If you can boot an old kernel, do the real repairs from a running system.

Step 4: make the old kernel the default (temporarily)

Once you boot successfully, you can pin the default boot kernel by configuring GRUB’s default menu entry. This prevents accidental reboots into the broken kernel while you work. You’ll do it from the OS, not from firmware menus.

After you boot: confirm, stabilize, and prevent a repeat

Booting is not “resolved.” Booting is “we can breathe again.” Now you have to learn why the new kernel failed and decide whether you’re going to repair it, hold it, or remove it.

Here’s the mental model that stops you from flailing:

  • If the new kernel panics early: inspect initramfs contents, missing modules, ZFS availability, microcode issues.
  • If it boots but has no network: DKMS module rebuild, firmware packages, renamed interfaces, or driver regression.
  • If it boots but can’t import ZFS: ZFS module mismatch, initramfs missing ZFS, or pool feature flags.
  • If GRUB/UEFI is the problem: ESP mount, grub-install (BIOS) or EFI entry recreation (UEFI), and sanity checks.

Also: if this is a clustered Proxmox environment, treat the node as “out of service” until it’s stable. A half-working node can create more damage than a powered-off one, especially if it starts flapping corosync membership or storage mounts.

Joke #2: Nothing builds team cohesion like everyone silently watching a server boot—together—like it’s a live theater performance.

One quote to keep you honest

“Hope is not a strategy.” — General Gordon R. Sullivan

In SRE terms: make the rollback deterministic, then make the forward path boring.

ZFS root and storage gotchas during kernel rollbacks

If your Proxmox host boots from ZFS, kernel updates can fail in a very specific way: the kernel boots, initramfs loads, then it can’t import the rpool because the ZFS module isn’t available (or doesn’t match the kernel). That produces errors like “cannot import ‘rpool’: no such pool available” or “ZFS: module not found,” followed by an emergency shell.

Why ZFS makes this spicier

  • ZFS is not built into the Linux kernel tree; it’s a separate module. That means it has to be built for each kernel.
  • On ZFS root, the initramfs must include ZFS bits so it can import the pool early in boot.
  • Proxmox packages normally handle this, but partial upgrades, broken DKMS builds, or an unmounted ESP can create “kernel updated, initramfs not really updated” scenarios.

Safe approach

Boot the last known-good kernel. Confirm the pool imports and datasets mount. Then fix the broken kernel by rebuilding initramfs and ensuring ZFS modules exist for it. Only after that do you let the system boot into the newer kernel again.

If you’re on ZFS root and you purge old kernels aggressively, you can trap yourself: the only kernel left is the one that needs ZFS modules that never got built. That’s how you turn a 10-minute rollback into a “boot from ISO and chroot” afternoon.

Practical tasks (commands, output meanings, decisions)

Below are real tasks you can run after you manage to boot (usually via an older kernel). Each one includes: command, what the output means, and the decision you make.

Task 1: confirm which kernel you actually booted

cr0x@server:~$ uname -r
6.5.13-5-pve

Meaning: You’re currently running 6.5.13-5-pve.

Decision: If this is the known-good kernel, keep it as your baseline while you repair the new one. If it’s the new kernel, you didn’t roll back successfully.

Task 2: list installed Proxmox kernels

cr0x@server:~$ dpkg -l 'pve-kernel-*' | awk '/^ii/ {print $2, $3}'
pve-kernel-6.5.13-5-pve 6.5.13-5
pve-kernel-6.8.12-3-pve 6.8.12-3

Meaning: Both the old and new kernels are installed.

Decision: If only the broken kernel is installed, stop and consider installing a known-good kernel package before doing anything else (possibly via chroot/rescue if you can’t boot).

Task 3: inspect the last boot logs for why the new kernel failed

cr0x@server:~$ journalctl -b -1 -p err --no-pager | tail -n 30
Dec 26 10:02:11 server kernel: zfs: module verification failed: signature and/or required key missing
Dec 26 10:02:11 server kernel: ZFS: Failed to load module
Dec 26 10:02:12 server systemd[1]: Failed to mount /.
Dec 26 10:02:12 server systemd[1]: Dependency failed for Initrd Root File System.

Meaning: Previous boot (-1) had ZFS module load failure, so root mount failed.

Decision: This points at Secure Boot/module signing, DKMS build failure, or initramfs missing ZFS. You’ll troubleshoot those specifically, not randomly reinstall GRUB.

Task 4: check if Secure Boot is in play

cr0x@server:~$ mokutil --sb-state
SecureBoot enabled

Meaning: UEFI Secure Boot is enabled.

Decision: If out-of-tree modules (ZFS, vendor drivers) aren’t signed properly, they may not load. Either enroll keys/sign modules or disable Secure Boot (policy decision). Don’t “just disable” if your org relies on it.

Task 5: verify initramfs images exist for each kernel

cr0x@server:~$ ls -lh /boot | egrep 'vmlinuz|initrd.img' | tail -n 10
-rw-r--r-- 1 root root  78M Dec 25 21:14 initrd.img-6.5.13-5-pve
-rw-r--r-- 1 root root  82M Dec 25 21:16 initrd.img-6.8.12-3-pve
-rw-r--r-- 1 root root 8.6M Dec 25 21:14 vmlinuz-6.5.13-5-pve
-rw-r--r-- 1 root root 9.1M Dec 25 21:16 vmlinuz-6.8.12-3-pve

Meaning: Both kernels have initramfs and vmlinuz present.

Decision: Presence doesn’t guarantee correctness, but absence is an immediate smoking gun: regenerate initramfs and ensure /boot is properly mounted.

Task 6: confirm /boot and EFI System Partition are mounted (UEFI systems)

cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE /boot /boot/efi
/boot /dev/sda2 ext4
/boot/efi /dev/sda1 vfat

Meaning: Both /boot and /boot/efi are mounted.

Decision: If /boot/efi is not mounted during kernel updates, GRUB EFI binaries and NVRAM entries can drift out of sync. Fix mounts before re-running update-grub or reinstalling bootloader pieces.

Task 7: regenerate initramfs for the broken kernel

cr0x@server:~$ sudo update-initramfs -u -k 6.8.12-3-pve
update-initramfs: Generating /boot/initrd.img-6.8.12-3-pve
Running hook script 'zz-proxmox-boot'..

Meaning: Initramfs regenerated for that exact kernel.

Decision: If this fails with missing modules or hook errors, stop and fix those errors before rebooting. A “successful” initramfs generation is necessary but not sufficient; validate module presence next.

Task 8: check ZFS module availability for the target kernel

cr0x@server:~$ ls /lib/modules/6.8.12-3-pve/updates/dkms/ 2>/dev/null | head
zfs.ko
zcommon.ko
znvpair.ko
zunicode.ko
zavl.ko

Meaning: DKMS-built ZFS modules exist for the new kernel.

Decision: If this directory is missing or empty, ZFS didn’t build for the new kernel. You’ll look at DKMS status and build logs.

Task 9: inspect DKMS status and rebuild if needed

cr0x@server:~$ dkms status
zfs/2.2.4, 6.5.13-5-pve, x86_64: installed
zfs/2.2.4, 6.8.12-3-pve, x86_64: built

Meaning: For the new kernel, ZFS is only “built” not “installed” (or may be missing entirely).

Decision: Install it for that kernel (or rebuild). A “built” module may not be placed into the right tree.

cr0x@server:~$ sudo dkms install zfs/2.2.4 -k 6.8.12-3-pve
Installing to /lib/modules/6.8.12-3-pve/updates/dkms/
depmod...

Meaning: DKMS installed the ZFS module into the new kernel’s module directory and ran depmod.

Decision: Regenerate initramfs again afterward so the initramfs contains the needed bits.

Task 10: rebuild GRUB menu entries

cr0x@server:~$ sudo update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-3-pve
Found initrd image: /boot/initrd.img-6.8.12-3-pve
Found linux image: /boot/vmlinuz-6.5.13-5-pve
Found initrd image: /boot/initrd.img-6.5.13-5-pve
done

Meaning: GRUB config now contains both kernels.

Decision: If the broken kernel wasn’t found, you either don’t have it installed, or /boot wasn’t correct. Don’t reboot into the unknown.

Task 11: set the default kernel entry (temporary pin)

First, list menu entries with indices:

cr0x@server:~$ grep -n "menuentry '" /boot/grub/grub.cfg | head -n 8
105:menuentry 'Debian GNU/Linux' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-11111111-2222-3333-4444-555555555555' {
153:menuentry 'Advanced options for Debian GNU/Linux' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-advanced-11111111-2222-3333-4444-555555555555' {
161:menuentry 'Debian GNU/Linux, with Linux 6.8.12-3-pve' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.8.12-3-pve-advanced-11111111-2222-3333-4444-555555555555' {
199:menuentry 'Debian GNU/Linux, with Linux 6.5.13-5-pve' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.13-5-pve-advanced-11111111-2222-3333-4444-555555555555' {

Meaning: You can reference a menuentry by exact title or by saved entry ID. Titles can change; IDs are more stable.

Decision: Use GRUB’s “saved” mechanism for predictable boots, then set the saved entry to the known-good kernel.

cr0x@server:~$ sudo sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT=saved/' /etc/default/grub
cr0x@server:~$ sudo grub-set-default "Debian GNU/Linux, with Linux 6.5.13-5-pve"
cr0x@server:~$ sudo update-grub
Generating grub configuration file ...
done

Meaning: GRUB will default to the 6.5 kernel until you change it.

Decision: Keep the pin until the new kernel is verified in a controlled reboot window.

Task 12: verify the booted services and cluster health (if applicable)

cr0x@server:~$ systemctl --failed --no-pager
UNIT                     LOAD   ACTIVE SUB    DESCRIPTION
0 loaded units listed.

Meaning: No failed systemd units.

Decision: If you see failed units related to storage, networking, or pve services, you fix those before reintroducing this node to production workloads.

cr0x@server:~$ pvecm status 2>/dev/null | egrep 'Quorate|Nodes|Name'
Name:             prod-cluster
Nodes:            3
Quorate:          Yes

Meaning: The cluster is quorate and recognizes nodes.

Decision: If the node is isolated or causing membership instability, keep it out and investigate network, corosync config, and time sync before proceeding.

Task 13: confirm ZFS pool health (ZFS root or data pools)

cr0x@server:~$ zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:09:12 with 0 errors on Sun Dec 22 02:10:12 2025
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0

errors: No known data errors

Meaning: Storage is healthy.

Decision: If pools are degraded or not imported, don’t blame the kernel update blindly. You might have discovered a real disk/controller issue that merely surfaced during reboot.

Task 14: check EFI boot entries if firmware drops you to setup

cr0x@server:~$ sudo efibootmgr -v | head -n 20
BootCurrent: 0003
Timeout: 1 seconds
BootOrder: 0003,0001,0002
Boot0001* UEFI: Built-in EFI Shell
Boot0002* UEFI PXE IPv4
Boot0003* debian	HD(1,GPT,aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee,0x800,0x32000)/File(\EFI\debian\grubx64.efi)

Meaning: There is a valid Debian/GRUB EFI entry pointing to \EFI\debian\grubx64.efi.

Decision: If the Debian entry is missing or points to a wrong disk, you’ll recreate it (carefully) after confirming the ESP content.

Task 15: validate the ESP contents (UEFI)

cr0x@server:~$ sudo ls -R /boot/efi/EFI | head -n 40
/boot/efi/EFI:
BOOT
debian
proxmox

/boot/efi/EFI/debian:
grub.cfg
grubx64.efi
shimx64.efi

Meaning: EFI files exist. The presence of shimx64.efi often indicates Secure Boot support.

Decision: If this directory is empty or missing, your ESP wasn’t mounted during updates or was overwritten. You’ll need to reinstall GRUB EFI binaries and potentially recreate NVRAM entries.

Task 16: check free space on /boot (quietly deadly)

cr0x@server:~$ df -h /boot
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       512M  489M   23M  96% /boot

Meaning: /boot is nearly full.

Decision: A full /boot can lead to incomplete initramfs writes or missing images. You may need to remove old kernels—but only after verifying you have at least one known-good fallback installed and bootable.

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption

The node was a perfectly normal Proxmox hypervisor. ZFS root, a couple of NVMe mirrors, and a small handful of critical VMs that people always insisted were “non-critical” right up until they were down. The team had a habit: apply updates late afternoon, reboot, go home. It usually worked. Usually is not a strategy.

One update cycle, the node didn’t come back. IPMI showed a kernel panic early in boot. Someone assumed “GRUB must be corrupted” because that’s the story everyone remembers from 2012. They booted rescue media, reinstalled GRUB, and rebooted. Same panic. Then they reinstalled GRUB again, because repeating yourself is a recognized debugging technique in some industries.

The real issue was simpler: the initramfs for the new kernel did not contain the storage driver for the HBA that exposed the boot pool. The module existed on disk, but it never made it into initramfs due to a hook error during update—triggered by a nearly-full /boot. The old kernel booted fine. The new one could not find root. GRUB was innocent.

Once they rolled back via GRUB to the previous kernel, the fix took 15 minutes: free space in /boot, regenerate initramfs, validate module presence, then reboot into the new kernel. The lesson wasn’t “don’t update.” The lesson was “don’t assume the failure is the bootloader just because you can see the bootloader.”

2) The optimization that backfired

A different environment had a strong “keep things clean” culture. They ran a cron job that purged old kernels to keep /boot tidy and reclaim disk space. Someone had seen a warning about /boot usage once and decided the permanent fix was removing “unused” kernels. It ran everywhere. It was “standard.” Nobody remembered who wrote it.

Then a kernel update landed that introduced a driver regression for the primary NIC. The system booted, but came up without network. That would have been annoying but recoverable via console—except the DC remote hands process required network access to the management plane for escalation. Their out-of-band access was there, but it was tied into the same network segment and policy stack. Fun.

The rollback would have been trivial if there was an older kernel to select in GRUB. There wasn’t. The “optimization” had reduced the safety margin to zero. They had to boot rescue media, mount filesystems, install an older kernel package from local mirrors, rebuild initramfs, and then reboot. It worked, but it burned hours and attention across multiple teams, because the procedure had never been practiced.

The postmortem wasn’t kind to the cron job. They kept /boot under control afterward, but with a rule: keep at least two known-good kernels installed, always, and alert on /boot filling up rather than silently purging your parachute.

3) The boring but correct practice that saved the day

Same class of failure: after a kernel update, the host booted to an initramfs emergency shell with “cannot mount root.” The difference was the team’s discipline. They didn’t poke random commands. They followed a small runbook and treated the console like evidence.

First, they photographed the screen and captured the exact error. Then they rebooted and used GRUB to select the previous kernel. It booted normally. They immediately pinned the old kernel as GRUB default to prevent accidental reboot into the broken one, and they put the node into maintenance mode so workloads wouldn’t land there.

Then they did the unsexy part: checked whether /boot and /boot/efi were mounted, checked free space, checked DKMS status, rebuilt initramfs for the new kernel, and confirmed the ZFS modules existed for that kernel. Only then did they test a reboot into the new kernel during a controlled window.

Nothing heroic happened. No clever hacks. Just careful sequencing, and a bias for reversible actions. That’s what “reliability engineering” looks like when it’s working.

Common mistakes (symptoms → root cause → fix)

1) Symptom: GRUB shows up, but booting the new kernel drops to emergency shell

Root cause: initramfs missing storage drivers or ZFS module; DKMS didn’t install modules for the new kernel; /boot full created partial images.

Fix: boot older kernel; ensure /boot has space; run dkms status; install missing modules for the new kernel; regenerate initramfs for that kernel; run update-grub; test.

2) Symptom: “ZFS: module not found” or pool can’t import during boot

Root cause: ZFS modules not built/installed for the new kernel, or blocked by Secure Boot signature policy.

Fix: boot old kernel; rebuild/install ZFS DKMS for the target kernel; regenerate initramfs; if Secure Boot enabled, ensure module signing/enrollment is correct or adjust policy.

3) Symptom: after update, firmware says “no bootable device”

Root cause: UEFI NVRAM boot entry lost; ESP corruption; ESP not mounted during update and boot files aren’t where firmware expects them.

Fix: confirm ESP mount; inspect /boot/efi/EFI; recreate boot entry with efibootmgr if needed; reinstall GRUB EFI binaries from the running system or chroot.

4) Symptom: system boots but has no network

Root cause: NIC driver regression; firmware package missing; DKMS driver failed; predictable interface name changed due to firmware/PCI ordering changes.

Fix: boot older kernel; compare ip link outputs across kernels; check journalctl -k for driver load errors; reinstall firmware packages; rebuild DKMS; adjust /etc/network/interfaces if interface names changed.

5) Symptom: GRUB menu has only one entry (the broken kernel)

Root cause: old kernels purged; update-grub not run; /boot not mounted; kernel images missing.

Fix: from rescue or chroot, install at least one known-good kernel; ensure /boot mounted; run update-grub. Stop purging kernels without a policy that preserves rollback.

6) Symptom: black screen after GRUB, but fans spin and disks blink

Root cause: console framebuffer/graphics handoff; serial console misconfigured; boot continues but you can’t see it.

Fix: try nomodeset; verify IPMI Serial-over-LAN settings; set kernel console parameters persistently if you rely on serial. Confirm via network reachability from another host.

Checklists / step-by-step plan

Checklist A: recover the node right now (minimal risk)

  1. Use console/IPMI to access the system. Don’t guess remotely.
  2. Force GRUB menu (Shift for BIOS, Esc for UEFI).
  3. Select Advanced options, boot the last known-good kernel.
  4. Confirm kernel version with uname -r.
  5. Confirm storage health (ZFS: zpool status; non-ZFS: check mounts).
  6. Confirm network is up enough for management access.
  7. Pin the known-good kernel as default using GRUB saved entry.

Checklist B: repair the broken kernel (so you can upgrade safely)

  1. Check /boot and /boot/efi are mounted and have space.
  2. Inspect last failed boot logs (journalctl -b -1 -p err).
  3. If ZFS/DKMS involved: check dkms status and module files under /lib/modules/<new-kernel>.
  4. Rebuild/initramfs for the target kernel: update-initramfs -u -k <version>.
  5. Run update-grub and ensure it finds the kernel + initrd.
  6. If UEFI issues: validate EFI entries via efibootmgr -v and ESP contents.
  7. Schedule a controlled reboot into the new kernel. Stay on console. Capture output if it fails.

Checklist C: harden against the next one

  1. Keep at least two bootable kernels installed. Treat that as policy, not preference.
  2. Monitor /boot usage and alert before it hits 90%.
  3. For ZFS root: ensure ZFS modules are built for new kernels before rebooting (DKMS checks in your change process).
  4. Decide Secure Boot policy: either support module signing properly or disable it intentionally. Half-and-half is where outages live.
  5. Test kernel updates on a similar node first (hardware matters for drivers).

FAQ

1) Is rolling back via GRUB “the right way” or just a hack?

It’s the right way. It uses the mechanism designed for exactly this: multiple installed kernels with a bootloader menu. The hack is deleting all old kernels.

2) Why did Proxmox update break boot when the same update worked on another node?

Hardware differences. Different HBAs, NICs, GPUs, firmware versions, Secure Boot states, and ESP layouts. Kernels are polite until they meet your specific PCI device.

3) Should I remove the broken kernel package?

Not immediately. First stabilize on the old kernel, collect logs, and try repairing the new kernel (initramfs + DKMS + ESP). Remove it only if you’re certain it’s irreparably broken for your hardware or you need space in /boot.

4) How many kernels should I keep installed?

At least two bootable kernels (current stable and previous stable). On critical hosts, three isn’t crazy—especially if you’re tight on change windows.

5) My GRUB menu doesn’t show. How do I force it permanently?

On a running system, you can adjust GRUB settings (for example, reduce timeout or show menu). But operationally, I prefer keeping the menu mostly hidden and relying on console access when needed. If you need it always visible for remote hands, set a nonzero timeout and avoid “quiet” boot until you’re stable.

6) I’m on ZFS root. Do I need special steps?

Yes: verify ZFS modules exist for the target kernel and that initramfs includes them. If ZFS can’t load at boot, your root pool won’t import, and you’ll end up in an initramfs shell.

7) Can Secure Boot block ZFS or other modules after a kernel update?

Yes. Secure Boot can prevent unsigned kernel modules from loading. If your logs mention signature verification failures, treat it as a policy mismatch, not a random boot glitch.

8) The system boots with the old kernel, but Proxmox services look weird. What now?

Check failed units and cluster status. A reboot can expose unrelated issues: degraded ZFS pools, time drift, corosync network flaps. Fix platform health first, then worry about upgrading again.

9) If /boot is full, can I just delete old initrd files manually?

Don’t. Remove kernel packages via the package manager so hooks update GRUB and initramfs correctly. Manual deletion creates boot menus pointing at missing files—an exciting way to create a second outage.

10) When should I use rescue media?

When GRUB can’t boot any kernel, when the ESP is missing/corrupted, or when you purged all working kernels. Rescue media is fine; it’s just slower and easier to mess up if you haven’t practiced.

Conclusion: next steps that actually reduce risk

If your Proxmox host won’t boot after a kernel update, the clean path is: use GRUB to boot the previous kernel, pin it as default, then repair the new kernel by fixing initramfs, DKMS modules (especially ZFS), and EFI/ESP consistency. Reboot into the new kernel only when you can watch the console and you’ve removed obvious landmines like a full /boot.

Do these next, in order:

  1. Keep the known-good kernel as default until the upgrade is proven.
  2. Capture the last failed boot errors (journalctl -b -1) and treat them as your roadmap.
  3. Make /boot boring: enough space, correct mounts, and a rule to keep rollback kernels.
  4. For ZFS root: verify ZFS modules are installed for every new kernel before rebooting.
  5. Write a tiny internal runbook from this incident. Next time should take minutes, not hours.

Production systems don’t demand perfection. They demand repeatable recovery. GRUB rollback is your repeatable recovery—use it with intent.

← Previous
Debian 13 “Broken pipe” errors: when it’s harmless and when it’s your first warning (case #15)
Next →
Site-to-site VPN: connect two offices into one network (a simple plan that works)

Leave a comment