Proxmox Failed After Update: Roll Back Kernel, Fix Boot, and Recover a Broken Node Safely

October 19, 2025 • February 3, 2026 • Read: 29 min • Views: 11

Was this helpful?

Nothing spices up an on-call rotation like a Proxmox node that won’t boot after a routine update. One minute you’re patching for “security,” the next you’re staring at a grub prompt, or a black screen, or a console looping through dracut/initramfs errors while your VMs quietly stop paying rent.

This is a field guide for getting that node back without making it worse. We’ll roll back kernels, repair bootloaders, recover ZFS/LVM-backed storage, and bring the node back into a cluster without accidentally turning a single-node problem into a whole-datacenter personality disorder.

Fast diagnosis playbook

If you remember nothing else: your goal is to locate the first real failure in the boot chain, not the last error message in the scrollback. Boot failures have a way of showing “symptoms” far away from the cause.

First: Can the machine reach a login prompt at all?

No, it’s stuck pre-kernel (GRUB prompt / EFI shell): this is bootloader/ESP territory. Skip to Fixing GRUB and EFI.
No, kernel panics or can’t mount root: most likely initramfs, missing storage driver, or wrong root UUID. Jump to Initramfs and Storage.
Yes, but services are broken (pvedaemon, pveproxy, cluster): treat it as a “boot succeeded, Proxmox stack failed” event. Go to Practical tasks and Cluster recovery.

Second: Is storage present and imported?

ZFS root or ZFS VM storage missing: check pool import, hostid, cachefile, and device names. ZFS usually tells you the truth, bluntly.
LVM-thin missing: check PV/VG activation, udev settle, and whether the initramfs has the right modules for your HBA/NVMe.

Third: Is the node safe to rejoin the cluster?

Single node in a cluster and it’s down: don’t “fix” quorum by randomly deleting files. Confirm which node is authoritative for cluster state.
Two nodes down: assume split-brain risk. Pause, pick a source of truth, and proceed deliberately.

And yes, you can absolutely spend two hours “debugging networking” when the actual problem is that the root filesystem is mounted read-only after journal replay. Seen it. Lived it.

Interesting facts and context (why this keeps happening)

Proxmox VE rides on Debian’s plumbing. When you upgrade Proxmox, you’re upgrading a Debian system plus a curated stack: kernel(s), QEMU, LXC, ZFS, and management daemons.
Multiple kernels installed is a feature, not clutter. Debian-style packaging typically keeps older kernels so you can boot back into a known-good one. Removing all but one kernel is like selling your spare tire to save weight.
initramfs was designed to solve “drivers too early” problems. Modern systems need storage and crypto modules before the “real” root filesystem exists; initramfs is that early userspace scaffolding.
UEFI changed the failure modes. Old BIOS+MBR failures were often “GRUB is toast.” With UEFI, you can have a perfectly fine OS and a broken EFI System Partition (ESP) mount, NVRAM entry, or shim chain.
ZFS is conservative by default, but unforgiving about identity. Hostid mismatches or stale cachefiles can prevent auto-import at boot, especially after cloning disks or moving pools between machines.
Corosync quorum is intentionally stubborn. It is designed to stop you from running a cluster in a split-brain configuration even if you’re having a “productive” day.
Proxmox’s web UI is just a client. When the node “looks dead” in the browser, it might just be pveproxy down while VMs are running fine underneath.
Kernel updates can break out-of-tree modules. NVIDIA, some HBAs with vendor drivers, and DKMS-built modules can fail to build, leaving you without required drivers at boot.

One reliability paraphrased idea worth taping to your monitor: paraphrased idea “Hope is not a strategy.” — attributed to James Cameron in operational circles, often repeated when teams substitute optimism for rollback plans.

Safety rules before you touch anything

Recovery work is where good engineers become accidental arsonists. The easiest way to lose data is to “fix” things fast on the wrong node, against the wrong disks, with the wrong assumptions.

Rules I actually follow

Prefer reversible changes. Booting an older kernel is reversible. Reinstalling GRUB is usually reversible. “Wiping and reinstalling Proxmox” is not a recovery plan; it’s a confession.
Write down what you change. Timestamped notes beat heroic memory every time.
Don’t rejoin a node to a cluster until it’s stable. Cluster membership multiplies mistakes.
Assume your storage is innocent until proven guilty. Boot issues often look like disk issues. Don’t start zapping pools because the initramfs can’t see an HBA.
Get console access. IPMI/iKVM/physical console. SSH is great right up until it isn’t.

Joke #1: The only thing more permanent than a temporary workaround is a temporary workaround done under pressure.

Failure modes after a Proxmox update

“Failed after update” usually means one of these buckets:

Bootloader/EFI break: GRUB menu missing, UEFI boot entry lost, ESP not mounted, or initrd not found.
Kernel boots, root won’t mount: missing storage driver, wrong root UUID, broken initramfs, encrypted root not unlocking.
System boots, Proxmox stack fails: pve services not starting, cluster filesystems stuck, corosync issues, certificate or permission problems.
System boots, networking broken: bridges not coming up, bond mode mismatch after driver update, MTU misalignments, predictable interface renames.
System boots, storage missing: ZFS pools not importing, LVM volumes not activating, multipath confusion, degraded arrays causing timeouts.

The trick is to categorize the failure quickly. Then you choose the least invasive fix that restores service.

Practical recovery tasks with commands, outputs, and decisions

Below are hands-on tasks you can run on a broken node or from rescue mode/chroot. Each includes what the output means and what decision you make next.

Task 1: Identify what kernel you booted (or tried to)

cr0x@server:~$ uname -a
Linux pve01 6.8.12-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2025-02-10T12:00Z) x86_64 GNU/Linux

Meaning: Confirms the running kernel version. If you’re running the newest kernel and things are broken, rollback is your first move.

Decision: If the failure correlates with a kernel jump, plan to boot an older kernel from GRUB and pin it temporarily.

Task 2: List installed kernels and Proxmox kernel packages

cr0x@server:~$ dpkg -l | egrep 'pve-kernel|linux-image' | awk '{print $1,$2,$3}'
ii pve-kernel-6.5.13-5-pve 6.5.13-5
ii pve-kernel-6.8.12-4-pve 6.8.12-4
ii proxmox-kernel-helper 8.1.1

Meaning: You have at least one older kernel installed. That’s a lifeline.

Decision: If an older kernel exists, try it first. If none exist, you’ll need to install one (possibly from rescue/chroot).

Task 3: Check whether the system dropped into emergency mode and why

cr0x@server:~$ systemctl --failed
UNIT                         LOAD   ACTIVE SUB    DESCRIPTION
pve-cluster.service          loaded failed failed The Proxmox VE cluster filesystem
corosync.service             loaded failed failed Corosync Cluster Engine

Meaning: The OS is up, but cluster services failed. That can block Proxmox management and even VM operations that depend on cluster FS.

Decision: Don’t reboot blindly. Pull logs for the failing unit and decide whether to isolate the node (local mode) or fix corosync.

Task 4: Inspect the journal for the first real failure

cr0x@server:~$ journalctl -b -p err --no-pager | head -n 30
Feb 12 09:14:51 pve01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 12 09:14:51 pve01 pmxcfs[1203]: [main] unable to open cluster database (Connection refused)
Feb 12 09:14:51 pve01 corosync[1188]: [TOTEM ] Failed to bind to multicast address
Feb 12 09:14:51 pve01 corosync[1188]: [MAIN  ] Corosync Cluster Engine exiting with status '2'

Meaning: This isn’t “Proxmox is broken,” it’s a corosync binding problem, often network-related (interface names, IP changes, MTU, firewall).

Decision: Verify network interfaces and corosync config before you touch storage or reinstall packages.

Task 5: Confirm interface naming didn’t change under you

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00
enp4s0           UP             3c:ec:ef:12:34:56
enp5s0           DOWN           3c:ec:ef:12:34:57
vmbr0            DOWN           3c:ec:ef:12:34:56

Meaning: Bridge is down. That can be as simple as the underlying NIC being down or the bridge config failing to apply.

Decision: Check /etc/network/interfaces and attempt a controlled network restart (or bring up the bridge manually) before blaming corosync.

Task 6: Validate Proxmox is mounting /etc/pve (pmxcfs)

cr0x@server:~$ mount | grep /etc/pve

Meaning: No output means pmxcfs is not mounted; cluster filesystem isn’t up.

Decision: If you need local recovery, you can start Proxmox in a “no cluster” state temporarily (covered later). If this is a multi-node environment, fix corosync first.

Task 7: Check disk visibility and kernel driver state

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINTS
sda     1.8T disk
├─sda1  512M part vfat   /boot/efi
├─sda2    1G part ext4   /boot
└─sda3  1.8T part
  └─pve-root  96G lvm    /
nvme0n1 1.7T disk zfs_member
nvme1n1 1.7T disk zfs_member

Meaning: Disks are present; EFI and boot partitions are mounted; ZFS members visible. That reduces the odds of “hardware vanished.”

Decision: If devices are missing, suspect the kernel/HBA module/initramfs. If present but pools/volumes missing, focus on import/activation.

Task 8: Check ZFS pool import status (if you use ZFS)

cr0x@server:~$ zpool import
   pool: tank
     id: 1234567890123456789
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank           ONLINE
          mirror-0     ONLINE
            nvme0n1    ONLINE
            nvme1n1    ONLINE

Meaning: Pool exists but is not imported. Boot may have skipped auto-import due to hostid mismatch or timing.

Decision: Import it read-write only if you’re sure this node is the right owner. If in doubt (shared storage, moved disks), import read-only first.

Task 9: Import ZFS pool carefully, and set cachefile for future boots

cr0x@server:~$ zpool import -o cachefile=/etc/zfs/zpool.cache tank
cr0x@server:~$ zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            nvme0n1   ONLINE       0     0     0
            nvme1n1   ONLINE       0     0     0

errors: No known data errors

Meaning: Pool imported and healthy. Setting the cachefile helps the initramfs/boot sequence locate pools reliably.

Decision: If import fails due to hostid, address hostid rather than forcing imports repeatedly.

Task 10: Check LVM status (if you use LVM-thin)

cr0x@server:~$ pvs
  PV         VG  Fmt  Attr PSize  PFree
  /dev/sda3  pve lvm2 a--  <1.82t 0
cr0x@server:~$ vgs
  VG  #PV #LV #SN Attr   VSize  VFree
  pve   1   3   0 wz--n- <1.82t 0
cr0x@server:~$ lvs -a
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve twi-aotz-- <1.60t             72.11  2.40
  root pve -wi-ao----  96.00g
  swap pve -wi-ao----   8.00g

Meaning: PV/VG/LVs exist and are active (ao). If VGs are missing or inactive, storage activation is your problem.

Decision: If inactive, try vgchange -ay and inspect dmesg for device errors before forcing repairs.

Task 11: Validate free space on /boot and ESP (updates fail quietly when /boot is full)

cr0x@server:~$ df -h /boot /boot/efi
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2      1007M  983M   24M  98% /boot
/dev/sda1       511M   12M  500M   3% /boot/efi

Meaning: /boot is basically full. That can produce incomplete initramfs/kernel installs and weird boot behavior.

Decision: Remove old kernels cleanly (after you have a known-good boot) or expand /boot. Don’t delete random files in /boot like it’s 1999.

Task 12: Check for DKMS/module build failures that break drivers

cr0x@server:~$ journalctl -b --no-pager | grep -E 'dkms|module|firmware' | tail -n 20
Feb 12 09:10:22 pve01 kernel: i915 0000:00:02.0: firmware: failed to load i915/skl_dmc_ver1_27.bin (-2)
Feb 12 09:10:25 pve01 dkms[842]: Error! Bad return status for module build on kernel: 6.8.12-4-pve (x86_64)
Feb 12 09:10:25 pve01 dkms[842]: Consult /var/lib/dkms/.../build/make.log for more information.

Meaning: A DKMS module failed to build. If that module is required for storage or networking, boot might succeed but devices won’t work.

Decision: Roll back kernel or fix the DKMS build. If it’s non-critical (e.g., GPU on a headless host), mask the noise but don’t ignore storage-related DKMS errors.

Task 13: Confirm Proxmox services health (when OS boots but UI is dead)

cr0x@server:~$ systemctl status pveproxy pvedaemon pvestatd --no-pager
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
     Active: active (running) since Wed 2025-02-12 09:18:02 UTC; 2min ago
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled)
     Active: active (running) since Wed 2025-02-12 09:18:00 UTC; 2min ago
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled)
     Active: active (running) since Wed 2025-02-12 09:18:01 UTC; 2min ago

Meaning: Core services are up. If the web UI still fails, suspect firewall, cert issues, or client-side caching.

Decision: If services are down, check dependencies: /etc/pve mount, DNS/hostname resolution, time sync drift, disk full.

Task 14: Check cluster status and quorum without guessing

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   43
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Feb 12 09:21:12 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2
Quorate:          Yes

Meaning: Quorum is present. That changes how aggressive you can be: you can repair this node without rewriting cluster state.

Decision: If Quorate: No, avoid changes that require cluster writes; consider temporary local mode or quorum restoration strategy.

Kernel rollback: the least dramatic fix

When a Proxmox node fails right after an update, assume the newest kernel is guilty until proven innocent. Rolling back is usually fast, safe, and reversible. It also buys you time to debug properly.

Boot an older kernel from GRUB

At boot, choose Advanced options and select the previous pve-kernel. If you can’t reach GRUB because the screen is too fast, hold Shift on BIOS systems or press Esc on many UEFI systems during boot.

Pin the known-good kernel (temporary)

Once booted into the working kernel, prevent the system from “helpfully” switching back on the next reboot.

cr0x@server:~$ proxmox-boot-tool kernel list
Manually selected kernels:
None.

Automatically selected kernels:
6.8.12-4-pve
6.5.13-5-pve

Pinned kernel:
None.

Meaning: No pinned kernel; default selection may prefer the newest.

Decision: Pin the working kernel until you finish the investigation.

cr0x@server:~$ proxmox-boot-tool kernel pin 6.5.13-5-pve
Pinned kernel version '6.5.13-5-pve'

If you’re not on a proxmox-boot-tool managed setup, you can also control GRUB defaults, but pinning via Proxmox tooling is usually cleaner when available.

Rebuild initramfs and refresh GRUB after rollback

Even if the older kernel boots, refresh boot artifacts so you don’t keep tripping over half-written initramfs images.

cr0x@server:~$ update-initramfs -u -k 6.5.13-5-pve
update-initramfs: Generating /boot/initrd.img-6.5.13-5-pve

Meaning: initramfs rebuilt for the chosen kernel.

Decision: If this fails due to /boot full, stop and fix /boot capacity before proceeding.

cr0x@server:~$ update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-4-pve
Found initrd image: /boot/initrd.img-6.8.12-4-pve
Found linux image: /boot/vmlinuz-6.5.13-5-pve
Found initrd image: /boot/initrd.img-6.5.13-5-pve
done

Meaning: GRUB sees both kernels and initrds.

Decision: Reboot once you’ve pinned/selected the kernel you trust, and confirm it stays selected.

Fixing GRUB and EFI boot issues

If you land in a GRUB prompt, an EFI shell, or “no bootable device,” you’re not debugging Proxmox anymore. You’re debugging the boot chain. Treat it like a surgical procedure: verify disks, verify partitions, mount correctly, then reinstall the bootloader.

Decide: BIOS/MBR or UEFI?

cr0x@server:~$ [ -d /sys/firmware/efi ] && echo UEFI || echo BIOS
UEFI

Meaning: This host is booted in UEFI mode (or should be).

Decision: Focus on the EFI System Partition at /boot/efi and UEFI NVRAM boot entries.

Check that the ESP is present and mountable

cr0x@server:~$ lsblk -f | grep -E 'vfat|EFI|/boot/efi'
sda1 vfat   FAT32  7A3B-1C2D                            /boot/efi

Meaning: ESP exists and is mounted. If it’s missing or not mountable, you can’t reliably reinstall GRUB.

Decision: If not mounted, mount it. If mount fails, run filesystem checks from rescue mode.

Validate UEFI boot entries

cr0x@server:~$ efibootmgr -v
BootCurrent: 0002
Timeout: 1 seconds
BootOrder: 0002,0001
Boot0001* UEFI OS       HD(1,GPT,...)File(\EFI\BOOT\BOOTX64.EFI)
Boot0002* proxmox       HD(1,GPT,...)File(\EFI\proxmox\grubx64.efi)

Meaning: The boot entry exists and points to a Proxmox GRUB EFI binary. If it points to nowhere, you’ll loop into EFI shell.

Decision: If missing, reinstall GRUB to the ESP and recreate the boot entry.

Reinstall GRUB on UEFI

cr0x@server:~$ grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox
Installing for x86_64-efi platform.
Installation finished. No error reported.

Meaning: GRUB EFI binaries are placed on the ESP.

Decision: Follow with update-grub and verify the entry with efibootmgr.

cr0x@server:~$ update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.5.13-5-pve
Found initrd image: /boot/initrd.img-6.5.13-5-pve
done

If you need rescue mode + chroot (common)

When the node can’t boot at all, boot from a rescue ISO, mount your root filesystem, and chroot. The exact steps depend on whether root is LVM or ZFS. Here’s a representative LVM-root flow.

cr0x@server:~$ vgscan
  Found volume group "pve" using metadata type lvm2
cr0x@server:~$ vgchange -ay
  3 logical volume(s) in volume group "pve" now active
cr0x@server:~$ mount /dev/pve/root /mnt
cr0x@server:~$ mount /dev/sda2 /mnt/boot
cr0x@server:~$ mount /dev/sda1 /mnt/boot/efi
cr0x@server:~$ for i in /dev /dev/pts /proc /sys /run; do mount --bind $i /mnt$i; done
cr0x@server:~$ chroot /mnt /bin/bash

Meaning: You’re now operating inside the installed OS, which is where GRUB and initramfs tooling expects to run.

Decision: Reinstall kernel/initramfs/GRUB from inside the chroot; then exit, unmount cleanly, and reboot.

Initramfs, missing modules, and storage drivers

The classic post-update brick: kernel boots, then can’t find the root filesystem. You see messages like “waiting for root device,” “unable to mount root fs,” or it drops into an initramfs shell.

This is often one of these:

initramfs was not generated correctly (disk full, interrupted upgrade).
Required modules for your storage controller weren’t included.
Root UUID changed (cloning, disk replacement) but boot config didn’t.
ZFS root: pool didn’t import due to hostid/caching issues.

Check what the kernel thinks your root device should be

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12-4-pve root=/dev/mapper/pve-root ro quiet

Meaning: Root points to an LVM mapper device. If that mapper never appears, you’ll fail to mount root.

Decision: Verify LVM activation in initramfs (or rebuild initramfs with LVM bits) and confirm the underlying PV is visible early in boot.

See whether the storage driver actually loaded

cr0x@server:~$ lsmod | egrep 'nvme|ahci|megaraid|mpt3sas|vfio' | head
nvme                   61440  2
ahci                   45056  0

Meaning: The running system has NVMe and AHCI modules. In failure scenarios, these might be absent in initramfs, not in the full OS.

Decision: If you suspect missing initramfs modules, rebuild initramfs and ensure required modules are included.

Rebuild initramfs for all installed kernels

cr0x@server:~$ update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.5.13-5-pve
update-initramfs: Generating /boot/initrd.img-6.8.12-4-pve

Meaning: Both initrds regenerated. This fixes a lot of “half-installed kernel” boot failures.

Decision: If it errors with “No space left on device,” fix /boot before you do anything else.

Make sure /etc/initramfs-tools/modules includes weird-but-required drivers

Most systems don’t need this. Some do. Especially if you boot from exotic HBAs or iSCSI or have strict ordering issues. Example snippet you might add:

cr0x@server:~$ sed -n '1,120p' /etc/initramfs-tools/modules
# List of modules that you want to include in your initramfs.
# They will be loaded at boot time in the order below.
nvme
ahci

Meaning: These modules will be forced into initramfs.

Decision: Only add modules you understand. Stuffing initramfs with everything is how you get slow boots and new failure modes.

Storage recovery: ZFS and LVM-thin

Storage is where “simple” becomes “incident.” Get this wrong and you can create real corruption. Get it right and you look like a magician. The difference is mostly discipline.

ZFS: pools not importing after update

Common reasons:

Hostid mismatch after reinstall, cloning, or accidental /etc/hostid change.
Cachefile missing/stale; initramfs can’t find pools reliably.
Device naming changes (e.g., SATA ordering) and you used raw /dev/sdX paths instead of by-id.
Pool is busy or imported elsewhere (in shared-disk scenarios, this is your warning siren).

Check hostid and ZFS expectations

cr0x@server:~$ hostid
1a2b3c4d
cr0x@server:~$ ls -l /etc/hostid
-rw-r--r-- 1 root root 4 Feb 12 09:00 /etc/hostid

Meaning: hostid exists. If it recently changed, ZFS may refuse auto-import or behave differently.

Decision: If pools were moved from another host, set hostid intentionally and document it. Don’t “wing it” with forced imports.

Verify pool health once imported

cr0x@server:~$ zpool status -x
all pools are healthy

Meaning: No known ZFS issues.

Decision: Proceed to Proxmox service recovery. If degraded, address resilvering before you restart high I/O workloads.

LVM-thin: VMs missing, storage “inactive,” or thin pool errors

LVM-thin usually breaks due to activation order problems or underlying device timeouts. Less often, thin metadata fills up.

Activate volume groups and check thin metadata usage

cr0x@server:~$ vgchange -ay
  3 logical volume(s) in volume group "pve" now active
cr0x@server:~$ lvs -o lv_name,vg_name,attr,lv_size,data_percent,metadata_percent
  LV   VG  Attr       LSize   Data%  Meta%
  data pve twi-aotz-- <1.60t  72.11  2.40
  root pve -wi-ao----  96.00g
  swap pve -wi-ao----   8.00g

Meaning: Thin pool metadata is only 2.40% used. That’s fine.

Decision: If metadata is near 100%, stop. Expanding/repairing thin metadata is delicate; don’t restart dozens of VMs and make it worse.

Check for filesystem errors if root went read-only

cr0x@server:~$ mount | grep ' / '
/dev/mapper/pve-root on / type ext4 (ro,relatime,errors=remount-ro)

Meaning: Root is mounted read-only. That will break package management, cluster FS, logs, everything.

Decision: Reboot into rescue and run fsck. Do not “just remount rw” and hope; you’ll often make metadata damage worse.

Cluster recovery: quorum, corosync, and rejoining safely

Cluster recovery is where engineers with good intentions cause bad weeks. Corosync is picky by design. Respect it.

Decide whether you’re dealing with:

A single broken node while the rest of the cluster is healthy.
Loss of quorum due to multiple nodes down or network partition.
Split-brain risk (two sides each think they’re primary).

Check corosync and cluster status

cr0x@server:~$ systemctl status corosync --no-pager
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
     Active: failed (Result: exit-code) since Wed 2025-02-12 09:14:51 UTC; 8min ago
     Process: 1188 ExecStart=/usr/sbin/corosync -f $COROSYNCOPTIONS (code=exited, status=2)

Meaning: corosync didn’t come up. The reason is in logs; don’t guess.

Decision: Pull detailed logs and verify the network interface used by corosync is up and has the expected address.

cr0x@server:~$ journalctl -u corosync --no-pager | tail -n 40
Feb 12 09:14:51 pve01 corosync[1188]: [KNET  ] host: 1 link: 0 is down
Feb 12 09:14:51 pve01 corosync[1188]: [KNET  ] failed to bind to interface 'vmbr0'
Feb 12 09:14:51 pve01 corosync[1188]: [MAIN  ] Corosync Cluster Engine exiting with status '2'

Meaning: Corosync wants vmbr0 but it’s down or missing. Fix network config before you touch cluster config.

Decision: Bring up the bridge/interface, then retry corosync.

Temporary local mode (use sparingly)

Sometimes you need to get a node operational (e.g., to evacuate VMs via storage access) even though corosync can’t form. Proxmox supports running without cluster communications, but you must understand the trade-offs.

A common approach is to stop cluster services and work locally. This can let you access local storage and VMs, but cluster-managed config won’t behave normally.

cr0x@server:~$ systemctl stop pve-cluster corosync
cr0x@server:~$ systemctl status pve-cluster --no-pager
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
     Active: inactive (dead)

Meaning: Cluster FS is stopped; you’re no longer writing cluster state.

Decision: Use this to stabilize the node and verify storage/VM state. Don’t run long-term like this in a real cluster unless you enjoy meetings.

Joke #2: Split-brain is when both halves of your cluster become very confident and very wrong.

Rejoin strategy

Once the node boots cleanly, networking is stable, and storage is correct, re-enable corosync and pmxcfs and confirm quorum. Don’t “fix” config version mismatches by copying random files; align the node with the cluster’s source of truth.

Network recovery: bridges, bonds, and “why is vmbr0 down?”

Network issues after updates often come from driver changes, interface renaming, or a bridge/bond configuration that depends on something that changed timing-wise.

Inspect the network config

cr0x@server:~$ sed -n '1,200p' /etc/network/interfaces
auto lo
iface lo inet loopback

auto enp4s0
iface enp4s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 10.10.10.11/24
        gateway 10.10.10.1
        bridge-ports enp4s0
        bridge-stp off
        bridge-fd 0

Meaning: vmbr0 depends on enp4s0. If enp4s0 got renamed (e.g., to enp3s0), vmbr0 won’t come up.

Decision: Compare with ip -br link and correct interface names. Then bring networking up carefully.

Bring up bridge manually to confirm config is the issue, not hardware

cr0x@server:~$ ip link set enp4s0 up
cr0x@server:~$ ifup vmbr0
Internet Systems Consortium DHCP Client 4.4.3-P1
ifup: interface vmbr0 already configured

Meaning: Either it’s already configured or ifupdown2 thinks it is. If it stays DOWN, inspect ip addr and logs.

Decision: If bridge refuses, check for conflicting network managers, or errors in config syntax.

Confirm corosync bind interface exists and has the expected IP

cr0x@server:~$ ip -br addr show vmbr0
vmbr0            UP             10.10.10.11/24

Meaning: corosync can now bind. That’s often the whole fix for “cluster down after update.”

Decision: Restart corosync and then pve-cluster, in that order, and confirm pvecm status.

Checklists / step-by-step plan

Plan A: Node won’t boot after update (most common)

Get console access (IPMI/iKVM). Capture the first error on screen.
Try GRUB advanced options: boot the previous Proxmox kernel.
If boot works, pin the working kernel and rebuild initramfs for all kernels.
Check /boot free space; if nearly full, remove old kernels only after confirming stable boot.
Verify storage: ZFS pools imported, LVM active, root filesystem mounted read-write.
Confirm Proxmox services: pveproxy/pvedaemon/pvestatd, then cluster services if applicable.
Reboot once to validate the fix persists.

Plan B: GRUB/UEFI failure (you can’t even reach a kernel)

Boot rescue ISO in the same mode (UEFI vs BIOS) the system uses.
Mount root, /boot, and /boot/efi; bind-mount /dev,/proc,/sys,/run; chroot.
Reinstall GRUB to ESP (UEFI) or disk (BIOS) and run update-grub.
Rebuild initramfs for the known-good kernel.
Verify efibootmgr entries; reboot and test.

Plan C: OS boots, Proxmox stack broken (web UI down, cluster angry)

Check disk full, root mounted read-only, and time drift first.
Check failing units with systemctl --failed.
Confirm /etc/pve mount status; cluster FS issues will break management.
For corosync failures, validate network interface and MTU, then restart corosync and pve-cluster.
Only if necessary, operate temporarily in local mode to recover VMs or storage, then rejoin properly.

Plan D: Storage missing after update

Confirm the kernel sees disks (lsblk, dmesg).
For ZFS: run zpool import, import cautiously, set cachefile, check hostid.
For LVM: run vgscan, vgchange -ay, inspect thin pool usage.
Do not force imports/repairs until you understand whether disks were moved or shared.

Common mistakes (symptoms → root cause → fix)

1) Symptom: “Boots into initramfs shell, can’t find root device”

Root cause: initramfs missing storage drivers or LVM bits; or update left an incomplete initrd due to /boot full.

Fix: Boot older kernel; in rescue/chroot rebuild initramfs (update-initramfs -u -k all), free /boot space, ensure required modules are included.

2) Symptom: “GRUB prompt” or “no bootable device” after update

Root cause: broken ESP mount, missing EFI entry, grub install not completed, or ESP corruption.

Fix: Mount ESP, reinstall GRUB for UEFI, run update-grub, verify efibootmgr entries.

3) Symptom: “Node boots but Proxmox web UI is dead”

Root cause: pveproxy or pvedaemon not running; /etc/pve not mounted; root filesystem read-only; disk full.

Fix: Check systemctl status for pve services, verify mount | grep /etc/pve, fix filesystem state, clear disk space.

4) Symptom: “Cluster shows node offline; pmxcfs errors”

Root cause: corosync cannot bind because network interface name changed or bridge is down.

Fix: Fix /etc/network/interfaces to match actual NIC names, bring up vmbr0, restart corosync then pve-cluster.

5) Symptom: “ZFS pool not importing at boot, but disks are present”

Root cause: hostid mismatch or stale/missing zpool cachefile; device path changes.

Fix: Import with cachefile option, confirm hostid stability, prefer by-id device naming, rebuild initramfs if ZFS root.

6) Symptom: “Upgrade succeeded, but after reboot networking is weird”

Root cause: driver change for NIC, predictable interface name change, or bond mode mismatch after module update.

Fix: Confirm ip -br link and adjust bridge/bond config; check dmesg for driver errors; roll back kernel if needed.

7) Symptom: “Everything looks fine, but packages can’t be installed/rolled back”

Root cause: root mounted read-only or /var full; dpkg left half-configured packages after interrupted upgrade.

Fix: Resolve filesystem state, then run dpkg --configure -a and apt -f install from a stable boot.

8) Symptom: “VM disks missing from storage view”

Root cause: storage backend not active (ZFS pool not imported, VG not activated), or pmxcfs not mounted so storage config isn’t loaded.

Fix: Restore storage first (import/activate), then ensure cluster FS is mounted and pve services are healthy.

Three corporate-world mini-stories (pain, learning, and one boring win)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a three-node Proxmox cluster with Ceph for VM disks and a couple of local ZFS pools for backups. An update weekend rolled around. One node didn’t come back. The on-call engineer saw “ZFS pool not imported” and assumed it was safe to force import because “it’s local anyway.”

The wrong assumption: the “local” pool wasn’t purely local. Months earlier, the hardware team had moved drives between nodes during a chassis replacement. Nobody updated the asset notes. The pool had been imported on a different node during the move, and the hostid + cachefile story was a mess. Force-importing it on the broken node made it look available, but it wasn’t the authoritative copy of everything anymore.

They didn’t immediately corrupt the pool, but they did something almost as expensive: they restored backups from the wrong snapshots. Several VMs came back with stale data that looked consistent at first glance. The incident wasn’t “node down,” it became “data correctness questionable,” which is the kind of sentence that causes calendar invites.

The fix ended up being mostly process. They documented where each pool physically lived, ensured imports use stable by-id device paths, and treated “force import” as an emergency action requiring a second set of eyes. Technically, the recovery play would have been: import read-only, confirm dataset timestamps, then decide. Socially, the recovery play was: don’t assume anything is “local” unless you can point to the bay.

Mini-story 2: The optimization that backfired

A different org had a performance initiative: reduce boot time and patch windows on virtualization hosts. Someone decided to “clean up old kernels” aggressively to reclaim space on /boot and reduce maintenance overhead. They set up automation to keep only the newest kernel installed.

It worked right up until it didn’t. A new kernel landed with a regression affecting their specific RAID controller firmware combination. The host rebooted into the new kernel and couldn’t see its root disk. With no older kernel installed, “just boot the previous one” was no longer an option. They had to do a rescue ISO dance during a maintenance window that was already over-budget.

Recovery took longer than it should have, partly because the on-call path was designed around rollback: boot old kernel, gather logs, then decide. That path was deleted in the name of tidiness.

The eventual correction was delightfully unglamorous: keep at least two known-good kernels, alert on /boot usage, and add a pre-reboot check that verifies initramfs generation succeeded. The optimization wasn’t evil; it was incomplete. Reliability hates “all eggs, one basket” decisions, especially when the basket is a kernel.

Mini-story 3: The boring but correct practice that saved the day

A financial services team had a Proxmox cluster supporting internal CI and a few latency-sensitive services. Their patching practice was dull: one node at a time, evacuate VMs, update, reboot, validate, then proceed. They also had a strict habit: keep iKVM reachable and tested, and snapshot the boot configuration (kernel list, grub defaults, /etc/network/interfaces) before maintenance.

During one update, a node rebooted and came up with networking misconfigured due to an interface rename. Corosync failed. The node was “down” from the cluster’s perspective. But the team had evacuated workloads already. So there was no customer impact, only engineer pride impact.

They used console access to compare interface names, fixed the bridge ports, restarted corosync, and rejoined cleanly. The postmortem was almost boring: “interface name changed; we validated pre/post; no workload impact.” Boring is the goal. Drama is not a KPI.

That same boring discipline made the fix safe: no forced imports, no random reboots, no hasty quorum changes. Just controlled steps and verification at each stage. The most impressive incident response is the one nobody outside the team notices.

FAQ

1) Should I roll back the kernel or reinstall Proxmox?

Roll back the kernel first. Reinstalling Proxmox is last-resort and often unnecessary. A working older kernel plus repaired boot artifacts solves a large fraction of “failed after update” cases.

2) Can I safely remove the broken kernel package?

Yes, but only after you have confirmed stable boot with another kernel and you have enough /boot space. Use package tools, not manual deletion in /boot. Keep at least one fallback kernel installed.

3) My node boots, but the Proxmox UI is unreachable. Are my VMs dead?

Not necessarily. Check systemctl status pveproxy and also check VM processes via qm list (if available) or ps. The UI is a service; it can die while QEMU keeps running.

4) ZFS pool won’t import automatically after reboot. What’s the most common cause?

Hostid/cachefile issues and device naming changes. Import the pool explicitly, set the cachefile, and ensure the hostid is stable and intentional.

5) Is it safe to force-import a ZFS pool?

Sometimes. But it’s also a great way to convert uncertainty into consequences. If there’s any chance the pool is imported elsewhere or was moved between hosts, import read-only first and verify.

6) How do I know if /boot being full caused my failure?

Check df -h /boot. If it’s above ~90% and you recently installed kernels, you likely have incomplete initramfs images or failed postinst scripts. Rebuild initramfs after freeing space.

7) What’s the safe order to bring cluster services back?

Networking first (the interface corosync binds to must be up), then start corosync, then pve-cluster, then confirm /etc/pve is mounted, then check pve services.

8) Can I run a Proxmox node “standalone” if the cluster is broken?

Temporarily, yes, by stopping cluster services. But you’re opting out of cluster-managed configuration and safety rails. Use it to recover data or stabilize the node, then rejoin properly.

9) Why did a kernel update break networking?

Driver regressions, firmware interactions, or interface renaming. The kernel is the hardware contract. When that contract changes, your bridge/bond setup can fail in surprising ways.

10) If I fix boot, how do I prevent this next time?

Stage updates one node at a time, keep fallback kernels, monitor /boot space, validate initramfs generation, and test console access before you need it.

Conclusion: next steps you should actually do

To recover a Proxmox node after a bad update, prioritize reversibility: boot an older kernel, pin it, rebuild initramfs, and only then repair GRUB/EFI if needed. Validate storage import/activation before you touch cluster membership. Bring networking up before corosync. And don’t “force” anything until you’ve proven you’re operating on the right disks in the right place.

Next steps that pay for themselves:

Keep at least two working kernels installed and ensure /boot has headroom.
Test iKVM/IPMI console access quarterly, not during a crisis.
Document storage ownership (which pools live where) and prefer stable by-id device paths.
Adopt a one-node-at-a-time update workflow with VM evacuation and a real validation checklist.

If you do those four things, the next “failed after update” event becomes a 20-minute rollback, not a long weekend with increasingly creative swearing.