Nothing spices up a quiet on-call shift like a server freezing, rebooting, and leaving you a single blunt message: Kernel panic. No graceful shutdown. No apology. Just a machine that decided it would rather stop existing than continue in a state it can’t trust.
If you run production systems long enough, you don’t ask whether you’ll see a kernel panic. You ask when, how loud, and whether the logs survive. The goal isn’t heroics. It’s repeatable diagnosis, safe containment, and preventing the sequel.
What a kernel panic actually is (and what it isn’t)
A Linux kernel panic is the kernel admitting it cannot continue safely. Not “the app crashed.” Not “the box is slow.” It’s the operating system core deciding that continuing could corrupt data, lose track of memory safety, or deadlock permanently. So it stops.
Panics are deliberate. They’re usually triggered by a code path that calls panic() (or an unrecoverable exception that leads there). The kernel prints what it knows, often including a stack trace, CPU state, and the reason (sometimes useful, sometimes insulting). You’ll often see one of these:
- “Kernel panic – not syncing: Attempted to kill init!” — PID 1 died or became unrecoverable.
- “Kernel panic – not syncing: VFS: Unable to mount root fs” — kernel can’t mount the root filesystem.
- “Fatal exception in interrupt” — something broke at a very bad time.
- “BUG: unable to handle kernel NULL pointer dereference” — kernel bug or bad driver behavior, maybe hardware corruption.
- “soft lockup” / “hard lockup” — CPU stuck, watchdog escalates to panic depending on settings.
What it isn’t:
- OOM kill — the kernel killing a process due to memory pressure is normal behavior. You lost a workload, not the kernel. A panic from OOM is rare and usually configuration-driven.
- Userspace crash — systemd, java, nginx, your “critical” Python script… none of those are kernel panics. Important, but different toolchain.
- Hardware reset — sudden reboot with no logs might be power, BMC watchdog, or a triple fault. Treat it differently until proven otherwise.
And yes, panics tend to happen in public—during deploys, maintenance windows, and executive demos. The kernel has impeccable comedic timing.
One quote worth keeping on your desk: “paraphrased idea” — John Allspaw: reliability comes from designing systems that assume failure, not from hoping failures won’t happen.
Joke #1: A kernel panic is the only time your server is truly honest about its feelings: “I can’t even.”
Interesting facts and a little history
Context matters because a lot of panic behavior is historical baggage: old defaults, compatibility promises, and decades of “this is fine” layered into “this is on fire.” Here are some concrete facts that show up in real investigations:
- “Panic” is older than Linux. UNIX kernels used panic paths to stop the system on corruption risks long before Linux existed.
- Linux kept the blunt message on purpose. The kernel prints directly to consoles and serial lines because those tend to work when nothing else does.
- SysRq is a survival tool, not a toy. Magic SysRq keys existed to regain control during hangs; it remains one of the few last-ditch tools when userspace is toast.
- kdump wasn’t always mainstream. Crash dumping matured over years; early Linux debugging often meant “reproduce under serial console” and pain.
- “Oops” is not “panic.” An oops is a kernel exception that may allow continuing; a panic is the kernel refusing to continue. Ops teams treat repeated oopses as pre-panics.
- Watchdogs can force panics. Soft/hard lockup watchdogs exist because silent hangs are worse than loud deaths in production.
- initramfs made boot both better and worse. It enables modular boot and early userspace, but also adds a whole new failure zone: missing drivers, wrong UUIDs, broken hooks.
- Out-of-tree modules are a known risk multiplier. Proprietary drivers and kernel modules that don’t track the kernel’s internal API are frequent “works until it doesn’t” sources.
- Modern storage is a kernel-adjacent minefield. NVMe, multipath, iSCSI, RDMA, and filesystems like ZFS bring performance and complexity. Complexity enjoys panicking at 3 a.m.
Fast diagnosis playbook
This is the “stop bleeding” sequence. You do it the same way every time, even when Slack is yelling. Especially when Slack is yelling.
First: determine whether it was a panic, a reset, or a power event
- Was there a panic message on console/IPMI? If yes, it’s probably a real panic.
- Do you have a vmcore? If yes, it was a controlled-ish crash path (panic or kdump-triggered).
- No logs at all? Suspect power, firmware watchdog, BMC reset, or hardware fault first.
Second: classify the panic in one sentence
You want a quick label that guides next steps:
- Boot panic: VFS can’t mount root, initramfs issues, kernel cmdline wrong.
- Runtime panic: driver bug, filesystem bug, memory corruption, lockup watchdog.
- Data-path panic: storage timeouts, NVMe resets, HBA firmware, multipath storms, ZFS splats.
- Memory/CPU hardware: MCEs, EDAC errors, random oopses across unrelated subsystems.
Third: pick the fastest evidence source
- Best: vmcore + matching vmlinux debug symbols.
- Good: persistent journal + panic console output + dmesg from previous boot.
- Okay: BMC System Event Log, serial console logs, netconsole.
- Bad: “it rebooted, no one saw anything.” (You’ll fix that later.)
Fourth: decide whether to keep the node up
- One-off and no data corruption risk: bring it back, capture logs, plan deeper analysis.
- Repeated or storage-path panic: quarantine the node. Your priority is preventing filesystem damage and cascading failure.
- Possible hardware memory errors: take it out. Don’t run business workloads on a machine that might be lying about bits.
Capture the evidence: make panics diagnosable
Panics are rarely “one and done.” The kernel panicked for a reason. Your job is to make sure the next one leaves a body.
Kdump: the difference between guessing and knowing
kdump reserves memory for a crash kernel. When the main kernel panics, it boots the crash kernel and writes a vmcore dump. That dump is heavy, slow, and worth it.
Common failure mode: you “enabled kdump” but never tested it, so when the real panic happens, it writes nothing. That’s not a monitoring gap; that’s a leadership gap.
Persistent logs: journal and console
Make the system keep logs across reboots. Also capture kernel output somewhere that survives: serial console logging, BMC SOL capture, netconsole, or pstore.
Storage engineers’ note: panics can be self-inflicted
If your root filesystem lives on a complex storage stack (dm-crypt + LVM + mdraid + multipath + iSCSI + thin provisioning), your boot path is a Rube Goldberg machine. It can work. It can also fail in exciting ways.
Joke #2: Booting from network storage is like trusting a parachute packed by a committee—possible, but you want a checklist.
Practical tasks: commands, outputs, decisions (12+)
These are real “do this now” tasks. Each one includes what you’re looking at and what decision you make from it.
Task 1: Confirm previous boot crashed and why (journal)
cr0x@server:~$ journalctl -b -1 -k -p err..alert --no-pager | tail -n 40
Jan 21 02:13:19 server kernel: BUG: unable to handle page fault for address: ffff8a0a00000000
Jan 21 02:13:19 server kernel: #PF: supervisor read access in kernel mode
Jan 21 02:13:19 server kernel: Kernel panic - not syncing: Fatal exception
Jan 21 02:13:19 server kernel: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.1.0-18-amd64 #1 Debian 6.1.76-1
Jan 21 02:13:19 server kernel: Call Trace:
Jan 21 02:13:19 server kernel: ? nvme_complete_rq+0x2a/0xd0
Jan 21 02:13:19 server kernel: ? blk_mq_end_request+0x2b/0x60
What it means: The prior boot ended with a kernel exception escalating to panic. Stack mentions NVMe completion path.
Decision: Treat as runtime panic likely in storage driver/firmware path. Quarantine node if it’s a fleet; don’t just reboot and forget.
Task 2: Check reboot reason and boot IDs (quick sanity)
cr0x@server:~$ last -x | head -n 8
reboot system boot 6.1.0-18-amd64 Tue Jan 21 02:15 still running
crash system crash 6.1.0-18-amd64 Tue Jan 21 02:13 - 02:15 (00:02)
reboot system boot 6.1.0-18-amd64 Mon Jan 20 22:01 - 02:13 (04:12)
What it means: “crash” indicates an unclean shutdown consistent with panic or reset.
Decision: If “crash” aligns with journal panic lines, proceed with kernel analysis. If journal is empty, pivot to hardware/firmware reset investigation.
Task 3: See if kdump produced a vmcore
cr0x@server:~$ ls -lh /var/crash
total 2.3G
drwxr-xr-x 2 root root 4.0K Jan 21 02:15 127.0.1.1-2026-01-21-02:15:08
What it means: A crash directory exists; likely contains vmcore and metadata.
Decision: Preserve it. Copy it off the node before “cleanup” jobs run, and before someone reimages the server.
Task 4: Validate kdump service state (and whether it was armed)
cr0x@server:~$ systemctl status kdump-tools --no-pager
● kdump-tools.service - Kernel crash dump capture service
Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
Active: active (exited) since Tue 2026-01-21 02:16:01 UTC; 3min ago
Docs: man:kdump-tools(8)
Process: 612 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
What it means: The service is enabled and appears healthy now. That’s not proof it worked at crash time, but it’s a start.
Decision: If disabled or failed, prioritize making dumps reliable before chasing root cause. No dump, no certainty.
Task 5: Confirm crashkernel reservation actually exists
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.1.0-18-amd64 root=UUID=8b2c1d90-0a0f-4db3-bf90-4b3d6f1c2f3e ro quiet crashkernel=512M
What it means: Crash kernel memory reserved via kernel command line.
Decision: If missing, add it and reboot in a maintenance window. Otherwise your “kdump enabled” story is fiction.
Task 6: Check for Machine Check Exceptions (hardware canaries)
cr0x@server:~$ journalctl -k -b -1 | egrep -i "mce|machine check|hardware error|edac" | tail -n 20
Jan 21 02:12:57 server kernel: mce: [Hardware Error]: Machine check events logged
Jan 21 02:12:57 server kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
What it means: Corrected ECC errors occurred. One corrected error isn’t instant doom; patterns are.
Decision: If you see repeated corrected errors or any uncorrected errors, treat as hardware fault. Open a ticket with the vendor and pull the node from critical workloads.
Task 7: Inspect PCIe/NVMe error patterns before the panic
cr0x@server:~$ journalctl -k -b -1 | egrep -i "nvme|pcie|aer|reset|timeout" | tail -n 40
Jan 21 02:12:58 server kernel: nvme nvme0: I/O 487 QID 7 timeout, aborting
Jan 21 02:12:58 server kernel: nvme nvme0: Controller is down; will reset: CSTS=0x3
Jan 21 02:12:59 server kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:03:00.0
Jan 21 02:13:00 server kernel: nvme 0000:03:00.0: enabling device (0000 -> 0002)
What it means: Timeouts and controller resets often precede panics if a driver hits a bug path or the device misbehaves.
Decision: Check NVMe firmware/driver combos; consider firmware update, kernel update, or swapping the device. Also check PCIe slot/backplane issues.
Task 8: Verify root filesystem and initramfs content (boot panics)
cr0x@server:~$ lsinitramfs /boot/initrd.img-6.1.0-18-amd64 | egrep -i "nvme|ahci|virtio|dm-crypt|lvm" | head
usr/lib/modules/6.1.0-18-amd64/kernel/drivers/nvme/host/nvme.ko
usr/lib/modules/6.1.0-18-amd64/kernel/drivers/md/dm-crypt.ko
usr/lib/modules/6.1.0-18-amd64/kernel/drivers/md/dm-mod.ko
usr/lib/modules/6.1.0-18-amd64/kernel/drivers/md/raid1.ko
usr/lib/modules/6.1.0-18-amd64/kernel/drivers/scsi/sd_mod.ko
What it means: Early-boot drivers are present. Missing storage or crypto modules is a classic “VFS: unable to mount root” trigger.
Decision: If required modules are missing, rebuild initramfs and verify kernel cmdline UUIDs. Don’t keep rebooting hoping it changes its mind.
Task 9: Confirm block device identity and UUID mapping
cr0x@server:~$ blkid | head -n 6
/dev/nvme0n1p1: UUID="8b2c1d90-0a0f-4db3-bf90-4b3d6f1c2f3e" TYPE="ext4" PARTUUID="0f12c0d4-01"
/dev/nvme0n1p2: UUID="0d6d8a8c-0d90-4d8b-9d8b-7c8e6a6e2c1a" TYPE="swap" PARTUUID="0f12c0d4-02"
What it means: UUID exists and matches cmdline (compare with /proc/cmdline).
Decision: If mismatch, fix bootloader config. If device names changed (common under HBAs/multipath), stop using /dev/sdX in boot configs.
Task 10: Inspect kernel taint status (is a third-party module involved?)
cr0x@server:~$ cat /proc/sys/kernel/tainted
0
What it means: 0 means untainted: no known flags like proprietary modules or forced module loads.
Decision: If tainted is non-zero, decode it and treat third-party modules as suspects. Supportability changes fast when you’re running out-of-tree code.
Task 11: Identify the exact kernel build and package provenance
cr0x@server:~$ uname -a
Linux server 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2025-12-18) x86_64 GNU/Linux
What it means: Kernel version, distro packaging, and build date. This is your baseline for known regressions.
Decision: If panic started after a kernel update, consider rolling back as containment while you analyze, especially for driver regressions.
Task 12: Check filesystem integrity signals after reboot (don’t guess)
cr0x@server:~$ dmesg | egrep -i "EXT4-fs|XFS|BTRFS|ZFS|recovery|journal" | tail -n 30
[ 2.911234] EXT4-fs (nvme0n1p1): recovery complete
[ 2.911980] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Quota mode: none.
What it means: Filesystem replayed journal cleanly. That’s good news; you still verify application data as needed.
Decision: If you see corruption warnings, force an offline fsck/xfs_repair plan and keep the node out of service until verified.
Task 13: Check for lockups that might have escalated
cr0x@server:~$ journalctl -k -b -1 | egrep -i "soft lockup|hard lockup|watchdog" | tail -n 30
Jan 21 02:12:50 server kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:2:238]
Jan 21 02:13:12 server kernel: Kernel panic - not syncing: softlockup: hung tasks
What it means: Kernel watchdog detected a CPU stuck and panicked. Often due to deadlocks in drivers, storage stalls, or severe IRQ storms.
Decision: Treat as a kernel/driver or hardware stall. Check storage latency, IRQ distribution, and recent driver changes. Don’t “fix” by disabling watchdog unless you enjoy silent hangs.
Task 14: Pull BMC/firmware event logs when panics are suspiciously quiet
cr0x@server:~$ ipmitool sel list | tail -n 8
91 | 01/21/2026 | 02:13:18 | System Boot Initiated | Initiated by watchdog | Asserted
92 | 01/21/2026 | 02:13:19 | Power Unit | Power off/down | Asserted
93 | 01/21/2026 | 02:15:02 | Power Unit | Power cycle | Asserted
What it means: A watchdog initiated reboot/power cycle. That could be kernel hang, BMC policy, or platform fault.
Decision: If BMC triggered resets, investigate hardware health, watchdog configuration, and whether the OS was alive enough to log. This changes the whole incident narrative.
Task 15: Open the vmcore with crash (when you have real evidence)
cr0x@server:~$ crash /usr/lib/debug/boot/vmlinux-6.1.0-18-amd64 /var/crash/127.0.1.1-2026-01-21-02:15:08/vmcore
crash 8.0.2
VMLINUX: /usr/lib/debug/boot/vmlinux-6.1.0-18-amd64
DUMPFILE: /var/crash/127.0.1.1-2026-01-21-02:15:08/vmcore [PARTIAL DUMP]
CPUS: 16
DATE: Tue Jan 21 02:13:19 2026
PANIC: "Kernel panic - not syncing: Fatal exception"
TASKS: 3247
What it means: You have a usable dump and the panic string. Now you can inspect stacks, tasks, and kernel state.
Decision: If vmcore loads cleanly, stop theorizing in chat and start extracting a real root cause hypothesis with evidence.
Task 16: Get the backtrace and identify the subsystem
cr0x@server:~$ crash -c "bt" /usr/lib/debug/boot/vmlinux-6.1.0-18-amd64 /var/crash/127.0.1.1-2026-01-21-02:15:08/vmcore
PID: 0 TASK: ffffffff81a13c80 CPU: 3 COMMAND: "swapper/3"
#0 [ffffb3b740003d90] machine_kexec at ffffffff8105b2ad
#1 [ffffb3b740003df0] __crash_kexec at ffffffff8112a7b2
#2 [ffffb3b740003ec0] panic at ffffffff8122c1e5
#3 [ffffb3b740003f40] oops_end at ffffffff810f3b9a
#4 [ffffb3b740003f60] page_fault_oops at ffffffff8107d1d2
#5 [ffffb3b740003fd0] exc_page_fault at ffffffff81c0167a
#6 [ffffb3b7400040b0] asm_exc_page_fault at ffffffff81e00b2b
#7 [ffffb3b740004140] nvme_complete_rq at ffffffff815a1e2a
What it means: The crashing context points to NVMe completion handling. This doesn’t prove NVMe is “the problem,” but it’s your best lead.
Decision: Correlate with NVMe reset/timeouts, firmware version, and known kernel issues. Consider updating firmware or kernel, or swapping hardware to confirm.
Failure modes that actually cause panics
1) Boot-time: can’t mount root filesystem
Panics that mention VFS, root=, initramfs, or “not syncing” during boot are often configuration or packaging issues. The kernel cannot proceed because it can’t find or mount /. Causes include:
- Wrong
root=UUID=after disk replacement or cloning. - Initramfs missing storage drivers (common after kernel install glitches or custom images).
- Encrypted/LVM root where hooks didn’t build, or
/etc/crypttabchanged without regenerating initramfs. - Multipath naming changes causing “root device” confusion.
What to do: Confirm cmdline, confirm UUIDs, inspect initramfs contents, and boot into a rescue environment when needed. Rebuilding initramfs is often the right fix. Reinstalling the OS is usually a lazy fix.
2) Runtime: buggy driver paths (network, GPU, storage, filesystem)
Drivers run in kernel space. When they go wrong, they go wrong with privileges. Storage drivers are especially spicy because they involve interrupts, DMA mapping, timeouts, and retries under heavy load.
What to do: Correlate panic stack trace with kernel logs: resets, timeouts, AER messages, link flaps. If you see “tainted” flags, treat out-of-tree modules as prime suspects.
3) Memory corruption (hardware or software) that looks like everything
The most annoying panics are the ones that don’t stay in one subsystem. Today it’s ext4. Tomorrow it’s TCP. Next week it’s page allocator. That pattern often means memory corruption—sometimes a kernel bug, often hardware (DIMM, CPU, motherboard) or unstable overclock/undervolt settings in “performance” servers.
What to do: Look for MCE/EDAC signals, run memory tests in maintenance, and compare crash signatures across nodes. If multiple unrelated stacks appear, stop blaming “the kernel” generically and start treating the host as suspect.
4) Watchdog-induced panics from lockups
Soft lockups mean the kernel scheduler isn’t making progress on a CPU. Hard lockups are worse. Watchdogs exist because silent hangs are operational poison: your node stays “up” but does nothing.
What to do: Don’t disable watchdogs as the first move. Investigate what was running, IRQ storms, storage latency, and deadlocks. Use vmcore to find blocked tasks and locks when possible.
5) Filesystem and storage stack edge cases
Filesystems can panic the kernel when internal invariants break (or when the kernel chooses to stop rather than risk corruption). Some are more “panic-forward” than others, often by design. Storage stacks can trigger panics via bugs, memory pressure, or block layer issues.
What to do: Treat the storage path as a system: firmware, cabling/backplane, PCIe errors, multipath settings, timeouts, queue depths, and kernel version. A kernel panic in storage is rarely “just storage” or “just kernel.” It’s the interface between them.
Three corporate-world mini-stories (anonymized)
Mini-story 1: The incident caused by a wrong assumption
They had a fleet of Linux nodes booting from local NVMe, with a small RAID1 for the root filesystem. Someone replaced a failed drive during a routine maintenance window. It was a good day: no alarms, no customer impact, clean swap.
Then the next reboot happened. The node dropped into a panic: VFS: Unable to mount root fs on unknown-block. The on-call assumed it was a “bad kernel update” because the reboot coincided with patching. They rolled back the kernel. Same panic. They reinstalled the bootloader. Same panic. They started muttering about vendor kernels and bad luck.
The real issue was simpler and more embarrassing: the bootloader config referenced the root device by a stale UUID copied from an older image. The system had been surviving because the old drive still existed and the boot order was “lucky.” The replacement changed enumeration just enough to stop the luck.
The fix was to use stable identifiers consistently (UUID/PARTUUID everywhere), rebuild initramfs, and validate that the boot path matched reality. The lasting change was cultural: no more “it’ll be fine” boot configs. Boot is part of production.
Mini-story 2: The optimization that backfired
A performance initiative targeted IO latency in a busy analytics cluster. Someone tuned NVMe settings, adjusted queue depths, and enabled aggressive CPU power management to “reduce jitter.” Benchmarks looked great. Everyone high-fived the graphs.
A week later, nodes started panicking under peak load. The stack traces pointed at the storage completion path and occasionally at scheduler code. It was intermittent, impossible to reproduce in a lab, and only happened when the system was both CPU-busy and IO-busy. That combination is where “mostly safe” optimizations go to die.
After a painful vmcore analysis and a lot of correlation, they found the pattern: a firmware/driver combination didn’t like the new power-state transitions at scale. The “optimization” increased the rate of controller resets and raced a rarely-used code path. The panic wasn’t from the power management tweak alone—it was the interaction.
The right move was boring: revert the power-management change, update NVMe firmware in a controlled rollout, and pin a kernel version known to behave with that hardware. Latency worsened slightly. Uptime improved a lot. The business preferred uptime, shockingly.
Mini-story 3: The boring but correct practice that saved the day
A payments-adjacent platform ran a mixed fleet with strict change control. Their kernel updates were staged, their kdump configuration was tested quarterly, and serial console output was captured centrally. Nobody loved this. Everyone benefited from it.
One night, a subset of nodes panicked after a storage path flap during a datacenter maintenance event. The panic happened fast, and the nodes rebooted. The service stayed up because the workload scheduler drained nodes automatically when heartbeats failed.
The next morning, instead of arguments, they had artifacts: vmcores, console logs, and a clear timeline from journal persistence. The crash dumps showed blocked tasks in the block layer after repeated timeouts, and the console logs captured PCIe AER errors just before the panic.
The fix wasn’t a magical kernel patch. It was replacing a flaky backplane and updating a firmware bundle that the vendor had been quietly recommending. The postmortem was calm because the evidence was complete. “Boring practices” didn’t prevent the failure, but they prevented the chaos—and that’s most of what ops is.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Kernel panic – not syncing: VFS: Unable to mount root fs” after an update
Root cause: Initramfs missing a required storage/crypto module, or root UUID mismatch after image changes.
Fix: Boot a rescue kernel, verify /proc/cmdline intent vs blkid reality, rebuild initramfs (dracut/update-initramfs), and reinstall bootloader if needed.
2) Symptom: Random panics with unrelated stack traces across weeks
Root cause: Memory corruption, often hardware (DIMM) or platform instability.
Fix: Check MCE/EDAC logs, run memory diagnostics, swap DIMMs/hosts, and don’t keep the box in production “because it rebooted fine.”
3) Symptom: Panic mentions “Attempted to kill init!”
Root cause: PID 1 crashed (systemd or init) or was killed due to catastrophic userspace failure (disk full on root, corrupted binaries, broken libc) or kernel bug.
Fix: Examine previous boot journal for userspace failures and filesystem errors. Validate storage health. If PID 1 died due to SIGKILL from OOM, address memory pressure and constraints; if due to disk corruption, repair and restore.
4) Symptom: Panic preceded by NVMe timeouts and resets
Root cause: NVMe firmware bug, PCIe signal integrity, kernel driver regression, or power management interaction.
Fix: Correlate resets with load and kernel version; update firmware; consider kernel update/rollback; check AER logs; reseat/replace device or slot/backplane.
5) Symptom: “soft lockup” messages then panic
Root cause: CPU stuck due to driver deadlock, interrupt storm, or long non-preemptible section under load.
Fix: Use vmcore to identify stuck threads and locks; check IRQ distribution and device errors; upgrade kernel if it’s a known deadlock; avoid disabling watchdog as a “solution.”
6) Symptom: Panics only on one kernel version across the fleet
Root cause: Kernel regression or changed defaults (IOMMU, scheduler, driver behavior).
Fix: Pin a known-good kernel as containment; test candidate versions on a canary ring; gather dumps to file a proper upstream/vendor bug with evidence.
7) Symptom: No panic logs, just sudden reboot
Root cause: Power loss, BMC watchdog reset, firmware bug, or kernel crash too early for logging.
Fix: Pull BMC SEL logs, enable serial console/netconsole/pstore, and verify crashkernel + kdump. Treat “no logs” as an infrastructure failure, not a mystery to tolerate.
8) Symptom: Panic after enabling a “harmless” kernel module
Root cause: Out-of-tree or poorly tested module; ABI mismatch; tainted kernel causing instability.
Fix: Remove module, return to vendor-supported stack, and reproduce in staging. If you must run it, pin kernel versions and keep a rollback plan that’s practiced.
Checklists / step-by-step plan
Checklist A: When a node panics in production (first 30 minutes)
- Contain: remove node from load balancer / scheduler; avoid repeated crash loops that chew disks.
- Preserve evidence: confirm
/var/crash; copy crash directory to safe storage; snapshot logs. - Collect kernel logs:
journalctl -b -1 -kanddmesgfrom current boot. - Check hardware signals: MCE/EDAC logs, BMC SEL, storage error patterns.
- Classify: boot vs runtime vs storage-path vs hardware-like randomness.
- Pick containment: roll back kernel/driver, swap hardware, or keep quarantined for deeper analysis.
Checklist B: Make future panics diagnosable (one maintenance window)
- Enable and test kdump with a controlled crash on a non-critical node.
- Ensure crashkernel memory is reserved and sufficient for your RAM size.
- Enable persistent journald storage and verify previous boot logs survive.
- Configure serial console logging or SOL capture in your environment.
- Decide where vmcores live and how they’re rotated (and who is allowed to delete them).
- Establish kernel update canaries and rollback procedures that aren’t “pray and reboot.”
Checklist C: Storage-path hardening that reduces panic probability
- Keep firmware (NVMe/HBA/backplane) aligned with vendor recommendations.
- Standardize kernel versions per hardware generation; avoid “snowflake kernels.”
- Monitor and alert on precursors: NVMe timeouts, AER corrected errors, link resets.
- Set sane timeouts and retry policies; overly aggressive tuning can trigger corner cases.
- Test failover paths (multipath, RAID rebuild) under load, not just in slides.
FAQ
1) What’s the difference between a kernel oops and a kernel panic?
An oops is a kernel exception that may allow the kernel to keep running (sometimes limping). A panic is the kernel halting (or rebooting) because it cannot continue safely. Repeated oopses are often a prelude to a panic.
2) Should I set the kernel to auto-reboot on panic?
In most production environments, yes: set a reasonable kernel.panic reboot delay so the node returns and your scheduler can replace capacity. But pair it with kdump and persistent logs; rebooting quickly without evidence is just faster amnesia.
3) Can a full disk cause a kernel panic?
Not directly in the common case, but it can cause userspace failures that kill critical processes (including PID 1), which can lead to “Attempted to kill init.” Disk-full can also exacerbate corruption recovery paths. Treat disk-full as a serious reliability bug.
4) If the stack trace mentions NVMe, is the NVMe drive definitely at fault?
No. It’s a lead, not a verdict. NVMe timeouts can come from firmware, PCIe path issues, power management, kernel regressions, or the device itself. Correlate with resets/timeouts/AER logs and, ideally, vmcore analysis.
5) Why do panics happen more during heavy IO?
Heavy IO hits complex concurrency: interrupts, DMA mapping, lock contention, and error recovery. Bugs and marginal hardware often hide until those paths are stressed. Production load is a better test than most labs.
6) Is it acceptable to disable watchdogs to stop “soft lockup” panics?
As a temporary containment in a very specific scenario, maybe. As a general fix, no. You’re trading a loud failure for a silent hang, which is worse operationally and harder to detect. Fix the underlying stall.
7) How do I know if my kernel is “tainted,” and why does it matter?
/proc/sys/kernel/tainted shows a bitmask. Taint often indicates proprietary modules, forced module loads, or other conditions that make upstream support harder and can correlate with instability. If tainted, scrutinize third-party modules first.
8) Do containers or Kubernetes “cause” kernel panics?
Containers don’t get special kernel privileges, but they increase load, kernel feature usage (cgroups, overlayfs, networking), and stress patterns. Kubernetes also creates rapid churn that can expose race conditions. The panic is still kernel/driver/hardware—but orchestration changes the blast radius.
9) What’s the single best investment to reduce time-to-root-cause?
Reliable crash dumps (kdump) plus a tested procedure to retrieve them. Everything else—arguments, theories, “I saw this once”—is slower.
Conclusion: next steps you can do this week
If you run Linux in production, you don’t eliminate kernel panics. You make them rare, diagnosable, and non-catastrophic.
- Pick one node and prove kdump works end-to-end: reservation, trigger, vmcore capture, retrieval.
- Make logs survive reboots and capture console output somewhere durable.
- Define containment for panic classes: boot panics, storage-path panics, hardware-suspected panics.
- Stop guessing when you have evidence. vmcores turn drama into engineering.
- Audit your “optimizations”—especially around power management and storage queueing—because the kernel has no patience for cleverness that wasn’t tested under real load.
And when Linux says “nope” in public, it’s not being rude. It’s refusing to corrupt your data quietly. Respect the honesty. Then go get the dump.