“It just rebooted.” The most expensive sentence in operations, usually delivered with the confidence of someone who didn’t get paged at 03:17. You check uptime, see it’s fresh, and everyone shrugs like servers are moody houseplants.
They’re not. Random reboots almost always leave evidence. The trick is knowing where the evidence lands, how it lies to you, and which missing logs are themselves the smoking gun. This is the log trail you’re ignoring—and how to follow it like you mean it.
What “random reboot” really means (and why it’s rarely random)
A Linux host “randomly rebooted” is a story. Your job is to turn it into a timeline with a cause. Most reboots fall into a few buckets:
- Orderly reboot: someone or something called
reboot,shutdown, or initiated a kernel update flow. - Crash reboot: kernel panic, BUG, lockup, or watchdog fired; sometimes with kdump, sometimes without.
- Power event: PSU hiccup, PDU trip, BMC/firmware reset, hypervisor reset, or a “helpful” remote hands.
- Management controller action: IPMI power cycle, iDRAC/iLO policy, or automated remediation.
- Virtualization event: host rebooted, VM got reset, live migration failed, or snapshot chain went feral.
The main mistake: treating “reboot” as a single event. It’s two events: the end of one boot (the failure) and the start of the next boot (the recovery). Linux logs will talk about both, but not in the same place or with the same honesty.
One quote worth keeping on your desk: “paraphrased idea” — W. Edwards Deming: without measurement, you’re mostly guessing.
In reboots, the “measurement” is your logs, firmware event records, and crash dumps. If you don’t have them, you’re guessing.
Fast diagnosis playbook (first/second/third)
First: establish the reboot type in 5 minutes
- Was it orderly? Check systemd journal for a shutdown sequence, and “reboot” reason lines.
- Was it a crash? Look for panic/oops/watchdog and whether logs abruptly stop.
- Was it power? Compare OS logs with BMC SEL logs and hypervisor events. Missing OS shutdown logs often means the OS didn’t get a vote.
Second: decide which trail is authoritative
- Physical server: BMC SEL + kernel logs + MCE are king.
- VM: hypervisor events + guest journal + virtual disk errors.
- Cloud instance: provider reboot history + serial console logs + guest journal.
Third: isolate the failure domain
- Compute: MCE, lockups, thermal throttling, microcode, watchdog.
- Memory: ECC corrected storms, DIMM failures, poison, NUMA weirdness.
- Storage: NVMe controller reset, SATA link resets, ext4/xfs abort, multipath flaps, firmware timeouts.
- Network: NIC firmware resets can wedge the kernel in drivers, especially under load.
- Power: PSU, PDU, UPS, datacenter maintenance, loose cable (yes, still).
Fast rule: if the journal ends mid-sentence, suspect power or hard reset. If it contains a clean shutdown sequence, suspect humans, automation, or a graceful panic with reboot policy.
Joke #1: A “random reboot” is like a magician’s trick—misdirection works until you check the other hand (IPMI SEL).
The log sources map: who knows what, and when
1) systemd-journald: the narrative, with footnotes
On most modern distros, journald is your primary timeline. It tags entries by boot ID, can show the previous boot, and often captures kernel messages too. But it’s not magic: if the machine loses power, the journal may not flush. That absence is itself evidence.
2) Kernel ring buffer (dmesg): the crime scene audio
dmesg is a view of the kernel ring buffer. It’s useful for driver issues, storage link resets, and panic traces. But after a reboot, dmesg shows this boot unless you use journal-backed kernel logs from the previous boot.
3) /var/log/wtmp (last) and /var/log/btmp: the guestbook
last can tell you when the system booted and who logged in. It’s coarse, but it’s fast, and it survives a lot of chaos. If wtmp is missing, rotated badly, or corrupted, that’s another clue.
4) BMC / IPMI SEL: the hardware’s diary
Server management controllers keep a System Event Log (SEL): power cycles, watchdog, thermal events, voltage irregularities. This is the log the OS can’t forge. If SEL says “Power Unit lost AC,” stop arguing with it.
5) Crash dumps (kdump): the autopsy
If you have kdump configured, a kernel crash can leave a vmcore. That’s not just nice-to-have. It’s the difference between “maybe a driver” and “here is the exact stack and lock.”
6) Storage layer logs: where reboots pretend to be “application issues”
Storage errors often precede reboots in a way people miss. NVMe resets, SCSI timeouts, ext4 journal aborts, and controller firmware hiccups can wedge the system until watchdog reboots it. If your logs contain long I/O stalls before the reboot, your reboot was probably a storage story wearing a kernel costume.
Practical tasks: commands, outputs, decisions (12+)
Below are practical tasks you can run on a real system. For each one: what to run, what the output means, and what decision you make from it. Run them in roughly this order when you’re under pressure.
Task 1: Confirm reboot times and rough count
cr0x@server:~$ uptime
10:18:02 up 1 day, 2:41, 2 users, load average: 0.54, 0.62, 0.70
Meaning: This only tells you the last boot time. Not why.
Decision: If uptime is unexpectedly low, you need previous boot logs (-b -1) immediately.
Task 2: Use wtmp to see reboots and shutdowns
cr0x@server:~$ last -x | head -n 12
reboot system boot 6.8.0-41-generic Mon Feb 3 07:36 still running
shutdown system down 6.8.0-41-generic Mon Feb 3 07:35 - 07:36 (00:01)
reboot system boot 6.8.0-41-generic Sun Feb 2 22:11 - 07:35 (09:24)
crash system down 6.8.0-41-generic Sun Feb 2 22:10 - 22:11 (00:01)
Meaning: “shutdown” indicates an orderly shutdown recorded by init/systemd. “crash” can show an unclean stop.
Decision: If there’s no “shutdown” before “reboot,” suspect power loss, reset button, BMC power cycle, or hard hang.
Task 3: Find reboot reason hints in the previous boot journal
cr0x@server:~$ journalctl -b -1 -p warning --no-pager | tail -n 30
Feb 02 22:09:44 server kernel: nvme nvme0: I/O 987 QID 5 timeout, aborting
Feb 02 22:09:45 server kernel: nvme nvme0: Abort status: 0x371
Feb 02 22:09:49 server kernel: INFO: task kworker/u64:2 blocked for more than 120 seconds.
Feb 02 22:10:12 server kernel: watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ksoftirqd/23:93]
Feb 02 22:10:13 server kernel: Kernel panic - not syncing: softlockup: hung tasks
Meaning: This is classic: storage timeouts → blocked kernel tasks → watchdog lockup → panic.
Decision: Stop chasing “random kernel panic.” Treat storage/NVMe path as primary suspect.
Task 4: Check whether the journal is persistent (critical for post-mortems)
cr0x@server:~$ grep -R "^[# ]*Storage=" /etc/systemd/journald.conf
#Storage=auto
Meaning: auto uses persistent storage if /var/log/journal exists; otherwise volatile.
Decision: If you don’t have /var/log/journal, create it and restart journald. Without persistence, crashes look “mysterious” because you threw away the past.
Task 5: List boots and align the timeline
cr0x@server:~$ journalctl --list-boots | head
-2 2c6c0a2a0b1f4cc2b4e0a3f57a8d8f55 Sun 2026-02-02 12:01:17 UTC—Sun 2026-02-02 22:11:01 UTC
-1 91f5d1f5e72a4a00b2b0c2b3f9edaa4c Sun 2026-02-02 22:11:10 UTC—Mon 2026-02-03 07:36:01 UTC
0 7bb2c1fdc77a4c1e9c3ce2c1b6a11e0b Mon 2026-02-03 07:36:10 UTC—Mon 2026-02-03 10:18:20 UTC
Meaning: Boot IDs let you pivot precisely and compare “end of boot -1” to “start of boot 0.”
Decision: Identify the boot that ended badly, then inspect the last 5 minutes of that boot and the first 2 minutes of the next.
Task 6: Pull the last minutes of the previous boot (where the body is buried)
cr0x@server:~$ journalctl -b -1 --since "10 min before shutdown" --no-pager | tail -n 80
Feb 02 22:09:41 server systemd[1]: Started Daily apt download activities.
Feb 02 22:09:44 server kernel: nvme nvme0: I/O 987 QID 5 timeout, aborting
Feb 02 22:10:12 server kernel: watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ksoftirqd/23:93]
Feb 02 22:10:13 server kernel: Kernel panic - not syncing: softlockup: hung tasks
Feb 02 22:10:13 server kernel: Rebooting in 5 seconds..
Meaning: If you see “Rebooting in 5 seconds,” the kernel chose to reboot (panic behavior). If the log stops abruptly, it didn’t.
Decision: Panic implies a crash path; begin kdump checks and driver/firmware triage.
Task 7: Look for explicit shutdown/reboot requests (humans and automation)
cr0x@server:~$ journalctl -b -1 -u systemd-logind --no-pager | grep -E "Power key|reboot|shutdown" | tail -n 20
Feb 02 22:07:03 server systemd-logind[933]: Power key pressed short.
Feb 02 22:07:03 server systemd-logind[933]: System is rebooting.
Meaning: Someone pressed the power button (or the chassis/BMC sent the event).
Decision: Correlate with physical access logs, BMC user sessions, and SEL. Don’t accuse the kernel of what a finger did.
Task 8: Check for watchdog resets (hardware or software)
cr0x@server:~$ journalctl -k -b -1 --no-pager | grep -i watchdog | tail -n 30
Feb 02 22:10:12 server kernel: watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ksoftirqd/23:93]
Feb 02 22:10:13 server kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Meaning: Watchdog messages indicate prolonged CPU stalls. Often caused by driver deadlocks, interrupt storms, or I/O stalls that block kernel threads.
Decision: If watchdog is involved, start hunting for the upstream stall (storage timeouts, NIC resets, known kernel bugs) rather than treating watchdog as the cause.
Task 9: Check Machine Check Exceptions (MCE) and ECC memory signals
cr0x@server:~$ journalctl -k -b -1 --no-pager | grep -iE "mce|machine check|hardware error" | tail -n 40
Feb 02 22:08:21 server kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 8: b200000000070005
Feb 02 22:08:21 server kernel: mce: [Hardware Error]: TSC 0 ADDR fef1a140 MISC d012000100000000
Feb 02 22:08:21 server kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1706911701 SOCKET 0 APIC 0 microcode 0x2f
Meaning: Hardware errors can be corrected (no reboot) or fatal (panic/reset). Repeated corrected errors are still actionable; they’re the “check engine light.”
Decision: If MCE exists near the reboot, involve hardware vendor/support and check DIMM/CPU/VRM health. Don’t “patch around” physics.
Task 10: On physical servers, read IPMI SEL for power and thermal events
cr0x@server:~$ ipmitool sel list | tail -n 12
2a0 | 02/02/2026 | 22:09:58 | Power Unit #0x01 | Power lost | Asserted
2a1 | 02/02/2026 | 22:10:02 | Power Unit #0x01 | Power restored | Asserted
2a2 | 02/02/2026 | 22:10:05 | System Event | System Boot Initiated | Asserted
Meaning: SEL confirms an AC/power event. The OS logs may look like a crash because they ended abruptly, but the real cause was “no electricity.”
Decision: Escalate to facilities/power chain (PDU/UPS), check redundant PSUs, and inspect datacenter change logs. Kernel debugging is wasted time here.
Task 11: Check if kdump captured a vmcore
cr0x@server:~$ ls -lh /var/crash
total 2.1G
drwxr-xr-x 2 root root 4.0K Feb 2 22:10 127.0.1.1-2026-02-02-22:10:14
-rw------- 1 root root 2.1G Feb 2 22:12 vmcore
-rw-r--r-- 1 root root 18K Feb 2 22:12 vmcore-dmesg.txt
Meaning: You have a crash dump. This is gold: you can identify the crashing thread, locks, and modules involved.
Decision: Preserve it (copy off-host), then do analysis with crash tooling (or hand to kernel support). Don’t reboot five more times and overwrite the evidence.
Task 12: Detect OOM killer and memory pressure events
cr0x@server:~$ journalctl -b -1 -k --no-pager | grep -iE "oom-killer|Out of memory|Killed process" | tail -n 30
Feb 02 21:58:34 server kernel: Out of memory: Killed process 24891 (java) total-vm:18742000kB, anon-rss:14230000kB
Feb 02 21:58:34 server kernel: oom_reaper: reaped process 24891 (java), now anon-rss:0kB, file-rss:0kB
Meaning: OOM killing isn’t a reboot by itself, but it can cascade into watchdog resets if critical daemons die, or if the system thrashes.
Decision: If OOM precedes reboot, fix memory limits, cgroups, overcommit settings, and swap strategy. Also check whether an external watchdog rebooted after service death.
Task 13: Check for filesystem errors that can force remount-ro or hangs
cr0x@server:~$ journalctl -b -1 -k --no-pager | grep -iE "EXT4-fs error|XFS.*corruption|Buffer I/O error|blk_update_request" | tail -n 40
Feb 02 22:09:31 server kernel: blk_update_request: I/O error, dev nvme0n1, sector 2039488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 02 22:09:33 server kernel: EXT4-fs error (device nvme0n1p2): ext4_find_entry:1539: inode #2: comm systemd: reading directory lblock 0
Feb 02 22:09:33 server kernel: EXT4-fs (nvme0n1p2): Remounting filesystem read-only
Meaning: Storage errors bubbled up to filesystem errors. Remount read-only can cause services to fail, then watchdog or orchestration may reboot.
Decision: Treat as storage reliability incident. Run SMART/NVMe health, review controller/firmware, and schedule fsck where appropriate.
Task 14: Check NVMe health and error log (common reboot precursor)
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | sed -n '1,25p'
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 47 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 4%
media_errors : 12
num_err_log_entries : 98
warning_temp_time : 0
critical_comp_time : 0
Meaning: Media errors and rising error log entries correlate strongly with timeouts and controller resets.
Decision: If media_errors and err_log_entries are non-trivial and increasing, plan a drive replacement and verify firmware compatibility. Don’t wait for “it gets worse.” It will.
Task 15: Check SATA/SAS link resets (if not NVMe)
cr0x@server:~$ journalctl -k -b -1 --no-pager | grep -iE "link is slow to respond|hard resetting link|SATA link down" | tail -n 50
Feb 02 22:08:58 server kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb 02 22:08:59 server kernel: ata3: hard resetting link
Feb 02 22:09:05 server kernel: ata3: link is slow to respond, please be patient (ready=0)
Meaning: The storage link is unstable. That can freeze I/O and cascade into lockups.
Decision: Inspect cabling/backplane/HBA firmware. This is not a “reinstall the OS” situation.
Task 16: Check if time jumped or RTC issues occurred (can break “timeline reasoning”)
cr0x@server:~$ journalctl -b -1 --no-pager | grep -iE "Time has been changed|System clock time|rtc" | tail -n 20
Feb 02 22:11:12 server systemd-timesyncd[612]: System clock time unset or jumped backwards, restoring from recorded timestamp: 2026-02-02 22:11:11 UTC
Meaning: Time jumps can make reboots look like they occurred earlier/later, ruining correlation with BMC or hypervisor logs.
Decision: Fix time sync (NTP/chrony), check RTC battery/firmware, and be cautious when aligning events across systems.
Task 17: Confirm last package/kernel updates around the reboot window
cr0x@server:~$ journalctl -b -1 --no-pager | grep -iE "apt|dnf|yum|transaction|linux-image|kernel" | tail -n 30
Feb 02 22:05:02 server unattended-upgrades[21122]: Installing: linux-image-6.8.0-41-generic
Feb 02 22:05:48 server unattended-upgrades[21122]: Packages that will be upgraded: linux-image-generic
Meaning: A kernel update can trigger reboot workflows (directly or via orchestration), or introduce a regression.
Decision: If the reboot aligns with updates, validate reboot policies, and consider pinning/rolling back kernel versions if you see new crashes.
Three corporate mini-stories from the reboot trenches
Mini-story 1: The incident caused by a wrong assumption
They had a cluster of physical database servers. A new on-call engineer saw “Kernel panic” in the previous boot logs and did what many people do under stress: blamed the kernel and started drafting a plan to roll back updates.
It was a confident narrative. Too confident. The journal ended abruptly a few seconds after the panic line, and there was no clean shutdown sequence. They assumed that meant the panic was the cause and the reboot was the effect.
Someone finally ran ipmitool sel list. The SEL showed power lost and restored—twice—in the same minute. The kernel panic line? That was from an earlier, unrelated test where panic-on-oops had been enabled temporarily and never reverted. The “random reboot” was a power event. The panic line was background noise that happened to be nearby in time.
The real root cause was upstream: a PDU outlet with a failing relay. The server’s redundant PSU setup was correct, but both PSUs were plugged into the same PDU bank “just for now” during maintenance months earlier. No one updated the documentation. Everyone updated their opinions.
Fixing the PDU and rebalancing power feeds stopped the reboots immediately. Rolling back the kernel would have been theater—comforting, visible, and wrong.
Mini-story 2: The optimization that backfired
A team chased latency. They tuned for performance: aggressive CPU C-state settings in BIOS, disabled some power-saving features, and cranked interrupt coalescing on NICs. Benchmarks improved, charts looked prettier, and the change got rubber-stamped because it was “just performance tuning.”
Two weeks later, sporadic reboots began. Not frequent enough to be obvious. Frequent enough to shred trust. The initial suspicion fell on the application, because of course it did. Then it fell on the kernel. Then storage. Then networking. The usual blame relay race.
The actual issue: the tuning increased sensitivity to a firmware bug in the platform’s watchdog/thermal management path. Under specific load patterns, the BMC would misinterpret delayed sensor polling as a hang and issue a reset. OS logs looked clean—no shutdown—because the OS never got the memo. The SEL showed watchdog resets, but nobody had been collecting SEL logs centrally.
They rolled back the BIOS tuning and updated BMC firmware. Reboots stopped. Performance dipped slightly. Stability returned dramatically. The lesson was not “never optimize.” It was “treat firmware and power management as production dependencies, not as a place to cosplay as a hardware engineer.”
Mini-story 3: The boring but correct practice that saved the day
A different organization had a habit that looked painfully dull: persistent journald storage, kdump enabled on all bare-metal nodes, and a nightly job that scraped IPMI SEL into a central system. No heroics. No “we’ll do it after the next sprint.” It was already done.
When a fleet started rebooting under peak load, the first incident call was calm. They pulled journalctl -b -1 across a few nodes and saw consistent NVMe timeout patterns. Then they checked SEL: no power events. Then they checked crash dumps: a subset had vmcores with stacks pointing into the NVMe driver recovery path.
That narrowed the blast radius to “storage firmware/driver interaction under load.” They coordinated a staged firmware update and temporarily reduced queue depth on affected nodes. No guessing. No tribal knowledge. Just evidence.
The result wasn’t magical uptime. It was a contained incident: fewer surprise reboots, a clear vendor escalation, and a paper trail that justified the maintenance window. Boring practices don’t prevent every failure. They prevent the second failure: the failure to understand what happened.
Joke #2: If you don’t store logs persistently, your server will reboot with the confidence of a cat knocking a glass off the table—no remorse, no explanation.
Common mistakes: symptom → root cause → fix
1) Symptom: “No logs show the reboot”
Root cause: Volatile journal (no persistent storage) or power loss/hard reset prevented log flush.
Fix: Enable persistent journald (/var/log/journal), increase journal flush frequency if appropriate, and collect BMC SEL/hypervisor logs to cover power/reset cases.
2) Symptom: “It always reboots under I/O load”
Root cause: Storage timeouts (NVMe controller resets, HBA firmware issues, multipath instability) causing blocked tasks and watchdog/panic.
Fix: Inspect kernel logs for timeouts, check NVMe SMART/error logs, validate firmware versions, reduce queue depth as a mitigation, and schedule hardware replacement when error counts rise.
3) Symptom: “Kernel panic in logs, so it’s definitely a kernel bug”
Root cause: Wrong causality. Panic may be secondary to hardware errors or forced reset; sometimes panic-on-oops is enabled, turning minor driver faults into reboots.
Fix: Confirm with SEL and MCE; check /proc/sys/kernel/panic and panic-on-oops settings; capture vmcore and analyze stacks.
4) Symptom: “Reboots occur after updates but not immediately”
Root cause: New kernel/driver regression only triggered by specific workload; or automation reboots on a schedule after patching.
Fix: Correlate update timestamps with boot IDs; check orchestration logs; roll back kernel for a canary subset and compare stability.
5) Symptom: “VM reboots but guest logs look clean”
Root cause: Hypervisor reset, host crash, or VM power-cycle action. Guest OS never saw a shutdown.
Fix: Use hypervisor event logs and management plane audit logs; treat as infrastructure incident, not guest OS.
6) Symptom: “Filesystem remounted read-only before reboot”
Root cause: Underlying block device I/O errors; filesystem protected itself, services failed, then watchdog or automation rebooted to “heal.”
Fix: Diagnose the block layer (SMART/NVMe, cabling, controller), run filesystem checks during maintenance, and fix the storage reliability issue first.
7) Symptom: “Nothing obvious, but SEL shows watchdog resets”
Root cause: Hardware watchdog triggered due to system hang, often from driver deadlock, interrupt storm, or firmware issues.
Fix: Update BIOS/BMC/microcode; check kernel versions for known lockup bugs; enable kdump; consider NMI watchdog tuning only after root cause is understood.
8) Symptom: “OOM kills show up before reboot”
Root cause: Memory exhaustion triggers cascading failures; external watchdog or orchestration reboots the node after services die.
Fix: Implement memory limits (cgroups), right-size instances, add swap carefully, and adjust OOM scoring for critical services.
Checklists / step-by-step plan
Step-by-step: single host investigation (30–60 minutes)
- Confirm boot count and timing:
last -x,journalctl --list-boots. - Inspect previous boot end:
journalctl -b -1with-p warningfirst, then full context. - Classify reboot: orderly vs abrupt (presence of shutdown sequence, or log cutoff).
- Check kernel for panic/oops/watchdog: grep for “panic”, “oops”, “watchdog”, “blocked for more than”.
- Check hardware signals: MCE lines; if physical, SEL events for power/thermal/watchdog.
- Check storage: timeouts, resets, filesystem errors, SMART/NVMe data.
- Check memory pressure: OOM killer lines; swap activity if you have it logged.
- Check changes: updates, config management runs, firmware changes, scheduled tasks around the event.
- Decide next action: mitigate (reduce load, remove node from service), preserve evidence (vmcore/logs), escalate (hardware/vendor).
Fleet checklist: when many nodes reboot (and everyone is yelling)
- Pick three nodes: one that rebooted, one that didn’t, one borderline. Compare logs side by side.
- Look for common signatures: same driver messages, same storage model, same firmware, same rack, same kernel build.
- Split by failure domain: if only one rack, suspect power/network; if only one storage model, suspect firmware; if only one kernel, suspect regression.
- Stop the bleeding: drain nodes, reduce queue depth, disable problematic feature flags, pause automated reboots.
- Preserve evidence centrally: copy
/var/log/journalsegments, SEL dumps, vmcores. Incidents are temporary; evidence is optional unless you make it mandatory.
Configuration checklist: what to set up before the reboot happens
- Persistent journal on all nodes; size it appropriately.
- Central log shipping (and verify it by querying last boot logs).
- kdump enabled where feasible, and monitored for vmcore creation.
- SEL scraping for bare metal; store off-host.
- Change logging: kernel updates, firmware updates, orchestration actions, and “who rebooted it” audit events.
Interesting facts and historical context (because history repeats)
- Fact 1:
wtmphas been a Unix tradition for decades; it’s crude, but it’s still one of the fastest ways to spot reboot patterns. - Fact 2: The Linux kernel ring buffer was historically volatile; systemd’s journal made cross-boot kernel message retrieval dramatically easier on many distros.
- Fact 3: Machine Check Exceptions (MCE) are a CPU-level mechanism; Linux merely reports them. If MCE says “hardware error,” it’s not being metaphorical.
- Fact 4: Watchdogs exist because hung systems can’t be trusted to recover themselves. They’re blunt tools: useful, but they hide root causes by rebooting the patient.
- Fact 5: Early Linux panic behavior was often “halt and wait.” Modern production defaults commonly reboot after panic to restore service quickly—sometimes too quickly to capture evidence.
- Fact 6: IPMI-based SEL logs live outside the OS. That’s why they’re so valuable when the OS didn’t shut down cleanly.
- Fact 7: Storage timeouts have improved, but modern NVMe stacks can still wedge under particular firmware/driver combinations, especially when error recovery collides with heavy queueing.
- Fact 8: “Corrected” ECC errors are not harmless. A storm of corrected errors can precede uncorrectable failures and reboots, and they can also tank performance.
- Fact 9: Journals and logs are not guaranteed to flush on power loss; that’s why reliable logging uses persistence plus off-host shipping.
FAQ
1) How do I tell power loss from a kernel crash?
Power loss/hard reset usually shows an abrupt end of logs with no shutdown sequence. Confirm with IPMI SEL (physical) or hypervisor events (VM). Kernel crash often leaves panic/oops/watchdog lines and sometimes a kdump vmcore.
2) What’s the single most useful command for “previous boot”?
journalctl -b -1. Pair it with -p warning to reduce noise and with --since to focus on the final minutes.
3) Why do I see “Kernel panic” but the reboot reason is still unclear?
Because panic can be a symptom, not a cause. You need upstream context: storage timeouts, MCE hardware errors, lockups, or external reset signals. Also check if panic-on-oops was enabled.
4) Do random reboots usually come from software or hardware?
In practice: a mix, but hardware/power/firmware are underdiagnosed because people only look at OS logs. If OS logs are missing or cut off, assume the OS wasn’t in control.
5) My VM “reboots” but there’s no shutdown in the guest logs. Why?
The hypervisor can reset the VM like pulling a plug. Guest logs won’t show a graceful shutdown. You need host events and management-plane audit logs.
6) Is enabling kdump worth the memory reservation?
For systems where kernel crashes matter, yes. The reserved memory is an insurance premium. Without vmcore, you’re often stuck with circumstantial evidence and vendor finger-pointing.
7) How do watchdog resets relate to storage problems?
I/O stalls can block kernel threads long enough that the watchdog decides the system is hung. The watchdog is the messenger; storage timeouts may be the message.
8) Can filesystem corruption cause reboots?
Filesystems usually try to protect data by remounting read-only or shutting down components. The reboot often comes later: services fail, orchestration restarts the node, or the kernel panics due to underlying block errors.
9) Why are my logs missing right before the reboot?
Either you’re using a volatile journal, the disk became unavailable, or the system lost power. Missing logs are a data point: treat them as “likely abrupt reset” until proven otherwise.
Conclusion: next steps you can do today
If your servers “randomly reboot,” you’re not dealing with randomness. You’re dealing with missing evidence, misplaced attention, or both. Fix the evidence first.
- Make journald persistent and verify you can query
journalctl -b -1after a reboot. - Enable kdump where it’s practical, and alert on vmcore creation.
- Collect BMC SEL off-host for bare metal; it’s your lie detector for power/reset events.
- Standardize a reboot triage runbook: previous boot journal, SEL/hypervisor, MCE, storage errors—always in that order.
- When you find the cause, change something durable: firmware updates, power feed corrections, storage replacement policy, or a kernel version strategy. Not just a Slack postmortem and a promise.
The goal isn’t to become a reboot detective as a hobby. The goal is to make the next “it just rebooted” turn into a two-screenshot answer and a fix request that actually sticks.