Ubuntu 24.04 Watchdog resets: detect silent hangs before they cost you uptime (case #18)

Was this helpful?

Nothing tests an on-call rotation like a server that “reboots itself” with no clean shutdown, no crash dump, and logs that stop mid-sentence. The service comes back, dashboards go green, and everyone pretends it was a fluke—until it happens again during the CEO demo.

Ubuntu 24.04 is perfectly capable of telling you what happened. The trick is knowing where the watchdog fits in the chain, what “silent hang” really means, and how to capture evidence before the reset wipes the crime scene.

What watchdogs actually do (and what they don’t)

A watchdog is a timer with a job: if the system stops making forward progress, reboot it (or at least yell loudly). In Linux you’ll see several layers:

  • Hardware watchdog (BMC/iDRAC/iLO, or a chipset timer). If the kernel doesn’t “pet” it, the hardware resets the machine. This is the one that keeps working when the kernel is totally wedged.
  • Kernel watchdogs (soft lockup, hard lockup/NMI watchdog). These detect CPUs stuck in kernel space, interrupt-disabled regions, or non-schedulable states.
  • Userspace watchdogs (systemd watchdog, application heartbeats). These detect a service that is alive-but-dead, like a daemon stuck in an infinite loop but still holding its PID.

Watchdogs are not “crash detectors.” They’re “we have lost control” detectors. A panic leaves breadcrumbs (stack traces, dumps). A hang is a blackout. Watchdogs exist because, in production, a clean failure is a luxury.

Here’s the painful part: the same reset you love for restoring service also destroys your best evidence. If your strategy is “wait for it to happen again and then SSH in,” you’re doing archaeology with a leaf blower.

Opinionated guidance: run a watchdog in production, but treat it like a tripwire that triggers data capture. Configure the capture first. Then let it reset.

Two terms you’ll keep seeing

  • Soft lockup: the CPU is stuck running kernel code for too long without scheduling. The kernel can still run enough to complain.
  • Hard lockup: the CPU doesn’t respond to interrupts (or appears dead). Detection often relies on NMIs or separate timing sources.

One quote, because it’s true

Werner Vogels (paraphrased idea): “You build for failure—because failure isn’t optional, it’s part of operating at scale.”

Also, for the record: a “silent hang” is not silent. It’s just talking in the wrong place—like the kernel ring buffer you never persisted, on a disk queue that’s currently frozen.

Interesting facts and historical context

Watchdogs have been around long enough to have baggage. A few context points that help when you’re reading logs at 03:00:

  1. Watchdog timers predate Linux by decades. Industrial controllers used them because a stuck control loop can be a safety issue, not just an SLA issue.
  2. Linux soft lockup detection arrived as kernels grew more preemptible. As the scheduler and preemption evolved, “CPU stuck” became measurable and actionable.
  3. The NMI watchdog’s job changed over time. It used to be a common “is the CPU alive?” tool; modern kernels often use perf events infrastructure as part of that detection path.
  4. Hung task detection was added because “it’s just waiting on I/O” can still kill a system. A task blocked in D state can stall critical subsystems even if CPUs are otherwise fine.
  5. Hardware watchdogs moved into BMCs. Many servers can reset you even if the OS is gone, which is great until your BMC configuration becomes a separate failure domain.
  6. systemd popularized service watchdogs. Not the first, but it made “process must ping me” mainstream in Linux distros.
  7. Virtualized environments complicate time. Watchdog thresholds can fire under extreme host contention because your guest stops getting CPU time.
  8. Storage hangs became more visible with NVMe. It’s fast—until firmware/driver edge cases wedge queues, and then everything that touches the disk looks guilty.
  9. Some “random reboots” are deliberate. A hardware watchdog reset is indistinguishable from a power glitch unless you collect the right out-of-band logs.

Fast diagnosis playbook (first/second/third)

If you’re in incident mode, you don’t have time for a grand theory. You need to locate the bottleneck quickly: CPU lockup, I/O stall, memory pressure, hardware reset, or virtualization starvation.

First: prove whether it was a watchdog reset vs a normal reboot

  • Check previous boot logs for watchdog/lockup messages.
  • Check for clean shutdown markers (they’re usually absent).
  • Check out-of-band/BMC event logs if you have them (even a power-loss looks like a reset).

Second: decide whether the system hung in CPU or in I/O

  • Look for soft lockup, hard LOCKUP, or NMI watchdog warnings (CPU path).
  • Look for blocked for more than, hung_task, NVMe timeouts, SCSI resets, ext4/XFS I/O errors (I/O path).
  • Correlate with application symptoms: timeouts vs total unresponsiveness.

Third: collect evidence for next time

  • Enable persistent journald and preserve kernel ring buffer if possible.
  • Set up kdump to capture a vmcore on panic (and configure the watchdog to panic instead of reboot if appropriate).
  • Configure sysrq triggers and remote logging so you have a trail even when local disks stall.

Decision point: if you can’t capture evidence, you’re not debugging—you’re guessing. And guessing is expensive.

From symptoms to evidence: the signals that matter

“It rebooted” is an outcome, not a diagnosis. Watchdog resets are usually downstream of one of these:

  • CPU lockup: kernel stuck with interrupts disabled, or stuck spinning on a lock. Watchdog catches it.
  • I/O deadlock or device timeout spiral: storage queue stalls; tasks pile up in D state; system becomes unresponsive; watchdog eventually triggers.
  • Memory pressure and reclaim storms: not a hang, but looks like one. If kswapd and friends get stuck in pathological reclaim or IO wait, the machine “freezes.”
  • Firmware/hardware faults: corrected errors until they aren’t; PCIe AER storms; NVMe controller resets; ECC events; sometimes the hardware watchdog just pulls the plug.
  • VM host contention: guest stops being scheduled; watchdog fires inside the guest; the host is innocent in logs (as always).

Ubuntu 24.04 ships a modern kernel and systemd stack, which is good news: you have the mechanisms to see what’s happening. The bad news: defaults favor “keep running” over “leave a perfect autopsy report.” You’ll need to make deliberate tradeoffs.

Practical tasks (commands, outputs, decisions)

These are not “run this to feel busy” tasks. Each one gives a specific signal, and you should make a decision based on it.

Task 1: Confirm the reboot was not clean (previous boot)

cr0x@server:~$ journalctl -b -1 -p warning..alert --no-pager | tail -n 40
Aug 12 03:14:22 server kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [kworker/7:1:1234]
Aug 12 03:14:22 server kernel: Modules linked in: nvme tcp_bbr ...
Aug 12 03:14:30 server kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Aug 12 03:14:31 server systemd[1]: Starting Reboot...
Aug 12 03:14:31 server kernel: watchdog: Initiating system reboot

What it means: soft lockup escalating to hard lockup detection, then a watchdog-triggered reboot.

Decision: treat as a kernel/hardware/driver hang, not an application restart. Move immediately to kernel and device evidence collection.

Task 2: See whether the system recorded a normal shutdown

cr0x@server:~$ last -x | head -n 8
reboot   system boot  6.8.0-41-generic Mon Aug 12 03:15   still running
crash    system crash 6.8.0-41-generic Mon Aug 12 03:14 - 03:15  (00:01)
reboot   system boot  6.8.0-41-generic Sun Aug 11 18:02 - 03:15  (09:13)
shutdown system down  6.8.0-41-generic Sun Aug 11 18:01 - 18:02  (00:01)

What it means: a crash marker appears rather than a clean shutdown sequence.

Decision: prioritize kernel logs and out-of-band evidence; don’t waste time auditing systemd unit restarts.

Task 3: Identify active watchdog devices and drivers

cr0x@server:~$ ls -l /dev/watchdog /dev/watchdog0 2>/dev/null || echo "no /dev/watchdog node"
crw------- 1 root root 10, 130 Aug 12 08:40 /dev/watchdog
crw------- 1 root root 10, 130 Aug 12 08:40 /dev/watchdog0

What it means: a watchdog character device exists; something can arm it.

Decision: confirm who is using it (systemd, watchdog daemon, vendor agent). Misconfigured userspace can cause surprise resets.

Task 4: Check whether systemd is feeding a hardware watchdog

cr0x@server:~$ systemctl show -p RuntimeWatchdogUSec -p RebootWatchdogUSec -p WatchdogDevice systemd-logind
RuntimeWatchdogUSec=0
RebootWatchdogUSec=0
WatchdogDevice=

What it means: these values here are not the global systemd watchdog; they’re per-unit. So don’t misread them.

Decision: check the global systemd configuration next; per-unit output can lull you into a wrong conclusion.

Task 5: Check global systemd watchdog settings

cr0x@server:~$ grep -E '^(RuntimeWatchdogSec|RebootWatchdogSec|ShutdownWatchdogSec|WatchdogDevice)=' /etc/systemd/system.conf
RuntimeWatchdogSec=30s
RebootWatchdogSec=10min
ShutdownWatchdogSec=2min

What it means: systemd is configured to arm a watchdog during runtime and reboot/shutdown transitions.

Decision: validate the watchdog device/driver and ensure your timeouts are sane for your worst-case I/O stalls (more on that later).

Task 6: Identify the kernel lockup detection configuration

cr0x@server:~$ sysctl kernel.watchdog kernel.softlockup_panic kernel.hardlockup_panic kernel.watchdog_thresh
kernel.watchdog = 1
kernel.softlockup_panic = 0
kernel.hardlockup_panic = 0
kernel.watchdog_thresh = 10

What it means: watchdog detection is on; it warns but doesn’t panic by default. Threshold is 10 seconds (common default).

Decision: in production, consider panicking on lockup (with kdump configured) to get a vmcore. Rebooting without evidence is how you end up “monitoring” instead of fixing.

Task 7: Inspect dmesg from the previous boot (if preserved)

cr0x@server:~$ journalctl -k -b -1 --no-pager | egrep -i 'watchdog|lockup|hung|blocked for more than|nvme|reset|I/O error' | tail -n 60
Aug 12 03:13:58 server kernel: nvme nvme0: I/O 123 QID 4 timeout, aborting
Aug 12 03:14:02 server kernel: nvme nvme0: Abort status: 0x371
Aug 12 03:14:06 server kernel: INFO: task jbd2/nvme0n1p2-8:512 blocked for more than 120 seconds.
Aug 12 03:14:06 server kernel:       Tainted: G           O       6.8.0-41-generic #41-Ubuntu
Aug 12 03:14:22 server kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [kworker/7:1:1234]
Aug 12 03:14:31 server kernel: watchdog: Initiating system reboot

What it means: NVMe timeouts preceded a hung journaling thread, then a CPU lockup warning. This is classic “storage made the kernel miserable.”

Decision: treat the storage path (firmware, driver, PCIe errors, queueing) as primary suspect. Don’t waste days blaming your app’s thread pool.

Task 8: Verify journald persistence (so you don’t lose the trail)

cr0x@server:~$ grep -E '^\s*Storage=' /etc/systemd/journald.conf
Storage=auto
cr0x@server:~$ ls -ld /var/log/journal || echo "no persistent journal directory"
ls: cannot access '/var/log/journal': No such file or directory
no persistent journal directory

What it means: logs may be volatile. After a reset, your best clues may be gone.

Decision: enable persistent journal on servers where you care about post-mortems. Yes, it uses disk. So does downtime.

Task 9: Enable persistent journald storage (safely)

cr0x@server:~$ sudo mkdir -p /var/log/journal
cr0x@server:~$ sudo systemd-tmpfiles --create --prefix /var/log/journal
cr0x@server:~$ sudo systemctl restart systemd-journald
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 1.1G in the file system.

What it means: journald is now persisting logs. You have a chance next time.

Decision: set retention limits appropriate for your disks and incident response needs.

Task 10: Check for kernel “hung task” policy and decide if you want panics

cr0x@server:~$ sysctl kernel.hung_task_timeout_secs kernel.hung_task_panic kernel.hung_task_warnings
kernel.hung_task_timeout_secs = 120
kernel.hung_task_panic = 0
kernel.hung_task_warnings = 1

What it means: the kernel will warn about tasks blocked for 120 seconds, but won’t panic.

Decision: if hangs are killing uptime, consider panicking (with kdump) after confirmed deadlocks to capture state. This is a tradeoff: a panic is disruptive, but so is a watchdog reset with zero evidence.

Task 11: Verify kdump is installed and armed

cr0x@server:~$ systemctl status kdump-tools --no-pager
● kdump-tools.service - Kernel crash dump capture service
     Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled; preset: enabled)
     Active: active (exited) since Mon 2025-08-12 08:41:12 UTC; 3h ago
       Docs: man:kdump-tools(8)

What it means: kdump is enabled. Not proof it will work, but it’s a start.

Decision: validate crashkernel reservation and do a controlled crash test during a maintenance window.

Task 12: Check crashkernel reservation (without it, kdump often fails)

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=... ro crashkernel=512M-:192M quiet splash

What it means: crashkernel is reserved. Size matters; too small and dumps fail under load.

Decision: ensure it’s adequate for your RAM size and kernel; adjust if vmcores are truncated or missing.

Task 13: Force a controlled crash test (maintenance window only)

cr0x@server:~$ sudo sysctl -w kernel.sysrq=1
kernel.sysrq = 1
cr0x@server:~$ echo c | sudo tee /proc/sysrq-trigger
c

What it means: the kernel panics immediately. If kdump is correct, you’ll get a vmcore on reboot.

Decision: if you don’t get a dump, fix kdump now—not after the next production hang.

Task 14: After reboot, confirm a dump exists

cr0x@server:~$ ls -lh /var/crash | tail -n 5
drwxr-xr-x 2 root root 4.0K Aug 12 09:02 202508120902
cr0x@server:~$ ls -lh /var/crash/202508120902 | egrep 'vmcore|dmesg'
-rw-r----- 1 root root 2.8G Aug 12 09:03 vmcore
-rw-r----- 1 root root  92K Aug 12 09:03 dmesg.0

What it means: you can capture evidence. That changes the entire game.

Decision: consider changing lockup/hung-task handling from “warn only” to “panic” for classes of hangs you cannot otherwise debug.

Task 15: Check for PCIe AER errors that often precede device wedges

cr0x@server:~$ journalctl -k -b -1 --no-pager | egrep -i 'AER|pcie|nvme.*reset|link down|fatal error' | tail -n 40
Aug 12 03:13:40 server kernel: pcieport 0000:3b:00.0: AER: Corrected error received: 0000:3b:00.0
Aug 12 03:13:40 server kernel: pcieport 0000:3b:00.0: AER: [ 0] RxErr
Aug 12 03:13:58 server kernel: nvme nvme0: controller is down; will reset: CSTS=0x3

What it means: the bus is complaining and the NVMe controller reset followed. That’s not “an app bug.”

Decision: loop in hardware/firmware owners: BIOS, NIC/HBA/NVMe firmware, PCIe topology, power management settings.

Task 16: Check I/O pressure and stalls (post-incident baseline)

cr0x@server:~$ cat /proc/pressure/io
some avg10=0.00 avg60=0.05 avg300=0.12 total=11234567
full avg10=0.00 avg60=0.01 avg300=0.03 total=1234567

What it means: PSI shows how often tasks are stalled on I/O. After a hang event, you want to compare to baseline; spikes correlate with “everything feels frozen.”

Decision: if PSI is elevated during normal periods, you likely have chronic storage contention; lockups may be the extreme tail.

Task 17: Confirm NVMe health and error logs (if NVMe is involved)

cr0x@server:~$ sudo nvme list
Node             SN                   Model                                   Namespace Usage                      Format           FW Rev
/dev/nvme0n1     S6XXXXXXXXXXXX       ACME NVMe Datacenter 3.2TB              1         1.20  TB / 3.20  TB        512   B +  0 B   1.2.3
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
critical_warning                    : 0x00
media_errors                        : 0
num_err_log_entries                 : 14
cr0x@server:~$ sudo nvme error-log /dev/nvme0 | head
Entry[ 0]
error_count     : 14
sqid            : 4
cmdid           : 0x001a
status_field    : 0x4004
parm_err_loc    : 0x0000

What it means: the controller has logged errors. Not necessarily fatal, but it’s a lead that matches timeouts.

Decision: plan firmware updates and consider vendor advisories. If you can correlate error entries with hang timestamps, you have a strong case.

Task 18: Check CPU starvation signals (virtualization or noisy neighbors)

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0-41-generic (server)  08/12/2025  _x86_64_  (16 CPU)

08:52:01 AM  CPU   %usr %nice  %sys %iowait  %irq %soft %steal %idle
08:52:02 AM  all   12.0  0.0    5.0   2.0     0.0  0.5   18.0  62.5
08:52:03 AM  all   10.8  0.0    4.7   1.8     0.0  0.4   21.3  61.0

What it means: high %steal indicates the VM isn’t getting scheduled. Watchdogs inside guests can fire because time effectively stops.

Decision: if this is a VM, treat host contention as a first-class suspect; tune watchdog thresholds carefully or fix host capacity.

Task 19: Identify whether CPU frequency/power settings are doing something weird

cr0x@server:~$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor 2>/dev/null || echo "cpufreq not exposed"
performance

What it means: governor is set to performance (often fine for servers). Power-saving modes can interact with latency and device timeouts on some platforms.

Decision: if you see lockups correlated with deep C-states or aggressive power management, coordinate BIOS and kernel params changes; don’t shotgun-tune in production.

Joke #1: A watchdog that reboots your server is like a smoke alarm that puts out the fire by detonating the building. Effective, but you still want the incident report.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

They had a clean narrative: “The database hung.” The graphs showed query latency rising, then a cliff. A watchdog reset followed. So the incident commander did what many of us do under pressure: assigned action items to the DB team, asked for query plans, and penciled in “optimize indexes.”

The DB team was annoyed but professional. They pulled slow query logs, compared baselines, found nothing spectacular. Meanwhile, the hang recurred—again during a period of “normal” query volume. It didn’t line up with deployments either. The only stable feature was the watchdog reboot.

A storage engineer finally asked a rude question: “Do we have persistent kernel logs?” They didn’t. Journald was volatile. The only clue was a truncated console screenshot someone grabbed from a remote KVM showing a line about blocked for more than 120 seconds.

They enabled persistent journald and configured kdump. On the next occurrence, logs showed NVMe timeouts escalating into a journaling thread stuck in D state, then lockups. The database was collateral damage: it was waiting on fsync that never came back. The “database hung” story was comforting, but wrong.

Fix was boring: firmware updates for the NVMe drives, a kernel update that improved controller reset handling, and a rollback of an aggressive PCIe power management BIOS setting. The DB was innocent, but it took a week to prove because the team assumed the loudest component caused the failure.

Mini-story 2: The optimization that backfired

A platform team wanted faster recovery from crashes. They enabled a hardware watchdog with a short timeout and configured systemd to pet it. Their logic: if the kernel freezes, reboot fast, reduce customer impact. This wasn’t crazy; it’s a standard pattern in appliances.

Then they rolled out a “performance improvement”: more aggressive dirty page writeback and higher I/O queue depths on their busiest nodes. It worked in benchmarks. In production, during a heavy compaction window, a subset of nodes would become unresponsive long enough to miss watchdog pings. The hardware watchdog reset them mid-write.

After reset, filesystems replayed journals, services restarted, and the fleet looked “healthy.” But subtle data corruption alarms started popping up—not instantly, but weeks later—because some application-level invariants didn’t like being interrupted at that point. The optimization didn’t directly corrupt disks; it corrupted assumptions about atomicity and the timing of writes.

The post-mortem was sharp: the watchdog reduced mean time to recovery but increased mean time to innocence. It masked the underlying I/O stall pattern and shortened the window for capturing evidence. They had traded “visible slow failure” for “invisible fast reset.”

They fixed it by increasing watchdog timeouts, enabling panic-on-lockup with kdump for a subset of nodes, and instrumenting I/O pressure. The big lesson: a watchdog is not a substitute for capacity planning, and it’s definitely not a substitute for understanding your storage tail latencies.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop had a habit people mocked: every kernel update and firmware change required a one-page “failure evidence plan.” It listed where logs go, whether kdump is tested, and which out-of-band logs are retained. Nobody loved this document. It looked like compliance cosplay.

Then a fleet of servers started doing occasional watchdog resets under peak load. The app teams were ready to blame each other, as tradition demands. But the evidence plan meant two things were already true: journald was persistent, and kdump had been test-passed in the last quarter.

The very first incident produced a vmcore. The kernel stack traces showed a deadlock involving a specific driver path under high interrupt load. They correlated it with a recent firmware update that changed interrupt moderation behavior on a PCIe device sharing a root complex with the NVMe controller.

They didn’t need to wait for three more reboots to “be sure.” They rolled back the firmware, isolated the PCIe topology issue, and applied a kernel update with a relevant fix once validated. Uptime impact was real but contained.

The boring practice—keeping evidence capture ready—didn’t prevent the hang. It prevented the second week of chaos. That’s what good operations looks like: not heroics, just pre-positioned truth.

Tuning watchdogs without self-sabotage

Watchdogs have two failure modes:

  • Too sensitive: they reset healthy-but-busy systems (false positives), turning load spikes into outages.
  • Too lax: they never trigger, leaving you with multi-hour black holes until someone power-cycles.

Ubuntu 24.04 gives you levers at multiple layers. Use them deliberately.

Kernel lockup thresholds: don’t “fix” lockups by hiding them

kernel.watchdog_thresh controls how long before soft lockup warnings. Raising it can reduce noise in high-latency environments, but it also delays detection of real deadlocks.

My bias: don’t raise thresholds until you’ve measured CPU scheduling and I/O pressure. If you’re seeing lockup warnings because your VM is starved (%steal high), fix the host. If you’re seeing warnings because storage stalls block critical threads, fix storage.

Consider panic-on-lockup forensics (with guardrails)

If you can’t reproduce the issue and it’s rare but expensive, set:

  • kernel.softlockup_panic=1 and/or kernel.hardlockup_panic=1
  • Ensure kdump works and crashkernel is sized correctly
  • Test on a canary first

This turns a hang into a crash dump. Crashes are debuggable; silent hangs are vibes.

systemd RuntimeWatchdogSec: align it with reality

systemd’s runtime watchdog is often wired to the hardware watchdog. If systemd can’t run (because the kernel is wedged), it can’t pet the watchdog, and the hardware resets you. That’s fine—if the timeout is long enough to avoid resets during legitimate stalls (like heavy RAID rebuilds or pathological device timeouts).

Rule of thumb: set the hardware watchdog timeout to be longer than the worst expected storage pause plus enough time for kdump/panic strategy if you use it. If your array can pause I/O for 45 seconds during controller failover, a 30-second runtime watchdog is a self-inflicted outage.

Joke #2: If you set your watchdog to 10 seconds on a storage box, you’re not running SRE—you’re speedrunning incident response.

Storage and I/O stalls: the hang that looks like compute

As a storage engineer, here’s the pattern I keep seeing: CPU lockup warnings show up in logs, so the blame goes to CPU or kernel scheduler. But the trigger is often I/O. When storage stalls, tasks block in D state. Kernel threads pile up. Memory reclaim gets tangled. Eventually a CPU spins in a path that stops scheduling, and the watchdog fires. The CPU is the messenger.

Tell-tale log lines that implicate storage

  • nvme ... I/O timeout, aborting
  • task ... blocked for more than ... seconds
  • Buffer I/O error, blk_update_request errors
  • Filesystem journal thread blocked: jbd2 (ext4) or log worker stalls (XFS)
  • SCSI resets and aborts (even if you “don’t use SCSI”—your HBA does)

Why Ubuntu 24.04 is relevant here

Ubuntu 24.04 tends to be deployed on newer platforms: NVMe, PCIe Gen4/Gen5, modern firmware stacks, and more virtualization. That’s a great environment for performance—and also for rare edge cases where a device, driver, or PCIe topology does something “creative” under pressure.

Also: many teams moved from SATA SSDs (slow, forgiving) to NVMe (fast, complex). The failure mode shifted from “it’s slow” to “it’s fine until it suddenly isn’t.”

Common mistakes: symptom → root cause → fix

These are not moral failings. They’re the traps people fall into when the system comes back up and everyone wants to move on.

Mistake 1: “Random reboot” with no logs

Symptom: uptime resets, services restart, no obvious errors in current boot logs.

Root cause: logs were in RAM (volatile journald), and the reset wiped the ring buffer.

Fix: enable persistent journald and/or remote logging; confirm previous boot logs are retained (journalctl -b -1 must be useful).

Mistake 2: Confusing a hardware watchdog reset with a kernel panic

Symptom: system reboots abruptly; people assume “kernel crashed.”

Root cause: hardware watchdog fired; there was no panic, so no vmcore, and no final console output.

Fix: decide whether you want panic on lockup/hung task (with kdump) instead of blind resets; tune hardware watchdog timeout to allow evidence capture.

Mistake 3: Treating soft lockup warnings as “just noise”

Symptom: occasional BUG: soft lockup lines, system “seems fine.”

Root cause: early warning of driver deadlocks, IRQ storms, or I/O stalls. The system recovers—until it doesn’t.

Fix: correlate with device timeouts and PSI. If recurring, reproduce under load and capture vmcore using panic-on-lockup on a canary.

Mistake 4: Raising watchdog thresholds to stop reboots

Symptom: watchdog resets stop after tuning; later, multi-minute hangs appear with no automatic recovery.

Root cause: you disabled the alarm instead of fixing the fire.

Fix: revert thresholds; fix root cause (storage timeouts, host CPU steal, driver bugs). Use a longer hardware watchdog but keep detection.

Mistake 5: Ignoring %steal in VMs

Symptom: guest watchdog triggers “hard lockup” during noisy neighbor events.

Root cause: the guest is paused/starved by the hypervisor; timeouts fire inside the guest.

Fix: fix host scheduling/capacity; pin vCPUs or adjust VM priorities. If you must, increase thresholds in guests—but treat that as a workaround, not a cure.

Mistake 6: Blaming the database because it was the first to complain

Symptom: DB latency spikes and timeouts before reset.

Root cause: DB is often the first to block on fsync; the real issue is storage or kernel I/O path.

Fix: inspect kernel logs for NVMe/SCSI/filesystem errors and hung tasks; measure I/O pressure and tail latency on the block devices.

Checklists / step-by-step plan

Step-by-step: set yourself up to catch the next hang

  1. Turn on persistent logs (journald to disk) and set retention limits.
  2. Ensure you can read the previous boot: verify journalctl -b -1 contains kernel messages.
  3. Enable and test kdump in a maintenance window using SysRq crash; confirm a vmcore is written.
  4. Decide your strategy:
    • Option A: watchdog reset for rapid recovery (less evidence)
    • Option B: panic-on-lockup + kdump for evidence (more evidence, controlled crash)
  5. Instrument I/O pressure (PSI) and baseline it. If you don’t know what normal looks like, you can’t call out abnormal.
  6. Collect out-of-band logs (BMC SEL, watchdog events) where possible. If you can’t, at least align with the hardware team on how to retrieve them.
  7. Canary changes: apply kernel/firmware updates to a subset; keep a rollback path. The goal is to change one variable at a time.

Incident checklist: when the reboot already happened

  1. Confirm previous boot window: journalctl -b -1 and last -x.
  2. Extract the last 200 kernel lines from previous boot; look for lockups, hung tasks, device resets.
  3. Check storage error logs (NVMe error-log, kernel timeouts, filesystem errors).
  4. Check virtualization signals: %steal and host contention if applicable.
  5. Check firmware and kernel versions for recent changes; correlate with incident start.
  6. Write down one hypothesis and one disproof test. Avoid the “twenty maybes” meeting.

Change control checklist: tuning watchdog settings

  • Don’t tune timeouts before confirming whether you’re dealing with CPU lockups, I/O stalls, or guest starvation.
  • If you enable panic-on-lockup, ensure kdump is functional and storage for dumps is reliable.
  • Keep watchdog timeouts comfortably above worst-case known pauses (controller failover, RAID patrol reads, heavy compaction).
  • Document the rationale. Future you will not remember why RuntimeWatchdogSec=73s seemed like a good idea.

FAQ

1) What’s the difference between a soft lockup and a hard lockup?

A soft lockup is when a CPU runs kernel code too long without scheduling; the kernel can still log warnings. A hard lockup is when the CPU stops responding to interrupts; detection is harder and often involves NMIs or timing sources.

2) Why do watchdog resets feel “random”?

Because the triggering condition is usually the tail of a distribution: rare storage timeouts, rare driver deadlocks, or rare host contention. The reset happens after a timeout, not at the moment the system started going wrong.

3) Should I disable watchdogs to stop reboots?

Only if you like multi-hour hangs that require manual power-cycles. Keep watchdogs. Instead, increase evidence capture (persistent logs, kdump) and fix the underlying stalls.

4) If I enable panic-on-lockup, won’t that reduce uptime?

It can, in the short term, because it converts “maybe it recovers” into “it crashes.” In exchange, you get a vmcore and can fix the root cause. For recurring production hangs, that trade usually pays for itself.

5) Can storage really cause CPU lockup messages?

Yes. I/O stalls can block critical kernel threads, trigger reclaim storms, and create lock contention. The watchdog reports where the CPU got stuck, not necessarily what started the cascade.

6) How do I know if my VM is being starved by the hypervisor?

Look at mpstat and the %steal column. High steal indicates the guest wanted CPU but didn’t get scheduled. That can make timeouts fire even when the guest is “healthy.”

7) Why didn’t I get a crash dump?

Because a watchdog reset is not a panic, and kdump triggers on panics. Also, kdump fails if crashkernel memory isn’t reserved or is undersized, or if the dump target is unavailable during the crash.

8) What’s the quickest way to find the most relevant logs after a reboot?

Use journalctl -b -1 -k and filter for lockups, hung tasks, and device errors. Then correlate timestamps with application alerts. If logs aren’t persistent, fix that first.

9) Is systemd’s watchdog the same as the kernel watchdog?

No. systemd watchdog feeds a watchdog device (often hardware). Kernel watchdogs detect lockups within kernel scheduling/interrupt behavior. They can interact, but they’re different mechanisms with different failure modes.

10) What if the machine resets so hard that even journald doesn’t flush?

Then you need redundancy: remote logging, out-of-band event logs, and/or a panic strategy that writes a vmcore via kdump. If the disk path is the thing hanging, local logs are inherently fragile.

Conclusion: next steps that actually reduce downtime

If you do nothing else this week, do these three things:

  1. Make previous-boot logs reliable: persistent journald, retention configured, and confirm journalctl -b -1 is useful.
  2. Get one working crash dump path: enable and test kdump. A single vmcore can save you months of speculation.
  3. Pick a stance on watchdog behavior: fast reset for availability, or panic-on-lockup for evidence. Don’t drift into a half-configured state that gives you neither.

Then go after root causes with a bias toward the usual suspects: storage timeouts, PCIe errors, VM CPU steal, and driver/firmware interactions. Watchdogs don’t create the hang. They just refuse to let it quietly steal your uptime.

← Previous
MariaDB vs PostgreSQL Migration: Move Without Downtime and Without Surprises
Next →
MySQL vs MariaDB on a 16GB VPS: When Replication and Pooling Become Mandatory

Leave a comment