Hard lockups are the worst kind of “it froze” because they’re rude enough to take the kernel down with them. Your SSH session stalls, the console stops updating, and the only thing still responsive is the power button. Then you reboot and dmesg greets you with: “watchdog: BUG: hard LOCKUP”. Great. Now what?
This is a production-grade, evidence-first approach for Ubuntu 24.04: collect the right artifacts on the next freeze, reduce the search space quickly, and avoid random toggling that makes the problem unprovable.
What a “hard lockup” actually means (and what it doesn’t)
A Linux “hard lockup” is the kernel admitting that one CPU core stopped responding to interrupts for too long. The watchdog timer expects periodic activity (typically from a high-priority timer interrupt or NMI context depending on configuration). If a CPU is stuck with interrupts disabled, spinning in a bad loop, wedged in microcode/firmware, or trapped in a broken driver path, the watchdog shouts.
What it is not: a vague message that “the system was slow.” You can have atrocious latency, runaway iowait, or an overloaded box without hard lockups. Hard lockups are closer to “a CPU went out to lunch and didn’t tell anyone.”
In dmesg you’ll usually see something like:
watchdog: BUG: hard LOCKUP on cpu X- Sometimes preceded by GPU resets, NVMe timeouts, RCU stalls, or “NMI watchdog” messages
- Sometimes followed by nothing, because the system is too frozen to log
Hard lockups can be caused by:
- CPU power management edge cases (C-states, P-states, cpufreq drivers)
- Firmware/BIOS bugs (SMI storms, broken ACPI tables, microcode interactions)
- Device drivers (GPU, HBA, NIC, out-of-tree modules)
- IOMMU/interrupt routing issues
- Storage firmware timeouts that trigger pathological kernel code paths
- Thermal/voltage instability (yes, “it’s hardware” is sometimes the answer)
There’s a sibling: “soft lockup.” That’s when a CPU runs kernel code too long without yielding, but interrupts still work. Soft lockups are bad; hard lockups are “call your on-call backup” bad.
Facts and small history lessons you can use at 3 a.m.
- Fact 1: The Linux “watchdog” hard-lockup detector historically relied on the NMI watchdog on x86, because NMIs can fire even when normal interrupts are disabled.
- Fact 2: “RCU stall” warnings often show up near hangs, but they’re not the same as hard lockups. RCU stalls mean a CPU isn’t reaching quiescent states; a hard lockup is more like the CPU stopped responding altogether.
- Fact 3: ACPI and SMI (System Management Interrupt) issues have been a classic source of “everything pauses” behavior for decades—especially on vendor firmware that’s optimized for Windows telemetry, not Linux determinism.
- Fact 4: NVMe error handling got significantly more sophisticated over the years, but “reset storms” can still cascade into system-wide stalls if interrupts and timeouts pile up the wrong way.
- Fact 5: The Linux kernel’s lockup detectors are deliberately conservative. They’d rather annoy you with a warning than miss a real dead-CPU scenario.
- Fact 6: The kernel’s “hung task” detector is different again: it detects blocked tasks (often I/O waits), not dead CPUs. People mix these up constantly.
- Fact 7: Early “tickless” kernel work (NO_HZ) improved power use but also changed timing behavior; some bugs only show up with specific timer/idle combinations.
- Fact 8: Microcode updates can fix lockups, but they can also surface latent driver or firmware problems by changing timing. “It got worse after updates” is not automatically “the update broke it.”
One reliability paraphrased idea that still holds: paraphrased idea: If you can’t measure it, you can’t manage it.
— W. Edwards Deming (paraphrased idea)
Fast diagnosis playbook (first/second/third)
This is the “stop scrolling, do this now” section. The goal is to quickly decide whether you’re dealing with: (a) a known driver/firmware problem, (b) a power-management edge case, (c) storage/interrupt path pathology, or (d) failing hardware.
First: confirm it’s a lockup and capture the last good words
- Pull the prior boot logs (persistent journal) and look for the earliest precursor messages: NVMe resets, GPU Xid, iwlwifi firmware asserts, “irq X: nobody cared”, “RCU stall”, MCE errors.
- Decide if you need a crash dump. If you can reproduce weekly or less, you need kdump. Otherwise you’ll be arguing with vibes.
- Record the kernel and firmware versions before you touch anything. Upgrades change timing and can “fix” by accident.
Second: isolate the common culprits with controlled toggles
- Out-of-tree modules and GPUs: temporarily boot without proprietary GPU drivers or DKMS modules if feasible.
- Power management: try one change at a time: disable deep C-states (
processor.max_cstate=1) or disable intel_pstate (or switch governor). Don’t shotgun ten kernel params. - IOMMU: toggle IOMMU settings if you see DMA/IOMMU faults or weird interrupt behavior.
Third: stress the suspect subsystem and watch counters
- Storage: run controlled I/O load and watch NVMe error counters, latency, and resets.
- CPU/thermals: check MCE logs, temperatures, and frequency behavior.
- Interrupts: identify interrupt storms or stuck IRQs; pin down which device owns the noisy line.
Joke #1: A hard lockup is your server’s way of requesting a “team-building exercise” between the kernel and your BIOS.
Evidence-first setup before you change anything
When people “debug” lockups by flipping random BIOS toggles, upgrading three packages, and changing kernel parameters all at once, they create the one thing worse than a lockup: a lockup with no reproducibility and no blame surface.
Your job is to produce artifacts that survive the reboot. For Ubuntu 24.04, the shortlist is:
- Persistent journal logs across boots
- Kernel ring buffer from previous boot (or netconsole output)
- Kdump capture (vmcore + dmesg) if the machine can panic/reboot
- Hardware error logs: MCE, EDAC, IPMI SEL (if server), SMART/NVMe logs
- Firmware and microcode versions
- A minimal change log of what you tested and what changed
Make logs survive reboots (journald persistence)
Ubuntu often defaults to persistent logs on servers, but not always, and containers/VMs can be weird. Make it explicit:
cr0x@server:~$ sudo grep -R "^\s*Storage=" /etc/systemd/journald.conf /etc/systemd/journald.conf.d/* 2>/dev/null || true
...output...
If you see nothing, set it to persistent and restart journald:
cr0x@server:~$ sudo install -d -m 2755 /etc/systemd/journald.conf.d
cr0x@server:~$ printf "[Journal]\nStorage=persistent\nSystemMaxUse=2G\n" | sudo tee /etc/systemd/journald.conf.d/persistent.conf
[Journal]
Storage=persistent
SystemMaxUse=2G
cr0x@server:~$ sudo systemctl restart systemd-journald
Decision: If your lockups are infrequent, persistent logs are mandatory. If disk is tiny, cap size; don’t disable persistence.
Enable kdump (when you need answers, not guesses)
Kdump is your black box recorder. It won’t always trigger on a hard lockup (because the system can be too dead to panic cleanly), but when it does, it’s gold. On Ubuntu:
cr0x@server:~$ sudo apt-get update
cr0x@server:~$ sudo apt-get install -y linux-crashdump crash kexec-tools
...output...
Then allocate crashkernel memory and enable:
cr0x@server:~$ sudo sed -n '1,200p' /etc/default/grub
...output...
Edit to include something like crashkernel=512M on a typical server (more for huge RAM boxes), then:
cr0x@server:~$ sudo update-grub
...output...
cr0x@server:~$ sudo systemctl enable --now kdump-tools
...output...
Decision: If you can tolerate the reserved memory, keep kdump on until the root cause is proven fixed. Disable it later, not now.
Set up Magic SysRq (sometimes you can “unfreeze” enough to dump state)
Hard lockups often ignore SysRq because interrupts are dead on the affected CPU. But on SMP systems, another CPU may still run long enough for SysRq to produce backtraces.
cr0x@server:~$ cat /proc/sys/kernel/sysrq
176
Values vary; 176 typically allows sync/remount/reboot and some debug. For full enable:
cr0x@server:~$ echo 'kernel.sysrq = 1' | sudo tee /etc/sysctl.d/99-sysrq.conf
kernel.sysrq = 1
cr0x@server:~$ sudo sysctl -p /etc/sysctl.d/99-sysrq.conf
kernel.sysrq = 1
Decision: Enable SysRq temporarily on debugging hosts. On high-security environments, weigh policy; but for outages, evidence beats purity.
Consider netconsole for “the log line right before it died”
When disks are stalled and the local console is frozen, shipping kernel messages over UDP to another box can capture the last gasp. This is especially useful for storage and driver lockups.
Practical tasks: commands, outputs, decisions (12+)
These are field tasks. Run them, read the output, then make a decision. Each task is here because it changes what you do next.
Task 1: Confirm the exact kernel, build, and boot parameters
cr0x@server:~$ uname -a
Linux server 6.8.0-52-generic #53-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 15 12:34:56 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-52-generic root=UUID=... ro quiet splash
What it means: You need the exact kernel version for correlation with known regressions and to verify changes actually took effect.
Decision: If you’re running HWE/edge kernels or custom builds, note it. Don’t mix “maybe it’s the kernel” with unknown provenance.
Task 2: Pull previous-boot kernel logs (journal) and filter lockup signals
cr0x@server:~$ sudo journalctl -k -b -1 | egrep -i "hard LOCKUP|soft lockup|watchdog|RCU stall|NMI|hung task|MCE|IOMMU|nvme|AER|Xid|amdgpu|i915|EDAC" | tail -n 120
...output...
What it means: You’re looking for precursors: NVMe resets, PCIe AER spam, GPU faults, machine check exceptions, IOMMU faults.
Decision: If you see hardware errors (MCE/EDAC/AER), prioritize hardware/firmware lanes before software tuning.
Task 3: Check if logs are persistent and sized sanely
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 1.2G in the file system.
What it means: If disk usage is tiny and you reboot often, you may be losing relevant history.
Decision: Increase SystemMaxUse if you’re clipping older boots; decrease if you’re on tiny root disks.
Task 4: See how the previous shutdown looked (clean vs crash)
cr0x@server:~$ last -x | head -n 12
reboot system boot 6.8.0-52-generic Mon Dec 29 09:18 still running
crash system crash 6.8.0-52-generic Mon Dec 29 09:02 - 09:18 (00:16)
...output...
What it means: “crash” entries often indicate abrupt resets or watchdog reboots.
Decision: If you have crash entries with no logs, set up netconsole and kdump to catch more.
Task 5: Check hardware error logs (MCE on x86)
cr0x@server:~$ sudo journalctl -k | egrep -i "mce:|machine check|hardware error" | tail -n 80
...output...
What it means: Machine Check Exceptions indicate CPU/cache/memory/PCIe-level problems.
Decision: If MCEs exist, stop blaming random drivers. Investigate microcode, BIOS, DIMMs, thermals, and PCIe cards.
Task 6: Check EDAC (memory controller reporting) if present
cr0x@server:~$ sudo journalctl -k | egrep -i "EDAC|CE|UE|ecc" | tail -n 80
...output...
What it means: Corrected errors (CE) are early warnings; uncorrected (UE) can crash you.
Decision: Any UE is an incident. Re-seat/replace DIMMs, check memory config, and validate BIOS settings.
Task 7: Check PCIe AER errors (often points at a sick device or link)
cr0x@server:~$ sudo journalctl -k | egrep -i "AER:|pcieport|Corrected error|Uncorrected" | tail -n 120
...output...
What it means: Corrected AER spam can still cause latency storms and driver weirdness. Uncorrected is worse.
Decision: Identify the device (BDF) and inspect that PCIe slot/cable/riser/firmware.
Task 8: NVMe health and error log (common in lockup-adjacent incidents)
cr0x@server:~$ sudo apt-get install -y nvme-cli
...output...
cr0x@server:~$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 S6XXXXXXXXXXXX ACME NVMe 1TB 1 120.0 GB / 1.0 TB 512 B + 0 B 3B2QEXM7
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
...output...
cr0x@server:~$ sudo nvme error-log /dev/nvme0 | head -n 40
...output...
What it means: Look for media errors, increasing error log entries, temperature warnings, or a suspicious number of resets.
Decision: If error counters climb, treat it as a hardware/firmware issue first: update SSD firmware, move slots, check PCIe power and cooling.
Task 9: Identify repeated storage timeouts/reset storms in the kernel log
cr0x@server:~$ sudo journalctl -k -b -1 | egrep -i "nvme.*timeout|resetting controller|I/O error|blk_update_request|Buffer I/O error|task abort" | tail -n 120
...output...
What it means: Storage timeouts can cause systemic stalls, especially if root filesystem or swap is involved.
Decision: If the timeouts are present, stop “tuning.” Fix the storage path: firmware, power, PCIe link, cables/backplanes, multipath config, queue settings.
Task 10: Check interrupt distribution and spot a storm
cr0x@server:~$ head -n 30 /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 22 0 0 0 IO-APIC 2-edge timer
1: 0 0 0 0 IO-APIC 1-edge i8042
...output...
cr0x@server:~$ egrep -i "nvme|mlx|ixgbe|i40e|ena|nvidia|amdgpu|ahci|xhci" /proc/interrupts | head -n 20
...output...
What it means: If one IRQ count is skyrocketing on one CPU, you may have an IRQ storm or affinity imbalance.
Decision: If interrupts pin to one CPU, consider irqbalance status, manual affinity, or investigating a misbehaving device/driver.
Task 11: Confirm irqbalance behavior (don’t guess)
cr0x@server:~$ systemctl status irqbalance --no-pager
...output...
What it means: irqbalance can help on general-purpose servers; on low-latency boxes, it can be deliberately disabled.
Decision: If disabled, re-enable temporarily to test whether lockups correlate with interrupt hotspots. If enabled, consider pinning critical IRQs for reproducibility.
Task 12: Check CPU frequency driver and governors (power management suspects)
cr0x@server:~$ sudo apt-get install -y linux-tools-common linux-tools-generic
...output...
cr0x@server:~$ cpupower frequency-info
...output...
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate
What it means: Different cpufreq drivers and governors can change timing and power-state transitions that trigger lockups.
Decision: If lockups correlate with idle periods, deep sleep states are a prime suspect.
Task 13: Check idle state usage (C-states) if available
cr0x@server:~$ sudo apt-get install -y linux-tools-$(uname -r)
...output...
cr0x@server:~$ sudo turbostat --quiet --Summary --interval 5 --num_iterations 3
...output...
What it means: If you see the CPU spending lots of time in deep C-states right before lockups (hard to catch live, but useful in testing), you have a lead.
Decision: Test with C-states limited via kernel params, one at a time, and document.
Task 14: Check for GPU driver resets or Xid (workstations and compute nodes)
cr0x@server:~$ sudo journalctl -k -b -1 | egrep -i "NVRM|Xid|amdgpu|ring timeout|GPU reset|i915.*reset" | tail -n 120
...output...
What it means: GPU hangs can freeze the box, especially if the GPU driver runs in kernel context and gets stuck.
Decision: If GPU errors precede lockups, reproduce with the GPU driver removed or swapped (open vs proprietary) to prove causality.
Task 15: Confirm kdump is armed and where dumps would land
cr0x@server:~$ sudo systemctl status kdump-tools --no-pager
...output...
cr0x@server:~$ sudo kdump-config show
...output...
What it means: If kdump isn’t loaded or crashkernel isn’t reserved, you won’t get vmcores.
Decision: Fix kdump now, not after the next outage.
Task 16: Force a controlled crash test (only on a test box or agreed maintenance)
cr0x@server:~$ echo 1 | sudo tee /proc/sys/kernel/sysrq
1
cr0x@server:~$ echo c | sudo tee /proc/sysrq-trigger
...output...
What it means: The system should crash and reboot, producing a vmcore if kdump is configured.
Decision: If no dump appears, your entire “we’ll debug later” plan is fictional.
Narrowing it down by subsystem
1) Power management: C-states, P-states, and the idle-time lockup pattern
Many hard lockups appear “random” until you correlate them with idle. The system freezes overnight, or right after load drops, or during low-traffic periods. That’s power management territory: deep C-states, package C-states, and frequency scaling transitions.
What to look for:
- Lockups happen when the system is mostly idle
- No obvious I/O errors beforehand
- Sometimes fixed by disabling deep sleep in BIOS, or by kernel params limiting C-states
Controlled tests (one change at a time):
- Limit C-states:
processor.max_cstate=1and/orintel_idle.max_cstate=1(Intel) - Disable specific idle driver:
idle=poll(heavy-handed; useful as a diagnostic) - Adjust intel_pstate mode (active/passive), or set governor to performance for a test window
Be strict: apply one parameter per reboot and keep a change log. The temptation to throw a whole Reddit comment into GRUB is strong. Resist.
2) Firmware and BIOS: where time goes to die
Firmware bugs can manifest as lockups because the CPU disappears into SMM (System Management Mode) for too long, or because ACPI tables describe broken power/interrupt behavior. Linux can’t fix vendor firmware. It can only route around it.
What to check:
- BIOS/UEFI version, release notes (especially “stability” and “PCIe compatibility” changes)
- Microcode package installed and active
- Power features like ASPM, Global C-states, “package C-state limit,” and “Energy Efficient Turbo” (names vary wildly)
Good practice: update BIOS and device firmware early in the investigation, but do it deliberately: change one layer, then observe for a full reproduction window.
3) Storage and PCIe: NVMe, AER, and the “it’s not I/O, it’s the bus” trap
Storage-related lockups often start as PCIe problems: marginal links, overheating NVMe drives, power delivery issues, buggy SSD firmware, or a backplane doing backplane things. The kernel log will often show timeouts and resets—if you’re lucky enough to capture them.
Key signals:
nvme nvme0: I/O X QID Y timeoutresetting controllerloops- PCIe AER corrected/uncorrected spam pointing to the NVMe’s BDF
- Filesystems complaining (EXT4 journal aborts, XFS log I/O errors) after device errors
Decisions that matter:
- If the root disk is on the problematic NVMe and you see resets, don’t waste time “tuning IO schedulers.” Stabilize the device path first.
- If AER errors exist, treat them as physical-layer evidence: slot, riser, backplane, cable, device firmware, BIOS PCIe settings.
4) GPU and display stacks: when the kernel is collateral damage
On Ubuntu desktops, dev workstations, and compute nodes, GPU lockups are common “hard lockup neighbors.” A GPU hang can trigger a chain of events: driver resets, blocked kernel threads, and sometimes watchdog reports that look CPU-related but started in the GPU path.
Approach:
- Check for GPU reset messages right before the lockup
- Try the opposite driver stack (open vs proprietary) for a test period
- Reduce complexity: disable overclocks, disable fancy power states, test with a simpler compositor/TTY workload if possible
5) IOMMU and virtualization: great power, great weirdness
IOMMU issues can show up as DMA faults, device resets, or lockups. Virtualization stacks (KVM, VFIO passthrough) increase your exposure: more devices, more interrupt routing, more edge cases.
Signals:
DMAR: [DMA Read] faultor AMD-Vi IOMMU fault logs- Lockups under passthrough load
- Odd device behavior after suspend/resume (on workstations)
Controlled tests:
- Toggle IOMMU: disable for a test window (
intel_iommu=offoramd_iommu=off) if policy permits - Or enable passthrough mode if you need IOMMU but want fewer translations:
iommu=pt
6) Kernel modules: DKMS, third-party drivers, and “works on my laptop” in production
Out-of-tree modules are frequent suspects because they don’t get the same regression testing across kernel updates. Ubuntu 24.04 runs a modern kernel; that’s good for hardware support, but it also means APIs move and timing changes.
Rule: If you have third-party modules, your first diagnostic goal is to reproduce without them. If you can’t, you don’t have a bug; you have a dependency negotiation.
Joke #2: The fastest way to reproduce a hard lockup is to schedule a demo for leadership—systems love an audience.
Three corporate mini-stories (how teams actually get this wrong)
Mini-story 1: The incident caused by a wrong assumption
The team ran a fleet of Ubuntu servers handling background video processing. They started seeing weekly hard lockups on a subset of new hardware. The initial narrative was confident: “It’s the new kernel. Roll back.” They pinned the kernel version across the fleet and… lockups continued. The confidence remained; the evidence did not.
Someone finally pulled the previous-boot kernel logs from a box that had persistent journald configured (quiet hero). Right before each freeze were PCIe AER corrected errors tied to a specific NVMe controller. No one had looked because the storage “seemed fine” after reboot.
The wrong assumption was subtle: they assumed corrected PCIe errors are harmless. They are not harmless when they’re constant. Corrected errors still cost time; they also correlate strongly with marginal links and impending uncorrected failures.
The fix wasn’t a kernel rollback. It was a firmware update for the SSD and a BIOS update that changed PCIe link training defaults. They also moved the drives out of a particular backplane batch. Lockups stopped. The kernel got blamed anyway, because kernels are a convenient villain.
Mini-story 2: The optimization that backfired
A latency-sensitive internal service wanted lower idle power. Someone enabled deeper package C-states in BIOS and changed the CPU governor to a more aggressive power-saving profile. The service looked great in graphs: lower watts, similar p95 latency under steady load. Everyone congratulated themselves for being efficient adults.
Then came the lockups. Not under load. During quiet times. The machines would freeze in the early morning when traffic dipped. Watchdog hard lockups appeared in dmesg after forced reboots, but nothing obvious preceded them.
They tried all the usual: driver upgrades, kernel updates, even NIC firmware. The turning point was a deliberately boring experiment: limit C-states via kernel parameters for two weeks on a subset of hosts, leaving everything else unchanged. The lockups vanished on the test set and continued on the control set.
The “optimization” was real, but it pushed the platform into a firmware/idle-state corner case. They rolled back the BIOS C-state change, documented it, and moved on. Power savings were smaller than an outage anyway.
Mini-story 3: The boring but correct practice that saved the day
A team running a mixed workload cluster (databases, caches, and batch jobs) had a rule: every host must have persistent journal logs and kdump configured, even if it “wastes memory.” It was unpopular. People complained about reserved RAM and disk space. The SRE who enforced it was described as “intensely unfun at parties,” which, to be fair, was accurate.
They hit a nasty series of hard lockups after a routine maintenance window. Some boxes froze completely; others rebooted. Instead of arguing about possible causes, they pulled vmcores from the boxes that crashed cleanly and compared stack traces. Several showed a common code path involving a storage driver error recovery routine triggered by repeated timeouts.
They didn’t need to reproduce blindly in production. They had dumps, timestamps, and consistent precursors in logs. The vendor storage driver was updated, and the issue was mitigated with a specific module parameter while waiting for a permanent fix.
No heroics. No folklore. Just the boring practice of collecting evidence before the incident. That rule paid for itself in one week.
Common mistakes: symptom → root cause → fix
This section is intentionally specific. These are the failure modes I keep seeing because people treat “hard lockup” like a generic Linux curse.
1) Symptom: Hard lockups mostly overnight or during low usage
Likely root cause: Deep idle state / C-state / firmware idle bug, sometimes triggered by tickless timing and low interrupt activity.
Fix: Test limiting C-states with processor.max_cstate=1 and/or intel_idle.max_cstate=1. If confirmed, adjust BIOS power settings, update BIOS/microcode, then remove the kernel workaround if possible.
2) Symptom: Hard lockup preceded by NVMe timeouts or controller resets
Likely root cause: NVMe firmware issue, overheating, PCIe marginal link, or power delivery/backplane problems causing reset storms.
Fix: Update SSD firmware and BIOS. Check cooling. Move drive to another slot. Investigate PCIe AER logs. Replace the device if error counters grow.
3) Symptom: “RCU stall” spam and then lockup
Likely root cause: CPU starvation (interrupt storm, stuck IRQ) or a driver path disabling interrupts too long; sometimes a VM host under interrupt pressure.
Fix: Check /proc/interrupts for storms, verify irqbalance, and identify the device. Consider isolating CPUs or changing IRQ affinity. Fix the misbehaving device/driver.
4) Symptom: Hard lockups after installing proprietary GPU driver
Likely root cause: GPU driver hang or reset loop; kernel stuck in driver code path.
Fix: Reproduce with open driver or without GPU acceleration. Update the driver and kernel in a controlled matrix. If it’s a workstation, reduce power management features for GPU temporarily.
5) Symptom: Lockups started after enabling virtualization passthrough
Likely root cause: IOMMU translation/interrupt remapping issues, device firmware bugs under VFIO, or buggy ACS/interrupt routing.
Fix: Check IOMMU fault logs. Test iommu=pt or temporarily disable IOMMU. Update BIOS and device firmware. Avoid stacking multiple experimental kernel params at once.
6) Symptom: “It’s always CPU X” in watchdog messages
Likely root cause: IRQ affinity hotspot, CPU isolation settings, or a CPU-core-specific hardware problem (rare but real).
Fix: Inspect IRQ distribution and CPU pinning. If it truly follows the core across configurations, consider hardware diagnostics and swapping CPUs/boards where possible.
7) Symptom: No logs at all right before the freeze
Likely root cause: Logs are not persistent, ring buffer overwritten, or the lockup is so hard it stops logging immediately.
Fix: Enable persistent journald, enable kdump, and consider netconsole. Also increase kernel log buffer if needed for noisy boots.
Checklists / step-by-step plan
Checklist A: Evidence capture (do this once, then leave it alone)
- Enable persistent journald storage with a sane cap.
- Install and configure kdump; verify with a controlled crash test (in maintenance).
- Record: kernel version, cmdline, BIOS version, microcode version, major driver versions (GPU, NIC, storage HBA).
- Ensure time sync works (chrony/systemd-timesyncd). Bad clocks ruin correlations.
- If logs still vanish, configure netconsole to a collector host.
Checklist B: Fast narrowing plan (one variable per reboot)
- Remove third-party modules (or boot a known-clean kernel) and attempt reproduction window.
- Power management test: limit C-states for one window. If it improves, pivot to BIOS/microcode investigation.
- Storage path test: check NVMe SMART/error logs and PCIe AER; stress I/O while watching logs.
- IOMMU test if faults exist or passthrough is used:
iommu=ptor disable for test (policy permitting). - Interrupt test: check /proc/interrupts before and after load; look for storms and affinity issues.
Checklist C: If you must mitigate now (production triage)
- Apply the least invasive mitigation that plausibly targets the subsystem implicated by logs.
- Prefer mitigations that are reversible and measurable (kernel params, driver version pinning) over BIOS fishing expeditions.
- If storage errors exist: migrate workload off the host; do not “wait and see.”
- If MCE/EDAC errors exist: take the host out of service and run hardware diagnostics; do not optimize your way out of a flaky DIMM.
FAQ
1) What’s the difference between a soft lockup and a hard lockup?
Soft lockup: a CPU is stuck running kernel code too long without scheduling; interrupts still happen. Hard lockup: a CPU stops responding to interrupts long enough that the watchdog fires. Hard is usually more severe and more likely to indicate interrupt-disabled paths, firmware issues, or hardware.
2) If I see “hard LOCKUP” once, is the machine unreliable forever?
No. Some lockups are real regressions or firmware bugs that get fixed. But treat it as a serious signal until you have a clean reproduction and a verified fix. “It went away” is not a fix; it’s a pause.
3) Does kdump always work for hard lockups?
No. If the system is fully wedged and can’t trigger a panic, you might not get a dump. That’s why persistent logs and sometimes netconsole matter. Kdump is still worth it because when it works, it shortens investigations dramatically.
4) Should I disable the watchdog to stop the messages?
Do not disable it as a “fix.” At best you hide the symptom; at worst you convert a detectable failure into a silent hang that looks like “the network died.” Leave watchdogs on while debugging.
5) Can storage issues really cause CPU hard lockups?
Indirectly, yes. Timeouts, reset storms, and interrupt/pathological error handling can lead to conditions where CPUs spend too long in critical sections or stop servicing interrupts properly. Also, PCIe-level faults can affect system behavior beyond “just I/O.”
6) What’s the fastest single change to test idle-state-related lockups?
Limit C-states for a test window: add processor.max_cstate=1 (and on Intel also intel_idle.max_cstate=1) to the kernel command line. It’s blunt, but it’s a clean diagnostic toggle.
7) I only see the lockup message after reboot. How do I catch the cause before it freezes?
Persistent journal logs help. Netconsole can capture the last kernel messages to another host. If the system still runs on some CPUs during the hang, SysRq backtraces might dump to logs. And if you can panic on hang (carefully), kdump can capture state.
8) Is this more likely on bare metal than VMs?
Hard lockups are more common on bare metal because you’re exposed to firmware, power management, and device drivers directly. VMs can still see soft lockups and stalls, but the hypervisor often absorbs some classes of hardware weirdness.
9) What if the lockups started exactly after upgrading to Ubuntu 24.04?
Then your priority is to correlate with: kernel version changes, driver stack changes (especially GPU/NIC/storage), and power management defaults. Capture evidence first, then test: older kernel on 24.04, or 24.04 kernel on older userspace if you have a lab. Avoid changing BIOS settings at the same time as OS upgrades unless you can isolate variables.
10) When do I stop debugging and replace hardware?
When you have MCE/EDAC errors, rising NVMe error counters, persistent PCIe AER spam tied to a device, or the issue follows a component across OS changes. Hardware is allowed to be guilty.
Conclusion: practical next steps
Hard lockups feel like chaos because they steal your observability at the worst possible moment. Your countermove is disciplined evidence capture and controlled experiments.
- Turn on persistent journald and verify you can read previous-boot kernel logs.
- Configure kdump and run a controlled crash test during maintenance. If you can’t get a vmcore, don’t pretend you can.
- On the next incident, extract precursors from
journalctl -k -b -1and classify the problem: power management, storage/PCIe, GPU/driver, IOMMU/interrupts, or hardware errors. - Run one change per reboot. Keep a tiny change log. Yes, even if you hate paperwork.
- If you see MCE/EDAC/AER/NVMe errors: stop tuning and start fixing the physical path—firmware, cooling, slots, and parts.
If you do this right, the investigation becomes boring. Boring is good. Boring means you’re turning a haunted system into a measurable one.