The reboot habit is a tax you pay forever. It buys you a few minutes of calm, wipes your fingerprints off the scene, and quietly guarantees you’ll see the same incident again—usually at 2:13 AM, during a deploy, when someone “just needs it back.”
Ubuntu 24.04 is a solid base. When it goes sideways, it’s rarely because the OS woke up feeling chaotic. It’s because something measurable happened: a queue built up, memory pressure crossed a threshold, a kernel worker got stuck, a NIC flapped, a filesystem paused, a service wedged itself into a bad state. Your job isn’t to reboot the evidence. Your job is to isolate the root cause and decide the smallest safe fix.
Rebooting is not a fix: the production mindset
A reboot is a blunt instrument. It clears memory, resets drivers, kills stuck processes, and makes metrics look “good” because you reset the clocks. It’s the technical equivalent of turning the music down to find the noise in your car engine: you’ll feel better, but the engine is still dying.
In production, you optimize for repeatable recovery and root cause isolation. That means:
- Containment first: stop user impact, cap blast radius, reduce load.
- Evidence capture: logs, state, counters, queues—before you reset them.
- Smallest reversible change: restart one service, reload config, drain one node, disable one feature flag.
- Root cause: prove it with data, not vibes.
If your process is “reboot and hope,” you’re training your org to accept unexplained outages as normal. That’s not resilience. That’s Stockholm syndrome with uptime dashboards.
One quote worth keeping on your wall: Gene Kranz (Apollo flight director) pushed a culture of “tough and competent.” Paraphrased idea: Be tough and competent—use disciplined process under pressure.
That’s ops work on a random Tuesday.
Joke #1: Rebooting to fix production is like microwaving your phone because it’s slow. You may get warmth, but you’re not solving the right problem.
Facts and context: why this keeps happening
Here are some concrete, non-myth facts that help explain why “a reboot fixed it” is so common—and so misleading:
- Linux used to be “reboot to apply changes” far less than Windows, but driver resets and kernel bugs still exist. Many “reboot fixes” are actually “driver state reset” stories.
- systemd made service management radically more consistent than ad-hoc init scripts. That also means you can often fix issues with a targeted
systemctl restartinstead of a full reboot. - Ubuntu’s move to predictable network interface names reduced “eth0 disappeared” chaos, but misconfigured netplan can still cause subtle link/route issues that persist until reload.
- cgroups evolved from “nice-to-have” to “the resource control plane” (especially with containers). Many “random slowdowns” are actually cgroup throttling, memory pressure, or IO weights doing exactly what you told them to do.
- NVMe brought amazing latency—plus new failure modes: firmware quirks, PCIe power management oddities, and intermittent media errors that show up as IO stalls, not clean failures.
- The OOM killer is not an error; it’s a policy decision. It kills something to keep the kernel alive. Rebooting after OOM is often a refusal to learn which process ate the machine.
- Journald’s structured logs changed incident response: you can filter by boot, unit, priority, PID, and time window. That’s evidence you didn’t have with scattered text files.
- Modern kernels can “stall” without crashing: soft lockups, RCU stalls, blocked IO, and filesystem transaction waits. The server looks alive, but forward progress is gone.
- Cloud environments made “reboot” dangerously cheap: the habit moved from desktops into fleets, and suddenly nobody remembers to preserve forensic data.
Ubuntu 24.04 (Noble) is especially “observable” if you use what’s already there: systemd, journald, kernel logs, performance counters, and the metrics you should have had anyway. The playbook below assumes you’re willing to look.
Fast diagnosis playbook (first/second/third)
This is the triage sequence that finds bottlenecks quickly without getting lost in trivia. Use it when the page hits and you have minutes to make the system less on fire.
First: establish the failure mode (is it CPU, memory, disk, or network?)
- Is the box responsive? Can you SSH? Can you run commands within 1–2 seconds?
- Is load average lying? Load can be high because of runnable CPU pressure or because tasks are stuck in uninterruptible IO sleep.
- Is there active swapping or OOM? Memory pressure causes cascading failure: slow allocs, swap storms, IO, timeouts.
- Is disk latency spiking? High await / long queues will freeze everything that touches storage.
- Is the network the bottleneck? Retransmits, drops, DNS failures, MTU mismatches—classic “it’s down but it’s up.”
Second: identify the “who” (which process/unit and which dependency)
- Top offenders: CPU hog, memory leak, stuck threads, or too many processes.
- Dependency mapping: is your app waiting on DB, DNS, NFS, object store, or local disk?
- Service health: systemd unit state, restart loops, watchdog triggers.
Third: take the smallest safe corrective action, then confirm with metrics
- Reduce load: drain node from LB, pause batch jobs, rate-limit, disable expensive feature flags.
- Targeted restart: restart the service, not the machine, unless you’re doing a controlled kernel/driver reset for a known issue.
- Confirm: did latency drop, error rate drop, queues drain, resource use normalize?
- Capture evidence: even if you must reboot, save journal excerpts, kernel logs, and state first.
That sequence avoids the trap of “I saw high load, therefore CPU,” which is how people end up resizing instances while the real culprit is a disk queue stuck behind one sad NVMe timeout.
Practical tasks (commands, outputs, decisions)
Below are hands-on tasks you can run on Ubuntu 24.04 during an incident. Each includes: command, a realistic output slice, what it means, and the decision you make.
Task 1: Identify if you’re looking at CPU pressure or IO wait
cr0x@server:~$ uptime
14:22:10 up 18 days, 3:11, 2 users, load average: 18.72, 17.95, 16.40
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 0 0 412180 90244 3218120 0 0 12 110 980 2100 45 10 40 5 0
3 2 0 398200 90244 3221200 0 0 240 9010 1200 2600 18 6 22 54 0
2 4 0 390112 90244 3221456 0 0 180 12050 1180 2500 10 4 18 68 0
Meaning: load is high, but b (blocked) is rising and wa (IO wait) is huge. This is not “just CPU.” It’s “threads stuck waiting on storage.”
Decision: stop scaling CPU. Pivot to disk latency and filesystem/RAID/NVMe checks.
Task 2: Check per-disk latency and queue depth
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0-41-generic (server) 12/30/2025 _x86_64_ (8 CPU)
Device r/s w/s rkB/s wkB/s aqu-sz await r_await w_await %util
nvme0n1 12.0 220.0 384.0 9012.0 18.40 78.20 6.10 82.10 99.2
nvme1n1 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
Meaning: %util ~99% and await ~78ms on nvme0n1 is bad for a system expecting low-millisecond latency. aqu-sz is deep: requests are queued.
Decision: find what’s writing, check filesystem, check for saturation or errors, and consider throttling the offender.
Task 3: Identify top IO writers by process
cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 1.10 M/s | Total DISK WRITE: 102.45 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
8921 be/4 postgres 0.00 B/s 68.20 M/s 0.00 % 7.10 % postgres: checkpointer
25140 be/4 root 0.00 B/s 22.30 M/s 0.00 % 3.20 % /usr/bin/rsync -aH --delete /var/lib/app/ /mnt/backup/
Meaning: PostgreSQL checkpointer and a backup job are writing heavily. The backup may be stealing IO at the worst time.
Decision: pause/renice/ionice the backup, confirm DB checkpoints aren’t mis-tuned, and verify you’re not snapshotting at peak.
Task 4: Confirm memory pressure and swap behavior
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 28Gi 510Mi 2.1Gi 2.5Gi 1.1Gi
Swap: 4.0Gi 3.6Gi 410Mi
Meaning: Swap is heavily used; “available” is low. This often turns “fine” latency into a swamp.
Decision: find the memory consumer, stop the leak, cap it via systemd/cgroups, or scale memory—but only after proving it’s not a runaway job.
Task 5: Catch OOM killer events and the victim
cr0x@server:~$ journalctl -k -b | grep -i -E "oom|out of memory|killed process" | tail -n 5
Dec 30 13:58:41 server kernel: Out of memory: Killed process 18221 (java) total-vm:9643212kB, anon-rss:6123400kB, file-rss:10240kB, shmem-rss:0kB, UID:1001 pgtables:18432kB oom_score_adj:0
Meaning: The kernel killed a Java process due to memory exhaustion. Rebooting “fixes” it because you removed the offender and cleared caches, not because the system healed.
Decision: fix memory sizing/limits, find the leak, add guardrails (systemd MemoryMax, container limits), and confirm the app’s heap settings match reality.
Task 6: Find which unit is flapping or in a restart loop
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● app-api.service loaded failed failed App API service
cr0x@server:~$ systemctl status app-api.service --no-pager -l
× app-api.service - App API service
Loaded: loaded (/etc/systemd/system/app-api.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2025-12-30 14:10:18 UTC; 2min 1s ago
Process: 24410 ExecStart=/usr/local/bin/app-api (code=exited, status=1/FAILURE)
Memory: 1.2G
CPU: 11.203s
Dec 30 14:10:18 server app-api[24410]: FATAL: cannot connect to Redis: dial tcp 10.20.0.15:6379: i/o timeout
Dec 30 14:10:18 server systemd[1]: app-api.service: Main process exited, code=exited, status=1/FAILURE
Dec 30 14:10:18 server systemd[1]: app-api.service: Failed with result 'exit-code'.
Meaning: The app isn’t “broken”; it’s failing because Redis is unreachable or slow. Rebooting the app host won’t fix a network path or Redis overload.
Decision: pivot to dependency: test Redis reachability, check its latency, and validate routing/firewall.
Task 7: Check kernel warnings that hint at stalls or driver resets
cr0x@server:~$ journalctl -k -b -p warning..alert --no-pager | tail -n 20
Dec 30 14:01:12 server kernel: INFO: task kworker/u16:3:1123 blocked for more than 120 seconds.
Dec 30 14:01:12 server kernel: nvme nvme0: I/O 224 QID 3 timeout, aborting
Dec 30 14:01:42 server kernel: nvme nvme0: Abort status: 0x371
Dec 30 14:02:10 server kernel: EXT4-fs warning (device nvme0n1p2): ext4_end_bio:340: I/O error 10 writing to inode 524313 (offset 0 size 4096)
Meaning: NVMe timeouts and ext4 IO errors are not “random slowness.” They’re a storage problem, potentially hardware or firmware.
Decision: protect data, plan maintenance, check SMART/NVMe health, consider failing the device out, and stop pretending a reboot is “repair.”
Task 8: Inspect NVMe health and error log
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 49 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 7%
data_units_read : 193,421,112
data_units_written : 882,100,420
media_errors : 12
num_err_log_entries : 98
cr0x@server:~$ sudo nvme error-log /dev/nvme0 | head -n 8
Error Log Entries for device:nvme0 entries:64
Entry[ 0]
error_count : 98
sqid : 3
cmdid : 0x00d1
status_field : 0x4004
lba : 182736128
Meaning: media_errors and a growing error log correlate with the kernel timeouts. This is trending toward real failure.
Decision: schedule replacement, reduce write amplification, verify backups, and move critical services off that device.
Task 9: Validate filesystem space/inodes (the boring outage generator)
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 220G 218G 1.2G 100% /
tmpfs 16G 1.2G 15G 8% /run
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 14417920 14399811 18109 100% /
Meaning: Full disk or inode exhaustion causes “random” failures: service can’t write PID files, logs fail, databases refuse commits, apt breaks, sockets fail.
Decision: delete/rotate logs, clear old releases, fix runaway file creation, add quotas, and set alerts before it hits 100% again.
Task 10: Spot file descriptor exhaustion (the stealth limiter)
cr0x@server:~$ cat /proc/sys/fs/file-nr
10240 0 1048576
cr0x@server:~$ sudo lsof | wc -l
245112
Meaning: The system-wide file handle ceiling is fine here, but the sheer number of open files suggests either high concurrency or a leak (sockets/files not closing).
Decision: check per-process limits (ulimit, systemd LimitNOFILE=), identify the leaking process, and fix it—don’t reboot to “close files.”
Task 11: Prove network loss, retransmits, and route weirdness
cr0x@server:~$ ip -s link show dev enp5s0
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
10498312323 8123312 0 24810 0 12901
TX: bytes packets errors dropped carrier collsns
9983211121 7921101 0 120 0 0
cr0x@server:~$ ss -s
Total: 2135 (kernel 0)
TCP: 1980 (estab 1120, closed 650, orphaned 3, timewait 645)
Transport Total IP IPv6
RAW 0 0 0
UDP 62 55 7
TCP 1330 1211 119
INET 1392 1266 126
FRAG 0 0 0
Meaning: RX drops are significant. That’s not “the app is slow,” that’s “packets are being dropped,” which leads to retries, timeouts, and sad dashboards.
Decision: investigate NIC ring sizes, CPU softirq saturation, switch port errors, MTU mismatches, and traffic bursts. Rebooting might temporarily clear queues, but it won’t fix a bad link or oversubscription.
Task 12: Check DNS failures quickly (because everything depends on it)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.10.0.53
DNS Servers: 10.10.0.53 10.10.0.54
Link 2 (enp5s0)
Current Scopes: DNS
DNS Servers: 10.10.0.53 10.10.0.54
cr0x@server:~$ dig +time=1 +tries=1 api.internal A @10.10.0.53
; <<>> DiG 9.18.24-1ubuntu0.1-Ubuntu <<>> +time=1 +tries=1 api.internal A @10.10.0.53
;; connection timed out; no servers could be reached
Meaning: DNS is timing out. Many services fail “mysteriously” when name resolution stalls.
Decision: fail over to alternate resolvers, fix upstream DNS, and avoid “app restarts” that just amplify load on a dying DNS tier.
Task 13: Inspect cgroup throttling (containers and systemd both do this)
cr0x@server:~$ systemctl show app-api.service -p CPUQuota -p MemoryMax -p IOWeight
CPUQuota=50%
MemoryMax=2147483648
IOWeight=100
Meaning: You explicitly limited CPU to 50% and memory to 2GiB. If the service is slow under load, it may be obeying your constraints, not misbehaving.
Decision: confirm whether throttling aligns with SLOs; adjust quotas deliberately, don’t “fix” it by rebooting into the same limits.
Task 14: Capture evidence before you change anything
cr0x@server:~$ sudo journalctl -b --since "2025-12-30 13:30" --until "2025-12-30 14:30" -o short-iso > /tmp/journal-incident.txt
cr0x@server:~$ sudo dmesg -T > /tmp/dmesg-incident.txt
cr0x@server:~$ ps -eo pid,ppid,cmd,%cpu,%mem,stat --sort=-%cpu | head -n 15 > /tmp/ps-topcpu.txt
cr0x@server:~$ ps -eo pid,ppid,cmd,%cpu,%mem,stat --sort=-%mem | head -n 15 > /tmp/ps-topmem.txt
Meaning: You’ve now preserved the story. If you must restart services—or even reboot—you’re not throwing away the clues.
Decision: proceed with mitigations knowing you can still do a post-incident RCA based on captured artifacts.
Joke #2: If rebooting is your only troubleshooting step, congratulations—you’ve invented a very expensive power button.
Three corporate mini-stories (realistic, painful, useful)
Mini-story 1: The incident caused by a wrong assumption
They had an internal API tier on Ubuntu. When latency spiked, the on-call reflex was to restart the API service. When that failed, they rebooted the VM. It worked often enough that nobody questioned it.
One Friday, the reboot didn’t help. Errors came back immediately: timeouts to a cache cluster. The incident commander assumed “cache is down.” So the cache team got paged, then the network team, then security. Classic corporate blood sport: three teams arguing in a chat while customers wait.
The actual issue was stupider and more instructive. A new netplan change had added a static route that accidentally shadowed the cache subnet via the wrong gateway. Most traffic still worked because the default route was fine. Only the cache traffic took the scenic route through a firewall that rate-limited it.
Reboots “fixed” it previously because the route sometimes failed to apply consistently after partial config pushes—timing and state. This time the route applied cleanly, meaning it broke deterministically. The reboot didn’t reset the bad route; it reloaded it.
Afterward they changed two things: (1) treat “dependency unreachable” as a network/path problem until proven otherwise, and (2) store route diffs and validated config in CI. The reboot habit had been masking misconfiguration, not solving incidents.
Mini-story 2: The optimization that backfired
A storage-heavy service was “optimized” by moving logs to faster local NVMe and increasing log verbosity for better debugging. The team also tightened log rotation because they’d been burned by disk-full once. Sensible motives. Bad interaction.
At scale, the verbose logs meant frequent writes. The logrotate schedule was aggressive, and the post-rotate hook sent signals to reload multiple services. Under bursty traffic, logrotate would kick in, compress logs, and spike CPU and IO right when the app needed it most.
Users saw periodic latency cliffs every hour. On-call would reboot nodes because “it clears the slowness.” The reboot helped because it interrupted compression and reset queues. Nobody connected the hourly pattern because people are bad at time series in their heads.
When they finally graphed disk await and CPU iowait, it was obvious: every cliff matched logrotate. Fix was boring: reduce log volume, switch to asynchronous shipping, spread rotation, and remove reload hooks that didn’t need to run. They also added alerts on disk latency, not just “disk used.”
The takeaway: “optimization” without measuring second-order effects is just gambling with better vocabulary.
Mini-story 3: The boring but correct practice that saved the day
A fintech shop ran Ubuntu 24.04 nodes for transaction processing. They had a strict incident rule: before any reboot, capture journal and kernel logs for the relevant window, plus one snapshot of process and network state. No exceptions unless the host is actively corrupting data.
One night, a subset of nodes started stalling. Not crashing. Just stalling: elevated load, rising latencies, intermittent timeouts. The fastest “fix” would have been to recycle the nodes. Instead, the on-call followed the rule and captured evidence.
The kernel logs showed NVMe timeouts, but only after specific patterns: a burst of sync writes from a particular service version. The process snapshots showed that version had flipped a feature flag that increased fsync frequency.
They rolled back the feature flag, latency stabilized, and the nodes recovered without reboots. Later, with the preserved logs, they worked with the vendor to update NVMe firmware and adjusted the app’s durability strategy. The evidence turned a midnight mystery into an actionable plan.
That “boring rule” didn’t just help diagnose. It prevented a bigger failure: rebooting would have hidden a storage reliability issue until it became data loss. In production, boring is often another word for “safe.”
Common mistakes: symptom → root cause → fix
1) “High load average means CPU is pegged”
Symptom: Load average is 20+, app is slow, but CPU graphs look “not that bad.”
Root cause: Tasks blocked on IO (D-state). Load includes uninterruptible sleep; it’s often storage latency.
Fix: Use vmstat (b, wa), iostat -x (await, util), then identify IO-hog processes (iotop).
2) “Reboot fixed it, so it was a kernel issue”
Symptom: Reboot clears problems for hours/days; incident repeats.
Root cause: Memory leak, file descriptor leak, gradual queue buildup, log/disk fill, or stuck dependency.
Fix: Capture evidence; track growth (RSS, open files, disk used, queue lengths). Restart the offending unit and fix underlying leak or limits.
3) “The application is down” when it’s actually DNS
Symptom: Random timeouts to internal services; restarting apps changes nothing.
Root cause: Resolver timeouts, broken upstream, or split-horizon misconfig.
Fix: Check resolvectl and test queries with short timeouts (dig +time=1 +tries=1). Fail over resolvers or fix DNS tier.
4) “Disk isn’t full, so storage isn’t the issue”
Symptom: Plenty of free space, but IO is slow; services hang.
Root cause: Latency spikes from device errors, firmware, write amplification, or a noisy writer. Also inode exhaustion can happen with free bytes remaining (or vice versa).
Fix: Check latency (iostat), errors (journalctl -k), health (nvme smart-log), and inodes (df -i).
5) “It must be networking” when it’s actually CPU softirq or conntrack
Symptom: Packet drops, intermittent timeouts, high connection counts.
Root cause: CPU spent in softirq handling packets, or conntrack table pressure causing drops/timeouts, often from bursts.
Fix: Correlate RX drops with CPU usage and connection stats; mitigate by shaping traffic, tuning, or distributing load across nodes.
6) “Restart loops are harmless”
Symptom: A service flaps; sometimes it “comes back,” sometimes it doesn’t.
Root cause: Dependency timeout, bad config, or resource limit reached. Restart loops amplify load and can cascade (thundering herd).
Fix: Inspect systemctl status, set sane restart backoff, and fix dependency health or config. Prefer circuit breakers over blind restarts.
7) “We increased performance by making everything sync”
Symptom: Latency cliffs during bursts; storage metrics show spikes.
Root cause: Excessive fsync or synchronous writes, especially if multiple services share a device.
Fix: Measure IO pattern; batch writes, adjust durability settings consciously, isolate workloads, or move logs/temporary writes off critical paths.
8) “It’s a memory leak” when it’s actually page cache + workload
Symptom: Free memory is low, people panic.
Root cause: Linux uses memory for cache; low “free” is normal. The signal is “available” and swap activity, not “free.”
Fix: Use free -h, vmstat (si/so), and app RSS trends. Fix only if available collapses or swap churn begins.
Checklists / step-by-step plan
Checklist A: The “do not reboot yet” 10-minute routine
- Confirm impact: what’s failing (API, DB, disk, DNS), and for whom (one AZ/node vs global).
- Capture evidence: journal window, dmesg, top processes, network snapshot. (Use Task 14.)
- Classify bottleneck: CPU vs memory vs disk vs network using
vmstat,iostat,free,ip -s link. - Find the offender: top CPU/mem process, IO writers, flapping unit.
- Check dependency errors: app logs for “cannot connect” and timeouts; validate DNS quickly.
- Mitigate safely: drain from LB, stop batch jobs, reduce concurrency, pause backups.
- Targeted restart if needed: restart one service or reload config; confirm recovery with metrics.
- Only then consider reboot: driver/firmware reset scenario, kernel panic loop, or irrecoverable deadlock—after evidence capture.
Checklist B: If you must reboot, do it like an adult
- State your reason: “Rebooting to reset NVMe controller after timeouts,” not “because slow.”
- Preserve evidence first: journal/dmesg snapshots; record current kernel and firmware versions.
- Drain traffic: take the node out of rotation.
- Reboot and validate: confirm the symptom is gone and watch for reappearance under load.
- Open a follow-up task: root cause analysis is mandatory; otherwise you just scheduled the incident again.
Checklist C: Post-incident root cause workflow (the part everyone skips)
- Timeline: when did symptoms start, what changed, what got worse, what fixed it.
- Correlate signals: disk latency vs error rate, memory pressure vs timeouts, packet drops vs retries.
- Prove one root cause: pick the smallest set of facts that explains the majority of symptoms.
- Define prevention: config change, limit, alert, capacity plan, runbook update, or rollback policy.
- Add a test/guardrail: CI validation for netplan routes, alerts for inode usage, SLO-based autoscaling, etc.
FAQ
1) When is rebooting actually the right call?
When you have evidence of a kernel/driver deadlock, repeated NVMe resets, or a known bug with a documented workaround—and you’ve captured logs first. Also when the host is corrupting data or can’t make forward progress and containment requires removal.
2) Why does a reboot “fix” memory leaks?
Because it kills the leaking process and clears memory state. The leak is still there. You just reset the timer until it fills RAM again.
3) How do I quickly tell CPU saturation from IO wait?
vmstat 1: high us/sy with low wa suggests CPU. Rising b and high wa suggests IO wait. Confirm with iostat -x for device latency.
4) Why is load average high even when CPU is idle?
Because load includes tasks stuck in uninterruptible sleep (often IO). They count toward load but they aren’t consuming CPU cycles.
5) journald feels “different.” How do I use it without getting lost?
Filter by boot (-b), time window (--since/--until), unit (-u), and priority (-p). Treat it like a query engine, not a scrolling contest.
6) What’s the single most underrated disk failure signal?
Latency spikes and intermittent IO timeouts in kernel logs. Disks can be “up” and “not full” while still failing at the job that matters: completing IO promptly and correctly.
7) What do I do if a dependency is slow but not down?
Measure and isolate: add timeouts, circuit breakers, and backpressure. Reduce load on the dependency (cache stampede control, query limits), and separate noisy tenants. “Not down” can still be “unusable.”
8) How do I keep people from rebooting as a reflex?
Make evidence capture and targeted remediation the default. Add a simple rule: “No reboot without a stated hypothesis and preserved logs.” Then reward fast, correct mitigations—not dramatic midnight heroics.
9) What’s the fastest way to catch “disk full” before it’s an outage?
Alert on both df -h usage and df -i inode usage, plus write latency. Disk-full is often preceded by rising log volume or runaway file creation—both measurable.
10) Why does restarting one service help more than rebooting sometimes?
Because it resets the broken state (dead connections, stuck threads, leaked descriptors) without wiping the whole system. It’s faster, less risky, and preserves other healthy services.
Next steps that reduce incidents (without heroics)
Stop using reboots as emotional support. On Ubuntu 24.04 you already have the tooling to isolate the root cause: systemd, journald, kernel logs, and the performance counters that tell you where the bottleneck really is.
Do these next:
- Adopt the 10-minute routine: capture evidence, classify bottleneck, find offender, then mitigate.
- Make “no reboot without hypothesis” a norm: write the reason in the incident channel/ticket.
- Alert on the right things: IO latency, inode usage, swap activity, RX drops—not just CPU and “disk percent used.”
- Limit blast radius: quotas/limits for services, rate limits for clients, and isolation for noisy workloads.
- Practice targeted recovery: restart units, drain nodes, disable features—prove you can recover without resetting the world.
You don’t need to be a kernel whisperer. You need to be disciplined, suspicious of convenient stories, and allergic to “it worked after reboot” as a final answer.