You SSH into an Ubuntu 24.04 box that’s “on fire”: load average is double digits, services are timing out, and your incident channel is filling with screenshots of top showing… basically no CPU usage. The graphs don’t look like a CPU storm. Yet the system is clearly not okay.
Welcome to the iowait trap: the machine isn’t “busy computing,” it’s busy waiting. And Linux will happily report that waiting as “not using CPU” while your users experience it as “everything is broken.”
What high load + low CPU actually means
Linux load average is not “CPU usage.” It’s a count of tasks that are either:
- Running (on-CPU, state
R) - Uninterruptible sleep (usually I/O wait, state
D)
When storage gets slow, your threads pile up in D state. They’re not consuming CPU cycles, so CPU usage looks low. But they still contribute to load average because they’re ready to do work… as soon as the kernel can complete their I/O.
That mismatch is why you see:
- High load average
- Low user/system CPU
- High iowait % (sometimes), plus elevated disk latency
- Lots of blocked threads (D state) and timeouts everywhere
One more thing: “iowait” is not a resource you can “use up.” It’s a symptom. It tells you CPUs are idle because tasks are waiting on I/O. Don’t fight the symptom; find the wait.
Joke #1: If your load average is high and CPU is low, congratulations—you’ve built a very expensive waiting room.
Facts & history you’ll care about mid-incident
- Load average predates Linux. It came from Unix systems in the 1970s; the concept assumed “runnable” meant “wants CPU,” long before modern storage stacks and distributed systems.
- Linux counts uninterruptible sleep in load. That’s why I/O stalls inflate load average even if CPUs aren’t saturated.
- “iowait” is per-CPU idle time attribution. The CPU isn’t busy; the scheduler is saying, “I would run work, but it’s blocked on I/O.”
- NVMe made latency visible, not irrelevant. Faster devices reduce average latency, but tail latency (p95/p99) still ruins queues and load when devices or firmware misbehave.
- Writeback can turn reads into a problem. Dirty page thresholds and throttling can stall unrelated work when the kernel decides it must flush.
- Cloud block storage often has burst behavior. You can “run out of performance” even while you have plenty of capacity, and it looks exactly like iowait.
- Filesystems trade consistency for performance differently. ext4 journaling modes, XFS allocation behavior, and ZFS transaction groups each create distinct stall signatures.
- RAID controllers still lie. Some caches acknowledge writes early, then later stall flushes under battery/capacitor issues or writeback policy changes.
- Linux has gotten better at observability. Tools like
iostat -x,pidstat -d, and BPF-based tracing make it harder for storage problems to hide behind “CPU looks fine.”
Fast diagnosis playbook (first/second/third)
This is the short list you run when the pager is loud and your brain is doing that fun thing where it forgets how computers work.
First: confirm it’s not actually CPU
- Check load, CPU breakdown, and run queue vs blocked tasks.
- Look for D-state processes and high iowait.
Second: confirm storage latency and queueing
- Is the device busy (
%util)? - Is latency high (
await,r_await,w_await)? - Is the queue deep (
aqu-sz)?
Third: identify which workload and which layer
- Which PIDs are doing I/O? Which filesystems? Which mount points?
- Is it local disk, mdraid, LVM, dm-crypt, ZFS, NFS, iSCSI, or a cloud volume?
- Is it writeback throttling, journal contention, or an underlying device problem?
Once you’ve done those three, you can stop debating and start fixing.
Confirming iowait and blocked work: commands that settle arguments
Below are practical tasks you can run on Ubuntu 24.04. Each includes: a command, realistic output, what it means, and the decision you make from it.
Task 1: Take a snapshot of the crime scene (load + CPU breakdown)
cr0x@server:~$ uptime
14:22:08 up 36 days, 6:11, 2 users, load average: 18.42, 17.90, 16.77
Meaning: Load is very high for most servers. It does not tell you why.
Decision: Immediately check whether the load is run queue (CPU) or blocked tasks (I/O).
Task 2: Use top like an adult (look at wa, not just %CPU)
cr0x@server:~$ top -b -n 1 | head -n 15
top - 14:22:13 up 36 days, 6:11, 2 users, load average: 18.42, 17.90, 16.77
Tasks: 612 total, 9 running, 168 sleeping, 0 stopped, 435 zombie
%Cpu(s): 3.1 us, 1.2 sy, 0.0 ni, 63.9 id, 31.6 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 64221.1 total, 1234.8 free, 40211.4 used, 22774.9 buff/cache
MiB Swap: 8192.0 total, 8100.0 free, 92.0 used. 18920.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18412 postgres 20 0 4267120 2.1g 12536 D 1.3 3.4 12:48.21 postgres
19344 www-data 20 0 621432 31420 8920 D 0.7 0.0 0:38.02 php-fpm
2228 root 20 0 0 0 0 I 0.3 0.0 9:12.33 kworker/u16:2
Meaning: wa at ~30% is a billboard: the CPUs are waiting on I/O. Also notice processes in D state.
Decision: Shift from “what’s using CPU?” to “what’s blocking on I/O?”
Task 3: Measure run queue vs blocked tasks (vmstat)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
7 38 94208 126412 81200 23110240 0 0 812 9440 3221 6110 4 1 63 32 0
5 41 94208 125980 81200 23109988 0 0 640 10320 3199 5902 3 1 62 34 0
8 44 94208 124992 81200 23109012 0 0 712 11088 3340 6408 3 1 60 36 0
6 39 94208 125220 81200 23108211 0 0 540 9720 3101 6055 3 1 64 32 0
7 45 94208 125104 81200 23107010 0 0 690 10840 3368 6530 4 1 61 34 0
Meaning: Column b (blocked) is huge compared to r (runnable). This is classic I/O wait pressure.
Decision: Go to device-level latency and queueing metrics.
Task 4: Confirm disk saturation and latency (iostat -x)
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0-41-generic (server) 12/28/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.2 0.0 1.3 31.4 0.0 64.1
Device r/s w/s rKB/s wKB/s rrqm/s wrqm/s r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 22.0 340.0 1800.0 93200.0 0.1 12.0 9.80 144.20 52.10 81.8 274.1 2.1 99.2
dm-0 21.8 338.9 1792.0 93140.0 0.0 0.0 10.10 146.90 52.30 82.2 274.9 0.0 0.0
Meaning: %util ~99% indicates the device is saturated or constantly busy. w_await ~144ms is painful for anything remotely transactional. aqu-sz ~52 means a deep queue: requests are piling up.
Decision: Identify what’s writing and why; decide whether to throttle workload, move it, or fix the storage path.
Task 5: Find which processes are doing I/O (pidstat -d)
cr0x@server:~$ pidstat -d 1 5
Linux 6.8.0-41-generic (server) 12/28/2025 _x86_64_ (16 CPU)
14:22:41 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
14:22:42 999 18412 0.00 18240.00 2400.00 11234 postgres
14:22:42 33 19344 0.00 1220.00 180.00 1320 php-fpm
14:22:42 0 1451 0.00 9800.00 0.00 2100 systemd-journald
14:22:42 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
14:22:43 999 18412 0.00 19010.00 2600.00 11620 postgres
14:22:43 0 1451 0.00 9400.00 0.00 2050 systemd-journald
Meaning: postgres and journald are major writers, and iodelay is high (time waiting on I/O). kB_ccwr/s suggests writeback/cached writes that later must flush.
Decision: Investigate the database and journaling/logging patterns; consider log rate limiting, log destination, DB checkpoint settings, or relocating WAL.
Task 6: Confirm lots of tasks stuck in D state (ps)
cr0x@server:~$ ps -eo state,pid,comm,wchan:32 --sort=state | head -n 20
D 18412 postgres io_schedule
D 19344 php-fpm io_schedule
D 20488 nginx io_schedule
D 20501 nginx io_schedule
D 20512 nginx io_schedule
D 22110 rsyslogd io_schedule
R 9123 top -
R 9130 ps -
S 1451 systemd-journald ep_poll
S 1680 sshd do_select
Meaning: Multiple app processes blocked in io_schedule. They’re waiting for I/O completion.
Decision: Treat this as a storage-path incident unless proven otherwise.
Task 7: Map I/O back to filesystems and mount options (findmnt)
cr0x@server:~$ findmnt -D
SOURCE FSTYPE OPTIONS TARGET
/dev/dm-0 ext4 rw,relatime,errors=remount-ro /
/dev/nvme0n1p2 ext4 rw,relatime /var/lib/postgresql
tmpfs tmpfs rw,nosuid,nodev /run
tmpfs tmpfs rw,nosuid,nodev /dev/shm
Meaning: Postgres data is on /dev/nvme0n1p2, root is on dm-0. If iostat shows nvme0n1 is saturated, that matches.
Decision: Focus on the NVMe device and the database write pattern first.
Task 8: Check filesystem-level pressure (df, inodes, and journal symptoms)
cr0x@server:~$ df -hT /var/lib/postgresql
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 1.8T 1.6T 120G 94% /var/lib/postgresql
Meaning: 94% full isn’t instant doom, but it’s a known risk: allocation gets harder, fragmentation rises, and some filesystems behave worse near full.
Decision: Plan to free space or extend the filesystem. Meanwhile, proceed—this alone doesn’t explain 150ms write await, but it contributes.
Task 9: Check kernel logs for disk resets/timeouts
cr0x@server:~$ sudo journalctl -k -n 30 --no-pager
Dec 28 14:18:02 server kernel: nvme nvme0: I/O 672 QID 4 timeout, completion polled
Dec 28 14:18:02 server kernel: nvme nvme0: Abort status: 0x371
Dec 28 14:18:03 server kernel: nvme nvme0: resetting controller
Dec 28 14:18:06 server kernel: nvme nvme0: controller reset complete
Dec 28 14:18:11 server kernel: EXT4-fs (nvme0n1p2): Delayed block allocation failed for inode 262401 at logical offset 918234
Dec 28 14:18:12 server kernel: blk_update_request: I/O error, dev nvme0n1, sector 1819238480 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Meaning: Controller resets and I/O errors. That’s not “the app is slow.” That’s “storage is sick.” Also ext4 allocation failures suggest pressure and/or underlying errors.
Decision: Escalate to hardware/cloud volume health immediately; start mitigation (failover, read-only mode, reduce write load) while investigating.
Task 10: Check SMART/NVMe health signals
cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | head -n 20
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 63 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 89%
data_units_read : 12,349,221
data_units_written : 81,442,109
host_read_commands : 902,112,441
host_write_commands : 5,110,992,120
controller_busy_time : 145,221
power_cycles : 33
power_on_hours : 14,880
unsafe_shutdowns : 7
media_errors : 18
num_err_log_entries : 18
Meaning: percentage_used at 89% and non-zero media_errors is a strong hint the drive is aging or failing. Not definitive alone, but combined with timeouts/resets it’s damning.
Decision: Replace/evacuate the device (or move the workload) rather than tuning around a dying drive.
Task 11: Verify whether memory pressure is triggering writeback storms
cr0x@server:~$ cat /proc/meminfo | egrep 'Dirty|Writeback|MemFree|MemAvailable|Buffers|Cached'
MemFree: 126412 kB
MemAvailable: 19374132 kB
Buffers: 83192 kB
Cached: 22640480 kB
Dirty: 1562040 kB
Writeback: 81220 kB
Meaning: Dirty data is ~1.5GB; not insane on a 64GB machine. MemAvailable is healthy. So this is less likely “memory is forcing brutal writeback,” more likely “device latency is high.”
Decision: Don’t knee-jerk sysctl writeback tuning yet. Fix the storage path first.
Task 12: Look for direct evidence of block-layer queueing
cr0x@server:~$ cat /sys/block/nvme0n1/queue/nr_requests
1023
Meaning: The device queue allows a lot of outstanding requests. That can be good for throughput, bad for latency if the device is already overloaded (queueing theory is not a vibes-based discipline).
Decision: If this is a latency-sensitive service, consider reducing concurrency at the application or database layer first; don’t randomly tweak kernel queues unless you can measure impact.
Task 13: Check IO scheduler and multipath/dm stack quickly
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
Meaning: NVMe commonly uses none (aka no traditional scheduler) with multiqueue. That’s fine. If you see a weird stack (dm-crypt, dm-thin, multipath), account for it in your mental model.
Decision: If this is cloud/virtualized storage, look for the virtual disk layer’s own limits; changing schedulers won’t fix burst-credit depletion or backend throttling.
Task 14: Correlate with sar to see if it’s chronic or acute
cr0x@server:~$ sar -u 1 3
Linux 6.8.0-41-generic (server) 12/28/2025 _x86_64_ (16 CPU)
14:23:41 CPU %user %nice %system %iowait %steal %idle
14:23:42 all 3.05 0.00 1.22 32.10 0.00 63.63
14:23:43 all 2.88 0.00 1.11 30.84 0.00 65.17
14:23:44 all 3.18 0.00 1.30 31.66 0.00 63.86
Average: all 3.04 0.00 1.21 31.53 0.00 64.22
Meaning: iowait is consistently high in this window. If you have historical sar data, you can compare to “normal” and identify when it started.
Decision: If it began after a change (deploy, kernel update, storage migration), roll back or isolate. If it’s gradual, suspect capacity and wear.
Task 15: If it’s network storage, check the obvious (mount type and RPC stats)
cr0x@server:~$ mount | egrep 'type nfs|type cifs' || true
cr0x@server:~$ nfsstat -c 1 2
Client rpc stats:
calls retrans authrefrsh
12433 118 0
Client nfs v4:
null read write commit open
0 812 9340 220 14
Meaning: Retransmits indicate network or server-side delay. For NFS-backed workloads, iowait can be “disk” but it’s actually “network + remote disk + server queues.”
Decision: If retrans climbs, treat it as a distributed storage incident: check network drops, NFS server load, and storage backend latency.
Task 16: Confirm if systemd/journald is amplifying the pain
cr0x@server:~$ sudo journalctl --disk-usage
Archived and active journals take up 6.7G in the file system.
Meaning: Large journals plus chatty services can generate sustained writes. Not usually enough to saturate NVMe alone, but during a storage incident it can keep the knife twisted.
Decision: During mitigation, reduce log verbosity or ship logs elsewhere. Don’t delete logs blindly in an incident unless you’ve agreed on the tradeoff.
Finding the bottleneck: disk, filesystem, controller, network storage, or kernel
Step 1: Don’t trust a single metric—correlate three
To confidently call iowait-driven load, you want a triangle of evidence:
- Scheduler evidence: high load average + many D-state processes (
ps,vmstat) - CPU attribution: significant iowait (
top,sar -u) - Device evidence: high latency + queueing (
iostat -x)
If all three align, stop arguing with the graphs. Start fixing the storage path.
Step 2: Decide whether you’re dealing with throughput or latency
This matters because the fix is different:
- Latency problem: awaits are high, queue is high, %util may be high. Apps time out; DBs get sad; web requests pile up.
- Throughput ceiling: awaits are moderate but you’re pushing huge sustained IO. Users may see slower bulk jobs, not necessarily timeouts.
For the “high load, low CPU” incident, it’s usually latency and queueing.
Step 3: Figure out if it’s reads, writes, or flushes
iostat -x breaks out r_await vs w_await. You care because:
- Write await spikes often correlate with journal commits, WAL/fsync patterns, controller cache policy, or cloud volume throttling.
- Read await spikes suggest cache misses, random IO amplification, or a device that can’t keep up with working set.
Step 4: Identify the layer adding pain
On Ubuntu, the I/O stack can include: filesystem → LVM → dm-crypt → mdraid → device, plus optional network storage. Each layer can multiply latency under load.
Quick ways to map it:
lsblk -fshows device-mapper layering.dmsetup ls --treemakes it explicit.multipath -ll(if used) shows path health.
cr0x@server:~$ lsblk -o NAME,TYPE,FSTYPE,SIZE,MOUNTPOINT
NAME TYPE FSTYPE SIZE MOUNTPOINT
nvme0n1 disk 1.8T
├─nvme0n1p1 part vfat 512M /boot/efi
├─nvme0n1p2 part ext4 1.6T /var/lib/postgresql
└─nvme0n1p3 part crypto 200G
└─dm-0 crypt ext4 200G /
Meaning: Root is encrypted via dm-crypt; DB data is plain ext4 on partition 2.
Decision: If the DB volume is the hot device, don’t get distracted by dm-crypt tuning on root.
Step 5: Understand the “D state doesn’t die” failure mode
When threads are stuck in uninterruptible sleep, you can’t always kill them. The kernel is waiting for I/O completion; signals don’t help. If the underlying device or path is wedged, processes can remain stuck until the device recovers—or until you reboot.
That’s why “just restart the service” sometimes does nothing. The old processes are still blocked, the new ones also block, and now you’ve added churn.
Joke #2: Sending SIGKILL to a D-state process is like yelling at a loading bar—emotionally satisfying, operationally useless.
Step 6: Use one quote to guide your incident behavior
Werner Vogels (paraphrased idea): “Everything fails, all the time—design and operate as if failure is normal.”
iowait incidents feel mysterious because they’re “not CPU.” But they’re normal: disks stall, networks jitter, controllers reset, and cloud volumes throttle. Build detection and runbooks accordingly.
Three corporate mini-stories (the kind you remember)
Mini-story 1: The incident caused by a wrong assumption
They had a payments service on Ubuntu, healthy CPU graphs, and a sudden spike in load average that set off the “scaling” alarms. The on-call assumed it was a burst of traffic and added two more instances. Load stayed high. Requests still timed out.
Then they did the classic: grep through application logs, find “database connection timeout,” and declare the database “slow.” They restarted the DB. It didn’t help. Worse: the restart took forever because the database process was stuck in D state waiting for disk, and shutdown hooks hung.
Finally someone ran iostat -x and saw write awaits in the hundreds of milliseconds and %util pegged. The storage was a cloud block volume. It wasn’t out of space; it was out of guaranteed IOPS due to a configuration change made weeks earlier, plus a spike in write-heavy audit logging that burned the burst headroom.
The wrong assumption was subtle: “high load means compute pressure.” They treated it like CPU and autoscaling. But iowait doesn’t scale out that way when every instance hits the same storage limit (shared backend) or the same workload pattern (write amplification).
Fix: roll back the logging change, reduce write frequency, and temporarily move the WAL/logs to a faster volume tier. Only then did the system behave. The postmortem was blunt: they had alarms for CPU saturation but none for disk latency. The pager screamed, but the dashboards whispered.
Mini-story 2: The optimization that backfired
A platform team wanted faster deployments. They moved container image layers and build caches onto a shared NFS mount to “deduplicate storage” and “simplify backups.” It was a tidy diagram. Everyone applauded. Then the first big release day arrived.
Nodes showed high load, but top didn’t show CPU usage. Pods failed readiness checks. Engineers chased Kubernetes scheduling, DNS, and “noisy neighbor” CPU quotas. Meanwhile, the NFS server’s backend storage was doing synchronous writes for metadata-heavy operations and getting hammered by thousands of small file operations.
The optimization backfired because it concentrated random metadata I/O. The network storage path amplified tail latency: client caches, server queues, backend disk, and occasional retransmits when the network got busy. Each layer was “fine” on average. Together they were a latency factory.
The recovery wasn’t fancy: move the build caches back to local NVMe on each node and treat the shared NFS mount as an artifact repository, not a scratchpad. The “savings” on storage capacity had been paid for with incident time, lost developer trust, and a lot of useless CPU scaling.
Mini-story 3: The boring but correct practice that saved the day
A company ran a customer analytics pipeline on a handful of big Ubuntu hosts. Nothing glamorous: batch jobs, a database, and a message queue. The SRE team had one unsexy habit: they logged weekly baseline latency numbers for each storage device—iostat -x summaries, peak await, and a note of what “normal” looked like.
One afternoon, load average crept up and the app team reported “CPU seems fine.” The on-call opened the baseline doc and immediately saw the mismatch: normal w_await was single-digit milliseconds; now it was triple digits. No debate. No “maybe it’s the code.” It was storage until proven otherwise.
They checked kernel logs and found occasional controller resets. Not enough to crash the system, but enough to wreck latency. Because they had a baseline, they didn’t waste an hour proving “this is abnormal.” They already knew.
Mitigation was equally boring: fail over the DB to a standby on different hardware, drain batch workers, and schedule a maintenance window to replace the suspect device. The incident report wasn’t exciting. That’s the point. Good operations often looks like someone refusing to be surprised by the same class of failure twice.
Common mistakes: symptom → root cause → fix
This is where teams lose time. The trick is to treat symptoms as clues, not as diagnoses.
1) Symptom: High load average, low CPU, lots of timeouts
Root cause: blocked threads in D state due to storage latency or errors.
Fix: Confirm with vmstat/ps/iostat -x. Mitigate by reducing write load, moving hot paths (WAL/logs) to faster storage, or failing over. Investigate device health and kernel logs.
2) Symptom: iowait is low, but load is high and apps hang
Root cause: load includes D state; iowait percentage can be misleading on multi-core systems or when blocked tasks aren’t sampled the way you expect.
Fix: Count D-state tasks directly (ps) and check block-layer metrics (iostat -x). Don’t gate on iowait alone.
3) Symptom: %util is 100% but throughput is not high
Root cause: the device is busy doing slow I/O (high latency), often random writes, fsync-heavy workloads, or retries due to errors.
Fix: Identify I/O pattern (iostat -x, pidstat -d). Reduce fsync frequency where safe, move write-ahead logs, or fix hardware/cloud volume limits.
4) Symptom: Everything gets worse after “tuning” kernel writeback
Root cause: pushing dirty ratios too high increases burstiness; flush storms become larger and more punishing when the device can’t keep up.
Fix: Revert to defaults unless you have a measured reason. Prefer application-level smoothing (batching writes, rate limits) over kernel roulette.
5) Symptom: Restarting services doesn’t help; shutdown hangs
Root cause: processes stuck in D state waiting for I/O completion; they won’t die until I/O returns.
Fix: Fix the underlying storage stall or reboot as a last resort after ensuring data safety. If it’s a DB, prefer failover to rebooting primary.
6) Symptom: Only one mount point is slow, others fine
Root cause: a specific volume/device is saturated or erroring; not “the whole host.”
Fix: Map mounts to devices (findmnt, lsblk). Move that workload, adjust quotas, or fix/replace that device.
7) Symptom: Bursty slowness at predictable intervals
Root cause: checkpointing, journal commits, ZFS txg syncs, log rotation spikes, or periodic backup snapshots.
Fix: Align schedules, reduce concurrency, stagger jobs, tune DB checkpoints carefully, and isolate logs/WAL from data when possible.
8) Symptom: High iowait on a VM, but disks “look fine” in the guest
Root cause: host-side contention, throttling, or noisy neighbor effects; guest metrics can hide the real queue.
Fix: Check hypervisor/cloud metrics and volume limits. In-guest tools show symptoms; they may not show the cause.
Checklists / step-by-step plan
Checklist A: 10-minute triage (do this before changing anything)
- Record
uptime,topsnapshot, andvmstat 1 5. - Run
iostat -x 1 3and save output. - List top I/O processes with
pidstat -d 1 5. - Count D-state tasks:
ps -eo state | grep -c '^D'. - Check kernel logs for storage errors:
journalctl -k -n 100. - Map mounts to devices:
findmnt -Dandlsblk. - Check free space and inode pressure on hot mounts:
df -hT,df -i.
Checklist B: Mitigation options (choose the least risky that works)
- Reduce write load fast: turn down debug logging, pause batch jobs, slow ingestion.
- Protect the database: if WAL/checkpoints are the pain, move WAL to a faster/dedicated device if available, or fail over.
- Isolate noisy processes: stop or throttle the top I/O offender (carefully—don’t kill the only thing keeping data consistent).
- Move traffic: drain the node from load balancers; shift to healthy replicas.
- Last resort: reboot only after confirming you won’t make recovery worse (especially for storage errors).
Checklist C: Root cause workflow (after stabilizing)
- Determine if latency is read or write dominated (
iostat -x). - Correlate with deploys/changes (
journalctl, your change log). - Confirm hardware/volume health (NVMe SMART, RAID logs, cloud volume events).
- Check filesystem behavior near-full, journal mode, and mount options.
- Evaluate app I/O pattern changes (new logging, new indexes, vacuum, compaction jobs).
- Add alerting: disk
await, queue depth, and D-state counts—not just CPU.
FAQ
1) Why does load average go up if CPU is idle?
Because Linux load includes tasks in uninterruptible sleep (D), typically waiting on I/O. They aren’t using CPU, but they are “not progressing,” so they count toward load.
2) Is high iowait always bad?
No. High iowait during a known bulk read/write job can be fine if latency-sensitive services are isolated. It’s bad when it correlates with user-facing timeouts and rising D-state counts.
3) Why is iowait sometimes low even when storage is the problem?
iowait is a CPU idle attribution, not a direct measurement of disk latency. On multi-core systems, a few blocked threads may not move the overall percentage much. Also, sampling and tooling can hide short spikes. Count D-state tasks and measure device latency directly.
4) What’s the fastest single command to prove it’s storage?
iostat -x 1 3. If you see high await, high aqu-sz, and high %util on a relevant device, you have real evidence. Pair it with D-state processes for a slam dunk.
5) Does high %util mean the disk is at maximum throughput?
Not necessarily. It can mean the device is busy servicing slow operations (high latency) or dealing with retries/errors. Throughput can be mediocre while %util is pegged.
6) How do I tell if it’s the disk or the filesystem/journal?
If the raw device shows high latency, start there. If raw device latency is fine but a particular mount is slow, look at filesystem-level behavior: journal contention, near-full allocation issues, mount options, and application fsync patterns.
7) Why do processes in D state ignore kill -9?
Because they’re stuck in an uninterruptible kernel wait, typically awaiting I/O completion. The signal is pending, but the process can’t handle it until the kernel call returns.
8) If it’s a cloud VM, what should I suspect first?
Volume performance limits, burst credit depletion, noisy neighbor effects, and host-level contention. Guest tools show the symptom; the cause may be outside the VM.
9) Can swapping cause this exact pattern?
Yes. Heavy swap-in/out creates I/O pressure and blocked tasks. But you’ll usually see swap activity in vmstat (si/so) and reduced MemAvailable. Don’t guess—measure.
10) What alerts should I add so I don’t get surprised again?
At minimum: per-device await (read/write), queue depth (aqu-sz), D-state task count, filesystem fullness, and kernel storage error rate. CPU alone is a liar in this story.
Conclusion: next steps that actually move the needle
When Ubuntu 24.04 shows high load but “nothing is using CPU,” treat it as a blocked-work problem until proven otherwise. Confirm with D-state counts, vmstat, and iostat -x. Once you see elevated latency and queueing, stop poking at CPU and start working the storage path: device health, controller resets, volume limits, filesystem pressure, and the specific processes generating writes.
Practical next steps:
- Put the fast diagnosis playbook into your on-call runbook verbatim.
- Add alerting on disk latency and blocked tasks, not just CPU and load.
- Baseline “normal”
awaitand queue depth for your critical hosts. - Separate latency-sensitive logs/WAL from bulk data when you can.
- If kernel logs show timeouts/resets: stop tuning and start migrating/replacing the storage. You can’t sysctl your way out of physics.