Ubuntu 24.04 Disk Hangs Under Load: Timeout Settings That Prevent Full Stalls (Case #30)

November 24, 2025 • February 3, 2026 • Read: 26 min • Views: 8

Was this helpful?

It usually starts the same way: latency climbs, dashboards turn yellow, and then your “fast” service becomes a museum exhibit. SSH sessions freeze mid-command. A database stops responding. The CPU is bored, RAM is fine, network looks normal—and yet the machine feels like it’s stuck in wet cement.

On Ubuntu 24.04, the most infuriating version of this is a disk “hang” under load: not a clean failure, not an obvious crash, but a system-wide stall because storage I/O won’t complete and timeouts are too slow (or nonexistent at the layers that matter). The fix is rarely “buy a faster disk.” It’s usually “make the system fail fast, isolate the damage, and stop one wedged device from holding the whole host hostage.”

What a “disk hang” really is (and why your host freezes)

When people say “the disk hung,” they often mean one of three things:

A device or path stops completing I/O. The kernel waits, retries, and waits some more. Applications block in uninterruptible sleep (D state). If enough critical threads block, the host looks frozen.
I/O completes, but so slowly that it’s indistinguishable from a hang. Queue depths explode, latency goes from milliseconds to minutes, and anything that needs storage becomes effectively down.
A storage stack layer serializes error handling. One dead path holds a lock, one queue gets stuck, one controller enters a reset loop, and the rest of the system queues behind it.

Linux is generally good at not panicking on storage trouble. That’s great for data integrity. But it can be terrible for availability when the system is configured to wait “forever-ish” for a device that is never coming back. Many environments would rather fail a mount, trip a health check, or evict a node than stall an entire fleet member.

There’s a reason this hurts so much: disk I/O is a dependency for almost everything. Logging, state, temp files, package managers, DNS caches, metrics spools, container layers—storage trouble doesn’t just break the database. It breaks your ability to observe the break.

One paraphrased idea often attributed to Werner Vogels (reliability engineering): Everything fails, so design for failure rather than pretending it won’t happen. Storage timeouts are exactly that: deciding how you fail.

Two uncomfortable truths:

No single timeout setting “fixes” hangs. You need coherent timeouts across layers so recovery happens within a predictable budget.
Too-aggressive timeouts can cause their own outages—especially on busy arrays, overloaded cloud volumes, or shaky multipath fabrics.

Joke #1: Storage never “goes down,” it just enters a deep meditation on the impermanence of packets.

Fast diagnosis playbook: first / second / third checks

This is the “I have 5 minutes before someone declares an incident” routine. The goal isn’t perfect root cause. The goal is to identify the layer that’s wedged and decide whether to fail over, fence, or reboot.

First: confirm it’s storage and identify the victim device

Check kernel logs for timeouts/resets. If you see SCSI timeouts, NVMe controller resets, or “blocked for more than … seconds,” you’re in the right article.
List blocked tasks and D-state pileups. If lots of threads are in D state, it’s almost always I/O wait on a device or filesystem log/journal.
Map mounts → devices. If /var or a database mount is on the trouble device, the whole host will feel worse than if it’s just a cold data mount.

Second: determine whether it’s device, path, or workload

Device-level? SMART/NVMe error logs, media errors, link resets, controller resets.
Path-level? iSCSI session drops, FC link flaps, multipath path groups failing over, “no path” messages.
Workload-level? One process doing insane sync writes, filesystem journal stalls, writeback congestion, queue depth saturation.

Third: choose the least-bad mitigation

If a path is dead and multipath is configured: tune path checking and failover timeouts so failover occurs quickly.
If a device is wedged: plan for removal/fencing. On a single-disk host, that’s often reboot. On a redundant stack (RAID, multipath, clustered FS), you can usually isolate it.
If it’s workload saturation: reduce concurrency, adjust queue depth/scheduler, or fix the workload (buffering, batching, async I/O).

Interesting facts and historical context

Linux used to default to CFQ I/O scheduling for fairness on spinning disks; modern kernels favor mq-deadline or none due to multi-queue block I/O.
SCSI timeouts date back to an era of slow devices where waiting 30–60 seconds wasn’t outrageous; on today’s services, that’s an eternity.
“Hung task” warnings exist because uninterruptible sleep is a feature, not a bug. The kernel deliberately can’t always kill a task waiting on I/O safely.
Multipath was built for flaky paths, not flaky thinking. It can hide transient failures beautifully—until timeouts are mismatched and it stalls longer than your SLA.
NVMe changed the failure model. It’s fast enough that the old “just retry for a while” logic can become catastrophic under queue pressure.
Writeback congestion is old news (the VM and block layer have wrestled with it for decades), but it still bites when dirty ratios and journaling collide.
EXT4 and XFS optimize for integrity first. A stalled journal/log can make an entire mount appear dead even if the rest of the device is “fine.”
Cloud block storage introduced new latency pathologies. Throttling, noisy neighbors, and backend maintenance can look like “random hangs” at the guest.

The mechanics: timeouts by layer (block, SCSI, NVMe, multipath, filesystems)

To stop full stalls, you need to understand who waits for whom. The storage stack is a chain of promises:

The application promises it can block on I/O.
The filesystem promises ordering and recovery (journal/log).
The block layer promises to queue and dispatch I/O.
The device driver promises to talk to hardware and recover from errors.
The transport (SATA/SAS/NVMe/FC/iSCSI) promises delivery—or at least an error.

When a disk “hangs,” the worst case is when errors are not returned quickly. Instead, the kernel retries for a long time, the filesystem waits for critical metadata I/O, and the application threads pile up. Many subsystems are correct to wait. Your job is to decide how long “correct” should be.

Block layer timeouts (request timeouts)

The Linux block layer has a request timeout concept. For many devices it’s exposed as:

/sys/block/<dev>/device/timeout (common for SCSI)
/sys/class/block/<dev>/queue/ parameters (queue depth, scheduler, etc.)

If requests don’t complete within that timeout, the kernel tries to abort/reset. Whether that helps depends on the driver and hardware. If the driver keeps resetting forever, you get the worst of both worlds: no progress and no clean failure.

SCSI: command timeouts and error handling

SCSI has explicit command timeouts (historically 30 seconds is common). On a local SAS disk, that might be fine. On a SAN with multipath and large caches, it can be too short (false timeouts). On an unstable path, it can be too long (stalling failover).

SCSI error handling can also block. If the device is in a bad state, you may see repeated reset attempts in dmesg. The trick is to align SCSI timeouts with multipath and transport recovery so that one layer takes responsibility for failing over or failing fast—not all of them fighting each other.

NVMe: controller timeouts and reset loops

NVMe is usually “either it flies or it’s on fire.” When it fails, it often fails by controller reset. You’ll see messages about timeouts and resets. The key knob is the NVMe core timeout and controller loss behavior (depending on kernel/driver). Some settings are module parameters; others are per-device sysfs.

In practice: if an NVMe device is intermittently wedging, making timeouts shorter can help you recover faster—or can cause repeated resets under high load if the device is merely slow. Don’t tune this blind; measure latency distribution.

Multipath: where “high availability” becomes “high anxiety”

Device-mapper multipath can either save you or stall you. The biggest failure mode is mismatched timeouts:

SCSI command timeout is long.
Multipath path checker is slow to declare a path dead.
Queueing behavior is set to “queue forever” when no path exists.

That last one is the killer: the system keeps queueing I/O indefinitely, your apps block, and the host looks dead while it politely waits for a path that isn’t returning. In many environments, you want queueing to stop after a bounded time so services can fail and orchestrators can reschedule.

Filesystems: journal/log stalls look like device hangs

EXT4 uses a journal; XFS uses a log. If critical metadata writes can’t complete, the filesystem can stall callers. Even if the underlying block device is only “partially” bad, the mount can feel fully wedged because metadata operations are serialized.

Also: fsync() patterns can make your storage look hung. A single process doing sync writes in a tight loop can dominate latency for everyone. If you’ve ever seen “why is the box frozen?” and the answer was “someone enabled synchronous logging everywhere,” welcome to the club.

Joke #2: The disk isn’t slow—it’s just benchmarking your patience.

Hands-on tasks: 12+ commands, how to read output, what decision to make

These are practical, production-safe commands. Use them to identify the bottleneck and to validate whether timeout tuning is working. Each task includes: command, sample output, what it means, and what decision you make.

Task 1: Check for kernel I/O timeout signatures

cr0x@server:~$ sudo dmesg -T | egrep -i 'timed out|timeout|reset|I/O error|blk_update_request|hung task|blocked for more than' | tail -n 40
[Mon Dec 30 10:11:02 2025] sd 2:0:0:0: timing out command, waited 30s
[Mon Dec 30 10:11:02 2025] blk_update_request: I/O error, dev sdb, sector 12345678 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0
[Mon Dec 30 10:11:05 2025] scsi host2: sas: attempt task abort!
[Mon Dec 30 10:11:22 2025] INFO: task postgres:2211 blocked for more than 120 seconds.

Meaning: You have real device/driver-level distress, not just “the DB is slow.” The presence of timing out command and abort/reset attempts implies the kernel is trying to recover.

Decision: Identify which device (sdb here), map it to a mount/workload, and decide whether you’re dealing with media failure (replace) or path/transport (failover/tune multipath).

Task 2: Identify D-state pileups quickly

cr0x@server:~$ ps -eo state,pid,comm,wchan:32 --sort=state | head -n 20
S     1 systemd         ep_poll
D  2211 postgres        io_schedule
D  2450 postgres        xfs_log_force_lsn
D  3102 rsyslogd        blk_mq_get_tag
R  4021 ps              -

Meaning: Multiple processes in D state, waiting in I/O-related kernel functions. xfs_log_force_lsn is a smoking gun for XFS log pressure/stalls.

Decision: Pivot to filesystem and underlying device checks. If the blocked processes are critical daemons (journald, rsyslog, database), the host will feel “hung.”

Task 3: See real-time I/O latency and queue depth with iostat

cr0x@server:~$ iostat -x 1 5
Linux 6.8.0-xx-generic (server)  12/30/2025  _x86_64_  (16 CPU)

Device            r/s   w/s  rkB/s  wkB/s  await  svctm  %util
sda              5.0  12.0   120.0  900.0    2.1    0.3    3.0
sdb            200.0  80.0 25600.0  6400.0 1800.0    1.2  100.0
dm-2           195.0  75.0 25000.0  6000.0 1850.0    1.1  100.0

Meaning: await in the seconds range with %util pegged means the device is saturated or stuck. The dm device matches the underlying disk behavior, suggesting the bottleneck is below the filesystem.

Decision: If the device is truly saturated, reduce concurrency or fix workload. If it’s “stuck,” look for timeouts/resets and consider failover/removal.

Task 4: Map mounts to devices (find what you’re about to break)

cr0x@server:~$ lsblk -f
NAME        FSTYPE FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1      ext4   1.0         2b4a...                               120G    20%   /
└─sda2      swap   1           9f31...
sdb
└─mpatha    LVM2_member       3JkQ...
  ├─vg0-data xfs                8d1c...                                80G    85%   /var/lib/postgresql
  └─vg0-log  xfs                7a2b...                                10G    40%   /var/log

Meaning: The troubled disk sdb backs LVM volumes that contain PostgreSQL data and logs. That explains system-wide pain.

Decision: If the issue is in the SAN/path, multipath tuning matters. If it’s local disk failure, plan replacement. Either way, prioritize data safety and service recovery.

Task 5: Confirm whether you’re on multipath and inspect its behavior

cr0x@server:~$ sudo multipath -ll
mpatha (3600508b400105e210000900000490000) dm-2 LIO-ORG ,TCMU device
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 2:0:0:0 sdb 8:16  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 3:0:0:0 sdc 8:32  active ready running

Meaning: You are queueing I/O if no path exists (queue_if_no_path). If both paths flap or disappear, I/O can stall indefinitely.

Decision: Decide whether “queue forever” matches your availability goals. Many clusters prefer bounded queueing, then fail.

Task 6: Check SCSI per-device timeout

cr0x@server:~$ cat /sys/block/sdb/device/timeout
30

Meaning: SCSI commands time out after 30 seconds (common default). That’s per command, and retries/reset behavior can extend the total stall time drastically.

Decision: If you need faster failover (multipath), you may reduce this—carefully. If you have a slow SAN that occasionally stalls, you may need it higher, but then ensure other layers don’t queue forever.

Task 7: Inspect queue settings and scheduler (performance vs latency)

cr0x@server:~$ cat /sys/block/sdb/queue/scheduler
[mq-deadline] kyber bfq none

Meaning: mq-deadline is active. That’s often a decent default for block devices that benefit from bounded latency. For NVMe, none is also common.

Decision: Don’t thrash schedulers during an incident. But if you’re on bfq on a server block device under heavy concurrency, consider a change after testing.

Task 8: Check if requests are timing out at the block layer

cr0x@server:~$ sudo journalctl -k -b | egrep -i 'blk_update_request|Buffer I/O error|reset|timed out' | tail -n 30
Dec 30 10:11:02 server kernel: blk_update_request: I/O error, dev sdb, sector 12345678 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0
Dec 30 10:11:05 server kernel: sd 2:0:0:0: timing out command, waited 30s
Dec 30 10:11:06 server kernel: scsi host2: sas: attempt task abort!

Meaning: Confirms errors are emitted by the kernel and include the device and operation type (READ/WRITE).

Decision: If writes are failing, filesystems may remount read-only or stall; plan for service failover. If reads fail on a redundant setup, assess whether it’s isolated to a path.

Task 9: Find which process is pounding the disk

cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 25.00 M/s | Total DISK WRITE: 8.00 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>    COMMAND
 2211 be/4 postgres    18.00 M/s   4.00 M/s  0.00 % 98.00 % postgres: checkpointer
 2305 be/4 postgres     6.00 M/s   2.50 M/s  0.00 % 65.00 % postgres: walwriter

Meaning: The database is the main I/O consumer and is spending massive time waiting on I/O (IO% high).

Decision: If the device is sick, killing the process won’t help; it will just move the waiting elsewhere. If it’s overload, tune DB checkpointing / WAL settings and storage concurrency.

Task 10: Check NVMe error log and controller resets (if NVMe)

cr0x@server:~$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
/dev/nvme0n1     S4G8...              ACME NVMe 3.2TB                           1         1.20  TB /   3.20  TB    512   B +  0 B   1.2.3

cr0x@server:~$ sudo nvme smart-log /dev/nvme0
critical_warning                    : 0x00
temperature                         : 44 C
available_spare                     : 100%
percentage_used                     : 3%
media_errors                        : 0
num_err_log_entries                 : 12

cr0x@server:~$ sudo nvme error-log /dev/nvme0 | head -n 10
Entry[0]
  error_count     : 12
  sqid            : 3
  cmdid           : 0x0042
  status_field    : 0x4004
  parm_err_loc    : 0x0000

Meaning: Errors exist (num_err_log_entries). Not all NVMe errors are fatal, but they correlate strongly with reset storms and latency spikes.

Decision: If error counts climb during the hang, treat it as device/firmware/controller trouble and consider replacement or firmware update in your maintenance window.

Task 11: Confirm multipathd timing and path checker events

cr0x@server:~$ sudo journalctl -u multipathd -b | tail -n 40
Dec 30 10:10:51 server multipathd[812]: sdb: tur checker reports path is down
Dec 30 10:10:55 server multipathd[812]: mpatha: remaining active paths: 1
Dec 30 10:11:15 server multipathd[812]: sdb: reinstated

Meaning: Paths are flapping. If failover/reinstate takes too long, the host stalls while multipath decides what reality is.

Decision: Tune multipath check intervals and no-path retry policy to align with service tolerance for pauses.

Task 12: Inspect filesystem health signals (XFS example)

cr0x@server:~$ sudo dmesg -T | egrep -i 'xfs|ext4|journal|metadata' | tail -n 30
[Mon Dec 30 10:11:01 2025] XFS (dm-2): log I/O error -5
[Mon Dec 30 10:11:01 2025] XFS (dm-2): metadata I/O error: block 0x12a3f error 5

Meaning: The filesystem is seeing I/O errors. This is no longer “maybe it’s slow.” It’s failing.

Decision: Stop writes if possible (service failover), capture logs, and plan repair/recovery. Do not keep hammering a failing log device.

Task 13: Check writeback pressure (dirty ratios) that can mimic hangs

cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500

Meaning: These are reasonable defaults. On very fast ingest workloads into slower disks, dirty limits can be hit, causing user threads to block in writeback—another “hang” look.

Decision: If you see stalls without device errors and with high dirty pages, tune these carefully and fix workload buffering.

Task 14: Measure latency distribution quickly with fio (safe mode)

cr0x@server:~$ sudo fio --name=latcheck --filename=/var/lib/postgresql/.fio-testfile --size=256M --rw=randread --bs=4k --iodepth=16 --numjobs=1 --direct=1 --time_based --runtime=20 --group_reporting
latcheck: (groupid=0, jobs=1): err= 0: pid=5123: Mon Dec 30 10:20:01 2025
  read: IOPS=18.2k, BW=71.2MiB/s (74.7MB/s)(1.39GiB/20001msec)
   lat (usec): min=70, max=540000, avg=880, stdev=12000

Meaning: Max latency of 540ms (or worse) during random reads suggests severe tail latency. If max is in seconds/minutes, you’re in hang territory.

Decision: Use this to validate improvements after tuning. If latency tail remains huge, timeouts may just be hiding a deeper device/path issue.

Timeout settings that prevent full stalls (what to tune, and how)

Let’s be blunt: the safest “no stall ever” setting is “never do I/O.” Since we can’t do that, the real goal is:

Detect failure quickly enough to fail over (if redundant) or fail the node (if not).
Prevent indefinite queueing that turns one bad disk/path into a host-wide freeze.
Keep timeouts consistent so one layer doesn’t wait 5 minutes while another expects recovery in 10 seconds.

1) Multipath: stop queueing forever (bounded pain)

If you use multipath, your biggest availability lever is whether I/O queues when there is no working path.

Why “queue_if_no_path” is dangerous: It is fantastic for short path blips. It is catastrophic when a fabric is down, a target is gone, or authentication breaks. Your processes block waiting for I/O that is being queued into the void.

Production guidance:

For clusters with node-level failover (Kubernetes, Pacemaker, etc.), prefer bounded queueing so the node fails and the service relocates.
For single-host, non-redundant systems where waiting is preferable to failing (rare), queueing may be acceptable—but you’re consciously choosing host stalls.

Example multipath configuration concept (not a link, just the knobs):

no_path_retry: how many retries when no paths; can be numeric or fail.
dev_loss_tmo and transport settings: how long the kernel keeps a device around after losing connectivity (common in iSCSI/FC).
fast_io_fail_tmo: how quickly to fail I/O when paths are failing (FC/iSCSI dependent).
polling_interval: how often multipath checks paths.

On Ubuntu 24.04, multipath is usually configured via /etc/multipath.conf. You want to align:

Path detection time (checker + polling) with
How long apps can wait (service timeouts, DB timeouts) and
How long the kernel retries SCSI commands before declaring a device dead.

Rule of thumb: if your orchestrator will kill/replace a node in 60 seconds, don’t configure storage to stall for 10 minutes “just in case.” You’ll just have a slow-motion outage.

2) SCSI device timeout: adjust with care (and never alone)

SCSI timeouts are per command. The system may retry and reset multiple times. Reducing the timeout can speed up failover, but can also cause unnecessary resets under transient load.

When lowering SCSI timeouts helps:

Multipath fabrics where one path is dead and you want quick failover.
Hosts where “freeze for minutes” is worse than “I/O error quickly.”
Environments where the storage backend is reliable and a timeout really indicates failure.

When lowering is risky:

Busy SANs that occasionally stall for tens of seconds during maintenance.
Workloads with huge I/O sizes on slow media where long operations are expected.
Controllers with known long error recovery behavior (some RAID HBAs).

In practice, you tune SCSI timeouts together with multipath’s no-path behavior and transport fail timers. Otherwise you get the classic problem: the kernel keeps retrying while multipath keeps queueing.

3) Transport timeouts (iSCSI/FC): make failure explicit

Many “disk hangs” on SAN-backed volumes are actually transport stalls. For iSCSI, session/transport timeouts define how quickly a lost session is detected and how fast I/O fails back to the OS. For FC, similar concepts exist via fast I/O fail timers and device loss timers.

Why this matters: if the OS thinks the device still exists, it will wait. If it decisively declares “device is gone,” multipath can fail over (or the system can fail) cleanly.

Operational advice: Align transport timeouts to be shorter than your application’s “I can wait” budget, but longer than expected transient blips. If you don’t know what transient blips look like, measure them during a controlled maintenance event. Yes, schedule one.

4) NVMe timeouts: avoid reset storms

On NVMe, too-short timeouts can cause a controller to reset repeatedly under heavy load if completion latencies spike. Reset storms are a special kind of misery because the “recovery action” becomes the outage.

What you do instead of blindly lowering timeouts:

Measure tail latency under realistic load (Task 14).
Check for firmware issues and error log entries (Task 10).
Validate PCIe link stability (kernel logs often show AER errors when things get spicy).

5) Filesystem and application timeouts: don’t pretend the disk is immortal

Even if the kernel fails I/O fast, your service can still hang if the app layer waits forever (or has absurdly high timeouts). Set reasonable timeouts and cancellation behavior in:

Database drivers and connection pools
RPC clients
Systemd service watchdogs (where appropriate)
Cluster health checks and fencing

The right philosophy: storage failures should make the node sick quickly and obviously, not slow and spooky.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “Multipath means no outages”

They ran a set of API servers on Ubuntu. Storage was SAN-backed, multipath enabled, and everyone slept well because there were “two paths.” During a midweek change on the switching layer, one fabric started dropping frames intermittently.

The wrong assumption was subtle: the team believed multipath failover would be near-instant. In reality, their configuration queued I/O when no path existed, and path checking ran slowly. When the “good” path also flapped, the system didn’t fail. It waited. Indefinitely.

From the outside, the APIs looked half-alive. Some requests hung. Others succeeded. Load balancers kept sending traffic because TCP still worked and health checks were too shallow. On the hosts, processes piled into D-state. Restarting services didn’t help; the restart scripts also needed disk I/O.

The fix wasn’t heroic. They changed multipath to stop queueing forever, tightened path check timing, and—this is the part people skip—made application health checks include a tiny disk write/read on critical mounts. After that, a real storage outage made nodes fail fast, get drained, and stop confusing everyone.

Optimization that backfired: “Let’s crank queue depth and iodepth”

A performance tuning sprint targeted a batch ingest pipeline. The team increased concurrency everywhere: more worker threads, higher database connection counts, and bigger I/O depths in their tooling. Benchmarks looked great on quiet storage.

Then a production traffic spike hit during a backend storage maintenance event. Latency rose, queues filled, and those higher depths turned into a self-inflicted denial of service. The system spent more time managing queued I/O than completing useful work. Tail latency went vertical.

Worse, their timeout settings were still defaults. So instead of failing quickly and rerouting jobs, the host stalled in long retry cycles. Monitoring saw “host up,” but the service was practically dead. Operators had to hard reboot nodes to restore scheduling capacity.

The eventual solution was to cap concurrency based on measured device behavior, not on optimism. They set sane queue depths, used backpressure in the application, and tuned timeouts so that real storage failure caused predictable error paths rather than long stalls. Throughput dropped slightly in the happy case; availability improved dramatically in the unhappy one. That trade is usually worth it.

Boring but correct practice that saved the day: “Timeout budget and regular failure drills”

A different org ran a mixed fleet: some local NVMe, some SAN, some cloud volumes. They had a policy that every storage class must have a documented timeout budget: how long the OS can wait, how long multipath can queue, how quickly the orchestrator fences.

It wasn’t glamorous. The document was a table. It was reviewed every quarter. The team also performed controlled “pull the path” drills on a staging SAN and simulated degraded cloud volume performance during load testing.

One day a storage controller in a remote site started glitching. A handful of nodes lost a path and then lost the second path briefly. Instead of hanging, the nodes failed I/O within the agreed budget. The cluster evicted them, workloads moved, and the incident became a minor capacity event, not a platform-wide outage.

Postmortem notes were short: “timeouts behaved as designed.” Which is the best kind of sentence in operations—boring, correct, and over quickly.

Common mistakes: symptom → root cause → fix

1) Symptom: host “freezes,” but CPU is low

Root cause: processes stuck in uninterruptible sleep (D-state) waiting for I/O; often a blocked filesystem journal/log.

Fix: identify the device/mount (Tasks 2–4), check kernel logs (Task 1), and stop indefinite queueing in multipath if applicable.

2) Symptom: multipath device exists, but I/O stalls for minutes when SAN blips

Root cause: queue_if_no_path or overly generous no_path_retry with slow path checking.

Fix: configure bounded retry/fail behavior and align path check intervals with your service timeout budget (Task 5, Task 11).

3) Symptom: repeated “resetting controller” messages (NVMe or HBA)

Root cause: device/firmware instability or timeouts too aggressive under load, causing reset storms.

Fix: validate firmware/health logs (Task 10), check error counters, and don’t shorten timeouts until you understand tail latency.

4) Symptom: “blocked for more than 120 seconds” appears, then clears, then returns

Root cause: intermittent path flaps or backend throttling causing tail latency spikes beyond application tolerance.

Fix: correlate with multipath logs (Task 11) and transport events; tune failover and reduce workload concurrency to control queue buildup.

5) Symptom: I/O errors in dmesg, filesystem remounts read-only

Root cause: real media errors, broken cable/path, or backend device failure returning errors.

Fix: treat as data risk. Fail over services, capture logs, and replace/repair hardware or volume. Tuning won’t fix a dead disk.

6) Symptom: “everything is slow” during backups or batch jobs, no kernel errors

Root cause: queue depth saturation and writeback congestion; sometimes poor I/O scheduler choice for the device type.

Fix: cap concurrency, schedule heavy jobs, consider scheduler tuning after testing (Task 7), and adjust writeback parameters carefully (Task 13).

7) Symptom: only one mount hangs, rest of system ok

Root cause: per-filesystem log/journal stall on a specific device or LVM LV; other mounts on other devices unaffected.

Fix: isolate and fail the mount/service, rather than rebooting the whole host. Consider separating logs/journals onto different devices where appropriate.

Checklists / step-by-step plan

Step-by-step: during an active “hang under load” incident

Capture quick evidence: kernel log snippets (Task 1), blocked tasks (Task 2), and iostat (Task 3). Save them somewhere off-host if possible.
Map impact: identify which mount/device is involved (Task 4). If it’s /, /var, or a database mount, expect broad blast radius.
Determine redundancy: are you on multipath/RAID/cluster? If yes, you can often isolate the bad path/device. If no, plan for a controlled reboot and replacement.
If multipath: confirm whether you’re queueing forever (Task 5). If yes, decide whether to switch to bounded behavior as part of your fix plan.
Check for actual I/O errors: journalctl kernel logs (Task 8) and filesystem messages (Task 12).
Check workload drivers: iotop (Task 9) to see if it’s a single offender or general pressure.
Mitigate: fail over services, drain node, or fence/reboot. Don’t keep restarting apps that are blocked on I/O.
After recovery: run controlled latency measurement (Task 14) to validate whether tail latency is sane again.

Step-by-step: implementing timeout tuning safely (change plan)

Define your timeout budget (example: “a node may stall at most 30 seconds on storage before it is declared unhealthy”). Without this, you’ll tune randomly.
Inventory storage types: local SATA/SAS, NVMe, iSCSI/FC SAN, cloud volumes, dm-crypt, LVM, RAID, multipath.
Record current settings: SCSI timeouts, multipath config, transport timeouts, scheduler, queue depth.
Decide fail behavior: do you prefer I/O errors quickly (fail fast) or waiting (attempt to survive blips)? Most distributed systems should fail fast.
Align layers:
- Transport detects path loss within budget.
- Multipath fails over or fails I/O within budget.
- SCSI/NVMe timeouts don’t exceed budget by an order of magnitude.
Test with fault injection: pull a path, disable a target, throttle a volume, then observe stall duration and application behavior.
Roll out gradually: one environment, one storage class at a time.
Monitor tail latency and error rates during rollout; revert if you cause reset storms or false failures.

Sanity checklist: what “good” looks like

A single dead path triggers multipath failover quickly and visibly in logs.
No indefinite queueing on critical mounts unless you explicitly chose it.
Applications time out and retry at the app level instead of hanging forever.
Node health checks catch storage unavailability (not just “process exists”).
Tail latency stays bounded under peak load, not just average latency.

FAQ

1) Is this an Ubuntu 24.04 issue, or a Linux issue?

Mostly Linux-and-hardware reality. Ubuntu 24.04 ships a modern kernel and userspace, but the failure modes are inherent to the storage stack: devices can stop responding, and the OS must decide how long to wait.

2) Why does one bad disk make the whole host feel frozen?

Because critical processes block on I/O. If journald, rsyslog, package services, or your database are blocked in D-state, everything from login to monitoring can stall behind it.

3) Should I just lower every timeout to 1 second?

No. You’ll create false failures and reset storms, especially on SAN and NVMe under load. Set timeouts based on measured tail latency and your service’s failover budget.

4) What’s the single most dangerous multipath setting?

queue_if_no_path without a bounded retry policy. It can turn “no paths available” into “host appears dead,” which is a deeply unhelpful user experience.

5) How do I know if it’s workload saturation vs a failing device?

Saturation usually shows high utilization and rising latency without kernel I/O errors. Failing devices show timeouts, resets, and I/O errors in kernel logs (Task 1 and Task 8).

6) Can filesystems cause hangs even if the disk is OK?

They can amplify problems. A workload that forces frequent sync/journal commits can stall the mount under pressure. But truly “OK disk” with catastrophic filesystem stalls is less common than people hope.

7) My cloud volume sometimes pauses for 20–60 seconds. Should I increase timeouts?

If those pauses are real and expected in your provider, you may need longer timeouts to avoid false failures—but then you must ensure your application and orchestration tolerate those pauses. Otherwise, build for failover to other nodes and don’t let a single stalled volume freeze the host.

8) Why doesn’t killing the blocked process fix the hang?

Because the process is blocked in uninterruptible kernel sleep waiting for I/O. The kernel can’t safely tear it down until the I/O completes or fails.

9) What’s the safest way to test timeout tuning?

Fault injection in staging: deliberately disable a path, pause a target, or throttle a volume while running a representative load. Measure stall duration and recovery behavior, then adjust.

10) When is rebooting the correct answer?

When you have a non-redundant device that is wedged and won’t recover, or when the kernel/driver is stuck in a reset loop and you need the host back fast. Rebooting is not a fix; it’s a tourniquet.

Conclusion: next steps you can do today

If your Ubuntu 24.04 host “hangs” under disk load, don’t accept it as a mystery. Treat it like a timeout budget problem with a hardware subplot. Your next steps:

During the next event, collect the three essentials: dmesg timeout lines, D-state process evidence, and iostat -x output (Tasks 1–3).
Map mounts to devices so you know what’s actually stalling (Task 4). Guessing wastes hours.
If you use multipath, audit whether you’re queueing forever and decide if that matches your availability goals (Task 5). Most modern platforms should prefer bounded failure.
Write a timeout budget for your storage class and align SCSI/NVMe/transport/multipath behavior to it. If you can’t say “we fail I/O within ~N seconds,” you don’t control your outages.
Run a controlled failure drill and observe whether your node fails fast or stalls slowly. Then tune with data, not superstition.

Disk hangs aren’t always preventable. Full-host stalls usually are. Make the system choose a clear failure, quickly, and your on-call life improves immediately—even if the storage vendor’s life remains… character-building.