Ubuntu 24.04: Disk “hangs” under load — timeout settings that prevent full stalls (case #90)

Was this helpful?

When a disk gets slow, your system shouldn’t get philosophical about it. Yet under heavy I/O, Ubuntu 24.04 boxes can look “hung” even when CPU is idle and memory is fine—because the kernel is politely waiting for storage that has stopped being polite.

This isn’t just “performance.” It’s liveness. The difference between a latency spike and a full-service stall is often a handful of timeouts, queueing behaviors, and recovery paths you either configured—or inherited by accident.

What a “disk hang” actually looks like on Ubuntu 24.04

“Hang” is a sloppy word, so let’s tighten it. In this case, the machine is not crashed. It is trapped waiting on I/O completion. That means:

  • SSH login works, but commands like ls or df freeze when touching affected mounts.
  • Load average climbs, but CPU usage looks boring. That’s runnable tasks waiting on blocked I/O, not a compute storm.
  • Systemd units stop responding to stop/restart. They’re stuck in uninterruptible sleep (state D).
  • Kernel logs mention timeouts, resets, “blocked for more than 120 seconds,” or filesystem journal waits.

The goal is not “make disks never slow.” The goal is: when disks do go slow or disappear, the system recovers quickly, fails requests quickly, and doesn’t wedge itself trying to be helpful.

One operational truth: the kernel can wait longer than your business can. Your job is to align those clocks.

Fast diagnosis playbook

When production is melting, don’t start by editing sysctls. Start by answering three questions: what is stuck, where it’s stuck, and why recovery isn’t happening.

First: confirm it’s I/O wait and identify the device

  • Check iowait and blocked tasks.
  • Find which mount(s) and which block device(s) are involved.
  • Look for obvious kernel messages about resets/timeouts.

Second: decide whether you’re in “latency spike” or “path/device failure” territory

  • Latency spike: I/O completes eventually; queues build; timeouts might not trigger.
  • Failure: commands hang indefinitely; multipath might be queueing forever; driver might retry for minutes.

Third: check what policy is making the stall worse

  • Multipath: queue_if_no_path can turn a transient SAN blip into an application freeze.
  • Device timeouts: SCSI or NVMe timeouts determine how long the kernel keeps trying before erroring or resetting.
  • Filesystem behavior: journal commits and metadata ops can serialize and block unrelated work.
  • Service timeouts: systemd may wait too long to fail fast or restart.

Get the facts quickly, then change the right knob. Under pressure, people love turning random knobs. That’s how you end up with a system that fails faster than it recovers.

Facts and context that change how you debug

Here are a few concrete points—some historical—that explain why “disk hang” problems behave the way they do:

  1. Linux block I/O can block in uninterruptible sleep (D state), meaning signals won’t kill the process. That’s why kill -9 looks powerless.
  2. SCSI error handling is intentionally patient. It tries hard to recover without data loss, which is great—until your apps need a quick failure.
  3. The default “blocked task” warning threshold (often 120s) is not a timeout; it’s a complaint. Work can stay blocked long after you see the warning.
  4. Multipath was designed to survive flaky fabrics. Features like “queue when no paths” were meant to preserve writes during path loss, but they can freeze user space.
  5. NCQ and deep queues (SATA/SAS) improved throughput, but large queues can amplify tail latency under contention. One slow command can sit in front of many others.
  6. NVMe brought fast error reporting and controller resets compared to old SATA behavior, but the reset/reconnect policy still matters, especially with PCIe quirks.
  7. Ext4 journaling exists to keep metadata consistent. Under storage stalls, journal commits can block operations that look unrelated, like creating a file in another directory.
  8. Writeback caching policies have been a reliability battleground for decades: fast caches hide latency until they don’t, and then they expose it as giant spikes.
  9. Virtualization added layers of queueing (guest block layer, virtio queues, host HBA queues, array queues). Tail latency compounds across layers.

Dry-funny reality check: disks don’t “hang.” They enter a long-term relationship with your kernel, and your kernel doesn’t believe in giving up quickly.

Timeouts: the model you need in your head

There is no single “disk timeout.” There are several timers and policies that stack, sometimes multiplicatively:

  • Device-level command timeouts (SCSI device_timeout / per-device timeout; NVMe controller timeouts).
  • Transport/link recovery (SAS link resets, FC fabric events, iSCSI session timeouts, TCP retransmit behavior).
  • Mapper policy (dm-multipath queueing behavior, fast_io_fail_tmo, dev_loss_tmo).
  • RAID layer behavior (mdraid timeouts; controller firmware; write cache).
  • Filesystem/journal timeouts (not always explicit, but commit intervals and sync behaviors).
  • Application/service timeouts (systemd unit timeouts, client request timeouts, database statement timeouts).

When the disk path is broken, the kernel may keep retrying, remapping, and waiting. That can be the correct default for data integrity. In production, you need to decide: do you prefer waiting or failing? For a database write, waiting might be safer. For a stateless API node, failing fast and rescheduling might be the whole point.

One quote that belongs on every on-call rotation: Hope is not a strategy. — General Gordon R. Sullivan

The trick is to pick timeout values that match your environment’s recovery time. If your SAN failover takes 15–30 seconds, don’t configure a policy that stalls indefinitely. If your SSD firmware occasionally needs 5 seconds to recover, don’t set timeouts to 1 second and cause constant resets.

Practical tasks (commands, what the output means, and what you decide)

You want real commands, not vibes. These tasks are ordered roughly from “safe and fast” to “this changes behavior.” Run them on Ubuntu 24.04; most require root.

Task 1: Confirm high iowait and run queue pressure

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.0-xx-generic (server)  12/31/2025  _x86_64_  (16 CPU)

12:00:01 AM  CPU   %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:00:02 AM  all   2.10  0.00  1.20   41.30 0.00  0.20   0.00   0.00    0.00  55.00

Meaning: iowait at 40%+ suggests CPUs are idle because they’re waiting on storage, not because the system is “busy.”

Decision: Move to device identification. Don’t tune CPU. Don’t chase “load average” yet.

Task 2: Identify processes stuck in D state

cr0x@server:~$ ps -eo pid,stat,comm,wchan:32 | awk '$2 ~ /D/ {print}'
23144 D    postgres  io_schedule
28901 D    rsync     ext4_sync_file

Meaning: These processes are blocked in kernel I/O wait paths.

Decision: Map them to filesystems/devices; don’t waste time trying to kill them.

Task 3: See which mount is slow by touching metadata

cr0x@server:~$ sudo timeout 3s stat /var/lib/postgresql || echo "stat timed out"
stat timed out

Meaning: Even a metadata op blocks. That points to filesystem-level blockage on that mount.

Decision: Identify which block device backs that mount.

Task 4: Map a mount to its block device

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE /var/lib/postgresql
/dev/mapper/mpatha /var/lib/postgresql ext4

Meaning: This is a dm-multipath device, not a simple /dev/sdX.

Decision: Check multipath policy and path health before touching ext4 knobs.

Task 5: Look for kernel timeout/reset messages

cr0x@server:~$ dmesg -T | egrep -i 'tim(e|ed) out|abort|reset|blk_update_request|I/O error|blocked for more'
[Wed Dec 31 00:01:12 2025] sd 3:0:0:1: timing out command, waited 180s
[Wed Dec 31 00:01:12 2025] blk_update_request: I/O error, dev dm-2, sector 81234567 op 0x1:(WRITE)
[Wed Dec 31 00:03:14 2025] INFO: task jbd2/dm-2-8:123 blocked for more than 120 seconds.

Meaning: The kernel waited 180s before declaring a command timed out; ext4 journal thread is blocked.

Decision: Your liveness budget is being spent in the SCSI/device layer. Tune timeouts and/or multipath fail behavior.

Task 6: Check per-device queue and scheduler settings

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MODEL,HCTL,STATE,SCHED,ROTA
NAME            TYPE  SIZE MODEL            HCTL       STATE SCHED    ROTA
sda             disk  1.8T ST2000NM000A      0:0:0:0    running mq-deadline    1
mpatha          mpath 2.0T                   -          running none           1

Meaning: Rotational storage + scheduler choices influence latency under load, but they won’t fix path loss. The mapper shows none because dm layer defers scheduling.

Decision: Keep this in mind later; first stabilize failure behavior with timeouts.

Task 7: Measure latency and queue depth

cr0x@server:~$ iostat -x 1 5
Device            r/s   w/s   rkB/s   wkB/s  await  aqu-sz  %util
sda              5.0  85.0    80.0  2400.0  180.0   32.0   99.8
dm-2             0.0  90.0     0.0  2400.0  185.0   31.5   99.9

Meaning: await ~180ms with a queue size ~32 at near 100% utilization. Under failure, you may see await climb to seconds and never come down.

Decision: If this is a spike, improve throughput/queueing. If it’s a stall, focus on failover and timeouts.

Task 8: Confirm multipath status and whether it’s queueing

cr0x@server:~$ sudo multipath -ll
mpatha (3600508b400105e210000900000490000) dm-2 HP,MSA2040
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 3:0:0:1 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 4:0:0:1 sdc 8:32 failed faulty running

Meaning: queue_if_no_path is enabled; one path is failed. If all paths fail, I/O will queue instead of failing.

Decision: Decide if your workload prefers queued writes (risk of application stall) or fast failure (risk of write errors but better service liveness).

Task 9: Inspect dm-multipath timeouts (fast fail and device loss)

cr0x@server:~$ sudo cat /sys/block/dm-2/dm/uuid
mpath-3600508b400105e210000900000490000

cr0x@server:~$ sudo cat /sys/block/dm-2/queue/io_timeout
0

Meaning: io_timeout of 0 typically means “use default / no explicit timeout here,” which often equals “wait a long time elsewhere.”

Decision: Prefer explicit policies: set multipath fast fail values and ensure SCSI timeouts are not absurd for your SLAs.

Task 10: Check SCSI device timeout currently in effect

cr0x@server:~$ sudo cat /sys/block/sdb/device/timeout
180

Meaning: 180 seconds per command is a lifetime in request-land. A single stuck command can pin the queue.

Decision: For many environments, 30–60 seconds is a more reasonable ceiling, but only if your storage can actually recover inside that window.

Task 11: Identify transport (SAS/iSCSI/NVMe) to pick the right knobs

cr0x@server:~$ lsblk -S -o NAME,TRAN,TYPE,MODEL,SERIAL
NAME TRAN TYPE MODEL            SERIAL
sdb  sas  disk HP LOGICAL VOLUME 600508b4...
sdc  sas  disk HP LOGICAL VOLUME 600508b4...

Meaning: This is SAS-backed storage, likely via a RAID controller or SAN enclosure, not NVMe.

Decision: SCSI timeout tuning and multipath policies matter more than NVMe controller reset settings here.

Task 12: Check for filesystem-level pain (ext4 errors, journal, remounts)

cr0x@server:~$ sudo dmesg -T | egrep -i 'ext4|jbd2|remount|buffer i/o error' | tail -n 20
[Wed Dec 31 00:03:14 2025] INFO: task jbd2/dm-2-8:123 blocked for more than 120 seconds.
[Wed Dec 31 00:03:18 2025] EXT4-fs (dm-2): Delayed block allocation failed for inode 262145 at logical offset 0 with max blocks 16 with error 5

Meaning: ext4 is seeing I/O errors and the journal thread is blocked. Filesystem operations will domino.

Decision: Fix underlying I/O recovery first. Filesystem tuning won’t outsmart missing blocks.

Task 13: See systemd timeouts and stop-job behavior (liveness of services)

cr0x@server:~$ systemctl show postgresql --property=TimeoutStartUSec,TimeoutStopUSec,KillMode
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
KillMode=control-group

Meaning: systemd will wait 90 seconds before declaring start/stop failure. But if the process is in D state, it may ignore termination signals.

Decision: You can’t “systemd your way” out of kernel I/O wait. But you can keep front-end services healthy by isolating storage-dependent units.

Task 14: Confirm whether the kernel is flooding with blocked task warnings

cr0x@server:~$ sudo sysctl kernel.hung_task_timeout_secs
kernel.hung_task_timeout_secs = 120

Meaning: This is diagnostic noise control, not a fix. Changing it changes when you get warnings, not when I/O returns.

Decision: Don’t “fix” the incident by muting the alarm. Use it to time-correlate stalls with path issues.

Task 15: Capture a quick stack trace of blocked tasks

cr0x@server:~$ echo w | sudo tee /proc/sysrq-trigger
w

Meaning: The kernel will dump blocked task stacks to dmesg/journal. It’s ugly, but it tells you which layer is waiting (dm, scsi, filesystem, etc.).

Decision: Use this to prove whether you’re stuck in multipath queueing, SCSI EH, or filesystem journal waits.

Task 16: Watch udev attributes for persistent timeout configuration

cr0x@server:~$ udevadm info --query=all --name=/dev/sdb | egrep 'ID_MODEL=|ID_SERIAL=|ID_WWN=|DEVPATH='
E: DEVPATH=/devices/pci0000:00/0000:00:1f.2/host3/target3:0:0/3:0:0:1/block/sdb
E: ID_MODEL=LOGICAL_VOLUME
E: ID_SERIAL=600508b400105e210000900000490000
E: ID_WWN=0x600508b400105e210000900000490000

Meaning: You can match specific devices (WWN) to apply udev rules that set timeouts consistently across reboots.

Decision: If you tune timeouts, make them persistent via udev or multipath config—not hand-edited sysfs one-offs.

Yes, that’s a lot of commands. But on-call is not the time to “just try a reboot.” Reboots are fine; mystery is expensive.

Timeout settings that prevent full stalls

We’ll talk about the settings that determine whether a slow disk causes “some errors” or “the whole node stops answering.” The right choice depends on whether the storage is local, RAID, SAN multipath, iSCSI, or NVMe. But the shape of the problem is the same: bound how long you wait before you fail, and ensure failover happens inside that bound.

1) SCSI disk command timeout: /sys/block/sdX/device/timeout

For SCSI devices (which includes most SAS, FC, iSCSI, and SAN LUNs), each block device exposes a timeout in seconds.

cr0x@server:~$ sudo cat /sys/block/sdb/device/timeout
180

What it does: how long the kernel waits for a command before declaring it timed out and entering error handling (EH). EH can include retries, resets, and bus recovery.

Why it causes stalls: if you allow a single command to sit for 180 seconds, everything behind it can pile up. Some devices/paths serialize certain commands; you get head-of-line blocking.

What to do: pick a value that’s longer than transient jitter but shorter than “we’ve lost the plot.” For many enterprise SAN environments, 30–60 seconds is a common ceiling. For shaky consumer SATA or SMR drives under write pressure, you may need longer—though that’s your sign you bought the wrong disk.

cr0x@server:~$ echo 60 | sudo tee /sys/block/sdb/device/timeout
60

Operational decision: if reducing this makes you see errors during normal load, your storage is not meeting expectations. Don’t “fix” it by raising timeouts; fix the storage path or workload.

2) Make timeout tuning persistent with udev rules

Sysfs writes vanish on reboot and device reprobe. Use udev rules to set timeouts based on WWN/serial/model.

cr0x@server:~$ sudo tee /etc/udev/rules.d/60-scsi-timeout.rules >/dev/null <<'EOF'
ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd[a-z]*", ENV{ID_WWN}=="0x600508b400105e210000900000490000", ATTR{device/timeout}="60"
EOF
cr0x@server:~$ sudo udevadm control --reload-rules
cr0x@server:~$ sudo udevadm trigger --subsystem-match=block --action=change

Meaning: future adds/changes apply the timeout automatically.

Decision: do this only after you’ve proven the new timeout doesn’t break legitimate failover behavior.

3) dm-multipath: stop queueing forever when paths are gone

Multipath is the classic “disk hang under load” accelerant. If all paths fail and you queue I/O, applications block. They don’t error, they don’t retry intelligently, they just sit there. That can be acceptable for some storage stacks; it’s poison for stateless services.

Key concept: queueing hides failure. Hidden failure becomes a node-level stall.

Look at current features:

cr0x@server:~$ sudo multipath -ll | sed -n '1,4p'
mpatha (3600508b400105e210000900000490000) dm-2 HP,MSA2040
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw

What to change: In many production environments, you want fast failure when all paths are down, plus a bounded retry window.

  • Disable indefinite queueing (avoid queue_if_no_path unless you have a clear reason).
  • Use fast_io_fail_tmo to fail I/O quickly when paths are down.
  • Use dev_loss_tmo to decide how long to keep trying to recover a path before giving up.

Example multipath configuration snippet (conceptual; adjust for your environment):

cr0x@server:~$ sudo tee /etc/multipath.conf >/dev/null <<'EOF'
defaults {
    find_multipaths yes
    user_friendly_names yes
    flush_on_last_del yes
}

devices {
    device {
        vendor "HP"
        product "MSA"
        no_path_retry 12
        fast_io_fail_tmo 5
        dev_loss_tmo 30
        features "0"
    }
}
EOF
cr0x@server:~$ sudo systemctl restart multipathd

Meaning: with fast_io_fail_tmo 5, I/O can error quickly when there are no paths (after 5 seconds). With dev_loss_tmo 30, paths are considered lost after 30 seconds.

Decision: If your storage fabric failover takes 20–25 seconds, dev_loss_tmo 30 can work. If it takes 2 minutes, you’re either misconfigured or living dangerously, and the timeouts should reflect reality.

Second short joke: Multipath queueing is like putting customer complaints in a drawer labeled “later.” Eventually you run out of drawers.

4) iSCSI specifics: session recovery vs app liveness

If your LUNs arrive via iSCSI, you have additional timeouts: TCP retransmits, iSCSI session replacement, and multipath. iSCSI can be rock-solid, but only if you pick an operational stance: quick failover or “hold the writes until the world heals.”

Check session status:

cr0x@server:~$ sudo iscsiadm -m session
tcp: [1] 10.10.10.20:3260,1 iqn.2001-04.com.example:storage.lun1 (non-flash)

Meaning: you have an active iSCSI session. If storage stalls correlate with session drops/reconnect storms, tune iSCSI and multipath together.

Decision: Don’t set multipath to fail in 5 seconds if iSCSI takes 30 seconds to re-establish and you actually want it to ride through.

5) NVMe: controller resets are fast, but not magic

On local NVMe, stalls are usually different: firmware hiccups, PCIe AER events, thermal throttling, or power management quirks. You don’t get SCSI timeouts; you get NVMe controller behavior and driver reset logic.

Check NVMe errors and resets:

cr0x@server:~$ sudo dmesg -T | egrep -i 'nvme|AER|reset|timeout' | tail -n 20
[Wed Dec 31 00:02:01 2025] nvme nvme0: I/O 123 QID 6 timeout, aborting
[Wed Dec 31 00:02:02 2025] nvme nvme0: reset controller

Meaning: the driver is actively aborting and resetting. That’s generally better than waiting 180 seconds, but resets can still wedge the filesystem if they happen frequently.

Decision: If resets are frequent, treat it as a hardware/firmware issue first. Timeout tuning is not a substitute for a stable device.

6) Filesystem and mount options: don’t confuse symptoms for causes

Under true device stalls, filesystem options are secondary. Still, some choices affect how quickly applications feel pain:

  • ext4 commit interval affects how often metadata is forced to disk. Lower commit can increase sync pressure; higher commit can increase recovery window and latency bursts.
  • barriers / write cache are mostly handled well by modern stacks, but misreported cache or disabled flushes can turn stalls into corruption.
  • noatime reduces metadata writes; it won’t fix a hang, but it can reduce background pressure.

Opinion: do not “mount-option your way” out of a path failure. Fix the path. Then optimize.

7) Service-level timeouts: contain blast radius

You can’t always prevent storage from going away, especially with networked storage. You can prevent it from taking the whole node hostage by designing for partial failure:

  • Keep critical services off questionable mounts (yes, even if it’s “just logs”).
  • Use systemd TimeoutStopSec and Restart policies appropriately, but remember D-state processes can’t be killed cleanly.
  • Prefer app-level request timeouts. If your database call blocks forever, your whole threadpool becomes a museum exhibit.

Three corporate mini-stories from the trenches

1) The incident caused by a wrong assumption: “The SAN will fail over instantly”

A mid-sized company migrated a set of stateful services onto a new SAN-backed cluster. The storage team promised redundant paths and “automatic failover.” The platform team heard “instant.” That was the wrong word to hear.

The Linux nodes were configured with dm-multipath and, because someone read a best-practices blog from an era when write loss was the villain, the devices had queue_if_no_path. The assumption was: if a path drops, multipath will reroute and the app will never notice. They were thinking like storage engineers. The application behaved like an application: it blocked.

A top-of-rack switch rebooted during a maintenance window that ran long. Both active paths disappeared briefly—not long enough to trigger panic at the SAN, but long enough for the Linux nodes to queue a mountain of writes. The database processes went into D state. The API tier, waiting on database responses, filled its own queues. Load balancers saw “healthy TCP,” but requests timed out at the client. It looked like an application outage. It was a storage-liveness outage.

The fix was boring and politically awkward: agree on a recovery time objective for storage path loss, then configure multipath to fail I/O within that window instead of queueing indefinitely. They also added upstream timeouts so user requests didn’t wait on a stuck kernel forever. The next switch reboot caused errors, not a full stall—and the system recovered faster because the failure was visible and bounded.

2) The optimization that backfired: “Increase queue depth to boost throughput”

An analytics team had a nightly batch that hammered a RAID set. A well-meaning performance push increased queue depth and allowed more outstanding I/O. Throughput numbers improved in a benchmark. The team celebrated and went to lunch. Classic.

Under real workload, the deeper queues amplified tail latency. When the array hit internal housekeeping (garbage collection, parity checks, whatever the firmware called its midlife crisis), individual I/O operations got slow. With a deep queue, slow operations accumulated. By the time the system noticed, there were hundreds of requests waiting, and unrelated processes started blocking on metadata operations. The box wasn’t “slow.” It was intermittently unresponsive.

They tried to “fix it” by increasing SCSI timeouts so the kernel wouldn’t error. That reduced visible errors but increased the duration of stalls. The batch ran longer, business dashboards lagged, and the on-call rotation started developing opinions about analytics.

The recovery was counterintuitive: reduce queue depth, accept slightly lower peak throughput, and get predictable latency. They also scheduled array housekeeping away from batch windows. The system stopped doing the “fine for 50 minutes, dead for 5” routine, which is the worst kind of reliability theater.

3) The boring but correct practice that saved the day: “Make timeouts explicit and test failover”

A financial services shop ran Ubuntu nodes with multipath LUNs for a small but critical cluster. Nothing glamorous—just steady I/O, strict SLAs, and a culture that distrusted heroics.

They had a simple practice: timeouts were configuration, not folklore. For each storage class, they documented expected failover times and set fast_io_fail_tmo, dev_loss_tmo, and SCSI device/timeout to align with that. They also ran a quarterly exercise: pull one path, pull all paths briefly, and verify that services either fail fast or ride through—whichever the service contract required.

One day an upstream fabric issue caused intermittent path drops. On nodes without this discipline, that would have been a silent queueing disaster. Here, the multipath layer failed I/O quickly enough that the application’s retry logic kicked in. Errors spiked, but the cluster stayed alive. Operators had something to alert on and a window to reroute traffic.

The postmortem was delightfully dull. No mystery. No “kernel hung.” Just a known failure mode behaving inside known bounds. That’s what you want: boredom with receipts.

Common mistakes (symptoms → root cause → fix)

1) Symptom: SSH works, but ls in a directory hangs

Root cause: the directory is on a filesystem whose underlying block device is blocked; metadata reads are waiting on I/O completion.

Fix: identify mount with findmnt, then inspect kernel logs and device timeouts. If multipath, check for queueing and path loss. If local disk, look for controller resets and hardware errors.

2) Symptom: load average is huge, CPU is mostly idle

Root cause: many tasks are blocked in I/O wait (D state), counted in load average.

Fix: use ps with wchan, iostat -x, and sysrq w stack dumps. Target the storage layer, not CPU scheduling.

3) Symptom: systemd can’t stop a service; stop job hangs forever

Root cause: processes are in uninterruptible sleep; signals won’t terminate them while kernel waits on I/O.

Fix: fix the I/O path. For containment, consider isolating the mount, failing over, or rebooting the node if recovery is impossible. Don’t waste an hour tuning systemd kill modes.

4) Symptom: no errors, just long stalls during SAN events

Root cause: multipath is queueing I/O when paths disappear (queue_if_no_path), hiding failures from apps.

Fix: disable indefinite queueing and set fast_io_fail_tmo/dev_loss_tmo to bounded values aligned with fabric failover behavior.

5) Symptom: intermittent “EXT4-fs error” followed by remount read-only

Root cause: underlying device is returning I/O errors or timing out; filesystem protects itself by remounting read-only.

Fix: treat storage as faulty. Check cabling, HBA logs, array health, SMART/NVMe logs. Don’t “fix” with mount options.

6) Symptom: tuning timeouts “fixed” hangs but now you see random I/O errors

Root cause: you reduced timeouts below the time your storage needs to recover; now slow operations are treated as failures.

Fix: measure actual failover/recovery time, then set timeouts slightly above that. If recovery is unacceptably slow, fix the storage path rather than hiding it.

7) Symptom: everything stalls during log rotation or backups

Root cause: synchronous metadata pressure (fsync storms), or saturating a single device with writes; journal contention makes unrelated operations block.

Fix: rate-limit backups, move logs off the critical volume, use ionice, reduce concurrency, and verify the device isn’t hitting internal GC or SMR write cliffs.

8) Symptom: virtual machine stalls, host looks fine

Root cause: queueing at the hypervisor or storage backend; guest sees blocked I/O but host metrics hide it.

Fix: measure at each layer: guest iostat, host block stats, storage array latency. Align timeouts end-to-end; don’t let the guest wait longer than the host can actually recover.

Checklists / step-by-step plan

Step-by-step: Stabilize a node that “hangs” under disk load

  1. Confirm it’s storage: check mpstat for iowait and ps for D-state tasks.
  2. Map the pain: use findmnt to map affected directories to devices.
  3. Read the kernel story: dmesg for timeouts, resets, path failures, and filesystem messages.
  4. Decide the class: local disk vs multipath SAN vs iSCSI vs NVMe. Don’t apply SAN advice to NVMe, or vice versa.
  5. Measure latency and queues: iostat -x to see if you’re saturated or failing.
  6. Multipath policy check: look for queue_if_no_path and path status; confirm failover timing expectations.
  7. Set bounded failure behavior: tune multipath fast_io_fail_tmo/dev_loss_tmo and SCSI device timeouts based on measured recovery time.
  8. Make it persistent: udev rules for SCSI timeout; multipath.conf for dm policy.
  9. Contain blast radius: ensure critical services have sensible request timeouts; move noncritical writes (logs, temp) off the critical mount if possible.
  10. Test: simulate path loss in a maintenance window. If you haven’t tested failover, you’re running a theory.

Checklist: Values to pick (guidance, not gospel)

  • Measure first: how long does a path failover actually take in your environment? Use that as baseline.
  • SCSI device/timeout: often 30–60s for SAN LUNs in stable fabrics; longer if your array needs it (but ask why).
  • Multipath fast_io_fail_tmo: 5–15s if you prefer fast error; longer if you want ride-through.
  • Multipath dev_loss_tmo: 30–120s depending on path recovery expectations.
  • Avoid indefinite queueing unless your application stack is explicitly designed to tolerate it.

Checklist: What to avoid

  • Don’t “solve” stalls by increasing timeouts blindly. You may only increase the duration of outages.
  • Don’t disable safety mechanisms (flushes/barriers) to chase benchmark numbers.
  • Don’t assume storage failover is instantaneous. It rarely is, and it often isn’t consistent.
  • Don’t ignore D-state processes. They’re the kernel waving a red flag, not an application bug.

FAQ

1) Why does kill -9 not terminate a stuck process?

Because it’s in uninterruptible sleep (D state) waiting for kernel I/O completion. The kernel won’t schedule it to handle the signal until the I/O returns or errors.

2) Is kernel.hung_task_timeout_secs a real timeout that fixes hangs?

No. It controls when the kernel warns you about blocked tasks. It’s useful for correlation, not remediation.

3) Should I disable queue_if_no_path in multipath?

In many production setups: yes, unless you have a documented reason to queue indefinitely (and a plan for what the application does while blocked). Prefer bounded failure with explicit retry windows.

4) What’s a reasonable SCSI timeout value?

Reasonable is “slightly above your proven storage recovery time.” In practice, many SAN environments land in the 30–60s range; some require longer. If you need 180s, ask what scenario requires it and whether your services can afford it.

5) If I lower timeouts, will I lose data?

Lower timeouts can convert long waits into I/O errors. Whether that risks data depends on the application and filesystem. Databases usually prefer explicit errors over indefinite stalls, but you must ensure your stack handles write failures correctly.

6) My system “hangs” only during backups. Is that a timeout problem?

Often it’s saturation and queueing rather than path failure. Check iostat -x for 100% util and high await. Then reduce backup concurrency, apply ionice, or move backups to a less critical path.

7) How do I know whether it’s the disk, the controller, or the SAN?

Start with dmesg. Controller resets, link errors, and path failures leave fingerprints. Then correlate with multipath -ll (paths), device timeouts, and latency metrics (iostat). If the mapper has all paths down and is queueing, it’s not ext4’s fault.

8) Does changing the I/O scheduler fix stalls?

Schedulers can improve fairness and latency under load, especially on rotational media. They will not fix a missing path or a device that stops responding. Treat scheduler tuning as phase two.

9) Should I reboot a node that has many D-state tasks?

If storage recovery is not happening and the node is effectively wedged, yes—reboot can be the fastest route to restore service. But you still need to fix the underlying timeout/failover behavior or you’ll reboot again later.

10) Can systemd timeouts protect me from disk stalls?

They protect you from services that hang in user space. They don’t reliably protect you from kernel I/O wait. Use them for hygiene, not as a storage reliability strategy.

Next steps you can do this week

If Ubuntu 24.04 machines “hang” under disk load, the kernel is usually doing exactly what you told it to do: wait. Your job is to set expectations—explicitly—about how long waiting is allowed before recovery or failure kicks in.

  1. Run the fast diagnosis playbook on one affected node and capture dmesg, iostat -x, multipath -ll, and a sysrq blocked-task dump.
  2. Pick a liveness budget (seconds, not minutes) for storage path loss and align SCSI timeouts and multipath fail settings to it.
  3. Make those settings persistent with udev rules and /etc/multipath.conf.
  4. Test failure: pull a path, then pull all paths briefly in a controlled window. Verify the behavior matches what your services can tolerate.
  5. Only after liveness is stable, tune performance: queue depth, scheduling, and workload shaping.

The end state you want is not “no errors.” It’s “errors happen quickly, predictably, and recoverably,” without turning your node into a very expensive meditation app.

← Previous
ZFS canmount: The Setting That Prevents Boot-Time Surprises
Next →
ZFS Cache Hit Rates: When They Matter and When They Don’t

Leave a comment