Debian/Ubuntu Disk Latency Spikes: Prove It’s Storage, Not the App (Tools + Fixes)

Was this helpful?

Everything is “fine” until it isn’t. The API p99 trips a circuit breaker, your database stalls, dashboards look like a seismograph, and someone says the line you’ll hear in every company forever: “The app didn’t change.”

Disk latency spikes are the classic scapegoat-and-guessing game. This is the antidote: a Debian/Ubuntu workflow that produces evidence, not vibes—so you can prove it’s storage (or prove it’s not), and then fix the right thing.

Fast diagnosis playbook

If you have 10 minutes, do this in order. The trick is to separate latency from throughput, and block device from filesystem from application. A busy disk can be fine; a disk with occasional 2–20 second pauses will ruin your day.

1) Confirm the symptom: are we stalled on I/O?

  • Check system-wide: run vmstat 1 and look for high wa (iowait) during the spike.
  • Check per-device: run iostat -x 1 and look for await and %util rising during the spike.
  • Check queueing: if avgqu-sz grows, you’re stacking requests faster than the device completes them.

2) Identify the victim: which process is blocked?

  • Capture blocked tasks: ps -eo pid,stat,wchan:25,comm | awk '$2 ~ /D/'.
  • Correlate with app: DB worker threads in D state are the OS telling you “I’m waiting on storage.”

3) Decide: local disk, virtual disk, or remote storage?

  • Map mount → block device: findmnt -no SOURCE,TARGET /your/mount.
  • Then: lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,ROTA,MODEL to see if you’re on NVMe, SATA SSD, HDD, dm-crypt, LVM, MD RAID, multipath, or a cloud virtual disk.

4) If it’s a spike: trace, don’t average

  • Use biosnoop (bcc) or bpftrace to catch latency outliers.
  • If you can’t, use blktrace/blkparse and look for long gaps between dispatch and completion.

5) Validate with a controlled repro

  • Run a safe fio profile against a test file on the same filesystem and see if p99/p999 latency spikes align with production symptoms.

What “disk latency spike” actually means

Disk latency is the time between “kernel submits a block I/O request” and “kernel gets completion.” If that time spikes, everything upstream becomes a liar: the app thread looks “slow,” locks appear “contentious,” queues “mysteriously” build, and humans start rewriting code instead of fixing the bottleneck.

There are three common spike shapes:

  • Queueing spikes: latency climbs because you’re saturating device or backend. Symptoms: high %util, high avgqu-sz, rising await with steady IOPS.
  • Pause spikes: latency jumps from a few ms to seconds with little throughput change. Symptoms: periodic multi-second stalls; sometimes %util doesn’t even look pegged. Causes include firmware hiccups, garbage collection, remote backend throttling, or journal commits.
  • Amplification spikes: small writes become many writes (journaling, copy-on-write, RAID parity, encryption). Symptoms: the app issues “reasonable” IO, storage does a lot more work, latency balloons under load.

One quote worth keeping on a sticky note:

“Hope is not a strategy.” — General Gordon R. Sullivan

Also: your app can be innocent and still be the trigger. A workload change without a deploy (new tenant, different query shape, new index build, background compaction) can push storage over a cliff. Your job is to prove the cliff exists.

Joke #1: Disk latency is like a meeting that “will only take five minutes.” It will not.

Interesting facts and context (because history repeats)

  • “iowait” isn’t “the disk is slow.” It’s CPU time spent idle while the system has outstanding I/O. A CPU-heavy app can still have low iowait and terrible storage latency.
  • The Linux elevator used to be a headline feature. Early schedulers like anticipatory and CFQ were designed for spinning disks and interactive responsiveness; SSDs and NVMe shifted the balance toward mq-deadline/none.
  • NCQ and deep queues changed failure modes. SATA NCQ let devices reorder requests; it also made “one bad command stalls the queue” more visible when firmware gets it wrong.
  • SSDs can pause to clean house. Garbage collection and wear leveling can cause periodic latency spikes, especially when the drive is near full or lacks overprovisioning.
  • Journaling traded data loss for latency predictability. ext3/ext4 journaling made crashes less exciting, but commit behavior can create periodic write bursts and sync-related stalls.
  • Write barriers became the default for a reason. Barriers (flush/FUA semantics) prevent reordering that can corrupt metadata after power loss; they can also expose a slow cache flush path.
  • Virtual disks are political boundaries. In clouds, your “disk” is a slice of shared backend; throttling and burst credits can make latency spikes appear out of nowhere.
  • RAID hides throughput problems better than latency problems. You can add spindles and get more MB/s, but small random writes on parity RAID still pay a tax.

Practical tasks: commands, what the output means, and what you do next

This section is intentionally hands-on. You want repeatable artifacts: logs, timestamps, and a narrative that survives the next meeting.

Task 1: Establish the ground truth (kernel, device, virtualization)

cr0x@server:~$ uname -a
Linux db-01 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
cr0x@server:~$ systemd-detect-virt
kvm
cr0x@server:~$ lsb_release -ds
Ubuntu 22.04.4 LTS

Meaning: Kernel version and virtualization strongly influence storage behavior (multi-queue, scheduler defaults, virtio). If this is a VM, you also need to think about noisy neighbors and backend throttling.

Decision: If virtualized, plan to collect evidence that survives the “it’s your guest” conversation: per-device latency, queue depth, throttling signals, and time correlation.

Task 2: Map mounts to block devices (don’t guess)

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /var/lib/postgresql
/dev/mapper/vg0-pgdata /var/lib/postgresql ext4 rw,relatime,discard
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,ROTA,MODEL
NAME            TYPE   SIZE FSTYPE      MOUNTPOINTS              ROTA MODEL
vda             disk   500G                                      1    QEMU HARDDISK
└─vda2          part 499.5G LVM2_member                          1
  ├─vg0-root    lvm     50G ext4        /                         1
  └─vg0-pgdata  lvm    449G ext4        /var/lib/postgresql       1

Meaning: The app is on LVM on vda. Rotational flag says “1” (HDD) even in virtual environments; treat it as “not NVMe-fast.” Also note discard—that can matter.

Decision: Your bottleneck might be underneath LVM (virtio, host storage, networked block). Keep mapping: device-mapper stacks add queueing and complexity.

Task 3: Watch system pressure, not just averages

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 612304  64216 931224    0    0   120   410  410  880 12  4 82  2  0
 1  0      0 610992  64216 931260    0    0   180   512  398  860 10  4 84  2  0
 1  5      0 610120  64216 930800    0    0   140  2100  520 1200  6  3 60 31  0
 0  6      0 609880  64216 930744    0    0    90  1800  540 1300  5  3 58 34  0
 1  0      0 610400  64216 931100    0    0   110   600  420  900  9  4 84  3  0

Meaning: During the spike, b (blocked processes) rises and wa jumps to 31–34%. That’s a real stall signature.

Decision: If blocked processes climb during latency spikes, move on to per-device metrics. If wa is low but p99 latency is high, the app might be CPU-bound or waiting on locks, not I/O.

Task 4: Get per-device latency and queue signals

cr0x@server:~$ iostat -x 1 5
Linux 6.5.0-28-generic (db-01) 	12/30/2025 	_x86_64_	(8 CPU)

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              8.00   95.00  512.0  8240.0    0.00   12.00   0.00  11.21   3.10  18.40   1.90    64.0    86.7   1.20  12.40
vda              7.00  110.00  448.0  9100.0    0.00   10.00   0.00   8.33   3.40 120.50  14.20    64.0    82.7   1.50  17.50
vda              6.00  105.00  384.0  9000.0    0.00   11.00   0.00   9.48   3.80 240.10  28.30    64.0    85.7   1.60  18.00
vda              9.00   98.00  576.0  8600.0    0.00   14.00   0.00  12.50   3.20  22.00   2.30    64.0    87.8   1.30  13.50
vda              8.00   92.00  512.0  8100.0    0.00   12.00   0.00  11.54   3.10  19.30   2.00    64.0    88.0   1.20  12.60

Meaning: Writes are the problem (w_await spikes to 120–240ms) while %util is not extreme. That’s classic “backend got slow” or “flush/commit path” rather than “device saturated.” aqu-sz rises during spikes: queueing is happening.

Decision: When await jumps but utilization doesn’t peg, suspect pauses: cache flushes, thin provisioning, remote throttling, or host contention. Move to tracing and flush/queue investigation.

Task 5: Confirm which processes are stuck in uninterruptible sleep (D state)

cr0x@server:~$ ps -eo pid,stat,wchan:25,comm | awk '$2 ~ /D/'
18423 D    io_schedule             postgres
18431 D    io_schedule             postgres
21102 D    ext4_writepages         postgres

Meaning: Postgres workers are blocked in kernel wait paths related to I/O. That’s not “slow SQL” by itself; that’s “storage completion is late.”

Decision: If app threads are in D state during spike windows, prioritize block layer evidence and storage-side causes. If they’re runnable (R) but slow, focus on CPU, locks, GC, or network.

Task 6: Read pressure stall information (PSI) to prove system-wide I/O contention

cr0x@server:~$ cat /proc/pressure/io
some avg10=0.28 avg60=0.22 avg300=0.15 total=184329210
full avg10=0.07 avg60=0.05 avg300=0.03 total=40210299

Meaning: PSI tells you how often tasks are delayed waiting for I/O. full indicates times when the system had no runnable task because everyone was waiting on I/O. That’s a strong “storage is gating the machine” signal.

Decision: If PSI full climbs during spikes, treat it as infra, not app. If PSI is quiet, your latency might be inside the app (locks) or inside the filesystem cache (page faults) rather than block I/O.

Task 7: Inspect block device scheduler and queue settings

cr0x@server:~$ cat /sys/block/vda/queue/scheduler
[mq-deadline] none
cr0x@server:~$ cat /sys/block/vda/queue/nr_requests
256
cr0x@server:~$ cat /sys/block/vda/queue/read_ahead_kb
128

Meaning: Scheduler choice matters for latency. mq-deadline is often a decent default for block devices that need fairness. Queue depth (nr_requests) influences burst behavior and tail latency.

Decision: Don’t “tune” blindly. If you see long tail lat, you may need to reduce queueing to keep latency bounded, especially on shared backends. Plan controlled tests.

Task 8: Check filesystem mount options that can force synchronous behavior

cr0x@server:~$ findmnt -no TARGET,FSTYPE,OPTIONS /var/lib/postgresql
/var/lib/postgresql ext4 rw,relatime,discard
cr0x@server:~$ tune2fs -l /dev/mapper/vg0-pgdata | egrep 'Filesystem features|Journal features'
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3

Meaning: discard can cause latency spikes depending on backend. Modern recommendation is often periodic fstrim rather than inline discard for latency-sensitive workloads.

Decision: If you see spikes during deletes/vacuum/compaction, try disabling discard and using scheduled fstrim. Validate with change control and measurement.

Task 9: Observe flushes and writeback behavior (dirty throttling)

cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
cr0x@server:~$ grep -E 'Dirty:|Writeback:' /proc/meminfo
Dirty:             182340 kB
Writeback:           2048 kB

Meaning: Dirty page thresholds dictate when the kernel forces writeback. When you hit dirty_ratio, app writes can be throttled hard, which looks like random latency spikes.

Decision: If spikes correlate with bursts of dirty memory and writeback, tune dirty settings carefully or reduce write amplification (app batching, DB settings). Be conservative; these knobs can make things worse.

Task 10: Capture storage latency outliers with BPF (biosnoop from bcc tools)

cr0x@server:~$ sudo biosnoop -Q -d vda
TIME(s) COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
12.4321 postgres      18423  vda     W  9132456   16384   14.92
12.4398 postgres      18431  vda     W  9132488   16384   18.77
12.9802 postgres      18423  vda     W  9132520   16384   942.51
13.0054 postgres      18431  vda     W  9132552   16384   1103.44

Meaning: This is the money shot: real I/O requests with measured latency, attributed to a process. Those 900–1100ms writes explain p99 app timeouts.

Decision: If BPF shows outliers at the block device, you can stop arguing about the app. Now figure out why: flushes, throttling, backend pauses, queueing, or device errors.

Task 11: Trace the block layer path with blktrace for deeper timing

cr0x@server:~$ sudo blktrace -d /dev/vda -o - | blkparse -i -
  8,0    1        1     0.000000000 18423  Q  WS 9132520 + 32 [postgres]
  8,0    1        2     0.000310215 18423  G  WS 9132520 + 32 [postgres]
  8,0    1        3     0.000482906 18423  I  WS 9132520 + 32 [postgres]
  8,0    1        4     0.000650102 18423  D  WS 9132520 + 32 [postgres]
  8,0    1        5     0.942912433 18423  C  WS 9132520 + 32 [0]

Meaning: This shows the request lifecycle: queued (Q), dispatched (D), completed (C). Here, completion is ~0.942s after dispatch. That’s device/backend time, not your SQL parser.

Decision: If time is mostly between D and C, focus on device/backend. If delay is between Q and D, you’re queueing in the OS (scheduler/queue depth), often due to saturation or stacked device-mapper layers.

Task 12: Check for kernel-reported device problems (the boring logs matter)

cr0x@server:~$ sudo dmesg -T | egrep -i 'blk|I/O error|timeout|reset|nvme|scsi|ext4|xfs' | tail -n 8
[Mon Dec 30 10:12:41 2025] blk_update_request: I/O error, dev vda, sector 9132520 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
[Mon Dec 30 10:12:41 2025] Buffer I/O error on dev dm-1, logical block 1141568, lost async page write
[Mon Dec 30 10:12:41 2025] EXT4-fs warning (device dm-1): ext4_end_bio:345: I/O error 10 writing to inode 262411 starting block 1141568)

Meaning: If you have I/O errors or resets, latency is no longer the main story. You have correctness risk. Spikes may be retries, remaps, or backend timeouts.

Decision: Escalate immediately: storage team/cloud provider/hypervisor owner. Start planning for failover and data integrity checks, not micro-optimizations.

Task 13: Verify TRIM behavior (discard vs scheduled trim)

cr0x@server:~$ systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Mon 2025-12-30 09:00:01 UTC; 1h 12min ago
     Trigger: Mon 2026-01-06 00:00:00 UTC; 6 days left
cr0x@server:~$ sudo fstrim -v /var/lib/postgresql
/var/lib/postgresql: 94.3 GiB (101251604480 bytes) trimmed

Meaning: If you can schedule trim, you can often remove inline discard. That can reduce latency spikes during heavy deletes.

Decision: Prefer fstrim.timer for many backends. If you’re on thin-provisioned SAN or certain cloud disks, validate with your provider/storage team.

Task 14: Reproduce with fio and look at tail latency, not just IOPS

cr0x@server:~$ fio --name=latcheck --filename=/var/lib/postgresql/fio.test --size=2G --direct=1 --ioengine=libaio --rw=randwrite --bs=16k --iodepth=16 --numjobs=1 --time_based --runtime=30 --group_reporting --output-format=normal
latcheck: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=16
fio-3.33
latcheck: Running 1 job
  write: IOPS=1420, BW=22.2MiB/s (23.3MB/s)(666MiB/30001msec)
    slat (usec): min=6, max=421, avg=14.52, stdev=9.30
    clat (msec): min=0, max=1840, avg=10.62, stdev=41.10
     lat (msec): min=0, max=1840, avg=10.65, stdev=41.11
    clat percentiles (msec):
     |  50.00th=[   1],  90.00th=[   3],  99.00th=[ 120],  99.90th=[ 820],  99.99th=[1700]

Meaning: Median is fine; the tail is awful. That’s exactly how production feels: mostly okay, occasionally catastrophic. Tail percentiles confirm spike behavior even under a controlled test.

Decision: If fio reproduces the tail spikes, it’s not your app. Now you can test fixes (scheduler, discard, queue depth, dirty settings) and measure improvement.

Build the case: storage vs app (how to prove it without starting a war)

“Prove it’s storage” means producing a chain of evidence from user-visible latency to kernel-level I/O completion time. You want correlation, attribution, and a plausible mechanism.

Evidence ladder (use it like a courtroom)

  1. User symptom: p95/p99 latency spikes, timeouts, retry storms, queue growth.
  2. Host symptom: blocked tasks, PSI I/O pressure, elevated iowait during incident windows.
  3. Device symptom: iostat -x shows await spikes and queue growth; sometimes without saturation.
  4. Per-I/O proof: BPF tools or blktrace show individual I/Os taking 200ms–seconds, attributed to the process and device.
  5. Mechanism: something explains why: flush storm, thin pool metadata contention, cloud throttling, discard, RAID write penalty, firmware GC, multipath failover, etc.

What not to do

  • Don’t use “CPU iowait is high” as your only argument. It’s suggestive, not definitive.
  • Don’t lean on svctm as gospel. On modern kernels and stacked devices, it’s often misleading.
  • Don’t average your way out of tail latency. Spikes live in p99/p999, not in the mean.

How app issues masquerade as storage (and how to separate them)

Sometimes the app is guilty, and storage is the witness. Here are the common impostors:

  • Lock contention: app threads blocked on mutexes look like “everything is slow,” but the kernel shows runnable tasks, not D-state I/O wait.
  • GC or compaction pauses: the app stalls but disk metrics stay stable; CPU may spike or pause patterns are periodic.
  • Network dependency: remote calls cause latency; disk stays fine; blocked tasks aren’t in I/O wait.
  • Filesystem cache misses: major page faults can look like I/O, but you’ll see a pattern in vmstat and perf counters; still storage, but different layer.

Joke #2: The app team will say it’s storage; the storage team will say it’s the app. Congratulations, you now run diplomacy as a service.

Three corporate mini-stories (how this fails in real life)

Mini-story 1: The incident caused by a wrong assumption

The company had a checkout service on Ubuntu VMs, backed by a managed block volume. A new partner integration launched on a Monday. No deploys, no schema changes, no obvious flags. By lunch, p99 latencies were spiking to seconds and the on-call chat was doing what on-call chats do: producing theories faster than data.

The prevailing assumption: “If storage was the issue, we’d see 100% disk utilization.” The graphs showed %util hovering in the teens. Someone declared storage innocent and blamed the new partner’s API. Engineers started adding caching, adjusting timeouts, and retry logic—helpfully increasing load on the database.

Later that afternoon, an SRE ran biosnoop and caught periodic 800–1500ms write latencies on the volume. The request rate wasn’t high; the backend was just occasionally slow. The missing concept was that latency spikes can happen without local saturation when you’re on a shared or throttled backend.

The fix wasn’t heroic: the workload had shifted to more small writes. They bumped the volume class to one with better baseline latency and eliminated a mount option that was causing synchronous discards under delete-heavy bursts. The partner API was fine. The original assumption wasn’t.

Mini-story 2: The optimization that backfired

A team wanted faster nightly batch jobs. Someone noticed the kernel’s dirty page settings and decided to “let Linux buffer more,” increasing vm.dirty_ratio and vm.dirty_background_ratio. The batch job got faster in the first hour. Slack celebrated. A change request was retroactively written. You can see where this is going.

In production, daytime traffic also writes. With higher dirty thresholds, the kernel accumulated more dirty data and then dumped it in larger writeback bursts. Storage wasn’t saturated on average, but the burst created multi-second stalls when the system hit the dirty limit and throttled writers.

The database didn’t just slow down; it started timing out client requests, which triggered retries and amplified write pressure. The app team saw timeouts and blamed query plans. The infra team saw acceptable average IOPS and shrugged. Tail latency was the only metric that mattered, and it was on fire.

The rollback restored stability immediately. The lasting fix was disciplined load testing with p99/p999 tracking, and a separate batch window using rate limits. The “optimization” had been real for throughput and disastrous for latency. Both can be true.

Mini-story 3: The boring but correct practice that saved the day

At another place, the storage platform team had a habit that looked uncool but worked: every host emitted a small set of storage SLO metrics—device await, PSI I/O full, and a histogram of block I/O latency from an eBPF program sampled during peaks. They also kept mount options and device mapper stacks in inventory.

One Thursday, several services started showing synchronized p99 spikes across unrelated apps. Because the telemetry was consistent, the on-call didn’t start with “what changed in the app.” They started with: “What’s common across these hosts?” The histogram showed long tail writes on volumes attached to one cluster of hypervisors.

They pulled D-state process snapshots and confirmed multiple processes blocked in io_schedule across different services. That shifted the conversation from “app bug” to “shared storage backend.” The hypervisor owner found a maintenance event that had moved a storage pool into a degraded state. No single VM was “saturated.” The backend was.

The outcome was anticlimactic: workloads were migrated off the affected hosts, and the storage pool was repaired. The saving grace wasn’t genius. It was boring measurement plus a habit of correlating time, device, and process. In ops, boring is a feature.

Fixes that actually move the needle (ordered by how often they work)

Once you’ve proven the latency is in the storage path, your fixes should target the mechanism you observed. Don’t shotgun tune sysctls and hope. Tail latency punishes superstition.

1) Fix the backend class/limits (cloud and virtualized environments)

If you’re on a cloud block volume or shared SAN, you may be hitting throttling, burst credit exhaustion, or noisy neighbor contention. The evidence pattern is: await spikes without local saturation, fio tail spikes, and often repeatable periodicity.

  • Action: move to a volume class with better baseline latency / provisioned IOPS; reduce variance by paying for it.
  • Action: distribute hot data across multiple volumes (striping at the app/DB level or LVM) if appropriate.
  • Avoid: “just add retries.” Retries are how small incidents become outages.

2) Remove inline discard for latency-sensitive filesystems (use fstrim)

Inline discard can turn deletes into synchronous trim operations. On some backends it’s fine; on others it’s a latency landmine.

  • Action: remove discard from fstab for the data volume, remount, and rely on fstrim.timer.
  • Validate: run fio and observe tail latency improvement.

3) Bound queueing to reduce tail latency

Deep queues improve throughput until they destroy p99. For shared backends, huge queue depths can turn a short stall into a long one by piling work behind it.

  • Action: test scheduler mq-deadline vs none for NVMe/virtio. Pick the one that reduces tail latency under your workload, not the one that wins microbenchmarks.
  • Action: consider reducing queue depth (provider-specific in VMs; in Linux, device queue and application iodepth matter).

4) Eliminate write amplification

If your app issues small random writes, every layer can multiply that into more I/O than you think.

  • Action: for databases, align checkpointing/flush behavior with storage characteristics; avoid settings that create huge periodic flush storms.
  • Action: check whether you’re accidentally on parity RAID for small-write-heavy workloads; consider mirror/striped mirrors for latency-sensitive writes.
  • Action: avoid stacking LVM + dm-crypt + MD RAID unless you need it; every layer adds queueing and failure modes.

5) Dirty page tuning (only with measurement)

Kernel writeback tuning can reduce burstiness, but it’s easy to make things worse. The safe stance: small changes, tested under load, watching p99 write latency and app SLOs.

  • Action: if you see periodic stalls at dirty limit, lower vm.dirty_ratio slightly to force smoother writeback.
  • Action: consider using vm.dirty_background_bytes and vm.dirty_bytes instead of ratios on hosts with variable memory footprints.

6) Fix error paths and firmware issues

If dmesg shows resets/timeouts/errors, treat it as a reliability incident. Latency spikes are often retries and controller resets wearing a disguise.

  • Action: update drive firmware / hypervisor storage drivers where applicable.
  • Action: replace failing devices; stop trying to tune your way out of physics.

Common mistakes: symptom → root cause → fix

1) App p99 spikes, but disk %util is low

Symptom: Users see timeouts; iostat shows low %util but await spikes.

Root cause: backend pause or throttling (cloud volume, SAN contention, cache flush path) rather than sustained saturation.

Fix: capture per-I/O latency with BPF/blktrace; move to better volume class or reduce flush/trim triggers; tune queueing for tail latency.

2) High iowait leads to “blame the disk,” but await is fine

Symptom: vmstat shows high wa, but iostat -x shows low await.

Root cause: the system is waiting on something else (NFS server hiccups, networked filesystem, swap I/O, or an app causing blocked reads via page faults on a different device).

Fix: map mounts to devices; check NFS stats if applicable; identify blocked processes and their wait channels; trace the right device.

3) Periodic spikes every few seconds/minutes

Symptom: latency spike cadence looks like a metronome.

Root cause: journal commits, checkpoints, writeback timers, periodic trim, or storage backend housekeeping.

Fix: correlate with filesystem/DB logs; remove inline discard; adjust checkpoint/commit behavior; measure with fio and BPF.

4) Spikes during delete-heavy workloads

Symptom: vacuum/compaction/deletion coincides with I/O stalls.

Root cause: synchronous discard, thin provisioning metadata pressure, or SSD GC pressure from invalidation bursts.

Fix: use scheduled trim; ensure adequate free space/overprovisioning; consider better SSD class or backend configuration.

5) “We moved to encryption and now it’s slow”

Symptom: higher latency and lower throughput after enabling dm-crypt.

Root cause: CPU overhead, smaller effective request sizes, loss of device offloads, or queueing interactions in device-mapper stack.

Fix: confirm with perf and CPU usage; ensure AES-NI is available; consider optimizing I/O sizes and iodepth; keep stacks minimal.

6) RAID “works” until small random writes happen

Symptom: reads look fine; writes have awful tail latency under load.

Root cause: parity RAID write penalty and read-modify-write cycles; cache flush behavior.

Fix: use mirrors for latency-sensitive write workloads; ensure write cache is safe (BBU/PLP) and flushes are sane; increase stripe alignment when applicable.

Checklists / step-by-step plan

Step-by-step: from alert to root cause in a controlled way

  1. Capture timestamps. Note start/end of spike windows. Correlation dies without time.
  2. Confirm host impact. Record vmstat 1 and PSI I/O output during the spike.
  3. Collect per-device stats. Run iostat -x 1 for at least 60 seconds spanning a spike.
  4. Identify blocked processes. Snapshot D-state tasks and their wait channels.
  5. Map the data path. Mount → filesystem → dm-crypt/LVM/MD → physical/virtual disk.
  6. Trace outliers. Use biosnoop or blktrace to capture a handful of worst-latency I/Os.
  7. Check logs for correctness risks. Scan dmesg for I/O errors/timeouts/resets.
  8. Reproduce safely. Run fio on a test file to validate tail spikes outside the app.
  9. Form a hypothesis. “Spikes are caused by X because evidence Y shows latency between D and C and correlates with Z.”
  10. Apply one change. One. Not five. Measure again with the same tools.
  11. Lock in monitoring. Keep PSI I/O, iostat await, and a tail-latency probe as standard signals.

Operational checklist: what to attach to the incident ticket

  • iostat output spanning the spike (raw text).
  • PSI I/O snapshot and vmstat snapshot.
  • D-state process list with wait channels.
  • One tracing artifact (biosnoop lines or blktrace excerpt) showing outlier I/O latencies.
  • dmesg excerpt for any storage-related warnings/errors.
  • Storage topology: lsblk output showing stacks (dm-crypt/LVM/MD/multipath).
  • Workload note: what the app was doing (checkpoint, vacuum, compaction, batch job).

FAQ

1) Is high iowait enough to prove storage is the bottleneck?

No. It’s a hint. Prove it with per-device latency (iostat -x) and per-I/O tracing (BPF or blktrace). iowait can be low during short spikes.

2) Why does latency spike when %util isn’t near 100%?

Because utilization is an average and often local to the guest. Backend pauses, throttling, cache flushes, or remote contention can create high completion time without local saturation.

3) What’s the fastest way to attribute slow I/O to a process?

Use biosnoop (bcc) or similar eBPF tooling. It records I/O latency and shows which process issued it. That ends arguments quickly.

4) Should I switch the scheduler to “none” for SSD/NVMe?

Sometimes, but measure. “none” can reduce overhead, but it can also allow unfairness and worse tail latency on shared backends. Test with fio and production-like load.

5) Is inline discard really that bad?

It can be. On some devices it’s cheap; on others it triggers expensive work at terrible times. If you see spikes during deletes, try scheduled trim instead.

6) My fio test shows terrible p99, but the app is “fine” most of the day. What now?

You have a tail latency problem that will surface under the wrong concurrency or background activity. Fix it now, before you get an outage that only happens on Tuesdays.

7) Can filesystem choice (ext4 vs xfs) fix latency spikes?

Sometimes, but it’s rarely the first lever. Most spikes come from backend variance, write amplification, queueing, or flush behavior. Filesystem changes are disruptive; exhaust simpler fixes first.

8) How do I tell queueing in the OS from slow device completion?

Use blktrace lifecycle timing. If delay is between Q and D, you’re queueing before dispatch. If delay is between D and C, the device/backend is slow.

9) Why do spikes get worse when we add retries?

Retries add load precisely when the system is weakest. They increase concurrency, deepen queues, and extend the spike. Prefer backoff, jitter, and circuit breakers—and fix the root cause.

Conclusion: next steps that don’t waste a week

When disk latency spikes show up, the danger isn’t just performance. It’s misdiagnosis. People rewrite code, “optimize” the wrong path, and ship changes that make the incident larger and harder to reason about.

Do this next:

  1. Instrument: keep PSI I/O and iostat -x-style device latency in your standard dashboards.
  2. During the next spike, capture one tracing artifact (BPF or blktrace) that shows an outlier I/O and its latency.
  3. Remove obvious latency multipliers (inline discard, pathological queueing) and retest with fio using percentiles.
  4. If you’re on shared/virtualized storage and you can reproduce tail spikes, stop negotiating with physics: upgrade the backend class or redesign the layout.

Once you can point at a specific I/O that took 1.1 seconds and name the process that issued it, the conversation changes. That’s the goal. Evidence beats opinions, and it also makes you less popular in meetings—which, honestly, is sometimes a bonus.

← Previous
Ubuntu 24.04: PHP-FPM keeps crashing — the log line you must find (and the fixes)
Next →
Ubuntu 24.04 updates broke kernel modules: rebuild initramfs correctly (case #28)

Leave a comment