Ubuntu 24.04 disk is slow: IO scheduler, queue depth, and how to verify improvements

Was this helpful?

“The disk is slow” is the least useful incident report and the most common one. It shows up as timeouts in your app, backups that suddenly take the whole night, or a database that starts acting like it’s doing deep philosophical work between writes.

Ubuntu 24.04 ships a modern Linux kernel with multi-queue block IO and a lot of sane defaults. Still, one wrong scheduler, one mismatched queue depth, or one “helpful” tuning knob can turn a fast NVMe into a latency sprinkler. The fix is rarely mystical. It’s usually measurement, a couple of deliberate settings, and a habit of proving the change actually helped.

Fast diagnosis playbook

When someone says “disk is slow”, you do not start by changing tunables. You start by finding whether you have a latency problem, a throughput ceiling, a saturation problem, or a plain old device failure. Here’s the order that catches the most issues with the least drama.

1) First: identify what kind of “slow” it is

  • Latency spike (p99/p999 reads or fsyncs jump): often queueing, scheduler mismatch, cache flushes, device firmware/GC, or backend storage.
  • Throughput ceiling (MB/s stuck low): often link speed, PCIe lane issues, RAID/MD constraints, cloud volume limits, or wrong IO pattern.
  • Saturation (high utilization, long queues): often queue depth too low/high, too many concurrent writers, or a noisy neighbor.
  • Stalls/timeouts: can be device errors, multipath failovers, controller resets, or filesystem/journal pain.

2) Second: check errors and resets before tuning

If the kernel is logging NVMe resets or SCSI timeouts, your “performance issue” is an availability issue wearing a performance hat.

3) Third: measure queueing and latency at the block layer

Use iostat and pidstat to see if you’re building queues (aqu-sz), suffering latency (await), or just doing a lot of IO. Then map that to scheduler and queue depth settings.

4) Fourth: validate scheduler and queue depth against device type

NVMe, SATA SSD, HDD, and cloud block volumes are different animals. If you treat them the same, they will punish you differently.

5) Fifth: change one thing, run the same test, compare percentiles

If you can’t reproduce the problem, you can’t prove the fix. That’s not cynicism; it’s how you avoid “tuning” your way into a new incident.

What “slow disk” actually means in production

Storage performance discussions go off the rails because people mix metrics. A developer says “disk is slow” and means the API request is timing out. A sysadmin hears “disk is slow” and checks MB/s. A database person means fsync latency. They’re all right, and all wrong, until you anchor on the same measurement.

Four metrics that matter (and one that lies)

  • IOPS: operations per second. Great for random IO and small blocks. Useless if you ignore latency distribution.
  • Throughput (MB/s): good for streaming reads/writes and large blocks. Misleading for databases and metadata-heavy workloads.
  • Latency (average and percentiles): the only metric your users feel. p99 matters more than “avg”.
  • Queueing: how much work is waiting. Queueing is often where “slow” is born.
  • CPU iowait: the liar. iowait can be low while latency is terrible (async IO), and high while everything is fine (CPU idle, IO busy).

On Ubuntu 24.04, you’re typically on a multi-queue block stack (blk-mq). That means multiple hardware queues, multiple software submission paths, and different scheduler behavior than the old single-queue days. It’s faster and more parallel. It’s also easier to create a mess if your queue depth is mismatched to the device or backend.

One quote that belongs on every on-call runbook:

“Hope is not a strategy.” — Gene Kranz

Storage tuning without measurement is hope, wearing a lab coat.

Interesting facts and history (short, concrete, useful)

  1. Linux used to pick between CFQ, deadline, and noop; with blk-mq, many devices now default to none (scheduler bypass) or mq-deadline.
  2. NVMe was designed for deep queues (many commands in flight) because flash and PCIe thrive on parallelism.
  3. HDDs hate random IO not because they’re “slow”, but because they’re mechanical: every random seek is a tiny physical commute.
  4. Deadline scheduling became popular for databases because it prevents starvation and keeps read latency from going feral under write load.
  5. Write cache and flush behavior changed the game: modern SSDs can acknowledge writes quickly, but a forced cache flush (fsync/fua) can still cost real time.
  6. Queue depth tuning became mainstream with SANs: too shallow wastes the array, too deep melts tail latency with queueing.
  7. Cloud block volumes often enforce performance caps (IOPS/MB/s) regardless of what your instance can do, leading to “slow disk” that is actually “slow wallet”.
  8. Multi-queue was adopted to scale with CPU cores; a single global queue lock was a bottleneck on fast SSDs.
  9. “noop” wasn’t “no scheduling” so much as “minimal merging”; it made sense for hardware that already reorders efficiently.

IO scheduler on Ubuntu 24.04: what it does and when it matters

The IO scheduler is the traffic cop between your filesystem and your block device. Its job is to decide what gets issued next and in what order, and how aggressively to merge and dispatch requests.

On modern Linux, especially with NVMe, the device and firmware already do a lot of reordering. That’s why “no scheduler” (none) can be the best choice: the kernel gets out of the way. But “can be” is not “always is”.

Schedulers you’ll commonly see

  • none: effectively bypasses scheduling for blk-mq devices; relies on hardware to manage ordering. Often best for NVMe and high-end storage arrays.
  • mq-deadline: multi-queue version of deadline; aims to bound latency and prevent starvation. Often a good default for SATA SSDs and mixed workloads.
  • kyber: targets low latency by controlling dispatch based on target latencies. Can help on certain devices; can also confuse you if you don’t measure correctly.
  • bfq: fairness-oriented, often for desktops; can be useful for interactive latency on rotational media, but it’s not my first pick for servers unless you have a reason.

When the scheduler matters

If your workload is mostly sequential IO (big streaming reads/writes), scheduler choice often doesn’t move the needle much. If your workload is mixed random reads/writes under contention (databases, VM hosts, Ceph OSDs), scheduler and queueing behavior can dominate tail latency.

If you’re on a hardware RAID controller or a SAN that already does sophisticated scheduling, adding a heavy scheduler layer can be like asking two project managers to “coordinate” the same sprint. You’ll get more meetings, not more features.

Queue depth: the hidden lever behind throughput and tail latency

Queue depth is how many IO requests you allow to be outstanding (in flight) at once. More depth increases parallelism and can improve throughput and IOPS. Too much depth increases queueing delay and makes latency worse, especially at the tail.

Three queues you should stop confusing

  • Application concurrency: how many threads, async tasks, or processes are issuing IO.
  • Kernel block queue depth: the block layer’s ability to hold and dispatch requests (think nr_requests and per-device limits).
  • Device or backend queue depth: NVMe queue size, HBA queue depth, SAN LUN limits, cloud volume limits.

The “right” queue depth is workload and device specific. Databases frequently want bounded latency more than peak throughput. Backup jobs want throughput and can tolerate latency. VM hosts want both, which is why they cause arguments.

Joke #1: Queue depth tuning is like espresso—too little and nothing happens, too much and nobody sleeps, including your storage array.

What happens when queue depth is wrong

  • Too low: you’ll see low utilization and mediocre throughput; the device could do more but doesn’t get enough parallel work.
  • Too high: you’ll see high utilization, high aqu-sz, rising await, and ugly p99 latency. You’re not “busy”; you’re congested.

Cloud volumes and SANs: queue depth is policy, not physics

With EBS-like volumes, you can have a locally fast NVMe instance store and still be capped by a networked block device policy. That’s why changing scheduler on the guest sometimes helps less than changing volume class, provisioned IOPS, or instance type.

Verification: how to benchmark without lying to yourself

If you want to verify improvements, you need a test that matches your production IO pattern and a method that controls for caching, warm-up, and concurrency.

Rules of storage benchmarking that keep you employed

  • Pick an IO pattern: random vs sequential, read vs write, block size, sync vs async, fsync frequency.
  • Use percentiles: average latency is a bedtime story. p95/p99 is your outage.
  • Control caching: page cache can make reads look magical. Direct IO can make filesystems look worse than reality. Choose deliberately.
  • Separate device tests from filesystem tests: test the raw block device when you suspect scheduler/queue depth; test the filesystem when you suspect journaling or mount options.
  • Run long enough: SSD GC and thermal throttling can show up after minutes, not seconds.
  • One change at a time: this is not a cooking show.

Your best friend here is fio. Not because it’s fancy, but because it’s explicit: you can describe the workload and get detailed latency stats. Pair it with iostat and kernel logs, and you can tell whether your “improvement” is real.

Practical tasks: commands, output meaning, and decisions

These are real tasks you can run on Ubuntu 24.04. Each includes what to look at and what decision to make. Run them as root where needed. If you’re on production, do the low-impact checks first.

Task 1: identify the actual block devices and topology

cr0x@server:~$ lsblk -e7 -o NAME,TYPE,SIZE,ROTA,TRAN,MODEL,SERIAL,MOUNTPOINTS
NAME        TYPE   SIZE ROTA TRAN MODEL             SERIAL            MOUNTPOINTS
nvme0n1     disk  1.8T    0 nvme Samsung SSD 980PRO S64DNE0R123456A
├─nvme0n1p1 part   512M   0 nvme                                  /boot/efi
└─nvme0n1p2 part   1.8T   0 nvme                                  /

What it means: ROTA=0 suggests SSD/NVMe; TRAN tells you if it’s nvme, sata, sas, etc. If you’re on multipath or MD RAID, you’ll see different layers.

Decision: Pick the right device to inspect (nvme0n1 here). Don’t tune the partition and forget the underlying disk.

Task 2: check kernel logs for device errors, resets, timeouts

cr0x@server:~$ sudo dmesg -T | egrep -i "nvme|scsi|blk_update_request|reset|timeout|I/O error" | tail -n 30
[Mon Dec 29 10:11:14 2025] nvme nvme0: I/O 123 QID 4 timeout, aborting
[Mon Dec 29 10:11:14 2025] nvme nvme0: Abort status: 0x371
[Mon Dec 29 10:11:15 2025] nvme nvme0: controller reset scheduled

What it means: Timeouts and resets are not “tuning opportunities”. They’re reliability incidents. Performance becomes chaos when the device is unstable.

Decision: If you see this, pause tuning. Check firmware, cabling/backplane, PCIe AER errors, controller health, and vendor advisories. If it’s cloud, open a support case and consider moving the volume.

Task 3: check current scheduler per device

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

What it means: The scheduler in brackets is active (none here). Available options follow.

Decision: For NVMe, none is often correct. If you’re seeing tail latency under mixed load, test mq-deadline. Do not assume.

Task 4: check basic queue depth and request limits

cr0x@server:~$ for f in nr_requests read_ahead_kb rotational rq_affinity nomerges; do echo -n "$f="; cat /sys/block/nvme0n1/queue/$f; done
nr_requests=256
read_ahead_kb=128
rotational=0
rq_affinity=1
nomerges=0

What it means: nr_requests caps queued requests in the block layer. It’s not the only queue, but it matters for congestion. read_ahead_kb influences sequential read behavior.

Decision: If your throughput is low and device utilization isn’t maxed, shallow queues can be suspect. If your latency is bad with high concurrency, overly deep queues can worsen queueing. Change cautiously and measure.

Task 5: check hardware queue counts and sizes (NVMe)

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0 | egrep -i "mdts|sqes|cqes|oacs|oncs|nn|cntlid"
nn    : 0x1
mdts  : 0x9
cntlid: 0x0001
oacs  : 0x0017
oncs  : 0x001f
sqes  : 0x66
cqes  : 0x44

What it means: NVMe capabilities shape how big IOs can be and what features are supported. Not a direct “queue depth” number, but it tells you if the device is behaving like a modern NVMe.

Decision: If NVMe tooling reports weirdness, confirm you’re not behind a controller presenting NVMe in a strange mode. For cloud, confirm you’re actually on NVMe and not virtio-blk with different behavior.

Task 6: watch live IO latency and queueing with iostat

cr0x@server:~$ iostat -x -d 1 10 nvme0n1
Linux 6.8.0-xx-generic (server) 	12/29/2025 	_x86_64_	(32 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
nvme0n1         120.0   3840.0     0.0    0.0    1.20    32.0     95.0   3040.0     0.0    0.0    8.50    32.0    1.05  32.0

What it means: r_await/w_await are average read/write latencies (ms). aqu-sz is average queue size. %util is busy time (not always reliable on fast devices, but still a clue).

Decision: High await with high aqu-sz suggests queueing: you’re piling up requests. High await with low queueing suggests the device/backend is simply slow per IO (or stuck flushing).

Task 7: map “who is doing IO” to a process list

cr0x@server:~$ sudo pidstat -d 1 5
Linux 6.8.0-xx-generic (server) 	12/29/2025 	_x86_64_	(32 CPU)

10:21:01 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
10:21:02 AM  1001     21455      0.00  51200.00      0.00       8  postgres
10:21:02 AM     0      1892      0.00   2048.00      0.00       1  systemd-journald

What it means: This is the fastest way to catch a runaway compaction, a backup job, or a logging storm.

Decision: If the “slow disk” coincides with one process dominating writes, fix the workload first (rate limit, schedule, move it) before you touch the scheduler.

Task 8: confirm filesystem and mount options

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /
/dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro

What it means: Filesystem choice and options affect metadata IO, writeback behavior, and barriers.

Decision: Don’t cargo-cult mount options. If someone disabled barriers or forced data=writeback “for performance”, treat it as a data integrity incident waiting for a power glitch.

Task 9: inspect udev rules and tuned profiles that might be overriding settings

cr0x@server:~$ systemctl is-enabled tuned 2>/dev/null || true
disabled
cr0x@server:~$ grep -R "queue/scheduler" -n /etc/udev/rules.d /lib/udev/rules.d 2>/dev/null | head
/lib/udev/rules.d/60-persistent-storage.rules:...

What it means: You might “set scheduler” manually and have udev or a profile revert it on boot.

Decision: If settings keep changing, stop chasing ghosts. Make the configuration persistent via udev rules or kernel cmdline where appropriate, and document it.

Task 10: change scheduler temporarily (for a controlled test)

cr0x@server:~$ echo mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler
mq-deadline
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
none [mq-deadline] kyber bfq

What it means: Scheduler changed live. This is reversible and should be tested under the same workload.

Decision: Run your benchmark and compare p99 latency and throughput. If it helps, make it persistent. If it hurts, revert and move on.

Task 11: test raw device latency and throughput with fio (direct IO)

cr0x@server:~$ sudo fio --name=randread --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --iodepth=32 --rw=randread --bs=4k --numjobs=1 --time_based=1 --runtime=60 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, ioengine=io_uring, iodepth=32
fio-3.36
Starting 1 process
randread: IOPS=210k, BW=820MiB/s (860MB/s)(49.1GiB/60s)
  lat (usec): min=42, max=8500, avg=145.2, stdev=60.1
  clat percentiles (usec):
   |  1.00th=[   70],  5.00th=[   85], 10.00th=[   95], 50.00th=[  140],
   | 90.00th=[  210], 95.00th=[  260], 99.00th=[  420], 99.90th=[ 1100]

What it means: This bypasses page cache and hits the device. You get percentiles that actually reflect tail latency.

Decision: Use this as your baseline for “device behavior”. If the device is fast here but your app is slow, the bottleneck is above the block layer (filesystem, fsync pattern, fragmentation, app concurrency, network storage).

Task 12: test filesystem performance (includes metadata and journaling)

cr0x@server:~$ mkdir -p /mnt/fio-test
cr0x@server:~$ sudo fio --name=fsyncwrite --directory=/mnt/fio-test --ioengine=io_uring --direct=0 --rw=write --bs=16k --numjobs=4 --iodepth=8 --fsync=1 --size=2G --group_reporting
fsyncwrite: (g=0): rw=write, bs=(R) 16384B-16384B, (W) 16384B-16384B, ioengine=io_uring, iodepth=8
fio-3.36
fsyncwrite: IOPS=9800, BW=153MiB/s (160MB/s)(8192MiB/53s)
  clat percentiles (msec):
   |  1.00th=[  0.6],  5.00th=[  0.9], 10.00th=[  1.0], 50.00th=[  2.1],
   | 90.00th=[  6.2], 95.00th=[  9.5], 99.00th=[ 21.0], 99.90th=[ 45.0]

What it means: This is closer to “database-ish” behavior: lots of sync points. Tail latency here is what kills transactions.

Decision: If p99/p99.9 is high here, scheduler and queue depth can help, but also look at filesystem options, journal settings, and competing writers.

Task 13: inspect multipath/DM devices (if applicable)

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,PKNAME,MOUNTPOINTS | head
NAME          TYPE  SIZE PKNAME MOUNTPOINTS
dm-0          lvm   900G        /
└─mpatha      mpath 900G sdb
  ├─sdb       disk  900G
  └─sdc       disk  900G

What it means: Device-mapper layers can hide queue depth constraints. Tuning the wrong layer does nothing.

Decision: Identify the leaf devices and confirm their queue limits and scheduler. For SANs, verify HBA queue depth and multipath settings. If you can notice a path failover in latency graphs, you have work to do.

Task 14: observe per-device IO time and merges with sar

cr0x@server:~$ sar -d 1 5
Linux 6.8.0-xx-generic (server) 	12/29/2025 	_x86_64_	(32 CPU)

10:24:01 AM       DEV       tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     %util
10:24:02 AM     nvme0n1   220.00   4096.00   6144.00     46.55      2.10      9.55     40.00

What it means: Another view of queueing and utilization, useful for historical capture.

Decision: If you need to convince someone, collect sar during an incident window. Humans believe charts more than arguments.

Task 15: check discard/TRIM behavior (SSD) and avoid “always on” myths

cr0x@server:~$ lsblk -D -o NAME,DISC-GRAN,DISC-MAX,DISC-ZERO
NAME      DISC-GRAN DISC-MAX DISC-ZERO
nvme0n1        512B       2G         0

What it means: Device supports discard. Whether you mount with discard or run periodic fstrim matters.

Decision: Prefer periodic fstrim (via systemd timer) over continuous discard for most server workloads. Continuous discard can add overhead and unpredictability.

Task 16: verify that your “improvement” is stable (repeatability)

cr0x@server:~$ sudo fio --name=randrw --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --iodepth=16 --rw=randrw --rwmixread=70 --bs=4k --numjobs=2 --time_based=1 --runtime=180 --group_reporting
randrw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, ioengine=io_uring, iodepth=16
fio-3.36
randrw: IOPS=120k, BW=469MiB/s (492MB/s)(84.1GiB/180s)
  clat percentiles (usec):
   | 95.00th=[  600], 99.00th=[ 1200], 99.90th=[ 4200]

What it means: A longer mixed workload run catches GC, throttling, and queueing effects.

Decision: If results swing wildly run-to-run, you don’t have a tuning problem; you have an environment problem (contention, throttling, backend variability, or thermal limits).

Three corporate-world mini-stories (anonymized, plausible, technically accurate)

1) Incident caused by a wrong assumption: “NVMe is always fast”

A team migrated a latency-sensitive service to newer servers with NVMe drives. The ticket said “we upgraded storage, disk can’t be the problem.” They also had a new deploy, so the narrative was basically pre-written: it must be application code.

The first hint was that average latency looked fine, but the service saw periodic request timeouts. On-host, iostat showed bursts where aqu-sz jumped and w_await climbed into double digits. It wasn’t constant; it came in waves. The device was “fast” until it wasn’t.

The assumption that “NVMe equals low latency” hid the real story: the workload was write-heavy with frequent syncs, and the drive model had a small SLC cache. Under sustained writes, the drive transitioned into a slower steady state and GC kicked in. The kernel logs were clean. No errors. Just a drive doing what consumer-ish drives do when you use them like a database journal.

They tested a different scheduler (mq-deadline) and tuned concurrency down slightly. Tail latency improved, but the real fix was boring: switching to an endurance-focused SSD class and separating the WAL/journal onto a device with predictable sustained write behavior.

The lesson wasn’t “buy expensive disks.” It was “stop treating product names like performance guarantees.” NVMe is a protocol. Your workload still has to fit the physics and firmware.

2) Optimization that backfired: “Let’s crank queue depth to max”

A virtualization cluster was suffering low throughput on a shared SAN LUN. Someone had read that “deep queues increase IOPS,” which is true in the same way “more cars increase traffic flow” is true: sometimes, up to a point.

They raised queue-related settings at multiple layers: HBA queue depth, multipath settings, and block layer nr_requests. The cluster graphs looked amazing for a day. Then the helpdesk started getting “VM is frozen” complaints. Not slow. Frozen.

Investigation showed the average latency stayed acceptable, but p99 latency went off a cliff under load. The SAN was absorbing the increased parallelism by queueing internally. When a burst arrived (backup plus patching plus someone doing a report), the internal queues grew. That queueing delay translated into seconds-long IO waits for unlucky VMs. Deep queues made the SAN look busy, not responsive.

They backed off. Lower queue depth reduced peak throughput but improved tail latency and eliminated the “frozen VM” experience. The correct approach was to set queue depth to what the array could handle with bounded latency, and to enforce workload isolation (separate LUNs / QoS / scheduling backups).

Joke #2: Queue depth “optimization” is the only performance tweak that can also function as an impromptu outage generator.

3) Boring but correct practice that saved the day: baseline benchmarks and change discipline

A storage-heavy service had a ritual: every kernel update cycle, they ran the same small suite of fio jobs on a staging host. It was dull. Nobody got promoted for it. But it built a baseline over time: typical IOPS, expected p99 latency, and how it varied with scheduler choices.

One week, a routine Ubuntu point update landed. The application team reported “disk is slow,” but the graphs were ambiguous: CPU wasn’t high, network wasn’t saturated, and the storage dashboard looked normal. Normally this is where people argue and the incident drags on.

Because they had baselines, they immediately reran the same fio profile and saw p99 latency roughly doubled for 4k random writes at the same concurrency. That narrowed the search. They found the update had changed a udev rule applied to certain device classes, flipping the scheduler from their chosen setting to the default.

They corrected it with a persistent udev rule and verified the fix with the same benchmark. No heroics, no guessing, no “it feels faster now.” The postmortem was short and boring—exactly what you want.

The practice wasn’t clever. It was repeatable measurement and minimal change scope. The kind of boring that keeps pagers quiet.

Common mistakes: symptom → root cause → fix

1) Symptom: high iowait, but disk looks idle

Root cause: IO isn’t hitting the local disk you’re watching (network filesystem, different LUN, device-mapper layer), or IO is blocked on flushes while the device reports low utilization.

Fix: Use pidstat -d to find the culprit process and findmnt to identify the backing device. If it’s NFS/Ceph/RBD, local scheduler changes won’t help; focus on network/storage backend latency.

2) Symptom: great throughput, terrible p99 latency

Root cause: Queue depth too high, or workload concurrency too high, causing queueing delays. Often seen on shared arrays and cloud volumes.

Fix: Reduce concurrency (app threads, iodepth, number of jobs) and/or pick a scheduler that bounds latency (often mq-deadline). Measure percentiles before/after.

3) Symptom: random read IOPS are far below expectations on SSD/NVMe

Root cause: Shallow queue depth, wrong IO engine, or a test accidentally using buffered IO and being dominated by page cache behavior.

Fix: Use fio with --direct=1, set a reasonable iodepth (e.g., 16–64), and confirm scheduler. Also verify PCIe link and NUMA locality if results are oddly low.

4) Symptom: writes become slow after a few minutes of testing

Root cause: SSD SLC cache exhaustion, garbage collection, thermal throttling, or filesystem journal pressure.

Fix: Extend benchmarks to 3–10 minutes and compare early vs late behavior. Check SMART/NVMe health and temperature. If sustained write performance matters, use enterprise SSDs and avoid consumer drives for journal-heavy workloads.

5) Symptom: changing scheduler seems to do nothing

Root cause: You’re changing the scheduler on the wrong layer (partition vs disk, dm-crypt vs underlying disk), or the device is controlled by hardware RAID/SAN that dominates ordering.

Fix: Use lsblk to find the leaf device and verify scheduler there. In RAID/SAN cases, focus on array policies, HBA queue depth, and workload isolation.

6) Symptom: occasional multi-second stalls with normal behavior otherwise

Root cause: Device resets/timeouts, multipath path flaps, or firmware issues. Sometimes also ext4 journal commits under pathological contention.

Fix: Check dmesg for reset/timeout messages. Stabilize the hardware/backend first. Then tune.

7) Symptom: sequential reads are slow on HDD arrays

Root cause: Read-ahead too small, fragmentation, or competing random IO. Also possible: controller cache policy or rebuild/degraded RAID.

Fix: Confirm array health, check read-ahead, and isolate sequential workloads from random writers. Do not “fix” this by deepening queues blindly.

Checklists / step-by-step plan

Checklist A: “Disk is slow” incident response (15–30 minutes)

  1. Confirm the device path: use findmnt and lsblk so you’re tuning the right thing.
  2. Check for errors: scan dmesg for timeouts, resets, or IO errors.
  3. Measure live latency/queueing: iostat -x 1 on the relevant device during the problem.
  4. Identify top IO processes: pidstat -d 1. Decide if workload control is the real fix.
  5. Establish whether this is read or write pain: compare r_await vs w_await, and consider fsync-heavy workloads.
  6. Take one baseline benchmark (if safe): a short fio run that matches the suspected pattern.
  7. Only then test scheduler/queue changes: one knob at a time, measure, note results.

Checklist B: Controlled tuning plan (change management friendly)

  1. Pick a representative workload: define block size, read/write mix, concurrency, and duration.
  2. Capture baseline: store fio output, iostat snapshots, and kernel version.
  3. Define success criteria: e.g., p99 write latency under X ms at Y IOPS; not just “feels faster”.
  4. Test scheduler candidates: none vs mq-deadline (and kyber if you know why).
  5. Test queue depth range: vary fio --iodepth and job count; do not jump straight to extremes.
  6. Validate under contention: run a second workload concurrently if production usually has mixed IO.
  7. Make persistent changes: once proven, implement via udev rule or appropriate config and document it.
  8. Monitor after rollout: watch p95/p99, not only averages, for at least one business cycle.

Making scheduler changes persistent (without surprises)

Temporary echo changes vanish on reboot. For persistence, udev rules are common. The exact match criteria depend on your hardware naming. Test on staging first.

cr0x@server:~$ sudo tee /etc/udev/rules.d/60-ioscheduler.rules >/dev/null <<'EOF'
ACTION=="add|change", KERNEL=="nvme0n1", ATTR{queue/scheduler}="mq-deadline"
EOF
cr0x@server:~$ sudo udevadm control --reload-rules
cr0x@server:~$ sudo udevadm trigger --name-match=nvme0n1

Decision: If you can’t confidently match devices (because names change), match by attributes like ID_MODEL or WWN. The goal is stable configuration, not a boot-time roulette.

FAQ

1) Should I use none or mq-deadline on NVMe?

Start with none for pure NVMe local disks, then test mq-deadline if you care about tail latency under mixed load. Pick the one that improves p99 for your real workload.

2) What queue depth should I set?

There isn’t a single number. For NVMe, deeper queues can improve throughput. For shared arrays/cloud volumes, too much depth often increases p99 latency. Tune by measuring: run fio at iodepth 1, 4, 16, 32, 64 and graph latency percentiles.

3) Why does %util show 100% but throughput is low?

On fast devices or when queueing is heavy, %util can reflect “busy waiting” and queueing delay rather than productive throughput. Look at await and aqu-sz to interpret it.

4) Does changing scheduler help on hardware RAID?

Sometimes a little, often not much. Hardware RAID controllers and SANs do their own scheduling and caching. In those cases, focus on controller cache policy, stripe size alignment, array health, and HBA queue depth.

5) My app is slow, but raw device fio is fast. Now what?

The bottleneck is above the device: filesystem journaling, fsync behavior, small sync writes, metadata contention, or the application’s IO pattern. Run filesystem-based fio tests and inspect mount options and writeback behavior.

6) Is bfq ever appropriate on servers?

Rarely, but not never. If you need fairness between competing IO sources (multi-tenant systems) and can afford overhead, it can help. For most server workloads, start with none or mq-deadline and only deviate with evidence.

7) Should I mount ext4 with discard for SSD performance?

Usually no. Prefer periodic trim (fstrim.timer) for predictable overhead. Continuous discard can add latency variance.

8) Can I “fix” slow disk by increasing read-ahead?

Only if the workload is actually sequential reads and the bottleneck is readahead thrash. For random workloads (databases), raising read-ahead often wastes cache and makes things worse. Measure page cache hit rate and workload pattern before touching it.

9) Why does performance change after reboot?

You lost non-persistent tunings, the device name changed, or udev/tuned applied different policies. Make changes persistent and verify after reboot with the same benchmark.

10) Is io_uring always the best fio engine?

It’s a good default on modern kernels, but not universal. For some environments or drivers, libaio is more comparable to legacy apps. Use the engine that matches your production IO stack.

Conclusion: next steps that survive change control

Ubuntu 24.04 isn’t secretly sabotaging your disks. Most “slow disk” problems boil down to three things: you’re measuring the wrong layer, you’re queueing too much (or too little), or the device/backend is unhealthy and tuning is a distraction.

Do this next, in order:

  1. Capture evidence: iostat -x, pidstat -d, and dmesg snippets during the slowdown.
  2. Baseline with fio: one raw-device test and one filesystem test that resemble production.
  3. Test scheduler choices: none vs mq-deadline is the pragmatic starting point. Verify with percentiles.
  4. Tune concurrency/queue depth deliberately: adjust workload parallelism first; touch kernel queue knobs only if you can explain why.
  5. Make it persistent and documented: if the fix isn’t reproducible after reboot, it’s not a fix; it’s a demo.

The goal isn’t to win a benchmark. It’s to make the system predictable under the worst day you’re definitely going to have.

← Previous
Fakes and Refurbs: How to Avoid a Ghost GPU
Next →
ZFS clones: Instant Copies With Hidden Dependencies (Know This First)

Leave a comment