You bought fast flash. You moved the workload. You stared lovingly at the vendor’s “up to” numbers.
And then production said: “Cute. I’ll do 40 MB/s and 30 ms latency.”
When SSDs are slow, people blame the filesystem, the app, “the network somehow,” or the phase of the moon.
Often the real culprit is boring: the wrong driver path, the wrong queueing mode, or a compatibility layer quietly taking control.
You can’t tune your way out of the wrong stack.
What “SSD is slow” really looks like in production
SSD slowness isn’t usually a single number. It’s a personality. It shows up as:
tail latency that turns into page timeouts, perfectly average throughput with random 200 ms stalls,
a database that “works fine” until a checkpoint, or a message queue that becomes an accidental batch processor.
The scary part: the drive might be fine. The PCIe link might be fine. The NAND might be fine.
But if the OS is talking to your storage through the wrong driver, wrong transport, or wrong queueing model,
you can end up with a Ferrari chassis being pushed by a shopping cart wheel.
What makes this class of problem nasty is how plausible the wrong explanations are. Filesystems can be slow.
Apps can be slow. Cloud volumes can be slow. Sure. But “wrong driver path” is the one that hides in plain sight,
because things still function. They’re just… depressingly functional.
Interesting facts and history (because storage is petty and remembers)
- NVMe wasn’t just “faster SATA.” It was designed for low-latency, parallel command submission with many queues; SATA/AHCI assumed a single queue mindset.
- AHCI’s default queue depth is tiny by modern standards. NCQ exists, but the architecture was built around spinning disks and modest concurrency.
- Linux’s multi-queue block layer (blk-mq) was a turning point. It helped scale IO submission across CPUs, but also introduced new tuning surfaces and new foot-guns.
- IO schedulers didn’t go away—some just became irrelevant. For NVMe, “none” is often right; for some SATA SSDs behind HBAs, a scheduler can still matter.
- Write cache policy has been fighting admins for decades. “It’s faster” and “it’s safe” are frequently opposing religions.
- TRIM/DISCARD is not a free lunch. Online discard can create bursts of latency depending on device firmware and kernel behavior; periodic fstrim often behaves better.
- SCSI emulation layers are everywhere. Many hypervisors and storage appliances expose “SCSI disks” even when the backend is SSD/NVMe; this can cap queueing and complicate tuning.
- Multipath can be a performance feature or a performance tax. A single mis-set policy can turn parallel paths into serialized sadness.
- PCIe link width/speed negotiation failures are more common than people admit. One dirty edge connector can “upgrade” your NVMe into a very expensive SATA device.
Fast diagnosis playbook (first/second/third)
You don’t have time for a spiritual journey through kernel subsystems. You want signal, quickly.
Here’s the order that tends to catch the biggest “SSD is slow” driver mistakes with the least effort.
First: confirm what the kernel thinks the device is
- Is it actually
nvme, or is it showing up assdXthrough SCSI translation? - Is it behind hardware RAID/HBA in a mode you didn’t intend?
- Is multipath in play without you realizing it?
Second: confirm the link and queueing basics
- PCIe link width and speed (NVMe): x4 vs x1, Gen4 vs Gen1 matters.
- Queue count and queue depth: one queue can make a many-core box cry.
- IRQ distribution: one CPU handling all NVMe completions is a classic “why are we at 20% CPU and still slow?” moment.
Third: measure latency properly before tuning
- Use
iostat -xfor utilization and average latency. - Use
nvme smart-log(NVMe) orsmartctl(SATA/SAS) for device-side hints. - Run a controlled
fiojob to see if the slowness is workload-specific or systemic.
Only after that do you touch schedulers, nr_requests, read-ahead, filesystem mount flags, or DB knobs.
If the driver stack is wrong, tuning is just decorating the problem.
The storage driver mistake behind it
The mistake comes in a few flavors, but they rhyme: the OS is not using the direct, intended driver for the hardware,
or it’s using the right driver with the wrong queueing/interrupt model due to defaults or legacy configuration.
Flavor 1: “It’s NVMe” (but presented as SCSI)
In bare metal, a true NVMe device shows up as /dev/nvme0n1 and uses the nvme driver.
In many environments—especially virtualized and some storage appliances—you’ll see a fast backend presented as a SCSI disk:
/dev/sda, driver chain through virtio-scsi, mpt3sas, device-mapper, or multipath.
Sometimes that’s fine. Sometimes it quietly limits queueing, changes completion behavior, or introduces a single bottleneck queue.
The backend can do 800k IOPS. Your guest can submit 30k. Everyone blames the SSD.
Flavor 2: AHCI/SATA path used when an NVMe path exists
This happens with weird BIOS settings, unusual carrier boards, or devices that can operate in multiple modes.
A drive that should be on a PCIe/NVMe path ends up routed through a SATA controller or a translation layer.
It works. It’s also leaving performance on the table, especially under concurrency.
Flavor 3: blk-mq/queues/IRQs misaligned with CPU topology
NVMe is designed for parallelism. Linux can do parallelism. But you can still get:
one queue, one IRQ, one CPU, and 63 other cores watching it all happen.
Or you can get interrupt storms pinned to CPU0 because of legacy irqbalance settings, old initramfs rules,
or someone “optimized interrupts” in 2019 and forgot about it.
Flavor 4: Device-mapper layers doing more than you think
LVM, dm-crypt, dm-multipath, MD RAID—these can be perfectly fine. They can also be configured into a corner:
wrong chunk size, wrong scheduler, wrong write cache policy, wrong path selector.
The performance profile becomes “fine in benchmarks, sad in production,” because real IO isn’t as polite.
Here’s the operational rule: before you blame the SSD, trace the IO path end-to-end and confirm the driver and queueing model at each layer.
One paraphrased idea from W. Edwards Deming fits storage work painfully well: Without data, you’re just another person with an opinion.
(paraphrased idea)
Practical tasks: commands, outputs, and decisions (12+)
These are the checks I actually run when an SSD is “slow.” Each task includes: a command, what the output means, and the decision you make.
Run them as root where needed.
Task 1: Identify the device and transport (NVMe vs SCSI/SATA)
cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,ROTA,TYPE,TRAN,MOUNTPOINTS
NAME MODEL SIZE ROTA TYPE TRAN MOUNTPOINTS
nvme0n1 SAMSUNG MZVL21T0HCLR 1.0T 0 disk nvme
├─nvme0n1p1 1G 0 part nvme /boot
└─nvme0n1p2 1023G 0 part nvme /
Meaning: TRAN nvme and ROTA 0 is what you want for a local NVMe SSD.
If you see sda with TRAN sas or blank transport on what you thought was NVMe, you’re likely in a translated path.
Decision: If transport isn’t what you expect, stop tuning and map the IO path (Tasks 2–4).
Task 2: See what driver the kernel bound to the device
cr0x@server:~$ readlink -f /sys/block/nvme0n1/device/driver
/sys/bus/pci/drivers/nvme
Meaning: This confirms the kernel driver. For SATA devices you’ll see AHCI-related PCI drivers; for HBAs you’ll see mpt3sas/megaraid_sas, etc.
Decision: If the driver is not what you expected, investigate BIOS/controller mode and cabling/backplane.
Task 3: For SCSI disks, identify the HBA/virtual controller
cr0x@server:~$ lsscsi -t
[0:0:0:0] disk ATA SAMSUNG SSD 860 3B6Q /dev/sda
\_ transport: sata: phy-0:0:0
Meaning: This tells you what stack is presenting the disk. In VMs, you might see virtio or a vendor SCSI controller.
Decision: If you’re on a legacy emulated controller, switch to virtio-blk/virtio-scsi or NVMe virtual device if supported.
Task 4: Check for device-mapper layers (LVM, crypt, multipath)
cr0x@server:~$ lsblk -o NAME,TYPE,FSTYPE,SIZE,MOUNTPOINTS
NAME TYPE FSTYPE SIZE MOUNTPOINTS
sda disk 1.8T
└─mpatha mpath 1.8T
└─vg0-lvdata lvm xfs 1.8T /data
Meaning: mpath indicates dm-multipath. That’s not “bad,” but it’s a whole policy engine between you and the SSD.
Decision: If multipath is present, verify path policy and queueing (Task 10). If it’s accidental, remove it carefully.
Task 5: Confirm PCIe link width/speed (NVMe)
cr0x@server:~$ lspci -s 01:00.0 -vv | sed -n '/LnkCap:/,/LnkSta:/p'
LnkCap: Port #0, Speed 16GT/s, Width x4
LnkSta: Speed 8GT/s (downgraded), Width x1 (downgraded)
Meaning: The device can do Gen4 x4, but negotiated Gen3 x1. That can absolutely flatten throughput and inflate latency under load.
Decision: Reseat the drive, check backplane, BIOS settings, riser quality, and firmware. Don’t waste time on software tuning until this is fixed.
Task 6: Verify NVMe controller features and namespace info
cr0x@server:~$ nvme id-ctrl /dev/nvme0 | egrep -i 'mn|fr|mdts|oacs|sqes|cqes'
mn : SAMSUNG MZVL21T0HCLR
fr : GXA7401Q
mdts : 9
oacs : 0x17
sqes : 0x66
cqes : 0x44
Meaning: Firmware revision matters; so do limits like MDTS (max data transfer size). A tiny MDTS can cap IO size efficiency.
Decision: If firmware is old or known-problematic in your fleet, schedule an update window. If MDTS is small, watch workloads with large IOs.
Task 7: Check queue count and scheduler for a block device
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
Meaning: For most NVMe, none is appropriate; it relies on the device’s internal scheduling and avoids extra kernel overhead.
For some devices/workloads, mq-deadline can stabilize latency.
Decision: If you’re on bfq for a server workload, that’s suspicious. Use none or mq-deadline unless you have a tested reason.
Task 8: Check queue depth limits (nr_requests, max_sectors_kb)
cr0x@server:~$ cat /sys/block/nvme0n1/queue/nr_requests
64
Meaning: This is the request queue depth at the block layer. Too low can throttle concurrency; too high can inflate latency and memory usage.
Decision: If you see very low values on a busy NVMe device, investigate why (udev rules, tuned profiles). Change only after measuring with fio.
Task 9: Look at real-time device latency and utilization
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (server) 02/04/2026 _x86_64_ (64 CPU)
Device r/s w/s rMB/s wMB/s await svctm %util
nvme0n1 1200 800 90.0 110.0 18.5 0.4 98.0
Meaning: %util near 100% with svctm low but await high hints at queueing: the device is fast, but requests are waiting.
That can be normal under heavy load—but it can also mean you’ve got a single-queue bottleneck upstream.
Decision: If await is high and throughput is unimpressive, dig into queues/IRQs and any DM layers.
Task 10: Check dm-multipath policy and path health
cr0x@server:~$ multipath -ll
mpatha (3600508b400105e210000900000490000) dm-2 NETAPP,LUN
size=1.8T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:1 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 4:0:0:1 sdc 8:32 active ready running
Meaning: Policy and priority decide how IO is distributed. Some policies effectively serialize IO to one path unless configured.
Decision: If you expected load balancing but see one path doing all the work, fix the multipath config (and validate with iostat on sdb/sdc).
Task 11: Check NVMe IRQ distribution (the “CPU0 is sad” test)
cr0x@server:~$ grep -i nvme /proc/interrupts | head
45: 98234120 1023 0 0 0 0 0 0 IR-PCI-MSI 524288-edge nvme0q0
46: 120 98220110 0 0 0 0 0 0 IR-PCI-MSI 524289-edge nvme0q1
Meaning: You want interrupts spread across CPUs reasonably. If all counts pile onto one CPU, latency can spike under load.
Decision: If skewed, check irqbalance, CPU affinity, and whether the system is pinning interrupts due to old tuning.
Task 12: Validate TRIM/discard behavior
cr0x@server:~$ findmnt -no TARGET,OPTIONS /
/ rw,relatime,discard
Meaning: Online discard is enabled. That can be fine, or it can create latency spikes depending on drive and kernel.
Decision: If you see periodic latency storms correlating with deletes, consider removing discard and using scheduled fstrim.
Task 13: Confirm discard support and alignment
cr0x@server:~$ lsblk -D -o NAME,DISC-GRAN,DISC-MAX,DISC-ZERO
NAME DISC-GRAN DISC-MAX DISC-ZERO
nvme0n1 512B 2T 0
Meaning: Discard granularity and max matter. If discard is unsupported (zeros), fstrim won’t help and “discard” mount options are pointless.
Decision: If unsupported, stop thinking TRIM will save you. Look at overprovisioning, write amplification, and workload patterns instead.
Task 14: Run a controlled fio latency test (don’t benchmark your database volume live)
cr0x@server:~$ fio --name=randread --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --iodepth=32 --rw=randread --bs=4k --numjobs=4 --time_based=1 --runtime=20 --group_reporting
randread: (groupid=0, jobs=4): err= 0: pid=1234: Tue Feb 4 12:00:00 2026
read: IOPS=320k, BW=1250MiB/s (1310MB/s)(24.4GiB/20s)
lat (usec): min=45, max=2200, avg=120.3, stdev=35.7
Meaning: This gives you baseline random read IOPS and latency. If this is terrible on raw device, it’s not your filesystem.
Decision: If fio shows the device is fast but your app is slow, look upward (filesystem, page cache, application IO pattern).
If fio is slow too, keep digging in the driver/queue/PCIe path.
Task 15: Check for write cache policy (SATA/SAS)
cr0x@server:~$ hdparm -W /dev/sda
/dev/sda:
write-caching = 1 (on)
Meaning: Write cache on can improve performance but depends on drive power-loss protection and your risk tolerance.
Decision: If you’re running without PLP and you care about durability, keep it conservative. If you do have PLP and the cache is off, turning it on can be a safe win.
Joke #1: Storage is the only place where “it’s working” can mean “it’s quietly ruining your quarter.”
Three corporate mini-stories (anonymous, plausible, and technically accurate)
Mini-story 1: The incident caused by a wrong assumption
A team migrated a busy metrics cluster from older SATA SSDs to shiny NVMe. They did the sensible stuff:
cloned the OS images, moved mounts, validated SMART, ran a quick fio test in staging. Numbers looked great.
Then production hit 9 a.m. traffic and the alerting system started paging for “disk latency > 50ms.”
The assumption was simple: “NVMe is NVMe.” The servers had M.2 drives, but they were installed in a carrier board
that could route those devices either as PCIe NVMe or as SATA, depending on a BIOS setting that had been inherited
from a previous hardware revision. Half the fleet negotiated PCIe; half showed up as SATA devices under AHCI.
The symptom pattern was confusing: some nodes were fine, some were awful, and the workload was “evenly balanced.”
The load balancer was doing its job. It was just balancing users across two different storage worlds.
The fix wasn’t a clever sysctl. It was a spreadsheet: serial numbers mapped to BIOS profiles, a maintenance window,
and a strict “validate transport in provisioning” gate. After the change, latency normalized immediately,
and the team got a reminder that hardware defaults are not contracts.
Mini-story 2: The optimization that backfired
A performance-minded engineer noticed that the IO scheduler on several database hosts wasn’t set consistently.
They pushed a change via udev rules to force mq-deadline everywhere “for predictable latency.”
It tested fine on a small subset, and the rollout continued.
Two days later, the OLTP workload started showing periodic stalls. Not constant slowness—worse.
Every few minutes, p99 latency would spike, replication would lag, and then everything would recover.
The metrics looked like the system was doing interval training.
The root cause was not that mq-deadline is bad. It was that the udev rule applied to device-mapper nodes
and the underlying NVMe namespaces inconsistently. Some volumes ended up with scheduler settings fighting each other
across layers, and the IO patterns interacted with a bursty TRIM behavior on one model of drive.
The rollback restored stability. The lasting fix was more boring: set scheduler only on the physical devices,
avoid setting it on dm-* nodes, and validate discard strategy separately. Also: never “standardize” performance knobs
without verifying the device classes you’re standardizing across.
Mini-story 3: The boring but correct practice that saved the day
A storage-heavy service ran on VMs with a mix of local NVMe and network-backed volumes.
The SRE team had a deployment checklist item that everyone mocked: “Record the block device model, transport, and driver binding.”
It looked like paperwork. It was, objectively, paperwork.
One week, a host OS image update changed the default virtual storage controller for a subset of VMs.
They still booted. They still mounted. They still passed application health checks.
But tail latency got worse, and the service started dropping requests under peak load.
Because they had the boring inventory from before, they could diff “known good” vs “now” quickly:
same volume, same hypervisor, different controller and queue behavior. The team didn’t spend two days
blaming the database. They escalated to the virtualization team with evidence.
The fix was to revert the controller choice, then test the new controller properly with the real workload profile.
The checklist item survived, and the people who mocked it quietly stopped mocking it.
Joke #2: The fastest way to improve storage performance is to stop “optimizing” it on a Friday.
Common mistakes: symptoms → root cause → fix
1) Symptom: NVMe drive benchmarks fast, app still slow
Root cause: The app is hitting a different path (dm-crypt, network filesystem, multipath) than your benchmark, or it’s bottlenecked on fsync/journaling patterns.
Fix: Benchmark the actual block device used by the filesystem (follow lsblk chain), then test with fio patterns matching fsync and queue depth.
2) Symptom: Throughput capped at suspiciously low numbers, CPU mostly idle
Root cause: Device negotiated PCIe x1/low Gen, or you’re in AHCI/SATA mode, or a single submission/completion queue is limiting parallelism.
Fix: Check lspci -vv link status, confirm TRAN in lsblk, verify IRQ distribution and queue count.
3) Symptom: High await, low svctm, %util near 100%
Root cause: Requests are waiting in software queues, often because queue depth is constrained or interrupts are pinned poorly.
Fix: Inspect /sys/block/*/queue/nr_requests, NVMe queue settings, and /proc/interrupts. Fix affinity/irqbalance first.
4) Symptom: Random latency spikes after deletions or compactions
Root cause: Online discard/TRIM causing synchronous work, or firmware GC interacting with your workload.
Fix: Remove discard mount option; schedule fstrim. Confirm discard support with lsblk -D.
5) Symptom: Multipath device slower than a single path
Root cause: Multipath policy not balancing, path priorities uneven, or queueing behavior like queue_if_no_path causing stalls.
Fix: Validate with multipath -ll. Set appropriate path selector (e.g., service-time) and verify both paths carry IO.
6) Symptom: “SSD is slow” only inside VMs
Root cause: Emulated controller, wrong paravirtual driver, low queue limits, or host-side throttling.
Fix: Switch to virtio (or virtual NVMe) and raise queue settings where appropriate; verify guest sees expected transport/queues.
7) Symptom: Writes are dramatically slower than reads, even on fresh drives
Root cause: Write cache disabled, PLP assumptions wrong, or drive in a conservative power state; sometimes a RAID controller forces write-through.
Fix: Check cache policy (hdparm -W / controller tools), verify power states, and ensure PLP-capable drives if enabling write-back.
8) Symptom: Performance regressed after “kernel upgrade”
Root cause: Changed defaults: scheduler selection, io_uring behavior, nvme_core parameters, or udev rules overridden by tuned profiles.
Fix: Diff /sys/block/*/queue values, check tuned/udev, confirm driver binding remains the same.
Checklists / step-by-step plan
Step-by-step: diagnose and fix the wrong-driver slowdown safely
-
Inventory the device chain.
Uselsblkand confirm whether the filesystem sits on raw NVMe, dm-crypt, LVM, md, or multipath. -
Confirm driver binding.
Usereadlink -f /sys/block/DEVICE/device/driver.
If it’s not what you expect, stop and identify why. -
Confirm transport and controller model.
Uselsblk -o TRAN,lsscsi -t, andlspcifor the device controller. -
Validate PCIe link (NVMe).
Uselspci -vvand look for downgraded speed/width.
Fix hardware negotiation issues before software tuning. -
Check queueing and scheduler defaults.
Read/sys/block/*/queue/schedulerandnr_requests.
Avoid “fleet-wide” overrides unless device classes are consistent. -
Check IRQ distribution.
Use/proc/interrupts.
If one core takes all completions, fix affinity and confirm irqbalance behavior. -
Measure before touching knobs.
Useiostat -xand a safe fio test.
Capture baseline p50/p95/p99 latency. -
Change one thing.
Driver path/controller mode changes require maintenance windows. Scheduler/affinity changes can often be done live, but validate carefully. -
Verify after changes.
Repeat fio and iostat checks, then validate with application SLOs (tail latency, queue time, error rates). -
Make it repeatable.
Bake validation into provisioning: transport, driver binding, PCIe link, and queue settings should be checked automatically.
Operational checklist: what to capture in a ticket
lsblk -o NAME,MODEL,SIZE,ROTA,TYPE,TRAN,MOUNTPOINTSreadlink -f /sys/block/DEVICE/device/driverlspci -vvsnippet showing link state (NVMe)iostat -x 1 10during incident window/proc/interruptsfiltered for storage interrupts- Multipath output if present:
multipath -ll - Trim/discard state:
findmnt -no OPTIONS,lsblk -D
FAQ
1) Is the “wrong storage driver” problem mostly a Linux thing?
Linux makes it visible because you can inspect the stack. But the class of problem exists everywhere:
wrong controller mode, emulation layers, queue caps, and policy engines can hurt any OS.
2) If my disk is /dev/sda, does that mean it’s not NVMe?
On bare metal, yes: NVMe namespaces show as /dev/nvme*. In VMs or behind some controllers, fast backend storage can still appear as SCSI disks (/dev/sd*).
The key is to identify the actual transport and driver chain.
3) Should I always use IO scheduler none for SSDs?
For NVMe, none is often right. For SATA SSDs or when you have layering (dm-crypt, md, multipath), mq-deadline can smooth latency.
Don’t guess—measure on your workload.
4) Why does changing queue depth sometimes make latency worse?
Deeper queues can raise throughput but increase waiting time for any single IO, especially under contention.
If your app cares about tail latency, “more queue” can be self-harm.
5) Is online discard bad practice?
Not universally. It depends on device firmware, kernel version, and workload. Many production setups prefer periodic fstrim
because it controls when discard work happens.
6) My NVMe link shows “downgraded.” Is it always hardware?
Usually. BIOS settings, bad risers, marginal backplanes, poor seating, or power issues can cause link negotiation to fall back.
Software won’t fix a physical layer negotiation problem.
7) Could encryption (dm-crypt) be the “driver mistake”?
Encryption isn’t a mistake, but it is a layer with its own queueing and CPU costs. If you benchmark raw NVMe and declare victory,
then mount an encrypted volume and wonder where the IOPS went, that’s a measurement mistake.
8) How do I know if multipath is helping or hurting?
If you have redundant paths, multipath is worth having. But validate that it’s distributing IO as intended and not serializing onto one path.
Use multipath -ll plus per-path iostat.
9) What’s the safest “first change” when I suspect a driver/queue issue?
Start with observation: transport, driver binding, link state, IRQ distribution, scheduler. The safest change is often fixing IRQ imbalance
(affinity/irqbalance), because it’s reversible and doesn’t change on-disk format.
10) Can firmware alone make an SSD slow?
Yes—especially with power state handling, GC behavior, and corner-case bugs. But treat firmware as part of the stack:
don’t update randomly, and don’t ignore it when all the OS-level checks are clean.
Conclusion: next steps you can actually do
When an SSD is slow “for no reason,” assume there is a reason—you just haven’t found the layer that’s lying to you yet.
The usual suspect isn’t the NAND. It’s the driver path and queueing model your OS ended up with, often due to defaults,
legacy settings, or virtualization choices.
Do this next:
- On one affected host, capture
lsblk, driver binding, and (for NVMe) PCIe link status. - Check IRQ distribution and queue/scheduler settings; fix obvious pinning and mismatches.
- Run a controlled fio test on the actual device path the app uses (not the one you wish it used).
- If the device is negotiated down (PCIe) or presented through an unintended controller mode, schedule the hardware/BIOS fix—don’t “tune around” it.
- Codify the checks into provisioning so you never debug this twice.