Debian 13: fio benchmarks lie — how to test storage without fooling yourself

November 1, 2025 • February 3, 2026 • Read: 23 min • Views: 10

Was this helpful?

You ran fio on Debian 13, got heroic numbers, and declared victory. Then production fell over during the first reindex, backup, or compaction, and the “fast” disks suddenly behaved like a shared network drive from 2008.

This is normal. Not because Linux storage is bad—because benchmarks are easy to accidentally rig. fio doesn’t lie; we lie to ourselves with wrong assumptions, convenient defaults, and tests that don’t match the workload.

What fio really measures (and what it doesn’t)

fio is a load generator. It submits I/O with specific patterns (random vs sequential, read vs write, mixed, block sizes, queue depth, sync modes) and reports what happened. That’s it. It doesn’t know if you tested the page cache instead of the disk. It doesn’t know your SSD is about to thermal throttle. It doesn’t know your RAID controller is acknowledging writes that are still living in volatile cache. It doesn’t know that your database does 4k random reads but your benchmark did 1M sequential writes.

To benchmark storage without fooling yourself, you need to answer three questions up front:

What are you trying to predict? Peak throughput? P99 latency under concurrency? Recovery time? Tail behavior during write bursts?
What is the unit of truth? IOPS is not truth if your users care about latency. Throughput is not truth if your bottleneck is small-block sync writes.
What system is under test? Disk alone, OS stack, filesystem, volume manager, encryption, network storage, controller cache, and CPU scheduling all matter. Decide what belongs in scope.

Benchmarks are not supposed to “be fair.” They are supposed to be predictive. If your test doesn’t match the production I/O shape, it’s a confidence generator, not a benchmark.

Interesting facts and context (the stuff that makes today’s mistakes predictable)

Linux page cache has been a “benchmark amplifier” since the 1990s. Buffered I/O can measure RAM speed and look like disk speed if you don’t force direct I/O or exceed memory.
SSD performance depends on internal state. Fresh-out-of-box NAND behaves differently than steady-state after sustained writes; preconditioning became a formal test concept because early SSD reviews were basically fantasy.
Queue depth was popularized by enterprise SANs. A lot of “QD32” folklore comes from controller-era tuning; it’s not automatically relevant for low-latency NVMe and modern apps with limited concurrency.
NVMe changed the CPU side of storage. I/O submission/completion paths became cheaper and more parallel, so CPU pinning and IRQ distribution can become the “disk bottleneck” even when media is fine.
Write caching has always been controversial. Controllers acknowledging writes before they’re durable gave the world great benchmarks and the occasional unforgettable outage; battery-backed cache was the compromise.
Filesystems historically optimized for different failure modes. ext4’s defaults aim for broad safety; XFS often shines with parallel throughput; copy-on-write systems trade certain latencies for snapshots and checksums.
Alignment issues are older than SSDs. Misaligned partitions hurt on RAID stripe units and still bite today when your 4k logical blocks don’t match underlying geometry.
“Latency percentiles” moved from academia to ops. The industry learned the hard way that averages hide pain; P99 and P99.9 became operationally meaningful as web-scale systems grew.

One quote worth keeping in your head while benchmarking: Hope is not a strategy. (Vince Lombardi)

How benchmarks accidentally cheat: the usual suspects

1) You benchmarked RAM (page cache), not storage

If you run file-based fio jobs without --direct=1, Linux can satisfy reads from cache and absorb writes into memory, then flush them later. Your results look amazing. Your database doesn’t get those numbers when it has to wait for actual durability.

And no, “but my test file was big” isn’t always enough. Caches exist at multiple layers: page cache, drive cache, controller cache, even the hypervisor if you’re virtualized.

2) Your request size and sync semantics don’t match reality

Many production systems do small, sync’d writes (journal commits, WAL fsync, metadata updates). A benchmark doing 1M sequential writes with deep queue depth is measuring a different universe.

3) Queue depth becomes a performance costume

High iodepth can hide latency by keeping the device busy, inflating throughput and IOPS. That can be legitimate—if your application actually issues that many outstanding requests. If it doesn’t, you’re testing a system you don’t run.

4) The device is in a “clean” state that production never sees

SSD FTL (flash translation layer) behavior depends on free blocks and garbage collection pressure. Short tests on empty drives can be unrealistically fast. Long tests can turn slow. Both are true; only one is predictive.

5) You measured a different path than your workload uses

Block device testing can bypass filesystem overhead, journaling, mount options, and metadata contention. File-based testing includes those—but also includes directory layout and free-space behavior. Pick intentionally.

6) CPU, IRQs, and NUMA quietly become “storage performance”

On NVMe, it’s common to bottleneck on interrupt handling, single-queue contention, or a bad affinity setup. Your “disk benchmark” turns into a CPU scheduling benchmark. That’s not wrong—unless you don’t notice.

7) Power management, throttling, and firmware policies change the rules mid-test

SSDs and NVMe drives can thermal throttle. CPUs can downclock. Laptops do laptop things. Servers do “energy saving mode” things. Your benchmark becomes a test of heatsinks, not IOPS.

Short joke #1: A storage benchmark without latency percentiles is like a restaurant review that only lists the average wait time—congratulations on your statistically satisfying hunger.

Practical tasks: 12+ checks with commands, output meaning, and the decision you make

These are not “nice to have.” This is how you keep fio from becoming performance theater on Debian 13.

Task 1: Verify what device you’re testing (and whether it’s virtual)

cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,ROTA,TRAN,TYPE,MOUNTPOINTS
NAME        MODEL               SIZE ROTA TRAN TYPE MOUNTPOINTS
sda         INTEL SSDSC2KB48     447G    0 sata disk
├─sda1                            1G    0      part /boot
└─sda2                          446G    0      part
  └─vg0-lv0                      400G    0      lvm  /
nvme0n1     SAMSUNG MZVL21T0      1.8T    0 nvme disk
└─nvme0n1p1                        1.8T  0      part /data

What it means: You can’t interpret results if you don’t know whether you’re on SATA SSD, NVMe, a virtual disk, or an LVM layer. ROTA=0 indicates non-rotational, but that still includes SATA SSD and NVMe.

Decision: Choose block-device tests for raw device comparison; file tests for filesystem + mount options; don’t mix them and then argue about “the disk.”

Task 2: Identify the active I/O scheduler (and avoid accidental queueing behavior)

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[mq-deadline] none

What it means: The bracketed scheduler is active. NVMe often runs well with none or mq-deadline, but the “right” one depends on your workload and kernel defaults.

Decision: Keep it consistent across tests. If you’re benchmarking for production, match production scheduler. If you’re diagnosing weird latency, test both none and mq-deadline.

Task 3: Confirm logical/physical block sizes and alignment risk

cr0x@server:~$ sudo blockdev --getss /dev/nvme0n1
512
cr0x@server:~$ sudo blockdev --getpbsz /dev/nvme0n1
4096

What it means: 512-byte logical sectors, 4k physical. Misalignment can cause read-modify-write penalties, especially for small writes.

Decision: Ensure partitions start at 1MiB boundaries (modern tools usually do). For databases, align page size and filesystem block size when feasible.

Task 4: Check filesystem and mount options (journaling and barriers matter)

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /data
/dev/nvme0n1p1 ext4 rw,relatime,errors=remount-ro

What it means: Options like noatime, barrier (or nobarrier historically), and journaling mode affect write behavior.

Decision: If you’re measuring durability-sensitive performance, don’t casually toggle options “for speed” unless you can prove power-loss safety and app correctness.

Task 5: Detect write cache settings (and whether you’re benchmarking honesty)

cr0x@server:~$ sudo hdparm -W /dev/sda | head
/dev/sda:
 write-caching =  1 (on)

What it means: Write cache “on” can be fine for SSDs with power-loss protection; dangerous for consumer drives or controllers without protection if you disable barriers or lie about flushes.

Decision: For production databases, require power-loss protection or conservative settings. For benchmarks, explicitly note caching configuration in your report.

Task 6: Observe current device health and thermal state (before blaming fio)

cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | sed -n '1,40p'
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.0-amd64] (local build)
=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVL21T0
Serial Number:                      S6XXXXXXXXXXXX
Firmware Version:                   3B2QGXA7
Total NVM Capacity:                 1,900,000,000,000 [1.90 TB]
...
Temperature:                        68 Celsius
Available Spare:                    100%
Percentage Used:                    2%

What it means: If temperature is high, throttling might kick in mid-test. “Percentage Used” helps catch drives near end-of-life that behave oddly.

Decision: If thermals are near vendor limits, fix cooling before “tuning fio.” If the drive is worn, expect worse sustained writes.

Task 7: Confirm you can bypass page cache (direct I/O sanity check)

cr0x@server:~$ fio --name=direct-check --filename=/data/fio.test --size=2G --rw=read --bs=128k --direct=1 --ioengine=libaio --iodepth=16 --numjobs=1 --runtime=10 --time_based --group_reporting
direct-check: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=16
fio-3.38
Starting 1 process
direct-check: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [R(1)][100.0%][r=912MiB/s][r=7296 IOPS][eta 00m:00s]
direct-check: (groupid=0, jobs=1): err= 0: pid=21199: Mon Dec 29 11:09:10 2025
  read: IOPS=7200, BW=900MiB/s (944MB/s)(9000MiB/10001msec)

What it means: --direct=1 requests direct I/O (bypassing page cache). If you omit it, reads can hit RAM after the first pass.

Decision: Use --direct=1 for device characterization. Use buffered I/O only when you are specifically modeling an app that relies on cache (and then measure cache size and behavior too).

Task 8: Prove to yourself that buffered reads can cheat

cr0x@server:~$ fio --name=buffered-cheat --filename=/data/fio.test --size=2G --rw=read --bs=128k --direct=0 --ioengine=psync --iodepth=1 --numjobs=1 --runtime=10 --time_based --group_reporting
buffered-cheat: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
fio-3.38
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=5240MiB/s][r=41920 IOPS][eta 00m:00s]
buffered-cheat: (groupid=0, jobs=1): err= 0: pid=21305: Mon Dec 29 11:11:33 2025
  read: IOPS=41000, BW=5120MiB/s (5370MB/s)(51200MiB/10001msec)

What it means: That bandwidth is suspiciously close to memory bandwidth behavior on many systems. Congratulations, you benchmarked cache.

Decision: Never publish buffered read numbers as “disk speed” without explicit disclaimers and a cache-cold methodology.

Task 9: Measure latency percentiles (because averages are liars)

cr0x@server:~$ fio --name=latency-4k --filename=/dev/nvme0n1 --rw=randread --bs=4k --direct=1 --ioengine=io_uring --iodepth=32 --numjobs=4 --runtime=30 --time_based --group_reporting --lat_percentiles=1
latency-4k: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
fio-3.38
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=1680MiB/s][r=430k IOPS][eta 00m:00s]
latency-4k: (groupid=0, jobs=4): err= 0: pid=21601: Mon Dec 29 11:14:22 2025
  read: IOPS=430k, BW=1680MiB/s (1762MB/s)(50400MiB/30001msec)
    clat percentiles (usec):
     |  1.00th=[   66],  5.00th=[   72], 10.00th=[   76], 50.00th=[   90]
     | 90.00th=[  120], 95.00th=[  140], 99.00th=[  210], 99.90th=[  420]
     | 99.99th=[  920]

What it means: The median looks great, P99 is okay, and P99.99 is where the “mysterious spikes” live. Tail latency matters for databases, message brokers, and anything with synchronous commit.

Decision: If P99.9/P99.99 is ugly, don’t sign off on the storage based on average throughput. Fix contention, reduce queue depth, or change media/controller.

Task 10: Confirm you’re not CPU-bound during “storage” tests

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.12.0-amd64 (server) 	12/29/2025 	_x86_64_	(32 CPU)

11:16:01 AM  CPU  %usr %nice %sys %iowait %irq %soft %steal %idle
11:16:02 AM  all  18.2  0.0  31.5    0.3  0.0   9.2    0.0  40.8
11:16:02 AM    7  11.0  0.0  72.0    0.0  0.0  12.0    0.0   5.0

What it means: If one or two CPUs are pinned at high %sys/%soft while others are idle, you might be bottlenecked on a single IRQ queue or completion handling path.

Decision: Investigate IRQ affinities and multi-queue distribution before buying “faster disks” to fix a CPU scheduling problem.

Task 11: Watch real-time disk utilization and latency while fio runs

cr0x@server:~$ iostat -x 1 5
Linux 6.12.0-amd64 (server) 	12/29/2025 	_x86_64_	(32 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await wareq-sz  aqu-sz  %util
nvme0n1       98000.0 392000.0     0.0    0.0    0.11     4.00     0.0      0.0     0.00     0.00    12.1   98.5

What it means: r_await is a rough latency measure. %util near 100% indicates device saturation (but on NVMe it can be tricky; still useful).

Decision: If fio claims high IOPS but iostat shows low utilization, something is off (cache, wrong target, or fio not doing what you think).

Task 12: Inspect NVMe error log and firmware details (silent problems happen)

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0n1 | grep -E 'mn|fr|oacs|oncs'
mn      : SAMSUNG MZVL21T0
fr      : 3B2QGXA7
oacs    : 0x17
oncs    : 0x5f
cr0x@server:~$ sudo nvme error-log /dev/nvme0n1 | head
Error Log Entries for device:nvme0n1 entries:64
 Entry[ 0]
 .................

What it means: Firmware versions and error logs help correlate “weird benchmark dips” with device-level issues. Some drives have known quirks under specific command mixes.

Decision: If you see errors or timeouts, stop. Benchmarks don’t fix hardware. Replace/firmware-update, then retest.

Task 13: Precondition / steady-state test for SSDs (so you don’t test the showroom model)

cr0x@server:~$ fio --name=precondition --filename=/dev/nvme0n1 --rw=write --bs=1M --direct=1 --ioengine=io_uring --iodepth=32 --numjobs=1 --runtime=600 --time_based --group_reporting
precondition: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=32
fio-3.38
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1450MiB/s][w=1450 IOPS][eta 00m:00s]

What it means: Sustained writes push the drive into a more realistic state: SLC cache exhausted, GC active, thermal characteristics visible.

Decision: If you care about steady-state, you must precondition, then run your actual test. If you only care about burst speed, explicitly label it burst.

Task 14: Verify TRIM/discard behavior (especially for virtual disks and thin provisioning)

cr0x@server:~$ lsblk -D -o NAME,DISC-GRAN,DISC-MAX,DISC-ZERO
NAME      DISC-GRAN DISC-MAX DISC-ZERO
sda              2M       2G         0
nvme0n1          4K       2G         0

What it means: Non-zero discard granularity/max indicates the device supports discard. Whether it’s enabled depends on mount options and your environment.

Decision: If you rely on thin provisioning or long-lived SSD performance, ensure discard/TRIM strategy is intentional (periodic fstrim vs online discard).

Designing fio jobs that look like your workload

The fastest way to get misled is to use whatever fio command you found in a blog post that didn’t mention fsync, percentiles, or how big the test file is relative to RAM.

Pick the scope: block device vs filesystem

Block device tests (--filename=/dev/nvme0n1) are for media/controller characteristics and OS I/O path overhead. They bypass filesystem metadata and fragmentation. Great for comparing devices; less predictive for “app on ext4”.
File tests (--filename=/data/fio.test) include filesystem behavior. Good for mount options, journaling effects, and real-world file allocation behavior. But they are easier to accidentally cache.

Control caching explicitly

For most storage characterization:

Use --direct=1.
Use a test size that’s not “small enough to stay warm.” If you must do file tests, consider sizes larger than RAM when testing reads.
Record whether you ran on an idle system or shared node. “Someone else’s backup” is a performance variable.

Choose an IO engine on purpose

On Debian 13, you’ll commonly use:

io_uring: modern, efficient, good for high IOPS. Can expose CPU/IRQ issues faster.
libaio: classic async I/O interface for O_DIRECT, still widely used and stable.
psync: useful for single-threaded sync-ish behavior; don’t use it as a default “because it works.”

Match concurrency and queue depth to your app

Don’t pick iodepth=256 because it looks serious. Measure how many outstanding I/Os your real workload issues, and emulate that. Databases often have a limited number of concurrent reads, and latency tends to be more important than peak queue-filling throughput.

Use mixed workloads when reality is mixed

Many systems do 70/30 reads/writes, with different block sizes and sync semantics. fio can do that; it’s your job to specify it.

Measure the tail, not just the mean

Percentiles aren’t fancy. They are the part of the distribution where your SLOs go to die.

Short joke #2: If your benchmark report has only “MB/s,” you didn’t run a test—you ran a vibe check.

Fast diagnosis playbook: find the bottleneck without a week of debate

This is the sequence I use when a system “has slow storage” and everyone is already emotionally attached to a theory.

First: prove what path you’re on

Confirm device and layers: lsblk, findmnt, check for LVM, mdraid, dm-crypt.
Confirm you’re hitting the intended target: fio filename points to the right device or mount.
Confirm direct vs buffered I/O choice matches the question.

Second: determine if it’s saturation or stalls

Run fio with percentiles and moderate concurrency; watch iostat -x.
If %util is high and latency rises with iodepth, you’re saturating the device or its queueing path.
If utilization is low but latency is high, suspect contention elsewhere: CPU, IRQ, lock contention, filesystem, encryption, or a misconfigured scheduler.

Third: separate CPU/IRQ bottlenecks from media bottlenecks

Check CPU distribution: mpstat.
Check IRQ affinity and NVMe queues if you’re deep into it (not shown here as a full tuning guide, but the symptom is usually “one core on fire”).
Try a different ioengine or reduce iodepth to see whether tail latency improves.

Fourth: validate device behavior under sustained load

Precondition if SSD.
Run longer tests; watch temperature and throughput over time.
Check SMART/NVMe logs for errors.

Fifth: confirm filesystem and durability semantics

Test with sync patterns relevant to the app (fsync, fdatasync, or database-specific patterns).
Validate mount options; avoid “nobarrier”-style footguns unless you have power-loss protection and a strong reason.

Common mistakes: symptoms → root cause → fix

1) “Reads are 5 GB/s on SATA SSD”

Symptom: Buffered reads show multi-GB/s; direct reads are much lower.
Root cause: Page cache satisfied reads after warmup; you benchmarked memory.
Fix: Use --direct=1, use a cold-cache methodology, and/or exceed RAM.

2) “Writes are fast, but database fsync is slow”

Symptom: Sequential write throughput looks great; latency spikes under sync writes.
Root cause: Benchmark used async buffered writes; app waits for flush/fua/journal commits.
Fix: Run fio with sync semantics: include --fsync=1 or use --rw=randwrite with relevant bs and iodepth, and measure percentiles.

3) “Random write IOPS collapses after a minute”

Symptom: Great initial numbers; sustained performance drops hard.
Root cause: SSD SLC cache exhaustion and garbage collection; device not preconditioned.
Fix: Precondition (sustained writes), then test steady-state; consider overprovisioning or enterprise SSDs for write-heavy workloads.

4) “Benchmark numbers change between identical runs”

Symptom: Same fio command, different results day-to-day.
Root cause: Background load, thermal throttling, CPU frequency scaling, different free-space layout, or cache state.
Fix: Control the environment: isolate the host, log thermals, pin CPU frequency if appropriate, precondition, and run multiple iterations with recorded variance.

5) “High IOPS but terrible tail latency”

Symptom: Great average; P99.9 is painful.
Root cause: Excessive queue depth, contention in the block layer, firmware GC spikes, or shared device.
Fix: Reduce iodepth, match app concurrency, test isolation, and focus on percentiles rather than peak.

6) “It’s fast on raw device, slow on filesystem”

Symptom: /dev/nvme tests are great; /data file tests are slow.
Root cause: Filesystem journaling, metadata contention, small inode tables, mount options, or fragmentation.
Fix: Benchmark both layers intentionally; tune filesystem for the workload; validate alignment and free space; consider XFS/ext4 differences for parallelism.

7) “Adding encryption barely changed throughput, but latency got weird”

Symptom: Throughput looks similar; tail latency worsens.
Root cause: CPU-bound crypto path under bursts; poor CPU/NUMA affinity; small I/O overhead.
Fix: Measure CPU usage during fio; consider --numjobs and CPU pinning; ensure AES-NI and appropriate cipher mode; keep iodepth realistic.

Three corporate mini-stories from the storage trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company upgraded a fleet of database nodes. New NVMe drives, new kernel, fresh Debian installs. The internal benchmark sheet looked fantastic: random reads in the hundreds of thousands of IOPS. The migration plan was signed off and scheduled for a quiet weekend.

After cutover, the database was “fine” for a few hours. Then latency climbed during routine maintenance: index rebuilds, autovacuum, and a backlog of write-heavy background tasks. The app didn’t crash. It just became politely unusable. Support tickets arrived in waves, the way they do when a system is alive but hurting.

The wrong assumption was subtle: the benchmark had used file-based tests on an empty filesystem with buffered I/O, and the test size was small enough that the page cache played hero. In production, the database dataset exceeded memory and forced real reads. Worse, the workload required frequent fsync-like durability behavior; the benchmark didn’t.

The fix wasn’t “buy faster NVMe.” The fix was to rerun tests with --direct=1, realistic concurrency, and latency percentiles; then tune the database’s I/O concurrency limits to match device behavior. Once they stopped chasing the headline IOPS number and started targeting P99 latency under realistic iodepth, the system stabilized. The drives were fine. The benchmark narrative was not.

Mini-story 2: The optimization that backfired

A different org ran a large analytics cluster. Someone noticed their nightly jobs were slower after a storage refresh, despite better hardware. A well-meaning engineer decided the bottleneck must be “filesystem overhead,” so they switched several workloads to raw block devices and cranked up fio-like queue depths in the application layer.

At first, the dashboards improved. Throughput increased, jobs finished sooner, and the weekly status email contained numbers that made everyone feel competent. Then came the backfire: tail latency spikes started appearing during peak hours, affecting unrelated services sharing the same hardware. The NVMe devices were saturated with deep queues, and latency-sensitive services got stuck behind an I/O wall of outstanding requests.

The “optimization” improved throughput by increasing queueing, but it damaged the overall system by increasing contention and tail latency. It also reduced observability: bypassing filesystem semantics meant losing some of the guardrails and tooling the team used for diagnosis.

The recovery involved capping I/O concurrency per workload, restoring file-based access where it made operational sense, and treating queue depth as a resource budget rather than a “make it faster” knob. They still got good throughput, but they stopped sacrificing the rest of the fleet to achieve it.

Mini-story 3: The boring but correct practice that saved the day

A financial services platform had a habit that looked unsexy in architecture reviews: every storage benchmark run had a short “environment record” attached. Device model/firmware, kernel version, scheduler, mount options, caching mode, test size, and whether the system was isolated. It was paperwork, basically.

One quarter, a latency regression appeared after a routine update. The team had good instincts but no single smoking gun. Some blamed the kernel. Others blamed the new batch pipeline. A few wanted to roll back everything and call it a day.

Because the benchmark runs were reproducible and recorded, they quickly noticed a small but meaningful change: the scheduler selection differed between two host images, and the affected nodes had a different IRQ distribution pattern. The raw device wasn’t “slower,” but the CPU path was noisier, and tail latency was worse under the same fio profile.

The fix was boring: standardize the scheduler choice for that device class, apply consistent IRQ affinity policy, rerun the same fio suite as a gate, and only then proceed. No heroics, no midnight vendor calls. The saved-the-day move was not a tuning trick; it was disciplined comparison.

Checklists / step-by-step plan

Step-by-step plan: a benchmark you can defend in a postmortem

Write down the workload intent. “We need predictable P99.9 latency for 4k random reads at concurrency X” is usable. “We need fast disks” is not.
Record the environment. Debian version, kernel, fio version, device model/firmware, filesystem, mount options, scheduler.
Pick test scope. Raw device vs filesystem. If filesystem, record free space and fragmentation risk.
Control caching. Use --direct=1 for device characterization. If buffered, justify it and prove cache state.
Precondition if SSD and steady-state matters. Don’t skip this if you write a lot in production.
Use multiple durations. Quick 30s for sanity, 10–30 minutes for sustained behavior.
Measure percentiles. Always. If you don’t care about tail latency, you’re either lucky or running a batch system that no one watches.
Run with realistic concurrency. Start with what the app can generate, then explore higher to see headroom and knee points.
Observe the system during the run. iostat -x, mpstat, device thermals.
Repeat and report variance. If results vary wildly, you don’t have a benchmark; you have a mystery.
Make a decision tied to the metric. For example: “Accept if P99.9 < 2ms at iodepth=8, numjobs=4.”

Quick checklist: “Am I benchmarking the right thing?”

Did I explicitly choose direct vs buffered I/O?
Is my test size meaningful relative to RAM?
Did I record caching settings and durability semantics?
Did I collect latency percentiles?
Did I match block sizes and concurrency to production?
Did I check thermals and CPU saturation?

FAQ

1) Should I always use `--direct=1` on Debian 13?

For device characterization and most “storage performance” claims: yes. Use buffered I/O only when modeling an application that intentionally benefits from page cache, and then measure cache behavior explicitly.

2) Is `io_uring` always better than `libaio`?

No. io_uring is often faster and more scalable, but it can expose CPU/IRQ bottlenecks and is sensitive to kernel/device quirks. Use both when diagnosing; standardize on one for ongoing regression testing so results are comparable.

3) Why do my fio numbers look great but my database is slow?

Common reasons: fio tested sequential I/O while the database does random; fio didn’t include sync/flush behavior; fio used a different queue depth than the database; filesystem and journaling overhead weren’t represented; or your database is bottlenecked on CPU locks and not I/O at all.

4) What block size should I test with?

Test multiple sizes, but start with what your application uses: 4k random reads are a classic baseline; 16k/32k can matter for some databases; 128k–1M for sequential scans and backups. If you only pick one, you’ll accidentally optimize for that one.

5) How long should fio runs be?

Long enough to include the behavior you care about. For burst: 30–60 seconds may be fine. For steady-state SSD behavior and tail latency: minutes to tens of minutes, after preconditioning if you write heavily.

6) How do I avoid destroying data when testing raw devices?

Assume fio will happily ruin your day. Use a dedicated test device or a disposable LVM LV. Triple-check --filename. Use --rw=read for non-destructive tests, and treat write tests as destructive unless you are targeting a test file on a mounted filesystem.

7) Why is “QD32” such a common default in examples?

It’s a historical artifact from storage systems where deep queues helped saturate controllers and disks. It can still be useful for measuring peak throughput, but it often misrepresents latency-sensitive workloads and single-threaded applications.

8) Do I need to drop caches between runs?

If you are doing buffered tests and trying to simulate cold-cache behavior, yes—carefully. In production-like tests, it’s usually better to use direct I/O and avoid the entire “cache management” problem unless cache is part of the system you’re modeling.

9) My fio results differ between file and block tests. Which is “real”?

Both are real; they measure different systems. Block tests measure device + kernel block layer; file tests include filesystem allocation, metadata, and journaling. Pick the one that matches your question, and don’t average them into a single story.

10) What’s the single most useful fio output field to look at?

Latency percentiles (clat percentiles) under realistic concurrency. Throughput is negotiable; tail latency is where user-visible pain lives.

Conclusion: next steps that actually reduce risk

If you take one operational lesson from this: stop arguing about “fast storage” and start defining “acceptable latency under realistic concurrency.” Then benchmark that, with caching controlled and percentiles reported.

Concrete next steps:

Pick two or three fio profiles that match your production I/O shapes (random read 4k, mixed 70/30, sync write pattern if applicable).
Standardize environment capture (device, firmware, kernel, scheduler, filesystem, mount options, direct/buffered).
Make a simple pass/fail gate based on P99/P99.9 latency, not peak MB/s.
Run precondition + steady-state tests for SSD-backed write-heavy systems.
When results surprise you, follow the fast diagnosis playbook instead of “tuning until the graph looks good.”

fio isn’t lying. It’s just obedient. Your job is to ask it the right questions—and to refuse flattering answers that don’t survive contact with production.