Why Synthetic Benchmarks Lie (and How to Catch Them)

November 22, 2025 • February 3, 2026 • Read: 22 min • Views: 0

Was this helpful?

You ran fio, got a heroic number, shipped it to procurement, and six weeks later production feels like it’s wading through syrup.
Now everyone’s staring at “the storage” like it’s a sentient villain.

Synthetic benchmarks didn’t just fail to predict reality—they actively convinced you the opposite of what was true.
That’s not bad luck. That’s a failure mode. And you can learn to spot it quickly, before it ships.

What synthetic benchmarks are (and what they quietly assume)

A synthetic benchmark is a designed workload: you choose the read/write mix, block size, queue depth, thread count, randomness, runtime, and data set size.
It’s a lab rat. Useful. Also not your users.

The lie isn’t that synthetic tools are “wrong.” The lie is that they’re incomplete by default, and people treat the output like a purchase order.
Storage performance is a system property: hardware, firmware, kernel, filesystem, drivers, multipath, network, virtualization, and the noisy neighbor you didn’t know existed.

The benchmark contract you didn’t read

Every synthetic result comes with unstated assumptions. When you run “4k randread QD=64,” you’re asserting:

Your real workload reads in 4k chunks (or at least behaves similarly at the IO layer).
Your workload can maintain queue depth 64 without stalling upstream.
Your app tolerates the same latency distribution, not just the same average.
Your caches (page cache, controller cache, CDN, app cache) behave similarly.
Your IO path is the same: same filesystem, same mount options, same encryption/compression, same replication settings.

In production, those assumptions get murdered by reality. The benchmark still prints a clean number. That’s the dangerous part.

Here’s the mental model that keeps you honest: benchmarks do not measure “disk speed.” They measure a pipeline under a very specific set of conditions.
Change the conditions, change the story.

Joke #1: Synthetic benchmarks are like résumés—everybody looks amazing until you check references and discover “expert in Kubernetes” meant “once opened the dashboard.”

How synthetic benchmarks lie: the common mechanisms

Benchmarks “lie” in a predictable set of ways. If you learn the patterns, you can catch them early—sometimes just by looking at the graph and asking one rude question.

1) Cache, cache, cache (and the benchmark that never touched disk)

The classic: your benchmark reads the same data repeatedly, and the OS page cache serves it from RAM.
Or a RAID controller’s write-back cache absorbs writes and later trickles them to disk while your benchmark celebrates.

Real workloads don’t usually get infinite warm cache. They get partial locality, eviction storms, and “it was fast until 09:03.”
Benchmarks that don’t size the dataset beyond cache are just measuring memory bandwidth with extra steps.

2) Queue depth fantasy: QD=64 is not “more realistic,” it’s “a different universe”

High queue depth inflates throughput on devices that can reorder and parallelize internally. NVMe loves it. SATA tolerates it. Networks sometimes pretend.
But applications may never generate that kind of concurrency because they block on locks, commit boundaries, RPC round trips, or CPU.

If your app is single-threaded or serialized by a database WAL fsync, QD=64 results are trivia.
You should benchmark at the queue depth your system can sustain end-to-end.

3) Block size theater: “IOPS” without block size is meaningless

1M sequential throughput and 4k random IOPS are different sports. A system can be great at one and embarrassing at the other.
Some vendors love showing whichever metric flatters them. So do engineers, if we’re being honest.

If your workload is “lots of tiny metadata reads” and you benchmark “128k sequential read,” you didn’t benchmark your workload. You benchmarked your hopes.

4) Latency averages are performance fan fiction

Averages hide the things that page your on-call: tail latency (p95, p99, p99.9) and latency spikes.
Many synthetic tests report an average that looks fine while p99 is a horror show.

What matters depends on the system. For a queueing system, p99 can drive retries, timeouts, and cascading failure.
For a database, a few long fsyncs can block commits and create thundering herds.

5) “Direct IO” vs buffered IO: choose wrong, measure the wrong subsystem

Buffered IO includes page cache behavior, writeback, and dirty page thresholds. Direct IO bypasses page cache and changes alignment requirements.
Production is often a mix: databases use direct IO for data files but rely on buffered IO elsewhere; services may buffer reads unintentionally.

If you benchmark with --direct=1 and your workload is mostly buffered reads, you’ll undervalue caching effects. If you benchmark buffered but expect direct semantics, you’ll overpromise.
Pick the IO mode that matches your application’s actual path, not the benchmarker’s preference.

6) Filesystem and mount options: the invisible multiplier

Ext4 with data=ordered behaves differently from XFS under metadata pressure. ZFS has its own universe (ARC, recordsize, sync behavior).
Mount options like noatime can remove metadata writes. Options like barriers, journaling mode, and discard can shift latency.

Synthetic benchmarks run on an empty filesystem with perfect locality. Production is fragmented, has millions of inodes, and is under concurrent load.

7) Device state matters: fresh-out-of-box SSDs are liars too

Many SSDs benchmark higher when empty due to SLC caching and clean flash. Under steady-state random writes, garbage collection and wear leveling kick in.
A “quick test” can show a fantasy that collapses after hours or days of real churn.

8) Compaction, checksum, encryption, replication: real work costs real cycles

Storage stacks often do extra work: checksums, compression, encryption at rest, deduplication, replication, erasure coding, snapshots.
Synthetic benchmarks that don’t enable the same features aren’t benchmarking your system. They’re benchmarking a different one.

9) Virtualization and neighbors: you benchmarked the host; production runs on the apartment building

On shared infrastructure, you don’t own the controller, the cache, the network uplink, or the “someone is doing a backup” schedule.
Your benchmark might run at night, alone. Production runs at noon, with a thousand siblings.

10) The benchmark becomes the workload (and everything optimizes for it)

Engineers tune sysctls and IO schedulers until fio looks great, then ship. The benchmark becomes the acceptance test.
The system is now optimized for the synthetic pattern, sometimes at the expense of mixed workloads or latency.

Joke #2: If you tune a system until the benchmark is perfect, congratulations—you’ve successfully trained your storage to ace a standardized test.

Interesting facts and history (yes, it matters)

IOPS became fashionable because HDDs were bad at random IO. Early enterprise discussions obsessed over “how many 4k random reads” because spindles were the bottleneck.
Average latency was historically reported because it was easy. Tail latency became mainstream later, when large-scale distributed systems made p99 spikes a reliability issue.
RAID controller write-back cache can make small write benchmarks look supernatural. It’s often a battery-backed DRAM cache—fast until it must flush to disks.
SSDs have “short burst” modes (e.g., SLC caching) that distort brief benchmarks. Many devices are designed to look good in short tests and consumer traces.
Linux’s page cache can satisfy reads without touching storage at all. Unless you use direct IO or size the dataset correctly, you’re benchmarking RAM.
IO schedulers evolved because different media needed different queueing. What helped rotational disks (reordering seeks) can be irrelevant or harmful on NVMe.
“fsync() makes it real” is only partly true. Devices and controllers can acknowledge writes before persistence unless barriers, cache settings, and power-loss protection align.
Cloud disks often advertise throughput/IOPS separately from latency. You can hit IOPS targets while still failing SLOs because tail latency is what users feel.
Some benchmark suites became procurement tools. Once a number determines a purchase, vendors optimize for it, sometimes without improving real workloads.

One useful paraphrased idea from W. Edwards Deming: when a metric becomes a target, it stops being a good metric. The storage version is simple:
if “fio IOPS” is your goal, you will get fio IOPS—whether or not your database stops timing out.

Three corporate-world mini-stories

Mini-story #1: The incident caused by a wrong assumption

A mid-sized SaaS company migrated from local NVMe to a network-backed block storage platform.
The migration plan was approved because a synthetic test showed “similar IOPS” and “higher throughput.” The slide deck was immaculate.
The first week after cutover, customer-facing latency spiked every morning. Not a full outage, just a slow bleeding mess: timeouts, retries, and a growing queue backlog.

The on-call team looked at CPU and memory—fine. Network? Fine. The storage dashboard showed IOPS below the advertised limit.
“So it can’t be storage,” someone said, which is a sentence that has ended many happy careers.
They were assuming that IOPS capacity implies latency stability.

The hidden difference was fsync-heavy writes. The application used a database with strict durability on commit.
The synthetic benchmark was mostly random reads with deep queue depth and big batches. Production was lots of small sync writes with modest concurrency.
Under morning load, the storage system’s tail latency climbed, and commit latency amplified into request latency.

The fix was not “more IOPS.” It was workload alignment: measure sync write latency at the concurrency the database actually produces.
They ultimately adjusted volume type and tuned database commit settings carefully, and they stopped approving storage changes based on a single fio profile.

The lesson was painfully simple: you can’t reason from the wrong benchmark.
Storage incidents are often caused by a mismatch between measured conditions and production conditions, not by a component “getting slower.”

Mini-story #2: The optimization that backfired

A data platform team had a nightly batch workload. They were proud of their benchmark discipline: every new node type ran the same synthetic suite.
One engineer noticed that switching the IO scheduler and increasing queue depth made the benchmark numbers jump.
They rolled the tuning across the fleet and declared victory.

Two weeks later, daytime interactive queries began showing jitter.
Nothing catastrophic, just enough tail latency to make dashboards feel sticky and on-call get grumpy.
The tuning had optimized for throughput under high concurrency, but it changed fairness under mixed workloads.

The real issue was contention: the scheduler and queue settings allowed batch IO to dominate device time slices.
Synthetic benchmarks didn’t include a competing latency-sensitive workload, so they never showed the starvation.
Production did.

Rolling back the tuning improved query p99 immediately, while batch throughput dipped slightly.
The team learned to benchmark “mixed-mode” scenarios: one job pushing throughput while another measures latency.
They also learned that a tuning that makes a benchmark prettier can be a tax on user experience.

Mini-story #3: The boring but correct practice that saved the day

A financial services team ran a storage platform with strict change control. They weren’t glamorous about it.
Before any upgrade, they captured a baseline: device firmware versions, queue settings, filesystem mount options, and a small set of workload-representative tests.
They stored results with timestamps and kernel versions. It was boring. It was also a time machine.

After a routine kernel update, they saw a subtle increase in p99 write latency. Users barely noticed—until month-end load hit.
Because they had baselines, they could say “this drift began exactly after the update,” not “storage feels weird lately.”
That narrowed the search to IO stack changes, not a vague “maybe hardware is failing.”

They used their baseline fio jobs plus application-level metrics to confirm the regression.
Then they compared block layer settings and found a default changed in their environment (queueing behavior and scheduler selection differed on new kernel).
They pinned the previous behavior and scheduled a controlled follow-up to test the new defaults properly.

Month-end passed without drama. Nobody celebrated. That’s the point.
The practice that saved them wasn’t a magic benchmark; it was repeatable measurement, change attribution, and refusing to “tune blind.”

Fast diagnosis playbook: what to check first/second/third

When “storage is slow,” you need a fast funnel. Not a week-long benchmark project. Here’s the order that usually finds the truth quickly.

First: prove whether it’s latency, throughput, or saturation

Look at tail latency (p95/p99), not just average.
Check utilization: is the device at 100% busy? Is the queue growing?
Correlate with load: did concurrency change? Did a backup start? Did compaction kick in?

Second: locate the bottleneck layer

Application: locks, GC pauses, connection pool exhaustion, fsync frequency.
Filesystem: journal pressure, metadata storms, fragmentation, mount options.
Block layer: queue depth, scheduler, merge behavior, throttling.
Device: SSD steady-state write cliff, firmware quirks, thermal throttling.
Network/storage fabric: retransmits, congestion, multipath flapping.
Virtualization/shared environment: noisy neighbor, host CPU steal, throttling policies.

Third: reproduce safely with a representative micro-test

Pick a test that matches your IO size, read/write mix, sync behavior, and concurrency.
Run it next to production load if possible (carefully), or replay traces in a staging clone.
Validate with both system metrics and application metrics. If they disagree, trust the application.

This playbook works because it’s skeptical: it assumes the symptom might be upstream, downstream, or self-inflicted by benchmarking mythology.

Practical tasks: commands, what the output means, and the decision you make

These are the tasks I actually run when someone hands me a benchmark result or a “storage is slow” ticket.
Each task includes: command, sample output, what it means, and the decision it drives.

Task 1: Identify the block devices and their topology

cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,ROTA,TYPE,MOUNTPOINTS
NAME        MODEL             SIZE ROTA TYPE MOUNTPOINTS
nvme0n1     Samsung SSD      1.8T    0 disk 
├─nvme0n1p1                  512M    0 part /boot
└─nvme0n1p2                  1.8T    0 part /
sda         ST8000NM0045     7.3T    1 disk 
└─sda1                        7.3T    1 part /mnt/archive

Meaning: ROTA=0 indicates solid-state; ROTA=1 indicates rotational. Model hints at controller and class.
Knowing whether you’re on NVMe vs SATA vs HDD changes what “good” looks like and what queueing behavior makes sense.

Decision: Pick benchmark profiles appropriate to the device class. Don’t use NVMe-style queue depths on a single HDD and call it “unfair hardware.”

Task 2: Check filesystem and mount options (the silent performance knobs)

cr0x@server:~$ findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /
/dev/nvme0n1p2 / ext4 rw,relatime,errors=remount-ro

Meaning: Mount options tell you whether you’re paying metadata taxes (atime), what journaling behavior you have, and whether discard is enabled.

Decision: If your benchmark environment uses different mount options than production, stop. Align them, then re-test.

Task 3: Confirm you’re not accidentally benchmarking RAM via page cache

cr0x@server:~$ grep -E 'MemTotal|MemAvailable|Cached' /proc/meminfo
MemTotal:       263824032 kB
MemAvailable:   221443104 kB
Cached:         78455232 kB

Meaning: A huge Cached value plus a dataset smaller than RAM often means reads will come from cache.

Decision: Size the benchmark dataset beyond RAM (or use direct IO). If you can’t, state explicitly that you measured cached performance, not disk.

Task 4: Check current writeback pressure and dirty page thresholds

cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20

Meaning: These control how much dirty data can accumulate before writeback forces throttling.
Benchmarks that write buffered data can look great until the kernel decides it’s time to flush, then latency spikes.

Decision: If you see periodic latency cliffs in buffered write tests, reproduce with direct IO or adjust the test duration and monitor dirty writeback behavior.

Task 5: See if the device is saturated (utilization and queue depth)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 	01/12/2026 	_x86_64_	(64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    4.20    6.50    0.00   77.20

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
nvme0n1         820.0  52480.0     0.0   0.0    2.10    64.0    610.0  78144.0     0.0   0.0    8.40   128.1    5.10  92.00

Meaning: %util near 100% suggests saturation. aqu-sz indicates queue backlog. r_await/w_await show latency including queueing time.

Decision: If saturated, you need either more devices, better parallelism, or less work per IO (compression choices, batching, caching). If not saturated but latency is high, look for firmware, throttling, or upstream stalls.

Task 6: Measure per-process IO behavior (who’s actually doing it)

cr0x@server:~$ pidstat -d 1 3
Linux 6.5.0 (server) 	01/12/2026 	_x86_64_	(64 CPU)

# Time   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
12:00:01 999     18422      0.00  41280.00      0.00      45  postgres
12:00:01 0        2211      0.00   5120.00      0.00       6  rsync

Meaning: You can separate “storage is slow” from “one process is doing a lot.”
iodelay is a rough indicator of time spent waiting on IO.

Decision: If a background job is dominating, schedule it differently or throttle it. If the main service is waiting, focus on latency and durability path.

Task 7: Confirm IO scheduler and queue settings (especially after kernel updates)

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

Meaning: The selected scheduler is bracketed. NVMe often uses none effectively; other devices may benefit from mq-deadline under mixed workloads.

Decision: Don’t blindly tune for benchmark throughput. If you have latency-sensitive workloads, test under contention and pick the scheduler that preserves tail latency.

Task 8: Spot device throttling or errors in kernel logs

cr0x@server:~$ dmesg -T | tail -n 12
[Mon Jan 12 11:58:10 2026] nvme nvme0: failed command: WRITE, cmdid 123 qid 4
[Mon Jan 12 11:58:10 2026] nvme nvme0: status: { DNR }
[Mon Jan 12 11:58:11 2026] EXT4-fs (nvme0n1p2): warning: mounting fs with errors, running e2fsck is recommended

Meaning: Benchmark numbers are irrelevant if the device is erroring or the filesystem is compromised.
Latency spikes can be error retries, resets, or degraded modes.

Decision: Stop performance testing. Stabilize the system: check SMART/NVMe logs, verify cabling/controller, and address filesystem errors.

Task 9: Run a direct IO latency-focused fio test (more honest for many DB paths)

cr0x@server:~$ fio --name=lat4k --filename=/mnt/testfile --size=8G --direct=1 --rw=randread --bs=4k --iodepth=4 --numjobs=4 --time_based --runtime=60 --group_reporting
lat4k: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
...
  read: IOPS=42000, BW=164MiB/s (172MB/s)(9840MiB/60001msec)
    slat (nsec): min=1800, max=21000, avg=5400.10, stdev=1100.30
    clat (usec): min=45, max=9200, avg=90.20, stdev=40.10
     lat (usec): min=49, max=9210, avg=96.10, stdev=40.50
    clat percentiles (usec):
     |  1.00th=[   55],  5.00th=[   62], 10.00th=[   68], 50.00th=[   84]
     | 90.00th=[  112], 95.00th=[  135], 99.00th=[  240], 99.90th=[ 1200]

Meaning: IOPS is fine, but the percentiles tell the real story. p99.9 at 1.2ms might be okay—or might break an SLO if your request path stacks multiple IOs.

Decision: If tail latency is high, reduce contention, lower queue depth, check GC/throttling, or move to a device class with better steady-state latency.

Task 10: Test sequential throughput with realistic block size and concurrency

cr0x@server:~$ fio --name=seqread --filename=/mnt/testfile --size=16G --direct=1 --rw=read --bs=1M --iodepth=8 --numjobs=2 --time_based --runtime=60 --group_reporting
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
...
  read: IOPS=980, BW=980MiB/s (1027MB/s)(58800MiB/60001msec)

Meaning: Good for bulk scans, backups, log shipping. But it does not predict small random IO performance.

Decision: Use this to capacity-plan bulk operations and maintenance windows, not to justify a database latency SLO.

Task 11: Verify discard/TRIM settings (can create latency spikes)

cr0x@server:~$ findmnt -no TARGET,OPTIONS /mnt/test
/mnt/test rw,relatime,discard

Meaning: Continuous discard can add overhead on some devices. Some environments prefer periodic fstrim.

Decision: If you see periodic latency spikes aligned with discards, switch to scheduled trim and re-test.

Task 12: Check NVMe SMART and media health (steady-state matters)

cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0x00
temperature                         : 47 C
available_spare                     : 100%
percentage_used                     : 3%
data_units_read                     : 12,345,678
data_units_written                  : 9,876,543
media_errors                        : 0
num_err_log_entries                 : 0

Meaning: Temperature and error counters can explain throttling and retries. percentage_used gives a rough wear indicator.

Decision: If temperature is high or errors appear, fix hardware/airflow before blaming your workload or “bad tuning.”

Task 13: See whether your “random” test is actually hitting the same blocks (locality check)

cr0x@server:~$ fio --name=randcheck --filename=/mnt/testfile --size=64G --direct=1 --rw=randread --bs=4k --iodepth=1 --numjobs=1 --runtime=30 --time_based --group_reporting --randrepeat=0
randcheck: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
  read: IOPS=8200, BW=32.0MiB/s (33.6MB/s)(960MiB/30001msec)

Meaning: --randrepeat=0 avoids repeating the same random sequence across runs. Larger --size helps defeat cache.

Decision: If performance collapses when you expand the dataset, your earlier result was probably cache-inflated.

Task 14: Confirm persistence semantics (write cache setting at the device level)

cr0x@server:~$ sudo hdparm -W /dev/sda
/dev/sda:
 write-caching =  1 (on)

Meaning: Write cache can improve speed but changes durability unless you have power-loss protection and correct barriers.

Decision: For systems requiring strict durability, ensure the stack is configured correctly and benchmark with fsync/sync patterns—otherwise you’re benchmarking “acknowledged” writes, not persistent writes.

Task 15: Measure file-level latency from the filesystem perspective

cr0x@server:~$ strace -T -e trace=fsync -p 18422 -s 0
strace: Process 18422 attached
fsync(57)                                = 0 <0.012341>
fsync(57)                                = 0 <0.089120>
fsync(57)                                = 0 <0.007882>

Meaning: Those timings are the application’s reality. If fsync sometimes takes 90ms, your “average 1ms latency” storage report is irrelevant.

Decision: If fsync latency is spiky, look for writeback storms, device GC, contention, or durability path issues (barriers, cache flush behavior, network storage).

Common mistakes: symptom → root cause → fix

1) Symptom: “Benchmark says 500k IOPS, but the app times out”

Root cause: Queue depth and concurrency in the benchmark exceed what the app can sustain; tail latency is the real limiter.

Fix: Benchmark at realistic concurrency (threads, iodepth). Track p95/p99. Validate with app-level timing (fsync, query latency).

2) Symptom: “Reads are insanely fast, then suddenly slower after a few minutes”

Root cause: Page cache warming or controller cache absorbing; dataset too small or repeated.

Fix: Use direct IO or dataset larger than RAM; disable randrepeat; run longer tests and watch cache hit ratios where possible.

3) Symptom: “Writes look great in a 30-second test, terrible after an hour”

Root cause: SSD SLC cache / short-term buffering hides steady-state garbage collection and write amplification.

Fix: Precondition the device (fill and sustain writes), run long time-based tests, and measure steady-state latency percentiles.

4) Symptom: “Throughput is high, but interactive latency is awful during batch jobs”

Root cause: Fairness/priority issues: scheduler, queueing, or lack of IO isolation. Batch dominates device time.

Fix: Test mixed workloads; apply cgroup IO controls or workload scheduling; choose scheduler for latency fairness, not peak throughput.

5) Symptom: “Same benchmark on two ‘identical’ nodes differs massively”

Root cause: Firmware differences, PCIe slot/link width differences, thermal throttling, background scrubs, or filesystem state.

Fix: Inventory firmware/kernel settings; verify link speed/width; check temperatures; confirm background tasks; align filesystem/mount options.

6) Symptom: “Random write IOPS is fine, but fsync is slow”

Root cause: Benchmark isn’t issuing sync writes; durability path (cache flush, barriers, journal) is the bottleneck.

Fix: Use fio with --fsync=1 or sync engines; measure fsync directly in the app; validate controller write cache and power-loss protection assumptions.

7) Symptom: “Storage looks idle, yet latency is high”

Root cause: Upstream stalls (locks, CPU throttling, network retransmits) or intermittent device resets/retries.

Fix: Correlate app wait reasons; check dmesg; verify network stats if remote; inspect per-process IO waits and CPU steal time.

Checklists / step-by-step plan for honest benchmarking

Step-by-step plan: from “pretty numbers” to “decision-grade results”

Write down the question. Are you choosing hardware? Validating a migration? Debugging a latency regression? Different questions require different tests.
Extract workload characteristics. IO size distribution, read/write mix, sync behavior, concurrency, working set size, burstiness, and acceptable tail latency.
Match the stack. Same filesystem, mount options, encryption/compression, replication, kernel, and drivers as production. No “close enough.”
Decide what you will report. Always include block size, queue depth, numjobs, direct vs buffered, runtime, dataset size, and percentiles.
Defeat fake speed. Dataset larger than caches, --randrepeat=0, run long enough to reach steady state, and avoid “first 10 seconds” bragging.
Measure both system and app signals. iostat/pidstat/latency histograms plus application request latency and error rates.
Test mixed workloads. One throughput job plus one latency probe. Production is not a single fio job running alone at midnight.
Run at least three times. If results vary wildly, your system is unstable or your method is wrong. Either way, don’t ship it.
Baseline and store artifacts. Keep fio job files, kernel versions, firmware versions, sysctls, and raw outputs. Future-you will need them during an incident.
Translate results into decisions. “p99 fsync < 5ms at 200 commits/sec” is a decision. “1M IOPS” is a poster.

Benchmark review checklist (use this to catch lies in someone else’s report)

Does the report include block size, read/write mix, iodepth, numjobs, runtime, and dataset size?
Does it include p95/p99/p99.9 latency, not just average?
Was the dataset larger than RAM and controller cache?
Was direct IO used appropriately for the target workload?
Was the device preconditioned for steady-state writes?
Was the filesystem full/fragmented enough to represent production, or was it a fresh empty volume?
Was there competing load, or was the benchmark run in isolation?
Are there kernel logs showing errors/retries during the run?
Do the results align with application-level timings?

What to avoid (strong opinions, earned the hard way)

Do not accept a single-number benchmark. No single IOPS number survives contact with real workloads.
Do not tune based on a synthetic test alone. Tuning changes behavior under contention; if you didn’t test contention, you didn’t test the risk.
Do not benchmark on a system with unknown background tasks. Scrubs, rebuilds, backups, and indexing will produce “mysterious variance.” It’s not mysterious.
Do not ignore tail latency. It is literally what users feel and what distributed systems amplify.

FAQ

1) Are synthetic benchmarks useless?

No. They’re excellent for controlled comparisons and for isolating variables. They’re useless when treated as a promise of production behavior without matching workload, stack, and contention.

2) What’s the single biggest reason benchmark results don’t match production?

Cache and concurrency mismatch. The benchmark often runs with a dataset that fits in cache and a queue depth the application can’t sustain.
The result is inflated throughput and hidden tail latency.

3) Should I always use direct IO in fio?

If your target workload uses direct IO (many databases do for data files), yes. If your workload depends on page cache (many web services do), benchmark buffered IO too.
The honest answer is: test the path you actually run.

4) Why does higher iodepth improve IOPS but sometimes hurt latency?

Queue depth increases parallelism and reordering opportunities, which boosts throughput. But it also increases queueing delay, especially under saturation.
Tail latency is where the pain shows up.

5) My vendor gave me benchmark numbers. How do I validate them quickly?

Re-run a minimal set: one 4k random read latency test with realistic iodepth/numjobs, one 4k sync write/fsync-focused test, and one sequential throughput test.
Ensure dataset size exceeds cache, and report percentiles.

6) What does “steady state” mean for SSD benchmarking?

It means performance after the device has been written enough that garbage collection and wear leveling are active.
Short tests on a clean device often measure burst cache behavior, not long-term performance.

7) How do I benchmark a distributed storage system (network-attached) honestly?

Include the network and the client stack. Measure retransmits, CPU, and tail latency.
Run tests from multiple clients concurrently, because distributed systems often behave differently under fan-in load than under a single client.

8) Why do my fio results differ between runs?

Common causes: caching, background jobs, thermal throttling, device GC behavior, and filesystem state (fragmentation, free space).
If variance is high, treat it as a signal: your environment isn’t controlled, or your storage is unstable.

9) What metrics should I put on a benchmark report so it can’t be misused?

Include: workload definition (rw mix, bs, iodepth, numjobs, direct/buffered), dataset size, runtime, throughput/IOPS, latency percentiles, and system context (kernel, filesystem, mount options).
If any of those are missing, someone will “helpfully” misinterpret the result.

Conclusion: next steps that survive production

Synthetic benchmarks don’t lie because they’re malicious. They lie because they’re narrow, and people are optimistic.
Production systems are not optimistic. They are busy, concurrent, messy, and full of background work you forgot about.

Next steps that actually help:

Pick 3–5 benchmark profiles that map to your real workload (including sync writes if you care about durability).
Require latency percentiles and dataset size in every performance report.
Baseline before changes, and store artifacts so regressions can be attributed to a specific change.
Test mixed workloads, not just “fio alone on an empty volume.”
When in doubt, trust the application’s timing over the storage dashboard.

The goal isn’t to ban synthetic benchmarks. The goal is to stop letting them approve decisions they didn’t measure.
Benchmarks are tools. Production is the exam. Act accordingly.