How Not to Get Tricked by “Marketing FPS”: Simple Rules

February 2, 2026 • February 3, 2026 • Read: 23 min • Views: 0

Was this helpful?

“It says 1,000,000 FPS.” That’s how it starts. A slide, a procurement spreadsheet, a manager who just wants the ticket closed, and a storage system that—mysteriously—can’t keep a database under 30 ms at peak.

“Marketing FPS” (call it IOPS, FPS, transactions, “ops,” or “up to”) is the number vendors use when they want you to stop asking questions. Production doesn’t care. Production cares about latency under load, tail behavior, read/write mix, working set, and what happens after the cache is cold and the system is moderately unhappy.

What “marketing FPS” really means (and what it hides)

When a vendor says “1M IOPS” (or their rebranded “FPS”), you should mentally add the missing footnotes:

Block size: almost always 4K (small blocks inflate operation counts).
Pattern: random read (reads are easier than writes; random hides sequential throughput limits).
Queue depth: high (deep queues pump IOPS while latency quietly spikes).
Cache: warmed, and sometimes not actually hitting media.
Data set: fits in cache or in an SLC write buffer, conveniently.
Duration: short enough to avoid steady-state behavior (especially for NAND and garbage collection).
Drives: the “top bin” model, not what you’ll be shipped after a quarter of supply chain “adjustments.”
Host: a benchmark server tuned like a race car, not like your VM farm.

Marketing numbers are not always lies. They’re often true for a narrow, curated scenario. The problem is that procurement people treat them like a warranty for your workload.

Here’s the reality: IOPS is not a capacity metric. It’s a point on a curve. Change one variable—block size, read/write mix, queue depth, data locality—and you’re on a different curve.

One quick sanity check: if a device promises enormous IOPS but can’t move much bandwidth, it’s probably a 4K headline. Example: 1,000,000 IOPS at 4K is about 4 GB/s. That’s real, but only if everything else cooperates and you can feed it.

Joke #1: “Up to 1M IOPS” is like “up to 200 mph” on a rental car—technically possible, socially discouraged, and not covered by warranty.

Simple rules that keep you out of trouble

Rule 1: If you don’t have latency percentiles, you don’t have a performance claim

Ask for P50, P95, P99 latency at the claimed IOPS. Average latency is a bedtime story; tail latency is the plot. If the vendor won’t provide percentile latency under the same conditions, treat the IOPS number as decorative.

Rule 2: Treat queue depth like a knob that trades latency for headline IOPS

Deep queues make IOPS bigger and users sadder. Your application usually doesn’t run at QD=128 per device. Databases, search, and request/response services often have limited concurrency and care about low latency at moderate QD, not “maximum throughput at any cost.”

Rule 3: Separate “burst” from “sustained”

NVMe drives, arrays, and cloud volumes can look heroic for 10–60 seconds. Then they drop to their steady state. Demand at least 30 minutes of testing for write-heavy claims, and ensure the device is in steady state (more on that below).

Rule 4: If the dataset fits in cache, you’re benchmarking cache

Cache is good. Buying storage to be a cache is also good—if that’s what you intended. But don’t benchmark 20 GB on a system with 512 GB of RAM and call it “disk performance.” That’s “RAM performance with extra steps.”

Rule 5: Demand the full test recipe (block size, rw mix, QD, threads, runtime, preconditioning)

A single number without the recipe is not a benchmark; it’s a slogan. Require a reproducible fio job file or equivalent. If they can’t hand you that, their number is not useful for engineering decisions.

Rule 6: Use “latency at required throughput” as your sizing target

You don’t buy “maximum IOPS.” You buy the ability to stay under, say, 5 ms P99 at your peak mixed workload. Start with your SLOs and work backwards.

Rule 7: Watch for write amplification and garbage collection

Flash systems can degrade under sustained random writes, especially near high utilization. Precondition drives. Test at realistic fill levels. Ask the vendor what happens at 70% full, not 7%.

Rule 8: If the claim ignores the host, it’s incomplete

Drivers, multipathing, CPU, NUMA, interrupt handling, filesystem, and encryption can bottleneck long before media. A storage claim that doesn’t specify host hardware and software is only half a claim.

Rule 9: “IOPS” without read/write mix is meaningless

70/30 read/write behaves nothing like 100% read. Neither does 30/70. If your workload is mixed, your benchmark must be mixed, and your acceptance criteria must be mixed.

Rule 10: Always test the failure mode you can’t avoid

Rebuilds happen. Background scrubbing happens. One path fails. One controller reboots. If you only test the sunny day, production will schedule a thunderstorm.

Interesting facts and context (storage marketing has a history)

Fact 1: “IOPS” rose to fame because early disk arrays could hide terrible seek times behind caching and parallel spindles—so vendors needed a single, countable number for random access.
Fact 2: The industry standardized on 4K random read for headline IOPS partly because it maps neatly to database page sizes and inflates operation counts compared to 8K/16K.
Fact 3: Storage benchmarks have been contentious since the SPEC and TPC days; vendors have long tuned configurations to win benchmark runs rather than match typical deployments.
Fact 4: Many SSDs use an SLC write cache (even when the NAND is TLC/QLC). It can create spectacular short bursts that vanish in steady state.
Fact 5: NAND flash requires erase-before-write at the block level; garbage collection and wear leveling are why steady-state random write performance is often much lower than “fresh out of box.”
Fact 6: Queue depth became a marketing lever when NVMe made deep parallelism cheap; high QD can keep the device busy, but also hides latency spikes until your app falls over.
Fact 7: In the HDD era, a “15K RPM” drive might do roughly 175–210 random IOPS; arrays got to tens of thousands by striping across many spindles and caching aggressively.
Fact 8: Cloud volumes often have explicit burst credits or throughput caps; performance can be contractual but time-limited, which makes short benchmarks deceptively flattering.
Fact 9: Early RAID controllers with battery-backed write cache could make writes look fast until a cache flush; the “surprise latency cliff” is older than many of today’s SRE teams.

The only metrics that matter in real systems

1) Latency percentiles (P50/P95/P99) over time

You want a time series, not a single summary. If P99 walks upward during the run, you’re watching cache exhaustion, thermal throttling, garbage collection, or background work.

2) IOPS and bandwidth together

IOPS without MB/s is how you get a system that looks great on paper and can’t do backups. Bandwidth without IOPS is how you get a system that streams nicely and stalls on metadata.

3) Queue depth and concurrency

Measure the IO queue you actually run. If the app only has 16 outstanding IOs per node, a benchmark at QD=256 is irrelevant.

4) Read/write mix and locality

Random vs sequential is not a binary. Many workloads are “mostly sequential with annoying random metadata,” which is why they feel fine until they don’t.

5) Steady state and fill level

Test at realistic utilization and after preconditioning. Flash performance varies wildly by how full and how “dirty” it is.

6) Tail under contention

Production has neighbors: background compaction, snapshots, rebuilds, scrubs, antivirus, log shipping, other tenants, and your own patching windows. You need numbers under mild pain, not only under no pain.

One paraphrased idea, because it’s too true to ignore: paraphrased idea — Gene Kranz (NASA flight director) is commonly associated with “failure is not an option,” which in ops translates to “test the ugly path now, not later.”

Practical tasks: commands, outputs, and the decision you make

These are the sorts of checks you can run on a Linux host to avoid getting hypnotized by a single IOPS number. Each task includes: a command, what typical output means, and the decision you make.

Task 1: Identify the storage device and its model (don’t benchmark the wrong thing)

cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,ROTA,TYPE,MOUNTPOINT,FSTYPE
NAME        MODEL               SIZE ROTA TYPE MOUNTPOINT FSTYPE
nvme0n1     Samsung SSD 990 PRO 1.8T    0 disk
├─nvme0n1p1                     512M    0 part /boot      ext4
└─nvme0n1p2                     1.8T    0 part /data      xfs

What it means: You’re on NVMe (ROTA=0), model is visible, and you can see where it’s mounted.

Decision: Benchmark the raw device or a dedicated test file on the correct filesystem. If the model is not what you expected, stop and fix procurement/host build assumptions.

Task 2: Confirm the kernel sees NVMe at the expected link speed and width

cr0x@server:~$ sudo nvme id-ctrl /dev/nvme0 | egrep -i 'mn|fr|mdts'
mn      : Samsung SSD 990 PRO 2TB
fr      : 5B2QJXD7
mdts    : 9

What it means: Correct model/firmware; MDTS hints max transfer size. Not the full link check, but a basic identity sanity check.

Decision: If the device identity is unexpected, don’t continue. Wrong firmware can also mean known throttling/bug behavior; align with your platform baseline.

Task 3: Check if the drive is thermally throttling (classic “benchmark starts fast, ends sad”)

cr0x@server:~$ sudo nvme smart-log /dev/nvme0 | egrep -i 'temperature|warning|critical'
temperature                             : 71 C
warning_temp_time                       : 12
critical_comp_time                      : 0

What it means: The drive hit warning temperature for 12 minutes. That’s a big red arrow pointing at throttling.

Decision: Fix cooling/airflow/slot placement before believing any “sustained” performance test.

Task 4: Verify filesystem mount options (you can accidentally benchmark journaling or barriers)

cr0x@server:~$ mount | grep ' /data '
/dev/nvme0n1p2 on /data type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

What it means: XFS with default-ish options; nothing obviously broken.

Decision: If you see unexpected options (e.g., sync, or exotic barrier changes), align with production configuration before testing.

Task 5: See if you’re already IO-limited (high utilization) before running fio

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 	01/21/2026 	_x86_64_	(32 CPU)

avg-cpu:  %user %nice %system %iowait  %steal   %idle
          12.1  0.0    4.2     8.7     0.0    75.0

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await aqu-sz  %util
nvme0n1         9.0     320.0     0.0     0.0    1.2     35.6    220.0   8200.0     6.9    1.8   92.4

What it means: %util at ~92% indicates the device is close to saturated. w_await is 6.9 ms, and iowait is non-trivial.

Decision: If this is a production host, you may already be at the limit—benchmarking now will mix workload noise into results. For troubleshooting: focus on who’s doing IO and whether latency matches your SLO.

Task 6: Find which processes are generating IO (don’t blame storage for an app bug)

cr0x@server:~$ sudo iotop -o -b -n 3
Total DISK READ:         0.00 B/s | Total DISK WRITE:       62.31 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
14221 be/4 postgres    0.00 B/s    41.72 M/s  0.00 % 65.12 % postgres: checkpointer
18802 be/4 root        0.00 B/s    18.49 M/s  0.00 % 22.30 % /usr/bin/rsync -a /var/lib/app/ /backup/

What it means: Postgres checkpoint and an rsync are doing most writes. That is explainable, not mysterious.

Decision: Fix scheduling/coordination (move backup, tune checkpointing) before spending money on storage.

Task 7: Measure latency directly with fio (and record percentiles)

cr0x@server:~$ sudo fio --name=randread4k --filename=/data/fio.test --size=40G --direct=1 --rw=randread --bs=4k --iodepth=16 --numjobs=4 --runtime=120 --time_based --group_reporting --ioengine=libaio --percentile_list=50:95:99:99.9
randread4k: (groupid=0, jobs=4): err= 0: pid=21401: Tue Jan 21 11:10:10 2026
  read: IOPS=185k, BW=723MiB/s (758MB/s)(84.7GiB/120001msec)
    clat (usec): min=62, max=4210, avg=334.18, stdev=91.22
     lat (usec): min=64, max=4215, avg=336.49, stdev=91.30
    clat percentiles (usec):
     | 50.000000th=[  331], 95.000000th=[  470], 99.000000th=[  620], 99.900000th=[  980]

What it means: Good IOPS, and importantly, P99 is 620 µs at this load. That’s a real engineering number.

Decision: If your app needs <2 ms, this is fine. If you need <200 µs, you must reduce queue depth, change architecture, or use faster media/topology.

Task 8: Show how queue depth inflates IOPS while latency worsens

cr0x@server:~$ sudo fio --name=randread4k_qd128 --filename=/data/fio.test --size=40G --direct=1 --rw=randread --bs=4k --iodepth=128 --numjobs=4 --runtime=120 --time_based --group_reporting --ioengine=libaio --percentile_list=50:95:99
randread4k_qd128: (groupid=0, jobs=4): err= 0: pid=21455: Tue Jan 21 11:13:01 2026
  read: IOPS=420k, BW=1641MiB/s (1721MB/s)(192GiB/120001msec)
    clat (usec): min=85, max=25000, avg=1188.74, stdev=655.10
    clat percentiles (usec):
     | 50.000000th=[ 1056], 95.000000th=[ 2200], 99.000000th=[ 3900]

What it means: IOPS more than doubled, but median latency is now ~1 ms and P99 is 3.9 ms. This is the “marketing FPS” trap in one screen.

Decision: Choose QD that matches your workload. For latency-sensitive services, accept lower IOPS to keep tail latency inside SLO.

Task 9: Test mixed workload (70/30) because that’s what real life looks like

cr0x@server:~$ sudo fio --name=mix70_30 --filename=/data/fio.test --size=80G --direct=1 --rw=randrw --rwmixread=70 --bs=8k --iodepth=32 --numjobs=8 --runtime=300 --time_based --group_reporting --ioengine=libaio --percentile_list=50:95:99
mix70_30: (groupid=0, jobs=8): err= 0: pid=21510: Tue Jan 21 11:20:01 2026
  read: IOPS=110k, BW=859MiB/s (901MB/s)
    clat percentiles (usec):
     | 50.000000th=[  540], 95.000000th=[ 1500], 99.000000th=[ 2800]
  write: IOPS=47.1k, BW=368MiB/s (386MB/s)
    clat percentiles (usec):
     | 50.000000th=[  810], 95.000000th=[ 2600], 99.000000th=[ 5200]

What it means: Writes are slower and have worse tail latency. That’s normal—and it’s why “100% read IOPS” doesn’t size a database.

Decision: If write P99 is too high, reduce write amplification (batching, WAL placement), add devices, or move to a system optimized for write consistency.

Task 10: Precondition an SSD before trusting sustained write tests

cr0x@server:~$ sudo fio --name=precond --filename=/dev/nvme0n1 --direct=1 --rw=write --bs=1M --iodepth=32 --numjobs=1 --runtime=1800 --time_based --ioengine=libaio --group_reporting
precond: (groupid=0, jobs=1): err= 0: pid=21602: Tue Jan 21 12:00:01 2026
  write: IOPS=2900, BW=2900MiB/s (3041MB/s)(5100GiB/1800000msec)

What it means: You wrote ~5 TB over 30 minutes; you’re pushing the drive into a more realistic state for subsequent random write tests.

Decision: If you can’t precondition (because it’s shared production), you cannot honestly claim “sustained” write performance from a short benchmark.

Task 11: Detect read-ahead or page cache cheating (accidental or deliberate)

cr0x@server:~$ sudo fio --name=cached-read --filename=/data/fio.test --size=4G --rw=read --bs=1M --iodepth=1 --numjobs=1 --runtime=30 --time_based --group_reporting
cached-read: (groupid=0, jobs=1): err= 0: pid=21688: Tue Jan 21 12:10:01 2026
  read: IOPS=9200, BW=9200MiB/s (9647MB/s)(270GiB/30001msec)

What it means: 9.2 GiB/s from a single NVMe in a typical server is… suspicious. You’re reading from cache (page cache) because direct IO wasn’t used and the dataset is small.

Decision: Add --direct=1, increase dataset beyond RAM, and re-run. Do not use this result for storage sizing.

Task 12: Check whether the device is the bottleneck or CPU is

cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (server) 	01/21/2026 	_x86_64_	(32 CPU)

11:25:01 AM  CPU   %usr %nice %sys %iowait %irq %soft %steal %idle
11:25:02 AM  all   38.2  0.0  18.9   0.7   0.0  2.1    0.0  40.1
11:25:02 AM   7   92.0  0.0   6.0   0.0   0.0  0.0    0.0   2.0

What it means: One CPU is pegged. That could be IRQ affinity, a single fio job pinned, or a driver bottleneck.

Decision: If CPU is saturated, “faster storage” won’t help. Fix CPU pinning, IRQ distribution, or increase parallelism appropriately.

Task 13: Look for IO scheduler and queue settings (small knobs, big consequences)

cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

What it means: NVMe is using none (often correct). Other schedulers can change latency distribution.

Decision: Don’t cargo-cult. If you have tail latency problems under mixed load, test mq-deadline/kyber with your actual workload profile and measure percentiles.

Task 14: Validate that TRIM/discard isn’t sabotaging you (or that it exists when you need it)

cr0x@server:~$ lsblk -D -o NAME,DISC-GRAN,DISC-MAX,DISC-ZERO
NAME      DISC-GRAN DISC-MAX DISC-ZERO
nvme0n1   512B      2T       0

What it means: Discard is supported. Good—now you can choose how to use it (periodic fstrim vs continuous discard).

Decision: If you have sustained write degradation, consider scheduled fstrim off-peak and confirm it doesn’t collide with critical IO windows.

Task 15: Observe per-device latency in real time during an incident

cr0x@server:~$ sudo pidstat -d 1 3
Linux 6.5.0 (server) 	01/21/2026 	_x86_64_	(32 CPU)

11:40:01 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
11:40:02 AM   999     14221      0.00  48200.00     0.00  postgres
11:40:02 AM     0     18802      0.00  20600.00     0.00  rsync

What it means: You can correlate IO bursts with process activity, which narrows “storage is slow” into “these two processes are colliding.”

Decision: Implement IO scheduling, cgroups, or maintenance windows. Don’t start with a storage migration unless you’ve proven media saturation.

Task 16: Check md RAID rebuild or background tasks (the invisible performance tax)

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sdc1[1]
      976630336 blocks super 1.2 [2/2] [UU]
      [======>..............]  resync = 34.1% (333333333/976630336) finish=120.2min speed=89000K/sec

What it means: Resync in progress. Your “normal” performance is not available right now.

Decision: Adjust rebuild speed limits, defer heavy jobs, or accept degraded SLO temporarily. Also: incorporate rebuild testing into your pre-purchase benchmarks.

Fast diagnosis playbook: find the bottleneck quickly

This is the checklist you use when someone pings “storage is slow” and you have 10 minutes before the incident channel turns into interpretive dance.

First: confirm it’s storage latency, not application time

Check request latency breakdowns (app metrics) if available.
On the host, check iostat -xz 1 for await, aqu-sz, and %util.
If %iowait is high but %util is low, suspect something upstream (network storage path, throttling, or per-cgroup limits).

Second: identify the constraint class (device, path, CPU, or policy)

Device constraint: %util ~100%, await rising with load, fio reproduces it on the same LUN/device.
Path constraint: multipath issues, link errors, congestion; latency spikes without local %util pegging.
CPU constraint: one core pegged (interrupts, encryption, checksums), storage underutilized.
Policy constraint: cloud volume IOPS cap, SAN QoS, noisy neighbor throttling.

Third: prove it with one targeted test

Run a short fio read test with --direct=1 on the affected volume, matching block size and queue depth that resembles the workload.
Compare to baseline numbers you trust (from your own runbooks, not a vendor slide).
If fio looks fine but the app is slow, the bottleneck is likely in the app, filesystem metadata, lock contention, or network path.

Fourth: decide the quickest safe mitigation

Reduce competing IO (pause batch jobs, slow rebuild, move backups).
Reduce write amplification (checkpoint tuning, batching, compaction pacing).
Scale out reads (replicas) or shard hot partitions.
Only then: scale up storage or migrate.

Joke #2: If the vendor benchmark says “zero latency,” they accidentally measured the sales team’s empathy.

Common mistakes: symptom → root cause → fix

1) “We hit the advertised IOPS, but the app is slower”

Symptom: fio at QD=256 shows huge IOPS; app still times out or P99 explodes.

Root cause: The app runs at low concurrency; deep queue benchmarks traded latency for IOPS.

Fix: Benchmark at realistic QD (often 1–32), and size for P99 latency at that QD. If you need both low latency and high throughput, scale horizontally or use multiple devices/paths.

2) “It was fast yesterday; today it’s half the speed”

Symptom: Sustained random write performance drops after some runtime or higher fill level.

Root cause: SSD out-of-box vs steady-state, garbage collection, SLC cache exhaustion, or thin-provisioned pool pressure.

Fix: Precondition, test longer, test at realistic fill, and ensure adequate overprovisioning. Consider periodic TRIM strategy and avoid running pools at the ragged edge.

3) “Reads are fine; writes occasionally stall for seconds”

Symptom: Mostly OK, then periodic multi-second latency spikes on writes.

Root cause: Cache flush events, journal commits, controller destage, or filesystem metadata pressure.

Fix: Check writeback settings, controller cache policy, filesystem journal sizing, and whether background tasks (snapshots, scrub) coincide with stalls.

4) “Adding an optimization made performance worse”

Symptom: After tuning “for speed,” tail latency gets worse or throughput drops.

Root cause: Misapplied mount options, wrong IO scheduler, too-aggressive readahead, or compression/encryption overhead on hot paths.

Fix: Revert, then test changes one at a time with percentiles. Treat tuning as experiments, not beliefs.

5) “SAN is slow, but the array dashboards say everything is green”

Symptom: Host shows high await; array shows low utilization.

Root cause: Path issues (multipath misconfig), congestion, retransmits, HBA queue limits, or QoS throttling per initiator.

Fix: Validate multipath state, check queue depths, look for link errors, and confirm no QoS policies are capping you.

6) “Cloud volume benchmarks great for 30 seconds, then collapses”

Symptom: First minute is amazing; later you hit a wall.

Root cause: Burst credits or baseline caps.

Fix: Benchmark long enough to drain credits. Size for baseline, not burst. If burst is part of your design, prove credit refill behavior under your duty cycle.

Three corporate mini-stories (anonymized, plausible, and technically accurate)

Mini-story 1: The incident caused by a wrong assumption

They replaced a noisy set of aging disks with a shiny array that “did 800k ops.” The project plan had one performance line item: “new storage faster than old storage.” That was the entire acceptance criterion. It passed in a one-hour test window. Everyone went home.

Two weeks later, month-end processing began. The database wasn’t CPU-bound and it wasn’t locked. It was just… slow. Batch jobs backed up. API latencies crept upward until the customer support channel started doing its own monitoring: “Is the system down?”

The storage array was delivering the promised ops, but at a queue depth the application never produced. The array’s caching made reads look brilliant; writes, under a mixed workload with synchronous commits, were bottlenecked by a policy they didn’t realize they’d enabled: write acknowledgement was pinned to a conservative protection mode and small-block random writes were being serialized by a narrow internal path.

The wrong assumption wasn’t “vendors lie.” It was subtler: the team assumed that a single IOPS number meant “faster in every way,” and that protection policies had no performance personality.

Fixing it didn’t require heroics. They collected percentiles at realistic concurrency, changed the protection setting for the specific volume class, and separated WAL/log IO onto a lower-latency tier. Month-end stopped being a ritual sacrifice.

Mini-story 2: The optimization that backfired

A platform team tried to “unlock performance” by raising queue depths everywhere. They increased NVMe queue settings, tweaked multipath parameters, and encouraged teams to run with more outstanding IO. Benchmarks improved. The slide deck was beautiful.

Then production traffic changed shape. A small number of tenants began generating bursts of random reads and writes at the same time as background maintenance: snapshots, compactions, and a rebuild on one node. Throughput looked fine, but tail latency became a horror anthology. Request timeouts climbed because services had strict deadlines and couldn’t wait behind long device queues.

The backfire was predictable: they optimized for aggregate IOPS, not for latency SLOs. Deep queues hid the pain by keeping devices busy, but user-facing services need short queues and quick answers. A device at 100% utilization is not “efficient” if it turns your 99th percentile into a ransom note.

The remediation was to treat queue depth as a per-workload setting. Latency-sensitive services got lower queue depth and stronger isolation. Batch workloads got to use the deep queues when the platform had headroom. They also added IO cgroup controls to keep “helpful” background jobs from eating the lunch of interactive traffic.

The funniest part was how boring the final graph looked. Tail latency flattened; peak IOPS numbers fell. Everyone stopped paging at 2 a.m., which is the only KPI that matters.

Mini-story 3: The boring but correct practice that saved the day

A different company had a rule: every storage platform had a short, versioned benchmark recipe. Same fio jobs, same runtime, same preconditioning steps, same fill targets. No exceptions. Engineers complained, quietly, because consistency is not exciting.

One Friday, a new batch of “equivalent” SSDs arrived due to a supply substitution. The system came up, workloads migrated, and within hours they saw slightly higher P99 write latency. Not catastrophic—just weird. The on-call pulled the standard benchmark results from their internal baseline and ran the same suite on one node in isolation.

The difference was obvious: sustained random write performance degraded faster and the tail was thicker. The drives weren’t broken; they were different. Firmware, NAND type, caching behavior—something had changed.

Because they had boring baseline data, they didn’t argue about vibes. They quarantined the new batch to a less latency-sensitive tier, updated procurement constraints, and required a qualification run for future substitutions. No incident, no customer impact, no weekend lost.

That’s what “operational excellence” often looks like: a graph nobody will ever present, and a pager that stays silent.

Checklists / step-by-step plan

Procurement sanity checklist (before you sign anything)

Demand the benchmark recipe: block size, rw mix, QD, threads, runtime, preconditioning, dataset size.
Demand percentile latency: at least P50/P95/P99 at the claimed load.
Demand steady-state behavior: 30+ minutes for write-heavy; include charts if possible.
Test at realistic fill levels: at least 60–80% for flash pools if that’s how you’ll run.
Specify failure-mode performance: rebuild/scrub/controller failover impacts and recovery time.
Specify host requirements: CPU, PCIe generation, HBA, driver versions, multipath settings.
Clarify burst vs baseline: especially for cloud volumes and cached arrays.

Benchmark execution plan (repeatable and defensible)

Reserve a clean host: disable unrelated cron jobs, backups, and noisy agents for the window.
Validate identity and health: model/firmware, SMART, temperature behavior.
Pin test variables: same fio version, same kernel, same mount options, same NUMA policy where possible.
Precondition where relevant: especially for sustained write tests.
Run multiple profiles: at least 4K randread, 4K randwrite, 70/30 mix, and a sequential throughput test.
Record percentiles and time series: not just summary output; repeat runs.
Compare to your SLO: accept/reject based on latency at required throughput, not peak IOPS.

Production acceptance checklist (after deployment)

Establish baselines: iostat/fio snapshots under known-good conditions.
Instrument latency: collect P95/P99 at the host and app layer.
Test one failure mode: pull a path, trigger a controlled failover, or simulate a rebuild window.
Validate isolation: ensure batch jobs can’t starve interactive workloads.

FAQ

Q1: Is IOPS useless?

No. It’s just incomplete. IOPS is useful when paired with block size, queue depth, read/write mix, and latency percentiles. Otherwise it’s a number that helps someone win a meeting.

Q2: What’s a reasonable queue depth to benchmark?

Benchmark the queue depth your workload produces. If you don’t know, measure it indirectly via observed outstanding IO and latency behavior. For many latency-sensitive services, QD in the 1–32 range per device is more representative than 128+.

Q3: Why do vendors always use 4K random read?

Because it produces a large, impressive operations number, and it’s a legitimate pattern for some workloads. It is not, however, a universal proxy for “fast storage.”

Q4: How long should I run fio?

Long enough to see steady state and tail behavior. For reads, a few minutes may be OK if the dataset exceeds cache. For sustained writes, 30 minutes is a better starting point, and longer is often justified.

Q5: Should I benchmark on raw block devices or files?

If you want device capability, use raw. If you want “what my application gets,” benchmark through the same filesystem stack you’ll run in production. Both are valid; mixing them without saying so is how teams argue for days.

Q6: Why does performance drop when the drive is fuller?

Flash translation layers need free blocks to manage writes efficiently. As free space shrinks, garbage collection costs rise, and write amplification increases. Many systems look great at 5–10% full and very different at 70–90%.

Q7: What about “FPS” from an application benchmark instead of fio?

Application benchmarks are better for end-to-end sizing—if they reflect your access pattern and concurrency. But they can still be gamed with caches, unrealistic datasets, or disabled durability settings. Treat them with the same skepticism: demand the recipe and the percentiles.

Q8: How do I compare local NVMe vs SAN vs cloud block storage?

Compare latency percentiles at your required throughput, and include failure behavior. SAN and cloud often add path latency and variability. Local NVMe is usually lower latency but less shared and may have different operational risks (replacement, mirroring, node failure).

Q9: Can I trust a benchmark run inside a VM?

You can, if you understand the virtualization layer: shared host queues, throttling policies, and noisy neighbors can dominate results. For capacity planning, test in the same environment you’ll run. For device qualification, test on bare metal or with strict isolation.

Q10: What’s the simplest acceptance criterion that isn’t stupid?

Pick a representative workload profile and require P99 latency under X ms at Y IOPS/MB/s for at least Z minutes, with dataset larger than cache and with durability settings matching production.

Conclusion: practical next steps

If you only remember three moves, make them these:

Refuse single-number performance claims. Ask for the full recipe and latency percentiles.
Benchmark like you operate. Realistic queue depth, realistic mix, realistic dataset size, realistic duration.
Size to your SLO, not to a headline. Buy “P99 under load,” not “up to” anything.

Then do the unglamorous thing that makes you look competent later: write down your benchmark job files, keep baseline outputs, and re-run them after changes. When the next “but the brochure says…” argument happens, you’ll have data, not opinions.