You bought NVMe. The benchmark charts looked like a religious experience. Then production happened: p99 latency spikes, CPU “idle”
but queries stuck, and a storage graph that looks like a cardiogram. Someone says “it’s just flushing,” someone else says “increase io_capacity,”
and a third person suggests “turn off durability for performance” like that’s a grown-up sentence.
This is the practical guide for running MySQL or MariaDB on NVMe without guessing. We’ll tune redo logs, flush behavior, and IO capacity
with commands you can run today—plus the failure modes you’ll hit if you don’t.
Facts & historical context (the stuff that explains the weirdness)
- InnoDB wasn’t born on flash. Early InnoDB assumptions fit spinning disks: orderly sequential writes were king, random was costly. NVMe flips the pain profile.
- The doublewrite buffer exists because torn pages exist. Power loss during a 16KB page write can leave half-old, half-new data. Doublewrite is the “seatbelt.”
- MySQL 8.0 changed redo logging format. The “new redo” (and ongoing work around atomic writes) changed some performance and recovery characteristics versus older releases.
- MariaDB diverged on features and defaults. It kept InnoDB (as XtraDB earlier), but behavior, variables, and implementation details differ enough to matter for tuning.
- Group commit is the reason you can have durability and throughput. Many transactions share one fsync when the engine batches them. Your workload decides if you get that win.
- NVMe is not “a faster disk.” It’s a different interface with deep queues and parallelism. A single-threaded fsync-heavy workload can still look slow.
- Checkpointing is the hidden tax. Redo logs let InnoDB defer flushing, but the bill arrives as checkpoint pressure and page cleaner work.
- Linux’s block layer evolved for flash. Older tunings assumed SATA/SAS; modern NVMe often wants “none” scheduler and careful writeback settings more than old-school queue hacks.
- Cloud NVMe is sometimes “NVMe-shaped.” Many instances expose NVMe devices backed by networked storage. Latency variance can be real even when throughput is high.
One quote worth keeping on a sticky note when you’re tempted to “just tune it until it’s fast”:
Hope is not a strategy.
— General Gordon R. Sullivan
The NVMe + InnoDB mental model: what’s actually happening
The easiest way to lose a week is to treat InnoDB as “it writes stuff to disk” and NVMe as “it writes stuff fast.”
In production, what matters is which writes, when they happen, and how spiky they are.
Three IO streams you must separate in your head
- Redo log writes: sequential-ish appends to ib_logfile/redo log files. These are gated by flush policy and fsync cost.
- Data page flushes: dirty 16KB pages written back to tablespaces as background activity (page cleaners) or under pressure.
- Doublewrite writes: extra writes used to make page flushes crash-safe (unless you use atomic writes / settings that change this).
NVMe helps all of them, but not equally. Redo log writes are typically small and latency-sensitive (fsync). Data page flushes are throughput-heavy.
Doublewrite can become “death by a thousand small writes” if misconfigured, or it can be mostly invisible if aligned with the device’s strengths.
Why your p99 latency spikes even on fast NVMe
InnoDB is an accountant. It happily lets you spend IO credit by buffering dirty pages in memory. When it needs to pay the debt—because redo is near full,
or dirty page percentage is too high, or a checkpoint must advance—it can force foreground threads to help flush.
That’s where you get stalls: transactions waiting on log flush, page flush, or both.
Joke #1: NVMe makes bad flush policy faster, like giving a toddler espresso—things happen quicker, but not better.
MySQL vs MariaDB: what differs for redo/flush/IO capacity
Both engines speak InnoDB, but they don’t ship the same InnoDB, and they don’t always behave identically. If you copy a tuning guide
from one to the other, you can land in the uncanny valley: “it starts, but it’s haunted.”
Redo log configuration differences
- MySQL 8.0: uses
innodb_redo_log_capacity(modern knob) and manages redo log files internally. Many older guides still talk ininnodb_log_file_size; that’s the old world. - MariaDB: commonly uses
innodb_log_file_sizeandinnodb_log_files_in_group(depending on version). It may not expose the same redo capacity abstraction.
IO capacity and background threads
The semantics of innodb_io_capacity and innodb_io_capacity_max are similar in spirit but don’t assume identical behavior under pressure.
Page cleaner tuning and flush neighbor settings can differ as well.
Atomic writes and doublewrite behavior
This is where “NVMe tuning” gets real. Some deployments rely on filesystem + device guarantees (atomic 16KB writes, DAX-like behavior, or device features)
to reduce the need for doublewrite. Many do not. And cloud NVMe often does not behave like your lab SSD.
Treat doublewrite as required unless you prove your full stack can survive torn pages.
The operational takeaway: pick a server (MySQL or MariaDB), then tune using that server’s variables and status counters.
“Same InnoDB” is close enough for architecture diagrams, not for outage avoidance.
Redo logs on NVMe: size, layout, and checkpoint dynamics
Redo logs are your shock absorber. Bigger redo capacity lets InnoDB buffer more dirty work and flush more smoothly.
But “bigger is better” turns into “recovery takes forever” if you go wild, and it can mask a flushing problem until it’s a 3 a.m. surprise.
What redo logs actually buy you
- They decouple commit from data page flush (mostly). Committing writes redo; pages can flush later.
- They smooth random write patterns into sequential-ish log appends.
- They set the checkpoint budget. If checkpoint can’t advance, you’ll stall.
NVMe-specific reality: redo is about latency variance, not throughput
Many NVMe devices can do ridiculous throughput, but fsync latency variance under pressure is what kills you.
The redo log path is sensitive to:
- device write cache behavior and FUA/flush semantics,
- filesystem journaling mode,
- kernel writeback congestion,
- background page flush bursts that compete for the same device queues.
Redo sizing: opinionated guidance
On modern systems, undersized redo is a self-inflicted wound. If your redo capacity is small, you’ll checkpoint constantly,
and the page cleaners will thrash. That turns “fast NVMe” into “why does commit take 40ms sometimes.”
Do this instead:
- Size redo so that steady-state write bursts don’t slam into “log full” pressure.
- Validate recovery time expectations. Big redo means more redo to scan during crash recovery.
- Watch checkpoint age and dirty page percent; tune to avoid sawtooth patterns.
When redo is too big
If you’re running on a node where restart time is tightly budgeted—think autoscaling or strict failover SLAs—redo that’s too large can make recovery slow.
That’s not theoretical. It’s the difference between a failover that looks like “hiccup” and one that looks like “incident bridge.”
Flush policy: durability knobs and what they cost
There are two kinds of database people: those who have lost data, and those who haven’t yet. Flush policy decides which camp you’re in.
innodb_flush_log_at_trx_commit: the big lever
This setting decides when InnoDB flushes redo to durable storage. Common values:
- 1: write and fsync redo at each commit. Best durability; highest sensitivity to fsync latency.
- 2: write at commit, fsync once per second. Can lose up to ~1 second of transactions on crash/power loss.
- 0: flush once per second; larger potential loss window.
My opinion: run 1 for most production systems that matter, and invest in making fsync predictable.
Use 2 only when the business explicitly accepts the loss window and you’ve documented it in the runbook.
sync_binlog: don’t forget replication durability
If you use binary logs (replication, point-in-time recovery, CDC), sync_binlog interacts with durability too.
A common “oops” is durable redo but non-durable binlog, then wondering why a crash causes replication weirdness or PITR gaps.
Filesystem and mount options matter more than most “database tuning”
On Linux, ext4 and XFS behave differently under fsync. Journaling mode and barriers matter.
If you’re using cloud volumes, the block device may lie about cache semantics in ways that are “fine” until the day they’re not.
This is why SREs learn to distrust a graph that looks too clean.
Joke #2: Turning off fsync for performance is like removing your smoke alarm because it’s loud.
IO capacity done right: io_capacity, background IO, and dirty pages
innodb_io_capacity is not “how fast your disk is.” It’s a hint to InnoDB about how aggressively it should flush in the background.
If you set it too low, dirty pages pile up and flushing becomes bursty. Too high, and you can create constant write pressure
that competes with reads and increases latency.
The goal: steady flushing, not heroic flushing
Your best-case is boring: page cleaners flush continuously at a rate that keeps dirty pages in a stable band,
checkpoint age healthy, and redo consumption smooth. Your worst-case is “quiet, quiet, quiet, panic flush,” which
looks like periodic latency cliffs.
How to choose starting values
- Start with a realistic baseline. NVMe can do tens of thousands of IOPS, but InnoDB flush patterns are not raw 4k random writes.
- Measure actual sustained write IOPS and latency under database load, not with an empty box benchmark.
- Use
innodb_io_capacity_maxas the “burst ceiling,” not the day-to-day plan.
Dirty page management is the leading indicator
If dirty pages grow steadily during normal traffic and then suddenly collapse during stalls, your background flushing is insufficient.
If dirty pages stay low but you see constant high write IO and elevated read latency, you might be over-flushing.
Linux + NVMe tuning that actually matters (and what’s snake oil)
There’s a whole genre of “NVMe tuning” that is basically cargo cult. The box is fast; the bottleneck is often flush semantics,
queue contention, CPU scheduling, or filesystem behavior. That said, a few Linux checks are consistently worth your time.
IO scheduler: usually none for NVMe
For NVMe, the kernel’s multiqueue block layer means traditional schedulers often don’t help. Many distros default to none already.
Confirm, don’t assume.
Writeback and dirty ratios: avoid synchronized storms
Kernel dirty page writeback can synchronize with InnoDB flushing and produce periodic congestion.
You don’t “fix” this by random sysctl changes; you measure if writeback spikes correlate with DB stalls,
then adjust carefully.
CPU frequency and interrupts: the quiet sabotage
NVMe + high QD can be CPU-hungry. If your CPU is downclocking aggressively, or interrupts are pinned badly,
your “fast storage” becomes an expensive heater. Latency-sensitive fsync paths particularly hate jitter.
TRIM/discard: treat with respect
Online discard can introduce latency spikes on some devices or stacks. Many operators prefer periodic fstrim
during maintenance windows. NVMe firmware quality varies, and you’re not going to out-tune a bad firmware day.
Hands-on tasks (commands + output + the decision)
These are real tasks you can run on a Linux host with MySQL or MariaDB. Each includes what the output means
and the decision you make from it. Don’t do all of them at once in production. Pick the ones that match your symptoms.
Task 1: Confirm whether you’re on MySQL or MariaDB (and version)
cr0x@server:~$ mysql --version
mysql Ver 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL)
What it means: You’re on MySQL 8.0, so redo sizing likely uses innodb_redo_log_capacity rather than older knobs.
Decision: Use MySQL 8.0 variable names and status counters; don’t apply MariaDB-only settings.
Task 2: Capture key InnoDB durability settings
cr0x@server:~$ mysql -Nse "SHOW VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit','sync_binlog','innodb_doublewrite','innodb_flush_method');"
innodb_doublewrite ON
innodb_flush_log_at_trx_commit 1
innodb_flush_method O_DIRECT
sync_binlog 1
What it means: This is the “durable by default” posture (redo and binlog synced). O_DIRECT avoids double caching.
Decision: Keep it unless the business accepts loss windows; optimize fsync predictability instead of turning knobs down.
Task 3: Check redo sizing variables (MySQL 8.0)
cr0x@server:~$ mysql -Nse "SHOW VARIABLES LIKE 'innodb_redo_log_capacity';"
innodb_redo_log_capacity 2147483648
What it means: Redo capacity is 2GiB. For heavy write workloads, that may be small and can increase checkpoint pressure.
Decision: If you see checkpoint stalls or “log waits,” plan a change window to increase redo capacity and validate recovery time impact.
Task 4: Check redo sizing variables (MariaDB-style)
cr0x@server:~$ mysql -Nse "SHOW VARIABLES WHERE Variable_name IN ('innodb_log_file_size','innodb_log_files_in_group');"
innodb_log_file_size 268435456
innodb_log_files_in_group 2
What it means: Total redo is ~512MiB. That’s often undersized on NVMe-backed OLTP where write bursts are common.
Decision: Consider increasing total redo, but account for the operational procedure (file recreation on restart in many setups).
Task 5: Check dirty page percentage and buffer pool pressure
cr0x@server:~$ mysql -Nse "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_%';"
Innodb_buffer_pool_pages_data 1048576
Innodb_buffer_pool_pages_dirty 196608
Innodb_buffer_pool_pages_free 1024
Innodb_buffer_pool_pages_total 1050624
What it means: Dirty pages are ~18.7% (196608/1050624). Free pages are near zero, so the buffer pool is fully utilized.
Decision: If dirty pages trend upward under steady load, raise background flushing (innodb_io_capacity) or fix IO contention. If they oscillate wildly, reduce burstiness (redo sizing, flush tuning).
Task 6: Check checkpoint behavior via InnoDB engine status
cr0x@server:~$ mysql -Nse "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
*************************** 1. row ***************************
Type: InnoDB
Name:
Status:
=====================================
2025-12-31 12:14:51 0x7f3c2c0c4700 INNODB MONITOR OUTPUT
=====================================
Log sequence number 879112345678
Log flushed up to 879112300000
Pages flushed up to 879110900000
Last checkpoint at 879110900000
What it means: LSN advances (writes), flush lags, and checkpoint equals “Pages flushed up to.” If Log sequence number runs far ahead of Last checkpoint, you’re accumulating checkpoint age.
Decision: If checkpoint age grows until stalls, increase redo capacity and/or raise IO capacity; also check for fsync latency spikes and doublewrite overhead.
Task 7: Measure fsync-related waits in Performance Schema (MySQL)
cr0x@server:~$ mysql -Nse "SELECT event_name, count_star, ROUND(sum_timer_wait/1000000000000,2) AS total_s FROM performance_schema.events_waits_summary_global_by_event_name WHERE event_name LIKE 'wait/io/file/innodb/innodb_log_file%' ORDER BY sum_timer_wait DESC LIMIT 5;"
wait/io/file/innodb/innodb_log_file 1289345 842.11
What it means: A lot of time is being spent waiting on redo log file IO. That’s often your commit latency.
Decision: Correlate with p99 commit latency. If high, focus on flush policy path: device latency variance, filesystem journaling, kernel congestion.
Task 8: Check InnoDB log waits and write pressure counters
cr0x@server:~$ mysql -Nse "SHOW GLOBAL STATUS WHERE Variable_name IN ('Innodb_log_waits','Innodb_log_write_requests','Innodb_os_log_fsyncs','Innodb_os_log_written');"
Innodb_log_waits 413
Innodb_log_write_requests 98234567
Innodb_os_log_fsyncs 4512390
Innodb_os_log_written 9876543210
What it means: Innodb_log_waits being non-zero suggests transactions had to wait because the log buffer/redo space was constrained.
Decision: If log waits increase during normal load, increase redo capacity and/or fix checkpointing and flush throughput.
Task 9: Confirm NVMe device model, firmware, and PCIe link (hardware sanity)
cr0x@server:~$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 S6X... SAMSUNG MZVL21T0HCLR-00B00 1 900.19 GB / 1.00 TB 512 B + 0 B GXA7401Q
What it means: You know what device you’re actually running on, including firmware. Firmware matters for latency spikes.
Decision: If you see periodic stalls, check if the device has known firmware issues in your fleet; consider controlled firmware updates.
Task 10: Check IO scheduler for NVMe
cr0x@server:~$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
What it means: Scheduler is none, which is often correct for NVMe.
Decision: Leave it unless you have strong evidence another scheduler improves tail latency under mixed read/write contention.
Task 11: Check filesystem and mount options (flush semantics live here)
cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /var/lib/mysql
/dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro,data=ordered
What it means: ext4 with data=ordered. This affects journaling behavior and can influence fsync costs.
Decision: If fsync latency is volatile, test alternative filesystems or mount options in staging; don’t flip production mount options casually.
Task 12: Check kernel dirty writeback settings
cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
What it means: The kernel will start background writeback around 10% dirty, and throttle around 20% dirty. These defaults can be okay, or can synchronize badly with InnoDB flushing.
Decision: If you observe periodic global IO congestion, consider tuning ratios lower to encourage earlier writeback, but validate impact with real workload.
Task 13: Observe real-time IO latency and queueing
cr0x@server:~$ iostat -x 1 5
Linux 6.1.0 (server) 12/31/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.12 0.00 2.11 3.95 0.00 87.82
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 820.0 52500.0 0.0 0.0 1.20 64.0 1600.0 78000.0 0.0 0.0 8.70 48.8 15.2 92.0
What it means: Writes have ~8.7ms await with high utilization. That’s not “NVMe fast” in the way your commits want.
Decision: If commit latency correlates with w_await, reduce write bursts (redo sizing, IO capacity tuning) and investigate device saturation or noisy neighbors.
Task 14: Identify top MySQL IO consumers at the OS level
cr0x@server:~$ sudo pidstat -d 1 3 -p $(pidof mysqld)
Linux 6.1.0 (server) 12/31/2025 _x86_64_ (32 CPU)
12:16:03 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
12:16:04 999 24187 12000.00 54000.00 0.00 52 mysqld
12:16:05 999 24187 11500.00 61000.00 0.00 61 mysqld
What it means: MySQL is the write generator. Good. If some other process is writing heavily, it can steal IO and destroy tail latency.
Decision: If IO contention exists, isolate MySQL (dedicated volume, cgroups/io.max, separate device, or kill the noisy neighbor).
Task 15: Check InnoDB IO capacity settings live
cr0x@server:~$ mysql -Nse "SHOW VARIABLES WHERE Variable_name IN ('innodb_io_capacity','innodb_io_capacity_max','innodb_page_cleaners');"
innodb_io_capacity 200
innodb_io_capacity_max 2000
innodb_page_cleaners 4
What it means: innodb_io_capacity=200 is a spinning-disk era value. On NVMe, it’s often too low, leading to dirty page buildup and eventual stalls.
Decision: Increase gradually (e.g., 1000–5000 depending on workload and device), watch dirty pages, checkpoint age, and read latency. Don’t jump to 20000 because you saw it on a forum.
Task 16: Confirm doublewrite and atomic-write-related settings (risk management)
cr0x@server:~$ mysql -Nse "SHOW VARIABLES LIKE 'innodb_doublewrite';"
innodb_doublewrite ON
What it means: Doublewrite is enabled, which protects against torn pages.
Decision: Leave it on unless you have a validated atomic write stack and a tested crash/recovery plan that proves it’s safe to change.
Fast diagnosis playbook (first/second/third checks)
When latency is spiking and everyone is staring at the same dashboard, you need a short path to “what is it, really?”
Here’s the sequence that finds the bottleneck quickly in most NVMe-backed MySQL/MariaDB incidents.
First: is it redo fsync latency or data flush pressure?
- Check commit latency and log waits:
Innodb_log_waits, Performance Schema redo waits, and application commit timing. - Check
iostat -xforw_awaitspikes and high %util.
If redo fsync is slow: focus on device latency variance, filesystem journaling, write cache, and IO contention. Redo sizing can help indirectly by smoothing checkpoints, but it won’t fix a bad fsync path.
Second: are page cleaners falling behind?
- Watch dirty page percentage over time.
- Check InnoDB status for checkpoint age growth and flushing activity.
- Confirm
innodb_io_capacityisn’t set like it’s 2009.
If cleaners are behind: raise IO capacity gradually, consider increasing redo capacity, and look for write amplification from doublewrite + filesystem.
Third: is it read latency caused by write contention?
- If reads slow down during flush storms, you have IO queue contention.
- Check if background writeback aligns with DB stalls (kernel dirty writeback settings).
If read latency tracks write bursts: reduce flush burstiness (redo/IO capacity), isolate IO, and verify scheduler and CPU behavior.
Fourth (when it’s “weird”): prove whether the NVMe is real NVMe
- On cloud, confirm the underlying storage class behavior via observed latency variance, not the device name.
- Check for other writers and for throttling at the hypervisor or volume layer.
Three corporate mini-stories from the trenches
1) The incident caused by a wrong assumption: “NVMe means fsync is cheap”
A fintech-ish company migrated a core MySQL cluster from SATA SSDs to local NVMe on new hosts. The team expected the usual:
lower latency, higher throughput, fewer IO-related incidents. They got the first two in the median, and the last one got worse.
The symptom was classic: p95 looked fine, p99.9 periodically went off a cliff. Application threads piled up waiting on commits.
The on-call looked at CPU and saw plenty of idle. Storage throughput was nowhere near device limits. “So it can’t be disk,” someone said,
and the room nodded because graphs are persuasive.
The wrong assumption was that NVMe automatically makes fsync predictable. In reality, their chosen filesystem + mount options
plus the device firmware’s internal housekeeping created periodic fsync stalls under sustained write pressure.
The commits were gated by those stalls because they ran innodb_flush_log_at_trx_commit=1 (correct) and had no slack when fsync jitter hit.
The fix wasn’t “turn durability down.” They measured redo fsync wait events, correlated them with iostat latency spikes,
tested a different filesystem configuration in staging, and added capacity headroom by increasing redo log capacity so checkpoints were less bursty.
Tail latency stabilized. The surprise was that the “fastest” storage produced the most volatile latency until the stack was tuned for predictability.
2) The optimization that backfired: cranking innodb_io_capacity to the moon
Another company had a MariaDB cluster serving a write-heavy service. They saw dirty pages climb during peak traffic and decided flushing was the bottleneck.
Someone set innodb_io_capacity to a value that looked reasonable compared to a synthetic NVMe benchmark, and they pushed it during a maintenance window.
For about an hour, things improved. Dirty pages stayed low, and the dashboards turned green. Then read latency started creeping up,
not just for heavy queries but for simple point lookups. The buffer pool hit rate dropped slightly. The application started retrying due to timeouts.
Everyone blamed “network” because that’s what people do when storage is supposed to be solved.
The reality: they forced the page cleaners to flush aggressively all the time, creating constant write pressure and saturating the device queues.
Reads now had to compete with a background write firehose. NVMe throughput was high, but tail latency for reads got worse. It was an IO scheduling problem they created.
They rolled back to a lower IO capacity, then re-tuned gradually while watching read latency and dirty pages together.
The correct pattern was a moderate baseline flush rate with a sensible max, plus a redo capacity bump to avoid checkpoint panic.
The lesson: “more flushing” is not the same as “better flushing.”
3) The boring but correct practice that saved the day: validating crash recovery time
A SaaS provider ran MySQL with relatively large redo capacity because their workload had bursty writes. It made the system smooth under load.
But they did something unfashionable: they regularly tested crash recovery time in staging with production-like data volume and redo utilization.
During a data center event, a primary crashed hard and restarted. The failover plan assumed a certain restart time budget.
Because they had tested it, they knew exactly what to expect and had already tuned redo capacity to stay within the recovery window.
Replication caught up cleanly because binlog durability settings matched the durability posture.
While other teams were arguing on chat about whether to rebuild from backup or promote a replica, this team followed the runbook:
confirm redo recovery progress, monitor crash recovery stages, and keep traffic drained until the engine reported consistent state.
The end result was not glamorous. It was a contained incident without data loss and without a multi-hour restore.
The “boring practice” was measuring recovery in advance and refusing to tune redo size without considering operational restart time.
Common mistakes: symptom → root cause → fix
1) Symptom: p99 commit latency spikes, throughput looks fine
Root cause: redo fsync latency variance (device firmware, filesystem journaling, writeback congestion, or IO contention).
Fix: measure redo wait events; check iostat -x for write await spikes; isolate IO; validate filesystem options; avoid competing writers on the same volume.
2) Symptom: periodic “stall storms” every few minutes
Root cause: checkpoint pressure and bursty flushing due to undersized redo and/or low innodb_io_capacity.
Fix: increase redo capacity; raise innodb_io_capacity gradually; monitor dirty pages and checkpoint age for stability rather than sawtooth patterns.
3) Symptom: reads slow down when writes spike, even though NVMe isn’t maxed on throughput
Root cause: IO queue contention from background flushing or doublewrite amplification; reads stuck behind writes.
Fix: tune innodb_io_capacity and innodb_io_capacity_max; ensure scheduler is appropriate; consider separating redo/binlog onto different devices only if you can manage the operational complexity.
4) Symptom: “log waits” increasing under normal traffic
Root cause: redo space constrained; checkpoint can’t advance fast enough; log buffer pressure; sometimes too-small redo capacity.
Fix: increase redo capacity; ensure page cleaners can flush steadily; validate that doublewrite and filesystem aren’t multiplying IO unnecessarily.
5) Symptom: after “making it faster” by changing durability, replication/PITR gets flaky
Root cause: inconsistent durability posture between redo and binlog (e.g., innodb_flush_log_at_trx_commit=2 but sync_binlog=0), or assumptions about crash consistency.
Fix: align redo and binlog durability with the business’s data loss tolerance; document it; test crash scenarios.
6) Symptom: NVMe shows high %util, but MySQL isn’t doing that many queries
Root cause: background flushing, doublewrite, or filesystem writeback; or another process writing heavily.
Fix: use pidstat -d to find writers; check InnoDB dirty pages; review kernel dirty writeback settings; consider moving non-DB workloads off the volume.
7) Symptom: tuning works in staging but fails in production
Root cause: different NVMe model/firmware, different cloud storage behavior, different concurrency, or different background jobs (backups, compaction, ETL).
Fix: standardize hardware; test under realistic concurrency; schedule background jobs to avoid peak; measure tail latency, not average.
Checklists / step-by-step plan
Step-by-step: baseline before tuning anything
- Record MySQL/MariaDB version and key variables: flush policy, redo sizing, IO capacity, doublewrite, binlog durability.
- Capture 10 minutes of
iostat -xduring representative load. - Capture InnoDB status snapshot at the start and end of that window (checkpoint and LSN movement).
- Capture dirty page percentage trends.
- Confirm filesystem and mount options for the datadir.
- Confirm there are no competing heavy writers on the same device.
Step-by-step: redo and checkpoint stabilization
- If redo capacity is small, increase it to reduce checkpoint churn (planned maintenance if required).
- After change, monitor recovery time during controlled restart in staging; don’t ship a redo increase without knowing restart cost.
- Watch
Innodb_log_waitsand checkpoint age; they should trend down.
Step-by-step: tune IO capacity without breaking reads
- Increase
innodb_io_capacityin small steps. - After each step, watch: dirty pages, read latency, write await, and CPU.
- Set
innodb_io_capacity_maxto allow bursts, but keep it bounded. - Stop when dirty pages stabilize and read p99 doesn’t degrade.
Step-by-step: decide on durability knobs like an adult
- Write down acceptable data loss window (0 seconds? 1 second? more?). Get it signed off by someone who owns the consequence.
- Align
innodb_flush_log_at_trx_commitandsync_binlogto match that decision. - Test crash behavior and recovery in staging: power-loss style stop, then recovery, then consistency checks.
FAQ
1) Should I always set innodb_flush_log_at_trx_commit=2 on NVMe?
No. NVMe can make fsync fast, but “fast” isn’t the same as “predictable.” Use 2 only if the business accepts losing up to ~1 second of commits on crash.
2) What’s a good redo size on NVMe?
Big enough to avoid constant checkpoint pressure, small enough to keep crash recovery within your operational budget. Start by measuring checkpoint age and recovery time; don’t pick a number from a blog.
3) Does increasing redo capacity always improve performance?
It often reduces stalls by smoothing checkpoints, but it can increase crash recovery time. Also, it won’t fix a fundamentally bad fsync path.
4) Can I disable the doublewrite buffer on NVMe?
Only if you can prove your full stack prevents torn pages (device + filesystem + configuration) and you’ve tested crash recovery. Otherwise, keep it on and tune around it.
5) Why is innodb_io_capacity still 200 in so many configs?
Because configs live longer than hardware generations. 200 made sense for spinning disks. On NVMe it can be a stall generator.
6) My NVMe shows 90% util but low throughput. Is that normal?
Yes, if you’re dominated by small synchronous writes (fsync) or latency-bound IO. High util can reflect queueing and wait time, not just bandwidth use.
7) Is separating redo logs onto another NVMe device worth it?
Sometimes, especially if data flush writes are starving redo fsyncs. But it adds operational complexity and can fail in new ways (capacity planning, device failure domains). Measure first.
8) MySQL vs MariaDB: which one is “better” on NVMe?
Neither wins by default. Pick based on features, operational tooling, and your team’s competence. Then tune using that engine’s variables and measure the results.
9) Why does tuning in staging not match production?
Concurrency, background jobs, noisy neighbors, and different device firmware. Tail latency is a production-only sport unless your staging is truly production-like.
Conclusion: next steps you can do this week
If you want NVMe to feel like magic in production, don’t chase peak IOPS. Chase predictable fsync latency and steady flushing.
Redo size is your shock absorber. Flush policy is your risk contract. IO capacity is how you keep the engine from panic-paying its debts.
- Run the baseline tasks: capture durability settings, redo sizing, dirty pages, iostat, filesystem options.
- Do the fast diagnosis sequence during a real spike and label it: redo fsync, checkpoint pressure, or IO contention.
- If redo is small and checkpointing is bursty, plan a redo capacity increase with a recovery-time test.
- Raise
innodb_io_capacitygradually until dirty pages stabilize without hurting read p99. - Write down durability decisions (
innodb_flush_log_at_trx_commit,sync_binlog) and stop treating them as “performance knobs.”
Do those, and your NVMe won’t just be fast. It’ll be boring. And boring is what you want at 2 a.m.