Most database “performance tuning” starts in the wrong place: a config file. Someone remembers a blog post, cranks a knob, restarts production, and hopes the graphs stop screaming. Sometimes it works. Often it just moves the pain to a different subsystem and makes the postmortem longer.
The fastest way to win MariaDB vs MySQL performance debates is boring: identify the bottleneck before you change anything. This checklist does that. It’s the same approach I use when a revenue path is timing out and the only “repro” is “it’s slow sometimes.”
The one checklist: locate the bottleneck, then tune
MariaDB and MySQL share a lot of DNA, especially if you’re mostly living in InnoDB-land. That’s why “which is faster” is usually the wrong question. The right question is: what is currently limiting throughput or latency on this host? CPU. IO. Locking. Memory. Network. Replication. Or the classic: one pathological query that drags everything else into the mud.
Here’s the mental model that keeps you honest:
- Throughput bottleneck: you’re maxing out some resource (CPU, disk bandwidth, mutexes). Symptoms: QPS plateaus while load increases; latency may or may not spike.
- Latency bottleneck: something occasionally blocks (fsync spikes, locks, buffer pool misses, stall in replication apply). Symptoms: tail latency climbs; “it’s slow sometimes.”
- Queueing bottleneck: a single-threaded or serialized stage (redo log flush, replication SQL thread, metadata lock). Symptoms: one queue grows; everyone waits their turn.
When you’re under pressure, tuning knobs feel productive because they’re concrete. But knobs are downstream decisions. If you don’t know the bottleneck, you’re basically changing random variables and declaring victory when the next incident happens on a different day.
One operational rule: don’t tune until you can say, in one sentence, “the bottleneck is X, proven by Y.” If you can’t, you’re not tuning; you’re gambling.
And yes, you will still tune. But you’ll tune with receipts.
Short joke #1: “We tuned MySQL until it was fast.” That’s adorable. The load balancer remembers.
Facts and history that still matter in production
Some context helps because it explains why defaults differ, why behavior surprises you, and why someone’s “MySQL tip” breaks on MariaDB (or vice versa).
- MariaDB was forked from MySQL in 2009 after concerns around MySQL’s acquisition by Oracle. This isn’t trivia: it’s why ecosystem assumptions drift over time.
- Early MariaDB marketed “drop-in replacement” aggressively. Operationally, it’s mostly true at the SQL layer—but performance behavior can diverge at the edges (optimizer, thread pool, replication features).
- MySQL 5.6/5.7 made InnoDB the center of gravity and improved performance schema and instrumentation substantially. If you’re still treating it like a black box, you’re leaving free diagnosis on the table.
- MySQL 8.0 removed the query cache (which was notorious for mutex contention). If your MariaDB estate still uses query cache, that’s a historical artifact with modern consequences.
- MariaDB diverged with features like Aria, ColumnStore, and Galera integration (in many distributions). Those are not just features; they change failure modes and operational knobs.
- MySQL 8.0’s data dictionary overhaul changed metadata and some upgrade/rollback realities. “It worked on 5.7” is not a plan.
- Both rely heavily on Linux IO behavior (fsync, dirty page flushing, scheduler). Many “database” incidents are just the kernel doing what you asked, not what you meant.
- Replication has evolved differently: MySQL has focused on GTIDs, multi-threaded replication, group replication; MariaDB has its own GTID implementation and often different replication ergonomics. Same word, different footguns.
Also: performance conversations age badly. A tuning guide written for MySQL 5.5 on spinning disks can be actively harmful on MySQL 8.0 on NVMe with cgroups and noisy neighbors.
One quote requirement, paraphrased idea: paraphrased idea: “Hope is not a strategy.”
— often attributed in operations circles; use it as a pager-ready principle, not a slogan.
Fast diagnosis playbook (first/second/third)
This is the “walk into a burning room” order. It’s designed for speed, not elegance. The goal is to identify whether you’re CPU-bound, IO-bound, lock-bound, or replication-bound within minutes.
First: prove what kind of bottleneck you have
- Is the host resource-saturated? CPU, memory pressure, IO wait, disk latency, network errors.
- Is the database waiting? Lock waits, fsync/redo waits, buffer pool misses, thread scheduling, connection storms.
- Is the database doing something expensive? Full scans, bad plans, sorting on disk, temp tables, missing indexes.
Second: identify the top offender, not the average
- Top queries by total time (not by count).
- Top waits (locks, IO, flush, metadata locks).
- Top tables (hot rows, missing indexes, heavy churn).
Third: decide whether to mitigate, fix, or scale
- Mitigate now: kill runaway queries, shed load, increase pool size carefully, reduce concurrency, move reads to replicas.
- Fix next: index changes, query rewrites, schema changes, partitioning, correcting isolation levels.
- Scale when appropriate: more CPU for CPU-bound workloads, faster storage for IO-bound, replicas for read fan-out, sharding only if you hate yourself (or really need it).
Speed tip: do not “sample” by looking at averages. Look for queues and tail latency. Averages are what you tell management. Queues are what you tell the on-call.
Practical tasks (commands, outputs, decisions)
These are runnable tasks you can do on a Linux DB host and inside the database. Each includes (a) a command, (b) what the output means, and (c) the decision you make. That last part is the difference between monitoring and operations.
Task 1: confirm which engine and version you are actually running
cr0x@server:~$ mysql -e "SELECT VERSION() AS version, @@version_comment AS comment, @@version_compile_machine AS arch;"
+------------------------+------------------------------+------+
| version | comment | arch |
+------------------------+------------------------------+------+
| 10.6.16-MariaDB-1:... | MariaDB Server | x86_64 |
+------------------------+------------------------------+------+
Meaning: Don’t debug “MySQL” when you’re running MariaDB (or vice versa). Version differences determine available instrumentation and optimizer behavior.
Decision: Pick the right tools: performance_schema heavy workflows are more mature in MySQL 8.0; MariaDB may require different status counters and sys schema equivalents.
Task 2: quick host pressure check (CPU, IO wait, run queue)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 0 81232 64312 987654 0 0 120 980 900 2100 25 10 45 20 0
8 2 0 79012 64320 987120 0 0 110 1600 1100 3200 35 12 33 20 0
Meaning: High r suggests CPU contention; high wa suggests waiting on IO; sustained b indicates blocked tasks.
Decision: If wa is consistently high and latency is spiky, go straight to disk latency tasks. If r is high with low wa, focus on query CPU and contention.
Task 3: measure disk latency, not just throughput
cr0x@server:~$ iostat -x 1 3
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 1200 9000 48000 320000 18.5 0.3 92.0
Meaning: await is the time requests spend waiting + being serviced. High await with high utilization means the device is saturated or something upstream is queueing.
Decision: If await spikes coincide with DB latency, you’re IO-bound. Mitigate by reducing write bursts (checkpoint/flush), improving buffer pool, or moving to faster storage. If await is low but DB slow, the bottleneck is elsewhere.
Task 4: find the busiest MySQL/MariaDB threads at the OS level
cr0x@server:~$ top -H -p $(pidof mysqld)
top - 12:01:22 up 34 days, 2 users, load average: 9.10, 8.80, 7.90
Threads: 210 total, 8 running, 202 sleeping
%Cpu(s): 70.0 us, 10.0 sy, 0.0 ni, 20.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12345 mysql 20 0 12.3g 8.1g 0.0g R 180.0 52.0 12:11.04 mysqld
Meaning: If you see a few threads pegged, you may have CPU-bound queries or mutex contention. If CPU is low but latency high, focus on waits/locks/IO.
Decision: If CPU is saturated, go to query profiling and plans. If not, go to lock waits and IO waits inside InnoDB.
Task 5: identify runaway queries immediately
cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G" | sed -n '1,80p'
*************************** 1. row ***************************
Id: 123
User: app
Host: 10.0.2.14:51234
db: prod
Command: Query
Time: 88
State: Sending data
Info: SELECT ... FROM orders JOIN order_items ...
Meaning: Time is how long it’s been running. State hints at what it’s doing (not perfectly). A pile of long-running “Sending data” often means large scans/joins.
Decision: If a single query is dominating and user-facing, consider killing it and throttling the caller. Then capture the query for EXPLAIN and indexing work.
Task 6: find lock contention fast (InnoDB)
cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,140p'
TRANSACTIONS
------------
Trx id counter 987654321
Purge done for trx's n:o < 987654000 undo n:o < 0 state: running
History list length 12045
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 987654310, ACTIVE 92 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 123, OS thread handle 1400, query id 7777 10.0.2.14 app updating
UPDATE accounts SET balance=balance-? WHERE id=?
Meaning: Long “ACTIVE” transactions and large History list length often correlate with purge lag, undo growth, and secondary symptoms like “random slow queries.”
Decision: If you see a long-running transaction holding locks, you may need to kill it, fix application transaction scope, or lower isolation where safe. Also investigate why purge can’t keep up (often due to long readers).
Task 7: check buffer pool pressure and hit ratio indicators
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| Innodb_buffer_pool_reads| 124009876 |
+-------------------------+-----------+
+----------------------------------+-------------+
| Variable_name | Value |
+----------------------------------+-------------+
| Innodb_buffer_pool_read_requests | 98012345678 |
+----------------------------------+-------------+
Meaning: Innodb_buffer_pool_reads are physical reads. If this climbs rapidly relative to read requests, you’re missing cache and paying disk latency.
Decision: If physical reads are high and disk latency is non-trivial, increase buffer pool (if RAM allows) or reduce working set (indexes, query patterns). If buffer pool is fine, don’t touch it.
Task 8: detect temporary table spills (sorts/joins hitting disk)
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
+-------------------------+----------+
| Variable_name | Value |
+-------------------------+----------+
| Created_tmp_disk_tables | 9123456 |
| Created_tmp_tables | 12345678 |
+-------------------------+----------+
Meaning: A high ratio of disk temp tables suggests memory limits too low for sorts/joins or queries that are inherently spill-prone (missing indexes, huge GROUP BY).
Decision: If disk temps correlate with slow endpoints, fix query plans first (indexes), then adjust temp table memory limits carefully. Bigger limits can also create memory blowups under concurrency.
Task 9: check redo log and flushing pressure (fsync pain)
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_os_log_fsyncs'; SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';"
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| Innodb_os_log_fsyncs | 77881234 |
+----------------------+----------+
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| Innodb_log_waits | 12345 |
+------------------+-------+
Meaning: Innodb_log_waits indicates waits for log buffer space / flush. Combined with high fsync rates and IO latency, this points to write bottlenecks.
Decision: If log waits are rising, you’re hitting a commit/flush ceiling. Mitigate with faster storage, reducing transaction rate, batching writes, or adjusting durability settings only with explicit business sign-off.
Task 10: check replication lag and whether it’s IO, SQL apply, or locks
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep -i "Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running|Relay_Log_Space"
Seconds_Behind_Master: 420
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Relay_Log_Space: 987654321
Meaning: IO thread is fine, SQL thread is running, but lag is high and relay logs are piling up. The replica can fetch but can’t apply fast enough.
Decision: Investigate apply bottlenecks: single-threaded apply, hot rows, DDL, long transactions, or slow storage. Consider multi-threaded replication where supported and safe, and fix the query/transaction patterns that serialize apply.
Task 11: verify connection storms and thread scheduling
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Threads%'; SHOW GLOBAL STATUS LIKE 'Connections';"
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_connected | 980 |
| Threads_running | 64 |
| Threads_cached | 0 |
+-------------------+-------+
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| Connections | 987654 |
+---------------+--------+
Meaning: Lots of connected threads with low cache can mean connection churn (apps not pooling) or too-low thread cache.
Decision: Fix the application first: enable pooling, reuse connections, cap concurrency. Only then adjust thread cache/pools; otherwise you’ll just let the app hurt you faster.
Task 12: check table/index size vs memory to predict cache misses
cr0x@server:~$ mysql -e "SELECT table_schema, SUM(data_length+index_length)/1024/1024/1024 AS gb FROM information_schema.tables GROUP BY table_schema ORDER BY gb DESC;"
+--------------+--------+
| table_schema | gb |
+--------------+--------+
| prod | 820.12 |
| mysql | 0.05 |
+--------------+--------+
Meaning: If your active dataset is hundreds of GB and your buffer pool is 32GB, you’re not caching your way out. Reads will be disk reads. Plans must be selective.
Decision: Either provision RAM to match working set or design queries/indexes to avoid scanning. If neither is possible, push reads to replicas or accept that tail latency is storage-shaped.
Task 13: catch metadata lock pileups (DDL and “harmless” ALTERs)
cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;" | awk '$6 ~ /Waiting/ || $7 ~ /metadata/ {print}'
234 app 10.0.2.21:49910 prod Query 35 Waiting for table metadata lock ALTER TABLE users ADD COLUMN ...
Meaning: Metadata locks block queries that touch the table. One DDL can freeze a hot table path if it waits behind a long transaction.
Decision: Pause the DDL, find and end the long transaction, or use online schema change techniques. Also: schedule schema changes like you schedule risk, not like you schedule lunch.
Task 14: inspect slow query log quickly (if enabled)
cr0x@server:~$ sudo tail -n 20 /var/log/mysql/slow.log
# Time: 2025-12-31T11:55:22.123456Z
# Query_time: 12.400 Lock_time: 0.000 Rows_sent: 50 Rows_examined: 50000000
SELECT ... FROM events WHERE customer_id=... ORDER BY created_at DESC LIMIT 50;
Meaning: Rows examined is the tell. 50M examined to return 50 rows is an index or plan problem, not a “more buffer pool” problem.
Decision: Add/repair the index (likely (customer_id, created_at)), validate with EXPLAIN, and consider query rewrite. Don’t “tune” around a missing index.
Task 15: validate the plan with EXPLAIN and make it readable
cr0x@server:~$ mysql -e "EXPLAIN SELECT id FROM events WHERE customer_id=42 ORDER BY created_at DESC LIMIT 50\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: events
type: ALL
possible_keys: idx_customer_created
key: NULL
rows: 50000000
Extra: Using where; Using filesort
Meaning: type: ALL and filesort on a large table indicates a full scan + sort. The index exists but isn’t used (wrong order, low selectivity, mismatched collation, function on column, or stats are off).
Decision: Fix the query or index. If stats are stale, run analyze. If the optimizer is wrong, consider hints only as a last resort and only with regression tests.
Task 16: check deadlocks (they’re not rare; they’re just underreported)
cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '/LATEST DETECTED DEADLOCK/,+60p'
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 987654400, ACTIVE 2 sec updating or deleting
...
WE ROLL BACK TRANSACTION (1)
Meaning: Deadlocks are concurrency bugs, not “database bugs.” They’re usually caused by inconsistent update order or overly large transactions.
Decision: Fix application ordering and transaction scope. Then ensure your retry logic is correct and bounded. Retrying forever is just deadlock cosplay.
Task 17: measure filesystem and mount choices that affect fsync
cr0x@server:~$ mount | egrep ' /var/lib/mysql | /data '
/dev/nvme0n1p2 on /var/lib/mysql type ext4 (rw,noatime,data=ordered)
Meaning: Filesystem and mount options affect metadata updates and write barriers. You don’t need exotic settings; you need consistent, tested ones.
Decision: If you’re seeing fsync stalls, validate storage firmware, filesystem health, and kernel logs before touching database durability knobs.
Task 18: verify the database is not swapping (it will ruin everything quietly)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 28Gi 0.8Gi 0.2Gi 2.2Gi 1.1Gi
Swap: 4Gi 1.9Gi 2.1Gi
Meaning: If swap is in use on a DB host, latency becomes unpredictable. Even “a little swap” can mean pages you care about get evicted.
Decision: Reduce memory footprint (buffer pool too big, per-connection buffers too high), add RAM, or tighten concurrency. Don’t normalize swap on databases unless you enjoy mystery novels.
Checklists / step-by-step plan
This is the operational checklist I want on the wall (or in the runbook). It’s the minimum set of steps that reliably gets you to an answer without tuning-by-folklore.
Checklist A: 10-minute triage (production is on fire)
- Confirm scope: one endpoint, one tenant, or everything? Grab a timestamp window and error rates from your app monitoring.
- Host saturation: run
vmstatandiostat -x. Decide: CPU-bound or IO-bound. - Database queue:
SHOW FULL PROCESSLISTto see if there are many “Waiting for …” states or a few long runners. - Locks:
SHOW ENGINE INNODB STATUSand look for long transactions, lock waits, deadlocks. - Replication (if relevant): check lag. If reads are served from replicas, lag is a user-facing bug.
- Immediate mitigation: kill the worst query, shed load, pause the DDL, or temporarily route read traffic.
- Capture evidence: slow log snippet, EXPLAIN, relevant status counters before restarting anything.
Checklist B: 1-hour diagnosis (find the true bottleneck)
- Top queries by total time: use slow log aggregation or performance schema views (if available). Focus on total time, not just longest single query.
- Plan validation: EXPLAIN the top offenders. Look for full scans, filesorts, temp tables, bad join order.
- Wait classification: identify whether waits are mostly IO, locks, or CPU. If MySQL 8.0, performance_schema can show this clearly; in MariaDB, rely more on InnoDB status + status counters.
- Data shape check: table sizes, index sizes, skewed distributions, hot rows.
- Concurrency check: threads running, connections, transaction scope, batch sizes.
- Storage check: kernel logs, disk latency, filesystem behavior, writeback spikes.
- Decide the fix class: query/index, app behavior, storage, or capacity.
Checklist C: safe tuning sequence (only after you know the bottleneck)
- Query/index fixes first (highest ROI, lowest systemic risk). Make them testable and reversible.
- Concurrency controls second (pooling, queueing, caps). Prevent thundering herds.
- Memory sizing third (buffer pool, per-connection buffers). Watch for swap and OOM.
- IO/durability knobs last and only with explicit risk acceptance. You can buy performance with durability; you cannot buy durability back after data loss.
Short joke #2: Turning off durability is like removing the smoke detector because it’s noisy. You’ll sleep great right up until you don’t.
Three corporate-world mini-stories (realistic, anonymized)
Mini-story 1: the incident caused by a wrong assumption
A mid-size SaaS company migrated a core service from MySQL to MariaDB because the vendor package made it easy and the app “only uses standard SQL.” The team did a careful schema migration, ran integration tests, and declared success. Latency looked fine in staging. Production was a different animal.
The wrong assumption was subtle: they assumed their workload was query-bound. It was not. In production, they ran a high-concurrency write workload with periodic bursts—end-of-hour billing events, each transaction touching a few hot rows. The app used long-lived transactions with extra reads inside “just to be safe.” Those reads weren’t safe. They held locks and extended purge pressure.
Within a day, they started seeing “random” slow queries. Then replication lag crept up. Then a read replica got promoted during a separate maintenance window and the system fell apart: the new primary had a large history list length, the storage had periodic fsync stalls, and the application amplified it by opening new connections during retries.
They initially blamed the engine choice. That was convenient and wrong. The real culprit was transaction scope and hot-row contention. When they shortened transactions, added a missing composite index, and introduced bounded retries, both MariaDB and MySQL behaved fine. The engine mattered less than the workload discipline.
The postmortem conclusion was blunt: “Drop-in replacement” is about correctness, not performance predictability. They added a migration gate: prove the bottleneck class under production-like load before switching engines.
Mini-story 2: the optimization that backfired
A retail platform had a classic pain: a slow report query during business hours. Someone proposed an easy win: increase per-connection buffers and temp table sizes so sorts won’t spill to disk. On paper, this was reasonable. In practice, it triggered a cascade.
The change went in on a Friday (not malicious, just optimistic). At peak traffic, connection counts spiked due to an upstream pooler issue. Each connection now had permission to allocate more memory during sorts and joins. Memory pressure climbed, the kernel started swapping, and suddenly every query—fast and slow—got slower. Latency got spiky, then flatlined.
They rolled back quickly, but the lesson stuck: memory knobs multiply by concurrency. “Bigger buffers” isn’t a free lunch; it’s a buffet where every thread gets a tray. If your app invites 1,000 guests, you need a bouncer, not a bigger kitchen.
They fixed it the boring way. First, they repaired connection pooling and set sane concurrency caps. Then they added a targeted index and rewrote the report query to pre-aggregate. After that, modest temp table tuning actually helped because it wasn’t fighting swap anymore.
Mini-story 3: the boring but correct practice that saved the day
A financial services team ran both MySQL and MariaDB across different business units. The infra was messy, but their operational hygiene was unusually good. Every database host had a standard “evidence capture” script: it grabbed processlist, InnoDB status, key status counters, IO stats, and a slice of slow logs, and stored it with timestamps.
One afternoon, they got a burst of timeouts. The on-call ran the script twice, two minutes apart. The second run showed rising Innodb_log_waits and a sudden jump in disk await. Processlist showed many short writes waiting on commit. This was not a query plan issue; it was a write latency issue.
Because they had the data, they didn’t waste time debating engine differences. They checked kernel logs and found storage-level resets on one NVMe device. The RAID/controller layer masked it just enough to look like “database slowness.” They failed over cleanly to a standby and replaced the device the same day.
The fix wasn’t clever. It was repeatable evidence. Their “boring script” prevented a week of superstition-driven tuning and an argument about which fork of MySQL was “more stable.”
Common mistakes: symptom → root cause → fix
This is the part where production systems confess their sins. Each item is specific because vague advice is how incidents reproduce.
1) Symptom: “CPU is high, but queries are simple”
Root cause: concurrency and contention. Many “simple” queries can still fight over hot rows, indexes, or internal mutexes. Another common culprit: inefficient collation comparisons or function calls on indexed columns that prevent index use.
Fix: Identify top queries by total time, confirm index usage with EXPLAIN, and reduce concurrency at the app. Consider adding or reshaping indexes to avoid hot spots.
2) Symptom: “Latency spikes every few minutes, then recovers”
Root cause: IO flush storms, checkpointing behavior, or storage hiccups. Also: background jobs or DDL starting on a schedule.
Fix: Correlate spikes with iostat await, Innodb_os_log_fsyncs, and dirty page behavior. Move bursts off peak, batch writes, and verify storage health.
3) Symptom: “Replica lag grows, but CPU is low”
Root cause: replica SQL apply is serialized by hot rows, big transactions, DDL, or single-threaded apply configuration. Low CPU doesn’t mean healthy; it can mean waiting on IO or locks.
Fix: Inspect relay log growth, check for long transactions, enable/adjust multi-threaded replication where applicable, and reduce transaction size on the primary.
4) Symptom: “Lots of connections, but Threads_running is not that high”
Root cause: connection churn and resource overhead; application isn’t pooling, or health checks are too aggressive. Also: max_connections set high to “fix” errors, creating a bigger blast radius.
Fix: Fix pooling, cap concurrency, use a proxy/pooler if needed, and keep max_connections realistic so overload fails fast instead of slowly.
5) Symptom: “Deadlocks started after adding an index”
Root cause: changed access paths altered lock acquisition order. The index didn’t create deadlocks; it revealed ordering bugs under concurrency.
Fix: Standardize update order in code, reduce transaction scope, and ensure deadlock retries are bounded. Use consistent WHERE clauses and avoid range updates on hot tables.
6) Symptom: “ALTER TABLE hangs forever”
Root cause: metadata lock waiting behind a long transaction, often a “harmless” read transaction left open by a connection pool.
Fix: Find the blocker via processlist/InnoDB status, kill or end the long transaction, and use online schema change strategies with operational guardrails.
7) Symptom: “IO utilization is low, but await is high”
Root cause: queueing in the storage stack, throttling, noisy neighbor, or latency injected by firmware/controller. %util can lie on modern devices and virtualized environments.
Fix: Trust latency metrics, check kernel logs, validate device health, and consider isolating IO (dedicated volumes, better QoS) before tuning InnoDB.
8) Symptom: “Queries got slower after upgrading”
Root cause: optimizer plan regressions, changed defaults, changed statistics behavior, or different SQL modes. MariaDB and MySQL diverge here more than people admit.
Fix: Capture EXPLAIN plans before/after, refresh stats, pin critical queries with tested hints if necessary, and run workload-based regression tests—not just unit tests.
MariaDB vs MySQL: where the differences actually bite
If you’re diagnosing bottlenecks, the engine choice matters mainly where it changes instrumentation, optimizer behavior, concurrency control, and replication semantics. Here’s how to think about it without turning it into a religious war.
Instrumentation: can you see what’s happening?
MySQL 8.0’s performance schema is a strong point for production diagnosis: waits, stages, statement digests, and contention can be made visible. MariaDB has instrumentation too, but the maturity and defaults differ across versions and distros.
Operational advice: pick one “golden path” for evidence capture per engine. For MySQL 8.0, lean into performance schema and digest summaries. For MariaDB, validate what’s enabled and rely on a mix of status counters, InnoDB status, slow logs, and targeted sampling.
Optimizer drift: same query, different plan
This is where “drop-in” gets spicy. A query can be correct on both engines yet choose different join orders or index paths. Under load, that’s not academic; it’s the difference between “works” and “melts.”
Operational advice: treat engine changes and major upgrades as plan-change events. Baseline critical query plans and regression-test them under representative data distributions.
Concurrency behavior: thread handling and internal contention
MariaDB often gets discussed alongside thread pool behavior in certain distributions. MySQL has its own thread/concurrency behavior and improvements over versions. In both, the theme is the same: uncontrolled concurrency turns latency into a lottery.
Operational advice: cap concurrency in the app and at the database boundary. Don’t “fix” overload by raising max_connections. That’s like widening the funnel while the pipe is clogged.
Replication and clustering: different tools, different failure modes
MySQL’s ecosystem leans into GTIDs, semi-sync, group replication, and managed-service patterns. MariaDB often appears with Galera in the field. Both can be reliable. Both can also fail in ways that look like “random slowness” when the real problem is flow control, certification conflicts, or apply serialization.
Operational advice: decide what you need: read scaling, failover, multi-writer, or geo distribution. Then design for it. Don’t “enable clustering” as a performance feature; it’s an availability feature with latency costs.
Storage engine reality: most people are just running InnoDB
If your workload is InnoDB, many of your bottlenecks will be the same across both: buffer pool misses, redo log flush, doublewrite behavior, page flushing, and lock contention. That’s why the checklist works regardless of brand.
Operational advice: if someone claims “MariaDB is slower” or “MySQL is slower,” ask: slower at what bottleneck class? And where is the evidence?
FAQ
1) What’s the single fastest way to find the bottleneck?
Classify it: CPU-bound vs IO-bound vs lock-bound vs replication-bound. Use vmstat, iostat -x, and SHOW FULL PROCESSLIST before you touch config.
2) Should I start with slow query log or performance schema?
Start with whatever is already enabled and reliable. Slow query log is low friction and works on both engines. Performance schema is richer on MySQL 8.0, but you need to know what’s enabled and the overhead you’re willing to accept.
3) If disk latency is high, should I just increase the buffer pool?
Only if you have RAM headroom and your working set can benefit. If your active dataset is far larger than memory, buffer pool increases have diminishing returns. Fix selectivity and indexing first.
4) Why is “Threads_running” more important than “Threads_connected”?
Connected tells you how many clients are parked. Running tells you how many are actively consuming CPU or waiting inside the engine. High running counts often correlate with contention and queueing.
5) Is replication lag a database problem or an application problem?
Both. Lag often comes from application transaction shape: big transactions, hot rows, or DDL. The database config (apply parallelism, IO capacity) can help, but it can’t make a fundamentally serialized workload parallel.
6) When is it okay to change durability settings for performance?
When the business explicitly accepts the risk and you can quantify blast radius. Otherwise, treat durability as non-negotiable and fix the actual bottleneck (storage, transaction rate, batching).
7) My query uses an index but is still slow—why?
Because “uses an index” can still mean scanning a large range, doing random IO, or sorting afterward. Look at rows examined, “Using filesort,” “Using temporary,” and whether the index matches the filter + order.
8) How do I avoid plan regressions during MariaDB/MySQL upgrades?
Baseline critical queries (EXPLAIN plans and runtime under representative data), run workload replay or load tests, and compare top query digests before and after. Treat optimizer changes as a feature and a risk.
9) What’s the most common root cause of “it’s slow sometimes”?
Queueing: fsync stalls, lock waits, or background flush bursts. Intermittent latency usually means something is blocking, not that the CPU forgot how to compute.
10) Should we move from MySQL to MariaDB (or the other way) to fix performance?
Not as a first move. Switching engines can help specific cases (features, instrumentation, replication model), but most performance wins come from query/index fixes, transaction discipline, and storage correctness.
Next steps that won’t ruin your weekend
Here’s the practical plan I’d hand to an on-call rotation that wants fewer 3 a.m. mysteries and more predictable systems:
- Standardize evidence capture: a script/runbook that grabs processlist, InnoDB status, key counters, and host IO stats with timestamps.
- Enable at least one query-tracing source: slow query log with sane thresholds, plus plan capture for top offenders.
- Establish your bottleneck taxonomy: CPU, IO, locks, replication, memory. Tag incidents accordingly. Patterns emerge fast.
- Fix the workload shape: shorten transactions, add missing composite indexes, reduce fan-out joins, and cap concurrency upstream.
- Only then tune: adjust memory sizing and IO-related settings based on proven evidence, and do it with rollback steps.
The real MariaDB vs MySQL performance advantage is not a secret parameter. It’s operational clarity. You win by finding the bottleneck quickly, fixing the real cause, and refusing to confuse activity with progress.