There are two kinds of database “performance issues”: the ones you can reproduce, and the ones that vanish the moment you open a ticket. The slow query log is what turns the second kind into the first—because it records what actually happened, not what someone swears happened.
If you can spend one focused hour with a slow log and walk out with a 2× speedup, it’s usually not because you’re a wizard. It’s because production has been patiently doing the same dumb expensive thing all day, and nobody made it visible.
MySQL vs MariaDB: what actually differs for slow logs
On paper, MySQL and MariaDB both offer a slow query log, and you enable it the same way: set a threshold, log the offenders, and then analyze. In practice, the details matter because the “same” query can show up with different timing, different noise, and different visibility depending on engine defaults, instrumentation, and optimizer behavior.
What’s the same
- Slow query log semantics: a record of statements exceeding a time threshold (or meeting other criteria like “not using indexes” if you choose that path).
- Core knobs: slow_query_log, slow_query_log_file, long_query_time, log_queries_not_using_indexes (with caveats), log_slow_admin_statements.
- Analysis approach: group by fingerprint, sort by total time, then chase the worst offenders with EXPLAIN/EXPLAIN ANALYZE, schema/index changes, and query rewrites.
What differs enough to trip you up
1) Optimizer and execution differences show up as different slow-log “top queries”. MariaDB and MySQL diverged over time: optimizer strategies, derived table handling, and feature defaults can lead to different plans. That means you can’t blindly apply “the fix we used on MySQL” to MariaDB (or vice versa) without validating plans and row estimates.
2) Instrumentation paths differ. MySQL 8 leans heavily on Performance Schema and sys schema views for analysis. MariaDB has Performance Schema too, but in some deployments it’s disabled or less relied upon; MariaDB also has its own diagnostic helpers. Either way, the slow log is the baseline truth, but the supporting telemetry can differ.
3) Version defaults matter more than the brand. A MySQL 5.7 box and a MySQL 8.0 box can behave more differently than MySQL vs MariaDB in your head. Same for MariaDB 10.3 vs 10.11. For slow logs, the important part is: do you have accurate timestamps, are you capturing enough context, and can you connect a slow statement to resource bottlenecks?
Here’s the uncomfortable production truth: the database you have is the one you must tune. Start from the logs, not ideology.
Fast diagnosis playbook: find the bottleneck before it finds you
This is the “I have 20 minutes before the next incident call” flow. It’s not exhaustive. It’s designed to find the dominant bottleneck quickly and stop you from chasing ghosts.
First: confirm the problem is query time, not connection time
- Check if requests are slow because queries are slow, or because threads are saturated, or because you’re waiting on locks.
- Look for symptoms: rising “Threads_running”, “Questions” flatlining while latency rises, spikes in “Innodb_row_lock_time”, or lots of “Sending data” in processlist.
Second: identify the top offender by total time, not by worst single query
- A query that takes 1.2s once an hour is not your incident. A query that takes 80ms and runs 50k times is.
- Sort by sum of Query_time in the slow log report, then by count.
Third: classify the bottleneck
Use the offender’s pattern to put it into a bucket:
- CPU/optimizer: high rows examined, poor index usage, bad join order, filesort, temp tables.
- I/O: buffer pool misses, read amplification, temp tables on disk, slow storage latency.
- Locking: high lock time, “Waiting for … lock”, long transactions, hot rows.
- Application behavior: N+1 queries, chatty ORM patterns, no pagination, “SELECT *” in angry loops.
Fourth: pick the lowest-risk, highest-payoff fix
- Create or adjust an index that matches the WHERE + JOIN + ORDER BY pattern.
- Rewrite a query to avoid scanning, or to reduce row width, or to pre-aggregate.
- Reduce lock scope (shorter transactions, consistent access order, better isolation choices).
- Only then consider configuration changes. Config fixes are real, but they’re also the easiest to cargo-cult.
Paraphrased idea (attributed): Gene Kim often emphasizes that improvement comes from making work visible and reducing the cost of learning from production.
The one-hour slow log workflow that gets real wins
One hour is enough time to get a 2× speedup if you do less “analysis” and more “triage”. The goal is not perfect understanding. The goal is one dominant query class fixed safely.
Minute 0–10: make sure you’re logging the right thing
Start by confirming the slow log is enabled, the threshold is sensible, and the log format is parseable. If the threshold is too high, you won’t see the death-by-a-thousand-cuts queries. If it’s too low, you’ll drown in noise and your disk will stage a protest.
Recommended starting thresholds in production:
- long_query_time: 0.1–0.5s for OLTP, 1–2s for mixed workloads. If you’re already on fire, start at 0.2s and adjust.
- log_queries_not_using_indexes: usually off at first. It produces a lot of false positives (small tables, legitimate scans). Turn it on briefly if you know what you’re doing.
- log_slow_admin_statements: on, because “ALTER TABLE” can be the silent assassin.
Minute 10–25: produce a top-N report and pick one victim
Generate a digest report (pt-query-digest is the workhorse). Sort by total query time. Pick the top query class that is (a) frequent, (b) expensive, (c) fixable without rewriting half the app.
This is where people sabotage themselves: they pick the most complex query because it looks “important”. You want the boring query that runs constantly.
Minute 25–45: validate with EXPLAIN ANALYZE and one targeted index or rewrite
Take a representative query sample (not the worst outlier) and run EXPLAIN and, if available, EXPLAIN ANALYZE. You’re looking for the mismatch between what you think happens and what actually happens: table scans, wrong join order, rows examined exploding, temp tables, filesorts, or using the wrong index.
Then do exactly one change:
- Create one index (or adjust one composite index order).
- Rewrite one predicate to be sargable (search-arg-able), e.g., avoid wrapping indexed columns in functions.
- Add a LIMIT/pagination strategy, or split a query into a cheap pre-filter plus a join.
Minute 45–60: prove it improved, and prove you didn’t break anything
Re-run EXPLAIN ANALYZE. Compare the slow log before/after (even a short sample window helps). Watch p95 latency and CPU/IO counters. If the query got faster but the system got worse, you probably pushed load somewhere else (locks, temp tables, replication, disk).
Joke #1: The slow query log is the only coworker who remembers what happened last night, and it never “forgets” to write it down.
Practical tasks (commands, outputs, decisions)
Below are practical, runnable tasks. Each includes: a command, an example of what output looks like, what it means, and what decision you make. Use them as a checklist, not as a ritual. The point is to reduce time-to-truth.
Task 1: Confirm slow log is enabled and where it’s writing
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'slow_query_log'; SHOW VARIABLES LIKE 'slow_query_log_file'; SHOW VARIABLES LIKE 'long_query_time';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| slow_query_log| ON |
+---------------+-------+
+---------------------+---------------------------+
| Variable_name | Value |
+---------------------+---------------------------+
| slow_query_log_file | /var/log/mysql/slow.log |
+---------------------+---------------------------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| long_query_time | 0.200 |
+-----------------+-------+
What it means: Logging is active, threshold is 200ms, file path is known.
Decision: If slow_query_log is OFF, turn it ON (temporarily if you must). If long_query_time is 10s, you’re not observing your real workload.
Task 2: Enable slow log dynamically (safe, reversible)
cr0x@server:~$ mysql -e "SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 0.2; SET GLOBAL log_slow_admin_statements = 'ON';"
What it means: You’ve enabled capture without a restart.
Decision: Do this during an incident if you lack data. But also schedule a config change later so it persists across restarts.
Task 3: Confirm you’re not accidentally logging everything (file growth sanity check)
cr0x@server:~$ sudo ls -lh /var/log/mysql/slow.log
-rw-r----- 1 mysql adm 1.8G Dec 29 10:12 /var/log/mysql/slow.log
What it means: The file is big. Maybe it’s okay. Maybe it’s eating disk.
Decision: If it’s growing too fast, raise long_query_time slightly or sample during peak, then rotate.
Task 4: Rotate the slow log without restarting mysqld
cr0x@server:~$ mysql -e "FLUSH SLOW LOGS;"
What it means: MySQL/MariaDB closes and reopens the slow log file, allowing logrotate to do its job.
Decision: Use this before starting a one-hour capture window so your dataset is “clean”.
Task 5: Quick skim: top offenders with mysqldumpslow
cr0x@server:~$ mysqldumpslow -s t -t 10 /var/log/mysql/slow.log
Reading mysql slow query log from /var/log/mysql/slow.log
Count: 842 Time=0.38s (319s) Lock=0.00s (0s) Rows=1.0 (842), root[root]@10.0.2.15
SELECT * FROM orders WHERE user_id=N AND status='open' ORDER BY created_at DESC LIMIT N
Count: 55 Time=2.12s (116s) Lock=0.01s (1s) Rows=30000.0 (1650000), app[app]@10.0.3.9
SELECT customer_id, SUM(amount) FROM invoices WHERE created_at >= 'S' GROUP BY customer_id
What it means: First query is frequent and moderately slow; second is heavy but less frequent. Total time matters: 319s vs 116s in this sample.
Decision: Investigate the top by total time first. The 842-call query is the likely 2× win.
Task 6: Proper grouping and ranking with pt-query-digest
cr0x@server:~$ pt-query-digest /var/log/mysql/slow.log | sed -n '1,120p'
# 320s total, 842 queries, 0.38s avg, 0.05s 95% 0.89s max
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== ============= ===== ====== ===== ============
# 1 0x8F3A1C9E6B0A2C11 319.1240 99.7% 842 0.3790 1.02 SELECT orders
# 2 0xB11D0E2A6F9C0D22 116.6042 36.4% 55 2.1201 1.40 SELECT invoices
# ...
What it means: Query #1 dominates wall clock time. Fixing it buys you the biggest win.
Decision: Pull the exact query sample from the pt-query-digest section and take it to EXPLAIN.
Task 7: Find the exact table/index situation (SHOW CREATE TABLE)
cr0x@server:~$ mysql -e "SHOW CREATE TABLE orders\G"
*************************** 1. row ***************************
Table: orders
Create Table: CREATE TABLE `orders` (
`id` bigint NOT NULL AUTO_INCREMENT,
`user_id` bigint NOT NULL,
`status` varchar(16) NOT NULL,
`created_at` datetime NOT NULL,
`total` decimal(10,2) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_status` (`status`),
KEY `idx_user` (`user_id`)
) ENGINE=InnoDB
What it means: There’s no composite index matching (user_id, status, created_at). Sorting by created_at likely causes extra work.
Decision: Add a composite index aligned to WHERE and ORDER BY.
Task 8: EXPLAIN ANALYZE the slow query (reality check)
cr0x@server:~$ mysql -e "EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id=123 AND status='open' ORDER BY created_at DESC LIMIT 20\G"
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (actual time=210.112..210.140 rows=20 loops=1)
-> Sort: orders.created_at DESC (actual time=210.109..210.114 rows=3200 loops=1)
-> Index lookup on orders using idx_user (user_id=123) (actual time=0.210..205.900 rows=3200 loops=1)
What it means: It uses idx_user, then sorts 3200 rows to pick 20. That’s where the time goes.
Decision: Add an index to satisfy filtering and ordering so the engine can stop early.
Task 9: Create the index safely (and know what “online” means)
cr0x@server:~$ mysql -e "ALTER TABLE orders ADD INDEX idx_user_status_created (user_id, status, created_at DESC);"
Query OK, 0 rows affected (12.41 sec)
Records: 0 Duplicates: 0 Warnings: 0
What it means: Index created. Time and locking behavior depend on version and settings; some builds do this “in place”, but it still consumes IO and can stall on metadata locks.
Decision: Run during a low-traffic window if table is large. Monitor for metadata lock waits.
Task 10: Re-run EXPLAIN ANALYZE to verify early exit
cr0x@server:~$ mysql -e "EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id=123 AND status='open' ORDER BY created_at DESC LIMIT 20\G"
*************************** 1. row ***************************
EXPLAIN: -> Limit: 20 row(s) (actual time=1.432..1.455 rows=20 loops=1)
-> Index lookup on orders using idx_user_status_created (user_id=123, status='open') (actual time=0.210..1.410 rows=20 loops=1)
What it means: No sort step, only 20 rows touched. That’s how you buy latency back.
Decision: If the plan didn’t change, investigate why (wrong index order, collation/type mismatch, or optimizer preference).
Task 11: Check for lock contention in the slow log and runtime counters
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time%'; SHOW GLOBAL STATUS LIKE 'Threads_running';"
+----------------------------+--------+
| Variable_name | Value |
+----------------------------+--------+
| Innodb_row_lock_time | 184223 |
| Innodb_row_lock_time_max | 12000 |
| Innodb_row_lock_waits | 912 |
+----------------------------+--------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| Threads_running | 48 |
+-----------------+-------+
What it means: There is meaningful lock waiting. Threads_running is high, indicating concurrency pressure.
Decision: If lock time is a big chunk of Query_time in slow log entries, focus on transaction scope and hot rows, not indexes.
Task 12: Inspect live queries and spot “Waiting for … lock” quickly
cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;" | sed -n '1,18p'
Id User Host db Command Time State Info
1209 app 10.0.3.9:53122 prod Query 12 Waiting for table metadata lock ALTER TABLE orders ADD COLUMN note varchar(64)
1211 app 10.0.3.7:60111 prod Query 9 Sending data SELECT * FROM orders WHERE user_id=123 AND status='open' ORDER BY created_at DESC LIMIT 20
1220 app 10.0.3.6:42290 prod Query 8 Sending data SELECT * FROM orders WHERE user_id=456 AND status='open' ORDER BY created_at DESC LIMIT 20
What it means: An ALTER TABLE is blocking. That’s not “slow query tuning”; that’s change management.
Decision: Stop the DDL (if safe) or reschedule it. Prevent repeats with disciplined migration practices.
Task 13: Check InnoDB buffer pool health (are we reading from disk?)
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+-----------+
| Variable_name | Value |
+---------------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 982341234 |
| Innodb_buffer_pool_reads | 19348231 |
+---------------------------------------+-----------+
What it means: Reads from disk are non-trivial. The ratio matters more than the raw number; lots of disk reads can be fine on large working sets, but it can also mean your buffer pool is undersized or your queries are scanning.
Decision: If the working set fits in memory but you’re still reading heavily, you likely have poor indexes or access patterns. If the working set truly exceeds memory, tune the biggest scans first and consider memory sizing.
Task 14: Check temporary tables and whether they spill to disk
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
+-------------------------+--------+
| Variable_name | Value |
+-------------------------+--------+
| Created_tmp_disk_tables | 48291 |
| Created_tmp_tables | 90210 |
+-------------------------+--------+
What it means: Many temp tables are hitting disk. This correlates with GROUP BY/ORDER BY on non-indexed columns, wide rows, or insufficient tmp table memory settings.
Decision: Prefer query and index fixes first. If query is inherently heavy, adjust tmp_table_size/max_heap_table_size carefully and validate memory impact.
Task 15: Validate storage latency during the slow window (don’t guess)
cr0x@server:~$ iostat -xz 1 5
Linux 6.5.0 (server) 12/29/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22.14 0.00 7.12 18.33 0.00 52.41
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
nvme0n1 322.0 18232.0 0.0 0.00 18.40 56.60 90.0 9120.0 4.10 6.32 78.00
What it means: r_await ~18ms suggests storage reads are not “fast”. If your workload expects low-millisecond reads, this contributes directly to query time when buffer pool misses occur.
Decision: If latency is high, you can still get a 2× win by reducing reads (indexing, better LIMIT usage). Also investigate noisy neighbors, RAID degradation, fsync storms, and checkpointing behavior.
Task 16: Check binary log and replication lag signals (fixes can move pain)
cr0x@server:~$ mysql -e "SHOW MASTER STATUS\G; SHOW SLAVE STATUS\G" | sed -n '1,40p'
*************************** 1. row ***************************
File: mysql-bin.002341
Position: 91822310
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set:
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 47
What it means: Replication is behind. Heavy queries or DDL can worsen lag. Some “optimizations” increase write amplification (new indexes) and make replication slower temporarily.
Decision: Schedule index builds carefully and watch lag. If lag is business-critical, consider throttling migrations or using online schema change tooling.
Slow log patterns that usually buy you 2×
Most 2× wins are not exotic. They’re removing accidental work. Here are the patterns that show up in slow logs across both MySQL and MariaDB fleets.
1) “ORDER BY … LIMIT” without a matching index
This is the classic: filter by some columns, order by timestamp, limit 20. Without a composite index that matches filter + order, the engine reads many rows and sorts them. Your slow log shows moderate time per call, massive call count, and “Rows_examined” much larger than “Rows_sent”.
Fix: A composite index with equality predicates first, then the ordering column. Use DESC indexes where supported and sensible; otherwise the engine can read backward in some cases, but don’t assume. Validate with EXPLAIN ANALYZE.
2) Non-sargable predicates
If you have WHERE DATE(created_at)=… or WHERE LOWER(email)=…, you’ve forced the optimizer into a corner. It can’t use the index efficiently because you wrapped the indexed column in a function.
Fix: Rewrite to a range (created_at >= … AND created_at < …) or store normalized values in an indexed generated column (version-dependent), or normalize at write time.
3) Large IN() lists and “application-made joins”
Slow logs love to show queries like WHERE id IN (…thousands…). Often that’s an ORM pattern or a batching attempt that went feral. It can cause large parse overhead, poor plans, and big temporary structures.
Fix: Use a temporary table, a join against a real table, or redesign the access pattern. If you must, limit batch size and measure.
4) Hot row updates and lock waits
Locking problems look like “random slowness” until you observe Lock_time in slow log entries and see waiting states in processlist. A single hot counter row can slow an entire service.
Fix: Reduce contention: shard counters, use INSERT … ON DUPLICATE KEY on distributed keys, or move the hot path to something designed for high write contention. Also shorten transactions and ensure consistent lock ordering.
5) GROUP BY that spills to disk
If you’re aggregating large sets and spilling temporary tables to disk, storage latency becomes your query latency. Slow logs show big Query_time with relatively low CPU usage, and your tmp disk table counters climb.
Fix: Add indexes that support grouping, reduce row width, pre-aggregate, or move reporting workloads off the primary OLTP instance.
6) “SELECT *” on wide tables
Wide row reads cost memory, buffer pool, network, and CPU. They also sabotage covering indexes. Slow logs show queries that fetch columns the application never uses.
Fix: Select only needed columns. If you want speed, stop hauling furniture when you only need your keys.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran an e-commerce platform with a MySQL primary and a couple of replicas. The app team had just added “recent orders” to a user dashboard. It looked harmless: filter by user_id, status, order by created_at, limit 20. The query was fast in staging, because staging had small tables and warm caches—like all staging environments, it was a comforting lie.
In production, p95 latencies doubled during peak. The on-call engineer did the usual: scaled app pods, restarted a few things, watched it get worse. CPU on the DB climbed, but not to the ceiling. Disk wasn’t pegged. It was “mysterious”.
The wrong assumption was simple: “LIMIT 20 means it only reads 20 rows.” Without a matching composite index, the engine read thousands of rows per user, sorted them, then returned 20. It was the same story repeated for thousands of users, thousands of times. The slow query log showed a query that wasn’t individually terrifying—350ms-ish—but it ran constantly.
They added a composite index (user_id, status, created_at). The query dropped to a couple milliseconds. The incident ended not with a heroic tuning session, but with one index and a reminder that SQL does not read your mind.
The follow-up action was more interesting: they changed their deploy checklist to require capturing a 15-minute slow log sample after any feature that adds a new query path. Not because they love process. Because they love sleeping.
Mini-story 2: The optimization that backfired
A different org ran MariaDB for a billing system. Reporting queries were hammering the primary during business hours. Someone proposed a quick fix: “Let’s turn on log_queries_not_using_indexes and add indexes for everything that shows up.” The logic sounded crisp. It also turned the slow log into a firehose.
Within a day, they had created several new indexes, including some wide composites on columns that were barely selective. Writes slowed. Buffer pool churn increased. The change also increased replication lag because every insert/update now had to maintain more index structures. Meanwhile, the slow queries didn’t improve much—because the worst offenders were legitimate full scans over monthly partitions for financial summaries. Logging “not using indexes” was accusing the database of doing exactly what the query asked.
The backfire was subtle: the team spent time chasing “missing index” warnings instead of addressing workload shape. The right solution was to move reporting to a replica (or a dedicated analytics path), add a few targeted indexes, and change a couple of queries to pre-aggregate into summary tables.
They eventually rolled back several indexes, which is always a fun day because dropping indexes can also take time and IO. Lesson learned: don’t let a diagnostic flag become a product requirement.
Joke #2: Turning on every logging option “for visibility” is like replacing your smoke detector with a fog machine.
Mini-story 3: The boring but correct practice that saved the day
A SaaS company ran MySQL 8 for multi-tenant workloads. They had a practice that nobody bragged about: the slow query log was always enabled at a modest threshold, rotated hourly, and summarized daily. No drama. No “we’ll enable it when needed.” It was just there, quietly capturing truth.
One afternoon, latency spiked. Not catastrophic, but noticeable. The on-call didn’t guess. They pulled the last hour’s slow log, ran pt-query-digest, and saw a new query fingerprint appear at the top. It was an ORM-generated join with an unexpected predicate on a low-cardinality column, causing row explosions.
Because they had the baseline daily summaries, they could prove the query was new and quantify impact without speculation. They also had a policy: any index change required an EXPLAIN ANALYZE before/after, and a quick check of replication lag. That policy is not cool. It is effective.
They shipped a small query rewrite and a narrow composite index. Latency returned to normal. The post-incident note was short: “slow log caught it; fix validated; no collateral damage.” That’s the dream—boring, correct, repeatable.
Common mistakes: symptom → root cause → fix
This section is intentionally specific. If you can map a symptom to a root cause quickly, you get your life back.
1) Symptom: Slow log shows high Query_time but low Rows_examined
Root cause: Waiting, not scanning—locks, fsync pressure, or network stalls. Often Lock_time is high, or storage latency spikes.
Fix: Check Lock_time in slow log entries, SHOW PROCESSLIST states, and InnoDB lock counters. For IO, correlate with iostat and checkpointing. Don’t add indexes “because it’s slow”.
2) Symptom: Queries fast on replica, slow on primary
Root cause: Primary is doing extra work: writes, fsync, buffer pool churn, or lock contention. Also possible: different hardware or config drift.
Fix: Compare config variables and buffer pool stats. Validate workload: are you running reports on the primary? Stop that first.
3) Symptom: Top slow query is a simple SELECT with ORDER BY + LIMIT
Root cause: Missing composite index that matches filter and sort, causing filesort and scanning.
Fix: Add index with equality predicates first, then order column. Confirm with EXPLAIN ANALYZE that sort disappears and rows touched drop.
4) Symptom: Sudden spike in slow queries after deployment, but only for some users/tenants
Root cause: Data skew. The plan is fine for typical tenants but terrible for “big ones”. Statistics might mislead the optimizer.
Fix: Test with representative tenant sizes. Consider better indexes, query changes, or splitting large tenants. Update statistics and validate plans on skewed cases.
5) Symptom: Many “Created_tmp_disk_tables” and slow GROUP BY
Root cause: Disk-spilling temp tables due to insufficient indexes, wide rows, large result sets, or memory limits.
Fix: Reduce row width, add supporting indexes, pre-aggregate, and only then consider tuning tmp_table_size/max_heap_table_size with memory budgeting.
6) Symptom: slow log file is huge, analysis is painful, disk usage climbs
Root cause: long_query_time too low for always-on, or a runaway query flood, or no rotation.
Fix: Implement rotation and retention; sample during peak; tune threshold. Use pt-query-digest on a bounded time window.
7) Symptom: After adding an index, reads got faster but replication lag worsened
Root cause: Additional index maintenance increased write cost. On replicas, applying row events or statement events becomes heavier.
Fix: Keep indexes minimal and purposeful. Add the one that matches your top offender. For migrations, throttle and monitor lag; consider scheduling or offloading.
Checklists / step-by-step plan
Checklist: capture one hour of useful slow logs (production-safe)
- Confirm slow log enabled and file location (Task 1).
- Set long_query_time to a useful threshold for your workload (Task 2).
- FLUSH SLOW LOGS to start with a fresh file (Task 4).
- Capture during a representative busy hour, not during a quiet lull.
- Rotate at the end (Task 4 again) so analysis is bounded.
- Record context: traffic volume, release versions, any migrations running.
Checklist: turn the hour into a ranked action list
- Run pt-query-digest and sort by total response time (Task 6).
- Pick one top query fingerprint that’s frequent and fixable.
- Pull schema and existing indexes for involved tables (Task 7).
- Run EXPLAIN ANALYZE on a representative query instance (Task 8).
- Classify bottleneck: scanning, sorting, temp tables, locking, IO.
- Propose one minimal change: one index or one rewrite.
- Implement carefully; watch for metadata locks and replication lag (Tasks 12 and 16).
- Re-run EXPLAIN ANALYZE and compare slow log ranking after change (Task 10 plus digest again).
Checklist: safety rails (because you will be tempted)
- Don’t add “covering indexes” by default. Wide indexes bloat buffer pool and slow writes.
- Don’t tune global knobs before you have one query nailed down as the offender.
- Don’t trust staging performance if production data distribution differs.
- Don’t run DDL during peak unless you have proven online behavior for your version and table shape.
Interesting facts and historical context
Some context helps because people repeat myths with confidence, and production punishes that habit.
- MariaDB started as a fork of MySQL after Oracle’s acquisition of Sun (2009). The “it’s the same” narrative was true early; divergence is real now.
- MySQL’s Performance Schema matured significantly over major versions. Modern MySQL leans on it for diagnostics; older deployments often relied more on slow logs and external tools.
- The slow query log predates the current observability fashion. It’s one of the oldest “just write down the truth” features in the MySQL ecosystem.
- pt-query-digest (Percona Toolkit) became popular because slow logs are verbose but not immediately actionable. Digests turn a diary into a ranked to-do list.
- MySQL removed the legacy Query Cache in 8.0. It was famous for helping benchmarks and hurting real concurrency. If someone suggests enabling it “for speed,” ask what year they think it is.
- InnoDB became the default storage engine for MySQL long ago, replacing MyISAM in most serious deployments. Many slow-query fixes assume transactional semantics and row-level locking.
- EXPLAIN ANALYZE is a relatively recent gift compared to plain EXPLAIN. It measures actual execution, which is priceless when the optimizer’s estimates are confidently wrong.
- Metadata locks surprise teams because they’re not “slow queries” until they are. DDL can block reads/writes in ways that show up as slow application queries.
FAQ
1) Should I use slow query log or Performance Schema?
Use slow query log as your ground truth for “what was slow.” Use Performance Schema for “why was it slow” (waits, stages, CPU-ish signals). If you can only pick one, pick slow log first.
2) What long_query_time should I set for OLTP?
Start at 0.2s in a busy system if you want actionable data quickly. If that’s too noisy, go to 0.5s. If you set it to 2–10 seconds, you’ll only catch disasters, not chronic waste.
3) Is log_queries_not_using_indexes a good idea?
As a permanent setting: usually no. As a short, deliberate experiment: sometimes. It flags legitimate full scans and encourages index spam. Use it when you already know scans are accidental, not when you’re still diagnosing.
4) Why does the “same” query show different performance on MySQL vs MariaDB?
Different optimizer decisions, different defaults, different statistics behavior, and different engine/version combinations. Treat each server as its own system: validate plans, don’t rely on tribal memory.
5) Can an index make a query slower?
Yes. The optimizer can choose a worse index, or index maintenance can slow writes enough that overall latency rises. That’s why you verify with EXPLAIN ANALYZE and watch system-level metrics, not just one query timing.
6) What if my slow log is full of admin statements like ALTER TABLE?
Then your bottleneck is change management. Schedule DDL, use online techniques appropriate to your version, and prevent surprise migrations during peak. The “fix” is operational discipline.
7) How do I know if my problem is disk IO or bad queries?
Correlate: high Innodb_buffer_pool_reads plus high storage await time points to IO sensitivity. But even then, the best first fix is usually reducing reads via indexes and query shape. Hardware upgrades are valid, but they’re not a substitute for stopping waste.
8) How do I turn one hour of logs into a 2× speedup reliably?
Rank by total time, pick one frequent offender, validate plan with EXPLAIN ANALYZE, apply one minimal index/query change, then re-measure. Avoid “tuning by vibes” (multiple simultaneous changes with no proof).
9) Does enabling slow query log hurt performance?
It adds overhead, but for most systems it’s manageable when configured sanely. The bigger risk is disk growth if you log too much without rotation. If performance is extremely sensitive, sample during peak windows and rotate aggressively.
10) My slow queries are all “Sending data” in processlist. What now?
“Sending data” often means the query is reading rows and processing them (not necessarily network send). Look at Rows_examined, index usage, and whether you’re scanning/sorting. It can also hide IO waits when reading pages.
Next steps you can do today
Do the boring setup once, then collect wins repeatedly:
- Enable slow query logging at a sane threshold and keep it on with rotation. Make it part of the system, not an emergency lever.
- Automate a daily digest (even if it’s just a cron job that runs pt-query-digest and stores the top 20 fingerprints). Trends matter.
- Adopt a one-change rule during incidents: one index or one query rewrite, validated by EXPLAIN ANALYZE and a quick replication/lock sanity check.
- Make “rows examined vs rows returned” a habit. It’s the fastest indicator of wasted work.
- Write down your top 5 query fingerprints and treat them like production dependencies. Because they are.
If you want the 2× speedup, don’t hunt for magic settings. Open the slow log, find the dominant query class by total time, and remove its accidental work. Production doesn’t need heroics. It needs fewer pointless reads.