MySQL vs Percona Server Monitoring: Find Killer Queries Without Guessing

October 12, 2025 • February 3, 2026 • Read: 24 min • Views: 10

Was this helpful?

Production MySQL doesn’t usually fail with drama. It fails with boredom: p95 latency quietly doubles, CPU looks “fine,” and a product team swears “we didn’t deploy anything.” Meanwhile, the database is on fire—politely.

The difference between a two-hour incident and a ten-minute fix is almost never raw skill. It’s whether your monitoring tells you which query is the villain, why it’s slow (CPU, I/O, locks, or replication), and what changed.

What you’re actually trying to do (and why guesswork is expensive)

“Find killer queries” sounds like a single goal, but in operations it’s three separate jobs:

Detection: identify the handful of statements that dominate latency, CPU, I/O, or lock time.
Attribution: connect those statements to an application action, deployment, feature flag, cron job, analyst notebook, or misconfigured ORM.
Actionability: choose the right fix—index, query rewrite, plan stability, concurrency limits, buffering, or schema changes—without breaking everything else.

The “guessing” trap happens when your monitoring only shows host metrics (CPU, disk, network). Those are necessary, not sufficient. Host metrics tell you that the database is unhappy. They do not tell you who made it unhappy.

You want three layers of visibility:

SQL layer: normalized query fingerprints, latency distribution, rows examined vs returned, temp tables, sort merges, errors.
Wait layer: time spent on locks, I/O, metadata locks, buffer pool, redo log, and internal mutexes.
System layer: CPU saturation, run queue, I/O latency, fsync time, page cache pressure, network retries.

MySQL can do this. Percona Server can do this too. The practical difference is how much you can do before an incident, how expensive the instrumentation is, and how easily you can stitch evidence into a timeline.

MySQL vs Percona Server: what changes for monitoring

They’re cousins, not strangers

Percona Server for MySQL is a downstream build that tracks upstream MySQL closely, with additional instrumentation and operational features. In modern fleets, you’ll see:

Oracle MySQL (community or enterprise): baseline features, Performance Schema, sys schema, slow log, EXPLAIN, optimizer trace, etc.
Percona Server: upstream-compatible, plus operational enhancements and extra knobs (some are now less “unique” than they used to be, but they still matter in production).
Percona Monitoring and Management (PMM): not a server fork, but a monitoring stack that makes query analytics usable for humans.

Monitoring differences that matter in practice

The big question isn’t “which is faster?” It’s “which one makes root cause obvious under stress?” Here’s how I call it:

Query analytics experience: With vanilla MySQL, you can get there with slow logs + Performance Schema + custom dashboards. With Percona Server plus PMM, you typically get to “top 10 worst query fingerprints” faster and with fewer homegrown parts.
Instrumentation cost control: Both can be configured to avoid heavy overhead, but Percona’s ecosystem tends to make “low overhead, always on” a cultural default rather than an aspirational slide.
Extra metrics and knobs: Percona Server has historically shipped extra status variables and features around instrumentation and diagnostics. Even when upstream catches up, Percona tends to keep the operational bias: expose the sharp edges, don’t hide them.

Opinionated guidance: if you run MySQL at modest scale and you already have good observability discipline, upstream MySQL is fine. If you run MySQL at “why is this always urgent” scale and you want query analytics that your on-call can use at 3 a.m., Percona Server plus PMM is hard to beat.

One dry truth: most incidents aren’t caused by “bad queries” in isolation. They’re caused by bad queries at bad concurrency, or a reasonable query that suddenly gets a different plan, or a tiny lock that becomes a global traffic jam.

Joke #1: A query isn’t “slow” until it’s executed in a loop by someone who just discovered analytics.

Fast diagnosis playbook (first/second/third)

This is the sequence that wins time in real incidents. It’s biased toward finding the bottleneck class quickly, then identifying the specific query fingerprint, then proving the cause.

First: classify the bottleneck (60–120 seconds)

Is the database CPU-bound? High user CPU, high run queue, stable disk latency.
Is it I/O-bound? Moderate CPU but elevated read/write latency, high fsync time, buffer pool misses.
Is it lock-bound? Threads running are low, threads connected high, many sessions “waiting for…”; spikes in lock waits.
Is it replication-bound? Replica lag increases, relay logs grow, SQL thread waiting; or primary OK but replicas slow due to query volume.

Second: identify the top offenders (2–5 minutes)

Use slow query log (if enabled) to get fingerprints and timings.
Use Performance Schema to list top statements by total time, average time, and rows examined.
Check active sessions: what is running now, and what are they waiting on?

Third: confirm the failure mode (5–15 minutes)

For CPU issues: verify plan regressions, missing indexes, poor predicates, bad cardinality estimates.
For I/O issues: verify buffer pool hit rate proxies, read-ahead, temp table spills, redo log pressure, fsync latency.
For locks: identify blocking session, lock type (row, metadata, gap/next-key), and why it holds the lock.
For replication: compare primary vs replica statement mix; find the long transaction or DDL; check parallel replication configuration.

Do not “optimize randomly.” Diagnose first, then change one thing, then measure. You’re running a system, not an improv class.

Interesting facts and historical context (because today’s defaults have a history)

Fact 1: The MySQL slow query log predates modern observability stacks; it’s old-school text logging, but it’s still one of the most incident-useful tools because it captures real statements with timings.
Fact 2: Performance Schema started life as a controversial feature because early versions could be heavy if enabled indiscriminately. Modern MySQL made it far more practical as an “always on” source of truth—if you configure it with intent.
Fact 3: The sys schema was created to make Performance Schema usable without everyone writing their own monstrous SQL. It’s basically “views for humans.”
Fact 4: Percona popularized query digesting in the MySQL world with tools like pt-query-digest, which normalized queries into fingerprints long before “query analytics” was a dashboard checkbox.
Fact 5: MySQL’s metadata locks (MDL) are a frequent surprise: a harmless-looking ALTER TABLE can block reads in ways that feel like a networking issue until you look at MDL waits.
Fact 6: InnoDB’s design is heavily “buffer pool first.” If your working set doesn’t fit and you’re doing random reads, you’ll feel it even on premium disks—because physics charges by the millisecond.
Fact 7: Replication lag is often not “network lag.” It’s frequently “SQL thread can’t keep up” because the replica is doing too much work per transaction, often due to different I/O or cache conditions.
Fact 8: The shift from file-based temporary tables to more in-memory temp table behavior across versions reduced some pain, but “temp table in memory” is not free; it can become memory pressure, which becomes swapping, which becomes an incident.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually run during incidents or pre-incident hardening. Each one includes a command, what typical output means, and the decision you make.

Task 1: Confirm which server you’re on (and stop arguing in Slack)

cr0x@server:~$ mysql -NBe "SELECT VERSION(), @@version_comment;"
8.0.36-28 Percona Server (GPL), Release 28, Revision 1234567

What it means: The version string and comment tell you if it’s Oracle MySQL, Percona Server, MariaDB, etc. This matters because features and defaults differ.

Decision: Pick the right playbook. Don’t apply a tuning flag that doesn’t exist on that build.

Task 2: Check whether Performance Schema is enabled

cr0x@server:~$ mysql -NBe "SHOW VARIABLES LIKE 'performance_schema';"
performance_schema	ON

What it means: If it’s OFF, you just lost one of the best sources of “what is the server doing right now?”

Decision: If OFF in production, plan a maintenance window to enable it, and keep it ON with careful instrumentation settings.

Task 3: See top statements by total latency (Performance Schema)

cr0x@server:~$ mysql -t -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1e12,2) AS total_s, ROUND(AVG_TIMER_WAIT/1e9,2) AS avg_ms, SUM_ROWS_EXAMINED, SUM_ROWS_SENT FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT ?
COUNT_STAR: 48291
total_s: 912.44
avg_ms: 18.89
SUM_ROWS_EXAMINED: 987654321
SUM_ROWS_SENT: 48291

What it means: This query dominates total time and examines absurd rows per result. That’s your lead.

Decision: Investigate indexing and plan. Specifically, look for a composite index on (user_id, created_at) and confirm it’s being used.

Task 4: Find “what is running right now” with full statements

cr0x@server:~$ mysql -t -e "SELECT ID, USER, HOST, DB, COMMAND, TIME, STATE, LEFT(INFO,120) AS query FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep' ORDER BY TIME DESC LIMIT 10;"
+------+---------+-------------------+------+---------+------+----------------------------+----------------------------------------------------------+
| ID   | USER    | HOST              | DB   | COMMAND | TIME | STATE                      | query                                                    |
+------+---------+-------------------+------+---------+------+----------------------------+----------------------------------------------------------+
| 8123 | appuser | 10.2.3.4:53210    | prod | Query   |  98  | Waiting for table metadata | ALTER TABLE orders ADD COLUMN foo INT                   |
| 8221 | appuser | 10.2.3.9:49211    | prod | Query   |  97  | Sending data               | SELECT * FROM orders WHERE user_id=123 ORDER BY created_at DESC LIMIT 50 |
+------+---------+-------------------+------+---------+------+----------------------------+----------------------------------------------------------+

What it means: A DDL is waiting for metadata lock while queries continue, or vice versa. The presence of “Waiting for table metadata” is a giant arrow toward MDL contention.

Decision: If the DDL is blocking critical traffic, stop it (or move it to online schema change tooling). If the DDL is blocked, find the session holding MDL.

Task 5: Identify metadata lock blockers (MDL)

cr0x@server:~$ mysql -t -e "SELECT * FROM sys.schema_table_lock_waits WHERE object_schema='prod' AND object_name='orders'\G"
*************************** 1. row ***************************
object_schema: prod
object_name: orders
waiting_query: ALTER TABLE orders ADD COLUMN foo INT
waiting_pid: 8123
blocking_pid: 7991
blocking_query: SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL 7 DAY
sql_kill_blocking_query: KILL QUERY 7991
sql_kill_blocking_connection: KILL 7991

What it means: One long-running SELECT is holding a metadata lock long enough to block DDL (or the reverse). sys schema makes this readable.

Decision: If it’s safe, kill the blocking query (not the connection unless you must). Then fix the application pattern: long transactions and “SELECT … FOR UPDATE” tend to hold MDL longer than people think.

Task 6: Confirm slow query log is enabled and sensible

cr0x@server:~$ mysql -NBe "SHOW VARIABLES WHERE Variable_name IN ('slow_query_log','slow_query_log_file','long_query_time','log_queries_not_using_indexes');"
slow_query_log	ON
slow_query_log_file	/var/log/mysql/mysql-slow.log
long_query_time	0.200000
log_queries_not_using_indexes	OFF

What it means: Logging queries slower than 200ms. That’s aggressive enough to catch “death by a thousand cuts” but not so aggressive that you melt your disk.

Decision: If it’s OFF, turn it ON. If long_query_time is 10 seconds, you’re blind to the common killers.

Task 7: Digest the slow log into a ranked list (Percona Toolkit)

cr0x@server:~$ pt-query-digest /var/log/mysql/mysql-slow.log | head -n 30
# 330ms user time, 20ms system time, 25.00M rss, 80.00M vsz
# Current date: Mon Dec 30 01:12:14 2025
# Overall: 12.34k total, 45 unique, 0.12 QPS, 0.02x concurrency ________
# Time range: 2025-12-30T00:10:00 to 2025-12-30T01:10:00
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time        8200s    5ms     38s    663ms     2s      1s     120ms
# Rows examine     9.10G     0     12M   737k     3.2M   1.1M     55k
# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ============
#    1 0xA1B2C3D4E5F6A7B8  4100.0000 50.0%  48291 0.0850 0.10  SELECT orders

What it means: You get a ranked list by response time and a normalized query ID. “Rows examine 9.10G” is a crime scene.

Decision: Focus on the top 1–3 fingerprints. Do not chase the tail of the distribution during an incident.

Task 8: Check InnoDB buffer pool pressure (quick proxy metrics)

cr0x@server:~$ mysql -t -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+-----------+
| Variable_name                         | Value     |
+---------------------------------------+-----------+
| Innodb_buffer_pool_read_requests      | 987654321 |
| Innodb_buffer_pool_reads              | 12345678  |
+---------------------------------------+-----------+

What it means: Innodb_buffer_pool_reads are reads from disk; read_requests are logical reads. If disk reads jump relative to requests, your working set is missing cache.

Decision: If you’re suddenly reading from disk, identify what changed: new query pattern, warmed caches lost after restart, or dataset growth. Consider increasing buffer pool (if safe), but also fix the query that is scanning.

Task 9: Identify lock waits at the engine level

cr0x@server:~$ mysql -t -e "SELECT * FROM sys.innodb_lock_waits ORDER BY wait_age DESC LIMIT 5\G"
*************************** 1. row ***************************
wait_started: 2025-12-30 01:09:01
wait_age: 00:00:47
locked_table: `prod`.`orders`
locked_index: PRIMARY
waiting_query: UPDATE orders SET status='shipped' WHERE id=?
blocking_query: UPDATE orders SET status='paid' WHERE id=?
blocking_pid: 7442
waiting_pid: 7551

What it means: You have transactional contention. Two updates are fighting over the same rows or hot index ranges.

Decision: Find the pattern. Is it a queue table? A “last row” hotspot? Apply batching, reorder operations, reduce transaction scope, or redesign the contention point.

Task 10: Confirm whether temp tables are spilling to disk

cr0x@server:~$ mysql -NBe "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
Created_tmp_disk_tables	184223
Created_tmp_tables	210001
Created_tmp_files	1123

What it means: A high ratio of disk temp tables suggests expensive sorts/group-bys or insufficient temp table memory configuration.

Decision: Identify which queries are creating temp tables (Performance Schema can help), then fix query shapes and indexes before blindly increasing memory limits.

Task 11: Look at top wait events (Performance Schema)

cr0x@server:~$ mysql -t -e "SELECT event_name, ROUND(SUM_TIMER_WAIT/1e12,2) AS total_s, COUNT_STAR FROM performance_schema.events_waits_summary_global_by_event_name ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;"
+------------------------------------------+---------+------------+
| event_name                               | total_s | COUNT_STAR |
+------------------------------------------+---------+------------+
| wait/io/file/innodb/innodb_data_file     |  820.11 |   91234567 |
| wait/synch/mutex/innodb/buf_pool_mutex   |  210.44 |   61234567 |
| wait/lock/metadata/sql/mdl               |   98.22 |     120345 |
+------------------------------------------+---------+------------+

What it means: Your time is going into file I/O and buffer pool mutex contention, with a side of metadata locks. That’s a useful narrative.

Decision: If I/O dominates, look for scans and spills. If mutex dominates, check concurrency, buffer pool instances, and hot pages. If MDL dominates, fix DDL scheduling and long transactions.

Task 12: Spot replication lag and who is to blame

cr0x@server:~$ mysql -t -e "SHOW REPLICA STATUS\G" | egrep -i "Seconds_Behind_Source|Replica_SQL_Running_State|Last_SQL_Error|Retrieved_Gtid_Set|Executed_Gtid_Set"
Seconds_Behind_Source: 187
Replica_SQL_Running_State: Waiting for dependent transaction to commit
Last_SQL_Error:
Retrieved_Gtid_Set: 1-1000023
Executed_Gtid_Set: 1-999842

What it means: Replica is behind. The SQL thread is waiting to commit (often commit/flush pressure or dependency constraints).

Decision: Check commit latency and long transactions. Consider tuning durability settings only with explicit risk acceptance. More often: fix the transaction pattern that created the lag.

Task 13: Correlate query time with rows examined (find “looks fast” but isn’t)

cr0x@server:~$ mysql -t -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(AVG_TIMER_WAIT/1e9,2) AS avg_ms, ROUND(SUM_ROWS_EXAMINED/COUNT_STAR,0) AS rows_examined_per_call FROM performance_schema.events_statements_summary_by_digest WHERE COUNT_STAR > 1000 ORDER BY rows_examined_per_call DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT id FROM sessions WHERE token = ?
COUNT_STAR: 450000
avg_ms: 1.20
rows_examined_per_call: 12000

What it means: Average latency is low, but it’s examining 12k rows per call. Under load, this becomes a CPU and buffer pool grinder.

Decision: Add or fix an index on token. This is preventative work that avoids a future incident.

Task 14: Verify plan choice with EXPLAIN (and don’t trust your intuition)

cr0x@server:~$ mysql -t -e "EXPLAIN SELECT * FROM orders WHERE user_id=123 ORDER BY created_at DESC LIMIT 50;"
+----+-------------+--------+------------+------+--------------------------+--------------------------+---------+------+----------+-------------+
| id | select_type | table  | partitions | type | possible_keys            | key                      | key_len | rows | filtered | Extra       |
+----+-------------+--------+------------+------+--------------------------+--------------------------+---------+------+----------+-------------+
|  1 | SIMPLE      | orders | NULL       | ref  | idx_user_created,idx_user| idx_user_created         | 8       |  50  |   100.00 | Using index |
+----+-------------+--------+------------+------+--------------------------+--------------------------+---------+------+----------+-------------+

What it means: It uses the composite index and reads ~50 rows. If your earlier metrics showed millions examined, you’re either looking at a different query shape, parameter skew, or plan instability.

Decision: Compare the exact statement text and parameter patterns. Investigate histogram/statistics and whether queries are sargable.

Task 15: Check filesystem I/O latency in the moment (because “disk is fine” is a lie until proven)

cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (db01) 	12/30/2025 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.10    0.00    6.20   18.40    0.00   43.30

Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz  avgqu-sz   await  r_await  w_await  svctm  %util
nvme0n1         820.0  1200.0   38.2    95.1     92.1      8.20   12.40    9.10    14.60   0.35  72.0

What it means: 12ms await on NVMe is suspicious, with iowait at 18%. That’s an I/O story, not “just CPU.”

Decision: Move up the stack: which queries are causing read amplification or temp spills? If this is shared storage, check noisy neighbors.

Task 16: If you have PMM: confirm QAN is ingesting and show top queries

cr0x@server:~$ pmm-admin list
pmm-admin 2.41.0

Service type  Service name  Address and Port  Service ID
MySQL         db01          127.0.0.1:3306    /service_id/abcd1234

Agent type   Status     Agent ID
pmm-agent    Running    /agent_id/efgh5678

What it means: PMM sees the MySQL service. If Query Analytics isn’t showing data, it’s typically a collector/config issue, not “no slow queries exist.”

Decision: Use PMM QAN to pivot from fingerprint → examples → hosts → time range. It’s the shortest path from pain to culprit.

Killer query fingerprints: how to recognize the class of problem

1) The “rows examined tsunami”

Signature: High SUM_ROWS_EXAMINED, modest rows returned, high total latency. CPU rises, buffer pool reads rise, disk reads may follow.

Causes: Missing composite index, non-sargable predicates, implicit type conversion, leading wildcard LIKE '%foo', or a query rewritten by ORM into something the optimizer can’t love.

Fix: Create the right index, rewrite predicate to be sargable, align types, and verify with EXPLAIN. Then watch for plan regressions after stats changes.

2) The “lock convoy”

Signature: Many sessions in “Waiting for…” states, low CPU, throughput collapses. The database looks idle while everyone waits.

Causes: Hot rows, serialized updates, long transactions, gap locks under REPEATABLE READ, or DDL colliding with traffic via MDL.

Fix: Reduce transaction scope, reorder operations, add indexes to reduce locked ranges, use online schema change patterns, and schedule DDL like it’s a production change (because it is).

3) The “temp table spill factory”

Signature: Increased Created_tmp_disk_tables, I/O latency increases, queries with GROUP BY/ORDER BY get slower, and disks get chatty.

Causes: Missing indexes for sort/grouping, selecting too many columns, large result sets, or insufficient tmp table memory settings.

Fix: Index to satisfy ORDER BY/GROUP BY, reduce selected columns, pre-aggregate, or refactor. Increase memory only after you stop the bleeding.

4) The “replication lag masquerading as app slowness”

Signature: Reads routed to replicas get slower or stale; or failovers become risky because lag is high.

Causes: One big transaction, commit pressure, dependency constraints, or a replica with different I/O/cache profile than the primary.

Fix: Break up large writes, tune replication parallelism appropriately, and ensure replicas have comparable storage performance (or adjust routing expectations).

5) The “plan changed and nobody noticed”

Signature: Same query fingerprint, suddenly worse. EXPLAIN shows different index or join order than last week.

Causes: Statistics changed, data distribution changed, or version upgrade altered optimizer behavior.

Fix: Refresh stats intentionally, consider histograms, stabilize with hints only when necessary, and monitor plan drift for critical queries.

Joke #2: The optimizer is like a cat: sometimes brilliant, sometimes sleeping on the keyboard, and it will not explain itself.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had a clean split: writes to the primary, reads to replicas. The assumption was simple and comforting: “Read traffic can’t hurt writes.” Everyone repeated it until it became policy.

Then the checkout flow started timing out. The primary’s CPU wasn’t pegged, but commit latency climbed. Replicas were fine. Network was fine. The application team swore it was “just a few extra reads.”

Performance Schema showed the truth: a new “read-only” endpoint performed a multi-table join with a missing predicate, scanning large ranges. It ran on the primary in a fallback path when the read pool was exhausted. The query itself wasn’t even that slow per execution—just frequent enough to evict hot pages and create buffer pool churn.

The wrong assumption wasn’t about “reads.” It was about where reads happen under partial failure. Connection pool exhaustion, replica lag, DNS flaps—any of these can route reads to the primary. The query became a write-path issue by wrecking cache locality and increasing I/O.

They fixed it by adding the missing predicate and an index, then implementing a circuit breaker: if replicas are unavailable, degrade the feature rather than silently pounding the primary. Monitoring was updated to alert on “read query fingerprints on primary” as a first-class risk.

Mini-story 2: The optimization that backfired

A team wanted to reduce p99 latency for a search endpoint. They added an index that seemed obviously correct, and it did improve their test query. The deployment went out, high fives all around.

Two days later, the write workload started to stall in bursts. CPU on the database rose, then fell, but throughput got spiky. The slow query log didn’t show a single “bad query.” That was the clue: the pain wasn’t in query execution time; it was in write amplification.

The new index multiplied the cost of updates. A background job that touched many rows now had to maintain an extra structure with poor locality. The buffer pool got churned by index pages. Performance Schema wait events showed increased mutex contention around buffer pool structures, and iostat showed elevated write latency during bursts.

The fix wasn’t “remove indexes.” It was: make the job less destructive (batching, lower concurrency), and redesign the index to match actual predicates. They replaced one wide index with a narrower composite index that satisfied the read path without punishing updates as much.

Lesson: “Optimization” is a change in tradeoffs. If you don’t measure the write path, you’re not optimizing—you’re gambling.

Mini-story 3: The boring but correct practice that saved the day

A different organization had a ritual that nobody bragged about: they kept slow query logging enabled at a low threshold, rotated logs properly, and digested the results weekly. No heroics, just hygiene.

One Friday afternoon, p95 latency jumped. The on-call opened PMM Query Analytics and saw a new fingerprint at the top within a minute. It was a query added in a minor release: a lookup by a token column, but the token was stored as a different type than the parameter, triggering implicit conversion and killing index usage.

Because slow logs and digests were normal, they had baselines. They could prove “this query did not exist yesterday” and tie it to a deployment. Rollback was immediate. Then the team added the correct index and aligned types in the following patch.

No war room. No all-hands. The fix was almost boring. That’s the point: boring observability prevents exciting outages.

Paraphrased idea from Gene Kranz: tough and competent—stay calm, rely on procedure, and use data over vibes.

Common mistakes: symptom → root cause → fix

1) Symptom: CPU is high, but slow query log is quiet

Root cause: Lots of medium-slow queries under the threshold, or overhead in parsing/execution not captured due to high long_query_time.

Fix: Lower long_query_time to 200–500ms temporarily, or enable Performance Schema digest aggregation and rank by total time.

2) Symptom: Threads connected is high, Threads running is low, latency is terrible

Root cause: Lock waits or resource waits (I/O, metadata locks). Everyone is queued.

Fix: Use sys views for lock waits and MDL waits; identify the blocker; reduce transaction length; schedule DDL properly.

3) Symptom: Queries “randomly” get slow after restart

Root cause: Cold buffer pool; working set not warmed; sudden disk reads explode.

Fix: Expect it and plan for it (warmup strategies, sufficient buffer pool, capacity margin). Also verify that you didn’t lose query cache locality due to new scans.

4) Symptom: Replica lag spikes during a batch job

Root cause: Large transactions or high commit rate saturates replica apply/flush capacity, or dependency constraints limit parallelism.

Fix: Break writes into smaller transactions, adjust job concurrency, and ensure replica durability and storage match intent.

5) Symptom: “Waiting for table metadata lock” appears during deploys

Root cause: DDL collides with long-running queries/transactions; MDL is held longer than expected.

Fix: Eliminate long transactions, run schema changes with online tooling patterns, and enforce DDL windows or throttling.

6) Symptom: p99 worsens, averages look fine

Root cause: Contention, queueing, parameter skew, or intermittent plan changes. The mean is lying.

Fix: Use percentiles in QAN/monitoring, check top statements by max latency and by 95th percentile, and inspect specific query examples.

7) Symptom: Disk utilization isn’t 100%, but latency is high

Root cause: Queue depth and latency can rise long before %util hits 100% (especially on shared storage). Small random I/O is the silent killer.

Fix: Trust await and queue depth. Reduce read amplification (indexes, query shape), reduce temp spills, and verify storage health.

8) Symptom: You “fixed” a query with an index, but overall throughput dropped

Root cause: Index maintenance overhead on writes, especially for high-churn tables.

Fix: Reassess index design, consider narrower/composite indexes, reduce write churn, and benchmark both read and write paths.

Checklists / step-by-step plan

Checklist A: Baseline monitoring you should have before the next incident

Enable Performance Schema and keep it on.
Enable slow query log with a threshold that catches real pain (often 200–500ms, adjust to workload).
Rotate and retain slow logs; ensure disk won’t fill.
Run a daily/weekly digest report and track top fingerprints over time.
Collect wait event metrics and have at least one dashboard that ranks them.
Track replication lag and alert on trend, not just absolute value.
Capture query examples for top digests (PMM helps, but you can store samples yourself).

Checklist B: Incident response steps (15-minute version)

Classify: CPU vs I/O vs locks vs replication.
Grab top digests by total time and by average time.
Check active sessions and lock waits; find blockers.
If I/O: check temp disk tables and scans (rows examined).
Validate plan for the top fingerprint with EXPLAIN.
Pick one action: kill a blocking query, add an index (only if safe and tested), throttle a job, or rollback a deploy.
Measure again. If the metric didn’t move, your fix wasn’t the fix.

Checklist C: Hardening changes that reduce future “killer query” risk

Add guardrails in the app: timeouts, circuit breakers, and query concurrency limits.
Prefer prepared statements and stable query shapes (helps digesting and reduces plan chaos).
Standardize schema change practice (online schema change approach, MDL awareness).
Make “rows examined per call” a tracked SLO for critical endpoints.
Write a rollback plan for every query-affecting change (indexes count).

FAQ

1) Is Percona Server “better” than MySQL for monitoring?

It’s better in the way a good toolbox is better than a wrench. MySQL has the primitives; Percona’s ecosystem tends to make them operationally usable faster, especially with PMM and tooling culture.

2) Should I rely on slow query log or Performance Schema?

Use both. Slow log gives you exact statements and timings in a durable file. Performance Schema gives you live aggregated truth and wait context. When they agree, you can move quickly.

3) What `long_query_time` should I use?

Start around 200–500ms for OLTP services and adjust based on volume and storage. If your p95 target is 50ms, a 10-second threshold is observational negligence.

4) Does enabling Performance Schema hurt performance?

It can if you enable every instrument and consumer. With sane defaults and selective instrumentation, it’s generally acceptable in production. Measure overhead in staging if you’re anxious; don’t disable it out of superstition.

5) How do I find “killer queries” when everything is fast individually?

Rank by total time and calls, not just average latency. A 5ms query executed 50,000 times per minute can be your whole CPU budget.

6) What’s the fastest way to diagnose lock issues?

Use sys views: sys.innodb_lock_waits and sys.schema_table_lock_waits. Find the blocker PID, look at its query and transaction age, and decide whether to kill or fix the workload pattern.

7) Why do I see replication lag but the primary looks healthy?

Replica apply is a different workload: different caching, different I/O, sometimes different query mix (read load). Lag is often caused by one big transaction or commit pressure, not primary distress.

8) Can PMM replace my existing monitoring?

It complements it. You still need host metrics, logs, and alerting discipline. PMM shines at query analytics and database-specific visibility, which is usually the missing piece during incidents.

9) Should I kill “killer queries” during incidents?

Sometimes. If a query is blocking others or causing a lock convoy, killing it can restore service quickly. But make the decision explicit: kill query vs kill connection, and understand retry storms from the application.

10) What if the “killer query” is actually many similar queries?

That’s exactly why digests matter. Fix the pattern: add the right index, change the query shape, or add caching. Don’t whack individual statements like moles.

Conclusion: next steps you can actually do this week

If you want to stop guessing, your goal is simple: make the database explain itself quickly.

Turn on slow logging with a realistic threshold and rotate it.
Keep Performance Schema enabled, and use sys schema views to avoid writing archaeology-grade SQL.
Adopt a query digest workflow: daily/weekly ranking, track top fingerprints, and treat “rows examined per call” as a smell.
Pick a single “fast diagnosis” dashboard: top digests, top waits, active sessions, replication lag. If it’s not there, it doesn’t exist during an incident.
Decide your stance on Percona: if you need faster operational visibility with less bespoke glue, Percona Server + PMM is a pragmatic choice. If you’re committed to upstream MySQL, copy the same discipline with your own tooling.

Most killer queries aren’t clever. They’re ordinary queries placed in extraordinary conditions. Your job is to make those conditions visible—and make the culprit painfully easy to name.