Debian 13: MySQL slow query log — find the query that’s silently killing you

Was this helpful?

Production MySQL rarely “falls over” with a dramatic crash. It just gets slower. Latency climbs, queue depth rises, the app team says “nothing changed,” and your on-call night evaporates into a fog of half-truths and dashboards that agree on exactly nothing.

The slow query log is the one instrument that cuts through the noise. Not because it’s fancy, but because it writes down what your database actually spent time doing. If you’re running Debian 13 and you suspect a single query is quietly turning your server into a space heater, this is how you catch it, prove it, and fix it without making things worse.

Why the slow query log still wins in 2025

MySQL has grown a lot of instrumentation: Performance Schema, sys schema views, optimizer traces, statement digests, and enough knobs to build a cockpit. Yet the slow query log remains the fastest way to connect “users are mad” to “this is the SQL doing it.” It’s not perfect. It’s not complete. It’s just reliably useful.

Here’s what it does better than most dashboards:

  • It captures reality under load. Not the query you wish you were running, the one you actually ran when everything was on fire.
  • It’s cheap to start with. Enable it, sample it, move on. Performance Schema can be heavier if you don’t know what you’re enabling.
  • It’s portable evidence. A slow log file is something you can hand to a teammate, archive with an incident ticket, and compare across releases.
  • It turns vague complaints into ranked suspects. Total time, count, rows examined, lock time—enough to tell you where to dig.

And yes, it can also become a trash heap of “slow” queries that are actually fine. That’s why you tune the thresholds and interpret it like an operator: you care about throughput, tail latency, and resource contention—not just a single number called “Query_time.”

Joke 1: The slow query log is like a workplace time tracker: it’s awkward until you see where your day actually went.

One quote worth carrying around (paraphrased idea): “Hope is not a strategy,” often attributed to operations leadership circles; treat performance work the same way—measure first, then act.

Interesting facts and a little history (so you stop repeating old mistakes)

You don’t need trivia to run databases, but a few concrete facts explain why certain knobs exist and why some “obvious” fixes keep failing.

  1. The slow query log predates modern observability. It was one of the earliest “built-in” ways to find query pain without external profilers.
  2. “Slow” is a policy choice, not a universal truth. A 200 ms query is slow for a login endpoint but fast for a nightly report.
  3. MySQL can log queries that don’t use indexes. The log_queries_not_using_indexes flag exists because full table scans are a reliable way to buy surprise outages.
  4. The log can include administrative statements. With log_slow_admin_statements, you’ll catch “innocent” DDL or maintenance that stalls prod.
  5. InnoDB changed the shape of “slow.” Once InnoDB became the default, many slowdowns moved from CPU-bound parsing to I/O waits and locking behavior.
  6. Query cache was removed in MySQL 8.0. A whole era of “we fixed it by caching” ended, and good riddance—query cache invalidation was a chaos generator.
  7. Replication can surface slow queries twice. A statement slow on the primary is often slow on replicas too; plus replicas may lag and compound the pain.
  8. Row-based replication made “same statement” less relevant. Performance issues sometimes hide in trigger logic or side effects rather than raw SELECT text.
  9. Modern MySQL exposes digest-based summaries. Performance Schema can aggregate by normalized statement patterns; slow log shows concrete examples you can reproduce.

Fast diagnosis playbook (first/second/third)

This is the order that finds the culprit quickly, without making your outage worse. The goal is not “optimize MySQL.” The goal is “stop bleeding.”

First: decide whether you are CPU-bound, I/O-bound, or lock-bound

  • If CPU is pegged: expect bad plans, missing indexes, huge sorts, or too many concurrent queries.
  • If disk is saturated: expect scans, large reads, buffer pool misses, or checkpoints/flush pressure.
  • If threads are waiting: expect locks, metadata lock issues, long transactions, or hot rows.

Second: enable or confirm slow log, then capture a clean window

Don’t tail a week-old log and guess. Rotate, then capture 5–15 minutes during pain. You want a tight time window that matches symptoms.

Third: rank by total time, not by “worst single query”

The query that ran for 20 seconds once is dramatic. The query that runs for 120 ms 4,000 times per minute is the one quietly eating your lunch. Rank by sum(Query_time) and by count, then drill into examples.

Fourth: verify with EXPLAIN and actual handler statistics

Slow logs tell you what happened. EXPLAIN tells you what MySQL planned to do. Both can be wrong in isolation. Combine them.

Enable slow query logging on Debian 13 (and don’t hurt yourself)

Debian 13 typically runs MySQL 8.0-ish or MariaDB depending on your choice. The mechanics are similar, but config file locations and package defaults differ. I’m going to assume Oracle MySQL (package mysql-server) with systemd; the same steps mostly apply to MariaDB with slightly different service names and variables.

Where config usually lives

  • /etc/mysql/my.cnf includes other files.
  • /etc/mysql/mysql.conf.d/mysqld.cnf is the common place for server settings.
  • /var/log/mysql/ is a common log directory, but verify permissions and AppArmor profiles.

Minimal sane slow-log config

Don’t overthink this. Start with file logging, a modest threshold, and avoid logging every query on a hot system.

cr0x@server:~$ sudo editor /etc/mysql/mysql.conf.d/mysqld.cnf
...add or adjust...
slow_query_log = ON
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 0.2
log_queries_not_using_indexes = OFF
log_slow_admin_statements = ON
log_slow_replica_statements = OFF

Why these choices:

  • long_query_time=0.2 is aggressive enough to catch “death by a thousand cuts” without logging everything (usually). Tune later.
  • log_queries_not_using_indexes=OFF at first because it can flood logs with harmless small table scans.
  • log_slow_admin_statements=ON because online DDL and admin commands can stall the world.

Reload safely

Changing slow log settings is often dynamic, but don’t assume. Use SET PERSIST where appropriate, and validate with SHOW VARIABLES.

Practical tasks: commands, outputs, and decisions (12+)

These are the operator moves that work at 2 a.m. Each one includes what the output means and what decision you make next.

Task 1: Confirm which server you’re running (MySQL vs MariaDB) and version

cr0x@server:~$ mysql --version
mysql  Ver 8.0.36-0+deb13u1 for Linux on x86_64 (MySQL Community Server - GPL)

What it means: You know which variable names and features apply. MySQL 8 has SET PERSIST; MariaDB differs.

Decision: Use MySQL 8 syntax and expect Performance Schema features to be available by default.

Task 2: Check if slow logging is currently enabled

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'slow_query_log';"
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| slow_query_log | OFF   |
+----------------+-------+

What it means: You currently have no forensic trail.

Decision: Turn it on immediately (dynamic if possible), then capture a short window during load.

Task 3: Enable slow log dynamically (immediate, not after restart)

cr0x@server:~$ mysql -e "SET GLOBAL slow_query_log = ON;"
...Query OK, 0 rows affected...

What it means: New slow queries will be logged right now.

Decision: Set log file path next, verify it’s writable, then set threshold.

Task 4: Set the slow log file location and verify MySQL can write it

cr0x@server:~$ mysql -e "SET GLOBAL slow_query_log_file = '/var/log/mysql/mysql-slow.log';"
...Query OK, 0 rows affected...
cr0x@server:~$ sudo install -o mysql -g adm -m 0640 /dev/null /var/log/mysql/mysql-slow.log
cr0x@server:~$ sudo ls -l /var/log/mysql/mysql-slow.log
-rw-r----- 1 mysql adm 0 Dec 30 01:12 /var/log/mysql/mysql-slow.log

What it means: The file exists with sane ownership. MySQL won’t fail silently because it can’t open the log.

Decision: If ownership is wrong, fix it before you chase phantom “no slow queries” issues.

Task 5: Set long_query_time for triage

cr0x@server:~$ mysql -e "SET GLOBAL long_query_time = 0.2; SHOW VARIABLES LIKE 'long_query_time';"
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| long_query_time | 0.200 |
+-----------------+-------+

What it means: Queries slower than 200 ms get logged.

Decision: During an incident, go lower (e.g., 0.1) if you need more samples. After, raise it to reduce noise.

Task 6: Rotate the slow log to isolate a time window

cr0x@server:~$ mysql -e "FLUSH SLOW LOGS;"
...Query OK, 0 rows affected...
cr0x@server:~$ sudo ls -lh /var/log/mysql/mysql-slow.log*
-rw-r----- 1 mysql adm  12K Dec 30 01:15 /var/log/mysql/mysql-slow.log
-rw-r----- 1 mysql adm 1.8M Dec 30 01:12 /var/log/mysql/mysql-slow.log.1

What it means: You’ve created a clean “current” log file. The old one is archived with suffix.

Decision: Capture the next 5–15 minutes of pain and analyze only that.

Task 7: Confirm MySQL is actually writing entries

cr0x@server:~$ sudo tail -n 5 /var/log/mysql/mysql-slow.log
# Time: 2025-12-30T01:16:19.123456Z
# User@Host: app[app] @ 10.0.2.41 []  Id:  9123
# Query_time: 0.412  Lock_time: 0.000  Rows_sent: 20  Rows_examined: 250000
SET timestamp=1735521379;
SELECT * FROM orders WHERE user_id=84721 ORDER BY created_at DESC LIMIT 20;

What it means: Logging works. This sample already smells like “missing index for ORDER BY + filter.”

Decision: Don’t optimize from one line. Aggregate next.

Task 8: Quick aggregation with mysqldumpslow

cr0x@server:~$ sudo mysqldumpslow -s t -t 10 /var/log/mysql/mysql-slow.log
Reading mysql slow query log from /var/log/mysql/mysql-slow.log
Count: 184  Time=0.29s (52s)  Lock=0.00s (0s)  Rows=20.0 (3680), app@app[10.0.2.41]
  SELECT * FROM orders WHERE user_id=N ORDER BY created_at DESC LIMIT N

Count: 21  Time=1.10s (23s)  Lock=0.02s (0s)  Rows=1.0 (21), app@app[10.0.2.55]
  UPDATE inventory SET qty=qty-N WHERE sku='S'

What it means: First query consumes 52 seconds total in this window due to high count. Second has higher per-query time but lower total.

Decision: Fix the top total-time offender first unless the second causes locks or a cascading failure.

Task 9: Install and run pt-query-digest for real ranking

cr0x@server:~$ sudo apt-get update
...output...
cr0x@server:~$ sudo apt-get install -y percona-toolkit
...output...
cr0x@server:~$ sudo pt-query-digest /var/log/mysql/mysql-slow.log | sed -n '1,80p'
# 2025-12-30T01:20:44
# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ====
#    1 0xA1B2C3D4E5F6A7B8  52.1230 65.0%   184 0.2833  0.12 SELECT orders
#    2 0x1122334455667788  23.0041 28.7%    21 1.0954  0.03 UPDATE inventory
#    3 0x9988776655443322   3.9002  4.9%    97 0.0402  0.01 SELECT sessions

What it means: You now have a ranked list by total response time. Query fingerprints prevent “same query, different IDs” noise.

Decision: Pull the full report section for Rank 1 and use it to reproduce and EXPLAIN.

Task 10: Get schema and index context for the top query

cr0x@server:~$ mysql -e "SHOW CREATE TABLE orders\G" | sed -n '1,60p'
*************************** 1. row ***************************
Table: orders
Create Table: CREATE TABLE `orders` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `user_id` bigint NOT NULL,
  `created_at` datetime(6) NOT NULL,
  `status` varchar(32) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_status` (`status`)
) ENGINE=InnoDB

What it means: No composite index on (user_id, created_at). The query filters by user_id and orders by created_at—classic case.

Decision: Plan an index that matches access pattern: (user_id, created_at DESC) in MySQL 8 (DESC supported), or (user_id, created_at) often still helps.

Task 11: Validate the execution plan (EXPLAIN) and what it’s scanning

cr0x@server:~$ mysql -e "EXPLAIN SELECT * FROM orders WHERE user_id=84721 ORDER BY created_at DESC LIMIT 20\G"
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: orders
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 248901
     filtered: 10.00
        Extra: Using where; Using filesort

What it means: Full table scan (type: ALL) plus filesort. That’s your I/O and CPU.

Decision: Add the right index, then re-run EXPLAIN to confirm it uses it, and measure before/after.

Task 12: Add the index carefully (online where possible) and watch impact

cr0x@server:~$ mysql -e "ALTER TABLE orders ADD INDEX idx_user_created (user_id, created_at DESC);"
...Query OK, 0 rows affected...

What it means: Index is created. Depending on table size and MySQL settings, this may take time and consume I/O. On big tables, do this during a maintenance window or use online schema change tooling.

Decision: Immediately validate new plan and check whether slow log volume drops. If replication exists, verify replica lag.

Task 13: Re-run EXPLAIN to confirm plan improvement

cr0x@server:~$ mysql -e "EXPLAIN SELECT * FROM orders WHERE user_id=84721 ORDER BY created_at DESC LIMIT 20\G"
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: orders
         type: ref
possible_keys: idx_user_created
          key: idx_user_created
      key_len: 8
          ref: const
         rows: 20
        Extra: Using index condition

What it means: MySQL now uses the index and expects ~20 rows, not ~250k. Filesort is gone.

Decision: Monitor latency and I/O. If it improved, you’ve likely found “the query.” Keep going for the next offender.

Task 14: Check whether the “slow” time was actually lock time

cr0x@server:~$ grep -E "Query_time|Lock_time" -n /var/log/mysql/mysql-slow.log | head
2:# Query_time: 0.412  Lock_time: 0.000  Rows_sent: 20  Rows_examined: 250000
8:# Query_time: 1.905  Lock_time: 1.723  Rows_sent: 1  Rows_examined: 1

What it means: The second sample spent most time waiting on locks, not running.

Decision: Stop “optimizing” SQL text and go find who holds locks: long transactions, hot rows, or metadata locks.

Task 15: Identify current blockers and lock waits

cr0x@server:~$ mysql -e "SHOW PROCESSLIST;" | sed -n '1,20p'
Id	User	Host	db	Command	Time	State	Info
9123	app	10.0.2.41:51022	prod	Query	2	Sending data	SELECT * FROM orders WHERE user_id=84721 ORDER BY created_at DESC LIMIT 20
9150	app	10.0.2.55:53488	prod	Query	35	Waiting for row lock	UPDATE inventory SET qty=qty-1 WHERE sku='ABC-123'
9201	admin	localhost	prod	Query	120	Starting alter table	ALTER TABLE inventory ADD COLUMN last_checked datetime

What it means: You’ve got row lock waits and a DDL running. That DDL may be impacting concurrency depending on algorithm and metadata locks.

Decision: If you’re in incident mode, consider killing the DDL or moving it off-peak, and fix the transaction behavior on inventory updates.

Task 16: Check InnoDB buffer pool pressure (I/O vs memory)

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+------------+
| Variable_name                         | Value      |
+---------------------------------------+------------+
| Innodb_buffer_pool_read_requests      | 987654321  |
| Innodb_buffer_pool_reads              | 12345678   |
+---------------------------------------+------------+

What it means: Reads from disk are non-trivial. A high Innodb_buffer_pool_reads rate means you’re missing cache and paying I/O.

Decision: If buffer pool is too small, increase it (carefully). But first eliminate scan-heavy queries; otherwise you’ll just cache your mistakes.

Reading slow logs like an SRE, not like a fortune teller

A slow log entry is a small story: who ran it, how long it took, what it returned, and how much it looked at. The trick is to understand which part of that story is actionable.

Fields that matter (and what they’re trying to tell you)

  • Query_time: wall-clock time. Includes waiting on CPU, I/O, locks, and sometimes scheduling delays. It’s your symptom, not your diagnosis.
  • Lock_time: time spent waiting for table/row locks (depending on engine). If this is large, your “slow query” might be fast SQL stuck behind a slower transaction.
  • Rows_examined: how much work was done. High rows examined with low rows sent is a common “missing index” signature.
  • Rows_sent: how much data you returned. If this is huge, the query might be “working as designed” but the design is wrong (chatty endpoints, giant exports).
  • User@Host: identifies a workload source. This often tells you which service owns the fix.
  • Timestamp: lets you correlate to deploys, cron jobs, or traffic spikes.

The three archetypes of “silently killing you”

You’ll see these over and over:

  1. High frequency, moderate time: a query that’s not awful but runs constantly. Fixing it improves CPU and tail latency. These are the best ROI.
  2. Low frequency, huge time: reports, batch jobs, and “one big query.” Often acceptable if scheduled; unacceptable if triggered by users.
  3. Lock-dominated: queries that look slow because they were blocked. Here, you fix transaction scope, isolation behavior, or data hot-spots.

Where operators get fooled

The slow log is literal. It records that a query took 1.2 seconds. It does not record why. The why might be:

  • filesystem latency spike,
  • a checkpoint flush,
  • mutex contention,
  • replica I/O saturation,
  • or a single session holding a lock because it started a transaction and then went out for coffee.

That last one happens more often than anyone wants to admit.

Joke 2: A long transaction is like leaving the office kitchen microwave running—everyone notices, nobody knows who did it.

Tooling: mysqldumpslow, pt-query-digest, and friends

You can do a lot with nothing but grep and patience, but you have better things to do than manually normalize SQL. Use the tools. Know what they’re good at.

mysqldumpslow: quick and dirty

It ships with MySQL and is good for a first cut. It groups queries by a crude normalization (replacing numbers and strings). It’s fine for “what’s the top offender right now?” It’s not great for deep analysis across many sources or for complex statement variants.

pt-query-digest: production-grade triage

Percona Toolkit’s digest is the workhorse. It fingerprints queries, aggregates by total time and variance, and prints representative samples. It’s opinionated in a good way: it pushes you to focus on what matters.

If you have a stable production environment, consider a scheduled job that digests slow logs and stores summaries somewhere. Not because you love reports, but because incidents are shorter when you already know the usual suspects.

When Performance Schema helps

Slow log is reactive: it tells you about queries that crossed a threshold. Performance Schema can be proactive: it can show top statements by total time even if they’re “not slow” individually.

But if you’re already in pain and you don’t know your current overhead, don’t start flipping a dozen instrumentation switches mid-incident. Your job is to restore service first, not to become a one-person telemetry platform.

Failure modes: when slow logs lie (or at least mislead)

The slow query log isn’t wrong. It’s just incomplete. Here are the common ways it sends you chasing the wrong rabbit.

1) Threshold too high, and the real killer never appears

If long_query_time is 2 seconds, you’ll miss the 150 ms query running 1,000 times per second. That query can max your CPU while never being “slow.”

Fix: temporarily lower long_query_time during investigation or use Performance Schema to rank by total time.

2) Threshold too low, and you drown in noise

If you set long_query_time to 0.01 on a busy system, you might generate logs faster than you can write them. Congrats, you just invented a log-based denial of service.

Fix: sample in short windows. Rotate logs. Analyze small slices. Raise threshold afterward.

3) Lock waits make innocent queries look guilty

A simple UPDATE can show Query_time: 2.0 with Lock_time: 1.9. The SQL isn’t slow; the concurrency model is broken.

Fix: find the blocker (processlist, InnoDB lock waits), shorten transactions, reduce hot rows, or change write patterns.

4) The slow query is a symptom of I/O collapse, not the cause

When storage latency spikes, everything slows down. The log will show a parade of “slow” queries, and you’ll blame the wrong one.

Fix: confirm storage health and latency. Look for buffer pool misses and scans; then check the platform (disk, RAID, virtualization, noisy neighbors).

5) The log location is unwritable, and you get silence

MySQL can’t always write where you told it to, especially with AppArmor profiles. You’ll think you have no slow queries. You do. You just have no log.

Fix: ensure file exists, correct ownership, and check MySQL error log for file open errors.

Common mistakes: symptom → root cause → fix

1) “CPU is pinned, but slow log is empty”

Symptom: High CPU, high QPS, users complain, slow log has little or nothing.

Root cause: long_query_time too high; your killer is “fast” but frequent.

Fix: temporarily drop long_query_time (e.g., 0.1–0.2), rotate logs, and/or use Performance Schema digests to rank total time.

2) “Everything is slow after enabling log_queries_not_using_indexes”

Symptom: Disk busy, log volume explodes, MySQL feels worse.

Root cause: You logged every small scan on tiny tables and created write amplification.

Fix: turn it off, then use digests to identify the few scans with high rows examined or high total time.

3) “We added an index and performance got worse”

Symptom: Writes slow down, replication lags, buffer pool churn increases.

Root cause: The new index increased write cost and memory footprint; you optimized a read path without considering write volume.

Fix: validate read benefit vs write penalty. Consider a covering index, reduce index width, or redesign query to avoid needing it.

4) “Slow queries are all the same SELECT, but only sometimes”

Symptom: Same statement appears with variable Query_time.

Root cause: parameter values produce different selectivity; plan is unstable; stats are stale; or you’re seeing lock contention spikes.

Fix: compare Rows_examined across samples, run EXPLAIN with representative parameters, update statistics, and investigate locks if Lock_time spikes.

5) “The slowest query is a report; the site is dying”

Symptom: A big reporting query dominates slow log, app latency skyrockets.

Root cause: Shared resources. The report is saturating I/O or buffer pool, harming OLTP workload.

Fix: move report to replica, schedule off-peak, add resource isolation, or pre-aggregate. Don’t “optimize” by adding random indexes to production OLTP tables first.

6) “Queries got slower right after a schema migration”

Symptom: Lock time and wait states increase; slow log shows admin statements.

Root cause: DDL triggered metadata locks or large table rebuilds; migration collided with peak traffic.

Fix: use online schema change strategies, run migrations during controlled windows, and log slow admin statements so you can prove causality next time.

Three corporate mini-stories from the trenches

Incident: the wrong assumption that “replicas don’t matter”

One team ran a busy transactional database with a single primary and two replicas. The replicas were “for reads and backups,” and that phrase became an excuse to ignore replica performance entirely. The primary was monitored obsessively; the replicas were treated like cold spares.

Then a new feature shipped: the application started routing a subset of user profile reads to replicas. Not all reads. Just some. The rollout was gradual, which made it harder: no one saw an obvious cliff. What people saw was a slow increase in p95 latency and a handful of “random timeouts” that didn’t correlate with primary metrics.

The slow query log on the primary looked boring. On the replicas, it was screaming. A particular SELECT with a harmless-looking ORDER BY was running just under the slow threshold on the primary thanks to warm cache, but it was crossing the threshold constantly on replicas because their buffer pools were cold and their disks were slower.

The wrong assumption was that a query that “works on primary” must be fine everywhere. It wasn’t. Replicas had different I/O characteristics, different load shape, and a backup job that spiked reads every hour. The slow logs proved it: same query fingerprint, different Query_time distribution.

The fix was not heroic: add the missing composite index, increase replica buffer pool size, and schedule backups away from the new traffic pattern. The lesson stuck: replicas are production too, just with fewer excuses.

Optimization that backfired: the index that ate the write path

A finance-adjacent service had a table that tracked ledger entries. Reads were frequent, but writes were constant. An engineer saw a slow log entry for a query filtering by (account_id, created_at) and did the “obvious” thing: added a large composite index that also included a wide varchar column to “cover” the query completely.

The read improved. It was beautiful in staging. In production, the write path started to wobble. Insert latency increased, replication lag appeared, and the buffer pool got noticeably less effective. The slow log began to show different queries: not the original SELECT, but inserts and updates that now took longer because every write had to maintain a heavier index structure.

The team spent a day arguing about whether the index “should” have helped. It did help the read. It also changed the economics of the whole table. InnoDB doesn’t give you free indexes; it charges you in write amplification and memory, every time.

The fix was surgical. They replaced the wide covering index with a narrower composite index, then rewrote the query to fetch only needed columns. That reduced index bloat and restored write throughput. They also added a rule: index proposals require an explicit write-cost discussion and a rollback plan.

Boring but correct practice that saved the day: time-windowed logging and rotation

A SaaS platform had a recurring issue: once every few days, the system would “feel slow” for about ten minutes. Nothing dramatic, just enough to trigger support tickets. It was the kind of problem that wastes weeks because it’s intermittent and everyone has a theory.

The on-call engineer did something unglamorous: they added a playbook step to flush and rotate the slow log as soon as the incident started, then capture exactly 10 minutes of data. Not “the log file from last night,” not “maybe it happened around 3.” A crisp window.

After two incidents, the pattern emerged. It wasn’t a single query; it was a batch job that ran every few hours and generated a surge of moderate queries. Individually they were fine. Together they saturated I/O and pushed buffer pool churn over the edge. The slow log, when sliced to the right window, showed a consistent fingerprint set appearing only during the event.

The fix was similarly boring: move the batch workload to a replica and add rate limiting on the job. Support tickets dropped. Nobody got a trophy. Production got quieter, which is the closest thing SREs have to romance.

Checklists / step-by-step plan

When you’re in an incident (30–60 minutes)

  1. Classify the bottleneck: CPU, I/O, or locks. Don’t guess; check.
  2. Confirm slow log is enabled and writable: if it’s not writing, fix that first.
  3. Set a triage threshold: long_query_time around 0.1–0.5 depending on workload.
  4. Flush/rotate logs: isolate a clean window during the pain.
  5. Digest and rank: use pt-query-digest if available; otherwise mysqldumpslow.
  6. Pick the top offender by total time: unless lock time indicates a blocking problem.
  7. EXPLAIN with real parameters: confirm scan vs index usage, filesort, temporary tables.
  8. Fix minimally: add or adjust one index, or rewrite one query path. Don’t refactor the universe.
  9. Re-measure: rotate the log again, compare top offenders before/after.
  10. Document the fingerprint: the normalized query pattern is what matters for recurrence.

After the incident (next day)

  1. Set persistent config: use config file or SET PERSIST so you don’t lose settings on restart.
  2. Decide on log retention: slow logs grow; rotate via logrotate, ship to central storage if needed.
  3. Establish SLO-aligned thresholds: choose long_query_time that catches user-visible latency, not noise.
  4. Baseline top query fingerprints: your “normal” top 20 is a powerful early warning system.
  5. Build an index review habit: every new index has a write cost and memory footprint; treat it like capacity planning.

Operational guardrails (prevent future self-harm)

  • Never enable “log everything” permanently on a busy primary.
  • Keep slow logs on local disk with predictable latency, not on a network filesystem.
  • Make schema migrations observable: log slow admin statements and track metadata lock waits.
  • Teach teams to read Rows_examined like a bill: if it’s huge, someone pays.

FAQ

1) Should I log to file or to table?

File, for most production systems. Logging to table can create contention, inflate the data dictionary workload, and complicate retention. Use table logging only if you have a controlled pipeline and you’ve tested overhead.

2) What’s a good long_query_time value?

Start around 0.2–0.5 seconds for OLTP services, then tune based on volume and SLOs. During investigation, lower it briefly to capture more samples. Don’t set it to 0 unless you enjoy filling disks.

3) Why do I see high Rows_examined but low Rows_sent?

That’s usually a missing or mismatched index, or a query that can’t use the index because of functions, implicit casts, or poor predicate order. Confirm with EXPLAIN; look for type: ALL or Using filesort.

4) The slow log shows an UPDATE with huge time, but EXPLAIN looks fine. Now what?

Check Lock_time. If it’s large, your UPDATE was blocked. Find the blocking transaction, shorten transaction scopes, and avoid hot-row designs where many sessions update the same row(s).

5) Will enabling the slow query log hurt performance?

Some overhead, yes—mostly from writing logs. With reasonable thresholds and file logging on local disk, it’s typically acceptable. The real risk comes from overly aggressive settings and massive log volume.

6) How do I catch “not slow but too frequent” queries?

Lower long_query_time temporarily and digest by total time, or use Performance Schema statement summaries to rank by total latency. Slow log alone is threshold-based by design.

7) Why are my slow logs empty even though slow_query_log=ON?

Common causes: log file path not writable, AppArmor denying writes, threshold too high, or the workload is dominated by quick queries and lock waits not crossing your threshold. Check error logs for file issues and validate the active path with SHOW VARIABLES.

8) Can I use slow logs to find missing indexes automatically?

You can get strong hints, not certainty. High Rows_examined and Using filesort point to index opportunities. But index design requires understanding read/write mix, cardinality, and query patterns. The slow log tells you where to look; you still have to think.

9) What about replicas—should they have different slow log settings?

Often yes. Replicas can be used for heavier logging and deeper instrumentation because they’re not the primary write bottleneck. But if your application reads from replicas, treat them like first-class production and monitor them the same way.

10) How long should I keep slow logs?

Keep enough to cover incident investigation and regression detection—commonly days to a couple of weeks, depending on volume. Rotate aggressively and digest summaries if you need longer trends without massive storage.

Conclusion: what to do next, tomorrow morning

If your MySQL server on Debian 13 feels like it’s dying quietly, stop guessing. Enable the slow query log with a sane threshold, rotate to capture a clean window, digest by total time, and fix the top offender with proof: EXPLAIN, index changes that match access patterns, and validation after the change.

Then make it routine. Keep slow logging available, run digests during incidents, and treat index changes like production changes—with rollback thinking and write-cost awareness. The “silent killer query” is rarely clever. It’s usually boring, repeated, and expensive. Which is exactly why it gets away with it.

← Previous
Proxmox Ceph PG Stuck/Inactive: What to Do Before Data Risk Escalates
Next →
ZFS RAIDZ rebuild math: Why One More Failure Can Kill You

Leave a comment