MySQL vs PostgreSQL: “CPU 100%”—how to prove it’s queries, not hardware

Was this helpful?

CPU 100% is the production equivalent of a smoke alarm you can’t reach: loud, persistent, and everyone wants it to stop immediately. The first instinct is to blame “the box” (or the cloud instance, or the hypervisor, or the neighbors). That instinct is often wrong.

This is a field guide to proving what’s actually happening when MySQL or PostgreSQL pegs CPU: whether you’re CPU-bound on queries, CPU-bound on background work, stalled on locks, or just measuring the wrong thing. It’s written for people who have to explain their conclusions to skeptical engineers and even more skeptical finance.

What “CPU 100%” really means for databases

When somebody says “CPU is 100%,” ask: whose CPU? Total system? One core? User time? Kernel time? Steal time? A single hot thread? A dozen workers each at 80%? The difference decides whether you tune queries, change configuration, or call your cloud provider with a straight face.

CPU-bound looks different from I/O-bound

Databases burn CPU for a few predictable reasons:

  • Execution work: scanning rows, evaluating predicates, computing aggregates, sorting, hashing, joining.
  • Planning work: generating and costing query plans (usually small, but can spike on complex queries or invalid statistics).
  • Concurrency work: locks, latches, MVCC bookkeeping, spin loops, context switching overhead.
  • Background work: vacuuming, purge/undo cleanup, checkpointing, replication apply, analyzing statistics.
  • Crypto/compression: TLS, at-rest encryption, compression, hashing for checksums.

CPU-bound means the database is actively executing instructions. I/O-bound means it’s mostly waiting for reads/writes. Both can show “high load,” but only one is fixed with “bigger CPU.”

Load average is not CPU usage. Load counts runnable tasks plus tasks stuck in uninterruptible sleep (often disk I/O). So a high load with moderate CPU is usually an I/O story. A high CPU with modest load can be one hot worker pinning a core.

One more landmine: steal time (virtualization). Your VM thinks it’s busy, but the hypervisor is taking cycles away. That’s not “queries,” but it will look like “CPU pressure” until you measure it.

Joke #1: If your database is at 100% CPU and the first fix is “add an index,” congratulations—you’ve joined the ancient religion of cargo cult performance.

My standard of proof

You don’t need a PhD in microarchitecture. You need repeatable evidence that links CPU consumption to specific sessions and ultimately to specific query fingerprints. If you can produce:

  1. OS-level proof that CPU is consumed by the database process (not irq, not kernel, not steal),
  2. DB-level proof of which sessions and query patterns are responsible,
  3. Plan-level proof explaining why those queries are expensive now,

…then you can say “it’s queries” without sounding like you’re guessing.

One operational quote is worth more than ten Slack debates. As Werner Vogels (Amazon CTO) put it: “Everything fails, all the time.” Your job is to fail with receipts.

MySQL vs PostgreSQL: how CPU saturation differs

MySQL and PostgreSQL can both melt CPUs. They just do it differently, and the diagnostic handles aren’t the same.

PostgreSQL: per-session processes, clear introspection

PostgreSQL typically runs one backend process per connection (plus background processes). That means when CPU is pegged, you can often map it quickly: “PID 12345 is hot” → “that PID is a backend” → “it is running query X.” This is a gift. Use it.

Postgres also tends to expose rich, query-centric telemetry: pg_stat_activity, pg_stat_statements, and auto_explain can tell you what’s running, what’s expensive over time, and what plans look like. If you’re serious, you enable them in production (carefully).

MySQL: thread-based execution, performance_schema is the truth serum

MySQL is more thread-centric inside a smaller number of processes, so “PID is hot” is less specific. You lean on performance_schema, the slow query log, and statement digests. If you’re still relying on “SHOW PROCESSLIST and vibes,” you’re flying with instruments turned off.

InnoDB introduces its own CPU flavor: mutex contention, buffer pool churn, purge/undo cleanup, and page flushing can all create CPU pressure that isn’t directly “my SELECT is slow,” but is still caused by workload and schema choices.

The big practical difference: how quickly you can isolate a culprit

In Postgres, isolating the top offenders often starts with pg_stat_statements and ends with EXPLAIN (ANALYZE, BUFFERS). In MySQL, you often start with statement digests and end with EXPLAIN ANALYZE (8.0+) plus index and schema work. Both require you to accept an uncomfortable truth: most “hardware problems” are actually work amplification—your queries doing more work than you think.

Fast diagnosis playbook (first/second/third)

First: confirm it’s really CPU, and whose

  1. Check CPU breakdown: user vs system vs iowait vs steal.
  2. Identify hot processes/threads: is it mysqld/postgres, or something else?
  3. Check run queue: are threads runnable (CPU-bound) or blocked (I/O/locks)?

Second: map OS hotness to DB activity

  1. Postgres: hot PID → pg_stat_activity to get query and state.
  2. MySQL: hot thread → performance_schema to get statement digest, user, host.
  3. Check concurrency: are you CPU-saturated due to many medium queries or one monstrous query?

Third: decide whether it’s plan regression, stats, contention, or background work

  1. Plan regression: same query got a worse plan after data growth, stats drift, or parameter changes.
  2. Missing/unused indexes: “it scans because it can,” or “it uses the wrong index because stats lie.”
  3. Lock/latch contention: CPU can rise from spin loops, plus throughput collapses.
  4. Background tasks: autovacuum (Postgres), purge/flushing (InnoDB), replication apply.

If you can’t answer those three steps in 10 minutes, you don’t have a “performance problem.” You have an observability problem.

Build an evidence chain: from OS to query text

When CPU pegs, people propose random fixes: restart the DB, fail over, add replicas, scale up, scale out, “tune the kernel,” sacrifice a goat. Don’t do random. Build a chain of evidence that you can show later.

What counts as proof

  • OS: CPU is busy in user space (or kernel) and it’s the database process.
  • DB: top CPU consumers correlate with specific SQL patterns (digests) and clients.
  • Plan: the expensive query is expensive for a reason (rows read, loops, sorts, hash spills, bad join order).
  • Change correlation: a deployment, index drop, stats drift, data growth, or config change preceded the spike.

What does not count as proof

  • “CPU is high, so the instance is too small.”
  • “My friend said Postgres is slower than MySQL for reads.”
  • “We didn’t change anything.” (Yes you did. Data changed. Traffic changed. The world changed.)

Practical tasks: commands, outputs, and decisions (12+)

Everything below is meant to be runnable on a typical Linux host with MySQL or PostgreSQL. Each task includes: the command, what you look for in the output, and what decision it triggers. Don’t run them all at once on a dying box. Pick the smallest hammer that answers the next question.

Task 1: Confirm CPU breakdown (user/system/iowait/steal)

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.1.0 (db01)  12/31/2025  _x86_64_  (16 CPU)

12:01:10 PM  CPU   %usr %nice  %sys %iowait %irq %soft %steal %idle
12:01:11 PM  all   92.40  0.00  6.80   0.20 0.00  0.10   0.00  0.50
12:01:11 PM    7   99.50  0.00  0.40   0.00 0.00  0.00   0.00  0.10

Meaning: High %usr implies query execution/computation. High %sys suggests kernel work (network, filesystem, contention). High %steal suggests noisy neighbors / oversubscription.

Decision: If %steal is non-trivial, stop arguing about SQL and validate host contention. If %iowait dominates, you’re likely I/O-bound, not CPU-bound.

Task 2: Find the hottest process

cr0x@server:~$ top -b -n 1 | head -20
top - 12:01:22 up 12 days,  3:12,  2 users,  load average: 18.32, 17.90, 16.01
Tasks: 412 total,   2 running, 410 sleeping,   0 stopped,   0 zombie
%Cpu(s): 93.1 us,  6.5 sy,  0.0 ni,  0.3 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64218.3 total,   2311.7 free,  21342.5 used,  40564.1 buff/cache

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 5123 mysql     20   0 6432108  12.8g  49288 S 1342.0  20.5  93:21.11 mysqld

Meaning: It’s actually the database process eating CPU. Also note: %CPU can exceed 100% because it’s per-core.

Decision: If it’s not the DB process, stop. Chase the real offender (backup tool, log shipper, agent, kernel threads).

Task 3: Thread-level CPU to see if it’s “one bad actor” or “death by a thousand cuts”

cr0x@server:~$ pidstat -t -p 5123 1 5
Linux 6.1.0 (db01)  12/31/2025  _x86_64_  (16 CPU)

12:01:31 PM   UID      TGID       TID    %usr %system  %CPU   CPU  Command
12:01:32 PM   112      5123      6120   98.00    1.00 99.00     7  mysqld
12:01:32 PM   112      5123      6121   76.00    2.00 78.00     3  mysqld
12:01:32 PM   112      5123      6177   10.00    0.00 10.00     9  mysqld

Meaning: A few threads are redlining. That often maps to specific connections or internal workers.

Decision: If one thread dominates, you hunt a single query/session. If many threads are hot, you look for a workload shift or global contention.

Task 4: Check run queue pressure and context switching

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
18  0      0 236812  91872 41231840  0    0     3    44 4211 9802 92  6  1  0  0
21  0      0 232104  91872 41241432  0    0     0    32 4877 12120 93  6  1  0  0

Meaning: r is runnable threads. If r is consistently higher than CPU count, you’re queueing for CPU. High cs can indicate excessive concurrency or lock contention.

Decision: If runnable threads are huge, consider connection limits, pool sizing, and query concurrency. If swapping appears, that’s its own emergency.

Task 5: Check steal time and CPU throttling (cloud reality check)

cr0x@server:~$ sar -u 1 3
Linux 6.1.0 (db01)  12/31/2025  _x86_64_  (16 CPU)

12:01:55 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:01:56 PM     all     89.12      0.00      7.01      0.20      3.10      0.57
12:01:57 PM     all     88.90      0.00      7.30      0.10      3.20      0.50

Meaning: %steal is CPU time you wanted but didn’t get. That can mimic “the DB got slower” without any query change.

Decision: If steal is elevated, validate host placement, instance class, and whether burst credits are exhausted. Don’t tune queries to compensate for bad tenancy.

Task 6: Postgres—map hot PID to the query and wait state

cr0x@server:~$ sudo -u postgres psql -x -c "select pid, usename, application_name, client_addr, state, wait_event_type, wait_event, now()-query_start as age, left(query,200) as query from pg_stat_activity where state<>'idle' order by age desc limit 5;"
-[ RECORD 1 ]-----+--------------------------------------------
pid               | 28741
usename           | app_user
application_name  | api
client_addr       | 10.20.3.41
state             | active
wait_event_type   |
wait_event        |
age               | 00:02:14.12031
query             | SELECT o.customer_id, count(*) FROM orders o JOIN order_items i ON i.order_id=o.id WHERE o.created_at >= now()-interval '30 days' GROUP BY o.customer_id;

Meaning: wait_event empty + state=active means it’s on-CPU (or at least not waiting on a tracked wait). The query text is your suspect.

Decision: If you see many actives running the same pattern, you likely have a hot endpoint. If wait events show locks, you treat it as contention, not “needs more CPU.”

Task 7: Postgres—get top CPU-style offenders by total time

cr0x@server:~$ sudo -u postgres psql -c "select calls, round(total_exec_time::numeric,1) as total_ms, round(mean_exec_time::numeric,2) as mean_ms, rows, left(query,120) as query from pg_stat_statements order by total_exec_time desc limit 10;"
 calls | total_ms | mean_ms | rows  | query
-------+----------+---------+-------+------------------------------------------------------------
  8231 | 912345.4 | 110.81  | 15321 | SELECT o.customer_id, count(*) FROM orders o JOIN order_items...
 12001 | 610112.7 | 50.84   | 12001 | SELECT * FROM sessions WHERE token = $1

Meaning: You now have query fingerprints ranked by total execution time (a proxy for CPU consumption, but not identical).

Decision: High total_exec_time drives incidents. High mean_exec_time drives tail latency. Decide which you’re fixing first.

Task 8: Postgres—prove where time goes with EXPLAIN (ANALYZE, BUFFERS)

cr0x@server:~$ sudo -u postgres psql -c "explain (analyze, buffers, verbose) SELECT o.customer_id, count(*) FROM orders o JOIN order_items i ON i.order_id=o.id WHERE o.created_at >= now()-interval '30 days' GROUP BY o.customer_id;"
HashAggregate  (cost=... rows=... width=16) (actual time=2150.113..2150.901 rows=4821 loops=1)
  Buffers: shared hit=120344 read=8123
  ->  Hash Join  (cost=... ) (actual time=310.123..2011.221 rows=241233 loops=1)
        Hash Cond: (i.order_id = o.id)
        Buffers: shared hit=120344 read=8123
        ->  Seq Scan on public.order_items i  (actual time=0.021..1190.332 rows=5200000 loops=1)
              Buffers: shared hit=99844 read=6500
        ->  Hash  (actual time=305.110..305.111 rows=402113 loops=1)
              ->  Seq Scan on public.orders o  (actual time=0.030..221.900 rows=402113 loops=1)
                    Filter: (created_at >= (now() - '30 days'::interval))

Meaning: Sequential scans and large hash aggregates are CPU-hungry. Buffers show cache hits vs reads; this query is doing both and still spending lots of time in compute.

Decision: If it scans millions of rows to answer a small question, you add/adjust indexes, rewrite the query, or pre-aggregate. If it’s already mostly cache hits but still slow, it’s CPU-heavy logic (hashing, sorting, functions).

Task 9: MySQL—identify top statement digests by CPU time

cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1e12,2) AS total_s, ROUND(SUM_LOCK_TIME/1e12,2) AS lock_s, ROUND(SUM_ROWS_EXAMINED/COUNT_STAR,0) AS avg_rows_examined FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT o.customer_id, COUNT ( * ) FROM orders o JOIN order_items i ON i.order_id = o.id WHERE o.created_at >= ? GROUP BY o.customer_id
COUNT_STAR: 8120
total_s: 945.21
lock_s: 2.11
avg_rows_examined: 510220

Meaning: This digest consumes the most statement time. The avg_rows_examined is a giant “hello, I scan too much” signal.

Decision: Target the worst digests first. If lock_s is high relative to total, it might be lock contention, not CPU. If rows examined is huge, you chase indexes and access paths.

Task 10: MySQL—see active threads and what they’re doing

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST;"
Id	User	Host	db	Command	Time	State	Info
9312	app	10.20.3.41:51844	prod	Query	132	Sending data	SELECT o.customer_id, count(*) FROM orders o JOIN order_items i ON i.order_id=o.id WHERE o.created_at >= NOW()-INTERVAL 30 DAY GROUP BY o.customer_id
9441	app	10.20.3.52:52011	prod	Query	98	Copying to tmp table	SELECT ... ORDER BY ...

Meaning: Long-running active queries are visible. “Copying to tmp table” and “Sending data” often mean big intermediate results, sorts, or poor indexing.

Decision: If one query dominates, you can kill it surgically (with coordination). If many similar long queries exist, the fix is systemic.

Task 11: MySQL—explain the plan and confirm whether you’re scanning

cr0x@server:~$ mysql -e "EXPLAIN SELECT o.customer_id, count(*) FROM orders o JOIN order_items i ON i.order_id=o.id WHERE o.created_at >= NOW()-INTERVAL 30 DAY GROUP BY o.customer_id\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: o
type: range
possible_keys: idx_orders_created_at
key: idx_orders_created_at
rows: 401233
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_order_items_order_id
key: idx_order_items_order_id
rows: 12
Extra:

Meaning: “Using temporary; Using filesort” is the classic CPU+memory tax for grouping/sorting without a good path.

Decision: Consider composite indexes to support grouping, reduce row counts early, or change query shape. If the range scan still hits hundreds of thousands of rows per call, you’re paying per request.

Task 12: MySQL 8—use EXPLAIN ANALYZE to see real execution behavior

cr0x@server:~$ mysql -e "EXPLAIN ANALYZE SELECT o.customer_id, count(*) FROM orders o JOIN order_items i ON i.order_id=o.id WHERE o.created_at >= NOW()-INTERVAL 30 DAY GROUP BY o.customer_id;"
-> Group aggregate: count(0)  (actual time=0.333..2150.221 rows=4821 loops=1)
    -> Nested loop inner join  (actual time=0.112..2010.011 rows=241233 loops=1)
        -> Index range scan on orders using idx_orders_created_at  (actual time=0.041..220.333 rows=402113 loops=1)
        -> Index lookup on order_items using idx_order_items_order_id (order_id=o.id)  (actual time=0.003..0.004 rows=12 loops=402113)

Meaning: The nested loop multiplied by 402k rows is the smoking gun. That’s how “fine” indexes still create huge CPU bills.

Decision: Consider rewriting to pre-aggregate order_items, or reduce the order set. In some cases, a different join strategy or covering index can slash loops.

Task 13: Check for lock contention masquerading as CPU trouble (Postgres)

cr0x@server:~$ sudo -u postgres psql -c "select wait_event_type, wait_event, count(*) from pg_stat_activity where state='active' group by 1,2 order by 3 desc;"
 wait_event_type |     wait_event      | count
-----------------+---------------------+-------
 Lock            | relation            |    18
                 |                     |     6

Meaning: Many actives waiting on relation locks means you may see throughput collapse and CPU do weird things (spin, context switch, retry storms).

Decision: Stop tuning SELECTs. Identify blocker sessions and the DDL or long transaction causing it.

Task 14: Check InnoDB internal pressure (MySQL)

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,140p'
------------
TRANSACTIONS
------------
Trx id counter 93211245
History list length 81234
...
--------
SEMAPHORES
--------
OS WAIT ARRAY INFO: reservation count 221234, signal count 221100
Mutex spin waits 912334, rounds 1823344, OS waits 22344

Meaning: Huge history list length can imply purge lag (often from long transactions). High mutex spin waits can indicate contention that burns CPU without “slow queries” looking obvious.

Decision: If purge is behind, find long transactions and fix application behavior. If contention is high, reduce concurrency, review hot indexes/tables, and consider config changes—after identifying the workload trigger.

Task 15: Prove whether you’re swapping or memory-thrashing (both)

cr0x@server:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:           64218       21342        2311        1024       40564       40112
Swap:              0           0           0

Meaning: No swap use is good. Low “free” is not automatically bad; “available” matters more.

Decision: If swap is in use, CPU can spike from paging overhead and queries slow down unpredictably. Fix memory pressure before rewriting SQL.

Task 16: Capture a CPU profile snapshot (safe-ish) with perf

cr0x@server:~$ sudo perf top -p 5123
Samples: 42K of event 'cycles', 4000 Hz, Event count (approx.): 12000000000
Overhead  Shared Object        Symbol
  18.21%  mysqld               Item_func_like::val_int
  11.03%  mysqld               JOIN::exec
   9.88%  libc.so.6            __memcmp_avx2_movbe

Meaning: You’re seeing where CPU is spent. Here it’s string matching and join execution, not “disk.” Profiles can quickly validate query patterns (LIKE ‘%…%’) and bad joins.

Decision: If CPU hotspots align with known expensive operators (sort, hash, LIKE, JSON extraction), prioritize query rewrites and indexes. If you see kernel hotspots (network stack), look at client storms or TLS overhead.

Interesting facts and historical context (why today’s behavior exists)

  • PostgreSQL started as POSTGRES at UC Berkeley in the 1980s, with early ideas around extensibility and complex data types—why it still loves rich indexing and custom operators.
  • MySQL was built with simplicity and speed for common web workloads in the mid-1990s—why it traditionally emphasized pragmatic defaults and ease of deployment.
  • InnoDB became the default MySQL engine because transactional integrity and crash recovery won in the real world, even if it meant more internal machinery (and more ways to burn CPU).
  • Postgres MVCC makes reads non-blocking in many cases, but it moves cost to vacuuming and tuple visibility checks—CPU can spike when autovacuum falls behind or bloat grows.
  • MySQL’s performance_schema evolved from “nice-to-have” to essential once systems grew and “SHOW STATUS” stopped being enough for attribution.
  • Query planners got smarter and more complex in both systems; smarter planners require statistics, and bad stats are performance debt that compounds quietly.
  • Replication changed the shape of incidents: read replicas reduced read CPU on primaries but introduced new CPU hotspots on replicas (replay/apply, vacuum, or secondary index maintenance).
  • Hardware got faster, and queries got lazier: teams stopped feeling the cost of scanning until data growth made “temporary” plans permanent.

Three corporate mini-stories from the trenches

Incident #1: the outage caused by a wrong assumption (“CPU high means we need bigger boxes”)

The company had a steady MySQL cluster and a quarterly ritual: if latency rose, bump instance size. It worked until it didn’t. One Friday, CPU pinned and error rates rose. The first response was textbook corporate reflex: scale up.

Scaling up helped for about 20 minutes. Then CPU climbed again, only faster. Engineers began blaming the cloud provider. Someone suggested disabling binary logging “just to see.” Cooler heads prevailed—barely.

We pulled thread-level CPU and saw a handful of threads running hot, not a uniform overload. performance_schema showed a digest dominated by a single endpoint: an “export” feature that had recently switched from incremental pagination to “give me everything newer than last month.” It was doing the right work, the wrong way.

The kicker: the data set had doubled, and the query plan drifted into “Using temporary; Using filesort.” The instance wasn’t too small; the query had become a CPU multiplier. We rate-limited the endpoint, added a composite index, and rewrote the query to pre-aggregate. CPU dropped, and the instance size went back down the next day.

The wrong assumption wasn’t technical. It was psychological: treating hardware as an apology for not measuring.

Incident #2: the optimization that backfired (a “helpful index” that made CPU worse)

A different team ran PostgreSQL and hit periodic CPU spikes during peak hours. They noticed a slow query, added an index, and watched it get faster in staging. Change approved, deployed, and celebrated. Briefly.

In production, CPU got worse and p99 latency rose. The index was used, yes—but it caused a plan that performed tons of random accesses and then sorted a much larger intermediate result. The old plan was a sequential scan with a tight filter that benefited from cache warmth.

We used EXPLAIN (ANALYZE, BUFFERS) on production-like data (not staging toys). The new index caused the planner to underestimate selectivity. It chose nested loops where hash join was cheaper. CPU went to join processing and sort comparisons, not I/O.

Fixing it meant admitting that “index == faster” is not a law of physics. We updated statistics targets on key columns, ran analyze, and replaced the index with a composite index matching the query’s filter and join pattern. CPU spikes didn’t vanish—they became proportional again, which is the nicest thing you can say about production behavior.

Incident #3: the boring practice that saved the day (slow query evidence on demand)

A payments platform ran both MySQL and PostgreSQL for historical reasons. They had one habit that looked boring in planning meetings: they kept lightweight query telemetry enabled all the time. Not full text logging of every statement—just enough to attribute time to patterns.

When a CPU spike hit, there was no scramble to “turn on the slow log and wait.” MySQL already had statement digests in performance_schema. Postgres had pg_stat_statements. They could answer, within minutes, which query patterns had changed in total time and call rate.

The culprit was neither exotic nor glamorous: a job scheduler misfired and launched too many parallel workers. Each worker ran a “reasonable” query. Together they created a CPU storm and thrashed caches. The team didn’t argue about hardware. They proved a concurrency change.

They fixed the scheduler, capped concurrency at the pool layer, and added a guardrail alert: “same digest, sudden calls per second jump.” The incident was anticlimactic. That’s the point. Boring practices don’t make good war stories, but they keep you employed.

Common mistakes: symptom → root cause → fix

1) Symptom: CPU 100%, but latency only sometimes spikes

Root cause: Bursty queries or batch jobs competing with OLTP traffic; CPU saturation is time-sliced.

Fix: Separate workloads (queue batches), enforce concurrency limits, and identify top digests by total time and call rate.

2) Symptom: High CPU and high load average, but mpstat shows iowait

Root cause: I/O-bound system with many threads blocked in uninterruptible sleep. Load is not CPU.

Fix: Measure disk latency, buffer cache hit rates, checkpoint pressure, and query plans that trigger large reads.

3) Symptom: CPU high in kernel time (%sys), not user time

Root cause: Network overhead (connection storms), TLS cost, filesystem contention, or excessive context switching.

Fix: Use connection pooling, reduce per-request connections, validate TLS settings, and check for syscall-heavy patterns.

4) Symptom: Postgres CPU high, many “active” sessions, but wait_event shows Lock

Root cause: Lock contention plus retry storms or blocked backends piling up.

Fix: Identify blockers, shorten transactions, avoid long-running DDL in peak, and consider lock timeouts plus safer migration patterns.

5) Symptom: MySQL CPU high, InnoDB history list length grows

Root cause: Long transactions prevent purge; internal cleanup falls behind and wastes CPU.

Fix: Find and fix long transactions, ensure commits happen, and tune workload. Sometimes the fix is “stop doing analytics in the primary.”

6) Symptom: A query that used to be fine is now expensive

Root cause: Data growth + stats drift + plan regression.

Fix: Refresh stats, validate plan changes, add composite indexes aligned with predicates/join keys, and consider query rewrites.

7) Symptom: CPU high after adding an index

Root cause: Planner picks a worse plan, or the index increases write amplification and maintenance overhead.

Fix: Compare plans before/after on production-like data; adjust stats; drop/replace index with the right shape; measure write overhead.

8) Symptom: CPU high mostly on replicas

Root cause: Replication apply, vacuum, index maintenance, or read workload moved without capacity planning.

Fix: Measure apply lag and background worker activity; right-size replicas; don’t assume replicas are “free.”

Joke #2: A CPU graph at 100% is like a corporate all-hands: everyone’s busy, and somehow nothing moves faster.

Checklists / step-by-step plan

Step-by-step: prove it’s queries (not hardware) in one incident

  1. Capture OS snapshot: mpstat, top, pidstat, vmstat.
  2. Confirm DB is the hot process: identify PID/TGID and top threads.
  3. Check virtualization noise: steal time; confirm no CPU throttling/burst exhaustion.
  4. Postgres path: map hot PID → pg_stat_activity query; check wait events; use pg_stat_statements top offenders.
  5. MySQL path: use performance_schema digest summaries; correlate with processlist; confirm rows examined.
  6. Decide contention vs execution: locks/waits vs pure compute.
  7. Run one plan proof: EXPLAIN (ANALYZE, BUFFERS) or EXPLAIN ANALYZE; capture row counts and loops.
  8. Make the least risky mitigation: rate limit, kill worst queries, reduce concurrency, or temporarily disable the endpoint.
  9. Implement durable fix: index/query rewrite/stats, then validate with before/after metrics.
  10. Write the postmortem evidence chain: “OS → DB → query digest → plan → change.”

Operational guardrails I recommend (and why)

  • Connection pooling: because thread/process storms are a CPU tax and a scheduling tax.
  • Query telemetry always on (lightweight): so your next incident isn’t “wait 30 minutes for logs.”
  • Concurrency caps per workload: because throughput increases until it doesn’t, and then it falls off a cliff.
  • Change discipline for indexes and stats: because performance changes are production changes, not “just schema.”
  • Production-like staging data for plan tests: because the planner doesn’t care about your synthetic dataset.

How to choose between “fix queries” and “scale hardware”

Scale hardware when you have stable, efficient queries and increasing legitimate load. Fix queries when CPU per request is climbing, plans regress, rows examined explode, or contention rises. The moment you see “rows examined per call” ballooning, scaling up is a temporary anesthetic. It wears off.

FAQ

1) How do I know it’s not a bad CPU or failing hardware?

Modern CPUs rarely “half-fail” in a way that only hurts MySQL. If mpstat shows high steal time, that’s a tenancy issue. If kernel logs show machine check errors, that’s hardware. Otherwise, assume workload first and prove otherwise.

2) Why is CPU high but queries don’t look slow individually?

Because throughput can kill you. A fast query called 50,000 times per minute can dominate CPU more than a single slow query. Look at total time and calls, not just mean latency.

3) Why does Postgres show many “active” sessions with no wait event?

Often it really is running on CPU. Sometimes it’s in a state not represented as a wait event. Use OS tools (pidstat/perf) to confirm the backend is consuming cycles.

4) Can lock contention cause 100% CPU?

Yes. Spinlocks, retries, and scheduler thrash can push CPU up even while useful work goes down. In MySQL, look at InnoDB semaphore metrics; in Postgres, look at wait events and blocked chains.

5) Should I enable the MySQL slow query log during an incident?

Only if you can afford the overhead and you set sane thresholds. Prefer performance_schema digests for attribution. Slow logs are useful, but turning them on mid-fire often produces more smoke than signal.

6) Is PostgreSQL inherently more CPU-heavy than MySQL?

No blanket answer. Postgres can spend CPU on MVCC visibility checks and vacuum-related behavior; MySQL can spend CPU on InnoDB contention and join execution patterns. The workload and schema decide.

7) When should I reach for perf?

When you need to prove where cycles go (sorting, hashing, string functions, JSON processing) and DB-level metrics aren’t telling a coherent story. Use it to validate hypotheses, not as a first reflex.

8) What’s the fastest safe mitigation when CPU is pegged?

Reduce concurrency and stop the bleeding: rate-limit a hot endpoint, pause a batch job, cap connection pool size, or kill the single worst offender. Then do the durable fix after the graph stops screaming.

9) Why did adding an index not reduce CPU?

Because indexes don’t reduce work if the query still touches huge row counts, or if the index causes an execution strategy that’s worse for your data distribution. Validate with EXPLAIN ANALYZE and real row counts.

10) How do I prove a plan regression?

Capture the old plan (from logs, stored explain output, or a baseline environment) and compare to the new plan with actual rows and loops. In Postgres, pg_stat_statements plus auto_explain can help. In MySQL, compare EXPLAIN/EXPLAIN ANALYZE results and digest timing before/after the change window.

Conclusion: practical next steps

If you want to prove “CPU 100% is queries, not hardware,” do it like an operator, not a philosopher. Measure CPU breakdown. Identify hot processes/threads. Map them to sessions and query digests. Get one plan-level proof with real row counts. Then choose the smallest mitigation that buys time.

Next steps that pay off immediately:

  1. Ensure you can query pg_stat_statements (Postgres) or performance_schema statement digests (MySQL) during a live incident.
  2. Write a one-page runbook with the Fast diagnosis playbook above and keep it near on-call rotation notes.
  3. Add one alert for calls per digest jumping suddenly; it catches bad deploys and scheduler storms before CPU hits the ceiling.
  4. Make “EXPLAIN with real data shape” a requirement for performance fixes. Indexes without plans are just expensive opinions.
← Previous
A driver update killed my FPS: how to diagnose it properly
Next →
Proxmox Ceph Slow Ops: Locate the Bottleneck (Disk, Network, or CPU)

Leave a comment