MariaDB vs PostgreSQL: CPU spikes—who burns cores faster under peak load

Was this helpful?

You don’t notice CPU until it starts screaming. The graph looks fine, then a promo email lands, QPS triples,
and your database hosts turn into tiny space heaters. Latency spikes. Timeouts stack. Someone asks the worst
question in operations: “Is it the database or the app?”

This isn’t a bench-racing piece about which engine is “faster.” Under peak load, both MariaDB and PostgreSQL can
burn cores quickly—just in different, predictable ways. The goal is to identify which failure mode you’re in,
prove it with commands, and make a fix that survives the next peak without turning your on-call rotation into a
hobby.

What “burning cores” really means under peak load

High CPU isn’t one problem; it’s a family reunion of problems. In practice, “CPU spikes” in databases come from:
(1) doing real useful work (executing queries), (2) doing work you didn’t intend (bad plans, excessive sorting,
repeated parsing), or (3) doing nothing productive but still spinning (contention, lock waits with busy loops,
excessive context switching).

Under peak load, the exact same query can shift from “fine” to “core-eater” because the system’s shape changes:
hot sets grow, caches miss, lock queues form, or a plan flips from index lookup to a scan due to stale stats or a
different bind parameter. Your database didn’t become stupid. It became honest.

When people ask “who burns cores faster,” they usually mean: “Which one hits CPU saturation sooner and collapses
harder when concurrency climbs?” The uncomfortable answer: it depends on how you feed it. MariaDB (InnoDB) often
suffers when you add too many concurrent writers or too many connections without pooling; PostgreSQL often suffers
when you let vacuum debt pile up or when a query plan regression turns a narrow request into a broad scan that
multiplies work across cores. Both can be tuned. Both can be mis-tuned into a bonfire.

A useful mental model: peak-load CPU issues are rarely about raw throughput. They’re about coordination overhead.
Your app asks for work; the database coordinates access to shared state. That coordination is where you lose cores.

One quote worth keeping on a sticky note, because it describes most CPU spikes in production:
“latency is a symptom, not a root cause” (paraphrased idea often attributed to Brendan Gregg’s performance methodology).
Your job is to find the root cause, not to argue with the graph.

How MariaDB spends CPU when it panics

MariaDB in production is usually MariaDB + InnoDB (or XtraDB historically). InnoDB is good at many things:
crash recovery, row-level locking, and handling common OLTP patterns. But peak load has a way of finding the seams.

Common MariaDB CPU spike patterns

1) Concurrency overhead: threads, mutexes, and “too many clients”

MariaDB uses a thread-per-connection model. When you scale connections linearly with traffic, you don’t just add
query work—you add scheduling work. Under high concurrency, CPU gets eaten by context switching and lock
coordination inside the engine. If you see high CPU with modest QPS, suspect thrash: lots of runnable threads,
not enough useful progress.

If you’re running without a thread pool (or with a misconfigured one), MariaDB can look like it’s “using CPU”
while accomplishing less and less. You’ll see it in OS-level run queues and in the database’s own status counters.

2) InnoDB contention: hot rows, hot indexes, and internal latch pain

Under peak write load, contention hotspots appear: an auto-increment primary key under heavy insert, a single “latest”
row hammered by updates, or secondary indexes that must be maintained for every write. This is real work plus
coordination. If the app creates a single hot row or hot index leaf, you can saturate CPU while throughput plateaus.

InnoDB has improved over the years with more fine-grained locking, but it still has places where contention shows up
as CPU spikes and stalls. Many of those are workload-shaped, not version-shaped.

3) Query execution: bad plans and expensive sorts

MariaDB can do great work quickly, but it can also commit classic mistakes under pressure:
sorting big intermediate result sets, scanning due to missing indexes, or repeating identical work because the app
sends slightly different text queries that defeat the query cache (and you shouldn’t rely on the query cache anyway).

4) Replication and binlog overhead

Peak load often coincides with replication lag. If your primary is CPU-saturated, it can’t flush binlogs, can’t keep
replicas fed, and can’t commit quickly. Group commit improvements help, but you still pay CPU for row-based logging
and for the transactional machinery around it. “It’s just writing a log” is famous last words.

MariaDB’s signature failure mode under peak load

When MariaDB collapses from CPU under peak, it often looks like “everyone is busy, nobody is happy”:
tons of connections, lots of threads, high system CPU, and a throughput curve that stops scaling as you add cores.
Your fix is usually to reduce concurrency and remove contention hotspots, not to add more cores and pray.

First joke (allowed, and yes it’s relevant): A database without connection pooling is like an open bar without
bartenders—everyone’s technically served, but nothing good happens fast.

How PostgreSQL spends CPU when it panics

PostgreSQL’s architecture is famously conservative and deeply engineered: process-per-connection, shared buffers,
MVCC, a robust planner, and background workers doing housekeeping. Under peak load, it often fails in a more
“logical” way than MariaDB: it will spend CPU doing exactly what it thinks you asked for, which is sometimes the
real problem.

Common PostgreSQL CPU spike patterns

1) Query plan regression: same query, different plan, suddenly expensive

PostgreSQL’s planner is powerful and sometimes too confident. If statistics drift, data distribution changes, or
bind parameters swing widely, a plan can flip from index nested loop to hash join or a sequential scan. Under peak,
that flip can multiply work by orders of magnitude, and CPU follows.

PostgreSQL gives you excellent tools to see the plan and timing. The trap is not looking until the outage, then
discovering you never captured baselines.

2) Autovacuum debt: you don’t pay, it still charges interest

MVCC means dead tuples accumulate until vacuum cleans them. If autovacuum can’t keep up—because of IO limits,
bad settings, or simply because you hit a write-heavy peak—then scans touch more pages, indexes bloat, and CPU
climbs because each query wades through garbage. You’ll see higher CPU even for “reads,” because reads now do more
work per row returned.

3) Concurrency overhead: many connections means many processes

PostgreSQL’s process-per-connection model is stable and debuggable, but each backend is real. Too many connections
means more context switching, bigger memory footprint, and CPU lost to coordination, even before you execute SQL.
Add peak traffic and you can slam into a wall where CPU burns on process scheduling and lock management.

4) Sorting, hashing, and memory pressure (work_mem landmines)

PostgreSQL can burn CPU doing sorts and hash operations. If work_mem is too low, operations spill to
disk and you get a mix of IO wait and CPU overhead. If work_mem is too high and you have many
concurrent queries, you get memory pressure, then the kernel starts doing its own “optimizations,” and CPU usage
becomes a chaotic art installation.

PostgreSQL’s signature failure mode under peak load

When PostgreSQL collapses from CPU under peak, it often looks like: a few query shapes dominate, they run longer
than usual, autovacuum falls behind, bloat increases, and then everything slows further. It’s a feedback loop:
longer queries hold resources longer, which increases contention and vacuum debt, which makes queries longer.
Your fix is usually to fix the worst queries and keep vacuum healthy, not to “tune random knobs.”

Fast diagnosis playbook (first/second/third checks)

First: is the CPU “real work” or “thrash”?

  • OS run queue high (load average far above CPU count, many runnable tasks): likely thrash or too much concurrency.
  • One or a few hot threads/processes: likely specific queries, vacuum, replication, or a hotspot lock.
  • High system CPU: context switching, kernel overhead (networking, fs), spinlocks.
  • High user CPU: query execution, sorting, hashing, expression evaluation.

Second: identify the top query shapes and wait causes

  • MariaDB: look at processlist, InnoDB status, handler/read metrics, and whether you’re thread-scheduling bound.
  • PostgreSQL: look at pg_stat_activity, top SQL in pg_stat_statements, and vacuum/bloat indicators.

Third: decide whether the bottleneck is CPU-only or “CPU as a symptom”

  • If IO is saturated, CPU can spike due to retries, buffer churn, and kernel overhead. Fix IO first.
  • If locks dominate, CPU spikes are just the visible part. Fix lock order, hot rows, and transaction duration.
  • If plan regression happened, nothing else matters until that plan is fixed.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually run during an incident or right after. Each has: command, sample output, what it
means, and the decision you make. They’re intentionally boring. Boring is how you get paged less.

Task 1: Confirm whether you’re CPU-saturated or just noisy

cr0x@server:~$ uptime
 14:22:01 up 31 days,  3:07,  2 users,  load average: 42.18, 39.77, 28.54

What it means: Load average ~42 on a 16-core box means you have far more runnable or uninterruptible tasks than CPU.

Decision: Stop guessing. Immediately inspect run queue and top processes. If connections are huge, start limiting concurrency (pooling, max connections) while you diagnose.

Task 2: See CPU breakdown (user vs system) and run queue

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
32  1      0 812344  92324 6231880   0    0   120   410 9800 42000 78 18  2  2  0
29  0      0 805112  92328 6239024   0    0    40   210 9600 39000 81 16  1  2  0

What it means: r near 30+ indicates runnable queue pressure; high cs means heavy context switching; high sy hints kernel overhead.

Decision: If cs is huge and DB has thousands of connections, prioritize connection pooling and caps. If wa is high, check storage/IO.

Task 3: Identify the top CPU consumers and whether it’s many or few

cr0x@server:~$ ps -eo pid,comm,%cpu,%mem,stat --sort=-%cpu | head
 2314 mariadbd  690.3 18.2 Sl
 4481 postgres  189.4  1.7 R
 4520 postgres  176.0  1.6 R
  911 node      42.1  0.8 R

What it means: One giant mariadbd process burning many cores suggests internal contention or heavy query load in a single process; many postgres backends burning CPU suggests a few expensive queries running concurrently.

Decision: If you see one DB process dominating: focus on engine contention and thread scheduling (MariaDB) or a specific background worker (Postgres). If many backend processes dominate: find top SQL shapes.

Task 4: Measure context switching and migrations (thrash indicator)

cr0x@server:~$ pidstat -w -p $(pgrep -o mariadbd) 1 3
Linux 6.1.0 (server)  12/30/2025  _x86_64_  (16 CPU)

01:22:10      PID   cswch/s nvcswch/s  Command
01:22:11     2314    1200.0   38000.0  mariadbd
01:22:12     2314    1105.0   40120.0  mariadbd

What it means: Massive involuntary context switches (nvcswch/s) usually means the scheduler is juggling runnable threads/processes. That burns CPU without completing proportional work.

Decision: Reduce concurrency: enforce pooling, lower max connections, enable/configure thread pool (MariaDB), tune pool size (Postgres via pgbouncer), and find lock contention.

Task 5 (MariaDB): Who is running what right now?

cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G" | head -n 30
*************************** 1. row ***************************
     Id: 81423
   User: app
   Host: 10.2.4.18:51214
     db: prod
Command: Query
   Time: 12
  State: Sending data
   Info: SELECT ... FROM orders WHERE customer_id=? ORDER BY created_at DESC LIMIT 50
*************************** 2. row ***************************
     Id: 81451
   User: app
   Host: 10.2.4.19:52108
     db: prod
Command: Query
   Time: 12
  State: Sorting result
   Info: SELECT ... FROM orders WHERE status='open' ORDER BY priority DESC

What it means: States like Sorting result and long Time under peak usually indicate missing or ineffective indexes; Sending data can mean large result sets or slow client reads.

Decision: Capture the top 2–3 queries. Run EXPLAIN, validate indexes, and consider short-term query limits/timeouts to stop the bleeding.

Task 6 (MariaDB): Check InnoDB contention and purge pressure

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
=====================================
2025-12-30 14:24:51 0x7f1f6c0a9700 INNODB MONITOR OUTPUT
=====================================
Mutex spin waits 1203328, rounds 3821140, OS waits 44211
RW-shared spins 922110, rounds 1102100, OS waits 21011
History list length 884321
...

What it means: High spin waits and OS waits can indicate contention; a very large history list length suggests purge is behind (long transactions or insufficient purge progress), which can inflate work for reads and indexes.

Decision: Find long-running transactions, reduce transaction time, and evaluate whether write workload or hotspot rows are causing contention. If history list explodes, hunt the session holding an old snapshot.

Task 7 (MariaDB): Verify thread/connection pressure quickly

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Threads%'; SHOW VARIABLES LIKE 'max_connections';"
+-------------------+--------+
| Variable_name     | Value  |
+-------------------+--------+
| Threads_connected | 1850   |
| Threads_running   | 280    |
| Threads_created   | 992134 |
+-------------------+--------+
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 4000  |
+-----------------+-------+

What it means: Thousands of connected threads and a high Threads_created count often correlates with CPU spent in connection handling and scheduling.

Decision: Cap connections to what the host can actually schedule. Implement pooling. If you must accept many clients, use a thread pool and keep Threads_running near core count, not 10x it.

Task 8 (PostgreSQL): See active queries and their wait events

cr0x@server:~$ psql -XAt -c "SELECT pid, state, wait_event_type, wait_event, now()-query_start AS age, left(query,120) FROM pg_stat_activity WHERE state<>'idle' ORDER BY age DESC LIMIT 10;"
4481|active|||00:00:17.182913|SELECT ... FROM orders WHERE customer_id=$1 ORDER BY created_at DESC LIMIT 50
4520|active|LWLock|buffer_mapping|00:00:12.504991|SELECT ... FROM events WHERE tenant_id=$1 AND ts>$2 ORDER BY ts DESC LIMIT 200
4602|active|Lock|transactionid|00:00:09.991221|UPDATE accounts SET balance=balance-$1 WHERE id=$2

What it means: No wait event and active means pure CPU work. LWLock waits can indicate internal contention (buffer mapping, WAL, etc.). Lock waits mean you’re stuck behind concurrency, not compute.

Decision: If most are CPU-active: find top SQL and plans. If lock waits dominate: reduce transaction time and fix lock order/hot rows. If LWLock hotspots: look for excessive shared buffer churn, high concurrency, or a few patterns hammering shared structures.

Task 9 (PostgreSQL): Identify top CPU-expensive queries via pg_stat_statements

cr0x@server:~$ psql -X -c "SELECT calls, round(total_exec_time::numeric,1) AS total_ms, round(mean_exec_time::numeric,2) AS mean_ms, rows, left(query,120) FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 8;"
 calls | total_ms | mean_ms |  rows  | left
-------+----------+---------+--------+------------------------------------------------------------
 93210 | 982344.6 |   10.54 | 186420 | SELECT ... FROM orders WHERE customer_id=$1 ORDER BY created_at DESC LIMIT 50
 11234 | 621990.2 |   55.36 |  11234 | SELECT ... FROM events WHERE tenant_id=$1 AND ts>$2 ORDER BY ts DESC LIMIT 200
  2011 | 318221.0 |  158.25 |   2011 | SELECT ... FROM line_items JOIN products ON ...

What it means: Total time highlights what consumes the box, not what’s “slowest once.” Under peak, the top-by-total are your CPU budget killers.

Decision: Take the top two queries, run EXPLAIN (ANALYZE, BUFFERS) off-peak or on a replica, and fix plans/indexes first.

Task 10 (PostgreSQL): Check autovacuum and bloat pressure signals

cr0x@server:~$ psql -X -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"
  relname  | n_live_tup | n_dead_tup |      last_autovacuum       |      last_autoanalyze
-----------+------------+------------+----------------------------+----------------------------
 orders    |   81203421 |   22190311 | 2025-12-29 03:12:44+00     | 2025-12-29 03:22:01+00
 events    |  190233442 |   80311220 |                            | 2025-12-28 22:10:08+00

What it means: A table with huge dead tuples and no recent autovacuum is a CPU trap: scans and index traversals do more work per useful row.

Decision: Fix autovacuum throughput (cost limits, workers, IO), reduce long transactions, and consider targeted vacuuming during controlled windows. If it’s extreme, plan a maintenance action (VACUUM FULL or reindex) carefully.

Task 11: Confirm whether IO is the real bottleneck (CPU as symptom)

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rKB/s   wKB/s  await  svctm  %util
nvme0n1         820.0  610.0  98240.0 110820.0  7.10  0.32  92.40

What it means: High %util and rising await indicates IO saturation. CPU spikes can accompany this due to buffer churn and increased query work from cache misses.

Decision: If storage is saturated, stop “CPU tuning.” Reduce IO: add indexes to avoid scans, increase cache hit rate, fix bloat, or scale storage/replicas.

Task 12: Validate memory pressure and swapping (silent CPU killer)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        96Gi       1.2Gi       2.1Gi        27Gi        18Gi
Swap:           16Gi       3.8Gi        12Gi

What it means: Swap usage during peak can turn CPU graphs into lies: the CPU is “busy” but it’s busy waiting on memory pages to come back from swap.

Decision: Reduce memory footprint (fewer connections, right-size buffers), stop runaway work_mem (Postgres), and fix the root query patterns causing huge sorts/hashes.

Task 13: Capture kernel-level hotspots with perf (when you need proof)

cr0x@server:~$ sudo perf top -p $(pgrep -o mariadbd)
Samples:  915 of event 'cycles', 4000 Hz, Event count (approx.): 145392102
  18.40%  mariadbd           [.] btr_cur_search_to_nth_level
  12.15%  mariadbd           [.] row_search_mvcc
   7.92%  libpthread-2.31.so [.] pthread_mutex_lock
   6.71%  mariadbd           [.] lock_rec_lock

What it means: Seeing mutex locks high in the stack suggests contention; seeing B-tree search dominating suggests read-heavy index traversals (possibly from cache misses or inefficient access patterns).

Decision: If mutex dominates: reduce concurrency and hotspots. If B-tree search dominates: improve indexes, reduce random lookups, and increase cache hit rate (buffer pool/shared buffers and working set).

Task 14 (PostgreSQL): Prove a plan problem with EXPLAIN (ANALYZE, BUFFERS)

cr0x@server:~$ psql -X -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id=123 ORDER BY created_at DESC LIMIT 50;"
                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..42.89 rows=50 width=312) (actual time=0.212..1.903 rows=50 loops=1)
   Buffers: shared hit=108 read=24
   ->  Index Scan using orders_customer_created_idx on orders  (cost=0.56..71234.12 rows=84123 width=312)
         Index Cond: (customer_id = 123)
 Planning Time: 0.295 ms
 Execution Time: 1.982 ms

What it means: Reads vs hits tells you cache behavior. If you see a sequential scan with huge reads where you expected an index scan, that’s your CPU spike in a trench coat.

Decision: Fix indexes, stats (ANALYZE), and query shape. If this query is on the critical path, pin it down: stable indexes, avoid parameter sensitivity if needed, and add guardrails.

Three corporate mini-stories from the trenches

Incident 1: The outage caused by a wrong assumption

A mid-size SaaS company migrated a billing subsystem from MariaDB to PostgreSQL. The migration went smoothly in staging.
Production went smoothly too—until the first end-of-quarter invoice run. CPU pegged, API timeouts, support tickets.
The postmortem had a familiar smell: “We assumed the database would scale like the old one.”

The assumption was that more connections meant more throughput. On MariaDB they had learned, painfully, to cap connections and
use pooling. During the migration, a new service landed with “just open a connection per request, it’s fine,” because the
team didn’t want to deal with pooling under a deadline. Postgres backends multiplied. Context switching went wild. CPU burned
and query latency ballooned.

The most annoying part: the queries weren’t slow in isolation. They were slow together. In peak, the scheduling overhead and
lock queues amplified everything. Autovacuum also fell behind during the invoice write storm, adding bloat and making reads worse,
right when customer portals were busiest.

The fix wasn’t exotic. They introduced a connection pooler, set a hard cap on active DB sessions per service, and made request
concurrency back-pressure a first-class feature instead of a shameful secret. CPU dropped enough that the real query bottlenecks
finally showed themselves, and those were solvable.

Incident 2: The “optimization” that backfired

An e-commerce platform on MariaDB had a write-heavy cart service. Someone “optimized” by adding three composite secondary
indexes to support new reporting queries. The indexes made the reporting endpoints faster in testing. It was celebrated in
a Slack thread and then quietly merged.

Peak day arrived. CPU spikes hit right after the morning traffic wave. Writes slowed first, then checkouts timed out. The DB
wasn’t IO-bound. It was CPU-bound in the storage engine, doing index maintenance and fighting contention in the same hot leaf
pages that every cart update touched. The platform had accidentally turned a write-optimized table into a multi-index write tax.

The rollback plan was missing. They couldn’t drop indexes instantly without risking prolonged locks and more pain. The emergency
mitigation was to divert reporting to a replica and cut reporting features temporarily. Then, during a controlled window, they
removed the worst index and replaced it with a different shape that supported the needed query but reduced write amplification.

The lesson wasn’t “don’t index.” It was “indexing is a write-path feature.” If you can’t explain the write cost of an index under
peak concurrency, you’re not done.

Incident 3: The boring but correct practice that saved the day

A fintech shop running PostgreSQL had a rule: every schema change must include (a) the expected query plan change,
(b) a rollback path, and (c) a canary verification query. Nobody loved the rule. It was paperwork wearing a pager.

One Friday, a seemingly harmless change rolled out: a new predicate on a large table. The canary query ran automatically after deploy
and compared plan shape to baseline. It flagged a surprise sequential scan. Not in production traffic yet—just in the canary.
The rollout paused itself.

The developer’s laptop showed the index being used. Production stats were stale; the distribution differed; the planner made a different
choice. The team ran an ANALYZE on the affected table in a controlled manner, adjusted a single index to match the new predicate,
and re-ran the canary. The plan went back to predictable.

Nothing dramatic happened. That’s the point. The incident that didn’t happen is the best kind—quiet, slightly tedious, and
financially invisible.

Interesting facts and historical context (8 points)

  1. MySQL split history matters: MariaDB was created by the original MySQL founders after Oracle acquired Sun, partly to keep a community-led fork viable.
  2. InnoDB wasn’t always “default everywhere”: Early MySQL deployments used MyISAM heavily; InnoDB became dominant as transactional workloads and reliability expectations rose.
  3. PostgreSQL’s lineage is academic: It descends from the POSTGRES project at UC Berkeley, with a long tradition of correctness-first engineering.
  4. MVCC is both gift and bill: PostgreSQL’s MVCC design avoids many read locks but requires continuous vacuuming; skip the housekeeping and CPU costs show up later.
  5. Thread vs process models shape peak behavior: MariaDB’s threads make connection overhead lighter per connection than full processes, but still vulnerable to scheduling/lock contention at high concurrency.
  6. Postgres got much better at parallelism over time: Modern PostgreSQL can use parallel query execution in more cases, which can improve throughput and also burn more cores faster if a bad plan goes parallel.
  7. MariaDB and MySQL diverged on features: Depending on versions, MariaDB may offer different optimizer behaviors, thread pool options, and instrumentation than upstream MySQL, which changes how CPU spikes manifest.
  8. Both ecosystems learned the hard way about “too many connections”: The operational best practice of pooling and back-pressure is shared folklore now, because physics is persuasive.

Second joke (and the last one, per the rules): Tuning a database during an outage is like changing airplane tires mid-flight.
It’s possible, but everyone will remember your facial expression.

Common mistakes: symptom → root cause → fix

1) Symptom: CPU 90–100%, QPS flat, latency rising

Root cause: Concurrency thrash (too many connections/sessions), scheduler overhead, internal contention.

Fix: Cap connections; enforce pooling; reduce app concurrency; add back-pressure. On MariaDB, consider thread pool; on Postgres, use a pooler and set sane max_connections.

2) Symptom: CPU spikes align with a single endpoint or job

Root cause: One query shape turned expensive (plan regression, missing index, bad join order).

Fix: Capture the query, run EXPLAIN/ANALYZE, add/adjust index, fix predicate sargability, refresh stats. Put a guardrail: statement timeout, query limit, or feature flag.

3) Symptom: CPU high, lots of lock waits, throughput collapses

Root cause: Long transactions; hot rows; lock order inversions; “update the same row” patterns.

Fix: Shorten transactions; fix contention hotspots (shard hot counters, avoid single-row “latest” patterns); ensure consistent lock ordering; batch writes sanely.

4) Symptom (Postgres): CPU gradually worsens over days/weeks, not minutes

Root cause: Autovacuum debt and bloat. Queries do more work per result as dead tuples accumulate.

Fix: Tune autovacuum for the heavy tables; avoid long-running transactions; monitor dead tuples; schedule maintenance; consider fillfactor or partitioning for churn-heavy tables.

5) Symptom (MariaDB): CPU high during writes, replication lag grows

Root cause: Write amplification from indexes/binlog; commit path overhead; fsync pressure; hot index pages.

Fix: Remove unnecessary secondary indexes; optimize transaction batching; validate durable settings; ensure fast storage; tune for group commit; move read/reporting off primary.

6) Symptom: CPU high, but clients see “Sending data” / slow reads

Root cause: Large result sets; inefficient pagination; client-side slowness causing server to hold resources longer.

Fix: Limit result size; keyset pagination; only select needed columns; fix N+1 patterns; add timeouts and max rows.

7) Symptom: CPU spikes after deploying a “minor” change

Root cause: Plan change due to predicate change or stats drift; new index changing planner decisions; different parameter distribution.

Fix: Canary query plans; compare baselines; analyze affected tables; adjust indexes; if needed, rewrite query for plan stability.

Checklists / step-by-step plan for sustained peak load

A. During the incident (stabilize first)

  1. Stop the stampede: enable back-pressure in the app; shed non-critical traffic; rate-limit heavy endpoints.
  2. Cap concurrency at the DB boundary: pool connections; temporarily lower max active sessions; prioritize critical services.
  3. Find the top two query shapes: processlist/pg_stat_activity + top SQL stats. Don’t chase the 20th query.
  4. Check for lock pileups: if lock waits dominate, killing random queries is not a strategy—fix the blocker.
  5. Validate IO: if storage is pegged, CPU fixes won’t hold; reduce IO pressure first.

B. Within 48 hours (make it not happen again next week)

  1. Baseline plans: capture EXPLAIN plans for critical queries and store them with the service.
  2. Fix vacuum/purge health: Postgres autovacuum tuning; MariaDB long transaction control and purge visibility.
  3. Right-size indexes: keep what helps the read path, remove what taxes the write path under peak.
  4. Set sane timeouts: statement timeouts, lock timeouts, and app deadlines. A stuck query is a contagion.
  5. Test peak concurrency: load tests must include production-like connection counts and realistic data distribution.

C. Architecture choices that affect “who burns cores faster”

  • If you can pool aggressively: PostgreSQL becomes calmer under peak because you limit backends and protect the scheduler.
  • If you can’t control clients: MariaDB with thread pool can be forgiving, but you still need caps; uncontrolled concurrency will eventually win.
  • If your workload is write-churn heavy: Postgres needs vacuum discipline; MariaDB needs index discipline and contention-aware schema design.
  • If you have a few complex analytics queries on OLTP: both will burn CPU; isolate workloads (replicas, separate systems) instead of praying.

FAQ

1) So, who burns cores faster: MariaDB or PostgreSQL?

Under uncontrolled concurrency, both can incinerate CPU. MariaDB often hits contention and thread scheduling limits hard when
you throw thousands of active connections at it. PostgreSQL often burns CPU faster when a bad plan or vacuum debt multiplies work
per query, and parallel plans can amplify that. The “winner” is whichever one you operate more carelessly.

2) Why does CPU spike when QPS hasn’t increased much?

Because work per request increased. Typical causes: plan regression, cache miss rate increase, bloat/dead tuples, or lock contention
extending query duration. Same QPS, more CPU-seconds per query, higher utilization.

3) Is high CPU always bad?

No. High CPU with stable latency and predictable throughput can be fine; you’re using what you paid for. High CPU with rising
latency and falling throughput is the bad kind: coordination overhead or amplified work.

4) Can adding cores solve peak CPU spikes?

Sometimes, but it’s the least reliable fix. If you’re CPU-bound on pure query execution and scale linearly, more cores buy headroom.
If you’re bound by contention, locks, or context switching, more cores mostly buy you a bigger bill and the same incident.

5) What’s the single best practice to prevent CPU collapse?

Connection pooling with strict caps and back-pressure. It forces the system to behave under peak. Without it, you’re running a
concurrency lottery.

6) For PostgreSQL, how do I know it’s autovacuum debt?

Look for rising n_dead_tup, slow scans, increasing buffer reads, and vacuum falling behind (or blocked by long transactions).
CPU will creep up as queries touch more pages and do more visibility checks.

7) For MariaDB, what’s the fastest indicator of connection/thread pain?

High Threads_connected, high context switching at the OS level, and Threads_running far above core count. Pair it with
InnoDB status showing contention or long history list length for extra confirmation.

8) Do replicas help CPU spikes?

Read replicas help when reads dominate and you can route them cleanly. They don’t fix write-path CPU saturation on the primary.
Also, if your primary is CPU-starved, replication can lag and replicas become stale right when you need them.

9) Should I tune kernel parameters first?

Only after you’ve proven the database bottleneck. Kernel tuning can help (scheduler, networking, IO queues), but it’s not a substitute
for fixing concurrency, query plans, and vacuum/purge health.

10) What if CPU is high but perf shows mostly mutex/lock functions?

That’s contention. You’re paying for coordination, not computation. The fix is to reduce concurrency and remove hotspots:
shorten transactions, shard hot counters, redesign “single-row” patterns, and cap active sessions.

Conclusion: practical next steps

Under peak load, MariaDB and PostgreSQL don’t “randomly” spike CPU. They follow patterns. MariaDB tends to punish uncontrolled
concurrency and write-path amplification; PostgreSQL tends to punish vacuum neglect and plan surprises. If you want fewer spikes,
stop treating CPU as the problem and treat it as the receipt.

Next steps that pay off immediately:

  1. Put hard limits on DB concurrency (pooling + caps) and make the app respect them.
  2. Instrument top SQL by total time and alert on sudden shifts in plan shape or latency distribution.
  3. Keep housekeeping healthy: autovacuum discipline in Postgres; long transaction control and purge visibility in MariaDB.
  4. Make indexing a production decision: every new index must justify its write cost under peak.
  5. Practice the playbook on a staging load test. The first time you run these commands should not be during an outage.
← Previous
Proxmox firewall locked you out: restore SSH/Web UI from console without panic
Next →
Debian/Ubuntu: “Works on LAN, fails on WAN” — routing/NAT checks that reveal the cause (case #85)

Leave a comment