MySQL vs PostgreSQL for High Concurrency: Who Hits the Wall First and Why

Was this helpful?

High concurrency is where database marketing goes to die. Your workload looks fine at 200 requests/sec, then you add “just one more” feature, traffic doubles, and suddenly the database is a small, angry sun. Latency balloons, CPU pins, and every team discovers they’ve been “temporarily” using the database as a queue, cache, search engine, and truth oracle.

This isn’t a religious war. It’s a wall-finding expedition. MySQL and PostgreSQL both scale to serious concurrency. They just hit different walls first, for different reasons, and the shortest path to stability depends more on your workload and operating habits than on anyone’s benchmark chart.

What “high concurrency” actually means in production

People say “high concurrency” when they mean one of three things:

  • Many simultaneous clients (hundreds to tens of thousands of active connections).
  • Many simultaneous transactions (a smaller number of connections doing lots of work, overlapping heavily).
  • Many simultaneous conflicts (everyone wants the same few rows, the same index pages, the same counters, the same partition, the same cache line).

The third one is where systems go to cry. You can buy more cores for “many transactions.” You can add a pooler for “many clients.” But “many conflicts” is a design problem pretending to be a capacity problem.

Also: concurrency isn’t throughput. You can have low throughput and still be concurrency-bound if you’re stuck in lock waits, context switching, fsync storms, or a queue of threads fighting over the same shared data structure.

The opinionated positioning: where each engine tends to break first

If you force me to generalize (and you are), here’s how it usually plays out in real production systems:

PostgreSQL: you hit the connection/process wall early, then MVCC cleanup if you’re sloppy

  • Thousands of direct connections can hurt fast because each backend is a process with memory overhead and scheduling cost. Postgres can run huge connection counts, but doing so without pooling is how you spend your CPU on overhead instead of queries.
  • Long transactions cause MVCC dead weight: old row versions accumulate, indexes bloat, autovacuum falls behind, and eventually “simple” queries slow down because they’re scanning a trash heap.
  • Hot updates can create bloat and page-level contention. “It’s MVCC so reads don’t block writes” is true until the system is busy cleaning up after you.

MySQL (InnoDB): you hit lock contention and I/O/write amplification walls, especially on hot rows

  • Hot rows and hot indexes show up as lock waits, deadlocks, and throughput collapse. InnoDB is strong, but it’s not magic: “update a counter row” at high concurrency is basically a denial-of-service attack you wrote yourself.
  • Dirty page flushing and redo/log pressure can dominate under heavy writes. Mis-tuned flushing looks like random latency spikes and “why is the disk busy when CPU is idle?”
  • Replication under write load (depending on topology) becomes the next wall: lag grows, failover gets scary, and you discover your app was reading “mostly consistent” data and calling it a feature.

Who hits the wall first? The engine that you operate like it’s the other engine. If you run Postgres without pooling and with long transactions, it’ll tap out early. If you run MySQL with hot-row updates, giant secondary indexes, and optimistic “the buffer pool will save us,” it’ll tap out early.

Exactly two things are always true: the database is blamed first, and the database is guilty more often than anyone wants to admit.

Joke #1: A database under high concurrency is like a restaurant with one waiter and infinite menus—technically everyone can order, practically nobody eats.

Facts and historical context that still matter

Some “old” history is still operationally relevant because it shaped defaults, mental models, and tooling.

  1. PostgreSQL’s lineage goes back to POSTGRES at UC Berkeley (1980s). The “correctness first” culture shows in features like MVCC, robust SQL, and a serious planner.
  2. MySQL’s early dominance (late 1990s–2000s) came from being fast and easy for web apps, often with MyISAM. Many concurrency horror stories are from pre-InnoDB or poorly understood engine switches.
  3. InnoDB became the default in MySQL 5.5 (2010). If someone still talks about “table locks in MySQL” as a universal truth, they’re either time traveling or running something cursed.
  4. Postgres autovacuum arrived to make MVCC survivable at scale. It’s not optional housekeeping; it’s the garbage collector for your data visibility model.
  5. MySQL replication history includes statement-based replication pitfalls, then row-based replication, then mixed. Write-heavy workloads with non-determinism taught a lot of teams humility.
  6. Postgres wait event instrumentation (notably expanded over the years) changed how you debug contention: you can often name the bottleneck precisely instead of guessing.
  7. MySQL Performance Schema matured into a real observability tool. If you’re not using it for contention diagnosis, you’re debugging with vibes.
  8. Both communities learned hard lessons about “more threads” being a trap: scheduler pressure, mutex contention, and I/O saturation don’t negotiate.
  9. Connection poolers became standard practice for Postgres in high-concurrency web environments, shaping how modern stacks are deployed (pgBouncer is practically furniture).

The walls: common bottlenecks under extreme concurrency

Wall #1: connection management and context switching

PostgreSQL uses a process-per-connection model. That’s not “bad”; it’s predictable and isolated. But if you have thousands of active sessions doing tiny queries, the OS scheduler becomes part of your query plan. Memory overhead grows, shared buffers aren’t the whole story, and you start paying for bookkeeping.

MySQL typically runs a thread-per-connection model (implementation details vary by version and configuration). It can handle large connection counts too, but you still pay in scheduling overhead, per-thread memory, and mutex contention on internal structures when concurrency gets pathological.

The real production advice: don’t ask either engine to babysit 20,000 mostly-idle connections. Put a pooler in front of it, keep active concurrency bounded, and make the app queue politely.

Wall #2: lock contention and hot rows

The fastest way to lose is to create a single-row hotspot: counters, “last_seen” updates on the same user row, “global sequence table,” “available inventory count,” or a queue implemented as “update one row where status=ready limit 1.” Congratulations, you built a lock contention generator.

InnoDB uses row-level locks, but the exact behavior depends on isolation level and access path. Under REPEATABLE READ (common default), next-key locking can lock ranges for some patterns, which surprises people who thought “row locks only.” Hot secondary indexes can also serialize inserts/updates.

PostgreSQL uses MVCC so reads don’t block writes, but writes still block writes. A hot row updated constantly becomes a serialization point. Also, index page contention and heavy UPDATE churn can create bloat and vacuum pressure.

The core point: concurrency collapses when you force serialized access. Databases can arbitrate, but they can’t make a single resource parallel.

Wall #3: MVCC cleanup and transaction hygiene

Both engines implement MVCC-ish behavior, but they pay the cleanup tax differently.

PostgreSQL’s MVCC keeps old row versions in the table until vacuum reclaims space. That means long-running transactions prevent cleanup because old snapshots must remain valid. Under high write rates, this becomes a slow-motion disaster: bloat grows, cache efficiency drops, index scans get heavier, and autovacuum starts fighting an uphill battle.

InnoDB stores undo information to provide consistent reads. Long transactions mean undo logs grow and purge can lag, which can impact performance and, in extreme cases, stability. Different shape, same sin: long transactions under write load are expensive.

High concurrency amplifies transaction hygiene problems. You can ignore a few sloppy transactions at low load. At high load they become a choke point and a space heater.

Wall #4: I/O amplification and checkpoint pressure

When a database “randomly” stalls under load, it’s often not random. It’s the storage subsystem being asked to do synchronous work right now because you didn’t schedule it earlier.

PostgreSQL checkpoints write dirty buffers to disk. Poorly tuned checkpoint settings can lead to bursty write I/O: quiet periods punctuated by “everything flushes now” and latency spikes. Add heavy WAL generation and you get a system that looks fine until it isn’t.

MySQL/InnoDB manages dirty pages, redo logging, and background flushing. Misconfigured flushing and small redo/log files can create constant pressure, where foreground writes stall waiting for space or durability.

Under high concurrency, the storage layer becomes the arbitration arena. If the underlying disk can’t keep up with fsync patterns and write amplification, both databases will “hit the wall,” just with different instrumentation vocabulary.

Wall #5: indexes and write amplification

Indexes are concurrency multipliers in both directions. They make reads cheaper and writes more expensive. Under high concurrency, expensive writes become contention factories: more pages touched, more latches, more cache churn, more WAL/redo.

MySQL’s secondary indexes include the primary key, so a wide primary key makes all secondary indexes heavier. Postgres indexes store key and row pointer; updates that change indexed columns create dead tuples in indexes too. Different mechanics, same outcome: if you index everything “just in case,” your write throughput will eventually ask for a divorce.

If you need high concurrency writes, you should be aggressively skeptical of every index. Make the application earn it.

Wall #6: replication and lag under write load

Replication is where concurrency problems become organizational problems. You can tolerate some lag until the first incident where someone reads stale data and makes a business decision with it.

PostgreSQL streaming replication is physical and WAL-based. It’s solid, but the replicas apply WAL; under heavy write churn, lag grows if apply can’t keep up. Hot standby conflicts can cancel queries on replicas if they block recovery, which shows up as “my read replica randomly kills queries.”

MySQL replication depends on the setup: classic async replication, semi-sync, group replication, etc. Lag is common under heavy writes, and multi-threaded replica apply helps but isn’t a free lunch. The more parallel the workload, the more careful you need to be with schema design and transaction patterns to allow apply parallelism.

Either way: if your scaling plan is “just add read replicas,” you should first check whether your workload is actually read-bound or just blocked on writes and locks.

Fast diagnosis playbook

The goal is to identify the limiting resource in under 15 minutes. Not to perfect the system. Not to win an argument. To stop the bleeding.

First: confirm whether you’re CPU-bound, I/O-bound, or lock-bound

  • CPU-bound: high CPU usage, runnable queue grows, query times increase uniformly.
  • I/O-bound: disk util high, fsync or write latency spikes, checkpoints/flushing correlate with stalls.
  • Lock-bound: many sessions “active” but not consuming CPU; they’re waiting.
  • Connection-bound: high connection count, high context switching, memory pressure, pool exhaustion.

Second: find the top wait reason, not the top query

In high concurrency, the “slow query” list often lies. The slow thing is the queue. Identify what everyone is waiting on: a lock, a buffer pin, WAL flush, redo log space, a specific index page, a filesystem sync.

Third: identify the hotspot (table, index, row, or code path)

Once you know the wait class, you find the hotspot. The fix is usually one of:

  • reduce contention (shard the hotspot, change algorithm, avoid hot-row updates)
  • reduce work (remove indexes, batch writes, cache reads, avoid pointless transactions)
  • increase capacity (faster storage, more memory, more CPU) — only after you know what you’re feeding
  • add isolation (pooler, queue, backpressure)

Fourth: make one change that reduces concurrency pressure

The fastest stabilization changes are often boring: lower max active transactions, enable pooling, reduce app timeouts, break up huge batch jobs, and stop long transactions.

Joke #2: Adding more app servers to a lock-bound database is like shouting at a traffic jam—louder doesn’t mean faster.

Practical tasks (commands, outputs, decisions)

These are the tasks I actually run during incidents. Each one includes what the output means and what decision you make from it. Mix and match for MySQL or PostgreSQL depending on what’s on fire.

Task 1: Check system load and CPU saturation (Linux)

cr0x@server:~$ uptime
 14:07:21 up 32 days,  6:12,  2 users,  load average: 22.44, 19.10, 13.02

Meaning: Load average far above CPU core count suggests runnable queue pressure or uninterruptible I/O waits.

Decision: If load is high and CPU is high, look for expensive queries. If load is high and CPU is modest, suspect lock waits or I/O waits.

Task 2: Identify I/O wait and swap pressure

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
12  3      0  52124  43120 918420    0    0   112  9840 9400 21000 35 10 35 20  0
18  5      0  49812  42980 915220    0    0   120 11240 9900 24000 30 12 28 30  0

Meaning: High wa indicates CPU waiting on disk. High cs can also hint at too many active threads/processes.

Decision: If wa is consistently high, move quickly to database flush/checkpoint diagnostics and storage latency.

Task 3: Measure storage latency quickly (Linux)

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   r_await   w_await  aqu-sz  %util
nvme0n1         120.0   980.0     1.20    18.50    9.40   97.0

Meaning: High w_await and near-100% %util indicates the device is saturated on writes.

Decision: Treat this as a storage bottleneck until proven otherwise. Reduce write rate (batching, fewer indexes, fewer fsync events) or upgrade storage.

Task 4: Count connections and find top talkers (PostgreSQL)

cr0x@server:~$ psql -XAt -c "select state, count(*) from pg_stat_activity group by 1 order by 2 desc;"
active|412
idle|1870
idle in transaction|37

Meaning: Lots of idle connections suggests you need pooling. Any non-trivial idle in transaction count is a red flag.

Decision: If idle is huge, put pgBouncer (transaction pooling for many OLTP apps) and cap server connections. If idle-in-transaction exists under load, hunt those clients first.

Task 5: Find what Postgres sessions are waiting on

cr0x@server:~$ psql -X -c "select wait_event_type, wait_event, count(*) from pg_stat_activity where state='active' group by 1,2 order by 3 desc limit 10;"
 wait_event_type |  wait_event  | count
-----------------+--------------+-------
 Lock            | transactionid |   120
 IO              | DataFileRead  |    64
 LWLock          | buffer_content|    38

Meaning: Your concurrency is bottlenecked by wait classes (transactionid locks, I/O reads, buffer-content LWLocks).

Decision: Transactionid lock pileups often mean hot-row updates or heavy FK checks. DataFileRead indicates cache misses or bloated scans. buffer_content hints at contention on shared buffers/pages.

Task 6: Identify the blocking chain in Postgres

cr0x@server:~$ psql -X -c "select blocked.pid as blocked_pid, blocker.pid as blocker_pid, blocked.query as blocked_query, blocker.query as blocker_query from pg_stat_activity blocked join pg_stat_activity blocker on blocker.pid = any(pg_blocking_pids(blocked.pid)) where blocked.state='active' limit 5;"
 blocked_pid | blocker_pid | blocked_query              | blocker_query
------------+-------------+---------------------------+-------------------------------
      18421 |       17210 | update counters set v=v+1 | update counters set v=v+1

Meaning: A classic hot-row update pileup.

Decision: Stop “update same row” patterns. Use sharded counters, append-only event logs with aggregation, or move to a cache with periodic flush.

Task 7: Check autovacuum health and dead tuples (PostgreSQL)

cr0x@server:~$ psql -X -c "select relname, n_live_tup, n_dead_tup, last_autovacuum from pg_stat_user_tables order by n_dead_tup desc limit 5;"
 relname  | n_live_tup | n_dead_tup |     last_autovacuum
----------+------------+------------+---------------------------
 events   |   52000000 |   18000000 | 2025-12-29 12:11:03+00
 orders   |    9000000 |    2100000 | 2025-12-29 13:40:21+00

Meaning: Lots of dead tuples implies bloat and vacuum pressure; reads get slower and caches get less effective.

Decision: If autovacuum can’t keep up, adjust autovacuum settings per table, reduce update churn, and eliminate long transactions blocking vacuum.

Task 8: Find long-running transactions (PostgreSQL)

cr0x@server:~$ psql -X -c "select pid, now()-xact_start as xact_age, state, left(query,80) from pg_stat_activity where xact_start is not null order by xact_age desc limit 5;"
  pid  | xact_age |        state        | left
-------+----------+---------------------+----------------------------------------
 20311 | 01:12:09 | idle in transaction | SELECT * FROM customer WHERE id=$1

Meaning: An hour-long transaction sitting idle is holding back vacuum and increasing bloat risk.

Decision: Kill it if safe, then fix the app. Add timeouts (idle_in_transaction_session_timeout) and ensure clients don’t hold transactions across network calls.

Task 9: Check Postgres checkpoint behavior

cr0x@server:~$ psql -X -c "select checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time, buffers_checkpoint from pg_stat_bgwriter;"
 checkpoints_timed | checkpoints_req | checkpoint_write_time | checkpoint_sync_time | buffers_checkpoint
------------------+-----------------+-----------------------+----------------------+-------------------
              128 |             412 |               983210  |               210992 |           8420012

Meaning: Many requested checkpoints (checkpoints_req) suggests WAL pressure forcing checkpoints, which often correlates with write spikes and latency cliffs.

Decision: Consider tuning checkpoint settings and WAL sizing, and reduce write amplification (indexes, update patterns). Also verify storage can handle sustained writes.

Task 10: See MySQL thread/connections and running queries

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_%';"
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_cached    | 120   |
| Threads_connected | 1850  |
| Threads_running   | 210   |
+-------------------+-------+

Meaning: Threads_running is the real concurrency pressure. Hundreds of running threads can mean CPU contention, lock waits, or both.

Decision: If running threads is high and latency is high, find what they’re waiting on (locks, I/O, redo). If connected is huge, fix pooling and timeouts.

Task 11: Check InnoDB lock waits and deadlocks

cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,80p'
=====================================
2025-12-29 14:09:56 INNODB MONITOR OUTPUT
=====================================
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 8912231, ACTIVE 0 sec updating or deleting
...

Meaning: Deadlocks under load are normal-ish; frequent deadlocks are a design smell (hot rows, inconsistent lock order, missing indexes).

Decision: If deadlocks spike, prioritize query/path redesign and consistent transaction ordering. Add/adjust indexes to avoid range locks from full scans.

Task 12: Identify MySQL lock waits via Performance Schema

cr0x@server:~$ mysql -e "select object_schema, object_name, count_star, sum_timer_wait/1000000000000 as seconds_waited from performance_schema.table_lock_waits_summary_by_table order by sum_timer_wait desc limit 5;"
+--------------+-------------+-----------+----------------+
| object_schema| object_name  | count_star| seconds_waited |
+--------------+-------------+-----------+----------------+
| appdb        | counters     |  1203321  |  842.21        |
| appdb        | orders       |   214220  |  190.44        |
+--------------+-------------+-----------+----------------+

Meaning: Which tables are causing lock waits, quantified. “counters” is practically a confession.

Decision: Fix the hotspot first. No amount of buffer pool tuning makes a single-row counter scale.

Task 13: Check MySQL redo log pressure signals

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';"
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_log_waits | 1842  |
+------------------+-------+

Meaning: Non-zero and increasing Innodb_log_waits means transactions are waiting for redo log space/flushing.

Decision: Investigate redo log sizing and flush settings; reduce write volume; confirm storage latency. This is a classic high-concurrency write wall.

Task 14: See top MySQL statements by total latency (Performance Schema)

cr0x@server:~$ mysql -e "select digest_text, count_star, round(sum_timer_wait/1000000000000,2) as total_s, round(avg_timer_wait/1000000000000,6) as avg_s from performance_schema.events_statements_summary_by_digest order by sum_timer_wait desc limit 3;"
+-------------------------------------------+------------+---------+---------+
| digest_text                               | count_star | total_s | avg_s   |
+-------------------------------------------+------------+---------+---------+
| UPDATE counters SET v = v + ? WHERE id=?  |  9202211   |  912.11 | 0.000099|
| SELECT * FROM orders WHERE user_id = ?    |   821220   |  410.44 | 0.000500|
+-------------------------------------------+------------+---------+---------+

Meaning: The “small” UPDATE dominates total latency because it runs millions of times and contends.

Decision: Reduce frequency, batch, shard, or redesign. Don’t chase micro-optimizations on the SELECT while the UPDATE is the knife.

Task 15: Check Postgres index bloat signals quickly (approximate)

cr0x@server:~$ psql -X -c "select relname, pg_size_pretty(pg_relation_size(relid)) as table_sz, pg_size_pretty(pg_total_relation_size(relid)) as total_sz from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc limit 5;"
 relname | table_sz | total_sz
---------+----------+----------
 events  | 58 GB    | 110 GB
 orders  | 12 GB    | 28 GB

Meaning: Total size far larger than table size suggests heavy indexing and/or bloat. Not proof, but a strong smell.

Decision: Investigate update patterns, autovacuum, and index necessity. Consider reindexing/maintenance windows if bloat is confirmed and harming cache hit rates.

Task 16: Check replication lag basics (Postgres)

cr0x@server:~$ psql -X -c "select client_addr, state, write_lag, flush_lag, replay_lag from pg_stat_replication;"
 client_addr |   state   | write_lag | flush_lag | replay_lag
-------------+-----------+-----------+-----------+------------
 10.0.2.18    | streaming | 00:00:00.2| 00:00:01.1| 00:00:03.8

Meaning: Replay lag growing under concurrency indicates replicas can’t apply WAL fast enough, often due to I/O or CPU constraints.

Decision: If replicas are behind and serving reads, you’re serving old truth. Either fix replica capacity, reduce churn, or route critical reads to primary.

Three corporate mini-stories from the concurrency trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company migrated a customer analytics service from MySQL to PostgreSQL. The team had good reasons: better SQL features, nicer JSON support for their payloads, and a query planner that handled their ad hoc reporting better. They load-tested. It looked fine.

Then the first “real” traffic event happened: a marketing campaign plus a partner integration. Concurrency spiked, and the database started timing out. CPU wasn’t terrible, and storage wasn’t melting. But the app was stuck in a slow panic.

The wrong assumption was simple: “Postgres can handle lots of connections, so we’ll just open one per request like we did before.” They had thousands of short-lived connections thrashing. Authentication overhead, process churn, memory fragmentation, and context switching turned into a tax on every request.

Worse, their transaction boundaries were sloppy. Some requests opened a transaction, did a SELECT, called an external service, and then did an UPDATE. Under load, those idle-in-transaction sessions started to pile up and vacuum fell behind. The system wasn’t just slow; it was slowly getting worse.

The fix was unglamorous: pgBouncer in transaction pooling mode, a sane server-side connection cap, and strict timeouts on idle transactions. Then they refactored the “call external service while holding a transaction” code path. Concurrency stabilized, and the same hardware handled more work. The database didn’t get “faster.” The system got less silly.

Mini-story 2: The optimization that backfired

An e-commerce platform running MySQL (InnoDB) had a performance goal: reduce reads on the product page under peak sale traffic. A well-meaning engineer added a “view_count” column to the products table and updated it on every page view. One row per product. What could go wrong?

Everything, in slow motion at first. During normal days, it worked. During sale events, hot products got hammered. The database started showing rising lock waits and deadlocks. Application retries multiplied. The extra retries increased write load. The write load increased flush pressure. The flush pressure increased latency. The latency increased timeouts. The timeouts increased retries. You know where this goes.

The team initially chased “MySQL tuning”: buffer pool size, thread concurrency, flushing knobs. They got marginal improvements but not stability. The hotspot was the design: a high-frequency, highly-contended update on a small set of rows. They had created a global mutex disguised as a feature.

The real fix: stop updating the same row for every view. They moved to an append-only events table (or even an external counter system) and aggregated asynchronously. Suddenly deadlocks calmed down, replication lag shrank, and the database stopped doing cardio.

The lesson wasn’t “never store counters.” It was: if the write is hotter than the object being counted, you need a write pattern that scales with concurrency, not against it.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran PostgreSQL for OLTP and had a weekly batch job that did heavy updates for reconciliation. The job was slow but predictable. They had a habit that looked paranoid: they tracked autovacuum behavior per table, they set per-table vacuum/analyze settings on the known churny relations, and they had a strict policy that any transaction older than a threshold triggered an alert.

One week, a new microservice deployment introduced a bug: a connection leak plus a missing commit in an error path. Sessions piled up “idle in transaction” while holding snapshots. The batch job started, generated dead tuples, and autovacuum couldn’t reclaim them. Query latencies rose, but not instantly—just enough to be suspicious.

The on-call engineer didn’t start by staring at CPU graphs. They ran two queries: active wait events, and longest transaction age. The longest transaction was hours old. That was the smoking gun. They killed the worst offenders, mitigated the leak by rolling back, and the system recovered before bloat became a multi-day cleanup.

This is the kind of win nobody brags about at conferences. It’s also how you keep your weekends. Boring guardrails—timeouts, alerts on transaction age, routine vacuum verification—prevent concurrency problems from becoming existential.

Common mistakes: symptoms → root cause → fix

1) “CPU is low but everything is slow”

Symptoms: high request latency, low CPU usage, many active sessions.

Root cause: lock contention or I/O stalls; threads/processes waiting rather than executing.

Fix: identify wait events (Postgres) or lock/redo waits (MySQL). Remove hot-row patterns, fix missing indexes causing range locks, reduce write amplification, and cap concurrency via pooling/backpressure.

2) PostgreSQL autovacuum “mysteriously” can’t keep up

Symptoms: growing table/index size, worsening cache hit rates, increasing query times over days.

Root cause: long-running transactions holding snapshots; update-heavy tables without tuned autovacuum thresholds.

Fix: kill/avoid long transactions, set timeouts, tune autovacuum per table, and reduce UPDATE churn (use INSERT-only with compaction/aggregation when possible).

3) MySQL deadlocks spike after a release

Symptoms: increased deadlock logs, retry storms, throughput collapse.

Root cause: new code path changed lock ordering or introduced hot-row updates; missing index causes a scan with broader locks.

Fix: make transaction lock order consistent; add proper indexes; reduce contention by sharding hotspots; limit retries with jitter and cap.

4) “Adding replicas didn’t help”

Symptoms: primary still overloaded, replicas lagging, read latency inconsistent.

Root cause: workload is write/lock-bound, not read-bound; replicas can’t apply changes fast enough.

Fix: fix write path first (indexes, batching, contention). Route only safe/lag-tolerant reads to replicas. Upgrade replica I/O if apply is the limiter.

5) Connection pool exhaustion causes cascading failures

Symptoms: app timeouts, error rates spike, database shows connection storms.

Root cause: pool size too small relative to latency; or worse, pool size too large and overwhelms the database; missing timeouts cause stuck clients holding connections.

Fix: set a hard cap on DB connections, use pooling, add query and transaction timeouts, and enforce backpressure at the edge.

6) “We tuned memory but it’s still slow”

Symptoms: large caches configured, but lots of disk reads and stalls under concurrency.

Root cause: bloat, bad query plans under parameterization, or working set larger than memory due to too many indexes or large rows.

Fix: reduce bloat and unnecessary indexes; measure cache hit ratios; fix query patterns and statistics; consider partitioning or archiving.

Checklists / step-by-step plan

Step-by-step plan: choosing between MySQL and PostgreSQL for high concurrency OLTP

  1. Define the contention model: are there hotspots (counters, inventory, queues), or mostly independent rows?
  2. Define transaction shape: short and frequent, or long and complex? If long transactions are common, plan controls and timeouts from day one.
  3. Decide on connection strategy: if Postgres, assume pooling is required. If MySQL, still pool; don’t treat DB as a socket server.
  4. Measure write amplification: count indexes, row width, and expected update rate. If write-heavy, be ruthless about indexes.
  5. Pick isolation intentionally: default isolation levels exist for historical reasons, not because they’re best for your workload.
  6. Design for backpressure: your app must degrade gracefully when the DB is saturated. Without this, either database will fail “dramatically.”
  7. Plan replication semantics: what reads can tolerate lag? If the answer is “none,” your architecture must reflect that reality.
  8. Operationalize maintenance: vacuum and bloat (Postgres), purge and redo pressure (MySQL), backups, failovers, and schema changes under load.

Checklist: stabilizing a lock-bound database in an incident

  • Confirm lock waits are the dominant wait class (Postgres wait events / MySQL lock wait summaries).
  • Identify the top contended table/index and the exact query pattern.
  • Temporarily reduce concurrency: lower app worker count, reduce queue consumers, tighten pool size, or shed load.
  • Stop retries from stampeding: cap retries, add jitter, and fail fast for non-critical paths.
  • Apply targeted mitigations: add missing index, change a transaction order, disable a non-essential feature that writes.
  • After stability: redesign hotspot (sharded counters, queue redesign, append-only events, partitioning).

Checklist: preventing Postgres MVCC bloat from killing concurrency

  • Alert on idle in transaction sessions and long transaction age.
  • Review top churn tables weekly: dead tuples, vacuum frequency, table growth.
  • Tune autovacuum per table where churn is high; don’t rely on global defaults.
  • Prefer INSERT-only patterns + periodic compaction/aggregation for high-write event streams.
  • Keep transactions short; don’t hold snapshots while calling external systems.

Checklist: preventing MySQL InnoDB lock storms

  • Identify and eliminate hot-row updates and “global counters.”
  • Ensure critical WHERE clauses are indexed to avoid broad locking and scans.
  • Keep transactions short; avoid large batches that hold locks for long periods.
  • Watch redo log waits and flushing behavior under sustained writes.
  • Validate replication lag behavior during load tests; don’t discover it in production.

FAQ

1) Is PostgreSQL “worse” at high concurrency because it uses processes?

No. It’s worse at careless connection patterns. With pooling and sane session counts, Postgres handles high transactional concurrency extremely well. Without pooling, you’re paying OS overhead unnecessarily.

2) Is MySQL “better” because it’s faster?

Sometimes it’s faster for simple OLTP patterns and can be very efficient. But “faster” collapses under lock contention and write amplification. Under hot-row updates, MySQL will punish you quickly and loudly.

3) Which one hits the wall first at 10,000 connections?

Without pooling: usually Postgres will complain sooner due to per-connection process overhead. With pooling: both can survive, and the wall moves to locks, I/O, or query design.

4) Why do “simple” UPDATE statements become the top latency consumer?

Because they’re not simple under contention. A 0.1ms UPDATE run ten million times with lock waits becomes your dominant cost. Frequency plus contention beats cleverness every time.

5) Can read replicas fix high concurrency problems?

Only if you’re truly read-bound and your reads can tolerate replica lag. If you’re lock-bound or write-bound, replicas won’t fix the primary’s choke point and may add operational complexity.

6) What’s the single best practice for Postgres under high concurrency?

Use a connection pooler and enforce short transactions with timeouts. If you do that, you avoid the most common early walls and keep MVCC cleanup healthy.

7) What’s the single best practice for MySQL under high concurrency?

Eliminate hotspots and ensure indexes support your write queries. Also watch redo log waits; if durability work blocks foreground writes, you’ll see concurrency collapse.

8) Is SERIALIZABLE isolation a bad idea for concurrency?

Not inherently, but it’s expensive and can increase retries/serialization failures under contention. Use it when you need it, and design around retries deliberately. Don’t “turn it on for safety” and hope.

9) How do I know if I should scale up hardware or redesign queries?

If waits are dominated by a single lock hotspot or by bloat-induced scans, redesign first. If you’re saturating storage with steady, unavoidable writes and you’ve reduced amplification, then scale hardware.

10) What’s one metric that reliably predicts an upcoming incident?

Transaction age outliers. One long transaction can poison concurrency by blocking cleanup (Postgres) or purge (InnoDB) and by holding locks longer than expected.

Next steps (what to do Monday morning)

If you’re deciding between MySQL and PostgreSQL for a high-concurrency system, pick based on your expected failure modes and your operational discipline:

  • If you can enforce pooling and transaction hygiene, PostgreSQL is a strong default with excellent observability and SQL depth.
  • If your workload is straightforward OLTP with careful schema/index discipline, MySQL/InnoDB can be brutally efficient—until you introduce hotspots and pretend they’re fine.

Then do the boring setup that prevents future heroics:

  1. Implement connection pooling and enforce a hard DB connection cap.
  2. Add timeouts: query timeout, transaction timeout, and idle-in-transaction timeout (or app-side equivalents).
  3. Build dashboards for waits (Postgres wait events; MySQL lock/redo waits), not just CPU.
  4. Run a load test that includes hotspots: counters, queues, and batch jobs. If you don’t test the ugly parts, production will.
  5. Make a hotspot playbook: sharded counters, append-only events, queue redesign, and partitioning strategies.

One operational principle to keep: paraphrased idea from John Allspaw: reliability comes from designing systems that expect failure and respond gracefully, not from pretending failures won’t happen.

Pick your database. Then run it like you mean it.

← Previous
Ubuntu 24.04 tmpfs/ramdisk gone wild: stop it eating RAM (without breaking apps)
Next →
Debian 13: “Text file busy” — why deploys fail and how to fix safely (case #57)

Leave a comment