If you’ve ever sat in a meeting where someone says, “We should switch databases, it’ll be faster,” you know the feeling: your pager gets heavier. Most “faster” claims are benchmarks with the wrong schema, the wrong dataset, the wrong concurrency, and the wrong failure modes.
PostgreSQL and Percona Server (a performance-focused MySQL variant) can both be brutally fast. They can also both be painfully slow. The difference isn’t vibes. It’s workload shape, operational discipline, and whether you’re accidentally benchmarking your cache.
Performance myths that won’t die
Myth 1: “Percona Server is faster than PostgreSQL”
Percona Server can be faster than PostgreSQL for classic MySQL-shaped OLTP: lots of short indexed point queries, predictable access patterns, and teams that know how to operate InnoDB well. PostgreSQL can also be faster than Percona Server for the same workload if your query patterns prefer Postgres’s planner behavior, or your team’s tuning instincts align with Postgres’s knobs and observability.
“Faster” here is a claim that needs a unit: p95 latency? Throughput at fixed latency? Time-to-recover after failover? Hours spent chasing replication drift? The database that “wins” depends on which pain you’re measuring.
Myth 2: “PostgreSQL is slow because of MVCC bloat”
MVCC is not a tax bill you can’t avoid. It’s a design trade that gives consistent reads and concurrency. The tax arrives when you ignore vacuuming, long-running transactions, and table/index bloat. Operate it correctly and Postgres handles heavy concurrency cleanly. Operate it lazily and it will slowly turn your storage into a scrapbook of dead tuples.
Myth 3: “MySQL is always faster for reads”
Reads are not one thing. There are point reads, range scans, fan-out joins, analytic aggregations, and “reads” that are actually lock waits. InnoDB can scream on point reads with a hot buffer pool and a simple schema. Postgres can also scream—especially with good indexes, effective caching, and query plans that avoid nasty nested-loop surprises.
Myth 4: “Benchmarks settle it”
Benchmarks settle exactly one thing: whether a specific database, on a specific configuration, on a specific dataset, with a specific workload generator, on a specific machine, under a specific failure-free fantasy, achieved some number. Useful. Not universal.
One short joke, as a palate cleanser: Benchmarks are like resumes—everything looks amazing until you check references under load.
Myth 5: “Switching engines fixes performance”
Switching engines changes your bottleneck, not your responsibility. You don’t escape physics. You trade one set of operational gotchas for another. If your problem is missing indexes, bad queries, or undersized storage, migrating is just an expensive way to postpone the same work.
Interesting facts and historical context (the stuff that explains today’s behavior)
- PostgreSQL’s lineage traces back to POSTGRES at UC Berkeley in the 1980s; its “PostgreSQL” name reflects the early SQL add-on that stuck.
- InnoDB became the default storage engine for MySQL in 5.5; before that, MyISAM was common and brought a very different locking and durability story.
- Percona Server emerged because many shops wanted MySQL compatibility plus instrumentation and performance features without waiting on upstream release cycles.
- Postgres added parallel query progressively (not all at once), and it changed the calculus for certain reporting workloads that used to be “export to Spark” by default.
- MySQL replication history started with statement-based replication, which was faster in some cases but could be nondeterministic; row-based replication improved correctness at a cost.
- Postgres WAL (write-ahead log) has been central forever, but features like logical decoding made it easier to build change-data pipelines without triggers.
- InnoDB’s buffer pool is one of the biggest single levers for MySQL/Percona performance; getting it wrong can make a fast server look like slow storage.
- Vacuum is not optional in Postgres; it’s part of how MVCC keeps performance stable. Autovacuum exists because humans are unreliable.
- Both ecosystems have matured into “real” enterprise-grade systems; the argument today is rarely about capability and mostly about fit and operations.
What “faster” actually means: workload archetypes
1) Hot-key OLTP: lots of point reads/writes on a small working set
This is where Percona Server (InnoDB) often looks great: hot pages sit in the buffer pool, change buffering and adaptive behaviors kick in, and the engine is optimized for the “many tiny queries” lifestyle. Postgres can do it too, but you’ll care more about connection management, index choices, and avoiding transaction patterns that block vacuum.
Typical trap: you benchmark against a tiny dataset that fits in RAM and declare victory. Production has a bigger working set, and suddenly you’re measuring storage latency and buffer churn instead of SQL.
2) Write-heavy OLTP: sustained inserts/updates with durability
Here “faster” is constrained by log volume, fsync behavior, and checkpointing. In Postgres, WAL volume and checkpoint tuning matter, as does autovacuum keeping up. In InnoDB, redo log capacity, flushing behavior, and doublewrite buffer implications can dominate. Both can run into I/O walls; they just hit different walls first depending on settings and schema.
3) Mixed workload with complex joins
Postgres’s planner is powerful, and its statistics and indexing options (including expression indexes and partial indexes) can produce very strong plans for complex relational queries. MySQL/Percona has improved a lot, but there are still workloads where Postgres tends to be the “less surprising” engine for joins and advanced SQL semantics.
Decision point: if your application leans into relational design and complex reporting queries alongside OLTP, Postgres often reduces the number of “we rewrote SQL to appease the optimizer” incidents.
4) Read replicas and fan-out reads
Replication is performance. If you can’t scale reads without replication lag becoming a product bug, your “fast” primary doesn’t matter. Percona Server shops often rely on replica fleets; Postgres shops do too, but the operational patterns differ (physical streaming replication, logical replication, and extensions).
5) “Boring” admin workloads: backups, schema changes, and recoveries
Speed includes how quickly you can ship changes. Online schema changes, index creation time, vacuuming, and restore time decide whether you’re sleeping. Both engines can be operated safely; both can be driven into a ditch if you treat maintenance as optional.
Where the engines differ (in the parts that hurt you)
Concurrency control: MVCC is not the same everywhere
Both systems use MVCC concepts, but the operational consequences differ. Postgres keeps multiple versions of rows in the table itself, and visibility is determined by transaction IDs. That means dead tuples accumulate until vacuum reclaims space. It’s not a flaw; it’s the deal.
InnoDB also uses MVCC with undo logs. Old versions live in undo segments rather than as dead tuples in the table pages. That changes the shape of “bloat” and the symptoms you see: long-running transactions can prevent purge, growing history length and causing performance degradation in a different way.
Durability and I/O patterns: WAL vs redo/undo + doublewrite
Postgres writes WAL records and later flushes dirty pages; checkpoints can cause write bursts if poorly tuned. InnoDB has redo logs and data files, plus mechanisms like doublewrite to protect against partial page writes. The tuning knobs differ, but the physics is the same: your storage subsystem has a budget, and the database will spend it.
The “optimizer tax”: query planners are opinionated machines
Postgres’s planner can make brilliant choices and occasionally terrible ones when statistics are stale or distributions are weird. MySQL/Percona’s optimizer has its own quirks, especially around complex joins and certain subquery patterns. You don’t pick the “best optimizer.” You pick the optimizer you can predict, instrument, and correct.
Connections: Postgres punishes connection storms
Postgres uses a process-per-connection model by default. A thundering herd of short-lived connections can be a self-inflicted DoS. The fix is not “make the database bigger.” It’s pooling (often PgBouncer), right-sizing max connections, and teaching the application to reuse connections.
MySQL typically handles many connections with different overhead characteristics. That does not mean you should let clients run wild. It means the failure mode may arrive later, with a different signature.
Observability: both have tools, but the defaults differ
Postgres with pg_stat_statements gives you high-value query visibility. Percona Server is famous for extra instrumentation and the wider Percona toolkit ecosystem. Either way, if you don’t collect the right metrics and query traces, you will end up “tuning” based on superstition.
One quote, because operations people deserve at least one good line: Hope is not a strategy.
— Vince Lombardi
Fast diagnosis playbook
This is what you do when someone says, “The database is slow,” and you have 15 minutes before the incident channel turns into performance art.
First: confirm the symptom and the unit
- Is it latency or throughput? p95 query latency vs total QPS. They fail differently.
- Is it the database? App thread pools, network, and downstream calls can impersonate DB slowness.
- Is it one query class? One bad query can poison a whole pool.
Second: classify the bottleneck in one of four buckets
- CPU-bound: high CPU, stable I/O, slow queries with heavy compute (sorts, hash joins, JSON processing).
- I/O-bound: high read/write latency, low cache hit rates, checkpoint/flush pressure.
- Lock/transaction-bound: waits, lock contention, long transactions, replication apply delays.
- Memory/working-set mismatch: cache thrash, buffer pool churn, frequent temp file usage.
Third: pick the shortest path to proof
- On Postgres: start with
pg_stat_activity,pg_stat_statements, cache hit ratio, checkpoint stats, and vacuum progress. - On Percona: start with
SHOW PROCESSLIST, InnoDB buffer pool metrics, redo/flush stats, and slow query summaries. - On the host: check iostat, pressure stalls, filesystem latency, and whether the storage is quietly screaming.
Second short joke (and we’re done): The fastest database is the one you didn’t accidentally point at the reporting job.
Practical tasks: commands, what the output means, and what decision you make
These are real tasks you can run during an incident or a performance review. Each includes: command, sample output, interpretation, and the next decision. Commands are split between PostgreSQL and Percona/MySQL, plus OS-level checks that decide which direction to dig.
Task 1 (PostgreSQL): See what’s running and what’s waiting
cr0x@server:~$ psql -X -d appdb -c "select pid, usename, state, wait_event_type, wait_event, now()-query_start as age, left(query,80) as query from pg_stat_activity where state <> 'idle' order by age desc limit 10;"
pid | usename | state | wait_event_type | wait_event | age | query
------+--------+--------+-----------------+---------------+----------+--------------------------------------------------------------------------------
8123 | app | active | Lock | transactionid | 00:02:11 | update orders set status='paid' where id=$1
7991 | app | active | IO | DataFileRead | 00:01:05 | select * from order_items where order_id=$1
7902 | app | active | | | 00:00:40 | select count(*) from orders where created_at > now()-interval '1 day'
What it means: You have a lock wait on transactionid (often row-level conflicts escalating into transaction-level waits) and an I/O wait on data file reads.
Decision: If locks dominate, find the blocking PID and fix transaction behavior. If I/O dominates, check cache hit rate and storage latency before touching SQL.
Task 2 (PostgreSQL): Identify the blocker
cr0x@server:~$ psql -X -d appdb -c "select blocked.pid as blocked_pid, blocker.pid as blocker_pid, blocked.query as blocked_query, blocker.query as blocker_query from pg_locks blocked_locks join pg_stat_activity blocked on blocked_locks.pid=blocked.pid join pg_locks blocker_locks on blocked_locks.locktype=blocker_locks.locktype and blocked_locks.database is not distinct from blocker_locks.database and blocked_locks.relation is not distinct from blocker_locks.relation and blocked_locks.page is not distinct from blocker_locks.page and blocked_locks.tuple is not distinct from blocker_locks.tuple and blocked_locks.virtualxid is not distinct from blocker_locks.virtualxid and blocked_locks.transactionid is not distinct from blocker_locks.transactionid and blocked_locks.classid is not distinct from blocker_locks.classid and blocked_locks.objid is not distinct from blocker_locks.objid and blocked_locks.objsubid is not distinct from blocker_locks.objsubid and blocker_locks.pid != blocked_locks.pid join pg_stat_activity blocker on blocker_locks.pid=blocker.pid where not blocked_locks.granted;"
blocked_pid | blocker_pid | blocked_query | blocker_query
------------+------------+------------------------------------------+---------------------------------------------
8123 | 7001 | update orders set status='paid' where id=$1 | begin; update orders set ...; -- long txn
What it means: PID 7001 is holding the lock. Note the begin; pattern: long transaction.
Decision: Fix the app transaction scope. During incident: consider terminating the blocker if safe. Long term: reduce transaction time, avoid user-driven “open transaction while thinking” flows.
Task 3 (PostgreSQL): Top queries by total time (requires pg_stat_statements)
cr0x@server:~$ psql -X -d appdb -c "select calls, round(total_exec_time::numeric,1) as total_ms, round(mean_exec_time::numeric,2) as mean_ms, rows, left(query,100) as query from pg_stat_statements order by total_exec_time desc limit 5;"
calls | total_ms | mean_ms | rows | query
-------+----------+---------+-------+-----------------------------------------------------------------------------------------------------
12000 | 980000.0 | 81.67 | 12000 | select * from order_items where order_id = $1
800 | 410000.0 | 512.50 | 800 | select count(*) from orders where created_at > now()-interval '1 day'
300 | 220000.0 | 733.33 | 9000 | select * from orders o join customers c on c.id=o.customer_id where c.email like $1
What it means: One query dominates total time due to high call count; another is slow per-call.
Decision: For the high-call query, focus on index and caching. For the slow query, run EXPLAIN (ANALYZE, BUFFERS) and fix plan shape or add the right index.
Task 4 (PostgreSQL): Check cache hit ratio (signal, not religion)
cr0x@server:~$ psql -X -d appdb -c "select datname, round(100*blks_hit/nullif(blks_hit+blks_read,0),2) as cache_hit_pct, blks_read from pg_stat_database where datname='appdb';"
datname | cache_hit_pct | blks_read
--------+---------------+-----------
appdb | 93.41 | 1829934
What it means: 93% can be fine or terrible depending on workload. But if it dropped from 99% yesterday, your working set grew or memory shrank.
Decision: If cache hit is falling and reads rise, verify RAM pressure, shared_buffers sanity, and whether a new query pattern is scanning big tables.
Task 5 (PostgreSQL): Find tables with bloat signals (dead tuples)
cr0x@server:~$ psql -X -d appdb -c "select relname, n_live_tup, n_dead_tup, round(100*n_dead_tup/nullif(n_live_tup+n_dead_tup,0),2) as dead_pct from pg_stat_user_tables order by n_dead_tup desc limit 5;"
relname | n_live_tup | n_dead_tup | dead_pct
------------+------------+------------+---------
orders | 12500000 | 4100000 | 24.69
sessions | 800000 | 2900000 | 78.38
events | 90000000 | 1200000 | 1.32
What it means: sessions is a bloat factory. Probably frequent updates/deletes plus poor vacuum progress, maybe long-lived transactions.
Decision: Fix the churn pattern (TTL table partitioning, fewer updates), and tune autovacuum for that table. If it’s already huge, plan a controlled VACUUM (FULL) or table rewrite via maintenance window.
Task 6 (PostgreSQL): See if checkpoints are punishing you
cr0x@server:~$ psql -X -d appdb -c "select checkpoints_timed, checkpoints_req, round(checkpoint_write_time/1000.0,1) as write_s, round(checkpoint_sync_time/1000.0,1) as sync_s, buffers_checkpoint from pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | write_s | sync_s | buffers_checkpoint
------------------+-----------------+---------+--------+-------------------
124 | 198 | 905.2 | 210.3 | 88412211
What it means: Lots of requested checkpoints suggests WAL pressure or settings forcing frequent checkpoints. That can create I/O spikes and latency.
Decision: Consider increasing max_wal_size, tuning checkpoint completion target, and checking storage write latency. Don’t “fix” it by disabling durability.
Task 7 (Percona/MySQL): Who is running and who is stuck
cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G"
*************************** 1. row ***************************
Id: 22341
User: app
Host: 10.0.2.19:51244
db: appdb
Command: Query
Time: 132
State: Waiting for row lock
Info: update orders set status='paid' where id=?
*************************** 2. row ***************************
Id: 22110
User: app
Host: 10.0.2.20:49821
db: appdb
Command: Query
Time: 141
State: Sending data
Info: select * from order_items where order_id=?
What it means: You have row lock waits and a query stuck in “Sending data” (often meaning it’s scanning/returning lots of rows, or waiting on I/O).
Decision: For row locks, identify the blocking transaction. For “Sending data,” check query plan and indexes and confirm buffer pool hit and disk reads.
Task 8 (Percona/MySQL): Find InnoDB lock waits and the blocker
cr0x@server:~$ mysql -e "SELECT r.trx_id waiting_trx_id, r.trx_mysql_thread_id waiting_thread, r.trx_query waiting_query, b.trx_id blocking_trx_id, b.trx_mysql_thread_id blocking_thread, b.trx_query blocking_query FROM information_schema.innodb_lock_waits w JOIN information_schema.innodb_trx b ON b.trx_id=w.blocking_trx_id JOIN information_schema.innodb_trx r ON r.trx_id=w.requesting_trx_id\G"
*************************** 1. row ***************************
waiting_trx_id: 9012231
waiting_thread: 22341
waiting_query: update orders set status='paid' where id=?
blocking_trx_id: 9012198
blocking_thread: 21990
blocking_query: update orders set status='processing' where id=?; -- txn open
What it means: Thread 21990 is blocking. Likely a long transaction or an application flow holding locks too long.
Decision: If safe, kill the blocker. Long term, reduce transaction scope and ensure indexes support targeted updates (to avoid locking extra rows).
Task 9 (Percona/MySQL): Check buffer pool effectiveness
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+------------+
| Variable_name | Value |
+---------------------------------------+------------+
| Innodb_buffer_pool_read_requests | 9834421190 |
| Innodb_buffer_pool_reads | 8832211 |
+---------------------------------------+------------+
What it means: The ratio is decent (reads from disk are a small fraction of requests). If Innodb_buffer_pool_reads is climbing rapidly, you’re missing cache.
Decision: If disk reads are high, increase buffer pool size (within RAM limits) or shrink the working set via indexes and query changes.
Task 10 (Percona/MySQL): Look for redo log / checkpoint pressure
cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
=====================================
2025-12-30 10:12:01 0x7f2c3c0a0700 INNODB MONITOR OUTPUT
=====================================
Log sequence number 146990331122
Log flushed up to 146990220981
Last checkpoint at 146988102000
0 pending log flushes, 0 pending chkp writes
What it means: LSN, flushed, and checkpoint positions indicate how far behind flushing/checkpointing is. Big gaps under load can mean redo pressure or I/O limits.
Decision: If checkpointing falls behind during peak, review redo log sizing and flush settings, and verify storage write latency. Don’t blindly change durability knobs without a risk sign-off.
Task 11 (OS): Verify disk latency and whether you’re I/O-bound
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db01) 12/30/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
18.21 0.00 4.12 22.33 0.00 55.34
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await wareq-sz aqu-sz %util
nvme0n1 220.0 18944.0 0.0 0.00 9.40 86.10 480.0 61200.0 18.70 127.50 12.30 98.00
What it means: High %iowait, high w_await, and %util near 100% screams storage saturation. Your database is waiting on the disk to keep promises it can’t keep.
Decision: Stop tuning SQL first. Reduce write amplification (indexes, churn), spread I/O (separate WAL/redo and data if possible), or upgrade storage. Also check for noisy neighbors on shared disks.
Task 12 (OS): Check memory pressure and swapping
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 81264 14000 612000 0 0 120 980 9200 8800 18 4 56 22 0
3 1 0 64220 13890 590100 0 0 110 1100 9400 9100 17 4 54 25 0
What it means: No swapping (si/so are 0), but there’s I/O wait. If you saw swapping, you’d expect dramatic latency and random “it’s slow” reports.
Decision: If swapping is present, stop and fix memory sizing (buffer pools, shared_buffers, work_mem, tmp tables). Swapping databases is a hobby, not a strategy.
Task 13 (PostgreSQL): Confirm temp file usage (sorts/hashes spilling to disk)
cr0x@server:~$ psql -X -d appdb -c "select datname, temp_files, pg_size_pretty(temp_bytes) as temp_size from pg_stat_database where datname='appdb';"
datname | temp_files | temp_size
--------+------------+-----------
appdb | 41290 | 98 GB
What it means: 98 GB of temp data suggests queries are spilling. That’s often poor indexes, bad plans, or insufficient work_mem for the specific workload (not globally).
Decision: Identify the spilling queries via EXPLAIN (ANALYZE, BUFFERS). Prefer fixing query/index first. If you raise work_mem, do it carefully to avoid OOM under concurrency.
Task 14 (Percona/MySQL): Validate slow query log is actually telling you the truth
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'slow_query_log%'; SHOW VARIABLES LIKE 'long_query_time';"
+---------------------+-------+
| Variable_name | Value |
+---------------------+-------+
| slow_query_log | ON |
| slow_query_log_file | /var/log/mysql/slow.log |
+---------------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| long_query_time | 0.2 |
+-----------------+-------+
What it means: Slow log is enabled and threshold is aggressive enough to capture p95-ish behavior (0.2s here). If it’s set to 10s, it’s basically a “postmortem only” tool.
Decision: Ensure slow logging is on in production with a sensible threshold and rotation. Otherwise you’ll end up guessing which query hurt you.
Task 15 (PostgreSQL): Check replication lag quickly
cr0x@server:~$ psql -X -d appdb -c "select client_addr, state, write_lag, flush_lag, replay_lag from pg_stat_replication;"
client_addr | state | write_lag | flush_lag | replay_lag
-------------+-----------+-----------+-----------+-----------
10.0.3.12 | streaming | 00:00:01 | 00:00:02 | 00:00:08
What it means: Replay lag is 8 seconds. That might be fine or disastrous depending on read-your-writes requirements.
Decision: If lag is high and growing, check replica I/O/CPU, long-running queries on replicas, and WAL volume on primary. Consider moving reporting load off replicas that serve product reads.
Task 16 (MySQL/Percona): Check replication health and lag
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep 'Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running|Relay_Log_Space'
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 37
Relay_Log_Space: 1849423872
What it means: SQL thread is running but lag is 37 seconds and relay logs are big. Replica can’t apply fast enough.
Decision: Investigate replica resource limits, heavy read queries, and whether single-threaded apply is bottlenecking (and whether parallel replication is configured appropriately for your workload).
Three corporate mini-stories from the performance trenches
1) Incident caused by a wrong assumption: “The benchmark said Percona was faster”
The company was mid-growth, with a payments-ish workload: lots of short transactions, but also a nasty reporting sidebar that did ad-hoc filtering across a few big tables. The engineering lead ran a benchmark: simple primary-key lookups, a few inserts, and a tidy dataset that fit comfortably in memory. Percona Server posted prettier numbers than Postgres on the chosen test. Decision made.
Migration happened. Week one looked fine. Week three was when reality showed up: the reporting queries weren’t the ones from the benchmark. They were wide joins with selective filters, and the business users kept changing the filters. The optimizer chose plans that were “reasonable” until data skew grew. Then it started scanning, spilling, and generally turning the replica tier into a space heater.
The incident itself was classic: a quarterly report ran during peak. Replication lag grew. Product reads started hitting stale replicas and showing inconsistent state. Support called it “data corruption.” It wasn’t corruption. It was just lag plus assumptions. Engineers chased phantom bugs for hours before someone checked Seconds_Behind_Master and the slow query log.
Fixing it required admitting the uncomfortable truth: the engine wasn’t the problem; the workload understanding was. They moved reporting to a separate system and redesigned the indexing strategy. The database got “faster” overnight without changing a single binary—because they stopped asking the wrong machine to do the wrong job at the wrong time.
2) Optimization that backfired: “We increased memory and removed fsync waits”
A different org ran Postgres for a high-write event ingestion pipeline. They saw periodic latency spikes. Someone pointed to checkpoints. True enough: checkpoint write bursts were visible, and the p99s had a heartbeat.
The initial tuning was sensible: adjust checkpoint pacing, increase WAL size, watch I/O. But then the “optimization” escalated: they bumped work_mem globally because a few queries were spilling, and they raised connection limits because the app team wanted faster parallelism.
Two days later the database started OOM-killing itself under load. Not because work_mem is evil, but because it’s per-sort/per-hash, multiplied by concurrent sessions. Combine that with too many connections and you get a memory bomb with a slow fuse. The checkpoint spikes were the visible problem; the memory setting was the silent killer.
The fix was boring: cap connections and use pooling, set work_mem conservatively, and apply per-role overrides for the few analytical queries that needed it. The team learned that “more memory” is not a tuning strategy; it’s a bill you pay later in uptime.
3) Boring but correct practice that saved the day: “We rehearsed restores”
This one doesn’t have heroics. It has adults in the room.
A SaaS company ran Percona Server with replicas and nightly backups. They also did something unfashionable: quarterly restore drills. Not just “backup succeeded,” but actually restoring to a staging environment, validating application-level queries, and timing the recovery steps. The runbook had real commands and real estimates.
One day, a storage firmware bug on the primary caused intermittent write stalls. The database didn’t crash; it just became unreliable and slow. The team decided to fail over. During the failover, they discovered the latest backup chain had a gap because a retention job had been misconfigured. Normally this is where the incident turns into a long, sweaty night.
Instead, they calmly restored from the last known-good backup, applied binlogs to a safe point, and brought up a replacement primary. It still took time, but it was time they had already measured. The business noticed a degraded window, not a catastrophe. The practice that “nobody has time for” paid for itself in one incident.
Common mistakes: symptoms → root cause → fix
1) Symptom: p99 latency spikes every few minutes (Postgres)
Root cause: checkpoint write bursts due to aggressive checkpointing and WAL pressure; storage can’t absorb the burst.
Fix: Increase max_wal_size, tune checkpoint pacing, verify storage latency, and consider separating WAL to faster media if applicable.
2) Symptom: steady performance decline over days (Postgres)
Root cause: autovacuum not keeping up; long transactions preventing cleanup; bloat growing.
Fix: Find long transactions, fix app patterns, tune autovacuum per table, and schedule bloat remediation (reindex, rewrite) where necessary.
3) Symptom: “CPU is fine but queries are slow” (either)
Root cause: I/O wait and cache misses; dataset outgrew memory; new query pattern does scans.
Fix: Check cache hit ratios/buffer pool reads, validate indexes, reduce scan queries, and upgrade storage or memory based on measured misses.
4) Symptom: sudden lock storms after a release (either)
Root cause: new transaction scope, missing index on update predicate, or “select then update” patterns increasing conflicts.
Fix: Identify blocker, add the right index, reduce lock hold time, and redesign hot rows (shard counters, avoid global row updates).
5) Symptom: replicas lag during peak but recover off-peak (MySQL/Percona)
Root cause: replica apply thread can’t keep up, heavy reads on replica, or binlog volume spikes from bulk updates.
Fix: Optimize write patterns (avoid massive multi-row updates), use appropriate parallel replication settings, and isolate reporting from product replicas.
6) Symptom: Postgres “too many connections” and high context switching
Root cause: app opens too many connections; process-per-connection overhead dominates.
Fix: Add PgBouncer, cap app pool sizes, reduce max_connections, and treat connection count as a capacity metric.
7) Symptom: MySQL “Sending data” states everywhere
Root cause: queries returning too many rows, missing indexes, or disk-bound scans. Also sometimes network backpressure.
Fix: EXPLAIN the query, add covering indexes, paginate, and verify network saturation vs disk reads.
8) Symptom: performance gets worse after “adding an index” (either)
Root cause: you increased write amplification; every insert/update now pays for extra index maintenance. Or the optimizer picked a worse plan.
Fix: Add only the indexes you can afford. Verify query plans before/after. Use partial/expression indexes (Postgres) or covering indexes thoughtfully (MySQL).
Checklists / step-by-step plan
Step-by-step: choosing between PostgreSQL and Percona Server for performance (without lying to yourself)
- Write down your workload in 10 queries that matter, with concurrency and read/write ratios. If you can’t, you’re not ready to choose.
- Define success metrics: p95/p99 latency for top queries, throughput at fixed latency, replication lag budget, recovery time objective.
- Test with production-like data size and realistic skew. At minimum, exceed RAM so you measure storage behavior.
- Test failure modes: replica lag under peak, failover behavior, and how quickly you can restore.
- Instrument first: enable
pg_stat_statementsor MySQL slow logs, and collect system metrics. - Do the boring tuning: memory sizing, checkpoint/log sizing, connection pooling, and sensible autovacuum/flush settings.
- Only then compare. If one engine wins, you’ll know why—and you’ll be able to keep it winning in production.
Operational checklist: keep Postgres fast
- Monitor autovacuum activity and dead tuples; alert on long transactions blocking vacuum.
- Cap connections and use pooling for spiky traffic.
- Track temp file growth; treat spills as a query/index issue before raising memory.
- Watch checkpoint stats and WAL volume; tune for smooth writes, not heroic bursts.
- Rehearse restores and verify backups with actual restore tests.
Operational checklist: keep Percona Server fast
- Size the InnoDB buffer pool for the working set and verify with read metrics.
- Use slow query logging with a threshold that matches your SLOs.
- Monitor lock waits and long transactions; tune hot rows and transaction scope.
- Track replication lag and relay log growth; keep reporting off critical replicas.
- Validate durability settings match business risk, not benchmark fantasies.
FAQ
1) Is Percona Server “just MySQL”?
It’s MySQL-compatible but ships with performance and instrumentation features and a different operational ecosystem. Compatibility is the point; the extras are the reason people pick it.
2) Is PostgreSQL always slower for OLTP?
No. Postgres can do excellent OLTP. It tends to be less tolerant of sloppy connection handling and neglected vacuuming. If you operate it well, it’s very competitive.
3) What’s the fastest way to know if I’m I/O-bound?
Check iostat -xz for latency and utilization, and correlate with DB-level cache metrics (Postgres blks_read, InnoDB buffer pool reads). If disk latency rises with query latency, you have your answer.
4) Should I tune Postgres by increasing shared_buffers a lot?
Not blindly. shared_buffers matters, but OS page cache matters too. Focus on working set, cache hit trends, and avoiding scans. Big numbers without measurements usually buy you new failure modes.
5) Should I just set a huge InnoDB buffer pool?
Size it to fit the working set while leaving headroom for the OS, other services, and spikes. Oversizing can cause swapping or filesystem cache starvation, which is an expensive way to get slower.
6) Why do my benchmarks show massive wins that production never sees?
Because you likely benchmarked warm caches, small datasets, and perfect locality. Production has skew, background jobs, backups, schema changes, and users doing weird things at peak.
7) Which one is better for complex SQL and reporting?
Often Postgres, especially if you lean into advanced indexing and relational querying. But “better” depends on your exact queries and whether you can isolate reporting from OLTP.
8) How do I avoid picking the wrong database based on vibes?
Define the top queries, test at realistic scale, test under concurrency, and test failure modes. Make the choice based on measurable behavior you can reproduce.
9) Is replication lag a performance problem or a correctness problem?
Both. Lag is “performance” when it breaks your SLOs and “correctness” when your product assumes read-your-writes. Treat lag as a first-class metric.
Conclusion: next steps that won’t embarrass you
If you want a real answer to “PostgreSQL vs Percona Server: which is faster,” stop asking it like it’s a single number. Ask which one is faster for your workload, under your operational constraints, with your team’s habits, and with the failure modes you can’t avoid.
- Inventory the workload: top 10 query patterns, write volume, concurrency, and latency targets.
- Instrument both worlds: Postgres with
pg_stat_statementsand activity views; Percona with slow logs and InnoDB status, plus host metrics. - Run a realistic test: dataset larger than RAM, same indexes, same query mix, same durability expectations.
- Do one fast diagnosis drill: deliberately introduce load, find the bottleneck using the playbook, and see which system your team can reason about faster.
- Pick the engine you can keep fast: not the one that wins a demo, but the one you can operate cleanly at 3 a.m.