PostgreSQL vs RDS PostgreSQL: performance tuning you still must do (even managed)

Was this helpful?

“Managed database” is a comforting phrase until your p95 latency doubles during a promotion, the CPU graph looks like a barcode, and your product team asks if you “can just add more RAM.” You can. It might even help. It also might do absolutely nothing while the real problem—locks, bloat, WAL pressure, or a single tragic query—keeps sawing through your SLOs.

RDS PostgreSQL removes whole classes of work: hardware swaps, minor version patching, automated backups, and a lot of dull-but-critical plumbing. It does not remove tuning. It does not remove physics. It definitely does not remove accountability.

Managed does not mean optimized

Running PostgreSQL yourself on EC2 or bare metal gives you knobs. Running PostgreSQL on RDS gives you… fewer knobs, but still plenty of ways to hurt yourself.

The difference is mostly about who owns the platform:

  • Self-managed PostgreSQL: you own everything: kernel, filesystem, disks, sysctls, Postgres build flags, extensions, backups, failover, monitoring, and the 3 a.m. wake-ups.
  • RDS PostgreSQL: AWS owns the host. You own the database behavior: schema, queries, indexes, autovacuum strategy, parameter group tuning, capacity planning, and how you observe and respond.

In practice, performance tuning is still your job because performance is an emergent property of data distribution, workload, and application behavior. RDS doesn’t know your workload. RDS can’t rewrite your ORM. RDS can’t tell your product team “no” when they ship a Cartesian join disguised as “a quick reporting endpoint.”

One dry truth: RDS makes it harder to do some low-level optimizations, but it also makes it easier to stay alive through failures. If you want to be fast and reliable, you still need to treat RDS as a production system you operate—just with fewer sharp metal edges exposed.

Joke #1: The cloud is just someone else’s computer. RDS is someone else’s computer with a nice UI and the same laws of thermodynamics.

Interesting facts and historical context

  • PostgreSQL’s roots go back to the 1980s (POSTGRES at UC Berkeley). Some of today’s durability ideas were being argued over before your CI pipeline existed.
  • MVCC is why PostgreSQL reads don’t block writes (usually). It’s also why vacuum exists and why “just delete rows” is not a performance plan.
  • Autovacuum wasn’t always there; early Postgres required manual vacuuming. People “learned” vacuum the hard way: production outages.
  • PostgreSQL famously values correctness over surprising “fast but wrong” behavior. That shapes planner choices and concurrency semantics.
  • RDS for PostgreSQL launched in the early 2010s and normalized the idea that many teams don’t need to run database hosts themselves.
  • pg_stat_statements became mainstream as Postgres matured into an observability-first database. You don’t tune what you can’t measure.
  • Postgres 9.6 to 13+ era improvements changed vacuum behavior, parallelism, and indexing performance in ways that make “old rules” unreliable.
  • IO patterns matter more than raw CPU in many OLTP systems; the database can be “CPU idle” while waiting on storage. The graphs love lying by omission.
  • Checkpoint tuning has been a recurring theme for two decades because WAL and checkpoints are the heartbeat of durability—and the source of periodic pain.

Where RDS genuinely helps—and where it doesn’t

What RDS gives you for free (or close to it)

RDS is legitimately good at the “boring infrastructure” parts:

  • Automated backups and point-in-time recovery workflows.
  • Managed patching (within the constraints of maintenance windows and your risk appetite).
  • Multi-AZ failover orchestration, which is not the same as “no downtime,” but is still valuable.
  • Operational metrics in CloudWatch and Performance Insights, which saves you from building everything from scratch.
  • Storage scaling options (depending on storage type), reducing the “we ran out of disk at 2 a.m.” genre of incident.

What RDS does not solve (and sometimes hides)

RDS does not:

  • Fix slow queries caused by missing indexes, bad join order, stale stats, or application behavior.
  • Prevent lock contention when your migration takes an ACCESS EXCLUSIVE lock at noon.
  • Eliminate bloat from high-churn tables. MVCC still needs vacuum, and vacuum still needs headroom.
  • Make IO infinite. EBS has limits; storage throughput is a billable and saturable resource.
  • Choose the right instance class for your workload. You can pay for CPU you don’t use or starve on memory you do.
  • Protect you from “too many connections” when your app scales horizontally and each pod opens 100 sessions like it’s 2009.

The most important difference: on self-managed Postgres you can often “instrument the host” to learn what the database is waiting on. On RDS you’re mostly limited to what AWS exposes: engine metrics, OS-ish proxies, logs, and Performance Insights. That’s enough to run a great system, but only if you take observability seriously.

The tuning you still must do (even on RDS)

1) Query tuning: the planner is not your enemy, but it’s not psychic

If you tune nothing else, tune your queries. PostgreSQL performance is dominated by access paths: index scans vs sequential scans, join strategies, and how much data you force the engine to touch.

On RDS you’ll feel query mistakes faster because you can’t “just tweak the kernel” or pin processes to CPUs. That’s fine; kernel tuning is rarely your first bottleneck anyway.

What to do:

  • Enable pg_stat_statements and treat it as your leaderboard of shame.
  • Use EXPLAIN (ANALYZE, BUFFERS) and read it like a crime scene report: what touched disk, what didn’t, and why.
  • Keep stats fresh (ANALYZE) and don’t assume autovacuum has your back on heavily skewed tables.

2) Autovacuum: the difference between “fine” and “why is disk full?”

MVCC means old row versions stick around until vacuum reclaims them. On RDS, vacuum is still your job to size and schedule. If autovacuum falls behind, you don’t just get bloat. You get:

  • Worse cache efficiency (more pages, fewer useful pages).
  • More IO (reading dead tuples is still reading).
  • Higher latency variance (vacuum thrashing at bad times).
  • Potential transaction ID wraparound risk if you neglect it long enough.

Autovacuum tuning is workload-specific. High-churn OLTP tables often need per-table settings. The default thresholds are polite. Your production workload is not.

3) Memory: shared_buffers is not a magic speed knob

On self-managed Postgres you can go wild with memory tuning. On RDS, you still tune the database parameters, but you also live with AWS’s OS choices. You don’t control the page cache directly; you influence it by not doing dumb IO.

Typical priorities:

  • work_mem: too low and your sorts/hash joins spill; too high and concurrency turns RAM into a smoking crater.
  • shared_buffers: on RDS it’s often set sensibly by default, but “sensible” is not “optimal.”
  • effective_cache_size: a planner hint; wrong values lead to wrong plan choices.

4) IO: RDS makes it purchasable, not optional

Most RDS PostgreSQL “performance incidents” are really IO incidents with a SQL accent.

Your core choices:

  • Storage type and sizing (general purpose vs provisioned IOPS, throughput caps, burst behavior).
  • Checkpoint behavior (write spikes vs steady state).
  • Vacuum and index maintenance (background IO that becomes foreground pain).

5) WAL and checkpoints: latency spikes come in waves

Write-heavy systems will eventually meet the WAL subsystem in a dark alley. Symptoms look like periodic stalls: commits slow down, IO jumps, and throughput falls off a cliff at regular intervals.

What you tune:

  • max_wal_size and checkpoint_timeout to reduce checkpoint frequency.
  • checkpoint_completion_target to spread writes out.
  • wal_compression and wal_buffers depending on version/workload.

On RDS, you change these via parameter groups. You also watch replica lag: WAL pressure shows up downstream.

6) Connections: the slow death by a thousand sessions

PostgreSQL uses a process-per-connection model. That’s robust and simple. It’s also expensive when you treat the database like a stateless HTTP server.

On RDS, max_connections can be high, but that doesn’t mean it should be. A connection storm means context switching, memory overhead, and more contention on shared structures.

Use a pooler (PgBouncer) unless you have a good reason not to. Yes, even with RDS. Especially with Kubernetes. Your pods are not polite.

7) Locks and migrations: your schema changes are production traffic too

RDS won’t stop you from running ALTER TABLE that blocks writes for minutes. It will, however, faithfully preserve the outage in CloudWatch graphs for your future self to admire.

Operationally:

  • Prefer online-ish patterns: create index concurrently, backfill in batches, swap columns carefully.
  • Set lock timeouts for migration sessions so they fail fast instead of taking the system hostage.
  • Monitor lock graphs during deploys like you monitor error rates.

8) Replication and read scaling: replicas aren’t a cheat code

Read replicas can offload read traffic. They can also introduce new failure modes: stale reads, replica lag during write bursts, and query plans that differ because of different cache warmth.

Multi-AZ improves availability, but it can slightly affect latency, and failover events will still hurt unless your application retries correctly.

9) Observability: if you can’t explain p95, you don’t own it

On self-managed Postgres you might use system-level tooling (iostat, vmstat, perf). On RDS your best friends are:

  • Performance Insights (wait events, top SQL).
  • Enhanced Monitoring (OS-ish CPU, memory, load, disk).
  • Database logs (slow query logging, autovacuum logging, checkpoints).
  • pg_stat_* views (ground truth from inside the engine).

Quote (paraphrased idea): Werner Vogels has pushed the idea that “everything fails, all the time,” so you design and operate assuming failure, not hoping it won’t happen.

Fast diagnosis playbook: find the bottleneck fast

This is the triage flow that works when the pager is noisy and your calendar is full of meetings you will now miss.

First: is it CPU, IO, locks, or connections?

  1. Check database load and wait events (Performance Insights if on RDS; pg_stat_activity everywhere). If you see lots of Lock waits, stop guessing and start looking for blockers.
  2. Check storage saturation: read/write IOPS, throughput, and latency. If latency is high and IOPS are capped, you are IO-bound no matter how many vCPUs you buy.
  3. Check connection count and active sessions. If active sessions are far above CPU cores, you’re probably queuing and context-switching.
  4. Check top SQL by total time. One query can dominate. It’s often not the one you suspect.

Second: is this a planner/stats problem or a data problem?

  1. Look for plan regressions: same query, new plan, worse performance.
  2. Check table bloat and dead tuples. If dead tuples are high and vacuum is behind, you’re paying IO tax.
  3. Check index usage vs sequential scans. Sequential scans are not evil; unexpected ones are suspicious.

Third: confirm with a single “smoking gun” query and fix safely

  1. Pick the top offender from pg_stat_statements.
  2. Run EXPLAIN (ANALYZE, BUFFERS) in a safe environment if possible; in production, run plain EXPLAIN first.
  3. Decide: add an index, rewrite query, adjust work_mem, update stats, or change autovacuum policy.
  4. Validate improvement using the same measurement you used to declare an incident.

Practical tasks: commands, outputs, and decisions

These are deliberately operational. Each task includes a runnable command, sample output, what it means, and the decision you make. Use them on self-managed Postgres and on RDS (from a bastion host or your laptop with network access), unless explicitly noted.

Task 1: Verify basic instance identity and version

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select version();"
                                                           version
-----------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.3.0, 64-bit
(1 row)

What it means: Confirms major version and build info. Version governs planner behavior, vacuum improvements, and available features.

Decision: If you’re on an older major version, prioritize an upgrade plan before heroic tuning. Many “tuning” problems vanish with modern Postgres.

Task 2: Confirm critical extensions (pg_stat_statements)

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select extname, extversion from pg_extension order by 1;"
      extname       | extversion
--------------------+------------
 pg_stat_statements | 1.10
 plpgsql            | 1.0
(2 rows)

What it means: If pg_stat_statements isn’t installed, you are debugging blind.

Decision: On RDS, enable it in the parameter group (shared_preload_libraries) and reboot if required; then create extension in each database that needs it.

Task 3: Find the current pain: active sessions and wait events

cr0x@server:~$ psql "$DATABASE_URL" -X -c "\x on" -c "select pid, usename, application_name, state, wait_event_type, wait_event, now()-query_start as age, left(query,120) as query from pg_stat_activity where state <> 'idle' order by age desc limit 10;"
-[ RECORD 1 ]-------+--------------------------------------------
pid                 | 21435
usename             | app
application_name    | api
state               | active
wait_event_type     | Lock
wait_event          | relation
age                 | 00:02:11.4123
query               | update orders set status='paid' where id=$1
-[ RECORD 2 ]-------+--------------------------------------------
pid                 | 21398
usename             | app
application_name    | migrate
state               | active
wait_event_type     | Lock
wait_event          | relation
age                 | 00:05:43.9921
query               | alter table orders add column reconciliation_id uuid;

What it means: You’re blocked on a relation lock; a migration is likely blocking OLTP updates.

Decision: Identify the blocker, stop/rollback the migration, and change your schema change approach (online patterns, lock_timeout, deploy sequencing).

Task 4: Identify blocking chains (who is blocking whom)

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select blocked.pid as blocked_pid, blocked.query as blocked_query, blocker.pid as blocker_pid, blocker.query as blocker_query from pg_locks bl join pg_stat_activity blocked on blocked.pid=bl.pid join pg_locks kl on kl.locktype=bl.locktype and kl.database is not distinct from bl.database and kl.relation is not distinct from bl.relation and kl.page is not distinct from bl.page and kl.tuple is not distinct from bl.tuple and kl.virtualxid is not distinct from bl.virtualxid and kl.transactionid is not distinct from bl.transactionid and kl.classid is not distinct from bl.classid and kl.objid is not distinct from bl.objid and kl.objsubid is not distinct from bl.objsubid and kl.pid <> bl.pid join pg_stat_activity blocker on blocker.pid=kl.pid where not bl.granted;"
 blocked_pid |             blocked_query              | blocker_pid |                 blocker_query
------------+----------------------------------------+------------+-----------------------------------------------
      21435 | update orders set status='paid' ...     |      21398 | alter table orders add column ...
(1 row)

What it means: Concrete blocker identification. Now it’s not “Postgres is slow,” it’s “this PID is blocking 40 sessions.”

Decision: Terminate the blocker if appropriate, then fix the deployment process to avoid ACCESS EXCLUSIVE locks in peak hours.

Task 5: Check connection pressure

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select count(*) as total, sum(case when state='active' then 1 else 0 end) as active from pg_stat_activity;"
 total | active
-------+--------
  612  | 184
(1 row)

What it means: 612 total connections is not automatically bad; 184 active sessions can be disastrous on a 16 vCPU box.

Decision: If active sessions exceed cores for sustained periods, implement pooling, lower max_connections, and fix chatty application patterns (N+1 queries, per-request transactions, etc.).

Task 6: Find the top SQL by total time (pg_stat_statements)

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select queryid, calls, round(total_exec_time::numeric,1) as total_ms, round(mean_exec_time::numeric,3) as mean_ms, rows, left(query,120) as query from pg_stat_statements order by total_exec_time desc limit 5;"
 queryid  |  calls  | total_ms | mean_ms |  rows   |                          query
----------+---------+----------+---------+---------+-----------------------------------------------------------
 91827364 |  184920 | 812349.7 |   4.392 | 184920  | select * from users where email=$1
 11223344 |   12001 | 643221.9 |  53.599 |  12001  | select o.* from orders o join order_items i on ...
(2 rows)

What it means: Total time shows where the database spent its life. High mean time indicates a “slow query.” High calls with moderate mean can still dominate.

Decision: Start with the top total time query unless a single query is causing visible tail latency. Then inspect plans and indexes.

Task 7: Check index usage vs sequential scans

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select relname, seq_scan, idx_scan, n_live_tup from pg_stat_user_tables order by seq_scan desc limit 10;"
    relname     | seq_scan | idx_scan | n_live_tup
---------------+----------+----------+------------
 events        |   982134 |    21011 |    4839201
 order_items   |   431201 |   892301 |     983211
(2 rows)

What it means: High seq_scan on a large table might be fine for analytics; it’s suspicious for OLTP endpoints.

Decision: If seq scans correlate with slow queries, add targeted indexes or rewrite queries. If the workload is analytics, consider partitioning, BRIN indexes, or moving reporting elsewhere.

Task 8: Spot bloat risk via dead tuples and vacuum history

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze from pg_stat_user_tables order by n_dead_tup desc limit 10;"
   relname   | n_live_tup | n_dead_tup | last_vacuum |     last_autovacuum     | last_analyze |     last_autoanalyze
------------+------------+------------+-------------+--------------------------+--------------+--------------------------
 events     |    4839201 |    1720033 |             | 2025-12-30 08:12:01+00   |              | 2025-12-30 08:13:44+00
 orders     |     312001 |      91011 |             | 2025-12-29 23:02:11+00   |              | 2025-12-29 23:03:09+00
(2 rows)

What it means: A large dead tuple count indicates vacuum is either behind or thresholds are too high for churn. It’s also a hint that index bloat may be growing.

Decision: Tune autovacuum thresholds per table and consider more aggressive vacuuming during off-peak. If bloat is already severe, plan a controlled rebuild (VACUUM FULL is rarely the right answer on production).

Task 9: Check autovacuum settings (global and per table)

cr0x@server:~$ psql "$DATABASE_URL" -X -c "show autovacuum; show autovacuum_vacuum_scale_factor; show autovacuum_vacuum_threshold; show autovacuum_max_workers;"
 autovacuum
------------
 on
(1 row)

 autovacuum_vacuum_scale_factor
-------------------------------
 0.2
(1 row)

 autovacuum_vacuum_threshold
----------------------------
 50
(1 row)

 autovacuum_max_workers
-----------------------
 3
(1 row)

What it means: Scale factor 0.2 means vacuum triggers after ~20% of table changes + threshold. On big churn tables, that’s often too late.

Decision: For hot tables, set lower scale factor via ALTER TABLE ... SET (autovacuum_vacuum_scale_factor=0.02, autovacuum_vacuum_threshold=1000) and increase workers if IO allows. Validate via vacuum logs.

Task 10: Detect checkpoint pressure

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select checkpoints_timed, checkpoints_req, round(checkpoint_write_time/1000.0,1) as write_s, round(checkpoint_sync_time/1000.0,1) as sync_s from pg_stat_bgwriter;"
 checkpoints_timed | checkpoints_req | write_s | sync_s
------------------+-----------------+---------+--------
              412 |             289 |  9821.4 |  612.7
(1 row)

What it means: High checkpoints_req relative to timed suggests you’re hitting WAL size limits and forcing checkpoints. That’s a common cause of periodic write stalls.

Decision: Increase max_wal_size, adjust checkpoint_timeout and checkpoint_completion_target, then watch write latency and replica lag.

Task 11: Check WAL generation rate proxy

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) as wal_since_boot;"
 wal_since_boot
----------------
 512 GB
(1 row)

What it means: This is a rough proxy. If WAL churn is massive, replication and storage will feel it. (On some systems you’ll use more precise WAL stats depending on version.)

Decision: If WAL volume is unexpectedly high, investigate bulk updates, index maintenance, and vacuum behavior; consider batching writes and avoiding pointless updates.

Task 12: Validate table and index sizes (find the elephants)

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select relname, pg_size_pretty(pg_total_relation_size(relid)) as total_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc limit 10;"
  relname  | total_size
-----------+------------
 events    | 219 GB
 orders    | 38 GB
(2 rows)

What it means: Big relations drive cache behavior and vacuum cost. On RDS, they also drive storage cost and snapshot time.

Decision: For truly large append-heavy tables, consider partitioning, BRIN indexes, or a data retention policy. If size doesn’t match your mental model, you probably have bloat or runaway retention.

Task 13: Check if sorts/hash joins are spilling to disk

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select datname, temp_files, pg_size_pretty(temp_bytes) as temp_written from pg_stat_database order by temp_bytes desc;"
  datname  | temp_files | temp_written
-----------+------------+--------------
 appdb     |      18231 | 97 GB
(1 row)

What it means: Large temp usage means queries are spilling (often because work_mem is too low for the query shape or because the query shape is bad).

Decision: Identify the offending queries via pg_stat_statements; consider increasing work_mem cautiously (or per role), or rewriting queries to reduce sort/hash requirements.

Task 14: Confirm statistics freshness on key tables

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select relname, last_analyze, last_autoanalyze from pg_stat_user_tables where relname in ('orders','order_items','events');"
   relname    |     last_analyze      |     last_autoanalyze
--------------+-----------------------+--------------------------
 orders       |                       | 2025-12-29 23:03:09+00
 order_items  | 2025-12-28 01:12:43+00| 2025-12-30 08:13:44+00
 events       |                       | 2025-12-30 08:13:44+00
(3 rows)

What it means: If analyze is stale, the planner estimates row counts badly, leading to poor join choices and scan types.

Decision: Run manual ANALYZE for critical tables after bulk loads or large churn events; tune autovacuum_analyze_scale_factor where needed.

Task 15: Measure cache hit ratio (with skepticism)

cr0x@server:~$ psql "$DATABASE_URL" -X -c "select datname, round(100.0*blks_hit/nullif(blks_hit+blks_read,0),2) as cache_hit_pct from pg_stat_database where datname=current_database();"
 datname | cache_hit_pct
---------+---------------
 appdb   |         98.71
(1 row)

What it means: High hit ratio is good, but it can hide pain: you can be 99% cached and still slow due to locks or CPU.

Decision: Use hit ratio as context, not proof. If IO latency is high despite high cache hit, you may be paying for WAL/checkpoints or temp spills.

Task 16 (RDS-specific): pull recent engine logs for autovacuum and checkpoints

cr0x@server:~$ aws rds describe-db-log-files --db-instance-identifier prod-pg --file-last-written 1735530000000
{
  "DescribeDBLogFiles": [
    {
      "LogFileName": "postgresql.log.2025-12-30-08",
      "LastWritten": 1735543561000,
      "Size": 12839210
    }
  ]
}

What it means: You can’t tail /var/log like on self-managed. You use RDS APIs to find the log segments you care about.

Decision: If you don’t have autovacuum and checkpoint logging enabled, enable it now. Tuning without logs is superstition with extra steps.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had moved a core OLTP database from self-managed PostgreSQL on EC2 to RDS PostgreSQL. The migration was a win: fewer on-call pages, backups worked, patching stopped being a quarterly fear festival. A month later, the first real growth spike hit.

The team assumed RDS storage “just scales.” It did scale in capacity. Performance didn’t scale the way they expected. During a traffic surge, p95 API latency climbed, but CPU stayed moderate. The initial reaction was classic: “it’s not CPU, so it must be the app.” They rolled back a deployment twice. Nothing changed.

Performance Insights showed sessions waiting on IO. EBS latency had crept up and IOPS was pegged against the volume’s limits. The workload had changed: more writes per request, larger rows, higher churn. Autovacuum was also behind, because bloat made every read heavier and every vacuum more expensive.

The wrong assumption wasn’t that AWS would fail. It was that “managed” implied “elastic performance.” The fix was dull and effective: move to a storage configuration with appropriate IOPS/throughput, tune checkpoint parameters to reduce write spikes, and tighten autovacuum on the hottest tables. They also added an alert on storage latency and IOPS saturation so they’d never again discover storage physics through user complaints.

Mini-story 2: The optimization that backfired

A different org had a habit: every time a query was slow, someone added an index. At first it worked. The dashboards looked better. People felt productive. Then the write load increased.

On RDS PostgreSQL, they added multiple overlapping indexes on a high-churn table—variations of the same composite keys, plus a couple “just in case” indexes for reporting. Write latency slowly rose. Autovacuum started running longer. Checkpoints got heavier. Replica lag became a daily visitor.

The backfire was predictable: each new index increased write amplification. Every INSERT/UPDATE now had more index maintenance. Vacuum needed to clean more index pages. The workload became more IO-bound, and the storage bill politely climbed as they “fixed performance” by throwing IOPS at it.

When they finally audited index usage, several indexes had near-zero scans. The team removed redundant indexes, replaced a few with better composite indexes aligned to real predicates, and offloaded reporting queries to a replica with stricter query governance. Writes got faster, vacuum calmed down, and replica lag stopped being a personality trait.

Mini-story 3: The boring but correct practice that saved the day

A fintech-ish company ran RDS PostgreSQL with Multi-AZ and a read replica for reporting. Nothing glamorous. Their performance tuning culture was even less glamorous: weekly review of top SQL, monthly vacuum/analyze sanity checks, and a policy that every migration must declare its lock behavior.

Then a vendor integration began hammering their API with bursty traffic, amplifying one endpoint that performed an expensive join. The query wasn’t catastrophic in isolation, but the burst created a thundering herd of identical requests. Connection count rose, active sessions piled up, and the system flirted with timeouts.

Because they had the boring practices, diagnosis took minutes: pg_stat_statements showed the query’s surge, Performance Insights showed CPU plus some IO, and their dashboards already broke out DB load by wait type. They applied a two-part fix: a targeted index that matched the join predicate and a change in the app to enforce request coalescing and caching for that endpoint.

The best part: they didn’t need a war room. No heroics. No “we should rewrite in Rust.” They just followed the playbook they’d been following all year, which is the closest thing to magic that production systems allow.

Joke #2: The only “set and forget” database is the one nobody uses. The moment it’s popular, it’s a job again.

Common mistakes: symptom → root cause → fix

  • Symptom: Periodic latency spikes every few minutes
    Root cause: Checkpoint storms due to low max_wal_size or aggressive checkpoint_timeout
    Fix: Increase max_wal_size, raise checkpoint_timeout, set checkpoint_completion_target near 0.9, and watch write latency and checkpoints_req.
  • Symptom: CPU low, requests slow, “database is fine” claims in chat
    Root cause: IO-bound workload: EBS latency high, IOPS/throughput cap, temp spills, or WAL pressure
    Fix: Validate with wait events and storage metrics; reduce IO via indexes/query fixes, tune work_mem, increase IOPS/throughput if needed.
  • Symptom: Disk usage steadily grows, queries slow down over weeks
    Root cause: Table/index bloat from MVCC + autovacuum lag or misconfigured thresholds
    Fix: Tune autovacuum per table; schedule maintenance; rebuild worst offenders using online-safe methods where possible.
  • Symptom: Sudden outage during deployment; many sessions waiting on locks
    Root cause: Migration took an ACCESS EXCLUSIVE lock (ALTER TABLE, index build without CONCURRENTLY, long transaction)
    Fix: Use CREATE INDEX CONCURRENTLY, backfill in batches, set lock_timeout, and run lock-aware migration tooling/process.
  • Symptom: Read replica lag spikes during bulk updates
    Root cause: WAL generation exceeds replica apply capacity; heavy vacuum/index churn adds more WAL
    Fix: Batch writes, reduce redundant indexes, tune checkpoint/WAL, and don’t route freshness-sensitive reads to replicas without safeguards.
  • Symptom: Increasing timeouts as the app scales horizontally
    Root cause: Connection storm; too many sessions and memory overhead; lock contention amplified by concurrency
    Fix: Add PgBouncer, cap app concurrency, lower max_connections, use sane pool sizing, and remove per-request connection creation.
  • Symptom: Query plan suddenly worse after a data growth event
    Root cause: Stale statistics or skewed distributions; planner estimates wrong cardinalities
    Fix: ANALYZE critical tables, increase statistics target for skewed columns, review query predicates and indexes.
  • Symptom: “We added an index and writes got slower”
    Root cause: Index overuse causing write amplification and vacuum overhead
    Fix: Audit index usage; remove unused/redundant indexes; design composite indexes that match real predicates; consider partial indexes.

Checklists / step-by-step plan

Week 1: Make performance observable

  1. Enable pg_stat_statements (and ensure it survives restarts via shared_preload_libraries).
  2. Enable slow query logging with a sane threshold for your SLOs; log duration and lock waits.
  3. Turn on autovacuum logging (at least for long vacuums) and checkpoint logging.
  4. Set up dashboards: DB load, wait events, active sessions, IOPS/throughput, storage latency, replica lag.
  5. Decide who owns “top queries review” weekly. If it’s “nobody,” it will be “incident commander” later.

Week 2: Stop the bleeding (high ROI fixes)

  1. From pg_stat_statements, pick top 5 by total time and top 5 by mean time; fix at least two.
  2. Add or correct indexes only after reading EXPLAIN and confirming predicate alignment.
  3. Reduce connection count: implement pooling and set sane pool sizes per service.
  4. Identify top churn tables; set per-table autovacuum thresholds and analyze thresholds.

Week 3–4: Make it boring

  1. Create a migration policy: lock_timeout, statement_timeout, index concurrently, backfills in batches.
  2. Capacity plan storage IOPS/throughput; alert before saturation, not after.
  3. Review WAL/checkpoint settings; eliminate forced checkpoint patterns.
  4. Establish a “query budget” culture: endpoints with DB-heavy operations get performance tests and guardrails.

FAQ

1) Is RDS PostgreSQL slower than self-managed PostgreSQL?

Not inherently. Many workloads run the same or better because the platform is stable and well-maintained. The “slowness” usually comes from mismatched instance/storage sizing, insufficient observability, or assuming defaults are optimal.

2) What tuning knobs do I lose on RDS that matter?

You lose host-level control: filesystem choices, kernel parameters, direct disk tools, and some extension flexibility. Most performance wins still come from query/index design, autovacuum strategy, and IO/WAL tuning—knobs you still have.

3) Do I really need pg_stat_statements on RDS?

Yes. Without it you can still guess using logs, but you’ll waste time and ship half-fixes. pg_stat_statements gives you ranking by total time and mean time, which is how you prioritize.

4) Should I crank up work_mem to stop temp spills?

Only with math. work_mem is per operation, per query, per session. If you set it high and have high concurrency, you can OOM the instance or push the OS into ugly swapping behavior. Prefer targeted fixes: query rewrites, indexes, or per-role settings for known heavy sessions.

5) Why is my CPU low but latency high?

Because the database is waiting, not working. Lock waits, IO waits, and network waits don’t show up as CPU. Use wait events and storage latency metrics to identify the real constraint.

6) Are read replicas a safe way to scale reads?

They’re useful, but not free. Replica lag is real, and your application must tolerate staleness or route freshness-sensitive reads to primary. Also, heavy reporting queries can still hurt a replica and create operational noise.

7) What’s the biggest autovacuum mistake on RDS?

Relying on defaults for high-churn tables. Defaults are conservative to avoid surprise load. Production workloads are rarely conservative. Set per-table thresholds for the tables that matter, then validate via logs and dead tuple trends.

8) Can I “just scale up” the RDS instance to fix performance?

Sometimes. Scaling up helps CPU-bound and memory-bound cases. It doesn’t fix lock contention, bad queries, forced checkpoints, or storage caps. Scale as part of a diagnosis, not as a substitute for one.

9) How do I prevent migrations from taking the database down?

Use lock-aware migration patterns, set lock_timeout and statement_timeout for migration roles, run heavyweight operations off-peak, and monitor locks during deploys. Test migration lock behavior on production-like data volume.

10) What’s the first metric you alert on for “database is slow”?

Database load by wait type (from Performance Insights or equivalent) plus storage latency. “CPU > 80%” alone is a great way to be wrong quickly.

Conclusion: next steps you can do this week

Self-managed PostgreSQL gives you more levers; RDS gives you fewer levers and more safety rails. Either way, the database is still a living system. It accumulates history (MVCC), it negotiates concurrency (locks), and it depends on storage physics (IO and WAL). Performance tuning remains your responsibility because the workload is yours.

Do these next:

  1. Turn on pg_stat_statements and slow query logging. Make it impossible to argue about what’s slow.
  2. Run the fast diagnosis playbook during the next slowdown and record the outcome: CPU, IO, locks, or connections.
  3. Fix two top queries with EXPLAIN-guided changes, not folklore.
  4. Pick your top three churn tables and tune autovacuum per table; validate with dead tuples and vacuum logs.
  5. Review checkpoints_req vs checkpoints_timed; tune WAL/checkpoint parameters to reduce write spikes.
  6. Cap and pool connections. Your database is not a chat room.
← Previous
Nginx for WordPress: Config Mistakes That Cause 5xx Errors (and Fixes)
Next →
Office VPN File Sharing: Stable SMB Between Offices Without Constant Disconnects

Leave a comment