“Randomly slow” is what people say when they don’t yet have a metric that explains the pain. Your app is fine for an hour, then a checkout takes 12 seconds, then everything’s normal again. Nobody changed anything (except the things that definitely changed). You’re on Debian 13, PostgreSQL is “healthy”, and yet the database feels like it’s thinking about its life choices.
This isn’t a vibes problem. It’s almost always one of a few bottlenecks rotating in and out: IO latency, WAL pressure, lock waits, memory reclaim, autovacuum, CPU steal, DNS lookups, or connection storms. The trick is to stop staring at top and start running checks that separate “slow query” from “slow system”.
Fast diagnosis playbook (first/second/third)
If you’re on call, you don’t have time for a philosophy seminar. You need a sequence that narrows the search space fast, without accidentally “fixing” the symptom by restarting things.
First: prove whether it’s PostgreSQL or the host
- Check if sessions are waiting. If you have lots of
wait_eventactivity, this is likely contention or IO, not “slow SQL”. - Check storage latency right now. If reads/writes are spiking into tens or hundreds of milliseconds, PostgreSQL will look “randomly slow” no matter how clever your indexes are.
- Check CPU run queue and steal. If you’re CPU-starved (or on a noisy VM host), queries stall even when IO is fine.
Second: identify the bottleneck category
- Locks: blocked sessions pile up; one bad transaction holds a lock and ruins everyone’s day.
- WAL / commits: “simple updates are slow” with lots of time in commit; latency points at WAL flush/device cache.
- Autovacuum: periodic slowness; IO increases; vacuum or analyze activity coincides with app complaints.
- Connections: spikes in connection creation; CPU/latency jumps; the database spends time accepting/logging/auth.
Third: confirm with one “ground truth” metric
- For IO:
awaitand utilization iniostat, plus PostgreSQL’spg_stat_ioif available. - For locks:
pg_lockswith blocker/blocked mapping. - For WAL:
pg_stat_wal, checkpoints, andpg_stat_bgwriter; correlate with device writes. - For autovacuum:
pg_stat_progress_vacuumand table bloat indicators.
Once you know the category, fixes become mechanical. Until then, every “tuning” change is just a new way to be wrong.
Interesting facts and context (why this happens)
- PostgreSQL’s durability defaults are conservative.
fsync=onand synchronous WAL flushes mean storage lies become your latency problem, not PostgreSQL’s. - Linux will happily use “free” RAM for page cache. That’s good… until memory pressure triggers reclaim, which can look like random stalls.
- Autovacuum exists because PostgreSQL uses MVCC. Old row versions don’t disappear until vacuum cleans them. Ignore it and your “random” slowness becomes permanent.
- Checkpoint tuning has a long history of misunderstanding. People raise
checkpoint_timeoutto “reduce IO” and accidentally create huge checkpoint spikes later. - Connection-per-request architectures were common in early web stacks. They still show up in modern systems through misconfigured pools, and they still hurt.
- PostgreSQL 9.6 introduced major autovacuum improvements. The database got better at self-maintenance, but it still needs sane table design and monitoring.
- Modern NVMe can be fast and still cause stalls. Latency spikes from firmware garbage collection or write cache behavior can turn “fast storage” into “moody storage.”
- Debian’s defaults are stability-first. That’s great for not exploding; it also means you should actively choose performance settings for a database host.
- WAL compression and checksums are trade-offs. They can improve IO patterns or safety, but they also change CPU and latency characteristics.
One quote worth keeping around when you’re tempted to guess: Hope is not a strategy.
— General Gordon R. Sullivan.
The 8 checks that reveal the real bottleneck
Check 1: Are queries slow, or are they waiting?
“Slow” is often “waiting.” PostgreSQL distinguishes active CPU work from waiting on locks, IO, WAL flush, client reads, and a dozen other things. The fastest way to stop guessing is to look at wait_event_type and wait_event.
If most sessions are active with no waits, you have CPU or query plan problems. If most are waiting on Lock, your root cause is lock contention. If they’re waiting on IO or WAL, it’s storage and write path behavior.
Check 2: Is the host showing disk latency spikes?
“Random slowness” is the natural language description of latency variance. Databases hate variance. A few 200ms writes don’t look scary on throughput graphs, but they turn commits into timeouts and make your app feel haunted.
On Debian 13, use iostat for immediate visibility. You’re looking for await and %util. High %util with rising await means the device is saturated or queueing badly. Low %util with high await often hints at underlying storage issues (virtualization, controller, multipath, write cache flushes).
Check 3: Is WAL the real bottleneck (commit latency)?
Many teams chase “slow queries” when the real pain is commits. If your workload is write-heavy, every transaction needs WAL. If WAL flush latency spikes, everything that commits becomes slow—even tiny updates.
Symptoms: inserts/updates that normally take milliseconds occasionally take seconds, and reads aren’t affected as much. That points away from query plans and toward the write path.
Check 4: Are checkpoints and background writer behavior spiking IO?
Checkpoints are necessary. They’re also a common source of periodic stalls when mis-tuned or when the storage can’t absorb bursts. PostgreSQL tries to spread checkpoint IO, but it can still bunch up under pressure.
If you see regular “every N minutes everything slows down,” suspect checkpoint behavior, autovacuum, or external jobs (backups, scrubs, log rotation). Don’t optimize blind. Measure.
Check 5: Are locks stacking up behind one transaction?
Lock contention is the most “random-seeming” database slowdown because it depends on timing. A long transaction that holds a lock for 30 seconds isn’t a problem… until the wrong 30 seconds.
Your goal is to identify the blocker and the victim list, then decide whether to kill the blocker, fix the code path, or change isolation/locking behavior.
Check 6: Is autovacuum competing with you (or failing silently)?
Autovacuum is like taking out the trash: you only notice it when nobody did it. When it’s too aggressive, it competes with your workload for IO and CPU. When it’s too weak, bloat grows, indexes swell, and performance degrades in slow motion.
On a busy system, a poorly tuned autovacuum can cause periodic IO spikes that look random to application teams but are extremely regular to anyone who graphs them.
Check 7: Is memory pressure causing reclaim or swapping?
PostgreSQL plus Linux page cache plus “just one more sidecar” can push a host into reclaim. You’ll see CPU usage, but it’s not useful CPU; it’s the kernel trying to find memory. The database “randomly slows” because it’s literally waiting for memory pages to become available.
If you ever see swap activity on a PostgreSQL host and you didn’t explicitly plan for it, treat it like a smoke alarm. It might not be a fire, but you don’t ignore it.
Check 8: Is your networking/auth/DNS path injecting latency?
Yes, DNS can slow down PostgreSQL. So can reverse DNS lookups in logging, LDAP auth hiccups, or packet loss between app and DB. The database can be fine while connections stall or authentication pauses. “Randomly slow” from the app perspective, perfectly consistent from the network’s perspective.
Also: connection storms. When a pool breaks and every request starts its own connection, PostgreSQL spends time on process management and auth overhead. Your queries didn’t get slower; your system got busier doing the wrong work.
Joke #1: “Random slowness” is what a system says when it wants you to install monitoring but you keep offering it positive vibes.
Hands-on tasks: commands, output meaning, decisions
Below are practical tasks you can run on Debian 13 and in PostgreSQL. Each one includes: a command, what the output means, and the decision you make from it. Run them during a slowdown if you can; otherwise, run them now to establish baselines.
Task 1: See who’s waiting and on what
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT now(), state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3,4 ORDER BY count(*) DESC;"
now | state | wait_event_type | wait_event | count
-------------------------------+--------+-----------------+------------+-------
2025-12-29 10:14:02.12345+00 | active | | | 12
2025-12-29 10:14:02.12345+00 | active | IO | DataFileRead | 7
2025-12-29 10:14:02.12345+00 | active | Lock | relation | 5
2025-12-29 10:14:02.12345+00 | idle | Client | ClientRead | 30
What it means: You’re categorizing pain. IO/DataFileRead suggests read latency or cache misses. Lock/relation suggests contention. Lots of ClientRead means clients are idle, not the DB.
Decision: If waits concentrate in one type, jump straight to the relevant check (IO, locks, WAL). If it’s mostly active with no waits, focus on CPU and query plans.
Task 2: Identify the single worst currently running query
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT pid, now()-query_start AS runtime, state, wait_event_type, wait_event, left(query, 120) AS query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY runtime DESC LIMIT 5;"
pid | runtime | state | wait_event_type | wait_event | query
------+-----------+--------+-----------------+--------------+--------------------------------------------------------
8421 | 00:02:31 | active | Lock | transactionid| UPDATE orders SET status = $1 WHERE id = $2
9112 | 00:01:04 | active | IO | DataFileRead | SELECT ... FROM order_items WHERE order_id = $1
What it means: If the top runtime query is waiting on a lock, it’s not “slow SQL”; it’s blocked. If it’s waiting on IO, you’re chasing storage or cache behavior.
Decision: For lock waits, find the blocker (Task 7). For IO waits, check device latency (Task 4) and cache hit ratio (Task 10).
Task 3: Host load and run queue sanity check
cr0x@server:~$ uptime
10:14:10 up 38 days, 2:11, 3 users, load average: 18.42, 17.90, 16.12
What it means: Load average is not CPU usage; it’s runnable + uninterruptible tasks. A load of 18 on an 8-core box usually means you’re queued up on CPU or stuck in IO waits.
Decision: Correlate with Task 4 (IO) and Task 5 (CPU). High load + high IO wait points to storage. High load + high CPU points to compute or query plans.
Task 4: Measure disk latency and saturation with iostat
cr0x@server:~$ sudo iostat -xz 1 5
Linux 6.12.0-debian (db01) 12/29/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22.11 0.00 6.90 18.30 0.00 52.69
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await wareq-sz aqu-sz %util
nvme0n1 820.0 55200.0 0.0 0.0 12.4 67.3 610.0 48800.0 45.8 80.0 42.3 98.7
What it means: w_await at ~46ms with %util ~99% is a red flag. PostgreSQL commits and checkpoints will feel slow. %iowait is also high.
Decision: Treat storage as the primary suspect. Next steps: check if it’s the same device as PGDATA, confirm write cache settings, investigate competing IO, and verify filesystem mount options.
Task 5: Check CPU saturation vs IO wait quickly
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.12.0-debian (db01) 12/29/2025 _x86_64_ (16 CPU)
12:14:20 PM CPU %usr %sys %iowait %steal %idle
12:14:21 PM all 28.0 7.0 2.0 0.0 63.0
12:14:21 PM 3 92.0 5.0 0.0 0.0 3.0
12:14:21 PM 7 88.0 6.0 0.0 0.0 6.0
What it means: A few hot CPUs may indicate a single-threaded hotspot (often one query, one index build, one autovacuum worker, or one WAL writer path). %steal non-zero on VMs indicates noisy neighbors.
Decision: If CPU is pegged with low IO wait, inspect top queries and plans. If steal is high, involve the virtualization/platform team.
Task 6: Catch memory pressure and swapping
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 64Gi 51Gi 1.2Gi 1.1Gi 12Gi 7.8Gi
Swap: 8Gi 2.5Gi 5.5Gi
What it means: Swap in use on a DB host is a performance tax. It may be historical (swapped once, never paged back) or active thrash.
Decision: Confirm swap activity with vmstat (Task 12). If actively swapping, reduce memory pressure: lower shared_buffers if it’s reckless, reduce connection count, stop co-located workloads, or add RAM.
Task 7: Find blockers and victims (lock contention)
cr0x@server:~$ sudo -u postgres psql -X -c "
WITH locked AS (
SELECT pid, locktype, relation::regclass AS rel, transactionid, mode, granted
FROM pg_locks
),
activity AS (
SELECT pid, usename, application_name, client_addr, state, now()-xact_start AS xact_age, left(query,120) AS query
FROM pg_stat_activity
)
SELECT a.pid AS blocked_pid, a.usename, a.application_name, a.xact_age, a.query,
b.pid AS blocker_pid, b.usename AS blocker_user, b.application_name AS blocker_app, b.xact_age AS blocker_xact_age, b.query AS blocker_query
FROM locked l1
JOIN locked l2 ON l1.locktype = l2.locktype
AND coalesce(l1.rel::text,'') = coalesce(l2.rel::text,'')
AND coalesce(l1.transactionid::text,'') = coalesce(l2.transactionid::text,'')
AND l1.pid <> l2.pid
JOIN activity a ON a.pid = l1.pid
JOIN activity b ON b.pid = l2.pid
WHERE NOT l1.granted AND l2.granted
ORDER BY a.xact_age DESC;
"
blocked_pid | usename | application_name | xact_age | query | blocker_pid | blocker_user | blocker_app | blocker_xact_age | blocker_query
------------+---------+------------------+----------+-----------------------------------------------------+------------+--------------+-------------+------------------+----------------------------------
8421 | app | api | 00:02:31 | UPDATE orders SET status = $1 WHERE id = $2 | 7710 | app | worker | 00:18:09 | UPDATE orders SET ... WHERE ...
What it means: One long transaction is blocking others. The blocker has been in transaction for 18 minutes. That’s rarely okay in OLTP.
Decision: Decide whether to terminate the blocker (careful: it will roll back), or to fix the app behavior (missing index causing long update, transaction scope too wide, waiting on external call mid-transaction).
Task 8: Inspect WAL and checkpoint health
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT * FROM pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | checkpoint_write_time | checkpoint_sync_time | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_backend_fsync | buffers_alloc
------------------+-----------------+-----------------------+----------------------+--------------------+---------------+------------------+----------------+-----------------------+--------------
1289 | 412 | 8123456 | 923456 | 73456789 | 1234567 | 9876 | 2345678 | 12345 | 987654321
What it means: High checkpoints_req suggests you’re forcing checkpoints due to WAL volume, not schedule. High checkpoint write/sync time correlates with IO stalls.
Decision: If requested checkpoints are frequent, consider increasing max_wal_size and ensuring checkpoint_completion_target is sane. But do not “tune” until you confirm IO capacity and WAL device behavior.
Task 9: Look at WAL stats (when available)
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset FROM pg_stat_wal;"
wal_records | wal_fpi | wal_bytes | wal_buffers_full | wal_write | wal_sync | stats_reset
------------+---------+-----------+------------------+-----------+----------+-------------------------------
88234567 | 123456 | 98 GB | 4212 | 91234 | 45678 | 2025-12-27 00:00:00+00
What it means: Frequent wal_buffers_full suggests WAL buffering pressure. High WAL write/sync counts aren’t inherently bad; correlate them with device latency and commit times.
Decision: If WAL pressure coincides with stalls, consider separate fast storage for WAL, validate synchronous_commit expectations, and inspect commit latency in the app.
Task 10: Check cache hit ratio (with a warning label)
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT datname, blks_hit, blks_read, round(100.0*blks_hit/GREATEST(blks_hit+blks_read,1),2) AS hit_pct FROM pg_stat_database ORDER BY hit_pct ASC;"
datname | blks_hit | blks_read | hit_pct
----------+-----------+-----------+---------
appdb | 987654321 | 45678901 | 95.59
What it means: Hit ratio is not a KPI you worship; it’s a clue. 95% might be fine or terrible depending on workload and storage. A sudden drop during incidents matters more than the absolute number.
Decision: If hit ratio drops during slowness and IO latency rises, you’re spilling to disk. Investigate memory, working set size, and whether query patterns changed.
Task 11: Find top time consumers with pg_stat_statements
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT queryid, calls, round(total_exec_time::numeric,1) AS total_ms, round(mean_exec_time::numeric,2) AS mean_ms, rows, left(query,120) AS query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;"
queryid | calls | total_ms | mean_ms | rows | query
----------+--------+----------+---------+-------+--------------------------------------------------------
12345678 | 998765 | 8123456.0| 8.13 | 998765| SELECT * FROM users WHERE email = $1
88776655 | 12345 | 6234567.0| 505.12 | 12345| SELECT ... FROM orders JOIN ... WHERE created_at > $1
What it means: total_exec_time identifies the biggest aggregate cost. High mean time queries are latency killers; high total time queries are capacity killers.
Decision: For high-mean queries, run EXPLAIN (ANALYZE, BUFFERS) in a safe environment. For high-total, consider caching, indexes, or query changes. If the query time spikes only during incidents, correlate with waits and host metrics first.
Task 12: Confirm active swapping or reclaim with vmstat
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 2 262144 312000 80000 9000000 12 38 120 890 900 2400 18 6 55 21
7 1 262200 298000 76000 8800000 0 52 80 1200 1100 2700 20 7 48 25
What it means: Non-zero si/so means swapping in/out. That’s not “maybe.” That’s “your database is paying rent to the swap device.” Also note b (blocked) and wa (IO wait) patterns.
Decision: If swapping is active, stop tuning SQL and fix memory pressure. Reduce concurrency, add RAM, adjust workload placement, or fix runaway processes.
Task 13: Check filesystem and mount options for PGDATA
cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /var/lib/postgresql
/dev/mapper/vg0-pgdata ext4 rw,relatime,errors=remount-ro
What it means: You want to know what filesystem you’re on and whether mount options are odd. For PostgreSQL, “clever” mount options usually become incident reports.
Decision: If you see exotic options (like disabling barriers) without a strong reason and validated hardware behavior, revert to sane defaults and fix the real performance issue elsewhere.
Task 14: Validate that PostgreSQL is not being throttled by cgroups
cr0x@server:~$ systemctl show postgresql --property=CPUQuota,MemoryMax,IOReadBandwidthMax,IOWriteBandwidthMax
CPUQuota=
MemoryMax=infinity
IOReadBandwidthMax=
IOWriteBandwidthMax=
What it means: If quotas are set (especially IO bandwidth limits), you may have self-inflicted “random” slowness when load rises and the limits bite.
Decision: Remove inappropriate limits for a database service, or deliberately size them and set expectations. Databases don’t do well on surprise diets.
Task 15: Check PostgreSQL logs for checkpoint, autovacuum, and duration spikes
cr0x@server:~$ sudo journalctl -u postgresql@15-main -S -2h | egrep -i "checkpoint|autovacuum|duration|could not|timeout" | tail -n 15
Dec 29 09:02:10 db01 postgres[2210]: LOG: checkpoint complete: wrote 32145 buffers (1.9%); 0 WAL file(s) added, 2 removed, 1 recycled; write=78.123 s, sync=1.992 s, total=80.256 s
Dec 29 09:10:44 db01 postgres[2210]: LOG: automatic vacuum of table "app.public.orders": index scans: 1 pages: 0 removed, 123456 remain, 12000 scanned (9.72% of total) tuples: 0 removed, 500000 remain
Dec 29 09:11:02 db01 postgres[2210]: LOG: duration: 2412.889 ms execute <unnamed>: UPDATE orders SET ...
What it means: A checkpoint that takes 80 seconds is not subtle. Autovacuum activity near incident windows can be causal or just correlated; you still need IO metrics.
Decision: If checkpoint times are huge, prioritize storage performance and checkpoint tuning. If duration spikes align with lock waits, fix locking. If autovacuum is heavy, adjust settings per-table and check bloat.
Task 16: Confirm connection churn (too many new connections)
cr0x@server:~$ sudo -u postgres psql -X -c "SELECT datname, numbackends, xact_commit, xact_rollback, blks_read, blks_hit FROM pg_stat_database;"
datname | numbackends | xact_commit | xact_rollback | blks_read | blks_hit
--------+-------------+-------------+---------------+----------+----------
appdb | 240 | 987654321 | 123456 | 45678901 | 987654321
What it means: numbackends is current connections. Spikes here, especially with high CPU and context switching, often mean a broken pool or an unexpected traffic pattern.
Decision: If connections are high, cap them and enforce pooling. Consider pgbouncer, but don’t install it as a panic response; configure it thoughtfully, then validate transaction semantics.
Joke #2: A “temporary” debugging setting in production has the same half-life as a coffee stain on a white shirt.
Three corporate mini-stories (what went wrong, what worked)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company migrated their primary database host from older SATA SSDs to shiny new NVMe. The change window was clean. Benchmarks looked great. Everyone expected fewer incidents.
Two weeks later, the support queue started describing “random slowness” during peak traffic: checkouts timing out, background jobs retrying, and occasional API 504s. The database metrics didn’t scream. CPU wasn’t maxed. Network looked fine. The team’s default explanation became “Postgres is weird under load,” which is the engineering equivalent of shrugging in a spreadsheet.
The wrong assumption: “NVMe is fast, so storage can’t be the bottleneck.” They had looked at throughput, not latency variance. Under sustained write pressure, the device would occasionally spike write latency into the hundreds of milliseconds due to internal behavior. PostgreSQL doesn’t care that your average throughput is great; it cares that one fsync took 400ms and blocked a pile of commits behind it.
They proved it by capturing iostat -xz 1 during an incident and correlating it with slow transaction logs. The spikes lined up perfectly. Once they moved WAL to a dedicated device and adjusted checkpoint/WAL sizing to reduce pressure bursts, “random slowness” turned into boring predictability.
The takeaway: stop treating storage as a binary (fast/slow). For databases, the variance is the product.
Mini-story 2: The optimization that backfired
A finance-adjacent platform had periodic stalls every 15 minutes. Someone noticed checkpoints in logs and did what the internet often suggests: increase checkpoint_timeout and max_wal_size dramatically to “reduce checkpoint frequency.” It worked—sort of.
Checkpoints happened less often, so the graph looked smoother for a while. But when checkpoints did run, they were massive. The IO subsystem wasn’t built for that kind of burst. During those checkpoints, latency went vertical, and the application would pile up retries, causing even more write pressure. The incident got worse, less frequent, and harder to reproduce in staging. Lovely.
When they finally measured the right thing—device latency and checkpoint_write_time—the pattern was obvious. The “optimization” had redistributed pain into fewer, bigger explosions. They rolled back to a smaller checkpoint_timeout, increased checkpoint_completion_target to smooth IO, and right-sized WAL so checkpoints weren’t forced by WAL churn.
They also discovered a second-order issue: a weekly analytics job that did huge updates without batching, creating WAL spikes that forced checkpoints regardless of timeout. Fixing that job reduced both IO and WAL volume more than any knob.
The takeaway: “fewer checkpoints” is not automatically “better.” The goal is steady IO, not rare disasters.
Mini-story 3: The boring but correct practice that saved the day
An enterprise internal platform team ran PostgreSQL for several business units. Nothing glamorous: payroll integrations, HR workflows, procurement. Their best feature was that incidents were rare and short.
What they did wasn’t magic. They kept a written runbook with a fast diagnosis sequence: check waits, check IO latency, check locks, check autovacuum, check memory pressure, then decide. They also had baselines: “normal” iostat latency, typical connection counts, and expected checkpoint times. Every quarter they did a 30-minute fire drill where someone pretended the database was slow and everyone practiced gathering evidence without restarting services.
When a vendor agent update started hammering the disk with log writes, the team didn’t argue about whose fault it was. They saw IO latency rise, confirmed PostgreSQL waits on IO, identified the offending process with per-process IO stats, and throttled the agent. The business never noticed, except that their status page stayed boring.
The takeaway: boring discipline beats heroic tuning. Baselines plus a practiced playbook turns “random slowness” into “a ticket with evidence.”
Common mistakes: symptom → root cause → fix
This section is intentionally specific. If you recognize the symptom, don’t debate it in a meeting—go test the root cause.
1) Symptom: “Everything is slow, but CPU is low”
- Root cause: IO waits (storage latency spikes) or lock contention.
- Fix: Check
pg_stat_activitywaits andiostat -xz. If IO: reduce competing IO, move WAL, verify storage health. If locks: find and fix the blocker; shorten transactions.
2) Symptom: Writes are slow; reads look normal
- Root cause: WAL flush latency, checkpoints, or synchronous replication waiting (if configured).
- Fix: Inspect
pg_stat_wal, checkpoint stats, and device write latency. Validate that WAL is on fast, stable storage. Don’t disable durability as a “performance fix” unless you also accept data loss.
3) Symptom: Periodic slowdowns at regular intervals
- Root cause: Checkpoints, autovacuum cycles, cron jobs, scrubs, backups, log rotation, or batch workloads.
- Fix: Correlate time windows with logs (
journalctl) and metrics. Smooth checkpoint IO; tune autovacuum per table; reschedule batch jobs; isolate backup IO.
4) Symptom: Spikes in connection count and latency during traffic bursts
- Root cause: Connection pooling failure or a new client deployment opening too many sessions.
- Fix: Enforce pooling and cap
max_connectionsto something your RAM can support. Use a pooler if needed, but test transaction semantics and prepared statements behavior.
5) Symptom: “Random” timeouts, but only for certain endpoints
- Root cause: One query plan flips between index scan and seq scan, or a parameter-sensitive plan.
- Fix: Use
pg_stat_statementsto find the offender and runEXPLAIN (ANALYZE, BUFFERS)with representative parameters. Considerplan_cache_modeadjustments cautiously, or rewrite queries to be more stable.
6) Symptom: Vacuum running constantly, but performance still degrades
- Root cause: Autovacuum can’t keep up, or bloat is already large; long-running transactions prevent cleanup.
- Fix: Identify long transactions, reduce their duration, and tune autovacuum settings per big table. For severe bloat, schedule maintenance (e.g.,
VACUUM (FULL)or online rebuild strategies) with a real plan.
7) Symptom: After Debian upgrade, PostgreSQL feels slower under the same workload
- Root cause: Kernel/IO scheduler changes, different defaults, cgroup settings, or changed filesystem behavior; sometimes it’s just a new bottleneck being revealed.
- Fix: Compare baselines: IO latency, CPU steal, memory reclaim, and PostgreSQL wait events. Verify that no new service limits were introduced and that huge pages, THP, or governor changes didn’t shift behavior.
Checklists / step-by-step plan
Step-by-step: during an active slowdown (10 minutes, no heroics)
- Capture PostgreSQL waits: run Task 1 and save output.
- Capture worst active queries: run Task 2.
- Capture IO latency: run Task 4.
- Capture CPU saturation: run Task 5.
- Capture memory pressure: run Task 6 and Task 12.
- If locks show up: run Task 7 and decide on terminating the blocker versus waiting.
- If WAL/checkpoint suspected: run Task 8 and Task 9; correlate with IO writes and logs (Task 15).
- If connections are high: run Task 16; verify pool health and app deployment changes.
- Write down time boundaries: “slow from 10:12:40 to 10:16:10”. Correlation needs timestamps.
Step-by-step: after the incident (the part teams skip, then repeat incidents)
- Classify the incident: IO-bound, lock-bound, CPU-bound, memory-bound, WAL-bound, connection-bound, or network/auth-bound.
- Pick one primary metric to alert on: e.g., disk
awaitover threshold, lock wait count, checkpoint time, swap activity. - Add one dashboard panel that would have made this obvious. Not twelve. One.
- Make one code/config change with a rollback plan. Don’t “tune everything.”
- Update the runbook. If the fix required tribal knowledge, you have a reliability bug.
FAQ
1) Why does PostgreSQL feel slow even when CPU usage is low?
Because it’s often waiting, not computing. Lock waits and IO waits don’t burn CPU. Look at wait_event_type in pg_stat_activity and at disk latency with iostat.
2) What “await” numbers are bad in iostat?
For OLTP, sustained single-digit milliseconds is usually fine; sustained tens of milliseconds is trouble; spikes into hundreds of milliseconds will be user-visible. The exact threshold depends on SLOs, but “await rising with %util near 100%” is a classic saturation signature.
3) Should I increase shared_buffers to fix slowness?
Not as a reflex. Too much shared_buffers can starve the OS page cache and push the system into reclaim or swap. Size it intentionally and verify memory pressure with vmstat and free.
4) Is autovacuum supposed to cause performance drops?
It can, especially when it has to catch up. The goal is to tune it so it runs continuously and quietly rather than occasionally and violently. Per-table autovacuum settings are often the right tool.
5) Can checkpoints cause “every N minutes everything slows down”?
Yes. Checkpoints can create bursts of writes. If storage can’t absorb the burst, latency spikes. Look at checkpoint timing in logs and pg_stat_bgwriter, then correlate with device latency.
6) Is setting synchronous_commit=off a good fix for random slowness?
It’s a trade: you reduce commit latency by accepting potential data loss on crash. For some workloads it’s acceptable; for many it’s not. Treat it as an explicit product decision, not a performance tweak.
7) How do I know if locks are the problem?
If many sessions are waiting on Lock and you can identify a blocker holding locks for a long time (Task 7), locks are the problem. Fix long transactions and hot-row contention patterns.
8) Why do connection storms hurt so much?
Each connection costs memory, CPU, and often triggers auth/logging work. Under storm conditions, the database spends resources managing connections instead of executing queries. Fix at the client with pooling and sensible limits.
9) Does Debian 13 change anything that can affect PostgreSQL latency?
OS upgrades can shift kernel behavior, IO scheduling, and defaults around services and cgroups. Treat it as a new baseline: validate IO latency, CPU steal, and memory reclaim under representative load rather than assuming “same but newer.”
10) What’s the single most useful PostgreSQL extension for performance triage?
pg_stat_statements. It won’t tell you about storage latency directly, but it will quickly identify whether the same few queries dominate time, and whether “random slowness” is actually a known query pattern.
Conclusion: next steps you can do today
PostgreSQL isn’t randomly slow. Your system is. The database just happens to be where your users feel it first.
- Implement the fast playbook. Make “waits, IO latency, locks” the first three checks. Practice it once.
- Baseline your host. Run
iostat,vmstat, and connection counts during normal load and save the numbers. - Turn the biggest unknown into a metric. If you don’t know whether incidents are IO-bound or lock-bound, you don’t have an incident response; you have a superstition.
- Fix one bottleneck at a time. The fastest way to extend an outage is to change five knobs and then argue about which one helped.
If you do nothing else: the next time someone says “Postgres is randomly slow,” respond with a timestamp request and run Task 1 and Task 4. You’ll look psychic. You’re not. You’re just measuring the right things.