Debian 13: PostgreSQL connection storms — pooler vs tuning: what actually works (case #37)

Was this helpful?

Connection storms don’t announce themselves politely. They arrive as “database is down” pages, a thundering herd of app retries,
CPU that looks fine until it isn’t, and a Postgres that suddenly acts like it’s doing you a favor by accepting any connections at all.
On Debian 13, the tooling is good, the defaults are reasonable, and yet you can still end up with hundreds or thousands of clients
trying to squeeze through the same narrow door.

This is case #37 in my mental notebook: the recurring argument between “just add PgBouncer” and “just tune Postgres.”
Both camps are right—sometimes. More often, one of them is about to waste your week.
Let’s cut through it with the stuff that actually changes outcomes in production.

What a connection storm really is (and why it hurts)

A PostgreSQL “connection storm” is not just “a lot of connections.” It’s the pathological moment when the rate of new connections
or reconnect attempts overwhelms some part of the system: Postgres backend process creation, authentication, TLS handshake, CPU
scheduler, kernel limits, disk latency, or simply the server’s ability to keep up with context switching.

Postgres uses a process-per-connection model (not a thread-per-connection model). That’s a design choice with clear benefits:
isolation, simpler debugging, predictable failure containment. But it also means each connection comes with real overhead:
a backend process, memory consumption (work_mem isn’t the only thing; there’s per-backend state), and scheduling cost.

Storms typically start upstream:

  • App deploy resets connection pools.
  • Autoscaling adds pods that all connect at once.
  • Load balancer health checks are misconfigured and become a login DOS.
  • A network flap triggers retries with tight loops.
  • A bad query makes response times climb; clients time out and reconnect, multiplying load.

The failure mode is nasty because it’s non-linear. The database may be “fine” at 200 steady connections, but it can collapse under
2000 connection attempts per second even if only 200 are active at any time.

Here’s the practical rule: when connection churn is the problem, raising max_connections is like widening a doorway while the
building is on fire. You’ll move more smoke.

Joke #1: A connection storm is like free pizza at an office—nobody “needs” it, but suddenly everyone is very motivated to show up.

Facts & context that change how you think about this

  • Postgres’ process-per-connection model dates back decades and is still a core architecture choice; it prioritizes robustness over raw connection fan-out.
  • pg_stat_activity has been around in some form for a long time; it’s still the first place you should look when reality disagrees with dashboards.
  • TLS everywhere changed the math: what used to be cheap “connect/auth” traffic now includes heavier crypto handshakes if you terminate TLS at Postgres.
  • SCRAM-SHA-256 adoption improved password security, but stronger auth can make storms more CPU-sensitive than older MD5 setups under extreme churn.
  • Linux cgroup and systemd limits became a quiet source of “it worked on the old OS” incidents; Debian 13’s systemd-managed units make these limits more visible (and enforceable).
  • PgBouncer’s transaction pooling became popular not because Postgres is “bad,” but because apps routinely misuse or over-create connections.
  • “Idle in transaction” has been a known Postgres footgun for ages: it holds locks and bloat risk while looking “idle,” which makes incident response deceptively slow.
  • Observability improved: modern Postgres exposes wait events (pg_stat_activity.wait_event) so you can stop guessing whether you’re CPU-bound, lock-bound, or IO-bound.

One paraphrased idea worth keeping in your head comes from a reliability heavyweight:
paraphrased ideaJohn Allspaw: “In incidents, the system makes sense to the people inside it; fix the conditions, not the blame.”

Fast diagnosis playbook

When you’re on-call, you don’t have time for philosophical debates about poolers. You need to find the bottleneck in minutes,
not in a postmortem. This is the triage order that works on Debian 13 with Postgres in the real world.

First: determine if you’re failing at “accepting connections” or “serving queries”

  • If clients can’t connect at all: check listen backlog, file descriptors, auth/TLS CPU spikes, and process limits.
  • If clients connect but time out on queries: check locks, IO, CPU saturation, and slow queries causing retry storms.

Second: classify the storm

  • Churn storm: many connects/disconnects, short-lived sessions, pg_stat_activity dominated by new sessions.
  • Idle storm: connections accumulate and stick around, lots of idle sessions, likely poor app pooling.
  • “Idle in transaction” storm: fewer sessions, but they pin locks and cause pile-ups.
  • Retry storm: query latency leads to timeouts, app retries amplify load; often not a “connections” problem at the root.

Third: pick the lever

  • Use a pooler when you need to cap connection fan-out or dampen churn.
  • Tune Postgres when the server is under-resourced, misconfigured, or blocked (locks/IO) and connections are just the symptom.
  • Fix the app when it creates connections per request, doesn’t reuse them, or has broken retry/backoff behavior.

Pooler vs tuning: decision logic, not ideology

You can “solve” a storm three ways: make Postgres accept more connections, reduce the number of connections, or reduce the need for retries.
The trap is assuming these are interchangeable. They’re not.

What tuning can do (and what it can’t)

Tuning helps when Postgres is wasting resources or blocked. Examples:

  • Locks: bad transaction hygiene creates lock queues; tuning won’t “fix” that, but lock monitoring and timeouts will reduce blast radius.
  • Memory: poor shared_buffers and runaway work_mem choices can turn normal load into swap misery.
  • IO: slow storage, checkpoint spikes, or bad autovacuum behavior can push latency high enough to trigger retries.
  • CPU: expensive auth (TLS/SCRAM) plus high churn can dominate CPU; tuning can move pieces (e.g., TLS offload) but doesn’t change the churn itself.

What tuning won’t do: make 10,000 clients behave responsibly. If they connect like mosquitoes at dusk, you’re going to need a screen door.

What a pooler changes

A pooler (most commonly PgBouncer) sits between clients and Postgres, collapsing many client connections into fewer server connections.
You get:

  • Hard caps on server connections even when clients spike.
  • Faster client connection establishment (especially if the pooler runs close to the app and keeps server connections warm).
  • Protection from deploy storms: app restarts don’t translate into Postgres backend churn.

But poolers also remove some guarantees:

  • Transaction pooling breaks session state: temp tables, prepared statements, session variables, advisory locks—handle carefully.
  • Visibility shifts: you must monitor pooler metrics, not just Postgres.
  • Misconfigurations can be worse than no pooler: wrong pool sizes and timeouts can create self-inflicted queueing delays.

If your workload is webby (lots of short transactions), transaction pooling is usually the win. If your workload is session-heavy
(lots of session state, long transactions), you can still use a pooler—but you may need session pooling or to refactor the app.

Joke #2: Raising max_connections in response to a storm is like buying more chairs when the meeting agenda is the problem.

Practical tasks (commands, outputs, decisions)

These are the tasks I actually run during incidents and during “we should fix this before it happens again” sessions.
Each task includes: command, example output, what it means, and what decision to make.

Task 1: Confirm you’re hitting connection limits (Postgres side)

cr0x@server:~$ sudo -u postgres psql -XAtc "SHOW max_connections; SHOW superuser_reserved_connections;"
200
3

Meaning: Only 197 non-superuser connections are effectively available.

Decision: If you see errors like “too many clients already,” don’t immediately increase this. First find why connections spike or stick.

Task 2: See current connection count and states

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT state, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
   state   | count
-----------+-------
 idle      |   160
 active    |    25
           |     3
 idle in transaction | 9
(4 rows)

Meaning: Lots of idle sessions. A few “idle in transaction” sessions are red flags.

Decision: If idle dominates and connection count is near the max, focus on application pooling and idle timeouts; a pooler is often the fastest containment.

Task 3: Identify top client sources (storms often come from one place)

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT client_addr, usename, count(*) AS conns FROM pg_stat_activity GROUP BY 1,2 ORDER BY conns DESC LIMIT 10;"
 client_addr | usename | conns
-------------+---------+-------
 10.42.7.19  | appuser |   120
 10.42.8.11  | appuser |    95
 10.42.9.02  | appuser |    70
(3 rows)

Meaning: A few app nodes dominate. That’s good news: you can fix or rate-limit a small set of offenders.

Decision: If one client address is exploding, isolate that app deployment, sidecar, or job runner.

Task 4: Measure connect/auth overhead symptoms via Postgres wait events

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT wait_event_type, wait_event, count(*) FROM pg_stat_activity WHERE state='active' GROUP BY 1,2 ORDER BY 3 DESC;"
 wait_event_type |     wait_event      | count
-----------------+---------------------+-------
 Lock            | transactionid        |    12
 IO              | DataFileRead         |     5
 CPU             |                     |     3
(3 rows)

Meaning: Your “connection storm” might actually be a lock storm causing timeouts and retries.

Decision: If Lock waits dominate, stop thinking about poolers and start thinking about the blocking transaction and timeouts.

Task 5: Find blockers fast (locks cause retry storms)

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT blocked.pid AS blocked_pid, blocker.pid AS blocker_pid, blocked.query AS blocked_query, blocker.query AS blocker_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked ON blocked.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocker_locks ON blocker_locks.locktype = blocked_locks.locktype AND blocker_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocker_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocker_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocker_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocker_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocker_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocker_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocker_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocker_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocker_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocker ON blocker.pid = blocker_locks.pid WHERE NOT blocked_locks.granted;"
 blocked_pid | blocker_pid |        blocked_query         |        blocker_query
------------+-------------+------------------------------+-----------------------------
      24811 |       20777 | UPDATE accounts SET ...      | ALTER TABLE accounts ...
(1 row)

Meaning: A DDL statement is blocking updates. That can cascade into app timeouts and reconnects.

Decision: Kill or postpone the blocker if it’s safe, and implement safer migration practices (lock timeouts, online migration patterns).

Task 6: Check OS-level file descriptor ceiling (classic “connections fail” cause)

cr0x@server:~$ cat /proc/$(pidof postgres | awk '{print $1}')/limits | grep -E "Max open files"
Max open files            1024                 1048576              files

Meaning: Soft limit is 1024 for the main PID. That’s dangerously low for a busy database.

Decision: Fix systemd unit limits (Task 7). If you run with low soft limits, you’ll get random failures under load.

Task 7: Verify systemd LimitNOFILE for PostgreSQL

cr0x@server:~$ systemctl show postgresql --property=LimitNOFILE
LimitNOFILE=1024

Meaning: PostgreSQL service is capped at 1024 open files. That includes sockets and data files.

Decision: Set a sane value via a drop-in. Then restart in a maintenance window.

Task 8: Apply a systemd drop-in for higher file limits

cr0x@server:~$ sudo systemctl edit postgresql
# (editor opens)
# add:
# [Service]
# LimitNOFILE=1048576
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart postgresql
cr0x@server:~$ systemctl show postgresql --property=LimitNOFILE
LimitNOFILE=1048576

Meaning: PostgreSQL now inherits a high file descriptor limit.

Decision: If connection failures disappear, you found at least one hard ceiling. Still investigate why connections spike.

Task 9: Inspect TCP listen backlog drops (kernel-level connection pain)

cr0x@server:~$ ss -ltn sport = :5432
State  Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 64     4096   0.0.0.0:5432      0.0.0.0:*

Meaning: Recv-Q shows queued connections. If it’s frequently near Send-Q (backlog), you’re dropping or delaying connection establishment.

Decision: If Recv-Q spikes during storms, consider a pooler close to the clients and review kernel backlog tuning (somaxconn) and Postgres listen_addresses/tcp_keepalives strategy.

Task 10: Check kernel backlog and SYN handling

cr0x@server:~$ sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096

Meaning: These values are reasonably high. If they’re low (128/256), storms can overwhelm the TCP handshake queue.

Decision: If values are low and you see connection drops, raise them carefully and test. But remember: backlog tuning helps symptoms; a pooler fixes the behavior.

Task 11: Determine whether authentication is expensive (pg_hba.conf clues)

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT name, setting FROM pg_settings WHERE name IN ('password_encryption','ssl');"
        name         | setting
---------------------+---------
 password_encryption | scram-sha-256
 ssl                 | on
(2 rows)

Meaning: SCRAM + TLS is secure; it can also be heavier during massive churn.

Decision: Don’t weaken auth as a first response. Prefer poolers to reduce handshake volume, or terminate TLS earlier if policy allows.

Task 12: Inspect connection churn rate from Postgres statistics

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT datname, numbackends, xact_commit, xact_rollback, blks_read, blks_hit FROM pg_stat_database ORDER BY numbackends DESC;"
  datname  | numbackends | xact_commit | xact_rollback | blks_read | blks_hit
-----------+-------------+-------------+---------------+-----------+----------
 appdb     |         190 |     9921330 |         12344 |   1209932 |  99881233
 postgres  |           3 |        1200 |             0 |       120 |     22000
(2 rows)

Meaning: High numbackends close to your effective max. Combined with low active work, this screams “connection management problem.”

Decision: Contain with a pooler or strict application-side pooling limits; then fix retry/backoff and pooling discipline.

Task 13: Check for “idle in transaction” and enforce timeouts

cr0x@server:~$ sudo -u postgres psql -Xc "SHOW idle_in_transaction_session_timeout;"
idle_in_transaction_session_timeout
-----------------------------------
0
(1 row)

Meaning: Timeout disabled. “Idle in transaction” sessions can hold locks forever.

Decision: Set a sane value (often 1–5 minutes for OLTP) to reduce incident tail risk.

Task 14: Apply timeouts (safe defaults beat heroic debugging)

cr0x@server:~$ sudo -u postgres psql -Xc "ALTER SYSTEM SET idle_in_transaction_session_timeout = '2min';"
ALTER SYSTEM
cr0x@server:~$ sudo -u postgres psql -Xc "SELECT pg_reload_conf();"
 pg_reload_conf
----------------
 t
(1 row)

Meaning: Config reloaded, new sessions will inherit the timeout.

Decision: If you have legitimate long idle-in-tx sessions (rare and usually wrong), fix that code path; don’t disable the guardrail.

Task 15: Verify client-side pooling is not “infinite” (example: psql shows application_name)

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC LIMIT 10;"
 application_name | count
------------------+-------
 myapi            |   180
 migration-job    |     9
 psql             |     1
(3 rows)

Meaning: The API is the main consumer. You need to inspect its pool settings and deploy behavior.

Decision: Cap per-instance pools, add jittered startup, and ensure exponential backoff on retries.

Task 16: Check memory risk before increasing max_connections (the trap)

cr0x@server:~$ sudo -u postgres psql -Xc "SHOW shared_buffers; SHOW work_mem; SHOW maintenance_work_mem; SHOW max_connections;"
shared_buffers
--------------
4GB
(1 row)

work_mem
--------
16MB
(1 row)

maintenance_work_mem
--------------------
512MB
(1 row)

max_connections
---------------
200
(1 row)

Meaning: If you jump to 800 connections while leaving work_mem at 16MB, you invite memory pressure. Not all connections use full work_mem, but incidents love worst-case math.

Decision: If you must increase max_connections, revisit memory strategy and consider a pooler first. Better: avoid needing the increase at all.

Task 17: Install and sanity-check PgBouncer on Debian 13

cr0x@server:~$ sudo apt-get update
cr0x@server:~$ sudo apt-get install -y pgbouncer
Reading package lists... Done
...
Setting up pgbouncer ...

Meaning: Pooler installed. Now you must configure it intentionally; defaults are not a production plan.

Decision: Place PgBouncer close to the apps if possible (same node/cluster) to reduce network handshake overhead and isolate storms.

Task 18: Minimal PgBouncer config checks that prevent self-harm

cr0x@server:~$ sudo grep -E "^(listen_addr|listen_port|pool_mode|max_client_conn|default_pool_size|server_reset_query|ignore_startup_parameters)" /etc/pgbouncer/pgbouncer.ini
listen_addr = 0.0.0.0
listen_port = 6432
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 50
server_reset_query = DISCARD ALL
ignore_startup_parameters = extra_float_digits

Meaning: Transaction pooling with sane pool size can absorb client spikes. DISCARD ALL is safer, though it has overhead.

Decision: If your app relies on session state, transaction pooling will break it. Either refactor or use session pooling for that subset.

Task 19: Observe PgBouncer queues (to know if you’re just moving the pain)

cr0x@server:~$ psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
 database | user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | maxwait
----------+---------+-----------+------------+----------+---------+---------+---------
 appdb    | appuser |       120 |         80 |       50 |       0 |       50 |     12
(1 row)

Meaning: Clients are waiting. Server connections are capped at 50 and fully utilized. This is expected under spikes.

Decision: If maxwait climbs and request latency suffers, tune pool size cautiously, and—more importantly—reduce client concurrency and fix slow queries.

Task 20: Confirm the app is actually using the pooler

cr0x@server:~$ sudo -u postgres psql -Xc "SELECT client_addr, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
 client_addr | count
-------------+-------
 127.0.0.1   |    52
(1 row)

Meaning: Postgres now sees PgBouncer as the client (loopback). That’s good; the fan-out moved to the pooler.

Decision: If you still see many app IPs connecting directly, you have a rollout/config drift problem, not a tuning problem.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption (the “connections are cheap” myth)

A mid-sized company ran a customer API on Debian, Postgres on a beefy VM, and a service mesh that proudly did mTLS everywhere.
Someone did the math on CPU and IO, saw plenty of headroom, and declared the database “overprovisioned.”
The team’s biggest fear was slow queries. Nobody worried about connects.

A routine deploy rolled out to a few hundred containers. Each container started, ran a health check, and established a new database
connection to verify migrations were “fine.” That health check ran every few seconds because “fast detection is good,” and it opened
a fresh connection each time because the codepath used a simple one-off client.

Postgres didn’t die immediately. It just got progressively less responsive. Authentication and TLS handshake volume spiked.
Backend processes multiplied. The VM’s CPU looked deceptively modest because the scheduler overhead and kernel bookkeeping were
the real enemy.

The fix wasn’t heroic tuning. They stopped doing connect-per-healthcheck, added a tiny sidecar pooler for the API pods,
and rate-limited readiness checks. They also stopped pretending that “connect is cheap” in a TLS world.
The incident never returned.

Mini-story #2: The optimization that backfired (turning up max_connections)

Another org had a classic “too many clients already” page. A smart, well-meaning engineer raised max_connections from a couple
hundred to well over a thousand. The change took minutes. The page stopped. Everyone went back to work.

Two weeks later, latency complaints returned—but now with a twist. The database didn’t refuse connections; it accepted them and
quietly sank. Response times got worse across the board. Autovacuum fell behind. Checkpoints got spiky.
The real symptom became “everything is slow.”

The postmortem was uncomfortable because no single query was “the problem.” The problem was concurrency and memory pressure.
More backends meant more memory overhead, more context switching, and more simultaneous work_mem consumption during bursts.
The system behaved fine until it crossed a threshold. Then it behaved like a tired cashier in a supermarket at closing time.

The eventual fix was to roll max_connections back down, deploy PgBouncer in transaction pooling mode, and cap app-side pools.
They also implemented better retry jitter so timeouts didn’t synchronize. The lesson: raising connection limits can turn a visible
failure into an invisible performance collapse.

Mini-story #3: The boring but correct practice that saved the day (timeouts and guardrails)

A finance-adjacent shop ran Postgres with a strict change process. Engineers grumbled about it, as engineers do.
But their DBA had insisted on a handful of “boring defaults”: statement_timeout for certain roles, idle_in_transaction_session_timeout,
and conservative lock timeouts for migrations.

One afternoon, a migration went out that would have been harmless in staging. In production, it collided with a batch job and
took heavier locks than expected. Without guardrails, the migration would have sat there indefinitely holding everyone hostage.

Instead, the migration hit the lock timeout and failed quickly. The app kept running. A few requests retried and succeeded.
The incident ticket was a non-event: “migration failed, re-run later.” No storm, no panic, no 3 a.m. calls.

That’s the thing about boring controls: they’re invisible until the day they prevent you from learning too much about your pager.

Common mistakes: symptom → root cause → fix

1) Symptom: “too many clients already” during deploys

Root cause: app instances start simultaneously and each opens a full-size pool immediately; no jitter; no pool caps.

Fix: cap per-instance pools; add startup jitter; add PgBouncer to collapse fan-out; ensure retries use exponential backoff with jitter.

2) Symptom: Postgres accepts connections but queries time out

Root cause: lock contention or long transactions, often triggered by DDL during peak traffic.

Fix: identify blockers; adopt lock timeouts; use safer migration patterns; kill or postpone the blocking session during incidents.

3) Symptom: Connections fail intermittently under load

Root cause: file descriptor limits (systemd LimitNOFILE), or kernel backlog limits.

Fix: raise service limits via systemd drop-in; validate with /proc/PID/limits; review backlog-related sysctls.

4) Symptom: Lots of “idle in transaction” sessions

Root cause: application starts a transaction and then waits on user input, background work, or network calls.

Fix: set idle_in_transaction_session_timeout; fix app transaction boundaries; consider using shorter transactions and explicit locks only when needed.

5) Symptom: After adding PgBouncer, some features break (temp tables, prepared statements)

Root cause: transaction pooling invalidates session state.

Fix: move those workloads to session pooling, or refactor application to avoid session state; use server-side prepared statements carefully or disable in clients.

6) Symptom: PgBouncer “fixes” storms but latency becomes spiky

Root cause: queueing moves to PgBouncer; server pool too small or queries are slow; concurrency exceeds DB capacity.

Fix: tune pool sizes conservatively; reduce app concurrency; fix slow queries and IO; treat PgBouncer queue as a signal, not a problem to hide.

7) Symptom: CPU spikes during storms even when query load seems low

Root cause: TLS handshakes and auth cost dominate under churn; also process creation overhead.

Fix: reduce churn with pooler; ensure keepalives; avoid connect-per-request; consider TLS termination strategy if policy allows.

Checklists / step-by-step plan

Checklist A: First hour containment (stop the bleeding)

  1. Confirm whether it’s connection refusal or query timeouts. Use pg_stat_activity, app errors, and ss backlog checks.
  2. Find the dominant client source. If one app group is misbehaving, isolate it (scale down, rollback, or throttle).
  3. Look for locks and “idle in transaction.” Kill the blocker only if you understand the impact.
  4. Check OS hard limits. File descriptors and systemd limits are fast wins.
  5. Reduce reconnect rate. Fix retry loops and add backoff/jitter; temporarily increase client timeouts to reduce thrash.

Checklist B: Two-day fix (make storms less likely)

  1. Implement PgBouncer for web-style workloads and set transaction pooling unless you have strong reasons not to.
  2. Set guardrails: idle_in_transaction_session_timeout, role-based statement_timeout, and migration lock timeouts.
  3. Cap app pools per instance and document the math: instances × pool size should not exceed what Postgres can serve.
  4. Instrument churn: track connection rate, not just connection count.
  5. Run a controlled storm test in staging: deploy restart + load, watch backlog, auth CPU, and PgBouncer queueing.

Checklist C: Long-term hygiene (stop relearning the same lesson)

  1. Standardize client settings (timeouts, keepalives, retry backoff) as a library, not tribal knowledge.
  2. Make migrations boring: enforce safe patterns and separate schema changes from data backfills.
  3. Capacity plan on “useful work,” not max connections: measure throughput, latency SLOs, and IO headroom; treat connections as a control plane.
  4. Run incident drills focused on lock contention and connection churn, because those are the ones that surprise people.

FAQ

1) Should I always deploy PgBouncer on Debian 13 for Postgres?

For most OLTP/web workloads: yes, it’s a practical default. Not because Postgres is weak, but because clients are messy.
If you have session-heavy workloads, you can still use PgBouncer with session pooling or isolate those clients.

2) Is raising max_connections ever the right move?

Sometimes—when you’re sure you have memory headroom, your workload truly needs more concurrency, and connection churn is not the issue.
It’s rarely the best first response to a storm. Measure per-backend memory overhead and watch swap like a hawk.

3) Why did my database get slower after I “fixed” connection errors?

You likely converted a refusal failure into a queueing failure. More backends means more scheduler overhead and more memory risk.
If you didn’t reduce the actual demand (client concurrency or slow queries), the system still can’t keep up—it just fails differently.

4) What pool mode should I use in PgBouncer?

Transaction pooling for typical API traffic. Session pooling only when you require session state (temporary tables, session GUCs,
advisory locks held across transactions, some prepared statement patterns).

5) How do I know if my “connection storm” is actually locks?

Look at pg_stat_activity wait events and lock graphs. If many sessions wait on Lock events and you can identify a blocker,
you’re in a lock incident that may trigger connection retries. Fix the lock cause; don’t just add pooling.

6) Can PgBouncer hide slow queries?

It can hide the early warning signs by buffering clients. That’s useful for containment. But the queue will grow and latency will spike.
Use PgBouncer metrics (waiting clients, maxwait) as a canary telling you the database is saturated or blocked.

7) What’s the single most underrated setting for storm blast radius?

idle_in_transaction_session_timeout. It prevents a small number of broken requests from holding locks forever and turning your day into archaeology.

8) Do I need to tune kernel parameters for Postgres connection storms?

Sometimes. If you see backlog saturation or SYN queue issues, tuning somaxconn and tcp_max_syn_backlog helps.
But if your app is connecting like a drum solo, kernel tuning is not a business plan. Fix churn with pooling and retry discipline.

9) Does putting PgBouncer on the database host make sense?

It can, and it’s common. But placing it closer to clients (per node, per cluster, or as a sidecar) often reduces network handshake load
and makes storms less able to cross blast-radius boundaries. Choose based on operational simplicity and failure domains.

10) What’s the fastest “good enough” improvement if I can’t deploy a pooler yet?

Cap app-side pools, add exponential backoff with jitter, and add timeouts (idle_in_transaction_session_timeout, lock timeouts for migrations).
Also confirm systemd file limits aren’t sabotaging you.

Next steps you can do this week

If you’re currently firefighting storms, do containment first: identify the dominant client, stop the retry flood, and confirm you aren’t
stuck on OS limits or lock contention. Then pick one structural fix and ship it.

  1. Add PgBouncer (transaction pooling) for your main app workload, and verify usage by observing Postgres client addresses collapsing to the pooler.
  2. Set guardrails: enable idle_in_transaction_session_timeout and add role-based timeouts where appropriate.
  3. Fix connection behavior: cap per-instance pools and implement jittered backoff for retries. If you can’t describe your retry policy, you don’t have one.
  4. Run a storm rehearsal in staging: rolling restart your app fleet and watch ss backlog, PgBouncer queues, and Postgres wait events.
  5. Stop using max_connections as a comfort blanket. Keep it aligned with memory, CPU, and the concurrency Postgres can serve well.

The point isn’t to “avoid errors.” The point is to keep the database doing useful work when the rest of the system is having a noisy day.
Poolers, tuning, and sane client behavior each do different parts of that job. Use the right tool, and your pager gets bored again.

← Previous
Proxmox LXC Won’t Start: Reading cgroups and AppArmor Errors Like a SRE
Next →
ZFS compression lz4: When It’s ‘Free’ and When It Isn’t

Leave a comment