“Random timeouts” is what people say when the incident channel is moving faster than the graphs. One request hangs for 30 seconds, another finishes in 12 ms, and your database gets blamed because it’s the only shared dependency everyone can spell.
Most of the time, the database is innocent. Or at least: guilty in a more interesting way than “Postgres is slow” or “MySQL dropped connections.” Timeouts live in the cracks—between app and DB—where network queues, DNS caching, load balancers, and connection pools quietly change the rules.
What “random” timeouts really mean
“Random” typically means correlated, but not with the things you’re currently looking at. The same query can time out for one user and not another because it’s not just the query. It’s:
- Which app instance you hit (different DNS cache, different pool, different node kernel, different conntrack table fullness).
- Which DB endpoint you resolved (stale DNS, split-horizon, happy-eyeballs IPv6/IPv4 mismatch).
- Which path the packet took (ECMP hash changes, one congested link, one noisy neighbor VM).
- Which connection you reused (pool gives you a half-dead TCP session; DB already forgot you).
- Which lock you waited on (a transaction sitting on a row lock is indistinguishable from a network stall unless you instrument).
Timeouts are also stackable. An app may have a 2s request timeout, the driver a 5s connect timeout, a pooler a 30s server_login_retry, and a load balancer a 60s idle timeout. When someone says “it times out after about 30 seconds,” that’s not a clue. That’s a confession: a default value did this.
MySQL vs PostgreSQL: how timeouts tend to look
Where “random timeouts” usually land in MySQL systems
In MySQL estates, “random timeouts” are commonly the side effects of connection churn and thread/connection handling limits:
- Connection storms after deploys (or after a pooler restart) can saturate MySQL with authentication and thread creation overhead.
- wait_timeout and interactive_timeout can kill idle connections that the pool thought were alive, producing sporadic “server has gone away” or “lost connection” errors.
- DNS reverse lookups (when name resolution is in play for grants or logging) can add unexpected latency at connect time when DNS is unhealthy.
- Proxy layers (SQL-aware proxies, L4 load balancers) can introduce idle timeouts that look like “MySQL is flaky.”
Where “random timeouts” usually land in PostgreSQL systems
In Postgres, timeouts often show up as waiting—on locks, on connection slots, or on server resources:
- max_connections saturation creates a very specific pattern: some clients connect instantly, some hang, some fail—depending on pool behavior and backoff.
- Locks and long transactions cause queries to “randomly” hang behind one bad actor.
- statement_timeout and idle_in_transaction_session_timeout can turn a slow path into an error, which is good—until it’s mis-tuned and becomes noise.
- PgBouncer in transaction/statement mode can amplify oddities if your application relies on session state.
Both databases suffer the same physics: TCP is not magic, DNS is a distributed cache of lies, and poolers are wonderful until they aren’t. The differences are in the default failure behavior and what people tend to bolt on around them.
Paraphrased idea from Werner Vogels: “Everything fails, all the time.” It’s not pessimism; it’s a design requirement.
Joke #1: “Random timeouts” are just deterministic failures that haven’t met your dashboards yet.
Interesting facts and a little history (that actually helps)
- PostgreSQL descends from POSTGRES (1980s), designed with extensibility and correctness in mind; it shows in its strictness around transactions and locking behavior.
- MySQL’s early popularity (late 1990s/2000s) came from being fast and easy for web workloads; many operational defaults and ecosystem assumptions still reflect “lots of short queries.”
- MySQL historically leaned on non-transactional engines (like MyISAM) before InnoDB became the norm; operational lore about “MySQL is simple” often ignores how InnoDB behaves under contention.
- Postgres introduced MVCC early and leaned into it; “queries don’t block writes” is mostly true—until locks and DDL show up.
- PgBouncer became common because Postgres connections are not cheap; a large fleet doing TLS + auth per request can look like a DDoS you accidentally paid for.
- MySQL proxies (and load balancers) became common with read scaling; L4/L7 components can inject idle timeouts that mimic server instability.
- DNS TTLs exist because resolvers are caches; clients disagree about TTL honoring, and some libraries cache longer than you think—especially in long-running processes.
- Linux conntrack (netfilter) can be a bottleneck in NAT-heavy setups; it’s not database-specific, but database traffic is steady enough to reveal it.
- Cloud load balancers often have fixed-ish idle timeout defaults; databases are chatty but sometimes quiet, so “idle” can still be “healthy.”
Fast diagnosis playbook
This is the order I use when someone says “DB timeouts” and there’s no time to debate architecture.
First: classify the timeout
- Connect timeout (can’t establish TCP/TLS/auth). Suspects: DNS, routing, firewall, load balancer health, max connections, SYN backlog.
- Read timeout (connected, then stuck). Suspects: locks, long transactions, server CPU/IO, packet loss/retransmits, pool giving you a dead socket.
- Write timeout (often looks like read timeout but on commit). Suspects: fsync/IO stalls, replication pressure, synchronous commit, storage latency, network to storage.
- Application timeout (your own deadline fired). Suspects: everything above plus “you set 500ms and hoped.”
Second: find the scope
- One app node? Look at DNS cache, kernel, conntrack, local pool state, noisy neighbor.
- One AZ/subnet? Look at routing changes, security groups, MTU mismatch, packet loss, brownouts.
- All clients? Look at DB saturation, storage stalls, or a shared proxy/load balancer.
- Only new connections? Look at auth/DNS, TLS handshake, pool exhaustion, max connections, SYN backlog.
Third: decide if you’re waiting or dropping
- Waiting: server shows sessions stuck on locks/IO/CPU; network shows low loss but high latency; queries show “active” but blocked.
- Dropping: TCP resets, broken pipes, “server closed the connection,” spikes in retransmits/timeouts, NAT table pressure.
Fourth: cut the problem surface
- Try one client, one connection, direct to DB IP (bypass DNS + pooler) to separate layers.
- Try from a different node/subnet to separate local vs systemic.
- Reduce concurrency temporarily to stop self-inflicted connection storms while you investigate.
Practical tasks: commands, outputs, and decisions
These are real tasks you can run during an incident. Each includes what the output means and what you decide next. Pick the ones that match your environment (bare metal, VM, Kubernetes).
1) Confirm DNS answer, TTL, and whether it’s flapping
cr0x@server:~$ dig +noall +answer +ttlid db.internal.example A
db.internal.example. 5 IN A 10.20.30.41
What it means: TTL=5 seconds. That’s aggressive. If the endpoint changes or the resolver is slow, you will notice.
Decision: If TTL is very low, check resolver health and client caching behavior. Consider raising TTL for stable services or using a stable VIP/proxy endpoint.
2) Check if your node is using a broken resolver path
cr0x@server:~$ resolvectl status | sed -n '1,80p'
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.0.0.3
What it means: This host is using systemd-resolved with two DNS servers. If 10.0.0.2 is sick, you may see intermittent lookup delays.
Decision: If you see lookup latency spikes, test each resolver directly; consider removing an unhealthy resolver or fixing routing/security to it.
3) Measure DNS latency directly (don’t guess)
cr0x@server:~$ for i in {1..5}; do time dig +tries=1 +timeout=1 @10.0.0.2 db.internal.example A >/dev/null; done
real 0m0.012s
real 0m0.980s
real 0m0.011s
real 0m1.003s
real 0m0.010s
What it means: Half the queries nearly hit the 1s timeout. That’s not “fine.” That’s roulette.
Decision: Treat DNS as part of your critical path. Fix resolver load, packet loss, or upstream recursion. Meanwhile, increase client-side DNS caching only if you understand failover implications.
4) Confirm the route and MTU to the DB
cr0x@server:~$ ip route get 10.20.30.41
10.20.30.41 via 10.20.0.1 dev eth0 src 10.20.10.55 uid 1000
cache
What it means: You have a routed path via 10.20.0.1. If this changes during incidents, that’s a clue.
Decision: If routing is unstable, pull network team in early. Unstable routes produce “random” timeouts that no amount of DB tuning will fix.
5) Quick TCP connect test (bypassing drivers)
cr0x@server:~$ nc -vz -w 2 10.20.30.41 5432
Connection to 10.20.30.41 5432 port [tcp/postgresql] succeeded!
What it means: TCP handshake succeeded. This does not prove authentication or query success, but it rules out “port is blocked” for that moment.
Decision: If TCP fails intermittently, look at security rules, SYN backlog, stateful firewalls, conntrack/NAT exhaustion, and LB health checks.
6) Inspect retransmits and TCP health from the client
cr0x@server:~$ ss -ti dst 10.20.30.41 | sed -n '1,40p'
ESTAB 0 0 10.20.10.55:49822 10.20.30.41:5432
cubic wscale:7,7 rto:204 rtt:2.3/0.7 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:21984 bytes_retrans:2896 segs_out:322 segs_in:310 send 50.4Mbps lastsnd:12 lastrcv:12 lastack:12 pacing_rate 100Mbps unacked:2 retrans:1/7
What it means: There are retransmits (bytes_retrans and retrans counters). A little is normal; spikes correlate strongly with “random” stalls.
Decision: If retransmits spike during incidents, stop debating query plans and start looking at packet loss, congestion, MTU/PMTU issues, or flaky NICs.
7) Check for conntrack pressure (NAT-heavy environments)
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 248901
net.netfilter.nf_conntrack_max = 262144
What it means: You’re close to the conntrack table limit. When it fills, you get packet drops that look like “intermittent timeouts.”
Decision: Raise conntrack limits (with memory awareness), reduce connection churn, avoid unnecessary NAT, and consider pooling at the edge.
8) Verify load balancer / proxy idle timeout behavior with a controlled pause
cr0x@server:~$ (echo "ping"; sleep 75; echo "ping") | nc 10.20.30.50 3306
ping
What it means: If the second “ping” never gets a response or the connection drops after ~60s, some middlebox is enforcing an idle timeout.
Decision: Align keepalive settings: kernel TCP keepalives, driver keepalives, and LB/proxy idle timeouts. Or remove the LB from the DB path if it’s the wrong tool.
9) PostgreSQL: see if clients are stuck waiting on locks
cr0x@server:~$ psql -h 10.20.30.41 -U ops -d appdb -c "select pid, wait_event_type, wait_event, state, now()-query_start as age, left(query,80) from pg_stat_activity where state <> 'idle' order by age desc limit 10;"
pid | wait_event_type | wait_event | state | age | left
------+-----------------+---------------+--------+---------+--------------------------------------------------------------------------------
8123 | Lock | transactionid | active | 00:00:31| update orders set status='paid' where id=$1
7991 | Client | ClientRead | active | 00:00:09| select * from orders where id=$1
What it means: A query is waiting on a lock for 31 seconds. That’s not a network timeout. That’s a blocking transaction.
Decision: Find the blocker (next task), then decide whether to kill it, tune the application transaction scope, or add indexes/rewrites to reduce lock duration.
10) PostgreSQL: find the blocking query
cr0x@server:~$ psql -h 10.20.30.41 -U ops -d appdb -c "select blocked.pid as blocked_pid, blocker.pid as blocker_pid, now()-blocker.query_start as blocker_age, left(blocker.query,80) as blocker_query from pg_locks blocked_locks join pg_stat_activity blocked on blocked.pid=blocked_locks.pid join pg_locks blocker_locks on blocker_locks.locktype=blocked_locks.locktype and blocker_locks.database is not distinct from blocked_locks.database and blocker_locks.relation is not distinct from blocked_locks.relation and blocker_locks.page is not distinct from blocked_locks.page and blocker_locks.tuple is not distinct from blocked_locks.tuple and blocker_locks.virtualxid is not distinct from blocked_locks.virtualxid and blocker_locks.transactionid is not distinct from blocked_locks.transactionid and blocker_locks.classid is not distinct from blocked_locks.classid and blocker_locks.objid is not distinct from blocked_locks.objid and blocker_locks.objsubid is not distinct from blocked_locks.objsubid and blocker_locks.pid != blocked_locks.pid join pg_stat_activity blocker on blocker.pid=blocker_locks.pid where not blocked_locks.granted and blocker_locks.granted;"
blocked_pid | blocker_pid | blocker_age | blocker_query
-------------+-------------+-------------+---------------------------------------------------------------
8123 | 7701 | 00:02:14 | begin; select * from orders where customer_id=$1 for update;
What it means: PID 7701 has held locks for over 2 minutes. Your “random” timeouts are your app waiting politely.
Decision: If this is a runaway transaction, terminate it. If it’s normal, change the code: keep transactions short, avoid FOR UPDATE unless you need it, and index the predicate.
11) PostgreSQL: detect connection slot exhaustion
cr0x@server:~$ psql -h 10.20.30.41 -U ops -d appdb -c "select count(*) as total, sum(case when state='active' then 1 else 0 end) as active from pg_stat_activity;"
total | active
-------+--------
498 | 112
What it means: 498 sessions are present. If max_connections is 500, you’re living on the edge.
Decision: Put PgBouncer in front (carefully), reduce app pool sizes, and reserve connections for maintenance. “Just raise max_connections” is usually a memory plan disguised as optimism.
12) MySQL: check if you’re hitting connection limits or thread pressure
cr0x@server:~$ mysql -h 10.20.30.42 -u ops -p -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL STATUS LIKE 'Threads_running'; SHOW VARIABLES LIKE 'max_connections';"
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_connected | 942 |
+-------------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| Threads_running | 87 |
+-----------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 1000 |
+-----------------+-------+
What it means: You’re near max connections. Even if CPU is fine, the server may thrash on connection handling and context switching.
Decision: Reduce connection counts via pooling, set sane per-service limits, and stop every microservice from thinking it deserves 200 connections “just in case.”
13) MySQL: identify “server has gone away” root causes (timeouts vs packet size)
cr0x@server:~$ mysql -h 10.20.30.42 -u ops -p -e "SHOW VARIABLES LIKE 'wait_timeout'; SHOW VARIABLES LIKE 'interactive_timeout'; SHOW VARIABLES LIKE 'max_allowed_packet';"
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wait_timeout | 60 |
+--------------------+-------+
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| interactive_timeout| 60 |
+--------------------+-------+
+--------------------+----------+
| Variable_name | Value |
+--------------------+----------+
| max_allowed_packet | 67108864 |
+--------------------+----------+
What it means: Idle connections die after 60 seconds. That’s fine for short-lived clients, terrible for pools that keep connections around longer.
Decision: Either: increase timeouts and enable keepalives; or shorten pool idle lifetime so the pool discards connections before the server does.
14) Validate kernel TCP keepalive timers (client side)
cr0x@server:~$ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
What it means: First keepalive is sent after 2 hours. If your load balancer kills idle sessions after 60 seconds, keepalive won’t save you.
Decision: Tune keepalives (carefully) for DB clients behind LBs/NATs, or stop putting databases behind devices that murder idle connections.
15) Validate application pool saturation at the OS level
cr0x@server:~$ ss -s
Total: 1632 (kernel 0)
TCP: 902 (estab 611, closed 221, orphaned 0, synrecv 0, timewait 221/0), ports 0
Transport Total IP IPv6
RAW 0 0 0
UDP 12 10 2
TCP 681 643 38
INET 693 653 40
FRAG 0 0 0
What it means: 611 established TCP connections. If your app “should” have 50, you have a pool leak or a connection storm.
Decision: Cap the pool, add backpressure, and fix the leak. Scaling the DB to match a buggy pool is how budgets die.
16) Quick server-side view: are you CPU or IO bound?
cr0x@server:~$ iostat -x 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
18.21 0.00 6.10 24.77 0.00 50.92
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
nvme0n1 120.0 40960.0 0.0 0.00 8.10 341.3 220.0 53248.0 21.40 4.90 96.80
What it means: Disk is ~97% utilized; write await is 21 ms. That’s commit latency waiting on storage, not a “driver bug.”
Decision: If IO is pegged, look at checkpointing, fsync behavior, replication settings, storage throttling, and noisy neighbors.
DNS: the silent saboteur
DNS is rarely the root cause of database timeouts. It’s usually the amplifier: a small DNS issue turns routine reconnect logic into a synchronized failure.
How DNS causes “random” DB timeouts
- Slow lookups during connection creation: every reconnect waits on DNS, so your connection pool becomes a DNS benchmark suite.
- Stale answers: clients keep using an old IP after failover; the DB is fine at the new IP, and you’re pounding the old one like it owes you money.
- Negative caching: transient NXDOMAIN or SERVFAIL gets cached, and now you “randomly” can’t resolve the DB for a while.
- Split-horizon mismatches: inside/outside DNS give different answers; half your pods resolve to unreachable addresses.
- Happy Eyeballs quirks: IPv6 AAAA answers lead to attempts over a broken v6 path before v4 works.
MySQL vs Postgres: DNS-related surprises
Both are affected at connect time because the client resolves the host before establishing TCP. The surprise differences tend to come from what sits around them:
- MySQL deployments often include SQL-aware proxies or L4 load balancers for read/write splitting. That adds another hostname, another resolution, and another caching layer.
- Postgres deployments often include PgBouncer. PgBouncer itself resolves upstream hostnames and has its own behavior around reconnects and retries.
Operationally: don’t let every app instance do its own DNS gymnastics during an incident. Put a stable endpoint in front of the DB (VIP, proxy, or managed endpoint) and test the failure mode intentionally.
Network realities: retransmits, MTU, and “it’s not dropped, it’s queued”
Networks don’t usually fail by going completely dark. They fail by becoming unreliable enough that your timeouts become statistically interesting.
Retransmits: the hidden tax
A single retransmit might cost tens or hundreds of milliseconds depending on RTO and congestion control. A few percent packet loss can turn a chatty database protocol into a timeout machine, especially with TLS on top.
MTU and PMTUD: the classic “it works until it doesn’t”
MTU mismatch can show up as:
- Small queries work, large responses stall.
- Connect succeeds, first big result hangs.
- Some paths work (same rack), some fail (cross-AZ).
If ICMP “fragmentation needed” is blocked somewhere, Path MTU Discovery breaks and you get blackhole behavior. It looks like a database timeout because the DB is the first thing that sends large packets consistently.
Queues and bufferbloat
Not all latency is loss. If you have congested links, your packets might be sitting in a queue. TCP will eventually deliver them, but your application will time out first. That’s why “no packet loss” doesn’t mean “network is fine.”
Joke #2: The network is like the office printer: it works perfectly until someone is watching.
Pooling culprits: PgBouncer, proxies, and “helpful” defaults
Pooling exists because creating a DB connection is expensive—CPU, memory, TLS handshakes, authentication, server bookkeeping. But pooling is also where reality gets abstracted into something your developers misunderstand.
The three pool types (and why you should care)
- Client-side pools (in the app): simplest, but can multiply connections by number of instances. Great at creating connection storms.
- External poolers (PgBouncer, ProxySQL): can cap server connections and absorb client churn. Also adds a hop and another timeout configuration surface.
- Managed proxies (cloud DB proxies): convenient, but their behavior during failover and their idle timeout defaults are often where “random” is born.
Pooler failure modes that look like database timeouts
- Pool saturation: requests queue waiting for a connection; app times out; DB is bored.
- Dead connections reused: pool hands out a socket that a middlebox killed; first query stalls or errors.
- Retry storms: pooler retries aggressively; adds load when the DB is already unhappy.
- Session state assumptions: transaction pooling breaks apps that rely on session variables, temp tables, advisory locks, prepared statements, or SET LOCAL semantics (depending on database and mode).
MySQL vs Postgres: pooling gotchas
Postgres is particularly sensitive to connection counts because each connection maps to a backend process (in classic architecture). That’s why PgBouncer is so common. But PgBouncer’s modes matter:
- Session pooling: safest; fewer surprises; less effective at reducing server connections under bursty concurrency.
- Transaction pooling: effective; breaks anything needing session affinity unless carefully designed.
- Statement pooling: sharp knife; rarely worth it for general apps.
MySQL typically handles many connections differently (thread-per-connection unless thread pool, depending on distro/edition). Connection count is still a problem, but the failure signature often looks like CPU overhead and scheduler churn rather than immediate “no slots.”
MySQL-specific timeout failure modes
Idle timeout mismatch (wait_timeout vs pool lifetime)
One of the most common “random” failures: the server times out idle connections, the pool keeps them, and your next query discovers the corpse.
What to do: Make pool max lifetime and idle timeout shorter than wait_timeout, or raise wait_timeout and enable TCP keepalive. Pick one; don’t let it drift.
Authentication and DNS reverse lookup slowness
If MySQL is configured in ways that require hostname resolution (or the environment does reverse lookups for logging/access control), a DNS hiccup becomes a connect-time stall. It’s not that MySQL “can’t handle it”; it’s that your auth path is now a distributed system.
Too many connections: “it works until the deploy”
MySQL can appear stable at 500 connections and then fall apart at 900—not because queries changed, but because overhead did. Threads, memory per connection, mutex contention, and cache churn all show up as timeouts in the app.
Strong opinion: If you have hundreds of application instances each allowed to open dozens of connections, you don’t have a database problem. You have a coordination problem.
PostgreSQL-specific timeout failure modes
Locks and long transactions
Postgres is excellent at concurrency—until you hold locks longer than you think. The usual culprit is a transaction that begins early, does some “business logic,” then updates rows at the end. Under load, those locks stack up and everything “randomly” waits.
What to do: Keep transactions short, reduce lock scope, index the exact rows you touch, and use timeouts that fail fast enough to protect the system.
Connection slot exhaustion and the myth of “just increase max_connections”
Each Postgres connection consumes memory and management overhead. Raising max_connections without rethinking work_mem, parallelism, and total RAM is how you turn a timeout incident into an OOM incident.
What to do: Put PgBouncer in session or transaction pooling mode, then right-size app pools. Reserve headroom for admin connections.
Timeout settings that are either missing or weaponized
Postgres gives you good knobs: statement_timeout, lock_timeout, idle_in_transaction_session_timeout. The trap is setting them globally to values that match a single endpoint’s expectations.
What to do: Set timeouts per role or per service. Different workloads deserve different failure modes.
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
The company had a busy checkout service and a Postgres primary with a hot standby. During a planned failover test, they moved the primary role. The endpoint DNS name was supposed to follow the primary, TTL set low, everything “modern.”
After the failover, a third of requests started timing out. Not errors—timeouts. The database graphs looked fine. CPU normal. Disk fine. Replication caught up. Everyone stared at Postgres, because that’s what you do when you’re scared.
The wrong assumption was simple: “Our Java apps respect DNS TTL.” They didn’t—not consistently. Some instances cached the resolved IP much longer due to JVM-level caching behavior combined with the way the client library resolved names. Those instances kept trying the old primary IP, which now rejected writes or simply wasn’t reachable due to security rules.
The fix wasn’t heroic. They made the DB endpoint stable behind a proxy that didn’t change IP on failover, and they explicitly configured JVM DNS caching to respect TTL for this environment. They also added a runbook check: on failover, measure DNS resolution and actual destination IP from a sample of pods. No more faith-based networking.
Mini-story 2: the optimization that backfired
A team running MySQL for a multi-tenant API wanted to reduce connection overhead. They turned pool sizes way up “to reuse connections” and reduced timeouts “to fail fast.” The result looked great in staging: fewer connects per second, lower median latency.
In production, a weekly traffic burst hit. The service scaled out quickly. Every new instance came online with a large pool, and they all tried to warm it at once. MySQL accepted the connections until it didn’t. Threads ballooned. CPU rose not from queries, but from managing connections and context switching. Then came the real fun: retries.
The client retry policy, combined with a short connect timeout, created a retry storm. Connections that would have succeeded in 300 ms were now failing at 100 ms and retried immediately. The “optimization” made the system less patient precisely when it needed patience.
They backed out the pool growth and replaced it with strict global connection budgets per service, plus a slow-start strategy on startup. They also made retries exponential with jitter and added circuit breaking around DB connectivity. Median latency increased slightly. Incidents decreased dramatically. Production systems prefer boring over clever.
Mini-story 3: the boring but correct practice that saved the day
A different org ran Postgres with PgBouncer in session mode. Nothing fancy. Their SRE team had a habit that looked dull: every quarter, they ran controlled failure drills—kill one app node, restart PgBouncer, simulate DNS slowness, introduce small packet loss in a test environment that mirrored production topology.
One afternoon, a cloud network degradation hit their region. Packet loss was low but non-zero; latency spiked intermittently. Services started reporting “database timeouts.” The on-call followed the drill playbook: first checked retransmits from a client node, then checked PgBouncer queue depth, then checked Postgres lock waits. In ten minutes they knew it was network degradation plus retry amplification, not a database regression.
Because they’d rehearsed, they had a safe mitigation ready: reduce concurrency by lowering app pool sizes dynamically, extend certain client timeouts slightly to avoid retry storms, and temporarily disable one background job that produced large result sets. Postgres stayed healthy. The incident became a blip instead of a pager marathon.
It wasn’t glamorous. It was the operational equivalent of flossing. Uncool, until it saves you thousands in dental work.
Common mistakes (symptom → root cause → fix)
1) Timeouts cluster exactly around 60 seconds
Symptom: failures at ~60s, regardless of query complexity.
Root cause: load balancer / proxy idle timeout, or server-side idle timeout killing pooled connections.
Fix: align idle timeouts and keepalives; set pool max lifetime shorter than DB idle timeout; avoid LBs in DB path unless you know their behavior.
2) Only new connections time out; existing ones work
Symptom: established traffic okay, but reconnects fail during incident.
Root cause: DNS slowness, TLS handshake bottleneck, SYN backlog pressure, or auth bottleneck.
Fix: measure DNS latency; check SYN backlog and firewall drops; cap connection creation rate; use poolers/proxies to reduce handshake churn.
3) Queries “randomly” hang, then resume
Symptom: a query pauses for seconds, then completes; CPU and IO look normal.
Root cause: lock waits (Postgres especially), or network retransmits/queueing.
Fix: check lock wait events; find blockers; reduce transaction scope; check client-side retransmits and network latency.
4) One app node is cursed
Symptom: timeouts only from a specific host/pod/node.
Root cause: local DNS cache issues, conntrack exhaustion, NIC errors, mis-sized pool, or a bad route table entry.
Fix: compare resolver behavior and retransmits across nodes; drain and replace the node; fix config drift; set SLO-based node quarantine rules.
5) “Server has gone away” / “broken pipe” after idle periods
Symptom: first query after idle fails; retry succeeds.
Root cause: server idle timeout, LB idle timeout, NAT timeout killing state.
Fix: lower pool idle lifetime; enable keepalive; adjust DB timeouts; avoid stateful middleboxes between app and DB.
6) Max connections increased, and now timeouts got worse
Symptom: fewer “too many connections” errors, more latency/timeouts.
Root cause: resource contention from too many concurrent backends/threads; memory pressure; context switching overhead.
Fix: add pooling; implement connection budgets; reduce concurrency; scale vertically only after you’ve controlled connection behavior.
7) Timeouts spike after deploys or restarts
Symptom: short incident windows right after rollout.
Root cause: connection warmup storms, synchronized retries, cache cold starts, DNS stampedes.
Fix: slow-start connection creation; add jitter; stagger rollouts; pre-warm carefully; cap retries.
8) Reads time out only for “big” responses
Symptom: small queries fine, big result sets hang.
Root cause: MTU/PMTUD blackholes, bufferbloat, or a proxy that mishandles large payloads.
Fix: validate MTU end-to-end; allow necessary ICMP; test with controlled packet sizes; avoid unnecessary proxies in the path.
Checklists / step-by-step plan
Checklist A: When timeouts start (first 10 minutes)
- Classify timeout: connect vs read vs app deadline. Pull an error sample with timestamps.
- Check DNS latency from at least two client nodes (task 3). If it’s spiky, stop and fix DNS before you do anything fancy.
- Check retransmits from a client node (task 6). Loss explains “random.”
- Check pool saturation: app metrics or OS connections (task 15). If queueing is in the pool, DB tuning won’t help.
- Check DB waiting:
- Postgres:
pg_stat_activitywait events (task 9). - MySQL: threads connected/running (task 12) and slow query / lock metrics if available.
- Postgres:
- Apply a safe throttle: reduce concurrency and retry rates; stop the loudest background job. Stabilize first, optimize later.
Checklist B: Prevent repeat incidents (next business day)
- Standardize timeouts across app, driver, pooler, and network devices. Document the chosen values and why.
- Set connection budgets per service. Enforce them in config, not in tribal memory.
- Implement backoff with jitter for retries; add circuit breakers on connect failures.
- Introduce a stable DB endpoint (proxy/VIP/managed endpoint) so DNS changes aren’t a client-side chaos experiment.
- Add lock/transaction hygiene rules (especially for Postgres): transaction scope, statement timeouts per role, alert on long transactions.
- Run failure drills: simulate DNS slowness and modest packet loss; verify your system degrades predictably.
Checklist C: Choosing MySQL vs Postgres for operational predictability
- If your organization cannot enforce connection discipline, plan for a pooler/proxy from day one, regardless of database choice.
- If your workload is lock-sensitive and business logic tends to hold transactions open, Postgres will surface the truth—which is painful but fixable.
- If your workload is connection-churn heavy (serverless patterns, bursty fleets), your architecture matters more than the engine. Put a stable pooling layer in front and measure DNS/network like a first-class dependency.
FAQ
1) Why do timeouts feel random even when the root cause is deterministic?
Because distribution hides patterns. Different clients take different routes, use different cached DNS answers, hit different pool states, and contend on different locks. “Random” is often “sharded.”
2) How do I tell if the timeout is in the pool rather than the database?
Look for queue time in the pool metrics, or infer it: if DB CPU/IO is low but app latencies spike and connection counts are high, the pool is likely saturated or churning. OS-level ss counts also help.
3) Should I put a load balancer in front of MySQL or Postgres?
Only if you know exactly what it does with long-lived TCP connections and idle timeouts, and you’ve tested failover behavior. LBs are great at HTTP; databases are not HTTP.
4) Is DNS TTL=5 seconds good for failover?
It can be, but only if your clients actually respect TTL and your resolvers are fast and reliable under load. Low TTL without client correctness turns failover into a lottery.
5) For Postgres, is PgBouncer always the answer?
Often, yes—but mode matters. Session pooling is the least surprising. Transaction pooling is powerful but requires app discipline around session state and prepared statements.
6) For MySQL, is “server has gone away” always a network issue?
No. It can be idle timeouts (wait_timeout), packet size limits (max_allowed_packet), server restarts, or middleboxes closing idle sessions. Start by correlating errors with idle periods and connection reuse.
7) Why do retries make things worse?
Retries convert latency into load. If the bottleneck is shared (DB, DNS, proxy, conntrack), retries synchronize clients and amplify the failure. Use exponential backoff with jitter and a retry budget.
8) What’s the quickest win to reduce “random” DB timeouts?
Control connection behavior. Cap pools, stop connection storms, and align timeouts across layers. Then measure DNS latency and TCP retransmits so you stop debugging ghosts.
9) Are Postgres lock waits a database problem or an application problem?
Usually application behavior expressed through the database. Postgres is just honest about waiting. Fix transaction scope, indexes, and access patterns; then consider isolation levels and timeouts.
10) How do I avoid blame ping-pong between app, DB, and network teams?
Collect three artifacts early: DNS latency samples, TCP retransmit evidence, and DB wait/lock snapshots. With those, the argument turns into a plan.
Next steps (the practical kind)
If you want fewer “random” timeouts, stop treating them as database personality traits. Treat them as interfaces failing under pressure: DNS, TCP, pooling, and concurrency control.
- Adopt the fast diagnosis playbook and make it muscle memory.
- Pick one stable DB endpoint strategy and test failover with real clients, not just a DNS change in theory.
- Set explicit connection budgets and pool caps per service. Enforce them.
- Align timeouts across app, driver, pooler, kernel keepalive, and any middleboxes. Write it down.
- Instrument waiting: Postgres lock waits, MySQL thread pressure, and network retransmits.
Do those, and the next time someone says “random timeouts,” you’ll have the uncomfortable luxury of replying, “Cool. Which layer?”