PostgreSQL vs Redis: how to stop cache stampedes from melting your DB

Was this helpful?

The page is fast in staging. In production, it’s fast… until it isn’t. One cache key expires, a thousand requests pile in, and suddenly your “database of record” is also your database of regret.
Postgres is now doing cardio at 3 a.m. while Redis watches from the sidelines like a bouncer who forgot why the club is full.

Cache stampedes (aka thundering herds) are one of those failures that feel unfair: the system is “working as designed,” and that’s the problem. This is a practical guide to preventing stampedes with Redis and keeping PostgreSQL alive when the cache lies, expires, or gets flushed by someone “just troubleshooting.”

What a cache stampede looks like in production

A cache stampede is not “the cache missed.” A cache miss is a normal operational event. A stampede is when a cache miss becomes synchronized: many callers miss the same key at the same time and all go to the backing store together.
It’s coordinated failure without any meetings.

Typical timeline

  • T-0: a hot key expires, or a deploy changes a cache namespace, or Redis restarts and loses data.
  • T+1s: request latency spikes; app threads begin to pile up waiting for Postgres.
  • T+10s: Postgres hits connection saturation; CPU rises; I/O queue grows; autovacuum is now your second fire.
  • T+30s: upstream retries kick in; traffic multiplies; caches become a rumor.
  • T+60s: you start scaling app nodes, which increases concurrency, which makes it worse.

The ugly part is that most systems are engineered to fail open: “if cache miss, read from DB.” That’s correct at small scale.
At real scale, it’s a denial-of-service attack you run against yourself, politely, with metrics dashboards open.

Exactly one quote, because it fits: Hope is not a strategy.General Gordon R. Sullivan

Short joke #1: Cache stampedes are like office free donuts—one person mentions them, and suddenly nobody remembers “budget constraints.”

PostgreSQL vs Redis: roles, strengths, and failure modes

PostgreSQL is the truth (and a lot of responsibility)

Postgres is built to be correct: ACID transactions, durable writes, indexes, query planning, isolation levels, and guardrails that prioritize integrity over speed.
It’s not a cache. It can act like one when you have RAM and the working set fits, but stampedes don’t care about your buffer cache.

Postgres fails in predictable ways under stampedes:

  • Connection exhaustion (too many clients, too many concurrent queries)
  • CPU saturation (many identical expensive queries with cold caches)
  • I/O stalls (random reads and index lookups; checkpointer/wal; storage latency)
  • Lock amplification (even reads can contend via metadata or due to queueing effects)
  • Replica lag (read replicas get hammered; lag spikes; eventually you read stale anyway)

Redis is the rumor mill (fast, volatile, and extremely useful)

Redis is an in-memory data structure server. It’s quick, single-threaded per shard in classic mode (and still effectively single-threaded for many commands even in more complex setups), and designed for low-latency operations.
It’s also not magic. If your stampede protection depends on Redis being perfectly available, you’ve built a single point of failure with nicer dashboards.

Redis fails differently:

  • Eviction storms if memory policy isn’t aligned with key sizes and TTLs
  • Latency spikes during persistence (RDB/AOF fsync) or slow commands
  • Cache flushes (accidental, operational, or during failover)
  • Hot key contention (single key hammered; CPU spikes; network queueing)

The decision boundary: what belongs where

Put these in Postgres:

  • System-of-record data
  • Write paths requiring durability and constraints
  • Complex ad-hoc queries and analytics on current state (within reason)

Put these in Redis:

  • Derived data and read-optimized projections
  • Rate limits, short-lived locks, idempotency keys
  • Shared memoization where recompute is allowed
  • Coordination primitives (carefully) such as “single-flight” locks

The goal isn’t “Redis replaces Postgres.” The goal is “Redis prevents Postgres from seeing the same problem thousands of times per second.”

Interesting facts and quick history (because context helps)

  • Postgres started as POSTGRES in the mid-1980s at UC Berkeley as a successor to Ingres, with early work on extensibility and rules.
  • MVCC isn’t a performance trick; it’s a concurrency model. Postgres’ MVCC reduces read locks, but it also means bloat and vacuum are facts of life.
  • Redis was created in 2009 and became the default “fast cache + simple primitives” choice for a generation of web systems.
  • The “thundering herd problem” predates the web; it’s a classic OS and distributed systems issue where many waiters wake at once.
  • TTL jitter is an old trick from early caching systems: randomize expiration to avoid synchronized cache invalidations.
  • CDNs popularized stale-while-revalidate as an HTTP cache-control pattern; the same idea works in Redis with a little discipline.
  • PgBouncer exists largely because Postgres connections are expensive compared to many other systems; too many direct clients is a known foot-gun.
  • Redis persistence is optional by design; many deployments trade durability for speed, which is fine until someone treats Redis like a database.

Patterns that actually stop stampedes

1) Request coalescing (“single-flight”): one rebuild, many waiters

When a key misses, do not allow every request to rebuild it. Elect one request to do the expensive work; everyone else waits briefly, or gets stale data.
This is the single most effective stampede control because it attacks the multiplicative factor.

You can implement coalescing:

  • In-process (works per instance; doesn’t coordinate across a fleet)
  • Distributed with Redis locks (coordinates across instances; requires careful timeouts)
  • Via a dedicated “cache builder” worker queue (most robust, more moving parts)

2) Stale-while-revalidate: serve old data while refreshing

A strict TTL cache has a cliff: at expiration, you either have data or you don’t. Stale-while-revalidate replaces that cliff with a slope:
keep a “fresh TTL” and a “stale TTL.” During the stale window, you can serve the old value quickly while one worker refreshes it.

This is the pattern you use when you care more about availability and latency than perfect freshness. Most product pages, recommendation blocks, and “top N” widgets qualify.
“Account balance” does not.

3) Soft TTL + hard TTL: two expirations, one system

Store metadata alongside the cached value:

  • soft_ttl: after this, it’s okay to refresh in background
  • hard_ttl: after this, you must refresh or fail closed / degrade

Soft/hard TTL is a practical way to encode business tolerance for staleness without depending on Redis having sophisticated cache semantics.

4) TTL jitter: stop synchronized expiration

If a million keys were written by the same job with the same TTL, they will expire together. That’s not “bad luck,” it’s math.
Add jitter: TTL = base ± random. Keep it bounded.

Good jitter is small enough not to violate product needs and large enough to desynchronize rebuilds. For a 10-minute TTL, ±60–120 seconds is often enough.

5) Negative caching: cache “no such thing” and “permission denied”

Not-found results are still results. If a missing user ID or absent product page causes a DB query every time, attackers (or buggy clients) can hammer you with misses.
Cache 404s and “no rows” results briefly. Keep TTL short to avoid hiding newly created records.

6) Postgres-friendly query shaping: reduce per-miss cost

If each cache miss costs Postgres a multi-join query with a sort, you’re betting your uptime on caches never missing. That bet will lose.
Build read models that are cheap to compute (or precomputed), and ensure indexes match your access patterns.

7) Backpressure and timeouts: fail fast before you queue forever

The stampede killer you already have is timeouts. The problem is most systems set them too high and too inconsistently.
Your app should stop waiting long before Postgres is fully down; otherwise you build a backlog that becomes an outage multiplier.

You want:

  • Shorter timeouts on cache rebuild paths than on interactive reads
  • Concurrency limits per key or per endpoint
  • Retry budgets (limited retries, exponential backoff, jitter)

8) Protect Postgres with pools and admission control

Stampedes often show up as “too many connections” because every app instance opens more connections trying to be helpful.
Use PgBouncer, cap concurrency at the app layer, and consider separate pools for:

  • Interactive reads
  • Background cache rebuilds
  • Batch jobs

9) Warm the cache, but do it like an adult

Cache warming works when it’s bounded and measurable. It fails when it’s a naive “recompute everything now” job.
A safe warmer:

  • Prioritizes the top hot keys
  • Runs with a strict concurrency limit
  • Stops when Postgres is under stress
  • Uses the same stampede controls as production reads

Short joke #2: “We’ll just pre-warm the entire cache” is the infrastructure equivalent of saying you’ll “just pay down all tech debt this sprint.”

Fast diagnosis playbook

When latency spikes and your dashboard looks like modern art, don’t start by debating architecture. Start by proving where time is going.
The fastest diagnosis is a disciplined sequence.

First: is Redis missing, slow, or empty?

  • Check Redis latency and command rate.
  • Confirm hit rate isn’t collapsing.
  • Look for eviction or restart events.
  • Identify the top hot keys (or patterns) causing misses.

Second: is Postgres saturated or just queued?

  • Check active connections vs max.
  • Check wait events (locks, I/O, CPU).
  • Find the top repeated queries; see whether they correlate with cache misses.
  • Check replica lag if reads go to replicas.

Third: is the application multiplying the problem?

  • Retries, timeouts, circuit breakers: are they sane?
  • Is there request coalescing or a lock, or are you fanning out rebuilds?
  • Is your connection pool causing queueing (threads waiting for DB connections)?

Decision points

  • If Redis is fine but Postgres is melting, you likely have low hit rate, expired namespace, or a new hot path bypassing cache.
  • If Redis is slow, fix Redis first; a slow cache can be worse than no cache because it adds latency and still forces DB work.
  • If both are fine but app latency is high, you may have thread pool starvation or downstream calls stacking up.

Practical tasks: commands, outputs, and decisions (12+)

These are production-grade checks. Each task includes: a command, sample output, what it means, and the decision you make from it.
The outputs are representative; your numbers will differ. The point is the shape of the data and the action it drives.

Task 1: Verify Redis is up and measure instantaneous latency

cr0x@server:~$ redis-cli -h 127.0.0.1 -p 6379 --latency -i 1
min: 0, max: 3, avg: 0.45 (1000 samples)
min: 0, max: 12, avg: 1.10 (1000 samples)

What it means: Redis is responding, but max spikes (12ms) may hurt tail latency if your SLO is tight.

Decision: If max/avg jumps during incidents, investigate slow commands, persistence fsync, or noisy neighbors before blaming Postgres.

Task 2: Check Redis evictions and memory pressure

cr0x@server:~$ redis-cli INFO memory | egrep 'used_memory_human|maxmemory_human|mem_fragmentation_ratio'
used_memory_human:9.82G
maxmemory_human:10.00G
mem_fragmentation_ratio:1.62

What it means: You’re basically at the ceiling, with fragmentation. Evictions are likely and latency may increase.

Decision: Either raise maxmemory, reduce key sizes, or change eviction policy. Don’t pretend a full cache is “fine.”

Task 3: Confirm Redis eviction policy and whether it matches your cache design

cr0x@server:~$ redis-cli CONFIG GET maxmemory-policy
1) "maxmemory-policy"
2) "noeviction"

What it means: With noeviction, writes will fail under pressure. Your application may interpret failures as misses and stampede the DB.

Decision: For cache workloads, prefer an eviction policy like allkeys-lru or volatile-ttl depending on your key strategy, and handle misses gracefully.

Task 4: Spot hot keys by sampling Redis commands

cr0x@server:~$ redis-cli MONITOR
OK
1735588430.112345 [0 10.2.3.14:52144] "GET" "product:pricing:v2:sku123"
1735588430.112612 [0 10.2.3.22:49018] "GET" "product:pricing:v2:sku123"
1735588430.113001 [0 10.2.3.19:58820] "GET" "product:pricing:v2:sku123"

What it means: One key is getting hammered. If it expires, you will feel it everywhere.

Decision: Add per-key coalescing and stale-while-revalidate for that class of keys; consider splitting the key or caching per segment.

Task 5: Check Postgres connection pressure

cr0x@server:~$ psql -h 127.0.0.1 -U postgres -d appdb -c "select count(*) as total, sum(case when state='active' then 1 else 0 end) as active from pg_stat_activity;"
 total | active
-------+--------
  480  |    210
(1 row)

What it means: You have a lot of sessions; many are active. If max_connections is ~500, you’re near the wall.

Decision: If you’re near the wall during spikes, move clients behind PgBouncer and cap app concurrency on rebuild paths.

Task 6: Identify what Postgres is waiting on (locks, I/O, CPU)

cr0x@server:~$ psql -h 127.0.0.1 -U postgres -d appdb -c "select wait_event_type, wait_event, count(*) from pg_stat_activity where state='active' group by 1,2 order by 3 desc;"
 wait_event_type |     wait_event      | count
-----------------+---------------------+-------
 IO              | DataFileRead        |    88
 Lock            | relation            |    31
 CPU             |                     |    12
(3 rows)

What it means: Reads are stalling on disk and there’s some lock contention. This is consistent with a miss storm.

Decision: Reduce repeated expensive reads (coalescing/stale), and check indexing. For the lock waits, find the blocking queries.

Task 7: Find the blocking query chain

cr0x@server:~$ psql -h 127.0.0.1 -U postgres -d appdb -c "select a.pid as blocked_pid, a.query as blocked_query, b.pid as blocking_pid, b.query as blocking_query from pg_stat_activity a join pg_stat_activity b on b.pid = any(pg_blocking_pids(a.pid)) where a.state='active';"
 blocked_pid |         blocked_query          | blocking_pid |        blocking_query
------------+--------------------------------+--------------+------------------------------
      9123  | update inventory set qty=qty-1 |         8871 | vacuum (analyze) inventory
(1 row)

What it means: A maintenance operation is blocking a write, which may cascade into retries and more load.

Decision: If this is during incident response, consider pausing or rescheduling the maintenance; then address why inventory updates are in the hot path of cached reads.

Task 8: Identify the most expensive repeating queries during the stampede

cr0x@server:~$ psql -h 127.0.0.1 -U postgres -d appdb -c "select calls, total_time, mean_time, left(query,120) as query from pg_stat_statements order by total_time desc limit 5;"
 calls | total_time | mean_time |                         query
-------+------------+-----------+---------------------------------------------------------
 92000 |  812340.12 |      8.83 | select price, currency from pricing where sku=$1 and...
 48000 |  604100.55 |     12.59 | select * from product_view where sku=$1

What it means: The same queries are being executed tens of thousands of times. That’s a cache miss storm signature.

Decision: Add request coalescing around those cache keys, and consider storing the full read model in Redis to avoid the heavier query.

Task 9: Check replica lag if reads go to replicas

cr0x@server:~$ psql -h 127.0.0.1 -U postgres -d appdb -c "select now() - pg_last_xact_replay_timestamp() as replica_lag;"
 replica_lag
-------------
 00:00:17.412
(1 row)

What it means: 17 seconds of lag can turn “cache miss, read replica” into “cache miss, read stale then invalidate then retry.”

Decision: During high load, stop routing critical reads to lagging replicas; prefer stale cache over laggy replica for non-critical data.

Task 10: Inspect PgBouncer pool saturation

cr0x@server:~$ psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -c "show pools;"
 database | user  | cl_active | cl_waiting | sv_active | sv_idle | sv_used | maxwait
----------+-------+-----------+------------+-----------+---------+---------+---------
 appdb    | app   |       120 |        380 |        80 |       0 |      80 |    12.5
(1 row)

What it means: Clients are waiting (380). Server connections are capped and fully used. You are queueing at the pool.

Decision: Don’t just raise pool sizes blindly. Add admission control to rebuild paths; increase DB capacity only after reducing stampede amplification.

Task 11: Confirm Redis keyspace hit/miss trend

cr0x@server:~$ redis-cli INFO stats | egrep 'keyspace_hits|keyspace_misses'
keyspace_hits:182334901
keyspace_misses:44211022

What it means: Misses are high. If the miss rate suddenly jumped after deploy, you likely changed key format, TTL, or serialization.

Decision: Roll back key namespace changes or add backward-compatible reads for old keys; implement staggered rollout for cache schema changes.

Task 12: Check for Redis persistence stalls (AOF) contributing to latency

cr0x@server:~$ redis-cli INFO persistence | egrep 'aof_enabled|aof_last_write_status|aof_delayed_fsync'
aof_enabled:1
aof_last_write_status:ok
aof_delayed_fsync:137

What it means: Delayed fsync events indicate the OS/storage can’t keep up. Redis latency will spike, and clients may timeout then fall back to DB.

Decision: Consider changing AOF fsync policy for a cache workload, or move Redis to faster storage / separate noisy neighbors. Also fix client timeouts to avoid DB fallback storms.

Task 13: Find whether the application is retrying too aggressively

cr0x@server:~$ grep -R "retry" -n /etc/app/config.yaml | head
42:  retries: 5
43:  retry_backoff_ms: 10
44:  retry_jitter_ms: 0

What it means: 5 retries with 10ms backoff and no jitter will hammer dependencies during incidents.

Decision: Reduce retries, add exponential backoff and jitter, and introduce a retry budget per request class.

Task 14: Inspect top Postgres tables for cache-miss-driven I/O

cr0x@server:~$ psql -h 127.0.0.1 -U postgres -d appdb -c "select relname, heap_blks_read, heap_blks_hit from pg_statio_user_tables order by heap_blks_read desc limit 5;"
   relname    | heap_blks_read | heap_blks_hit
--------------+----------------+---------------
 pricing      |        9203812 |      11022344
 product_view |        7112400 |      15099112
(2 rows)

What it means: High disk reads on a small set of tables suggests your working set isn’t in cache, or your access pattern is random enough to miss buffers.

Decision: Reduce DB work per request (coalescing/stale), add indexes, or redesign the cached projection to be cheaper.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A retail-ish company had a “pricing cache” in Redis with a 5-minute TTL. The team assumed that because Redis was “in-memory,” it was effectively always there and always fast.
Their cache-aside code path looked clean: GET, if miss then query Postgres, then SETEX. Simple. Too simple.

One evening, Redis was restarted during a routine maintenance window. It came back quickly, but the dataset was effectively cold. The app servers took that as a massive wave of misses and went straight to Postgres.
Traffic wasn’t unusual. The cache was the unusual part: it had gone from “mostly hits” to “mostly misses” in seconds.

Postgres didn’t die immediately. It queued. Connections rose. PgBouncer pools filled. App threads blocked waiting for a connection, and timeouts triggered retries.
The wrong assumption wasn’t “Redis is fast.” The wrong assumption was “cache misses are independent events.” They weren’t; they were synchronized by the restart.

The fix wasn’t exotic. They added stale-while-revalidate with a soft TTL and a hard TTL, and request coalescing per key using a short Redis lock.
After that, a Redis restart caused slower pages for a few minutes, not a database incident. It was less dramatic, which is the highest compliment in operations.

Mini-story 2: The optimization that backfired

A B2B SaaS team got aggressive about “freshness.” They shortened TTLs from 10 minutes to 30 seconds on a set of dashboard widgets because sales wanted faster updates.
They also introduced a background job that “warmed” the cache by recomputing popular keys every 25 seconds. On paper, it meant fewer misses and fresher data.

In practice, the warm job and real traffic aligned. Both hit the same set of keys at roughly the same time. With short TTLs, keys expired frequently; with the warm job, they were constantly being rebuilt.
They had created a permanent mini-stampede. Not a single dramatic spike, but a persistent elevated load that made every other Postgres workload fragile.

The symptom was sneaky: no huge cache miss spikes, just increased mean query latency and occasional lock waits. Autovacuum started falling behind because the system was busy enough that maintenance never got a quiet moment.
Engineers chased “slow queries” and “bad plans” for weeks, because the system wasn’t obviously on fire. It was just always too warm.

The fix was to stop rebuilding on a fixed cadence and move to event-driven invalidation for the small subset of keys that truly needed freshness. For everything else: longer TTLs, jitter, and soft TTL refresh only when accessed.
They also put hard caps on the warmer’s concurrency and made it back off when Postgres latency rose.

Freshness is not free. Short TTLs are a tax you pay forever, not a one-time fee you negotiate with product.

Mini-story 3: The boring but correct practice that saved the day

A media company ran Postgres with PgBouncer and had a strict rule: background rebuilds and batch jobs use separate pools and separate DB users.
It was not glamorous. It was a spreadsheet of limits and a couple of config files nobody wanted to touch.

During a traffic surge triggered by a breaking story, their Redis cluster started showing elevated latency due to a noisy neighbor on the underlying hosts.
Cache hit rate dropped and app servers began to fall back to Postgres more than usual. This is the part where most systems faceplant.

Instead, PgBouncer’s pool separation did its quiet work. The interactive pool remained usable because the background pool saturated first. Rebuild jobs queued, not user requests.
The site got slower, but it stayed up. Editors kept publishing, users kept reading, and the incident remained an incident instead of a headline.

Afterward, they tuned Redis, fixed host isolation, and improved client timeouts. But the thing that mattered most that day was the boring discipline of admission control and pool separation.
Reliability is often just refusing to be clever in the wrong places.

Common mistakes: symptoms → root cause → fix

1) Symptom: sudden surge in Postgres QPS right after a deploy

Root cause: cache key namespace changed (prefix/version bump), effectively a global cache invalidation.

Fix: version keys gradually, read-through both namespaces during rollout, or pre-warm with strict concurrency limits plus coalescing.

2) Symptom: Redis is “up” but app timeouts increase and DB load rises

Root cause: Redis latency spikes (persistence fsync, slow command, CPU saturation); clients timeout and fall back to DB.

Fix: fix Redis latency first; increase client timeouts slightly (not infinitely), add hedging carefully, and implement stale reads so DB fallback isn’t the default reaction.

3) Symptom: “too many connections” on Postgres during traffic spikes

Root cause: no pooling or ineffective pooling; app instances open more connections under pressure; rebuild paths share pool with interactive reads.

Fix: deploy PgBouncer, cap pools, separate pools/users for background rebuilds, add app-level concurrency limits.

4) Symptom: cache hit rate is decent, but DB still melts on expiration boundaries

Root cause: synchronized TTLs for hot keys; a cohort expires at once and stampedes.

Fix: TTL jitter; soft TTL with background refresh; per-key request coalescing.

5) Symptom: “lock wait timeout” errors increase during cache misses

Root cause: rebuild queries include writes or lock-heavy reads; vacuum/DDL collides; retries amplify contention.

Fix: isolate writes from read rebuilds; avoid lock-heavy patterns; reduce retries; schedule maintenance; ensure indexes avoid long-running scans.

6) Symptom: Redis memory hits max and keys churn; DB load becomes spiky

Root cause: eviction policy mismatch, oversized values, or unbounded cardinality keys (e.g., caching per request parameter explosion).

Fix: bound cardinality, compress or store smaller projections, pick an eviction policy aligned with TTL usage, and monitor evictions as first-class signals.

7) Symptom: after Redis failover, everything slows even though Redis is back

Root cause: cache cold start; rebuild storm; no coalescing; application uses synchronous rebuild on request path.

Fix: stale-while-revalidate, soft TTL, and background rebuild queue. Make cold starts survivable.

8) Symptom: read replicas fall behind during spikes, then cache invalidation logic goes wild

Root cause: replica lag + “validate against DB” logic; stampede causes lag; lag causes more invalidations/retries.

Fix: don’t validate hot cached reads against lagging replicas; prefer serving stale cache with bounded staleness, and route critical reads to primary only when necessary.

Checklists / step-by-step plan

Step-by-step plan: harden a cache-aside system against stampedes

  1. Inventory hot keys: identify top endpoints and keys by request rate and miss impact.
  2. Define staleness budget: per key class, decide what staleness is acceptable (seconds/minutes/hours).
  3. Implement request coalescing: per key, ensure only one builder runs at a time across the fleet.
  4. Add soft TTL + hard TTL: serve stale during soft window; stop serving after hard TTL unless you explicitly degrade.
  5. Add TTL jitter: desynchronize expirations on hot keys and cohorts.
  6. Add negative caching: cache not-found and empty results briefly.
  7. Separate pools: interactive reads vs rebuilds vs batch jobs; enforce with PgBouncer config and DB users.
  8. Put rebuilds behind admission control: cap concurrency and use backpressure when Postgres latency rises.
  9. Fix retries: enforce retry budgets; exponential backoff; jitter; no infinite retry loops.
  10. Observe the right signals: cache hit rate, rebuild rate, lock waits, PgBouncer waiting clients, Redis evictions, Postgres wait events.
  11. Test failure modes: simulate Redis restart, cache flush, and key namespace change in a staging environment with realistic concurrency.
  12. Write the runbook: include “disable rebuilds,” “serve stale only,” and “shed load” procedures.

Operational checklist: before you ship a cache key change

  • Does the new key format coexist with the old one during rollout?
  • Is there a maximum rebuild concurrency per key class?
  • Is stale-while-revalidate enabled for non-critical keys?
  • Do you have TTL jitter enabled by default?
  • Can you quickly disable rebuild-on-miss and serve stale?
  • Are Redis evictions, latency, and Postgres wait events on the same dashboard?

Emergency checklist: during a live stampede

  • Stop making it worse: reduce retries and disable aggressive warmers.
  • Enable “serve stale” mode if available; extend TTLs on hot keys if safe.
  • Clamp rebuild concurrency; isolate rebuild pool from interactive pool.
  • If Redis is the bottleneck, stabilize Redis first; otherwise you’ll just funnel to Postgres.
  • If Postgres is saturated, shed load: rate limit endpoints that rebuild, degrade non-critical features, and protect login/checkout paths.

FAQ

1) Can PostgreSQL be the cache if I just add more RAM?

Sometimes. But it’s a trap as a primary strategy. Postgres buffer cache helps for repeated reads, but stampedes introduce concurrency and queueing problems that RAM doesn’t solve:
connections, CPU for planning/executing, and I/O bursts for non-resident pages. Use RAM, but also use coalescing and admission control.

2) Is Redis the only way to prevent cache stampedes?

No. You can do in-process single-flight, use a message queue to serialize rebuilds, or precompute projections in a separate store.
Redis is popular because it’s already there and provides coordination primitives. The key is controlling rebuild fan-out, not the brand of tool.

3) Should I use a Redis distributed lock for cache rebuilds?

Yes, but keep it short-lived and treat it as a coordination hint, not a correctness mechanism. Use a TTL on the lock and handle lock loss safely.
Your rebuild should be idempotent, and your system should tolerate occasional double rebuilds. The goal is “mostly one,” not “perfectly one.”

4) What TTL should I use?

The honest answer: whatever your staleness budget allows, plus enough headroom to avoid constant rebuilds. Longer TTLs reduce rebuild pressure.
Add soft TTL to keep freshness acceptable and hard TTL to prevent serving ancient data.

5) Why do retries make stampedes worse?

Retries turn partial failure into amplified load. When Postgres is slow, each timed-out request may trigger multiple additional queries.
Add jitter and exponential backoff, cap retries, and prefer serving stale cache over “try again immediately” logic.

6) How do I know if I’m suffering a stampede vs a real DB regression?

Stampedes have a signature: repeated identical queries spike in pg_stat_statements, cache misses rise, and latency worsens sharply around TTL boundaries or cache resets.
A DB regression looks more like one query changed plan or one table got bloated. You confirm by correlating cache hit/miss with top queries and wait events.

7) Is “stale-while-revalidate” safe?

It’s safe when the business semantics allow it. It’s not safe for correctness-critical reads like balances, permissions, or inventory commits.
For those, use transactional reads from Postgres (and cache carefully with explicit invalidation), or design a dedicated read model with strong update rules.

8) What about caching query results in Postgres (materialized views)?

Materialized views and summary tables can reduce compute per request, but they don’t solve stampedes by themselves. If the cache layer still expires and triggers rebuilds, you can stampede the materialized view refresh.
They’re best used as part of a system: cheap read model in Postgres, plus Redis caching, plus coalescing and stale serving.

9) Do I need read replicas to survive stampedes?

Replicas help when the workload is stable and you have predictable read scaling. During stampedes, replicas can lag and become a new failure mode.
Fix stampedes at the cache layer first. Then use replicas for steady-state scale, not as your primary emergency brake.

Conclusion: next steps you can do this week

Cache stampedes aren’t a mysterious distributed systems curse. They’re what happens when “cache miss → go to DB” is allowed to scale linearly with traffic.
Your job is to break that linearity.

If you do nothing else this week, do these three things:

  1. Add request coalescing for your hottest keys (even a rough Redis lock with a TTL is better than a free-for-all).
  2. Implement stale-while-revalidate for any data that can be slightly stale, and add a hard TTL to prevent zombie values.
  3. Protect Postgres with pool separation and admission control so rebuild work can’t starve interactive traffic.

Then measure: hit rate, rebuild rate, lock waits, PgBouncer waiting clients, Redis latency, and Postgres wait events. The dashboard won’t prevent incidents, but it will stop you from guessing.
And guessing is how stampedes become outages.

← Previous
Responsive Tables for Technical Docs That Don’t Break in Production
Next →
ZFS Latency Monitoring: The Graphs That Catch Disaster Early

Leave a comment