PostgreSQL vs Redis for sessions, rate limits, and queues

Was this helpful?

Production systems don’t fail because you picked “the wrong database.” They fail because you picked the right database for the wrong behavior.
Sessions that outlive deploys, rate limits that must not reset on restart, queues that can’t lose a job but also can’t block the world—those are behaviors.

If you store everything in the same place because it’s convenient, you’ll pay later. Usually at 2:17 AM, when your “simple cache”
becomes your authentication system and the incident channel starts doing cardio.

Decision framework: pick by failure mode, not vibes

PostgreSQL and Redis overlap just enough to tempt you into making a religious choice. Resist that. Choose based on what must be true
during failures: restarts, partitions, overload, partial writes, time skew, client retries, and rolling deploys.

One blunt rule

If losing it is acceptable (or reconstructable), Redis is a great default. If losing it is unacceptable (or legally/financially painful),
PostgreSQL is the adult in the room.

What each system is “for” in practice

  • Redis: fast shared memory with networking, atomic ops, expirations, and multiple data structures. Durability is optional and nuanced.
  • PostgreSQL: a transactional system with write-ahead logging, constraints, queryability, and durable semantics under well-understood conditions.

How the same requirement sounds different in incident review

“We store sessions in Redis” is a design statement. “A Redis restart logged everyone out” is an incident statement.
Your job is to translate design statements into incident statements before production does it for you.

A reliability quote (paraphrased idea)

Paraphrased idea: hope is not a strategy — attributed to many ops leaders; the sentiment is common in reliability engineering.

Two short jokes (exactly two)

Joke #1: A cache is where data goes to retire. Unless you store authentication in it, then it becomes a career with night shifts.

Joke #2: Every “temporary” Redis key lives exactly as long as your on-call rotation.

Decision matrix (use this when people argue)

Ask these in order; stop when you hit a “hard no.”

  1. Can you tolerate losing it? If no: prefer PostgreSQL (or another durable queue/store).
  2. Do you need atomic increments/expirations at high rate? If yes: Redis often wins.
  3. Do you need ad-hoc queries, auditing, or backfills? If yes: PostgreSQL wins.
  4. Do you need fan-out / streams / consumer groups? Redis can be excellent, but commit to operating it like a system, not a toy.
  5. Do you need strict transactional coupling with business writes? If yes: PostgreSQL, because “write in app + write to Redis” is where consistency goes to die.

Threat model: the failures you must design for

  • Restart: process restarts, node reboots, container reschedules.
  • Partition: app can reach one datastore node but not another; clients retry.
  • Overload: latency spikes, backlog builds, timeouts become retries, retries become a storm.
  • Time: TTL and rate-limit windows are time-based; clocks drift, deploys roll, users are impatient.
  • Eviction: Redis may evict keys; “volatile-lru” is not a business continuity plan.
  • Vacuum/compaction: Postgres bloat and vacuum affect latency; Redis fork for RDB snapshots affects memory and latency.

Interesting facts and historical context (the kind that changes decisions)

  • PostgreSQL’s lineage goes back to POSTGRES (1980s, UC Berkeley), designed with a research obsession for correctness and extensibility—still visible in MVCC and WAL.
  • Redis started (late 2000s) as an in-memory data structure server; its killer feature was not “key/value,” but atomic operations on useful structures (lists, sets, sorted sets).
  • Redis persistence began as optional snapshots (RDB). Append-only file (AOF) arrived to reduce loss windows, but trades durability for write amplification and fsync choices.
  • Postgres MVCC means reads don’t block writes (in general). It also means dead tuples pile up; vacuum is not optional if you like stable latency.
  • Redis single-threaded command execution is a feature: it makes most operations atomic without locks. It’s also a ceiling when you pin heavy Lua scripts or long-running commands.
  • LIST-based queues in Redis were popular long before streams; BRPOP patterns shaped a generation of “good enough” job systems—until people needed replay and consumer groups.
  • “Exactly-once processing” has been a recurring industry mirage. Most real systems achieve at-least-once plus idempotency. Postgres makes idempotency easier to enforce with constraints.
  • Rate limiting evolved from simple fixed windows to sliding windows and token buckets because fairness matters when traffic bursts; Redis’s atomic increments made these patterns practical at scale.

Sessions: sticky, stateless, and the lie in between

What “session storage” really means

Sessions are state, but the application wants to pretend it’s stateless. That tension shows up in three places:
authentication, revocation, and expiration.

The dangerous assumption is that sessions are “just cache.” They’re not, unless you truly don’t care if a user is logged out or re-challenged.
For consumer apps that may be acceptable. For B2B admin consoles during a sales demo, it’s how you meet your next budget cut.

Three patterns and where they belong

  1. Signed tokens (no server-side session)
    Store nothing server-side; put claims in a signed token (JWT-like). Great for read-heavy APIs. Terrible when you need instant revocation
    and short TTLs cause refresh storms.
  2. Server-side sessions
    Store a session ID in a cookie, keep session data in Redis or Postgres. Operationally boring, which is a compliment.
  3. Hybrid
    Signed token for identity + server-side blacklist/revocation list. This is where Redis is strong: small keys, TTLs, atomic checks.

Redis for sessions: when it’s right

Redis works well for sessions when:

  • Session loss is acceptable (or you can re-auth easily).
  • You need fast reads and TTL expiration without a cleanup job.
  • You are disciplined about persistence (or explicitly choose to be lossy).
  • You can tolerate the occasional “everyone logs in again” day.

Redis for sessions: when it’s a trap

It’s a trap when the session is effectively an entitlement record (roles, scopes, MFA flags) and loss means:
revoked access returns, or access disappears for valid users. Either one is a support ticket generator.

Also: Redis eviction policies don’t care about your sales funnel. If memory pressure triggers eviction and your session keys are eligible,
you just created a random logout lottery.

PostgreSQL for sessions: boring and correct

Postgres sessions are slower per request, but predictable and queryable. You can enforce uniqueness, track last_seen, run audits,
and backfill cleanup logic. And you can couple session state changes with business writes inside transactions.

The trade: you now own cleanup and bloat control. Session tables are churn factories. If you don’t manage vacuum, you’ll see latency creep.

Practical design: split “session identity” from “session payload”

A good compromise: store the minimal authoritative record in Postgres (session id, user id, created_at, revoked_at, expires_at),
and store optional performance payload in Redis (user preferences, computed permissions) keyed by session id with TTL.
If Redis loses it, you recompute. If Postgres loses it, you have bigger problems.

Rate limits: counters, clocks, and fairness

What rate limiting is protecting

Rate limiting isn’t just “stop abusers.” It’s also:
protecting downstream dependencies, shaping noisy tenants, avoiding thundering herds on login or password reset,
and preventing self-inflicted retries from consuming your entire budget.

Redis rate limiting: the default for a reason

Redis is excellent at rate limiting because it has:

  • Atomic increments (INCR/INCRBY) so you don’t race yourself.
  • TTL (EXPIRE) so counters disappear without cron jobs.
  • Lua scripts to combine “increment + check + set TTL” into one atomic operation.
  • Low latency so your limiter doesn’t become the bottleneck.

But: Redis durability is not free

If your rate limiter resets on restart, you might be fine. Or you might flood a partner API for 90 seconds and get your key revoked.
Decide which world you live in.

If you need “survive restarts,” configure persistence thoughtfully and test it. AOF with fsync policies can help,
but it changes the performance envelope and the failure mode (disk IO becomes your limiter).

Postgres rate limiting: when you should do it anyway

Postgres can do rate limiting, usually in one of these forms:

  • Per-window counters with upserts (INSERT … ON CONFLICT DO UPDATE). Works for moderate rates and strong correctness needs.
  • Token bucket stored per user/tenant in a row with “last_refill” and “tokens.” Needs careful handling of time and concurrency.
  • Leaky bucket via jobs where the database is authoritative and the application caches the decision briefly.

Postgres rate limiting is slower, but gives you auditing and queryability. If you need to answer “why was tenant X throttled yesterday,”
Postgres is where that story is easiest to reconstruct.

Fairness is a product decision disguised as an algorithm

Fixed-window rate limits are easy and unfair at boundaries. Sliding window logs are fair and expensive. Token buckets are fair enough
and cheap enough. The point: pick the fairness you can support operationally. A clever algorithm you can’t debug during an incident
is a liability.

Queues: durability, visibility, and backpressure

A queue is not a list

A real queue is a contract:
messages are durably recorded, workers claim them, failures return them, duplicates happen, and the system remains observable.
If you implement queues as “a list plus hope,” you will learn about edge cases in the worst possible environment: production.

Redis queues: fast, flexible, and easy to get subtly wrong

Redis can implement queues with lists (LPUSH/BRPOP), sorted sets (delayed jobs), streams (consumer groups), and pub/sub (not a queue).
Each comes with tradeoffs:

  • Lists: simple, fast. But you need your own reliability model (ack, retry, dead-letter). BRPOPLPUSH patterns help.
  • Streams: closer to a real log with consumer groups, acking, and pending entries. Operationally more complex, but more honest.
  • Pub/Sub: not durable; subscribers that disconnect miss messages. Great for ephemeral notifications, not for jobs.

Postgres queues: surprisingly strong for many workloads

Postgres can power queues using tables + indexes + row locking:

  • SELECT … FOR UPDATE SKIP LOCKED is the workhorse for “claim a job without two workers taking it.”
  • Transactional enqueue allows coupling job creation with business state changes (the outbox pattern).
  • Queryability lets you build dashboards and investigate stuck jobs without custom tooling.

The trade is throughput and contention. Postgres is great until it’s not: high write rates, hot partitions, or long-running transactions
can turn the queue table into a battlefield. But for many corporate systems—moderate rates, high correctness—Postgres queues are
the boring choice that actually ships.

Visibility timeout: the thing that decides whether you sleep

Jobs need a concept of “in progress.” If a worker dies mid-job, that job must become visible again. Redis list queues don’t give you that
automatically. Streams do, but you still need to handle pending entries. Postgres does, if you model “locked_at/locked_by” and reclaim jobs
after a timeout.

Idempotency: you will process duplicates

Between retries, timeouts, network partitions, and deploys, duplicates are not hypothetical. “Exactly once” is marketing.
Build idempotency keys into job handlers and enforce them where possible (Postgres unique constraints are your best friend).

Fast diagnosis playbook: find the bottleneck in 10 minutes

You’re paged. Latency is up. Logins are failing. Jobs are backing up. Don’t debate architecture. Run the playbook.

First: determine if the problem is datastore latency or app contention

  1. Check Redis latency and blocked clients (if Redis is in the path). If you see slow commands or blocked clients, you’re likely Redis-bound.
  2. Check Postgres active sessions and locks. If you see lock waits or saturated connections, you’re likely Postgres-bound.
  3. Check error patterns: timeouts vs “OOM” vs “too many clients” vs “READONLY” helps you triage quickly.

Second: confirm whether the issue is resource saturation or correctness bugs

  1. Resource saturation: CPU high, IO high, network drops, memory pressure, eviction, connection pool maxed.
  2. Correctness bugs: sudden spike in retries, accidental tight loop, Lua script gone wild, unbounded queue growth due to missing ack.

Third: pick the least dangerous mitigation

  • Rate limit harder (yes, even if product complains) to stabilize.
  • Disable expensive features (session enrichment, deep permission checks) if they hit the datastore.
  • Pause consumers if the queue is melting downstream systems.
  • Scale reads if the workload allows it; do not blindly add writers to fix a lock problem.

What not to do

Don’t restart Redis as a “fix” unless you accept losing volatile data and you know why it’s stuck. Don’t bounce Postgres when you have
long-running transactions unless you enjoy explaining to finance why invoices duplicated.

Practical tasks: commands, outputs, what it means, what you decide

These are the kinds of checks you run during design reviews and incidents. Each includes a command, example output,
what the output means, and the decision it drives.

Redis: health, latency, persistence, memory, eviction

Task 1: Check Redis basic health and role

cr0x@server:~$ redis-cli -h redis-01 INFO replication
# Replication
role:master
connected_slaves:1
master_repl_offset:987654321
repl_backlog_active:1

Meaning: You’re on a master; one replica is connected; replication backlog is active.

Decision: If role is “slave” unexpectedly, your clients may be writing to a read-only node. Fix service discovery/failover first.

Task 2: Measure Redis command latency spikes

cr0x@server:~$ redis-cli -h redis-01 --latency -i 1
min: 0, max: 12, avg: 1.23 (176 samples)
min: 0, max: 97, avg: 4.87 (182 samples)

Meaning: Occasional 97ms spikes. In a rate limiter path, that hurts. In session reads, it can cascade into timeouts/retries.

Decision: Investigate slow commands, persistence forks, or network jitter. If spikes correlate with RDB saves, tune persistence or move to AOF settings that fit.

Task 3: Identify slow Redis commands (built-in slowlog)

cr0x@server:~$ redis-cli -h redis-01 SLOWLOG GET 3
1) 1) (integer) 12231
   2) (integer) 1735600000
   3) (integer) 24567
   4) 1) "EVAL"
      2) "..."
   5) "10.0.2.41:53422"
   6) ""

Meaning: A Lua script took ~24ms. A few of these under load can serialize your server because Redis executes commands mostly single-threaded.

Decision: Rewrite scripts to be simpler, reduce key scans, or move heavy logic out of Redis. If you need complex queue logic, consider Redis streams or Postgres.

Task 4: Check Redis memory use and fragmentation

cr0x@server:~$ redis-cli -h redis-01 INFO memory | egrep 'used_memory_human|maxmemory_human|mem_fragmentation_ratio'
used_memory_human:18.42G
maxmemory_human:20.00G
mem_fragmentation_ratio:1.78

Meaning: You’re close to maxmemory and fragmentation is high. Evictions or OOM errors are around the corner; fork for persistence may fail.

Decision: Increase memory, reduce key cardinality, fix TTL strategy, or choose eviction policy intentionally. If sessions must not disappear, do not rely on eviction-friendly settings.

Task 5: Confirm eviction behavior

cr0x@server:~$ redis-cli -h redis-01 CONFIG GET maxmemory-policy
1) "maxmemory-policy"
2) "allkeys-lru"

Meaning: Any key can be evicted under memory pressure.

Decision: If you store sessions or queue state here, this is a reliability risk. Consider “volatile-ttl” with TTL-only keys, or move authoritative state to Postgres.

Task 6: Check persistence mode and last save

cr0x@server:~$ redis-cli -h redis-01 INFO persistence | egrep 'aof_enabled|rdb_last_save_time|aof_last_write_status'
aof_enabled:1
aof_last_write_status:ok
rdb_last_save_time:1735600123

Meaning: AOF is enabled and writing successfully; RDB also exists. You have some durability, depending on fsync policy.

Decision: If this Redis is now a session authority or queue source of truth, verify fsync policy and recovery time. If you cannot tolerate loss, Redis still might not be enough.

Task 7: Check for blocked clients (often queue-related)

cr0x@server:~$ redis-cli -h redis-01 INFO clients | egrep 'blocked_clients|connected_clients'
connected_clients:1823
blocked_clients:312

Meaning: Many clients are blocked, likely waiting on blocking pops or slow scripts. This can be normal for BRPOP patterns, but 312 is high.

Decision: If blocked clients correlate with timeouts, redesign queue consumption (streams, bounded concurrency) or increase worker efficiency.

PostgreSQL: connections, locks, vacuum, bloat, IO, query behavior

Task 8: See Postgres connection saturation

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select count(*) as total, state from pg_stat_activity group by state order by total desc;"
 total | state
-------+----------------
   120 | active
    80 | idle
    35 | idle in transaction

Meaning: 35 sessions are idle in transaction. That’s often a bug or a pool misconfiguration and it blocks vacuum and holds locks.

Decision: Fix the application to commit/rollback promptly; set statement timeouts; ensure your pool doesn’t leave transactions open.

Task 9: Identify lock waits

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select wait_event_type, wait_event, count(*) from pg_stat_activity where wait_event is not null group by 1,2 order by 3 desc;"
 wait_event_type |   wait_event   | count
-----------------+----------------+-------
 Lock            | transactionid   |     9
 LWLock          | BufferMapping   |     4

Meaning: Lock waits on transaction IDs suggest contention (updates/deletes, long transactions) affecting others.

Decision: Find the blocking transaction. If it’s a migration or batch job, stop it or throttle it. If it’s queue workers holding locks, fix worker behavior.

Task 10: Find the blockers

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select pid, usename, state, now()-xact_start as xact_age, query from pg_stat_activity where state <> 'idle' order by xact_age desc limit 5;"
 pid  | usename | state  | xact_age |                  query
------+--------+--------+----------+------------------------------------------
 8421 | app    | active | 00:14:32 | update sessions set last_seen=now() ...
 9110 | app    | active | 00:09:11 | delete from queue_jobs where ...

Meaning: Long-running writes on sessions/queue tables. That’s the hot zone for session churn and queue contention.

Decision: Add indexes, reduce update frequency (write-behind last_seen), and ensure queue deletes are batched with sane limits.

Task 11: Check vacuum health on a churny table

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select relname, n_dead_tup, last_autovacuum, autovacuum_count from pg_stat_user_tables where relname in ('sessions','queue_jobs');"
  relname  | n_dead_tup |     last_autovacuum     | autovacuum_count
-----------+------------+-------------------------+------------------
 sessions  |    812334  | 2025-12-30 01:10:02+00  |              421
 queue_jobs|    120998  | 2025-12-30 01:08:47+00  |              388

Meaning: Large dead tuple counts. Autovacuum is running, but you may still be behind under load.

Decision: Tune autovacuum for these tables (lower thresholds), consider partitioning by time, and reduce update churn.

Task 12: Inspect index usage for sessions or rate-limit tables

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select relname, idx_scan, seq_scan from pg_stat_user_tables where relname='sessions';"
 relname  | idx_scan | seq_scan
----------+----------+----------
 sessions | 98234122 |     4132

Meaning: Index scans dominate (good). If seq_scan were high, you’d likely be doing full table scans on a hot table.

Decision: If seq_scan spikes, add/repair indexes or fix query predicates. For sessions, you want index lookups by session_id and expires_at.

Task 13: Measure Postgres cache vs disk pressure (rough signal)

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select datname, blks_hit, blks_read, round(100.0*blks_hit/nullif(blks_hit+blks_read,0),2) as hit_pct from pg_stat_database where datname='appdb';"
 datname | blks_hit  | blks_read | hit_pct
---------+-----------+-----------+---------
 appdb   | 891234567 |  23123456 |   97.46

Meaning: Cache hit is ~97%. That’s decent. If it drops sharply, disk IO may be your bottleneck.

Decision: If hit_pct falls, check working set size, indexes, and whether a new query pattern is reading lots of cold data.

Task 14: Check queue backlog and oldest job age (Postgres queue)

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "select count(*) as ready, min(now()-created_at) as oldest_age from queue_jobs where state='ready';"
 ready | oldest_age
-------+------------
 48211 | 02:41:18

Meaning: Backlog is big; oldest job is nearly 3 hours old. This is a throughput or downstream dependency problem.

Decision: Scale consumers carefully, but first confirm DB isn’t the bottleneck. Consider pausing producers or applying backpressure upstream.

Task 15: Inspect Redis keyspace and TTL behavior for sessions

cr0x@server:~$ redis-cli -h redis-01 INFO keyspace
# Keyspace
db0:keys=4123891,expires=4100022,avg_ttl=286000

Meaning: Nearly all keys expire; average TTL ~286 seconds. That’s typical for rate limits, risky for sessions unless intended.

Decision: If these are sessions and TTL is short, you’re generating churn and logouts. Adjust TTL strategy, use refresh tokens, or move to Postgres for authority.

Task 16: Confirm Postgres queue claim behavior (SKIP LOCKED)

cr0x@server:~$ psql -h pg-01 -U app -d appdb -c "begin; select id from queue_jobs where state='ready' order by id limit 3 for update skip locked; commit;"
BEGIN
  id
------
 9912
 9913
 9914
(3 rows)
COMMIT

Meaning: Workers can safely claim jobs without double-processing due to row locks.

Decision: If this blocks or returns nothing while backlog exists, investigate lock contention, missing indexes, or jobs stuck “in progress.”

Three corporate mini-stories from the trenches

1) Incident caused by a wrong assumption: “Sessions are cache”

A mid-sized SaaS company moved sessions from Postgres to Redis to “reduce load.” On paper it was clean:
session_id → JSON blob, TTL 30 days. Reads got faster, the database graphs looked nicer, everyone went home early.

Months later, traffic rose and the Redis node got a memory bump. During the maintenance window, the node restarted.
No big deal, they thought—persistence was “enabled.” It was, technically. Snapshots every 15 minutes.

The blast radius was immediate: users were logged out, but worse, some security controls were coupled to session payload.
A portion of “recently verified MFA” flags were lost and re-triggered MFA challenges. Support tickets piled up, sales demos derailed,
and the incident took on that special tone where everyone is technically calm but emotionally furious.

The wrong assumption wasn’t “Redis is unreliable.” Redis was doing exactly what it was configured to do.
The wrong assumption was that sessions were “reconstructable cache.” They weren’t. Sessions had become an entitlement artifact.

The fix was boring and effective: authoritative session records moved back to Postgres with a revocation table.
Redis stayed, but only for derived session enrichment with TTL. They also changed the runbook: any Redis restart is treated like a planned auth-impacting event unless proven otherwise.

2) Optimization that backfired: the Lua script rate limiter

Another company ran a public API and did rate limiting in Redis. Initially it was a straightforward INCR + EXPIRE.
Someone improved it with a Lua script implementing a sliding window log. Fairness improved, dashboards looked nicer, and the developer got praised.

Then the API launched a batch feature. Clients hammered the endpoint in bursts, which is what batch features do.
The Lua script now ran on hot keys with large sorted sets. Under load, the script started showing up in SLOWLOG.

Redis, being single-threaded for command execution, became the choke point. Latency rose, clients timed out, clients retried,
and the retries increased the load. The rate limiter—meant to protect the system—became the system’s primary failure mode.

They tried scaling Redis vertically, which bought time but not peace. The actual fix was to simplify:
token bucket with atomic ops and short keys; accept slight unfairness at boundaries; add per-tenant smoothing in the application.
Fairness is nice. Survival is nicer.

Post-incident, they added a rule: any Lua script change requires a load test and a rollback plan.
Redis scripting is powerful. It is also the easiest way to introduce a hidden global lock into your architecture.

3) Boring but correct practice that saved the day: Postgres outbox + idempotency

A company processed payments. They needed to emit “payment_succeeded” events to trigger emails, provisioning, and analytics.
They used Postgres for core data and had a separate worker system. Someone proposed pushing events directly into Redis for speed.

The team that had been burned before insisted on an outbox table in Postgres: when a payment row is committed, a row is inserted into outbox
within the same transaction. A background worker reads outbox rows with SKIP LOCKED, publishes to downstream systems, and marks them done.
It’s not glamorous. It is extremely debuggable.

One day, a downstream service degraded and event publishing slowed to a crawl. The outbox backlog grew, but payments continued safely.
Because the outbox was in Postgres, the team could query exact stuck states, replay safely, and run targeted cleanup.

The saving detail was idempotency: the outbox had a unique constraint on (event_type, aggregate_id, version).
When workers retried under timeout, duplicates were harmless. The constraint enforced the contract even under chaos.

The incident ended without data loss, without double-provisioning, and without a week-long forensic saga.
Nobody got applauded for their exciting architecture. They got something better: a quiet postmortem.

Common mistakes (symptoms → root cause → fix)

1) Random logouts and “session not found” spikes

Symptoms: Users intermittently logged out; auth service shows cache misses; Redis memory near max.

Root cause: Redis eviction policy allows session keys to be evicted, or persistence window loses recent sessions after restart.

Fix: Move authoritative session state to Postgres, or configure Redis maxmemory-policy to avoid evicting session keys and ensure persistence matches your loss tolerance.

2) Rate limiting resets after deploy/restart

Symptoms: Traffic bursts pass through limiter after Redis restart; partner API complains; sudden 429 drop to zero.

Root cause: Volatile Redis limiter state without durable persistence, or limiter keys only exist in memory.

Fix: Decide if reset is acceptable. If not, use AOF with appropriate fsync, or store counters in Postgres for critical limits, or implement a hybrid (Redis fast path + periodic reconciliation).

3) Queue backlog grows while workers look “healthy”

Symptoms: Workers running, CPU low, but jobs age increases. Redis blocked_clients high or Postgres locks present.

Root cause: Downstream dependency slow, visibility timeout mis-sized, or workers stuck on a lock or long transaction.

Fix: Instrument job duration and dependency latency; implement timeouts; in Postgres queues, avoid long transactions and keep claim+work separated from business writes when possible.

4) Postgres queue causes lock contention and slows the whole DB

Symptoms: Elevated lock waits; increased latency for unrelated queries; autovacuum lag on queue table.

Root cause: Hot queue table with frequent updates/deletes; missing partial indexes; workers updating rows too often (heartbeats).

Fix: Use “ready” partial index; update minimal columns; batch deletes; partition by time; consider moving high-throughput ephemeral queues to Redis streams.

5) Redis queue loses jobs on consumer crash

Symptoms: Job disappears after worker dies; no retry; business process incomplete.

Root cause: Using simple list pop without ack/requeue (RPOP/BLPOP) and no in-flight tracking.

Fix: Use BRPOPLPUSH with a processing list and reaper, or use Redis streams with consumer groups, or move to Postgres queue with visibility timeout semantics.

6) “Too many connections” in Postgres during login spikes

Symptoms: Postgres rejects connections; app threads stall; pgbouncer absent or misconfigured.

Root cause: One connection per request pattern; lack of pooling; session writes on every request.

Fix: Add connection pooling, reduce write frequency (avoid updating last_seen per request), and precompute session-derived data in Redis if safe.

7) Redis latency spikes every few minutes

Symptoms: P99 latency jumps; rate limiter causes timeouts; graphs show periodic spikes.

Root cause: RDB snapshot fork overhead, AOF rewrite, or disk IO saturation if AOF fsync is aggressive.

Fix: Tune persistence schedule, use “everysec” fsync if acceptable, monitor fork time, ensure disk is fast, and avoid huge memory footprints that make forks expensive.

Checklists / step-by-step plan

Checklist A: deciding where sessions go

  1. Classify what a session contains: identity only, entitlements, MFA state, or user preferences.
  2. If entitlements/MFA state is inside the session payload, make Postgres the authority (or split authority from cache).
  3. Define loss tolerance: “users may re-login” vs “revocation must be immediate.”
  4. Pick TTL strategy: absolute expiry + idle timeout; decide who refreshes and when.
  5. Plan cleanup: Postgres vacuum/autovacuum tuning; Redis TTL and eviction policy.
  6. Test restart behavior in staging with realistic load and persistence settings.

Checklist B: implementing rate limits without creating a new bottleneck

  1. Pick the algorithm: token bucket is usually the best compromise.
  2. Define scope: per-IP, per-user, per-tenant, per-endpoint—avoid unbounded cardinality without a plan.
  3. Use Redis if you need high throughput and can tolerate short loss windows.
  4. If limits are contractual (partner APIs), consider Postgres-backed limits or durable Redis configuration.
  5. Instrument limiter latency separately; treat it like a dependency.
  6. Have a “degrade mode”: allow small bursts, block expensive endpoints, and log decisions for forensics.

Checklist C: choosing a queue implementation that won’t betray you

  1. Write down the contract: at-least-once, retry policy, dead-letter, ordering needs, delay scheduling, visibility timeout.
  2. If jobs must be transactionally coupled to DB writes, use a Postgres outbox pattern.
  3. If throughput is high and jobs are ephemeral, Redis streams can be a strong option—operate it seriously.
  4. Design idempotency keys and enforce them (unique constraints in Postgres or dedupe sets in Redis with TTL).
  5. Make backpressure explicit: producers must slow down when consumers lag.
  6. Build operational queries: “oldest job,” “stuck in progress,” “retry count distribution.”

Step-by-step migration plan (from “whatever” to sane)

  1. Inventory state: sessions, rate limits, queues, idempotency keys. For each, define loss tolerance and audit needs.
  2. Pick authority stores: Postgres for durable truth; Redis for derived/ephemeral acceleration.
  3. Add observability: P95/P99 latency for Redis/Postgres calls, queue lag, eviction counts, autovacuum lag.
  4. Implement dual-write only if you can reconcile; prefer staged cutovers and read-fallback strategies.
  5. Load test the “incident path”: restart Redis, failover Postgres, simulate network partition, throttle disk IO.
  6. Write runbooks: what to do on eviction, replication lag, lock storms, backlog spikes.

FAQ

1) Can I store sessions in Redis safely?

Yes, if “safe” means you accept loss windows, eviction risk, and restart behavior—or you configure persistence and capacity to match your requirements.
If sessions are authoritative security artifacts, keep the authority in Postgres and cache derived data in Redis.

2) Is Redis persistence (AOF/RDB) enough to treat it like a database?

Sometimes, but don’t hand-wave it. Persistence settings determine loss windows and performance. Fork-based snapshots and AOF rewrite have operational costs.
If the business cannot tolerate loss, Postgres is usually the simpler durability story.

3) Why not just use Postgres for everything?

You can, and many teams should do more of that. The limits are throughput, latency, and contention under extremely hot counters or queues.
Redis is a better fit when you need atomic ops at very high rates with TTL semantics.

4) Why not just use Redis for everything?

Because “in memory” doesn’t mean “infallible,” and because queryability matters during incidents and audits.
Redis is amazing at certain patterns; it’s not a general-purpose system of record unless you design it that way and accept the tradeoffs.

5) Are Redis streams a real queue?

They’re closer than lists or pub/sub: you get consumer groups, acknowledgments, and pending entries you can inspect.
You still need to design for retries, dead-letter handling, and operational tooling. Streams are a queue toolkit, not a complete queue product.

6) How do I implement a reliable queue in Postgres?

Use a jobs table with a “state” and timestamps, claim jobs via FOR UPDATE SKIP LOCKED, keep transactions short, and implement retry/dead-letter logic.
If enqueue must be coupled to business writes, use an outbox pattern in the same transaction.

7) What’s the biggest operational risk with Postgres for sessions?

Churn and bloat. Session tables create dead tuples fast. If autovacuum isn’t tuned, latency becomes unpredictable. Partitioning and reduced write frequency help a lot.

8) What’s the biggest operational risk with Redis for rate limiting?

Turning the limiter into a bottleneck. Heavy Lua scripts, high cardinality keys, and persistence-induced latency spikes can create a failure loop
where timeouts cause retries, which cause more limiter traffic.

9) Should rate limits be durable?

Depends on what they protect. If it’s internal fairness, resets are often acceptable. If it’s contractual (partner APIs, fraud controls),
durability matters—either via persistent Redis configuration or a Postgres-backed authority.

10) How do I avoid double-processing jobs?

Assume at-least-once delivery and make handlers idempotent. In Postgres, enforce idempotency with unique constraints.
In Redis, use dedupe keys (with TTL) or embed idempotency in downstream writes.

Next steps you can execute this week

Stop arguing about which tool is “better.” Decide which failures you can tolerate, then pick the datastore that fails in acceptable ways.
Redis is fantastic for speed and TTL-driven state. Postgres is fantastic for truth, coupling, and forensic clarity.
Most resilient systems use both, with a strict boundary between authority and acceleration.

  1. Write down what happens if Redis loses all keys. If the answer includes “security incident,” move authority to Postgres.
  2. Run the practical tasks above in staging and production. Capture baselines for latency, eviction, lock waits, and vacuum health.
  3. For queues, pick one contract (at-least-once + idempotency is the sane default) and implement visibility timeouts and dead-lettering.
  4. For sessions, split authority from cache: Postgres for revocation/expiry records; Redis for derived, recomputable payload.
  5. For rate limiting, keep the algorithm simple, instrument the limiter, and treat it as a dependency that can take you down.

Store it right now, or pay later. The invoice arrives during the incident. It always does.

← Previous
Proxmox “guest agent not running”: enable QEMU Guest Agent and make it stick
Next →
MySQL vs RDS MySQL: the hidden limits that bite during incidents

Leave a comment