Everything is fast in staging. Then production shows up with a 99th percentile that looks like a ski slope and a Redis cluster that’s “fine” right until it isn’t. Someone adds caching to fix a MySQL hot spot. The hot spot moves. Then the pager moves into your bedroom.
This is the unglamorous truth: MySQL and Redis aren’t competitors in real systems; they’re co-workers. Your job is to keep them from silently disagreeing about reality. That’s what write-through and cache-aside are: contracts between your app, your cache, and your database. One contract breaks less often—if you pick it for the right reasons and operate it like you mean it.
The real question: what breaks less
“MySQL vs Redis” is a fake debate. MySQL is your system of record. Redis is your performance bet. The question that matters is: which caching pattern produces fewer customer-visible failures under normal operational chaos—deploys, partial outages, latency spikes, and the occasional “someone ran a command in the wrong terminal” event.
If you want the opinion up front:
- Cache-aside breaks less for most product apps because it fails open: when Redis is unhappy, you can still read and write to MySQL and limp along.
- Write-through breaks less only when you can invest in strong operational guarantees: disciplined timeouts, queueing or retries with idempotency, and clear rules about what happens if Redis or MySQL is unavailable.
- Write-through breaks more loudly. That’s not always bad. Loud failures are debuggable. Silent inconsistency is where you lose weekends.
There’s no free lunch; there’s only where you want to pay. Cache-aside usually pays in occasional stale reads and stampedes. Write-through pays in write-path latency and more ways to wedge the whole app if your cache hiccups.
One quote to keep taped to your monitor, because it’s saved many on-call rotations (paraphrased idea): Werner Vogels: build for failure—assume everything fails and design so it doesn’t become a catastrophe.
Definitions you can deploy
MySQL
Persistent relational database. Durable storage, transactional semantics, and query flexibility. Also: the thing everyone blames when the app is slow, even when the problem is the network, the pool, or the fact that a cache key design turned into performance art.
Redis
In-memory data structure store. Used as cache, queue, rate limiter, session store, and “temporary database we promise isn’t a database.” Redis can persist (RDB snapshots, AOF logs), can replicate, and can cluster, but it remains a different beast from MySQL: it optimizes for speed and simplicity, not relational correctness.
Cache-aside (lazy loading)
The application is responsible for cache reads and cache fills:
- Read path: check Redis → if miss, read MySQL → write to Redis → return.
- Write path: write MySQL → invalidate or update Redis.
Key point: cache is optional. If Redis fails, you can route around it and hit MySQL. Your worst day becomes “slow” instead of “down,” if MySQL can take the load.
Write-through (synchronous population)
The application writes to cache and database in the same logical operation. Variants differ, but the spirit is:
- Write path: write to Redis (or cache layer) and MySQL as part of a single request.
- Read path: read from Redis; cache is expected to be warm and correct.
Key point: cache becomes part of correctness. If Redis is unhealthy, your write path is impacted. If your write-through layer lies, your app lies.
Short joke #1: Write-through is like a “quick” change request in a big company—fast right up until the approvals start.
Facts & history that actually matter
- Redis started (2009) as a practical response to slow web app data access, not as a grand unified platform. Its “do simple things extremely fast” DNA still shows.
- Memcached popularized cache-aside in the mainstream. Many Redis caching habits are inherited from that era: TTLs everywhere, best-effort invalidation, and a tolerance for occasional inconsistency.
- MySQL query cache was removed (MySQL 8.0) because it caused contention and unpredictable performance. Caching moved outward to application layers and dedicated caches.
- Redis is single-threaded for command execution (with some multi-threaded I/O in newer versions). That’s why it’s fast and predictable—until you run slow commands and block the world.
- Redis persistence is optional and tunable: RDB snapshots trade durability for speed; AOF trades disk write overhead for better recovery granularity. The choice changes what “write-through” even means.
- Replication is asynchronous by default in both Redis and many MySQL topologies. “I wrote it” may mean “a primary accepted it,” not “it’s safe from failure.”
- Cache stampede (thundering herd) was a known issue decades ago in large web systems; mitigations like request coalescing and jittered TTLs are old ideas—still ignored weekly.
- Redis Cluster shards by key hash slot. Multi-key operations get complicated fast, and cross-slot operations can become silent foot-guns for write-through workflows.
Write-through vs cache-aside: decision-grade comparison
What you’re optimizing for
If your primary pain is read latency and your dataset is stable-ish, cache-aside is usually enough. If your primary pain is read amplification caused by complex computed objects (e.g., assembled user profile + permissions + counters), and you want a consistently warm cache, write-through starts looking attractive.
But don’t confuse “warm” with “correct.” Warm is easy. Correct is where the bills arrive.
Operational blast radius
- Cache-aside: Redis outage → higher MySQL load → possible MySQL saturation → slow or partial outage. You can degrade gracefully if you prepared.
- Write-through: Redis outage → your write path may block or fail → cascading failures in app tier → outage even if MySQL is fine.
Consistency profile
Neither pattern gives you transactional consistency across MySQL and Redis without extra machinery. The choice is which inconsistency you prefer:
- Cache-aside risks stale reads after writes (invalidation races, replication lag, missed deletes).
- Write-through risks split-brain truth if one write succeeds and the other fails or retries incorrectly. It’s not stale; it’s contradictory.
Latency profile
Cache-aside keeps the write path mostly bounded by MySQL, which you already tuned for. Write-through adds Redis to the write path. If Redis is in a different AZ, behind TLS, or simply busy, congratulations: you just turned a local disk problem into a distributed systems problem.
When I pick cache-aside
- Read-heavy workloads with tolerable eventual consistency.
- Objects that can be regenerated from MySQL on demand.
- Teams that want a cache that can be bypassed during incidents.
- Data models with frequent writes and complex invalidation can still work—if you keep the rules simple.
When I pick write-through
- You have a clear caching layer/service that owns the write-through contract and can be operated like a database component.
- You can enforce idempotency and ordering for writes (or tolerate last-write-wins).
- You are ready to budget latency and availability for Redis like it’s critical-path infrastructure.
- You want predictable cache warmness for reads and you can keep data structures straightforward.
Failure modes: how each pattern dies
Cache-aside: the classics
1) Stale reads from invalidation races
Typical sequence:
- Request A reads cache miss, fetches MySQL row (old value), then prepares to set Redis.
- Request B updates MySQL row (new value), deletes Redis key.
- Request A now sets Redis with old value after B’s delete.
Result: Redis contains stale data until TTL expires or next invalidation. Customers see old state. Engineers see “but we deleted the key.” Both are true.
Mitigation: versioned keys, compare-and-set (Lua script with version), or write updates to cache with a monotonically increasing version/timestamp.
2) Cache stampede after expiry
Hot key expires. Thousands of requests miss at once. They all hit MySQL. MySQL falls over. Redis remains healthy, watching this drama like a cat watching a laser pointer.
Mitigation: request coalescing (single flight), probabilistic early refresh, per-key mutex, jitter TTLs, and “serve stale while revalidate.”
3) Cache penetration (misses for non-existent keys)
Attack traffic or buggy clients request IDs that don’t exist. Cache misses aren’t cached, so MySQL gets hammered with pointless queries.
Mitigation: negative caching with short TTLs, bloom filters, rate limits.
4) Silent partial failures
Redis timeouts aren’t treated as failures. The app waits too long and ties up threads. Or the app retries aggressively and becomes the DoS.
Mitigation: strict timeouts, circuit breakers, and clear “cache is best-effort” semantics.
Write-through: fewer misses, more correctness traps
1) Dual-write inconsistency
Write to Redis succeeds; write to MySQL fails. Or vice versa. Or both succeed but retries reorder operations. Now your cache and DB disagree, and your read path is guaranteed to serve something—possibly the wrong thing.
Mitigation: transactional outbox, change-data-capture (CDC) to drive cache updates, or make MySQL authoritative and treat cache writes as derived.
2) Latency amplification on the write path
Redis hiccups (slow disk for AOF fsync, network jitter, CPU spikes). Write-through turns those into user-visible write latency. Timeouts cause retries. Retries cause load. Load causes more timeouts. You know the rest.
3) Redis persistence surprises
If you rely on Redis as part of write-through correctness, but Redis is configured with snapshotting only, a crash can drop recent writes. MySQL might be correct; Redis might be in the past. If your app reads Redis first, you’ll serve time-travel.
4) Cluster topology + key design pitfalls
Write-through often wants multi-key atomicity (update object + index + counters). Redis Cluster can’t run multi-key transactions across hash slots. People work around it with hash tags, then discover they built a hotspot shard.
Short joke #2: Cache invalidation is one of the hard problems, but at least it doesn’t have meetings. Dual writes do.
Three corporate mini-stories from the trenches
Incident #1: the outage caused by a wrong assumption
A mid-sized SaaS company ran a classic cache-aside setup: MySQL primary with replicas, Redis for hot objects. The team assumed “Redis reads are cheap, so we can do them everywhere.” They sprinkled Redis calls across the codebase—feature flags, rate limits, user sessions, and a few critical authorization checks.
Then came a regional network issue that increased packet loss between the app tier and the Redis nodes. Redis itself was healthy. Latency wasn’t even terrible—just jittery, with timeouts. The app’s Redis client defaulted to a 2-second timeout and a retry. Under load, threads piled up waiting for Redis. Eventually the web workers hit max concurrency and stopped accepting requests.
MySQL was fine. CPU was fine. The incident commander kept hearing “but Redis is up.” Sure. A server can be “up” the way a door can be “closed.” Both are technically true and operationally unhelpful.
The fix wasn’t heroic. They reduced Redis timeouts to something that reflected reality (tens of milliseconds, not seconds), added a circuit breaker, and—this is the key—stopped using Redis as a hard dependency for authorization decisions. For auth, they cached in-process with short TTLs and fell back to MySQL when needed. Redis went back to being a cache, not a judge.
Incident #2: the optimization that backfired
A marketplace platform wanted faster profile reads. They built a write-through flow: when a user updated their profile, the service wrote the denormalized profile blob into Redis and then updated MySQL. Reads were Redis-first, no MySQL fallback except during “maintenance.”
It worked beautifully until a deploy introduced a subtle bug: retries on MySQL write failures were not idempotent. The code appended new preferences rather than replacing them, and the cache write happened before the MySQL write. Under a brief MySQL lock contention event, the service retried. Redis now had the newest blob (with duplicate preference entries), while MySQL had a mixture of old and partially updated rows depending on which retry succeeded.
Customers saw inconsistent profiles depending on which service instance served them and which cache key they hit. Support calls described it as “settings won’t stick.” Engineers described it as “we have no idea which system is right.” The cache had become a second source of truth, without the operational maturity of a database.
The rollback restored some sanity, but the real repair took longer: they moved to MySQL-authoritative writes and used a change stream to update Redis after commit. They also added a version field and rejected older cache writes. The optimization had been faster. It was also a liar.
Incident #3: the boring but correct practice that saved the day
A payments-adjacent service (not the core ledger, but close enough to be sensitive) used cache-aside with strict rules: caches could be stale, but never used for final balance decisions. Every cache key had a TTL, a version, and an owner. Every Redis call had a tight timeout and a fallback strategy documented in a runbook.
One afternoon, a Redis failover (Sentinel-triggered) caused a brief window where some clients wrote to the old primary, some to the new. This wasn’t catastrophic by itself; it was the kind of chaos you should expect. Their app handled it because they treated Redis as best effort. Writes went to MySQL; cache invalidations were attempted but not required for correctness.
The system slowed. Alerts fired. But the service stayed up, and the correctness boundary stayed intact. The on-call followed the runbook: temporarily disabled cache reads for the hottest endpoints, let MySQL handle reads for a while, and watched error rates stabilize.
No heroics. No novel algorithms. Just clear contracts and the willingness to accept a temporary performance hit in exchange for not corrupting state. That “boring” discipline is what people mean when they say reliability is a feature.
Fast diagnosis playbook
When things get slow or inconsistent, you don’t have time for philosophy. You need a short sequence that identifies the bottleneck and the failure domain.
First: decide if it’s Redis-path, MySQL-path, or the app
- Check user-facing symptoms: are reads slow, writes slow, or both? Are errors timeouts or data mismatches?
- Check Redis latency and saturation: instantaneous latency spikes, blocked clients, evictions.
- Check MySQL concurrency: running queries, lock waits, replication lag.
- Check app pool health: thread/connection pools, queue depth, GC pauses.
Second: test bypass paths
- If you can safely bypass Redis reads for a hot endpoint, do it and watch whether latency normalizes. If it does, Redis-path is the culprit (or the client library is).
- If you can safely bypass MySQL replicas and read from primary (briefly), do it to validate replication lag.
Third: validate correctness boundary
- Pick one user/object with a known recent update and compare MySQL vs Redis values directly.
- If they differ, find out whether your pattern can produce that difference (invalidation race vs dual-write failure) and proceed accordingly.
Practical tasks: commands, outputs, and decisions
These are not toy commands. They’re the ones you run at 2 a.m. to stop guessing. Each task includes what the output means and the decision you make from it.
Task 1: Is Redis responding quickly from the app host?
cr0x@server:~$ redis-cli -h redis-01 -p 6379 --latency -i 1
min: 0, max: 2, avg: 0.31 (1000 samples)
min: 0, max: 85, avg: 1.12 (1000 samples)
Meaning: The second line shows occasional 85ms spikes. That’s not fatal, but if your app timeout is 50ms, it becomes errors.
Decision: If max/avg is above your SLO, investigate Redis CPU, persistence fsync settings, network jitter, or slow commands. Consider temporarily relaxing cache dependency (fallback) if you’re on write-through.
Task 2: Are Redis clients piling up?
cr0x@server:~$ redis-cli -h redis-01 INFO clients | egrep 'connected_clients|blocked_clients'
connected_clients:1248
blocked_clients:37
Meaning: Blocked clients > 0 often indicates BLPOP/BRPOP consumers, Lua scripts, or clients waiting on something that isn’t happening. In caches, blocked clients usually mean trouble.
Decision: Identify blocking commands, check for slow Lua scripts, or stuck transactions. If blocked_clients rises with latency, treat it as a service degradation and shed cache load.
Task 3: Is Redis evicting keys (memory pressure)?
cr0x@server:~$ redis-cli -h redis-01 INFO stats | egrep 'evicted_keys|keyspace_hits|keyspace_misses'
keyspace_hits:93811233
keyspace_misses:12100444
evicted_keys:482919
Meaning: Evictions mean your cache is not a cache anymore; it’s a churn machine. High misses amplify MySQL load. Evictions also destroy any assumption of write-through “warmness.”
Decision: Increase maxmemory, fix TTLs, reduce value sizes, improve key distribution, or change eviction policy. In an incident: disable caching for low-value endpoints to reduce churn.
Task 4: What eviction policy is configured?
cr0x@server:~$ redis-cli -h redis-01 CONFIG GET maxmemory-policy
1) "maxmemory-policy"
2) "allkeys-lru"
Meaning: allkeys-lru evicts any key under pressure. If you store sessions/locks alongside cache entries, they’ll be evicted too. That’s how you get random logouts and “why did the job run twice?” bugs.
Decision: Separate critical keys into a different Redis instance/db or use policies like volatile-ttl for pure caches. Don’t mix “must not disappear” data with best-effort cache data.
Task 5: Are slow commands blocking Redis?
cr0x@server:~$ redis-cli -h redis-01 SLOWLOG GET 5
1) 1) (integer) 912341
2) (integer) 1766812230
3) (integer) 58321
4) 1) "ZRANGE"
2) "leaderboard"
3) "0"
4) "50000"
5) "WITHSCORES"
5) "10.21.4.19:51722"
6) ""
Meaning: A 58ms ZRANGE returning 50k elements will block the event loop. Redis is fast, but it is not a miracle. Big responses are expensive.
Decision: Cap ranges, paginate, or redesign data access. If this is a cache, large sorted sets are often accidental product requirements disguised as technical ones.
Task 6: Is Redis persistence causing write latency?
cr0x@server:~$ redis-cli -h redis-01 INFO persistence | egrep 'aof_enabled|aof_last_write_status|rdb_bgsave_in_progress'
aof_enabled:1
aof_last_write_status:ok
rdb_bgsave_in_progress:0
Meaning: AOF is enabled. If disk is slow or fsync is aggressive, write latency can spike, especially painful under write-through.
Decision: For pure caching, consider disabling AOF or using a less strict fsync policy. For correctness-adjacent uses, measure disk latency and ensure persistence settings match your risk model.
Task 7: Is MySQL saturated or waiting on locks?
cr0x@server:~$ mysql -h mysql-01 -e "SHOW PROCESSLIST" | head
Id User Host db Command Time State Info
4123 app 10.21.5.11:53312 prod Query 12 Waiting for table metadata lock UPDATE users SET ...
4188 app 10.21.5.18:50221 prod Query 9 Sending data SELECT ...
Meaning: Metadata lock waits suggest DDL or schema changes colliding with traffic. Cache-aside won’t save you if writes are stuck on locks.
Decision: Stop/rollback DDL, or move it off-peak with online schema change tooling. In the meantime, reduce write concurrency or disable features hitting the locked tables.
Task 8: What does InnoDB say is happening right now?
cr0x@server:~$ mysql -h mysql-01 -e "SHOW ENGINE INNODB STATUS\G" | egrep -i 'LATEST DETECTED DEADLOCK|Mutex spin waits|history list length' | head -n 20
LATEST DETECTED DEADLOCK
Mutex spin waits 0, rounds 0, OS waits 0
History list length 987654
Meaning: Large history list length can indicate purge lag from long transactions, leading to bloat and worse performance. Deadlocks may show write contention patterns.
Decision: Identify long-running transactions, fix app transaction scope, or adjust isolation and indexing. If cache-aside invalidations depend on these writes, they’ll back up too.
Task 9: Is replication lag causing stale reads (blamed on cache)?
cr0x@server:~$ mysql -h mysql-replica-01 -e "SHOW REPLICA STATUS\G" | egrep 'Seconds_Behind_Source|Replica_IO_Running|Replica_SQL_Running'
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 47
Meaning: 47 seconds lag will look exactly like “cache staleness” if your app reads from replicas. You’ll invalidate cache and still read old data.
Decision: Route read-after-write traffic to primary (or use GTID-based read consistency). Fix replication bottlenecks before rewriting caching logic.
Task 10: Are Redis keys expiring in a synchronized wave?
cr0x@server:~$ redis-cli -h redis-01 --scan --pattern 'user:*' | head -n 5
user:10811
user:10812
user:10813
user:10814
user:10815
Meaning: You’re sampling keyspace. If most keys share identical TTLs (you’ll verify next), you’re setting yourself up for stampedes.
Decision: Add TTL jitter (randomized offset), or implement early refresh and single-flight locks for hot keys.
Task 11: Check TTL distribution for a hot key
cr0x@server:~$ redis-cli -h redis-01 TTL user:10811
(integer) 60
Meaning: A clean 60-second TTL is suspiciously synchronized if applied broadly. Many apps do exactly this and then wonder why the DB falls over every minute.
Decision: Change to something like 60±15 seconds jitter, or use soft TTL (serve stale while refreshing).
Task 12: Validate a suspected inconsistency (Redis vs MySQL)
cr0x@server:~$ redis-cli -h redis-01 GET user:10811
{"id":10811,"email":"old@example.com","version":17}
cr0x@server:~$ mysql -h mysql-01 -e "SELECT id,email,version FROM users WHERE id=10811"
id email version
10811 new@example.com 18
Meaning: Redis is behind. With cache-aside, this could be an invalidation race or missed delete. With write-through, it could be a partial dual-write failure or a replay ordering bug.
Decision: If cache-aside: delete key and audit invalidation paths, add versioned writes. If write-through: treat as correctness incident—stop the line, find dual-write failure path, consider moving to DB-driven cache updates.
Task 13: Check Redis replication health (if you run replicas)
cr0x@server:~$ redis-cli -h redis-01 INFO replication | egrep 'role|connected_slaves|master_link_status|master_last_io_seconds_ago'
role:master
connected_slaves:1
master_link_status:up
master_last_io_seconds_ago:1
Meaning: If master_link_status is down or last_io is high, failover may be in progress or replicas are stale. Under write-through with reads from replicas, that’s a correctness trap.
Decision: Prefer reading from master for strong-ish consistency, or accept eventual consistency and treat stale reads as expected. Don’t pretend you get both.
Task 14: Check Linux-level network pain from app to Redis/MySQL
cr0x@server:~$ ss -tan state established '( dport = :6379 or dport = :3306 )' | wc -l
842
Meaning: High established connection count can indicate missing pooling, connection churn, or stuck sockets. This becomes latency and then timeouts.
Decision: Ensure app pools connections, tune client limits, and verify Redis maxclients / MySQL max_connections. In incidents, limit concurrency at the ingress.
Task 15: Check Redis memory and fragmentation
cr0x@server:~$ redis-cli -h redis-01 INFO memory | egrep 'used_memory_human|used_memory_rss_human|mem_fragmentation_ratio'
used_memory_human:12.31G
used_memory_rss_human:16.02G
mem_fragmentation_ratio:1.30
Meaning: Fragmentation ratio 1.30 suggests allocator/fragmentation overhead. Not always fatal, but it reduces effective cache size and can trigger evictions.
Decision: If you’re eviction-bound, consider restarting during a maintenance window, enabling active defrag, or resizing memory. Also reduce value churn.
Common mistakes (symptom → root cause → fix)
1) “Redis is up but the site is down”
Symptom: High request latency and timeouts; Redis shows healthy CPU and memory.
Root cause: Client-side timeouts too high + retries + thread pool exhaustion. Redis is “up,” but your app is stuck waiting.
Fix: Set aggressive timeouts (typically 5–50ms depending on topology), cap retries, add circuit breaker, ensure fallback to MySQL for cache-aside reads.
2) “We invalidated the cache and it’s still stale”
Symptom: After updates, some users see old data for minutes.
Root cause: Invalidation race (delete then refill with old value), or replica lag (you read old data from replica and repopulate cache).
Fix: Use versioned keys or compare-and-set writes; route read-after-write to primary or enforce read consistency; add TTL jitter and single-flight.
3) “Writes got slower after we added caching”
Symptom: P95 write latency increases; timeouts appear on update endpoints.
Root cause: Write-through added Redis to the critical path; persistence fsync or network jitter amplifies latency.
Fix: Make cache writes asynchronous (DB-authoritative + CDC), or accept cache-aside with invalidation; tune Redis persistence or isolate cache-only Redis.
4) “Random logouts, duplicate jobs, missing locks”
Symptom: Sessions disappear; background jobs run twice; distributed locks fail.
Root cause: Using a single Redis instance for both volatile cache keys and critical ephemeral coordination keys; eviction policy nukes important keys.
Fix: Separate Redis instances or at least separate memory budgets/policies; avoid eviction for coordination data; monitor evicted_keys.
5) “Every minute the database melts”
Symptom: Periodic spikes in MySQL QPS and latency, aligned with TTL boundaries.
Root cause: Synchronized TTL expiry across hot keys; stampede.
Fix: TTL jitter, early refresh, request coalescing, serve-stale-while-revalidate, and partial caching of expensive components.
6) “Cache hit rate is high but performance is still bad”
Symptom: Redis hit rate looks great; app still slow.
Root cause: Large value payloads cause network overhead; serialization/deserialization costs; slow Redis commands; app CPU bound.
Fix: Measure payload size, compress selectively, store smaller projections, avoid heavy range queries, profile app CPU.
7) “We have data corruption but no errors”
Symptom: Users see contradictory state depending on endpoint.
Root cause: Dual-write without idempotency and without ordering guarantees; write-through design assumes success symmetry.
Fix: Stop dual-write in request path. Use transactional outbox/CDC to update cache after commit; add version checks; define one source of truth.
Checklists / step-by-step plan
Pick the pattern: a practical decision tree
- Can the system function correctly with Redis down?
- Yes → default to cache-aside.
- No → you’re building a distributed datastore. Treat Redis as critical infrastructure and consider whether MySQL is still needed on the hot path.
- Do you require read-after-write consistency for user-facing flows?
- Yes → cache-aside with primary reads for the session, or DB-driven cache updates with versioning.
- No → cache-aside with TTL + jitter is usually fine.
- Are writes frequent and latency-sensitive?
- Yes → avoid synchronous write-through unless Redis is extremely close and very well operated.
- No → write-through can be acceptable if it simplifies reads and you can enforce idempotency.
Cache-aside runbook: “correctness first” implementation steps
- Define the source of truth: MySQL is authoritative. Redis is derived.
- Read path: Redis GET → on miss, read MySQL → set Redis with TTL + jitter.
- Write path: Write MySQL in transaction → after commit, invalidate (DEL) or update Redis.
- Prevent stampedes: implement single-flight per key (mutex with short TTL), or serve stale while refreshing.
- Add versioning: embed version in the value; reject older writes to Redis if you can.
- Time out fast: Redis timeout short; if hit, skip cache and read MySQL.
- Observe: track hit rate, evictions, latency, and MySQL QPS during cache bypass drills.
Write-through runbook: if you insist, do it like you mean it
- Make writes idempotent: retries must not create new state. Use request IDs, versions, or upserts carefully.
- Define ordering: last-write-wins needs a timestamp/version that is monotonic per object.
- Plan partial failure behavior: if Redis write fails but MySQL succeeds, what happens? If MySQL fails but Redis succeeds, how do you repair?
- Prefer DB-driven cache updates: commit in MySQL, then update Redis via async worker consuming an outbox/CDC.
- Budget latency: Redis becomes part of write SLO. Measure, alert, and capacity-plan accordingly.
- Isolation: don’t share this Redis with best-effort caches and random feature experiments.
Incident checklist: keep the service alive
- Reduce blast radius: disable cache reads for the hottest endpoints if safe.
- Cap concurrency at ingress (load shedding beats total collapse).
- Verify Redis evictions and latency; verify MySQL locks and replication lag.
- If correctness is compromised (dual-write inconsistency), freeze writes or route reads to MySQL while you repair.
- After stabilization, backfill cache and run consistency sampling.
FAQ
1) Which breaks less: cache-aside or write-through?
In most product apps: cache-aside breaks less because it can degrade to MySQL when Redis is slow or down. Write-through makes Redis part of your write availability.
2) Can write-through be made safe?
Yes, but “safe” usually means not truly synchronous dual-write. The safer model is MySQL commit first, then async cache update via outbox/CDC with versioning.
3) Why not just rely on Redis persistence and skip MySQL?
Sometimes that’s valid, but it’s a different design. Redis persistence and clustering can work, but you lose relational queries and gain new operational constraints. Don’t sleepwalk into this by accident.
4) What TTL should I use?
Pick TTL based on how painful staleness is and how expensive a miss is. Then add jitter (random offset) to avoid synchronized expiry. Hot keys often need special handling beyond TTL.
5) Should I update cache on write or delete/invalidate?
Invalidate is simpler and often safer, but can increase misses. Update-on-write reduces misses but increases correctness complexity. If you update-on-write, use versioning or CAS semantics to avoid races.
6) How do I prevent cache stampede?
Use single-flight (per-key lock), serve stale while revalidating, early refresh, and TTL jitter. Also consider negative caching for non-existent items.
7) Why is my hit rate high but MySQL still hot?
Because the remaining misses might be the expensive ones, or because Redis calls are slow/large, or because you’re doing extra MySQL work per request (joins, locks, secondary queries) unrelated to cached objects.
8) Is Redis Cluster required for caching?
No. Many caches do fine with primary+replica and Sentinel for failover, or even a single instance if you can tolerate loss. Cluster adds operational overhead and key-slot constraints—worth it when you need horizontal scaling.
9) How do I debug “stale data” complaints fast?
Pick a single object, compare Redis vs MySQL values directly, and check replication lag. Then identify if it’s an invalidation race (cache-aside) or partial dual-write (write-through).
Conclusion: next steps you can ship
If you want something that breaks less, pick cache-aside with tight timeouts, sane fallbacks, and stampede protection. Treat Redis as a performance layer, not a truth layer. When Redis fails, you should get slower—not wrong.
If you truly need write-through semantics, don’t do naive dual writes in the request path. Make MySQL authoritative, update Redis after commit, and add versioning so old cache writes can’t resurrect stale state.
Concrete next steps:
- Audit every Redis call: timeout, retry policy, and whether the request can succeed without Redis.
- Add dashboards/alerts for Redis latency, evictions, blocked clients, and MySQL replication lag.
- Implement TTL jitter and single-flight on your top 20 keys by QPS.
- Run a game day: disable Redis reads for one endpoint and verify MySQL can survive the load long enough for an incident response.
- Write down your correctness boundary in plain English and enforce it in code reviews.
Production systems don’t reward cleverness. They reward contracts that hold under stress.