Some websites don’t “scale.” They just slowly accumulate latency until the first meaningful traffic spike turns your database into a smoking crater. You add indexes. You add replicas. You add “a cache.” The graphs look better—until they don’t, and now you’re debugging missing carts, stale prices, and users randomly logged out.
The uncomfortable truth: caching is easy to bolt on, hard to make correct, and very easy to make fast in the wrong direction. MariaDB and Redis can both be used for caching, but they fail differently. If you understand those failure modes, you can build caches that speed up your site without losing data.
What you’re actually deciding
“MariaDB vs Redis” is rarely a binary choice. The real decision is how you split responsibilities between:
- System of record: the database that must remain correct and recoverable after failures (typically MariaDB).
- Computation and coordination: counters, locks, rate limits, leaderboards, dedupe keys (Redis excels).
- Derived data: cached query results, rendered pages, precomputed aggregates (both can do it, but not equally).
- Operational risk: can you tolerate losing cache contents? If not, you’re talking about persistence, backups, replication, and restore drills—not just “add Redis.”
If you want speed without data loss, you need to be honest about what “data” means. Losing a cached product page is fine. Losing a confirmed payment status is not. Redis can be durable-ish. MariaDB can be used as a cache-ish. But neither should be forced to cosplay as the other without a plan.
Paraphrased idea from Werner Vogels: “Everything fails, all the time,” so you design assuming components will break, and you make recovery routine.
MariaDB as a cache: when it’s sane, when it’s a trap
Using MariaDB as a cache usually means one of these:
- Materialized-ish tables: precomputed aggregates stored in tables and refreshed.
- Denormalized read models: “search index table,” “product_summary,” “user_profile_compact.”
- Result caching by application: storing serialized blobs in a table with TTL-ish fields.
- Read replicas: not caching per se, but offloading reads can feel like it.
Why MariaDB can be a good cache
Because it’s boring. Boring is a feature when correctness matters.
- Durability is native: InnoDB redo logs, doublewrite buffer, crash recovery. It’s built to keep your bits.
- SQL gives you leverage: you can refresh a cache table incrementally, join it, filter it, backfill it.
- Ops already knows it: backups, replication, monitoring, access controls—often already in place.
Why MariaDB as a cache can hurt you
Caching is about reducing expensive work. If you make the “cache” as expensive as the original work, you didn’t cache—you duplicated load.
- Hot rows become bottlenecks: counters, rate limits, “last seen,” “inventory remaining” can turn into update storms and lock contention.
- Buffer pool thrash: caching lots of ephemeral blobs can evict genuinely important pages.
- TTL cleanup hurts: deleting expired rows can cause I/O spikes, purge lag, and replication delay.
Rule of thumb: use MariaDB to cache structured, queryable derived data that you’re willing to manage like a real dataset. Don’t use it for “millions of tiny TTL keys” unless you enjoy explaining to finance why your primary is at 95% CPU on a Tuesday morning.
Redis as a cache: fast, sharp edges included
Redis is an in-memory data structure server. That sentence is doing a lot of work. The value isn’t only speed; it’s the primitives: atomic increments, sets, sorted sets, streams, pub/sub, Lua scripting, and fast expiration.
What Redis is excellent at
- Low-latency lookups: cache-aside for object fetches, HTML fragments, permissions, feature flags.
- High-churn ephemeral state: sessions, CSRF tokens, one-time links, idempotency keys.
- Coordination: distributed locks (carefully), rate limits, queues (with caveats), dedupe sets.
- Fighting stampedes: with TTLs, jitter, and “single flight” patterns.
What Redis is bad at (and you can still make it worse)
- Pretending it’s a database without budgeting for persistence: if you store primary business facts in Redis without a durability plan, you’re betting your company on RAM and configuration defaults.
- Unbounded memory growth: without maxmemory policies and key hygiene, it will happily accept your data until the kernel’s OOM killer shows up like an uninvited auditor.
- Big keys and large values: single keys with huge payloads cause latency spikes due to blocking operations and eviction overhead.
Joke #1: Redis is like espresso—amazing when measured, catastrophic when you keep refilling the cup because “it still fits.”
Interesting facts and small history that matters
- Redis started (2009) as a way to handle real-time stats without hammering a relational database—its DNA is “fast counters and sets,” not “perfect durability.”
- MariaDB was created (2009–2010) as a fork of MySQL after Oracle acquired Sun; many teams adopted it to keep an open development path.
- MySQL’s historical Query Cache (also relevant to MariaDB’s ancestry) was notorious for mutex contention under write load; it was eventually removed in MySQL 8.0 because it hurt more than it helped at scale.
- Redis expiration is lazy + active: keys expire when accessed, plus background sampling. This affects memory planning and “why is stale stuff still there?” debugging.
- Redis persistence has two main modes: RDB snapshots and AOF journaling; both have tradeoffs in write amplification and recovery time.
- InnoDB’s buffer pool is a cache: MariaDB already caches data pages in memory. Sometimes your “cache layer” duplicates what the engine is already doing well.
- Replica lag is an old problem: using read replicas as a “cache” introduces consistency issues that can look like missing writes or “random bugs.”
- Redis’ single-threaded command execution (for the main event loop) is a feature for atomicity, but it means slow commands and big payloads hurt everyone.
- Cache invalidation has been a famous hard problem in distributed systems for decades—not because engineers are bad, but because time, concurrency, and partial failure are annoying.
Caching patterns that don’t eat your data
1) Cache-aside (lazy loading): the default choice
Flow: read from cache → miss → read from MariaDB → set cache with TTL → return.
Why it works: MariaDB remains the source of truth. Cache can be dropped anytime. Recovery is “wait for warmup.”
Failure modes:
- Stampede: many requests miss and all slam MariaDB at once.
- Stale data: cache not invalidated after writes, or TTL too long.
- Hot key: a single popular key causes cache churn and lock contention in your app.
Do this:
- Add TTL with jitter (randomized extra seconds) to prevent synchronized expiry.
- Use “single flight” in the app (one recompute per key) or Redis locks with short TTLs.
- Cache negative results briefly (e.g., “user not found” for 30s) to prevent brute-force misses.
2) Read-through cache: outsource miss handling
Some client libraries or sidecars can fetch from the DB on miss. Operationally seductive. Usually a bad idea unless it’s battle-tested in your environment, because it hides load from the application and makes DB traffic harder to reason about.
Use case: internal platforms with standardized access patterns and solid observability.
3) Write-through: correctness first, latency later
Flow: write to cache and DB in one path; reads hit cache.
Benefit: cache stays fresh; no invalidation complexity for many use cases.
Cost: higher write latency, more moving parts in the write path. If Redis is down, writes can fail unless you explicitly degrade to DB-only.
Do this: treat cache update as best-effort unless the cached value is needed for correctness. For most websites, it isn’t.
4) Write-behind (aka write-back): the fastest way to invent data loss
Flow: write to cache, return success to user, asynchronously write to DB later.
When it’s acceptable: rarely. Maybe for analytics counters where losing a few updates is acceptable and you have idempotency and replay.
When it’s a disaster: orders, balances, permissions, entitlements, inventory, user-generated content.
Write-behind is how you turn a cache outage into a postmortem titled “why did 2% of carts evaporate.” It also makes compliance people develop sudden interest in your weekend plans.
5) TTL-only invalidation: cheap, cheerful, and sometimes wrong
If you set a TTL and never explicitly invalidate, you’re relying on time to fix correctness. That’s fine for content that can be stale for a bit (home pages, trending lists), and not fine for anything transactional.
Better: combine TTL with event-driven invalidation for critical keys.
6) Event-driven invalidation: harder, but it scales correctness
Flow: writes go to MariaDB → publish “entity changed” event → consumers invalidate or refresh Redis keys.
Benefit: low staleness, consistent behavior under load.
Risk: message delivery, consumer lag, and ordering. You need idempotency and you must assume events can be duplicated.
7) Versioned keys: the anti-stale pattern that actually works
Instead of deleting keys, you change the key namespace by bumping a version number.
- Key:
product:123:v17 - Another key stores the current version:
product:123:ver
On update, increment the version. New reads go to the new key. Old keys expire naturally. This is boring-good for high-traffic pages where delete storms would crush Redis.
8) Cache stampede protection patterns
- Probabilistic early recompute: refresh before TTL expires with a probability based on remaining TTL and compute cost.
- Serve stale while revalidating: keep a soft TTL (stale allowed) and a hard TTL (must recompute). Stale buys you time when DB is hot.
- Single-flight: only one worker recomputes; others wait or serve stale.
9) Redis for sessions: fast and usually correct
Sessions are a good Redis use case because they’re ephemeral and naturally TTL’d. But you still need to decide what “no data loss” means. Losing sessions annoys users; it typically doesn’t corrupt money.
Do not store long-lived authorization state in sessions unless you can revoke it correctly. Cache permissions, sure; store “this user is admin forever,” no.
10) MariaDB summary tables: “cache” that stays queryable
If the expensive part is a multi-join aggregation, Redis can cache the result, but you lose queryability. MariaDB summary tables let you index and filter the derived data. You pay with refresh complexity and storage, but you gain explainable SQL and durability.
Joke #2: Cache invalidation is the adult version of “did you try turning it off and on again,” except the cache remembers that you tried.
Durability: “without data loss” in the real world
When someone says “no data loss,” ask: which data, and what’s your acceptable loss window?
- Cacheable derived data: losing it is fine; recompute from MariaDB.
- Ephemeral user state: losing it is tolerable but should be rare; sessions can be re-authenticated.
- Transactional facts: must survive process crashes, node failures, and operator mistakes. Store them in MariaDB (or another true system of record), not only Redis.
MariaDB durability checklist (baseline)
- InnoDB with proper flush settings for your risk appetite.
- Binary logs enabled if you need point-in-time recovery.
- Backups tested by restore, not by hope.
- Replication monitored for lag and errors.
Redis durability options (if you insist on keeping important state)
Redis can persist data, but persistence isn’t magic. It is a set of tradeoffs you must deliberately choose.
- RDB snapshots: periodic snapshots. Faster writes, but you can lose data between snapshots.
- AOF (append-only file): logs every write. Better durability, more write I/O, and rewrite behavior to manage file size.
- Replication: helps availability, not guaranteed zero loss unless you design for it and accept latency.
If you genuinely can’t lose Redis data, you’re no longer “using a cache.” You’re running another database. Treat it like one: persistence, replication, backups, restore drills, and capacity planning.
Practical tasks: commands, outputs, and what you decide
These are the kinds of checks you run during an incident or a performance review. Each task includes: command, sample output, what it means, and the decision you make.
Task 1: Confirm MariaDB is the bottleneck (top queries)
cr0x@server:~$ sudo mariadb -e "SHOW FULL PROCESSLIST\G" | sed -n '1,60p'
*************************** 1. row ***************************
Id: 8421
User: app
Host: 10.0.3.24:41372
db: prod
Command: Query
Time: 12
State: Sending data
Info: SELECT ... FROM orders JOIN order_items ...
*************************** 2. row ***************************
Id: 8422
User: app
Host: 10.0.3.25:41810
db: prod
Command: Query
Time: 11
State: Sending data
Info: SELECT ... FROM orders JOIN order_items ...
Meaning: long-running read queries dominate, and many are identical. That’s prime caching territory.
Decision: implement cache-aside for the expensive read path, or create a summary table if the query is complex and needs filtering.
Task 2: Check MariaDB buffer pool health
cr0x@server:~$ sudo mariadb -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+-----------+
| Variable_name | Value |
+---------------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 984332111 |
| Innodb_buffer_pool_reads | 12099122 |
+---------------------------------------+-----------+
Meaning: physical reads vs logical reads. A high ratio of reads to read_requests indicates a cold or undersized buffer pool.
Decision: if the buffer pool miss rate is high, tuning MariaDB may outperform adding Redis for some workloads.
Task 3: Identify slow queries you are about to “cache over”
cr0x@server:~$ sudo mariadb -e "SHOW VARIABLES LIKE 'slow_query_log%'; SHOW VARIABLES LIKE 'long_query_time';"
+---------------------+-------+
| Variable_name | Value |
+---------------------+-------+
| slow_query_log | ON |
| slow_query_log_file | /var/lib/mysql/slow.log |
+---------------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| long_query_time | 1.000 |
+-----------------+-------+
Meaning: you have slow logging enabled with a 1s threshold.
Decision: before caching, fix obvious missing indexes and N+1 query patterns. Cache should reduce load, not hide sloppy access patterns.
Task 4: Measure replication lag (when replicas are your “cache”)
cr0x@server:~$ sudo mariadb -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 37
Meaning: replica is 37 seconds behind. Reads from replica will be stale.
Decision: do not use replica for read-after-write paths (login, checkout). Use primary or build a consistency strategy (sticky reads, GTID-based reads, or explicit cache invalidation).
Task 5: Check Redis memory and eviction risk
cr0x@server:~$ redis-cli INFO memory | egrep "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
used_memory_human:7.83G
maxmemory_human:8.00G
mem_fragmentation_ratio:1.62
Meaning: you’re near maxmemory, fragmentation is high, eviction pressure is likely.
Decision: either increase memory, reduce value sizes, shard, or change eviction policy. Also fix fragmentation by reviewing allocator behavior and key churn.
Task 6: Verify Redis eviction policy (you might be evicting the wrong thing)
cr0x@server:~$ redis-cli CONFIG GET maxmemory-policy
1) "maxmemory-policy"
2) "noeviction"
Meaning: Redis will refuse writes when memory is full. Applications often interpret this as “random errors.”
Decision: for caching, prefer allkeys-lfu or volatile-lfu depending on whether every key has TTL. For critical state, be explicit: maybe you do want noeviction so failure is loud.
Task 7: Look for big keys (latency spikes and memory waste)
cr0x@server:~$ redis-cli --bigkeys
# Scanning the entire keyspace to find biggest keys as well as biggest overall keys per data type
Biggest string found 'pagecache:/product/123' has 5242880 bytes
Biggest hash found 'session:hash' has 183201 fields
0 keys sampled
Meaning: you have multi-megabyte values and huge hashes; both can create pauses and uneven eviction.
Decision: split large objects, compress carefully (CPU tradeoff), and avoid “one giant hash of everything.” Prefer per-session keys with TTL.
Task 8: Check Redis persistence settings (data loss window)
cr0x@server:~$ redis-cli CONFIG GET save appendonly appendfsync
1) "save"
2) "900 1 300 10 60 10000"
3) "appendonly"
4) "no"
5) "appendfsync"
6) "everysec"
Meaning: RDB snapshots enabled; AOF disabled. Worst-case loss is up to the snapshot interval if the node dies.
Decision: if Redis holds anything more important than disposable cache, enable AOF and decide on fsync policy; otherwise accept cache loss and design for it.
Task 9: Detect blocked Redis clients (slow commands)
cr0x@server:~$ redis-cli INFO clients | egrep "blocked_clients|connected_clients"
connected_clients:812
blocked_clients:17
Meaning: clients are blocked; something is slow (big keys, Lua scripts, slow disk for AOF, or network stalls).
Decision: inspect slowlog, identify the command, and fix the workload. Redis being “in memory” doesn’t mean it’s immune to I/O and CPU.
Task 10: Inspect Redis slowlog to catch self-inflicted pain
cr0x@server:~$ redis-cli SLOWLOG GET 3
1) 1) (integer) 19042
2) (integer) 1735250401
3) (integer) 15423
4) 1) "KEYS"
2) "*"
5) "10.0.2.9:51244"
6) ""
2) 1) (integer) 19041
2) (integer) 1735250399
3) (integer) 8120
4) 1) "HGETALL"
2) "session:hash"
5) "10.0.2.10:42118"
6) ""
Meaning: someone ran KEYS * (blocks Redis on big datasets) and you’re using HGETALL on a massive hash.
Decision: ban KEYS in production (use SCAN), redesign session storage to avoid huge hashes, and add tooling guardrails.
Task 11: Confirm MariaDB indexing on the hot path
cr0x@server:~$ sudo mariadb -e "EXPLAIN SELECT * FROM orders WHERE user_id=123 ORDER BY created_at DESC LIMIT 20\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: orders
type: ref
possible_keys: idx_user_created
key: idx_user_created
key_len: 4
ref: const
rows: 20
Extra: Using where
Meaning: query uses the intended composite index and scans only a small number of rows.
Decision: caching this query might still help, but you’re no longer masking a missing index. Good. Now you can size cache TTL based on business freshness.
Task 12: Check for lock contention in MariaDB (counters gone wrong)
cr0x@server:~$ sudo mariadb -e "SHOW ENGINE INNODB STATUS\G" | sed -n '/LATEST DETECTED DEADLOCK/,+40p'
------------------------
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 928331, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1128, 2 row lock(s)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 221 page no 9 n bits 80 index PRIMARY of table `prod`.`rate_limits`
Meaning: deadlocks on a rate limit table. Classic “database used as a counter/lock service.”
Decision: move rate limiting and counters to Redis where atomic ops are cheap, and keep MariaDB for durable records.
Task 13: Validate Redis key TTL hygiene (are you leaking memory?)
cr0x@server:~$ redis-cli INFO keyspace
db0:keys=1823492,expires=24112,avg_ttl=0
Meaning: almost no keys have expiration; avg_ttl=0 suggests indefinite keys.
Decision: for a cache, most keys should have TTL. If they don’t, you’re building a second datastore with no migration plan.
Task 14: Detect a cache stampede in the application layer (Redis hit rate)
cr0x@server:~$ redis-cli INFO stats | egrep "keyspace_hits|keyspace_misses"
keyspace_hits:12099331
keyspace_misses:8429932
Meaning: miss rate is high; Redis is not serving as an effective cache, or TTLs are too low, or key names are inconsistent.
Decision: standardize key construction, increase TTL where safe, add jitter, and implement single-flight to avoid thundering herds.
Task 15: Measure OS-level pressure on Redis node (swapping is death)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 16Gi 15Gi 120Mi 1.1Gi 900Mi 300Mi
Swap: 2Gi 1.8Gi 200Mi
Meaning: Redis box is swapping. Latency will go nonlinear, then the incident will become “mysterious.”
Decision: stop swapping (tune, add RAM, reduce dataset). If Redis must be reliable, disable swap or set strict memory margins and alerts.
Task 16: Confirm what MariaDB is spending time on (CPU vs I/O)
cr0x@server:~$ sudo iostat -x 1 3
Linux 6.1.0 (db1) 12/30/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22.10 0.00 5.33 41.20 0.00 31.37
Device r/s w/s rKB/s wKB/s await svctm %util
nvme0n1 820.0 310.0 52160.0 20120.0 18.2 0.9 92.5
Meaning: high iowait and high disk utilization; the DB is I/O-bound.
Decision: caching hot reads in Redis can help, but also consider buffer pool sizing, query/index tuning, and storage improvements. Don’t cache your way out of slow disks if your working set should fit in memory.
Fast diagnosis playbook
You don’t have time for philosophy when the p95 latency graph looks like a ski jump. Here’s the order that finds bottlenecks quickly, with minimal self-deception.
First: is it the database, the cache, or the app?
- Check Redis hit/miss and latency: if misses are high, Redis may be irrelevant; if latency spikes, Redis might be overloaded or swapping.
- Check MariaDB active queries and lock waits: long “Sending data” points to expensive reads; lock waits/deadlocks point to write contention.
- Check app errors/timeouts: cache errors can cascade into DB overload (classic). DB overload can cascade into cache stampedes (also classic).
Second: look for stampede dynamics
- Did a popular key expire across the fleet at the same moment (no jitter)?
- Is there a deploy that changed key naming or TTL defaults?
- Is traffic pattern different (campaign, crawler, bot)?
Third: confirm resource pressure
- Redis memory: near maxmemory? eviction thrash? fragmentation? swapping?
- MariaDB I/O: high iowait? buffer pool misses? slow query log exploding?
- Network: increased RTT between app and Redis/DB can masquerade as “cache is slow.”
Fourth: pick the least risky mitigation
- Serve stale while revalidating for safe content.
- Temporarily increase TTL and add jitter.
- Rate limit recomputes (single-flight) to protect MariaDB.
- For Redis overload: reduce value size, disable expensive commands, scale out, or fail open (DB-only) if safe.
Common mistakes: symptoms → root cause → fix
1) “Cache added, but DB load didn’t drop”
Symptoms: Redis CPU low, misses high, MariaDB QPS unchanged.
Root cause: caching the wrong layer (e.g., caching after personalization), inconsistent cache keys, TTL too low, or most requests are unique.
Fix: cache at a more shared level (fragments), standardize key construction, and measure hit rate per endpoint. Consider summary tables if the query is inherently uncacheable.
2) “Redis is fast until it suddenly isn’t”
Symptoms: intermittent spikes, blocked clients, timeouts.
Root cause: big keys, slow commands, swapping, AOF fsync stalls, or a single-threaded hotspot.
Fix: run --bigkeys, inspect slowlog, remove blocking commands, keep memory headroom, and avoid giant values.
3) “Users see stale data after updates”
Symptoms: profile updates don’t show, prices revert, admin changes delayed.
Root cause: TTL-only invalidation or missing invalidation events; replica lag used for reads; keys not versioned.
Fix: event-driven invalidation for critical entities, versioned keys for hot objects, and read-after-write consistency (sticky primary reads where needed).
4) “We lost data when Redis restarted”
Symptoms: empty leaderboards, missing sessions, vanished counters; app panics.
Root cause: Redis treated as system of record; persistence off or insufficient; no restore plan.
Fix: move durable facts to MariaDB; enable AOF if Redis must persist; test restart behavior and recovery time.
5) “MariaDB became slower after we added ‘cache tables’”
Symptoms: buffer pool hit rate drops, I/O increases, purge lag, replication delay.
Root cause: ephemeral cache data flooding InnoDB, TTL cleanup deletes causing churn.
Fix: move TTL-heavy keys to Redis; if you keep cache tables, partition them, batch cleanup, and separate workloads by instance if necessary.
6) “Random 500s during traffic spikes”
Symptoms: errors correlate with load; Redis “OOM command not allowed” or timeouts.
Root cause: maxmemory reached with noeviction, or eviction policy fights workload; cache stampede forces recomputation.
Fix: set maxmemory + an eviction policy appropriate for caches; add jitter and single-flight; protect MariaDB with circuit breakers.
7) “Cache makes correctness worse than no cache”
Symptoms: inconsistent reads, impossible states, hard-to-reproduce bugs.
Root cause: caching mixed-consistency objects (partly from DB, partly from other services), multi-key updates without atomicity, or using Redis as a queue without acknowledging semantics.
Fix: cache immutable snapshots; version keys; avoid multi-key transactional illusions; if you need durable messaging, use proper queues/streams with clear delivery guarantees.
Three corporate mini-stories (anonymized)
Story 1: Incident caused by a wrong assumption (replicas are “a cache”)
The team ran a large content site with MariaDB primary + replicas. Someone proposed “free performance” by sending all reads to replicas. It worked in staging, because staging didn’t have meaningful write volume. Production did.
They routed the “account” page (recent purchases, address book, subscription status) to replicas. Support tickets started arriving: “I changed my address and it didn’t save.” Engineers checked the primary—data was correct. The UI still showed old data because the replica was lagging under load.
The wrong assumption was subtle: they treated replicas like a cache where staleness is acceptable. But the account page is a read-after-write surface. Humans notice when their address changes back.
The fix was boring and effective: sticky reads. After a successful write, the user’s next N seconds of reads went to primary. They also instrumented replica lag and made the router refuse replica reads when lag exceeded a threshold.
The lesson: replicas can reduce read load, but they introduce consistency behavior you must design for. A cache miss just costs time. A stale read costs trust.
Story 2: Optimization that backfired (write-behind “for speed”)
An e-commerce company had a slow “add to cart” path. Someone noticed that the cart table in MariaDB was a hotspot—lots of tiny updates, lots of contention. They moved carts to Redis and made it write-behind to MariaDB in a background worker.
Latency improved dramatically. Everyone cheered. Then a Redis node rebooted during a routine kernel patch. The background worker queue lagged, then fell behind, then started retrying. Some updates were applied out of order. A subset of carts reverted or duplicated items, depending on the retry pattern.
The outage wasn’t a total meltdown; it was worse: partial corruption. The checkout system had to add defensive checks, and support had to handle frustrated customers who swore they “added the item three times.” They were right.
They rolled back to MariaDB as the system of record for carts, and used Redis as cache-aside for rendering carts, plus a small Redis structure for “cart dirty flags” to reduce recomputation. They also introduced idempotency tokens in write requests so retries could not duplicate operations.
The lesson: write-behind is a correctness tax you pay later with interest. If you can’t articulate your ordering and retry semantics, don’t do it.
Story 3: Boring but correct practice that saved the day (serve stale + versioned keys)
A SaaS product had a dashboard that hit MariaDB with a heavy aggregation query. They built a Redis cache-aside layer. It helped, but a traffic burst would occasionally line up with mass cache expiry and crush the database.
Instead of chasing cleverness, they implemented two boring patterns: versioned keys and “serve stale while revalidating.” Each dashboard card had a soft TTL (serve cached) and a hard TTL (must recompute). A single-flight mechanism ensured only one recompute per key happened at a time.
They also used versioned keys so invalidation became a single atomic increment of a version number, rather than delete storms. Old keys expired naturally.
Weeks later, MariaDB had a brief I/O performance wobble during storage maintenance. The dashboard stayed responsive by serving slightly stale data for a few minutes. Support never heard about it. The team noticed only because the monitoring was good enough to be mildly annoying.
The lesson: the best caching feature is controlled degradation. When things break, your users shouldn’t have to know.
Checklists / step-by-step plan
Step-by-step: choose MariaDB, Redis, or both
- Classify data:
- Transactional facts → MariaDB.
- Ephemeral state → Redis (with TTL).
- Derived read models → MariaDB tables or Redis depending on queryability needs.
- Pick the caching pattern:
- Default: cache-aside with TTL + jitter.
- High correctness & high read: event-driven invalidation or versioned keys.
- Avoid write-behind unless losses are acceptable and semantics are proven.
- Decide on staleness budget: per endpoint, explicitly. “A few seconds” is not a plan; write it down.
- Design stampede protection: single-flight, stale-while-revalidate, and/or early refresh.
- Capacity plan Redis: maxmemory, eviction policy, headroom, key TTL coverage, value sizing.
- Keep MariaDB healthy: index review, query review, buffer pool sizing, I/O monitoring, replication checks.
- Define failure behavior:
- If Redis is down, do you fail open (DB-only) or fail closed?
- If DB is slow, do you serve stale or return errors?
- Instrument the basics: cache hit rate, p95 latency for Redis and DB, replica lag, Redis memory, DB slow queries.
- Run a game day: restart Redis, simulate eviction, inject replica lag. Make sure the website degrades the way you intended.
Checklist: safe Redis configuration for caching
- Set maxmemory and a deliberate maxmemory-policy.
- Ensure most cache keys have TTL.
- Alert on used_memory/maxmemory, blocked_clients, and evicted_keys.
- Avoid blocking commands in production (
KEYS, largeSORT, hugeHGETALLon giant hashes). - Keep memory headroom to avoid fragmentation/eviction storms.
Checklist: safe MariaDB usage alongside Redis
- Make read-after-write endpoints explicit; don’t serve them from laggy replicas.
- Use summary tables for heavy aggregations that need filtering and indexing.
- Do not implement rate limits or hot counters as row updates on the primary.
- Monitor deadlocks and lock waits; they are telling you where to move coordination workloads to Redis.
FAQ
1) Should Redis ever be the system of record?
Only if you are prepared to run it like a primary datastore: persistence, replication strategy, backups, restore testing, and strict capacity management. For most websites: no. Keep transactional truth in MariaDB.
2) If MariaDB already caches in the buffer pool, why use Redis?
MariaDB caches pages, not your application’s computed results. Redis helps when the expensive work is joins/aggregations, serialization, template rendering, permission checks, or coordination primitives.
3) What’s the safest default caching pattern?
Cache-aside with TTL + jitter, plus single-flight or stale-while-revalidate for hot keys. It keeps MariaDB authoritative and makes cache loss survivable.
4) How do I prevent stale data without deleting a million keys?
Use versioned keys. Bump a version number on writes, read from the new namespace, let old keys expire. It scales better than delete storms.
5) Is write-through better than cache-aside?
Write-through can reduce staleness but increases write-path complexity and coupling to Redis availability. Use it when you need fresh cached reads and can degrade safely if Redis is down.
6) Why is write-behind so risky?
Because you acknowledge success before data is durable in MariaDB. Crashes, retries, and reordering become correctness bugs. Use it only for non-critical, aggregatable data with idempotent updates.
7) What eviction policy should I use for Redis caching?
If everything is a cache: allkeys-lfu is a strong default. If only TTL keys should be evicted: volatile-lfu. Avoid noeviction unless you want hard failures when full.
8) How do I know if Redis is helping?
Measure hit rate per endpoint and compare MariaDB QPS/latency before and after. If misses stay high or DB load doesn’t drop, you’re caching the wrong thing or keying incorrectly.
9) Can I store sessions in MariaDB instead of Redis?
You can, but it often creates write contention and cleanup churn. Redis with TTL is usually a better fit. If sessions are truly critical, design re-authentication and persistence expectations explicitly.
10) What about caching rendered HTML pages?
Great when personalization is limited. Cache whole pages or fragments in Redis, and invalidate on content changes via events or versioning. Don’t cache per-user pages unless you’ve done the math on key cardinality.
Next steps you can actually do this week
- Pick three endpoints with the worst DB cost. Measure their query patterns and decide what to cache: objects, fragments, or summary tables.
- Add cache-aside with TTL + jitter for one endpoint, and instrument hit rate, p95 latency, and DB QPS impact.
- Implement stampede protection (single-flight or stale-while-revalidate) for one hot key path.
- Set Redis maxmemory and eviction policy deliberately, and alert when memory headroom shrinks.
- Audit “no data loss”: ensure transactional facts are in MariaDB with backups and restore tests; ensure Redis holds only what you can lose or what you can persist correctly.
- Run a restart drill: restart Redis during business hours in a controlled way, watch behavior, and fix what breaks before it breaks for real.
If you do nothing else: stop treating “cache” as a magic performance sticker. Treat it as a second system with its own failure modes—and design for the failures first.