Docker Redis persistence without turning it into a slow disk app

Was this helpful?

You turned on Redis persistence in a container and suddenly your “in-memory” database started acting like it’s hauling rocks uphill.
Latency spikes, throughput falls off a cliff, and the app team swears nothing changed—except the part where you asked Redis to not forget everything on restart.

This is the Redis durability tax. The good news: you can pay it in a sane currency (predictable I/O patterns and bounded stalls), instead of handing your SSD a blank check.
The bad news: Docker makes it easy to choose the wrong storage path and then blame Redis for being “slow.”

Facts and short history that actually matter

Redis persistence debates can get religious. Let’s keep it concrete: here are a few context points that explain today’s knobs and their sharp edges.

  1. Redis started as an in-memory server with disk as an optional safety net. Persistence is layered on, not fundamental—so you have to align it with your workload.
  2. RDB snapshots came first and are intentionally “chunky.” They trade frequent small writes for occasional heavy writes and a fork-based copy-on-write cost.
  3. AOF (Append Only File) was added to reduce data loss windows. It logs write commands; replay builds the dataset on restart. Great for durability; easy to misuse.
  4. appendfsync everysec exists because “always” is expensive. It aims for a one-second durability window while smoothing write amplification.
  5. Redis rewrites AOF to avoid it growing forever. That rewrite is a classic “background task that still bites” because it competes for I/O and memory.
  6. Fork + copy-on-write means persistence can increase memory usage. During snapshot or rewrite, changed pages get duplicated. In containers, that’s how you meet the OOM killer.
  7. Docker’s copy-on-write filesystem layers are not magical. Put write-heavy files on overlay2 and you’ll benchmark “filesystem metadata gymnastics,” not Redis.
  8. fsync behavior is a kernel + storage contract, not a Redis promise. If your storage lies about flushes or your hypervisor cheats, your “durable” writes become theater.
  9. Redis persistence isn’t a backup strategy. It’s a crash recovery mechanism. Backups require separate retention, off-host copies, and restore testing.

One quote to keep you honest: paraphrased idea — Charity Majors: “Everything fails; the difference is whether you can quickly understand and recover.”

Persistence model: what Redis really writes and when it hurts

RDB snapshots: big writes, long shadows

RDB persistence writes a point-in-time snapshot to disk. It’s efficient when you accept some loss window and want compact files and fast restarts.
The cost shows up in two places: CPU/memory during fork and disk I/O during the snapshot write.

The fork itself is usually cheap, until your dataset is large and your memory is fragmented. Then fork stalls can show up as latency blips.
The real sneaky cost is copy-on-write: while Redis is writing the snapshot in a child process, the parent continues serving requests. Every page modified becomes duplicated.
That’s extra memory pressure and extra work.

In Docker, memory pressure is less forgiving. The container doesn’t care that you “only have 60% used”—it cares that you crossed the limit now.
With RDB enabled, your peak memory is not your steady-state dataset size. Plan for it.

AOF: lots of writes, plus periodic “diet weeks”

AOF logs operations. That means frequent appends, plus fsync depending on configuration.
If you set appendfsync always, Redis will call fsync on every write. That’s maximum durability and maximum opportunity to hate your storage subsystem.

The default-ish production posture is appendfsync everysec. Redis appends to the OS page cache and asks the kernel to flush once per second.
If the kernel can’t keep up, Redis can still stall because the buffer fills, or because the storage backend turns fsync into a blocking event with tail latency.

Then there’s AOF rewrite. Over time, the AOF includes redundant commands. Redis rewrites it to a compact form.
Rewrites are background, but they are not “free.” They allocate, they scan, they write large sequential files, and they can collide with your application’s write path.

The durability knobs that decide your latency budget

  • appendfsync always: lowest loss window, highest latency sensitivity. Use only if you truly need it and your storage is proven.
  • appendfsync everysec: typical production trade-off. One-second loss window; can still spike under I/O contention.
  • appendfsync no: kernel decides when to flush. It’s “fast until you reboot,” and sometimes “not fast either.”

First joke (short and relevant): If you enable appendfsync always on cheap network storage, you’ve invented a new database: Redis But Slower.

What Docker changes (and what it doesn’t)

Redis in a container is still Redis. Same persistence code, same fork model, same fsync calls.
What changes is the path to disk: overlay filesystems, volume plugins, cgroup limits, and the habit of treating containers as disposable pets while quietly expecting them to remember things.

Docker storage: overlay2, bind mounts, named volumes, and why you should care

Do not put Redis persistence on the container writable layer

The container writable layer (overlay2 on most Linux installs) is fine for logs and small state. Redis AOF is not “small state.”
Overlay2 adds copy-on-write semantics and additional metadata operations. Under write-heavy patterns, it can add latency and amplify writes.

If you remember only one rule: Redis persistence files should live on a real mount—a named volume or bind mount backed by a filesystem you understand.

Bind mount vs named volume

A bind mount points to a host path. It’s transparent and easy to inspect. It also inherits host mount options, which is both power and risk.
A named volume is managed by Docker under /var/lib/docker/volumes. It’s cleaner operationally and tends to avoid “oops wrong permissions” incidents.

Performance-wise, both can be excellent. The deciding factor is usually governance: who manages the path, backups, and mount options.

Filesystem and mount options: you’re not choosing a religion, you’re choosing failure modes

For Redis persistence, you usually want:

  • Fast, consistent fsync: local SSD/NVMe beats networked storage for tail latency.
  • Stable write throughput: AOF rewrite is a sequential writer; RDB is a big sequential writer too.
  • Predictable latency under pressure: not “great on average, tragic at p99.9.”

Ext4 with sane defaults is boring and good. XFS is also solid. If you run ZFS, you can get excellent results, but only if you know how sync writes are handled.
If you don’t know, you’ll learn at 3 a.m.

Storage that lies about flushes

Some layers acknowledge flushes early. Some RAID controllers cache without battery. Some cloud volumes have a flush model that is “eventually consistent in spirit.”
Redis will call fsync. If the stack cheats, your AOF can be “successfully written” and still vanish after a power event.

This is why durability requirements must include the storage platform, not just Redis config.

What to run in production: opinionated persistence recipes

Recipe A: “I need fast cache with restart survival” (common)

Use RDB snapshots, accept a loss window (minutes), and keep Redis fast.
If your source of truth is elsewhere (database, event log), Redis is a cache with benefits.

  • Enable RDB with reasonable save points.
  • Disable AOF.
  • Put dump.rdb on a host volume.
  • Monitor snapshot duration and fork time.

Recipe B: “I can lose 1 second, not 10 minutes” (most stateful uses)

Use AOF with appendfsync everysec.
Keep rewrite behavior predictable. Keep your disk path clean. Don’t run this on overlay2 and then file an issue titled “Redis is slow.”

  • Enable AOF.
  • appendfsync everysec.
  • Keep RDB either disabled or as an additional periodic safety snapshot (depends on restore strategy).
  • Make sure you have memory headroom for rewrite.

Recipe C: “I really mean durable” (rare, and expensive)

If you require “no acknowledged write is lost,” Redis alone is not the whole answer. Still, if you insist:

  • Consider AOF always only if your storage is proven for sync writes.
  • Use replication and configure WAIT in the client for synchronous replication acknowledgments.
  • Test crash scenarios, not just benchmarks.

Second joke (short and relevant): “Always fsync” is like wearing a helmet in the office—safer, but you’ll still trip over the carpet.

Settings I actually like (and why)

For many production workloads:

  • appendonly yes
  • appendfsync everysec
  • no-appendfsync-on-rewrite yes to reduce rewrite-induced stalls (accepting a slightly larger loss window during rewrite)
  • auto-aof-rewrite-percentage and auto-aof-rewrite-min-size set so rewrites happen before the file becomes absurd

The controversial one is no-appendfsync-on-rewrite. If you can’t tolerate additional loss risk during rewrite, don’t enable it.
If you can, it often buys you latency stability.

Container-specific advice that avoids slow disk Redis

  • Mount persistence files on a volume/bind mount, not container layer.
  • Cap memory appropriately, but leave headroom for fork/rewrite copy-on-write.
  • Pin CPU enough to avoid “fsync thread starves” nonsense.
  • Don’t mix data path with noisy neighbors (same disk doing logs, image pulls, and Redis fsync).

Hands-on tasks: commands, outputs, and the decision you make

These are the checks I run when someone says “Redis is slow after we enabled persistence.” Each task includes what the output means and what you decide next.
Run them on the Docker host unless stated otherwise.

Task 1: Verify Redis persistence mode from inside the container

cr0x@server:~$ docker exec -it redis redis-cli CONFIG GET appendonly appendfsync save
1) "appendonly"
2) "yes"
3) "appendfsync"
4) "everysec"
5) "save"
6) "900 1 300 10 60 10000"

What it means: AOF is enabled with fsync every second; RDB snapshots are also enabled.

Decision: If you don’t need both, disable one. Running both increases I/O and fork/rewrite events. Prefer one primary persistence method.

Task 2: Confirm where Redis is writing data (and whether it’s on overlay2)

cr0x@server:~$ docker exec -it redis redis-cli CONFIG GET dir dbfilename appendfilename
1) "dir"
2) "/data"
3) "dbfilename"
4) "dump.rdb"
5) "appendfilename"
6) "appendonly.aof"

What it means: Redis writes to /data inside the container.

Decision: Check if /data is a volume/bind mount. If it’s just the container filesystem, fix it now.

Task 3: Inspect Docker mounts for the container

cr0x@server:~$ docker inspect redis --format '{{json .Mounts}}'
[{"Type":"volume","Name":"redis-data","Source":"/var/lib/docker/volumes/redis-data/_data","Destination":"/data","Driver":"local","Mode":"z","RW":true,"Propagation":""}]

What it means: Good: /data is a Docker volume mapped to a host path.

Decision: If Type is missing or Destination isn’t /data, you’re probably writing to overlay2. Move persistence to a mount.

Task 4: Identify the filesystem backing the data directory

cr0x@server:~$ df -Th /var/lib/docker/volumes/redis-data/_data
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4  450G  120G  307G  29% /

What it means: Data lives on ext4 on local NVMe-backed root filesystem.

Decision: If this shows a network filesystem type (like nfs) or a slow shared disk, expect fsync pain. Consider moving Redis data to local SSD.

Task 5: Check mount options that affect durability and latency

cr0x@server:~$ findmnt -no TARGET,FSTYPE,OPTIONS /var/lib/docker/volumes/redis-data/_data
/ ext4 rw,relatime,errors=remount-ro

What it means: Standard ext4 options. Nothing obviously dangerous like data=writeback here.

Decision: If you see exotic options you don’t understand on the Redis data mount, stop and validate. “Unknown” is not a mount option; it’s a future incident.

Task 6: Measure Redis persistence-related latency from Redis itself

cr0x@server:~$ docker exec -it redis redis-cli INFO persistence | egrep 'aof_enabled|aof_last_write_status|aof_last_fsync_status|aof_delayed_fsync|rdb_last_bgsave_status|rdb_last_bgsave_time_sec'
aof_enabled:1
aof_last_write_status:ok
aof_last_fsync_status:ok
aof_delayed_fsync:37
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:4

What it means: aof_delayed_fsync indicates fsync operations that took longer than expected (kernel couldn’t flush quickly).

Decision: If aof_delayed_fsync climbs during spikes, focus on storage latency and contention, not Redis CPU.

Task 7: Check if AOF rewrite is happening and causing stalls

cr0x@server:~$ docker exec -it redis redis-cli INFO persistence | egrep 'aof_rewrite_in_progress|aof_current_size|aof_base_size|aof_pending_rewrite|aof_current_rewrite_time_sec'
aof_rewrite_in_progress:0
aof_current_size:2147483648
aof_base_size:1073741824
aof_pending_rewrite:0
aof_current_rewrite_time_sec:-1

What it means: No rewrite right now; AOF has doubled since base. A rewrite will likely trigger soon depending on thresholds.

Decision: If rewrites coincide with latency spikes, tune rewrite thresholds and confirm I/O headroom. Consider scheduling maintenance windows only if you can’t make it stable.

Task 8: Confirm the container isn’t being throttled on CPU (fsync thread starvation is real)

cr0x@server:~$ docker inspect redis --format '{{.HostConfig.NanoCpus}} {{.HostConfig.CpuQuota}} {{.HostConfig.CpuPeriod}}'
0 50000 100000

What it means: CPU quota is 50% of a core (50,000/100,000). Redis under load with persistence can suffer if it’s CPU-starved.

Decision: If latency spikes correlate with CPU throttling, raise CPU quota or remove it. Don’t try to outsmart physics with half a core and sync writes.

Task 9: Check memory headroom for fork/rewrite

cr0x@server:~$ docker exec -it redis redis-cli INFO memory | egrep 'used_memory_human|maxmemory_human|mem_fragmentation_ratio'
used_memory_human:7.82G
maxmemory_human:8.00G
mem_fragmentation_ratio:1.41

What it means: You’re basically at the limit. Fork/rewrite will push you into OOM territory, especially with fragmentation at 1.41.

Decision: Increase maxmemory headroom (or reduce dataset), or accept that persistence events may kill the process. Containers don’t negotiate.

Task 10: Observe actual disk latency on the host (quick and dirty)

cr0x@server:~$ iostat -x 1 5
Linux 6.2.0 (server) 	01/03/2026 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.11    0.00    6.05    8.90    0.00   72.94

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1         12.00   1400.00     0.00   0.00    1.10   116.67  950.00  18000.00  120.00  11.20   22.30    18.95   21.40  98.00

What it means: The disk is pegged (%util near 100), and write await is high. Redis fsync is going to feel this.

Decision: Find contention: other workloads, log storms, image pulls, backups, or noisy neighbors on the same device. Move Redis data or isolate the disk.

Task 11: Check if the host is writeback-throttling dirty pages

cr0x@server:~$ sysctl vm.dirty_background_ratio vm.dirty_ratio
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20

What it means: Defaults-ish. If dirty ratios are too low, the kernel may force synchronous writeback more often; too high can cause big writeback storms.

Decision: If you see periodic multi-second stalls that line up with writeback, tune these carefully and test. Don’t “optimize” by cargo culting sysctl values.

Task 12: Confirm Redis is actually persisting to disk (files and sizes)

cr0x@server:~$ ls -lh /var/lib/docker/volumes/redis-data/_data
total 3.1G
-rw-r--r-- 1 redis redis 2.0G Jan  3 10:12 appendonly.aof
-rw-r--r-- 1 redis redis 1.1G Jan  3 10:10 dump.rdb

What it means: Both AOF and RDB exist and are sizable.

Decision: Decide whether you need both. If not, disable one and reclaim I/O budget. If you keep both, plan capacity and test restart times.

Task 13: Check container logs for persistence warnings

cr0x@server:~$ docker logs --tail 200 redis | egrep -i 'AOF|fsync|rewrite|RDB|fork|latency|WARNING'
1:C 03 Jan 2026 10:12:05.101 # Background append only file rewriting started by pid 42
1:C 03 Jan 2026 10:12:12.220 # AOF rewrite child asks to stop sending diffs.
1:M 03 Jan 2026 10:12:12.221 # Parent agreed to stop sending diffs. Finalizing...
1:M 03 Jan 2026 10:12:12.980 # Background AOF rewrite finished successfully
1:M 03 Jan 2026 10:12:13.005 # Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

What it means: Redis is explicitly telling you disk is busy and fsync is slow.

Decision: Treat this as a storage incident, not an application bug. Go to host-level I/O, contention, and mount path.

Task 14: Measure Redis command latency distribution (don’t guess)

cr0x@server:~$ docker exec -it redis redis-cli --latency-history -i 1
min: 0, max: 87, avg: 2.41 (320 samples) -- 1.00 seconds range

What it means: You’re seeing occasional 80+ ms spikes. That’s consistent with fsync stalls or fork/rewrite pauses.

Decision: If spikes align with AOF rewrite or disk saturation, tune persistence and fix storage. If spikes align with CPU throttling, fix CPU limits.

Task 15: Verify that persistence survives container restart (the only test that matters)

cr0x@server:~$ docker exec -it redis redis-cli SET durability_test 123
OK
cr0x@server:~$ docker restart redis
redis
cr0x@server:~$ docker exec -it redis redis-cli GET durability_test
"123"

What it means: Your persistence path and config survive a normal restart.

Decision: If this fails, you’re not persisting where you think you are, or your config isn’t applied. Fix before arguing about tuning.

Task 16: Check whether you’re accidentally using overlay2 for persistence anyway

cr0x@server:~$ docker exec -it redis sh -lc 'mount | egrep " /data |overlay"'
/dev/nvme0n1p2 on /data type ext4 (rw,relatime,errors=remount-ro)
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/...,upperdir=/var/lib/docker/overlay2/.../diff,workdir=/var/lib/docker/overlay2/.../work)

What it means: Great: /data is a real ext4 mount, separate from the overlay root.

Decision: If /data shows as overlay, you’ve found your “why is it slow?” answer. Move it to a volume/bind mount.

Fast diagnosis playbook

When Redis latency goes bad after enabling persistence, you want to identify the bottleneck in minutes, not after three meetings and a spreadsheet.
Check these in order.

First: are we writing to the wrong place?

  • Confirm dir and file names: redis-cli CONFIG GET dir dbfilename appendfilename
  • Confirm a mount exists: docker inspect ...Mounts and mount | grep /data
  • If persistence files are on overlay2, stop. Move them to a volume/bind mount, then re-test.

Second: is storage latency the culprit?

  • Look for Redis warnings about fsync taking too long in logs.
  • Check aof_delayed_fsync and if it increases during spikes.
  • Run iostat -x and look at w_await, aqu-sz, and %util.
  • If disk is saturated, find the noisy neighbor: backups, log ingestion, image pulls, or another database on the same device.

Third: is it memory/fork/rewrite pressure?

  • Check memory headroom: INFO memory and container memory limits.
  • Check whether AOF rewrite or RDB bgsave correlates with spikes.
  • Look for OOM kills in host logs if Redis disappears during persistence events.

Fourth: is CPU throttling making everything worse?

  • Inspect CPU quotas and limits.
  • Correlate latency spikes with CPU throttling metrics (cgroup stats if you have them).
  • Redis persistence is not free; don’t run it like a background toy process and expect deterministic latency.

Three corporate mini-stories from the durability trenches

1) Incident caused by a wrong assumption: “Docker volumes are always persistent”

A mid-sized product team containerized Redis for a session service. They used Docker Compose, set appendonly yes, and felt responsible.
The deployment “worked” for weeks. Then they did a host OS upgrade and rebuilt the stack.

After the maintenance window, sessions were gone. The application recovered poorly because it assumed sessions existed and tried to validate them against missing keys.
Their dashboards showed Redis running, CPU fine, memory fine. But users were effectively logged out.

The root cause wasn’t Redis. It was the assumption that data persisted because they enabled persistence.
They had no volume mount. /data lived in the container writable layer. When they replaced the container, they replaced the filesystem.

The fix was boring: a named volume mounted to /data, plus a restart test in CI that writes a key, restarts the container, and checks it’s still there.
They also documented what “persistence” means in their environment: persistence to a specific host path, not “I set a flag once.”

2) Optimization that backfired: “Put AOF on network storage so it survives the node”

Another company wanted node failure resilience without using Redis replication (long story, mostly org chart).
Someone suggested mounting Redis /data on a network volume so any node could pick it up. It sounded elegant in a slide deck.

In testing, throughput looked fine. In production, p99 latency spiked during peak hours.
The app team saw occasional request timeouts and blamed Redis. The platform team saw Redis CPU at 20% and said “it can’t be Redis.”
The SRE on call saw the AOF fsync warnings and got unpopular fast.

The problem: network-attached storage with inconsistent fsync latency.
The “everysec” flush still needs the backend to complete durable writes regularly; when the storage hit contention, fsync blocked longer.
Redis did the right thing—waited—while the app melted down.

The resolution was to stop pretending that remote persistence is free. They moved AOF back to local SSD, added replication for resilience,
and kept network storage for periodic RDB backups copied out-of-band. Latency stabilized immediately.

3) Boring but correct practice that saved the day: “Capacity and headroom for rewrite”

A finance-adjacent service used Redis as a fast state store. They enabled AOF with everysec and had a tidy SLO.
The team had one unglamorous habit: they budgeted memory and disk for worst-case persistence events, not average usage.

They tracked three numbers weekly: dataset size, AOF size, and peak memory during rewrite. If the AOF grew too quickly, they adjusted rewrite thresholds.
If memory fragmentation rose, they planned a controlled restart window (with replicas) rather than waiting for random OOM.

One day a traffic pattern changed—more writes, more churn. AOF grew, rewrites became more frequent.
On a less disciplined system, this is where you get a rewrite storm and latency spikes that look like a DDoS from inside the building.

Their system didn’t flinch. They had headroom. Rewrites completed before disks saturated, and fork-induced memory spikes stayed under limits.
The incident report was short and deeply unexciting, which is the highest compliment in operations.

Common mistakes: symptom → root cause → fix

1) Redis is fast without persistence, slow with AOF

Symptom: Enabling AOF makes p99 latency jump and throughput drop.

Root cause: fsync latency and storage contention; often worsened by network volumes or saturated disks.

Fix: Use local SSD-backed volume, keep appendfsync everysec, isolate Redis data from noisy I/O, and watch aof_delayed_fsync.

2) Data disappears after container recreation

Symptom: Restarting the container sometimes keeps data; redeploying loses it.

Root cause: Persistence files stored in the container writable layer (overlay2), not a volume/bind mount.

Fix: Mount /data to a Docker volume or bind mount; verify with docker inspect and a restart test.

3) Periodic latency spikes every few minutes

Symptom: Spikes correlate with background tasks.

Root cause: RDB snapshots or AOF rewrite causing fork copy-on-write, disk bursts, and cache pressure.

Fix: Adjust snapshot schedule; tune AOF rewrite thresholds; ensure memory headroom; consider disabling RDB if AOF is primary.

4) Redis gets OOM-killed during rewrite or snapshot

Symptom: Container dies, restarts, and logs show persistence activity around the event.

Root cause: Fork + copy-on-write increases memory temporarily; container memory limit too tight; fragmentation high.

Fix: Increase memory limit/headroom; reduce dataset; consider maxmemory lower than container limit; watch fragmentation ratio.

5) AOF file grows forever and restarts are slow

Symptom: Restart takes a long time; AOF is huge.

Root cause: AOF rewrite thresholds too conservative or rewrite is failing due to disk space.

Fix: Set sane auto-aof-rewrite-percentage/min-size; ensure free disk; check logs for rewrite failures.

6) Redis is “durable” but still loses data on host crash

Symptom: After a power loss, AOF has gaps even though fsync was configured.

Root cause: Storage stack lied about flushes, volatile write cache, or virtualization layer behavior.

Fix: Use storage with verified flush semantics; avoid unsafe RAID caching; validate failure scenarios; consider replication and WAIT for stronger guarantees.

7) Turning on no-appendfsync-on-rewrite “fixed” latency but increased data loss

Symptom: Latency smooths out, but after crash during rewrite, more data missing than expected.

Root cause: You explicitly accepted a larger durability window during rewrite.

Fix: If loss window is unacceptable, disable it and instead fix the disk path and I/O contention; or reduce rewrite frequency.

Checklists / step-by-step plan

Step-by-step: get persistence right in Docker without killing performance

  1. Decide your loss window.
    Minutes? Use RDB. Around a second? Use AOF everysec. “None”? Prepare for replication and serious storage.
  2. Mount /data on a real volume.
    Named volume is fine. Bind mount is fine. Overlay2 is not fine for this.
  3. Pick one primary persistence method.
    Running both is valid but costs I/O and memory. If you keep both, do it intentionally.
  4. Budget memory for fork/rewrite.
    Leave headroom. If you cap the container at “dataset size plus a sandwich,” the sandwich will be removed at runtime.
  5. Confirm storage latency behavior with real checks.
    Watch fsync warnings and aof_delayed_fsync. Don’t rely on “it’s SSD” as a performance metric.
  6. Set rewrite thresholds to avoid surprise rewrites.
    You want rewrites to be regular and boring, not rare and catastrophic.
  7. Test: write key → restart → read key.
    Put it in CI or a pre-deploy hook. If this test fails, everything else is theater.
  8. Monitor the right signals.
    Redis latency, persistence stats, disk latency, disk utilization, memory fragmentation, and container OOM events.

Checklist: when you change persistence settings

  • Confirm config applied (CONFIG GET).
  • Confirm data path is a mount (docker inspect, mount).
  • Confirm disk has free space for rewrite/snapshot.
  • Run a load test that includes writes and measures p95/p99 latency.
  • Simulate restart and verify key survival.
  • Record new baseline: aof_delayed_fsync, snapshot duration, rewrite duration.

Checklist: storage sanity for Redis persistence

  • Local SSD/NVMe preferred for AOF.
  • Avoid sharing the device with heavy log writes and backups.
  • Validate flush semantics if you claim durability.
  • Know your filesystem type and mount options.

FAQ

1) Should I enable both RDB and AOF?

Only if you have a specific restore strategy that benefits from both. AOF gives better durability granularity; RDB gives compact snapshots and sometimes faster cold start.
Running both increases background work and I/O. If you don’t know why you want both, you probably don’t.

2) Why did Redis get slow after enabling AOF?

AOF adds write amplification and flush behavior. Even with everysec, Redis depends on the storage path to complete regular flushes.
Slow fsync, saturated disks, or network volumes with inconsistent latency will show up directly as request latency.

3) Is Docker overlay2 really that bad for Redis persistence?

Overlay2 is fine for container images and small writes. Redis persistence is write-heavy and latency-sensitive.
Overlay2 adds copy-on-write and metadata overhead. You might get away with it at low volume, until you don’t. Put persistence on a mount.

4) What’s the best appendfsync setting?

For most production: everysec. It’s the pragmatic balance between performance and durability.
Use always only when you’ve validated storage and you truly need it. Use no when Redis is disposable cache and you accept more loss.

5) Can I use tmpfs for AOF to make it fast?

You can, but it defeats the purpose of persistence: if the host loses power, tmpfs forgets everything with enthusiasm.
tmpfs can be useful for ephemeral caches or as a staging area combined with a separate replication/backup strategy, but don’t call it durability.

6) My AOF rewrite causes latency spikes. What do I tune first?

First ensure the data is on a fast, isolated disk path. Then tune rewrite thresholds to avoid frequent rewrites.
If the loss window allows it, no-appendfsync-on-rewrite yes can reduce stalls, but it’s a conscious durability trade.

7) How do I know if I’m losing data due to the one-second window?

With everysec, you can lose up to about one second of acknowledged writes on crash, sometimes more if the system is under heavy I/O pressure.
Watch aof_delayed_fsync and storage metrics; if flushes are delayed, your effective loss window grows.

8) Is Redis persistence a backup?

No. Persistence is crash recovery for that node. Backups require separate copies, retention policies, and restore verification.
Treat persistence as “I can restart without reloading everything from upstream,” not “I have archival safety.”

9) What about running Redis on Kubernetes—same story?

Same physics, more abstraction. You still need a persistent volume with known performance, you still need memory headroom for fork/rewrite,
and you still need to test restart behavior. Kubernetes makes it easier to move pods; it does not make fsync faster.

10) How much disk do I need for AOF?

Enough for the current AOF, plus rewrite headroom (since rewrite writes a new file before swapping), plus some safety margin.
If you’re tight on disk, rewrites fail, AOF grows, restarts slow, and you end up debugging disk space instead of serving traffic.

Practical next steps

If you’re running Redis in Docker and you want persistence without turning it into a slow disk app, do these in order:

  1. Confirm persistence files are on a real mount (volume or bind mount) and not overlay2.
  2. Pick your durability target: RDB for coarse recovery, AOF everysec for tighter recovery.
  3. Measure fsync pain using Redis INFO persistence and host iostat -x. Don’t tune blind.
  4. Budget memory headroom so fork/rewrite doesn’t trip OOM.
  5. Run the restart survival test as a gate before any deployment that touches Redis storage or config.

Redis can be fast and persistent. It just won’t do it on your behalf while you hide its files inside a copy-on-write maze and call it “cloud-native.”

← Previous
Proxmox LXC Bind-Mount Permission Denied: UID/GID, Unprivileged Containers, and the Fix That Actually Works
Next →
WooCommerce Critical Error After Update: Roll Back and Recover Safely

Leave a comment