Docker Swap Storms: Why Containers “Work” While Your Host Melts (and Fixes)

October 27, 2025 • February 3, 2026 • Read: 23 min • Views: 9

Was this helpful?

The page is up. The pods are green. The container logs look boring. Meanwhile your host is screaming:
load average in the stratosphere, SSH takes 30 seconds to echo a character, and kswapd is eating CPU like it’s paid by the cycle.

This is the special kind of production failure where everything “works” right until the business notices latency,
timeouts, and mysteriously “slow databases.” Welcome to the swap storm: not a single crash, just a slow-motion meltdown.

What a swap storm is (and why it fools you)

A swap storm is sustained memory pressure that drives the kernel to continuously evict pages from RAM to swap,
then fault them back in, repeatedly. It’s not just “some swap is used.” It’s the system spending so much time
moving pages that useful work becomes secondary.

The nasty part is that many applications keep “working.” They respond, slowly. They retry. They time out and get retried.
Your orchestrator sees processes still alive, health checks still barely passing, and thinks the world is fine.
Humans notice first: everything feels sticky.

Two signals that separate “swap used” from “swap storm”

Major page faults spike (reading pages from disk-backed swap).
PSI memory pressure shows sustained stalls (tasks waiting on memory reclaim / IO).

If you only look at “swap used percentage,” you’ll get lied to. Swap can be 20% used and stable for weeks with no drama.
Conversely, swap can be “only” 5% used and still storming if the working set churns.

Interesting facts and historical context (because this mess has history)

Early Linux OOM behavior was famously blunt. The kernel’s OOM killer evolved over decades; it still surprises people under pressure.
cgroups arrived to stop “noisy neighbors.” They were built for shared systems long before containers made them fashionable.
Swap accounting in cgroups has been controversial. It adds overhead and has had real-world bugs; many platforms disabled it by default.
Kubernetes historically discouraged swap. Not because swap is evil, but because predictable memory isolation is hard when swapping is in play.
The “free memory” number has been misunderstood since forever. Linux uses RAM for page cache aggressively; “free” being low is often healthy.
Pressure Stall Information (PSI) is relatively new. It’s one of the best modern tools for seeing “waiting on memory” without guessing.
SSD swap made storms quieter, not safer. Faster swap reduces pain… until it masks the issue and you hit write amplification and latency cliffs.
Overcommit defaults are a cultural artifact. Linux assumes many programs allocate more than they touch; this is true until it isn’t.

Why containers look healthy while the host dies

Containers don’t have their own kernel. They’re processes grouped by cgroups and namespaces.
Memory pressure is managed by the host kernel, and the host kernel is the one doing reclaim and swap.

Here’s the illusion: a container can continue to run and respond while the host is swapping heavily, because
the container’s processes are still scheduled and still making progress—just at a terrible cost.
Your container’s CPU usage may even look lower because it’s blocked on IO (swap-in), not burning CPU.

The main failure modes that create the “containers fine, host melting” pattern

No memory limits (or wrong limits). One container grows until the host reclaims and swaps everyone.
Limits set but swap not constrained. The container stays under its RAM cap but still pushes pressure into global reclaim via page cache patterns and shared resources.
Page cache + filesystem IO dominates. Containers doing IO can blow out cache, forcing reclaim and swap for other workloads.
Overcommit + bursts. Many services allocate aggressively at once; you don’t OOM immediately, you churn.
OOM policy avoids killing. The system swaps instead of failing fast, trading correctness for “availability” the worst way.

One more twist: container-level telemetry can be misleading. Some tools report cgroup memory usage but not
host-level reclaim pain. You’ll see containers “within limits” while the host spends its day shuffling pages.

Joke #1: Swap is like a storage unit—you feel organized until you realize you’re paying monthly to store junk you still need every day.

Linux memory basics you actually need

You don’t need to memorize kernel code. You do need a few concepts to reason about swap storms without superstition.

Working set vs allocated memory

Most apps allocate memory they don’t actively touch. The kernel doesn’t care about “allocated,” it cares about
“recently used.” Your working set is the pages you touch frequently enough that evicting them hurts.

Swap storms happen when the working set doesn’t fit, or when it fits but the kernel is forced to churn pages due to
competing demands (page cache, other cgroups, or a single offender that keeps dirtying memory).

Anonymous memory vs file-backed memory

Anonymous: heap, stack—swappable to swap.
File-backed: page cache—evictable without swap (just re-read from file) unless dirty.

When you run databases, caches, JVMs, and log-heavy services on the same host, anonymous and file-backed reclaim
interact in entertaining ways. “Entertaining” here means “a postmortem you’ll read at 2am.”

Reclaim, kswapd, and direct reclaim

The kernel tries to reclaim memory in the background (kswapd). Under heavy pressure, processes themselves may enter
direct reclaim—they stall trying to free memory. That’s where latency goes to die.

Why swap storms feel like CPU problems

Reclaim burns CPU. Compression can burn CPU (zswap/zram). Faulting pages back in burns CPU and IO.
And your application threads might be blocked, making utilization graphs confusing: low app CPU, high system CPU, high IO wait.

cgroups, Docker, and the sharp edges around swap

Docker uses cgroups to constrain resources. But “memory constraints” are a grab bag depending on kernel version,
cgroup v1 vs v2, and what Docker is configured to do.

cgroup v1 vs v2: the practical differences for swap storms

In cgroup v1, memory and swap were managed with separate knobs (memory.limit_in_bytes, memory.memsw.limit_in_bytes),
and swap accounting could be disabled. In cgroup v2, memory is more unified and the interface is cleaner:
memory.max, memory.swap.max, memory.high, plus pressure metrics.

If you’re on cgroup v2 and not using memory.high, you’re missing one of the best tools to prevent a single cgroup from turning
the host into a swap-powered toaster.

Docker memory flags: what they really mean

--memory: hard limit. If exceeded, the cgroup will reclaim; if it can’t, you get OOM (inside that cgroup).
--memory-swap: in many setups, total memory+swap limit. The semantics vary; on some systems it’s ignored without swap accounting.
--oom-kill-disable: almost always a bad idea in production. It encourages the host to suffer longer.

The container “working” while the host melts is often the result of a policy decision:
we told the system “do not kill, just try harder.” The kernel complied.

One quote you should tattoo onto your runbooks

“Hope is not a strategy.” — paraphrased idea often attributed in engineering/ops circles; the point stands either way.

Fast diagnosis playbook

This is the order I use when someone pings “the host is slow” and I suspect memory pressure. It’s designed to
get you to a decision quickly: kill, cap, move, or tune.

First: confirm it’s actually a swap storm (not just ‘swap used’)

Check current swap activity (swap-in/out rates) and major faults.
Check PSI memory pressure for sustained stalls.
Check whether the IO subsystem is saturated (swap is IO).

Second: find the offender cgroup/container

Compare per-container memory usage, including RSS and cache.
Check which cgroups are triggering OOM or high reclaim.
Look for a workload pattern (JVM heap growth, batch job, log burst, compaction, index rebuild).

Third: decide the mitigation

If latency matters more than completion: fail fast (tight limits, allow OOM, restart cleanly).
If completion matters more than latency: isolate (dedicated nodes, lower swappiness, controlled swap, slower but stable).
If you’re blind: add PSI + per-cgroup memory metrics first. Tuning without visibility is gambling.

Hands-on tasks: commands, outputs, decisions

These are the commands I actually run on a real host. Each task includes what the output means and what decision it enables.
Adjust interface names and paths to match your environment.

Task 1: Confirm swap is present and how much is in use

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        54Gi       1.2Gi       1.1Gi       6.8Gi       2.3Gi
Swap:           16Gi        12Gi       4.0Gi

Meaning: Swap is heavily used (12Gi) and available memory is low (2.3Gi). Not proof of storm, but suspicious.
Decision: Move to activity metrics; used swap alone doesn’t justify action.

Task 2: Measure swap activity and page fault pressure

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1 12453248 312000  68000 512000   60  210   120   980 1200 2400 12 18 42 28  0
 2  2 12454800 298000  66000 500000  180  640   200  1800 1800 3200  8 22 30 40  0
 4  2 12456000 286000  64000 490000  220  710   260  2100 1900 3300  9 23 24 44  0
 3  3 12456800 280000  62000 482000  240  680   300  2000 2000 3500 10 24 22 44  0
 2  2 12458000 276000  60000 475000  210  650   280  1900 1950 3400  9 23 25 43  0

Meaning: Non-trivial si/so (swap-in/out) every second and high wa (IO wait). That’s active paging.
Decision: Treat as storm. Next: determine whether IO is saturated and which cgroup is pressuring memory.

Task 3: Check PSI for memory stalls (host-level)

cr0x@server:~$ cat /proc/pressure/memory
some avg10=18.40 avg60=12.12 avg300=8.50 total=192003210
full avg10=6.20 avg60=3.90 avg300=2.10 total=48200321

Meaning: full pressure indicates tasks are frequently stalled because memory reclaim can’t keep up. This correlates strongly with latency spikes.
Decision: Stop looking for “CPU bugs.” This is memory pressure. Find the offender and cap/kill/isolate.

Task 4: Identify whether you’re on cgroup v1 or v2

cr0x@server:~$ stat -fc %T /sys/fs/cgroup
cgroup2fs

Meaning: cgroup v2 is active. You can use memory.high and memory.swap.max.
Decision: Prefer v2 controls; avoid v1-era folklore knobs that don’t apply.

Task 5: See top memory consumers by container

cr0x@server:~$ docker stats --no-stream
CONTAINER ID   NAME              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O     PIDS
a1b2c3d4e5f6   api-prod          35.20%    1.8GiB / 2GiB         90.00%    2.1GB / 1.9GB  120MB / 3GB  210
b2c3d4e5f6g7   search-indexer     4.10%    7.4GiB / 8GiB         92.50%    150MB / 90MB   30GB / 2GB   65
c3d4e5f6g7h8   metrics-agent      0.50%    220MiB / 512MiB       42.97%    20MB / 18MB    2MB / 1MB    14

Meaning: search-indexer is near its memory limit and doing huge block IO (30GB reads/writes), which could be page cache churn, compaction, or spill.
Decision: Drill into that container’s cgroup metrics (reclaim, swap, OOM events).

Task 6: Inspect cgroup memory + swap limits (v2) for a suspect container

cr0x@server:~$ CID=b2c3d4e5f6g7
cr0x@server:~$ CG=$(docker inspect -f '{{.HostConfig.CgroupParent}}' "$CID")
cr0x@server:~$ docker inspect -f '{{.Id}} {{.Name}}' "$CID"
b2c3d4e5f6g7h8i9j0 /search-indexer

cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.max
8589934592
cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.swap.max
max

Meaning: RAM limit is 8GiB, but swap is unlimited (max). Under pressure, this cgroup can push swap hard.
Decision: Set memory.swap.max or configure Docker to bound swap for containers that shouldn’t page.

Task 7: Check per-cgroup events: are you hitting reclaim/OOM?

cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.events
low 0
high 1224
max 18
oom 2
oom_kill 2

Meaning: The cgroup hit high frequently and has had OOM kills. That’s not “random instability”; it’s a sizing problem.
Decision: Either raise memory, reduce working set, or accept restarts but prevent host-level thrash by bounding swap and using memory.high.

Task 8: Observe swap usage per process (find the real hog)

cr0x@server:~$ sudo smem -rs swap | head -n 8
  PID User     Command                         Swap      USS      PSS      RSS
18231 root     java -jar indexer.jar          6144M    4096M    4200M    7000M
 9132 root     python3 /app/worker.py          820M     600M     650M    1200M
 2210 root     dockerd                          90M      60M      70M     180M
 1987 root     containerd                       40M      25M      30M      90M
 1544 root     /usr/bin/prometheus              10M     900M     920M     980M
 1123 root     /usr/sbin/sshd                    1M       2M       3M       8M

Meaning: The Java indexer has 6GiB swapped out. That explains “container alive but slow”: it’s faulting pages constantly.
Decision: If this workload isn’t supposed to swap, cap it and force OOM/restart. If it is, isolate it to a host with faster swap and lower contention.

Task 9: Check disk saturation (swap is IO; IO is latency)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 	01/02/2026 	_x86_64_	(16 CPU)

avg-cpu:  %user %nice %system %iowait  %steal %idle
          10.21  0.00   22.11   41.90    0.00  25.78

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
nvme0n1        120.0   12800.0     2.0   1.64    18.2   106.7    210.0   24400.0     8.0   3.67   32.5   116.2    9.80  99.20

Meaning: %util near 100% and high await. The NVMe is saturated; swap-in/out will queue, causing stalls everywhere.
Decision: Immediate mitigation: reduce memory pressure (kill offender, lower concurrency). Longer-term: separate swap from workload IO or use faster storage.

Task 10: See which processes are stuck in reclaim or IO waits

cr0x@server:~$ ps -eo pid,stat,wchan:20,comm --sort=stat | head -n 12
  PID STAT WCHAN                COMMAND
18231 D    io_schedule          java
19102 D    io_schedule          java
 9132 D    io_schedule          python3
24011 D    balance_pgdat        postgres
24022 D    balance_pgdat        postgres
 2210 Ssl  ep_poll              dockerd
 1987 Ssl  ep_poll              containerd

Meaning: D state + io_schedule indicates uninterruptible sleep waiting for IO. balance_pgdat suggests direct reclaim.
Decision: Your latency is kernel-level waiting. Stop scaling traffic; it will worsen the queue. Shed load or stop the offender.

Task 11: Check kernel logs for OOM and reclaim warnings

cr0x@server:~$ sudo dmesg -T | tail -n 12
[Thu Jan  2 10:14:22 2026] Memory cgroup out of memory: Killed process 18231 (java) total-vm:12422392kB, anon-rss:7023120kB, file-rss:10244kB, shmem-rss:0kB
[Thu Jan  2 10:14:22 2026] oom_reaper: reaped process 18231 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Thu Jan  2 10:14:25 2026] Out of memory: Killed process 9132 (python3) total-vm:2048320kB, anon-rss:1192200kB, file-rss:9120kB, shmem-rss:0kB

Meaning: cgroup OOM kills occurred. That’s actually preferable to host-wide swap storms—if your service can restart cleanly.
Decision: Confirm restart policies, adjust limits, and set expectations: controlled OOM beats uncontrolled host collapse.

Task 12: Inspect host swappiness and overcommit posture

cr0x@server:~$ sysctl vm.swappiness vm.overcommit_memory vm.overcommit_ratio
vm.swappiness = 60
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

Meaning: Swappiness 60 is default-ish and can be too eager on mixed container hosts. Overcommit is heuristic (0).
Decision: If this host runs latency-sensitive services, consider lowering swappiness and tightening overcommit, but only after you have per-cgroup limits.

Task 13: Verify zswap/zram status (helps sometimes, hides problems always)

cr0x@server:~$ grep -H . /sys/module/zswap/parameters/enabled /sys/block/zram0/disksize 2>/dev/null
/sys/module/zswap/parameters/enabled:Y

Meaning: zswap is enabled; swap pages may be compressed in RAM. This reduces IO but increases CPU and can mask pressure until it’s too late.
Decision: Keep it if it’s buying stability. Don’t treat it as a license to run memory-unbounded workloads.

Task 14: Check Docker daemon and kernel memory accounting assumptions

cr0x@server:~$ docker info | sed -n '1,40p'
Client:
 Version:    26.1.0
 Context:    default
 Debug Mode: false

Server:
 Containers: 38
  Running: 33
  Paused: 0
  Stopped: 5
 Server Version: 26.1.0
 Storage Driver: overlay2
 Cgroup Driver: systemd
 Cgroup Version: 2
 Kernel Version: 6.5.0
 Operating System: Ubuntu 24.04 LTS
 OSType: linux

Meaning: systemd + cgroup v2. Good. Your controls exist; you just need to use them.
Decision: Implement cgroup v2-aware policies (memory.high, memory.swap.max) rather than legacy guidance.

Three corporate mini-stories from the memory trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran a search indexing worker fleet in Docker on a handful of beefy hosts. They had swap enabled “for safety,”
and they had container memory limits set. Everyone felt responsible. Everyone slept.

During a busy week, the indexing backlog grew and an engineer increased worker concurrency inside the container.
The container stayed under its --memory cap most of the time, but the workload’s allocation pattern became spiky:
large transient buffers, heavy file IO, and aggressive caching. The host started swapping.

The wrong assumption was subtle: “If containers have memory limits, the host won’t swap itself into oblivion.”
In reality, limits prevent a single cgroup from using infinite RAM, but they don’t automatically guarantee host-level
fairness or prevent global reclaim pain—especially when swap is unbounded and the IO path is shared.

Symptoms were classic. API latency doubled, then tripled. SSH logins stalled. Monitoring showed “container memory within limits,”
so the team chased network and database tuning for hours. Eventually someone ran cat /proc/pressure/memory
and the story wrote itself.

The fix wasn’t exotic. They set memory.swap.max for the indexing workers, added memory.high to throttle them before OOM,
and moved the indexing workload to dedicated nodes with their own IO budget. The biggest improvement came from the boring bit:
writing down that “memory limits are not host protection” and making it a deployment gate.

Mini-story 2: The optimization that backfired

Another org had a multi-tenant logging pipeline. They wanted to reduce disk write load, so they enabled zswap and bumped swap size.
The initial results looked great: fewer write spikes, smoother IO graphs, fewer immediate OOMs.

Then came a minor incident: a customer turned on verbose logging plus compression at the application layer.
Log volume surged, CPU rose, and memory pressure increased. With zswap, the kernel compressed swapped pages in RAM first.
That reduced swap IO, but increased CPU time spent compressing and reclaiming.

In dashboards, it looked like “CPU saturation,” not “memory pressure.” The team tuned CPU limits, added cores, and made the host bigger.
The system got worse. Bigger RAM meant bigger caches and more to churn; more CPU meant zswap could compress more, delaying the obvious failure.
Latency jitter became permanent.

The backfire wasn’t zswap itself. It was treating it as performance optimization rather than a pressure buffer.
The real problem was uncontrolled memory growth in a parsing stage and no memory.high throttling. Swapping was the symptom,
zswap was the amplifier that made the symptom harder to see.

They fixed it by setting strict per-stage memory ceilings, adding backpressure in the pipeline, and using zswap only on nodes
dedicated to batch processing where latency didn’t matter. They also changed the alert: PSI memory sustained > threshold
became a page-worthy event.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran a mix of customer-facing APIs and a nightly batch reconciliation on the same Kubernetes cluster.
They were painfully conservative: per-service requests/limits were set, eviction thresholds were reviewed quarterly,
and every node pool had a written “swap policy.” It was not exciting work. It was also why they didn’t have exciting outages.

One night, a batch job started using more memory after a vendor library update. It grew steadily, not explosively.
On a less disciplined team, this becomes a slow swap storm and an incident report with the words “we don’t know.”

Their system did something unglamorous: the job hit its memory limit, got OOM-killed, and restarted with a reduced parallelism setting
(a predefined fallback). The batch ran slower. The APIs stayed fast. The on-call got an alert that was specific:
“batch job OOMKilled; host PSI normal.”

The next morning, they rolled back the library, filed the upstream issue, and adjusted the job’s memory request permanently.
Nobody wrote a heroic Slack message. That’s the point. The correct practice—limits plus controlled failure—kept a “host melts” event from existing at all.

Fixes that stick: limits, OOM, swappiness, and monitoring

1) Put memory limits on every container that matters

No limit is a decision. It’s the decision that the host will be your limit. The host is a bad limit because it fails
collectively: all workloads suffer, then everything collapses.

Use per-service limits sized to the working set, not peak allocation fantasies. For JVMs, that means setting heap
intentionally and leaving headroom for off-heap and native allocations.

2) Use `memory.high` (cgroup v2) to throttle before OOM

memory.max is a cliff. memory.high is a speed limit. When you set memory.high, the kernel starts reclaiming
and throttling allocations once the cgroup exceeds it, which tends to reduce host-wide thrash.

For bursty services, memory.high slightly below memory.max often produces a system that’s slower under pressure but still
controllable, instead of chaotic.

3) Bound swap per workload, or turn it off for latency-sensitive services

Unlimited swap is how you end up with a server that is “up” but unusable. For services that must be responsive,
prefer either no swap usage or a very small swap allowance.

If you need swap for batch workloads, isolate them. Swap is not free; it’s a decision to trade latency for completion.
Mixing “must be fast” with “can be slow” on the same swap-backed node is a sociological experiment.

4) Lower swappiness (carefully) on mixed container hosts

vm.swappiness controls how aggressively the kernel swaps anonymous memory versus reclaiming page cache.
On hosts running databases or low-latency services, a lower value (like 10 or 1) can reduce swapping of hot pages.

Don’t cargo-cult vm.swappiness=1 everywhere. If your host relies on swap to avoid OOM and you drop swappiness without fixing sizing,
you’ll just trade swap storms for OOM storms.

5) Prefer controlled OOM over uncontrolled thrash

This is the part people hate emotionally but love operationally: for many services, restarting is cheaper than paging.
If a service can’t run within its budget, killing it is honesty.

Avoid disabling OOM kill for containers unless you deeply understand the consequences. Disabling OOM kill is how you
turn one bad process into a host-wide incident.

Joke #2: Disabling the OOM killer is like removing the fire alarm because it’s loud—now you can enjoy the flames in peace.

6) Watch PSI and reclaim metrics, not just “memory used”

The most useful alerts are the ones that tell you “the kernel is stalling tasks.”
PSI gives you that directly. Pair it with swap-in/out rates and IO latency.

7) Separate IO paths when swap is unavoidable

If swap and your primary workload share the same disk, you’ve created a feedback loop: swap causes IO queueing,
IO queueing slows apps, slow apps hold memory longer, memory pressure increases, swap increases. Congratulations, you built a carousel.

On serious systems, swap belongs on fast storage, sometimes separate devices, or you use zram for bounded emergency headroom
(with CPU headroom accounted for).

8) Make memory budgeting part of deployment, not postmortem

Corporate reality: teams will not “remember to add limits later.” They will ship today. Your job is to turn this into a guardrail:
CI checks for Compose, admission policies for Kubernetes, and runtime alerts for “no limit set.”

Common mistakes: symptoms → root cause → fix

1) Symptom: Host load average is huge; CPU usage looks moderate

Root cause: Threads blocked in IO wait during swap-in or direct reclaim. Load counts runnable + uninterruptible tasks.

Fix: Confirm with vmstat/iostat/ps (D-state). Reduce memory pressure immediately; cap/kill offender; add memory.high.

2) Symptom: Containers show “within memory limit,” but the host swaps heavily

Root cause: Limits exist but swap is unbounded, page cache churn forces global reclaim, or multiple containers collectively exceed host capacity.

Fix: Set per-cgroup swap limits (memory.swap.max) and realistic memory budgets. Don’t oversubscribe without an explicit policy.

3) Symptom: Random latency spikes across unrelated services

Root cause: Global reclaim and IO queueing create cross-service coupling. One memory offender punishes everyone.

Fix: Isolate noisy workloads onto separate nodes/pools; enforce limits; watch PSI.

4) Symptom: “We added more swap and it got worse”

Root cause: More swap increases the time a host can remain in a degraded paging state, amplifying tail latency and operational confusion.

Fix: Treat swap as emergency buffer, not capacity. Prefer OOM/restart for latency-sensitive services, or isolate batch jobs.

5) Symptom: OOM kills happen but the host is still sluggish afterward

Root cause: Swap remains populated; reclaim and refaulting continues; IO queue still draining; caches fragmented.

Fix: After removing offender, allow recovery time; reduce IO pressure; consider temporarily dropping caches only as a last resort and with understanding of blast radius.

6) Symptom: “Free memory” is near zero; someone panics

Root cause: Linux uses RAM for page cache; low free is normal when available is healthy and PSI is low.

Fix: Educate teams to use available and PSI, not free. Alert on pressure and churn, not aesthetics.

7) Symptom: After enabling zswap/zram, CPU increased and throughput dropped

Root cause: Compression overhead plus continued memory pressure. You moved cost from disk to CPU.

Fix: Only enable when CPU headroom exists; cap swap usage; fix the actual memory budget.

8) Symptom: “Swappiness=1 fixed it” (until next week)

Root cause: Reduced swap masked poor memory sizing temporarily; pressure still exists and may now become abrupt OOM.

Fix: Size workloads, set limits, add memory.high/backpressure. Tune swappiness as a finishing move, not the opening act.

Checklists / step-by-step plan

Immediate incident response (15 minutes)

Run vmstat 1 and cat /proc/pressure/memory. Confirm active paging + stalls.
Run iostat -xz 1. Confirm disk saturation / await.
Find offender: docker stats, then per-process swap with smem or inspect cgroup memory events.
Mitigate:
- Reduce concurrency / traffic.
- Stop the offender container or restart it with a lower memory footprint.
- If needed, temporarily move the offender to a dedicated host.
Verify recovery: swap-in/out rates drop, PSI full approaches ~0, IO await normalizes.

Stabilization plan (same day)

Set memory limits for any container without them.
On cgroup v2: add memory.high for bursty workloads to reduce thrash.
Bound swap where appropriate (memory.swap.max), especially for latency-sensitive services.
Review --oom-kill-disable usage; remove unless you have a very specific reason.
Adjust node role separation: batch vs latency-sensitive services should not share the same memory/IO fate.

Hardening plan (next sprint)

Add alerting on PSI memory (some and full) sustained thresholds.
Add alerting on swap-in/out rate, major faults, and disk await.
Implement policy-as-code: Compose lints or admission controls requiring memory limits.
Document per-service memory budgets and restart behavior (what happens on OOM, how fast it recovers).
Load test with memory pressure: run concurrency spikes and verify the host remains interactive.

FAQ

1) Is swap always bad for Docker hosts?

No. Swap is a tool. It’s useful as an emergency buffer and for batch workloads. It’s dangerous when it becomes a crutch for
undersized services and makes the host “alive but unusable.”

2) Why do my containers show low CPU while the system is slow?

Because they’re blocked. In swap storms, threads often sit in IO wait or direct reclaim. Your CPU graph doesn’t show “waiting on disk”
unless you look at iowait, PSI, or blocked tasks.

3) Should I disable swap to prevent storms?

If you run latency-sensitive services and you have good limits, disabling swap can improve predictability.
But it also makes OOM more likely. The correct approach is usually: set limits and policies first, then decide swap posture.

4) What’s the difference between OOM and swap thrash in terms of user impact?

OOM is abrupt and noisy: a process dies and restarts. Swap thrash is prolonged and sneaky: everything is slow, timeouts cascade,
and you get secondary failures. In many systems, a controlled OOM is the lesser evil.

5) Why does adding RAM sometimes not fix it?

More RAM helps only if the working set fits and you reduce churn. If the workload scales to fill memory (caches, heaps, compactions),
you can end up with the same pressure, just on a larger canvas.

6) How do I set swap limits for Docker containers on cgroup v2?

Docker’s flags can be inconsistent across setups. The reliable way is often to use cgroup v2 controls directly via systemd scopes
or runtime configuration, ensuring memory.swap.max is set for the container’s cgroup. Validate by reading the file in /sys/fs/cgroup.

7) Why is “free memory” low even when the system is fine?

Linux uses spare RAM as cache. That cache is reclaimable. Look at “available” memory and pressure metrics, not “free.”
Low free + low PSI is usually normal.

8) What metrics should I page on to catch swap storms early?

Sustained PSI memory (/proc/pressure/memory), swap-in/out rates, major page faults, disk await/%util,
and per-cgroup memory events (high/max/oom). Alerts should trigger on sustained conditions, not one-second blips.

9) Can overlay2 or filesystem behavior contribute to swap storms?

Indirectly, yes. Heavy filesystem metadata churn and write amplification can saturate IO, making swap-in/out far more painful.
It doesn’t create memory pressure, but it turns memory pressure into a system-wide outage faster.

10) Is zram better than disk swap for containers?

zram avoids disk IO by compressing into RAM. It can soften storms when you have CPU headroom, but it’s still swapping—latency increases,
and it can hide capacity issues. Use it as a buffer, not a permission slip.

Practical next steps

If your Docker host is melting while containers “work,” stop debating whether swap is morally good. Treat it like what it is:
a performance debt instrument with a variable interest rate and a nasty compounding model.

Do these next, in this order:

Instrument pressure: add PSI-based alerts and dashboards alongside swap activity and IO latency.
Budget memory per service: set realistic limits; remove “unlimited” as a default.
Control swap behavior: bound swap per workload (especially latency-sensitive ones) and consider memory.high throttling.
Prefer controlled failure: allow cgroup OOM + restart for services that can recover, instead of host-wide thrash.
Isolate the troublemakers: batch indexing, compaction, and anything that likes to balloon should not share nodes with low-latency APIs.

Your goal isn’t “never use swap.” Your goal is “never let swap decide your incident timeline.” The kernel will do what you ask.
Ask for something survivable.