The page is up. The pods are green. The container logs look boring. Meanwhile your host is screaming:
load average in the stratosphere, SSH takes 30 seconds to echo a character, and kswapd is eating CPU like it’s paid by the cycle.
This is the special kind of production failure where everything “works” right until the business notices latency,
timeouts, and mysteriously “slow databases.” Welcome to the swap storm: not a single crash, just a slow-motion meltdown.
What a swap storm is (and why it fools you)
A swap storm is sustained memory pressure that drives the kernel to continuously evict pages from RAM to swap,
then fault them back in, repeatedly. It’s not just “some swap is used.” It’s the system spending so much time
moving pages that useful work becomes secondary.
The nasty part is that many applications keep “working.” They respond, slowly. They retry. They time out and get retried.
Your orchestrator sees processes still alive, health checks still barely passing, and thinks the world is fine.
Humans notice first: everything feels sticky.
Two signals that separate “swap used” from “swap storm”
- Major page faults spike (reading pages from disk-backed swap).
- PSI memory pressure shows sustained stalls (tasks waiting on memory reclaim / IO).
If you only look at “swap used percentage,” you’ll get lied to. Swap can be 20% used and stable for weeks with no drama.
Conversely, swap can be “only” 5% used and still storming if the working set churns.
Interesting facts and historical context (because this mess has history)
- Early Linux OOM behavior was famously blunt. The kernel’s OOM killer evolved over decades; it still surprises people under pressure.
- cgroups arrived to stop “noisy neighbors.” They were built for shared systems long before containers made them fashionable.
- Swap accounting in cgroups has been controversial. It adds overhead and has had real-world bugs; many platforms disabled it by default.
- Kubernetes historically discouraged swap. Not because swap is evil, but because predictable memory isolation is hard when swapping is in play.
- The “free memory” number has been misunderstood since forever. Linux uses RAM for page cache aggressively; “free” being low is often healthy.
- Pressure Stall Information (PSI) is relatively new. It’s one of the best modern tools for seeing “waiting on memory” without guessing.
- SSD swap made storms quieter, not safer. Faster swap reduces pain… until it masks the issue and you hit write amplification and latency cliffs.
- Overcommit defaults are a cultural artifact. Linux assumes many programs allocate more than they touch; this is true until it isn’t.
Why containers look healthy while the host dies
Containers don’t have their own kernel. They’re processes grouped by cgroups and namespaces.
Memory pressure is managed by the host kernel, and the host kernel is the one doing reclaim and swap.
Here’s the illusion: a container can continue to run and respond while the host is swapping heavily, because
the container’s processes are still scheduled and still making progress—just at a terrible cost.
Your container’s CPU usage may even look lower because it’s blocked on IO (swap-in), not burning CPU.
The main failure modes that create the “containers fine, host melting” pattern
- No memory limits (or wrong limits). One container grows until the host reclaims and swaps everyone.
- Limits set but swap not constrained. The container stays under its RAM cap but still pushes pressure into global reclaim via page cache patterns and shared resources.
- Page cache + filesystem IO dominates. Containers doing IO can blow out cache, forcing reclaim and swap for other workloads.
- Overcommit + bursts. Many services allocate aggressively at once; you don’t OOM immediately, you churn.
- OOM policy avoids killing. The system swaps instead of failing fast, trading correctness for “availability” the worst way.
One more twist: container-level telemetry can be misleading. Some tools report cgroup memory usage but not
host-level reclaim pain. You’ll see containers “within limits” while the host spends its day shuffling pages.
Joke #1: Swap is like a storage unit—you feel organized until you realize you’re paying monthly to store junk you still need every day.
Linux memory basics you actually need
You don’t need to memorize kernel code. You do need a few concepts to reason about swap storms without superstition.
Working set vs allocated memory
Most apps allocate memory they don’t actively touch. The kernel doesn’t care about “allocated,” it cares about
“recently used.” Your working set is the pages you touch frequently enough that evicting them hurts.
Swap storms happen when the working set doesn’t fit, or when it fits but the kernel is forced to churn pages due to
competing demands (page cache, other cgroups, or a single offender that keeps dirtying memory).
Anonymous memory vs file-backed memory
- Anonymous: heap, stack—swappable to swap.
- File-backed: page cache—evictable without swap (just re-read from file) unless dirty.
When you run databases, caches, JVMs, and log-heavy services on the same host, anonymous and file-backed reclaim
interact in entertaining ways. “Entertaining” here means “a postmortem you’ll read at 2am.”
Reclaim, kswapd, and direct reclaim
The kernel tries to reclaim memory in the background (kswapd). Under heavy pressure, processes themselves may enter
direct reclaim—they stall trying to free memory. That’s where latency goes to die.
Why swap storms feel like CPU problems
Reclaim burns CPU. Compression can burn CPU (zswap/zram). Faulting pages back in burns CPU and IO.
And your application threads might be blocked, making utilization graphs confusing: low app CPU, high system CPU, high IO wait.
cgroups, Docker, and the sharp edges around swap
Docker uses cgroups to constrain resources. But “memory constraints” are a grab bag depending on kernel version,
cgroup v1 vs v2, and what Docker is configured to do.
cgroup v1 vs v2: the practical differences for swap storms
In cgroup v1, memory and swap were managed with separate knobs (memory.limit_in_bytes, memory.memsw.limit_in_bytes),
and swap accounting could be disabled. In cgroup v2, memory is more unified and the interface is cleaner:
memory.max, memory.swap.max, memory.high, plus pressure metrics.
If you’re on cgroup v2 and not using memory.high, you’re missing one of the best tools to prevent a single cgroup from turning
the host into a swap-powered toaster.
Docker memory flags: what they really mean
--memory: hard limit. If exceeded, the cgroup will reclaim; if it can’t, you get OOM (inside that cgroup).--memory-swap: in many setups, total memory+swap limit. The semantics vary; on some systems it’s ignored without swap accounting.--oom-kill-disable: almost always a bad idea in production. It encourages the host to suffer longer.
The container “working” while the host melts is often the result of a policy decision:
we told the system “do not kill, just try harder.” The kernel complied.
One quote you should tattoo onto your runbooks
“Hope is not a strategy.” — paraphrased idea often attributed in engineering/ops circles; the point stands either way.
Fast diagnosis playbook
This is the order I use when someone pings “the host is slow” and I suspect memory pressure. It’s designed to
get you to a decision quickly: kill, cap, move, or tune.
First: confirm it’s actually a swap storm (not just ‘swap used’)
- Check current swap activity (swap-in/out rates) and major faults.
- Check PSI memory pressure for sustained stalls.
- Check whether the IO subsystem is saturated (swap is IO).
Second: find the offender cgroup/container
- Compare per-container memory usage, including RSS and cache.
- Check which cgroups are triggering OOM or high reclaim.
- Look for a workload pattern (JVM heap growth, batch job, log burst, compaction, index rebuild).
Third: decide the mitigation
- If latency matters more than completion: fail fast (tight limits, allow OOM, restart cleanly).
- If completion matters more than latency: isolate (dedicated nodes, lower swappiness, controlled swap, slower but stable).
- If you’re blind: add PSI + per-cgroup memory metrics first. Tuning without visibility is gambling.
Hands-on tasks: commands, outputs, decisions
These are the commands I actually run on a real host. Each task includes what the output means and what decision it enables.
Adjust interface names and paths to match your environment.
Task 1: Confirm swap is present and how much is in use
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 62Gi 54Gi 1.2Gi 1.1Gi 6.8Gi 2.3Gi
Swap: 16Gi 12Gi 4.0Gi
Meaning: Swap is heavily used (12Gi) and available memory is low (2.3Gi). Not proof of storm, but suspicious.
Decision: Move to activity metrics; used swap alone doesn’t justify action.
Task 2: Measure swap activity and page fault pressure
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 12453248 312000 68000 512000 60 210 120 980 1200 2400 12 18 42 28 0
2 2 12454800 298000 66000 500000 180 640 200 1800 1800 3200 8 22 30 40 0
4 2 12456000 286000 64000 490000 220 710 260 2100 1900 3300 9 23 24 44 0
3 3 12456800 280000 62000 482000 240 680 300 2000 2000 3500 10 24 22 44 0
2 2 12458000 276000 60000 475000 210 650 280 1900 1950 3400 9 23 25 43 0
Meaning: Non-trivial si/so (swap-in/out) every second and high wa (IO wait). That’s active paging.
Decision: Treat as storm. Next: determine whether IO is saturated and which cgroup is pressuring memory.
Task 3: Check PSI for memory stalls (host-level)
cr0x@server:~$ cat /proc/pressure/memory
some avg10=18.40 avg60=12.12 avg300=8.50 total=192003210
full avg10=6.20 avg60=3.90 avg300=2.10 total=48200321
Meaning: full pressure indicates tasks are frequently stalled because memory reclaim can’t keep up. This correlates strongly with latency spikes.
Decision: Stop looking for “CPU bugs.” This is memory pressure. Find the offender and cap/kill/isolate.
Task 4: Identify whether you’re on cgroup v1 or v2
cr0x@server:~$ stat -fc %T /sys/fs/cgroup
cgroup2fs
Meaning: cgroup v2 is active. You can use memory.high and memory.swap.max.
Decision: Prefer v2 controls; avoid v1-era folklore knobs that don’t apply.
Task 5: See top memory consumers by container
cr0x@server:~$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
a1b2c3d4e5f6 api-prod 35.20% 1.8GiB / 2GiB 90.00% 2.1GB / 1.9GB 120MB / 3GB 210
b2c3d4e5f6g7 search-indexer 4.10% 7.4GiB / 8GiB 92.50% 150MB / 90MB 30GB / 2GB 65
c3d4e5f6g7h8 metrics-agent 0.50% 220MiB / 512MiB 42.97% 20MB / 18MB 2MB / 1MB 14
Meaning: search-indexer is near its memory limit and doing huge block IO (30GB reads/writes), which could be page cache churn, compaction, or spill.
Decision: Drill into that container’s cgroup metrics (reclaim, swap, OOM events).
Task 6: Inspect cgroup memory + swap limits (v2) for a suspect container
cr0x@server:~$ CID=b2c3d4e5f6g7
cr0x@server:~$ CG=$(docker inspect -f '{{.HostConfig.CgroupParent}}' "$CID")
cr0x@server:~$ docker inspect -f '{{.Id}} {{.Name}}' "$CID"
b2c3d4e5f6g7h8i9j0 /search-indexer
cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.max
8589934592
cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.swap.max
max
Meaning: RAM limit is 8GiB, but swap is unlimited (max). Under pressure, this cgroup can push swap hard.
Decision: Set memory.swap.max or configure Docker to bound swap for containers that shouldn’t page.
Task 7: Check per-cgroup events: are you hitting reclaim/OOM?
cr0x@server:~$ cat /sys/fs/cgroup/system.slice/docker-$CID.scope/memory.events
low 0
high 1224
max 18
oom 2
oom_kill 2
Meaning: The cgroup hit high frequently and has had OOM kills. That’s not “random instability”; it’s a sizing problem.
Decision: Either raise memory, reduce working set, or accept restarts but prevent host-level thrash by bounding swap and using memory.high.
Task 8: Observe swap usage per process (find the real hog)
cr0x@server:~$ sudo smem -rs swap | head -n 8
PID User Command Swap USS PSS RSS
18231 root java -jar indexer.jar 6144M 4096M 4200M 7000M
9132 root python3 /app/worker.py 820M 600M 650M 1200M
2210 root dockerd 90M 60M 70M 180M
1987 root containerd 40M 25M 30M 90M
1544 root /usr/bin/prometheus 10M 900M 920M 980M
1123 root /usr/sbin/sshd 1M 2M 3M 8M
Meaning: The Java indexer has 6GiB swapped out. That explains “container alive but slow”: it’s faulting pages constantly.
Decision: If this workload isn’t supposed to swap, cap it and force OOM/restart. If it is, isolate it to a host with faster swap and lower contention.
Task 9: Check disk saturation (swap is IO; IO is latency)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 01/02/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
10.21 0.00 22.11 41.90 0.00 25.78
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 120.0 12800.0 2.0 1.64 18.2 106.7 210.0 24400.0 8.0 3.67 32.5 116.2 9.80 99.20
Meaning: %util near 100% and high await. The NVMe is saturated; swap-in/out will queue, causing stalls everywhere.
Decision: Immediate mitigation: reduce memory pressure (kill offender, lower concurrency). Longer-term: separate swap from workload IO or use faster storage.
Task 10: See which processes are stuck in reclaim or IO waits
cr0x@server:~$ ps -eo pid,stat,wchan:20,comm --sort=stat | head -n 12
PID STAT WCHAN COMMAND
18231 D io_schedule java
19102 D io_schedule java
9132 D io_schedule python3
24011 D balance_pgdat postgres
24022 D balance_pgdat postgres
2210 Ssl ep_poll dockerd
1987 Ssl ep_poll containerd
Meaning: D state + io_schedule indicates uninterruptible sleep waiting for IO. balance_pgdat suggests direct reclaim.
Decision: Your latency is kernel-level waiting. Stop scaling traffic; it will worsen the queue. Shed load or stop the offender.
Task 11: Check kernel logs for OOM and reclaim warnings
cr0x@server:~$ sudo dmesg -T | tail -n 12
[Thu Jan 2 10:14:22 2026] Memory cgroup out of memory: Killed process 18231 (java) total-vm:12422392kB, anon-rss:7023120kB, file-rss:10244kB, shmem-rss:0kB
[Thu Jan 2 10:14:22 2026] oom_reaper: reaped process 18231 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Thu Jan 2 10:14:25 2026] Out of memory: Killed process 9132 (python3) total-vm:2048320kB, anon-rss:1192200kB, file-rss:9120kB, shmem-rss:0kB
Meaning: cgroup OOM kills occurred. That’s actually preferable to host-wide swap storms—if your service can restart cleanly.
Decision: Confirm restart policies, adjust limits, and set expectations: controlled OOM beats uncontrolled host collapse.
Task 12: Inspect host swappiness and overcommit posture
cr0x@server:~$ sysctl vm.swappiness vm.overcommit_memory vm.overcommit_ratio
vm.swappiness = 60
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
Meaning: Swappiness 60 is default-ish and can be too eager on mixed container hosts. Overcommit is heuristic (0).
Decision: If this host runs latency-sensitive services, consider lowering swappiness and tightening overcommit, but only after you have per-cgroup limits.
Task 13: Verify zswap/zram status (helps sometimes, hides problems always)
cr0x@server:~$ grep -H . /sys/module/zswap/parameters/enabled /sys/block/zram0/disksize 2>/dev/null
/sys/module/zswap/parameters/enabled:Y
Meaning: zswap is enabled; swap pages may be compressed in RAM. This reduces IO but increases CPU and can mask pressure until it’s too late.
Decision: Keep it if it’s buying stability. Don’t treat it as a license to run memory-unbounded workloads.
Task 14: Check Docker daemon and kernel memory accounting assumptions
cr0x@server:~$ docker info | sed -n '1,40p'
Client:
Version: 26.1.0
Context: default
Debug Mode: false
Server:
Containers: 38
Running: 33
Paused: 0
Stopped: 5
Server Version: 26.1.0
Storage Driver: overlay2
Cgroup Driver: systemd
Cgroup Version: 2
Kernel Version: 6.5.0
Operating System: Ubuntu 24.04 LTS
OSType: linux
Meaning: systemd + cgroup v2. Good. Your controls exist; you just need to use them.
Decision: Implement cgroup v2-aware policies (memory.high, memory.swap.max) rather than legacy guidance.
Three corporate mini-stories from the memory trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran a search indexing worker fleet in Docker on a handful of beefy hosts. They had swap enabled “for safety,”
and they had container memory limits set. Everyone felt responsible. Everyone slept.
During a busy week, the indexing backlog grew and an engineer increased worker concurrency inside the container.
The container stayed under its --memory cap most of the time, but the workload’s allocation pattern became spiky:
large transient buffers, heavy file IO, and aggressive caching. The host started swapping.
The wrong assumption was subtle: “If containers have memory limits, the host won’t swap itself into oblivion.”
In reality, limits prevent a single cgroup from using infinite RAM, but they don’t automatically guarantee host-level
fairness or prevent global reclaim pain—especially when swap is unbounded and the IO path is shared.
Symptoms were classic. API latency doubled, then tripled. SSH logins stalled. Monitoring showed “container memory within limits,”
so the team chased network and database tuning for hours. Eventually someone ran cat /proc/pressure/memory
and the story wrote itself.
The fix wasn’t exotic. They set memory.swap.max for the indexing workers, added memory.high to throttle them before OOM,
and moved the indexing workload to dedicated nodes with their own IO budget. The biggest improvement came from the boring bit:
writing down that “memory limits are not host protection” and making it a deployment gate.
Mini-story 2: The optimization that backfired
Another org had a multi-tenant logging pipeline. They wanted to reduce disk write load, so they enabled zswap and bumped swap size.
The initial results looked great: fewer write spikes, smoother IO graphs, fewer immediate OOMs.
Then came a minor incident: a customer turned on verbose logging plus compression at the application layer.
Log volume surged, CPU rose, and memory pressure increased. With zswap, the kernel compressed swapped pages in RAM first.
That reduced swap IO, but increased CPU time spent compressing and reclaiming.
In dashboards, it looked like “CPU saturation,” not “memory pressure.” The team tuned CPU limits, added cores, and made the host bigger.
The system got worse. Bigger RAM meant bigger caches and more to churn; more CPU meant zswap could compress more, delaying the obvious failure.
Latency jitter became permanent.
The backfire wasn’t zswap itself. It was treating it as performance optimization rather than a pressure buffer.
The real problem was uncontrolled memory growth in a parsing stage and no memory.high throttling. Swapping was the symptom,
zswap was the amplifier that made the symptom harder to see.
They fixed it by setting strict per-stage memory ceilings, adding backpressure in the pipeline, and using zswap only on nodes
dedicated to batch processing where latency didn’t matter. They also changed the alert: PSI memory sustained > threshold
became a page-worthy event.
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran a mix of customer-facing APIs and a nightly batch reconciliation on the same Kubernetes cluster.
They were painfully conservative: per-service requests/limits were set, eviction thresholds were reviewed quarterly,
and every node pool had a written “swap policy.” It was not exciting work. It was also why they didn’t have exciting outages.
One night, a batch job started using more memory after a vendor library update. It grew steadily, not explosively.
On a less disciplined team, this becomes a slow swap storm and an incident report with the words “we don’t know.”
Their system did something unglamorous: the job hit its memory limit, got OOM-killed, and restarted with a reduced parallelism setting
(a predefined fallback). The batch ran slower. The APIs stayed fast. The on-call got an alert that was specific:
“batch job OOMKilled; host PSI normal.”
The next morning, they rolled back the library, filed the upstream issue, and adjusted the job’s memory request permanently.
Nobody wrote a heroic Slack message. That’s the point. The correct practice—limits plus controlled failure—kept a “host melts” event from existing at all.
Fixes that stick: limits, OOM, swappiness, and monitoring
1) Put memory limits on every container that matters
No limit is a decision. It’s the decision that the host will be your limit. The host is a bad limit because it fails
collectively: all workloads suffer, then everything collapses.
Use per-service limits sized to the working set, not peak allocation fantasies. For JVMs, that means setting heap
intentionally and leaving headroom for off-heap and native allocations.
2) Use memory.high (cgroup v2) to throttle before OOM
memory.max is a cliff. memory.high is a speed limit. When you set memory.high, the kernel starts reclaiming
and throttling allocations once the cgroup exceeds it, which tends to reduce host-wide thrash.
For bursty services, memory.high slightly below memory.max often produces a system that’s slower under pressure but still
controllable, instead of chaotic.
3) Bound swap per workload, or turn it off for latency-sensitive services
Unlimited swap is how you end up with a server that is “up” but unusable. For services that must be responsive,
prefer either no swap usage or a very small swap allowance.
If you need swap for batch workloads, isolate them. Swap is not free; it’s a decision to trade latency for completion.
Mixing “must be fast” with “can be slow” on the same swap-backed node is a sociological experiment.
4) Lower swappiness (carefully) on mixed container hosts
vm.swappiness controls how aggressively the kernel swaps anonymous memory versus reclaiming page cache.
On hosts running databases or low-latency services, a lower value (like 10 or 1) can reduce swapping of hot pages.
Don’t cargo-cult vm.swappiness=1 everywhere. If your host relies on swap to avoid OOM and you drop swappiness without fixing sizing,
you’ll just trade swap storms for OOM storms.
5) Prefer controlled OOM over uncontrolled thrash
This is the part people hate emotionally but love operationally: for many services, restarting is cheaper than paging.
If a service can’t run within its budget, killing it is honesty.
Avoid disabling OOM kill for containers unless you deeply understand the consequences. Disabling OOM kill is how you
turn one bad process into a host-wide incident.
Joke #2: Disabling the OOM killer is like removing the fire alarm because it’s loud—now you can enjoy the flames in peace.
6) Watch PSI and reclaim metrics, not just “memory used”
The most useful alerts are the ones that tell you “the kernel is stalling tasks.”
PSI gives you that directly. Pair it with swap-in/out rates and IO latency.
7) Separate IO paths when swap is unavoidable
If swap and your primary workload share the same disk, you’ve created a feedback loop: swap causes IO queueing,
IO queueing slows apps, slow apps hold memory longer, memory pressure increases, swap increases. Congratulations, you built a carousel.
On serious systems, swap belongs on fast storage, sometimes separate devices, or you use zram for bounded emergency headroom
(with CPU headroom accounted for).
8) Make memory budgeting part of deployment, not postmortem
Corporate reality: teams will not “remember to add limits later.” They will ship today. Your job is to turn this into a guardrail:
CI checks for Compose, admission policies for Kubernetes, and runtime alerts for “no limit set.”
Common mistakes: symptoms → root cause → fix
1) Symptom: Host load average is huge; CPU usage looks moderate
Root cause: Threads blocked in IO wait during swap-in or direct reclaim. Load counts runnable + uninterruptible tasks.
Fix: Confirm with vmstat/iostat/ps (D-state). Reduce memory pressure immediately; cap/kill offender; add memory.high.
2) Symptom: Containers show “within memory limit,” but the host swaps heavily
Root cause: Limits exist but swap is unbounded, page cache churn forces global reclaim, or multiple containers collectively exceed host capacity.
Fix: Set per-cgroup swap limits (memory.swap.max) and realistic memory budgets. Don’t oversubscribe without an explicit policy.
3) Symptom: Random latency spikes across unrelated services
Root cause: Global reclaim and IO queueing create cross-service coupling. One memory offender punishes everyone.
Fix: Isolate noisy workloads onto separate nodes/pools; enforce limits; watch PSI.
4) Symptom: “We added more swap and it got worse”
Root cause: More swap increases the time a host can remain in a degraded paging state, amplifying tail latency and operational confusion.
Fix: Treat swap as emergency buffer, not capacity. Prefer OOM/restart for latency-sensitive services, or isolate batch jobs.
5) Symptom: OOM kills happen but the host is still sluggish afterward
Root cause: Swap remains populated; reclaim and refaulting continues; IO queue still draining; caches fragmented.
Fix: After removing offender, allow recovery time; reduce IO pressure; consider temporarily dropping caches only as a last resort and with understanding of blast radius.
6) Symptom: “Free memory” is near zero; someone panics
Root cause: Linux uses RAM for page cache; low free is normal when available is healthy and PSI is low.
Fix: Educate teams to use available and PSI, not free. Alert on pressure and churn, not aesthetics.
7) Symptom: After enabling zswap/zram, CPU increased and throughput dropped
Root cause: Compression overhead plus continued memory pressure. You moved cost from disk to CPU.
Fix: Only enable when CPU headroom exists; cap swap usage; fix the actual memory budget.
8) Symptom: “Swappiness=1 fixed it” (until next week)
Root cause: Reduced swap masked poor memory sizing temporarily; pressure still exists and may now become abrupt OOM.
Fix: Size workloads, set limits, add memory.high/backpressure. Tune swappiness as a finishing move, not the opening act.
Checklists / step-by-step plan
Immediate incident response (15 minutes)
- Run
vmstat 1andcat /proc/pressure/memory. Confirm active paging + stalls. - Run
iostat -xz 1. Confirm disk saturation / await. - Find offender:
docker stats, then per-process swap withsmemor inspect cgroup memory events. - Mitigate:
- Reduce concurrency / traffic.
- Stop the offender container or restart it with a lower memory footprint.
- If needed, temporarily move the offender to a dedicated host.
- Verify recovery: swap-in/out rates drop, PSI full approaches ~0, IO await normalizes.
Stabilization plan (same day)
- Set memory limits for any container without them.
- On cgroup v2: add
memory.highfor bursty workloads to reduce thrash. - Bound swap where appropriate (
memory.swap.max), especially for latency-sensitive services. - Review
--oom-kill-disableusage; remove unless you have a very specific reason. - Adjust node role separation: batch vs latency-sensitive services should not share the same memory/IO fate.
Hardening plan (next sprint)
- Add alerting on PSI memory (
someandfull) sustained thresholds. - Add alerting on swap-in/out rate, major faults, and disk await.
- Implement policy-as-code: Compose lints or admission controls requiring memory limits.
- Document per-service memory budgets and restart behavior (what happens on OOM, how fast it recovers).
- Load test with memory pressure: run concurrency spikes and verify the host remains interactive.
FAQ
1) Is swap always bad for Docker hosts?
No. Swap is a tool. It’s useful as an emergency buffer and for batch workloads. It’s dangerous when it becomes a crutch for
undersized services and makes the host “alive but unusable.”
2) Why do my containers show low CPU while the system is slow?
Because they’re blocked. In swap storms, threads often sit in IO wait or direct reclaim. Your CPU graph doesn’t show “waiting on disk”
unless you look at iowait, PSI, or blocked tasks.
3) Should I disable swap to prevent storms?
If you run latency-sensitive services and you have good limits, disabling swap can improve predictability.
But it also makes OOM more likely. The correct approach is usually: set limits and policies first, then decide swap posture.
4) What’s the difference between OOM and swap thrash in terms of user impact?
OOM is abrupt and noisy: a process dies and restarts. Swap thrash is prolonged and sneaky: everything is slow, timeouts cascade,
and you get secondary failures. In many systems, a controlled OOM is the lesser evil.
5) Why does adding RAM sometimes not fix it?
More RAM helps only if the working set fits and you reduce churn. If the workload scales to fill memory (caches, heaps, compactions),
you can end up with the same pressure, just on a larger canvas.
6) How do I set swap limits for Docker containers on cgroup v2?
Docker’s flags can be inconsistent across setups. The reliable way is often to use cgroup v2 controls directly via systemd scopes
or runtime configuration, ensuring memory.swap.max is set for the container’s cgroup. Validate by reading the file in /sys/fs/cgroup.
7) Why is “free memory” low even when the system is fine?
Linux uses spare RAM as cache. That cache is reclaimable. Look at “available” memory and pressure metrics, not “free.”
Low free + low PSI is usually normal.
8) What metrics should I page on to catch swap storms early?
Sustained PSI memory (/proc/pressure/memory), swap-in/out rates, major page faults, disk await/%util,
and per-cgroup memory events (high/max/oom). Alerts should trigger on sustained conditions, not one-second blips.
9) Can overlay2 or filesystem behavior contribute to swap storms?
Indirectly, yes. Heavy filesystem metadata churn and write amplification can saturate IO, making swap-in/out far more painful.
It doesn’t create memory pressure, but it turns memory pressure into a system-wide outage faster.
10) Is zram better than disk swap for containers?
zram avoids disk IO by compressing into RAM. It can soften storms when you have CPU headroom, but it’s still swapping—latency increases,
and it can hide capacity issues. Use it as a buffer, not a permission slip.
Practical next steps
If your Docker host is melting while containers “work,” stop debating whether swap is morally good. Treat it like what it is:
a performance debt instrument with a variable interest rate and a nasty compounding model.
Do these next, in this order:
- Instrument pressure: add PSI-based alerts and dashboards alongside swap activity and IO latency.
- Budget memory per service: set realistic limits; remove “unlimited” as a default.
- Control swap behavior: bound swap per workload (especially latency-sensitive ones) and consider
memory.highthrottling. - Prefer controlled failure: allow cgroup OOM + restart for services that can recover, instead of host-wide thrash.
- Isolate the troublemakers: batch indexing, compaction, and anything that likes to balloon should not share nodes with low-latency APIs.
Your goal isn’t “never use swap.” Your goal is “never let swap decide your incident timeline.” The kernel will do what you ask.
Ask for something survivable.