Docker OOM in Containers: The Memory Limits That Prevent Silent Crashes

Was this helpful?

Your service “randomly restarts.” There’s no stack trace, no nice exception, no goodbye message. One minute it’s serving traffic;
the next it’s reincarnated with a fresh PID and the same unresolved problems.

Nine times out of ten, the culprit is memory: a container hit its limit, the kernel got involved, and the process was killed with the
enthusiasm of a bouncer removing someone who “just wants to talk.”

What “OOM” actually means in Docker (and why it looks like a crash)

“OOM” is not a Docker feature. It’s a Linux kernel decision: Out Of Memory. When the system (or a memory cgroup) can’t satisfy an
allocation, the kernel tries to reclaim memory. If that fails, it kills something to survive.

Docker just provides a convenient arena for this to happen, because containers are usually placed in memory cgroups with explicit
limits. Once you set a limit, you’ve created a little universe where “out of memory” can occur even if the host has plenty of RAM left.
That’s the point: predictable failure boundaries instead of one rogue service eating the node alive.

The key nuance: there are two broad OOM scenarios that look similar from the outside.

  • cgroup OOM (container limit hit): the container reaches its memory limit, the kernel kills one or more processes in that
    cgroup. The host may be perfectly healthy.
  • system OOM (node OOM): the whole host runs out of memory. Now the kernel kills processes across the system, including
    dockerd, containerd, and innocent bystanders. This is where “everything went sideways at once” comes from.

In both cases, the process is usually terminated with SIGKILL. SIGKILL means no cleanup, no “flush logs,” no graceful shutdown. Your app
doesn’t “crash” as much as it gets erased from existence.

One quote worth keeping in your incident channel, attributed to Werner Vogels (paraphrased idea): Everything fails; the job is to design
systems that fail well and recover quickly.

Why you sometimes get “silent” failures

The silence isn’t malicious. It’s mechanics:

  • The kernel kills the process abruptly. Your logger may be buffered. Your last few lines may never hit stdout/stderr.
  • Docker reports the container exited. Unless you query docker inspect or check kernel logs, you may never see “OOMKilled”.
  • Some runtimes (or entrypoints) swallow exit codes by restarting quickly, leaving you with a flapping service and minimal context.

Joke #1: If you think you’ve “handled all exceptions,” congratulations—Linux just found one you can’t catch.

Facts and history that change how you debug

These aren’t trivia for trivia’s sake. Each one nudges a troubleshooting decision in the right direction.

  1. cgroups landed years before “containers” became a product. Google engineers proposed cgroups in the mid-2000s; Docker
    simply made them approachable. Implication: the kernel, not Docker, is the authority.
  2. Early memory cgroups were conservative and sometimes surprising. Historically, memory accounting and OOM behavior in cgroups
    matured over multiple kernel versions. Implication: “works on my laptop” can be a kernel-version mismatch.
  3. cgroups v2 changed the knobs. On many modern distros, memory limit files moved from memory.limit_in_bytes (v1)
    to memory.max (v2). Implication: your scripts need to detect which mode you’re in.
  4. Swap is not “extra RAM,” it’s deferred pain. Swap delays OOM at the cost of latency spikes and lock contention. Implication:
    you can “fix” OOM by enabling swap and still ship an outage—just slower.
  5. Exit code 137 is a clue, not a diagnosis. 137 usually means SIGKILL (128+9). OOM is a common cause, but not the only one.
    Implication: confirm with cgroup/kern logs.
  6. Page cache is memory too. The kernel uses free memory for cache; in cgroups, page cache can be charged to the cgroup depending
    on settings and kernel. Implication: heavy I/O containers can OOM “without leaks.”
  7. Overcommit is a policy, not a promise. Linux can allow allocations that exceed physical memory (overcommit) and later refuse
    when touched. Implication: a big allocation can succeed and still OOM later under load.
  8. OOM killer selects victims based on badness scoring. The kernel computes a score; high memory consumers with low “importance”
    are preferred. Implication: the process killed may not be the one you expected.
  9. Containers don’t isolate the kernel. A kernel-level OOM can still take out multiple containers, or the runtime itself.
    Implication: host-level monitoring still matters in “containerized” environments.

cgroups memory accounting: what’s counted, what isn’t, and why you care

Before you touch a limit, understand what the kernel counts toward it. Otherwise you’ll “fix” the wrong thing and keep paging the SRE on
weekends.

Memory buckets you need to recognize

Inside a container, memory usage is not just heap. The usual suspects:

  • Anonymous memory: heap, stacks, malloc arenas, language runtimes, in-memory caches.
  • File-backed memory: memory-mapped files, shared libraries, and page cache associated with files.
  • Page cache: cached disk reads; makes things fast until it makes things dead.
  • Kernel memory: network buffers, slab allocations. Accounting differs across versions and cgroup modes.
  • Shared memory: /dev/shm usage; common for browsers, databases, and IPC-heavy apps.

cgroups v1 vs v2: what changes for OOM behavior

In cgroups v1, memory knobs are scattered across controllers, and the “memsw” controller (memory+swap) is optional. In v2, things are more
unified: memory.current, memory.max, memory.high, memory.swap.max, and better eventing.

Operationally, the big improvement is memory.high in v2: you can set a “throttle” limit to induce reclaim pressure before you
reach the hard kill limit. It’s not magic, but it’s a real lever for “degrade instead of die.”

OOM inside the container vs OOM outside the container

If you only remember one thing: a container-level OOM is usually a budgeting problem; a host-level OOM is usually a capacity or oversubscription
problem.

  • Container OOM: you set --memory too low, you forgot about page cache, your app leaks, or you have a one-time spike
    (JIT warm-up, cache fill, compaction).
  • Host OOM: too many containers with generous limits, no limits at all, node memory pressure from non-container processes, or
    swap misconfiguration.

The memory limits that matter (and the ones that just make you feel better)

Docker gives you a handful of flags that look straightforward. They are. The part that isn’t straightforward is how they interact with swap,
the kernel’s reclaim behavior, and your application’s allocation patterns.

Hard limit: --memory

This is the line in the sand. If the container crosses it and cannot reclaim enough, the kernel kills processes in that cgroup.

What to do:

  • Set a hard limit for every production container. No exceptions.
  • Leave headroom above steady-state usage. If your service uses 600MiB, don’t set 650MiB and call it “tight.” Call it “brittle.”

Swap policy: --memory-swap

Docker’s swap behavior confuses even experienced operators because the meaning depends on cgroup mode and kernel, and because the default
isn’t always what you assume.

Practical stance:

  • If you can afford it, prefer little to no swap for latency-sensitive services, and size the hard limit correctly.
  • If you must allow swap, do it intentionally and monitor major page faults and latency. Swap hides memory pressure until it becomes a fire.

Soft-ish limits: reservations and v2 memory.high

Docker exposes --memory-reservation, which is not a guaranteed minimum. It’s a hint, primarily used for scheduling decisions in some
orchestrators and for reclaim behavior. On cgroups v2 systems, memory.high is the real “soft cap” tool.

In plain terms: reservations are for planning; hard limits are for survival; memory.high is for shaping.

What you should avoid

  • Unlimited containers on a shared node. That’s not “flexible.” That’s roulette with your runtime.
  • Hard limits with no observability. If you don’t measure memory, you’re just choosing a failure time.
  • Setting limits equal to JVM max heap (or Python “expected usage”) without overhead. Runtimes, native libs, threads, and page
    cache will laugh at your spreadsheet.

Fast diagnosis playbook

This is the “I’m on-call and it’s 02:13” sequence. Don’t overthink it. Start here, get signal fast, then dig deeper.

First: confirm it was actually OOM

  • Check container state: was it OOMKilled, did it exit 137, or was it a different SIGKILL?
  • Check kernel logs for OOM events tied to that cgroup or PID.

Second: determine where the OOM happened

  • Was it a cgroup OOM (container limit) or system OOM (node exhausted)?
  • Look at node memory pressure and other containers’ usage at the same timestamp.

Third: decide if it’s a sizing problem or a leak/problematic spike

  • Compare steady-state memory to limit. If usage slowly climbs until death, suspect leak.
  • If it dies during a known event (deploy, cache warm-up, batch job), suspect spike and missing headroom.
  • If it dies under I/O load, suspect page cache, mmap, or tmpfs (/dev/shm) usage.

Fourth: fix safely

  • Short-term: raise the limit or reduce concurrency to stop the bleeding.
  • Medium-term: add instrumentation, capture heap/profile dumps, and reproduce.
  • Long-term: enforce limits everywhere; set budgets; create alarms on “approaching limit” not just restarts.

Practical tasks: commands, outputs, and decisions (12+)

These are meant to be run on a Docker host. Some tasks need root. All of them produce information you can act on.

Task 1: Check if Docker thinks the container was OOMKilled

cr0x@server:~$ docker inspect -f '{{.Name}} OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}} Error={{.State.Error}}' api-1
/api-1 OOMKilled=true ExitCode=137 Error=

What it means: OOMKilled=true is your smoking gun; exit code 137 indicates SIGKILL.
Decision: Treat as memory-limit breach unless kernel logs show system-wide OOM.

Task 2: Check restart count and last finish time (correlate with load/deploy)

cr0x@server:~$ docker inspect -f 'RestartCount={{.RestartCount}} FinishedAt={{.State.FinishedAt}} StartedAt={{.State.StartedAt}}' api-1
RestartCount=6 FinishedAt=2026-01-02T01:58:11.432198765Z StartedAt=2026-01-02T01:58:13.019003214Z

What it means: Frequent restarts close together usually indicate a hard failure loop, not a one-off.
Decision: Freeze the deployment pipeline and collect forensic data before the next restart overwrites it.

Task 3: Read the kernel’s OOM narrative

cr0x@server:~$ sudo dmesg -T | tail -n 20
[Thu Jan  2 01:58:11 2026] Memory cgroup out of memory: Killed process 24819 (python) total-vm:1328452kB, anon-rss:812340kB, file-rss:12044kB, shmem-rss:0kB, UID:1000 pgtables:2820kB oom_score_adj:0
[Thu Jan  2 01:58:11 2026] oom_reaper: reaped process 24819 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

What it means: “Memory cgroup out of memory” indicates a container-level OOM, not a host-wide one.
Decision: Focus on container limits and per-container memory behavior, not node capacity—yet.

Task 4: Confirm whether the host experienced a system OOM

cr0x@server:~$ sudo dmesg -T | egrep -i 'out of memory|oom-killer|Killed process' | tail -n 10
[Thu Jan  2 01:58:11 2026] Memory cgroup out of memory: Killed process 24819 (python) total-vm:1328452kB, anon-rss:812340kB, file-rss:12044kB, shmem-rss:0kB, UID:1000 pgtables:2820kB oom_score_adj:0

What it means: You see a cgroup OOM event but no system-wide “Out of memory: Kill process …” preamble.
Decision: You likely don’t need to evacuate the node; you need to stop that container from hitting the wall.

Task 5: Check the container’s configured memory limits

cr0x@server:~$ docker inspect -f 'Memory={{.HostConfig.Memory}} MemorySwap={{.HostConfig.MemorySwap}} MemoryReservation={{.HostConfig.MemoryReservation}}' api-1
Memory=1073741824 MemorySwap=1073741824 MemoryReservation=0

What it means: Hard limit is 1GiB. Swap limit equals memory limit (effectively no swap beyond RAM for that cgroup).
Decision: If the process needs occasional bursts above 1GiB, you either raise the limit or reduce the burst.

Task 6: Identify cgroup version (v1 vs v2) to choose the right files

cr0x@server:~$ stat -fc %T /sys/fs/cgroup
cgroup2fs

What it means: cgroup2fs indicates cgroups v2.
Decision: Use memory.current, memory.max, and memory.events for precise signals.

Task 7: Map a container to its cgroup path and read current usage (v2)

cr0x@server:~$ CID=$(docker inspect -f '{{.Id}}' api-1); echo $CID
b1d6e0b6e4c7c14e2c8c3ad3b0b6e9b7d3c1a2f7d9f5d2e1c0b9a8f7e6d5c4b3
cr0x@server:~$ CG=$(systemctl show -p ControlGroup docker.service | cut -d= -f2); echo $CG
/system.slice/docker.service
cr0x@server:~$ sudo find /sys/fs/cgroup$CG -name "*$CID*" | head -n 1
/sys/fs/cgroup/system.slice/docker.service/docker/b1d6e0b6e4c7c14e2c8c3ad3b0b6e9b7d3c1a2f7d9f5d2e1c0b9a8f7e6d5c4b3
cr0x@server:~$ sudo cat /sys/fs/cgroup/system.slice/docker.service/docker/$CID/memory.current
965312512

What it means: About 920MiB in use right now (bytes). If the limit is 1GiB, you’re close.
Decision: If this is steady-state, raise the limit. If it’s climbing, start leak/spike investigation.

Task 8: Check memory limit and OOM events counter (v2)

cr0x@server:~$ sudo cat /sys/fs/cgroup/system.slice/docker.service/docker/$CID/memory.max
1073741824
cr0x@server:~$ sudo cat /sys/fs/cgroup/system.slice/docker.service/docker/$CID/memory.events
low 0
high 12
max 3
oom 3
oom_kill 3

What it means: The cgroup hit memory.max three times; three OOM kills occurred. high is non-zero,
indicating repeated reclaim pressure.
Decision: This is not a fluke. Either raise the budget or reduce peak memory demands (or both).

Task 9: See which processes in the container are using memory

cr0x@server:~$ docker top api-1 -o pid,ppid,cmd,rss
PID    PPID   CMD                          RSS
25102  25071  python /app/server.py         612m
25134  25102  python /app/worker.py         248m
25160  25102  /usr/bin/ffmpeg -i pipe:0     121m

What it means: The main process plus worker and a native binary (ffmpeg) share the container’s memory budget.
Decision: If the “sidecar-ish” subprocess is unbounded, cap concurrency or move it to its own container with its own limits.

Task 10: Check container memory usage trend quickly (Docker stats)

cr0x@server:~$ docker stats --no-stream api-1
CONTAINER ID   NAME    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O     BLOCK I/O   PIDS
b1d6e0b6e4c7   api-1   184.21%   972.4MiB / 1GiB       94.96%    1.3GB/1.1GB  2.8GB/1.9GB  28

What it means: The container is riding at ~95% of its limit. That’s not “efficient,” it’s “one request away from murder.”
Decision: Increase limit immediately or reduce load. Then investigate why it’s so close.

Task 11: Differentiate anonymous memory vs file cache (cgroup v2 memory.stat)

cr0x@server:~$ sudo egrep 'anon|file|shmem|slab' /sys/fs/cgroup/system.slice/docker.service/docker/$CID/memory.stat | head -n 20
anon 843018240
file 76292096
shmem 0
slab 31260672
file_mapped 21434368

What it means: Most usage is anonymous memory (heap/stack/native allocations), not page cache.
Decision: Focus on application allocations: leak hunting, runtime tuning, request payload sizing, concurrency.

Task 12: Verify whether /dev/shm is quietly too small (or too large)

cr0x@server:~$ docker exec api-1 df -h /dev/shm
Filesystem      Size  Used Avail Use% Mounted on
shm              64M   12M   52M  19% /dev/shm

What it means: Default shm size is 64MiB unless configured. Some workloads (Chromium, Postgres extensions, ML inference) need more.
Decision: If the app uses shared memory heavily, set --shm-size and account for it in your memory budget.

Task 13: Check host memory pressure (to rule out node OOM risk)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        49Gi       1.2Gi       1.1Gi        12Gi        8.4Gi
Swap:            0B          0B          0B

What it means: Host has low “free” but decent “available” due to cache; swap is disabled.
Decision: Node isn’t currently OOM, but you’re operating with thin margins. Don’t run unlimited containers.

Task 14: See per-container limits quickly to spot “unlimited” landmines

cr0x@server:~$ docker ps -q | xargs -n1 docker inspect -f '{{.Name}} mem={{.HostConfig.Memory}} swap={{.HostConfig.MemorySwap}}'
/api-1 mem=1073741824 swap=1073741824
/worker-1 mem=0 swap=0
/cache-1 mem=536870912 swap=536870912

What it means: mem=0 means no memory limit. That container can consume the node.
Decision: Fix the unlimited container first. One unlimited process can convert a container OOM into a node OOM.

Task 15: Detect whether the container was killed but immediately restarted by policy

cr0x@server:~$ docker inspect -f 'RestartPolicy={{.HostConfig.RestartPolicy.Name}} MaxRetry={{.HostConfig.RestartPolicy.MaximumRetryCount}}' api-1
RestartPolicy=always MaxRetry=0

What it means: “always” restart makes failures noisy only if you watch restarts; otherwise it turns OOM into “it fixed itself.”
Decision: Keep restart policies, but add alerting on restart rate and OOMKilled state.

Task 16: Confirm the container’s view of memory (important for JVM/Go tuning)

cr0x@server:~$ docker exec api-1 sh -lc 'cat /proc/meminfo | head -n 3'
MemTotal:       65843056 kB
MemFree:         824512 kB
MemAvailable:   9123456 kB

What it means: Some processes still “see” host memory via /proc/meminfo, depending on kernel/runtime.
Decision: Ensure your runtime is container-aware (modern JVMs and Go generally are). If not, set explicit heap limits.

Three corporate mini-stories from the OOM trenches

Mini-story 1: An incident caused by a wrong assumption

A mid-sized SaaS company migrated a handful of monolith endpoints into a shiny new “API” container. The team set a 512MiB memory limit because
the service “only handled JSON” and “memory is mostly for databases.” That assumption lasted about one business day.

The service used a popular Python web stack with a few native dependencies. Under normal traffic it hovered around 250–300MiB. Under a
marketing-driven spike, it began processing unusually large payloads (still valid JSON; just huge). The request bodies were parsed, copied, and
validated multiple times. Peak memory rose in steps with concurrency.

The failures looked like random restarts. Logs ended mid-line. Their APM showed incomplete traces and weird gaps. Someone blamed the load
balancer. Someone blamed the last deploy. The incident commander did the boring thing: docker inspect, then dmesg.
There it was: cgroup OOM kills.

Fixing it took two moves. First, they raised the limit to stop the hemorrhaging and reduced worker concurrency. Second, they added payload size
limits and streaming parsing for large bodies. The lesson wasn’t “give everything more RAM.” The lesson was that memory limits must be set with
awareness of worst-case input size, not average behavior.

Mini-story 2: An optimization that backfired

A finance-adjacent platform got obsessed with p99 latency. An engineer noticed frequent disk reads and decided to “warm the cache” aggressively on
startup: read a large set of reference data, precompute results, keep it all in memory for speed. It worked beautifully in staging.

Production was different: multiple replicas restarted during deployments, all warming simultaneously. The node had plenty of RAM, but each
container had a strict 1GiB limit because the team was trying to pack instances tightly. The warm-up phase briefly pushed memory to ~1.1GiB per
instance due to temporary allocations during parsing and building indexes. Temporary allocations are still allocations; the kernel doesn’t care
that you meant well.

The OOM pattern was cruel: containers died during warm-up, got restarted, warmed again, died again. A classic crash loop, except no crash stack.
The team’s “optimization” became self-inflicted denial of service during deploys.

The fix was to make warm-up incremental and bounded. They also introduced a startup memory budget: measure peak usage during warm-up and set the
container limit with headroom. Performance work that ignores memory budgets isn’t performance work; it’s just an outage with better graphs.

Mini-story 3: A boring but correct practice that saved the day

A logistics company ran a Docker fleet on a few beefy nodes. Nothing fancy. Their most “innovative” habit was a weekly audit: every container had
explicit CPU and memory limits, and every service had an agreed memory budget with a small buffer.

One week, a vendor library update introduced a leak in a rarely used code path. Memory climbed slowly—tens of megabytes per hour—until the service
hit its 768MiB limit and got OOM-killed. But the blast radius stayed small: only that container died, not the node.

Their alerting wasn’t “container down,” which is too late. It was “container memory at 85% of limit for 10 minutes.” The on-call saw the trend
early, rolled back the library, and opened a ticket for a deeper investigation. Users saw a brief blip, not a multi-service outage.

The boring practice—limits everywhere, budgets reviewed, and alarms on approaching limits—didn’t prevent the bug. It prevented the bug from
becoming a platform incident. That’s the bar.

Common mistakes: symptom → root cause → fix

1) “Container restarts with exit code 137, but there’s no OOMKilled flag”

Symptom: Exit 137, restarts, no clear OOM indicator in Docker state.
Root cause: It was killed by something else: manual kill, orchestrator timeout, host-level kill, or a supervisor sending SIGKILL
after SIGTERM grace expired.
Fix: Check dmesg for OOM lines and inspect orchestrator events. If no OOM evidence exists, treat as forced kill and
look at health checks and stop timeouts.

2) “We set a memory limit, but the node still OOMs”

Symptom: Host gets system OOM, multiple containers die, sometimes dockerd/containerd gets hit.
Root cause: Some containers have no limits, or limits exceed node capacity when summed, or the host has big non-container memory
consumers. Also: page cache and kernel memory don’t negotiate with your spreadsheet.
Fix: Enforce limits on all containers, reserve memory for the OS, and stop overpacking. If you allow swap, do it intentionally and
monitor. If you run without swap, be stricter about headroom.

3) “We increased the limit and it still OOMs”

Symptom: Same pattern, higher number, same death.
Root cause: Memory leak or unbounded workload (queue depth, concurrency, payload size, caching). Raising the limit only changes the
time-to-failure.
Fix: Cap concurrency, cap queue size, cap payload size, and instrument memory. Then profile: heap dumps, alloc profiling, native
memory tracking for mixed workloads.

4) “It OOMs only during deploys / restarts”

Symptom: Stable for days, then dies during rollout.
Root cause: Startup spike: cache warm-up, JIT compilation, migrations, index builds, or thundering herd when many replicas warm at
once.
Fix: Stagger restarts, bound warm-up work, and set limits based on peak startup usage, not just steady state. Consider readiness
gates that prevent traffic until warm-up is complete.

5) “It OOMs under I/O load even though the heap looks fine”

Symptom: Heap metrics look stable; OOM appears during heavy reads/writes.
Root cause: Page cache and memory-mapped files, or tmpfs usage (including /dev/shm), being charged to the cgroup.
Sometimes also kernel buffers for network or storage.
Fix: Examine memory.stat file vs anon usage. Reduce mmap footprint, tune caching strategy, or raise limit to include
cache budget. Ensure tmpfs/shm sizing is intentional.

6) “We disabled swap and now we see more OOMs”

Symptom: OOM kills increased after swap-off change.
Root cause: Swap previously masked memory oversubscription. Without swap, the system reaches hard constraints sooner.
Fix: Don’t re-enable swap as a reflex. First, correct container budgets and reduce oversubscription. If swap is required, set clear
SLO-based rules: which workloads may swap, and how you’ll detect when swap becomes user-visible latency.

7) “OOM kills the wrong process in the container”

Symptom: A helper process dies, or the main process dies first unexpectedly.
Root cause: OOM badness scoring plus per-process memory usage at kill time. Mult-process containers complicate victim selection.
Fix: Prefer one main process per container. If you must run multiple, isolate memory-heavy helpers into separate containers or tune
architecture so a single kill doesn’t create cascading failure.

Joke #2: The OOM killer is the only coworker who never misses a deadline—unfortunately, it also has the soft skills of a guillotine.

Checklists / step-by-step plan

Checklist: set memory limits that prevent silent crashes

  1. Measure steady-state memory for at least one full traffic cycle (day/week depending on workload).
  2. Measure peak memory during deploy/startup, traffic spikes, and background jobs.
  3. Pick a hard limit with headroom above peak (not above average). If you can’t afford headroom, your node density is fantasy.
  4. Decide swap policy: none for latency-sensitive services; limited swap for batch workloads if acceptable.
  5. Account for non-heap overhead: threads, native libs, memory maps, page cache, tmpfs, and runtime metadata.
  6. Set alerts on 80–90% of container memory limit, not just restarts.
  7. Track OOM events via kernel logs and cgroup counters.

Step-by-step plan: when a container OOMs in production

  1. Confirm OOM: docker inspect for OOMKilled and dmesg for cgroup OOM lines.
  2. Stabilize: temporarily raise limit or reduce concurrency/load. If it’s a crash loop, consider pausing deployments and scaling
    down traffic to the affected instance.
  3. Classify: leak vs spike vs cache. Use memory.stat to split anon vs file.
  4. Collect evidence: capture runtime heap dumps/profiles (where applicable), request sizes, queue depths, and the time correlation
    with deploys.
  5. Fix the trigger: add caps (payload, concurrency, cache size), fix leak, or reduce startup burst.
  6. Make it harder to repeat: enforce limits on all services and codify budgets with CI/CD checks or policy enforcement.

Checklist: avoid turning container OOM into node OOM

  1. No unlimited memory containers on shared nodes (unless it’s a deliberate, isolated node pool).
  2. Reserve memory for the host: system daemons, filesystem cache, monitoring, and runtime overhead.
  3. Don’t sum hard limits to 100% of RAM; leave real headroom.
  4. Monitor node memory pressure and reclaim activity, not just “free memory.”
  5. Decide swap policy at the node level and ensure it matches workload SLOs.

FAQ (the questions people ask right after the incident)

1) What’s the difference between OOMKilled and exit code 137?

OOMKilled is Docker telling you the kernel killed the container due to memory cgroup OOM. Exit code 137 means the process got SIGKILL,
which is consistent with OOM but can also come from other forced kills. Always confirm with dmesg and cgroup event counters.

2) If the host has free memory, why does my container OOM?

Because you gave the container its own budget via cgroups. The container can hit --memory and be killed even if the node has RAM.
That’s the isolation boundary doing its job—assuming you set the budget correctly.

3) Should I set memory limits on every container?

Yes. Production nodes without per-container limits eventually become “one bad deploy away” from a node-wide OOM. If you need one special
container without a limit, give it a dedicated node or at least a dedicated risk conversation.

4) How much headroom should I leave?

Enough to survive known spikes plus a margin for unknowns. Practically: measure peak and add buffer. If you can’t add buffer, reduce peak via
caps or reduce density. Headroom is cheaper than incidents.

5) Is swap good or bad for Docker containers?

Swap is a trade. For latency-sensitive services, swap often converts “hard failure” into “slow meltdown,” which can be worse. For batch jobs,
limited swap can improve throughput by avoiding kill storms. Decide per workload and monitor page faults and latency.

6) Why do my logs cut off right before the crash?

OOM kills are usually SIGKILL: the process can’t flush buffers. If you need better last-gasp observability, flush logs more frequently, write
critical events synchronously (sparingly), or use an external log collector that doesn’t depend on the dying process to be polite.

7) Can I make the kernel kill a different process instead?

Inside a single-container cgroup, victim selection is limited to processes in that cgroup. You can sometimes influence behavior with
oom_score_adj, but it’s not a reliable “pick that one” mechanism. The real fix is architectural: avoid multi-process containers for
memory-heavy helpers, or isolate them.

8) Docker Compose memory limits: do they actually work?

They work when deployed in a mode that honors them. Compose in classic mode maps to Docker runtime constraints; in swarm mode, constraints are
expressed differently. The only safe answer is to verify with docker inspect and cgroup files on the host.

9) How do I know if it’s a memory leak?

Look for a monotonic climb over time under roughly similar workload, ending in OOM. Confirm with runtime-level metrics (heap, RSS) and cgroup
counters. If it climbs only during specific events, it’s more likely a spike or unbounded cache.

10) Why does the container’s view of total memory look like the host’s?

Depending on kernel and runtime configuration, /proc reporting may reflect host totals, while enforcement still happens via cgroups.
Modern runtimes often detect cgroup limits directly, but don’t assume: set explicit memory flags for runtimes that misbehave.

Conclusion: next steps that actually reduce outages

Container OOMs are one of the few failure modes that are both predictable and preventable. Predictable because the limit is explicit.
Preventable because you can size it, observe it, and shape behavior before the kernel pulls the plug.

Do these next, in this order:

  1. Enforce memory limits on every container and hunt down the mem=0 offenders.
  2. Add alerts on approaching the limit (80–90%) and on OOMKilled events, not just restarts.
  3. Measure peak memory during deploy and warm-up; set limits based on the real peak, not the calm moments.
  4. Cap unbounded behaviors: payload size, queue depth, concurrency, cache growth.
  5. Decide swap policy intentionally instead of inheriting whatever the node happens to have.

Memory limits don’t make your service “safe.” They make your failure contained. That’s the whole point of containers: not to prevent
failure, but to keep failure from spreading like gossip in an office.

← Previous
Debian 13: fstab mistake prevents boot — the fastest rescue-mode fix
Next →
Proxmox “pmxcfs is not mounted”: why /etc/pve is empty and how to recover

Leave a comment