Ubuntu 24.04 OOM killer: prove it, fix it, prevent repeats

Was this helpful?

Your service was healthy. Then it wasn’t. A restart loop. A gap in metrics. Someone says “maybe a deploy?”
You check the timeline and find nothing—except the cold, efficient hand of the OOM killer.

On Ubuntu 24.04, you can lose a process to either the kernel OOM killer or systemd’s
userspace OOM manager. If you don’t prove which one did it (and why), you’ll “fix” the wrong layer and
the incident will return, right on schedule, usually at 02:17.

What “OOM” actually means in 2025 Linux

“Out of memory” sounds binary, like the system ran out of RAM. In production it’s more like a queueing
problem that turns into a knife fight: allocation requests arrive, reclaim and compaction struggle,
the kernel or userspace OOM logic decides forward progress requires sacrifice, and your service becomes
the sacrifice.

On Ubuntu 24.04 you’re typically dealing with:

  • Kernel OOM killer (classic): triggered when the kernel cannot satisfy memory allocations
    and reclaim fails; it selects a victim process based on badness scoring.
  • cgroup v2 memory enforcement: memory limits can trigger OOM within a cgroup even while
    the host has free RAM; the victim is selected inside that cgroup.
  • systemd-oomd: userspace daemon that proactively kills processes or slices when PSI
    (pressure stall information) indicates sustained memory pressure, often before the kernel hits a hard OOM.

Six-to-ten facts that make you better at this

  1. The kernel OOM killer predates containers by decades; it was built for “one big system,”
    then retrofitted into cgroups where “OOM in a box” became normal.
  2. PSI (pressure stall information) is relatively new in Linux history and changed the game:
    you can measure “time stalled waiting for memory,” not just “free bytes.”
  3. Ubuntu’s default swap story has changed over the years. Some fleets moved to swapfiles
    by default; others disabled swap in container hosts, which makes OOM events sharper and faster.
  4. Page cache is memory too. “Free” memory can be low while the system is still healthy,
    because cache is reclaimable. The trick is to know when reclaim stops working.
  5. OOM isn’t only about leaks. You can OOM from a legitimate workload spike, a thundering herd,
    or a single query that builds a giant in-memory hash table.
  6. cgroup v2 changed semantics. Under unified hierarchy, memory.current, memory.max, and PSI
    are first-class. If you’re still reasoning in cgroup v1 terms, you’ll misdiagnose limits.
  7. systemd-oomd can kill “the wrong thing” from your point of view because it targets units/slices
    based on pressure and configured policies, not business impact.
  8. Overcommit is policy, not magic. Linux can promise more virtual memory than exists; the bill
    comes due later, often at peak traffic.

One quote to keep taped to your monitor:
Hope is not a strategy. — paraphrased idea often attributed in operations and reliability circles.
The point stands: you need evidence, then controls.

Short joke #1: The OOM killer is like budget season—everyone’s surprised, and it still picks a victim.

Fast diagnosis playbook (first/second/third)

When a service dies and you suspect OOM, you want fast certainty, not a two-hour archeological dig.
Here’s the triage order that wins time.

First: confirm a kill and identify the killer

  • Look for kernel OOM lines in dmesg / journal: “Out of memory”, “Killed process”.
  • Look for systemd-oomd actions: “Killed … due to memory pressure” in the journal.
  • Check if the service is in a cgroup with memory.max set: container runtime, systemd unit limit,
    Kubernetes QoS, etc.

Second: confirm it was memory pressure, not a crash disguised as one

  • Service exit code: 137 (SIGKILL) is common for OOM-related kills (containers especially), but not exclusive.
  • Check oom_kill counters in cgroup memory.events.
  • Correlate with PSI memory “some/full” pressure spikes.

Third: locate the pressure source

  • Was it one process growing? (RSS growth, heap leak, runaway threads)
  • Was it many processes? (fork bomb, thundering herd, log fanout, sidecars)
  • Was it reclaim failure? (dirty pages, IO stall, swap thrash, memory fragmentation)
  • Was it a limit? (container limit too low, unit MemoryMax, wrong QoS class)

If you do only one thing: decide whether this was kernel OOM, cgroup OOM, or
systemd-oomd. The prevention changes completely.

Prove it from logs: kernel vs systemd-oomd vs cgroup

Kernel OOM killer fingerprints

Kernel OOM leaves very distinctive breadcrumbs: allocation context, the victim process, the “oom_score_adj,”
and often a list of tasks with memory stats. You don’t need guesswork; you need to read the right place.

systemd-oomd fingerprints

systemd-oomd is quieter but still auditable. It will log decisions and the unit it targeted. It acts
on sustained pressure signals and configured policies, not on “allocation failed.”

cgroup limit OOM fingerprints

If you’re in containers or systemd services with memory limits, you can get killed while the host is
fine. Your host graphs show plenty of free RAM. Your service still dies. That’s a cgroup story.
The proof is in memory.events and friends.

Practical tasks (commands + output meaning + decisions)

These are the tasks I actually run during incidents. Each one includes what the output means and the decision
you make from it. Run them as root or with sudo where needed.

Task 1: Confirm the service died from SIGKILL (often OOM)

cr0x@server:~$ systemctl status myservice --no-pager
● myservice.service - My Service
     Loaded: loaded (/etc/systemd/system/myservice.service; enabled; preset: enabled)
     Active: failed (Result: signal) since Mon 2025-12-29 10:41:02 UTC; 3min ago
   Main PID: 21477 (code=killed, signal=KILL)
     Memory: 0B
        CPU: 2min 11.203s

Meaning: The main process was killed with SIGKILL. OOM is a prime suspect because the kernel (and oomd)
typically use SIGKILL, but an operator could also have used kill -9.

Decision: Move to journal evidence. Don’t “fix” anything yet.

Task 2: Search the journal for kernel OOM lines around the event

cr0x@server:~$ journalctl -k --since "2025-12-29 10:35" --until "2025-12-29 10:45" | egrep -i "out of memory|oom|killed process|oom-killer"
Dec 29 10:41:01 server kernel: Out of memory: Killed process 21477 (myservice) total-vm:3281440kB, anon-rss:1512200kB, file-rss:1200kB, shmem-rss:0kB, UID:110 pgtables:4120kB oom_score_adj:0
Dec 29 10:41:01 server kernel: oom_reaper: reaped process 21477 (myservice), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Meaning: This is a kernel OOM kill. It names the victim and provides memory stats.

Decision: Focus on why the kernel hit OOM: overall memory exhaustion, swap disabled, reclaim stalled, or a spike.
systemd-oomd is not the culprit in this specific event.

Task 3: If kernel logs are quiet, search for systemd-oomd actions

cr0x@server:~$ journalctl --since "2025-12-29 10:35" --until "2025-12-29 10:45" -u systemd-oomd --no-pager
Dec 29 10:40:58 server systemd-oomd[783]: Memory pressure high for /system.slice/myservice.service, killing 1 process(es) in this unit.
Dec 29 10:40:58 server systemd-oomd[783]: Killed /system.slice/myservice.service (myservice), pid=21477, uid=110, total_vm=3281440kB, rss=1513408kB

Meaning: systemd-oomd acted, targeting a systemd unit because memory pressure stayed high.

Decision: You must inspect oomd configuration and PSI pressure, and consider changing unit protections or policies.
Adding RAM may help, but policy may still kill you early.

Task 4: Verify whether systemd-oomd is enabled and active

cr0x@server:~$ systemctl is-enabled systemd-oomd && systemctl is-active systemd-oomd
enabled
active

Meaning: oomd is in play on this host.

Decision: When you see SIGKILL without kernel OOM logs, treat oomd as a first-class suspect.

Task 5: Identify the cgroup path for your service and check if it has a memory limit

cr0x@server:~$ systemctl show -p ControlGroup -p MemoryMax -p MemoryHigh myservice
ControlGroup=/system.slice/myservice.service
MemoryMax=infinity
MemoryHigh=infinity

Meaning: This unit has no explicit systemd memory limit. If you still get cgroup OOM, it may be a parent slice limit,
a container limit, or a different controller boundary.

Decision: Inspect parent slices and the cgroup v2 files directly.

Task 6: Read cgroup memory.events for hard proof of cgroup OOM

cr0x@server:~$ cgpath=$(systemctl show -p ControlGroup --value myservice); cat /sys/fs/cgroup${cgpath}/memory.events
low 0
high 12
max 3
oom 1
oom_kill 1

Meaning: oom_kill 1 is the smoking gun: a kill happened due to cgroup memory policy.
max 3 indicates the cgroup hit its hard memory limit multiple times; it didn’t always kill, but it was banging the ceiling.

Decision: Fix the limit or memory behavior. Do not waste time on host-level RAM graphs; this is local to the cgroup.

Task 7: Inspect current usage vs limit at the cgroup level

cr0x@server:~$ cat /sys/fs/cgroup${cgpath}/memory.current; cat /sys/fs/cgroup${cgpath}/memory.max
1634328576
2147483648

Meaning: The service is using ~1.52 GiB with a 2 GiB cap. If you see kills with usage near the cap, you have real headroom issues.

Decision: Either raise the limit, reduce the footprint, or add backpressure so it doesn’t sprint into the wall.

Task 8: Check PSI memory pressure to see whether the host is stalling

cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.48 avg60=0.92 avg300=1.22 total=39203341
full avg10=0.09 avg60=0.20 avg300=0.18 total=8123402

Meaning: PSI shows the system spends measurable time stalled on memory. “full” means tasks are completely blocked waiting for memory.
If “full” is non-trivial, you’re not just low on free RAM—you’re losing work time.

Decision: If PSI is high and sustained, pursue systemic fixes: reclaim behavior, swap strategy, workload shaping, or more RAM.

Task 9: Confirm swap state and whether you’re operating with the safety off

cr0x@server:~$ swapon --show
NAME      TYPE SIZE USED PRIO
/swapfile file  8G  512M   -2

Meaning: Swap exists and is being used. That can buy time and avoid sharp OOMs, but it can also hide leaks until latency collapses.

Decision: If swap is off on a general-purpose host, consider enabling a modest swapfile. If swap is on and heavily used,
investigate memory growth and IO stall risks.

Task 10: Look for the kernel’s OOM selection details (badness, constraints)

cr0x@server:~$ journalctl -k --since "2025-12-29 10:40" --no-pager | egrep -i "oom_score_adj|constraint|MemAvailable|Killed process" | head -n 20
Dec 29 10:41:01 server kernel: myservice invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Dec 29 10:41:01 server kernel: Constraint: CONSTRAINT_NONE, nodemask=(null), cpuset=/, mems_allowed=0
Dec 29 10:41:01 server kernel: Killed process 21477 (myservice) total-vm:3281440kB, anon-rss:1512200kB, file-rss:1200kB, shmem-rss:0kB, UID:110 pgtables:4120kB oom_score_adj:0

Meaning: CONSTRAINT_NONE suggests this was global memory pressure (not constrained to a cpuset/mems).

Decision: Look at the whole host: top memory consumers, reclaim, swap, and IO. If you expected a cgroup limit, your assumption is wrong.

Task 11: Identify top memory consumers at the time (or now, if still present)

cr0x@server:~$ ps -eo pid,ppid,comm,rss,vsz,oom_score_adj --sort=-rss | head -n 12
  PID  PPID COMMAND           RSS    VSZ OOM_SCORE_ADJ
30102     1 java         4123456 7258120             0
21477     1 myservice    1512200 3281440             0
 9821     1 postgres      812344 1623340             0
 1350     1 prometheus    402112  912440             0

Meaning: RSS is real resident memory. VSZ is virtual address space (often misleading). If something else dwarfs your service,
the victim may have been “unlucky” rather than “largest.”

Decision: Decide whether to cap or move the heavyweight process, or protect your service via oom_score_adj and unit policies.

Task 12: Check memory overcommit settings (policy that changes when OOM happens)

cr0x@server:~$ sysctl vm.overcommit_memory vm.overcommit_ratio
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

Meaning: Overcommit mode 0 is heuristic. It can allow allocations that later trigger OOM under load.

Decision: For some classes of systems (database-heavy, memory-reservation-sensitive), consider a stricter policy (mode 2).
For many application servers, mode 0 is fine; focus on limits and leaks first.

Task 13: Check for reclaim/IO trouble that makes “available memory” lie

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 524288 112344  30240 8123448    0    4   120   320  912 2100 11  4 83  2  0
 3  1 524288  80320  30188 8051120    0   64   128  4096 1102 2600 18  6 62 14  0
 4  2 524288  60220  30152 7983340    0  256   140  8192 1200 2901 22  7 48 23  0
 2  1 524288  93210  30160 8041200    0   32   110  1024  980 2400 14  5 74  7  0
 1  0 524288 104220  30180 8100100    0    0   100   512  900 2200 10  4 84  2  0

Meaning: High so (swap out) plus high wa (IO wait) can signal swap thrash or storage contention.
That can push the system into memory stalls and then OOM decisions.

Decision: If IO wait spikes, treat storage as part of the memory incident. Fix IO saturation, dirty page behavior,
and consider faster swap backing or different swap strategy.

Task 14: Inspect systemd unit protections that influence kill choices

cr0x@server:~$ systemctl show myservice -p OOMScoreAdjust -p ManagedOOMMemoryPressure -p ManagedOOMMemoryPressureLimit
OOMScoreAdjust=0
ManagedOOMMemoryPressure=auto
ManagedOOMMemoryPressureLimit=0

Meaning: With ManagedOOMMemoryPressure=auto, systemd may opt in depending on defaults and version behavior.

Decision: For critical services, explicitly choose: either protect them (disable oomd management for the unit)
or enforce a slice policy that kills less important work first.

Task 15: Check whether your service is running inside a container with its own limits

cr0x@server:~$ cat /proc/21477/cgroup
0::/system.slice/myservice.service

Meaning: This example shows a systemd service directly on the host. In Docker/Kubernetes you’d see paths indicating
container scopes/slices.

Decision: If the path points to container scopes, go to the container’s cgroup and read memory.max and memory.events there.

Task 16: If you have a core business service, set a deliberate OOM score adjustment

cr0x@server:~$ sudo systemctl edit myservice
# (editor opens)
# Add:
# [Service]
# OOMScoreAdjust=-500
cr0x@server:~$ sudo systemctl daemon-reload && sudo systemctl restart myservice

Meaning: Lower (more negative) values make the kernel less likely to pick the process as the OOM victim.
This does not prevent OOM; it changes who gets shot.

Decision: Use this only when you also have a plan for what should die instead (batch jobs, caches, best-effort workers).
Otherwise you just move the blast radius somewhere else.

Why it happened: failure modes you can bet on

1) The “host has memory” illusion (cgroup limit OOM)

The host can have gigabytes free and your service still gets killed if its cgroup hits memory.max.
This is common on Kubernetes nodes, Docker hosts, and systemd services where someone set MemoryMax
months ago and forgot.

The proof isn’t in free -h. It’s in memory.events for that cgroup.

2) The “no swap because performance” policy (sharp global OOM)

Disabling swap can be valid for some latency-sensitive environments, but it trades gradual degradation for sudden death.
If you’re doing it, you must have strict memory limits, admission control, and excellent capacity planning. Many shops have none of those.

3) A leak that only shows under real traffic

The classic. A cache without bounds. A metrics label explosion. A request path that retains references.
Everything looks fine in staging because staging doesn’t get hammered by real user diversity.

4) Reclaim doesn’t work because storage is the hidden bottleneck

Memory reclaim often depends on IO: writing dirty pages, swapping out, reading back. When storage is saturated,
memory pressure turns into stalls and then kills.
SRE rule: if memory incidents correlate with IO wait, you’re not having “a memory problem.” You’re having a system problem.

5) systemd-oomd is doing what you asked (or what “auto” decided)

oomd is designed to kill earlier than the kernel, to prevent the entire system from becoming unusable.
That’s good. It’s also surprising when it kills a unit you assumed was protected.
If you run multi-tenant hosts, oomd can be your friend. If you run single-purpose hosts, it can be noise unless configured.

Short joke #2: Turning off swap to “avoid latency” is like removing the fire alarm to “avoid noise.”

Three corporate mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: Incident caused by a wrong assumption

A mid-size SaaS ran a reporting API on a pool of Ubuntu hosts. The on-call saw a rash of restarts and did the normal thing:
checked host memory graphs. Everything looked fine—plenty of free RAM, no swap usage, no smoking gun. The team blamed
“random crashes” and rolled back a harmless library update.

The restarts continued. A senior engineer finally pulled memory.events for the systemd unit and found
oom_kill incrementing. The service had a MemoryMax inherited from a parent slice used for “non-critical apps.”
Nobody remembered setting it because it was done during a cost-cutting sprint, and it lived in an infrastructure repo
that the API team never read.

The wrong assumption was simple: “If the host has free RAM, it can’t be OOM.” In a cgroup world, that assumption is dead.
The host was fine; the service was boxed in.

The fix wasn’t dramatic: raise the unit’s memory limit to match real peak usage, add an alert on cgroup
memory.events for high/max before oom_kill, and document the slice policy.
The postmortem didn’t blame the kernel. It blamed missing ownership and invisible constraints.

Mini-story 2: An optimization that backfired

A payments platform wanted lower p99 latency. Someone proposed disabling swap across application nodes because “swap is slow.”
They also tuned the JVM to use a bigger heap to reduce GC frequency. The first week looked great—clean dashboards and
slightly better tail latency.

Then a traffic campaign hit. Memory usage climbed as caches warmed and request concurrency increased. Without swap,
there was no buffer for transient spikes. The kernel hit OOM quickly and killed random Java workers. Some nodes survived,
others didn’t, creating uneven load and retry storms.

The team tried to fix it by raising heap sizes even more, which made it worse: the heap ate into file cache and
reduced reclaim options. When the system got into trouble, it had fewer places to go.

The eventual fix was a little boring and very effective: re-enable a modest swapfile, lower heap to a safer fraction
of RAM, and enforce per-worker memory limits via systemd slices so a single worker could not consume the entire node.
They kept the latency wins by reducing concurrency spikes instead of removing the system’s safety net.

Mini-story 3: A boring but correct practice that saved the day

A data ingestion service processed large customer files. It wasn’t glamorous: mostly streaming IO, some decompression,
some parsing. The platform team had a strict policy: every service unit must declare memory expectations with
MemoryHigh and MemoryMax, and must emit a “current RSS” gauge. Teams complained it was bureaucratic.

One evening, a new customer sent a malformed file that triggered pathological behavior in a parsing library.
RSS grew steadily. The service didn’t crash immediately; it just kept asking for memory. On a host without limits,
it would have dragged everything down.

Instead, MemoryHigh caused throttling signals early (reclaim pressure increased, performance degraded but stayed functional),
and MemoryMax prevented total node exhaustion. The ingestion worker got killed inside its own cgroup,
not taking down the database sidecars or node exporters.

The on-call saw the alert: memory.events high climbing. They could correlate it to a single customer job,
quarantine it, and ship a parser fix the next day. The boring policy turned a node-wide incident into a single failed job.
Nobody cheered. That’s the point.

Prevention that works: limits, swap, tuning, and runtime discipline

Decide which layer should fail first

You don’t prevent memory exhaustion by hoping the kernel “handles it.” You prevent it by designing a failure order:
which workloads get squeezed, which get killed, and which are protected.

  • Protect: databases, coordination services, node agents that keep the host manageable.
  • Best-effort: batch jobs, caches that can rebuild, async workers that can retry.
  • Never unbounded: anything that can fan out with user input (regex, decompression, JSON parsing, caching layers).

Use systemd limits intentionally (MemoryHigh + MemoryMax)

MemoryMax is a hard wall. Useful, but brutal: hit it and you can get killed. MemoryHigh is a pressure threshold:
it triggers reclaim and throttling and gives you a chance to recover before you die.

A pragmatic pattern for long-running services:

  • Set MemoryHigh at a level where performance degradation is acceptable but alarms are loud.
  • Set MemoryMax high enough to allow known peaks, but low enough to protect the host.
cr0x@server:~$ sudo systemctl edit myservice
# Add:
# [Service]
# MemoryHigh=3G
# MemoryMax=4G
cr0x@server:~$ sudo systemctl daemon-reload && sudo systemctl restart myservice

Configure oomd rather than pretending it doesn’t exist

If systemd-oomd is enabled, you need an explicit policy. “auto” is not a policy; it’s a default that will eventually
surprise you during the worst possible hour.

Typical approaches:

  • Multi-tenant nodes: keep oomd, use slices, put best-effort workloads into a slice that oomd may kill first.
  • Single-purpose nodes: consider disabling oomd management for the critical service unit if it’s getting killed
    prematurely, but keep kernel OOM as last resort.

Swap: pick a strategy and own it

Swap is not “bad.” Unmanaged swap is bad. If you enable swap, monitor swap-in/out rates and IO wait. If you disable swap,
accept that OOMs will be sudden and frequent unless you have strict limits and predictable workloads.

Stop memory spikes at the application boundary

The fastest way to prevent OOM is to stop accepting work that turns into unbounded memory growth.
Production-grade habits:

  • Bound caches (size + TTL) and measure eviction rates.
  • Limit concurrency. Most services don’t need “as many threads as possible,” they need “as many as you can keep in cache and RAM.”
  • Cap payload sizes and enforce streaming parsing.
  • Use backpressure: queues with limits, load shedding, circuit breakers.

Know the difference between RSS and “it looks big”

Engineers love blaming “memory leaks” based on VSZ. That’s how you end up “fixing” mmap reservations that were never resident.
Use RSS, PSS (if you can), and cgroup memory.current. And correlate with pressure (PSI), not just bytes.

Storage engineer’s note: memory incidents are often IO incidents wearing a different hat

If your system is reclaiming, swapping, or writing dirty pages under pressure, storage latency becomes part of your memory story.
A slow or saturated disk can turn “recoverable pressure” into “kill something now.”

If you see high IO wait during pressure, fix the storage path: queue depth, noisy neighbor volumes, throttling, writeback settings,
and swap backing.

Common mistakes: symptom → root cause → fix

1) Symptom: “Service got SIGKILL, but there’s no OOM in dmesg”

Root cause: systemd-oomd killed it based on PSI, or the kill happened inside a container/cgroup and you’re looking at the wrong journal scope.

Fix: Check journalctl -u systemd-oomd and the unit’s memory.events. Confirm cgroup path and read the right files.

2) Symptom: “Host shows 30% free RAM, but container keeps OOMing”

Root cause: cgroup memory.max is too low (container limit), or memory spikes exceed headroom.

Fix: Inspect memory.max and memory.events in the container cgroup. Raise limit or reduce peak usage.

3) Symptom: “Everything slows down for minutes, then one process dies”

Root cause: Reclaim and swap thrash; IO wait makes memory stalls unbearable.

Fix: Check vmstat for swap/wa. Reduce dirty writeback pressure, improve storage performance, consider a sane swap strategy,
and cap the worst offenders.

4) Symptom: “OOM kills a small process, not the big hog”

Root cause: OOM badness scoring, oom_score_adj differences, shared memory considerations, or the hog is protected by policy.

Fix: Inspect logs for oom_score_adj; set deliberate OOMScoreAdjust for critical services and constrain best-effort work
with cgroup limits so the kernel has better choices.

5) Symptom: “OOM happens right after deploy, but not always”

Root cause: Cold caches + increased concurrency + new allocations. Also common: a new code path allocates based on user input.

Fix: Add canary load tests that simulate cold start, bound caches, add request size limits, and set MemoryHigh to surface pressure early.

6) Symptom: “We added RAM, OOM still happens”

Root cause: The limit is in a cgroup, not the host. Or a leak scales to fill whatever you buy.

Fix: Prove where the limit is (memory.max). Then measure growth slope over time to confirm leak vs workload.

Checklists / step-by-step plan

Incident checklist (15 minutes, no heroics)

  1. Get the exact death time and signal: systemctl status or container runtime status.
  2. Check kernel log around that time for “Killed process”.
  3. If kernel is quiet, check systemd-oomd journal.
  4. Read memory.events for the service cgroup and its parent slice.
  5. Capture PSI memory: /proc/pressure/memory (host) and cgroup PSI if relevant.
  6. Capture top RSS consumers and their oom_score_adj.
  7. Check swap state and vmstat for swap/IO wait interaction.
  8. Write down: kernel OOM vs oomd vs cgroup limit OOM. Don’t leave it ambiguous.

Stabilization plan (same day)

  1. If cgroup OOM: raise MemoryMax (or container limit) to stop the bleeding.
  2. If global kernel OOM: add swap if appropriate, reduce concurrency, and temporarily cap non-critical workloads.
  3. If oomd kill: adjust ManagedOOM policy or move best-effort workloads into a killable slice.
  4. Set alerts on cgroup memory.events “high” and PSI “full” to detect pressure before kills.
  5. Protect critical units with OOMScoreAdjust, but only after you’ve given the system a safe victim class.

Prevention plan (this week)

  1. Define memory budgets per service: expected steady-state RSS and worst-case peak.
  2. Set MemoryHigh and MemoryMax accordingly; document ownership.
  3. Add application-level guardrails: request size limits, bounded caches, concurrency caps.
  4. Instrument memory: RSS gauges, heap metrics where applicable, and periodic leak checks.
  5. Run a controlled load test that simulates cold start + peak concurrency.
  6. Review swap and IO path: make sure reclaim has somewhere to go that won’t melt storage.

FAQ

1) How do I tell kernel OOM from systemd-oomd quickly?

Kernel OOM shows “Out of memory” and “Killed process” in journalctl -k. systemd-oomd shows kill decisions in
journalctl -u systemd-oomd. If both are quiet, check cgroup memory.events for oom_kill.

2) What does exit code 137 mean?

It usually means the process got SIGKILL (128 + 9). OOM is a common reason, especially in containers, but an operator or watchdog can also SIGKILL.
Always corroborate with logs and cgroup events.

3) Why did the OOM killer pick my important service?

The kernel picks based on badness score and constraints. If everything is critical and unbounded, something critical will die.
Use OOMScoreAdjust to protect key units, but also create killable classes (batch, cache) with limits.

4) Can I just disable systemd-oomd?

You can, but don’t do it as a reflex. On multi-tenant nodes, oomd can prevent total host collapse by acting early.
If you disable it, be confident your cgroup limits and workload controls prevent global pressure.

5) What’s the difference between MemoryHigh and MemoryMax?

MemoryHigh applies reclaim pressure and throttling when exceeded—an early warning and soft control.
MemoryMax is a hard limit; exceed it and you can trigger OOM kills within the cgroup.

6) Why does “free” memory look low even when the system is fine?

Linux uses RAM for page cache. Low “free” isn’t inherently bad. Look at “available” memory and, better, PSI memory pressure
to know whether the system is stalling.

7) Is swap always recommended on Ubuntu 24.04 servers?

Not always. Swap can reduce the frequency of hard OOM kills, but can increase latency under pressure.
For general-purpose hosts, a modest swapfile is often a net win. For strict low-latency workloads, you might disable swap—
but then you must enforce tight memory limits and admission control.

8) How do I prove a container OOM vs host OOM?

Check the container’s cgroup: memory.events with oom_kill indicates cgroup OOM.
Host OOM will show kernel “Killed process” entries and often affects multiple services.

9) What’s the single best early-warning metric?

Memory PSI (/proc/pressure/memory) plus cgroup memory.events high trending upward. Bytes tell you “how much.”
Pressure tells you “how bad.”

Next steps (what to do before the next page)

Don’t treat OOM like weather. It’s engineering. The immediate win is proving the killer: kernel OOM, cgroup OOM, or systemd-oomd.
Once you know that, the fixes stop being superstition.

Do these three things this week:

  1. Add an alert on memory.events (high and oom_kill) for your top services.
  2. Set explicit MemoryHigh/MemoryMax budgets for services that matter, and put best-effort work in a killable slice.
  3. Decide your swap strategy and monitor it—because “we disabled swap once” is not a plan, it’s a rumor.

The next time a service disappears, you should be able to answer “what killed it?” in under five minutes.
Then you can spend the rest of your time preventing the repeat, instead of arguing with a dashboard screenshot.

← Previous
ZFS Read Errors: When It’s the Disk, the Cable, or the Controller
Next →
Liquid metal mishaps: the upgrade that turns into a repair bill

Leave a comment