The page hits at 03:12. “API latency spiking, nodes swapping, one container OOM-killed.” You log in and see the usual: memory “used” is high, executives are awake, and somebody has already suggested “just add RAM.”
Memory leaks are the worst kind of slow-motion incident: everything works… until it doesn’t, and then it fails at the least convenient moment. The trick on Debian 13 is not heroics. It’s collecting evidence without turning production into a lab experiment, making small reversible changes, and always separating real leaks from expected memory behavior.
Interesting facts & context (quick, concrete)
- Linux doesn’t “free memory” the way humans want. The kernel aggressively uses RAM for page cache; “available” is the metric that matters more than “free.”
- cgroups have changed the game twice. cgroups v1 allowed per-controller configuration; v2 unified it, and systemd made it the default control plane for services.
- /proc is older than most incident runbooks. It has been the canonical interface for process stats since the early Linux days, and it still wins for low-disruption visibility.
- “OOM killer” is not a single event type. There’s system-wide OOM, cgroup OOM, and user-space “we died because malloc failed.” They look similar in dashboards and very different in logs.
- Malloc behavior is policy, not physics. glibc’s allocator uses arenas and may not return memory to the OS promptly; this often looks like a leak when it isn’t.
- Overcommit is a feature. Linux can promise more virtual memory than exists; it’s normal until it’s not. The failure mode depends on overcommit settings and workload.
- eBPF made “observe without stopping” mainstream. You can trace allocations and faults with far less performance penalty than old-school ptrace-heavy approaches.
- Java invented new categories of “leak”. Heap leaks are one thing; native memory tracking exists because direct buffers, thread stacks, and JNI can quietly balloon.
- Smaps is the unsung hero. /proc/<pid>/smaps is verbose, expensive-ish to read at scale, and absolutely decisive when you need to know what memory really is.
What “memory leak” really means on Linux
In production, “memory leak” gets used for four different problems. Only one is a classic leak.
If you don’t name the problem correctly, you’ll “fix” the wrong thing and feel productive while the page keeps paging.
1) True leak: memory becomes unreachable and never reclaimed
This is the textbook: allocations continue, frees don’t match, and the live set grows without bound.
You’ll see a monotonic climb in RSS or heap usage that doesn’t correlate with load.
2) Retained memory: reachable, but unintentionally kept
Caches without eviction, unbounded maps keyed by user IDs, request contexts stored globally. Technically not “leaked” because the program can still reach it, but practically the same outcome.
3) Allocator behavior: memory freed internally, not returned to the OS
glibc malloc, fragmentation, per-thread arenas, and “high-water mark” effects can keep RSS high even after peak load passes.
The application may be healthy. Your graphs will accuse it anyway.
4) Kernel/page cache pressure: RAM is used, but not by your process
“Used memory” climbs because the kernel caches file pages. Under pressure, it should drop.
If it doesn’t, you may have dirty page congestion, slow IO, or cgroup reclaim rules that make the cache sticky.
The job is to figure out which category you’re in with the smallest blast radius. Debugging by panic is expensive.
One paraphrased idea from Werner Vogels (Amazon CTO): Everything fails eventually; build systems and habits that make failure survivable.
Fast diagnosis playbook (first/second/third)
First: confirm it’s a process problem, not “Linux being Linux”
- Look at MemAvailable, swap activity, and major faults. If MemAvailable is healthy and swap is quiet, you probably don’t have an urgent leak.
- Identify the top RSS consumers and whether growth is monotonic.
- Check if memory is dominated by anon (heap) or file (cache, mmaps).
Second: decide if you’re in a cgroup/systemd boundary
- Is the service constrained by systemd MemoryMax? If yes, leaks will show as cgroup OOM, not system-wide OOM.
- Collect memory.current, memory.events, and per-process RSS under the unit’s cgroup.
Third: pick the least disruptive evidence source
- /proc/<pid>/smaps_rollup for a quick PSS/RSS/Swap view.
- cgroup v2 stats for service-level tracking and OOM events.
- Language-native profilers (pprof, JVM NMT, Python tracemalloc) if you can enable them without restart.
- eBPF sampling when you can’t touch the app but need attribution.
Joke #1: Memory leaks are like office snacks—small at first, then suddenly everything is gone and nobody admits anything.
Practical tasks: commands, output meaning, and decisions (12+)
These are production-friendly. Most are read-only. A few change configuration, and those are flagged with the decision you’re making.
Use them in order; don’t skip to the “cool tools” unless you’ve earned it with basic evidence.
Task 1: Check whether the kernel is actually under memory pressure
cr0x@server:~$ grep -E 'Mem(Total|Free|Available)|Swap(Total|Free)' /proc/meminfo
MemTotal: 65843064 kB
MemFree: 1234560 kB
MemAvailable: 18432000 kB
SwapTotal: 4194300 kB
SwapFree: 4096000 kB
What it means: MemFree is low (normal), MemAvailable is still ~18 GB (good), swap mostly free (good).
Decision: If MemAvailable is healthy and swap isn’t trending down, you likely don’t have an acute leak. Move to process-level confirmation before waking everyone.
Task 2: Look for active swapping and reclaim churn
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 1234560 81234 21000000 0 0 5 12 820 1600 12 4 83 1 0
1 0 0 1200000 80000 21100000 0 0 0 8 790 1550 11 3 85 1 0
3 1 0 1100000 79000 21200000 0 0 0 120 900 1700 15 6 74 5 0
5 2 0 900000 78000 21300000 0 0 0 800 1200 2200 18 8 60 14 0
4 2 0 850000 77000 21350000 0 0 0 900 1180 2100 16 7 62 13 0
What it means: si/so are zero (no swap-in/out), but wa (IO wait) is rising, and b (blocked) is non-zero.
Decision: If swapping is active (si/so > 0 sustained), you’re in emergency territory. If not, the issue might be IO-induced stalls or a single process ballooning but not yet forcing swap.
Task 3: Identify top RSS consumers (quick triage)
cr0x@server:~$ ps -eo pid,ppid,comm,%mem,rss --sort=-rss | head -n 10
PID PPID COMMAND %MEM RSS
4123 1 api-service 18.4 12288000
2877 1 search-worker 12.1 8050000
1990 1 postgres 9.7 6450000
911 1 nginx 1.2 780000
What it means: api-service is the main resident hog.
Decision: Pick the top suspect and stay focused. If multiple processes are growing together, suspect shared cache pressure, tmpfs growth, or a workload shift.
Task 4: Confirm monotonic growth (don’t trust one snapshot)
cr0x@server:~$ pid=4123; for i in 1 2 3 4 5; do date; awk '/VmRSS|VmSize/ {print}' /proc/$pid/status; sleep 30; done
Mon Dec 30 03:12:01 UTC 2025
VmSize: 18934264 kB
VmRSS: 12288000 kB
Mon Dec 30 03:12:31 UTC 2025
VmSize: 18959000 kB
VmRSS: 12340000 kB
Mon Dec 30 03:13:01 UTC 2025
VmSize: 19001000 kB
VmRSS: 12420000 kB
Mon Dec 30 03:13:31 UTC 2025
VmSize: 19042000 kB
VmRSS: 12510000 kB
Mon Dec 30 03:14:01 UTC 2025
VmSize: 19090000 kB
VmRSS: 12605000 kB
What it means: Both VmSize and VmRSS climb steadily. That’s a leak or retention pattern, not a one-time spike.
Decision: Start collecting attribution (smaps_rollup, heap metrics, allocator stats). Also plan a mitigation (restart, scale-out) because you now have a time-to-failure.
Task 5: Determine anon vs file-backed memory (smaps_rollup)
cr0x@server:~$ pid=4123; cat /proc/$pid/smaps_rollup | egrep 'Rss:|Pss:|Private|Shared|Swap:'
Rss: 12631240 kB
Pss: 12590010 kB
Shared_Clean: 89240 kB
Shared_Dirty: 1024 kB
Private_Clean: 120000 kB
Private_Dirty: 12450000 kB
Swap: 0 kB
What it means: Mostly Private_Dirty anon memory. That’s typically heap, stacks, or anonymous mmaps.
Decision: Focus on application allocations and retention. If instead File-backed memory dominated, you’d inspect mmaps, caches, and file IO patterns.
Task 6: Check for cgroup OOM events under systemd (Debian 13 default patterns)
cr0x@server:~$ systemctl status api-service.service --no-pager
● api-service.service - API Service
Loaded: loaded (/etc/systemd/system/api-service.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-30 02:10:11 UTC; 1h 2min ago
Main PID: 4123 (api-service)
Tasks: 84 (limit: 12288)
Memory: 12.4G (peak: 12.6G)
CPU: 22min 9.843s
CGroup: /system.slice/api-service.service
└─4123 /usr/local/bin/api-service
What it means: systemd is tracking current and peak memory; this is already valuable for leak shape.
Decision: If you see “Memory: … (limit: …)” or recent OOM messages, move to cgroup stats next.
Task 7: Read cgroup v2 memory counters and OOM events
cr0x@server:~$ cg=$(systemctl show -p ControlGroup --value api-service.service); echo $cg; cat /sys/fs/cgroup$cg/memory.current; cat /sys/fs/cgroup$cg/memory.events
/system.slice/api-service.service
13518778368
low 0
high 0
max 0
oom 0
oom_kill 0
What it means: The service is using ~13.5 GB and has not hit cgroup OOM yet.
Decision: If oom_kill increments, you’re not chasing a “mystery crash”—you’re hitting a known resource wall. Adjust MemoryMax, fix leak, or scale out.
Task 8: Detect whether memory is in tmpfs or runaway logs (sneaky offenders)
cr0x@server:~$ df -h | egrep 'tmpfs|/run|/dev/shm'
tmpfs 32G 1.2G 31G 4% /run
tmpfs 32G 18G 14G 57% /dev/shm
What it means: /dev/shm is huge. That’s memory, not disk.
Decision: If /dev/shm or /run balloons, inspect what’s writing there (shared memory segments, browser-style caches, IPC). This is not a heap leak; restarting the service may not help if it’s another process.
Task 9: Find top consumers inside the service’s cgroup (multi-process units)
cr0x@server:~$ cg=$(systemctl show -p ControlGroup --value api-service.service); for p in $(cat /sys/fs/cgroup$cg/cgroup.procs | head -n 20); do awk -v p=$p '/VmRSS/ {print p, $2 "kB"}' /proc/$p/status 2>/dev/null; done | sort -k2 -n | tail
4123 12605000kB
What it means: One main process accounts for almost all RSS.
Decision: Good: attribution is simpler. If many processes share the growth, suspect workers, a fork bomb pattern, or an allocator arena explosion across threads.
Task 10: Check page fault rate to distinguish “touching new memory” vs “reusing”
cr0x@server:~$ pid=4123; awk '{print "minflt="$10, "majflt="$12}' /proc/$pid/stat
minflt=48392011 majflt=42
What it means: Minor faults are high (normal for allocation), major faults low (not thrashing on disk yet).
Decision: If major faults climb rapidly, you’re already in performance collapse. Prioritize mitigation (restart/scale) before deep profiling.
Task 11: See mapped regions by size (big mmaps jump out)
cr0x@server:~$ pid=4123; awk '{print $1, $2, $6}' /proc/$pid/maps | head
55fdb5a6b000-55fdb5b2e000 r--p /usr/local/bin/api-service
55fdb5b2e000-55fdb5e7e000 r-xp /usr/local/bin/api-service
55fdb5e7e000-55fdb5f24000 r--p /usr/local/bin/api-service
7f01b4000000-7f01b8000000 rw-p
7f01b8000000-7f01bc000000 rw-p
What it means: Large anonymous regions (rw-p with no file) suggest heap arenas or explicit mmap allocations.
Decision: If you see giant file-backed mmaps (e.g., data files), the “leak” might be a mapping strategy. Different fix.
Task 12: Capture pmap snapshot (read-only, useful for trend diffs)
cr0x@server:~$ pid=4123; sudo pmap -x $pid | tail -n 5
---------------- ------- ------- ------- -------
total kB 18990000 12631240 12591000 0
What it means: Confirms totals; pmap is a good “before/after” artifact for tickets.
Decision: Save this output during incident timeline. If memory drops after a config change or traffic shift, you have causality clues.
Task 13: Inspect journald for OOM and cgroup kill evidence
cr0x@server:~$ sudo journalctl -k --since "1 hour ago" | egrep -i 'oom|out of memory|killed process|memory cgroup' | tail -n 20
Dec 30 02:55:10 server kernel: Memory cgroup out of memory: Killed process 2877 (search-worker) total-vm:9132000kB, anon-rss:7800000kB, file-rss:12000kB, shmem-rss:0kB
Dec 30 02:55:10 server kernel: oom_reaper: reaped process 2877 (search-worker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
What it means: The kernel killed a process due to memory cgroup OOM. That’s not a segfault; it’s policy enforcement.
Decision: Decide whether the limit is wrong (too low for peak) or the workload is wrong (leak/retention). Often it’s both.
Task 14: Get systemd’s view of memory limits and accounting (and fix what’s missing)
cr0x@server:~$ systemctl show api-service.service -p MemoryAccounting -p MemoryMax -p MemoryHigh -p OOMPolicy
MemoryAccounting=yes
MemoryMax=infinity
MemoryHigh=infinity
OOMPolicy=continue
What it means: Accounting is on, but there’s no cap. OOMPolicy=continue means systemd won’t stop the unit on OOM by itself.
Decision: If you’re running multi-tenant hosts, set MemoryHigh/MemoryMax to protect neighbors. If this is a dedicated node, you may prefer no cap and rely on autoscaling + alerting.
Task 15: Put a temporary “tripwire” cap (carefully) to force earlier failure with better evidence
cr0x@server:~$ sudo systemctl set-property api-service.service MemoryHigh=14G MemoryMax=16G
cr0x@server:~$ systemctl show api-service.service -p MemoryHigh -p MemoryMax
MemoryHigh=15032385536
MemoryMax=17179869184
What it means: You’ve applied a limit live (systemd writes to the cgroup). MemoryHigh triggers reclaim pressure; MemoryMax is the hard stop.
Decision: Only do this if you can tolerate the service being killed sooner. It’s a trade: better containment and clearer signals versus potential user impact. On shared hosts, it’s often the responsible move.
Task 16: If it’s Java, enable Native Memory Tracking (low disruption if planned)
cr0x@server:~$ jcmd 4123 VM.native_memory summary
4123:
Native Memory Tracking:
Total: reserved=13540MB, committed=12620MB
- Java Heap (reserved=8192MB, committed=8192MB)
- Class (reserved=512MB, committed=480MB)
- Thread (reserved=1024MB, committed=920MB)
- Code (reserved=256MB, committed=220MB)
- GC (reserved=1200MB, committed=1100MB)
- Internal (reserved=2300MB, committed=1700MB)
What it means: The process is not just “heap”. Threads and internal allocations can dominate.
Decision: If heap is stable but native/internal grows, focus on off-heap buffers, JNI, thread creation, or allocator fragmentation—not GC tuning.
Task 17: If it’s Go, take a pprof heap snapshot (minimal disruption if endpoint exists)
cr0x@server:~$ curl -sS localhost:6060/debug/pprof/heap?seconds=30 | head
\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff...
What it means: That’s a gzipped pprof profile. It’s not human-readable in the terminal.
Decision: Save it and analyze off-box. If you can’t expose pprof safely, don’t punch holes in prod during an incident—use SSH port forwarding or localhost-only binding.
Task 18: If it’s Python, use tracemalloc (best when enabled early)
cr0x@server:~$ python3 -c 'import tracemalloc; tracemalloc.start(); a=[b"x"*1024 for _ in range(10000)]; print(tracemalloc.get_traced_memory())'
(10312960, 10312960)
What it means: tracemalloc reports current and peak tracked allocations (bytes).
Decision: In real services, you enable tracemalloc at startup or via a feature flag. If it’s not enabled, don’t pretend you can reconstruct Python object allocation history from RSS alone.
Mini-story 1: the wrong assumption (the “RSS equals leak” trap)
A mid-sized SaaS company ran a Debian fleet hosting a multi-tenant API. Every Monday morning, one node would creep up in memory until it hit swap, then latency would go geometric. The on-call did what on-calls do: found the biggest RSS process and filed a “memory leak” ticket against the API team.
The API team responded with the usual defensive art: “works on staging,” a few heap graphs, and a sincere belief that their garbage collector was an innocent bystander. Operations kept pointing at RSS. The debate became religious: “Linux is caching” vs “your service is leaking.”
The wrong assumption was simple: that rising RSS means rising live data. It doesn’t. The service used a library that memory-mapped large read-only datasets for fast lookups. On Monday, a batch job rotated new datasets and the service mapped the new ones before unmapping the old ones. For a while, both sets coexisted. RSS jumped. Not a leak; a rollout pattern.
The fix was not a profiler. It was coordinating dataset rotation and forcing a mapping strategy that swapped the pointer once the new map was ready, then unmapping the old region immediately. After that, RSS still rose on Mondays—just less and for shorter windows. They also changed alerting from “RSS > threshold” to “MemAvailable trending down + major faults + latency.”
Nobody loves the moral of that story because it’s boring: measure the right thing before assigning blame. But it’s cheaper than holding a weekly trial.
Mini-story 2: the optimization that backfired (allocator knobs meet real traffic)
Another org, another quarter, another “cost efficiency initiative.” They ran a C++ service on Debian and wanted tighter memory usage. A well-meaning engineer read about glibc arenas and decided to reduce memory footprint by setting MALLOC_ARENA_MAX=2 across the fleet.
The change worked in pre-prod. RSS dropped under synthetic load. The graphs looked fantastic. Then production happened: traffic patterns had more long-lived connections, more per-request concurrency spikes, and crucially, different object lifetimes. Latency started stuttering. CPU went up. Memory did not actually become stable; it became contended.
With too few arenas, threads fought over allocator locks. Some requests slowed, queues grew, and the service held onto memory longer because it was busy being slow. The on-call saw rising RSS and blamed a leak again. They were measuring an effect, not the cause.
The rollback fixed latency. Memory climbed again—but now it behaved predictably. They eventually moved to jemalloc for that service with profiling enabled in staging and a carefully controlled production rollout. The lesson wasn’t “never tune malloc.” It was “allocator knobs are workload-specific and can turn memory problems into performance problems.”
Joke #2: Tuning malloc in production is like reorganizing the pantry during a dinner party—technically possible, socially questionable.
Mini-story 3: the boring but correct practice that saved the day (budgeting, limits, and reheated runbooks)
A financial services team ran Debian 13 nodes for background processing. The service was not glamorous: a queue consumer that transformed documents and stored results. It also had a history of occasional memory spikes due to third-party parsing libraries.
They did two boring things. First, they used systemd MemoryAccounting with MemoryHigh set to a value that forced reclaim pressure before the host was in trouble. Second, they built a weekly “leak rehearsal” job in staging: ramp load, verify memory plateaus, and capture smaps_rollup snapshots at fixed intervals.
One night, a vendor library update changed behavior and retained huge buffers. Memory started climbing. The service didn’t take the whole node down because the cgroup limit contained it. The node stayed healthy; only that worker group was impacted.
The on-call didn’t need to guess. Alerts included memory.current and memory.events. The runbook already said: if memory.current rises and private dirty dominates, restart the unit, pin the package version, and open an incident for root cause. They followed it, got back to sleep, and fixed the leak properly the next day.
Nothing heroic happened. That was the point. The “boring” practice turned a potential fleet incident into a single-service hiccup.
Low-disruption techniques that actually work
Start with artifacts you can collect without changing the process
Your safest tools are: /proc, systemd, and cgroup counters. They don’t need a restart. They don’t attach debuggers. They don’t introduce the Heisenbug where your profiler “fixes” timing and hides the leak.
- /proc/<pid>/smaps_rollup tells you if memory is private dirty (heap-ish) versus shared/file-backed.
- /sys/fs/cgroup/…/memory.current tells you whether the whole unit is growing, even if it has multiple PIDs.
- journalctl -k tells you if you’re seeing cgroup OOM kills, global OOM, or something else entirely.
Prefer sampling over tracing when you’re under load
Full tracing of allocations can be expensive and can distort behavior. Sampling profilers and periodic snapshots are the production default.
If you need precise allocation sites, do it briefly, and do it with a rollback plan.
Contain damage with cgroup limits and clean restarts
“Least disruption” doesn’t mean “never restart.” It means you restart intentionally: drain traffic, roll one instance, verify plateau behavior, and continue.
A controlled restart that avoids node-level swap storms is often the kindest act you can perform for your users.
Look for time correlation with workload changes
Leaks often correlate with a specific request type, a cron job, a deploy, or a data shape change.
If memory grows only when a particular queue is active, you’ve just narrowed the search more than any generic tool will.
systemd and cgroups v2: use them, don’t fight them
Debian 13 leans into systemd and cgroups v2. That’s not an ideological statement; it’s the reality you debug in.
If you treat the host as “a bunch of processes” and ignore cgroups, you’ll miss the actual enforcement boundary.
Why cgroup-level thinking matters
A service can be killed while the host still has free memory because it hit its cgroup limit. Conversely, a host can be in trouble even if the service looks fine because another cgroup is hogging memory.
Service-level telemetry prevents you from arguing about which process “looks big” and instead answers: which unit is responsible for pressure.
Use MemoryHigh before MemoryMax
MemoryHigh introduces reclaim pressure; MemoryMax is a hard kill boundary. In practice, MemoryHigh is a gentler early warning that also buys time.
If you set only MemoryMax, your first signal might be a kill event. That’s like discovering your fuel gauge is broken when the engine stops.
Have an explicit OOMPolicy for the unit
If systemd kills something due to OOM, what do you want it to do? Restart? Stop? Continue?
Decide based on service behavior: stateless APIs can restart; stateful workers may need careful drain and retry logic.
cr0x@server:~$ sudo systemctl edit api-service.service
cr0x@server:~$ sudo cat /etc/systemd/system/api-service.service.d/override.conf
[Service]
MemoryAccounting=yes
MemoryHigh=14G
MemoryMax=16G
OOMPolicy=restart
Restart=always
RestartSec=5
What it means: You’ve defined the containment boundary and automated recovery.
Decision: If restarts are safe and your leak is slow, this turns “3am node death” into “controlled instance recycle,” buying time for root cause work.
Don’t confuse “memory used” with “memory charged”
cgroups charge memory differently for anonymous and file cache, and the exact behavior depends on kernel versions and settings.
The point isn’t to memorize every detail; it’s to watch trends and know which counters you’re using.
Language-specific leak hunting (Java, Go, Python, C/C++)
Java: split the world into heap vs native
If RSS grows but heap usage is flat, stop blaming GC. You’re in native memory: direct buffers, thread stacks, JNI, code cache, or allocator fragmentation.
NMT (Native Memory Tracking) is your friend, but it is best enabled intentionally (startup flags) for low overhead.
Low-disruption approach: collect NMT summaries periodically, plus GC logs or JFR events if already enabled. If you need a heap dump, prefer a controlled window and ensure you have disk space and IO headroom; heap dumps can be disruptive.
Go: treat heap profiles as evidence, not opinions
Go’s runtime gives you pprof and runtime metrics that are surprisingly good. The risk is exposure: a debug endpoint reachable from the wrong network is an incident of its own.
Keep pprof bound to localhost and tunnel when needed.
Look for: growing in-use heap, rising number of objects, or goroutine counts creeping upward (a different kind of leak).
Python: leaks can be “native” too
Python services can leak in pure Python objects (tracked by tracemalloc), but also in native extensions that allocate outside Python’s object tracking.
If tracemalloc looks fine and RSS grows anyway, suspect native libs, buffers, and mmap patterns.
C/C++: decide whether you need allocator telemetry or code-level tooling
In C/C++, a real leak is common, but so is “not returning memory to OS.”
If you can afford it, swapping glibc malloc for jemalloc in a controlled rollout can provide profiling and often more stable RSS behavior. But don’t treat allocator replacement as a cure-all.
The least disruptive path: capture smaps_rollup, pmap snapshots, and if needed, short eBPF sampling of allocation stacks rather than always-on tracing.
Common mistakes: symptom → root cause → fix
“Used memory is 95%, we’re leaking”
Root cause: page cache is doing its job; the system still has healthy MemAvailable.
Fix: alert on MemAvailable, swap in/out, major faults, and latency. Don’t page humans for Linux using RAM.
“RSS grows, therefore heap leak”
Root cause: file-backed mmaps, allocator fragmentation, or native memory growth (Java direct buffers, Python extensions).
Fix: smaps_rollup to split private dirty vs file-backed; use language tooling (JVM NMT, pprof) to confirm.
“Service died, must be a segfault”
Root cause: cgroup OOM kill (often silent at app level) or system-wide OOM.
Fix: journalctl -k for OOM lines; check memory.events in the service cgroup; decide on MemoryMax/OOMPolicy strategy.
“We’ll fix it by adding swap”
Root cause: swap masks leaks and turns them into latency incidents.
Fix: keep swap modest, monitor swap activity, and treat sustained swap-in/out as a P1. Use containment and restarts, not infinite swap.
“We turned on heavy profiling and the leak disappeared”
Root cause: observer effect; timing changes; different allocation patterns.
Fix: prefer sampling; collect multiple short windows; correlate with load; reproduce in staging with production-like traffic shapes.
“We set MemoryMax and now it randomly restarts”
Root cause: limit set too low for legitimate peaks, or MemoryHigh absent so you get sudden death.
Fix: set MemoryHigh below MemoryMax, measure peaks, and adjust. Use graceful shutdown hooks and autoscaling if possible.
“The container limit is fine; the host OOMed anyway”
Root cause: other cgroups unbounded; kernel memory pressure; file cache and dirty pages; or misconfigured accounting.
Fix: inspect memory.current across top-level cgroups; ensure accounting is enabled; set sane limits for noisy neighbors.
Checklists / step-by-step plan
Checklist A: 10-minute confirmation (no restarts, no new agents)
- Check
/proc/meminfo: is MemAvailable dropping over time? - Run
vmstat 1: is swap active (si/so), is IO wait climbing? - Identify top RSS process via
ps --sort=-rss. - Confirm growth with repeated reads of
/proc/<pid>/status(VmRSS). - Use
/proc/<pid>/smaps_rollup: private dirty vs file-backed. - Check unit memory with
systemctl statusand cgroup memory.current. - Search kernel logs for OOM events and victims.
Checklist B: Containment without drama (when the leak is real)
- Estimate time-to-failure: current RSS growth rate vs available headroom.
- Decide containment boundary: per-service MemoryHigh/MemoryMax or node scaling.
- Set MemoryHigh first, then MemoryMax if needed. Prefer small iterative changes.
- Ensure OOMPolicy and Restart behavior match service type.
- Plan a rolling restart window; drain traffic if applicable.
- Capture “before restart” artifacts: smaps_rollup, pmap totals, memory.current, memory.events.
- After restart: verify memory plateau under the same workload.
Checklist C: Root cause work (when users are safe)
- Pick the right tool: pprof/JFR/NMT/tracemalloc/allocator profiling.
- Correlate leak to workload: endpoints, job types, payload sizes.
- Reproduce in staging with production-like concurrency and data.
- Fix retention: add eviction, bounds, timeouts, and backpressure.
- Add dashboards: memory.current, private dirty share, OOM events, restart counts.
- Add a regression test: run the suspected workload for long enough to show slope.
FAQ
1) How do I tell “leak” from “allocator not returning memory”?
Look for monotonic growth in private dirty memory (smaps_rollup) correlated with object counts or heap metrics. If app-level live data drops but RSS stays high, suspect allocator behavior/fragmentation.
2) Why does RSS keep rising even when traffic is flat?
True leak, cache retention, background jobs, or a slow “once per hour” task. Confirm with repeated VmRSS checks and correlate with request/job logs. Flat traffic doesn’t mean flat workload shape.
3) What’s the single most useful /proc file for this?
/proc/<pid>/smaps_rollup. It’s compact enough to grab during incidents and gives you decisive breakdown signals.
4) Should I set MemoryMax for every systemd service?
On shared hosts: yes, almost always. On dedicated nodes: maybe. Limits prevent noisy neighbors from taking the machine down, but they also introduce hard-fail behavior you must design for.
5) How do I know if the OOM killer hit my service?
Check journalctl -k for “Killed process” lines and check the service cgroup’s memory.events (oom/oom_kill counters). Don’t guess.
6) Is adding swap a valid mitigation?
It’s a short-term crutch, not a fix. Swap can prevent immediate OOM but often converts failures into latency spikes and IO storms. Use it sparingly, monitor it aggressively.
7) Can I debug leaks with zero restarts?
Sometimes. /proc and cgroup stats require no restarts. eBPF sampling often requires only privileges, not restarts. Language-level profilers vary: Go and JVM tools can often attach; Python tracemalloc is best enabled at startup.
8) What’s the least disruptive way to collect “before/after” evidence?
Take periodic snapshots: smaps_rollup, pmap totals, memory.current, and memory.events. Store them with timestamps. That gives you a slope and a change log without heavy instrumentation.
9) Why does systemd show “Memory: 12G” but ps shows different RSS?
systemd reports cgroup memory usage for the whole unit (including children and cache charged to the cgroup). ps shows per-process RSS. They answer different questions; use both.
10) When should I stop debugging and just roll back?
If the leak started after a deploy and you have a safe rollback, do it early. Root cause can wait. Users don’t care that you found the perfect allocation site at 04:40.
Conclusion: next actions you can take today
Memory leaks in services aren’t solved by vibes. On Debian 13, the lowest-disruption path is consistent: confirm pressure with MemAvailable and swap, identify the growing unit via cgroups, classify memory via smaps_rollup, and only then choose deeper tooling.
Next steps that pay off immediately:
- Add service dashboards for memory.current, memory.events, and restart counts per unit.
- Set MemoryHigh (and sometimes MemoryMax) for noisy services, plus an explicit OOMPolicy.
- Update alerting to focus on MemAvailable trend, swap activity, major faults, and latency, not raw “used memory.”
- Make a habit of capturing smaps_rollup snapshots during incidents so you can stop arguing and start fixing.
You can’t prevent every leak. You can prevent most leak incidents from becoming host-level disasters. That’s the real win.