The Proxmox host was “fine” for months. Then one Tuesday: VMs pause, the UI lags, ssh takes forever, and the logs read like a crime scene.
Something got shot. It wasn’t the disk. It wasn’t the network. It was memory.
When the Linux OOM-killer shows up on a virtualization host, it’s not being dramatic. It’s doing the last responsible thing it can do:
kill a process so the kernel can keep breathing. Your job is to stop inviting it over.
What an OOM event really is (and what it is not)
On Linux, “out of memory” doesn’t mean “RAM hit 100%”. It means the kernel could not satisfy a memory allocation request
without breaking its own rules, and it couldn’t reclaim enough memory fast enough to remain stable.
At that point, Linux has a short list of bad options. The least bad option is to kill something.
On Proxmox, this is especially spicy because the host is both a hypervisor and a workload platform:
it runs QEMU processes (one per VM), LXC container processes, storage daemons, monitoring agents, and the Proxmox management stack.
If the host loses the wrong process, the hypervisor can survive but your VMs can pause or crash, and your HA story turns into interpretive dance.
The kernel’s priorities during memory pressure
Linux tries to reclaim memory in this rough order: free page cache, reclaim anonymous memory (swap), compact memory, and finally
invoke the OOM-killer. On a host with swap disabled and aggressive overcommit, you skip half the ladder and jump straight to “kill a process”.
That jump is what you see as an OOM event.
OOM-killer vs systemd-oomd vs cgroup limits
Three different “someone killed my process” mechanisms routinely get mixed up:
- Kernel OOM-killer: global, last-resort, chooses a victim based on “badness” scoring.
- cgroup OOM: a container/VM cgroup hits its memory limit; the kernel kills within that cgroup, not globally.
- systemd-oomd: a userspace daemon that kills earlier based on pressure signals (PSI). It can be a lifesaver or a nuisance.
They leave different fingerprints in logs, and the fix depends on which one fired.
If you don’t know which one fired, you’re going to “fix” it by adding RAM and hoping. That works until it doesn’t.
Here’s the dry truth: if you treat the OOM-killer as a bug, you’ll keep getting surprise visits. Treat it like a smoke alarm.
The smoke alarm isn’t your problem; your kitchen habits are.
One quote that holds up in ops (paraphrased idea): John Allspaw: reliability comes from owning how systems fail and designing around that reality.
Joke #1: The OOM-killer is like an HR layoff during a budget crunch—fast, unfair, and everyone insists it was “based on metrics.”
Interesting facts and a little history
These are not trivia for trivia’s sake. Each fact explains why your Proxmox host behaves the way it does when memory gets tight.
- Linux has had an OOM-killer for decades, because the kernel cannot simply “return NULL” in many critical allocation paths without risking corruption.
- Early Linux systems relied heavily on swap; modern fleets often disable swap by policy, which makes OOM events more abrupt and less forgiving.
- cgroups changed the game: instead of the whole machine going down, memory can be constrained per service or container—and the kills happen locally.
- “Badness” scoring considers memory usage and adjusters (like
oom_score_adj), which is why a big QEMU process is often the victim. - ZFS ARC is not “stealing” memory; it’s a cache that is designed to grow and shrink. But it can still contribute to pressure when mis-tuned.
- Ballooning exists because overcommit exists: it’s a way to reclaim guest memory under pressure, but it only helps if guests cooperate and have reclaimable pages.
- PSI (pressure stall information) is a relatively modern Linux feature that measures time spent stalled on resource pressure; it enabled proactive killers like systemd-oomd.
- THP (Transparent Huge Pages) can increase allocation latency and fragmentation issues under pressure, which sometimes turns “slow” into “OOM”.
- File cache is not “free” memory and not “used” memory in the same way; Linux will reclaim it, but not always fast enough for bursty allocators.
How memory gets consumed on a Proxmox host
Three buckets that matter
Think of host memory as three buckets that fight each other:
- Guest memory: QEMU RSS for VMs, plus container memory usage. This is the big one.
- Host services: pveproxy, pvedaemon, corosync, ceph (if you run it), monitoring, backups, and random “one small agent” that quietly eats 1–2 GB.
- Cache and kernel: page cache, slab, ZFS ARC, metadata, and kernel allocations. This can be huge and it is supposed to be huge—until it isn’t.
VMs: QEMU memory isn’t just “the VM’s RAM”
When you assign 16 GB to a VM, the QEMU process typically maps that memory. With KVM, much of it is backed by host RAM.
Ballooning, KSM, and swap can change the shape, but the simple operational rule is:
VM assigned RAM is a reservation against your host unless you have a proven overcommit plan.
Also: backups, snapshots, and heavy I/O can create transient memory spikes in QEMU and the kernel. If you budget only for steady-state,
you will have OOM events during “boring maintenance windows” at 02:00. That’s when nobody is watching and everything is writing.
Containers: limits are your friend, until they’re lies
LXC on Proxmox uses cgroups. If you set a memory limit, the container can’t exceed it. Great.
But if you set it too low, the container will OOM internally, which looks like “my app randomly died.”
If you set it too high, you haven’t really limited anything.
ZFS: ARC is a performance feature, not a free buffet
Proxmox loves ZFS, and ZFS loves RAM. ARC will grow to improve read performance and metadata caching.
It will shrink under memory pressure, but not always at the speed you need when a workload suddenly allocates a few gigabytes.
ARC can be tuned. Do it deliberately. Don’t do it because a forum post told you “set it to 50%”.
Swap: not a sin, a circuit breaker
Swap on a Proxmox host isn’t about running everything from disk. It’s about surviving spikes and giving the kernel time to reclaim.
A small amount of swap (or zram) can turn a hard OOM into a slow incident you can mitigate. Yes, swapping is ugly. So is downtime.
Joke #2: Disabling swap to “prevent slowness” is like removing your car’s airbags to save weight—technically faster, medically questionable.
Fast diagnosis playbook
You don’t win an OOM incident by reading every log line. You win by proving which mechanism killed what, and why memory pressure became unrecoverable.
This is the order that gets you answers fast.
1) Confirm what killed the process
- Kernel global OOM-killer?
- cgroup memory limit OOM inside a container/service?
- systemd-oomd preemptive kill?
2) Identify the victim and the aggressor
The victim is what got killed. The aggressor is what caused the pressure.
Sometimes they’re the same. Often they’re not: one VM allocates, another VM gets killed because it looked “fatter”.
3) Determine if it’s capacity or spike
Capacity issues show sustained high anon memory and swap usage (if swap exists). Spike issues show sudden stalls, high allocation rate,
and frequent reclaim/compaction. Your fix differs: capacity means re-budgeting; spikes mean smoothing, limiting, or adding a buffer.
4) Check host-level culprits fast
- ZFS ARC too large or slow to shrink?
- Kernel slab growth (e.g., network, ZFS metadata, inode/dentry) due to a leak or workload pattern?
- Backup jobs causing burst allocations?
- Ballooning misconfigured or ineffective?
5) Decide: limit, move, or add
Your three levers are (1) set limits so a single tenant can’t burn the building down, (2) redistribute workloads, and (3) add RAM/swap.
“Just add RAM” is not wrong, it’s just not a plan if you keep overcommitting.
Practical tasks: commands, outputs, and decisions
These tasks are ordered from “confirm the event” to “set guardrails” to “prove you fixed it”.
Each includes a realistic snippet of output and what decision you make from it.
Task 1: Find OOM-killer events in the kernel log
cr0x@server:~$ journalctl -k -g 'Out of memory\|oom-killer\|Killed process' -n 200
Dec 26 10:41:12 pve kernel: oom-killer: constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0
Dec 26 10:41:12 pve kernel: Out of memory: Killed process 22134 (qemu-system-x86) total-vm:28764564kB, anon-rss:15632240kB, file-rss:12428kB, shmem-rss:0kB, UID:0 pgtables:34912kB oom_score_adj:0
Dec 26 10:41:12 pve kernel: Killed process 22134 (qemu-system-x86) total-vm:28764564kB, anon-rss:15632240kB, file-rss:12428kB, shmem-rss:0kB
What it means: This is a global kernel OOM kill (constraint=CONSTRAINT_NONE). A QEMU process died; a VM likely crashed.
Decision: Treat as host memory exhaustion, not a per-container limit. Move to host capacity and overcommit checks.
Task 2: Determine whether it was a cgroup memory limit OOM
cr0x@server:~$ journalctl -k -g 'Memory cgroup out of memory' -n 50
Dec 26 10:13:07 pve kernel: Memory cgroup out of memory: Killed process 18872 (node) total-vm:1876540kB, anon-rss:824112kB, file-rss:1220kB, shmem-rss:0kB, UID:1000 pgtables:3420kB oom_score_adj:0
Dec 26 10:13:07 pve kernel: Tasks in /lxc.payload.104.slice killed as a result of limit of /lxc.payload.104.slice
What it means: This is local to a cgroup (here, an LXC container). The host wasn’t necessarily out of memory.
Decision: Increase container limit, fix app memory, or add swap inside the container if appropriate. Don’t “fix” by tuning host ARC.
Task 3: Check whether systemd-oomd killed something
cr0x@server:~$ journalctl -u systemd-oomd -n 100
Dec 26 10:20:55 pve systemd-oomd[782]: Killed /system.slice/pveproxy.service due to memory pressure (memory pressure 78.21%).
Dec 26 10:20:55 pve systemd-oomd[782]: system.slice: memory pressure relieved.
What it means: Userspace killer acted early based on pressure signals.
Decision: Either tune oomd policies or exempt critical services; still investigate why pressure got that high.
Task 4: See current memory layout (including cache and swap)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 62Gi 54Gi 1.2Gi 1.1Gi 6.9Gi 2.8Gi
Swap: 8.0Gi 6.4Gi 1.6Gi
What it means: Available memory is low; swap is heavily used. You’re in sustained pressure.
Decision: This is likely capacity or runaway workload. Consider migrating VMs, increasing RAM, and reviewing per-guest sizing. Don’t just restart things.
Task 5: Check pressure stall information (PSI) for memory
cr0x@server:~$ cat /proc/pressure/memory
some avg10=12.34 avg60=9.88 avg300=4.21 total=9349221
full avg10=3.21 avg60=2.10 avg300=0.88 total=1883211
What it means: “some” indicates time tasks were stalled on memory reclaim; “full” indicates system-wide stalls.
Decision: If full is non-trivial under normal load, you need buffer (swap/zram), less overcommit, or workload redistribution.
Task 6: Identify top memory consumers quickly
cr0x@server:~$ ps -eo pid,comm,rss,vsz,oom_score_adj --sort=-rss | head -n 12
PID COMMAND RSS VSZ OOM_SCORE_ADJ
22134 qemu-system-x86 15632240 28764564 0
9921 qemu-system-x86 8241100 16544320 0
18772 pveproxy 612332 1054320 0
1443 pvedaemon 244120 612044 0
What it means: QEMU processes dominate. That’s normal, but it means your overcommit math must be correct.
Decision: Map PIDs to VMIDs, then decide which guests are oversized or misbehaving.
Task 7: Map QEMU PID to VMID
cr0x@server:~$ pgrep -a qemu-system-x86_64 | head -n 3
22134 /usr/bin/kvm -id 101 -name vm101,debug-threads=on -m 16384 ...
9921 /usr/bin/kvm -id 102 -name vm102,debug-threads=on -m 8192 ...
What it means: PID 22134 is VM 101 with 16 GB assigned.
Decision: Inspect VM 101 workload, balloon settings, and whether the assigned RAM is justified.
Task 8: Inspect VM config for ballooning and memory settings
cr0x@server:~$ qm config 101
agent: 1
balloon: 4096
boot: order=scsi0;net0
cores: 8
memory: 16384
name: vm101
ostype: l26
scsi0: rpool:vm-101-disk-0,size=200G
What it means: VM has 16 GB max, balloon target 4 GB. This allows reclaiming up to ~12 GB if the guest cooperates.
Decision: If host pressure is frequent, ballooning can help, but only if the guest has the balloon driver and can free memory. Otherwise it’s wishful thinking.
Task 9: Check whether ballooning is actually active (virtio-balloon)
cr0x@server:~$ qm monitor 101 --cmd 'info balloon'
balloon: actual=16384
What it means: The guest is currently holding the full 16 GB; ballooning is not reclaiming.
Decision: If you expected it to shrink, verify guest drivers/agent and whether the guest is genuinely using RAM. Otherwise, stop relying on ballooning to save the host.
Task 10: Check LXC container memory limit and current usage
cr0x@server:~$ pct config 104 | egrep 'memory|swap|cores'
cores: 4
memory: 2048
swap: 512
cr0x@server:~$ cat /sys/fs/cgroup/lxc.payload.104.slice/memory.current
1938479104
What it means: Container limit is 2 GB; it’s using ~1.8 GB. Close to the ceiling.
Decision: If it OOMs inside the container, raise the limit or fix the workload. If it never should exceed 2 GB, keep the limit and tune the app.
Task 11: See cgroup memory events (OOM counts)
cr0x@server:~$ cat /sys/fs/cgroup/lxc.payload.104.slice/memory.events
low 0
high 0
max 3
oom 3
oom_kill 3
What it means: The cgroup hit max and killed tasks 3 times.
Decision: This is not a host OOM problem. Raise container memory and/or investigate the in-container process memory growth.
Task 12: Check swap status and what kind of swap you have
cr0x@server:~$ swapon --show
NAME TYPE SIZE USED PRIO
/dev/sda3 partition 8G 6.4G -2
What it means: There is swap and it is being used. Good: it gives breathing room. Bad: you might be thrashing.
Decision: If swap is constantly high and latency is bad, reduce overcommit and right-size guests. If swap is zero and you OOM, add a small swap device.
Task 13: Check ZFS ARC size and limits (if using ZFS)
cr0x@server:~$ arcstat -f time,arcsz,arc_max,hit%,miss% 1 1
time arcsz arc_max hit% miss%
10:44:10 28.4G 48.0G 92 8
What it means: ARC is 28.4 GB, allowed up to 48 GB. That’s a lot on a 62 GB host that’s already tight.
Decision: Consider capping ARC. But only after confirming memory pressure is real and not a temporary spike, and understanding your storage workload.
Task 14: Check kernel overcommit settings
cr0x@server:~$ sysctl vm.overcommit_memory vm.overcommit_ratio
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
What it means: Default heuristic overcommit. Many DBs and JVMs can behave badly under pressure regardless.
Decision: Don’t “fix OOM” by flipping overcommit knobs blindly. Use limits and capacity planning first; then tune if you understand the risk.
Task 15: See if the kernel is spending time compacting memory (fragmentation pain)
cr0x@server:~$ grep -E 'compact|pgscan|pgsteal' /proc/vmstat | head
compact_migrate_scanned 184921
compact_free_scanned 773212
compact_isolated 82944
pgscan_kswapd 29133211
pgsteal_kswapd 28911220
What it means: High scan/steal and compaction activity can indicate heavy reclaim and fragmentation.
Decision: Expect latency. Add buffer (swap), reduce huge allocations, review THP, and reduce overcommit.
Task 16: Verify if Transparent Huge Pages is enabled
cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
What it means: THP is set to always. For some virtualization hosts and certain DB/VM mixes, this increases compaction pressure.
Decision: Consider switching to madvise if you observe reclaim/compaction issues and can validate performance impact.
Task 17: Confirm which guests are configured and how much memory you’ve promised
cr0x@server:~$ qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
101 vm101 running 16384 200.00 22134
102 vm102 running 8192 100.00 9921
103 vm103 running 4096 80.00 7744
cr0x@server:~$ awk '/^memory:/{sum+=$2} END{print sum " MB assigned to VMs"}' /etc/pve/qemu-server/*.conf
28672 MB assigned to VMs
What it means: You can quantify “promised” VM memory. This is step zero in overcommit math.
Decision: Compare promised memory plus host services plus ARC/cache needs against physical RAM. If you’re near the line, you need limits and headroom.
Task 18: Inspect host memory cgroup and systemd slices (who is protected)
cr0x@server:~$ systemctl show -p MemoryMax -p MemoryHigh -p ManagedOOMMemoryPressure system.slice
MemoryMax=infinity
MemoryHigh=infinity
ManagedOOMMemoryPressure=auto
What it means: No memory caps at the slice level; oomd may manage pressure automatically.
Decision: If oomd kills important Proxmox services, you may need to adjust oomd behavior or protect critical units with OOMScoreAdjust and policies.
Setting sane limits: VMs, containers, host, and ZFS
The one rule you should actually follow
Leave headroom on the host. Real headroom, not theoretical. On a Proxmox node, 15–25% of RAM reserved for the host
is a sane starting point when running ZFS and a mix of guests. If you run Ceph on the same node (people do), reserve more.
If you can’t afford the headroom, you can’t afford the workload density. That’s not philosophical. It’s physics.
VM memory sizing: stop treating it like a wish list
For KVM VMs, you have three common patterns:
- Static allocation (no ballooning): most predictable. Best for important workloads.
- Ballooning with a realistic minimum: OK for general-purpose fleets, not OK as your only safety mechanism.
- Aggressive overcommit: only if you can prove it with metrics, and you accept occasional performance collapse under pressure.
Recommended VM policy (opinionated)
- Production databases: no ballooning, dedicated memory, no host overcommit beyond mild levels.
- General app servers: ballooning allowed, but set balloon minimum no lower than what the app needs under peak.
- Build agents, dev boxes: ballooning and tighter caps are acceptable; they are supposed to be cheap and slightly annoying.
LXC container limits: do it, but do it with intent
Containers share the host kernel. That makes them efficient and also makes bad limits extra painful because the kernel will enforce them hard.
If a container hosts a Java app with a poorly configured heap, a 2 GB limit means the kernel will eventually kill it.
That’s “working as designed.” Your design is just bad.
For LXC, a simple baseline:
- Set memory to what the workload needs under normal peak.
- Set swap to a small value (like 25–50% of memory) if the workload is bursty, unless you have strict latency requirements.
- Track OOM counts via
memory.events. If you seeoom_killincreasing, you’re under-provisioned or leaking.
Host swap: boring and effective
For a Proxmox node, a modest swap device is usually the right call. Not because you want swapping. Because you want survivability.
4–16 GB is a common range depending on node size and workload volatility. If you have ultra-low-latency requirements, consider zram or accept the risk.
Swappiness and why you should not obsess over it
People set vm.swappiness=1 like it’s a badge of honor. On a virtualization host, you want the kernel to prefer RAM,
but you also want it to use swap as a pressure relief valve. Values around 10–30 are often reasonable.
The correct number is the one that matches your workload’s tolerance and your incident budget.
ZFS ARC: cap it when the host is a hypervisor
If your Proxmox node is a storage server first and a hypervisor second, you can let ARC grow.
If it’s a hypervisor first, ARC should not be allowed to take half the machine unless you have a big RAM budget.
A practical approach:
- Measure ARC under typical workload. Don’t tune based on a quiet hour.
- Decide on host headroom first (say 20%).
- Cap ARC so that “VMs + containers + host services + ARC + buffer” fits comfortably.
How to set a ZFS ARC max (example)
On Debian/Proxmox, you typically set a module option. Example: cap ARC to 16 GiB.
cr0x@server:~$ echo "options zfs zfs_arc_max=17179869184" | sudo tee /etc/modprobe.d/zfs-arc.conf
options zfs zfs_arc_max=17179869184
cr0x@server:~$ update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.12-5-pve
What it means: ARC max will apply after reboot (or module reload, which you generally don’t do on a production node unless you enjoy adrenaline).
Decision: Reboot in a maintenance window, then validate ARC behavior and VM performance. If read latency spikes, you capped too hard.
Protecting the host from the guests (and from itself)
Two practical protections:
- Don’t let the host run at 95% “available” memory under normal load. If “available” is regularly under a few GB, you are already in the incident.
- Ensure the Proxmox management plane is hard to kill. If pveproxy dies, recovery becomes slower and riskier, especially with HA clusters.
OOM score tuning: use sparingly, but use it for critical services
Linux uses oom_score_adj to bias OOM decisions. You can make critical services less likely to be killed.
Don’t make everything “unkillable” unless you want the kernel to panic or kill something truly essential.
cr0x@server:~$ systemctl edit pveproxy.service
[Service]
OOMScoreAdjust=-800
cr0x@server:~$ systemctl daemon-reload
cr0x@server:~$ systemctl restart pveproxy.service
What it means: pveproxy is less likely to be selected as a victim.
Decision: Protect management and clustering services. Do not protect memory hogs.
Three corporate mini-stories from the trenches
1) The incident caused by a wrong assumption: “available memory is free memory”
A mid-sized company ran a Proxmox cluster for internal apps: ticketing, CI runners, a couple of databases, and a few “temporary” analytics boxes.
The node graphs showed RAM at 90–95% used most of the time. Nobody panicked because “Linux uses memory for cache.”
True. Also incomplete. The key metric was available, not used.
A new VM was deployed for a vendor appliance. It had a big Java service and a habit of allocating memory in bursts during log ingestion.
During the first ingestion run, the host dipped into reclaim, then swap, then the kernel had to choose: kill something.
It killed a different VM’s QEMU process because it had the largest RSS. That VM happened to be the ticketing database.
The postmortem was messy because the vendor VM was blamed (“it’s the new thing”), but it wasn’t actually “misbehaving” by its own definition.
The host was running with too little headroom. Page cache wasn’t the villain; it was the canary.
The wrong assumption was that high “used” memory is fine as long as it’s cache. Cache is reclaimable, but reclaim is not free, and under burst load it’s not fast.
The fix wasn’t exotic: reserve host headroom, cap ARC, add a small swap partition, and enforce reasonable memory limits on containers.
They also stopped pretending ballooning would save them when guests were legitimately using their RAM. The OOMs stopped, and so did the 2 a.m. mysteries.
2) The optimization that backfired: “swap off for performance”
Another shop ran latency-sensitive services and decided swap was the enemy. A consultant had recommended disabling swap years earlier on bare-metal database boxes.
Someone applied the same recommendation to Proxmox nodes. It sounded plausible. It was also context-free.
For a while, everything felt faster. Less swap meant fewer surprise stalls during normal operation. Victory.
Then they added a nightly backup routine that did snapshot-based exports, compression, and some encryption.
The node would spike in memory use, reclaim would ramp, and then—no swap available—OOM would hit.
It killed QEMU, which is the hypervisor equivalent of cutting the brake lines to reduce car weight.
The “performance optimization” had turned a survivable spike into a hard crash. Worse, it made the incident pattern unpredictable:
sometimes the host lived, sometimes it killed a different VM, sometimes it killed a host daemon and made troubleshooting harder.
Disabling swap wasn’t inherently evil; doing it without a buffer strategy was.
The eventual compromise was boring and effective: re-enable swap, but set conservative swappiness, and monitor PSI.
They also moved the backup workload to a dedicated node with more memory headroom. Latency-sensitive services stayed happy, and the nightly roulette ended.
3) The boring but correct practice that saved the day: “limits and headroom are non-negotiable”
A third team ran Proxmox for mixed workloads across several departments. They were not unusually skilled; they were unusually disciplined.
Every VM request had a memory budget and a justification. Containers had explicit memory and swap limits.
And every node had a fixed “host reserve” target that was treated like a capacity tax, not an optional suggestion.
One quarter, a vendor pushed an update that introduced a memory leak in an agent running in multiple containers.
The leak was slow and polite—until it wasn’t. Containers started hitting their cgroup limits and getting killed.
Annoying, but contained. The host stayed stable. No global OOM events, no hypervisor crash, no cascading failure.
Their monitoring caught the increasing oom_kill counters per container, and the team rolled back the agent version.
A few services restarted, customers saw minor blips, and the incident ended in the same hour it began.
The “boring” practice—enforced limits and headroom—turned a potential host-level outage into a manageable application bug.
If you’re building reliability, this is what winning looks like: fewer dramatic incidents, more small contained failures, and fast diagnosis.
Not glamorous. Very effective.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Random VM crashes; host stays up”
Root cause: Kernel global OOM-killer killed a QEMU process; the VM died.
Fix: Reduce overcommit, add host headroom, verify swap strategy, and consider ARC capping. Confirm via journalctl -k.
2) Symptom: “Containers restart; host memory looks fine”
Root cause: cgroup memory limit OOM inside the container; local kill.
Fix: Increase container memory/swap, tune the application (heap limits), or enforce better workload sizing. Verify via memory.events.
3) Symptom: “Proxmox web UI dies during pressure; ssh still works”
Root cause: systemd-oomd or kernel OOM killed pveproxy/pvedaemon because they were killable and had non-trivial RSS.
Fix: Protect critical services with OOMScoreAdjust; reduce pressure triggers; tune oomd if enabled.
4) Symptom: “OOM happens during backups or replication jobs”
Root cause: Transient memory spikes from compression/encryption, snapshot operations, and kernel cache behavior.
Fix: Schedule concurrency limits, move backup jobs off-node, add swap buffer, reserve more headroom.
5) Symptom: “No swap, no warning, instant OOM under burst”
Root cause: Swap disabled; reclaim cannot keep up.
Fix: Add modest swap or zram; tune swappiness; use PSI to catch pressure before OOM.
6) Symptom: “ZFS ARC ‘eats’ memory; OOM after heavy reads”
Root cause: ARC grows, then guest allocations spike; ARC shrink lag contributes to pressure.
Fix: Cap ARC based on measured workload; ensure host reserve; validate post-change latency and cache hit ratio.
7) Symptom: “OOM with plenty of free memory shown earlier”
Root cause: Fragmentation/compaction issues (THP, huge allocations) or sudden anonymous allocation rate.
Fix: Consider THP=madvice, add swap, reduce huge allocation patterns, and avoid packing the node too tightly.
8) Symptom: “Fix was to add RAM; OOM returns months later”
Root cause: No limits, no budgeting; consumption expanded to match capacity.
Fix: Implement per-tenant limits, headroom targets, and periodic right-sizing reviews. Add monitoring for assigned vs used.
Checklists / step-by-step plan
Immediate incident checklist (first 15 minutes)
- Confirm the killer: kernel OOM vs cgroup OOM vs systemd-oomd (
journalctlqueries). - Identify what died (PID, command, VMID/container slice).
- Capture current state:
free -h,swapon --show, PSI (/proc/pressure/memory). - List top consumers (
ps --sort=-rss) and map QEMU PIDs to VMIDs. - If the host is thrashing: reduce load immediately (migrate/shutdown non-critical guests) rather than restarting daemons.
Stabilization checklist (next 24 hours)
- Set a host headroom target (start at 20% RAM for mixed workloads).
- Ensure some swap exists; set a reasonable swappiness (avoid extremes).
- Audit guest allocations: assigned memory vs actual need; remove “comfort RAM”.
- For containers: set memory and swap limits; track
memory.eventsfor OOM counts. - For VMs: decide where ballooning is acceptable; verify balloon driver effectiveness on those guests.
- If on ZFS: measure ARC and set a cap that respects hypervisor headroom.
Hardening checklist (next sprint)
- Add alerting on: host available memory low, swap usage trend, PSI full avg60, kernel OOM events, cgroup OOM counts.
- Protect critical Proxmox services with OOMScoreAdjust and verify systemd-oomd policies.
- Reduce scheduled concurrency: backups, replication, scrubs, and heavy batch jobs.
- Document an overcommit policy: what is allowed, what isn’t, and how it’s measured.
- Run a controlled load test: simulate burst allocations and confirm the host degrades gracefully rather than killing QEMU.
FAQ
1) Why does Linux kill a “random” process instead of the one that caused the problem?
It isn’t random. The kernel computes a badness score based on memory usage and other factors. The “aggressor” might be smaller than the “fattest” victim.
That’s why per-tenant limits are so valuable: they keep the blast radius local.
2) Is it normal for Proxmox to run at 90% memory usage?
“Used” is a misleading number because it includes cache. What matters is available, swap trend, and pressure signals.
A node can look “full” and be healthy, or look “fine” and be one burst away from OOM.
3) Should I disable swap on Proxmox?
Usually no. A small swap is a safety buffer. If you disable it, you’re choosing hard kills over slowdowns.
If you have strict latency requirements, consider a different buffer strategy (like zram) and accept the risk explicitly.
4) Does ZFS ARC cause OOM?
ARC is designed to shrink, but it can still contribute to pressure if the system is already overcommitted or if workload spikes faster than ARC can shrink.
Capping ARC is reasonable on hypervisor-first nodes.
5) Does VM ballooning prevent OOM?
It can help, but it is not a guarantee. Ballooning requires guest cooperation and reclaimable memory inside the guest.
If guests are legitimately using their RAM, ballooning cannot magically create free memory.
6) How do I tell if an OOM happened inside a container vs on the host?
Container/cgroup OOM messages mention “Memory cgroup out of memory” and refer to a cgroup path (like /lxc.payload.104.slice).
Host/global OOM messages look like “oom-killer: constraint=CONSTRAINT_NONE” and may kill QEMU or host daemons.
7) What’s the safest way to reduce OOM risk quickly?
Add a modest swap buffer, enforce container limits, stop extreme overcommit, and ensure host headroom.
If using ZFS, cap ARC only after measuring. Then validate with PSI and workload behavior during backup windows.
8) If I set OOMScoreAdjust very low for everything, will I stop OOM kills?
You’ll just change who dies—or push the kernel into worse failure modes. Protect a small set of critical host services.
Let everything else be killable, and fix the real pressure source with capacity and limits.
9) Why do OOMs often happen during backups, scrubs, or replication?
Those operations increase I/O, metadata churn, and compression/encryption activity. The kernel cache and filesystem layers get busy,
and QEMU may allocate transient buffers. If you sized only for steady-state, maintenance becomes your peak load.
10) What’s a “sane” overcommit ratio for VM memory?
There’s no universal number. Start conservative: do not exceed physical RAM minus host reserve with “guaranteed” allocations for critical VMs.
If you overcommit, do it on low-priority guests, with ballooning where appropriate, and with monitoring that proves it’s safe.
Conclusion: practical next steps
OOM-killer events on Proxmox are rarely mysterious. They’re the bill for memory decisions you made months ago:
overcommit without limits, swap-as-a-moral-failing, and “we’ll tune ZFS later.”
The kernel collects.
Do these next, in this order:
- Classify the event (kernel OOM vs cgroup OOM vs systemd-oomd) using the log queries above.
- Rebuild your budget: host reserve, guest assigned memory, swap buffer, ZFS ARC target.
- Enforce limits on containers and adopt a VM sizing policy that matches workload criticality.
- Add observability: alert on PSI, available memory, swap trend, and OOM counters.
- Validate under maintenance load, not just during quiet business hours.
The goal isn’t to “never OOM.” The goal is to make memory pressure predictable, contained, and boring. In production, boring is a compliment.