It starts as “the VM feels sluggish.” Ten minutes later it’s “why is kswapd eating a core,” and an hour after that somebody suggests “just add swap” like they’re sprinkling holy water on a fire.
Ubuntu 24.04 is perfectly capable of running stable under memory pressure. But if you mix ballooning, overcommit, cgroup limits, and optimistic defaults, you can manufacture a swap storm that looks like a CPU issue, a disk issue, and a networking issue all at once. It’s not magic. It’s accounting, and it’s fixable.
What “memory ballooning surprises” actually means
Ballooning is simple in theory: the hypervisor reclaims RAM from a VM that “isn’t using it,” and gives it to something else. In practice, it’s a negotiation with imperfect information.
Inside the guest (Ubuntu 24.04 here), the balloon driver pins pages so the guest thinks it has less usable memory. If the guest later needs that memory back, the hypervisor can “deflate” the balloon—if it can. That “if” is where outages are born.
A swap storm is what happens when a system spends more time moving pages between RAM and swap than doing useful work. The classic shape:
- RAM pressure rises → anonymous pages get swapped out
- workload touches swapped pages → major faults explode
- IO waits rise → request latencies spike
- CPU time burns in reclaim and page faults → the box “has CPU” but nothing completes
Now add ballooning surprises:
- The guest’s “available” memory drops quickly (balloon inflates), sometimes faster than the guest can reclaim cleanly.
- The guest starts swapping even though the host still has memory overall—or the host is also pressured, and both layers reclaim simultaneously.
- Metrics look contradictory: the guest shows swapping; the host shows free memory; application latencies look like disk; and CPU is busy but unproductive.
Here’s the opinionated part: if you need predictability, stop treating memory like a floaty suggestion. Put hard limits where they belong, size swap intentionally, and pick one layer to do reclaim first. Otherwise you’re running a distributed paging system, and nobody asked for that.
Joke #1 (short, relevant): Swap is like taking boxes to a storage unit during a move—helpful until you need the toaster right now.
Interesting facts and historical context (short, useful, slightly nerdy)
- Ballooning predates “cloud.” VMware popularized ballooning in the early 2000s to make consolidation ratios look great—until workloads became spiky and latency-sensitive.
- Virtio-balloon is cooperative. It relies on the guest to give pages back. If the guest is already struggling, ballooning can amplify the pain because the guest does the work.
- Linux swapping is not “a bug.” Linux historically used swap proactively to increase page cache; modern kernels are more nuanced, but swap activity alone isn’t a conviction.
- cgroup v2 made memory control stricter and more observable. It’s harder to accidentally bypass limits, and easier to see pressure and events—if you actually look.
- PSI (Pressure Stall Information) is fairly new. It was merged around Linux 4.20 era to quantify “time spent waiting on resources,” making “the box feels slow” measurable.
- systemd-oomd is an opinionated user-space killer. It can act earlier than the kernel OOM killer to preserve the system, which surprises people who assume OOM means “only when totally out.”
- Overcommit defaults differ by environment. Hypervisors overcommit; Linux can overcommit; containers can overcommit. Stack them and you get a Russian nesting doll of optimism.
- Swap storms got worse with fast disks. NVMe makes swap “less awful,” which tempts people to lean on it; that often just moves the bottleneck to CPU and latency variance.
Fast diagnosis playbook: find the bottleneck in minutes
This is the order that wins in production because it separates “we’re memory pressured” from “we’re confused.” Do it on the guest first, then the host if you can.
1) Confirm it’s memory pressure, not just “high RAM usage”
- Check swap-in rate and major faults (real pain).
- Check PSI memory (time spent stalled).
- Check reclaim CPU (
kswapdand direct reclaim).
2) Identify where reclaim is happening (guest? host? both?)
- In the guest, confirm balloon driver status and current balloon size if exposed.
- On the host, check the VM’s balloon target and host swap activity.
3) Determine the control plane that’s actually enforcing limits
- cgroup v2 memory.max / memory.high for services and containers
- systemd-oomd activity and logs
- Hypervisor memory limits / ballooning policy
4) Decide the immediate mitigation
- If latency is on fire: reduce ballooning target or add real RAM allocation (don’t just add swap).
- If a single service is greedy: cap it via cgroups, or restart with a smaller heap / worker count.
- If the host is pressured: stop overcommit, migrate, or shed load—host swap makes every guest sad.
Practical tasks: commands, outputs, and decisions (12+)
These are not “run and feel good” commands. Each has an interpretation and a decision attached. Use them during an incident and again during a postmortem.
Task 1: Check memory and swap totals quickly
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 15Gi 13Gi 210Mi 220Mi 1.8Gi 820Mi
Swap: 8.0Gi 6.4Gi 1.6Gi
What it means: “available” under 1 GiB plus heavy swap usage suggests active pressure, not just cached memory.
Decision: If latency is high and swap used is growing, move to Tasks 2–4 immediately. If swap used is high but stable and PSI is low, you might be fine.
Task 2: Measure swap-in/out and reclaim churn
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 671088 180000 12000 860000 420 980 1800 2100 3200 8800 18 12 10 58 2
2 1 672000 175000 12000 850000 300 760 1400 1800 3100 8200 16 11 12 59 2
4 2 675000 165000 11000 845000 520 1200 2200 2400 3500 9300 20 14 8 56 2
2 1 677000 160000 11000 840000 410 900 1700 2000 3200 8600 18 12 10 58 2
3 2 679000 155000 10000 835000 600 1300 2400 2600 3600 9700 21 15 7 55 2
What it means: Non-zero si/so continuously is active swapping. High wa means IO wait is dominating. This is a swap storm profile.
Decision: Mitigate by reducing memory pressure (more RAM / less ballooning / cap offenders). Tuning swappiness won’t rescue you mid-storm.
Task 3: Look for major page faults (the “I touched swapped pages” counter)
cr0x@server:~$ sar -B 1 3
Linux 6.8.0-xx-generic (server) 12/31/25 _x86_64_ (8 CPU)
12:01:11 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
12:01:12 AM 1800.00 2400.00 52000.00 980.00 14000.00 9200.00 0.00 6100.00 66.30
12:01:13 AM 1600.00 2100.00 50500.00 1020.00 13200.00 8900.00 0.00 5800.00 65.10
12:01:14 AM 2000.00 2600.00 54000.00 1100.00 15000.00 9800.00 0.00 6400.00 65.30
What it means: majflt/s near 1000 is brutal for most services. Each major fault is “wait for disk.”
Decision: Stop the bleeding: reduce ballooning, kill/restart the memory hog, or scale out. If you do nothing, you’ll get cascading timeouts and retries.
Task 4: Read PSI memory pressure (guest-side truth serum)
cr0x@server:~$ cat /proc/pressure/memory
some avg10=35.20 avg60=22.10 avg300=10.88 total=987654321
full avg10=12.90 avg60=7.40 avg300=3.20 total=123456789
What it means: “some” = tasks stalled because of memory pressure; “full” = nobody can run because memory is the bottleneck. Sustained non-trivial “full” is a fire.
Decision: If full avg10 is above ~1–2% for latency-sensitive systems, treat it as a production incident. Reduce pressure; don’t argue with math.
Task 5: See if kswapd is burning CPU
cr0x@server:~$ top -b -n1 | head -n 15
top - 00:01:40 up 12 days, 3:22, 1 user, load average: 8.22, 7.90, 6.40
Tasks: 243 total, 4 running, 239 sleeping, 0 stopped, 0 zombie
%Cpu(s): 21.3 us, 14.9 sy, 0.0 ni, 6.8 id, 56.0 wa, 0.0 hi, 1.0 si, 0.0 st
MiB Mem : 16384.0 total, 15870.0 used, 220.0 free, 180.0 buff/cache
MiB Swap: 8192.0 total, 6600.0 used, 1592.0 free. 820.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
431 root 20 0 0 0 0 R 85.0 0.0 120:22.11 kswapd0
2331 app 20 0 5020.0m 3900.0m 4200 S 60.0 23.8 88:10.22 java
1102 postgres 20 0 2800.0m 1500.0m 9800 S 25.0 9.2 40:11.03 postgres
What it means: High IO wait plus kswapd0 hot means reclaim is thrashing. The CPU is “busy” doing chores.
Decision: Don’t optimize the application yet. Fix memory pressure first; otherwise you’re tuning a sinking ship’s playlist.
Task 6: Inspect what the kernel thinks about memory and reclaim
cr0x@server:~$ egrep 'MemTotal|MemAvailable|SwapTotal|SwapFree|Active|Inactive|Dirty|Writeback|AnonPages|Mapped|SReclaimable' /proc/meminfo
MemTotal: 16384000 kB
MemAvailable: 780000 kB
SwapTotal: 8388604 kB
SwapFree: 1654320 kB
Active: 8920000 kB
Inactive: 5120000 kB
Dirty: 184000 kB
Writeback: 2400 kB
AnonPages: 11200000 kB
Mapped: 420000 kB
SReclaimable: 520000 kB
What it means: Large AnonPages means anonymous memory (heaps, not cache). This is where swap pressure comes from.
Decision: If it’s mostly anonymous, you need fewer processes, smaller heaps, or more RAM. If it’s mostly cache, you might reclaim cleanly.
Task 7: Find top memory consumers (RSS, not VIRT)
cr0x@server:~$ ps -eo pid,comm,rss,pmem --sort=-rss | head
2331 java 3998200 24.3
1102 postgres 1543000 9.4
4120 node 1022000 6.2
1887 redis-server 620000 3.8
2750 python 410000 2.5
987 systemd-journald 180000 1.1
1450 nginx 90000 0.5
812 snapd 82000 0.5
701 multipathd 62000 0.3
655 unattended-upgr 52000 0.3
What it means: RSS is real resident memory. A single large process can destabilize the entire guest once ballooning lowers headroom.
Decision: If one process dominates, cap it (cgroup), reconfigure it, or restart it intentionally—before the system does it for you.
Task 8: Check cgroup v2 memory limits for a systemd service
cr0x@server:~$ systemctl show myapp.service -p MemoryMax -p MemoryHigh -p OOMPolicy -p ManagedOOMMemoryPressure
MemoryMax=infinity
MemoryHigh=infinity
OOMPolicy=stop
ManagedOOMMemoryPressure=auto
What it means: No memory ceilings. The app can eat the guest. With ballooning, this becomes Russian roulette.
Decision: For critical hosts, set MemoryHigh (soft throttle) and MemoryMax (hard cap) to force predictability.
Task 9: Inspect the cgroup’s actual memory current and events
cr0x@server:~$ CG=$(systemctl show -p ControlGroup --value myapp.service); echo $CG
/system.slice/myapp.service
cr0x@server:~$ cat /sys/fs/cgroup$CG/memory.current
4219031552
cr0x@server:~$ cat /sys/fs/cgroup$CG/memory.events
low 0
high 182
max 3
oom 1
oom_kill 1
What it means: The service hit memory.high a lot and even hit memory.max. There was an OOM kill within the cgroup.
Decision: If high is rising rapidly, tune the service (heap/workers) or raise limits. If max/oom_kill increments, you’re underprovisioned or misconfigured.
Task 10: See swap usage per process (smaps rollup)
cr0x@server:~$ sudo awk '/^Swap:/ {sum+=$2} END {print sum " kB"}' /proc/2331/smaps_rollup
1540000 kB
What it means: That process has ~1.5 GiB swapped. Even if it’s still “running,” it’s likely stalling whenever it touches swapped regions.
Decision: If it’s a latency-sensitive service, restarting (after lowering memory footprint) can be the least-bad option. Also fix the upstream cause (ballooning/limits).
Task 11: Check kernel swappiness and reclaim knobs
cr0x@server:~$ sysctl vm.swappiness vm.vfs_cache_pressure vm.watermark_scale_factor
vm.swappiness = 60
vm.vfs_cache_pressure = 100
vm.watermark_scale_factor = 10
What it means: Swappiness 60 is the generic default. It’s not “wrong,” but it’s not tailored for ballooned VMs running latency-sensitive apps.
Decision: Post-incident, consider vm.swappiness=10–30 for many server workloads, but only after you’ve fixed sizing and limits.
Task 12: Confirm which swap devices are active and their priority
cr0x@server:~$ swapon --show --bytes
NAME TYPE SIZE USED PRIO
/swapfile file 8589934592 7088373760 -2
What it means: Single swapfile, low priority (normal). If you have multiple swap backends, priorities decide where pages go first.
Decision: If you introduce zram, give it higher priority than disk swap so you compress before you hit IO latency.
Task 13: Observe disk latency during swapping
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0-xx-generic (server) 12/31/25 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
18.20 0.00 12.10 56.30 2.10 11.30
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 420.0 16800.0 2.0 0.47 9.80 40.0 520.0 22000.0 3.0 0.57 18.50 42.3 12.10 98.00
What it means: %util ~98% and high awaits mean the disk is saturated. Under swapping, even NVMe can become the bottleneck.
Decision: If swap is the workload, adding “faster disk” isn’t a fix. Reduce swapping first.
Task 14: Check for host-side memory ballooning from within the guest (driver presence)
cr0x@server:~$ lsmod | grep -E 'virtio_balloon|vmw_balloon|xen_balloon'
virtio_balloon 24576 0
virtio 16384 2 virtio_balloon,virtio_net
What it means: The virtio balloon driver is loaded, so ballooning can be active depending on hypervisor policy.
Decision: If you can’t tolerate dynamic memory, coordinate with the hypervisor team to disable or constrain ballooning for this VM class.
Task 15: See recent OOM and oomd actions
cr0x@server:~$ journalctl -k -g -i 'oom|out of memory' --since '1 hour ago' | tail -n 20
Dec 31 00:12:02 server kernel: Out of memory: Killed process 4120 (node) total-vm:2103456kB, anon-rss:980000kB, file-rss:12000kB, shmem-rss:0kB, UID:1001 pgtables:3400kB oom_score_adj:0
Dec 31 00:12:02 server kernel: oom_reaper: reaped process 4120 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
cr0x@server:~$ journalctl -u systemd-oomd --since '1 hour ago' | tail -n 20
Dec 31 00:11:58 server systemd-oomd[812]: Killed /system.slice/myapp.service due to memory pressure for /system.slice.
What it means: Both kernel OOM killer and systemd-oomd can kill. If oomd killed first, your “OOM event” may happen before RAM hits zero.
Decision: Decide which policy you want: proactive oomd (usually good for keeping the host responsive) or “let the kernel handle it” (often noisier). Configure deliberately.
Root causes: ballooning, overcommit, reclaim, and why swap storms happen
Ballooning is a bet on workload stability
Ballooning works best when:
- guests have predictable idle memory
- workloads aren’t latency-sensitive
- host has enough spare to deflate quickly
Ballooning works worst when:
- memory use grows suddenly (caches warming, deploys, traffic spikes)
- the host is also pressured, so balloon deflation can’t happen in time
- the guest is configured with generous swap, masking early warning signs
Double reclaim: guest swaps, host swaps, everybody loses
The nastiest incidents involve reclaim at two layers. The guest swaps because its ballooned “physical” memory shrank. The host swaps because it overcommitted too far. Now every guest page-in may trigger host page-ins. Latency becomes a matryoshka doll.
Memory overcommit is a policy decision, not a default you inherit
There are three different “overcommit” ideas people conflate:
- Hypervisor overcommit: allocating more vRAM total than host RAM, assuming not all VMs peak simultaneously.
- Linux virtual memory overcommit: allowing processes to allocate more virtual memory than RAM+swap, based on heuristics.
- Container density overcommit: scheduling pods/containers with memory requests lower than limits, betting on averages.
Any one of these can be sane. Stack all three and you’re building a peak-load catastrophe machine.
Swap storms are usually triggered by one of four patterns
- Balloon target changes too aggressively (host pressure event, automation “right-sizing,” or a human trying to be clever).
- Heap-based service grows until it hits GC thrash, then starts faulting like crazy once swapped.
- File cache isn’t the problem but you tune cache knobs anyway and starve anonymous memory.
- IO slowdown causes reclaim to lag (noisy neighbor storage, degraded RAID, snapshot storms), pushing the system into direct reclaim stalls.
There’s a paraphrased idea worth keeping in your head when you design limits: paraphrased idea
— Gene Kim has emphasized that reliability comes from reducing variance and making work predictable, not from heroics.
Set sane limits: VMs, containers, and the host
“Sane limits” means you decide which layer is allowed to say “no” first, and how. My preference for production systems that people yell at:
- Hard allocation for critical VMs (or at least strict balloon floors) so guests have predictable RAM.
- cgroup limits per service so one runaway process doesn’t turn the OS into a swap benchmark.
- oomd configured intentionally to kill the right thing early, instead of the kernel killing something random late.
VM layer: don’t let “dynamic” mean “surprising”
On many hypervisors you can set:
- min memory (balloon floor)
- max memory (cap)
- shares/priority (who gets memory first)
Rules of thumb that age well:
- Set the floor to at least the observed steady-state RSS + safety margin, not “half of max because it seems fine.”
- For databases and JVMs, floors should be conservative; reclaiming their working set hurts immediately.
- Disable ballooning for extremely latency-sensitive or bursty workloads unless you’ve proven it’s harmless under load tests.
Container layer: memory.max is your seatbelt
If you’re running containers directly under systemd (or even just services), cgroup v2 is already there in Ubuntu 24.04. Use it.
Example: set a soft and hard limit for a systemd service
cr0x@server:~$ sudo systemctl set-property myapp.service MemoryHigh=6G MemoryMax=7G
cr0x@server:~$ systemctl show myapp.service -p MemoryHigh -p MemoryMax
MemoryHigh=6442450944
MemoryMax=7516192768
What it means: The service will be throttled around 6 GiB and killed at 7 GiB (depending on behavior and oomd/kernel policy).
Decision: Pick limits based on load testing and observed RSS. If you can’t test, start conservative and watch memory.events.
Host layer: stop host swap before it starts
Host swap is a tax on every VM, paid in latency. If your host is swapping, your “noisy neighbor” is the host itself.
Host policy that works:
- avoid memory overcommit ratios that rely on “it never all peaks”
- keep host swap minimal, or configure it as an emergency brake, not a daily commuter lane
- monitor PSI on the host and alert on sustained “full” memory pressure
Swap strategy on Ubuntu 24.04: less drama, more control
Swap is not evil. It’s a tool. The problem is when it becomes your primary memory tier by accident.
Pick a swap approach that matches your failure mode
- Disk swapfile: simple, persistent, can absorb spikes, but can cause IO amplification under pressure.
- zram: compressed RAM swap, fast, reduces disk IO, but uses CPU and reduces effective RAM at high compression ratios.
- No swap: forces OOM sooner, can be acceptable for stateless services that restart cleanly, risky for systems that need graceful degradation.
My usual stance for VMs running mixed workloads: small disk swap + zram (optional) + strong cgroup limits. The goal is to survive short spikes without letting the system quietly enter permanent swap debt.
Make swappiness a policy, not folklore
Lower swappiness reduces the kernel’s eagerness to swap anonymous memory. For many production servers, 10–30 is a reasonable starting range. For desktops, defaults can be fine. For ballooned guests, I prefer lowering it because ballooning already reduces headroom unpredictably.
cr0x@server:~$ sudo sysctl -w vm.swappiness=20
vm.swappiness = 20
cr0x@server:~$ printf "vm.swappiness=20\n" | sudo tee /etc/sysctl.d/99-swappiness.conf
vm.swappiness=20
What it means: Immediate and persistent change.
Decision: If you lower swappiness, also ensure you have enough RAM headroom; otherwise you may just reach OOM faster. That might be desirable—if you planned for it.
When zram helps (and when it doesn’t)
zram can be a lifesaver in swap storms because it replaces slow IO with faster compression. But it’s not free: under sustained pressure, CPU can become the next bottleneck. Use it when your failure mode is IO wait and major faults, not when your CPU is already pegged.
Quick check: do you already have zram?
cr0x@server:~$ swapon --show
NAME TYPE SIZE USED PRIO
/swapfile file 8G 6.7G -2
/dev/zram0 partition 2G 0B 100
What it means: zram exists and has higher priority (100) than the swapfile (-2), which is correct if you choose this approach.
Decision: If you see disk swap getting hammered while zram is unused, priorities are wrong. Fix priorities or disable one of the swap backends.
Joke #2 (short, relevant): The only thing that grows faster than swap usage is the confidence of someone who “just increased the swapfile.”
OOM behavior: kernel OOM killer vs systemd-oomd
Ubuntu 24.04 often runs with systemd’s oomd available. The kernel OOM killer is the last resort when the kernel can’t allocate memory. systemd-oomd is a policy engine that can kill earlier based on pressure signals (PSI) and cgroup boundaries.
Why you should care
If you expect “OOM only when RAM is 100% used,” you’ll be surprised. oomd may kill a service when the system is still technically alive but deeply stalled. That’s not cruelty. That’s triage.
What to configure
- Make sure critical system services are protected (or separated) so oomd doesn’t shoot the wrong target.
- Put your main workloads into dedicated slices with clear memory budgets.
- Decide whether “kill the biggest offender” is correct, or whether you want “kill the least important” via unit structure and priorities.
Check if oomd is enabled
cr0x@server:~$ systemctl is-enabled systemd-oomd
enabled
What it means: oomd will act if configured/triggered by pressure and unit settings.
Decision: If you run multi-tenant workloads on a VM, oomd is often your friend. If you run one critical monolith with strict SLOs, you may prefer to cap memory tightly and let the kernel OOM within a controlled cgroup.
PSI: measuring pressure instead of guessing
PSI tells you how much time tasks spend stalled due to resource contention. Memory PSI is the best “is the system actually suffering?” metric I’ve used in years. It’s better than “RAM used” and more honest than “load average.”
What good looks like
some avg10low single digits during bursts is often acceptable.full avg10should be near zero for latency-sensitive services.
What bad looks like
full avg10sustained above ~1–2% on production service nodes usually means tail latency is already busted.some avg10above ~20–30% suggests chronic pressure. Your system is spending a third of its life waiting for memory. That’s not a lifestyle choice.
Check CPU PSI as well (swap storms can masquerade as CPU pressure due to reclaim):
cr0x@server:~$ cat /proc/pressure/cpu
some avg10=9.12 avg60=6.10 avg300=2.00 total=55555555
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
What it means: CPU is busy but not saturated in a way that blocks all tasks. If memory PSI is high and CPU PSI is moderate, memory is the culprit.
Decision: Don’t “scale CPU” to fix a memory stall. You’ll just get a faster stall.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The company: a mid-size SaaS provider with a mix of customer-facing APIs and background workers. The platform team rolled out Ubuntu 24.04 guests on a KVM cluster. They also enabled ballooning because the cluster was “mostly idle overnight,” and the dashboards made it look safe.
The wrong assumption was subtle: “If the host has free memory, the guest can always get memory back quickly.” In reality, the host allocator had plenty of free memory in aggregate, but it wasn’t available where the VM needed it at the moment due to competing VMs spiking together and the balloon targets changing fast.
The API tier took the first hit. Requests began timing out, but CPU wasn’t pegged. Engineers chased an imagined network regression. The tell was hidden in plain sight: major faults spiked, and memory PSI full crept above 5% for minutes at a time.
Eventually, the kernel OOM killer got blamed for “randomly” killing a Node process. It wasn’t random. It was the one that happened to touch the most swapped pages while holding customer traffic. The postmortem fix was not complicated: raise the balloon floor above steady-state RSS, set cgroup limits per service, and alert on PSI and swap-in rate rather than “RAM used.”
Mini-story 2: The optimization that backfired
A different org, same genre of problems. They were proud of their cost controls. Someone suggested reducing VM memory allocations because “Linux uses free memory for cache anyway.” True statement, wrong application.
They trimmed vRAM, then tried to compensate by increasing swap size “so we don’t OOM.” The first week looked fine: fewer OOM events, fewer pager alerts. Then came the end-of-month workload, which included heavier reporting queries and a batch process that warmed large in-memory structures.
The system didn’t crash; it just became unusably slow. That’s the worst kind of failure because it triggers retries and timeouts, not a clean restart. Storage graphs showed high utilization; everyone stared at the SAN like it had committed a personal betrayal.
Root cause: the VM entered sustained swap debt. The bigger swapfile delayed OOM, but it also allowed the process set to grow beyond what the VM could run efficiently. The “optimization” converted a sharp failure into a slow-motion outage. The eventual fix: smaller swap, stricter per-service memory caps, and a policy that said: for latency-tier services, swapping is treated as an incident, not a coping mechanism.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent company ran a handful of critical PostgreSQL nodes as VMs. They had a reputation for being conservative—sometimes annoyingly so. No ballooning on the database VMs. Fixed memory allocations. Strict host-side reservations. Boring.
One day, a neighboring cluster had an unrelated memory leak on a set of worker VMs. The hosts started to experience pressure. On a different team’s workloads, the blast radius was ugly: balloon targets changed, guests swapped, and latency spiked. But the database VMs stayed steady. They were protected by policy, not luck.
The DB team still had to handle downstream effects (apps timing out, connection storms), but their own nodes didn’t join the chaos. That mattered: stable databases give you options. You can shed load, drain traffic, and recover with fewer moving parts.
The post-incident lesson wasn’t glamorous: they kept doing the boring thing—reserve memory for stateful systems, cap everything else, and treat overcommit as a controlled risk with monitoring. Sometimes the best optimization is refusing to optimize the wrong thing.
Common mistakes: symptom → root cause → fix
1) “CPU is high, so it must be compute”
Symptom: load average high, CPU system time high, latency high, but user CPU not that high.
Root cause: reclaim and page fault overhead (direct reclaim, kswapd), often triggered by ballooning or a memory leak.
Fix: Check PSI and major faults. Reduce memory pressure first: raise VM floor / reduce ballooning / cap services / fix leak. Only then revisit CPU sizing.
2) “We have free memory on the host, so the guest shouldn’t swap”
Symptom: guest swaps heavily; host shows available memory; people argue about whose graph is wrong.
Root cause: guest sees ballooned memory; hypervisor policy and timing mean “free” doesn’t equal “available to this guest now.”
Fix: Set balloon floors and stop aggressive target changes. For critical VMs, disable ballooning or use static reservations.
3) “Let’s just add more swap”
Symptom: fewer OOM kills, but longer timeouts and bigger performance cliffs under load.
Root cause: swap becomes a crutch that allows memory footprint to exceed working set capacity.
Fix: Keep swap modest. Use cgroup limits and oomd to fail fast and recover, or provision more RAM.
4) “Swappiness tuning will fix the storm”
Symptom: someone sets swappiness to 1 during an incident; nothing improves.
Root cause: once you’re thrashing, the system is already in debt; policy knobs don’t erase swapped pages instantly.
Fix: Immediate mitigation is reducing memory demand or increasing real memory. Apply swappiness changes after stabilization, then validate with load tests.
5) “The kernel OOM killer is random”
Symptom: different processes die each time; people conclude Linux is unpredictable.
Root cause: OOM scoring depends on memory usage and adjustments; under pressure, the “best” victim changes with workload timing.
Fix: Put workloads into cgroups with explicit memory budgets, set OOMScoreAdjust for critical services, and rely on oomd for earlier, policy-driven kills.
6) “Cache is stealing my RAM” (the evergreen complaint)
Symptom: free shows low free memory; folks panic; they drop caches.
Root cause: misunderstanding of Linux page cache versus available memory; dropping caches causes IO spikes and can worsen latency.
Fix: Use MemAvailable, PSI, and major faults. Only drop caches in controlled tests, not in production firefights.
Checklists / step-by-step plan
Step-by-step: stabilize an active swap storm (guest)
- Confirm it’s real pressure: run
vmstat, checksi/so, runcat /proc/pressure/memory. - Identify the hog:
ps ... --sort=-rss, check per-process swap via/proc/<pid>/smaps_rollup. - Stop growth: cap the service with
systemctl set-property ... MemoryMax=or reduce workers/heap. - Coordinate with hypervisor: raise allocated memory or reduce ballooning target/floor. This is often the fastest real fix.
- Restart strategically: restart the most swapped-out latency-critical processes after reducing their memory appetite. Restarting without fixing appetite just replays the incident.
- Watch recovery: major faults and PSI should drop first. Swap used may stay high; that’s okay if swap-in stops and latency recovers.
Step-by-step: prevent recurrence (policy)
- Pick your reclaim layer: prefer guest-side cgroup limits and oomd policies over host-level surprise ballooning.
- Set VM floors: ensure balloon minimum covers steady-state + burst headroom for each VM class.
- Set per-service budgets: use
MemoryHigh/MemoryMaxand validate withmemory.events. - Right-size swap: small-to-moderate swap; consider zram with correct priority; treat swap-in rate as an alertable metric.
- Alert on PSI: especially
memory full avg10. It catches “slow death” earlier than most dashboards. - Load test with ballooning enabled: if you insist on ballooning, test it under peak-like concurrency with realistic memory churn.
- Document kill behavior: which services are allowed to die first, and how they recover (systemd restart policies, graceful shutdown timeouts).
Step-by-step: sanity-check after changes
- Run a controlled load. Capture
free -h, PSI,sar -B,iostat -x. - Verify cgroup enforcement:
memory.eventsshould reflect your thresholds during stress, not only after disaster. - Confirm no host swap under normal peaks (if you control the host). If host swap occurs, your platform policy is lying to you.
FAQ
1) Is swap always bad on Ubuntu 24.04 servers?
No. Swap is a safety buffer. It becomes bad when it’s used continuously or heavily enough to cause major faults and IO wait. Treat swap-in rate and memory PSI full as the real red flags.
2) Why does my guest swap when the host shows free memory?
Because the guest’s available “physical” memory can be reduced by ballooning, independent of host free memory graphs. Also, “free” on the host doesn’t mean “immediately allocatable to your VM” during contention.
3) Should I disable ballooning?
For critical stateful systems (databases, queues) and tight latency SLOs: usually yes, or at least set a conservative floor. For bursty stateless fleets: ballooning can be acceptable if you monitor PSI and have strong per-service limits.
4) What’s the difference between MemoryHigh and MemoryMax?
MemoryHigh is a throttle point: the kernel will start reclaiming within the cgroup and apply pressure. MemoryMax is a hard ceiling: allocations fail and you can trigger OOM within that cgroup.
5) Why did systemd-oomd kill a service even though there was still RAM?
oomd can act on sustained pressure (PSI) to keep the system responsive. That can happen before RAM hits zero, especially under reclaim stalls and swap thrash.
6) Is lowering swappiness enough to stop swap storms?
Not by itself. It may reduce how eagerly the kernel swaps under moderate pressure, but it won’t compensate for undersized RAM, aggressive ballooning, or a memory leak. Fix sizing and limits first.
7) How do I tell if I’m suffering from major faults specifically?
Use sar -B (look at majflt/s) and correlate with latency. Major faults mean disk-backed page-ins. If they’re high, your workload is literally waiting on swap.
8) What should I alert on to catch this early?
At minimum: memory PSI (full and some), swap-in rate (vmstat si trend), major faults, and cgroup memory.events for key services. “RAM used percent” is a weak signal by itself.
9) Should I use zram on production VMs?
Sometimes. If your swap storms are IO-bound and you have CPU headroom, zram can reduce IO wait dramatically. If you’re already CPU-bound, zram can trade one bottleneck for another.
10) Can a swap storm look like a storage outage?
Yes. Swap storms generate lots of random-ish IO, saturate devices, and inflate latency. Storage looks “slow,” but it’s responding to pathological demand. Fix memory pressure and the “storage outage” often disappears.
Conclusion: next steps you can apply today
If you remember one thing: ballooning plus weak limits is how you accidentally build a swap storm generator. Ubuntu 24.04 gives you the tools to avoid that—cgroup v2 controls, PSI visibility, and workable OOM policy. Use them intentionally.
Practical next steps:
- On one troubled VM, baseline:
free -h,vmstat 1,sar -B,cat /proc/pressure/memory. - Set per-service memory budgets with
MemoryHighandMemoryMax; validate viamemory.events. - Agree on a VM memory policy: conservative balloon floors for critical systems, and explicit rules for when ballooning is allowed.
- Right-size swap so it’s a buffer, not a lifestyle. Consider zram only with eyes open.
- Alert on pressure (PSI) and swap-in, not just “percent used.”
Do those, and the next “the VM feels sluggish” page becomes a five-minute diagnosis instead of a three-hour blame tour.