Your Proxmox host has “plenty of free RAM,” yet swap usage climbs like it’s training for a marathon. VMs feel sticky. The node load creeps up.
And every reboot “fixes it” for a while, which is the kind of fix that belongs in a haunted-house, not a datacenter.
Swap growth on a hypervisor is rarely mysterious. It’s usually one of three things: you’re actually under memory pressure, you’ve configured the host
so reclaim behaves badly, or you’re reading the wrong metric and chasing ghosts. Let’s make it boring again.
Fast diagnosis playbook
If you only have 10 minutes before someone starts suggesting “just add RAM,” do this in order. You’re trying to answer one question:
is swap growth driven by real pressure, bad reclaim, or a lying dashboard?
1) Confirm the symptom is real (and not just “cached memory” panic)
- Check host swap used and swap I/O rate (not just “swap used”).
- Check memory pressure (PSI) and kswapd activity.
2) Identify the pressure source
- Is ZFS ARC eating headroom?
- Are you overcommitting RAM across VMs or containers?
- Is ballooning forcing guests down and pushing host into reclaim chaos?
3) Decide the stabilizer
- If pressure is sustained: reduce commitments (VM memory), cap ARC, or add RAM.
- If reclaim is pathological: fix swappiness, watermark, and hugepages/THP interactions; add zram if appropriate.
- If swap is “used but quiet”: consider leaving it alone, or proactively slowly drain swap without triggering storms.
Operational rule: swap used without swap I/O can be fine. Swap used with high swap-in/swap-out is where performance goes to die.
Swap growth isn’t automatically a bug
Linux will opportunistically move cold anonymous pages to swap to keep more RAM available for page cache and filesystem metadata. On a workstation,
this can be a win. On a hypervisor, it depends. Your “cold” pages might belong to qemu processes that suddenly need them. That’s when you see
a VM freeze for a second and your users invent new adjectives.
A key distinction:
- Swap occupancy: how much swap is used right now.
- Swap activity: how much data is moving in/out of swap per second.
Swap occupancy can grow and stay high for months if the kernel swapped out genuinely cold pages and never needed them again. That’s not a fire.
Swap activity during business hours is a fire. A quiet fire, but still.
Joke #1: Swap is like a storage unit. You can keep it for years, but the moment you need something from it during a meeting, you’ll regret everything.
Interesting facts and historical context
- Linux used to expose “free memory” as the headline, and people panic-tuned systems for years; modern guidance emphasizes “available” memory.
- Swappiness isn’t “swap aggressiveness” in a simple way; it influences reclaim balance between anonymous pages and file cache, and behavior changes by kernel era.
- The OOM killer is not a bug; it’s Linux choosing “one process must die so the system lives,” which is often correct on a host but painful in virtualization.
- ZFS ARC is designed to consume RAM because caching is performance; without a cap, it can crowd out guests on a hypervisor.
- Pressure Stall Information (PSI) arrived to quantify “time spent stalled” under pressure—finally turning “it feels slow” into a measurable signal.
- Memory cgroups changed the game by enabling per-VM/container memory accounting and reclaim behavior—sometimes improving isolation, sometimes adding surprise.
- KSM (Kernel Samepage Merging) was a big deal for virtualization density, but it trades CPU for RAM and can interact with reclaim in unexpected ways.
- Transparent Huge Pages were introduced for performance, but in virtualization they can increase latency spikes during compaction and reclaim.
- Swapiness defaults and heuristics evolved because SSDs made swapping less catastrophic than on spinning disks—yet random swap I/O still punishes latency-sensitive workloads.
How Linux decides to reclaim memory (and why Proxmox makes it spicy)
Proxmox is “just Debian,” but in production it’s never “just.” You’re running qemu-kvm processes (big anonymous memory consumers), possibly LXC containers,
and maybe ZFS, which is a high-performance filesystem that’s happy to use a lot of RAM. Then you add memory ballooning, which is basically a negotiated
lie: “yes guest, you still have 16 GB,” while the host quietly asks for it back.
Three buckets matter: anonymous, file cache, and reclaimability
When RAM tightens, the kernel reclaims memory mainly by:
- Dropping file cache (easy and fast if the cache can be discarded).
- Writing out dirty pages (can be slow; depends on storage).
- Swapping out anonymous memory (process pages not backed by a file).
Virtualization turns guest RAM into host anonymous pages. That means the hypervisor’s biggest RAM consumer is precisely the type of memory Linux is willing
to swap if it thinks it can keep the system responsive. And you’ve seen how that ends.
Memory pressure is not “low free”
Linux happily uses RAM for cache; it’s not wasted. The metric you want is MemAvailable (from /proc/meminfo) and, better, PSI
pressure signals that show whether tasks are stalling waiting on memory reclaim.
ZFS complicates reclaim
ZFS ARC is technically reclaimable, but it’s not “page cache” in the same way as ext4’s cache. ARC competes with guests for memory. If ARC grows without a cap,
the host may start swapping guest pages while holding a lot of ZFS cache. That’s a trade you rarely want on a hypervisor: swap I/O latency is worse than a smaller ARC.
Ballooning can create “fake headroom”
Ballooning is fine when used intentionally (for elasticity, not for denial). But if you overcommit and rely on ballooning as a safety net,
you can push guests into their own reclaim while the host also reclaims. Two layers of reclaim. Twice the fun. Half the stability.
Quote (paraphrased idea): “Hope is not a strategy” — commonly attributed to operations leaders; treat it as a principle, not a citation.
What usually causes “swap keeps growing” on Proxmox
1) Real overcommit: you allocated more memory than you own
This is the classic. You sum VM “memory assigned” and it exceeds physical RAM by a lot, then you act surprised when the host uses swap.
The host isn’t wrong. Your spreadsheet is.
2) ZFS ARC grows into guest headroom
On a dedicated storage box, letting ARC use most memory is great. On a hypervisor, ARC needs boundaries. Otherwise ARC becomes “that one tenant”
who takes all the kitchen space and acts offended when asked to share.
3) Swappiness and reclaim tuning are mismatched to a hypervisor
Defaults are reasonable for general Linux. Hypervisors are not general. The wrong combination can lead to early swapping of qemu pages even when
dropping cache would have been cheaper.
4) Memory fragmentation / compaction issues (THP and hugepages)
If the host struggles to allocate contiguous memory, compaction and reclaim get noisy. This can look like “swap keeps growing,” but the root
cause is latency and CPU time spent in reclaim paths.
5) Something leaks, but not where you think
Sometimes it’s a real leak: a monitoring agent, a backup process, or a runaway container. But more often it’s “steady growth” caused by cache
(ZFS ARC, slab caches) that looks like a leak in the wrong dashboard.
6) Slow storage makes swap “sticky”
If your swap device is slow or saturated, swapped pages don’t come back quickly. The kernel may keep them swapped longer, making swap usage rise
and stay high. Not because the kernel loves swap—because your disks hate you.
Joke #2: The kernel doesn’t “randomly swap.” It swaps with the calm certainty of someone who knows you didn’t capacity-plan and would like you to learn.
Practical tasks: commands, outputs, decisions
Below are field-ready tasks. Each has: a command, what the output means, and the decision you make. Run them on the Proxmox host first.
Then, when needed, inside a misbehaving guest.
Task 1: Confirm swap occupancy and swap devices
cr0x@server:~$ swapon --show --bytes
NAME TYPE SIZE USED PRIO
/dev/sda3 part 17179869184 4294967296 -2
Meaning: You have a 16 GiB swap partition and 4 GiB is currently used.
Decision: If USED is high, do not panic yet—next check activity. If swap is on a slow disk shared with VM storage, consider moving it.
Task 2: Check swap activity (the real “is this hurting?” metric)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 4194304 812344 53212 921344 0 0 12 30 400 700 3 2 94 1 0
2 0 4194304 790112 53044 910020 0 0 0 64 420 760 4 3 92 1 0
3 1 4194304 120312 52880 302120 0 512 120 2000 800 1500 20 10 55 15 0
2 2 4194816 90344 52000 280000 128 1024 300 5000 1200 2000 25 12 40 23 0
1 1 4198912 81200 51000 270000 256 2048 500 8000 1500 2500 22 15 35 28 0
Meaning: si/so are swap-in/swap-out KB/s. In the last samples they’re non-zero and rising: active swapping.
Decision: Active swapping means performance impact. Move to pressure diagnosis; plan to reduce memory pressure or change reclaim behavior.
Task 3: Check MemAvailable and swap totals
cr0x@server:~$ grep -E 'MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree' /proc/meminfo
MemTotal: 263989996 kB
MemFree: 1123400 kB
MemAvailable: 34122388 kB
SwapTotal: 16777212 kB
SwapFree: 12582912 kB
Meaning: MemFree is small (normal), MemAvailable is ~32 GiB (good headroom).
Decision: If MemAvailable is healthy but swap activity is high, suspect reclaim pathologies, ZFS ARC pressure behavior, or per-cgroup limits.
Task 4: Read PSI memory pressure (are tasks stalling?)
cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.15 avg60=0.20 avg300=0.35 total=18203456
full avg10=0.02 avg60=0.04 avg300=0.05 total=3401120
Meaning: “some” is time where at least one task was stalled on memory; “full” is time where all runnable tasks were stalled.
Non-trivial “full” indicates user-visible stalls.
Decision: If PSI full is elevated during incident windows, treat it as real memory pressure. Stop debating “but free RAM” and fix capacity/tuning.
Task 5: Identify if kswapd is burning CPU (reclaim is working overtime)
cr0x@server:~$ top -b -n 1 | head -n 20
top - 10:01:22 up 41 days, 3:12, 1 user, load average: 6.20, 5.90, 5.10
Tasks: 412 total, 2 running, 410 sleeping, 0 stopped, 0 zombie
%Cpu(s): 22.1 us, 9.3 sy, 0.0 ni, 55.0 id, 13.2 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 257802.0 total, 1140.2 free, 210000.4 used, 46661.4 buff/cache
MiB Swap: 16384.0 total, 8192.0 free, 8192.0 used. 43000.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
158 root 20 0 0 0 0 R 48.0 0.0 120:11.2 kswapd0
2210 root 20 0 9820.0m 12.2g 210.0m S 30.0 4.8 300:10.9 qemu-system-x86
901 root 20 0 2120.0m 10.0g 50.0m S 8.0 4.0 90:22.2 pvestatd
Meaning: kswapd0 at ~48% CPU indicates heavy background reclaim.
Decision: You’re not just “using swap,” you’re paying CPU and latency for it. Investigate ARC, overcommit, and reclaim tunables.
Task 6: See what’s actually in RAM (anon vs file vs slab)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 252Gi 205Gi 1.1Gi 2.0Gi 45Gi 42Gi
Swap: 16Gi 8.0Gi 8.0Gi
Meaning: High used is normal; focus on “available.” Here available is 42 GiB, which suggests you’re not cornered—unless ZFS/VMs need burst.
Decision: If available is low (<~5–10% of RAM) and swap activity is high, you need to reduce load or increase memory. If available is high, focus on why swap isn’t being reclaimed (often “it doesn’t need to”).
Task 7: Identify top swap consumers (per-process)
cr0x@server:~$ for pid in $(ls /proc | grep -E '^[0-9]+$'); do awk '/^VmSwap:/ {print $2 " " pid}' pid=$pid /proc/$pid/status 2>/dev/null; done | sort -nr | head
1048576 2210
524288 4120
262144 1987
131072 901
65536 3301
Meaning: PID 2210 (likely a qemu process) has ~1 GiB swapped.
Decision: If qemu is swapping heavily, treat it as a host-level issue (overcommit/ARC/tuning). If a random daemon is swapping, fix/limit that process.
Task 8: Map qemu PID to VMID (Proxmox-specific)
cr0x@server:~$ ps -p 2210 -o pid,cmd --no-headers
2210 /usr/bin/kvm -id 104 -name vm104 -m 16384 -smp 8 -drive file=/dev/zvol/rpool/vm-104-disk-0,if=virtio,cache=none
Meaning: That swapped process is VMID 104 with 16 GiB assigned.
Decision: Check the VM’s memory configuration (ballooning, min/max) and host overcommit. If it’s a business-critical VM, prioritize stabilizing host reclaim.
Task 9: Check Proxmox memory allocation vs physical (sanity, not perfection)
cr0x@server:~$ qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
101 app01 running 32768 64.00 2101
104 db01 running 16384 200.00 2210
105 cache01 running 32768 32.00 2302
110 winbuild running 24576 120.00 2410
Meaning: Assigned memory adds up quickly; ballooning may hide it, but physics won’t.
Decision: If you’re near or over host RAM, stop. Reduce allocations, enforce limits, or add nodes/RAM. “But the guests don’t use it” is how incidents begin.
Task 10: Check ballooning settings for a VM
cr0x@server:~$ qm config 104 | grep -E 'memory|balloon'
balloon: 4096
memory: 16384
Meaning: VM has 16 GiB max, balloon target 4 GiB (very aggressive reclaim).
Decision: If balloon target is far below realistic working set, you’re forcing guest reclaim and then host reclaim. Consider raising balloon minimum or disabling ballooning for latency-sensitive VMs.
Task 11: If you use ZFS, check ARC size and ARC pressure
cr0x@server:~$ arcstat 1 1
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
10:03:41 220 12 5 3 1.3 9 3.7 0 0.0 96.0G 110.0G
Meaning: ARC size is ~96 GiB, target cap c ~110 GiB. That’s a lot on a hypervisor, depending on total RAM and VM needs.
Decision: If ARC is large while guests are swapping, cap ARC. Hypervisor stability beats marginal read cache wins.
Task 12: Confirm ZFS ARC max (if set) and decide whether to cap it
cr0x@server:~$ grep -R "zfs_arc_max" /etc/modprobe.d /etc/sysctl.conf /etc/sysctl.d 2>/dev/null
/etc/modprobe.d/zfs.conf:options zfs zfs_arc_max=68719476736
Meaning: ARC max is set to 64 GiB (in bytes). Good: at least it’s bounded.
Decision: If you have frequent reclaim and swap activity, lower ARC max further (carefully) to keep headroom for guests. If you’re storage-heavy and VM-light, keep it higher.
Task 13: Check kernel swappiness and dirty writeback behavior
cr0x@server:~$ sysctl vm.swappiness vm.dirty_ratio vm.dirty_background_ratio
vm.swappiness = 60
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
Meaning: Swappiness 60 is default-ish. Dirty ratios define when the kernel starts forcing writes.
Decision: On hypervisors, a common stance is lower swappiness (e.g., 1–10) to discourage swapping qemu memory, unless you know you benefit from swap. Adjust dirty ratios if you see writeback stalls.
Task 14: Inspect major page faults (a sign of swap-ins and cache misses)
cr0x@server:~$ pidstat -r -p 2210 1 3
Linux 6.2.16 (server) 12/26/2025 _x86_64_ (32 CPU)
10:05:12 AM PID minflt/s majflt/s VSZ RSS %MEM Command
10:05:13 AM 2210 1200.00 45.00 10055680 12582912 4.8 qemu-system-x86
10:05:14 AM 2210 1100.00 60.00 10055680 12583104 4.8 qemu-system-x86
10:05:15 AM 2210 1300.00 55.00 10055680 12583360 4.8 qemu-system-x86
Meaning: Major faults (majflt/s) often involve disk I/O (including swap-ins). These numbers are high enough to care about.
Decision: High major faults for qemu during load correlates with swapping or heavily pressured memory. Reduce pressure or improve swap device and reclaim behavior.
Task 15: Check THP status (can contribute to reclaim/compaction pain)
cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Meaning: THP is set to always. That can be fine, but on some virtualization workloads it increases latency spikes.
Decision: If you see compaction/reclaim stalls, try madvise instead of always and measure. Don’t cargo-cult: test on one node first.
Task 16: Find out if swap is on slow or contended storage
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,ROTA,MOUNTPOINTS
NAME TYPE SIZE ROTA MOUNTPOINTS
sda disk 1.8T 1
├─sda1 part 512M 1 /boot/efi
├─sda2 part 1G 1 /boot
└─sda3 part 16G 1 [SWAP]
└─sda4 part 1.8T 1
nvme0n1 disk 1.9T 0
└─nvme0n1p1 part 1.9T 0 /
Meaning: Swap is on a rotational disk (ROTA=1) while the OS is on NVMe. That’s a performance smell.
Decision: Move swap to faster storage (NVMe) or use zram for burst absorption. Rotational swap on a busy hypervisor is a latency factory.
Task 17: Check whether you’re hitting memory cgroup limits (containers especially)
cr0x@server:~$ systemd-cgtop -m -n 1
Control Group Memory Current Memory Peak Memory Swap IO Read IO Write
/ 210.0G 211.2G 8.0G 2.1M 30.4M
/system.slice/pve-container@112.service 3.2G 3.4G 1.5G 0B 12K
/system.slice/pve-container@113.service 7.8G 8.0G 2.0G 0B 18K
Meaning: Containers can use swap too; “memory.swap” shows swap consumption per cgroup.
Decision: If one container is the swap hog, fix its limit or workload. Host-wide tuning won’t make a badly sized container behave.
Task 18: Drain swap safely (when you’ve fixed the cause)
cr0x@server:~$ sudo sysctl -w vm.swappiness=1
vm.swappiness = 1
cr0x@server:~$ sudo swapoff -a && sudo swapon -a
cr0x@server:~$ swapon --show
NAME TYPE SIZE USED PRIO
/dev/sda3 part 16G 0B -2
Meaning: Swap was cleared and re-enabled.
Decision: Only do this when MemAvailable is comfortably high and swap activity is low. Otherwise, swapoff can trigger a memory storm and an OOM event.
Three corporate mini-stories from the swap mines
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS shop ran a Proxmox cluster hosting “miscellaneous” internal services: CI runners, a metrics stack, a couple of databases nobody owned,
and a Windows VM that existed because someone once needed Visio. The hosts had ample RAM on paper. Swap still climbed, slowly, like a bad mood.
The wrong assumption was simple: “If free shows tens of gigabytes available, the system can’t be under memory pressure.” They looked at
MemAvailable and declared victory. Meanwhile, user complaints were about short freezes—30 seconds of nothing—then everything “caught up.”
The key was PSI. During the freezes, memory PSI “full” spiked. kswapd took a CPU core for long stretches. Swap I/O was measurable, not huge, but consistent.
It wasn’t a lack of memory on average; it was a lack of memory when needed, and reclaim couldn’t keep up.
Root cause: a handful of VMs had balloon targets set far below their actual working sets. The host would reclaim from guests aggressively, guests would page,
and then the host would page too. Two layers of paging created latency spikes. The fix was boring: disable ballooning for the latency-sensitive VMs, set realistic
minimums for the rest, and stop pretending overcommit was “free.”
The swap “kept growing” after the fix, but swap activity went near-zero. That was the important part. They stopped rebooting hosts like it was a wellness ritual.
Mini-story 2: The optimization that backfired
Another org decided to “optimize I/O” by moving swap to a ZFS zvol on the same pool as VM disks, because it was convenient and snapshots were cool.
It worked in the lab. Everything works in the lab. The lab is where physics goes to take a nap.
In production, under a mild memory pressure event (a DB VM doing a one-time index rebuild), swap activity increased. ZFS started working harder.
ARC grew because the workload was read-heavy. The pool got busy. Swap I/O competed with VM disk I/O. Latency climbed. Guests slowed down.
The team responded by increasing swap size. That reduced OOM events but increased time spent in misery. They essentially turned “fail fast” into
“fail slowly while everyone watches dashboards.”
The fix: move swap off the ZFS pool and onto dedicated fast local storage (or zram for burst). Cap ARC. Keep swap I/O away from the same queue
your VMs depend on. Sometimes the right optimization is separating concerns, not shaving microseconds.
The lesson stuck: “Convenience architectures are production’s favorite punching bag.”
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent company had a Proxmox cluster that never made headlines. Not because it was magical—because it was run with the kind of discipline
that looks unimpressive until it’s missing.
They kept a simple weekly report: per-node MemAvailable trends, PSI averages, swap I/O rates, and top swap consumers. Not vanity metrics; the kind
you use to catch slow changes. They also enforced a policy: VM memory allocations couldn’t exceed a defined headroom threshold unless justified in a ticket.
One week, PSI “some” drifted upward on two nodes. Swap I/O remained low, but kswapd CPU ticked up. Nothing was broken yet. That’s the best time to fix things.
They found a new log-ingestion VM with a memory limit set too high and ballooning too low, causing host reclaim pressure during peak ingest.
They adjusted the VM memory, set a sane balloon minimum, and tightened ARC max slightly. The nodes never hit the cliff. No incident. No late-night “why is
the hypervisor swapping” drama. The boring graph saved the day because it made “almost broken” visible.
If you want reliability, you don’t need heroics. You need early signals and permission to act on them.
Common mistakes: symptom → root cause → fix
1) Symptom: swap used is high, but performance is fine
Root cause: Cold pages were swapped out and never needed again. Swap activity is near-zero.
Fix: Do nothing. Monitor PSI and swap I/O. Don’t “swapoff -a” just to feel clean.
2) Symptom: swap used grows daily, kswapd high CPU, short VM freezes
Root cause: Sustained memory pressure or reclaim loop; often VM overcommit plus ZFS ARC or ballooning.
Fix: Reduce allocations or move VMs; cap ARC; lower swappiness; disable or constrain ballooning; verify PSI drops.
3) Symptom: host has plenty of MemAvailable but swap I/O is high
Root cause: Reclaim imbalance or per-cgroup memory pressure (containers/VM configs), or THP/compaction stalls creating pressure patterns.
Fix: Inspect PSI and per-cgroup usage; tune swappiness; consider THP=madvice; ensure swap device is fast.
4) Symptom: after adding swap, system stops OOMing but becomes sluggish
Root cause: You converted a hard failure into thrashing. Swap is masking capacity problems.
Fix: Right-size RAM commitments; keep swap moderate; use swap for safety, not for steady-state.
5) Symptom: swapping spikes during backups/scrubs
Root cause: I/O-induced stalls cause dirty writeback and reclaim delays; ZFS scrubs can change cache behavior and pressure.
Fix: Schedule heavy I/O; cap ARC; tune dirty ratios if writeback stalls; ensure swap not on same busy device.
6) Symptom: one VM is “fine” but everything else is slow
Root cause: One VM or container is forcing the host into reclaim (memory hog), often due to mis-sized memory or runaway workload.
Fix: Identify top swap consumers; cap the offender; set realistic VM memory; isolate noisy workloads to dedicated nodes.
7) Symptom: swap keeps growing after migration changes
Root cause: Memory locality shifts, ARC warms differently, or KSM/THP behavior changes. Also: lingering swapped pages don’t automatically return.
Fix: Re-measure swap activity and PSI. If quiet, accept it. If not, tune and then optionally drain swap during a maintenance window.
Checklists / step-by-step plan
Step-by-step: stabilize a Proxmox host that’s swapping under load
-
Measure activity, not feelings.
Usevmstat 1, PSI, andpidstatto confirm active swapping and stalls. -
Find the pressure source.
Identify top swap consumers; map qemu PIDs to VMIDs; check container cgroups. -
Check commitments.
Compare physical RAM to total assigned RAM. If you’re overcommitted without a plan, that’s the plan failing. -
Fix ballooning policy.
For latency-sensitive VMs: disable ballooning or set a realistic minimum. For the rest: keep ballooning conservative. -
Cap ZFS ARC if using ZFS.
Decide a budget: leave headroom for host + worst-case VM bursts. Apply and monitor. -
Make swap fast or make it smaller.
Put swap on NVMe if you need it; avoid placing it on the same contended pool as VM disks. -
Tune reclaim lightly, then re-measure.
Adjust swappiness (common: 1–10 for hypervisors), consider THP=madvice, and watch PSI and swap I/O. -
Only then drain swap (optional).
Use swapoff/swapon during a low-load window if swap occupancy annoys you or you need a clean baseline. -
Lock in guardrails.
Alerts on PSI full, swap I/O rate, and kswapd CPU. A dashboard is only useful if it changes behavior.
Checklist: what “good” looks like on a stable Proxmox node
- MemAvailable stays above a comfortable floor under normal peak load (define it; don’t guess).
- PSI memory “full” is near-zero most of the time; “some” is low and stable.
- Swap I/O is near-zero in steady-state; swap used may be non-zero and that’s acceptable.
- kswapd is not a top CPU consumer.
- ZFS ARC is capped (if running ZFS) and doesn’t starve guests.
- Ballooning is deliberate, not defaulted into chaos.
Checklist: when to add RAM vs when to tune
- Add RAM when PSI shows sustained stalls and you can’t reduce commitments without business impact.
- Tune when swap activity is high but headroom exists, or when one configuration choice (ARC/ballooning/swap device) is obviously wrong.
- Re-architect when your density goal requires permanent overcommit and you don’t have workload predictability. That’s a strategy decision, not a sysctl.
FAQ
1) Why does swap usage keep increasing even though RAM “looks fine”?
Because swap usage is sticky. Linux can swap out cold pages and never bother pulling them back if there’s no need. If swap I/O is low and PSI is calm,
it’s not necessarily a problem.
2) Should I set vm.swappiness=1 on Proxmox?
Often yes for hypervisors, because swapping qemu memory hurts latency. But don’t treat it as a magic number. Measure swap I/O and PSI before and after.
If you run memory-heavy file cache workloads or ZFS, you still need to manage ARC and commitments.
3) Is it safe to run with no swap?
It’s “safe” in the sense that you’ll hit OOM sooner and harder. Some environments prefer that to thrashing. Most production hypervisors keep some swap
as an emergency buffer, but rely on capacity planning so swap isn’t used under normal load.
4) My swap is used but swap-in/out is zero. Should I clear it?
No urgency. Clearing swap forces those pages back into RAM, which can cause transient pressure. If you want a clean baseline, drain it during a maintenance
window with plenty of MemAvailable.
5) Does ZFS ARC “cause” swapping?
ARC doesn’t directly force swapping, but it competes for RAM. If you let ARC grow large on a hypervisor, the kernel may reclaim anonymous pages
(your VMs) while ARC remains big. Capping ARC is a common stabilization move on Proxmox+ZFS.
6) Should swap live on a ZFS zvol?
You can, but you usually shouldn’t on a busy hypervisor. Swap I/O competes with VM disk I/O and can amplify latency. Prefer dedicated fast local storage
or zram for burst absorption, depending on your constraints.
7) Is memory ballooning good or bad?
It’s a tool. It’s good when you use it to reclaim truly unused guest memory and you set realistic minimums. It’s bad when you use it as a crutch for
overcommit and then wonder why guests and hosts both start paging under load.
8) How do I know if swapping is hurting VMs?
Look for host swap I/O (vmstat), elevated major faults in qemu processes, PSI memory “full,” kswapd CPU, and VM-level symptoms like latency spikes
and I/O wait. Swap used alone is not enough.
9) Can THP cause swap growth?
THP is more about compaction and reclaim cost than swap occupancy directly. But if THP “always” leads to frequent compaction stalls and reclaim churn,
you can see more swapping and latency. If you suspect it, try THP=madvice and measure.
10) What’s the single most reliable stabilization move?
Stop lying to the box about memory. Keep commitments within physical reality (with headroom), cap ARC if on ZFS, and make ballooning conservative.
Then tune swappiness and swap placement as refinements.
Conclusion: next steps that actually stabilize
“Swap keeps growing” is only scary when it’s paired with pressure and activity. Your job is to separate occupancy from thrash,
then remove the reason the host is forced to make ugly tradeoffs.
Do this next:
- Capture a 10-minute sample during peak:
vmstat 1, PSI (/proc/pressure/memory), top processes, and per-qemu major faults. - Map swap consumers to VMIDs; verify ballooning and memory sizes aren’t fantasy.
- If you run ZFS: cap ARC to a deliberate budget that leaves VM headroom.
- Move swap to fast storage or use zram if you need a burst buffer; avoid contended pools.
- Set swappiness low (and persist it) once you’ve confirmed it improves swap activity and stalls.
- Only after stability: optionally clear swap during a quiet window to reset baselines.
The goal is not “zero swap used.” The goal is “no memory stalls, no thrash, and predictable latency.” Boring is the feature.