Proxmox Swap Keeps Growing: What’s Wrong with Memory Pressure and How to Stabilize It

Was this helpful?

Your Proxmox host has “plenty of free RAM,” yet swap usage climbs like it’s training for a marathon. VMs feel sticky. The node load creeps up.
And every reboot “fixes it” for a while, which is the kind of fix that belongs in a haunted-house, not a datacenter.

Swap growth on a hypervisor is rarely mysterious. It’s usually one of three things: you’re actually under memory pressure, you’ve configured the host
so reclaim behaves badly, or you’re reading the wrong metric and chasing ghosts. Let’s make it boring again.

Fast diagnosis playbook

If you only have 10 minutes before someone starts suggesting “just add RAM,” do this in order. You’re trying to answer one question:
is swap growth driven by real pressure, bad reclaim, or a lying dashboard?

1) Confirm the symptom is real (and not just “cached memory” panic)

  • Check host swap used and swap I/O rate (not just “swap used”).
  • Check memory pressure (PSI) and kswapd activity.

2) Identify the pressure source

  • Is ZFS ARC eating headroom?
  • Are you overcommitting RAM across VMs or containers?
  • Is ballooning forcing guests down and pushing host into reclaim chaos?

3) Decide the stabilizer

  • If pressure is sustained: reduce commitments (VM memory), cap ARC, or add RAM.
  • If reclaim is pathological: fix swappiness, watermark, and hugepages/THP interactions; add zram if appropriate.
  • If swap is “used but quiet”: consider leaving it alone, or proactively slowly drain swap without triggering storms.

Operational rule: swap used without swap I/O can be fine. Swap used with high swap-in/swap-out is where performance goes to die.

Swap growth isn’t automatically a bug

Linux will opportunistically move cold anonymous pages to swap to keep more RAM available for page cache and filesystem metadata. On a workstation,
this can be a win. On a hypervisor, it depends. Your “cold” pages might belong to qemu processes that suddenly need them. That’s when you see
a VM freeze for a second and your users invent new adjectives.

A key distinction:

  • Swap occupancy: how much swap is used right now.
  • Swap activity: how much data is moving in/out of swap per second.

Swap occupancy can grow and stay high for months if the kernel swapped out genuinely cold pages and never needed them again. That’s not a fire.
Swap activity during business hours is a fire. A quiet fire, but still.

Joke #1: Swap is like a storage unit. You can keep it for years, but the moment you need something from it during a meeting, you’ll regret everything.

Interesting facts and historical context

  1. Linux used to expose “free memory” as the headline, and people panic-tuned systems for years; modern guidance emphasizes “available” memory.
  2. Swappiness isn’t “swap aggressiveness” in a simple way; it influences reclaim balance between anonymous pages and file cache, and behavior changes by kernel era.
  3. The OOM killer is not a bug; it’s Linux choosing “one process must die so the system lives,” which is often correct on a host but painful in virtualization.
  4. ZFS ARC is designed to consume RAM because caching is performance; without a cap, it can crowd out guests on a hypervisor.
  5. Pressure Stall Information (PSI) arrived to quantify “time spent stalled” under pressure—finally turning “it feels slow” into a measurable signal.
  6. Memory cgroups changed the game by enabling per-VM/container memory accounting and reclaim behavior—sometimes improving isolation, sometimes adding surprise.
  7. KSM (Kernel Samepage Merging) was a big deal for virtualization density, but it trades CPU for RAM and can interact with reclaim in unexpected ways.
  8. Transparent Huge Pages were introduced for performance, but in virtualization they can increase latency spikes during compaction and reclaim.
  9. Swapiness defaults and heuristics evolved because SSDs made swapping less catastrophic than on spinning disks—yet random swap I/O still punishes latency-sensitive workloads.

How Linux decides to reclaim memory (and why Proxmox makes it spicy)

Proxmox is “just Debian,” but in production it’s never “just.” You’re running qemu-kvm processes (big anonymous memory consumers), possibly LXC containers,
and maybe ZFS, which is a high-performance filesystem that’s happy to use a lot of RAM. Then you add memory ballooning, which is basically a negotiated
lie: “yes guest, you still have 16 GB,” while the host quietly asks for it back.

Three buckets matter: anonymous, file cache, and reclaimability

When RAM tightens, the kernel reclaims memory mainly by:

  • Dropping file cache (easy and fast if the cache can be discarded).
  • Writing out dirty pages (can be slow; depends on storage).
  • Swapping out anonymous memory (process pages not backed by a file).

Virtualization turns guest RAM into host anonymous pages. That means the hypervisor’s biggest RAM consumer is precisely the type of memory Linux is willing
to swap if it thinks it can keep the system responsive. And you’ve seen how that ends.

Memory pressure is not “low free”

Linux happily uses RAM for cache; it’s not wasted. The metric you want is MemAvailable (from /proc/meminfo) and, better, PSI
pressure signals that show whether tasks are stalling waiting on memory reclaim.

ZFS complicates reclaim

ZFS ARC is technically reclaimable, but it’s not “page cache” in the same way as ext4’s cache. ARC competes with guests for memory. If ARC grows without a cap,
the host may start swapping guest pages while holding a lot of ZFS cache. That’s a trade you rarely want on a hypervisor: swap I/O latency is worse than a smaller ARC.

Ballooning can create “fake headroom”

Ballooning is fine when used intentionally (for elasticity, not for denial). But if you overcommit and rely on ballooning as a safety net,
you can push guests into their own reclaim while the host also reclaims. Two layers of reclaim. Twice the fun. Half the stability.

Quote (paraphrased idea): “Hope is not a strategy” — commonly attributed to operations leaders; treat it as a principle, not a citation.

What usually causes “swap keeps growing” on Proxmox

1) Real overcommit: you allocated more memory than you own

This is the classic. You sum VM “memory assigned” and it exceeds physical RAM by a lot, then you act surprised when the host uses swap.
The host isn’t wrong. Your spreadsheet is.

2) ZFS ARC grows into guest headroom

On a dedicated storage box, letting ARC use most memory is great. On a hypervisor, ARC needs boundaries. Otherwise ARC becomes “that one tenant”
who takes all the kitchen space and acts offended when asked to share.

3) Swappiness and reclaim tuning are mismatched to a hypervisor

Defaults are reasonable for general Linux. Hypervisors are not general. The wrong combination can lead to early swapping of qemu pages even when
dropping cache would have been cheaper.

4) Memory fragmentation / compaction issues (THP and hugepages)

If the host struggles to allocate contiguous memory, compaction and reclaim get noisy. This can look like “swap keeps growing,” but the root
cause is latency and CPU time spent in reclaim paths.

5) Something leaks, but not where you think

Sometimes it’s a real leak: a monitoring agent, a backup process, or a runaway container. But more often it’s “steady growth” caused by cache
(ZFS ARC, slab caches) that looks like a leak in the wrong dashboard.

6) Slow storage makes swap “sticky”

If your swap device is slow or saturated, swapped pages don’t come back quickly. The kernel may keep them swapped longer, making swap usage rise
and stay high. Not because the kernel loves swap—because your disks hate you.

Joke #2: The kernel doesn’t “randomly swap.” It swaps with the calm certainty of someone who knows you didn’t capacity-plan and would like you to learn.

Practical tasks: commands, outputs, decisions

Below are field-ready tasks. Each has: a command, what the output means, and the decision you make. Run them on the Proxmox host first.
Then, when needed, inside a misbehaving guest.

Task 1: Confirm swap occupancy and swap devices

cr0x@server:~$ swapon --show --bytes
NAME       TYPE SIZE        USED       PRIO
/dev/sda3  part 17179869184 4294967296  -2

Meaning: You have a 16 GiB swap partition and 4 GiB is currently used.
Decision: If USED is high, do not panic yet—next check activity. If swap is on a slow disk shared with VM storage, consider moving it.

Task 2: Check swap activity (the real “is this hurting?” metric)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 4194304 812344  53212 921344   0    0    12    30  400  700  3  2 94  1  0
 2  0 4194304 790112  53044 910020   0    0     0    64  420  760  4  3 92  1  0
 3  1 4194304 120312  52880 302120   0  512   120  2000  800 1500 20 10 55 15  0
 2  2 4194816  90344  52000 280000 128 1024   300  5000 1200 2000 25 12 40 23  0
 1  1 4198912  81200  51000 270000 256 2048   500  8000 1500 2500 22 15 35 28  0

Meaning: si/so are swap-in/swap-out KB/s. In the last samples they’re non-zero and rising: active swapping.
Decision: Active swapping means performance impact. Move to pressure diagnosis; plan to reduce memory pressure or change reclaim behavior.

Task 3: Check MemAvailable and swap totals

cr0x@server:~$ grep -E 'MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree' /proc/meminfo
MemTotal:       263989996 kB
MemFree:         1123400 kB
MemAvailable:   34122388 kB
SwapTotal:      16777212 kB
SwapFree:       12582912 kB

Meaning: MemFree is small (normal), MemAvailable is ~32 GiB (good headroom).
Decision: If MemAvailable is healthy but swap activity is high, suspect reclaim pathologies, ZFS ARC pressure behavior, or per-cgroup limits.

Task 4: Read PSI memory pressure (are tasks stalling?)

cr0x@server:~$ cat /proc/pressure/memory
some avg10=0.15 avg60=0.20 avg300=0.35 total=18203456
full avg10=0.02 avg60=0.04 avg300=0.05 total=3401120

Meaning: “some” is time where at least one task was stalled on memory; “full” is time where all runnable tasks were stalled.
Non-trivial “full” indicates user-visible stalls.
Decision: If PSI full is elevated during incident windows, treat it as real memory pressure. Stop debating “but free RAM” and fix capacity/tuning.

Task 5: Identify if kswapd is burning CPU (reclaim is working overtime)

cr0x@server:~$ top -b -n 1 | head -n 20
top - 10:01:22 up 41 days,  3:12,  1 user,  load average: 6.20, 5.90, 5.10
Tasks: 412 total,   2 running, 410 sleeping,   0 stopped,   0 zombie
%Cpu(s): 22.1 us,  9.3 sy,  0.0 ni, 55.0 id, 13.2 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem : 257802.0 total,   1140.2 free, 210000.4 used,  46661.4 buff/cache
MiB Swap:  16384.0 total,   8192.0 free,   8192.0 used.  43000.0 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  158 root      20   0       0      0      0 R  48.0   0.0  120:11.2 kswapd0
 2210 root      20   0  9820.0m  12.2g  210.0m S  30.0   4.8  300:10.9 qemu-system-x86
  901 root      20   0  2120.0m  10.0g   50.0m S   8.0   4.0   90:22.2 pvestatd

Meaning: kswapd0 at ~48% CPU indicates heavy background reclaim.
Decision: You’re not just “using swap,” you’re paying CPU and latency for it. Investigate ARC, overcommit, and reclaim tunables.

Task 6: See what’s actually in RAM (anon vs file vs slab)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           252Gi       205Gi       1.1Gi       2.0Gi        45Gi        42Gi
Swap:           16Gi       8.0Gi       8.0Gi

Meaning: High used is normal; focus on “available.” Here available is 42 GiB, which suggests you’re not cornered—unless ZFS/VMs need burst.
Decision: If available is low (<~5–10% of RAM) and swap activity is high, you need to reduce load or increase memory. If available is high, focus on why swap isn’t being reclaimed (often “it doesn’t need to”).

Task 7: Identify top swap consumers (per-process)

cr0x@server:~$ for pid in $(ls /proc | grep -E '^[0-9]+$'); do awk '/^VmSwap:/ {print $2 " " pid}' pid=$pid /proc/$pid/status 2>/dev/null; done | sort -nr | head
1048576 2210
524288  4120
262144  1987
131072  901
65536   3301

Meaning: PID 2210 (likely a qemu process) has ~1 GiB swapped.
Decision: If qemu is swapping heavily, treat it as a host-level issue (overcommit/ARC/tuning). If a random daemon is swapping, fix/limit that process.

Task 8: Map qemu PID to VMID (Proxmox-specific)

cr0x@server:~$ ps -p 2210 -o pid,cmd --no-headers
2210 /usr/bin/kvm -id 104 -name vm104 -m 16384 -smp 8 -drive file=/dev/zvol/rpool/vm-104-disk-0,if=virtio,cache=none

Meaning: That swapped process is VMID 104 with 16 GiB assigned.
Decision: Check the VM’s memory configuration (ballooning, min/max) and host overcommit. If it’s a business-critical VM, prioritize stabilizing host reclaim.

Task 9: Check Proxmox memory allocation vs physical (sanity, not perfection)

cr0x@server:~$ qm list
 VMID NAME         STATUS     MEM(MB)    BOOTDISK(GB) PID
 101  app01        running    32768             64.00 2101
 104  db01         running    16384            200.00 2210
 105  cache01      running    32768             32.00 2302
 110  winbuild     running    24576            120.00 2410

Meaning: Assigned memory adds up quickly; ballooning may hide it, but physics won’t.
Decision: If you’re near or over host RAM, stop. Reduce allocations, enforce limits, or add nodes/RAM. “But the guests don’t use it” is how incidents begin.

Task 10: Check ballooning settings for a VM

cr0x@server:~$ qm config 104 | grep -E 'memory|balloon'
balloon: 4096
memory: 16384

Meaning: VM has 16 GiB max, balloon target 4 GiB (very aggressive reclaim).
Decision: If balloon target is far below realistic working set, you’re forcing guest reclaim and then host reclaim. Consider raising balloon minimum or disabling ballooning for latency-sensitive VMs.

Task 11: If you use ZFS, check ARC size and ARC pressure

cr0x@server:~$ arcstat 1 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:03:41   220    12      5     3  1.3     9  3.7     0  0.0   96.0G  110.0G

Meaning: ARC size is ~96 GiB, target cap c ~110 GiB. That’s a lot on a hypervisor, depending on total RAM and VM needs.
Decision: If ARC is large while guests are swapping, cap ARC. Hypervisor stability beats marginal read cache wins.

Task 12: Confirm ZFS ARC max (if set) and decide whether to cap it

cr0x@server:~$ grep -R "zfs_arc_max" /etc/modprobe.d /etc/sysctl.conf /etc/sysctl.d 2>/dev/null
/etc/modprobe.d/zfs.conf:options zfs zfs_arc_max=68719476736

Meaning: ARC max is set to 64 GiB (in bytes). Good: at least it’s bounded.
Decision: If you have frequent reclaim and swap activity, lower ARC max further (carefully) to keep headroom for guests. If you’re storage-heavy and VM-light, keep it higher.

Task 13: Check kernel swappiness and dirty writeback behavior

cr0x@server:~$ sysctl vm.swappiness vm.dirty_ratio vm.dirty_background_ratio
vm.swappiness = 60
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10

Meaning: Swappiness 60 is default-ish. Dirty ratios define when the kernel starts forcing writes.
Decision: On hypervisors, a common stance is lower swappiness (e.g., 1–10) to discourage swapping qemu memory, unless you know you benefit from swap. Adjust dirty ratios if you see writeback stalls.

Task 14: Inspect major page faults (a sign of swap-ins and cache misses)

cr0x@server:~$ pidstat -r -p 2210 1 3
Linux 6.2.16 (server)  12/26/2025  _x86_64_ (32 CPU)

10:05:12 AM   PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
10:05:13 AM  2210   1200.00    45.00 10055680 12582912   4.8  qemu-system-x86
10:05:14 AM  2210   1100.00    60.00 10055680 12583104   4.8  qemu-system-x86
10:05:15 AM  2210   1300.00    55.00 10055680 12583360   4.8  qemu-system-x86

Meaning: Major faults (majflt/s) often involve disk I/O (including swap-ins). These numbers are high enough to care about.
Decision: High major faults for qemu during load correlates with swapping or heavily pressured memory. Reduce pressure or improve swap device and reclaim behavior.

Task 15: Check THP status (can contribute to reclaim/compaction pain)

cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Meaning: THP is set to always. That can be fine, but on some virtualization workloads it increases latency spikes.
Decision: If you see compaction/reclaim stalls, try madvise instead of always and measure. Don’t cargo-cult: test on one node first.

Task 16: Find out if swap is on slow or contended storage

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,ROTA,MOUNTPOINTS
NAME   TYPE   SIZE ROTA MOUNTPOINTS
sda    disk   1.8T    1
├─sda1 part   512M    1 /boot/efi
├─sda2 part     1G    1 /boot
└─sda3 part    16G    1 [SWAP]
└─sda4 part   1.8T    1
nvme0n1 disk  1.9T    0
└─nvme0n1p1 part 1.9T    0 /

Meaning: Swap is on a rotational disk (ROTA=1) while the OS is on NVMe. That’s a performance smell.
Decision: Move swap to faster storage (NVMe) or use zram for burst absorption. Rotational swap on a busy hypervisor is a latency factory.

Task 17: Check whether you’re hitting memory cgroup limits (containers especially)

cr0x@server:~$ systemd-cgtop -m -n 1
Control Group                           Memory Current  Memory Peak  Memory Swap  IO Read  IO Write
/                                         210.0G        211.2G        8.0G        2.1M     30.4M
/system.slice/pve-container@112.service      3.2G          3.4G        1.5G          0B       12K
/system.slice/pve-container@113.service      7.8G          8.0G        2.0G          0B       18K

Meaning: Containers can use swap too; “memory.swap” shows swap consumption per cgroup.
Decision: If one container is the swap hog, fix its limit or workload. Host-wide tuning won’t make a badly sized container behave.

Task 18: Drain swap safely (when you’ve fixed the cause)

cr0x@server:~$ sudo sysctl -w vm.swappiness=1
vm.swappiness = 1
cr0x@server:~$ sudo swapoff -a && sudo swapon -a
cr0x@server:~$ swapon --show
NAME      TYPE SIZE  USED PRIO
/dev/sda3 part  16G    0B   -2

Meaning: Swap was cleared and re-enabled.
Decision: Only do this when MemAvailable is comfortably high and swap activity is low. Otherwise, swapoff can trigger a memory storm and an OOM event.

Three corporate mini-stories from the swap mines

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS shop ran a Proxmox cluster hosting “miscellaneous” internal services: CI runners, a metrics stack, a couple of databases nobody owned,
and a Windows VM that existed because someone once needed Visio. The hosts had ample RAM on paper. Swap still climbed, slowly, like a bad mood.

The wrong assumption was simple: “If free shows tens of gigabytes available, the system can’t be under memory pressure.” They looked at
MemAvailable and declared victory. Meanwhile, user complaints were about short freezes—30 seconds of nothing—then everything “caught up.”

The key was PSI. During the freezes, memory PSI “full” spiked. kswapd took a CPU core for long stretches. Swap I/O was measurable, not huge, but consistent.
It wasn’t a lack of memory on average; it was a lack of memory when needed, and reclaim couldn’t keep up.

Root cause: a handful of VMs had balloon targets set far below their actual working sets. The host would reclaim from guests aggressively, guests would page,
and then the host would page too. Two layers of paging created latency spikes. The fix was boring: disable ballooning for the latency-sensitive VMs, set realistic
minimums for the rest, and stop pretending overcommit was “free.”

The swap “kept growing” after the fix, but swap activity went near-zero. That was the important part. They stopped rebooting hosts like it was a wellness ritual.

Mini-story 2: The optimization that backfired

Another org decided to “optimize I/O” by moving swap to a ZFS zvol on the same pool as VM disks, because it was convenient and snapshots were cool.
It worked in the lab. Everything works in the lab. The lab is where physics goes to take a nap.

In production, under a mild memory pressure event (a DB VM doing a one-time index rebuild), swap activity increased. ZFS started working harder.
ARC grew because the workload was read-heavy. The pool got busy. Swap I/O competed with VM disk I/O. Latency climbed. Guests slowed down.

The team responded by increasing swap size. That reduced OOM events but increased time spent in misery. They essentially turned “fail fast” into
“fail slowly while everyone watches dashboards.”

The fix: move swap off the ZFS pool and onto dedicated fast local storage (or zram for burst). Cap ARC. Keep swap I/O away from the same queue
your VMs depend on. Sometimes the right optimization is separating concerns, not shaving microseconds.

The lesson stuck: “Convenience architectures are production’s favorite punching bag.”

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company had a Proxmox cluster that never made headlines. Not because it was magical—because it was run with the kind of discipline
that looks unimpressive until it’s missing.

They kept a simple weekly report: per-node MemAvailable trends, PSI averages, swap I/O rates, and top swap consumers. Not vanity metrics; the kind
you use to catch slow changes. They also enforced a policy: VM memory allocations couldn’t exceed a defined headroom threshold unless justified in a ticket.

One week, PSI “some” drifted upward on two nodes. Swap I/O remained low, but kswapd CPU ticked up. Nothing was broken yet. That’s the best time to fix things.
They found a new log-ingestion VM with a memory limit set too high and ballooning too low, causing host reclaim pressure during peak ingest.

They adjusted the VM memory, set a sane balloon minimum, and tightened ARC max slightly. The nodes never hit the cliff. No incident. No late-night “why is
the hypervisor swapping” drama. The boring graph saved the day because it made “almost broken” visible.

If you want reliability, you don’t need heroics. You need early signals and permission to act on them.

Common mistakes: symptom → root cause → fix

1) Symptom: swap used is high, but performance is fine

Root cause: Cold pages were swapped out and never needed again. Swap activity is near-zero.

Fix: Do nothing. Monitor PSI and swap I/O. Don’t “swapoff -a” just to feel clean.

2) Symptom: swap used grows daily, kswapd high CPU, short VM freezes

Root cause: Sustained memory pressure or reclaim loop; often VM overcommit plus ZFS ARC or ballooning.

Fix: Reduce allocations or move VMs; cap ARC; lower swappiness; disable or constrain ballooning; verify PSI drops.

3) Symptom: host has plenty of MemAvailable but swap I/O is high

Root cause: Reclaim imbalance or per-cgroup memory pressure (containers/VM configs), or THP/compaction stalls creating pressure patterns.

Fix: Inspect PSI and per-cgroup usage; tune swappiness; consider THP=madvice; ensure swap device is fast.

4) Symptom: after adding swap, system stops OOMing but becomes sluggish

Root cause: You converted a hard failure into thrashing. Swap is masking capacity problems.

Fix: Right-size RAM commitments; keep swap moderate; use swap for safety, not for steady-state.

5) Symptom: swapping spikes during backups/scrubs

Root cause: I/O-induced stalls cause dirty writeback and reclaim delays; ZFS scrubs can change cache behavior and pressure.

Fix: Schedule heavy I/O; cap ARC; tune dirty ratios if writeback stalls; ensure swap not on same busy device.

6) Symptom: one VM is “fine” but everything else is slow

Root cause: One VM or container is forcing the host into reclaim (memory hog), often due to mis-sized memory or runaway workload.

Fix: Identify top swap consumers; cap the offender; set realistic VM memory; isolate noisy workloads to dedicated nodes.

7) Symptom: swap keeps growing after migration changes

Root cause: Memory locality shifts, ARC warms differently, or KSM/THP behavior changes. Also: lingering swapped pages don’t automatically return.

Fix: Re-measure swap activity and PSI. If quiet, accept it. If not, tune and then optionally drain swap during a maintenance window.

Checklists / step-by-step plan

Step-by-step: stabilize a Proxmox host that’s swapping under load

  1. Measure activity, not feelings.
    Use vmstat 1, PSI, and pidstat to confirm active swapping and stalls.
  2. Find the pressure source.
    Identify top swap consumers; map qemu PIDs to VMIDs; check container cgroups.
  3. Check commitments.
    Compare physical RAM to total assigned RAM. If you’re overcommitted without a plan, that’s the plan failing.
  4. Fix ballooning policy.
    For latency-sensitive VMs: disable ballooning or set a realistic minimum. For the rest: keep ballooning conservative.
  5. Cap ZFS ARC if using ZFS.
    Decide a budget: leave headroom for host + worst-case VM bursts. Apply and monitor.
  6. Make swap fast or make it smaller.
    Put swap on NVMe if you need it; avoid placing it on the same contended pool as VM disks.
  7. Tune reclaim lightly, then re-measure.
    Adjust swappiness (common: 1–10 for hypervisors), consider THP=madvice, and watch PSI and swap I/O.
  8. Only then drain swap (optional).
    Use swapoff/swapon during a low-load window if swap occupancy annoys you or you need a clean baseline.
  9. Lock in guardrails.
    Alerts on PSI full, swap I/O rate, and kswapd CPU. A dashboard is only useful if it changes behavior.

Checklist: what “good” looks like on a stable Proxmox node

  • MemAvailable stays above a comfortable floor under normal peak load (define it; don’t guess).
  • PSI memory “full” is near-zero most of the time; “some” is low and stable.
  • Swap I/O is near-zero in steady-state; swap used may be non-zero and that’s acceptable.
  • kswapd is not a top CPU consumer.
  • ZFS ARC is capped (if running ZFS) and doesn’t starve guests.
  • Ballooning is deliberate, not defaulted into chaos.

Checklist: when to add RAM vs when to tune

  • Add RAM when PSI shows sustained stalls and you can’t reduce commitments without business impact.
  • Tune when swap activity is high but headroom exists, or when one configuration choice (ARC/ballooning/swap device) is obviously wrong.
  • Re-architect when your density goal requires permanent overcommit and you don’t have workload predictability. That’s a strategy decision, not a sysctl.

FAQ

1) Why does swap usage keep increasing even though RAM “looks fine”?

Because swap usage is sticky. Linux can swap out cold pages and never bother pulling them back if there’s no need. If swap I/O is low and PSI is calm,
it’s not necessarily a problem.

2) Should I set vm.swappiness=1 on Proxmox?

Often yes for hypervisors, because swapping qemu memory hurts latency. But don’t treat it as a magic number. Measure swap I/O and PSI before and after.
If you run memory-heavy file cache workloads or ZFS, you still need to manage ARC and commitments.

3) Is it safe to run with no swap?

It’s “safe” in the sense that you’ll hit OOM sooner and harder. Some environments prefer that to thrashing. Most production hypervisors keep some swap
as an emergency buffer, but rely on capacity planning so swap isn’t used under normal load.

4) My swap is used but swap-in/out is zero. Should I clear it?

No urgency. Clearing swap forces those pages back into RAM, which can cause transient pressure. If you want a clean baseline, drain it during a maintenance
window with plenty of MemAvailable.

5) Does ZFS ARC “cause” swapping?

ARC doesn’t directly force swapping, but it competes for RAM. If you let ARC grow large on a hypervisor, the kernel may reclaim anonymous pages
(your VMs) while ARC remains big. Capping ARC is a common stabilization move on Proxmox+ZFS.

6) Should swap live on a ZFS zvol?

You can, but you usually shouldn’t on a busy hypervisor. Swap I/O competes with VM disk I/O and can amplify latency. Prefer dedicated fast local storage
or zram for burst absorption, depending on your constraints.

7) Is memory ballooning good or bad?

It’s a tool. It’s good when you use it to reclaim truly unused guest memory and you set realistic minimums. It’s bad when you use it as a crutch for
overcommit and then wonder why guests and hosts both start paging under load.

8) How do I know if swapping is hurting VMs?

Look for host swap I/O (vmstat), elevated major faults in qemu processes, PSI memory “full,” kswapd CPU, and VM-level symptoms like latency spikes
and I/O wait. Swap used alone is not enough.

9) Can THP cause swap growth?

THP is more about compaction and reclaim cost than swap occupancy directly. But if THP “always” leads to frequent compaction stalls and reclaim churn,
you can see more swapping and latency. If you suspect it, try THP=madvice and measure.

10) What’s the single most reliable stabilization move?

Stop lying to the box about memory. Keep commitments within physical reality (with headroom), cap ARC if on ZFS, and make ballooning conservative.
Then tune swappiness and swap placement as refinements.

Conclusion: next steps that actually stabilize

“Swap keeps growing” is only scary when it’s paired with pressure and activity. Your job is to separate occupancy from thrash,
then remove the reason the host is forced to make ugly tradeoffs.

Do this next:

  1. Capture a 10-minute sample during peak: vmstat 1, PSI (/proc/pressure/memory), top processes, and per-qemu major faults.
  2. Map swap consumers to VMIDs; verify ballooning and memory sizes aren’t fantasy.
  3. If you run ZFS: cap ARC to a deliberate budget that leaves VM headroom.
  4. Move swap to fast storage or use zram if you need a burst buffer; avoid contended pools.
  5. Set swappiness low (and persist it) once you’ve confirmed it improves swap activity and stalls.
  6. Only after stability: optionally clear swap during a quiet window to reset baselines.

The goal is not “zero swap used.” The goal is “no memory stalls, no thrash, and predictable latency.” Boring is the feature.

← Previous
MySQL vs PostgreSQL Schema Changes: Who Makes ALTER TABLE a Nightmare
Next →
Kernel panic: when Linux says “nope” in public

Leave a comment