Proxmox “cannot allocate memory”: ballooning, overcommit, and how to tune it

Was this helpful?

You click Start on a VM and Proxmox answers with the digital equivalent of a shrug:
“cannot allocate memory”. Or worse, the VM starts and then the host starts murdering random processes
like a stressed-out stage manager in a theater with one exit.

Memory failures in Proxmox aren’t mystical. They’re accounting problems: what the host thinks it has,
what VMs claim they might use, what they actually touch, and what the kernel is willing to
promise at that moment. Fix the accounting, and most of the drama goes away.

Fast diagnosis playbook

If you’re on-call and the cluster is yelling, you don’t want a philosophy lecture. You want a tight loop:
confirm the failure mode, identify the limiter, make one safe change, repeat.

First: is this a host memory exhaustion or a per-VM limit?

  • If the VM fails to start with cannot allocate memory, suspect host commit limits,
    cgroup limits, hugepages, or fragmentation—often visible immediately in dmesg / journal.
  • If the VM starts then gets killed, it’s usually the guest OOM killer (inside the VM) or
    the host OOM killer (killing QEMU), depending on which logs show the body.

Second: check the host’s “real” headroom, not the pretty graphs

  • Host free memory and swap: free -h
  • Host memory pressure and reclaim stalls: vmstat 1
  • OOM evidence: journalctl -k and dmesg -T
  • ZFS ARC size (if you use ZFS): arcstat / /proc/spl/kstat/zfs/arcstats

Third: verify Proxmox-side allocation and policy

  • VM config: ballooning target vs max, hugepages, NUMA, etc.:
    qm config <vmid>
  • Node overcommit policy: pvesh get /nodes/<node>/status and
    /etc/pve/datacenter.cfg
  • If it’s a container (LXC), check cgroup memory limit and swap limit:
    pct config <ctid>

Fourth: pick the least-bad immediate mitigation

  • Stop one noncritical VM to free RAM and reduce pressure now.
  • If ZFS is eating the box: cap ARC (persistent) or reboot as a last resort.
  • If you’re overcommitted: reduce VM max memory (not just balloon target).
  • If swap is absent and you’re tight: add swap (host) to avoid instant OOM while you fix sizing.

Joke #1: Memory overcommit is like corporate budgeting—everything works until everyone tries to expense lunch on the same day.

What “cannot allocate memory” actually means in Proxmox

Proxmox is a management layer. The actual allocator is Linux, and for VMs it’s usually QEMU/KVM. When you see
cannot allocate memory, one of these is happening:

  • QEMU can’t reserve the VM’s requested RAM at start time. That can fail even if
    free looks okay, because Linux cares about commit rules and fragmentation.
  • Kernel refuses the allocation due to overcommit/commitlimit logic. Linux tracks how much
    memory processes have promised to potentially use (virtual memory), and it can deny new promises.
  • Hugepages are requested but not available. Hugepages are pre-carved. If they aren’t there,
    the allocation fails immediately and loudly.
  • cgroup limits block the allocation. More common with containers, but can apply if systemd
    slices or custom cgroups are involved.
  • Memory is available but not in the shape you asked for. Fragmentation can prevent large
    contiguous allocations, especially with hugepages or certain DMA needs.

Meanwhile, the “fix” people reach for—ballooning—doesn’t change what QEMU asked for if you still configured a large
maximum memory. Ballooning adjusts what the guest is encouraged to use, not what the host must be prepared
to back at the worst possible time.

Two numbers matter: guest target and guest max

In Proxmox VM options, ballooning gives you:

  • Memory (max): the VM’s ceiling. QEMU reserves accounting for it.
  • Balloon (min/target): the VM’s runtime target that can be lowered under pressure.

If you set max to 64 GB “just in case” and balloon target to 8 GB “because it usually idles,” you’ve told the host:
“Please be ready to fund my 64 GB lifestyle.” The host, being an adult, may say no.

Interesting facts and a little history (so you stop repeating it)

  1. Linux overcommit behavior is old and intentional: it exists because many allocations are never fully touched,
    and strict accounting would waste RAM on empty promises.
  2. The OOM killer predates most modern virtualization stacks; it was Linux’s pragmatic answer to “somebody is lying
    about memory” long before cloud marketing turned lying into a feature.
  3. Ballooning became mainstream with early hypervisors because idle guests hoarded cache and made consolidation look
    worse than it had to be.
  4. KSM (Kernel Samepage Merging) was designed to deduplicate identical memory pages across VMs—especially common when
    many VMs run the same OS image.
  5. Transparent Huge Pages (THP) were introduced to improve performance by using larger pages automatically, but
    they can create latency spikes under memory pressure due to compaction work.
  6. ZFS ARC is not “just cache.” It competes with anonymous memory. If you don’t cap it, it will happily take RAM
    until the kernel forces it to give some back—sometimes too late.
  7. cgroups changed the game: instead of the whole host being one happy family, memory limits can now make a single
    VM or container fail even when the host looks fine.
  8. Swap used to be mandatory advice; then people abused it; then people swore it off; then modern SSDs made
    “a small, controlled swap” sensible again in many cases.

One operational quote that remains painfully relevant (paraphrased idea): Werner Vogels has said the core of reliability is expecting failure and designing for it, not pretending it won’t happen.

Ballooning: what it does, what it doesn’t, and why it lies to you

What ballooning actually is

Ballooning uses a driver inside the guest (virtio-balloon typically). The host asks the guest to “inflate” a balloon,
meaning: allocate memory inside the guest and pin it so the guest can’t use it. That memory becomes reclaimable
from the host’s perspective because the guest voluntarily gave it up.

It’s clever. It’s also limited by physics and guest behavior:

  • If the guest is under real memory pressure, it can’t give you much without swapping or OOMing itself.
  • If the guest doesn’t have the balloon driver, ballooning is basically interpretive dance.
  • Ballooning is reactive. If the host is already in trouble, you may be too late.

Ballooning in Proxmox: the important gotcha

Proxmox’s ballooning config often gives a false sense of safety. People set low balloon targets and high max
memory, thinking they’re “only using the target.” But QEMU’s accounting and the kernel’s commit logic often
need to consider the maximum.

Operational stance: ballooning is a tuning tool, not an excuse to avoid sizing. Use it for
elastic workloads where the guest OS can cope. Do not use it as your primary strategy to pack a host until
it squeals.

When ballooning is worth it

  • Dev/test clusters where guests idle and spikes are rare and tolerable.
  • VDI-like fleets with many similar VMs, often combined with KSM.
  • General-purpose server fleets where you can enforce sane max values, not fantasy ones.

When ballooning is a trap

  • Databases with strict latency and buffer pools (guest memory pressure becomes IO pressure).
  • Systems with swap disabled in guests (ballooning can force OOM inside the guest).
  • Hosts already tight on memory where ballooning response time is too slow.

Overcommit: when it’s smart, when it’s reckless

Three different “overcommits” people confuse

In practice, you’re juggling three layers:

  1. Proxmox scheduler/accounting overcommit: whether Proxmox thinks it’s okay to start another VM
    based on configured RAM, balloon targets, and node memory.
  2. Linux virtual memory overcommit: vm.overcommit_memory and CommitLimit.
  3. Actual physical overcommit: whether the sum of actively used guest memory exceeds host RAM
    (and whether you have swap, compression, or a plan).

Linux commit accounting in one operational paragraph

Linux decides whether to allow an allocation based on how much memory could be used if processes touch it.
That “could be used” number is tracked as Committed_AS. The allowed ceiling is CommitLimit,
roughly RAM + swap minus some reserved bits, modified by overcommit settings. If Committed_AS approaches
CommitLimit, the kernel starts rejecting allocations—hello, “cannot allocate memory.”

Opinionated guidance

  • Production: keep overcommit modest, and enforce realistic VM maximums. If you can’t state your
    overcommit ratio and your eviction plan, you’re not overcommitting—you’re gambling.
  • Lab: overcommit aggressively if you accept occasional OOM events. Just label it honestly and
    stop pretending it’s prod.
  • Mixed workloads: either separate noisy memory users (DB, analytics) onto their own nodes,
    or cap them hard. “Coexistence” is what people call it right before the incident review.

ZFS ARC, page cache, and the host memory you forgot to budget

Proxmox often runs on ZFS because snapshots and send/receive are addictive. But ZFS is not shy: it will use RAM
for ARC (Adaptive Replacement Cache). That’s great until it isn’t.

ARC versus “free memory”

ARC is reclaimable, but not instantly and not always in the way your VM start wants. Under pressure, the kernel
tries to reclaim page cache and ARC, but if you’re in a tight loop of allocations (starting a VM, inflating memory,
forking processes), you can hit transient failures.

What to do

  • On ZFS hosts with many VMs, set a sensible ARC maximum (zfs_arc_max). Don’t let ARC “fight” your guests.
  • Treat host memory as shared infrastructure. The host needs memory for:
    kernel, slab, networking, ZFS metadata, QEMU overhead, and your monitoring agents that swear they’re lightweight.

Swap: not a sin, but also not a life plan

No swap means you’ve removed the shock absorbers. With virtualization, that can be fatal because a sudden pressure
spike turns into immediate OOM kills instead of a slow, diagnosable degradation.

But swap can also become a performance tarpit. The goal is controlled swap: enough to survive bursts, not enough to
hide chronic overcommit.

Host swap recommendations (practical, not dogma)

  • If you run ZFS and many VMs: add swap. Even a moderate amount can prevent the host from killing
    QEMU during brief spikes.
  • If your storage is slow: keep swap smaller and prioritize correct RAM sizing. Swapping to a busy
    HDD RAID is not “stability,” it’s “extended suffering.”
  • If you use SSD/NVMe: swap is much more tolerable, but still not free. Monitor swap-in/out rate,
    not just swap used.

Joke #2: Swap is like a meeting that could’ve been an email—sometimes it saves the day, but if you live there, your career is over.

Practical tasks: commands, outputs, and decisions (12+)

These are the checks I actually run when a Proxmox node starts throwing memory allocation errors. Each task includes:
a command, example output, what it means, and what decision it drives.

Task 1: Check host RAM and swap at a glance

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        54Gi       1.2Gi       2.3Gi       6.9Gi       2.8Gi
Swap:            8Gi       1.6Gi       6.4Gi

Meaning: “available” is your near-term headroom before reclaim gets ugly. 2.8 GiB on a 62 GiB host
with virtualization is tight but not instantly doomed.

Decision: If available is < 1–2 GiB and VMs are failing to start, stop noncritical VMs now.
If swap is 0, add swap as a stabilizer while you fix sizing.

Task 2: Identify if the kernel is rejecting allocations due to commit limits

cr0x@server:~$ grep -E 'CommitLimit|Committed_AS' /proc/meminfo
CommitLimit:    71303168 kB
Committed_AS:   70598240 kB

Meaning: You’re close to the commit ceiling. The kernel may reject new memory reservations even if
there’s cache that could be reclaimed.

Decision: Reduce VM max memory allocations, add swap (increases CommitLimit), or move workloads.
Ballooning target changes won’t help if max is the problem.

Task 3: Confirm overcommit policy

cr0x@server:~$ sysctl vm.overcommit_memory vm.overcommit_ratio
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

Meaning: Mode 0 is heuristic overcommit. Ratio matters mostly for mode 2. Still, commit behavior
is in play.

Decision: Don’t flip these in panic unless you understand the impact. If you’re hitting commit limits,
fixing sizing is better than “just overcommit harder.”

Task 4: Look for OOM killer evidence on the host

cr0x@server:~$ journalctl -k -b | tail -n 30
Dec 26 10:14:03 pve1 kernel: Out of memory: Killed process 21433 (qemu-system-x86) total-vm:28751400kB, anon-rss:23110248kB, file-rss:0kB, shmem-rss:0kB
Dec 26 10:14:03 pve1 kernel: oom_reaper: reaped process 21433 (qemu-system-x86), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Meaning: The host killed QEMU. That VM didn’t “crash,” it was executed.

Decision: Treat as host memory exhaustion/overcommit. Reduce consolidation, cap ARC, add swap,
and stop relying on ballooning as a seatbelt.

Task 5: Check memory pressure and reclaim behavior live

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  1 1677720 312000  8200 5120000  40  120   180   260  900 1800 18 12 55 15  0
 5  2 1677800 280000  8100 5010000  10  200   140   320  920 1700 15 10 50 25  0
 7  3 1677850 260000  8000 4920000  80  500   220   600 1100 2200 20 15 35 30  0

Meaning: Nonzero si/so indicates swapping. High wa suggests IO wait.
If b grows and id collapses, the host is thrashing.

Decision: If swapping is sustained and IO wait spikes, stop VMs or move load. You cannot “tune” your
way out of a thrash storm in real time.

Task 6: Find the biggest memory consumers on the host (RSS, not VIRT fantasies)

cr0x@server:~$ ps -eo pid,comm,rss,vsz --sort=-rss | head -n 10
 21433 qemu-system-x86 23110248 28751400
 19877 qemu-system-x86 16188012 21045740
  1652 pveproxy          312400  824000
  1321 pvedaemon         210880  693000
  1799 zfs               180200  0
  1544 pvestatd          122000  610000

Meaning: RSS is real resident memory. QEMU processes dominate, as expected.

Decision: If one VM is a runaway, cap its max memory or investigate inside the guest.
If it’s “many medium” VMs, it’s consolidation math, not a single villain.

Task 7: Inspect a VM’s memory configuration (ballooning vs max)

cr0x@server:~$ qm config 104 | egrep 'memory|balloon|numa|hugepages'
memory: 32768
balloon: 8192
numa: 1
hugepages: 2

Meaning: Max is 32 GiB, balloon target 8 GiB. Hugepages are enabled (2 = 2MB hugepages).

Decision: If the node is failing allocations, this VM’s 32 GiB max might be too generous.
If hugepages are enabled, confirm hugepages availability (Task 8) or disable hugepages for flexibility.

Task 8: Validate hugepages availability (classic cause of start failures)

cr0x@server:~$ grep -i huge /proc/meminfo
AnonHugePages:   1048576 kB
HugePages_Total:    8192
HugePages_Free:      120
HugePages_Rsvd:       50
Hugepagesize:       2048 kB

Meaning: Only 120 hugepages free (~240 MiB). If you try to start a VM needing many hugepages, it fails.

Decision: Either provision enough hugepages at boot, or stop using hugepages for that VM class.
Hugepages are a performance tool, not a default.

Task 9: Check for THP behavior (can cause latency during pressure)

cr0x@server:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Meaning: THP is always enabled.

Decision: For latency-sensitive nodes, consider madvise or never.
Don’t change this mid-incident unless you’re confident; plan it with a maintenance window and measure.

Task 10: If using ZFS, check ARC size quickly

cr0x@server:~$ awk '/^size/ {print}' /proc/spl/kstat/zfs/arcstats
size                            4    34359738368

Meaning: ARC is ~32 GiB. On a 64 GiB host with many VMs, that may be too much.

Decision: If you’re memory-starved and ARC is large, cap ARC persistently (see checklist section)
and plan a reboot if needed for immediate relief.

Task 11: Confirm KSM status (helps with many similar VMs, can cost CPU)

cr0x@server:~$ systemctl is-active ksmtuned
inactive

Meaning: KSM tuning service isn’t running. On some Proxmox setups, KSM is configured differently;
this is just a quick signal.

Decision: If you run dozens of similar Linux VMs, enabling KSM may reduce memory usage. If CPU is
already hot, KSM can backfire. Test on one node first.

Task 12: Check Proxmox node memory info (what Proxmox thinks is happening)

cr0x@server:~$ pvesh get /nodes/pve1/status | egrep '"memory"|"swap"|"loadavg"'
"loadavg": [
  "2.61",
  "2.45",
  "2.31"
],
"memory": {
  "free": 1288490188,
  "total": 66571993088,
  "used": 651834
},
"swap": {
  "free": 6871947673,
  "total": 8589934592,
  "used": 1717986919
}

Meaning: Proxmox’s API is giving a view that might differ from your immediate expectations (units,
caching, and timing). Don’t treat it as ground truth; cross-check with free and meminfo.

Decision: Use this for automation and dashboards, but when debugging allocation failures, trust
kernel evidence and QEMU logs first.

Task 13: Inspect a VM start failure in task logs

cr0x@server:~$ journalctl -u pvedaemon -b | tail -n 20
Dec 26 10:18:11 pve1 pvedaemon[1321]: start VM 104: UPID:pve1:0000A3F9:00B2B6D1:676D5A13:qmstart:104:root@pam:
Dec 26 10:18:12 pve1 pvedaemon[1321]: VM 104 qmp command failed - unable to execute QMP command 'cont': Cannot allocate memory
Dec 26 10:18:12 pve1 pvedaemon[1321]: start failed: command '/usr/bin/kvm -id 104 ...' failed: exit code 1

Meaning: The failure is at QEMU start/cont stage, not inside the guest.

Decision: Focus on host commit limits, hugepages, and fragmentation—not guest tuning.

Task 14: Validate container (LXC) memory configuration and swap limit

cr0x@server:~$ pct config 210 | egrep 'memory|swap|features'
memory: 4096
swap: 512
features: nesting=1,keyctl=1

Meaning: Container has 4 GiB RAM and 512 MiB swap allowance. If it spikes above, allocations fail inside the container.

Decision: For containers, “cannot allocate memory” is often a cgroup limit. Increase memory/swap
or fix the application’s memory behavior. Host free RAM won’t save an LXC with a hard ceiling.

Task 15: Check fragmentation risk signals (quick and dirty)

cr0x@server:~$ cat /proc/buddyinfo | head
Node 0, zone      DMA      1      1      1      1      0      0      0      0      0      0      0
Node 0, zone    DMA32   1024    512    220     12      0      0      0      0      0      0      0
Node 0, zone   Normal   2048   1880    940    110      2      0      0      0      0      0      0

Meaning: Buddy allocator shows how many free blocks exist at different orders. If higher orders are
mostly zero, large contiguous allocations (including some hugepage needs) may fail even with “enough total free.”

Decision: If hugepages/THP compaction is part of your setup, consider reducing reliance on contiguous
allocations or scheduling periodic maintenance reboots for nodes that must satisfy those allocations.

Three corporate mini-stories from the trenches

Incident: a wrong assumption (“ballooning means it won’t reserve max”)

A mid-sized company ran an internal Proxmox cluster for line-of-business apps and a few heavy batch jobs.
The team had a habit: set VM max memory high “so nobody has to file a ticket,” then set balloon target low
to “keep utilization efficient.”

It worked—until they upgraded a few VMs and started a quarterly reporting run. New processes spawned, memory maps
expanded, and several VMs were restarted for patching. Suddenly: cannot allocate memory on VM start.
The dashboard still showed “free” memory because cache looked reclaimable.

The root cause wasn’t a leak. It was accounting. The host’s Committed_AS crept near CommitLimit.
Every VM with a generous max contributed to the promised memory total, even if it “usually” sat low. When several
restarts happened together, QEMU tried to reserve what it had been told it might need. The kernel refused. The error
was accurate; their mental model wasn’t.

The fix was dull: they reduced VM max memory to what each service could justify, kept ballooning for elasticity,
and added swap on hosts where it was missing. Most importantly, they stopped treating “max” as a wish.
The next quarter’s run still spiked, but it stopped breaking restarts.

Optimization that backfired (hugepages everywhere)

Another org chased latency. A performance-minded engineer enabled hugepages for a whole class of VMs because a blog
post said it improved TLB behavior. And it can. They also left Transparent Huge Pages on “always,” because more huge
pages sounded like more performance. That’s how optimism becomes configuration.

For weeks, everything looked fine. Then a node started failing VM starts after routine migrations. Same VM starts on
other nodes. On this node: cannot allocate memory. Free memory wasn’t terrible, but hugepages free were near
zero. Buddyinfo showed fragmentation: the memory was there, just not in the right chunks.

They tried to “fix” it by increasing hugepages dynamically. That made it worse: the kernel had to compact memory to
satisfy the request, raising CPU spikes and stalling reclaim. Latency went sideways during peak hours. The best part
is that the incident report called it “intermittent.” It was intermittent in the same way gravity is intermittent
when you’re indoors.

The recovery plan was: disable hugepages for general VMs, reserve hugepages only for a small set of latency-critical
instances with predictable sizing, and set THP to madvise. Performance improved overall because the system
stopped fighting itself.

Boring but correct practice that saved the day (host reservation and caps)

A third team ran Proxmox for mixed workloads: web apps, some Windows VMs, and a couple of storage-heavy appliances.
They had a boring rule: every node keeps a fixed “host reserve” of RAM that is never allocated to guests on paper.
They also capped ZFS ARC from day one.

It wasn’t fancy. It meant they could run fewer VMs per node than the spreadsheet warriors wanted. But during an
incident where a noisy guest suddenly started consuming memory (a misconfigured Java service), the host had enough
headroom to keep QEMU processes alive and avoid host OOM.

The guest still suffered (as it should), but the blast radius stayed inside that VM. The cluster didn’t start
killing unrelated workloads. They drained the node, fixed the guest config, and resumed. No midnight reboot,
no cascading failures, no “why did our firewall VM die?”

The practice that saved them wasn’t a secret kernel tunable. It was budgeting and refusing to spend the emergency fund.

Common mistakes: symptom → root cause → fix

VM won’t start: “Cannot allocate memory” right away

  • Symptom: Start fails instantly; QEMU exits with allocation error.
  • Root cause: Host commit limit reached, hugepages missing, or memory fragmentation for the requested allocation.
  • Fix: Lower VM max memory; add host swap; disable hugepages for that VM; provision hugepages at boot if needed.

VM starts, then randomly shuts down or resets

  • Symptom: VM appears to “crash,” logs show no clean shutdown.
  • Root cause: Host OOM killer killed QEMU, often after a memory spike or heavy reclaim.
  • Fix: Find OOM logs; reduce host overcommit; reserve host memory; cap ZFS ARC; ensure swap exists and monitor swap activity.

Guests become slow, then host becomes slow, then everything becomes philosophical

  • Symptom: IO wait climbs; swap-in/out rates rise; VM latency spikes.
  • Root cause: Thrashing: not enough RAM for working sets, and swap/page reclaim dominates.
  • Fix: Stop or migrate VMs; reduce memory limits; add RAM; redesign consolidation. No sysctl will save you here.

Ballooning enabled but memory never “comes back”

  • Symptom: Host remains full; guests don’t release memory as expected.
  • Root cause: Balloon driver not installed/running, guest can’t reclaim, or the “max” still forces host commitment.
  • Fix: Install virtio balloon driver; verify in guest; set realistic max; use ballooning as elasticity, not a substitute for sizing.

Everything was fine until ZFS snapshots and replication increased

  • Symptom: Host memory pressure increases during heavy storage activity; VM startups fail.
  • Root cause: ARC growth, metadata pressure, slab growth, and IO-driven memory use.
  • Fix: Cap ARC; monitor slab; keep headroom; avoid running the node at 95% “used” and calling it efficient.

Containers show “cannot allocate memory” while host has plenty

  • Symptom: LXC apps fail allocations; host looks okay.
  • Root cause: cgroup memory limit reached (container memory/swap cap).
  • Fix: Raise container limits; tune the application; ensure container swap is allowed if you expect bursts.

Checklists / step-by-step plan

Step-by-step: fix a node that throws allocation errors

  1. Confirm host OOM vs start-time failure.
    Check journalctl -k for OOM kills and pvedaemon logs for start failure context.
  2. Measure commit pressure.
    If Committed_AS is near CommitLimit, you’re in “promises exceeded reality” territory.
  3. List VMs with large max memory.
    Reduce max memory for the offenders. Don’t just adjust balloon targets.
  4. Check hugepages and THP settings.
    If hugepages are enabled for VMs, ensure adequate preallocation or turn it off for general workloads.
  5. Check ZFS ARC if applicable.
    If ARC is big and you’re a VM host first, cap it.
  6. Ensure swap exists and is sane.
    Add swap if none; monitor si/so. Swap is for spikes, not for paying rent.
  7. Reserve host memory.
    Keep a fixed buffer for host + ZFS + QEMU overhead. Your future self will thank you in silence.
  8. Re-test VM starts in a controlled sequence.
    Don’t start everything at once after tuning. Start critical services first.

Persistent tuning: ZFS ARC cap (example)

If the node is a VM host and ZFS is a means to an end, set an ARC maximum. One common method:
create a modprobe config file and update initramfs so it applies at boot.

cr0x@server:~$ echo "options zfs zfs_arc_max=17179869184" | sudo tee /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=17179869184
cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.12-4-pve

Meaning: ARC capped at 16 GiB (value is bytes). You just told ZFS it cannot eat the whole machine.

Decision: Pick a cap that leaves enough RAM for guests plus host reserve. Validate after reboot by reading arcstats again.

Persistent tuning: add host swap (file-based example)

cr0x@server:~$ sudo fallocate -l 8G /swapfile
cr0x@server:~$ sudo chmod 600 /swapfile
cr0x@server:~$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 8 GiB (8589930496 bytes)
no label, UUID=0a3b1e4c-2f1e-4f65-a3da-b8c6e3f3a8d7
cr0x@server:~$ sudo swapon /swapfile
cr0x@server:~$ swapon --show
NAME      TYPE SIZE USED PRIO
/swapfile file   8G   0B   -2

Meaning: Swap is active. CommitLimit increases, and you have a buffer against sudden allocation bursts.

Decision: If swap usage becomes sustained with high si/so, that’s not “working as designed.”
It’s a sign to reduce consolidation or add RAM.

Policy: reserve RAM for the host (a simple rule that works)

  • Reserve at least 10–20% of host RAM for the host on mixed nodes.
    More if you run ZFS, Ceph, heavy networking, or many small VMs.
  • Keep a “guest max sum” target you can defend. If the sum of VM max values exceeds a set multiple of host RAM,
    do it intentionally and only where workload behavior supports it.

Ballooning checklist (use it correctly)

  • Enable ballooning only if the guest has virtio-balloon support.
  • Set max memory close to reality; balloon target can be lower for idling.
  • Monitor for guest swap and guest OOM events after enabling ballooning.
  • Don’t balloon databases unless you accept IO spikes and unpredictable latency.
  • FAQ

    1) Why does Proxmox say “cannot allocate memory” when free shows GBs free?

    Because free shows a snapshot of physical memory, while the kernel’s commit accounting and fragmentation
    rules can deny a new allocation. Also, “free” ignores whether memory is available in the form needed (e.g., hugepages).

    2) Does ballooning reduce what the host must reserve?

    It reduces what the guest uses at runtime, but if your VM max is high, the host may still be on the hook for the promise.
    Ballooning is not a get-out-of-sizing-free card.

    3) Should I set vm.overcommit_memory=1 to stop allocation failures?

    That’s a blunt instrument. It may reduce start-time failures, but it increases the chance of catastrophic OOM later.
    In production, prefer fixing VM sizing and adding swap over loosening the kernel’s safety rails.

    4) How much swap should a Proxmox host have?

    Enough to survive bursts and improve CommitLimit, not enough to mask chronic overcommit. Commonly: a few GB to
    low tens of GB depending on host RAM and workload volatility. Measure swap activity; if it’s constantly busy, you’re undersized.

    5) Is ZFS ARC the reason my node “runs out of memory”?

    Sometimes. ARC can grow large and compete with VMs. If VM startups fail or the host OOMs while ARC is massive,
    cap ARC. If ARC is modest, look elsewhere (commit limits, hugepages, runaway guests).

    6) Should I enable KSM on Proxmox?

    If you run many similar VMs (same OS, similar memory pages), KSM can save RAM. It costs CPU and can add latency.
    Enable it deliberately, measure CPU overhead, and don’t treat it as free memory.

    7) Why do containers hit “cannot allocate memory” when the host is fine?

    LXC is governed by cgroups. A container can be out of memory inside its limit even if the host has plenty.
    Adjust pct memory/swap limits or fix the container workload.

    8) Are hugepages worth it?

    For certain high-throughput, latency-sensitive workloads: yes. For general consolidation: often no.
    Hugepages increase predictability for TLB behavior but reduce flexibility and can create start failures if not provisioned carefully.

    9) What’s the difference between guest OOM and host OOM?

    Guest OOM happens inside the VM: the guest kernel kills processes, but the VM stays up. Host OOM kills processes on
    the hypervisor, including QEMU—your VM disappears. Host OOM is the one that ruins your afternoon.

    10) Can I “fix” this permanently without adding RAM?

    Often yes: set realistic VM max memory, reserve host RAM, cap ARC if needed, and avoid overcommit ratios that assume
    miracles. If working sets genuinely exceed physical RAM, the permanent fix is: more RAM or fewer workloads per node.

    Next steps (the sane kind)

    “Cannot allocate memory” in Proxmox is not a curse. It’s the kernel enforcing a boundary you’ve already crossed in
    policy, configuration, or expectations.

    1. Stop treating VM max memory as a suggestion. Make it a contract.
    2. Use ballooning for elasticity, not denial. Target low, cap realistically.
    3. Give the host an emergency fund. Reserve RAM; add swap; keep ZFS ARC in its lane.
    4. Prefer predictable nodes over heroic tuning. Separate workloads when their failure modes differ.
    5. Operationalize it. Add alerts for CommitLimit proximity, swap-in/out rate, OOM logs, and ARC size.

    Do those, and the next time Proxmox complains about memory, it’ll be because you truly ran out—not because your
    configuration told a charming story the kernel refused to believe.

    ← Previous
    PostgreSQL vs SQLite on a VPS: the quickest no-regret choice
    Next →
    3D Stacking and the Chiplet Future: Where CPUs Are Headed

    Leave a comment