Proxmox: Stop “Random” VM Freezes by Fixing One Host Kernel Setting

February 14, 2026 • February 14, 2026 • Read: 23 min • Views: 0

Was this helpful?

“Random” VM freezes on Proxmox have a special talent: they happen during the one meeting where you promised the platform was “stable now.” The guest stops responding, monitoring goes quiet, and then—minutes later—it magically comes back like nothing happened. No crash. No reboot. Just a gap in time and a growing distrust of your hypervisor.

Most of the time, this isn’t a haunted VM. It’s a host-level writeback stall: the Linux kernel lets dirty memory pile up, then throttles hard when it finally has to flush. Your VMs feel it as a freeze because QEMU threads get blocked behind host I/O and memory writeback pressure. The fix is boring and precise: tune the host’s vm.dirty_* settings so the kernel starts flushing earlier and throttles less violently.

What a “freeze” actually looks like (and what it isn’t)

People call it a “random VM freeze” because they experience it from the guest: SSH stops responding, RDP hangs, app timeouts, monitoring agents stop reporting, and then it recovers without intervention. But in the host’s eyes, it’s often not random. It’s a predictable stall triggered by a flush of accumulated dirty pages, a burst of synchronous writes, or a storage device that suddenly has opinions about latency.

The classic pattern

One or more VMs become unresponsive for tens of seconds to a few minutes.
The Proxmox host is “up” (ping works, maybe even the UI loads) but feels sluggish.
iowait spikes or CPU usage looks weirdly low while things are stuck.
Storage graphs show a latency wall: IOPS drop, latency climbs.
After the stall, everything “catches up” and logs show a time gap.

What it usually isn’t

Before we blame writeback, let’s eliminate the usual red herrings:

It isn’t necessarily a guest kernel panic. Those leave obvious fingerprints.
It isn’t always RAM exhaustion. RAM pressure can contribute, but the symptom here is the host throttling writes.
It isn’t always “ZFS is slow.” ZFS can amplify the problem if sync writes or slog behavior are involved, but the stall mechanism can happen on ext4, XFS, Ceph, you name it.
It isn’t (only) CPU steal. Steal time hurts, but a multi-minute freeze with minimal CPU usage usually screams I/O throttling and writeback congestion.

Here’s the cynical truth: your VM is a process, and Linux will gladly stall that process to protect storage integrity and keep the system from drowning in dirty data. The kernel isn’t being mean. It’s being consistent.

Joke #1: “Random freezes” are like intermittent DNS issues: they’re always real, and they always happen when you’re trying to prove they aren’t.

The one kernel setting family that usually fixes it: `vm.dirty_*`

If you want one lever that stops a lot of Proxmox VM “freezes,” it’s this: reduce how much dirty memory the host is allowed to accumulate before it starts writeback and before it throttles.

Linux uses RAM as a write buffer. When a process writes to a file, it often writes into the page cache first (dirty pages). Later, the kernel flushes those dirty pages to disk (“writeback”). This is usually a performance win. But if the system lets dirty pages pile too high, the eventual flush can become a violent event: writeback storms, device queue saturation, and throttling that blocks processes—including your QEMU/KVM guests.

The knobs that matter

vm.dirty_background_ratio / vm.dirty_background_bytes: when background writeback kicks in.
vm.dirty_ratio / vm.dirty_bytes: when foreground processes get throttled to force writeback.
vm.dirty_expire_centisecs: how long dirty data can sit before it’s considered old and must be flushed.
vm.dirty_writeback_centisecs: how often flusher threads wake up to write dirty data.

The common Proxmox trap is that ratio-based defaults scale with RAM. Big RAM host? Big dirty buffer. Big dirty buffer + slow or bursty storage? Eventually, a writeback stall big enough to freeze VMs.

Why this feels like a VM problem

Because the VM’s virtual disk lives on host storage, and QEMU’s I/O threads are just host threads. When the host kernel decides “we are writing now, and everyone else will wait,” your guests wait too. Your app sees a pause. Your SRE sees a pager. Your CFO sees “why do we pay for this?”

One quote worth keeping on your wall, because it’s painfully operational:

“Hope is not a strategy.” — James Cameron

When you’re running hypervisors, “hope the kernel flushes gently” is exactly that: hope.

Why Proxmox makes this failure mode visible

Proxmox is not uniquely broken here. It’s just honest about what you built: a shared storage and memory system that multiplexes many workloads into one kernel. That kernel has to pick winners when contention hits.

Multiplying the burst

On a single bare-metal app server, a writeback storm is “the app is slow.” On a hypervisor, it’s “half the company is slow.” You aggregate I/O patterns: databases doing fsync, log shippers, Windows updates, backup jobs, container layers, and the occasional developer running benchmarks because they “needed numbers.”

The storage stack can amplify stalls

Some common setups are great, but they are also very good at making a pause catastrophic:

ZFS with sync-heavy workloads: If guests do lots of synchronous writes (databases, journaling, Windows), latency spikes can appear like freezes.
Ceph under recovery/backfill: Latency variability is the norm, and writeback can hit a wall at the worst possible time.
Consumer SSDs with SLC cache behavior: They sprint, then they walk, then they crawl. Your dirty page cache believes the sprint.
RAID controllers with cache policies: A battery-less write cache is basically a latency prank waiting to happen.

When you combine “big RAM buffer” + “bursty reality,” the host eventually enforces discipline. The enforcement is what you experience as a freeze.

Fast diagnosis playbook (first/second/third)

If you only have 10 minutes between “something froze” and “it recovered,” do this in order. You’re not collecting trivia. You’re deciding whether it’s writeback/I/O, memory pressure, or something else.

First: confirm it’s I/O + writeback, not a guest crash

Check host logs for blocked tasks and writeback messages.
Check if dirty memory was high and then dropped.
Check storage latency in the host during the window.

Second: identify what saturated (disk, controller, networked storage)

Run iostat and look at %util and latency.
Check per-device queues; one bad actor device can freeze everything if it holds the journal/pool.
Confirm whether you’re on ZFS, Ceph, LVM-thin, or plain files. The “right” next step differs.

Third: check memory pressure and host swap behavior

Look at MemAvailable, swap in/out, and whether the host is reclaiming aggressively.
Confirm ballooning or KSM isn’t creating extra turbulence.

If the evidence points to writeback stalls, stop debating and tune vm.dirty_*. You can still go hunting for the workload that triggered it, but you need the platform not to seize up first.

Practical tasks: commands, outputs, and decisions (12+)

These are real, runnable checks on a Proxmox host (Debian-based). Each task includes: command, what the output means, and what decision you make next. Run them as root or with sudo.

Task 1: Identify Proxmox and kernel version (context matters)

cr0x@server:~$ pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.13-5-pve)
pve-manager: 8.1.3
pve-kernel-6.5: 6.5.13-5

What it means: Kernel behavior and defaults vary. Knowing whether you’re on a PVE kernel and which series helps explain writeback and blk-mq behavior.

Decision: If you’re on an older kernel series and seeing blocked task bugs already fixed upstream, plan a kernel update after stabilizing settings.

Task 2: Check current dirty settings (the suspects)

cr0x@server:~$ sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs vm.dirty_writeback_centisecs
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500

What it means: With 256 GB RAM, dirty_ratio=20 can allow ~51 GB of dirty cache before forced throttling. That’s not a buffer; that’s a lifestyle.

Decision: On big-memory hosts with shared storage, prefer *_bytes over ratios, or use much smaller ratios.

Task 3: See how much dirty memory is currently present

cr0x@server:~$ egrep -i 'dirty|writeback' /proc/meminfo
Dirty:              124812 kB
Writeback:              0 kB
WritebackTmp:           0 kB

What it means: “Dirty” is what’s waiting to be written. During a freeze event, you may see Dirty balloon and then collapse after writeback catches up.

Decision: If Dirty spikes into gigabytes during incidents, you’re a good candidate for stricter dirty thresholds.

Task 4: Look for blocked tasks and writeback stalls in the kernel log

cr0x@server:~$ journalctl -k -b --no-pager | egrep -i 'blocked|hung|writeback|congestion|task'
[ 2893.441122] INFO: task kvm:23144 blocked for more than 120 seconds.
[ 2893.441126]       Tainted: P           O      6.5.13-5-pve #1
[ 2893.441130] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2893.441133] task:kvm             state:D stack:0     pid:23144 ppid:1 flags:0x00004002
[ 2893.441140] Call Trace:
[ 2893.441150]  __io_schedule+0x2d/0x60
[ 2893.441161]  io_schedule+0x12/0x40
[ 2893.441170]  bit_wait_io+0x11/0x60
[ 2893.441185]  __wait_on_bit+0x4c/0x110

What it means: state:D is uninterruptible sleep, commonly I/O wait. If QEMU/KVM threads show up here, the VM is “frozen” because the host thread is stuck waiting on I/O.

Decision: If you see repeated blocked tasks tied to writeback or filesystem waits, prioritize writeback tuning and storage latency investigation.

Task 5: Check I/O latency and utilization quickly

cr0x@server:~$ iostat -x 1 5
Linux 6.5.13-5-pve (server) 	02/04/2026 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.21    0.00    2.11   38.44    0.00   56.24

Device            r/s     w/s   rkB/s   wkB/s  aqu-sz  await  svctm  %util
nvme0n1         12.0   980.0   512.0 65536.0   45.12   46.8   0.9   92.5

What it means: High %iowait, high await, and high %util suggest the device is saturated and requests are queueing.

Decision: If await is tens/hundreds of ms during freezes, treat storage latency as a first-class incident. Dirty tuning helps, but you also need to reduce burstiness or upgrade the storage path.

Task 6: Confirm the storage backend for the VM disks

cr0x@server:~$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        19633960         2341232        16210228   11.92%
local-zfs         zfspool active       191260672        84528128       106732544   44.19%
ceph-vm           rbd     active              0               0               0    0.00%

What it means: Knowing whether a VM lives on ZFS, Ceph RBD, or a directory-backed image changes where you look next.

Decision: If VMs are on ZFS and freezes correlate with sync-heavy workloads or scrubs, you’ll also evaluate ZFS settings and slog. If on Ceph, check cluster health and backfill.

Task 7: If ZFS: check pool health and latency suspects

cr0x@server:~$ zpool status -v
  pool: rpool
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	rpool       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nvme0n1 ONLINE       0     0     0
	    nvme1n1 ONLINE       0     0     0

errors: No known data errors

What it means: A healthy pool doesn’t guarantee good latency, but an unhealthy pool guarantees bad days.

Decision: If you see degraded vdevs or resilvering, expect stalls and postpone performance tuning debates until redundancy is restored.

Task 8: If ZFS: check whether sync writes are being forced

cr0x@server:~$ zfs get -o name,property,value -s local,default sync rpool
NAME   PROPERTY  VALUE
rpool  sync      standard

What it means: sync=standard means ZFS honors O_SYNC/fsync semantics. If someone set sync=disabled, they “fixed” latency by deleting safety.

Decision: If sync=disabled is set on VM storage, revert it unless you’re comfortable explaining data loss to auditors.

Task 9: Check memory pressure and swap activity

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           251Gi       182Gi       6.2Gi       2.1Gi        63Gi        54Gi
Swap:           16Gi       3.1Gi        13Gi

What it means: If available is low and swap usage is growing fast during incidents, you might have a memory pressure problem that worsens writeback behavior.

Decision: If swapping is active on a Proxmox host, investigate ballooning and overcommit. But don’t confuse swap thrash with writeback stalls; they can co-exist.

Task 10: Catch real-time dirty/writeback changes during a stall

cr0x@server:~$ watch -n 1 'grep -E "Dirty|Writeback" /proc/meminfo; echo; vmstat 1 2 | tail -1'
Every 1.0s: grep -E Dirty|Writeback /proc/meminfo; echo; vmstat 1 2 | tail -1

Dirty:           8421120 kB
Writeback:        412512 kB

 2  0      0 123456 789012 345678    0    0  20480  10240  38  4 56  2  0

What it means: Dirty in the multi-GB range plus sustained block I/O (bi/bo) and high wa indicates the system is flushing heavily.

Decision: If Dirty climbs until a “flush cliff” occurs, your dirty thresholds are too permissive for your storage reality.

Task 11: Inspect per-VM QEMU processes for D-state hints

cr0x@server:~$ ps -eo pid,stat,comm,args | egrep 'qemu-system|kvm' | head
23144 D    qemu-system-x86 /usr/bin/kvm -id 103 -name vm103 ...
23188 S    qemu-system-x86 /usr/bin/kvm -id 104 -name vm104 ...

What it means: D state for QEMU is consistent with uninterruptible I/O wait. One stuck VM can indicate a specific disk; many stuck VMs indicates shared storage saturation.

Decision: If many QEMU processes go D-state at once, stop blaming the guests. Focus on the host kernel + storage path.

Task 12: Check disk scheduler and queue settings (context, not a silver bullet)

cr0x@server:~$ for d in nvme0n1 nvme1n1; do echo "== $d =="; cat /sys/block/$d/queue/scheduler; cat /sys/block/$d/queue/nr_requests; done
== nvme0n1 ==
[mq-deadline] kyber none
1024
== nvme1n1 ==
[mq-deadline] kyber none
1024

What it means: Modern kernels use blk-mq. Schedulers matter less than they used to, but queue depth can still influence latency under burst.

Decision: Don’t “tune” schedulers randomly. If you change these, do it with measurement and a rollback plan. Your first lever remains dirty writeback thresholds.

Task 13: Check whether you’re using ratios or bytes (avoid scaling surprises)

cr0x@server:~$ sysctl vm.dirty_bytes vm.dirty_background_bytes
vm.dirty_bytes = 0
vm.dirty_background_bytes = 0

What it means: 0 means the ratio settings are active. On large RAM, ratios can be too generous.

Decision: Consider switching to bytes-based thresholds to keep behavior consistent across hosts.

Task 14: Validate that writeback threads are doing work (and not stuck)

cr0x@server:~$ ps -eo pid,stat,comm | egrep 'flush|writeback|kworker' | head -n 10
  157 S    flush-8:0
  158 S    flush-8:16
  612 S    kworker/u64:3
  948 S    kworker/u64:7

What it means: Flush threads exist per backing device. If they’re present but the system still stalls, the bottleneck is usually the device latency or queueing, not “missing threads.”

Decision: If the host stalls and you see blocked tasks waiting on I/O, focus on lowering dirty thresholds and reducing burstiness.

Task 15: Confirm what changed recently (because it always did)

cr0x@server:~$ journalctl --since "24 hours ago" --no-pager | egrep -i 'apt|kernel|zfs|ceph|firmware' | head -n 30
Feb 04 01:12:03 server apt[11902]: upgrade zfsutils-linux:amd64 2.2.2-pve1 to 2.2.3-pve1
Feb 04 01:12:10 server apt[11902]: upgrade pve-kernel-6.5.13-4-pve to 6.5.13-5-pve

What it means: Kernel and storage stack updates can change writeback behavior, defaults, and performance regressions.

Decision: If freezes began after an update, don’t roll back blindly. Capture evidence, tune writeback, and then evaluate regressions with controlled tests.

Recommended values (safe defaults, and when to deviate)

You want earlier, steadier writeback. That means lower thresholds and more frequent background flushing. The goal is not “maximum throughput.” The goal is latency stability so VMs don’t stall like they hit a pothole at highway speed.

My default recommendation for Proxmox hosts

Prefer bytes-based thresholds. They behave the same on a 64 GB node and a 512 GB node, which is what you want in a cluster.

vm.dirty_background_bytes: 268435456 (256 MB)
vm.dirty_bytes: 1073741824 (1 GB)
vm.dirty_expire_centisecs: 3000 (30 seconds)
vm.dirty_writeback_centisecs: 100 (1 second)

These are conservative. They keep the dirty cache from turning into a multi-GB time bomb. If you have exceptionally fast storage with stable latency (good enterprise NVMe, well-tuned Ceph with headroom), you can raise them. If you have consumer SSDs, networked storage with variable latency, or a mixed workload of databases and backups, keep them low.

When ratios are acceptable

If you insist on ratios, use smaller ones than defaults on large-RAM hosts:

vm.dirty_background_ratio: 3–5
vm.dirty_ratio: 8–12

But understand the scaling: 10% of 512 GB is a lot of dirty data to flush when your storage decides to do garbage collection.

A note on “fixing” it by disabling safety

If you see advice to set ZFS sync=disabled to eliminate pauses, that’s not tuning. That’s opting out of data durability. You may get lower latency. You also get “surprising” database corruption after a power event. Choose what you can live with. Most businesses can’t live with that.

Joke #2: Disabling sync writes to “solve performance” is like removing seatbelts to make the car lighter. Technically true, spiritually cursed.

Apply the fix and make it persistent

Do it live first (so you can observe impact), then persist it. Proxmox hosts are production systems; treat them like it: change control, a maintenance window if required, and a rollback plan.

Apply temporarily (immediate effect)

cr0x@server:~$ sysctl -w vm.dirty_background_bytes=268435456
vm.dirty_background_bytes = 268435456
cr0x@server:~$ sysctl -w vm.dirty_bytes=1073741824
vm.dirty_bytes = 1073741824
cr0x@server:~$ sysctl -w vm.dirty_writeback_centisecs=100
vm.dirty_writeback_centisecs = 100
cr0x@server:~$ sysctl -w vm.dirty_expire_centisecs=3000
vm.dirty_expire_centisecs = 3000

What it means: The host will start background flushing at ~256 MB dirty and throttle harder around 1 GB dirty, rather than tens of GB.

Decision: Watch for improved latency stability under bursty writes. If throughput drops slightly but freezes stop, that’s a win on hypervisors.

Persist across reboots

Create a dedicated sysctl drop-in. Don’t jam it into random files you’ll forget in six months.

cr0x@server:~$ cat > /etc/sysctl.d/99-proxmox-writeback.conf <<'EOF'
# Proxmox host writeback tuning to reduce VM stalls under bursty IO
vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 1073741824
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 3000
EOF
cr0x@server:~$ sysctl --system
* Applying /etc/sysctl.d/99-proxmox-writeback.conf ...
vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 1073741824
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 3000

What it means: Settings are now part of boot configuration.

Decision: Record this change in your infra repo or ticketing system. Future you deserves receipts.

Confirm ratios are no longer active

cr0x@server:~$ sysctl vm.dirty_ratio vm.dirty_background_ratio vm.dirty_bytes vm.dirty_background_bytes
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_bytes = 1073741824
vm.dirty_background_bytes = 268435456

What it means: Ratios may still show values, but bytes take precedence when non-zero.

Decision: Leave ratio values alone unless you want to standardize configs; the effective behavior is bytes-based now.

Common mistakes: symptoms → root cause → fix

1) Symptom: VM freezes for 30–180 seconds, then recovers; host shows high iowait

Root cause: Dirty page accumulation leads to writeback storm; QEMU threads blocked in D-state.

Fix: Reduce dirty thresholds (prefer vm.dirty_background_bytes and vm.dirty_bytes), increase writeback frequency (vm.dirty_writeback_centisecs=100), verify storage latency.

2) Symptom: Freezes correlate with backups or vzdump windows

Root cause: Backup jobs create write bursts (temporary files, compression, snapshot activity) that overflow page cache and saturate storage.

Fix: Writeback tuning + stagger backups + cap backup I/O via storage-level QoS if available. Also verify backup target latency.

3) Symptom: “Fix” attempted by setting ZFS `sync=disabled`, freezes reduce, later data integrity scares appear

Root cause: Latency was masked by removing synchronous durability; application assumptions break after power loss or crash.

Fix: Restore sync=standard. If sync latency is real, add proper SLOG on power-loss-protected media or move sync-heavy workloads to appropriate storage.

4) Symptom: Only one VM freezes, others are fine; host not obviously in trouble

Root cause: That VM is hammering a specific volume, sparse file, or storage path; could be a single RBD image, a single ZVOL, or a single disk in a pool causing retries.

Fix: Check per-device latency (iostat -x), per-VM disk stats, and storage health. Dirty tuning still helps overall, but isolate the hot spot.

5) Symptom: Host stays responsive, but guest clocks jump and apps time out

Root cause: Guest vCPU threads are stalled; timekeeping catches up when the VM runs again.

Fix: Same core fix: reduce host writeback stalls. Also verify guest agent/clocksource configuration if time drift becomes chronic.

6) Symptom: Freezes appear after “performance optimization” that increased cache or RAM usage

Root cause: More RAM means ratios allow more dirty data; flush events become larger and less predictable.

Fix: Switch to bytes-based dirty thresholds. Stability beats theoretical throughput on a hypervisor.

7) Symptom: Ceph-backed VMs freeze during recovery/backfill

Root cause: Backend latency spikes cause host writeback to backlog and then throttle.

Fix: Writeback tuning on hypervisors plus Ceph recovery tuning and capacity headroom. Don’t pretend the hypervisor can paper over a saturated cluster.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

The team had just refreshed their Proxmox nodes: more cores, more RAM, same storage shelf. They assumed the upgrade would only improve things. And for a week, it did. Then the “random freezes” started. Not daily. Not hourly. Just often enough to get blamed on “that one legacy app.”

The wrong assumption was subtle: they believed that if the storage shelf could handle peak throughput in synthetic benchmarks, it could handle production bursts. Benchmarks were neat. Production had backups, log rotations, and a database that liked to checkpoint at the worst times.

During one freeze window, the host wasn’t down. It was just waiting. Kernel logs showed blocked QEMU threads. /proc/meminfo showed Dirty in the tens of gigabytes right before the stall ended. The bigger RAM upgrade had quietly increased the size of the dirty buffer because the system was using ratio-based defaults.

They switched to bytes-based dirty thresholds and set aggressive background flushing. The freezes stopped. Throughput in benchmarks dropped slightly. Nobody cared, because the platform stopped time-traveling during business hours.

Mini-story #2: The optimization that backfired

A different shop got clever with performance tuning. They noticed that storage latency spikes correlated with guest sync writes, so they applied the classic “fix”: set ZFS sync=disabled on the dataset backing VM disks. Immediately, the graphs looked amazing. The “freezes” seemed gone. The tuning ticket was closed with a proud comment about “unlocking ZFS performance.”

Two months later, a power event hit part of the rack. UPS worked, mostly. One node didn’t shut down cleanly. Several VMs came back with unhappy databases. Not all of them; just enough to cause an incident review with unpleasant questions.

The backfire wasn’t that ZFS was unreliable. ZFS did what it was told: it acknowledged sync writes before they were durable. The team had traded latency stability for durability, and then acted surprised when physics filed a complaint.

The recovery was a combination of reverting sync to standard, adding proper power-loss-protected log devices for workloads that genuinely needed sync performance, and—yes—tuning vm.dirty_* so the host stopped building massive writeback cliffs.

Mini-story #3: The boring but correct practice that saved the day

A third org ran a Proxmox cluster for internal services. Nothing glamorous: ticketing, CI runners, a few Windows VMs for licensing tools. They did one boring thing consistently: they had a baseline host sysctl profile, applied via config management, including writeback tuning. Every node looked the same on day one.

One quarter, they added a new workload: a data pipeline that wrote large sequential files and then did bursts of metadata updates. The workload was noisy, and it did cause higher storage utilization. But it didn’t cause “random freezes.” The dirty thresholds prevented giant flush events; writeback happened continuously in the background.

When a new node was built hastily during an expansion, someone forgot to apply the baseline. That one node started freezing VMs under the new pipeline workload. The team didn’t spend days debating storage vendors. They diffed sysctls between nodes, saw the missing writeback profile, applied it, and moved on with their lives.

It wasn’t heroic. It was correct. The best incident is the one you don’t get invited to.

Checklists / step-by-step plan

Step-by-step: stabilize a freezing Proxmox node in production

Capture evidence immediately: kernel log snippet with blocked tasks; Dirty/Writeback values; iostat -x output during the stall.
Confirm storage backend: ZFS vs Ceph vs LVM-thin vs dir images. Don’t troubleshoot blind.
Apply temporary writeback tuning: set bytes-based dirty thresholds and shorter writeback interval.
Observe for one workload cycle: backups, batch jobs, or the time window that used to trigger freezes.
Persist settings: sysctl drop-in, config management, and change record.
Reduce burstiness: stagger backups, cap concurrency, avoid all VMs flushing at the same time.
Validate storage latency headroom: if await remains high, you still have a storage problem. Tuning prevents cliffs; it doesn’t make slow disks fast.
Post-incident review: identify the workload that pushed you over the edge and decide whether to isolate it, throttle it, or move it.

Checklist: what to standardize across a Proxmox cluster

Bytes-based dirty thresholds (avoid RAM-size surprises).
Consistent kernel series across nodes (don’t mix behaviors casually).
Backup scheduling policy (stagger, cap concurrency).
Storage monitoring focused on latency, not just throughput.
A documented stance on ZFS sync behavior (no ad-hoc disabling).
Host swap policy and ballooning policy.

Rollback plan (because you’re an adult)

Keep a copy of the prior sysctl values (or export sysctl -a snapshot for the vm.* subset).
If performance is impacted unacceptably, increase vm.dirty_bytes gradually (e.g., 1 GB → 2 GB) rather than jumping back to ratios.
If the host still freezes, the storage path is likely saturated or misbehaving; pursue latency root cause while keeping thresholds conservative.

Interesting facts & history (because kernels have long memories)

Linux writeback has been a recurring battleground since early 2.4/2.6 days; balancing throughput vs latency keeps changing with hardware.
The “dirty” page cache exists because disk is slow, but it becomes dangerous when disk is unpredictably slow—like SSDs under garbage collection.
Ratio-based defaults scale with RAM, which made sense when servers had far less memory; today it can create multi-GB flush events.
“Blocked task” warnings are Linux’s way of telling you: threads are stuck in uninterruptible sleep, usually waiting on I/O completion.
Modern kernels use blk-mq (multi-queue block layer), changing how queues and schedulers behave compared to older single-queue designs.
NVMe made throughput cheap, but it didn’t make tail latency disappear; tail latency is what freezes VMs.
ZFS emphasizes data integrity; tuning it for speed by disabling sync semantics is not a harmless trick—it changes correctness.
Virtualization stacks amplify contention because many workloads share one kernel’s I/O and memory policies; host tuning matters more than on single-purpose servers.
Writeback tuning is a classic SRE move: you’re not optimizing an application; you’re shaping system behavior under load to avoid cascading failures.

FAQ

1) Is `vm.dirty_*` really “one setting”?

It’s one subsystem with a few knobs. In practice, using bytes-based thresholds plus a shorter writeback interval is the single most impactful host-side change for these freeze patterns.

2) Will this fix every Proxmox freeze?

No. It fixes a common class: host writeback stalls and I/O congestion that blocks QEMU threads. If you have CPU contention, PCIe errors, firmware issues, or a dying disk, this won’t magically cure it.

3) Should I use `dirty_ratio` or `dirty_bytes`?

On hypervisors, use dirty_bytes and dirty_background_bytes. Ratios scale with RAM and lead to inconsistent behavior across nodes with different memory sizes.

4) Won’t smaller dirty thresholds reduce performance?

Sometimes, yes—peak throughput can dip because you flush more continuously. On a hypervisor, that’s usually a good trade: stable latency beats occasional cliffs that freeze multiple VMs.

5) What values should I start with?

Start with dirty_background_bytes=256MB, dirty_bytes=1GB, dirty_writeback_centisecs=100, dirty_expire_centisecs=3000. Then adjust based on observed latency and workload burstiness.

6) Does this interact with ZFS ARC?

Yes, indirectly. ARC and page cache both use RAM. ZFS ARC is not the Linux page cache, but memory pressure dynamics can still affect writeback behavior. The tuning here targets Linux dirty page writeback, not ARC itself.

7) What about Ceph-backed VMs?

Still relevant. The host is still caching and writing. Ceph latency variability can trigger bigger backlogs and harsher throttling. Tune writeback and also ensure the Ceph cluster has headroom and isn’t in constant recovery.

8) Should I disable swap on the Proxmox host?

Don’t disable it blindly. A small amount of swap can be a safety net, but active swapping under normal conditions usually indicates overcommit or ballooning issues. Fix the cause, don’t just hide the symptom.

9) Can I set `vm.dirty_writeback_centisecs` too low?

Yes. Extremely low values can increase background flush activity and overhead. 100 (1 second) is a pragmatic starting point for hypervisors; measure and adjust if needed.

10) How do I prove the tuning worked?

Measure before/after: frequency and duration of stalls, iostat await tail latency during bursts, Dirty/Writeback behavior under backup windows, and whether QEMU threads still hit D-state during incidents.

Next steps you should do this week

Implement bytes-based dirty thresholds on one Proxmox node, observe for a full backup cycle, then roll out cluster-wide.
Add a quick incident capture script (even a shell snippet) that records: iostat -x, /proc/meminfo dirty counters, and relevant kernel log lines when freezes are reported.
Audit your storage durability stance: if anyone “fixed performance” with sync=disabled, revert and design a real solution (SLOG, workload placement, or storage upgrade).
Schedule bursty jobs (backups, scrubs, replication) so they don’t all line up like a synchronized swimming routine.
Set expectations internally: hypervisors are latency systems. Optimize for tail latency and predictability, not just headline throughput.

If you do only one thing: set vm.dirty_background_bytes and vm.dirty_bytes to sane numbers. That one “kernel setting” family is often the difference between “random freezes” and a platform you can trust.