You upgraded Proxmox. The cluster came back up. The graphs look “fine.” Yet everything feels sticky: VM consoles lag, databases wait on I/O, backups crawl, and your users have rediscovered the ancient art of opening tickets.
This is the common shape of an upgrade regression: not a full outage, not a clean breakage—just death by a thousand micro-stalls. The good news: most slowdowns after a Proxmox upgrade come from a small set of changes, and a handful of checks usually exposes the guilty subsystem fast.
Fast diagnosis playbook (first/second/third)
If you have one hour to stop the bleeding, do this. Don’t open twenty rabbit holes. Pick the top-level bottleneck first, then drill down.
First: identify which resource is saturating right now
- Host CPU: check load vs. CPU usage vs. steal time. High load with low CPU usage screams “I/O wait” or lock contention.
- Host I/O: check disk latency and queue depth. A little latency is fine; sustained tens of milliseconds on random I/O is a performance crime.
- Network: check drops, errors, and mismatch in MTU/offloads. Quiet packet loss is “works in dev” for production.
Second: decide whether the problem is host-wide or VM-specific
- If all VMs slowed similarly after the upgrade, suspect host kernel, storage backend, NIC driver/offload, or cluster services.
- If only a subset slowed, suspect specific VM hardware settings (virtio/scsi controller, cache mode), guest drivers, or per-VM limits.
Third: compare “before vs after” with the simplest baseline
- Run a host-side I/O micro-test (carefully) to see if the storage subsystem regressed.
- Run a VM-side CPU and I/O sanity check on a single VM to confirm the hypervisor boundary isn’t the culprit.
- Check the upgrade changed kernel, ZFS/Ceph versions, IO scheduler, or QEMU defaults.
Rule of thumb: if you can’t point to one of CPU, I/O, or network as “dominant” in the first 15 minutes, you’re not looking at the right counters.
What upgrades change (and why it matters)
Proxmox upgrades are deceptively “one click.” Under that click: a new kernel, new QEMU, updated storage stacks (ZFS, Ceph clients/daemons if you run them), new NIC drivers, new defaults, and occasionally new behaviors around cgroups, scheduling, and power management.
Performance regressions after upgrades typically fall into these buckets:
- Kernel/driver changes: different default I/O scheduler, NIC offload behavior, IRQ handling, CPU frequency governor, or a driver regression.
- Storage stack changes: ZFS version changes affecting ARC behavior or sync write handling; Ceph version changes shifting recovery/backfill behavior.
- VM hardware config drift: controller types, cache modes, iothreads, ballooning, or CPU model changes.
- Cluster services: corosync latency sensitivity, time sync issues, or “helpful” watchdog behavior.
- New background work: scrubs/resilvers, Ceph rebalancing, SMART scans, pvestatd polling, log storms.
Upgrades also expose existing sins. Your platform may have been surviving on lucky caching, quiet underutilization, or a driver quirk that happened to be beneficial. Upgrades remove that luck.
One quote worth keeping on your wall:
“Hope is not a strategy.” — James Cameron
That applies to upgrades too: treat them like change management, not a vibe.
Interesting facts and historical context
- KVM landed in Linux in 2007, and it won largely because it was “just” the Linux kernel doing virtualization, not a separate hypervisor world.
- Virtio was designed to avoid emulating slow legacy hardware. When virtio drivers are missing or mismatched, you can see 10× differences on storage and NIC throughput.
- ZFS came to Linux via a port (OpenZFS), and it’s famous for data integrity—but also for being sensitive to RAM, sync write patterns, and mis-tuned datasets.
- Ceph’s performance can look “fine” while it’s unhealthy because it will keep serving IO while quietly doing recovery and backfill in the background.
- Linux switched many systems to cgroup v2 by default over time; scheduling and accounting changes can affect container-heavy Proxmox hosts in subtle ways.
- Modern NVMe drives are fast but not magically consistent: firmware, power states, and thermal throttling can create “sawtooth” latency patterns after reboots.
- IO schedulers changed philosophy: for fast devices, “none” often wins; for slower disks, schedulers that merge requests can reduce latency spikes.
- CPU frequency scaling is a performance variable, not a green checkbox. A different kernel default can move you from predictable to sluggish under bursty load.
- Corosync relies on timely network delivery. A tiny increase in latency/jitter can trigger membership flaps, which look like “random slowness” elsewhere.
Start from first principles: define “slower” properly
“Slower” is a complaint. You need a symptom you can measure.
Pick one or two representative workloads and define the regression in numbers:
- Database: 95th percentile commit latency, buffer cache hit rate, fsync time.
- Web/API: request latency percentiles, error rate, CPU saturation, run queue.
- File server: SMB/NFS throughput, metadata ops latency, small-file creates.
- Backups: MB/s and elapsed time, plus storage write latency during backup windows.
Then determine where the waiting happens. Most of the time it’s one of these:
- I/O wait on the host: storage latency spikes, queues back up, VMs stall.
- CPU scheduling delay: too many runnable tasks, pinning mistakes, IRQ storms.
- Network loss/jitter: retransmits and timeouts, corosync grumpiness, storage replication lag.
- Guest-level driver mismatch: virtio drivers outdated, wrong controller model, weird caching modes.
Joke #1: Performance regressions are like “just one more meeting”—nobody schedules them, but everyone attends.
Practical tasks: commands, outputs, and decisions
These are the first checks I run on a Proxmox host after an upgrade when someone says “it’s slower.” Each task includes: command, what good/bad looks like, and what decision to make.
Task 1: Confirm what actually changed (kernel, pve, qemu)
cr0x@server:~$ pveversion -v
pve-manager/8.2.2/bb2d... (running kernel: 6.8.12-3-pve)
proxmox-ve: 8.2-1 (running kernel: 6.8.12-3-pve)
qemu-server/8.2.4/...
libpve-storage-perl/8.2.2/...
Meaning: You now know the kernel and stack versions in play. “It got slower” often maps to “kernel changed” or “QEMU changed.”
Decision: If the regression started exactly after this, prioritize kernel/driver/storage stack checks before chasing VM-specific tuning.
Task 2: Check if the host is CPU-bound, I/O-bound, or just confused
cr0x@server:~$ uptime
14:02:11 up 12 days, 3:45, 3 users, load average: 18.21, 16.77, 15.90
Meaning: Load average counts runnable tasks and tasks stuck in uninterruptible sleep (often I/O). High load doesn’t automatically mean high CPU.
Decision: Pair this with per-CPU usage and iowait next. If load is high but CPU usage is modest, suspect storage latency or kernel lock contention.
Task 3: Look for iowait and steal time on the host
cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.8.12-3-pve (server) 12/26/2025 _x86_64_ (32 CPU)
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %idle
Average: all 12.40 0.00 4.10 18.70 0.00 0.80 0.00 64.00
Average: 0 10.00 0.00 3.00 22.00 0.00 0.50 0.00 64.50
Meaning: Sustained %iowait in the teens (or higher) means the CPU is waiting on storage. %steal is typically 0 on bare metal, but on nested virtualization it matters.
Decision: If iowait is high, stop “tuning CPU.” Go to storage latency and queue depth immediately.
Task 4: Identify the top processes and whether they are blocked on IO
cr0x@server:~$ top -b -n 1 | head -25
top - 14:02:33 up 12 days, 3:45, 3 users, load average: 18.21, 16.77, 15.90
Tasks: 812 total, 5 running, 807 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.3 us, 4.1 sy, 0.0 ni, 64.0 id, 18.7 wa, 0.0 hi, 0.9 si, 0.0 st
MiB Mem : 256000.0 total, 84000.0 free, 42000.0 used, 130000.0 buff/cache
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8421 root 20 0 11.8g 2.1g 10500 S 180.0 0.8 90:12.22 kvm
9211 root 20 0 10.2g 1.9g 9700 D 40.0 0.7 15:44.01 kvm
Meaning: State D means uninterruptible sleep—often blocked on I/O. Seeing KVM threads in D state is a classic “storage is stalling VMs” indicator.
Decision: Move to device-level latency checks (iostat, nvme smart-log, ZFS/Ceph health) before touching VM configs.
Task 5: Measure disk latency and queueing per device
cr0x@server:~$ iostat -x 1 5
Linux 6.8.12-3-pve (server) 12/26/2025 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.35 0.00 4.12 18.69 0.00 64.84
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 120.0 980.0 4200.0 128000.0 240.0 9.80 9.60 2.10 10.50 0.35 38.0
sda 2.0 80.0 40.0 1800.0 46.0 3.20 45.00 8.00 46.00 1.20 95.0
Meaning: Look at await and avgqu-sz. A slow device with high utilization and high await (like sda above) can drag the whole storage path if it’s part of the pool/array or used for logs.
Decision: If one device is pinned at high %util with large await, identify what it’s used for (ZFS special vdev? boot disk? Ceph WAL/DB?) and fix that path.
Task 6: Confirm ZFS pool health and whether it’s doing background work
cr0x@server:~$ zpool status -v
pool: rpool
state: ONLINE
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Dec 26 12:11:09 2025
1.22T scanned at 1.80G/s, 220G issued at 320M/s, 3.45T total
220G resilvered, 6.2% done, 02:45:12 to go
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
Meaning: A resilver or scrub can hammer I/O. After an upgrade/reboot, these jobs may start or resume.
Decision: If you’re in a performance emergency, consider scheduling scrubs/resilvers thoughtfully (not “never”), and reduce competing workloads during recovery windows.
Task 7: Check ZFS dataset properties that impact latency (sync, compression, recordsize)
cr0x@server:~$ zfs get -o name,property,value -s local,received sync,compression,recordsize,logbias,primarycache rpool/data
NAME PROPERTY VALUE
rpool/data sync standard
rpool/data compression zstd
rpool/data recordsize 128K
rpool/data logbias latency
rpool/data primarycache all
Meaning: Upgrades don’t usually flip these by themselves, but migrations and “helpful tuning” sometimes do. sync and logbias are latency levers for sync-heavy workloads.
Decision: If you see sync=always unexpectedly, expect slower databases. Fix the root cause (NFS exports? dataset inherited?) rather than blindly disabling sync.
Task 8: Observe ZFS ARC pressure (memory vs cache behavior)
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% size c
14:03:51 812 120 14 28 3 70 8 22 3 96.1G 108G
14:03:52 790 140 17 31 4 81 10 28 4 95.9G 108G
14:03:53 840 160 19 35 4 91 11 34 4 95.8G 108G
Meaning: Rising miss% and shrinking ARC can push more reads to disk. After upgrades, memory pressure patterns can change (containers, services, kernel memory use).
Decision: If ARC is constantly squeezed, validate host RAM allocation, ballooning, and whether new services (e.g., Ceph daemons) are competing for memory.
Task 9: Confirm Ceph health and whether it’s recovering/rebalancing
cr0x@server:~$ ceph -s
cluster:
id: 9e2a...c6
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 24/1024 objects degraded
Reduced data availability: 8 pgs inactive
services:
mon: 3 daemons, quorum pve1,pve2,pve3
mgr: pve1(active), standbys: pve2
osd: 11 osds: 10 up, 11 in
data:
pools: 4 pools, 256 pgs
objects: 1.2M objects, 4.1 TiB
usage: 12 TiB used, 24 TiB / 36 TiB avail
pgs: 8 inactive, 24 degraded
io:
client: 220 MiB/s rd, 45 MiB/s wr, 1.8k op/s rd, 0.6k op/s wr
Meaning: A “WARN” state with degraded PGs and recovery means client IO competes with recovery IO. Performance will be inconsistent.
Decision: Fix the health issue first. Tuning a sick Ceph cluster is like rearranging chairs during turbulence.
Task 10: Check if recovery/backfill is consuming the cluster
cr0x@server:~$ ceph tell 'osd.*' perf dump | head
{
"filestore": {
"journal_queue_max_ops": 500,
"op_queue_max_ops": 1000
},
"throttle-msgr_dispatch_throttler": {
"val": 0
}
}
Meaning: This is a crude peek. In practice you’ll look at recovery rates, backfill, and OSD perf counters to see if the cluster is busy “fixing itself.”
Decision: If recovery is heavy and business needs performance now, temporarily tune recovery settings (carefully) and schedule repairs. But don’t leave it permanently throttled out of fear.
Task 11: Verify NIC errors, drops, and link state
cr0x@server:~$ ip -s link show dev bond0
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
91234567890 81234567 0 124 0 0
TX: bytes packets errors dropped carrier collsns
78123456789 73456789 0 0 0 0
Meaning: Dropped RX packets after an upgrade can be offload-related, ring buffer sizing, driver regressions, or MTU mismatch causing fragmentation/blackholes.
Decision: If drops are rising steadily, treat it as a network incident. Check offloads, MTU end-to-end, switch counters, and driver/firmware.
Task 12: Check bridge and VLAN configuration actually matches what you think you deployed
cr0x@server:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode 802.3ad
mtu 9000
auto vmbr0
iface vmbr0 inet static
address 10.10.0.11/24
gateway 10.10.0.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
mtu 9000
Meaning: Post-upgrade, sometimes an interface rename, missing module, or misapplied config creates a “mostly working” but degraded path (e.g., bond degraded to one link, MTU inconsistent).
Decision: If the config doesn’t match your intended design, fix config first. Don’t benchmark chaos.
Task 13: Check corosync ring latency (cluster “slow” can be cluster comms)
cr0x@server:~$ corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 10.10.0.11
status = ring 0 active with no faults
RING ID 1
id = 10.20.0.11
status = ring 1 active with no faults
Meaning: This shows ring status, not performance. If you see faults or ring flaps, your “slowness” may be cluster state churn.
Decision: If rings are unstable, stop and fix the network/time sync. HA, migrations, and storage coordination will all get worse.
Task 14: Look for kernel log hints (driver resets, NVMe timeouts, soft lockups)
cr0x@server:~$ journalctl -k -p warning --since "2 hours ago" | tail -20
Dec 26 12:44:10 server kernel: nvme nvme0: I/O 123 QID 6 timeout, reset controller
Dec 26 12:44:11 server kernel: nvme nvme0: controller reset succeeded
Dec 26 12:58:22 server kernel: ixgbe 0000:3b:00.0: Detected Tx Unit Hang
Meaning: Timeouts and resets are performance killers that often masquerade as “random slow.” After an upgrade, driver behavior changes can expose marginal hardware or firmware bugs.
Decision: If you see resets/hangs, stop blaming Proxmox and start isolating firmware/driver combos, cabling, and hardware health.
Task 15: Validate CPU frequency governor and power settings
cr0x@server:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave
Meaning: “powersave” can be fine on some platforms, terrible on others. After upgrades, defaults can shift, or microcode updates can change boost behavior.
Decision: If latency-sensitive workloads regressed and you’re in powersave, test switching to performance (with change control) and measure.
Task 16: Check VM config for disk controller and cache mode regressions
cr0x@server:~$ qm config 104 | egrep 'scsi|virtio|cache|iothread|cpu|balloon'
balloon: 4096
cpu: x86-64-v2-AES
scsi0: rpool/data/vm-104-disk-0,cache=writeback,iothread=1,discard=on,size=200G
scsihw: virtio-scsi-single
Meaning: Cache modes and controller types matter. cache=writeback can be fast and dangerous without proper storage guarantees; other modes can be safer but slower. virtio-scsi-single with iothread can help parallelism.
Decision: If only certain VMs regressed, compare their configs. A changed CPU model or missing iothread can show up as “everything got slower” for that one workload.
Storage regressions: ZFS, Ceph, and “where did my IOPS go?”
Most post-upgrade “Proxmox is slow” incidents are actually storage latency incidents. The hypervisor gets blamed because it’s the common layer. Storage is the place where tiny changes become visible pain.
ZFS on Proxmox: the classic regression patterns
ZFS is fantastic when you respect its needs: RAM, sane vdev layout, and a clear story for sync writes. It’s also unforgiving when you run it like ext4 with vibes.
1) Sync writes got expensive
Databases and VM disks do sync writes. If your underlying storage can’t commit them quickly, everything stalls. Common triggers after an upgrade:
- A device used for SLOG (separate log) is now slow, failing, or missing.
- A kernel update changed NVMe behavior or queueing.
- Workload shifted slightly and crossed a threshold where latency spikes become constant.
Check if you have a SLOG and if it’s healthy:
cr0x@server:~$ zpool status
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
logs
nvme2n1 ONLINE 0 0 0
Decision: If a dedicated log device exists, verify it’s not the bottleneck. A slow SLOG can turn sync writes into molasses.
2) Special vdev and metadata hot spots
If you use special vdevs (metadata/small blocks), a regression on that device hurts everything. Check for uneven latency in iostat and in device SMART/health.
3) ARC behavior changed because memory pressure changed
After upgrades, you might be running more services (Ceph mgr modules, exporters, extra daemons), or container accounting changed. ARC gets squeezed, read latency rises, and you get “random slow” that’s actually “cache miss storm.”
4) ZFS recordsize vs workload mismatch surfaced
Recordsize is not a tuning toy; it’s a data layout decision. Most VM images do fine with 128K, but some random IO workloads benefit from smaller blocks. Upgrades don’t change recordsize, but they change the environment enough that you finally notice the mismatch.
Ceph on Proxmox: health first, tuning second
Ceph isn’t “a disk.” It’s a distributed system that happens to store bytes. After an upgrade, the most common performance killers are:
- OSDs flapping due to network issues or timeouts
- Recovery/backfill consuming IO and network
- Mis-matched MTU or NIC offloads causing packet loss and retransmits
- CRUSH changes or reweights triggering rebalancing
- Kernel client changes (if using RBD) affecting latency
If you’re slow and Ceph is not HEALTH_OK, treat the warning as the root cause until proven otherwise.
Benchmark carefully, or you’ll benchmark your own mistake
Use fio only when you understand blast radius. Don’t run a random write benchmark on a production pool at noon and call it “diagnostics.” You’ll diagnose yourself into an incident.
That said, a controlled, small benchmark can confirm whether host storage performance fundamentally changed:
cr0x@server:~$ fio --name=latcheck --filename=/rpool/data/latcheck.bin --size=2G --direct=1 --rw=randread --bs=4k --iodepth=16 --numjobs=1 --runtime=20 --time_based=1 --group_reporting
latcheck: (groupid=0, jobs=1): err= 0: pid=18821: Thu Dec 26 14:10:12 2025
read: IOPS=52.1k, BW=203MiB/s (213MB/s)(4062MiB/20001msec)
slat (nsec): min=920, max=112k, avg=2450.1, stdev=2100.4
clat (usec): min=60, max=7800, avg=286.2, stdev=120.4
lat (usec): min=63, max=7805, avg=289.0, stdev=121.0
Meaning: You’re looking for latency distribution: average and max. A few ms max is usually fine; tens/hundreds of ms max under light load indicates storage stalls.
Decision: If this is dramatically worse than your known baseline (or worse than other hosts), focus on device health, kernel logs, ZFS/Ceph background work, and storage configuration changes.
CPU and scheduling: stolen time, pinning, and kernel surprises
When storage is fine, CPU scheduling is the next most common upgrade regression. Proxmox is Linux; Linux is great; Linux is also perfectly capable of scheduling you into a ditch if you give it the wrong constraints.
Common CPU regression after upgrade: frequency scaling
Some upgrades change the CPUfreq driver used (intel_pstate vs acpi-cpufreq), or revert a tuned governor. The symptom is predictable: single-thread performance tanks, latency rises, and overall CPU usage might still look modest.
Check current policy:
cr0x@server:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 2>/dev/null | head
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:powersave
Decision: If you expect consistent performance, consider setting a policy explicitly (and document it). Measure before and after; don’t just flip it and declare victory.
IRQ storms and softirq CPU burn
After NIC driver updates, you can get different interrupt behavior. Symptoms:
- High
%softinmpstat - Network throughput drops while CPU rises
- Packet drops increase
Look at interrupts:
cr0x@server:~$ cat /proc/interrupts | head -15
CPU0 CPU1 CPU2 CPU3
0: 42 0 0 0 IO-APIC 2-edge timer
24: 81234567 79234561 80123455 81111222 PCI-MSI 524288-edge ixgbe
25: 1024 1100 1008 990 PCI-MSI 524289-edge ixgbe
Decision: If one or two CPUs get hammered by interrupts after upgrade, investigate IRQ affinity and NIC queue configuration. Don’t “fix” it by just giving VMs more vCPUs.
CPU pinning: great when correct, brutal when wrong
Pinning can improve performance for latency-sensitive workloads. It can also create artificial contention when you change CPU topology, microcode, or kernel scheduling behavior.
Check for pinned VMs and their CPU masks. In Proxmox, this is often in VM config (cores, cpulimit, cpuunits) and in host-level tuning tools.
Kernel upgrades can also change how CFS balances tasks across CPUs. If you’re pinned too tightly and have heavy IO completion interrupts on the same cores, you’re basically fighting yourself.
Network regressions: bridges, offloads, MTU, and corosync
Networking regressions after upgrade are sneaky. Throughput tests may look fine, but latency-sensitive traffic (storage replication, corosync, small RPCs) suffers.
MTU mismatch: the classic “it works but it’s slow”
If you run jumbo frames (MTU 9000) on bonds/bridges, you must ensure it’s consistent end-to-end: NIC, bond, bridge, switch ports, VLANs, and the peer endpoints (storage networks included). A mismatch can lead to fragmentation or drops depending on path and DF bit behavior.
Check MTU at each layer:
cr0x@server:~$ ip link show vmbr0 | grep mtu
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
Decision: If some nodes show MTU 9000 and others show 1500, fix the inconsistency before blaming storage.
Offloads and driver changes
After upgrades, NIC offload defaults can change. Sometimes offloads help, sometimes they introduce weirdness with bridges/VLANs or specific switch firmware. If you see drops and retransmits, check offload settings and test changes carefully.
cr0x@server:~$ ethtool -k eno1 | egrep 'tcp-segmentation-offload|generic-segmentation-offload|generic-receive-offload'
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
Decision: If you suspect offload-related issues, test toggling one change at a time and measure. Don’t cargo-cult “turn off everything.”
Corosync sensitivity
Corosync doesn’t need huge bandwidth. It needs low jitter and low loss. A small network regression can cause cluster instability, which then causes secondary performance damage: migrations fail, HA decisions churn, and storage operations get delayed.
Look for corosync complaints:
cr0x@server:~$ journalctl -u corosync --since "24 hours ago" | tail -20
Dec 26 09:15:22 server corosync[1482]: [TOTEM ] Token has not been received in 2500 ms
Dec 26 09:15:22 server corosync[1482]: [TOTEM ] A processor failed, forming new configuration.
Decision: If you see token timeouts, treat it as a network/time problem. Fix that before you attempt performance tuning elsewhere.
QEMU/VM configuration changes that quietly hurt
Sometimes the host is fine and the regression is specific: only Windows VMs, only a database VM, only VMs on one storage, only VMs created after the upgrade.
Virtio drivers and guest tooling
Windows guests without current virtio drivers can regress after hypervisor changes. Linux guests usually handle it better, but old kernels can still struggle with newer virtio features.
Check whether the QEMU guest agent is flapping or misbehaving:
cr0x@server:~$ qm agent 104 ping
{"return":{}}
Meaning: Agent responds quickly. If this times out across many VMs after upgrade, you might have a broader host issue or a QEMU/agent compatibility problem.
Decision: If agent calls hang, check QEMU logs and host load. Don’t assume it’s “just the agent”—it can be a symptom of bigger stalls.
Disk cache modes and barriers
Cache settings are where people go to “make it faster.” It can work. It can also backfire spectacularly when a host crashes and your “fast” becomes “corrupt.”
After an upgrade, validate that your cache mode and storage stack still match your intended durability:
- If you rely on writeback caching, you need reliable power-loss protection and a clear understanding of risk.
- If you’re using ZFS, don’t fight it with nonsense settings that create double caching and unpredictable write behavior.
CPU model changes
A CPU model change can cause a performance regression inside the guest (crypto extensions, vector instructions, timing). Proxmox upgrades can shift defaults or expose old choices.
Compare CPU model settings between a “fast” VM and a “slow” VM (qm config). If one is using a very generic model, it might miss instruction set features that your workload benefits from.
Joke #2: The fastest way to improve VM performance is to stop calling everything “the network.” It’s not the network—until it is.
Three corporate-world mini-stories (anonymized, plausible, and painful)
Mini-story 1: The incident caused by a wrong assumption
They upgraded a three-node Proxmox cluster on a quiet Friday night. The change ticket was clean: kernel update, minor Proxmox point release, reboot each node. Sunday afternoon the on-call got paged: the ERP system “randomly freezes” for 5–20 seconds several times per hour.
The first assumption was classic: “ZFS is slow because ARC got reset by reboot, it’ll warm up.” They waited. The freezes continued, and now backups were also timing out. Someone started tweaking ZFS tunables in production, because nothing says “Sunday” like writing to /etc/modprobe.d under pressure.
Eventually someone looked at journalctl -k and saw periodic NVMe controller resets. Post-upgrade, the NVMe driver behaved slightly differently with that specific firmware revision. Before the upgrade, the device had been marginal but “stable enough.” After, it wasn’t. The result wasn’t a clean failure—just intermittent stalls that aligned perfectly with user complaints.
The fix was boring: update the NVMe firmware to a known good version and move the VM storage off the affected device until the maintenance window. Performance snapped back immediately, and the ZFS “tuning” was rolled back with an apology to future humans.
Lesson: the wrong assumption wasn’t “ZFS warmed up.” It was “upgrades only change software.” They also change how software talks to hardware, and hardware is always listening.
Mini-story 2: The optimization that backfired
A different team had been fighting slow database writes on VMs. Someone read that setting a VM disk to cache=writeback improves performance. It did. The graphs looked glorious. Latency dropped, throughput went up, and everyone felt like a storage engineer for a day.
Then they upgraded Proxmox. After the upgrade, the host started doing more aggressive background I/O during scrubs, and the writeback cache hid it until it didn’t. During peak load, latency spikes became frequent. They “optimized” further by increasing VM iodepth and adding more vCPUs. Now the VM could generate write bursts faster than the storage could durably absorb. Latency spikes got worse. The database started tripping its own timeouts.
They blamed the upgrade. The upgrade was an accomplice, not the mastermind. The real issue was that the caching choice changed failure modes and masked the storage’s true behavior under sync write pressure. The performance boost was real but not stable across operational conditions.
The eventual fix was to align durability and performance expectations: move the database VM to a storage tier with proper sync write performance, ensure SLOG wasn’t a bottleneck, and use safer caching modes. They still got good performance—just without the roulette wheel.
Mini-story 3: The boring but correct practice that saved the day
An enterprise shop ran Proxmox with Ceph and had a ritual: before any upgrade, they recorded three baselines on each node—host I/O latency (iostat -x), Ceph health and recovery state (ceph -s), and a small fio profile on a dedicated test volume. They also had a rolling upgrade plan with a pause after each node to observe.
During one upgrade cycle, right after rebooting node two, the fio latency profile doubled. Not catastrophic, but clearly different. Because they had a baseline from the previous week, it wasn’t “maybe it’s always been like this.” It was new.
They stopped the upgrade. No heroics. They compared kernel logs and found a NIC driver message that correlated with increased drops on the storage network. The node had come up with a different offload setting. Ceph was healthy, but the client latency rose because packets were being retransmitted under load.
The fix was straightforward: normalize the NIC settings across nodes, verify switch counters, rerun baseline tests, then continue. Users never noticed. The only drama was in the change review meeting, where someone asked why they “paused for such a small difference.” The answer was simple: small differences become big incidents at 2 a.m.
Common mistakes: symptoms → root cause → fix
This section is intentionally specific. “Check logs” is not a fix.
1) Symptom: high load average, but CPU usage looks low
- Root cause: tasks blocked on I/O (high iowait), often storage latency spikes after upgrade.
- Fix: confirm with
mpstatandiostat -x; then identify which device/pool is slow; check ZFS scrub/resilver or Ceph recovery; check kernel logs for timeouts.
2) Symptom: only Windows VMs got slower, especially disk I/O
- Root cause: virtio storage driver mismatch or the VM switched controller model; guest lacks the right driver features.
- Fix: verify VM controller (
qm config) and update virtio drivers in the guest; avoid changing controllers casually during upgrades.
3) Symptom: random 5–30 second freezes across multiple VMs
- Root cause: NVMe timeouts/controller resets, SATA link issues, or firmware/driver regression revealed by new kernel.
- Fix: check
journalctl -kfor resets/timeouts; update firmware; test alternate kernel if available; migrate hot VMs away from the device.
4) Symptom: Ceph-backed VMs slow “sometimes,” worse after upgrade
- Root cause: Ceph in
HEALTH_WARN, recovery/backfill competing with client I/O; or network drops on storage network. - Fix: get Ceph to
HEALTH_OK; check OSD status; check NIC drops; verify MTU and offloads; only then tune recovery rates.
5) Symptom: migrations slower, HA actions delayed, cluster feels unstable
- Root cause: corosync token timeouts due to network jitter, MTU mismatch, or time sync issues.
- Fix: confirm corosync logs; verify rings; check NIC counters; validate time sync service status; keep corosync on a clean, boring network.
6) Symptom: throughput is okay, but tail latency is awful
- Root cause: queueing and burst behavior; often IO scheduler change, write amplification, or background work (scrub/resilver, trim, backups).
- Fix: look at
iostat -xawaitandavgqu-sz; examine background tasks; adjust schedules; verify queue depths; consider iothreads for busy VM disks.
7) Symptom: network drops increased after upgrade, but link is up
- Root cause: NIC driver/offload change, ring buffer sizing, or bonding/LACP negotiation issues.
- Fix: inspect
ip -s link; checkethtooloffloads; verify bond status; compare against other nodes; standardize settings and firmware.
Checklists / step-by-step plan
Checklist A: 20-minute triage (production-safe)
- Confirm versions:
pveversion -v. Write down kernel and qemu versions. - Check load and iowait:
uptime,mpstat -P ALL 1 5. - Check device latency:
iostat -x 1 5. Identify the worst device. - Check kernel warnings:
journalctl -k -p warning --since "2 hours ago". - If ZFS:
zpool status -vfor scrub/resilver;zfs getfor weird dataset settings. - If Ceph:
ceph -s; don’t tune performance while degraded. - Check network drops:
ip -s link, confirm MTU and bond state. - Pick one slow VM and compare its
qm configto a known-good VM.
Checklist B: 2-hour isolation (find the subsystem)
- Determine whether regression is host-wide: compare metrics across nodes; find the “bad node.”
- If one node is worse, compare hardware logs and NIC/storage firmware versions (don’t assume uniformity).
- Run a small fio read-only test on a safe path to compare latency between nodes.
- Inspect interrupts and softirq load if network is suspected (
/proc/interrupts,mpstat). - Check for background jobs: ZFS scrub/resilver; Ceph recovery; backups; trim/discard jobs.
- If kernel-related, consider testing booting the previous kernel (with a rollback plan) to confirm causality.
Checklist C: Hardening after you fix it (so next upgrade doesn’t bite)
- Record baselines (latency distributions, not just averages):
iostat, small fio profile, VM latency metrics. - Standardize NIC offloads/MTU across nodes; document it in your build process.
- Schedule scrubs/resilvers/backups to avoid peak business windows.
- Keep firmware in lifecycle management: NVMe, NIC, BIOS, HBA.
- Maintain an upgrade canary node; upgrade it first and observe for 24–48 hours if possible.
- Write down the known-good VM hardware profile (controller type, cache mode, CPU model) and apply it consistently.
FAQ
1) “Everything is slower after upgrade.” Is it usually Proxmox itself?
Usually it’s the kernel/storage/network layer that changed under Proxmox. Proxmox is the integration point, so it gets blamed. Start with iowait, disk latency, and NIC drops.
2) How do I quickly tell if it’s storage vs CPU?
mpstat is your friend. High %iowait points to storage. High %usr/%sys with low iowait points to CPU saturation or kernel work. Then confirm with iostat -x.
3) Can a ZFS scrub/resilver really slow down VMs that much?
Yes, especially on HDD pools or pools with a stressed vdev. It competes for I/O and can raise latency. The effect is workload-dependent; databases feel it immediately.
4) Ceph is HEALTH_WARN but “mostly works.” Should I still treat it as the cause?
Yes. “Mostly works” is how distributed systems lure you into complacency. Recovery/backfill and degraded placement groups can absolutely cause user-visible latency.
5) After the upgrade, VM disk performance dropped. Should I change cache mode to writeback?
Not as a first move. It can improve speed but changes durability semantics. Fix underlying latency first (device health, pool layout, Ceph health). If you change cache mode, do it intentionally with risk acceptance.
6) Why do I see high load average but the CPUs are idle?
Load includes tasks stuck in uninterruptible sleep (often I/O). So you can have high load and idle CPUs if the system is waiting on storage.
7) Can an MTU mismatch really show up as “storage slow”?
Absolutely. If your storage network is dropping larger frames or fragmenting unpredictably, you’ll see retransmits and jitter. Storage protocols are sensitive to that.
8) Should I roll back the kernel immediately?
If you have strong evidence the regression is kernel/driver-related (new resets/timeouts, NIC hang messages, clear before/after), rolling back is a valid mitigation. Confirm with logs and one controlled test. Don’t roll back blindly without understanding the risk and security implications.
9) Is it normal for performance to change after reboot because caches are cold?
Some change is normal—ARC and page cache warm up. But persistent slow performance hours later, or periodic stalls, is not “cold cache.” That’s a real bottleneck.
10) How do I avoid this next time?
Baseline metrics before upgrades, use a canary node, keep firmware consistent, and stop treating storage and networking as “set and forget.” They remember.
Conclusion: next steps that actually work
If Proxmox got slower after an upgrade, don’t start by tweaking VM knobs. Start by proving which subsystem is making everything wait.
- Classify the bottleneck: CPU vs storage vs network. Use
mpstatandiostat, not gut feelings. - Check for background work and health issues: ZFS resilvers/scrubs, Ceph recovery, corosync instability.
- Read the kernel warnings: timeouts and driver resets are performance incidents wearing a disguise.
- Compare configs and baselines: find what changed, measure it, and decide based on evidence.
- Once stable, harden: record baselines, standardize NIC/storage firmware and settings, and run a canary upgrade path.
Do those steps and you’ll usually find the cause before the next status meeting invites itself onto your calendar.