If you run Proxmox long enough, live migration will eventually betray you at the worst possible time: a maintenance window, a noisy neighbor incident, or five minutes before a demo. The UI says “failed,” the VM is still running (maybe), and you’ve got a log that reads like a thriller novel written by QEMU.
This is a production-first guide to isolating why Proxmox live migration fails—specifically the big three: network, CPU flags, and storage. We’re going to verify facts, not vibes. You’ll run commands, interpret outputs, and make decisions that stop the bleeding.
Fast diagnosis playbook (what to check first)
Live migration failures feel random because the failure can happen in multiple phases: pre-flight checks, establishing the migration channel, pre-copy RAM transfer, stop-and-copy switchover, and post-migration cleanup. Your job is to find the phase and the bottleneck quickly.
First: identify the phase and the real error (not the UI summary)
- Check the task log in the UI, then go straight to journal logs on both nodes.
- Decision: If you don’t see a concrete error within 2 minutes, you’re reading the wrong log.
Second: confirm you have a working migration network path
- Verify node-to-node connectivity on the actual IPs used for migration (not “management,” not “whatever ping works”).
- Check MTU, packet loss, latency, and firewall rules.
- Decision: If there’s MTU mismatch or intermittent loss, fix network before touching Proxmox settings. Migration is a sustained high-throughput stream; it will find every weakness.
Third: confirm CPU model compatibility and QEMU machine type consistency
- Compare VM CPU type configuration versus host CPU flags.
- Decision: If CPU is set to
hostand nodes are not identical, stop. Set a compatible CPU type (e.g., x86-64-v2/v3) and retest.
Fourth: confirm storage assumptions (shared vs local, and what must move)
- Is disk on shared storage? If not, is “with local disks” enabled and is there bandwidth and space?
- Decision: If the VM disk is local and huge, live migration may be “working” but effectively impossible within your window. Plan cold migration or replication.
Fifth: verify cluster health and time sync
- Corosync and quorum issues can block operations, and time drift can create weird TLS and auth symptoms.
- Decision: If the cluster isn’t healthy, do not treat migration as your first fix. It’s a feature that assumes a stable foundation.
Operational truth: when migration fails, the fastest wins come from verifying the network path and CPU model assumptions. Storage is usually the slow burn, not the instant “connection closed” killer.
Interesting facts and context (why this fails in real life)
- Live migration predates most “cloud” branding. Practical VM live migration was a research topic in the early 2000s; pre-copy migration became the default approach because it minimizes downtime by iteratively copying dirty pages.
- QEMU migration is a protocol, not a magic trick. It streams VM state (RAM pages, CPU state, device state) over a channel that behaves like any other long-lived connection: loss, MTU issues, or firewall meddling can kill it.
- “CPU compatibility” is a contractual promise. If the destination CPU doesn’t support an instruction the VM was using, QEMU can’t safely resume the VM. That’s why CPU models exist—so you can promise less and move more.
- Intel and AMD feature flags are not just marketing names. Things like AVX/AVX2, AES-NI, and virtualization extensions show up as flags; mismatch can block migration when CPU type is too specific.
- Machine type matters. QEMU “machine types” (like i440fx vs q35, and versioned variants) define chipset and device models. Mismatches can break migration even when CPUs look fine.
- MTU mismatches are the silent assassin of high-throughput streams. ICMP ping can pass while large packets fragment or drop; migration traffic is large and sustained, so it hits the wall fast.
- Ceph makes live migration easier and harder. Easier because disks are shared; harder because if the Ceph network is sick, migration becomes a stress test you didn’t schedule.
- Compression and postcopy exist because pre-copy has limits. If a VM dirties memory faster than you can copy it (think hot databases), pre-copy can loop until it times out or forces downtime.
- Corosync is not the migration data plane. Corosync is cluster messaging and quorum; migration uses separate channels (SSH-based orchestration and QEMU migration sockets). But a broken cluster can still block actions.
Mental model: what “live migration” is actually doing
Think of live migration as two workflows running in parallel:
- Control plane: Proxmox coordinates the move: checks target node readiness, locks the VM, sets up tunnels/ports, starts a destination QEMU process in “incoming” mode, then asks the source QEMU to migrate.
- Data plane: QEMU streams VM runtime state: memory pages, vCPU state, device state. Disk state is only streamed if you explicitly migrate local disks (or do storage migration). With shared storage, disks stay where they are; the VM just points to the same block device from another node.
Most “mysterious” failures are one of these:
- Connectivity and negotiation failures (SSH, firewall, wrong IP, MTU, TLS/certs, hostname resolution).
- Compatibility failures (CPU model, flags, machine type, device model differences, firmware/UEFI differences).
- Throughput/latency failures (migration can’t keep up; pre-copy never converges; storage migration stalls; the VM dirties RAM too fast).
- Operational guardrails (cluster not quorate, target node lacks storage, insufficient memory, HA constraints, or pinned resources).
Now we’ll diagnose those with concrete tasks.
Practical verification tasks (commands, outputs, decisions)
You want commands that answer one question each. Run them on the source and target nodes unless noted. Don’t “fix” anything until you can explain the failure with one sentence.
Task 1: Identify the migration failure reason in system logs
cr0x@server:~$ journalctl -u pvedaemon -u pveproxy -u pvestatd -u pve-cluster --since "30 min ago" | tail -n 80
Dec 26 10:12:09 pve1 pvedaemon[2461]: VM 113 qmp command failed - unable to connect to VM 113 qmp socket
Dec 26 10:12:10 pve1 pvedaemon[2461]: start migration failed: command '/usr/bin/ssh -e none -o 'BatchMode=yes' root@10.10.10.12 ...' failed: exit code 255
Dec 26 10:12:10 pve1 pvedaemon[2461]: migration aborted (duration 00:00:03): can't connect to destination
What it means: This is a control-plane failure. Exit code 255 from SSH usually means auth/host key/firewall/routing, not QEMU compatibility.
Decision: Stop looking at CPU/storage. Fix SSH and node-to-node connectivity first.
Task 2: Check the dedicated migration network configuration in Proxmox
cr0x@server:~$ cat /etc/pve/datacenter.cfg
keyboard: en-us
migration: secure,network=10.10.10.0/24
What it means: Proxmox will prefer 10.10.10.0/24 for migration traffic. If that network is broken or misrouted between nodes, migration fails or drags.
Decision: Ensure each node has a stable IP on that subnet and routing is correct. If you don’t have a true migration network, remove the setting or fix the network.
Task 3: Verify the node IPs Proxmox thinks it should use
cr0x@server:~$ cat /etc/pve/nodes/pve1/config | sed -n '1,120p'
# node config
acme: account=default
cr0x@server:~$ cat /etc/hosts
127.0.0.1 localhost
192.168.1.11 pve1
192.168.1.12 pve2
10.10.10.11 pve1-mig
10.10.10.12 pve2-mig
What it means: Many clusters rely on hostnames resolving consistently. If pve2 resolves to a management IP but migration is configured for another network, you can end up with asymmetric paths.
Decision: Pick a naming scheme that doesn’t lie. Ensure the hostname used for SSH matches the intended network, or use explicit migration network configuration.
Task 4: Confirm passwordless root SSH works between nodes (the boring prerequisite)
cr0x@server:~$ ssh -o BatchMode=yes -o ConnectTimeout=5 -e none root@10.10.10.12 "pveversion && hostname && true"
pve-manager/8.2.2/9355359cd (running kernel: 6.8.12-4-pve)
pve2
What it means: Control plane can reach the destination and run commands.
Decision: If this fails, fix SSH keys, known_hosts issues, routing, or firewall before anything else.
Task 5: Check for MTU mismatch and jumbo-frame half-configurations
cr0x@server:~$ ip -d link show vmbr1
4: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 2c:fd:a1:11:22:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0
cr0x@server:~$ ip -d link show vmbr1
4: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 2c:fd:a1:aa:bb:cc brd ff:ff:ff:ff:ff:ff promiscuity 0
What it means: One node is MTU 9000, the other 1500. That’s not “sort of fine.” That’s a migration failure waiting to happen.
Decision: Standardize MTU end-to-end (NIC, switch ports, VLANs, bridges). If you can’t guarantee jumbo everywhere, run 1500 and move on with your life.
Joke #1: Jumbo frames are like executive promises—amazing when end-to-end, but one weak link turns them into a very expensive rumor.
Task 6: Validate the migration network path with large packets (don’t trust default ping)
cr0x@server:~$ ping -M do -s 8972 -c 3 10.10.10.12
PING 10.10.10.12 (10.10.10.12) 8972(9000) bytes of data.
8972 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.391 ms
8972 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.402 ms
8972 bytes from 10.10.10.12: icmp_seq=3 ttl=64 time=0.388 ms
--- 10.10.10.12 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2046ms
rtt min/avg/max/mdev = 0.388/0.393/0.402/0.006 ms
What it means: Jumbo packets pass with “do not fragment.” Good sign for a high-throughput stream.
Decision: If it fails with “Frag needed” or packet loss, fix MTU and/or switch config before retrying migration.
Task 7: Check firewall status and migration-related rules
cr0x@server:~$ pve-firewall status
Status: running
Enabled: 1
cr0x@server:~$ iptables -S | grep -E 'DROP|REJECT' | head
-A INPUT -j PVEFW-INPUT
-A FORWARD -j PVEFW-FORWARD
What it means: Firewall is in play. Migration commonly uses SSH and QEMU migration ports negotiated by Proxmox. Overly strict rules can break it.
Decision: If you recently enabled firewall, confirm cluster/migration allowances exist. Temporarily disable at node level only for diagnosis if policy allows, then implement proper rules.
Task 8: Confirm cluster quorum and corosync health (because operations are gated)
cr0x@server:~$ pvecm status
Cluster information
-------------------
Name: prod
Config Version: 17
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Dec 26 10:20:31 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.4a
Quorate: Yes
What it means: Cluster is quorate. That removes a whole class of “why is everything locked?” problems.
Decision: If not quorate, fix cluster networking first. Don’t attempt migrations as your workaround; you’ll compound the incident.
Task 9: Verify VM config for CPU model and migration compatibility
cr0x@server:~$ qm config 113 | egrep '^(name|memory|cores|sockets|cpu|machine|bios|efidisk0|hostpci|args):'
name: api-prod-01
memory: 16384
cores: 6
sockets: 1
cpu: host,hidden=1,flags=+aes
machine: pc-q35-8.1
bios: ovmf
efidisk0: ceph-vm:vm-113-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
What it means: CPU is set to host. That’s great for performance and terrible for heterogeneous clusters. Also note machine type is versioned.
Decision: If destination node has a different CPU generation/vendor, change CPU type to a compatible baseline and keep machine types aligned across nodes.
Task 10: Compare host CPU flags across nodes (catch the mismatch fast)
cr0x@server:~$ lscpu | egrep 'Model name|Vendor ID|CPU family|Model:|Flags' | head -n 20
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
CPU family: 6
Model: 106
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx avx2
cr0x@server:~$ lscpu | egrep 'Model name|Vendor ID|CPU family|Model:|Flags' | head -n 20
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7302P 16-Core Processor
CPU family: 23
Model: 49
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq svm ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx avx2
What it means: Different vendors. With cpu: host, this VM is almost guaranteed to fail migration (or be blocked) because exposed CPU model differs.
Decision: Use a common CPU model in the VM config. In Proxmox, choose a baseline like x86-64-v2/v3 (depending on your fleet) or a named QEMU CPU model supported by all nodes.
Task 11: Confirm QEMU versions match closely enough (don’t mix major eras casually)
cr0x@server:~$ pveversion -v | egrep 'pve-manager|pve-qemu-kvm|kernel'
pve-manager/8.2.2/9355359cd
pve-qemu-kvm: 8.1.5-5
kernel: 6.8.12-4-pve
What it means: If source is on QEMU 8.1 and target on something older/newer with incompatible machine types or device models, you can hit migration incompatibilities.
Decision: Keep nodes on the same Proxmox/QEMU major versions. Do rolling upgrades with a policy: migrate VMs away from the node being upgraded, not across incompatible versions mid-upgrade.
Task 12: Inspect the migration attempt details from the task log
cr0x@server:~$ grep -R "TASK OK\|TASK ERROR\|migration" /var/log/pve/tasks/active 2>/dev/null | tail -n 30
UPID:pve1:00003C2A:0001B2F4:676D5F9A:qmigrate:113:root@pam:
start migration of VM 113 to node 'pve2' (10.10.10.12)
migration aborted (duration 00:00:19): storage 'local-lvm' is not available on node 'pve2'
TASK ERROR: migration aborted
What it means: Storage gating check failed. This isn’t bandwidth; it’s “you asked for shared storage but you don’t have it.”
Decision: Either move disks to shared storage first, enable “with local disks” migration intentionally, or make the storage available on the target node.
Task 13: Determine whether VM disks are on shared storage or local storage
cr0x@server:~$ qm config 113 | grep -E '^(scsi|virtio|sata|ide)[0-9]+:'
scsi0: local-lvm:vm-113-disk-0,size=120G
scsi1: ceph-vm:vm-113-disk-1,size=500G
What it means: Mixed storage. One disk is local-lvm (node-local), the other is Ceph (shared). Live migration without local disk migration will fail or be blocked.
Decision: Decide: either relocate scsi0 to shared storage (best) or accept local disk migration costs/time.
Task 14: Verify target node has enough free memory and no ballooning surprises
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 251Gi 118Gi 11Gi 2.1Gi 122Gi 133Gi
Swap: 8.0Gi 0.0Gi 8.0Gi
What it means: You’ve got headroom. Remember: migration needs memory on the destination to receive pages, and the VM may spike during switchover.
Decision: If available memory is tight, do not gamble. Migrate a different VM off first, or schedule a controlled outage.
Task 15: Check whether the VM is dirtying memory too fast (pre-copy may not converge)
cr0x@server:~$ qm monitor 113 --cmd "info migrate"
capabilities: xbzrle: on compress: off events: on postcopy-ram: off
Migration status: active
transferred ram: 7032.4 MB
remaining ram: 6141.2 MB
total ram: 16384.0 MB
duplicate: 82.1 MB
skipped: 0.0 MB
normal: 6950.3 MB
dirty pages rate: 61200 pages/s
expected downtime: 340 ms
What it means: Dirty page rate is high. If that stays high, migration might never finish or will extend downtime beyond your threshold.
Decision: If it doesn’t converge, throttle the workload, enable migration compression carefully, or use postcopy for specific cases (with eyes open about risk).
Task 16: Validate Ceph health if VM disks are on Ceph (because migration stresses it)
cr0x@server:~$ ceph -s
cluster:
id: 9c0ad9b4-1b2b-4b4d-a8b8-4f9b4b0f2a71
health: HEALTH_WARN
1 slow ops, oldest one blocked for 31 sec, osd.7 has slow ops
services:
mon: 3 daemons, quorum a,b,c (age 2h)
mgr: a(active, since 2h)
osd: 12 osds: 12 up (since 2h), 12 in (since 7d)
data:
pools: 3 pools, 256 pgs
objects: 1.20M objects, 4.6 TiB
usage: 13 TiB used, 21 TiB / 34 TiB avail
pgs: 254 active+clean
What it means: Ceph is not fully happy. Slow ops can translate into VM IO latency spikes during migration and can stall storage migration.
Decision: If Ceph is degraded or slow, postpone heavy migrations. Fix the storage subsystem first; it will not “heal faster” under load.
Joke #2: If your storage is already warning about slow ops, adding live migration is like “testing” a parachute by jumping twice.
Network checks that catch 80% of failures
When migration fails fast with connection errors, it’s usually network or SSH. When it fails slowly (or hangs at a percentage), it’s usually throughput, loss, or a VM that won’t converge. Your job is to turn “network seems fine” into measurable reality.
Verify which IPs and routes are actually used
Proxmox can be configured to use a migration network, but Linux routing still decides how packets leave the box. If you have multiple NICs/VLANs, confirm the route to the target migration IP uses the intended interface.
cr0x@server:~$ ip route get 10.10.10.12
10.10.10.12 dev vmbr1 src 10.10.10.11 uid 0
cache
Interpretation: Good: it’s using vmbr1 with a source IP on the migration subnet.
Decision: If it routes via the management NIC, fix routing or adjust migration network configuration so you don’t saturate the wrong link.
Measure loss and latency like an adult
Loss kills TCP throughput; jitter makes convergence harder. Use mtr if available, or at least a decent ping run. For a migration network on the same switch, you want boring numbers.
cr0x@server:~$ ping -c 50 10.10.10.12 | tail -n 5
--- 10.10.10.12 ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50062ms
rtt min/avg/max/mdev = 0.312/0.398/0.911/0.089 ms
Decision: Any loss in a datacenter LAN is a red flag. Fix cabling, switch ports, NIC offloads, or congestion before blaming Proxmox.
Validate bandwidth (quick-and-dirty)
Migrations are basically large memory copies plus overhead. If you expect a 16–64 GB VM to migrate “fast,” you need real throughput. If you don’t have iperf3, install it. It’s a diagnostic tool, not a lifestyle.
cr0x@server:~$ iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
cr0x@server:~$ iperf3 -c 10.10.10.12 -P 4 -t 10
Connecting to host 10.10.10.12, port 5201
[SUM] 0.00-10.00 sec 9.72 GBytes 8.35 Gbits/sec 0 sender
[SUM] 0.00-10.04 sec 9.70 GBytes 8.31 Gbits/sec receiver
Decision: If you’re getting 1–2 Gbps on a “10G” link, migration will feel stuck. Fix the network (bonding config, duplex, switch, VLAN, NIC drivers, offloads) before tuning QEMU.
Watch the real-time traffic during migration
Don’t guess whether migration is saturating a link. Look.
cr0x@server:~$ ip -s link show dev vmbr1 | sed -n '1,12p'
4: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 2c:fd:a1:11:22:33 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
98234982344 81234567 0 12 0 0
TX: bytes packets errors dropped carrier collsns
112334455667 92345678 0 0 0 0
Decision: If drops increment during migration, you’ve found your villain. Drops are not “fine if small.” They amplify retransmits and stall progress.
Keep firewalls predictable
In tightly controlled environments, the Proxmox firewall is good—until someone enables it without modeling the traffic. Migration uses:
- SSH from source to destination for orchestration.
- QEMU migration channels (often tunneled/managed by Proxmox; behavior varies with “secure migration”).
- Storage traffic (Ceph, NFS, iSCSI) which may spike during and after migration.
What you want is a firewall policy that’s explicit: node-to-node allow on migration networks, and strict everywhere else. “Drop everything and see what breaks” is not a security strategy; it’s a job security strategy for your incident commander.
CPU flags and machine type mismatches
CPU compatibility is the cleanest “hard stop” failure category. QEMU can’t teleport CPU features. Migration requires the destination to present a CPU that is compatible with what the guest OS already saw.
The common trap: cpu: host in a mixed cluster
host exposes the host’s CPU model and features to the guest. That’s great when you’re pinning a VM to a node forever or you truly have identical CPUs across the cluster. It’s a foot-gun otherwise.
On heterogeneous hardware, choose a baseline CPU type so the guest sees a consistent virtual CPU everywhere. Yes, you might give up a few instructions. No, your application probably won’t notice. The alternative is “migration doesn’t work,” which is a noticeable performance regression.
Baseline CPU models: pick one and standardize
What to pick depends on your fleet:
- x86-64-v2: conservative baseline for broad compatibility (good when you have older nodes).
- x86-64-v3: more modern baseline with wider instruction set; good when your fleet is newer and consistent.
- Named QEMU models (e.g., Haswell, Skylake-Server): useful in Intel-only fleets, but can be too specific.
Your goal isn’t maximum performance; it’s predictable mobility. If you need max performance for a special VM, document that it’s pinned and not migratable across all nodes.
Machine type alignment is not optional
QEMU machine types like pc-q35-8.1 encode chipset/device model versions. If you have mismatched QEMU packages across nodes, you can hit a device state mismatch during migration.
Operational policy: keep the cluster on the same Proxmox version as much as possible. During upgrades, migrate workloads away from the node being upgraded, upgrade it, then return workloads. Don’t do “half-cluster on new QEMU” and expect smooth live migration for every device combination.
PCI passthrough and vGPU: special pain
If a VM uses hostpci devices, live migration is usually a non-starter unless you have very specific hardware and tooling. The device state can’t be safely moved, and the destination may not have the same physical device mapping.
cr0x@server:~$ qm config 210 | egrep 'hostpci|machine|cpu'
cpu: host
machine: pc-q35-8.1
hostpci0: 0000:65:00.0,pcie=1
Decision: Treat passthrough VMs as pinned. Design maintenance around them (failover at app layer, scheduled outages, or cold migration).
One reliability quote worth keeping on your desk
Hope is not a strategy.
— Rudy Giuliani
This line gets repeated in ops because it’s brutally applicable: don’t “hope” CPUs match, don’t “hope” MTU is consistent, don’t “hope” storage is shared. Verify.
Storage: shared, local, Ceph, ZFS, and the migration data path
Storage is where migrations go to die slowly. Network and CPU mismatches kill migration fast; storage misdesign makes it hang, crawl, or “work” with unacceptable downtime.
Know what you’re migrating: compute state vs disk state
- Live migration (shared storage): move RAM+CPU+device state; disks stay put on shared storage (Ceph RBD, NFS, iSCSI, etc.). Fast-ish, predictable.
- Live migration with local disks: move RAM+state and copy disk blocks over the network. This can be brutal for big disks and busy VMs.
- Storage migration: copy disks between storages; can be done online for some setups, but it’s still heavy IO.
If you don’t have shared storage and you expect fast live migration, you’re not planning; you’re auditioning for an incident report.
Verify storage definitions and availability on both nodes
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
ceph-vm rbd active 34.00TiB 13.00TiB 21.00TiB 38.24%
local dir active 1.80TiB 0.72TiB 1.08TiB 40.00%
local-lvm lvmthin active 1.75TiB 1.62TiB 130.00GiB 92.70%
What it means: local-lvm is dangerously full. If you attempt local disk migration, you may run out of space mid-copy.
Decision: If thin pools are above ~85–90%, treat it as an incident risk. Free space or expand before migrating disks around.
ZFS specifics: local ZFS is not shared storage
ZFS is great. Local ZFS is still local. Live migration requires either shared storage or a disk move mechanism.
If you’re using ZFS on each node, the right mental model is: you can use replication (ZFS send/receive) to pre-stage disks, then do a controlled cutover. That’s not the same as “live migrate anytime.”
Ceph specifics: separate networks, separate failure modes
Ceph typically uses a “public” network (clients) and possibly a “cluster” network (replication/backfill). Migration traffic is separate. But operationally they collide because migration increases IO and CPU load, which makes a marginal Ceph cluster show its true personality.
Before large migration events, make Ceph boring:
- All OSDs up/in.
- No deep-scrub storms.
- No recovery/backfill overloads.
- Latency within normal bounds.
Local disk migration: decide if it’s worth it
If you must migrate local disks online, at least quantify it:
- Disk size (allocated and actual).
- Available bandwidth between nodes.
- Write rate of the VM (copy-on-write overhead).
- Business tolerance for degraded performance during copy.
In many production environments, the correct decision is: schedule a maintenance window and do a cold move, or redesign storage for shared access.
Common mistakes: symptom → root cause → fix
1) “Migration failed: can’t connect to destination”
- Symptom: Fails within seconds, SSH exit code 255 in logs.
- Root cause: SSH key mismatch, changed host keys, firewall blocking SSH, wrong destination IP/route.
- Fix: Validate
ssh -o BatchMode=yes root@destworks from source on the migration IP. Fix routing/hosts/firewall; re-establish cluster SSH trust if needed.
2) Migration stuck at 0% or barely moves
- Symptom: Progress doesn’t advance; UI shows running task for a long time.
- Root cause: Wrong network path (using 1G mgmt instead of 10G), MTU issues causing retransmits, packet loss, or firewall state tracking problems.
- Fix: Confirm
ip route getand test throughput (iperf3). Validate jumbo pings with-M do. Fix MTU, routes, or disable misbehaving offloads.
3) “CPU feature mismatch” / “host doesn’t support requested features”
- Symptom: Fails quickly with CPU-related errors; often appears only in QEMU logs.
- Root cause: VM uses
cpu: hostor an overly specific CPU model; destination lacks a required feature flag. - Fix: Set VM CPU to a cluster-wide baseline. Keep nodes on consistent hardware families where possible.
4) “storage ‘local-lvm’ is not available on node …”
- Symptom: Migration blocked before it starts; task log mentions storage not available.
- Root cause: VM disk located on node-local storage; target node doesn’t have that storage ID (by design).
- Fix: Move disks to shared storage first, or migrate with local disks intentionally, or adjust storage definitions if you actually intended shared storage.
5) Migration completes but VM pauses too long (“live” felt very offline)
- Symptom: Migration succeeds but downtime is seconds+; application sees timeouts.
- Root cause: VM dirties RAM too fast; pre-copy doesn’t converge; network throughput insufficient; disk IO latency spikes (especially on shared storage under load).
- Fix: Migrate during lower write activity, consider tuning migration settings (compression/postcopy) cautiously, and fix underlying storage latency.
6) Migration fails only for certain VMs (others migrate fine)
- Symptom: Some VMs migrate; others consistently fail.
- Root cause: Those VMs have passthrough devices, special CPU flags, huge pages configuration differences, or specific machine type/device combinations.
- Fix: Normalize VM hardware profiles. Document exceptions (pinned VMs). Don’t pretend every workload is mobile.
7) Migration breaks right after enabling Proxmox firewall
- Symptom: Previously working migrations now fail; other traffic might still be okay.
- Root cause: Missing node-to-node allows for migration network or related services; stateful rules drop long-lived flows.
- Fix: Implement explicit cluster node allowlists on migration networks. Test migrations as part of firewall change rollout.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The company had two Proxmox nodes that “looked the same” in the rack: same chassis, same vendor, same number of NICs. Someone ordered them six months apart and assumed “same model” meant “same CPU.” It did not. One shipped with a newer stepping and a different microcode baseline, and the other was a different CPU family entirely due to supply chain substitutions.
They set most VMs to cpu: host because performance. It worked fine for months because they never had to migrate the heavy hitters—until a firmware update on node A required a reboot and they tried to evacuate workloads. The first migration failed instantly. The second one too. Now they had a node scheduled for maintenance and a node that couldn’t accept the workloads.
In the incident call, the assumption that “live migration is a button” wasted time. People chased Ceph latency, then firewall rules, then HA settings. The clue was sitting in qm config the whole time: cpu: host. A quick lscpu comparison ended the mystery.
The fix was not heroic. They changed VM CPU models to a baseline supported across both nodes, tested migration on one VM, then rolled it across the fleet. Performance impact was negligible. The lesson stuck: if you want mobility, you must budget for it in CPU feature exposure.
Mini-story 2: The optimization that backfired
A different org got fancy with jumbo frames. They enabled MTU 9000 on the hosts because “10G is expensive and we should optimize.” But the network team only enabled jumbo on some switch ports. A few VLAN trunks were still at 1500. Nobody had a complete end-to-end diagram, because of course they didn’t.
Normal traffic looked fine. Management ping worked. SSH worked. Small packets don’t care much. Then they tried migrating a memory-heavy VM. It would start, transfer a bit, then stall and eventually fail. Sometimes it succeeded but took forever. The team blamed Proxmox, then QEMU, then Ceph, then their life choices.
The breakthrough came from doing the unsexy test: a large ping with “do not fragment.” It failed. Not “maybe.” Failed. The migration network had a path that silently black-holed jumbo packets, leading to fragmentation issues, retransmits, and terrible throughput.
They fixed it by standardizing MTU across the entire migration path. And here’s the twist: after the fix, they went back to MTU 1500 anyway. Why? Operational simplicity. They chose predictable performance over theoretical optimization. That’s what production systems teach you: the fastest network is the one that works every day.
Mini-story 3: The boring but correct practice that saved the day
One team treated live migration as a feature that must be continuously tested, not a button you discover during an outage. Every week, they ran a small set of migrations: one Linux VM, one Windows VM, one “busy” VM, and one VM with UEFI. They logged results and kept a running baseline of migration time and downtime.
It was dull. Nobody got promoted for “everything still works.” But the practice paid off when a switch firmware update introduced intermittent packet loss on a specific VLAN. The first sign wasn’t users complaining—it was the weekly migration check showing a sudden slowdown and occasional failures.
Because they had a baseline, they could say “this changed” and prove it with data: ping loss, iperf drop, and migration duration doubling. The network team took it seriously because the evidence was tight and reproducible.
They rolled back the firmware, restored the baseline, and then upgraded again with proper validation. No major incident. The boring practice didn’t just “save the day”; it prevented a 2 a.m. emergency that would have involved too many people and not enough coffee.
Checklists / step-by-step plan
Checklist A: Before you attempt live migration in production
- Cluster health:
pvecm statusshows quorate. If not, stop. - Version alignment:
pveversion -vis consistent across nodes (especiallypve-qemu-kvm). - Migration network: Decide whether you use one. If yes, configure it and verify with large ping tests.
- SSH trust: Node-to-node
ssh -o BatchMode=yesworks on the migration IPs. - CPU model policy: Fleet baseline CPU model decided; avoid
cpu: hostunless nodes are truly homogeneous. - Storage policy: Shared storage for migratable VMs, or a documented plan for local disks.
- Exception list: Passthrough/vGPU/edge-case VMs are documented as pinned.
- Capacity headroom: Destination has memory and CPU headroom for the VM and migration overhead.
Checklist B: When a migration fails (15-minute incident-mode plan)
- Grab the real error: check
journalctl -u pvedaemonand task logs. - Classify failure: connect/auth vs CPU compatibility vs storage gating vs slow/convergence.
- Network sanity:
ip route get,ping -M do -s ..., quickiperf3if allowed. - CPU sanity:
qm config VMIDfor CPU type;lscpucompare nodes. - Storage sanity:
qm configto find disk locations;pvesm statusand (if Ceph)ceph -s. - Retry once after a targeted fix. Not five times. Repeated retries create noise, lock contention, and sometimes partial state.
- Escalate with evidence: bring the exact error line and the command outputs. “Migration broke” is not a ticket; it’s a mood.
Checklist C: After you fix it (so it stays fixed)
- Standardize VM CPU models across the migratable fleet.
- Standardize MTU across migration networks, or standardize on 1500.
- Schedule regular test migrations and record duration/downtime.
- Document pinned VMs and why they are pinned.
- Keep node versions aligned; avoid long-running mixed-version clusters.
FAQ
1) Do I need shared storage for live migration?
If you want fast, reliable live migration: yes, in practice. Without shared storage you can migrate local disks, but you’re now doing a giant online copy and calling it “live.” That may be acceptable for small disks and quiet VMs, not for busy production databases.
2) Why does migration work for some VMs but not others?
Different virtual hardware profiles. CPU type set to host, device passthrough, different machine types, or special QEMU args can make a VM non-migratable even when the cluster is healthy.
3) Is cpu: host always bad?
No. It’s fine for single-node “performance first” VMs or truly identical clusters. It’s bad when you assume hardware uniformity that you don’t enforce. If you want mobility, pick a baseline CPU model and stick to it.
4) What’s the fastest way to catch an MTU problem?
Large “do not fragment” ping on the migration IP: ping -M do -s 8972 target for MTU 9000. If it fails, jumbo isn’t working end-to-end. Fix it or stop using jumbo.
5) Why does migration hang around a certain percentage?
Often it’s pre-copy not converging due to high dirty page rate, or network throughput collapsing due to loss/retransmits. Check qm monitor VMID --cmd "info migrate" and look for dirty page rate and migration status.
6) Can I migrate between nodes on different Proxmox/QEMU versions?
Sometimes, but it’s not a lifestyle you should adopt. Migration compatibility is best when QEMU versions and machine types align. Rolling upgrades should evacuate a node, upgrade it, then reintroduce it—minimizing mixed-version operation.
7) How do I tell if my migration is using the wrong NIC?
ip route get <destination migration IP> tells you which interface and source IP will be used. If it’s not your migration bridge/NIC, fix routes or migration network configuration.
8) What about Windows VMs and UEFI—any special migration risks?
UEFI itself is usually fine if both nodes support it and the VM config is consistent. Problems tend to come from device differences, machine type mismatches, or storage layout changes. Keep firmware/machine type consistent and avoid ad-hoc hardware changes.
9) Should I enable migration compression?
Only when you’ve measured a network bottleneck and you have CPU headroom. Compression trades CPU for bandwidth. On CPU-bound hosts, it makes everything worse, including the workload you’re trying not to disrupt.
10) What if I have Ceph and migration still fails?
Then the failure is likely not “shared disk availability” but network/CPU/firewall or Ceph health under load. Check ceph -s and watch for slow ops. Ceph can be technically up and still operationally miserable.
Conclusion: next steps that keep you sane
When Proxmox live migration fails, don’t treat it like a mystical event. It’s a set of predictable prerequisites: stable network path, compatible CPU exposure, and storage architecture that matches your expectations.
Next steps that actually reduce future incidents:
- Pick a VM CPU policy (baseline model for migratable workloads) and enforce it.
- Make the migration network boring: consistent MTU end-to-end, verified throughput, explicit firewall allowances.
- Stop pretending local disks are shared. If you need mobility, design for shared storage or a replication-driven cutover process.
- Run test migrations regularly and keep a baseline. If you only test during outages, you’re doing chaos engineering with customers.
If you do nothing else: verify the migration IP path and stop using cpu: host in mixed hardware clusters. Those two changes eliminate a shocking amount of pain.