Proxmox “migration aborted”: common reasons and the repair path

Was this helpful?

“migration aborted” is Proxmox’s way of saying: something went wrong, the system bailed, and now you’re staring at two nodes that disagree about where a VM lives. It’s the virtualization equivalent of a house key snapping in the lock—nothing is on fire, but you’re not getting in without tools.

This is a field guide for operators who run Proxmox in anger. We’ll triage fast, figure out the real failure mode (network, storage, cluster state, guest config, or human optimism), fix it cleanly, and then make it harder to happen again.

What “migration aborted” really means in Proxmox

In Proxmox VE, migrations come in a few flavors:

  • Live migration (QEMU memory state moves while the VM is running). Usually requires shared storage or replication strategies that make disk available on the target.
  • Offline migration (VM is stopped; config and disks are moved). Slower, often simpler, sometimes the only sane choice.
  • Storage migration (move disks between storage backends, optionally without moving compute node).
  • Replication-assisted approaches (ZFS replication or Ceph/RBD where disk locality and consistency rules differ).

“migration aborted” is not a single error. It’s a top-level failure state for a multi-step workflow:

  1. Cluster checks (permissions, config, quorum, target node readiness).
  2. Connectivity checks (SSH and/or migration network).
  3. Storage checks (does the disk exist on target? is it shared? enough space? correct storage ID?).
  4. QEMU start of migration channel (pre-copy memory sync, dirty-page tracking).
  5. Final switchover / cleanup (the part that hurts when it fails).

When it aborts, Proxmox may have already done some work: created a transient config on the target, reserved space, started a receiver process, or partially copied disks. Your job is to determine what state you’re in now, not what you hoped the state would be.

Paraphrased idea from James Hamilton (AWS reliability engineering): “Operate by minimizing variability and learning from every failure.” It’s not poetic. It’s correct.

Short joke #1: Live migration is like moving apartments while still cooking dinner—possible, but the odds of spilling soup are non-zero.

Fast diagnosis playbook (check these first)

If you only have five minutes before someone starts “helping,” do this in order. The goal is to identify whether the bottleneck is cluster state, network, storage, or guest constraints.

1) Confirm cluster health and target node readiness

  • Is there quorum? Is corosync flapping?
  • Is the target node in a clean state and not fenced or partially disconnected?
  • Is the VM locked?

2) Pull the task log, then the node logs

  • Proxmox task log gives the first meaningful error line.
  • System logs tell you the real reason (auth failure, permission denial, network reset, OOM, disk full).

3) Identify the storage topology and whether it matches the migration type

  • Shared storage (Ceph, NFS, iSCSI, etc.) makes compute migration easier.
  • Local storage implies disk move/replication and bandwidth constraints.
  • ZFS datasets, thin LVM, and qcow2 behave differently during copy and snapshot handling.

4) Check the migration network path

  • MTU mismatch, asymmetric routing, firewall rules, or a busy bond can kill a migration.
  • Latency spikes can extend pre-copy indefinitely until timeout or operator abort.

5) Check guest blockers

  • Huge pages, passthrough devices, local-only CD-ROM/ISO, or unsupported CPU flags can block or degrade migration.
  • Memory ballooning and dirty-page rate can make live migration “never finish.”

Once you’ve classified the problem, you stop thrashing. You pick the repair path and follow it to completion. That’s the difference between operators and button-clickers.

Interesting facts and context (why this fails the way it does)

  • Fact 1: Proxmox VE uses Corosync for cluster membership and pmxcfs (a FUSE filesystem) to distribute configuration under /etc/pve. If pmxcfs is unhappy, migrations get weird fast.
  • Fact 2: QEMU live migration relies on a pre-copy algorithm: it sends memory pages while the VM runs, then retries “dirty” pages. High write rates can prevent convergence.
  • Fact 3: The migration channel is not “just SSH.” Proxmox uses SSH for orchestration, but QEMU opens its own migration stream (often on a dedicated network) that can be blocked even when SSH works.
  • Fact 4: VM configs in Proxmox are stored as plain-text files under /etc/pve/qemu-server/. A partially written or stale config lock can derail retries.
  • Fact 5: Ceph-backed VMs (RBD) usually migrate compute state without copying disks, but you’re now betting on Ceph health, client permissions, and network paths to the Ceph public and cluster networks.
  • Fact 6: ZFS replication in Proxmox is snapshot-based. It is reliable, but it is not magic: missing snapshots, dataset renames, or “helpful” manual zfs commands can break the chain.
  • Fact 7: The “VM lock” mechanism exists to prevent concurrent operations. If a migration aborts at the wrong time, the lock can stick and block everything until cleared.
  • Fact 8: MTU mismatches commonly show up as “works for small transfers, breaks for large ones.” Migrations are large ones.

Common root causes, with repair paths

A) Cluster state problems (quorum, corosync, pmxcfs)

What it looks like: Migration fails instantly, sometimes with generic errors; GUI may show nodes “?”; configs don’t appear on all nodes; tasks fail with “no quorum” or cannot write under /etc/pve.

What’s really happening: Proxmox needs consistent cluster membership to safely move resources. Without quorum, it refuses to make changes. If pmxcfs is degraded, even reading/writing VM configs can fail.

Repair path: Stabilize cluster first. If your cluster isn’t healthy, don’t migrate. Fix corosync links, quorum, and time sync. Then retry migration.

B) SSH and auth failures (the orchestration layer)

What it looks like: “migration aborted: ssh error” or failures early in the task log. Sometimes it’s intermittent—because nothing says “enterprise” like rotating keys mid-day.

What’s really happening: Proxmox uses SSH between nodes for coordination and for certain file transfers. Wrong host keys, broken known_hosts, permissions, or a forced command in authorized_keys can kill it.

Repair path: Verify root SSH access between nodes (as Proxmox expects), fix host key mismatches, confirm sshd allows what you need, and retry.

C) Migration network path issues (firewall, MTU, routing, congestion)

What it looks like: Migration starts then aborts; logs show “connection reset,” timeouts, or QEMU migration socket errors. Sometimes it gets to 90% then dies. That’s a hint: initial coordination worked; the data plane didn’t.

What’s really happening: QEMU’s migration stream is sensitive to packet loss and path MTU issues. Firewalls may allow SSH but block the migration port range. Or you’re running migration over a busy production network that’s also carrying Ceph traffic, backups, and your colleague’s “temporary” rsync.

Repair path: Confirm MTU end-to-end, ensure firewall rules allow the migration traffic, put migration on a dedicated interface if you can, and stop saturating the link.

D) Storage mismatch: shared vs local reality

What it looks like: “storage ‘local-lvm’ not found on target,” “volume does not exist,” “cannot open disk image,” or migration aborts after copying some disks.

What’s really happening: The VM config references storage IDs. If the target node doesn’t have the same storage definition (same ID, compatible type), Proxmox can’t place the disk. For local storage, you must either move disks (storage migration) or replicate them (ZFS replication, backups/restore, etc.).

Repair path: Align storage definitions across nodes, confirm space, verify volumes, and choose the correct migration method (live compute move vs offline with disk move).

E) Disk copy failures (space, permissions, slow storage, snapshot weirdness)

What it looks like: rsync or qemu-img copy fails; “No space left on device”; “Input/output error”; migration is painfully slow then aborts.

What’s really happening: Disk migration is a storage workload: metadata-heavy for qcow2, sequential for raw, and punishing for fragmented thin-provisioned volumes. Also: if the underlying storage is degraded (ZFS errors, Ceph recovery), you’ll discover it during migration.

Repair path: Check space, storage health, and I/O errors. If storage is sick, stop migrating and fix storage. If it’s just slow, schedule downtime and do offline.

F) Guest constraints (CPU model, passthrough, hugepages, TPM, local media)

What it looks like: Migration aborts with messages about devices, CPU incompatibility, or “cannot migrate with device.” Sometimes it aborts after QEMU starts because the destination can’t recreate the exact same virtual hardware.

What’s really happening: Live migration requires the destination to emulate a compatible CPU and device set. PCI passthrough, some USB devices, and certain configurations are migration-hostile. TPM state can also complicate matters depending on setup.

Repair path: Use a compatible CPU baseline (e.g., x86-64-v2/v3), avoid passthrough for migratable VMs, detach local media, and choose offline migration when necessary.

G) Locks, leftovers, and half-migrated state

What it looks like: VM shows as locked; subsequent operations say “resource is locked”; config exists on both nodes; disks exist in two places with similar names.

What’s really happening: A migration is a transaction with cleanup steps. If it aborts, the cleanup might not run. You can end up with a lock that blocks future operations and artifacts on the target that confuse the next attempt.

Repair path: Identify where the VM is actually running, then remove stale locks and delete only the artifacts you can prove are safe to delete. “Prove” means: check volume references, VMIDs, and storage content—twice.

Practical tasks: commands, what the output means, and the decision you make

These are the moves I reach for in production. Each task includes a command, what “good” and “bad” look like, and what you decide next. Run them on the source and target nodes as appropriate.

Task 1: Grab the exact failure from the Proxmox task log

cr0x@server:~$ tail -n 80 /var/log/pve/tasks/active
UPID:pve1:000A1B2C:0F2B3C4D:676D2A1E:qmigrate:101:root@pam:
cr0x@server:~$ cat /var/log/pve/tasks/0F/2B/UPID:pve1:000A1B2C:0F2B3C4D:676D2A1E:qmigrate:101:root@pam: | tail -n 60
starting migration of VM 101 to node 'pve2' (192.168.10.12)
found local disk 'local-lvm:vm-101-disk-0' (in current node)
migration aborted (duration 00:00:14): storage 'local-lvm' not available on target node
TASK ERROR: migration aborted

Meaning: The task log usually contains one line that matters. Here it’s a storage ID mismatch.

Decision: Stop debugging networking. Fix storage definitions or do a storage migration/backup-restore approach.

Task 2: Verify cluster quorum and membership

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   42
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Dec 26 14:12:06 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2c
Quorate:          Yes

Meaning: Quorate: Yes means cluster writes are allowed. If it’s “No,” migrations are a bad idea.

Decision: If not quorate: fix cluster network/time, or restore quorum. Do not “force” migrations as a workaround unless you enjoy surprise split-brain.

Task 3: Check corosync link health

cr0x@server:~$ corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
        addr    = 10.10.0.11
        status  = OK
LINK ID 1 udp
        addr    = 10.20.0.11
        status  = OK

Meaning: Links should be OK. Flapping links cause pmxcfs delays and “random” failures.

Decision: If a link is down: fix networking before migration. Corosync stability is not optional.

Task 4: Confirm pmxcfs is mounted and responsive

cr0x@server:~$ mount | grep pve
pve on /etc/pve type fuse.pve (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
cr0x@server:~$ ls -la /etc/pve/qemu-server | head
total 0
drwxr-xr-x 2 root www-data 0 Dec 26 13:58 .
drwxr-xr-x 8 root www-data 0 Dec 26 13:58 ..
-rw-r----- 1 root www-data 512 Dec 26 13:40 101.conf

Meaning: If /etc/pve isn’t mounted, you’re not in a healthy cluster state.

Decision: Fix pmxcfs/corosync before anything else. Migrating with a broken config filesystem is how you earn weekend work.

Task 5: Check whether the VM is locked (and why)

cr0x@server:~$ qm config 101 | grep -i lock
lock: migrate

Meaning: A lock indicates an operation in progress or an aborted operation that didn’t clean up.

Decision: Don’t clear locks blindly. First determine where the VM is actually running (Task 6). If it’s stable and you’re cleaning up, then you can unlock.

Task 6: Determine where the VM is really running

cr0x@server:~$ qm status 101
status: running
cr0x@server:~$ pvesh get /cluster/resources --type vm | awk '$1 ~ /qemu/ && $2 ~ /101/ {print}'
qemu 101 running pve1

Meaning: The cluster resource view should show one owner node. If it shows “unknown” or flips, your cluster is unstable.

Decision: If VM is running on source, treat target artifacts as suspect. If it’s running on target, stop trying to “re-migrate” until configs and disks are aligned.

Task 7: Clear a stale lock (only after you verified reality)

cr0x@server:~$ qm unlock 101

Meaning: Unlock removes the operation lock, not the underlying problem.

Decision: If unlocking is required to proceed with cleanup (delete partial volumes, retry migration), do it. If you’re unlocking to “make the error go away,” stop and re-read the task log.

Task 8: Validate storage definitions match across nodes

cr0x@server:~$ pvesm status
Name        Type     Status     Total     Used    Available   %
local       dir      active    196G      22G        164G     11%
local-lvm   lvmthin  active    900G     610G        290G     67%
ceph-rbd    rbd      active     10T       6T          4T     60%

Meaning: The Name column (storage ID) must exist on both source and target for the VM’s disks. Same ID, compatible backend.

Decision: If local-lvm exists on source but not target: either define it on target (if appropriate) or move disks to a shared storage first.

Task 9: Check the actual disk volumes referenced by the VM

cr0x@server:~$ qm config 101 | egrep '^(scsi|virtio|sata|ide)[0-9]+:'
scsi0: local-lvm:vm-101-disk-0,discard=on,iothread=1,size=80G
scsi1: ceph-rbd:vm-101-disk-1,size=200G

Meaning: Mixed storage is common. It is also a common migration trap: one disk is shared, the other is not.

Decision: Either move scsi0 to shared storage (storage migration) or accept an offline migration with disk transfer.

Task 10: Verify free space where the disk would land

cr0x@server:~$ lvs -a -o+lv_size,data_percent,metadata_percent,lv_attr,devices pve
  LV            VG  Attr       LSize   Data%  Meta%  Attr       Devices
  data          pve twi-aotz-- 900.00g  67.11  12.03 twi-aotz-- /dev/sda3(0)
  vm-101-disk-0 pve Vwi-aotz--  80.00g  92.30         Vwi-aotz-- data(12345)

Meaning: For thin pools, watch Data% on the pool and the specific LV. High pool usage means the next write might fail mid-migration.

Decision: If pool usage is high: free space, extend the pool, or don’t migrate yet. “Disk full during migration” is a classic partial-state generator.

Task 11: Check Ceph health before blaming Proxmox

cr0x@server:~$ ceph -s
  cluster:
    id:     6e0e1d8a-5a1c-4b7a-a7a0-3f1b3e8c9c51
    health: HEALTH_WARN
            1 osds down
            Degraded data redundancy: 12/345678 objects degraded (0.003%)

  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active), standbys: y
    osd: 9 osds: 8 up, 9 in

  data:
    pools:   4 pools, 512 pgs
    objects: 115k objects, 420 GiB
    usage:   1.2 TiB used, 8.8 TiB / 10 TiB avail
    pgs:     510 active+clean, 2 active+degraded

Meaning: Live migration with Ceph relies on stable Ceph I/O. If Ceph is degraded, your VM might be fine, but migrations and restarts can become roulette.

Decision: If Ceph isn’t healthy: fix Ceph first, then migrate. The migration is not the place to discover you were one OSD away from a bad day.

Task 12: Validate the migration network and MTU (the silent killer)

cr0x@server:~$ ip -br link show | egrep 'vmbr|bond|enp'
bond0            UP             3c:fd:fe:aa:bb:cc
vmbr0            UP             3c:fd:fe:aa:bb:cc
enp65s0f0        UP             3c:fd:fe:aa:bb:cc
cr0x@server:~$ ip link show vmbr0 | grep mtu
mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
cr0x@server:~$ ping -M do -s 8972 -c 3 10.10.0.12
PING 10.10.0.12 (10.10.0.12) 8972(9000) bytes of data.
8980 bytes from 10.10.0.12: icmp_seq=1 ttl=64 time=0.412 ms
8980 bytes from 10.10.0.12: icmp_seq=2 ttl=64 time=0.401 ms
8980 bytes from 10.10.0.12: icmp_seq=3 ttl=64 time=0.398 ms

--- 10.10.0.12 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2036ms

Meaning: If jumbo frames are configured, they must work end-to-end. If you see “Frag needed,” or packet loss, expect migration pain.

Decision: If MTU doesn’t match: either fix the network or drop MTU to 1500 consistently. Mixed MTUs are performance theater with a high casualty rate.

Task 13: Check firewall state and rules (SSH working is not proof)

cr0x@server:~$ pve-firewall status
Status: enabled/running
cr0x@server:~$ iptables -S | head -n 20
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N PVEFW-FORWARD
-N PVEFW-INPUT
-N PVEFW-OUTPUT

Meaning: If Proxmox firewall is enabled, ensure migration traffic is allowed on the interfaces used. The GUI can show it, but CLI truth is faster during incidents.

Decision: If rules are blocking migration ports: fix rules. Don’t “temporarily disable firewall” unless you also enjoy writing incident reports.

Task 14: Look for QEMU-side migration errors in journald

cr0x@server:~$ journalctl -u pvedaemon -u pveproxy -u pvestatd --since "1 hour ago" | tail -n 80
Dec 26 13:57:22 pve1 pvedaemon[1923]: starting migration of VM 101 to node 'pve2'
Dec 26 13:57:33 pve1 pvedaemon[1923]: migration aborted: QEMU exited with code 1
Dec 26 13:57:33 pve1 pvedaemon[1923]: ERROR: vm 101 - unable to migrate - VM uses local cdrom 'local:iso/installer.iso'

Meaning: This is the kind of “obvious in hindsight” blocker that the top-level error hides.

Decision: Detach the local ISO or ensure the same ISO path/storage exists on the target.

Task 15: Validate CPU compatibility settings

cr0x@server:~$ qm config 101 | grep -E '^cpu:'
cpu: host,flags=+aes
cr0x@server:~$ pvesh get /nodes/pve1/capabilities/qemu/cpu | head
cputype
kvm64
qemu64
x86-64-v2-AES
x86-64-v3

Meaning: cpu: host is fast but ties you to the source CPU features. Migrating to a node with different microarchitecture can fail or force a reset.

Decision: For clusters with mixed CPUs, standardize on a baseline CPU model for migratable VMs.

Task 16: Inspect replication status (ZFS replication users)

cr0x@server:~$ pvesr status
JobID  Guest  Target  Last_sync              Status
101    101    pve2    2025-12-26 13:40:02    ok

Meaning: If replication is stale or failing, migrations that assume “disk already there” will abort or boot the wrong disk version.

Decision: Fix replication errors first; don’t use migration as a replication debugging tool.

Common mistakes: symptom → root cause → fix

1) Symptom: “storage not available on target”

Root cause: Storage ID referenced in VM config doesn’t exist on target, or exists but is different (dir vs lvmthin vs zfs).

Fix: Align storage definitions across nodes (pvesm status), or migrate disks to shared storage first, or do offline migration with disk move.

2) Symptom: Migration aborts after starting, with “connection reset”

Root cause: MTU mismatch, firewall blocking migration stream, or congested link causing QEMU to time out.

Fix: Test MTU with ping -M do, verify firewall rules, move migrations to a dedicated network, or reduce traffic during the window.

3) Symptom: VM remains locked after failure

Root cause: Cleanup didn’t complete; lock stayed in config.

Fix: Verify where VM runs (pvesh get /cluster/resources), then qm unlock <vmid>. Clean partial disks only after proving they’re not in use.

4) Symptom: “no quorum” or cluster shows nodes as unknown

Root cause: Corosync link failure, split network, or time sync issues leading to membership instability.

Fix: Restore corosync communication. Don’t migrate until pvecm status is quorate and stable.

5) Symptom: Migration fails only for some VMs

Root cause: Those VMs use cpu: host, passthrough devices, local ISOs, or special device models incompatible with the target.

Fix: Standardize CPU models, remove non-migratable hardware for those guests, ensure media is accessible from both nodes, or perform offline migration.

6) Symptom: Migration “hangs” at high percentage for a long time

Root cause: Dirty-page rate exceeds migration bandwidth; memory keeps changing faster than it can be copied.

Fix: Reduce guest write churn (temporarily), increase migration bandwidth, or switch to offline migration for that workload window.

7) Symptom: Migration aborts and target has leftover volumes

Root cause: Disk copy partially completed; target now has orphaned volumes or partial datasets.

Fix: Identify volumes belonging to the VMID, confirm they’re not referenced by any config, then remove them. If unsure, keep them and change the plan to restore from backup/replication.

Short joke #2: “Just retry it” is not a strategy; it’s how you turn one error into a recurring calendar event.

Three corporate-world mini-stories (realistic, anonymized)

Mini-story 1: The incident caused by a wrong assumption

They had a two-node Proxmox cluster supporting internal tooling. Nothing fancy: a couple dozen VMs, local LVM-thin on each node, and nightly backups. Someone asked for maintenance on node A, so the on-call engineer decided to “just live migrate everything to node B.”

The assumption was simple: “Migration is a compute move.” They’d used live migration in another environment with shared storage and forgot that in this cluster, disk lived locally. The first migration threw “migration aborted,” and the engineer interpreted it as “network blip” and tried again. And again.

Each attempt created partial disk artifacts on the target. LVM-thin is efficient until it isn’t: metadata grew, the thin pool started to fill, and unrelated VMs on node B began to see I/O pauses. The migration problem turned into a platform problem.

They eventually stopped and looked at the VM config. local-lvm wasn’t a shared storage; it was two separate islands with the same name in human conversation, not in Proxmox reality. The fix was boring: stop doing live migrations, plan an offline migration with storage move, or restore from backup onto the other node.

The real lesson was operational: migration is a storage topology decision first and a compute decision second. If you don’t know where the blocks are, you don’t know what you’re moving.

Mini-story 2: The optimization that backfired

A different company built a “fast lane” network for Proxmox migration and Ceph traffic. Jumbo frames enabled, bonding configured, separate VLAN, the works. The graphs looked great. The operators felt great. This is where the plot thickens.

A switch in the path had one port configured at MTU 1500 due to a template mismatch. Most traffic worked fine. SSH worked. The Ceph cluster mostly worked because it was resilient and because small packets were common. But live migrations intermittently aborted mid-stream with vague socket errors.

Engineers chased ghosts: Proxmox versions, kernel upgrades, QEMU options. A few tried “just disable firewall.” It didn’t help. Someone even blamed a specific VM workload because “it dirties memory too fast.” That was partially true, but not the root cause.

The breakthrough came from a simple MTU probe. Large ICMP with “do not fragment” failed across the migration network. Fixing MTU on that one switch port turned migrations from “coin flip” to boring.

The backfire wasn’t jumbo frames; it was inconsistent jumbo frames. Optimizations that require perfect consistency across infrastructure should be treated like production changes, not like “network seasoning.”

Mini-story 3: The boring but correct practice that saved the day

A team running Proxmox with mixed CPU generations standardized VM CPU types early. They didn’t use cpu: host for anything that might migrate. They picked a baseline CPU model and documented it. It cost them a bit of peak performance. It saved them repeatedly.

One afternoon a node began throwing ECC memory errors. Hardware was degraded but still limping. They needed to evacuate VMs quickly to avoid the kind of corruption you only notice during quarterly audits. Live migration was the plan, and it had to work.

Because CPU compatibility was already standardized, they didn’t get stuck on “destination cannot support requested features.” Because storage IDs were consistent across nodes, they didn’t spend time creating emergency storage mappings. They had a dedicated migration network that was tested quarterly with an MTU probe and an iperf run.

The evacuation succeeded. The postmortem had almost no drama. The “hero move” was a checklist they’d been following for months: consistent CPU models, consistent storage naming, periodic validation of the migration path.

Reliability is mostly repetition. The exciting part is optional.

Checklists / step-by-step plan (clean recovery and prevention)

Step-by-step: Recover cleanly after a migration abort

  1. Stop retrying. Pull the task log and identify the first actionable error line.
  2. Confirm VM reality. Where is it running? Which node owns it according to cluster resources?
  3. Stabilize the cluster. If quorum or corosync is unstable, fix that first.
  4. Check locks. If the VM is locked due to migration and you need to proceed, unlock only after you know which node is authoritative.
  5. Check storage topology. Are disks on shared storage? Do storage IDs exist on both nodes?
  6. Check disk artifacts on target. Identify orphaned volumes/datasets; don’t delete anything you can’t attribute.
  7. Choose the right migration method:
    • Shared storage healthy → live migration is reasonable.
    • Local storage involved → plan storage migration or offline migration.
    • Replication present → ensure replication is current and consistent.
  8. Retry once, with intent. Same error twice means you didn’t change the conditions. Don’t collect errors like trading cards.

Prevention checklist: make “migration aborted” rarer

  • Standardize storage IDs across nodes. Same names, same types, same expectations.
  • Baseline CPU model for migratable workloads. Avoid cpu: host unless the VM is pinned by design.
  • Separate networks where practical: management, corosync, storage, migration. At least isolate bandwidth contention.
  • Validate MTU end-to-end (quarterly, after switch changes, after firmware). Jumbo frames are a contract.
  • Monitor storage health (Ceph, ZFS, SMART). Migrations amplify weak signals.
  • Document non-migratable VMs (passthrough, licensing constraints, special devices) so on-call doesn’t discover it mid-incident.
  • Practice migration on a schedule. If you only migrate during emergencies, you’re doing chaos engineering without the benefits.

FAQ

1) Why does Proxmox show only “migration aborted” without a clear reason?

Because the top-level task is a wrapper around multiple subsystems. The actionable line is usually earlier in the task log or in journald for pvedaemon/qemu.

2) Can I live migrate a VM using local-lvm?

Not as a pure compute move. If disks are local to the source node, you need disk availability on the target (shared storage, replication, or a disk copy). Otherwise Proxmox will abort.

3) Is it safe to run qm unlock after a failed migration?

It’s safe after you confirm where the VM is actually running and what artifacts exist. Unlocking is not destructive, but it can allow you (or automation) to perform destructive actions next.

4) Why does migration fail at 90–99% and then abort?

That’s often the final switchover or convergence phase. Common causes: dirty-page rate too high, network instability, or destination failing to recreate a device/CPU feature set.

5) Does “SSH works” mean migration networking is fine?

No. SSH is the control plane. QEMU migration is a data plane stream that can be blocked by firewall rules, MTU issues, or routing differences even when SSH is perfect.

6) What’s the fastest way to tell if storage is the problem?

Check the VM config disk lines and compare them to pvesm status on the target. If a referenced storage ID doesn’t exist or isn’t shared, it’s a storage problem.

7) How do I handle leftover target disks after a failed storage migration?

Identify volumes by VMID, confirm they’re not referenced by any VM config on the target or source, then remove them. If there’s uncertainty, keep them and recover via backup/restore to avoid deleting the only good copy.

8) Why do CPU settings break migration?

Live migration requires the destination to offer a compatible virtual CPU. Using cpu: host exposes host-specific CPU features; if the target can’t match them, QEMU can refuse to migrate or the VM may not start safely.

9) Can Ceph issues show up as “migration aborted” even if VMs seem fine?

Yes. Migration stresses storage with metadata operations, reconnections, and sometimes concurrent IO bursts. A mildly degraded Ceph cluster can “work” until you ask it to do something ambitious at the same time.

10) When should I stop attempting live migration and just go offline?

When the workload has a high dirty-page rate, when the network is constrained, when storage is local and needs copying anyway, or when you’re in an incident and need determinism more than elegance.

Next steps (what to do after you fix it)

After the migration succeeds—or after you choose the correct offline path—do these practical follow-ups:

  1. Write down the root cause in your ops notes. The next abort will look “new” unless you label it.
  2. Normalize storage IDs and CPU models across the cluster. This is one of those boring chores that pays rent every month.
  3. Test the migration network with an MTU probe and a throughput test during a quiet window. Make it a recurring task, not a heroic discovery.
  4. Audit VM configs for local ISOs, passthrough devices, and “host CPU” use. Decide which VMs are meant to be migratable, and configure them accordingly.
  5. Clean up artifacts from aborted migrations: stale volumes, old snapshots, and leftover replication jobs. Do it with evidence, not vibes.

The goal isn’t to eliminate failures. The goal is to make failures legible, recoverable, and less frequent. “migration aborted” becomes boring once your cluster, network, and storage stop surprising each other.

← Previous
Heartbleed: the bug that showed the internet runs on duct tape
Next →
MariaDB vs PostgreSQL Defaults: One Forgives, One Punishes

Leave a comment