“QEMU exited with code 1” is Proxmox’s way of shrugging while your pager screams. It’s not a root cause. It’s a status light: something failed while starting the VM, and Proxmox is reporting the least helpful part of the story.
The fix is almost never “restart the node” (although yes, that sometimes works, like turning your laptop upside down to improve Wi‑Fi). The fix is to read the right log line, in the right place, and understand which subsystem actually blocked QEMU: storage, networking, permissions, config, kernel features, or a stale lock from last Tuesday.
What “exit code 1” really means in Proxmox
Proxmox starts a VM by constructing a QEMU command line, then handing it to qemu-system-* (usually via /usr/bin/kvm wrapper or directly) under a service context. If QEMU immediately returns 1, Proxmox prints the generic “QEMU exited with code 1” banner. That line is not the failure; it’s the epilogue.
Think in layers: which component rejected the launch?
- Config layer: bad VM config syntax, deprecated keys, impossible device combination, missing disk file, wrong controller type, duplicate PCI addresses.
- Permission / security layer: AppArmor/SELinux denial, wrong ownership/permissions on disk images, failed tap device creation due to privileges.
- Storage layer: ZFS zvol busy, LVM-thin activation failure, Ceph/RBD map errors, NFS stale handles, iSCSI session dead, lock timeouts.
- Kernel / virtualization layer: KVM not available, nested virt missing, CPU flags mismatch, hugepages/NUMA pinned but unavailable.
- Networking layer: bridge missing, firewall rules messing with tap, MTU mismatch isn’t a start failure (usually) but tap creation failures are.
- Resource layer: no RAM, no file descriptors, exhausted inotify, process limits, out-of-space on storage used for logs or state.
The professional move is to find the first concrete error message QEMU printed. Everything else is theater.
One quote worth keeping on your wall:
“Hope is not a strategy.” — Gen. H. Norman Schwarzkopf
You can restart the node and hope. Or you can trace the failure, fix it once, and sleep.
Fast diagnosis playbook (check 1/2/3)
1) Start with the journal for the exact VMID and timestamp
If you do only one thing: capture the log line right before QEMU dies.
- Look for: “
cannot open”, “Permission denied”, “Device or resource busy”, “failed to get”, “could not set up”, “lock timeout”, “invalid argument”. - Decision: if the error names a file/path/device, you’re in storage/permissions land. If it names a bridge/tap, you’re in networking land. If it mentions KVM, CPU, or accel, you’re in kernel/virt land.
2) Confirm whether a stale lock or zombie QEMU process exists
Proxmox has a lock mechanism; QEMU has its own PID lifecycle; storage backends sometimes keep devices “busy” after a crash.
- Look for:
lock filepresent,qm listshows running while it isn’t, oldqemu-system-x86_64still alive, ZFS zvol held open. - Decision: if there’s a stale lock, clear it safely (after confirming the VM is not actually running). If a QEMU process exists, don’t start a second one; kill the right PID and clean up.
3) Validate the backend the VM depends on (storage + network) is healthy
“Code 1” often means the VM config is fine and the world underneath it is not.
- Storage: check pool health, thinpool, Ceph status, NFS mount state, free space/inodes.
- Network: check bridge exists, tap creation works, firewall isn’t failing to load rules.
- Decision: if the backend is degraded, fix the backend first. Starting VMs on a broken substrate is how you grow an incident into a career change.
Joke #1: Exit code 1 is QEMU’s way of saying “I’m not mad, I’m just disappointed.”
Where the real error lives: logs and processes
Proxmox layers that emit clues
- pvedaemon / pveproxy: UI task logs, API failures, authentication context.
- qemu-server: builds the QEMU command line; logs why it refused to proceed.
- systemd + journald: canonical timeline; shows QEMU’s stderr, tap errors, permission denials.
- storage stack: ZFS, LVM, Ceph, NFS, iSCSI, multipath; each has its own logs and commands.
- kernel ring buffer: KVM failures, I/O errors, VFIO, AppArmor denials, bridge issues.
How to read a Proxmox task log like an adult
The UI’s “Task viewer” is decent for timestamps and high-level messages. The problem is it often truncates the exact QEMU stderr line you needed. Treat it as an index, not the source of truth. When you see “QEMU exited with code 1,” immediately pivot to journalctl and the VM’s config and storage objects.
What you’re hunting for
You want the first failure line that names a resource. Examples of “good” errors:
could not open disk image /dev/zvol/rpool/data/vm-101-disk-0: Device or resource busyfailed to create tun device: Operation not permittedkvm: failed to initialize: No such file or directoryCannot access storage 'ceph-rbd' (500)unable to parse value of 'net0'
“Bad” errors are generic wrappers. Ignore them. Keep scrolling up.
Practical tasks: commands, outputs, decisions (12+)
Everything below is designed to be runnable on a Proxmox node with typical packages. Replace 101 with your VMID and adjust storage names.
Task 1: Pull the last start attempt for a VM from journald
cr0x@server:~$ journalctl -u pvedaemon -u pveproxy -u pvescheduler --since "15 min ago" | tail -n 120
Dec 26 10:21:33 pve01 pvedaemon[2214]: starting task UPID:pve01:00005A2B:0001B8F2:676D1E5D:qmstart:101:root@pam:
Dec 26 10:21:34 pve01 pvedaemon[2214]: VM 101 qmp command failed - got timeout
Dec 26 10:21:34 pve01 pvedaemon[2214]: VM 101 - unable to open disk image '/dev/zvol/rpool/data/vm-101-disk-0': Device or resource busy
Dec 26 10:21:34 pve01 pvedaemon[2214]: VM 101 - QEMU exited with code 1
What it means: you have a specific resource: a zvol path is busy. That’s storage, not “Proxmox being weird.”
Decision: stop looking at VM config; investigate who holds the zvol open (Task 7) and whether the VM is half-running (Task 2).
Task 2: Check whether Proxmox thinks the VM is running, and whether a QEMU PID exists
cr0x@server:~$ qm status 101
status: stopped
cr0x@server:~$ pgrep -a -f "qemu-system.*-id 101"
What it means: if qm status says stopped but you still see a QEMU process, you’ve got a stale state or a previous start attempt that didn’t die cleanly.
Decision: if QEMU exists, inspect it (Task 3) and shut it down cleanly or kill it after confirming it’s the right one.
Task 3: Inspect the exact QEMU command line Proxmox tried to run
cr0x@server:~$ qm showcmd 101 --pretty
/usr/bin/kvm \
-id 101 \
-name 'app-prod-01,debug-threads=on' \
-machine type=pc-q35-8.1+pve0 \
-m 8192 \
-smp 4,sockets=1,cores=4,maxcpus=4 \
-drive if=none,id=drive-scsi0,format=raw,file=/dev/zvol/rpool/data/vm-101-disk-0,cache=none,aio=io_uring,discard=on \
-device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 \
-netdev tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on \
-device virtio-net-pci,mac=BC:24:11:AA:BB:CC,netdev=net0,bus=pci.0,addr=0x12 \
...
What it means: you can now reproduce or reason about failures: disk path, cache mode, aio backend, machine type, netdev setup.
Decision: if a particular option looks suspicious (e.g., aio=io_uring on an old kernel), you have a concrete experiment: change one thing, retry.
Task 4: Read the VM configuration for parse errors and drift
cr0x@server:~$ cat /etc/pve/qemu-server/101.conf
boot: order=scsi0;net0
cores: 4
memory: 8192
name: app-prod-01
net0: virtio=BC:24:11:AA:BB:CC,bridge=vmbr0,firewall=1
scsi0: rpool:vm-101-disk-0,discard=on,cache=none,aio=io_uring
scsihw: virtio-scsi-pci
What it means: config is stored in pmxcfs and should be consistent cluster-wide. Small typos (especially in args: lines) can hard-fail QEMU.
Decision: if you see hand-edited exotic settings, plan to revert to defaults temporarily to isolate the issue.
Task 5: Tail the qemu-server logs for the VMID
cr0x@server:~$ journalctl -t pve-qemu-server -S "30 min ago" | grep -E "VM 101|101" | tail -n 80
Dec 26 10:21:34 pve01 pve-qemu-server[58312]: VM 101 - start failed: unable to open disk image '/dev/zvol/rpool/data/vm-101-disk-0': Device or resource busy
Dec 26 10:21:34 pve01 pve-qemu-server[58312]: VM 101 - failed to run /usr/bin/kvm: exit code 1
What it means: confirms it’s QEMU failing, not a scheduler bug or API auth issue.
Decision: pivot to storage-level investigation (Tasks 6–9).
Task 6: Verify ZFS pool health and recent errors
cr0x@server:~$ zpool status -x
all pools are healthy
cr0x@server:~$ zpool status rpool
pool: rpool
state: ONLINE
status: Some supported features are not enabled on the pool.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support the features.
scan: scrub repaired 0B in 00:12:41 with 0 errors on Thu Dec 26 09:10:05 2025
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
errors: No known data errors
What it means: if you see read/write/checksum errors, your “code 1” might be the first symptom of storage flakiness. Healthy pool means look for zvol busy/holds/locks.
Decision: if pool is degraded, stop VM churn and fix the pool first (replace disk, resilver, scrub).
Task 7: Find which process holds a zvol or disk device open
cr0x@server:~$ ls -l /dev/zvol/rpool/data/vm-101-disk-0
lrwxrwxrwx 1 root root 10 Dec 26 10:20 /dev/zvol/rpool/data/vm-101-disk-0 -> ../../zd16
cr0x@server:~$ lsof /dev/zd16 | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
qemu-syst 4121 root 15u BLK 230,16 0t0 123 /dev/zd16
What it means: the disk is busy because a QEMU process already has it open. This could be a previous VM instance, a backup job, or a stuck migration.
Decision: inspect PID 4121, confirm it’s safe to terminate, then cleanly stop it (Task 8). Don’t delete locks blindly until you understand who is holding the device.
Task 8: Stop a stuck VM process (prefer graceful, then force)
cr0x@server:~$ qm stop 101
trying to stop VM 101...
VM 101 stopped
cr0x@server:~$ kill -TERM 4121
cr0x@server:~$ sleep 2; ps -p 4121 -o pid,cmd
PID CMD
What it means: if the process disappears after TERM, great. If it stays, it may be stuck in uninterruptible I/O (D state), which points back to storage.
Decision: if it’s in D state, don’t spam kill -9 and call it “fixed.” Investigate I/O path (NFS/iSCSI/Ceph). If it’s a normal stuck process, a controlled kill can be acceptable after validation.
Task 9: Check Proxmox VM locks and remove them safely
cr0x@server:~$ qm config 101 | grep -i lock
lock: backup
cr0x@server:~$ qm unlock 101
unlocking VM 101
What it means: Proxmox can lock a VM during backup, migration, snapshot, etc. If the lock remains after a failed job, starts can fail or behave unpredictably.
Decision: only unlock after confirming the related operation is not actually running (check backup logs, running tasks, and processes). Unlocking mid-backup is how you get partial snapshots and awkward meetings.
Task 10: Confirm storage free space and inode availability
cr0x@server:~$ df -hT /var/lib/vz /var/log
Filesystem Type Size Used Avail Use% Mounted on
rpool/ROOT/pve-1 zfs 96G 74G 22G 78% /
rpool/var-log zfs 10G 9.8G 256M 98% /var/log
cr0x@server:~$ df -i /var/log
Filesystem Inodes IUsed IFree IUse% Mounted on
rpool/var-log 524288 11234 513054 3% /var/log
What it means: a nearly full /var/log (or root fs) can cause QEMU to fail when it tries to write state, logs, or sockets. Inodes matter too, especially on ext4-backed mounts.
Decision: if usage is high, rotate logs, expand datasets, or move chatty services. Don’t “just delete random logs” unless you want to delete the evidence you need.
Task 11: Check KVM availability and whether virtualization is actually enabled
cr0x@server:~$ ls -l /dev/kvm
crw-rw---- 1 root kvm 10, 232 Dec 26 08:02 /dev/kvm
cr0x@server:~$ kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used
cr0x@server:~$ dmesg | tail -n 20
[ 112.345678] kvm: VMX supported
[ 112.345689] kvm: Nested virtualization enabled
What it means: no /dev/kvm or failures here can lead to QEMU exiting early, sometimes with misleading messages. On some systems QEMU will still run without KVM but Proxmox configs may assume KVM features.
Decision: if KVM is unavailable, check BIOS/UEFI VT-x/AMD-V, kernel modules (kvm_intel/kvm_amd), and whether you’re inside a hypervisor without nested virt.
Task 12: Look for AppArmor denials that block QEMU
cr0x@server:~$ journalctl -k --since "30 min ago" | grep -i apparmor | tail -n 20
Dec 26 10:21:34 pve01 kernel: audit: type=1400 apparmor="DENIED" operation="open" profile="pve-qemu-kvm" name="/mnt/pve/nfs-share/images/101/vm-101-disk-0.qcow2" pid=58345 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
What it means: QEMU is being blocked by a security profile when trying to open a disk image. That’s a clean root cause.
Decision: fix the storage path and AppArmor profile expectations (often by ensuring the storage is configured through Proxmox, mounted under the right path, and not an ad-hoc mount), or adjust policy if you know exactly what you’re doing.
Task 13: Validate bridges and tap creation prerequisites
cr0x@server:~$ ip link show vmbr0
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 5a:1e:4c:00:11:22 brd ff:ff:ff:ff:ff:ff
cr0x@server:~$ ip tuntap show | head
tap101i0: tap persist vnet_hdr
What it means: missing vmbr0 or inability to create tap devices can cause QEMU to exit instantly. If tap101i0 already exists unexpectedly, it may be a leftover interface from a failed start.
Decision: if bridge is missing, fix host networking. If tap exists but VM is stopped, remove the stale tap (carefully) after confirming no process uses it.
Task 14: Check for failed storage activation on LVM-thin
cr0x@server:~$ lvs -a -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Devices
pve/data pve twi-aotz-- <1.82t 78.12 2.34 /dev/sda3(0)
vm-101-disk-0 pve Vwi-a-tz-- 64.00g data 91.02 /dev/sda3(12345)
cr0x@server:~$ lvchange -ay pve/vm-101-disk-0
1 logical volume(s) in volume group "pve" now active
What it means: if activation fails or the thinpool is full, QEMU may fail opening the block device. Thin provisioning can lie until it can’t.
Decision: if thinpool is near 100% or metadata is exhausted, you need to free space or extend the pool before you start anything else.
The usual suspects: failure patterns and how to prove them
1) Storage path problems: “cannot open …” is almost always literal
When QEMU can’t open the disk, it exits. No disk, no VM. Causes include:
- ZFS zvol held open by a lingering process
- NFS stale file handle after server failover
- Ceph/RBD mapping failure or auth mismatch
- LVM-thin out of space or metadata full
- Permissions wrong on a directory-mounted storage
Prove it by checking: the exact path in qm showcmd, existence of the device/file, and the backend’s health (ZFS/LVM/Ceph/NFS commands).
2) Locking and concurrency: Proxmox protects you… until it doesn’t
Proxmox uses locks to avoid two operations modifying a VM simultaneously (backup + start, snapshot + migration, etc.). After crashes or failed jobs, locks can remain.
Prove it by reading the VM config for a lock: entry and checking running tasks in the journal. Remove locks only when you’re sure the related operation is dead.
3) Networking setup: tap/bridge failures kill the VM early
If QEMU can’t create or attach the tap interface, it exits with code 1. Common causes: missing bridge, broken /etc/network/interfaces syntax, firewall scripts failing, or privileges/capabilities messed up.
Prove it by checking ip link show, presence of vmbrX, and journald lines mentioning tap, tun, bridge, or vhost.
4) Kernel features and accelerators: KVM, VFIO, hugepages
QEMU is flexible, but Proxmox’s typical VM configs assume KVM and certain CPU features. If you’re trying GPU passthrough (VFIO), the failure modes multiply: IOMMU groups, device reset quirks, permissions on /dev/vfio/*, ROM issues.
Prove it by correlating QEMU stderr lines with kernel messages (journalctl -k, dmesg), and by validating devices exist and are bound to the correct driver.
5) “Optimization” flags that bite: AIO backends, cache modes, discard
Options like aio=io_uring and aggressive cache modes can be fantastic—until they meet an older kernel, a weird filesystem, or a storage backend that doesn’t support what you asked for. Then QEMU exits immediately and you’re left arguing with a line of config someone copy-pasted from a benchmark blog.
Joke #2: Nothing is more permanent than a “temporary” performance tweak added on a Friday.
Three corporate mini-stories (mistakes, backfires, boring wins)
Incident #1: The wrong assumption (and a lock that wasn’t the problem)
A mid-sized company ran a Proxmox cluster for internal services. One morning, a critical VM wouldn’t start: “QEMU exited with code 1.” The on-call engineer saw lock: backup in the VM config and assumed the obvious: stale lock. They ran qm unlock, then hammered “Start” in the UI a few times because the UI is right there and impatience is a renewable resource.
It still failed. Now the situation was worse: they had removed a legitimate lock while a storage-side snapshot job was still active on the NFS server. That job wasn’t visible as a Proxmox task because it was triggered by the storage team’s scripts. QEMU was failing because the qcow2 file was temporarily locked on the NFS side and returning inconsistent metadata during the snapshot window.
The “code 1” was just the last domino. The real clue was in the kernel log: intermittent “stale file handle” messages and NFS server timeouts around the start attempts. But those lines were never checked because the lock theory felt neat.
The fix was boring: stop trying to start the VM during the storage snapshot window, mount NFS with sane options for their environment, and add a pre-flight check in operations runbooks: validate the storage backend isn’t in a maintenance state before clearing locks. They also changed policy: Proxmox locks are cleared only after confirming no related operation is running anywhere in the stack.
Afterward, the team added a small script that prints “storage readiness” signals (NFS responsiveness, Ceph health, ZFS status) alongside VM start actions. It didn’t prevent every incident. It did prevent the same incident twice, which is the real KPI.
Incident #2: The optimization that backfired (io_uring meets reality)
Another organization wanted better VM disk performance for a database workload. Someone read that io_uring was the future and rolled out aio=io_uring and some cache tuning across a fleet of VMs via automation. The change looked harmless: a single parameter in each VM config.
A week later, after a kernel update was delayed on a subset of nodes, a migration landed a VM on an older node. The VM started failing with “QEMU exited with code 1.” The task log showed “invalid argument” around the drive definition. It wasn’t obvious because the VM started fine on other nodes.
The root cause: the older kernel/QEMU combo didn’t support the chosen AIO backend for that storage path. The parameter was valid in newer builds but effectively nonsensical in the older environment. The cluster became a compatibility minefield: VM configs assumed features that weren’t uniform across nodes.
The fix wasn’t to abandon performance tuning. It was to stop pretending the cluster was homogeneous when it wasn’t. They enforced node version baselines, added a CI-style check that rejects VM config changes requiring unsupported features on any node in the cluster, and limited the optimization to a label-controlled pool of nodes that were known-good.
Performance improvements stayed. Surprise outages didn’t. The lesson wasn’t “don’t optimize.” It was “optimize like you’ll be on-call for the edge cases.”
Incident #3: The boring practice that saved the day (journal discipline)
A third team had a policy that every failed VM start must be diagnosed from logs before anyone restarts anything. It sounded pedantic. It was, in the best way.
During a power event, one node came back with a slightly different network state. A handful of VMs failed to start with “QEMU exited with code 1.” People were ready to reboot the node again, assuming “the bridges didn’t come up correctly.”
The on-call followed the policy: journalctl -t pve-qemu-server plus journalctl -k. The error was precise: tap device creation failed because the system hit a per-user process limit in a weird corner case triggered by a monitoring agent forking aggressively during startup. Restarting the node would have masked it temporarily and guaranteed recurrence.
They raised the appropriate systemd limits, fixed the monitoring agent configuration, and restarted only the affected services. The VMs started, the incident ended, and the team didn’t add “reboot it again” to the runbook.
The boring practice—treating logs as primary evidence—prevented a loop of guesswork. It also produced a clean incident report with an actual root cause. Auditors love that. So do engineers who enjoy sleeping.
Common mistakes: symptom → root cause → fix
1) Symptom: “Device or resource busy” on a zvol or disk
Root cause: old QEMU process still has the device open; backup/migration process holding it; kernel stuck I/O path.
Fix: find the holder with lsof / fuser, stop the VM or offending process, then retry. If the holder is in D state, investigate storage backend health (NFS/Ceph/iSCSI) rather than playing whack-a-mole with PIDs.
2) Symptom: “Permission denied” opening a disk image on directory storage
Root cause: wrong permissions/ownership on the image path; storage mounted outside Proxmox’s expected mountpoints; AppArmor denial.
Fix: ensure storage is configured via Proxmox, confirm mountpoint is correct (/mnt/pve/<storage>), fix file permissions, check AppArmor denials and align paths.
3) Symptom: “could not create tap device” / “Operation not permitted”
Root cause: bridge missing, firewall scripts failing, capabilities changed, or leftover tap interface causing collisions.
Fix: verify vmbr0 exists and is up; check firewall logs; remove stale tap only after confirming no process uses it; validate /etc/network/interfaces and reload networking cautiously.
4) Symptom: Start fails only on one node
Root cause: node drift: different kernel/QEMU versions, missing CPU flags, different storage connectivity, stale mounts, broken multipath.
Fix: compare node versions and key configs, standardize packages, validate storage access from that node, and avoid config flags requiring newer features unless the cluster is uniform.
5) Symptom: “kvm: failed to initialize” or “failed to get KVM”
Root cause: virtualization disabled in BIOS/UEFI, missing kernel modules, running inside a VM without nested virt, or permissions on /dev/kvm.
Fix: enable VT-x/AMD-V, load modules, ensure nested virt is enabled upstream, validate /dev/kvm ownership/group membership.
6) Symptom: “Cannot access storage … (500)”
Root cause: Proxmox storage plugin error: unreachable NFS/Ceph target, auth failures, stale mount, missing keyrings.
Fix: confirm storage is online from the node (mount status, ceph health), fix auth, remount, and only then retry VM start.
7) Symptom: “invalid argument” in drive/net options
Root cause: incompatible QEMU option for the node’s QEMU version; bad args: line; unsupported AIO/cache setting for backend.
Fix: simplify: remove custom args, revert to defaults, then re-add changes one at a time. Standardize QEMU versions across cluster nodes.
8) Symptom: VM starts from CLI but not from UI/API tasks
Root cause: permission context differences, environment variables, or task-time hooks (firewall, pre-start scripts) failing.
Fix: reproduce via qm start and inspect task logs; check hook scripts and firewall; confirm the same user/context is used.
Checklists / step-by-step plan
Step-by-step: from “code 1” to root cause in under 10 minutes
- Capture the timestamp of the failed start (UI task log is fine for this).
- Pull journald lines around that time for
pve-qemu-serverand kernel messages. Identify the first concrete error string. - Run
qm showcmdfor the VM and locate the disk paths, netdev config, and any unusual options. - Check VM lock status in config; unlock only after confirming the related operation is not running.
- Verify backend health:
- ZFS:
zpool status - LVM:
lvsand pool usage - Ceph: cluster health and RBD mapping basics
- NFS: mount responsiveness and kernel errors
- ZFS:
- Check for leftovers: existing QEMU PID, stale tap device, lingering zvol holder.
- Make one change at a time, retry, and re-check logs immediately.
Checklist: storage first, because it fails the most expensively
- Disk path exists and is accessible from the node that’s starting the VM.
- No process holds the block device/image file open unexpectedly.
- Pool health is OK (no degraded vdevs, no read/write errors).
- Free space is sane, including thinpool metadata and root/log partitions.
- No stale NFS handles or iSCSI session issues in kernel logs.
Checklist: networking sanity
- Bridge exists and is up (
ip link show vmbr0). - Tap devices are created/removed cleanly; no stale tap name collisions.
- Firewall scripts aren’t failing (journald errors).
Checklist: version and feature compatibility
- QEMU versions are consistent across cluster nodes, or VM config avoids node-specific features.
- KVM is available and enabled (
/dev/kvmexists, modules loaded). - Passthrough devices are bound correctly (VFIO/IOMMU) if used.
Interesting facts and historical context (things that help you debug)
- QEMU started in 2003 as a fast CPU emulator and grew into the de facto virtualization workhorse in Linux ecosystems.
- KVM arrived in the Linux kernel in 2007, turning QEMU from “pure emulation” into hardware-assisted virtualization on x86.
- Proxmox’s VM configs live in a cluster filesystem (pmxcfs) stored under
/etc/pve, which is why edits replicate and why split-brain hurts. - Exit code 1 is intentionally generic: QEMU often uses it for “failed to initialize or parse arguments,” which can mean anything from missing disk to unsupported option.
- Virtio devices exist because emulating real hardware is expensive: virtio is a paravirtual interface designed to reduce overhead and improve performance.
- io_uring became mainstream in Linux around 2019+ and keeps evolving; mixing kernels and QEMU versions can make “valid” options invalid on some nodes.
- ZFS zvols are block devices backed by datasets, and they can be “busy” due to holders you won’t see in Proxmox UI—process-level tools matter.
- LVM-thin can overcommit, which is great until it isn’t; out-of-space conditions show up as weird VM failures rather than friendly warnings.
- Tap devices are a Linux kernel feature, and failures creating them are often privilege/capability problems, not “a Proxmox bug.”
FAQ
Why does Proxmox only show “QEMU exited with code 1”?
Because Proxmox is reporting QEMU’s exit status, not the underlying stderr detail. The actionable error is typically in journald under pve-qemu-server or kernel logs.
Where is the single best place to find the real error line?
journalctl -t pve-qemu-server around the start timestamp, plus journalctl -k for kernel-level denials and device issues.
Is it safe to run qm unlock whenever a VM won’t start?
No. Unlocking is safe only after you prove the related operation (backup/migration/snapshot) is not running anywhere. Otherwise you risk corruption or inconsistent snapshots.
My VM starts on node A but fails on node B. What should I suspect first?
Node drift: different QEMU/kernel versions, storage access differences (mount missing on node B), or different security policies. Compare qm showcmd output and check storage reachability from node B.
Can a nearly full /var/log or root filesystem cause a QEMU start failure?
Yes. QEMU and Proxmox need to write logs, sockets, and runtime state. A full filesystem can surface as weird start failures, sometimes still reported as “code 1.”
Does “Device or resource busy” always mean a running VM?
Often, but not always. It can also mean a backup process, a stuck QEMU instance, or storage-side locking (especially on NFS/Ceph layers). Use lsof/fuser to prove it.
How do I safely reproduce the failure without guessing?
Use qm showcmd <vmid> to see the exact QEMU command line, then focus on the failing resource mentioned in logs (disk path, tap interface, KVM). Don’t random-walk through config options.
Is it a good idea to add custom args: to Proxmox VM configs?
Only if you also accept the operational burden: node compatibility, upgrade testing, and future you debugging it at 3 a.m. Defaults exist for a reason.
What’s the fastest way to tell if this is storage vs networking vs KVM?
Read the first concrete error string. Disk path errors point to storage. Tap/bridge errors point to networking. KVM/VFIO errors point to kernel/virtualization features.
Conclusion: practical next steps
When Proxmox says “QEMU exited with code 1,” don’t treat it like a mystery. Treat it like a pointer to evidence you haven’t collected yet. Your job is to extract the first specific error message, then validate the subsystem it names.
Next steps you can do today:
- Standardize a runbook: journal first, then locks/processes, then backend health.
- Reduce cluster drift: align kernel/QEMU versions across nodes or constrain VM features.
- Audit “performance” tweaks in VM configs and remove the ones you can’t justify under incident conditions.
- Add a habit: every “code 1” gets a root-cause note. Not a workaround. A note.
If you do that consistently, “QEMU exited with code 1” stops being an incident and becomes a routine diagnostic step. Which is exactly where it belongs.