Proxmox “VM disk vanished”: storage vs config, and how to diagnose it

December 15, 2025 • February 3, 2026 • Read: 23 min • Views: 13

Was this helpful?

It’s 02:13, your pager is doing its little dance, and Proxmox is insisting a VM disk “doesn’t exist.” The guests might still be running. Or they might be dead. Either way, the business wants it back, and they want it back five minutes ago.

This failure looks like “storage ate my disk,” but a surprising amount of the time it’s “config can’t find the disk that’s still there.” The difference matters: one path is a careful recovery; the other is a fast fix with minimal risk. This guide teaches you to separate storage reality from Proxmox metadata, prove what exists, and choose the least-dangerous next step.

The only mental model you need: config pointers vs storage objects

When Proxmox says a VM disk is missing, you’re dealing with two layers that get out of sync:

Config layer: Proxmox VM config in /etc/pve/qemu-server/<vmid>.conf (and storage definitions in /etc/pve/storage.cfg). This is the “pointer system”: which disk name, which storage ID, which volume ID.
Storage layer: actual objects on the storage backend (ZFS zvol, LVM LV, QCOW2 file, Ceph RBD image, iSCSI LUN, NFS file). This is the “physics.”

Most “vanished disk” incidents are one of these:

The object exists, the pointer is wrong (config mismatch). This is the happy case: fix config, import volume, refresh storage, restart VM.
The pointer is right, the object is inaccessible (storage down, pool not imported, network path dead, permissions). Still usually recoverable without data loss, but you must stop guessing and start proving.
The object is gone (deleted volume, rolled-back snapshot, replaced storage, wrong pool). This becomes a restore-from-backup or deep storage recovery story.

So your job is not “make Proxmox stop complaining.” Your job is to answer, with evidence:

What volume ID does Proxmox think the VM uses?
Does that volume exist on the backend?
If not, does a volume that looks like it exist somewhere else?
Did storage.cfg change, did the storage ID change, or did the backend move?

One quote that has aged well in operations: “paraphrased idea” — Gene Kim: reliability comes from designing feedback loops that catch errors early, not from heroics after the fact. The point here: build a habit of verifying config and storage separately.

First short joke: Your VM disk didn’t “vanish”; it just went to live on a farm upstate with all the other missing LUNs.

Fast diagnosis playbook (first/second/third)

This is the playbook you use when people are waiting. It’s biased toward speed and evidence.

First: prove what Proxmox thinks should exist

Locate VM config, read the disk lines (virtio/scsi/sata/ide). Extract the volume IDs.
Check whether the storage ID referenced is defined and currently “active.”
Try a Proxmox-level volume lookup (pvesm list / pvesm status).

Second: prove whether the backend object exists

ZFS: list zvols/datasets, confirm pool is imported and healthy.
LVM-thin: list LVs, ensure VG is present and thinpool is active.
Directory storage: check file paths and permissions, confirm mount is real (not an empty mountpoint).
Ceph: list RBD images, check cluster health and keyring auth.

Third: decide the safest recovery path

If disk exists but pointer is wrong: modify VM config or re-import volume into storage with correct ID.
If storage is down: fix storage first; don’t edit VM config to “work around” missing storage, or you’ll cause split-brain configuration mistakes.
If disk is gone: stop poking; pivot to backups, replication targets, snapshots, or filesystem-level recovery. Every extra “try something” is an opportunity to overwrite evidence.

Speed tip: don’t restart the node or “just reboot the VM” until you know whether you’re dealing with a stale mount, a missing pool, or an actually-deleted volume. Reboots can hide transient evidence in logs and can also activate automatic repairs that change the situation.

Interesting facts and context (why this keeps happening)

/etc/pve is a cluster filesystem: Proxmox stores config in pmxcfs, a distributed DB-like filesystem. If cluster comms are unhealthy, config reads can be weirdly consistent but wrong across nodes.
Storage IDs are names, not magic: “local-lvm” is just a label in storage.cfg. Renaming or re-creating storage with a different backend can silently invalidate VM disk references.
ZFS zvols are block devices: in Proxmox, a zvol-backed disk is not a file. “Looking in the directory” won’t find it; you must query ZFS.
LVM-thin can lie to you under pressure: thin pools can go read-only or refuse activations when metadata is full. That can look like “missing LV,” but it’s really “LV won’t activate.”
Mountpoints can fail open: if an NFS mount drops and a service recreates the mount directory locally, you can end up writing VM images to the node’s root disk. Later, the “disk vanished” when NFS remounted and hid the local files.
Ceph “healthy” is scoped: a cluster can be globally healthy but your client may be missing a keyring or have caps that prevent listing/reading an RBD image.
Proxmox volume IDs are structured: they typically look like local-lvm:vm-101-disk-0 or tank:vm-101-disk-0. If you see a raw path in config, someone likely bypassed the storage abstraction.
VM config drift is common in migrations: especially when moving from file-based disks (qcow2) to block volumes (ZFS/LVM/Ceph), or when importing from other platforms.
Snapshots don’t protect you from pointer mistakes: snapshots protect data; they don’t protect a human from attaching the wrong disk to the wrong VMID.

What “disk vanished” actually means in Proxmox

There isn’t one canonical error string. You’ll see variations depending on the backend:

Proxmox GUI: “unable to find volume” or “volume does not exist.”
qm start: errors referencing a volume ID (e.g., storage 'local-lvm' does not exist or no such logical volume).
qemu-system: can’t open a block device, permission denied, missing file.

Translate the symptom into a question:

If the error mentions storage ID (e.g., local-lvm, tank, ceph-ssd): likely storage.cfg or storage activation problem.
If it mentions a volume name (vm-101-disk-0): likely missing backend object, wrong pool, wrong VG, or the disk was moved/renamed.
If it mentions a path (e.g., /mnt/pve/nfs/images/101/vm-101-disk-0.qcow2): likely mount/permissions/path mismatch.

The fastest way to stop flailing is to treat the VM config as a contract. It’s either valid and storage is broken, or storage is fine and the contract is wrong.

Practical tasks: commands, outputs, and decisions (12+)

Everything below is written for a Proxmox host shell. Run as root (or with appropriate privileges). Each task includes: the command, what output typically means, and the decision you make.

Task 1: Identify the exact disk references in the VM config

cr0x@server:~$ vmid=101; sed -n '1,200p' /etc/pve/qemu-server/${vmid}.conf
boot: order=scsi0;net0
cores: 4
memory: 8192
name: app-prod-01
scsi0: local-lvm:vm-101-disk-0,discard=on,iothread=1,size=80G
scsi1: tank:vm-101-disk-1,size=200G
net0: virtio=AA:BB:CC:DD:EE:FF,bridge=vmbr0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci

Meaning: you now have two distinct backends: local-lvm and tank. Don’t assume both are broken; diagnose per-disk.

Decision: copy the volume IDs somewhere. Your next steps are to validate each storage ID and each volume object.

Task 2: Check Proxmox storage status (active, enabled, content types)

cr0x@server:~$ pvesm status
Name             Type     Status           Total            Used       Available        %
local            dir      active      98420640        21053440        72278144   21.39%
local-lvm        lvmthin  active      402653184       187904000       214749184   46.67%
tank             zfspool  active      1099511627776   603979776000   495531851776   54.93%

Meaning: storages are “active.” If your VM still can’t start, we’re likely dealing with missing volumes, permissions, or activation edge cases, not a disabled storage definition.

Decision: if a storage shows inactive or missing, fix that before touching VM config.

Task 3: Validate storage configuration exists and hasn’t drifted

cr0x@server:~$ grep -nE '^(dir|lvmthin|zfspool|rbd|lvm|nfs|iscsi):|^content|^path|^pool|^vgname|^monhost|^nodes' /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content iso,vztmpl,backup
lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir
zfspool: tank
        pool tank
        content images,rootdir

Meaning: storage IDs match what the VM config references. If the VM references a storage ID not listed here, Proxmox will treat volumes as nonexistent.

Decision: if storage ID is wrong in VM config, fix VM config (carefully) rather than trying to “create a storage with that name” pointing elsewhere.

Task 4: Ask Proxmox to list the referenced volume (backend-agnostic)

cr0x@server:~$ pvesm list local-lvm --vmid 101
Volid                          Format  Type              Size VMID
local-lvm:vm-101-disk-0         raw     images     85899345920 101

Meaning: Proxmox can see the disk object via its storage plugin. That strongly suggests the disk exists and is discoverable.

Decision: if the disk lists here but VM won’t start, look for permission/locking, stale device mappings, or QEMU errors, not “missing disk.”

Task 5: Try listing a volume that Proxmox claims is missing

cr0x@server:~$ pvesm path tank:vm-101-disk-1
pvesm: volume 'tank:vm-101-disk-1' does not exist

Meaning: at Proxmox abstraction level, that volume ID is not present in the storage plugin’s view.

Decision: pivot to backend checks (ZFS list). Do not yet edit the VM config; the disk might exist under a slightly different name or pool.

Task 6: ZFS backend: confirm pool is imported and healthy

cr0x@server:~$ zpool status -x
all pools are healthy

Meaning: pool health isn’t your immediate blocker.

Decision: continue: list datasets/zvols; we’re looking for a naming mismatch or moved volume.

Task 7: ZFS backend: search for the missing zvol

cr0x@server:~$ zfs list -t volume -o name,volsize,used,refer,origin | grep -E 'vm-101|101-disk|disk-1' || true
tank/vm-101-disk-0   80G   12G  12G  -
tank/vm-102-disk-1  200G   68G  68G  -

Meaning: the expected tank/vm-101-disk-1 is not there. But you do see something suspicious: tank/vm-102-disk-1 exists and is exactly the size you expected for VM 101’s second disk.

Decision: investigate whether disks were attached to the wrong VMID or renamed. This is where “config vs storage” becomes a forensic exercise.

Task 8: Cross-check which VM config references the suspicious volume

cr0x@server:~$ grep -R --line-number "tank:vm-102-disk-1" /etc/pve/qemu-server/*.conf
/etc/pve/qemu-server/102.conf:6:scsi1: tank:vm-102-disk-1,size=200G

Meaning: VM 102 references it, so it’s probably legitimately attached there. But “probably” is not a plan.

Decision: if VM 101’s disk is missing and VM 102 has an extra disk that “looks like” it, verify via guest-level identifiers (filesystem UUIDs, application data) before reattaching anything.

Task 9: LVM-thin backend: confirm VG and thinpool exist

cr0x@server:~$ vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent
  VG  #PV #LV #SN Attr   VSize   VFree
  pve   1  12   0 wz--n- 372.51g  1.00g
  VG   LV                Attr       LSize   Pool      Data%  Meta%
  pve  data              twi-aotz-- 350.00g           62.15  11.02
  pve  vm-101-disk-0      Vwi-aotz--  80.00g data     15.63

Meaning: vm-101-disk-0 exists on LVM-thin and is active. Data/metadata utilization is not alarming here. If metadata is 99% and LV activation fails, you’ll often see missing devices as a symptom.

Decision: if data thinpool is not active or metadata is full, fix thinpool health before trying to “recreate” volumes.

Task 10: Directory storage: detect the “empty mountpoint” trap

cr0x@server:~$ mount | grep -E '/mnt/pve|nfs|cifs' || true
server-nfs:/export/pve-images on /mnt/pve/nfs-images type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)

cr0x@server:~$ ls -lah /mnt/pve/nfs-images/images/101/ | head
total 80G
drwxr-xr-x 2 root root 4.0K Dec 26 01:50 .
drwxr-xr-x 3 root root 4.0K Dec 26 01:50 ..
-rw-r----- 1 root root 80G Dec 26 01:49 vm-101-disk-0.qcow2

Meaning: the mount exists and the expected directory contains the disk. If mount output is empty but the directory still exists, you might be staring at local disk contents, not your NFS share.

Decision: if the mount is missing, stop the VM immediately if it’s writing to the wrong place. Remount, then reconcile which copy is authoritative.

Task 11: Check for locks that make a disk look “unusable”

cr0x@server:~$ qm status 101
status: stopped

cr0x@server:~$ qm config 101 | grep -E '^lock:|^scsi|^virtio|^sata|^ide'
scsi0: local-lvm:vm-101-disk-0,discard=on,iothread=1,size=80G
scsi1: tank:vm-101-disk-1,size=200G

Meaning: no lock line shown. If you see lock: backup or lock: migrate, Proxmox may block operations and users interpret that as “disk missing.” It’s not missing; it’s guarded.

Decision: clear locks only after confirming no backup/migration actually runs. Removing a real lock mid-operation is how you earn your next outage.

Task 12: Start the VM to capture the real error message (don’t rely on GUI summaries)

cr0x@server:~$ qm start 101
kvm: -drive file=/dev/zvol/tank/vm-101-disk-1,if=none,id=drive-scsi1,format=raw,cache=none,aio=io_uring: Could not open '/dev/zvol/tank/vm-101-disk-1': No such file or directory
start failed: QEMU exited with code 1

Meaning: QEMU is trying to open a zvol path and it doesn’t exist. This confirms it’s not a Proxmox “UI issue.” The backend object is missing or renamed.

Decision: find the actual zvol name (Task 7) or restore it; don’t invent a new blank disk with the same name unless you enjoy irreversible data loss.

Task 13: Search logs for deletion/rename events and storage errors

cr0x@server:~$ journalctl -u pvedaemon -u pvestatd -u pvesm -u pveproxy --since "24 hours ago" | tail -n 40
Dec 26 01:41:12 server pvedaemon[2210]: update VM 101: moved disk scsi1 to tank
Dec 26 01:41:15 server pvedaemon[2210]:  end task UPID:server:000008A2:0000F1A3:676C3B6B:qmmove:101:root@pam: OK
Dec 26 02:10:33 server pvedaemon[2210]:  start task UPID:server:00000B11:00012A3D:676C4BE9:qmstart:101:root@pam:
Dec 26 02:10:34 server pvedaemon[2210]: command 'qmstart 101' failed: exit code 1

Meaning: you have a clue: someone moved the disk recently. “Move disk to tank” succeeded, but the zvol is missing now. That suggests a later deletion, rollback, or pool change.

Decision: correlate with ZFS history, storage rollback events, or operator actions around 01:41–02:10.

Task 14: ZFS history (if enabled) to catch destroy/rename

cr0x@server:~$ zpool history tank | tail -n 20
2025-12-26.01:41:13 zfs create -V 214748364800 tank/vm-101-disk-1
2025-12-26.01:41:14 zfs set volblocksize=16K tank/vm-101-disk-1
2025-12-26.01:58:02 zfs destroy tank/vm-101-disk-1

Meaning: it was created, then destroyed. This is no longer “mysteriously vanished.” It’s “deleted,” which is very different.

Decision: stop trying to start the VM; proceed to snapshot/backup restore or ZFS recovery options (depending on your environment and policy).

Task 15: Ceph RBD backend: verify image presence and permissions

cr0x@server:~$ ceph -s
  cluster:
    id:     7d2a2f1c-1a11-4f2b-bc1a-0c3f0c0d0d0d
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3
    osd: 6 osds: 6 up, 6 in
  data:
    pools:   2 pools, 128 pgs
    objects: 1.2M objects, 4.6 TiB
    usage:   13 TiB used, 20 TiB / 33 TiB avail

cr0x@server:~$ rbd -p rbd ls | grep -E 'vm-101-disk-0|vm-101-disk-1' || true
vm-101-disk-0

Meaning: cluster is healthy, but only one image exists. If Proxmox expects a second one, it won’t start. This can also happen if you’re using the wrong pool name or missing RBD namespace awareness.

Decision: verify storage.cfg for the correct pool/namespace; verify which node has the right keyring and caps.

Task 16: Confirm Proxmox resolves the disk path (storage plugin sanity)

cr0x@server:~$ pvesm path local-lvm:vm-101-disk-0
/dev/pve/vm-101-disk-0

Meaning: Proxmox can resolve the volume ID to an actual block device path. If QEMU still fails, the error is likely permissions, stale device nodes, or kernel mapping.

Decision: check whether the device node exists and is accessible, and whether LVM activation is consistent.

Task 17: Verify the block device node exists and has sane permissions

cr0x@server:~$ ls -lah /dev/pve/vm-101-disk-0
lrwxrwxrwx 1 root root 7 Dec 26 00:02 /dev/pve/vm-101-disk-0 -> ../dm-12

cr0x@server:~$ ls -lah /dev/dm-12
brw-rw---- 1 root disk 253, 12 Dec 26 00:02 /dev/dm-12

Meaning: device exists. If it were missing, you’d see “No such file” and you’d focus on LVM activation and udev rules.

Decision: if device nodes are missing, run LVM activation steps and investigate why they didn’t appear (VG not found, multipath, iSCSI session down).

Task 18: Find stale mappings for ZFS zvol device paths

cr0x@server:~$ ls -lah /dev/zvol/tank/ | head
total 0
drwxr-xr-x 2 root root  80 Dec 26 01:41 .
drwxr-xr-x 4 root root  80 Dec 26 01:41 ..
lrwxrwxrwx 1 root root  13 Dec 26 01:41 vm-101-disk-0 -> ../../../zd0

Meaning: the missing zvol truly isn’t present; otherwise you’d see a symlink for it. This aligns with the QEMU error from Task 12.

Decision: don’t “touch” VM config. Your fix is: restore zvol (from snapshot/backup) or reattach correct existing zvol if it was renamed.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran a Proxmox cluster where “local-lvm” existed on every node, but not identically. On paper, it was “the same VG name everywhere,” which is the kind of statement that sounds true until you ask what it means.

A VM was migrated from Node A to Node B after a hardware alarm. Proxmox updated the VM config, and the VM booted. A week later, the disk “vanished” after a reboot. Panic followed. Storage team blamed Proxmox. Proxmox admin blamed storage. Classic duet.

The wrong assumption: “If the storage ID is the same, the storage content is the same.” On Node B, “local-lvm” pointed to a different VG than on Node A due to a previous rebuild. It worked briefly because the VM had been running and the block devices remained mapped until reboot. On reboot, the LV activation didn’t find the expected volume.

The fix wasn’t heroic: they standardized VG names and storage IDs, and they stopped using a shared label for non-shared storage. The root cause was a naming collision. The lesson: a Proxmox storage ID is not a promise of shared semantics. It’s a string in a config file.

Mini-story 2: The optimization that backfired

An enterprise internal platform team wanted faster clones and fewer snapshots stored on expensive flash. They introduced a scheduled job that aggressively pruned “old” ZFS volumes that matched a pattern used for temporary test VMs. It was neat. It was tidy. It was also wrong.

A production VM had recently been moved from LVM-thin to ZFS. During the move, a human renamed a disk to match internal naming conventions—unfortunately, those conventions overlapped with the “temporary” pattern. The VM ran fine for two days, until the prune job did what it was told and destroyed the zvol.

The incident presented as “VM disk vanished.” Proxmox wasn’t lying; the disk was gone. The team’s first reaction was to recreate a blank zvol with the same name so the VM would start. That would have been a fast way to ensure no forensics or recovery would ever work. Someone stopped them.

They restored from backups, then fixed the policy: deletion jobs now check for references in /etc/pve/qemu-server before destroying anything. Also, disk names stopped being free-form text. “Optimization” is just another word for “introducing a new failure mode,” unless you also add guardrails.

Mini-story 3: The boring but correct practice that saved the day

A financial org ran Proxmox with Ceph RBD for VM disks and had a strict rule: every storage change ticket included a “proof step” screenshot equivalent—output from pvesm status, qm config, and a backend listing (rbd ls or zfs list) before and after.

It was bureaucratic. People rolled their eyes. But it also meant there was always a small trail of objective truth. One day, a node rebooted after kernel updates and several VMs refused to start with “unable to find volume.” The on-call pulled the last change ticket and compared the before/after outputs. The VM configs referenced ceph-ssd, but storage.cfg now listed ceph_ssd due to a well-intended rename to match a naming standard.

The disks never moved. Ceph was fine. Only the storage ID label changed, breaking every reference. They reverted the storage ID name, reloaded, and everything came up.

Nothing fancy. No data recovery. No guessing. Just disciplined change evidence that made the failure obvious. Boring practices don’t win awards. They do win uptime.

Common mistakes: symptom → root cause → fix

1) “Volume does not exist” right after a storage rename

Symptom: VM configs reference old-storage:vm-XXX-disk-Y; pvesm status doesn’t show old-storage.

Root cause: storage ID changed in /etc/pve/storage.cfg but VM configs weren’t updated.

Fix: revert the storage ID name or update VM configs to the new storage ID (only if backend is truly the same). Then verify with pvesm list and qm start.

2) Missing QCOW2 files after an NFS hiccup

Symptom: files “vanished” from /mnt/pve/nfs-*; directory exists but is empty; later reappears.

Root cause: mount dropped; host wrote to local directory; remount hid local files.

Fix: remount properly, then reconcile duplicates. Confirm with mount and compare inode/device IDs. Prevent by using systemd mount units with hard dependencies and monitoring for stale file handle.

3) ZFS pool imported on one node but not another (cluster confusion)

Symptom: VM starts on Node A but not on Node B; Node B says zvol path missing.

Root cause: local ZFS pool not imported on Node B; Proxmox config assumes shared but it isn’t.

Fix: import pool on correct node or stop treating it as shared storage. Use correct storage type and node restriction in storage.cfg.

4) LVM-thin “missing LV” during thinpool metadata exhaustion

Symptom: LV appears missing, VM won’t start; thinpool metadata near 100%.

Root cause: thinpool cannot allocate metadata; LV activation fails; device nodes may not appear.

Fix: extend metadata LV, reduce snapshots, run thin pool repair as appropriate. Then re-activate VGs and retry.

5) “Permission denied” looks like missing disk

Symptom: QEMU can’t open file/block device; GUI reports disk inaccessible.

Root cause: wrong ownership/mode on directory storage, AppArmor/SELinux policy, or broken ACLs on NFS.

Fix: correct permissions, ensure Proxmox is configured for the storage type, and retest with pvesm path and direct ls -l.

6) Disk exists but under a different name (human rename)

Symptom: VM config references vm-101-disk-1 but backend has vm-101-data or a different VMID disk name.

Root cause: manual rename on backend or during migration/restore.

Fix: verify the correct volume via size, creation time, snapshot lineage, and guest filesystem identifiers; then update VM config to the real volume ID.

7) Backups restored to the wrong storage ID

Symptom: restore job “succeeds,” but VM won’t start because it references a disk on a storage that doesn’t contain it.

Root cause: restore target storage differs from original; config not rewritten as expected.

Fix: restore with explicit storage mapping, then confirm with qm config and pvesm list.

8) Ceph looks fine, but RBD image “missing” to Proxmox

Symptom: ceph -s is OK; Proxmox can’t list images or can’t open one.

Root cause: wrong pool/namespace in storage.cfg, missing keyring on that node, or insufficient caps.

Fix: validate storage.cfg, confirm keyring exists on the node, test rbd -p POOL ls as root and under the same environment Proxmox uses.

Second short joke: Storage is like gossip—once a bad name gets out (wrong storage ID), everyone repeats it until you correct the source.

Checklists / step-by-step plan

Step-by-step: when a VM won’t start due to a missing disk

Freeze the scene. If the VM is down, keep it down until you know whether you’re dealing with deletion vs mapping. If it’s running but reports missing disk on reboot, don’t reboot again.
Extract disk references. Read /etc/pve/qemu-server/<vmid>.conf. Record each disk line and volume ID.
Verify storage IDs exist and are active. pvesm status. If inactive, fix storage first (mount, pool import, network).
Proxmox-level list. pvesm list <storage> --vmid <vmid>. If it lists the volume, your “missing disk” is probably not missing.
Backend-level list. ZFS: zfs list -t volume. LVM: lvs. Directory: find the file. Ceph: rbd ls.
If mismatch: search for close matches. Grep VM configs for similar disk names; check for wrong VMID attachments.
Check logs. journalctl for move/delete tasks; ZFS history or Ceph logs where possible.
Choose recovery mode.
- Pointer wrong: update VM config or re-import volume.
- Storage inaccessible: restore storage access.
- Volume deleted: restore from backup/snapshot; consider ZFS recovery if policy allows.
Validate before boot. Ensure pvesm path <volid> works for every disk in config.
Boot once, watch output. qm start from CLI; capture full error if it fails.
Post-incident clean-up. Document which layer failed (config or storage), and add a monitor/guardrail.

Checklist: what you should not do under pressure

Don’t create a new empty disk with the same name “to make it boot.” That’s how you overwrite the evidence and guarantee data loss.
Don’t rename storage IDs casually. If you must, plan a bulk update of VM configs and validate with a dry run.
Don’t assume “active” storage means “correct” storage. An NFS mount can be active but pointed at the wrong export.
Don’t treat local storage as shared storage. If the pool/VG exists only on one node, restrict it to that node.
Don’t clear locks blindly. Locks are often the only thing preventing you from making a bad situation worse.

Checklist: prevention that actually works

Standardize storage IDs and enforce them via review. Names are a dependency.
Monitor mount presence and “mount correctness” (device ID, not just directory existence).
Alert on ZFS pool import failures and LVM thinpool metadata usage.
Run a nightly audit: for each VM disk volid, verify it exists on the backend and is resolvable via pvesm path.
Require change evidence (before/after outputs) for storage operations: move, rename, delete, pool changes.

FAQ

1) How do I tell if it’s a config problem or a storage problem?

If pvesm list can’t see the volume and backend listing can’t see it either, it’s storage/object reality. If backend shows it but Proxmox can’t resolve it, it’s config/storage definition mismatch.

2) The VM config references `local-lvm`, but `pvesm status` shows it inactive. What’s fastest?

Fix storage activation first: confirm the VG exists and is visible. Editing VM config to point elsewhere is usually a short-term hack that becomes long-term corruption.

3) Can I just edit `/etc/pve/qemu-server/<vmid>.conf` directly?

Yes, but do it like a grown-up: make a copy, change one thing, and validate with qm config and pvesm path. Also remember /etc/pve is cluster-managed.

4) I see a zvol like `tank/vm-101-disk-1`, but Proxmox says it doesn’t exist.

Usually a storage.cfg mismatch (wrong pool name), node restriction, or the volume is in a different pool/dataset than Proxmox expects. Confirm the storage definition points to the same pool and that the node has access.

5) The disk file exists on NFS, but Proxmox can’t open it.

Check mount options and permissions/ownership. Also verify you’re not looking at a local directory because the mount failed. mount output is the referee.

6) What does it mean if a VM starts on one node but not after migration?

It usually means the storage isn’t truly shared or isn’t identically defined on every node. Proxmox migration can move config, but it can’t conjure a local VG into existence elsewhere.

7) What’s the safest way to recover if the volume was deleted?

Stop, confirm deletion (backend history/logs), then restore from backup or snapshot. If you have ZFS snapshots/replication, restore the dataset/zvol and then reattach by volume ID.

8) Why does Proxmox sometimes show the disk in the GUI but start still fails?

Because listing metadata is easier than opening the device. You can “see” a volume ID, but QEMU can still fail due to permissions, stale device nodes, activation issues, or backend read-only states.

9) How do I prevent a storage rename from breaking VMs?

Don’t rename storage IDs in place unless you plan a coordinated update. If you must, update VM configs in a controlled change window and verify with pvesm path for every disk.

10) Is there a single command that finds broken disk references?

Not a single perfect one, but you can script it: iterate VMIDs, parse disk lines, run pvesm path and flag failures. The absence of that audit is why this incident keeps happening.

Conclusion: next steps that prevent a repeat

Diagnosing “VM disk vanished” is mostly the discipline of not mixing layers. Proxmox config is a set of pointers. Storage is the set of objects. Your job is to find the mismatch with proof, not vibes.

Next steps you should actually take:

Implement a daily audit: for each VM disk volid, verify pvesm path resolves and the backend object exists.
Harden mounts and pool imports: treat “mount correctness” and “pool imported” as monitored SLOs, not as assumptions.
Change control for storage IDs: storage.cfg edits should be reviewed like code. Names are dependencies.
Write down your recovery rules: “never recreate a missing disk name,” “verify backend existence before config edits,” and “logs first, reboot last.”

The best outcome is not a clever recovery. It’s a boring incident that never happens again because you made it mechanically hard to lose track of where the disks are.