Proxmox snapshot stuck: safely cleaning LVM-thin leftovers

Was this helpful?

You click “Delete snapshot” in Proxmox. The task spins. Then it fails. Or worse: it says it succeeded, but the storage still shows ghost volumes, your thin pool is still bloated, and backups keep tripping over the same landmine.

This is the gritty reality of LVM-thin under a busy virtualization cluster: snapshots are cheap until they’re not, and cleanup is safe only if you understand what’s actually allocated, what’s merely referenced, and what’s outright orphaned.

The mental model: how Proxmox + LVM-thin snapshots really work

If you treat “snapshot” as a magical time machine, you will eventually delete the wrong thing. Treat it as storage plumbing with reference counts and copy-on-write behavior and you’ll sleep better.

What Proxmox is doing when you click “Snapshot”

For LVM-thin storage (lvmthin in Proxmox), a VM disk is typically a thin volume inside a thin pool. A snapshot is another thin volume that shares blocks with the origin volume. Writes go somewhere else; reads might come from the shared blocks or the delta blocks. That “somewhere else” is still your thin pool. Thin provisioning isn’t free space; it’s deferred billing.

Proxmox tracks snapshots in VM config (/etc/pve/qemu-server/<vmid>.conf) and uses LVM tooling underneath. When snapshot deletion fails, you often end up with disagreement between:

  • Proxmox’s inventory (what it thinks exists)
  • LVM’s reality (what thin volumes exist and who references whom)
  • Device-mapper’s state (what’s active, open, or suspended)

What “leftovers” look like in LVM-thin

Leftovers show up as thin volumes that no longer have a corresponding VM config reference, or as snapshot volumes whose origin has been removed, or as volumes that Proxmox would like to delete but LVM refuses because something still has it open.

Common leftover patterns:

  • Orphaned snapshot thin LV: the VM snapshot record is gone, but the thin LV remains.
  • Orphaned base disk: the VM disk got moved or deleted at the Proxmox layer, but the thin LV is still there.
  • “busy” devices: qemu, qemu-nbd, backup tooling, or even stale device-mapper mappings keep the LV open.
  • thin pool metadata pressure: deletions and merges get slow or fail when metadata is near full or the pool is unhealthy.

One quote, because it’s still true

Paraphrased idea (Werner Vogels): you should prepare for failure as a normal operating condition, not as a rare exception.

Fast diagnosis playbook (check 1/2/3)

Don’t wander. Don’t “just reboot the node” unless you can tolerate an outage and you know which devices are open. Do this instead.

1) Check whether Proxmox is stuck on its own locks or a running task

  • Is there a running task for snapshot delete/rollback?
  • Is there a stale lock in the VM config?
  • Are you racing with a backup (vzdump) or replication job?

2) Check LVM-thin pool health and metadata usage

  • Is thin pool data or metadata close to 100%?
  • Are there warnings about transaction IDs, needs-check, or read-only behavior?
  • Is dmeventd/lvmmonitor reacting, or has it been disabled?

3) Identify who is holding the LV open

  • Is qemu still running?
  • Is a mapper device still active?
  • Is something like qemu-nbd or a stale mount holding it?

Once you know which of these three is the bottleneck, cleanup becomes mostly mechanical. If you don’t know, every command is a coin toss.

Interesting facts and context (because history repeats)

  1. LVM snapshots predate thin provisioning. Classic LVM snapshots were notoriously space-hungry and slow under heavy write workloads; thin snapshots changed the performance profile but not the operational sharp edges.
  2. Device-mapper is the real engine. LVM is userland orchestration; the Linux kernel’s device-mapper does the actual mapping, reference tracking, and thin provisioning behavior.
  3. Thin pool metadata is its own “disk.” When metadata fills, the pool can flip read-only to prevent corruption. Deletions can become impossible at the worst time—right when you need space.
  4. Deleting snapshots can be expensive. Thin snapshot deletion may trigger block discards and metadata updates, which are not “free,” especially on busy pools.
  5. Proxmox uses config state as truth. If config and storage drift, the UI can lie with a straight face. The system isn’t malicious; it’s just reading a different book than LVM is.
  6. Thin provisioning can hide risk. Overcommit works until it doesn’t; the failure mode is abrupt and ugly, and it tends to happen during snapshot-heavy operations.
  7. Older kernels had rougher thin repair stories. Thin metadata tooling improved over time, but repair is still the kind of “success” that comes with a headache and a postmortem.
  8. Snapshot storms are a real thing. Repeated snapshot create/delete cycles can fragment metadata and amplify latency in ways that surprise people who only look at raw IOPS charts.

Joke #1: A snapshot is like a gym membership—cheap to create, mysteriously expensive to get rid of.

Practical tasks with commands: diagnose, decide, act

These tasks are ordered the way I actually run them in production: prove what’s happening, reduce blast radius, then delete things with confidence.

Task 1: Confirm the storage type and name Proxmox is using

cr0x@server:~$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        197919072        45121480       142663256   22.80%
local-lvm      lvmthin    active        9996536384      7125647360      2870889024   71.28%

What it means: You have an lvmthin storage called local-lvm. Snapshot leftovers will be thin LVs inside the VG backing this storage (commonly pve).

Decision: Use LVM-thin tools, not ZFS tooling, not “delete random files.” Your cleanup path is lvs/lvremove/dmsetup, plus Proxmox config hygiene.

Task 2: Identify the VM and see whether Proxmox thinks a snapshot exists

cr0x@server:~$ qm listsnapshot 104
`-> current
   `-> pre-upgrade-2025w51  2025-12-18 02:14:09  VM snapshot
   `-> before-agent-change  2025-12-20 11:07:55  VM snapshot

What it means: Proxmox believes snapshots exist. If deletion fails, either LVM refused, or Proxmox state drifted mid-operation.

Decision: If Proxmox lists snapshots, prefer deleting via qm delsnapshot first. Manual LVM deletions are for when state is already inconsistent or Proxmox is stuck.

Task 3: Look for VM config locks that block snapshot operations

cr0x@server:~$ grep -E '^(lock:|snapshots:)' /etc/pve/qemu-server/104.conf
lock: snapshot

What it means: VM is locked for a snapshot operation. This could be legitimate (still running) or stale (task died).

Decision: If there is no running task and the lock is stale, unlock the VM before retrying deletes. If there is a running task, stop and investigate that first.

Task 4: Check Proxmox task history for the failing operation

cr0x@server:~$ journalctl -u pvedaemon -u pveproxy --since "2 hours ago" | tail -n 30
Dec 26 09:10:14 server pvedaemon[2214]: starting task UPID:server:0000B5AA:0001B2C1:676D6E46:qmddelsnapshot:104:root@pam:
Dec 26 09:10:16 server pvedaemon[2214]: command 'lvremove -f pve/vm-104-disk-0-pre--upgrade--2025w51' failed: exit code 5
Dec 26 09:10:16 server pvedaemon[2214]: error: Logical volume pve/vm-104-disk-0-pre--upgrade--2025w51 contains a filesystem in use.
Dec 26 09:10:16 server pvedaemon[2214]: TASK ERROR: command 'lvremove -f pve/vm-104-disk-0-pre--upgrade--2025w51' failed: exit code 5

What it means: This isn’t “Proxmox UI weirdness.” LVM refused because the device is “in use.” That’s usually an open file descriptor, a stale mapper device, or qemu still referencing it.

Decision: Identify the process holding it. Don’t force-remove until you know what “in use” really is.

Task 5: Find LVM thin volumes for the VM and map snapshot names to LVs

cr0x@server:~$ lvs -a -o vg_name,lv_name,lv_attr,origin,devices,lv_size,data_percent,metadata_percent pve | grep -E 'vm-104'
pve  vm-104-disk-0                             Vwi-aotz--                                    thinpool           200.00g  78.12
pve  vm-104-disk-0-pre--upgrade--2025w51       Vwi---tz-k  vm-104-disk-0                     thinpool           200.00g  12.44
pve  vm-104-disk-0-before--agent--change       Vwi---tz-k  vm-104-disk-0                     thinpool           200.00g   3.02

What it means: You have the base disk and two thin snapshots. origin points back to the base. The attributes (Vwi---) tell you they’re thin volumes and active.

Decision: If LVs exist, Proxmox isn’t hallucinating. If Proxmox thinks a snapshot exists but it’s missing here, you’re dealing with drift and will need to fix config state.

Task 6: Inspect thin pool health (data + metadata usage)

cr0x@server:~$ lvs -o vg_name,lv_name,lv_attr,lv_size,data_percent,metadata_percent pve
VG   LV        Attr       LSize   Data%  Meta%
pve  root      -wi-ao----  96.00g
pve  swap      -wi-ao----   8.00g
pve  thinpool  twi-aotz--   7.30t  71.28  92.10

What it means: Metadata at 92% is a flashing yellow light. Thin metadata exhaustion causes strange failures: deletes hang, pool flips read-only, allocations fail, and you start learning new swear words.

Decision: If metadata is >80%, prioritize cleanup and/or metadata extension before aggressive snapshot churn. If it’s >95%, treat it like an incident.

Task 7: Check whether the VM is running and could still be referencing snapshot devices

cr0x@server:~$ qm status 104
status: running

What it means: A running VM may keep devices open. Snapshot deletion usually targets snapshot volumes, but depending on operations (rollback, backup, clone), extra mappings can exist.

Decision: For safe cleanup, plan a maintenance window to stop the VM if you need to remove LVs that claim “in use.” If you can’t stop it, you must prove no critical path still references the LV.

Task 8: Find who has the LV open (lsof/fuser on the device path)

cr0x@server:~$ lsof /dev/pve/vm-104-disk-0-pre--upgrade--2025w51 | head
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
qemu-syst 915 root   49u  BLK  253,7      0t0  891 /dev/dm-7

What it means: qemu has the device open. LVM is right to refuse deletion.

Decision: Figure out why qemu is holding a snapshot LV. Commonly: a block job, a backup, or a rollback attempt. Don’t kill qemu blindly unless you’ve accepted VM impact.

Task 9: Map device-mapper nodes to human names

cr0x@server:~$ dmsetup ls --tree | sed -n '1,80p'
pve-thinpool (253:2)
 ├─pve-vm--104--disk--0 (253:3)
 ├─pve-vm--104--disk--0--pre--upgrade--2025w51 (253:7)
 └─pve-vm--104--disk--0--before--agent--change (253:8)

What it means: The snapshot is a dm device. If it’s open, dm won’t let LVM remove it cleanly.

Decision: Use this mapping to confirm exactly what is open, and avoid “rm -rf” style thinking. You can also use dmsetup info to verify open count.

Task 10: Check open count on the problematic device

cr0x@server:~$ dmsetup info -c /dev/mapper/pve-vm--104--disk--0--pre--upgrade--2025w51
Name  Maj Min Stat Open Targ Event  UUID
pve-vm--104--disk--0--pre--upgrade--2025w51 253 7 L--w 1 1 0 LVM-6l3B0bQyQq0yYV3uQ0sG0fQxY3o2mV3S

What it means: Open is 1. Something has it. That explains the lvremove failure.

Decision: Find and stop the holder (preferred), or end the operation holding it, then retry deletion. If Open is 0 and lvremove still fails, you’re dealing with stale mappings or LVM metadata weirdness.

Task 11: Check for backup jobs or snapshot-based backups currently running

cr0x@server:~$ ps aux | grep -E 'vzdump|qemu-nbd|pbs-client' | grep -v grep
root     18444  2.1  0.1  35472 12896 ?        Ss   09:05   0:16 vzdump 104 --mode snapshot --compress zstd

What it means: The backup job may have created temporary snapshot mappings or may be holding devices open during transfer.

Decision: Let the backup finish or stop it cleanly if it’s wedged. Deleting snapshots mid-backup is a great way to manufacture corruption allegations.

Task 12: Attempt deletion the supported way (after removing the root cause)

cr0x@server:~$ qm delsnapshot 104 pre-upgrade-2025w51
delete snapshot 'pre-upgrade-2025w51'
TASK OK

What it means: Proxmox managed the deletion, updated config state, and asked LVM to remove the right LV.

Decision: Re-check thin pool usage and that the LV is actually gone. “TASK OK” is necessary, not sufficient.

Task 13: Verify the LV is actually removed (trust, then verify)

cr0x@server:~$ lvs -a pve | grep -E 'vm-104-disk-0-pre--upgrade'

What it means: No output means the LV is gone.

Decision: If it’s still present, you’ve got a partial failure: Proxmox state says “deleted” but LVM didn’t. That’s when manual cleanup starts, and you do it carefully.

Task 14: Find “unknown” volumes from Proxmox’s perspective

cr0x@server:~$ pvesm list local-lvm | grep -E 'vm-104|unused|snap' | head -n 20
local-lvm:vm-104-disk-0        vmimage 200.00g
local-lvm:vm-104-disk-0-before--agent--change  vmimage 200.00g

What it means: Proxmox still sees the “before-agent-change” snapshot LV as a VM image. That may be correct or leftover depending on what qm listsnapshot says.

Decision: If Proxmox doesn’t list the snapshot but pvesm list does, treat it as an orphan candidate and confirm references before removal.

Task 15: Check the VM config for “unused” disks pointing to old LVs

cr0x@server:~$ grep -E '^(scsi|virtio|sata|unused)[0-9]+:' /etc/pve/qemu-server/104.conf
scsi0: local-lvm:vm-104-disk-0,size=200G
unused0: local-lvm:vm-104-disk-0-before--agent--change

What it means: Proxmox sometimes parks no-longer-attached volumes as unusedX. This is a polite form of hoarding.

Decision: If it’s truly unused, remove it through Proxmox so it updates its inventory, then validate with LVM.

Task 16: Remove an unused disk safely via Proxmox tooling

cr0x@server:~$ qm unused 104
unused0: local-lvm:vm-104-disk-0-before--agent--change

cr0x@server:~$ qm set 104 --delete unused0
update VM 104: -delete unused0

What it means: The config reference is removed. Depending on storage and settings, the underlying LV may still exist until you explicitly remove it.

Decision: Now remove the volume via pvesm free (preferred) or lvremove (only if you’ve confirmed it’s orphaned).

Task 17: Free a volume using Proxmox storage manager

cr0x@server:~$ pvesm free local-lvm:vm-104-disk-0-before--agent--change
successfully removed 'local-lvm:vm-104-disk-0-before--agent--change'

What it means: Proxmox asked the backend to delete the LV and updated its internal state.

Decision: If this fails with “volume is busy,” go back to open count and process holders. Forcing deletion here is how you get 3 a.m. surprises.

Task 18: If Proxmox is out of sync, locate true orphans by scanning LVM for “vm-” volumes without configs

cr0x@server:~$ ls /etc/pve/qemu-server | sed 's/\.conf$//' | sort -n | tail -n 5
101
102
104
110
132

cr0x@server:~$ lvs --noheadings -o lv_name pve | awk '{print $1}' | grep '^vm-' | head
vm-101-disk-0
vm-102-disk-0
vm-104-disk-0
vm-999-disk-0-oldsnap

What it means: vm-999-... exists in LVM but there is no 999.conf. That’s an orphan candidate.

Decision: Before deleting, confirm it’s not referenced by another VM (linked clones, template chains) and not used by replication or backup tooling.

Task 19: Confirm whether an orphan candidate is referenced anywhere in Proxmox configs

cr0x@server:~$ grep -R "vm-999-disk-0-oldsnap" /etc/pve/qemu-server/ || true

What it means: No output suggests no VM config references it.

Decision: Still verify it isn’t open and isn’t part of an origin/snapshot chain you care about.

Task 20: Verify snapshot/origin relationships for an orphan candidate

cr0x@server:~$ lvs -a -o lv_name,lv_attr,origin,lv_size,data_percent pve | grep -E 'vm-999|Origin'
vm-999-disk-0-oldsnap              Vwi---tz-k  vm-999-disk-0                 100.00g  24.11
vm-999-disk-0                      Vwi-aotz--                                   100.00g  88.02

What it means: The “oldsnap” LV is a snapshot of vm-999-disk-0. If vm-999-disk-0 is also orphaned, you may remove both—but order matters and you should verify open counts.

Decision: Remove snapshots first, then origins, unless you have a reason (and proof) to do otherwise.

Task 21: Confirm nothing has the orphan LV open

cr0x@server:~$ dmsetup info -c /dev/mapper/pve-vm--999--disk--0--oldsnap
Name  Maj Min Stat Open Targ Event  UUID
pve-vm--999--disk--0--oldsnap 253 22 L--w 0 1 0 LVM-0bqVtQyS2L3kVJrQ9xF5jH1gYd2nM3p

What it means: Open count is 0. Good.

Decision: You can remove it, but do it in a way that keeps LVM metadata consistent and makes auditing easy.

Task 22: Remove an orphan LV with LVM directly (only after proving it’s orphaned)

cr0x@server:~$ lvremove -y pve/vm-999-disk-0-oldsnap
Logical volume "vm-999-disk-0-oldsnap" successfully removed.

What it means: The snapshot LV is gone. If it was the leftover, you should see thin pool usage drop over time (sometimes not instantly if discards are deferred).

Decision: Re-check pool usage and metadata. If metadata remains dangerously high, plan a metadata extension or a maintenance window for deeper cleanup.

Task 23: Re-check thin pool status after cleanup

cr0x@server:~$ lvs -o lv_name,data_percent,metadata_percent pve/thinpool
LV       Data%  Meta%
thinpool 69.90  88.40

What it means: Metadata dropped from 92% to 88%. That’s meaningful. Metadata relief often matters more than raw data usage when deletions are failing.

Decision: If metadata is still high, keep pruning snapshots and consider extending thin metadata LV.

Task 24: Extend thin pool metadata (carefully) when you’re near the cliff

cr0x@server:~$ lvdisplay pve/thinpool | grep -E 'LV Size|Pool metadata'
LV Size                7.30 TiB
Pool metadata          thinpool_tmeta

cr0x@server:~$ lvs -o lv_name,lv_size pve | grep thinpool_tmeta
thinpool_tmeta  16.00g

cr0x@server:~$ lvextend -L +4G pve/thinpool_tmeta
Size of logical volume pve/thinpool_tmeta changed from 16.00 GiB (4096 extents) to 20.00 GiB (5120 extents).
Logical volume pve/thinpool_tmeta successfully resized.

cr0x@server:~$ lvconvert --repair pve/thinpool
Attempting repair of thin pool pve/thinpool
Thin pool pve/thinpool repaired

What it means: You increased metadata capacity and repaired/updated thin pool structures. Depending on your environment, you may not need --repair unless the pool complains, but resizing metadata is a standard pressure-release valve.

Decision: If you don’t have free extents in the VG, stop here and plan storage expansion. Extending metadata into “no space left” territory is not a clever hack; it’s how you turn warnings into downtime.

Task 25: Validate Proxmox inventory after manual LVM changes

cr0x@server:~$ pvesm list local-lvm | grep -E 'vm-999|oldsnap' || true

What it means: No output means Proxmox doesn’t see it. If it still shows up, rescan or you may have a stale storage cache perception.

Decision: Prefer to keep Proxmox as the actor for deletes. If you must do manual LVM deletes, follow with inventory verification and config cleanup.

Joke #2: LVM-thin is like a junk drawer—everything fits until you need to find one specific thing and it’s holding three other things hostage.

Three corporate mini-stories (how this goes wrong in real life)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a Proxmox cluster for internal services: build runners, some legacy apps, and a handful of “temporary” VMs that somehow survived three reorganizations. Snapshots were used heavily before patch windows. The storage backend was LVM-thin on fast SSDs, because “it’s simple and local.”

During a routine kernel update window, an engineer assumed that deleting a snapshot is always a metadata operation and therefore basically instant. They queued deletions for dozens of old snapshots across multiple VMs while backups were also running in snapshot mode. The system didn’t immediately fail; it just got slow. Really slow.

Thin pool metadata climbed into the high 90s. LVM started throwing warnings, then the thin pool became read-only to protect itself. That was the moment when “delete snapshots to free space” stopped being an option, because the very act of freeing space required metadata writes.

The outage wasn’t dramatic in the sense of explosions—just a spreading inability to write: VMs hung on disk I/O, services timed out, and every dashboard lit up as if the network was failing. The root cause was storage, but it looked like everything else.

The fix was boring and painful: stop the snapshot-heavy jobs, shut down non-critical VMs to reduce churn, extend metadata, and then carefully prune snapshots in a controlled order. The lesson stuck: snapshot deletion can be the heaviest I/O you do all week, depending on metadata pressure and how many concurrent operations you allow.

Mini-story 2: The optimization that backfired

Another org wanted faster backups. They tuned their backup pipeline to run more VMs in parallel, still in snapshot mode, because it reduced guest downtime. Someone also enabled aggressive scheduling so backups would finish before business hours. It worked great for a few weeks.

Then it started “randomly” failing: sometimes snapshots wouldn’t delete, sometimes thin pool usage wouldn’t drop, sometimes LVs stayed “busy” long after the backup task ended. People suspected Proxmox bugs, then kernel bugs, then “maybe SSD firmware.” The usual tour.

The real issue was a coupling they didn’t model: parallel snapshot backups increased the lifetime and number of snapshot devices mapped at once. That increased open counts, and it increased metadata churn. When metadata usage climbed, operations got slower, which extended the backup window, which created more overlap, which created more open devices. A feedback loop with a nice corporate haircut.

They “optimized” themselves into a system where cleanup couldn’t keep up with creation. The backfiring part wasn’t that parallelism is bad; it’s that they added parallelism without guardrails tied to thin pool health. The fix was to cap concurrency based on metadata percentage and to stagger snapshot operations. Backups got a bit longer, but the cluster stopped playing storage roulette.

Mini-story 3: The boring but correct practice that saved the day

A regulated environment (lots of paperwork, lots of change control) had a Proxmox cluster that was not the fastest, not the newest, but it was stable. Their storage practice was painfully conservative: limit snapshot count per VM, enforce TTL on “temporary” snapshots, and run a weekly audit that compared Proxmox inventory against LVM.

One week, the audit flagged a handful of LVs that didn’t exist in any VM config. Nobody panicked. They did what the runbook said: verify open count, check for replication/backup tasks, confirm origin relationships, then schedule removal in the next window.

Two days later, a node crashed due to an unrelated hardware issue. During recovery, they discovered that the orphan LVs were from a previous half-failed migration attempt. If those LVs had remained, they would have pushed metadata near the cliff during the recovery write burst. Instead, metadata headroom existed, and the recovery stayed boring. Boring is the highest compliment in ops.

The saving practice wasn’t genius. It was regular reconciliation and disciplined limits. Their snapshot system wasn’t “more reliable” because of a fancy filesystem; it was reliable because they treated storage state like financial state: reconcile it, don’t assume it.

Checklists / step-by-step plan: safe cleanup of LVM-thin leftovers

Here’s the plan I’d hand to an on-call who needs to fix this without improvising. The goal: delete stuck snapshots and orphaned LVs without causing accidental data loss or turning the thin pool read-only.

Checklist A: Stabilize the patient

  1. Freeze new snapshot creation. Pause backup jobs and any automation that creates snapshots.
  2. Check thin pool metadata percentage. If it’s very high, treat every operation as riskier and slower.
  3. Identify whether the VM(s) can be stopped. If you can stop them, you’ll solve “busy” conditions faster and safer.
  4. Confirm no storage outage is already in progress. If the pool is read-only or reporting errors, stop and move into incident mode.

Checklist B: Prefer Proxmox-native deletion first

  1. List snapshots with qm listsnapshot <vmid>.
  2. Try deletion with qm delsnapshot.
  3. If it fails, read the task log and capture the exact LV name and the failure message.
  4. If it says “busy,” find open holders and remove the holder, not the LV.

Checklist C: When state is inconsistent, reconcile before you delete

  1. List LVM volumes for the VM ID using lvs.
  2. Search Proxmox configs for references to candidate LVs.
  3. Check snapshot/origin relationships and open count.
  4. Only then use pvesm free or lvremove.

Checklist D: Order of operations for manual cleanup (the safe default)

  1. Stop the VM if possible (or at least stop backup/replication tasks).
  2. Delete snapshots first (thin snapshot LVs that have an origin).
  3. Delete truly unused/orphan LVs next.
  4. Re-check thin pool metadata. If still high, consider extending metadata or scheduling a more comprehensive pruning.
  5. Bring the VM back and run a quick integrity check at the guest level if you did anything invasive.

Checklist E: If thin pool metadata is critically high

  1. Stop snapshot churn (backups, snapshot schedules, migrations that snapshot).
  2. Free easy wins: delete obviously old snapshots on non-critical VMs first.
  3. If you have free extents, extend thin metadata (thinpool_tmeta).
  4. Only then attempt heavier cleanup like bulk snapshot deletions.

Common mistakes: symptom → root cause → fix

1) “Snapshot delete hangs forever”

  • Symptom: Proxmox task runs for a long time; storage stays busy; eventual timeout.
  • Root cause: Thin pool metadata pressure and/or heavy concurrent snapshot activity; deletion is doing real work.
  • Fix: Reduce concurrency, pause backups, check metadata %. If high, extend metadata and prune snapshots in smaller batches.

2) “TASK ERROR: contains a filesystem in use” or “device busy”

  • Symptom: lvremove fails; Proxmox reports busy volume.
  • Root cause: qemu or backup tooling still holds the dm device open.
  • Fix: Use lsof/dmsetup info to identify holders; stop the task or VM; retry Proxmox-native deletion.

3) “Snapshot not listed in UI, but space is still used”

  • Symptom: No snapshots in qm listsnapshot, yet thin pool stays full; LVs exist in lvs.
  • Root cause: Orphaned thin LVs due to partial failures (migration, backup, crash mid-delete).
  • Fix: Reconcile: scan LVM for vm- LVs, grep configs for references, check open count, then remove via pvesm free or lvremove.

4) “Can’t create snapshots anymore”

  • Symptom: Snapshot creation fails; errors about thin pool or insufficient space.
  • Root cause: Thin pool metadata full (more common than data full in snapshot-heavy setups).
  • Fix: Extend thinpool_tmeta if possible; prune snapshots; enforce snapshot limits per VM.

5) “Proxmox shows ‘unused’ disks forever”

  • Symptom: VM config accumulates unused0, unused1 entries; storage never frees.
  • Root cause: Detach without delete; cautious defaults; or operators avoiding destructive actions.
  • Fix: Audit qm unused, confirm disks are not needed, then delete config references and free volumes via pvesm free.

6) “After manual lvremove, Proxmox still thinks the volume exists”

  • Symptom: UI shows phantom volume; operations fail with ‘volume does not exist’ or similar mismatches.
  • Root cause: Config references remain (unusedX or disk entries), or Proxmox inventory cache hasn’t refreshed.
  • Fix: Remove references from VM config via qm set --delete; verify with pvesm list.

FAQ

1) Is it safe to delete LVM-thin snapshots while the VM is running?

Sometimes, yes—when Proxmox created the snapshot and is deleting it in a normal flow. If the LV is “busy,” the safe answer becomes “stop the VM or the job holding it.” Don’t fight open devices.

2) Why does thin pool metadata fill faster than data?

Snapshots and frequent block changes create lots of mapping updates. Metadata tracks those mappings. A small metadata LV can become the limiting factor long before you run out of data space.

3) If qm listsnapshot is empty, can there still be snapshots?

Yes. Proxmox tracks snapshots in config state; LVM tracks thin snapshot volumes. If deletion or migration partially failed, LVM can retain snapshot LVs that Proxmox no longer references.

4) What’s the safest way to delete an orphaned LVM thin LV?

First prove it’s unreferenced: grep Proxmox configs, check origin chains, confirm open count is zero. Then delete via pvesm free if Proxmox knows it, otherwise lvremove.

5) Can I just reboot the node to clear “device busy”?

A reboot will often clear open counts, yes. It also often clears your maintenance window and replaces it with an incident. Reboot only when you understand who holds the device and you can tolerate impact.

6) Why didn’t my thin pool usage drop after deleting snapshots?

Space reclamation can be delayed by discard behavior, thin pool internal accounting, and ongoing writes. Also, if the snapshot was holding unique blocks that are still referenced elsewhere, you won’t reclaim as much as you hoped.

7) What’s the difference between “unused disk” and “snapshot leftover” in Proxmox terms?

“Unused disk” is a Proxmox config reference to a volume not attached to the VM. A “snapshot leftover” is typically an LVM LV that exists without a corresponding Proxmox snapshot record (or vice versa).

8) Should I extend thin pool metadata or just delete more snapshots?

If metadata is near the cliff and you have free extents, extending metadata buys safety and time. Still delete snapshots—extension is not a diet plan, it’s a bigger belt.

9) How do I prevent this from happening again?

Limit snapshot count and lifetime, cap backup concurrency based on thin metadata %, and run regular reconciliation between Proxmox configs and LVM volumes. Also: stop letting “temporary” snapshots become permanent residents.

Conclusion: next steps that keep you out of trouble

When Proxmox snapshots won’t delete, the problem is rarely “a broken button.” It’s almost always one of three things: a stale lock/task, a thin pool under metadata pressure, or an open device that LVM correctly refuses to destroy.

Do the fast diagnosis in order. Use Proxmox-native deletes when possible. When you must go manual, reconcile state first, verify open counts, and delete in the right order. Then give yourself the gift of prevention: enforce snapshot TTLs, cap concurrency, and monitor thin metadata like it’s a first-class SLO.

← Previous
ZFS Encryption + Compression: The Order That Makes Performance
Next →
Legacy Magic: Nobody Knows How It Works, So Don’t Touch It

Leave a comment