Proxmox LVM-thin “out of data space”: free space without destroying VMs

October 7, 2025 • February 3, 2026 • Read: 23 min • Views: 11

Was this helpful?

You notice it when backups fail first. Then a VM freezes mid-write. Then Proxmox starts throwing errors like
thin-pool ... out of data space and suddenly your calm little virtualization host is acting like a suitcase
packed by a toddler: nothing else fits and everything is on the floor.

The good news: an LVM-thin pool hitting “out of data space” is usually recoverable without deleting VMs. The bad news:
you need to be deliberate, because “freeing space” in thin-provisioned storage is not the same thing as deleting files.
Let’s fix the pool, not “clean up around it”.

What “out of data space” really means (and what it doesn’t)

Proxmox’s local-lvm storage is often an LVM thin pool: a big logical volume (the “thinpool”) from which VM
disks (thin LVs) are allocated on demand. You can overprovision: present 10 TB of virtual disks backed by 2 TB of real
blocks, as long as the guests don’t actually write all of it.

When the thinpool reports “out of data space”, it means the pool’s data LV (the part that stores actual blocks)
is full. Writes to any thin volume may stall or fail. Depending on your settings, LVM may pause the pool to prevent
corruption, which looks like “everything is hung” at the VM level.

Two critical nuances that stop you from making the worst decision at 2 a.m.:

Deleting files inside a VM does not necessarily free thinpool space unless discard/TRIM is enabled and run.
Without discard, the host still thinks those blocks are in use.
Snapshots are not “free insurance”. Thin snapshots can grow dramatically if the base disk keeps changing.
Snapshots are copy-on-write; sustained churn turns them into silent space-eaters.

If you remember one thing: your job is to determine whether the pool is out of data, out of metadata, or both, then
pick the least risky way to reclaim or add space. Panic is optional.

Joke #1: A thinpool doesn’t “run out of space”, it “achieves full utilization” right before your SLA achieves a new low.

Fast diagnosis playbook (first/second/third checks)

This is the “don’t get lost in the weeds” sequence I use when a Proxmox host is paging me about thinpool space.
The goal is to locate the real bottleneck in under five minutes.

First: confirm what is full (data vs metadata) and whether the pool is paused

Check thinpool usage with lvs (data% and meta%).
Look for pool state: is it active, read-only, or suspended?

Second: identify what is consuming blocks (snapshots, backups, orphaned volumes)

List LVs with sizes and data% to find heavy writers.
List snapshots in Proxmox and in LVM (they aren’t always the same “story”).
Check for abandoned volumes from deleted VMs or failed restores.

Third: decide the emergency lever (reclaim vs expand vs both)

If you can add capacity quickly: extend VG and thinpool.
If you cannot: reclaim space via snapshot deletion and discard/trim, then consider temporary measures.
If metadata is the limiting factor: extend metadata now; it’s small, fast, and often the real culprit.

If you’re stuck choosing: expanding the thinpool (data + metadata) is the lowest drama move if you have any spare
storage in the VG or can add a disk. Reclaiming space is more nuanced and tends to surprise people.

Interesting facts and context (why LVM-thin behaves this way)

These aren’t trivia for trivia’s sake. Each point maps to a failure mode or a better operational habit.

LVM snapshots predate thin provisioning and were infamous for performance cliffs. Classic snapshots used a
separate COW volume and could slow down heavily when near full; thin snapshots improved mechanics but didn’t remove
the “space grows with churn” reality.
Thin provisioning is a promise, not a guarantee. It’s mathematically possible to provision more virtual
capacity than you can physically store; the pool only stays healthy if write growth stays under control.
Thin pools have two constraints: data and metadata. Metadata tracks mappings from virtual blocks to physical
blocks. You can have plenty of data space and still die from metadata exhaustion.
Metadata pressure rises with fragmentation and snapshots. More mappings, more copy-on-write events, and
more block-level churn increase metadata consumption.
Discard/TRIM didn’t become operationally normal overnight. Many stacks defaulted to “discard off” for years
because early implementations caused performance issues or unexpected behavior on some storage.
Proxmox’s “local-lvm” defaults are designed for convenience, not perfection. It’s a sane starting point for
labs and small clusters, but it expects you to monitor and adjust.
Thin pool auto-extend exists, but it’s not a babysitter. LVM can auto-extend a thinpool when it hits a
threshold, but only if the VG has free extents. If the VG is also full, you still crash into a wall.
Space reclamation is a multi-layer negotiation. Guest OS must mark blocks free, filesystem must support
trim, virtual disk path must pass discard, and the thinpool must accept it.

Practical tasks: commands, outputs, decisions

Below are real tasks you can run on a Proxmox node. Each one includes: the command, representative output, and the
decision you make from it. Don’t blindly paste. Read the output like it’s a diagnostic report—because it is.

Task 1: Identify the thinpool and see data% and meta%

cr0x@server:~$ sudo lvs -a -o+seg_monitor,lv_attr,lv_size,data_percent,metadata_percent,pool_lv,origin vg0
  LV                      VG  Attr       LSize   Data%  Meta%  Pool          Origin  Monitor
  root                    vg0 -wi-ao----  96.00g
  swap                    vg0 -wi-ao----   8.00g
  data                    vg0 twi-aotz--   1.60t  99.12  71.43                         monitored
  data_tmeta              vg0 ewi-ao----   8.00g
  data_tdata              vg0 ewi-ao----   1.60t
  vm-101-disk-0           vg0 Vwi-aotz--  80.00g                 data                 monitored
  vm-102-disk-0           vg0 Vwi-aotz-- 200.00g                 data                 monitored

What it means: The thinpool LV is vg0/data (Attr starts with twi-). Data% is 99.12%: you’re effectively out.
Meta% is 71.43%: metadata is not the immediate blocker, but it’s trending.

Decision: Treat this as an emergency. Stop anything that writes heavily (backups, restores, log storms),
and choose reclaim/expand now. If Meta% were 95–100%, prioritize metadata extension first.

Task 2: Check whether the thinpool is suspended (writes may hang)

cr0x@server:~$ sudo dmsetup status /dev/vg0/data
0 3355443200 thin-pool 0 5118/524288 131072/1048576 - rw no_discard_passdown queue_if_no_space

What it means: The pool is in rw mode and set to queue_if_no_space. That means writes may queue when full
instead of failing fast. That feels “safer” until you realize it can deadlock your workload and your patience.

Decision: If you’re stuck in a write queue spiral, you must free or add space. Also consider whether
queue_if_no_space is the behavior you want long-term.

Task 3: Quick Proxmox-level view: what storage is full?

cr0x@server:~$ pvesm status
Name      Type     Status           Total        Used       Avail      %
local     dir      active        100789248    22133704    73419180   21.96%
local-lvm lvmthin  active       1753219072  1738919936     14299136   99.18%

What it means: local (directory storage) is fine. local-lvm is basically full. VM disks on local-lvm are at risk.

Decision: Focus on LVM-thin remediation, not random deletions in /var/lib/vz.

Task 4: Find the worst offenders at the LV level

cr0x@server:~$ sudo lvs -o lv_name,lv_size,lv_attr,data_percent,metadata_percent --sort -lv_size vg0
  LV            LSize   Attr       Data%
  data          1.60t   twi-aotz--  99.12
  vm-102-disk-0 200.00g Vwi-aotz--
  vm-101-disk-0  80.00g Vwi-aotz--
  root           96.00g -wi-ao----
  swap            8.00g -wi-ao----

What it means: Thin volumes don’t show Data% here (that’s pool-level), but this tells you which VMs have big virtual disks.
Big virtual size isn’t proof of big physical use, but it’s a good shortlist for “which VM is likely churning”.

Decision: Correlate these VMs with recent backup jobs, DB maintenance, log bursts, or snapshot sprawl.

Task 5: Check thin pool detailed report (chunks, transaction id, features)

cr0x@server:~$ sudo lvs -o+chunk_size,lv_health_status,thin_count,discards vg0/data
  LV   VG  Attr       LSize  Pool Origin Data%  Meta%  Chunk  Health  #Thins Discards
  data vg0 twi-aotz-- 1.60t             99.12  71.43  64.00k  ok      12     passdown

What it means: Chunk size is 64K. Discards are set to passdown (good sign). Health is ok (it’s full, but not corrupt).

Decision: If discards are disabled here, host-level trimming won’t help much. You’ll need to enable discard support
and then run trims in guests or on the host where applicable.

Task 6: List Proxmox snapshots (the “human” snapshots)

cr0x@server:~$ qm listsnapshot 102
`-> pre-upgrade
    `-> weekly-retain-1
`-> weekly-retain-2

What it means: VM 102 has multiple snapshots. Each snapshot can pin blocks and drive growth.

Decision: If you’re out of space, delete snapshots you don’t need—starting with the oldest, after confirming
they’re not part of an active backup/replication workflow.

Task 7: Delete a Proxmox snapshot safely (and what to watch)

cr0x@server:~$ qm delsnapshot 102 weekly-retain-2
Deleting snapshot 'weekly-retain-2'...
TASK OK

What it means: Proxmox has scheduled the snapshot merge. On thin storage, merges can cause I/O. They may also
temporarily increase write activity. That’s not a reason to avoid deletion; it’s a reason to do it intentionally.

Decision: If the pool is 99–100% full, consider stopping heavy writers first so the merge doesn’t collide with production churn.

Task 8: Detect orphaned LVs (the “storage is haunted” problem)

cr0x@server:~$ sudo lvs vg0 | grep -E 'vm-[0-9]+-disk-[0-9]+' | awk '{print $1}'
vm-101-disk-0
vm-102-disk-0
vm-109-disk-0
vm-999-disk-0

cr0x@server:~$ ls /etc/pve/qemu-server/ | sed 's/\.conf$//' | sort | tail -n +1
101
102
109

What it means: There’s a vm-999-disk-0 LV but no 999.conf. That disk is likely orphaned: created during a failed restore,
clone, or manual tinkering.

Decision: Verify it’s truly unused, then remove it to reclaim space. Don’t delete it because it “looks wrong”; delete it because you’ve proven it’s unused.

Task 9: Prove a suspected orphan is not referenced by any VM config

cr0x@server:~$ grep -R "vm-999-disk-0" /etc/pve/qemu-server/

cr0x@server:~$ echo $?
1

What it means: Exit code 1 indicates “not found”. That’s evidence (not proof) it’s not attached.

Decision: Also check running QEMU processes and block device mappings before removal.

Task 10: Confirm no running VM has the LV open

cr0x@server:~$ sudo lsof | grep "/dev/vg0/vm-999-disk-0" | head

What it means: No output suggests nothing has it open. On a busy host, lsof can be heavy; use it carefully.

Decision: If it’s not open and not referenced, you can remove it.

Task 11: Remove an orphaned thin LV

cr0x@server:~$ sudo lvremove -y /dev/vg0/vm-999-disk-0
  Logical volume "vm-999-disk-0" successfully removed.

What it means: You removed the LV mapping. Space should return to the pool immediately (though thin metadata and kernel accounting may lag slightly).

Decision: Re-check thinpool Data% and Meta% to confirm impact. If nothing changed, you didn’t remove the right thing—or the pool is blocked by metadata.

Task 12: Re-check pool usage after deletions

cr0x@server:~$ sudo lvs -o lv_name,lv_size,data_percent,metadata_percent vg0/data
  LV   LSize  Data%  Meta%
  data 1.60t  96.44  70.90

What it means: You clawed back ~3% of the pool. That might be enough to unjam operations, but it’s not a fix if growth continues.

Decision: If you’re still above ~90% and you have ongoing churn, expand the pool. Reclaim is a stopgap unless you change behavior.

Task 13: Check VG free space (can we extend without adding disks?)

cr0x@server:~$ sudo vgs -o vg_name,vg_size,vg_free,vg_free_count
  VG   VSize  VFree   #VFree
  vg0  1.80t  180.00g 46080

What it means: You have 180G free in the volume group. You can extend the thinpool right now.

Decision: Extend the thinpool data LV, and consider extending metadata as well. Do both; it’s cheap insurance.

Task 14: Extend thinpool data (online) and then metadata (online)

cr0x@server:~$ sudo lvextend -L +150G /dev/vg0/data
  Size of logical volume vg0/data changed from 1.60 TiB (419430 extents) to 1.75 TiB (458752 extents).
  Logical volume vg0/data successfully resized.

cr0x@server:~$ sudo lvextend -L +2G /dev/vg0/data_tmeta
  Size of logical volume vg0/data_tmeta changed from 8.00 GiB (2048 extents) to 10.00 GiB (2560 extents).
  Logical volume vg0/data_tmeta successfully resized.

What it means: Data grew by 150G. Metadata grew by 2G. Thin metadata is small but crucial; growing it proactively is usually wise.

Decision: Re-check Data% and Meta% afterwards. If Meta% remains extremely high, you may need more metadata or fewer snapshots/churn.

Task 15: Verify the thinpool is now healthy and has headroom

cr0x@server:~$ sudo lvs -o lv_name,lv_size,data_percent,metadata_percent vg0/data
  LV   LSize  Data%  Meta%
  data 1.75t  88.12  56.41

What it means: You bought breathing room. 88% is still “watch closely” in thin land, but you’re no longer inches from failure.

Decision: Now you prevent recurrence: enable discard end-to-end, reduce snapshot retention, and set alerting thresholds.

Task 16: If there’s no VG free space, add a new disk and extend the VG

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT
NAME         SIZE TYPE FSTYPE      MOUNTPOINT
sda        1.8T  disk
├─sda1       1M  part
├─sda2     1G  part ext4         /boot
└─sda3   1.8T  part LVM2_member
  ├─vg0-root  96G lvm  ext4         /
  ├─vg0-swap   8G lvm  swap         [SWAP]
  └─vg0-data 1.75T lvm
sdb        1.0T  disk

cr0x@server:~$ sudo pvcreate /dev/sdb
  Physical volume "/dev/sdb" successfully created.

cr0x@server:~$ sudo vgextend vg0 /dev/sdb
  Volume group "vg0" successfully extended

cr0x@server:~$ sudo vgs -o vg_name,vg_size,vg_free
  VG   VSize  VFree
  vg0  2.80t  1.03t

What it means: You added real capacity. Now you can extend the thinpool again.

Decision: Extend vg0/data and (optionally) vg0/data_tmeta. Then set monitoring so you don’t repeat this ritual.

Task 17: Enable weekly TRIM for guests where it makes sense (and confirm)

cr0x@server:~$ systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled)
     Active: active (waiting)

What it means: Guests running systemd can periodically discard freed blocks. This is one of the few “set it and forget it” wins.

Decision: Make sure discard is allowed in the VM disk settings and supported by filesystem/storage. If not, you’re just scheduling optimism.

Task 18: Check whether a VM disk is configured to pass discard (Proxmox)

cr0x@server:~$ qm config 102 | grep -E 'scsi|virtio|sata'
scsi0: local-lvm:vm-102-disk-0,discard=on,iothread=1,size=200G

What it means: discard=on is enabled for that disk. Good. Without it, guest TRIM won’t reach the thinpool.

Decision: For disks where you want reclamation, enable discard. For some ultra-sensitive latency workloads, test first—but most people should enable it.

Recovery options that don’t destroy VMs

When you’re out of data space, you have three real levers: stop writes, reclaim blocks, and add capacity. The best fix is usually a combination.
The wrong fix is “delete random stuff until the error goes away.”

Option A: Delete snapshots you don’t need (fastest “real” reclamation)

If you have old snapshots, they’re the first place I look. Snapshots pin older block versions and keep data alive that would otherwise be reusable.
Deleting them can free a lot, but merges create I/O—so do it with awareness.

When to do it: pool is full, snapshots exist, and you can tolerate some merge activity.
When to avoid it: snapshot is actively used for rollback in a planned change window happening right now (rare).

Option B: Remove orphaned volumes and failed restore debris (the “sweep the warehouse” move)

Proxmox is generally tidy, but humans are creative. Orphans happen after failed backups, interrupted restores, or manual LVM edits.
This can be a clean, low-risk win—if you verify references first.

Option C: Expand the thinpool (the boring, usually best option)

If the volume group has free extents, extending the thinpool is fast and online. If it doesn’t, add a disk and extend the VG.
In incident terms, this is the highest confidence action: it immediately reduces pressure.

Option D: Reclaim from inside guests with TRIM (good, but not instant)

Reclaim is not magic. The guest must issue discards, the virtual controller must pass them, and the thinpool must accept them.
Also: the guest might not discard aggressively unless prompted (fstrim) or mounted with continuous discard (not always recommended).

Option E: Temporary triage: stop the bleeding

If you are at 99–100% and VMs are stuck, the immediate action is to stop high write-rate jobs:
backups, restores, log aggregation, database vacuum/compaction, indexing, CI artifact churn.
You’re not “fixing it,” you’re preventing a cascading failure while you create space.

Joke #2: The fastest way to reclaim storage is to schedule a maintenance window; suddenly everyone remembers what can be deleted.

Three corporate mini-stories from the thinpool trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized org ran a Proxmox cluster for internal services: Git, CI runners, a few app servers, and a database that everyone pretended wasn’t “production”
because it wasn’t customer-facing. It lived on local-lvm, thin provisioned, with generous virtual disks. Monitoring watched the host filesystem,
not the thinpool. Everything looked green—until it didn’t.

The wrong assumption was simple: “If the guest deletes files, the host gets space back.” Their CI runners produced artifacts, uploaded them, then cleaned the
workspace. Inside the VM, disk usage looked fine. On the host, the thinpool crept up anyway. Nobody noticed because they weren’t looking at Data%.

Then a routine OS update created new packages, logs rolled, CI got busy, and the thinpool crossed the line. Writes queued. The CI runners stalled. The Git
service got slow. The database started timing out because fsync wasn’t finishing. It looked like “the network is flaky” until someone actually checked LVM.

The fix wasn’t heroic: enable discard on the VM disks, enable weekly fstrim in the guests, delete stale snapshots, and expand the pool by a few hundred gigs.
The cultural change mattered more: they added thinpool monitoring and stopped treating local-lvm as “infinite because thin”.

Nobody got fired. But the team’s relationship with “assumptions” changed permanently. Thin provisioning is a contract with reality, and reality always reads the fine print.

Mini-story 2: The optimization that backfired

Another shop wanted “faster backups.” They were snapshotting a busy VM and running frequent backup jobs. To reduce backup time, they increased snapshot frequency
and kept more restore points. The graphs looked great—until the thinpool didn’t.

Snapshots are cheap at creation. That’s the trap. The VM was a log-heavy application server with constant writes. Each snapshot pinned old blocks; the more snapshots
they kept, the more versions of hot blocks lived simultaneously. The thinpool filled faster precisely because their “optimization” preserved churn.

When the thinpool hit the ceiling, the backup job that “made things safer” became the trigger for downtime. Worse: deleting snapshots during the incident caused
a burst of merge work right when the system was already under I/O stress. That merge work wasn’t the villain, but it was definitely not a soothing lullaby.

The long-term fix was counterintuitive: fewer snapshots, smarter backup scheduling, and separating high-churn workloads onto storage sized for churn.
They also implemented thinpool autoextend with a hard “must have VG free space” rule and alerting when free extents ran low.

Performance improved. Reliability improved. The “backup is slow” complaint was replaced by “backups are boring,” which is the highest compliment operations can receive.

Mini-story 3: The boring but correct practice that saved the day

A conservative team ran Proxmox for internal platforms. They had a simple, almost dull standard: every storage backend had alerting at three levels (warning, urgent,
stop-the-world), and thinpools were treated like production databases—capacity planning, not hope.

They also had a habit that sounds like paperwork: after every restore test or migration test, they ran an “orphan sweep” to verify no stray volumes remained.
It was a small script and a small calendar reminder. People occasionally rolled their eyes.

One afternoon, a restore test failed halfway through due to a network hiccup. It left behind several large thin volumes. The very next day, the orphan sweep caught them.
They removed the debris and moved on. No incident. No emergency expansion. No midnight calls.

Months later, a similar restore failure happened during a busy period. But the same boring practice kicked in. The host never hit the cliff, because the team didn’t let
silent garbage accumulate.

The lesson wasn’t “be paranoid.” It was “be routine.” Most outages aren’t caused by exotic bugs. They’re caused by the slow accumulation of unowned state.

Common mistakes: symptom → root cause → fix

These are patterns I keep seeing in real Proxmox environments. If one matches your situation, don’t debate it—act on it.

1) Symptom: thinpool is full, but guest has plenty of free space

Root cause: Discard/TRIM not reaching the thinpool, or never being issued.
Fix: Enable discard=on for VM disks; enable and run fstrim in guests; confirm thinpool discards are passdown.

2) Symptom: “out of data space” happens suddenly after enabling frequent snapshots

Root cause: Snapshot retention + high churn. Copy-on-write pins old blocks; churn multiplies consumed space.
Fix: Reduce snapshot count/retention; schedule snapshots around low churn; expand pool; consider moving high-churn VMs to dedicated storage.

3) Symptom: pool data% is moderate, but writes still fail or pool appears stuck

Root cause: Metadata full (Meta% near 100%) or metadata corruption risk.
Fix: Extend data_tmeta; reduce snapshots; reduce fragmentation/churn; check health and consider maintenance if corruption suspected.

4) Symptom: you delete a VM, but thinpool usage barely changes

Root cause: VM disks not actually on that thinpool, or additional snapshots/orphans still exist, or you deleted config but not disks.
Fix: Check qm config and storage mapping; list LVs; remove orphans; confirm Proxmox deletion was configured to remove disks.

5) Symptom: after freeing space, pool still reports near-full

Root cause: Freed blocks weren’t discarded; thinpool accounting delayed; or you freed space on the wrong storage (directory vs lvmthin).
Fix: Re-check with pvesm status and lvs; run trims; confirm discards; ensure you’re operating on local-lvm.

6) Symptom: merges/snapshot deletions make performance worse during incident

Root cause: Deleting snapshots triggers copy-on-write cleanup/merge I/O at the worst possible time.
Fix: Pause heavy write jobs first; delete snapshots strategically; add capacity to reduce pressure before merges if possible.

Checklists / step-by-step plan

Emergency checklist (thinpool at 95–100%)

Stop the loud writers: pause backups/restores, stop log storms, delay database maintenance jobs.
The goal is to stop accelerating into the wall.
Confirm the problem: run pvesm status and lvs for Data% and Meta%.
Delete low-value snapshots: start with oldest and least justified. Keep the “one you might need” only if you truly might need it.
Remove proven orphans: grep configs, check open handles, then lvremove.
Extend thinpool if you can: check vgs; if VG has free extents, extend now.
Extend metadata proactively: especially if Meta% > 70% and you run snapshots.
Re-check health: confirm Data% drops to a safer range and VMs recover.

Stabilization checklist (after you’ve stopped the bleeding)

Turn on discard end-to-end: Proxmox disk setting discard=on, guest OS trims, thinpool discards passdown.
Set snapshot policy: cap count, cap age, and align with churn patterns.
Implement alerting: warn at 70–80%, urgent at 85–90%, critical at 95% for thinpool Data% and Meta%.
Measure churn: identify VMs that write constantly; treat them as capacity multipliers.
Test restore processes: failed restores are a top source of orphaned volumes.

Capacity planning checklist (so this doesn’t return)

Define overprovision ratio: pick a number you can defend (e.g., 2:1) and enforce it culturally.
Track thinpool growth rate: if Data% grows 1–2% per day, you are on a countdown clock.
Keep VG free space: if you rely on autoextend, you must keep free extents. Otherwise autoextend is just a nice thought.
Separate workloads: put high-churn VMs on storage sized for churn, not on the same thinpool as “quiet” infrastructure.

Prevention: make it boring, make it permanent

The thinpool emergency is usually not a one-off. It’s a delayed bill for earlier convenience: overprovisioning without monitoring, snapshots without retention discipline,
and guests deleting data without discarding it.

Monitoring that actually works

Watch Data% and Meta% of the thinpool LV. Do not rely on “disk free” inside the guests. Do not rely on df -h on the host filesystem.
Those are different layers with different truths.

A practical rule: alert early enough that you can respond during business hours. The point isn’t to catch 99%. The point is to prevent reaching it.

Snapshot discipline

Snapshots are operational debt with compounding interest. Keep them when they have a purpose: a short safety net for a risky change.
Don’t keep them “because storage is cheap.” On thin provisioning, storage is not cheap; it’s just deferred.

Discard: enable it, then prove it’s working

Enabling discard on a Proxmox disk is necessary but not sufficient. Guests need to issue trims. Some filesystems and workloads benefit from periodic fstrim
rather than continuous discard mounts. Test on your workload, but don’t skip it.

Autoextend: a good servant, a terrible boss

LVM autoextend can save you from slow growth surprises, but it can’t conjure space. It only consumes VG free extents. If you run your VG to zero free space,
autoextend becomes a false sense of safety—like a spare tire made of cardboard.

One quote to keep you honest

Paraphrased idea from John Allspaw: reliability comes from how your system behaves under stress, not from wishing stress won’t happen.

FAQ

1) Can I fix “out of data space” without rebooting the Proxmox host?

Usually yes. Extending the thinpool and deleting snapshots/orphans are typically online operations. Reboots are for when you’ve dug a deeper hole (or for kernel/driver issues),
not for basic capacity fixes.

2) What’s the difference between thinpool “data” and “metadata” space?

Data is the actual stored blocks for VM disks. Metadata is the mapping database that tracks where each virtual block lives in the pool and how snapshots relate.
You can be “out of space” in either, and the symptoms can look similar.

3) Why doesn’t deleting files inside the VM free space on the host?

Because the thinpool only knows blocks are reusable when it receives discards. Deleting a file marks blocks free inside the filesystem, but without TRIM/discard,
the underlying storage still considers them allocated.

4) Is enabling `discard=on` safe for all VMs?

For most general workloads, yes—and it’s usually the correct default on thin provisioning. For some latency-sensitive workloads, test first.
The bigger risk is running thin without reclamation and then acting surprised when it fills.

5) I deleted snapshots but Data% barely moved. Why?

Either the snapshots weren’t the main consumer, or the pool is filled by active data still referenced by current disks, or you’re actually constrained by metadata.
Also, “barely moved” might still be meaningful if your pool is multi-terabyte; verify with lvs and correlate with churn sources.

6) Should I add more metadata space even if Meta% isn’t high?

If you use snapshots heavily or run high churn, yes—within reason. Metadata extension is cheap compared to an incident where metadata hits 100% and the pool goes sideways.

7) Can I shrink a VM disk to reclaim thinpool space?

Shrinking is possible but rarely the first move: it requires filesystem shrinking inside the guest (and that’s not always supported), and it’s operationally risky.
Prefer discard-based reclamation and snapshot cleanup. Expand capacity if you can.

8) What’s the safest “first action” when the pool is at 99% and VMs are slow?

Stop or pause the largest write generators (backups/restores/log storms), then create space by deleting snapshots/orphans or extending the pool.
Don’t start “cleanup” inside guests and hope it reaches the host. Hope is not a storage driver.

9) Why did it fill so fast when the VM disks aren’t that large?

Thinpool fullness is about physical written blocks, not virtual disk sizes. High-churn patterns (databases, logs, CI workspaces), plus snapshots, can multiply physical usage quickly.

10) If I extend the thinpool, will Proxmox automatically notice?

Yes in most cases; Proxmox reads LVM state and will reflect new capacity in pvesm status. If it doesn’t update immediately, re-check that you extended the right LV
and that the storage definition points to that pool.

Conclusion: next steps you should do today

If you’re reading this mid-incident, do three things in order: confirm whether data or metadata is full, reclaim space by deleting snapshots and proven orphans,
then extend the thinpool (and metadata) if you have any capacity to add. That gets you out of the danger zone without destroying VMs.

Then do the part that prevents the sequel: enable discard end-to-end, run trims, set snapshot retention like an adult, and add monitoring on thinpool Data% and Meta%.
Make storage boring again. Your future self is already tired.