Ubuntu 24.04: LVM thin pool 100% — save your VMs before it’s too late

Was this helpful?

If your Ubuntu 24.04 host runs VMs on LVM thin and that thin pool hits 100%, the failure mode is not “kinda slow.” It’s “writes stop, guests wedge, and your hypervisor starts making ugly noises.” You don’t get a graceful countdown. You get a pager and a stomach drop.

This is the playbook I wish every virtualization team had taped inside the rack door: how to diagnose what’s actually full (data or metadata), how to get VMs breathing again without turning recovery into a second incident, and how to keep thin provisioning from quietly becoming thin ice.

What actually happens when a thin pool hits 100%

LVM thin provisioning is a brilliant idea with one sharp edge: it lets you promise more space than you physically have. That’s not inherently evil—storage arrays have done it for years—but it shifts failure from “allocation time” to “write time.” When the thin pool runs out of space, you don’t just fail to create a new volume. Existing VMs can suddenly be unable to write blocks they’ve never written before. That can look like random application failures, filesystem remounts, database corruption warnings, and guests going read-only.

Two separate pools can kill you:

  • Data space (the big one): where your thin LVs’ blocks live.
  • Metadata space (the small one): the mapping table that says “this thin LV block maps to that physical block.”

When data is 100%, new writes that require new block allocation fail. When metadata is 100%, allocation bookkeeping fails even if you still have data space. Both situations can stall I/O. Both can wedge VMs. Both are avoidable if you treat thin provisioning as a controlled substance.

One more nuance: a thin pool can become unhealthy even before it hits 100% if it’s heavily fragmented, has runaway snapshots, or is pressured into constant allocate/free cycles. But 100% is the headline because it’s when your margin becomes negative and you’re now paying interest.

Joke #1: Thin provisioning is like credit: fantastic until you accidentally max it out on “just one more” snapshot.

Facts and short history: why thin pools behave like this

Some context helps you predict behavior instead of guessing. Here are concrete, useful facts—no nostalgia, just mechanics.

  1. Device-mapper thin provisioning (dm-thin) has been in mainline Linux for years; LVM thin uses it under the hood. LVM isn’t “doing magic,” it’s orchestrating dm-thin devices.
  2. Metadata is a separate LV (often named like thinpool_tmeta) because mapping tables must be crash-consistent and fast. Running out of metadata can break allocation even when data looks fine.
  3. Thin pools can be overcommitted by design: the sum of virtual sizes of thin LVs can exceed the pool’s physical size. This is the point—and also the trap.
  4. Discard/TRIM support evolved slowly across stacks (guest FS → virtual disk → host dm-thin). It’s now workable, but only if you actually enable it end-to-end and understand the performance cost.
  5. Snapshots are cheap at creation time in thin provisioning. That’s why people make too many. The cost arrives later as diverging blocks accumulate.
  6. Autoextend exists for thin pools (via LVM’s monitoring and profile settings), but it only helps if the VG has free extents or you can add PVs quickly.
  7. 100% isn’t a polite boundary: dm-thin can switch to an error mode for new allocations. Depending on configuration, I/O errors propagate unpredictably to filesystems and applications.
  8. Metadata repair tools exist (like thin_check and thin_repair), but they are not routine maintenance tools. If you’re using them monthly, your process is the problem.

There’s a reliability maxim worth keeping on your desk. Here’s a paraphrased idea from John Allspaw, long-time operations and reliability leader: paraphrased idea: Incidents come from normal work in complex systems; blame is useless—understanding and learning is the job.

Fast diagnosis playbook (first/second/third)

When the thin pool hits 100%, your job is to stop guessing. Follow a strict order. Each step answers a binary question and drives your next action.

First: is it data space or metadata space?

  • If data is full: you need physical space or you need to free blocks (TRIM/discard won’t save you quickly unless you already had it running).
  • If metadata is full: you often can extend metadata LV quickly (if VG has space) and restore allocations.

Second: are writes failing right now?

  • Check kernel logs for dm-thin errors.
  • Check whether guests have gone read-only or services are failing.

Third: what is consuming space: real writes, snapshots, or stale blocks?

  • Look at per-LV usage.
  • Identify snapshot growth patterns.
  • Decide whether to delete snapshots, migrate a VM, extend pool, or add a new PV.

Time matters. But “fast” doesn’t mean “creative.” It means “do the fewest risky operations that restore write headroom.”

Practical tasks (commands, outputs, and decisions)

Below are real tasks you can run on Ubuntu 24.04. Each includes: command, sample output, what it means, and the decision you make. Run them as root or with sudo. Adjust VG/LV names to your environment.

Task 1: Identify thin pools and their usage

cr0x@server:~$ sudo lvs -a -o vg_name,lv_name,lv_attr,lv_size,data_percent,metadata_percent,origin,pool_lv --units g
  VG     LV                 Attr       LSize  Data%  Meta%  Origin Pool
  vg0    thinpool           twi-aotz--  900.00g 99.82  12.44        -
  vg0    thinpool_tdata     Twi-aotz--  900.00g
  vg0    thinpool_tmeta     ewi-aotz--    4.00g
  vg0    vm-101-disk-0      Vwi-aotz--   80.00g 62.10
  vg0    vm-102-disk-0      Vwi-aotz--  120.00g 91.34

Meaning: twi- indicates a thin pool. Data% is dangerously high (99.82%). Metadata is fine (12.44%).

Decision: Treat this as a data-space emergency. You need immediate headroom: extend the pool, delete snapshots, or move a VM off the pool.

Task 2: Confirm the VG has free space to extend

cr0x@server:~$ sudo vgs -o vg_name,vg_size,vg_free,vg_free_count --units g
  VG   VSize    VFree   VFreeCount
  vg0  953.00g   0.00g  0

Meaning: No free extents in the VG. You can’t extend the thin pool without adding storage or moving data around.

Decision: Plan to add a new PV (new disk/LUN) or evacuate one or more thin LVs to another VG.

Task 3: Confirm whether the pool is in trouble mode (kernel messages)

cr0x@server:~$ sudo dmesg -T | tail -n 20
[Mon Dec 29 11:41:20 2025] device-mapper: thin: 253:10: reached low water mark; sending event.
[Mon Dec 29 11:43:02 2025] device-mapper: thin: 253:10: no free space for data block allocation
[Mon Dec 29 11:43:02 2025] blk_update_request: I/O error, dev dm-7, sector 1048576 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0

Meaning: dm-thin is failing allocations. Guests will see write errors or stalls.

Decision: Stop non-essential writes. Pause/stop backup jobs, snapshot jobs, log storms. Your first priority is to restore headroom.

Task 4: Find the top thin LVs by used blocks

cr0x@server:~$ sudo lvs -o lv_name,lv_size,data_percent,metadata_percent,seg_monitor --sort=-data_percent vg0
  LV             LSize   Data%  Meta%  Cpy%Sync Monitor
  vm-102-disk-0  120.00g 91.34
  vm-101-disk-0   80.00g 62.10

Meaning: data_percent per thin LV is how full the virtual disk’s allocated blocks are, not the filesystem usage inside the guest.

Decision: If one VM is a runaway writer, consider shutting it down or moving it first.

Task 5: Check for snapshots that are quietly eating the pool

cr0x@server:~$ sudo lvs -a -o lv_name,lv_attr,origin,lv_size,data_percent --sort=origin vg0 | sed -n '1,40p'
  LV                    Attr       Origin          LSize   Data%
  thinpool              twi-aotz--                 900.00g
  vm-101-disk-0         Vwi-aotz--                 80.00g  62.10
  vm-101-disk-0-snap    Vri-aotz--  vm-101-disk-0  80.00g  18.43

Meaning: Snapshot LVs (Vri-) accumulate changed blocks. If your snapshot policy is sloppy, your pool fills like a bathtub with the drain taped over.

Decision: Identify non-critical snapshots and delete them to reclaim blocks (reclaim is not always immediate; see TRIM notes later).

Task 6: Check thin pool settings (autoextend and thresholds)

cr0x@server:~$ sudo lvs -o lv_name,lv_attr,lv_size,segtype,seg_monitor,lv_profile vg0/thinpool
  LV       Attr       LSize    Type  Monitor Profile
  thinpool twi-aotz-- 900.00g thin-pool monitored

Meaning: “monitored” means lvm2 can emit events, but it doesn’t guarantee automatic growth unless configured.

Decision: After you survive the incident, set a thin pool profile with autoextend or build external alerting. During the incident, don’t waste time “tuning.”

Task 7: Check LVM event daemon status (sometimes alerts aren’t happening)

cr0x@server:~$ systemctl status lvm2-lvmpolld lvm2-monitor --no-pager
● lvm2-lvmpolld.service - LVM2 poll daemon
     Loaded: loaded (/usr/lib/systemd/system/lvm2-lvmpolld.service; static)
     Active: active (running)
● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
     Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; enabled)
     Active: active (exited)

Meaning: Services are present. If you never got alerts, you likely didn’t wire them up (LVM doesn’t magically page you).

Decision: For now: focus on freeing space. Later: integrate with Prometheus/NRPE/whatever you use.

Task 8: Add a new disk and make it a PV (fastest clean expansion)

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
NAME        SIZE TYPE MOUNTPOINT
sda         1.0T disk
├─sda1        1G part /boot/efi
├─sda2        2G part /boot
└─sda3      997G part
  ├─vg0-thinpool_tdata 900G lvm
  ├─vg0-thinpool_tmeta   4G lvm
  └─vg0-root            80G lvm /
sdb         500G disk
cr0x@server:~$ sudo pvcreate /dev/sdb
  Physical volume "/dev/sdb" successfully created.

Meaning: You have a new PV ready to be added to the VG.

Decision: If this is a hypervisor, prefer adding capacity over “clever cleanup” when the pool is already at 100%.

Task 9: Extend the VG, then extend the thin pool data LV

cr0x@server:~$ sudo vgextend vg0 /dev/sdb
  Volume group "vg0" successfully extended
cr0x@server:~$ sudo lvextend -L +400G vg0/thinpool
  Size of logical volume vg0/thinpool changed from 900.00 GiB (230400 extents) to 1.29 TiB (332800 extents).
  Logical volume vg0/thinpool successfully resized.

Meaning: The pool is larger. This is often the cleanest “get out of jail” move if you can add storage quickly.

Decision: If data% was ~100% and you’re seeing I/O errors, extending is the least risky fix. Do it first, then deal with why it filled.

Task 10: If metadata is the problem, extend metadata too

cr0x@server:~$ sudo lvs -o lv_name,lv_size,data_percent,metadata_percent vg0/thinpool
  LV       LSize    Data%  Meta%
  thinpool 1.29t    76.10  99.92
cr0x@server:~$ sudo lvextend --poolmetadatasize +2G vg0/thinpool
  Size of logical volume vg0/thinpool_tmeta changed from 4.00 GiB (1024 extents) to 6.00 GiB (1536 extents).
  Logical volume vg0/thinpool_tmeta successfully resized.

Meaning: Metadata was near full; you increased it. This can immediately stop allocation failures.

Decision: If Meta% is high (>80% sustained), treat it as an emergency. Extending metadata is usually quick and low-drama compared to repairing metadata after corruption.

Task 11: Confirm the thin pool is no longer pinned at 100%

cr0x@server:~$ sudo lvs -o lv_name,lv_size,data_percent,metadata_percent vg0/thinpool
  LV       LSize    Data%  Meta%
  thinpool 1.29t    76.09  54.12

Meaning: You have headroom. Not “safe forever,” but safe enough to stop active bleeding.

Decision: Resume services gradually. Keep backups paused until you’re sure you’re stable and you understand the growth driver.

Task 12: Check whether discards are enabled for the thin pool

cr0x@server:~$ sudo lvs -o lv_name,lv_attr,discards vg0/thinpool
  LV       Attr       Discards
  thinpool twi-aotz-- passdown

Meaning: passdown means discards can be passed to underlying storage. This is good, but it doesn’t guarantee guests are issuing TRIM, nor that your hypervisor layer passes it.

Decision: If you rely on “free space will come back,” you must validate TRIM end-to-end. Otherwise thin pools only grow.

Task 13: Check filesystem usage inside the host for log storms (often the trigger)

cr0x@server:~$ df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg0-root   78G   71G  3.0G  96% /

Meaning: If the host root is full, you may be seeing cascading failures (journald, libvirt, qemu logs). That can cause VM failures unrelated to thin pool capacity—don’t conflate them.

Decision: If host root is nearly full, free it too. A “fixed thin pool” on a dying root filesystem is just a different kind of outage.

Task 14: Inspect per-VM disk files or LVs from the hypervisor layer

cr0x@server:~$ sudo virsh domblklist vm102
 Target   Source
------------------------------------------------
 vda      /dev/vg0/vm-102-disk-0
 vdb      /var/lib/libvirt/images/vm102-seed.img

Meaning: Confirms which LVs are attached. Useful when a VM “mysteriously” writes to a different disk than you think.

Decision: If a noisy VM maps to the thin pool, you now know exactly which LV to snapshot/delete/migrate/limit.

Task 15: If you can shut down a VM, punch out its space usage and snapshot count

cr0x@server:~$ sudo lvs -o lv_name,lv_size,data_percent,lv_time,origin vg0 | grep -E 'vm-102|snap'
  vm-102-disk-0        120.00g 91.34  2025-12-28 22:10:14
  vm-102-disk-0-snap   120.00g 44.02  2025-12-29 09:00:11 vm-102-disk-0

Meaning: This shows whether a recent snapshot is ballooning. Many “thin pool suddenly full” incidents are “snapshot job went rogue.”

Decision: Delete the snapshot if it’s not required for recovery/compliance—after confirming with stakeholders. If it is required, extend capacity instead and fix the snapshot policy later.

Task 16: Delete an unneeded thin snapshot (carefully)

cr0x@server:~$ sudo lvremove -y vg0/vm-102-disk-0-snap
  Logical volume "vm-102-disk-0-snap" successfully removed.

Meaning: Snapshot is gone. Whether space is immediately reusable depends on discard behavior and dm-thin internals, but allocation pressure usually drops.

Decision: If you removed snapshots and Data% doesn’t move quickly, don’t panic. Instead, focus on stopping new writes and adding capacity.

Task 17: Validate thin pool health status

cr0x@server:~$ sudo lvs -o lv_name,lv_attr,health_status vg0/thinpool
  LV       Attr       Health
  thinpool twi-aotz-- ok

Meaning: “ok” is what you want. If it’s not ok, treat the pool as compromised and prioritize backups/evacuation.

Decision: If health is degraded, stop taking new snapshots and schedule a maintenance window to investigate metadata integrity and I/O errors.

Recovery tactics: choose the least-worst option

When the thin pool is full, you’re in a constrained optimization problem: reduce risk, restore write headroom, and preserve recoverability. The best move depends on what’s available: free VG space, extra disks, another host, or the ability to shut down VMs.

Option A (best): add capacity, extend the pool, then investigate

If you can attach a new disk/LUN quickly, do it. Extending the pool is low-risk compared to live surgery inside guests. It also gives you the breathing room to do the second part of the job: figuring out why you got here.

Operationally, this is the “stop the bleeding” step. You still need prevention, but your VMs stop screaming.

Option B: evacuate one VM to another datastore

If you have another VG/storage backend with space, moving one heavy VM can buy time. Depending on your stack (libvirt, Proxmox, custom scripts), you might do a cold migration (shutdown VM and copy) or a live migration (if supported and configured). In a crisis, cold migration is boring and reliable.

Be honest about bandwidth. Copying 500G at 1 Gbit/s is not a “quick fix.” It’s a plan for tomorrow unless you’re lucky.

Option C: delete snapshots (if you are absolutely sure)

Snapshots feel like backups until they don’t. If you delete the wrong snapshot chain, you can erase the only path to a point-in-time rollback. But if snapshots are clearly unneeded (test snapshots, forgotten nightly jobs, abandoned VM templates), removing them is often the fastest reclamation lever.

Do not delete snapshots blindly. Identify owners, check change windows, and confirm whether any restore workflows depend on them.

Option D: enable/discover discards and try to reclaim space

TRIM can help, but it’s not a defibrillator. If your guests have been deleting lots of data but not issuing discards, your thin pool may be full of dead blocks. Enabling discard can allow space reclamation—but it can also create I/O churn and won’t necessarily bring you from 100% to safe within minutes.

Also: not all guest filesystems issue trim automatically; some need periodic fstrim. Virtualization layers may ignore discard unless explicitly enabled. Thin pools can pass discards down, but that’s only one hop in a chain.

Option E (last resort): repair tools and metadata intervention

If you’re seeing metadata corruption or severe thin pool health issues, you may need offline checks with thin_check/thin_repair. This is not a “do it live on production at 2pm” task unless your alternative is total data loss.

Repair workflows vary by distro packaging and kernel behavior. The safe principle is consistent: take the pool out of service, capture metadata, validate, repair if necessary, and restore carefully.

Joke #2: “We’ll just run thin_repair real quick” is the storage equivalent of “I’ll just reboot the database, what’s the worst that could happen?”

Three corporate mini-stories from real life

Mini-story 1: The incident caused by a wrong assumption

The team inherited a virtualization cluster that “had plenty of space.” The previous admin had sized thin LVs generously—2–4× what guests actually used—and everybody loved the flexibility. Developers could request bigger disks without a procurement ticket. Nobody felt the risk because nothing had broken yet.

The wrong assumption was subtle: they assumed that because guests were only 40–50% full at the filesystem level, the thin pool must be safe. They watched df inside VMs, not lvs on the host. They also assumed deletions in guests would automatically return space to the pool.

Then a routine CI workload changed: container image caches started churning hard, creating and deleting large layers. The guests freed space from their perspective. The thin pool did not reclaim those blocks because discard wasn’t enabled end-to-end.

The pool hit 100% during business hours. A subset of VMs froze on writes. A database VM went read-only. The incident response was chaotic because the host still had “free space” according to the guests. People argued with the graphs instead of the kernel logs.

The fix was boring: extend the pool, enable discard carefully, and change monitoring to alert on thin pool Data%/Meta%, not guest filesystem usage. The lesson stuck because it was expensive: “free space” is a layered concept, and each layer lies in its own way.

Mini-story 2: The optimization that backfired

A platform group decided to reduce backup windows. Their hypervisors used LVM thin snapshots for crash-consistent backups: snapshot, mount, copy, delete. Someone noticed snapshots were cheap to create and proposed increasing snapshot frequency to get better RPO for key systems.

On paper, it was elegant: frequent snapshots, faster incremental copies, less data to move each time. In practice, the workload was write-heavy. Snapshots began accumulating changed blocks quickly. The “incremental” copies weren’t as incremental as hoped because hot datasets changed everywhere.

The optimization backfired in a classic way: the system became more complex and less predictable. Thin pool Data% climbed slowly, then sharply. The team had alerts at 95%, but the slope was steep enough that they hit 100% between checks, right when a few large VMs were also running application upgrades.

They deleted snapshots to regain space and discovered a second-order effect: backup jobs retried, recreated snapshots, and immediately refilled the pool. The issue wasn’t “snapshots are bad.” It was “unbounded snapshot concurrency is bad.”

The final remediation was policy: cap snapshot count per VM, serialize snapshot operations, and define a maximum allowed thin overcommit ratio. The team also carved out separate pools: one for volatile workloads, one for stable databases. Not because it’s fancy, but because blast radius is a real thing.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent application ran on a small set of VMs. The team running it was not glamorous, but they were disciplined. They had a weekly routine: check thin pool usage, validate backups, test one restore, and review any snapshot drift.

They also practiced one rule: every hypervisor had a documented “capacity add” procedure with pre-approved storage. When a pool approached the warning threshold, they didn’t debate whether it was “real.” They added capacity and then investigated consumption trends after the immediate risk was gone.

One quarter-end, the workload spiked. Logging increased, reports generated intermediate files, and a batch job temporarily doubled write volume. Thin pool Data% climbed fast. Their monitoring alerted at a conservative threshold, and the on-call ran the playbook without improvisation.

They attached a new LUN, extended the VG and thin pool, and kept the business running. There was no hero moment, just a routine task executed under pressure. The post-incident review was short because the system behaved as designed.

That team didn’t “avoid incidents.” They reduced the chance that a capacity incident turned into data loss. In production, that’s what competence looks like.

Common mistakes: symptom → root cause → fix

These aren’t theoretical. These are the ways thin pools punish overconfidence.

1) Symptom: VMs freeze or go read-only, but host CPU/RAM look fine

Root cause: dm-thin cannot allocate new blocks (data full) or cannot update mappings (metadata full). Writes stall or fail.

Fix: Check lvs Data%/Meta% and dmesg. Restore headroom by extending pool or deleting snapshots; extend metadata if Meta% is high.

2) Symptom: Guests show plenty of free space, but thin pool is full

Root cause: Thin provisioning tracks allocated blocks, not guest free space. Deleted files don’t necessarily return blocks unless discards flow through.

Fix: Enable discard end-to-end and schedule fstrim in guests where appropriate. Don’t count on this as an emergency fix when already full; add capacity first.

3) Symptom: Thin pool Meta% climbs faster than expected

Root cause: High churn: lots of small random writes, many snapshots, or rapid create/delete cycles increase mapping complexity.

Fix: Extend metadata LV proactively. Reduce snapshot count and lifecycle. Consider separate pools for churn-heavy workloads.

4) Symptom: You extended the pool but Data% still looks high

Root cause: You extended by too little, or active workloads immediately consumed new space. Sometimes you extended the wrong LV (tmeta/tdata confusion).

Fix: Re-check lvs -a and confirm pool LV size changed. Temporarily throttle/stop the worst offenders while you regain margin.

5) Symptom: Snapshot deletion doesn’t “free space” quickly

Root cause: Reclamation depends on how blocks are managed and whether discards are flowing. Also, other writers may be consuming space at the same time.

Fix: Measure again after stopping high-write jobs. Don’t use “I deleted a snapshot” as your only plan—pair it with adding headroom.

6) Symptom: You never got an alert; pool hit 100% silently

Root cause: No monitoring on thin pool Data%/Meta%, or alerts were set too high with too long an interval.

Fix: Alert at multiple thresholds (e.g., 70/80/90/95) and on rate-of-change. Treat thin pools like flight instruments, not like a casual dashboard.

Prevention: monitoring, policy, and design choices

The goal is not “never fill a thin pool.” The goal is “never fill a thin pool unexpectedly.” You want time to make a calm decision, not a desperate one.

Set a thin overcommit policy (yes, a policy)

Thin provisioning without an overcommit limit is just gambling with better UX. Decide your risk tolerance:

  • For mixed workloads: keep virtual sum under ~150–200% of physical, depending on churn and your response time.
  • For volatile CI/build/test: assume growth is real; overcommit less or isolate those VMs.
  • For databases: either don’t thin-provision, or keep generous physical headroom and strict snapshot limits.

Alert on both Data% and Meta% (and on slope)

Data% is the obvious one. Meta% is the assassin in the back seat. Also alert on how fast Data% changes. A pool at 85% that grows 1% per week is a budgeting task. A pool at 85% that grows 1% per hour is an incident you haven’t noticed yet.

Make snapshot lifecycle a first-class citizen

Unbounded snapshots are the #1 thin pool self-own. Set rules:

  • Max snapshots per VM.
  • Max age (automatic expiry).
  • Serialize snapshot-heavy jobs to avoid burst allocation.

Use discard/TRIM deliberately, not religiously

Discard can reduce long-term allocation growth and make thin pools behave more like “real” storage. It can also add overhead. Decide based on workload:

  • For SSD/NVMe-backed pools, discard is often worth it.
  • For certain SANs or thin-provisioned arrays, discard behavior varies; test it.
  • Prefer scheduled trim (fstrim.timer) over continuous discard mounts for some workloads to avoid latency spikes.

Separate pools by blast radius

One giant thin pool is efficient until it’s not. A single runaway VM can consume the last few percent and take unrelated systems down with it. Separate pools for:

  • databases (predictable, high-value, low tolerance for stalls)
  • volatile workloads (CI, scratch, developer sandboxes)
  • backup staging (if you must)

Have pre-approved capacity add paths

In corporate environments, the slowest part of a capacity incident is often procurement approvals, not commands. Pre-approve an emergency disk/LUN expansion mechanism. Make it boring. Boring is fast.

Checklists / step-by-step plan

Checklist A: When the thin pool is at 95%+ (pre-incident)

  1. Run lvs and confirm whether Data% or Meta% is climbing.
  2. Identify top consumers and any recent snapshot spikes.
  3. Pause or reschedule heavy snapshot/backup jobs until headroom is restored.
  4. If VG has free extents, extend thin pool now (don’t wait for 100%).
  5. If VG is full, schedule adding a PV or moving at least one VM off-pool.
  6. Notify stakeholders: “We are in a capacity warning state; action in progress.”

Checklist B: When the thin pool hits 100% (incident mode)

  1. Stop write amplification: pause backups, snapshot automation, log-heavy jobs if possible.
  2. Confirm thin pool usage and which dimension is full: lvs Data% vs Meta%.
  3. Check kernel logs for dm-thin allocation errors.
  4. Restore headroom:
    • Add PV and extend pool (preferred), or
    • Delete non-essential snapshots (careful), or
    • Evacuate a VM to another datastore.
  5. Verify pool is below critical thresholds and health is ok.
  6. Bring paused jobs back gradually and watch slope.

Checklist C: After recovery (post-incident hardening)

  1. Document root cause (workload change, snapshot job, discard gaps, capacity planning miss).
  2. Add monitoring for Data% and Meta%, with slope alerts.
  3. Enforce snapshot lifecycle policies and caps.
  4. Decide on discard strategy and test in a maintenance window.
  5. Re-evaluate overcommit ratio and separate pools if needed.
  6. Run a restore test. Not because it’s fun—because it’s the only honest verification.

FAQ

1) Is “thin pool 100%” always catastrophic?

Not always immediately, but it’s always unacceptable in production. Some workloads may keep running until they need new block allocations. You don’t control when that happens.

2) What’s worse: data full or metadata full?

Both are bad. Metadata full can be more surprising because the pool may still have plenty of data space. The practical difference: metadata is often easier to extend quickly if your VG has free extents.

3) If I delete files inside a VM, why doesn’t the thin pool shrink?

Because the host can’t read the guest’s mind. The guest filesystem marks blocks free internally, but unless it issues discard/trim and those discards are passed through the virtualization layer to dm-thin, the host still considers those blocks allocated.

4) Can I rely on enabling discard to recover from 100%?

No. Discard is preventative and “long-game” reclaim. When you’re already at 100%, you need immediate physical headroom: extend the pool or evacuate data.

5) Should I stop VMs when the pool is full?

If VMs are actively failing writes, stopping the noisiest writers can stabilize the situation while you extend capacity. But prefer a fix that restores headroom quickly; leaving production down while you debate is not a strategy.

6) How much headroom should I keep?

Enough that you can respond calmly. Practically: alert early (70–80%), treat 90% as urgent, and avoid operating above 95% unless you enjoy adrenaline.

7) Do thin snapshots count against the pool even if I never use them?

Yes, as they accumulate changed blocks. Creating a snapshot is cheap. Keeping it around during heavy writes is not.

8) Is autoextend a good idea?

It’s useful when the VG has free extents or you have an automated way to add PVs. It is not a substitute for monitoring and it won’t help when the VG is already full.

9) Why does Meta% grow at all—shouldn’t it be stable?

Metadata tracks mappings. As allocations increase and snapshots multiply mappings, metadata grows. High churn workloads can inflate metadata faster than you expect.

10) Should databases live on thin provisioning?

If you must, keep strict headroom and snapshot discipline, and monitor aggressively. Many teams choose thick provisioning for databases to avoid unpredictable allocation stalls.

Next steps that actually reduce risk

If your Ubuntu 24.04 thin pool is at or near 100%, don’t negotiate with physics. Restore headroom first. Extend the pool if you can. Delete snapshots only when you’re sure. Evacuate a VM if you must. Then—and only then—investigate the why.

Practical next steps for the next 24–48 hours:

  1. Add monitoring on lvs Data% and Meta%, plus rate-of-change alarms.
  2. Set a snapshot lifecycle policy and enforce it technically, not socially.
  3. Validate discard/TRIM end-to-end in a controlled test; decide whether to schedule fstrim in guests.
  4. Define a maximum thin overcommit ratio and separate volatile workloads into their own pool.
  5. Write (and rehearse) the “add PV and extend pool” runbook so the next on-call doesn’t learn it at 3 a.m.

The thin pool won’t care that you were busy. It will fill anyway. Your job is to make sure “full” is a planned event, not an incident.

← Previous
Copy to Clipboard Buttons That Don’t Lie: States, Tooltips, and Feedback
Next →
Proxmox “Connection refused” on 8006 after updates: what to check first

Leave a comment