Proxmox Storage: ZFS vs LVM-Thin — The Benchmark Lie That Wastes Weeks

Was this helpful?

You ran fio. You got numbers. Then you built a cluster around those numbers. Two weeks later the helpdesk is forwarding “VM slow” tickets like it’s a competitive sport, and your graphs look like a cardiogram.

This is the benchmark lie: measuring the wrong thing beautifully, then making a permanent architecture decision with it. ZFS and LVM-Thin both work on Proxmox. Both can be fast. Both can be disastrous. The difference is how they fail, and what your workload will punish.

The benchmark lie: why your “fastest” storage loses in production

Most Proxmox storage debates start with a screenshot of fio results, as if storage is a drag race and the winner gets custody of your VMs. The problem is that virtualization storage is rarely limited by “max throughput” on a clean, empty system. It’s limited by latency under mixed I/O, queueing behavior, write amplification, and what happens when a pool is 80% full and someone takes a snapshot every hour.

Benchmarks often lie in four common ways:

1) They benchmark the wrong layer

fio on the host block device is not the same as fio inside a guest with a virtual disk on top of a storage stack with caching, copy-on-write, discard, and flush semantics. Proxmox adds choices too: raw vs qcow2, cache=none vs writeback, aio=native vs io_uring, VirtIO SCSI vs VirtIO Block. That’s before you even pick ZFS or LVM-Thin.

2) They ignore sync writes and flushes

Databases, mail servers, and anything that cares about durability will issue flushes (or use O_DIRECT / FUA-like behavior depending on stack). ZFS has explicit opinions about sync writes. LVM-Thin mostly delegates durability to the underlying filesystem (often ext4/xfs) and the drive write cache policy. If your benchmark doesn’t include sync patterns, you’re not benchmarking production. You’re benchmarking optimism.

3) They don’t include fragmentation and snapshots

LVM-Thin snapshot chains and ZFS snapshots behave differently, but both can turn “fast” into “why is latency 200ms?” when snapshots accumulate or blocks become scattered. The dirty secret: the first week after deployment is always the fastest week. Your benchmark probably measured week one.

4) They measure averages, not tail latency

Your users don’t experience average latency. They experience p95 and p99. They experience the one query that stalls behind a queue of writes. For VM storage, tail latency is the difference between “fine” and “incident”.

One quote is worth pinning above your dashboards. It’s short, rude, and correct:

“The fastest code is the code you don’t run.” — Ken Thompson

Storage version: the fastest I/O is the I/O you don’t force into an expensive path. Pick a storage backend that matches your I/O shape and failure tolerance so you’re not “running” unnecessary pain.

First short joke: Storage benchmarks are like resumes: the best ones are technically true and still wildly misleading.

Interesting facts and short history that actually matters

These aren’t trivia. They explain why ZFS and LVM-Thin behave the way they do, and why their failure modes feel so different.

  1. ZFS was designed to end silent data corruption by combining filesystem and volume management with checksums on every block. That DNA still shows: integrity first, performance second, and “it depends” third.
  2. Copy-on-write is older than most of your servers. The concept predates modern virtualization; ZFS operationalizes it at scale. It’s why snapshots are cheap, and why random writes can get more expensive over time.
  3. LVM pre-dates thin provisioning. Classic LVM was thick by default: you allocate what you use. Thin came later to compete with SAN behavior and virtualization convenience.
  4. Thin provisioning became popular because storage was expensive, not because it was safe. Overcommit helps budgets. It also helps incidents happen at 2 a.m.
  5. Write barriers and flush semantics evolved because disks lied. Drives with volatile write cache can acknowledge writes before they’re durable. Filesystems and storage stacks added barriers/flushes to reduce the lying. It works—until a layer ignores it.
  6. ZFS ARC was built when RAM was precious. ARC is aggressively adaptive and will happily use memory if you let it. On a hypervisor, that can collide with VM memory needs unless you set limits.
  7. SSDs changed the bottleneck but not the physics. Latency got better; queueing still exists. Mixed random I/O remains a tax collector, just with a faster calculator.
  8. Consumer NVMe introduced new failure patterns. Thermal throttling, firmware quirks, and sudden latency spikes can look like “ZFS is slow” or “LVM is slow” when the real culprit is the drive behaving like a drama student.

Decision framework: pick a default and justify it

If you’re running Proxmox in production, you need a default choice that survives ordinary chaos: unexpected growth, backups, snapshots, tired humans, and a CFO who thinks “storage is storage”.

My opinionated default

  • Single-node or small cluster with local disks, no external SAN: choose ZFS unless you have a very specific reason not to.
  • You have stable capacity planning, want predictable performance, and you’re comfortable with classic block + filesystem management: LVM-Thin is fine—if you treat overprovisioning like a loaded weapon.
  • You’re optimizing for maximum “it just works” VM snapshot/backup workflows with reasonable safety: ZFS is usually the safer default.

When ZFS is the better trade

  • You care about end-to-end checksums and easy detection of bit rot.
  • You value simple replication semantics (send/receive) and coherent snapshotting.
  • You can allocate RAM appropriately and you can accept some overhead for safety.
  • You can design vdevs correctly (mirrors for IOPS; RAIDZ for capacity, with caveats).

When LVM-Thin is the better trade

  • You need lean overhead and you’re on fast, reliable storage underneath (good SSDs, RAID controller with BBWC, or enterprise NVMe with PLP).
  • You want simpler mental models: block devices, ext4/xfs, and well-known Linux tooling.
  • You’re willing to enforce strict monitoring on thin pool data and metadata usage, and you have a plan for “pool full” that isn’t “panic”.

The question that decides most cases

What is the cost of being wrong? With ZFS, being wrong often shows up as performance surprises and memory contention. With LVM-Thin, being wrong often shows up as capacity incidents that can become data-loss incidents if you let a thin pool hit 100% or metadata fill up.

Pick your poison based on what your team can operationally handle at 3 a.m. That’s not cynicism; that’s reliability engineering.

ZFS on Proxmox: what it’s really doing to your I/O

ZFS is a storage system, not a filesystem bolt-on

ZFS owns the block layer decisions: allocation, caching, checksumming, compression, and how writes become durable. That integration is why ZFS can protect data better than “filesystem on top of RAID” stacks. It’s also why ZFS performance depends heavily on how you configure the pool and datasets, not just the disks.

Mirrors vs RAIDZ: the IOPS reality

For VM storage, random IOPS and latency matter. Mirrors generally win here because they can service reads from either side and distribute random I/O better. RAIDZ is great for capacity efficiency, but small random writes can be expensive due to parity math and read-modify-write behavior.

If you’re running lots of small VMs with mixed workloads, mirrors are the boring, correct answer. RAIDZ can be fine for bulk storage or sequential-heavy workloads, but VM “miscellaneous chaos” tends to punish it.

ARC: your friend until it isn’t

The ARC (Adaptive Replacement Cache) will use RAM aggressively. On a dedicated storage box, that’s great. On a hypervisor, it competes with VM memory. Starve the host and you get swapping, ballooning, and VM jitter that looks like storage latency.

The fix is simple: cap ARC to leave room for VMs and the host. The hard part is admitting your “128GB RAM is plenty” plan didn’t include cache behavior.

Sync writes: the production benchmark you forgot

ZFS treats sync writes as sacred: if the application says “this must be durable,” ZFS will honor it. Without a dedicated low-latency log device (SLOG) and proper device characteristics (power-loss protection matters), sync-heavy workloads can get slower than you expect.

There’s also the temptation to set sync=disabled. It makes benchmarks scream. It also changes durability semantics. Turning it on globally is like replacing your seatbelts with motivational posters.

Compression: usually a win, sometimes a trap

Modern CPUs often make compression effectively “free” compared to I/O costs, especially on SSDs. lz4 is commonly the right default. But compression can amplify CPU contention on overloaded hosts, and it can distort benchmarks if your test data compresses better than real data. Random data won’t compress; VM images and logs often do.

ZVOL vs file-based images

On Proxmox, ZFS storage often means ZVOLs (block devices) for VM disks. That’s generally good for performance consistency. But tuning matters: volblocksize affects write amplification and latency. Get it wrong and you can create a performance tax you’ll pay forever, because changing it later is non-trivial.

LVM-Thin on Proxmox: the quiet efficiency and the sharp edges

LVM-Thin is not “worse ZFS”

LVM-Thin is a thin-provisioned block layer. It does not provide end-to-end checksums. It does not inherently protect you from silent corruption. It doesn’t try to be a storage religion. It’s a pragmatic tool that works extremely well when you give it stable underlying storage and you respect its failure modes.

The big win: simplicity and low overhead

On good hardware, LVM-Thin can be very fast. There’s less metadata gymnastics than a copy-on-write filesystem doing checksums, compression, and transactional semantics. If you want predictable, “Linux classic” behavior, LVM-Thin is comfortable territory.

The thin pool “full” problem is not theoretical

Thin provisioning is great until it isn’t. When the thin pool data area fills up, writes fail. When thin pool metadata fills up, you can get stalls and failures that look like corruption or “VM froze”. And because it’s virtualization, you can fill the pool in ways that aren’t obvious—snapshots, backup jobs, or a single VM writing logs like it’s being paid per line.

Overprovisioning is allowed, but it’s a risk budget. Spend it deliberately, monitor it aggressively.

Discard/TRIM: helpful, but only if it’s end-to-end

LVM-Thin can reclaim blocks if discards make it through from guest to host. But the chain is long: guest filesystem → virtual disk driver → QEMU settings → host block stack → thin pool. Miss one link and your thin pool “used” grows forever, even when guests delete data.

Snapshots: convenient, but don’t hoard them

LVM snapshots (thin snapshots) are functional and fast at first. Over time, lots of snapshots can increase metadata work and fragment the pool. This is not unique to LVM; it’s a general truth of snapshot-heavy environments. But thin metadata pressure is a particularly sharp edge: it fails loudly.

Second short joke: Thin provisioning is like office free coffee: people love it until it runs out, and then it’s suddenly everyone’s emergency.

Benchmarks that don’t lie (much): what to measure instead

If you want benchmarks that correlate with production pain, measure these instead of just max throughput:

  • Tail latency (p95/p99) under mixed read/write, not just average IOPS.
  • Sync write latency (or flush-heavy patterns) if you run databases, mail, or journaling-heavy apps.
  • Performance under snapshot/backup load because that’s when users complain.
  • Behavior at 70–85% full because nobody keeps pools empty forever.
  • CPU cost per I/O because hypervisors run compute too.

Use the guest for guest experience, use the host for root cause

Run workload-specific tests inside a VM to approximate what users feel. Then use host-level tools to explain the behavior: is it the disk, the queue, the CPU, the ARC, or the thin metadata?

Also: don’t benchmark with empty caches unless you plan to reboot every hour. Cold-cache tests are useful, but they’re not the whole story.

Fast diagnosis playbook: find the bottleneck before the meeting starts

This is the order that usually gets you to truth quickly on Proxmox. Not always. But often enough to be a habit.

First: is it actually storage?

  • Check host CPU steal/ready-like pressure: saturation can look like I/O waits.
  • Check memory pressure and swapping: a swapping hypervisor produces “storage latency” tickets.
  • Check network if it’s remote storage (iSCSI/NFS/Ceph): retransmits and pauses look like disk stalls.

Second: locate the queue

  • Look for iowait and per-device utilization.
  • Check for a single hot disk/vdev, or a single VM doing pathological I/O.
  • Correlate with snapshot/backup jobs and replication.

Third: identify the storage-stack specific failure mode

  • ZFS: ARC pressure, sync write bottlenecks, pool fragmentation, slow vdev, mis-sized recordsize/volblocksize, or bad SLOG choices.
  • LVM-Thin: thin pool data/metadata near full, discard not working, snapshot sprawl, underlying filesystem/RAID cache policy issues.

Fourth: verify with a targeted test

Don’t run a massive benchmark in the middle of an incident. Run a small, representative test: measure latency, not hero throughput. If you can’t explain the numbers in terms of the stack, the numbers are just decorative.

Practical tasks: commands, outputs, and what decision they trigger

These are the tasks I actually do on Proxmox hosts when someone says “storage is slow” or “we need to choose ZFS vs LVM-Thin.” Each includes what the output means and the decision you make from it.

Task 1: Identify what storage a VM disk is actually using

cr0x@server:~$ qm config 101 | egrep '^(boot|scsi|virtio|ide|sata)'
boot: order=scsi0;net0
scsi0: local-zfs:vm-101-disk-0,size=80G

What it means: This VM’s disk is on local-zfs. You troubleshoot ZFS, not LVM, not “the SSD”.

Decision: Use ZFS tools (zpool/zfs) and ZVOL tuning paths, and expect snapshot semantics to be ZFS-native.

Task 2: Check host memory pressure (ARC vs VMs vs swap)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi        92Gi       3.1Gi       1.2Gi        30Gi        14Gi
Swap:          8.0Gi       2.6Gi       5.4Gi

What it means: Swap is in use. On a hypervisor, that’s a performance smell. If you use ZFS, ARC may be part of the story; if not, VMs may be overcommitted.

Decision: Investigate ARC size (/proc/spl/kstat/zfs/arcstats) and host swapping. Consider capping ARC or adjusting VM memory/ballooning policy.

Task 3: Check I/O wait and top offenders

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.11-8-pve (server) 	02/04/2026 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.21    0.00    4.13   21.77    0.00   65.89

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1         120.0   18000.0     0.0   0.00    3.20   150.0    640.0   72000.0     0.0   0.00   28.40   112.5   18.40  99.00

What it means: %iowait is high and nvme0n1 is pegged at ~99% util with a high write await. That’s a storage queue problem, not “Proxmox UI is slow”.

Decision: Identify which VMs are driving writes; then determine whether it’s expected load, a single offender, or a misconfiguration (sync writes, snapshots, thin pool issues).

Task 4: See which processes are issuing I/O (host view)

cr0x@server:~$ pidstat -d 1 5
Linux 6.5.11-8-pve (server) 	02/04/2026 	_x86_64_	(32 CPU)

02:14:21      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
02:14:22        0      2310      0.00  82000.00      0.00       90  kworker/u64:2
02:14:22        0     12844      0.00  12000.00      0.00       30  qemu-system-x86

What it means: QEMU is writing; a kernel worker is also active (could be filesystem, ZFS txg, md/raid, or NVMe housekeeping depending on stack).

Decision: Map QEMU PID to a VM ID, then inspect that VM’s disk and workload. If it correlates with backup jobs, shift schedules or throttle backups.

Task 5: Map QEMU processes to VM IDs

cr0x@server:~$ pgrep -a qemu-system | head -n 2
12844 /usr/bin/kvm -id 101 -name vm101 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,server=on,wait=off ...
13201 /usr/bin/kvm -id 102 -name vm102 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/102.qmp,server=on,wait=off ...

What it means: PID 12844 is VM 101. Now you can correlate offender to a tenant/team/app.

Decision: Look at VM 101 storage configuration and any snapshots/backups running for it.

Task 6: For ZFS, check pool health and basic stats

cr0x@server:~$ zpool status -x
all pools are healthy

What it means: No obvious device failures. That doesn’t mean performance is fine; it means you don’t have a degraded vdev dragging latency.

Decision: Move on to performance counters: zpool iostat, ARC, and sync behavior.

Task 7: For ZFS, locate the slow vdev/disk

cr0x@server:~$ zpool iostat -v rpool 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        2.10T  1.30T    420   1800  32.4M   140M
  mirror     1.05T   650G    210    900  16.2M  70.0M
    nvme0n1      -      -    105    460  8.10M  35.1M
    nvme1n1      -      -    105    440  8.10M  34.9M
  mirror     1.05T   650G    210    900  16.2M  70.0M
    nvme2n1      -      -    200    900  15.9M  70.0M
    nvme3n1      -      -     10      0  0.30M  0.00M

What it means: One device (nvme3n1) is contributing almost nothing. That suggests a device issue, path issue, or a vdev imbalance due to errors.

Decision: Check zpool status in detail, kernel logs, and NVMe SMART. A “healthy” pool can still have a disk behaving badly before it trips a hard fault.

Task 8: For ZFS, see ARC size and hit ratio signals

cr0x@server:~$ awk '/c_max|c_min|size/{print}' /proc/spl/kstat/zfs/arcstats | head
c_max 4 85899345920
c_min 4 10737418240
size 4 64424509440

What it means: ARC max is ~80GiB, current ~60GiB. On a hypervisor, that might be fine—or it might be stealing memory from VMs.

Decision: If the host is swapping or VMs balloon unexpectedly, cap ARC. If the host has free memory and storage is the bottleneck, ARC may be helping; don’t “optimize” it into a regression.

Task 9: For ZFS, check dataset/ZVOL settings that affect VM I/O

cr0x@server:~$ zfs get -r compression,atime,recordsize,sync rpool/data | head -n 20
NAME        PROPERTY     VALUE     SOURCE
rpool/data  compression  lz4       local
rpool/data  atime        off       local
rpool/data  recordsize   128K      default
rpool/data  sync         standard  default

What it means: Sensible defaults: lz4, atime=off, recordsize 128K (note: for ZVOLs, the relevant knob is volblocksize per ZVOL).

Decision: If workloads are database-heavy with many 8K writes, consider tuning per-dataset or per-ZVOL behavior rather than global changes. Avoid global “magic” changes during incidents.

Task 10: For ZFS sync behavior, verify if you have a SLOG and what it is

cr0x@server:~$ zpool status rpool | sed -n '1,80p'
  pool: rpool
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	rpool       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nvme0n1 ONLINE       0     0     0
	    nvme1n1 ONLINE       0     0     0
	logs
	  nvme4n1p1 ONLINE       0     0     0

errors: No known data errors

What it means: There is a dedicated log device. That can dramatically change sync write latency—if the device is low-latency and has power-loss protection.

Decision: If sync-heavy workloads are slow, check log device health/latency. If you don’t have a SLOG and you run sync-heavy apps, consider adding the right hardware rather than turning off sync.

Task 11: For LVM-Thin, check thin pool data and metadata usage

cr0x@server:~$ lvs -a -o+seg_monitor,lv_size,data_percent,metadata_percent,lv_attr vg0
  LV                 VG  Attr       LSize   Data%  Meta%  Mon
  pve                vg0 -wi-ao----  200.00g
  data               vg0 twi-aotz--    3.00t  78.34  62.11 yes
  [data_tdata]       vg0 Twi-ao----    3.00t
  [data_tmeta]       vg0 ewi-ao----   16.00g
  [lvol0_pmspare]    vg0 ewi-------   16.00g

What it means: Thin pool data is at 78% data usage and 62% metadata usage. That’s safe-ish but trending toward “watch it.” Metadata fills faster under snapshots.

Decision: If Meta% rises quickly, reduce snapshot sprawl, ensure discard works, and plan metadata expansion before you hit a wall.

Task 12: For LVM-Thin, confirm monitoring and auto-extension behavior

cr0x@server:~$ lvs -o lv_name,seg_monitor vg0
  LV    Monitor
  data  monitored

What it means: The thin pool is monitored (dmeventd). That’s good: you can trigger warnings/auto-extend policies.

Decision: If it’s not monitored, fix that immediately. Thin pools filling silently is how you end up explaining to management why “storage overcommit” is not free money.

Task 13: Check discard support end-to-end (host side)

cr0x@server:~$ lsblk -D -o NAME,DISC-GRAN,DISC-MAX,ROTA,TYPE,MOUNTPOINT | head
NAME        DISC-GRAN DISC-MAX ROTA TYPE MOUNTPOINT
nvme0n1       512B       2G    0 disk
├─nvme0n1p1   512B       2G    0 part
└─nvme0n1p2   512B       2G    0 part

What it means: The block device reports discard granularity and max, suggesting TRIM is supported at the drive level.

Decision: If DISC-* is blank or zero, discard may not work (or is hidden by RAID). For thin provisioning, that increases the chance of “used space never shrinks.”

Task 14: Confirm VM disk cache mode and discard settings (Proxmox/QEMU)

cr0x@server:~$ qm config 101 | egrep '^(scsi0|virtio0|ide0|sata0)'
scsi0: local-lvm:vm-101-disk-0,discard=on,iothread=1,cache=none,size=80G

What it means: discard=on is enabled (good for thin), cache=none avoids double-caching and respects flush semantics more predictably, and iothread=1 can reduce contention for busy disks.

Decision: If discard is off and you rely on thin provisioning, enable it (and verify in-guest). If cache=writeback is used casually, verify you’re not trading safety for speed without realizing it.

Task 15: Check kernel log for device resets and NVMe drama

cr0x@server:~$ dmesg -T | egrep -i 'nvme|reset|timeout|blk_update_request|I/O error' | tail -n 8
[Tue Feb  4 02:08:10 2026] nvme nvme3: I/O 123 QID 4 timeout, reset controller
[Tue Feb  4 02:08:11 2026] nvme nvme3: controller reset successful

What it means: The device is timing out and resetting. That can present as random latency spikes blamed on ZFS or LVM.

Decision: Treat this as hardware/firmware first. Don’t tune ZFS recordsize to fix a controller reset. Replace or update, then re-evaluate.

Task 16: Check pool fullness and fragmentation (ZFS)

cr0x@server:~$ zpool list -o name,size,alloc,free,capacity,frag,health
NAME   SIZE  ALLOC   FREE  CAPACITY  FRAG  HEALTH
rpool  3.40T  2.10T  1.30T       61%   38%  ONLINE

What it means: Capacity is okay, fragmentation is moderate. If frag climbs and performance degrades, random write behavior may worsen.

Decision: If frag is high and latency is rising, consider workload changes, reducing snapshots, adding vdevs (carefully), or migrating hot VMs to a fresher pool.

Task 17: Check LVM-Thin discard effectiveness via thin usage trends

cr0x@server:~$ lvs -o lv_name,lv_size,data_percent,metadata_percent vg0/data
  LV   LSize Data% Meta%
  data  3.00t 78.34 62.11

What it means: Snapshot cleanup or guest deletes should eventually reduce Data% if discard works and blocks are unmapped. If Data% only ever goes up, discard isn’t making it through (or the workload is truly append-only).

Decision: If discard is broken, fix the chain (guest fstrim, virtio-scsi, discard=on, underlying support). If workload is append-only, stop expecting thin pools to “shrink themselves.”

Task 18: Check Proxmox backup load correlation

cr0x@server:~$ systemctl list-timers --all | egrep 'vzdump|pbs|backup'
Tue 2026-02-04 02:30:00 UTC  12min left Tue 2026-02-04 01:30:02 UTC  47min ago vzdump.timer

What it means: A backup job is scheduled close to your performance complaint window. Backups can trigger snapshot activity and heavy reads.

Decision: Stagger backups, apply bandwidth/ionice controls, or move backup storage to separate spindles/NVMe so user I/O doesn’t fight backup I/O.

Three corporate mini-stories from the storage trenches

1) Incident caused by a wrong assumption: “thin means we can’t run out”

A mid-size company consolidated a pile of aging VMware hosts onto Proxmox. They picked LVM-Thin because it was familiar, lightweight, and looked fast in early tests. Someone made a slide that said “thin provisioning improves utilization.” True. Someone else heard “thin provisioning prevents capacity problems.” Not true.

They overcommitted storage because the old environment had lots of empty space inside guest filesystems. Weeks later, a routine burst happened: Windows updates, log rotation gone wrong in a couple of Linux VMs, and a developer running a local CI job that cached artifacts aggressively. The thin pool hit the wall while everyone was asleep.

The symptoms were confusing: a couple of VMs froze, then more started timing out. Applications reported filesystem errors. People blamed the network. Someone rebooted a VM, which made it worse because the reboot triggered journal replays and extra writes. The hypervisor didn’t “crash,” it just couldn’t satisfy writes reliably.

The root cause wasn’t LVM being “bad.” It was an operational assumption: they treated thin as elastic capacity instead of borrowed capacity. The fix was boring: alarms on data and metadata usage, a hard policy on maximum overcommit, and scheduled reporting so capacity was a weekly conversation instead of a surprise.

Afterward, they kept LVM-Thin. They also stopped letting people create snapshots without an expiry. The incident wasn’t a referendum on technology. It was a referendum on wishful thinking.

2) Optimization that backfired: “sync=disabled made the graphs look great”

A different org ran databases on Proxmox with local ZFS mirrors. Their initial performance was fine but not spectacular. A well-meaning engineer read a forum thread about ZFS sync writes and decided to “fix it.” They set sync=disabled on the dataset hosting the DB ZVOLs.

Benchmarks improved dramatically. Application latency improved too. The engineer took a victory lap and wrote a short internal post: “ZFS default is slow; disable sync.” Nobody challenged it because the numbers were pretty and the tickets got quieter.

Months later they had a power event. Not a clean shutdown; the kind that happens when a building’s infrastructure makes a different choice than your UPS runtime estimate. After reboot, some databases came up with corruption. Not all. Just enough to create a week of forensic misery, restore testing, and awkward conversations.

The hard part was the postmortem: the “optimization” didn’t cause the power failure, but it removed the safety rails that would have contained the damage. The team had to re-learn that performance changes often change durability semantics. That’s not tuning, that’s a contract rewrite.

The eventual fix was to re-enable sync, add appropriate hardware for fast durable sync writes (and validate it), and re-run workload-specific tests. Performance landed in a sane place. The graphs were less sexy. The data was less flammable.

3) Boring but correct practice that saved the day: capacity and latency budgets

A larger enterprise ran mixed workloads: file servers, web apps, a couple of databases, and a sea of “small but important” VMs. They used ZFS on most nodes and LVM-Thin on a few where hardware constraints existed. The difference wasn’t the technology. It was the practice.

They treated storage like a budget with thresholds. ZFS pools had a soft cap (don’t cross ~80% for hot pools without review). Thin pools had alerting on both data and metadata with clear runbooks. Snapshots had TTLs. Backups were staggered. Replication had time windows. None of this was exciting.

Then a vendor pushed a bad update that caused an application to log aggressively. One VM started writing at a rate that would normally create an incident. It didn’t, because the team’s dashboards flagged rising latency and unusual write rates quickly. They throttled the VM’s disk I/O and rolled back the update. Other workloads barely noticed.

The lesson: “boring” controls don’t prevent every problem, but they turn outages into contained events. The tech stack becomes resilient not because it’s magical, but because you can see trouble early and react with intention.

Common mistakes: symptom → root cause → fix

1) Symptom: VMs randomly freeze for seconds under load

Root cause: Host memory pressure causes swapping, or ZFS ARC competes with VM RAM; storage latency is a side effect.

Fix: Check free -h and swap usage; cap ARC if needed; ensure the host has reserved memory headroom; reduce ballooning chaos.

2) Symptom: fio shows huge throughput, but databases complain about latency

Root cause: Benchmark used buffered I/O or didn’t include sync/flush patterns; tail latency not measured.

Fix: Re-test with sync-heavy patterns and measure p95/p99. If on ZFS, evaluate SLOG with PLP; if on LVM, verify cache modes and drive cache policy.

3) Symptom: Thin pool “used” climbs forever even after deleting data in guests

Root cause: Discard/TRIM not working end-to-end, or guest filesystems not trimming.

Fix: Enable discard=on for VM disks, ensure virtio-scsi, run fstrim in guests, verify underlying discard support (lsblk -D).

4) Symptom: After weeks, snapshot-heavy VMs become slow

Root cause: Snapshot sprawl causing fragmentation and metadata overhead (ZFS or LVM-Thin), backup schedules overlapping with peak.

Fix: Enforce snapshot TTLs, reduce retention on hypervisor-level snapshots, move backups off-peak, avoid long snapshot chains.

5) Symptom: ZFS “feels slow” on writes, especially with many small sync writes

Root cause: No SLOG for sync-heavy workload, or SLOG is the wrong device (high latency, no PLP), or workload is parity-bound on RAIDZ.

Fix: Add proper SLOG only if sync writes dominate; prefer mirrors for VM IOPS; do not set sync=disabled as a casual fix.

6) Symptom: LVM-Thin metadata hits high percentages quickly

Root cause: Lots of snapshots, heavy churn workloads, small block allocations; metadata LV too small.

Fix: Reduce snapshot count, expand metadata LV (carefully, with a plan), monitor Meta% separately from Data%.

7) Symptom: One disk seems to “drag down” the whole host intermittently

Root cause: Device resets/timeouts, thermal throttling, or firmware issues (especially on NVMe). Storage stack gets blamed.

Fix: Check dmesg, SMART/NVMe logs, firmware updates; replace suspect hardware. Stop tuning software to compensate for hardware flakiness.

8) Symptom: ZFS pool is healthy but performance is inconsistent

Root cause: Fragmentation, mixed workloads, or CPU contention from compression/checksums; also possible ashift mismatch from initial pool creation.

Fix: Verify pool settings, measure CPU, review dataset properties; consider migrating hot workloads or adding vdevs rather than “mystery tuning”.

Checklists / step-by-step plan

Choosing ZFS vs LVM-Thin (practical decision checklist)

  1. Define the workload mix. Databases? Many small VMs? Mostly sequential file serving? Don’t guess—sample real I/O if you can.
  2. Decide what failures you can tolerate. Is silent corruption unacceptable? Is capacity surprise unacceptable? Pick which risk you manage better.
  3. Pick your vdev layout or thin pool policy.
    • ZFS for VM IOPS: mirrors are usually the move.
    • LVM-Thin: decide max overcommit ratio and enforce it.
  4. Plan snapshot lifecycle. TTLs, backup windows, and who is allowed to create snapshots.
  5. Plan monitoring before deployment. Not after the first incident.

ZFS deployment plan (safe defaults that age well)

  1. Create pools with correct sector alignment (ashift) from day one. Changing later is painful.
  2. Prefer mirrors for VM-heavy workloads unless you have a very specific capacity-driven design and accept the performance trade.
  3. Enable compression=lz4 and atime=off for VM datasets.
  4. Cap ARC if the host is memory-constrained. Leave headroom for VMs.
  5. Add SLOG only when sync latency is proven to be the bottleneck, and only with appropriate devices.
  6. Establish snapshot TTL and replication/backup schedules that avoid peak.

LVM-Thin deployment plan (make thin safe enough)

  1. Size thin metadata generously. Metadata is not where you want to “save space.”
  2. Enable monitoring for the thin pool and set alert thresholds for data and metadata.
  3. Decide: allow overcommit or not. If yes, define a hard ceiling and a review process.
  4. Ensure discard works end-to-end and schedule in-guest trims where appropriate.
  5. Keep snapshot counts low and time-bound. Treat snapshots as tools, not collections.
  6. Run backups with throttling and avoid overlapping with business-critical I/O windows.

Incident response checklist (when storage is “slow”)

  1. Confirm it’s not CPU/memory/network first.
  2. Find the busiest device and the busiest VM.
  3. Correlate with backups/snapshots/replication windows.
  4. Check ZFS pool/vdev stats or LVM thin usage and metadata.
  5. Check kernel logs for device resets.
  6. Make one change at a time; measure; roll back if wrong.

FAQ

1) Should I use ZFS or LVM-Thin for Proxmox in a homelab?

If you want safety and easy snapshots/replication learning, use ZFS. If you’re RAM-constrained and want minimal overhead, LVM-Thin can be fine—just monitor thin usage like an adult.

2) Is ZFS “slower” than LVM-Thin?

Sometimes, on specific patterns. ZFS does checksums, copy-on-write, and transactional semantics; that costs something. But ZFS can also be faster in real life due to compression and caching. The right question is: which one delivers lower tail latency for your workload under snapshot/backup pressure?

3) Can I fix ZFS write latency by setting sync=disabled?

You can make benchmarks look better and durability worse. If your workload issues sync writes for correctness, disabling sync changes the contract. Fix sync latency with appropriate hardware (or workload configuration) and measure again.

4) Does LVM-Thin protect me from bit rot?

No, not end-to-end like ZFS. You can mitigate with good hardware, RAID with patrol reads, and backups, but LVM-Thin itself doesn’t checksum user data blocks the way ZFS does.

5) Do I need a SLOG for ZFS on Proxmox?

Only if sync writes are a proven bottleneck. Many VM workloads are not sync-bound. If you add a SLOG, use a device with low latency and power-loss protection; otherwise you can make things worse or less safe.

6) Why did performance get worse after months even though hardware didn’t change?

Snapshots, fragmentation, pool fullness, and workload drift. Storage systems age. Measure fragmentation (ZFS), snapshot count, thin metadata usage (LVM), and whether backups shifted into peak windows.

7) Is RAIDZ okay for VM storage on ZFS?

It can be, but it’s not the default I’d pick for mixed VM workloads where latency matters. Mirrors are usually better for random I/O. RAIDZ makes more sense when capacity efficiency is key and the workload is more sequential or tolerant.

8) What VM disk format should I use with each backend?

On ZFS, raw ZVOL-backed disks are common and perform well. On LVM-Thin, raw logical volumes are typical. qcow2 adds features but can add overhead and complexity; use it when you need qcow2 features, not by habit.

9) How full is “too full” for ZFS and LVM-Thin?

For hot ZFS pools, crossing ~80–85% is where performance risk rises and operational flexibility drops. For LVM-Thin, the danger is hitting 100% data or metadata; set alerts well before that and keep a buffer for bursts and snapshots.

10) What’s the fastest way to prove whether it’s the storage backend or a single bad VM?

Use iostat -xz to find the hot device, then map QEMU PIDs to VM IDs and look for correlation with backups/snapshots and in-guest activity. One offender is common.

Next steps you can do this week

  1. Stop trusting one benchmark. Add a mixed I/O test that reports p95/p99 latency, and run it inside a VM, not just on the host.
  2. Implement the Fast diagnosis playbook as a runbook. Put the exact commands your team should run in the ticket template.
  3. If you run LVM-Thin: set alerts on Data% and Meta%, confirm monitoring, and prove discard works end-to-end.
  4. If you run ZFS: check ARC sizing vs host RAM, verify vdev layout matches VM workload, and validate sync write behavior before anyone “tunes” it.
  5. Put snapshots on a leash. TTLs, owner accountability, and backup windows that don’t coincide with peak load.
  6. Do one controlled failure drill. Fill a test thin pool. Pull a disk in a test ZFS mirror. Practice recovery while everyone is awake.

Pick ZFS or LVM-Thin, but don’t pick it because someone posted a pretty fio chart. Pick it because you understand what it optimizes for, what it refuses to compromise on, and exactly how it will punish you when you get lazy.

← Previous
Fix “Your Organization Manages This Device” When It’s Your PC
Next →
Map a Network Drive That Stays Mapped (Even After Reboot)

Leave a comment