ZFS on Proxmox vs VMFS on ESXi: Snapshots, Performance, Recovery, and Real-World Gotchas

Was this helpful?

The outage didn’t start with a bang. It started with a “backup successful” email and a datastore that got 3% slower every day until it fell over.
When you’re running VMs for real workloads, storage is where optimism goes to die—quietly, at 2 a.m., after the third “snapshot delete” stalls.

This is a production-grade comparison: ZFS on Proxmox versus VMFS on ESXi. Not features on a slide deck—failure modes, performance cliffs, snapshot behavior,
and how you actually recover when someone deleted the wrong thing with confidence.

Make the decision like an operator

If you want a short version: ZFS is a filesystem + volume manager with opinions. VMFS is a clustered filesystem built to host VM files safely on shared storage.
They solve different problems, and the trap is pretending they’re interchangeable.

Choose ZFS on Proxmox when

  • You own the disks (local NVMe/SATA/SAS) and want end-to-end integrity, scrubs, and predictable recovery tools.
  • You value snapshots as a first-class primitive you can replicate (zfs send/receive) and reason about.
  • You can enforce a storage discipline: recordsize/volblocksize, ashift correctness, and avoiding write amplification footguns.
  • You want debugging transparency: ZFS tells you what it knows. It also tells you what it suspects.

Choose VMFS on ESXi when

  • You’re on shared SAN storage (FC/iSCSI/NVMe-oF) and you need clustered access with VMware’s locking, multipathing, and ecosystem integration.
  • You’re operationally standardized on vSphere: vCenter workflows, DRS, HA, vendor support playbooks.
  • You want predictable vendor-blessed behavior even when it’s less transparent. Sometimes boring is the point.

Don’t pick based on ideology (“open source” vs “enterprise”). Pick based on what you can keep healthy at scale with your people,
your on-call rotation, and your hardware reality.

One quote worth keeping in your pocket during storage debates: Werner Vogels’ idea (paraphrased) that “everything fails, all the time,”
and systems must be designed around that.

Joke #1: If someone says “storage is easy,” they’ve either never paged themselves, or they’re about to.

Interesting facts and historical context

  • ZFS debuted at Sun in the mid-2000s as a combined filesystem and volume manager, aiming to replace layers of RAID + LVM + filesystem with one coherent stack.
  • Copy-on-write (CoW) wasn’t new, but ZFS made it operationally useful at scale: snapshots, checksums, and self-healing were integrated rather than bolted on.
  • VMFS was designed for clustered virtualization storage, focusing on safely letting multiple ESXi hosts access the same datastore concurrently.
  • VMware’s snapshot mechanism is not “a backup” by design; it’s a delta chain for short-lived operational tasks—yet people keep treating it like a time machine.
  • ZFS checksums every block (data and metadata), enabling detection of silent corruption that traditional RAID might faithfully return as “successful.”
  • VMFS has evolved from older SCSI reservations to finer-grained locking (ATS and related mechanisms), reducing contention on shared LUNs.
  • ZFS’s ARC changed how people think about cache: it’s adaptive, metadata-aware, and can dominate memory planning for hypervisors if you let it.
  • Both stacks can suffer from “successfully slow” failures: no hard errors, just rising latency until applications time out and humans panic.

Snapshots: what you think happens vs what happens

ZFS snapshots on Proxmox: clean semantics, sharp edges

ZFS snapshots are cheap to create and expensive to keep—specifically, expensive in the sense that they retain old blocks and prevent free space from being reused.
The snapshot itself is metadata; the space cost is the divergence after the snapshot. Operationally, this is great: you get fast point-in-time views,
predictable rollback semantics, and efficient replication via zfs send.

In Proxmox, you’ll typically store VM disks either as ZVOLs (block devices backed by ZFS) or as files (qcow2/raw) inside datasets.
ZVOL snapshots map nicely to VM disk snapshots because the hypervisor gets a consistent block device snapshot. It’s not magic, though:
application consistency still needs guest cooperation (qemu-guest-agent, filesystem quiesce, database-aware hooks).

The sharp edges are familiar:

  • Snapshot sprawl: 15-minute snapshots retained for months will turn “free space” into “accounting fiction.” ZFS won’t delete blocks still referenced.
  • Fragmentation: heavy snapshot churn can fragment allocations; reads get slower and resilvers get uglier.
  • Rollback risk: rolling back a dataset or ZVOL is a blunt instrument. It’s fast. It’s also final for anything written after the snapshot.

VMFS snapshots on ESXi: delta chains and consolidation drama

ESXi snapshots create delta VMDKs. Writes go to the delta; the base stays mostly unchanged. This is operationally convenient for short maintenance windows.
The cost shows up later: the longer the snapshot chain exists and the more it changes, the more I/O becomes a scavenger hunt across multiple files.

The most common failure mode is not “snapshot exists.” It’s “snapshot delete triggers consolidation, which triggers a large copy/merge,
which triggers latency, which triggers application timeouts, which triggers people making it worse.”

If you’ve never watched a consolidation stall while the datastore is 92% full, you haven’t truly experienced the special genre of fear
where every click is a moral decision.

Snapshot guidance you can live by

  • ZFS: keep snapshot schedules tight and retention intentional. Replicate off-host. Monitor used-by-snapshots explicitly.
  • ESXi/VMFS: treat snapshots as temporary operational scaffolding. If you need restore points, use a backup product that understands VMware.
  • Both: measure snapshot overhead by latency, not by ideology.

Performance: latency, IOPS, and the “it depends” you can measure

What performance actually means for hypervisors

Hypervisors don’t “need IOPS.” They need low, predictable latency. A VM with a database does not care that your storage can do 400k IOPS at queue depth 128
if the 99th percentile write latency spikes to 40 ms during snapshot merges. Your users will experience the spikes, not your benchmarks.

ZFS performance characteristics

ZFS performance is dominated by a few levers:

  • Recordsize / volblocksize: mismatch them with workload and you manufacture write amplification.
  • ARC sizing: starving the host page cache and starving VMs are different ways to have a bad day.
  • SLOG behavior: only matters for synchronous writes; it is not a “write cache” in the way people wish it was.
  • Compression: can be a free win on modern CPUs, but it’s workload-dependent and can increase CPU contention on busy nodes.

For VM workloads, ZVOLs with sane volblocksize (often 8K–16K depending on workload) and compression=lz4 are common starting points.
But don’t cargo-cult. Measure with your actual guests.

VMFS performance characteristics

VMFS performance is shaped by the backing storage and the path: array cache, RAID layout, controller saturation, iSCSI/FC settings, HBA firmware,
multipathing policy, queue depths, and contention between hosts.

VMFS itself is usually not the bottleneck unless:

  • you’re doing heavy metadata operations on a stressed datastore,
  • you’re suffering from locking contention, path thrashing, or APD/PDL events,
  • you’re running deep snapshot chains and forcing random reads across deltas.

Where people get performance comparisons wrong

They compare local ZFS NVMe to VMFS on a midrange array over iSCSI and conclude ZFS is “faster.” No kidding.
Or they compare VMFS on a tuned FC SAN to ZFS on a single SATA mirror and conclude VMware is “faster.” Also no kidding.
Compare architectures honestly: local vs shared, redundancy level, controller cache, network, and failure domains.

Joke #2: Benchmarking storage is like testing a parachute by reading the manual—comforting right up until the jump.

Recovery: the parts you’ll touch during an incident

ZFS recovery: deterministic tools, unforgiving reality

ZFS shines when something silently corrupts. Checksums detect it. Mirrors/RAIDZ can self-heal. Scrubs find problems before users do.
When a disk fails, resilvering is generally safer than classic RAID rebuild behavior because ZFS knows what blocks are allocated and relevant.

But ZFS demands you respect physics:

  • Capacity matters: ZFS does not like running hot. As pools fill, fragmentation and write performance degrade. Resilver times grow.
  • Wrong ashift is forever: pick the wrong sector alignment and you keep paying in latency for the life of the pool.
  • RAIDZ rebuilds are not magic: they still read a lot, and they still stress the remaining disks.

VMFS recovery: operational paths and vendor boundaries

VMFS recovery often depends on the storage array’s behavior, the SAN fabric, and VMware’s handling of path failures.
If a LUN goes away briefly, you can get APD (All Paths Down) or PDL (Permanent Device Loss) scenarios.
Your recovery is less about “repairing a filesystem” and more about stabilizing access and ensuring metadata consistency.

VMFS also forces discipline about snapshots and free space. You can often recover from “oops, I deleted a VM” if you have backups,
or if the array provides its own snapshotting. But VMFS won’t save you from thin provisioning fantasies or an array that lies about free blocks.

The recovery question that matters

Ask: “When things go weird, do I have a deterministic procedure that an on-call engineer can execute without a storage vendor on the phone?”
ZFS tends to score better on local storage. VMFS tends to score better when you already have a mature SAN practice and support model.

Real-world gotchas that bite adults

ZFS gotchas on Proxmox

  • ARC vs VMs: ZFS will happily take memory. If you don’t cap it, you’ll starve guests and blame “random slowness.”
  • zvol vs qcow2: qcow2 brings its own CoW and metadata overhead; stacking CoW-on-CoW can be fine or awful depending on sync/trim behavior.
  • SLOG misunderstandings: a fast SLOG helps sync writes, but it can’t fix a pool that can’t sustain your write workload.
  • TRIM/discard expectations: thin provisioning and discard require alignment between guest OS, hypervisor, and ZFS settings.
  • “Snapshots are free” thinking: they aren’t. They’re delayed billing.

VMFS gotchas on ESXi

  • Snapshot chains: long-lived snapshots destroy predictability. Consolidation can wreck performance at the worst time.
  • Thin provisioning across layers: thin VMDK on thin LUN on thin array is a confidence trick. Eventually, someone runs out of real space.
  • Multipath policy assumptions: “round robin” is not always configured or appropriate by default. Path imbalance can look like random latency.
  • Queue depth and HBA limits: one host can bully the datastore while others suffer. It looks like “VMware is slow” but it’s contention.
  • APD/PDL handling: if you haven’t rehearsed it, the first time will be during an incident, which is a terrible rehearsal environment.

The shared gotcha: monitoring the wrong thing

Both worlds fail when you monitor averages and ignore tail latency, free space trends, and background maintenance work.
Storage doesn’t need your attention until it really, really does. Your job is to notice earlier.

Fast diagnosis playbook

When storage is “slow,” don’t guess. Triage fast, isolate the layer, and only then start tuning.
This is the order that most often reduces time-to-truth in production.

First: confirm it’s storage latency, not CPU steal or memory pressure

  • On the hypervisor: check CPU ready/steal, host swapping, ballooning, and general load.
  • On the VM: check whether the app is blocked on I/O (iowait) or stalled on locks.

Second: identify read vs write, sync vs async, random vs sequential

  • Write latency spikes during backup windows often means snapshot/replication/consolidation.
  • Read latency spikes with stable writes often means fragmentation, cache misses, or path issues.
  • Sync write pain suggests SLOG (ZFS) or array cache/BBU issues (SAN).

Third: find the contention point

  • ZFS: pool busy? one vdev saturated? ARC thrashing? scrub/resilver running?
  • VMFS: datastore contention? path thrashing? array port saturation? queue depth limits?

Fourth: check free space and snapshot debt

  • ZFS: pool at 80%+ with lots of snapshots is a predictable performance cliff.
  • VMFS: datastore near full makes snapshot operations risky and can trigger emergency consolidation pain.

Fifth: pick a safe action

  • Reduce write load, pause non-critical backups, reschedule scrubs, evacuate hot VMs.
  • Do not “tune everything” during the incident. Change one thing you can roll back.

Practical tasks: commands, output, and decisions

These are real tasks you’ll run when something feels off. Each includes: the command, what the output means, and the decision you make.
Commands are written for a Proxmox/ZFS host or an ESXi host where appropriate.

1) ZFS: check pool health and error counters

cr0x@server:~$ zpool status -v
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool.
action: Upgrade the pool to enable all supported features.
  scan: scrub repaired 0B in 00:12:41 with 0 errors on Sun Dec 22 03:00:18 2025
config:

        NAME                         STATE     READ WRITE CKSUM
        rpool                        ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            nvme-SAMSUNG_MZVLB1T0    ONLINE       0     0     0
            nvme-SAMSUNG_MZVLB1T0_2  ONLINE       0     0     0

errors: No known data errors

Meaning: ONLINE with zero READ/WRITE/CKSUM errors is what you want. Scrub completed with 0 errors is quiet confidence.
“Some supported features…” is usually not an incident; it’s a maintenance choice.

Decision: If you see non-zero CKSUM errors, assume hardware/path issues until proven otherwise; plan disk replacement and run a scrub after.
If scrub is running during peak, consider rescheduling but don’t cancel casually.

2) ZFS: identify who is consuming space (especially snapshots)

cr0x@server:~$ zfs list -o name,used,avail,refer,usedbysnapshots,mountpoint -r rpool
NAME                   USED  AVAIL  REFER  USEDBYSNAPSHOTS  MOUNTPOINT
rpool                  820G   120G   192K              0B  /rpool
rpool/data             810G   120G   192K              0B  /rpool/data
rpool/data/vm-100-disk-0  220G   120G  180G            40G  -
rpool/data/vm-101-disk-0  310G   120G  140G           170G  -

Meaning: usedbysnapshots shows retained blocks due to snapshots. That 170G is “space you can’t reclaim” until you delete snapshots.

Decision: If pool headroom is tight, delete snapshots intentionally (oldest first), or replicate and then prune. Don’t wait until 95% full.

3) ZFS: list snapshots and spot retention drift

cr0x@server:~$ zfs list -t snapshot -o name,used,creation -s creation | tail -n 6
rpool/data/vm-101-disk-0@auto-2025-12-27_0200    1.2G  Sat Dec 27 02:00 2025
rpool/data/vm-101-disk-0@auto-2025-12-27_0300    1.1G  Sat Dec 27 03:00 2025
rpool/data/vm-101-disk-0@auto-2025-12-27_0400    1.3G  Sat Dec 27 04:00 2025
rpool/data/vm-101-disk-0@auto-2025-12-27_0500    1.4G  Sat Dec 27 05:00 2025
rpool/data/vm-101-disk-0@weekly-2025-11-24       32G   Mon Nov 24 00:00 2025
rpool/data/vm-101-disk-0@monthly-2025-09-01      58G   Mon Sep  1 00:00 2025

Meaning: hourly snapshots are fine until weekly/monthly ones hang around forever and accumulate divergence.

Decision: enforce retention; if monthly snapshots are required, ensure pool sizing accounts for it or replicate off-node and prune locally.

4) ZFS: confirm recordsize/volblocksize and compression for VM storage

cr0x@server:~$ zfs get -o name,property,value -H recordsize,compression rpool/data
rpool/data	recordsize	128K
rpool/data	compression	lz4

cr0x@server:~$ zfs get -o name,property,value -H volblocksize,compression rpool/data/vm-101-disk-0
rpool/data/vm-101-disk-0	volblocksize	8K
rpool/data/vm-101-disk-0	compression	lz4

Meaning: datasets have recordsize (file I/O). ZVOLs have volblocksize (block I/O). A DB-heavy VM on a 128K volblocksize can suffer.

Decision: if you see pathological write amplification, create new ZVOLs with an appropriate volblocksize and migrate disks. You can’t change volblocksize in place.

5) ZFS: check pool I/O latency and which vdev is hurting

cr0x@server:~$ zpool iostat -v 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
rpool                        820G   120G    210    980  12.3M  55.1M
  mirror-0                   820G   120G    210    980  12.3M  55.1M
    nvme-SAMSUNG_MZVLB1T0       -      -    102    480   6.1M  27.5M
    nvme-SAMSUNG_MZVLB1T0_2     -      -    108    500   6.2M  27.6M
--------------------------  -----  -----  -----  -----  -----  -----

Meaning: this shows per-vdev distribution. If one side of a mirror is doing all reads or writes, you might have a device problem or path issue.

Decision: if bandwidth/ops are uneven or errors appear, check SMART and cabling/backplane; prepare to replace the suspect device.

6) ZFS: see if a scrub/resilver is stealing your lunch

cr0x@server:~$ zpool status
  pool: rpool
 state: ONLINE
  scan: scrub in progress since Sun Dec 28 10:03:11 2025
        312G scanned at 1.10G/s, 84G issued at 302M/s, 820G total
        0B repaired, 10.24% done, 00:41:12 to go
config:

        NAME                       STATE     READ WRITE CKSUM
        rpool                      ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            nvme-SAMSUNG_MZVLB1T0  ONLINE       0     0     0
            nvme-SAMSUNG_MZVLB1T0_2 ONLINE      0     0     0

Meaning: scrub is I/O heavy. On some pools it’s fine; on others it’s a performance event.

Decision: if latency spikes align with scrubs, reschedule scrubs to off-peak; consider zfs_vdev_scrub_max_active tuning only after measuring impact.

7) Proxmox: check what storage type backs a VM disk (zvol vs qcow2)

cr0x@server:~$ qm config 101 | egrep 'scsi|virtio|ide|sata'
scsi0: local-zfs:vm-101-disk-0,size=200G
scsi1: local-lvm:vm-101-disk-1,size=50G

Meaning: local-zfs indicates a ZFS-backed volume (commonly a ZVOL). This impacts snapshot mechanics and performance tuning knobs.

Decision: if a VM is performance-sensitive, prefer a straightforward block-backed disk (ZVOL or raw on fast storage) over stacked CoW formats unless you need qcow2 features.

8) Linux/Proxmox: check whether ZFS is starving the host of memory

cr0x@server:~$ arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:21:11   812    44      5     9   20    10   23    25   57   62.1G  64.0G
10:21:12   790    51      6    11   21     9   18    31   61   62.2G  64.0G
10:21:13   805    47      5    10   21     8   17    29   62   62.2G  64.0G

Meaning: ARC size near c indicates ZFS is using what it’s allowed. If the host is swapping while ARC is huge, guests suffer.

Decision: cap ARC (zfs_arc_max) if the host needs memory for VMs; don’t “let ZFS eat RAM” on a busy hypervisor unless you’ve proven it’s safe.

9) ZFS: check sync write behavior and whether you’re forcing sync

cr0x@server:~$ zfs get -o name,property,value -H sync,logbias rpool/data/vm-101-disk-0
rpool/data/vm-101-disk-0	sync	standard
rpool/data/vm-101-disk-0	logbias	latency

Meaning: sync=standard respects application requests. sync=disabled is a data-loss lever disguised as a performance setting.

Decision: leave sync alone unless you understand your workload’s durability requirements. If you need sync performance, invest in proper SLOG and power-loss protection.

10) ESXi: list datastores and free space (snapshot safety check)

cr0x@server:~$ esxcli storage filesystem list
Mount Point                                        Volume Name  UUID                                 Mounted  Type        Size        Free
-------------------------------------------------  -----------  -----------------------------------  -------  ----  ----------  ----------
/vmfs/volumes/DS_VMFS01                             DS_VMFS01    64c1d7f1-0b12c9a0-3b7d-001b21aabbcc  true     VMFS-6  10.00T      1.12T
/vmfs/volumes/BOOTBANK1                             BOOTBANK1    5f2a0f13-5caa1122-0000-000000000000  true     vfat     3.99G      2.10G

Meaning: 1.12T free might be plenty—or dangerously low—depending on snapshot/consolidation behavior and thin provisioning beneath.

Decision: if free space is trending down and you have snapshots, stop adding risk: consolidate, evacuate VMs, expand datastore, or delete safely with a plan.

11) ESXi: identify snapshot state (from the host)

cr0x@server:~$ vim-cmd vmsvc/getallvms | head
Vmid   Name                File                                   Guest OS      Version   Annotation
12     app-portal-01       [DS_VMFS01] app-portal-01/app-portal-01.vmx  ubuntu64Guest  vmx-19

cr0x@server:~$ vim-cmd vmsvc/snapshot.get 12
GetSnapshot: Snapshot tree:
|-ROOT
   +-Snapshot Name        : pre-upgrade
     Snapshot Id          : 1
     Snapshot Created On  : 12/27/2025 01:12:44
     Snapshot State       : poweredOn

Meaning: snapshot exists and is not new. That’s risk. Age matters because delta growth and chain overhead compound.

Decision: schedule consolidation during low load. If the VM is write-heavy, expect a noisy merge; plan capacity and performance headroom.

12) ESXi: spot pathing issues and dead paths

cr0x@server:~$ esxcli storage core path list | egrep -A3 'Runtime Name|State:|Preferred'
Runtime Name: vmhba64:C0:T2:L10
   State: active
   Preferred: true
Runtime Name: vmhba64:C0:T3:L10
   State: dead
   Preferred: false

Meaning: a dead path can create intermittent latency and failover thrash depending on multipathing policy and array behavior.

Decision: fix the fabric before tuning anything. Replace optics/cables, check switch ports, validate zoning, and confirm array target health.

13) ESXi: check device latency from the hypervisor’s view

cr0x@server:~$ esxtop -b -n 1 | head -n 12
# esxtop batch mode
# Time: 2025-12-28 10:35:11
# ...

Meaning: In practice you’ll use interactive esxtop and look at disk metrics like DAVG (device), KAVG (kernel), and GAVG (guest).
High DAVG suggests array/storage; high KAVG suggests ESXi queuing; high GAVG is what VMs feel.

Decision: if DAVG is high across hosts, talk to the array team. If KAVG is high on one host, look at queue depth, HBA, and “noisy neighbor” VMs.

14) Proxmox: measure node-level I/O pressure and latency quickly

cr0x@server:~$ iostat -x 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.01    0.00    4.20    8.55    0.00   75.24

Device            r/s     w/s   rMB/s   wMB/s  await  r_await  w_await  svctm  %util
nvme0n1         210.0   980.0   12.3    55.1   6.20    3.10     6.90   0.40   48.0

Meaning: await gives you a rough latency picture. High %util with rising await suggests saturation.

Decision: if you’re saturating a single device/vdev, move load or redesign (more vdevs, faster disks, separate workloads). Don’t “sysctl” your way out of physics.

Three corporate-world mini-stories

Mini-story #1: an incident caused by a wrong assumption

A mid-sized company migrated a cluster from an aging array to local NVMe in Proxmox with ZFS mirrors. The performance was instantly better.
The team celebrated by turning on frequent snapshots “because snapshots are cheap.”

They assumed “cheap” meant “free.” After a couple of months, the pool sat at around 88% allocated, but monitoring still looked fine
because raw free space never hit zero and nothing was throwing errors. Then the nightly backup window started taking longer,
and the helpdesk started tagging tickets with “slowness.”

One weekend, a large application upgrade wrote a lot of new data. Latency spiked. The pool crossed into the zone where fragmentation and space maps
made allocations expensive. VMs didn’t crash; they just became unresponsive. That’s the worst kind of failure because it looks like a network issue,
an application issue, a DNS issue—anything but “storage.”

The on-call engineer deleted a handful of old snapshots, expecting immediate relief. It helped, but not quickly; deletion still had to free metadata,
and the pool was already stressed. The real fix was painful but straightforward: implement retention with hard caps, replicate off-host,
and keep operational headroom. They also learned to graph “used by snapshots” per dataset.

Wrong assumption: “If the UI lets me create them forever, it must be safe.” Storage UIs are polite. Physics is not.

Mini-story #2: an optimization that backfired

Another org ran ESXi with VMFS on a shared iSCSI array. They were trying to cut latency on a busy SQL VM.
Someone read that “thin provisioning is slow,” so they converted several large VMDKs to eager-zeroed thick during business hours.

The conversion itself generated a massive write stream. The array’s write cache handled it until it didn’t.
The iSCSI network started showing microbursts. Other hosts experienced intermittent latency spikes, which triggered application timeouts,
which triggered retries, which increased load. Classic feedback loop.

The team saw the VMFS datastore was “fine” and the array was “healthy.” That’s the trap: health indicators often mean “not dead,” not “fast.”
vCenter showed increasing storage latency alarms. The database team saw deadlocks and blamed the application.

After stabilizing by pausing the conversions and evacuating a few noisy VMs, they did the boring analysis:
storage path counters, queue depths, and array performance charts. The array wasn’t broken; it was saturated by a well-intentioned bulk write job.
The optimization wasn’t wrong. The timing was.

Backfired optimization: “Let’s do heavy storage transformations on the production datastore during peak.” Sometimes performance work is just moving pain around.

Mini-story #3: the boring but correct practice that saved the day

A financial services shop ran Proxmox with ZFS for branch workloads and ESXi with VMFS for core workloads. Two different stacks, one shared habit:
they rehearsed recovery and kept runbooks that were actually executable.

A Proxmox node lost a disk in a mirror. No drama: alert fired, engineer checked zpool status, replaced the drive, watched resilver progress,
and validated a scrub later. The VMs kept running. Nobody wrote a postmortem because nothing caught fire.

Weeks later, a SAN switch firmware bug caused brief path flaps. ESXi hosts saw intermittent dead paths.
Because the team had rehearsed their APD/PDL playbook, they didn’t chase ghosts in the guest OS.
They verified paths, stabilized the fabric, and only then started evacuating VMs from the worst-affected datastore.

The saving practice wasn’t a fancy feature. It was routine scrubs, tested restores, and a culture of “prove it works” rather than “assume it works.”
Boring won. Again.

Common mistakes: symptom → root cause → fix

1) ZFS pool looks healthy, but VMs get slower every week

Symptom: no disk errors, but read latency climbs and random I/O feels worse.

Root cause: snapshot churn and high pool occupancy causing fragmentation; ARC misses increasing; pool too full.

Fix: reduce snapshot retention, keep more free space (aim for meaningful headroom), consider adding vdevs (not just bigger disks), and measure before/after with latency percentiles.

2) ZFS “free space” seems fine, but you can’t reclaim space after deleting data

Symptom: data deleted in VM, but dataset used stays high; pool still tight.

Root cause: snapshots retain old blocks; guest discard not passed through; TRIM not configured/working end-to-end.

Fix: check usedbysnapshots, prune snapshots, enable discard where appropriate, and validate with controlled tests—not hope.

3) ESXi VM freezes during snapshot delete

Symptom: snapshot removal stalls, datastore latency spikes, VM becomes unresponsive.

Root cause: consolidation merge is heavy; datastore too full; underlying array is saturated; snapshot chain is deep.

Fix: ensure datastore free space, do consolidation off-peak, reduce competing I/O (pause backups), and avoid long-lived snapshots as policy.

4) VMFS datastore reports plenty of space, then suddenly hits “out of space” behavior

Symptom: writes fail, VMs pause, array alarms, but datastore didn’t look “full” last week.

Root cause: thin provisioning across layers; array real capacity exhausted while VMFS still thought it had room.

Fix: monitor actual array capacity, set conservative overcommit limits, and enforce alarms on both vCenter and storage side.

5) ZFS sync writes are painfully slow after “tuning”

Symptom: database commits crawl; latency spikes; someone says “add a SLOG” and it doesn’t help.

Root cause: workload is sync-heavy and pool can’t sustain it; SLOG device is slow or lacks power-loss protection; or sync was forced unintentionally.

Fix: validate sync behavior, use proper enterprise SSD/NVMe with PLP for SLOG, and fix the underlying pool performance (more vdevs, better media).

6) ESXi latency spikes only on one host

Symptom: one host sees high latency while others are fine on the same datastore.

Root cause: pathing misconfiguration, queue depth differences, firmware mismatch, or a single host saturating its HBA.

Fix: compare multipath config across hosts, check dead paths, align firmware/driver versions, and identify noisy neighbors with esxtop.

Checklists / step-by-step plan

Planning checklist: choosing between ZFS local storage and VMFS shared storage

  1. Define failure domains: do you want a host failure to take storage with it (local), or do you need shared storage continuity?
  2. Define recovery objectives: RPO/RTO for VM restore, and whether you can restore without vendor involvement.
  3. Inventory write patterns: databases, logging, message queues—these punish latency spikes.
  4. Decide snapshot policy: operational snapshots vs backups; retention; replication target; tested restore procedure.
  5. Capacity plan with headroom: include snapshot retention, rebuild/resilver overhead, and “bad day” growth.
  6. Monitoring plan: tail latency, free space trend, snapshot count/age, scrub/resilver state, path health.

ZFS on Proxmox: build and run it like you mean it

  1. Pick the right topology: mirrors for IOPS/latency; RAIDZ for capacity; don’t pretend RAIDZ is mirror-fast.
  2. Set ashift correctly at pool creation: assume 4K sectors (or higher for some SSDs). Wrong ashift is a lifetime tax.
  3. Decide ZVOL vs dataset files: default to ZVOL for VM disks unless you specifically need qcow2 features.
  4. Set sane defaults: compression=lz4; atime=off for VM datasets; consistent volblocksize for new VM disks based on workload class.
  5. Cap ARC if needed: hypervisors need memory for guests first, ARC second.
  6. Schedule scrubs: monthly is common; more frequent for flakier hardware. Monitor duration changes.
  7. Snapshot and replicate: local snapshots for quick rollback, replicated snapshots for actual recovery.
  8. Test restores: “zfs send succeeded” is not the same as “VM booted.”

VMFS on ESXi: operate it safely under real load

  1. Get multipathing right: policy consistency across hosts, dead path alerting, and switch/array hygiene.
  2. Monitor latency at the right layer: know DAVG/KAVG/GAVG patterns and what they implicate.
  3. Set snapshot policy: operational snapshots only; enforce age limits; automate reporting.
  4. Capacity discipline: keep datastore headroom; don’t bet the company on thin-on-thin illusions.
  5. Rehearse APD/PDL scenarios: decide when to failover, when to evacuate, and when to stop touching it.
  6. Backups that understand VMware: application-consistent where needed, and restore tests that include networking and boot verification.

FAQ

1) Are ZFS snapshots “better” than ESXi snapshots?

They’re better as a storage primitive: consistent semantics, cheap creation, and replication via send/receive. ESXi snapshots are operational tools that become liabilities when kept too long.
If you want restore points, treat ESXi snapshots as transient and use backups for durability.

2) Does ZFS replace RAID controllers?

In many designs, yes: use HBAs/JBOD mode and let ZFS manage redundancy and integrity. But you must design vdevs properly and monitor them.
Hardware RAID can still be appropriate in certain constrained environments, but stacking RAID under ZFS often reduces observability and can complicate recovery.

3) Should I use qcow2 on ZFS in Proxmox?

Only if you need qcow2 features (e.g., certain sparse behaviors) and you’ve tested performance. For most production VM disks on ZFS, ZVOLs are simpler and often faster.
CoW-on-CoW can be fine, but it’s also an easy way to create latency you’ll misdiagnose for weeks.

4) Is SLOG mandatory for ZFS VM storage?

No. SLOG matters for synchronous writes. Many VM workloads are largely asynchronous. If you have databases or NFS with sync requirements, a good SLOG can help.
A bad SLOG device can hurt—or at least fail in exciting ways.

5) How full is “too full” for ZFS pools?

There isn’t one number, but performance and resilver risk worsen as you fill the pool. Past ~80% you should start acting like you’re in the danger zone,
especially with heavy snapshots. Past ~90% you’re negotiating with entropy.

6) How full is “too full” for VMFS datastores?

If you rely on snapshots or have unpredictable growth, “comfortably full” is a myth. Keep headroom for consolidation events and unexpected delta growth.
The exact threshold depends on VM size and change rate, but near-full datastores make snapshot operations risky.

7) Can ZFS give me shared storage like VMFS?

Not directly in the same way. ZFS is typically local to a host. You can build shared storage on top (NFS/iSCSI exported from ZFS),
or use clustered filesystems and replication strategies, but that’s a different architecture with different failure modes.

8) What’s the cleanest way to replicate Proxmox ZFS VM storage?

ZFS send/receive replication of snapshots is the clean primitive. Proxmox has tooling around it, but the underlying truth remains:
you’re shipping snapshot deltas. Validate bandwidth, retention alignment, and the ability to boot restored VMs.

9) Why do ESXi snapshot consolidations sometimes take forever?

Because consolidation is a large copy/merge under load, and it competes with production I/O. If the delta is huge, the datastore is busy,
or the array is already close to saturation, the merge crawls. The “fix” is prevention: keep snapshots short-lived.

10) If my array has snapshots, do I still care about VMFS snapshots?

Yes. Array snapshots can be great for crash-consistent rollback at the LUN level, but you must understand coordination with VMware,
quiescing, and restore workflows. VMFS snapshots are not a substitute; they’re a different tool with different blast radius.

Next steps

If you’re running Proxmox with ZFS: audit snapshot retention, graph usedbysnapshots, cap ARC if hosts are memory-tight, and verify scrubs complete on schedule.
Then do one restore test that boots a VM and validates application health. Not “the dataset exists”—the VM actually works.

If you’re running ESXi with VMFS: inventory snapshots across the fleet, enforce a maximum age, check datastore headroom, and validate multipath health.
Then rehearse a consolidation under controlled conditions so the first time isn’t during an incident.

If you’re choosing between them: stop treating it as a religious war. Map your failure domains, operational maturity, and recovery requirements.
Pick the stack you can keep boring.

← Previous
Site-to-site VPN routing: why the tunnel is up but nothing works (and how to fix it)
Next →
Silicon lottery: why identical CPUs perform differently

Leave a comment