ZFS Proxmox: VM Storage Defaults You Should Change Immediately

Was this helpful?

Most Proxmox+ZFS performance “mysteries” aren’t mysteries. They’re defaults. Defaults that were chosen to be safe, generic, and broadly compatible—then quietly deployed into your very specific storage reality: consumer SSDs with shaky power-loss protection, RAIDZ pools doing random I/O, guests issuing sync writes, and a hypervisor that loves to cache on top of caching.

If your VMs feel fast until they don’t—until a database checkpoint, a Windows update, or a backup window hits and everything turns to syrup—this is for you. We’ll change the defaults that actually matter, explain why, and show you how to prove the results with commands that don’t lie.

The mindset: stop accepting “works on my lab” storage

ZFS is opinionated. Proxmox is opinionated. Your workload is also opinionated—especially if it includes databases, mail servers, CI runners, or anything with lots of small sync writes. When those three opinions disagree, you get latency spikes that look like “random slowness,” and you burn days blaming the wrong layer.

The central trap is assuming that virtualization “smooths things out.” It does the opposite. Virtualization takes many I/O patterns—sequential, random, bursty, sync-heavy, metadata-heavy—and multiplexes them into one shared pool. That pool then has to make promises: durability promises (sync semantics), allocation promises (copy-on-write), caching promises (ARC/L2ARC), and fairness promises (scheduler). Defaults are a starting point, not a strategy.

Here’s the framing I use in production:

  • Know which writes must be safe. If the guest asked for sync, assume it matters unless you’re deliberately trading durability for speed.
  • Match block sizes to reality. ZFS can’t read what it didn’t write. If you pick a zvol block size that fights your workload, you will pay forever.
  • Minimize “double work.” Double caching, double checksumming, double copy-on-write layers: each is defensible alone; combined, they’re a tax.
  • Measure at the right layer. iostat inside the guest can look fine while the host is on fire. Conversely, ZFS can look fine while the guest is doing pathological fsync loops.

Interesting facts and historical context (short, useful)

  1. ZFS popularized end-to-end checksumming in mainstream deployments, turning silent corruption from “mythical” into “measurable.” It changed how ops teams think about storage trust.
  2. Copy-on-write (CoW) is why snapshots are cheap, and also why random writes can be expensive—especially on RAIDZ under virtualization.
  3. Compression on ZFS is often a performance feature, not just a space feature, because fewer bytes moved can beat the CPU cost (particularly with lz4).
  4. “Sync writes are slow” used to be shrugged off as a database problem; ZFS made it a system design question by honoring sync semantics very strictly.
  5. SLOG devices became a cottage industry because people learned the hard way that “an SSD” is not the same as “a safe, low-latency log device.”
  6. Proxmox moved many users from LVM-thin to ZFS because snapshots and replication are operationally addictive—until you discover what defaults do to latency.
  7. The industry spent years rediscovering write amplification: small random writes on CoW filesystems can explode into much larger physical work, especially with parity RAID.
  8. ashift mistakes are forever (practically speaking): if you build a pool with the wrong sector size assumption, performance can be permanently kneecapped.

The Proxmox+ZFS defaults that deserve immediate skepticism

1) Sync behavior: you don’t get to ignore it anymore

Proxmox defaults won’t save you from sync write latency. Your guests can issue sync writes (fsync, O_DSYNC, barriers/flushes), and ZFS will treat those writes as “must be stable on power loss.” If you have no SLOG and your pool is HDDs or saturated SSDs, sync-heavy workloads will feel like a haunted house.

But the worst part is the ambiguity: some workloads are sync-heavy only during specific phases—database commits, journal flushes, VM backups triggering filesystem behavior. So you get intermittent misery and a false sense of “it’s mostly fine.”

Decision point: if you care about durability, keep sync=standard and build the pool to handle it. If you don’t care (lab, ephemeral CI), you can choose sync=disabled deliberately—never accidentally.

2) volblocksize: the setting that quietly decides your I/O economics

ZVOLs have volblocksize. Datasets have recordsize. People mix these up, then wonder why their tuning didn’t move the needle.

volblocksize is set at zvol creation and is effectively permanent (you can change it, but existing blocks don’t rewrite themselves; you typically migrate to benefit fully). Proxmox often creates zvols with a default that may not match your workload.

Typical guidance (not gospel):

  • 16K: often good for general VM disks, mixed workloads, and databases that do lots of small random I/O.
  • 8K: can help for particularly sync-heavy or log-heavy patterns, but can increase metadata overhead.
  • 64K/128K: can be great for large sequential workloads (media, backups), often wrong for OS disks.

On RAIDZ, smaller blocks can be especially punishing due to parity and read-modify-write behavior. This is where “ZFS is slow” rumors are born.

3) Compression: leaving performance on the table

If you’re not using compression=lz4 on VM storage, you’re basically choosing to move more bytes than necessary. For many VM workloads (OS files, logs, text, package repositories), compression reduces physical I/O and improves latency under pressure.

Compression can backfire on already CPU-starved hosts or on incompressible workloads (pre-compressed media, encrypted volumes). But for most Proxmox hypervisors, lz4 is the default you should want.

4) atime: death by a thousand reads

atime=on means reads become writes because access times are updated. On VM storage, that’s usually wasted churn. If you like your SSD endurance and your latency, set atime=off for VM datasets.

5) Primary cache: double caching is not a personality trait

ZFS ARC caches aggressively. Guests also cache. If you let ZFS cache VM data that the guest will cache again, you burn memory on both sides and get less effective caching overall.

Common approach: for zvol-based VM disks, consider primarycache=metadata (cache metadata, not data) to reduce double caching. This is situational: if your guests are tiny and your host has huge RAM, caching data can still help. But you should make that decision intentionally.

6) Thin provisioning: “free space” is a social construct

Thin zvols feel like efficiency until the pool hits a cliff. Once a ZFS pool gets too full, performance can collapse due to allocation pressure and fragmentation. “But we still have 15% free!” is how people talk right before a bad day.

Operational rule: keep meaningful free space. For many VM-heavy pools, treat 20–30% free as normal. If that sounds wasteful, you’re about to learn what emergency storage procurement feels like.

Joke #1: RAIDZ with a nearly full pool is like a meeting that “should have been an email.” It keeps going, gets slower, and nobody can leave.

ZVOL vs dataset file images (raw/qcow2): pick the least-worst for your use

ZVOLs (block devices): simple path, strong semantics

ZVOLs are block devices backed by ZFS. Proxmox likes them because snapshots and replication integrate cleanly, and you avoid some filesystem-on-filesystem weirdness.

Tradeoffs:

  • volblocksize matters a lot, and you don’t get infinite do-overs.
  • Discard/TRIM behavior can be tricky depending on versions and settings. You need to verify space reclamation actually works.
  • Cache semantics (primarycache) become important to avoid fighting the guest.

Dataset files (raw/qcow2): flexibility with an extra layer

Storing VM disks as files on a ZFS dataset can be perfectly fine. The main risk is stacking CoW layers (qcow2 is CoW; ZFS is CoW). That can amplify fragmentation and metadata overhead, especially under random-write pressure.

Opinionated guidance:

  • Prefer raw over qcow2 on ZFS unless you have a specific qcow2 feature you truly need.
  • Use qcow2 cautiously for niche cases (sparse + internal snapshots), but understand you are paying for that flexibility.

How to decide quickly

  • If you run databases or latency-sensitive services: zvols with sane volblocksize, and plan for sync writes.
  • If you run mostly sequential bulk data: dataset files can be fine, tune recordsize, and keep it simple.
  • If you run lots of small VMs and value operational ease: zvols are often easier to reason about in Proxmox tooling.

The changes I make on day one (and why)

Change 1: Enable lz4 compression on VM storage

Do it unless you have evidence not to. lz4 is low-latency, and VM disks are commonly compressible. The win is often most visible under load when the pool is busy.

Change 2: Turn off atime for VM datasets

This is a classic “small” setting that prevents pointless churn. VM storage doesn’t need access-time writes.

Change 3: Decide your sync policy explicitly

Most people accidentally run sync=standard on hardware that can’t sustain it, then “fix” it by disabling sync globally, and then forget they did that. Don’t be that person.

Pick one:

  • Durable path: sync=standard, add a proper SLOG if needed, and verify latency.
  • Speed-at-all-costs path: sync=disabled on specific datasets/zvols where data loss is acceptable.

Change 4: Set primarycache thoughtfully for zvol-heavy VM storage

If the host is memory-constrained, ARC competing with guests is a knife fight. For many zvol-backed VM disks, primarycache=metadata is a sane default. If you have huge RAM and mostly read-heavy guests, caching data may help. Measure, don’t vibe.

Change 5: Stop overfilling pools

Write this down: a ZFS pool at 85–90% full is not “fine.” It’s a latency incident queued up in advance.

Change 6: Use the right virtualization disk options (and verify)

Proxmox disk settings like cache, discard, iothread, and controller model matter. But “matter” means “show up in measurements,” not “sound cool in a forum post.” Your baseline should be stable latency and predictable failure modes.

Joke #2: Turning on every performance knob at once is like chugging five energy drinks to “sleep better.” You’ll learn something, but it won’t be what you wanted.

Practical tasks: commands, outputs, and the decision you make

These are real checks I run on a Proxmox host when VM storage is slow, inconsistent, or suspiciously “fine until it isn’t.” Each task includes a command, example output, what the output means, and the decision you make next.

Task 1: Identify pools, health, and obvious red flags

cr0x@server:~$ zpool status
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool.
action: Upgrade the pool to enable all features.
  scan: scrub repaired 0B in 00:12:44 with 0 errors on Sun Dec 22 03:12:26 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        rpool                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-SAMSUNG_SSD_1TB_A   ONLINE       0     0     0
            ata-SAMSUNG_SSD_1TB_B   ONLINE       0     0     0

errors: No known data errors

What it means: Pool is online, no errors, scrub clean. Feature warning is not a performance issue by itself.

Decision: If you see degraded vdevs, checksum errors, or constant resilvers, stop tuning and fix hardware first. Performance tuning on a failing pool is how you get creative outage reports.

Task 2: Check pool fullness (the silent performance killer)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint -r rpool
NAME                 USED  AVAIL  REFER  MOUNTPOINT
rpool                720G   180G   104K  /rpool
rpool/data           680G   180G   104K  /rpool/data
rpool/data/vmdata    640G   180G    96K  /rpool/data/vmdata

What it means: ~80% used. You’re entering the zone where fragmentation and allocation overhead start to show up.

Decision: If you’re above ~80–85% used on a VM-heavy pool, plan capacity work now: add vdevs, migrate cold VMs, or split workloads. Do not “optimize” your way out of physics.

Task 3: Inspect critical dataset properties (compression, atime, sync)

cr0x@server:~$ zfs get -o name,property,value -s local,default compression,atime,sync,primarycache rpool/data/vmdata
NAME               PROPERTY      VALUE
rpool/data/vmdata  compression   off
rpool/data/vmdata  atime         on
rpool/data/vmdata  sync          standard
rpool/data/vmdata  primarycache  all

What it means: Compression off and atime on: you’re paying extra I/O. Sync standard: durable semantics are in play. primarycache all: ARC may be double-caching guest data.

Decision: Enable compression=lz4 and atime=off on VM datasets unless there’s a reason not to. Evaluate primarycache based on RAM pressure and guest caching.

Task 4: Apply safe, reversible dataset changes (lz4 + atime off)

cr0x@server:~$ sudo zfs set compression=lz4 atime=off rpool/data/vmdata
cr0x@server:~$ zfs get -o name,property,value compression,atime rpool/data/vmdata
NAME               PROPERTY     VALUE
rpool/data/vmdata  compression  lz4
rpool/data/vmdata  atime        off

What it means: Future writes will be compressed; access-time updates stop generating noise.

Decision: This is typically a net win. If CPU becomes a bottleneck (rare with lz4), you’ll see it in host CPU metrics under I/O load.

Task 5: List zvols and check volblocksize

cr0x@server:~$ zfs list -t volume -o name,volblocksize,used,refer,logicalused -r rpool/data/vmdata
NAME                         VOLBLOCKSIZE  USED  REFER  LOGICALUSED
rpool/data/vmdata/vm-101-disk-0  128K       64G   64G       120G
rpool/data/vmdata/vm-102-disk-0  8K         20G   20G        22G

What it means: You have inconsistent volblocksize. A 128K OS disk often performs poorly for random I/O. LOGICALUSED > REFER indicates compression/snapshots or thin provisioning effects.

Decision: For general-purpose VM disks, standardize (often 16K) unless you have a measured reason not to. For existing disks, plan a migration to benefit fully.

Task 6: Check whether you’re abusing sync semantics (latency spikes)

cr0x@server:~$ sudo zpool iostat -v rpool 1 5
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool        720G   180G    150    980  12.4M  5.1M
  mirror-0   720G   180G    150    980  12.4M  5.1M
    ata-SAMSUNG_SSD_1TB_A     -      -     75    490  6.2M  2.6M
    ata-SAMSUNG_SSD_1TB_B     -      -     75    490  6.2M  2.6M

What it means: Moderate bandwidth but high write operations: classic small-write workload. If latency is bad during this, sync writes may be forcing waits.

Decision: Next, inspect ZFS latency and sync behavior (tasks below). If this is a database host, assume sync matters until proven otherwise.

Task 7: Check ZFS latency directly

cr0x@server:~$ sudo zpool iostat -rlv rpool 1 3
                              read                              write
pool        r/s   w/s  rMB/s  wMB/s  rlat  wlat  cholat  dlat
----------  ----  ---- -----  -----  ----  ----  ------  ----
rpool        120   950  10.1   4.9   2ms   38ms   42ms   1ms

What it means: Write latency ~38ms, higher when busy. That’s “VMs feel laggy” territory, especially for metadata and journaling.

Decision: If wlat regularly jumps into tens/hundreds of ms, you need to address sync path, pool saturation, or device behavior (firmware/PLP). Tuning guests won’t fix host latency.

Task 8: See if you even have a separate log (SLOG)

cr0x@server:~$ zpool status -v rpool
  pool: rpool
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        rpool                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-SAMSUNG_SSD_1TB_A   ONLINE       0     0     0
            ata-SAMSUNG_SSD_1TB_B   ONLINE       0     0     0

errors: No known data errors

What it means: No logs section. Sync writes are landing on the main vdevs.

Decision: If you have sync-heavy workloads and latency pain, evaluate a proper SLOG (mirrored, power-loss protected). If you don’t have PLP, you’re just buying new ways to lose data quickly.

Task 9: Check ARC pressure and memory reality

cr0x@server:~$ grep -E 'MemTotal|MemFree|MemAvailable' /proc/meminfo
MemTotal:       131900192 kB
MemFree:         4021184 kB
MemAvailable:   18944128 kB
cr0x@server:~$ cat /proc/spl/kstat/zfs/arcstats | egrep '^(size|c |c_min|c_max|hits|misses) '
size                            4    82463313920
c                               4    85899345920
c_min                           4    4294967296
c_max                           4    85899345920
hits                            4    2840093011
misses                          4    390129220

What it means: ARC is ~82G, can grow to ~85G. MemAvailable ~18G suggests the host isn’t starving, but guests might be.

Decision: If the host is swapping or MemAvailable is consistently tiny, reduce ARC (or add RAM). If guests are memory-starved, consider primarycache=metadata for zvol VM disks to stop the host from “helping” too much.

Task 10: Verify discard/TRIM is enabled end-to-end

cr0x@server:~$ lsblk -D
NAME        DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda                0      512B       2G         0
sdb                0      512B       2G         0
cr0x@server:~$ sudo zpool get autotrim rpool
NAME   PROPERTY  VALUE     SOURCE
rpool  autotrim  off       default

What it means: Drives support discard, but pool autotrim is off.

Decision: Consider enabling autotrim on SSD pools where space reclamation matters and your workload won’t be hurt by extra trim activity. Then verify guest discard is enabled in Proxmox for the VM disks.

Task 11: Enable autotrim (if appropriate) and observe

cr0x@server:~$ sudo zpool set autotrim=on rpool
cr0x@server:~$ sudo zpool get autotrim rpool
NAME   PROPERTY  VALUE  SOURCE
rpool  autotrim  on     local

What it means: The pool will pass trims down. This can help long-term SSD performance and space behavior.

Decision: If enabling trim correlates with latency spikes on some SSDs, you may disable it and instead run scheduled trims during quiet windows.

Task 12: Check fragmentation (especially on long-lived VM pools)

cr0x@server:~$ zpool list -o name,size,alloc,free,frag,cap,dedupratio
NAME   SIZE  ALLOC   FREE  FRAG  CAP  DEDUP
rpool  900G   720G   180G   42%  80%  1.00x

What it means: 42% fragmentation is not catastrophic, but it can contribute to random I/O latency on busy pools, especially when combined with high fullness.

Decision: If frag is high and the pool is also full, stop. Plan a migration or expansion. Fragmentation “tuning” is mostly capacity management wearing a disguise.

Task 13: Check whether dedup is accidentally enabled

cr0x@server:~$ zfs get -o name,property,value dedup -r rpool/data/vmdata
NAME               PROPERTY  VALUE
rpool/data/vmdata  dedup     off

What it means: Dedup is off. Good.

Decision: If dedup is on for VM storage and you didn’t design for it, turn it off for new writes and plan remediation. Dedup without sizing and testing is how you turn RAM into a performance bottleneck.

Task 14: Observe per-vdev device latency with iostat

cr0x@server:~$ iostat -x 1 3
Linux 6.8.12-pve (server) 	12/25/2025 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    2.91   11.44    0.00   77.53

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
sda             85.0   490.0  6200.0  2600.0     27.4     3.10   38.2     3.1     44.1    0.8   46.2
sdb             85.0   490.0  6200.0  2600.0     27.4     3.06   37.9     3.0     43.7    0.8   46.0

What it means: Host iowait is non-trivial (11%). Writes have ~44ms await: this matches ZFS wlat. Devices are not at 100% util, but latency is still high—common when the device is doing internal garbage collection or handling flushes.

Decision: If await is high while %util isn’t pinned, suspect flush behavior, firmware, or sync pressure rather than pure throughput saturation.

Task 15: Check Proxmox VM disk config for risky cache modes

cr0x@server:~$ qm config 101 | egrep 'scsi|virtio|ide|sata|boot'
boot: order=scsi0;net0
scsi0: rpool/data/vmdata/vm-101-disk-0,discard=on,iothread=1,cache=writeback,ssd=1

What it means: cache=writeback can be fast, but it changes failure semantics. With writeback, the host may acknowledge writes before they’re safe on disk.

Decision: For databases and anything that cares about correctness, prefer safer cache modes (and rely on ZFS + proper hardware). If you knowingly choose writeback, document it like it’s a loaded weapon.

Task 16: Confirm pool ashift (because you can’t fix it later)

cr0x@server:~$ sudo zdb -C rpool | egrep 'ashift|vdev_tree' -n | head
54:        vdev_tree:
86:            ashift: 12

What it means: ashift=12 (4K sectors). That’s usually correct for modern drives.

Decision: If you ever see ashift=9 on drives that are really 4K, expect performance pain. Fixing it usually means rebuilding the pool. Plan accordingly.

Task 17: Check snapshot bloat (because “free space” might be trapped)

cr0x@server:~$ zfs list -o name,used,usedbysnapshots,usedbydataset,logicalused -r rpool/data/vmdata | head
NAME               USED  USEDSNAP  USEDDS  LOGICALUSED
rpool/data/vmdata  640G     210G    96K       1.1T

What it means: 210G used by snapshots. That’s not “bad”; it’s a policy decision. But it impacts pool fullness and performance.

Decision: If snapshots are eating space and pushing you toward the high-capacity danger zone, tighten retention or move backups elsewhere. “We keep everything forever” is not a storage strategy.

Task 18: Verify compression is actually working (ratio check)

cr0x@server:~$ zfs get -o name,property,value compressratio -r rpool/data/vmdata | head
NAME               PROPERTY      VALUE
rpool/data/vmdata  compressratio 1.38x

What it means: 1.38x is a meaningful win. You are reducing physical I/O for many workloads.

Decision: If compressratio is ~1.00x for encrypted/incompressible data, that’s expected. Still, lz4 overhead is usually low enough to keep enabled.

Paraphrased idea — John Allspaw: Reliability comes from designing systems to fail safely, not from assuming they won’t fail.

Fast diagnosis playbook (first/second/third checks)

This is the quick triage flow when “VM storage is slow” is the ticket and you have 15 minutes before the next meeting you’ll ignore anyway.

First: is the host storage layer sick or just busy?

  • Run zpool status: look for degraded devices, checksum errors, resilvering, or scrubs running during peak.
  • Run zpool iostat -rlv 1 3: check write latency (wlat). If wlat is high, the storage layer is the problem.
  • Run iostat -x 1 3: confirm device await and iowait. High await confirms the pain is real at the device level.

Second: are you drowning in sync writes?

  • Check dataset/zvol sync property (and don’t assume). If sync=standard, sync writes will wait for stable media.
  • Confirm whether you have a SLOG. If you don’t, sync latency is paid on the main vdevs.
  • Look for workloads that trigger sync storms: databases, mail, NFS inside guests, journaling filesystems under stress.

Third: are you capacity- and fragmentation-bound?

  • Check pool cap and frag. High capacity plus high frag is a latency multiplier.
  • Check snapshot usage. Snapshots can trap space and accelerate “pool too full” behavior.
  • Check for thin provisioning surprises: guests think they have space, the pool does not.

If those three checks are clean

Then you look at VM-level configuration: cache mode, disk bus, iothread, queue depth, and guest filesystem behavior. But don’t do it first. That’s how you end up tuning the dashboard while the engine is missing a cylinder.

Common mistakes: symptom → root cause → fix

1) Symptom: “Everything freezes for 1–5 seconds randomly”

Root cause: Sync write latency spikes (flushes, fsync storms) on a pool without a proper SLOG, often combined with consumer SSDs that stall under flush pressure.

Fix: Keep sync=standard for important data, add a mirrored, PLP-capable SLOG if needed, and validate with zpool iostat -rlv. If data is disposable, set sync=disabled on that specific dataset/zvol and document the risk.

2) Symptom: “VMs benchmark fast, but real apps are slow”

Root cause: Benchmarks hit sequential paths; apps hit random sync-heavy patterns and metadata. Also common: volblocksize mismatch for zvols.

Fix: Match volblocksize to workload (often 16K for general VM disks), avoid qcow2-on-ZFS when you don’t need it, and measure latency not just throughput.

3) Symptom: “Pool has space, but performance is terrible”

Root cause: Pool is too full (80–95%), snapshots trap space, allocator is under pressure; fragmentation rises.

Fix: Reduce snapshot retention, migrate data, expand pool. Treat free space as a performance reserve, not a suggestion.

4) Symptom: “After enabling autotrim, latency got worse”

Root cause: Some SSDs handle continuous trims poorly; trim operations compete with foreground writes.

Fix: Disable autotrim and run scheduled trims during quiet windows. Consider enterprise SSDs with predictable trim behavior.

5) Symptom: “Host has lots of RAM, but guests still swap under load”

Root cause: ARC grows aggressively and steals memory guests need, or you’re double caching VM data on host and guest.

Fix: Consider primarycache=metadata for VM zvols; cap ARC if necessary; validate with arcstats and guest memory metrics.

6) Symptom: “Backups make production unusable”

Root cause: Snapshot/backup I/O collides with production random write I/O; backups can induce additional read amplification and metadata churn.

Fix: Schedule backups, throttle if possible, isolate backup storage, and avoid putting backup targets on the same stressed pool.

7) Symptom: “We added a fast SSD cache and nothing improved”

Root cause: L2ARC doesn’t help write latency; it helps read caching, and only if your working set and access patterns fit. Also, cache devices can steal RAM and add overhead.

Fix: Solve write latency at the vdev/sync layer. Only add L2ARC after confirming reads are the bottleneck and ARC hit rate is insufficient.

Three corporate-world mini-stories from the storage trenches

Mini-story #1: The incident caused by a wrong assumption

They migrated a small fleet of application VMs onto Proxmox with ZFS mirrors on “good SSDs.” It was a sensible, budget-approved design: two drives, mirrored, plenty of IOPS on paper. The apps were mostly stateless web services with a small database VM and a message broker.

The wrong assumption was subtle: “If it’s mirrored SSD, sync writes are basically free.” Nobody said it out loud, which is how assumptions survive. They also assumed the database’s durability settings were conservative but not aggressive. The platform went live, looked great, and then started producing short, sharp latency spikes during peak traffic.

The on-call team chased ghosts: network jitter, noisy neighbors, CPU steal, even “maybe Linux has a scheduler regression.” Meanwhile, the database VM was doing perfectly reasonable things—commits with fsync—and ZFS was doing perfectly reasonable things—waiting for stable storage.

Once they graphed host write latency alongside application timeouts, the story wrote itself. The SSDs were consumer models with inconsistent flush latency. When the database hit a burst of sync writes, the devices stalled. The mirror didn’t save them; it just provided two devices that could stall in sympathy.

The fix wasn’t exotic. They deployed a mirrored SLOG on power-loss-protected devices and stopped pretending flush latency didn’t matter. The spikes disappeared. The team also documented which datasets could tolerate sync=disabled (few) and which absolutely could not (the database and broker). The key lesson was operational: never assume your storage honors durability quickly just because it’s “SSD.”

Mini-story #2: The optimization that backfired

Another org wanted “maximum performance” and got aggressive with tuning. They turned off sync globally. They set VM disks to writeback cache. They enabled every guest-side performance toggle they could find. They also chose qcow2 because it made moving disks around “easier.”

For a while, it worked. Benchmarks looked heroic. Deployments were fast. Everyone high-fived and went back to feature work. Then a host reboot happened—routine kernel update, nothing dramatic—and a handful of VMs came back with corrupted filesystems. The database recovered. The mail VM did not. The postmortem was… educational.

The backfire wasn’t because any single setting was “wrong” in a vacuum. It was the combination: writeback caching plus sync disabled plus qcow2’s own write patterns created a system that acknowledged writes optimistically. The system was fast because it was, in practical terms, lying about what was durable.

They rolled back to safer defaults, but not blindly. They moved performance tuning into a policy: for disposable workloads, speed mattered more than durability; for stateful workloads, they optimized the storage hardware path instead of turning off correctness. They also migrated from qcow2 to raw where possible to reduce CoW stacking and metadata overhead.

The real cost wasn’t the corruption itself—it was the week of lost trust. Users don’t care that you can explain cache semantics. They care that their data didn’t come back.

Mini-story #3: The boring but correct practice that saved the day

A mid-sized enterprise ran Proxmox clusters for internal services. Nothing flashy: domain controllers, Git, monitoring, a few small databases, and a surprising amount of file sync. Their storage was ZFS mirrors and some RAIDZ for bulk. They had one practice that felt painfully boring: weekly scrubs, monitored SMART data, and strict pool capacity thresholds.

They also standardized VM storage datasets with consistent properties: lz4 compression on, atime off, and explicit decisions about sync. Most importantly, they had a rule: no VM pool over 80% without a capacity plan signed off. People complained. Finance complained. Everyone complains when you tell them “no.”

One quarter, a batch of SSDs started showing rising media errors. Nothing exploded. No red lights. Just a trend in SMART and a few slow reads that showed up as slight latency increases during scrubs. Because scrubs ran regularly and alerts were tied to changes, they caught it early.

They replaced drives during business hours, one at a time, with controlled resilvers. No incident bridge. No customer-visible outage. The boring practice—scrub discipline and capacity discipline—meant the pool never got into the “too full to resilver comfortably” danger zone. They didn’t win an award. They just didn’t have a bad week.

Checklists / step-by-step plan

Checklist A: Day-one storage defaults for a new Proxmox ZFS VM pool

  1. Create the pool with correct ashift (usually 12). Verify with zdb -C before you put data on it.
  2. Create a dedicated dataset for VM storage (don’t dump everything in root datasets).
  3. Set compression=lz4 on the VM dataset.
  4. Set atime=off on the VM dataset.
  5. Decide sync policy per dataset: durable by default; disable only for workloads you can lose.
  6. If using zvols, standardize volblocksize (often 16K for general VM disks). Decide before creating disks.
  7. Decide primarycache behavior for VM disks (consider metadata-only for memory contention).
  8. Set and enforce capacity thresholds (alert at 70–75%, action at 80%).

Checklist B: When a VM feels slow (15-minute triage)

  1. Check zpool status (errors? resilver? scrub?). If yes, stop and stabilize.
  2. Check zpool iostat -rlv 1 3 (write latency?). If high, storage path is the bottleneck.
  3. Check iostat -x 1 3 (device await, iowait). Confirm it’s not just guest perception.
  4. Check pool fullness and frag (zpool list, zfs list snapshot usage). If too full, you need capacity, not vibes.
  5. Check VM disk cache mode (qm config) and dataset properties (zfs get).

Checklist C: Controlled migration to fix bad volblocksize

  1. Create a new zvol with the desired volblocksize.
  2. Use Proxmox storage migration or a controlled block copy while the VM is down (preferred for correctness).
  3. Validate performance and latency with zpool iostat -rlv and application checks.
  4. Remove the old zvol and monitor fragmentation and space.

FAQ

1) Should I use ZFS or LVM-thin for Proxmox VM storage?

If you want simple snapshots/replication and end-to-end integrity, ZFS is a strong choice. If you want simpler mental models for block storage and less CoW interaction, LVM-thin can be easier. For many shops, ZFS wins operationally—provided you treat sync, capacity, and block sizing as first-class concerns.

2) Is compression=lz4 safe for VM disks?

Yes. It’s transparent and commonly deployed. The “risk” is mostly CPU overhead, which is usually minor compared to I/O saved. The real risk is not using compression and then wondering why the pool is always busy moving unnecessary bytes.

3) What volblocksize should I use for VM zvols?

Common defaults for general VM disks are 16K (often a good balance). For specialized workloads, measure: databases sometimes like 8K or 16K; large sequential workloads may like 64K+. The key is consistency and intent: choose based on workload, not superstition.

4) Can I change volblocksize after the zvol is created?

You can change the property, but existing written blocks remain at the old size. To truly benefit, you usually migrate data to a newly created zvol with the correct volblocksize.

5) Should I disable sync to make Proxmox faster?

Only if you’re willing to lose recent writes on power loss or crash—and only for the datasets where that’s acceptable. For real systems with real data, build a storage path that can handle sync=standard rather than turning off correctness globally.

6) Do I need a SLOG for ZFS on SSD mirrors?

Not always. If your workload is mostly async writes and reads, you may be fine. If you have significant sync writes and you care about latency, a proper SLOG can help a lot. “Proper” means low-latency and power-loss protected, and ideally mirrored.

7) Is qcow2 a bad idea on ZFS?

Often, yes—because it stacks CoW on CoW, increasing fragmentation and metadata overhead. If you need qcow2 features, use it knowingly. Otherwise, raw on ZFS is usually the calmer, more predictable choice.

8) Why does performance collapse when the pool gets full?

ZFS needs free space to allocate efficiently. As free space shrinks, allocations become harder, fragmentation increases, and write amplification grows—especially with VM random I/O. Keeping 20–30% free isn’t waste; it’s buying stable latency.

9) Should I set primarycache=metadata for all VM storage?

It’s a good default when guests are large and memory pressure exists, because it reduces double caching. If the host has abundant RAM and guests are small/read-heavy, caching data can help. Don’t guess: verify ARC behavior and guest memory health.

10) Does autotrim always help on SSD pools?

It often helps long-term space behavior and can maintain SSD performance, but some devices handle continuous trims poorly. Enable it, observe latency, and be willing to switch to scheduled trims if needed.

Conclusion: practical next steps

If you run Proxmox on ZFS and you haven’t touched storage defaults, you’re probably running a system that behaves great on Tuesday and betrays you on Thursday. Fixing it isn’t magic. It’s policy.

  1. Set lz4 compression and atime off on VM storage datasets today. It’s low risk, usually high reward.
  2. Audit sync behavior. Decide what must be durable, and architect for it. Don’t “accidentally” disable sync across the board.
  3. Standardize volblocksize for new VM disks (often 16K) and plan migrations for the worst offenders.
  4. Stop overfilling pools. Capacity is a performance feature. Make it a monitored SLO, not a late-night surprise.
  5. Measure latency, not just throughput. Use zpool iostat -rlv and iostat -x to keep yourself honest.

Do those five things, and most “ZFS is slow” complaints disappear. The ones that remain are at least honest problems—hardware limits, workload realities, and the occasional decision you made on purpose.

← Previous
Debian 13: TCP retransmits are killing performance — find where the loss really is
Next →
Debian 13: SSH port changed — fix firewall + sshd order without locking yourself out (case #87)

Leave a comment