Proxmox + ZFS in a VM: HBA Passthrough vs Virtual Disks (IOMMU Reality Check)

February 11, 2026 • February 11, 2026 • Read: 24 min • Views: 3

Was this helpful?

You can run ZFS inside a VM on Proxmox. People do it every day. Some of them even sleep at night.

But the “how” matters: HBA passthrough is usually the grown-up choice, while virtual disks are the tempting shortcut that turns into a 3 a.m. incident once you mix caching layers, write ordering, and “helpful” RAID controllers.

Make the decision first: what you’re optimizing for

There are only a few legitimate reasons to put ZFS inside a VM:

You want ZFS features inside the guest: native send/receive workflows, dataset-level policies, encryption, snapshots that the guest owns.
You’re offering “storage as a service” to something like Kubernetes and you want the storage stack isolated from the hypervisor lifecycle.
You’re migrating from bare-metal ZFS and want to keep operational muscle memory intact.
You have hardware constraints (shared hosts, limited bays) and you’re trying to consolidate without giving up ZFS semantics.

There are also reasons that sound legitimate until they aren’t:

“I’ll just create a few virtual disks and let ZFS manage them.” That’s not “wrong,” but it stacks caching, write barriers, and error reporting in a way that makes failures harder to reason about. It’s a reliability tax you pay later.
“Passthrough is scary.” It’s not scary. It’s picky. Those are different.

If you want a crisp default recommendation:

For serious data: pass through an HBA (or entire NVMe devices) and let the guest see real drives.
For disposable or test workloads: virtual disks are fine, but choose safe cache modes and accept the limitations.

Here’s your only allowed shortcut: if the data matters and you’re asking “is it okay?”, it’s not okay. Do passthrough.

Interesting facts and short history (you’ll use this later)

ZFS was built to end silent data corruption, and it assumes it can see device errors and ordering. Virtualized storage often blurs both.
The “ZIL” isn’t a separate log by default. It’s an on-pool intent log; SLOG is only an external device for synchronous writes.
4K sector drives forced the world to care about alignment. That’s why ashift exists, and why “it benchmarks fine” can still be wrong.
VT-d/IOMMU passthrough used to be a luxury feature. Now it’s common, but motherboard PCIe topology still decides whether your day is easy.
Virtio was designed to be “paravirtualized” and fast, but it’s still an abstraction. Abstractions hide things, including pain.
Write caches on RAID controllers caused a decade of surprise data loss. Some of those controllers are still in servers, smiling politely.
ZFS checksums cover data and metadata, but it cannot validate what it never receives. If the hypervisor lies, ZFS can’t argue.
Consumer NVMe got fast before it got predictable. Latency spikes matter more to VMs than headline throughput.

Two models: HBA passthrough vs virtual disks

Model A: HBA passthrough (or device passthrough)

This is the “ZFS gets real disks” model. You pass through an HBA (LSI/Broadcom IT mode is the classic), or pass through entire NVMe devices, and the guest owns the storage stack from controller to pool.

What you gain:

Real error reporting (SMART where supported, real sense codes, real timeouts).
Predictable write ordering: fewer “guest thought it was durable” situations.
Cleaner performance tuning: you tune ZFS, not ZFS plus QEMU plus host filesystem plus storage backend.
Scrubs and resilvers behave like real ZFS, not like “ZFS on top of somebody’s file.”

What you pay:

IOMMU groups may force you to pass through more than you want (or block you entirely).
Live migration becomes difficult or impossible unless you also move the physical device (you can’t).
Host visibility is reduced. The host can’t easily back up “inside” the pool without guest cooperation.
Operations: you now maintain ZFS in the guest. That’s fine—just admit it.

Model B: Virtual disks (qcow2/raw on ZFS, LVM-thin, Ceph, etc.)

This is the “ZFS inside the guest, but the guest disks are virtual” model. The guest sees /dev/vda or /dev/sdX virtual devices. Those devices are backed by something on the host: a ZFS zvol, a file on ZFS, an LVM logical volume, or a distributed backend like Ceph.

What you gain:

Easy lifecycle: snapshots, backups, cloning at the hypervisor layer.
Live migration is feasible (depending on backend).
Hardware agnostic: no IOMMU drama.
Centralized monitoring from the host side.

What you pay:

You are layering filesystems/storage stacks. When it fails, you debug the whole lasagna.
Cache mode choices matter. One wrong bit and “durable write” becomes “optimistic fiction.”
TRIM/discard behavior can be surprising, especially with snapshots.
SMART and detailed drive telemetry usually disappears (or becomes “best effort”).

Opinionated rule: If you plan to run ZFS inside the guest, do not back it with qcow2 unless you’re intentionally trading correctness and performance for convenience. Use raw or block devices if you insist on virtual disks.

Joke #1: A storage stack with three caching layers is like a corporate org chart: everyone claims ownership, nobody takes blame.

IOMMU reality check: groups, ACS, and why your NIC moved

Passthrough is not “flip a switch.” It’s “convince the platform to isolate a PCIe device well enough that the hypervisor can hand it to a guest without turning DMA into a horror story.” That’s what IOMMU does: it provides remapping and isolation for device DMA.

What usually goes wrong

Your HBA shares an IOMMU group with something you need (a NIC, a USB controller, sometimes the entire PCIe root port). If you pass it through, you lose the other thing on the host.
ACS is missing or limited. Some consumer boards are built for gaming GPUs, not for strict device isolation.
Reset quirks: some HBAs don’t reset cleanly between VM starts, leading to “works after host reboot” syndrome.
Interrupt handling and CPU pinning: not strictly IOMMU, but it’s where latency shows up once you “successfully” pass the device.

What “ACS override” really means

Linux can sometimes split IOMMU groups with an ACS override kernel parameter. This is a hack. Sometimes it’s a reasonable hack, sometimes it’s the kind of hack that becomes a compliance meeting. If the platform cannot provide isolation, you are asking software to pretend it can.

For home labs, ACS override is often “fine.” For production with meaningful risk, buy hardware with proper PCIe isolation. Your future self will send you a fruit basket. Or at least not wake you up.

Performance and correctness: where the bodies are buried

Write ordering: the quiet killer

ZFS cares about write ordering because it’s transactional. It expects that if the OS says “this is on stable storage,” the device (or stack beneath) agrees. Virtualization can break this in subtle ways:

Host cache mode “writeback” can acknowledge writes before they are durable.
Storage backend may reorder writes unless barriers/flushes are honored end-to-end.
UPS-less hosts with aggressive caches turn power loss into a data integrity workshop.

The practical takeaway: if you use virtual disks, choose cache modes and backends that preserve durability semantics. If you use passthrough, validate that the controller isn’t doing RAID or lying about flushes.

Double ZFS: yes, people do it; no, you shouldn’t by default

Running ZFS on the host and ZFS inside the guest (with the guest backed by a zvol or a file) is common. It’s also a great way to create confusing failure domains:

The guest ZFS thinks it’s managing disks; it’s actually managing “a slice of a pool.”
Scrub timing and IO patterns can fight each other: host scrub plus guest scrub equals “why is everything slow?”
Compression and recordsize choices can stack badly.
Write amplification increases, especially for small random writes.

If you want ZFS features, pick the layer that owns them. Host ZFS with zvols for VM disks is fine. Guest ZFS with passthrough is fine. ZFS-on-ZFS is a special case, not a default.

SLOG and sync writes inside a VM

If your workload is sync-heavy (NFS, databases with fsync, VM images stored inside the guest), ZFS will hit the ZIL path. Inside a VM, this gets tricky:

A fast SLOG helps only if sync semantics are actually end-to-end durable.
If the host lies about flushes, your “fast sync” is just faster corruption.
If you passthrough an NVMe as SLOG, make sure it has power-loss protection if you care about acknowledged sync writes.

TRIM/discard and thin provisioning: space lies in both directions

With virtual disks, thin provisioning is convenient until you discover the guest can’t effectively return space, or the host snapshots prevent reclamation. With passthrough, TRIM is closer to reality (though still dependent on drive/controller support). The operational difference is huge: one model fails by “mysteriously full pools,” the other fails by “mysteriously dead SSD endurance.” Pick your adventure, then monitor accordingly.

Paraphrased idea (attributed): Werner Vogels has a well-known operations mantra: “Everything fails, all the time.” Design like you believe it.

Practical tasks: commands, outputs, and what the output means (and what you do next)

You don’t debug storage with vibes. You debug it with evidence. Below are concrete checks that work on real Proxmox hosts and real ZFS guests.

Task 1: Confirm IOMMU is actually enabled on the Proxmox host

cr0x@server:~$ dmesg | egrep -i 'DMAR|IOMMU|AMD-Vi' | head -n 20
[    0.823456] DMAR: IOMMU enabled
[    0.823789] DMAR: Host address width 39
[    0.824001] DMAR: DRHD base: 0x000000fed90000 flags: 0x0

What it means: You have Intel VT-d/DMAR enabled and the kernel is using it.

Decision: If you do not see “IOMMU enabled,” fix BIOS settings and kernel boot params before touching passthrough configs.

Task 2: Check kernel boot parameters for IOMMU and passthrough friendliness

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12-4-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt

What it means: intel_iommu=on enables IOMMU; iommu=pt reduces overhead for non-passthrough devices by using passthrough mappings.

Decision: If you’re missing the IOMMU flags, add them and reboot. If you’re AMD, you’d look for amd_iommu=on.

Task 3: Identify your HBA and its PCI address

cr0x@server:~$ lspci -nn | egrep -i 'sas|sata|lsi|megaraid'
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [1000:0072] (rev 03)

What it means: The controller is at 03:00.0 and is an LSI SAS2008 class device.

Decision: Use the PCI address in passthrough configuration. Also confirm it’s in IT mode if you want ZFS to see individual disks.

Task 4: Inspect the IOMMU group for that device

cr0x@server:~$ readlink -f /sys/bus/pci/devices/0000:03:00.0/iommu_group
/sys/kernel/iommu_groups/18

cr0x@server:~$ ls -l /sys/kernel/iommu_groups/18/devices
total 0
lrwxrwxrwx 1 root root 0 Feb  4 10:12 0000:03:00.0 -> ../../../../bus/pci/devices/0000:03:00.0

What it means: The HBA is alone in its group. That’s the good timeline.

Decision: If other critical devices share the group, you either move PCIe slots, change motherboard settings, or abandon passthrough on this hardware.

Task 5: Check what driver currently owns the HBA (host side)

cr0x@server:~$ lspci -k -s 03:00.0
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 (rev 03)
	Subsystem: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

What it means: The host is using mpt3sas. For passthrough, you want vfio-pci to own it instead.

Decision: Bind it to vfio-pci; if the host is using it for its own storage, stop here and rethink your architecture.

Task 6: Bind the HBA to vfio-pci (and verify)

cr0x@server:~$ echo "options vfio-pci ids=1000:0072" | sudo tee /etc/modprobe.d/vfio.conf
options vfio-pci ids=1000:0072

cr0x@server:~$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.12-4-pve

cr0x@server:~$ sudo reboot
Connection to server closed by remote host.

cr0x@server:~$ lspci -k -s 03:00.0
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 (rev 03)
	Kernel driver in use: vfio-pci
	Kernel modules: mpt3sas

What it means: vfio-pci owns the device; the host will no longer touch its disks.

Decision: Now it’s safe to attach it to a VM. If it still binds to mpt3sas, you missed initramfs or a conflicting config.

Task 7: Attach the HBA to a Proxmox VM and confirm QEMU sees it

cr0x@server:~$ sudo qm set 120 -hostpci0 03:00.0,pcie=1
update VM 120: -hostpci0 03:00.0,pcie=1

cr0x@server:~$ sudo qm config 120 | egrep -i 'hostpci|machine|bios'
bios: ovmf
hostpci0: 03:00.0,pcie=1
machine: q35

What it means: You’re using q35 and OVMF, both friendly for modern PCIe passthrough.

Decision: If you’re on legacy i440fx, move to q35 unless you have a hard reason not to.

Task 8: Inside the guest, confirm disks are visible and identify them safely

cr0x@zfs-vm:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,TYPE
NAME      SIZE MODEL            SERIAL        TYPE
sda      14.6T ST16000NM001G    ZR123ABC      disk
sdb      14.6T ST16000NM001G    ZR124DEF      disk
sdc      14.6T ST16000NM001G    ZR125GHI      disk
sdd      14.6T ST16000NM001G    ZR126JKL      disk

What it means: The guest sees real disks with model/serial. That’s what you want for ZFS device identity.

Decision: Use persistent IDs (by-id) when creating pools. If disks show up as generic “QEMU HARDDISK,” you’re not doing true device passthrough.

Task 9: Build the pool with persistent device paths and correct ashift

cr0x@zfs-vm:~$ ls -l /dev/disk/by-id | egrep 'ZR123ABC|ZR124DEF|ZR125GHI|ZR126JKL'
lrwxrwxrwx 1 root root  9 Feb  4 10:20 ata-ST16000NM001G_ZR123ABC -> ../../sda
lrwxrwxrwx 1 root root  9 Feb  4 10:20 ata-ST16000NM001G_ZR124DEF -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb  4 10:20 ata-ST16000NM001G_ZR125GHI -> ../../sdc
lrwxrwxrwx 1 root root  9 Feb  4 10:20 ata-ST16000NM001G_ZR126JKL -> ../../sdd

cr0x@zfs-vm:~$ sudo zpool create -o ashift=12 tank raidz1 \
/dev/disk/by-id/ata-ST16000NM001G_ZR123ABC \
/dev/disk/by-id/ata-ST16000NM001G_ZR124DEF \
/dev/disk/by-id/ata-ST16000NM001G_ZR125GHI \
/dev/disk/by-id/ata-ST16000NM001G_ZR126JKL

cr0x@zfs-vm:~$ zpool status tank
  pool: tank
 state: ONLINE
config:

	NAME                                        STATE     READ WRITE CKSUM
	tank                                        ONLINE       0     0     0
	  raidz1-0                                  ONLINE       0     0     0
	    ata-ST16000NM001G_ZR123ABC              ONLINE       0     0     0
	    ata-ST16000NM001G_ZR124DEF              ONLINE       0     0     0
	    ata-ST16000NM001G_ZR125GHI              ONLINE       0     0     0
	    ata-ST16000NM001G_ZR126JKL              ONLINE       0     0     0

What it means: Pool is online and uses stable identifiers. ashift=12 is generally right for 4K-sector drives (and safe for most modern disks).

Decision: If you guess ashift wrong, you don’t “fix it later.” You rebuild. Decide now, not after you store 40 TB.

Task 10: Validate sync behavior and latency signals from ZFS

cr0x@zfs-vm:~$ sudo zfs set sync=standard tank

cr0x@zfs-vm:~$ zpool iostat -v tank 1 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.20T  42.5T      0    210      0  52.0M
  raidz1-0  1.20T  42.5T      0    210      0  52.0M
    sda         -      -      0     52      0  13.1M
    sdb         -      -      0     52      0  13.0M
    sdc         -      -      0     53      0  13.0M
    sdd         -      -      0     53      0  12.9M

What it means: IO distribution looks sane. If one disk is consistently slower, that’s a drive/controller/cable problem, not “ZFS being ZFS.”

Decision: If you see wildly uneven writes, investigate that specific device path, cabling, or HBA port.

Task 11: On the Proxmox host, verify the VM disk backend and cache mode (virtual disk scenario)

cr0x@server:~$ qm config 130 | egrep -i 'scsi|virtio|cache|discard'
scsi0: local-zfs:vm-130-disk-0,cache=none,discard=on,iothread=1,size=200G
scsihw: virtio-scsi-single

What it means: This VM uses a ZFS zvol on the host (local-zfs), with cache=none (safer for durability), discard enabled, and an IO thread.

Decision: If you see cache=writeback on a non-BBU, non-UPS host, change it. You’re currently relying on luck and electricity behaving politely.

Task 12: Measure actual latency from the guest perspective

cr0x@zfs-vm:~$ iostat -x 1 3
Linux 6.5.0 (zfs-vm) 	02/04/2026 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.13    0.00    3.45    8.20    0.10   86.12

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  svctm
sda              0.00   55.00    0.00 14080.0    0.00    1.00  62.00  11.20   0.35
sdb              0.00   54.00    0.00 13824.0    0.00    0.00  61.50  11.10   0.34
sdc              0.00   56.00    0.00 14336.0    0.00    0.00  62.30  11.30   0.35
sdd              0.00   55.00    0.00 14080.0    0.00    0.00  62.10  11.25   0.35

What it means: await is ~11ms for writes, which is plausible for HDDs under moderate load. If await is spiking into hundreds of ms, you have queueing or backend stalls.

Decision: If %steal is high, you’re CPU-contended; storage tuning won’t save you. If await is high with low util, suspect virtualization overhead or host IO stalls.

Task 13: Confirm ZFS isn’t memory-starved (ARC thrash looks like “slow disks”)

cr0x@zfs-vm:~$ cat /proc/meminfo | egrep 'MemTotal|MemFree|Cached'
MemTotal:       33554432 kB
MemFree:         1823400 kB
Cached:         16200480 kB

cr0x@zfs-vm:~$ sudo arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:33:01   812   142     17    60   42    82   58     0    0   12.0G  12.0G
10:33:02   790   150     19    65   43    85   57     0    0   12.0G  12.0G
10:33:03   805   155     19    70   45    85   55     0    0   12.0G  12.0G

What it means: ARC is stable at 12G and miss rate is moderate. If ARC is tiny or constantly shrinking, performance will feel like “random IO is cursed.”

Decision: If the VM doesn’t have enough RAM for ARC, either give it RAM, move ZFS to the host, or accept the performance profile.

Task 14: Verify you’re not accidentally running ZFS atop a RAID volume

cr0x@zfs-vm:~$ sudo smartctl -a /dev/sda | egrep 'Vendor|Product|Rotation|SMART support'
Vendor:               SEAGATE
Product:              ST16000NM001G
Rotation Rate:        7200 rpm
SMART support is:     Available - device has SMART capability.

What it means: The guest can query SMART, suggesting a relatively direct path to disk. If you see a RAID virtual volume instead, you’ll often get generic vendor strings and missing SMART.

Decision: If you’re on a hardware RAID volume, stop and reflash the controller to IT/HBA mode, or replace it. ZFS wants disks, not theater.

Task 15: On the Proxmox host, check for IO wait and underlying device saturation

cr0x@server:~$ pveperf
CPU BOGOMIPS:      76800.00
REGEX/SECOND:      1975679
HD SIZE:           98.23 GB (rbd)
FSYNCS/SECOND:     2321.41
DNS EXT:           37.82 ms
DNS INT:           0.86 ms

What it means: FSYNCS/SECOND gives a rough feel for sync performance to the Proxmox root/storage backend. It’s not a benchmark suite, but it catches obvious “this node is sick” problems.

Decision: If FSYNCS/SECOND is terrible on a node that should be healthy, investigate host storage first. Guest tuning won’t beat a broken backend.

Fast diagnosis playbook (what to check first/second/third)

This is the “it’s slow” and “ZFS is angry” playbook. Don’t freestyle. Follow the chain.

First: decide where the storage truth lives

HBA/device passthrough: treat the guest like bare metal. Start in the guest for disk health and ZFS signals.
Virtual disks: start on the host. The guest can only report what it sees, and what it sees may be a polite lie.

Second: determine if it’s latency, throughput, or CPU scheduling

In the guest: iostat -x 1 and look at await and %steal.
On the host: iostat -x 1 and top (or htop) for IO wait.

If %steal is elevated, your “storage issue” might be vCPU scheduling. If await is huge with low util, suspect stalls higher up (hypervisor queueing, backend flushes, or a sick controller).

Third: check ZFS itself for what it’s doing

zpool status -xv: errors, degraded vdevs, checksum issues.
zpool iostat -v 1: per-vdev imbalance.
zfs get compressratio,recordsize,atime,sync for the dataset involved.

Fourth: confirm you didn’t build a write cache time bomb

Virtual disks: confirm Proxmox cache mode is none or a consciously chosen safe alternative.
Passthrough: confirm controller is IT mode and not doing writeback without protection.

Fifth: look for the virtualization-specific “gotchas”

IO threads enabled for virtio-scsi when appropriate.
Multi-queue settings, vCPU pinning, and NUMA alignment for heavy IO.
Ballooning disabled for ZFS guests (memory pressure + ARC = sadness).

Three corporate mini-stories from the storage trenches

Incident: the wrong assumption (“the hypervisor will keep it safe”)

A mid-sized company ran Proxmox for internal services. A team wanted ZFS features inside a VM because they had a slick replication workflow built around ZFS send/receive. Reasonable. They created a VM, attached a set of qcow2 disks on host storage, and built a raidz pool inside the guest. Benchmarks looked good. Everyone high-fived.

Months later, there was a power event. Not dramatic. A short outage, UPS did what it could, but a couple nodes went down hard. When the cluster came back, one VM’s pool imported but started throwing checksum errors. Another wouldn’t import without forcing. It smelled like partial writes.

The root cause wasn’t a single bug. It was a stack of assumptions: host cache mode acknowledged writes before they were durably committed, and the underlying storage backend had its own caching behavior. ZFS inside the guest believed it was getting correct flush semantics. It wasn’t. The pool survived enough to be terrifying: not a clean failure, not an obvious loss. A long, expensive week followed.

The fix was boring and structural: they rebuilt the design. For ZFS-in-VM workloads that mattered, they moved to HBA passthrough and validated durability. For the ones that needed live migration and host-level snapshots, they stopped running ZFS inside the guest and used host ZFS instead. Same hardware. Different contract.

Optimization that backfired: “Let’s use a faster cache mode”

Another organization ran a storage-heavy application with a lot of fsync calls. Someone noticed high latency and low throughput on their virtual disks. They changed Proxmox disk cache to writeback because it made the graphs look nicer. And it did—latency dropped, throughput improved, everyone went back to ignoring storage.

Then came a maintenance reboot during a noisy period. The app restarted, but the database reported corruption. Not immediately catastrophic—more the “I found things I cannot reconcile” type. Backups were available, but recovery meant data loss between snapshots, plus an ugly restore window.

Postmortem found the obvious: writeback caching improved performance by acknowledging writes early. The less obvious part was cultural: nobody had defined what “durable write” meant in their environment. They had optimized a metric without specifying the reliability requirement behind it.

They rolled back to safer cache settings, added power-loss protection where it mattered, and re-ran tests that included crash scenarios. The performance was lower, but it was honest. That’s the kind you can operate.

Boring but correct practice that saved the day: “We test resilver and scrub like it’s a feature”

A third shop ran ZFS inside a VM with HBA passthrough. It wasn’t glamorous; it was chosen because they wanted predictable failure behavior and clean telemetry. The team had a habit that looked excessive: they scheduled regular scrubs, and once per quarter they intentionally offlined a disk in a maintenance window to exercise replacement and resilver procedures.

One day, a disk started throwing intermittent errors. Nothing dramatic: a handful of read errors that auto-corrected at first. Scrub flagged it, and because the team had seen this movie before, they didn’t debate. They replaced the drive, resilvered, and moved on.

Two weeks later, another disk in the same batch started doing the same thing. Again: replace, resilver, move on. No panic, no “why is the pool degraded” Slack storm.

The real win came later: the vendor acknowledged an issue with a production run. Many companies found out when they lost a vdev during resilver. This one found out when the scrub graph twitched. The practice was boring. The outcome was not.

Joke #2: The only thing more persistent than bit rot is a spreadsheet that claims your storage is “green.”

Common mistakes: symptoms → root cause → fix

1) VM ZFS pool corrupt after host crash

Symptoms: Pool imports with errors, checksum errors appear, datasets behave oddly after an unclean shutdown.

Root cause: Virtual disk cache mode and backend didn’t honor flush semantics end-to-end; guest ZFS trusted durability that wasn’t real.

Fix: Use safer cache modes (cache=none for many setups), avoid qcow2 for ZFS-in-guest if you care about integrity, or move to HBA/device passthrough.

2) “Passthrough doesn’t work” even though IOMMU is enabled

Symptoms: VM won’t start, Proxmox reports VFIO errors, device disappears or causes host instability.

Root cause: Device shares IOMMU group with host-critical hardware, or the device has reset quirks.

Fix: Check group membership; move PCIe slots; choose different motherboard/CPU platform; sometimes update BIOS/firmware. Avoid relying on ACS override for production risk profiles.

3) ZFS guest is slow, but disks look fine

Symptoms: High latency, low throughput, no obvious SMART issues, ZFS shows no errors.

Root cause: vCPU contention and high %steal, or memory ballooning causing ARC thrash.

Fix: Pin vCPUs for heavy IO workloads, disable ballooning for ZFS guests, allocate sufficient RAM, and avoid overcommitting CPU on storage VMs.

4) Pool keeps “mysteriously filling up” with thin-provisioned virtual disks

Symptoms: Host storage fills while the guest shows free space; discards don’t reclaim space as expected.

Root cause: Discard/TRIM not enabled end-to-end, snapshots prevent space reclamation, or backend doesn’t punch holes efficiently.

Fix: Enable discard on Proxmox disks, ensure guest issues TRIM, manage snapshots intentionally, and monitor actual allocated size on the host.

5) Scrubs are painfully slow inside a VM

Symptoms: Scrub takes forever; production workload suffers during scrub windows.

Root cause: Scrub IO competes with VM IO; on virtual disks you can get compounded contention (host plus guest). On passthrough, it’s often just “disks are busy” but might be queue depth issues.

Fix: Schedule scrubs, set scrub/resilver priorities, avoid running scrub on both host and guest ZFS layers simultaneously, and validate HBA queue settings if applicable.

6) “ZFS can’t see SMART” after passthrough

Symptoms: smartctl returns limited info or errors; disks show generic identifiers.

Root cause: You passed through a RAID controller still doing RAID, or you’re behind an expander/controller combination that needs specific smartctl device types.

Fix: Put controller into IT/HBA mode; use correct smartctl flags for SAS; confirm the guest sees actual devices, not logical volumes.

Checklists / step-by-step plan

Plan A: You want ZFS inside the VM for real data (recommended path)

Buy/choose hardware that supports clean IOMMU isolation. Server boards tend to behave better; PCIe topology matters.
Enable VT-d/AMD-Vi in BIOS, boot with IOMMU flags, verify in dmesg.
Identify the HBA, confirm it’s in IT mode, and verify it is alone in its IOMMU group.
Bind HBA to vfio-pci, reboot, confirm driver ownership.
Create a dedicated storage VM: q35 + OVMF, no ballooning, enough RAM, sensible vCPU count, consider CPU pinning if latency-sensitive.
Pass through the HBA, then inside guest build pool using /dev/disk/by-id paths.
Set ZFS properties intentionally per dataset (recordsize, compression, atime).
Monitor and test failure behavior: scrubs, drive replacement procedure, reboot tests, and a planned power-loss simulation if you’re brave and prepared.

Plan B: You insist on virtual disks but want to reduce the blast radius

Prefer raw or zvol-backed disks over qcow2 for ZFS-in-guest.
Use virtio-scsi with iothread=1 for heavy IO patterns.
Set cache mode consciously: default to cache=none unless you can prove durability another way.
Enable discard end-to-end if thin provisioning matters.
Do not run host ZFS scrub at the same time as guest ZFS scrub if you can avoid it. Stagger maintenance.
Test recovery: simulate unclean VM poweroff and verify pool import behavior and application integrity.

Plan C: You don’t actually need ZFS in the guest

Run ZFS on the Proxmox host and use zvols for VM disks.
Use Proxmox snapshots and backups at the hypervisor layer.
Expose storage to guests via virtio and keep the storage brain in one place.

FAQ

1) Is HBA passthrough always faster?

No. It’s often more predictable. Virtual disks can be fast on good backends, but passthrough reduces layers that add latency spikes and semantic mismatches.

2) Can I live-migrate a VM that uses HBA passthrough?

Not in any meaningful way. The disks are physically attached to one host. You can migrate compute, not cables.

3) Is it okay to run ZFS on qcow2?

For testing and non-critical data, sure. For important data, it’s an attractive way to add fragmentation and complexity while making failure analysis harder.

4) If I use Proxmox host ZFS, should I also use guest ZFS?

Default answer: no. Pick one layer to own ZFS responsibilities. ZFS-on-ZFS is a niche design, not a best practice.

5) What cache mode should I use for virtual disks?

Start with cache=none. It’s not the fastest, but it’s honest. If you change it, do it with a clear durability story (power protection, backend guarantees, crash testing).

6) Does a SLOG help ZFS in a VM?

Sometimes. It helps synchronous write latency when the workload forces sync writes. But it only helps if flush/durability semantics are real end-to-end.

7) Why does my HBA share an IOMMU group with my NIC?

Because PCIe topology and ACS support are determined by motherboard and CPU design. Linux can’t manufacture isolation that the hardware didn’t provide.

8) Should I enable ballooning on a ZFS VM?

No, not if you care about performance consistency. ZFS uses memory for ARC caching; ballooning turns caching into a surprise eviction party.

9) Do I need IT mode specifically?

If you want ZFS to manage individual disks correctly, yes. RAID mode hides disks behind logical volumes and often interferes with error visibility and recovery behavior.

10) What’s the simplest “safe” architecture for Proxmox + storage?

Host ZFS + zvol VM disks + Proxmox backups is the simplest safe setup for many environments. Add passthrough only when you need guest-owned ZFS.

Practical next steps

Pick your design based on what you’re trying to guarantee:

If data integrity and clear failure modes are the priority: build a storage VM with HBA/device passthrough, and treat it like a storage server.
If operational convenience and mobility are the priority: don’t run ZFS inside the guest; run it on the Proxmox host and keep VM disks simple.
If you must run ZFS-in-guest on virtual disks: use raw/zvol-backed devices, cache=none, enable discard deliberately, and test crash behavior before production.

Then do the unglamorous part: run the commands above, record the outputs, and decide based on evidence. Storage is one of the few places where reality always wins. It just doesn’t always win quickly.