Proxmox vs ESXi for ZFS: choosing the right disk controller path (HBA passthrough vs virtual disks)

January 24, 2026 • February 3, 2026 • Read: 24 min • Views: 14

Was this helpful?

The ticket says “storage is slow.” The graph says “latency spiked.” The application team says “it was fine yesterday.”
And you’re staring at a ZFS pool running inside a VM, backed by “some disks,” presented by “some controller,” through “some hypervisor magic.”

This is where ZFS either looks like a miracle filesystem—or like a murder mystery with too many suspects.
The controller path you choose (HBA passthrough vs virtual disks) determines what ZFS can see, what it can fix, and what it can only guess.
Guessing is not a storage strategy.

The actual decision: what ZFS needs vs what hypervisors offer

ZFS is not shy. It wants direct-ish access to disks, stable identities, correct cache flush semantics, predictable latency, and visibility into errors.
Hypervisors, by design, abstract hardware. Sometimes beautifully. Sometimes catastrophically.

Your controller choice is really three choices bundled together:

Visibility: Can ZFS see SMART, error counters, and true device behavior—or only a virtual block device?
Ordering and flush guarantees: When ZFS says “sync this,” does the stack actually sync it to non-volatile media?
Failure blast radius: Does one layer’s “helpful caching” turn a host crash into pool corruption?

If you remember one line, make it this: ZFS can tolerate slow disks; it cannot tolerate lies.
“Lies” here means write acknowledgements that aren’t durable yet, or error reporting that gets swallowed, rewritten, or delayed until it’s too late.

So… Proxmox vs ESXi?

If you want ZFS as your primary storage stack on the host: Proxmox is the straightforward option because it’s Linux, and ZFS-on-Linux is a normal citizen there.
If you want ESXi for its virtualization ecosystem but still want ZFS features: you’re usually talking about ZFS inside a VM (TrueNAS, OmniOS, Debian+ZFS, etc.).
That’s viable, but your controller path becomes the whole game.

Opinionated guidance:

Best practice for ZFS-in-a-VM: PCIe passthrough of an HBA in IT mode to the ZFS VM.
Acceptable for light duty / lab: virtual disks (VMDK/virtio-blk) only if you understand what you’re losing.
What I avoid in production: putting ZFS on top of a RAID controller presenting logical volumes, or stacking ZFS on thin-provisioned virtual disks without guardrails.

Interesting facts and historical context (that still bites today)

ZFS was born at Sun Microsystems and shipped in Solaris (mid-2000s), built around end-to-end checksumming and copy-on-write to prevent silent corruption.
Copy-on-write is why ZFS hates “lying” caches: it relies on transaction groups and ordered durability; acknowledgements that aren’t durable can poison consistency after a crash.
VMware’s VMFS datastores were designed for shared-block virtualization, not for exposing raw disk semantics to a guest filesystem with its own RAID logic.
HBAs in “IT mode” became popular largely because ZFS (and later Ceph) wanted simple JBOD behavior: no RAID metadata, no controller caching surprises.
LSI SAS2008/SAS2308 era cards (and their many OEM rebrands) became the de facto homelab and SMB standard because they were cheap and supported passthrough well.
4K sector drives forced the industry’s hand: ashift mismatches can permanently waste IOPS and space; virtualization layers sometimes hide true sector size.
Write cache policies have caused real outages for decades—long before ZFS—because “cache enabled” plus “no battery/flash backup” is just “data loss later.”
SMART passthrough is not a given: many virtual disk paths don’t provide drive health details to the guest, which breaks proactive ops workflows.

Disk presentation paths: passthrough HBA, RDM, VMDK, and “it depends”

Path A: PCIe passthrough HBA (IT mode) to the ZFS VM

This is the “stop being cute” approach. The ZFS VM owns the HBA, enumerates the disks directly, sees SMART, sees serials, and gets real error behavior.
It’s the closest you get to bare metal while still virtualizing compute around it.

Pros

Best disk visibility: SMART, temps, error counters, real device IDs.
Best alignment with ZFS assumptions about flushes and error handling.
Predictable performance characteristics: queueing and latency are less “mystery meat.”
Easier incident response: when a drive is dying, ZFS tells you which one, not “naa.600…something.”

Cons

You lose some hypervisor conveniences (vMotion for that VM, snapshots in the usual sense).
Passthrough can complicate host upgrades and hardware changes.
Some consumer boards/IOMMU groupings make it annoying.

Path B: ESXi RDM (Raw Device Mapping) to a ZFS VM

RDM is VMware’s “kind of raw” mapping. In practice it’s still mediated.
Depending on mode (virtual vs physical compatibility), it may or may not forward what ZFS wants.
It can be workable, but it’s rarely the best option when passthrough is available.

Pros

Better than a generic VMDK for some use cases.
Can allow some tooling integration with ESXi.

Cons

SMART and error visibility are often limited or inconsistent.
More layers where cache/flush semantics can get weird.
Operational ambiguity: “Is the guest really seeing the disk?” becomes a recurring question.

Path C: Virtual disks (VMDK on VMFS; virtio disks on Proxmox)

This is the most common way people try ZFS in a VM because it’s convenient and looks clean in the UI.
It’s also where ZFS gets the least truth about the underlying media.

If you do this for anything you care about, treat it like a product decision: define reliability requirements, test crash behavior, and accept the monitoring blind spots.

Pros

Easy lifecycle operations: migrate, snapshot, clone.
Hardware-agnostic: no special HBA needed.
Great for dev/test, CI, disposable environments.

Cons

ZFS loses direct SMART and some error semantics.
Misaligned sector sizes and volatile caches can create performance and integrity issues.
Double caching and queueing (guest + host) can produce latency spikes that look like “random” application stalls.

Joke #1: Virtual disks for production ZFS are like putting racing slicks on a forklift—technically it rolls, but it’s not the kind of speed you wanted.

Path D: Hardware RAID controller logical volumes presented to ZFS

Don’t. ZFS already is the RAID layer. Putting it on top of another RAID layer is a great way to get:
mismatched failure handling, masked errors, write hole behavior, and a support blame game.

There are niche exceptions (single-disk RAID0 per disk on a controller purely for pass-through behavior),
but that’s usually a hack for when you bought the wrong controller.
For new builds: buy the right hardware.

Proxmox: ZFS is first-class, but hardware still matters

On Proxmox, ZFS typically runs on the host. That’s the cleanest architecture: the kernel sees disks directly,
ZFS manages them directly, and your VMs see storage over zvols, datasets, or network shares.

The “controller decision” still exists, but it’s simpler: you either attach disks to the Proxmox host via a proper HBA (or onboard SATA in AHCI),
or you complicate your life with RAID controllers and caching policies.

What to buy (and how to configure it)

HBA in IT mode (LSI/Broadcom family) is the classic answer for SAS/SATA backplanes and expanders.
Onboard SATA in AHCI mode is fine for small direct-attached SATA builds, assuming the chipset and cabling are sane.
Avoid RAID mode unless you can force true HBA/JBOD behavior with caching disabled and stable identifiers—still second-best.

Performance reality check

Proxmox with ZFS isn’t “slow.” It’s honest. If your pool is six rust disks doing sync writes, latency will reflect physics.
ZFS will gladly show you the bill for your design decisions.

ESXi: excellent hypervisor, awkward ZFS guest story

ESXi is great at what it was built to do: run VMs, manage them at scale, abstract hardware, and provide stable ops tooling.
The friction happens when you want a guest filesystem (ZFS) to behave like it’s on bare metal while ESXi is doing what hypervisors do.

What to do on ESXi if you want ZFS features

Best: dedicate an HBA and use PCIe passthrough to a storage VM (TrueNAS, etc.). Present storage back to ESXi over NFS/iSCSI.
Okay-ish: RDM in physical mode for each disk, if passthrough is impossible. Validate SMART visibility and flush behavior.
Risky: ZFS over VMDKs on VMFS, especially thin-provisioned, with snapshots, and no monitoring for underlying datastore pressure.

The philosophical problem: who is responsible for integrity?

In an ESXi + VMFS world, VMware’s stack is responsible for datastore integrity, redundancy, and recovery.
In a ZFS world, ZFS wants to own that job end-to-end.
If you let both stacks “help,” you can end up with neither being fully accountable.

One quote worth keeping on a sticky note:
Hope is not a strategy. — James Cameron

Why controller choice changes failure modes (and your pager)

1) ZFS needs stable disk identity

ZFS labels disks and expects them to stay the same device.
Virtual disks are stable in their own way, but they hide the underlying drive topology.
If a physical disk starts erroring, ZFS in the guest might only see “virtual disk read error,” not “Seagate SN1234 is dying on bay 7.”

2) SMART and predictive failure signals

SMART isn’t perfect, but it’s what we have. Reallocated sectors, pending sectors, CRC errors—these are early warnings.
With passthrough, you can monitor them inside the ZFS system and replace disks before a resilver turns into a week-long drama.
With VMDKs, you’re often blind unless you also monitor from the host with separate tooling and correlate manually.

3) Cache flush semantics and sync writes

ZFS cares deeply about what “sync” means. Databases care too.
If the hypervisor or controller acknowledges writes before they’re durable, a power loss can leave ZFS believing data is safely on disk when it isn’t.
ZFS is robust, but it can’t checksum data that never made it.

4) Queue depth and latency variance

Many “ZFS is slow in a VM” reports are really “your IO path has three queues, two caches, and one tiny bottleneck.”
HBAs with passthrough let the guest manage the queueing directly.
With virtual disks, you can get unpredictable latency spikes from host contention, datastore metadata operations, or snapshot chains.

5) Recoverability when the host is sick

With passthrough HBA, the ZFS VM is more self-contained: move the HBA + disks to another host, import pool, carry on.
With VMDKs, your pool is entangled with VMFS datastores, ESXi host configuration, and sometimes the state of snapshots.
Recovery becomes a multi-layer archaeological dig.

Joke #2: Storage layers are like lasagna—too many and you’ll spend the night regretting your choices.

Practical tasks: commands, output interpretation, decisions

The point of commands is not to feel busy. It’s to reduce uncertainty until the right decision is obvious.
Below are real tasks I use in production triage and design reviews. Each includes what to look for and what you decide next.

Task 1: Confirm ZFS sees real disks (not a single giant virtual blob)

cr0x@server:~$ lsblk -o NAME,MODEL,SERIAL,SIZE,ROTA,TYPE
NAME   MODEL            SERIAL        SIZE ROTA TYPE
sda    ST12000NM0008    ZHZ1ABC1      10.9T    1 disk
sdb    ST12000NM0008    ZHZ1ABC2      10.9T    1 disk
sdc    ST12000NM0008    ZHZ1ABC3      10.9T    1 disk
nvme0n1 Samsung SSD 980 S64DNF0R1234   931G    0 disk

What it means: You have per-disk model and serial. That usually indicates direct attachment or passthrough HBA.
If you see “VMware Virtual disk” (or identical models/serials), you’re on virtual disks.

Decision: If this is a ZFS VM and you care about ops visibility, move toward HBA passthrough.

Task 2: Check whether the controller is a real HBA and in a sane mode

cr0x@server:~$ lspci -nn | egrep -i 'sas|sata|raid|lsi|broadcom'
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087]

What it means: LSI SAS2308 class device is a common HBA suitable for IT mode.
If you see a RAID controller model, double-check it’s not doing write caching or RAID presentation.

Decision: Prefer an HBA in IT mode for ZFS; avoid hardware RAID volumes as ZFS vdevs.

Task 3: Verify passthrough from the guest perspective (ZFS VM)

cr0x@server:~$ dmesg | egrep -i 'mpt2sas|mpt3sas|scsi host' | tail -n 8
[    2.114012] mpt2sas_cm0: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)
[    2.116842] scsi host0: ioc0
[    2.116904] scsi host1: ioc1

What it means: The guest loaded the HBA driver and sees the controller directly.

Decision: Proceed with per-disk vdevs; enable SMART monitoring in the guest.

Task 4: Confirm ZFS pool topology (and spot accidental single points of failure)

cr0x@server:~$ zpool status -v
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sda                     ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0

errors: No known data errors

What it means: RAIDZ2 across six disks, no errors. If you see scsi- IDs changing often, you may have unstable device naming.

Decision: If device names are unstable, switch to /dev/disk/by-id paths when creating/replacing drives.

Task 5: Check ashift (4K alignment) to avoid permanent performance sadness

cr0x@server:~$ zdb -C tank | egrep 'ashift|vdev_tree' -n | head
41:        vdev_tree:
57:                ashift: 12

What it means: ashift: 12 implies 4K sectors. If it’s 9 (512B) on 4K drives, you can get write amplification and lower IOPS.

Decision: If ashift is wrong, you rebuild the pool. There is no magic “fix ashift” button.

Task 6: Measure sync write penalty (tells you whether SLOG matters)

cr0x@server:~$ zfs get -o name,property,value -s local,default sync tank
NAME  PROPERTY  VALUE
tank  sync      standard

cr0x@server:~$ dd if=/dev/zero of=/tank/testfile bs=1M count=1024 oflag=dsync status=progress
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 24.6 s, 43.6 MB/s

What it means: oflag=dsync approximates sync write behavior. 43 MB/s on spinning disks might be normal; 3 MB/s suggests a flush/queue problem.

Decision: If sync writes are a bottleneck and you have real sync workloads (databases, NFS with sync), add a proper SLOG device with power-loss protection.

Task 7: Verify SLOG presence and what it’s doing

cr0x@server:~$ zpool status tank | egrep -A3 'logs|cache|special'
logs
  nvme0n1p2                    ONLINE       0     0     0

What it means: Dedicated log device exists. If it’s a consumer NVMe without PLP, it may make things faster right up until it doesn’t.

Decision: Use enterprise SSD/NVMe with PLP for SLOG. If you can’t, don’t pretend you can.

Task 8: Check ZFS compression and recordsize (common “free performance” lever)

cr0x@server:~$ zfs get -o name,property,value compression,recordsize tank
NAME  PROPERTY    VALUE
tank  compression lz4
tank  recordsize  128K

What it means: lz4 is typically a win; recordsize 128K is a general-purpose default.

Decision: For VM images, consider a dataset with recordsize=16K or 32K. For backups/media, keep 128K or larger.

Task 9: Look for checksum errors (they tell you about the path, not just the disk)

cr0x@server:~$ zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     12
            sdb     ONLINE       0     0     0

errors: No known data errors

What it means: Non-zero CKSUM can be cabling, backplane, controller, expander, firmware, or the drive.
ZFS corrected it thanks to redundancy, but the path is noisy.

Decision: Check SMART, swap cables, reseat, update HBA firmware. If errors persist on one disk, replace it.

Task 10: Pull SMART data (only works well with real disk access)

cr0x@server:~$ smartctl -a /dev/sda | egrep -i 'model|serial|reallocated|pending|crc|temperature'
Device Model:     ST12000NM0008
Serial Number:    ZHZ1ABC1
Reallocated_Sector_Ct   0
Current_Pending_Sector  0
UDMA_CRC_Error_Count    18
Temperature_Celsius     41

What it means: CRC errors often scream “cable/backplane,” not “disk media.”

Decision: Fix the transport layer before you blame ZFS. Replace SATA/SAS cables; check backplane seating and expander health.

Task 11: Identify whether the hypervisor path is adding latency (guest view)

cr0x@server:~$ iostat -x 1 5
Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sda              5.0    42.0   320.0  8192.0  38.21   2.11  98.7
sdb              4.0    40.0   256.0  8064.0  41.02   2.03  97.9

What it means: High await with moderate svctm can indicate queueing above the device (controller/hypervisor contention).

Decision: If this is a VM with virtual disks, investigate host datastore contention or queue depth settings; consider moving to passthrough HBA.

Task 12: Check TRIM/discard behavior (important for SSD pools and virtual disks)

cr0x@server:~$ zpool get autotrim tank
NAME  PROPERTY  VALUE   SOURCE
tank  autotrim  off     default

cr0x@server:~$ zpool set autotrim=on tank

What it means: If you’re on SSDs, autotrim helps long-term performance. In some virtual disk setups, discard may be unsupported or slow.

Decision: Enable autotrim for SSD pools where supported; benchmark in virtualized environments to ensure trims aren’t causing latency spikes.

Task 13: Confirm whether you’re accidentally double-caching

cr0x@server:~$ zfs get -o name,property,value primarycache,secondarycache tank
NAME  PROPERTY       VALUE
tank  primarycache   all
tank  secondarycache all

What it means: ZFS will use RAM (ARC) and optional L2ARC. In a VM, the host may also cache aggressively.

Decision: If the VM memory is constrained or ballooning, reduce ZFS ARC size or adjust caches; avoid running ZFS where the hypervisor can steal its memory unpredictably.

Task 14: Validate scrub behavior and time-to-detect problems

cr0x@server:~$ zpool scrub tank
cr0x@server:~$ zpool status tank | egrep -i 'scan|scrub'
scan: scrub in progress since Sun Dec 28 01:10:11 2025
        1.23T scanned at 1.05G/s, 220G issued at 187M/s, 10.8T total
        0B repaired, 2.04% done, 0:58:12 to go

What it means: Scrub throughput tells you about end-to-end read performance and contention.

Decision: If scrubs are painfully slow in a VM, check host contention, controller queueing, and whether virtual disks are throttled by datastore limits.

Three mini-stories from corporate life

1) The incident caused by a wrong assumption: “The hypervisor flushes writes, obviously”

A mid-sized company ran an ESXi cluster with a storage VM providing NFS back to ESXi. The storage VM used ZFS.
Someone built it quickly during a hardware refresh window. They used VMDKs on a VMFS datastore because it made the build “portable.”
It passed basic tests: file copies, a few VM boots, a synthetic benchmark that made everyone feel productive.

Weeks later, a host reboot happened during a power event. UPS worked, mostly. But one host dropped hard.
After power returned, the storage VM came up and ZFS imported the pool. Applications started, and everything looked normal—until a database started throwing logical corruption errors.
Not “disk failed,” not “pool faulted.” Quietly wrong data in a place that mattered.

The investigation was ugly because every layer had plausible deniability. ESXi said the datastore was healthy.
ZFS said the pool was online. The database said, essentially, “I was promised durability and I didn’t get it.”
The core assumption—that sync writes from the guest were faithfully durable on the physical media—was never validated under crash conditions.

Fixing it wasn’t a magical ZFS setting. They rebuilt: HBA passthrough to the storage VM, verified SMART visibility, disabled unsafe write caching on the controller, and retested with deliberate crash simulations.
The takeaway wasn’t “ESXi is bad.” It was that abstraction without validation turns “should” into “probably.”
Production does not run on “probably.”

2) The optimization that backfired: “Let’s add an L2ARC and crank compression settings”

Another org had Proxmox with ZFS on the host, mostly SSDs plus a few HDD mirrors for bulk.
They had a performance complaint: VM boot storms during patch windows caused latency spikes.
Someone proposed adding an L2ARC SSD and tuning knobs: bigger ARC, bigger L2ARC, and some dataset tweaks that were copied from a forum post like a spell.

For a week, things looked improved. Cache hit ratios climbed. Graphs went green.
Then a different symptom appeared: unpredictable stalls during daytime. Not constant slowness—just sharp pauses that made chatty applications time out.
Scrubs also slowed down and stretched into business hours.

Root cause: they added a consumer SSD as L2ARC without power-loss protection and underestimated the extra write load and metadata churn.
Worse, memory pressure from oversized ARC plus VM memory demands pushed the host into reclaim behavior.
The system wasn’t out of RAM; it was out of stable RAM. ZFS cache effectiveness fell off a cliff whenever the host had to juggle.

They backed out the L2ARC, capped ARC sanely, and focused on the actual bottleneck: sync write behavior and queueing during bursts.
They added a proper SLOG device and adjusted VM storage to spread IO peaks.
The “optimization” wasn’t evil—it was just misapplied. ZFS tuning is a scalpel, not confetti.

3) The boring but correct practice that saved the day: consistent burn-in + SMART baselines + scrub schedule

A team running mixed workloads on Proxmox had a habit that looked almost quaint: every new disk got a burn-in.
Not a quick format. A real one—SMART long tests, badblocks, and a baseline snapshot of SMART attributes.
Then they added each disk to a pool only after it survived a weekend of being stressed and bored.

Months later, they started seeing occasional checksum errors on one vdev. Nothing dramatic. ZFS corrected them.
But their monitoring flagged a rising UDMA CRC error count on a single drive, and the baseline made it obvious: it wasn’t “always like that.”
They swapped the cable and the errors stopped.

Two weeks later, a different disk showed reallocated sectors rising. Again: baseline comparison made the trend unambiguous.
They replaced the disk during business hours, resilvered cleanly, and nobody outside infra noticed.

The magic wasn’t a fancy architecture. It was disciplined, repetitive hygiene: scrubs, SMART trend monitoring, and not trusting brand-new disks.
Boring practices don’t get applause. They do prevent the applause from being replaced by screaming.

Fast diagnosis playbook

When “ZFS is slow” hits your queue, you need a path to an answer in minutes, not a weekend of vibe-based tuning.
This playbook assumes you want to identify the bottleneck quickly and decide whether it’s a disk/controller issue, a sync/flush issue, or a virtualization layering issue.

First: identify the storage presentation path

Is ZFS on the Proxmox host? Or inside a VM (ESXi or Proxmox)?
Does the ZFS system see real disks with serials? (lsblk, smartctl)
Is there a RAID controller doing “helpful” caching?

Second: separate latency from throughput

Use iostat -x to see await and %util per device.
Use a sync write test (dd ... oflag=dsync) to see if the pain is sync-specific.
Check pool health (zpool status) for errors that imply retries.

Third: look for queueing and contention above the disks

In a VM: check whether the host datastore is saturated or snapshot chains exist.
Check whether ZFS ARC is memory-thrashing due to VM ballooning or host pressure.
Confirm you’re not running a scrub/resilver during peak IO windows.

Fourth: decide the fix class

Design fix: move from VMDKs/RDM to HBA passthrough, add SLOG/metadata devices, rebuild with correct ashift.
Operational fix: cable/firmware replacement, adjust scrub schedule, tune ARC cap.
Expectation fix: the workload is random sync IO on rust; physics says “no.” Add SSDs or change architecture.

Common mistakes: symptoms → root cause → fix

1) Symptom: Random latency spikes, especially during snapshots/backups

Root cause: ZFS pool built on VMDKs with snapshot chains on the hypervisor, causing extra metadata IO and copy operations.

Fix: Avoid hypervisor snapshots for storage VMs; use ZFS snapshots/replication instead. Prefer HBA passthrough so the pool is not a set of VMDKs.

2) Symptom: ZFS reports checksum errors, but SMART looks clean

Root cause: Transport issues (SAS/SATA cable, backplane, expander, marginal connector) or controller firmware quirks.

Fix: Check CRC errors, swap cables, reseat drives, update HBA firmware, and ensure proper cooling for HBAs and expanders.

3) Symptom: Sync writes are painfully slow; async seems fine

Root cause: No SLOG for sync-heavy workloads, or write cache/flush semantics causing forced flushes to stall.

Fix: Add an enterprise SLOG device with PLP; verify controller cache is safe; validate with dd oflag=dsync and workload-specific tests.

4) Symptom: Pool imports slowly or devices “move around” after reboot

Root cause: Unstable device naming (using /dev/sdX), especially in virtualized environments or with expanders.

Fix: Use /dev/disk/by-id when creating vdevs and replacing drives; document slot-to-serial mappings.

5) Symptom: Great benchmark numbers, awful application performance

Root cause: Benchmark is measuring cache (ARC/host cache), not disk. Or it’s sequential throughput while the app is random IO.

Fix: Test with sync/random patterns relevant to the app; watch latency metrics, not just MB/s; cap ARC to avoid host memory fights.

6) Symptom: After power loss, ZFS pool is online but apps show corruption

Root cause: Write acknowledgements were not durable (unsafe cache, virtualization flush mismatch), causing torn or lost writes above ZFS’s awareness.

Fix: Use HBA passthrough, disable volatile write caches without protection, validate crash consistency, and use proper SLOG for sync workloads.

7) Symptom: SSD pool performance decays over time

Root cause: TRIM/discard not functioning through the stack; high fragmentation and no reclamation.

Fix: Enable autotrim on ZFS where supported; ensure the underlying virtual disk path passes discard; otherwise schedule manual trim or redesign.

8) Symptom: Resilver takes forever and impacts everything

Root cause: Pool is near-full, disks are slow, and virtualization adds contention. Or you’re resilvering through a constrained controller/expander.

Fix: Keep pools below sensible fullness, prefer mirrors for IOPS-heavy workloads, schedule resilvers/scrubs off-peak, and avoid oversubscribed HBAs.

Checklists / step-by-step plan

Design checklist: pick the controller path

Decide where ZFS lives: Proxmox host (preferred for Proxmox-first shops) or storage VM (common for ESXi-first shops).
If ZFS is in a VM: require PCIe passthrough HBA unless you can justify the monitoring and integrity tradeoffs.
Choose HBA: SAS HBA in IT mode; avoid RAID firmware and volatile caches.
Plan for monitoring: SMART, zpool alerts, scrub schedule, and serial-to-slot mapping.
Plan for recovery: can you move the disks/HBA to another host and import?

Implementation checklist: Proxmox host ZFS

Set BIOS to AHCI for onboard SATA; enable IOMMU if you need passthrough for other devices.
Install Proxmox; verify disks show real models/serials in lsblk.
Create pool with by-id paths; verify ashift before you commit.
Enable compression (lz4) and set dataset recordsize per workload.
Configure scrub schedule and SMART monitoring.

Implementation checklist: ESXi with ZFS storage VM

Install an HBA in IT mode; confirm it’s in its own IOMMU group and supported for passthrough.
Enable passthrough on the ESXi host and attach the HBA to the storage VM.
In the VM, confirm the HBA driver loads and disks appear with serial numbers.
Create the ZFS pool directly on disks; set up periodic scrubs and SMART checks.
Export storage back to ESXi over NFS/iSCSI; measure sync behavior based on your workloads.
Document the operational constraints: no vMotion for that VM; patching requires planning.

Ops checklist: monthly boring hygiene

Review zpool status for errors and slow resilvers.
Review SMART deltas: reallocations, pending sectors, CRC errors, temperatures.
Confirm scrubs are completing within expected windows.
Validate free space headroom and fragmentation risks.
Test restores (file-level and VM-level) like you mean it.

FAQ

Is HBA passthrough always required for ZFS in a VM?

Required? No. Recommended for production where you care about data integrity, monitoring, and predictable recovery? Yes.
Virtual disks can work, but you accept blind spots and more complex failure modes.

Is RDM “good enough” on ESXi?

Sometimes. It’s typically better than VMDKs for presenting disks, but it’s still not as clean as passthrough HBA.
The big question is whether the guest can reliably see errors and SMART and whether flush semantics behave under stress.

Can I put ZFS on top of a RAID controller?

You can, in the same sense you can tow a boat with a sports car: it might move, and it will create stories.
For ZFS redundancy, use JBOD/HBA behavior so ZFS owns redundancy and repair logic.

Why do people say ZFS needs “direct disk access”?

Because ZFS’s self-healing and integrity model is strongest when it can see true device errors and control ordering/durability assumptions.
Abstraction layers can hide the signals ZFS uses to diagnose and repair.

What virtual controller is best if I must use virtual disks?

On Linux guests, paravirtual SCSI (where available) usually behaves better than older emulated controllers.
On Proxmox, virtio-scsi tends to be the sane default. But remember: controller choice can’t restore SMART visibility lost by VMDKs.

Does ZFS inside a VM eliminate the need for a SAN/NAS?

It can replace it for some environments—especially SMBs—if you treat the storage VM as a real storage appliance:
dedicated hardware path, disciplined upgrades, tested restores, and clear failure domain boundaries.

Should I run ZFS on the Proxmox host or in a VM?

On Proxmox: run ZFS on the host unless you have a compelling reason not to. It’s simpler and more observable.
If you need a storage appliance feature set (NAS services, app ecosystem) and accept the complexity, a VM approach can work.

How do I know if my sync writes are the bottleneck?

If databases or NFS workloads stall and your async benchmarks look fine, test sync behavior with dd oflag=dsync and observe latency.
If sync is slow, consider SLOG with PLP and verify caching policies.

What’s the single biggest red flag in a ZFS virtualization design review?

A storage VM running ZFS on thin-provisioned VMDKs with hypervisor snapshots enabled and no clear monitoring for datastore fullness.
It’s a slow-motion outage machine.

Will HBA passthrough prevent all corruption?

No. It reduces layers that can lie about durability and improves error visibility.
You still need redundancy, scrubs, SMART monitoring, tested backups, and sane operational processes.

Practical next steps

If you’re building new and ZFS is central: choose the architecture that lets ZFS see and control the disks.
On Proxmox, that usually means ZFS on the host with a real HBA (or AHCI SATA) and no RAID cleverness.
On ESXi, that usually means a dedicated HBA passed through to a storage VM, with storage exported back over NFS/iSCSI.

If you already deployed ZFS on virtual disks in production: don’t panic, but stop improvising.
Run the diagnostic tasks above, validate sync write behavior, check whether you can see SMART, and decide whether the risk profile matches your business.
Then schedule the migration to passthrough HBA the same way you schedule any risk-reduction project: deliberately, tested, and with a rollback plan.

The controller path isn’t a cosmetic detail. It’s the difference between ZFS being a reliable narrator and ZFS being trapped in the back seat with a blindfold.
Let it drive.

Proxmox vs ESXi for ZFS: choosing the right disk controller path (HBA passthrough vs virtual disks)

The actual decision: what ZFS needs vs what hypervisors offer

So… Proxmox vs ESXi?

Interesting facts and historical context (that still bites today)

Disk presentation paths: passthrough HBA, RDM, VMDK, and “it depends”

Path A: PCIe passthrough HBA (IT mode) to the ZFS VM

Path B: ESXi RDM (Raw Device Mapping) to a ZFS VM

Path C: Virtual disks (VMDK on VMFS; virtio disks on Proxmox)

Path D: Hardware RAID controller logical volumes presented to ZFS

Proxmox: ZFS is first-class, but hardware still matters

What to buy (and how to configure it)

Performance reality check

ESXi: excellent hypervisor, awkward ZFS guest story

What to do on ESXi if you want ZFS features

The philosophical problem: who is responsible for integrity?

Why controller choice changes failure modes (and your pager)

1) ZFS needs stable disk identity

2) SMART and predictive failure signals

3) Cache flush semantics and sync writes

4) Queue depth and latency variance

5) Recoverability when the host is sick

Practical tasks: commands, output interpretation, decisions

Task 1: Confirm ZFS sees real disks (not a single giant virtual blob)

Task 2: Check whether the controller is a real HBA and in a sane mode

Task 3: Verify passthrough from the guest perspective (ZFS VM)

Task 4: Confirm ZFS pool topology (and spot accidental single points of failure)

Task 5: Check ashift (4K alignment) to avoid permanent performance sadness

Task 6: Measure sync write penalty (tells you whether SLOG matters)

Task 7: Verify SLOG presence and what it’s doing

Task 8: Check ZFS compression and recordsize (common “free performance” lever)

Task 9: Look for checksum errors (they tell you about the path, not just the disk)

Task 10: Pull SMART data (only works well with real disk access)

Task 11: Identify whether the hypervisor path is adding latency (guest view)

Task 12: Check TRIM/discard behavior (important for SSD pools and virtual disks)

Task 13: Confirm whether you’re accidentally double-caching

Task 14: Validate scrub behavior and time-to-detect problems

Three mini-stories from corporate life

1) The incident caused by a wrong assumption: “The hypervisor flushes writes, obviously”

2) The optimization that backfired: “Let’s add an L2ARC and crank compression settings”

3) The boring but correct practice that saved the day: consistent burn-in + SMART baselines + scrub schedule

Fast diagnosis playbook

First: identify the storage presentation path

Second: separate latency from throughput

Third: look for queueing and contention above the disks

Fourth: decide the fix class

Common mistakes: symptoms → root cause → fix

1) Symptom: Random latency spikes, especially during snapshots/backups

2) Symptom: ZFS reports checksum errors, but SMART looks clean

3) Symptom: Sync writes are painfully slow; async seems fine

4) Symptom: Pool imports slowly or devices “move around” after reboot

5) Symptom: Great benchmark numbers, awful application performance

6) Symptom: After power loss, ZFS pool is online but apps show corruption

7) Symptom: SSD pool performance decays over time

8) Symptom: Resilver takes forever and impacts everything

Checklists / step-by-step plan

Design checklist: pick the controller path

Implementation checklist: Proxmox host ZFS

Implementation checklist: ESXi with ZFS storage VM

Ops checklist: monthly boring hygiene

FAQ

Is HBA passthrough always required for ZFS in a VM?

Is RDM “good enough” on ESXi?

Can I put ZFS on top of a RAID controller?

Why do people say ZFS needs “direct disk access”?

What virtual controller is best if I must use virtual disks?

Does ZFS inside a VM eliminate the need for a SAN/NAS?

Should I run ZFS on the Proxmox host or in a VM?

How do I know if my sync writes are the bottleneck?

What’s the single biggest red flag in a ZFS virtualization design review?

Will HBA passthrough prevent all corruption?

Practical next steps

Related articles

Leave a comment Cancel reply