The ticket says “storage is slow.” The graph says “latency spiked.” The application team says “it was fine yesterday.”
And you’re staring at a ZFS pool running inside a VM, backed by “some disks,” presented by “some controller,” through “some hypervisor magic.”
This is where ZFS either looks like a miracle filesystem—or like a murder mystery with too many suspects.
The controller path you choose (HBA passthrough vs virtual disks) determines what ZFS can see, what it can fix, and what it can only guess.
Guessing is not a storage strategy.
The actual decision: what ZFS needs vs what hypervisors offer
ZFS is not shy. It wants direct-ish access to disks, stable identities, correct cache flush semantics, predictable latency, and visibility into errors.
Hypervisors, by design, abstract hardware. Sometimes beautifully. Sometimes catastrophically.
Your controller choice is really three choices bundled together:
- Visibility: Can ZFS see SMART, error counters, and true device behavior—or only a virtual block device?
- Ordering and flush guarantees: When ZFS says “sync this,” does the stack actually sync it to non-volatile media?
- Failure blast radius: Does one layer’s “helpful caching” turn a host crash into pool corruption?
If you remember one line, make it this: ZFS can tolerate slow disks; it cannot tolerate lies.
“Lies” here means write acknowledgements that aren’t durable yet, or error reporting that gets swallowed, rewritten, or delayed until it’s too late.
So… Proxmox vs ESXi?
If you want ZFS as your primary storage stack on the host: Proxmox is the straightforward option because it’s Linux, and ZFS-on-Linux is a normal citizen there.
If you want ESXi for its virtualization ecosystem but still want ZFS features: you’re usually talking about ZFS inside a VM (TrueNAS, OmniOS, Debian+ZFS, etc.).
That’s viable, but your controller path becomes the whole game.
Opinionated guidance:
- Best practice for ZFS-in-a-VM: PCIe passthrough of an HBA in IT mode to the ZFS VM.
- Acceptable for light duty / lab: virtual disks (VMDK/virtio-blk) only if you understand what you’re losing.
- What I avoid in production: putting ZFS on top of a RAID controller presenting logical volumes, or stacking ZFS on thin-provisioned virtual disks without guardrails.
Interesting facts and historical context (that still bites today)
- ZFS was born at Sun Microsystems and shipped in Solaris (mid-2000s), built around end-to-end checksumming and copy-on-write to prevent silent corruption.
- Copy-on-write is why ZFS hates “lying” caches: it relies on transaction groups and ordered durability; acknowledgements that aren’t durable can poison consistency after a crash.
- VMware’s VMFS datastores were designed for shared-block virtualization, not for exposing raw disk semantics to a guest filesystem with its own RAID logic.
- HBAs in “IT mode” became popular largely because ZFS (and later Ceph) wanted simple JBOD behavior: no RAID metadata, no controller caching surprises.
- LSI SAS2008/SAS2308 era cards (and their many OEM rebrands) became the de facto homelab and SMB standard because they were cheap and supported passthrough well.
- 4K sector drives forced the industry’s hand: ashift mismatches can permanently waste IOPS and space; virtualization layers sometimes hide true sector size.
- Write cache policies have caused real outages for decades—long before ZFS—because “cache enabled” plus “no battery/flash backup” is just “data loss later.”
- SMART passthrough is not a given: many virtual disk paths don’t provide drive health details to the guest, which breaks proactive ops workflows.
Disk presentation paths: passthrough HBA, RDM, VMDK, and “it depends”
Path A: PCIe passthrough HBA (IT mode) to the ZFS VM
This is the “stop being cute” approach. The ZFS VM owns the HBA, enumerates the disks directly, sees SMART, sees serials, and gets real error behavior.
It’s the closest you get to bare metal while still virtualizing compute around it.
Pros
- Best disk visibility: SMART, temps, error counters, real device IDs.
- Best alignment with ZFS assumptions about flushes and error handling.
- Predictable performance characteristics: queueing and latency are less “mystery meat.”
- Easier incident response: when a drive is dying, ZFS tells you which one, not “naa.600…something.”
Cons
- You lose some hypervisor conveniences (vMotion for that VM, snapshots in the usual sense).
- Passthrough can complicate host upgrades and hardware changes.
- Some consumer boards/IOMMU groupings make it annoying.
Path B: ESXi RDM (Raw Device Mapping) to a ZFS VM
RDM is VMware’s “kind of raw” mapping. In practice it’s still mediated.
Depending on mode (virtual vs physical compatibility), it may or may not forward what ZFS wants.
It can be workable, but it’s rarely the best option when passthrough is available.
Pros
- Better than a generic VMDK for some use cases.
- Can allow some tooling integration with ESXi.
Cons
- SMART and error visibility are often limited or inconsistent.
- More layers where cache/flush semantics can get weird.
- Operational ambiguity: “Is the guest really seeing the disk?” becomes a recurring question.
Path C: Virtual disks (VMDK on VMFS; virtio disks on Proxmox)
This is the most common way people try ZFS in a VM because it’s convenient and looks clean in the UI.
It’s also where ZFS gets the least truth about the underlying media.
If you do this for anything you care about, treat it like a product decision: define reliability requirements, test crash behavior, and accept the monitoring blind spots.
Pros
- Easy lifecycle operations: migrate, snapshot, clone.
- Hardware-agnostic: no special HBA needed.
- Great for dev/test, CI, disposable environments.
Cons
- ZFS loses direct SMART and some error semantics.
- Misaligned sector sizes and volatile caches can create performance and integrity issues.
- Double caching and queueing (guest + host) can produce latency spikes that look like “random” application stalls.
Joke #1: Virtual disks for production ZFS are like putting racing slicks on a forklift—technically it rolls, but it’s not the kind of speed you wanted.
Path D: Hardware RAID controller logical volumes presented to ZFS
Don’t. ZFS already is the RAID layer. Putting it on top of another RAID layer is a great way to get:
mismatched failure handling, masked errors, write hole behavior, and a support blame game.
There are niche exceptions (single-disk RAID0 per disk on a controller purely for pass-through behavior),
but that’s usually a hack for when you bought the wrong controller.
For new builds: buy the right hardware.
Proxmox: ZFS is first-class, but hardware still matters
On Proxmox, ZFS typically runs on the host. That’s the cleanest architecture: the kernel sees disks directly,
ZFS manages them directly, and your VMs see storage over zvols, datasets, or network shares.
The “controller decision” still exists, but it’s simpler: you either attach disks to the Proxmox host via a proper HBA (or onboard SATA in AHCI),
or you complicate your life with RAID controllers and caching policies.
What to buy (and how to configure it)
- HBA in IT mode (LSI/Broadcom family) is the classic answer for SAS/SATA backplanes and expanders.
- Onboard SATA in AHCI mode is fine for small direct-attached SATA builds, assuming the chipset and cabling are sane.
- Avoid RAID mode unless you can force true HBA/JBOD behavior with caching disabled and stable identifiers—still second-best.
Performance reality check
Proxmox with ZFS isn’t “slow.” It’s honest. If your pool is six rust disks doing sync writes, latency will reflect physics.
ZFS will gladly show you the bill for your design decisions.
ESXi: excellent hypervisor, awkward ZFS guest story
ESXi is great at what it was built to do: run VMs, manage them at scale, abstract hardware, and provide stable ops tooling.
The friction happens when you want a guest filesystem (ZFS) to behave like it’s on bare metal while ESXi is doing what hypervisors do.
What to do on ESXi if you want ZFS features
- Best: dedicate an HBA and use PCIe passthrough to a storage VM (TrueNAS, etc.). Present storage back to ESXi over NFS/iSCSI.
- Okay-ish: RDM in physical mode for each disk, if passthrough is impossible. Validate SMART visibility and flush behavior.
- Risky: ZFS over VMDKs on VMFS, especially thin-provisioned, with snapshots, and no monitoring for underlying datastore pressure.
The philosophical problem: who is responsible for integrity?
In an ESXi + VMFS world, VMware’s stack is responsible for datastore integrity, redundancy, and recovery.
In a ZFS world, ZFS wants to own that job end-to-end.
If you let both stacks “help,” you can end up with neither being fully accountable.
One quote worth keeping on a sticky note:
Hope is not a strategy.
— James Cameron
Why controller choice changes failure modes (and your pager)
1) ZFS needs stable disk identity
ZFS labels disks and expects them to stay the same device.
Virtual disks are stable in their own way, but they hide the underlying drive topology.
If a physical disk starts erroring, ZFS in the guest might only see “virtual disk read error,” not “Seagate SN1234 is dying on bay 7.”
2) SMART and predictive failure signals
SMART isn’t perfect, but it’s what we have. Reallocated sectors, pending sectors, CRC errors—these are early warnings.
With passthrough, you can monitor them inside the ZFS system and replace disks before a resilver turns into a week-long drama.
With VMDKs, you’re often blind unless you also monitor from the host with separate tooling and correlate manually.
3) Cache flush semantics and sync writes
ZFS cares deeply about what “sync” means. Databases care too.
If the hypervisor or controller acknowledges writes before they’re durable, a power loss can leave ZFS believing data is safely on disk when it isn’t.
ZFS is robust, but it can’t checksum data that never made it.
4) Queue depth and latency variance
Many “ZFS is slow in a VM” reports are really “your IO path has three queues, two caches, and one tiny bottleneck.”
HBAs with passthrough let the guest manage the queueing directly.
With virtual disks, you can get unpredictable latency spikes from host contention, datastore metadata operations, or snapshot chains.
5) Recoverability when the host is sick
With passthrough HBA, the ZFS VM is more self-contained: move the HBA + disks to another host, import pool, carry on.
With VMDKs, your pool is entangled with VMFS datastores, ESXi host configuration, and sometimes the state of snapshots.
Recovery becomes a multi-layer archaeological dig.
Joke #2: Storage layers are like lasagna—too many and you’ll spend the night regretting your choices.
Practical tasks: commands, output interpretation, decisions
The point of commands is not to feel busy. It’s to reduce uncertainty until the right decision is obvious.
Below are real tasks I use in production triage and design reviews. Each includes what to look for and what you decide next.
Task 1: Confirm ZFS sees real disks (not a single giant virtual blob)
cr0x@server:~$ lsblk -o NAME,MODEL,SERIAL,SIZE,ROTA,TYPE
NAME MODEL SERIAL SIZE ROTA TYPE
sda ST12000NM0008 ZHZ1ABC1 10.9T 1 disk
sdb ST12000NM0008 ZHZ1ABC2 10.9T 1 disk
sdc ST12000NM0008 ZHZ1ABC3 10.9T 1 disk
nvme0n1 Samsung SSD 980 S64DNF0R1234 931G 0 disk
What it means: You have per-disk model and serial. That usually indicates direct attachment or passthrough HBA.
If you see “VMware Virtual disk” (or identical models/serials), you’re on virtual disks.
Decision: If this is a ZFS VM and you care about ops visibility, move toward HBA passthrough.
Task 2: Check whether the controller is a real HBA and in a sane mode
cr0x@server:~$ lspci -nn | egrep -i 'sas|sata|raid|lsi|broadcom'
03:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087]
What it means: LSI SAS2308 class device is a common HBA suitable for IT mode.
If you see a RAID controller model, double-check it’s not doing write caching or RAID presentation.
Decision: Prefer an HBA in IT mode for ZFS; avoid hardware RAID volumes as ZFS vdevs.
Task 3: Verify passthrough from the guest perspective (ZFS VM)
cr0x@server:~$ dmesg | egrep -i 'mpt2sas|mpt3sas|scsi host' | tail -n 8
[ 2.114012] mpt2sas_cm0: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)
[ 2.116842] scsi host0: ioc0
[ 2.116904] scsi host1: ioc1
What it means: The guest loaded the HBA driver and sees the controller directly.
Decision: Proceed with per-disk vdevs; enable SMART monitoring in the guest.
Task 4: Confirm ZFS pool topology (and spot accidental single points of failure)
cr0x@server:~$ zpool status -v
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
What it means: RAIDZ2 across six disks, no errors. If you see scsi- IDs changing often, you may have unstable device naming.
Decision: If device names are unstable, switch to /dev/disk/by-id paths when creating/replacing drives.
Task 5: Check ashift (4K alignment) to avoid permanent performance sadness
cr0x@server:~$ zdb -C tank | egrep 'ashift|vdev_tree' -n | head
41: vdev_tree:
57: ashift: 12
What it means: ashift: 12 implies 4K sectors. If it’s 9 (512B) on 4K drives, you can get write amplification and lower IOPS.
Decision: If ashift is wrong, you rebuild the pool. There is no magic “fix ashift” button.
Task 6: Measure sync write penalty (tells you whether SLOG matters)
cr0x@server:~$ zfs get -o name,property,value -s local,default sync tank
NAME PROPERTY VALUE
tank sync standard
cr0x@server:~$ dd if=/dev/zero of=/tank/testfile bs=1M count=1024 oflag=dsync status=progress
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 24.6 s, 43.6 MB/s
What it means: oflag=dsync approximates sync write behavior. 43 MB/s on spinning disks might be normal; 3 MB/s suggests a flush/queue problem.
Decision: If sync writes are a bottleneck and you have real sync workloads (databases, NFS with sync), add a proper SLOG device with power-loss protection.
Task 7: Verify SLOG presence and what it’s doing
cr0x@server:~$ zpool status tank | egrep -A3 'logs|cache|special'
logs
nvme0n1p2 ONLINE 0 0 0
What it means: Dedicated log device exists. If it’s a consumer NVMe without PLP, it may make things faster right up until it doesn’t.
Decision: Use enterprise SSD/NVMe with PLP for SLOG. If you can’t, don’t pretend you can.
Task 8: Check ZFS compression and recordsize (common “free performance” lever)
cr0x@server:~$ zfs get -o name,property,value compression,recordsize tank
NAME PROPERTY VALUE
tank compression lz4
tank recordsize 128K
What it means: lz4 is typically a win; recordsize 128K is a general-purpose default.
Decision: For VM images, consider a dataset with recordsize=16K or 32K. For backups/media, keep 128K or larger.
Task 9: Look for checksum errors (they tell you about the path, not just the disk)
cr0x@server:~$ zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 12
sdb ONLINE 0 0 0
errors: No known data errors
What it means: Non-zero CKSUM can be cabling, backplane, controller, expander, firmware, or the drive.
ZFS corrected it thanks to redundancy, but the path is noisy.
Decision: Check SMART, swap cables, reseat, update HBA firmware. If errors persist on one disk, replace it.
Task 10: Pull SMART data (only works well with real disk access)
cr0x@server:~$ smartctl -a /dev/sda | egrep -i 'model|serial|reallocated|pending|crc|temperature'
Device Model: ST12000NM0008
Serial Number: ZHZ1ABC1
Reallocated_Sector_Ct 0
Current_Pending_Sector 0
UDMA_CRC_Error_Count 18
Temperature_Celsius 41
What it means: CRC errors often scream “cable/backplane,” not “disk media.”
Decision: Fix the transport layer before you blame ZFS. Replace SATA/SAS cables; check backplane seating and expander health.
Task 11: Identify whether the hypervisor path is adding latency (guest view)
cr0x@server:~$ iostat -x 1 5
Device r/s w/s rkB/s wkB/s await svctm %util
sda 5.0 42.0 320.0 8192.0 38.21 2.11 98.7
sdb 4.0 40.0 256.0 8064.0 41.02 2.03 97.9
What it means: High await with moderate svctm can indicate queueing above the device (controller/hypervisor contention).
Decision: If this is a VM with virtual disks, investigate host datastore contention or queue depth settings; consider moving to passthrough HBA.
Task 12: Check TRIM/discard behavior (important for SSD pools and virtual disks)
cr0x@server:~$ zpool get autotrim tank
NAME PROPERTY VALUE SOURCE
tank autotrim off default
cr0x@server:~$ zpool set autotrim=on tank
What it means: If you’re on SSDs, autotrim helps long-term performance. In some virtual disk setups, discard may be unsupported or slow.
Decision: Enable autotrim for SSD pools where supported; benchmark in virtualized environments to ensure trims aren’t causing latency spikes.
Task 13: Confirm whether you’re accidentally double-caching
cr0x@server:~$ zfs get -o name,property,value primarycache,secondarycache tank
NAME PROPERTY VALUE
tank primarycache all
tank secondarycache all
What it means: ZFS will use RAM (ARC) and optional L2ARC. In a VM, the host may also cache aggressively.
Decision: If the VM memory is constrained or ballooning, reduce ZFS ARC size or adjust caches; avoid running ZFS where the hypervisor can steal its memory unpredictably.
Task 14: Validate scrub behavior and time-to-detect problems
cr0x@server:~$ zpool scrub tank
cr0x@server:~$ zpool status tank | egrep -i 'scan|scrub'
scan: scrub in progress since Sun Dec 28 01:10:11 2025
1.23T scanned at 1.05G/s, 220G issued at 187M/s, 10.8T total
0B repaired, 2.04% done, 0:58:12 to go
What it means: Scrub throughput tells you about end-to-end read performance and contention.
Decision: If scrubs are painfully slow in a VM, check host contention, controller queueing, and whether virtual disks are throttled by datastore limits.
Three mini-stories from corporate life
1) The incident caused by a wrong assumption: “The hypervisor flushes writes, obviously”
A mid-sized company ran an ESXi cluster with a storage VM providing NFS back to ESXi. The storage VM used ZFS.
Someone built it quickly during a hardware refresh window. They used VMDKs on a VMFS datastore because it made the build “portable.”
It passed basic tests: file copies, a few VM boots, a synthetic benchmark that made everyone feel productive.
Weeks later, a host reboot happened during a power event. UPS worked, mostly. But one host dropped hard.
After power returned, the storage VM came up and ZFS imported the pool. Applications started, and everything looked normal—until a database started throwing logical corruption errors.
Not “disk failed,” not “pool faulted.” Quietly wrong data in a place that mattered.
The investigation was ugly because every layer had plausible deniability. ESXi said the datastore was healthy.
ZFS said the pool was online. The database said, essentially, “I was promised durability and I didn’t get it.”
The core assumption—that sync writes from the guest were faithfully durable on the physical media—was never validated under crash conditions.
Fixing it wasn’t a magical ZFS setting. They rebuilt: HBA passthrough to the storage VM, verified SMART visibility, disabled unsafe write caching on the controller, and retested with deliberate crash simulations.
The takeaway wasn’t “ESXi is bad.” It was that abstraction without validation turns “should” into “probably.”
Production does not run on “probably.”
2) The optimization that backfired: “Let’s add an L2ARC and crank compression settings”
Another org had Proxmox with ZFS on the host, mostly SSDs plus a few HDD mirrors for bulk.
They had a performance complaint: VM boot storms during patch windows caused latency spikes.
Someone proposed adding an L2ARC SSD and tuning knobs: bigger ARC, bigger L2ARC, and some dataset tweaks that were copied from a forum post like a spell.
For a week, things looked improved. Cache hit ratios climbed. Graphs went green.
Then a different symptom appeared: unpredictable stalls during daytime. Not constant slowness—just sharp pauses that made chatty applications time out.
Scrubs also slowed down and stretched into business hours.
Root cause: they added a consumer SSD as L2ARC without power-loss protection and underestimated the extra write load and metadata churn.
Worse, memory pressure from oversized ARC plus VM memory demands pushed the host into reclaim behavior.
The system wasn’t out of RAM; it was out of stable RAM. ZFS cache effectiveness fell off a cliff whenever the host had to juggle.
They backed out the L2ARC, capped ARC sanely, and focused on the actual bottleneck: sync write behavior and queueing during bursts.
They added a proper SLOG device and adjusted VM storage to spread IO peaks.
The “optimization” wasn’t evil—it was just misapplied. ZFS tuning is a scalpel, not confetti.
3) The boring but correct practice that saved the day: consistent burn-in + SMART baselines + scrub schedule
A team running mixed workloads on Proxmox had a habit that looked almost quaint: every new disk got a burn-in.
Not a quick format. A real one—SMART long tests, badblocks, and a baseline snapshot of SMART attributes.
Then they added each disk to a pool only after it survived a weekend of being stressed and bored.
Months later, they started seeing occasional checksum errors on one vdev. Nothing dramatic. ZFS corrected them.
But their monitoring flagged a rising UDMA CRC error count on a single drive, and the baseline made it obvious: it wasn’t “always like that.”
They swapped the cable and the errors stopped.
Two weeks later, a different disk showed reallocated sectors rising. Again: baseline comparison made the trend unambiguous.
They replaced the disk during business hours, resilvered cleanly, and nobody outside infra noticed.
The magic wasn’t a fancy architecture. It was disciplined, repetitive hygiene: scrubs, SMART trend monitoring, and not trusting brand-new disks.
Boring practices don’t get applause. They do prevent the applause from being replaced by screaming.
Fast diagnosis playbook
When “ZFS is slow” hits your queue, you need a path to an answer in minutes, not a weekend of vibe-based tuning.
This playbook assumes you want to identify the bottleneck quickly and decide whether it’s a disk/controller issue, a sync/flush issue, or a virtualization layering issue.
First: identify the storage presentation path
- Is ZFS on the Proxmox host? Or inside a VM (ESXi or Proxmox)?
- Does the ZFS system see real disks with serials? (
lsblk,smartctl) - Is there a RAID controller doing “helpful” caching?
Second: separate latency from throughput
- Use
iostat -xto seeawaitand %util per device. - Use a sync write test (
dd ... oflag=dsync) to see if the pain is sync-specific. - Check pool health (
zpool status) for errors that imply retries.
Third: look for queueing and contention above the disks
- In a VM: check whether the host datastore is saturated or snapshot chains exist.
- Check whether ZFS ARC is memory-thrashing due to VM ballooning or host pressure.
- Confirm you’re not running a scrub/resilver during peak IO windows.
Fourth: decide the fix class
- Design fix: move from VMDKs/RDM to HBA passthrough, add SLOG/metadata devices, rebuild with correct ashift.
- Operational fix: cable/firmware replacement, adjust scrub schedule, tune ARC cap.
- Expectation fix: the workload is random sync IO on rust; physics says “no.” Add SSDs or change architecture.
Common mistakes: symptoms → root cause → fix
1) Symptom: Random latency spikes, especially during snapshots/backups
Root cause: ZFS pool built on VMDKs with snapshot chains on the hypervisor, causing extra metadata IO and copy operations.
Fix: Avoid hypervisor snapshots for storage VMs; use ZFS snapshots/replication instead. Prefer HBA passthrough so the pool is not a set of VMDKs.
2) Symptom: ZFS reports checksum errors, but SMART looks clean
Root cause: Transport issues (SAS/SATA cable, backplane, expander, marginal connector) or controller firmware quirks.
Fix: Check CRC errors, swap cables, reseat drives, update HBA firmware, and ensure proper cooling for HBAs and expanders.
3) Symptom: Sync writes are painfully slow; async seems fine
Root cause: No SLOG for sync-heavy workloads, or write cache/flush semantics causing forced flushes to stall.
Fix: Add an enterprise SLOG device with PLP; verify controller cache is safe; validate with dd oflag=dsync and workload-specific tests.
4) Symptom: Pool imports slowly or devices “move around” after reboot
Root cause: Unstable device naming (using /dev/sdX), especially in virtualized environments or with expanders.
Fix: Use /dev/disk/by-id when creating vdevs and replacing drives; document slot-to-serial mappings.
5) Symptom: Great benchmark numbers, awful application performance
Root cause: Benchmark is measuring cache (ARC/host cache), not disk. Or it’s sequential throughput while the app is random IO.
Fix: Test with sync/random patterns relevant to the app; watch latency metrics, not just MB/s; cap ARC to avoid host memory fights.
6) Symptom: After power loss, ZFS pool is online but apps show corruption
Root cause: Write acknowledgements were not durable (unsafe cache, virtualization flush mismatch), causing torn or lost writes above ZFS’s awareness.
Fix: Use HBA passthrough, disable volatile write caches without protection, validate crash consistency, and use proper SLOG for sync workloads.
7) Symptom: SSD pool performance decays over time
Root cause: TRIM/discard not functioning through the stack; high fragmentation and no reclamation.
Fix: Enable autotrim on ZFS where supported; ensure the underlying virtual disk path passes discard; otherwise schedule manual trim or redesign.
8) Symptom: Resilver takes forever and impacts everything
Root cause: Pool is near-full, disks are slow, and virtualization adds contention. Or you’re resilvering through a constrained controller/expander.
Fix: Keep pools below sensible fullness, prefer mirrors for IOPS-heavy workloads, schedule resilvers/scrubs off-peak, and avoid oversubscribed HBAs.
Checklists / step-by-step plan
Design checklist: pick the controller path
- Decide where ZFS lives: Proxmox host (preferred for Proxmox-first shops) or storage VM (common for ESXi-first shops).
- If ZFS is in a VM: require PCIe passthrough HBA unless you can justify the monitoring and integrity tradeoffs.
- Choose HBA: SAS HBA in IT mode; avoid RAID firmware and volatile caches.
- Plan for monitoring: SMART, zpool alerts, scrub schedule, and serial-to-slot mapping.
- Plan for recovery: can you move the disks/HBA to another host and import?
Implementation checklist: Proxmox host ZFS
- Set BIOS to AHCI for onboard SATA; enable IOMMU if you need passthrough for other devices.
- Install Proxmox; verify disks show real models/serials in
lsblk. - Create pool with by-id paths; verify ashift before you commit.
- Enable compression (lz4) and set dataset recordsize per workload.
- Configure scrub schedule and SMART monitoring.
Implementation checklist: ESXi with ZFS storage VM
- Install an HBA in IT mode; confirm it’s in its own IOMMU group and supported for passthrough.
- Enable passthrough on the ESXi host and attach the HBA to the storage VM.
- In the VM, confirm the HBA driver loads and disks appear with serial numbers.
- Create the ZFS pool directly on disks; set up periodic scrubs and SMART checks.
- Export storage back to ESXi over NFS/iSCSI; measure sync behavior based on your workloads.
- Document the operational constraints: no vMotion for that VM; patching requires planning.
Ops checklist: monthly boring hygiene
- Review
zpool statusfor errors and slow resilvers. - Review SMART deltas: reallocations, pending sectors, CRC errors, temperatures.
- Confirm scrubs are completing within expected windows.
- Validate free space headroom and fragmentation risks.
- Test restores (file-level and VM-level) like you mean it.
FAQ
Is HBA passthrough always required for ZFS in a VM?
Required? No. Recommended for production where you care about data integrity, monitoring, and predictable recovery? Yes.
Virtual disks can work, but you accept blind spots and more complex failure modes.
Is RDM “good enough” on ESXi?
Sometimes. It’s typically better than VMDKs for presenting disks, but it’s still not as clean as passthrough HBA.
The big question is whether the guest can reliably see errors and SMART and whether flush semantics behave under stress.
Can I put ZFS on top of a RAID controller?
You can, in the same sense you can tow a boat with a sports car: it might move, and it will create stories.
For ZFS redundancy, use JBOD/HBA behavior so ZFS owns redundancy and repair logic.
Why do people say ZFS needs “direct disk access”?
Because ZFS’s self-healing and integrity model is strongest when it can see true device errors and control ordering/durability assumptions.
Abstraction layers can hide the signals ZFS uses to diagnose and repair.
What virtual controller is best if I must use virtual disks?
On Linux guests, paravirtual SCSI (where available) usually behaves better than older emulated controllers.
On Proxmox, virtio-scsi tends to be the sane default. But remember: controller choice can’t restore SMART visibility lost by VMDKs.
Does ZFS inside a VM eliminate the need for a SAN/NAS?
It can replace it for some environments—especially SMBs—if you treat the storage VM as a real storage appliance:
dedicated hardware path, disciplined upgrades, tested restores, and clear failure domain boundaries.
Should I run ZFS on the Proxmox host or in a VM?
On Proxmox: run ZFS on the host unless you have a compelling reason not to. It’s simpler and more observable.
If you need a storage appliance feature set (NAS services, app ecosystem) and accept the complexity, a VM approach can work.
How do I know if my sync writes are the bottleneck?
If databases or NFS workloads stall and your async benchmarks look fine, test sync behavior with dd oflag=dsync and observe latency.
If sync is slow, consider SLOG with PLP and verify caching policies.
What’s the single biggest red flag in a ZFS virtualization design review?
A storage VM running ZFS on thin-provisioned VMDKs with hypervisor snapshots enabled and no clear monitoring for datastore fullness.
It’s a slow-motion outage machine.
Will HBA passthrough prevent all corruption?
No. It reduces layers that can lie about durability and improves error visibility.
You still need redundancy, scrubs, SMART monitoring, tested backups, and sane operational processes.
Practical next steps
If you’re building new and ZFS is central: choose the architecture that lets ZFS see and control the disks.
On Proxmox, that usually means ZFS on the host with a real HBA (or AHCI SATA) and no RAID cleverness.
On ESXi, that usually means a dedicated HBA passed through to a storage VM, with storage exported back over NFS/iSCSI.
If you already deployed ZFS on virtual disks in production: don’t panic, but stop improvising.
Run the diagnostic tasks above, validate sync write behavior, check whether you can see SMART, and decide whether the risk profile matches your business.
Then schedule the migration to passthrough HBA the same way you schedule any risk-reduction project: deliberately, tested, and with a rollback plan.
The controller path isn’t a cosmetic detail. It’s the difference between ZFS being a reliable narrator and ZFS being trapped in the back seat with a blindfold.
Let it drive.