Every virtualization platform has a moment where the CPU graph looks fine, memory has headroom, and yet VMs feel like they’re wading through syrup. That moment is usually storage. Not capacity. Not “we need more TB.” Latency, consistency, and the ugly interactions between hypervisors, filesystems, networks, and drive firmware.
Proxmox makes storage choices look easy—click a wizard, add a pool, mount an export, enable Ceph. Production makes them expensive. Let’s pick the right kind of expensive.
Decision rules you can actually use
If you remember one thing, make it this: you’re not choosing “storage,” you’re choosing failure modes. Performance is a feature; predictability is a product.
Pick ZFS (local) when
- You can tolerate that a single node’s storage is not automatically shared.
- You want strong data integrity and transparent recovery behavior.
- Your workload is mostly “VMs live on the node they run on,” with migration/HA handled by replication or backups.
- You value operational simplicity: one box, one pool, one set of tools.
Opinionated guidance: For 1–3 nodes, ZFS is the default answer unless you have a hard requirement for shared block storage and live migration without planning.
Pick Ceph when
- You need shared storage across many hypervisors and you expect nodes to die without drama.
- You can afford 3+ nodes (really) and enough disks to make replication sane.
- You have the network to support it: low-latency, non-oversubscribed, ideally dedicated for storage traffic.
- You’re willing to run a distributed system on purpose, not by accident.
Opinionated guidance: Ceph is great when you have a cluster. It’s a hobby when you have two nodes and a dream.
Pick iSCSI when
- You already own a storage array or SAN that does snapshots, replication, and support contracts.
- You want block semantics and central storage management.
- You can configure multipath correctly and you understand the blast radius of a single array.
Opinionated guidance: iSCSI is fine if the array is the product. If your “array” is a Linux VM with a couple of disks, you’re reinventing disappointment.
Pick NFS when
- You want the fastest path to shared storage with understandable behavior.
- You’re storing VM images or backups and can live with “the server is the server.”
- You want easy human inspection and simpler recovery workflows.
Opinionated guidance: NFS is the minivan of virtualization storage: not sexy, almost always adequate, and everyone has an opinion about the cup holders.
One quote, paraphrased idea: Jim Gray’s reliability idea (paraphrased): if you can’t explain your failure model, you don’t have a system—just a demo.
A mental model: what VMs do to storage
Virtual machines do three things that storage hates:
- Random writes from multiple guests at once. Even “sequential” guest writes get fragmented by the host.
- Metadata churn: snapshots, cloning, thin provisioning, and copy-on-write turn “write data” into “write data plus bookkeeping.”
- Latency sensitivity: a VM isn’t a database server; it’s a stack of schedulers. Add 5 ms at the storage layer and you might add 50 ms of “why is SSH laggy?” up top.
Then you add Proxmox-specific realities:
- VM disks may be raw (best), qcow2 (featureful but overhead), or RBD (Ceph block).
- Backups and snapshots happen on the host, not in the guest, which is great until your storage backend interprets that as “surprise write amplification.”
- Live migration wants shared storage or fast copy. “Fast copy” is still copy. Copy is still IO.
So the question becomes: where do you want complexity to live—on each node (ZFS), in a shared NAS/SAN (NFS/iSCSI), or inside your cluster fabric (Ceph)?
Joke 1/2: Storage is the only place where “it worked in testing” is code for “nobody tested the noisy neighbor VM.”
ZFS on Proxmox: local-first, reliability-first
ZFS is a filesystem and volume manager that behaves like it’s personally offended by silent corruption. Checksums everywhere. Copy-on-write semantics. Snapshots that are actually useful. Send/receive replication that’s boring in the best way.
What ZFS is great at in virtualization
- Predictable recovery: a mirrored vdev loses a disk, you replace it, you resilver. No magic.
- Snapshots with real tooling: ZFS snapshots are cheap; what they cost you is long-term fragmentation and metadata growth if you keep too many.
- Local performance: with NVMe mirrors and enough RAM, ZFS is brutally fast for VM workloads.
- Data integrity: end-to-end checksums catch bad disks, bad cables, bad controllers, and bad days.
What ZFS is not
ZFS is not shared storage by default. You can replicate, you can back up, you can build a shared layer above it, but ZFS itself won’t give you transparent shared block storage across nodes.
Design rules that keep you out of trouble
- Mirrors beat RAIDZ for VM latency. RAIDZ can be fine for capacity-heavy, read-heavy, sequential workloads. VM random write latency is not that workload.
- Don’t obsess over SLOG unless you understand sync writes. For many VM workloads, you either don’t need a SLOG or you need a very good one. A cheap “SLOG SSD” is a reliability trap if it lies about power-loss protection.
- Use raw disks/volumes where possible. qcow2 has features; it also has overhead. ZVOLs or raw files typically behave better for hot IO.
- Keep snapshots under control. Hundreds of snapshots on busy VM datasets will slowly turn your “fast pool” into a “fast pool (historical reenactment).”
The settings that matter (and why)
Three knobs show up in postmortems more than they should:
- ashift (sector size alignment). Get it wrong at pool creation and you bake in write amplification.
- recordsize / volblocksize. Mismatch with workload and you waste IO or amplify writes.
- sync. “sync=disabled” is how you turn a UPS battery into a data-loss lottery ticket.
Ceph on Proxmox: distributed storage with real teeth
Ceph is what happens when you want storage to behave like a cloud primitive: self-healing, replicated, scalable, and network-native. It can be glorious. It can also be a slow-motion lesson in physics.
What Ceph is great at
- Shared block storage (RBD) across hosts, with live migration that doesn’t involve copying disks around.
- Failure tolerance when designed correctly: OSD dies, host dies, disks die, and the cluster keeps serving.
- Operational leverage at scale: add nodes, add OSDs, rebalance, keep going.
What Ceph demands
- Network quality: latency and packet loss will show up as “random VM slowness” and “blocked tasks.”
- IOPS budget: replication means you pay writes multiple times. Small writes get multiplied and journaled. This is not optional.
- Capacity headroom: running near full is not “efficient,” it’s a performance cliff. Rebalancing needs space.
- Operational maturity: you must be able to interpret cluster health, backfill states, and placement group behavior without panic-clicking.
Ceph for Proxmox: practical guidance
For VM disks, use RBD unless you have a specific need for a POSIX filesystem; CephFS is great, but it’s typically better for shared files, not hot VM disks. If your workload is latency-sensitive and write-heavy, budget for fast OSDs (NVMe or very good SSDs) and a real storage network.
Ceph tuning is less about magic sysctls and more about sane architecture: enough OSDs, appropriate replication, enough CPU per host, and not running Ceph on the same weak nodes you hoped to use as compute.
Joke 2/2: Ceph is “simple” the way a shipping container is “just a box.” The interesting parts are everything around it.
iSCSI: block storage with sharp edges
iSCSI is block storage over IP. It’s not modern, not trendy, and not going away. In corporate environments it’s often the path of least resistance because there’s already a SAN and a team that speaks fluent LUN.
When iSCSI is the right answer
- You need shared block storage and you trust your array more than you trust DIY distributed storage.
- You want central snapshots/replication handled by the array (and the vendor takes the midnight calls).
- You can implement multipath and redundant networking properly.
Failure modes you must accept
- Centralized blast radius: array outage is cluster-wide pain.
- Network path sensitivity: microbursts and misconfigured MTU can look like “VM disk corruption” when it’s really timeouts and resets.
- Queue depth and latency interactions: one chatty VM can push the LUN into tail-latency hell if the array isn’t provisioned well.
On Proxmox, you’ll often pair iSCSI with LVM or ZFS over iSCSI (yes, you can, but do it with intent). If you already have ZFS locally, stacking ZFS on top of iSCSI usually means you’re doubling down on complexity for unclear gains.
NFS: simple shared storage (until it isn’t)
NFS is file storage over the network. It’s easy to set up, easy to reason about, and easy to outgrow. For many Proxmox clusters, it’s the right compromise: shared storage for ISO/templates/backups and even VM disks if performance needs are moderate and the server is decent.
NFS shines when
- You want shared storage without running a distributed storage system.
- You value straightforward recovery: mount the export somewhere else and your files are just there.
- Your workload is not extremely latency-sensitive or write-heavy.
NFS hurts when
- Your NAS is underpowered and becomes the single choke point.
- Mount options are wrong for VM images (caching and locking behavior matters).
- You treat NFS like block storage and expect it to behave like a SAN.
Proxmox supports NFS well. Just don’t pretend it’s magic. The NFS server is now a critical dependency: monitor it, patch it, and give it storage that doesn’t fall over during scrub/raid rebuild.
Interesting facts and a little history (the useful kind)
- ZFS was born at Sun Microsystems in the mid-2000s with end-to-end checksumming as a core feature, not an add-on.
- Copy-on-write predates modern virtualization; it’s an old idea that became mainstream because snapshots and clones are operational gold.
- Ceph started as a university project and became a major open-source storage platform adopted heavily in OpenStack-era private clouds.
- RBD exists because filesystems weren’t enough for cloud block semantics: images, clones, and snapshots at scale wanted a native block interface.
- NFS dates back to the 1980s and still survives because “a shared filesystem” remains a fundamental need, and it’s simple enough to debug with packets.
- iSCSI was standardized around the early 2000s to bring SAN-like access to commodity Ethernet networks.
- Write amplification is not a theoretical problem: COW filesystems plus VM workloads can multiply backend writes dramatically, especially with snapshots and small blocks.
- Ceph’s “CRUSH” placement algorithm is designed to avoid central lookup tables for data placement, which matters when you scale.
- Proxmox popularized “batteries included” clustering for small and mid-sized shops: you get KVM, management, backups, and storage integration without assembling a dozen components yourself.
12+ practical tasks: commands, output, decisions
These are the checks I run when someone says “storage is slow” or “migrations got weird.” Each task includes: command, what the output means, and what decision you make.
Task 1: Identify the storage backend per VM
cr0x@server:~$ qm config 101
boot: order=scsi0;ide2;net0
cores: 4
memory: 8192
name: app01
scsi0: local-zfs:vm-101-disk-0,size=120G
ide2: local:iso/debian-12.iso,media=cdrom
net0: virtio=DE:AD:BE:EF:00:01,bridge=vmbr0
Meaning: local-zfs tells you it’s a ZFS-backed volume on this node, not shared. If you expected shared storage for live migration, you just found the mismatch.
Decision: For HA/live migration without downtime, either move the VM disk to Ceph/NFS/iSCSI or implement ZFS replication + planned failover.
Task 2: Confirm what storages Proxmox thinks exist
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
local dir active 102305280 25165824 77175168 24.60%
local-zfs zfspool active 9999220736 2130706432 7868514304 21.31%
ceph-rbd rbd active 29997662208 8996782080 209,? 29.99%
Meaning: Confirms which backends are online and whether Proxmox can reach them. If Ceph shows weird capacity or goes inactive, stop blaming “the VM.”
Decision: If a shared store is flapping, treat it as an incident on the storage layer, not on compute.
Task 3: ZFS health check (don’t guess)
cr0x@server:~$ zpool status -x
all pools are healthy
Meaning: Healthy pool. If performance is bad, it’s not because ZFS is currently degraded.
Decision: If this reports degraded vdevs or checksum errors, stop performance work and start data safety work.
Task 4: Check ZFS pool layout (mirrors vs RAIDZ)
cr0x@server:~$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
Meaning: Mirror vdev. Good for VM latency. If you see RAIDZ1/2 with a small number of disks and write-heavy VMs, expect higher tail latency.
Decision: If VM latency is the problem and you’re on RAIDZ, consider migrating hot VMs to mirrors or SSD/NVMe vdevs.
Task 5: Check ZFS properties that affect VM behavior
cr0x@server:~$ zfs get -o name,property,value -s local recordsize,compression,atime,sync tank/vmdata
NAME PROPERTY VALUE
tank/vmdata atime off
tank/vmdata compression lz4
tank/vmdata recordsize 128K
tank/vmdata sync standard
Meaning: Sensible defaults: lz4 helps more often than it hurts; atime=off avoids metadata writes; sync=standard is safe.
Decision: If you see sync=disabled on VM datasets, decide whether you’re okay losing recent writes after a crash. Most production shops shouldn’t be.
Task 6: Find ZFS ARC pressure (RAM as a performance component)
cr0x@server:~$ arcstat 1 3
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:01 812 74 9 9 12 65 88 0 0 48G 56G
12:01:02 790 80 10 8 10 72 90 0 0 48G 56G
12:01:03 805 77 10 7 9 70 91 0 0 48G 56G
Meaning: ARC size vs target (arcsz vs c) and miss rate. A high miss rate during steady-state workloads can mean your working set doesn’t fit in RAM.
Decision: If misses are consistently high and disks are busy, add RAM or move hot workloads to faster media; don’t micro-tune before you size correctly.
Task 7: Check live IO latency on the host
cr0x@server:~$ iostat -x 1 3
Linux 6.8.12-pve (pve1) 12/28/25 _x86_64_ (16 CPU)
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 120.0 310.0 4096.0 18560.0 2.10 0.25 11.0
nvme1n1 118.0 305.0 4000.0 18200.0 2.35 0.24 10.7
Meaning: await is the practical “how bad does it feel” metric. If await is 20–200 ms, guests will complain. If %util is pegged, you’re saturated.
Decision: High await with low util suggests queueing elsewhere (controller, filesystem locks, network). High await with high util suggests you need more spindles/IOPS.
Task 8: Map which process/VM is doing IO
cr0x@server:~$ pvesh get /nodes/pve1/qemu --output-format json-pretty
[
{
"cpu": 0.17,
"diskread": 1610612736,
"diskwrite": 987654321,
"mem": 0.62,
"name": "db01",
"pid": 24510,
"status": "running",
"uptime": 182003,
"vmid": 110
},
{
"cpu": 0.04,
"diskread": 10485760,
"diskwrite": 7340032,
"mem": 0.28,
"name": "app01",
"pid": 25111,
"status": "running",
"uptime": 55321,
"vmid": 101
}
]
Meaning: A quick per-VM view. The “db01” VM is doing real IO. Correlate with latency metrics.
Decision: If one VM dominates IO and causes collateral damage, isolate it (dedicated pool, QoS, separate OSD class, or move it off shared NFS).
Task 9: NFS mount sanity check (options matter)
cr0x@server:~$ findmnt -t nfs4 -o TARGET,SOURCE,FSTYPE,OPTIONS
TARGET SOURCE FSTYPE OPTIONS
/mnt/pve/nfs-vm nas01:/export/proxmox nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,sec=sys
Meaning: You want hard mounts for VM images (to avoid silent corruption via short writes), and reasonable rsize/wsize. NFSv4.2 is typically good if the NAS supports it.
Decision: If you see soft mounts, fix it. “Soft” is for non-critical reads, not VM disks.
Task 10: NFS server latency check from the client
cr0x@server:~$ nfsstat -c
Client rpc stats:
calls retrans authrefrsh
1248933 120 0
Client nfs v4:
ops count retrans
READ 402911 18
WRITE 311220 47
COMMIT 1021 0
Meaning: Retransmits indicate network/server trouble. A few happen; a rising rate correlates with stalls and VM freezes.
Decision: If retrans spikes during incidents, stop tuning Proxmox and start looking at NAS load, NIC errors, switch buffers, MTU mismatch.
Task 11: iSCSI session and multipath status
cr0x@server:~$ iscsiadm -m session
tcp: [1] 10.10.10.50:3260,1 iqn.2020-01.com.vendor:san01.lun1 (non-flash)
tcp: [2] 10.10.11.50:3260,1 iqn.2020-01.com.vendor:san01.lun1 (non-flash)
cr0x@server:~$ multipath -ll
mpatha (3600a098038314c6a6b5d4a6f4b6a6f30) dm-3 VENDOR,Array
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:10 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 4:0:0:10 sdc 8:32 active ready running
Meaning: Two iSCSI sessions: likely two fabrics/VLANs. Multipath shows active path and enabled path. If you have one path, you don’t have redundancy.
Decision: If paths flap or one is down, fix networking before blaming the array or Proxmox.
Task 12: Ceph cluster health at a glance
cr0x@server:~$ ceph -s
cluster:
id: 3c3b1e2d-aaaa-bbbb-cccc-111122223333
health: HEALTH_WARN
1 slow ops, oldest one blocked for 12 sec
services:
mon: 3 daemons, quorum pve1,pve2,pve3 (age 4h)
mgr: pve1(active), standbys: pve2
osd: 12 osds: 12 up (since 3h), 12 in (since 3h)
data:
pools: 2 pools, 128 pgs
objects: 1.02M objects, 3.9 TiB
usage: 11 TiB used, 22 TiB / 33 TiB avail
pgs: 127 active+clean, 1 active+clean+scrubbing
Meaning: “slow ops” is a red flag for latency. Scrubbing is happening. Sometimes scrub coincides with user-visible jitter.
Decision: If slow ops persist, check OSD latency and network. If scrub causes pain, adjust scrub windows and priorities rather than disabling scrubs forever.
Task 13: Ceph OSD latency and perf counters
cr0x@server:~$ ceph osd perf
osd commit_latency(ms) apply_latency(ms)
0 3 4
1 4 5
2 45 52
3 3 4
Meaning: OSD 2 is sick compared to peers. That’s often a bad disk, a saturated device, or a noisy neighbor on that node.
Decision: Drain/mark out the bad OSD if needed, or investigate the specific host/disk. Don’t “tune Ceph” globally to compensate for one failing device.
Task 14: Check Proxmox backup pressure (storage can be fine, backups can kill it)
cr0x@server:~$ grep -E "vzdump|INFO:|ERROR:" /var/log/syslog | tail -n 8
Dec 28 01:10:02 pve1 vzdump[31120]: INFO: starting new backup job: vzdump 110 --storage nfs-backup --mode snapshot
Dec 28 01:10:05 pve1 vzdump[31120]: INFO: VM 110 - starting backup
Dec 28 01:10:09 pve1 vzdump[31120]: INFO: VM 110 - using inotify to track modifications
Dec 28 01:12:41 pve1 vzdump[31120]: INFO: VM 110 - backup finished
Meaning: Backups are running during the pain window. Snapshot mode is good, but it still generates read IO and metadata work.
Decision: If storage latency aligns with backup windows, change schedules, limit concurrency, or move backup targets off the same bottleneck.
Fast diagnosis playbook: what to check first/second/third
This is the order that finds answers fast. The goal isn’t perfection; it’s to stop arguing and isolate the layer.
First: determine if it’s host-local, shared, or networked
- Check VM disk location:
qm config <vmid>. - Check storage list:
pvesm status. - If it’s Ceph/NFS/iSCSI, assume network is involved until proven otherwise.
Second: measure latency where it matters
- Host disk latency:
iostat -x 1. - Ceph “slow ops” and
ceph osd perf. - NFS retransmits:
nfsstat -c. - iSCSI path health:
multipath -ll.
Third: check for degraded/recovery states
- ZFS:
zpool statusfor resilver/scrub/checksum errors. - Ceph: recovery/backfill/scrub states in
ceph -s. - NAS/array: is it rebuilding, scrubbing, or snapshotting?
Fourth: find the bully VM and the time correlation
- Per-VM IO counters:
pvesh get /nodes/<node>/qemu. - Check backup windows and snapshot jobs.
- Look for one tenant saturating shared queues.
Fifth: validate the boring stuff
- NIC errors and drops:
ip -s link, switch counters. - MTU consistency (especially if you tried jumbo frames).
- CPU steal/ready time and IO wait: storage pain can masquerade as CPU contention.
Common mistakes: symptoms → root cause → fix
1) VM freezes for 30–120 seconds on NFS
Symptoms: Guest logs show disk timeouts; host shows “server not responding” messages; then everything recovers.
Root cause: NFS server hiccups (load, storage latency, or network loss). Hard mounts cause clients to wait (which is safer than corrupting data).
Fix: Fix the NAS bottleneck or network. Validate findmnt options. Consider moving hot VM disks off NFS to local ZFS or Ceph RBD.
2) ZFS pool looks healthy but VMs are slow during snapshots
Symptoms: Latency spikes during backup windows; IO patterns turn random; ARC misses rise.
Root cause: Snapshot-heavy workloads increase metadata and fragmentation; backups can induce read storms.
Fix: Reduce snapshot counts/retention, stagger backups, keep hot datasets separate, and prefer raw/ZVOL for performance-critical disks.
3) Ceph feels fast for weeks, then suddenly gets “sticky”
Symptoms: HEALTH_WARN slow ops; VM IO jitter; occasional blocked tasks.
Root cause: Running too close to capacity or an OSD slowly failing. Backfill/recovery competes with client IO.
Fix: Maintain headroom, replace failing disks early, and schedule scrubs sensibly. Don’t ignore a single lagging OSD in ceph osd perf.
4) iSCSI “works” but you get random filesystem errors in guests
Symptoms: Guests show ext4/xfs errors; multipath shows path flaps; dmesg has SCSI resets.
Root cause: Unstable network path, wrong MTU, bad cables/optics, or misconfigured multipath/timeout settings.
Fix: Stabilize links, enforce redundant paths, align MTU end-to-end, and validate multipath health. Don’t hide it with retries.
5) ZFS “sync=disabled” made things fast… until power flickered
Symptoms: After crash, some VMs boot with corrupted filesystems; databases need repair.
Root cause: Disabling sync acknowledged writes before they were safe on stable media.
Fix: Set sync back to standard, add proper SLOG with power-loss protection if you need sync performance, and use UPS properly (still not a substitute).
6) Ceph on 1 GbE “kind of works” but migrations are brutal
Symptoms: High latency during peak; client IO stalls; recovery takes forever.
Root cause: Network is the backplane for Ceph. 1 GbE is a tax you pay every write.
Fix: Upgrade to at least 10 GbE (often 25 GbE in serious deployments), separate storage traffic, and stop treating the network as optional.
Checklists / step-by-step plan
If you’re building a 1–3 node Proxmox setup
- Default to local ZFS mirrors for VM disks.
- Use a separate storage (often NFS) for ISOs/templates/backups.
- Decide upfront how you’ll handle node failure:
- ZFS replication between nodes for important VMs, or
- Restore from backups with defined RTO/RPO.
- Size RAM with ARC in mind; don’t starve ZFS.
- Keep snapshot retention sane; design backup schedules around IO.
If you’re building a 4+ node cluster and want shared VM storage
- Pick Ceph RBD if you can provide:
- 3+ nodes minimum (for quorum and replication),
- Fast storage devices,
- A serious storage network.
- Separate failure domains: host, disk, rack (if you have one).
- Plan capacity headroom: don’t operate near full.
- Operationalize health checks: slow ops, OSD perf, scrub windows.
- Test failure: pull an OSD, reboot a host, validate recovery and performance impact.
If your company already owns a SAN/NAS
- Use iSCSI for VM block storage if the array is proven and supported.
- Use NFS for shared files, templates, and backups where file semantics help.
- Implement multipath and redundant switching like you mean it.
- Get array-side telemetry access or you’ll troubleshoot blind.
A simple “what should I choose” matrix
- ZFS: best for single-node performance and data integrity; shared storage via replication/backups.
- Ceph: best for shared storage and resilience at cluster scale; demands network and ops maturity.
- iSCSI: best when you already have a real SAN; high blast radius but predictable under a good array.
- NFS: best for simplicity and shared files; acceptable for VM disks with good NAS and moderate workloads.
Three corporate mini-stories from the storage trenches
Incident caused by a wrong assumption: “NFS is just slower local disk”
They had a tidy Proxmox cluster: three compute nodes and one NAS appliance that everyone liked because it had a friendly UI and blinking lights. VM disks lived on NFS because it made migration easy and the storage team said “it’s redundant.”
The assumption was that NFS behaves like a slower local disk. Meaning: if the NAS is busy, VMs get slower. The real behavior is harsher: if the NAS pauses long enough, clients stall hard. That’s not a bug; it’s how “hard mounts” preserve correctness.
One afternoon, a scheduled NAS scrub overlapped with a Proxmox backup job. The NAS didn’t crash. It just took longer to answer. Guest kernels started logging IO timeouts. Databases did what databases do when they get scared: they stopped trusting their own world.
The postmortem was uncomfortable because the graph everyone watched was throughput. Throughput looked fine. Latency was the killer, and nobody was graphing NFS retransmits or NAS disk queue depth.
Fix was boring: move the latency-sensitive VMs to local ZFS mirrors, keep NFS for backups/templates, and add monitoring that shows tail latency and retransmits. Migration became a little less magical. Incidents became a lot less frequent.
Optimization that backfired: “Let’s disable sync; it’s faster”
A different shop had ZFS on SSD mirrors and a handful of VMs that ran a transactional workload. They saw periodic write latency spikes and someone proposed the classic fix: set sync=disabled on the dataset. Performance improved instantly. Applause. A ticket was closed.
Two months later, they had a brief power event. Not a dramatic outage—more like the kind that makes lights flicker and reminds you the building is older than your CI pipeline. The hosts rebooted. Most VMs came back. A few didn’t. One database did, but the application layer started throwing subtle data inconsistencies.
Now the “performance optimization” was an incident response problem with a legal department hovering nearby. The team discovered the hard truth: when you disable sync, you’re trading durability for speed. Sometimes that trade is acceptable in a lab. In production, you need to be explicit about it, document it, and get sign-off from people who will be blamed later.
The final fix wasn’t exotic. They reverted to safe sync behavior and added a proper SLOG device with power-loss protection for the workloads that needed low-latency sync writes. They also improved their UPS testing and stopped trusting “it has batteries” as a design spec.
Boring but correct practice that saved the day: “Headroom and rehearsed failure”
A company running a Proxmox + Ceph cluster had a rule everyone complained about: keep free space above a threshold and never let the cluster drift close to full. It was unpopular because it looked like wasted budget. Finance loves a full disk; engineers love a quiet pager.
They also ran a quarterly failure rehearsal. Not theatrical chaos—just pulling an OSD, rebooting a node, and watching what happened. They tracked the time to recover and the effect on VM latency. It wasn’t fun. It was work.
Then a host died on a Monday morning. A real death: motherboard, not a reboot. Ceph rebalanced, clients kept going, and the cluster remained usable. Latency bumped up, but it didn’t go nonlinear because there was space to backfill and the network wasn’t already at the edge.
The recovery felt anticlimactic. That’s the highest compliment you can pay storage.
What saved them wasn’t a secret Ceph flag. It was headroom, monitoring, and having already watched the system misbehave in a controlled way.
FAQ
1) Should I use ZFS or Ceph for a two-node Proxmox cluster?
Use ZFS locally and replicate or back up. Two-node Ceph is possible in strange ways, but it’s rarely worth the operational risk and performance compromises.
2) Is RAIDZ bad for Proxmox?
Not “bad,” but often the wrong tool for VM latency. Mirrors usually deliver better IOPS and lower tail latency for random write workloads typical of mixed VMs.
3) ZFS on SSDs: do I need a SLOG?
Only if you have a meaningful volume of sync writes and you care about their latency. If you add a SLOG, make it enterprise-grade with power-loss protection, or you’ve built a reliability footgun.
4) For Ceph on Proxmox, should I use RBD or CephFS for VM disks?
RBD for VM disks in most cases. CephFS is excellent for shared filesystems, but VM disk workloads typically map better to block semantics.
5) Can I run Ceph and VMs on the same Proxmox nodes?
Yes, and many do. But size CPU/RAM accordingly and remember: when VM load spikes, Ceph daemons compete for resources. If you want predictable storage, consider dedicated storage nodes at larger scale.
6) Is NFS acceptable for VM disks?
Yes, if the NAS is strong and your workload isn’t ultra latency-sensitive. It’s common for small clusters. Monitor latency and retransmits, and don’t use “soft” mounts.
7) iSCSI vs NFS for Proxmox: which is faster?
It depends more on the server/array and network than the protocol. iSCSI can offer predictable block behavior; NFS can be simpler operationally. Pick based on failure modes and manageability, not a benchmark screenshot.
8) Should VM disks be qcow2 or raw?
Raw is typically faster and simpler. qcow2 gives features (like internal snapshots) but adds overhead. On Proxmox, prefer raw/ZVOL for hot disks unless you need qcow2’s features.
9) What’s the fastest way to tell if Ceph is the problem or the VM is the problem?
Check ceph -s for slow ops and ceph osd perf for outliers, then correlate with host iostat and VM disk latency symptoms. If Ceph shows slow ops, it’s not “just the VM.”
10) Can I mix HDD and SSD in Ceph?
You can, but do it deliberately (separate device classes, separate pools). Mixing them blindly usually means the fast disks spend their lives waiting for the slow ones.
Conclusion: practical next steps
Pick a storage backend based on how you want outages to look, not on how you want dashboards to look.
- If you’re small (1–3 nodes), build local ZFS mirrors for VM disks, keep NFS for backups and shared artifacts, and implement replication/backups with a tested restore.
- If you’re a real cluster (4+ nodes) and you can fund the network and disks, use Ceph RBD and treat it like the distributed system it is: monitor it, leave headroom, rehearse failures.
- If you already own a proper array, iSCSI is a sane corporate answer—just do multipath correctly and accept the centralized dependency.
- If you want shared storage without a distributed storage platform, NFS remains the simplest thing that works, and simplicity is a performance feature at 3 a.m.
Your next action should be specific: run the diagnostic tasks above on a normal day, capture baseline latency, and write down what “healthy” looks like. The incident will arrive later. It always does. At least make it arrive to prepared adults.