You don’t feel storage decisions on day one. You feel them the first time a node dies at 2:13 a.m., the CEO is on a plane with spotty Wi‑Fi, and your “HA” story turns into “we’re investigating.” That’s when you discover whether you bought resilience or just purchased a new hobby.
In Proxmox, the Ceph vs ZFS question looks like a feature comparison. In production, it’s a failure-mode decision. This piece is the decision tree I wish more teams had before they built the wrong thing confidently.
The decision tree (use this, not vibes)
Here’s the blunt version. If you disagree, fine—run the tests later in this article and let your graphs settle the argument.
Step 1: Do you need shared storage for live migration and HA across nodes?
- If yes, and you want it built-in: choose Ceph (RBD). That’s the whole point: distributed block storage that survives node loss and keeps serving.
- If no (or you can accept “migration with downtime” or replica-based approaches): ZFS on each node is usually the best reliability-per-effort deal you can buy.
Step 2: How many nodes do you have, really?
- 1 node: ZFS. Ceph is not a single-node product unless you’re doing a lab or you enjoy self-inflicted complexity.
- 2 nodes: ZFS, plus a third vote (qdevice) for cluster quorum. Ceph on 2 nodes is a trap; you’ll invent new forms of sadness.
- 3 nodes: Ceph becomes viable. ZFS is still simpler. Your requirement for shared storage and fault tolerance decides.
- 4–8 nodes: Ceph shines if you need shared storage and can afford the network and disks to do it properly.
- 9+ nodes: Ceph is often the correct default for shared storage, but only if you treat it as a storage system, not a checkbox.
Step 3: What’s your failure budget?
- “A host can die and nothing should notice”: Ceph with replication (size 3) and sane failure domains.
- “A host can die and we can fail over in minutes, maybe restore some VMs”: ZFS with replication (zfs send/receive), backups, and realistic RTO.
- “We just need good local disks, we’ll restore from backup”: ZFS, mirrors, and a backup plan that actually restores.
Step 4: What network do you have for storage traffic?
- 10GbE with shared switching and no isolation: ZFS tends to win unless your Ceph cluster is small and lightly loaded. Ceph can run on 10GbE, but performance and recovery will fight you.
- 25GbE+ dedicated storage network (or at least well-separated VLAN/QoS): Ceph becomes much more predictable, especially during backfill/rebalance.
Step 5: What kind of IO do your VMs actually do?
- Latency-sensitive random writes (databases, message queues): ZFS mirrors often feel faster. Ceph can do it, but you must engineer for it: NVMe OSDs, good network, and tuned recovery.
- Mostly reads, moderate writes, lots of VMs: Ceph is a good fit if you want shared storage and accept the write amplification of replication/EC.
- Big sequential workloads: both can do it; choose based on operational model and failure behavior.
Step 6: Who will operate it at 2 a.m.?
- Small team, low tolerance for storage-specific on-call: ZFS. Fewer moving pieces, fewer cluster-wide emergent behaviors.
- Team that can invest in operational maturity: Ceph can be boring—in the good way—once you’re disciplined.
Opinionated rule: if you’re choosing Ceph to avoid buying a shared SAN, you must still pay for the equivalent in network, disks, and operational effort. Ceph is not “cheap shared storage.” It’s “shared storage you own.”
What you’re really buying: semantics and failure modes
ZFS in Proxmox: local truth, strong integrity
ZFS is a filesystem and volume manager with end-to-end checksumming, copy-on-write, snapshots, replication, and a caching model that can make your VMs feel snappy. In Proxmox, ZFS is typically used as local storage per node: pools for VM disks (zvols) and datasets for backups or templates.
What you get: predictable performance, excellent data integrity, and operational simplicity. If a node dies, the storage dies with it—unless you’ve replicated elsewhere.
What you don’t get: shared block storage out of the box. You can build shared semantics via replication plus orchestration, but that’s not the same as “any node can run any VM right now with its disk attached.”
Ceph in Proxmox: distributed block storage, shared by design
Ceph gives you a storage cluster: OSDs store data, MONs keep cluster state, and clients (your Proxmox nodes) talk to the cluster over the network. With RBD, you get shared block devices that Proxmox can use for VM disks. If a node or disk dies, the cluster heals by re-replicating data.
What you get: shared storage, self-healing replication, failure-domain awareness, and the ability to lose a node without losing access to VM disks.
What you don’t get: simplicity. You’re operating a distributed system. Distributed systems are great, right up until you realize the network is part of your RAID controller now.
Dry truth: ZFS tends to fail “locally and loudly.” Ceph tends to fail “globally and subtly” if misconfigured, because cluster health and recovery behaviors can degrade everything at once.
Paraphrased idea (attributed): Werner Vogels (Amazon CTO) has long pushed the idea that “everything fails, so you design for failure.” Storage choices are where you either mean that or you don’t.
Interesting facts and historical context (so you stop repeating 2012 mistakes)
- Ceph started as a research project (Sage Weil’s PhD work), and its early design bet heavily on commodity hardware and software-driven reliability.
- CRUSH (Ceph’s placement algorithm) is designed to avoid central metadata bottlenecks by computing object placement, not looking it up.
- RBD (RADOS Block Device) became the “VM disk” workhorse because block storage semantics map cleanly to hypervisors and image formats.
- ZFS was born at Sun with a “storage pool” model that treated disks as a managed resource, not a fixed set of partitions with a prayer.
- ZFS popularized end-to-end checksumming in mainstream admin consciousness: silent corruption stops being a myth when ZFS shows you the receipts.
- Early Ceph clusters had a reputation for operational complexity because recovery tuning, PG counts, and hardware variance could make behavior unpredictable. Much of that got better, but the physics didn’t.
- Erasure coding became a big deal for Ceph because replication (3x) is expensive; EC reduces overhead but increases complexity and write cost.
- Proxmox embraced both Ceph and ZFS early because small-to-mid shops needed real storage options without buying a full SAN ecosystem.
Joke #1: RAID stands for “Redundant Array of Inexpensive Disks,” and then you buy the 25GbE switches and it becomes “Remarkably All-Incostly Decision.”
Hardware and topology: the part everyone under-budgets
Ceph hardware: the minimums are not the same as “works well”
Ceph wants consistent disks, consistent latency, and enough network headroom to survive recovery events. It will run on all sorts of hardware. It will also punish you for creative choices.
Disk layout choices
- HDD OSDs: good for capacity, bad for random write latency. Fine for archival-ish VM workloads, not fine for busy databases unless you accept higher latency.
- SSD/SATA OSDs: workable middle ground, but watch endurance and sustained write performance.
- NVMe OSDs: the “Ceph feels like local storage” experience is usually NVMe plus good networking.
Ceph’s cost is often dominated by write amplification: replication (size 3) writes multiple copies; EC spreads data and parity across OSDs; small random writes get expensive. Don’t “budget” Ceph by raw TB. Budget it by usable TB at the latency you need during recovery.
Network
Ceph is a networked storage system. That means your storage bus is Ethernet. Under-provision it, and you get the storage equivalent of trying to drink a milkshake through a coffee stirrer.
- 10GbE: can work for small clusters, but recovery will eat your lunch. You need separation from client traffic or extremely careful traffic shaping.
- 25GbE: the sane baseline for “we care about performance and recovery time.”
- Dual networks: still useful (public/client vs cluster/backfill), but many deployments succeed with a single well-designed network if you’re disciplined.
ZFS hardware: boring is a feature
ZFS rewards you for doing the basics: ECC memory (preferably), mirrored vdevs for VM storage, and keeping pools comfortably under full. ZFS can do RAIDZ for capacity, but VM workloads are not “one big sequential file.” They’re a pile of small random writes in a trench coat.
Mirror vs RAIDZ in Proxmox VM workloads
- Mirrors: lower latency, higher IOPS, simpler rebuild behavior. Great default for VM pools.
- RAIDZ2/3: better capacity efficiency, but random write penalty and resilver time can be painful. Better for bulk storage, backups, and less latency-sensitive workloads.
SLOG and L2ARC: stop buying magic parts first
Most teams don’t need a separate SLOG device. If your workload is mostly async writes (common for many VM workloads), SLOG won’t help much. If you’re running sync-heavy databases and you care about latency, then yes, a high-end, power-loss-protected SLOG can matter.
L2ARC can help reads, but it also consumes RAM for metadata and can create a false sense of security. Start with enough RAM and a sensible pool layout. Add L2ARC when you can prove a cache miss problem.
Performance reality: latency, IOPS, rebuilds, and the “noisy neighbor” tax
Ceph: steady-state vs failure-state
In steady state, Ceph can deliver excellent throughput and good latency—especially on NVMe and fast networks. In failure-state (OSD down, node reboot, network hiccup), Ceph starts doing what it’s designed to do: re-replicate, backfill, rebalance. That recovery work competes with client IO.
Your job is to ensure recovery doesn’t turn into a cluster-wide brownout. That means:
- Right-sized network and disk performance headroom.
- Reasonable recovery/backfill settings (not “unlimited, YOLO”).
- Failure domains that match reality (host, chassis, rack).
ZFS: the ARC makes you look smart (until it doesn’t)
ZFS performance is often perceived as “fast” because ARC (in-memory cache) hides read latency. That’s real value. But it can also hide that your pool is under-provisioned for writes, or that you’re about to hit fragmentation and get wrecked by sync writes.
Another fun ZFS reality: once you get above ~80% pool utilization, performance often degrades. People argue about the exact number; your graphs won’t. Keep VM pools with headroom. You’re not storing family photos. You’re storing other people’s deadlines.
“Noisy neighbor” in practice
With ZFS local pools, a noisy VM tends to hurt that node. With Ceph, a noisy VM can amplify into cluster-level pain if it drives heavy write IO that causes widespread replication traffic and recovery delays. This is why quality-of-service (QoS) and sane VM disk limits matter more in Ceph land.
Operational overhead: who’s on-call for what
ZFS ops: fewer moving parts, but you must do replication/backups
ZFS’s operational workload is mostly:
- Monitor pool health (scrubs, SMART, error counters).
- Manage snapshots and retention (don’t snapshot forever; that’s not a plan, it’s procrastination).
- Replication if you want resilience beyond a single host.
- Capacity management: add vdevs, don’t paint yourself into a RAIDZ expansion corner.
Ceph ops: a storage cluster is a living thing
Ceph requires comfort with:
- Cluster health: placement groups, backfill, degraded objects.
- Network troubleshooting as a storage skill, not a separate team’s hobby.
- Change management: adding OSDs changes data placement and triggers rebalancing.
- Understanding recovery behaviors: speed vs impact tradeoffs.
Joke #2: Ceph is like having a pet octopus—brilliant, powerful, and if you ignore it for a weekend it redecorates the house.
Three corporate mini-stories (all real enough to hurt)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS shop moved from local ZFS to Ceph because they wanted seamless live migration. The storage engineer (good intentions, limited time) assumed the existing 10GbE network would be “fine” because the steady-state benchmarks looked acceptable. They built a 4-node Ceph cluster, replicated size 3, mixed SATA SSD OSDs and a few older drives “temporarily,” and went live.
Two months later, an OSD started flapping—up/down every few minutes—because of a marginal drive and an HBA that didn’t like its cabling. Ceph did what Ceph does: backfill started, then paused, then resumed, then started again. Client IO latency climbed. VM disks started timing out. The hypervisors looked “healthy,” but the storage was essentially thrashing.
The wrong assumption was subtle: they assumed storage recovery traffic was a background task. It’s not. In Ceph, recovery is a first-class workload. On 10GbE, recovery plus client IO can saturate the same links, causing latency spikes that look like “random application issues.”
The fix wasn’t glamorous. They isolated storage traffic, upgraded interconnects, and set conservative recovery limits during business hours. They also standardized OSD media classes and stopped mixing “temporary” disks that had different latency profiles.
The lesson: if you can’t afford the network headroom for recovery events, you can’t afford Ceph. You’re just borrowing reliability from the future at terrible interest.
Mini-story 2: The optimization that backfired
A different org ran Proxmox with ZFS mirrors. They were capacity-constrained and decided to “optimize” by switching new pools to RAIDZ2. The math was compelling. The dashboards looked greener. Finance was thrilled.
Then the VM workload changed. A product team added a write-heavy analytics pipeline, lots of small random writes, and a couple of databases configured for aggressive fsync behavior. Latency went from “fine” to “my app is haunted.” ZFS started spending a lot of time doing parity math and dealing with random write patterns that RAIDZ isn’t excited about. Scrubs and resilvers got longer. Performance under load became spiky.
They tried to band-aid it with faster SSDs as L2ARC, which mostly helped reads and did nothing for their write-latency pain. Then they bought a SLOG device without verifying that their workload was actually sync-write bound. It wasn’t, at least not in the way they thought. The SLOG improved a few metrics but didn’t fix the user-visible problem.
Eventually they migrated the busiest VM storage back to mirrored vdevs and kept RAIDZ2 for bulk datasets and backups. The capacity win was real, but it was the wrong tier for that workload.
The lesson: optimization without a workload model is just performance roulette. Sometimes you win. Operations remembers when you don’t.
Mini-story 3: The boring but correct practice that saved the day
A healthcare-adjacent company ran Proxmox with Ceph for shared VM disks. Nothing fancy: consistent NVMe OSDs, dedicated storage VLAN, and a strict change window. The unsexy part was their weekly routine: review Ceph health, verify scrubs, check SMART counters, and test restore of at least one VM from backups.
One Thursday, a top-of-rack switch started dropping packets under load. Not fully down. Just enough to cause intermittent latency spikes. The applications started complaining. The hypervisors looked fine. The storage graphs showed a rise in retransmits and a dip in client throughput.
Because they had boring baselines, they quickly saw what changed: network errors increased, and Ceph started showing slow ops. They throttled recovery, shifted some traffic, and coordinated with networking to swap hardware. The cluster stayed available.
The real save came later. During the network event, one OSD got marked out and started backfilling. They noticed it immediately from routine health checks, confirmed the data was rebalancing safely, and prevented a second failure from turning into a data-risk situation by pausing non-essential maintenance tasks.
The lesson: boring practices—baselines, routine health checks, and restore tests—turn “mystery outage” into “controlled incident.” It’s not glamorous. It’s how you keep weekends.
Practical tasks with commands: what to run, what it means, what to decide
Below are practical tasks you can run on Proxmox nodes. Each one includes: a command, a realistic sample output, what it means, and the decision it drives. Run them. Don’t debate them in Slack.
Task 1: Confirm what storage types Proxmox thinks you have
cr0x@server:~$ pvesm status
Name Type Status Total Used Available %
local dir active 196.00G 21.30G 164.70G 10%
local-zfs zfspool active 1.75T 612.00G 1.14T 35%
ceph-rbd rbd active 10.00T 6.20T 3.80T 62%
Output meaning: You have both local ZFS and Ceph RBD configured, and Ceph is at 62% usage.
Decision: If Ceph is your primary VM disk store, 62% is not “full,” but it’s far enough along that recovery/rebalance will cost more. Plan capacity before 75–80%.
Task 2: Check ZFS pool health and spot silent trouble early
cr0x@server:~$ zpool status -x
all pools are healthy
Output meaning: No known errors, no degraded vdevs.
Decision: If this isn’t “all pools are healthy,” stop arguing about Ceph vs ZFS and fix your current pool first. Storage hygiene precedes architecture debates.
Task 3: Inspect ZFS layout (mirror vs RAIDZ tells you performance expectations)
cr0x@server:~$ zpool status rpool
pool: rpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_SSD_1 ONLINE 0 0 0
ata-SAMSUNG_SSD_2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-SAMSUNG_SSD_3 ONLINE 0 0 0
ata-SAMSUNG_SSD_4 ONLINE 0 0 0
Output meaning: Two mirrored vdevs (striped mirrors). Great for VM IOPS and rebuild behavior.
Decision: If you’re on RAIDZ for VM disks and you’re latency-sensitive, this is the moment to consider migrating hot workloads to mirrors.
Task 4: Check ZFS compression and volblocksize (quiet performance killers)
cr0x@server:~$ zfs get -o name,property,value -s local compression,volblocksize local-zfs/vmdata
NAME PROPERTY VALUE
local-zfs/vmdata compression lz4
local-zfs/vmdata volblocksize 8K
Output meaning: LZ4 compression enabled (usually good), volblocksize 8K (common for VM disks).
Decision: If compression is off, turn it on unless you have a specific reason. If volblocksize is huge for random-write workloads, expect write amplification.
Task 5: Check ARC size pressure (ZFS performance often equals RAM reality)
cr0x@server:~$ arcstat 1 1
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
12:01:11 892 73 8 39 53 34 47 0 0 48G 56G
Output meaning: ARC is 48G, target 56G, miss rate 8%. Not terrible.
Decision: If miss% is consistently high and your disks are busy, consider more RAM before buying cache SSDs. RAM is the cheapest performance upgrade in ZFS land.
Task 6: Confirm whether sync writes are a bottleneck (SLOG decisions live here)
cr0x@server:~$ zfs get -o name,property,value sync local-zfs/vmdata
NAME PROPERTY VALUE
local-zfs/vmdata sync standard
Output meaning: Default sync behavior. Apps requesting sync will get it; others won’t.
Decision: Don’t set sync=disabled to “fix performance” for VM disks. If you do, rename your cluster “maybe-later” and update your resume.
Task 7: Check Ceph cluster health at a glance
cr0x@server:~$ ceph -s
cluster:
id: 8e0c2d8f-7e2b-4c2a-9d1b-2c9e5d2e3a1a
health: HEALTH_WARN
12 slow ops, oldest one blocked for 38 sec
1 osds down
services:
mon: 3 daemons, quorum pve1,pve2,pve3
mgr: pve1(active), standbys: pve2
osd: 12 osds: 11 up, 12 in
data:
pools: 4 pools, 256 pgs
objects: 1.20M objects, 4.6 TiB
usage: 13 TiB used, 24 TiB / 37 TiB avail
pgs: 240 active+clean
16 active+degraded
Output meaning: One OSD is down but still “in,” and you have slow ops and degraded PGs.
Decision: Treat HEALTH_WARN with slow ops as a production incident if it persists. Find the down OSD and the reason before you reboot anything else.
Task 8: Identify the down or flapping OSD quickly
cr0x@server:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 37.260 root default
-3 12.420 host pve1
0 ssd 3.105 osd.0 up 1.00000 1.00000
1 ssd 3.105 osd.1 down 1.00000 1.00000
2 ssd 3.105 osd.2 up 1.00000 1.00000
3 ssd 3.105 osd.3 up 1.00000 1.00000
-5 12.420 host pve2
4 ssd 3.105 osd.4 up 1.00000 1.00000
5 ssd 3.105 osd.5 up 1.00000 1.00000
6 ssd 3.105 osd.6 up 1.00000 1.00000
7 ssd 3.105 osd.7 up 1.00000 1.00000
-7 12.420 host pve3
8 ssd 3.105 osd.8 up 1.00000 1.00000
9 ssd 3.105 osd.9 up 1.00000 1.00000
10 ssd 3.105 osd.10 up 1.00000 1.00000
11 ssd 3.105 osd.11 up 1.00000 1.00000
Output meaning: osd.1 is down on host pve1.
Decision: Decide whether this is a transient daemon issue (restart) or a device issue (SMART, kernel logs). If hardware, mark out and replace properly.
Task 9: Check if Ceph is backfilling/recovering (performance impact predictor)
cr0x@server:~$ ceph -w
2026-02-04T12:03:10.123+0000 mon.pve1 [WRN] Health check update: 1 osds down (OSD_DOWN)
2026-02-04T12:03:12.456+0000 osd.3 [INF] 1.2% backfill_recovery 120/10000 objects recovered
2026-02-04T12:03:13.222+0000 mgr.pve1 [WRN] Slow OSD heartbeats on back (osd.6)
Output meaning: Recovery/backfill is active, plus heartbeat warnings (network or overloaded OSD).
Decision: If client latency is hurting, consider temporarily throttling recovery to keep services alive, then speed it up off-hours.
Task 10: Confirm Ceph pool replication settings (you can’t guess durability)
cr0x@server:~$ ceph osd pool ls detail
pool 1 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on
pool 2 'cephfs_data' erasure profile ec-4-2 crush_rule 3 pg_num 64 pgp_num 64 autoscale_mode on
Output meaning: Your VM disk pool (rbd) is replicated size 3, min_size 2.
Decision: If you’re on size 2 because “space,” understand you’re trading away failure tolerance. Size 2 often turns a second failure into downtime or data risk.
Task 11: Check Ceph OSD latency and utilization (who is the slow kid in group projects)
cr0x@server:~$ ceph osd perf
osd commit_latency(ms) apply_latency(ms)
0 5 7
1 0 0
2 6 8
3 52 75
4 6 9
5 5 7
6 7 10
Output meaning: osd.3 is dramatically slower than peers.
Decision: Investigate osd.3 host hardware, disk health, and network. A single slow OSD can cause cluster-wide latency via slow ops.
Task 12: Validate the storage network (packet loss turns into “Ceph is slow”)
cr0x@server:~$ ip -s link show eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
RX: bytes packets errors dropped missed mcast
9823456789 8123456 0 0 0 12345
TX: bytes packets errors dropped carrier collsns
1234567890 7123456 0 0 0 0
Output meaning: No RX/TX errors or drops on this interface.
Decision: If you see errors/dropped packets, fix the network before tuning Ceph. Storage traffic hates “mostly works.”
Task 13: Verify disk health signals (ZFS and Ceph both die from boring disks)
cr0x@server:~$ smartctl -a /dev/nvme0n1 | egrep -i "critical_warning|media_errors|num_err_log_entries|percentage_used"
critical_warning : 0x00
percentage_used : 7%
media_errors : 0
num_err_log_entries : 0
Output meaning: NVMe looks healthy; low wear, no media errors.
Decision: If media_errors climb or critical_warning is set, stop trusting the device. Plan replacement before it decides for you.
Task 14: Measure IO latency on the node (prove whether it’s disk or network)
cr0x@server:~$ iostat -x 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
12.11 0.00 4.25 8.90 0.00 74.74
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s r_await w_await aqu-sz %util
nvme0n1 120.0 850.0 4800.0 55200.0 0.0 2.0 1.2 18.4 6.5 92.0
Output meaning: Device is heavily utilized, and write await is ~18ms—high for NVMe, suggesting saturation or queueing.
Decision: If %util is pegged and await rises, you’re disk-bound on that node/OSD host. In Ceph, that can be one slow OSD; in ZFS, maybe one vdev is overloaded.
Task 15: Check Proxmox VM disk IO settings (sometimes the bottleneck is self-inflicted)
cr0x@server:~$ qm config 101 | egrep -i "scsi|virtio|iothread|cache|discard"
scsihw: virtio-scsi-single
scsi0: ceph-rbd:vm-101-disk-0,discard=on,iothread=1,ssd=1
Output meaning: virtio-scsi-single with iothread enabled and discard on, using Ceph RBD.
Decision: Good baseline for many workloads. If you see cache=unsafe anywhere, treat it as a risk register item with a short deadline.
Task 16: Watch Ceph client IO from the cluster perspective
cr0x@server:~$ ceph osd pool stats
pool rbd id 1
client_io_rate: 48 MiB/s rd, 22 MiB/s wr, 310 op/s rd, 420 op/s wr
pool cephfs_data id 2
client_io_rate: 12 MiB/s rd, 4 MiB/s wr, 90 op/s rd, 60 op/s wr
Output meaning: The RBD pool is doing the bulk of IO.
Decision: When users say “storage is slow,” validate whether IO actually spiked, or whether latency rose without throughput (often network or one OSD).
Fast diagnosis playbook: find the bottleneck in minutes
This is the triage order I use when someone says “VMs are slow” and the room starts blaming storage as a ritual.
First: is this node-local or cluster-wide?
- Check Proxmox node load: if one node is hot, it may be local IO or CPU scheduling.
- Check if only VMs on Ceph are affected: if yes, focus on Ceph health and network; if no, it may be host CPU, memory pressure, or a shared network issue.
Second: is Ceph unhealthy or recovering?
- Run:
ceph -sand look for HEALTH_WARN/ERR, slow ops, degraded PGs, OSD down, recovery/backfill. - Decision: If recovery is happening, assume it’s contributing to latency. Decide whether to throttle recovery temporarily.
Third: is it network loss/latency?
- Check interface errors/drops:
ip -s linkon all nodes and storage VLAN ports. - Look for Ceph heartbeat warnings: they often point to network micro-loss or saturated links.
- Decision: If you see drops/retransmits, treat network as the primary suspect.
Fourth: is it one slow disk/OSD?
- Ceph:
ceph osd perfto identify outliers; then check SMART anddmesgon the host. - ZFS:
zpool statusfor errors;iostat -xfor saturation; check if one vdev is carrying more load. - Decision: Replace failing hardware early. A degraded-but-still-working disk is how outages start.
Fifth: is it VM configuration or guest behavior?
- Check disk bus and iothreads: misconfigured devices can cap performance.
- Check guest fsync behavior: databases can turn storage into a latency microscope.
- Decision: If one VM is the culprit, apply IO limits or move it to a tier designed for it.
Common mistakes: symptom → root cause → fix
1) Symptom: “Ceph is slow during the day, fine at night”
Root cause: Recovery/backfill competes with client IO; business-hour load leaves no headroom.
Fix: Throttle recovery during peak; increase network capacity; use faster media classes; reduce variance in OSD performance.
2) Symptom: Random VM freezes/timeouts on Ceph
Root cause: Packet loss or microbursts on storage network; a flapping OSD; slow ops accumulating.
Fix: Validate error counters and switch health; isolate storage traffic; fix MTU mismatches; replace unstable disks/OSDs.
3) Symptom: ZFS pool “healthy” but performance keeps degrading over months
Root cause: Pool too full; fragmentation; snapshots retained forever; small random writes on RAIDZ.
Fix: Keep headroom; prune snapshots; migrate hot VM storage to mirrored vdevs; add vdevs before crisis.
4) Symptom: “Adding disks made Ceph worse”
Root cause: Rebalancing/backfill kicked in and saturated network/disk; OSDs not uniform; CRUSH failure domains mismatched.
Fix: Add capacity in controlled windows; throttle backfill; keep OSD classes consistent; validate CRUSH topology.
5) Symptom: ZFS corruption scare after power event
Root cause: No UPS; consumer SSDs without power-loss protection; risky write cache settings.
Fix: UPS plus proper SSDs for critical workloads; avoid unsafe caching; test recovery procedures.
6) Symptom: Ceph looks healthy but latency spikes persist
Root cause: One slow OSD (high apply latency) or CPU contention on OSD hosts; noisy neighbor VM saturating IO.
Fix: Use ceph osd perf and host-level iostat; move heavy VMs; apply IO limits; ensure OSD hosts have CPU headroom.
7) Symptom: Live migration is slow or fails under load
Root cause: Migration traffic shares constrained network with Ceph; or storage is local ZFS and you’re actually copying disks.
Fix: Separate migration network; if you need shared semantics, use Ceph or accept planned downtime migration with replication.
8) Symptom: “We set sync=disabled and everything got fast”
Root cause: You traded durability for speed; you’re now vulnerable to power loss and kernel panic data loss.
Fix: Undo it. If sync writes are too slow, fix the storage path (mirrors, SLOG with PLP, faster media, better tuning), not the laws of physics.
Checklists / step-by-step plan
Checklist A: If you’re leaning ZFS (local storage, simpler ops)
- Choose mirrors for VM pools unless you have a proven workload that fits RAIDZ.
- Enable compression (lz4) and verify it on VM datasets/zvols.
- Plan headroom: target <70% used for VM pools; treat 80% as a performance cliff, not a suggestion.
- Schedule scrubs and monitor for checksum errors.
- Implement replication (zfs send/receive) to another node or a backup target if you need fast restores.
- Test restores monthly at minimum; more often if you change retention policies.
- Decide your failure story: node loss = restore from backup, or failover via replicated VM disks.
Checklist B: If you’re leaning Ceph (shared storage, HA semantics)
- Start with 3+ nodes with consistent OSD hardware; avoid mixed latency classes in the same pool.
- Budget network properly: 25GbE baseline for serious workloads; isolate storage traffic.
- Pick replication size intentionally (often size 3). Write down what failure it tolerates.
- Define failure domains (host/rack) so CRUSH matches physical reality.
- Set recovery/backfill limits for daytime stability; have an “after hours” profile if needed.
- Monitor slow ops and OSD perf outliers; treat them as early warning.
- Practice OSD replacement on a non-crisis day. You don’t want your first time to be during a degraded state.
Checklist C: Migration plan if you chose wrong last year
- Inventory workloads: which VMs need low latency, which need capacity, which need shared semantics.
- Create tiers: hot storage (mirrors/NVMe), general storage, bulk/backup storage.
- Move one workload first and measure before/after (latency, throughput, tail lat).
- Automate rollback so your migration isn’t a one-way door.
- Document operational runbooks for the new tier (how to replace disk, how to diagnose slow IO, who pages whom).
FAQ
1) Can I run Ceph on 3 nodes with mixed SSD and HDD?
You can, but you probably shouldn’t for VM disks. Mixed latency in the same pool tends to create tail-latency pain. If you must mix, separate by device class and use different pools.
2) Is Ceph always slower than local ZFS?
No. But Ceph adds network hops and replication overhead. On great hardware and network, Ceph can be fast and consistent. On “leftover 10GbE,” local ZFS will often feel better for latency-sensitive writes.
3) Is RAIDZ “bad” for ZFS VM storage?
Not morally. Practically, RAIDZ can be a poor fit for random-write VM workloads because parity overhead and IOPS behavior can hurt. Mirrors are usually the safer default for VM disks.
4) Should I use Ceph erasure coding for VM disks?
Usually no for primary VM disks unless you have a strong reason and you understand the performance tradeoffs. EC shines for capacity efficiency, often better for object/file-ish workloads than latency-sensitive random-write block IO.
5) What’s the minimum “serious” Ceph network?
For production with meaningful load: 25GbE and sane switching. 10GbE can work, but you’ll end up tuning around recovery events and living with more variance.
6) Do I need ECC RAM for ZFS?
Strongly recommended, especially for critical workloads. ZFS protects data on disk with checksums, but RAM errors can still cause bad outcomes. ECC reduces a class of silent, ugly failures.
7) How do snapshots differ operationally between Ceph and ZFS?
ZFS snapshots are extremely mature and easy to reason about. Ceph RBD snapshots exist and are useful, but your operational muscle memory will be stronger with ZFS-style workflows. For Proxmox backups, you’ll typically use Proxmox Backup Server regardless.
8) Can I combine them: ZFS for local and Ceph for shared?
Yes, and many shops do. Put latency-sensitive or “single-node affinity” workloads on local ZFS, and use Ceph for VMs that benefit from shared storage and HA. The trick is having clear placement rules and monitoring both paths.
9) What’s the most common reason Ceph deployments disappoint?
Under-provisioned network and inconsistent OSD hardware. People expect distributed storage to behave like a local SSD. Ceph can, but only if you pay for it in design.
10) What’s the most common reason ZFS deployments disappoint?
Capacity pressure and snapshot sprawl. Pools get too full, performance drops, and then someone “fixes it” by deleting the wrong thing. Plan headroom and retention from day one.
Next steps you can do this week
- Run the command tasks above on your current cluster and write down what’s true (layout, health, latency outliers, network drops). That becomes your baseline.
- Decide what you actually need: shared storage semantics (Ceph) or strong local storage with replication/backups (ZFS). Don’t pretend you need both if you don’t.
- If you’re on Ceph: validate network isolation and headroom; confirm pool size/min_size; identify any slow OSDs and fix them before they become the incident.
- If you’re on ZFS: confirm mirrors for VM pools, compression on, and plenty of headroom; audit snapshot retention; do a restore test that involves booting a VM.
- Write the failure story in one page: “If a node dies, we do X; if a disk dies, we do Y; if the network flakes, we do Z.” If you can’t write it, you don’t own it yet.
The Ceph vs ZFS decision isn’t “which is better.” It’s “which set of tradeoffs do you want to debug at 2 a.m.” Pick the one you can operate, not the one that wins a forum argument.