Ceph on Proxmox: when it’s worth it, when it’s a trap, and how to size it correctly

Was this helpful?

You built a Proxmox cluster because you wanted simple virtualization, not a second career in distributed storage.
Then someone said “just add Ceph” and promised magic: HA, live migration, no single point of failure, pool-based growth.
Next thing you know, VM disks feel like they’re stored on a spinning rust over VPN, recovery traffic is eating the network,
and the CEO’s favorite app is “a little slow today”.

Ceph can be a power tool. It can also remove fingers. This guide is about knowing when it’s the right tool, when it’s a trap,
and how to size it so it behaves like production infrastructure instead of an ongoing science fair.

When Ceph on Proxmox is worth it

Ceph on Proxmox is worth it when you need shared block storage for virtualization that can grow,
survive failures, and be operated without vendor lock-in. The key phrase is “need”, not “it would be cool”.

It’s a fit when these statements are true

  • You need HA + live migration at scale without a big central SAN/NAS.
  • You can dedicate real hardware (NVMe, CPU, RAM, network) to storage behavior, not just capacity.
  • You expect failures (a disk per month, a host per quarter) and you want the cluster to take it.
  • You can run at least three nodes (and preferably more) with consistent hardware and networking.
  • Your workload is mostly VM disks (RBD) with reasonable IO patterns, not a chaos parade of tiny sync writes.
  • You have ops discipline: monitoring, maintenance windows, firmware control, and a human who owns storage health.

What Ceph gives you that’s hard to fake

The real value is not “distributed storage”. It’s predictable failure tolerance and operational elasticity.
When you add nodes, you add capacity and performance. When a host dies, data is still there. When a disk dies, you replace it
and Ceph heals. If you set it up sanely, this is boring in the best way.

If you want a single sentence decision rule: Ceph is worth it when you want storage to fail like a service, not like a device.

When it’s a trap (and what to do instead)

The trap is thinking Ceph is “three servers and some disks”. That’s like thinking a datacenter is “a room and some power”.
Ceph is a distributed system. It’s latency-sensitive, network-hungry during recovery, and extremely honest about physics.

Red flags that usually mean “don’t do Ceph here”

  • Two nodes or “we’ll add the third later”. Ceph needs quorum and failure domains; “later” is not a design.
  • 1GbE or “storage and VM traffic share the same cheap switch”. Recovery will eat your lunch.
  • Mixed hardware where half the nodes are old and half are new. The slowest OSDs set the tone during recovery.
  • Capacity-first budgeting: lots of HDDs, tiny SSDs, minimal RAM, minimal CPU.
  • Write-heavy databases with fsync storms, or applications that do sync writes per transaction without batching.
  • No operator time. If nobody owns storage, the cluster will eventually own you.

What to do instead (realistic alternatives)

  • Single-node or small cluster lab: ZFS on each node, replicate with ZFS send, accept that live migration isn’t free.
  • Two nodes in production: use a shared SAN/NAS or a third lightweight witness host for quorum (but don’t fake Ceph).
  • Latency-critical DBs: local NVMe with replication at the application layer; or a dedicated storage appliance.
  • Budget constrained: fewer, stronger nodes beats many weak ones. Ceph doesn’t reward “more junk”.

Joke #1: Ceph on 1GbE is a great way to learn patience, mindfulness, and how to explain “eventual consistency” to angry humans.

How Ceph actually behaves under VM workloads

Proxmox typically uses Ceph RBD for VM disks. That means your VM reads and writes are turned into object operations
against placement groups (PGs), spread across OSDs, and replicated (or erasure-coded) across failure domains.
It’s elegant. It’s also relentless: every write becomes multiple writes.

Replicated pools: the default for a reason

A replicated pool with size=3 writes data to three OSDs. Your “1GB write” is three writes plus metadata work.
The upside is low read amplification and simple rebuild logic. For VM disks, replicated pools remain the sane default.

Erasure coding: great on paper, complicated in practice

Erasure coding saves usable capacity, but it increases CPU cost, network traffic, and small-write amplification.
It can work for large sequential workloads. For VM boot disks and random write patterns, it often becomes a latency machine.
Use EC pools cautiously, usually for cold data, backups, or object storage patterns—not as your primary VM datastore.

BlueStore, WAL/DB, and the “small write tax”

Modern Ceph uses BlueStore (not FileStore). BlueStore has its own metadata (RocksDB) and uses a WAL.
On HDD OSDs, you want those DB/WAL components on SSD/NVMe, or you’ll pay latency for every metadata-heavy operation.
On all-flash OSDs, keeping DB/WAL on the same device is fine; splitting can help but isn’t mandatory.

Latency beats throughput for virtualization

VM disks are dominated by small random IO, metadata updates, and fsync behavior depending on guest OS.
If you’re only looking at “GB/s” you will build a cluster that benchmarks well and feels terrible.
What matters: p95 latency during normal operations and during recovery.

One quote, because operations keeps receipts

paraphrased idea — Gene Kranz: “tough and competent” systems come from discipline and preparation, not optimism.

Facts and historical context (because Ceph didn’t appear by accident)

  • Ceph began as a research project at UCSC in the mid‑2000s, focused on eliminating centralized metadata bottlenecks.
  • Its CRUSH algorithm was designed to avoid lookup tables and scale placement decisions without a central coordinator.
  • Ceph’s RADOS layer is the real product; block (RBD), file (CephFS), and object (RGW) are “clients” on top of it.
  • BlueStore replaced FileStore to avoid double journaling through a POSIX filesystem and improve performance consistency.
  • CephFS required stable metadata server behavior; it took years before it was treated as production-ready in conservative shops.
  • Proxmox integrated Ceph deeply to make hyperconverged storage accessible without a separate storage vendor stack.
  • Placement groups exist to batch object placement; too few PGs bottleneck parallelism, too many increase overhead.
  • Ceph’s health states (HEALTH_OK/WARN/ERR) are intentionally blunt so operators don’t ignore “minor” issues until they’re outages.

Sizing that doesn’t lie to you

Sizing Ceph is about matching failure tolerance, latency targets, and recovery time.
Capacity is the easiest part—and the most misleading. The right way is to work backwards from the workload.

Start with the only question that matters: what happens when a node dies?

When a node dies, Ceph will backfill/rebalance. That means a lot of reads from surviving OSDs and writes to others.
Your cluster must survive this while keeping VM IO tolerable. If you size for “happy path only”, recovery becomes the outage.

Node count: three is minimum; five is where it starts feeling adult

  • 3 nodes: workable, but failure domains are tight. Recovery pressure is intense; maintenance is stressful.
  • 4 nodes: better, but still awkward for some CRUSH layouts and maintenance scheduling.
  • 5+ nodes: smoother recovery, easier to take nodes down, better distribution of PGs and IO.

Networking: treat it like a storage backplane, not “just Ethernet”

For Ceph on Proxmox, the network is the chassis. Under normal IO, you push replication traffic. Under recovery,
you push a storm. If your network is underbuilt, everything becomes “random latency” and the blame game begins.

  • Minimum viable: 10GbE, ideally dedicated or VLAN-separated public/cluster traffic.
  • Comfortable: 25GbE for mixed workloads, or 10GbE for smaller all-flash clusters with careful tuning.
  • Switches: non-blocking, low buffer pain, consistent MTU, and no “mystery” QoS policies.

OSD media: all-flash is simpler; HDD requires SSD/NVMe help

If you’re running VM storage on HDD OSDs, you’re choosing a world where random write latency is your constant companion.
You can make it usable with SSD/NVMe for DB/WAL, enough spindles, and realistic expectations. But don’t pretend it’ll feel like SAN flash.

Rule-of-thumb hardware profiles (practical, not theoretical)

  • Small all-flash VM cluster (3–5 nodes): 8–16 NVMe per node or fewer larger enterprise NVMe; 128–256GB RAM; 16–32 cores; 25GbE if you can.
  • Hybrid (HDD capacity + SSD metadata): 8–16 HDD per node plus 1–2 NVMe for DB/WAL; 128GB+ RAM; 25GbE strongly recommended.
  • “We found disks in a closet”: don’t.

CPU and RAM: Ceph will happily use what you deny it

OSDs consume CPU for checksums, compression (if enabled), EC math (if used), and general IO pipelines.
MONs and MGRs need memory to keep cluster maps and serve clients without stalling.

  • RAM: plan for OSD memory + host overhead + Proxmox. Starving the host causes jitter that looks like “network” or “Ceph bug”.
  • CPU: do not oversubscribe to the point where OSD threads get preempted under VM load. Latency spikes are the symptom.

Capacity math that matches reality

For replicated pools, usable capacity is roughly: raw / replica_size, minus overhead.
Replica size 3 means you get about one-third usable. Then reserve headroom: Ceph needs space to move data during failure and recovery.

  • Target: keep pools under ~70–75% full in steady state if you value your weekends.
  • Why: recovery and backfill need free space across OSDs; near-full clusters amplify rebalance pain and risk.

IOPS and latency sizing: the uncomfortable part

You size Ceph by understanding IO patterns:

  • Small random writes: punish HDD clusters and underpowered networks.
  • Sync writes (fsync): reveal journal/WAL and device latency; guests running databases can dominate cluster behavior.
  • Reads: usually easier, but cold-cache reads plus recovery can still hurt.

If you can’t measure the workload, assume it’s worse than you think. Production workloads always are.

Pool, CRUSH, and data placement decisions

Replicated pool sizing: pick the failure you want to survive

Most Proxmox deployments use size=3, min_size=2. That generally survives one node failure without going read-only.
Dropping to size=2 is tempting for capacity. It’s also a reliability downgrade that often becomes a surprise outage during a second failure.

Failure domains: host, chassis, rack

CRUSH rules decide where replicas land. If your failure domain is “host” but your three nodes share a single power strip,
your failure domain is actually “power strip”. Design your physical layout like you mean it.

PG count: don’t hand-tune like it’s 2016

PG autoscaling exists for a reason. Still, you need to understand PG pressure:
too few PGs can create hotspots; too many can overload OSD memory and CPU.
Use autoscaler, then sanity-check distribution and performance.

Separate pools by performance class, not by vibes

  • Fast pool: all-flash NVMe OSDs for latency-sensitive VMs.
  • Capacity pool: HDD+SSD DB/WAL for bulk storage, less sensitive workloads.
  • Don’t mix: putting one slow OSD class in the same pool drags client IO during recovery and PG peering.

Practical tasks: commands, outputs, what it means, what you decide

These are the checks I run when someone says “Ceph is slow” or “Ceph is weird”.
Each one includes: command, example output, what the output means, and the decision you make.

Task 1 — Confirm cluster health and the real headline

cr0x@server:~$ ceph -s
  cluster:
    id:     2c8c8b9a-1c61-4a9f-9a19-4c0d7cfe2a91
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 23 sec, osd.7 has slow ops

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 2h)
    mgr: pve1(active, since 2h), standbys: pve2
    osd: 24 osds: 24 up (since 2h), 24 in (since 2h)

  data:
    pools:   2 pools, 256 pgs
    objects: 1.12M objects, 4.3 TiB
    usage:   13 TiB used, 21 TiB / 34 TiB avail
    pgs:     254 active+clean
             2  active+remapped+backfilling

Meaning: Not “everything is broken”. It’s one OSD with slow ops and backfill ongoing.

Decision: Treat performance complaints as expected during backfill; investigate why osd.7 is slow (disk, CPU, network).

Task 2 — Check if you’re in recovery/backfill mode

cr0x@server:~$ ceph health detail
HEALTH_WARN 1 slow ops; 2 pgs backfilling
SLOW_OPS 1 slow ops, oldest one blocked for 23 sec, osd.7 has slow ops
PG_BACKFILLING 2 pgs backfilling

Meaning: Recovery activity is present; client latency may spike.

Decision: Decide whether to throttle recovery temporarily or ride it out; check if backfill is expected (recent OSD restart, disk replacement).

Task 3 — Identify which OSD is the villain (or the victim)

cr0x@server:~$ ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
  0                  6                 7
  1                  5                 6
  7                120               145
  8                  7                 8

Meaning: osd.7 is far slower than peers; this is not “Ceph is slow”, it’s “one component is slow”.

Decision: Investigate osd.7 host disk health, saturation, or network errors; consider marking it out if it’s harming the cluster.

Task 4 — Check OSD tree and ensure topology matches reality

cr0x@server:~$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME      STATUS  REWEIGHT  PRI-AFF
-1         34.00000 root default
-3         11.33333     host pve1
 0    ssd   1.33333         osd.0      up   1.00000  1.00000
 1    ssd   1.33333         osd.1      up   1.00000  1.00000
 2    ssd   1.33333         osd.2      up   1.00000  1.00000
-5         11.33333     host pve2
 7    ssd   1.33333         osd.7      up   1.00000  1.00000
 8    ssd   1.33333         osd.8      up   1.00000  1.00000
 9    ssd   1.33333         osd.9      up   1.00000  1.00000

Meaning: OSD class and weights appear consistent. If you see wildly different weights, you’ll get uneven data and IO.

Decision: If one host has fewer OSDs/weight, expect hotspots; plan expansion or reweighting.

Task 5 — Check PG autoscaler status

cr0x@server:~$ ceph osd pool autoscale-status
POOL     SIZE  TARGET SIZE  RATE  RAW CAPACITY  RATIO  TARGET RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
rbd         3            -   3.0       34 TiB   0.65             -     128         192  on
cephfs      3            -   3.0       34 TiB   0.10             -     128          64  on

Meaning: Autoscaler suggests rbd needs more PGs (more parallelism), while cephfs could use fewer.

Decision: Accept autoscaler changes during calm periods; avoid massive PG churn during incidents.

Task 6 — Check pool settings that affect VM latency

cr0x@server:~$ ceph osd pool get rbd size
size: 3
cr0x@server:~$ ceph osd pool get rbd min_size
min_size: 2
cr0x@server:~$ rbd pool stats rbd
Total Images: 124
Total Snapshots: 37
Provisioned Size: 19.7 TiB
Total Used Size: 6.2 TiB

Meaning: Replication is set for typical HA. Provisioned vs used helps spot thin-provisioning risk.

Decision: If provisioned approaches usable capacity, enforce quotas or expand before backfill becomes impossible.

Task 7 — Validate network health on the Ceph interfaces

cr0x@server:~$ ip -s link show eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped  missed  mcast
    9123489123  8123492       0        3       0      0
    TX:  bytes  packets  errors  dropped  carrier  collsns
    8451239912  7923341       0        0        0      0

Meaning: Errors are zero; a few drops may be okay, but rising drops under load can mean congestion or buffer issues.

Decision: If errors/drops climb, stop blaming Ceph and start debugging switch ports, MTU mismatch, or saturation.

Task 8 — Confirm MTU consistency (silent killer)

cr0x@server:~$ ip link show eno2 | grep mtu
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000

Meaning: Interface MTU is jumbo. That’s fine only if the entire path supports it.

Decision: If some nodes are 1500 and some 9000, pick one and standardize; inconsistent MTU yields weird latency and fragmentation.

Task 9 — Check host IO saturation (is the disk pegged?)

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
nvme0n1         120.0   980.0  2400.0  8820.0   2.1   0.3   34.0
sdb              10.0   320.0   160.0  5120.0  48.0   2.5   98.0

Meaning: sdb is nearly 100% utilized with high await. If that’s an OSD device, it will cause slow ops.

Decision: If HDD OSDs are saturated, reduce recovery rate, move hot workloads to flash pool, or add spindles/nodes.

Task 10 — Inspect a specific OSD daemon for stalls or device errors

cr0x@server:~$ systemctl status ceph-osd@7 --no-pager
● ceph-osd@7.service - Ceph object storage daemon osd.7
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled)
     Active: active (running) since Sun 2025-12-28 08:12:11 UTC; 2h 3min ago
   Main PID: 23871 (ceph-osd)
      Tasks: 92
     Memory: 5.1G
        CPU: 1h 12min
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@7.service
             └─23871 /usr/bin/ceph-osd -f --cluster ceph --id 7 --setuser ceph --setgroup ceph

Meaning: OSD is running; memory use is visible. This doesn’t prove it’s healthy, but it rules out “it’s down”.

Decision: If CPU time is low but latency is high, suspect disk or IO waits; check logs and device stats.

Task 11 — Look for kernel-level disk trouble

cr0x@server:~$ dmesg -T | tail -n 8
[Sun Dec 28 09:55:14 2025] blk_update_request: I/O error, dev sdb, sector 219902314
[Sun Dec 28 09:55:14 2025] Buffer I/O error on dev sdb, logical block 27487789, async page read
[Sun Dec 28 09:55:15 2025] sd 2:0:5:0: [sdb] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sun Dec 28 09:55:15 2025] sd 2:0:5:0: [sdb] Sense Key : Medium Error [current]
[Sun Dec 28 09:55:15 2025] sd 2:0:5:0: [sdb] Add. Sense: Unrecovered read error

Meaning: The disk is failing. Ceph will try to cope, but your latency will be awful and recovery may thrash.

Decision: Mark the OSD out, replace the disk, and plan the recovery window; don’t “wait and see”.

Task 12 — Safely mark an OSD out when it’s harming the cluster

cr0x@server:~$ ceph osd out 7
marked out osd.7.

Meaning: Ceph will begin moving data off osd.7 to restore the desired replication state.

Decision: Monitor backfill impact. If the cluster can’t handle recovery without melting, your sizing is marginal.

Task 13 — Observe recovery/backfill throttles and tune carefully

cr0x@server:~$ ceph config get osd osd_max_backfills
1
cr0x@server:~$ ceph config get osd osd_recovery_max_active
3

Meaning: Conservative recovery settings reduce impact on clients but extend rebuild time.

Decision: During business hours, keep conservative. During maintenance windows, increase slightly and watch latency.

Task 14 — Check for “nearfull” before it becomes “full”

cr0x@server:~$ ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
ssd    34 TiB   21 TiB   13 TiB     13 TiB     38.25
TOTAL  34 TiB   21 TiB   13 TiB     13 TiB     38.25

--- POOLS ---
POOL  ID  PGS  STORED  OBJECTS  USED     %USED  MAX AVAIL
rbd    1  128  4.1TiB  950k     12.3TiB  57.0     6.2TiB
cephfs 2  128  0.2TiB  170k     0.6TiB    2.8     6.2TiB

Meaning: Raw usage looks fine, but pool %USED is what matters for that pool’s effective capacity.

Decision: If rbd approaches 70–75%, schedule expansion; don’t run close to full and expect painless recovery.

Task 15 — Verify Proxmox sees and uses the Ceph datastore correctly

cr0x@server:~$ pvesm status
Name     Type     Status    Total     Used     Avail    %
local    dir      active   1.8T     220G     1.6T   12.0%
ceph-rbd rbd      active   6.2T     3.5T     2.7T   56.0%

Meaning: Proxmox storage plugin reports capacity and usage; mismatch can indicate auth/config issues.

Decision: If Proxmox shows “unknown” or “inactive”, fix Ceph client config before chasing performance ghosts.

Fast diagnosis playbook (find the bottleneck without a week-long meeting)

This is the order that saves time. Not always. But often enough that it should be muscle memory.

First: is the cluster healthy or recovering?

  • Run ceph -s and ceph health detail.
  • If you see backfill/recovery/peering, accept that performance is degraded and decide whether to throttle or wait.
  • If you see slow ops, identify the OSD(s) with ceph osd perf.

Second: is one host or disk dragging everyone down?

  • On the suspected host: iostat -x for %util/await.
  • Check kernel logs: dmesg -T for IO errors/reset storms.
  • Check SMART if available in your environment (often via vendor tools).
  • If a device is failing, mark out the OSD and replace; do not negotiate with physics.

Third: is the network lying?

  • Check counters: ip -s link (errors/drops).
  • Confirm MTU: ip link show.
  • Look for congestion: spikes during backfill are normal; constant drops are not.

Fourth: is it a pool/client configuration issue?

  • Confirm pool replication settings and PG autoscaler.
  • Check whether you used EC pools for VM disks (common self-inflicted wound).
  • Validate Proxmox storage configuration and that clients hit the right network.

Fifth: is it just undersized?

If all components are “healthy” but latency is still bad under normal load, your cluster may simply not have enough
IOPS, CPU, or network. This is the moment to be honest: tuning won’t turn HDD into NVMe.

Joke #2: Tuning an undersized Ceph cluster is like rearranging deck chairs—except the chairs are on fire and the ship is your SLA.

Common mistakes: symptoms → root cause → fix

1) “Everything gets slow when a node goes down”

Symptoms: latency spikes, VM timeouts, Proxmox feels sluggish during disk replacement or host reboot.

Root cause: recovery/backfill saturates network or OSDs; cluster sized only for happy path.

Fix: increase node count and network bandwidth; throttle recovery during business hours; reserve headroom (stay under ~75%).

2) “Random slowdowns, no obvious errors”

Symptoms: occasional 5–30s stalls, HEALTH_WARN with slow ops, then it clears.

Root cause: one OSD device intermittently failing, firmware hiccups, or host-level IO contention.

Fix: use ceph osd perf to identify offenders; check dmesg; replace flaky drives; stop mixing consumer SSDs.

3) “Ceph is full, but df says there’s space”

Symptoms: pool hits nearfull/full, writes block, despite “raw” capacity still available.

Root cause: pool-level effective capacity is constrained by replication and distribution; uneven fill across OSDs; thin provisioning oversold.

Fix: expand before pools exceed safe thresholds; rebalance weights; enforce quotas and provisioning discipline.

4) “Great throughput benchmarks, terrible VM experience”

Symptoms: sequential tests look fine; real workloads feel laggy; p95 latency is ugly.

Root cause: benchmarks don’t match small random IO + fsync patterns; HDD/hybrid without adequate SSD DB/WAL; CPU starvation.

Fix: measure latency; move VM disks to all-flash pool; ensure adequate CPU scheduling for OSDs; don’t overcommit storage nodes.

5) “We used erasure coding for VM disks to save space”

Symptoms: higher write latency, especially under load; recovery is brutal; CPU spikes.

Root cause: EC small-write amplification and compute cost; not aligned with VM disk IO patterns.

Fix: use replicated pools for VM disks; keep EC for large-object, less latency-sensitive workloads.

6) “Ceph keeps flapping quorum / MONs complain”

Symptoms: MONs out of quorum, cluster pauses, odd errors during peak traffic.

Root cause: overloaded or undersized MON nodes, network instability, or running MONs on nodes starved by VM load.

Fix: ensure three MONs on stable nodes; reserve CPU/RAM; isolate network; stop putting MONs on the most abused hypervisor.

7) “Backfill never ends”

Symptoms: weeks of remapped/backfilling PGs, constant degraded states after small changes.

Root cause: too little spare capacity, too little bandwidth, or too aggressive changes (adding/removing multiple OSDs at once).

Fix: plan changes; add capacity in batches; throttle recovery; avoid operating near full; increase network.

Three corporate-world mini-stories (anonymized, plausible, and painfully familiar)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company migrated from a single virtualization host with local SSDs to a three-node Proxmox cluster with Ceph.
The assumption was simple: “We have three nodes, so we have HA now. It’ll be at least as fast.”
Nobody said “replication adds writes” out loud. Nobody did failure testing with real workloads.

The first maintenance reboot happened on a Tuesday morning. One node went down for firmware updates. Ceph did what Ceph does:
started backfilling to maintain replication. VM latency climbed, then climbed again, then the helpdesk started forwarding tickets
with subject lines written in all caps.

They blamed Proxmox, then Ceph, then the network. The truth was boring: 10GbE was shared with client traffic,
and the OSDs were SATA SSDs that looked fine in a single-node context but crumbled under replicated random writes plus recovery.
The wrong assumption wasn’t “Ceph is fast”. It was “Ceph behaves like local storage, just shared.”

The fix was not a magical config. They separated storage traffic, throttled recovery during office hours,
and added two more nodes to reduce recovery pressure. The long-term fix was more uncomfortable:
they stopped selling internal stakeholders on “HA is free”.

Mini-story 2: The optimization that backfired

Another organization had an all-flash Ceph cluster that was “fine” but not exciting. They wanted more capacity efficiency.
Someone proposed erasure coding for the main VM datastore. The spreadsheet was beautiful.
The migration plan was clean. The change ticket had a confident tone. Danger, in corporate form.

At first, it worked. Usable capacity went up. Then performance complaints started—subtle at first, then persistent.
The worst part wasn’t average latency; it was tail latency. Apps would stall briefly, then recover. Users described it as “sticky.”

Under the hood, small random writes were getting hammered by EC write amplification and additional compute cost.
During recovery events, the network and CPU load spiked in ways the team hadn’t modeled. Every minor incident turned into
an extended period of degraded performance.

They rolled back the main datastore to replicated pools, kept EC for less interactive workloads,
and stopped “optimizing” without a workload-aligned performance budget. Capacity efficiency is great—until it costs you credibility.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran a five-node Proxmox+Ceph cluster. Nothing exotic: replicated pools, 25GbE, consistent NVMe,
conservative recovery settings. Their secret weapon wasn’t gear. It was process.

They kept a small weekly habit: review Ceph health, check OSD latency outliers, and look at pool utilization trends.
They also enforced a rule: storage firmware updates and kernel changes happened in controlled maintenance windows, with one node at a time,
and a “stop if recovery is angry” checkpoint.

One week, they noticed a single OSD creeping up in commit/apply latency. No errors yet. Just “different.”
They marked it out during a quiet window, replaced the drive, let the cluster heal, and moved on. Two days later,
a similar drive model in a different environment started throwing read errors and caused a customer-visible incident.

Their outage was a non-event because they treated “slightly weird” as a to-do, not a curiosity. The practice was boring.
The outcome was elite.

Checklists / step-by-step plan

Decision checklist: should you run Ceph on Proxmox?

  • Do you have at least 3 nodes now (not “soon”)? Preferably 5+?
  • Do you have 10GbE minimum, ideally 25GbE, with sane switching?
  • Can you keep steady-state utilization under ~70–75%?
  • Do you have consistent hardware across nodes?
  • Do you have a human owner for storage health (alerts, maintenance, upgrades)?
  • Do you accept that failure recovery is a performance event you must budget for?

Build checklist: a sane baseline for a new cluster

  1. Pick node count and failure domains. Decide if “host” failure is enough or you need rack-level separation.
  2. Choose network speed and topology. Separate Ceph traffic logically (VLAN) and, if possible, physically.
  3. Choose media class per pool. Don’t mix HDD and NVMe in the same performance pool.
  4. Set replicated pools for VM disks (size=3, min_size=2 is common). Be explicit about why.
  5. Enable PG autoscaler and review suggested PG changes periodically.
  6. Define alerting for: HEALTH_WARN/ERR, slow ops, nearfull/full, MON quorum changes, OSD flaps, disk errors.
  7. Document recovery behavior: what “normal degraded” looks like and when to throttle.

Operations checklist: weekly habits that prevent drama

  • Review ceph -s and ceph health detail.
  • Check ceph osd perf for outliers and investigate one per week.
  • Review capacity trends with ceph df and Proxmox storage usage.
  • Confirm no network errors/drops spikes on storage NICs.
  • Do one controlled maintenance activity at a time (one host reboot, one OSD replacement), then observe.

Expansion checklist: adding nodes/OSDs without chaos

  1. Confirm you have headroom for backfill (space and bandwidth).
  2. Add in batches small enough that recovery doesn’t crush client IO.
  3. Watch PG states and OSD perf while rebalancing.
  4. After recovery, validate pool utilization and PG distribution.
  5. Only then proceed to the next batch.

FAQ

1) What’s the minimum number of nodes for Ceph on Proxmox?

Three nodes is the minimum for a real cluster with quorum and replicated storage. Five nodes is where maintenance and recovery stop feeling like a stunt.

2) Can I run Ceph on 1GbE if my workload is small?

You can, but you probably shouldn’t. Even small clusters have recovery events, and 1GbE turns recovery into “everything is slow”.
If you must, keep expectations low and avoid HA promises.

3) Do I need separate networks for public and cluster traffic?

Not strictly, but separation (VLAN at least) reduces blast radius and makes troubleshooting sane. If you have the ports, dedicate links.

4) Should I use CephFS for VM disks?

For VM disks on Proxmox, RBD is typically the right choice. CephFS is great for shared file workloads, not as a replacement for block storage semantics.

5) Replication size 2 vs 3: is size 2 acceptable?

Size 2 is acceptable only when you explicitly accept reduced fault tolerance and understand failure scenarios.
In most business environments, size 3 is the safer default because it tolerates more messy reality.

6) Should I use erasure coding to save space?

Use EC when the workload fits: large objects, less latency sensitivity, and you can afford extra CPU/network overhead.
For primary VM datastores, replicated pools are usually the correct trade.

7) How full is “too full” for Ceph?

Treat ~70–75% as the “start planning expansion” range for replicated VM pools.
Past that, recovery becomes slower and riskier, and you’ll eventually hit backfill constraints.

8) Why is performance worse during rebuilds even if the cluster is healthy?

Because rebuilds consume the same resources your clients need: disk IO, CPU, and network.
A healthy cluster can still be busy. You size and tune for acceptable degradation during recovery, not for perfection.

9) Can I mix different SSD models or different node generations?

You can, but it’s a common source of tail latency and uneven behavior. In Ceph, heterogeneity shows up during recovery and hotspot workloads.
If you must mix, isolate by device class and pool, and expect operational complexity.

10) What’s the first metric to watch for “Ceph feels slow”?

ceph osd perf for commit/apply latency outliers, plus whether the cluster is backfilling. One slow OSD can poison the whole experience.

Next steps you can execute

  1. Run the fast diagnosis playbook the next time someone complains. Capture ceph -s, ceph osd perf, and host iostat.
  2. Decide your failure budget: how bad can things get during recovery and still be “acceptable”?
  3. Fix the big rocks first: network bandwidth, consistent media, and headroom. Tuning comes after.
  4. Separate pools by performance class and stop mixing slow and fast devices in the same VM datastore.
  5. Institutionalize boring checks: weekly health review, outlier hunting, capacity trend tracking, and one-change-at-a-time maintenance.

If Ceph on Proxmox fits your needs and you size it honestly, it’s a solid way to run resilient virtualization without buying a storage religion.
If you try to cheat the physics, it will respond with latency graphs that look like modern art.

← Previous
LGA vs PGA: Why the Pins Moved to the Motherboard
Next →
Excel Runs the World: Terrifying Stories That Keep Happening

Leave a comment