Ceph performance on Proxmox is slow: 10 checks that actually find the bottleneck

Was this helpful?

Ceph on Proxmox has a special talent: it works fine until the day you migrate a few VMs, start a backup, and suddenly everything feels like it’s writing to a USB stick from 2009. Latency spikes. iowait climbs. Your “hyperconverged” setup becomes “hyper-concerned.”

The fix is rarely mystical. It’s usually one of ten boring, measurable bottlenecks—network, recovery traffic, mismatched disk classes, mis-sized BlueStore metadata, underpowered CPUs, or a CRUSH rule that quietly hates you. Let’s catch it in the act.

Fast diagnosis playbook

You want answers fast, not a weekend with Grafana and regrets. This is the order I use when someone says “Ceph is slow” and there’s a production queue forming behind them.

First: is the cluster unhealthy or just slow?

  • Check health, recovery, and stuck ops. If recovery/backfill is running, you’re not diagnosing performance—you’re diagnosing a rebuild.
  • Confirm OSDs are up/in and no one is flapping.

Second: is it the network?

  • Most “Ceph is slow” complaints are “the cluster network is congested” in disguise.
  • Look for drops, retransmits, NIC offload weirdness, MTU mismatch, and oversubscribed switches.

Third: is it the disks/OSDs?

  • Check OSD commit/apply latency, BlueStore slow ops, and whether DB/WAL are on the wrong media.
  • Make sure you didn’t put HDD OSDs in the same crush rule as SSD OSDs and then act surprised.

Fourth: is it the client path (Proxmox hosts/VMs)?

  • Proxmox nodes can bottleneck on CPU, kernel network stack, or too many RBD clients.
  • Misconfigured caching, wrong IO scheduler, and backup jobs can turn “fine” into “flatline.”

If you only remember one thing: always prove whether the bottleneck is network, OSD media, or recovery before you touch tuning knobs. Tuning the wrong layer is how outages become “learning experiences.”

A few facts (and history) that explain today’s pain

Performance debugging gets easier when you remember why Ceph behaves the way it does.

  1. Ceph’s design goal was reliability at scale—not “maximum IOPS on three nodes.” It came out of research at UC Santa Cruz and grew into a planet-sized storage system.
  2. CRUSH (Controlled Replication Under Scalable Hashing) is why Ceph can place data without a central metadata server deciding every location. That’s great for scale, but it means topology and device classes matter a lot.
  3. RBD is copy-on-write at the block layer and likes consistent latency. It will absolutely reflect microbursts and network jitter into VM “disk” latency.
  4. BlueStore replaced FileStore to remove filesystem overhead and improve performance, but it introduced the DB/WAL separation story—done right it’s fast, done wrong it’s slow in a very specific way.
  5. Ceph’s “replication 3” default is a cultural artifact from operators who prefer sleeping at night. You pay for that safety with write amplification and network traffic.
  6. Placement groups (PGs) are a scaling lever, not a tuning superstition. Too many PGs burns memory and CPU; too few concentrates load and slows recovery.
  7. Recovery and backfill are intentional throttles. Ceph tries to keep serving IO while rebuilding, but you still share disks and network.
  8. 10GbE made Ceph common in “affordable” clusters. Unfortunately, it also made oversubscription common, and oversubscription is how you get mystery latency.
  9. Proxmox made Ceph approachable with a friendly UI—and also made it easy to deploy Ceph without doing the unsexy network and disk homework.

One paraphrased idea worth remembering, attributed to Werner Vogels (reliability/operations mindset): paraphrased idea: everything fails eventually, so design and operate like failure is normal. Performance debugging is the same mentality: assume contention is normal and prove where it comes from.

The 10 checks (with commands, outputs, and decisions)

These are not “tips.” They’re checks with a clear outcome: you run a command, read the output, then choose a concrete action. That’s how you stop guessing.

Check 1: Cluster health, recovery, and slow ops (the “are we rebuilding?” check)

If the cluster is recovering, your “performance issue” might be expected behavior. Decide whether to wait, throttle recovery, or stop the job causing churn.

cr0x@server:~$ ceph -s
  cluster:
    id:     9f1b2d9a-1b2c-4b9b-8d2d-2f7e5f0f2c1a
    health: HEALTH_WARN
            12 slow ops, oldest one blocked for 34 sec, daemons [osd.3,osd.7] have slow ops
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h)
    mgr: mgr1(active, since 2h)
    osd: 12 osds: 12 up (since 2h), 12 in (since 2h)
  data:
    pools:   4 pools, 256 pgs
    objects: 1.2M objects, 4.6 TiB
    usage:   13 TiB used, 21 TiB / 34 TiB avail
    pgs:     220 active+clean
             36  active+clean+scrubbing
  io:
    client:   220 MiB/s rd, 55 MiB/s wr, 2.1k op/s rd, 900 op/s wr

What it means: “slow ops” usually indicates OSDs can’t keep up (disk) or are blocked (network or internal queues). Scrubbing also shows up; it’s not evil, but it’s not free.

Decision: If slow ops correlate with scrub/recovery windows, schedule scrub off-hours and tune recovery QoS (later checks). If slow ops appear at random under normal IO, keep going—this is a real bottleneck.

cr0x@server:~$ ceph health detail
HEALTH_WARN 12 slow ops, oldest one blocked for 34 sec, daemons [osd.3,osd.7] have slow ops
[WRN] SLOW_OPS: 12 slow ops, oldest one blocked for 34 sec, daemons [osd.3,osd.7] have slow ops
    slow op 1, oldest at osd.3, committed, currently waiting for subops from [osd.7, osd.9]
    slow op 2, oldest at osd.7, waiting for BlueStore kv commit

What it means: “waiting for subops” smells like replication/network; “waiting for BlueStore kv commit” smells like BlueStore DB/WAL or device latency.

Decision: Split your hypothesis: network path vs OSD media/BlueStore metadata.

Check 2: Prove the problem is in Ceph, not the VM or host filesystem

Operators love blaming Ceph. Sometimes it’s deserved. Sometimes the VM is doing 4k sync writes to a filesystem with barriers and you’re watching physics happen.

cr0x@server:~$ pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        98.00 GiB        22.16 GiB        70.80 GiB   22.61%
local-lvm     lvmthin     active       870.00 GiB       412.00 GiB       458.00 GiB   47.36%
ceph-rbd         rbd     active        21.00 TiB        13.10 TiB         7.90 TiB   62.38%

What it means: Confirms which VMs are actually on RBD. I’ve seen “Ceph is slow” tickets where the VM was on local-lvm the whole time. That’s not an advanced incident; that’s a scavenger hunt.

Decision: If only some workloads are on Ceph, compare behavior across storages before you touch Ceph tuning.

cr0x@server:~$ rbd -p ceph-vm ls | head
vm-100-disk-0
vm-101-disk-0
vm-104-disk-0
vm-105-disk-0

What it means: Confirms RBD images exist and are accessible. If this command is slow, it can indicate MON/MGR slowness or network issues.

Decision: Slow metadata operations push you toward network, MON disk, or overall cluster load problems.

Check 3: Network basics—latency, drops, MTU mismatch (the silent killer)

Ceph is a distributed storage system. That means your network is a storage bus. Treat it like one.

cr0x@server:~$ ip -s link show dev eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:ec:ef:12:34:56 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
    9876543210 7123456      0   18422       0  112233
    TX:  bytes packets errors dropped carrier collsns
    8765432109 6234567      0       9       0       0

What it means: RX drops in the tens of thousands are not “normal.” They’re a symptom. On busy Ceph links, drops often translate into retransmits, which translate into tail latency, which translates into angry VM owners.

Decision: If drops are rising during incidents, fix the network before you tune Ceph. Check switch buffers, oversubscription, flow control, and NIC/driver issues.

cr0x@server:~$ ping -c 5 -M do -s 8972 10.10.10.12
PING 10.10.10.12 (10.10.10.12) 8972(9000) bytes of data.
8980 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.321 ms
8980 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.309 ms
8980 bytes from 10.10.10.12: icmp_seq=3 ttl=64 time=0.315 ms
8980 bytes from 10.10.10.12: icmp_seq=4 ttl=64 time=0.311 ms
8980 bytes from 10.10.10.12: icmp_seq=5 ttl=64 time=0.318 ms

--- 10.10.10.12 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 0.309/0.315/0.321/0.004 ms

What it means: Jumbo frames are either consistently working end-to-end, or they’re a trap. This checks path MTU with “do not fragment.”

Decision: If you see “Frag needed” or packet loss, stop. Fix MTU consistency across NICs, bonds, bridges, and switches. Mixed MTU creates a special kind of performance misery: it sort-of works, slowly.

cr0x@server:~$ ss -ti dst 10.10.10.12 | head -n 12
ESTAB 0 0 10.10.10.11:6801 10.10.10.12:0
	 cubic wscale:7,7 rto:204 rtt:0.31/0.02 ato:40 mss:8960 pmtu:9000 rcvmss:8960 advmss:8960
	 bytes_sent:123456789 bytes_acked:123450000 bytes_received:9876543 segs_out:123456 segs_in:122999
	 retrans:12/3456 lost:0 sacked:123 fackets:12 reordering:0

What it means: Retransmits during a latency incident are a smoking gun. Not always the root cause, but always relevant.

Decision: If retransmits jump under load, reduce oversubscription, separate Ceph cluster traffic, and verify NIC ring buffers and interrupt settings. Don’t “tune BlueStore” to fix packet loss.

Joke #1: Jumbo frames are like diet plans: either everyone follows them, or you get weird results and a lot of denial.

Check 4: Are public and cluster networks separated (or at least not fighting)?

Proxmox makes it easy to run Ceph public traffic and replication/backfill on the same interface “just for now.” “Just for now” is how it stays forever.

cr0x@server:~$ ceph config get mon public_network
10.10.10.0/24
cr0x@server:~$ ceph config get mon cluster_network
10.20.20.0/24

What it means: If cluster_network is empty, replication and recovery share the same network as clients. That can be fine in tiny clusters, until it isn’t.

Decision: If you have separate NICs/VLANs, configure cluster_network. If you don’t, consider adding it before chasing micro-optimizations.

Check 5: Replication/EC overhead—are you paying for durability twice?

Performance complaints often come from mismatched expectations. Someone wanted “fast VM disks,” someone else wanted “three copies across hosts,” and nobody priced the IO bill.

cr0x@server:~$ ceph osd pool ls detail
pool 1 'ceph-vm' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on
pool 2 'ceph-ct' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on
pool 3 'cephfs_data' erasure size 4+2 crush_rule 3 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on
pool 4 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on

What it means: A replicated pool with size 3 turns every write into three writes across the network and disks. EC pools reduce capacity overhead but cost CPU and can increase small-write latency.

Decision: For VM workloads with lots of small random writes, replicated pools are usually the sane baseline. If you use EC for VMs, do it because you measured it and accept the tradeoffs.

cr0x@server:~$ ceph tell osd.* perf dump | head -n 20
{
  "osd": {
    "op_wip": 12,
    "op_latency": 0.018,
    "op_process_latency": 0.012,
    "op_r_latency": 0.006,
    "op_w_latency": 0.021,
    "subop_latency": 0.019
  }
}

What it means: Rising subop_latency relative to op_latency can indicate replication sub-operations are slow—often network, sometimes slow peers (mixed disk classes, busy OSDs).

Decision: If a subset of OSDs have much worse latency, isolate them: they may be on worse disks, wrong firmware, or overloaded hosts.

Check 6: Disk classes and CRUSH rules—stop mixing SSDs and HDDs in the same performance path

This is one of the most common “it worked in the lab” failures. A few HDD OSDs sneak into an SSD pool’s rule, and now every write occasionally lands on the slow kid in the group project.

cr0x@server:~$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS REWEIGHT PRI-AFF
-1         34.00000 root default
-3         11.33333     host pve1
 0   ssd    1.80000         osd.0         up  1.00000 1.00000
 1   ssd    1.80000         osd.1         up  1.00000 1.00000
 2   hdd    3.06667         osd.2         up  1.00000 1.00000
-5         11.33333     host pve2
 3   ssd    1.80000         osd.3         up  1.00000 1.00000
 4   ssd    1.80000         osd.4         up  1.00000 1.00000
 5   ssd    1.80000         osd.5         up  1.00000 1.00000

What it means: You have HDD OSDs in the same root as SSDs. That’s not automatically wrong, but it’s a risk if your pool rule doesn’t constrain by class.

Decision: Ensure the pool’s CRUSH rule selects only SSD (or only HDD) depending on intent. If you want mixed tiers, do it explicitly with separate pools and policies, not accidental roulette.

cr0x@server:~$ ceph osd crush rule dump replicated_rule
{
  "rule_id": 0,
  "rule_name": "replicated_rule",
  "type": 1,
  "steps": [
    { "op": "take", "item": -1, "item_name": "default" },
    { "op": "chooseleaf_firstn", "num": 0, "type": "host" },
    { "op": "emit" }
  ]
}

What it means: This rule doesn’t filter device class. So HDDs can be chosen as replicas even for SSD-heavy pools.

Decision: Create a class-based rule and move performance-sensitive pools to it. Yes, it’s work. It’s less work than explaining random 200ms fsyncs to a database team.

Check 7: BlueStore DB/WAL placement and sizing—your “SSD OSDs” might still be metadata-starved

BlueStore uses RocksDB for metadata (the DB) and a WAL. On fast media, DB/WAL placement can make or break latency under small writes.

cr0x@server:~$ ceph-bluestore-tool show-label --dev /dev/sdb | head -n 30
{
  "osd_uuid": "f3b6b7b3-2a3a-4b7c-9b4c-1f2e3d4c5b6a",
  "size": 1920383410176,
  "btime": "2025-10-01T11:12:13.000000+0000",
  "description": "main",
  "whoami": "3"
}
cr0x@server:~$ ceph-volume lvm list | sed -n '1,80p'
====== osd.3 =======
  [block]       /dev/ceph-2c1a3b4d-.../osd-block-9a8b7c6d-...
      devices              /dev/sdb
  [db]          /dev/ceph-2c1a3b4d-.../osd-db-1a2b3c4d-...
      devices              /dev/nvme0n1
  [wal]         /dev/ceph-2c1a3b4d-.../osd-wal-5e6f7a8b-...
      devices              /dev/nvme0n1

What it means: This OSD has block on /dev/sdb (likely SSD/SATA) with DB/WAL on NVMe, which is generally good for latency. If DB/WAL are on the same slow device as block for HDD OSDs, you’ll feel it.

Decision: If you have HDD OSDs, strongly consider placing DB/WAL on SSD/NVMe. If you have SSD OSDs but still see high commit latency, confirm DB isn’t undersized or contended (multiple OSDs sharing a tiny NVMe partition).

cr0x@server:~$ ceph daemon osd.3 bluestore perf dump | head -n 40
{
  "kv_commits": 124567,
  "kv_commit_latency_ms": {
    "avgcount": 1024,
    "sum": 8345.21,
    "avg": 8.15
  },
  "deferred_write_ops": 0,
  "stall": 0
}

What it means: Average KV commit latency of ~8ms might be acceptable on HDD, suspicious on “all NVMe,” and catastrophic if it spikes into tens/hundreds. It also correlates with “waiting for BlueStore kv commit” slow ops.

Decision: If KV commit latency is high, investigate DB device saturation, write cache settings, and whether you’re pushing too many OSDs onto one DB device. Consider re-provisioning with proper DB sizing and isolation.

Check 8: OSD CPU, memory, and scheduler—when the storage is fast, the host becomes the bottleneck

Ceph is not just disks. OSDs do checksums, compression (if enabled), networking, and bookkeeping. On small clusters, a little CPU starvation becomes a lot of tail latency.

cr0x@server:~$ ceph tell osd.* dump_historic_ops | head -n 25
{
  "ops": [
    {
      "description": "osd_op(client.1234:5678 1.2e3f4b5c ::ffff:10.10.10.50:0/12345 1) [write 0~4096]",
      "duration": 1.238,
      "initiated_at": "2025-12-28T10:11:12.123456+0000",
      "age": 1.238
    }
  ]
}

What it means: Ops taking >1s are not normal for healthy SSD-based clusters under moderate load. The description shows 4k writes; small sync writes expose latency immediately.

Decision: Correlate with CPU steal, load average, and OSD thread contention on the same node.

cr0x@server:~$ top -b -n 1 | head -n 15
top - 10:22:01 up 12 days,  3:10,  1 user,  load average: 18.22, 16.90, 12.40
Tasks: 512 total,   2 running, 510 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.1 us,  6.4 sy,  0.0 ni, 48.2 id, 15.9 wa,  0.0 hi,  1.4 si,  0.0 st
MiB Mem : 128000.0 total,   2200.0 free,  41200.0 used,  84600.0 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  82200.0 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 8123 ceph      20   0 8421376 4.1g  120m S  220.0   3.3  92:12.34 ceph-osd

What it means: 15.9% iowait suggests storage stalls, but the load average plus hot OSD process suggests the node is busy. If this is a hyperconverged node running many VMs, CPU and IO contention is real.

Decision: If OSDs and VMs fight for CPU, pin resources, reduce VM density, or separate roles. “Just add tuning” is not a capacity plan.

cr0x@server:~$ cat /sys/block/sdb/queue/scheduler
mq-deadline none

What it means: For SSDs, mq-deadline is often a reasonable default; for HDDs, scheduler choice matters more. For NVMe, scheduler often matters less, but “none” is fine.

Decision: If you see pathological latency on HDD OSDs, validate you’re not using a scheduler that worsens seek storms. Don’t expect miracles: Ceph on HDD is still Ceph on HDD.

Check 9: Recovery/backfill/scrub pressure—your cluster is eating its vegetables during lunch rush

Ceph background work is good. It prevents data loss. It also competes for the same disks and network your clients use.

cr0x@server:~$ ceph status | sed -n '1,30p'
  cluster:
    id:     9f1b2d9a-1b2c-4b9b-8d2d-2f7e5f0f2c1a
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h)
    mgr: mgr1(active, since 2h)
    osd: 12 osds: 12 up (since 2h), 12 in (since 2h)
  data:
    pools:   4 pools, 256 pgs
    pgs:     256 active+clean
  progress:
    Recovery event (35s)
      [============================..] (remaining: 9s)

What it means: Even with HEALTH_OK, recovery can be active. On small clusters, a recovery event can dominate performance for minutes to hours.

Decision: If user-facing IO is more important than fast recovery during business hours, throttle recovery during peaks and schedule maintenance windows for heavy rebalancing.

cr0x@server:~$ ceph config get osd osd_recovery_max_active
3
cr0x@server:~$ ceph config get osd osd_max_backfills
2

What it means: These values determine how aggressive recovery/backfill is. Higher is faster recovery and worse client latency (usually). Lower is gentler on clients and slower rebuilds.

Decision: If your cluster is small and user IO matters, keep these conservative. If you’re in a failure scenario and need redundancy quickly, temporarily increase them—then revert.

cr0x@server:~$ ceph config set osd osd_recovery_sleep 0.1

What it means: Adding a small recovery sleep can smooth client latency by giving disks breathing room.

Decision: Use this when you see recovery-induced latency spikes and you can tolerate slower recovery. Test carefully; don’t “set and forget” without understanding rebuild time requirements.

Check 10: Benchmark the right thing—RADOS vs RBD, reads vs writes, sync vs async

Benchmarks are useful when they answer a specific question. “How fast is Ceph?” is not a question. “Is raw OSD write latency acceptable?” is.

cr0x@server:~$ rados bench -p ceph-vm 30 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds.
Total time run:         30.422
Total writes made:      259
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     34.06
Stddev Bandwidth:       6.12
Max bandwidth (MB/sec): 45.01
Min bandwidth (MB/sec): 18.77
Average IOPS:           8
Stddev IOPS:            1
Average Latency(s):     1.932
Max latency(s):         5.221
Min latency(s):         0.412

What it means: 4MB writes with ~2s average latency is a red flag unless the cluster is heavily recovering or on HDD with extreme contention. This isn’t “tuning territory”; it’s “something is wrong” territory.

Decision: If RADOS bench is bad, the problem is cluster-side (network/OSDs/recovery). If RADOS bench is good but VM disks are slow, the problem is client-side (RBD settings, VM IO pattern, host contention).

cr0x@server:~$ rbd perf image iostat --pool ceph-vm --image vm-100-disk-0 --interval 2 --count 5
rbd/image                         read_ops  read_bytes  write_ops  write_bytes  read_latency  write_latency
ceph-vm/vm-100-disk-0                  120     1.2MiB        340      24.0MiB        7.2ms        41.8ms
ceph-vm/vm-100-disk-0                  110     1.1MiB        360      25.5MiB        8.1ms        55.3ms
ceph-vm/vm-100-disk-0                   95   980.0KiB        390      28.3MiB        6.9ms        72.1ms
ceph-vm/vm-100-disk-0                  130     1.3MiB        310      22.1MiB        7.5ms        38.9ms
ceph-vm/vm-100-disk-0                  125     1.2MiB        320      23.0MiB        7.0ms        44.0ms

What it means: Write latency is climbing while read latency stays steady. That often points to replication/commit pressure, DB/WAL contention, or recovery interference.

Decision: If it’s one image/VM: check workload (fsync-heavy DB, journaling). If it’s many images: check OSD commit latency and network retransmits.

Joke #2: Benchmarking without a question is like load testing your coffee maker: you’ll learn something, but not what you needed.

Three corporate mini-stories from the performance trenches

Mini-story 1: The incident caused by a wrong assumption (“10GbE is plenty”)

They had a tidy Proxmox cluster, three nodes, Ceph replicated size 3, and a 10GbE switch stack that looked respectable in procurement spreadsheets. The assumption was simple: “10GbE is plenty for our workload.” It often is—until it isn’t.

The first symptom wasn’t a Ceph alert. It was the helpdesk: “VMs freeze for a second during backups.” Then it became “databases are stuttering.” Latency graphs looked like a city skyline. The Ceph dashboard stayed mostly calm, because health wasn’t the issue. Performance was.

We pulled interface counters and found RX drops climbing only during two events: nightly backups and a weekly scrub window. The backup traffic wasn’t even on Ceph—at least not intentionally. It shared the same bonded uplinks and the same switch buffers. Ceph replication traffic and “backup to NAS” traffic were fighting in a narrow hallway.

The wrong assumption was that bandwidth was the only metric. The real villain was contention and microbursts. Ceph’s replication traffic is spiky; backup traffic is sustained; put them together and you get tail latency that makes RBD clients sad.

The fix was boring: separate networks properly (VLANs plus QoS on the switch, eventually dedicated NICs), stop scrubbing during backup windows, and enforce a policy that cluster traffic never shares the same choke point with bulk transfers. The performance incident evaporated. Nobody was impressed, which is how you know it was the right fix.

Mini-story 2: The optimization that backfired (“Let’s crank recovery so it finishes faster”)

A different company, different mood: they hated seeing “backfill” in the Ceph status output. They had a node reboot loop after a power event, and the cluster was rebalancing. Someone decided to “speed it up” by increasing recovery and backfill concurrency.

For a few minutes, it looked great. Recovery progress bars moved faster. Slack got optimistic. Then VM latency went vertical. Applications started timing out. The cluster wasn’t unhealthy; it was simply too busy doing the right thing too aggressively.

What happened: they turned recovery into a priority workload, unintentionally. Backfill hammered the same disks serving client IO. The network got noisy. OSD queues filled. Client requests didn’t fail; they just waited. And waited is what databases interpret as “maybe the disk is dying.”

We rolled back the settings, added a small recovery sleep, and chose a more nuanced strategy: aggressive recovery only during a defined incident window, and conservative recovery during business hours. They still recovered redundancy quickly—just not at the cost of turning production into a slideshow.

The lesson is not “never tune recovery.” It’s “recovery is a performance workload.” If you don’t schedule it, it schedules you.

Mini-story 3: The boring but correct practice that saved the day (“Class-based CRUSH rules and capacity discipline”)

One team did something unfashionable: they kept their storage design simple. Separate SSD and HDD pools. Separate CRUSH rules by device class. No mixed-media heroics. They also kept headroom: they treated 70% utilization as “getting full,” not “plenty of space.”

When a batch of SSDs started showing elevated latency (not failing, just getting weird), the cluster didn’t implode. Why? Because the CRUSH rules meant HDD OSDs never accidentally became replicas for SSD-backed VM pools. Performance-sensitive workloads stayed on SSD. The slow devices didn’t drag the entire pool’s tail latency down.

They still had work to do: mark suspect OSDs out, replace disks, let recovery run. But the incident stayed contained. No cascading tickets from unrelated services. No emergency “move everything off Ceph” plan.

What saved them wasn’t a magical sysctl. It was design hygiene. The kind that looks like overkill until the day it quietly turns a crisis into a routine maintenance task.

Common mistakes: symptom → root cause → fix

This is the part where you recognize your own cluster and feel mildly judged. Good. Production systems respond well to mild judgment.

1) Symptom: “Random” write latency spikes across many VMs

  • Root cause: Network drops/retransmits, MTU mismatch, or switch oversubscription causing tail latency.
  • Fix: Check ip -s link, ss -ti, MTU end-to-end. Separate cluster/public traffic. Reduce oversubscription. Avoid mixing backup traffic onto the same links.

2) Symptom: Reads look fine, writes are awful

  • Root cause: Replication/commit path bottleneck: slow OSD peers, BlueStore KV commits, undersized DB/WAL, or recovery interfering.
  • Fix: Inspect slow ops detail, BlueStore perf, and recovery settings. Move DB/WAL to fast devices; throttle recovery during peaks; ensure pool rule avoids slow disks.

3) Symptom: “Ceph is slow” only during scrub/backfill

  • Root cause: Background work competing for the same IO and network.
  • Fix: Schedule scrub windows. Tune recovery/backfill concurrency and add osd_recovery_sleep if needed. Accept longer recovery time as a business choice.

4) Symptom: One VM is terrible; others are fine

  • Root cause: Workload pattern mismatch (sync-heavy DB, small random writes, fsync storms), or that image is on a busy PG/OSD set.
  • Fix: Use rbd perf image iostat. Check inside the VM for filesystem and application settings. Consider moving that workload to a pool with different settings or faster media.

5) Symptom: Performance degraded after “adding capacity”

  • Root cause: You added slower disks into the same CRUSH rule/pool, or you triggered heavy rebalancing during peak.
  • Fix: Class-based CRUSH rules. Add capacity in a controlled window. Consider staged reweights to reduce the rebalancing blast radius.

6) Symptom: Latency gets worse as utilization crosses ~70–80%

  • Root cause: Fragmentation, reduced free space for BlueStore/allocators, and more expensive placement/recovery behavior as the cluster fills.
  • Fix: Keep headroom. Plan capacity early. Don’t run Ceph near full unless you enjoy emergency migrations.

7) Symptom: Ceph commands feel slow (status, ls, etc.)

  • Root cause: MON disk latency, overloaded monitors, network issues, or general cluster overload.
  • Fix: Check MON host IO and network. Ensure MONs run on reliable media and aren’t starved by VMs on the same node.

Checklists / step-by-step plan

Step-by-step triage plan (do this in order)

  1. Capture the moment: run ceph -s and ceph health detail. Save output with a timestamp.
  2. Confirm scope: which VMs, which storage, which nodes. Use pvesm status and check where the disks live.
  3. Check recovery/scrub: if active, decide whether to throttle or wait.
  4. Network counters: ip -s link on all Ceph NICs, check drops and errors.
  5. MTU sanity: path MTU test with ping -M do between all nodes on the Ceph networks.
  6. Retransmits: ss -ti between nodes; look for increasing retrans under load.
  7. OSD outliers: identify OSDs with worse latency via slow ops and perf dumps.
  8. Disk class and CRUSH: verify pools use correct rules and classes; fix accidental mixing.
  9. BlueStore DB/WAL: confirm placement and check KV commit latency.
  10. Benchmark responsibly: RADOS bench to validate cluster, then RBD iostat to validate client path.

Operational checklist for keeping it fast (the unglamorous routine)

  • Keep cluster utilization comfortably below “full.” Plan capacity like an adult.
  • Separate Ceph public and cluster networks when possible, or at least isolate them from bulk traffic.
  • Enforce device classes and CRUSH rules; treat mixed media in one pool as a design review item.
  • Schedule scrub and heavy maintenance tasks away from backups and batch workloads.
  • Document recovery/backfill tuning and reset it after incidents.
  • Track latency distributions (p95/p99), not just averages. Averages are how you get surprised.

FAQ

1) Why is Ceph on Proxmox slower than local NVMe?

Because it’s not the same thing. Local NVMe is a single device with microsecond-scale access. Ceph writes are distributed, replicated, and acknowledged across the network and multiple OSDs. You’re trading latency for availability and operational flexibility.

2) Do I really need a separate cluster network?

Not always, but if you have unpredictable latency and any mix of client IO plus recovery/backups on the same links, separation is one of the highest ROI fixes. If you can’t add NICs, VLAN separation plus switch QoS is still better than “everything everywhere.”

3) Is replication size 3 mandatory?

No. It’s a default that matches a common risk posture. Size 2 reduces write amplification but increases risk and reduces failure tolerance. The correct size depends on business requirements, node count, and how much you hate downtime.

4) Should I use erasure coding for VM disks?

Usually not for small-write, latency-sensitive VM workloads. EC can be great for capacity efficiency and large objects, but it adds CPU overhead and can punish small random writes. If you want EC for VMs, measure with your actual workload and accept the complexity.

5) What does “slow ops” actually mean?

It means a client IO operation is taking longer than expected inside the OSD pipeline—waiting for sub-ops, waiting for commits, waiting on disk, or stuck behind queues. The detail output often hints whether it’s network/subop latency or BlueStore commit latency.

6) Can a few slow disks really hurt the whole pool?

Yes. Replication means your write latency is often gated by the slowest replica involved in the operation. If your CRUSH rule lets HDDs participate in an SSD-intended pool, you’ve built a randomness generator for p99 latency.

7) Why does performance tank during recovery even if the cluster is “HEALTH_OK”?

Because HEALTH_OK is about data safety and cluster invariants, not your application’s latency SLO. Recovery is heavy IO and network traffic; Ceph can stay correct while still being slow.

8) How many OSDs per node is “too many”?

It depends on CPU, RAM, media, and network. If OSD processes are CPU-starved or if BlueStore DB devices are shared too aggressively, you’ll see commit latency and slow ops. The right answer comes from measuring per-OSD latency and host contention, not from a magic number.

9) Should I “just add more PGs” to fix hotspots?

No, not reflexively. PG count affects memory, peering overhead, and recovery behavior. Use autoscaling where appropriate and only adjust manually when you understand why distribution is poor.

10) What’s the quickest single indicator of “network vs disk” bottleneck?

Look at retransmits/drops for network and BlueStore KV commit latency plus OSD apply/commit delays for disk/metadata. If both are ugly, congratulations: you have a real distributed systems problem.

Conclusion: next steps that move the needle

If your Ceph-on-Proxmox performance is slow, don’t start by changing tunables. Start by proving where time is spent. Run the fast diagnosis playbook, then do the ten checks with discipline: network counters, MTU, retransmits, recovery pressure, disk classes/CRUSH rules, BlueStore DB/WAL placement, and finally targeted benchmarks.

Your practical next steps:

  • Within an hour: capture ceph -s, ceph health detail, NIC drops, MTU tests, and retransmits during the slow period.
  • Within a day: verify pool rules don’t mix disk classes; confirm cluster/public network separation; audit recovery/scrub schedules.
  • Within a week: fix structural issues—DB/WAL on proper devices, remove slow outlier disks, and address oversubscription. Tune only after the architecture stops fighting itself.

Ceph can be fast. It can also be reliable. But it won’t be either if you treat the network like an afterthought and the disk layout like a suggestion.

← Previous
MySQL vs MariaDB upgrades: how to update without nuking production
Next →
WordPress Core Web Vitals: real fixes for LCP/INP/CLS

Leave a comment