Proxmox Ceph Slow Ops: Locate the Bottleneck (Disk, Network, or CPU)

October 14, 2025 • February 3, 2026 • Read: 25 min • Views: 10

Was this helpful?

“Slow ops” is Ceph’s way of telling you: something is jammed, and it’s not going to unjam itself out of politeness.
In Proxmox, this usually shows up right when someone is migrating VMs, restoring a backup, or doing “just a small” clone.
Suddenly you’re staring at warnings, latency graphs that look like a lie detector test, and users asking why the “cloud” is slow.

The trick is to stop guessing. A Ceph cluster is a pipeline: client IO → network → OSD queue → disk → replication/erasure coding → acknowledgements.
“Slow ops” happens where that pipeline backs up. Your job is to find which stage and fix the constraint, not to random-walk through tunables.

What “slow ops” actually means in Ceph (and why Proxmox surfaces it)

In Ceph, an “op” is an operation the cluster must complete: a client read, a client write, a metadata request, a peering step, a scrub chunk,
a backfill copy, a recovery update. When Ceph warns about “slow ops,” it’s telling you that some operations have exceeded a latency threshold
and are now stuck long enough to be considered unhealthy.

Crucially, Ceph reports slow ops from the point of view of the daemon that is waiting.
If an OSD is waiting for its disk, it reports slow ops. If it’s waiting for network acknowledgements, it reports slow ops.
If it’s CPU-bound and can’t schedule work, it reports slow ops. The symptom is generic; the cause is not.

Proxmox surfaces the warnings because it integrates Ceph health checks into the cluster status UI and logs. That’s helpful… and dangerous.
Helpful because you notice early. Dangerous because many admins treat it like a Proxmox issue. It’s not. Proxmox is the passenger.
Ceph is the aircraft. If you smell smoke, don’t blame the seatbelt.

One quote that should live in your head during incidents: “Hope is not a strategy.” — paraphrased idea often attributed to operations leadership.
(We’ll avoid the attribution rabbit hole; the idea is what matters: measure, then act.)

Fast diagnosis playbook (first/second/third)

First: confirm it’s real pain, not harmless noise

Check cluster health and where slow ops are reported. Are they on one OSD, one host, or everywhere?
Check client-facing latency. If VM IO latency is flat and users are happy, you may be seeing transient recovery noise.
Check whether recovery/backfill/scrub is running hard. Many “slow ops” events are self-inflicted by recovery settings.

Second: identify the bottleneck class (disk vs network vs CPU)

Disk bottleneck usually looks like: high OSD commit/apply latency, high device utilization, long NVMe/SATA latencies, BlueStore stalls, WAL/DB contention.
Network bottleneck usually looks like: elevated msgr RTT, packet drops, retransmits, uneven latency between nodes, busy switches, MTU mismatch, one NIC pegged.
CPU bottleneck usually looks like: OSD threads runnable, high system time/softirq, ksoftirqd spikes, IRQ imbalance, high context switching, ceph-osd pegged.

Third: isolate scope, then stop the bleeding

Scope: one OSD? one node? one rack? one pool? one client? If it’s everywhere, suspect network core or global recovery pressure.
Stop bleeding: throttle recovery/backfill, pause scrubs during business hours, move hot VMs, or temporarily reduce client IO concurrency.
Fix: replace failing disks, correct MTU, rebalance IRQs, separate public/cluster network, move BlueStore DB/WAL to fast devices, tune recovery sanely.

Joke #1: Ceph slow ops are like traffic jams: adding more cars (clients) doesn’t make the road wider, it just improves your dashboard’s error vocabulary.

Interesting facts and context (short, concrete, useful)

Ceph’s design goal was “no single point of failure,” but your NIC firmware, switch buffers, or a bad SSD can still create a very real single point of pain.
The term “OSD” comes from object storage device, but in practice an OSD is a process with queues, threads, and latency behavior that often matters more than raw disk speed.
BlueStore replaced FileStore because the “filesystem on top of a filesystem” approach had unavoidable overhead and awkward write amplification under load.
Ceph’s monitors (MONs) are small but critical: they don’t serve data IO, yet MON latency and quorum issues can stall cluster operations and cause cascading slowdowns.
“Recovery” is a feature and a tax: Ceph’s self-healing is why you bought it, but recovery IO competes with client IO unless you intentionally manage it.
CRUSH (data placement) is deterministic, which is great for scale, but it also means a bad topology description can concentrate load in ways that look like random misery.
Ceph’s network stack (msgr) evolved over time, and modern deployments often use msgr2; misconfigurations can show up as latency spikes rather than obvious outages.
Proxmox made Ceph popular in smaller shops by making deployment clicky; that’s convenient until you forget Ceph is still a distributed system with distributed failure modes.

A mental model: where latency hides (disk, network, CPU, or “Ceph work”)

If you want to debug slow ops quickly, stop thinking “Ceph is slow” and start thinking “this queue is growing.”
Every slow op is an op that entered a queue and didn’t exit on time.

Disk bottlenecks: the boring villain

Disks fail loudly when they die. They fail quietly for months before that.
Media errors aren’t required for latency to explode. A drive can be “healthy” and still have 50–200ms tail latency.
For Ceph, tail latency matters because replication waits on the slowest required acknowledgement.

BlueStore adds its own flavor: it uses RocksDB for metadata, and it cares deeply about WAL/DB placement and device latency characteristics.
If your DB/WAL is on the same slow device as your data, or worse, on a saturated SATA SSD, you can get periodic stalls that look like network issues.

Network bottlenecks: the sneaky villain

Ceph is chatty. Not in a “I send a lot of bytes” way only; it also requires predictable latency and low packet loss for its internal protocol.
A network can have plenty of bandwidth and still be a disaster if it has microbursts, drop storms, or MTU mismatches.

Also: your “public network” (clients to OSDs) and “cluster network” (OSD replication/backfill) are not the same traffic pattern.
Mixing them is not forbidden, but you’re signing up for contention. If your cluster is busy, the internal replication can starve client IO.

CPU bottlenecks: the modern villain

With NVMe and fast networking, the next bottleneck is frequently CPU: checksum calculations, BlueStore overhead, encryption, compression,
and plain old interrupt handling. A host can have “only” 30% CPU usage overall and still be CPU-bound where it matters: one NUMA node,
one core handling the NIC IRQs, or a handful of ceph-osd threads stuck runnable.

“Ceph work” bottlenecks: recovery, backfill, scrub, peering

Ceph does background work to keep data safe and consistent. That work is non-negotiable, but its rate is negotiable.
Aggressive recovery settings can make a cluster look heroic in dashboards while it quietly suffocates client IO.
Conversely, recovery set too low keeps the cluster degraded for longer, increasing risk and sometimes increasing total work.
You manage it. You don’t ignore it.

Practical tasks: commands, what the output means, and the decision you make

These tasks are meant to be run from a Proxmox node with Ceph tools installed (or a dedicated admin node).
Each task includes: the command, example output, how to interpret it, and what decision you make next.
Don’t run all of them blindly in production during peak hours. Run the right ones in the right order.

Task 1: Identify where Ceph thinks the problem is

cr0x@server:~$ ceph -s
  cluster:
    id:     9c6c2d4a-1c3b-4b8d-8a5e-2f2f3b2a7b61
    health: HEALTH_WARN
            17 slow ops, oldest one blocked for 54 sec, daemons [osd.12,osd.31] have slow ops
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 7d)
    mgr: mgr1(active, since 6d)
    osd: 36 osds: 36 up (since 7d), 36 in (since 7d)
  data:
    pools:   6 pools, 512 pgs
    objects: 4.8M objects, 18 TiB
    usage:   55 TiB used, 98 TiB / 153 TiB avail
    pgs:     510 active+clean
             2 active+clean+scrubbing
  io:
    client:   620 MiB/s rd, 210 MiB/s wr, 4.2k op/s rd, 1.9k op/s wr

Meaning: The warning names specific daemons (osd.12, osd.31). That’s gold. If it’s always the same OSDs, suspect host/disk.
If it rotates across many OSDs, suspect network or global recovery pressure.

Decision: Focus on the named OSDs first. Don’t start tuning cluster-wide settings until you know whether this is localized.

Task 2: Read the detailed health output (it contains the “blocked on” clue)

cr0x@server:~$ ceph health detail
HEALTH_WARN 17 slow ops, oldest one blocked for 54 sec, daemons [osd.12,osd.31] have slow ops
[WRN] SLOW_OPS: 17 slow ops, oldest one blocked for 54 sec, daemons [osd.12,osd.31] have slow ops
    osd.12 has 9 slow ops, oldest one blocked for 54 sec, committed 0, applied 0
    osd.31 has 8 slow ops, oldest one blocked for 42 sec, committed 0, applied 0

Meaning: “committed 0, applied 0” often points to the write path being stuck early, frequently disk/WAL/DB latency.
If you see ops committed but not applied, you may be looking at a different stage of the pipeline.

Decision: Check OSD perf counters next; don’t touch CRUSH or pool settings yet.

Task 3: Check OSD commit/apply latency (fast signal for disk trouble)

cr0x@server:~$ ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
 12               87.23             132.71
 31               65.10             110.44
  7                2.11               3.05
  8                1.98               2.80
  9                2.45               3.40

Meaning: Two OSDs are orders of magnitude slower than the rest. That’s almost never “Ceph tuning.” It’s usually disk path, controller,
firmware, queueing, or a host-level contention issue.

Decision: Jump to per-host disk and CPU investigation for the hosts running those OSD IDs. Confirm which host owns them.

Task 4: Map OSD IDs to hosts (stop guessing where to log in)

cr0x@server:~$ ceph osd find 12
{
    "osd": 12,
    "ip": "10.20.10.12:6803/12345",
    "crush_location": {
        "root": "default",
        "host": "pve3"
    }
}
cr0x@server:~$ ceph osd find 31
{
    "osd": 31,
    "ip": "10.20.10.31:6809/23456",
    "crush_location": {
        "root": "default",
        "host": "pve7"
    }
}

Meaning: Now you know exactly which Proxmox nodes to inspect: pve3 and pve7.

Decision: Go host-local: disk latency, device errors, IO scheduler, IRQ handling, NIC drops.

Task 5: Look for disk latency and saturation on the affected host

cr0x@server:~$ iostat -xz 1 5
Linux 6.8.12-pve (pve3)  12/26/2025  _x86_64_  (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.24    0.00    6.18    9.80    0.00   71.78

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  svctm
nvme0n1          42.0   310.0   1800   52600     0.0     0.0   98.7  24.10   0.32
sdb               0.2   120.4      4    2048     0.0     0.0   99.9  87.55   0.41

Meaning: %util near 100% with high await screams “device queueing.”
NVMe at 24ms and SATA/SAS at 87ms await is not a happy Ceph OSD. Notice %iowait is non-trivial too.

Decision: Identify which OSD uses which device. If it’s a single failing disk, schedule replacement.
If it’s DB/WAL on a saturated device, redesign placement.

Task 6: Confirm which block devices back the OSD (BlueStore mapping)

cr0x@server:~$ ceph-volume lvm list | sed -n '1,120p'
====== osd.12 =======
  [block]       /dev/ceph-4f3c7f9d-.../osd-block-1a2b3c4d-...
      devices              /dev/sdb
  [block.db]    /dev/ceph-4f3c7f9d-.../osd-db-9f8e7d6c-...
      devices              /dev/nvme0n1p3
  [block.wal]   /dev/ceph-4f3c7f9d-.../osd-wal-11223344-...
      devices              /dev/nvme0n1p3

Meaning: Data on /dev/sdb, DB/WAL on NVMe partition. That’s a normal pattern, but only if the NVMe isn’t also overloaded by other OSDs.

Decision: If multiple OSDs share a tiny DB/WAL device, you can create a metadata bottleneck even when data disks are fine.
Consider dedicating more NVMe capacity or reducing OSDs per DB device.

Task 7: Check kernel logs for “quiet” drive failure signals

cr0x@server:~$ dmesg -T | egrep -i 'blk|nvme|scsi|ata|reset|error|timeout' | tail -n 20
[Thu Dec 26 10:14:02 2025] blk_update_request: I/O error, dev sdb, sector 112394232 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Thu Dec 26 10:14:02 2025] sd 3:0:8:0: [sdb] tag#218 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Thu Dec 26 10:14:03 2025] sd 3:0:8:0: [sdb] rejecting I/O to offline device
[Thu Dec 26 10:14:05 2025] nvme nvme0: I/O 932 QID 7 timeout, aborting

Meaning: This is not “Ceph being Ceph.” This is the kernel telling you the storage stack is falling over.
Timeouts and resets will amplify latency and create slow ops.

Decision: Replace or reseat hardware; check HBA firmware; move OSD off the suspect device. Software tuning is wasted effort here.

Task 8: Check Ceph’s own view of disk slowness per OSD (internal metrics)

cr0x@server:~$ ceph daemon osd.12 perf dump | egrep -i 'commit_latency|apply_latency|op_wip|op_queue|bluestore|kv|rocksdb' | head -n 30
    "op_wip": 128,
    "op_queue_age_hist": {
        "avgcount": 4312,
        "sum": 183.221
    },
    "bluestore_kv_commit_lat": 0.084,
    "bluestore_commit_lat": 0.132,
    "bluestore_state_deferred_write": 1

Meaning: op_wip high means backlog. BlueStore commit latency elevated. If RocksDB commit latency spikes, DB/WAL device is often implicated.

Decision: If backlog correlates with DB commit latency, look at DB device utilization and consider moving DB/WAL to faster media or more lanes.

Task 9: Determine whether recovery/backfill is the bully in the room

cr0x@server:~$ ceph -s | sed -n '1,120p'
  cluster:
    health: HEALTH_WARN
            17 slow ops, oldest one blocked for 54 sec, daemons [osd.12,osd.31] have slow ops
  data:
    pgs: 480 active+clean
         22 active+clean+remapped+backfilling
         10 active+recovering
  io:
    client: 620 MiB/s rd, 210 MiB/s wr
    recovery: 480 MiB/s, 120 objects/s

Meaning: Recovery is heavy. That doesn’t automatically mean it’s the cause, but it absolutely can be the amplifier.

Decision: If client latency is the priority (business hours), throttle recovery/backfill temporarily.
If data safety is the priority (degraded redundancy), you may accept higher client latency but keep an eye on user impact.

Task 10: Throttle recovery responsibly (and know what you changed)

cr0x@server:~$ ceph osd dump | egrep 'osd_recovery_max_active|osd_max_backfills|osd_recovery_sleep' | head
osd_recovery_max_active 3
osd_max_backfills 1
osd_recovery_sleep 0
cr0x@server:~$ ceph config set osd osd_recovery_max_active 1
cr0x@server:~$ ceph config set osd osd_max_backfills 1
cr0x@server:~$ ceph config set osd osd_recovery_sleep 0.05
cr0x@server:~$ ceph config get osd osd_recovery_max_active
1

Meaning: You’re reducing parallel recovery work and inserting tiny sleeps. This often stabilizes client latency quickly, at the cost of slower healing.

Decision: If slow ops disappear and client latency improves, you’ve proven recovery pressure was a contributor. Keep it throttled until the cluster is clean, then revert.

Task 11: Check the network for drops and retransmits (don’t trust “link up”)

cr0x@server:~$ ip -s link show dev bond0
2: bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:ec:ef:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    RX:  bytes  packets  errors  dropped  missed   mcast
    98765432109 123456789 0      124      0       1234
    TX:  bytes  packets  errors  dropped  carrier collsns
    87654321098 98765432  0      0        0       0
cr0x@server:~$ nstat -az | egrep 'TcpRetransSegs|IpInDiscards|IpOutDiscards'
TcpRetransSegs            18422
IpInDiscards              91
IpOutDiscards             0

Meaning: RX drops and TCP retransmits correlate strongly with Ceph latency spikes, especially under replication load.
Drops can be host-side ring buffer exhaustion, switch congestion, MTU mismatch causing fragmentation, or a broken bond/LACP setup.

Decision: If drops climb during incidents, fix the network path before touching Ceph tunables. Check MTU end-to-end and switch ports for errors.

Task 12: Measure network latency between Ceph nodes (simple, but telling)

cr0x@server:~$ ping -c 20 -M do -s 8972 10.20.10.31
PING 10.20.10.31 (10.20.10.31) 8972(9000) bytes of data.
8972 bytes from 10.20.10.31: icmp_seq=1 ttl=64 time=0.385 ms
8972 bytes from 10.20.10.31: icmp_seq=2 ttl=64 time=0.401 ms
8972 bytes from 10.20.10.31: icmp_seq=3 ttl=64 time=3.912 ms
8972 bytes from 10.20.10.31: icmp_seq=4 ttl=64 time=0.392 ms
--- 10.20.10.31 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19022ms
rtt min/avg/max/mdev = 0.361/0.612/3.912/0.802 ms

Meaning: Jumbo frames work (no “Frag needed”), but the max RTT spikes to ~4ms. On a local cluster network, that’s suspicious.
Spikes without packet loss often come from congestion or bufferbloat.

Decision: If RTT spikes correlate with slow ops, investigate switch queues, QoS, and whether public and cluster traffic share the same oversubscribed uplinks.

Task 13: Check CPU saturation and softirq pressure on an OSD host

cr0x@server:~$ mpstat -P ALL 1 5 | sed -n '1,80p'
Linux 6.8.12-pve (pve7)  12/26/2025  _x86_64_  (32 CPU)

01:22:10 PM  CPU  %usr %nice %sys %iowait %irq %soft %steal %idle
01:22:11 PM  all  22.1  0.0  9.8   1.2    0.0  7.9    0.0   59.0
01:22:11 PM   3  18.0  0.0  12.2  0.9    0.0  28.1   0.0   40.8
01:22:11 PM   4  12.1  0.0  10.8  0.8    0.0  25.4   0.0   50.9

Meaning: Softirq is high on specific CPUs. That’s often NIC interrupts concentrated on a few cores.
Ceph can look “disk-bound” when in reality the CPU can’t keep up with packet processing.

Decision: Inspect IRQ affinity and RPS/XPS. Consider isolating Ceph traffic NICs, enabling multiqueue, and balancing interrupts across cores.

Task 14: Check for one monitor or manager being slow (control-plane lag can amplify pain)

cr0x@server:~$ ceph mon stat
e3: 3 mons at {mon1=10.20.0.11:6789/0,mon2=10.20.0.12:6789/0,mon3=10.20.0.13:6789/0}, election epoch 98, leader 0 mon1, quorum 0,1,2 mon1,mon2,mon3
cr0x@server:~$ ceph time-sync-status
{
  "time_skew_status": {
    "mon1": "ok",
    "mon2": "ok",
    "mon3": "ok"
  },
  "timechecks": {
    "epoch": 98,
    "round": 21834,
    "round_status": "finished"
  }
}

Meaning: MON quorum is stable and clock skew is fine. If time skew is bad, you can get bizarre behavior and delayed decisions.

Decision: If monitors are unstable, fix time sync and MON resource contention first. A shaky control plane makes data-plane debugging miserable.

Task 15: Verify whether a specific pool is suffering (crush rules, PG count, EC overhead)

cr0x@server:~$ ceph osd pool stats
pool rbd:     client_io 1200op/s rd, 800op/s wr, 410MiB/s rd, 190MiB/s wr
pool backups: client_io  50op/s rd,  30op/s wr,  20MiB/s rd,  15MiB/s wr
pool cephfs:  client_io 300op/s rd, 200op/s wr,  90MiB/s rd,  20MiB/s wr
cr0x@server:~$ ceph osd pool ls detail | egrep 'pool|crush_rule|size|min_size|pg_num|pgp_num'
pool 1 'rbd' replicated size 3 min_size 2 crush_rule 0 pg_num 256 pgp_num 256
pool 2 'backups' replicated size 2 min_size 1 crush_rule 0 pg_num 64 pgp_num 64

Meaning: The hot pool is rbd with size 3. That’s normal. If slow ops are only seen under one pool,
you might have a CRUSH rule placing data on a subset of hosts, or an EC pool with high CPU overhead.

Decision: If one pool is the hotspot, validate CRUSH rules and device classes. Don’t punish the entire cluster for one pool’s design.

Task 16: Correlate slow ops with PG states (peering/backfill can be the trigger)

cr0x@server:~$ ceph pg stat
512 pgs: 480 active+clean, 22 active+clean+remapped+backfilling, 10 active+recovering; 18 TiB data, 55 TiB used, 98 TiB / 153 TiB avail
cr0x@server:~$ ceph pg dump pgs_brief | egrep 'backfill|recover|peering|stuck' | head
1.2a0  active+clean+remapped+backfilling  [12,7,8]  12  12  1012  1012
1.2a1  active+recovering                  [31,9,10] 31  31  980   980

Meaning: The problematic OSDs are part of PGs that are actively backfilling/recovering. That is a strong correlation.

Decision: If slow ops are driven by recovery, throttle it (Task 10) and/or temporarily mark the worst OSD out if it’s failing hardware.

Task 17: If a single OSD is sick, check its heartbeat and mark-out decision carefully

cr0x@server:~$ ceph osd tree | egrep 'pve3|osd.12|pve7|osd.31'
-3     2.91089 host pve3
 12    0.97030     osd.12  up  1.00000  1.00000
-7     2.91089 host pve7
 31    0.97030     osd.31  up  1.00000  1.00000
cr0x@server:~$ ceph osd out 12
marked out osd.12.

Meaning: Marking an OSD out triggers data movement. If you do this during peak load, you can trade “slow ops” for “slow everything.”

Decision: Mark out only when you have evidence of hardware failure or persistent pathological latency.
If you do it, throttle recovery and communicate expected impact.

Joke #2: The only thing slower than Ceph recovery on a busy cluster is the meeting where someone suggests “let’s just add more PGs.”

Corporate mini-story #1: the wrong assumption incident

A mid-sized company ran Proxmox with Ceph on three nodes. It worked fine for months. Then “slow ops” started during business hours,
mostly when the backup job kicked off. The storage graphs showed plenty of free IOPS headroom—at least on paper—so the team assumed
Ceph was “just being sensitive.”

The wrong assumption: “If bandwidth is fine, the network isn’t the problem.” They had 10GbE links, and iperf tests between nodes looked great
late at night. During the incident, they focused on BlueStore settings and recovery tunables, because that’s what the internet argues about.
The slow ops persisted, and VM latency got ugly.

Eventually someone did the boring thing: checked ip -s link on the Ceph bond during the backup window.
RX drops climbed steadily. Not a few. Enough to matter. The switch ports looked “up,” but their buffers were being overrun by microbursts:
backup traffic, Ceph replication, and client IO all shared the same uplink, and the switch’s default queue settings were not their friend.

The fix was unglamorous and decisive: separate Ceph cluster traffic from public/client traffic (VLAN and physical ports),
reduce backup concurrency, and fix MTU consistency end-to-end. Slow ops stopped showing up as a daily ritual.
The team learned a painful truth: distributed storage doesn’t need your network to be fast; it needs your network to be predictable.

The postmortem note that mattered: they had measured network bandwidth, but not network loss and tail latency under production load.
They were testing the highway at 2 AM and declaring it safe for rush hour.

Corporate mini-story #2: an optimization that backfired

Another shop wanted to speed up recovery after node maintenance. They read that higher recovery concurrency improves healing time,
so they increased recovery-related settings across the cluster and felt clever for about a day.

Then a routine OSD restart turned into a weekday incident. Client latency spiked. Slow ops piled up. The cluster technically stayed “up,”
which made it worse: workloads didn’t fail cleanly; they just slowed until timeouts and retries made everything noisier.
It looked like the disks were failing—until they realized the disks were fine, and the cluster was simply drowning in its own recovery enthusiasm.

The backfired optimization was not “tuning is bad.” It was “tuning without a limit is bad.”
They had effectively told every OSD to prioritize recovery work at a level that made sense only when the cluster was otherwise idle.
In production, recovery competed with client IO at every step: reads to copy objects, writes to place them, and network to replicate them.

The fix was to treat recovery like a scheduled workload. During business hours, recovery was throttled.
Off-hours, it was allowed to run hotter, but still within observed safe limits.
They also began tracking “recovery throughput vs client latency” as a first-class SLO tradeoff, not a vague hope.

The lesson: Ceph will happily do exactly what you ask—even if what you ask is “please set my production cluster on fire, but evenly.”

Corporate mini-story #3: the boring practice that saved the day

A regulated environment (think audits, change windows, and people who love spreadsheets) ran Proxmox+Ceph with strict operational discipline.
They were not exciting. They were, however, consistently online.

Their “boring practice” was a weekly latency and health baseline: capture ceph osd perf, host iostat, and network drop counters
under a known workload window. They didn’t optimize constantly. They watched for drift.
When drift was detected—one OSD slowly creeping from 2ms to 8ms apply latency—they acted early.

One week the baseline flagged an OSD host with rising NVMe timeouts in dmesg, but no SMART failure yet.
They drained that host, replaced the NVMe under a planned window, and avoided an incident that would have occurred during quarter-end reporting.
Everyone forgot about it a month later, which is the highest compliment you can pay to maintenance.

They also kept recovery throttles as a documented policy, not tribal knowledge. When an OSD went out unexpectedly,
the on-call had a known-safe knob set ready to apply, rather than improvising under pressure.

The lesson is annoyingly consistent: boring operations is not laziness. It’s the discipline of not learning the same lesson twice.

Common mistakes: symptoms → root cause → fix

These are not theoretical. These are the mistakes that show up at 3 AM, wearing your pager as a hat.

1) Slow ops on a few OSDs only

Symptoms: ceph osd perf shows 1–3 OSDs with 10–100x higher commit/apply latency; warnings name the same daemons.
Root cause: One failing disk, one HBA lane, shared DB/WAL bottleneck, or a host-level contention problem.
Fix: Map OSD to host/device; check iostat and dmesg; replace hardware or move DB/WAL; avoid cluster-wide tuning.

2) Slow ops everywhere during recovery/backfill

Symptoms: Many OSDs report slow ops; PGs show backfill/recovering; recovery throughput is high; client latency spikes.
Root cause: Recovery concurrency too aggressive for your hardware/network; mixing client and cluster traffic on oversubscribed links.
Fix: Throttle recovery/backfill; separate networks if possible; schedule heavy recovery for off-hours when feasible.

3) Slow ops plus RX drops or retransmits

Symptoms: ip -s link shows drops; nstat shows TCP retransmits; ping RTT spikes under load.
Root cause: Congestion, bufferbloat, MTU mismatch, bad bond/LACP hashing, NIC driver/firmware bugs, switch port errors.
Fix: Verify MTU end-to-end; check switch counters; pin down bond mode; balance IRQs; consider separate Ceph networks.

4) “Disks look idle” but Ceph is still slow

Symptoms: iostat shows moderate %util; Ceph apply latency still high; CPU softirq or system time is high.
Root cause: CPU bottleneck (interrupt handling, checksums, encryption), NUMA locality issues, IRQ imbalance.
Fix: Balance interrupts; verify multiqueue; keep Ceph on adequate cores; avoid oversubscribing CPU with VM workloads on OSD nodes.

5) Spiky slow ops with no obvious errors

Symptoms: Cluster is mostly fine, but latency periodically goes bad; no clear disk errors.
Root cause: Background scrubs at the wrong time, RocksDB compactions, or noisy-neighbor VMs saturating the same host resources.
Fix: Schedule scrubs; ensure DB/WAL devices have headroom; isolate Ceph traffic; place noisy workloads carefully.

6) “We increased PGs and performance got worse”

Symptoms: More PGs, more memory usage, more CPU overhead; OSDs busier; slow ops increase.
Root cause: PG count increased beyond what the cluster can manage; metadata overhead grows; peering and maintenance costs rise.
Fix: Use sane PG sizing; reduce PG count if it’s excessive; focus on hardware bottlenecks and placement, not “more shards.”

7) CephFS “slow requests” masquerading as general slow ops

Symptoms: CephFS clients slow; MDS reports slow requests; RBD maybe fine.
Root cause: MDS under-provisioned CPU/RAM, too many caps, or metadata-heavy workload without tuning.
Fix: Scale MDS, pin to reliable hosts, and separate CephFS metadata pool onto faster devices if needed.

Checklists / step-by-step plan

Checklist A: Triage in 10 minutes

Run ceph -s and ceph health detail. Write down which daemons and how old the oldest slow op is.
Run ceph osd perf. Identify whether it’s localized (few OSDs) or systemic (many elevated).
Check PG states with ceph pg stat. If recovery/backfill is active, note recovery throughput.
On affected hosts, run iostat -xz 1 5 and ip -s link. Look for high await/%util or drops.
If you see kernel errors/timeouts in dmesg, treat it as hardware until proven otherwise.

Checklist B: Decide “disk vs network vs CPU” with evidence

Disk case: High commit/apply latency on specific OSDs + high await/%util + kernel IO errors/timeouts. Replace or migrate.
Network case: Drops/retransmits + RTT spikes under load + many OSDs affected across hosts. Fix MTU, congestion, bonds, switching.
CPU case: High softirq/system time + IRQ concentration + fast disks + fast network but still slow ops. Balance IRQs and reduce contention.
Recovery pressure case: PGs backfilling/recovering + recovery throughput high + latency spike starts with a failure event. Throttle, then heal.

Checklist C: Safe “stop the bleeding” actions (ranked)

Throttle recovery/backfill (temporary). It’s reversible and often immediately helpful.
Pause scrubs during peak hours if scrubbing is contributing (then re-enable later).
Move the worst hot VMs away from OSD hosts that are already struggling (reduce noisy neighbor contention).
Mark out a clearly failing OSD only when you have strong evidence. Then control recovery rate.
Do not mass-restart OSDs hoping it “clears.” That often multiplies recovery work and turns a warning into a day.

Checklist D: Structural improvements (the stuff you do when not on fire)

Separate Ceph public and cluster traffic if you can (physical NICs or at least VLANs with capacity planning).
Ensure DB/WAL placement is intentional: enough NVMe, not oversubscribed, not sharing with unrelated workloads.
Right-size OSD hosts: CPU and RAM headroom matters with fast media and 25/40/100GbE.
Baseline and trend: capture weekly OSD latency, disk latency, and network drops. Drift is your early warning system.
Standardize recovery policy: business-hours throttles, off-hours faster healing, documented rollback steps.

FAQ

1) Are “slow ops” always a production emergency?

No. A few slow ops during a controlled recovery event can be acceptable. If slow ops persist, grow, or correlate with VM IO latency spikes, treat it as an incident.

2) If only one OSD shows high apply latency, is it safe to restart it?

Restarting may temporarily mask a bad device by clearing queues, but it won’t fix the cause. First check dmesg and device latency. If hardware looks suspect, plan a replacement or mark-out.

3) What’s the fastest way to tell disk vs network?

Disk issues show up as a small set of OSDs with extreme commit/apply latency and high device await. Network issues show up as broader impact plus drops/retransmits and RTT spikes under load.

4) Does separating public and cluster networks really matter for small clusters?

If the cluster is lightly used, you can often get away with a single network. Once you have heavy backups, migrations, or recovery events, separation becomes a reliability feature, not a luxury.

5) Can CPU really be the bottleneck if overall CPU usage is only 30%?

Yes. IRQ handling and softirq can saturate a few cores while the rest are idle. Also, NUMA locality and per-daemon thread behavior can create “local saturation” that averages hide.

6) Should I increase recovery settings so the cluster heals faster?

Only after measuring client impact. Faster recovery is good, but not if it turns client IO into sludge. Use a policy: conservative during peak, more aggressive off-hours, and verify with latency metrics.

7) How do I know if BlueStore DB/WAL is my bottleneck?

Look for elevated BlueStore/RocksDB commit latencies in ceph daemon osd.X perf dump, and check utilization/latency on the DB device. If DB is shared among too many OSDs, it’s a common choke point.

8) Is it normal for recovery to cause slow ops?

Some impact is normal. Large impact is usually a sign that recovery is competing too aggressively with client IO, or that hardware/network capacity is marginal for the chosen replication/EC settings.

9) Does Proxmox change how Ceph behaves under load?

Proxmox doesn’t change Ceph’s fundamentals, but it influences workload patterns (migrations, backups, snapshots) and resource contention (VMs and OSDs sharing hosts). The physics still apply.

10) What’s a safe first knob to turn during an incident?

Throttling recovery/backfill is often the safest reversible change, especially when slow ops appear after a failure and PGs are actively recovering. Don’t “tune” your way out of failing hardware.

Conclusion: next steps that actually change outcomes

When Ceph says “slow ops,” it’s not asking for your feelings. It’s asking for your measurements.
Start with scope (which daemons), then classify the bottleneck (disk, network, CPU, or recovery pressure), then act with reversible controls before you redesign anything.

Practical next steps:

Build a tiny runbook from the Fast diagnosis playbook and Tasks section. Put it where on-call can find it.
Baseline ceph osd perf, host iostat, and network drop counters weekly. Watch drift.
Decide on a recovery policy and document it (including how to revert).
If you’re mixing public and cluster traffic on one congested network, plan separation. It’s the closest thing to buying reliability with a credit card.
If specific OSDs are consistently slower, stop debating. Replace the hardware or fix the device topology. Ceph is rarely wrong about which OSD is hurting.