You glance at Proxmox, see a yellow banner, and your brain does the same thing your cluster is doing: backfilling anxiety. The message is “Ceph HEALTH_WARN”. It might be harmless housekeeping. It might be the first cough before pneumonia.
The trap is reacting fast in the wrong direction. This guide is about safe moves: the checks that won’t make things worse, the commands that tell the truth, and the decisions you can defend in a postmortem.
What HEALTH_WARN actually means (and what it doesn’t)
Ceph health has three high-level states: HEALTH_OK, HEALTH_WARN, HEALTH_ERR. HEALTH_WARN is the cluster telling you: “Something is not ideal, but I’m still functioning.”
That doesn’t mean “ignore it.” It also doesn’t mean “panic and start restarting daemons.” HEALTH_WARN is a bucket. The meaningful content is the list of warnings underneath: nearfull, OSD down, PGs degraded, slow ops, clock skew, mon quorum issues, scrub errors, and about a dozen other ways storage can politely ask for help.
Two guiding principles:
- Ceph is a distributed state machine. Most “fixes” are about letting it converge (or helping it converge safely).
- Your safest first moves are read-only. Observe, measure, then change.
Here’s the operationally useful interpretation:
- HEALTH_WARN with stable I/O and no data availability risk: schedule maintenance, reduce risk, don’t thrash.
- HEALTH_WARN with client impact (latency, timeouts, stalled VMs): treat it like an incident and identify the bottleneck fast.
- HEALTH_WARN with “misplaced”, “degraded”, or “undersized” PGs: you’re one more failure away from a bad day. Prioritize redundancy recovery.
Dry truth: Ceph is like a group project—if you don’t check what everyone is doing, you’ll be debugging “communication problems” at 2 a.m.
Fast diagnosis playbook (first/second/third)
First: confirm what’s actually wrong (single command, one screen)
Your first job is to translate “yellow” into one or two specific failure modes.
cr0x@server:~$ ceph -s
cluster:
id: 7c2f0b7c-9c55-4e2e-8b3f-8c3c7c1d8a2a
health: HEALTH_WARN
1 osds down
2 nearfull osd(s)
36 slow ops, oldest one blocked for 94 sec, daemons [osd.12] have slow ops.
services:
mon: 3 daemons, quorum pve1,pve2,pve3 (age 2h)
mgr: pve1(active, since 2h), standbys: pve2
osd: 12 osds: 11 up, 12 in
data:
pools: 2 pools, 256 pgs
objects: 2.1M objects, 8.2 TiB
usage: 24 TiB used, 36 TiB / 60 TiB avail
pgs: 240 active+clean
12 active+undersized+degraded
Meaning: This is not one problem; it’s a cluster under pressure. One OSD is down, a couple are nearfull (which throttles), and there are slow ops (client-visible pain). PGs are undersized/degraded (risk).
Decision: Treat as an incident. Focus first on availability risk (OSD down → degraded PGs), then on performance (slow ops), while immediately starting nearfull mitigation planning.
Second: determine if the problem is “control plane” or “data plane”
Control plane problems: monitor quorum flapping, mgr dead, time skew. Data plane problems: OSDs down, disk errors, network loss, slow ops, backfill storms.
- If mon quorum is unstable, do not start “repairing” PGs. Fix quorum first.
- If OSDs are down, don’t chase performance tweaks. Restore OSDs and redundancy.
- If all daemons are up but slow ops persist, chase network/disk latency and recovery settings.
Third: identify the bottleneck type in 10 minutes
Most real HEALTH_WARN incidents collapse into one of these:
- Capacity pressure (nearfull/full, backfill blocked, writes throttled).
- Single-node failure (OSD down, host reboot, disk dead, HBA resets).
- Network pain (packet loss, bad MTU, congested switch, asymmetric routing).
- Recovery storm (backfill/rebalance competing with client I/O).
- Slow storage (one OSD device dragging everyone down).
- Time/clock issues (mon warnings, auth oddities, weird flaps).
Safety rules: what to avoid when the cluster is yellow
Ceph rewards calm. It punishes improvisation.
1) Don’t restart daemons “to clear the warning”
Restarting mons/mgrs/osds changes state, triggers peering, can restart recovery, and can make you lose the one thing you had: a stable cluster that was slowly healing. Only restart when you know what you’re fixing (stuck process, confirmed crash loop, kernel reset, etc.).
2) Don’t mark OSDs out unless you mean it
ceph osd out triggers data movement. On a busy cluster that can turn “one OSD down” into “the whole cluster is slow for hours.” If the OSD is down due to a transient host issue, get it back up before you declare it out.
3) Don’t “fix nearfull” by changing the full ratios mid-incident
Yes, you can raise mon_osd_nearfull_ratio and friends. That’s not capacity. That’s denial with extra steps. If you must adjust ratios to unblock emergency writes, treat it as a temporary exception with a rollback plan.
4) Don’t run aggressive repair commands blindly
ceph pg repair and OSD-level tooling can be correct, but they’re not a first response. Prove you have scrub errors or inconsistent PGs, and confirm quorum and networking are stable first.
5) Don’t optimize recovery while you’re still diagnosing
Changing osd_max_backfills or osd_recovery_max_active can help—but you can also starve client I/O or overload a weak node. Diagnose first, tune second.
Joke #1: If you’re about to “just reboot the Ceph node,” remember: the cluster has feelings, and it expresses them in backfill traffic.
Interesting facts and context (Ceph and Proxmox)
- Ceph started at UC Santa Cruz as a research project in the mid‑2000s, aiming for self-managing storage at scale. That DNA still shows: it wants to heal itself, sometimes noisily.
- CRUSH isn’t a filesystem; it’s the placement algorithm that decides where data goes without a centralized lookup table, which is why Ceph can scale without a single metadata bottleneck for object placement.
- PGs (placement groups) are virtual buckets used to manage replication and recovery. They’re not “partitions” and they’re not “volumes,” but they control the blast radius of recovery.
- HEALTH_WARN is often a capacity signal, not a failure signal. Many clusters run “fine” until they hit nearfull, then performance falls off a cliff because Ceph throttles to protect itself.
- Ceph’s scrub model is preventive medicine: light scrubs and deep scrubs trade background I/O for early corruption detection. Turning scrub off is like removing smoke detectors because you don’t like the beeping.
- BlueStore replaced FileStore as the default storage backend years ago, primarily to reduce journal complexity and improve performance, especially on SSD/NVMe.
- Public and cluster networks are separate concepts in Ceph design: mixing client traffic and replication/backfill on the same saturated network is a classic “it was fine in testing” story.
- Proxmox integrates Ceph tightly (pveceph tooling, UI health panels), but it doesn’t change Ceph’s fundamental behavior: you still troubleshoot with Ceph tools, not GUI vibes.
- Clock skew warnings exist for a reason: Ceph monitors rely on timeouts and quorum logic; time drift can look like failures and can lead to flapping.
One quote worth keeping on a sticky note, because it’s how you should troubleshoot distributed systems: “Hope is not a strategy.”
— General Gordon R. Sullivan.
Practical tasks: commands, meaning, decision
These are ordered roughly from safest/most general to more targeted. Use them like a checklist: run, interpret, decide. If you only do one thing from this article, do the “meaning + decision” part every time. That’s what keeps you from random-walking into downtime.
Task 1: Snapshot the cluster health details
cr0x@server:~$ ceph health detail
HEALTH_WARN 1 osds down; 2 nearfull osd(s); 36 slow ops
[WRN] OSD_DOWN: 1 osds down
osd.7 down since 2025-12-26T09:01:33.123+0000, last_up 2025-12-26T08:42:10.991+0000
[WRN] OSD_NEARFULL: 2 nearfull osd(s)
osd.3 is near full at 84%
osd.9 is near full at 85%
[WRN] SLOW_OPS: 36 slow ops, oldest one blocked for 94 sec, daemons [osd.12] have slow ops
Meaning: This is the actionable list. It tells you what Ceph thinks is wrong, with IDs and timestamps.
Decision: Open an incident note. Copy/paste this output. You’ll need it to know whether things are improving after changes.
Task 2: Check monitor quorum and basic control-plane stability
cr0x@server:~$ ceph quorum_status --format json-pretty
{
"election_epoch": 42,
"quorum": [0,1,2],
"quorum_names": ["pve1","pve2","pve3"],
"quorum_leader_name": "pve1",
"monmap": {
"mons": [
{"rank":0,"name":"pve1","public_addrs":{"addrvec":[{"type":"v2","addr":"10.10.0.11:3300"},{"type":"v1","addr":"10.10.0.11:6789"}]}},
{"rank":1,"name":"pve2","public_addrs":{"addrvec":[{"type":"v2","addr":"10.10.0.12:3300"},{"type":"v1","addr":"10.10.0.12:6789"}]}},
{"rank":2,"name":"pve3","public_addrs":{"addrvec":[{"type":"v2","addr":"10.10.0.13:3300"},{"type":"v1","addr":"10.10.0.13:6789"}]}}
]
}
}
Meaning: Quorum is healthy when it’s stable and contains a majority of mons. If quorum is missing a mon or oscillating, everything else becomes ambiguous.
Decision: If quorum is unstable: stop and fix network/time/mon host health first. Don’t touch OSD state or PG repair until quorum is boring again.
Task 3: Confirm the manager is active and not wedged
cr0x@server:~$ ceph mgr stat
{
"active_name": "pve1",
"num_standbys": 1
}
Meaning: MGR provides dashboards and some orchestration logic (depending on modules). If it’s down, you lose visibility and some automation.
Decision: If no active mgr: restore it, but don’t assume mgr issues cause I/O issues. It’s usually observability, not data path.
Task 4: Identify which PGs are unhealthy and why
cr0x@server:~$ ceph pg stat
256 pgs: 240 active+clean, 12 active+undersized+degraded, 4 active+peering
data: 8.2 TiB objects, 24 TiB used, 36 TiB / 60 TiB avail
io: 120 MiB/s rd, 45 MiB/s wr, 410 op/s rd, 190 op/s wr
Meaning: undersized+degraded means replication requirements aren’t met. peering means PGs are negotiating who owns what; it can be normal briefly, or a sign of a deeper issue if it persists.
Decision: If degraded/undersized persists more than a few minutes, find missing OSDs and restore them. If peering persists, suspect network partition or OSD flaps.
Task 5: Find the down OSD and its host quickly
cr0x@server:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.45600 root default
-3 1.81900 host pve1
0 hdd 0.45500 osd.0 up 1.00000 1.00000
3 hdd 0.45500 osd.3 up 1.00000 1.00000
7 hdd 0.45500 osd.7 down 1.00000 1.00000
-5 1.81900 host pve2
1 hdd 0.45500 osd.1 up 1.00000 1.00000
4 hdd 0.45500 osd.4 up 1.00000 1.00000
9 hdd 0.45500 osd.9 up 1.00000 1.00000
-7 1.81800 host pve3
2 hdd 0.45500 osd.2 up 1.00000 1.00000
5 hdd 0.45500 osd.5 up 1.00000 1.00000
12 hdd 0.45500 osd.12 up 1.00000 1.00000
Meaning: This gives you the physical blast radius: which node contains the down OSD.
Decision: If it’s a host-wide issue (multiple OSDs down on one host), treat it like a node incident (power, kernel, NIC, HBA). If it’s a single OSD, it’s likely a disk/HBA path problem.
Task 6: Check if the down OSD is actually “down” or just “out”/stopped
cr0x@server:~$ ceph osd dump | grep -E '^osd\.7|^epoch'
epoch 12872
osd.7 up 0 in 1 weight 0.455 last_clean_begin 2025-12-26T08:40:15.000000+0000 last_clean_end 2025-12-26T08:40:15.000000+0000
Meaning: The OSD is marked up 0 (down) but still in 1. Ceph expects it to be part of the cluster, but it’s not responding.
Decision: Prefer restoring the daemon/disk path so it becomes up again. Avoid marking it out unless you expect a long outage or hardware replacement.
Task 7: On the affected host, check the OSD service and recent logs
cr0x@server:~$ systemctl status ceph-osd@7 --no-pager
● ceph-osd@7.service - Ceph object storage daemon osd.7
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled)
Active: failed (Result: exit-code) since Fri 2025-12-26 09:02:01 UTC; 3min ago
Process: 19444 ExecStart=/usr/bin/ceph-osd -f --cluster ceph --id 7 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 19444 (code=exited, status=1/FAILURE)
cr0x@server:~$ journalctl -u ceph-osd@7 -n 80 --no-pager
Dec 26 09:01:58 pve1 ceph-osd[19444]: bluestore(/var/lib/ceph/osd/ceph-7) _mount failed: (5) Input/output error
Dec 26 09:01:58 pve1 ceph-osd[19444]: failed to mount object store
Dec 26 09:02:01 pve1 systemd[1]: ceph-osd@7.service: Main process exited, code=exited, status=1/FAILURE
Meaning: This isn’t a Ceph “logic” issue; it’s an I/O error mounting BlueStore. Think disk, controller, cabling, kernel.
Decision: Stop guessing and check hardware and kernel logs next. If the disk is failing, plan an OSD replacement workflow rather than repeated restarts.
Task 8: Check kernel messages for disk resets/timeouts on that node
cr0x@server:~$ dmesg -T | tail -n 40
[Fri Dec 26 09:01:41 2025] sd 6:0:2:0: [sdc] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Dec 26 09:01:41 2025] blk_update_request: I/O error, dev sdc, sector 918274048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Fri Dec 26 09:01:43 2025] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Fri Dec 26 09:01:43 2025] ata7.00: failed command: READ DMA
[Fri Dec 26 09:01:44 2025] ata7: hard resetting link
Meaning: The kernel is seeing real I/O errors/resets. Ceph can’t outvote physics.
Decision: Replace the disk or fix the path (HBA, cable, backplane). If you can’t do it immediately, mark the OSD out and begin controlled recovery (after evaluating cluster capacity and performance headroom).
Task 9: Identify “slow ops” offenders and whether it’s one OSD or many
cr0x@server:~$ ceph health detail | sed -n '/SLOW_OPS/,$p'
[WRN] SLOW_OPS: 36 slow ops, oldest one blocked for 94 sec, daemons [osd.12] have slow ops
cr0x@server:~$ ceph daemon osd.12 dump_historic_ops | head -n 25
{
"ops": [
{
"description": "osd_op(client.48219:9123 3.2f3a 3:1f9d9b4f:::rbd_data.1a2b...:head [write 0~4194304] ...)",
"duration": 92.334,
"initiated_at": "2025-12-26T09:00:11.112233Z",
"type_data": { "op_type": "write" }
}
]
}
Meaning: Ceph is telling you which daemon is stalling operations. If it’s one OSD consistently, suspect that disk or its host (latency, queueing, NIC).
Decision: If one OSD is the hotspot, inspect its disk and node-level metrics. If slow ops spread across many OSDs, suspect network congestion or recovery storm.
Task 10: Check for recovery/backfill pressure (the silent performance killer)
cr0x@server:~$ ceph -s | sed -n '/recovery/,$p'
progress:
Global Recovery Event (1m)
[==========..........] (remaining: 2m)
recovering, 1.2 GiB/s, 18 objects/s
cr0x@server:~$ ceph osd perf
osd commit_latency(ms) apply_latency(ms)
0 5 8
1 6 9
2 5 8
3 7 11
4 6 9
5 5 8
9 8 14
12 60 95
Meaning: Recovery is happening and one OSD (12) has far worse latency. That OSD can become the drag anchor for both recovery and client I/O.
Decision: If recovery is competing with production I/O, throttle recovery carefully. If one OSD is unhealthy, fix or isolate it before increasing recovery speed.
Task 11: Confirm nearfull is real and find which OSDs are the culprits
cr0x@server:~$ ceph df detail | head -n 40
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 60 TiB 36 TiB 24 TiB 24 TiB 40.0
POOLS:
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
rbd 1 224 7.8 TiB 2.0M 23 TiB 62.5 11 TiB
cephfs 2 32 0.4 TiB 0.1M 1.1 TiB 18.0 11 TiB
cr0x@server:~$ ceph osd df | sort -k8 -n | tail -n 6
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
3 hdd 0.45500 1.00000 5.5T 4.6T 4.5T 0B 4.0G 0.9T 84.1 1.20 24 up
9 hdd 0.45500 1.00000 5.5T 4.7T 4.6T 0B 4.1G 0.8T 85.2 1.21 26 up
Meaning: Nearfull is per-OSD, not cluster-average. Two OSDs are hot, likely from CRUSH imbalance, device size mismatch, or a past “out/in” causing uneven placement.
Decision: If a few OSDs are nearfull, don’t just add capacity “somewhere.” Fix imbalance (reweight/balancer), or replace small drives, or redistribute with care. Also plan immediate headroom: delete/move data if you’re close to full.
Task 12: Check if the balancer is enabled and whether it’s helping or hurting
cr0x@server:~$ ceph balancer status
{
"active": true,
"mode": "upmap",
"optimize_result": "no_optimization_needed",
"last_optimize_duration": "0.000000s"
}
Meaning: Upmap mode can smooth distribution without massive data movement, but it still changes mappings. If it’s disabled, you may have long-term imbalance. If it’s enabled but stuck, you may have constraints (nearfull, misconfigured CRUSH rules).
Decision: If nearfull is localized and the balancer isn’t making progress, investigate CRUSH, device classes, and OSD weights. Avoid forcing aggressive rebalance during peak load.
Task 13: Verify time sync and look for clock skew warnings
cr0x@server:~$ ceph health detail | grep -i skew
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.pve2, mon.pve3
cr0x@server:~$ timedatectl
Local time: Fri 2025-12-26 09:06:17 UTC
Universal time: Fri 2025-12-26 09:06:17 UTC
RTC time: Fri 2025-12-26 09:06:16
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: Clock skew warnings can be caused by real drift, paused VMs, or overloaded nodes delaying NTP. In practice, it’s often “the node is too busy to keep time well.”
Decision: If NTP isn’t synchronized, fix time first. If NTP is fine but skew persists, look for CPU saturation or VM pauses on mon hosts.
Task 14: Check network errors and dropped packets on Ceph interfaces
cr0x@server:~$ ip -s link show dev eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
9123491234 12349123 0 412 0 10123
TX: bytes packets errors dropped carrier collsns
8234123412 11234123 0 95 0 0 0
Meaning: Drops on the storage network correlate strongly with slow ops, peering delays, and OSD flaps. MTU mismatches can look like random loss.
Decision: If drops/errors are non-zero and climbing, treat network as suspect. Confirm MTU end-to-end and check switch ports. Don’t tune Ceph recovery to “fix” a packet-loss problem.
Task 15: Check cluster flags that might be blocking recovery
cr0x@server:~$ ceph osd dump | grep flags
flags noout,noscrub,nodeep-scrub
Meaning: Flags like noout can be intentional during maintenance, but they can also keep the cluster degraded longer than expected. noscrub may hide corruption signals.
Decision: If flags were set for maintenance and forgotten, remove them deliberately when safe. If you’re in an incident, don’t remove flags without understanding why they were set.
Task 16: Inspect a specific PG that’s stuck or degraded
cr0x@server:~$ ceph pg dump_stuck inactive | head
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
1.2a inactive [7,3,12] 7 [7,3,12] 7
cr0x@server:~$ ceph pg 1.2a query | head -n 35
{
"state": "active+undersized+degraded",
"up": [7,3,12],
"acting": [7,3,12],
"info": {
"stats": {
"state": "active+undersized+degraded"
}
},
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2025-12-26T09:01:40.000000Z"
}
]
}
Meaning: The PG’s acting set includes OSD 7, which is down. That’s why the PG can’t meet size/min_size.
Decision: Fix OSD 7 or replace it. Don’t waste time on PG repair when the missing replica is the real cause.
Task 17: If you must throttle recovery, do it with intent (and revert later)
cr0x@server:~$ ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1'
injected data to osd.0
injected data to osd.1
injected data to osd.2
injected data to osd.3
injected data to osd.4
injected data to osd.5
injected data to osd.9
injected data to osd.12
Meaning: You reduced recovery concurrency on all OSDs. This can stabilize client latency at the cost of slower healing.
Decision: Use only when client impact is severe and you need the cluster responsive. Create a reminder to revert when the incident ends, or you’ll be “mysteriously slow to recover” forever.
Task 18: Verify pool replication/min_size isn’t set to self-sabotage
cr0x@server:~$ ceph osd pool ls detail | sed -n '1,80p'
pool 1 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 224 pgp_num 224 autoscale_mode on
pool 2 'cephfs' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
Meaning: Size 3 / min_size 2 is typical. If min_size is 3, a single OSD down can block writes. If size is 2 in a small cluster, your failure domain tolerance is thinner than you think.
Decision: If writes are blocked because min_size is too strict for your risk appetite, adjust during a planned change, not mid-chaos—unless business impact forces it and you document it.
Three corporate mini-stories from the trenches
1) Incident caused by a wrong assumption: “Yellow means safe”
A mid-sized SaaS company ran Proxmox with Ceph for VM storage. They’d seen HEALTH_WARN plenty of times—usually a scrub warning, a nearfull OSD that went away after routine cleanup, or a transient OSD flap during kernel updates. The on-call habit was to glance at ceph -s, shrug, and move on unless the UI turned red.
One Friday, HEALTH_WARN reported a single OSD down and a handful of degraded PGs. The assumption was “replication will cover it.” That’s not wrong in principle; it’s wrong in timing. The cluster was already running hot: high write load, recovery throttled too conservatively from an earlier incident, and one node had slightly worse disks than the others.
Over the weekend, a second disk in the same host started throwing latency spikes. The OSD didn’t go fully down, it just became slow enough to trigger slow ops. Clients began timing out. The storage was technically available, but performance became the outage. The business saw it as “the platform is down” because VMs that can’t write are VMs that can’t live.
The postmortem had a single sentence that mattered: they treated redundancy loss as cosmetic. Degraded PGs are not a beauty mark; they’re a countdown clock that ticks faster when you’re already stressed.
What fixed it wasn’t heroics. They restored the down OSD, replaced the failing disk, and—most importantly—added a rule: degraded/undersized PGs trigger an incident regardless of color. The UI color is not your risk model.
2) Optimization that backfired: “Let’s speed up recovery”
A different org had frequent node reboots due to firmware updates. They got annoyed at long recovery times and decided to “tune Ceph.” Someone increased osd_max_backfills and osd_recovery_max_active across the cluster to make backfill run faster.
It worked—on an empty staging cluster. In production, the same change turned every reboot into a brownout. Recovery competed with client writes and reads, saturating disks and the storage network. Latency spikes triggered more slow ops. Slow ops led to timeouts. Timeouts caused application retries. Retries amplified load. The whole thing became a self-inflicted DDoS, but with more YAML.
The really painful part: the engineers interpreted the symptoms as “Ceph is unstable after reboots,” so they rebooted nodes again to “reset it,” restarting recovery each time. That’s how you turn a tuneable into an incident generator.
The fix was to tie recovery tuning to time-of-day and observed latency. They created two profiles: conservative during business hours, aggressive overnight. They also learned to look at ceph osd perf and network drops before changing anything. Recovery speed is not a free lunch; it’s a trade with your users.
3) Boring but correct practice that saved the day: “Noout during maintenance, then undo it”
A regulated enterprise (lots of process, lots of paperwork, lots of opinions) scheduled a maintenance window to replace a failing drive in a Ceph node. They did the boring dance: set noout before taking the host down, recorded the start time, and assigned one person to be responsible for removing the flag at the end.
During the window, another node experienced a brief network issue and dropped an OSD. In many environments, this is where the cluster begins reshuffling data and the maintenance turns into an all-night festival of backfill. But with noout set, Ceph didn’t immediately decide the temporarily missing OSD was gone forever, so it didn’t start the heavy data movement while they were already operating with reduced capacity.
They restored network, brought the maintenance node back, and then removed noout deliberately. The cluster healed with minimal drama. The business never noticed.
This wasn’t clever engineering. It was operational hygiene: knowing which cluster flags exist, when to use them, and—crucially—how to exit maintenance cleanly. The team’s best tool that day was a checklist.
Joke #2: Ceph maintenance without a checklist is like hot-swapping a drive with your eyes closed—technically possible, emotionally expensive.
Common mistakes: symptom → root cause → fix
1) Symptom: “HEALTH_WARN nearfull” and performance is getting weird
Root cause: One or a few OSDs are nearfull, causing Ceph to throttle/back off, and recovery may be blocked or slowed by full ratio constraints.
Fix: Identify the top %use OSDs (ceph osd df). Reduce data growth immediately (delete/move snapshots/VMs), add capacity, and correct imbalance (balancer/crush weights). Avoid “ratio hacks” unless you need a temporary emergency write window.
2) Symptom: “1 osds down” but you mark it out immediately and the cluster becomes slow
Root cause: Marking out triggered a large rebalance on already busy disks/network.
Fix: If the OSD can return quickly, bring it back up instead of marking out. If you must mark out, throttle recovery and schedule the movement when client load is low.
3) Symptom: PGs stuck “peering” for a long time
Root cause: OSD flapping, network packet loss, or a partition causing inconsistent state exchange.
Fix: Check OSD up/down history, network drops, MTU consistency, and switch logs. Stabilize network first. Restarting OSDs randomly often extends peering.
4) Symptom: Slow ops pinned to one OSD
Root cause: That OSD’s disk is slow, its host is overloaded, or it’s seeing kernel-level resets.
Fix: Use ceph osd perf to confirm, then inspect dmesg and service logs on the host. Replace the device or fix the path. Don’t “tune Ceph” to compensate for a dying disk.
5) Symptom: Slow ops across many OSDs, especially during recovery
Root cause: Recovery/backfill is saturating network or disks. Or the storage network is lossy/congested.
Fix: Check recovery progress, OSD perf, and NIC drops. Throttle recovery temporarily if client impact is unacceptable. Fix the network if you see loss.
6) Symptom: “MON_CLOCK_SKEW” appears intermittently
Root cause: Bad or unstable time synchronization, or hosts overloaded such that timekeeping is delayed.
Fix: Ensure NTP/chrony is active and synchronized. Investigate CPU steal, load, and VM pauses on mon hosts. Time warnings aren’t cosmetic; they correlate with quorum weirdness.
7) Symptom: Scrub errors or “inconsistent PG” warnings
Root cause: Data inconsistency detected during scrub, often due to past disk errors, bit rot, or interrupted recovery events.
Fix: Confirm which PGs are inconsistent, ensure cluster stability, then run targeted repair with caution. Also investigate underlying hardware for silent errors. If it’s widespread, treat hardware fleet quality as the real incident.
8) Symptom: Everything looks “up” but writes are blocked
Root cause: Pool min_size too strict for current failures, or the cluster is hitting full thresholds, or some PGs are stuck undersized.
Fix: Verify pool settings, check full ratios, restore missing OSDs. Adjust min_size only as a conscious risk trade, documented and ideally planned.
Checklists / step-by-step plan
Checklist A: First 5 minutes (no-risk, read-only)
- Run
ceph -sandceph health detail. Paste outputs into your incident log. - Confirm quorum:
ceph quorum_status. If unstable, stop and fix that first. - Look at PG state:
ceph pg stat. Identify degraded/undersized/peering counts. - Find obvious down OSDs:
ceph osd tree. - Check whether nearfull is per-OSD:
ceph osd df.
Checklist B: Next 15 minutes (pinpoint bottleneck)
- If any OSD is down, map it to a host and check
systemctl status+journalctlfor that OSD. - Run
ceph osd perf. Find outliers. - If slow ops exist, identify which daemons:
ceph health detailand useceph daemon osd.X dump_historic_opsselectively. - Check network drops on Ceph interfaces:
ip -s link. If drops increase, treat network as first-class suspect. - Look for forgotten flags:
ceph osd dump | grep flags.
Checklist C: Safe intervention ladder (change as little as possible)
- Restore what’s missing: bring down OSDs back up if hardware allows; fix host issues.
- Reduce client pain: if recovery is crushing latency, throttle recovery temporarily (and document the exact change).
- Restore redundancy: if hardware is dead, mark OSD out and begin replacement—only after confirming you have capacity headroom and failure-domain safety.
- Address capacity: prioritize nearfull mitigation before it becomes full; add space, rebalance carefully, delete data if needed.
- Repair only with evidence: scrub inconsistencies get targeted repair after stability is restored.
Checklist D: Post-incident hardening (the stuff you wish you had done before)
- Create alerting that triggers on degraded/undersized PGs, not just HEALTH_ERR.
- Baseline OSD commit/apply latency and track outliers.
- Track per-OSD utilization spread (VAR) and investigate imbalance early.
- Document recovery tuning profiles and when to use each.
- Standardize storage network MTU and validate it end-to-end during change windows.
- Make “remove maintenance flags” a required step with a named owner.
FAQ
1) Is HEALTH_WARN always urgent?
No. It’s a severity level, not a diagnosis. Some warnings are informational (like recent daemon crashes that already recovered). Others are pre-failure indicators (nearfull, degraded PGs). Treat degraded/undersized PGs and slow ops as urgent because they map to risk and user pain.
2) Should I restart Ceph services when I see HEALTH_WARN?
Not as a first move. Restarting changes cluster state and can restart recovery and peering. Restart when you have evidence: a daemon crash loop, a stuck process, or a confirmed fix that requires restart.
3) What’s the difference between “OSD down” and “OSD out”?
Down means not responding. Out means Ceph no longer places data there and will move data away. Down can be transient; out is a policy decision that triggers data movement.
4) Why does nearfull on a couple of OSDs matter when the cluster has lots of free space?
Because Ceph writes are constrained by the fullest devices. If two OSDs are nearfull, placement gets constrained, backfill gets harder, and the cluster may throttle or refuse writes even if average utilization looks fine.
5) What’s the fastest way to find the bottleneck: network, disk, or recovery?
Run ceph health detail (what Ceph complains about), ceph osd perf (who is slow), and ip -s link (is the network dropping). Those three usually tell you where to look next.
6) Can I safely change recovery settings during an incident?
Yes, if you treat it like a temporary mitigation and you know why you’re doing it. Lower recovery concurrency to protect client latency; raise it to reduce time-at-risk when client load is low and hardware is healthy. Always record the change and revert later.
7) What do “active+clean”, “degraded”, “undersized”, and “peering” really imply?
active+clean: fully replicated, serving I/O normally. degraded: missing one or more replicas. undersized: below min_size threshold risk; may block writes depending on settings and state. peering: PG members negotiating state; prolonged peering suggests instability.
8) When do I mark an OSD out versus trying to bring it back?
Bring it back if the issue is transient (host reboot, service stopped, short network hiccup). Mark it out if the device/path is failing or you expect the OSD to be unavailable long enough that waiting increases risk. If you mark it out, plan for recovery load.
9) Is Proxmox’s GUI enough for Ceph troubleshooting?
The GUI is fine for a quick overview. For decisions that affect data movement and recovery, use Ceph CLI outputs. The CLI shows you the exact warning types, timestamps, and daemon IDs you need.
10) If I see scrub errors, should I repair immediately?
First stabilize the cluster (quorum, network, OSD stability). Then identify which PGs are inconsistent and repair those specifically. Repairing during instability can waste time and sometimes makes the situation noisier.
Conclusion: next steps you can run tomorrow
When Proxmox shows Ceph HEALTH_WARN, don’t treat it as a single problem and don’t treat it as a color. Treat it as a list. Your safest path is: capture ceph health detail, confirm quorum, identify degraded PGs and down OSDs, and then prove whether the bottleneck is a disk, a network, or a recovery storm.
Practical next steps:
- Write a one-page on-call runbook that starts with the Fast diagnosis playbook and includes the commands you actually use.
- Add alerting on degraded/undersized PGs, nearfull OSDs, and slow ops, not just HEALTH_ERR.
- Decide (in calm daylight) what your recovery tuning policy is during business hours versus off hours.
- Audit your storage network: MTU consistency, drops, and switch port errors. Ceph will faithfully amplify whatever your network is doing.
- Plan capacity headroom. Ceph nearfull warnings are early warnings; treat them like it.
Ceph is reliable when you let it be predictable. Your job is to keep the system stable enough to converge—and to avoid “fixes” that create more movement than information.