Ceph doesn’t fail politely. It fails like a distributed system: loudly, intermittently, and with enough red text to make you question your life choices. In Proxmox, an OSD going down can be a dead disk, a sick NIC, a confused service, or the kind of “it’s fine” that turns into a 3 a.m. incident.
This is a production-first guide to identify the cause fast. Not “read everything,” not “run every command,” but the shortest path from OSD down to a correct decision: replace hardware, fix networking, or restart/repair the daemon without making it worse.
Fast diagnosis playbook (first/second/third)
When an OSD goes down, your job is not to be clever. Your job is to be fast, correct, and boring. Start with the questions that collapse uncertainty the most.
First: Is the node reachable and time sane?
- Check host reachability: Can you SSH to the node? If not, you’re not debugging Ceph; you’re debugging the node, hypervisor, or network.
- Check time: If clocks drift, Ceph behaves like you replaced physics with improv theatre. Fix NTP before chasing ghosts.
Second: Is the OSD process failing, or is it running but isolated?
- Service state: If systemd says the OSD is crash-looping, you likely have disk/BlueStore/corruption/auth issues. If it’s “active (running)” but the cluster shows it down, suspect network, firewall, or wrong IP binding.
- Ceph view: Compare
ceph osd treewith systemd status. Mismatches are diagnostic gold.
Third: Is the disk healthy enough to sustain journal/BlueStore IO?
- Kernel logs: One glance at
dmesgandjournalctloften tells you if the drive is throwing errors or timing out. - SMART and IO timeouts: If you see resets, timeouts, or SMART reallocations, stop restarting daemons and start planning replacement.
Stop conditions (avoid making it worse)
- If multiple OSDs are down across hosts: treat as network or cluster-wide config until proven otherwise.
- If one host loses many OSDs at once: suspect HBA/backplane/PSU or that host’s networking.
- If OSD flaps: assume intermittent IO stalls or packet loss, not “Ceph is weird.” Ceph is deterministic; your hardware isn’t.
A practical mental model: what “OSD down” actually means
Ceph has two different “truths” about an OSD:
- What the host thinks: systemd starts
ceph-osd@ID, the process is running, it can read its store, it binds to an IP, and it connects to monitors. - What the cluster thinks: monitors track OSD heartbeats. If the monitors don’t receive them within tolerances, the OSD is marked
down. Separately, an OSD can beout(excluded from data placement) even if it’sup.
So “OSD down” is not “disk dead” by definition. It’s “the cluster cannot reliably hear this OSD.” That can be:
- Disk/IO: OSD process stalls or crashes because reads/writes to BlueStore block device time out.
- Network: OSD is healthy locally but can’t reach mons or peers (routing, VLAN, MTU, bonding, switch, firewall).
- Service/config: OSD won’t start, can’t authenticate (cephx), binds wrong address, or hits a version/config mismatch.
What you want is a minimal set of checks that splits these categories early. Everything else is detail.
Paraphrased idea (Gene Kim, reliability/operations author): “Improvement happens when you shorten and stabilize feedback loops.” That’s what this playbook is: short loops.
Interesting facts and context (Ceph + Proxmox reality)
- Ceph’s “OSD heartbeat” logic is monitor-centric. Your OSD can be busy, alive, and still declared down if heartbeats don’t land in time.
- “Down” and “out” are different levers.
downis health detection;outis data placement. Confusing them causes self-inflicted rebalancing storms. - BlueStore replaced FileStore because FileStore’s double-write and filesystem overhead weren’t keeping up with modern drives and latency expectations.
- Ceph’s CRUSH algorithm dates back to research focused on controlled failure domains. It’s why you can survive losing hosts or racks, assuming your topology is honest.
- Proxmox integrated Ceph management to make hyperconverged storage approachable, but it also makes it easy to click your way into trouble when you don’t understand the underlying daemon behavior.
- Network mistakes masquerade as disk failures. Packet loss can look like “slow ops” and timeouts. A bad switch port has ruined more weekends than firmware bugs.
- OSD flapping is usually not software randomness. It’s often power saving, thermal throttling, HBA issues, or marginal links—things that “mostly work” until load spikes.
- Ceph recovery can amplify problems. When an OSD goes down, backfill/recovery increases IO and network traffic, which can push marginal components over the edge.
Joke #1: Distributed storage is just single-host storage, plus networking, plus consensus, plus the chance to be wrong in three places at once.
Triage tasks: commands, outputs, decisions (12+)
These are practical tasks you can run on Proxmox nodes (or from a node with Ceph admin access). Each task includes the command, what typical output implies, and what you decide next. Run them in roughly this order unless you already know the category.
Task 1: Confirm the cluster sees an OSD down, and which one
cr0x@pve1:~$ ceph -s
cluster:
id: 7c2a0d3e-0f8a-4e5a-9a5b-5e9b0c7d3c12
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 12 pgs undersized
services:
mon: 3 daemons, quorum pve1,pve2,pve3 (age 6h)
mgr: pve1(active, since 2h), standbys: pve2
osd: 12 osds: 11 up (since 4m), 12 in (since 6h)
data:
pools: 4 pools, 128 pgs
objects: 2.1M objects, 8.3 TiB
usage: 25 TiB used, 40 TiB / 65 TiB avail
pgs: 12 undersized+degraded
Meaning: This is real: the cluster is warning about at least one OSD down, and PGs are impacted.
Decision: Identify which OSD(s) and which host(s) immediately. Don’t restart random services yet.
Task 2: Map OSD ID to host and device location
cr0x@pve1:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 65.00000 root default
-3 21.66667 host pve1
0 hdd 3.61111 osd.0 up 1.00000 1.00000
1 hdd 3.61111 osd.1 up 1.00000 1.00000
2 hdd 3.61111 osd.2 down 1.00000 1.00000
-5 21.66667 host pve2
3 hdd 3.61111 osd.3 up 1.00000 1.00000
...
Meaning: osd.2 is down on pve1.
Decision: Move to the specific host and check the OSD daemon + disk.
Task 3: Check whether the OSD service is running (service vs network split)
cr0x@pve1:~$ systemctl status ceph-osd@2 --no-pager
● ceph-osd@2.service - Ceph object storage daemon osd.2
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Fri 2025-12-26 08:51:22 UTC; 2min 10s ago
Process: 18342 ExecStart=/usr/bin/ceph-osd -f --cluster ceph --id 2 (code=exited, status=1/FAILURE)
Main PID: 18342 (code=exited, status=1/FAILURE)
Dec 26 08:51:22 pve1 ceph-osd[18342]: bluestore(/var/lib/ceph/osd/ceph-2) _read_bdev_label failed: (5) Input/output error
Dec 26 08:51:22 pve1 systemd[1]: ceph-osd@2.service: Main process exited, code=exited, status=1/FAILURE
Dec 26 08:51:22 pve1 systemd[1]: ceph-osd@2.service: Failed with result 'exit-code'.
Meaning: Not running; and the log screams IO error while reading BlueStore label. That’s usually disk path trouble, not networking.
Decision: Stop restarting; inspect disk/HBA and kernel logs. Prepare for replacement.
Task 4: If it’s running, verify Ceph still calls it down (network suspicion)
cr0x@pve1:~$ systemctl is-active ceph-osd@2
active
Meaning: Locally active. If ceph osd tree still shows it down, it’s often network, auth, or binding.
Decision: Check connectivity to monitors and the OSD’s bound addresses.
Task 5: Pull OSD-specific logs fast
cr0x@pve1:~$ journalctl -u ceph-osd@2 -n 80 --no-pager
Dec 26 08:49:10 pve1 ceph-osd[18011]: starting osd.2 at /var/lib/ceph/osd/ceph-2
Dec 26 08:49:11 pve1 ceph-osd[18011]: bluestore: using bdev /dev/ceph-2/block
Dec 26 08:49:15 pve1 ceph-osd[18011]: heartbeat_map is_healthy_to_peer osd.7 down
Dec 26 08:49:22 pve1 ceph-osd[18011]: slow ops, oldest one is now 31.122s
Dec 26 08:49:29 pve1 ceph-osd[18011]: monclient(hunting): authenticate timed out
Dec 26 08:49:31 pve1 ceph-osd[18011]: unable to talk to monitor
Dec 26 08:49:35 pve1 ceph-osd[18011]: dropped 3 slow requests due to timeout
Meaning: Slow ops then monitor auth timeouts. This could be disk stalls or network issues reaching mons. The “unable to talk to monitor” pushes you toward network, but don’t assume yet.
Decision: Check kernel logs for IO timeouts vs link/MTU issues. One will usually confess.
Task 6: Check kernel ring buffer for disk resets/timeouts (disk split)
cr0x@pve1:~$ dmesg -T | egrep -i 'blk_update_request|I/O error|reset|timed out|nvme|sd '
[Fri Dec 26 08:48:58 2025] sd 6:0:12:0: [sdl] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Fri Dec 26 08:48:58 2025] blk_update_request: I/O error, dev sdl, sector 182400 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Fri Dec 26 08:48:59 2025] sd 6:0:12:0: [sdl] rejecting I/O to offline device
[Fri Dec 26 08:49:02 2025] ata12: hard resetting link
[Fri Dec 26 08:49:07 2025] ata12: link is slow to respond, please be patient
Meaning: That’s a drive or link path misbehaving (disk, cable, backplane, HBA). Ceph is only the messenger.
Decision: Treat as hardware incident: stop flapping by marking out (if needed), and schedule replacement/maintenance.
Task 7: Check network link, errors, and drops (network split)
cr0x@pve1:~$ ip -s link show dev bond0
2: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:11:22:33 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
981G 744M 0 18422 0 11922
TX: bytes packets errors dropped carrier collsns
904G 701M 0 0 0 0 0
Meaning: Dropped RX packets at the bond level. On a storage network, drops are latency spikes wearing a trench coat.
Decision: Check slaves, switch ports, MTU consistency, and congestion. If drops climb during recovery, you’ve found your bottleneck.
Task 8: Validate MTU consistency (classic Proxmox/Ceph foot-gun)
cr0x@pve1:~$ ip link show dev vmbr1 | sed -n '1p'
6: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
cr0x@pve2:~$ ip link show dev vmbr1 | sed -n '1p'
6: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
Meaning: MTU mismatch. Jumbo frames on one side and not the other is how you buy “random” timeouts.
Decision: Fix MTU end-to-end (hosts, bonds, bridges, switch ports). Then re-evaluate OSD health.
Task 9: Confirm monitor connectivity from the affected node (network vs auth)
cr0x@pve1:~$ ceph quorum_status --format json-pretty | sed -n '1,40p'
{
"election_epoch": 88,
"quorum": [
0,
1,
2
],
"quorum_names": [
"pve1",
"pve2",
"pve3"
],
"quorum_leader_name": "pve1"
}
Meaning: From this node, the CLI can talk to mons. That doesn’t guarantee the OSD can, but it reduces “total network isolation.”
Decision: If the OSD can’t authenticate but CLI can, suspect cephx keyring/permissions or OSD binding to the wrong interface.
Task 10: Inspect which IPs Ceph thinks it should use (binding mistakes)
cr0x@pve1:~$ ceph config get osd public_network
10.10.10.0/24
cr0x@pve1:~$ ceph config get osd cluster_network
10.10.20.0/24
Meaning: Ceph is configured to bind to specific networks. If your host interfaces don’t match (or routes are wrong), the OSD may bind somewhere useless.
Decision: Verify host IP addressing and that both networks exist and route correctly on this node.
Task 11: Verify what address the OSD is advertising
cr0x@pve1:~$ ceph osd find 2
{
"osd": 2,
"ip": "10.10.10.11:6806/12345",
"crush_location": {
"host": "pve1",
"root": "default"
}
}
Meaning: The OSD advertises 10.10.10.11. If that IP is wrong, moved, or now on a different interface/VLAN, peers won’t reach it.
Decision: If this doesn’t match the intended network, fix addressing and restart the OSD cleanly. Don’t “out” it yet unless data risk requires it.
Task 12: Check Ceph “health detail” for immediate clues
cr0x@pve1:~$ ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 12 pgs undersized
[WRN] OSD_DOWN: 1 osds down
osd.2 is down
[WRN] PG_DEGRADED: Degraded data redundancy: 12 pgs undersized
pg 1.2c is undersized+degraded, acting [7,2]
pg 1.2f is undersized+degraded, acting [2,9]
Meaning: This is the cluster screaming “I’m missing osd.2.” It doesn’t say why, but it tells you impact and urgency.
Decision: If client IO is impacted and you need stability, consider marking the OSD out (with caution) to start recovery—after confirming it’s truly dead or unreachable for a while.
Task 13: Verify the block device mapping for that OSD (disk path accuracy)
cr0x@pve1:~$ ceph-volume lvm list --osd-id 2
====== osd.2 =======
[block] /dev/ceph-2/block
type block
lv_path /dev/ceph-2/osd-block-3c6f8a1d-9c6e-4ad2-9e4c-5d2c6c4a9f12
vg_name ceph-2
lv_name osd-block-3c6f8a1d-9c6e-4ad2-9e4c-5d2c6c4a9f12
devices /dev/sdl
Meaning: The OSD depends on /dev/sdl. Now you can check SMART, controller, cabling, and whether /dev/sdl is still present.
Decision: If the device is missing or unstable, stop. Hardware handling time.
Task 14: Quick device presence and latency sanity
cr0x@pve1:~$ lsblk -o NAME,SIZE,MODEL,SERIAL,STATE,ROTA,TYPE,MOUNTPOINT | egrep 'sdl|ceph-2|NAME'
NAME SIZE MODEL SERIAL STATE ROTA TYPE MOUNTPOINT
sdl 3.6T ST4000NM000A ZC1ABC12 running 1 disk
ceph--2-osd--block--3c6f8a1d 3.6T 1 lvm
Meaning: Device exists. Existence is not health, but it’s better than “gone.”
Decision: If service fails with IO errors despite presence, run SMART and check link resets.
Task 15: SMART health check (don’t overinterpret, but do respect it)
cr0x@pve1:~$ smartctl -a /dev/sdl | egrep -i 'Reallocated|Pending|Offline_Uncorrectable|SMART overall|Power_On_Hours'
SMART overall-health self-assessment test result: FAILED!
9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 39876
5 Reallocated_Sector_Ct 0x0033 001 001 010 Pre-fail Always - 3200
197 Current_Pending_Sector 0x0012 001 001 000 Old_age Always - 88
198 Offline_Uncorrectable 0x0010 001 001 000 Old_age Offline - 88
Meaning: The drive is effectively telling you it’s leaving. Believe it.
Decision: Replace the disk. Don’t waste time tuning Ceph for failing hardware.
Task 16: If you suspect network loss, test loss and MTU properly (not just ping)
cr0x@pve1:~$ ping -c 5 -M do -s 8972 10.10.10.12
PING 10.10.10.12 (10.10.10.12) 8972(9000) bytes of data.
From 10.10.10.11 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.10.10.11 icmp_seq=2 Frag needed and DF set (mtu = 1500)
--- 10.10.10.12 ping statistics ---
5 packets transmitted, 0 received, +2 errors, 100% packet loss, time 4081ms
Meaning: Path MTU is 1500 somewhere. Jumbo is not end-to-end.
Decision: Fix MTU configuration before blaming Ceph. Then restart the affected OSD if it bound correctly.
Task 17: If service is failing, inspect BlueStore and FS permissions quickly
cr0x@pve1:~$ ls -ld /var/lib/ceph/osd/ceph-2
drwxr-x--- 5 ceph ceph 4096 Dec 26 08:40 /var/lib/ceph/osd/ceph-2
Meaning: Ownership looks normal. If it’s wrong (root:root or too restrictive), the OSD might crash early.
Decision: Fix permissions only if you can explain how they changed (package scripts, manual edits, restore). Otherwise you risk masking a deeper issue.
Task 18: Check Ceph crash reports for this daemon (often underused, very telling)
cr0x@pve1:~$ ceph crash ls | head
ID ENTITY NEW
2025-12-26_08:51:22.18342 osd.2 *
2025-12-20_14:11:03.22991 osd.7
cr0x@pve1:~$ ceph crash info 2025-12-26_08:51:22.18342 | sed -n '1,60p'
{
"crash_id": "2025-12-26_08:51:22.18342",
"entity_name": "osd.2",
"utsname_hostname": "pve1",
"assert_condition": "bdev_label_valid",
"backtrace": [
"ceph-osd(...",
"libc.so.6(..."
]
}
Meaning: The crash is recorded and points to a label/BlueStore block device read problem. This supports the “disk path / IO” hypothesis.
Decision: Don’t treat it as “restart fixes it.” Plan hardware swap or BlueStore repair only if you know what you’re doing and have redundancy.
Disk vs network vs service: telltale signs
The fastest diagnosis is pattern recognition backed by one or two confirmatory commands. Here are the patterns that matter in production.
Disk/IO problem indicators
- OSD fails to start with BlueStore read errors, “Input/output error,” or “failed to read bdev label.”
- Kernel logs show resets/timeouts on the underlying device (SATA link reset, NVMe timeout, SCSI “rejecting I/O”).
- SMART shows pending/reallocated sectors or overall health FAIL.
- OSD flaps under load (recovery/backfill triggers it). Marginal drives collapse when you stop being polite and start being real.
Practical decision: If kernel logs show IO timeouts, assume hardware until disproven. Reboots and restarts are palliative care, not treatment.
Network problem indicators
- OSD process is running but Ceph marks it down; logs show “unable to talk to monitor” or heartbeat timeouts.
- Multiple OSDs down across different hosts around the same time.
- Packet drops/errors on storage-facing interfaces; switch counters climb; bonding flaps.
- MTU mismatch (jumbo frames half-enabled) causing intermittent fragmentation failures or silent drops with DF set.
- Asymmetric routing (public network works, cluster network doesn’t) leading to weird partial connectivity.
Practical decision: Fix packet loss before “tuning Ceph.” Ceph can’t out-configure physics.
Service/config problem indicators
- systemd shows exit-code failures without kernel IO errors (think: auth, config, permissions, missing keyring).
- cephx errors in the OSD log: “permission denied,” “bad key,” “authenticate timed out” (can be network too, so validate).
- OSD binds wrong address after an interface rename, bridge change, or network refactor.
- Version skew or config drift after partial upgrades (common in corporate change windows where someone “only had time” for two nodes).
Practical decision: If it’s service/config, you can usually fix it without replacing hardware—but only after you confirm the disk path is stable and the network is clean.
Deep dives: the fast paths inside each failure mode
1) Disk path failures: the “OSD is down” symptom is the last domino
Ceph OSDs are extremely good at tolerating transient slowness—up to the point they can’t. Once IO stalls exceed heartbeat tolerances, the monitors declare the OSD down. Sometimes the OSD process is still alive, blocked in IO, looking like it’s “running.” That’s why you don’t stop at systemd status.
Disk path failures come in flavors:
- Drive media failure: reallocated/pending sectors, uncorrectables, read retries, thermal issues.
- Transport failure: bad SATA/SAS cable, marginal backplane, HBA firmware bugs, expander issues.
- Power issues: a shared PSU rail or backplane power causing resets under load.
- Queueing collapse: a drive that “works” but has catastrophic latency spikes (the killer for distributed storage).
What to do when you strongly suspect disk path issues:
- Collect evidence (dmesg, SMART, OSD logs) once.
- Stop flapping. Flapping increases recovery churn, which increases load, which makes flapping worse. It’s a feedback loop with a job title.
- Decide: keep
intemporarily if it’s coming back quickly and you need to avoid rebalancing, or markoutif it’s truly dead or unsafe.
Opinionated guidance: If you see repeated IO timeouts/resets for that device over more than a few minutes, mark the OSD out and plan replacement. If you keep it in “to see if it settles,” you’re gambling with cluster health while the house edge is 100%.
2) Network failures: Ceph is a latency detector that also stores data
Ceph is merciless about latency because it has to be. If heartbeats and messages are delayed, it must assume failure to preserve consistency. That makes Ceph a great early warning system for network problems—assuming you don’t shoot the messenger by restarting daemons until the counters reset.
Network failures that commonly present as OSD down:
- MTU mismatch between vmbr/bond, switch ports, or VLAN interfaces.
- Bonding misconfiguration (LACP on host, static on switch, or vice versa), causing intermittent hashing blackholes.
- VLAN tagging inconsistencies after switch changes.
- Firewall rules blocking Ceph messenger ports (6800–7300 typical range) or monitor ports.
- Congestion and microbursts during recovery/backfill.
Fast way to separate “network” from “disk stall” when logs show timeouts: check kernel logs. Disk stalls often show device messages. Network problems often show dropped packets, link flaps, bonding messages, or nothing at all while Ceph complains. In that “nothing at all” case, look at interface counters and switch logs.
3) Service/config failures: the subtle ones
Service issues are where humans cause the most damage because they’re confident. The disk exists, the network pings, so the OSD “should” start. But Ceph is not a single binary; it’s a set of identities, keyrings, configs, and device mappings. Break one, and you get an OSD that politely refuses.
Common service/config triggers:
- Keyring missing or wrong permissions under
/var/lib/ceph/osd/ceph-ID/keyring. - OSD directory permissions changed during manual repair or restore.
- Stale device symlink if udev names changed and the OSD points to a dead path.
- Partial upgrades causing messenger/protocol mismatches or config assumptions.
Service fixes should be reversible. If you find yourself “trying things,” stop and take a snapshot of facts: logs, config, device mapping. Then change one thing, validate, and move on.
Joke #2: An OSD that “just needs a restart” is like a printer that “just needs a restart”—it’s rarely the whole story, it’s just the part you can do without tools.
Three corporate-world mini-stories (what went wrong, what worked)
Mini-story #1: The incident caused by a wrong assumption
They had a small Proxmox cluster running Ceph for VM disks. One OSD went down after a routine maintenance reboot. The on-call looked at ceph -s, saw “1 osd down,” and assumed the disk died. Reasonable… if you like being wrong efficiently.
The team marked the OSD out immediately to “start recovery.” That kicked off backfill across the cluster. Network utilization climbed. Latency climbed with it. Then two more OSDs started flapping on other nodes. Now the dashboard was a Christmas tree and client IO was intermittently stalling.
The root cause wasn’t disks. The reboot had reordered interface names on that host after a kernel update, and the Ceph public network setting still pointed to the old bridge. The OSD started, bound to the wrong address, and became effectively unreachable to peers. The “disk failure” was a story the humans told themselves because it fit the symptom.
Once they corrected the interface binding and restarted the OSD, it came back instantly. But the damage was done: the cluster was already rebalancing hard. They had turned a single OSD identity problem into a multi-OSD performance incident.
What changed their behavior afterward was a rule: never mark an OSD out until you’ve validated disk path errors or confirmed sustained network isolation. It slowed them down by five minutes and saved them hours later.
Mini-story #2: The optimization that backfired
A different team wanted “more performance” and “less overhead.” They enabled jumbo frames on the storage network. Good instinct. They updated the Proxmox bridges and bonds to MTU 9000, updated a pair of ToR switch ports, tested with a basic ping, and declared victory.
Weeks later, during a heavy recovery event (a planned OSD replacement), OSDs started dropping in and out. Slow ops warnings piled up. The on-call saw monitor auth timeouts and suspected cephx issues. They rotated keys. They restarted daemons. They even restarted a monitor. Everything got worse.
The culprit was boring: an intermediate switch path still had MTU 1500 on one VLAN trunk. Most traffic “worked” because it fell back to smaller packets, but specific Ceph messenger patterns under load started hitting fragmentation constraints. The original ping test didn’t use DF + large payload. The optimization created a latent failure mode that only triggered when the system was stressed—exactly when you least want surprises.
After they fixed MTU end-to-end and standardized their validation (DF pings, interface counters, switch port checks), the flapping stopped. Performance improved too, but the real win was predictability.
The lesson: optimizations that rely on correctness across multiple layers should be treated like config migrations, not tweaks. Validate like a skeptic, not an optimist.
Mini-story #3: The boring but correct practice that saved the day
One org ran Ceph on Proxmox with a strict change procedure that everyone mocked until it mattered. Each OSD disk had a recorded mapping: chassis bay → serial number → Linux device path → OSD ID. They also kept a small store of identical spare drives and had a tested replacement runbook.
When an OSD went down at 2 p.m. on a Tuesday (the nicest kind of incident), the on-call did not guess. They ran ceph osd tree, mapped the OSD ID to the drive serial via their inventory, then confirmed with ceph-volume lvm list and smartctl. SMART showed pending sectors and kernel logs showed resets. No drama.
They marked the OSD out after confirming the failure was real, set recovery flags to avoid saturating client IO, and replaced the drive in the correct bay on the first try. The replacement OSD came up, backfilled, and the cluster returned to clean health without a single “where is that disk?” meeting.
No heroics. No mystical Ceph commands. Just accurate inventory and a runbook that had been used before it was needed.
The lesson: boring practices scale better than clever people. Also, they sleep more.
Common mistakes (symptom → root cause → fix)
1) “OSD is down, so the disk is dead”
Symptom: ceph -s reports 1 OSD down; admins immediately plan disk replacement.
Root cause: OSD is running but unreachable due to MTU mismatch, VLAN issue, or wrong bind address.
Fix: Compare systemctl is-active ceph-osd@ID with ceph osd tree. If active locally but down in cluster, validate MTU with DF ping, check interface drops, and confirm OSD advertised IP via ceph osd find.
2) “Restarting the OSD will fix it” (flapping amplifier)
Symptom: OSD alternates up/down every few minutes; restarting seems to help briefly.
Root cause: IO stalls or link resets under load; restarting only resets timers and hides the underlying problem until load returns.
Fix: Check dmesg for resets/timeouts; check SMART. If hardware signals exist, stop restarting and schedule replacement. If network signals exist (drops/errors), fix networking first.
3) Marking OSD out too early
Symptom: One OSD down; operator marks it out immediately; cluster performance tanks.
Root cause: Rebalancing/backfill starts while the underlying issue was transient or fixable (e.g., service misbind). Recovery traffic overloads the cluster, causing more timeouts.
Fix: Wait long enough to confirm persistence (minutes, not seconds), validate category, and throttle recovery if client IO matters. Only mark out when you’re confident it won’t come back quickly.
4) Jumbo frames half-enabled
Symptom: Random OSD down events during recovery; pings work; TCP sessions reset under load.
Root cause: MTU mismatch across bridges, bonds, VLANs, or switch trunks.
Fix: Standardize MTU end-to-end. Validate with ping -M do -s 8972 between all Ceph nodes on the relevant network(s). Check switch configuration too.
5) Ignoring a “mostly fine” disk that has latency spikes
Symptom: No SMART FAIL, but frequent slow ops and occasional OSD down.
Root cause: Drive/HBA/backplane causing intermittent long tail latency without outright failure. Ceph punishes tail latency.
Fix: Use kernel logs and OSD logs timing. Correlate down events with IO stalls. Consider replacing the suspect component even without a dramatic SMART summary.
6) Firewall or ACL changes blocking Ceph ports
Symptom: OSD starts but cannot connect to mons; logs show timeouts; issue begins after “security hardening.”
Root cause: Blocked monitor/OSD messenger ports or asymmetric rules across nodes.
Fix: Verify reachability on required ports from OSD node to monitors and peers. Roll back or correct rules. Consistency matters more than cleverness.
Checklists / step-by-step plan
Checklist A: Five-minute “what is it?” triage
- Run
ceph -sandceph osd tree. Identify OSD IDs and hosts. - On the host:
systemctl status ceph-osd@IDandjournalctl -u ceph-osd@ID -n 80. - On the host:
dmesg -T | egrep -i 'I/O error|timed out|reset|rejecting I/O'. - On the host:
ip -s linkon Ceph-facing interfaces; look for drops/errors. - Validate MTU with DF ping between storage IPs if jumbo is enabled.
Checklist B: If it looks like disk/IO
- Confirm mapping:
ceph-volume lvm list --osd-id ID. - Check SMART:
smartctl -a /dev/DEVICE. - Check kernel errors and resets:
dmesg -T. - If confirmed failing: plan replacement. If it’s flapping badly, mark
outto stabilize placement. - Throttle recovery if client IO matters (coordinate with your team, don’t freestyle during peak hours).
Checklist C: If it looks like network
- Check interface counters (drops/errors) and link state.
- Validate MTU end-to-end with DF pings.
- Check bond state and LACP consistency; verify slaves aren’t flapping.
- Confirm OSD advertised IP:
ceph osd find ID. - Fix network first, then restart OSD once, not ten times.
Checklist D: If it looks like service/config
- Check systemd and logs for cephx/config errors.
- Validate ownership and keyring presence in the OSD directory.
- Confirm Ceph network config values match real interfaces and routes.
- Check for recent upgrades and version skew across nodes.
- Apply one change at a time, document what you touched, and roll back if the symptom changes but doesn’t improve.
FAQ
1) What’s the difference between an OSD being “down” and “out”?
Down is health detection (monitors aren’t receiving heartbeats). Out changes CRUSH placement so the cluster stops trying to store data there and begins recovery elsewhere.
2) The OSD service is “active (running)” but Ceph says it’s down. How?
The process can be alive but unreachable: wrong bind IP, network isolation, firewall rules, MTU mismatch, or severe packet loss. Validate advertised IP (ceph osd find) and interface drops.
3) Should I restart the OSD service when it goes down?
Only after you’ve checked journalctl and dmesg. If kernel logs show IO timeouts/resets, restarting just makes the outage noisier. Fix hardware first.
4) When should I mark an OSD out?
When you have evidence it won’t return quickly (hardware failure, persistent network isolation) and keeping it in is causing instability. Marking out triggers recovery load, so do it deliberately.
5) Can a single bad disk take down multiple OSDs?
Yes, if multiple OSDs share an HBA, expander, backplane, or if a single host has systemic IO issues. Also yes if your “single disk” is actually a shared failure domain you didn’t model.
6) Why do OSDs go down during recovery/backfill more often?
Recovery increases IO depth and network traffic. Marginal disks, HBAs, cables, or switch ports that “work fine” at idle can collapse under real load.
7) Is SMART enough to declare a disk healthy or dead?
SMART is useful, not omniscient. A SMART FAIL is decisive. A SMART PASS is not a guarantee, especially for latency spikes and controller path issues.
8) How do I tell MTU mismatch from general packet loss?
MTU mismatch shows up when you use DF pings with large payloads and get “Frag needed and DF set.” General packet loss shows rising drops/errors and inconsistent ping loss even for small packets.
9) We have separate public and cluster networks. Which one matters for OSD down?
Both can matter. OSDs talk to monitors on the public network, and often use the cluster network for replication/backfill. A failure on either can trigger down behavior depending on config and traffic patterns.
10) What’s the safest way to collect evidence before changing anything?
Grab: ceph -s, ceph health detail, ceph osd tree, systemctl status ceph-osd@ID, journalctl -u ceph-osd@ID, and dmesg -T. That set usually pins the category.
Next steps you should actually do
If you want fewer “OSD down” surprises, do the unglamorous work:
- Standardize your fast triage: make the playbook above your default muscle memory. Consistency beats brilliance.
- Harden your network assumptions: validate MTU end-to-end, monitor drops, and treat storage VLANs like production-critical systems (because they are).
- Inventory your disks like you mean it: map OSD ID to serial and bay. The time to learn which drive is which is not during an outage.
- Reduce flapping incentives: stop blind restarts. Collect evidence, decide disk vs network vs service, then act once.
- Practice OSD replacement when nothing is on fire. The runbook you’ve used calmly is the runbook you’ll use correctly under pressure.
The point isn’t to become a Ceph wizard. The point is to stop losing time to misclassification. Disk, network, or service: pick the right one fast, and the rest becomes straightforward engineering instead of interpretive dance.