Proxmox Ceph PG Stuck/Inactive: What to Do Before Data Risk Escalates

Was this helpful?

You notice VMs freezing. Backups stall. Latency spikes. Then Proxmox lights up with Ceph warnings: PGs stuck, PGs inactive, maybe “peering”
for longer than your patience. This is the part where teams either do calm, reversible work—or they start “fixing” things until the cluster becomes
a science experiment.

A PG stuck/inactive is not a cosmetic issue. It’s Ceph telling you it cannot safely serve or confirm data placement for some objects. Your job is to
find which constraint is blocking progress—disk, network, quorum, OSD state, pool rules, or plain old full devices—then apply the smallest change that
gets the PGs back to active+clean without gambling with data.

The mental model: what “PG stuck/inactive” actually means

In Ceph, objects live in Placement Groups (PGs). A PG is not data by itself; it’s the unit of placement and consistency. Each PG maps to a set of OSDs
using CRUSH rules. Ceph cares about PG state because it’s how the cluster decides: “Can I serve reads/writes safely?” and “Do I know who has the latest
version of each object?”

Inactive means the PG cannot currently serve I/O because it has not completed peering into a consistent acting set (the OSDs currently
responsible for it). It might be missing OSDs, lacking quorum knowledge, or stuck waiting for history. Stuck is a time-based alarm:
the PG has been in some non-ideal state (peering, activating, backfilling, undersized, degraded) longer than expected.

The dangerous trap is treating “inactive” as a single bug. It’s not. It’s a symptom class. Sometimes the fix is boring (bring an OSD back, replace a disk,
restore network). Sometimes it’s surgical (adjust recovery limits). Sometimes it’s decision-heavy (reduce pool size temporarily, mark an OSD out, or as a last
resort, declare data lost).

Your safe operating principle: don’t change the cluster’s understanding of data until you’ve confirmed why it can’t agree. If the PG can’t
peer, it’s usually because some participant is missing, inconsistent, or too slow to answer. Make it answer, or remove it cleanly—and only with a clear
impact assessment.

Joke #1 (short and relevant): Ceph recovery is like an airport baggage system—everything’s “eventually consistent” until your suitcase is the one missing.

Key PG states you’ll see in Proxmox Ceph clusters

  • active+clean: the goal state.
  • active+degraded: I/O works, but some replicas are missing; recovery will try to fix.
  • active+undersized: acting set has fewer OSDs than pool size wants; risk is up.
  • peering / activating: agreement in progress; if stuck, something’s blocking it.
  • backfill_wait, backfilling, recovering: data movement; can be slow or blocked by throttles/full disks.
  • stale: monitor hasn’t heard from OSD(s) responsible; often network or host down.
  • incomplete: Ceph can’t find authoritative data for the PG; this is where you slow down and think.

What “before data risk escalates” means in practice

Data risk escalates when you cross any of these thresholds:

  • Multiple OSDs down in the same failure domain (same host, same rack, same power feed) for pools with size 3.
  • PGs become incomplete or inactive for a pool that serves VM disks and you keep writing anyway (or keep forcing restarts).
  • OSD disks are near-full and you continue backfill, causing write amplification and potential BlueStore allocation failures.
  • You start using destructive commands (ceph pg repair, ceph-objectstore-tool, ceph osd lost) without evidence.
  • Monitors are flapping quorum; cluster maps churn; PGs can’t settle.

Interesting facts and context (Ceph and PGs)

  • PGs exist to scale metadata: Ceph avoids per-object metadata decisions by grouping objects into PGs; that’s why PG count matters for performance and recovery.
  • “Peering” is Ceph’s consensus-lite per-PG: it’s not Paxos per object; it’s a PG-level history exchange to decide the authoritative log.
  • CRUSH (2006-era research roots): Ceph’s placement is based on a deterministic algorithm designed to avoid central lookup tables.
  • Ceph’s “nearfull/backfillfull/full” flags are guardrails: they’re there because running out of space mid-recovery can strand PGs in ugly states.
  • Backfill is not “just copying”: it competes with client I/O, amplifies writes, and can expose latent disk/SSD firmware issues.
  • BlueStore changed the failure profile: compared to FileStore, BlueStore reduced double-write penalties but made device health (DB/WAL placement, latency) more visible.
  • Mon quorum is a hard dependency for maps: even if OSDs are “up,” map instability can keep PGs bouncing through peering.
  • PG autoscaler exists because humans are bad at PG math: manual PG tuning caused years of avoidable incidents, especially after adding OSDs.
  • Scrub scheduling became an ops discipline: scrubs catch bitrot-ish issues, but aggressive scrub settings can ruin recovery windows.

Fast diagnosis playbook (first/second/third checks)

When PGs are stuck/inactive, you don’t need more dashboards. You need a clean decision tree.
The goal is to identify the bottleneck that prevents peering/activation, and whether you’re facing availability pain or integrity risk.

First: confirm the blast radius and the exact PG states (2 minutes)

  • How many PGs, which pools, and which states?
  • Is it inactive, stale, incomplete, or just slow recovery?
  • Are monitors in quorum and stable?

Second: find the missing participants (5 minutes)

  • Which OSDs are down/out?
  • Are they down because the host is dead, the disk is dead, or the daemon is wedged?
  • Is the network partitioned (OSD heartbeats failing)?

Third: check capacity and throttles (5–10 minutes)

  • Are any OSDs nearfull/backfillfull/full?
  • Are recovery/backfill settings too strict (cluster crawling) or too loose (cluster drowning)?
  • Are there slow ops indicating a specific device or node?

Fourth: decide the recovery strategy (then act)

  • If OSD is recoverable: bring it back, let peering complete, keep changes minimal.
  • If OSD disk is failing: mark out, replace, and rebuild; don’t keep rebooting it into further corruption.
  • If PG is incomplete: stop improvising; identify the last authoritative OSDs, check logs, and plan carefully. “Force” is not a strategy.

Joke #2 (short and relevant): The only thing more permanent than a temporary Ceph tweak is the ticket asking why it’s still set six months later.

Practical tasks with commands, outputs, and decisions (12+)

The commands below assume you’re using Proxmox’s Ceph integration (so ceph CLI is available on a node with admin keyring),
and systemd-managed OSDs/mon/mgr. Adjust hostnames/IDs to your reality.

Task 1: Get the real health message (don’t guess)

cr0x@server:~$ ceph -s
  cluster:
    id:     6c2a6d0c-3a7f-4c50-9c90-2d14c5d1f9aa
    health: HEALTH_WARN
            12 pgs inactive
            4 pgs peering
            1 osds down
            37 slow ops, oldest one blocked for 412 sec

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 17m)
    mgr: pve1(active), standbys: pve2
    osd: 24 osds: 23 up (since 3m), 24 in (since 2h)

  data:
    pools:   4 pools, 512 pgs
    objects: 3.1M objects, 12 TiB
    usage:   36 TiB used, 48 TiB / 84 TiB avail
    pgs:     488 active+clean
             12 inactive
             4 peering

What it means: health already tells you whether this is “just recovery” or “we cannot serve I/O safely.” Inactive PGs are availability-impacting.
The “osds down” count hints at a missing participant; “slow ops” hints at a performance bottleneck.

Decision: If any PGs are inactive or incomplete, prioritize restoring peering over tuning performance. Tuning can wait; correctness can’t.

Task 2: Identify which PGs are stuck and why Ceph thinks so

cr0x@server:~$ ceph health detail
HEALTH_WARN 12 pgs inactive; 4 pgs peering; 1 osds down; 37 slow ops
[WRN] PG_AVAILABILITY: 12 pgs inactive
    pg 1.2f is stuck inactive for 611.243 seconds, current state inactive, last acting [3,7,12]
    pg 2.9a is stuck inactive for 603.991 seconds, current state inactive, last acting [5,9,21]
[WRN] PG_DEGRADED: 4 pgs peering
    pg 1.31 is stuck peering for 503.102 seconds, current state peering, last acting [7,12,18]
[WRN] OSD_DOWN: 1 osds down
    osd.12 is down
[WRN] SLOW_OPS: 37 slow ops, oldest one blocked for 412 sec

What it means: You now have specific PG IDs and their last acting set. That acting set is your breadcrumb trail.
If multiple stuck PGs share an OSD ID, that OSD is usually the villain (or a victim).

Decision: Pivot to the shared OSD(s). Don’t scattershot restart everything.

Task 3: Dump PG state for one problem PG (see who it’s waiting on)

cr0x@server:~$ ceph pg 1.2f query
{
  "state": "inactive",
  "acting": [3,7,12],
  "up": [3,7,12],
  "blocked_by": ["osd.12"],
  "peering_blocked_by": ["osd.12"],
  "info": {
    "last_update": "145'9023",
    "last_complete": "145'9001"
  }
}

What it means: This is gold. It tells you the PG can’t activate because osd.12 isn’t participating.
Sometimes blocked_by isn’t present; then you look at peering logs and OSD perf.

Decision: Focus on restoring osd.12 or cleanly removing it (out) if the disk is toast.

Task 4: Confirm OSD topology and state (up/in/out and host mapping)

cr0x@server:~$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME      STATUS  REWEIGHT  PRI-AFF
-1         84.00000 root default
-3         28.00000     host pve1
 0   ssd    3.50000         osd.0      up   1.00000  1.00000
 1   ssd    3.50000         osd.1      up   1.00000  1.00000
 2   ssd    3.50000         osd.2      up   1.00000  1.00000
-5         28.00000     host pve2
 3   ssd    3.50000         osd.3      up   1.00000  1.00000
 7   ssd    3.50000         osd.7      up   1.00000  1.00000
12   ssd    3.50000         osd.12    down  1.00000  1.00000
-7         28.00000     host pve3
 5   ssd    3.50000         osd.5      up   1.00000  1.00000
 9   ssd    3.50000         osd.9      up   1.00000  1.00000
21   ssd    3.50000         osd.21     up   1.00000  1.00000

What it means: You see where osd.12 lives (host pve2), and whether it’s just down or also out.
In your output it’s down but still in, which will stall PGs that need it for peering.

Decision: Go to pve2, investigate why the OSD is down, and decide “restart” vs “replace.”

Task 5: Check the OSD service status and recent logs on the host

cr0x@server:~$ ssh pve2 -- systemctl status ceph-osd@12 --no-pager
● ceph-osd@12.service - Ceph object storage daemon osd.12
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled)
     Active: failed (Result: exit-code) since Fri 2025-12-26 08:14:02 UTC; 2min 11s ago
    Process: 11472 ExecStart=/usr/bin/ceph-osd -f --cluster ceph --id 12 (code=exited, status=1/FAILURE)
   Main PID: 11472 (code=exited, status=1/FAILURE)

Dec 26 08:14:02 pve2 ceph-osd[11472]: bluestore(/var/lib/ceph/osd/ceph-12) _read_bdev_label failed: (5) Input/output error
Dec 26 08:14:02 pve2 ceph-osd[11472]: OSD:init: unable to mount object store
Dec 26 08:14:02 pve2 systemd[1]: ceph-osd@12.service: Failed with result 'exit-code'.

What it means: This isn’t “Ceph being Ceph.” It’s a block device I/O error. BlueStore can’t read its label.
That’s usually a dying disk, a broken controller path, or a miswired replacement device.

Decision: Do not loop on restarts. Move to device health checks and replacement workflow.

Task 6: Verify the device layer (SMART, dmesg). Confirm it’s real hardware pain.

cr0x@server:~$ ssh pve2 -- sudo dmesg -T | tail -n 12
[Fri Dec 26 08:13:51 2025] blk_update_request: I/O error, dev sdd, sector 129024 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Fri Dec 26 08:13:51 2025] Buffer I/O error on dev sdd1, logical block 16128, async page read
[Fri Dec 26 08:13:52 2025] sd 6:0:9:0: [sdd] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Dec 26 08:13:52 2025] sd 6:0:9:0: [sdd] Sense Key : Medium Error [current]
[Fri Dec 26 08:13:52 2025] sd 6:0:9:0: [sdd] Add. Sense: Unrecovered read error

What it means: Kernel confirms unrecovered read errors. That’s not a tuning problem.

Decision: Treat osd.12 as failed hardware. Mark it out and plan replacement.

Task 7: Confirm monitors are in quorum and not flapping (PGs can’t settle without maps)

cr0x@server:~$ ceph quorum_status
{
  "quorum": [0,1,2],
  "quorum_names": ["pve1","pve2","pve3"],
  "quorum_age": 1052,
  "election_epoch": 94
}

What it means: Quorum is present and stable (quorum_age ~17 minutes). Good. If quorum were flapping, you’d fix that first.

Decision: Proceed with OSD remediation; don’t chase phantom PG behavior caused by mon churn.

Task 8: Check for nearfull/backfillfull/full (capacity can block recovery and keep PGs stuck)

cr0x@server:~$ ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
ssd    84 TiB   48 TiB   36 TiB    36 TiB       42.86
TOTAL  84 TiB   48 TiB   36 TiB    36 TiB       42.86

--- POOLS ---
POOL         ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
rbd          1   256  9.8 TiB  2.1M     29 TiB   60.42  14 TiB
cephfs_data  2   128  2.0 TiB  0.7M     6.0 TiB  12.50  14 TiB

What it means: Raw usage is fine. If you saw a pool with tiny MAX AVAIL, or raw at 85%+, you’d expect blocked backfill or “nearfull” warnings.

Decision: Capacity isn’t the blocker; proceed with OSD replacement/out.

Task 9: Before marking out, see if the cluster is already degraded/undersized and how much

cr0x@server:~$ ceph osd stat
24 osds: 23 up, 24 in; epoch: e4123

What it means: OSD is down but still “in,” so Ceph keeps trying to use it. That can stall peering depending on the PGs and their history.

Decision: If the disk is dead and won’t return quickly, mark it out so Ceph can remap and recover elsewhere.

Task 10: Mark the failed OSD out (controlled remap) and watch for recovery

cr0x@server:~$ ceph osd out 12
marked out osd.12.

What it means: Ceph will start remapping PGs that had osd.12 in their acting set. This triggers backfill/recovery.

Decision: If you are capacity-tight or already degraded elsewhere, consider pausing client-heavy workloads or temporarily adjusting recovery throttles (carefully).

Task 11: Check recovery progress and whether PGs leave inactive/peering

cr0x@server:~$ ceph -s
  cluster:
    health: HEALTH_WARN
            0 pgs inactive
            18 pgs degraded
            52 pgs recovering
            19 slow ops, oldest one blocked for 98 sec
  data:
    pgs:     442 active+clean
             52 active+recovering
             18 active+degraded

What it means: The immediate availability problem (inactive PGs) is gone. Now you’re in recovery mode. That’s better: I/O works, risk is reduced, cluster is healing.

Decision: Keep recovery stable. Don’t restart OSDs “to speed it up.” If slow ops persist, identify the slow device or node.

Task 12: Identify slow ops offenders (often one OSD or one host)

cr0x@server:~$ ceph health detail | grep -A3 -i slow
[WRN] SLOW_OPS: 19 slow ops, oldest one blocked for 98 sec
    osd.7 has slow ops
    osd.21 has slow ops

What it means: Now you have candidate OSDs to inspect. Slow ops during recovery are common, but sustained slow ops can keep peering/recovery slow or stuck.

Decision: Check OSD perf stats and host-level latency; don’t just crank recovery settings blindly.

Task 13: Check OSD performance counters (commit/apply latency)

cr0x@server:~$ ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
0    7                   11
3    9                   14
7    120                 220
21   85                  160

What it means: OSDs 7 and 21 are slow relative to others. This can be disk contention, saturated NIC, bad SSD, or a DB/WAL device issue.

Decision: Investigate those hosts before tuning. If one OSD is pathological, recovery will queue behind it and PGs can appear “stuck.”

Task 14: Check network health quickly (Ceph is a distributed storage system, not a local disk cosplay)

cr0x@server:~$ ceph config get osd public_network
10.10.0.0/24
cr0x@server:~$ ssh pve2 -- ping -c 3 10.10.0.11
PING 10.10.0.11 (10.10.0.11) 56(84) bytes of data.
64 bytes from 10.10.0.11: icmp_seq=1 ttl=64 time=0.411 ms
64 bytes from 10.10.0.11: icmp_seq=2 ttl=64 time=0.399 ms
64 bytes from 10.10.0.11: icmp_seq=3 ttl=64 time=0.405 ms

--- 10.10.0.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2033ms
rtt min/avg/max/mdev = 0.399/0.405/0.411/0.005 ms

What it means: Latency is low and stable on a basic ping. This doesn’t prove the network is perfect, but it eliminates obvious partitions.

Decision: If ping shows loss/latency spikes, stop. Fix the network first, or you’ll “fix” Ceph into a deeper hole.

Task 15: Inspect the stuck PG list in bulk (pattern matching saves time)

cr0x@server:~$ ceph pg dump_stuck inactive
PG_STAT  STATE     UP      UP_PRIMARY  ACTING  ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP  LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP

What it means: If this output is empty, you cleared inactive PGs. If not, the list can show whether the same OSD IDs keep recurring.

Decision: Shared acting sets imply a specific node/OSD issue; scattered patterns imply network/quorum/rules/capacity.

Task 16: Check if any pools are misconfigured (size/min_size mismatch can keep PGs unhappy)

cr0x@server:~$ ceph osd pool ls detail
pool 1 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on

What it means: For replicated pools, size is number of replicas; min_size is the minimum required to serve writes.
If min_size is too high relative to current OSD availability, you can end up with blocked writes or unhealthy PGs during outages.

Decision: Don’t casually lower min_size. It can keep the cluster writable, but it increases risk. Use it only with explicit acceptance of potential data loss on further failures.

Task 17: Validate the cluster isn’t paused (yes, this happens)

cr0x@server:~$ ceph osd dump | egrep 'pause|noup|nodown|noin|nobackfill|norecover|noscrub|nodeep-scrub'
flags nodown,noin,nobackfill,norecover

What it means: Someone set flags that prevent normal healing (norecover, nobackfill). Sometimes it’s deliberate during maintenance. Sometimes it’s forgotten.

Decision: If you’re not in a controlled maintenance window, remove the flags. Otherwise your PGs will stay degraded/stuck forever.

cr0x@server:~$ ceph osd unset norecover
unset norecover
cr0x@server:~$ ceph osd unset nobackfill
unset nobackfill
cr0x@server:~$ ceph osd unset noin
unset noin
cr0x@server:~$ ceph osd unset nodown
unset nodown

Task 18: If a specific OSD is slow, check host IO saturation (quick sanity)

cr0x@server:~$ ssh pve2 -- iostat -x 1 3
Linux 6.8.12-pve (pve2)  12/26/2025  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.11    0.00    6.24   18.52    0.00   63.13

Device            r/s     w/s   rkB/s   wkB/s  avgrq-sz avgqu-sz   await  r_await  w_await  svctm  %util
nvme0n1         120.0   310.0  7800.0  21400.0    92.0     8.40   24.10   12.30   28.70   1.20  51.00
sdd              15.0    40.0   800.0   1900.0    84.0    22.10  410.00  380.00  421.00  7.10  98.50

What it means: sdd is pinned at ~98% utilization with ~410 ms await. That will absolutely create slow ops and prolong recovery.

Decision: If this is an OSD data disk, prepare to replace it. If it’s a DB/WAL device shared by multiple OSDs, you found your cluster-wide latency amplifier.

Task 19: Map PG to acting OSDs and then to hosts (for correlated failure)

cr0x@server:~$ ceph pg map 1.2f
osdmap e4123 pg 1.2f (1.2f) -> up [3,7,18] acting [3,7,18]

What it means: After marking out osd.12, the PG remapped to a new acting set. This should eliminate “blocked_by osd.12” conditions.

Decision: If PG mapping keeps bouncing, suspect mon quorum issues, flapping OSDs, or unstable network.

Task 20: Only if you have evidence of metadata inconsistency: attempt a PG repair (rare, cautious)

cr0x@server:~$ ceph pg repair 1.31
instructing pg 1.31 on osd.7 to repair

What it means: ceph pg repair can help when a PG is stuck due to inconsistent replicas. It can also burn CPU and I/O and
make things worse if the underlying issue is missing OSDs or broken disks.

Decision: Run repair only after restoring stable quorum and OSD availability. If you’re missing an OSD, repair is usually theatre.

One quote to keep your hands steady: Werner Vogels (paraphrased idea): “Everything fails, all the time—design and operate as if that’s normal.”

Failure modes that keep PGs stuck

1) An OSD is down, but still “in,” and the PG needs it to peer

This is the most common Proxmox-on-Ceph incident shape: a host reboot, a disk dies, a controller resets, or an OSD crashes. Ceph marks it down.
If it stays “in,” some PGs will wait for it (especially if it previously held authoritative data and the others need its log to agree).

The right move depends on the root cause:

  • Daemon crash, device healthy: restart the OSD, confirm it stays up.
  • Device I/O errors: mark out, replace, rebuild. Don’t keep power-cycling a dying disk; it doesn’t become healthier out of spite.
  • Host down: restore host/network/power first; decide whether to mark out based on expected MTTR and current redundancy.

2) Mon quorum instability causes constant map churn

If monitors can’t keep quorum, OSD maps and PG mappings may churn. PGs peer, then re-peer, then re-peer again. You’ll see stuck peering,
but the real problem is governance: the cluster can’t agree on the current truth.

Root causes include network partitions, clock skew, overloaded mons, or slow disks where mon DB lives. In Proxmox, running mons on busy nodes is common.
It can work—until it doesn’t.

3) Nearfull/backfillfull/full blocks movement and creates deadlocks

Ceph protects itself from running out of space by restricting backfill and, at worst, writes. If some OSDs are nearfull, CRUSH may keep placing data on the
remaining ones, making them nearfull too. Recovery slows, and your “fix” (adding load or restarting) makes it worse.

If you’re nearfull and have inactive PGs, your options narrow:

  • Add capacity (the best answer, if you can do it fast).
  • Delete data (only if you’re absolutely sure what you’re deleting).
  • Temporarily adjust full ratios (risky; you’re trading correctness safeguards for time).

4) Backfill/recovery tuning turned into self-harm

Recovery settings are not performance “boosters.” They are trade-offs. If you set recovery too aggressive, you can saturate disks and networks,
causing client IO to time out, OSDs to look “down” due to heartbeat delays, and PGs to become stuck peering because participants are overloaded.

If you set recovery too conservative, you can stretch a failure window from minutes into hours—long enough for the next failure to happen.
Clusters don’t like long exposure windows.

5) A single pathological device (often DB/WAL) poisons the whole cluster

BlueStore uses RocksDB and a WAL. If DB/WAL is on a shared SSD that is failing or saturated, multiple OSDs can become “slow but not dead.”
That’s the worst kind: everything is technically up, but PGs can’t progress fast enough, and slow ops stack.

6) CRUSH/rule changes or pool parameter changes create surprises

Changing failure domains, device classes, or CRUSH rules can trigger massive remaps. During a degraded state, that’s a great way to create a second incident
without resolving the first. If PGs are inactive, the cluster is already in a fragile mode: reduce change, not increase it.

7) Incomplete PGs: when Ceph can’t find authoritative data

incomplete is where “availability incident” becomes “data integrity incident.” It can happen if too many OSDs that held a PG are gone
(or marked lost), or if the remaining replicas don’t have enough log history to agree.

Your job is to determine whether:

  • Those OSDs can be brought back (best).
  • You can restore from backup/snapshot (often the correct business choice).
  • You must declare data lost for specific PGs (last resort, business decision, not a CLI reflex).

Three mini-stories from corporate life

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Proxmox with Ceph on three nodes. Someone saw 12 pgs inactive after a routine kernel update rebooted one node.
They assumed “inactive” meant “rebalancing.” The helpdesk ticket said “storage is slow,” so they treated it like performance.

They cranked recovery settings up: more backfills, more active recovery. It looked like progress—more I/O, more network. Meanwhile, the rebooted node
had a NIC that came up at the wrong speed due to an autonegotiation issue. Heartbeats were dropping. OSDs were flapping, not failing cleanly.

The wrong assumption was subtle: that Ceph was free to move data if it needed to. But peering needed stable membership first. The increased recovery traffic
made the already-wobbly NIC drop more packets, which made more OSDs look down, which made more PGs try to peer again. A nice little feedback loop.

Eventually, they “fixed” the issue by rebooting everything—because rebooting is the universal solvent of bad hypotheses. The cluster came back, but recovery
took longer and clients saw more timeouts than they would have with the boring fix: force the NIC to the correct speed/duplex, stabilize heartbeats, then let
peering finish.

The takeaway: PG inactivity is often a membership problem, not a bandwidth problem. Fix participation (OSDs, network, quorum) before you tune recovery.

Mini-story 2: The optimization that backfired

A larger enterprise wanted faster recovery after disk failures. Their storage team increased concurrency across the board: more backfills, more recovery threads,
higher priority for recovery. On paper, it reduced MTTR in the lab.

In production, it coincided with a bad SSD firmware batch in a subset of nodes. Those SSDs didn’t fail hard; they got slow under sustained write pressure.
Normal client I/O was fine. Recovery traffic was not. The “optimization” forced the slow SSDs into their worst behavior: long latency spikes and occasional resets.

Once a device starts stalling, Ceph does what distributed systems do: it retries, it queues, it re-peers, it logs slow ops. Suddenly the cluster wasn’t just
recovering—it was collectively waiting on a handful of devices that were technically alive. PGs started sticking in peering and backfill_wait because the
cluster couldn’t keep a steady rhythm.

They rolled back tuning and instituted a policy: during a degraded event, recovery is allowed to be aggressive only if device latency remains within a defined
envelope. If not, recovery settings are reduced to preserve service and prevent OSD flapping. That policy stopped future “optimizations” from turning failures
into meltdowns.

The takeaway: recovery tuning must be bounded by observed latency, not optimism. Concurrency is a weapon; don’t point it at your own foot.

Mini-story 3: The boring, correct practice that saved the day

A finance-ish organization had a rigid change discipline for their Proxmox Ceph cluster. It wasn’t glamorous. They maintained a runbook and required a
“cluster health screenshot” (really just ceph -s and ceph health detail outputs pasted into the ticket) before and after maintenance.

One morning, they saw pgs inactive after a top-of-rack switch reboot. The on-call followed the runbook: check mon quorum, check OSD down,
identify shared acting sets, and verify network reachability between Ceph public/cluster networks. They found that one VLAN trunk hadn’t come back correctly.

Here’s the boring part: they did not restart OSDs. They did not tweak recovery. They did not mark anything lost. They fixed the trunk, waited for OSD heartbeats
to stabilize, watched PGs peer, then monitored recovery. The incident stayed an availability blip, not a data event.

Later, in the postmortem, they compared it to a similar event months earlier (before the runbook existed) where someone had marked an OSD out prematurely,
causing a larger recovery storm during business hours. The runbook didn’t make engineers smarter. It made them less improvisational.

The takeaway: boring processes reduce creative damage. In storage, creativity is overrated.

Common mistakes: symptom → root cause → fix

1) Symptom: “PGs stuck peering” after a node reboot

Root cause: OSDs are flapping because the node came back with broken networking, wrong MTU, or time sync issues; peering never stabilizes.

Fix: Stabilize the node: verify NIC speed/duplex, MTU consistency, routes/VLANs, and NTP/chrony. Only then restart the affected OSD daemons if needed.

2) Symptom: “PGs inactive” and ceph pg query shows blocked_by osd.X

Root cause: That OSD is down, hung, or too slow to respond, and it’s needed for peering history.

Fix: If recoverable: restore OSD (service + device). If not: mark out and replace. Don’t run ceph pg repair to compensate for missing OSDs.

3) Symptom: Recovery doesn’t start, PGs stay degraded, but “everything is up”

Root cause: Cluster flags like norecover or nobackfill were set during maintenance and forgotten.

Fix: Clear flags (ceph osd unset ...). Then monitor ceph -s to confirm recovery resumes.

4) Symptom: Slow ops explode during recovery; OSDs start timing out

Root cause: Recovery concurrency too high for the hardware/network. Heartbeats get delayed, triggering OSD flaps and peering churn.

Fix: Reduce recovery/backfill concurrency. In Proxmox terms: tune conservatively, watch ceph osd perf, and prioritize cluster stability over speed.

5) Symptom: PGs stuck with backfill_wait for a long time

Root cause: Either throttles are too strict, or there’s a hidden bottleneck: nearfull OSDs, slow target OSDs, or network constraints.

Fix: Check ceph df and ceph osd perf. Fix the slow device or capacity pressure before increasing backfill limits.

6) Symptom: Inactive/incomplete PGs after multiple disk failures on the same host

Root cause: Failure domain mismatch. CRUSH thought replicas were separated, but they weren’t (or host had too many OSDs for the domain).

Fix: Re-evaluate CRUSH failure domain (host/rack) and OSD distribution. This is a design fix, not an incident band-aid. For the incident: restore missing OSDs or recover from backup.

7) Symptom: PGs stuck after changing pool size/min_size or rules

Root cause: You changed placement while the cluster was already unhealthy; remaps stacked on top of recovery.

Fix: Stop changing rules during degradation. Roll back if safe, stabilize OSD availability, then re-apply changes in a controlled window.

8) Symptom: Ceph shows “stale” PGs

Root cause: Mons/OSDs are not receiving heartbeats (network partition, host freeze, or firewall mistake on Ceph ports).

Fix: Fix network reachability and host health. Don’t mark stale OSDs lost unless you’re prepared for permanent data loss declarations.

Checklists / step-by-step plan

Step-by-step plan for a live incident (PG stuck/inactive)

  1. Freeze risky changes.
    Stop tuning, stop “let’s upgrade real quick,” stop restarting daemons like you’re shaking a vending machine.
  2. Capture the current truth.
    Run ceph -s and ceph health detail. Save outputs in the incident channel/ticket.
  3. List stuck PGs and identify shared acting OSDs.
    Use ceph health detail and ceph pg dump_stuck inactive.
  4. Confirm mon quorum stability.
    ceph quorum_status. If quorum is unstable, fix that first.
  5. Check OSD state and location.
    ceph osd tree and ceph osd stat.
  6. For a suspect OSD, inspect service + logs + kernel messages.
    If you see I/O errors, treat it as hardware, not software mood.
  7. Decide: restore vs mark out.
    If the OSD can return quickly and cleanly, restore it. If not, mark it out and replace.
  8. Watch PG states change.
    Your first success condition is “no inactive PGs.” Your second is “degraded count trending down.”
  9. Manage recovery load if needed.
    If slow ops and client pain are severe, reduce recovery concurrency. If recovery is too slow and the cluster is stable, increase a little. Never jump from 1 to 11.
  10. After stabilization, fix the root cause.
    Replace hardware, correct networking, and document the chain of evidence (outputs, logs, and decisions).

Safety checklist before using “sharp tools”

  • Mon quorum stable for at least several minutes (no rapid elections).
  • You can name the specific PG(s) and the specific OSD(s) involved.
  • You have checked device/kernel logs for hardware errors.
  • You understand pool size and min_size impact on data safety.
  • You have a rollback plan or at least a “stop condition.”
  • Stakeholders are aware if you’re considering ceph osd lost or lowering min_size.

Stabilization checklist after the incident

  • Clear temporary flags: verify no lingering norecover, nobackfill, noscrub.
  • Confirm PGs are active+clean (or you have a known, acceptable degraded state with a plan).
  • Review OSD perf outliers (ceph osd perf) and replace/repair weak devices.
  • Verify time sync across nodes; Ceph hates time drama.
  • Record: what failed, detection time, decision points, and what you changed.

FAQ

1) Are “PGs stuck” and “PGs inactive” the same thing?

No. “Stuck” is a duration alarm: a PG has remained in some state too long. “Inactive” is a state: the PG cannot serve I/O safely. Inactive is more urgent.

2) Can I just restart Ceph services on all nodes?

You can, but that’s a blunt instrument and it often hides the root cause. If the issue is disk I/O errors, network partitions, or quorum instability,
restarts can make peering take longer and increase map churn. Restart only the component you have evidence for.

3) When should I mark an OSD out?

Mark it out when you have a credible reason it won’t return quickly and cleanly: confirmed disk I/O errors, repeated crashes, host hardware failure,
or long MTTR. If it’s a short reboot with healthy disks, it’s often better to wait a few minutes—unless you’re already one failure away from trouble.

4) What’s the fastest way to see what an inactive PG is waiting on?

Start with ceph health detail for the PG IDs and acting set, then ceph pg <pgid> query. Look for blocked_by or peering blockers.

5) Should I run ceph pg repair when peering is stuck?

Usually no. If the PG is stuck because an OSD is missing or the cluster is unstable, repair doesn’t solve the missing participant. Use repair when
you have evidence of inconsistency with all required OSDs present and stable.

6) What does active+undersized mean for my VM disks?

It means the PG is active (I/O can proceed) but fewer replicas exist than the pool’s size. Risk is increased: another failure in the wrong place can
make data unavailable or lost. Treat it as “running on the spare tire.”

7) Why do PGs get stuck during backfill even when hardware looks fine?

Common reasons: throttles/flags (nobackfill, norecover), nearfull constraints, or one OSD being much slower than the rest.
Recovery tends to move at the speed of the slowest critical participant.

8) Is it safe to lower min_size to get writes flowing?

It can be operationally necessary, but it’s a deliberate risk trade. Lowering min_size can allow writes with fewer replicas, which increases the chance
of data loss if another OSD fails before recovery completes. Make it a documented, time-bound change with a clear exit.

9) Do Proxmox updates commonly cause PG inactive events?

Updates themselves aren’t the cause; the reboots and service restarts are. Rolling reboots without checking cluster health, or rebooting multiple Ceph nodes at once,
is a reliable way to learn new PG states.

10) If I have incomplete PGs, what should I do first?

Stop making changes and focus on restoring any missing OSDs that might contain authoritative data. If they’re gone, move to backup/restore planning.
“Force” options are last-resort and should be treated as business decisions, not technical bravado.

Conclusion: next steps that reduce future pain

When PGs are stuck/inactive, the cluster is telling you it can’t complete a safety-critical handshake. Your job is not to “make the warning go away.”
Your job is to restore stable participation: quorum, network, OSD availability, and sane capacity headroom.

Practical next steps:

  • Codify the fast diagnosis playbook in your on-call notes: ceph -s, ceph health detail, ceph pg query, ceph osd tree, quorum, capacity, perf.
  • Track and replace slow devices before they become “not dead enough” to fail fast. ceph osd perf outliers are early warnings.
  • Keep recovery tuning minimal and time-bound. If you must change it, record it, set a reminder, and revert.
  • Design for failure domains that match reality. Host-level separation is not optional when your “hosts” share power or a switch.
  • Practice the boring approach. Evidence, smallest change, observe, repeat. It’s not heroic. It works.

If you remember one thing: PG inactivity is a correctness alarm first, a performance issue second. Treat it that way and you’ll have fewer “we escalated risk” postmortems.

← Previous
ZFS zpool status: Reading Health Like a Forensics Analyst
Next →
Debian 13: MySQL slow query log — find the query that’s silently killing you

Leave a comment