Proxmox RBD “error opening”: auth/keyring mistakes and fixes

Was this helpful?

error opening” is the Ceph equivalent of a dashboard check-engine light. It tells you almost nothing, it happens at the worst possible time,
and it can be caused by a single missing character in a keyring path that you last touched six months ago.

In Proxmox, this usually surfaces when you try to create/attach a disk, start a VM, or migrate between nodes using RBD-backed storage. One node works.
Another throws “error opening”. Your Ceph cluster looks “HEALTH_OK”. Everyone’s annoyed. Let’s make this boring again.

What “error opening” actually means in Proxmox RBD terms

When Proxmox says “RBD: error opening”, you’re usually seeing a failure bubble up from librbd (the userspace library used to access RBD images).
The library tries to:

  1. Load Ceph configuration (monitors, auth settings, fsid, etc.).
  2. Authenticate (cephx) using a key for some client ID (client.admin, client.pve, or a custom user).
  3. Talk to monitors (MONs), get the cluster map, and locate OSDs.
  4. Open the RBD image (which requires permissions on the pool and the image).

“Error opening” is commonly thrown for:

  • Wrong or missing keyring/key in Proxmox storage configuration.
  • Client ID mismatch: you have the right key, but for the wrong client name.
  • Caps don’t allow the operation (read-only caps but you’re creating images; missing profile rbd; missing access to rbd_children metadata, etc.).
  • Monitors unreachable from one node (routing, firewall, wrong mon_host, IPv6 vs IPv4 confusion).
  • Ceph config differences between nodes (one node has a stale /etc/ceph/ceph.conf or wrong fsid).
  • Keyring file permissions on disk: root can read it, but a process is running as a different user (common in custom tooling; less common in stock Proxmox).

The fastest way to stop guessing is to reproduce the exact open operation from the failing node using rbd CLI with the same ID and keyring.
If rbd ls works but rbd info pool/image fails, you’re staring at a caps mismatch. If nothing works, start at monitors + keyring.

Joke #1: “Error opening” is what Ceph says when it’s too polite to say “your keyring is garbage.”

Fast diagnosis playbook (check 1/2/3)

This is the order that ends incidents fastest. Not the order that feels emotionally satisfying.

1) Confirm you can reach monitors and authenticate from the failing node

  • If monitor connectivity or cephx auth fails, nothing else matters. Fix that first.
  • Use ceph -s and ceph auth get client.X where applicable.

2) Confirm Proxmox is using the keyring you think it’s using

  • Inspect /etc/pve/storage.cfg and the per-storage keyring path (or embedded key).
  • Validate the file exists on every node (Proxmox config is shared, but keyring files are local unless you manage them).

3) Validate caps against the pool and operation

  • List caps: ceph auth get client.pve.
  • Test with rbd commands that mirror the failing action: rbd ls, rbd info, rbd create, rbd snap ls.

4) Only then: chase Proxmox UI errors, qemu logs, and edge cases

  • Look at task logs and journalctl for pvedaemon, pveproxy, and qemu-server.
  • Most “error opening” incidents are auth/caps/config. The exotic ones exist, but they’re not your first bet.

Interesting facts and context (because the past is still running in prod)

  • Ceph’s “cephx” auth was designed to avoid shared cluster-wide secrets. You can scope keys to pools and operations, which is why caps matter so much.
  • RBD’s original audience was cloud platforms. The whole “image + snapshot + clone” model is very VM-centric, which is why Proxmox and OpenStack latched onto it early.
  • Proxmox stores cluster config in a distributed filesystem. /etc/pve is shared across nodes, but local files like /etc/ceph/ceph.client.pve.keyring are not magically replicated.
  • Historically, many deployments used client.admin everywhere. It “works” until it becomes an audit nightmare and an incident amplifier.
  • Caps syntax evolved over time. Older blog posts show outdated patterns; modern Ceph likes profile rbd plus explicit pool scoping.
  • Ceph monitors are a consistency gate. You can have healthy OSDs and still fail basic RBD opens if MON quorum or reachability is broken from one node.
  • RBD “open” can require metadata operations. Even reads can require access to pool metadata (and depending on features, to omap keys). “I gave it read-only” can be accidentally too strict.
  • Ceph config discovery has multiple paths. Environment variables, default paths, and explicit flags can lead to “works in my shell” but fails in Proxmox tasks.

Common symptoms: what you’ll see and where

Proxmox can surface the same underlying failure through several layers. Learn the patterns:

  • Proxmox task log:rbd: error opening” during disk create, attach, snapshot, migrate, or VM start.
  • QEMU start failures: VM won’t start; qemu logs mention inability to open RBD image.
  • CLI mapping errors: rbd map returns “permission denied” or “error connecting to the cluster”.
  • Ceph side hints: MON logs show auth failures; OSD logs show denied ops; but in many cases Ceph stays quiet unless debug levels are raised.
  • Node-specific behavior: One Proxmox node can access RBD; another can’t. That screams “local keyring/config file mismatch”.

Practical tasks: commands, outputs, and decisions (12+)

The point of these tasks is to turn vague UI errors into crisp decisions. Run them from the failing Proxmox node first, then from a known-good node to compare.

Task 1: Confirm Ceph cluster is reachable and you’re not chasing ghosts

cr0x@server:~$ ceph -s
  cluster:
    id:     2c3d1d5e-9a2e-4e2b-8a59-4f1c7c0c1c2f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 2h)
    mgr: mgr1(active, since 2h)
    osd: 12 osds: 12 up (since 3h), 12 in (since 3h)

  data:
    pools:   4 pools, 256 pgs
    objects: 3.2M objects, 12 TiB
    usage:   36 TiB used, 72 TiB / 108 TiB avail
    pgs:     256 active+clean

What it means: This confirms the CLI can talk to monitors and authenticate using whatever default config/key your shell is picking up.

Decision: If this fails on the broken node but works elsewhere, fix monitor reachability and local Ceph config before touching Proxmox.

Task 2: Identify what Proxmox thinks your RBD storage is

cr0x@server:~$ grep -nE '^(rbd:|[[:space:]]*(pool|monhost|username|keyring|content))' /etc/pve/storage.cfg
12:rbd: ceph-rbd
13:        monhost 10.10.0.11 10.10.0.12 10.10.0.13
14:        pool vmdata
15:        username pve
16:        keyring /etc/ceph/ceph.client.pve.keyring
17:        content images,rootdir

What it means: Proxmox will try to connect to those monitor IPs, authenticate as client.pve, using that keyring file.

Decision: If keyring is missing or points to a file that doesn’t exist on some nodes, you found your root cause.

Task 3: Verify the keyring file exists on this node and is readable

cr0x@server:~$ ls -l /etc/ceph/ceph.client.pve.keyring
-rw------- 1 root root 151 Dec 26 10:41 /etc/ceph/ceph.client.pve.keyring

What it means: It exists and only root can read it, which is normal on Proxmox.

Decision: If it’s missing on one node, copy it securely or re-create it. If permissions are too open, fix them anyway; sloppy secrets become incidents.

Task 4: Confirm the keyring actually contains the expected client name

cr0x@server:~$ sed -n '1,120p' /etc/ceph/ceph.client.pve.keyring
[client.pve]
	key = AQB7qMdnJg0aJRAA7i9fJvQW9x0o0Jr8mGmNqA==
	caps mon = "profile rbd"
	caps osd = "profile rbd pool=vmdata"

What it means: The section header must match the username Proxmox uses (without the client. prefix in storage.cfg).

Decision: If the file says [client.admin] but storage.cfg says username pve, Proxmox will fail to authenticate.

Task 5: Test RBD access explicitly using the same identity as Proxmox

cr0x@server:~$ rbd -p vmdata ls --id pve --keyring /etc/ceph/ceph.client.pve.keyring
vm-101-disk-0
vm-102-disk-0
base-9000-disk-0

What it means: Authentication works and the user can list images in the pool.

Decision: If listing works but Proxmox still errors on open, the problem is likely image-specific permissions/features or a different pool/image name than you think.

Task 6: Reproduce the open on a specific image (most useful for “error opening”)

cr0x@server:~$ rbd info vmdata/vm-101-disk-0 --id pve --keyring /etc/ceph/ceph.client.pve.keyring
rbd image 'vm-101-disk-0':
	size 100 GiB in 25600 objects
	order 22 (4 MiB objects)
	snapshot_count: 2
	id: 1a2b3c4d5e6f
	block_name_prefix: rbd_data.1a2b3c4d5e6f
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
	op_features:
	flags:
	create_timestamp: Tue Dec 24 09:12:33 2025
	access_timestamp: Tue Dec 24 09:12:33 2025
	modify_timestamp: Thu Dec 26 10:01:07 2025

What it means: If this succeeds, “open” works at the RBD level. Proxmox should be able to start the VM unless it’s using different credentials/config.

Decision: If this fails with “permission denied”, your caps are insufficient for metadata operations or you’re targeting the wrong pool.

Task 7: Confirm caps for the client user (don’t guess)

cr0x@server:~$ ceph auth get client.pve
[client.pve]
	key = AQB7qMdnJg0aJRAA7i9fJvQW9x0o0Jr8mGmNqA==
	caps mon = "profile rbd"
	caps osd = "profile rbd pool=vmdata"

What it means: This is the authoritative truth inside Ceph (not whatever’s copied into a keyring file).

Decision: If caps don’t include the target pool, fix caps. If the key differs from the keyring file, update the file everywhere.

Task 8: Check the Ceph config that Proxmox will implicitly use

cr0x@server:~$ cat /etc/ceph/ceph.conf
[global]
fsid = 2c3d1d5e-9a2e-4e2b-8a59-4f1c7c0c1c2f
mon_host = 10.10.0.11 10.10.0.12 10.10.0.13
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx

What it means: Wrong fsid or missing/incorrect mon_host can cause a node to talk to the wrong cluster or no cluster.

Decision: If this differs between nodes, standardize it. A split-brain of configuration is how you get “it worked yesterday” without a real change.

Task 9: Confirm monitor reachability from the failing node (routing/firewall)

cr0x@server:~$ for m in 10.10.0.11 10.10.0.12 10.10.0.13; do echo "== $m =="; nc -vz -w2 $m 3300; nc -vz -w2 $m 6789; done
== 10.10.0.11 ==
Connection to 10.10.0.11 3300 port [tcp/*] succeeded!
Connection to 10.10.0.11 6789 port [tcp/*] succeeded!
== 10.10.0.12 ==
Connection to 10.10.0.12 3300 port [tcp/*] succeeded!
Connection to 10.10.0.12 6789 port [tcp/*] succeeded!
== 10.10.0.13 ==
Connection to 10.10.0.13 3300 port [tcp/*] succeeded!
Connection to 10.10.0.13 6789 port [tcp/*] succeeded!

What it means: Ceph MON uses 3300 (msgr2) and sometimes 6789 (legacy). You want connectivity to at least what your cluster uses.

Decision: If this fails on only one node, fix firewall/routing/VLAN/MTU. Don’t “fix” auth to compensate for a broken network.

Task 10: Pull the Proxmox task log that contains the failure

cr0x@server:~$ journalctl -u pvedaemon -S -2h | tail -n 40
Dec 26 10:50:14 pve3 pvedaemon[2211]:  starting task UPID:pve3:00008A1B:0002A1C4:676D7F46:qmstart:101:root@pam:
Dec 26 10:50:15 pve3 pvedaemon[1032]: command '/usr/bin/kvm -id 101 -name vm101 ... -drive file=rbd:vmdata/vm-101-disk-0:conf=/etc/pve/ceph.conf:id=pve:keyring=/etc/ceph/ceph.client.pve.keyring,if=none ...' failed: exit code 1
Dec 26 10:50:15 pve3 pvedaemon[1032]: TASK ERROR: start failed: error opening 'rbd:vmdata/vm-101-disk-0': (13) Permission denied

What it means: This confirms the exact arguments QEMU used, including id=pve and the keyring path. That’s gold.

Decision: Re-run the equivalent RBD command manually with those parameters. If it fails, it’s not “a Proxmox issue.”

Task 11: Verify Proxmox is pointing QEMU at the right ceph.conf

cr0x@server:~$ ls -l /etc/pve/ceph.conf
-rw-r----- 1 root www-data 232 Dec 26 10:40 /etc/pve/ceph.conf

What it means: Proxmox can generate/use /etc/pve/ceph.conf. This may differ from /etc/ceph/ceph.conf.

Decision: If your CLI tests are using /etc/ceph/ceph.conf but QEMU uses /etc/pve/ceph.conf, align them or test with the same --conf.

Task 12: Re-test with the same config file that QEMU used

cr0x@server:~$ rbd --conf /etc/pve/ceph.conf -p vmdata ls --id pve --keyring /etc/ceph/ceph.client.pve.keyring
rbd: listing images failed: (13) Permission denied

What it means: This narrows it down. Your earlier success may have been from a different config (different cluster, different mon_host, different auth settings).

Decision: Inspect /etc/pve/ceph.conf and fix it or stop using it. Consistency beats cleverness.

Task 13: Confirm which Ceph cluster you’re talking to (fsid sanity check)

cr0x@server:~$ ceph --conf /etc/pve/ceph.conf fsid
2c3d1d5e-9a2e-4e2b-8a59-4f1c7c0c1c2f

What it means: If fsid differs from the expected cluster, you’re authenticating against the wrong Ceph cluster (or an old lab leftover).

Decision: Fix the config file and restart affected services; don’t “just add more mons” to both clusters and hope.

Task 14: Fix caps for a Proxmox RBD client (typical safe pattern)

cr0x@server:~$ ceph auth caps client.pve mon "profile rbd" osd "profile rbd pool=vmdata"
updated caps for client.pve

What it means: You’re granting RBD-appropriate monitor permissions and pool-scoped OSD permissions. This is the sane default for VM disks in one pool.

Decision: If you have multiple pools used by Proxmox, add each pool explicitly. Avoid broad allow * unless you enjoy explaining it later.

Task 15: Update (or create) the keyring file consistently across nodes

cr0x@server:~$ ceph auth get client.pve -o /etc/ceph/ceph.client.pve.keyring
exported keyring for client.pve

What it means: You’re writing the authoritative key/caps to the node’s filesystem. Repeat on each node or distribute securely.

Decision: If only one node had a stale keyring, this eliminates node-specific “error opening” failures.

Task 16: Validate Proxmox storage definition is healthy

cr0x@server:~$ pvesm status
Name       Type     Status           Total       Used        Available        %
ceph-rbd   rbd      active            0           0           0               0.00
local      dir      active        1966080    1126400          839680         57.29

What it means: For RBD, capacity may show as 0 depending on setup, but the storage should be active.

Decision: If it’s inactive or errors, re-check monitor hosts, username, and keyring path in storage.cfg.

Ceph auth model in Proxmox: clients, keyrings, caps, and where Proxmox hides things

Client names: the most common foot-gun is a one-word mismatch

Ceph users are named like client.pve, client.admin, client.proxmox. In Proxmox storage.cfg, you often specify
username pve, which Proxmox treats as client.pve.

The mismatch patterns:

  • Keyring header mismatch: file contains [client.proxmox] but Proxmox uses pve. Authentication fails.
  • Key mismatch: file header correct but key is from an older rotation. Authentication fails.
  • Caps mismatch: auth succeeds but operations fail at open/create/snapshot time.

Keyring location: shared config, local secrets

Proxmox’s cluster filesystem makes it tempting to think everything in your configuration is replicated. It isn’t.
/etc/pve/storage.cfg is replicated. Your keyring file under /etc/ceph is just a file.

This is why “works on node1, fails on node3” happens so often:

  • You added the storage in the UI once, it updated /etc/pve/storage.cfg across the cluster.
  • You copied the keyring to only one node (or you copied a different version).
  • Proxmox happily schedules a VM start on a node that cannot authenticate, and you get “error opening”.

Caps: “profile rbd” is the baseline, pool scoping is the safety rail

For Proxmox RBD usage, the operational sweet spot is:

  • mon = "profile rbd" so the client can query necessary maps and RBD-related metadata.
  • osd = "profile rbd pool=<poolname>" so the client can access images in a specific pool.

If you’re using multiple pools (e.g., vmdata, fast-ssd, templates), you either:

  • Grant multiple pool clauses (separate clients is cleaner), or
  • Accept broader caps and live with the security tradeoff.

Proxmox and /etc/pve/ceph.conf: the subtle config split

Proxmox can maintain a Ceph configuration under /etc/pve/ceph.conf, and QEMU processes invoked by Proxmox tasks may reference it directly.
Meanwhile, your shell commands might default to /etc/ceph/ceph.conf. If those differ, you’ll waste hours “proving” contradictory facts.

Decide on one source of truth and make it consistent. If Proxmox is using /etc/pve/ceph.conf, keep it correct and keep it synced with the actual cluster.

One reliability quote you should actually take seriously

Paraphrased idea from John Allspaw (operations/reliability): “Incidents come from normal work and ordinary decisions, not just rare incompetence.”

Common mistakes: symptom → root cause → fix

1) Symptom: Works on one node, fails on another with “error opening”

Root cause: Keyring file missing or different on the failing node (or different ceph.conf).

Fix: Ensure the keyring and config exist and match on every node.

cr0x@server:~$ sha256sum /etc/ceph/ceph.client.pve.keyring /etc/ceph/ceph.conf /etc/pve/ceph.conf
e1d0c0d2f0b8d66c3f2f5b7a20b3fcb0a1f6e42a2bfafbfcd1c4e2a8fcbcc3af  /etc/ceph/ceph.client.pve.keyring
9b1f0c3c4f74d5d5c22d5e4e2d0a2a77bff2f5bd3d92a0e7db6c2f4f122c8f10  /etc/ceph/ceph.conf
9b1f0c3c4f74d5d5c22d5e4e2d0a2a77bff2f5bd3d92a0e7db6c2f4f122c8f10  /etc/pve/ceph.conf

Decision: Hash mismatch across nodes? Stop. Standardize. Don’t keep debugging higher layers.

2) Symptom: “(13) Permission denied” when starting VM or creating disk

Root cause: Caps too narrow for what Proxmox is doing (create, snapshot, clone), or wrong pool scoping.

Fix: Update caps to include correct pool and profile rbd. Verify with rbd create test.

cr0x@server:~$ rbd create vmdata/caps-test --size 64M --id pve --keyring /etc/ceph/ceph.client.pve.keyring
rbd: create error: (13) Permission denied

Decision: This confirms it’s caps, not a flaky VM config. Fix caps, then retest create and delete the test image.

3) Symptom: “no keyring found” or “failed to load keyring” in logs

Root cause: Wrong keyring path in storage.cfg, or file exists but wrong permissions/SELinux/AppArmor context (rare on default Proxmox).

Fix: Correct the path; use absolute path; set 0600 root:root.

4) Symptom: “error connecting to the cluster” or MON connection timeouts

Root cause: Monitor IPs wrong in storage.cfg/ceph.conf, firewall blocks 3300/6789, or DNS/IPv6 mismatch.

Fix: Use stable monitor addresses; validate connectivity; avoid hostnames unless DNS is truly boring.

5) Symptom: RBD list works, but open fails for some images

Root cause: Image is in another pool, or image features require ops your caps block, or the image name is wrong (typo, stale reference after rename).

Fix: Verify exact pool/image; run rbd info and rbd snap ls using the same identity Proxmox uses.

6) Symptom: After rotating keys, old VMs won’t start

Root cause: One node still has the old keyring; Proxmox schedules starts there; you get “error opening”.

Fix: Roll out keyring updates atomically across nodes, then validate with a small start/migrate test set.

Joke #2: Key rotation is like flossing—everyone agrees it’s good, and almost nobody does it on the schedule they claim.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a Proxmox cluster with Ceph RBD for VM disks. They added a new node, joined it to the Proxmox cluster, and called it done.
The next morning, routine maintenance triggered a handful of VM migrations onto the new node.

Half the migrated VMs didn’t come back. Proxmox showed the same blunt message: “error opening”.
Ceph health was fine. The storage was defined in /etc/pve/storage.cfg, so the team assumed “the storage config replicated; therefore storage access replicated.”

That assumption was the entire incident. The new node didn’t have /etc/ceph/ceph.client.pve.keyring. The existing nodes did.
The Proxmox UI made it worse by being consistent: same storage name, same pool, same monitors, same failure message.

The fix was unglamorous: distribute the keyring to every node, verify hashes match, then re-run the starts.
The postmortem action item was even more boring: a node-join checklist with a “Ceph keyrings present and verified” gate.

Mini-story 2: The optimization that backfired

Another org wanted to reduce blast radius, so they created separate Ceph users for different Proxmox clusters and aggressively minimized caps.
Good instinct. Then they went one step too far: read-only caps for a user that Proxmox also used for snapshot operations and clone-based templating.

Everything looked fine for weeks because day-to-day VM reads and writes mostly worked—until the template pipeline ran at scale.
Suddenly, provisioning tasks started failing with “error opening” and “permission denied,” and the team chased networking because failures were bursty and time-correlated.

The real cause was that some operations needed metadata writes (snap create, clone, flatten) that their caps blocked.
The failures were periodic because those operations were periodic.

They fixed it by splitting responsibilities: one Ceph user for “VM runtime I/O” with strictly scoped pool access,
another for “image management” tasks run by automation, with additional permissions and tighter operational controls.
Least privilege survived. It just needed to be aligned to actual workflows, not wishful thinking.

Mini-story 3: The boring but correct practice that saved the day

A financial services team had a habit that looked almost comical: every node had a small local script that validated Ceph client access daily.
It ran ceph -s, rbd ls, and rbd info against a known image, using the exact credentials Proxmox used.
It logged results locally and also surfaced a simple “ok/fail” metric.

One afternoon, a Ceph admin rotated keys during a change window. The change was correct, caps were fine, and the Ceph cluster stayed healthy.
But one Proxmox node missed the key update due to a temporary configuration management failure.

Their daily validation caught it within hours—before a maintenance migration moved workloads onto the broken node.
Instead of an outage, they had a ticket: “Node pve7 fails RBD open using client.pve.” The remediation was a keyring sync and a retest.

Nothing heroic happened. Nobody got paged. This is what “reliability engineering” looks like on a good day: fewer stories to tell.

Checklists / step-by-step plan

Checklist A: When a VM fails to start with “error opening”

  1. From the failing node, get the exact error and parameters from logs (journalctl -u pvedaemon).
  2. Extract the id=, keyring=, pool, image name, and conf= file path.
  3. Run rbd --conf ... info pool/image --id ... --keyring ....
  4. If auth fails: verify keyring existence, correctness, and client name header.
  5. If permission denied: inspect caps and pool scoping; fix caps; retest.
  6. If monitor connectivity fails: validate ports 3300/6789; verify mon_host and routing/MTU.
  7. Once fixed, re-run VM start and verify it can read/write.

Checklist B: Adding a new Proxmox node to a Ceph-backed cluster

  1. Install Ceph client packages as needed for your Proxmox version.
  2. Copy /etc/ceph/ceph.conf (or ensure /etc/pve/ceph.conf is correct and used consistently).
  3. Copy required keyrings: typically /etc/ceph/ceph.client.pve.keyring.
  4. Verify file permissions: 0600 root:root for keyrings.
  5. Run: ceph -s and rbd -p <pool> ls --id pve --keyring ....
  6. Only then allow migrations/HA onto that node.

Checklist C: Safe-ish key rotation for Proxmox RBD clients

  1. Create or update the Ceph auth entry (ceph auth get-or-create / ceph auth caps), keeping pool scoping correct.
  2. Export the updated keyring file.
  3. Distribute the keyring to all Proxmox nodes (atomically if possible).
  4. Verify hashes match across nodes.
  5. Run RBD open tests from each node using the same --conf that QEMU uses.
  6. Perform a small canary: start one VM per node, do a migration, create a snapshot if you use them.
  7. Only then consider the rotation “done”.

Commands that help automate the checklist validation

cr0x@server:~$ rbd --conf /etc/pve/ceph.conf info vmdata/vm-101-disk-0 --id pve --keyring /etc/ceph/ceph.client.pve.keyring
rbd image 'vm-101-disk-0':
	size 100 GiB in 25600 objects
	order 22 (4 MiB objects)
	snapshot_count: 2
	id: 1a2b3c4d5e6f
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
	op_features:
	flags:
	create_timestamp: Tue Dec 24 09:12:33 2025
	access_timestamp: Tue Dec 24 09:12:33 2025
	modify_timestamp: Thu Dec 26 10:01:07 2025

Decision: If this works on every node, you’ve eliminated most auth/keyring causes of “error opening.”

FAQ

1) Why does Proxmox show “error opening” instead of the real Ceph error?

Because the error bubbles through QEMU/librbd layers and gets summarized. The detailed reason is often in journalctl lines showing
“permission denied”, “no such file”, or connection errors. Always pull logs from the node that failed.

2) I can run ceph -s successfully, so why does Proxmox fail?

Your shell may be using a different config file (/etc/ceph/ceph.conf) and different key (client.admin via default keyring).
Proxmox might be using /etc/pve/ceph.conf and client.pve. Test using the same --conf, --id, and --keyring you see in Proxmox logs.

3) Can I just use client.admin to make it go away?

You can, and it will “work,” and it’s a bad habit. It expands blast radius and makes audits painful. Use a dedicated client with pool-scoped caps.
Reserve client.admin for administrative tasks, not routine VM I/O.

4) What are the minimum caps for Proxmox RBD usage?

Typically: mon "profile rbd" and osd "profile rbd pool=<pool>". If you use additional workflows (snapshots, clones, flatten),
you still usually want profile rbd, but you may need to ensure your cluster and clients support the needed ops. Validate by testing the operation with the same identity.

5) Why does it fail only during migration or snapshot?

Because migrations and snapshots exercise different API calls. Listing images isn’t the same as opening an image with certain features, creating snapshots, or cloning.
If it fails on those operations, suspect caps mismatch first.

6) Where does Proxmox store Ceph secrets?

Proxmox stores the storage definition in /etc/pve/storage.cfg. The key itself is typically in a keyring file under /etc/ceph referenced by path.
Some setups embed secrets differently, but the “node-local keyring file” pattern is common and is exactly why node-to-node mismatch happens.

7) How do I tell if it’s a monitor connectivity problem versus auth?

If you see timeouts and “error connecting to the cluster,” validate network reachability to MON ports (3300/6789) and confirm mon_host.
If you see “permission denied” quickly, monitors are reachable and auth/caps are the likely culprit.

8) Do I need to restart Proxmox services after fixing keyrings or caps?

Often no; new tasks will pick up the updated keyring file. But if you changed which config file is used or updated storage definitions,
restarting pvedaemon and retrying the task can remove stale state. Keep it targeted; don’t reboot nodes as therapy.

9) What’s the fastest safe test to validate a fix?

Run rbd info pool/image using the same --conf, --id, and --keyring QEMU uses, from the node that failed.
Then start one VM that uses that image. If you rely on snapshots/clones, test one of those too.

10) Could this be a Ceph bug or data corruption?

It can be, but if the cluster is healthy and the error is “permission denied” or “keyring not found,” it’s not corruption.
Start with auth/config; 95% of “error opening” incidents are self-inflicted paper cuts.

Conclusion: next steps you can do today

If you want “error opening” to stop being a recurring character in your on-call life, do three things:

  1. Standardize what config file QEMU uses (/etc/pve/ceph.conf vs /etc/ceph/ceph.conf) and make them consistent across nodes.
  2. Use a dedicated Ceph client (e.g., client.pve) with pool-scoped profile rbd caps. Stop using client.admin for routine VM I/O.
  3. Make keyrings a first-class deployment artifact: distribute them to every node, verify hashes, and validate access with an automated rbd info test.

The good news: once you treat keyrings and caps like production configuration (not tribal knowledge), Ceph becomes predictably boring. That’s the goal.

← Previous
MariaDB vs PostgreSQL: “Too many open files”—why it happens and the real fix
Next →
FireWire vs USB: how “better tech” lost to cheaper tech

Leave a comment