Backups are green. Schedules are happy. Grafana looks calm. Then you run the restore you actually need and it faceplants: missing chunks, permission denied, timeout, checksum errors, or a restore that “runs” forever while doing nothing measurable.
This is the failure mode that hurts: you did the responsible thing (back up), but you didn’t prove the only thing the business actually cares about (restore). The good news is that PBS failures are usually boring in the best possible way—misaligned permissions, storage semantics, network realities, or lifecycle jobs (prune/GC/verify) stepping on each other. This checklist is designed to catch the classics fast, and the weird stuff methodically.
Restore failures: a mental model that actually helps
PBS is built for efficient, deduplicated, incremental backups. That design is exactly why “backup succeeds” is not strong evidence that “restore will succeed.” A successful backup can be mostly streaming new chunks into a datastore, while a restore must be able to locate, decrypt, verify, and rehydrate a potentially huge graph of chunks across time, across snapshots, across namespaces, sometimes across permissions and network paths that differ from the backup path.
Backups are mostly “write path”; restores are “read path under stress”
Backups tend to be sequential writes plus some metadata updates. Restores are read-amplified: lots of random-ish chunk reads, decompression, optional verification, and more sensitivity to latency. If your storage backend has quirks (NFS attribute caching, ZFS recordsize mismatch, thin-provisioned SAN that lies, a RAID controller with cache policy surprises), restores will find them.
Three layers must agree: identity, integrity, and location
- Identity: You must be authorized (PBS ACLs), and your client must be presenting the expected credentials or API token. Restore tools often run under a different identity than backup jobs.
- Integrity: The chunk store must be consistent. Missing chunks, corrupt chunks, or incomplete garbage collection can hide until you do a restore or verify.
- Location: The network route, DNS, firewall, MTU, proxy, TLS trust, and time must line up. Backups might traverse one path; restores might traverse another (especially if you restore to a different node).
A decent heuristic: if backups are green but restores fail, assume read path + permissions + lifecycle jobs first, and assume “my backup software lied” last.
Fast diagnosis playbook (first/second/third)
This is the “stop the bleeding” flow. Do it in order. Don’t freestyle. The goal is to find the bottleneck quickly, then deepen only where needed.
First: identify what kind of restore failure you have
- Hard fail immediately: permission denied, authentication error, fingerprint mismatch, repository not found, TLS errors.
- Hard fail mid-stream: missing chunks, checksum errors, I/O errors, read-only filesystem.
- Soft fail (hang/slow): stuck at 0%, “still running,” ETA is fantasy.
Second: confirm the basics that make restores different than backups
- Who is doing the restore? Same PBS user/token as the backup job? Same namespace? Same ACL path?
- Where are you restoring to? Different Proxmox node, different network, different storage target?
- Is PBS under load from prune/GC/verify? Those jobs can be “fine” during backups and brutal during restores.
Third: choose one deep-dive track
- Auth/ACL/TLS track if the error appears before data moves.
- Datastore integrity track if you see missing/corrupt chunks, verify errors, or restore reads fail.
- Storage performance track if restores are slow/hanging and the datastore is on NFS/Ceph/ZFS-on-RAID.
- Network track if you see timeouts, resets, MTU weirdness, or restores fail from only one node.
Interesting facts and context (why PBS behaves like this)
- PBS uses a chunk store with deduplication: restores can depend on chunks created weeks ago, not just last night’s backup.
- Prune and garbage collection are separate concepts: prune removes snapshot references; GC reclaims unreferenced chunks later. A snapshot can disappear “logically” before space is reclaimed “physically.”
- Verification is not just a feel-good button: it’s your early warning system for chunk corruption or incomplete writes that didn’t surface during backup.
- Incremental backups reduce write load but can increase restore dependency depth: the more you rely on history, the more you need long-term chunk integrity.
- Client-side encryption changes the failure shape: wrong key or missing key file often looks like “data exists but cannot be read,” because it’s literally true.
- Restores are latency-sensitive: a datastore on high-latency storage (or NFS with poor options) may back up fine but restore painfully due to read amplification.
- Namespaces and ACLs are powerful and sharp: it’s easy to grant backup rights without granting restore rights, especially when tokens are scoped narrowly.
- Time matters more than people think: TLS, ticket validation, and log correlation become messy when nodes drift. You can “authenticate” while still breaking long-running transfers.
- “Backup succeeded” often means “server accepted the stream”: it does not guarantee the storage backend preserved it correctly under later reads (hello, flaky disks and NFS semantics).
Practical tasks: commands, outputs, decisions (12+)
These tasks are meant to be runnable on a typical Proxmox VE + PBS setup. Adjust hostnames, datastore names, and IDs. Each task includes: command, what the output means, and what decision you make next.
Task 1: Check PBS service health (server-side)
cr0x@pbs01:~$ systemctl status proxmox-backup
● proxmox-backup.service - Proxmox Backup Server API and daemon
Loaded: loaded (/lib/systemd/system/proxmox-backup.service; enabled)
Active: active (running) since Tue 2026-02-04 08:11:02 UTC; 3h 12min ago
Docs: man:proxmox-backup-api(1)
man:proxmox-backup-proxy(1)
Main PID: 1123 (proxmox-backup)
Tasks: 14 (limit: 154225)
Memory: 312.4M
CPU: 19min 22.341s
CGroup: /system.slice/proxmox-backup.service
├─1123 proxmox-backup-api
└─1131 proxmox-backup-proxy
Meaning: If it’s not active/running, restores will fail in creative ways (timeouts, 500s, connection refused).
Decision: If not running, fix PBS first (service, disk full, corrupted config). Don’t chase client-side ghosts.
Task 2: Read the restore-time logs, not the backup logs
cr0x@pbs01:~$ journalctl -u proxmox-backup --since "30 min ago" --no-pager
Feb 04 11:52:18 pbs01 proxmox-backup-api[1123]: authentication failed for user 'backup@pbs'
Feb 04 11:52:18 pbs01 proxmox-backup-api[1123]: permission check failed: missing Datastore.Read on /datastore/vmstore
Feb 04 11:52:21 pbs01 proxmox-backup-api[1123]: client disconnected (TLS error: alert bad certificate)
Meaning: PBS often tells you exactly what permission or TLS condition failed. “Datastore.Read” missing is the classic restore-only ACL gap.
Decision: If you see auth/ACL/TLS errors here, stop and fix identity/trust before touching storage.
Task 3: Confirm the datastore is mounted and writable (PBS host)
cr0x@pbs01:~$ pvesm status
Name Type Status Total Used Available %
vmstore dir active 70368744177664 4966055936000 65402688241664 7%
Meaning: “active” is good. If it’s missing/inactive, PBS may still accept metadata but fail to read chunks, or fail mid-restore.
Decision: If inactive, fix the mount (NFS down, ZFS pool degraded, permissions) before attempting any restore again.
Task 4: Check filesystem-level errors (read-only flips are real)
cr0x@pbs01:~$ dmesg -T | tail -n 20
[Mon Feb 4 11:41:07 2026] EXT4-fs error (device sdb1): ext4_find_entry:1453: inode #262145: comm proxmox-backup: reading directory lblock 0
[Mon Feb 4 11:41:07 2026] Aborting journal on device sdb1-8.
[Mon Feb 4 11:41:07 2026] EXT4-fs (sdb1): Remounting filesystem read-only
Meaning: If the backing filesystem remounted read-only, restores will fail (sometimes after “starting fine”). Backups might have succeeded earlier and then you hit the cliff.
Decision: Treat this as a storage incident. Stop PBS writes, repair filesystem, validate datastore integrity afterward.
Task 5: Verify time synchronization (TLS and long sessions hate drift)
cr0x@pbs01:~$ timedatectl
Local time: Tue 2026-02-04 12:03:44 UTC
Universal time: Tue 2026-02-04 12:03:44 UTC
RTC time: Tue 2026-02-04 12:03:45
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: If “System clock synchronized: no” on either PBS or PVE nodes, expect certificate validation weirdness, token expiry confusion, and logs that don’t line up.
Decision: Fix time first. It’s fast and removes a whole class of “impossible” failures.
Task 6: Confirm API connectivity and fingerprint (client-side trust)
cr0x@pve01:~$ proxmox-backup-manager cert info
Subject Name: /CN=pbs01
Issuer Name: /CN=pbs01
Valid Since: 2025-09-14 09:20:48
Valid Until: 2035-09-12 09:20:48
Fingerprint (sha256): 7E:1A:3B:9F:4B:41:2C:2A:9B:AD:9C:3D:25:0C:11:1A:AF:3A:0F:DE:41:19:92:33:6E:9B:7A:45:4F:0C:8A:21
Meaning: Fingerprint mismatch between what the client trusts and what PBS presents causes sudden restore failures after reinstallation or certificate regeneration.
Decision: If PBS was rebuilt or its cert changed, re-trust the new fingerprint on PVE nodes (and document it).
Task 7: Validate the repository config on the node doing the restore
cr0x@pve01:~$ cat /etc/pve/storage.cfg
pbs: pbs-vmstore
datastore vmstore
server pbs01
content backup
fingerprint 7E:1A:3B:9F:4B:41:2C:2A:9B:AD:9C:3D:25:0C:11:1A:AF:3A:0F:DE:41:19:92:33:6E:9B:7A:45:4F:0C:8A:21
username backup@pbs
token backup-token
namespace prod
Meaning: Restores happen from the PVE node’s view of PBS. Wrong namespace, wrong username/token, or stale fingerprint will break restore even if backups from some other node still work.
Decision: Align storage.cfg on all nodes that might perform restores. Don’t assume the “backup node” is the “restore node.”
Task 8: Check PBS ACLs for restore permissions (server-side)
cr0x@pbs01:~$ proxmox-backup-manager acl list
/path /datastore/vmstore, user backup@pbs, role DatastoreBackup
/path /datastore/vmstore, user restore-ops@pbs, role DatastoreReader
/path /, user root@pam, role Administrator
Meaning: A role like “DatastoreBackup” may be sufficient for backups but not for browsing/restoring, depending on your setup and tooling. Reader permissions (or more) may be required for restore workflows.
Decision: Ensure the identity used for restore has the necessary read permissions on the datastore and namespace. If you’re using API tokens, ensure the token is not more restricted than the user.
Task 9: List snapshots and ensure you’re restoring what you think you are
cr0x@pbs01:~$ proxmox-backup-client snapshot list --repository backup@pbs@pbs01:vmstore --ns prod | head
vm/101/2026-02-03T01:00:12Z
vm/101/2026-02-02T01:00:10Z
vm/101/2026-02-01T01:00:09Z
ct/203/2026-02-03T01:10:05Z
ct/203/2026-02-02T01:10:02Z
Meaning: If the snapshot you want isn’t listed under the namespace you’re using, restores will “fail” by not finding anything.
Decision: Confirm namespace and backup group type (vm vs ct vs host). Don’t waste an hour restoring the wrong thing flawlessly.
Task 10: Run a datastore verify and interpret the results (integrity track)
cr0x@pbs01:~$ proxmox-backup-manager datastore verify vmstore --ignore-verified 1
starting datastore verify: vmstore
found 8123 snapshot(s)
verified 131072 chunks (12.5 GiB)
ERROR: missing chunk 3a2f0f9d3c0a... referenced by vm/101/2026-02-03T01:00:12Z
ERROR: verification failed: 1 errors found
Meaning: “missing chunk” is not a cosmetic issue. That snapshot (and possibly others) cannot be restored completely.
Decision: Stop assuming. Now you do incident triage: identify blast radius (which groups/snapshots), check storage backend health, and decide whether you can restore from an older snapshot or a different datastore/replica.
Task 11: Check prune/GC schedules and current running tasks (contention track)
cr0x@pbs01:~$ proxmox-backup-manager task list --running 1
UPID:pbs01:0000A1B3:0001C2D4:67A1F2C3:garbage_collection:vmstore:root@pam:
UPID:pbs01:0000A1C0:0001C2F1:67A1F2DA:verify:vmstore:root@pam:
Meaning: GC + verify while you attempt a big restore is like scheduling roadwork during an evacuation. It can work, but it’s a choice.
Decision: For urgent restores, pause or reschedule heavy maintenance tasks. Then re-run restore and observe if the failure mode changes from “timeout/slow” to “works.”
Task 12: Measure datastore I/O latency the simple way (is storage lying?)
cr0x@pbs01:~$ iostat -x 1 5
Linux 6.5.11 (pbs01) 02/04/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
4.12 0.00 2.98 22.45 0.00 70.45
Device r/s w/s rkB/s wkB/s await svctm %util
sdb 82.0 310.0 9120.0 50240.0 87.35 3.12 99.80
Meaning: High await and near-100% utilization indicates the disk subsystem is saturated or suffering. Restores get punished first because they’re read-heavy and latency-sensitive.
Decision: If await is high, you’re not debugging PBS; you’re debugging storage performance and contention. Consider moving the datastore, adding cache, or reducing competing jobs.
Task 13: Confirm network stability and MTU assumptions (timeout track)
cr0x@pve01:~$ ping -M do -s 8972 -c 3 pbs01
PING pbs01 (10.10.10.20) 8972(9000) bytes of data.
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
ping: local error: message too long, mtu=1500
--- pbs01 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss
Meaning: Someone assumed jumbo frames end-to-end. They were wrong. Backups might tolerate fragmentation or different flow patterns; restores might hit a path that doesn’t.
Decision: Either fix MTU consistently across the path or stop pretending and use 1500 everywhere. Mixed MTU is a slow-motion outage generator.
Task 14: Confirm DNS and reverse DNS (the boring culprit)
cr0x@pve01:~$ getent hosts pbs01
10.10.10.20 pbs01 pbs01.internal
Meaning: If name resolution differs between nodes, you can end up trusting one certificate name but connecting via another, or routing differently for restores.
Decision: Standardize how PVE nodes refer to PBS (name vs IP), and make certificate CN/SAN match that reality.
Task 15: Attempt a file-level restore mount (tests read path without overwriting VMs)
cr0x@pve01:~$ proxmox-backup-client mount vm/101/2026-02-03T01:00:12Z drive-scsi0 --repository backup@pbs@pbs01:vmstore --ns prod /mnt/pbs-test
mounted file system at "/mnt/pbs-test"
Meaning: If mount fails with missing chunks or permission errors, you’ve reproduced the restore failure in a low-risk way.
Decision: Use this as a pre-flight test after any fix. If mount works and browsing files is fine, a full restore is much more likely to succeed.
Task 16: Inspect backup group ownership and encryption mode (when keys are the problem)
cr0x@pve01:~$ proxmox-backup-client snapshot list --repository backup@pbs@pbs01:vmstore --ns prod --output-format json | head -n 5
[
{"backup-id":"101","backup-time":1706922012,"backup-type":"vm","comment":null},
{"backup-id":"101","backup-time":1706835610,"backup-type":"vm","comment":null},
{"backup-id":"101","backup-time":1706749209,"backup-type":"vm","comment":null},
{"backup-id":"203","backup-time":1706922605,"backup-type":"ct","comment":null},
Meaning: Snapshot listing works, but restore might still fail if encryption keys are missing on the restore host. Listing metadata is cheaper than decrypting payload.
Decision: If you use client-side encryption, validate the key files are present and accessible on the restore node (and protected properly). Don’t “temporarily” disable encryption for convenience and then forget.
Joke #1: Backups are like parachutes—having one is nice, but you only learn if it works when you’re already having a day.
Common mistakes: symptom → root cause → fix
1) “Permission denied” during restore, but backups run nightly
Symptom: Restore fails immediately; logs show missing Datastore.Read or namespace permission.
Root cause: Backup token/user has rights to push backups but not to browse/restore snapshots, or restore is attempted from a different node with a different token.
Fix: Grant the restore identity read permissions on the datastore path and the correct namespace. Align /etc/pve/storage.cfg across nodes. Prefer a dedicated “restore-ops” identity with controlled scope.
2) “Fingerprint mismatch” after PBS maintenance
Symptom: Client refuses to connect; errors about certificate fingerprint or bad certificate.
Root cause: PBS certificate was regenerated (reinstall, hostname change), but PVE nodes still trust the old fingerprint in storage.cfg.
Fix: Update the fingerprint everywhere. Standardize the PBS identity (stable hostname) and treat cert regeneration as a change-controlled event.
3) Restore fails mid-way with “missing chunk” or verification errors
Symptom: Restore starts, then fails with missing chunk references; datastore verify shows errors.
Root cause: Underlying storage corruption, incomplete writes, filesystem errors, flaky disks, or a datastore moved/restored incorrectly (rsync without xattrs/permissions, snapshot inconsistency).
Fix: Stop write load, fix the storage layer (SMART, RAID, ZFS scrub, fsck where appropriate), run verify, identify last known-good snapshot, and restore from that. If you have a replicated datastore, fail over to it.
4) Restore “hangs” or is extremely slow; backups are fine
Symptom: Restore job runs but progress crawls; iowait spikes; PBS feels sluggish.
Root cause: Read amplification meets slow storage (NFS latency, SMR drives, overloaded HDD RAID, ZFS tuned for throughput not latency), or prune/GC/verify running concurrently.
Fix: Pause heavy maintenance tasks during restores, measure disk latency (iostat), and move the datastore to storage designed for random reads (SSD, proper caching, sane RAID level). If on NFS, revisit mount options and server performance.
5) “Snapshot exists but cannot be found” in the UI or via CLI
Symptom: Backups show up in one view but not for the restore workflow; “group not found” errors.
Root cause: Namespace mismatch, different repository definition, or attempting to restore a VM backup as a CT (or vice versa).
Fix: Confirm --ns and group type. List snapshots from the same node and same identity used by the restore.
6) Restore fails only from one Proxmox node
Symptom: Restore works from pve02 but fails from pve01 with timeouts or TLS errors.
Root cause: Per-node storage.cfg drift, firewall rules, routing, DNS differences, or MTU mismatch on a specific path.
Fix: Compare configs. Run the same connectivity tests from both nodes. Standardize and automate config distribution where possible.
7) Restores fail after “optimizing” retention/GC
Symptom: After changing prune schedules or retention, old restores fail, or restores become slow during business hours.
Root cause: GC moved into peak time, or prune policies removed snapshots you assumed you still had. Sometimes “keep-last” works until you realize you needed “keep-daily” for compliance.
Fix: Put GC and verify in off-hours. Model retention policies against real restore requirements (RPO/RTO and audit needs), not against disk usage anxiety.
Three corporate mini-stories (how this fails in real life)
Mini-story 1: The incident caused by a wrong assumption
The company had a clean Proxmox cluster and a single PBS appliance on fast-enough storage. Backups were green for months. Nobody worried. Then a host failed and they needed to restore a production VM to a different node. The restore failed instantly: permission denied.
The wrong assumption was subtle: “the token that can back up can also restore.” Their backup token was scoped tightly (good instinct), but it only had rights aligned with the backup workflow. Restores were never tested from a non-primary node, and the token stored in /etc/pve/storage.cfg on the other nodes was… different. An engineer had copied a config snippet weeks ago and used a placeholder user that didn’t have datastore read rights.
They burned time debugging storage and the network because the backup logs looked perfect. Meanwhile, the PBS logs were blunt: missing Datastore.Read. Once they aligned the repository config across nodes and created a dedicated restore role with explicit ACLs, restores worked immediately.
The follow-up was the real fix: they added a monthly “restore-from-a-random-node” drill and treated storage.cfg as configuration that must not drift. The green backup checkmark went back to being a useful signal, not a comforting lie.
Mini-story 2: The optimization that backfired
A different org had a cost-cutting sprint. They moved the PBS datastore from local disks to an existing NFS filer because “it’s just backups.” Backups remained fine—mostly big sequential writes overnight, dedupe doing its thing, nobody complaining. Then a ransomware simulation turned into a restore simulation. Restores were slow enough to be functionally broken.
The NFS backend had decent throughput but mediocre latency under concurrency. Restores hammered it with chunk reads while the filer also served home directories and CI artifacts. During the restore window, GC also ran because someone rescheduled maintenance to “use the daytime when people are around.” That decision turned a slow restore into a hopeless one.
They tried tuning NFS mount options and added network bandwidth. It helped a bit, but not enough. The underlying issue was physics: chunked, deduplicated restores are not a sequential workload, and they will punish latency and metadata performance. Eventually, they moved PBS back to local SSD-backed storage and set GC/verify to off-hours only.
They also learned a cruel lesson: optimizing for “backup success” is not optimizing for “business recovery.” Those are related goals, not identical ones.
Mini-story 3: The boring but correct practice that saved the day
A regulated environment. Lots of process. The kind of place where you can’t just “SSH in and see what happens.” The storage team insisted on quarterly restore tests with evidence: a restore to an isolated network, boot validation, and a checksum comparison for a known dataset. Everyone groaned because it was predictable work and nobody got promoted for it.
One quarter, the restore test failed. Not catastrophically—just a missing chunk error on a subset of snapshots. Because the test was routine, it happened early enough that the affected data was still available from an older snapshot and from an offline copy. They treated it as a storage integrity incident, not as a one-off glitch.
The root cause ended up being intermittent disk errors on the PBS host that hadn’t yet triggered obvious alarms. SMART counters were creeping up, but the system was still “working.” Verification caught it. Restore testing proved it mattered. They replaced the disks, re-verified, and documented the incident with calm language and clear actions.
When a real production restore was needed months later, it worked. Nobody cheered. That’s the point.
Checklists / step-by-step plan (make restores boring)
Checklist A: When a restore fails right now (triage)
- Capture the exact error from the restore UI/CLI and the PBS journal around the same timestamp.
- Classify: auth/ACL/TLS vs missing chunks vs timeout/slow vs target storage errors.
- Confirm identity: same user/token/namespace on the node performing restore.
- Confirm datastore state: mounted, writable, no filesystem remount read-only, no pool degradation.
- Check running tasks: pause GC/verify if you need the restore fast.
- Do a low-risk test: file-level mount of the snapshot (proves read path) before full VM restore.
- If integrity errors: run datastore verify, determine blast radius, choose last known-good restore point.
Checklist B: Weekly hygiene that prevents “green backups, dead restores”
- Run datastore verify on a schedule sized to your datastore and change rate. Rotate through snapshots if needed.
- Review prune and GC timing so it never competes with restore windows or peak production.
- Check storage health (SMART, RAID, ZFS scrubs) and alert on early warning signs, not just failures.
- Standardize repository config across all PVE nodes; treat drift as a bug.
- Perform a restore drill (mount + selective file restore + full VM restore quarterly at minimum).
Checklist C: Hardening decisions (what I’d actually do)
- Separate identities: one token for backup jobs, another for restore operations, both least-privileged but not self-sabotaging.
- Prefer local SSD-backed datastore for PBS if you care about RTO. NFS can work, but you must design for latency and metadata, not just throughput.
- Use replication if the datastore matters: it’s your insurance against corruption and single-host failure. Backups that can’t be restored are performance art.
- Schedule verify + GC intentionally: off-hours, and never overlapping with your backup ingest peak if you can avoid it.
- Document the certificate fingerprint process and treat PBS rebuilds as changes that require re-trust steps.
Joke #2: If your retention policy is “keep everything forever,” congratulations—you’ve invented a very expensive way to avoid making decisions.
One operations quote to keep you honest
Everything fails, all the time.
— Werner Vogels
It’s short, not comforting, and operationally correct. Design your restore practice around it.
FAQ
1) Why can backups succeed if the datastore has integrity problems?
Because the backup path can succeed at accepting new chunks and writing metadata even while older chunks are missing/corrupt. Restores often need those older chunks to reconstruct a snapshot. Verification and restore drills are what surface this.
2) What’s the fastest safe restore test without overwriting a VM?
Use proxmox-backup-client mount against a recent snapshot and browse a few files. It exercises authentication, namespace access, chunk reads, and decompression with minimal risk.
3) Is NFS a bad idea for PBS datastores?
Not inherently, but it’s easy to get wrong. Backups are write-friendly; restores are read-latency-sensitive. If your NFS server is shared, has latency spikes, or has quirky caching, restores will suffer first. If you must use NFS, measure restore performance under load and schedule maintenance jobs carefully.
4) How do namespaces cause “restore not found” issues?
Namespaces partition backup groups. If your PVE node is configured to use namespace prod but the backups were written to the root namespace (or another), listing may differ and restore workflows won’t find the snapshot where you expect it. Always confirm with snapshot list using the same --ns.
5) What’s the relationship between prune and garbage collection?
Prune removes snapshot references based on retention. Garbage collection reclaims chunks no longer referenced. You can prune aggressively and still not reclaim space until GC runs. Conversely, running GC at the wrong time can starve restores and backups of I/O.
6) Restores fail only from one node. What should I compare first?
/etc/pve/storage.cfg (username/token/namespace/fingerprint), DNS resolution (getent hosts), routing/firewall differences, and MTU settings. In practice, per-node config drift is the most common culprit.
7) How often should I run datastore verify?
Often enough that you detect corruption before your retention window removes good restore points. For many environments: daily or weekly verify of recent snapshots, plus a rolling verify across older snapshots. The exact schedule depends on datastore size and performance headroom.
8) What does “missing chunk” usually mean, and can it be repaired?
It means a referenced chunk file is absent from the datastore, often due to storage loss, corruption, or an incomplete migration/copy. Repair is not magical; you typically restore from a different snapshot that doesn’t reference the missing chunk, or from a replicated datastore, or from an independent copy.
9) Do I need separate tokens for backup and restore?
You don’t need them, but you should. It reduces blast radius and makes auditing sane. Just don’t scope the backup token so tightly that nobody can restore under pressure.
10) Why do restores expose performance issues more than backups?
Deduplicated restores are chunk-read-heavy and can be random I/O intensive. Backups tend to be more sequential and forgiving. If your storage backend has high latency or gets saturated, restores will be the first thing to feel “broken.”
Next steps (what to do this week)
- Run a restore drill from a non-primary node: mount a snapshot, then restore one VM into an isolated network. Record the steps and timing.
- Schedule verify and GC intentionally: off-hours, no overlap with backup ingest, and not during your realistic restore window.
- Audit identities and ACLs: ensure the restore path has
Datastore.Readand correct namespace access. Stop relying on “it worked for backups.” - Measure storage latency under restore-like load: if
iostatshows uglyawait, treat it as a storage design issue, not a PBS bug. - Standardize repository configuration: same fingerprint, same namespace conventions, same token management across all PVE nodes.
The end goal is not “successful backups.” The end goal is “boring restores.” If you can restore on demand, under pressure, without heroics, you’ve built something real.