Proxmox on ZFS: the backup strategy that doesn’t lie (snapshots vs real backups)

Was this helpful?

If you run Proxmox on ZFS, you can create snapshots so fast it feels like cheating. And that’s the trap.
The day you need a real restore—after a pool fault, a fat-fingered rm, or a hypervisor upgrade that goes sideways—your “snapshot strategy” may turn out to be a strategy for feeling calm, not for getting data back.

Backups are a product: they must ship restores. ZFS snapshots are a feature: they ship convenience. Stop mixing them up.
Here’s the hard-edged playbook that keeps Proxmox clusters honest.

The core claim: snapshots are not backups

A ZFS snapshot is a point-in-time reference to blocks in the same pool. It’s fantastic for quick rollback and for “oops” moments.
But it lives and dies with the pool. If the pool is gone, your snapshots are gone. If the pool is encrypted and the key is gone, your snapshots are still “there” in the same way your files are “there” on a drive you can’t unlock: technically, spiritually, not operationally.

A backup is something else: it’s independent. It’s stored elsewhere, controlled by different failure modes, and it is routinely proven by restores.
If your “backup” doesn’t leave the failure domain, you don’t have a backup; you have a time machine bolted to the same car that just drove into the lake.

The operational litmus test is blunt: if the hypervisor motherboard fries, can you restore your VMs on another host without heroic forensics?
If the answer involves “well, the snapshots are on the pool,” that’s not a plan. That’s an anecdote.

Joke #1: A snapshot is like a save button that only works if your laptop doesn’t fall into the ocean. Great feature, questionable disaster recovery.

Interesting facts and short history (that actually helps decisions)

  • ZFS snapshots are copy-on-write: they don’t “freeze” data by copying it; they preserve block references and write new blocks for changes.
  • ZFS was designed with end-to-end checksums: data and metadata blocks are checksummed, so silent corruption is detectable during reads and scrubs.
  • Solaris heritage matters: ZFS grew up in enterprise environments where “rollback” and “replication” were separate ideas—snapshots and send/receive play different roles.
  • zfs send is snapshot-based streaming: replication and many “backup-like” workflows rely on snapshots as the unit of transfer.
  • ZFS scrubs are patrol reads: they’re not backups, but they’re how ZFS finds bit rot before it becomes a restore event.
  • Proxmox historically leaned on vzdump: classic backups were file-based archives to storage targets; modern deployments increasingly use Proxmox Backup Server (PBS) for chunked, deduplicated backups.
  • Snapshots became popular because they’re cheap: the “near-zero cost” feeling leads teams to overuse them as retention, then get surprised by space accounting.
  • RAID is not a backup—still: RAID (including RAIDZ) reduces downtime for disk failures but does nothing for logical deletion, ransomware, or “oops I wiped the dataset.”

Threat model: what you are really defending against

Your backup strategy should map to failure modes, not vibes. Here are the ones that actually show up:

1) The human factor (a.k.a. “I am the outage”)

Accidental deletes, wrong dataset destroyed, wrong node reinstalled, wrong disk wiped. This is not a rare event; it’s a recurring tax.
Snapshots are excellent for this when the pool is healthy.

2) Storage failure domain loss

HBA dies and takes multiple disks offline, backplane fault, firmware bug, pool corruption, or just the slow-motion horror of a pool that won’t import after a reboot.
If your snapshots are on that pool, they’re a nice memorial.

3) Ransomware and credential compromise

If an attacker gets root on the hypervisor, they can delete snapshots. They can delete local backups. They can zfs destroy your history with one line.
You need immutability and separation of credentials. This is where PBS shines, but only if you deploy it with sane auth boundaries.

4) “Everything works” until it doesn’t: restore uncertainty

Backups fail quietly. Restores fail loudly. The only backup that counts is one you’ve restored recently, ideally into a network-isolated test environment that resembles production.

5) Performance regressions that turn backups into their own incident

Backups can destroy latency, especially with small-block random I/O workloads, busy pools, or aggressive snapshot retention.
You need to manage when backups run, what they touch, and what they compete with.

How Proxmox uses ZFS (and why it matters for backups)

Proxmox usually stores VM disks as ZVOLs (block devices) when you select ZFS storage. Containers may use datasets.
The distinction matters: ZVOL backups and dataset backups behave differently, especially for snapshotting and space use.

VMs on ZVOLs: fast snapshots, tricky space

A VM disk as a ZVOL is a block device with a volblocksize (often 8K/16K). Writes come from the guest filesystem, not from ZFS-aware semantics.
Snapshots will preserve old blocks; long retention can create serious space pressure on write-heavy VMs (databases love to do this to you).

Containers on datasets: more visibility, but still not magic

Datasets can be snapped and replicated cleanly, and ZFS can do better compression decisions based on recordsize.
But containers still suffer from “it was on the same pool,” meaning snapshots don’t save you from pool loss.

Proxmox backups: vzdump vs Proxmox Backup Server

vzdump creates archive files (.vma.zst, .tar.zst, etc.) on a storage target. It can use snapshots to get consistency while copying.
PBS stores backups in chunks, deduplicates across VMs, and supports encryption and (optionally) retention policies that don’t involve “delete old tarballs by hand.”

The best setups often use both concepts:
local ZFS snapshots for fast rollback and operational mistakes;
PBS (or another off-host backup) for actual disaster recovery.

Snapshots: what they do well, what they absolutely don’t

They are great for:

  • Fast rollback after bad package upgrades, broken VM config changes, or an application deployment that detonates.
  • Short-term safety rails during risky operations: migrations, filesystem changes, schema changes.
  • Replication inputs: snapshots are the unit of incremental transfer with zfs send -i.

They are not great for:

  • Pool loss: they don’t leave the pool.
  • Credential compromise: root can destroy them.
  • Long retention without capacity planning: old blocks pin space; your pool fills “mysteriously.”
  • Application-consistent recovery: a crash-consistent snapshot is often okay for journaling filesystems, not always okay for databases without coordination.

Snapshot retention should be short and intentional. Hours to days, maybe a couple of weeks for “oops” windows—depending on change rate.
If you’re keeping months of snapshots on production pools, you’re not doing “backup.” You’re doing “space leak with an API.”

Real backups: what “counts” in production

A real backup must meet three conditions:

  1. Separation: stored off-host or at least off-pool, ideally off-site. Different credentials, different blast radius.
  2. Recoverability: you can restore to a clean environment, within an RTO you can live with.
  3. Verification: you’ve performed test restores recently. Not “checked logs,” not “validated checksums,” but actually booted or verified data integrity.

In Proxmox land, the most common “real backup” implementations are:

  • Proxmox Backup Server: best default for many teams: dedupe, encryption, pruning, verification jobs.
  • vzdump to NFS/SMB/another ZFS box: workable and simple, but be honest about immutability and verification.
  • ZFS send/receive to a backup target: powerful, transparent, fast for incrementals; also easy to misconfigure into a synchronized disaster.

“Backups without restore drills” are compliance theater. And yes, theater has budgets.

Quote (paraphrased idea) from W. Edwards Deming: Without measurement, you’re not managing—just guessing. Backups are measurement by restore.

A backup design that survives bad days

Here’s a strategy that works for small shops and scales up without changing its personality:

Layer 1: Local ZFS snapshots for fast rollback

  • Keep them short-lived.
  • Snapshot before risky changes and on a schedule (e.g., hourly for 24h, daily for 7d).
  • Automate with a snapshot tool or your own cron, but keep naming consistent.

Layer 2: Backups to a separate system (PBS preferred)

  • PBS on separate hardware, ideally separate rack/power domain if you can.
  • Use separate credentials; don’t mount the datastore read-write from hypervisors “just because.”
  • Enable encryption at backup time if your threat model includes someone stealing the backup storage.

Layer 3: Off-site copy (replication or sync)

  • Either PBS sync to a second PBS, or ZFS replication to another ZFS target, or export backups to object storage via a controlled pipeline.
  • Make sure this layer is not a mirror of deletion. Delayed replication is a feature, not a flaw.

Retention: stop pretending it’s a feeling

Retention is a policy decision based on:
regulatory needs, business tolerance, and storage costs.
A sane baseline:

  • Local snapshots: 24–72 hours of hourly + 7–14 days daily (adjust to write rate).
  • PBS backups: 14–30 days of daily + 3–12 months of weekly/monthly, depending on what your business calls “oops.”
  • Off-site: at least weekly, ideally daily, with longer retention for ransomware recovery windows.

Consistency: decide what you mean

For many VMs, crash-consistent is fine. For databases, you want application-consistent where possible: flush, freeze, or use database-native backups.
ZFS doesn’t know what “a transaction” is. It knows blocks.

Joke #2: “We replicate instantly to the DR site” is the backup version of “I always text back immediately.” It’s impressive until you regret what you sent.

Practical tasks with commands: inspect, decide, act

These are the sorts of tasks you run during setup, during an incident, and during the monthly “prove it still works” ritual.
Each task includes: command, example output, what it means, and what decision you make.

Task 1: Verify pool health before blaming backups

cr0x@server:~$ zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 02:11:40 with 0 errors on Sun Dec 22 03:00:01 2025
config:

        NAME                        STATE     READ WRITE CKSUM
        rpool                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            ata-SAMSUNG_SSD_1       ONLINE       0     0     0
            ata-SAMSUNG_SSD_2       ONLINE       0     0     0

errors: No known data errors

What it means: Your pool is healthy, scrubs are completing, and no checksum errors exist.
Decision: Proceed with snapshot/backup tuning; you’re not fighting underlying storage corruption.
If you see DEGRADED, FAULTED, or checksum errors, fix that first—backup churn on a sick pool is a great way to make it worse.

Task 2: Check capacity and whether snapshots are pinning space

cr0x@server:~$ zfs list -o name,used,avail,refer,usedbysnapshots -r rpool/data
NAME                  USED  AVAIL  REFER  USEDBYSNAPSHOTS
rpool/data            3.2T   410G    96K            0B
rpool/data/vm-101     420G   410G   110G          210G
rpool/data/vm-102     610G   410G   240G          290G

What it means: For these ZVOL datasets, a huge chunk of used space is “used by snapshots.” That’s old blocks being retained.
Decision: Tighten snapshot retention, or move long retention to PBS/off-host. Consider whether write-heavy VMs should be on their own pool or different schedule.

Task 3: See your snapshots and whether naming is sane

cr0x@server:~$ zfs list -t snapshot -o name,creation,used -r rpool/data/vm-101 | tail -n 5
rpool/data/vm-101@auto-2025-12-27_0100  Sun Dec 27 01:00  1.2G
rpool/data/vm-101@auto-2025-12-27_0200  Sun Dec 27 02:00  1.0G
rpool/data/vm-101@auto-2025-12-27_0300  Sun Dec 27 03:00  980M
rpool/data/vm-101@auto-2025-12-27_0400  Sun Dec 27 04:00  1.1G
rpool/data/vm-101@pre-upgrade           Sat Dec 27 04:22  0B

What it means: You have regular snapshots plus a manual “pre-upgrade” snapshot. Good.
Decision: Ensure retention matches intent. The “used” column creeping upward is normal; sudden spikes point to big changes (guest updates, DB churn).

Task 4: Identify the worst snapshot offenders quickly

cr0x@server:~$ zfs list -o name,usedbysnapshots -s usedbysnapshots -r rpool/data | tail -n 5
rpool/data/vm-215     180G
rpool/data/vm-102     290G
rpool/data/vm-101     210G
rpool/data/vm-330     340G
rpool/data/vm-407     390G

What it means: These datasets are pinning the most snapshot space.
Decision: Review retention and backup approach for these VMs first. If you need longer history, don’t keep it on the primary pool.

Task 5: Confirm ZVOL properties that affect backup/replication

cr0x@server:~$ zfs get -o name,property,value -s local,default volblocksize,compression,recordsize rpool/data/vm-101
NAME               PROPERTY      VALUE
rpool/data/vm-101   volblocksize  16K
rpool/data/vm-101   compression   zstd

What it means: ZVOL uses 16K volblocksize, compression enabled. Recordsize doesn’t apply to ZVOLs.
Decision: Don’t randomly change volblocksize after provisioning unless you understand the consequences. Compression is usually good; measure CPU and latency if you’re near the edge.

Task 6: Verify scrub schedule and results

cr0x@server:~$ zpool get scrubtime rpool
NAME   PROPERTY  VALUE                  SOURCE
rpool  scrubtime Sun Dec 22 03:00 2025  -

What it means: Last scrub time is recorded. Scrubs catch latent errors before a rebuild or restore needs those blocks.
Decision: If you aren’t scrubbing monthly (or more for large/old disks), schedule it. If scrubs are failing or never complete, treat it as a red alarm.

Task 7: Check ARC pressure before you blame “backup slowness”

cr0x@server:~$ arc_summary | egrep "ARC Size|ARC Target|Cache Hits|Cache Misses"
ARC Size:                               31.2 GiB
ARC Target Size:                        32.0 GiB
Cache Hits:                             92.1%
Cache Misses:                           7.9%

What it means: ARC is near target and hit rate is healthy.
Decision: Your pool isn’t obviously starving for memory. If hit rate is low and misses are high during backups, consider scheduling, throttling, or memory increases—especially on all-in-one hypervisor/storage nodes.

Task 8: Watch I/O latency while backups run

cr0x@server:~$ zpool iostat -v rpool 5 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
rpool                        3.1T   410G    820   1600   110M   220M
  mirror-0                   3.1T   410G    820   1600   110M   220M
    ata-SAMSUNG_SSD_1           -      -    410    800    55M   110M
    ata-SAMSUNG_SSD_2           -      -    410    800    55M   110M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: You see sustained writes during backups. This is normal, but the key is whether guest latency spikes.
Decision: If your VMs complain during this window, move backup windows, reduce concurrency, or add a separate backup target to avoid hammering the main pool.

Task 9: Confirm Proxmox storage definitions (and avoid backing up to “local” by accident)

cr0x@server:~$ pvesm status
Name             Type     Status     Total     Used     Available        %
local             dir     active      94G      12G         82G       12.8%
local-zfs         zfspool active     3.5T     3.1T        410G       88.3%
pbs-prod          pbs     active      18T      6.2T       11.8T      34.4%

What it means: You have a PBS datastore available. Good. The local dir storage is not a backup destination unless it’s on separate hardware (it usually isn’t).
Decision: Ensure scheduled backups target pbs-prod, not local or local-zfs.

Task 10: Inspect backup jobs and their targets

cr0x@server:~$ cat /etc/pve/jobs.cfg | sed -n '1,120p'
backup: nightly-pbs
        enabled 1
        schedule 03:15
        storage pbs-prod
        mode snapshot
        compress zstd
        mailnotification always
        prune-backups keep-daily=14,keep-weekly=8,keep-monthly=6

What it means: Nightly backups go to PBS, with pruning defined.
Decision: If pruning is missing, add it. If schedule overlaps scrub, replication, or peak workload, move it.

Task 11: Run a manual vzdump for a single VM and read the result

cr0x@server:~$ vzdump 101 --storage pbs-prod --mode snapshot --compress zstd --notes-template '{{vmid}} {{name}}'
INFO: starting new backup job: vzdump 101 --storage pbs-prod --mode snapshot --compress zstd
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2025-12-28 03:22:11
INFO: status = running
INFO: VM Name: app-prod-01
INFO: including disk 'scsi0' 'local-zfs:vm-101-disk-0' 110G
INFO: creating Proxmox Backup Server archive 'vm/101/2025-12-28T03:22:11Z'
INFO: transferred 33.1 GiB in 04:18 min, average speed 131.2 MiB/s
INFO: backup successful
INFO: Backup finished at 2025-12-28 03:26:30
INFO: Finished Backup of VM 101 (00:04:19)

What it means: Success is not just “exit code 0”; it’s “archive created on PBS” with plausible throughput and duration.
Decision: If speed is wildly low, check pool iostat, network, and PBS load. If it says it backed up to local storage, fix job targets immediately.

Task 12: Verify PBS datastore health from the Proxmox node (basic reachability + authentication)

cr0x@server:~$ pvesm list pbs-prod | head
Volid                                                        Format  Type         Size  VMID
pbs-prod:backup/vm/101/2025-12-28T03:22:11Z                   pbs     backup   35567616   101
pbs-prod:backup/vm/102/2025-12-28T03:15:01Z                   pbs     backup   74448896   102
pbs-prod:backup/vm/215/2025-12-28T03:18:45Z                   pbs     backup   118382592  215

What it means: The node can list backups on PBS. That validates storage config and auth at a basic level.
Decision: If listing fails intermittently, investigate network reliability and PBS service health before trusting nightly jobs.

Task 13: Check ZFS replication sanity (if you use send/receive)

cr0x@server:~$ zfs send -nP rpool/data/vm-101@auto-2025-12-27_0400 | head
size	11811160064

What it means: Dry-run shows the stream size. This is how you forecast replication time and bandwidth.
Decision: If the size is huge for a “small change,” you may not be doing incremental sends, or the VM churns blocks aggressively. Adjust schedule and retention, or switch to PBS for dedupe efficiency.

Task 14: Confirm you can actually receive on the target (and you’re not overwriting production)

cr0x@server:~$ ssh root@backupzfs "zfs list -o name,used,avail -r bpool/replica | head"
NAME                 USED  AVAIL
bpool/replica        1.1T  7.8T
bpool/replica/rpool  1.1T  7.8T

What it means: The target has a dedicated replica dataset/pool. This is basic hygiene.
Decision: If you’re receiving into a similarly named pool without separation, you’re one script bug away from sending destruction in both directions.

Task 15: Do a test restore of a VM into an isolated ID (the only test that matters)

cr0x@server:~$ qmrestore pbs-prod:backup/vm/101/2025-12-28T03:22:11Z 9101 --storage local-zfs
restore vma archive: vm/101/2025-12-28T03:22:11Z
creating VM 9101 on target node
  restoring disk scsi0 size 110G to local-zfs:vm-9101-disk-0
progress 15% (read 5.0 GiB, write 5.0 GiB)
progress 62% (read 21.0 GiB, write 21.0 GiB)
progress 100% (read 33.1 GiB, write 33.1 GiB)
restore successful

What it means: You can restore. Not “in theory,” not “the logs are green,” but actual VM creation and disk materialization.
Decision: Boot it on an isolated network, validate app health, then destroy it. Schedule this monthly at minimum for critical services.

Task 16: Check snapshot/backup load correlation with VM latency (quick view)

cr0x@server:~$ iostat -x 5 3
Linux 6.8.12 (server)   12/28/2025

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.10    0.00    6.55    8.40    0.00   72.95

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
nvme0n1         220.0  1400.0  95000  210000   9.80   0.35  58.00

What it means: Elevated %iowait and await show the system is waiting on storage. During backups, that can be expected; sustained high await correlates with VM slowness.
Decision: If await jumps into tens/hundreds of ms, reduce backup concurrency, change schedule, or separate backup I/O from production storage.

Fast diagnosis playbook

When backups get slow, fail, or “succeed” but restores are painful, you don’t want a philosophy seminar. You want a sequence.

First: confirm you’re backing up to the right place

  • Check Proxmox job config: storage target, mode, pruning.
  • Confirm the destination is reachable and has free space.

Second: identify the bottleneck domain (disk, CPU, network, PBS)

  • Disk latency: zpool iostat, iostat -x
  • CPU pressure: top and “is compression pegging cores?”
  • Network: ip -s link, ss -s, or your switch counters
  • PBS load: datastore verification jobs, garbage collection, concurrent clients

Third: check snapshot pressure and pool capacity

  • zfs list -o usedbysnapshots to find pinned space.
  • zpool list for overall capacity; avoid living above ~80–85% on busy pools.

Fourth: validate restore mechanics

  • Pick one VM and run a test restore to a temporary VMID.
  • If restore is slow, the same bottleneck likely exists, just reversed (read-heavy from backup target).

Fifth: reduce concurrency before you redesign the world

Many Proxmox backup “incidents” are self-inflicted by running too many backups at once.
Concurrency is a knob. Use it.

Common mistakes: symptom → root cause → fix

1) “We have snapshots, so we’re backed up.”

Symptom: Pool won’t import, and everything—VM disks and snapshots—are inaccessible.

Root cause: Snapshots never left the pool’s failure domain.

Fix: Implement PBS or off-host ZFS replication. Require monthly test restores as a gate for calling it “backed up.”

2) Backup jobs are green, but restores fail

Symptom: You can see backup entries, but restore errors occur (missing chunks, auth failures, corrupted archives).

Root cause: No restore drills; problems accumulate quietly (permissions, datastore pruning mistakes, intermittent network).

Fix: Add scheduled restore tests. For PBS, run verification jobs and watch failure rates. Fix auth boundaries and network flakiness.

3) ZFS pool fills unexpectedly

Symptom: df shows plenty of room somewhere, but ZFS reports low AVAIL. VM writes start failing.

Root cause: Snapshot retention pins blocks; heavy churn VMs keep old blocks alive. Also common: living at 90%+ pool usage.

Fix: Reduce snapshot retention. Move long retention to PBS. Keep pools under ~80–85% for performance and fragmentation control.

4) Replication “works” and then you realize it replicated the disaster

Symptom: A delete or ransomware encryption appears on the replica quickly.

Root cause: Continuous replication without delayed checkpoints; replica is treated as a mirror, not a backup.

Fix: Use delayed replication, keep snapshot history on the replica, and protect replica from production credentials. For ransomware, immutability matters.

5) Backups cause VM latency spikes

Symptom: Application timeouts during backup window; storage latency graphs spike.

Root cause: Backup jobs competing for IOPS and cache; too many concurrent VMs; scrubs/resilver happening simultaneously.

Fix: Change schedule, reduce concurrency, move backups off-host, separate pools, or add faster media for the hot workload.

6) “We encrypted the pool so backups are secure”

Symptom: Backup target is unencrypted, or keys are stored on the same hypervisor.

Root cause: Confusing at-rest encryption for operational access controls; key management treated as an afterthought.

Fix: Encrypt backups at backup time (PBS supports this). Store keys separately. Limit who can delete backups and how.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Proxmox on a beefy host with mirrored SSDs. They had hourly ZFS snapshots for two weeks.
The team believed they were “covered,” because rollbacks had saved them more than once after bad application deployments.

Then came an unplanned reboot after a power event. The pool didn’t import cleanly. They had a boot environment, they had snapshots,
and they had confidence. What they didn’t have was a second copy of the VM disks anywhere else.

Recovery turned into a storage archaeology project: import attempts with different flags, hardware checks, and a growing fear that the “fix” might worsen damage.
Eventually they recovered most of it, but it took long enough that the business started making operational decisions based on partial service.

The postmortem was uncomfortable because the “wrong assumption” wasn’t technical; it was semantic. They called snapshots “backups.”
Once the language was corrected, the architecture followed. PBS was deployed on separate hardware, restore tests became a recurring ticket, and snapshots were demoted to what they are: local safety rails.

Mini-story 2: The optimization that backfired

Another shop decided backups were too slow, so they chased throughput. They increased backup concurrency and tuned compression to be as aggressive as possible.
The backup window shrank. The graphs looked great. Everyone high-fived.

Two weeks later, users started complaining of random latency spikes in the early morning. Nothing obvious was “down,” but everything felt sticky.
The team looked at CPU first—fine. Network—fine. Then storage latency: ugly.

The optimization had turned backups into a sustained random read/write storm right when nightly batch jobs also ran inside guests.
ZFS was doing its job, but the pool was fragmented and heavily utilized. Backups competed with production, and production lost.

The fix was boring: reduce concurrency, move the heaviest VMs to a different window, and stop treating backup throughput as the only KPI.
They also added a rule: if a change improves backup speed but increases p95 latency for production, it’s not an optimization. It’s a trade you didn’t price.

Mini-story 3: The boring but correct practice that saved the day

A financial services team had a habit that felt like paperwork: every month, one engineer restored two critical VMs from PBS into an isolated network.
They verified the service health checks, logged into the app, and then destroyed the test VMs. It took an hour and produced no exciting dashboards.

One day, a hypervisor upgrade went wrong. The node didn’t boot, and the local pool import was flaky. It wasn’t a total loss, but it wasn’t trustworthy.
The team made a decision quickly: stop poking the broken host, restore to another node, and bring services back cleanly.

The restore plan worked because they had rehearsed it. They already knew which backups were fastest to restore, which networks to attach, and which credentials were needed.
Their outage was measured in a few hours, not days of “maybe we can fix the pool.”

The kicker: the monthly restore ritual had previously been questioned as “busywork.”
Afterward, nobody questioned it again. Boring is a feature when the building is on fire.

Checklists / step-by-step plan

Step-by-step: build a backup strategy that doesn’t lie

  1. Define RPO/RTO per service. If you can’t say “we can lose X hours and be down Y hours,” you can’t pick tooling honestly.
  2. Keep local snapshots short. Use them for rollback, not for history.
  3. Deploy PBS on separate hardware. Different disks, different power if possible. Separate credentials.
  4. Schedule backups away from peak I/O. Avoid overlap with scrubs, resilvers, heavy batch jobs.
  5. Set retention in the backup system. Prune on PBS (or equivalent) so retention is enforceable and visible.
  6. Enable encryption where it makes sense. Especially if backups are off-site or accessible by many systems.
  7. Do monthly restore drills. Restore a VM to a new VMID, boot it, validate, then delete.
  8. Monitor the boring signals. Pool capacity, scrub completion, backup job duration, restore duration, verification failures.
  9. Document the restore runbook. Where backups live, who has access, how to restore networking, what order to restore dependencies.

Checklist: before you trust your setup

  • Backups stored off-host and off-pool
  • Separate credentials for deleting backups
  • Retention and pruning configured
  • At least one successful test restore in the last 30 days
  • ZFS scrubs scheduled and completing
  • Pool capacity under control (not living at 90% full)
  • Backup windows don’t hurt production latency

Checklist: when you change something risky

  • Take a manual snapshot with a human name (@pre-upgrade)
  • Confirm last successful PBS backup exists for critical VMs
  • Ensure you can reach the backup target right now
  • Have a rollback plan and a restore plan—these are different plans

FAQ

1) Are ZFS snapshots ever “enough” as backups?

Only if your definition of “backup” excludes hardware loss, pool corruption, and root compromise—which is a definition used mainly by people selling optimism.
For real environments: no.

2) If I replicate ZFS snapshots to another host, is that a backup?

It can be, if the target is independent and protected. But replication often mirrors mistakes quickly.
Make replication delayed, retain history on the target, and use separate credentials so production root can’t destroy the replica.

3) Should I use Proxmox Backup Server or vzdump to NFS?

PBS is the better default: dedupe, encryption, verification, pruning. NFS can work, but you must design immutability and verification yourself,
and you’ll likely pay for it later with operational mess.

4) Do I need both snapshots and backups?

In practice, yes. Snapshots are for fast rollback and change management. Backups are for disaster recovery and audit-grade retention.
Trying to force one to do the other usually produces pain.

5) How many snapshots should I keep on the production pool?

Keep as few as your rollback needs require. For many environments: hourly for 24–72 hours and daily for 7–14 days.
If you need months, keep them off-host in the backup system.

6) My pool is 90% full but “it still works.” Why are you telling me to care?

Because ZFS performance and allocation behavior degrade as free space shrinks, especially on busy pools. Also, snapshots make “free space” deceptive.
You don’t want to discover the cliff during an incident when you need writes for restores.

7) How do I make backups resilient to ransomware?

Use separation of credentials, immutability (where available), delayed/offline copies, and a backup system that can’t be deleted by compromised hypervisors.
Also: restore drills, because ransomware recovery is mostly “how fast can we restore cleanly.”

8) Are ZFS scrubs a substitute for backups?

No. Scrubs detect and repair corruption using redundancy. They don’t protect against deletion, encryption, or catastrophic pool loss.
Scrubs help ensure that when you need your data, it reads correctly.

9) What’s the single most important metric for backups?

Restore time for a representative workload. Not backup duration, not dedupe ratio. The business experiences restore time.

10) What should I do if backups slow down over time?

Check capacity (including snapshot pinning), check latency during the backup window, and check concurrency. Then check the backup target health.
Slow backups are often a symptom of “pool too full” or “too many jobs at once,” not a mystical curse.

Conclusion: next steps that pay rent

Stop calling snapshots backups. Promote them to “rollback tooling,” and you’ll immediately make better decisions about retention, capacity, and risk.
Then build actual backups that leave the failure domain, are protected from easy deletion, and are proven by restores.

Practical next steps:

  1. Pick one critical VM and perform a test restore from your backup system this week. Time it. Write down the steps.
  2. Audit where your backups land. If anything says “local” on the same host, treat it as convenience, not protection.
  3. Measure snapshot space pinning and reduce retention on the worst offenders.
  4. Schedule scrubs and make sure they complete.
  5. Implement off-host backups (PBS recommended) with pruning and verification, and separate credentials for deletion.

When the bad day arrives—and it will—you want your backup strategy to be boring, repeatable, and a little smug. Not because you love smugness,
but because your users love service.

← Previous
Spectre and Meltdown: When Security Started Costing Performance
Next →
The thermal wall: how physics ended marketing’s favorite story

Leave a comment