Proxmox replication failed: why it breaks and how to recover

Was this helpful?

Replication fails at the worst possible time: right after you promised the business that the DR node is “fully in sync.” Then a single red line appears—replication failed—and suddenly your plan is a screenshot in a slide deck.

This is a practical field guide for Proxmox VE replication when it breaks: what’s actually happening under the hood, how to diagnose the bottleneck quickly, and how to recover without turning a small outage into a long weekend.

What Proxmox “replication” really does (and what it doesn’t)

In Proxmox VE, “replication” is not a magical block-level mirror for everything you can store. It’s a scheduled job that—when you use ZFS-based storage—creates VM disk snapshots and ships them from a source node to a target node using ZFS send/receive. For local ZFS pools, this is reliable, fast, and relatively boring. When it’s not, it’s because one of the layers underneath stopped being boring.

Under the covers, a typical replication run looks like this:

  1. Proxmox coordinates the job (per VM, per schedule).
  2. ZFS snapshots are created for the VM disks (datasets or zvols depending on your layout).
  3. Incremental ZFS send is performed from the last successful snapshot to the new snapshot.
  4. The stream goes over SSH to the target node.
  5. The target does a ZFS receive into its local dataset.
  6. Proxmox records success/failure and prunes old replication snapshots.

What it is not:

  • Not a HA failover mechanism by itself. It feeds data to a node; HA decides where to run the VM.
  • Not continuous replication. It’s periodic; your RPO is at best the schedule interval, plus whatever delay the job accumulates.
  • Not a substitute for backups. Replication happily replicates corruption, deletion, and “oops.”
  • Not storage-agnostic. If you’re not using ZFS replication capable storage, you’re in a different troubleshooting universe.

One quote, because it still captures production reality: “Hope is not a strategy.” —paraphrased idea often attributed to reliability leaders in operations.

Interesting facts and historical context

  • ZFS snapshots are cheap because they’re metadata references, not full copies—until you change blocks, then space use grows with divergence.
  • ZFS send/receive dates back to early Solaris ZFS days and became popular because it was a portable, streamable replication primitive with built-in consistency.
  • Incremental sends require a common snapshot on both sides; remove it on either end and you’ve broken the chain.
  • Proxmox replication has always been “storage-first”: it’s built around ZFS semantics, not generic disk copying.
  • SSH is part of your storage pipeline for ZFS replication; key management and host keys can break “storage” in a very non-storage way.
  • Resume tokens (ZFS feature) can allow interrupted receives to resume without starting over, but they also create head-scratching states if you don’t recognize them.
  • ZFS can detect silent corruption with checksums, but replication can still propagate logical corruption (like deleting a database file cleanly).
  • Replication snapshot naming conventions matter because tooling expects patterns; ad-hoc snapshot cleanup can make Proxmox look “wrong” when ZFS is actually fine.
  • Time drift is an old enemy: scheduling, job ordering, log correlation, and even certificate validation are all made worse by sloppy NTP.

Fast diagnosis playbook (check 1/2/3)

This is the “stop guessing” order. You’re trying to identify the bottleneck in minutes, not by reading every log since last Tuesday.

1) Is the failure control-plane (Proxmox) or data-plane (ZFS/SSH/network)?

  • If the job doesn’t start, can’t resolve nodes, or reports permission issues: control-plane.
  • If it starts but fails mid-stream, stalls, or reports ZFS receive errors: data-plane.

2) Confirm the replication job state and last error

  • Look at task logs and journal for the replication worker.
  • Extract the actual ZFS error (dataset exists, no space, invalid stream, etc.).

3) Validate the three boring prerequisites

  • SSH works non-interactively from source to target for root (or the configured user).
  • ZFS pools are healthy on both ends and have free space/headroom.
  • A common snapshot exists for incremental replication.

4) If it’s slow/stuck rather than “failed,” isolate throughput

  • Is it network-limited? (iperf3, interface errors, duplex, MTU mismatch)
  • Is it disk-limited? (zpool iostat, txg sync contention)
  • Is it CPU-limited? (compression, encryption, single-thread bottlenecks)

Decision rule: if you can’t name the bottleneck after 10 minutes, you’re not checking the right things—you’re reading.

Why replication breaks: the real failure modes

1) Snapshot chain breaks (the classic)

ZFS incremental replication depends on: “send from snapshot A to snapshot B.” That requires snapshot A to exist on both source and target. If someone deletes snapshots on the target to “save space” or runs an aggressive prune, your next incremental send fails because there’s no common base.

This failure mode is common because it looks like housekeeping. People see snapshots, assume they’re “temporary,” and clean them up. Then replication starts demanding the one snapshot you just deleted.

2) Target dataset mismatch (wrong place, wrong type, wrong name)

ZFS receive is picky in the best way. If the destination dataset exists but doesn’t match expectations (type mismatch: dataset vs zvol, incompatible properties, wrong mountpoint semantics), receive can error out. Proxmox expects certain dataset structures under a storage ID. Manual changes on the target can make the next receive collide with reality.

3) Out of space (and not just “df -h” space)

ZFS needs headroom for copy-on-write, metadata, and transaction group behavior. A pool at 95% full is not “fine.” It’s a pool preparing a small opera called ENOSPC.

Also watch:

  • quota/reservation on datasets
  • refquota/refreservation
  • special allocation classes (special vdev full)
  • slop space behavior

4) SSH and key management failures (storage gets mugged in a dark alley)

Replication uses SSH. That means host key changes, tightened ciphers, revoked keys, new jump-host policies, or a rotated root password can break replication with errors that look like “storage” but aren’t.

Joke #1: SSH is like a corporate badge—everything works until Security “improves” it five minutes before your maintenance window.

5) Network path issues: MTU, drops, asymmetric routing

ZFS send is a steady stream. If the network drops packets, you’ll see stalls, resets, or corrupted streams. If MTU is mismatched (one side set to 9000, other not), you can get fragmentation, weird performance, or outright failure depending on the path.

6) Pool health problems: degraded, errors, slow I/O

If a pool is degraded, resilvering, or throwing checksum errors, replication can fail or become glacial. Proxmox will report “failed,” but the real problem is the storage subsystem gasping for air.

7) ZFS feature flags and version mismatches

If the target pool doesn’t support a feature used in the send stream (or you’re using raw/encrypted sends without matching support), receive can fail. This matters when you replicate between nodes on different ZFS versions or with different feature flag sets enabled.

8) Encryption key and raw send/receive issues

Encrypted datasets can replicate as raw streams (keeping data encrypted) or as decrypted streams (requiring keys). If your policy changed mid-flight, or keys aren’t loaded on the target when needed, you’ll see “cannot mount” or “key not loaded” style issues.

9) Scheduler and lock contention

Replication jobs can conflict with backups, snapshot-heavy operations, scrub/resilver, or other I/O spikes. Sometimes the failure is just a timeout or lock issue. Sometimes it’s a dead-simple “too many jobs at once” problem dressed up as complexity.

10) Corrupted or incomplete receive state (resume tokens, partial datasets)

An interrupted receive can leave behind partial state. Modern ZFS can issue a resume token so you can continue. But if the receive was interrupted and then someone manually “fixed it” by deleting snapshots/datasets, you can end up with a tangle that needs careful cleanup.

Hands-on tasks: commands, outputs, and decisions

These are real tasks you can run on Proxmox nodes. Each includes: command, example output, what it means, and the decision you make.

Task 1: Confirm replication jobs and last status

cr0x@server:~$ pvesr status
JobID          Guest  Target     Status    Last Sync
local-zfs:101  101    pve02      ok        2025-12-26 09:10:02
local-zfs:102  102    pve02      failed    2025-12-26 08:55:14

Meaning: VM 102’s job failed; you have a timestamp for correlation.

Decision: Pull the specific task log for that time window and VM/job ID before you touch anything.

Task 2: Read the Proxmox task log for the failed replication

cr0x@server:~$ grep -R "local-zfs:102" /var/log/pve/tasks/active /var/log/pve/tasks/index 2>/dev/null | tail -n 5
/var/log/pve/tasks/index:UPID:pve01:0000A3F1:0B7C9E2A:676D22E2:pvesr:local-zfs:102:root@pam:
cr0x@server:~$ cat /var/log/pve/tasks/UPID:pve01:0000A3F1:0B7C9E2A:676D22E2:pvesr:local-zfs:102:root@pam:
starting replication job 'local-zfs:102'
command: /usr/bin/ssh -o BatchMode=yes pve02 -- zfs receive -o readonly=on -u -F rpool/replica/vm-102-disk-0
cannot receive incremental stream: most recent snapshot of rpool/replica/vm-102-disk-0 does not match incremental source
TASK ERROR: command 'ssh ... zfs receive ...' failed: exit code 1

Meaning: Incremental base snapshot mismatch. Someone deleted or altered snapshots on one side, or the dataset diverged.

Decision: Verify snapshots on both sides; plan for a new full send (or rebase) rather than blind retries.

Task 3: Inspect snapshots on the source dataset

cr0x@server:~$ zfs list -t snapshot -o name,creation -s creation rpool/data/vm-102-disk-0 | tail -n 8
rpool/data/vm-102-disk-0@replicate_2025-12-26_08-00-01  Fri Dec 26 08:00 2025
rpool/data/vm-102-disk-0@replicate_2025-12-26_08-30-01  Fri Dec 26 08:30 2025
rpool/data/vm-102-disk-0@replicate_2025-12-26_08-45-01  Fri Dec 26 08:45 2025
rpool/data/vm-102-disk-0@replicate_2025-12-26_09-00-01  Fri Dec 26 09:00 2025

Meaning: Source has replication snapshots continuing. Good.

Decision: Compare with target. If target is missing the base snapshot, you’ll need a full re-seed or a matched snapshot restore.

Task 4: Inspect snapshots on the target dataset

cr0x@server:~$ ssh pve02 -- zfs list -t snapshot -o name,creation -s creation rpool/replica/vm-102-disk-0 | tail -n 8
rpool/replica/vm-102-disk-0@replicate_2025-12-26_07-30-01  Fri Dec 26 07:30 2025
rpool/replica/vm-102-disk-0@replicate_2025-12-26_08-00-01  Fri Dec 26 08:00 2025

Meaning: Target is missing newer snapshots that the source expects as a base for incrementals (or the job expects a later one).

Decision: Determine last common snapshot and whether Proxmox’s recorded state matches reality. If state is inconsistent, re-seed.

Task 5: Verify SSH non-interactive access and host key sanity

cr0x@server:~$ ssh -o BatchMode=yes -o ConnectTimeout=5 pve02 -- "echo ok && hostname"
ok
pve02

Meaning: SSH works without prompts. That eliminates the “it’s waiting for password/hostkey confirmation” class of failures.

Decision: If this fails, fix SSH first. Don’t touch ZFS until control-plane connectivity is clean.

Task 6: Check pool health on both nodes

cr0x@server:~$ zpool status -x
all pools are healthy
cr0x@server:~$ ssh pve02 -- zpool status -x
all pools are healthy

Meaning: No obvious pool errors.

Decision: If you see “DEGRADED,” “too many errors,” or ongoing resilvering, treat it as storage incident first; replication is a symptom.

Task 7: Check free space and dataset quotas (the ZFS way)

cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint rpool | head -n 5
NAME    USED  AVAIL  REFER  MOUNTPOINT
rpool   3.21T  620G   192K  /rpool
cr0x@server:~$ ssh pve02 -- zfs list -o name,used,avail,refer,mountpoint rpool | head -n 5
NAME    USED  AVAIL  REFER  MOUNTPOINT
rpool   3.79T  110G   192K  /rpool

Meaning: Target has only 110G available. That’s flirting with failure.

Decision: If target is tight, stop trying to “force” replication. Free space or expand pool before re-seeding; otherwise you’ll create partial receives and snapshot sprawl.

Task 8: Identify if the target has a resume token (interrupted receive)

cr0x@server:~$ ssh pve02 -- zfs get -H -o name,property,value receive_resume_token rpool/replica/vm-102-disk-0
rpool/replica/vm-102-disk-0  receive_resume_token  -

Meaning: No resume token. The target is not waiting to resume a partial stream.

Decision: If you do see a long token value, decide whether to resume (preferred) or abort/rollback (riskier) depending on policy and time.

Task 9: Run a manual dry-run of incremental send size (capacity planning)

cr0x@server:~$ zfs send -nPv -i rpool/data/vm-102-disk-0@replicate_2025-12-26_08-00-01 rpool/data/vm-102-disk-0@replicate_2025-12-26_09-00-01
send from @replicate_2025-12-26_08-00-01 to rpool/data/vm-102-disk-0@replicate_2025-12-26_09-00-01 estimated size is 18.4G
total estimated size is 18.4G

Meaning: The delta is ~18.4G. Useful for deciding if bandwidth and target free space are adequate.

Decision: If estimated delta is huge, schedule replication away from peak hours or increase interval; huge deltas often mean your schedule is too sparse or jobs are failing repeatedly.

Task 10: Measure network throughput quickly

cr0x@server:~$ iperf3 -c pve02 -t 10
Connecting to host pve02, port 5201
[  5]   0.00-10.00  sec  8.76 GBytes  7.52 Gbits/sec  0             sender
[  5]   0.00-10.00  sec  8.75 GBytes  7.52 Gbits/sec                receiver

Meaning: ~7.5 Gbit/s effective. Not perfect 10G, but likely fine unless your deltas are huge.

Decision: If you see 200–800 Mbit/s on a 10G link, stop blaming ZFS. Check MTU, errors, routing, and congestion.

Task 11: Inspect interface errors and MTU

cr0x@server:~$ ip -s link show dev bond0
2: bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped  missed  mcast
    91234567890  81234567  0  0  0  0
    TX:  bytes  packets  errors  dropped  carrier  collsns
    82345678901  72345678  0  0  0  0
cr0x@server:~$ ssh pve02 -- ip link show dev bond0 | head -n 2
2: bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

Meaning: MTU mismatch. That can make replication slow or fragile depending on the path.

Decision: Standardize MTU end-to-end (including switches). Don’t “half-jumbo” and hope for the best.

Task 12: Check Proxmox and system logs around the failure

cr0x@server:~$ journalctl -u pvedaemon -u pveproxy -u pvescheduler --since "2025-12-26 08:45" --until "2025-12-26 09:05" | tail -n 30
Dec 26 08:55:14 pve01 pvescheduler[1123]: starting replication job local-zfs:102
Dec 26 08:55:18 pve01 pvedaemon[1109]: VM 102: replication job failed: exit code 1

Meaning: Confirms timing and that it’s a job-level failure, not a GUI glitch.

Decision: If you see repeated retries, timeouts, or overlapping jobs, you may need to adjust schedules or concurrency.

Task 13: Validate the target dataset exists and is the expected type

cr0x@server:~$ ssh pve02 -- zfs list -o name,type,volsize,used,avail rpool/replica/vm-102-disk-0
NAME                         TYPE   VOLSIZE  USED  AVAIL
rpool/replica/vm-102-disk-0   volume 128G     91G   110G

Meaning: Target is a zvol (“volume”). That’s consistent with VM disks stored as zvols.

Decision: If the type is “filesystem” but you expect “volume,” stop and fix the dataset layout—forcing receives into the wrong type is how you create creative disasters.

Task 14: Perform a controlled full re-seed (manual ZFS send/receive)

cr0x@server:~$ zfs snapshot rpool/data/vm-102-disk-0@replicate_reseed_2025-12-26_09-30
cr0x@server:~$ zfs send -Pv rpool/data/vm-102-disk-0@replicate_reseed_2025-12-26_09-30 | ssh pve02 -- zfs receive -u -F rpool/replica/vm-102-disk-0
send from rpool/data/vm-102-disk-0@replicate_reseed_2025-12-26_09-30 estimated size is 92.1G
total estimated size is 92.1G
TIME        SENT   SNAPSHOT
00:00:10    5.62G  rpool/data/vm-102-disk-0@replicate_reseed_2025-12-26_09-30

Meaning: You’re overwriting/rolling the target to match the source snapshot state, using -F to force rollback on the target dataset.

Decision: Use -F only when you are absolutely sure the target is a replica and not hosting a running VM. If that target is “active,” you’re about to destroy the active disk state.

Task 15: After re-seed, validate snapshot alignment and replication readiness

cr0x@server:~$ ssh pve02 -- zfs list -t snapshot -o name -s creation rpool/replica/vm-102-disk-0 | tail -n 3
rpool/replica/vm-102-disk-0@replicate_2025-12-26_08-00-01
rpool/replica/vm-102-disk-0@replicate_2025-12-26_09-00-01
rpool/replica/vm-102-disk-0@replicate_reseed_2025-12-26_09-30

Meaning: Target now has the reseed snapshot and prior ones (depending on what was sent). You have a common base again.

Decision: Re-enable Proxmox replication for that VM/job and watch the next incremental run. If it fails again, you likely have a systematic issue (space, pruning, schedule collisions).

Task 16: Check for ZFS receive errors that hint at feature/encryption mismatch

cr0x@server:~$ ssh pve02 -- zpool get -H -o value feature@encryption rpool
active

Meaning: Target pool supports encryption feature flag.

Decision: If this is “disabled” or “inactive” while the source uses encrypted datasets, expect receive failures. Align ZFS feature support or adjust send mode.

Recovery plans that don’t create new problems

There are two broad recovery strategies: repair the chain or re-seed. The wrong choice wastes time; the reckless choice risks data loss.

Plan A: Repair the snapshot chain (when you can)

Use this when the target still has a common snapshot, but Proxmox state is confused or a recent snapshot is missing due to an interrupted job.

  • Confirm last common snapshot exists on both sides.
  • Don’t delete anything “to clean up” until you know what Proxmox expects.
  • If there’s a resume token, try resume before re-seeding.
  • Run a manual incremental send using explicit snapshot names to prove viability.

The advantage: minimal bandwidth and time. The disadvantage: it requires discipline and careful inspection.

Plan B: Full re-seed (the blunt instrument that works)

Use this when the chain is broken, snapshots have been pruned inconsistently, or the target dataset has diverged.

  • Ensure the target dataset is not in use by a running VM.
  • Ensure enough free space on the target pool (real headroom).
  • Take a new “reseed” snapshot on the source.
  • Send it with a forced receive into the replica dataset.
  • Verify snapshots and re-enable scheduled replication.

Joke #2: A re-seed is like reinstalling your OS to fix a driver—effective, slightly embarrassing, and sometimes the best use of time.

When to stop and escalate

Stop “trying things” and escalate to a storage incident if you see:

  • pool errors increasing, checksum errors, or repeated device timeouts
  • replication speed collapses while zpool iostat shows long waits
  • the target pool is near-full and fragmentation is high
  • scrub/resilver running during replication windows

Three corporate-world mini-stories (anonymized, plausible, instructive)

Mini-story 1: The wrong assumption that broke DR

The organization had two Proxmox clusters: one primary, one “warm DR.” Replication ran hourly, and it had been green for months. Storage was ZFS on both sides, different hardware generations, same dataset naming scheme. Everyone slept well.

Then a new engineer joined and did what conscientious people do: cleaned up what looked like clutter. On the DR node, they saw a pile of snapshots named like replication artifacts and assumed they were old leftovers. They pruned them with a quick script. It freed space. The dashboards stayed green for a while because nothing ran immediately.

The next replication cycle failed for a third of the VMs. Errors said incremental base snapshots didn’t match. The engineer (trying to be helpful) deleted more snapshots, thinking Proxmox would rebuild them. That’s the assumption: “replication snapshots are disposable.” They aren’t. They’re the chain.

Recovery wasn’t technically hard—full re-seeds fixed it—but operationally it was ugly. Several large VMs needed full sends across a link that was already busy during business hours. The team had to throttle and reschedule jobs, and the DR RPO was poor for a day. The postmortem had one lesson in bold: never manually prune replication snapshots unless you’re intentionally re-seeding.

Mini-story 2: The optimization that backfired

A different company wanted to reduce replication time. Someone noticed that compression settings differed between pools. They also saw CPU headroom on the nodes. Great: enable aggressive compression, crank up replication frequency, and watch the deltas shrink.

They changed ZFS properties on the source datasets to a heavier compression mode. On paper: less bandwidth, faster replication, smaller storage. In reality: during peak hours, CPU contention hit the hypervisor and increased VM latency. Not by much—just enough for a few latency-sensitive services to become flaky. The network graphs looked better, the storage graphs looked better, and the application graphs looked worse.

Then the backfire got sharper: replication jobs started overlapping with backup jobs. Both were snapshot-heavy. The system spent more time snapshotting, pruning, and syncing txgs than actually moving data. Replication failed intermittently due to timeouts and lock contention, which increased delta sizes, which made replication longer, which caused more overlap. A nice little feedback loop.

The fix was boring: back off compression to a balanced setting, separate replication and backup windows, and enforce concurrency limits. They kept some of the gains, lost the self-inflicted latency, and regained predictability—the real currency in ops.

Mini-story 3: The boring practice that saved the day

A finance company had a habit that looked paranoid: every month, they performed a controlled DR test for a handful of VMs. Not a big bang. Just a rotating sample. They verified that the replicated datasets were present, that the VM could boot, and that application-level checks passed.

One month, the test failed. The VM wouldn’t boot on the DR node because the replica dataset was present but incomplete. The replication job status was “ok” for the last run, but a previous partial receive had left the dataset in an inconsistent “looks there” state. Nobody had noticed because the primary was fine and dashboards love to lie by omission.

Because they tested, they found it before it mattered. They paused new replication jobs, fixed the underlying network flakiness, and re-seeded a handful of affected disks. The incident was a ticket, not a headline.

That’s the secret: a recurring, boring DR drill forces you to validate the entire chain—ZFS, SSH, Proxmox state, and bootability. It also forces you to keep runbooks accurate, because you actually use them.

Common mistakes (symptom → root cause → fix)

1) “Most recent snapshot does not match incremental source”

Symptom: Replication fails with incremental mismatch errors.

Root cause: Snapshot chain broken—snapshots deleted on target/source, or target dataset diverged due to manual changes.

Fix: Identify last common snapshot; if none, perform a controlled full re-seed with a new snapshot and forced receive. Stop deleting snapshots “until it works.”

2) “cannot receive: destination exists” or “dataset already exists”

Symptom: Receive fails immediately.

Root cause: Target dataset exists but is not the expected replica target, or your receive flags don’t match (missing -F where appropriate).

Fix: Verify dataset type and intended structure. If it’s a true replica, use controlled zfs receive -F. If not, rename or reconfigure storage layout to avoid collisions.

3) “No space left on device” during receive

Symptom: Job fails mid-stream, sometimes leaving partial state.

Root cause: Target pool/dataset full, quotas, or insufficient headroom causing ENOSPC under copy-on-write pressure.

Fix: Free space properly (delete old backups, expand pool, adjust quotas). Then clean up partial receives and re-seed if required.

4) Replication “stuck” or takes forever, but doesn’t fail

Symptom: Jobs run for hours, queue behind each other.

Root cause: I/O contention with backups/scrubs, network throughput collapse, or slow disks on target.

Fix: Measure: iperf3, zpool iostat, and interface errors. Separate windows, limit concurrency, and avoid scrubs during replication peaks.

5) “Host key verification failed” or password prompts in logs

Symptom: Replication fails immediately; logs mention SSH.

Root cause: Changed host keys, missing keys, or BatchMode fails.

Fix: Repair SSH trust and keys. Confirm ssh -o BatchMode=yes works. Then rerun replication.

6) Replication status green, but DR boot fails

Symptom: You try to start a replicated VM and it panics or disk is missing/corrupt.

Root cause: Replica dataset incomplete, mis-mounted, encryption keys not loaded, or wrong dataset received.

Fix: Validate dataset properties, encryption key state, and run periodic boot tests. If in doubt, re-seed the affected disk(s).

Checklists / step-by-step plan

Checklist: Before you attempt recovery

  1. Confirm which node is source and which is target for the failed job.
  2. Confirm the target dataset is not actively used by a running VM.
  3. Capture the failure log (task log + exact ZFS error). Don’t rely on memory.
  4. Check SSH BatchMode connectivity.
  5. Check pool health and free space on both ends.
  6. List snapshots on both ends and locate last common snapshot.

Step-by-step: Repair incremental replication (preferred when possible)

  1. Identify last common snapshot name (exact spelling).
  2. Estimate send size with zfs send -nPv -i.
  3. Run a manual incremental send/receive to validate the chain works.
  4. Restart the Proxmox replication job and watch the next run.
  5. After success, review snapshot retention so the base snapshot won’t be pruned prematurely.

Step-by-step: Full re-seed (when the chain is broken)

  1. Stop or pause the replication job (so it doesn’t race you).
  2. Ensure target has enough headroom; expand or clean up first.
  3. Create a new “reseed” snapshot on source.
  4. Send full snapshot to target with forced receive (-F) only if safe.
  5. Verify target snapshot exists and matches.
  6. Re-enable Proxmox replication and monitor.

Checklist: After recovery (don’t skip this)

  1. Run one more replication cycle successfully for the VM.
  2. Confirm target pool space is stable (not trending toward full).
  3. Confirm schedules don’t overlap heavily with backups/scrubs.
  4. Perform a DR boot test for at least one recovered VM if policy allows.
  5. Document the root cause in one paragraph that a tired on-call can understand.

FAQ

1) Is Proxmox replication the same as backup?

No. Replication keeps a nearline copy of VM disks on another node, typically for quick failover. Backups are point-in-time archives designed for restore and long retention. You want both.

2) Why does replication fail after I “cleaned up old snapshots”?

Because incremental replication requires a common snapshot on both nodes. Deleting replication snapshots breaks the chain. If you must prune, do it through Proxmox policies or be prepared to re-seed.

3) Can I just click “Retry” until it works?

You can, but that’s how you turn a chain-break into a queue backlog. Retry is fine after fixing the root cause (SSH, space, MTU). If the error says incremental mismatch, retries won’t create missing snapshots.

4) What’s the safest “big hammer” fix?

A full re-seed to the target dataset, performed deliberately: verify target not in use, ensure space, snapshot source, zfs send to zfs receive -F. The hammer is safe when you confirm what you’re hitting.

5) How do I know if the network is the problem?

Measure it. Use iperf3, check interface errors with ip -s link, and confirm MTU matches end-to-end. If throughput is low and errors/drops exist, replication is innocent.

6) Does ZFS compression help replication?

Sometimes. It can reduce bytes sent, but it increases CPU and can amplify contention during busy periods. Optimize for predictability first, speed second.

7) What about encrypted datasets—can they replicate cleanly?

Yes, but you must align expectations: raw encrypted send keeps data encrypted; decrypted sends require keys loaded and can change behavior. Mismatched ZFS feature support or missing keys will break receives.

8) Why does replication show “ok” but the replica won’t boot?

Because “ok” usually means the last job returned success—not that you validated bootability. Partial datasets, wrong destination mapping, or key issues can still exist. Test boot periodically.

9) Should I replicate every VM as frequently as possible?

No. Replication has cost: snapshots, I/O, pruning, and job contention. Classify VMs by RPO needs and size. Make the schedule match reality.

10) What’s the one thing you’d automate?

Alerting on: replication failures, target pool >80% usage, interface errors increasing, and missing recent snapshots on the target. Humans are good at fixing problems, bad at noticing slow drift.

Next steps you should actually do

If replication failed today, do this in order:

  1. Pull the exact task log and extract the ZFS/SSH error line.
  2. Verify SSH BatchMode and pool health on both ends.
  3. Check target headroom and snapshot presence; find the last common snapshot.
  4. If the chain is intact, repair incrementals. If it’s broken, re-seed deliberately.
  5. After recovery, fix the policy issue that caused it: snapshot pruning, space trends, schedule overlap, MTU mismatch, or key management.

The point of replication isn’t to make dashboards green. It’s to make failure boring. When replication breaks, treat it like any other production pipeline: control-plane first, data-plane second, and no “cleanup” until you’ve proven what’s actually missing.

← Previous
Legacy Magic: Nobody Knows How It Works, So Don’t Touch It
Next →
8088 and the IBM PC deal that crowned Intel (almost by accident)

Leave a comment