Replication only feels boring when it works. When it doesn’t, it turns into a slow-motion horror film:
the CEO asks “Do we have a copy?” and you answer “Yes,” while your stomach quietly files for bankruptcy.
ZFS makes it dangerously easy to believe you’re safe because some snapshots exist somewhere.
This piece is about proving the replica is real: current, consistent, decryptable, restorable, and usable
under the exact failure you’re actually going to have.
What “real replication” means in production
“We replicate ZFS” can mean at least six different things, and five of them are “we feel good until we don’t.”
A real replica is not just “data exists on the other host.” It’s:
- Current enough: your RPO is met by observation, not optimism.
- Consistent: it is derived from a coherent snapshot chain, not a patched-together accident.
- Complete: all datasets you care about are included (including children, props, and holds where relevant).
- Decodable: if encryption is involved, you can actually load keys and mount it during the worst day of the year.
- Restorable: you can promote, rollback, mount, and hand the result to an application without archaeology.
- Operationally safe: replication doesn’t sabotage production via load spikes, huge snapshots, or backpressure.
ZFS replication is snapshot-based. That’s both its superpower and its trap. Superpower: it’s precise and efficient.
Trap: if you’re missing one link in the chain, the “incremental” you thought you were shipping is not incremental anymore.
Also, the send stream is faithful to the snapshot it was produced from; it does not care that your database needed
quiescing and you didn’t.
Here’s the guiding rule: audit the restore path, not the send path. Sending data is easy.
Restoring it under pressure is where your unexamined assumptions come to die.
One dry truth from the operations world: “We have backups” is not a state; it’s a claim that requires evidence.
If you can’t answer “What is the newest recoverable snapshot for dataset X, and how long would it take to put it in service?”
you do not have replication, you have vibes.
Paraphrased idea from John Allspaw (operations/reliability): reliability comes from learning in the messy real world, not from believing the plan
.
Auditing replication is that learning loop, applied to storage.
Interesting facts and historical context
- ZFS was designed around end-to-end checksums (data and metadata), making silent corruption detectable—if you scrub and monitor.
- Snapshots are cheap in ZFS because they are copy-on-write references, not full copies. They can still become expensive if you keep them forever.
- The original ZFS came out of Sun Microsystems; its replication approach (send/receive) was built to move snapshot deltas safely and deterministically.
- OpenZFS became a cross-platform effort after Sun’s decline, and feature parity varies by OS distribution and version—important for resume tokens and encryption behavior.
- Replication streams can include dataset properties depending on flags; missing properties is a classic “it restored but it’s wrong” failure.
- Resume tokens exist because long sends fail in real networks. They’re not decoration; they’re what keeps you from re-sending 40 TB after a 3-second link flap.
- ZFS encryption is per-dataset, not per-pool, and key handling on the receiver is a whole separate operational problem.
- Scrubs are not backups, but they are an audit tool: they tell you whether the replica can read its own data reliably.
An audit model that catches the usual lies
Lie #1: “The replica exists” (but it’s not the right datasets)
You replicated tank/data but not tank/data/mysql because someone created it later
and your script enumerated datasets once, in 2022, and never again. The receiver looks healthy. The restore is a blank stare.
Your audit needs to compare “what should be protected” vs “what is protected” and flag drift.
Lie #2: “It’s current” (but the snapshot schedule is broken)
Replication cannot be more current than the newest snapshot on the sender. If snapshots stop, replication “succeeds” while
moving nothing. Your audit must verify snapshot creation, snapshot naming, and that the newest snapshot arrives at the receiver.
Lie #3: “It’s consistent” (but the chain is fractured)
Incremental replication assumes both sides share a common snapshot. Delete one on the receiver, and your next incremental send fails.
Or worse: it appears to succeed because you fell back to a full send without noticing, spiking bandwidth and runtime.
Auditing means verifying the common base snapshot and tracking unexpected full sends.
Lie #4: “It’s restorable” (but the receiver is not usable under DR)
The receiver may be read-only, may have canmount=off, may need encryption keys, may require promotion of clones,
may lack mountpoints you expect, or may restore with wrong recordsize/compression properties and crater performance.
A real audit includes a non-destructive restore test to a staging host, plus a documented “promote and mount” runbook.
Lie #5: “We have bandwidth” (but you don’t have time)
Your replication might “finish eventually” but still miss RPO and RTO because the delta grew faster than the link can move it.
Auditing means measuring send sizes, receive rates, and growth of snapshots over time. No guessing. No “it should be fine.”
Joke #1: Replication plans are like umbrellas—everyone remembers them right after it starts raining.
Fast diagnosis playbook
This is the “it’s 02:13 and replication is behind” sequence. The goal is to find the bottleneck in minutes, not hours.
Check these in order; stop when you find the first hard constraint.
First: Is there anything to replicate?
- Is snapshot creation running on the sender?
- Is the newest snapshot on the sender newer than the newest on the receiver?
- Are you stuck on a missing common snapshot (incremental chain broken)?
Second: Is the replication process actually moving data?
- Is
zfs receiverunning and consuming stream? - Are you stalled on a resume token?
- Is the network saturated or flaky?
Third: Is the receiver able to write?
- Pool health: degraded, suspended, or out of space?
- IOPS latency: are disks choking or is SLOG missing/misused?
- Is there heavy scrub/resilver competing with replication?
Fourth: Is the sender the bottleneck?
- Compression/encryption overhead during send?
- Huge snapshot deltas due to retention mistakes?
- Dataset layout (recordsize, sync) causing pathological IO?
Fifth: Are you losing time to avoidable rework?
- Are you accidentally doing full sends because the base snapshot doesn’t match?
- Are you not using resume tokens over unreliable links?
- Are you re-sending properties and forcing churn?
Practical audit tasks (commands, outputs, decisions)
These are not “nice to know.” These are the commands you run when you want to stop believing and start knowing.
I’ll assume a sender host zfs-prod-01 with pool tank, and a receiver zfs-dr-01 with pool backup.
Task 1: Confirm pool health on both sides
cr0x@server:~$ zpool status -x
all pools are healthy
What it means: no known faults, no degraded vdevs, no suspended IO.
If you see anything else, replication correctness is secondary to pool survival.
Decision: If not healthy, pause aggressive replication and fix the pool first; a degraded receiver can silently extend RTO.
Task 2: Verify the receiver has enough free space for worst-case deltas
cr0x@server:~$ zfs list -o name,used,avail,refer,mountpoint backup
NAME USED AVAIL REFER MOUNTPOINT
backup 62.1T 8.40T 128K /backup
What it means: only 8.4T free. If your next delta is 10T because of a retention blunder, receive will fail mid-stream.
Decision: Set a hard “minimum free space” policy (and alert on it). If below threshold, prune snapshots or expand capacity before your next large send.
Task 3: Enumerate what you believe you replicate (sender)
cr0x@server:~$ zfs list -r -o name,type,mountpoint tank/data
NAME TYPE MOUNTPOINT
tank/data filesystem /tank/data
tank/data/mysql filesystem /tank/data/mysql
tank/data/pg filesystem /tank/data/pg
tank/data/home filesystem /tank/data/home
What it means: this is your intended scope. Your replication tooling must cover all of it (or explicitly exclude parts).
Decision: If your scripts replicate only a fixed list, replace them with recursive dataset discovery plus allow/deny rules that live in version control.
Task 4: Enumerate what actually exists on the receiver
cr0x@server:~$ zfs list -r -o name,type,mountpoint backup/data
NAME TYPE MOUNTPOINT
backup/data filesystem /backup/data
backup/data/mysql filesystem none
backup/data/home filesystem none
What it means: backup/data/pg is missing. Also, mountpoints are none (which might be intentional for DR).
Decision: Missing datasets are a stop-the-line issue. Fix replication scope before arguing about performance.
Task 5: Check newest snapshot timestamps on sender and receiver (RPO reality)
cr0x@server:~$ zfs list -t snapshot -o name,creation -s creation -r tank/data/mysql | tail -n 3
tank/data/mysql@rep_2026-02-04_0000 Tue Feb 4 00:00 2026
tank/data/mysql@rep_2026-02-04_0030 Tue Feb 4 00:30 2026
tank/data/mysql@rep_2026-02-04_0100 Tue Feb 4 01:00 2026
cr0x@server:~$ zfs list -t snapshot -o name,creation -s creation -r backup/data/mysql | tail -n 3
backup/data/mysql@rep_2026-02-04_0000 Tue Feb 4 00:00 2026
backup/data/mysql@rep_2026-02-04_0030 Tue Feb 4 00:30 2026
backup/data/mysql@rep_2026-02-04_0100 Tue Feb 4 01:00 2026
What it means: receiver matches the sender up to 01:00. RPO for that dataset is currently “minutes,” not “maybe.”
Decision: Track this automatically. If receiver’s newest snapshot lags beyond policy, page the on-call. This is the whole point.
Task 6: Confirm incremental base snapshot exists on both sides
cr0x@server:~$ zfs list -t snapshot -o name -r tank/data/mysql | grep rep_2026-02-03_2300
tank/data/mysql@rep_2026-02-03_2300
cr0x@server:~$ zfs list -t snapshot -o name -r backup/data/mysql | grep rep_2026-02-03_2300
backup/data/mysql@rep_2026-02-03_2300
What it means: the common base exists. If it doesn’t, the next -i send will fail with “incremental source does not exist.”
Decision: If base missing, decide between (a) restore the missing snapshot (if available), (b) do a full send, or (c) rebuild the receiver dataset.
Task 7: Detect unexpected full sends by estimating delta sizes
cr0x@server:~$ zfs send -nPv -i tank/data/mysql@rep_2026-02-04_0000 tank/data/mysql@rep_2026-02-04_0100
size 2.31G
incremental tank/data/mysql@rep_2026-02-04_0000 tank/data/mysql@rep_2026-02-04_0100
no-op bytes 132K
What it means: expected delta is ~2.31G. If you run this and see “size 4.8T,” you’re about to find religion.
Decision: If delta size is far above normal, stop and investigate snapshot retention, runaway files, or a broken incremental chain before you saturate links for days.
Task 8: Check resume tokens on the receiver (stuck replication)
cr0x@server:~$ zfs get -H -o name,value receive_resume_token backup/data/mysql
backup/data/mysql 1-EMyVd...AAAB
What it means: a token exists: a receive was interrupted. Your automation may be blindly starting new sends that can’t attach.
Decision: Either resume the stream using the token or explicitly abort and clear state (carefully). Treat tokens like a transaction log, not clutter.
Task 9: Resume an interrupted send safely
cr0x@server:~$ TOKEN=$(zfs get -H -o value receive_resume_token backup/data/mysql)
cr0x@server:~$ ssh zfs-prod-01 "zfs send -t $TOKEN" | zfs receive -s -F backup/data/mysql
cr0x@server:~$ echo $?
0
What it means: exit code 0: the resumed receive completed.
Decision: Add token-aware logic to your replication tooling. If your process doesn’t support this, you are one flaky link away from repeating terabytes.
Task 10: Validate properties replication (the “it restored but it’s wrong” check)
cr0x@server:~$ zfs get -H -o name,property,value compression,recordsize,atime,canmount tank/data/mysql
tank/data/mysql compression zstd
tank/data/mysql recordsize 16K
tank/data/mysql atime off
tank/data/mysql canmount on
cr0x@server:~$ zfs get -H -o name,property,value compression,recordsize,atime,canmount backup/data/mysql
backup/data/mysql compression zstd
backup/data/mysql recordsize 128K
backup/data/mysql atime on
backup/data/mysql canmount noauto
What it means: receiver differs: recordsize and atime don’t match, and canmount is deliberately different.
Recordsize mismatch can wreck database performance after failover.
Decision: Decide which properties must match for DR. Replicate them intentionally (or enforce via post-receive policy).
Don’t accidentally “optimize” the replica into being unusable.
Task 11: Confirm encryption state and key availability (don’t discover this during DR)
cr0x@server:~$ zfs get -H -o name,property,value encryption,keystatus,keylocation tank/secure/hr
tank/secure/hr encryption aes-256-gcm
tank/secure/hr keystatus available
tank/secure/hr keylocation prompt
cr0x@server:~$ zfs get -H -o name,property,value encryption,keystatus,keylocation backup/secure/hr
backup/secure/hr encryption aes-256-gcm
backup/secure/hr keystatus unavailable
backup/secure/hr keylocation prompt
What it means: receiver has encrypted data but no key loaded. That may be correct (security) but it must be operationally planned.
Decision: Implement a key escrow process and a DR key-loading runbook. If nobody can load keys at 03:00, you do not have a replica.
Task 12: Perform a non-destructive mount test on the receiver (prove restore path)
cr0x@server:~$ zfs clone backup/data/mysql@rep_2026-02-04_0100 backup/test-restore/mysql
cr0x@server:~$ zfs set canmount=noauto mountpoint=/mnt/restore-mysql backup/test-restore/mysql
cr0x@server:~$ zfs mount backup/test-restore/mysql
cr0x@server:~$ zfs list -o name,mountpoint backup/test-restore/mysql
NAME MOUNTPOINT
backup/test-restore/mysql /mnt/restore-mysql
What it means: you can materialize a point-in-time view without touching the main replicated dataset.
Decision: Make this a scheduled exercise. If mounting requires “tribal knowledge,” write it down and rehearse.
Task 13: Verify data integrity with scrub status on the receiver
cr0x@server:~$ zpool status backup
pool: backup
state: ONLINE
scan: scrub repaired 0B in 05:14:02 with 0 errors on Sun Feb 2 03:12:44 2026
config:
NAME STATE READ WRITE CKSUM
backup ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
What it means: scrub completed with zero errors. That’s evidence your replica can read its own blocks.
Decision: If scrubs show errors, treat the receiver as suspect until repaired; replication to broken media is not “safety.”
Task 14: Look for snapshot retention drift (sender and receiver must agree)
cr0x@server:~$ zfs list -t snapshot -o name -r tank/data/mysql | wc -l
336
cr0x@server:~$ zfs list -t snapshot -o name -r backup/data/mysql | wc -l
92
What it means: receiver has far fewer snapshots. Maybe intentional. Maybe someone runs cleanup only on DR and broke the incremental chain.
Decision: Align retention policy with replication method. If you prune on the receiver, you must ensure the base snapshot needed for incrementals is preserved.
Task 15: Verify replication is not competing with resilver/scrub at the wrong time
cr0x@server:~$ zpool iostat -v backup 5 2
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
backup 62.1T 8.40T 120 980 9.2M 112M
raidz2-0 62.1T 8.40T 120 980 9.2M 112M
sda - - 30 240 2.3M 28M
sdb - - 28 260 2.1M 29M
sdc - - 32 250 2.4M 28M
sdd - - 30 230 2.4M 27M
What it means: writes are high; if latency is also high, receive may crawl. If you see scrubs/resilver running, expect it to worsen.
Decision: Schedule replication windows or throttle send rate when the pool is already busy with healing work.
Task 16: Validate what snapshots are holding space (why deltas are huge)
cr0x@server:~$ zfs holds -r tank/data/mysql@rep_2026-02-01_0000
NAME TAG TIMESTAMP
tank/data/mysql@rep_2026-02-01_0000 keep Sun Feb 1 00:00 2026
What it means: a hold prevents deletion. Holds are useful, but they can also pin months of churn.
Decision: Audit holds and make them intentional. If “keep” is set by a human, record why and for how long.
Joke #2: A resume token is like that drawer of mystery cables—useless until it’s suddenly the only thing saving your weekend.
Three corporate mini-stories (how this fails)
1) Incident caused by a wrong assumption: “Replication succeeded because the job said OK”
A mid-sized company ran ZFS replication from production to a DR site. The replication job was a tidy shell script
run by cron. It wrote logs. It even emailed “success” if the pipeline exit code was zero.
Everyone slept well. Too well.
The catch: snapshots were created by a different job on the production host. That job failed silently after a package update
changed a script path. Replication kept running, found no new snapshots, and performed zero work—cleanly.
The logs showed successful runs because the send step correctly sent nothing and the receive step correctly received nothing.
Months later, a production outage required failover. The DR datasets existed. They mounted. The services started.
And the data was old enough to vote.
Nobody had been measuring “newest snapshot on receiver vs sender,” so no alert fired. The replication system was “green”
in exactly the way that should scare you.
The fix was boring and effective: they added a hard RPO check that compared snapshot timestamps per dataset, plus an alert if
“latest replicated snapshot age” exceeded policy. They also made snapshot creation part of the same unit of work:
one orchestration, one log, one set of metrics.
2) Optimization that backfired: “Let’s reduce snapshot count on DR to save space”
Another organization had a tight DR pool. Someone looked at snapshot counts and decided the DR side didn’t need as many.
The DR host ran a cleanup policy that deleted old snapshots more aggressively than production.
On paper, it was efficient: fewer snapshots, less space, less metadata overhead.
Then incrementals began failing intermittently. The replication tool would attempt to send from the last common snapshot,
but that snapshot had been deleted on the receiver. The tool reacted by falling back to a full send—because the script writer
thought “better to replicate slowly than fail.” It worked, until it didn’t.
Full sends ran for days. They collided with other IO. They saturated the WAN. They increased snapshot age, which increased deltas,
which made the next sends even worse. Classic positive feedback loop. The system didn’t collapse from one bug. It collapsed from one
“optimization” and three missing guardrails.
The eventual solution: unify retention logic across sender and receiver, and explicitly protect the most recent N snapshots on the
receiver with holds so incrementals always had a base. The team also stopped allowing automatic fallback to full sends without
a noisy alert and a human decision.
3) Boring but correct practice that saved the day: “Quarterly restore drills on a staging box”
A regulated business (think: compliance paperwork that breeds at night) had a habit that engineers initially mocked:
every quarter, they performed a restore drill. Not a tabletop exercise. A real restore to a staging host, with the same OS family,
same ZFS feature flags, and a cut-down version of the application.
The drill checklist was plain. Confirm the latest snapshot exists. Clone it. Mount it. Run a lightweight integrity check.
Start the service pointed at the restored dataset. Validate basic app behavior. Document the runtime.
Then destroy the clone. No heroics, no “we’ll remember later.”
When they later suffered a ransomware incident on production, the storage team didn’t have to invent a recovery plan in the meeting room.
They already had a muscle memory for: “identify last known-good snapshot, clone on DR, promote if required, bring up service.”
The restore still wasn’t fun—nothing is fun during ransomware—but it was executable.
The biggest win was psychological: because they’d practiced, they didn’t waste hours arguing about whether the replica was valid.
They were debating which snapshot to use, not whether snapshots existed.
That’s what competence looks like: unglamorous, rehearsed, and fast.
Common mistakes (symptom → root cause → fix)
Replication “succeeds” but DR data is stale
- Symptom: jobs report success; newest snapshot on receiver is hours/days old.
- Root cause: snapshot creation stopped; replication ran with no new snapshots; no RPO check existed.
- Fix: alert on “age of latest snapshot per dataset” and “age of latest replicated snapshot.” Treat “no new snapshots” as a failure unless explicitly expected.
Incremental sends fail: “does not exist” or “incremental source” errors
- Symptom: send/receive errors mentioning missing snapshots; automation retries forever.
- Root cause: receiver pruned snapshots or never received the base; naming mismatch; replication started mid-chain.
- Fix: enforce retention symmetry for the “base window”; use holds on the receiver for required base snapshots; rebuild dataset with a fresh full send when chain is irreparably broken.
Replication is slow despite plenty of bandwidth
- Symptom: network isn’t saturated, but receive crawls; pool latency high.
- Root cause: receiver disks are the bottleneck; competing scrub/resilver; small recordsize datasets causing high IOPS.
- Fix: schedule or throttle replication during healing; measure with
zpool iostat; consider tuning dataset properties on production (carefully) rather than “fixing” DR in isolation.
Replica mounts, but application performance is terrible after failover
- Symptom: DR service starts but is sluggish; DB metrics look like molasses.
- Root cause: properties diverged (recordsize, compression, atime, logbias); replication didn’t include properties or receiver enforced different defaults.
- Fix: define a “must-match” property set; replicate properties intentionally; validate via periodic property diffs and restore drills.
Encrypted datasets replicate, but DR cannot mount them
- Symptom: datasets exist on DR;
keystatus=unavailable; mounts fail during incident. - Root cause: key management not integrated into DR; keys require manual entry by someone who is asleep or unavailable.
- Fix: establish key escrow and break-glass procedure; test key load on DR quarterly; ensure feature flag compatibility.
Replication randomly restarts from scratch
- Symptom: large transfers repeat; job time grows; WAN bills get interesting.
- Root cause: interrupted streams without resume; resume tokens ignored; automation starts fresh full sends.
- Fix: use
zfs receive -sand token-aware resume; alert when a resume token exists longer than a threshold; avoid “silent fallback to full.”
Replication appears fine, but DR dataset hierarchy is wrong
- Symptom: some children missing; mountpoints odd; quotas/reservations absent.
- Root cause: replication script didn’t use recursive mode; datasets created later were never added; properties not included.
- Fix: replicate recursively with explicit exclusions; audit dataset list drift; enforce expected hierarchy using a “desired state” file.
Checklists / step-by-step plan
Weekly audit checklist (15–30 minutes)
- Pool health: check
zpool status -xon sender and receiver; any non-healthy state is a priority. - RPO check: compare newest snapshot timestamps on both sides for top-tier datasets (databases, home dirs, object stores).
- Scope check: verify the dataset tree matches what you think you’re protecting; look for new children not covered.
- Retention check: compare snapshot counts and ensure base snapshots required for incrementals exist on receiver.
- Resume token check: ensure no long-lived
receive_resume_tokenexists without action. - Capacity check: verify receiver free space against your worst-case delta and safety margin.
Monthly audit checklist (1–2 hours)
- Property diff: sample critical datasets and compare properties that affect runtime behavior (recordsize, compression, atime, sync, logbias, quotas).
- Scrub review: confirm receiver scrub completed with zero errors; investigate any checksum or read errors immediately.
- Delta sizing: run
zfs send -nPvfor a few representative incrementals and record sizes; watch for step-function changes. - Failure injection: intentionally interrupt a replication run and confirm resume token logic works.
Quarterly restore drill (half day, but it pays rent)
- Select a dataset: choose one high-value dataset (DB or key file share) and one “wide” dataset (lots of small files).
- Clone latest snapshot on DR: mount it in a controlled path.
- Application-level validation: run a minimal service startup or integrity check that resembles reality.
- Measure time: record how long each step took; that’s your RTO evidence.
- Destroy clones: clean up; ensure the drill doesn’t bloat space usage for weeks.
- Update runbook: if someone had to “just know,” it goes into the procedure.
Hard rules (the ones that prevent the 03:00 incident review)
- No silent full sends. If incrementals fail and you choose a full, that’s a human decision with an alert.
- No replication without RPO monitoring. “Job succeeded” is a meaningless metric by itself.
- No encryption without key DR procedures. “Secure but unrecoverable” is a form of data loss.
- No receiver pruning that breaks chains. If you must prune, you must preserve base snapshots required for incrementals.
FAQ
1) What’s the single best metric for replication health?
Age of the newest recoverable snapshot on the receiver, per critical dataset. Not “job success,” not throughput.
Snapshot age maps directly to RPO.
2) Is “snapshot exists on DR” enough?
No. You need to prove you can use it: clone it, mount it, load keys if encrypted, and validate data at the application level.
3) Should the DR dataset be mounted automatically?
Usually no. Common practice is canmount=noauto or mountpoint=none on DR to avoid accidental use.
But you must document exactly how to mount during DR and test it.
4) Can I replicate encrypted datasets safely?
Yes, but “safe” includes key handling. Decide whether DR should have keys loaded by default. Many environments keep keys unavailable
and use a break-glass process. Either way, test key loading and mounting.
5) Why do incrementals sometimes balloon in size?
Common causes: large churn (databases, VM images), retention mistakes (holding old snapshots pins changed blocks),
or a workload change (e.g., a new cache directory). Use zfs send -nPv to measure before you ship.
6) What breaks incremental replication most often?
Missing common snapshots. Typically caused by pruning on the receiver, manual snapshot deletion, or mismatched naming/policies.
Guard the base snapshot window.
7) Are scrubs required on the DR pool?
If you care about the data, yes. Replication can copy corruption faithfully if it occurred before the snapshot, and disks can rot quietly.
Scrub results are evidence the replica can read itself.
8) How do I know if the receiver is the bottleneck?
Check pool IO with zpool iostat -v during receive, and look at latency (via your OS metrics).
If network isn’t full but writes are slow and disks are saturated, the receiver is your limiter.
9) Is it OK to compress the send stream?
ZFS streams already reflect dataset compression; additional external compression rarely helps and can burn CPU.
Measure. If CPU-bound on sender/receiver, you’ll miss RPO even on a fat link.
10) Should I replicate properties?
Replicate what matters for correctness and performance. Then audit it.
If you intentionally diverge on DR (like canmount), enforce that divergence explicitly so it doesn’t drift accidentally.
Next steps you can do this week
-
Implement an RPO check per dataset. Script it if you must: compare newest snapshot creation times on sender vs receiver.
Alert when it exceeds policy. -
Do one restore drill. Pick one dataset, clone the newest snapshot on DR, mount it, and validate at the application layer.
Measure the time. Write down every step you had to “remember.” -
Audit scope drift. Enumerate datasets recursively on sender and confirm they exist on receiver.
If you discover missing children, treat it like a real incident. -
Make resume tokens first-class. If you have unreliable links, not handling tokens is choosing pain.
Alert on long-lived tokens and build a safe resume workflow. -
Align retention. Make sure receiver snapshot pruning cannot delete snapshots needed as incremental bases.
Use holds where appropriate and make “silent full sends” socially unacceptable. - Decide how encryption works in DR. Who can load keys? How? Under what approval? Practice it.
The goal isn’t to have replication that “runs.” The goal is to have replication that you can prove is recoverable,
and a restore path that works when you are tired, stressed, and being watched.
That’s what “the replica you thought you had” is supposed to be: not a belief, but a demonstrated capability.