ZFS send/receive DR drill: Practicing Restore, Not Just Backup

Was this helpful?

Your backup dashboard is green. Your snapshots are “running.” Your auditors are placated.
Then a real incident hits—ransomware, accidental rm -rf, a controller dies, or someone “helpfully” destroys a dataset—and suddenly you’re learning in public what “restore” actually means.

A ZFS send/receive DR drill is where you trade comforting theory for measured reality: how long it takes, what breaks, what’s missing, and what you wish you had tested last quarter.
Backups are a process. Restore is a performance.

Why ZFS DR drills matter (and what they really test)

ZFS send/receive replication is seductive because it looks deterministic: snapshot A goes in, snapshot A comes out, and you can even measure bytes in transit.
That encourages a dangerous habit: treating replication as synonymous with recovery.

A DR drill is not “can we copy data.” It’s “can we restore a service,” with all the unpleasant details attached:

  • Identity: are dataset names, mountpoints, and permissions correct on the target?
  • Dependencies: do you also need PostgreSQL WAL, application config, secrets, TLS keys, or VM metadata?
  • Time: do you actually hit RPO and RTO when the network is busy, and the on-call is tired?
  • Safety: can you do it without clobbering the last good copy?
  • People: does anyone besides “the ZFS person” know what to type?

You don’t run DR drills to prove ZFS works. ZFS works. You run DR drills to catch the parts that don’t: your assumptions, your automation, your key management,
your naming conventions, and your habit of not writing down the weird thing you did three years ago at 2 a.m.

One operational quote that aged well: Hope is not a strategy. — James Cameron.
It applies to backup restores with embarrassing accuracy.

Joke #1: Backups are like parachutes—if you haven’t tested one, you’re about to learn a lot very quickly.

Interesting facts and quick history

A little context helps because ZFS send/receive is not “a file copy.” It’s a stream of filesystem state, and the design decisions behind it explain
many of the sharp edges you’ll meet in drills.

  1. Snapshot replication predates cloud fashion: send/receive has been around since early ZFS days, built for real enterprise replication workflows.
  2. ZFS streams are transactional: you’re sending dataset blocks and metadata as-of a snapshot, not replaying “file changes” like rsync.
  3. Incrementals depend on lineage: an incremental stream requires the receiver to have the exact base snapshot (same GUID lineage), not just a snapshot with the same name.
  4. Properties ride along (sometimes): with -p and related flags, dataset properties can be replicated, which is either salvation or chaos depending on your target conventions.
  5. Encryption changed the game: native ZFS encryption introduced key management realities into replication—especially “raw” sends and key loading on the receiver.
  6. Compression is not “free bandwidth”: ZFS stream size varies wildly based on recordsize, compressratio, and whether you’re sending compressed or raw data.
  7. Receive is not always idempotent: repeated receives can fail or diverge if snapshots were pruned, rollbacks happened, or a prior receive left partial state.
  8. Replication isn’t backup retention: send/receive mirrors snapshot history you choose; it does not automatically solve retention, legal hold, or offline copies.
  9. ZFS has always been about integrity: checksums are end-to-end, but drills still catch the real-world parts: bad cables, flaky RAM, mis-sized arc, or an overwhelmed target pool.

Define the drill: scope, RPO/RTO, and “done”

Don’t start with commands. Start with definitions, because DR drills fail most often as project management failures wearing a technical costume.
Define three things before you touch a shell:

1) What are you restoring?

Pick one service with enough complexity to be honest: a database-backed web app, a file share with ACLs, a VM workload, or an analytics pipeline.
Include “everything needed to operate” for that service, not just its main dataset.

2) What is the RPO and RTO you will measure?

  • RPO: the newest acceptable snapshot you can restore (e.g., “no more than 15 minutes of data loss”).
  • RTO: the time from declaring disaster to users being back in service (e.g., “under 2 hours”).

For ZFS replication, you can often control RPO with snapshot cadence and replication frequency. RTO is where reality bites:
network bandwidth, target pool performance, key loading, mount ordering, service checks, DNS, and application-level recovery.

3) What does “done” look like?

“Dataset received successfully” is not “done.” “Application passes smoke tests” is closer.
Your definition of done should include:

  • Datasets mounted as expected (or intentionally not mounted until cutover).
  • Services started and healthy.
  • Permissions and ownership correct.
  • At least one real read/write path verified (login works, query works, file create works).
  • Measured time stamps for each phase (receive duration, mount duration, application recovery duration).

Preparation: what to build before you practice

A DR drill should feel rehearsed. Not because it’s a theater piece, but because when it’s real you won’t have time to invent the choreography.
Build these ahead of time.

Names that won’t hurt you later

Decide how the DR side names pools and datasets. If you replicate tank/prod/app into backup/prod/app, you’re making a choice about
mountpoints, fstab expectations, and tooling.

I recommend receiving into a dedicated pool and namespace on the DR host, like drpool/recv/prod/app, then using controlled promotion/rename at cutover.
It avoids accidental mounts over production paths during a drill.

Key management (if you use native encryption)

If your datasets are encrypted, replication isn’t just “data.” It’s keys, key locations, and who can load them under pressure.
Decide whether the DR host stores keys (riskier but faster restore) or requires a manual key load (safer but slower).
Either way, drill it.

Automation that’s boring on purpose

DR automation should be predictably dull: explicit snapshot names, explicit receive targets, logging to a file, and immediate failure on mismatch.
Don’t get clever with implicit “latest snapshot” logic unless you also build guardrails.

Capacity and fragmentation reality

Replication will not save you from undersized DR storage. ZFS is not impressed by optimism.
Size the DR pool to hold: replicated datasets, snapshot retention, and space for receive to breathe (you need headroom).
If you’re running close to full, your drill should include the moment when the receive stops mid-stream.

Hands-on tasks (commands, outputs, decisions)

These tasks are meant to be run during a DR drill (or while preparing for one). Each includes:
the command, realistic output, what it means, and the decision you make from it.
Hostnames and pools are examples: prod1 sends from tank, dr1 receives into drpool.

Task 1: Confirm pool health before you replicate

cr0x@prod1:~$ zpool status -x
all pools are healthy

What it means: No known vdev errors. This is your baseline.

Decision: If this is not healthy, stop. Don’t replicate corruption and call it DR.

Task 2: Check free space and fragmentation (sender and receiver)

cr0x@prod1:~$ zpool list -o name,size,alloc,free,frag,cap,health
NAME   SIZE  ALLOC   FREE  FRAG  CAP  HEALTH
tank  7.25T  5.10T  2.15T   28%  70%  ONLINE
cr0x@dr1:~$ zpool list -o name,size,alloc,free,frag,cap,health
NAME    SIZE  ALLOC   FREE  FRAG  CAP  HEALTH
drpool  9.06T  3.40T  5.66T   12%  37%  ONLINE

What it means: The receiver has plenty of free space; fragmentation is modest.

Decision: If CAP is > 80% or frag is high, expect slower receives and potential ENOSPC during snapshots retention. Fix capacity before the drill.

Task 3: Verify the dataset list and critical properties

cr0x@prod1:~$ zfs list -r -o name,used,avail,recordsize,compression,encryption,mountpoint tank/prod/app
NAME                USED  AVAIL  RECORDSIZE  COMPRESS  ENCRYPTION  MOUNTPOINT
tank/prod/app       820G  1.20T  128K        zstd     aes-256-gcm /srv/app
tank/prod/app/db    540G  1.20T  16K         zstd     aes-256-gcm /var/lib/postgresql
tank/prod/app/logs  120G  1.20T  128K        zstd     aes-256-gcm /srv/app/logs

What it means: Recordsize differs between app and DB, compression is zstd, encryption enabled.

Decision: Ensure the DR target is compatible with these properties. For DR drills, avoid blindly inheriting mountpoints that could collide with local paths.

Task 4: Confirm snapshot cadence and “latest safe point” (RPO)

cr0x@prod1:~$ zfs list -t snapshot -o name,creation -s creation -r tank/prod/app | tail -n 5
tank/prod/app@dr-2025-12-26_1000  Fri Dec 26 10:00 2025
tank/prod/app@dr-2025-12-26_1015  Fri Dec 26 10:15 2025
tank/prod/app@dr-2025-12-26_1030  Fri Dec 26 10:30 2025
tank/prod/app@dr-2025-12-26_1045  Fri Dec 26 10:45 2025
tank/prod/app@dr-2025-12-26_1100  Fri Dec 26 11:00 2025

What it means: Snapshots every 15 minutes. Your RPO is bounded by this schedule plus replication lag.

Decision: If snapshots aren’t regular, replication cannot meet an RPO you haven’t engineered.

Task 5: Measure replication lag by comparing latest snapshots on sender vs receiver

cr0x@dr1:~$ zfs list -t snapshot -o name,creation -s creation -r drpool/recv/prod/app | tail -n 3
drpool/recv/prod/app@dr-2025-12-26_1030  Fri Dec 26 10:30 2025
drpool/recv/prod/app@dr-2025-12-26_1045  Fri Dec 26 10:45 2025
drpool/recv/prod/app@dr-2025-12-26_1100  Fri Dec 26 11:00 2025

What it means: Receiver has the latest snapshot. Lag is effectively zero right now.

Decision: If receiver is behind, decide whether your drill uses the most recent received snapshot (realistic) or you pause production to force alignment (less realistic).

Task 6: Dry-run estimate of an incremental send size

cr0x@prod1:~$ zfs send -nPv -I tank/prod/app@dr-2025-12-26_1045 tank/prod/app@dr-2025-12-26_1100
send from @dr-2025-12-26_1045 to tank/prod/app@dr-2025-12-26_1100 estimated size is 14.2G
total estimated size is 14.2G

What it means: Roughly 14.2G changed between snapshots. That’s your replication workload for that interval.

Decision: Compare to available bandwidth and your replication window. If you can’t ship the delta in time, you won’t meet RPO under load.

Task 7: Perform an initial full replication into a safe namespace

cr0x@prod1:~$ zfs send -w tank/prod/app@dr-2025-12-26_1100 | ssh dr1 "zfs receive -u -o mountpoint=none -d drpool/recv"
receiving full stream of tank/prod/app@dr-2025-12-26_1100 into drpool/recv/tank/prod/app@dr-2025-12-26_1100

What it means: -w sends a raw encrypted stream (keys stay on the sender; encryption properties preserved). -u prevents auto-mount on receive. -o mountpoint=none avoids accidental mount collisions.

Decision: If you need the DR host to mount and serve data, you must plan key loading and mountpoint adjustments. If this is just a replica, keep it unmounted by default.

Task 8: Incremental replication (the real daily work)

cr0x@prod1:~$ zfs send -w -I tank/prod/app@dr-2025-12-26_1100 tank/prod/app@dr-2025-12-26_1115 | ssh dr1 "zfs receive -u -dF drpool/recv"
receiving incremental stream of tank/prod/app@dr-2025-12-26_1115 into drpool/recv/tank/prod/app@dr-2025-12-26_1115

What it means: -I sends all intermediate snapshots. -F forces rollback of the target dataset to the most recent snapshot to accept the stream.

Decision: Use -F only when you are sure the receiver dataset is a replica and you accept losing any local changes. In drills, document when you used it.

Task 9: Confirm receive-side state and that datasets are not mounted

cr0x@dr1:~$ zfs get -r -o name,property,value mounted,mountpoint drpool/recv/tank/prod/app | head
NAME                          PROPERTY   VALUE     SOURCE
drpool/recv/tank/prod/app      mounted    no        -
drpool/recv/tank/prod/app      mountpoint none      local
drpool/recv/tank/prod/app/db   mounted    no        -
drpool/recv/tank/prod/app/db   mountpoint /var/lib/postgresql  received

What it means: Top dataset is unmounted with mountpoint none, but a child has a received mountpoint set. That’s a classic “surprise mount later” situation.

Decision: Normalize mountpoints on the receive side. For a replica namespace, I prefer forcing mountpoint none on the entire tree, then setting explicit mountpoints only on the restored clone.

Task 10: Override inherited/received mountpoints safely

cr0x@dr1:~$ zfs set -r mountpoint=none drpool/recv/tank/prod/app
cr0x@dr1:~$ zfs get -r -o name,property,value mountpoint drpool/recv/tank/prod/app | head -n 6
NAME                          PROPERTY   VALUE  SOURCE
drpool/recv/tank/prod/app      mountpoint none   local
drpool/recv/tank/prod/app/db   mountpoint none   local
drpool/recv/tank/prod/app/logs mountpoint none   local

What it means: You’ve neutralized mountpoints across the replica tree.

Decision: Keep replicas inert. Create a separate “restored instance” dataset (clone or promote) with real mountpoints for service startup.

Task 11: Create a writable restored copy without disturbing the replica

cr0x@dr1:~$ zfs clone drpool/recv/tank/prod/app@dr-2025-12-26_1115 drpool/restore/prod/app
cr0x@dr1:~$ zfs set mountpoint=/srv/app drpool/restore/prod/app
cr0x@dr1:~$ zfs mount drpool/restore/prod/app
cr0x@dr1:~$ zfs list -o name,mountpoint,origin drpool/restore/prod/app
NAME                    MOUNTPOINT  ORIGIN
drpool/restore/prod/app /srv/app    drpool/recv/tank/prod/app@dr-2025-12-26_1115

What it means: You now have a writable clone for the drill while keeping the replica pristine.

Decision: Always restore into a new dataset when possible. It’s a safety valve for human error and for repeated drills.

Task 12: Load encryption keys on DR side (if needed)

cr0x@dr1:~$ zfs get -o name,property,value keystatus,encryptionroot drpool/restore/prod/app
NAME                    PROPERTY        VALUE
drpool/restore/prod/app  keystatus       unavailable
drpool/restore/prod/app  encryptionroot  drpool/recv/tank/prod/app
cr0x@dr1:~$ zfs load-key -L file:///etc/zfs/keys/prod_app.key drpool/restore/prod/app
cr0x@dr1:~$ zfs get -o name,property,value keystatus drpool/restore/prod/app
NAME                    PROPERTY   VALUE
drpool/restore/prod/app  keystatus  available

What it means: Keys were unavailable until you loaded them. Without this, mounts and service startup will fail.

Decision: In your runbook, make key loading explicit, with who has access and where keys live. If the answer is “in someone’s home directory,” fix it before the next drill.

Task 13: Measure actual throughput of a send/receive pipeline

cr0x@prod1:~$ zfs send -w tank/prod/app@dr-2025-12-26_1115 | pv -brt | ssh dr1 "zfs receive -u -d drpool/recv"
2.10GiB 0:00:18 [ 119MiB/s] [   <=>  ]

What it means: End-to-end throughput is ~119MiB/s. That includes CPU, network, and disk on both ends.

Decision: Compare to your delta sizes and RPO window. If you need 14G every 15 minutes, 119MiB/s is fine; if you need 200G, you’re dreaming.

Task 14: Identify whether the bottleneck is disk I/O, CPU, or network

cr0x@dr1:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.42    0.00    6.11   21.37    0.00   54.10

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm  w_await wareq-sz  aqu-sz  %util
nvme0n1         12.0   10240.0     0.0   0.00    1.20   853.3    410.0   92800.0     2.0   0.49    9.80   226.3    4.10  92.0

What it means: High %util and write workload suggest the receiver disk is near saturation. iowait is significant.

Decision: If disk is pinned, tuning network won’t help. Consider faster vdevs, recordsize alignment, or staging to a faster pool.

Task 15: Confirm snapshot lineage (why incrementals fail)

cr0x@prod1:~$ zfs get -o name,property,value guid tank/prod/app@dr-2025-12-26_1100
NAME                           PROPERTY  VALUE
tank/prod/app@dr-2025-12-26_1100  guid      1638672096235874021
cr0x@dr1:~$ zfs get -o name,property,value guid drpool/recv/tank/prod/app@dr-2025-12-26_1100
NAME                                      PROPERTY  VALUE
drpool/recv/tank/prod/app@dr-2025-12-26_1100  guid      1638672096235874021

What it means: GUID matches; the receiver has the exact snapshot instance needed for incrementals.

Decision: If GUIDs differ, your incremental will not apply. You’ll need a new full send or a corrected replication chain.

Task 16: Validate dataset properties that can break restores (ACLs, xattrs, casesensitivity)

cr0x@prod1:~$ zfs get -o name,property,value acltype,xattr,casesensitivity -r tank/prod/app
NAME                PROPERTY         VALUE
tank/prod/app        acltype          posixacl
tank/prod/app        xattr            sa
tank/prod/app        casesensitivity  sensitive
tank/prod/app/db     acltype          posixacl
tank/prod/app/db     xattr            sa
tank/prod/app/db     casesensitivity  sensitive

What it means: Properties that impact application semantics are set.

Decision: Ensure the receive preserves them (property replication) or that you explicitly set them on the restore dataset. Mismatched xattr storage can become a performance surprise.

Fast diagnosis playbook: what to check first/second/third

During a drill (or a real outage), the replication pipeline gets slow and people immediately start changing things.
Don’t. Diagnose in a strict order so you don’t “fix” the wrong layer and add variables.

First: is it blocked on the receiver pool?

  • Check: zpool status for resilver/scrub in progress; iostat -xz for %util near 100% and high await.
  • Interpretation: If the receiver is busy (resilver, scrub, SMR pain, saturated vdevs), receive throughput collapses regardless of sender speed.
  • Action: Pause the drill, reschedule around scrub/resilver, or temporarily receive into a faster staging pool.

Second: is it network or SSH overhead?

  • Check: observed throughput with pv; CPU on sender/receiver; NIC errors and drops with OS tooling.
  • Interpretation: If CPU is pegged on a single core during SSH encryption, you’ve found your ceiling.
  • Action: Use faster ciphers, enable hardware offload where available, or replicate over a trusted network with alternative transport (policy-dependent). Don’t “optimize” security away in a panic.

Third: is it ZFS-level behavior (recordsize, compression, sync, special vdevs)?

  • Check: dataset properties, zvol vs filesystem, and whether the workload is small random writes or large sequential.
  • Interpretation: A DB dataset with recordsize=16K behaves differently than a media dataset with recordsize=1M.
  • Action: Align recordsize and consider separating datasets by workload. In DR, the restore path often reveals mismatches you ignored in production.

Fourth: is it a replication-chain issue?

  • Check: missing base snapshots, GUID mismatch, or forced rollbacks that pruned needed history.
  • Interpretation: “Incremental send fails” is often “snapshot history is not what you think it is.”
  • Action: Stop trying random flags. Confirm lineage, then decide: rebuild from a full send or reconstruct the expected snapshot chain.

Common mistakes: symptoms → root cause → fix

1) Symptom: incremental receive fails with “does not exist” or “incremental source … is not found”

Root cause: the receiver doesn’t have the exact base snapshot (wrong name, pruned, or different GUID lineage after a restore/rollback).

Fix: list snapshots on both sides; compare GUIDs; perform a new full send to re-establish a clean chain. Avoid manual snapshot deletions on the receiver replica tree.

2) Symptom: receive “succeeds” but datasets mount into production paths on DR host

Root cause: received mountpoint properties were applied, and zfs mount -a (or boot) mounted them.

Fix: receive with -u; force mountpoint=none on replica tree; only set mountpoints on restore clones.

3) Symptom: restore drill stalls at “load keys” step and nobody can find them

Root cause: encryption key management was treated as “someone’s problem,” not as an operational dependency.

Fix: define key custody, storage location, and access procedure; test key loading in every drill; document emergency access with approvals.

4) Symptom: replication throughput is fine at night but awful during business hours

Root cause: shared network links, QoS policy, competing workloads on sender/receiver, or scrubs scheduled during peak.

Fix: measure during peak; schedule scrubs away from replication windows; implement traffic shaping if needed; consider staging targets or dedicated links.

5) Symptom: “cannot receive: failed to read from stream” mid-transfer

Root cause: broken SSH session, MTU mismatch, network drops, or out-of-space causing aborts that bubble up as stream errors.

Fix: check receiver free space; check logs for disconnect; rerun with monitoring; prefer resumable streams if your ZFS version supports it, and keep enough headroom.

6) Symptom: application starts, but data is subtly wrong or old

Root cause: you restored the wrong snapshot (naming ambiguity), or the app needs additional state (WAL, object storage, secrets) not covered by dataset replication.

Fix: enforce snapshot naming and selection rules; add non-ZFS dependencies to the DR scope; validate with a real business transaction in the drill.

7) Symptom: receive is slow and CPU is pegged, but disks are idle

Root cause: SSH cipher overhead, compression/decompression cost, or single-thread bottleneck in the pipeline.

Fix: profile CPU; adjust cipher selection; consider replication on a trusted isolated network with appropriate transport decisions; avoid stacking compression in multiple layers blindly.

8) Symptom: restore works once, then subsequent drills get messier and slower

Root cause: the DR host becomes a junk drawer of half-restored datasets, mounts, and modified replicas; “just one quick fix” turns into state.

Fix: treat the DR environment as code: rebuild or reset between drills; keep replicas read-only; restore via clone and destroy after drill.

Three corporate mini-stories from the real world

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company had two sites and a tidy ZFS replication setup. Production sent hourly snapshots to a DR host.
The team was confident because the receive logs showed “success” for months.

Then a storage controller issue took out the primary pool. Not catastrophic—exactly what DR is for.
They promoted the DR dataset and tried to bring services up. The filesystem mounted. The app started. The database refused.

The wrong assumption: “We replicate the app dataset, so we replicate the whole system.” In reality, the database logs lived on a separate dataset that had been added later.
It wasn’t included in the replication job. Nobody noticed because nothing failed during replication; it just silently omitted the new dependency.

Under outage pressure, someone tried to “fix it” by copying the missing dataset from a stale VM backup. It booted the database, but with a timeline mismatch.
Now they had a service that was up but logically inconsistent—worse than downtime because data correctness was in question.

The post-incident improvement was not exotic: an inventory of datasets per service, replication coverage checks, and a DR drill that required a real end-to-end write path.
The lesson wasn’t “ZFS failed.” The lesson was “our mental model failed, and ZFS didn’t stop us.”

Mini-story 2: The optimization that backfired

Another organization wanted faster replication. Someone noticed CPU usage during SSH-based replication and decided to optimize.
They changed the pipeline, added extra compression, and tuned a few flags. It looked great in a quick test: smaller streams, higher peak throughput.

A month later, DR receives started falling behind during peak hours. Not by minutes—by hours. The replication queue grew.
The team chased the network, then the disks, then the scheduler.

The backfire: the extra compression layer made the sender CPU-bound under real workload, and it increased latency variance.
The tests were run on an idle system with warm ARC, not under the messy reality of production writes.
Replication became fragile: sometimes fast, sometimes glacial, always unpredictable.

The fix was almost embarrassing: remove the extra compression layer, rely on ZFS’s native compression characteristics, and measure with a representative workload window.
They also capped concurrency and scheduled replication to avoid stepping on scrub windows.

The best optimization is the one you can explain to the next on-call in two sentences. If it requires a whiteboard and a prayer, it’s not an optimization; it’s a personality trait.

Mini-story 3: The boring but correct practice that saved the day

A financial services team had a reputation for being painfully methodical. They ran quarterly DR drills with a checklist, timestamps, and a habit of destroying and rebuilding the DR restore datasets each time.
It wasn’t glamorous, so it was often mocked by teams that preferred “innovation.”

A real incident arrived via a routine maintenance window: an engineer typed a dataset destroy command against the wrong host.
It was the kind of mistake that happens when the environment looks the same and your terminal tabs outnumber your attention span.

The DR team didn’t improvise. They followed the runbook: identify the last good snapshot on the receiver, clone into the restore namespace, load keys, mount, start services, run smoke tests.
Because they had practiced on the same commands repeatedly, nobody argued about which snapshot to use or where mountpoints should land.

The recovery was not instant, but it was controlled. Most importantly, they didn’t contaminate the replica by making ad-hoc changes to “get it running.”
The replica remained the source of truth, and the restored clone was disposable.

Later, executives called it “good luck.” It wasn’t. It was a boring habit that kept humans from making the situation worse.

Checklists / step-by-step plan

Drill design checklist (before the day of the drill)

  • Pick one service and list every dataset it depends on (data, DB, logs, configs, secrets if applicable).
  • Define RPO and RTO targets in minutes/hours, not vibes.
  • Decide restore target namespace (drpool/recv for replicas, drpool/restore for writable restores).
  • Decide key management method for encrypted datasets; verify access paths for on-call.
  • Define snapshot naming convention used for DR (e.g., dr-YYYY-MM-DD_HHMM).
  • Agree on what constitutes “service restored”: smoke tests, synthetic transaction, or specific health endpoints.
  • Schedule the drill during a realistic load window at least once per year (peak reveals lies).

Replication readiness checklist (day of drill, pre-flight)

  • Sender and receiver pools healthy (zpool status -x).
  • Receiver has sufficient free space for snapshots + headroom.
  • Replica datasets are unmounted by default (-u, mountpoint=none policy).
  • Confirm snapshot presence and freshness on both sides.
  • Confirm your rollback policy (zfs receive -F acceptable or not for this replica).

Restore execution plan (the drill itself)

  1. Declare the drill start time. Write it down. DR without timestamps is just cosplay.
  2. Select the snapshot. Use the newest snapshot actually present on DR that satisfies your RPO.
  3. Create a restore clone. Never restore by modifying the replica tree directly unless you like living dangerously.
  4. Set mountpoints explicitly. Avoid inherited received mountpoints; they’re optimized for surprises.
  5. Load keys (if encrypted). Validate keystatus=available.
  6. Mount datasets in correct order. DB before app if your service expects it.
  7. Start services. Use standard service manager commands and capture logs.
  8. Run smoke tests. One real write operation, one read path, one query, or one user login.
  9. Record RTO. Stop the clock when the service passes the test criteria, not when a dataset mounts.
  10. Clean up. Destroy restore clones and temporary mountpoints; keep replica intact.
  11. Write down what surprised you. That’s the entire point.

Joke #2: A DR drill is the only meeting where “we should destroy everything and start over” is both correct and encouraged.

FAQ

1) Is ZFS replication a backup?

It can be part of a backup strategy, but by itself it’s often a mirror of your mistakes. If ransomware encrypts files and you replicate those changes quickly,
congratulations: you have two copies of the problem. You still want retention, immutability controls, and ideally at least one offline or logically isolated copy.

2) Should I receive directly into the final mountpoints?

For DR drills, no. Receive into an inert namespace with -u and neutral mountpoints, then clone into a restore namespace with explicit mountpoints.
It reduces the chance you mount over something important or start services against the wrong dataset.

3) When is zfs receive -F appropriate?

When the receiver dataset is a pure replica and you are comfortable rolling it back to accept incoming streams.
It’s inappropriate when the receiver has local changes you care about, or when you don’t understand why it’s needed.

4) How do I pick which snapshot to restore?

Use the newest snapshot that is actually present on the receiver and meets your RPO. Then validate application integrity.
Snapshot naming should encode purpose (DR vs ad-hoc) so “latest” doesn’t accidentally mean “someone’s experiment.”

5) What about encrypted datasets—do I need keys on the DR host?

If you send raw encrypted streams (zfs send -w), the receiver can store encrypted data without keys, but it can’t mount for use until keys are loaded.
Decide whether keys are stored on DR (faster) or retrieved during incident (safer). Drill whichever you choose.

6) Should I replicate properties with -p?

Sometimes. Property replication can preserve important behavior (compression, recordsize, ACL settings), but it can also drag in mountpoints and local conventions you don’t want.
A common compromise: replicate most properties, but enforce mountpoint policy on receive and restore via clone.

7) Why is my receive slow when the sender is fast?

Because receive performance is usually bounded by the receiver: disk write IOPS, fragmentation, scrubs/resilvers, and CPU for checksumming/decompression.
Measure end-to-end throughput and then look at receiver iostat and pool activity first.

8) Do I need to test restores if replication logs show success?

Yes. Logs tell you “a stream applied.” They don’t tell you “the service is recoverable,” “keys can be loaded,” “mountpoints are safe,” or “RTO is achievable.”
DR drills catch operational gaps, not filesystem bugs.

9) How often should we run a DR drill?

Quarterly if you can, semi-annually if you must, annually only if your environment changes slowly (it doesn’t).
Run at least one drill per year under realistic load, and run smaller tabletop exercises whenever key staff or architecture changes.

10) Should the DR host be identical hardware to production?

Not always, but you need performance predictability. If DR is materially slower, your RTO will inflate.
If DR uses different disk layouts or network paths, drills must measure the real restore speed, not a theoretical one.

Conclusion: next steps you can schedule this week

If you only “test backups” by seeing snapshots exist, you’re testing paperwork. ZFS send/receive is powerful, but it’s not magic.
The restore path is where your system design meets human reality: naming, key access, mountpoints, and time pressure.

Practical next steps:

  1. Pick one service and inventory its datasets and non-ZFS dependencies.
  2. Define RPO/RTO and write down what “done” means at the application layer.
  3. Build a safe receive namespace with unmounted replicas and neutral mountpoints.
  4. Run one full restore drill using clone-based restores and explicit key loading.
  5. Record timings and bottlenecks, then fix the biggest one—capacity, keys, or throughput.
  6. Repeat until the drill feels boring. Boring is reliable.
← Previous
Docker Compose Profiles: Dev/Prod Stacks Without Duplicate YAML
Next →
Proxmox PCI Passthrough Fails: IOMMU Groups and the Classic Traps

Leave a comment