ZFS key management: The Part Everyone Forgets Until Disaster Day

November 13, 2025 • February 3, 2026 • Read: 23 min • Views: 10

Was this helpful?

Encryption is the easy part. Key management is the part you only notice when it’s missing—usually at 02:13, during an outage, with a senior manager asking whether “the data is still there” like it’s a philosophical question.

ZFS native encryption is solid, fast, and deceptively simple to turn on. The problem is that “encryption=on” is not a plan. A plan is: how keys are created, stored, loaded, rotated, audited, backed up, and recovered when the box is dead, the pool is imported read-only, and your password manager is also on that pool. This is a guide for people who actually run production systems—and want to keep doing so.

What you are actually managing (and why it’s harder than it looks)

ZFS encryption isn’t a single switch. It’s a set of properties attached to datasets, plus a set of behaviors that show up at the worst possible moments: boot, import, failover, restore, replication, and audits.

At a high level, ZFS native encryption works like this:

You create an encrypted dataset with encryption=on.
ZFS uses a wrapping key derived from a passphrase or stored key material (keyformat) to protect a dataset key.
Data blocks are encrypted; metadata is partially protected depending on settings and on-disk realities.
To mount an encrypted dataset, you generally need to load the key (unlock), then mount.

Key management is everything around that unlock moment: where the key comes from, who can access it, how you rotate it, and how you prove you can recover without improvising.

One joke, because we’re going to need it: encryption without a recovery-tested key process is like a parachute you bought online but never tried on—technically present, emotionally unhelpful.

The three “planes of reality” you must align

Most ZFS key disasters happen when these three planes aren’t aligned:

The dataset plane: properties like encryption, keyformat, keylocation, keystatus, inheritance boundaries, and whether child datasets inherit the key.
The host plane: boot process, initramfs, service ordering, availability of key files, passphrase prompts, consoles that don’t exist (hello, cloud serial consoles), and the reality that “we’ll type it at boot” stops working when you have 50 nodes.
The people plane: who knows the passphrase, how it’s stored, how to rotate it without downtime, how to revoke access, and how to prove this isn’t “security theater” with a shared password in a chat topic.

ZFS key terminology you should be precise about

In ops conversations, the words “key” and “password” get mashed together. For ZFS encryption, be explicit:

Dataset encryption key (DEK): the internal key that encrypts data. You don’t type this; ZFS manages it.
Wrapping key / key material: what protects the DEK. This is derived from a passphrase or comes from a key file, depending on keyformat.
Loaded key: key material present in memory and usable. This is what zfs load-key accomplishes.
Key rotation: usually means changing the wrapping key (re-wrapping the DEK) rather than re-encrypting all data blocks.

Interesting facts and context: how we got here

Key management mistakes aren’t a personal failing. They’re an industry tradition. A few useful context points:

ZFS was born at Sun Microsystems in the mid-2000s with an “everything is checksummed” philosophy that still feels futuristic when you compare it to many legacy stacks.
Native ZFS encryption arrived later than many people assume; for years, folks used GELI (FreeBSD), LUKS/dm-crypt (Linux), or hardware encryption under ZFS.
Early “encrypt the whole disk” approaches made recovery and boot workflows simpler in some ways (one unlock), but complicated replication and granular access control.
Operationally, encryption shifted from niche to default when compliance regimes (and breach headlines) made “at rest” encryption table stakes even for internal systems.
ZFS’s dataset model is a double-edged sword: it gives fine-grained encryption boundaries, but it also gives you 400 ways to create inconsistent inheritance you won’t notice until restore day.
Key management failures are often “success failures”: teams succeed at automating unlock for convenience and accidentally succeed at removing any meaningful access control.
Backup systems historically struggled with encrypted datasets because administrators conflated “encrypted on disk” with “safe in transit” or “safe in backups,” which are different problems.
The industry learned (repeatedly) that storing the only copy of the key on the encrypted storage is a form of performance art, not engineering.

Threat model: what ZFS encryption does (and does not) protect

If you don’t write this down, your org will invent it later during a breach review.

What ZFS native encryption is good at

Stolen disks / decommissioned drives: if someone walks off with drives or a whole chassis, encrypted datasets protect data at rest (assuming keys aren’t stored with the disks).
“Oops, we shipped the wrong RMA” situations: encryption reduces impact when hardware escapes your custody.
Multi-tenant-ish boundaries on shared storage: dataset keys can separate access, within limits of the host environment.

What it is not good at

Malware on the host: if the dataset is mounted and the key is loaded, ransomware reads/writes just fine.
Root on the host: root can typically read memory, intercept key loading, or access key files depending on your setup.
Exfiltration from applications: ZFS encryption is below the filesystem API; apps see plaintext.

Threat model decisions you must make explicitly

Is unattended boot required? If yes, your key is either stored locally (less secure) or retrieved from a network service (more moving parts). Both are valid; pretending you can avoid the trade-off is not.
Do you need per-dataset keys? Great for least privilege; painful for ops if you don’t standardize naming and inheritance.
What’s your “break glass” path? Who can unlock during an incident, and how do they do it without Slack archaeology?

Key design patterns that survive real life

Pattern 1: One encrypted root with inherited keys (boring, effective)

This is the “I want to sleep” pattern. You encrypt a top-level dataset (often the pool’s root dataset or a designated parent), set children to inherit, and keep the number of distinct keys low.

Pros: fewer keys, simpler unlock procedure, easier disaster recovery. Cons: less granular separation between datasets.

Pattern 2: Multiple keys by domain (security-friendly, ops-taxing)

You group datasets by sensitivity: e.g., pool/app, pool/db, pool/secrets, each with separate wrapping keys. You accept that unlock and recovery require careful choreography.

Pros: better isolation. Cons: more key rotation work, more boot-time dependencies, more chances to forget one dataset until a restore fails.

Pattern 3: Passphrase vs keyfile (choose based on your boot reality)

Passphrase: humans can type it, but humans are unreliable infrastructure. It’s great for laptops, less great for headless servers unless you have a robust remote console and procedures.

Keyfile: automation-friendly, but you must protect the file. If the keyfile lives on the same machine, you’ve mostly moved the problem. If it’s pulled from a secure source at boot, you’ve built a key distribution system—congratulations on your new subsystem.

Pattern 4: Keylocation is an interface, not a storage strategy

ZFS lets you set keylocation to prompt or a file path. That’s not a key management solution; it’s a pointer. Your actual strategy is: where that key file comes from, how it’s permissioned, how it’s rotated, and how it’s recovered if the box is toast.

Pattern 5: The “two-control” rule for production datasets

In practice: don’t let a single admin’s laptop password manager be the only path to unlock production data. Use either shared escrow (with audit) or a dual-control process. Your auditors will like it. Your incident commander will love it.

Practical tasks with commands (and what the output means)

Commands below assume OpenZFS on Linux, run as root unless stated otherwise. Output is representative; your exact fields may vary.

Task 1: Inventory encryption state across a pool

cr0x@server:~$ zfs list -r -o name,encryption,keyformat,keylocation,keystatus,mounted pool
NAME                 ENCRYPTION  KEYFORMAT  KEYLOCATION     KEYSTATUS  MOUNTED
pool                 off         -          -               -          yes
pool/system           aes-256-gcm  passphrase prompt          available  yes
pool/system/var       aes-256-gcm  passphrase prompt          available  yes
pool/data             aes-256-gcm  raw        file:///etc/zfs/keys/pool.data.key  available  yes
pool/data/backups     aes-256-gcm  raw        file:///etc/zfs/keys/pool.data.key  available  yes
pool/archive          aes-256-gcm  passphrase prompt          unavailable no

Interpretation: keystatus=unavailable means the dataset is locked (key not loaded). If mounted=no and keystatus=available, you have a mount problem, not a key problem.

Task 2: Confirm which datasets inherit a key vs have their own

cr0x@server:~$ zfs get -r -o name,property,value,source keylocation pool/system
NAME             PROPERTY     VALUE    SOURCE
pool/system      keylocation  prompt   local
pool/system/var  keylocation  prompt   inherited from pool/system

Interpretation: Inheritance boundaries are where key sprawl starts. The source column tells you where you accidentally went “local” and forked your key management.

Task 3: Create an encrypted dataset with passphrase prompt

cr0x@server:~$ zfs create -o encryption=on -o keyformat=passphrase -o keylocation=prompt pool/secure
Enter passphrase:
Re-enter passphrase:

Interpretation: This creates the dataset and sets it to require a passphrase when loading keys. Good for interactive unlock workflows; risky for unattended boot.

Task 4: Create an encrypted dataset using a key file

cr0x@server:~$ install -d -m 0700 /etc/zfs/keys
cr0x@server:~$ head -c 32 /dev/urandom > /etc/zfs/keys/pool.app.key
cr0x@server:~$ chmod 0400 /etc/zfs/keys/pool.app.key
cr0x@server:~$ zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///etc/zfs/keys/pool.app.key pool/app

Interpretation: keyformat=raw expects key material bytes. Permissions matter; the file is now a crown jewel. If it’s readable by casual users, your encryption is mostly decoration.

Task 5: Lock and unlock a dataset (and verify)

cr0x@server:~$ zfs unmount pool/app
cr0x@server:~$ zfs unload-key pool/app
cr0x@server:~$ zfs get -o name,property,value keystatus pool/app
NAME      PROPERTY  VALUE
pool/app  keystatus unavailable

cr0x@server:~$ zfs load-key pool/app
cr0x@server:~$ zfs mount pool/app
cr0x@server:~$ zfs get -o name,property,value keystatus pool/app
NAME      PROPERTY  VALUE
pool/app  keystatus available

Interpretation: Unmount first, then unload key. If you unload while mounted, ZFS will generally refuse because the key is in use.

Task 6: Import a pool without mounting, then unlock deliberately

cr0x@server:~$ zpool import -N pool
cr0x@server:~$ zfs load-key -a
cr0x@server:~$ zfs mount -a

Interpretation: -N imports without mounting, which is what you want in controlled recovery: import first, unlock keys second, mount last. It makes failures obvious and safer.

Task 7: Identify why a dataset won’t mount (key vs mountpoint vs busy)

cr0x@server:~$ zfs mount pool/archive
cannot mount 'pool/archive': encryption key not loaded

cr0x@server:~$ zfs load-key pool/archive
Enter passphrase for 'pool/archive':

Interpretation: ZFS is politely telling you the real root cause. When it says “key not loaded,” don’t go tuning ARC and writing performance postmortems.

Task 8: Rotate a wrapping key (passphrase change) without re-encrypting data

cr0x@server:~$ zfs change-key pool/secure
Enter new passphrase:
Re-enter new passphrase:

Interpretation: This changes the wrapping key for that dataset’s encryption root. It’s fast because it rewraps key material, not rewrite all blocks.

Task 9: Rotate key material for a keyfile-managed dataset

cr0x@server:~$ head -c 32 /dev/urandom > /etc/zfs/keys/pool.app.key.new
cr0x@server:~$ chmod 0400 /etc/zfs/keys/pool.app.key.new
cr0x@server:~$ zfs set keylocation=file:///etc/zfs/keys/pool.app.key.new pool/app
cr0x@server:~$ zfs change-key pool/app
cr0x@server:~$ mv /etc/zfs/keys/pool.app.key.new /etc/zfs/keys/pool.app.key

Interpretation: Update keylocation, then change-key. Keep the old key until you verify unlock after reboot/import. Yes, it’s a footgun if you delete too early.

Task 10: Find the encryption root and confirm hierarchy

cr0x@server:~$ zfs get -o name,property,value -r encryptionroot pool/data
NAME             PROPERTY        VALUE
pool/data        encryptionroot  pool/data
pool/data/backups encryptionroot pool/data

Interpretation: Child datasets inherit keys from the encryption root. If you see children with different encryption roots, your unlock procedure must account for it.

Task 11: Validate that keys are not accidentally stored on the encrypted pool

cr0x@server:~$ zfs get -o name,property,value keylocation pool/data
NAME      PROPERTY     VALUE
pool/data keylocation  file:///etc/zfs/keys/pool.data.key

cr0x@server:~$ findmnt /etc/zfs/keys
TARGET        SOURCE    FSTYPE OPTIONS
/etc/zfs/keys /dev/sda1 ext4   rw,relatime

Interpretation: The key file lives on /dev/sda1 (likely the OS disk), not on pool. If /etc is on the encrypted pool you’re trying to unlock, you’ve built a locked door with the key taped to the other side.

Task 12: Check boot-time unlock failures via systemd journal

cr0x@server:~$ journalctl -b -u zfs-import-cache -u zfs-mount -u zfs-load-key --no-pager
-- Journal begins at ...
systemd[1]: Starting Load ZFS keys...
zfs-load-key[1023]: cannot open 'file:///etc/zfs/keys/pool.data.key': No such file or directory
systemd[1]: zfs-load-key.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Load ZFS keys.
systemd[1]: Dependency failed for ZFS mount.

Interpretation: This is not a ZFS bug; it’s an ordering or filesystem availability issue. Your key path didn’t exist at the time the service ran.

Task 13: List loaded keys quickly (at scale)

cr0x@server:~$ zfs list -H -o name,keystatus -r pool | awk '$2=="unavailable"{print $1}'
pool/archive
pool/secrets/hr

Interpretation: This is the “what’s still locked?” view. Useful during recovery or when a reboot left datasets unavailable.

Task 14: Verify that an encrypted dataset is actually encrypted on disk

cr0x@server:~$ zfs get -o name,property,value encryption,keyformat pool/secure
NAME        PROPERTY   VALUE
pool/secure encryption aes-256-gcm
pool/secure keyformat  passphrase

Interpretation: encryption shows the cipher mode. If it’s off, you don’t have encryption; you have optimism.

Task 15: Export/import cycle test (the closest thing to a fire drill)

cr0x@server:~$ zfs unmount -a
cr0x@server:~$ zfs unload-key -a
cr0x@server:~$ zpool export pool

cr0x@server:~$ zpool import -N pool
cr0x@server:~$ zfs load-key -a
cr0x@server:~$ zfs mount -a
cr0x@server:~$ zfs list -o name,mounted,keystatus -r pool | head
NAME         MOUNTED  KEYSTATUS
pool         yes      -
pool/system  yes      available
pool/data    yes      available

Interpretation: This is the operational test you should run before you trust any key strategy. It’s controlled pain that prevents uncontrolled pain later.

Replication with encrypted datasets: send/receive without heartbreak

Replication is where good intentions go to die. People encrypt datasets, then assume replication “just works” the way it did before. It can—but you need to choose what you’re replicating: plaintext, ciphertext, or something in between.

The three replication modes you should understand

Send plaintext (receiver re-encrypts): data is decrypted on send side and received as plaintext stream; receiver can encrypt with its own key. This can be acceptable inside a trusted network, but it’s a different security story.
Send raw (ciphertext): receiver gets encrypted blocks and (often) the same dataset key structure. This preserves encryption end-to-end but changes how keys and properties behave.
Hybrid workflows: send raw for archives, plaintext for operational datasets, depending on compliance and restore needs.

Task 16: Send an encrypted dataset as a raw stream

cr0x@server:~$ zfs snapshot -r pool/data@replica-001
cr0x@server:~$ zfs send -w -R pool/data@replica-001 | zfs receive -uF backup/data

Interpretation: -w sends a raw stream (keeps encryption). -u receives without mounting. This is how you avoid auto-mount chaos on a backup target. After receive, you still need keys to mount if you want to read it.

Task 17: Receive and set a different key on the destination (when not sending raw)

cr0x@server:~$ zfs snapshot pool/app@replica-001
cr0x@server:~$ zfs send pool/app@replica-001 | zfs receive -o encryption=on -o keyformat=passphrase -o keylocation=prompt backup/app

Interpretation: This re-encrypts at the destination with destination-controlled keying. It’s useful when the backup environment has different access control requirements.

Replication gotcha: “raw send” can replicate your mistakes

If you used the same key hierarchy everywhere, raw send happily preserves it. That might be exactly what you want—or exactly what makes your “backup environment” not meaningfully separated from production.

Three corporate-world mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption (“It’ll prompt us at boot”)

They had a tidy fleet: a dozen database nodes and a handful of storage-heavy app servers. Someone turned on ZFS encryption for “compliance,” chose passphrase prompts, and called it done. It worked in the lab because there was always a person at the console.

Then came the mundane trigger: a kernel update and a coordinated reboot. Half the hosts were in a colocation facility with remote hands; the other half were cloud instances with a serial console that was technically available but practically unusable under time pressure. The boot process waited for passphrase input that nobody could provide at scale.

The on-call did what on-calls do: they improvised. They tried to mount datasets manually, then realized the datasets weren’t even unlocked. They tried to load keys, but the “right” passphrases were scattered across a few humans and one password manager vault—stored on an encrypted dataset that was, of course, locked on one of the down hosts.

The outage lasted long enough to become a cross-team incident. Nobody had done a reboot drill with encryption enabled. The wrong assumption wasn’t technical; it was organizational: they assumed a human prompt was an acceptable production dependency.

Afterward, they moved to a split approach: passphrases for the highest-sensitivity dataset that could tolerate manual unlock, and network-retrieved keyfiles for the rest with strict access controls. The key lesson wasn’t “never use prompts.” It was: don’t build a critical path that depends on a calm human at a console.

Mini-story 2: The optimization that backfired (“Let’s make unlock fully automatic”)

Another shop went the other way. They hated boot prompts, so they fully automated unlocking by storing keyfiles on the root filesystem of each server. They set permissions tightly, used configuration management, and even rotated keys quarterly. It looked mature.

Then a contractor account was compromised. The attacker didn’t need kernel exploits or crypto wizardry. They got enough privilege to read the keyfile on a subset of hosts. From there, encrypted ZFS datasets were a speed bump, not a barrier.

The post-incident review was uncomfortable because nobody had done anything “wrong” in the usual sense. The team optimized for availability and operability, and they achieved it. The failure was that they optimised past the threat model: encryption was meant to protect against lost drives and decommissioning leaks, but leadership believed it also protected against host compromise.

The fix wasn’t to rip out automation; the fix was to implement a real separation of duties. They moved keys off-host, required authenticated retrieval at boot, and reduced who could access the retrieval mechanism. They also introduced a “keys loaded” monitoring signal: if datasets were unlocked outside normal windows, it triggered investigation.

Second joke, because we’ve earned it: the only thing worse than a key you can’t find is a key everyone can find.

Mini-story 3: The boring but correct practice that saved the day (the “restore rehearsal” nobody wants)

A finance-adjacent company had a ritual: once per quarter, they did an encrypted restore rehearsal. It was not glamorous. It cost time. It annoyed people. It was also the reason they didn’t have a career-ending incident.

The rehearsal was simple: pick a dataset, export and import on a clean host, load keys using the documented method, mount read-only, verify data integrity, and document every deviation. The point wasn’t “can we do it eventually”; the point was “can we do it while tired, under pressure, and with a change freeze in effect.”

One quarter, the rehearsal failed. Not because of ZFS itself, but because the key escrow process had drifted. A rotation had occurred, and the updated key material never made it to the offline escrow. They caught it during a planned test, not during a ransomware event.

They fixed the process and added a rule: a key rotation is not complete until restore rehearsal passes. People complained. Then a storage shelf later died in an unplanned way and they had to rebuild from replication and backups. Unlocking and mounting were the least exciting parts of the recovery—which is exactly the point.

Fast diagnosis playbook: what to check first, second, third

This is for the moment when encrypted datasets won’t mount and everyone is staring at you like you personally misplaced the laws of mathematics.

First: determine whether it’s a key problem or a pool problem

cr0x@server:~$ zpool status -x
all pools are healthy

cr0x@server:~$ zpool import
   pool: pool
     id: 1234567890
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.

Interpretation: If the pool can’t import or is degraded, fix that first. Keys don’t help if the pool isn’t online.

Second: check key status and encryption root

cr0x@server:~$ zfs list -r -o name,keystatus,encryptionroot,mounted pool | sed -n '1,25p'
NAME                 KEYSTATUS     ENCRYPTIONROOT  MOUNTED
pool                 -             -              yes
pool/system          available     pool/system     yes
pool/archive         unavailable   pool/archive    no

Interpretation: If keystatus=unavailable, don’t waste time on mountpoint tuning. Your next move is zfs load-key.

Third: validate keylocation availability and service ordering

cr0x@server:~$ zfs get -o name,property,value keylocation pool/archive
NAME         PROPERTY     VALUE
pool/archive keylocation  file:///etc/zfs/keys/pool.archive.key

cr0x@server:~$ ls -l /etc/zfs/keys/pool.archive.key
ls: cannot access '/etc/zfs/keys/pool.archive.key': No such file or directory

Interpretation: Missing key file is the most common boot-time failure. If it exists, check permissions. If it doesn’t, check whether it lives on a filesystem that wasn’t mounted yet.

Fourth: try controlled import and unlock

cr0x@server:~$ zpool export pool
cr0x@server:~$ zpool import -N pool
cr0x@server:~$ zfs load-key -a

Interpretation: Controlled sequences reduce surprises. If load-key fails, the error message is usually actionable (wrong passphrase, missing file, invalid URI).

Fifth: if still stuck, collect evidence before improvising

cr0x@server:~$ zfs get -r -o name,property,value,source encryption,keyformat,keylocation,keystatus,encryptionroot pool > /root/zfs-encryption-audit.txt
cr0x@server:~$ journalctl -b --no-pager > /root/boot-journal.txt

Interpretation: This preserves the state for later analysis, and it stops the team from “fixing” the problem into a mystery.

Common mistakes: symptoms and fixes

Mistake 1: Key file stored on the encrypted pool

Symptom: after reboot, zfs load-key fails because the key file path doesn’t exist; datasets remain locked; chicken-and-egg situation.

Fix: store key material on an unencrypted root filesystem, removable media, or a network retrieval mechanism available before ZFS mount. Re-test with zpool import -N and zfs load-key.

Mistake 2: “We used encryption, so backups are safe”

Symptom: backups are raw encrypted streams but keys are not escrowed; restore fails when original host is gone.

Fix: escrow the keys (or passphrases) necessary to unlock replicas, and perform restore rehearsals. If using raw send, ensure the receiving side can unlock independently of the sender.

Mistake 3: Key rotation done without a reboot/import test

Symptom: everything works until the next reboot; then datasets won’t unlock because keylocation points to a rotated key that wasn’t deployed everywhere.

Fix: after rotation, run an export/import test (Task 15) in a maintenance window or on a canary host. Treat rotation as incomplete until it passes.

Mistake 4: Inheritance drift creates “mystery keys”

Symptom: some child datasets require different keys; unlock automation loads most keys, but one dataset stays locked and an application fails in a weird way.

Fix: audit encryptionroot and keylocation sources. Standardize on an encryption-root strategy and enforce it via code review or policy checks.

Mistake 5: Unattended boot enabled with locally stored keys, then declared “secure”

Symptom: compliance review asks “who can unlock the data?” and the honest answer is “anyone with root on the host,” which is not what leadership believed.

Fix: align the threat model. If you need protection against host compromise, local keys won’t do. Move key retrieval off-host and reduce who can access retrieval paths.

Mistake 6: Confusing pool import failures with key failures

Symptom: engineers repeatedly run zfs load-key but the pool isn’t imported or is read-only due to errors.

Fix: start with zpool status and zpool import. Get the pool healthy/imported first; then unlock.

Mistake 7: Forgetting about snapshots and replication access

Symptom: backup system receives encrypted datasets but can’t mount them for verification; or verification mounts leak plaintext into environments you didn’t plan for.

Fix: decide whether the backup target is allowed to decrypt. If yes, manage keys there. If not, verify via checksum/metadata-level checks and keep it locked by default.

Checklists / step-by-step plan

Checklist 1: Design decisions (before you type commands)

Write down your threat model: stolen disks? rogue admin? compromised host? compliance checkbox?
Decide unattended boot policy per system class (database nodes vs archival vs laptops).
Pick key granularity: single encryption root vs per-domain vs per-dataset.
Define escrow/break-glass: who can unlock, how, and where secrets live when the primary system is down.
Define rotation cadence and what “rotation complete” means (hint: tested import/unlock).

Checklist 2: Build an encrypted dataset hierarchy safely

Create an encryption root dataset for a domain (e.g., pool/data).
Set child datasets to inherit keys unless there is a justified separation.
Document the encryption roots and keylocations in your infrastructure repo (not in someone’s head).
Run an export/import drill (Task 15) before production cutover.

Checklist 3: Operational runbook for reboot and recovery

Import pools without mounting: zpool import -N.
Load keys: zfs load-key -a.
Mount: zfs mount -a.
Verify critical datasets are mounted and keys are available.
Check application health only after storage is confirmed.

Checklist 4: Key rotation step-by-step (safe version)

Announce maintenance window and identify blast radius (encryption roots affected).
Generate new key material (passphrase policy or keyfile bytes).
Deploy key material to all relevant nodes (or key retrieval system), but do not delete old material yet.
Change key: zfs change-key on encryption roots.
Perform controlled export/import test on a canary host.
Only after successful reboot/import: remove old key material from active locations, keep escrow per policy.

FAQ

1) If I encrypt a parent dataset, are children automatically encrypted?

If children are created under an encrypted dataset, they’ll typically inherit encryption. But inheritance can drift if someone sets properties locally or creates datasets in unexpected places. Confirm with zfs get -r encryption,encryptionroot.

2) Does `zfs change-key` re-encrypt all my data?

No, it generally rewraps key material (fast). It’s still operationally risky because you can lock yourself out if you mishandle keylocation or escrow. Treat it like a change that needs testing.

3) What’s the difference between `keyformat=passphrase` and `keyformat=raw`?

passphrase derives key material from a human-entered secret. raw expects actual key bytes (often stored in a file). Raw is automation-friendly and can be high entropy, but it increases the importance of key file protection.

4) Can I unlock a dataset without mounting it?

Yes. zfs load-key unlocks (loads key into memory); mounting is separate. This is useful for controlled recovery and to avoid mounting in the wrong environment.

5) If a dataset is encrypted, is the pool itself encrypted?

ZFS encryption is dataset-level. A pool can contain both encrypted and unencrypted datasets. Your audits should check the dataset hierarchy, not just assume “the pool is encrypted.”

6) Can I replicate encrypted datasets to a backup site and keep them encrypted end-to-end?

Yes, with raw sends (e.g., zfs send -w). But then you must ensure the backup site has an independent, tested way to unlock—or intentionally keep it locked and accept different verification methods.

7) Why did my system boot but services failed, even though the pool imported fine?

Commonly: pool imported, but keys weren’t loaded, so encrypted datasets didn’t mount. Check zfs list -o name,keystatus,mounted and system logs for zfs-load-key failures.

8) Is it safe to store key files on the OS disk?

It depends on your threat model. It’s operationally convenient and protects against stolen data disks, but it does not protect against an attacker with root on the host. If you need stronger guarantees, use off-host key retrieval and tighter access controls.

9) How do I prove to auditors (and myself) that recovery works?

By performing an import/unlock/mount drill on a clean system, using only documented procedures and escrowed secrets. Save the output of zfs get and the steps you took. Repeat on a schedule.

Conclusion: boring beats heroic

ZFS encryption is good engineering. But the winning move isn’t enabling it—it’s operationalizing it. Your future self does not want a clever setup; they want a setup that survives reboots, restores, staff turnover, and the awkward moment when your “secure” system is inaccessible because the key was stored somewhere… secure.

If you take only one habit from this: run the export/import + load-key drill before you need it. Disaster Day doesn’t care that the design looked clean on a diagram. It cares whether the person on call can unlock data without guessing.