The worst time to learn your backups are “secure” is during a restore. The second-worst time is
during an audit where someone asks, calmly, why production database dumps are sitting in object
storage as plain text with a timestamp in the filename.
Docker makes shipping apps easy. It also makes it easy to back up the wrong thing, encrypt it in
the wrong place, and then discover your restore process depends on a secret that lived on the box
you just lost. This is a guide for people who run production systems and want encryption that
survives real disasters—operator errors included.
Threat model first, tools second
“Encrypt the backups” is not a requirement; it’s a vague wish. Requirements are specific:
who must not read the backups, what failures you can tolerate, and how you’ll
restore when half your infrastructure is on fire and the other half is waiting for change
control.
Start with three common threat models:
-
Cloud storage compromise: credentials leak, bucket policy is wrong, or an internal account goes rogue.
Encryption should still prevent readable data exfiltration. -
Ransomware / attacker on the host: attacker can read local files and maybe your backup scripts.
Encryption may or may not help depending on where keys live. -
Operator error: the most common. Someone uploads a “temporary” dump to a shared place, or rotates keys without a restore plan.
Encryption should reduce blast radius and restores should remain routine.
For Docker environments, the operational truth is this: you’re not “backing up Docker.” You’re
backing up the data behind containers, plus enough metadata to reconstruct it. Encryption needs
to wrap the artifact you actually store and move—not just the disk, not just the transport, and
definitely not “somewhere in the stack we assume is encrypted.”
Interesting facts and historical context
- Fact 1: Early Unix backup culture treated tapes as “physical security.” Encryption was rare because the tape stayed in a locked cabinet—until it didn’t.
- Fact 2: GPG (1990s) became the default “encrypt a file” tool for ops because it worked offline and across platforms, not because it was easy.
- Fact 3: Many “backup encryption” failures are really key availability failures: the ciphertext is fine; the restore cannot decrypt in time.
- Fact 4: Compression-before-encryption is a long-standing best practice because encryption destroys redundancy and makes compression ineffective.
- Fact 5: The rise of object storage made “offsite backups” cheap and common—also making accidental public exposure a recurring industry headline.
- Fact 6: Deduplication and encryption have been in a cold war for years: encryption randomizes data and can kill dedupe efficiency unless designed carefully.
- Fact 7: A surprising number of teams still “backup containers” (images) instead of backing up state, then wonder why the database is empty after restore.
- Fact 8: Key rotation is older than cloud. The hard part has always been compatibility with old data and predictable operational rituals.
One reliable paraphrased idea from Richard Cook (resilience engineering): success and failure often come from the same everyday work; the difference is context
(paraphrased idea).
Backup encryption is exactly that. The same pipeline that protects you can strand you—depending on whether your restore path was treated as a first-class feature.
Principles: encrypt without breaking restores
1) Encrypt the artifact you store, not the machine you wish you had
Disk encryption (LUKS, cloud EBS encryption, encrypted ZFS datasets) is good hygiene. But
backups are portable by design. You will copy them to other machines, other buckets, other
regions, maybe a laptop during an incident. So encryption must travel with the backup artifact.
2) The restore path is the product
If decrypting needs a VPN to a dead datacenter, you built a trap. Restores must work from a cold
start: new host, minimal tooling, keys obtainable via a controlled but available process.
3) Prefer simple crypto workflows over clever ones
Complex key hierarchies and bespoke envelope schemes can be correct and still be un-restorable at 03:00.
Most teams do best with:
- One file per backup artifact
- Compression then encryption (streaming if big)
- Recipient-based encryption (no shared passphrases)
- Documented and tested key retrieval
4) Separate confidentiality from integrity, but verify both
Modern encryption modes (and tools like age) provide authenticated encryption: confidentiality plus integrity.
Still, you should verify the artifact’s checksum and test decryption and restore. Silent corruption is not a vibe.
5) Plan for key loss, not just key theft
Most orgs obsess over preventing theft and ignore loss. Loss is how you end up with “perfectly secure”
backups that cannot be restored. Your key management should have:
- At least two independent recovery paths (e.g., two operators, or vault + break-glass)
- Documented rotation
- Access logs
- Regular restore drills
What to back up in Docker (and what not to)
Containers are cattle; volumes and external systems are the pets you actually need. In Docker terms,
your backup scope is usually:
- Named volumes (e.g., Postgres data dir, app uploads)
- Bind mounts where state lives (careful with permissions and path drift)
- Database logical dumps (pg_dump, mysqldump) or physical backups (pg_basebackup, xtrabackup)
- Configuration state: docker-compose.yml, env files (but treat secrets carefully), reverse proxy configs
- Secrets: not inside git, not inside the backup unless you have a reason and a separate protection story
Things people often back up by mistake:
- Container layers/images (you can rebuild; also images can contain secrets if you’re sloppy)
- /var/lib/docker wholesale (possible, but brittle across versions/storage drivers; and it’s huge)
- Runtime caches (wasteful and can slow restores)
Opinion: back up data and a rebuildable deployment definition. If your restore depends on copying
a Docker daemon state directory between hosts, you’re already negotiating with chaos.
Where to encrypt: at rest, in flight, or in the pipeline
Transport encryption (TLS)
Necessary, not sufficient. TLS protects data in transit, but the endpoint still sees plaintext. If
the bucket is compromised later, TLS did nothing.
Storage-side encryption (SSE)
S3-style server-side encryption (SSE-S3 or SSE-KMS) is useful, especially for compliance and to
reduce “oops bucket” impact. But it still means the storage provider can decrypt, and the object
is stored in a form that could be decrypted with access to the right controls. It’s not end-to-end.
Client-side encryption (encrypt before upload)
This is the default recommendation for “keep secrets safe.” You encrypt the backup artifact
locally, upload ciphertext. Even if your storage is compromised, the attacker gets encrypted blobs.
Reality check: client-side encryption is only as good as key hygiene. If the same host that
encrypts also stores the key unprotected, you mainly improved compliance paperwork.
Crypto choices: age, GPG, OpenSSL, S3 SSE
age: boring in the right way
age is a modern file encryption tool designed to be hard to misuse. Recipient-based encryption
is straightforward, output is deterministic to operate, and it’s less “mystical” than GPG.
Use it when you want a clean workflow and don’t need the full PGP ecosystem.
Use case: daily encrypted tarballs, streamed database dumps, offsite copies.
GPG: powerful, sharp edges included
gpg is everywhere and supports complex key management. It also has complexity debt: trust models,
keyrings, pinentry, agent behavior, and human confusion.
Use case: when your org already has PGP key distribution and you need interoperability.
OpenSSL: flexible, but easy to shoot yourself
OpenSSL can encrypt files with a passphrase, but passphrase-based workflows are routinely abused:
shared secrets, weak passphrases, passphrases embedded in scripts, etc. If you use OpenSSL, use
modern AEAD modes and keep passphrases out of process lists and shell history.
Server-side encryption: good baseline, not the finish line
SSE helps, particularly SSE-KMS with strong IAM boundaries and audit logs. But for “cloud account
compromised” scenarios, client-side encryption is the stronger story.
Joke 1: If your backup encryption key is in the same bucket as the backups, you didn’t build security—you built a themed escape room.
Key management that doesn’t eat your weekend
Pick a key model you can operate under stress
The goal is not “no one can ever decrypt.” The goal is “only authorized people can decrypt, and
they can do it during an incident without improvising.”
Recommended patterns
-
Recipient-based encryption (preferred): age recipients or GPG public keys. Encrypt to multiple recipients
(e.g., ops team key + break-glass key). No shared passphrase. No single human bottleneck. -
Break-glass key stored offline: printed QR in a safe, or an HSM-protected export process, depending on your world.
The point is independence from the production host. - Vault/KMS for key distribution, not necessarily encryption: store the private key or a wrapped data key in a system with strong access control and audit logs.
What to avoid
- Single passphrase shared in a chat channel (now it’s in backups of chat too)
- Keys stored on the same host without additional protection
- Key rotation with no re-encryption plan (you’ll discover this during restore)
- “We can always ask Alice” dependency (Alice goes on vacation, or leaves)
Key rotation: do it without breaking old restores
Rotation needs a compatibility window. Keep old private keys available (under controls) until all
backups encrypted under them have expired. If you must revoke access, re-encrypt the retained
archives, and record which key ID encrypted which object.
Practical encrypted backup pipelines (with restore paths)
Pipeline A: volume snapshot via tar → compress → age → upload
This is the workhorse for “named volume contains files” cases. It’s not database-consistent unless
your application can tolerate it. For databases, prefer logical/physical backups from the DB itself.
Pipeline B: Postgres logical dump streamed and encrypted
For many orgs, pg_dump is the sweet spot: consistent snapshot, portable, and you can stream it so the plaintext
doesn’t land on disk. Then encrypt and ship.
Pipeline C: object storage with SSE-KMS plus client-side encryption
Layered security is normal. SSE-KMS gives you provider-side controls and audit logs; client-side encryption
gives you “bucket compromise still yields ciphertext.”
Restore path thinking
For every pipeline, document:
- Where the encrypted artifact lives
- How to fetch it (credentials and who has them)
- How to get the decryption key (primary and break-glass)
- How to verify integrity (checksum/signature)
- How to restore into a clean Docker environment
12+ real tasks: commands, outputs, and decisions
These are the kinds of checks you run when you’re building the system or debugging why a restore is slow, failing, or suspiciously “successful” with missing data.
Each task includes: command, what the output means, and the decision you make from it.
Task 1: Enumerate containers and identify stateful ones
cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'
NAMES IMAGE STATUS
pg postgres:16 Up 3 days
app myco/app:2.8.1 Up 3 days
nginx nginx:1.25 Up 3 days
Meaning: You’re looking for databases, file stores, queues—anything with state.
Decision: Mark pg and likely app as requiring backups beyond redeploying images.
Task 2: List volumes and map them to containers
cr0x@server:~$ docker volume ls
DRIVER VOLUME NAME
local pg_data
local app_uploads
cr0x@server:~$ docker inspect pg --format '{{json .Mounts}}'
[{"Type":"volume","Name":"pg_data","Source":"/var/lib/docker/volumes/pg_data/_data","Destination":"/var/lib/postgresql/data","Driver":"local","Mode":"z","RW":true,"Propagation":""}]
Meaning: pg_data is a named volume; backing up that directory raw risks inconsistency unless the DB is stopped or you use DB-native backups.
Decision: Use pg_dump or pg_basebackup rather than tarring pg_data while running.
Task 3: Verify Docker Compose definition exists and is versioned
cr0x@server:~$ ls -l /srv/app/docker-compose.yml
-rw-r--r-- 1 root root 2481 Jan 2 09:41 /srv/app/docker-compose.yml
Meaning: You can recreate the stack. Without this, restores turn into archaeology.
Decision: Put compose files in version control; keep secrets out of it or use external secret stores.
Task 4: Confirm where secrets come from (env vars vs files)
cr0x@server:~$ docker inspect app --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -E 'PASSWORD|TOKEN|SECRET' || true
DATABASE_URL=postgres://app:REDACTED@pg:5432/app
JWT_SECRET=REDACTED
Meaning: If secrets live in env vars, they can leak via diagnostics, crash dumps, and overly-helpful scripts.
Decision: Decide whether your backups should include secret material. Usually: no. Back up data; recover secrets from a separate controlled path.
Task 5: Create an age key pair (operator workstation or secure host)
cr0x@server:~$ age-keygen -o /secure/keys/backup.agekey
Public key: age1k8t6y7z8n6k5m2p9k4d3s2q1w0e9r8t7y6u5i4o3p2a1s0d9f8g7h6j
Meaning: You now have a recipient public key (safe to distribute) and a private key (treat as break-glass-worthy).
Decision: Store the private key outside the Docker host and outside the backup location. Encrypt to multiple recipients if possible.
Task 6: Encrypt a small test file and verify decryption works
cr0x@server:~$ printf "hello-restore\n" > /tmp/restore-test.txt
cr0x@server:~$ age -r age1k8t6y7z8n6k5m2p9k4d3s2q1w0e9r8t7y6u5i4o3p2a1s0d9f8g7h6j -o /tmp/restore-test.txt.age /tmp/restore-test.txt
cr0x@server:~$ age -d -i /secure/keys/backup.agekey -o /tmp/restore-test.out /tmp/restore-test.txt.age
cr0x@server:~$ cat /tmp/restore-test.out
hello-restore
Meaning: The key is valid, tooling works, and decryption is not “future you’s problem.”
Decision: Bake age install into your restore environment or keep a known-good static binary accessible.
Task 7: Stream a Postgres dump and encrypt it (no plaintext file on disk)
cr0x@server:~$ docker exec -i pg pg_dump -U postgres -d app --format=custom | age -r age1k8t6y7z8n6k5m2p9k4d3s2q1w0e9r8t7y6u5i4o3p2a1s0d9f8g7h6j -o /backups/app_pg.dump.age
cr0x@server:~$ ls -lh /backups/app_pg.dump.age
-rw-r--r-- 1 root root 192M Jan 3 01:00 /backups/app_pg.dump.age
Meaning: You produced an encrypted artifact. The dump traveled through a pipe; the host didn’t store plaintext at rest.
Decision: Adopt streaming for large sensitive backups; reduce the “plaintext residue” problem.
Task 8: Validate you can decrypt and restore that dump into a fresh database
cr0x@server:~$ docker run --rm -d --name pg-restore -e POSTGRES_PASSWORD=restore -p 5433:5432 postgres:16
d4b2f8f15c65f6b6f8b0b76d1cc9d9d2a2b0a6f0d3f0d8f9a1c2b3d4e5f6a7b8
cr0x@server:~$ age -d -i /secure/keys/backup.agekey /backups/app_pg.dump.age | pg_restore -h 127.0.0.1 -p 5433 -U postgres -d postgres --clean --if-exists
Password:
Meaning: This proves the end-to-end restore path, not just encryption.
Decision: Make this a scheduled drill. If you can’t restore into a clean instance, you don’t have backups—you have artifacts.
Task 9: Back up a non-DB volume with tar + compression + encryption
cr0x@server:~$ docker run --rm -v app_uploads:/data:ro -v /backups:/out alpine:3.20 sh -c "cd /data && tar -cf - . | gzip -1" | age -r age1k8t6y7z8n6k5m2p9k4d3s2q1w0e9r8t7y6u5i4o3p2a1s0d9f8g7h6j -o /backups/app_uploads.tgz.age
cr0x@server:~$ file /backups/app_uploads.tgz.age
/backups/app_uploads.tgz.age: age encrypted file
Meaning: You made a portable encrypted archive of file data.
Decision: For file volumes, this is fine. For databases, don’t do this while the DB is live unless you have a consistency mechanism.
Task 10: Restore the encrypted tarball into a new empty volume
cr0x@server:~$ docker volume create app_uploads_restore
app_uploads_restore
cr0x@server:~$ age -d -i /secure/keys/backup.agekey /backups/app_uploads.tgz.age | gunzip -c | docker run --rm -i -v app_uploads_restore:/restore alpine:3.20 sh -c "cd /restore && tar -xvf - | head"
./
./avatars/
./avatars/u123.png
./docs/
./docs/terms.pdf
Meaning: Data is recoverable and lands in a Docker-managed volume.
Decision: Use a restore volume first, then swap it in. Direct restores into live volumes are how you turn a small incident into a long incident.
Task 11: Upload encrypted backups to S3-compatible storage and confirm server-side encryption status
cr0x@server:~$ aws s3 cp /backups/app_pg.dump.age s3://prod-backups/app_pg/2026-01-03/app_pg.dump.age --sse aws:kms
upload: backups/app_pg.dump.age to s3://prod-backups/app_pg/2026-01-03/app_pg.dump.age
cr0x@server:~$ aws s3api head-object --bucket prod-backups --key app_pg/2026-01-03/app_pg.dump.age --query '{Size:ContentLength,SSE:ServerSideEncryption,KMS:SSEKMSKeyId}'
{
"Size": 201326592,
"SSE": "aws:kms",
"KMS": "arn:aws:kms:us-east-1:111122223333:key/REDACTED"
}
Meaning: The object is ciphertext from client-side encryption, and additionally protected by SSE-KMS.
Decision: Keep SSE on. It won’t replace client-side encryption, but it adds guardrails and audit trails.
Task 12: Generate and store a checksum manifest for integrity checks
cr0x@server:~$ sha256sum /backups/app_pg.dump.age /backups/app_uploads.tgz.age > /backups/manifest.sha256
cr0x@server:~$ cat /backups/manifest.sha256
2f9e0f1c0a8c5d4b3e2a1f0e9d8c7b6a5f4e3d2c1b0a9e8d7c6b5a4f3e2d1c0b /backups/app_pg.dump.age
9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b /backups/app_uploads.tgz.age
Meaning: You can detect corruption or partial transfers before you try to restore.
Decision: Store the manifest alongside backups (and ideally also in a separate log system). Verify before decrypting large files.
Task 13: Verify remote object size matches local before restore
cr0x@server:~$ aws s3 ls s3://prod-backups/app_pg/2026-01-03/app_pg.dump.age
2026-01-03 01:02:11 201326592 app_pg.dump.age
Meaning: Size matches expectations; not a guarantee, but catches obvious truncation.
Decision: If size is off, stop. Investigate transfer, multipart upload failures, lifecycle rules, or storage-side corruption.
Task 14: Check lifecycle/immutability posture (ransomware resistance)
cr0x@server:~$ aws s3api get-bucket-versioning --bucket prod-backups
{
"Status": "Enabled"
}
cr0x@server:~$ aws s3api get-object-lock-configuration --bucket prod-backups
{
"ObjectLockConfiguration": {
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "GOVERNANCE",
"Days": 30
}
}
}
}
Meaning: Versioning and object lock reduce the chance an attacker (or script bug) deletes your only good backup.
Decision: If you can enable immutability, do it. Encryption protects confidentiality; immutability protects existence.
Task 15: Confirm the restore environment has the right tools before you begin
cr0x@server:~$ age --version
age 1.2.0
cr0x@server:~$ pg_restore --version
pg_restore (PostgreSQL) 16.1
cr0x@server:~$ docker --version
Docker version 24.0.7, build afdd53b
Meaning: Toolchain compatibility is a classic restore failure mode, especially with database restore utilities.
Decision: Keep a known-good restore image or a runbook specifying versions. Don’t discover version mismatches mid-incident.
Fast diagnosis playbook
When encrypted backups are “slow” or restores “hang,” it’s usually one of four bottlenecks: CPU, disk, network, or the database apply phase.
Don’t guess. Check in this order and stop as soon as you find the limiting factor.
1) Confirm you’re restoring the right artifact (and it’s complete)
- Check remote object size matches expectations.
- Verify checksum manifest.
- Decision: if integrity is unclear, do not proceed with restore into a live target.
2) Identify which phase is slow: download, decrypt, decompress, or apply
- Time each segment with simple pipelines.
- Decision: focus optimization where time is actually spent; encryption is often blamed but not guilty.
3) Check CPU saturation and entropy-starvation myths
- Modern systems rarely “wait for entropy” for age/GPG in normal use; CPU is the usual suspect.
- Decision: if CPU is pegged, consider parallelism or faster compression settings, not weaker encryption.
4) Check disk I/O and filesystem behavior
- Restore targets can be slower than sources, especially on networked storage.
- Decision: restore to fast local storage, then move/attach if feasible.
5) For databases: the apply phase dominates
pg_restoreand index creation can dwarf download/decrypt times.- Decision: tune restore options (jobs, maintenance settings) and validate WAL/checkpoint settings for the restore environment.
Common mistakes (symptoms → root cause → fix)
Mistake 1: “Decrypt failed: no identity found” during restore
Symptoms: age refuses to decrypt; restore blocked immediately.
Root cause: You encrypted to the wrong recipient, lost the private key, or the restore host doesn’t have access to the identity file.
Fix: Encrypt to multiple recipients (team key + break-glass key). Store identity retrieval in the runbook. Test decryption weekly with a small artifact.
Mistake 2: Backups exist but restore yields empty application state
Symptoms: App boots after restore but no users/files/data appear.
Root cause: You backed up images/compose but not volumes or database dumps; or you restored only one of the state components.
Fix: Define a backup inventory: DB + uploads + config. Run restore drills that validate application-level correctness (not just “container is up”).
Mistake 3: Encrypted backups are huge and slow to move
Symptoms: Transfer windows blow out; storage costs climb.
Root cause: Encrypting already-compressed data is fine, but encrypting without compression wastes space; also backing up too much (logs, caches).
Fix: Compress before encrypting; exclude non-critical paths; consider DB-native compression or custom formats.
Mistake 4: “We use SSE-KMS so we’re safe” (until the account is compromised)
Symptoms: Audit passes, but threat modeling doesn’t; a compromised IAM principal can fetch and decrypt.
Root cause: Relying solely on server-side encryption and assuming IAM is unbreakable.
Fix: Add client-side encryption. Treat cloud account compromise as a real scenario, not a theoretical one.
Mistake 5: Passphrase-based encryption embedded in scripts
Symptoms: Anyone with access to the script, process list, or CI logs can decrypt backups.
Root cause: Operational convenience winning a fistfight with security.
Fix: Move to recipient-based encryption. If you must use passphrases, fetch them securely at runtime and keep them out of logs and shell history.
Mistake 6: Volume tar backups of live databases
Symptoms: Restore completes but DB is corrupted or inconsistent; sometimes it “works” until it doesn’t.
Root cause: Filesystem-level capture of a changing database without snapshots or DB coordination.
Fix: Use pg_dump, pg_basebackup, or snapshot mechanisms coordinated with the database. Don’t tar live database directories.
Mistake 7: Key rotation breaks access to old backups
Symptoms: New backups decrypt; old ones don’t; compliance retention becomes a liability.
Root cause: Rotated away old keys without a retention-aware plan.
Fix: Keep old keys available under strict controls until old backups expire, or re-encrypt retained backups as part of rotation.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a Docker Compose stack: Postgres, a web app, and a file upload volume. They had “encrypted backups”
because the VM disk was encrypted and the S3 bucket required SSE-KMS. The backup job copied a nightly pg_dump into
the bucket. It passed audits. Everyone slept.
Then a developer’s cloud access keys leaked via a misconfigured CI job. The attacker didn’t need to break encryption. With those keys,
they listed the bucket, pulled the dumps, and—because IAM allowed it—downloaded them. SSE-KMS did its job exactly as designed: it
decrypted for an authorized principal.
The team’s wrong assumption was subtle: “server-side encryption means the storage compromise can’t read backups.” It’s true only if the
attacker can’t become a principal that the service honors. In cloud environments, credential compromise is the whole game.
The fix was client-side encryption to a key not accessible to the compromised environment. They encrypted dumps with recipient keys held
outside the cloud account and uploaded ciphertext. The attacker could still steal objects, but they were just expensive noise.
Mini-story 2: The optimization that backfired
Another org wanted faster backups. They noticed compression took time and decided to “optimize” by encrypting first, then compressing.
The idea sounded reasonable to non-storage people: “encryption reduces data; compression will shrink it more.” That’s backwards in reality.
Their backup artifacts ballooned. Network transfer time doubled. Costs rose quietly until finance noticed. Restore time got worse too,
because now they had to move bigger objects before they could even start the decrypt-and-apply dance.
The deeper problem: they also lost any chance of deduplication in their downstream backup tooling, because encryption randomizes the stream.
They had accidentally forced “unique bytes every day,” which storage systems interpret as “please bill me for everything.”
They rolled back to compress-then-encrypt, with a low CPU compression level for streaming. Backups were smaller, transfers stabilized,
restores sped up, and nobody had to pretend the original change was “an experiment” (it was).
Mini-story 3: The boring but correct practice that saved the day
A regulated business ran Dockerized services on a couple of hosts. Nothing fancy. They did one unfashionable thing: a monthly restore drill.
Not a tabletop exercise. A real restore into a disposable environment, with a checklist and timeboxes.
They also maintained two decryption paths: the normal key in an internal secrets system with RBAC, and a break-glass private key stored offline
with a documented retrieval process that required two people. It was annoying, which is how you know it was working.
During an incident—storage failure plus a messy host rebuild—they discovered their secrets system was temporarily unreachable due to unrelated
networking work. This is where many teams stall and start improvising. They didn’t. They invoked the break-glass path, decrypted the backups,
restored Postgres and uploads, and got the business running.
Afterwards, the postmortem was calm. Not because they were lucky, but because the restore path wasn’t theoretical. It had been practiced,
timed, and made intentionally boring.
Joke 2: Backups are like parachutes—if you only test them in the lab, you’re still doing “research,” not safety.
Checklists / step-by-step plan
Step-by-step: implement encrypted backups for a Docker Compose stack
-
Inventory state: list containers, volumes, bind mounts, and external dependencies (DBs, object stores).
Decide what “restore complete” means at the application layer. -
Pick a primary backup method per component:
- Postgres:
pg_dumpfor portability;pg_basebackupfor faster large restores (with more complexity). - Uploads/files: tar+gzip from the volume.
- Postgres:
- Pick the encryption tool and model: age recommended; encrypt to multiple recipients.
- Design key custody: where private keys live, who can access, and the break-glass path.
- Implement streaming pipelines: minimize plaintext on disk.
- Generate integrity metadata: checksum manifests; optionally sign manifests if you need stronger provenance.
- Ship offsite: object storage with versioning and object lock if available.
- Test restore: restore into a fresh container/volume, validate app behavior and data counts.
- Schedule restore drills: at least monthly for critical systems; quarterly minimum for less critical.
- Log and alert on backup failures: “silent failure” is the default state of unattended scripts.
Operational checklist: before you rotate keys
- List which backup objects are encrypted under which key (store key IDs/recipients in metadata).
- Define retention windows and ensure old keys remain available until data expires or is re-encrypted.
- Perform a restore using the “old key” path the week before rotation.
- Rotate keys, then immediately perform a restore using the “new key” path.
- Update the runbook, not just the code.
Incident checklist: you need to restore now
- Pick the restore point (RPO) and confirm it matches business needs.
- Fetch the artifact and verify checksum/size before decrypting.
- Obtain decryption key via primary path; if blocked, invoke break-glass.
- Restore into clean targets first (new volumes / new DB instance).
- Validate at the application layer (logins, record counts, critical workflows).
- Only then redirect traffic.
FAQ
1) Should I rely on disk encryption (LUKS/EBS) for backups?
No. Use it, but don’t rely on it. Disk encryption protects a disk. Backups move. Encrypt the backup artifact itself so protection travels with it.
2) Is server-side encryption (SSE-KMS) enough?
It’s a strong baseline and often required, but it’s not end-to-end. If an attacker gains IAM access that permits reads, SSE will happily decrypt for them.
Add client-side encryption for “bucket/account compromise” scenarios.
3) age or GPG for Docker backups?
If you’re starting fresh, pick age. It’s simpler and harder to misuse. Use GPG if your organization already has PGP key distribution and operational muscle memory.
4) Where should I store the private key for decrypting backups?
Not on the Docker host and not next to the backups. Store it in a controlled secrets system with audit logs, plus a break-glass offline copy with a documented retrieval procedure.
5) Should backups include application secrets?
Usually no. Back up data and configuration needed to redeploy, but recover secrets from a separate system. If you must back up secrets, treat that backup as higher sensitivity and isolate access.
6) How do I avoid plaintext landing on disk during backups?
Stream: dump → compress → encrypt → upload. Pipes are your friend. Validate that intermediate temp files aren’t created by your tools and that logs don’t capture sensitive content.
7) Do encrypted backups prevent ransomware?
Encryption prevents unauthorized reading; it does not prevent deletion or encryption-by-attacker. For ransomware resistance, you need immutability (object lock, versioning, offline copies) and strong IAM boundaries.
8) Why are restores slow even though encryption is “fast”?
Because restores are usually dominated by download time, decompression, or database apply/index rebuild. Measure each phase. Optimize the real bottleneck, not the tool you suspect.
9) How many recipients should I encrypt to?
At least two: an ops team key (or service key) and a break-glass recovery key. More recipients increase operational flexibility but also broaden who can decrypt, so keep it intentional.
10) Can I back up /var/lib/docker and call it done?
You can, but you probably shouldn’t. It’s large, version-sensitive, and mixes caches with state. Back up volumes and database-native artifacts; keep Docker daemon state out unless you have a proven need.
Conclusion: next steps you can actually do
If you want encrypted Docker backups that restore cleanly, stop treating encryption as a checkbox and treat restore as a feature.
The practical path is:
- Inventory state: volumes, bind mounts, databases, and “hidden” dependencies.
- Use DB-native backups for databases; tar volumes for file data.
- Compress then encrypt (streaming where possible).
- Use recipient-based encryption (age is a solid default).
- Keep keys off the host; provide a break-glass path that’s independent and documented.
- Make restore drills routine and measurable.
- Add immutability/versioning to protect backup existence, not just confidentiality.
Do those, and your next restore won’t be a thriller. It’ll be work. Boring work. The best kind.