It always starts with a simple request: “Can we get last week’s file back?” Then a pause. Then someone says, “We should have it… somewhere.” That “somewhere” is usually a Slack thread, a dead laptop, or a storage array that died with dignity and zero remorse.
“No backups” isn’t a scary story because it’s rare. It’s scary because it’s common, boring, and entirely preventable. And because when it hits, everyone discovers—at the same time—that their organization has been confusing data existence with data recoverability.
What “no backups” really means
Most teams that “have no backups” don’t literally have nothing. They have artifacts. Snapshots in the wrong place. A cron job that ran once. A bucket with a lifecycle rule that quietly deletes the only copy. A tape rotation policy lovingly designed in 2009 and never executed since the person who understood it left.
In operational terms, “no backups” means at least one of these is true:
- No independent copy exists. All copies share the same failure domain: same account, same region, same storage system, same credentials, same ransomware blast radius.
- No verified restore path exists. The backup may be present, but you can’t restore it within your RTO, or at all.
- No known-good point exists. The backups might be corrupt, incomplete, encrypted with missing keys, or captured mid-transaction without consistency.
- No one owns recovery. Backups “belong to IT” and restores “belong to the app team,” which translates to “no one will touch it until it’s too late.”
- No monitoring detects failure. The job hasn’t run in 47 days and nobody noticed because green dashboards are decorative.
You can store data in ten places and still have no backups. Backups are not about copying. They’re about recovery: controlled, repeatable, time-bounded recovery that you can perform under stress with incomplete sleep.
One opinion that will save you money and pain: if you haven’t performed a restore drill recently, you don’t have backups—you have hope.
There’s a reason operations people talk about backups with the tone usually reserved for electrical fires. It’s the one category of failure where the system looks fine right until it doesn’t, and then it’s not the system that fails—it’s the story you told yourself about the system.
Short joke #1: Backups are like parachutes: if you only test them once, you’ll be very confident for a short time.
RPO and RTO: the only words that matter when the lights go out
RPO (Recovery Point Objective) is how much data you’re willing to lose. If your RPO is 15 minutes, then a backup that runs nightly is not a backup plan; it’s a resignation letter written in YAML.
RTO (Recovery Time Objective) is how long you’re willing to be down. If your RTO is 1 hour, then a restore that requires a human to find a decryption key in an ex-employee’s personal vault is not a restore; it’s a scavenger hunt.
Don’t negotiate RPO/RTO in the middle of an incident. Decide in advance, get sign-off, and build the system to meet it. You’ll still have incidents. You just won’t have surprises.
Facts and history: why this keeps happening
“No backups” has been around longer than most programming languages in production. It persists because it sits at the intersection of human optimism, budget pressure, and systems that fail in creative ways.
8 facts and context points (short, concrete)
- Early enterprise backups were built around tape because disks were expensive; operational discipline mattered more than hardware. The medium changed, the discipline didn’t.
- The “3-2-1 rule” (three copies, two media types, one offsite) predates cloud hype and still captures a core truth: independence beats elegance.
- RAID was widely misunderstood as “backup” for decades; RAID is availability, not recoverability. It prevents some failures and amplifies others.
- Snapshots became mainstream because they’re cheap and fast (copy-on-write), which made them easy to overtrust—until the snapshot system shares the same failure domain.
- Ransomware shifted the threat model from “hardware failure” to “active adversary.” If attackers can delete backups, you don’t have backups.
- Cloud introduced a new illusion: that durability automatically implies recoverability. Object durability doesn’t help if you overwrite the wrong thing or delete the bucket with admin credentials.
- Backup software failures are often silent because success is usually measured by “job completed” rather than “restore verified.” You can “complete” garbage.
- Compliance regimes increased retention demands (and costs), which pushed teams into aggressive deduplication and lifecycle rules—two tools that are powerful and dangerous when misconfigured.
The reliability idea worth carrying around
Here’s a single sentence that should live in your runbooks:
“Everything fails; design so failure is survivable.” — Werner Vogels
If you’re building systems, that’s the job. Not preventing failure. Surviving it.
Failure modes: how “we have backups” turns into “we never did”
1) Same account, same keys, same blast radius
Your backups are in the same cloud account as production. The same IAM role that can delete prod can delete backups. During a ransomware event or credential leak, attackers don’t politely avoid your “backup” bucket.
Fix: put backups in a separate account/project/tenant with separate credentials and strict cross-account write-only patterns. If you can browse the backup repository from your laptop with your everyday SSO role, you’ve already made it too easy to destroy.
2) Snapshots mistaken for backups
Snapshots are excellent. They are also frequently not backups:
- Snapshots on the same volume don’t survive volume loss.
- Snapshots on the same storage array don’t survive array loss.
- Snapshots accessible with the same admin credentials don’t survive ransomware.
Snapshots are best treated as fast local recovery. Backups are independent recovery.
3) “Backups succeed” but restores fail
This is the classic. The job logs “success.” The repository fills with files. Then you try to restore and learn:
- the chain depends on missing incrementals,
- the encryption key is gone,
- the database backup is not consistent,
- the restore instructions were never written down,
- the target platform changed (kernel, filesystem, DB major version).
Backups are a product; restores are the acceptance test.
4) Retention policies that erase your past
A simple lifecycle misconfiguration can delete the only viable restore points. Common patterns:
- Retention tuned for cost and accidentally set to “keep 7 days” when ransomware sat undetected for 21.
- “Move to cheaper storage” rules that break restore time objectives because retrieval takes hours.
- Object versioning disabled because “it costs too much,” then an automated job overwrites everything.
5) The “optimization” trap: deduplication and incremental chains
Dedup and incrementals are fantastic when you can guarantee integrity and test restores. They’re also a way to concentrate risk: corrupt a chunk in the dedup store, and many backups become simultaneously useless.
When someone says “we saved 60% storage,” your next question should be: “Show me last month’s full restore test.”
6) Backups without identity and provenance
A backup with no metadata is a sealed box. If you don’t know what it contains, what produced it, when it was taken, what version it matches, and how to restore it, it’s a liability with a storage bill.
At minimum, every backup artifact should be traceable to:
- source system identity (host/db/cluster),
- time window,
- consistency method (filesystem freeze, DB snapshot, WAL archive),
- retention class (daily/weekly/monthly),
- encryption method and key ownership,
- restore procedure reference (runbook section, not a tribal memory).
Three corporate mini-stories from the backup trenches
Mini-story #1: the incident caused by a wrong assumption
A mid-sized SaaS company ran its primary database on managed Postgres. They also had “backups” on a separate VM: a nightly job that ran pg_dump and copied a file to object storage. On paper, it looked safe: managed backups plus their own belt-and-suspenders copy.
The wrong assumption was subtle and extremely common: they assumed pg_dump output implied a consistent, restorable snapshot. It usually does—until it doesn’t. Their largest tables were under constant write load, and the dump ran long enough that it overlapped multiple schema changes. The dump completed, reported success, and uploaded to storage. Everyone slept.
Then an operator mistake happened: a migration script dropped a table in the wrong schema. Managed point-in-time recovery existed, but the team had never practiced it and didn’t know the provider’s constraints. They turned to the “simple” dump file.
Restore failed with errors about missing relations and inconsistent schema state. The dump wasn’t a clean point-in-time view of the database as the application expected; it was an artifact of an evolving system. Their “backup” wasn’t wrong; it was just not the backup they needed for the recovery they were attempting.
They ultimately recovered via managed PITR after a long, stressful learning session and a lot of downtime. The fix wasn’t exotic: stop treating dumps as the primary safety net for a hot, changing database. Use a backup method designed for consistency (physical backups with WAL, snapshots coordinated with DB, or provider-native PITR), and run restore drills quarterly.
Mini-story #2: the optimization that backfired
A financial services team decided to cut backup costs. Their repository was growing quickly, and the monthly bill triggered executive interest—the kind that arrives without technical context and leaves with “action items.” A well-meaning engineer turned on aggressive deduplication and changed full backups from weekly to monthly, relying heavily on incrementals.
Everything looked better. Storage growth flattened. Backups ran faster. Dashboards glowed green. They even bragged about the “efficiency win” in a cost review. The scary part of the story is that none of this was a lie.
Three months later, a storage controller bug corrupted a small portion of the dedup store. Not enough to crash the backup system. Not enough to show up in the “job success” metrics. Just enough to make restore chains unreliable.
When they needed to recover an application server after a botched OS upgrade, the restore repeatedly failed late in the process. Different files each time. The system was reading corrupted chunks that were referenced by many backups. The optimization had concentrated risk into a single shared pool of blocks, and the long incremental chains meant there was no recent “clean anchor” full backup to fall back to.
They rebuilt the repository, added periodic synthetic fulls (verified), and implemented automated restore tests that mounted recovered filesystems and validated checksums. The cost went up. So did their ability to sleep.
Mini-story #3: the boring but correct practice that saved the day
A manufacturing company had a small SRE team supporting a mix of legacy VMs and containerized services. Their backup system was not fancy: ZFS snapshots replicated nightly to a second site, plus weekly exports to immutable object storage. The runbooks were painfully detailed, and the team ran a restore drill every month.
During a routine patch window, a firmware update on the primary storage array went sideways. The array didn’t die dramatically; it just stopped presenting several LUNs correctly. From the OS perspective, filesystems became read-only and then inconsistent. Applications started failing in confusing ways. The incident wasn’t flashy. It was a slow-motion stumble.
The team didn’t debate. They executed the documented plan: freeze writes, fail services over to the replicated dataset, and bring up a minimal set of critical apps first. Their first restore target wasn’t “everything”—it was the systems that made money and paid people.
They were back in service before the business fully understood what had happened. The postmortem was almost dull: no heroics, no magical recovery scripts, no midnight “we found a copy on Bob’s laptop.” Just replication, immutability, and a practiced procedure.
The quiet lesson: boring recovery beats clever backup. If your backup plan requires brilliance during an outage, it is not a plan.
Fast diagnosis playbook: find the bottleneck fast
When someone says “backup is failing” or “restore is too slow,” you need a quick path to the limiting factor. Don’t start by rewriting everything. Start by finding what’s actually constrained: credentials, capacity, IOPS, network, CPU, repository health, or correctness.
First: verify what failure you have (availability vs correctness vs time)
- Availability: backup jobs not running, repository unreachable, permissions denied.
- Correctness: backups “succeed” but restore fails or data is missing/corrupt.
- Time: restore takes longer than RTO, backups overlap and never finish, replication lag breaks RPO.
Second: check the four chokepoints in order
- Identity and permissions: expired credentials, broken IAM policies, rotated keys, MFA/SSO changes.
- Storage capacity and retention: repo full, snapshots pruned early, object lifecycle deleting.
- Throughput: network saturated, disk I/O capped, cloud throttling, single-threaded compression.
- Consistency: application-aware quiesce missing, DB backups not coordinated with logs.
Third: prove restore feasibility with a small, deterministic test
Don’t aim for a full environment restore first. Pick one artifact and restore it to a scratch host:
- restore a single VM disk and mount it read-only,
- restore a Postgres base backup + replay WAL to a known timestamp,
- restore one directory tree and verify checksums and counts.
This reduces the problem to facts: can you retrieve data, can you decrypt it, can you mount it, can the application read it, and how long does it take?
Practical tasks (with commands): prove you can restore
Below are real tasks you can run today. Each one includes: a command, sample output, what it means, and the decision you make. The goal is not to admire command output. The goal is to change your posture from “we think” to “we know.”
Task 1: Confirm you actually have recent backups (filesystem level)
cr0x@server:~$ ls -lh /backups/daily | tail -n 5
-rw-r----- 1 root backup 1.8G Jan 22 01:05 app01-etc-2026-01-22.tar.zst
-rw-r----- 1 root backup 22G Jan 22 01:12 app01-varlib-2026-01-22.tar.zst
-rw-r----- 1 root backup 4.1G Jan 22 01:18 db01-config-2026-01-22.tar.zst
-rw-r----- 1 root backup 91G Jan 22 01:55 db01-basebackup-2026-01-22.tar.zst
-rw-r----- 1 root backup 3.4M Jan 22 02:01 manifest-2026-01-22.json
Output meaning: You can see today’s artifacts and their sizes, plus a manifest. Size anomalies (too small/too large) are early warning signs.
Decision: If today’s files are missing or clearly wrong-sized, stop trusting success emails. Investigate job logs and upstream permissions immediately.
Task 2: Check backup job success from systemd timers
cr0x@server:~$ systemctl list-timers --all | grep backup
Tue 2026-01-22 01:00:00 UTC 6h left Tue 2026-01-21 01:00:04 UTC 18h ago backup-daily.timer backup-daily.service
Tue 2026-01-22 02:30:00 UTC 8h left Tue 2026-01-21 02:30:02 UTC 16h ago backup-verify.timer backup-verify.service
Output meaning: Timers exist and ran recently. If the “last” time is ancient, your backups may have stopped weeks ago.
Decision: If timers aren’t running, treat it as data-loss risk, not a minor ticket. Fix scheduling and alerting before doing anything else.
Task 3: Inspect the last backup service run status
cr0x@server:~$ systemctl status backup-daily.service --no-pager
● backup-daily.service - Nightly backup job
Loaded: loaded (/etc/systemd/system/backup-daily.service; enabled)
Active: inactive (dead) since Tue 2026-01-21 01:55:13 UTC; 18h ago
Process: 18422 ExecStart=/usr/local/sbin/backup-daily.sh (code=exited, status=0/SUCCESS)
Main PID: 18422 (code=exited, status=0/SUCCESS)
Jan 21 01:00:04 server backup-daily.sh[18422]: starting backup run id=2026-01-21T010004Z
Jan 21 01:55:12 server backup-daily.sh[18422]: upload complete: s3://backup-prod/daily/...
Jan 21 01:55:13 server backup-daily.sh[18422]: finished OK
Output meaning: The unit exited successfully and logged upload completion.
Decision: Success here is necessary but not sufficient. You still need to verify integrity and restoreability (tasks below).
Task 4: Detect “repo full” before it becomes an outage
cr0x@server:~$ df -h /backups
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 7.3T 7.1T 110G 99% /backups
Output meaning: 99% used is a failure in progress. Many backup tools will start failing, some will start deleting, and a few will lie.
Decision: Pause non-essential backups, extend capacity, and review retention. Also verify the latest backup is complete; near-full disks lead to partial artifacts.
Task 5: Validate checksums of a backup artifact (spot check)
cr0x@server:~$ sha256sum -c /backups/daily/manifest-2026-01-22.sha256 | head
app01-etc-2026-01-22.tar.zst: OK
app01-varlib-2026-01-22.tar.zst: OK
db01-config-2026-01-22.tar.zst: OK
db01-basebackup-2026-01-22.tar.zst: OK
Output meaning: The files match recorded hashes. This catches corruption in storage or transfer.
Decision: If any file is “FAILED,” quarantine that backup set, investigate storage health, and confirm you have a different restore point.
Task 6: Verify object storage versioning and lock status (ransomware resilience)
cr0x@server:~$ aws s3api get-bucket-versioning --bucket backup-prod
{
"Status": "Enabled"
}
Output meaning: Versioning is enabled; deletes become delete-markers and overwrites keep previous versions.
Decision: If versioning is “Suspended” or empty, you’re easier to wipe. Enable versioning and consider Object Lock (immutability) where appropriate.
Task 7: Confirm S3 Object Lock configuration (immutability)
cr0x@server:~$ aws s3api get-object-lock-configuration --bucket backup-prod
{
"ObjectLockConfiguration": {
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "COMPLIANCE",
"Days": 30
}
}
}
}
Output meaning: Default retention is immutable for 30 days in compliance mode. Even admins can’t shorten it.
Decision: If Object Lock is absent and your threat model includes ransomware, you need an immutable tier (Object Lock, WORM, offline, or separate controlled system).
Task 8: Check replication lag for ZFS snapshots (RPO reality check)
cr0x@server:~$ zfs list -t snapshot -o name,creation -s creation | tail -n 3
tank/appdata@2026-01-22T0100Z Tue Jan 22 01:00 2026
tank/appdata@2026-01-22T0200Z Tue Jan 22 02:00 2026
tank/appdata@2026-01-22T0300Z Tue Jan 22 03:00 2026
Output meaning: Snapshots exist hourly. That’s your theoretical best-case RPO locally.
Decision: Next, confirm those snapshots are replicated off-host. Local snapshots alone don’t meet disaster recovery needs.
Task 9: Confirm ZFS replication actually advanced on the target
cr0x@server:~$ ssh backup-target "zfs list -t snapshot -o name,creation -s creation | grep tank/appdata | tail -n 3"
tank/appdata@2026-01-22T0000Z Tue Jan 22 00:00 2026
tank/appdata@2026-01-22T0100Z Tue Jan 22 01:00 2026
tank/appdata@2026-01-22T0200Z Tue Jan 22 02:00 2026
Output meaning: Target is one snapshot behind the source. That’s replication lag: your real RPO is currently ~1 hour, not “hourly.”
Decision: If lag exceeds your RPO, investigate replication throughput, failures, or scheduling. Consider continuous replication for databases or WAL shipping.
Task 10: Measure repository throughput during restore (RTO reality check)
cr0x@server:~$ pv /backups/daily/db01-basebackup-2026-01-22.tar.zst > /dev/null
91.2GiB 0:12:31 [ 124MiB/s] [==================================>] 100%
Output meaning: You can read the backup file at ~124 MiB/s from the repo. That’s a hard ceiling on restore speed before decompression and application replay.
Decision: If your RTO requires restoring 2 TB in under an hour, this repo path cannot do it. Add parallelism, faster storage, or tiered restore strategies.
Task 11: Validate that your backup contains what you think (tar listing)
cr0x@server:~$ zstd -t /backups/daily/app01-etc-2026-01-22.tar.zst
/backups/daily/app01-etc-2026-01-22.tar.zst : OK
Output meaning: The compressed file passes integrity checks at the compression layer.
Decision: If this fails, don’t attempt restore from it. Find another point-in-time and investigate hardware and upload paths.
Task 12: Perform a real restore drill of a directory to a scratch path
cr0x@server:~$ mkdir -p /restore-test/app01-etc && tar -I zstd -xpf /backups/daily/app01-etc-2026-01-22.tar.zst -C /restore-test/app01-etc
cr0x@server:~$ ls -l /restore-test/app01-etc/etc | head
total 64
-rw-r--r-- 1 root root 296 Jan 21 23:59 hostname
-rw-r--r-- 1 root root 1177 Jan 21 23:59 hosts
drwxr-xr-x 2 root root 4096 Jan 21 23:59 systemd
drwxr-xr-x 2 root root 4096 Jan 21 23:59 ssh
Output meaning: You successfully extracted the backup and can see expected files. This is a minimum viable restore test.
Decision: If extraction fails or files are missing, treat backups as compromised. Fix before the next incident forces your hand.
Task 13: Check Postgres backups: is WAL archiving configured?
cr0x@server:~$ sudo -u postgres psql -c "show wal_level; show archive_mode; show archive_command;"
wal_level
----------
replica
(1 row)
archive_mode
--------------
on
(1 row)
archive_command
-------------------------------------
test ! -f /wal-archive/%f && cp %p /wal-archive/%f
(1 row)
Output meaning: WAL archiving is enabled. This is essential for point-in-time recovery beyond the base backup.
Decision: If archiving is off, your “backups” might only restore to the last base backup time. Turn on archiving and test PITR.
Task 14: Verify WAL archive is actually receiving files (not just configured)
cr0x@server:~$ ls -lh /wal-archive | tail -n 3
-rw------- 1 postgres postgres 16M Jan 22 02:54 00000001000000A7000000FE
-rw------- 1 postgres postgres 16M Jan 22 02:58 00000001000000A7000000FF
-rw------- 1 postgres postgres 16M Jan 22 03:02 00000001000000A700000100
Output meaning: New WAL segments are arriving every few minutes. That’s live evidence your PITR chain is progressing.
Decision: If the directory is stale, your RPO is drifting quietly. Fix archiving before you trust any restore plan.
Task 15: Confirm that backups are not only present but restorable (Postgres restore smoke test)
cr0x@server:~$ createdb restore_smoke
cr0x@server:~$ pg_restore --list /backups/daily/appdb-2026-01-22.dump | head
;
; Archive created at 2026-01-22 01:10:02 UTC
; dbname: appdb
; TOC Entries: 248
; Compression: -1
; Dump Version: 1.14-0
Output meaning: The dump is readable and contains objects. This is a lightweight check before a full restore.
Decision: If pg_restore can’t read the file, your backup is effectively imaginary. Move to a different backup type and add verification steps.
Task 16: Verify you can decrypt your backups (key management reality)
cr0x@server:~$ gpg --decrypt /backups/daily/secrets-2026-01-22.tar.gpg > /dev/null
gpg: encrypted with 4096-bit RSA key, ID 7A1C2B3D4E5F6789, created 2025-10-03
gpg: decryption ok
Output meaning: The decrypt operation succeeds with currently available keys.
Decision: If decryption fails, you don’t have backups—you have encrypted paperweights. Fix key escrow, access procedures, and rotation policies.
Short joke #2: The only thing worse than no backups is having backups you can’t decrypt—like a safe where you stored the key inside.
Common mistakes: symptoms → root cause → fix
This section is designed for incident response and postmortems. Symptoms are what you see at 2 a.m. Root causes are why it happened. Fixes are specific actions you can execute.
1) “Backups say SUCCESS, but restores fail”
Symptoms: Job logs show success; restore errors include missing incrementals, corrupt archive, checksum mismatch, or application won’t start.
Root cause: Success criteria were “uploaded file exists” rather than “restore completed and validated.” Or corruption in dedup store, or inconsistent DB snapshot.
Fix: Add automated restore verification (mount/extract + checksum + application-level smoke test). Keep periodic verified full backups. For databases, use application-aware backup methods and validate by restoring to a scratch instance.
2) “We can restore, but it takes days”
Symptoms: Restore throughput is low; retrieval from cold storage takes hours; decompression is CPU-bound; network saturates.
Root cause: RTO never informed architecture. Backups stored in cheapest tier without modeling retrieval time. Restore pipeline is single-threaded or constrained by repository IOPS.
Fix: Measure restore throughput (not backup throughput). Use tiered backups: hot recent backups for fast restore, cold for retention. Parallelize restores, pre-stage critical datasets, and ensure you can scale restore targets.
3) “Ransomware deleted our backups too”
Symptoms: Backup bucket emptied, snapshots deleted, repository encrypted, backup server compromised.
Root cause: Shared credentials and lack of immutability. Backups accessible with admin roles. No separation of duties.
Fix: Separate accounts and credentials, enforce write-only upload roles, enable Object Lock/WORM, restrict deletion, and keep an offline or logically isolated copy. Audit IAM for backup delete permissions.
4) “We restored the wrong thing”
Symptoms: Restore completed but data is older than expected, missing tables, wrong environment, or wrong customer dataset.
Root cause: Poor labeling/metadata, inconsistent naming, no manifest, no runbook for selecting restore points.
Fix: Standardize backup naming and manifests. Track backup provenance (source identity, timestamp, consistency method). Create a restore selection procedure with approval steps for production restores.
5) “Backups stopped weeks ago; nobody noticed”
Symptoms: Last successful backup is old. Monitoring didn’t alert.
Root cause: Monitoring looks at job execution, not end-to-end success. Alerts routed to dead channels. On-call not paged for backup failures.
Fix: Alert on “time since last verified restore point.” Add repository freshness metrics. Page the team that is accountable for recovery.
6) “We have snapshots, so we’re fine”
Symptoms: Confident statements until the disk/array/account dies; then no off-host copy exists.
Root cause: Confusing availability tools (RAID, snapshots) with backup independence.
Fix: Replicate snapshots to another system/account/region. Add immutable retention. Document what failures snapshots do and do not cover.
7) “Retention pruned our only good restore point”
Symptoms: You need a restore older than retention window (e.g., ransomware dwell time), but it’s gone.
Root cause: Retention based on cost, not detection time and business risk. Lifecycle rules too aggressive.
Fix: Set retention with security and legal input. Keep longer “monthly” or “quarterly” restore points. Use immutable storage for a subset of backups.
Checklists / step-by-step plan
Step-by-step: go from “we think” to “we can restore” in 10 steps
- Define RPO and RTO per system. Not “the company.” The payroll DB and the marketing wiki do not deserve the same engineering.
- Classify data by recovery method. VM image restore, file-level restore, database PITR, object version rollback, etc.
- Map failure domains. Identify what can fail together: account, region, array, credentials, backup server, KMS keys.
- Implement at least one independent copy. Separate account/tenant with restricted permissions. Prefer immutability for ransomware scenarios.
- Make backups consistent. Application-aware snapshots, WAL archiving, filesystem freeze, or database-native tooling.
- Write the restore runbook. Include prerequisites, keys, access steps, commands, and how to validate success. Treat it like a production feature.
- Automate verification. At minimum: checksum verification + restore-to-scratch monthly. Better: continuous small restore tests.
- Monitor “time since last good restore point.” Not “job succeeded.” Alert to on-call.
- Run a quarterly DR exercise. Pick one critical system. Restore it under a stopwatch and document what broke.
- Review and tighten permissions. Separate duties: production admins should not be able to delete immutable backups casually.
Operational checklist: before you approve a backup design
- Can production credentials delete backups? If yes, fix that first.
- Is there an immutable retention tier for critical systems? If no, add one.
- Is restore tested on the current platform/version? If no, schedule a drill.
- Does the design meet RPO/RTO in measured tests, not estimates?
- Are encryption keys recoverable under incident conditions (SSO down, on-call limited access)?
- Is monitoring based on restore points and verification, not job completion?
- Does retention cover detection delays (security) and legal requirements?
- Is the runbook written so a competent engineer unfamiliar with the system can execute it?
Restore drill checklist: what “done” looks like
- Restore completed to an isolated environment without touching production.
- Integrity validated (checksums, row counts, application health checks).
- Recovery time measured and compared to RTO.
- Recovery point measured and compared to RPO.
- Issues logged as work items with owners and dates.
- Runbook updated immediately based on surprises encountered.
FAQ
1) Are snapshots backups?
Sometimes. If snapshots are replicated to an independent system and protected from deletion, they can be part of a backup strategy. Local-only snapshots are a fast undo button, not disaster recovery.
2) Is RAID a backup?
No. RAID handles certain disk failures without downtime. It won’t help if data is deleted, encrypted, corrupted by software, or the entire array is lost.
3) What’s the minimum viable backup strategy for a small team?
Start with: (1) automated daily backups, (2) an off-host copy in a separate account, (3) monthly restore drill to a scratch environment, (4) alerting on missed backups. Add immutability if ransomware is in scope (it is).
4) How often should we test restores?
For critical systems: at least quarterly full restore drills, plus smaller automated restore verifications more frequently (weekly or daily). The right cadence is the one that catches drift before an incident does.
5) What retention window is “enough”?
It depends on detection time and legal needs. Many ransomware incidents involve days-to-weeks of dwell time. If you only keep 7 days, you’re betting your business on perfect detection. That’s a bad bet.
6) Can we rely on our cloud provider’s backups?
Provider backups can be excellent, but you still need to validate restore workflows, understand constraints, and consider independence. A second copy under your control (separate account, different credentials) is often warranted for critical data.
7) Should backups be encrypted?
Yes—usually. Encrypt in transit and at rest. But encryption without recoverable key management is self-sabotage. Treat keys as part of the restore plan: access-controlled, audited, and testable during drills.
8) What should we monitor: backup jobs or backup data?
Monitor the existence of a recent verified restore point. A job can succeed while producing unusable output. Track “age of last verified restore,” repository capacity, and restore test outcomes.
9) How do we keep costs under control without increasing risk?
Tier your backups: keep recent restore points in a fast tier, older ones in a cheaper tier. Use compression thoughtfully. Be cautious with aggressive dedup and long incremental chains unless you have strong integrity checks and restore testing.
10) Who should own backups and restores?
Ownership should be explicit. Infrastructure teams can own the platform, but application teams must own recovery correctness (does the app work after restore?). A shared runbook and joint drills prevent finger-pointing.
Conclusion: next steps you can do this week
If you’re currently in the “we should really do backups” phase, congratulations: you’ve identified the problem while you still have choices. The goal is not to build the fanciest backup system. The goal is to make data loss a bounded event, not an existential crisis.
Practical next steps that move the needle:
- Pick one critical system and define RPO/RTO with the business. Write it down. Make it someone’s responsibility.
- Run a restore drill to a scratch environment. Time it. Document every surprise. Fix one surprise.
- Separate failure domains: ensure at least one copy is off-host and not deletable by everyday production credentials.
- Enable immutability for the backups that would matter most during ransomware.
- Change monitoring from “job succeeded” to “last verified restore point age.” Page on drift.
Systems fail. People make mistakes. Attackers exist. The only adult response is designing for recovery and proving it with drills. Everything else is storytelling.