RAID is not backup: the sentence people learn too late

Was this helpful?

The call usually comes in when the dashboard is green and the data is gone. The array is “healthy.” The database is “running.”
And yet the CFO is staring at an empty report, the product team is staring at an empty bucket, and you’re staring at the one sentence
you wish you’d tattooed onto the purchase order: RAID is not backup.

RAID is great at one thing: keeping a system online through certain kinds of disk failure. It is not designed to protect you from
deletion, corruption, ransomware, fire, fat fingers, broken firmware, or the strange and timeless human urge to run rm -rf
in the wrong window.

What RAID actually does (and what it never promised)

RAID is a redundancy scheme for storage availability. That’s it. It’s a way to keep serving reads and writes when one disk
(or sometimes two) stops cooperating. RAID is about continuity of service, not continuity of truth.

In production terms: RAID buys you time. It reduces the probability that a single disk failure becomes an outage. It may improve
performance depending on level and workload. It can simplify capacity management. But it does not create a separate, independent,
versioned copy of your data. And independence is the word that keeps your job.

Availability vs durability vs recoverability

People mash these into one bucket labeled “data safety.” They are not the same:

  • Availability: can the system keep working right now? RAID helps here.
  • Durability: will bits remain correct over time? RAID sometimes helps, sometimes lies about it.
  • Recoverability: can you restore a known-good state after an incident? That’s backup, snapshots, replication, and process.

RAID can keep serving corrupted data. RAID can faithfully mirror your accidental deletion. RAID can replicate your ransomware-encrypted blocks
with extreme enthusiasm. RAID is a loyal employee. Loyal doesn’t mean smart.

What “backup” means in a system you can defend

A backup is a separate copy of data that is:

  • Independent of the primary failure domain (different disks, different host, ideally different account/credentials).
  • Versioned so you can go back to before the bad thing happened.
  • Restorable within a time bound you can live with (RTO) and to a point in time you can accept (RPO).
  • Tested, because “we have backups” is not a fact until you have restored from them.

Snapshots and replication are great tools. They are not automatically backups. They become backups when they’re independent, protected from
the same admin mistakes, and you can restore them under pressure.

Joke #1: RAID is the seatbelt. Backup is the ambulance. If you’re counting on the seatbelt to perform surgery, you’re going to have a long day.

Why RAID fails as backup: the failure modes that matter

The reason “RAID is not backup” gets repeated is that the failure modes are non-intuitive. Disk failure is just one kind of data loss.
Modern systems lose data through software, humans, and attackers more often than through a single drive popping its SMART cherry.

1) Deletion and overwrite are instantly redundant

Delete a directory. RAID mirrors the deletion. Overwrite a table. RAID stripes that new truth across the set. There is no “undo” because RAID’s
job is to keep copies consistent, not to keep copies historical.

2) Silent corruption, bit rot, and the “looks fine” trap

Disks, controllers, cables, and firmware can return the wrong data without throwing an error. Filesystems with checksums (like ZFS, btrfs) can
detect corruption, and with redundancy they can often self-heal. Traditional RAID under a filesystem that doesn’t checksum at the block level
can happily return corrupted blocks and call it success.

Even with end-to-end checksums, you can still corrupt data at a higher layer: bad application writes, buggy compaction, half-applied migrations.
RAID will preserve the corruption perfectly.

3) Ransomware doesn’t care about your parity

Ransomware encrypts what it can access. If it can access your mounted filesystem, it can encrypt your data on RAID1, RAID10, RAID6,
ZFS mirrors, whatever. Redundancy doesn’t stop encryption. It just ensures the encryption is highly available.

4) Controller and firmware failures take the array with them

Hardware RAID adds a failure domain: the controller, its cache module, its firmware, its battery/supercap, and its metadata format.
If the controller dies, you may need an identical controller model and firmware level to reassemble the array cleanly.

Software RAID also has failure domains (kernel, md metadata, userspace tooling), but they tend to be more transparent and portable.
Transparent does not mean safe. It just means you can see the knife before you step on it.

5) Rebuilds are stressful and get worse as drives get bigger

Rebuild is where the math meets physics. During rebuild, every remaining disk is read heavily, often close to full bandwidth, for hours or days.
That’s a perfect storm for surfacing latent errors on the remaining drives. If you lose another disk in a RAID5 during rebuild, you lose the array.
RAID6 buys you more margin, but rebuild still increases risk and degrades performance.

6) Human error: the most common, least respected failure mode

A tired engineer replaces the wrong disk, pulls the wrong tray, or runs the right command on the wrong host. RAID doesn’t protect against
humans. It amplifies them. One wrong click gets replicated at line rate.

7) Site disasters and blast radius

RAID is local. Fire is also local. So is theft, power events, and “oops we deleted the whole cloud account.” A real backup strategy assumes
you will lose an entire failure domain: a host, a rack, a region, or an account.

Interesting facts and a little history (the useful kind)

A few concrete facts make this topic stick because they show how RAID ended up being treated like a magic spell.
Here are nine, all relevant, none romantic.

  1. RAID was named and popularized in a 1987 UC Berkeley paper that framed “redundant arrays of inexpensive disks” as an alternative to big expensive disks.
  2. Early RAID marketing leaned hard on “fault tolerance,” and a lot of people quietly translated that into “data protection,” which is not the same contract.
  3. RAID levels were never a single official standard. Vendors implemented “RAID5” with different behaviors and cache policies, then argued about semantics in your outage window.
  4. Hardware RAID controllers historically used proprietary on-disk metadata formats, which is why controller failure can turn into archaeology.
  5. The rise of multi-terabyte disks made RAID5 rebuilds dramatically riskier because the rebuild time grew and the probability of encountering an unreadable sector during rebuild rose.
  6. URE (unrecoverable read error) rates were widely discussed in the 2000s as a practical reason to prefer dual-parity for large arrays, especially under heavy rebuild load.
  7. ZFS (first released in the mid-2000s) pushed end-to-end checksums into mainstream operations and made “bit rot” a boardroom-friendly phrase because it could finally be detected.
  8. Snapshots became common in enterprise storage in the 1990s but were often stored on the same array—fast rollback, not disaster recovery.
  9. Ransomware shifted the backup conversation from “tape vs disk” to “immutability vs credentials,” because attackers learned to delete backups first.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran its primary PostgreSQL cluster on a pair of high-end servers with hardware RAID10. The vendor pitch sounded
comforting: redundant disks, battery-backed write cache, hot spares. The team heard “no data loss” and mentally filed backups under “nice-to-have.”

One afternoon, a developer ran a cleanup script against production. It was supposed to target a staging schema; it targeted the live one.
Within seconds, millions of rows were deleted. The database kept serving traffic, and the monitoring graphs looked fine—queries got faster, actually,
because there was less data.

They tried to recover using the RAID controller’s “snapshot” feature, which was not a snapshot in the filesystem sense. It was a configuration
profile for caching behavior. The storage vendor, to their credit, did not laugh. They simply asked the question that ends careers:
“What are your last known-good backups?”

There were none. There was a nightly logical dump configured months ago, but it wrote to the same RAID volume, and the cleanup script deleted
the dump directory too. The company rebuilt from application logs and third-party event streams. They recovered most, but not all, and they spent
weeks fixing subtle referential damage.

The wrong assumption wasn’t “RAID is safe.” It was “availability implies recoverability.” They had high uptime and low truth.

Mini-story 2: The optimization that backfired

A media platform was obsessed with performance. They moved their object storage metadata from a conservative setup to a wide RAID5 to squeeze
more usable capacity and better write throughput on paper. They also enabled aggressive controller caching to improve ingest rates.

In normal operation, it looked great. The queue depths were low. Latency was down. Leadership got their “storage efficiency” slide for the quarterly
deck. Everyone slept better for about a month.

Then a single disk started throwing intermittent read errors. The array marked it as “predictive failure” but kept it online. A rebuild was initiated
to a hot spare during peak hours because the system was “redundant.” That rebuild saturated the remaining disks. Latency spiked, timeouts climbed,
and application retries created a feedback loop.

Mid-rebuild, another disk hit an unreadable sector. RAID5 can’t handle that during rebuild. The controller declared the virtual disk failed.
The result wasn’t just downtime. It was partial metadata corruption that made recovery slower and nastier than a clean crash would have been.

The optimization wasn’t evil; it was unbounded. They optimized for capacity and benchmark performance, then paid for it with rebuild risk and
a larger blast radius. They replaced the layout with dual parity, moved rebuild windows off-peak, and—most importantly—built an off-array backup
pipeline so the next failure would be boring.

Mini-story 3: The boring but correct practice that saved the day

A financial services firm ran a file service used by internal teams. The storage was a ZFS mirror set: simple, conservative, not exciting.
The exciting part was their backup hygiene: nightly snapshots, offsite replication to a different admin domain, and monthly restore tests.
Everyone complained about the restore tests because they “wasted time.” The SRE manager made them non-optional anyway.

A contractor’s laptop was compromised. The attacker obtained VPN access and then a privileged credential that could write to the file share.
Overnight, ransomware started encrypting user directories. Because the share was online and writable, the encryption propagated quickly.

ZFS did exactly what it was asked to do: it stored the new encrypted blocks with integrity. RAID mirroring ensured the encryption was durable.
The next morning, users found their files renamed and unreadable. The mirror was “healthy.” The business was not.

The firm pulled the network share offline, rotated credentials, and checked the immutable backup target. The backups were stored in a separate
environment with restricted delete permissions and retention locks. The attacker couldn’t touch them.

Restore was not magical; it was practiced. They restored the most critical directories first based on a pre-agreed priority list, then the rest
over the next day. The postmortem was dull in the best way. The moral was also dull: boring process beats fancy redundancy.

Fast diagnosis playbook: find the bottleneck and the blast radius

When something is wrong with storage, teams waste time arguing about whether it’s “the disks” or “the network” or “the database.”
The right approach is to establish: (1) what changed, (2) what is slow, (3) what is unsafe, and (4) what you can still trust.

First: stop making it worse

  • If you suspect corruption or ransomware, freeze writes where you can: remount read-only, stop services, revoke credentials.
  • If an array is degraded and rebuilding, consider reducing workload to avoid a second failure during rebuild.
  • Start an incident log: commands run, timestamps, changes made. Memory is not evidence.

Second: identify whether this is performance, integrity, or availability

  • Performance: high latency, timeouts, queue depth, iowait. Data may still be correct.
  • Integrity: checksum errors, application-level corruption, unexpected file changes. Performance may look fine.
  • Availability: devices missing, arrays degraded/failed, filesystems not mounting. The system is screaming.

Third: localize the fault domain quickly

  1. Host: kernel logs, disk errors, controller state.
  2. Storage stack: RAID/mdadm/ZFS, filesystem health, scrub status.
  3. IO path: multipath, HBA, SAS expander, NICs, switches if network storage.
  4. Application: query plans, lock contention, retry storms.
  5. Backup/recovery posture: do you have a clean restore point, and is it reachable?

Fourth: decide on the objective

In an outage, you must pick one objective to lead with:

  • Keep it running (availability): stabilize, accept degraded mode.
  • Protect data (integrity): freeze writes, take forensic copies, restore from known-good.
  • Recover service (recoverability): fail over, rebuild elsewhere, restore backups.

These objectives conflict. Pretending they don’t is how you end up with a working system serving the wrong data.

Practical tasks with commands: what to run, what it means, what you decide

Below are hands-on tasks you can run on Linux systems to understand your redundancy posture and your actual recoverability.
Each task includes: command, example output, what it means, and the decision you make from it.

Task 1: Check current block devices and RAID membership

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,MODEL,SERIAL
NAME    SIZE TYPE FSTYPE MOUNTPOINT MODEL            SERIAL
sda   3.6T disk       	        HGST_HUS726T4TAL  K8H1ABCD
├─sda1 512M part vfat   /boot/efi
└─sda2 3.6T part
sdb   3.6T disk       	        HGST_HUS726T4TAL  K8H1EFGH
└─sdb1 3.6T part
md0   3.6T raid1 ext4   /data

What it means: You have a software RAID1 device md0 mounted at /data, built from partitions.

Decision: If you thought you had “backups,” you don’t. This is redundancy only. Confirm backup location is separate.

Task 2: Inspect mdadm RAID health and rebuild status

cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda2[0]
      3906886464 blocks super 1.2 [2/2] [UU]

unused devices: <none>

What it means: [UU] indicates both members are up. During rebuild you’d see [U_] and a progress line.

Decision: If degraded, reduce load and plan disk replacement. Also: take a backup snapshot now if you don’t have one off-host.

Task 3: Get detailed mdadm information, including event counts

cr0x@server:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Oct  2 11:22:09 2023
        Raid Level : raid1
        Array Size : 3906886464 (3726.02 GiB 4000.79 GB)
     Used Dev Size : 3906886464 (3726.02 GiB 4000.79 GB)
      Raid Devices : 2
     Total Devices : 2
       State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
 Spare Devices : 0
           Name : server:0
           UUID : 1a2b3c4d:5e6f:7890:abcd:ef0123456789
         Events : 12891

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       17        1      active sync   /dev/sdb1

What it means: “clean” is good, but it says nothing about whether files are correct, consistent, or restorable.

Decision: Use this to confirm topology and identify which physical disk maps to which member before touching hardware.

Task 4: Check kernel logs for IO errors and resets

cr0x@server:~$ sudo journalctl -k -S "1 hour ago" | egrep -i "ata|sas|scsi|i/o error|reset|timeout" | tail -n 8
Jan 22 10:11:41 server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 22 10:11:41 server kernel: ata3.00: failed command: READ DMA EXT
Jan 22 10:11:41 server kernel: blk_update_request: I/O error, dev sdb, sector 9175040 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 10:11:42 server kernel: ata3: hard resetting link
Jan 22 10:11:47 server kernel: ata3: link is slow to respond, please be patient
Jan 22 10:11:52 server kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

What it means: Link resets and IO errors are early warning signs. Could be disk, cable, backplane, or controller.

Decision: Treat as “integrity at risk.” Start a fresh backup if possible; plan maintenance and hardware isolation.

Task 5: Query SMART health and key counters

cr0x@server:~$ sudo smartctl -a /dev/sdb | egrep -i "SMART overall|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|Power_On_Hours"
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       12
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       41231

What it means: “PASSED” is not reassurance. Pending/offline-uncorrectable sectors matter more. This disk is deteriorating.

Decision: Replace proactively. If in RAID5/6, rebuild risk rises; schedule rebuild with reduced load and verified backups.

Task 6: For hardware RAID, check controller/virtual disk state (storcli example)

cr0x@server:~$ sudo storcli /c0/vall show
Controller = 0
Status = Success
Description = Show Virtual Drives

DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
0/0   RAID5 dgrd  RW     No      RWBD  -   OFF  10.913 TB data_vd0

What it means: Virtual drive is dgrd (degraded). “Consist No” suggests a consistency check is needed.

Decision: Pause nonessential writes, identify failed/predictive disks, and ensure you have a restorable backup before rebuild.

Task 7: Confirm write cache policy and battery/supercap status

cr0x@server:~$ sudo storcli /c0 show battery
Controller = 0
Status = Success
Description = Battery Status

BatteryType = iBBU
Status = Failed
Replacement required = Yes

What it means: If cache protection is failed, controllers often disable write-back cache or risk losing acknowledged writes on power loss.

Decision: Expect performance changes and potential data integrity risk if policy is misconfigured. Replace battery/supercap and review cache mode.

Task 8: Measure if you are CPU-starved or IO-starved (iostat)

cr0x@server:~$ iostat -xz 1 3
Linux 6.1.0 (server) 	01/22/2026 	_x86_64_	(16 CPU)

avg-cpu:  %user %nice %system %iowait  %steal   %idle
          12.34  0.00    5.12   31.45    0.00   51.09

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await aqu-sz  %util
md0              85.0   5420.0     0.0    0.0   18.20    63.76     40.0   3120.0   44.10   2.90   98.7

What it means: High %iowait and %util near 100% indicate IO bottleneck. Write latency is high.

Decision: Throttle heavy jobs, check for rebuild/scrub, and consider moving hot workload off the array while you stabilize.

Task 9: Find which processes are hammering IO (iotop)

cr0x@server:~$ sudo iotop -oPa -n 5
Total DISK READ: 55.43 M/s | Total DISK WRITE: 12.10 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN  IO>  COMMAND
18422 be/4   postgres  40.22 M/s   8.10 M/s  0.00 % 92.00 % postgres: checkpointer
27109 be/4   root      12.11 M/s   0.00 B/s  0.00 % 15.00 % rsync -aH --delete /data/ /mnt/backup/

What it means: Your backup job and database maintenance are competing. That’s not a morality tale; it’s physics.

Decision: Reschedule backups/maintenance windows or implement rate limiting so backups don’t cause outages (or vice versa).

Task 10: Check filesystem errors quickly (ext4 example)

cr0x@server:~$ sudo dmesg | egrep -i "EXT4-fs error|I/O error|Buffer I/O error" | tail -n 6
[915230.112233] EXT4-fs error (device md0): ext4_find_entry:1531: inode #524301: comm nginx: reading directory lblock 0
[915230.112240] Buffer I/O error on device md0, logical block 12345678

What it means: The filesystem is seeing read errors. RAID may be masking some failures, but not all.

Decision: Stop services if possible, capture logs, plan a controlled fsck (or restore) rather than letting corruption spread.

Task 11: Verify ZFS pool health and error counters

cr0x@server:~$ sudo zpool status -v
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.
action: Replace the faulted device, or use 'zpool clear' to mark the device repaired.
  scan: scrub repaired 0B in 00:42:18 with 0 errors on Sun Jan 18 02:15:01 2026
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sdc     FAULTED      0     0     8  too many errors
            sdd     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/data/app.db

What it means: ZFS detected checksum errors and can tell you which file is affected. This is the difference between “we think” and “we know.”

Decision: Treat named files as suspect. Restore affected data from backup or application-level replication; replace the faulted disk.

Task 12: Check ZFS snapshots and whether you’re confusing them with backups

cr0x@server:~$ sudo zfs list -t snapshot -o name,creation -s creation | tail -n 5
tank/data@hourly-2026-01-22-0600  Thu Jan 22 06:00 2026
tank/data@hourly-2026-01-22-0700  Thu Jan 22 07:00 2026
tank/data@hourly-2026-01-22-0800  Thu Jan 22 08:00 2026
tank/data@hourly-2026-01-22-0900  Thu Jan 22 09:00 2026
tank/data@hourly-2026-01-22-1000  Thu Jan 22 10:00 2026

What it means: Nice. But if these snapshots live on the same pool, they won’t survive pool loss, account compromise, or site failure.

Decision: Replicate snapshots to an independent target with different credentials and deletion protections.

Task 13: Confirm backups exist and are recent (restic example)

cr0x@server:~$ restic -r /mnt/backup/restic-repo snapshots --last
repository 9b2f1c12 opened (version 2, compression level auto)
ID        Time                 Host        Tags        Paths
a1b2c3d4  2026-01-22 09:00:14  server                  /data

What it means: You have a backup snapshot from today. That’s a start.

Decision: Validate restore capability, not just existence. If snapshots stop updating, treat it as an incident.

Task 14: Do a test restore of a single file to prove recoverability

cr0x@server:~$ mkdir -p /tmp/restore-test && restic -r /mnt/backup/restic-repo restore latest --target /tmp/restore-test --include /data/important/report.csv
repository 9b2f1c12 opened (version 2, compression level auto)
restoring <Snapshot a1b2c3d4 of [/data] at 2026-01-22 09:00:14 by cr0x@server> to /tmp/restore-test
Summary: Restored 1 files/dirs (42.133 KiB) in 0:00

What it means: This is the moment “we have backups” becomes a fact. You successfully restored a real file.

Decision: Make this a scheduled drill with random file selection and documented results.

Task 15: Check whether your “backup” target is actually independent

cr0x@server:~$ mount | egrep "/data|/mnt/backup"
 /dev/md0 on /data type ext4 (rw,relatime)
 server:/export/backup on /mnt/backup type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2)

What it means: Backups go to NFS. Independence depends on where that NFS lives and who can delete it.

Decision: If NFS is on the same server, same rack, or same admin credential set, it’s not independent enough. Fix that.

Task 16: Verify retention and immutability at the filesystem layer (chattr)

cr0x@server:~$ sudo lsattr -d /mnt/backup
-------------e---- /mnt/backup

What it means: No immutability flags here. That might be fine, but then immutability must come from the backup system or storage target.

Decision: If ransomware is in your threat model (it is), implement retention locks/immutability outside the primary admin’s easy reach.

Task 17: Check if you’re one typo away from deleting backups (permissions)

cr0x@server:~$ namei -l /mnt/backup/restic-repo | tail -n 4
drwxr-xr-x root root /mnt
drwxr-xr-x root root /mnt/backup
drwxrwxrwx root root /mnt/backup/restic-repo

What it means: World-writable backup repository. That’s not a backup; it’s a community art project.

Decision: Lock down permissions, separate backup credentials, and consider append-only or immutable targets.

Task 18: Spot a rebuild or scrub that is quietly killing performance

cr0x@server:~$ sudo zpool iostat -v 1 3
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
tank                        2.10T  1.40T    820    210  92.1M  18.2M
  mirror-0                  2.10T  1.40T    820    210  92.1M  18.2M
    sdc                         -      -    420    105  46.0M   9.1M
    sdd                         -      -    400    105  46.1M   9.1M
--------------------------  -----  -----  -----  -----  -----  -----

What it means: Sustained high reads can indicate scrub/resilver or a workload shift. You need to correlate with pool status and cron jobs.

Decision: If this coincides with user pain, reschedule scrubs, tune resilver priority, or add capacity/performance headroom.

Joke #2: A RAID rebuild is the storage equivalent of “just a quick change in production.” It’s never quick, and it definitely changes things.

Common mistakes: symptoms → root cause → fix

This section is intentionally specific. Generic advice doesn’t survive an incident; it just gets quoted in the postmortem.

1) “Array is healthy, but files are corrupted”

  • Symptoms: Application errors reading specific files; checksum mismatches at app layer; users see garbled media; RAID shows optimal.
  • Root cause: Silent corruption on disk/controller/cable, or application wrote bad data. RAID parity/mirroring preserved it.
  • Fix: Use checksumming filesystem (ZFS) or application checksums; run scrubs; restore corrupted objects from independent backups; replace flaky hardware.

2) “We can’t rebuild: second disk failed during rebuild”

  • Symptoms: RAID5 virtual disk fails mid-rebuild; UREs appear; multiple drives show media errors.
  • Root cause: Single-parity plus large disks plus heavy rebuild read load; insufficient margin for latent sector errors.
  • Fix: Prefer RAID6/RAIDZ2 or mirrors for large arrays; keep hot spares; run patrol reads/scrubs; replace drives proactively; ensure you have restorable backups before rebuild.

3) “Backups exist but restores are too slow to meet RTO”

  • Symptoms: Backup job reports success; restore is days; business needs hours.
  • Root cause: RTO was never engineered; backup target bandwidth too low; too much data, too little prioritization; no tiered restore plan.
  • Fix: Define RTO/RPO per dataset; implement fast local recovery (snapshots) plus offsite backups; pre-stage critical datasets; practice partial restores.

4) “Snapshots saved us… until the pool died”

  • Symptoms: Confident snapshot schedule; then catastrophic pool loss; snapshots gone with it.
  • Root cause: Snapshots stored in the same failure domain as primary data.
  • Fix: Replicate snapshots to a different system/account; add immutability; treat “same host” as “same blast radius.”

5) “Ransomware encrypted production and backups”

  • Symptoms: Backup repository deleted/encrypted; retention purged; credentials used legitimately.
  • Root cause: Backup system writable/deletable by the same credentials compromised on production; no immutability/air gap.
  • Fix: Separate credentials and MFA; write-only backup roles; immutable object lock or append-only targets; offline copy for worst-case; monitor deletion events.

6) “Performance collapsed after we replaced a disk”

  • Symptoms: Latency spikes after disk replacement; systems time out; nothing else changed.
  • Root cause: Rebuild/resilver saturating IO; controller throttling; degraded mode on parity arrays.
  • Fix: Schedule rebuild windows; throttle rebuild; move workloads; add spindles/SSDs; keep extra headroom; don’t rebuild at peak unless you enjoy chaos.

7) “Controller died and we can’t import the array”

  • Symptoms: Disks appear but array metadata not recognized; vendor tool can’t see virtual disk.
  • Root cause: Hardware RAID metadata tied to controller family/firmware; cache module failure; foreign config confusion.
  • Fix: Standardize controllers and keep spares; export controller configs; prefer software-defined storage for portability; most importantly, have backups that don’t require the controller to exist.

Checklists / step-by-step plan: build backups that survive reality

Here’s the plan that works when you’re tired, understaffed, and still expected to be right.
It’s opinionated because production is opinionated.

Step 1: Classify data by business consequence

  • Tier 0: authentication/identity, billing, customer data, core database.
  • Tier 1: internal tools, analytics, logs needed for security/forensics.
  • Tier 2: caches, build artifacts, reproducible datasets.

If everything is “critical,” nothing is. Define RPO and RTO per tier. Write it down where finance can see it.

Step 2: Choose the baseline rule and then exceed it

The classic baseline is 3-2-1: three copies of data, on two different media/types, with one copy offsite. It’s a starting point, not a medal.
For ransomware, “offsite” should also mean “not deletable by the same creds.”

Step 3: Separate failure domains on purpose

  • Different hardware: not “a different directory.”
  • Different administrative boundary: separate accounts/roles; production should not have delete on backups.
  • Different geography: at least one copy outside the site/rack/region you can lose.

Step 4: Use snapshots for speed, backups for survival

Local snapshots are for fast “oops” recovery: accidental deletes, bad deploys, quick rollback. Keep them frequent and short-retention.
Backups are for when the machine, the array, or the account is gone.

Step 5: Encrypt and authenticate the backup pipeline

  • Encrypt at rest and in transit (and manage keys as if they matter, because they do).
  • Use dedicated backup credentials with minimal permissions.
  • Prefer write-only paths from production to backup when possible.

Step 6: Make retention a policy, not a vibe

  • Short: hourly/daily for fast rollback.
  • Medium: weekly/monthly for business/legal needs.
  • Long: quarterly/yearly if required, stored cheaply and immutably.

Step 7: Test restores like you mean it

The most expensive backup is the one you never restore until the day you need it. Restore tests should be scheduled, logged, and owned.
Rotate responsibility so knowledge doesn’t live in one person’s head.

Step 8: Monitor the right things

  • Backup freshness: last successful snapshot time per dataset.
  • Backup integrity: periodic verification or test restore.
  • Deletion events: alerts on unusual backup deletions.
  • Storage health: SMART, RAID state, ZFS errors, scrub results.

Step 9: Run a tabletop exercise for the ugly scenarios

Practice:

  • Accidental delete (restore a directory).
  • Ransomware (assume attacker has production admin).
  • Controller failure (assume primary array is unrecoverable).
  • Site loss (assume the whole rack/region is gone).

Step 10: Decide what RAID level is for (and stop asking it to be a backup)

Use RAID/mirrors/erasure coding to meet availability and performance goals. Use backups to meet recoverability goals.
If your RAID choice is being driven by “we don’t need backups,” you’re doing architecture by wishful thinking.

One quote worth keeping above your monitor

Paraphrased idea: Hope is not a strategy. — General Jim Mattis (often cited in engineering and operations circles)

If you’re building storage on hope, you’re not building storage. You’re building a future incident report with a long lead time.

FAQ

1) If I have RAID1, do I still need backups?

Yes. RAID1 protects against one disk failing. It does not protect against deletion, corruption, ransomware, controller bugs, or site loss.
RAID1 makes the system keep running while the wrong thing is happening.

2) Are snapshots a backup?

Not automatically. Snapshots are point-in-time references, usually stored on the same system. They become “backup-like” only when replicated
to an independent target with retention you can’t casually delete.

3) Is RAID6 “safe enough” to skip backups?

No. RAID6 reduces the chance of array loss from disk failures during rebuild. It does nothing for logical failures (delete, overwrite),
malware, or catastrophic events. Backups exist because disk failure isn’t the only threat.

4) What about cloud storage with redundancy—does that count as backup?

Cloud provider redundancy is typically about durability of stored objects, not your ability to recover from your own mistakes.
If you delete or overwrite, the cloud will do it reliably. You still need versioning, retention locks, and independent copies.

5) What’s the minimum viable backup plan for a small company?

Start with: daily backups to an independent target, at least 30 days retention, and one offsite copy. Add weekly/monthly retention as needed.
Then schedule restore tests. If you only do one “advanced” thing, do the restore tests.

6) How often should we test restores?

For critical systems, monthly is a reasonable baseline, with smaller partial restores more frequently (weekly is great).
After major changes—new storage, new encryption keys, new backup tool—test immediately.

7) What’s the difference between replication and backup?

Replication copies data to another place, often near-real-time. That’s great for high availability and low RPO, but it can replicate bad changes instantly.
Backups are versioned and retained so you can go back to before the failure. Many environments use both.

8) How do I protect backups from ransomware?

Separate credentials and restrict delete. Use immutability/retention locks on the backup target. Keep at least one copy offline or in a separate
admin domain. Monitor for suspicious deletion and disable backup repository access from general-purpose hosts.

9) Does ZFS eliminate the need for backups?

ZFS improves integrity with checksums and self-healing (with redundancy), and snapshots are excellent for fast rollback.
But ZFS doesn’t stop you from deleting data, encrypting it, or losing the whole pool. You still need independent backups.

10) What RPO/RTO should we pick?

Pick based on business pain, not what the storage team wishes were true. For Tier 0 data, RPO of minutes/hours and RTO of hours might be necessary.
For lower tiers, days may be acceptable. The key is that the numbers must be engineered and tested, not declared.

Next steps you can do this week

RAID is a tool for staying online through certain hardware failures. It is not a time machine. It is not a courtroom witness. It does not care
whether the data is correct; it cares whether the bits are consistent across disks.

If you run production systems, do these next steps this week:

  1. Inventory your storage: RAID level, controller type, disk ages, and rebuild behavior.
  2. Write down RPO/RTO for your top three datasets. If you can’t, you don’t have a backup plan—you have a hope plan.
  3. Verify independence: confirm backups live outside the primary failure domain and outside easy-delete credentials.
  4. Run one restore test: a single file, a directory, and (if you’re brave) a database restore to a test environment.
  5. Set alerts for backup freshness and deletion anomalies, not just disk health.

Then, and only then, enjoy your RAID. It’s useful when you treat it honestly: as redundancy, not salvation.

← Previous
ZFS Scrub Slow: How to Tell Normal Slowness From a Real Problem
Next →
Ubuntu 24.04: “Failed to get D-Bus connection” — fix broken sessions and services (case #48)

Leave a comment