You inherit a system that pays salaries, ships orders, or clears trades. It has the uptime of a lighthouse and the readability of a ransom note.
Everyone calls it “stable.” What they mean is: nobody has dared to move it in years.
Then a disk dies, a certificate expires, a vendor goes away, or the cloud bill doubles. Suddenly you’re expected to touch the “magic.”
This is how you do that without turning production into a group therapy session.
What “legacy magic” really is
“Nobody knows how it works, so don’t touch it” is not a technical assessment. It’s an organizational scar. It usually means:
the system is business-critical, poorly instrumented, under-documented, and has accumulated fixes that are correct only under
a specific and mostly forgotten set of constraints.
The most dangerous part isn’t the old code. It’s the invisible contracts: timeouts tuned to a specific storage latency profile,
cron schedules designed around batch windows, schema quirks depended on by one weird export job, kernel parameters set by someone
who’s now a scuba instructor. Legacy systems run on assumptions. When those assumptions rot, “stable” becomes “haunted.”
There’s a kind of mythology around these systems: the “wizard” who wrote it, the one database trigger that “must never be changed,”
the special server that “needs” a certain NIC. Mythology makes teams cautious; caution is good. Mythology also makes teams lazy; laziness
is how outages become traditions.
Why it survives (and why it fails)
Legacy magic survives because it was once the best option
Most “mystery” stacks weren’t created by incompetent people. They were created by people under pressure, with limited tooling,
and often with hardware constraints that forced cleverness. If you’re young enough to have always had cheap SSDs and managed
databases, you’re living in luxury. Older systems had to squeeze performance out of spinning disks, tiny RAM, slow networks,
and occasionally a budget that looked like a typo.
The system survives because it’s aligned with the business: it does one thing reliably enough, and the risk of changing it feels higher
than the risk of leaving it alone. That calculus is rational—until it isn’t.
Legacy magic fails because the environment changes, not because it “expires”
Systems don’t die of old age. They die of interface drift: libraries change behavior, TLS requirements tighten, DNS or NTP issues get
amplified by strict time checks, virtualized storage behaves differently than local disks, and your new “faster” instance type ships
with a completely different CPU topology. Then the system that worked for years collapses in a week.
Reliability engineering is mostly about keeping assumptions true, or making them irrelevant. If you treat legacy magic like a sacred artifact,
you’re guaranteeing you will eventually break it—just at the worst possible time.
Facts and historical context you can weaponize
- “Write once, run anywhere” had a shadow: early portability often hid performance assumptions, especially around filesystem semantics and time handling.
- POSIX filesystems weren’t built for your microservice swarm: many legacy apps assume local, low-latency, strongly consistent file operations.
- NFS became default glue in enterprises: it also normalized “it’s slow sometimes” as an operational reality—many apps quietly coded around it.
- RAID controller caches shaped database defaults: old DB tuning often assumed battery-backed write cache; remove it and your “stable” system becomes a latency machine.
- VMs changed time: clock drift and scheduling jitter in virtualized environments broke older software that assumed monotonic time or predictable timers.
- Log files used to be the observability stack: lots of “magic” relies on log scraping or ad-hoc grep pipelines that no one admits are production-critical.
- Init scripts were operational logic: before systemd standardized behavior, many orgs baked fragile sequencing rules into custom init scripts.
- Filesystem barriers and write ordering evolved: toggles like write barriers changed safety/performance tradeoffs, and old advice can be actively harmful now.
- “Batch windows” were a design primitive: nightly processing shaped data models, locking strategies, and backup plans; always-on workloads stress them differently.
Operational rules: how to touch it safely
Rule 1: Treat “unknown” as a dependency, not a shame
Mystery is not a moral failure. It’s a state of the system. Your job is to reduce mystery the way you reduce latency: by measuring,
isolating, and iterating. If the team’s culture treats not knowing as embarrassing, people will hide uncertainty—and you’ll ship risk.
Rule 2: Don’t change behavior until you can observe behavior
Before you refactor, upgrade, or “clean up,” you need a baseline. That baseline is not “it seems fine.” It’s: latency distribution,
error rates, saturation points, queue depths, and the exact shape of normal.
One quote that operations people keep rediscovering, because it keeps being true: “You can’t improve what you don’t measure.”
— Peter Drucker (paraphrased idea).
Measurements won’t make the system safe, but they make change testable.
Rule 3: Reduce blast radius before you improve performance
Performance work is seductive: you see a graph spike, you want to fix it. But legacy performance “fixes” often depend on obscure ordering,
backpressure behavior, or timing. First make failures smaller: canary changes, feature flags, smaller maintenance windows,
smaller datasets, smaller replication scopes. Then optimize.
Rule 4: Prefer reversible changes
If your change can’t be rolled back quickly, it must be tested like a surgical implant. In legacy land, you rarely have that luxury.
Choose changes that can be toggled off: config flags, runtime parameter changes, read-only probes, shadow traffic, parallel pipelines.
Rule 5: Separate “how it works” from “how it breaks”
You don’t need to fully understand a legacy system to operate it safely. You need to understand:
what good looks like, what bad looks like, and which levers change outcomes.
Build runbooks around failure modes, not architecture diagrams.
Joke #1: Legacy systems are like cast iron pans—if you scrub too hard, the “seasoning” comes off and everyone gets upset.
Rule 6: Document intent, not trivia
“Set vm.dirty_ratio=12” is trivia. “Keep writeback latency under X so the DB checkpoint doesn’t stall” is intent.
The second survives hardware changes and kernel upgrades. Intent is what lets the next engineer make a different but correct choice.
Rule 7: Storage is usually the silent accomplice
As a storage engineer, I’ll be blunt: when legacy services misbehave, storage is frequently involved—sometimes as the cause, sometimes
as the amplifier. Old apps bake in assumptions about fsync cost, inode lookup speed, rename atomicity, and free space behavior.
Put that app on a different storage backend and you’ve changed the laws of physics it evolved under.
Rule 8: “Don’t touch it” is a risk register item, not a strategy
If a system is too scary to change, it’s too risky to depend on. Put it in the risk register with concrete triggers:
certificate expiration, OS end-of-life, disk model end-of-sale, vendor contract renewal, key person loss.
Then build a plan that replaces fear with steps.
Fast diagnosis playbook
When the “magic” system slows down, you don’t have time for philosophy. You need a triage loop that finds the bottleneck quickly,
without making it worse. This is the order that tends to win in production.
First: confirm impact and bound the blast radius
- Is it user-facing latency, throughput, error rate, or data correctness?
- Is it one host, one AZ, one shard, one tenant, or global?
- Did anything change in the last hour/day: deploy, config, kernel, storage path, network ACLs, certificate renewal?
Second: decide whether you’re CPU-bound, memory-bound, or I/O-bound
- CPU-bound: high run queue, high user/system CPU, low I/O wait, stable latency until saturation.
- Memory-bound: rising major faults, swap activity, reclaim stalls, OOM kills, cache thrash.
- I/O-bound: high iowait, high disk utilization, high await/service times, blocked threads, fsync storms.
Third: verify saturation and queuing at the lowest layer
- Storage: device queue depth, latency, errors, multipath failover, filesystem full, ZFS pool health, md RAID resync.
- Network: retransmits, drops, MTU mismatch, duplex issues, DNS latency, TLS handshake spikes.
- Kernel: dmesg warnings, hung tasks, soft lockups, block layer timeouts.
Fourth: identify the chokepoint process and its dependency
- Which PID is consuming CPU, blocked on I/O, or holding locks?
- Which files, sockets, or disks is it waiting on?
- Is contention internal (locks) or external (storage/network)?
Fifth: apply the smallest safe mitigation
- Rate limit noisy jobs (cron, batch, compactions).
- Fail over reads, drain traffic, or shift workloads.
- Increase headroom (temporary capacity, cache, or queue tuning) only if you understand the side effects.
Practical tasks with commands, what the output means, and what decision you make
These are the bread-and-butter moves when you’re facing a mystery service on Linux with storage in the mix.
Each task includes a command, sample output, what it means, and the decision you make from it.
Run them read-only first. Curiosity is fine; surprise reboots are not.
Task 1: Find what changed recently (package and service state)
cr0x@server:~$ journalctl --since "6 hours ago" -p warning..alert | tail -n 20
Jan 12 10:14:03 app01 kernel: nvme nvme0: I/O 123 QID 4 timeout, aborting
Jan 12 10:14:03 app01 kernel: blk_update_request: I/O error, dev nvme0n1, sector 987654 op 0x1:(WRITE) flags 0x0 phys_seg 16 prio class 0
Jan 12 10:18:11 app01 systemd[1]: myapp.service: Main process exited, code=killed, status=9/KILL
Meaning: kernel I/O timeouts plus errors, then a service got killed. This is not “the app is slow.” This is “storage is on fire.”
Decision: stop tuning the application and start protecting data: check device health, multipath, controller, and filesystem integrity. Consider draining traffic.
Task 2: Get a one-screen health view (CPU, memory, load, iowait)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 81232 41200 621000 0 0 120 210 950 1400 12 6 78 4 0
4 1 0 50440 39800 612400 0 0 110 980 1100 2100 10 7 55 28 0
6 3 0 49600 39000 610800 0 0 90 1600 1200 2600 8 6 40 46 0
Meaning: rising b (blocked) and wa (iowait) indicates threads stuck waiting on I/O.
Decision: pivot to disk and filesystem checks; do not “just add CPU.” CPU isn’t the bottleneck yet.
Task 3: Identify which disks are slow and how (latency vs utilization)
cr0x@server:~$ iostat -xz 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
9.21 0.00 6.14 31.55 0.00 53.10
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await svctm
nvme0n1 220.0 310.0 9024.0 12800.0 0.0 0.0 99.5 18.2 0.9
nvme1n1 10.0 12.0 400.0 520.0 0.0 0.0 3.1 1.2 0.2
Meaning: %util near 100% with high await means the device is saturated and requests are queuing.
Decision: find what is hitting nvme0n1 and reduce its write rate, move hot data, or add capacity/IOPS. If this is “supposed to be mirrored,” verify it actually is.
Task 4: Find which process is generating I/O
cr0x@server:~$ sudo iotop -o -b -n 3
Total DISK READ: 8.12 M/s | Total DISK WRITE: 45.33 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
7321 be/4 myapp 1.23 M/s 22.10 M/s 0.00 % 6.12 % myapp --worker
9012 be/4 postgres 0.00 B/s 18.44 M/s 0.00 % 5.90 % postgres: checkpointer
Meaning: the app plus Postgres checkpoint are the big writers. That combination often means dirty pages are flushing in bursts.
Decision: consider smoothing writes (DB checkpoint tuning) and check whether storage latency changed recently. Do not kill checkpointer blindly; it’s doing damage control.
Task 5: Check filesystem fullness and inode exhaustion (the dumb outage)
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 200G 196G 2.1G 99% /
/dev/nvme1n1p1 ext4 500G 120G 356G 26% /data
Meaning: root filesystem is at 99%. Expect weirdness: temp files failing, logs not writing, services crashing in non-obvious ways.
Decision: free space immediately (rotate logs, remove old artifacts), then fix the growth path. If you need time, redirect writes to /data temporarily.
Task 6: Validate inode availability (small files can kill you)
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 1310720 1310100 620 100% /
/dev/nvme1n1p1 3276800 210000 3066800 7% /data
Meaning: 100% inode usage: you can have “free space” and still be unable to create files. Legacy batch systems love making tiny files.
Decision: find the directory with the explosion (usually logs, spool, temp, cache). Clean up and add guardrails (logrotate, retention policies).
Task 7: Identify mount options and risky behaviors (barriers, noatime, etc.)
cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /
/ /dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro
Meaning: relatime is fine; errors=remount-ro means the filesystem will flip read-only on errors, which can look like “the app froze.”
Decision: if you saw I/O errors, plan for read-only flip and ensure services handle it. On legacy systems, they often don’t.
Task 8: Check kernel for hung tasks and block layer pain
cr0x@server:~$ dmesg -T | tail -n 30
[Sun Jan 12 10:14:03 2026] nvme nvme0: I/O 123 QID 4 timeout, aborting
[Sun Jan 12 10:14:09 2026] INFO: task myapp:7321 blocked for more than 120 seconds.
[Sun Jan 12 10:14:09 2026] Tainted: G W OE 5.15.0-91-generic #101-Ubuntu
[Sun Jan 12 10:14:09 2026] Call Trace:
[Sun Jan 12 10:14:09 2026] __schedule+0x2c5/0x6f0
[Sun Jan 12 10:14:09 2026] io_schedule+0x46/0x70
Meaning: tasks blocked in I/O scheduling: classic symptom of storage stalls.
Decision: escalate to storage layer: check device SMART/NVMe logs, controller, multipath, SAN, and any recent firmware changes.
Task 9: Map hot files to devices (which mount is the culprit)
cr0x@server:~$ sudo lsof -p 7321 | head -n 10
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
myapp 7321 myapp cwd DIR 259,2 4096 131073 /
myapp 7321 myapp 10w REG 259,2 1048576 262200 /var/log/myapp/worker.log
myapp 7321 myapp 11w REG 259,2 5242880 262199 /var/lib/myapp/spool/events.dat
Meaning: the process is writing under /var on the root disk (DEVICE 259,2), not on /data. That can be accidental and catastrophic.
Decision: relocate spool/logs to a proper data volume, ideally with a symlink or bind mount for minimal change. First, verify nothing depends on the old path behavior.
Task 10: Check service restart loops and dependency ordering
cr0x@server:~$ systemctl status myapp.service --no-pager
● myapp.service - MyApp Worker
Loaded: loaded (/etc/systemd/system/myapp.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Sun 2026-01-12 10:18:12 UTC; 6s ago
Process: 9441 ExecStart=/usr/local/bin/myapp --worker (code=exited, status=1/FAILURE)
Main PID: 9441 (code=exited, status=1/FAILURE)
CPU: 210ms
Jan 12 10:18:12 app01 myapp[9441]: FATAL: cannot create /var/lib/myapp/spool/lock: No space left on device
Meaning: not a mystery anymore: it’s failing due to space, and systemd is hammering restarts.
Decision: stop the restart storm if it worsens I/O (temporary systemctl stop), free space, then restart. Add monitoring/alerts for disk/inodes to prevent repeats.
Task 11: Inspect cron/system timers (the invisible workload)
cr0x@server:~$ systemctl list-timers --all --no-pager | head -n 12
NEXT LEFT LAST PASSED UNIT ACTIVATES
Sun 2026-01-12 10:30:00 UTC 7min left Sun 2026-01-12 10:00:00 UTC 23min ago logrotate.timer logrotate.service
Sun 2026-01-12 11:00:00 UTC 37min left Sun 2026-01-12 10:00:03 UTC 23min ago fstrim.timer fstrim.service
Sun 2026-01-12 12:00:00 UTC 1h 37min Sun 2026-01-12 00:00:00 UTC 10h ago apt-daily-upgrade.timer apt-daily-upgrade.service
Meaning: fstrim on thin-provisioned or busy storage can cause latency spikes; logrotate can cause bursts of rename/compress I/O.
Decision: if spikes correlate, reschedule timers to low-traffic windows or tune them (e.g., trim frequency). Validate with metrics before changing.
Task 12: Network sanity for “storage” that is actually remote
cr0x@server:~$ ip -s link show dev eth0 | sed -n '1,12p'
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:ab:cd:ef brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
987654321 1234567 0 1241 0 0
TX: bytes packets errors dropped carrier collsns
123456789 2345678 0 0 0 0
Meaning: RX drops: can cause NFS/iSCSI pain that looks like “disk slowness” at the app layer.
Decision: investigate NIC queueing, driver/firmware, MTU, and upstream congestion. If storage is remote, treat network as part of the disk.
Task 13: Confirm what storage you’re actually on (LVM/MD/DM)
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINT
NAME TYPE SIZE FSTYPE MOUNTPOINT
nvme0n1 disk 238G
├─nvme0n1p1 part 1G vfat /boot/efi
└─nvme0n1p2 part 200G ext4 /
nvme1n1 disk 477G
└─nvme1n1p1 part 477G ext4 /data
Meaning: there’s no RAID/LVM shown. If someone told you “it’s mirrored,” they might be remembering a different era.
Decision: verify redundancy claims. If none exists, plan risk mitigation: backups, replication, or actual RAID/mirroring.
Task 14: Measure latency distribution to separate “slow” from “spiky”
cr0x@server:~$ sudo fio --name=latcheck --filename=/data/fio.test --size=1G --direct=1 --rw=randread --bs=4k --iodepth=32 --runtime=30 --time_based --group_reporting
latcheck: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.28
Run status group 0 (all jobs):
READ: bw=85.2MiB/s (89.3MB/s), 85.2MiB/s-85.2MiB/s (89.3MB/s-89.3MB/s), io=2556MiB (2681MB), run=30001-30001msec
slat (nsec): min=800, max=21000, avg=3200.4, stdev=900.1
clat (usec): min=180, max=12000, avg=340.2, stdev=410.7
lat (usec): min=190, max=12010, avg=344.0, stdev=411.0
clat percentiles (usec):
| 1.00th=[ 210], 5.00th=[ 230], 10.00th=[ 245], 50.00th=[ 290],
| 90.00th=[ 520], 95.00th=[ 780], 99.00th=[ 2100], 99.90th=[ 7200]
Meaning: median latency is fine, tail latency is ugly. Legacy systems often die from the 99.9th percentile.
Decision: investigate periodic stalls (GC, compaction, backups, trim, snapshots, controller hiccups). Optimize for tails, not just averages.
Task 15: Validate backup capability without trusting the story
cr0x@server:~$ sudo -n test -r /var/backups/myapp/latest.sql.gz && echo "backup readable" || echo "backup missing"
backup readable
Meaning: existence and readability are the bare minimum. Many “we have backups” claims end here.
Decision: schedule a restore test to a scratch host. If you can’t restore, you don’t have backups; you have compressed regret.
Task 16: Check for silent data corruption signals (ZFS example)
cr0x@server:~$ sudo zpool status -v
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
scan: scrub repaired 0B in 00:12:44 with 2 errors on Sun Jan 12 09:40:11 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sda ONLINE 0 0 2
sdb ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/data/myapp/index/segment-00017
Meaning: checksums caught corruption. That’s not ZFS being dramatic; that’s ZFS doing its job and telling you the uncomfortable truth.
Decision: restore affected data from a known-good replica/backup, replace suspect hardware, and run scrubs on a schedule. If your legacy app can’t tolerate file corruption, treat this as a sev-1.
Three corporate-world mini-stories
1) Incident caused by a wrong assumption: “It’s redundant, right?”
A mid-sized company ran a billing pipeline on a single Linux host. Everyone believed the database lived on “a mirrored RAID”
because, historically, it had. The original builder had left, and the current team saw two disks in the chassis and assumed safety.
No one had checked in years, because checking felt like tempting fate.
They migrated the host into a new rack during a data center cleanup. After reboot, the system came up, ran for a day, and then
started throwing filesystem errors under write load. Kernel logs showed I/O timeouts. The on-call did the reasonable thing:
fail “over” to the second disk. There was no failover. There was only a second disk, formatted for something else, quietly unused.
Recovery became archaeology. The last known good backup was older than leadership was comfortable admitting, because backup jobs had been
“green” but were actually backing up an empty directory after a path change. The incident wasn’t caused by a disk dying; disks die all the time.
It was caused by an assumption surviving longer than its author.
The fix wasn’t heroic. It was boring: inventory with lsblk, verify redundancy, restore backups to a scratch environment monthly,
and add an alert for “backup produced suspiciously small output.” The new runbook included a single line: “If someone says ‘mirrored,’ show the command output.”
2) Optimization that backfired: “We turned on the faster thing”
Another organization had a legacy Java service that wrote small records to disk and relied on fsync for durability. It ran on local SSDs
and behaved predictably. A performance initiative moved it onto a shared networked storage platform that offered better utilization
and centralized snapshots. The migration hit the KPI: average throughput improved and the platform team declared victory.
Two weeks later, the system began experiencing periodic latency spikes. Not constant slowness—spikes. Every few hours, writes would stall,
request queues grew, and upstream services timed out. Engineers chased garbage collection, thread pools, and random config flags.
Someone “optimized” further by increasing application concurrency to “hide latency.” That raised the write burst size, which made the
storage platform’s tail latency worse, which caused even bigger backlogs. A clean feedback loop—just the wrong kind.
The root issue was a mismatch: the app’s durability model assumed cheap fsync; the storage platform implemented durability with different
semantics and periodic background work. The app wasn’t wrong. The platform wasn’t wrong. The pairing was wrong.
They stabilized by limiting concurrency, moving hot journals back to local NVMe, and leaving bulk data on the shared backend.
Then they measured p99.9 latency as a first-class metric, not a postmortem footnote. The “optimization” lesson stuck:
if you don’t measure tails, you’re measuring your optimism.
3) Boring but correct practice that saved the day: “The runbook did”
A financial services team had a crusty batch processor that was older than some of the engineers. It was never rewritten because it worked.
But the team treated it like a first-class production service anyway: it had dashboards, a weekly restore test, and a runbook that described
failure modes with simple checks and known-good baselines.
One evening, processing time doubled. No errors. Just slow. The on-call followed the runbook: check disk utilization, check iowait,
check NTP, check job concurrency, check recent timer events. Within ten minutes they found the culprit: a logrotate job compressing a
very large log on the same volume as the input dataset, causing periodic I/O contention. The system wasn’t broken; it was competing.
The fix was as thrilling as a spreadsheet: move logs to a separate volume and cap compression CPU. They also added an alert:
“log file exceeds retention size” and “compression overlaps batch window.” The next run was normal.
The interesting part is what didn’t happen: no panic tuning, no random restarts, no “maybe the SAN is haunted.” Boring practice worked because
it turned unknowns into checks. In production, boredom is a feature.
Joke #2: If your monitoring only checks averages, congratulations—you’ve built a system that’s always healthy in retrospect.
Common mistakes: symptoms → root cause → fix
1) “It’s slow after migration” → hidden fsync assumptions → move journals or change durability strategy
Symptoms: p99 write latency spikes, threads blocked, application timeouts, especially during bursts.
Root cause: workload moved from local SSD to networked storage/NFS/iSCSI with different flush semantics and tail latency.
Fix: keep latency-sensitive WAL/journals on local NVMe; benchmark with fio; measure p99.9; cap concurrency and add backpressure.
2) “Random crashes” → disk full or inode full → enforce retention and alerting
Symptoms: services exit with weird errors, lock files fail, logs stop, package updates fail.
Root cause: filesystem at 95–100% or inode exhaustion; legacy apps create many tiny files.
Fix: df -hT and df -i alerts; logrotate that actually runs; move spools off root; set quotas where possible.
3) “CPU is high, so it must be compute” → CPU is high because of I/O wait or compression → separate signal from noise
Symptoms: load average high, users complain, but adding CPU doesn’t help.
Root cause: blocked threads inflate load; CPU cycles spent in kernel or compression; iowait hidden under load.
Fix: use vmstat and iostat -xz; examine blocked tasks in dmesg; move compression jobs out of peak windows.
4) “We rebooted and it got worse” → dependency ordering + stateful recovery → stop restart storms
Symptoms: service keeps restarting, disks thrash, logs flood, recovery takes longer.
Root cause: systemd restart loops plus long warm-up tasks (reindex, cache rebuild) plus saturated storage.
Fix: temporarily stop the service; confirm disk headroom; raise restart backoff; add explicit dependencies and readiness checks.
5) “Backups are green” → wrong path or empty dataset → test restores
Symptoms: backups “succeed,” but restores fail or restore empty data.
Root cause: path moved, permissions changed, or backup job captures the wrong mount; monitoring checks exit code only.
Fix: periodic restore drills; validate backup size ranges; store backup manifests; alert on unexpectedly small outputs.
6) “Storage says healthy” → tail latency and micro-stalls → observe percentiles and queue depth
Symptoms: overall throughput looks fine; users see freezes; graphs show periodic cliffs.
Root cause: platform health checks focus on averages; background tasks (scrubs, trims, snapshots) create tail spikes.
Fix: instrument p95/p99/p99.9 latency; monitor device await, queue depth; reschedule background tasks.
7) “We tuned the kernel like a blog post” → old advice on new kernels → revert to baseline and test one change at a time
Symptoms: unpredictable latency, memory reclaim issues, odd stalls after “sysctl tuning.”
Root cause: cargo-culted sysctls that conflict with modern kernels or cgroup behavior.
Fix: track sysctl changes; revert to distro defaults; introduce changes with a hypothesis and measurement plan.
8) “It only fails on one node” → hardware/firmware divergence → enforce homogeneity and capture versions
Symptoms: one host shows higher errors, different latency, or weird resets.
Root cause: different NVMe firmware, RAID controller cache state, NIC driver, BIOS settings.
Fix: inventory firmware and kernel modules; standardize; isolate the node; replace suspect components.
Checklists / step-by-step plan
Checklist A: “Touching the magic” without breaking it
- Define the safety boundary: what data must never be lost, what downtime is acceptable, what rollback means in minutes.
- Capture a baseline: CPU, memory, disk latency percentiles, error rates, queue depths. Save screenshots if you must; just save something.
- Inventory reality: disks, mounts, RAID/LVM/ZFS, network paths, timers, cron, and backups. “We think” doesn’t count.
- Write the first runbook page: how to tell healthy vs unhealthy, and the five commands that prove it.
- Choose a reversible change first: logging, metrics, read-only probes, or moving non-critical writes.
- Make one change: one. Not three “while we’re here.”
- Measure again: if you can’t see improvement or degradation, you didn’t control the experiment.
- Document intent: why the change exists, what metric it affects, and how to revert.
Checklist B: Storage-specific safety steps (because storage is where careers go to die)
- Check free space and inodes on all relevant mounts; enforce headroom targets.
- Validate redundancy claims with commands, not folklore.
- Confirm write durability expectations (fsync frequency, journaling mode, DB WAL settings).
- Inspect kernel logs for device timeouts, resets, and filesystem remounts.
- Test tail latency with a controlled benchmark off-peak, not during an incident.
- Run a restore test before any risky migration.
Checklist C: Incident response on legacy systems
- Stabilize: stop the restart loop, throttle batch jobs, reduce traffic if needed.
- Identify the bottleneck layer: CPU vs memory vs storage vs network.
- Confirm the symptom with two sources: kernel + app logs, or iostat + latency metrics.
- Apply smallest mitigation: move one hot path, reschedule one timer, add one GB of headroom, not a redesign.
- Record what worked: commands, outputs, and timestamps. You’re writing tomorrow’s runbook under pressure.
- Post-incident: convert one assumption into a check, and one check into an alert.
FAQ
1) Is “don’t touch it” ever correct?
Temporarily, yes—during peak revenue windows or when you lack rollback. Permanently, no. If it can’t be changed, it must be isolated,
measured, and put on a path to replacement or containment.
2) How do I start documenting a system I don’t understand?
Start with behavior: inputs, outputs, SLOs, and failure modes. Then dependencies: storage mounts, databases, network endpoints, cron/timers,
and backup/restore. Architecture diagrams are nice; operational truth is nicer.
3) What’s the first metric to add to a legacy service?
Latency percentiles for the critical operation, plus error rate. If storage is involved, add device latency and utilization.
Averages will lie to you with a straight face.
4) Why do legacy systems hate shared storage?
They often assume local disk semantics: low jitter, predictable fsync, and simple failure modes. Shared storage introduces queuing,
multi-tenant interference, and different durability mechanisms. Sometimes it still works—just not by default.
5) Should we rewrite it?
Not as a first response. First, make it observable and safe to operate. Rewrites are justified when the cost of risk reduction exceeds
the cost of replacement, or when dependencies (OS, runtime, vendor) are genuinely untenable.
6) How do I convince management that “stable” is risky?
Bring concrete triggers: OS end-of-life dates, certificate expirations, single points of failure, lack of restore tests, and measured tail latency.
Risk that can be graphed gets funded faster than fear.
7) What’s a safe first “touch” change?
Add read-only observability: dashboards, logs with correlation IDs, disk/inode alerts, and a backup restore drill.
These change your visibility, not production behavior.
8) How do I handle “the wizard” who guards the system?
Respect their experience, then extract it into artifacts: runbooks, baseline outputs, and test procedures. If knowledge stays in one head,
the system is already in a degraded state—you just haven’t had the outage yet.
9) What if the system is too fragile to test?
Then your first project is to create a staging environment or a shadow read-only replica. If you truly can’t, reduce risk by narrowing
changes to reversible config toggles and operational controls.
10) How do I avoid cargo-cult tuning?
Write a hypothesis for every change: “This reduces checkpoint burstiness and lowers p99 write latency.” Measure before/after. Keep a revert plan.
If you can’t explain the failure mode you might introduce, don’t ship it.
Conclusion: next steps you can actually do
Legacy magic is just undocumented engineering plus time. Treat it like production, not folklore. The goal isn’t to be fearless.
The goal is to be methodical enough that fear stops being your primary control mechanism.
Do this next, in order:
- Baseline and observe: capture iostat/vmstat snapshots under normal load; add latency percentiles to dashboards.
- Inventory storage reality: mounts, redundancy, free space/inodes, kernel error logs. Write it down.
- Prove backups by restoring: one dry run to a scratch host. Schedule it.
- Reduce blast radius: separate logs/spools from root, add rate limits to batch jobs, implement canaries for changes.
- Convert assumptions into checks: if someone says “it’s mirrored,” automate verification; if someone says “backups are fine,” test restore monthly.
Touch the system, but touch it with gloves: measurements, reversibility, and a plan that values boring correctness over exciting heroics.
That’s how legacy stops being magic and starts being just another service you run confidently.