It’s 03:17. Your dashboard says “green,” your pager says “absolutely not,” and your CEO says “but we’re on the premium cloud tier.” Meanwhile your users are watching spinners like it’s a nostalgic hobby.
By 2026 we’ve added AI copilots, “serverless everything,” NVMe everywhere, service meshes, eBPF, and enough managed offerings to wallpaper an aircraft hangar. Yet the outages still rhyme with the ones from 2006: overloaded queues, mismatched assumptions, capacity denial, and the kind of “optimization” that looks brilliant until the first real incident.
Why 2026 feels familiar: the same physics, new marketing
Most “new” reliability failures are old failures with a different costume. The core constraints didn’t change:
- Queuing theory still runs your datacenter. When utilization approaches 100%, latency explodes. Not linearly. Explosively.
- Tail latency still dominates user experience. Your median can be gorgeous while your p99 is actively on fire.
- Storage is still the shared fate of stateful systems. Whether it’s a managed database, a Kubernetes PersistentVolume, or a homegrown ZFS pool, you can’t “microservice” your way out of I/O.
- Networks are still lossy and variable. You can wrap them in gRPC, QUIC, meshes, and policies, but physics still has the last word.
- Humans still assume. And assumptions still break at 2x scale, 10x load, or at exactly the wrong moment.
The “2026 twist” is that complexity is now sold as safety. If you use a managed service, surely someone else solved reliability for you. If you add another abstraction layer, surely the system gets simpler. If the dashboard is green, surely production is fine. None of those are guaranteed. They’re just ways to outsource understanding.
Here’s a paraphrased idea often attributed to Richard Cook: Complex systems fail in complex ways, and understanding normal work is key to understanding failures.
It’s not poetic. It’s operational advice: stop only studying outages; study how things succeed on a normal Tuesday. That’s where your real safety margins are hiding.
One rule for 2026: if you can’t explain where the queue is, you don’t know why the latency is high.
Nine facts and historical context points (that explain today)
- RAID didn’t disappear; it moved. Many “cloud disks” are still built on some form of mirroring/erasure coding plus caching, just not your server’s RAID card.
- IOPS marketing has misled teams for decades. A million 4k reads isn’t the same as a thousand 1MB writes, and “up to” performance is not a promise.
- Write amplification is older than SSDs. Filesystems, databases, and log-structured designs have always traded sequential writes for compaction work later.
- The fallacy of “stateless fixes stateful pain” keeps recurring. People split services while keeping one shared database, then wonder why the database is the new monolith.
- Tail latency became a mainstream topic in the 2010s. Before that, many teams measured averages and were surprised by user complaints. Now we measure p99 and are surprised anyway—because we still don’t manage queues.
- Containers didn’t eliminate noisy neighbors; they industrialized them. cgroups constrain CPU and memory, but shared storage and shared network paths still couple workloads.
- “Eventually consistent” systems still have consistency costs. Retries, reconciliation, and compaction can shift pain into the background until it becomes foreground.
- NVMe reduced latency but increased speed of failure. When a system can generate load faster, it can also overwhelm downstream dependencies faster.
- Incident retrospectives are often too polite. The real root cause is frequently “we believed a property that was never true,” but it gets written up as “transient latency.”
The shiny packaging: where old failure modes hide now
1) “Managed” does not mean “bounded”
Managed databases and managed disks are fantastic until you treat them like magic. They still have:
- IO credit systems and burst behavior
- maintenance windows and background compactions
- multi-tenant contention
- limits you won’t notice until you hit them
In 2026, the most common managed-service outage pattern I see is soft throttling. No obvious 5xx spike. No dramatic “down.” Just a slow slide into timeouts as latency increases and retries multiply. The service is “up.” Your SLO is not.
2) Observability got better—and also worse
We have more telemetry than ever: distributed traces, profiles, eBPF-based network visibility, storage metrics, queue depth, saturation signals. And yet teams still fail because their observability is uncurated.
If your on-call has to mentally join 14 dashboards and translate six different percentiles into one decision, you don’t have observability. You have a museum.
3) The cache is still the most dangerous “performance feature”
Caches are good. Overconfidence in caches is not. In 2026 the classic failures still happen:
- cache stampedes after invalidation
- evictions during a traffic spike
- hot key amplification from a single popular object
- cache warming jobs that quietly DDoS the database
Joke #1: A cache is like a shared office fridge: everyone loves it until someone puts fish in it on Friday afternoon.
4) “Just add retries” remains a reliability anti-pattern
Retries are a tax you pay for uncertainty. They can be lifesaving if bounded and jittered. They can also be the reason your outage becomes a catastrophe. In 2026, retries are often hidden inside SDKs, service meshes, and client libraries—so you think you have one request, but your dependencies see three.
Retry budgets and circuit breakers aren’t “nice-to-have.” They’re the difference between a transient blip and a total cascade.
5) Storage is still where performance optimism goes to die
Stateful systems haven’t become less important. They’ve become more centralized. You might have 200 microservices, but you still have:
- a few databases that matter
- a message bus that matters
- an object store everyone leans on
- a persistent volume layer that quietly determines your p99
When these saturate, the rest of your architecture becomes decorative.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They migrated a customer-facing API from VMs to Kubernetes. Same app, same database, “more resilient platform.” The new cluster used a CSI driver backed by networked block storage. Nobody panicked, because the vendor’s datasheet said the volume type was “high performance.”
The wrong assumption was subtle: the team believed storage latency would be stable as long as average IOPS stayed under the published limit. But their workload wasn’t average. It was spiky writes, fsync-heavy, and sensitive to tail latency. During peak, the block device hit intermittent latency spikes—single-digit milliseconds jumping to hundreds. Not constant, just enough to break request deadlines.
The first symptom wasn’t storage alarms. It was application timeouts and a surge in retries. That doubled the write rate, which increased queue depth at the storage layer, which increased latency further. The system “self-amplified” into an outage without ever truly going down.
When they finally pulled node-level metrics, they saw it: iowait rising, but only on a subset of nodes. The scheduling spread was uneven; a few nodes had multiple stateful pods pinned close together, hammering the same underlying storage path.
The fix was unglamorous: set explicit pod anti-affinity for stateful workloads, tune client timeouts with jittered retries, and switch the database volume to a class with predictable latency (and pay for it). Most importantly, they updated their runbook: “Published IOPS is not a latency guarantee.”
Mini-story 2: The optimization that backfired
A platform team wanted to cut cloud spend. They noticed their database instances were underutilized on CPU. So they downsized CPU and memory, expecting the same throughput. It worked in staging. It even worked for a week in production.
Then the monthly billing cycle hit, along with the usual reporting jobs. The database started checkpointing more aggressively. With less memory, the buffer cache churned. With fewer CPUs, background maintenance competed harder with foreground queries. Latency went up.
Someone “optimized” the application by increasing connection pool size to reduce perceived queuing. That increased concurrency at the database, which increased lock contention and write amplification. The p50 improved for a few endpoints. The p99 got wrecked.
The backfire wasn’t just performance. It was operational clarity. The system became unstable in a way that made every dashboard look “sort of plausible.” CPU wasn’t pegged. Disk wasn’t pegged. Network wasn’t pegged. Everything was just… slower.
They rolled back the downsizing, reduced pool sizes, and introduced admission control: if the queue is growing, shed load or degrade features rather than “pushing through.” The lesson landed hard: when you optimize for averages, you usually pay in tails.
Mini-story 3: The boring but correct practice that saved the day
A mid-sized SaaS company had a reputation for being “too cautious.” They did quarterly restore drills. Not a tabletop exercise—actual restores into an isolated environment, plus application-level verification. People teased them about it until the day it mattered.
An engineer ran a routine storage expansion on a ZFS-backed database host. The change was correct, but a separate, unrelated issue existed: one drive in the mirror was quietly throwing errors that hadn’t crossed the alert threshold. During resilver, error rates increased. The pool degraded. The database started logging I/O errors.
They declared an incident early—before data corruption symptoms reached customers. Because the team had a practiced restore process, the decision wasn’t “can we restore?” It was “which restore path is fastest and safest?” They failed over to a warm replica, quarantined the degraded pool, and restored the affected datasets to fresh storage for verification.
The post-incident review wasn’t dramatic. It was almost boring. And that’s the point: the “boring” practice of restore drills converted a potential existential event into a controlled exercise with a few hours of degraded performance.
Joke #2: Nothing builds team unity like a restore drill—except an untested restore during an outage, which builds unity in the way a shipwreck builds teamwork.
Fast diagnosis playbook: find the bottleneck in minutes
This is the playbook I wish more orgs printed and taped near the on-call desk. Not because it’s fancy, but because it’s ordered. The order matters.
First: confirm you have a latency problem, not a correctness problem
- Are requests timing out or returning errors?
- Is p95/p99 rising while p50 stays flat? That’s a queue or contention signal.
- Is only one endpoint failing? That’s likely a dependency or lock hotspot.
Second: identify the constrained resource using the “USE” lens
For each layer (app, node, storage, network, database), check:
- Utilization: near the limit?
- Saturation: queue depth, wait time, backpressure?
- Errors: timeouts, retransmits, disk errors, throttling?
Third: hunt for the queue
The queue is rarely where you want it to be. In 2026 it often hides in:
- client connection pools
- service mesh retries
- thread pools and async executors
- kernel block layer (avgqu-sz)
- storage controller queues
- database lock waits
- message broker partitions
Fourth: stop the bleeding with bounded actions
Pick actions that are reversible and predictable:
- reduce concurrency (connection pool, worker threads)
- disable non-essential jobs (analytics, batch imports)
- increase timeouts only if you also reduce retries
- route traffic away from a hot shard/zone
- temporarily shed load with feature flags
Fifth: fix the cause, not the symptom
After stability: remove the hidden coupling. If two workloads contend, isolate them. If storage latency is variable, buy predictability or change access patterns. If retries create storms, implement budgets and backoff.
Practical tasks: commands, outputs, and decisions (12+)
These are not “toy” commands. They’re what you run when production is slow and you need answers without writing a thesis. Each task includes: the command, what the output means, and the decision you make from it.
Task 1: Confirm load average vs CPU pressure
cr0x@server:~$ uptime
03:22:18 up 41 days, 7:10, 2 users, load average: 18.42, 16.90, 12.11
Meaning: High load average can be CPU runnable tasks or tasks blocked on I/O. Load alone doesn’t tell you which.
Decision: Don’t scale CPU blindly. Next check iowait and run queue.
Task 2: Check CPU iowait and steal time
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (db-01) 02/02/2026 _x86_64_ (16 CPU)
03:22:31 CPU %usr %nice %sys %iowait %irq %soft %steal %idle
03:22:32 all 12.10 0.00 4.20 38.50 0.00 0.40 0.10 44.70
03:22:33 all 11.80 0.00 4.10 40.20 0.00 0.30 0.10 43.50
03:22:34 all 13.00 0.00 4.50 39.70 0.00 0.30 0.10 42.40
Meaning: ~40% iowait is screaming “storage latency or saturation.” Low steal suggests it’s not the hypervisor stealing CPU.
Decision: Move quickly to disk queue/latency; scaling CPU won’t fix this.
Task 3: Identify which block device is hurting
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db-01) 02/02/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.4 0.0 4.3 39.5 0.1 43.7
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await svctm
nvme0n1 120.0 880.0 6144.0 32768.0 0.0 120.0 98.5 42.30 0.85
Meaning: %util ~98% plus await ~42ms indicates the device is saturated or experiencing high latency. svctm is not reliable on modern kernels; focus on await and util.
Decision: Confirm queue depth and whether latency is workload-driven or device throttling.
Task 4: Check NVMe health and media errors
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 43 C
available_spare : 100%
percentage_used : 2%
media_errors : 0
num_err_log_entries : 0
warning_temp_time : 0
critical_comp_time : 0
Meaning: No media errors, low wear. This points away from “disk failing” and toward “disk overloaded” or “latency elsewhere in the stack.”
Decision: Look for filesystem/dataset behavior, sync writes, and database checkpoints.
Task 5: Find the top I/O consumers by process
cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 6.01 M/s | Total DISK WRITE: 33.12 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2411 be/4 postgres 1.20 M/s 18.50 M/s 0.00 % 62.10 % postgres: checkpointer
2388 be/4 postgres 0.80 M/s 9.10 M/s 0.00 % 31.40 % postgres: wal writer
3102 be/4 app 0.20 M/s 2.10 M/s 0.00 % 7.20 % java -jar api.jar
Meaning: Checkpointer and WAL writer dominating suggests write pressure and checkpoint activity, often tied to memory pressure or write-heavy bursts.
Decision: Reduce write bursts (batching), tune checkpoint settings, or add IOPS/latency headroom. Consider isolating WAL to separate device.
Task 6: Inspect filesystem space and inode pressure
cr0x@server:~$ df -h /var/lib/postgresql
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 900G 812G 42G 96% /var/lib/postgresql
Meaning: 96% full is a performance smell on some filesystems and a risk for sudden failure modes (autovacuum bloat, logs, temp files).
Decision: Free space now; then implement capacity alerts with “days to full,” not just percentage used.
Task 7: Check for kernel-level block queue pressure
cr0x@server:~$ cat /proc/diskstats | egrep 'nvme0n1 '
259 0 nvme0n1 112233 0 987654 12000 998877 0 776655 54000 0 62000 66000 0 0 0 0
Meaning: Diskstats alone is raw. You use it to correlate with iostat, or to confirm I/O is happening when higher-level metrics lie.
Decision: If app says “idle” but diskstats increments fast, your observability is missing something (sidecars, host processes, backup jobs).
Task 8: Identify TCP retransmits and drops (network-induced latency)
cr0x@server:~$ ss -s
Total: 1542 (kernel 0)
TCP: 1123 (estab 980, closed 77, orphaned 0, timewait 25)
Transport Total IP IPv6
RAW 0 0 0
UDP 12 10 2
TCP 1046 1002 44
INET 1058 1012 46
FRAG 0 0 0
Meaning: This is just connection volume. Useful for spotting runaway connection counts or TIMEWAIT storms.
Decision: If estab explodes, suspect connection pooling misconfig or load balancer behavior; move to retransmit metrics next.
cr0x@server:~$ netstat -s | egrep -i 'retrans|segments retransmited|listen|reset'
3420 segments retransmited
217 resets received for embryonic SYN_RECV sockets
Meaning: Retransmits correlate strongly with tail latency. SYN_RECV resets can indicate overload or SYN floods, but also aggressive timeouts.
Decision: If retransmits are rising during the incident, treat network as a bottleneck contributor, not just “the thing between services.”
Task 9: Kubernetes: locate uneven pod placement (noisy neighbor by scheduling)
cr0x@server:~$ kubectl get pods -n prod -o wide | egrep 'db|cache'
db-0 1/1 Running 0 14d 10.2.3.14 node-a
db-1 1/1 Running 0 14d 10.2.5.21 node-c
cache-0 1/1 Running 0 14d 10.2.3.51 node-a
cache-1 1/1 Running 0 14d 10.2.3.52 node-a
Meaning: Multiple heavy stateful pods on node-a. That’s a classic “we didn’t mean to pin them together” problem.
Decision: Add pod anti-affinity and/or topology spread constraints. Consider node taints for storage-heavy workloads.
Task 10: Kubernetes: inspect persistent volume class and access mode
cr0x@server:~$ kubectl get pvc -n prod db-data-db-0 -o yaml | egrep 'storageClassName|accessModes|resources'
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: standard
Meaning: “standard” often implies shared backend with bursty behavior. ReadWriteOnce also means failover and rescheduling have constraints.
Decision: If p99 is a business requirement, use a storage class designed for predictable latency, and test it under realistic fsync workloads.
Task 11: Database: find lock waits that look like “random slowness”
cr0x@server:~$ psql -U postgres -d appdb -c "select now(), wait_event_type, wait_event, count(*) from pg_stat_activity where wait_event is not null group by 1,2,3 order by 4 desc;"
now | wait_event_type | wait_event | count
------------------------------+-----------------+---------------+-------
2026-02-02 03:25:10.112+00 | Lock | transactionid | 37
2026-02-02 03:25:10.112+00 | IO | DataFileRead | 12
Meaning: Many sessions waiting on transactionid locks indicates contention (long transactions, hot rows). Some I/O waits also present.
Decision: Kill/mitigate the long transaction if safe; reduce concurrency; fix the hot-row pattern. Don’t just throw storage at lock contention.
Task 12: Database: check checkpoint pressure (PostgreSQL example)
cr0x@server:~$ psql -U postgres -d appdb -c "select checkpoints_timed, checkpoints_req, buffers_checkpoint, checkpoint_write_time, checkpoint_sync_time from pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | buffers_checkpoint | checkpoint_write_time | checkpoint_sync_time
------------------+-----------------+--------------------+-----------------------+----------------------
1200 | 980 | 98765432 | 83423456 | 1223344
Meaning: High requested checkpoints (checkpoints_req) can mean the system is forcing checkpoints due to WAL volume, which can spike I/O.
Decision: Tune checkpoint settings, reduce write spikes, and ensure WAL and data are on storage that can sustain the write rate.
Task 13: ZFS: detect pool health and read/write latency hints
cr0x@server:~$ sudo zpool status
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device and restore the pool from backup if necessary.
scan: scrub repaired 0B in 02:31:12 with 0 errors on Sun Feb 1 03:10:01 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb UNAVAIL 5 12 0
Meaning: DEGRADED plus an unavailable member is not “we’ll handle it later.” It changes performance and risk immediately.
Decision: Replace the device now, and avoid heavy maintenance (like big compactions) until redundancy is restored.
Task 14: Linux memory pressure (cache vs reclaim thrash)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 64Gi 52Gi 1.2Gi 1.1Gi 11Gi 3.0Gi
Swap: 0B 0B 0B
Meaning: Only 3Gi available suggests reclaim pressure; storage performance can degrade when the kernel is constantly evicting and refaulting pages.
Decision: Add memory headroom, reduce cache churn, or isolate memory-heavy sidecars. Avoid “we have free in buff/cache so we’re fine” arguments during incidents.
Task 15: Spot runaway retries from a client side (rate + error)
cr0x@server:~$ sudo journalctl -u api.service -n 20 --no-pager
Feb 02 03:24:51 api-01 api[3102]: WARN upstream timeout talking to payments, retry=1
Feb 02 03:24:51 api-01 api[3102]: WARN upstream timeout talking to payments, retry=2
Feb 02 03:24:52 api-01 api[3102]: WARN upstream timeout talking to payments, retry=3
Feb 02 03:24:52 api-01 api[3102]: ERROR request failed after retries
Meaning: Retries are happening fast and likely aligned across requests. That can become a retry storm.
Decision: Add jittered exponential backoff, cap retries, and consider circuit breaking to preserve the dependency and your own queue.
Common mistakes: symptoms → root cause → fix
1) “Everything is slower” but CPU is fine
Symptoms: p99 climbs, CPU looks modest, dashboards show nothing pegged.
Root cause: Hidden queue: storage await, lock waits, network retransmits, or connection pool saturation.
Fix: Measure saturation directly (iostat await/%util, DB wait events, retransmits). Reduce concurrency and retries; isolate hot components.
2) Scaling the app makes it worse
Symptoms: More pods/instances increases error rate and latency.
Root cause: A shared dependency is saturated (DB, cache, disk). More clients create more contention and queueing.
Fix: Scale the bottleneck, not the callers. Add admission control and backpressure. Cap concurrency at the dependency boundary.
3) “But it’s NVMe, it can’t be disk”
Symptoms: iowait spikes, database fsync latency spikes, but the device is “fast.”
Root cause: NVMe can still saturate; also thermal throttling, firmware behavior, or write amplification from checkpoints/compaction.
Fix: Check iostat await/%util, nvme smart-log, and workload patterns (checkpoint/compaction). Separate WAL/log from data if needed.
4) Kubernetes makes stateful workloads “mysteriously” flaky
Symptoms: Only some pods slow; rescheduling changes behavior; nodes differ.
Root cause: Uneven placement, shared node resources, storage path differences, or PV backend variance.
Fix: Topology spread, anti-affinity, dedicated node pools, and explicit storage classes validated under fsync-heavy load tests.
5) Cache invalidation triggers outages
Symptoms: After deploy or cache flush, DB load spikes; timeouts; recovery is slow.
Root cause: Stampede: many callers miss cache simultaneously and hammer origin.
Fix: Request coalescing, stale-while-revalidate, jittered TTLs, and rate limiting to the origin. Never “flush all” during peak.
6) Backups “work” until you need them
Symptoms: Backup jobs succeed; restore fails or is incomplete; RTO is fantasy.
Root cause: Untested restores, missing app-level verification, or backups that capture inconsistent state.
Fix: Scheduled restore drills and verification queries; document the restore path like it’s a deployment.
7) “We increased timeouts and it stabilized” (temporarily)
Symptoms: Timeouts reduced, but overall latency and resource usage increase, and the next spike is worse.
Root cause: Timeouts were providing a safety valve. Increasing them allowed queues to grow, increasing tail latency and memory usage.
Fix: Keep deadlines realistic. Reduce concurrency, implement backpressure, and prioritize work. Timeouts are not a performance tuning knob.
Checklists / step-by-step plan: build systems that don’t repeat history
Step-by-step: before you adopt the next shiny platform
- Write down the invariants you’re assuming. Example: “Storage latency is stable under X load.” Then test that exact claim.
- Define your SLOs in percentiles. If you only define averages, you’ll engineer for the wrong reality.
- Find the queue at every layer. App thread pools, client pools, broker partitions, kernel queues, storage queues.
- Budget retries. Per endpoint and per dependency. Make retry storms mechanically hard to create.
- Prove restore, not just backup. Restore drills with verification, on a schedule, owned by a team.
- Capacity plan for tails. Plan headroom so that burst periods don’t push utilization into the “latency cliff.”
- Separate workloads by failure domain. Noisy neighbor isn’t a theory; it’s a bill you pay later with interest.
- Prefer predictable performance over peak performance. Stable latency wins more incidents than “up to 1M IOPS.”
Operational checklist: when deploying performance “improvements”
- Can you roll back quickly without data migration?
- Did you test with production-like concurrency and data volume?
- Did you measure p95/p99 and error rates, not only throughput?
- Did you test failure: dependency slow, not just dependency down?
- Did you confirm timeouts and retries are aligned across services?
- Did you verify the cache behavior under invalidation and cold start?
- Did you confirm storage class performance under fsync-heavy load?
Architecture decisions that age well (and why)
- Admission control: It prevents overload collapse. If the dependency can only handle N, enforce N.
- Isolation for state: Dedicated nodes/disks for databases and queues. Shared fate is real.
- Explicit budgets: Retry budgets, error budgets, capacity budgets. They convert “hope” into numbers.
- Runbooks that name the queue: “Check iostat await” is better than “investigate performance.”
- Load tests that include background work: Compaction, checkpointing, vacuuming, backups. The stuff that actually happens.
FAQ
1) Why do the same outages keep happening even with better tooling?
Because tools don’t change incentives. Teams still ship features faster than they build understanding, and complexity still hides queues. Better tooling can even raise confidence while lowering comprehension.
2) What’s the single biggest reliability trap in 2026?
Assuming “managed” implies predictable performance and clear failure modes. Managed services can be excellent, but their limits are often softer, multi-tenant, and harder to see.
3) Is p99 always the right metric?
No, but it’s usually closer to user pain than averages. Use p50 to understand typical behavior, p95/p99 to understand queuing, and error rate to understand correctness. If you have strict deadlines (payments, search, auth), the tail is your product.
4) How do I tell if it’s storage or the database?
Check both: node-level iowait/iostat for device saturation, and database wait events/locks for contention. Storage saturation often shows high await/%util; DB contention shows lock waits and long transactions. They can also compound each other.
5) Why does adding replicas or pods sometimes increase latency?
Because it increases contention on shared dependencies and increases coordination overhead. More callers can mean more retries, more locks, more cache churn, and more background work. Scale the bottleneck, and cap the callers.
6) Are retries bad?
No. Unbounded retries are bad. Use exponential backoff with jitter, cap attempts, and design a retry budget so that a slow dependency doesn’t get hammered into failure.
7) What’s a quick way to spot a hidden queue?
Look for rising p99 with stable p50, rising iowait, rising DB waiters, or rising retransmits. Then find where the wait time accumulates: thread pool, connection pool, disk queue, or lock.
8) How should I capacity plan in a world of bursty workloads and autoscaling?
Plan for the slowest dependency, not the easiest tier to scale. Autoscaling helps CPU-bound stateless tiers. It doesn’t magically scale your database locks, your disk latency, or your shared network path.
9) What’s the most underrated “boring” practice?
Restore drills with verification. Not “we have snapshots,” but “we restored, ran integrity checks, and timed it.” It turns existential risk into a predictable procedure.
Conclusion: next steps you can do this week
The industry in 2026 isn’t short on innovation. It’s short on humility about old constraints. The shiny packaging is real—better products, faster hardware, smarter automation—but the physics didn’t sign up for your rebrand.
Do three things this week:
- Write a “where is the queue?” runbook for your top two user journeys. Include the exact commands you’ll run (like the ones above) and what decisions they drive.
- Set one SLO that forces you to see tails (p95/p99) and tie it to a dependency budget (retries, concurrency, or capacity headroom).
- Run a restore drill that includes application verification, not just data restoration. Time it. Record the steps. Fix the parts that were “tribal knowledge.”
If you do that, you’ll still have incidents. Everyone does. But you’ll stop paying for the same ones twice—once in downtime, and again in surprise.