Itâs 03:17. Your dashboard says âgreen,â your pager says âabsolutely not,â and your CEO says âbut weâre on the premium cloud tier.â Meanwhile your users are watching spinners like itâs a nostalgic hobby.
By 2026 weâve added AI copilots, âserverless everything,â NVMe everywhere, service meshes, eBPF, and enough managed offerings to wallpaper an aircraft hangar. Yet the outages still rhyme with the ones from 2006: overloaded queues, mismatched assumptions, capacity denial, and the kind of âoptimizationâ that looks brilliant until the first real incident.
Why 2026 feels familiar: the same physics, new marketing
Most ânewâ reliability failures are old failures with a different costume. The core constraints didnât change:
- Queuing theory still runs your datacenter. When utilization approaches 100%, latency explodes. Not linearly. Explosively.
- Tail latency still dominates user experience. Your median can be gorgeous while your p99 is actively on fire.
- Storage is still the shared fate of stateful systems. Whether itâs a managed database, a Kubernetes PersistentVolume, or a homegrown ZFS pool, you canât âmicroserviceâ your way out of I/O.
- Networks are still lossy and variable. You can wrap them in gRPC, QUIC, meshes, and policies, but physics still has the last word.
- Humans still assume. And assumptions still break at 2x scale, 10x load, or at exactly the wrong moment.
The â2026 twistâ is that complexity is now sold as safety. If you use a managed service, surely someone else solved reliability for you. If you add another abstraction layer, surely the system gets simpler. If the dashboard is green, surely production is fine. None of those are guaranteed. Theyâre just ways to outsource understanding.
Hereâs a paraphrased idea often attributed to Richard Cook: Complex systems fail in complex ways, and understanding normal work is key to understanding failures.
Itâs not poetic. Itâs operational advice: stop only studying outages; study how things succeed on a normal Tuesday. Thatâs where your real safety margins are hiding.
One rule for 2026: if you canât explain where the queue is, you donât know why the latency is high.
Nine facts and historical context points (that explain today)
- RAID didnât disappear; it moved. Many âcloud disksâ are still built on some form of mirroring/erasure coding plus caching, just not your serverâs RAID card.
- IOPS marketing has misled teams for decades. A million 4k reads isnât the same as a thousand 1MB writes, and âup toâ performance is not a promise.
- Write amplification is older than SSDs. Filesystems, databases, and log-structured designs have always traded sequential writes for compaction work later.
- The fallacy of âstateless fixes stateful painâ keeps recurring. People split services while keeping one shared database, then wonder why the database is the new monolith.
- Tail latency became a mainstream topic in the 2010s. Before that, many teams measured averages and were surprised by user complaints. Now we measure p99 and are surprised anywayâbecause we still donât manage queues.
- Containers didnât eliminate noisy neighbors; they industrialized them. cgroups constrain CPU and memory, but shared storage and shared network paths still couple workloads.
- âEventually consistentâ systems still have consistency costs. Retries, reconciliation, and compaction can shift pain into the background until it becomes foreground.
- NVMe reduced latency but increased speed of failure. When a system can generate load faster, it can also overwhelm downstream dependencies faster.
- Incident retrospectives are often too polite. The real root cause is frequently âwe believed a property that was never true,â but it gets written up as âtransient latency.â
The shiny packaging: where old failure modes hide now
1) âManagedâ does not mean âboundedâ
Managed databases and managed disks are fantastic until you treat them like magic. They still have:
- IO credit systems and burst behavior
- maintenance windows and background compactions
- multi-tenant contention
- limits you wonât notice until you hit them
In 2026, the most common managed-service outage pattern I see is soft throttling. No obvious 5xx spike. No dramatic âdown.â Just a slow slide into timeouts as latency increases and retries multiply. The service is âup.â Your SLO is not.
2) Observability got betterâand also worse
We have more telemetry than ever: distributed traces, profiles, eBPF-based network visibility, storage metrics, queue depth, saturation signals. And yet teams still fail because their observability is uncurated.
If your on-call has to mentally join 14 dashboards and translate six different percentiles into one decision, you donât have observability. You have a museum.
3) The cache is still the most dangerous âperformance featureâ
Caches are good. Overconfidence in caches is not. In 2026 the classic failures still happen:
- cache stampedes after invalidation
- evictions during a traffic spike
- hot key amplification from a single popular object
- cache warming jobs that quietly DDoS the database
Joke #1: A cache is like a shared office fridge: everyone loves it until someone puts fish in it on Friday afternoon.
4) âJust add retriesâ remains a reliability anti-pattern
Retries are a tax you pay for uncertainty. They can be lifesaving if bounded and jittered. They can also be the reason your outage becomes a catastrophe. In 2026, retries are often hidden inside SDKs, service meshes, and client librariesâso you think you have one request, but your dependencies see three.
Retry budgets and circuit breakers arenât ânice-to-have.â Theyâre the difference between a transient blip and a total cascade.
5) Storage is still where performance optimism goes to die
Stateful systems havenât become less important. Theyâve become more centralized. You might have 200 microservices, but you still have:
- a few databases that matter
- a message bus that matters
- an object store everyone leans on
- a persistent volume layer that quietly determines your p99
When these saturate, the rest of your architecture becomes decorative.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They migrated a customer-facing API from VMs to Kubernetes. Same app, same database, âmore resilient platform.â The new cluster used a CSI driver backed by networked block storage. Nobody panicked, because the vendorâs datasheet said the volume type was âhigh performance.â
The wrong assumption was subtle: the team believed storage latency would be stable as long as average IOPS stayed under the published limit. But their workload wasnât average. It was spiky writes, fsync-heavy, and sensitive to tail latency. During peak, the block device hit intermittent latency spikesâsingle-digit milliseconds jumping to hundreds. Not constant, just enough to break request deadlines.
The first symptom wasnât storage alarms. It was application timeouts and a surge in retries. That doubled the write rate, which increased queue depth at the storage layer, which increased latency further. The system âself-amplifiedâ into an outage without ever truly going down.
When they finally pulled node-level metrics, they saw it: iowait rising, but only on a subset of nodes. The scheduling spread was uneven; a few nodes had multiple stateful pods pinned close together, hammering the same underlying storage path.
The fix was unglamorous: set explicit pod anti-affinity for stateful workloads, tune client timeouts with jittered retries, and switch the database volume to a class with predictable latency (and pay for it). Most importantly, they updated their runbook: âPublished IOPS is not a latency guarantee.â
Mini-story 2: The optimization that backfired
A platform team wanted to cut cloud spend. They noticed their database instances were underutilized on CPU. So they downsized CPU and memory, expecting the same throughput. It worked in staging. It even worked for a week in production.
Then the monthly billing cycle hit, along with the usual reporting jobs. The database started checkpointing more aggressively. With less memory, the buffer cache churned. With fewer CPUs, background maintenance competed harder with foreground queries. Latency went up.
Someone âoptimizedâ the application by increasing connection pool size to reduce perceived queuing. That increased concurrency at the database, which increased lock contention and write amplification. The p50 improved for a few endpoints. The p99 got wrecked.
The backfire wasnât just performance. It was operational clarity. The system became unstable in a way that made every dashboard look âsort of plausible.â CPU wasnât pegged. Disk wasnât pegged. Network wasnât pegged. Everything was just⌠slower.
They rolled back the downsizing, reduced pool sizes, and introduced admission control: if the queue is growing, shed load or degrade features rather than âpushing through.â The lesson landed hard: when you optimize for averages, you usually pay in tails.
Mini-story 3: The boring but correct practice that saved the day
A mid-sized SaaS company had a reputation for being âtoo cautious.â They did quarterly restore drills. Not a tabletop exerciseâactual restores into an isolated environment, plus application-level verification. People teased them about it until the day it mattered.
An engineer ran a routine storage expansion on a ZFS-backed database host. The change was correct, but a separate, unrelated issue existed: one drive in the mirror was quietly throwing errors that hadnât crossed the alert threshold. During resilver, error rates increased. The pool degraded. The database started logging I/O errors.
They declared an incident earlyâbefore data corruption symptoms reached customers. Because the team had a practiced restore process, the decision wasnât âcan we restore?â It was âwhich restore path is fastest and safest?â They failed over to a warm replica, quarantined the degraded pool, and restored the affected datasets to fresh storage for verification.
The post-incident review wasnât dramatic. It was almost boring. And thatâs the point: the âboringâ practice of restore drills converted a potential existential event into a controlled exercise with a few hours of degraded performance.
Joke #2: Nothing builds team unity like a restore drillâexcept an untested restore during an outage, which builds unity in the way a shipwreck builds teamwork.
Fast diagnosis playbook: find the bottleneck in minutes
This is the playbook I wish more orgs printed and taped near the on-call desk. Not because itâs fancy, but because itâs ordered. The order matters.
First: confirm you have a latency problem, not a correctness problem
- Are requests timing out or returning errors?
- Is p95/p99 rising while p50 stays flat? Thatâs a queue or contention signal.
- Is only one endpoint failing? Thatâs likely a dependency or lock hotspot.
Second: identify the constrained resource using the âUSEâ lens
For each layer (app, node, storage, network, database), check:
- Utilization: near the limit?
- Saturation: queue depth, wait time, backpressure?
- Errors: timeouts, retransmits, disk errors, throttling?
Third: hunt for the queue
The queue is rarely where you want it to be. In 2026 it often hides in:
- client connection pools
- service mesh retries
- thread pools and async executors
- kernel block layer (avgqu-sz)
- storage controller queues
- database lock waits
- message broker partitions
Fourth: stop the bleeding with bounded actions
Pick actions that are reversible and predictable:
- reduce concurrency (connection pool, worker threads)
- disable non-essential jobs (analytics, batch imports)
- increase timeouts only if you also reduce retries
- route traffic away from a hot shard/zone
- temporarily shed load with feature flags
Fifth: fix the cause, not the symptom
After stability: remove the hidden coupling. If two workloads contend, isolate them. If storage latency is variable, buy predictability or change access patterns. If retries create storms, implement budgets and backoff.
Practical tasks: commands, outputs, and decisions (12+)
These are not âtoyâ commands. Theyâre what you run when production is slow and you need answers without writing a thesis. Each task includes: the command, what the output means, and the decision you make from it.
Task 1: Confirm load average vs CPU pressure
cr0x@server:~$ uptime
03:22:18 up 41 days, 7:10, 2 users, load average: 18.42, 16.90, 12.11
Meaning: High load average can be CPU runnable tasks or tasks blocked on I/O. Load alone doesnât tell you which.
Decision: Donât scale CPU blindly. Next check iowait and run queue.
Task 2: Check CPU iowait and steal time
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (db-01) 02/02/2026 _x86_64_ (16 CPU)
03:22:31 CPU %usr %nice %sys %iowait %irq %soft %steal %idle
03:22:32 all 12.10 0.00 4.20 38.50 0.00 0.40 0.10 44.70
03:22:33 all 11.80 0.00 4.10 40.20 0.00 0.30 0.10 43.50
03:22:34 all 13.00 0.00 4.50 39.70 0.00 0.30 0.10 42.40
Meaning: ~40% iowait is screaming âstorage latency or saturation.â Low steal suggests itâs not the hypervisor stealing CPU.
Decision: Move quickly to disk queue/latency; scaling CPU wonât fix this.
Task 3: Identify which block device is hurting
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db-01) 02/02/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.4 0.0 4.3 39.5 0.1 43.7
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await svctm
nvme0n1 120.0 880.0 6144.0 32768.0 0.0 120.0 98.5 42.30 0.85
Meaning: %util ~98% plus await ~42ms indicates the device is saturated or experiencing high latency. svctm is not reliable on modern kernels; focus on await and util.
Decision: Confirm queue depth and whether latency is workload-driven or device throttling.
Task 4: Check NVMe health and media errors
cr0x@server:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x00
temperature : 43 C
available_spare : 100%
percentage_used : 2%
media_errors : 0
num_err_log_entries : 0
warning_temp_time : 0
critical_comp_time : 0
Meaning: No media errors, low wear. This points away from âdisk failingâ and toward âdisk overloadedâ or âlatency elsewhere in the stack.â
Decision: Look for filesystem/dataset behavior, sync writes, and database checkpoints.
Task 5: Find the top I/O consumers by process
cr0x@server:~$ sudo iotop -oPa
Total DISK READ: 6.01 M/s | Total DISK WRITE: 33.12 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2411 be/4 postgres 1.20 M/s 18.50 M/s 0.00 % 62.10 % postgres: checkpointer
2388 be/4 postgres 0.80 M/s 9.10 M/s 0.00 % 31.40 % postgres: wal writer
3102 be/4 app 0.20 M/s 2.10 M/s 0.00 % 7.20 % java -jar api.jar
Meaning: Checkpointer and WAL writer dominating suggests write pressure and checkpoint activity, often tied to memory pressure or write-heavy bursts.
Decision: Reduce write bursts (batching), tune checkpoint settings, or add IOPS/latency headroom. Consider isolating WAL to separate device.
Task 6: Inspect filesystem space and inode pressure
cr0x@server:~$ df -h /var/lib/postgresql
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 900G 812G 42G 96% /var/lib/postgresql
Meaning: 96% full is a performance smell on some filesystems and a risk for sudden failure modes (autovacuum bloat, logs, temp files).
Decision: Free space now; then implement capacity alerts with âdays to full,â not just percentage used.
Task 7: Check for kernel-level block queue pressure
cr0x@server:~$ cat /proc/diskstats | egrep 'nvme0n1 '
259 0 nvme0n1 112233 0 987654 12000 998877 0 776655 54000 0 62000 66000 0 0 0 0
Meaning: Diskstats alone is raw. You use it to correlate with iostat, or to confirm I/O is happening when higher-level metrics lie.
Decision: If app says âidleâ but diskstats increments fast, your observability is missing something (sidecars, host processes, backup jobs).
Task 8: Identify TCP retransmits and drops (network-induced latency)
cr0x@server:~$ ss -s
Total: 1542 (kernel 0)
TCP: 1123 (estab 980, closed 77, orphaned 0, timewait 25)
Transport Total IP IPv6
RAW 0 0 0
UDP 12 10 2
TCP 1046 1002 44
INET 1058 1012 46
FRAG 0 0 0
Meaning: This is just connection volume. Useful for spotting runaway connection counts or TIMEWAIT storms.
Decision: If estab explodes, suspect connection pooling misconfig or load balancer behavior; move to retransmit metrics next.
cr0x@server:~$ netstat -s | egrep -i 'retrans|segments retransmited|listen|reset'
3420 segments retransmited
217 resets received for embryonic SYN_RECV sockets
Meaning: Retransmits correlate strongly with tail latency. SYN_RECV resets can indicate overload or SYN floods, but also aggressive timeouts.
Decision: If retransmits are rising during the incident, treat network as a bottleneck contributor, not just âthe thing between services.â
Task 9: Kubernetes: locate uneven pod placement (noisy neighbor by scheduling)
cr0x@server:~$ kubectl get pods -n prod -o wide | egrep 'db|cache'
db-0 1/1 Running 0 14d 10.2.3.14 node-a
db-1 1/1 Running 0 14d 10.2.5.21 node-c
cache-0 1/1 Running 0 14d 10.2.3.51 node-a
cache-1 1/1 Running 0 14d 10.2.3.52 node-a
Meaning: Multiple heavy stateful pods on node-a. Thatâs a classic âwe didnât mean to pin them togetherâ problem.
Decision: Add pod anti-affinity and/or topology spread constraints. Consider node taints for storage-heavy workloads.
Task 10: Kubernetes: inspect persistent volume class and access mode
cr0x@server:~$ kubectl get pvc -n prod db-data-db-0 -o yaml | egrep 'storageClassName|accessModes|resources'
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: standard
Meaning: âstandardâ often implies shared backend with bursty behavior. ReadWriteOnce also means failover and rescheduling have constraints.
Decision: If p99 is a business requirement, use a storage class designed for predictable latency, and test it under realistic fsync workloads.
Task 11: Database: find lock waits that look like ârandom slownessâ
cr0x@server:~$ psql -U postgres -d appdb -c "select now(), wait_event_type, wait_event, count(*) from pg_stat_activity where wait_event is not null group by 1,2,3 order by 4 desc;"
now | wait_event_type | wait_event | count
------------------------------+-----------------+---------------+-------
2026-02-02 03:25:10.112+00 | Lock | transactionid | 37
2026-02-02 03:25:10.112+00 | IO | DataFileRead | 12
Meaning: Many sessions waiting on transactionid locks indicates contention (long transactions, hot rows). Some I/O waits also present.
Decision: Kill/mitigate the long transaction if safe; reduce concurrency; fix the hot-row pattern. Donât just throw storage at lock contention.
Task 12: Database: check checkpoint pressure (PostgreSQL example)
cr0x@server:~$ psql -U postgres -d appdb -c "select checkpoints_timed, checkpoints_req, buffers_checkpoint, checkpoint_write_time, checkpoint_sync_time from pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | buffers_checkpoint | checkpoint_write_time | checkpoint_sync_time
------------------+-----------------+--------------------+-----------------------+----------------------
1200 | 980 | 98765432 | 83423456 | 1223344
Meaning: High requested checkpoints (checkpoints_req) can mean the system is forcing checkpoints due to WAL volume, which can spike I/O.
Decision: Tune checkpoint settings, reduce write spikes, and ensure WAL and data are on storage that can sustain the write rate.
Task 13: ZFS: detect pool health and read/write latency hints
cr0x@server:~$ sudo zpool status
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device and restore the pool from backup if necessary.
scan: scrub repaired 0B in 02:31:12 with 0 errors on Sun Feb 1 03:10:01 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb UNAVAIL 5 12 0
Meaning: DEGRADED plus an unavailable member is not âweâll handle it later.â It changes performance and risk immediately.
Decision: Replace the device now, and avoid heavy maintenance (like big compactions) until redundancy is restored.
Task 14: Linux memory pressure (cache vs reclaim thrash)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 64Gi 52Gi 1.2Gi 1.1Gi 11Gi 3.0Gi
Swap: 0B 0B 0B
Meaning: Only 3Gi available suggests reclaim pressure; storage performance can degrade when the kernel is constantly evicting and refaulting pages.
Decision: Add memory headroom, reduce cache churn, or isolate memory-heavy sidecars. Avoid âwe have free in buff/cache so weâre fineâ arguments during incidents.
Task 15: Spot runaway retries from a client side (rate + error)
cr0x@server:~$ sudo journalctl -u api.service -n 20 --no-pager
Feb 02 03:24:51 api-01 api[3102]: WARN upstream timeout talking to payments, retry=1
Feb 02 03:24:51 api-01 api[3102]: WARN upstream timeout talking to payments, retry=2
Feb 02 03:24:52 api-01 api[3102]: WARN upstream timeout talking to payments, retry=3
Feb 02 03:24:52 api-01 api[3102]: ERROR request failed after retries
Meaning: Retries are happening fast and likely aligned across requests. That can become a retry storm.
Decision: Add jittered exponential backoff, cap retries, and consider circuit breaking to preserve the dependency and your own queue.
Common mistakes: symptoms â root cause â fix
1) âEverything is slowerâ but CPU is fine
Symptoms: p99 climbs, CPU looks modest, dashboards show nothing pegged.
Root cause: Hidden queue: storage await, lock waits, network retransmits, or connection pool saturation.
Fix: Measure saturation directly (iostat await/%util, DB wait events, retransmits). Reduce concurrency and retries; isolate hot components.
2) Scaling the app makes it worse
Symptoms: More pods/instances increases error rate and latency.
Root cause: A shared dependency is saturated (DB, cache, disk). More clients create more contention and queueing.
Fix: Scale the bottleneck, not the callers. Add admission control and backpressure. Cap concurrency at the dependency boundary.
3) âBut itâs NVMe, it canât be diskâ
Symptoms: iowait spikes, database fsync latency spikes, but the device is âfast.â
Root cause: NVMe can still saturate; also thermal throttling, firmware behavior, or write amplification from checkpoints/compaction.
Fix: Check iostat await/%util, nvme smart-log, and workload patterns (checkpoint/compaction). Separate WAL/log from data if needed.
4) Kubernetes makes stateful workloads âmysteriouslyâ flaky
Symptoms: Only some pods slow; rescheduling changes behavior; nodes differ.
Root cause: Uneven placement, shared node resources, storage path differences, or PV backend variance.
Fix: Topology spread, anti-affinity, dedicated node pools, and explicit storage classes validated under fsync-heavy load tests.
5) Cache invalidation triggers outages
Symptoms: After deploy or cache flush, DB load spikes; timeouts; recovery is slow.
Root cause: Stampede: many callers miss cache simultaneously and hammer origin.
Fix: Request coalescing, stale-while-revalidate, jittered TTLs, and rate limiting to the origin. Never âflush allâ during peak.
6) Backups âworkâ until you need them
Symptoms: Backup jobs succeed; restore fails or is incomplete; RTO is fantasy.
Root cause: Untested restores, missing app-level verification, or backups that capture inconsistent state.
Fix: Scheduled restore drills and verification queries; document the restore path like itâs a deployment.
7) âWe increased timeouts and it stabilizedâ (temporarily)
Symptoms: Timeouts reduced, but overall latency and resource usage increase, and the next spike is worse.
Root cause: Timeouts were providing a safety valve. Increasing them allowed queues to grow, increasing tail latency and memory usage.
Fix: Keep deadlines realistic. Reduce concurrency, implement backpressure, and prioritize work. Timeouts are not a performance tuning knob.
Checklists / step-by-step plan: build systems that donât repeat history
Step-by-step: before you adopt the next shiny platform
- Write down the invariants youâre assuming. Example: âStorage latency is stable under X load.â Then test that exact claim.
- Define your SLOs in percentiles. If you only define averages, youâll engineer for the wrong reality.
- Find the queue at every layer. App thread pools, client pools, broker partitions, kernel queues, storage queues.
- Budget retries. Per endpoint and per dependency. Make retry storms mechanically hard to create.
- Prove restore, not just backup. Restore drills with verification, on a schedule, owned by a team.
- Capacity plan for tails. Plan headroom so that burst periods donât push utilization into the âlatency cliff.â
- Separate workloads by failure domain. Noisy neighbor isnât a theory; itâs a bill you pay later with interest.
- Prefer predictable performance over peak performance. Stable latency wins more incidents than âup to 1M IOPS.â
Operational checklist: when deploying performance âimprovementsâ
- Can you roll back quickly without data migration?
- Did you test with production-like concurrency and data volume?
- Did you measure p95/p99 and error rates, not only throughput?
- Did you test failure: dependency slow, not just dependency down?
- Did you confirm timeouts and retries are aligned across services?
- Did you verify the cache behavior under invalidation and cold start?
- Did you confirm storage class performance under fsync-heavy load?
Architecture decisions that age well (and why)
- Admission control: It prevents overload collapse. If the dependency can only handle N, enforce N.
- Isolation for state: Dedicated nodes/disks for databases and queues. Shared fate is real.
- Explicit budgets: Retry budgets, error budgets, capacity budgets. They convert âhopeâ into numbers.
- Runbooks that name the queue: âCheck iostat awaitâ is better than âinvestigate performance.â
- Load tests that include background work: Compaction, checkpointing, vacuuming, backups. The stuff that actually happens.
FAQ
1) Why do the same outages keep happening even with better tooling?
Because tools donât change incentives. Teams still ship features faster than they build understanding, and complexity still hides queues. Better tooling can even raise confidence while lowering comprehension.
2) Whatâs the single biggest reliability trap in 2026?
Assuming âmanagedâ implies predictable performance and clear failure modes. Managed services can be excellent, but their limits are often softer, multi-tenant, and harder to see.
3) Is p99 always the right metric?
No, but itâs usually closer to user pain than averages. Use p50 to understand typical behavior, p95/p99 to understand queuing, and error rate to understand correctness. If you have strict deadlines (payments, search, auth), the tail is your product.
4) How do I tell if itâs storage or the database?
Check both: node-level iowait/iostat for device saturation, and database wait events/locks for contention. Storage saturation often shows high await/%util; DB contention shows lock waits and long transactions. They can also compound each other.
5) Why does adding replicas or pods sometimes increase latency?
Because it increases contention on shared dependencies and increases coordination overhead. More callers can mean more retries, more locks, more cache churn, and more background work. Scale the bottleneck, and cap the callers.
6) Are retries bad?
No. Unbounded retries are bad. Use exponential backoff with jitter, cap attempts, and design a retry budget so that a slow dependency doesnât get hammered into failure.
7) Whatâs a quick way to spot a hidden queue?
Look for rising p99 with stable p50, rising iowait, rising DB waiters, or rising retransmits. Then find where the wait time accumulates: thread pool, connection pool, disk queue, or lock.
8) How should I capacity plan in a world of bursty workloads and autoscaling?
Plan for the slowest dependency, not the easiest tier to scale. Autoscaling helps CPU-bound stateless tiers. It doesnât magically scale your database locks, your disk latency, or your shared network path.
9) Whatâs the most underrated âboringâ practice?
Restore drills with verification. Not âwe have snapshots,â but âwe restored, ran integrity checks, and timed it.â It turns existential risk into a predictable procedure.
Conclusion: next steps you can do this week
The industry in 2026 isnât short on innovation. Itâs short on humility about old constraints. The shiny packaging is realâbetter products, faster hardware, smarter automationâbut the physics didnât sign up for your rebrand.
Do three things this week:
- Write a âwhere is the queue?â runbook for your top two user journeys. Include the exact commands youâll run (like the ones above) and what decisions they drive.
- Set one SLO that forces you to see tails (p95/p99) and tie it to a dependency budget (retries, concurrency, or capacity headroom).
- Run a restore drill that includes application verification, not just data restoration. Time it. Record the steps. Fix the parts that were âtribal knowledge.â
If you do that, youâll still have incidents. Everyone does. But youâll stop paying for the same ones twiceâonce in downtime, and again in surprise.