The worst alerting system is a human being who pays you. It arrives late, incomplete, and emotionally formatted.
“Your site is broken” is not a symptom. It’s an indictment.
Running production with no monitoring is like flying with no instruments and calling it “agile.”
You can do it right up until the moment you can’t, and then you discover the true meaning of “mean time to innocence.”
What “no monitoring” really means (and why it happens)
“No monitoring” rarely means literally nothing. It usually means no monitoring that helps during an incident.
There might be dashboards nobody looks at, logs nobody can query, and a status page updated by someone shouting across the room.
In practice, “no monitoring” is one of these failure modes:
- No external signal: nothing checks the service from the outside, so internal success lies to you.
- No alert routing: metrics exist, but nobody gets paged, or alerts go to an inbox that died in 2019.
- No golden signals: you collect CPU and RAM, but not latency, errors, saturation, or traffic.
- No correlation: you have metrics, logs, and traces, but they don’t share identifiers, time sync, or retention.
- No ownership: “platform team” is an email alias, not a team; SRE is “whoever has the most context.”
- No budgets: the cheapest monitoring is “we’ll add it later,” which is also the most expensive.
The reason it happens is rarely ignorance. It’s incentives. Shipping features is visible. Preventing outages is invisible.
Until the day it’s extremely visible.
Here’s the hard truth: if you can’t answer “is it down?” in under 60 seconds, you don’t have monitoring.
You have vibes.
Facts and history that explain today’s mess
Observability didn’t appear because engineers love graphs. It appeared because production kept lighting itself on fire, and humans got tired.
A few concrete context points that matter:
- SNMP (late 1980s) made it possible to poll network devices at scale, but it also normalized “polling is monitoring” even when apps were the problem.
- Nagios-era checks (2000s) popularized “is it up?” plugins and page-on-fail patterns; great for hosts, mediocre for distributed systems.
- The Google SRE book (mid-2010s) pushed SLOs and error budgets into mainstream ops, reframing reliability as a product feature, not a hobby.
- Microservices (2010s) multiplied failure modes: one user request now touches 10–100 components, turning “logs on the box” into archaeology.
- Time-series databases and cheap storage made it feasible to keep high-resolution metrics, but also enabled the anti-pattern of collecting everything and understanding nothing.
- Container orchestration changed “server down” into “pod rescheduled,” which is nice, until you realize rescheduling can hide a crash loop for weeks.
- Major cloud outages taught companies that “the cloud is someone else’s computer” also means “someone else’s outage domain.” You still need instrumentation.
- Modern incident culture moved from blame to systems thinking—at least in companies that want to keep engineers employed longer than one on-call rotation.
One paraphrased idea, because it stays true across decades: “Hope is not a strategy.”
— paraphrased idea attributed to Gen. Gordon R. Sullivan, often repeated in engineering ops culture.
The technical reality of “the customer is the pager”
When a customer tells you you’re down, three things are already true:
- You lost time: the incident started earlier than the report. Your MTTR clock is already running.
- You lost fidelity: customer symptoms are filtered through browsers, networks, and human interpretation.
- You lost trust: they discovered your failure before you did, and that’s not a great brand moment.
The operational problem is not just detection. It’s triage.
Monitoring isn’t “graphs.” It’s the ability to quickly answer a sequence of production questions:
- Is the service down or just slow?
- Is the issue global or regional, one tenant or all tenants?
- Did it start after a deploy, a config change, or a load spike?
- Is the bottleneck CPU, memory, disk, network, dependency latency, or a lock?
- Is it degrading (leak) or sudden (crash)?
- Are we making it worse by retry storms, autoscaling, or failover loops?
Without monitoring, you end up doing “production forensics” under pressure: SSH into boxes, tail logs, guess, restart, hope.
That works occasionally, mostly when the problem is trivially fixed by a reboot. It does not scale, and it’s a great way to create a second incident.
Joke #1: Running without monitoring is like driving at night with the headlights off because “the road looks fine in the parking lot.”
Why “we’ll notice quickly” is almost always wrong
Teams assume users will report outages quickly. Sometimes they do. Sometimes they don’t.
The quietest outages are often the most damaging:
- Partial failures: one API endpoint fails, one region fails, or one customer segment fails.
- Slow failures: latency creeps upward until customers churn rather than complain.
- Data correctness failures: responses are fast and wrong. You don’t get complaints; you get audits.
- Background job failures: billing, email, exports, ETL. Nobody checks until month-end.
Monitoring isn’t just about uptime. It’s about correctness, performance, and capacity.
It’s about knowing that the system is doing the right thing, at the right speed, for the right people.
Fast diagnosis playbook: first/second/third checks
This is the “customer says we’re down” playbook. The goal is not to be clever; it’s to be fast, repeatable, and hard to screw up.
You are looking for the bottleneck and the blast radius.
First: confirm the symptom from outside
- Check the public endpoint from two networks (your corp network and a phone hotspot) to eliminate local DNS/VPN issues.
- Capture HTTP status, latency, and a request ID if available.
- Decide: is it hard-down (connection refused/timeouts) or soft-down (500s/slow)?
Second: determine the blast radius
- Is it one region/zone? One customer/tenant? One endpoint?
- Check edge components first: DNS, load balancer, ingress, CDN, cert expiry.
- Decide: is this a traffic routing issue or an application capacity issue?
Third: isolate the bottleneck category
- Compute: CPU pegged, load high, run queue high.
- Memory: OOM kills, swap storms, RSS growth.
- Disk: full filesystems, high iowait, latency spikes, broken RAID/ZFS pool degradation.
- Network: packet loss, conntrack exhaustion, misrouted traffic, TLS handshake stalls.
- Dependencies: DB saturated, cache down, upstream API slow, DNS resolution delays.
Fourth: stop the bleeding safely
- Rollback the last deploy/config if it correlates with onset.
- Rate-limit or shed load (return 429/503) instead of melting.
- Scale only if you’re sure you’re not amplifying a dependency failure.
- Decide: mitigate now, diagnose after stability returns.
Fifth: preserve evidence
- Snapshot logs, capture process state, record timestamps.
- Make sure system clocks are sane; time drift ruins correlation.
- Decide: what do we need for a postmortem, before “restart fixes it.”
Practical tasks: commands, output, and decisions
Below are concrete tasks you can run during an incident. Each has: a command, representative output, what it means, and the decision you make.
Pick the ones matching your stack; don’t cosplay a Linux wizard if you’re on managed services. But learn the shape of the signals.
Task 1: Confirm DNS is returning what you think
cr0x@server:~$ dig +short api.example.com
203.0.113.10
What it means: You got an A record. If it’s empty, returns an unexpected IP, or changes between queries, you might have DNS propagation or split-horizon issues.
Decision: If DNS looks wrong, stop poking the app. Fix routing (record, TTL, zone, health checks) or bypass with a known-good IP for debugging.
Task 2: Check TLS validity and handshake from outside
cr0x@server:~$ echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates
notBefore=Jan 10 00:00:00 2026 GMT
notAfter=Apr 10 23:59:59 2026 GMT
What it means: If this fails to connect or shows an expired cert, your “outage” might be a cert rotation failure.
Decision: If cert expired or chain broken, prioritize cert fix over app debugging; no amount of scaling fixes cryptography.
Task 3: Measure endpoint behavior (status + latency)
cr0x@server:~$ curl -sS -o /dev/null -w "code=%{http_code} total=%{time_total} connect=%{time_connect} ttfb=%{time_starttransfer}\n" https://api.example.com/health
code=503 total=2.413 connect=0.012 ttfb=2.390
What it means: Fast connect, slow TTFB implies the server accepted the connection but couldn’t respond quickly—app or dependency latency, not pure network.
Decision: Treat as “soft down” and focus on saturation or dependency timeouts rather than DNS or firewall.
Task 4: Identify whether the host is overloaded (CPU/run queue)
cr0x@server:~$ uptime
14:02:11 up 36 days, 3:18, 2 users, load average: 38.12, 35.77, 29.05
What it means: Load average far above CPU count indicates many runnable or blocked tasks. “Blocked” often means I/O wait, not CPU.
Decision: Immediately check iowait and per-process resource usage; don’t just add CPUs and hope.
Task 5: See CPU, iowait, and saturation quickly
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (app-01) 02/02/2026 _x86_64_ (16 CPU)
02:02:15 PM all %usr %sys %iowait %idle
02:02:16 PM all 12.3 6.1 62.4 19.2
02:02:17 PM all 10.9 5.7 64.0 19.4
02:02:18 PM all 11.1 5.9 63.2 19.8
What it means: High %iowait means CPUs are idle but waiting on disk. Your problem is storage latency or disk saturation.
Decision: Stop scaling app instances as the first move; find what’s hammering disk (logs, DB, swap, compactions) and mitigate.
Task 6: Find the top offenders by CPU and memory
cr0x@server:~$ ps -eo pid,comm,%cpu,%mem,rss --sort=-%cpu | head
PID COMMAND %CPU %MEM RSS
7321 java 380 12.4 2068420
914 nginx 45 0.6 98120
5210 node 33 2.1 349812
What it means: One Java process using multiple cores heavily suggests compute-bound work or a runaway loop.
Decision: If CPU-bound and correlated with traffic spike, consider load shedding, reduce concurrency, or rollback. If unexpected, capture stack traces before restart.
Task 7: Check for OOM kills (silent service death)
cr0x@server:~$ dmesg -T | tail -n 8
[Mon Feb 2 13:58:41 2026] Out of memory: Killed process 5210 (node) total-vm:4123456kB, anon-rss:1765432kB, file-rss:0kB, shmem-rss:0kB
[Mon Feb 2 13:58:41 2026] oom_reaper: reaped process 5210 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
What it means: The kernel killed a process. Your service may “randomly” restart, flap, or disappear.
Decision: Mitigate by reducing memory usage, adjusting limits, adding RAM, or fixing leaks. Also add alerting on OOM events; they are not subtle, just ignored.
Task 8: Check disk space (the classic boring outage)
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 200G 200G 0 100% /
What it means: Disk full. Expect bizarre side effects: failed writes, stuck processes, database corruption risk, logging failures, package manager failures.
Decision: Free space safely: delete old logs, rotate, move artifacts, expand disk. Then fix the root cause (retention policies, runaway logs, oversized core dumps).
Task 9: Find what’s eating disk right now
cr0x@server:~$ sudo du -xhd1 /var | sort -h
120M /var/cache
2.1G /var/lib
38G /var/log
What it means: /var/log is huge. That’s often the “we turned on debug logging” incident.
Decision: Rotate/truncate the biggest offenders, then set sane log levels and log rotation. If you need debug logs, ship them off-box with retention.
Task 10: Check disk latency and saturation (where outages hide)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (app-01) 02/02/2026 _x86_64_ (16 CPU)
avg-cpu: %user %system %iowait %idle
11.2 5.8 63.7 19.3
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 12.0 850.0 512.0 98432.0 48.3 1.2 99.8
What it means: %util near 100% and high await means the device is saturated; writes are queued and everything slows down.
Decision: Identify the writer (database checkpoint, log flood, backup job). Throttle it, move it, or scale storage. Don’t pretend this is “just CPU.”
Task 11: Validate service health under systemd
cr0x@server:~$ systemctl status nginx --no-pager
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled)
Active: active (running) since Mon 2026-02-02 13:40:07 UTC; 21min ago
What it means: The process is running. That does not mean the service is working; it might be deadlocked or returning errors.
Decision: If “active” but users are failing, check upstream health, connection counts, and logs. Don’t stop here and declare victory.
Task 12: Check recent logs for obvious errors
cr0x@server:~$ journalctl -u nginx -S "-15 min" --no-pager | tail -n 8
Feb 02 13:55:10 app-01 nginx[914]: connect() failed (111: Connection refused) while connecting to upstream, client: 198.51.100.33, server: api.example.com, request: "GET /v1/orders HTTP/2.0", upstream: "http://127.0.0.1:8080/v1/orders"
What it means: Nginx can’t reach the upstream app on 127.0.0.1:8080. That’s likely an app crash, bind failure, or local firewall.
Decision: Investigate the upstream service: port listening, crash loop, OOM, config. Fix app first; nginx is just the messenger.
Task 13: Confirm a port is listening (and by whom)
cr0x@server:~$ ss -lntp | grep ':8080'
LISTEN 0 4096 127.0.0.1:8080 0.0.0.0:* users:(("java",pid=7321,fd=123))
What it means: The port is open and owned by the Java process. If nothing is listening, the upstream error is explained.
Decision: If not listening, check crash logs and config. If listening, suspect the app is wedged or dependency-bound; probe it locally.
Task 14: Local health check bypassing the load balancer
cr0x@server:~$ curl -sS -o /dev/null -w "code=%{http_code} total=%{time_total}\n" http://127.0.0.1:8080/health
code=500 total=0.083
What it means: The app responds quickly but returns 500: internal error, misconfig, failed dependency, or startup incomplete.
Decision: Look for stack traces and dependency failures; don’t waste time on network debugging.
Task 15: Check database connectivity quickly (PostgreSQL example)
cr0x@server:~$ psql -h db-01 -U app -d appdb -c "select now(), count(*) from pg_stat_activity;"
now | count
-------------------------------+-------
2026-02-02 14:02:44.221+00 | 198
What it means: The DB is reachable and has 198 active connections. If that’s near your connection limit, you’re flirting with a self-inflicted outage.
Decision: If connections are high, enable pooling, reduce concurrency, or kill runaway queries. If unreachable, treat DB as primary incident.
Task 16: Identify lock contention (PostgreSQL example)
cr0x@server:~$ psql -h db-01 -U app -d appdb -c "select wait_event_type, wait_event, count(*) from pg_stat_activity where state='active' group by 1,2 order by 3 desc;"
wait_event_type | wait_event | count
-----------------+---------------------+-------
Lock | transactionid | 62
Client | ClientRead | 21
What it means: Many sessions waiting on locks indicates the database is blocked, not “slow.” Your app is queueing behind contention.
Decision: Find the blocker query and remediate (terminate session, rollback deploy, adjust indexes/transactions). Scaling app servers will increase pressure and worsen it.
Task 17: Check kernel-level network errors
cr0x@server:~$ netstat -s | egrep -i 'listen|overflow|dropped' | head
124 times the listen queue of a socket overflowed
124 SYNs to LISTEN sockets dropped
What it means: Your server is dropping incoming connections. This looks like “random timeouts” from clients.
Decision: Increase backlog settings, tune accept queues, reduce per-request work, or scale horizontally. Also confirm the app can actually keep up.
Task 18: Storage integrity check (ZFS example)
cr0x@server:~$ sudo zpool status
pool: tank
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.
scan: scrub repaired 0B in 01:12:33 with 0 errors on Mon Feb 2 01:10:02 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb FAULTED 9 12 0 too many errors
errors: No known data errors
What it means: Your pool is degraded. Performance may drop, resilver may be pending, and another disk failure could become data loss.
Decision: Replace the failed disk, plan for resilver impact, and consider throttling heavy jobs. Also: add monitoring on pool health yesterday.
Task 19: Check if you’re swapping (a silent latency killer)
cr0x@server:~$ free -m
total used free shared buff/cache available
Mem: 64000 61200 300 210 2500 900
Swap: 16000 15850 150
What it means: Swap is almost fully used; the system is likely thrashing. Latency becomes random, CPU iowait rises, and everything looks haunted.
Decision: Reduce memory pressure immediately (restart leaky workers, scale up RAM, tune caches). Long term: set limits, right-size, and alert on swap usage.
Joke #2: If your only alert is a customer email, congratulations—you’ve built a distributed monitoring system powered by human anxiety.
Three corporate-world mini-stories (anonymized and plausible)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran a single “API” service behind a load balancer. They had host metrics: CPU, memory, and disk.
They had no synthetic checks and no error-rate alerts. The on-call rotation was two developers and one infra person who also owned backups.
One Monday morning, a customer reported that file uploads were failing. The team assumed “storage issue,” because uploads and storage are married in everyone’s head.
They SSH’d into app hosts, checked disk usage (fine), checked the object store status page (green), and restarted the upload workers just to “clear it.”
That made it worse, because restarts dropped in-flight uploads and triggered client retries.
The actual issue was a clock drift on one node after a hypervisor migration. TLS handshakes intermittently failed because the node believed it was in the future.
The load balancer kept sending a fraction of upload traffic to that node. CPU, memory, disk all looked normal. The service was “up.”
From the customer’s browser, it was roulette.
The fix was simple: restore NTP, drain the node, then reintroduce it. The lesson was not “NTP matters” (it does).
The lesson was that the team’s mental model was too narrow: they assumed every failure shows up as resource saturation.
Without request-level success metrics and external checks, they were blind to correctness failures.
Afterward they added two things: a synthetic upload test every minute and an alert on TLS handshake errors at the edge.
Not a hundred dashboards. Two signals that mapped to user pain.
Mini-story 2: The optimization that backfired
A different company had cost pressure and decided to “optimize” logging. They moved from application logs shipped off-box to local disk logs with a batch shipper.
The pitch was good: fewer network calls, lower ingest costs, and “we can ship in bulk.”
During a traffic spike, the batch shipper couldn’t keep up. Logs piled up locally. Disk filled slowly, then suddenly.
When disk hit 100%, the database started failing checkpoints and the app started throwing errors it couldn’t write to its own logs.
The team didn’t see the errors because the errors were trapped behind the very disk problem causing them.
Customers reported timeouts. The team scaled the API tier, which increased write traffic and log volume, which filled disks faster.
Classic positive feedback loop: “scale to fix it” became “scale to accelerate it.”
The postmortem was blunt: local buffering without strict quotas is a denial-of-service you schedule for yourself.
The correct optimization would have included hard disk reservations, log rate limiting, and backpressure behavior (drop debug logs first, not the database).
They eventually implemented per-service log volume caps and kept a minimal error stream shipped in real time.
Mini-story 3: The boring but correct practice that saved the day
An enterprise internal platform team ran a payments-adjacent service. They were not glamorous. They were the people who said “no” a lot.
Their monitoring was also boring: synthetic checks, error-rate alerts, latency percentiles, and dependency timeouts. They had runbooks.
A Thursday deploy introduced a subtle connection leak to a downstream database. Connections accumulated slowly.
Because they had a connection pool dashboard and an alert on “pool saturation,” the on-call got paged before customers noticed anything.
The alert didn’t say “something is weird.” It said “DB pool is at 90% and rising.”
The response was equally boring: they rolled back immediately, then verified the pool drained, then re-deployed with a fix behind a feature flag.
They also rate-limited the one endpoint that was worst-affected to reduce pressure while validating.
The customers never filed tickets. The business never held a war room. The team got no praise, which is the highest compliment in reliability engineering.
Their practice wasn’t genius; it was disciplined: alerts that match user impact, and authority to rollback without a committee.
Common mistakes: symptom → root cause → fix
When you have no monitoring, you fall into predictable traps. Here are the ones I see most, framed in a way you can act on during an incident.
1) “The service is running” but users get errors
- Symptom:
systemctl statussays active; customers see 500/503. - Root cause: Dependency failure, thread pool exhaustion, or app returning errors while process stays alive.
- Fix: Add health checks that validate dependencies; alert on error rate and latency, not process existence.
2) Random timeouts that “go away” after restart
- Symptom: Timeouts under load; restart helps briefly.
- Root cause: Resource leak (connections, file descriptors, memory), or kernel limits (listen backlog, conntrack).
- Fix: Alert on FD usage, connection counts, and kernel network drops; enforce limits and add pooling.
3) Only one customer complains
- Symptom: “It’s down for us, not for others.”
- Root cause: Tenant-specific data issue, geo routing, shard imbalance, or auth policy differences.
- Fix: Add per-tenant/per-region success metrics and tracing; implement canary synthetic checks from multiple regions.
4) Everything is slow, CPU is low
- Symptom: Low CPU, high latency.
- Root cause: Disk iowait, lock contention, or upstream dependency latency.
- Fix: Track iowait and disk await; alert on dependency latency; add database lock monitoring.
5) Scaling makes it worse
- Symptom: Add instances, error rate increases.
- Root cause: Thundering herd/retry storm, or saturating a shared dependency (DB, cache, filesystem).
- Fix: Implement exponential backoff with jitter, circuit breakers, and per-endpoint rate limits; alert on dependency saturation.
6) Outage “starts” when customers notice it
- Symptom: No internal timestamp; vague incident start time.
- Root cause: No external synthetic monitoring, no event timeline (deploy markers), poor log retention.
- Fix: Add synthetics and deploy annotations; retain logs/metrics long enough to see trends (not just the last 15 minutes).
7) Health checks pass while real traffic fails
- Symptom: /health returns 200; /checkout fails.
- Root cause: Shallow health checks that don’t exercise real dependencies or paths.
- Fix: Add layered checks: liveness (process), readiness (dependency), and synthetic transactions (real user path).
Checklists / step-by-step plan
Step-by-step: going from “nothing” to “customers aren’t your monitoring”
- Define what “down” means for your users. Pick one or two user journeys (login, checkout, API request) and measure them.
- Add external synthetic checks. Run them from at least two networks/regions. Alert on failure and on latency regression.
- Adopt the golden signals per service. Traffic, errors, latency, saturation. If you can’t name them, you can’t alert on them.
- Instrument request IDs. Every request gets a correlation ID at the edge and logs include it end-to-end.
- Create a minimal on-call page. One page, not ten: current status, last deploy, error rate, p95 latency, dependency health.
- Alert on symptoms, not causes. “500s spiking” is a symptom. “CPU 70%” is trivia.
- Establish ownership. Every alert has a team and a runbook. If nobody owns it, it will page everyone and help nobody.
- Set retention and quotas. Logs fill disks. Metrics fill budgets. Decide the retention you need for debugging trends and regressions.
- Time sync and clocks. NTP/chrony everywhere. If time is wrong, observability is fiction.
- Practice rollback. Make rollback routine, not a sacred ritual requiring approvals while production burns.
- Write postmortems that change systems. If the action item is “be more careful,” you wrote a diary entry, not an engineering artifact.
- Review alerts monthly. Kill noisy alerts. Add missing ones. Monitoring is a living system, not a one-off project.
Incident checklist: when a customer reports an outage
- Confirm externally (status code, latency, error text).
- Record time, customer region, endpoint, request ID if possible.
- Check last deploy/config change.
- Check edge: DNS, TLS, load balancer health, cert expiry.
- Check app: error logs, dependency connectivity, saturation (CPU/mem/disk/network).
- Mitigate: rollback, shed load, disable non-critical features.
- Preserve evidence: logs, system snapshots, top output, DB lock info.
- Communicate: one owner, one channel, clear customer-facing status update.
Build-it-right checklist: minimum viable monitoring (MVM)
- External synthetic: one “real” endpoint, one “health” endpoint.
- Service metrics: request rate, error rate, p95 latency, saturation (queue depth, thread pool, DB connections).
- Dependency metrics: DB latency, cache hit rate, upstream timeouts.
- Host metrics: disk usage, iowait, memory pressure, network drops.
- Events: deploy markers, config changes, scaling events.
- Alert hygiene: paging only for user-impacting symptoms; ticket for slow-burn capacity issues.
FAQ
1) Isn’t “no monitoring” okay if the app is small?
Small apps fail in small ways, and then grow. Monitoring isn’t about size; it’s about time-to-diagnosis.
Even a single VM benefits from disk-full alerts and a basic synthetic check.
2) What’s the first monitoring signal I should add?
An external synthetic check of the most important user journey, with latency and status code recorded.
If it fails, you page. If it slows, you investigate before it fails.
3) Why do CPU and memory graphs feel useless during outages?
Because many outages are about dependencies and saturation elsewhere: disk latency, lock contention, or network drops.
Host graphs are necessary but not sufficient; they’re the vitals, not the diagnosis.
4) How do we avoid alert fatigue?
Page only on user-impacting symptoms and actionable thresholds. Everything else becomes a ticket or a dashboard.
If an alert can’t lead to a decision, it’s noise dressed as responsibility.
5) Do I need distributed tracing to be “real” SRE?
No. You need clarity. Tracing helps in microservices, but you can get far with request IDs, error rate alerts, and dependency latency metrics.
Add tracing when correlation becomes your bottleneck.
6) What’s the minimum set of dashboards that actually helps?
One service dashboard per critical service: traffic, errors, p95/p99 latency, saturation, and top dependencies.
Plus one “edge” dashboard: DNS/TLS/LB health and synthetic checks.
7) How do we handle “partial outages” when only some users complain?
Segment your signals: per region, per tenant, per endpoint, and per build version.
Partial outages are what happens when your monitoring aggregates away the truth.
8) Should health checks hit the database?
Readiness checks should validate critical dependencies, including the database, but carefully: they must be lightweight and rate-limited.
Use a simple query or connection check, not a full transaction.
9) What if we can’t install agents due to security policy?
You can still do a lot with black-box monitoring (synthetics), application-level metrics endpoints, and central log shipping via approved collectors.
Security policies should change architecture, not eliminate visibility.
10) How do I convince leadership to fund monitoring?
Don’t sell “observability.” Sell reduced outage cost and faster incident resolution.
Show a recent incident timeline and highlight how much time was lost to guessing. Make the waste visible.
Conclusion: next steps you can do this week
If you learned you were down from a customer, treat it as a production defect, not a bad day.
The fix isn’t a prettier dashboard. It’s building a system that tells you the truth quickly and consistently.
Do these next steps in order:
- Add one external synthetic check that measures status code and latency for a real endpoint.
- Create one paging alert on error rate (not CPU), and route it to a real on-call schedule.
- Instrument request IDs and ensure logs include them end-to-end.
- Alert on disk full, OOM kills, and dependency timeouts—because these are the “silent killers” that customers find first.
- Write a short runbook for the top three failure modes you’ve actually seen (disk full, DB saturation, crash loops).
Then do the boring part: keep it maintained. Monitoring that isn’t tended becomes decorative.
And decorative monitoring is just an expensive way to learn you’re down from a customer—again.