DNS outages are uniquely cruel. Your web servers can be green, your databases can be humming, and your incident channel can be quiet—while users
stare at “site can’t be reached” like it’s your hobby. When DNS breaks, everything looks broken, and the first signal is often your CEO’s phone.
The fix isn’t “monitor DNS” in the abstract. The fix is monitoring the right DNS failure modes with checks that are cheap, specific, and fast
to interpret at 2 a.m. You want alerts that fire before customers notice, and dashboards that tell you what to do next—not charts you admire while production burns.
What you’re actually monitoring (and why DNS lies to you)
DNS monitoring sounds like one thing. In production it’s at least four:
authoritative DNS (your zone is served correctly),
recursive resolution (users can actually resolve it),
record correctness (the answers point to the right place),
and change safety (you don’t break it during updates).
The biggest trap is thinking “DNS is up” is a boolean. DNS is a distributed, cached, time-dependent system. It can be “up” in one country and “down” in another.
It can be “up” for the exact record you tested and “down” for the record your mobile clients actually use. It can be “up” for clients that have a warm cache and “down”
for new resolvers. And it can be “up” until TTL expires, at which point it becomes very down, very fast.
The four layers that fail differently
-
Authoritative nameservers: Your NS set answers queries for your zone. Failures show as
SERVFAIL, timeouts, wrong answers, lame delegations, DNSSEC errors. - Delegation layer (registrar + parent zone): The parent zone points to your NS records and glue. Failures show as intermittent resolution or total NXDOMAIN depending on caching.
- Recursive resolvers: What most users actually hit (ISP resolvers, public resolvers, enterprise resolvers). Failures show as delays, stale answers, or SERVFAIL spikes.
- Client behavior: Apps do their own caching, some ignore TTLs, some prefer IPv6, some race A and AAAA differently. Failures show as “works on my laptop” incidents.
Monitor at least two perspectives: authoritative correctness and recursive user experience. If you only do one,
your alerts will be either late (authoritative-only) or vague (recursive-only).
Exactly one quote, because it’s operationally true: “Hope is not a strategy.” — Vince Lombardi.
Engineers didn’t invent it, but we sure live by it.
Joke #1: DNS is like a group chat—one wrong participant (NS record) and suddenly nobody can find you, but somehow it’s still your fault.
Interesting DNS facts that matter in incidents
Some “trivia” becomes very practical when you’re diagnosing a weird outage. Here are concrete facts that tend to show up in postmortems.
- DNS was specified in the early 1980s to replace the central HOSTS.TXT model. That’s why it’s decentralized and cached by design, and why debugging is inherently distributed.
- UDP is still the default transport for most DNS queries because it’s cheap and fast. That also means packet loss, fragmentation, and MTU quirks can look like “DNS is down.”
- EDNS(0) exists because classic DNS payloads were too small. It lets resolvers advertise larger UDP sizes, which interacts with firewalls that hate “large DNS.”
- TCP fallback isn’t optional in real life. Truncation (TC=1) forces TCP; if your network blocks TCP/53, some answers will fail intermittently.
- Negative caching is a thing. NXDOMAIN responses can be cached based on the SOA’s negative TTL. A brief misconfiguration can haunt you longer than you deserve.
- TTL is advisory, not a moral law. Many resolvers and clients clamp TTLs (min/max), and some applications cache longer than they should.
- Glue records exist because nameservers need names too. If your NS names are inside the zone they serve, glue in the parent is what prevents circular dependency.
- DNSSEC is about authenticity, not availability. When it breaks (bad DS, expired signatures), it often fails “hard” for validating resolvers and looks like random outage.
- Root and TLD infrastructure is engineered for resiliency (anycast, massive distribution). Most DNS outages are self-inflicted at the delegation or zone layer.
SLIs and alert design that don’t waste your life
The goal is not to produce graphs. The goal is to catch user-impacting resolution failures early, and to route the alert to a human who can act.
That means you need SLIs that represent failure and latency in places that match reality.
Pick SLIs that map to user pain
- Recursive success rate: percentage of queries that return a valid A/AAAA/CNAME chain within a deadline from multiple vantage points.
- Recursive latency: p50/p95 query time to recursive resolvers (or your own). Spikes often precede failures.
- Authoritative availability: success rate and latency when querying each authoritative NS directly.
- Answer correctness: returned records match expected targets (IP ranges, CNAMEs, MX hosts) and expected TTL bounds.
- RCODE distribution: spikes in SERVFAIL, NXDOMAIN, REFUSED are early smoke signals.
Alerting rules that are hard to hate
A practical set:
- Page on user-impacting failure: recursive success < 99.5% for 5 minutes across at least two regions, for critical names.
- Ticket on rising latency: recursive p95 > 250 ms for 15 minutes, or sudden 3× baseline increase.
- Page on authoritative errors: any authoritative NS timing out > 20% over 5 minutes, or SOA/NS mismatch detected.
- Ticket on correctness drift: unexpected IP/CNAME changes, TTL out of policy, missing AAAA where required, DNSSEC signature nearing expiry.
Keep a short allowlist of “names that matter”: login, api, www, mail, and whatever your mobile app hardcodes. Monitor those like you mean it.
Then monitor zone health (SOA/NS/DNSKEY) like you’re going to be the one waking up if it breaks. Because you are.
Simple, effective DNS checks (with real commands)
You don’t need a fancy platform to start. You need repeatable commands, expected outputs, and explicit decisions.
Below are practical tasks you can run from a Linux host with dig, delv, and basic tooling.
The commands are intentionally blunt. They’re meant to work when your brain is running on fumes.
Task 1: Confirm recursive resolution works (baseline)
cr0x@server:~$ dig +time=2 +tries=1 www.example.com A
; <<>> DiG 9.18.24 <<>> +time=2 +tries=1 www.example.com A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5312
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; QUESTION SECTION:
;www.example.com. IN A
;; ANSWER SECTION:
www.example.com. 60 IN A 203.0.113.10
;; Query time: 23 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Tue Dec 31 12:00:00 UTC 2025
;; MSG SIZE rcvd: 59
What it means: You got NOERROR, an A record, and a query time. This is the “most things are fine” baseline.
Decision: If this fails, stop guessing. You have a resolver path problem, not an application problem.
Task 2: Force a known public recursive resolver (compare paths)
cr0x@server:~$ dig @1.1.1.1 +time=2 +tries=1 www.example.com A
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4011
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
www.example.com. 60 IN A 203.0.113.10
;; Query time: 19 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
What it means: If local fails but a public resolver works, your local stub resolver, corporate DNS, or network path is broken.
Decision: Route to the team that owns resolver infrastructure/VPN/edge.
Task 3: Query each authoritative nameserver directly (is auth DNS alive?)
cr0x@server:~$ dig +norecurse @ns1.example.net example.com SOA
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1209
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
example.com. 300 IN SOA ns1.example.net. hostmaster.example.com. 2025123101 3600 600 1209600 300
;; Query time: 14 msec
What it means: The aa flag says the server is authoritative and answering. SOA serial is visible.
Decision: If one NS times out or returns different SOA serials than others, you likely have propagation/zone transfer issues or a dead anycast node.
Task 4: Verify all NS agree on the SOA serial (consistency check)
cr0x@server:~$ for ns in ns1.example.net ns2.example.net ns3.example.net; do echo "== $ns =="; dig +norecurse +time=2 +tries=1 @$ns example.com SOA +short; done
== ns1.example.net ==
ns1.example.net. hostmaster.example.com. 2025123101 3600 600 1209600 300
== ns2.example.net ==
ns1.example.net. hostmaster.example.com. 2025123101 3600 600 1209600 300
== ns3.example.net ==
ns1.example.net. hostmaster.example.com. 2025123009 3600 600 1209600 300
What it means: ns3 is behind (older serial). That can cause different answers depending on which NS gets queried.
Decision: Treat as a correctness incident. Fix zone distribution/transfer or your deployment pipeline; don’t hand-wave it as “it’ll catch up.”
Task 5: Validate delegation from the parent (catch lame delegation and glue issues)
cr0x@server:~$ dig +trace example.com NS
; <<>> DiG 9.18.24 <<>> +trace example.com NS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21044
...
example.com. 172800 IN NS ns1.example.net.
example.com. 172800 IN NS ns2.example.net.
example.com. 172800 IN NS ns3.example.net.
;; Received 117 bytes from a.gtld-servers.net#53(a.gtld-servers.net) in 28 ms
What it means: You can see the delegation chain. If it stops, loops, or points at the wrong NS set, your registrar/parent delegation is wrong.
Decision: If the parent’s NS set is wrong, changing your zone file won’t help. Fix delegation at the registrar/DNS provider.
Task 6: Check for CNAME chains and “looks fine but isn’t” answers
cr0x@server:~$ dig www.example.com +noall +answer
www.example.com. 60 IN CNAME www.example.cdn-provider.net.
www.example.cdn-provider.net. 20 IN A 198.51.100.77
What it means: The user-visible hostname relies on another domain’s DNS. If that provider has trouble, you inherit it.
Decision: Monitor both the vanity name and the provider target. Alert on either failing, because users don’t care whose fault it is.
Task 7: Measure query time explicitly (latency SLI from a shell)
cr0x@server:~$ dig @1.1.1.1 www.example.com A +stats +time=2 +tries=1 | tail -n 5
;; ANSWER SECTION:
www.example.com. 60 IN A 203.0.113.10
;; Query time: 312 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
What it means: 312 ms is high for many regions. This might be transient, or it might be packet loss forcing retries elsewhere.
Decision: If p95 is rising, don’t wait for SERVFAIL. Start checking path MTU, firewall changes, resolver load, and EDNS behavior.
Task 8: Detect truncation and TCP fallback problems
cr0x@server:~$ dig @8.8.8.8 DNSKEY example.com +dnssec +time=2 +tries=1
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6512
;; flags: qr rd ra tc; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: Message truncated
What it means: The tc flag indicates truncation; the resolver will retry over TCP. If TCP/53 is blocked, validating resolvers may fail badly.
Decision: Test TCP explicitly and open TCP/53 where required (especially for DNSSEC-heavy responses).
Task 9: Test DNS over TCP explicitly (prove firewall isn’t eating it)
cr0x@server:~$ dig +tcp @8.8.8.8 DNSKEY example.com +dnssec +time=2 +tries=1 +stats | tail -n 6
;; ANSWER SECTION:
example.com. 1800 IN DNSKEY 257 3 13 rX9W...snip...
example.com. 1800 IN DNSKEY 256 3 13 dE2Q...snip...
;; Query time: 41 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (TCP)
What it means: TCP works; truncation is survivable. If TCP fails, you’ll see timeouts.
Decision: If UDP works but TCP fails, fix the network. If both fail, fix DNS or routing.
Task 10: Check negative caching behavior (NXDOMAIN that sticks)
cr0x@server:~$ dig +noall +authority does-not-exist.example.com A
example.com. 300 IN SOA ns1.example.net. hostmaster.example.com. 2025123101 3600 600 1209600 300
What it means: NXDOMAIN responses include the zone’s SOA; the last field (here 300) is the negative caching TTL.
Decision: If you accidentally removed a record and put it back, expect some clients to keep failing until negative TTL expires. Plan changes accordingly.
Task 11: Validate DNSSEC from a validating tool (separate “DNS is up” from “DNS is valid”)
cr0x@server:~$ delv @1.1.1.1 example.com A
; fully validated
example.com. 300 IN A 203.0.113.20
What it means: “fully validated” indicates DNSSEC chain validates from trust anchors.
Decision: If validation fails, treat as high severity for users behind validating resolvers (many enterprises). Fix DS/DNSKEY/RRSIG issues fast.
Task 12: Spot TTLs that are too low (self-inflicted load and jitter)
cr0x@server:~$ dig +noall +answer api.example.com A
api.example.com. 5 IN A 203.0.113.44
What it means: TTL of 5 seconds is effectively “please DDoS my authoritative servers with legitimate traffic.”
Decision: Raise TTL to something sane for stable endpoints (often 60–300 seconds for dynamic front doors, longer for static). Use low TTL only with a plan.
Task 13: Verify multi-record behavior (A and AAAA can fail independently)
cr0x@server:~$ dig www.example.com A AAAA +noall +answer
www.example.com. 60 IN A 203.0.113.10
www.example.com. 60 IN AAAA 2001:db8:abcd::10
What it means: Dual-stack is in play. If AAAA is wrong or points to a broken path, some clients will prefer IPv6 and “randomly” fail.
Decision: Monitor both A and AAAA correctness and reachability. If you can’t operate IPv6, don’t publish AAAA “for later.”
Task 14: Check for split-horizon surprises (internal vs external answers)
cr0x@server:~$ dig @10.0.0.53 www.example.com A +noall +answer
www.example.com. 60 IN A 10.20.30.40
What it means: Internal resolver returns a private IP. That can be correct (internal routing) or disastrous (leaking private answers to public).
Decision: Confirm views are intentional. Add monitoring from inside and outside the network to ensure each audience gets the intended answer.
Task 15: Confirm authoritative server identity (are you hitting the right box?)
cr0x@server:~$ dig +norecurse @ns1.example.net version.bind TXT chaos +short
"ns1-anycast-pop3"
What it means: If enabled, this can help identify which anycast POP you’re reaching. Not all providers allow it, and that’s fine.
Decision: Use it to correlate errors to a specific POP or backend. If it’s disabled, don’t fight it; rely on latency, traceroutes, and provider tooling.
Task 16: Capture RCODEs and timeouts at scale (turn ad-hoc checks into metrics)
cr0x@server:~$ for i in $(seq 1 20); do dig @1.1.1.1 +time=1 +tries=1 www.example.com A +noall +comments; done | egrep "status:|Query time:"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34325
;; Query time: 18 msec
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2462
;; Query time: 21 msec
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 991
;; Query time: 1001 msec
What it means: A SERVFAIL in a small sample is already suspicious. If you see intermittent SERVFAILs with ~1s timing, you’re hitting timeouts and retries.
Decision: Treat intermittent errors as real. They become outages under load or when caches expire.
Joke #2: DNS caching is the only place where “it worked five minutes ago” is a design feature, not a confession.
Fast diagnosis playbook (first/second/third)
When DNS is on fire, speed matters. Not because you’re heroic, but because caching makes the blast radius time-dependent. Here’s the order that finds bottlenecks fast.
First: Is this user-experience or authoritative?
- Query a public recursive resolver from at least two networks. If it fails broadly, it’s likely authoritative/delegation or DNSSEC.
- Query your authoritative NS directly with +norecurse. If direct auth is fine but recursion is failing, it’s a resolver ecosystem problem (or DNSSEC validation).
- Check RCODE distribution: NXDOMAIN vs SERVFAIL vs timeout. Each points to different causes.
Second: Identify the class of failure
- Timeouts: network path, firewall, DDoS mitigation misfire, anycast POP issue, upstream packet loss.
- SERVFAIL: DNSSEC validation failure, broken authoritative response, resolver unable to reach NS, lame delegation, backend failure.
- NXDOMAIN: record missing, wrong zone deployed, negative caching, querying the wrong suffix, split-horizon confusion.
- Wrong answer: stale zone on one NS, view leakage, bad automation, CNAME target changed, compromised configuration.
Third: Pinpoint where in the chain it breaks
- Run +trace to see delegation and where it stops.
- Compare SOA serial across all authoritative NS. Inconsistent serials explain “only some users.”
- Test UDP and TCP (especially for DNSKEY/large answers). Truncation + blocked TCP is a classic.
- Validate DNSSEC with delv. If validation fails, your zone can look “fine” to non-validating resolvers while enterprises can’t resolve you at all.
If you do those steps in that order, you avoid the two common time-wasters: staring at application logs when name resolution is failing, and arguing about “propagation”
when you actually have inconsistent authoritative state.
Three corporate mini-stories from the DNS trenches
Mini-story 1: The outage caused by a wrong assumption
A mid-sized SaaS company migrated their marketing site to a new CDN. The change request was simple: swap an A record for a CNAME,
lower TTL for the cutover, and call it done. The engineer assumed that “if the CNAME resolves for me, it resolves for everyone.”
That assumption has ended many peaceful evenings.
The cutover looked good from the office network. It looked good from a public resolver. But enterprise customers started reporting failures:
browsers would hang for a while, then give up. The incident channel filled with screenshots and guesswork. The application was healthy; the origin was healthy; TLS was healthy.
The only common factor was “can’t reach the site.”
It turned out the CDN’s target name returned a large set of records with DNSSEC, triggering truncation. Some enterprise networks allowed UDP/53 but blocked TCP/53
(yes, still). For those users, the recursive resolver saw TC=1, tried TCP, got blocked, and returned SERVFAIL. Home users never noticed because their resolvers could use TCP.
The fix wasn’t in the web stack. It was a policy and monitoring fix: test DNS over TCP in synthetic checks, monitor SERVFAIL rates from “corporate-ish” vantage points,
and add a preflight step for “does this change increase response size or DNSSEC complexity.” The assumption died. Production got better.
Mini-story 2: The optimization that backfired
A different company ran their own authoritative DNS on a pair of anycasted clusters. Someone noticed a measurable chunk of latency on cold lookups
and decided to “optimize DNS” by setting TTLs to 10 seconds across the board. The theory: faster failover during incidents and quicker rollouts.
The reality: DNS is not a feature flag system. It’s a load multiplier.
It didn’t explode immediately. That’s the worst part. For a few days, everything felt snappier during a rollout and the change got praised as “more dynamic.”
Then a routine traffic increase hit: not a DDoS, just normal growth and a marketing campaign. Resolver caches expired constantly, authoritative query volume went up,
and the anycast clusters started dropping packets under peak load. Timeouts increased, then retries increased, then the retries created more load. Classic feedback loop.
Users experienced intermittent failures that were hard to reproduce. Some had cached answers; others were unlucky. Monitoring that only checked one resolver
and one name missed the early warning. It took far too long to connect the dots: “the optimization” was making DNS depend on perfect authoritative performance.
The rollback was simple: restore sane TTLs (60–300 seconds for front doors; longer where possible), and add rate-based alerting on authoritative QPS and error rates.
They also kept a small set of low-TTL records for real failover scenarios, behind change controls. The lesson: optimize for resiliency, not for the dopamine hit of quick changes.
Mini-story 3: The boring practice that saved the day
An enterprise with multiple business units had a habit that sounded dull: every DNS change required a “two-view validation” and “two-recursive validation.”
That meant checking internal and external answers, and checking at least two different recursive resolvers (one corporate, one public).
It was written down, enforced in code review, and occasionally mocked as bureaucracy.
One Friday, a team added a new internal-only hostname for a service. The change was supposed to land only in the internal view,
but a template variable was mis-set. The record was added to the external zone too, pointing at a private RFC1918 address.
If that leaked widely, external users would try to connect to an unroutable private IP—best case a timeout, worst case hitting something unintended on their own network.
The preflight checks caught it within minutes. The external check saw a private A record and failed loudly. The change was reverted before caches spread it widely.
The incident ticket never became a customer incident. The team got to enjoy their weekend, which is the real SRE KPI.
The boring practice wasn’t fancy. It was just specific, repeatable validation that matched known failure modes. It didn’t prevent mistakes.
It prevented mistakes from becoming outages.
Common mistakes: symptom → root cause → fix
DNS incidents are repetitive. So are the mistakes. Here are the common ones with crisp mapping from what you see to what to do.
1) Symptom: “Works for me, broken for users”
- Root cause: cached answers on your network; some resolvers still have old data; inconsistent authoritative NS; split-horizon differences.
- Fix: compare SOA serial across all authoritative NS; query multiple recursive resolvers; lower TTL before planned migrations, not during emergencies.
2) Symptom: Intermittent SERVFAIL, especially on enterprises
- Root cause: DNSSEC validation failing (expired RRSIG, wrong DS), or truncation requiring TCP that is blocked.
- Fix: run
delvto validate; testdig +tcp; fix DS/DNSKEY rollover process; ensure TCP/53 works end-to-end.
3) Symptom: NXDOMAIN spikes after a deploy
- Root cause: record removed/renamed; wrong zone deployed; typo in name; negative caching extends impact.
- Fix: restore record; confirm SOA serial increment; understand negative TTL in SOA; monitor NXDOMAIN rate per-name.
4) Symptom: Slow resolution but not failing
- Root cause: packet loss; overloaded resolvers; upstream connectivity; EDNS UDP size issues causing retries/fallback.
- Fix: measure p95 latency; test with different resolvers; check truncation; adjust EDNS buffer size if you own resolvers; fix network loss.
5) Symptom: Only IPv6 users fail (or only some mobile networks)
- Root cause: broken AAAA record or broken IPv6 routing; clients preferring IPv6; asymmetric reachability.
- Fix: monitor AAAA separately; validate reachability of IPv6 endpoints; remove AAAA if you can’t support it (temporarily, with a plan).
6) Symptom: One specific record fails, everything else fine
- Root cause: CNAME target domain problem; record too large; wrong record type; wildcard interactions.
- Fix: monitor and test the entire chain; keep record sets small; avoid clever wildcard setups unless you enjoy mystery.
7) Symptom: Some resolvers get REFUSED
- Root cause: ACLs on authoritative servers; rate limiting too aggressive; geo-blocking; misconfigured views.
- Fix: review ACLs and response policy; tune RRL; test from multiple regions; don’t block recursive resolvers you don’t control unless legally required.
8) Symptom: Zone changes “randomly” revert or differ across NS
- Root cause: multiple sources of truth; partial deployments; broken NOTIFY/AXFR/IXFR; manual edits on one node.
- Fix: enforce single source of truth; automate deployment; audit zones across NS; alert on SOA mismatch immediately.
Checklists / step-by-step plan
A. Build a minimal DNS monitoring set (one afternoon, no drama)
- Pick 5–10 critical names:
www,api,login,mail, plus mobile endpoints. - For each name, define expected answers: A/AAAA ranges, CNAME targets, MX hosts, TTL bounds.
- Run recursive checks from at least two networks (cloud region A and B; or cloud + on-prem).
- Run authoritative checks against each NS directly with
+norecurse. - Alert on failures: timeouts, SERVFAIL, wrong answers, and SOA serial mismatch.
- Dashboard RCODEs and latency per-name, not just aggregated across your whole zone.
B. Make DNS changes safer (the “don’t be clever” plan)
- Lower TTL ahead of planned migrations (hours to a day before), not in the middle of an incident.
- Use staged rollouts: validate on a canary name first (e.g.,
canary-api). - Two-view validation: internal and external answers must be intentionally different, never accidentally different.
- Two-recursive validation: test both a corporate resolver path and a public resolver.
- SOA serial discipline: enforce monotonic serials and alert on mismatch across NS.
- DNSSEC hygiene: track signature expiration and DS/DNSKEY rollovers with explicit runbooks.
C. Production-ready alert routing
- Page the team that can fix it: DNS provider team, network team, or platform team. If the alert can’t be actioned, it’s noise.
- Include immediate commands in the alert: “Run dig +trace”, “Query each NS for SOA”, “Run delv”.
- Include expected vs actual in alert payload: record mismatch details, TTL values, and which resolvers failed.
- Correlate with change windows: DNS incidents are often change-correlated; your alert should show recent DNS pushes.
FAQ
1) Should I monitor DNS from inside my VPC, from the public internet, or both?
Both. Inside catches split-horizon mistakes and resolver issues in your environment. Outside catches what customers see. One viewpoint is how you end up surprised.
2) What’s the single most valuable DNS check?
Query each authoritative nameserver for the SOA and compare serials. It’s the fastest way to detect partial deployments, broken transfers, and “some users” incidents.
3) Isn’t “dig from one box” enough?
No. DNS failures are often regional, resolver-specific, or cache-dependent. One box gives you false confidence or false panic, depending on its cache and network.
4) How do I alert on “propagation” issues?
Stop calling it propagation and call it what it is: inconsistent authoritative state or delegation mismatch. Alert on SOA serial mismatch and on answers differing across NS.
5) How low should TTL be for load balancer front doors?
Commonly 60–300 seconds. Lower only if you have a tested failover procedure and authoritative capacity to handle increased QPS. “5 seconds everywhere” is a reliability smell.
6) Why do I see SERVFAIL when the authoritative server seems fine?
Often DNSSEC validation fails at the resolver, or the resolver can’t reach one of your NS due to network issues. Test with delv and query each NS directly.
7) Do I really need to care about TCP/53?
Yes. DNSKEY responses, large TXT records, and DNSSEC can trigger truncation. If TCP is blocked, some resolvers will fail. Monitor UDP and TCP paths.
8) How do I detect “wrong answer” issues automatically?
Define expected values: CNAME targets, IP ranges, and TTL bounds. Then run synthetic queries and diff the answer section. Alert on drift, not just on outages.
9) What about DNS over HTTPS (DoH) and DNS over TLS (DoT)?
If your user base includes environments that force DoH/DoT, include at least one synthetic check through those paths. Otherwise you’ll declare victory while browsers keep failing.
10) How many names should I monitor without drowning in alerts?
Start with 5–10 critical names plus zone health checks. Expand only when you can explain what action each alert triggers. Monitoring is not stamp collecting.
Conclusion: next steps you can do this week
DNS monitoring doesn’t need to be complex. It needs to be specific. Monitor the names that matter, from more than one vantage point, and test both recursion and authority.
Alert on the failure modes that predict real pain: SOA mismatch, SERVFAIL spikes, timeouts, wrong answers, and DNSSEC validation failures.
Practical next steps:
- Implement SOA serial consistency checks across all authoritative NS and page on mismatch.
- Add recursive checks from two networks and alert on sustained failure rate and p95 latency.
- Add UDP+TCP DNS checks for at least one DNSSEC-heavy query (DNSKEY) to catch truncation/TCP blocks.
- Define expected answers and TTL policy for your critical names, and alert on drift.
- Write a one-page DNS fast diagnosis runbook using the playbook section above, and attach it to alerts.
If you do just those, you’ll catch most DNS failures before your customers do. And if you’re lucky, before your CEO does.