When DNS breaks, it never breaks politely. It breaks while your CEO is demoing, your CI pipeline is “just doing a quick deploy,” and half the company suddenly can’t reach “the internet” (which, to them, is one SaaS login page).
The fastest way to get your time back is to stop treating every DNS error like a vibe. NXDOMAIN and SERVFAIL are not the same kind of failure. They point to different broken things, owned by different teams, fixed with different moves.
NXDOMAIN vs SERVFAIL: two codes, two failure worlds
NXDOMAIN in one sentence
NXDOMAIN means: “The domain name you asked for does not exist” (specifically: the server believes that name does not exist in DNS). It’s an answer, not a crash.
SERVFAIL in one sentence
SERVFAIL means: “I can’t answer this right now.” It’s a failure to produce an answer—often due to upstream issues, DNSSEC validation failures, lame delegations, broken recursion, or internal resolver trouble.
The production shortcut: what each usually implies
- NXDOMAIN: most often a data/config issue (wrong name, missing record, wrong zone, split-horizon mismatch, stale negative cache). The system is working; it’s telling you “no.”
- SERVFAIL: most often a path/validation/availability issue (resolver can’t reach authority, authority is misbehaving, DNSSEC chain is busted, recursion is blocked, MTU/fragmentation, EDNS issues, timeouts). The system is not successfully answering.
That’s the whole game. NXDOMAIN pushes you toward “does this name exist where I think it exists?” SERVFAIL pushes you toward “why can’t the resolver complete resolution safely?”
Joke #1: DNS is like a phone book that can return “no such person” or “I’m having a day”—and both ruin your weekend equally.
What to do immediately when you see each
If it’s NXDOMAIN:
- Confirm spelling and FQDN. Don’t laugh; do it.
- Check whether the record exists in the authoritative zone (not just in your IaC repo).
- Check whether you’re querying the right view (public vs internal / split-horizon).
- Check negative caching (SOA minimum / negative TTL) if you just created the record.
If it’s SERVFAIL:
- Check whether the resolver is failing for everyone or only you (local stub vs corporate recursive vs public resolvers).
- Check DNSSEC validation: try with and without validation to isolate.
- Check delegation health: NS records, glue, reachability, and whether authorities are “lame.”
- Check transport oddities: UDP fragmentation, EDNS buffer size, firewall rules, rate limiting.
Fast diagnosis playbook (what to check first)
This is the “I have five minutes before the incident channel becomes performance art” playbook. Follow it in order. Stop when you find the bottleneck.
Step 1: Reproduce with dig and capture the exact error
- Use
digwith+norecurseand normal recursion, and notestatus:,flags:, andSERVER:. - Decision: If status is NXDOMAIN, pivot to “data and zone.” If SERVFAIL, pivot to “path and validation.”
Step 2: Change resolver to isolate where it fails
- Query your usual resolver, then a known-good public resolver, then the authoritative nameserver directly.
- Decision: If public works but corporate fails, it’s your recursion/egress/DNSSEC policy. If authoritative fails, it’s the zone/authority itself.
Step 3: Identify authoritative servers and test them individually
- Use
dig NSand query each NS for the exact record. - Decision: If one NS differs, you have zone propagation or hidden-master/transfer issues. If all fail, it’s global to the zone.
Step 4: If SERVFAIL, do the DNSSEC split test
- Use
+dnssecand also try disabling validation on a validating resolver (or use a non-validating resolver you control). - Decision: If disabling validation makes it work, you have a DNSSEC chain problem (DS mismatch, expired signatures, missing DNSKEY, bad NSEC/NSEC3, etc.).
Step 5: Check caching and TTL behavior
- If you just fixed it and it still fails, flush caches where you can (local stub, recursive resolver) or wait out negative TTLs.
- Decision: If authoritative is correct but clients still see NXDOMAIN/SERVFAIL, you’re in cache territory.
Step 6: Look for network constraints that look like “DNS is down”
- Test UDP/53 and TCP/53 reachability. Watch for ICMP frag-needed. Check firewall logs, NAT timeouts, and egress allowlists.
- Decision: If TCP works but UDP fails (or vice versa), fix the network. DNS is just the messenger.
How DNS gets to NXDOMAIN or SERVFAIL (without the fairy tale)
DNS resolution is a chain of responsibility. Your application typically asks a stub resolver (often on the host), which asks a recursive resolver (corporate DNS, ISP, or a public resolver), which walks the tree by querying authoritative servers.
NXDOMAIN: an authoritative “no” (or a believable one)
NXDOMAIN is returned when the responder believes the name does not exist. In a clean world, that comes from the authoritative server for the zone, backed by proof (SOA in authority section, possibly NSEC/NSEC3 with DNSSEC).
But in production, NXDOMAIN can also be:
- Split-horizon mismatch: internal view says “no such name,” public view has it (or the reverse).
- Search suffix accidents: client asked for
apiand the resolver triedapi.corp.example, got NXDOMAIN, and your app logs “api does not exist.” - Negative caching: a resolver cached NXDOMAIN; you created the record; everyone keeps seeing NXDOMAIN until the negative TTL expires.
- Wildcard interactions: you thought a wildcard covers it; it doesn’t cover all types or all labels the way you assumed.
SERVFAIL: a resolver refusing to lie
SERVFAIL is the resolver saying, “I failed to obtain a reliable answer.” That can happen for many reasons, and a good resolver is conservative: it would rather fail than hand out an answer it can’t validate or complete.
Common SERVFAIL paths:
- DNSSEC validation failure: the resolver got an answer, but the signatures didn’t validate. It returns SERVFAIL to the client.
- Timeouts and unreachable authorities: recursion can’t reach the authoritative servers, often due to firewall rules, DDoS mitigation, broken anycast routes, or just dead nameservers.
- Lame delegation: NS records point at servers that aren’t authoritative for the zone. Resolvers waste time; some return SERVFAIL.
- Truncation and TCP issues: large DNS responses (DNSSEC makes them larger) get truncated over UDP, then TCP/53 is blocked, causing failures.
- EDNS/fragmentation problems: EDNS buffer sizes, PMTU blackholes, or middleboxes that “helpfully” drop fragmented UDP.
Joke #2: SERVFAIL is DNS saying “I could tell you, but then I’d have to validate you.”
A useful mental model: “Where is the truth stored, and who is failing to fetch it?”
Authoritative servers store the truth. Recursive resolvers fetch, validate, and cache that truth. NXDOMAIN is usually a truth problem. SERVFAIL is usually a fetch/validate problem.
Facts and history that actually help during incidents
These aren’t trivia-night facts. They change how you debug.
- NXDOMAIN is defined as RCODE 3. It’s not “no data.” “No data” is a NOERROR response with an empty answer section (often called NODATA).
- SERVFAIL is RCODE 2. It’s intentionally vague: it covers many internal failures, including upstream timeouts and DNSSEC validation failures.
- Negative caching is standardized. Resolvers cache NXDOMAIN and NODATA responses using the zone’s SOA values (commonly the SOA minimum field / negative TTL behavior), which is why “I just added the record” can still look broken.
- DNS was designed for UDP first. TCP exists for zone transfers and truncated answers, but real-world networks often treat TCP/53 as suspicious. That’s a recipe for SERVFAIL when responses get large.
- DNSSEC made answers bigger and debugging sharper. Validation failures often surface as SERVFAIL at the resolver, even when authoritative servers are happily serving data.
- “Lame delegation” is a classic failure mode. Delegating to nameservers that don’t actually serve the zone burns resolver time and increases tail latency—sometimes all the way to SERVFAIL.
- EDNS(0) was introduced to extend DNS without breaking it. Middleboxes sometimes break EDNS, leading to weird behaviors: timeouts, truncation loops, or SERVFAIL depending on resolver logic.
- The root and TLD infrastructure evolved to reduce fragility. Anycast deployments and multiple NS records are meant to improve availability, but can also mask partial failures until a route change exposes them.
- Search domains (resolv.conf) are a corporate foot-gun. They create queries you didn’t intend. Debugging needs fully-qualified names to avoid phantom NXDOMAINs.
One reliability paraphrased idea worth keeping in your pocket: paraphrased idea
from Werner Vogels (Amazon CTO): if you design for failure as normal, you’ll spend less time pretending it’s exceptional.
Practical tasks: commands, outputs, decisions (12+)
These are the things you actually run at 02:00. Each task includes: the command, an example of what you might see, what it means, and what you do next.
Task 1: Get the basic status and confirm which resolver answered
cr0x@server:~$ dig api.example.com A
; <<>> DiG 9.18.24 <<>> api.example.com A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 42118
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; QUESTION SECTION:
;api.example.com. IN A
;; AUTHORITY SECTION:
example.com. 300 IN SOA ns1.example.net. hostmaster.example.com. 2025123101 7200 3600 1209600 300
;; Query time: 21 msec
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
;; WHEN: Wed Dec 31 10:12:01 UTC 2025
;; MSG SIZE rcvd: 115
Meaning: Your recursive resolver at 10.10.0.53 is returning NXDOMAIN. It also shows an SOA, suggesting the resolver believes it has authoritative proof of non-existence (or cached proof).
Decision: Don’t chase network yet. Verify whether api.example.com exists in the authoritative zone and whether you’re querying the right DNS view.
Task 2: Compare against a different recursive resolver
cr0x@server:~$ dig @1.1.1.1 api.example.com A
; <<>> DiG 9.18.24 <<>> @1.1.1.1 api.example.com A
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5329
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.example.com. 60 IN A 203.0.113.10
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; Query time: 18 msec
Meaning: Public recursion returns a valid A record; your corporate resolver returns NXDOMAIN. That’s not “DNS is down.” That’s a policy/view/caching issue on your side.
Decision: Check split-horizon configuration, forwarding rules, RPZ/blocklists, or stale negative cache on the corporate resolver.
Task 3: Force a fully qualified name to avoid search-domain lies
cr0x@server:~$ dig api A
; <<>> DiG 9.18.24 <<>> api A
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1204
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; QUESTION SECTION:
;api. IN A
;; AUTHORITY SECTION:
. 86400 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2025123100 1800 900 604800 86400
Meaning: You asked for api. (the literal top-level name), not api.example.com. Apps and humans do this accidentally through search suffixes and missing dots.
Decision: Use FQDNs in configs. In incident debugging, always query the full name ending with a dot if you want to be pedantic: api.example.com.
Task 4: Check if it’s NXDOMAIN vs NODATA (NOERROR, empty answer)
cr0x@server:~$ dig www.example.com AAAA
; <<>> DiG 9.18.24 <<>> www.example.com AAAA
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64430
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; AUTHORITY SECTION:
example.com. 300 IN SOA ns1.example.net. hostmaster.example.com. 2025123101 7200 3600 1209600 300
Meaning: The name exists, but there is no AAAA record. That’s not NXDOMAIN, and your fix is different.
Decision: If your app requires IPv6, add AAAA or fix client behavior. If not, ignore and stop paging yourself.
Task 5: Find the authoritative nameservers (and whether delegation looks sane)
cr0x@server:~$ dig example.com NS +noall +answer
example.com. 172800 IN NS ns1.example.net.
example.com. 172800 IN NS ns2.example.net.
Meaning: These are the servers the world will ask for example.com.
Decision: Query each NS directly for the record. If answers differ, you likely have zone sync problems.
Task 6: Query the authoritative server directly to confirm reality
cr0x@server:~$ dig @ns1.example.net api.example.com A +norecurse
; <<>> DiG 9.18.24 <<>> @ns1.example.net api.example.com A +norecurse
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33201
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.example.com. 60 IN A 203.0.113.10
Meaning: The authority says it exists and marks the answer authoritative (aa flag). If your recursive resolver gives NXDOMAIN, the issue is caching, filtering, split-horizon, or a broken forwarder chain.
Decision: Investigate the recursive resolver: cache state, RPZ, views, forwarding, and DNSSEC behavior.
Task 7: Check for lame delegation
cr0x@server:~$ dig @ns2.example.net example.com SOA +norecurse
; <<>> DiG 9.18.24 <<>> @ns2.example.net example.com SOA +norecurse
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 9012
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
Meaning: The server listed as an NS refuses to answer authoritatively. That’s a delegation problem (or ACL problem), and resolvers may end up SERVFAIL-ing under load or on unlucky selection.
Decision: Fix NS records to point to actual authoritative servers, or fix the server ACLs so it answers authoritatively for the zone.
Task 8: Check DNSSEC at the resolver boundary
cr0x@server:~$ dig secure.example.com A +dnssec
; <<>> DiG 9.18.24 <<>> secure.example.com A +dnssec
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 4411
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
Meaning: SERVFAIL with DNSSEC in play often signals validation failure, especially if other resolvers behave differently.
Decision: Compare with a resolver known to validate correctly, and inspect DS/DNSKEY/RRSIG chain for the zone.
Task 9: Use +trace to watch delegation step-by-step
cr0x@server:~$ dig secure.example.com A +trace
; <<>> DiG 9.18.24 <<>> secure.example.com A +trace
. 518400 IN NS a.root-servers.net.
...
com. 172800 IN NS a.gtld-servers.net.
...
example.com. 172800 IN NS ns1.example.net.
example.com. 172800 IN NS ns2.example.net.
secure.example.com. 60 IN A 203.0.113.20
Meaning: Trace shows the chain: root → TLD → your authoritative servers → final record. If it breaks mid-chain, you know which delegation layer is wrong.
Decision: If trace fails at the TLD delegation, check registrar-side NS/DS settings. If it fails at your NS, fix your authoritative infrastructure.
Task 10: Check TCP fallback (truncation problems)
cr0x@server:~$ dig dnssec-heavy.example.com DNSKEY +dnssec +ignore +bufsize=4096
; <<>> DiG 9.18.24 <<>> dnssec-heavy.example.com DNSKEY +dnssec +ignore +bufsize=4096
;; Truncated, retrying in TCP mode.
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20031
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
Meaning: The answer didn’t fit in UDP, so the resolver retried over TCP. If TCP/53 is blocked somewhere, clients may see SERVFAIL or timeouts.
Decision: Ensure TCP/53 is allowed between recursive resolvers and authoritative servers (and sometimes between clients and resolvers, depending on architecture).
Task 11: Confirm what your host is actually using for DNS
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (eth0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.10.0.53
DNS Servers: 10.10.0.53 10.10.0.54
DNS Domain: corp.example
Meaning: Your systemd-resolved stub is forwarding to corporate resolvers and appending corp.example as a search domain.
Decision: If you see surprising queries, disable or constrain search domains for critical services. Also verify both listed DNS servers behave the same.
Task 12: Detect “one resolver is sick” by querying both explicitly
cr0x@server:~$ dig @10.10.0.53 api.example.com A +time=1 +tries=1
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31990
;; Query time: 1001 msec
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
cr0x@server:~$ dig @10.10.0.54 api.example.com A +time=1 +tries=1
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31991
;; ANSWER SECTION:
api.example.com. 60 IN A 203.0.113.10
Meaning: One recursive resolver is failing; the other works. This is more common than anyone wants to admit: bad cache, stuck upstream, CPU thrash, broken DNSSEC state.
Decision: Pull the bad resolver from rotation (DHCP option, load balancer, Anycast withdrawal), then debug it offline.
Task 13: Check BIND/unbound logs for validation and upstream errors
cr0x@server:~$ sudo journalctl -u unbound --since "10 minutes ago" | tail -n 10
Dec 31 10:05:22 resolver1 unbound[1321]: info: validation failure <secure.example.com. A>: signature expired
Dec 31 10:05:22 resolver1 unbound[1321]: info: resolving secure.example.com. A
Dec 31 10:05:22 resolver1 unbound[1321]: info: error: SERVFAIL secure.example.com. A IN
Meaning: The resolver is explicitly telling you the DNSSEC signature is expired. This is one of the cleanest SERVFAIL root causes you can get.
Decision: Fix DNSSEC signing for the zone (re-sign, correct time sync on signer, ensure automated rollovers). Temporary mitigation: route critical clients to a resolver policy that doesn’t validate (carefully), but treat that as an emergency bandage, not a lifestyle.
Task 14: Confirm authoritative server health and zone loading
cr0x@server:~$ sudo rndc status
version: BIND 9.18.24 (Stable Release)
running on ns1: Linux x86_64 6.5.0
boot time: Wed, 31 Dec 2025 08:12:07 GMT
last configured: Wed, 31 Dec 2025 09:58:41 GMT
number of zones: 128
debug level: 0
xfers running: 0
up time: 1h 59m
Meaning: BIND is up and not obviously overloaded. This doesn’t prove your zone is correct, but it rules out “named is dead.”
Decision: If you suspect zone load failures, inspect logs and confirm the zone serial and records.
Task 15: Verify the zone serial and whether secondaries are in sync
cr0x@server:~$ dig @ns1.example.net example.com SOA +noall +answer
example.com. 300 IN SOA ns1.example.net. hostmaster.example.com. 2025123101 7200 3600 1209600 300
cr0x@server:~$ dig @ns2.example.net example.com SOA +noall +answer
example.com. 300 IN SOA ns1.example.net. hostmaster.example.com. 2025123009 7200 3600 1209600 300
Meaning: The serials differ. One nameserver is serving an older zone. That can manifest as intermittent NXDOMAIN (depending on which NS the resolver picks).
Decision: Fix zone transfers (TSIG keys, allow-transfer ACLs, notify configuration, network reachability). Don’t ship “eventual consistency DNS” unless you like mystery outages.
Three corporate mini-stories (and what they teach)
Incident 1: The outage caused by a wrong assumption (NXDOMAIN masquerading as “network”)
At a mid-size company with a hybrid cloud setup, a team migrated an internal service from api.int.corp to api.internal.corp. They did the polite things: updated service discovery, changed a few configs, merged the PR, and moved on.
Then an on-call got paged: “Service unreachable.” The application logs showed repeated NXDOMAIN for api.int.corp. The on-call assumed it was a resolver outage because “NXDOMAIN means DNS can’t find it, so DNS must be down.” That assumption is how you lose an hour.
The actual problem was boring. One legacy batch job had the old hostname hardcoded. It ran once per hour, triggered retries, and made enough noise to look like a platform issue. DNS was working perfectly and consistently telling the truth: that name no longer existed.
What fixed it wasn’t heroics. They searched the config management repo for the string, found the job, deployed the new name, and added a temporary CNAME from the old name to the new name for a week. They also tightened monitoring to differentiate NXDOMAIN spikes by name, not just “DNS errors.”
Lesson: NXDOMAIN is often DNS doing its job. When you see NXDOMAIN, suspect your inputs first: names, zones, views, and caches. Don’t “reboot DNS” to fix a typo embedded in 200 nodes.
Incident 2: The optimization that backfired (SERVFAIL via DNSSEC + “smaller packets” lore)
A platform team wanted faster DNS. They tuned their recursive resolvers: smaller UDP buffer sizes, aggressive timeouts, and a couple of “we don’t need that” EDNS settings copied from an old forum thread. The graphs looked good for a month. Latency shaved, dashboards green. Everyone nodded wisely.
Then a partner enabled DNSSEC on a zone that the company depended on. Suddenly, a slice of queries started returning SERVFAIL. Not all. Just enough to ruin customer logins in one geography and make the incident commander question reality.
The root cause wasn’t the partner. It was the “optimization”: reduced EDNS buffer size meant more truncation, which forced TCP fallback. But TCP/53 egress from those resolvers was inconsistently allowed across NAT gateways. Sometimes it worked; sometimes it didn’t. Resolvers that couldn’t complete validation returned SERVFAIL. The partner’s DNSSEC just made the packets big enough to hit the trap.
The fix was unglamorous: revert the EDNS tweaks, standardize firewall policy to allow TCP/53 from recursive resolvers, and set timeouts based on real upstream behavior instead of ideology. They also added a canary check that deliberately queries a DNSSEC-heavy record to ensure TCP fallback works.
Lesson: DNS “optimizations” that ignore the full protocol (UDP and TCP, EDNS, DNSSEC) are just time bombs with better metrics.
Incident 3: The boring practice that saved the day (consistent delegation checks)
A larger enterprise ran their own authoritative DNS plus a managed secondary provider. Every change went through a checklist: update zone, increment serial, validate syntax, push to hidden master, confirm secondaries got the new serial, then run an external delegation check from at least two networks.
One day, a registrar-side change happened during a domain consolidation project. NS records were updated. A perfectly reasonable change—except one of the newly listed nameservers wasn’t actually serving the zone yet. Lame delegation: textbook.
The difference was that their post-change check caught it within minutes. They queried the listed NS set individually, saw one returning REFUSED, and rolled back the delegation before caches across the internet absorbed the bad state. Users never noticed. The project manager still got their consolidation.
Lesson: Delegation validation is boring the way seatbelts are boring. You don’t feel it working. You just notice when it’s missing.
Common mistakes: symptom → root cause → fix
1) “NXDOMAIN for a record we just created”
Symptom: dig returns NXDOMAIN for a name you added minutes ago.
Root cause: Negative caching at recursive resolvers (cached NXDOMAIN), or you updated only one DNS view/zone (internal vs public), or secondaries aren’t in sync.
Fix: Query authoritative directly to confirm the record exists. Check SOA serial on each NS. If authoritative is correct, wait out negative TTL or flush caches on resolvers you control.
2) “Intermittent NXDOMAIN”
Symptom: Sometimes it resolves, sometimes NXDOMAIN.
Root cause: Split-brain authoritative set (some NS have the record, others don’t), or anycast nodes serving different zone versions.
Fix: Query each authoritative NS directly. Compare SOA serials. Fix transfer/notify and ensure deployment is atomic across authoritative nodes.
3) “SERVFAIL from corporate DNS, but public resolvers work”
Symptom: dig @10.10.0.53 returns SERVFAIL; dig @1.1.1.1 works.
Root cause: DNSSEC validation failure due to stale trust anchors, broken time sync on resolvers, policy filtering (RPZ), or blocked TCP/53 preventing validation completion.
Fix: Check resolver logs for validation errors. Ensure NTP is correct. Verify TCP/53 egress. Review RPZ hits and exceptions.
4) “SERVFAIL only for DNSSEC-enabled domains”
Symptom: Some zones always SERVFAIL, others fine.
Root cause: DNSSEC chain issues (expired RRSIG, DS mismatch after key rollover, incorrect DNSKEY).
Fix: Use dig +trace +dnssec and check DS at the parent vs DNSKEY at child. Re-sign zone or fix DS at registrar/parent.
5) “Everything fails in Kubernetes, but nodes can resolve”
Symptom: Pods see SERVFAIL/NXDOMAIN; host resolves fine.
Root cause: CoreDNS upstream misconfig, NodeLocal DNSCache issues, iptables rules, or stub resolver differences.
Fix: Query from inside the pod to CoreDNS service IP and upstream. Check CoreDNS logs, configmap, and network policies for UDP/TCP 53.
6) “NOERROR but app says ‘host not found’”
Symptom: DNS returns NOERROR with no answers; app fails.
Root cause: You queried a record type that doesn’t exist (AAAA vs A), or the app requires SRV/TXT and you only checked A.
Fix: Query the exact record types the app uses. For modern service discovery, check SRV/TXT and verify library behavior.
7) “SERVFAIL spikes after enabling DDoS protection”
Symptom: Sudden SERVFAIL under load or after security changes.
Root cause: Rate limiting or firewall rules dropping fragments, blocking TCP/53, or dropping EDNS traffic.
Fix: Validate UDP and TCP paths. Tune rate limits. If you must filter, do it knowingly and test DNSSEC-heavy responses.
8) “NXDOMAIN for internal name from laptops on VPN”
Symptom: On VPN it fails; off VPN it works (or vice versa).
Root cause: Split DNS not correctly pushed via VPN, or corporate resolver views depend on source IP ranges.
Fix: Verify which resolvers clients use when on VPN. Confirm view ACLs and that VPN address pools are included.
Checklists / step-by-step plan
Checklist A: When you see NXDOMAIN
- Confirm the exact FQDN being queried (including trailing dot behavior and search suffixes).
- Query authoritative directly for the record type in question (
A,AAAA,CNAME,SRV). - Check for split-horizon: query from inside and outside networks; query internal and public authorities.
- Check SOA serial on each authoritative NS; confirm they match.
- If it was recently created/changed, account for negative caching: wait or flush caches you control.
- Confirm you didn’t publish only a CNAME without its target, or publish a record in the wrong zone.
- Fix at the source of truth (zone file / DNS provider), not in random resolvers.
Checklist B: When you see SERVFAIL
- Identify which resolver returns SERVFAIL (local stub, corporate recursive, public).
- Query the same name via multiple resolvers to isolate scope.
- Run
dig +traceto see where the chain breaks. - Test authoritative servers individually; look for timeouts, REFUSED, or inconsistent answers.
- Investigate DNSSEC: compare DS and DNSKEY, check for expired signatures, confirm time sync on signers and resolvers.
- Test UDP and TCP behavior (truncation and fallback). Ensure TCP/53 isn’t blocked.
- Check resolver logs for validation failures, upstream timeouts, or “lame delegation” warnings.
- Mitigate: shift clients to healthy resolvers, withdraw broken anycast nodes, or temporarily disable validation only where risk is understood.
Step-by-step plan: the “restore service first, then fix the cause” flow
- Stabilize: If one resolver is failing, remove it from rotation. If one authoritative is broken, take it out of NS set (carefully) or fix it fast.
- Confirm truth: Query authoritative sources directly. Determine whether the record exists and is consistent across NS.
- Fix correctness: For NXDOMAIN, fix zone data. For SERVFAIL, fix reachability/validation/delegation.
- Fix propagation: Ensure zone transfers complete, serial increments, and caches expire or are flushed.
- Prevent recurrence: Add delegation checks, DNSSEC expiry monitoring, TCP/53 reachability tests, and canaries from multiple networks.
FAQ
1) Is NXDOMAIN always an authoritative answer?
No. It’s supposed to reflect authoritative non-existence, but you can see NXDOMAIN from recursive caches, RPZ policies, or the wrong DNS view. Always verify by querying the authoritative NS directly.
2) What’s the difference between NXDOMAIN and “no answer”?
NXDOMAIN means the name doesn’t exist. “No answer” is usually NOERROR with an empty answer section (NODATA): the name exists, but not that record type.
3) Why does SERVFAIL happen when the authoritative server seems fine?
Because resolvers validate and enforce policy. DNSSEC validation failures, blocked TCP fallback, or upstream timeouts can cause SERVFAIL even if the authority is serving data.
4) Can a firewall cause NXDOMAIN?
Firewalls more commonly cause timeouts or SERVFAIL, but they can indirectly contribute to NXDOMAIN if your resolver fails over to a different view or forwarder chain that returns NXDOMAIN. Treat it as “unlikely but possible,” and isolate by querying authority directly.
5) Why is SERVFAIL sometimes intermittent?
Resolvers may pick different authoritative servers each time. If one NS is unreachable, lame, or serving broken DNSSEC, some resolution paths succeed and others fail. That’s why per-NS direct queries are a staple.
6) How do I tell if DNSSEC is the problem without becoming a cryptographer?
Do a comparison test: query a validating resolver and a non-validating resolver you control, then use dig +trace +dnssec. Resolver logs often spell it out (expired signatures, DS mismatch).
7) Should I disable DNSSEC validation to “fix SERVFAIL”?
Only as a temporary, explicitly risk-accepted mitigation for critical paths, and only if you control the resolver policy. The real fix is to repair the DNSSEC chain or transport issues.
8) Why does switching to TCP fix some DNS issues?
Large responses (especially with DNSSEC) may not fit in UDP. DNS uses truncation to signal “retry over TCP.” If UDP fragmentation or EDNS is broken, TCP can be more reliable—assuming TCP/53 is allowed.
9) What’s the quickest way to prove it’s split-horizon DNS?
Query the same FQDN against internal resolvers and external resolvers, and compare. If public returns NOERROR and internal returns NXDOMAIN (or different IPs), you have split-horizon behavior by design or by accident.
10) How do I avoid getting fooled by cached failures?
Always include an authoritative direct query in your workflow. If authority is correct and recursion is wrong, you’re dealing with caching, policy, or resolver health—not missing records.
Conclusion: what to do next time
NXDOMAIN and SERVFAIL are not just two flavors of “DNS angry.” They are different signposts.
- NXDOMAIN: assume the system is answering. Verify the name, the zone, the view, and the caches.
- SERVFAIL: assume the resolver couldn’t complete resolution safely. Investigate DNSSEC, delegation health, reachability, and UDP/TCP behavior.
Practical next steps you can do this week:
- Build a tiny runbook page that starts with “NXDOMAIN vs SERVFAIL” and includes the Step 1–4 commands above.
- Add a periodic delegation/authority consistency check (query each NS and compare SOA serials).
- Add a canary DNSSEC-heavy query that validates TCP fallback works end-to-end.
- Teach your incident process to capture
digoutput includingstatus,SERVER, andflags. Screenshots are not telemetry.
DNS will still break. But it won’t get to waste your time by being mysterious about it.