DNS SERVFAIL from Upstream: How to Prove Your Provider Is the Problem

Was this helpful?

SERVFAIL is the DNS equivalent of “something went wrong, don’t ask me.” Your app sees timeouts, your load balancer goes “healthy?” then “lol no,” and the business blames “the network.” Meanwhile your resolver is shrugging because someone upstream returned an error and recursion stopped.

When the upstream resolver is your ISP, a cloud DNS “forwarder,” or a managed recursive service, you don’t get to fix it—only to prove it. This is the field guide for doing exactly that: gathering evidence that survives the corporate escalation gauntlet, and separating “our bug” from “their outage” without guessing.

What SERVFAIL actually means (and what it doesn’t)

SERVFAIL is DNS response code 2. It’s intentionally vague: “the name server was unable to process this query due to a problem with the name server.” In practice, for recursive resolvers, it often means: “I tried recursion and hit an internal error, policy block, DNSSEC validation failure, timeout, or upstream failure.”

SERVFAIL is not NXDOMAIN

NXDOMAIN is a clean “does not exist.” SERVFAIL is “something prevented me from deciding.” Treat them differently. NXDOMAIN is data. SERVFAIL is a reliability event.

SERVFAIL might be your fault, even when it looks upstream

If you forward all queries to a provider resolver and your local caching resolver is configured as a pure forwarder, you’ve created a single chokepoint. When that upstream resolver misbehaves, your local server reports SERVFAIL and you feel like you’re losing your mind. But the root cause might still be your design choice: no fallback, no diversity, bad MTU, broken DNSSEC trust anchors, or overly strict policy.

We’re not here to philosophize. We’re here to prove where the fault lives and what to do about it.

Joke #1: DNS is the only system where “it’s cached” can mean “it’s fine,” “it’s broken,” or “it’s broken but in a different way.”

Fast diagnosis playbook (first/second/third)

First: decide if it’s local, upstream recursive, or authoritative

  • Query your local resolver (whatever your hosts actually use). If it SERVFAILs, keep going.
  • Query a known-good public resolver from the same host. If it works there, the domain is probably fine and your upstream path is suspect.
  • Trace to authoritative (non-recursive walk). If authoritative answers are healthy, upstream recursion/validation is likely the issue.

Second: differentiate DNSSEC failures vs transport failures vs policy blocks

  • DNSSEC: SERVFAIL on validating resolvers; works with +cd (checking disabled) or from non-validating resolvers.
  • Transport/MTU/fragmentation: UDP responses truncated or dropped; TCP fallback fails; intermittent by network path.
  • Policy/RPZ/filtering: consistent SERVFAIL/NXDOMAIN only on provider resolvers; other resolvers succeed; sometimes the provider admits “security filtering.”

Third: capture proof with time, resolver IPs, and packet traces

  • Log the exact resolver IP that returned SERVFAIL, timestamps, query name/type, and whether TCP was attempted.
  • Collect dig outputs with +stats and +dnssec.
  • Run a short targeted tcpdump to show upstream responses (or the lack of them).

If you do only one thing from this article: prove the authoritative chain answers correctly, and show that only your provider’s recursive resolvers fail. That turns “maybe” into “ticket.”

Facts & history that matter in real outages

  1. DNS predates the web. It was designed in the early 1980s as a distributed database to replace HOSTS.TXT; reliability assumptions were… optimistic.
  2. RCODE 2 (SERVFAIL) was always vague by design. The protocol didn’t want to leak internal details, which is great for simplicity and terrible for incident response.
  3. EDNS0 (1999) changed the failure modes. Larger UDP payloads reduced TCP usage, but introduced fragmentation/MTU problems that still cause “random” SERVFAIL/timeout behavior.
  4. DNSSEC (2000s) made “data integrity” a first-class feature. It also made “perfectly reachable domains” fail validation when signatures, DS records, or clocks are wrong.
  5. Negative caching became standardized behavior. Caching NXDOMAIN and other negative answers improves performance but can prolong pain after a fix if TTLs are high.
  6. Anycast resolvers changed debugging. “8.8.8.8” is not one box; provider resolvers often anycast too, so behavior can vary by geography and time.
  7. Glue records are a recurring source of outages. Delegations that rely on glue can break in subtle ways when registrars/parents are updated incorrectly.
  8. TCP fallback for DNS is not optional in practice. Firewalls that block DNS over TCP create failures that look like upstream SERVFAIL.
  9. Resolvers implement policy layers. RPZ, malware blocking, “family filters,” and enterprise policies can turn a normal query into SERVFAIL or NXDOMAIN—by design.

A mental model: stub → recursive → authoritative

Most incidents get messy because people don’t agree what “DNS server” they’re talking about. You need a clean mental model:

  • Stub resolver: the client library on your host (often via systemd-resolved), asking someone else to do recursion.
  • Recursive resolver: your caching/validating server (Unbound, BIND, Knot Resolver) or your provider’s managed resolver, performing recursion and caching results.
  • Authoritative servers: the servers for the domain, answering from zone data (often behind DNS providers, load balancers, and anycast).
  • Parent zone: delegation and DS records live above you (.com, .net, etc.). Many “authoritative problems” are actually parent/registrar problems.

“SERVFAIL from upstream” usually means: your stub asked your recursive, and your recursive asked its upstream forwarder. That upstream forwarder returned SERVFAIL or timed out. Your recursive then returned SERVFAIL to the stub. The trick is to prove exactly which hop failed and why.

One paraphrased idea often attributed to an operations leader: “Hope is not a strategy; measure and verify.” (paraphrased idea, Gene Kim)

Practical tasks: commands, expected output, and decisions (12+)

These tasks are designed to build an evidence chain. Run them from: (1) an affected host, (2) a host in a different network segment, and ideally (3) a clean vantage point like a tiny VM outside your provider.

Task 1: Identify what resolver the host is actually using

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute
Current DNS Server: 10.0.0.53
       DNS Servers: 10.0.0.53 10.0.0.54

What it means: Your host is sending DNS to 10.0.0.53/10.0.0.54. Those are your first suspects, not “the internet.”

Decision: All further testing must query these resolvers explicitly. If you skip this, you’ll “prove” the wrong thing.

Task 2: Reproduce the SERVFAIL against the resolver in use

cr0x@server:~$ dig @10.0.0.53 www.example.com A +norecurse

; <<>> DiG 9.18.24 <<>> @10.0.0.53 www.example.com A +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 41277
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; QUERY SECTION:
;www.example.com.                IN      A

;; Query time: 58 msec
;; SERVER: 10.0.0.53#53(10.0.0.53) (UDP)
;; WHEN: Tue Dec 31 12:04:55 UTC 2025
;; MSG SIZE  rcvd: 40

What it means: The resolver returned SERVFAIL quickly (58 ms). That’s not a generic timeout; it’s an active failure path.

Decision: Now test the same name against multiple other resolvers to isolate scope.

Task 3: Compare with a known-good public resolver from the same host

cr0x@server:~$ dig @1.1.1.1 www.example.com A +stats

; <<>> DiG 9.18.24 <<>> @1.1.1.1 www.example.com A +stats
; (1 server found)
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5013
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
www.example.com.         300     IN      A       93.184.216.34

;; Query time: 14 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Tue Dec 31 12:05:08 UTC 2025
;; MSG SIZE  rcvd: 60

What it means: The domain resolves fine elsewhere. This points away from authoritative outage and toward your recursive path/provider behavior.

Decision: Start building “only these resolver IPs fail” evidence.

Task 4: Check if it’s DNSSEC-related (the classic SERVFAIL trap)

cr0x@server:~$ dig @10.0.0.53 www.example.com A +dnssec +multi

; <<>> DiG 9.18.24 <<>> @10.0.0.53 www.example.com A +dnssec +multi
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 60172
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

What it means: Still SERVFAIL. Not conclusive yet.

Decision: Try +cd (checking disabled). If that works, you likely have DNSSEC validation failure in the upstream resolver.

Task 5: Query with checking disabled to isolate validation failures

cr0x@server:~$ dig @10.0.0.53 www.example.com A +cd +stats

; <<>> DiG 9.18.24 <<>> @10.0.0.53 www.example.com A +cd +stats
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28212
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
www.example.com.         300     IN      A       93.184.216.34

;; Query time: 31 msec
;; SERVER: 10.0.0.53#53(10.0.0.53) (UDP)
;; WHEN: Tue Dec 31 12:06:01 UTC 2025
;; MSG SIZE  rcvd: 60

What it means: With DNSSEC checking disabled, it resolves. That strongly suggests DNSSEC validation failure in the resolver path (trust anchor, bad DS, expired signatures, clock skew, or upstream bug).

Decision: Gather DNSSEC chain evidence and check resolver logs; prepare to tell the provider “validates incorrectly” rather than “DNS is down.”

Task 6: Walk the delegation chain using trace (authoritative path proof)

cr0x@server:~$ dig +trace www.example.com A

; <<>> DiG 9.18.24 <<>> +trace www.example.com A
.                       518400  IN      NS      a.root-servers.net.
...
com.                    172800  IN      NS      a.gtld-servers.net.
...
example.com.            172800  IN      NS      a.iana-servers.net.
example.com.            172800  IN      NS      b.iana-servers.net.
www.example.com.        300     IN      A       93.184.216.34

What it means: You can reach the authoritative chain and get an A record. This is your “authoritative is fine” receipt.

Decision: If trace works reliably while your provider resolver SERVFAILs, you have a strong upstream-resolver case.

Task 7: Confirm the provider resolver is the one failing (query it directly)

cr0x@server:~$ dig @203.0.113.53 www.example.com A +stats

; <<>> DiG 9.18.24 <<>> @203.0.113.53 www.example.com A +stats
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 1400
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; Query time: 22 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (UDP)
;; WHEN: Tue Dec 31 12:06:44 UTC 2025
;; MSG SIZE  rcvd: 40

What it means: A specific upstream recursive resolver IP returned SERVFAIL quickly. This is gold for escalation: it’s precise and reproducible.

Decision: Repeat against the provider’s secondary resolver(s). If only one fails, you’ve likely hit an anycast POP or partial outage.

Task 8: Check whether TCP fallback works (fragmentation/MTU suspicion)

cr0x@server:~$ dig @203.0.113.53 www.example.com DNSKEY +tcp +dnssec +stats

; <<>> DiG 9.18.24 <<>> @203.0.113.53 www.example.com DNSKEY +tcp +dnssec +stats
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32999
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 35 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (TCP)
;; WHEN: Tue Dec 31 12:07:18 UTC 2025
;; MSG SIZE  rcvd: 1102

What it means: TCP works and returns DNSKEY. If UDP queries SERVFAIL but TCP succeeds, you might be dealing with UDP path issues, fragmentation drops, or firewall weirdness.

Decision: If TCP works but UDP doesn’t, focus on MTU/EDNS0 sizing and network filtering rather than “DNS server is down.”

Task 9: Force small EDNS0 buffer to avoid fragmentation

cr0x@server:~$ dig @203.0.113.53 www.example.com A +bufsize=1232 +stats

; <<>> DiG 9.18.24 <<>> @203.0.113.53 www.example.com A +bufsize=1232 +stats
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49811
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
www.example.com.         300     IN      A       93.184.216.34

;; Query time: 20 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (UDP)
;; WHEN: Tue Dec 31 12:07:47 UTC 2025
;; MSG SIZE  rcvd: 60

What it means: With a conservative buffer size, it succeeds. That implicates fragmentation/PMTUD problems between you and the resolver or between the resolver and authoritative servers.

Decision: Your provider may be fine but the network path (possibly their edge) is dropping fragments. Still their problem, but now you can describe it in packets, not feelings.

Task 10: Inspect your local caching resolver/forwarder configuration (Unbound example)

cr0x@server:~$ sudo unbound-control status
version: 1.19.0
verbosity: 1
threads: 2
modules: 2 [ subnetcache validator ]
uptime: 38122 seconds
options: control(ssl)
unbound (pid 1221) is running...

What it means: Unbound is validating (validator module loaded). If upstream fails only on validating queries, this matters.

Decision: If you’re forwarding to a validating upstream while also validating locally, don’t. Pick one validation point and make it predictable.

Task 11: Check resolver logs for validation failures (systemd-journald)

cr0x@server:~$ sudo journalctl -u unbound --since "10 min ago" | tail -n 8
Dec 31 12:05:59 server unbound[1221]: info: validation failure <www.example.com. A>: signature expired
Dec 31 12:06:00 server unbound[1221]: info: error: SERVFAIL <www.example.com. A>: validation failure

What it means: This is explicit: signatures expired (or appear expired). That can be genuine or it can be clock skew.

Decision: Immediately verify system time and NTP health on the resolver. If time is sane, this becomes authoritative/provider DNSSEC issue.

Task 12: Verify time sync (DNSSEC cares, a lot)

cr0x@server:~$ timedatectl
               Local time: Tue 2025-12-31 12:08:21 UTC
           Universal time: Tue 2025-12-31 12:08:21 UTC
                 RTC time: Tue 2025-12-31 12:08:21
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What it means: Clock is synchronized. Good. One less self-own.

Decision: If DNSSEC failures persist with correct time, the failure is upstream or in the domain’s signing/DS chain.

Task 13: Check for TCP/53 being blocked (locally or by network policy)

cr0x@server:~$ nc -vz 203.0.113.53 53
Connection to 203.0.113.53 53 port [tcp/domain] succeeded!

What it means: TCP/53 is reachable. If it failed, you’d expect UDP truncation to become SERVFAIL/timeouts on larger responses.

Decision: If TCP/53 is blocked, fix that before blaming anyone. If you can’t, you must tune EDNS0 down aggressively.

Task 14: Packet capture to prove the upstream behavior

cr0x@server:~$ sudo tcpdump -ni eth0 host 203.0.113.53 and port 53 -vv -c 6
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:09:01.120001 IP 10.0.1.20.53422 > 203.0.113.53.53: 41277+ A? www.example.com. (33)
12:09:01.141883 IP 203.0.113.53.53 > 10.0.1.20.53422: 41277 ServFail 0/0/1 (40)
12:09:02.001220 IP 10.0.1.20.39511 > 203.0.113.53.53: 5013+ A? www.example.com. (33)
12:09:02.015430 IP 203.0.113.53.53 > 10.0.1.20.39511: 5013 1/0/1 A 93.184.216.34 (60)
6 packets captured
6 packets received by filter
0 packets dropped by kernel

What it means: You have a timestamped, on-wire record that the provider resolver sometimes emits SERVFAIL and sometimes succeeds. If you’re seeing intermittent behavior, this is how you prove it’s real.

Decision: If the SERVFAIL arrives quickly from upstream, it’s not your firewall dropping replies. If there’s no reply, you pivot to reachability/packet loss/anycast path issues.

Task 15: Test multiple anycast paths by querying from another network namespace or host

cr0x@server:~$ dig @203.0.113.53 www.example.com A +stats
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 21001
;; Query time: 19 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (UDP)

cr0x@server:~$ ssh cr0x@bastion 'dig @203.0.113.53 www.example.com A +stats'
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10442
;; Query time: 12 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (UDP)

What it means: Same resolver IP, different vantage point, different outcome. That’s classic anycast POP inconsistency or regional filtering.

Decision: Your escalation should include “fails from ASN/location A, succeeds from ASN/location B,” plus timestamps.

Task 16: Confirm whether your resolver is forwarding (and to whom)

cr0x@server:~$ sudo grep -R "forward-addr" /etc/unbound/unbound.conf /etc/unbound/unbound.conf.d 2>/dev/null
/etc/unbound/unbound.conf.d/forward.conf:forward-addr: 203.0.113.53
/etc/unbound/unbound.conf.d/forward.conf:forward-addr: 203.0.113.54

What it means: Your recursion is outsourced. That’s not inherently wrong, but now your provider’s resolver is a dependency you must treat like one.

Decision: If the provider is flaky and you can’t fix them quickly, switch to full recursion locally or diversify forwarders (with care).

How to build provider-grade proof (without drama)

Providers aren’t evil. They’re busy. Also their first response is often “works for us.” Your job is to make that response impossible.

What “proof” looks like to a DNS operations team

  • Exact resolver IP that returns SERVFAIL (not “your DNS”).
  • Exact query: name, type, DO/CD bits if relevant, and whether UDP or TCP.
  • Timestamps with timezone and frequency (“1/20 queries fail” or “fails for 7 minutes”).
  • Comparison data: same query succeeds against another recursive resolver.
  • Authoritative path sanity: dig +trace works (or shows where it breaks).
  • Packet capture excerpt showing SERVFAIL response code or missing responses.
  • Network context: your source IP range/ASN, region, whether NAT is involved.

Don’t send them a novel; send a dossier

Good escalation is tight:

  • One paragraph summary: what’s broken, impact, when it started.
  • A bullet list of resolver IPs tested and outcomes.
  • Three command outputs maximum in the first message. Attach the rest if the ticketing system allows.
  • One tcpdump excerpt or pcap if they accept it.

What to avoid

  • “DNS is down.” No. Name the resolver and the query.
  • “It’s random.” It’s intermittent. Provide a failure rate, time window, and whether it correlates with EDNS0 size.
  • “We rebooted things.” That’s a smell, not evidence.

Joke #2: The fastest way to get a provider to act is to say “I have a packet capture,” because now it’s their problem in high definition.

Three corporate mini-stories (anonymized, technically plausible, and painfully familiar)

Incident caused by a wrong assumption: “SERVFAIL means authoritative is down”

A mid-sized SaaS company started getting alarms: API error rates spiked, and a few regions reported login failures. The on-call saw SERVFAIL for the identity provider hostname and did what many people do under stress: assumed the domain’s authoritative DNS must be broken.

They escalated to the vendor, opened a severity ticket, and started drafting a customer incident notice. The vendor replied that their authoritative servers were healthy and provided their own query logs showing normal traffic. The on-call doubled down. “But my dig says SERVFAIL.”

A calmer engineer asked a basic question: “Which resolver did you query?” The answer was the company’s own forwarder, which forwarded to a managed recursive service bundled with their connectivity provider. Queries to a public resolver worked instantly. dig +trace showed the delegation and authoritative answers were fine.

The real failure was DNSSEC validation in one anycast POP of the managed resolver. The provider’s resolver returned SERVFAIL quickly for DNSSEC-valid responses during a key rollover window. The company’s forwarder had no diversity; it forwarded exclusively to that managed service.

They fixed the immediate incident by switching forwarders temporarily and permanently by running local recursion in two regions with diverse upstreams. The postmortem had one uncomfortable line: “We escalated to the wrong party because we didn’t verify the authoritative path.” That line prevented future embarrassment.

Optimization that backfired: “Let’s forward everything to reduce latency”

A finance company had a well-tuned Unbound cluster doing full recursion, with caching and DNSSEC validation. Someone noticed that the cloud provider offered “low-latency DNS resolvers” inside the VPC, and proposed a “simple optimization”: change Unbound to forward all queries to the cloud resolvers.

In normal times, it looked great. Median latency dropped. Resolver CPU usage fell. Everyone congratulated the change in a meeting where nobody asked about failure modes.

Three weeks later, a subset of domains began returning SERVFAIL intermittently. Not all domains, not all queries. The team chased ghosts: application retries, connection pools, even TLS handshakes. It was DNS, but only sometimes, and only through the cloud resolvers.

Root cause: the cloud resolver fleet had an EDNS0 fragmentation issue on a specific path to certain authoritative servers. UDP responses above a certain size were dropped, and TCP fallback was inconsistent due to a security appliance policy. Full recursion on their own Unbound boxes had previously handled this better because they had conservative EDNS0 settings and predictable TCP behavior.

The “optimization” created a new blast radius: a managed dependency they couldn’t tune. They rolled back to full recursion and kept the cloud resolver as a secondary forwarder only for specific internal zones. The lesson wasn’t “never use managed DNS.” It was: don’t outsource a critical path unless you can observe it and you have a fallback.

Boring but correct practice that saved the day: diverse resolvers + continuous probes

A retail platform had been burned by DNS before. Not catastrophically—just enough to make it expensive. So they implemented a policy that nobody found exciting: every region would have two recursive resolvers, each capable of full recursion, each with independent upstream connectivity. Clients would have two resolvers in resolv.conf, and their service mesh would also use resolver diversity.

They also ran continuous probes: every minute, from each region, they queried a small set of critical names (their own auth endpoints, payment gateway, and a few external dependencies) against each resolver, logging RCODE, latency, and whether TCP was used.

One afternoon, one resolver pool began returning SERVFAIL for a payment gateway domain, but only in one region. Because they had per-resolver telemetry, the issue was obvious in minutes: resolver A in that region was bad; resolver B was fine. Clients automatically failed over, so customer impact was minimal.

When they opened a ticket with the connectivity provider, they didn’t say “DNS is flaky.” They said: “Resolver IP X returns SERVFAIL for name Y starting at time T; resolver IP Z does not. Here are minute-by-minute RCODE counts and a tcpdump excerpt.” The provider fixed a broken validating resolver node in that POP the same day.

Nothing heroic happened. That’s why it worked.

Common mistakes: symptoms → root cause → fix

1) Symptom: SERVFAIL only on your provider resolver; public resolvers work

Root cause: Provider recursive outage, anycast POP inconsistency, or provider policy (filtering/RPZ).

Fix: Query specific resolver IPs directly, collect tcpdump evidence, and switch/augment resolvers short-term. Escalate with resolver IP + timestamps.

2) Symptom: SERVFAIL disappears when using +cd

Root cause: DNSSEC validation failure (bad DS chain, expired signatures, wrong trust anchor, or resolver clock issues).

Fix: Verify time sync; run dig +trace and inspect DS/DNSKEY chain; if upstream validating resolver is wrong, escalate with “validation fails on resolver IP X; succeeds elsewhere.”

3) Symptom: Small answers work; large answers SERVFAIL/time out

Root cause: EDNS0 fragmentation/PMTUD problems; TCP/53 blocked; middleboxes mangling fragments.

Fix: Test with +bufsize=1232 and +tcp. If that fixes it, tune resolver EDNS0 size, ensure TCP/53 allowed, and escalate to provider/network team with packet captures.

4) Symptom: Intermittent SERVFAIL varies by region

Root cause: Anycast routing to different resolver nodes; one POP is sick; inconsistent cache/validation state.

Fix: Test from multiple vantage points; include source IP/region in evidence; ask provider to drain/repair the POP.

5) Symptom: SERVFAIL for a domain you just updated

Root cause: Broken delegation/DS during DNSSEC rollover; stale negative caching; inconsistent authoritative propagation.

Fix: Check dig +trace for delegation/DS correctness; reduce TTLs before changes; if already broken, coordinate registrar/parent fix and wait out caches.

6) Symptom: Only some clients fail; others succeed on same network

Root cause: Different resolvers configured (DHCP), split DNS, VPN pushing resolvers, or local stub caching differences.

Fix: Verify resolver configuration per client (resolvectl), standardize DHCP options, and stop relying on “whatever DNS the laptop got today.”

7) Symptom: SERVFAIL but your resolver logs are empty

Root cause: You’re not actually querying the resolver you think you are; or queries are being intercepted by a local stub / sidecar / NAT device.

Fix: Use dig @IP explicitly; capture packets at the client; confirm traffic hits the expected resolver IP.

8) Symptom: SERVFAIL after “hardening” changes

Root cause: Blocking DNS over TCP, blocking fragments, or disabling EDNS0 incorrectly; overly strict DNSSEC settings.

Fix: Re-enable TCP/53 and fragments as needed; tune EDNS0 buffer; validate DNSSEC with controlled tests before rolling changes broadly.

Checklists / step-by-step plan

Step-by-step: from first alert to provider escalation

  1. Confirm impact scope: is this one hostname, one resolver, one region, or global?
  2. Identify the active resolver on an affected host (resolvectl status).
  3. Reproduce with explicit queries: dig @resolver name type +stats.
  4. Compare with a second resolver (public or alternate internal). Note differences in RCODE and latency.
  5. Run dig +trace to prove the authoritative chain works (or pinpoint where it breaks).
  6. Test DNSSEC angle: run with and without +cd; check validating resolver logs; verify time sync.
  7. Test transport angle: +tcp, +bufsize=1232, and verify TCP/53 reachability.
  8. Capture packets showing SERVFAIL response from upstream or missing responses.
  9. Document “matrix” evidence: resolver IPs × vantage points × outcomes.
  10. Mitigate: switch to alternate resolvers, enable fallback, or temporarily disable DNSSEC checking only if you understand the risk and have approval.
  11. Escalate: send a short summary plus key evidence, ask for POP/node investigation if anycast suspected.
  12. Follow-up: keep probing and add permanent monitoring; write a postmortem that includes “how we will detect this in 2 minutes next time.”

Quick mitigation checklist (when production is bleeding)

  • Switch clients to at least two diverse resolvers (different providers if possible).
  • On forwarders, add multiple upstreams and health-check them; prefer ones with independent anycast networks.
  • If fragmentation suspected: reduce EDNS0 UDP size and ensure TCP/53 is allowed.
  • If DNSSEC suspected: verify time first; don’t disable validation globally as your first move.
  • Reduce dependency on one recursive resolver: it’s a single point of failure wearing a trench coat.

Evidence bundle checklist (what to attach to the ticket)

  • Resolver IPs tested and their RCODEs (SERVFAIL vs NOERROR) for the same query.
  • Timestamps and timezone.
  • dig +trace output showing authoritative answers.
  • dig output showing +cd difference if DNSSEC suspected.
  • tcpdump excerpt or pcap showing SERVFAIL response from upstream resolver.
  • Your source IP/public egress and region (especially if anycast).

FAQ

1) Why does my resolver return SERVFAIL instead of timing out?

Because something upstream responded with SERVFAIL or your resolver hit a validation/policy error. Fast SERVFAIL is often an intentional decision, not packet loss.

2) If dig +trace works, can the provider still be at fault?

Yes. +trace bypasses recursion by walking the hierarchy. If trace works but the provider resolver SERVFAILs, the provider resolver is failing at recursion, caching, validation, transport, or policy.

3) Does SERVFAIL always mean DNSSEC problems?

No. DNSSEC is a common cause, but so are upstream outages, rate limiting, timeouts to authoritative servers, broken TCP fallback, and middlebox interference.

4) What does it mean when +cd makes the query succeed?

It strongly suggests DNSSEC validation failure on the resolver you queried. It’s not proof of a domain problem by itself—misconfigured resolvers and bad clocks can also cause it.

5) Should I just turn off DNSSEC validation to stop the incident?

Only as a time-boxed mitigation with explicit risk acceptance. Disabling validation trades integrity for availability. Sometimes that’s the right business call; often it’s a panic move that becomes permanent.

6) Why do only some regions see SERVFAIL when resolvers are anycast?

Anycast routes you to a nearby POP, and “nearby” can change with routing events. One POP can be broken while others are fine. Your evidence needs regional vantage points.

7) How do I prove it’s not our firewall/NAT?

If you capture packets showing an upstream SERVFAIL response arriving, it’s not your firewall dropping replies. If you see queries leave but no responses come back, you need more tests: TCP vs UDP, EDNS buffer sizing, and checks from a different egress path.

8) Why does reducing +bufsize to 1232 help?

It avoids fragmentation on common internet paths and aligns with modern operational guidance for DNS over UDP. If it helps, you likely have fragmentation loss or PMTUD issues.

9) Can provider “security DNS” cause SERVFAIL?

Yes. Some filtering implementations return NXDOMAIN, some return SERVFAIL, and some do “helpful” redirection. If public resolvers succeed and provider resolvers fail consistently, suspect policy.

10) What should I ask the provider to do?

Ask them to investigate the specific resolver IP/POP for recursion or DNSSEC validation failures, check for fragmentation/TCP fallback issues, and confirm whether filtering/RPZ is applied.

Conclusion: practical next steps

SERVFAIL is not a diagnosis. It’s a demand for better questions. The fastest route to clarity is a controlled comparison: your resolver vs another resolver, recursion vs trace, validation on vs off, UDP vs TCP. Then you capture the proof.

Next steps you can do today:

  • Write down the resolver IPs your fleet actually uses. Put them in monitoring.
  • Add a small set of continuous DNS probes that record RCODE, latency, and TCP usage per resolver and per region.
  • Ensure TCP/53 is allowed where it must be, and tune EDNS0 buffer conservatively if you operate resolvers.
  • Design for resolver diversity. One upstream recursive is a single point of failure, even if it’s anycast and has a fancy SLA.
  • When the provider is the problem, escalate with a dossier: resolver IP, exact query, timestamps, trace proof, and packets. You’ll get action instead of poetry.
← Previous
Docker: Blue/green on a single host — the simplest approach that works
Next →
Debian 13: AppArmor blocks your service — allow what you need without disabling security

Leave a comment