DNSSEC Fails Randomly: Debugging Validation Errors Without Panic

Was this helpful?

It’s 02:13. Half your fleet can reach a partner API; the other half gets SERVFAIL. Somebody says “DNS is down.” Somebody else says “It’s DNSSEC.” You feel the urge to reboot a resolver and hope nobody notices.

Don’t. “Random” DNSSEC failures are usually deterministic. They just look random because different resolvers, paths, caches, MTUs, clocks, and key states create different outcomes. This is a field guide to turning panic into a short list of checks, with commands you can run while the incident channel is still arguing about whose change it was.

What “random DNSSEC failure” actually means

DNSSEC validation is a yes/no operation: either the chain of trust validates for the RRset at the time you asked, or it doesn’t. So why does it look intermittent?

  • Different resolvers see different data. Cache state, negative caching, aggressive NSEC, and prefetching mean two resolvers can be “correct” and still disagree for minutes.
  • Different paths treat big UDP differently. DNSSEC responses are bigger. EDNS0 increases UDP payload size; some networks drop fragments or block ICMP “Fragmentation needed.” Results: one client retries over TCP, another doesn’t.
  • Time matters. RRSIG has inception/expiration. A clock skew on the validator can turn valid signatures into “expired” or “not yet valid.”
  • Different trust anchor states. Managed trust anchors and RFC 5011 behavior can lead to “works here, fails there” during key rollovers or after a long outage.
  • Upstream differences. If your resolvers forward to different upstreams (or auto-discover), you may be comparing apples to an orange that’s also on fire.

The goal is to stop treating it like a ghost. You’re going to pin down: which resolver fails, for which name/type, with which DNSSEC status, and why.

Facts and history that matter in practice

Here are some concrete points that aren’t trivia—they change how you debug:

  1. DNSSEC’s core specs landed in 2005 (RFC 4033/4034/4035). A lot of “legacy DNS” tooling predates it and lies by omission.
  2. The root zone was signed in 2010. Before that, validation stopped at the root; after that, failures can propagate globally if the chain breaks.
  3. Algorithm rollovers are real events, not theory. SHA-1-based DS digests (and older algorithms) have been phased out; mismatches often appear during transitions.
  4. DNSKEY and DS are not the same thing. DS sits in the parent; DNSKEY in the child. If you change keys without updating DS, validators will (correctly) call your zone “bogus.”
  5. Large DNS responses were historically fragile. DNS started with 512-byte UDP responses; DNSSEC made “big answers” common, hence EDNS0 and more TCP fallback.
  6. NSEC3 was introduced to reduce zone-walking. It also increases complexity and response sizes; you’ll see it in denial-of-existence proofs.
  7. Some resolvers cache “bogus” decisions briefly. That can make a short misconfiguration look longer than it was, and it’s why “we fixed it but it still fails” is a thing.
  8. There was a major root KSK rollover in 2018. It taught everyone that “trust anchor management” is operational work, not a checkbox.

One quote to keep you honest, because incidents punish wishful thinking:

“Hope is not a strategy.” — General Gordon R. Sullivan

Fast diagnosis playbook

When users say “DNSSEC is flaky,” you need a tight loop. Here’s the order that finds the bottleneck fastest in real systems.

1) Confirm it’s DNSSEC validation (not plain DNS)

  • Pick one failing client and one known-good client.
  • Query the same resolver IP directly (don’t rely on /etc/resolv.conf search domains and split DNS).
  • Compare AD (Authenticated Data), RA, and the actual response code (NOERROR, SERVFAIL, NXDOMAIN).

2) Identify the failing validator and isolate it

  • If you run a pool of resolvers, check one-by-one. Intermittent often means “subset.”
  • If you forward to upstream resolvers, bypass forwarding temporarily (or test direct recursion) to localize where validation breaks.

3) Verify the chain: DS in parent, DNSKEY in child, signatures valid “now”

  • Use delv or drill -D to force a full validation trace.
  • Check for DS mismatch, missing RRSIG, expired signatures, or algorithm unsupported.

4) Check for MTU / fragmentation / TCP fallback

  • Look for udp truncated patterns, rising TCP queries, or timeouts only on some networks.
  • Test with reduced EDNS buffer sizes to reproduce.

5) Check time and trust anchors

  • Clock skew on validators is a silent killer.
  • Stale root trust anchors show up after long outages or frozen images.

Decision rule: if the same query to the same resolver flips between AD and SERVFAIL without any zone changes, suspect network path/fragmentation or overloaded resolver first, not cryptography.

A mental model: where DNSSEC validation can fail

Think in layers. DNSSEC isn’t one thing; it’s several dependencies lined up, any of which can fail in ways that resemble each other.

Layer A: Transport (UDP/TCP, EDNS0, fragmentation)

DNSSEC often increases response size: DNSKEY sets, DS records, RRSIGs, NSEC/NSEC3 proofs. If UDP packets are fragmented and fragments are dropped, the resolver may never assemble the answer, may retry, or may mark the upstream as “lame.”

Layer B: Data integrity (signatures and chain of trust)

Validation requires a path from a trust anchor (usually the root) through DS and DNSKEY records down to the RRset you asked for. Break the DS link, publish bad signatures, or let signatures expire, and a validator must return SERVFAIL (or “bogus” internally). That’s the whole point.

Layer C: Time (RRSIG validity windows)

RRSIG inception/expiration is not forgiving. A validator with a bad clock can invalidate perfectly good zones. This is one of the few DNS problems that really is fixed by NTP, not by arguing harder.

Layer D: Validator behavior (caching, RFC 5011, aggressive NSEC)

Resolvers are not identical. Unbound, BIND, PowerDNS Recursor, and others differ in defaults: how aggressively they cache, how they handle prefetch, how they log validation failures, and how they manage trust anchors. If you operate a mixed fleet, you’re also operating a mixed set of failure modes.

Joke #1: DNSSEC is like airport security: it improves safety, but it will make you take your shoes off at the worst possible time.

Practical tasks: commands, outputs, decisions

The fastest way to debug DNSSEC is to collect “proof,” not feelings. Below are real tasks you can run on a client, a resolver, or both. Each has: a command, what typical output means, and what decision you make next.

Task 1: Check whether the resolver is validating (AD flag)

cr0x@server:~$ dig @192.0.2.53 www.cloudflare.com A +dnssec +multi

; <<>> DiG 9.18.24 <<>> @192.0.2.53 www.cloudflare.com A +dnssec +multi
; (1 server found)
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1122
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
www.cloudflare.com.  300 IN A 104.16.132.229
www.cloudflare.com.  300 IN RRSIG A 13 3 300 20260101000000 20251201000000 34505 cloudflare.com. ...

;; Query time: 21 msec
;; SERVER: 192.0.2.53#53(192.0.2.53) (UDP)

Meaning: ad indicates the resolver validated the answer. Presence of RRSIG shows DNSSEC data was requested.

Decision: If the failing case lacks ad or returns SERVFAIL, keep digging. If it returns NOERROR without ad, the resolver may be non-validating or configured to strip AD.

Task 2: Confirm the failure is SERVFAIL from validation, not an upstream timeout

cr0x@server:~$ dig @192.0.2.53 broken.dnssec-failed.org A +dnssec +comments

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 44110
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

Meaning: SERVFAIL is ambiguous by itself. The key is to correlate with resolver logs (tasks below) and test with a validating tool (delv).

Decision: If multiple unrelated names SERVFAIL, suspect resolver health or connectivity. If only one zone, suspect zone DNSSEC data or path MTU issues for that zone’s responses.

Task 3: Use delv to get a validation explanation (local proof)

cr0x@server:~$ delv @192.0.2.53 example.com A

; fully validated
example.com.  300 IN A 93.184.216.34

Meaning: “fully validated” means the resolver provided enough data and validation succeeded.

Decision: If delv says “resolution failed” or “bogus,” treat it as DNSSEC until proven otherwise and move to chain inspection.

Task 4: Make delv show why it’s bogus

cr0x@server:~$ delv @192.0.2.53 bad.example A +rtrace

...
; validation failure <bad.example/A>: no valid RRSIG
; resolution failed: SERVFAIL

Meaning: This tells you the category: missing/invalid signature, DS mismatch, unsupported algorithm, etc.

Decision: “no valid RRSIG” usually means broken signing, expired signatures, or the answer was truncated/dropped and the resolver didn’t receive signatures.

Task 5: Compare behavior with DNSSEC disabled at query-time (control test)

cr0x@server:~$ dig @192.0.2.53 example.com A +nodnssec +multi

; <<>> DiG 9.18.24 <<>> @192.0.2.53 example.com A +nodnssec +multi
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5881
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
example.com.  300 IN A 93.184.216.34

Meaning: If +nodnssec works but +dnssec fails, DNSSEC data size/validation is implicated. If both fail, it’s plain DNS or transport.

Decision: Use this to keep stakeholders calm: “DNS works, DNSSEC validation fails” is a narrower problem than “DNS is broken.”

Task 6: Inspect DS at the parent (the “is the parent pointing correctly?” test)

cr0x@server:~$ dig @192.0.2.53 example.com DS +dnssec +multi

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19981
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
example.com.  86400 IN DS 370 13 2 9F3E...C1A7
example.com.  86400 IN RRSIG DS 8 2 86400 20260101000000 20251201000000 12345 com. ...

Meaning: The parent publishes DS. The digest and key tag must match a DNSKEY in the child zone. ad here is a good sign: the parent chain is fine up to this point.

Decision: If DS is missing, the zone is “insecure” (not signed from the parent’s perspective). If DS exists but doesn’t match child DNSKEY, you get bogus.

Task 7: Inspect DNSKEY in the child (the “does the child publish the right key?” test)

cr0x@server:~$ dig @192.0.2.53 example.com DNSKEY +dnssec +multi

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5022
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
example.com.  3600 IN DNSKEY 257 3 13 mGx...==
example.com.  3600 IN DNSKEY 256 3 13 tZk...==
example.com.  3600 IN RRSIG DNSKEY 13 2 3600 20260101000000 20251201000000 370 example.com. ...

Meaning: You’ll typically see a KSK (flag 257) and one or more ZSKs (flag 256). They must be signed, and the signatures must validate.

Decision: If DNSKEY queries are timing out or truncated, you likely have MTU/fragmentation issues. If DNSKEY exists but doesn’t align with DS, it’s a DS rollover problem.

Task 8: Force TCP to rule out UDP fragmentation

cr0x@server:~$ dig @192.0.2.53 example.com DNSKEY +dnssec +tcp

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61001
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; Query time: 37 msec
;; SERVER: 192.0.2.53#53(192.0.2.53) (TCP)

Meaning: If TCP works reliably while UDP fails intermittently, your “random” problem is probably packet fragmentation, firewall behavior, or broken PMTUD.

Decision: Mitigate by lowering EDNS UDP size on the resolver, allowing DNS over TCP/853 properly, or fixing the network path that drops fragments/ICMP.

Task 9: Constrain EDNS buffer size to reproduce “big answer” failures

cr0x@server:~$ dig @192.0.2.53 example.com DNSKEY +dnssec +bufsize=1232

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4555
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232

Meaning: 1232 is a common “safe” EDNS size for avoiding fragmentation over typical paths. If lowering the bufsize makes failures disappear, you’ve learned something useful about your network.

Decision: Set resolver-side EDNS buffer to a conservative value if you operate in hostile networks (VPNs, overlays, hotel Wi‑Fi, certain enterprise firewalls).

Task 10: Check resolver logs for explicit validation errors (Unbound)

cr0x@server:~$ sudo journalctl -u unbound --since "10 min ago" | tail -n 30
Dec 31 01:59:12 resolver-a unbound[912]: info: validation failure <example.com. DNSKEY IN>: signature expired
Dec 31 01:59:12 resolver-a unbound[912]: info: resolving example.com. DNSKEY IN
Dec 31 01:59:13 resolver-a unbound[912]: info: error: SERVFAIL example.com. A IN

Meaning: That’s not “random.” It’s “signature expired,” which usually means the zone operator failed to re-sign in time, or your clock is wrong.

Decision: Check local time first (task 12). If local time is good, escalate to the zone owner with evidence.

Task 11: Check resolver logs for validation errors (BIND named)

cr0x@server:~$ sudo journalctl -u named --since "10 min ago" | tail -n 40
Dec 31 02:01:22 resolver-b named[1044]: resolver: info: validating example.com/A: no valid signature found
Dec 31 02:01:22 resolver-b named[1044]: resolver: info: DNSKEY example.com is not secure
Dec 31 02:01:22 resolver-b named[1044]: resolver: info: client @0x7f... 198.51.100.27#51622 (example.com): query failed (SERVFAIL) for example.com/IN/A at query.c:...

Meaning: “No valid signature found” often correlates with DS mismatch, missing RRSIG, or a truncated response where the resolver never received the signatures it needed.

Decision: Immediately test with +tcp and +bufsize=1232 to see if it’s transport-related. If not, inspect DS/DNSKEY alignment.

Task 12: Confirm clocks are sane (validator time skew)

cr0x@server:~$ timedatectl
               Local time: Wed 2025-12-31 02:06:41 UTC
           Universal time: Wed 2025-12-31 02:06:41 UTC
                 RTC time: Wed 2025-12-31 02:06:41
                Time zone: UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: DNSSEC validation depends on correct time. If System clock synchronized is “no,” you’re in the danger zone.

Decision: Fix NTP/time sync before touching DNS configs. If the host time is wrong, every other conclusion is suspect.

Task 13: Spot a trust anchor problem (root key issues) with Unbound’s anchor utility

cr0x@server:~$ sudo unbound-anchor -a /var/lib/unbound/root.key
/var/lib/unbound/root.key has content
success: the anchor is ok

Meaning: This checks whether the configured root trust anchor file is present and reasonable.

Decision: If this fails on only some resolvers, you’ve likely got image drift or a stale appliance. Update trust anchors in a controlled way.

Task 14: Check if you’re forwarding (and therefore inheriting someone else’s DNSSEC behavior)

cr0x@server:~$ sudo unbound-control list_forwards
zone=.
forward-addr=203.0.113.9@53
forward-addr=203.0.113.10@53

Meaning: You are not doing full recursion; you’re trusting upstream forwarders. If they validate differently, filter EDNS, or break TCP fallback, your outcomes vary.

Decision: For debugging, test direct recursion from one resolver (temporarily in a lab or isolated instance) to see if the problem is yours or upstream.

Task 15: Verify that your resolver is actually attempting DNSSEC validation

cr0x@server:~$ sudo unbound-control get_option val-permissive-mode
val-permissive-mode: no

Meaning: Permissive mode can turn hard failures into soft ones. Great for emergency mitigation; terrible for security expectations and debugging clarity.

Decision: Keep permissive mode off in normal operation. If it’s on during an incident, document it, time-box it, and plan the rollback.

Task 16: Measure UDP vs TCP query mix (transport signal)

cr0x@server:~$ sudo unbound-control stats_noreset | egrep 'num.query.tcp|num.query.udp|num.answer.rcode.SERVFAIL'
num.query.udp=1849921
num.query.tcp=219004
num.answer.rcode.SERVFAIL=3812

Meaning: A sudden spike in TCP queries can indicate UDP fragmentation issues or deliberate truncation. A spike in SERVFAIL correlated with specific zones points to DNSSEC chain problems.

Decision: If TCP ratio rises during the same window as failures, look at network/MTU. If SERVFAIL rises without TCP changes, look at validation errors and trust anchors.

Task 17: Test path MTU and fragmentation behavior (quick-and-dirty)

cr0x@server:~$ ping -M do -s 1472 198.51.100.53 -c 3
PING 198.51.100.53 (198.51.100.53) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1480
ping: local error: message too long, mtu=1480
ping: local error: message too long, mtu=1480

--- 198.51.100.53 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2041ms

Meaning: MTU constraints exist on the path. DNS over UDP with large EDNS payloads is now suspicious. Also: this doesn’t prove DNS fragments are dropped, but it’s a strong smell.

Decision: Reduce EDNS UDP size on the resolver, ensure TCP fallback works, and fix ICMP handling where possible.

Task 18: Confirm that the authoritative servers are reachable and consistent

cr0x@server:~$ dig @ns1.example.net example.com DNSKEY +dnssec +norec
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3101
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

Meaning: aa confirms an authoritative answer. If one auth server gives different DNSKEY/RRSIG than another, you may have propagation issues, split-brain signers, or a partially rolled key.

Decision: Query each authoritative nameserver directly. Inconsistent DNSSEC data across auth servers is a high-confidence root cause for “random” failures.

Joke #2: DNSSEC debugging is when you learn the difference between “distributed system” and “distributed blame.”

Common mistakes: symptom → root cause → fix

This is the part where we stop pretending it’s always “the Internet” and name the usual suspects.

1) Some clients get SERVFAIL, others succeed

  • Symptom: Intermittent failures correlated with office/VPN/region; retries sometimes work.
  • Root cause: UDP fragmentation drops, broken PMTUD, or a firewall that mangles EDNS0. DNSSEC responses are large; the path is not kind.
  • Fix: Lower EDNS UDP size on resolvers (often 1232), ensure TCP/53 is permitted, and fix ICMP handling on the path. Validate with dig +tcp and dig +bufsize=1232.

2) Everything worked until a key rollover, then it “randomly” broke

  • Symptom: A zone that was stable now intermittently bogus; some resolvers recover faster than others.
  • Root cause: DS record in the parent doesn’t match the active KSK in the child; or one authoritative is still serving old DNSKEY/RRSIG.
  • Fix: Verify DS at parent and DNSKEY in child. Ensure all authoritative servers serve the same signed data. Re-run rollover steps correctly; don’t “just remove DNSSEC” unless you enjoy customer calls.

3) Failures start after “security hardening” on the network

  • Symptom: DNS is fine for small records; DNSKEY queries time out; TCP/53 blocked.
  • Root cause: Firewall blocks fragments, blocks TCP/53, or rate-limits “unknown UDP” patterns, accidentally DoSing DNSSEC.
  • Fix: Permit TCP/53 to resolvers, allow returning fragments or tune EDNS size downward. Confirm with stats: rising TCP fallbacks, truncation, or timeouts.

4) Validation failures appear on a subset of resolvers only

  • Symptom: Resolver A validates; Resolver B returns SERVFAIL for the same query.
  • Root cause: Different trust anchors, stale root.key, clock skew, different forwarders, or different resolver versions/defaults.
  • Fix: Standardize resolver configuration and trust anchor management. Check time sync. Compare unbound-anchor, version, and forwarding settings.

5) NXDOMAIN becomes SERVFAIL under DNSSEC

  • Symptom: A non-existent name should be NXDOMAIN but comes back SERVFAIL on validating resolvers.
  • Root cause: Broken denial-of-existence proofs (NSEC/NSEC3) or missing RRSIGs on NSEC/NSEC3.
  • Fix: Validate with delv +rtrace. Fix authoritative signing for NSEC/NSEC3, ensure correct parameters, and confirm all auth servers serve consistent denial records.

6) “It works with public resolvers but not ours”

  • Symptom: Public resolvers validate fine; your internal validating resolvers fail.
  • Root cause: Your resolvers are forwarding to an upstream that strips DNSSEC, alters EDNS, or blocks TCP fallback; or your internal network drops fragments.
  • Fix: Test direct recursion (no forwarding) on one resolver. Compare EDNS buffer, TCP behavior, and logs.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

Company A ran a neat resolver layer: two anycast VIPs per region, Unbound behind them, and a “simple” forwarder policy to a third-party DNS service for “better cache hit rates.” DNSSEC was enabled because their security team had a quarterly checkbox that required the words “DNSSEC validation.”

A partner rotated their DNSSEC keys. Nothing unusual, just routine hygiene. Within an hour, a handful of application pods in one region started failing TLS handshakes because they couldn’t resolve an OCSP responder hostname. The incident channel got busy and loud.

The on-call engineer assumed: “If the public Internet resolves it, our resolvers must be broken.” That assumption is backwards. Public resolvers weren’t forwarding to the same upstream, weren’t behind the same firewall, and weren’t subject to the same EDNS handling.

It turned out their upstream forwarder did validation too—but with a different policy. It would return SERVFAIL for zones during a DS transition window that their internal setup could have handled if it were doing full recursion. The internal resolvers were innocent; they were obediently relaying upstream failure.

The fix was unglamorous: they temporarily disabled forwarding for the affected zone (policy-based forwarding), did direct recursion, and the failures stopped. Then they redesigned: either forward and accept upstream behavior, or recurse and own validation end-to-end. Mixing the two without explicit expectations is how you get paged.

Mini-story 2: The optimization that backfired

Company B had a global network with “helpful” WAN optimizers. Someone noticed DNS traffic was “chatty” and decided to “optimize” it. The WAN team deployed a policy that de-prioritized UDP fragments during congestion because “fragmentation is often garbage.”

Nothing exploded immediately. Most DNS responses were small. But once DNSSEC became common in their vendor ecosystem, DNSKEY and some signed TXT records started fragmenting. The optimizers didn’t drop all fragments—just enough to make it intermittent under load. Perfect: a bug that reproduces best when you’re busiest.

What made it worse was the fallback behavior. Some resolvers retried over TCP quickly; some clients had short timeouts; some middleboxes treated TCP/53 suspiciously. Users reported “random DNS failures,” and the first reaction was to add more resolvers.

Adding resolvers changed cache behavior and query distribution, which slightly altered packet sizes and timing. It also made the incident graphs harder to interpret. The optimization had backfired twice: it caused the issue and then disguised it.

The eventual resolution was to set a conservative EDNS UDP size and to adjust the WAN policy to stop treating fragments as disposable. It wasn’t heroic. It was just admitting that DNSSEC made “large UDP” a first-class citizen, like it or not.

Mini-story 3: The boring but correct practice that saved the day

Company C didn’t do anything fancy. They ran BIND resolvers with a strict config baseline, NTP everywhere, config management with drift detection, and a canary resolver per region. They also had a habit: once a day, automated jobs ran delv against a small set of critical external domains and a few internal signed zones.

One Tuesday, a registrar-side DS update for a customer-managed zone went wrong. The zone was still signed, but the DS in the parent no longer matched the KSK in the child. That means: validating resolvers must fail. Non-validating resolvers will keep happily resolving, which is how you get a split-brain user experience.

The daily canary job caught it within minutes of the change. Not because the company was psychic—because they measured validation status as a first-class SLO signal. Their on-call had a playbook, logs were centralized, and the evidence was already in the ticket.

They notified the registrar, rolled back the DS, and the incident never became a full-scale outage. The boring practice wasn’t “DNSSEC expertise.” It was repeating a small validation test every day so that a bad day looks like a known pattern, not a novel horror.

Checklists / step-by-step plan

Step-by-step: isolate whether this is zone data, resolver config, or the network

  1. Pick a single failing name and record type. Write it down. Don’t debug “DNS” in general.
  2. Query the same resolver directly from a failing host. Use dig @IP and include +dnssec.
  3. Repeat from a known-good host. If results differ, you’ve got path or resolver selection differences.
  4. Try +tcp. If TCP fixes it, you’re looking at UDP/EDNS/fragmentation.
  5. Try +bufsize=1232. If that fixes it, your network path is hostile to large UDP.
  6. Run delv +rtrace to get the reason. It’s faster than guessing.
  7. Check resolver logs for “bogus” reasons. Signature expired, DS mismatch, unsupported algorithm, etc.
  8. Validate time sync on the resolver. If time is wrong, stop and fix it.
  9. Query parent DS and child DNSKEY. Ensure they align and are consistent across authoritative servers.
  10. Decide ownership.
    • If DS/DNSKEY mismatch: zone owner/registrar problem.
    • If only some resolvers fail: your fleet drift/time/trust anchors.
    • If TCP works and UDP doesn’t: network path/EDNS sizing.

Checklist: what to capture for an escalation (so it gets fixed)

  • Failing FQDN and record type (A/AAAA/TXT/DNSKEY/DS).
  • Resolver IP(s) tested and whether forwarding is involved.
  • dig output with +dnssec and with +tcp.
  • delv +rtrace output showing the reason (expired sig, DS mismatch, etc.).
  • Timestamp and your resolver’s timezone/clock status.
  • Which authoritative servers were queried directly, and whether answers differed.

FAQ

1) Why does DNSSEC failure show up as SERVFAIL instead of a clearer message?

Because DNS is a protocol built for minimalism and caching. Validators generally don’t leak detailed validation errors to clients. The resolver knows “bogus,” the client sees “SERVFAIL.” Use resolver logs and delv for the explanation.

2) What’s the fastest way to prove it’s DS/DNSKEY mismatch?

Query DS at the parent and DNSKEY at the child, then run delv +rtrace. If the DS digest doesn’t match any DNSKEY, validators will fail consistently once caches converge.

3) Can I just disable DNSSEC validation during an incident?

You can, but do it like a controlled burn: time-box it, document it, and understand what you’re trading away. A better emergency mitigation is often lowering EDNS UDP size or ensuring TCP/53 is allowed—fixing transport without dropping validation.

4) Why do public resolvers succeed while ours fail?

Different resolver implementations, different trust anchor states, different network paths, and different policies. Public resolvers may have better anycast reachability, more tolerant TCP fallback, or simply not be behind your firewall rules.

5) What does the AD flag actually mean?

AD (Authenticated Data) is set by a validating resolver when it believes the data is validated. It can be stripped or not set depending on configuration. Treat it as a signal, not gospel, and confirm with delv when it matters.

6) Is lowering EDNS UDP size “bad for performance”?

Sometimes it increases TCP fallback, which can add latency. But a slightly slower answer beats a random failure. For many enterprise paths, a conservative EDNS size is the practical choice.

7) How can time drift cause only intermittent failures?

Because RRSIG validity windows are time-based, and different RRsets have different signature timings. Also, caches and retries can mask it until a specific record’s signature crosses the validator’s skew boundary.

8) What’s the difference between “insecure” and “bogus”?

Insecure means there’s no chain of trust (no DS in parent). The resolver can answer without validation. Bogus means there is a chain expected but it fails (bad signatures, mismatched DS/DNSKEY, etc.). Bogus should fail closed.

9) Do mixed resolver fleets really matter that much?

Yes. Defaults differ: caching behavior, trust anchor management, TCP fallback, and logging. If you want predictable behavior, standardize or at least document differences and test both paths.

Conclusion: next steps you can ship this week

DNSSEC doesn’t fail randomly. Your system does a good impression of randomness when different resolvers, caches, clocks, and networks disagree.

Do these next:

  1. Implement the fast diagnosis playbook. Put it in your on-call runbook and make “dig +dnssec, dig +tcp, delv +rtrace” muscle memory.
  2. Standardize resolver baselines. Same version, same trust anchor handling, same EDNS sizing policy, same logging level during incidents.
  3. Pick a conservative EDNS UDP size if you operate across messy networks. Then measure TCP ratio so you know the cost.
  4. Monitor validation outcomes. Track SERVFAIL by zone, not just total. Run daily canary validations of critical domains.
  5. Get serious about time. If you run validators, you run clocks. Make NTP drift an alert, not a footnote.
← Previous
Proxmox Clustering and HA: How It Works, What Breaks, and How to Design It Properly
Next →
HBM on CPUs: When Memory Moves Into the Package

Leave a comment