It’s 02:13. Your app is fine, your load balancer is fine, and your database is minding its own business. Yet users can’t log in because DNS lookups are returning SERVFAIL. In the resolver logs you see the words that ruin weekends: bogus, validation failed.
DNSSEC is supposed to make DNS safer. In production, it occasionally turns a small upstream mistake into a global outage for exactly the people who tried to do the right thing.
What “bogus/validation failed” actually means
“Bogus” is not a vibe. It’s a verdict. A validating resolver has cryptographic evidence that the answer it got is not what the zone’s owner signed, or that the resolver can’t build a trusted chain from the root down to the record you asked for.
That distinction matters:
- “Insecure” means “no DNSSEC here.” The resolver returns answers normally.
- “Secure” means “validated.” The resolver returns answers normally.
- “Bogus” means “validation failed.” The resolver returns
SERVFAILbecause returning untrusted data defeats DNSSEC’s purpose.
DNSSEC validation fails for a handful of reasons, and they cluster into two buckets:
- Trust chain breaks: DS record wrong/missing, DNSKEY mismatch, bad rollover, wrong algorithm, wrong key tag, missing signatures, incorrect NSEC/NSEC3 proofs.
- Reality breaks: time skew, truncated UDP + broken fallback, middleboxes mangling EDNS0, resolvers misconfigured, caches holding old data during a rollover.
Practical approach: treat it like any other distributed systems problem. Gather evidence from multiple vantage points, identify the first component that produces a wrong assumption, and apply the smallest mitigation that restores service without papering over the root cause forever.
Paraphrased idea from Werner Vogels (reliability and operations): Everything fails, and it’s your job to design and operate like that’s normal.
One more thing: DNSSEC failures are embarrassingly binary. Either the math checks out, or it doesn’t. There’s no “mostly signed,” no “works on my resolver,” and no “it passed staging.”
Fast diagnosis playbook
This is the order that gets you to the real bottleneck fastest, with minimal thrash. Stick to it. Random-walking through resolvers at 3 AM is how you end up debugging your own laptop’s Wi‑Fi.
1) Confirm it’s DNSSEC-related (not general DNS outage)
- Query a known validating resolver and a known non-validating resolver.
- If the domain works without validation but fails with validation, you’re in DNSSEC land.
2) Identify where the trust chain breaks
- Check DS at the parent zone.
- Check DNSKEY at the child zone.
- Check that RRSIGs exist and are within validity windows.
3) Rule out “reality problems” quickly
- Time sync on resolvers (
timedatectl), MTU/fragmentation, UDP truncation, EDNS0 issues. - Make sure your resolver isn’t stuck on stale trust anchors or caching old DNSKEY sets.
4) Apply a safe mitigation while you fix upstream
- If you run validating resolvers for users: use a Negative Trust Anchor (NTA) for the affected zone to restore availability temporarily.
- If you’re the zone owner: fix DS/DNSKEY and signatures correctly; do not “just turn off DNSSEC” unless you can cleanly remove DS first.
Decision rule: If you can’t fix the root in under 30 minutes and it’s customer-impacting, mitigate. Then fix properly.
Facts and history that matter in outages
- DNSSEC was specified in the late 1990s, but broad deployment took years because operational complexity beat cryptography in real life.
- The root zone was signed in 2010, which made end-to-end validation practical at scale—before that, you were stitching trust anchors manually.
- DS records live in the parent zone (like
.com), which means your registrar is part of your security boundary whether you like it or not. - Key rollovers are the #1 human-caused DNSSEC outage pattern: timing, caching, and “which key signs what” details all matter.
- Algorithm agility is real: older algorithms (like RSA/SHA-1) became undesirable; migrating algorithms can break validators if done sloppily.
- NSEC3 exists largely because of zone walking concerns; it also adds complexity and has its own failure modes (salt/iterations mistakes don’t help outages).
- EDNS0 enabled larger DNS responses over UDP; it also introduced “middlebox breaks DNS” class problems that show up as DNSSEC failures.
- There was a major root KSK rollover (2018), and it surfaced how many resolvers were misconfigured or stale—an object lesson in operational hygiene.
- Some resolvers fail “closed” by design: when validation fails, they prefer outage over potential cache poisoning. That’s the point, but it changes your incident playbook.
Practical tasks (commands, outputs, decisions)
Below are hands-on tasks you can run during an incident. Each includes a command, a sample output shape, what it means, and what decision to make.
Task 1 — See the user-visible symptom: SERVFAIL vs NOERROR
cr0x@server:~$ dig +noall +comments +answer www.example.com A
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 12014
Meaning: The resolver couldn’t produce an answer it trusts. Not necessarily NXDOMAIN; likely DNSSEC validation, upstream timeout, or recursion failure.
Decision: Immediately compare with a different resolver and with DNSSEC-disabled behavior to confirm it’s validation.
Task 2 — Compare validating vs non-validating resolvers
cr0x@server:~$ dig +noall +answer www.example.com A @8.8.8.8
www.example.com. 300 IN A 93.184.216.34
cr0x@server:~$ dig +noall +comments +answer www.example.com A @9.9.9.9
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5112
Meaning: Different resolvers, different outcomes. That often indicates DNSSEC validation differences, caching state, or transport quirks.
Decision: Move to DNSSEC-specific queries (+dnssec, delv) and check DS/DNSKEY chain.
Task 3 — Ask for DNSSEC records and look for the AD bit
cr0x@server:~$ dig +dnssec www.example.com A @1.1.1.1
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
www.example.com. 300 IN A 93.184.216.34
www.example.com. 300 IN RRSIG A 13 3 300 20260101000000 20251201000000 12345 example.com. ...
Meaning: ad indicates the resolver validated the answer. If you never see ad from a validating resolver, it may not be validating or can’t validate.
Decision: If your own resolver doesn’t set AD where public resolvers do, debug your resolver config/trust anchors first.
Task 4 — Use delv to get a validator’s explanation
cr0x@server:~$ delv www.example.com A
; fully validated
www.example.com. 300 IN A 93.184.216.34
Meaning: delv performs DNSSEC validation and tends to tell you what failed.
Decision: If delv says “fully validated” but your production resolver says bogus, suspect caching state, time skew, or resolver software issues.
Task 5 — Catch a real DNSSEC failure reason from delv
cr0x@server:~$ delv broken.example A
; validation failed: no valid signature found
; resolution failed: DNSSEC validation failure
Meaning: The zone responded, but signatures didn’t validate against DNSKEY, or the right RRSIG is missing/expired.
Decision: Inspect DNSKEY set and RRSIG validity windows; check for bad rollover or signing outage.
Task 6 — Inspect DS at the parent (chain-of-trust checkpoint)
cr0x@server:~$ dig +noall +answer example.com DS @a.gtld-servers.net
example.com. 86400 IN DS 12345 13 2 8B4F...A1C9
Meaning: Parent publishes a DS pointing to a child DNSKEY (by key tag, algorithm, digest).
Decision: If DS exists but child DNSKEY doesn’t match, you’ll get bogus. Fix DS (at registrar/parent) or fix child keys to match DS.
Task 7 — Inspect DNSKEY in the child zone
cr0x@server:~$ dig +noall +answer example.com DNSKEY @ns1.example.net
example.com. 3600 IN DNSKEY 257 3 13 AwEAAbc...KSK...
example.com. 3600 IN DNSKEY 256 3 13 AwEAAc9...ZSK...
Meaning: You should see at least one KSK (flag 257) and one ZSK (256) for typical setups. Algorithms must match DS.
Decision: If DNSKEYs changed recently, check if DS was updated (KSK rollover) and if caches could still have old DS/DNSKEY.
Task 8 — Verify RRSIG timing (expired signatures are classic)
cr0x@server:~$ dig +dnssec +noall +answer example.com SOA @ns1.example.net
example.com. 3600 IN SOA ns1.example.net. hostmaster.example.com. 2025123101 7200 3600 1209600 3600
example.com. 3600 IN RRSIG SOA 13 2 3600 20251231120000 20251201000000 12345 example.com. ...
Meaning: RRSIG has inception and expiration times. If your signing pipeline broke, expirations hit and validators start failing.
Decision: If expired/near-expired, fix signing automation immediately and re-sign the zone. If times look “in the future,” check clock skew on authoritative servers or signer.
Task 9 — Check your resolver’s time (yes, still)
cr0x@server:~$ timedatectl
Local time: Wed 2025-12-31 02:21:10 UTC
Universal time: Wed 2025-12-31 02:21:10 UTC
RTC time: Wed 2025-12-31 02:21:10
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
Meaning: DNSSEC depends on time validity windows. If your resolver thinks it’s last week, signatures look “not yet valid” or “expired.”
Decision: If clock sync is off, fix NTP/chrony before touching DNSSEC config. Don’t “debug crypto” with a broken clock.
Task 10 — Read Unbound logs for the specific failure
cr0x@server:~$ sudo journalctl -u unbound --since "10 min ago" | tail -n 20
... unbound[1123]: info: validation failure <example.com. A IN>: no keys have a DS with algorithm 13
... unbound[1123]: info: resolving example.com. A
... unbound[1123]: info: error: SERVFAIL example.com. A IN
Meaning: That line is gold: it tells you the resolver expected a DS/algorithm relationship that doesn’t exist.
Decision: Check DS algorithm at the parent and DNSKEY algorithm at the child. This is often a botched algorithm rollover or stale DS.
Task 11 — Read BIND named logs for “bogus” and key IDs
cr0x@server:~$ sudo journalctl -u named --since "10 min ago" | tail -n 30
... named[905]: validating example.com/A: no valid signature found
... named[905]: resolving example.com/A: got insecure response; treating as bogus
Meaning: BIND gives a similar story. “No valid signature found” often means missing/expired RRSIG or wrong DNSKEY set.
Decision: Confirm RRSIG presence and validity from authoritative servers directly. If missing, fix signing. If present, look for mismatch between DNSKEY and RRSIG signer.
Task 12 — Query authoritative servers directly (bypass recursion)
cr0x@server:~$ dig +norecurse +dnssec example.com DNSKEY @ns1.example.net
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
example.com. 3600 IN DNSKEY 257 3 13 AwEAAbc...KSK...
example.com. 3600 IN RRSIG DNSKEY 13 2 3600 20251231120000 20251201000000 12345 example.com. ...
Meaning: aa confirms the server is authoritative and returned signed DNSKEYs. If authoritative lacks RRSIG/DNSKEY, you have a signing or publication problem.
Decision: If authoritative data is wrong, fix at the source. If authoritative is fine but validators fail, look at DS at parent or transport/middlebox issues.
Task 13 — Check for truncation and TCP fallback issues
cr0x@server:~$ dig +dnssec +noall +comments example.com DNSKEY @ns1.example.net +bufsize=512
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39011
;; flags: qr aa tc; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
Meaning: tc means truncated UDP response. A resolver should retry over TCP. If TCP is blocked, validation can fail mysteriously.
Decision: Ensure TCP/53 is allowed to authoritative servers and from resolvers. If you see tc often, consider response sizing (fewer keys, proper rollover hygiene) and MTU paths.
Task 14 — Confirm TCP/53 connectivity when UDP truncates
cr0x@server:~$ dig +tcp +dnssec +noall +answer example.com DNSKEY @ns1.example.net
example.com. 3600 IN DNSKEY 257 3 13 AwEAAbc...KSK...
example.com. 3600 IN DNSKEY 256 3 13 AwEAAc9...ZSK...
Meaning: If TCP works and UDP truncates, your network needs to allow TCP fallback. Some “security” devices break it and call it a feature.
Decision: If TCP is blocked, fix the firewall policy. Do not disable DNSSEC to accommodate a broken middlebox.
Task 15 — Use a Negative Trust Anchor (NTA) in Unbound to mitigate
cr0x@server:~$ sudo unbound-control status
version: 1.17.1
verbosity: 1
threads: 2
modules: 2 [ subnetcache validator iterator ]
uptime: 86400 seconds
options: control(ssl)
cr0x@server:~$ sudo unbound-control neg_anchor_add example.com
ok
cr0x@server:~$ sudo unbound-control list_neg_anchors
example.com
Meaning: You’ve told the validator to treat that zone as insecure temporarily, bypassing DNSSEC validation failures for it.
Decision: Use this only as a time-bounded mitigation. Create a ticket to remove the NTA once upstream fixes DS/DNSKEY/signatures. Add monitoring so it doesn’t live forever.
Task 16 — Flush resolver cache after upstream fixes (carefully)
cr0x@server:~$ sudo unbound-control flush_zone example.com
ok
Meaning: Old DNSKEY/DS/negative cache entries can keep you broken after the upstream is fixed.
Decision: Flush only the affected zone when possible. Full cache flush in peak hours is a self-inflicted DDoS against your upstream.
Joke #1: DNSSEC outages are great for team bonding—you’ll all stare at the same hex strings together, silently questioning your life choices.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The company had a clean “ownership boundary” mindset: app team owns the app, platform team owns Kubernetes, network team owns DNS. The domain was managed by marketing through a registrar account no engineer had ever logged into. That seemed fine because the authoritative DNS was hosted by a reputable DNS provider and “DNSSEC is enabled.”
Then a certificate renewal job started failing. Not because ACME was down, but because the renewal worker couldn’t resolve the DNS provider’s API endpoint consistently. Sometimes it worked from laptops, sometimes not from production. The incident channel filled with the usual suspects: NAT gateways, egress firewalls, VPC resolver, “maybe IPv6.”
The wrong assumption was subtle: “If the DNS provider is healthy, DNSSEC must be fine.” In reality, the DS record at the parent had been changed during a registrar “security upgrade” workflow. Nobody noticed because authoritative DNS continued to serve records normally, and non-validating resolvers continued to resolve normally. Only validating resolvers started returning SERVFAIL.
They proved it by querying DS from the TLD servers and comparing it with the DNSKEY set. The DS pointed to a key tag that did not exist anymore. The zone was signed, the provider was fine, but the chain of trust was broken one level up—at the registrar. The fix was not heroic engineering. It was logging into the registrar, correcting DS to match the current KSK, and waiting for caches to settle.
The actual lesson: if you don’t control your registrar and DS workflow like production infrastructure, you don’t control DNSSEC. You have a hope-based security model.
Mini-story 2: The optimization that backfired
A platform team wanted to reduce DNS latency and external dependency. Reasonable. They deployed local validating resolvers on every node pool and forced pods to use them. They also tightened firewall rules, because “why would DNS ever need TCP?” In their threat model, TCP/53 looked suspicious.
A few months later, a partner domain rolled keys and briefly served a larger DNSKEY response. Over UDP it frequently truncated. The local resolvers tried to retry over TCP. The firewall blocked it. Validation failed intermittently, and because the resolvers were local, the blast radius was huge: every workload on that node pool saw sporadic failures to reach the partner API.
The on-call spent hours chasing “random network flakiness,” because packet loss graphs were fine and the authoritative DNS servers were reachable over UDP. The clue was in the resolver logs: repeated truncation and TCP retry failures, followed by bogus validation and SERVFAIL.
The “optimization” was not having local resolvers; that part was good. The backfire was assuming DNS is UDP-only and treating TCP as optional. DNSSEC made larger answers more common. Truncation is normal. TCP fallback is not a luxury feature.
They fixed it by allowing TCP/53 egress, and by adding a canary check that specifically tests a DNSKEY query over TCP from each cluster. That check caught the next firewall change before it shipped.
Mini-story 3: The boring but correct practice that saved the day
Another org ran its own authoritative DNS with a separate signer pipeline. Nothing fancy. But they had one boring ritual: every DNSSEC change required a preflight run that compared DS, DNSKEY, and RRSIG validity from three independent networks. Not just “it resolves from my laptop,” but actual validation.
During a planned KSK rollover, the signer produced the new key material on schedule. The DNS admins published the new DNSKEYs, then updated DS at the registrar. Everything looked good internally. The preflight, however, failed from one external vantage point: the parent zone was still serving the old DS to that network, likely due to propagation delay and caching behavior on some recursive path.
Because the team anticipated this, they didn’t panic. They kept both keys published longer than the minimum, avoided aggressive TTL reductions that cause resolver stampedes, and waited until DS convergence was observed in all three checks. Then they proceeded to remove the old key.
No outage. No drama. The root cause that never became an incident was “we assumed propagation is uniform.” It isn’t. Their boring practice turned an unpredictable internet into a manageable rollout process.
Joke #2: The internet doesn’t do “uniform propagation,” it does “eventually, with opinions.”
Common mistakes: symptoms → root cause → fix
1) Symptom: Some resolvers return NOERROR, others SERVFAIL
Root cause: DNSSEC validation differences (one validates, one doesn’t), or caches at different rollover states.
Fix: Use delv and dig +dnssec to confirm validation. Check DS vs DNSKEY match. Flush affected zones on your resolvers after fixing upstream.
2) Symptom: Everything broke right after “turning off DNSSEC”
Root cause: DS record left in the parent while the zone stopped serving signed data. Validators now expect signatures that no longer exist.
Fix: Remove DS at the registrar/parent first (or concurrently), keep serving DNSKEY/RRSIG until DS is gone, then disable signing.
3) Symptom: Unbound says “no keys have a DS with algorithm X”
Root cause: DS algorithm/digest mismatch with published DNSKEYs; common after algorithm rollover or mistaken DS update.
Fix: Recompute DS from the intended KSK and publish the correct DS at the parent. Ensure the child publishes that KSK.
4) Symptom: “no valid signature found” but DNSKEY exists
Root cause: Missing RRSIG for the RRset, expired signatures, or RRSIG created with a key no longer published.
Fix: Re-sign the zone; confirm signatures cover all critical RRsets (SOA, NS, A/AAAA, etc.). Validate time sync on signer and authoritative servers.
5) Symptom: Works for small queries, fails on DNSKEY or ANY
Root cause: UDP truncation plus TCP blocked, or EDNS0/fragmentation issues causing large DNSSEC responses to fail.
Fix: Allow TCP/53; verify path MTU; avoid oversized responses by cleaning up stale keys and not publishing unnecessary DNSKEY baggage.
6) Symptom: Failure appears “random” across sites
Root cause: Split-horizon DNS returning different DNSKEY/RRSIG depending on source, or anycast nodes out of sync, or different authoritative sets behind geo policies.
Fix: Ensure DNSSEC material is consistent across all authoritative nodes/edges. Avoid split-horizon for signed zones unless you truly know what you’re doing.
7) Symptom: Validators report “NSEC/NSEC3 proof failed” for NXDOMAIN
Root cause: Broken negative responses: missing/invalid NSEC/NSEC3 records or signatures during signing glitches.
Fix: Re-sign the zone and ensure the signer correctly generates negative proofs. Confirm authoritative servers serve the same NSEC/NSEC3 chain.
8) Symptom: Only your internal resolvers fail; public resolvers succeed
Root cause: Your resolvers have stale trust anchors, misconfigured validation, bad forwarders, or broken time sync.
Fix: Check resolver config, trust anchor updates, NTP, and whether you’re forwarding to a resolver that strips DNSSEC records.
Checklists / step-by-step plan
Checklist A — You are the resolver operator (you run validating resolvers)
- Confirm DNSSEC is the trigger: compare results from your resolver and an external validating resolver using
dig +dnssecand check AD bit. - Collect evidence: resolver logs (Unbound/BIND), DS from parent, DNSKEY from child, RRSIG timing for affected RRsets.
- Rule out local environment: time sync, TCP/53 reachability, EDNS0 buffer behavior, packet fragmentation, forwarder stripping.
- Decide mitigation: If business impact is high and upstream fix isn’t immediate, add an NTA for the minimum scope (ideally the exact zone, not the whole TLD).
- Communicate clearly: “We have a DNSSEC validation failure for zone X; service impact; mitigation applied; upstream owner engaged.” Not “DNS is down.”
- Remove mitigation: after upstream fix is validated from multiple networks; flush zone cache; remove NTA; confirm AD bit returns.
Checklist B — You are the zone owner (you control authoritative DNS and DNSSEC)
- Identify the failure type:
- If DS mismatch: parent DS doesn’t match child KSK.
- If signatures missing/expired: signing pipeline or publication broke.
- If transport: responses truncated and TCP blocked somewhere.
- Fix DS/DNSKEY correctness before anything else: publish correct keys; update DS; keep overlap for rollovers; don’t remove old keys too fast.
- Validate negative answers: NXDOMAIN responses must be signed with correct NSEC/NSEC3 proofs.
- Manage TTLs like a grown-up: lowering TTLs can speed changes, but it can also increase query load and expose brittle resolver behavior during incidents.
- Coordinate with registrar: DS changes are production changes. Treat them like production changes.
Checklist C — You’re stuck because “upstream won’t fix it fast”
- Mitigate locally: NTA (Unbound) or equivalent policy-based bypass where appropriate.
- Pinpoint blast radius: only bypass the affected zone, not global validation.
- Set an expiry: calendar reminder, ticket, and monitoring alert for the NTA still present after a deadline.
- Escalate with evidence: provide DS output, DNSKEY output, and the exact validator error message. Vendors respond better to proofs than to feelings.
FAQ
1) What does “bogus” mean in DNSSEC terms?
It means validation failed: the resolver cannot cryptographically verify the answer against the signed chain of trust, so it returns SERVFAIL.
2) Why do some resolvers still resolve a bogus domain?
Because they aren’t validating, or they’re configured not to enforce validation. Also, some might have cached data from before the break. Validation behavior is policy.
3) Is disabling DNSSEC on my resolver a good emergency fix?
As a last resort, temporarily, and only with explicit risk acceptance. Better: apply an NTA for the specific zone. That restores service while keeping DNSSEC for everything else.
4) We “turned off DNSSEC” on the zone but users got more failures. Why?
If the DS record remains at the parent, validators expect signatures. Turning off signing without removing DS breaks validation harder. Remove DS first, then disable signing.
5) How can clock skew cause DNSSEC validation to fail?
RRSIGs have inception and expiration times. If your resolver’s clock is wrong, valid signatures can appear “expired” or “not yet valid.” Fix NTP before anything else.
6) What’s the difference between DS and DNSKEY, operationally?
DNSKEY is published in the child zone. DS is published in the parent and points (by digest) to a child KSK. DS is what links the chain of trust.
7) Why does TCP/53 matter for DNSSEC?
DNSSEC adds signatures and keys, increasing response sizes. UDP can truncate; resolvers retry over TCP. If TCP/53 is blocked, large signed responses can fail.
8) Do I need to flush resolver caches after fixing DNSSEC?
Often yes, at least for the affected zone. Resolvers may cache DNSKEY, DS-related state, and negative responses. Prefer zone-specific flushes over full cache wipes.
9) What’s a safe way to validate from multiple vantage points?
Run delv or dig +dnssec from at least two networks (e.g., your DC and a cloud VM) and compare DS/DNSKEY and AD bit behavior.
10) If the authoritative servers answer correctly, why would validators fail?
Because the parent DS might be wrong, or the signatures might not validate even if they exist. Also, intermediate transport issues can prevent validators from retrieving necessary records reliably.
Conclusion: next steps that prevent repeats
If you remember one thing, make it this: DNSSEC failures are usually not mysterious. They’re just distributed, cached, and owned by multiple parties. The way out is disciplined evidence collection and controlled mitigation.
Do this next
- Add a DNSSEC canary that runs
delv(or equivalent validation) against your critical domains from at least two networks, and alerts on validation failures—not just NXDOMAIN/timeout. - Document DS ownership: who can change DS at the registrar, how approvals work, how rollbacks work. If the answer is “marketing,” fix your org chart or your access controls.
- Make TCP/53 non-negotiable for resolvers. If someone wants to block it, make them prove they understand truncation and EDNS0 behavior first.
- Practice key rollovers with a runbook that includes external validation checks and explicit overlap periods. “We’ll just rotate it” is how you buy a new incident.
- Prefer targeted mitigations (NTA per zone) over disabling validation globally. If you must bypass, set an expiry and an alert.
DNSSEC is like a seatbelt: mildly annoying until the day you need it. The trick is making sure it doesn’t lock up because somebody installed it backwards.