DNSSEC: The Rollover Mistake That Nukes Email and Web at the Same Time

Was this helpful?

When DNSSEC goes wrong, the failure mode is brutal: resolvers don’t “sort of” fail. They validate, they don’t like what they see, and they return SERVFAIL. Your website looks dead. Your API looks dead. And your email—often routed through the same domain’s MX and related records—quietly stops arriving.

This is the outage that confuses everyone because the origin servers are fine, the load balancers are green, and the on-call is staring at dashboards like they’ve been personally betrayed by math. DNSSEC is math. It keeps receipts.

What actually breaks when DNSSEC breaks

DNSSEC doesn’t break “DNS.” It breaks resolution for validating resolvers. That detail matters because your internal recursive resolvers, your customers’ ISPs, and public resolvers may behave differently at the same moment. Some validate. Some don’t. Some are misconfigured and validate “sometimes” depending on upstream. That’s how you get the worst kind of incident: partial, geography-dependent, resolver-dependent.

When your domain goes bogus (DNSSEC-speak for “cryptographically invalid”), validating resolvers will stop answering. They return SERVFAIL for records in the affected zone:

  • Web breaks because A/AAAA lookups fail for www or the apex.
  • API breaks because service discovery stops working.
  • Email breaks because MX lookups fail, plus TXT records for SPF/DMARC may fail, and some MTAs treat that as a hard error or defer for hours.
  • Auth flows break because OIDC/SAML metadata endpoints, callback domains, and token introspection hostnames often share the same zone.
  • Certificate issuance breaks because ACME challenges depend on DNS or HTTP reachability that starts with DNS resolution.

There’s also a dark comedy angle: you can still resolve your own domain from a laptop using a non-validating resolver and conclude “DNS is fine.” That is how you earn the kind of postmortem where the root cause is a screenshot.

Facts and historical context you can use in a postmortem

DNSSEC has history, and some of it explains today’s operational sharp edges. Here are concrete facts that help teams make better decisions:

  1. DNSSEC was conceived in the 1990s after cache poisoning attacks made it obvious that DNS needed authenticity, not just availability.
  2. The root zone was signed in 2010, a major milestone that made global DNSSEC validation feasible without custom trust anchors.
  3. The first root KSK rollover happened in 2018, after years of planning, delays, and careful measurement of resolver readiness.
  4. DNSSEC doesn’t encrypt DNS; it signs it. Confidentiality is a different fight (now handled by things like DoT/DoH at the transport layer).
  5. DNSSEC adds response size, which historically increased fragmentation risk and triggered path MTU/firewall issues—especially before EDNS0 was widely handled correctly.
  6. NSEC and NSEC3 exist partly because authenticated denial of existence is required; you need a signed way to say “this name doesn’t exist.”
  7. Registrar/registry DS workflows vary, and many incidents are caused not by cryptography but by ticket queues, UI quirks, and propagation timing.
  8. TTL and caching rules dominate incident duration. Even after you fix a DS record, broken state can persist until caches expire.
  9. Some email systems treat DNS failures differently: a transient DNS error can become hours of deferred mail, then a burst of retries, then silent drops if queues overflow.

Chain of trust in operational terms (not crypto theater)

Forget the whitepapers for a minute. In production, DNSSEC is a supply chain:

  • The parent zone publishes a DS record for your zone.
  • Your child zone publishes DNSKEY records. One of them is the KSK (key-signing key), one or more are ZSKs (zone-signing keys). In many implementations they’re just flags and operational roles.
  • Your zone’s records are signed with RRSIG. Validators verify signatures using DNSKEYs, and verify that the DNSKEY is authorized by comparing digests to the DS in the parent.

The point: the parent vouches for the child by publishing DS. If DS and DNSKEY stop matching, validators do not shrug. They mark the zone as bogus and stop answering.

The operational consequence is simple and harsh: a DNSSEC rollover is a distributed change across at least two administrative domains (your zone and the parent/registry), plus the caching layer of the entire internet. That’s why “just rotate keys weekly” is not a sign of maturity; it’s a sign you enjoy avoidable excitement.

One reliability quote worth keeping on the wall—because DNSSEC failures are usually process failures:

“Hope is not a strategy.” — Gene Kranz

The rollover mistake that nukes web and email

The classic meltdown is this sequence:

  1. You roll the KSK or change DNSKEYs in the zone.
  2. You forget to update the DS at the parent, or you update it incorrectly, or the update is delayed.
  3. Validating resolvers can no longer build a chain of trust from parent DS to your DNSKEY.
  4. They return SERVFAIL for names in your zone.

That’s the simple version. The realistic version has more ways to hurt you:

  • You publish a new DNSKEY but you haven’t yet signed the zone correctly with it (or you stopped publishing the old signatures too early).
  • You updated DS, but used the wrong digest type or the wrong key tag because your tool output was misunderstood.
  • You updated DS at the registrar UI, but the UI accepted it while silently truncating, reformatting, or delaying submission to the registry.
  • You are using a DNS provider that automates DNSSEC, and you switched providers without coordinating DS removal/addition.
  • You changed nameservers or migrated DNS hosting and assumed DNSSEC “moves with the zone.” It doesn’t unless you rebuild the chain and coordinate DS.

DNSSEC rollovers are also where “email and web die together” becomes very literal. Web fails because clients can’t resolve A/AAAA. Email fails because senders can’t resolve MX and may treat that as a temporary failure, which delays mail and piles up retries. Even worse, your _dmarc TXT and SPF TXT may become unresolvable, and depending on sender policy, the absence can change deliverability characteristics in unpredictable ways.

Joke #1: DNSSEC is like a bouncer who checks IDs with a microscope. Great security, until you spell your own name wrong and get denied entry to your own party.

Why this failure is so catastrophic: validators fail closed

DNSSEC is designed to protect against forgery. If validators accepted “maybe valid” answers, an attacker could exploit uncertainty. So validation is fail-closed. Your records are either signed correctly and chain to the parent DS, or they’re treated as untrustworthy.

This is not negotiable and not “a bug in resolvers.” It’s the point. Your job is to make rollovers boring.

The two rollovers you must distinguish: ZSK vs KSK

Most teams can survive a ZSK rollover mistake with less blast radius because ZSK changes do not require updating DS in the parent (assuming KSK remains stable and the DNSKEY RRset is signed appropriately). KSK rollovers, by definition, are the ones that require parent coordination.

Operational rule of thumb:

  • ZSK rollover: your zone, your signatures, your TTLs. Mostly under your control.
  • KSK rollover: your zone and the parent DS pipeline, plus caching. You are no longer the only adult in the room.

Three corporate mini-stories from the trenches

Mini-story 1: The wrong assumption incident

A mid-size SaaS company decided to migrate authoritative DNS from one provider to another. The project plan included lowering TTLs, testing record parity, and gradually moving NS at the registrar. It looked like a textbook move—until the cutover produced a sudden flood of support tickets: “site is down,” “can’t log in,” “emails bouncing.”

The on-call started with the usual suspects: load balancers, WAF, TLS certs. Everything was green. Then a customer sent a screenshot: SERVFAIL from their ISP resolver. That narrowed it to DNS. The company’s own office resolvers still worked, because their internal recursive resolvers were configured without DNSSEC validation. The war room spent an hour arguing about whether DNSSEC “really matters” because “we’ve never had issues.”

The wrong assumption was simple: the team assumed DNSSEC was “attached to the domain” at the registrar and would keep working after the nameserver move. In reality, the domain’s parent zone still published a DS record pointing to the old provider’s DNSKEY. The new provider served a different DNSKEY set. Validators saw a DS that didn’t match the child DNSKEY and failed closed.

The fix was equally simple but operationally painful: either remove DS (disabling DNSSEC) or publish the correct DNSKEY and update DS to match. The registrar UI update took time to hit the registry. During that window, caches held onto the bogus state. Web was down for a chunk of the internet; email queued and then delivered late, with a backlog that caused secondary pain in downstream systems.

The postmortem action item that mattered was not “be careful with DNSSEC.” It was: treat DNS hosting migrations as a two-phase commit with DS coordination. DNSSEC must be explicitly moved, verified, and monitored. It is not a checkbox; it is a chain.

Mini-story 2: The optimization that backfired

An enterprise team had a goal: reduce operational overhead by automating cryptographic key rotation. They set an aggressive schedule—rotate DNSSEC keys frequently, like clockwork. The automation was impressive: pipelines, HSM integration, alerts if signatures were missing. The team was proud. They should have been. Then reality showed up.

The first few rotations were ZSK-only and went fine. Then someone “optimized” by including KSK rotation in the same automation. The script generated new keys and pushed updated DNSKEYs to the authoritative service. It also used the registrar’s API to update DS. But the API had quirks: it accepted a DS payload format that was slightly different from what the tool produced. The API returned success while actually storing a malformed DS. The zone went bogus for validating resolvers.

The worst part was timing. The automation ran at night. The DS update “succeeded” according to logs, and the script deleted the old KSK early to reduce “key clutter.” Validation broke, and the rollback path was not trivial because the old key material was already rotated out of the HSM slot used by the system. Everything still existed, but restoring required an HSM operator and a change window. Meanwhile, the public internet had no sympathy.

The team eventually stabilized by adopting a less sexy plan: separate ZSK rotation from KSK rotation, make KSK rotation rare and manual-with-checks, and require external validation from multiple public resolvers before and after DS changes. The optimization that backfired wasn’t automation itself—it was automating a cross-org, cache-dependent workflow without building guardrails for the parent link.

Mini-story 3: The boring practice that saved the day

A retailer with heavy seasonal traffic had a habit that nobody bragged about: they kept a runbook with screenshots of their registrar’s DS UI, the exact commands to compute DS from DNSKEY, and a calendar for planned DNSSEC events. They also had a standing policy: don’t touch DNSSEC during peak weeks. Boring. Correct.

One morning, their DNS provider had an issue that required emergency migration to secondary authoritative nameservers hosted elsewhere. This could have become a DNSSEC horror story. Instead, the team already had a pre-published secondary DNS setup, including signed zones on both providers with the same KSK, and they had tested validation weeks earlier.

During the incident, they switched NS records at the parent with minimized TTL impact. Because DNSSEC keys and signatures remained consistent and the DS stayed valid, validating resolvers never saw a broken chain. Customers saw no widespread SERVFAIL. Email kept flowing. The only real pain was the change process itself, not customer impact.

The lesson: boring operational discipline beats cleverness. If your DNSSEC plan depends on “we’ll just fix it quickly,” you’ve already lost.

Fast diagnosis playbook

When web and email both look dead, you don’t have time for philosophy. You need a tight loop that tells you whether you’re dealing with DNSSEC validation, authoritative availability, or something else.

First: identify if it’s DNSSEC validation (not just DNS)

  1. Query with DNSSEC and look for SERVFAIL from a validating resolver.
  2. Compare against a non-validating path (or explicitly disable checking in the query) to confirm records exist but validation fails.
  3. Check the chain point of failure: is it the child signatures, the DNSKEY, or the parent DS?

Second: confirm if the parent DS matches the child DNSKEY

  1. Fetch DS from the parent view.
  2. Fetch DNSKEY from the child authoritative servers.
  3. Compute DS from DNSKEY and compare.

Third: decide your emergency posture

  • If you can quickly restore a valid chain: do it. Re-publish old DNSKEY/signatures or update DS correctly.
  • If parent update will take too long: consider temporarily disabling DNSSEC by removing DS at the parent, but treat it as an incident-level decision with explicit approval and a re-enable plan.
  • Communicate early: “validating resolvers impacted; non-validating may work; email delays expected.” That messaging reduces noise.

Joke #2: DNS is the phonebook, DNSSEC is the notary. If the notary goes on strike, nobody gets to call anyone.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I actually want an on-call to run. Each has: a command, example output, what it means, and what you decide next. Use your own domain and nameserver hostnames. The point is the workflow.

Task 1: Check if a validating resolver returns SERVFAIL

cr0x@server:~$ dig +dnssec www.example.com A @1.1.1.1

; <<>> DiG 9.18.24 <<>> +dnssec www.example.com A @1.1.1.1
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 1203
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 42 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Tue Feb 04 10:12:01 UTC 2026
;; MSG SIZE  rcvd: 56

What it means: A validating public resolver can’t provide an answer. SERVFAIL is consistent with DNSSEC bogus, but could also be upstream failures.

Decision: Immediately test from a second validating resolver and then compare with a non-validating query path.

Task 2: Compare against another validating resolver to rule out a single resolver issue

cr0x@server:~$ dig +dnssec www.example.com A @8.8.8.8

; <<>> DiG 9.18.24 <<>> +dnssec www.example.com A @8.8.8.8
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5550
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 31 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)

What it means: Multiple validating resolvers fail. That strongly suggests a DNSSEC validation issue or a widespread authoritative issue.

Decision: Query the authoritative servers directly to see whether the records exist and whether DNSKEY/RRSIG are present.

Task 3: Query authoritative nameserver directly for the record

cr0x@server:~$ dig www.example.com A @ns1.dns-provider.net +norecurse

; <<>> DiG 9.18.24 <<>> www.example.com A @ns1.dns-provider.net +norecurse
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2086
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
www.example.com. 300 IN A 203.0.113.10

What it means: The authoritative server has the record and answers NOERROR. So “DNS exists.” The problem is likely validation (DNSSEC chain/signatures), not missing records.

Decision: Pull DNSKEY and signatures from the authoritative server next.

Task 4: Fetch DNSKEY from authoritative and verify it’s being served

cr0x@server:~$ dig example.com DNSKEY @ns1.dns-provider.net +dnssec +norecurse

; <<>> DiG 9.18.24 <<>> example.com DNSKEY @ns1.dns-provider.net +dnssec +norecurse
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3919
;; flags: qr aa; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
example.com. 300 IN DNSKEY 257 3 13 ( ...KSK_PUBLIC_KEY... ) ; key id = 20345
example.com. 300 IN DNSKEY 256 3 13 ( ...ZSK_PUBLIC_KEY... ) ; key id = 48711
example.com. 300 IN RRSIG DNSKEY 13 2 300 20260212000000 20260201000000 20345 example.com. ( ... )

What it means: DNSKEY RRset is present and signed (RRSIG over DNSKEY exists). Key id/key tag values matter for DS comparisons.

Decision: Compare parent DS with the KSK (flag 257) digest. If DS doesn’t match, you’ve found the nuke.

Task 5: Fetch DS from the parent view

cr0x@server:~$ dig example.com DS @a.gtld-servers.net +norecurse

; <<>> DiG 9.18.24 <<>> example.com DS @a.gtld-servers.net +norecurse
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6421
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
example.com. 86400 IN DS 14567 13 2 8A1C5F2D1B3E1B4C4F...C0FFEE

What it means: The parent (TLD) is publishing DS key tag 14567. If your current KSK key tag is 20345, that mismatch is suspicious (not automatically wrong, but often is).

Decision: Compute DS from the KSK you’re serving and compare digest/key tag.

Task 6: Compute DS from DNSKEY and compare with parent DS

cr0x@server:~$ dig example.com DNSKEY @ns1.dns-provider.net +dnssec +short > /tmp/example.com.dnskey
cr0x@server:~$ dnssec-dsfromkey -2 /tmp/example.com.dnskey
example.com. IN DS 20345 13 2 3B7F2E0B7C7C0C2B7A...BADA55

What it means: The DS you computed from the currently served KSK (20345) does not match the parent DS (14567 … C0FFEE). That’s a broken chain of trust.

Decision: Either restore the old KSK/DNSKEY that matches the parent DS, or update DS at the parent to the new value. Choose based on which can be done faster and more safely.

Task 7: Prove the zone is “bogus” using a validator tool (Unbound)

cr0x@server:~$ unbound-host -D example.com

example.com has address 203.0.113.10 (secure)
example.com has no AAAA record (secure)

What it means: If you see (secure), validation succeeded. If you see (bogus) or errors about DS/DNSKEY, you have a DNSSEC failure.

Decision: If bogus: focus on DS/DNSKEY/signatures. If secure but you still have issues: you’re dealing with a different DNS or application problem.

Task 8: Check for missing or expired signatures on key records

cr0x@server:~$ dig example.com DNSKEY @ns1.dns-provider.net +dnssec +norecurse | sed -n '/RRSIG DNSKEY/,$p'

example.com. 300 IN RRSIG DNSKEY 13 2 300 20260212000000 20260201000000 20345 example.com. ( ... )

What it means: RRSIG has an expiration and inception. If expiration is in the past (or inception is in the future due to clock skew in signing), validators reject it.

Decision: If signatures are expired/not-yet-valid: trigger a re-sign with correct time and ensure signer clock sync.

Task 9: Check SOA serial and propagation across authoritative nameservers

cr0x@server:~$ for ns in ns1.dns-provider.net ns2.dns-provider.net; do dig example.com SOA @$ns +norecurse +short; done
ns1.dns-provider.net. hostmaster.example.com. 2026020401 3600 600 1209600 300
ns1.dns-provider.net. hostmaster.example.com. 2026020309 3600 600 1209600 300

What it means: Different SOA serials indicate inconsistent zone versions between authoritative servers. With DNSSEC, inconsistency can be fatal if signatures/keys differ.

Decision: Stop rollouts, fix your zone distribution/AXFR/CI pipeline, and ensure all auth servers serve identical DNSKEY and signed RRsets before touching DS.

Task 10: Check if resolvers are getting truncated responses (TCP fallback issues)

cr0x@server:~$ dig +dnssec example.com DNSKEY @1.1.1.1

;; Truncated, retrying in TCP mode.
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53026
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

What it means: DNSKEY responses can be large. Truncation is normal; TCP fallback must work. If firewalls block TCP/53, validators may fail.

Decision: If you suspect TCP/53 is blocked between resolvers and your authoritative servers, verify network policy and allow TCP/53.

Task 11: Validate MX resolution from a validating resolver

cr0x@server:~$ dig +dnssec example.com MX @9.9.9.9

; <<>> DiG 9.18.24 <<>> +dnssec example.com MX @9.9.9.9
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49990
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

What it means: Email routing is impacted. Even if your mail provider is healthy, senders can’t discover where to send mail.

Decision: Start customer comms: expect delayed inbound mail. If you operate MTAs, monitor queues and deferred mail volume.

Task 12: Check SPF and DMARC TXT resolution (deliverability side-effects)

cr0x@server:~$ dig +dnssec example.com TXT @1.1.1.1 +short
cr0x@server:~$ dig +dnssec _dmarc.example.com TXT @1.1.1.1 +short

What it means: Empty output with SERVFAIL (seen in full output) means policy records can’t be fetched. Some senders treat this as a temporary error; others downgrade enforcement; some will defer.

Decision: If DNSSEC is broken, fixing validation comes first. Don’t “tune DMARC” during a cryptographic outage.

Task 13: Confirm DS state via WHOIS-like tool output is not enough—use DNS

cr0x@server:~$ dig example.com DS +trace

; <<>> DiG 9.18.24 <<>> example.com DS +trace
.			518400	IN	NS	a.root-servers.net.
...
com.			172800	IN	NS	a.gtld-servers.net.
...
example.com.		86400	IN	DS	14567 13 2 8A1C5F2D1B3E1B4C4F...C0FFEE

What it means: +trace shows the delegation path and the DS actually published in DNS, not what some control panel claims.

Decision: Treat DNS as truth. If the panel disagrees, escalate with registrar/registry or fix the API integration.

Task 14: For BIND operators—check signing status and key publication locally

cr0x@server:~$ rndc signing -list example.com
Done signing with key 20345/RSASHA256
Done signing with key 48711/RSASHA256

What it means: BIND thinks the zone is signed with specific keys. If this disagrees with what authoritative servers publish, your deployment is inconsistent.

Decision: If mismatch: stop and reconcile. Ensure key material and signed zone files are deployed consistently to all auth nodes.

Common mistakes: symptoms → root cause → fix

This is the section that should prevent your next outage. These are not theoretical; they are the patterns that keep happening.

1) Symptom: SERVFAIL on validating resolvers; authoritative answers directly

Root cause: DS does not match the child KSK DNSKEY (or DNSKEY RRset is not properly signed).

Fix: Compute DS from the active KSK and update the parent DS. Or restore the prior KSK/DNSKEY that matches the published DS. Don’t guess—compare digests.

2) Symptom: Works for some users, fails for others, flips over hours

Root cause: Caching and TTL differences; inconsistent authoritative servers; some resolvers pinned to old DS/DNSKEY in cache.

Fix: Ensure all authoritative servers serve identical signed data and keys. Wait for TTLs. For emergency: fix chain quickly; then allow caches to converge.

3) Symptom: DNSKEY queries are slow or time out; lots of TCP retries

Root cause: UDP fragmentation or blocked TCP/53; oversized DNSSEC responses; middleboxes mishandling EDNS.

Fix: Allow TCP/53 to authoritative servers. Tune authoritative response size via EDNS buffer considerations. Avoid network policies that treat TCP/53 as suspicious.

4) Symptom: Suddenly bogus after a “routine” resign

Root cause: Signer clock skew or bad signature validity windows causing RRSIG inception/expiration to be unacceptable.

Fix: Fix NTP. Re-sign. Verify RRSIG times from authoritative with dig.

5) Symptom: Email bounces or defers while web seems “mostly fine”

Root cause: MX and TXT lookups failing for validating resolvers; web clients may be using cached DNS or non-validating resolvers; MTAs are stricter about repeated failures.

Fix: Validate MX/TXT specifically from major validating resolvers. Communicate delay expectations. Fix DNSSEC chain.

6) Symptom: Outage immediately after DNS provider migration

Root cause: DS still points to old provider’s key; new provider serving different DNSKEYs.

Fix: Plan migrations with DNSSEC: either keep KSK stable across providers (if supported) or coordinate DS update as a cutover step with validation gates.

7) Symptom: Zone is secure, but only a specific record type fails (e.g., TXT)

Root cause: Selective signing or broken NSEC/NSEC3 chain for denial-of-existence; or inconsistent signing across nodes for that RRset.

Fix: Verify that the RRset has RRSIG and that denial of existence proofs are consistent. Re-sign entire zone; ensure consistent deployment.

8) Symptom: Panel shows DS updated, but trace still shows old DS

Root cause: Registrar accepted input but didn’t submit to registry yet, or registry update window delayed, or wrong delegation updated (multiple accounts/registrars).

Fix: Use dig +trace as truth. Escalate with registrar, include trace output. Don’t keep rotating keys while waiting.

Checklists / step-by-step plan

Emergency rollback plan (when you are already on fire)

  1. Confirm DNSSEC validation failure with two validating resolvers using dig +dnssec.
  2. Query authoritative directly to confirm records exist and isolate validation.
  3. Compare DS and DNSKEY using dnssec-dsfromkey and a DS query from the parent.
  4. Choose the fastest safe repair path:
    • If you still have the old KSK and can serve it: republish it and its signatures so it matches the parent DS.
    • If parent DS can be updated quickly and reliably: update DS to new KSK digest and verify via dig +trace.
    • If neither is fast: remove DS to disable DNSSEC temporarily, document the decision, and plan re-enable.
  5. Validate after change using unbound-host -D or equivalent, and multiple public resolvers.
  6. Monitor mail queues and retries. Expect delayed delivery even after fix due to sender retry behavior.

Planned KSK rollover: the boring-but-correct runbook

  1. Inventory your dependencies: registrar DS update method, registry timing, DNS provider capabilities, whether you use CDS/CDNSKEY automation, and who has access.
  2. Lower TTLs ahead of time for DNSKEY and relevant RRsets if your platform supports it safely. Do this days before, not minutes before.
  3. Pre-publish the new KSK (publish new DNSKEY alongside old). Do not remove old yet.
  4. Sign the DNSKEY RRset properly so validators can trust the new key when the DS changes.
  5. Compute DS for the new KSK and have a second person verify key tag, algorithm, digest type, and the hex digest.
  6. Update DS at the parent and track actual DNS publication via dig +trace.
  7. Wait through TTL and propagation windows. This is not bureaucracy; it’s cache physics.
  8. Only then remove the old KSK after you are confident the new DS is universally published and validation is secure across resolvers.
  9. Post-roll verification: validate A/AAAA, MX, TXT, DNSKEY, and a nonexistent name (to test NSEC/NSEC3 behavior) from multiple validating resolvers.

Planned ZSK rollover: keep it routine

  1. Pre-publish new ZSK in DNSKEY RRset.
  2. Start signing with both ZSKs (double-sign) during transition if supported.
  3. After caches have aged out old signatures, stop signing with old ZSK.
  4. Remove old ZSK after safe delay.

FAQ

1) Why does DNSSEC failure show up as SERVFAIL instead of NXDOMAIN?

SERVFAIL means the resolver couldn’t produce a valid answer. With DNSSEC, “I got data but it didn’t validate” is treated as a failure, not “name doesn’t exist.” NXDOMAIN is a signed statement about nonexistence; SERVFAIL is “I can’t vouch for anything here.”

2) If I disable DNSSEC by removing DS, will everything instantly recover?

No. Some resolvers cache failure states and negative responses. You typically see improvement quickly, but full recovery depends on TTLs and resolver behavior. Plan for a tail.

3) Why do web and email fail together?

They share the same DNS zone. Web needs A/AAAA. Email needs MX and usually TXT for SPF/DMARC. If validating resolvers can’t resolve any of these due to bogus DNSSEC, both channels degrade at once.

4) Is this only a KSK rollover problem?

KSK rollovers are the most common “nuke it from orbit” scenario because DS must match KSK. But ZSK rollovers can also break validation if signatures are inconsistent, expired, or not deployed uniformly across authoritative servers.

5) Can I rely on my DNS provider’s “automatic DNSSEC” feature?

You can, but only if you also trust and understand how DS is managed at the parent. Automation that stops at the zone edge is not automation; it’s a half-built bridge. Demand visibility: key tags, digest types, and validation checks.

6) What about CDS/CDNSKEY—can that prevent DS mistakes?

It can reduce human error if your registrar/registry supports it correctly and you understand the publication timing. It can also amplify mistakes if your signer publishes bad CDS/CDNSKEY and the parent accepts it. Treat it like any other automation: verify.

7) How do I test like a customer, not like an engineer with special resolvers?

Use multiple public validating resolvers, and test from networks you don’t control (or at least different recursive stacks). Also test both web and mail-related records: A/AAAA, MX, TXT, and DNSKEY.

8) Why did email keep failing after DNS was fixed?

Because email is store-and-forward. Senders retry on temporary failures, often with backoff. Once DNS works again, queues drain over hours. If the outage was long, some senders may have given up or hit queue limits.

9) What’s the safest “break glass” move?

Fastest safe move is usually to restore the last-known-good DNSKEY/signature set that matches the currently published DS. Removing DS disables DNSSEC and can be effective, but it’s a security regression and should be time-boxed and carefully tracked.

10) How do I prevent “it works for me” during DNSSEC incidents?

Ensure your internal recursive resolvers validate DNSSEC (or at least have a validation-checking path available). If your own environment doesn’t validate, you are blind to a whole class of customer failures.

Next steps you can do this week

DNSSEC is not hard because cryptography is hard. It’s hard because you’re coordinating state across organizations and caches. So do the boring things that make coordination survivable:

  1. Make validation visible internally: run at least one validating resolver you trust, and have a one-liner check in your incident tooling for dig +dnssec against it.
  2. Write your DNSSEC runbook: include DS update method, how to compute DS from DNSKEY, the names of the authoritative servers, and rollback options.
  3. Practice in a non-production zone: do a full KSK rollover rehearsal where you actually update DS at the parent and validate from public resolvers.
  4. Add monitoring that checks “secure” status: not just “does A record resolve,” but “does it resolve as secure from validating resolvers.”
  5. Separate ZSK and KSK processes: rotate ZSK routinely if you must, and treat KSK rollovers as planned maintenance with explicit gates.
  6. Stop doing DNS migrations without DNSSEC planning: if a project plan doesn’t mention DS, it’s incomplete.

If you take only one lesson: DNSSEC rollovers are change management, not crypto. Your keys can be perfect and you can still take down your business with a mismatched DS. Make the chain of trust a first-class production dependency, because the internet already does.

← Previous
Secure Boot Errors After Install: Fix the Key/Mode Mismatch
Next →
Docker Performance: Why Your Containers Lag Under Load (And It’s Not CPU)

Leave a comment