DNS: Your Domain Works… Until It Doesn’t — The Delegation Trap Explained

Was this helpful?

The nastiest DNS outages aren’t the ones where nothing resolves. The nastiest ones are where
some people can reach you, some can’t, and everyone is convinced it’s “the network.”
Your status page is up (ironically on a different domain), your origin is healthy, and yet you’re
watching support tickets roll in from one ISP, one country, or one specific mobile carrier.

That’s the delegation trap: the uncomfortable gap between “my zone file looks correct” and “the world
can reliably discover my authoritative servers and trust the answers.” Delegation is the plumbing.
When it leaks, it leaks sideways.

Delegation: what it actually means on the wire

People talk about DNS like it’s a database you update and everyone magically sees the new truth.
In practice it’s a scavenger hunt. A recursive resolver starts at the root, walks to the TLD,
walks to your domain’s parent zone, and only then learns where your authoritative nameservers live.
That “learn where to go” step is delegation.

Delegation is not your zone file. Delegation is the parent zone’s published NS set for your domain,
plus any glue records (A/AAAA in the parent) needed to reach those nameservers.
Your zone file can be immaculate, and the internet can still be walking toward the wrong door.

The parent and the child: two sources of truth that must agree

For example.com, the parent is .com. The parent publishes:

  • NS records at the delegation point: which authoritative servers should be queried.
  • Glue records (sometimes): IP addresses for nameservers that are inside the delegated zone (in-bailiwick).
  • DS record (if DNSSEC): the chain-of-trust link to the child zone’s DNSKEY.

The child zone (your zone file on your authoritative servers) publishes:

  • Its own NS set: what it claims are its authoritative nameservers.
  • Everything else: A/AAAA, MX, TXT, CNAME, etc.
  • DNSKEY records (if DNSSEC), and signatures (RRSIG).

If the parent NS set and the child NS set disagree, you can end up in “split delegation”:
some resolvers follow one path, some follow another, and caches freeze the divergence in place.
If glue is wrong, you can have a correct NS name pointing at an unreachable IP. If DS is wrong,
you can get hard failures (SERVFAIL) even though the records are present.

Dry truth: delegation is a distributed system. Distributed systems don’t “propagate,” they
converge—sometimes slowly, sometimes never, unless you fix the inconsistency.

Why delegation fails in production (even when your zone is perfect)

1) The registrar UI is not the DNS

Your registrar’s UI is a control plane. The actual delegation lives in the registry’s parent zone.
Many registrars do the right thing. Some do it eventually. Some do it after you click “save” twice
and wait for a background job that’s running on a server last rebooted during a budget meeting.

Worse: registrars sometimes accept invalid states (like missing glue for in-bailiwick NS names),
or they normalize and reorder records in ways that surprise automation.

2) Glue records: the “just enough” address book that can go stale

Glue exists to break circular dependency. If your domain is example.com and one of your
nameservers is ns1.example.com, a resolver can’t look up ns1.example.com
without first being able to resolve example.com. That’s recursion eating its own tail.
Glue is the parent zone providing the IP directly.

Glue is also a trap: people change nameserver IPs and forget the parent glue. The child zone updates,
but resolvers keep using the old glue. Or worse, only some resolvers do—because caches differ, and
some resolvers are more aggressive about “helpful” extra lookups.

3) DNSSEC: the delegation fails closed

DNSSEC adds integrity, and it’s worth it. But DNSSEC makes mistakes louder. A stale DS record in the
parent that no longer matches the child’s KSK means validators reject your answers. Non-validating
resolvers may still work. That’s your classic “it works for me” outage with a cryptographic punchline.

4) Negative caching: your fix is correct, but the internet is still mad

Resolvers cache no as enthusiastically as they cache yes. If a resolver asked for
A www.example.com and got NXDOMAIN, it can cache that result for the
negative TTL (from SOA). That means you can fix a record and still have customers swearing it’s broken
for minutes to hours.

5) Anycast, load balancers, and “clever” NS topologies

Many authoritative DNS providers use anycast, which is great when done correctly. But if you roll
your own NS and put it behind a load balancer that does health checks on TCP/53 while your clients
mostly use UDP/53, you’re auditioning for an incident report.

6) Child zone inconsistencies: serials, NOTIFY, and stealth masters

Delegation can be correct, but your authoritative servers disagree. You updated the primary, one
secondary is stale, and your NS set sends resolvers to both. Some users get the new answer, some get
the old, and everyone blames “propagation.” The correct term is “you have inconsistent authority.”

Joke #1: DNS propagation is like office gossip—everyone hears it eventually, but not in the same order, and rarely with the original meaning intact.

Facts & history that explain today’s weirdness

  • 1983: DNS (RFC 882/883, later replaced) was designed to replace a single HOSTS.TXT file—delegation was the scalability feature.
  • Root servers aren’t “one server”: there are 13 logical root letters, but each is anycasted to many sites worldwide.
  • Glue is intentionally limited: parent zones provide glue only for in-bailiwick nameservers to avoid becoming a general-purpose address directory.
  • Bailiwick checking exists for safety: resolvers treat glue differently depending on whether it’s within the authority they’re querying.
  • DNS was built for UDP: TCP fallback exists, but many networks still mishandle TCP/53, making large responses (DNSSEC!) brittle.
  • EDNS(0) changed the game: it allows larger UDP payloads, but middleboxes sometimes drop fragmented UDP, causing “random” failures.
  • DNSSEC (1990s–2000s): adds signatures and keys; the delegation point uses DS to connect parent and child trust.
  • TTL is not a promise: it’s a hint; some resolvers cap TTLs or apply local policies, so convergence is messier than your spreadsheet.
  • NXDOMAIN can be “helpfully” rewritten: some providers historically used wildcarding at the TLD or resolver level; your debugging may face lies.

One operational truth has survived every DNS evolution: the path matters as much as the data.
Delegation is the path.

Fast diagnosis playbook (first/second/third)

When a domain “kind of works,” your job is to find where resolution diverges. Don’t start by
staring at your zone file. Start by proving what the world is being told.

First: determine whether it’s delegation, authority, or validation

  1. Check with trace: Does the resolver find the right NS set from the parent?
    If not, it’s delegation (registrar/registry/glue/DS).
  2. Query each authoritative directly: Do all authoritative servers answer the same?
    If not, it’s zone consistency/transfer/rollout.
  3. Check DNSSEC and response codes: SERVFAIL on validating resolvers but
    success on non-validating suggests DNSSEC chain issues.

Second: look for “some networks only” patterns

  • Works on one ISP, fails on another: likely caching differences or DNSSEC validation differences.
  • Works on v4 but not v6: AAAA issues, broken v6 glue, or unreachable v6 authoritative.
  • Works from your laptop, fails from a monitoring region: anycast routing differences or firewall geo rules.

Third: decide if you can mitigate fast

  • Temporary mitigation: add a working NS back into the parent delegation; reduce blast radius.
  • Fix forward: correct glue/DS/NS and wait out caches; communicate expected recovery window.
  • If DNSSEC is broken: either repair DS/keys quickly or remove DS (coordinated), but don’t half-do it.

Paraphrased idea (attributed): Werner Vogels has long pushed that you should “design for failure” as a normal state, not an exception.

Hands-on tasks: commands, expected output, and what you decide

Below are practical tasks you can run from a Linux box. Each one includes what the output tells you
and the decision you make from it. These aren’t toy commands; they’re the ones you use at 2 a.m.
when Slack is on fire.

Task 1: Confirm the parent delegation NS set (with trace)

cr0x@server:~$ dig +trace example.com NS

; <<>> DiG 9.18.24 <<>> +trace example.com NS
.			518400	IN	NS	a.root-servers.net.
.			518400	IN	NS	b.root-servers.net.
;; Received 811 bytes from 127.0.0.53#53(127.0.0.53) in 1 ms

com.			172800	IN	NS	a.gtld-servers.net.
com.			172800	IN	NS	b.gtld-servers.net.
;; Received 1171 bytes from 198.41.0.4#53(a.root-servers.net) in 18 ms

example.com.		172800	IN	NS	ns1.dns-host.net.
example.com.		172800	IN	NS	ns2.dns-host.net.
;; Received 212 bytes from 192.5.6.30#53(a.gtld-servers.net) in 22 ms

What it means: This shows what the parent zone publishes as the delegation NS set.
The TTL here is the parent’s TTL, not your zone’s.

Decision: If these NS names aren’t the ones you expect, stop blaming your authoritative servers.
Fix delegation at the registrar/registry. If they are correct, move to checking glue and child zone consistency.

Task 2: Check for glue at the parent (in-bailiwick case)

cr0x@server:~$ dig @a.gtld-servers.net example.com NS +norecurse +authority +additional

; <<>> DiG 9.18.24 <<>> @a.gtld-servers.net example.com NS +norecurse +authority +additional
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 2

;; AUTHORITY SECTION:
example.com.	172800	IN	NS	ns1.example.com.
example.com.	172800	IN	NS	ns2.example.com.

;; ADDITIONAL SECTION:
ns1.example.com.	172800	IN	A	203.0.113.10
ns2.example.com.	172800	IN	A	203.0.113.11

What it means: The parent is providing glue A records for ns1.example.com/ns2.example.com.
If the IPs are wrong, resolvers may never reach your authoritative servers.

Decision: If glue is wrong, update host objects / glue at the registrar. Updating the child zone won’t fix it.

Task 3: Verify the child zone’s NS set matches the parent delegation

cr0x@server:~$ dig @ns1.example.com example.com NS +noall +answer

example.com.	3600	IN	NS	ns1.example.com.
example.com.	3600	IN	NS	ns2.example.com.

What it means: This is what the child zone claims is authoritative. The TTL is from your zone.

Decision: If the child NS set differs from the parent’s, fix it. Don’t leave “extra” NS in the child or parent as a casual habit.
Consistency matters more than optimism.

Task 4: Query every authoritative nameserver directly for the problem record

cr0x@server:~$ for ns in ns1.example.com ns2.example.com; do echo "== $ns =="; dig @"$ns" www.example.com A +noall +answer; done
== ns1.example.com ==
www.example.com.	300	IN	A	198.51.100.20
== ns2.example.com ==
www.example.com.	300	IN	A	198.51.100.21

What it means: Your authoritative servers disagree. That’s not propagation; that’s inconsistency.

Decision: Fix zone distribution (AXFR/IXFR, API push, hidden master pipeline), then re-check. Do not proceed to “wait it out.”

Task 5: Check SOA serials across authoritative servers

cr0x@server:~$ for ns in ns1.example.com ns2.example.com; do dig @"$ns" example.com SOA +noall +answer; done
example.com.	3600	IN	SOA	ns1.example.com. hostmaster.example.com. 2026020401 7200 3600 1209600 300
example.com.	3600	IN	SOA	ns1.example.com. hostmaster.example.com. 2026020304 7200 3600 1209600 300

What it means: Different serials mean not all servers have the same zone version.

Decision: Investigate transfers/updates. If you can’t quickly repair the stale authoritative, remove it from delegation until it’s healthy.

Task 6: Check DNSSEC DS record at the parent

cr0x@server:~$ dig @a.gtld-servers.net example.com DS +noall +answer

example.com.	86400	IN	DS	12345 13 2 4C2B9D3C8B4E9B0F0F3E2C2B2D1A9A0D0B2A6F9E6E7C0A1B2C3D4E5F6789

What it means: The parent says the child zone should validate with this DS digest.

Decision: If you recently rotated keys or moved DNS providers, verify DS matches the current KSK. If not, fix DS or you’ll keep getting validation failures.

Task 7: Confirm DNSKEY at the child (and that it exists where you think)

cr0x@server:~$ dig @ns1.example.com example.com DNSKEY +noall +answer | head -n 3
example.com.	3600	IN	DNSKEY	257 3 13 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=
example.com.	3600	IN	DNSKEY	256 3 13 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy=

What it means: Presence of KSK (257) and ZSK (256) records. This doesn’t prove correctness, but absence is a smoking crater.

Decision: If DNSKEY is missing but DS exists at parent, expect SERVFAIL for validators. Either publish DNSKEY/sign the zone or remove DS.

Task 8: Validate as a resolver would (check AD flag via a validating resolver)

cr0x@server:~$ dig @1.1.1.1 www.example.com A +noall +answer +comments

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40251
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1
www.example.com.	300	IN	A	198.51.100.20

What it means: The ad flag indicates the resolver considers the answer authenticated (DNSSEC validated).

Decision: If you get SERVFAIL here but your authoritative answers look fine, it’s likely DS/DNSKEY/RRSIG mismatch or broken chain.

Task 9: Compare behavior across resolvers (spot “some networks only”)

cr0x@server:~$ for r in 1.1.1.1 8.8.8.8 9.9.9.9; do echo "== $r =="; dig @"$r" www.example.com A +time=2 +tries=1 +noall +answer +comments; done
== 1.1.1.1 ==
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27364
www.example.com.	300	IN	A	198.51.100.20
== 8.8.8.8 ==
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 58312
== 9.9.9.9 ==
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11427
www.example.com.	300	IN	A	198.51.100.20

What it means: Divergent results suggest validation differences, cache states, or reachability issues to one authoritative anycast site.

Decision: If only some validating resolvers fail, suspect DNSSEC edge cases or packet size/fragmentation. If only one fails, check that resolver’s path to your NS.

Task 10: Check for UDP truncation (TC bit) and TCP fallback

cr0x@server:~$ dig @ns1.example.com example.com DNSKEY +noall +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61643
;; flags: qr aa; QUERY: 1, ANSWER: 0
;; WARNING: Message parser reports malformed message packet.
cr0x@server:~$ dig +tcp @ns1.example.com example.com DNSKEY +noall +answer | head -n 2
example.com.	3600	IN	DNSKEY	257 3 13 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=
example.com.	3600	IN	DNSKEY	256 3 13 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy=

What it means: Large DNSSEC responses can trigger truncation or middlebox weirdness. TCP succeeding while UDP fails is a red flag.

Decision: Ensure authoritative supports EDNS properly, allow fragmented UDP if possible, and ensure TCP/53 is reachable. If you can’t guarantee UDP, your DNSSEC deployment will be fragile.

Task 11: Verify NS reachability on UDP and TCP from your vantage point

cr0x@server:~$ nc -vz -u ns1.example.com 53
Connection to ns1.example.com 53 port [udp/domain] succeeded!
cr0x@server:~$ nc -vz ns1.example.com 53
Connection to ns1.example.com 53 port [tcp/domain] succeeded!

What it means: This checks basic reachability. It does not confirm correct DNS behavior, but it rules out obvious firewall drops.

Decision: If UDP fails but TCP works, expect intermittent resolution and timeouts. Fix network ACLs/security groups/firewalls before changing DNS records.

Task 12: Find the authoritative set as seen by a recursive resolver

cr0x@server:~$ dig @1.1.1.1 example.com NS +noall +answer
example.com.	172800	IN	NS	ns1.example.com.
example.com.	172800	IN	NS	ns2.example.com.

What it means: The recursive is telling you which NS set it believes is authoritative (from cached delegation).

Decision: If this differs from what dig +trace shows now, you’re looking at cached old delegation. You can’t purge the world; plan mitigations accordingly.

Task 13: Inspect negative caching behavior (NXDOMAIN TTL via SOA)

cr0x@server:~$ dig @ns1.example.com example.com SOA +noall +answer
example.com.	3600	IN	SOA	ns1.example.com. hostmaster.example.com. 2026020401 7200 3600 1209600 300

What it means: The last field (here 300) is the negative caching TTL (RFC 2308 behavior).

Decision: If you’re planning a cutover where records may temporarily not exist, lower this ahead of time. If you just fixed an NXDOMAIN, expect some resolvers to keep the “no” for up to this value.

Task 14: Confirm no accidental CNAME-at-apex or illegal combinations

cr0x@server:~$ dig @ns1.example.com example.com CNAME +noall +answer
cr0x@server:~$ dig @ns1.example.com example.com A +noall +answer
example.com.	300	IN	A	198.51.100.10

What it means: Zone apex should not be a CNAME in classic DNS. If you see CNAME at apex and also NS/SOA, some resolvers will behave badly.

Decision: Fix the zone structure (use ALIAS/ANAME at provider-specific level if needed) and keep standards-compliant outputs on authoritative servers.

Three corporate mini-stories from the delegation trenches

Mini-story 1: The outage caused by a wrong assumption

A mid-size SaaS company decided to “professionalize DNS” by moving authoritative hosting to a managed provider.
The migration runbook was clean: copy zone, reduce TTLs, swap NS at registrar, monitor.
It worked in staging. It worked for the engineer running the change. It even worked for most users.

Then sales escalations started: a cluster of enterprise customers couldn’t log in from a particular corporate network.
The login domain resolved to the old IP for them. For everyone else, it resolved to the new one.
The team did the usual dance—restart things, roll back app deploys, stare at dashboards that had no idea what DNS was doing.

The wrong assumption: “If the registrar shows the new nameservers, the parent delegation is updated.”
In reality, the registrar UI was ahead of the registry publication. Some recursive resolvers had cached the old delegation NS set
with a long TTL from the parent, and they kept asking the old authoritative provider—who was now serving an old zone snapshot.

The fix wasn’t heroic. They temporarily restored the old provider’s zone to match the new answers (so either delegation path worked),
then waited out the parent TTL window. After things stabilized, they retired the old authoritative servers again—this time after verifying
with dig +trace and cross-resolver checks, not screenshots.

What changed culturally: they stopped treating “propagation” as a mystical force and started treating it as cached state with measurable TTLs.
They also added a hard requirement: before shutting off old DNS, confirm that multiple resolvers worldwide are receiving the new delegation from the parent.

Mini-story 2: The optimization that backfired

A large enterprise ran its own authoritative DNS on a pair of regional load balancers. The reasoning was familiar:
“We can keep traffic local, reduce latency, and have full control.” They also turned on DNSSEC, because security reviews were coming.

In a calm week, they “optimized” firewall rules: allow UDP/53 from everywhere, allow TCP/53 only from known resolver IPs.
The idea was to reduce exposure. It even passed initial tests—most queries are UDP.

Then a key rollover increased DNS response sizes and triggered more truncation and TCP fallback. Suddenly, a slice of resolvers started timing out.
They weren’t on the allowlist. They were perfectly legitimate recursors with changing egress IPs (cloud-based resolvers, enterprise forwarders,
and some ISPs with large fleets). Users saw intermittent failures that correlated with nothing obvious.

The symptom pattern was classic delegation-adjacent pain: some resolvers worked, others got SERVFAIL or timeouts, and retried on different NS
would sometimes succeed. The team wasted hours blaming the DNS provider—except the DNS provider was them.

The eventual fix was unglamorous: allow TCP/53 broadly, rate-limit it sensibly, and monitor it.
If you publish DNSSEC, you don’t get to pretend TCP is optional. You can reduce risk with sane controls, but you can’t deny the protocol reality.

Mini-story 3: The boring but correct practice that saved the day

Another company ran multiple brands and domains, and they had lived through one painful delegation outage years earlier.
After that, they implemented a rule: every DNS change is validated by an automated “delegation health” job.
No exceptions. The job ran from several networks and checked parent NS, glue, DS, authoritative consistency, and basic resolution.

One day, a routine change request came in: “Move ns2 to a new IP; old subnet is being retired.”
The engineer updated the authoritative zone’s A record for ns2.example.com and the new server came up.
Everything looked fine locally.

The automation blocked the change from being declared complete. It flagged: “Parent glue for ns2.example.com still points to old IP.”
The engineer sighed, logged into the registrar, and updated the host object. The glue TTL was long, so they kept the old IP live until expiry.

The outcome was boring, which is the nicest thing you can say about DNS. Customers never noticed. The subnet retirement happened on schedule.
The team didn’t earn kudos, but they also didn’t earn a postmortem. That’s the deal.

Joke #2: DNS is the only place where “boring” is a feature, and “exciting” is a career-limiting event.

Common mistakes: symptoms → root cause → fix

1) Symptom: works from some networks, fails from others

Root cause: Split delegation (parent NS differs from child NS), or caches holding old delegation with long parent TTLs.

Fix: Use dig +trace to confirm current parent delegation. Make parent and child NS sets identical. Keep old authoritative serving correct data until parent TTLs expire.

2) Symptom: intermittent timeouts, especially on DNSKEY/TXT

Root cause: UDP fragmentation issues, EDNS problems, or TCP/53 blocked. DNSSEC and large TXT records amplify it.

Fix: Ensure TCP/53 is reachable. Tune authoritative for EDNS. Reduce response size where possible (avoid bloated TXT, use sane DNSSEC algorithms/parameters).

3) Symptom: SERVFAIL on validating resolvers, but authoritative queries look correct

Root cause: DNSSEC DS mismatch, missing DNSKEY, expired signatures, or wrong NSEC/NSEC3 chain.

Fix: Compare parent DS with child DNSKEY. Repair DS or re-sign zone; if necessary as emergency mitigation, remove DS at parent (coordinated) to stop validation failures.

4) Symptom: after changing NS, domain is “dead” for hours

Root cause: Parent TTL is high; resolvers cached old delegation. Or new authoritative not reachable globally.

Fix: Plan migrations: lower TTLs ahead (where applicable), keep old authoritative serving, validate reachability from multiple networks, and only then retire.

5) Symptom: IPv6-only clients fail, IPv4 works

Root cause: Broken AAAA glue for in-bailiwick NS, unreachable v6 on authoritative, or inconsistent dual-stack routing.

Fix: Verify NS AAAA records and v6 reachability. If you can’t support v6 reliably for authoritative, don’t publish AAAA for NS.

6) Symptom: different answers depending on which NS is hit

Root cause: Stale secondary, failed zone transfer, or multi-provider split-brain without a disciplined pipeline.

Fix: Enforce SOA serial consistency checks. Fix transfer pipeline. Remove unhealthy NS from delegation until consistent.

7) Symptom: some resolvers get NXDOMAIN even after record is created

Root cause: Negative caching TTL holding NXDOMAIN.

Fix: Wait out negative TTL, or if you need faster recovery next time, reduce SOA minimum/negative TTL ahead of planned changes.

8) Symptom: delegation looks correct, but resolvers still query old NS names

Root cause: Some recursive resolvers ignore TTL hints or cap them; others cache aggressively. Also, intermediary forwarders may cache longer.

Fix: Treat it as a phased convergence. Keep compatibility, monitor, and communicate timelines based on observed resolver behavior, not wishful thinking.

Checklists / step-by-step plan

Migration checklist: changing authoritative DNS providers without getting hurt

  1. Inventory current delegation: capture parent NS set, glue, DS, and TTLs using dig +trace and direct parent queries.
  2. Lower TTLs in the child zone for records you expect to change (A/AAAA, CNAME) well ahead of time. Don’t pretend this changes parent TTLs.
  3. Stage the new provider: import zone, verify all records, verify DNSSEC status (off/on) deliberately, not accidentally.
  4. Direct query test: query new authoritative servers directly for critical names. Confirm answers, SOA, and DNSSEC if enabled.
  5. Reachability test: ensure UDP/53 and TCP/53 are reachable from the public internet. Confirm v4/v6 as applicable.
  6. Change delegation at registrar and immediately verify with dig +trace from multiple vantage points.
  7. Run dual-service window: keep old authoritative serving the same answers until parent TTLs and observed recursor caches converge.
  8. Monitor resolution, not just uptime: monitor from multiple regions and multiple resolvers; alert on NXDOMAIN/SERVFAIL spikes.
  9. Only then retire old authoritative and remove old NS from child and parent.

Emergency checklist: “domain partially down” suspected delegation failure

  1. Confirm symptom scope: which networks/resolvers fail? capture examples with exact resolver IPs.
  2. Run trace: confirm what parent is delegating right now.
  3. Check glue: if in-bailiwick NS, confirm parent glue addresses are correct and reachable.
  4. Check authoritative consistency: SOA serial and target records across all NS.
  5. Check DNSSEC chain: DS at parent vs DNSKEY at child; validate using at least one known validating resolver.
  6. Mitigate: if one NS is broken, remove it from parent delegation (and child NS set) or fix it fast; avoid leaving a dead NS in rotation.
  7. Communicate realistically: publish what you changed, and the expected convergence window based on TTLs and observed behavior.
  8. Post-incident: add automation to prevent recurrence (delegation checks, NS reachability, DNSSEC expiry monitoring).

Operational hygiene checklist: boring controls that prevent delegation traps

  • Keep parent and child NS sets identical. Drift is not redundancy; it’s confusion.
  • Maintain at least two authoritative servers on independent networks/providers where possible.
  • Monitor from outside your network and outside your resolver—internal views are comforting lies.
  • Track glue changes as infrastructure changes, not as “DNS content” changes.
  • For DNSSEC: monitor signature expiration, automate rollovers carefully, and treat DS updates as production changes with rollbacks.
  • Document the registrar process (including who has access and how long updates take).

FAQ

1) What exactly is “delegation” in DNS?

Delegation is the parent zone telling resolvers which nameservers are authoritative for your domain (NS records),
plus glue IPs when needed, and DS records if DNSSEC is enabled.

2) If I fixed my zone file, why are users still seeing the old behavior?

Because they might not be reaching your zone file. They may be following cached old delegation, or using stale glue,
or stuck behind negative caching. Fixing the child zone doesn’t automatically fix the path to it.

3) What’s the difference between parent NS and child NS records?

Parent NS records live in the parent zone (registry). Child NS records live in your zone on authoritative servers.
They should match. When they don’t, you get inconsistent resolution depending on which path a resolver takes.

4) When do I need glue records?

When your nameserver hostname is inside the domain it serves (in-bailiwick), like ns1.example.com for example.com.
The parent must provide glue A/AAAA so resolvers can reach the NS without already resolving the zone.

5) Can wrong glue break everything even if the NS names are correct?

Yes. Resolvers may have the right NS names but get sent to the wrong IP address. You’ll see timeouts or inconsistent reachability.
Glue is a common failure point during IP renumbering and provider migrations.

6) Why does DNSSEC cause SERVFAIL instead of just “wrong answer”?

Validating resolvers fail closed: if the chain of trust breaks (DS doesn’t match DNSKEY, signatures expired, etc.),
they won’t return potentially tampered data. They return SERVFAIL because they can’t validate.

7) Is “DNS propagation” a real thing?

Not in the way people mean it. There’s no global push. There’s caching with TTLs, plus resolver policies,
plus a multi-step lookup path. The internet converges toward your change; it doesn’t instantly receive it.

8) How do I quickly prove whether it’s delegation or my authoritative servers?

Use dig +trace to see what the parent delegates, then query each authoritative directly for the failing name.
If the trace points to wrong NS/glue/DS, it’s delegation. If authoritative disagree, it’s your distribution.

9) Should I run my own authoritative DNS?

You can, but do it with production discipline: diverse networks, anycast done right (or don’t), TCP/53 allowed,
DNSSEC lifecycle monitoring, and real external monitoring. If that sounds like work, it is.

10) How many nameservers should I publish?

Two is the minimum; three or four can help resilience, but only if they are truly independent and consistently updated.
Publishing five mediocre NS is worse than publishing two excellent ones.

Conclusion: next steps that prevent repeat incidents

Delegation failures feel personal because they make you look incompetent in public while your systems are quietly fine.
The fix is to treat DNS delegation as production infrastructure with its own observability, change control, and rollback plan.
Not “set and forget.” Not “the registrar says it’s fine.”

Do these next, in this order

  1. Build a delegation health check that runs dig +trace, verifies parent NS/glue/DS, and queries each authoritative for SOA and critical records.
  2. Enforce NS set consistency between parent and child. Make drift a page-worthy alert.
  3. Make TCP/53 non-negotiable if you run DNSSEC or large responses. Monitor UDP and TCP success rates separately.
  4. Practice migrations with a dual-service window: keep old authoritative serving correct data until caches converge.
  5. Write down who controls the registrar and how quickly they can update glue/DS. When you need it, you won’t have time to hunt for access.

The delegation trap is avoidable. But only if you stop thinking of DNS as “records” and start thinking of it as a distributed lookup path
with multiple authorities, caches, and trust anchors. That’s the system you’re actually operating.

← Previous
Linux: The 10-Minute Method to Find What’s Actually Eating Your RAM
Next →
Windows Server 2025 Install Like a Pro: Roles, Updates, and Hardening in 60 Minutes

Leave a comment