DNS: Dig/drill power moves — commands that answer 90% of DNS mysteries

Was this helpful?

DNS failures don’t arrive with fireworks. They arrive as “some users can’t log in,” “payments are slow,” or the classic: “works on my laptop.” Then someone says “it’s probably DNS,” and you’re either the hero with a terminal or the person refreshing dashboards like it’s a ritual.

This is the terminal-first playbook I use in production: dig and drill commands that quickly answer the boring but decisive questions—what name was asked, who answered, what did they say, how long will it stick, and where the chain broke.

The DNS mindset: stop guessing, start pinning down the hop

DNS is a distributed database with caching, delegation, and partial failure as a design feature. If you treat it like a single server (“the DNS is down”), you’ll waste time. If you treat it like a chain of custody, you’ll fix things faster.

Most DNS mysteries reduce to one of these:

  • You asked the wrong place. Different resolvers, different views (split-horizon), different cache contents.
  • You asked the right place for the wrong name/type. A/AAAA vs CNAME/ALIAS, missing trailing dot in zone files, search domains in resolvers.
  • The chain is broken. Delegation mismatch, glue wrong, lame delegation, registrar not updated, DS mismatch.
  • It’s cached. TTLs, negative caching, stale answers, resolver prefetching, client caches.
  • It’s “correct” but operationally wrong. You pointed at the load balancer that isn’t reachable from that network, or you returned an IPv6 AAAA for a service that isn’t actually listening on v6.
  • It’s policy. DNSSEC validation, filtering, NXDOMAIN rewriting, enterprise proxies, or internal resolvers doing “helpful” things.

Your job during an incident is to answer, with evidence: what answered, what path was followed, and what will change if we fix it now (TTL/caches). dig and drill are your pocket knife and your flashlight.

One quote worth keeping in your head, because DNS debugging is a reliability problem disguised as a lookup:

Paraphrased idea (John Allspaw): “You don’t solve incidents by blaming people; you solve them by improving systems and understanding how work actually happens.”

Joke #1: DNS is the only system where “it’s cached” is both an excuse and a law of physics.

Interesting DNS facts (useful in outages)

  1. DNS predates the modern internet as you experience it. It replaced the centralized HOSTS.TXT model in the early 1980s, when “everyone edit the same file” stopped scaling.
  2. “Recursive” and “authoritative” are different jobs. Authoritative servers publish truth for zones; recursive resolvers fetch truth and cache it for clients.
  3. Negative caching is a thing. An NXDOMAIN (or NODATA) can be cached, driven by the zone’s SOA “minimum”/negative TTL rules—so a fixed record might still “not exist” for a while.
  4. Glue records exist because chicken-and-egg is real. If a nameserver’s name is inside the zone it serves, the parent must provide glue A/AAAA to bootstrap resolution.
  5. CNAME flattening was an operational workaround. The DNS spec forbids CNAME at the zone apex, yet people want apex-to-CDN behavior—hence ALIAS/ANAME or provider flattening.
  6. EDNS0 was added because DNS messages got bigger. Classic DNS over UDP had size limits; EDNS0 extends it so modern records (especially DNSSEC) can fit without immediate TCP fallback.
  7. DNSSEC failure modes look like “random SERVFAIL.” Many resolvers hide the details unless you ask the right questions; validation errors are often indistinguishable from outages at first glance.
  8. Resolvers are not obligated to behave the same. Cache algorithms, prefetching, “serve stale,” aggressive NSEC caching, and timeout strategies vary—so “works on 8.8.8.8” is evidence, not a verdict.

Fast diagnosis playbook (first/second/third)

First: confirm what the client is actually doing

  • What resolver is the client using? Corporate VPN, local router, systemd-resolved stub, browser DoH?
  • What name and type? A vs AAAA, search suffix expansion, missing trailing dot, mixed-case doesn’t matter (DNS is case-insensitive in labels).
  • What is the symptom exactly? NXDOMAIN, SERVFAIL, timeout, wrong IP, intermittent?

Second: compare recursive vs authoritative truth

  • Ask a known public recursive resolver (or two) and compare: do they agree?
  • Ask the authoritative servers directly. If auth is correct but recursive is wrong, you’re in caching/propagation/validation territory.
  • Trace delegation. If the chain breaks at the parent, fix registrar/NS/glue/DS—not the zone file.

Third: decide whether you’re fighting caching, delegation, reachability, or policy

  • Caching: TTLs, negative TTL, client caches, resolver serve-stale, stale NS in caches.
  • Delegation: mismatched NS at parent vs child, glue wrong, lame delegation.
  • Reachability: UDP/53 blocked, TCP/53 blocked (matters for DNSSEC/large answers), anycast path issues.
  • Policy: DNSSEC validation, RPZ filtering, NXDOMAIN rewrite, split-horizon view mismatch.

If you do those three steps, you’ll have the right problem statement fast: “Authoritative is correct but Resolver X caches NXDOMAIN for 10 minutes due to SOA min TTL,” beats “DNS is weird.”

Dig/drill power moves: real tasks, commands, outputs, decisions

Below are practical tasks you’ll do in real incidents. Each one includes: a command, what the output means, and what decision you make next. Use dig if you have it; use drill when you want built-in tracing and a slightly different perspective. I use both because production systems don’t care about your preferences.

Task 1: Identify the resolver you’re actually querying (the “who answered” check)

cr0x@server:~$ dig example.com A +comments

; <<>> DiG 9.18.24 <<>> example.com A +comments
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14067
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;example.com.                   IN      A

;; ANSWER SECTION:
example.com.            300     IN      A       93.184.216.34

;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Wed Dec 31 10:12:44 UTC 2025
;; MSG SIZE  rcvd: 56

Meaning: The SERVER line tells you who answered. Here it’s 127.0.0.53, the systemd-resolved stub, not your corporate resolver or a public resolver.

Decision: If you’re debugging “why does my laptop behave differently,” query the real upstream resolver directly (next task) and inspect local resolver configuration separately.

Task 2: Query a specific recursive resolver to compare behavior

cr0x@server:~$ dig @1.1.1.1 example.com A +noall +answer +comments

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50289
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
example.com.            300     IN      A       93.184.216.34

Meaning: You asked Cloudflare’s resolver directly. If this differs from your local answer, you’re likely dealing with split-horizon, filtering, or different cache state.

Decision: Compare 2–3 resolvers (your corporate one, one public). Agreement across resolvers suggests authoritative truth is consistent; disagreement suggests view/policy/caching differences.

Task 3: Check TTLs to predict “how long until users see the fix”

cr0x@server:~$ dig @8.8.8.8 api.corp.example A +noall +answer

api.corp.example.       17      IN      A       203.0.113.10

Meaning: TTL is 17 seconds. That’s the remaining cache lifetime in this resolver at the moment, not necessarily the zone’s configured TTL.

Decision: If TTL is low, fixes will propagate quickly. If TTL is high, you either wait, or you route around it (temporary IP accept list, parallel record, or change client resolver if you must). Don’t promise instant recovery with a 12-hour TTL unless you enjoy uncomfortable calls.

Task 4: Ask authoritative servers directly to bypass recursive caching

cr0x@server:~$ dig ns1.dns-provider.example api.corp.example A +norecurse +noall +answer +comments

api.corp.example.       300     IN      A       203.0.113.10

Meaning: +norecurse tells the server not to do recursion. If it still answers, you’re hitting an authoritative server for that zone (or a resolver that ignores you, but reputable auth servers comply).

Decision: If authoritative answers are correct but resolvers are wrong, stop editing zone files and start thinking caches, DNSSEC, or delegation mismatch.

Task 5: Trace the delegation path end-to-end (find where the chain breaks)

cr0x@server:~$ dig +trace www.example.org A

; <<>> DiG 9.18.24 <<>> +trace www.example.org A
;; Received 811 bytes from 127.0.0.53#53(127.0.0.53) in 1 ms

org.                    86400   IN      NS      a0.org.afilias-nst.info.
org.                    86400   IN      NS      a2.org.afilias-nst.info.
...
example.org.            172800  IN      NS      ns1.dns-provider.example.
example.org.            172800  IN      NS      ns2.dns-provider.example.
...
www.example.org.        300     IN      A       198.51.100.44

Meaning: +trace walks root → TLD → authoritative. If it fails at the TLD step, your registrar/parent delegation is broken. If it fails at the auth step, your authoritative servers aren’t answering correctly.

Decision: Fix the layer where the trace stops. Don’t touch application code. DNS doesn’t care about your deployment pipeline.

Task 6: Use drill to trace and see DS/DNSSEC hints more plainly

cr0x@server:~$ drill -TD www.example.org

;; Number of trusted keys: 1
;; Chasing: www.example.org. A
www.example.org.        300     IN      A       198.51.100.44
;; Chase successful

Meaning: drill -T traces; -D provides DNSSEC-related details. A “chase successful” is a quick sanity check that the chain resolves in principle.

Decision: If drill can’t chase due to DNSSEC, suspect DS mismatch, expired signatures, or broken DNSKEY publication.

Task 7: Diagnose SERVFAIL by checking DNSSEC validation explicitly

cr0x@server:~$ dig @1.1.1.1 broken-dnssec.example A +dnssec +multi

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 24670
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

Meaning: Some resolvers return SERVFAIL when DNSSEC validation fails (or when upstream lookups fail). +dnssec asks for DNSSEC records; it doesn’t force validation, but it helps you see whether RRSIGs are present when querying authoritative servers.

Decision: Ask authoritative directly with +dnssec (next task). If auth serves broken signatures or missing DNSKEY, fix DNSSEC at the provider or remove DS from parent temporarily (carefully).

Task 8: Verify authoritative DNSSEC material exists (DNSKEY, RRSIG)

cr0x@server:~$ dig ns1.dns-provider.example broken-dnssec.example DNSKEY +norecurse +dnssec +noall +answer

broken-dnssec.example.  3600    IN      DNSKEY  257 3 13 ( ... )
broken-dnssec.example.  3600    IN      RRSIG   DNSKEY 13 2 3600 ( ... )

Meaning: You’re checking the zone publishes DNSKEY and signs it (RRSIG). If DNSKEY exists but resolvers SERVFAIL, the DS at parent may not match, or signatures may be expired.

Decision: If DNSKEY/RRSIG missing or wrong, fix zone signing. If they look present, check DS in the parent (Task 9).

Task 9: Check DS record at the parent (is the chain of trust intact?)

cr0x@server:~$ dig +trace broken-dnssec.example DS

; <<>> DiG 9.18.24 <<>> +trace broken-dnssec.example DS
...
example.                86400   IN      NS      a.iana-servers.net.
...
broken-dnssec.example.  86400   IN      DS      12345 13 2 ( ... )

Meaning: The DS in the parent points to the child zone’s DNSKEY. If you rotated keys and didn’t update DS (or removed signing but left DS), validating resolvers fail.

Decision: Align DS and DNSKEY. If you’re mid-incident and the business needs resolution now, removing the DS at the parent can restore availability (but treat it as a controlled rollback, with follow-up).

Task 10: Detect CNAME chains and where they break

cr0x@server:~$ dig cdn.example.net A +noall +answer

cdn.example.net.        60      IN      CNAME   cdn.vendor.example.
cdn.vendor.example.     60      IN      CNAME   edge123.vendor.example.
edge123.vendor.example. 20      IN      A       192.0.2.80

Meaning: You don’t have “an A record problem,” you have a chain. A break anywhere returns NXDOMAIN or SERVFAIL depending on circumstances.

Decision: If there’s a chain, query each hop directly. When something goes missing, fix that specific zone/provider. Also consider reducing chain depth; long chains amplify TTL and failure modes.

Task 11: Differentiate NXDOMAIN vs NODATA (same name, different failure)

cr0x@server:~$ dig nonexistent.example.com A +noall +comments

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 10832
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
cr0x@server:~$ dig www.example.com TXT +noall +comments

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40420
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

Meaning: NXDOMAIN means the name doesn’t exist. NOERROR with zero answers often means the name exists, but not that record type (NODATA). They cache differently and imply different misconfigurations.

Decision: If NXDOMAIN, check zone content and delegation. If NODATA, you likely forgot to publish that record type (e.g., TXT for verification) or you published it in a different view.

Task 12: Inspect negative caching TTL via SOA (why NXDOMAIN persists)

cr0x@server:~$ dig example.com SOA +noall +answer

example.com.            300     IN      SOA     ns1.dns-provider.example. hostmaster.example.com. 2025123101 7200 900 1209600 600

Meaning: The last SOA field (here 600) is commonly used as the negative caching TTL in practice. If you accidentally removed a record and put it back, resolvers may keep “it doesn’t exist” for that long.

Decision: If negative TTL is high, plan for delayed recovery after fixing NXDOMAIN. In emergencies, you can sometimes mitigate by answering with a different name (new label) that isn’t negatively cached.

Task 13: Validate delegation: compare NS at parent vs NS at child

cr0x@server:~$ dig example.org NS +noall +answer

example.org.            172800  IN      NS      ns1.dns-provider.example.
example.org.            172800  IN      NS      ns2.dns-provider.example.
cr0x@server:~$ dig @ns1.dns-provider.example example.org NS +norecurse +noall +answer

example.org.            3600    IN      NS      ns1.dns-provider.example.
example.org.            3600    IN      NS      ns2.dns-provider.example.
example.org.            3600    IN      NS      ns3.dns-provider.example.

Meaning: The parent says ns1+ns2; the child zone says ns1+ns2+ns3. That mismatch isn’t always fatal, but it’s a classic source of inconsistent behavior and “some resolvers ask the wrong server.”

Decision: Align parent and child NS sets. During incidents, remove dead or wrong NS from both sides; lame servers increase timeouts and can trigger resolver retry storms.

Task 14: Detect lame delegation (authoritative server listed but not authoritative)

cr0x@server:~$ dig @ns3.dns-provider.example example.org SOA +norecurse +noall +comments

;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 61120
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

Meaning: If an NS is listed but refuses or returns SERVFAIL/REFUSED to authoritative queries, some resolvers will waste time on it. That’s “lame delegation” in practice.

Decision: Remove that NS from delegation (and/or fix its configuration). Timeouts here show up as slow page loads and intermittent failures.

Task 15: Check whether TCP fallback works (big answers, DNSSEC, truncated UDP)

cr0x@server:~$ dig @8.8.8.8 dnssec-large.example TXT +dnssec +noall +comments

;; Truncated, retrying in TCP mode.
cr0x@server:~$ dig @8.8.8.8 dnssec-large.example TXT +dnssec +tcp +noall +answer

dnssec-large.example.   300     IN      TXT     "v=some-long-value..."

Meaning: If UDP is truncated, resolvers retry with TCP. If TCP/53 is blocked somewhere (firewall, security group, corporate proxy rules), you get mysterious failures that correlate with record size.

Decision: Ensure TCP/53 is allowed between resolvers and authoritative servers (and from clients to resolvers if applicable). Don’t “optimize security” by blocking TCP/53 without understanding DNSSEC and modern response sizes.

Task 16: Test split-horizon DNS by querying internal vs external resolvers

cr0x@server:~$ dig @10.0.0.53 app.internal.example A +noall +answer

app.internal.example.   30      IN      A       10.10.20.15
cr0x@server:~$ dig @1.1.1.1 app.internal.example A +noall +comments

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5904
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

Meaning: Internal resolver answers with RFC1918 address; public resolver says NXDOMAIN. That’s intentional split-horizon—until it isn’t. If VPN clients use public DoH, they won’t see internal records.

Decision: Decide policy: either force internal resolver usage on managed devices (and disable unmanaged DoH), or publish public equivalents. Don’t pretend split-horizon is invisible; it leaks into app behavior.

Task 17: Verify reverse DNS (PTR) for mail, logging, and “why does this vendor block us?”

cr0x@server:~$ dig -x 203.0.113.10 +noall +answer

10.113.0.203.in-addr.arpa. 3600 IN PTR mailout.example.com.

Meaning: Reverse DNS exists and points to a hostname. Many systems (especially mail-related) use it as a weak identity signal.

Decision: If PTR is missing or wrong, fix it with the IP owner (often your cloud provider). Also ensure forward-confirmed reverse DNS where it matters (PTR points to name, and name resolves back to the same IP).

Task 18: Measure latency and spot timeouts quickly

cr0x@server:~$ dig @9.9.9.9 www.example.com A +stats +tries=1 +time=2

; <<>> DiG 9.18.24 <<>> @9.9.9.9 www.example.com A +stats +tries=1 +time=2
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31869
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
www.example.com.        60      IN      A       198.51.100.44
;; Query time: 187 msec
;; SERVER: 9.9.9.9#53(9.9.9.9) (UDP)
;; WHEN: Wed Dec 31 10:20:12 UTC 2025
;; MSG SIZE  rcvd: 59

Meaning: You get query time in milliseconds. A spike here can be network path, resolver overload, upstream timeouts due to lame delegation, or packet filtering.

Decision: If query times are high, compare resolvers and query authoritative directly. If authoritative is fast but recursive is slow, the resolver is the bottleneck. If authoritative is slow/unreachable, fix that layer.

Joke #2: If you’re doing DNS at 3 a.m., remember: the records are immutable until you stop typing.

Three corporate mini-stories from the DNS trenches

Incident #1: The outage caused by a wrong assumption

Company: mid-size SaaS, global customers, two cloud regions. A new service endpoint was launched behind a CDN. The team added a CNAME and moved on. Everyone tested from laptops. Everything worked.

The incident began with “some customers can’t reach the API.” The symptoms were cleanly split by geography. In one region it was fine; in another it timed out. The application metrics looked normal, because the traffic never arrived.

The wrong assumption: “If public DNS resolves, it resolves everywhere.” In reality, a subset of enterprise customers used recursive resolvers that still queried an old authoritative nameserver from the previous DNS provider—because the delegation at the parent had been updated, but an old NS remained listed for weeks during a half-migration. Some resolvers cached the older delegation and kept trying the dead server. They’d eventually succeed after retries… sometimes. Intermittent, slow, and maddening.

Debugging was a matter of tracing from the failing network. dig +trace showed the parent advertising an NS set that included a nameserver that no longer served the zone. Querying that server directly returned REFUSED. The “it works for me” crew had resolvers that happened not to hit the lame NS.

The fix was boring: align NS records at the parent and the child; remove the dead NS; lower NS TTLs during the migration window; and validate from multiple resolvers before declaring victory. The postmortem was more about process: migrations need a checklist and a hard “done when delegation matches” gate.

Incident #2: The optimization that backfired

Different company, more security posture. They decided DNS should be “fast” and reduced TTLs aggressively for most records—30 seconds across the board. The logic: faster failover, faster rollbacks, less risk.

Then they introduced a dependency on an external authentication provider with a long CNAME chain and DNSSEC. On a normal day, fine. On a bad day, their recursive resolvers got hammered. Low TTL meant constant cache misses. The resolvers were forced to re-walk the chain repeatedly, pulling DNSKEY/RRSIG overhead and occasionally falling back to TCP when responses were large.

The first symptom was elevated login latency, not outright failure. The second symptom was “random” SERVFAILs, because the resolvers were under load and timing out upstream. The third symptom was an infrastructure team convinced it was “the auth provider” while the auth provider pointed out they were healthy and your resolvers were the ones struggling.

The eventual fix: stop treating TTL like a performance knob you can crank down without consequences. They raised TTLs for stable records, kept low TTL only for the few failover-controlled names, and added resolver capacity. They also stopped blocking TCP/53 in egress “because UDP is enough.” It wasn’t.

Lesson: low TTL is not a free lunch. It’s a recurring bill you pay in resolver load and upstream dependency amplification, especially with DNSSEC and chains.

Incident #3: The boring but correct practice that saved the day

Another org, another day. They ran their own authoritative DNS with a managed provider and had a habit that looked like paperwork: before any change, they captured “before” outputs for SOA, NS, and the affected records from both authoritative servers and two public resolvers. They kept the snippets in the change ticket.

A Friday change introduced a subtle issue: an engineer added an AAAA record pointing to a new IPv6 address for a service that wasn’t actually reachable from all networks. The service was dual-stack in one region and v4-only in another. Users on networks preferring IPv6 got timeouts. IPv4-only users were fine. Monitoring, mostly IPv4, stayed green.

Because they had “before” snapshots, they quickly proved the only DNS change was AAAA. Then they reproduced by querying A and AAAA separately and testing connectivity. The DNS was “correct,” but the deployment wasn’t. The fastest mitigation was to remove AAAA (or route IPv6 properly), not to tinker with resolvers.

The boring practice—capturing baseline dig outputs and checking A/AAAA separately—turned a cross-team blame spiral into a 30-minute fix. The incident report was short. That’s the dream.

Common mistakes: symptom → root cause → fix

  • Symptom: “Some users get NXDOMAIN for a record that exists now.”

    Root cause: Negative caching after a recent deletion or typo; SOA negative TTL is long; some resolvers cached NXDOMAIN.

    Fix: Check SOA (negative TTL), wait it out, or change label temporarily; keep negative TTL sane for zones that change often.

  • Symptom: “Works on public DNS, fails on corporate VPN.”

    Root cause: Split-horizon views, RPZ filtering, or internal resolver returning private IPs not reachable from the client network.

    Fix: Query both resolvers directly; align views or enforce correct resolver usage; avoid returning unroutable addresses to roaming clients.

  • Symptom: “SERVFAIL on validating resolvers, NOERROR on non-validating.”

    Root cause: DNSSEC DS/DNSKEY mismatch, expired signatures, missing DNSKEY on some authoritative nodes.

    Fix: Validate DS via trace; fix signing or DS; ensure all authoritative nodes serve consistent DNSSEC material.

  • Symptom: “Intermittent slow DNS; sometimes 2–5 seconds.”

    Root cause: Lame delegation or dead NS in parent/child sets causing retries; resolver cycling through NS list.

    Fix: Compare NS at parent and child; query each NS for SOA with +norecurse; remove dead NS; lower NS TTL during transition.

  • Symptom: “A record looks right, but clients still hit old IP.”

    Root cause: TTL still high in caches; local stub caches; browsers with DoH; application-level caching.

    Fix: Check TTL from affected resolver; query authoritative; flush the right cache (systemd-resolved, nscd, unbound) if you control it; otherwise wait/roll forward with overlap.

  • Symptom: “Only large TXT records fail; small lookups fine.”

    Root cause: UDP fragmentation issues, EDNS size path MTU problems, or TCP/53 blocked for fallback.

    Fix: Test dig +tcp; allow TCP/53; consider lowering EDNS UDP size on resolvers; avoid gigantic records when possible.

  • Symptom: “CDN cutover didn’t work; some traffic still goes to origin.”

    Root cause: CNAME chain cached at different stages; stale records in resolvers; multiple views or multiple authoritative providers during migration.

    Fix: Map the chain and TTLs at each hop; minimize chain depth; ensure single authoritative source of truth; plan cutovers with staged TTL reductions.

  • Symptom: “Email vendor says our reverse DNS is wrong.”

    Root cause: Missing PTR, PTR points to a name that doesn’t resolve back, or you changed outbound IP without updating rDNS.

    Fix: Verify with dig -x and forward lookup; update PTR via IP owner; keep rDNS aligned with sending identity.

Checklists / step-by-step plan

Checklist: “DNS is broken” triage in 10 minutes

  1. Write down the exact name and type. Example: api.example.com AAAA, not “the API DNS.”
  2. Query the client’s configured resolver. Confirm the SERVER line and status (NOERROR/NXDOMAIN/SERVFAIL/timeout).
  3. Query two known recursive resolvers. If they disagree, you have view/policy/caching differences.
  4. Query authoritative servers directly. If auth is correct but recursive is wrong, stop changing zone data blindly.
  5. Run +trace on the problematic name. Note exactly which step fails.
  6. Check TTLs (positive and negative). Decide whether you can wait, or need a workaround.
  7. Check A and AAAA separately. Dual-stack surprises are common and painful.
  8. If SERVFAIL, suspect DNSSEC early. Validate DS/DNSKEY chain instead of arguing with application teams.
  9. If slow/intermittent, suspect delegation/NS health. Test each NS with SOA queries.
  10. Record evidence. Paste dig outputs into the incident doc. DNS arguments die quickly when you show the chain.

Checklist: planned DNS change (so you don’t create next week’s outage)

  1. Decide the propagation model. TTL reduction 24–48 hours before cutover for high-traffic names is normal. Do it deliberately.
  2. Capture “before” snapshots: SOA, NS, target records, and any CNAME chain outputs from authoritative and at least one recursive resolver.
  3. Make one change at a time. Bundling NS + DNSSEC + service cutover is how you earn a weekend.
  4. Validate from multiple networks/resolvers. Especially if you run split-horizon or have enterprise customers.
  5. Keep rollback simple. Know what you’ll set records back to, and understand caches will still exist.
  6. After change, verify delegation consistency. Parent NS vs child NS must match, and all NS must answer authoritatively.

Cache flushing (only when you control the box)

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
 resolv.conf mode: stub
Current DNS Server: 10.0.0.53
       DNS Servers: 10.0.0.53 10.0.0.54
cr0x@server:~$ sudo resolvectl flush-caches

Meaning: On systemd-resolved systems, this clears the local cache. It won’t fix upstream caches on recursive resolvers you don’t control.

Decision: Flush local caches only to prove a point or during controlled testing. In production, you usually design around caches rather than fight them.

FAQ

1) What’s the practical difference between dig and drill?

dig (BIND tools) is ubiquitous and script-friendly. drill (from ldns) has nice tracing behavior and DNSSEC chase features. In incidents, use whichever is installed first—and keep the other handy when you need a second opinion.

2) When should I use +trace?

When you suspect delegation issues, registrar mistakes, wrong NS/glue, or you need to prove where the chain breaks. It’s also the fastest way to stop people arguing about “our DNS provider is down” when the parent delegation is the real issue.

3) Why does one resolver return an IP and another returns NXDOMAIN?

Common causes: split-horizon views, RPZ/filtering, stale caches (including negative caches), or querying different names due to search domains. Query both resolvers directly and compare status, ANSWER, and AUTHORITY sections.

4) What does SERVFAIL actually mean?

It means “the resolver couldn’t complete the query successfully.” That can be upstream timeouts, DNSSEC validation failure, lame delegation, or resolver overload. SERVFAIL is a symptom, not a diagnosis—so you trace and query authoritative servers directly.

5) Why do I see correct records on the authoritative server but clients still get the old answer?

Caches. Recursive resolvers cache answers until TTL expires. Clients may also cache. Your “fix” is correct but not yet observed. Read TTLs from the resolver your clients actually use, and plan changes with TTL windows.

6) How do I prove DNSSEC is the problem without specialized tooling?

Check whether validating resolvers SERVFAIL while non-validating resolvers (or direct auth queries) return data. Then inspect DNSKEY/RRSIG on the authoritative servers and DS in the parent via dig +trace DS. That combination is usually enough to make the case.

7) Should I always publish AAAA records if I have IPv6 somewhere?

No. Publish AAAA when the service is reliably reachable over IPv6 for the users who will receive it. Partial IPv6 deployments create slow failures that look like “the app is flaky” because clients often try v6 first.

8) What’s a “lame delegation” in practical terms?

An NS listed for your zone that doesn’t actually serve it authoritatively (or refuses/servfails). Resolvers waste time on it. You get intermittent slowness, retries, and occasionally failures. Fix by removing/repairing the lame server and aligning parent/child NS.

9) Why does a TXT verification record show up for me but not for the vendor?

Either you’re querying different resolvers with different caches, you published it in the wrong zone/view, or you created NODATA (name exists but no TXT at that name) vs the intended record. Ask the vendor what resolver they use, then query that resolver directly.

10) Is flushing caches a good incident response?

Only if you control the cache and you’re doing it to validate a hypothesis. You can’t flush the world’s resolvers. The real skill is understanding TTLs and planning safe cutovers with overlap.

Next steps you can do today

  1. Build a “DNS evidence” habit. During incidents, paste dig/drill outputs showing SERVER, status, TTL, and authoritative answers. It shortens meetings.
  2. Standardize a resolver comparison. Pick two public resolvers and your corporate resolver, and keep a short command set to compare them quickly.
  3. Know your delegation. For critical zones, periodically verify parent NS matches child NS and every NS answers authoritatively.
  4. Review TTL strategy. Low TTL only where you actively need fast steering; sane TTL elsewhere to reduce dependency load.
  5. Make TCP/53 non-negotiable for DNS infrastructure. Modern DNS (especially with DNSSEC) needs it when UDP truncates.
  6. Practice a controlled DNSSEC rollback. Not during an incident. Know how you’d remove DS if needed and how you’d re-enable safely.

DNS isn’t magic. It’s paperwork with packet loss. Use the commands above, pin down where the truth diverges, and your “mystery” becomes a fix.

← Previous
ZFS Special VDEV Sizing: How Big It Should Be (So You Don’t Regret It)
Next →
MySQL vs PostgreSQL: Docker memory limits—how to stop silent throttling

Leave a comment