DNS Resolvers: Why Caching Makes Outages Worse (And How to Tame It)

Was this helpful?

You deploy a clean failover. Health checks are green. The new IP is live. Half your users still hit the dead endpoint like it’s a lifestyle choice.
You check the authoritative DNS: correct. You check the load balancer: fine. Your on-call channel fills with “DNS is broken.”

DNS is rarely “broken.” It’s just doing what it was designed to do: cache aggressively, everywhere, by default, and sometimes for longer than you thought you negotiated.
That’s great for latency and scale. It’s also how a one-minute incident becomes a one-hour outage with a smug little TTL printed right there in your zone file.

What actually caches DNS (spoiler: everyone)

“DNS caching” isn’t one cache. It’s a stack of caches, each with its own rules, bugs, and opinions. When your service changes address or
your authoritative servers go sideways, these layers don’t fail together. They fail in a staggered, user-specific, geography-specific mess
that makes incident timelines look like modern art.

The layers that cache DNS answers

  • Application-level caches: browsers, JVM resolvers, language runtimes, and libraries that “helpfully” remember answers.
  • OS stub resolver: the thing your app calls. It may cache (depending on platform and configuration).
  • Local caching daemon: systemd-resolved, dnsmasq, nscd, Unbound in “local” mode, etc.
  • Node-local / sidecar caches: common in Kubernetes and service meshes to reduce latency and load.
  • Recursive resolvers: ISP resolvers, enterprise resolvers, public resolvers. These are big, shared caches.
  • Forwarders: corporate networks often chain resolvers (branch office → HQ → upstream).
  • Authoritative servers: not “caches,” but they influence caching via TTL, SOA values, and response behavior.
  • Middleboxes: yes, some “security” devices do DNS manipulation and caching. The packets will never testify.

When someone says “the TTL is 60 seconds,” you should reply: “At which layer, under which failure mode, and with which resolver software?”
The correct answer is usually a long sigh.

Why caching amplifies outages

DNS caching is a force multiplier. It multiplies reliability when answers are correct and stable. It multiplies pain when an answer is wrong,
missing, or temporarily unreachable. The key is that caches make your system stateful across the internet. You can roll back code in minutes;
you can’t roll back cached NXDOMAIN across recursive resolvers you don’t control.

Failure mode #1: wrong answers persist

Publish a bad A record (fat-fingered IP, stale endpoint, wrong region), and you’ve minted a distributed outage token. Every recursive resolver
that sees it can hold it until TTL expires. Your fix might be instantaneous at the authoritative layer, but the world is now running on cached state.

Failure mode #2: timeouts and SERVFAIL turn into “sticky” behavior

Some resolvers cache failures, or at least behave as if they do by throttling retries, marking servers as lame, or preferring one NS due to
past latency. During an authoritative incident, recursors may decide one of your name servers is “bad” and avoid it long after it’s healthy.

Failure mode #3: negative caching makes recovery slower than the outage

Delete a record (or have it go missing due to a bad deploy), and resolvers can cache “does not exist” (NXDOMAIN) based on the zone’s SOA.
That negative TTL can be minutes to hours. So even after you restore the record, clients keep believing in the void.

Joke #1: DNS is like a rumor: once it’s cached, the correction never travels as fast as the scandal.

Failure mode #4: load spikes from cache stampedes

When a popular record’s TTL expires across many resolvers at roughly the same time, you get a burst of cache misses. If your authoritative
servers are already under stress, this burst can push them from “slow” into “down,” creating more misses, creating more bursts. It’s a feedback loop.
Some resolver software has prefetching and serve-stale features to reduce this, but those are knobs with sharp edges.

TTL is not a promise: the dirty realities

TTL is the “cache for this many seconds” field in a DNS record. Operators treat it like a contract. It isn’t. It’s a suggestion, and different layers
interpret it with policy.

Common ways TTL gets “adjusted” in practice

  • Minimum TTL enforcement: some enterprise resolvers and public resolvers impose a floor (for performance and abuse resistance).
  • Maximum TTL enforcement: some impose a ceiling (to keep things from being cached for days).
  • Local caches ignoring low TTLs: certain stub resolvers and apps keep answers longer than the TTL.
  • Connection reuse and app pooling: apps may only resolve at startup, effectively creating an infinite TTL until restart.
  • Happy Eyeballs and fallback logic: clients may pick an address family (A vs AAAA) and stick with it even after DNS changes.
  • Resolver server selection memory: recursive resolvers track which authoritative servers were responsive and prefer them.

If you want DNS-based failover, treat it like a distributed systems problem, not a magic switch. DNS can route around damage,
but it does not provide synchronized convergence. You design for “some percentage of clients will be wrong for a while.”

One guiding line worth keeping on a sticky note is a paraphrased idea attributed to Werner Vogels: Everything fails, all the time; design accordingly.
It’s not a DNS quote, but DNS is where that philosophy gets audited by reality.

Negative caching: when “not found” sticks

NXDOMAIN caching is the most underappreciated outage extender. You think “we fixed the record.” Your customers keep getting “host not found.”
Why? Because “not found” is cacheable.

How negative caching works

When a resolver gets NXDOMAIN (name does not exist) or NODATA (name exists but not that type), it can cache the negative answer.
The TTL for that cache is derived from the zone’s SOA record, specifically the SOA MINIMUM field (historically) and/or the SOA’s TTL behavior
as interpreted by modern standards and implementations. In operational terms: your SOA settings matter even when you’re not touching SOA.

Two practical consequences

  • A brief missing record can linger. If your deployment accidentally removes an A record for 30 seconds, resolvers might remember that
    absence for minutes or longer.
  • New records “propagate” slowly if clients asked too early. Launching a new subdomain and announcing it before it exists is a classic
    foot-gun: early requests seed negative caches.

The fix is not “set everything to TTL 0.” The fix is controlling negative TTLs, planning cutovers, and using cache-busting techniques when you must.

Resolver stacks in the real world (Linux, macOS, Windows, containers)

The phrase “check /etc/resolv.conf” is the SRE equivalent of “have you tried turning it off and on again.” It’s not wrong; it’s just incomplete.
Modern systems may route DNS through a local stub (127.0.0.53), a VPN plugin, a corporate agent, or a container runtime.

Linux: systemd-resolved and friends

Many distros use systemd-resolved, which can cache and also implement split DNS for VPNs. /etc/resolv.conf may point to 127.0.0.53, which is not
your real resolver; it’s a local stub that forwards upstream. When debugging, you need to inspect both the stub and the upstream configuration.

macOS: mDNSResponder and per-interface DNS

macOS keeps per-interface DNS settings. You can be “on the same Wi-Fi” and still have different resolver behavior because one machine also has a VPN
profile installed. Flushing caches helps, but the more important step is confirming which resolvers are being queried.

Windows: DNS Client service and NRPT

Windows caches DNS in the DNS Client service. Enterprises also use Name Resolution Policy Table (NRPT) rules, especially with DirectAccess/VPN setups,
which can steer certain domains to specific resolvers. Debugging requires looking at policy, not just cache.

Containers and Kubernetes: CoreDNS plus something else

Kubernetes adds at least one layer: CoreDNS (or kube-dns), plus the node’s resolver, plus whatever the container image does. Some clusters add NodeLocal
DNSCache to reduce CoreDNS load and latency. This improves steady-state performance and makes some outages worse by extending stale answers closer to workloads.
You are trading load for staleness. That trade can be worth it, but you must measure it.

Interesting facts and historical context (the stuff that bites later)

  1. DNS replaced HOSTS.TXT because a central file didn’t scale; caching was a feature from day one, not an afterthought.
  2. TTL was designed to control query load on authoritative servers; it wasn’t designed to support rapid failover.
  3. Negative caching wasn’t always consistent; standardization evolved because early resolvers behaved wildly differently for NXDOMAIN.
  4. Resolvers track “lame” name servers and avoid them; this is good… until a transient packet loss event marks your best NS as “bad.”
  5. Some resolvers cap TTLs (both minimum and maximum) for policy reasons, which makes “propagation math” unreliable.
  6. Glue records can keep you stuck: a cached delegation plus glue can keep clients going to old NS IPs even after you fixed the zone.
  7. DNSSEC increases failure modes: a broken chain of trust can turn “works for me” into SERVFAIL globally, with caching on top.
  8. EDNS0 and larger UDP payloads improved functionality, but also increased sensitivity to middleboxes that drop fragments or block EDNS.
  9. TCP fallback for DNS exists for a reason; if your path blocks DNS over TCP, large responses can fail intermittently.

Three corporate mini-stories from the trenches

Mini-story #1: The outage caused by a wrong assumption

A mid-size SaaS company planned a “simple” migration of an API from one load balancer to another. The plan was: lower TTL to 60 seconds, switch the A record,
monitor, then raise TTL back. They did the first part carefully. They waited a day. They switched during a quiet window. And then support tickets arrived anyway.

Some customers kept connecting to the old IP for hours. The on-call team blamed “ISP DNS propagation.” The network team blamed the customer.
The application team blamed the network team, because that’s what applications do when scared.

The actual issue was painfully mundane: a popular client integration was written in Java, and the runtime’s DNS caching behavior on that fleet was effectively
“cache forever” unless configured otherwise. Those clients resolved once at startup and then pinned the IP. Changing TTL did nothing because the app wasn’t
re-resolving.

The fix wasn’t a DNS change. The fix was communicating with customers: restart the integration service, or configure the JVM TTL settings. Internally, the company
changed the cutover playbook to include “clients may pin DNS; provide an alternate endpoint during transition.” DNS did exactly what it said. The assumption did not.

Mini-story #2: The optimization that backfired

Another company had CoreDNS under load. Latency spikes, occasional timeouts, and a steady stream of complaints from developers who assumed DNS should be invisible.
The platform team added a node-local DNS cache to reduce CoreDNS QPS. It worked. Charts looked better. The team celebrated quietly, as engineers do when they don’t
want fate to notice.

Weeks later, an upstream authoritative provider had a partial outage. The right behavior would have been “some lookups fail quickly, retries recover, and once
upstream heals, everything returns to normal.” What happened instead was a slow-motion incident: node-local caches held onto stale failures and intermittent answers
long enough that pods across the cluster saw inconsistent resolution for far longer than upstream downtime.

The node-local cache also changed the cluster’s failure surface. Before, CoreDNS issues were centralized and obvious. After, failures became node-specific. Some
nodes were “fine,” others were “cursed,” and the scheduler happily placed workloads on cursed nodes because CPU was available.

The postmortem conclusion was not “never cache.” It was “cache with explicit serve-stale policy, explicit monitoring, and a way to flush or bypass quickly.”
They added health checks, metrics per node-local instance, and an emergency DaemonSet that could restart caches cluster-wide. Optimization is great until it becomes
the thing you can’t unwind during an incident.

Mini-story #3: The boring, correct practice that saved the day

A payments company ran authoritative DNS in two providers. That sounds fancy, but the interesting part was their boring discipline: they ran continuous tests from
multiple networks that queried both providers directly, validated DNSSEC responses, and compared answers against an expected set.

One afternoon, the monitoring alerted: queries to one provider started returning SERVFAIL intermittently for a signed zone. From most client networks, everything
still looked fine because caches were hiding the problem. The company didn’t get customer reports because the failure was young and masked.

The DNS on-call did not start by flushing caches or changing TTLs. They isolated which authoritative set was misbehaving, temporarily adjusted delegation to favor
the healthy provider, and kept the TTLs stable. They also paused a planned deployment that would have introduced new records, because negative caching during a
DNSSEC wobble is a special kind of misery.

The incident never became a full outage. The customers never knew. The “boring practice” was not heroics; it was having tests that bypass caches and validate the
real source of truth continuously. The best DNS incident is the one you resolve while everyone else is still convinced everything is fine.

Fast diagnosis playbook

When DNS is blamed, your job is to figure out where the lie entered the system: authoritative data, recursive caching, local caching, or the application.
You do not have time for philosophy. Here’s the order that gets you to a root cause fast.

First: determine if the name is wrong everywhere or only for some clients

  • Check authoritative directly (query the authoritative servers by name/IP, bypassing recursors).
  • Check one or two known public recursors to see if the cached view differs.
  • Check the affected client’s resolver path (what server is it actually asking?).

Second: classify the failure type

  • Wrong IP: cached stale record or bad publish.
  • NXDOMAIN: missing record and negative caching.
  • SERVFAIL: DNSSEC, broken delegation, unreachable authoritative, or resolver policy.
  • Timeout: network path, MTU/fragmentation, firewall, or overloaded DNS component.
  • Intermittent: mixed authoritative behavior, anycast path issues, packet loss, or one NS lame.

Third: find the caching layer that’s “sticky”

  • Local stub cache (systemd-resolved, nscd, dnsmasq).
  • Node-local / CoreDNS (Kubernetes).
  • Enterprise forwarders (chains add state).
  • Recursive resolvers (ISP/public).
  • Application runtime (JVM, Go, custom caching).

Fourth: decide whether to wait, flush, or route around

  • Wait if the cached value is correct but you need convergence; communicate expected recovery time based on TTL and cache behavior.
  • Flush if you control the resolver and can do it safely (and if you won’t cause a stampede).
  • Route around by temporarily adding alternate records, adjusting delegation, or using a different hostname during the incident.

Practical tasks: commands, outputs, and decisions

These are not “toy” commands. They’re the ones you run while someone is asking for ETAs. Each task includes what the output tells you and what decision
you make next. Assume Linux unless otherwise noted.

Task 1: Identify which resolver the host is actually using

cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
search corp.example
options edns0 trust-ad

What it means: This host is pointing to a local stub (127.0.0.53), likely systemd-resolved.
Decision: Don’t assume upstream resolvers yet; inspect systemd-resolved configuration and cache behavior next.

Task 2: Inspect systemd-resolved upstream servers and per-link DNS

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.10.0.53
       DNS Servers: 10.10.0.53 10.10.0.54

Link 2 (ens192)
    Current Scopes: DNS
         Protocols: +DefaultRoute
Current DNS Server: 10.10.0.53
       DNS Servers: 10.10.0.53 10.10.0.54

What it means: systemd-resolved forwards to 10.10.0.53/54.
Decision: If only some hosts fail, compare this output across “good” and “bad” machines. Differences often explain everything.

Task 3: Query a name via the default resolver and observe TTL

cr0x@server:~$ dig api.example.com A +noall +answer +ttlid
api.example.com.  42  IN  A  203.0.113.20

What it means: You got an answer with 42 seconds remaining in cache.
Decision: If the IP is wrong, you’re looking at cached state. Next: compare against authoritative and other recursors.

Task 4: Bypass local and query a specific recursive resolver

cr0x@server:~$ dig @1.1.1.1 api.example.com A +noall +answer +ttlid
api.example.com.  60  IN  A  198.51.100.77

What it means: A different resolver returns a different IP (and a fresh TTL). That’s classic “different cache contents.”
Decision: Determine which answer is correct by querying authoritative. If authoritative matches one, the other resolver is stale or poisoned.

Task 5: Query authoritative name servers directly (bypassing recursion)

cr0x@server:~$ dig api.example.com A +noall +answer +authority +additional +trace
api.example.com. 300 IN A 198.51.100.77

What it means: Authoritative says 198.51.100.77 with TTL 300.
Decision: The resolver returning 203.0.113.20 is stale or following a different view (split-horizon, old zone, or bad forwarding).

Task 6: Check for NXDOMAIN and whether it’s being cached negatively

cr0x@server:~$ dig new.api.example.com A +noall +comments +authority
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1443
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

example.com.  900  IN  SOA  ns1.example.com. hostmaster.example.com. 2026020401 7200 3600 1209600 900

What it means: NXDOMAIN, and the SOA shows a 900-second negative caching window (commonly derived from SOA MINIMUM).
Decision: If you just created this record, you may need to wait out negative caches or change strategy (temporary alternate name, or pre-create before announcement).

Task 7: Confirm whether DNSSEC is turning things into SERVFAIL

cr0x@server:~$ dig api.example.com A +dnssec +noall +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 39012
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

What it means: Resolver gave up with SERVFAIL. DNSSEC validation failures often present like this, but so do upstream timeouts.
Decision: Compare results using a resolver with DNSSEC validation disabled (in a controlled environment) and query authoritative directly for DNSKEY/DS correctness.

Task 8: Measure authoritative reachability and latency from your network

cr0x@server:~$ dig @ns1.example.com api.example.com A +tries=1 +time=1 +stats
api.example.com. 300 IN A 198.51.100.77
;; Query time: 18 msec
;; SERVER: 192.0.2.53#53(ns1.example.com) (UDP)
;; WHEN: Tue Feb 04 12:02:51 UTC 2026
;; MSG SIZE  rcvd: 56

What it means: Authoritative responds quickly from here.
Decision: If clients elsewhere time out, you may have geographic routing/anycast issues, firewalling, or upstream resolver selection problems.

Task 9: Detect MTU/fragmentation issues impacting DNS (EDNS0 size)

cr0x@server:~$ dig dnskey example.com +dnssec +bufsize=4096 +noall +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58807
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

What it means: Large UDP responses work from this host. If this fails elsewhere but works with smaller bufsize or TCP, you’ve got a path or middlebox issue.
Decision: Test with TCP and/or smaller bufsize; if TCP works but large UDP fails, involve network teams with packet captures and MTU checks.

Task 10: Force DNS over TCP to confirm UDP is the problem

cr0x@server:~$ dig dnskey example.com +dnssec +tcp +noall +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4926
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

What it means: TCP succeeds. If UDP intermittently fails, you likely have UDP fragmentation drops, rate limits, or stateful firewall weirdness.
Decision: Tune EDNS buffer sizes, validate firewall rules, and consider authoritative/provider guidance; don’t “fix” this by raising TTLs.

Task 11: Inspect Unbound cache statistics (local resolver)

cr0x@server:~$ sudo unbound-control stats_noreset | egrep 'total.num.queries|total.num.cachehits|total.num.cachemiss|unwanted'
total.num.queries=184392
total.num.cachehits=151004
total.num.cachemiss=33388
unwanted.queries=219

What it means: Cache hit ratio is high; Unbound is doing work. “unwanted” can indicate dropped queries (policy, rate limits, or issues).
Decision: If cache misses spike during incident, you may be seeing a stampede; consider prefetch/serve-stale policy and authoritative capacity.

Task 12: Flush a single name from Unbound (when you control the resolver)

cr0x@server:~$ sudo unbound-control flush api.example.com
ok

What it means: That name is removed from the cache on this resolver.
Decision: Re-query to confirm authoritative data is now used. Be careful: flushing at scale can trigger a cache-miss surge.

Task 13: Verify BIND recursion and cache behavior on a resolver

cr0x@server:~$ sudo rndc status
version: BIND 9.18.18 (Stable Release)
running on server: Linux x86_64
boot time: Tue Feb 04 09:11:03 UTC 2026
last configured: Tue Feb 04 09:12:01 UTC 2026
current serial: 2026020401
number of zones: 104 (104 automatic)
recursive clients: 12/1000/1000
tcp clients: 3/100
server is up and running

What it means: Resolver is alive, not saturated with clients.
Decision: If clients still fail, check query logs for SERVFAIL/timeouts and confirm upstream reachability and DNSSEC validation state.

Task 14: Check if Kubernetes CoreDNS is erroring or slow

cr0x@server:~$ kubectl -n kube-system logs -l k8s-app=kube-dns --tail=20
[INFO] 10.244.1.18:39976 - 56512 "A IN api.example.com. udp 44 false 512" SERVFAIL qr,rd,ra 44 0.002153s
[INFO] 10.244.2.21:34771 - 22341 "A IN api.example.com. udp 44 false 512" NOERROR qr,rd,ra 60 0.001122s

What it means: Mixed SERVFAIL/NOERROR suggests upstream or intermittent validation/reachability issues, not a simple “record missing.”
Decision: Identify which upstream CoreDNS forwards to, and test that resolver directly from the node network namespace.

Task 15: Confirm whether an application is re-resolving or pinning DNS

cr0x@server:~$ sudo strace -f -e trace=network -p 24817 2>&1 | head
connect(12, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("203.0.113.20")}, 16) = 0

What it means: The process is connecting directly to an IP; no sign of DNS lookups in this snippet. It may have resolved earlier and is reusing.
Decision: If the IP is stale, restarting the process may “fix” it, which is also evidence of application-level caching/pinning. Long-term fix is configuration or code.

Task 16: Inspect live socket destinations to spot stale IP usage

cr0x@server:~$ ss -ntp | egrep ':443' | head
ESTAB 0 0 10.0.5.12:53014 203.0.113.20:443 users:(("myclient",pid=24817,fd=12))

What it means: Active connections are going to 203.0.113.20.
Decision: If authoritative now points elsewhere, you have pinned connections or stale DNS in some layer. Decide whether to drain, restart, or add temporary routing.

Joke #2: Flushing DNS caches during an incident is like rearranging chairs on a ferry—sometimes it helps, but you’d better know what’s leaking first.

Common mistakes: symptoms → root cause → fix

1) “Authoritative is correct, but users still hit the old IP”

Symptom: Some clients keep connecting to old endpoints long after a DNS change.

Root cause: Recursive resolvers cached old answers; local caches exist; apps pin DNS at startup or pool connections for a long time.

Fix: Plan cutovers with overlap (keep old endpoint alive), shorten TTL well in advance, and verify client runtime DNS behavior. Provide alternate hostnames for transition.

2) “NXDOMAIN persists after we add the record”

Symptom: New subdomain fails for minutes/hours; some networks can resolve, others can’t.

Root cause: Negative caching based on SOA; early queries seeded NXDOMAIN caches.

Fix: Pre-create records before announcement. Tune SOA negative TTL to something sane. For urgent fixes, use a new hostname (cache-busting) rather than waiting for NXDOMAIN expiry.

3) “Intermittent SERVFAIL on one resolver, fine on another”

Symptom: Some resolvers return SERVFAIL; others return NOERROR.

Root cause: DNSSEC validation failure, broken DS/DNSKEY chain, or intermittent authoritative reachability (one NS unreachable).

Fix: Validate DNSSEC chain end-to-end and confirm all authoritative servers answer consistently. If one NS is bad, remove or fix it; partial authority is worse than fewer good servers.

4) “Timeouts only for large responses”

Symptom: A/AAAA lookups mostly fine, but DNSKEY/TXT or some responses time out.

Root cause: EDNS0/fragmentation issues, MTU mismatch, middleboxes dropping UDP fragments or blocking DNS over TCP fallback.

Fix: Test with +tcp and adjust EDNS buffer sizes. Fix the network path. Don’t paper over it with lower response sizes unless you understand the trade.

5) “Split-horizon surprises after adding a VPN”

Symptom: Internal names resolve wrong externally or vice versa; only VPN users fail.

Root cause: Per-domain forwarding/split DNS policies; resolvers differ by interface; corporate agents override resolvers.

Fix: Document split-horizon domains, enforce consistent resolver settings, and test from both VPN and non-VPN paths. Ensure internal zones don’t leak and external zones are accessible.

6) “Lowering TTL didn’t help our failover”

Symptom: You reduced TTL to 30–60 seconds, but failover still took ages.

Root cause: Some resolvers enforce minimum TTLs; some clients pin DNS; caches already held older answers before TTL change.

Fix: Lower TTL days ahead of planned changes. Test with real client populations. Use application-level retries and multi-endpoint logic; DNS alone is not a failover system.

7) “We flushed the cache and made it worse”

Symptom: After flushing resolvers, authoritative QPS spikes and more queries time out.

Root cause: Cache stampede: you turned your resolver fleet into a synchronized miss generator.

Fix: Flush surgically (one name, one resolver tier), stagger restarts, and ensure authoritative capacity. Prefer serve-stale/prefetch strategies where appropriate.

Checklists / step-by-step plan

Checklist A: Before a planned DNS cutover

  1. Lower TTL early (at least one full previous TTL window; often 24–48 hours) so existing caches age out.
  2. Confirm negative caching SOA values so a brief mistake doesn’t linger for an hour.
  3. Keep old endpoints alive long enough to ride out clients that pin DNS or keep long-lived connections.
  4. Test from multiple resolvers (enterprise, public, mobile) and from multiple regions.
  5. Verify authoritative health and consistency across all NS (including DNSSEC if enabled).
  6. Have a rollback plan that doesn’t rely on TTL: alternate hostname, routing override, or dual-stack endpoints.

Checklist B: During an incident blamed on DNS

  1. Establish ground truth: query authoritative directly; write down expected answers.
  2. Compare caches: query at least two different recursors; note differences.
  3. Identify the failing layer: local stub, node-local cache, enterprise forwarder, public resolver, or app.
  4. Classify the failure: wrong answer, NXDOMAIN, SERVFAIL, timeout, intermittent.
  5. Pick a mitigation: wait, flush (surgical), bypass (alternate hostname), or route around (delegation preference).
  6. Communicate realistic timelines: include negative caching windows and “some clients pin DNS” in your ETA language.

Checklist C: After the incident (the part people skip and then repeat)

  1. Add monitoring that bypasses caches: query authoritative directly from multiple networks.
  2. Track resolver error rates: SERVFAIL, timeouts, NXDOMAIN spikes, and latency distributions.
  3. Document resolver topology: which networks use which resolvers; include VPN and branch office forwarders.
  4. Standardize cache controls: serve-stale policy, prefetch policy, flush procedures, and safe restart patterns.
  5. Audit application DNS behavior: especially JVM settings, connection pooling, and libraries that cache.

FAQ

1) Why did my DNS change “not propagate” even though TTL was low?

Because the TTL you set only controls caching for resolvers that respect it and for clients that re-query DNS. Some resolvers enforce minimum TTLs,
and some apps resolve once at startup and never again. Also, lowering TTL only helps after existing caches have aged out; it doesn’t rewrite the past.

2) What’s the difference between authoritative DNS and recursive resolvers?

Authoritative servers host the zone data (source of truth). Recursive resolvers fetch that data on behalf of clients and cache it. During incidents,
you must query authoritative directly to know what “should” be true, and query recursive resolvers to know what clients “think” is true.

3) Can I set TTL to 0 to avoid caching?

You can set very low TTLs, but many resolvers won’t honor 0, and you’ll increase query load significantly. Low TTLs also don’t fix application-level
pinning and can cause stampedes. Use low TTLs tactically for planned cutovers, not as a permanent crutch.

4) What is negative caching and why do I care?

Negative caching is caching “this name does not exist” (NXDOMAIN) or “no data” responses. It matters because a brief mistake—like a missing record—can
persist in caches long after you restore it, slowing recovery more than the original error window.

5) When should I flush DNS caches?

Flush when you control the resolver and you are confident the authoritative data is correct, and you need convergence faster than TTL allows.
Do it surgically (one name, one tier) and be aware of stampedes. Flushing everything is often a performance incident disguised as a fix.

6) How do I tell if SERVFAIL is DNSSEC-related?

Compare behavior across resolvers with known DNSSEC validation behavior, query with +dnssec, and inspect DS/DNSKEY consistency.
SERVFAIL can also come from timeouts, so measure reachability and test with TCP to rule out UDP issues.

7) Why do some clients work and others fail at the same time?

Because they use different resolvers (ISP vs enterprise vs public), have different cache contents, or are behind different network policies (VPN split DNS).
Also, some clients keep long-lived connections and never need to resolve again until reconnect.

8) Does Kubernetes make DNS problems worse?

It can. Kubernetes adds DNS layers (CoreDNS, optional node-local cache), increases query volume, and creates new failure modes where a single node’s DNS cache
becomes a localized outage. Done well, it’s fine. Done casually, it’s a distributed mystery box.

9) Should I rely on DNS for failover between regions?

Use DNS as one tool, not the only tool. DNS failover is eventually consistent and client-dependent. Pair it with application retries, multi-endpoint support,
and health-checked load balancing where possible. Assume some clients will be wrong for a while and design so “wrong” is still survivable.

Conclusion: next steps that prevent repeat incidents

DNS caching doesn’t “cause” outages. It preserves them, distributes them, and makes them look random. If you treat DNS like a simple config file,
you’ll keep having incidents where the fix is correct and the customer impact stubbornly continues.

Do three things this quarter:

  1. Map your resolver path for every major environment (prod nodes, office, VPN, CI, Kubernetes). Write it down. Keep it current.
  2. Build cache-bypassing monitors that query authoritative servers directly and validate expected answers (including DNSSEC if used).
  3. Operationalize DNS changes: lower TTL early, control negative TTLs, plan overlap, and treat cache flush as a scalpel, not a fire alarm.

The goal isn’t to eliminate caching. The goal is to make caching predictable, observable, and boring. Boring is a compliment in production.

← Previous
Stuck in “Preparing Automatic Repair”? The Recovery Fix That Actually Works
Next →
Remove Bloatware Safely: The Security Reason You Should Care

Leave a comment