You updated a DNS record. You waited âa few minutes.â Your dashboard still shows the old IP. Someone in Slack says,
âDNS is broken again,â and another person suggests restarting random things until the graph turns green.
Hereâs the reality: DNS usually isnât broken. Your change is simply stuck behind a specific cache, somewhere
between your authoritative nameserver and the userâs eyeballs. The trick is to find which cache is lying to you,
flush only what helps, and leave the rest aloneâbecause the wrong flush can turn a minor delay into a full-blown incident.
The mental model: where DNS answers actually come from
When people say âDNS propagation,â they imagine a single global system slowly syncing. Thatâs not whatâs happening.
DNS is a chain of delegations and caches, and every link can independently decide what to return and for how long.
You publish records on an authoritative server (or provider). But most clients never talk to it.
They ask a recursive resolver (often your ISP, corporate network, or a public resolver), and that resolver
caches answers. Then your OS may cache. Your browser may cache. Your application may cache. Your service mesh may do âhelpfulâ
caching. Your load balancer might keep a stale upstream mapping. And then Kubernetes CoreDNS might have its own opinion.
A âDNS change not visibleâ is nearly always one of these:
- You didnât change what you think you changed (wrong zone, wrong record, wrong name, wrong view).
- You changed it, but TTLs say ânot yetâ (including negative caching, which is sneakier).
- You changed it, but youâre testing through a cache (local resolver, corporate forwarder, browser, etc.).
- You changed it, but different users get different answers (geo DNS, EDNS Client Subnet, split-horizon).
The goal is not âflush everything.â The goal is: identify the bottleneck cache, then choose the smallest safe action.
Production systems reward precision.
Fast diagnosis playbook (check 1/2/3)
If youâre on-call, you want a 60â180 second path to clarity. Hereâs the order that tends to minimize self-inflicted damage.
1) Check authoritative truth (bypass caches)
Query the authoritative nameservers directly for the record you changed. If authoritative doesnât show it, stop.
Your problem is not âpropagation.â Your problem is publication.
- Find the zoneâs NS records.
- Query each NS directly with
dig @nsX. - Verify the record, TTL, and any CNAME chain.
2) Check the resolver you actually use in the failing path
If authoritative is correct, test the recursive resolver that the impacted client uses. That might be:
corporate forwarder, VPC resolver, node-local cache, or a public resolver.
Compare:
- Answer section (A/AAAA/CNAME)
- TTL remaining (if itâs high, youâre waiting, not broken)
- NXDOMAIN vs NOERROR (negative caching means âyouâre waitingâ too)
3) Check the client-side caches last (OS, browser, app)
Only after authoritative and recursive are sane do you go after the local machine. Because local flushing is cheap,
but itâs also a distraction: you can âfix it on your laptopâ while production users still see stale data.
Paraphrased idea from Richard Cook (resilience engineering): âSuccess and failure come from the same everyday processes.â
DNS caching is one of those processes. Itâs doing its jobâuntil you need it to stop doing its job.
Which caches exist (and what theyâre guilty of)
1) Authoritative server behavior (not a cache, but often blamed)
Your authoritative servers serve whatever is in the zone file/database right now. If theyâre wrong:
wrong record, wrong zone, not synced, stale secondary, wrong view, or you edited a UI that writes to a different place
than you think.
A classic failure mode: you update www.example.com but traffic is actually using api.example.com
via a CNAME chain you forgot existed.
2) Recursive resolver caches (the big one)
Recursive resolvers cache per TTL. Thatâs the point. If you set TTL to 3600 yesterday and change the record now,
some resolvers will keep the old answer for up to an hour from when they last looked it up.
Worse: resolvers also cache negative answers (NXDOMAIN, NODATA). If a resolver recently asked for
a record that didnât exist, it may cache that âdoesnât existâ response based on the zoneâs SOA minimum/negative TTL.
People forget this, then swear the internet is gaslighting them.
3) Forwarders and corporate DNS layers (where truth goes to get âoptimizedâ)
Many networks use a chain: client â local stub â corporate forwarder â upstream recursive. Each hop can cache.
Each hop can also rewrite behavior: DNS filtering, split-horizon, conditional forwarding, or âsecurityâ appliances
that MITM DNS.
4) OS stub resolver caches (systemd-resolved, mDNSResponder, Windows DNS Client)
Modern OSes often cache DNS answers locally for performance. This is usually a small cache, but itâs enough to make
your laptop disagree with your server, which is enough to waste an afternoon.
5) Application-level caches (Java, Go, glibc behaviors, and friends)
Some runtimes cache DNS results aggressively or unpredictably. Java historically cached forever unless configured.
Some HTTP clients pool connections and keep using an old IP without re-resolving. Thatâs not DNS caching. Thatâs
connection reuse. Different bug, same symptom: âI changed DNS and nothing happened.â
6) Browser DNS caches
Browsers keep their own caches and prefetch names. They may also keep established connections and keep talking to the
old endpoint even after DNS changes.
7) CDN / edge resolvers and geo features
If you use geo DNS, latency-based routing, or EDNS Client Subnet, different recursive resolvers get different answers.
You can be ârightâ in one place and âwrongâ in another without any bugâjust policy.
8) Kubernetes DNS (CoreDNS) and node-local caches
In clusters, DNS is a dependency like any other. CoreDNS caches. Node-local DNS caches.
And then your app might cache again. If youâre debugging a service that canât see a new endpoint, you must decide
which layer youâre testing from: pod, node, or outside the cluster.
Practical tasks: commands, outputs, decisions (12+)
These are real tasks you can run. Each includes: the command, what typical output means, and what decision you make.
Use dig when you can. nslookup is fine, but it hides details you often need (TTL, flags, authority).
Task 1: Find the authoritative nameservers for a zone
cr0x@server:~$ dig +noall +answer example.com NS
example.com. 3600 IN NS ns1.dns-provider.net.
example.com. 3600 IN NS ns2.dns-provider.net.
What it means: These are the authoritative NS records the world should use.
Decision: Query these directly next. If youâre not seeing the provider you expect, youâre editing the wrong zone or delegation is wrong.
Task 2: Query authoritative directly (bypass recursive caches)
cr0x@server:~$ dig @ns1.dns-provider.net www.example.com A +noall +answer +authority
www.example.com. 300 IN A 203.0.113.42
example.com. 3600 IN SOA ns1.dns-provider.net. hostmaster.example.com. 2025123101 7200 900 1209600 300
What it means: Authoritative says the A record is 203.0.113.42 with TTL 300 seconds.
Decision: If this is wrong, fix publication (record/value/zone). If itâs right, your issue is downstream caching or client behavior.
Task 3: Compare all authoritative servers (catch stale secondaries)
cr0x@server:~$ for ns in ns1.dns-provider.net ns2.dns-provider.net; do echo "== $ns =="; dig @$ns www.example.com A +noall +answer; done
== ns1.dns-provider.net ==
www.example.com. 300 IN A 203.0.113.42
== ns2.dns-provider.net ==
www.example.com. 300 IN A 198.51.100.77
What it means: Split-brain at authoritative. Different NS serve different answers.
Decision: Stop debugging caches. Fix zone distribution/AXFR/IXFR/hidden primary, or provider sync. Until authoritatives agree, everything else is noise.
Task 4: Check what a public recursive resolver sees
cr0x@server:~$ dig @1.1.1.1 www.example.com A +noall +answer
www.example.com. 245 IN A 203.0.113.42
What it means: Public resolver has the new value; TTL remaining is 245 seconds.
Decision: If users still see old data, they may be using a different resolver (corporate/VPC), or the problem is local/app caching.
Task 5: Check a corporate/VPC resolver explicitly
cr0x@server:~$ dig @10.20.30.40 www.example.com A +noall +answer
www.example.com. 3240 IN A 198.51.100.77
What it means: This resolver still has the old answer cached, with nearly an hour remaining.
Decision: Wait (best), or flush that resolverâs cache if you own it and flushing is safe. Do not restart half the fleet.
Task 6: Detect negative caching (NXDOMAIN) at the resolver
cr0x@server:~$ dig @10.20.30.40 newhost.example.com A +noall +comments +authority
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 41433
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
example.com. 300 IN SOA ns1.dns-provider.net. hostmaster.example.com. 2025123101 7200 900 1209600 300
What it means: NXDOMAIN is being returned and the SOA in authority hints at negative caching TTL behavior.
Decision: If you just created newhost, this resolver may be caching âdoesnât existâ for up to the SOAâs negative TTL. You can flush the resolver cache or wait it out; donât keep âtrying againâ and expect different results.
Task 7: Follow CNAME chains to the real target
cr0x@server:~$ dig www.example.com A +noall +answer
www.example.com. 300 IN CNAME www.example.com.cdn.vendor.net.
www.example.com.cdn.vendor.net. 60 IN A 203.0.113.42
What it means: Your ârecord changeâ might be at the CDN hostname, not the vanity name.
Decision: Debug the right zone. If you control only www.example.com but the vendor controls the target, flushing your caches wonât change the vendorâs TTLs.
Task 8: Verify what resolver your Linux host is actually using (systemd-resolved)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.30.40
DNS Servers: 10.20.30.40 10.20.30.41
What it means: Your machine is using 10.20.30.40 as its resolver, not the one you tested earlier.
Decision: Query that resolver directly. If itâs stale, flushing your browser wonât help.
Task 9: Flush systemd-resolved cache (client-side)
cr0x@server:~$ sudo resolvectl flush-caches
What it means: Local stub cache cleared.
Decision: Re-test with dig from the same host. If results donât change, the problem is upstream or not DNS at all.
Task 10: Inspect glibc NSS path (are you even using DNS?)
cr0x@server:~$ grep -E '^\s*hosts:' /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] dns
What it means: Lookups check /etc/hosts first, then mDNS behavior, then DNS.
Decision: If someone pinned an old IP in /etc/hosts, no amount of DNS flushing will fix it. Check /etc/hosts next.
Task 11: Check whether /etc/hosts is overriding your change
cr0x@server:~$ grep -n 'www.example.com' /etc/hosts
12:198.51.100.77 www.example.com
What it means: The host is hard-coded to the old IP.
Decision: Remove or update the entry. This is not a DNS problem; itâs a local override.
Task 12: Observe actual resolution via getent (what your apps often use)
cr0x@server:~$ getent ahostsv4 www.example.com
203.0.113.42 STREAM www.example.com
203.0.113.42 DGRAM
203.0.113.42 RAW
What it means: The system resolver stack (NSS + stub) is returning the new IP.
Decision: If your app still connects to the old endpoint, suspect connection pooling, pinned upstreams, or an internal service discovery layer.
Task 13: Check DNS TTLs and caching at BIND (resolver you own)
cr0x@server:~$ sudo rndc status
version: BIND 9.18.24
running on dns-cache-01: Linux x86_64 6.8.0
number of zones: 102
recursive clients: 17/1000/1000
What it means: Youâre running a recursive resolver (or at least BIND is present).
Decision: If this is the caching layer returning stale answers, plan a controlled cache flush (next task) rather than restarting the daemon blindly.
Task 14: Flush BIND resolver cache safely (targeted approach)
cr0x@server:~$ sudo rndc flushname www.example.com
What it means: BIND drops cache entries for that name (and related data).
Decision: Prefer flushname over global flush during business hours. Global flush can stampede your upstream and make latency look like an outage.
Task 15: Check CoreDNS behavior inside Kubernetes
cr0x@server:~$ kubectl -n kube-system get configmap coredns -o yaml | sed -n '1,120p'
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
What it means: CoreDNS is explicitly caching for 30 seconds, plus the Kubernetes plugin has TTL 30.
Decision: If âDNS changes not visibleâ lasts minutes, CoreDNS cache likely isnât the culprit. Look upstream or at app caching.
Task 16: Test from inside a pod (remove your laptop from the story)
cr0x@server:~$ kubectl run -it --rm dns-debug --image=alpine:3.20 -- sh -lc "apk add --no-cache bind-tools >/dev/null && dig www.example.com A +noall +answer"
www.example.com. 300 IN A 203.0.113.42
What it means: The cluster sees the new answer.
Decision: If an in-cluster workload still hits the old IP, suspect the application (connection reuse) or a sidecar/service mesh.
Task 17: Prove itâs not DNSâcheck where connections go
cr0x@server:~$ curl -sS -o /dev/null -w "remote_ip=%{remote_ip}\n" https://www.example.com/
remote_ip=198.51.100.77
What it means: Youâre still connecting to the old IP even if DNS says otherwise (or TLS/SNI is routing you there).
Decision: Check proxies, load balancers, CDN config, and whether your client is reusing an existing connection. DNS might already be correct.
Task 18: Check whether a local caching daemon is in play (nscd)
cr0x@server:~$ systemctl is-active nscd
inactive
What it means: nscd is not running, so itâs not your cache layer.
Decision: Donât waste time flushing what doesnât exist. Find the actual caching component.
Caches you should not flush (unless you enjoy outages)
Flushing caches is like power-cycling a router: it feels productive, itâs occasionally necessary, and itâs dangerously habit-forming.
Some caches are safe to clear locally. Others are shared infrastructure and flushing them can cause a thundering herd.
Do not globally flush large recursive resolvers during peak hours
If you run a corporate resolver fleet or a shared VPC resolver tier, a global flush can trigger:
- Upstream QPS spikes
- Increased latency and timeouts
- Amplified dependency on external resolvers
- Cascading failures in apps that treat DNS timeouts as fatal
Prefer targeted flushing (flushname / per-zone / per-view) or, better, wait for TTL when itâs safe.
Do not restart CoreDNS as your first move
Restarting CoreDNS can break name resolution cluster-wide. Thatâs a lot of blast radius for a problem thatâs often just TTL.
If you suspect CoreDNS caching, confirm with in-pod dig and inspect the Corefile first.
Do not âfixâ it by dropping TTLs in a panic
Lower TTLs are a planning tool, not an emergency lever. Lowering TTL now doesnât help clients who already cached the old record.
It only affects the next cache fill.
Joke #1: DNS caching is like office gossip: once itâs out there, correcting it takes longer than starting it.
Interesting facts and a little DNS history
- DNS replaced HOSTS.TXT scaling pain. Early ARPANET hosts used a shared hosts file; as the network grew, distribution became the bottleneck.
- TTL wasnât designed for your deploy cadence. It was designed to make a global naming system scalable and resilient, not to make marketing redirects instant.
- Negative caching is standardized. Caching âdoesnât existâ is intentional; otherwise resolvers would repeatedly hammer authoritative servers for typos and non-existent names.
- The SOA record influences negative caching. The SOA âminimumâ/negative TTL field has a long history of confusion; different tooling labels it differently.
- Resolvers cache more than A/AAAA. They also cache NS delegations and glue behavior, which can make delegation changes feel âstickyâ even when a record update is fast.
- DNS is usually UDP. Thatâs great for speed, but it means packet loss and MTU weirdness can masquerade as âstale DNS.â TCP fallback exists, but not always reliably.
- EDNS Client Subnet changed caching dynamics. Some resolvers vary answers based on the clientâs subnet, which reduces cache hit rates and increases âbut it works for meâ moments.
- Browsers became DNS participants. Modern browsers pre-resolve, cache independently, and sometimes race multiple connections, so DNS is no longer just an OS concern.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company migrated a customer-facing API to a new load balancer. The plan was clean: update the A record for
api.example.com from the old VIP to the new VIP. They set TTL to 300 a week before. Nice.
On cutover day, half the traffic moved. The other half kept hitting the old infrastructure and started timing out because
the old load balancer pool was being drained. The incident channel filled up with âDNS didnât propagate.â
The wrong assumption: everyone was querying api.example.com. They werenât. A legacy mobile client had hardcoded
api-v2.example.com, which was a CNAME to a vendor name with a TTL of 3600. The team had changed the vanity name
that humans knew, not the dependency chain that devices used.
The fix wasnât flushing caches. The fix was updating the actual CNAME target (through the vendor workflow) and temporarily
keeping the old VIP alive. They also documented the full CNAME chain in their runbook, because tribal knowledge is not a
change-management strategy.
Mini-story 2: The optimization that backfired
Another org got tired of DNS latency spikes and decided to âoptimizeâ by adding aggressive caching everywhere:
node-local DNS caching on Kubernetes nodes, plus an internal forwarder tier, plus a caching library inside the app.
The pitch sounded good: fewer queries, faster responses, lower cost.
Then they introduced blue/green deployments behind DNS. During a rollout, some pods resolved the new endpoint and succeeded.
Others kept using the old endpoint and failed. Their dashboards looked like a bar-code: success/failure alternating every few seconds.
The backfire came from cache layering and mismatched TTL semantics. The app cache kept entries longer than the OS TTL.
The node-local cache respected TTL but had a bug that pinned negative responses longer under load. The forwarder tier had
stale delegations after a nameserver change. Each layer was âworking,â but the composition was chaos.
The recovery was boring: remove application-level DNS caching, standardize on one caching layer (node-local or central,
not both), and alert on resolver SERVFAIL rates. They also stopped using DNS as a fine-grained traffic shifting tool
and moved to load balancer weights for fast cutovers.
Mini-story 3: The boring but correct practice that saved the day
A finance company had a compliance-driven change process that everyone mocked until it paid rent. They needed to move
an external vendor integration endpoint. The integration used a hostname, not a fixed IP, which was already a good sign.
Two weeks before the move, they lowered TTL from 3600 to 300. Not in a panicâon schedule. They validated the new TTL by
querying authoritative and several recursors. They also captured baseline answers in a ticket, including CNAME chain and SOA.
On cutover night, they changed the record, verified authoritative, then verified through their corporate resolvers.
They did not flush caches. They watched TTLs count down. For the small number of clients behind a stubborn resolver,
they applied a targeted flush on the resolver tier they owned.
The move landed with zero customer impact. Nobody wrote a victory email because boring correctness doesnât trend.
But the on-call engineer slept, which is the only KPI that matters at 2 a.m.
Common mistakes: symptoms â root cause â fix
1) Symptom: âAuthoritative shows new IP, but some users still hit old IP for hoursâ
Root cause: Recursive resolvers cached the old answer with a high TTL, or you lowered TTL too late.
Fix: Wait out TTL; next time lower TTL at least one full TTL window before the change. If you control the resolver, do a targeted flush for the name.
2) Symptom: âNew hostname returns NXDOMAIN even though we created itâ
Root cause: Negative caching at recursive resolvers from earlier lookups when the record didnât exist.
Fix: Check SOA negative TTL; flush the resolver cache if you own it, or accept the wait. Avoid creating records âjust-in-timeâ when you can pre-stage.
3) Symptom: âWorks on my laptop, fails in productionâ
Root cause: Different resolvers. Your laptop uses a public resolver or VPN; production uses VPC resolver or internal forwarder with stale cache.
Fix: Query the production resolver directly. Donât use your own machine as the measurement device for production.
4) Symptom: âdig shows new IP, but the service still connects to old IPâ
Root cause: Connection pooling / keep-alives / long-lived clients (not DNS). Or proxy routing based on SNI/Host.
Fix: Verify remote IP with curl -w %{remote_ip}; restart or reload the client pool, not the DNS layer. Consider lower keep-alive or proactive DNS refresh in the client.
5) Symptom: âSome regions see new answer, others donâtâ
Root cause: Geo DNS / latency routing / EDNS Client Subnet, or split-horizon views.
Fix: Test with resolvers in those regions or corporate egress points. Confirm provider routing policies. Make sure you changed the right view.
6) Symptom: âAfter we flushed the resolver, everything got slowerâ
Root cause: Cache stampede. You forced a cold cache across many names, raising upstream QPS and latency.
Fix: Avoid global flushes. Use targeted flushes. If you must flush globally, do it off-peak and watch upstream saturation and SERVFAIL rates.
7) Symptom: âOnly one server sees the old recordâ
Root cause: Local override in /etc/hosts, local caching daemon, or different resolv.conf/resolved settings.
Fix: Check /etc/hosts, resolvectl status, and getent. Flush local stub cache only after confirming the local resolver path.
8) Symptom: âWe changed NS records and now some clients canât resolve the domainâ
Root cause: Delegation caching and glue/parent zone TTLs; some resolvers still use old NS set.
Fix: Plan NS changes with longer lead time. Keep old nameservers serving the zone during the transition. Verify parent zone NS and glue correctness.
Joke #2: Flushing DNS caches is the only time âhave you tried turning it off and on againâ can DDoS your own infrastructure.
Checklists / step-by-step plan
Checklist A: You changed an A/AAAA record and users donât see it
- Verify authoritative truth: query each authoritative NS directly. Confirm value and TTL.
- Check CNAME chain: make sure you didnât change a name nobody actually queries.
- Identify the resolver in the failing path: clientâs resolver, corporate DNS, VPC resolver.
- Measure TTL remaining at that resolver: if TTL is high, waiting is correct.
- Choose action:
- If you own the resolver: targeted flush for that name.
- If you donât: wait; consider temporary mitigations (keep old endpoint alive).
- Only then: flush OS stub cache on test clients to reduce confusion.
- Validate reality: confirm where connections go (remote IP), not just what DNS returns.
Checklist B: You created a new hostname and it returns NXDOMAIN
- Query authoritative directly: does the record exist there yet?
- If authoritative is correct, query the failing resolver and check NXDOMAIN + SOA in authority.
- Decide: wait for negative TTL to expire, or flush the resolver if you control it.
- Prevent next time: pre-create records before go-live to avoid negative caching on launch day.
Checklist C: You changed NS records (danger zone)
- Confirm the parent zone delegation is correct (what the registry/parent serves).
- Confirm new nameservers serve the zone correctly and consistently.
- Keep old nameservers up and serving the zone for at least the maximum relevant TTL window.
- Expect mixed behavior during transition; donât interpret that as ârandom.â Itâs cached delegation.
Checklist D: Decide what to flush (minimal blast radius)
- Flush local stub if only your workstation is wrong:
resolvectl flush-caches. - Flush a single name on a resolver you own if business impact is real:
rndc flushname. - Do not flush global resolver caches unless you have a capacity plan and an incident commander who enjoys pain.
- Do not restart DNS services as a substitute for understanding.
FAQ
1) Why does my DNS provider UI show the new value but users still get the old one?
Provider UI shows authoritative data. Users usually query recursive resolvers that cached the old value until TTL expiry.
Confirm by querying authoritative directly, then the userâs resolver.
2) If I lower TTL now, will it speed up the current change?
Mostly no. Resolvers already holding the old answer will keep it until their cached TTL runs out.
Lower TTL helps the next cache fill.
3) What does âDNS propagationâ actually mean?
Itâs not a single synchronized wave. Itâs independent caches expiring at different times across recursive resolvers,
forwarders, OS stubs, browsers, and apps.
4) What cache should I flush first?
None. First, confirm authoritative is correct. Second, identify the resolver in the failing path. Flush only the layer
thatâs proven staleâand only if you own it and the blast radius is acceptable.
5) Why do I see NXDOMAIN for a record that exists now?
Negative caching. A resolver may have cached âdoesnât existâ when the record wasnât present yet. That cache can persist
based on SOA/negative TTL. Flush the resolver if you can; otherwise wait.
6) Why does dig show the new IP but my app still uses the old one?
Because your app may not be re-resolving. It might be reusing a pooled connection, caching DNS in-process,
or routing via a proxy that has its own upstream mapping.
7) Should we run node-local DNS caching in Kubernetes?
It can be great for performance and resilience, but it adds another layer to debug. If you do it, standardize and document:
where caching happens, expected TTLs, and how to test from pod/node. Avoid stacking multiple caches without a reason.
8) Is flushing browser DNS cache useful?
Sometimes, for a single-user âmy laptop is weirdâ case. It wonât fix production users. Use it as a last-mile cleanup,
not a propagation strategy.
9) Whatâs the safest way to handle planned DNS cutovers?
Pre-stage, lower TTL ahead of time, validate authoritative answers, keep old endpoints alive for at least the TTL window,
and monitor real connection destinations. Use targeted resolver flush only when necessary.
10) Why do different public resolvers show different answers?
They may have cached at different times, be in different regions, apply different policies, or receive different answers
due to EDNS Client Subnet or geo routing. That diversity is normal.
Next steps you can do today
- Add âauthoritative firstâ to your runbook: always verify direct NS answers before touching caches.
- Document your real resolver path: which resolvers production uses, where caches exist, and who owns them.
- Standardize your tools: prefer
dig+getent+ âcheck remote IPâ over guesswork. - Plan TTL changes: lower TTLs ahead of planned migrations; donât expect last-minute TTL edits to save you.
- Adopt targeted flushing: if you operate resolvers, support per-name flush. Make global flush an explicitly approved action.
DNS changes ânot visibleâ are rarely mysterious. Theyâre usually just caching doing exactly what it was designed to do,
at the worst possible moment for your timeline. Your job is to find the one cache that matters, and leave the rest of the planet alone.