DNS Changes Not Visible: Which Caches to Flush (and Which Not to Touch)

January 2, 2026 • February 3, 2026 • Read: 21 min • Views: 6

Was this helpful?

You updated a DNS record. You waited “a few minutes.” Your dashboard still shows the old IP. Someone in Slack says,
“DNS is broken again,” and another person suggests restarting random things until the graph turns green.

Here’s the reality: DNS usually isn’t broken. Your change is simply stuck behind a specific cache, somewhere
between your authoritative nameserver and the user’s eyeballs. The trick is to find which cache is lying to you,
flush only what helps, and leave the rest alone—because the wrong flush can turn a minor delay into a full-blown incident.

The mental model: where DNS answers actually come from

When people say “DNS propagation,” they imagine a single global system slowly syncing. That’s not what’s happening.
DNS is a chain of delegations and caches, and every link can independently decide what to return and for how long.

You publish records on an authoritative server (or provider). But most clients never talk to it.
They ask a recursive resolver (often your ISP, corporate network, or a public resolver), and that resolver
caches answers. Then your OS may cache. Your browser may cache. Your application may cache. Your service mesh may do “helpful”
caching. Your load balancer might keep a stale upstream mapping. And then Kubernetes CoreDNS might have its own opinion.

A “DNS change not visible” is nearly always one of these:

You didn’t change what you think you changed (wrong zone, wrong record, wrong name, wrong view).
You changed it, but TTLs say “not yet” (including negative caching, which is sneakier).
You changed it, but you’re testing through a cache (local resolver, corporate forwarder, browser, etc.).
You changed it, but different users get different answers (geo DNS, EDNS Client Subnet, split-horizon).

The goal is not “flush everything.” The goal is: identify the bottleneck cache, then choose the smallest safe action.
Production systems reward precision.

Fast diagnosis playbook (check 1/2/3)

If you’re on-call, you want a 60–180 second path to clarity. Here’s the order that tends to minimize self-inflicted damage.

1) Check authoritative truth (bypass caches)

Query the authoritative nameservers directly for the record you changed. If authoritative doesn’t show it, stop.
Your problem is not “propagation.” Your problem is publication.

Find the zone’s NS records.
Query each NS directly with dig @nsX.
Verify the record, TTL, and any CNAME chain.

2) Check the resolver you actually use in the failing path

If authoritative is correct, test the recursive resolver that the impacted client uses. That might be:
corporate forwarder, VPC resolver, node-local cache, or a public resolver.

Compare:

Answer section (A/AAAA/CNAME)
TTL remaining (if it’s high, you’re waiting, not broken)
NXDOMAIN vs NOERROR (negative caching means “you’re waiting” too)

3) Check the client-side caches last (OS, browser, app)

Only after authoritative and recursive are sane do you go after the local machine. Because local flushing is cheap,
but it’s also a distraction: you can “fix it on your laptop” while production users still see stale data.

Paraphrased idea from Richard Cook (resilience engineering): “Success and failure come from the same everyday processes.”
DNS caching is one of those processes. It’s doing its job—until you need it to stop doing its job.

Which caches exist (and what they’re guilty of)

1) Authoritative server behavior (not a cache, but often blamed)

Your authoritative servers serve whatever is in the zone file/database right now. If they’re wrong:
wrong record, wrong zone, not synced, stale secondary, wrong view, or you edited a UI that writes to a different place
than you think.

A classic failure mode: you update www.example.com but traffic is actually using api.example.com
via a CNAME chain you forgot existed.

2) Recursive resolver caches (the big one)

Recursive resolvers cache per TTL. That’s the point. If you set TTL to 3600 yesterday and change the record now,
some resolvers will keep the old answer for up to an hour from when they last looked it up.

Worse: resolvers also cache negative answers (NXDOMAIN, NODATA). If a resolver recently asked for
a record that didn’t exist, it may cache that “doesn’t exist” response based on the zone’s SOA minimum/negative TTL.
People forget this, then swear the internet is gaslighting them.

3) Forwarders and corporate DNS layers (where truth goes to get “optimized”)

Many networks use a chain: client → local stub → corporate forwarder → upstream recursive. Each hop can cache.
Each hop can also rewrite behavior: DNS filtering, split-horizon, conditional forwarding, or “security” appliances
that MITM DNS.

4) OS stub resolver caches (systemd-resolved, mDNSResponder, Windows DNS Client)

Modern OSes often cache DNS answers locally for performance. This is usually a small cache, but it’s enough to make
your laptop disagree with your server, which is enough to waste an afternoon.

5) Application-level caches (Java, Go, glibc behaviors, and friends)

Some runtimes cache DNS results aggressively or unpredictably. Java historically cached forever unless configured.
Some HTTP clients pool connections and keep using an old IP without re-resolving. That’s not DNS caching. That’s
connection reuse. Different bug, same symptom: “I changed DNS and nothing happened.”

6) Browser DNS caches

Browsers keep their own caches and prefetch names. They may also keep established connections and keep talking to the
old endpoint even after DNS changes.

7) CDN / edge resolvers and geo features

If you use geo DNS, latency-based routing, or EDNS Client Subnet, different recursive resolvers get different answers.
You can be “right” in one place and “wrong” in another without any bug—just policy.

8) Kubernetes DNS (CoreDNS) and node-local caches

In clusters, DNS is a dependency like any other. CoreDNS caches. Node-local DNS caches.
And then your app might cache again. If you’re debugging a service that can’t see a new endpoint, you must decide
which layer you’re testing from: pod, node, or outside the cluster.

Practical tasks: commands, outputs, decisions (12+)

These are real tasks you can run. Each includes: the command, what typical output means, and what decision you make.
Use dig when you can. nslookup is fine, but it hides details you often need (TTL, flags, authority).

Task 1: Find the authoritative nameservers for a zone

cr0x@server:~$ dig +noall +answer example.com NS
example.com.            3600    IN      NS      ns1.dns-provider.net.
example.com.            3600    IN      NS      ns2.dns-provider.net.

What it means: These are the authoritative NS records the world should use.
Decision: Query these directly next. If you’re not seeing the provider you expect, you’re editing the wrong zone or delegation is wrong.

Task 2: Query authoritative directly (bypass recursive caches)

cr0x@server:~$ dig @ns1.dns-provider.net www.example.com A +noall +answer +authority
www.example.com.        300     IN      A       203.0.113.42
example.com.            3600    IN      SOA     ns1.dns-provider.net. hostmaster.example.com. 2025123101 7200 900 1209600 300

What it means: Authoritative says the A record is 203.0.113.42 with TTL 300 seconds.
Decision: If this is wrong, fix publication (record/value/zone). If it’s right, your issue is downstream caching or client behavior.

Task 3: Compare all authoritative servers (catch stale secondaries)

cr0x@server:~$ for ns in ns1.dns-provider.net ns2.dns-provider.net; do echo "== $ns =="; dig @$ns www.example.com A +noall +answer; done
== ns1.dns-provider.net ==
www.example.com.        300     IN      A       203.0.113.42
== ns2.dns-provider.net ==
www.example.com.        300     IN      A       198.51.100.77

What it means: Split-brain at authoritative. Different NS serve different answers.
Decision: Stop debugging caches. Fix zone distribution/AXFR/IXFR/hidden primary, or provider sync. Until authoritatives agree, everything else is noise.

Task 4: Check what a public recursive resolver sees

cr0x@server:~$ dig @1.1.1.1 www.example.com A +noall +answer
www.example.com.        245     IN      A       203.0.113.42

What it means: Public resolver has the new value; TTL remaining is 245 seconds.
Decision: If users still see old data, they may be using a different resolver (corporate/VPC), or the problem is local/app caching.

Task 5: Check a corporate/VPC resolver explicitly

cr0x@server:~$ dig @10.20.30.40 www.example.com A +noall +answer
www.example.com.        3240    IN      A       198.51.100.77

What it means: This resolver still has the old answer cached, with nearly an hour remaining.
Decision: Wait (best), or flush that resolver’s cache if you own it and flushing is safe. Do not restart half the fleet.

Task 6: Detect negative caching (NXDOMAIN) at the resolver

cr0x@server:~$ dig @10.20.30.40 newhost.example.com A +noall +comments +authority
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 41433
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

example.com.            300     IN      SOA     ns1.dns-provider.net. hostmaster.example.com. 2025123101 7200 900 1209600 300

What it means: NXDOMAIN is being returned and the SOA in authority hints at negative caching TTL behavior.
Decision: If you just created newhost, this resolver may be caching “doesn’t exist” for up to the SOA’s negative TTL. You can flush the resolver cache or wait it out; don’t keep “trying again” and expect different results.

Task 7: Follow CNAME chains to the real target

cr0x@server:~$ dig www.example.com A +noall +answer
www.example.com.        300     IN      CNAME   www.example.com.cdn.vendor.net.
www.example.com.cdn.vendor.net. 60     IN      A       203.0.113.42

What it means: Your “record change” might be at the CDN hostname, not the vanity name.
Decision: Debug the right zone. If you control only www.example.com but the vendor controls the target, flushing your caches won’t change the vendor’s TTLs.

Task 8: Verify what resolver your Linux host is actually using (systemd-resolved)

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.30.40
       DNS Servers: 10.20.30.40 10.20.30.41

What it means: Your machine is using 10.20.30.40 as its resolver, not the one you tested earlier.
Decision: Query that resolver directly. If it’s stale, flushing your browser won’t help.

Task 9: Flush systemd-resolved cache (client-side)

cr0x@server:~$ sudo resolvectl flush-caches

What it means: Local stub cache cleared.
Decision: Re-test with dig from the same host. If results don’t change, the problem is upstream or not DNS at all.

Task 10: Inspect glibc NSS path (are you even using DNS?)

cr0x@server:~$ grep -E '^\s*hosts:' /etc/nsswitch.conf
hosts:          files mdns4_minimal [NOTFOUND=return] dns

What it means: Lookups check /etc/hosts first, then mDNS behavior, then DNS.
Decision: If someone pinned an old IP in /etc/hosts, no amount of DNS flushing will fix it. Check /etc/hosts next.

Task 11: Check whether /etc/hosts is overriding your change

cr0x@server:~$ grep -n 'www.example.com' /etc/hosts
12:198.51.100.77 www.example.com

What it means: The host is hard-coded to the old IP.
Decision: Remove or update the entry. This is not a DNS problem; it’s a local override.

Task 12: Observe actual resolution via getent (what your apps often use)

cr0x@server:~$ getent ahostsv4 www.example.com
203.0.113.42    STREAM www.example.com
203.0.113.42    DGRAM
203.0.113.42    RAW

What it means: The system resolver stack (NSS + stub) is returning the new IP.
Decision: If your app still connects to the old endpoint, suspect connection pooling, pinned upstreams, or an internal service discovery layer.

Task 13: Check DNS TTLs and caching at BIND (resolver you own)

cr0x@server:~$ sudo rndc status
version: BIND 9.18.24
running on dns-cache-01: Linux x86_64 6.8.0
number of zones: 102
recursive clients: 17/1000/1000

What it means: You’re running a recursive resolver (or at least BIND is present).
Decision: If this is the caching layer returning stale answers, plan a controlled cache flush (next task) rather than restarting the daemon blindly.

Task 14: Flush BIND resolver cache safely (targeted approach)

cr0x@server:~$ sudo rndc flushname www.example.com

What it means: BIND drops cache entries for that name (and related data).
Decision: Prefer flushname over global flush during business hours. Global flush can stampede your upstream and make latency look like an outage.

Task 15: Check CoreDNS behavior inside Kubernetes

cr0x@server:~$ kubectl -n kube-system get configmap coredns -o yaml | sed -n '1,120p'
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

What it means: CoreDNS is explicitly caching for 30 seconds, plus the Kubernetes plugin has TTL 30.
Decision: If “DNS changes not visible” lasts minutes, CoreDNS cache likely isn’t the culprit. Look upstream or at app caching.

Task 16: Test from inside a pod (remove your laptop from the story)

cr0x@server:~$ kubectl run -it --rm dns-debug --image=alpine:3.20 -- sh -lc "apk add --no-cache bind-tools >/dev/null && dig www.example.com A +noall +answer"
www.example.com.        300     IN      A       203.0.113.42

What it means: The cluster sees the new answer.
Decision: If an in-cluster workload still hits the old IP, suspect the application (connection reuse) or a sidecar/service mesh.

Task 17: Prove it’s not DNS—check where connections go

cr0x@server:~$ curl -sS -o /dev/null -w "remote_ip=%{remote_ip}\n" https://www.example.com/
remote_ip=198.51.100.77

What it means: You’re still connecting to the old IP even if DNS says otherwise (or TLS/SNI is routing you there).
Decision: Check proxies, load balancers, CDN config, and whether your client is reusing an existing connection. DNS might already be correct.

Task 18: Check whether a local caching daemon is in play (nscd)

cr0x@server:~$ systemctl is-active nscd
inactive

What it means: nscd is not running, so it’s not your cache layer.
Decision: Don’t waste time flushing what doesn’t exist. Find the actual caching component.

Caches you should not flush (unless you enjoy outages)

Flushing caches is like power-cycling a router: it feels productive, it’s occasionally necessary, and it’s dangerously habit-forming.
Some caches are safe to clear locally. Others are shared infrastructure and flushing them can cause a thundering herd.

Do not globally flush large recursive resolvers during peak hours

If you run a corporate resolver fleet or a shared VPC resolver tier, a global flush can trigger:

Upstream QPS spikes
Increased latency and timeouts
Amplified dependency on external resolvers
Cascading failures in apps that treat DNS timeouts as fatal

Prefer targeted flushing (flushname / per-zone / per-view) or, better, wait for TTL when it’s safe.

Do not restart CoreDNS as your first move

Restarting CoreDNS can break name resolution cluster-wide. That’s a lot of blast radius for a problem that’s often just TTL.
If you suspect CoreDNS caching, confirm with in-pod dig and inspect the Corefile first.

Do not “fix” it by dropping TTLs in a panic

Lower TTLs are a planning tool, not an emergency lever. Lowering TTL now doesn’t help clients who already cached the old record.
It only affects the next cache fill.

Joke #1: DNS caching is like office gossip: once it’s out there, correcting it takes longer than starting it.

Interesting facts and a little DNS history

DNS replaced HOSTS.TXT scaling pain. Early ARPANET hosts used a shared hosts file; as the network grew, distribution became the bottleneck.
TTL wasn’t designed for your deploy cadence. It was designed to make a global naming system scalable and resilient, not to make marketing redirects instant.
Negative caching is standardized. Caching “doesn’t exist” is intentional; otherwise resolvers would repeatedly hammer authoritative servers for typos and non-existent names.
The SOA record influences negative caching. The SOA “minimum”/negative TTL field has a long history of confusion; different tooling labels it differently.
Resolvers cache more than A/AAAA. They also cache NS delegations and glue behavior, which can make delegation changes feel “sticky” even when a record update is fast.
DNS is usually UDP. That’s great for speed, but it means packet loss and MTU weirdness can masquerade as “stale DNS.” TCP fallback exists, but not always reliably.
EDNS Client Subnet changed caching dynamics. Some resolvers vary answers based on the client’s subnet, which reduces cache hit rates and increases “but it works for me” moments.
Browsers became DNS participants. Modern browsers pre-resolve, cache independently, and sometimes race multiple connections, so DNS is no longer just an OS concern.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company migrated a customer-facing API to a new load balancer. The plan was clean: update the A record for
api.example.com from the old VIP to the new VIP. They set TTL to 300 a week before. Nice.

On cutover day, half the traffic moved. The other half kept hitting the old infrastructure and started timing out because
the old load balancer pool was being drained. The incident channel filled up with “DNS didn’t propagate.”

The wrong assumption: everyone was querying api.example.com. They weren’t. A legacy mobile client had hardcoded
api-v2.example.com, which was a CNAME to a vendor name with a TTL of 3600. The team had changed the vanity name
that humans knew, not the dependency chain that devices used.

The fix wasn’t flushing caches. The fix was updating the actual CNAME target (through the vendor workflow) and temporarily
keeping the old VIP alive. They also documented the full CNAME chain in their runbook, because tribal knowledge is not a
change-management strategy.

Mini-story 2: The optimization that backfired

Another org got tired of DNS latency spikes and decided to “optimize” by adding aggressive caching everywhere:
node-local DNS caching on Kubernetes nodes, plus an internal forwarder tier, plus a caching library inside the app.
The pitch sounded good: fewer queries, faster responses, lower cost.

Then they introduced blue/green deployments behind DNS. During a rollout, some pods resolved the new endpoint and succeeded.
Others kept using the old endpoint and failed. Their dashboards looked like a bar-code: success/failure alternating every few seconds.

The backfire came from cache layering and mismatched TTL semantics. The app cache kept entries longer than the OS TTL.
The node-local cache respected TTL but had a bug that pinned negative responses longer under load. The forwarder tier had
stale delegations after a nameserver change. Each layer was “working,” but the composition was chaos.

The recovery was boring: remove application-level DNS caching, standardize on one caching layer (node-local or central,
not both), and alert on resolver SERVFAIL rates. They also stopped using DNS as a fine-grained traffic shifting tool
and moved to load balancer weights for fast cutovers.

Mini-story 3: The boring but correct practice that saved the day

A finance company had a compliance-driven change process that everyone mocked until it paid rent. They needed to move
an external vendor integration endpoint. The integration used a hostname, not a fixed IP, which was already a good sign.

Two weeks before the move, they lowered TTL from 3600 to 300. Not in a panic—on schedule. They validated the new TTL by
querying authoritative and several recursors. They also captured baseline answers in a ticket, including CNAME chain and SOA.

On cutover night, they changed the record, verified authoritative, then verified through their corporate resolvers.
They did not flush caches. They watched TTLs count down. For the small number of clients behind a stubborn resolver,
they applied a targeted flush on the resolver tier they owned.

The move landed with zero customer impact. Nobody wrote a victory email because boring correctness doesn’t trend.
But the on-call engineer slept, which is the only KPI that matters at 2 a.m.

Common mistakes: symptoms → root cause → fix

1) Symptom: “Authoritative shows new IP, but some users still hit old IP for hours”

Root cause: Recursive resolvers cached the old answer with a high TTL, or you lowered TTL too late.

Fix: Wait out TTL; next time lower TTL at least one full TTL window before the change. If you control the resolver, do a targeted flush for the name.

2) Symptom: “New hostname returns NXDOMAIN even though we created it”

Root cause: Negative caching at recursive resolvers from earlier lookups when the record didn’t exist.

Fix: Check SOA negative TTL; flush the resolver cache if you own it, or accept the wait. Avoid creating records “just-in-time” when you can pre-stage.

3) Symptom: “Works on my laptop, fails in production”

Root cause: Different resolvers. Your laptop uses a public resolver or VPN; production uses VPC resolver or internal forwarder with stale cache.

Fix: Query the production resolver directly. Don’t use your own machine as the measurement device for production.

4) Symptom: “dig shows new IP, but the service still connects to old IP”

Root cause: Connection pooling / keep-alives / long-lived clients (not DNS). Or proxy routing based on SNI/Host.

Fix: Verify remote IP with curl -w %{remote_ip}; restart or reload the client pool, not the DNS layer. Consider lower keep-alive or proactive DNS refresh in the client.

5) Symptom: “Some regions see new answer, others don’t”

Root cause: Geo DNS / latency routing / EDNS Client Subnet, or split-horizon views.

Fix: Test with resolvers in those regions or corporate egress points. Confirm provider routing policies. Make sure you changed the right view.

6) Symptom: “After we flushed the resolver, everything got slower”

Root cause: Cache stampede. You forced a cold cache across many names, raising upstream QPS and latency.

Fix: Avoid global flushes. Use targeted flushes. If you must flush globally, do it off-peak and watch upstream saturation and SERVFAIL rates.

7) Symptom: “Only one server sees the old record”

Root cause: Local override in /etc/hosts, local caching daemon, or different resolv.conf/resolved settings.

Fix: Check /etc/hosts, resolvectl status, and getent. Flush local stub cache only after confirming the local resolver path.

8) Symptom: “We changed NS records and now some clients can’t resolve the domain”

Root cause: Delegation caching and glue/parent zone TTLs; some resolvers still use old NS set.

Fix: Plan NS changes with longer lead time. Keep old nameservers serving the zone during the transition. Verify parent zone NS and glue correctness.

Joke #2: Flushing DNS caches is the only time “have you tried turning it off and on again” can DDoS your own infrastructure.

Checklists / step-by-step plan

Checklist A: You changed an A/AAAA record and users don’t see it

Verify authoritative truth: query each authoritative NS directly. Confirm value and TTL.
Check CNAME chain: make sure you didn’t change a name nobody actually queries.
Identify the resolver in the failing path: client’s resolver, corporate DNS, VPC resolver.
Measure TTL remaining at that resolver: if TTL is high, waiting is correct.
Choose action:
- If you own the resolver: targeted flush for that name.
- If you don’t: wait; consider temporary mitigations (keep old endpoint alive).
Only then: flush OS stub cache on test clients to reduce confusion.
Validate reality: confirm where connections go (remote IP), not just what DNS returns.

Checklist B: You created a new hostname and it returns NXDOMAIN

Query authoritative directly: does the record exist there yet?
If authoritative is correct, query the failing resolver and check NXDOMAIN + SOA in authority.
Decide: wait for negative TTL to expire, or flush the resolver if you control it.
Prevent next time: pre-create records before go-live to avoid negative caching on launch day.

Checklist C: You changed NS records (danger zone)

Confirm the parent zone delegation is correct (what the registry/parent serves).
Confirm new nameservers serve the zone correctly and consistently.
Keep old nameservers up and serving the zone for at least the maximum relevant TTL window.
Expect mixed behavior during transition; don’t interpret that as “random.” It’s cached delegation.

Checklist D: Decide what to flush (minimal blast radius)

Flush local stub if only your workstation is wrong: resolvectl flush-caches.
Flush a single name on a resolver you own if business impact is real: rndc flushname.
Do not flush global resolver caches unless you have a capacity plan and an incident commander who enjoys pain.
Do not restart DNS services as a substitute for understanding.

FAQ

1) Why does my DNS provider UI show the new value but users still get the old one?

Provider UI shows authoritative data. Users usually query recursive resolvers that cached the old value until TTL expiry.
Confirm by querying authoritative directly, then the user’s resolver.

2) If I lower TTL now, will it speed up the current change?

Mostly no. Resolvers already holding the old answer will keep it until their cached TTL runs out.
Lower TTL helps the next cache fill.

3) What does “DNS propagation” actually mean?

It’s not a single synchronized wave. It’s independent caches expiring at different times across recursive resolvers,
forwarders, OS stubs, browsers, and apps.

4) What cache should I flush first?

None. First, confirm authoritative is correct. Second, identify the resolver in the failing path. Flush only the layer
that’s proven stale—and only if you own it and the blast radius is acceptable.

5) Why do I see NXDOMAIN for a record that exists now?

Negative caching. A resolver may have cached “doesn’t exist” when the record wasn’t present yet. That cache can persist
based on SOA/negative TTL. Flush the resolver if you can; otherwise wait.

6) Why does `dig` show the new IP but my app still uses the old one?

Because your app may not be re-resolving. It might be reusing a pooled connection, caching DNS in-process,
or routing via a proxy that has its own upstream mapping.

7) Should we run node-local DNS caching in Kubernetes?

It can be great for performance and resilience, but it adds another layer to debug. If you do it, standardize and document:
where caching happens, expected TTLs, and how to test from pod/node. Avoid stacking multiple caches without a reason.

8) Is flushing browser DNS cache useful?

Sometimes, for a single-user “my laptop is weird” case. It won’t fix production users. Use it as a last-mile cleanup,
not a propagation strategy.

9) What’s the safest way to handle planned DNS cutovers?

Pre-stage, lower TTL ahead of time, validate authoritative answers, keep old endpoints alive for at least the TTL window,
and monitor real connection destinations. Use targeted resolver flush only when necessary.

10) Why do different public resolvers show different answers?

They may have cached at different times, be in different regions, apply different policies, or receive different answers
due to EDNS Client Subnet or geo routing. That diversity is normal.

Next steps you can do today

Add “authoritative first” to your runbook: always verify direct NS answers before touching caches.
Document your real resolver path: which resolvers production uses, where caches exist, and who owns them.
Standardize your tools: prefer dig + getent + “check remote IP” over guesswork.
Plan TTL changes: lower TTLs ahead of planned migrations; don’t expect last-minute TTL edits to save you.
Adopt targeted flushing: if you operate resolvers, support per-name flush. Make global flush an explicitly approved action.

DNS changes “not visible” are rarely mysterious. They’re usually just caching doing exactly what it was designed to do,
at the worst possible moment for your timeline. Your job is to find the one cache that matters, and leave the rest of the planet alone.