DNS failures rarely announce themselves as “DNS failures”. They show up as a TLS error, a timeout in an API client, a Kubernetes node that “can’t pull images”, or a laptop that only breaks on the office Wi‑Fi. Then someone says the ancient incantation: “flush DNS”. Ten minutes later, it still fails. Now you’ve wasted time and built confidence in the wrong ritual.
Ubuntu 24.04 is a particularly good place to learn this lesson, because it can involve multiple caching layers and more than one resolver path. The result: your DNS caches can lie—not maliciously, just mechanically. Your job is to flush the cache that’s actually serving the answer, and prove it with tools that don’t lie back.
The mental model: DNS is a pipeline, not a lookup
When someone says “DNS cache”, they’re usually imagining a single bucket of answers. In reality, DNS on a modern Ubuntu machine is a pipeline:
- The application’s resolver behavior (glibc NSS, Go’s pure resolver, Java, curl with c-ares, Chromium’s own cache).
- Local stub resolver (often
systemd-resolvedlistening on127.0.0.53). - A local caching forwarder (sometimes
dnsmasq, sometimes none). - Upstream recursive resolvers (corporate DNS, ISP, cloud resolver, Unbound, BIND).
- Authoritative DNS (Route 53, Cloud DNS, on-prem BIND, etc.).
- Anycast and CDN edges (because “DNS answer” can vary by where you are).
Any one of those layers can cache. Any one of them can have different rules about TTL, negative caching (caching “NXDOMAIN”), DNSSEC behavior, or which DNS server is “preferred”. So yes, your machine can keep insisting that api.internal.example points to an old IP address even after you “flushed DNS”. You probably flushed a cache that wasn’t being used.
Practical rule: Always start by proving which component is answering. Flushing before you know what to flush is like rebooting before you’ve looked at logs. Sometimes it works. That doesn’t make it a strategy.
One paraphrased idea from Gene Kim (operations and reliability): Improve outcomes by shortening feedback loops; make problems visible quickly, and fix the system rather than rely on heroics.
Interesting facts and historical context (why this is so messy)
- Fact 1: DNS predates the web by years; it was designed for a smaller, slower-changing internet. Caching was a feature, not an accident.
- Fact 2: The “stub resolver” concept exists because most applications shouldn’t implement full recursive DNS. They ask a local agent, which asks the network.
- Fact 3: The classic file
/etc/resolv.confwas built for static servers. Modern systems often generate it dynamically, sometimes as a symlink. - Fact 4: Negative caching (caching “does not exist”) is standardized. That’s why a briefly missing record can haunt you longer than you expect.
- Fact 5: “Split-horizon DNS” (different answers depending on where you ask from) is common in corporate networks. It also makes debugging feel like gaslighting.
- Fact 6:
systemd-resolvedexists partly to unify a chaotic world of per-interface DNS, VPN DNS, and DNSSEC decisions. You may not like it, but it’s trying to stop the bleeding. - Fact 7: Some languages bypass glibc and do their own DNS. Go is the famous example, but not the only one. Your app may not use the same path as
dig. - Fact 8: Browsers cache DNS too. Even if the OS cache is perfect, a browser can keep a stale answer and make you blame the network.
DNS debugging is annoying because it’s distributed state. When state is distributed, confidence is a luxury. Evidence is cheaper.
Ubuntu 24.04 resolver stack: who answers your question?
On Ubuntu 24.04, the most common default is:
systemd-resolvedrunning as a service.- A stub resolver at
127.0.0.53, exposed via a generated/etc/resolv.conf. - Upstream DNS servers provided by NetworkManager, netplan/systemd-networkd, DHCP, or VPN clients.
But production machines are never “default” for long. You might have:
dnsmasqinstalled for local caching or split DNS for containers.nscd(Name Service Cache Daemon) still lurking from older tuning guides.- Kubernetes components, Docker, or systemd-networkd doing per-interface DNS manipulation.
- A VPN that pushes its own DNS servers and routing rules.
- Applications using DoH (DNS over HTTPS) or their own resolver libraries.
The point: “Flush DNS” is not one command. It’s a decision tree.
Joke #1: DNS is like office gossip: it spreads fast, it’s cached everywhere, and the retraction has a much lower TTL.
What “flush” even means
Flushing only helps if:
- The layer actually caches.
- The layer is actually in use for the failing query.
- The failure is due to stale cached state (not routing, firewall, TLS SNI mismatch, proxy config, or IPv6 selection issues).
If a record is wrong in authoritative DNS, flushing clients won’t fix it. If upstream recursive resolvers are serving stale data because of their own cache, flushing a laptop won’t fix it. And if the application never asks the OS for DNS, flushing systemd-resolved is performance art.
Fast diagnosis playbook: find the bottleneck in minutes
This is the order I use on-call. It’s optimized for “what changed, what’s broken, and what’s the fastest proof”.
1) Confirm the symptom is DNS, not “can’t reach the IP”
- If you have an IP address you expect, try connecting directly by IP (HTTP with Host header, TLS with SNI considerations).
- If direct IP works but hostname fails, you’re probably in DNS land.
2) Identify which resolver path the failing app uses
- Does it use glibc NSS? (Most Linux native tools do.)
- Is it Go with pure resolver? Is it Java? Is it a browser with its own cache/DoH?
- Does it run in a container with different
/etc/resolv.conf?
3) Ask the OS stub what it believes
- Use
resolvectl queryand check cache status, server selection, and records.
4) Ask upstream resolvers directly
- Use
dig @server nameagainst the corporate resolver, then against a known-good resolver in the same network. - Compare answers and TTLs.
5) Flush the specific layer that is wrong
- Flush
systemd-resolvedcache if it’s wrong. - Restart
dnsmasqif it’s in the path and caching. - Clear browser/app caches if the OS is right but the app is wrong.
6) If caches look consistent: suspect split DNS, VPN routing, or IPv6 selection
- Check per-interface DNS and routing rules.
- Check whether the app is preferring AAAA records that route to nowhere.
Practical tasks (with commands, outputs, and decisions)
These are real tasks you can run on Ubuntu 24.04. Each one includes: command, example output, what it means, and what decision to make.
Task 1: See what /etc/resolv.conf really is
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jun 21 08:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
Meaning: You’re using the systemd-resolved stub (likely 127.0.0.53). Many “flush DNS” guides for other distros won’t apply.
Decision: Use resolvectl and systemctl to inspect and flush. Don’t edit /etc/resolv.conf by hand unless you’re intentionally overriding the system.
Task 2: Confirm systemd-resolved is running
cr0x@server:~$ systemctl status systemd-resolved --no-pager
● systemd-resolved.service - Network Name Resolution
Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-30 09:10:14 UTC; 2h 11min ago
Docs: man:systemd-resolved.service(8)
man:resolvectl(1)
Main PID: 812 (systemd-resolve)
Status: "Processing requests..."
Meaning: The local cache and stub are active.
Decision: Flushing nscd or restarting random networking services is probably the wrong move. Start with resolvectl.
Task 3: Inspect resolver configuration (servers, interfaces, DNSSEC)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53 10.20.0.54
Link 2 (ens3)
Current Scopes: DNS
Protocols: +DefaultRoute
DNS Servers: 10.20.0.53 10.20.0.54
DNS Domain: corp.example
Meaning: Your system is using corporate resolvers, and the active server is 10.20.0.53. This tells you who to interrogate next.
Decision: If the upstream server is wrong or inconsistent, flushing local caches won’t fix it. You’ll need to query the upstream resolvers directly and possibly escalate to whoever owns them.
Task 4: Ask the OS resolver for a name and see what it returns
cr0x@server:~$ resolvectl query api.internal.example
api.internal.example: 10.70.8.19 -- link: ens3
-- Information acquired via protocol DNS in 11.2ms.
-- Data is authenticated: no
Meaning: This is what your OS resolver believes right now. It also tells you which link (interface) and thus which per-interface DNS settings were used.
Decision: If the OS answer is wrong, flush systemd-resolved cache. If it’s right but your app is wrong, suspect app-level caching or a different resolver path.
Task 5: Check if the result is coming from cache
cr0x@server:~$ resolvectl statistics
DNSSEC supported by current servers: no
Transactions
Current Transactions: 0
Total Requests: 14382
Cache Hits: 9112
Cache Misses: 5270
Cache
Current Cache Size: 216
Cache Hits: 9112
Cache Misses: 5270
Meaning: This confirms caching is active and being used. High cache hits during an incident can mean you’re repeatedly serving stale answers.
Decision: If you suspect stale cache, flush it and re-query; then verify whether the upstream answers changed.
Task 6: Flush systemd-resolved cache (the right “flush” most of the time)
cr0x@server:~$ sudo resolvectl flush-caches
Meaning: No output is normal. This clears systemd-resolved caches (positive and negative).
Decision: Immediately re-run resolvectl query. If the answer stays wrong, the upstream resolver is likely serving the same wrong data.
Task 7: Prove what the upstream resolver says (bypass local cache)
cr0x@server:~$ dig @10.20.0.53 api.internal.example +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60124
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.internal.example. 300 IN A 10.70.8.19
Meaning: Upstream resolver returns the same A record with TTL 300 seconds. If it’s wrong, upstream is wrong (or authoritative is wrong).
Decision: If you control upstream, flush there or fix authoritative. If you don’t, use an alternate resolver only if policy allows and split DNS won’t break you.
Task 8: Compare with another resolver to detect split DNS or resolver inconsistency
cr0x@server:~$ dig @10.20.0.54 api.internal.example +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28490
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.internal.example. 60 IN A 10.70.8.42
Meaning: Two corporate resolvers disagree. That’s not “DNS being DNS”; that’s a broken replication/forwarding/caching situation or split-horizon misconfiguration.
Decision: Stop flushing clients. Pick the resolver that’s correct (temporarily) or fail over via resolver ordering, and open an incident for the DNS team.
Task 9: See if you’re dealing with negative caching (NXDOMAIN)
cr0x@server:~$ resolvectl query new-service.internal.example
new-service.internal.example: resolve call failed: 'new-service.internal.example' not found
Meaning: NXDOMAIN (or equivalent) can be cached. If the record was just created, your resolver path may keep denying its existence for a while.
Decision: Flush caches at the layer that’s producing NXDOMAIN. Then query authoritative/upstream directly. If authoritative still says NXDOMAIN, stop blaming caches and fix DNS data.
Task 10: Inspect NSS order (is DNS even consulted?)
cr0x@server:~$ grep -E '^\s*hosts:' /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] dns
Meaning: This system consults /etc/hosts first, then mDNS, then DNS. A stale entry in /etc/hosts will override everything.
Decision: If name resolution is “wrong but consistent”, check /etc/hosts. Flushing caches won’t override a file.
Task 11: Check for an override in /etc/hosts
cr0x@server:~$ grep -n 'api.internal.example' /etc/hosts
12:10.70.8.19 api.internal.example
Meaning: That hostname is pinned locally. Your DNS changes won’t matter until you remove or update this line.
Decision: Fix the entry (or remove it). Then retest. This is the rare case where editing a file beats flushing a cache.
Task 12: Confirm what glibc-based tools see (getent is your friend)
cr0x@server:~$ getent ahosts api.internal.example
10.70.8.19 STREAM api.internal.example
10.70.8.19 DGRAM
10.70.8.19 RAW
Meaning: getent uses the system’s NSS configuration. If getent matches resolvectl, the OS path is consistent.
Decision: If the application still behaves differently, it may be using its own DNS resolver or caching. Move your debugging up the stack.
Task 13: Identify whether a local caching forwarder like dnsmasq is in play
cr0x@server:~$ systemctl is-active dnsmasq
inactive
Meaning: dnsmasq is not running as a system service. Good: one less cache layer.
Decision: Don’t restart dnsmasq “just in case”. If it’s inactive, it’s not your cache. Focus elsewhere.
Task 14: Check if nscd is caching host lookups
cr0x@server:~$ systemctl is-active nscd
inactive
Meaning: No NSCD. Many “flush DNS cache on Linux” posts are NSCD-era fossils.
Decision: If it’s inactive, stop trying to flush it. If it’s active in your environment, you need to manage it deliberately (and consider removing it if it fights your resolver stack).
Task 15: Confirm which nameserver your tools actually query (127.0.0.53 vs direct)
cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search corp.example
Meaning: Most libc-based resolution goes through the stub. But dig may bypass it unless you point it at 127.0.0.53 or at an upstream resolver explicitly.
Decision: When comparing tools, make sure they query the same thing. Otherwise you’re debugging two different systems and calling it “inconsistent”.
Task 16: Query the stub resolver directly with dig
cr0x@server:~$ dig @127.0.0.53 api.internal.example +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49821
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.internal.example. 287 IN A 10.70.8.19
Meaning: This is the stub’s current cached view. TTL decreasing suggests you’re looking at cached data (which is normal).
Decision: If you flush and TTL resets (or answer changes), you’ve proven the stub cache mattered. If nothing changes, the lie is upstream.
Task 17: Look for per-interface DNS that might be different under VPN
cr0x@server:~$ resolvectl dns
Global: 10.20.0.53 10.20.0.54
Link 2 (ens3): 10.20.0.53 10.20.0.54
Link 3 (tun0): 172.16.100.1
Meaning: VPN interface tun0 has its own DNS server. Depending on routing domains, queries can go there.
Decision: If internal names only resolve on VPN, that’s expected. If public names break on VPN, you may be leaking/overriding DNS in a way the app didn’t expect.
Task 18: Check if you have a routing/domain search rule that changes resolution
cr0x@server:~$ resolvectl domain
Global: corp.example
Link 2 (ens3): corp.example
Link 3 (tun0): ~internal.example
Meaning: The ~ prefix indicates a routing-only domain (split DNS). Queries for internal.example go to the VPN resolver, others do not.
Decision: If the wrong resolver answers, fix the split DNS configuration rather than flushing caches. Flushing doesn’t change routing rules.
Joke #2: Flushing the wrong cache is the ops equivalent of unplugging the monitor to fix a database deadlock.
Three corporate mini-stories from the DNS trenches
Mini-story 1: The incident caused by a wrong assumption (the “dig is truth” trap)
A mid-sized SaaS company had a staged migration of an internal API from one VPC to another. The plan was clean: lower TTLs ahead of time, update A records during a maintenance window, monitor error rates, then flip clients back if needed. The team tested resolution with dig from a bastion host, saw the new IP address, and declared DNS “done”.
Thirty minutes later, worker fleets started failing health checks. The error wasn’t “NXDOMAIN” or “could not resolve host”. It was connect timeouts. On-call chased security groups, NAT gateways, and load balancers. Everything looked fine. Meanwhile, dig still showed the correct new address. Confidence remained misplaced.
The catch: their workers were built in Go, using a resolver configuration that bypassed the local stub and held its own cache behavior. On the affected Ubuntu 24.04 nodes, systemd-resolved was correct, but the Go processes were pinned to old answers longer than the team expected, due to the application’s own lookup and connection reuse patterns. Some processes weren’t even doing fresh lookups because keepalive connections were sticking to dead endpoints.
Resolution came from changing the rollout procedure, not “flushing harder”. They added a controlled process restart after DNS flips for that particular service class, and they validated resolution through the same code path the workers used (a small diagnostic binary linked similarly). They also stopped using “dig on the bastion” as their definition of reality.
The lesson stuck: the resolver you test with must match the resolver your workload uses. Otherwise you are measuring something else and calling it validation.
Mini-story 2: The optimization that backfired (local caching forwarder edition)
An enterprise platform team decided to “improve DNS performance” on build agents. They installed dnsmasq as a local caching forwarder, reasoning that build scripts do lots of lookups against internal artifact repositories. They also enabled aggressive caching settings to cut latency and reduce load on corporate resolvers.
It worked—until it didn’t. A certificate rotation coincided with moving an internal service behind a new VIP. The service name stayed the same, IP changed. The authoritative DNS and corporate resolvers behaved correctly. But the build agents kept trying the old IP well past the TTL. Some would recover after a while, some wouldn’t, and the failures were oddly “sticky” per machine.
The root cause wasn’t a simple “dnsmasq bug”. It was configuration drift: some agents had different cache size and TTL enforcement settings, and a few had two local resolver layers (dnsmasq in front of systemd-resolved) because images evolved over time. The result was inconsistent caching semantics and un-debuggable variance. Engineers started restarting networking randomly, which sometimes helped and mostly wasted time.
The fix was boring and effective: remove dnsmasq from most agents, keep it only where split DNS was genuinely required, and rely on stable upstream resolvers designed to cache at scale. For the few cases that needed local forwarding, they standardized configs and added a simple “what resolver am I using” health check to their provisioning pipeline.
Optimization lesson: adding caching layers increases the number of ways you can be wrong. Performance wins are real, but you must price in operability.
Mini-story 3: The boring but correct practice that saved the day (TTL discipline and staged verification)
A financial services company ran an internal service discovery setup where DNS records were updated by automation during failovers. The team had been burned before by stale caches, so they did something uncool: they wrote down a runbook and followed it like adults.
Before any planned move, they reduced TTLs 24 hours in advance. Not to “5 seconds”—to a value that their resolvers and clients would actually respect without causing query storms. They also had a standard verification sequence: query authoritative, query each corporate recursive resolver, then query a representative client through the stub. Only after those matched did they flip traffic.
During one incident, a failover happened cleanly but a subset of clients still hit the old target. The runbook made the investigation quick: authoritative was correct, one corporate recursive resolver was serving stale data. Because they already had the list of recursors and a standard query set, they isolated the bad resolver, removed it from DHCP scope temporarily, and restored service while the DNS team fixed the resolver.
Nothing heroic happened. No one “found the magic flush command”. They just had evidence, in order, and a pre-agreed plan. That’s what saved them.
Common mistakes: symptom → root cause → fix
1) Symptom: “dig shows the new IP, but the app still hits the old one”
Root cause: Different resolver paths or app-level caching/connection reuse. dig is not your application.
Fix: Validate with getent ahosts (glibc path) and with the application runtime’s DNS behavior. Consider restarting the service after DNS flips if the runtime caches or keeps connections open.
2) Symptom: “Flushing local DNS does nothing”
Root cause: The wrong cache. The upstream recursive resolver is serving stale data, or the name is pinned in /etc/hosts.
Fix: Query upstream resolvers directly with dig @server. Check /etc/hosts. Escalate to DNS owners if upstream is wrong.
3) Symptom: “Works on Ethernet, fails on VPN (or the opposite)”
Root cause: Split DNS and per-interface resolver rules. Different domains are routed to different resolvers, sometimes unexpectedly.
Fix: Inspect resolvectl dns and resolvectl domain. Fix routing-only domains and VPN-pushed DNS settings rather than flushing caches.
4) Symptom: “New hostname returns NXDOMAIN for minutes after creation”
Root cause: Negative caching at some resolver layer.
Fix: Flush caches where NXDOMAIN is being produced (systemd-resolved, dnsmasq, upstream recursors). Confirm authoritative has the record.
5) Symptom: “Random clients fail, others fine”
Root cause: Resolver pool inconsistency (two recursors disagree), or mixed images/config drift causing different caching behavior.
Fix: Query each configured resolver directly and compare. Standardize resolver configuration and remove extra caching layers unless they solve a real problem.
6) Symptom: “Hostname resolves to IPv6 and connections fail; IPv4 works”
Root cause: AAAA records present but network path for IPv6 is broken or filtered; Happy Eyeballs behavior varies per app.
Fix: Query AAAA and A separately, validate IPv6 routing, or temporarily adjust records/policy. Don’t treat it as a cache problem until you’ve proven it is.
7) Symptom: “After changing DNS servers, machine still queries the old ones”
Root cause: Stale per-interface configuration, VPN pushing settings, or a static /etc/resolv.conf override.
Fix: Use resolvectl status to see current servers. Fix netplan/NetworkManager/VPN configuration; avoid hand-editing generated files.
Checklists / step-by-step plan (do this, not vibes)
Checklist A: “DNS change just happened, some clients are wrong”
- On an affected client: capture the OS view with
resolvectl query name. - Capture the upstream view:
dig @current_upstream name +noall +answer +comments. - Compare with a second upstream resolver if present.
- If upstream differs: treat as resolver inconsistency. Escalate and/or remove the bad resolver from rotation.
- If upstream matches but is wrong: query authoritative servers (from a place allowed to reach them) and fix DNS data.
- If OS is right but the app is wrong: check app runtime resolver and caches; restart/reload the service if appropriate.
Checklist B: “Stop flushing the wrong cache”
- Check
ls -l /etc/resolv.conf. If it points to systemd’s stub, you’re insystemd-resolvedworld. - Check
systemctl is-active systemd-resolved. If it’s inactive, flushing it is a placebo. - Check for extra layers:
systemctl is-active dnsmasq,systemctl is-active nscd. - Use
getent ahostsas your baseline for “what libc-based apps see”.
Checklist C: “Planned migration with DNS flips”
- Lower TTLs ahead of time (hours to a day, depending on your environment’s caching behavior).
- Document which resolvers are in use (DHCP scopes, VPN profiles, datacenter recursors).
- During the window, verify in this order:
- Authoritative answer is correct.
- Each recursive resolver answer is correct and TTL is sane.
- Representative clients via stub answer is correct.
- Applications resolve correctly through their real resolver path.
- Have a rollback plan that doesn’t rely on “flush everyone’s caches”. If your rollback plan is human cache invalidation, it’s not a plan.
FAQ
1) What is the “right” DNS cache to flush on Ubuntu 24.04?
Most of the time: systemd-resolved. Use sudo resolvectl flush-caches, then re-check with resolvectl query.
2) Why does “restart networking” sometimes fix DNS?
Because it may restart or reconfigure the resolver stack (interfaces, DNS servers, search domains). It’s also a blunt instrument that can cause new outages. Prefer targeted checks with resolvectl status.
3) Why does dig show one answer but getent shows another?
dig queries the DNS server you tell it to (or your default, depending on how you run it). getent follows NSS and often uses the local stub. They’re not equivalent unless you make them query the same resolver.
4) Does Ubuntu 24.04 cache DNS in glibc?
glibc itself doesn’t provide a general-purpose DNS cache like a daemon would. Caching typically happens in systemd-resolved, dnsmasq, nscd, or inside applications.
5) How do I know if /etc/hosts is overriding DNS?
Check /etc/nsswitch.conf for the hosts: order, then search /etc/hosts for the hostname. If it’s present, DNS is irrelevant until you fix the file.
6) What about browser DNS cache on Ubuntu?
Browsers often cache DNS and can use their own resolver or DoH. If OS tools show the correct answer but the browser doesn’t, clear the browser’s host cache or disable/adjust DoH in that environment.
7) What if I flushed systemd-resolved and the answer is still wrong?
Then the wrong answer is coming from upstream or authoritative DNS, or you’re not using systemd-resolved for that query path. Prove upstream behavior with dig @upstream and inspect per-interface DNS with resolvectl status.
8) Can VPN split DNS cause “DNS cache” symptoms?
Yes. It can look like a cache because answers vary depending on interface and routing domains. Use resolvectl domain and resolvectl dns to see which names go where.
9) Should I disable systemd-resolved to make DNS “simple”?
Usually no. Disabling it can make VPN and per-interface DNS worse, not better. If you need a different resolver architecture (e.g., Unbound locally), do it intentionally and document it. “Simple” isn’t simple if no one knows what’s running.
10) What’s the safest way to validate a DNS change?
Query authoritative, then each recursive resolver, then representative clients through their real resolver path. If any layer disagrees, treat that as the incident until proven otherwise.
Conclusion: next steps you can actually do today
Stop treating DNS cache flushes as spiritual cleansing. On Ubuntu 24.04, the most common cache that matters is systemd-resolved, and the most common failure mode is flushing it when the lie is upstream—or when the app never asked it in the first place.
Do these next:
- On your fleet images, standardize on one resolver stack (and remove extra caching layers unless you can justify them).
- Teach your team to use
resolvectl status,resolvectl query, andgetent ahostsas the baseline evidence. - For every DNS-related incident, record: the client’s stub answer, the upstream resolver answer, and whether the application uses the OS resolver path.
- For planned migrations, build a runbook that verifies authoritative → recursive → client, and includes a rollback that doesn’t rely on flushing human brains.