Docker DNS Caching Lies: Flush the Right Cache (Inside vs Outside Containers)

November 21, 2025 • February 3, 2026 • Read: 22 min • Views: 15

Was this helpful?

You change a DNS record. You wait. You even do the ritual: “it’s fine, TTL is low.”
And yet one container keeps calling the old IP like it’s in a long-distance relationship with your retired load balancer.

The trap is simple: you’re flushing a cache, not the cache. Docker adds at least one resolver layer, Linux adds more, and your application may be hoarding answers like it pays rent per lookup. If you don’t know where the lie lives, you’ll keep chasing ghosts.

A practical mental model: where DNS answers can stick

DNS troubleshooting in containers is rarely about “DNS is down.” It’s about time, context, and caching in places you forgot existed.
The fastest way to get unstuck is to stop asking “what is my DNS?” and start asking “who answered, who cached, and who is still reusing it?”

DNS has multiple “clients,” not one

A containerized application usually doesn’t talk directly to your corporate DNS servers. It talks to something closer:

Your application’s own resolver logic (JVM, Go net.Resolver behavior, Node’s caching, Envoy, nginx upstream resolution rules).
libc + NSS (glibc’s resolver, musl, /etc/nsswitch.conf, /etc/hosts, search domains, ndots, timeout/attempts).
A local caching daemon (nscd, dnsmasq, systemd-resolved) — sometimes inside the container, often on the host.
Docker’s embedded DNS (commonly at 127.0.0.11) which forwards and does name-to-container service discovery on user-defined networks.
The host’s upstream resolvers (corporate DNS, VPC resolver, CoreDNS, Unbound, bind).

If you flush the host cache but the application caches forever, nothing changes. If you restart the container but Docker’s embedded resolver is still returning something odd, also nothing changes.
You need to isolate which layer is serving stale answers.

“Flush DNS” is not a single command

On Linux alone, “flush DNS” can mean:

restart systemd-resolved or drop its caches,
restart nscd or dnsmasq,
restart Docker (or just the affected containers / networks),
restart the application process to clear its internal cache,
or flush an upstream caching resolver you don’t control.

The correct flush is the one that actually changes the next query path.

One operational principle that ages well: observe before you intervene. Flushing hides evidence. Gather a few “before” snapshots, then flush the minimum necessary.

One quote worth keeping near your on-call brain: “Hope is not a strategy.” — General Gordon R. Sullivan

Interesting facts and a little history (so you stop blaming “Docker magic”)

DNS caching wasn’t always expected at the OS layer. Early Unix resolvers tended to be simple stub resolvers; caching became common as networks got slower and name lookups got frequent.
TTL is advisory, not a universal law. A resolver may cap TTLs (min/max), and applications can cache longer than DNS says—especially when they’re trying to “help.”
Docker added embedded DNS to make service discovery on user-defined networks workable. Otherwise, container-to-container naming becomes painful fast.
Docker’s default bridge networking historically copied the host’s resolv.conf behavior. That made containers inherit host DNS quirks—good and bad—until embedded DNS became the norm on custom networks.
systemd-resolved changed expectations on many distros. It introduced a local stub resolver (often 127.0.0.53) with split DNS and caching, which interacts with Docker in surprising ways.
Alpine (musl) resolver behavior differs from Debian/Ubuntu (glibc). The failure modes around search domains, ndots, and timeouts can look like “random DNS.”
Go’s DNS resolver behavior changed across versions. Depending on build flags and environment, it may use the pure Go resolver or cgo (glibc), impacting caching and configuration parsing.
Corporate DNS often does its own caching and “helpful” rewriting. Split-horizon DNS, internal zones, and policy-based responses mean the same name can resolve differently inside/outside a VPN.

That last point matters because “it works on my laptop” might be literally true: your laptop is on a different DNS view than the container host.

The DNS caching stack: inside container vs host vs upstream

Layer 0: the application itself (the most common liar)

If you only remember one thing: many apps cache DNS longer than you think, and some effectively cache forever unless you re-resolve.
Common patterns:

Connection pools keep sockets to old IPs. DNS can change and the app won’t care until the pool churns.
Runtime DNS caches (Java’s InetAddress caching, some HTTP clients, service meshes).
Long-lived processes that resolve once at startup and never again.

Operational consequence: flushing DNS below the app does nothing if the app never asks again.
Your best “flush” might be a rolling restart or a signal that forces reload (if supported).

Layer 1: libc, NSS, and resolver config

This is where /etc/resolv.conf, /etc/nsswitch.conf, /etc/hosts, search domains, and options like ndots live.
This layer doesn’t usually “cache” aggressively by itself (glibc doesn’t keep a big shared cache), but it can create behavior that looks like caching:

search domains + ndots cause multiple queries per lookup. One of those queries may succeed and stick in upstream caches.
timeout/attempts cause long hangs that look like partial outages.
/etc/hosts overrides DNS entirely, resulting in “stale DNS” that is actually a static file.

Layer 2: caching daemons (nscd, dnsmasq, systemd-resolved)

If a node runs a local caching resolver, containers may query it directly (via copied resolv.conf) or indirectly (Docker forwards to it).
If that daemon is caching bad data, containers will keep seeing it until the daemon’s cache expires or you clear it.

On modern Ubuntu, systemd-resolved frequently sits on 127.0.0.53, acting as a local stub and cache.
Docker sometimes struggles if it copies 127.0.0.53 into containers: that address is inside the container namespace and doesn’t point to the host stub unless special routing exists.
That failure mode is not caching; it’s a namespace mismatch.

Layer 3: Docker’s embedded DNS (127.0.0.11)

On user-defined bridge networks, Docker typically injects nameserver 127.0.0.11 into the container’s /etc/resolv.conf.
That IP is not the host resolver; it’s Docker’s in-engine resolver bound inside the container’s network namespace.

What it does well:

resolves container names to container IPs within the Docker network,
handles aliases and service discovery in Compose,
forwards other queries to upstream resolvers configured for Docker/host.

What it does poorly (or at least opaquely):

it becomes an extra hop where timeouts and caching behavior can be misunderstood,
it hides upstream resolver changes unless containers are recreated or Docker reloads configuration,
it makes “flush DNS” ambiguous because you don’t directly manage that cache as a typical daemon.

Layer 4: upstream resolvers (CoreDNS, Unbound, bind, cloud resolvers)

Upstream resolvers cache based on TTL, but they also have:

negative caching (NXDOMAIN cached for some time),
prefetch and serve-stale features in some implementations,
policy / split DNS that changes answers depending on source network.

If you don’t control upstream caching, the only reliable “flush” is to query a different resolver (temporarily) or wait out the TTL/negative TTL.

Joke #1: DNS stands for “Definitely Not Synchronized.” It doesn’t, but it explains most on-call weekends.

Fast diagnosis playbook

The goal here is not to become a DNS scholar. It’s to find the bottleneck and the liar quickly, with minimal collateral damage.

First: verify the symptom and scope

Is it one container, one host, or all hosts? Single-container issues usually mean app cache, container config, or namespace-specific resolver path.
Is it one name or all names? One name points to DNS record or caching; all names points to resolver connectivity or config.
Is it stale (old IP) or failure (NXDOMAIN/timeouts)? Stale is cache; timeouts are network/MTU/firewall; NXDOMAIN can be negative caching or split DNS.

Second: identify what resolver the container is using

Check /etc/resolv.conf in the container: 127.0.0.11 means Docker embedded DNS; anything else is direct.
Check Docker network mode and whether it’s user-defined bridge, host network, or something unusual.

Third: do side-by-side queries from container and host

Query the name from container using dig against the configured resolver.
Query the same name from host against the upstream resolver(s).
If answers differ, the divergence is somewhere between container stub and upstream. If answers match but app still calls old IP, the liar is the app or connection pooling.

Fourth: decide the smallest reset that changes behavior

If app caches: restart the process or force re-resolution (if supported).
If host cache: flush systemd-resolved/dnsmasq/nscd.
If Docker embedded DNS path is suspect: recreate the container (or network), or adjust Docker daemon DNS settings and restart Docker in a controlled window.
If upstream cache: query a different resolver temporarily, or wait for TTL/negative TTL to expire.

Hands-on tasks: commands, output meaning, and decisions (12+)

Task 1: Find the container’s configured resolver

cr0x@server:~$ docker exec -it web-1 cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0

What it means: This container uses Docker’s embedded DNS (127.0.0.11). Flushing host caches might not directly change what it sees if Docker is caching/forwarding oddly.

Decision: Query via dig inside container to see what Docker DNS returns; also inspect Docker daemon DNS configuration.

Task 2: Confirm whether the container is on a user-defined network

cr0x@server:~$ docker inspect -f '{{json .NetworkSettings.Networks}}' web-1
{"app_net":{"IPAMConfig":null,"Links":null,"Aliases":["web-1","web"],"NetworkID":"8d0e...","EndpointID":"0c4c...","Gateway":"172.20.0.1","IPAddress":"172.20.0.10","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"MacAddress":"02:42:ac:14:00:0a","DriverOpts":null}}

What it means: User-defined bridge network (app_net) typically implies embedded DNS is active and service discovery is in play.

Decision: If the problem is “service name points to wrong container,” you’re debugging Docker network DNS, not corporate DNS.

Task 3: Compare DNS answers inside the container using dig

cr0x@server:~$ docker exec -it web-1 sh -lc 'apk add --no-cache bind-tools >/dev/null 2>&1 || true; dig +noall +answer api.internal A'
api.internal.         30      IN      A       10.40.12.34

What it means: Docker’s resolver (or what it forwards to) is currently returning 10.40.12.34 with TTL 30 seconds.

Decision: If you expected a new IP and TTL has passed, something is caching above/below. Now query from host and from a specific upstream resolver.

Task 4: Query the same name from the host against its default resolver

cr0x@server:~$ dig +noall +answer api.internal A
api.internal.         30      IN      A       10.40.12.99

What it means: Host sees 10.40.12.99, container sees 10.40.12.34. The divergence is real and local.

Decision: Docker embedded resolver might be forwarding to different upstream resolvers than the host, or caching stale data.

Task 5: Find what DNS servers Docker daemon is using

cr0x@server:~$ docker info | sed -n '/DNS:/,/Registry Mirrors:/p'
DNS: 10.0.0.2
  10.0.0.3

What it means: Docker is configured to use 10.0.0.2 and 10.0.0.3 as upstream resolvers for embedded DNS forwarding.

Decision: Query those resolvers directly from host to see what they return and whether they have stale cache.

Task 6: Query Docker’s upstream resolver directly

cr0x@server:~$ dig @10.0.0.2 +noall +answer api.internal A
api.internal.         30      IN      A       10.40.12.34

What it means: The upstream resolver Docker uses is returning the old IP. Docker isn’t necessarily lying; it’s faithfully repeating someone else’s lie.

Decision: Fix or flush the upstream resolver cache, or repoint Docker to the correct resolvers.

Task 7: Detect systemd-resolved involvement on the host

cr0x@server:~$ resolvectl status | sed -n '1,35p'
Global
         Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
  resolv.conf mode: stub
Current DNS Server: 10.0.0.2
       DNS Servers: 10.0.0.2 10.0.0.3
        DNS Domain: corp.internal

What it means: Host uses systemd-resolved stub mode; it may be caching, and it may also be the source of /etc/resolv.conf contents Docker reads.

Decision: If containers copy a stub address that’s unreachable in their namespace, fix Docker’s DNS config; if host caching is bad, flush resolved.

Task 8: Flush systemd-resolved cache on the host (when it is the cache)

cr0x@server:~$ sudo resolvectl flush-caches

cr0x@server:~$ resolvectl statistics | sed -n '1,25p'
DNSSEC supported by current servers: no
Transactions: 1482
Cache Hits: 312
Cache Misses: 1170

What it means: Cache was flushed; subsequent lookups should be cache misses until it warms.

Decision: Re-run host/container queries. If container still sees old answers, systemd-resolved was not the bottleneck.

Task 9: Check whether the container can even reach the host’s stub resolver

cr0x@server:~$ docker exec -it web-1 sh -lc 'grep -E "^nameserver" /etc/resolv.conf; nc -zu -w1 127.0.0.53 53; echo $?'
nameserver 127.0.0.11
1

What it means: The container isn’t configured to use 127.0.0.53 anyway, and even if it were, that address would be local to the container namespace.

Decision: Stop trying to “flush host stub” as a fix for a container using 127.0.0.11. Work on Docker DNS/upstream.

Task 10: Determine if the application is pinning connections to old IPs

cr0x@server:~$ docker exec -it web-1 sh -lc 'ss -tnp | head -n 10'
State  Recv-Q Send-Q Local Address:Port   Peer Address:Port  Process
ESTAB  0      0      172.20.0.10:49218   10.40.12.34:443    users:(("app",pid=1,fd=73))
ESTAB  0      0      172.20.0.10:49222   10.40.12.34:443    users:(("app",pid=1,fd=74))

What it means: The app is currently connected to the old IP. Even if DNS now resolves to the new IP, active sockets keep flowing.

Decision: Force connection churn (reload, restart, lower keepalive), or fix pool behavior. DNS flush won’t tear down established TCP sessions.

Task 11: Inspect /etc/hosts inside container for “DIY DNS” surprises

cr0x@server:~$ docker exec -it web-1 cat /etc/hosts
127.0.0.1	localhost
172.20.0.10	web-1
10.40.12.34	api.internal

What it means: Someone pinned api.internal in /etc/hosts. That’s not caching. That’s a permanent override.

Decision: Remove the entry (rebuild image, fix entrypoint, or stop injecting hosts records). Then redeploy. Flushing any DNS cache is irrelevant until this is gone.

Task 12: Check NSS order (hosts vs dns) inside container

cr0x@server:~$ docker exec -it web-1 sh -lc 'cat /etc/nsswitch.conf | sed -n "1,25p"'
passwd:         files
group:          files
hosts:          files dns
networks:       files

What it means: files is checked before DNS. If /etc/hosts has an entry, it wins.

Decision: If you must use /etc/hosts for a one-off migration, treat it as a change-managed configuration, not a hack you forget about.

Task 13: Observe negative caching (NXDOMAIN) behavior

cr0x@server:~$ docker exec -it web-1 sh -lc 'dig +noall +authority does-not-exist.corp.internal'
corp.internal.        300     IN      SOA     ns1.corp.internal. hostmaster.corp.internal. 2026010301 3600 600 604800 300

What it means: The SOA negative TTL is 300 seconds. NXDOMAIN can be cached for 5 minutes by resolvers.

Decision: If you just created the record, waiting can be the correct move. Or query authoritative servers directly if you can.

Task 14: Check Docker daemon config for fixed DNS settings

cr0x@server:~$ sudo cat /etc/docker/daemon.json
{
  "dns": ["10.0.0.2", "10.0.0.3"],
  "log-driver": "json-file"
}

What it means: Docker’s upstream resolvers are pinned. If corporate DNS changed, Docker won’t follow the host automatically.

Decision: Update this file (via config management), then restart Docker in a maintenance window, and recreate containers if necessary.

Task 15: Confirm what the host’s /etc/resolv.conf actually is

cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jan  3 09:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

What it means: The host points to systemd’s stub resolv.conf. Docker reading this and copying it into containers can be disastrous if it results in nameserver 127.0.0.53 inside containers.

Decision: Prefer explicit Docker DNS settings (daemon.json) or ensure Docker uses the “real” resolv.conf file if your distro provides it (/run/systemd/resolve/resolv.conf) via daemon config.

Task 16: Test the path from container to resolver and measure latency

cr0x@server:~$ docker exec -it web-1 sh -lc 'time dig +tries=1 +timeout=2 api.internal A +noall +answer'
api.internal.         30      IN      A       10.40.12.34

real	0m0.018s
user	0m0.009s
sys	0m0.004s

What it means: Lookup is fast. If your app reports “DNS timeout,” it may be doing multiple lookups (search domains) or using a different resolver path than your test tool.

Decision: Check search domains and ndots; inspect application resolver behavior; capture traffic if needed.

Three corporate-world mini-stories (anonymized)

Incident: the wrong assumption (“we flushed DNS, so it must be fine”)

A mid-sized SaaS company moved an internal API behind a new VIP during a datacenter tidy-up. They lowered TTL days in advance, watched dashboards, and did the cutover on a Tuesday morning because nobody enjoys a Friday.
Half the fleet moved cleanly. A few application pods didn’t. Errors climbed, but only for a subset of customers.

The incident channel filled with the usual DNS folklore. Someone flushed systemd-resolved on a couple of nodes. Someone else restarted Docker on one host. A third person ran dig on their laptop, got the new IP, and declared victory.
Meanwhile, a set of long-lived worker containers kept talking to the old IP. They weren’t even doing DNS anymore; they had connection pools holding TLS sessions open for hours.

The clue came from inside the container: ss -tnp showed established connections to the old address. DNS was “correct” and still irrelevant.
The fix was not a cache flush. It was a controlled restart of the workers (rolling, with queue draining) plus tuning the client’s connection lifetime.

Postmortem action items were boring: document that “DNS changed” is not the same as “traffic moved,” add a canary that checks actual peer IPs, and put connection lifetime under configuration rather than code defaults.

Optimization that backfired: “Let’s speed up DNS by adding a cache”

Another team had a legitimate problem: their hosts were doing a lot of DNS lookups, and latency spikes occasionally showed up in request traces.
The obvious fix was to put a local caching resolver on each node. They deployed dnsmasq and pointed the host to it. Lookups got faster. Everyone celebrated.

Then came an incident that didn’t look related: a blue/green deployment switched an internal service to a new IP, but a segment of containers kept hitting the old backend for far longer than TTL.
The caching resolver had been configured with aggressive minimum TTLs “to reduce lookup load.” It wasn’t malicious; it was “efficient.”
Unfortunately, efficiency is how you create stale answers at scale.

The team chased Docker first, because containers were the visible change. But the real issue was the caching policy on dnsmasq combined with negative caching for a record that briefly disappeared during the cutover.
They fixed it by aligning cache behavior with their actual change process: stop overriding TTLs unless you’re willing to own the consequences, and add a runbook step to validate upstream resolver answers during migrations.

Lesson: caching is a performance feature that becomes a reliability feature only when it’s configured like you’ll be paged for it. Otherwise it’s a time bomb with a configurable fuse.

Boring but correct: the practice that saved the day

A finance-adjacent company ran Docker hosts across two networks: corporate LAN and a restricted PCI segment.
DNS was split-horizon: the same name could resolve differently depending on where you stood. This was intentional, audited, and annoying.

Their SRE team had a habit that looked like bureaucracy: every container image included a tiny diagnostic toolbox and every on-call runbook started with “print resolv.conf and query the exact resolver shown.”
People rolled their eyes—until a Thursday evening outage.

A deployment in the PCI segment failed to reach an internal service name. From laptops it worked. From some hosts it worked. From others it didn’t.
The runbook quickly separated the problem: containers were using Docker embedded DNS which was forwarding to a corporate resolver that did not have the PCI view.
Someone had “standardized” daemon.json DNS servers across environments earlier in the week.

Because they had the resolver path documented and the diagnostic steps consistent, they reversed the daemon.json change in minutes, restarted Docker on a small subset, validated, then rolled the fix safely.
No heroics. No midnight packet captures. Just disciplined, repeatable observation.

Joke #2: Adding a DNS cache to fix reliability is like adding a second fridge to fix hunger—you’ll just store more bad decisions for later.

Common mistakes: symptom → root cause → fix

1) Symptom: “Container resolves old IP, host resolves new IP”

Root cause: Container uses Docker embedded DNS forwarding to different upstream resolvers (daemon.json pinned, or Docker picked different servers than the host).

Fix: Compare docker info DNS with host resolver; query Docker’s upstream resolvers directly; update Docker daemon DNS configuration and restart Docker in a controlled manner.

2) Symptom: “Flushing host DNS cache changed nothing”

Root cause: Application caches DNS or keeps persistent connections; it never re-queries.

Fix: Restart or reload the app; configure connection max lifetime; reduce keepalive where safe; verify with ss that peer IP changes.

3) Symptom: “DNS works on host, container gets timeouts”

Root cause: Container has nameserver 127.0.0.53 copied from host, but that’s not reachable inside container namespace.

Fix: Configure Docker to use real upstream resolvers (not host stub), or ensure Docker uses the non-stub resolv.conf; recreate containers.

4) Symptom: “One specific name always resolves ‘wrong’ inside container”

Root cause: /etc/hosts entry in image or injected at runtime; NSS checks files before DNS.

Fix: Remove the hosts entry; fix build/entrypoint; validate hosts: files dns order and contents.

5) Symptom: “Intermittent NXDOMAIN after creating a record”

Root cause: Negative caching (SOA minimum/negative TTL) in upstream resolvers or local caches.

Fix: Inspect SOA negative TTL; wait it out or query authoritative; avoid delete-and-recreate patterns during migrations.

6) Symptom: “DNS is slow only in containers”

Root cause: Search domains and ndots causing multiple queries; or an upstream resolver reachable from host but not from container network due to firewall rules.

Fix: Inspect /etc/resolv.conf options; trim search domains; ensure UDP/TCP 53 path from container namespace; measure with timed dig.

7) Symptom: “Docker Compose service names resolve inconsistently”

Root cause: Multiple networks, aliases, or stale containers; embedded DNS answers differ depending on network attachment.

Fix: Inspect network attachments; ensure services share the intended network; remove orphan containers and recreate the project network.

8) Symptom: “After changing corporate DNS, some hosts never pick it up”

Root cause: Docker daemon has pinned DNS servers; containers keep using embedded DNS which keeps forwarding to old servers.

Fix: Update daemon.json via config management; restart Docker in a maintenance window; redeploy containers that were created under the old config.

Checklists / step-by-step plan (boring on purpose)

Step-by-step: isolate the liar in 15 minutes

Pick one failing container and one healthy container (if you have a healthy one). Same image preferred.
Record container resolver config: cat /etc/resolv.conf, cat /etc/nsswitch.conf, and cat /etc/hosts.
Query the name with a tool that shows TTL (dig) from inside the container.
Query the same name from the host and against the specific upstream resolvers Docker uses.
If DNS answers differ: identify the first layer where they diverge (Docker upstream vs host upstream vs split DNS).
If DNS answers match but app still uses old IP: inspect existing TCP connections, pool settings, and application DNS caching.
Apply the smallest change that forces a new resolution path: flush the right cache or restart the right process.
Validate with evidence: show new DNS answer and show new peer IP in ss or request logs.

Deployment-time checklist: don’t create DNS surprises

Do not ship /etc/hosts hacks in images unless you also ship a removal plan.
Keep Docker daemon DNS settings explicit per environment; don’t assume host DNS equals container DNS.
For services behind DNS-based load balancing, tune client connection lifetimes so traffic can actually move.
During migrations, avoid transient NXDOMAIN events; negative caching will make them linger.
Have one canonical diagnostic container (or toolbox) available on every host/cluster.

Change plan: updating Docker DNS safely on a host fleet

Identify which hosts run Docker embedded DNS heavily (user-defined networks, Compose stacks).
Update /etc/docker/daemon.json with correct dns servers and (if needed) options.
Schedule a restart of Docker daemon; understand impact (containers may restart depending on policy).
Restart Docker on a canary host first; validate container /etc/resolv.conf and dig results.
Roll through the fleet; recreate containers that were created with old DNS config if behavior persists.

FAQ

1) Why does `docker exec` show `nameserver 127.0.0.11`?

That’s Docker’s embedded DNS for user-defined networks. It provides container/service name resolution and forwards other queries upstream.
It’s normal, and it’s also why “flush host DNS cache” often doesn’t do what you expect.

2) Can I flush Docker’s embedded DNS cache directly?

Not in a clean “one command” way like systemd-resolved. Operationally, you influence it by changing upstream resolvers, recreating containers/networks, or restarting Docker.
Before doing that, prove the embedded layer is where divergence starts by comparing answers across layers.

3) Why does restarting the container sometimes fix DNS?

It can change multiple things at once: the container gets a fresh /etc/resolv.conf, the application restarts and drops internal caches/pools, and Docker may rebuild parts of the networking state.
It’s a blunt instrument. Useful, but it hides root cause if you do it first.

4) My TTL is 30 seconds. Why did stale answers persist for minutes?

Because TTL applies to DNS caches, not to your application’s existing connections, and not necessarily to application-level caching.
Also, upstream resolvers can cap TTLs or enforce minimum TTLs, and negative caching has its own timers.

5) Why does the host resolve correctly but containers don’t after enabling systemd-resolved?

If containers end up with nameserver 127.0.0.53, that points to themselves, not the host stub.
Configure Docker to use real upstream resolvers or a reachable caching resolver IP, not the host loopback stub.

6) Should I run a DNS cache inside every container?

No, unless you enjoy debugging multiple layers of caching under incident pressure.
If you need caching, do it at the node level (owned and observable) or via a dedicated resolver tier, and keep container resolver paths simple.

7) What’s the fastest way to prove it’s not DNS at all?

Show that dig returns the new IP, but ss -tnp shows established connections to the old IP, or request logs show the old peer.
That’s connection pooling / keepalive / app cache territory.

8) Does Kubernetes change this story?

Yes, but the shape is similar: containers often query CoreDNS, which caches and forwards. You still have application caching, libc behavior, and upstream resolver behavior.
The main difference is that the “embedded DNS” is typically CoreDNS plus kube-dns conventions, not Docker’s 127.0.0.11.

9) Why does BusyBox `nslookup` disagree with `dig`?

Different tools have different resolver behavior, output, and sometimes different defaults (TCP vs UDP, search behavior, retries).
Prefer dig for clarity (TTL, authority/additional sections) and use timed runs to spot retries.

10) When is flushing caches the wrong move?

When you haven’t proven caching is the problem. Timeouts due to firewall, MTU, or unreachable resolver IPs won’t be fixed by flushing.
Also, flushing upstream caches can create thundering herds if lots of nodes re-query at once.

Conclusion: next steps you can actually take

Docker DNS problems are rarely mysterious. They’re layered. The lie usually lives in one of three places:
the application that never re-resolves, the resolver path that differs between host and container, or an upstream cache that’s doing exactly what it was configured to do.

Next time you see “stale DNS”:

Start by printing /etc/resolv.conf inside the container and stop guessing.
Do side-by-side dig from container and host, and then directly against Docker’s upstream resolvers.
If DNS is correct but traffic is wrong, inspect established connections and restart the right process, not the universe.
Standardize a small diagnostic toolkit and a runbook that forces evidence collection before flushing anything.

If you do those four things consistently, “DNS caching lies” becomes a manageable nuisance instead of a recurring incident theme.