Here’s a familiar punchline: dig returns the right IP, your DNS dashboards look green, and yet your application behaves like it’s been cut off from civilization. Requests time out. TLS handshakes fail. Retries pile up until your load balancer starts sweating. Everyone stares at DNS because it’s the easiest villain to name.
Sometimes DNS is guilty. More often, DNS is just the witness—and the actual crime scene is a caching layer you forgot existed, sitting between your app and the authoritative answer, confidently serving yesterday’s truth.
The mental model: “DNS works” is not a thing
When someone says “DNS works,” what they usually mean is: from my laptop, right now, using my preferred tool, I can resolve a name. That’s a nice moment. It is not an incident closure.
In production, DNS is a pipeline. Your application calls a resolver API (often through libc), which may consult local caches, a stub resolver, a node-local cache, a cluster DNS service, an upstream recursive resolver, and finally authoritative servers. At every hop, someone may decide to cache. At every hop, timeouts and retries can multiply. At every hop, configuration can diverge across environments.
The result is the classic split-screen failure:
- Humans: “Look,
digworks.” - Apps: “I can’t connect, and I will now hold a stale IP in memory until the heat death of the universe.”
That mismatch happens because DNS isn’t one cache. It’s several caches, each with its own TTL rules, eviction policy, and failure semantics. You don’t debug DNS by arguing about A records. You debug it by finding the caching layer that’s lying.
The caching layers you’re actually dealing with
1) Application-level caches (the “my code is innocent” cache)
Many runtimes and libraries cache DNS results, sometimes aggressively, sometimes forever, and sometimes in ways that don’t match the record TTL you set so carefully.
- JVM: DNS caching is famously “helpful.” Depending on security properties and the presence of a SecurityManager (less common now), it may cache positive answers for a long time, and negative answers too. If you run Java and you don’t know your effective DNS TTL settings, you are gambling.
- Go: The resolver path depends on build flags and environment. In many Linux setups, Go may use the system resolver via cgo (and inherit libc behavior) or use a pure Go resolver with its own approach to caching and retries. Containers add spice.
- Node.js / Python / Ruby: Behavior varies by library. The runtime might not cache much, but higher-level HTTP clients, connection pools, or service discovery SDKs often do.
- HTTP clients and pools: Even if DNS refreshes correctly, a pool may keep connecting to an old IP until the socket breaks. DNS “fixed” doesn’t mean traffic moved.
App-level caching problems are brutal because they are invisible to your normal DNS tools. A JVM holding an IP in memory will keep failing even while dig succeeds. That’s not DNS being flaky. That’s your process being stubborn.
2) libc and NSS (the “it depends on /etc/nsswitch.conf” cache)
On Linux, name resolution is routed through NSS (Name Service Switch). That means your app might consult files (hosts file), then dns, then mDNS, LDAP, or whatever someone configured three years ago during a “quick fix.” The order matters. The failure behavior matters. Timeouts matter.
glibc itself doesn’t maintain a big persistent DNS cache the way some people assume. But it can still bite you through:
/etc/hostsoverrides (stale, forgotten, or baked into images).- NSS modules that cache (e.g., SSSD, nscd) or introduce delays.
resolv.confoptions liketimeout,attempts, androtatethat can change failure modes dramatically.
3) Local caching daemons: nscd, dnsmasq, systemd-resolved
If you have systemd-resolved, it likely runs a stub resolver on 127.0.0.53 and caches answers. If you have dnsmasq on laptops or nodes, it caches. If you still have nscd (it’s not dead everywhere), it can cache host lookups.
These caches are usually well-intentioned: reduce latency, reduce upstream query rate, survive transient resolver outages. But they can become the place where wrong answers linger.
4) Node-local DNS caches (especially in Kubernetes)
Kubernetes clusters often run CoreDNS as a service. Many teams add NodeLocal DNSCache to reduce load and improve tail latency. That’s fine—until it isn’t. Now you have:
- Pods → node-local cache (iptables to a local IP)
- Node-local cache → CoreDNS service
- CoreDNS → upstream recursive resolvers
More layers, more opportunities for stale entries, inconsistent config, and confusing telemetry. If one node’s cache is wedged, one slice of pods will fail while others look normal. That’s the kind of incident that generates long Slack threads and short tempers.
5) Recursive resolvers (the “we pay money for this” cache)
Your upstream recursive resolvers (corporate DNS, cloud provider resolvers, or managed services) cache aggressively. They also enforce TTL caps in some environments: minimum TTLs, maximum TTLs, prefetching behaviors, and stale serving during upstream failures.
Some resolvers will return expired-but-cached answers (“serve stale”) to preserve availability. That’s great when auth servers are down. It’s terrible when you’re trying to evacuate traffic from a dead IP.
6) Authoritative DNS (the place you keep editing while nothing changes)
Authoritative DNS is where your TTL settings live. It’s also where you expect control. But your authority only matters if every caching layer respects it. Plenty won’t. Plenty will “respect it” in ways that are technically allowed but operationally inconvenient.
7) The non-DNS caches that look like DNS failures
Not every “DNS issue” is DNS. Some problems are caused by:
- Connection pooling: clients keep using existing sockets to old backends.
- Happy Eyeballs / IPv6 preference: your app tries AAAA first, fails slowly, then falls back.
- TLS session reuse and SNI mismatches: you hit the wrong IP and get a cert mismatch; it looks like “DNS wrong,” but it’s routing plus TLS.
- Load balancer health propagation delays: DNS points to an LB that’s still sending to broken targets.
Joke #1: DNS is like a rumor mill—by the time it reaches everyone, it’s technically accurate and operationally useless.
Facts and historical context that explain today’s mess
- DNS was designed in the early 1980s to replace HOSTS.TXT distribution; caching was a feature from the start, not an afterthought.
- The concept of TTL was meant to control cache lifetime, but many recursive resolvers apply policy: TTL floors/ceilings that override your intent.
- Negative caching became standardized later: NXDOMAIN and “no data” responses can be cached, meaning “it didn’t exist” can stick around after you create it.
- Early operational guidance assumed relatively stable records. Modern microservices plus autoscaling produce more DNS churn than the system was socially prepared for.
- Some resolvers implement “serve stale” behaviors to improve availability during upstream failures, at the cost of faster cutovers.
- systemd-resolved (mid-2010s) normalized stub resolvers on localhost in many distributions, changing debugging patterns and surprising people who expect
/etc/resolv.confto list real upstreams. - Kubernetes made DNS a critical control plane dependency for service discovery; CoreDNS became one of the most important pods in the cluster, whether you like it or not.
- CDNs and global traffic managers made DNS “part of routing”. That’s powerful, but it means DNS answers are sometimes location-dependent and policy-driven.
- Split-horizon DNS is common in enterprises (internal and external answers differ), which means “works on my laptop” may reflect a different view than your workload gets.
Three corporate mini-stories (anonymized, plausible, technically accurate)
Mini-story #1: The incident caused by a wrong assumption
The team ran a multi-region API behind a DNS name that returned different A records per region. During a migration, they reduced TTLs ahead of time and scheduled a cutover. The official plan was: update records, wait a minute, verify new traffic distribution.
Cutover day came. dig from a bastion host showed the new targets immediately. Monitoring, however, showed a stubborn stream of traffic still hitting the old region. Latency spiked. Some requests timed out. The on-call thought the change hadn’t propagated and started “fixing DNS” repeatedly, which accomplished nothing except more anxiety.
The real culprit wasn’t authoritative DNS or recursive resolvers. It was an application layer: a Java service that performed DNS lookup once at startup for an upstream dependency and cached the result in-process. It wasn’t malicious. It was “an optimization,” written years ago when upstreams never moved. The service had been stable for so long that nobody remembered it existed.
They restarted the deployment to force fresh resolution. Traffic moved instantly. The postmortem was awkward because the DNS team had done everything right, and the app team had done something understandable. The lasting fix was to remove the in-process cache, honor TTL, and add a deploy-time check to prove that the runtime refreshes DNS without restart.
Mini-story #2: The optimization that backfired
A platform group introduced NodeLocal DNSCache to reduce CoreDNS load and fix intermittent DNS latency under bursty traffic. The rollout looked good. CoreDNS CPU dropped. P99 resolution latency got better. The team declared victory.
Two months later, a subset of nodes started showing mysterious, localized failures: pods on certain nodes couldn’t resolve newly created service names for several minutes. Engineers saw NXDOMAIN in application logs. Meanwhile, pods scheduled on other nodes resolved fine. That asymmetry made people suspect Kubernetes itself, or “networking,” or the moon phase.
The node-local cache was negative-caching NXDOMAIN for a service name that appeared shortly after a deploy. A race in the deploy sequence caused early requests before the Service existed, producing NXDOMAIN. Once cached, those nodes continued to believe the service did not exist. The authoritative truth had changed, but the negative cache had not caught up yet.
They fixed it by tightening deploy ordering (Service before workloads), reducing negative caching TTL in the DNS cache configuration, and adding an alert on NXDOMAIN rates per node. NodeLocal DNSCache stayed—it was still a good idea—but the team learned that improving P99 latency can also improve the speed at which you cache the wrong thing.
Mini-story #3: The boring but correct practice that saved the day
A payments platform ran with a conservative DNS practice: every critical dependency had a runbook entry listing the entire resolution path. App → libc → node stub → cluster DNS → upstream resolver → authoritative. For each hop, they documented what logs existed, what metrics to watch, and how to bypass it.
It was boring work. Nobody got promoted for “documented the resolver chain.” But it meant that during an outage where an upstream recursive resolver started timing out intermittently, they didn’t waste hours arguing with each other’s laptops.
They immediately reproduced the failure from inside an affected pod, then from the node, then directly against the upstream resolver IP. That isolated the problem: the upstream resolver was the bottleneck, not CoreDNS, not the app. They temporarily reconfigured nodes to use a secondary resolver pool and restored service.
Later, they used their existing dashboards to show query latency and timeout rates per resolver. They escalated to the owner of the recursive resolver with proof rather than feelings. The incident was short. The postmortem was calm. The boring practice paid rent.
Fast diagnosis playbook: what to check first/second/third
This is the workflow that saves hours. The goal is to identify which caching layer is lying, and whether the failure is resolution, routing, or connection reuse.
First: reproduce from the same network namespace as the failing app
- If it’s Kubernetes: exec into the pod (or a debug pod on the same node).
- If it’s a VM: run from the host, not your laptop.
- If it’s a container: run from inside the container.
If you can’t reproduce from the same place, you’re debugging vibes.
Second: confirm what resolver the workload is actually using
- Check
/etc/resolv.confinside the workload. - Check whether it points to
127.0.0.53(systemd-resolved stub), a node-local IP, or the cluster DNS service. - Check search domains and
ndots. These can turn one lookup into five.
Third: isolate whether it’s “wrong answer” or “slow answer”
- Wrong answer: the IP returned is stale, points to a dead target, or differs by client.
- Slow answer: timeouts, long retries, occasional SERVFAIL. Apps often treat this as a dependency outage.
Fourth: bypass layers on purpose
- Query the configured resolver.
- Query the upstream recursive resolver directly.
- Query authoritative (or at least a known-good recursive) directly.
Every bypass is a test that eliminates a caching layer.
Fifth: check app/runtime DNS caching settings
- JVM TTL and negative TTL.
- Proxy sidecars (Envoy) DNS refresh rate and circuit behavior.
- Service discovery clients (Consul, etcd-based, custom SDKs) that cache endpoints.
Sixth: verify connections actually moved
- If DNS updated but traffic didn’t, it’s usually connection pooling or LB behavior.
- Check existing sockets, keep-alives, and retry logic.
Paraphrased idea (attributed): Werner Vogels has emphasized that everything fails, so systems should be designed to expect and handle failures.
Practical tasks: commands, what the output means, and what decision you make
These are the hands-on moves I want engineers to actually run. Each task includes a command, representative output, what it means, and the decision you make.
Task 1: See what resolv.conf looks like inside the failing environment
cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search svc.cluster.local cluster.local
Meaning: You’re using a local stub resolver (127.0.0.53), and search domains will be appended. Resolution behavior depends on systemd-resolved, not “the nameserver you thought.”
Decision: Don’t waste time querying random upstreams yet. First interrogate systemd-resolved and confirm where it forwards.
Task 2: Check NSS order and whether non-DNS sources can override
cr0x@server:~$ grep -E '^\s*hosts:' /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname
Meaning: /etc/hosts is consulted first. mDNS can short-circuit. DNS is not the first stop.
Decision: If the symptom is “works on one box but not another,” compare /etc/hosts and NSS configs. Your “DNS issue” might be a local override.
Task 3: Confirm systemd-resolved status, upstream servers, and cache stats
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.0.0.3
Meaning: The stub forwards to 10.0.0.2 and 10.0.0.3. If those are slow or poisoned, everything above them suffers.
Decision: Test those upstream resolver IPs directly next. If direct queries are slow, escalate to the resolver owners.
Task 4: Flush systemd-resolved cache (for a controlled test)
cr0x@server:~$ sudo resolvectl flush-caches
Meaning: You removed local cached entries. If behavior changes immediately, you’ve found a lying layer.
Decision: If flush fixes it, focus on why stale entries were served (TTL policy, serve-stale, negative caching, or upstream inconsistency).
Task 5: Query through the configured resolver and time it
cr0x@server:~$ dig +tries=1 +time=2 api.internal.example A
;; ANSWER SECTION:
api.internal.example. 30 IN A 10.40.12.19
;; Query time: 3 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
Meaning: Local stub answered fast with TTL 30. That’s good, but it’s still only one layer.
Decision: If the app still fails, suspect app-level caching or connection reuse. Also compare with querying upstream directly.
Task 6: Query the upstream recursive resolver directly (bypass stub)
cr0x@server:~$ dig @10.0.0.2 api.internal.example A +tries=1 +time=2
;; ANSWER SECTION:
api.internal.example. 30 IN A 10.40.12.19
;; Query time: 210 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
Meaning: The upstream is slower than the stub. That can be normal (cache hit locally) or a warning sign (upstream overloaded).
Decision: If upstream queries are consistently slow or timing out, fix upstream capacity, network path, or resolver health. Don’t tune app retries first; that just hides pain.
Task 7: Check for negative caching by querying a name you just created
cr0x@server:~$ dig newservice.svc.cluster.local A +tries=1 +time=2
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1122
;; Query time: 4 msec
Meaning: NXDOMAIN can be cached. If the service now exists but some clients keep seeing NXDOMAIN, you’re looking at negative caching behavior.
Decision: Validate creation order in deployments and tune negative cache TTL where you control it (CoreDNS, node-local caches). If you can’t tune, adjust rollouts to avoid querying before records exist.
Task 8: Observe search domain expansion and ndots behavior (common Kubernetes trap)
cr0x@server:~$ cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5 timeout:1 attempts:2
Meaning: With ndots:5, a name like api.internal.example (two dots) may be treated as “relative” and tried with search domains first, generating multiple queries and delays.
Decision: For external names, use FQDN with a trailing dot (api.internal.example.) in configs where supported, or adjust ndots carefully (understanding cluster defaults).
Task 9: Use getent to see what the application likely sees via libc/NSS
cr0x@server:~$ getent hosts api.internal.example
10.40.12.19 api.internal.example
Meaning: getent uses NSS, which matches many application resolution paths better than dig.
Decision: If dig works but getent fails or returns different results, focus on NSS config, hosts file, and local caching daemons—not authoritative DNS.
Task 10: Check whether /etc/hosts is overriding your name
cr0x@server:~$ grep -n "api.internal.example" /etc/hosts
12:10.20.1.99 api.internal.example
Meaning: Someone hardcoded it. This will override DNS if NSS checks files first (common).
Decision: Remove the override, rebuild images without it, and add CI checks to prevent hostfile pinning for production service names.
Task 11: In Kubernetes, query CoreDNS directly and compare answers
cr0x@server:~$ kubectl -n kube-system get svc kube-dns -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 2y k8s-app=kube-dns
Meaning: CoreDNS service IP is 10.96.0.10.
Decision: Query it directly from a pod to bypass node-local caches or stubs when isolating blame.
Task 12: Inspect CoreDNS config for caching and stub domains
cr0x@server:~$ kubectl -n kube-system get configmap coredns -o yaml | sed -n '1,120p'
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . 10.0.0.2 10.0.0.3
cache 30
loop
reload
loadbalance
}
Meaning: CoreDNS forwards to upstream resolvers and caches for 30 seconds. That’s not “bad,” but it explains propagation timing and stale behavior during cutovers.
Decision: If you need faster cutover for certain records, consider lower TTLs end-to-end and verify resolver policies. Don’t blindly set cache to 0; you’ll DOS yourself with queries.
Task 13: Look for CoreDNS errors and timeouts
cr0x@server:~$ kubectl -n kube-system logs -l k8s-app=kube-dns --tail=20
[ERROR] plugin/errors: 2 api.internal.example. A: read udp 10.244.1.10:45712->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 api.internal.example. AAAA: read udp 10.244.1.10:46891->10.0.0.2:53: i/o timeout
Meaning: CoreDNS is timing out to upstream resolver 10.0.0.2. Apps will see intermittent resolution failures or delays.
Decision: Fix upstream resolver health/capacity or routing. As a mitigation, add/remove upstreams and reduce timeout/attempts to fail fast, but don’t pretend it’s “just Kubernetes.”
Task 14: Confirm what IP your app actually connects to (DNS vs connection pooling)
cr0x@server:~$ sudo ss -tnp | grep -E ':(443|8443)\s' | head
ESTAB 0 0 10.244.3.21:48712 10.40.12.10:443 users:(("java",pid=2314,fd=214))
Meaning: The process is connected to 10.40.12.10, which may be an old IP even if DNS now returns 10.40.12.19.
Decision: If DNS is correct but sockets point to old targets, your issue is connection reuse. Reduce keep-alive lifetime, implement periodic re-resolution, or rotate pods/clients during cutovers.
Task 15: Measure resolution latency and failure rate quickly
cr0x@server:~$ for i in {1..10}; do dig +tries=1 +time=1 api.internal.example A | awk '/Query time/ {print $4}'; done
3
4
1002
2
1001
3
4
1001
2
3
Meaning: Some queries take ~1s (timeout) then succeed on retry paths, which is poison for tail latency.
Decision: Treat this like a dependency outage: fix the slow resolver hop, reduce attempts, and avoid multiplying retries at multiple layers.
Joke #2: DNS caching is the only place where being “eventually consistent” means “eventually you get paged.”
Common mistakes: symptoms → root cause → fix
1) Symptom: dig works, app still hits old IP
Root cause: Connection pooling or in-process DNS caching (JVM, SDK, custom cache). DNS changed, but the app didn’t re-resolve or didn’t reconnect.
Fix: Ensure the runtime honors TTL and periodically re-resolves; cap keep-alive lifetimes; during cutovers, recycle clients safely. Validate with ss and runtime settings.
2) Symptom: Some pods fail, others fine (same deployment)
Root cause: Node-local DNS cache stuck, node-specific network path to resolver, or per-node stub resolver issues.
Fix: Compare /etc/resolv.conf and cache layer per node; flush caches; restart node-local DNS pods; fix upstream reachability from the node.
3) Symptom: Newly created service name returns NXDOMAIN for minutes
Root cause: Negative caching in node-local caches or recursive resolvers; deploy ordering queries before record exists.
Fix: Create DNS objects before clients query them; reduce negative caching TTL where possible; add rollout readiness gates that verify resolution.
4) Symptom: Slow first request, then fast thereafter
Root cause: Resolver timeout/attempts too high, search domain expansion, or AAAA attempts timing out before A succeeds.
Fix: Tune resolver timeouts and attempts; fix upstream IPv6 reachability or adjust address family strategy; reduce unnecessary search expansions by using absolute FQDNs.
5) Symptom: Works on laptop, fails in production
Root cause: Split-horizon DNS, different resolver chain, VPN/corporate resolvers, or internal-only zones not visible to production networks.
Fix: Test from the same network namespace and resolvers as the workload; document which view (internal/external) each environment uses; avoid relying on laptop DNS as proof.
6) Symptom: Random SERVFAIL spikes
Root cause: Upstream recursive resolver issues, DNSSEC validation failures in some paths, packet loss (UDP fragmentation), or overloaded CoreDNS.
Fix: Check CoreDNS logs and upstream query latency; test TCP fallback; reduce EDNS payload size if needed; scale resolver capacity and add diversity.
7) Symptom: DNS query rate explodes after a “small” config change
Root cause: Disabling caching, setting TTL too low globally, or creating many unique names (e.g., per-request hostnames).
Fix: Re-enable caching with sane TTLs; avoid per-request hostnames; use service discovery patterns that don’t turn DNS into a hot loop.
8) Symptom: Canary succeeds; bulk traffic fails
Root cause: Different runtime stack: canary uses a different client, sidecar config, or JVM flags; bulk uses connection pooling or stale resolution.
Fix: Ensure canary and bulk match resolver chain and runtime settings; compare process flags and sidecar configs; run getent and socket inspection in both.
Checklists / step-by-step plan
Checklist A: During an incident (15–30 minutes)
- Reproduce from the workload (pod/container/host). Capture timestamp and node name.
- Record resolver chain evidence:
/etc/resolv.conf,resolvectl status(if relevant), and/etc/nsswitch.conf. - Run both
digandgetentfor the failing name. If they differ, prioritize NSS/local cache debugging. - Measure query time distribution (loop dig 10–20 times). Look for intermittent 1s/2s spikes.
- Bypass layers: query the configured resolver, then the upstream resolver IP directly.
- Check for negative caching: confirm whether NXDOMAIN is involved; identify if record/service was created recently.
- Confirm actual connections: use
ssto see where processes are connected. DNS might be correct while sockets are stuck. - Mitigate safely: restart the lying cache layer (stub/node-local/CoreDNS) only if you understand blast radius; otherwise reroute to healthy resolvers.
Checklist B: After the incident (make it not happen again)
- Document the resolver chain for each environment. Include who owns each hop.
- Set explicit runtime DNS caching policy for JVM and sidecars. Don’t rely on defaults.
- Standardize tooling:
getentfor “what the app sees,”digfor “what DNS says.” Teach the difference. - Align TTLs with operational reality: if you need 30-second cutovers, ensure caches don’t enforce 5-minute minimum TTLs.
- Add observability: query latency, SERVFAIL/NXDOMAIN rates, per-node cache health in Kubernetes.
- Practice cutovers: do a planned record change in a non-prod environment and verify clients actually move without restarts.
Checklist C: Before you add a new caching layer (be honest)
- Define why you need it: tail latency, resolver load, resilience. Pick one primary objective.
- Decide who owns it and how it’s monitored. “It’s just DNS” is not an ownership model.
- Decide negative caching policy explicitly.
- Plan bypass and emergency controls: how to point clients at a different resolver quickly.
- Test failure modes: upstream timeout, SERVFAIL, NXDOMAIN, packet loss, and auth changes.
FAQ
1) Why does dig work but the app fails to resolve?
dig queries DNS directly and bypasses parts of the system resolver path. Your app likely uses NSS/libc (or runtime resolver behavior) plus caches. Use getent hosts name to match app behavior more closely.
2) If I set TTL to 30 seconds, why do clients still use old answers for minutes?
Because some caching layers apply TTL floors/ceilings, serve-stale policies, or the app caches independently. Also, even with correct DNS refresh, existing TCP connections won’t magically migrate to new IPs.
3) Is flushing DNS cache a real fix?
It’s a diagnostic and sometimes an emergency mitigation. If flushing fixes it, you’ve found a cache layer serving stale/incorrect data. The real fix is aligning TTL policy, resolver health, and runtime caching behavior.
4) Why do only some Kubernetes pods fail DNS?
Often because resolution is node-dependent (node-local DNS cache, node networking, iptables redirection, or node-specific resolver reachability). Confirm the node and test from another pod scheduled on the same node.
5) What’s negative caching and why should I care?
Negative caching means caching “does not exist” responses (NXDOMAIN/no data). If your deploy queries a name before it exists, caches can remember the NXDOMAIN and keep failing after the name is created.
6) Does glibc cache DNS results?
glibc primarily follows NSS and resolver configuration; persistent caching is usually done by external daemons (systemd-resolved, nscd, dnsmasq) or higher layers (app/runtime). Don’t assume “glibc cache” without evidence.
7) How do I tell whether this is DNS or connection pooling?
If DNS answers are correct but the process still connects to an old IP, it’s pooling/reuse. Check active connections with ss -tnp and compare to current DNS answers.
8) Should we disable DNS caching to avoid staleness?
No, not as a general policy. Disabling caching shifts load upstream, increases latency, and can create cascading failures during resolver hiccups. Instead, tune TTLs, negative caching, and ensure clients behave correctly.
9) Why do we see delays resolving short names inside clusters?
Search domains and ndots can cause multiple queries per lookup. A name may be tried as name.namespace.svc.cluster.local variants before the intended FQDN, and timeouts compound.
10) What’s the single most useful metric for DNS reliability?
Resolver query latency and timeout/SERVFAIL rates at each hop (node-local, CoreDNS, upstream recursive). “Query rate” alone is misleading; you want to know when resolution becomes slow or flaky.
Conclusion: practical next steps
If your DNS “works” but apps still fail, stop treating DNS as a single component. Treat it as a chain of caches and policies. Your job is to find the hop that’s lying or slow, then make it boring again.
- Standardize your debugging baseline: always collect
/etc/resolv.conf,/etc/nsswitch.conf, and agetentresult from the workload environment. - Instrument the resolver chain: CoreDNS/node-local cache/upstream resolver latency and failure rates. If you can’t see it, you can’t operate it.
- Make runtime DNS caching explicit: JVM, proxies, and SDKs. Defaults are not a strategy.
- Plan cutovers that account for connections: DNS changes don’t drain existing sockets. Budget time or force reconnection safely.
- Write the runbook you wish you had: resolver path, bypass commands, and ownership. It’s dull. It’s also how you avoid the 3 a.m. improv show.