You’re on-call. A deploy is half done. Suddenly every box starts complaining: Temporary failure in name resolution.
Not “NXDOMAIN”, not “connection refused”. Just that vague, maddening shrug.
This error is DNS’s way of saying: “I tried, I waited, and I gave up.” The fix isn’t “restart networking” (stop doing that).
The fix is understanding which layer is failing—and checking them in the right order so you don’t waste an hour arguing with /etc/resolv.conf like it owes you money.
What the error actually means (and what it does not)
“Temporary failure in name resolution” is usually the userland resolver saying: “I could not get an answer right now.”
On Linux with glibc, this often corresponds to EAI_AGAIN from getaddrinfo()—a transient lookup failure.
That’s different from “the name does not exist.”
If you want to reason about it like an SRE, translate the message into one of these buckets:
- I can’t reach a resolver (network path, firewall, wrong IP, resolver down).
- I reached a resolver but it’s not answering (timeouts, UDP blocked, EDNS issues, rate-limits).
- I got an answer but the client rejected it (DNSSEC validation, truncation, TCP fallback blocked, broken cache).
- I asked the wrong question (search domains, ndots, split-horizon surprises).
The key: the error is about the process (resolution attempt), not the truth of the name.
That’s why random restarts “help” sometimes—because you reset a cache, a socket, or a race. It’s also why those same restarts come back to haunt you.
One quote worth keeping on the wall:
Hope is not a strategy.
— General Gordon R. Sullivan
The 5 root causes (in the order you should fix them)
Root cause #1: The host is not using the resolver you think it is
Modern Linux has multiple layers that can “own” DNS: NetworkManager, systemd-resolved, DHCP clients, VPN agents,
containers, and good old handwritten files. “Temporary failure” is common when those layers disagree, and your app is faithfully querying a dead end.
Typical failure modes:
- /etc/resolv.conf points to 127.0.0.53 but systemd-resolved is stopped or misconfigured.
- /etc/resolv.conf was overwritten by DHCP or a VPN, leaving a nameserver that is unreachable from this network.
- Inside containers, resolv.conf is inherited from the host but doesn’t make sense in that namespace.
Fix order: confirm what resolver is configured, then confirm that resolver is reachable, then confirm it answers for the right zones.
Don’t touch upstream DNS until you can prove the client is asking the right place.
Root cause #2: The network path to the resolver is broken (routing, VPN, firewall)
DNS is a network service. Sometimes the network is the problem. Shocking, I know.
A resolver can be “up” and “healthy” and still unreachable from a specific subnet because someone changed routing,
tightened egress, or turned on a VPN that hijacked your default route.
DNS is also special because it uses UDP 53 by default and falls back to TCP 53. Plenty of security devices treat UDP and TCP differently.
If UDP is blocked, you’ll see timeouts. If TCP fallback is blocked, large responses (or DNSSEC) fail in a way that looks “temporary.”
Fix order: verify L3 reachability to the resolver IP(s), verify UDP 53, then verify TCP 53.
The fastest mistake is to debug “DNS records” while packets are dying in a firewall rule written by someone who hates joy.
Root cause #3: Your resolver is sick (overloaded, misconfigured, or selectively failing)
Recursive resolvers are infrastructure. Infrastructure gets tired.
Under load, resolvers drop packets, queue too much, or rate-limit. Some failures are “partial”: only certain domains,
only AAAA records, only DNSSEC-signed zones, only during cache misses.
Common triggers:
- Cache miss storms after a restart, or after TTLs expire on popular names.
- Broken forwarding to upstream resolvers (ISP/enterprise forwarders down or blocked).
- Mis-sized EDNS buffers causing truncation and TCP fallback issues.
- Rate limiting against NATed clients (a whole cluster looks like one source IP).
Fix order: test the resolver directly with dig, watch latency and SERVFAIL/timeout rates, then validate recursion and forwarding.
If the resolver is yours, look at CPU, memory, socket buffers, and query logs. If it’s not yours, use another resolver as a control.
Root cause #4: Negative caching, search domains, and resolver behavior surprises
The resolver does more than “ask DNS.” It tries suffixes, it retries, it picks A vs AAAA in a particular order,
and it caches negative results in ways that feel personal.
A classic: search domains plus “ndots” behavior. A short hostname like api might be tried as:
api.corp.example, then api.dev.example, then finally api..
If those intermediate domains are slow or broken, your “simple” lookup now has multi-second latency and occasional “temporary failure.”
Fix order: reproduce with dig and fully-qualified names, inspect search domains, and test A/AAAA separately.
Avoid hacking TTLs or flushing caches until you prove it’s not the client’s algorithm biting you.
Root cause #5: Authoritative DNS problems (broken zone, lame delegation, DNSSEC, or upstream meltdown)
Sometimes it really is “DNS DNS.” The authoritative side is wrong: delegation points at dead servers, glue records are missing,
DNSSEC signatures expired, or a zone transfer didn’t happen and the secondary is serving stale nonsense.
These issues often show up as SERVFAIL or timeouts from the recursive resolver. The client sees “temporary failure,” because recursion failed.
Fix order: run dig +trace, validate delegation and authoritative reachability, then address DNSSEC or zone content issues.
This is last in the order for a reason: it’s frequently blamed and less frequently guilty.
Fast diagnosis playbook (first/second/third)
Your goal is not to become a philosopher of DNS. Your goal is to find the bottleneck in minutes.
Here’s the playbook I use when prod is on fire and people are typing “is DNS down?” into every chat.
First: confirm whether it’s local config, local daemon, or the network
- Check which resolver you’re using (resolv.conf, systemd-resolved status).
- Try a lookup against the configured resolver IP directly with
dig. - If that fails, test basic reachability to the resolver (ping isn’t enough, but it’s a start).
Second: establish a control resolver and compare
- Query the same name via a known-good resolver (a public resolver or a different internal resolver).
- If control works, the problem is likely your resolver or your path to it.
- If control also fails, suspect authoritative DNS or wider network restrictions.
Third: decide whether you’re dealing with timeouts, SERVFAIL, or client behavior
- Timeouts suggest network path/firewall or overloaded resolver dropping packets.
- SERVFAIL suggests recursion failure, DNSSEC issues, or broken delegation upstream.
- Slow then fail suggests search domain expansion, ndots, or TCP fallback blockage.
Joke #1: DNS is like a phone book that occasionally decides it’s a poetry anthology. Technically still “text,” emotionally unhelpful.
Practical tasks: commands, outputs, decisions (12+)
Below are real tasks you can run on a Linux host. Each includes what the output means and the decision you make.
Run them in order until you hit a contradiction. Contradictions are where the truth lives.
Task 1: See what /etc/resolv.conf really is
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jan 2 10:11 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
Meaning: This host is using systemd-resolved’s stub listener (127.0.0.53). If systemd-resolved is down, DNS is down.
Decision: Check systemd-resolved health next. If resolv.conf is a regular file, inspect its nameserver lines directly.
Task 2: Inspect nameservers and search domains
cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search corp.example dev.example
Meaning: Lookups may expand via search. Short names can cause multiple queries and delays.
Decision: When testing, always try the FQDN and avoid ambiguity. Keep search domains in mind for “slow then fail.”
Task 3: Check systemd-resolved status and which upstream DNS it uses
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.30.53
DNS Servers: 10.20.30.53 10.20.30.54
DNS Domain: corp.example
Meaning: systemd-resolved is active and has upstream resolvers configured (10.20.30.53/54).
Decision: Query those upstream resolvers directly with dig @10.20.30.53. If status is missing or empty, fix local config first.
Task 4: Verify the local stub listener is actually listening
cr0x@server:~$ ss -ulpn | grep ':53 '
UNCONN 0 0 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=812,fd=14))
Meaning: The stub resolver is listening on UDP 53 on loopback.
Decision: If you don’t see it, restart/repair systemd-resolved (or stop using the stub and point resolv.conf at real resolvers).
Task 5: Query a name through the system resolver (baseline)
cr0x@server:~$ getent ahosts example.com
93.184.216.34 STREAM example.com
93.184.216.34 DGRAM example.com
93.184.216.34 RAW example.com
Meaning: The libc resolver path works (NSS + DNS). This is closer to what applications use than dig.
Decision: If getent fails but dig works, suspect NSS configuration, nscd, or application-specific resolver behavior.
Task 6: Query the configured upstream resolver directly (bypass local stub)
cr0x@server:~$ dig @10.20.30.53 example.com +time=2 +tries=1
; <<>> DiG 9.18.24-1ubuntu1.3-Ubuntu <<>> @10.20.30.53 example.com +time=2 +tries=1
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50612
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
example.com. 300 IN A 93.184.216.34
;; Query time: 14 msec
;; SERVER: 10.20.30.53#53(10.20.30.53) (UDP)
;; WHEN: Tue Jan 02 10:22:48 UTC 2026
;; MSG SIZE rcvd: 56
Meaning: The upstream resolver is reachable and responding quickly.
Decision: If this works but the system resolver fails, focus on local resolver daemon, NSS, or search/ndots behavior. If it times out, go to network path checks.
Task 7: Distinguish timeout vs SERVFAIL vs NXDOMAIN
cr0x@server:~$ dig @10.20.30.53 does-not-exist.example.com +time=2 +tries=1
; <<>> DiG 9.18.24-1ubuntu1.3-Ubuntu <<>> @10.20.30.53 does-not-exist.example.com +time=2 +tries=1
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1243
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
Meaning: NXDOMAIN is a definitive “name doesn’t exist.” That’s not “temporary failure.”
Decision: If your app shows “temporary failure” while dig shows NXDOMAIN, suspect search domains and the app querying a different name than you think.
Task 8: Check raw reachability to the resolver IP and route selection
cr0x@server:~$ ip route get 10.20.30.53
10.20.30.53 via 10.20.10.1 dev eth0 src 10.20.10.44 uid 1000
cache
Meaning: The kernel knows how it would reach the resolver and which interface it uses.
Decision: If this route is wrong (e.g., goes through a VPN interface), you’ve found your “temporary failure.” Fix routing/VPN split tunneling.
Task 9: Prove UDP 53 works (not just ICMP)
cr0x@server:~$ nc -u -vz 10.20.30.53 53
Connection to 10.20.30.53 53 port [udp/domain] succeeded!
Meaning: You can send UDP to port 53. It doesn’t prove responses return, but it’s a quick signal.
Decision: If this fails, stop debating DNS records. You have a firewall, security group, or routing problem.
Task 10: Prove TCP 53 works (needed for truncation/DNSSEC/large answers)
cr0x@server:~$ nc -vz 10.20.30.53 53
Connection to 10.20.30.53 53 port [tcp/domain] succeeded!
Meaning: TCP fallback is possible. If UDP works but TCP doesn’t, some queries will “randomly” fail—especially with DNSSEC or lots of records.
Decision: If TCP is blocked, fix it. Workarounds like “disable DNSSEC” are a last resort and often the wrong one.
Task 11: Detect truncation and TCP fallback behavior
cr0x@server:~$ dig @10.20.30.53 dnssec-failed.org +dnssec +bufsize=4096
; <<>> DiG 9.18.24-1ubuntu1.3-Ubuntu <<>> @10.20.30.53 dnssec-failed.org +dnssec +bufsize=4096
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 32801
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
Meaning: SERVFAIL on a known DNSSEC-broken domain indicates DNSSEC validation is being enforced by the resolver.
Decision: If your internal domains are SERVFAIL due to DNSSEC, fix signing/DS records. Don’t “just turn off DNSSEC” unless you understand the blast radius.
Task 12: Trace delegation to see if authoritative servers are broken
cr0x@server:~$ dig +trace app.corp.example
; <<>> DiG 9.18.24-1ubuntu1.3-Ubuntu <<>> +trace app.corp.example
. 518400 IN NS a.root-servers.net.
...
example. 172800 IN NS ns1.example.
example. 172800 IN NS ns2.example.
corp.example. 86400 IN NS ns1.corp.example.
corp.example. 86400 IN NS ns2.corp.example.
app.corp.example. 60 IN A 10.50.0.25
Meaning: Trace shows the chain of delegation. If it stalls at some step, you’ve found where resolution breaks.
Decision: If trace fails at your authoritative tier, focus there: firewall to auth servers, broken NS records, stale glue, or downed masters.
Task 13: Confirm the resolver’s view of a specific interface (systemd-resolved)
cr0x@server:~$ resolvectl dns eth0
Link 2 (eth0): 10.20.30.53 10.20.30.54
Meaning: DNS servers are assigned per-link; VPN interfaces can override this.
Decision: If the wrong link has the active DNS servers, fix NetworkManager/VPN settings or set DNS routing domains properly.
Task 14: Watch real DNS traffic (prove packets leave and replies return)
cr0x@server:~$ sudo tcpdump -ni eth0 '(udp port 53 or tcp port 53)' -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:25:11.122334 IP 10.20.10.44.41321 > 10.20.30.53.53: 47924+ A? example.com. (29)
10:25:11.136702 IP 10.20.30.53.53 > 10.20.10.44.41321: 47924 1/0/1 A 93.184.216.34 (45)
Meaning: You have bidirectional traffic. If you only see outgoing queries and no replies, the issue is upstream (resolver down) or path (ACL/security).
Decision: Use this to stop arguments. Packets don’t lie; people do.
Task 15: Check NSS order (it can make DNS look broken)
cr0x@server:~$ grep '^hosts:' /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname
Meaning: If mDNS is consulted first, some names can stall or be short-circuited unexpectedly.
Decision: For servers, keep it boring: typically files dns (plus whatever your environment genuinely needs).
Task 16: Spot “ndots” and resolver options that change behavior
cr0x@server:~$ grep '^options' /etc/resolv.conf
options edns0 trust-ad
Meaning: Options change packet size, trust behavior, and retry patterns.
Decision: If Kubernetes is involved, check its ndots (often 5). High ndots + many search domains can create lookup storms and “temporary failures.”
Joke #2: The only thing more temporary than “temporary failure in name resolution” is the promise that “we’ll fix DNS later.”
Three mini-stories from corporate life (anonymized)
Mini-story 1: The incident caused by a wrong assumption
A mid-size SaaS company ran an internal recursive resolver pair in two data centers. Dev environments pointed at them too.
One Friday, a team rolled out a new VPN client profile “to simplify remote access.” It was tested. It worked. People could reach internal apps.
Then builds started failing, then production deploys started timing out fetching packages, then monitoring lit up with the classic message.
The wrong assumption: “DNS is internal, so sending DNS over the VPN is always correct.”
The profile pushed a DNS server reachable only from inside the VPN—fine for laptops—then the same profile was accidentally applied to a set of CI runners.
Those runners did not have the VPN interface up. They simply got a new /etc/resolv.conf pointing at an IP they could never reach.
Engineers spent the first 45 minutes staring at authoritative zone files, because someone saw “name resolution” and immediately blamed records.
The breakthrough came when one person ran ip route get to the resolver IP and noticed it tried to route through an interface that didn’t exist on that host.
The fix was boring: separate DHCP/VPN DNS configuration from server infrastructure, lock down who can change resolver settings on CI,
and add a canary job that runs getent hosts against a few critical names from the runner pool.
DNS wasn’t “down.” It was simply never reachable from the machines that mattered.
Mini-story 2: The optimization that backfired
A larger enterprise had performance issues: internal resolvers were handling a huge volume of lookups from Kubernetes clusters.
Someone did the sensible-sounding thing: tuned caching aggressively and increased the EDNS UDP buffer size “to reduce TCP fallbacks.”
They also enabled DNSSEC validation “because security wanted it,” without doing a careful inventory of internal zones.
For a week everything looked fine. Then intermittent resolution failures started. Not everywhere. Only for a set of apps that depended on a partner domain.
The error was “temporary failure.” The resolver metrics showed an uptick in SERVFAIL for that partner domain, but latency was otherwise good.
The backfire: the partner’s authoritative servers mishandled large EDNS responses and occasionally sent malformed replies.
Previously, the resolvers had fallen back to TCP more often and succeeded. Now, with bigger UDP and DNSSEC validation, the failure rate increased and turned into SERVFAIL.
Meanwhile, Kubernetes retries amplified the load and made the resolver look “flaky.”
The fix: reduce EDNS buffer size to a safer value, keep TCP 53 open everywhere, and implement per-zone forwarding exceptions for the problematic domain.
The lesson was not “don’t optimize.” The lesson was “treat DNS as a distributed system with many broken edges.”
If your optimization depends on everyone else being perfect, it’s not an optimization; it’s a new outage mode.
Mini-story 3: The boring but correct practice that saved the day
A financial services org ran internal resolvers in three sites. Nothing fancy. Two instances per site. Simple health checks.
The unglamorous part: they maintained a tiny “DNS sanity” dashboard that tested resolution from every major subnet against every resolver, every minute,
using both UDP and TCP. It was the kind of thing nobody celebrated, which is how you know it was good engineering.
One afternoon, a change in a firewall policy blocked TCP 53 from a subset of application subnets “because we only use UDP for DNS.”
Most names still resolved. Then a few critical ones started failing—specifically those with larger answers and DNSSEC-signed responses.
Apps reported “temporary failure.” Teams started to panic because the failures were sporadic and dependent on the name.
The DNS sanity dashboard lit up immediately: UDP success, TCP failure, localized to a set of subnets.
That shrank the problem space from “maybe the resolver is dying” to “a specific firewall rule broke TCP 53.”
They rolled back the policy quickly. No heroics. No packet captures during a war room.
The boring practice—testing both protocols and multiple vantage points—turned an afternoon incident into a brief inconvenience.
Common mistakes: symptom → root cause → fix
1) “Works on one server but not another”
Symptom: Same app, same subnet, different outcome. One host resolves; another returns “temporary failure.”
Root cause: Different DNS configuration sources (systemd-resolved vs static resolv.conf, VPN overwrite, per-link DNS).
Fix: Compare ls -l /etc/resolv.conf and resolvectl status across hosts; standardize ownership and prevent random agents from rewriting resolver config.
2) “dig works, app fails”
Symptom: dig returns answers, but application logs show temporary failures.
Root cause: App uses libc/NSS path; dig does not. NSS order, nscd, or search/ndots expansion can break or slow things.
Fix: Test with getent ahosts; check /etc/nsswitch.conf; reproduce with FQDN; reduce search domains; audit ndots in container environments.
3) “Intermittent failures, especially for some domains”
Symptom: Some names resolve, others time out; failures vary by domain.
Root cause: TCP fallback blocked, EDNS/fragmentation issues, or authoritative DNS instability for certain zones.
Fix: Verify TCP 53 end-to-end; run dig +trace; consider lowering EDNS buffer size on resolvers; confirm firewall allows fragmented UDP or allow TCP.
4) “Everything fails right after reboot / deploy”
Symptom: After restart, lots of timeouts; later it stabilizes.
Root cause: Cache miss storm against a small resolver pool; resolver CPU/socket exhaustion.
Fix: Add resolver capacity, enable caching at the right layer, stagger restarts, and monitor qps/latency. Don’t restart all resolvers at once unless you enjoy chaos.
5) “Only inside containers / only in Kubernetes pods”
Symptom: Node resolves; pod fails with temporary failure.
Root cause: Cluster DNS (CoreDNS) down, upstream forwarding broken, ndots + search domains generating query storms, or network policies blocking DNS.
Fix: Query CoreDNS service IP directly from a pod; check network policy for UDP/TCP 53; reduce ndots if appropriate; ensure node-local caching (if used) is healthy.
6) “VPN users fine, office users broken (or vice versa)”
Symptom: Resolution depends on whether you’re on VPN.
Root cause: Split-horizon DNS or split tunneling misconfiguration; wrong resolver for the network context.
Fix: Use per-domain routing (DNS routing domains) rather than a single global resolver; ensure resolvers are reachable from each network segment.
7) “SERVFAIL for internal domains after enabling DNSSEC”
Symptom: Internal names return SERVFAIL; clients see temporary failures.
Root cause: Broken DNSSEC chain (wrong DS, expired signatures, mismatched keys), or internal zones not suitable for validation.
Fix: Fix signing properly or disable validation only for the specific internal zones via resolver policy. Blanket disable is a security and reliability footgun.
Checklists / step-by-step plan
Checklist A: Single-host “temporary failure”
- Identify the resolver in use:
ls -l /etc/resolv.confandcat /etc/resolv.conf. - If using systemd-resolved:
resolvectl statusandss -ulpn | grep ':53 '. - Test libc path:
getent ahosts example.com. - Test direct resolver reachability:
dig @<resolver-ip> example.com +time=2 +tries=1. - Differentiate failure type: is it timeout, SERVFAIL, or NXDOMAIN?
- Validate network path:
ip route get <resolver-ip>, then check UDP and TCP 53. - Packet-level proof if needed:
tcpdumpto see queries and replies. - Only then look at upstream or authoritative DNS with
dig +trace.
Checklist B: Multi-host / incident mode
- Pick three vantage points: one failing host, one healthy host, one host in a different subnet.
- Compare resolver configuration across them.
- Pick two test names: one internal critical name, one external stable name.
- Query via configured resolver and via a control resolver.
- Look for patterns: only AAAA fails? only large answers fail? only certain domains?
- Check resolver health metrics/logs if you own it (qps, latency, SERVFAIL rate, memory, file descriptors).
- Confirm firewall changes for UDP/TCP 53 and MTU/fragmentation issues.
- Communicate clearly: “Timeouts to resolver IP from subnet X” beats “DNS seems flaky.”
Fix order summary (print this mentally)
- Client config ownership and correctness (what resolver am I using?)
- Path to resolver (routing + firewall, UDP and TCP)
- Resolver health and recursion/forwarding behavior
- Client-side search/ndots/NSS behavior causing retries and delays
- Authoritative DNS and delegation/DNSSEC correctness
Interesting facts & historical context (DNS has lore)
- DNS replaced HOSTS.TXT scaling pain: early networks used a single hosts file distributed to everyone; it didn’t scale once networks grew.
- Paul Mockapetris designed DNS in 1983, introducing the distributed, hierarchical naming system still used today.
- UDP was chosen for speed, but TCP has always been part of the protocol for larger responses and zone transfers.
- Resolvers cache “no” answers too: negative caching exists to reduce load from repeated misses, which can make typos feel “sticky.”
- TTL is advisory but powerful: aggressive TTL changes can amplify traffic patterns and cause cache-miss storms during outages or deploys.
- EDNS (Extension mechanisms for DNS) arrived to extend DNS without replacing it, including larger UDP payloads—also a source of fragmentation headaches.
- DNSSEC adds authenticity but also size and complexity; validation failures commonly surface as SERVFAIL rather than a neat explanation.
- “Lame delegation” is a real term: it means a nameserver is listed for a zone but can’t authoritatively answer for it.
- Search domains were designed for convenience in enterprise environments, but at scale they can multiply query volume and latency dramatically.
FAQ
1) Is “Temporary failure in name resolution” always a DNS server outage?
No. It’s often local: wrong resolver configured, systemd-resolved down, or a network path/firewall issue.
Treat it as “resolution attempt failed,” not “DNS records are wrong.”
2) Why does restarting the service sometimes fix it?
Because you might be clearing a cache, reopening sockets, or changing timing. That’s not a root cause; it’s a coin flip with better marketing.
Use restarts only as a stopgap while you collect evidence (dig/getent/tcpdump).
3) What’s the difference between NXDOMAIN and this error?
NXDOMAIN means the name does not exist in DNS (authoritative negative answer). “Temporary failure” usually means timeouts or SERVFAIL during recursion.
They lead to different actions: NXDOMAIN is usually configuration/typo; temporary failure is connectivity/health/delegation.
4) Why does it work with dig but not with curl or my app?
dig queries DNS directly. Many apps use libc’s resolver via NSS, which can consult files, mDNS, LDAP, or other sources first,
and can apply search domains and retry policies. Test with getent ahosts to mimic application behavior.
5) How do I tell whether UDP or TCP is the problem?
Test both. Use nc -u -vz <resolver> 53 and nc -vz <resolver> 53, then confirm with tcpdump.
If TCP 53 is blocked, large DNS answers will fail unpredictably.
6) Can MTU or fragmentation cause “temporary failure”?
Yes. Large UDP DNS responses can fragment. If fragments are dropped (common across some tunnels and firewalls), you get timeouts.
That looks “temporary” because small responses still succeed. Allow TCP 53 and consider safer EDNS UDP sizes on resolvers.
7) What’s the quickest way to check if it’s my resolver or upstream authoritative DNS?
Query a control resolver. If your configured resolver fails but the control works, suspect your resolver or your path to it.
If both fail, run dig +trace to see where delegation breaks.
8) How does Kubernetes make this worse?
Kubernetes commonly uses multiple search domains and a high ndots value, which can multiply DNS queries per lookup.
During partial DNS issues, that multiplication becomes a load amplifier. Fix by ensuring CoreDNS health, allowing UDP/TCP 53,
and being intentional about ndots/search configuration.
9) Should we just hardcode IPs to avoid DNS failures?
Hardcoding turns a dynamic system into a brittle one. You dodge one failure mode and buy three new ones: stale endpoints, broken failover, and messy rotations.
If DNS is unreliable, fix the resolver layer and monitoring. Don’t carve the problem into stone.
10) What monitoring actually catches this early?
Multi-vantage DNS checks against each resolver, for both UDP and TCP, for a small set of critical names (internal and external).
Track latency percentiles and SERVFAIL/timeout rates. Alert on trend, not just total failure.
Conclusion: practical next steps
“Temporary failure in name resolution” is rarely mysterious. It’s just layered.
The fix is to stop guessing and work the stack in the right order: client config, network path, resolver health, client behavior, authoritative DNS.
Next steps you can do today:
- Standardize DNS ownership on hosts (decide: systemd-resolved, NetworkManager, or static; not “all of the above”).
- Ensure both UDP and TCP 53 are permitted between clients and resolvers, and between resolvers and upstreams.
- Add a simple DNS sanity check from key subnets (UDP + TCP) and treat rising SERVFAIL/timeouts as an early warning.
- Audit search domains and ndots in container platforms; reduce query multiplication before it reduces your sleep.
- When the next incident hits, use the playbook: prove where packets stop, then fix that layer—no DNS séances required.