Slow DNS doesn’t feel like a “DNS problem.” It feels like SSH lag, package installs hanging, CI timing out, Kubernetes probes flapping, and developers insisting “the network is fine” because ping works.
On modern Linux, that misery often routes through systemd-resolved. Not because it’s bad software, but because it’s sitting at the intersection of bad assumptions: broken search domains, VPN split DNS, NAT64, DNSSEC expectations, captive portals, and ancient resolver behavior baked into libc. The fix is rarely “disable it.” The fix is to observe the whole resolver chain, then change the one thing that actually causes the seconds.
Fast diagnosis playbook
If DNS lookups feel slow, don’t start by editing config files. Start by proving where the time is spent. You’re looking for: retries, search domain expansion, IPv6/IPv4 fallback delays, DNSSEC validation stalls, or a local stub that’s not actually local.
1) Confirm it’s DNS, not TCP
Pick a hostname you know is slow in your app (not a toy domain). Measure name resolution separately from connection.
cr0x@server:~$ time getent ahosts api.internal.example
192.0.2.41 STREAM api.internal.example
192.0.2.41 DGRAM
192.0.2.41 RAW
real 0m2.183s
user 0m0.010s
sys 0m0.006s
What it means: getent uses the same NSS path your apps do. If this takes seconds, you’ve found the right subsystem.
Decision: If getent is slow but a direct dig @server is fast, your bottleneck is local resolver logic (search domains, NSS ordering, stub behavior), not upstream DNS.
2) Identify the active resolver path
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.0.0.53
DNS Servers: 10.0.0.53 10.0.0.54
DNS Domain: corp.example
Link 2 (ens192)
Current Scopes: DNS
Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.0.0.53
DNS Servers: 10.0.0.53 10.0.0.54
DNS Domain: corp.example
What it means: This tells you whether systemd-resolved is in play and which DNS servers and domains it believes.
Decision: If resolv.conf mode is stub, apps likely query 127.0.0.53. If it’s foreign or static, something else is owning /etc/resolv.conf and you must debug that owner.
3) Look for retries and timeouts
cr0x@server:~$ resolvectl query api.internal.example
api.internal.example: 192.0.2.41 -- link: ens192
-- Information acquired via protocol DNS in 2.127s.
-- Data is authenticated: no
What it means: The “acquired in” time is the user-visible pain.
Decision: If it’s >200ms consistently on a LAN, you’re probably seeing timeouts/retries or a broken search domain. Move to packet capture or resolved logs.
4) Differentiate “slow upstream DNS” from “slow resolver behavior”
cr0x@server:~$ dig +tries=1 +time=1 @10.0.0.53 api.internal.example A
; <<>> DiG 9.18.24-1ubuntu1.2-Ubuntu <<>> +tries=1 +time=1 @10.0.0.53 api.internal.example A
;; global options: +cmd
;; ANSWER SECTION:
api.internal.example. 60 IN A 192.0.2.41
;; Query time: 7 msec
;; SERVER: 10.0.0.53#53(10.0.0.53) (UDP)
;; WHEN: Wed Dec 31 12:12:12 UTC 2025
;; MSG SIZE rcvd: 68
What it means: Upstream DNS is fast. Your local resolution path isn’t.
Decision: Focus on systemd-resolved configuration, NSS, and the stub. Don’t waste a week yelling at the DNS team.
The resolver chain: where the seconds hide
Linux name resolution is not “a DNS query.” It’s a chain of decisions, and each decision can add delay.
The path most people actually run
- An application calls
getaddrinfo()(often with both A and AAAA). Some apps do this in the hot path. Some do it on every request because caching is “hard.” - glibc consults NSS via
/etc/nsswitch.confand walks the configured sources (files, resolve, dns, mdns, wins…). Ordering matters. Failure handling matters more. - If
systemd-resolvedis enabled and NSS is set toresolve, glibc talks to resolved over D-Bus or the nss-resolve module. Otherwise, glibc uses/etc/resolv.confand sends DNS packets directly. - If
/etc/resolv.confpoints at127.0.0.53, your “nameserver” is a local stub listening on localhost. That stub forwards upstream, applies split DNS routing, caches, may validate DNSSEC, and tries to be helpful. - Upstream servers might be local caching resolvers, corporate DNS, a VPN-provided resolver, or something your laptop got from DHCP in a hotel that should be illegal.
The symptom “DNS is slow” can come from any layer. And yes, dig can be fast while your app is slow, because dig bypasses parts of this chain. If you diagnose with only dig, you’re debugging the wrong system.
One quote worth keeping in your head while you chase this: “Hope is not a strategy.”
— General Gordon R. Sullivan. That’s not DNS-specific, but it’s painfully applicable.
Joke #1: DNS is the only place where “it’s always the network” and “it’s never the network” are both somehow true.
Facts and history that explain today’s weirdness
- The original resolver behavior predates your cloud. The default “search domain” logic comes from an era where
printerresolving toprinter.officewas normal, and the cost of extra queries was ignored. - NSS is a plug-in system, not just DNS. Name lookups can involve local files, LDAP, mDNS, WINS, and systemd’s resolver. Misordering can add seconds per call.
- systemd-resolved popularized a local stub. That
127.0.0.53address isn’t a “real DNS server”; it’s a convenience layer to support split DNS, caching, and policy. - Negative caching is a performance feature. Without it, typos and dead domains turn into repeated timeouts. With it, misconfigurations can look “sticky” until caches expire.
- AAAA lookups became standard, and that changed latency. Dual-stack clients often query AAAA and A; broken IPv6 paths can add delays even when IPv4 works fine.
- EDNS(0) improved DNS, then MTUs ruined it. Larger UDP responses help modern records, but when path MTU discovery is broken, you get retries and fallbacks that look like “random slowness.”
- DNSSEC is computationally cheap until it isn’t. Validation overhead is usually small, but time sync issues and broken upstream chains can cause extra queries and delays.
- VPNs made split DNS normal. Corporate environments now routinely route certain suffixes to internal resolvers and everything else to public resolvers. That’s a policy engine, not “just DNS.”
What “slow DNS” looks like (and why dig can lie)
Here are the classic patterns:
- First lookup slow, then fast. Caching at some layer is working, but the first miss is expensive (timeouts or search list expansion).
- Only some hostnames slow. Usually split DNS routing (wrong link), an internal zone leak, or a DNS server that can resolve public names but not private (or vice versa).
- Only some programs slow. Apps using libc/NSS are slow; tools like
digare fast. That points at NSS ordering, resolved stub behavior, or IPv6/IPv4 fallback. - Everything slow after VPN connect. The VPN pushed DNS/search domains, and now every single-label name generates a small storm of queries.
- Intermittent multi-second spikes. Retransmits due to UDP loss, MTU issues, or a DNS server that’s occasionally unavailable.
- NXDOMAIN is slow. Negative answers are taking the long route (search domains) or timing out due to DNS server mismatch.
Joke #2: If you want to experience time travel, run apt update on a laptop with a bad search domain list.
Practical tasks: commands, outputs, decisions
These are not “random commands.” Each one answers a specific question, and each question should change what you do next.
Task 1: Prove the slow path uses libc/NSS
cr0x@server:~$ time getent hosts github.com
140.82.121.3 github.com
real 0m1.612s
user 0m0.003s
sys 0m0.004s
Meaning: getent hosts uses NSS. If this is slow, your apps will be slow.
Decision: Continue with NSS and resolved checks. If getent is fast but your app is slow, your app is doing something else (proxy, TLS, OCSP, or retry loops).
Task 2: Compare direct DNS query speed
cr0x@server:~$ time dig +tries=1 +time=1 github.com A | sed -n '1,20p'
; <<>> DiG 9.18.24-1ubuntu1.2-Ubuntu <<>> +tries=1 +time=1 github.com A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48219
;; ANSWER SECTION:
github.com. 60 IN A 140.82.121.3
Meaning: If dig is quick while getent is slow, upstream DNS is fine.
Decision: Focus on resolver chain (search domains, stub, DNSSEC, IPv6).
Task 3: Identify who owns /etc/resolv.conf
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Dec 31 10:02 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
Meaning: If it’s symlinked to systemd’s stub resolv.conf, localhost stub is in use.
Decision: Don’t hand-edit /etc/resolv.conf. You’ll lose. Configure the manager (resolved, NetworkManager, netplan, or your DHCP client).
Task 4: Inspect stub content and search domains
cr0x@server:~$ cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
nameserver 127.0.0.53
options edns0 trust-ad
search corp.example dev.corp.example
Meaning: Big or wrong search lists cause extra queries. Some resolvers also try ndots-like behavior; libc uses options too.
Decision: If the search list is long, trim it at the source (DHCP/VPN config) or set per-link domains properly.
Task 5: Check NSS ordering (silent performance killer)
cr0x@server:~$ grep -E '^\s*hosts:' /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] dns
Meaning: This order tries local files, then mDNS, then systemd-resolved, then DNS. mDNS can add delay for non-.local names if misconfigured. The bracket rules matter.
Decision: On servers, prefer deterministic behavior. If you don’t use mDNS, remove it. If you use resolved, keep resolve before dns to avoid split-DNS bypass.
Task 6: Observe resolved’s view of the world
cr0x@server:~$ resolvectl dns
Global: 10.0.0.53 10.0.0.54
Link 2 (ens192): 10.0.0.53 10.0.0.54
Meaning: Confirms which DNS servers resolved will use. Per-link DNS matters with VPNs and multiple interfaces.
Decision: If the wrong link has the “DefaultRoute” DNS, fix per-link settings (NetworkManager, netplan, or resolved config).
Task 7: Measure with resolvectl (bypasses NSS but hits resolved)
cr0x@server:~$ resolvectl query -t A -c 1 api.internal.example
api.internal.example: 192.0.2.41 -- link: ens192
-- Information acquired via protocol DNS in 2.044s.
Meaning: Slow here means resolved’s forwarding path is slow (server choice, retries, DNSSEC, packet issues), not NSS search expansion alone.
Decision: Check logs and packet capture next.
Task 8: Turn on resolved debug logs (temporarily)
cr0x@server:~$ sudo resolvectl log-level debug
cr0x@server:~$ sudo journalctl -u systemd-resolved -n 50 --no-pager
Dec 31 12:20:11 server systemd-resolved[612]: Sending query for api.internal.example IN A on interface 2/ens192.
Dec 31 12:20:12 server systemd-resolved[612]: Timeout reached on transaction 45123.
Dec 31 12:20:13 server systemd-resolved[612]: Retrying transaction 45123.
Meaning: You can literally see timeouts and retries.
Decision: If you see timeouts, suspect UDP loss, firewall, wrong server, or MTU/EDNS issues. If you see repeated search domain attempts, suspect the search list and single-label names.
Task 9: Packet capture: prove retransmits or fragmentation problems
cr0x@server:~$ sudo tcpdump -ni any '(udp port 53 or tcp port 53)' -vv -c 20
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:22:01.123456 ens192 Out IP 10.0.0.10.51234 > 10.0.0.53.53: 45123+ A? api.internal.example. (40)
12:22:02.124987 ens192 Out IP 10.0.0.10.51234 > 10.0.0.53.53: 45123+ A? api.internal.example. (40)
12:22:03.127001 ens192 In IP 10.0.0.53.53 > 10.0.0.10.51234: 45123* 1/0/0 A 192.0.2.41 (56)
Meaning: Same query sent twice before an answer arrives: that’s retransmit behavior, not “CPU is slow.”
Decision: If retransmits are common, look for packet loss, overloaded DNS, firewall rate limits, or MTU issues. If responses arrive only over TCP, investigate EDNS and fragmentation.
Task 10: Check if resolved is falling back to TCP
cr0x@server:~$ resolvectl statistics
Transactions: 2145
Cache: size 1024, hits 1432, misses 713
DNSSEC Verdicts: secure 0, insecure 0, bogus 0, indeterminate 0
Meaning: Stats can hint whether caching is helping. It won’t directly say TCP fallback, but combined with tcpdump it tells a story.
Decision: If cache misses are high for repeated names, your TTLs may be tiny, or your apps are generating unique names (service discovery gone wild).
Task 11: Identify IPv6-induced delays (AAAA problems)
cr0x@server:~$ getent ahosts example.com | head
2606:2800:220:1:248:1893:25c8:1946 STREAM example.com
2606:2800:220:1:248:1893:25c8:1946 DGRAM
2606:2800:220:1:248:1893:25c8:1946 RAW
Meaning: If AAAA exists, your app may try IPv6 first. If IPv6 routing is broken, you’ll see connection delays that look like DNS delays.
Decision: If name resolution is fast but connections stall, fix IPv6 routing/RA/firewall rather than touching DNS. If resolution itself is slow, continue debugging resolved.
Task 12: Trace a single slow lookup with strace
cr0x@server:~$ strace -f -tt -o /tmp/ga.trace getent hosts api.internal.example
192.0.2.41 api.internal.example
cr0x@server:~$ tail -n 12 /tmp/ga.trace
12:24:41.100112 connect(3, {sa_family=AF_UNIX, sun_path="/run/systemd/resolve/io.systemd.Resolve"}, 110) = 0
12:24:41.100321 sendto(3, "{...}", 224, 0, NULL, 0) = 224
12:24:43.142901 recvfrom(3, "{...}", 4096, 0, NULL, NULL) = 312
12:24:43.143210 close(3) = 0
Meaning: You can see the 2-second gap between send and receive. That’s resolver latency, not application CPU.
Decision: If delays occur before contacting resolved (e.g., iterating files, mdns), fix NSS ordering. If delays occur waiting on resolved, fix resolved/upstream.
Task 13: Check systemd-resolved configuration (what you actually set)
cr0x@server:~$ systemd-analyze cat-config systemd/resolved.conf
# /etc/systemd/resolved.conf
[Resolve]
DNS=10.0.0.53 10.0.0.54
Domains=corp.example
DNSSEC=no
DNSOverTLS=no
Cache=yes
Meaning: This is the effective config, including drop-ins. It beats guessing what file is used.
Decision: If DNSSEC is “allow-downgrade” or “yes” and you’re in a messy enterprise environment, consider setting it explicitly (but do it with intent, not panic).
Task 14: Ensure /etc/resolv.conf isn’t fighting your network manager
cr0x@server:~$ resolvectl status | grep -E 'resolv\.conf mode'
resolv.conf mode: stub
Meaning: “stub” means the intended integration path is active.
Decision: If you see resolv.conf mode: foreign, find who overwrote it (often a VPN client or legacy resolvconf package) and stop the tug-of-war.
Fix systemd-resolved the right way (tuning patterns)
“Disable systemd-resolved” is the DNS equivalent of removing the smoke alarm because it’s loud. Sometimes you do need to replace it, but most of the time you need to configure it so it stops doing expensive things on your behalf.
Pattern A: Fix search domains (the quiet query multiplier)
If your /etc/resolv.conf has multiple search domains, a lookup for db becomes:
db.corp.exampledb.dev.corp.exampledb(depending on options)
Multiply by A and AAAA, multiply by retries, multiply by every microservice. You get the idea.
What to do: On servers, prefer FQDNs in configs and keep the search list short. If DHCP pushes junk, fix DHCP. If a VPN pushes junk, fix the VPN profile. If neither is possible, enforce sane domains in resolved per-link configuration via your network manager.
Pattern B: Use per-link DNS and routing domains for split DNS
The right approach for corporate + internet is usually split DNS: internal suffixes to internal resolvers, everything else to a general resolver. systemd-resolved is built for that. The common failure mode is that “DefaultRoute” DNS ends up on the wrong interface (VPN vs Wi‑Fi), or the VPN only wants corp.example but accidentally becomes the resolver for everything.
What to do: Make internal domains routing-only where appropriate. In systemd-resolved semantics, prefixing a domain with ~ makes it a routing domain. That means “send queries for this suffix to these DNS servers,” not “append this suffix to unqualified names.” You reduce pointless search expansions while still routing correctly.
Pattern C: Decide on DNSSEC explicitly (don’t let defaults surprise you)
DNSSEC is great when the chain is intact and time is synced. In enterprises, you’ll meet middleboxes that rewrite DNS, captive portals that hijack, and internal zones that were never signed. That doesn’t mean “turn DNSSEC off everywhere.” It means decide where you want validation and where you don’t.
What to do: On servers inside a controlled network, set DNSSEC policy explicitly in /etc/systemd/resolved.conf, and verify your upstream supports it. On laptops roaming the world, “allow-downgrade” may avoid support tickets but it’s a security trade. Make the trade consciously.
Pattern D: Don’t fight /etc/resolv.conf ownership
Modern distros expect something to manage /etc/resolv.conf. If you manually edit it, it will be overwritten. If you install a second manager, you get heisenbugs.
What to do: Pick one authority: NetworkManager, systemd-networkd/netplan, or a deliberate static setup. Ensure your VPN client integrates with that authority rather than scribbling over /etc/resolv.conf directly.
Pattern E: Fix timeouts by fixing the network, not by hiding them
There are knobs for retry and timeout behavior, but if you need them on a LAN, something else is wrong: packet loss, firewall rules, an overloaded resolver, MTU blackholes. You can “optimize” around it and still lose the war at 2 a.m.
If your logs show timeouts, capture packets, identify which server is slow, and remove it from rotation or fix it. If one of your configured DNS servers is dead, your “fast DNS” is actually “fast half the time.”
Three corporate mini-stories from the trenches
Incident 1: The wrong assumption (dig is fast, so DNS is fine)
A mid-sized company had a wave of “intermittent outages” on internal APIs. From the application side it looked like TCP connect stalls. Engineers ran dig against the internal name and got answers in a few milliseconds. Case closed, they thought. The API team started investigating thread pools and load balancers instead.
Then someone ran getent ahosts and saw 1–3 second delays, mostly for names that didn’t exist. That was the first clue: the service wasn’t slow, the resolver was slow when asked for typos, stale service names, or feature-flag endpoints that were disabled.
The root cause was a long search domain list pushed by a VPN profile. Every single-label hostname triggered multiple queries across split DNS boundaries. Worse, some suffixes were routed to resolvers that couldn’t answer them and simply timed out instead of responding NXDOMAIN. Each timeout was a full second or two. Multiply that by retries and by A+AAAA, and you get the “intermittent outage.”
The fix wasn’t to disable systemd-resolved. The fix was to convert the VPN’s “Domains” into routing domains for the internal suffixes, shorten the search list, and stop single-label lookups in production configs. The outage evaporated. The postmortem included a new rule: for Linux services, measure with getent, not just dig.
Incident 2: The optimization that backfired (forcing a “faster” DNS server)
At another shop, someone decided the corporate resolvers were “slow,” based on a handful of laptop tests from coffee shops. The proposed optimization: hardcode public resolvers in /etc/systemd/resolved.conf on all developer workstations. They shipped it as a standard image tweak.
Performance improved for public domains. Then internal tooling started failing. Some internal zones were only resolvable on corporate DNS. Developers started adding hacks: hosts file entries, ad-hoc VPN DNS scripts, and a habit of copying IPs into configs “just to get work done.” Predictably, those IPs changed, and the hacks turned into outages.
The most fun part: the public resolvers were blocked on a subset of corporate networks, so the “optimization” caused timeouts and long fallbacks exactly where you least want them—during incident response on a restricted segment.
The fix was a rollback and a better split DNS design: corporate suffixes routed to corporate resolvers, everything else using the local network resolvers with caching. Developers got speed without breaking internal names. Security got policy. Ops got fewer tickets. Everybody stopped pretending that “one DNS server fits all” is a strategy.
Incident 3: The boring correct practice that saved the day (making resolver behavior observable)
A large enterprise didn’t have perfect DNS. They had multiple resolvers, VPNs, and a mix of distros. What they did have was a boring, disciplined practice: every fleet image shipped with a lightweight “resolver health” script and a standard debugging bundle that captured resolvectl status, /etc/nsswitch.conf, /etc/resolv.conf, and a short packet trace when triggered.
One afternoon, a new firewall policy started rate-limiting UDP fragments. Nobody announced it as “DNS-related.” The first symptom was random login slowness, then package install hiccups, then microservices timing out.
Because the resolver bundle existed, the on-call didn’t guess. They compared packet traces from healthy and unhealthy hosts. The unhealthy hosts showed truncated responses, TCP fallback, and repeated retries. The correlation to a specific network segment was immediate. They handed the firewall team evidence, not vibes.
The fix was surgical: allow fragmented DNS responses or enforce smaller EDNS payload sizes at the right layer. The incident ended without the usual week-long blame carousel. Boring practice won. Again.
Common mistakes: symptom → root cause → fix
1) “dig is fast but curl is slow”
Symptom: dig returns quickly; applications hang before connecting.
Root cause: Apps use libc/NSS; dig bypasses NSS ordering and may query a different server. Also, apps often do A+AAAA plus search expansion; your dig test likely did a single query.
Fix: Test with getent ahosts and resolvectl query. Audit /etc/nsswitch.conf and search domains; ensure resolved routing is correct.
2) “Everything got slow after VPN”
Symptom: Name lookups are fast before VPN, slow after.
Root cause: VPN pushes long search domains and/or steals default DNS route, sending all queries to a resolver reachable only over a congested tunnel.
Fix: Configure split DNS with routing domains for internal suffixes; keep default DNS on the local link unless policy requires otherwise.
3) “NXDOMAIN takes 5 seconds”
Symptom: Typos or missing records are painfully slow.
Root cause: Search list expansion + upstream servers that timeout instead of answering; sometimes a dead secondary DNS server causing retries.
Fix: Reduce search domains, remove dead DNS servers, and ensure upstream resolvers respond correctly for non-existent names.
4) “Only internal domains are slow”
Symptom: Public domains resolve quickly; internal zones lag.
Root cause: Split DNS misrouting: internal suffix sent to the wrong resolver, or multiple interfaces competing (Wi‑Fi + VPN + Docker bridge).
Fix: Use per-link DNS and routing domains; verify with resolvectl status which link owns the domain.
5) “Random spikes: sometimes 20ms, sometimes 2s”
Symptom: Mostly fine, sometimes stalls.
Root cause: UDP loss, firewall rate limiting, MTU/EDNS issues, or one of the configured DNS servers intermittently failing.
Fix: Packet capture to see retransmits; remove or fix flaky resolvers; address MTU or fragmented UDP handling.
6) “systemd-resolved keeps changing my resolv.conf”
Symptom: Manual edits disappear.
Root cause: That file is not yours. It’s managed by resolved or another network component.
Fix: Configure the owner (resolved conf, NetworkManager connection settings, netplan/systemd-networkd). Stop editing /etc/resolv.conf directly.
Checklists / step-by-step plan
Step-by-step: from “slow” to “fixed” without guesswork
- Measure with the right tool: run
time getent ahosts name. Record the real delay. - Check ownership:
ls -l /etc/resolv.conf. If it’s a stub symlink, resolved is in the loop. - Inspect search domains:
cat /etc/resolv.confandresolvectl status. If search is long, it’s a suspect. - Inspect NSS ordering: check the
hosts:line. Remove sources you don’t use (especially on servers). - Compare layers:
dig @upstreamvsresolvectl queryvsgetent. Identify where the delay enters. - Turn on debug briefly: set resolved log level to debug and read timeouts/retries.
- Capture packets: confirm retransmits, truncation, TCP fallback, or “no response” behavior.
- Fix the cause: wrong DNS server, dead secondary, broken MTU, split DNS misrouting, DNSSEC mismatch, or search list bloat.
- Rollback debug settings: set log level back to normal and keep the change log clean.
- Re-test and document: re-run the same measurements and capture before/after timing.
Operational checklist: keep it from coming back
- Standardize how DNS is managed on each distro (NetworkManager vs networkd) and don’t mix managers on the same host.
- Define allowed search domains per environment; keep production servers conservative.
- Monitor resolver latency and error rates from hosts, not just from the DNS servers.
- Have a known-good “resolver sanity” command set for on-call:
getent,resolvectl,journalctl,tcpdump.
FAQ
1) Should I disable systemd-resolved to fix slow DNS?
Usually no. Disabling it often breaks split DNS, VPN routing, and consistent behavior across apps. Fix the actual bottleneck (search domains, wrong per-link DNS, timeouts) and keep the policy engine.
2) Why is 127.0.0.53 in /etc/resolv.conf?
That’s systemd-resolved’s local stub resolver. Apps query localhost; resolved forwards to the real DNS servers and applies routing/caching. If it’s slow, it’s either forwarding slowly or doing too much work per query.
3) Why is dig fast but getent slow?
dig is a DNS client; getent follows NSS and uses the same path as most applications. Slow getent usually means NSS ordering, search domains, or resolved integration is causing extra queries or timeouts.
4) How do I tell if the delay is search-domain expansion?
Look at resolved debug logs and packet captures. You’ll see repeated queries for the same base name with different suffixes. Also, compare a single-label lookup (db) vs an FQDN (db.corp.example).
5) Does DNSSEC make lookups slow?
It can, but it’s not the default villain. Most DNSSEC overhead is small. Slowness appears when validation triggers extra queries, time is wrong, or upstream DNS breaks the chain. Decide policy explicitly instead of hoping defaults behave.
6) Can IPv6 cause “DNS slowness” even if DNS is fine?
Yes. Many apps resolve AAAA and then try IPv6 first. If IPv6 connectivity is broken, the connection delay can be mistaken for DNS delay. Separate resolution timing (getent) from connection timing (curl -v or nc).
7) What’s the safest way to change resolver settings on servers?
Change them through the system’s network manager and commit them as code (netplan configs, NetworkManager profiles, or systemd-networkd units). Avoid one-off edits to /etc/resolv.conf.
8) I have multiple DNS servers configured. Why would it be slower?
If one server is dead or drops responses, your resolver may wait, retry, then switch. That can add seconds. Multiple servers are good when they’re healthy; they’re terrible when one is silently broken.
9) Why do NXDOMAIN responses sometimes take longer than successful ones?
Because the resolver may try search domains and multiple record types, and some upstream servers timeout instead of answering cleanly for certain suffixes. Negative answers should be fast; if they’re not, something’s misrouted or broken.
Next steps you can do today
Do three things, in this order:
- Measure with
getent ahostsandresolvectl queryso you’re debugging the same path your applications use. - Make the resolver chain boring: sane
hosts:ordering, short search lists, correct per-link DNS routing, and no turf war over/etc/resolv.conf. - When you see seconds, capture evidence: resolved debug logs and a short
tcpdump. Then fix the actual cause—dead resolvers, wrong routes, MTU/EDNS breakage—instead of papering over it.
Once DNS is fast and predictable, the rest of your system stops feeling haunted. Which is the whole point of reliability engineering: fewer mysteries, more sleep.