It’s 02:17, your deployment window is closing, and apt suddenly refuses to fetch anything. The error is almost insulting in its simplicity: Could not resolve host. You know the box has a link. You can ping an IP. Yet every hostname looks like it’s been erased from the universe.
This is where production instincts matter. Don’t “try random fixes” and definitely don’t reboot as a diagnostic tool. Triage name resolution like you would a storage incident: find the boundary, prove where it breaks, and change one thing at a time.
Fast diagnosis playbook
This is the “stop bleeding” path. It’s biased toward speed, not elegance. You’re trying to answer three questions fast:
- Is this box able to reach any resolver at all?
- Is the resolver answering correctly?
- Is the application bypassing your resolver because of proxy/IPv6/NSS oddities?
First: prove whether DNS is the problem or just the first thing to fail
- Check IP connectivity to something stable (gateway, resolver, known public IP if policy allows). If IP connectivity fails, DNS debugging is theater.
- Resolve using the system path (
getent) and a direct tool (digorresolvectl). If they disagree, you’ve found the fracture line.
Second: locate the resolver in use and its upstream
- Read /etc/resolv.conf and check if it’s a stub (127.0.0.53) or real servers.
- Use resolvectl to see which DNS servers are actually configured per link and whether DNSSEC/DoT is involved.
Third: check “invisible” application routing: proxy and IPv6 preference
- Proxy: validate environment variables and APT proxy snippets. A proxy can make DNS appear broken because the proxy is doing the resolution (or failing to).
- IPv6: if your host gets AAAA records but can’t route IPv6, you’ll see timeouts and “could not resolve host” style failures depending on the client.
If you’re in a hurry: start with Tasks 1–6 below. They’ll usually tell you where to dig deeper.
What “Could not resolve host” really means
The phrase shows up in multiple clients (curl, wget, git, sometimes apt via its HTTP stack). It’s not a single error. It’s a family of failures that all end with the same shrug: “I tried to turn a name into an address, and I didn’t get one.”
In Debian 13, the usual moving parts are:
- glibc name resolution (NSS) via
/etc/nsswitch.conf - /etc/resolv.conf (either a real file or a symlink into systemd’s world)
- systemd-resolved (if enabled) and its stub listener
- NetworkManager or systemd-networkd feeding DNS servers to resolved
- VPN clients doing split-DNS, sometimes badly
- Proxy configuration (environment variables, APT config, corporate PAC files)
- IPv6 preference and policy (Happy Eyeballs helps, but it doesn’t absolve you from broken v6)
One reliable mantra: “Could not resolve host” is the symptom. Your job is to identify which layer is lying.
Joke #1: DNS is the only distributed system where forgetting a dot can ruin your day and your weekend.
Interesting facts and historical context
These aren’t trivia for trivia’s sake. Each one explains why modern Linux name resolution behaves the way it does, and why “simple” fixes can be traps.
/etc/hostspredates DNS. Early ARPANET systems used a single centrally distributed HOSTS.TXT file; DNS replaced that scaling disaster with delegation.- The resolver library is not the same as “DNS.” On Linux, glibc NSS decides whether to consult files, DNS, mDNS, LDAP, SSSD, etc. DNS is only one module.
resolv.confwas designed for static networks. Laptops, DHCP, VPN split DNS, and containers made it a contested file where multiple daemons fight politely (or not).- systemd-resolved introduced a local stub by default in many distros. That 127.0.0.53 entry isn’t “wrong”; it’s a local forwarder and cache—until something else breaks around it.
- Negative caching is a thing. Resolvers can cache “NXDOMAIN” results, so a transient upstream failure can linger as “this host doesn’t exist” for a while.
- DNS timeouts can look like resolution failures. Many clients map “timed out contacting DNS” into generic errors because they only care that they didn’t get an address.
- DNSSEC and DoT are reliability trade-offs. Validation and encryption add failure modes: clock skew, blocked port 853, or middleboxes that mishandle EDNS0.
- IPv6 broke a lot of assumptions. Dual-stack means you can “successfully resolve” AAAA records yet fail to connect if v6 routing is half-configured.
- Corporate proxies sometimes resolve names for you. Your client may never do DNS at all; it hands the hostname to the proxy and waits. That changes the debugging path.
The decision tree: DNS vs proxy vs IPv6 vs routing
Here’s the mental model that keeps you from randomly editing files. Your aim is to find the first layer that deviates from expected behavior.
1) Can you reach anything by IP?
If you can’t ping your default gateway or reach a known IP, stop. Fix link/routing/firewall. DNS is not the primary incident, it’s collateral damage.
2) Does the system resolver path work?
Use getent ahosts because it exercises the same path most applications use (NSS). If getent fails, you’ve got a system-wide resolver issue. If getent works but curl fails, you’re looking at proxy settings, application-level DNS overrides, or IPv6 connect behavior.
3) Are you using systemd-resolved, and is /etc/resolv.conf honest?
In a lot of breakages, /etc/resolv.conf points at a stub that isn’t listening (resolved is disabled), or points at stale DHCP-provided resolvers you can’t reach (VPN moved you). It’s boring, and it’s common.
4) Are you in a proxy regime?
Corporate networks love proxies. Some do explicit proxies configured via environment variables; others enforce transparent proxies or PAC. If APT has a proxy configured but the proxy is unreachable, you can get misleading errors that feel like DNS.
5) Is IPv6 creating a “resolution succeeded, connect failed” loop?
Many tools resolve AAAA and try IPv6 first. If IPv6 DNS works but IPv6 routing is broken, you’ll see slow failures. Some clients summarize that as a resolution failure. Your job is to separate “I got an address” from “I can reach the address.”
Practical tasks: commands, output, and decisions
These are real tasks you can run on Debian 13. Each one includes (a) the command, (b) what typical output means, and (c) what decision you make next. Do them in order until you hit the divergence.
Task 1: Confirm the error surface (which tool, which hostname)
cr0x@server:~$ apt-get update
Ign:1 http://deb.debian.org/debian trixie InRelease
Err:1 http://deb.debian.org/debian trixie InRelease
Could not resolve 'deb.debian.org'
Reading package lists... Done
W: Failed to fetch http://deb.debian.org/debian/dists/trixie/InRelease Could not resolve 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.
What it means: APT can’t resolve the repo hostname. It might be DNS, a proxy config, or a resolver mismatch.
Decision: Don’t edit anything yet. Move to IP connectivity and system resolver checks.
Task 2: Verify interface state and default route
cr0x@server:~$ ip -br addr
lo UNKNOWN 127.0.0.1/8 ::1/128
ens3 UP 10.10.20.15/24 2001:db8:20::15/64
cr0x@server:~$ ip route
default via 10.10.20.1 dev ens3
10.10.20.0/24 dev ens3 proto kernel scope link src 10.10.20.15
What it means: You have an IPv4 address and a default route. At least on paper, outbound should work.
Decision: Test reachability to the gateway and a known IP.
Task 3: Test raw IP connectivity (gateway, then a known IP)
cr0x@server:~$ ping -c 2 10.10.20.1
PING 10.10.20.1 (10.10.20.1) 56(84) bytes of data.
64 bytes from 10.10.20.1: icmp_seq=1 ttl=64 time=0.358 ms
64 bytes from 10.10.20.1: icmp_seq=2 ttl=64 time=0.311 ms
--- 10.10.20.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1020ms
rtt min/avg/max/mdev = 0.311/0.334/0.358/0.023 ms
cr0x@server:~$ ping -c 2 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=55 time=8.29 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=55 time=8.10 ms
--- 1.1.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 8.100/8.195/8.290/0.095 ms
What it means: Network path works to the internet by IP. This strongly points back to name resolution or proxy.
Decision: Check system resolver behavior with getent next.
Task 4: Test resolution through NSS (what most apps actually use)
cr0x@server:~$ getent ahosts deb.debian.org
What it means if empty: NSS couldn’t produce any address records. That’s a system resolver issue, not “APT being weird.”
Decision: Inspect /etc/nsswitch.conf and the resolver configuration (/etc/resolv.conf, resolvectl).
Task 5: Inspect NSS order (hosts resolution sources)
cr0x@server:~$ grep -E '^\s*hosts:' /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] dns
What it means: The system checks /etc/hosts, then mDNS, then DNS. That’s a common default on desktops; on servers it can be a footgun if mDNS misbehaves.
Decision: If resolution is slow or erratic, consider whether mDNS is interfering. But don’t change this yet—first confirm whether DNS is reachable.
Task 6: Check /etc/resolv.conf identity and contents
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Dec 29 01:40 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search corp.example
What it means: You’re using systemd’s local stub at 127.0.0.53. That’s fine if systemd-resolved is running and upstream DNS servers are configured correctly.
Decision: Verify resolved is running and has upstream servers. If resolved is stopped, this stub becomes a black hole.
Task 7: Check systemd-resolved health and upstream DNS
cr0x@server:~$ systemctl status systemd-resolved --no-pager
● systemd-resolved.service - Network Name Resolution
Loaded: loaded (/lib/systemd/system/systemd-resolved.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-12-29 01:39:12 UTC; 52min ago
Docs: man:systemd-resolved.service(8)
man:org.freedesktop.resolve1(5)
Main PID: 512 (systemd-resolve)
Status: "Processing requests..."
Tasks: 1 (limit: 4662)
Memory: 6.2M
CPU: 620ms
CGroup: /system.slice/systemd-resolved.service
└─512 /lib/systemd/systemd-resolved
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (ens3)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.10.0.53
DNS Servers: 10.10.0.53 10.10.0.54
DNS Domain: corp.example
What it means: resolved is active and forwarding to 10.10.0.53/54. No DoT, DNSSEC not validating. Looks sane.
Decision: Now test queries directly against the configured DNS server to detect firewall/VPN/split DNS problems.
Task 8: Query the upstream DNS directly (bypass stub)
cr0x@server:~$ dig @10.10.0.53 deb.debian.org A +time=2 +tries=1
; <<>> DiG 9.18.33-1-Debian <<>> @10.10.0.53 deb.debian.org A +time=2 +tries=1
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
What it means: Your upstream DNS server is not reachable (timeout). Could be routing, firewall, or you’re on the wrong network for that resolver (classic VPN split DNS failure).
Decision: Check if you can reach 10.10.0.53 at all (ICMP may be blocked; try TCP/UDP 53 tests). Validate VPN and routing.
Task 9: Test reachability to DNS server on port 53 (TCP and UDP)
cr0x@server:~$ nc -vz -w 2 10.10.0.53 53
nc: connect to 10.10.0.53 port 53 (tcp) timed out: Operation now in progress
cr0x@server:~$ sudo timeout 2 bash -c 'cat < /dev/null > /dev/udp/10.10.0.53/53' && echo ok || echo failed
failed
What it means: Both TCP and UDP attempts fail. This is not a “DNS mis-typed” issue; it’s network reachability to the resolver.
Decision: Fix routing/VPN/firewall to the resolver, or temporarily switch to reachable resolvers (policy permitting) to restore service.
Task 10: Identify who configured DNS (DHCP, NetworkManager, networkd, VPN)
cr0x@server:~$ systemctl is-active NetworkManager
inactive
cr0x@server:~$ systemctl is-active systemd-networkd
active
cr0x@server:~$ networkctl status ens3 --no-pager
● 2: ens3
Link File: /usr/lib/systemd/network/99-default.link
Network File: /etc/systemd/network/10-ens3.network
Type: ether
State: routable (configured)
Online state: online
Address: 10.10.20.15/24
2001:db8:20::15/64
Gateway: 10.10.20.1
DNS: 10.10.0.53
10.10.0.54
DNS Domain: corp.example
What it means: systemd-networkd is providing DNS servers. If those servers aren’t reachable from this network segment, you need to adjust the networkd config or DHCP scope.
Decision: Validate that the DNS servers are correct for this VLAN/VPN context. If you recently moved networks, this is your smoking gun.
Task 11: Verify proxy settings (environment variables)
cr0x@server:~$ env | grep -iE '^(http|https|no)_proxy='
http_proxy=http://proxy.corp.example:3128
https_proxy=http://proxy.corp.example:3128
no_proxy=localhost,127.0.0.1,.corp.example
What it means: Your shell is configured to use a proxy. Tools like curl, wget, and sometimes git will obey this.
Decision: If proxy is required, test reachability to it. If proxy is not required on this network, unset it and retest resolution.
Task 12: Check APT’s proxy configuration (can override env)
cr0x@server:~$ grep -R --line-number -E 'Acquire::(http|https)::Proxy' /etc/apt/apt.conf /etc/apt/apt.conf.d 2>/dev/null
/etc/apt/apt.conf.d/30proxy:1:Acquire::http::Proxy "http://proxy.corp.example:3128";
/etc/apt/apt.conf.d/30proxy:2:Acquire::https::Proxy "http://proxy.corp.example:3128";
What it means: APT is pinned to a proxy regardless of your environment variables. If that proxy is unreachable, APT breaks even if DNS is fine.
Decision: Validate proxy connectivity; if you’re off-corp-network, disable or conditionalize this config.
Task 13: Test proxy connectivity and name resolution path
cr0x@server:~$ nc -vz -w 2 proxy.corp.example 3128
nc: getaddrinfo for host "proxy.corp.example" port 3128: Temporary failure in name resolution
cr0x@server:~$ nc -vz -w 2 10.10.30.40 3128
Connection to 10.10.30.40 3128 port [tcp/*] succeeded!
What it means: DNS is failing for the proxy hostname, but the proxy IP is reachable. That’s a classic situation where “DNS is down” is true, but the fastest mitigation is to use the proxy IP temporarily (if policy allows) or fix DNS.
Decision: Choose mitigation: fix DNS upstream, or use a temporary IP-based proxy entry to restore package installs during the incident.
Task 14: Detect IPv6 preference problems (AAAA resolves, connect fails)
cr0x@server:~$ getent ahosts deb.debian.org | head
2a04:4e42:600::644 deb.debian.org
2a04:4e42:400::644 deb.debian.org
146.75.106.132 deb.debian.org
cr0x@server:~$ curl -I https://deb.debian.org --max-time 5
curl: (6) Could not resolve host: deb.debian.org
What it means: This is an example of mismatch: getent returns addresses, but curl claims it can’t resolve. In real incidents, this often points to proxy settings (curl uses proxy and fails there), or to curl’s DNS configuration (rare) or an NSS/plugin mismatch in containerized environments.
Decision: Run curl with proxy disabled and force IPv4/v6 to isolate.
Task 15: Force curl to bypass proxy and pin address family
cr0x@server:~$ curl -I https://deb.debian.org --noproxy '*' --max-time 5
HTTP/2 200
server: envoy
content-type: text/html
date: Mon, 29 Dec 2025 02:32:11 GMT
cr0x@server:~$ curl -I https://deb.debian.org -4 --max-time 5
HTTP/2 200
server: envoy
content-type: text/html
date: Mon, 29 Dec 2025 02:32:14 GMT
cr0x@server:~$ curl -I https://deb.debian.org -6 --max-time 5
curl: (28) Failed to connect to deb.debian.org port 443 after 5000 ms: Connection timed out
What it means: DNS is fine. IPv4 works. IPv6 connect times out. That’s not “resolution,” that’s routing or firewall for IPv6.
Decision: Either fix IPv6 routing properly or temporarily deprioritize/disable IPv6 for the affected client/system while you repair the network.
Task 16: Check IPv6 routes and RA status
cr0x@server:~$ ip -6 route
2001:db8:20::/64 dev ens3 proto kernel metric 256 pref medium
fe80::/64 dev ens3 proto kernel metric 256 pref medium
What it means: You have no IPv6 default route. You can resolve AAAA, but you can’t reach the IPv6 internet. That’s the worst kind of “almost configured.”
Decision: Fix RA/default route, or disable IPv6 on that interface (or configure policy routing) based on your environment’s intent.
Task 17: Check whether resolved is listening on the stub
cr0x@server:~$ ss -lntu | grep -E '127\.0\.0\.53:53|:53'
udp UNCONN 0 0 127.0.0.53:53 0.0.0.0:*
tcp LISTEN 0 4096 127.0.0.53:53 0.0.0.0:*
What it means: The stub listener is present. If you see nothing listening on 127.0.0.53:53 while /etc/resolv.conf points there, resolution will fail for almost everything.
Decision: If not listening, start/enable resolved or repoint /etc/resolv.conf to real resolvers (carefully, and preferably temporarily).
Task 18: Inspect resolved logs for upstream failures and DNSSEC/DoT issues
cr0x@server:~$ journalctl -u systemd-resolved --since "30 min ago" --no-pager | tail -n 30
Dec 29 02:10:21 server systemd-resolved[512]: Using degraded feature set UDP instead of UDP+EDNS0 for DNS server 10.10.0.53.
Dec 29 02:10:23 server systemd-resolved[512]: DNS server 10.10.0.53:53 timed out.
Dec 29 02:10:23 server systemd-resolved[512]: Switching to DNS server 10.10.0.54:53.
Dec 29 02:10:25 server systemd-resolved[512]: DNS server 10.10.0.54:53 timed out.
Dec 29 02:10:25 server systemd-resolved[512]: All attempts to contact name servers or networks failed.
What it means: Upstream resolvers aren’t responding. This is where you stop poking the client and start looking at network ACLs, VPN state, or upstream DNS service health.
Decision: Escalate to network/DNS owners with evidence: server IP, resolver IPs, time range, and whether UDP/TCP fail.
Task 19: Validate the resolver config from the application’s perspective
cr0x@server:~$ strace -f -e trace=network,connect,sendto,recvfrom -s 128 getent hosts deb.debian.org 2>&1 | tail -n 15
sendto(3, "\250\310\1\0\0\1\0\0\0\0\0\0\3deb\6debian\3org\0\0\1\0\1", 32, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 32
recvfrom(3, 0x7ffd8c3c6b10, 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, [16]) = -1 EAGAIN (Resource temporarily unavailable)
What it means: It’s actually trying 127.0.0.53, and it’s not getting an answer in time. That’s a strong confirmation of stub/resolved trouble.
Decision: Fix resolved/upstream reachability. Don’t waste time rewriting /etc/hosts unless you need a surgical temporary workaround for one hostname.
Task 20: Quick temporary mitigation (only if policy allows)
If you need packages now and you have a reachable resolver IP, you can temporarily replace the stub with direct nameservers. This is tactical, not a “final fix.”
cr0x@server:~$ sudo cp -a /etc/resolv.conf /etc/resolv.conf.bak
cr0x@server:~$ sudo rm -f /etc/resolv.conf
cr0x@server:~$ printf "nameserver 1.1.1.1\nnameserver 8.8.8.8\n" | sudo tee /etc/resolv.conf
nameserver 1.1.1.1
nameserver 8.8.8.8
cr0x@server:~$ getent ahosts deb.debian.org | head -n 3
2a04:4e42:600::644 deb.debian.org
2a04:4e42:400::644 deb.debian.org
146.75.106.132 deb.debian.org
What it means: You’ve bypassed the local stub and corporate resolvers. This may violate corporate policy or break internal name resolution. Use only as an incident bridge.
Decision: After the incident, revert and fix the real upstream problem. Permanent hacks rot into outages later.
Common mistakes (symptom → root cause → fix)
This section is deliberately specific. These patterns repeat because they’re easy to create and hard to spot under pressure.
1) Symptom: /etc/resolv.conf shows 127.0.0.53, but nothing resolves
Root cause: systemd-resolved is disabled/stopped, or the stub listener isn’t bound due to conflicts.
Fix: Enable/start resolved and verify it’s listening; or repoint /etc/resolv.conf to a real resolver. Also look for a competing local DNS service (dnsmasq, unbound) binding port 53.
2) Symptom: dig works but apt and curl fail with “could not resolve host”
Root cause: Proxy configuration. Your CLI DNS test is direct, but the application is using a proxy (APT proxy config or environment variables) and failing before DNS or delegating DNS to the proxy.
Fix: Validate Acquire::http::Proxy and http_proxy/https_proxy. Test with proxy disabled (--noproxy) to confirm.
3) Symptom: Resolution is slow, intermittent, and CPU idle; logs show timeouts
Root cause: Upstream resolver unreachable due to VPN route changes, firewall, or wrong DHCP DNS. systemd-resolved cycles servers and retries.
Fix: Confirm reachability to configured resolvers with UDP/TCP 53 tests. Correct the DNS server list for the network you’re on.
4) Symptom: AAAA records exist; connections hang; sometimes error says “could not resolve host”
Root cause: IPv6 is enabled enough to be preferred but not enough to work (no v6 default route, broken firewall, misconfigured RA).
Fix: Either fix IPv6 routing properly (preferred in real networks), or temporarily force IPv4 for the failing client or disable IPv6 on the interface if it’s not supported.
5) Symptom: Internal names fail, external names work (or vice versa)
Root cause: Split DNS and search domains. VPN expects corporate resolvers for corp.example, but you’re using public resolvers; or your search domain makes short names resolve incorrectly.
Fix: Use per-link DNS and routing domain configuration with systemd-resolved (~corp.example routing domains) or correct VPN-provided DNS settings.
6) Symptom: Only one host is “unresolvable” and everyone edits /etc/hosts
Root cause: Stale negative caching, misconfigured authoritative DNS, or a recent cutover that didn’t propagate. /etc/hosts “fixes” one box and creates a maintenance debt.
Fix: Verify TTLs and authoritative answers. Flush caches where appropriate (resolvectl flush-caches) and fix the DNS records instead of carving exceptions into hosts files.
7) Symptom: Container can’t resolve, host can
Root cause: Container runtime writes its own resolv.conf or uses a different DNS server, often blocked by network policy.
Fix: Inspect container /etc/resolv.conf and runtime DNS settings. Align with host’s reachable resolvers or use a dedicated internal resolver accessible from that network.
8) Symptom: DNS works right after reboot, then fails later
Root cause: Race/override between services writing resolver configuration: DHCP client vs NetworkManager vs networkd vs VPN up/down scripts.
Fix: Pick one network stack owner per host. Remove competing services. Ensure the resolver configuration is managed consistently and monitored.
Joke #2: If you “fix” DNS by hardcoding IPs in /etc/hosts, congratulations—you just invented configuration drift with extra steps.
Three corporate mini-stories from the trenches
Mini-story 1: The outage caused by a wrong assumption
They had a standard build for Debian servers: networkd, resolved, and a security hardening profile. A team spun up a new environment in a different data center segment and cloned the same image. Everything booted. Monitoring agents checked in. Then deployments started failing on “Could not resolve host”.
The on-call assumed “DNS must be down” because that’s what the error said. They escalated to the DNS team, who replied (correctly) that their service was healthy and answering queries. Meanwhile the incident kept rolling because package updates couldn’t proceed, container images couldn’t pull, and the deploy pipeline retried itself into a small self-inflicted DDoS.
The actual issue was boring: the image had DNS servers pinned to the old segment’s resolver IPs. Those resolver IPs were reachable only inside the original VLAN; the new segment couldn’t route to them by design. The assumption was that “DNS is global,” which is true philosophically and false practically. Corporate DNS is often per-segment for latency, control, and policy.
They fixed it by sourcing DNS servers via DHCP in that environment instead of hardcoding them, and by adding a boot-time validation: if configured DNS servers aren’t reachable on UDP/TCP 53, fail provisioning early. The next day, everyone remembered that “resolve host” is half naming and half routing to the resolver.
Mini-story 2: The optimization that backfired
A platform team wanted faster builds. They enabled DNS over TLS on a fleet because encrypted DNS sounded like a free win: privacy, integrity, and modern vibes. It also looked neat in security reviews.
Two weeks later, they started seeing sporadic failures during peak hours. Not total outages—worse. Some hosts took 20–30 seconds to resolve certain names. CI jobs flaked. Developers filed tickets like “network is haunted.” The DNS service itself was fine, but the path to it wasn’t.
The culprit was an upstream firewall policy that rate-limited or intermittently dropped port 853 when a certain traffic pattern emerged. UDP/53 was allowed and stable; TCP/853 was “allowed” but not reliably. The clients fell back in messy ways: some retried, some blocked, some cached negative results. The optimization improved security posture but added a fragile dependency on middleboxes that didn’t fully support it under load.
The pragmatic fix was to make DoT conditional: enabled only on networks with verified support, with monitoring that compared resolution latency and failure rates between modes. They also documented it in the build profiles so future teams wouldn’t inherit “security features” that silently reduce reliability.
Mini-story 3: The boring practice that saved the day
A finance-adjacent system (read: nobody wants downtime, and nobody wants change) ran Debian and had strict egress rules. They could only talk to a small set of internal services: a proxy, internal DNS, and a few repositories mirrored inside the company.
The team did one boring thing exceptionally well: they kept a runbook with known-good outputs for resolver state. Not theory. Actual snapshots from a healthy system: resolvectl status, ip route, getent hosts for key internal domains, plus which service owns DNS configuration on that host.
When “Could not resolve host” hit during a change window, they didn’t debate. They compared the current outputs to the known-good ones. The diff was instant: the DNS servers had changed to a generic DHCP-provided list after a network service restart. That list was correct for desktops and wrong for locked-down servers. The “boring” runbook meant they didn’t chase the proxy, didn’t blame upstream DNS, and didn’t waste 45 minutes arguing about IPv6.
They reverted the DNS source configuration, pinned network ownership back to networkd, and added an alert: if the configured DNS servers deviate from the approved set, page early—before application errors start. It wasn’t glamorous, but it kept the incident under control.
Checklists / step-by-step plan
This is the plan you follow when you want repeatable outcomes, not heroic debugging.
Checklist A: Fast triage (5–10 minutes)
- Capture the failing command and hostname. APT repo? Git remote? Proxy hostname? Write it down.
- Check IP address and default route.
ip -br addr,ip route. - Ping gateway and a known IP. If IP fails, stop and fix routing.
- Check system resolution path.
getent ahosts <host>. - Inspect /etc/resolv.conf. Stub vs direct servers.
- If stub, verify resolved.
systemctl status systemd-resolved,resolvectl status,ss -lntu | grep :53. - Test upstream resolvers directly.
dig @server host A, then port tests if needed. - Check proxies.
env | grep -i proxyand/etc/apt/apt.conf.dproxy entries. - Check IPv6 reality.
ip -6 route; testcurl -4vscurl -6.
Checklist B: Safe mitigations (choose one, don’t stack them blindly)
- Flush caches (good when you fixed upstream but clients still fail):
cr0x@server:~$ sudo resolvectl flush-cachesDecision: If resolution recovers after cache flush, investigate upstream flaps or negative caching events.
- Restart resolved (good when stub is wedged, not when upstream is unreachable):
cr0x@server:~$ sudo systemctl restart systemd-resolved cr0x@server:~$ resolvectl query deb.debian.org deb.debian.org: 146.75.106.132 -- link: ens3 2a04:4e42:600::644 -- link: ens3Decision: If restart helps, you still owe an explanation: why did it wedge? Look at logs and upstream reachability.
- Temporarily force IPv4 for a critical operation (good when IPv6 is broken and you need a bridge):
cr0x@server:~$ sudo apt-get -o Acquire::ForceIPv4=true update Hit:1 http://deb.debian.org/debian trixie InRelease Reading package lists... DoneDecision: If this works, schedule proper IPv6 remediation or permanently configure policy if IPv6 isn’t supported.
- Temporarily bypass proxy for known-good endpoints (good when proxy config is wrong):
cr0x@server:~$ unset http_proxy https_proxy no_proxy cr0x@server:~$ curl -I https://deb.debian.org --max-time 5 HTTP/2 200 server: envoy content-type: text/html date: Mon, 29 Dec 2025 02:41:22 GMTDecision: If unsetting fixes it, treat proxy as the incident domain. Reconfigure properly rather than leaving it half-on.
Checklist C: Make it stay fixed (post-incident hardening)
- Single source of truth for DNS config. Pick NetworkManager or networkd. Remove the other if possible.
- Monitor resolver reachability. Alert when configured DNS servers don’t answer on UDP/TCP 53.
- Record known-good outputs. Keep a baseline for
resolvectl status,/etc/resolv.conftarget, and proxy config. - Document proxy ownership. Is it env vars, APT config, or system-wide policy? Make it explicit.
- Decide what IPv6 should be. Fully supported, or explicitly disabled. “Kind of working” is not a strategy.
FAQ
1) Why does ping to an IP work but hostnames fail?
Because routing is fine but name resolution is failing. You’ve proven L3 connectivity; now isolate the resolver path (getent, /etc/resolv.conf, resolvectl).
2) Why does dig work but getent fails?
dig queries DNS directly. getent uses NSS, which may consult files, mDNS, SSSD, or other sources before DNS. Check /etc/nsswitch.conf and whether the DNS module is reachable through your configured resolver.
3) Is 127.0.0.53 in /etc/resolv.conf a bug?
No. It’s the systemd-resolved stub resolver. It’s correct when resolved is running and configured. It’s wrong when resolved is stopped or upstream servers are unreachable and nobody notices.
4) How do I tell if the proxy is causing “Could not resolve host”?
Disable proxy for one test. For curl: curl --noproxy '*'. For APT: temporarily comment proxy config in /etc/apt/apt.conf.d (or run in a controlled environment). If the error disappears, the proxy path is your culprit.
5) Why does forcing IPv4 fix it?
Because DNS resolution returns both A and AAAA records, and the client may prefer IPv6. If IPv6 routing/firewall is broken, connections stall. Forcing IPv4 avoids that path. It’s a mitigation, not a moral victory.
6) Should I hardcode DNS servers in /etc/resolv.conf?
As a temporary incident mitigation, sometimes. As a long-term fix, usually no—because it will be overwritten by your network stack and it breaks split DNS and internal naming. Fix the actual configuration source (DHCP, networkd, NetworkManager, VPN client).
7) Why do internal short names resolve differently on different hosts?
Search domains and split DNS. One host might have search corp.example, another might not. Some environments rely on short names; others shouldn’t. Standardize and prefer FQDNs for automation.
8) How do I know which component “owns” DNS configuration on Debian 13?
Check which network service is active (NetworkManager vs systemd-networkd), then use resolvectl status and networkctl status. Also inspect whether /etc/resolv.conf is a symlink into systemd’s resolve directory.
9) What’s the safest way to debug without breaking production?
Prefer read-only commands first (ip, getent, resolvectl, dig, ss, logs). When you change something, change one knob, record it, and be ready to revert.
10) What quote do you keep in mind during these incidents?
Hope is not a strategy.
— paraphrased idea often repeated in reliability and operations circles.
Conclusion: next steps you can do today
If you only remember one thing: separate resolution from reachability, and separate system resolver path from application behavior. That stops the pointless edits and gets you to the right owner fast.
Next time “Could not resolve host” shows up on Debian 13:
- Prove IP connectivity (
ip route, ping gateway, ping known IP). - Prove NSS resolution (
getent ahosts) and compare to direct DNS (dig). - Confirm resolver ownership (
/etc/resolv.confsymlink,resolvectl status). - Interrogate proxies and IPv6, because they lie in ways that look like DNS.
- Apply a single mitigation, restore service, then fix the upstream cause and remove the mitigation.
Make the boring investment: baseline outputs, monitoring for resolver reachability, and a decision on IPv6 support. That’s how you keep “Could not resolve host” from becoming a recurring character in your incident reports.