DNS problems don’t announce themselves as DNS problems. They show up as “the login page is slow,” “Kubernetes is weird,” “apt can’t fetch,” or “our monitoring is green but users are angry.” And when you finally look closely, you realize every host is doing its own recursive lookups over a flaky network path, or worse, you’ve built a resolver chain that eats itself.
Unbound is the boring, sharp tool for this job: a validating, caching resolver that you can run locally or centrally, with sane defaults and predictable behavior. You can set it up in 15 minutes. The trick is avoiding the common trap: creating a loop with your existing resolver (often systemd-resolved), and spending the next hour wondering why everything is SERVFAIL.
Why Unbound (and what you’re actually building)
Unbound is a recursive DNS resolver with a cache and optional DNSSEC validation. “Recursive” means it will walk the DNS hierarchy itself (root → TLD → authoritative), rather than asking someone else to do it. “Caching” means if 5000 processes ask for the same record, you don’t do 5000 upstream lookups. And DNSSEC validation means you can reject forged answers—at the cost of complexity if your environment is messy.
What you’re really building is a dependency boundary:
- Your hosts depend on your resolver, not whatever DHCP handed them in a coffee shop last year.
- Your applications get lower tail latency for DNS-heavy workflows (service discovery, CDNs, API calls, package managers).
- You gain one place to observe DNS: logs, cache behavior, failure modes.
The classic production pattern is one of these:
- Local node cache: Unbound on each machine, listening on 127.0.0.1, forwarding to upstream. Great for laptops, single servers, or latency-sensitive nodes.
- Central resolver: A small pool of Unbound servers, clients point to them. Great for fleets, consistent policy, and easier observability.
- Hybrid: Local Unbound forwarding to central Unbound, which does recursion. Great when you want local cache + central policy.
Pick deliberately. “Whatever works” is how you end up with a resolver loop, inconsistent DNSSEC, and a war room full of people who suddenly remember that DNS is hard.
Fast facts and short history that actually matters
Facts and context aren’t trivia here; they explain why certain defaults exist and why some failure modes repeat.
- DNS is older than the web. The DNS RFC dates to the early 1980s, designed for a network that was smaller, friendlier, and less adversarial.
- TTL is a contract, not a suggestion. Caches are expected behavior. If your app “needs” instant DNS changes, your app is the problem.
- Negative caching exists. NXDOMAIN can be cached too, which is why a typo can haunt you for minutes.
- DNSSEC was bolted on later. It adds integrity verification, but it also adds more ways to fail (clock skew, broken chains, MTU/fragmentation issues).
- Resolvers aren’t just “a server.” They’re state machines: outstanding queries, retries, timeouts, cache, and policy.
- The root hints file is not optional for full recursion. If you do true recursion, you need a way to find the root servers. Modern packages handle this, but the concept matters.
- EDNS0 changed the game. Larger UDP payloads and DNSSEC records made DNS answers bigger, which exposed network devices that hate fragmentation.
- systemd-resolved popularized a stub-resolver model. A local stub on 127.0.0.53 can be fine—until you stack it with another local resolver and create a loop.
- Unbound comes from the NLnet Labs ecosystem. It’s built by people who live and breathe DNS correctness; it’s not a hobby project.
One quote worth keeping in your head while you do this:
“Hope is not a strategy.” — James Cameron
DNS reliability work is where that line earns its keep. You don’t “hope” your resolver path is sane. You prove it.
15-minute setup: Ubuntu/Debian, production-safe
This setup is optimized for “get a stable caching resolver running fast” without painting you into a corner. It assumes systemd is present (Ubuntu/Debian), and you’re okay with Unbound listening on localhost first. You can later bind it to a LAN IP for clients.
Step 1: Install Unbound (and tools)
cr0x@server:~$ sudo apt-get update
...output...
cr0x@server:~$ sudo apt-get install -y unbound dnsutils
...output...
What the output means: You want to see Unbound installed and a systemd unit created. If apt pulls in something unexpected (like another resolver you didn’t plan), pause and read the dependency list.
Decision: Proceed only if you understand what’s now providing DNS on the host (systemd-resolved vs Unbound). Two resolvers can coexist, but only if you’re explicit about roles.
Step 2: Create a minimal, safe Unbound config
On Debian/Ubuntu, Unbound’s main config is commonly in /etc/unbound/unbound.conf and includes snippets from /etc/unbound/unbound.conf.d/. We’ll drop a snippet so package updates don’t stomp you.
cr0x@server:~$ sudo tee /etc/unbound/unbound.conf.d/local-cache.conf >/dev/null <<'EOF'
server:
# Listen only on localhost to start. Expand later.
interface: 127.0.0.1
port: 53
# Allow local queries only (for now).
access-control: 127.0.0.0/8 allow
# Basic hardening and privacy.
hide-identity: yes
hide-version: yes
qname-minimisation: yes
# Cache tuning: safe defaults.
cache-min-ttl: 0
cache-max-ttl: 86400
cache-max-negative-ttl: 300
prefetch: yes
# Reliability.
do-daemonize: no
use-syslog: yes
verbosity: 1
# DNSSEC validation on (default on many distros, but be explicit).
auto-trust-anchor-file: "/var/lib/unbound/root.key"
# Avoid UDP fragmentation pain by advertising a conservative buffer.
edns-buffer-size: 1232
EOF
...output...
What the output means: If you see the file echoed back, it was written. If you see “Permission denied,” you forgot sudo or your path is wrong.
Decision: Keep it localhost-only until you’ve verified it’s correct. Exposing a resolver to a network without access-control is how you end up running an open resolver. That ends badly and noisily.
Step 3: Start and enable Unbound
cr0x@server:~$ sudo systemctl enable --now unbound
...output...
cr0x@server:~$ systemctl status unbound --no-pager
● unbound.service - Unbound DNS server
Loaded: loaded (/lib/systemd/system/unbound.service; enabled; vendor preset: enabled)
Active: active (running) ...
...output...
What the output means: “active (running)” is the baseline. If it’s “failed,” read the last lines; Unbound is usually candid about config parse errors.
Decision: If Unbound isn’t running, don’t touch resolv.conf yet. Fix Unbound first. Otherwise you’re about to cut your own SSH session’s throat.
Step 4: Point the host at Unbound (without breaking systemd-resolved)
Here’s where the common trap is born. On many modern systems, /etc/resolv.conf is a symlink to systemd-resolved’s stub file, and it points at 127.0.0.53. If you naively configure Unbound to forward to “whatever is in resolv.conf,” you can create a loop: Unbound → stub → Unbound → stub… until everything times out.
We’ll do it cleanly:
- Option A (simple): Disable systemd-resolved and let Unbound be the resolver.
- Option B (coexist): Keep systemd-resolved but configure it to use 127.0.0.1 (Unbound) upstream, and ensure Unbound does not forward back to the stub.
Option A: Unbound as the only local resolver
cr0x@server:~$ sudo systemctl disable --now systemd-resolved
...output...
cr0x@server:~$ sudo rm -f /etc/resolv.conf
...output...
cr0x@server:~$ printf "nameserver 127.0.0.1\noptions edns0 trust-ad\n" | sudo tee /etc/resolv.conf
nameserver 127.0.0.1
options edns0 trust-ad
What the output means: /etc/resolv.conf is now a real file, pointing at Unbound. The trust-ad option asks for authenticated data (AD bit) when possible.
Decision: Use Option A on servers where you want fewer moving parts and you control networking. It’s also easier to reason about at 03:00.
Option B: Keep systemd-resolved, use Unbound upstream
This avoids some edge cases with VPN clients and split DNS that rely on systemd-resolved. Configure resolved to use Unbound:
cr0x@server:~$ sudo mkdir -p /etc/systemd/resolved.conf.d
...output...
cr0x@server:~$ sudo tee /etc/systemd/resolved.conf.d/unbound-upstream.conf >/dev/null <<'EOF'
[Resolve]
DNS=127.0.0.1
Domains=~.
DNSStubListener=yes
EOF
...output...
cr0x@server:~$ sudo systemctl restart systemd-resolved
...output...
What the output means: resolved is still the stub at 127.0.0.53, but it forwards to Unbound on 127.0.0.1. That’s fine as long as Unbound does not forward back.
Decision: Choose Option B when you need systemd-resolved features (per-link DNS, VPN integration), but still want Unbound’s caching/validation.
The common trap: resolver loops (symptoms, proof, fix)
The loop usually happens like this:
- System points to stub resolver at
127.0.0.53(systemd-resolved). - You install Unbound and set it to forward to “whatever is in /etc/resolv.conf”.
- Unbound forwards to 127.0.0.53.
- systemd-resolved is configured (explicitly or via some automation) to use 127.0.0.1 (Unbound) or “the local DNS.”
- Congratulations: you’ve made a DNS ouroboros.
Symptom pattern: short bursts of success, then timeouts; SERVFAIL; high CPU in Unbound; logs showing repeated queries; dig taking ~5 seconds and returning nothing useful.
Proof: identify who your system is actually querying
cr0x@server:~$ readlink -f /etc/resolv.conf
/run/systemd/resolve/stub-resolv.conf
What the output means: You’re using systemd-resolved’s stub, not Unbound directly.
Decision: If you planned Option A, fix resolv.conf. If you planned Option B, ensure Unbound is not forwarding to 127.0.0.53.
Proof: see where Unbound is forwarding (if at all)
cr0x@server:~$ sudo unbound-checkconf
unbound-checkconf: no errors in /etc/unbound/unbound.conf
What the output means: Config parses. It does not mean your forwarding is sane; it just means Unbound can read the file without choking.
Decision: If you have a forward-zone configured, verify it points to real upstream resolvers (not your local stub unless you absolutely mean it).
Fix: pick a direction and make it one-way
One-way rules that keep you out of trouble:
- If Unbound is recursive, do not forward to systemd-resolved.
- If Unbound is forwarding, forward to known upstream IPs (corporate resolvers, or external resolvers) — not “whatever resolv.conf says.”
- If systemd-resolved forwards to Unbound, Unbound must not forward to systemd-resolved.
Joke #1 (short, relevant): DNS loops are like office org charts: they look fine until you realize everyone reports to themselves.
Verification: prove it’s caching and validating
DNS is one of those systems where “it seems fine” is a trap. Verify:
- Queries are answered by Unbound (not bypassing it).
- Second query is faster (cache is working).
- DNSSEC status is sane (if enabled).
- Latency and timeouts are in the range you expect.
Basic query test
cr0x@server:~$ dig @127.0.0.1 example.com A +noall +answer +stats
example.com. 86396 IN A 93.184.216.34
;; Query time: 28 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
...output...
What the output means: The SERVER: line must be 127.0.0.1 (Unbound). Query time in tens of milliseconds for a cold cache is normal.
Decision: If the server is not 127.0.0.1, your clients are not using Unbound. Fix resolver configuration before tuning anything.
Cache proof: run it twice
cr0x@server:~$ dig @127.0.0.1 example.com A +noall +answer +stats
example.com. 86396 IN A 93.184.216.34
;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
...output...
What the output means: 1–2 ms indicates cache hit (or at least a very fast path). This is the entire point.
Decision: If second query is not faster, check whether you’re bypassing Unbound, or if cache is effectively disabled by configuration/policy.
DNSSEC signal: ask for the AD bit
cr0x@server:~$ dig @127.0.0.1 dnssec-failed.org A +dnssec +noall +comments +answer
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 12345
...output...
What the output means: dnssec-failed.org is intentionally mis-signed; a validating resolver should fail. SERVFAIL here is a good sign.
Decision: If you need DNSSEC validation for policy/security, keep this behavior. If this breaks critical internal names (because your internal DNS is “creative”), you may need to scope DNSSEC or forward internal zones.
Practical tasks (commands, output meaning, decisions)
You asked for real tasks. Here are more than a dozen. Each one tells you something operationally actionable.
Task 1: Confirm Unbound is listening where you think
cr0x@server:~$ sudo ss -lntup | grep -E ':(53)\b'
udp UNCONN 0 0 127.0.0.1:53 0.0.0.0:* users:(("unbound",pid=1234,fd=5))
tcp LISTEN 0 128 127.0.0.1:53 0.0.0.0:* users:(("unbound",pid=1234,fd=6))
What the output means: UDP and TCP 53 on 127.0.0.1. TCP matters for large responses and fallback behavior.
Decision: If you see 0.0.0.0:53 unexpectedly, you might have exposed your resolver. Fix interfaces/access-control before continuing.
Task 2: Confirm what resolver your host uses (and whether it’s a symlink)
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jan 1 10:00 /etc/resolv.conf -> /run/systemd/resolve/stub-resolv.conf
What the output means: systemd-resolved stub is in play.
Decision: Decide: disable resolved (Option A), or configure it to use Unbound upstream (Option B). Don’t drift in the middle.
Task 3: Check systemd-resolved’s current upstreams
cr0x@server:~$ resolvectl status
Global
LLMNR setting: yes
MulticastDNS setting: no
DNSOverTLS setting: no
DNSSEC setting: allow-downgrade
DNSSEC supported: yes
Current DNS Server: 127.0.0.1
DNS Servers: 127.0.0.1
...output...
What the output means: resolved is forwarding to 127.0.0.1 (Unbound). That’s coherent only if Unbound is not forwarding back to resolved.
Decision: If Current DNS Server is 127.0.0.53 or some stale DHCP server you don’t trust, fix resolved config or your network manager.
Task 4: Validate Unbound config syntax (fast failure prevention)
cr0x@server:~$ sudo unbound-checkconf /etc/unbound/unbound.conf
unbound-checkconf: no errors in /etc/unbound/unbound.conf
What the output means: Syntax is fine. If it prints an error, it will point to the exact line.
Decision: Never reload/restart Unbound in production without a checkconf step in automation. Humans make typos. DNS does not forgive.
Task 5: Check live Unbound stats (is it actually serving?)
cr0x@server:~$ sudo unbound-control status
version: 1.17.1
verbosity: 1
threads: 1
modules: 2 [ subnetcache validator iterator ]
uptime: 320 seconds
options: control(ssl)
unbound-control statistics not enabled
What the output means: Control works, but statistics are not enabled by default on some builds.
Decision: If you want observability, enable stats and remote-control explicitly (later section). If control fails, fix permissions/certs before you depend on it.
Task 6: Check logs for obvious resolver loop or upstream timeouts
cr0x@server:~$ sudo journalctl -u unbound -n 50 --no-pager
...output...
unbound[1234]: info: resolving example.com. A IN
unbound[1234]: info: response for example.com. A IN
...output...
What the output means: You want “resolving” followed by “response”. Repeated “timed out” or “SERVFAIL” patterns are your early smoke.
Decision: If timeouts dominate, suspect network path, firewall, or MTU/fragmentation. If SERVFAIL dominates, suspect DNSSEC or loop.
Task 7: Confirm the host is using Unbound via libc path (not just dig)
cr0x@server:~$ getent ahosts example.com | head
93.184.216.34 STREAM example.com
93.184.216.34 DGRAM example.com
93.184.216.34 RAW example.com
What the output means: libc resolver is getting a usable answer. This catches cases where dig works but system resolver doesn’t (nsswitch oddities, search domains, etc.).
Decision: If getent fails but dig works, you likely have an NSS/nsswitch or resolv.conf issue, not Unbound itself.
Task 8: Check cache hit signals by timing repeated lookups
cr0x@server:~$ for i in 1 2 3; do dig @127.0.0.1 www.cloudflare.com A +noall +stats | grep -E 'Query time|SERVER'; done
;; Query time: 24 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
What the output means: Cold → warm cache behavior is visible. If it’s always slow, you’re not caching effectively.
Decision: If always slow, check whether you configured prefetch off, cache size too small, or you’re forwarding to an upstream that disables caching semantics (rare but happens with “smart” gateways).
Task 9: Detect DNSSEC-related failures quickly
cr0x@server:~$ dig @127.0.0.1 www.iana.org A +dnssec +noall +comments | head -n 5
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4242
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
...output...
What the output means: The ad flag indicates validated data (when available). If you never see ad, validation might not be working, or the client isn’t requesting/printing it properly.
Decision: If you require validation and don’t see AD, inspect trust anchors, time sync, and whether you accidentally disabled validator module.
Task 10: Check time sync (DNSSEC’s quiet dependency)
cr0x@server:~$ timedatectl status
Local time: Wed 2025-12-31 10:00:00 UTC
Universal time: Wed 2025-12-31 10:00:00 UTC
RTC time: Wed 2025-12-31 10:00:01
System clock synchronized: yes
NTP service: active
...output...
What the output means: DNSSEC validation can fail if your clock is wrong enough to make signatures look expired or not-yet-valid.
Decision: If clock isn’t synchronized, fix NTP before you blame Unbound or upstream resolvers.
Task 11: Inspect firewall rules for DNS (UDP and TCP)
cr0x@server:~$ sudo iptables -S | grep -E 'dport 53|sport 53' || true
-A OUTPUT -p udp -m udp --dport 53 -j ACCEPT
-A OUTPUT -p tcp -m tcp --dport 53 -j ACCEPT
What the output means: Outbound DNS is allowed. If TCP is blocked, you’ll see weird intermittent failures when answers don’t fit in UDP.
Decision: Allow TCP/53 outbound. Blocking it is a classic “works until it doesn’t” policy.
Task 12: Confirm upstream reachability (if forwarding)
cr0x@server:~$ dig @1.1.1.1 example.com A +time=2 +tries=1 +noall +stats
;; Query time: 18 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
...output...
What the output means: If your chosen forwarder is slow/unreachable from this host, Unbound will look bad even if it’s innocent.
Decision: Pick forwarders that are reachable, low-latency, and policy-compatible (especially in corporate networks with split DNS).
Task 13: Catch a resolver loop with packet observation
cr0x@server:~$ sudo tcpdump -ni lo port 53 -c 10
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:00:01.000000 IP 127.0.0.1.55555 > 127.0.0.1.53: 1234+ A? example.com. (28)
10:00:01.000100 IP 127.0.0.1.53 > 127.0.0.1.55555: 1234* 1/0/0 A 93.184.216.34 (44)
...output...
What the output means: Loopback DNS traffic exists (good for local resolver). If you see repeated queries without responses, or traffic bouncing between 127.0.0.1 and 127.0.0.53 in patterns, you’ve built the loop.
Decision: If evidence suggests a loop, stop guessing and fix the chain. One resolver should be “the last hop.”
Task 14: Check Unbound’s root trust anchor file exists
cr0x@server:~$ sudo ls -l /var/lib/unbound/root.key
-rw-r--r-- 1 root root 1091 Jan 1 10:00 /var/lib/unbound/root.key
What the output means: Trust anchor file present. If missing, validation will behave unpredictably or fail.
Decision: If missing, install the package that provides it (often unbound-anchor behavior or distro-specific hooks) or generate anchors properly, then restart.
Forwarding vs full recursion: pick one on purpose
Unbound can do full recursion, or it can forward everything (or specific zones) to upstream resolvers. There’s no moral purity here. There are tradeoffs.
Full recursion: autonomy, fewer dependencies, more surface area
With full recursion, Unbound queries root servers and walks down. Benefits:
- Less dependence on upstream resolver behavior.
- Predictable caching and validation policy.
- Often better resilience if one upstream provider has a bad day.
Costs:
- More traffic diversity (many authoritative servers), which can trigger restrictive egress policies.
- More exposure to MTU quirks and “DNS is blocked except to our corporate resolvers” networks.
- More responsibility: you own the whole recursion path.
Forwarding: simplicity, compliance, and corporate reality
Forwarding is common in enterprises: Unbound caches locally but sends misses to a known recursive resolver (internal, ISP, security stack). It’s often the right choice when:
- You need split-horizon internal DNS and can’t replicate the zones elsewhere.
- You have mandated filtering or logging upstream.
- Egress is restricted to only a few resolver IPs.
Safe forwarding config (and how not to forward into your own face)
Add a forward zone file. Use explicit IPs. Not resolv.conf. Not 127.0.0.53.
cr0x@server:~$ sudo tee /etc/unbound/unbound.conf.d/forwarders.conf >/dev/null <<'EOF'
forward-zone:
name: "."
forward-addr: 1.1.1.1
forward-addr: 1.0.0.1
EOF
...output...
cr0x@server:~$ sudo unbound-checkconf
unbound-checkconf: no errors in /etc/unbound/unbound.conf
cr0x@server:~$ sudo systemctl restart unbound
...output...
Decision: If you’re on a corporate network with internal zones, you’ll likely want forwarding to corporate resolvers (not public ones), plus explicit forwarding for internal suffixes if needed.
Performance and reliability tuning that won’t bite you later
Most Unbound installs don’t need heroic tuning. DNS is light compared to databases. The mistakes come from “optimizing” without a model, and from treating DNS as stateless.
Threading and file descriptors
If you’re running a central resolver serving many clients, consider multiple threads and ensure your system limits aren’t tight. On a single host local cache, one thread is fine.
Typical checks:
cr0x@server:~$ systemctl show unbound -p LimitNOFILE
LimitNOFILE=1048576
Decision: If LimitNOFILE is small (like 1024) on a busy resolver, raise it via systemd override. If it’s already high, don’t touch it just because you can.
EDNS buffer size: the quiet fix for “random” failures
We set edns-buffer-size: 1232 for a reason: it reduces fragmentation risk over typical modern paths. This is particularly useful in environments with VPNs, tunnels, or firewalls that mishandle fragments.
When this matters, symptoms are obnoxious: some domains work, others timeout, and retries mysteriously succeed over TCP sometimes. DNSSEC makes it worse because answers are bigger.
Prefetch: good when you have repeat traffic, bad when you don’t
prefetch: yes means Unbound may refresh popular records before they expire. That smooths tail latency for hot names. But if you have a resolver serving a huge variety of one-off names (think tracking domains, telemetry, ad tech), prefetch can create extra upstream load.
Don’t overthink it; measure. If upstream traffic spikes and you don’t have stable hot sets, disable prefetch.
Logging: verbosity is not observability
Unbound can log a lot. In production, “a lot” becomes “noise plus bills.” Keep it low by default. When you need detail, raise temporarily.
Access controls: avoid becoming part of someone else’s botnet
If you bind Unbound to a LAN IP, you must set access-control correctly. An open recursive resolver will be abused for amplification attacks. You don’t want your DNS server to star in someone else’s incident report.
Joke #2 (short, relevant): Running an open resolver is a great way to make new friends on the internet. Unfortunately, they only visit when they need bandwidth.
Three corporate mini-stories (realistic, anonymized)
Incident #1: the wrong assumption (the “resolv.conf tells the truth” fallacy)
A mid-sized SaaS company wanted faster deployments. Their CI runners were doing a lot of DNS: pulling containers, hitting artifact stores, calling internal APIs. Someone suggested “just install Unbound on the runners for caching.” Reasonable.
The engineer wrote an Ansible role that installed Unbound and then created a forwarder config pointing to the resolver found in /etc/resolv.conf. This was meant to “respect local network settings.” On paper, very polite. In reality, many runners had /etc/resolv.conf symlinked to systemd-resolved’s stub at 127.0.0.53.
Separately, another automation change had set systemd-resolved’s DNS server to 127.0.0.1 so it would “use the local cache.” Nobody connected the two changes because they landed weeks apart and were owned by different teams.
The incident was subtle at first. CI jobs started timing out during “random” steps: sometimes docker pulls, sometimes a curl to an internal API, sometimes even apt. A few reruns would succeed. The queue length crept up; engineers blamed the build system.
What finally cracked it was a packet capture on loopback showing DNS requests bouncing between 127.0.0.1 and 127.0.0.53 with repeating IDs. Resolver loop. The fix was boring: stop forwarding based on resolv.conf, configure explicit upstreams, and document a single supported resolver chain. Post-incident, the new rule was: “resolv.conf is not a source of truth; it’s an output of policy.”
Incident #2: an optimization that backfired (aggressive caching and surprise outages)
An enterprise internal platform team ran Unbound as a central resolver cluster. They were proud of their DNS performance graphs. One quarter, they decided to “reduce upstream load” by increasing cache-max-ttl dramatically and setting cache-min-ttl to a non-zero value so records would stay warm even if authoritative TTLs were small.
Within days, a downstream team rotated a set of service IPs behind a DNS name as part of a blue/green rollout. The authoritative TTL was intentionally low. The resolver cluster, now enforcing a higher minimum TTL, kept serving stale IPs. Traffic flowed to drained nodes. The rollout looked like a partial outage.
The platform team argued that “DNS shouldn’t be used for load balancing anyway.” They weren’t wrong in principle, but they were wrong in practice: the organization already did it, and the resolver change silently changed the meaning of TTL. That’s not an optimization; that’s a semantic change in a core dependency.
The fix was twofold: revert cache-min-ttl to 0, and create a policy exception list only for domains where they controlled the entire lifecycle and could accept staleness. They also added a pre-change review item: “Will this change alter TTL behavior?” You’d be amazed how many outages start with someone forgetting DNS is part of the contract.
Incident #3: the boring correct practice that saved the day (staging + canary + fast rollback)
A financial services company ran a pair of Unbound resolvers per datacenter, fronted by anycast within each site. Their DNS was intentionally boring: minimal customization, explicit forwarders for internal zones, recursion for everything else, and tight access controls.
One Tuesday, a network team rolled out a firewall policy update. It unintentionally blocked outbound TCP/53 from the resolver VLAN, while leaving UDP/53 allowed. Most queries continued to work. Then the failures began: certain domains (especially DNSSEC-signed with larger responses) started timing out. Applications that depended on those domains failed in ways that looked like “TLS handshake issues” or “random API timeouts.”
Here’s what saved them: they had a canary resolver in each site on a separate policy path, and they routinely ran a small suite of DNS checks against it (including forced TCP DNS queries). The canary lit up within minutes. They could say, confidently, “this is DNS transport, not application.”
Rollback was clean because their Unbound config was managed with a simple versioned approach, and their firewall team had a tested revert play. Service impact stayed contained. Nobody got a hero badge. That’s the point.
Fast diagnosis playbook
When DNS is “slow” or “down,” you don’t have time for philosophy. You need a tight sequence that finds the bottleneck fast.
First: confirm who you’re talking to
- Check
/etc/resolv.confand whether it’s a symlink. - Run
digexplicitly against Unbound (@127.0.0.1). - Compare
dig example.com(default) vsdig @127.0.0.1 example.com.
If default queries are slow but direct-to-Unbound is fast, the problem is your system resolver path (resolved/NSS/VPN), not Unbound.
Second: determine whether it’s upstream, DNSSEC, or transport
- Check Unbound logs for timeouts vs SERVFAIL patterns.
- Test known-good domains with
+dnssecand look forador expected SERVFAIL on mis-signed domain. - Force TCP with
+tcpto see if UDP fragmentation is involved.
If TCP works and UDP fails intermittently, suspect MTU/firewall fragment handling. If SERVFAIL appears mostly on signed domains, suspect DNSSEC/time/trust anchor.
Third: verify resource constraints and saturation
- Check if Unbound is CPU bound (load, process CPU), or file descriptor constrained.
- Check for packet loss and retransmits on the resolver’s egress interface.
- Check cache hit rate via
unbound-control stats_noreset(if enabled).
If you don’t have stats enabled, enable them before the next incident. “We’ll add observability later” is how later becomes never.
Common mistakes: symptoms → root cause → fix
1) Symptom: SERVFAIL for most queries, sometimes works on retry
Root cause: Resolver loop (Unbound forwarding to systemd stub, which forwards back), or upstream timeouts.
Fix: Make the chain one-way. Use explicit forwarders in Unbound, or disable systemd-resolved. Confirm with readlink -f /etc/resolv.conf and tcpdump -ni lo port 53.
2) Symptom: Some domains consistently fail; others are fine
Root cause: DNSSEC validation failures, MTU/fragmentation problems, or blocked TCP/53.
Fix: Test with dig +dnssec and dig +tcp. Set edns-buffer-size: 1232. Ensure outbound TCP/53 is allowed. Check time sync.
3) Symptom: DNS “works” on the resolver host, but clients can’t use it
Root cause: Unbound bound only to 127.0.0.1; missing access-control for client subnets; firewall blocks inbound 53.
Fix: Add interface: 0.0.0.0 or specific LAN IP, add access-control: 10.0.0.0/8 allow (or your subnet), and allow inbound UDP/TCP 53 on the interface.
4) Symptom: High latency spikes every few minutes
Root cause: Cache too small, prefetch behavior causing bursts, upstream resolver rate limiting, or packet loss.
Fix: Enable stats and inspect cache hit/miss. If forwarding, test upstream latency directly. Consider disabling prefetch if your query set is extremely diverse.
5) Symptom: Internal domains fail, public domains succeed
Root cause: You chose full recursion but your internal zones are only resolvable via corporate resolvers (split-horizon). Or you forward “.” to public resolvers that can’t see internal names.
Fix: Add per-zone forwarding for internal suffixes to corporate DNS servers. Keep recursion or forwarding for the rest as appropriate.
6) Symptom: Everything breaks after enabling DNSSEC validation
Root cause: Clock skew, missing trust anchor, or broken DNSSEC in upstream chain (common with some middleboxes and old forwarders).
Fix: Fix NTP/time. Verify /var/lib/unbound/root.key. If forwarding to upstream that mangles DNSSEC, either switch upstream or disable validation for specific internal zones rather than globally.
7) Symptom: Random NXDOMAIN for names that should exist
Root cause: Search domain mishandling, split DNS confusion, or negative caching after transient upstream failure.
Fix: Inspect resolver search domains and try fully-qualified names with trailing dot in dig. Consider lowering cache-max-negative-ttl if transient NXDOMAINs are frequent (but don’t set it to zero unless you enjoy upstream load).
Checklists / step-by-step plan
Plan A: Localhost caching resolver on a single server (15-minute version)
- Install
unboundanddnsutils. - Configure Unbound to listen on 127.0.0.1 only.
- Enable DNSSEC, set
edns-buffer-size: 1232. - Start Unbound and confirm it’s listening on 127.0.0.1:53 (UDP+TCP).
- Choose: disable systemd-resolved (simplest) or configure it to forward to Unbound.
- Verify with
dig @127.0.0.1andgetent. - Run the loop check: ensure Unbound is not forwarding to 127.0.0.53.
Plan B: Central resolver for a subnet (do this if you want other machines to use it)
- Bind Unbound to a LAN IP (not 0.0.0.0 unless you mean it).
- Add
access-controlentries for client subnets. - Firewall: allow inbound UDP/TCP 53 from those subnets; allow outbound UDP/TCP 53 to upstreams or the internet (if recursing).
- Decide recursion vs forwarding; configure explicit forwarders if needed.
- Canary test from a client:
dig @resolver-ip example.com. - Add a second resolver for redundancy; point clients to both.
- Instrument: enable Unbound stats; ensure logs are sane and rate-limited.
Plan C: Change management that prevents DNS incidents
- Make resolver chain explicit in docs: client → stub (optional) → Unbound → upstream.
- Add config validation to deployment (
unbound-checkconfmust pass). - Canary queries: include a DNSSEC test domain and a forced TCP query.
- Have a rollback play: revert config, restart, confirm listening and query success.
FAQ
1) Should I run Unbound on every host or centrally?
If you want the simplest reliability story for a fleet, run a small central pool and point clients to it. If you want best per-host latency and resilience against network wobble, run it locally. Hybrid is fine if you document the chain and avoid loops.
2) Is Unbound better than systemd-resolved?
They solve different problems. systemd-resolved is a stub resolver with per-link logic; Unbound is a real recursive caching resolver. Use resolved for network plumbing and Unbound for DNS policy/caching/validation. Just don’t let them chase each other in circles.
3) Do I need DNSSEC validation?
If you can keep time sync solid and your network doesn’t mangle DNS, validation is a good safety net. If your environment includes flaky middleboxes or broken internal DNSSEC, scope it carefully or you’ll trade security for outages.
4) Why do I need TCP/53 if DNS is “UDP”?
Because large responses exist (DNSSEC, large TXT, big NS sets), and truncation forces TCP fallback. Blocking TCP/53 creates intermittent, domain-specific failures that look like ghosts.
5) What’s the safest way to configure forwarders?
Use explicit IPs in forward-zone. Do not “forward to resolv.conf.” resolv.conf is often a stub, and stubs are how loops happen.
6) Can Unbound serve as authoritative DNS too?
It can do some local-data, but it’s not an authoritative DNS server in the way NSD or BIND authoritative mode is. Use it primarily as a resolver and keep authoritative DNS separate unless you’re doing a small, intentional hack (like overriding a few names).
7) How do I know caching is actually working?
Run the same dig query twice against Unbound and compare Query time. For deeper proof, enable statistics and watch cache hit metrics. Also verify that clients are actually sending queries to Unbound, not bypassing it.
8) What’s a good negative caching TTL?
Five minutes (cache-max-negative-ttl: 300) is a pragmatic default. Lower it if transient NXDOMAIN hurts you. Higher values can make mistakes and temporary upstream failures linger longer than you’d like.
9) I enabled Unbound, but VPN split DNS stopped working. Why?
Many VPN clients integrate with systemd-resolved to install per-domain or per-interface DNS routes. If you disable resolved (Option A), you may lose that behavior. Use Option B (resolved stub + Unbound upstream) or manage split DNS explicitly in Unbound with forward zones.
10) How do I avoid becoming an open resolver?
Bind only to needed interfaces, and set access-control to only allow your client networks. Don’t expose it to the public internet. Also verify with ss that you’re not listening on 0.0.0.0 unless intended.
Next steps you should actually do
You can get Unbound running quickly. Getting it running correctly is the part that saves you future incidents.
- Lock down the resolver chain: decide whether systemd-resolved is in the path and make it one-way. No loops.
- Verify with evidence:
dig @127.0.0.1twice for caching,+dnssecfor validation,+tcpfor transport sanity. - Choose recursion vs forwarding intentionally: corporate networks usually want forwarding (at least for internal zones).
- Add a canary check: one DNSSEC test, one forced TCP test, one internal name. Run it continuously.
- Keep config boring: avoid clever TTL overrides and “optimize” only after measuring.
DNS is a dependency you can’t avoid. The good news is you can make it predictable. Unbound is one of the few tools in this space that behaves like it wants you to sleep at night.