BIND9: The Config That Looks Right but Causes Intermittent NXDOMAIN

Was this helpful?

Intermittent NXDOMAIN is the kind of outage that makes grown operators start using the phrase “it’s haunted” in serious meetings. One minute your name resolves; the next minute the same query from the same client returns NXDOMAIN, and your monitoring looks like a seismograph.

This isn’t usually “DNS is down.” It’s DNS doing exactly what you accidentally told it to do. BIND9 is especially good at this: a configuration can look clean, pass named-checkconf, serve most queries correctly, and still manufacture NXDOMAIN in bursts—because caching, views, forwarders, and DNSSEC are all stateful in ways people underestimate.

What intermittent NXDOMAIN really means

NXDOMAIN is a specific statement: “the name does not exist.” It is not “I can’t reach an upstream.” It is not “timeout.” It is not “your packet got dropped.” That’s why intermittent NXDOMAIN is so revealing: something, somewhere, had enough confidence to assert non-existence.

When BIND returns NXDOMAIN, it’s usually in one of these categories:

  • Authoritative NXDOMAIN: BIND is authoritative for the zone and the name truly does not exist (or your zone file says it doesn’t).
  • Cached negative answer: BIND previously learned NXDOMAIN (or NODATA) and cached it. Now it’s repeating it.
  • Policy-generated NXDOMAIN: Response Policy Zones (RPZ), views, ACLs, or special logic cause an NXDOMAIN-like result.
  • Forwarder-provided NXDOMAIN: Your forwarder says NXDOMAIN and BIND trusts it (sometimes too much).
  • DNSSEC corner behavior: Typically DNSSEC failures are SERVFAIL, but misinterpretations and upstream behavior can lead to NXDOMAIN-looking outcomes, especially when you’re mixing validation settings and forwarders.

The “intermittent” part typically comes from state changing underneath you:

  • Negative cache entries expiring at different times (SOA MINIMUM / negative TTL, RFC 2308 behavior).
  • Multiple resolvers or anycast nodes with divergent cache contents.
  • Views selecting different data based on source IP, EDNS Client Subnet, or TSIG key.
  • Forwarders returning inconsistent answers (or one forwarder is broken and chosen some of the time).
  • DNS propagation or zone transfer timing differences.
  • Packet size/fragmentation differences affecting only DNSSEC-signed answers, pushing queries onto TCP for some clients and not others.

If you’re trying to “fix intermittent NXDOMAIN” by restarting named until it stops, you’re not fixing it—you’re playing cache roulette. This is one of those problems where you either instrument and reason about it, or you donate time to the void.

Fast diagnosis playbook

Do these in order. Don’t improvise. The goal is to quickly pin the NXDOMAIN’s origin: authoritative, cached, policy, or upstream.

1) Confirm what you’re actually getting (and from where)

  • Query the suspect resolver directly (not via your OS stub).
  • Capture the response flags: aa, ra, ad, and the SOA in the authority section.
  • Compare with a known-good resolver (a different recursive resolver or a different BIND instance).

2) Decide: authoritative or recursive?

  • If the server responds aa, you’re in authoritative land: zone contents, transfers, and views.
  • If it responds without aa but with ra, you’re in recursive land: cache, forwarders, DNSSEC validation, and policy.

3) Check negative caching and policy before touching DNSSEC

  • Negative caching is the top cause of “it comes and goes.”
  • RPZ/view mismatches can look random when clients come from different NAT pools or IPv6 addresses.

4) Validate the upstream path

  • If you use forwarders, test each forwarder individually for the failing name.
  • Run dig +trace from a non-forwarding resolver host to see what the world thinks.

5) Only then dig into DNSSEC/EDNS/MTU weirdness

  • If failures correlate with DNSSEC-signed zones and UDP size, suspect fragmentation or broken middleboxes.

Paraphrased idea from Werner Vogels (Amazon CTO): “You build it, you run it.” If DNS is “someone else’s,” you’ll keep inheriting mystery failures.

The config that looks right (and why it lies)

Here’s the trap: BIND has enough knobs that two different mental models can both “fit” the same config. One person thinks they built a pure recursive resolver. Another thinks it’s a “forwarding cache.” Someone else believes views are only for split-horizon internal zones. And someone quietly added RPZ for “security.”

Everything can appear to work—until a few names wobble.

A classic “looks right” setup that breeds intermittent NXDOMAIN has these ingredients:

  • Forwarders with mixed behavior: one forwarder is a corporate DNS appliance doing DNS rewriting, another is a plain resolver. BIND rotates or fails over.
  • Views: internal clients get one view, VPN clients get another, and IPv6 clients accidentally fall into “default” view.
  • Negative caching enabled (as designed) with a long negative TTL due to a zone’s SOA MINIMUM or a misread of RFC 2308 semantics.
  • Stale cache settings (or the lack of them) causing answers to differ during upstream brownouts.
  • RPZ or “blocklists” producing NXDOMAIN for a subset of queries based on QNAME, client subnet, or view.
  • Multiple named instances behind a VIP/anycast with uncoordinated cache and update timing.

BIND is deterministic. Your environment isn’t.

Joke 1: DNS is the only system where “non-existent” can be cached as a performance feature. Reality, but with TTLs.

Failure modes that produce intermittent NXDOMAIN

1) Negative caching: NXDOMAIN that persists after you fixed the record

Negative caching is not an implementation quirk; it’s part of DNS. When a resolver learns that foo.example.com doesn’t exist, it can cache that fact. How long? Based on the zone’s SOA and the negative caching rules from RFC 2308. In practice, operators get burned when:

  • They create a record shortly after querying it (or after some automated check queried it before provisioning finished).
  • The resolver cached NXDOMAIN with a negative TTL longer than expected.
  • They only “fixed” one authoritative server, and another still returns NXDOMAIN sometimes.

Intermittent comes from which resolver instance or which cache you hit, or from the negative TTL expiring at different times across a fleet.

2) Views: correct zone data, wrong clients

Views are powerful and sharp. They select different zone data and options based on client identity (source IP, TSIG, ACL match). One subtle failure: IPv6 clients don’t match the intended ACLs and fall into a default view that doesn’t have the internal zone loaded. They get NXDOMAIN. IPv4 looks fine. This can feel intermittent when dual-stack clients flip between v4 and v6, or when NAT pools spread clients across multiple source ranges.

3) Forwarders: “forward first” masking upstream weirdness

If you use forward first;, BIND will try the forwarders and, if they fail (timeout/REFUSED), it will fall back to full recursion. That sounds resilient. It’s also a source of inconsistency:

  • If a forwarder returns NXDOMAIN quickly (even wrongly), BIND accepts it and won’t recurse.
  • If a forwarder times out, BIND may recurse and find the correct answer.
  • So the same name can alternate between NXDOMAIN and NOERROR depending on forwarder health.

In regulated environments, forwarders can also apply “helpful” search suffixes, split-horizon, or blocklists. If that policy changes, your BIND becomes the messenger, and it gets blamed.

4) RPZ: policy NXDOMAIN that’s hard to notice

RPZ can rewrite answers to NXDOMAIN, NODATA, CNAME-to-walled-garden, or something else. If RPZ is applied only in one view, only some clients see NXDOMAIN. If RPZ updates are pulled asynchronously, a subset of your resolvers may have the policy loaded while others don’t. Congratulations: intermittent NXDOMAIN.

5) Authoritative inconsistency: partial propagation, broken transfers, or “hidden master” assumptions

If BIND is authoritative (or you have authoritative servers elsewhere), intermittent NXDOMAIN often means not all authoritative servers agree. Common causes:

  • One secondary isn’t transferring updates (TSIG mismatch, firewall, notify issues).
  • Serial numbers aren’t incremented reliably.
  • Dynamic updates (DDNS) are writing to one server, but secondaries don’t pick it up, or views split the update path.

Resolvers pick different authoritative servers based on RTT and cache. You see “sometimes NXDOMAIN.” It’s not sometimes. It’s “from that one server.”

6) Search domains + ndots: the “I didn’t ask for that name” problem

Lots of NXDOMAINs are self-inflicted by clients. With search domains and ndots, a query for api might turn into api.corp.example, api.prod.corp.example, and finally api. The resolver might cache negative responses for those expansions. If your app logs only the short name, your DNS logs show a different name. The resulting incident report reads like a surrealist play.

7) DNSSEC/EDNS/MTU: not a direct NXDOMAIN generator, but a chaos amplifier

DNSSEC failures usually show up as SERVFAIL on validating resolvers. But DNSSEC affects payload sizes and transport behavior. Some clients retry over TCP; some don’t. Some middleboxes drop fragments. A forwarder might degrade into returning NXDOMAIN-like behavior for certain names if its own validation or policy engine gets confused. The intermittent part often correlates with:

  • Only some networks having broken PMTUD or fragment reassembly.
  • Only some names being DNSSEC-signed or having large responses (many records, long RRSIG chains).
  • Anycast or multi-resolver fleets where only some nodes have the broken path.

Practical tasks: commands, outputs, and decisions (12+)

These are the moves I actually want people to run during an incident. Each task includes: command, what the output means, and what decision to make next.

Task 1: Query the resolver directly and read the flags

cr0x@server:~$ dig @127.0.0.1 www.example.com A +noall +answer +authority +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 53188
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; AUTHORITY SECTION:
example.com.	300	IN	SOA	ns1.example.net. hostmaster.example.com. 2026020401 3600 600 1209600 300

Meaning: Status NXDOMAIN, recursion available (ra), not authoritative (aa absent). This is a recursive answer (or forwarded) and the SOA tells you which zone asserted non-existence.

Decision: Treat as cache/forwarder/policy problem first. If aa were present, pivot to authoritative zone data and transfers.

Task 2: Compare against a known-good resolver

cr0x@server:~$ dig @9.9.9.9 www.example.com A +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4821
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

www.example.com.	60	IN	A	203.0.113.10

Meaning: The public resolver gets NOERROR with an A record. Your BIND returning NXDOMAIN is not “the internet.” It’s your stack.

Decision: Investigate negative caching, views, RPZ, and forwarders. Don’t touch the authoritative zone yet.

Task 3: See whether your BIND is forwarding and to whom

cr0x@server:~$ sudo named-checkconf -p | sed -n '1,160p'
options {
	directory "/var/cache/bind";
	recursion yes;
	allow-recursion { 10.0.0.0/8; 192.168.0.0/16; };
	forwarders { 10.10.10.10; 10.10.10.11; };
	forward first;
	dnssec-validation auto;
};

zone "corp.example" IN {
	type master;
	file "/etc/bind/zones/db.corp.example";
};

Meaning: You are forwarding first to two internal forwarders. That alone can create intermittent differences if forwarders disagree or one is unhealthy.

Decision: Test each forwarder directly for the failing name (next task). If forwarders return NXDOMAIN, BIND is just the courier.

Task 4: Query forwarders individually

cr0x@server:~$ dig @10.10.10.10 www.example.com A +noall +answer +authority +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 30211
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

example.com.	600	IN	SOA	ns1.example.net. hostmaster.example.com. 2026020401 3600 600 1209600 600
cr0x@server:~$ dig @10.10.10.11 www.example.com A +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6114
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

www.example.com.	60	IN	A	203.0.113.10

Meaning: Your forwarders disagree. With forward first, your BIND may sometimes accept NXDOMAIN from one forwarder and sometimes get the correct answer from the other.

Decision: Fix or remove the misbehaving forwarder. If you can’t, consider forward only; (for strict policy) or drop forwarding and recurse yourself (for correctness), but don’t mix “sometimes policy” with “sometimes reality.”

Task 5: Check if the NXDOMAIN is being cached (negative cache)

cr0x@server:~$ sudo rndc dumpdb -cache
Dumping caches.
cr0x@server:~$ sudo grep -n "www.example.com" /var/cache/bind/named_dump.db | head
21483;www.example.com.	300	negative
21484;www.example.com.	300	IN	SOA	example.com. hostmaster.example.com. 2026020401 3600 600 1209600 300

Meaning: BIND has a cached negative entry for www.example.com with 300 seconds remaining.

Decision: If you know the name exists now, flush it to stop serving stale NXDOMAIN and then fix the upstream cause that injected NXDOMAIN.

Task 6: Flush a single name (surgical, not “flush everything”)

cr0x@server:~$ sudo rndc flushname www.example.com
flushed

Meaning: Removes cached data for that name (including negative cache), forcing re-resolution.

Decision: Re-query immediately. If NXDOMAIN returns again, it’s not “stale cache”—it’s upstream or policy.

Task 7: Check whether RPZ is rewriting the answer

cr0x@server:~$ sudo named-checkconf -p | grep -n "response-policy" -n
42:	response-policy { zone "rpz-block"; };
cr0x@server:~$ dig @127.0.0.1 www.example.com A +noall +answer +authority
example.com.	300	IN	SOA	ns.rpz.local. hostmaster.rpz.local. 2026020401 3600 600 1209600 300

Meaning: The SOA in the authority section is from an RPZ-related domain, not example.com. That’s a smoking gun: policy generated this “NXDOMAIN.”

Decision: Inspect RPZ zone content and update pipeline. Decide whether this name should be blocked. If not, fix the RPZ feed or override with policy exceptions.

Task 8: Validate whether views are selecting different zone data

cr0x@server:~$ sudo named-checkconf -p | sed -n '1,220p' | grep -n "view\|match-clients\|zone \"corp.example\""
61:view "internal" {
62:	match-clients { 10.0.0.0/8; 192.168.0.0/16; };
80:	zone "corp.example" { type master; file "/etc/bind/zones/db.corp.example"; };
95:};
97:view "external" {
98:	match-clients { any; };
112:	zone "corp.example" { type master; file "/etc/bind/zones/db.corp.example.external"; };
130:};

Meaning: Different clients can get different versions of the same zone. That’s not automatically wrong. It’s automatically risky.

Decision: Confirm which clients match which view (next tasks) and whether the “external” zone file intentionally omits names that are now needed by VPN/cloud clients.

Task 9: Query from different source addresses to reproduce view selection

cr0x@server:~$ dig @10.0.0.53 app.corp.example A +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15054
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

app.corp.example.	30	IN	A	10.20.30.40
cr0x@server:~$ dig @198.51.100.53 app.corp.example A +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4490
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

Meaning: Same resolver software, different address/path, different answer. That is usually views or policy, not “DNS randomness.”

Decision: Decide whether app.corp.example should be visible externally. If yes, fix the external view zone data or match-clients rules.

Task 10: Verify authoritative consistency with +trace

cr0x@server:~$ dig +trace www.example.com A
; <<>> DiG 9.18.24 <<>> +trace www.example.com A
;; Received 811 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms

example.com.	172800	IN	NS	ns1.example.net.
example.com.	172800	IN	NS	ns2.example.net.
;; Received 525 bytes from 192.0.2.53#53(a.root-servers.net) in 19 ms

www.example.com.	60	IN	A	203.0.113.10
;; Received 116 bytes from 203.0.113.53#53(ns1.example.net) in 24 ms

Meaning: The authoritative path returns NOERROR. If your resolver returns NXDOMAIN, the issue is between your resolver and the authority: forwarders, cache, policy, or broken resolution path.

Decision: If forwarders are involved, fix them. If not, inspect your resolver logs and cache behavior.

Task 11: Turn on targeted query logging (temporarily) to see the source of NXDOMAIN

cr0x@server:~$ sudo rndc querylog
query logging is now on
cr0x@server:~$ sudo journalctl -u bind9 -n 20 --no-pager
Feb 04 11:12:01 dns1 named[1234]: client @0x7f2a1c0 10.1.2.3#55214 (www.example.com): query: www.example.com IN A +E(0)K (10.10.10.10)
Feb 04 11:12:01 dns1 named[1234]: client @0x7f2a1c0 10.1.2.3#55214 (www.example.com): forwarding failed, response NXDOMAIN from 10.10.10.10

Meaning: The log explicitly shows forwarding and the upstream that returned NXDOMAIN.

Decision: Stop debating. Fix 10.10.10.10 or remove it. Then turn query logging back off; it’s noisy in production.

Task 12: Check for SERVFAILs masked by client retries (and transport issues)

cr0x@server:~$ dig @127.0.0.1 dnskey example.com +dnssec +noall +comments +answer
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38712
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

Meaning: ad indicates validated data (when the client requested it and the resolver validates). If you never see ad where you expect it, validation may be off or failing.

Decision: If DNSSEC is in play and failures correlate with specific networks, inspect EDNS size and TCP fallback (next tasks). Even if it’s not NXDOMAIN directly, it can trigger different retry behavior that looks “intermittent.”

Task 13: Check EDNS UDP size and force TCP to test fragmentation hypotheses

cr0x@server:~$ dig @127.0.0.1 www.example.com A +dnssec +bufsize=4096 +noall +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19640
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
cr0x@server:~$ dig @127.0.0.1 www.example.com A +tcp +noall +comments +answer
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28640
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

www.example.com.	60	IN	A	203.0.113.10

Meaning: If UDP sometimes fails but TCP always works, you likely have a path MTU / fragmentation issue somewhere between clients and resolver, or between resolver and upstream.

Decision: Reduce advertised UDP size on the resolver (or fix the network). Don’t “solve” it by disabling DNSSEC everywhere unless you enjoy future incidents.

Task 14: Inspect BIND statistics for spikes in NXDOMAIN/negative answers

cr0x@server:~$ sudo rndc stats
Statistics dump file written to /var/cache/bind/named.stats
cr0x@server:~$ sudo sed -n '1,120p' /var/cache/bind/named.stats
+++ Statistics Dump +++
++ Incoming Requests ++
               18452 QUERY
++ Outgoing Requests ++
               10211 QUERY
++ Name Server Statistics ++
                1423 NXDOMAIN
                 211 SERVFAIL

Meaning: High NXDOMAIN counts can be normal (typos, search domains), or they can be your incident. Pair this with query logging or a qname filter to see what names dominate.

Decision: If NXDOMAIN spikes are tied to a single domain/service, focus there. If they’re broad, suspect forwarder/policy/view selection.

Task 15: Verify zone file correctness if authoritative

cr0x@server:~$ sudo named-checkzone corp.example /etc/bind/zones/db.corp.example
zone corp.example/IN: loaded serial 2026020401
OK

Meaning: The zone parses and loads; serial is visible.

Decision: If intermittent NXDOMAIN occurs and you’re authoritative, compare serials across all authorities. A lagging secondary is a common culprit.

Task 16: Confirm secondaries have the same serial (authoritative consistency)

cr0x@server:~$ dig @ns1.corp.example corp.example SOA +noall +answer
corp.example.	300	IN	SOA	ns1.corp.example. hostmaster.corp.example. 2026020401 3600 600 1209600 300
cr0x@server:~$ dig @ns2.corp.example corp.example SOA +noall +answer
corp.example.	300	IN	SOA	ns1.corp.example. hostmaster.corp.example. 2026020317 3600 600 1209600 300

Meaning: ns2 is behind (older serial). Resolvers hitting ns2 may get NXDOMAIN for recently added names.

Decision: Fix zone transfers (TSIG, allow-transfer, notify, firewall). Don’t paper over it with low TTLs; that just makes the pain arrive faster.

Common mistakes: symptom → root cause → fix

This is the section you’ll recognize mid-incident, because you’ve seen at least two of these in your career. If you haven’t, congratulations on your charmed life.

1) “It resolves for me but not for users”

Symptom: IT staff on corp network get NOERROR; remote users/VPN/mobile get NXDOMAIN.

Root cause: Views or split-horizon. The remote users land in a different view (often due to IPv6 source addresses or different NAT pools), which lacks the record or has RPZ applied.

Fix: Audit match-clients ACLs for both v4 and v6, and ensure the correct zones/policies exist in all intended views. Test from representative source networks, not only from the DNS server itself.

2) “We added the record, but it’s still NXDOMAIN sometimes”

Symptom: After provisioning, some clients still get NXDOMAIN for minutes to hours.

Root cause: Negative caching. The NXDOMAIN was learned and cached before the record existed, and the negative TTL is longer than you assumed.

Fix: Flush specific names on recursive resolvers (rndc flushname) when doing urgent cutovers. Longer term: lower negative TTLs by designing SOA MINIMUM/negative TTL consciously, and avoid pre-checks that query names before they exist (or isolate those checks to non-production resolvers).

3) “It fails only on Mondays (or randomly), and restarting BIND helps”

Symptom: Intermittent NXDOMAIN that “goes away” after restart.

Root cause: Cache reset hides the underlying inconsistent upstream: forwarders disagree, RPZ feed updates are inconsistent, or one authoritative is out of sync.

Fix: Stop restarting as a cure. Identify which upstream generated NXDOMAIN via query logging or forwarder testing. Fix the source; then cache resets become unnecessary and rare.

4) “Only some records under a zone fail, others are fine”

Symptom: A handful of hostnames NXDOMAIN; others under the same zone resolve.

Root cause: Wildcards and empty non-terminals, or the record exists only in one view/one authoritative. Another common root: you created _service._tcp-style records but clients query different types and get NODATA/NXDOMAIN confusion.

Fix: Check authoritative answers directly. Inspect zone files for delegations, wildcards, and missing nodes. Verify all authorities have the same serial and data.

5) “It’s fine on IPv4, broken on IPv6”

Symptom: Dual-stack clients see intermittent NXDOMAIN depending on which stack they use.

Root cause: Views ACLs missing IPv6, or upstream reachability differences, or missing AAAA/glue records causing different resolution paths.

Fix: Make view ACLs explicitly dual-stack. Ensure forwarders and root hints are reachable over IPv6 if you serve v6 clients. Test with dig -6 and compare.

6) “It happens only for some public domains”

Symptom: A subset of external domains returns NXDOMAIN from your resolver, but the rest resolve.

Root cause: RPZ blocks, DNS filtering upstream, or a broken forwarder doing NXDOMAIN remapping for “security.”

Fix: Identify whether the SOA in the NXDOMAIN response points to the real zone or a policy zone. If policy, fix policy. If forwarder, replace or reconfigure it. Don’t accept “security appliance says so” without auditability.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had two internal DNS resolvers behind a load balancer. The team assumed they were “identical,” because they were built from the same automation. That assumption lived happily for months, as assumptions often do.

Then a new internal service launched with a hostname that didn’t exist until the deployment pipeline created it. A health check—run early—queried the name repeatedly. For a brief window, resolvers learned NXDOMAIN and cached it. Soon after, the record appeared and started resolving… sometimes. Some clients hit resolver A (still negative-caching NXDOMAIN), others hit resolver B (cache expired already). The on-call saw it as a flaky app rollout.

They restarted the load balancer pool member that “seemed bad.” The symptoms disappeared. The postmortem almost ended there, which is how you keep the same incident on subscription.

A week later, it happened again. This time someone bothered to compare the SOA MINIMUM fields for their internal zone. One resolver had an older zone version with a longer negative TTL due to a stale secondary transfer. Not identical. Not even close when it mattered.

The fix wasn’t heroic. They repaired transfers, standardized SOA/negative TTL settings, and changed the deployment health check to query only after record creation or to query a staging resolver. The wrong assumption was “identical nodes.” DNS makes liars of that assumption quickly.

Mini-story 2: The optimization that backfired

A different org was proud of their “fast DNS.” They enabled forwarders to an internal caching layer that promised low latency and “security filtering.” They also set forward first so BIND could recurse if the forwarder had issues. Best of both worlds, right?

It worked—until the forwarder vendor pushed a policy update. A set of domains started returning NXDOMAIN due to categorization mistakes. Meanwhile, the forwarder cluster was slightly overloaded; sometimes it timed out. In those cases, BIND fell back to recursion and returned the correct answer. So users saw intermittent NXDOMAIN depending on whether the forwarder responded quickly (wrongly) or slowly (not at all).

The vendor insisted “no outage.” Their systems were up. True. The customers’ names were down, intermittently. The BIND team got to be the adult in the room and prove, with logs, that “fast wrong answers” were worse than “slow correct answers.”

They switched to forward only temporarily to make behavior consistent while they escalated the policy issue. Consistency reduced tickets immediately, even though it meant “always blocked” for those domains until policy was corrected. After the forwarder was fixed, they reevaluated whether they needed that layer at all.

Joke 2: The only thing worse than a slow DNS response is a fast one that’s confidently wrong.

Mini-story 3: The boring but correct practice that saved the day

A finance-heavy company ran authoritative DNS for a few internal zones and a public zone for customer endpoints. They had a routine: every zone change required a serial bump, named-checkzone, and a quick SOA serial verification across all authorities. No exceptions. It was unpopular with people who thought DNS is “just a text file.”

One afternoon, a change request added a new customer endpoint record. The primary updated fine. One secondary didn’t. But because the team always checked SOA serials, they caught it within minutes. They hadn’t received a single customer complaint yet.

The root cause was mundane: a firewall rule had been “cleaned up” the night before, and TCP/53 for zone transfers was removed between the primary and that secondary. The secondary kept serving the old zone. If this had reached customers, they would have seen intermittent NXDOMAIN depending on which authoritative server their resolver picked.

The team restored the firewall rule, confirmed the transfer, and the record became globally consistent. The boring practice—serial checks—prevented a customer-facing incident. In ops, boring is often another word for “effective.”

Interesting facts and historical context

  • NXDOMAIN is standardized: It’s DNS RCODE 3, meaning “Name Error,” and it’s distinct from “no data” (NOERROR with empty answer).
  • Negative caching wasn’t always common: RFC 2308 formalized negative caching behavior to reduce unnecessary repeated queries for non-existent names.
  • SOA MINIMUM got repurposed: Historically, SOA MINIMUM was used as a default TTL for records; later, it became associated with negative caching TTL semantics.
  • BIND has been the reference implementation for decades: BIND’s behavior influenced operational expectations across the industry, for better and occasionally for “why is it doing that.”
  • Split-horizon DNS is older than cloud: Enterprises have used views/split-horizon since long before Kubernetes made everyone discover service discovery the hard way.
  • RPZ was built for policy at scale: It lets operators inject response modifications without modifying authoritative zones, which is powerful and easy to abuse.
  • DNSSEC made answers bigger: Validation introduces additional records (RRSIG, DNSKEY, DS) that increase response size and make transport/path issues more visible.
  • Resolvers prefer UDP until they can’t: DNS commonly starts on UDP/53; TCP fallback behavior exists, but client stacks and middleboxes vary widely.
  • “Authoritative” is per-zone, not per-server: A server can be authoritative for one zone and recursive for everything else, which is a classic way to confuse troubleshooting.

Checklists / step-by-step plan

Checklist A: Determine where NXDOMAIN is coming from

  1. Run dig @your-resolver name type +noall +answer +authority +comments.
  2. Record: status, presence of aa, presence of ra, and the SOA owner name in AUTHORITY.
  3. If SOA matches the real zone: likely genuine authoritative NXDOMAIN or cached from authority.
  4. If SOA looks like an RPZ/policy zone: policy-generated NXDOMAIN.
  5. If you use forwarders: query each forwarder directly and compare.

Checklist B: Fix without collateral damage

  1. Flush only the impacted name(s) on recursive resolvers: rndc flushname.
  2. If authoritative inconsistency: verify SOA serials across all authoritative servers and repair transfers.
  3. If views: ensure the record exists in the intended view(s), and ACLs include IPv4 and IPv6 client ranges.
  4. If forwarders disagree: remove the bad one or fix its policy/zone. Don’t keep “sometimes wrong” in rotation.
  5. If RPZ: confirm whether it’s a true positive block; add exceptions if needed; fix feed sync.

Checklist C: Make it hard to regress

  1. Add synthetic monitoring that queries each resolver node directly, not just the VIP.
  2. Alert on sudden increases in NXDOMAIN for specific critical names (not overall NXDOMAIN volume).
  3. Track forwarder health independently and remove them automatically if they return nonsense for a canary domain.
  4. For authoritative zones, automate SOA serial parity checks across all authorities.
  5. Keep view definitions minimal. Every view is a branch in your incident tree.

FAQ

1) Why is NXDOMAIN intermittent instead of consistently wrong?

Because DNS is distributed and cached. Different resolvers, different cache states, different upstreams, different views. Your system can be “consistently inconsistent.”

2) How can I tell if BIND is authoritative for a zone?

Query and check for the aa flag in the response. Also inspect named-checkconf -p for the zone definition. If it’s type master or type slave, it’s authoritative for that zone.

3) What’s the difference between NXDOMAIN and NODATA?

NXDOMAIN means the name doesn’t exist at all. NODATA is NOERROR but empty answer for that type (for example, the name exists but has no A record). Both can be negative-cached, and both can break apps in different ways.

4) Can DNSSEC cause NXDOMAIN?

DNSSEC validation failures typically produce SERVFAIL on validating resolvers. But DNSSEC increases response size and complexity, which can trigger transport issues or forwarder quirks that show up as intermittent wrong answers, including NXDOMAIN from upstream policy engines.

5) Should I use forward first or forward only?

If you need strict policy enforcement from forwarders, use forward only and accept that policy is the source of truth. If you want correctness and independence, don’t forward at all—recurse directly. forward first can create inconsistent behavior when forwarders are flaky or wrong.

6) Is flushing the cache a valid fix?

Flushing a specific name is a valid mitigation when negative caching is the immediate problem. Flushing the entire cache is a blunt instrument that hides root causes, increases upstream load, and tends to make the next failure worse.

7) How do I prove it’s an RPZ issue?

Look at the SOA in the authority section of the NXDOMAIN response; policy NXDOMAIN often points at an RPZ-controlled SOA. Also check for response-policy in the effective config and inspect the RPZ zone contents.

8) Why do only some clients hit the wrong view?

Because match-clients ACLs often don’t include all real client source ranges (especially IPv6), or because traffic comes through NAT, VPN pools, or split tunneling. Clients you think are “internal” might not look internal to the DNS server.

9) What’s the quickest way to isolate a broken forwarder?

Query each forwarder directly for a failing name and a known-good canary name, back-to-back. If one returns NXDOMAIN while others return NOERROR, it’s the culprit. Then confirm with query logging on BIND.

10) How do I stop this class of incident from recurring?

Monitor per-node resolver behavior, control negative TTL expectations, keep views/policy explicit and audited, and test forwarders like dependencies—because they are.

Conclusion: next steps you can actually do

Intermittent NXDOMAIN isn’t a mystical DNS mood swing. It’s almost always one of four things: negative caching, views, forwarders, or policy (RPZ). DNSSEC and network transport problems are frequent accomplices, not usually the original author of NXDOMAIN.

Practical next steps:

  1. Pick one failing name and reproduce with dig directly against your resolver. Save the output with flags and SOA.
  2. Query each forwarder individually. If they disagree, you’ve found your “intermittent.”
  3. Dump cache and confirm whether a negative entry exists. Flush the specific name to mitigate.
  4. Inspect views and RPZ configuration in the rendered config (named-checkconf -p), not in someone’s half-remembered mental model.
  5. If authoritative, verify SOA serial parity across all authoritative servers before you blame resolvers.
  6. After the incident: add per-node DNS checks and a canary query that should never be NXDOMAIN, and alert when it is.

Do those, and the next time someone says “DNS is flaky,” you can respond with evidence, not vibes.

← Previous
Windows Terminal Like a Pro: Profiles, Fonts, and Shortcuts
Next →
ZFS: The 3 Metrics That Predict a Pool Crash Before It Happens

Leave a comment