DNS Resolvers: Negative Caching — The Setting That Makes Outages Last Longer

Was this helpful?

Here’s a classic outage timeline: you fat-finger a DNS record, realize it immediately, fix it in a minute… and users keep failing for another 30–60 minutes. The ticket says “DNS is still broken,” your dashboards say “authoritative is correct,” and the business says “are you sure?”

Welcome to negative caching: the resolver feature that caches failures (like NXDOMAIN) and makes short mistakes feel like long incompetence. It’s not malicious. It’s just doing its job—often with defaults that were chosen for yesterday’s internet and today’s latency budget.

Negative caching in plain English

DNS is a distributed database that mostly returns “here’s the IP” (or “here’s the CNAME”). Negative caching is what happens when it returns “no such name” (NXDOMAIN) or “no data” (NODATA: name exists, but not that record type), and the resolver decides to remember that failure for a while.

That “for a while” is the whole story. Negative caching converts transient lookups into fewer upstream queries, reduces load on authoritative servers, and makes users faster when a name truly doesn’t exist. But when a name starts existing—because you just created it, fixed it, or changed delegation—negative caching can keep clients stuck in the past.

Two different truths can coexist:

  • Your authoritative server answers correctly right now.
  • Your customers’ resolvers are still convinced the name does not exist.

If you’ve ever watched an SRE team “wait out DNS” with clenched teeth, this is why.

Dry-funny joke #1: DNS propagation isn’t real—DNS cache expiration is. Unfortunately, marketing prefers the first one.

How negative caching actually works (and why SOA matters)

NXDOMAIN vs NODATA: same pain, different semantics

NXDOMAIN means the queried name doesn’t exist in that zone. Example: you ask for api.example.com, and the authoritative server says “never heard of it.”

NODATA (often seen as NOERROR with an empty answer section) means the name exists, but not the requested type. Example: www.example.com exists with an A record, but you ask for AAAA and the zone has none. Many resolvers negative-cache this too.

Operationally: both can wedge rollouts, especially dual-stack rollouts (AAAA) and service discovery patterns that lean on “missing means disabled.”

Negative caching TTL is usually driven by the SOA

Negative caching is standardized behavior (RFC 2308), and the TTL used to cache negative answers is derived from the zone’s SOA record.

Here’s the practical version:

  • Authoritative answers often include the zone’s SOA in the authority section for negative responses.
  • The resolver uses the SOA’s TTL and/or “minimum” field (depending on server and resolver behavior) to decide how long to cache the negative response.
  • Resolvers may also cap negative TTLs with local policy (for sanity and security).

If your zone SOA is set with a negative TTL of an hour, you just gave every resolver permission to remember your mistake for an hour. That may be fine for typosquatting protection and query load. It is not fine for fast-moving service deployments that create names on demand.

Who is “the resolver” anyway?

DNS resolution often crosses multiple caches:

  • Stub resolver on the host (or systemd-resolved)
  • Node-local cache (common in Kubernetes, e.g., NodeLocal DNSCache)
  • Recursive resolver (Unbound, BIND, PowerDNS Recursor, corporate resolvers, ISP resolvers)
  • Application caches (JVM DNS cache, Go’s resolver behavior, in-process caches)
  • CDN / security appliances that proxy DNS

Negative caching can happen at several layers. That’s why “I flushed my laptop DNS” sometimes changes nothing: the cache holding the grudge isn’t on your laptop.

Quote worth taping to your monitor

Hope is not a strategy. — Gene Kranz

DNS incidents love “hope.” Hope that caches expire soon. Hope that clients retry. Hope that the other team didn’t set TTLs to geological time. Don’t hope. Measure, bound, and verify.

Facts and historical context worth knowing

  1. Negative caching was formalized to protect the DNS from needless repeated queries when names don’t exist (RFC 2308). It’s a scaling feature, not a bug.
  2. “SOA minimum” used to mean default TTL in older DNS practice; modern zones use explicit TTLs, but that legacy naming still causes confusion during incident response.
  3. Resolvers almost always cache NXDOMAIN unless configured otherwise, because failing fast repeatedly can DoS authoritative infrastructure just as effectively as success traffic.
  4. Negative caching applies to more than NXDOMAIN: “no AAAA record” is often cached as a negative answer. That matters for IPv6 rollouts.
  5. Some resolvers cap negative TTLs (for example to a few minutes) to reduce blast radius of mistakes, but enterprises frequently override caps “for performance.”
  6. DNSSEC changed the game for negative responses: authenticated denial of existence (NSEC/NSEC3) makes negative responses cryptographically provable, which also makes them more confidently cacheable.
  7. Browsers and runtimes got more aggressive about caching over time to reduce latency. That helps page loads; it also prolongs “we fixed it” moments.
  8. Kubernetes made DNS more central to application health than many enterprises were ready for; negative caching is particularly painful when services and endpoints are ephemeral.

Where negative caches live: stub, recursive, application, and “helpful” middleboxes

1) Host-level stub resolvers

On Linux, you may be dealing with systemd-resolved, nscd, dnsmasq, or nothing at all (glibc sending queries straight to configured recursors). Each has its own caching behavior.

Negative caching at this layer can make a single machine appear “possessed” while others are fine. It’s a great way to waste 45 minutes blaming routing.

2) Recursive resolvers (the real cache)

This is where negative caching tends to hurt most, because a recursive resolver serves entire populations: office networks, VPCs, clusters, and VPN users.

Recursive resolvers typically:

  • Cache negative answers with a TTL
  • May “serve expired” answers under some conditions (usually for positive responses, but behavior varies)
  • Have policy knobs for maximum/minimum TTLs, including negative TTL caps

3) Application-level caches and runtime behavior

Even if DNS is “fixed,” the app might keep failing because it cached the negative result.

Common culprits:

  • Java can cache negative lookups; defaults depend on security settings and JVM version.
  • Go can use the cgo resolver or pure Go resolver depending on build/runtime; caching is usually minimal in the runtime, but your app or sidecars might cache.
  • Envoy / service meshes may cache DNS results for cluster discovery.

4) Middleboxes and “security” DNS

Some corporate DNS filtering appliances return NXDOMAIN (or sinkhole IPs) for blocked domains and cache those results aggressively. That can collide with internal domain patterns and cause… creative incidents.

Dry-funny joke #2: The nice thing about negative caching is it’s the only part of DNS that never lies—if it says “no,” it means “no, for the next 3600 seconds.”

Three corporate-world mini-stories (the kind you recognize)

Mini-story #1: The incident caused by a wrong assumption

The company had a tidy DNS playbook: set low TTLs for services, automate record creation, and use health checks to shift traffic. A new team shipped an internal API gateway and decided to create per-tenant DNS names on demand: tenant-id.api.internal.

The wrong assumption was simple: “If the record doesn’t exist, clients will just retry later and it’ll work once we create it.” In their heads, DNS behaved like an eventually consistent database with a fast convergence time.

In reality, the first lookup happened before automation finished. The recursive resolver returned NXDOMAIN. That answer carried an SOA with a fat negative TTL. Now the resolver had permission to keep telling everyone “doesn’t exist” for a long time, even after the record was created correctly.

The incident symptoms were strange: only new tenants failed, and only for a slice of users. Existing tenants were fine. The gateway was fine. The authoritative zone looked correct. The team rotated pods and restarted services, because that’s what you do when you’re losing and it’s late.

The fix was not heroic: reduce the zone’s negative caching TTL, add a “warm-up” step that creates the record before any client can attempt resolution, and in the short term flush caches on the corporate recursors. The lesson stuck: DNS failure is cacheable state, not a transient error.

Mini-story #2: The optimization that backfired

A different organization ran busy recursive resolvers and saw high query rates for random, nonexistent names—typos, telemetry, broken clients, the usual background noise. An engineer decided to “optimize” by raising negative TTL caps significantly and caching NXDOMAIN for longer. It worked: upstream query volume dropped, CPU looked great, and everyone high-fived quietly because DNS engineers don’t really high-five.

Then they did a rebrand. New hostnames, new subdomains, lots of marketing-driven DNS changes. The authoritative zone was ready. The recursors were ready. But a large portion of the enterprise had already tried to resolve some of those new names during prelaunch testing, when records weren’t in place yet.

Those NXDOMAINs were now cached for a long time due to the “optimization.” Launch day arrived. Some users were fine, others got hard failures. The helpdesk saw it as a flaky rollout. The SREs saw it as “DNS is correct.” Both were technically right. The incident lasted as long as the negative caches did.

They rolled back the negative TTL cap, but that didn’t undo the already-cached failures. They had to flush caches and wait. The backfire wasn’t just the setting—it was the lack of a policy: negative caching is a reliability decision, not a performance tweak.

Mini-story #3: The boring but correct practice that saved the day

A fintech company had a rule that annoyed product teams: “No DNS name may be referenced by production code until it resolves successfully from at least two independent recursive resolvers.” It sounded bureaucratic. It was. It also prevented a very specific class of outage.

They were migrating a payment callback endpoint to a new region. The migration plan included creating callback.payments.example as a CNAME to a regional name, then switching it during maintenance. They pre-created the record days in advance and verified it from (1) their own recursors and (2) a separate resolver stack used for monitoring.

During the change window, an engineer accidentally applied a zone file update missing that record. Authoritative started returning NXDOMAIN. Monitoring caught it within minutes because it checked from multiple resolvers and compared against expected answers. They reverted the zone quickly.

Here’s the part that mattered: because the record had existed and been stable for days, the caches held positive entries, not negative ones. The window where clients observed NXDOMAIN was small, and many clients never saw it at all. The boring practice—precreate, verify, monitor from multiple vantage points—turned a potential hour-long “why is DNS still broken” incident into a short blip.

Fast diagnosis playbook

When you suspect negative caching, you’re racing two clocks: the one in the cache, and the one in the business’s patience. Don’t shotgun random flushes. Prove where the bad answer is coming from.

First: identify what failure you’re actually seeing

  • Is it NXDOMAIN? Name doesn’t exist (cached hard).
  • Is it NODATA? Name exists, but missing the type (common for AAAA).
  • Is it SERVFAIL/timeouts? Not negative caching; usually DNSSEC, reachability, or upstream issues.

Second: compare answers from three places

  1. Authoritative nameserver directly (bypass caches).
  2. Your normal recursive resolver (what clients actually use).
  3. A known “clean” resolver (another recursive stack under your control, or a test resolver in a different network segment).

Third: extract the negative TTL and decide your move

  • If negative TTL is small (seconds to a few minutes): wait it out, but keep proving improvement.
  • If negative TTL is large (tens of minutes to hours): fix the SOA negative caching settings long-term, and flush recursive caches short-term if you control them.
  • If only some clients fail: suspect a local caching layer (node-local DNS, systemd-resolved, JVM), not the authoritative zone.

Fourth: check for split-horizon and conditional forwarding

“Works on VPN, fails off VPN” or “works in one VPC, fails in another” often means different resolvers are getting different truths. Negative caching then locks those different truths in place for longer than anyone expects.

Hands-on tasks: commands, outputs, and decisions (12+)

These tasks assume Linux tooling. If you run resolvers in containers or appliances, the same logic applies: query, compare, extract TTL, act.

Task 1: Confirm the symptom with dig against the default resolver

cr0x@server:~$ dig api.example.internal A

; <<>> DiG 9.18.24 <<>> api.example.internal A
;; global options: +cmd
;; Got answer:
;; ->HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34219
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; AUTHORITY SECTION:
example.internal.  1800  IN  SOA  ns1.example.internal. hostmaster.example.internal. 2026020401 3600 900 1209600 1800

;; Query time: 12 msec
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
;; WHEN: Tue Feb  4 10:01:22 UTC 2026
;; MSG SIZE  rcvd: 132

What it means: Status is NXDOMAIN. The SOA is shown with TTL 1800 seconds. That’s a strong hint that negative caching is 30 minutes.

Decision: Don’t redeploy the app yet. First verify authoritative truth and confirm the negative TTL policy is the cause.

Task 2: Query the authoritative server directly (bypass recursion)

cr0x@server:~$ dig @192.0.2.53 api.example.internal A +norecurse

; <<>> DiG 9.18.24 <<>> @192.0.2.53 api.example.internal A +norecurse
;; global options: +cmd
;; Got answer:
;; ->HEADER<<- opcode: QUERY, status: NOERROR, id: 1201
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
api.example.internal. 60 IN A 198.51.100.20

;; Query time: 3 msec
;; SERVER: 192.0.2.53#53(192.0.2.53) (UDP)
;; WHEN: Tue Feb  4 10:01:31 UTC 2026
;; MSG SIZE  rcvd: 62

What it means: Authoritative says the record exists and is correct (NOERROR, aa flag, answer present).

Decision: This is almost certainly cached NXDOMAIN in recursion layers. Your zone is fine right now; your caches are not.

Task 3: Ask the recursive resolver and inspect authority SOA TTL

cr0x@server:~$ dig @10.10.0.53 api.example.internal A +noall +answer +authority +comments

;; Got answer:
;; ->HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54530
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; AUTHORITY SECTION:
example.internal.  1742  IN  SOA  ns1.example.internal. hostmaster.example.internal. 2026020401 3600 900 1209600 1800

What it means: The TTL is now 1742 seconds, counting down. That’s cached negative state decaying in real time.

Decision: If you can’t flush, you can at least predict when it will self-heal (in ~29 minutes from the original cache fill).

Task 4: Confirm whether the issue is type-specific (NODATA for AAAA)

cr0x@server:~$ dig @10.10.0.53 www.example.internal AAAA

; <<>> DiG 9.18.24 <<>> @10.10.0.53 www.example.internal AAAA
;; ->HEADER<<- opcode: QUERY, status: NOERROR, id: 7711
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; AUTHORITY SECTION:
example.internal.  900  IN  SOA  ns1.example.internal. hostmaster.example.internal. 2026020401 3600 900 1209600 1800

What it means: NOERROR with empty answer can be cached as “no AAAA.” This can break clients that prefer IPv6 first (Happy Eyeballs helps, but not always).

Decision: Decide whether you need to publish AAAA or adjust client behavior. Don’t treat it as “DNS is fine.” It’s a real, cacheable negative.

Task 5: Use dig +trace to see where denial happens

cr0x@server:~$ dig +trace api.example.internal A

; <<>> DiG 9.18.24 <<>> +trace api.example.internal A
;; Received 811 bytes from 127.0.0.53#53(127.0.0.53) in 0 ms

example.internal. 172800 IN NS ns1.example.internal.
example.internal. 172800 IN NS ns2.example.internal.
;; Received 151 bytes from 198.41.0.4#53(a.root-servers.net) in 17 ms

api.example.internal. 60 IN A 198.51.100.20
;; Received 62 bytes from 192.0.2.53#53(ns1.example.internal) in 4 ms

What it means: The authoritative path returns the correct record. So if clients see NXDOMAIN, it’s caching or split-horizon, not missing delegation.

Decision: Focus on recursors and intermediates. Stop editing zone files “just in case.”

Task 6: Check what resolver your host is using

cr0x@server:~$ resolvectl status

Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute
Current DNS Server: 10.10.0.53
       DNS Servers: 10.10.0.53 10.10.0.54
        DNS Domain: example.internal

What it means: The host uses systemd-resolved stub mode. There’s a local cache plus upstream recursors.

Decision: If only this host fails, flush local cache. If many hosts fail, flushing one host won’t help.

Task 7: Flush systemd-resolved cache (host-level)

cr0x@server:~$ sudo resolvectl flush-caches

What it means: No output is normal. It doesn’t guarantee upstream caches are cleared.

Decision: Re-test with dig. If NXDOMAIN persists, it’s upstream recursion caching, not host cache.

Task 8: Check if nscd is caching hosts lookups

cr0x@server:~$ systemctl is-active nscd
inactive

What it means: nscd isn’t your culprit on this host.

Decision: Don’t waste time restarting it. Move on.

Task 9: Inspect Unbound’s negative caching related settings

cr0x@server:~$ sudo unbound-control get_option neg-cache-size
neg-cache-size: 4m

What it means: Unbound allocates space for negative cache. Size isn’t TTL, but it confirms negative caching is in play.

Decision: If you’re memory-pressured and evictions are happening, you may see inconsistent behavior. Otherwise, proceed to flush or adjust caps.

Task 10: Flush a single name from Unbound (surgical, not napalm)

cr0x@server:~$ sudo unbound-control flush api.example.internal
ok

What it means: The cached entry for that name is removed from Unbound.

Decision: Re-query. If it returns NOERROR now, you’ve proven cached NXDOMAIN was the blocker. Consider flushing the whole relevant subtree if many names were affected.

Task 11: Flush a name from BIND (if you run named)

cr0x@server:~$ sudo rndc flushname api.example.internal

What it means: No output typically means success. Some builds log it rather than printing.

Decision: Immediately verify with dig via that resolver. If still NXDOMAIN, the cache might be elsewhere (forwarder, node-local, client runtime).

Task 12: Find negative TTL policy in a BIND zone SOA

cr0x@server:~$ dig @192.0.2.53 example.internal SOA +noall +answer

example.internal. 300 IN SOA ns1.example.internal. hostmaster.example.internal. 2026020401 3600 900 1209600 1800

What it means: The last field (1800) is commonly used as the negative caching TTL (per RFC 2308 semantics, depending on implementation). The SOA record’s own TTL is 300 here, but the “minimum” field is 1800.

Decision: If you need faster recovery from mistakes and rapid name creation, lower that minimum/negative TTL. Don’t set it to 0 unless you enjoy self-inflicted query storms.

Task 13: Validate what your client actually sees via getent

cr0x@server:~$ getent ahostsv4 api.example.internal
198.51.100.20   STREAM api.example.internal
198.51.100.20   DGRAM
198.51.100.20   RAW

What it means: The libc resolver path resolves successfully. If dig fails but getent works (or vice versa), you may have split DNS behavior or different resolver paths (stub vs direct).

Decision: Debug the path your application uses, not the one you prefer. Many outages are “works in dig” while the app uses something else.

Task 14: Check whether Kubernetes CoreDNS is caching negatives too long

cr0x@server:~$ kubectl -n kube-system get configmap coredns -o yaml | sed -n '1,120p'
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 10.10.0.53 10.10.0.54
        cache 30
        loop
        reload
        loadbalance
    }

What it means: The cache 30 plugin caches responses for 30 seconds by default behavior. Depending on configuration, negatives may be cached too. Your upstream recursor might still cache NXDOMAIN much longer.

Decision: If the cluster is the affected population, tune CoreDNS caching deliberately and ensure upstream negative TTL caps align with service discovery needs.

Task 15: Observe whether a forwarder is returning cached NXDOMAIN

cr0x@server:~$ dig @10.10.0.53 api.example.internal A +stats

; <<>> DiG 9.18.24 <<>> @10.10.0.53 api.example.internal A +stats
;; ->HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 32602
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; Query time: 1 msec
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
;; WHEN: Tue Feb  4 10:02:01 UTC 2026
;; MSG SIZE  rcvd: 132

What it means: A 1ms response time strongly suggests this was served from cache, not fetched from authoritative servers.

Decision: You need cache management (flush/cap TTL) more than you need “try again.”

Common mistakes: symptoms → root cause → fix

1) “We created the record but users still get NXDOMAIN”

Symptoms: Authoritative shows the record; clients still fail with NXDOMAIN; failures slowly disappear over tens of minutes.

Root cause: Recursive resolvers cached NXDOMAIN earlier; negative TTL derived from SOA is long.

Fix: Lower negative TTL in the zone (SOA minimum/negative TTL policy), flush caches where you control them, and precreate records before any client can query them.

2) “Only IPv6 clients fail” (or only some libraries)

Symptoms: A lookups work; AAAA lookups return empty; some clients time out or behave strangely.

Root cause: NODATA for AAAA is being cached; client prefers IPv6; app doesn’t gracefully fall back.

Fix: Publish AAAA where appropriate, or adjust client behavior; consider reducing negative caching for AAAA-related NODATA in resolver policy if it hurts rollouts.

3) “Works on my laptop, fails in the data center”

Symptoms: Same query gives different answers depending on network.

Root cause: Split-horizon DNS, conditional forwarding, or different recursive resolver stacks with different cached negative state.

Fix: Identify which resolver each environment uses; compare directly; unify policy; flush the specific resolver population that’s wrong.

4) “Restarting pods fixed it… briefly”

Symptoms: App works after restart, then fails again; or different pods show different behavior.

Root cause: Node-local caches differ; some nodes hit cached NXDOMAIN; others don’t. Restarts change which node you land on.

Fix: Fix at the resolver layer (node-local DNS cache / CoreDNS / upstream recursor). Don’t use restarts as cache invalidation.

5) “We set TTLs low, so why is it still broken?”

Symptoms: Positive TTLs are low (like 60s), but NXDOMAIN persists much longer.

Root cause: Negative TTL is not your record TTL. It’s derived from SOA negative caching settings and resolver policy caps.

Fix: Audit SOA fields and resolver negative TTL caps. Treat negative TTL as a first-class SLO control.

6) “DNSSEC made it worse”

Symptoms: After enabling DNSSEC, negative answers linger or SERVFAIL appears, and flushing seems ineffective.

Root cause: DNSSEC-validating resolvers cache authenticated denial of existence confidently; plus misconfig can yield SERVFAIL (not NXDOMAIN).

Fix: Separate NXDOMAIN vs SERVFAIL. For SERVFAIL, validate DNSSEC chain and signatures. For NXDOMAIN, tune negative TTL policy and cache flush strategy.

Checklists / step-by-step plan

Checklist: before you create brand-new hostnames in production

  1. Set a sane negative caching TTL for the zone (often 30–300 seconds for fast-moving internal zones; longer for stable public zones).
  2. Precreate names (or wildcard carefully) before any client can attempt resolution.
  3. Verify from at least two recursive resolvers (different stacks or network segments).
  4. Verify both A and AAAA behavior explicitly (even if you “don’t use IPv6”).
  5. Document how to flush caches in your resolver fleet, and test that it works.

Checklist: during an incident where “DNS is fixed but still broken”

  1. Confirm the failure type: NXDOMAIN vs NODATA vs SERVFAIL.
  2. Query authoritative directly with +norecurse.
  3. Query the recursive resolver your clients use; extract the remaining TTL from the SOA line.
  4. Check whether response time suggests cache (1–2ms) vs upstream (10ms+).
  5. Flush the specific name on the recursive resolver if you control it; re-test.
  6. If you don’t control it, move clients to a different resolver temporarily (last resort, but sometimes the only lever).
  7. After restoration, lower negative TTL if it extended outage duration.

Step-by-step plan: right-sizing negative caching without melting your auth servers

  1. Inventory zones and categorize them: stable public, internal stable, internal dynamic (service discovery, ephemeral names).
  2. Measure NXDOMAIN rates on recursors and authoritative servers. High NXDOMAIN volume suggests typos, scanning, or broken clients.
  3. Set negative TTL targets per category:
    • Stable public zones: longer can be fine.
    • Internal dynamic zones: short negative TTL is usually worth it.
  4. Apply resolver caps so one bad SOA doesn’t impose an hour-long outage on everyone.
  5. Load-test authoritative capacity if you reduce negative TTL drastically. Fewer cached NXDOMAINs means more upstream queries.
  6. Roll out gradually across resolver fleet; monitor query volume, latency, and SERVFAIL.

FAQ

1) Is negative caching the same as “DNS propagation”?

No. “Propagation” is a folk term. What’s actually happening is caches obeying TTLs—positive and negative—across multiple resolver layers.

2) What controls how long NXDOMAIN is cached?

Typically the zone’s SOA (negative TTL semantics) plus any caps or overrides on recursive resolvers. Some resolvers also apply minimum/maximum policies.

3) If I set my record TTL to 60 seconds, will NXDOMAIN also cache for 60 seconds?

No. Record TTL affects positive answers. NXDOMAIN caching is governed by negative caching TTL policy, usually derived from the SOA.

4) Can I just set negative TTL to 0 everywhere?

You can, but you’ll pay for it with increased query load and potentially self-inflicted DoS of your authoritative servers. For dynamic internal zones, small (not zero) is the usual sweet spot.

5) How do I tell if I’m seeing cached NXDOMAIN versus authoritative NXDOMAIN?

Query authoritative directly with dig @auth +norecurse. If authoritative says NOERROR but recursive says NXDOMAIN, it’s cached or split-horizon. Also watch TTL count down in the SOA line on repeated queries.

6) Why do only some users see the failure?

Because different users hit different resolvers, or different caches at different layers. Some may have cached the old failure; others never queried during the bad window.

7) Does flushing caches always fix it immediately?

Only if you flush the cache that holds the bad entry and clients actually use that resolver. Flushing a laptop won’t fix an enterprise recursive cache upstream of it.

8) What about browser DNS caches?

Browsers may cache DNS results indirectly through connection reuse and internal caches, but the bigger operational issue is usually recursive resolver caching and application runtime caching.

9) How does DNSSEC affect negative caching?

DNSSEC can make negative answers verifiable (authenticated denial of existence), which can increase confidence in caching. It also introduces failure modes (SERVFAIL) when validation fails—different from NXDOMAIN.

10) What’s the safest way to roll out a new hostname for a critical service?

Precreate the record well ahead of time, verify from multiple recursive resolvers, keep negative TTL short in that zone, and monitor resolution continuously.

Next steps you can do this week

Negative caching isn’t a villain. It’s a lever. If you don’t set it intentionally, it will be set for you—by defaults, by legacy, and by the last person who “optimized DNS.”

Do these three things

  1. Audit your SOA negative TTLs for zones used by dynamic services. If the number is measured in hours, you’re choosing long outages.
  2. Add one dashboard panel: NXDOMAIN rate and top NXDOMAIN names on your recursors. You can’t manage what you don’t see, and NXDOMAIN spikes are often the first smoke.
  3. Write a cache flush runbook for every resolver tier you operate (systemd-resolved, node-local, CoreDNS, Unbound/BIND). Include verification commands and expected outcomes.

If you want a simple operational rule: keep negative caching short where names are born and die quickly, and keep it longer where the namespace is stable. That’s not ideology. That’s matching cache behavior to how your business actually ships changes.

← Previous
File History: The Underrated Backup That Beats “Copy My Files”
Next →
Proxmox: The Hidden Reason Your VM Feels Slow (Even on NVMe)

Leave a comment