DNS Cache Poisoning Basics: Harden Your Resolver Without Overengineering

Was this helpful?

If DNS breaks, everything looks broken. Your app “can’t reach the database,” your CI can’t fetch dependencies, your metrics drop out,
and someone inevitably suggests “it’s probably the network” with the confidence of a horoscope.

Cache poisoning is the nasty version of DNS breaking: resolution still works, but it works wrong. Users get sent to the attacker.
Or they get sent to your own stale IPs. Either way, you’ll spend hours proving that “yes, the packets are real” while the business asks why
the login page now sells discount sneakers.

What cache poisoning actually is (and what it isn’t)

“DNS cache poisoning” is when a recursive resolver stores a bogus answer and serves it to clients until it expires (or gets evicted).
The key word is cache. The attacker doesn’t need to compromise your authoritative DNS server to hurt you; they just need
your resolver to believe a lie for long enough.

This is different from:

  • Hijacking at the registrar (someone changes NS records at the domain registrar). That’s not cache poisoning; it’s account compromise.
  • On-path tampering (MITM rewriting DNS responses). That’s a network attack; caching may amplify it, but the root cause is traffic interception.
  • Bad configuration (wrong A/AAAA records, stale TTLs, split-horizon surprises). Looks identical from the client’s chair.
  • NXDOMAIN storms (lots of negative answers). Not poisoning; still painful.

Poisoning matters because DNS is a trust distribution system that predates modern threat models. It was designed for reliability and
scale, not active adversaries. We’ve bolted defenses onto it. They work, but only if you turn them on and don’t sabotage them with
“helpful” optimizations.

The blast radius is shaped by two decisions you control:

  • Who can query your resolver? If you run an open resolver, you’re making strangers’ problems into your problems.
  • How strictly does the resolver validate answers? DNSSEC validation and sane bailiwick rules turn many poisoning attempts into harmless noise.

Short joke #1: DNS is like office gossip: once the wrong thing hits the cache, everyone repeats it with absolute confidence.

How poisoning works: from guesses to persistence

A recursive resolver asks upstream servers questions and caches the answers. Attackers want to insert a fake answer into that cache.
Historically, poisoning was about racing the real answer with a forged response and winning the resolver’s trust.

The moving parts that make poisoning possible

A forged DNS response needs to match what the resolver expects: the transaction ID, the source port, and the question section.
Older resolvers used predictable IDs and fixed ports. That reduced the guessing space to something an attacker could brute force.
Modern resolvers randomize IDs and source ports, making off-path guessing far harder.

But “harder” isn’t “impossible.” Attackers look for:

  • NAT devices that collapse source ports (port randomization gets neutered by middleboxes).
  • Side channels that leak port or ID information.
  • Weak forwarding setups (forwarders that accept junk, or that don’t validate DNSSEC).
  • Misconfigured resolvers that accept out-of-bailiwick additional records (classic poisoning vector).
  • Long TTLs that make a single win last a long time.

Why “additional section” matters (and why bailiwick is a grown-up word)

DNS responses can contain extra records in the “additional” section, like glue A records for nameservers.
A resolver should be picky about what it caches from there. The rule of thumb is “bailiwick”: cache additional data only if it’s within
the authority of the server that provided it, and even then with caution.

Poisoning often tries to smuggle in something like:

  • An A record for www.victim.com in a response that shouldn’t be authoritative for it.
  • An NS record pointing victim.com at an attacker-controlled nameserver, plus glue for that nameserver.

Once a resolver caches a malicious delegation (poisoned NS), the attacker gets leverage: future queries can be steered at their servers,
which can answer consistently and “legitimately” from the resolver’s point of view. That’s when the incident becomes persistent.

DNSSEC changes the game—if you actually validate

DNSSEC adds signatures to DNS data so a validating resolver can detect tampering. It does not encrypt DNS. It does not stop someone from
learning what you query. It also doesn’t protect unsigned zones, and it can be undermined by bad trust-anchor handling or validation
being turned off “temporarily.”

Turning on DNSSEC validation is like wearing a seatbelt. It doesn’t make you invincible, but the alternative is trusting physics to be kind.

One quote, because operations is a contact sport: “Hope is not a strategy.” — Vince Lombardi.

Interesting facts and historical context (short, concrete)

  • DNS is older than the web: DNS standardized in the 1980s; HTTP came later. The threat model was “campus network,” not “global adversary.”
  • 2008 was a turning point: Dan Kaminsky’s disclosure showed practical cache poisoning at scale, forcing widespread patching and randomization.
  • Source port randomization became default after that era because 16-bit transaction IDs alone were guessable with enough attempts.
  • “0x20 encoding” is a thing: Some resolvers randomize letter case in queries to add entropy; it’s clever but not a substitute for real validation.
  • NAT can sabotage security: Some older NAT devices rewrote source ports predictably, shrinking the entropy you thought you had.
  • DNSSEC signing started in the 1990s conceptually, but real deployment took ages due to operational complexity and key management fear.
  • The root zone got signed in 2010, which made end-to-end DNSSEC validation viable without custom trust hacks.
  • Negative caching is standardized (SOA minimum / negative TTL behavior). It’s essential for scale and also a common source of “why won’t it resolve yet?” confusion.
  • Resolvers are DDoS amplifiers when open: not poisoning, but related: open recursion turns your infrastructure into someone else’s weapon.

Fast diagnosis playbook

You’re on call. Something smells like DNS. Don’t boil the ocean. Do this in order and stop when you find the smoking crater.

First: determine whether the resolver is lying or upstream is changing

  1. Query your resolver and a known-good external resolver for the same name and compare answer + TTL.
  2. Check DNSSEC validation status (is it on, and is it failing?)
  3. Trace delegation (are you being sent to unexpected NS servers?)

Second: look for cache persistence and scope

  1. Check TTL behavior: is the wrong answer counting down (cached) or fluctuating (upstream)?
  2. Check multiple clients/subnets: is it one resolver node, one site, or everywhere?
  3. Check if a forwarder is involved: many “resolver” deployments are just forwarders with extra steps.

Third: confirm it’s not self-inflicted

  1. Search config management diffs for DNS changes (forwarders, views, ACLs, DNSSEC toggles).
  2. Check for split-horizon collisions (internal zone shadowing external).
  3. Inspect NAT/firewall behavior if you suspect off-path poisoning risk (port rewriting, UDP timeouts).

The objective: separate “poisoned cache” from “authoritative truth changed” from “we broke ourselves.” Those are three different incident classes.

Hands-on tasks: commands, outputs, and the decision you make

These are deliberately practical. Run them on a client host and on the resolver host. Compare outputs. Make decisions.
I’ll use 10.0.0.53 as the internal recursive resolver IP and resolver1 as its hostname.

Task 1 — Confirm what resolver your host is actually using

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.0.0.53
       DNS Servers: 10.0.0.53 10.0.0.54

What it means: This host is sending DNS to 10.0.0.53 (primary) and 10.0.0.54 (secondary).
DNSSEC here is about the stub resolver and may say “unsupported” even if recursion validates upstream.

Decision: If the “Current DNS Server” isn’t your intended resolver, stop and fix DHCP/systemd-resolved settings.
Debugging poisoning on the wrong box is how you end up blaming the moon.

Task 2 — Compare your resolver’s answer to a known external reference

cr0x@server:~$ dig @10.0.0.53 www.example.com A +noall +answer +ttlid
www.example.com.  296  IN  A  203.0.113.50
cr0x@server:~$ dig @1.1.1.1 www.example.com A +noall +answer +ttlid
www.example.com.  300  IN  A  93.184.216.34

What it means: Your resolver is returning a different IP than a public resolver. That’s either poisoning, split-horizon, or upstream divergence.

Decision: If the domain should be public and consistent, treat this as an incident and continue. If it’s intentionally internal, document it and move on.

Task 3 — Check whether the wrong answer is cached (TTL counting down)

cr0x@server:~$ dig @10.0.0.53 www.example.com A +noall +answer +ttlid
www.example.com.  296  IN  A  203.0.113.50
cr0x@server:~$ sleep 5; dig @10.0.0.53 www.example.com A +noall +answer +ttlid
www.example.com.  291  IN  A  203.0.113.50

What it means: TTL decreasing indicates the resolver is serving from cache.

Decision: Cached wrong answer implies you need cache purge/flush (carefully) and root cause. If TTL resets upward, your resolver is fetching it repeatedly from somewhere.

Task 4 — Trace delegation to see where answers come from

cr0x@server:~$ dig +trace www.example.com A
; <<>> DiG 9.18.24 <<>> +trace www.example.com A
;; global options: +cmd
.                       3600    IN      NS      a.root-servers.net.
...
example.com.            172800  IN      NS      a.iana-servers.net.
example.com.            172800  IN      NS      b.iana-servers.net.
www.example.com.        300     IN      A       93.184.216.34

What it means: +trace bypasses your resolver’s cache and performs iterative resolution from the root down.
If +trace yields the correct record but your resolver doesn’t, the resolver is suspect.

Decision: If trace is correct but resolver answer differs, inspect cache poisoning or internal override. If trace is also “wrong,” the authoritative truth changed or your network is intercepting DNS.

Task 5 — Ask for DNSSEC validation status in the response

cr0x@server:~$ dig @10.0.0.53 www.example.com A +dnssec +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42012
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
www.example.com.  300  IN  A  93.184.216.34

What it means: You don’t see the ad (Authenticated Data) flag. That can mean DNSSEC validation is off, unsupported for this name, or your resolver isn’t a validating resolver.

Decision: If you expect validation, confirm on the resolver. If you don’t expect it, decide whether that’s still acceptable in 2025 (it usually isn’t).

Task 6 — On Unbound, verify DNSSEC is enabled and working

cr0x@server:~$ sudo unbound-control status
version: 1.17.1
verbosity: 1
threads: 2
modules: 2 [subnetcache validator]
uptime: 18432 seconds
options: reuseport control(ssl)

What it means: The module list includes validator, a good sign that DNSSEC validation is compiled and active.

Decision: If you don’t see validator, you’re not validating. Fix configuration before you argue about poisoning.

Task 7 — On BIND9, confirm recursion and who is allowed to use it

cr0x@server:~$ sudo named-checkconf -p | sed -n '1,120p'
options {
        recursion yes;
        allow-recursion { 10.0.0.0/8; 192.168.0.0/16; };
        allow-query-cache { 10.0.0.0/8; 192.168.0.0/16; };
        dnssec-validation auto;
        minimal-responses yes;
        ...
};

What it means: Recursion is enabled (fine) but restricted to internal networks (also fine). dnssec-validation auto; is the normal safe default.

Decision: If allow-recursion is any; or missing, treat it as a security issue. Open recursion is an invitation to abuse and also increases your exposure.

Task 8 — Check for SERVFAIL patterns (often DNSSEC or upstream trouble)

cr0x@server:~$ dig @10.0.0.53 badsig.example A +dnssec
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 61457
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

What it means: SERVFAIL can mean validation failures, timeouts, or upstream breakage. If it correlates with DNSSEC-enabled zones, it’s usually validation.

Decision: Check resolver logs for DNSSEC errors; don’t just disable validation to “fix” production. That’s like solving a smoke alarm by removing the batteries.

Task 9 — Verify the resolver isn’t forwarding to an untrusted or broken upstream

cr0x@server:~$ sudo ss -uapn | grep -E ':(53)\b'
UNCONN 0      0         10.0.0.53:53         0.0.0.0:*    users:(("unbound",pid=1187,fd=6))
UNCONN 0      0         10.0.0.53:49322      8.8.8.8:53   users:(("unbound",pid=1187,fd=9))
UNCONN 0      0         10.0.0.53:42617      1.1.1.1:53   users:(("unbound",pid=1187,fd=10))

What it means: This resolver is making outbound UDP/53 queries to public resolvers. That’s a forwarding design, not full recursion.

Decision: Decide if forwarding is intentional. If you forward, you inherit upstream behavior and reduce your ability to do end-to-end validation and troubleshooting.

Task 10 — Inspect NAT behavior that can reduce source-port entropy

cr0x@server:~$ sudo conntrack -L -p udp --dport 53 2>/dev/null | head
udp      17 25 src=10.0.0.53 dst=1.1.1.1 sport=42617 dport=53 src=1.1.1.1 dst=203.0.113.10 sport=53 dport=42617 [ASSURED]
udp      17 25 src=10.0.0.53 dst=8.8.8.8 sport=49322 dport=53 src=8.8.8.8 dst=203.0.113.10 sport=53 dport=49322 [ASSURED]

What it means: You can see the UDP flows and source ports. If you repeatedly observe sequential or reused ports due to NAT, you’re losing entropy.

Decision: If your NAT is predictable, fix NAT configuration or move the resolver so it can preserve randomness.

Task 11 — Check Unbound statistics for cache anomalies

cr0x@server:~$ sudo unbound-control stats_noreset | egrep 'num.cachehits|num.cachemiss|num.query|unwanted|cache'
num.query=1842099
num.cachehits=1320091
num.cachemiss=522008
unwanted.queries=183
unwanted.replies=91
msg.cache.count=76231
rrset.cache.count=54012

What it means: High cache hits are normal. Rising unwanted.replies can indicate spoofed responses arriving that don’t match outstanding queries.

Decision: If unwanted.replies spikes, investigate network spoofing attempts or broken upstream devices. It’s not proof of poisoning, but it’s a clue worth chasing.

Task 12 — Capture DNS traffic to confirm what’s on the wire

cr0x@server:~$ sudo tcpdump -ni any udp port 53 -vv -c 5
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
IP 10.0.0.53.42617 > 1.1.1.1.53: 42012+ A? www.example.com. (33)
IP 1.1.1.1.53 > 10.0.0.53.42617: 42012* 1/0/1 A 93.184.216.34 (49)
IP 10.0.0.53.49322 > 8.8.8.8.53: 1337+ DS? example.com. (29)
IP 8.8.8.8.53 > 10.0.0.53.49322: 1337 1/0/1 DS 370 13 2 49AAC11D... (68)
IP 203.0.113.66.53 > 10.0.0.53.42617: 42012 1/0/0 A 203.0.113.50 (49)

What it means: The third-party IP 203.0.113.66 is sending a response that matches the transaction ID and port for the outstanding query.
If that IP isn’t your upstream and it arrives before the real one, you’ve got a serious problem: spoofing is happening on-path or via routing/ACL failure.

Decision: Escalate immediately: network team for ACLs/routing, security team for incident response, and isolate the resolver if you can.

Task 13 — Check BIND query logs for unexpected clients and recursion abuse

cr0x@server:~$ sudo rndc querylog
query logging is now on
cr0x@server:~$ sudo tail -n 5 /var/log/named/named.log
client @0x7f3c9c0a: query: randomstring.badness.example IN A +E(0)K (198.51.100.77)
client @0x7f3c9c0a: query: www.example.com IN A +E(0)K (10.10.14.22)
client @0x7f3c9c0a: query: api.internal.corp IN A +E(0)K (10.20.1.9)

What it means: A public IP querying your resolver suggests open recursion or firewall leakage. Also note random subdomains: could be amplification or cache-bypass attempts.

Decision: If any public IPs are hitting recursion, block immediately and fix ACLs. This is both security and stability.

Task 14 — Check whether you’re leaking internal zones to the world (views/split horizon)

cr0x@server:~$ dig @10.0.0.53 api.internal.corp A +noall +answer
api.internal.corp.  60  IN  A  10.55.0.21
cr0x@server:~$ dig @1.1.1.1 api.internal.corp A +noall +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 11610

What it means: Internal name exists only internally. Good. But if an external resolver returns something, you have a collision or leakage.

Decision: Ensure internal suffixes are truly private and not colliding with public TLDs. Consider using a reserved internal domain strategy to avoid surprises.

Task 15 — Validate that your resolver rejects out-of-bailiwick junk (sanity test)

cr0x@server:~$ dig @10.0.0.53 ns1.example.com A +noall +authority +additional
example.com.     172800  IN  NS  a.iana-servers.net.
example.com.     172800  IN  NS  b.iana-servers.net.
a.iana-servers.net.  172800 IN A  199.43.135.53
b.iana-servers.net.  172800 IN A  199.43.133.53

What it means: The additional section contains glue records for the nameservers. That’s expected and in-bailiwick glue for the delegation chain.

Decision: If you see your resolver caching unrelated additional A records (for names not involved in the delegation), tighten configuration and update resolver software.

Resolver hardening that moves the needle

Hardening resolvers is mostly about default-deny thinking and reducing ambiguity. You’re not trying to be clever.
You’re trying to be boring, correct, and resistant to both attackers and your coworkers’ “small improvements.”

1) Stop running open recursion (seriously)

Restrict who can query and who can recurse. Two different ACLs. Get them both right.
If you run an open resolver, you attract abuse, you generate confusing traffic patterns, and you increase the chance that spoofed garbage reaches your cache.

  • BIND: set allow-recursion and allow-query-cache to internal networks only.
  • Unbound: configure access-control for allowed subnets; deny everything else.

2) Enable DNSSEC validation and don’t treat it as optional

DNSSEC validation is the most effective poisoning mitigation for signed zones. It turns “maybe” into “no.”
Yes, it introduces new failure modes (expired signatures, broken DS records). That’s fine. We know how to monitor those.

What you should not do: flip DNSSEC off during an outage and forget to turn it back on. That’s how you graduate from “temporary instability”
to “quiet compromise.”

3) Keep your resolver software current

Resolver security is a moving target: bailiwick fixes, better randomization, hardened parsing, QNAME minimization improvements,
and mitigation for new side channels. Old versions accumulate sharp edges. You will bleed.

4) Preserve entropy end-to-end (ports, IDs, and the network in between)

Source-port randomization only helps if the packets keep their ports. If your resolver is behind a NAT that rewrites ports predictably,
you lose entropy. Either move the resolver to the edge, fix NAT behavior, or use a design that avoids that NAT path.

5) Prefer full recursion over blind forwarding (when you can)

Forwarding to public resolvers is operationally easy. It’s also a trust shift. You’re outsourcing part of your security posture and all of your
troubleshooting clarity. Full recursion makes root cause analysis cleaner and reduces dependence on a single upstream’s quirks.

If you must forward (compliance, policy, egress constraints), do it to controlled resolvers you operate or to trusted, validated upstreams,
and keep DNSSEC validation on locally where possible.

6) Turn on QNAME minimization

QNAME minimization reduces information leakage and slightly reduces the value of some spoofing games by minimizing what you reveal to each
upstream. This is not your primary poisoning defense, but it’s a cheap win.

7) Use rate limiting where it protects stability, not as a magic security spell

Rate limiting helps against floods and some brute-force patterns. It won’t “solve poisoning.”
But it can keep the resolver alive long enough for you to respond and it reduces amplification potential.

8) Log with intent: enough to investigate, not enough to drown

Query logs 24/7 are expensive and noisy. Use sampling or short-term toggles (rndc querylog, Unbound verbosity increase),
and keep structured metrics always on: cache hit ratio, SERVFAIL rates, unwanted replies, top QNAMEs by volume.

9) Pin down internal zones and avoid collisions

Split-horizon DNS is often necessary. It’s also a fertile ground for accidental “poisoning-like” behavior: different answers depending on source,
wrong views applied, internal zones shadowing real public domains. Use clear suffixes, test externally, and automate checks.

10) Decide on your stance for EDNS, cookies, and TCP fallback

EDNS increases DNS message sizes and capabilities. It’s good, but it interacts with MTU, fragmentation, and middleboxes.
EDNS Cookies can add resilience against spoofed replies in some scenarios by adding a token to the exchange (supported variably).
Ensure TCP fallback works: some attacks and failures rely on breaking UDP assumptions.

Short joke #2: Every time someone says “DNS is simple,” a resolver drops a fragmented UDP packet and quietly plots revenge.

Three corporate-world mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A company ran two recursive resolvers per site. The design doc said “anycast,” but the implementation was “two IPs in DHCP.”
An engineer assumed that meant clients would seamlessly fail over when one resolver was unhealthy.

One Monday, a resolver node started returning stale records for a small set of high-traffic domains. Not totally dead. Just wrong enough.
The TTLs were long. The helpdesk got reports of “login page sometimes wrong.” Which is the kind of report you wish was a prank.

The wrong assumption: the team thought the secondary resolver would mask the primary’s problems. In reality, most client stacks stick to the first server
unless it times out. A resolver that answers quickly—incorrectly—doesn’t trigger failover. It becomes an authoritative source of lies for that client population.

The fix wasn’t heroic. They stopped treating “two IPs” as high availability. They added resolver health checks, removed unhealthy nodes from DHCP pools,
and standardized cache flush procedures. Most importantly, they built a tiny synthetic check that compared answers from each resolver against an external trace.

The postmortem conclusion was blunt: “fast wrong answers are worse than timeouts.” That line became a design requirement, not a lesson.

Mini-story 2: The optimization that backfired

A different org had latency complaints from remote offices. Someone proposed an optimization: forward all DNS to a single pair of central resolvers
because “cache hits will be higher.” Technically true. Operationally, a trap.

They implemented forwarding over a VPN. Query volume was fine, but EDNS responses started getting fragmented, and the VPN path had a lower effective MTU.
Some fragments were dropped. Certain domains started intermittently failing with SERVFAIL or long hangs due to retry logic and TCP fallback issues.

The team noticed the symptom but diagnosed the wrong thing first: “authoritative servers are flaky.” They opened tickets, waited, and got nowhere.
Meanwhile, an attacker in a shared network environment (not on-path to the datacenter, but near some branch clients) had an easier time spoofing responses
because the forwarding path and NAT behavior reduced effective entropy and increased retransmissions.

Nothing catastrophic happened, but the incident was a master class in self-inflicted fragility. The “optimization” concentrated trust and failure.
The fix was to deploy local resolvers in branches with full recursion, preserve port randomization locally, and keep forwarding only as an emergency policy lever.

The final lesson: caching is not a reason to create a single point of weirdness. Especially not for DNS.

Mini-story 3: The boring but correct practice that saved the day

A payments company had a policy: resolvers validate DNSSEC, and disabling validation requires a formal change with an expiry.
People grumbled about it during calm weeks. During a real incident, it paid for itself.

A third-party SaaS domain began resolving to an unexpected IP for a subset of networks. Public resolvers disagreed with each other.
The company’s validating resolvers returned SERVFAIL for some responses and correct answers for others. That looked like a DNS outage to the application teams.

Because metrics were in place, SRE saw an increase in validation failures and “unwanted replies.” Packet captures showed forged responses arriving
faster than legitimate ones on one ISP path. DNSSEC validation refused the forged data. The impact was degraded resolution (timeouts and retries),
but not silent misdirection to the attacker.

They temporarily routed resolver egress through a different provider, keeping validation on. Service stabilized.
Later, the ISP acknowledged an issue consistent with traffic interference. The company never had to tell customers “we might have sent you to the wrong host.”

The boring practice wasn’t just “DNSSEC on.” It was “DNSSEC on, monitored, and hard to disable.” Boring is good. Boring scales.

Common mistakes: symptoms → root cause → fix

1) Symptom: Some users reach the wrong IP, others are fine

Root cause: One resolver node has poisoned/stale cache, or clients are pinned to the first DNS server.

Fix: Compare answers per resolver IP; flush cache on the bad node; remove it from rotation; add per-node synthetic checks.

2) Symptom: Wrong answer persists for hours even after “fixing DNS”

Root cause: Long TTL cached at recursive resolvers; negative caching; intermediate forwarder caching; or apps caching DNS internally.

Fix: Measure TTL countdown at each layer. Purge caches where you control them; reduce TTL before migrations; restart only as last resort.

3) Symptom: Sudden spike in SERVFAIL for popular domains

Root cause: DNSSEC validation failures (broken DS chain, expired signatures) or upstream timeouts/fragmentation issues.

Fix: Check logs for validation errors; confirm EDNS/MTU path; ensure TCP fallback; do not disable validation without a scoped plan.

4) Symptom: Resolver CPU is fine, but clients see timeouts

Root cause: Packet loss, firewall dropping fragments, or UDP state timeouts; occasionally conntrack exhaustion.

Fix: tcpdump on resolver; check conntrack; allow DNS TCP/53; tune EDNS buffer size or path MTU; fix network ACLs.

5) Symptom: Public IPs appear in resolver logs as clients

Root cause: Open recursion exposed to the internet or VPN misrouting; sometimes a cloud security group mistake.

Fix: Lock down recursion with ACLs; firewall UDP/TCP 53; confirm cloud SG/NACL; verify you’re not inadvertently advertising resolver IPs.

6) Symptom: Resolver returns “correct” A record but TLS fails with certificate mismatch

Root cause: DNS answer points to wrong host (poisoning or split-horizon), or CDNs shifting edges and your pinning assumptions are outdated.

Fix: Compare with +trace and multiple resolvers; check whether you’re overriding the zone internally; validate authoritative records.

7) Symptom: Problems only for large responses (TXT, DNSKEY, some HTTPS/SVCB)

Root cause: EDNS fragmentation drops; path MTU issues; broken middleboxes; TCP blocked.

Fix: Ensure TCP/53 works end-to-end; tune EDNS buffer; prefer modern resolver versions with sane defaults.

8) Symptom: Random subdomain queries spike (e.g., a1b2c3.example.com)

Root cause: Cache-bypass flood, DDoS attempt, or malware beaconing; can also be misbehaving service discovery.

Fix: Use RPZ or local policy to block known bad patterns; implement rate limiting; identify source IPs; coordinate with security.

Checklists / step-by-step plan

Phase 1 — Baseline your resolver security posture (1–2 days)

  1. Inventory resolvers: IPs, versions, whether they recurse or forward, and where their egress goes.
  2. Confirm recursion ACLs: only internal networks can recurse; everyone else gets refused.
  3. Enable DNSSEC validation: validate on the resolver, not just on clients.
  4. Verify TCP/53 is allowed: both inbound to resolver from clients (if needed) and outbound from resolver.
  5. Confirm port randomization survives NAT: check conntrack/flows; fix predictable mappings.
  6. Turn on metrics: cache hit rate, SERVFAIL, NXDOMAIN, unwanted replies, latency histograms.

Phase 2 — Add guardrails that prevent “temporary” insecurity (1–2 weeks)

  1. Config as code: resolvers managed via your normal pipeline, not artisanal SSH edits.
  2. Change control for DNSSEC toggles: any disablement requires an expiry and an owner.
  3. Synthetic checks per resolver node: compare against +trace for a small domain set; alert on divergence.
  4. Short-term query logging toggle: operational runbook to enable logs for 15 minutes with safe retention.

Phase 3 — Hardening without overengineering (ongoing)

  1. Prefer full recursion unless policy forces forwarding. If you forward, be explicit about trust and monitor upstream behavior.
  2. Minimize split-horizon complexity: keep internal zones small and clear; avoid shadowing public domains.
  3. Patch cadence: treat resolver updates as security updates, because they are.
  4. Tabletop exercise: simulate “wrong DNS answer” and “DNSSEC validation failure” and practice the response.

FAQ

1) Is DNS cache poisoning still a thing in 2025?

Yes, but the easy versions are mostly dead on modern resolvers with good randomization and bailiwick rules.
The remaining risk is misconfiguration, forwarding trust, NAT entropy loss, on-path interference, and “validation turned off.”

2) If I use DoH/DoT, am I safe from cache poisoning?

DoH/DoT protects the client-to-resolver path from on-path tampering. It does not automatically validate DNSSEC, and it doesn’t protect
you if the resolver itself is compromised or misconfigured. It’s a transport improvement, not a truth oracle.

3) What’s the single best hardening step?

Enable DNSSEC validation on the recursive resolver and keep it on. Then restrict recursion to internal clients.
Those two changes remove entire classes of pain.

4) Why does my resolver return SERVFAIL when public resolvers return an IP?

Most often: DNSSEC validation failure. Public resolvers might be configured differently, might be ignoring validation for that zone,
or might have different caching state. Check your logs before blaming the internet.

5) How can I tell poisoning from split-horizon DNS?

Split-horizon tends to be consistent per source network and configured zones. Poisoning often shows divergence between resolvers,
unexpected delegation changes, and anomalies like unwanted replies. Use +trace and compare internal vs external answers.

6) Should we flush caches during a suspected poisoning incident?

Sometimes, yes—after you capture evidence. Flushing can destroy the proof you need (packet captures, logs, cached RRsets).
If you flush first and ask questions later, your postmortem will be a creative writing exercise.

7) Are long TTLs always bad?

No. Long TTLs are great for stability and cost. They’re bad when the data is wrong. Operationally, you should lower TTLs before planned
migrations and avoid extremely long TTLs on records that change frequently.

8) Does running multiple resolvers fix poisoning?

It reduces the chance a single poisoned cache affects everyone, but only if clients actually use multiple resolvers and you monitor per-node behavior.
Two resolvers with identical bad forwarding upstream is just redundancy of the same mistake.

9) What about “0x20 encoding” and other entropy tricks?

They can help in specific cases, but they’re not a primary defense. Treat them as seasoning, not the meal.
DNSSEC validation and proper randomization plus ACLs are the meal.

10) How do I avoid overengineering?

Don’t build a bespoke DNS security platform. Run a modern resolver (Unbound or BIND), validate DNSSEC, lock down recursion,
monitor key metrics, and practice incident response. Most teams fail on basics, not on lack of exotic features.

Conclusion: practical next steps

DNS cache poisoning is less about Hollywood hacks and more about operational hygiene: who can query you, what you trust, and whether your resolver
can tell truth from trash. Attackers like ambiguity. So do outages. Your job is to remove both.

  1. This week: verify recursion ACLs, confirm DNSSEC validation is enabled, and run the “compare against external” checks on your top domains.
  2. Next week: add per-resolver synthetic monitoring for answer divergence and validation failures; make “disable DNSSEC” a controlled change with expiry.
  3. This quarter: modernize resolver versions, audit NAT/egress behavior for entropy loss, and simplify split-horizon zones.

If you do only one thing: stop accepting fast wrong answers as “healthy.” DNS is infrastructure. Treat it like it matters, because everything else depends on it.

← Previous
dnsmasq Cache + DHCP: A Clean Config That Doesn’t Fight Your System
Next →
CSS Animations That Don’t Murder Performance: Transform/Opacity Rules + Pitfalls

Leave a comment