DNS load spikes: rate-limit and survive attacks without downtime

November 25, 2025 • February 3, 2026 • Read: 24 min • Views: 7

Was this helpful?

DNS is the one dependency everybody forgets until it’s on fire. Your web tier can autoscale, your caches can absorb, your databases can shed. DNS? DNS just answers questions until it can’t—and then everything looks “down” even though your servers are fine.

Load spikes hit DNS in ugly ways: sudden popularity, broken client retries, bot scans, NXDOMAIN floods, reflection attacks, or a “helpful” change that turns a quiet zone into a hot mess. The trick isn’t merely blocking traffic. It’s limiting the right things, at the right layer, without lying to legitimate clients or detonating your own reliability.

What actually breaks during DNS spikes

When DNS falls over, you usually don’t get a clean “DNS is down” alarm. You get a parade of second-order symptoms: login failures, random API timeouts, Kubernetes nodes “flapping,” and a CEO who can’t reach the marketing site (which, tragically, becomes “P0”).

Failure mode 1: packet handling collapses before CPU does

Most DNS daemons are fast, but they still need the kernel to receive packets, queue them, and deliver them to user space. Under bursty QPS, you can hit:

NIC ring exhaustion (drops at the driver).
Socket receive buffer overruns (drops at the kernel).
Single-thread bottlenecks (daemon can’t drain queues).

Result: clients see timeouts and retry, which multiplies traffic and turns “spike” into “spiral.” DNS is one of the rare places where “timeout” is not just a symptom—it’s a traffic amplifier.

Failure mode 2: latency kills you long before errors do

DNS clients often wait only a few hundred milliseconds before retrying or failing over. Your server can still be answering, but if p95 creeps from 5 ms to 200 ms you’ll see “random” outages across applications.

Failure mode 3: the wrong traffic wins

Not all queries deserve equal treatment. During attacks, you’ll get floods of:

Random subdomain queries (e.g., asdf123.example.com) designed to bypass caches and create NXDOMAIN or NODATA work.
ANY queries (less relevant now but still seen) or query types that trigger big responses.
EDNS-enabled probes that try to coerce large UDP responses for reflection.

If your infrastructure treats these like normal traffic, you end up protecting the attacker’s QPS while legitimate clients queue up behind them. That’s not neutrality; it’s operational self-harm.

Failure mode 4: your “protection” is the outage

Rate limiting done badly is indistinguishable from an attack. Too strict, and you blackhole real users behind NAT or carrier-grade NAT. Too clever, and you throttle your own recursive resolvers or health checks. The goal is controlled degradation: some responses slow down or get truncated, but the service stays reachable and recovers quickly.

One dry truth: DNS is where “simple” configurations become distributed systems. You are negotiating with stub resolvers, recursive resolvers, caches, middleboxes, and people running antique firmware in hotel Wi‑Fi routers.

One quote to remember (paraphrased idea): “Hope is not a strategy,” often attributed to engineers and operators discussing reliability. Treat DNS spikes the same way: design, measure, rehearse.

Joke #1: DNS is the only system where the best-case outcome is being ignored because everyone cached you.

Interesting facts and historical context

DNS predates the modern web: it was designed in the early 1980s to replace the HOSTS.TXT file distribution model.
UDP was the default for speed: DNS over UDP was chosen because it avoids connection setup, which is great until you meet spoofing and reflection abuse.
DNS caching is deliberate load shedding: TTLs are not just correctness controls; they’re an economic system for QPS.
The root server system isn’t a single box: it’s a globally anycasted constellation; “root is down” is nearly always your local problem.
ANY queries became a DDoS tool: many operators now minimize or restrict ANY responses, because they were used to trigger oversized replies.
DNSSEC increased response sizes: adding signatures improves integrity but can worsen fragmentation and amplification risk if you don’t manage EDNS sizes.
Response Rate Limiting (RRL) emerged from pain: authoritative operators needed a way to cut reflection usefulness without going offline.
NXDOMAIN attacks are old: random subdomain floods have been used for years because they bypass normal caching behavior.
Anycast changed the playbook: spreading authoritative DNS across POPs reduces single-site saturation but adds debugging complexity.

Fast diagnosis playbook (first/second/third)

This is the “stop guessing” sequence. You can do it in 10 minutes while everyone else argues in Slack.

First: confirm it’s DNS, not “DNS-shaped”

From an affected client network, query your authoritative name directly (bypass recursion). Check latency and timeouts.
From inside the data center/VPC, do the same. If internal is fine and external is not, suspect edge saturation or upstream filtering.
Check whether the spike is on authoritative, recursive, or both. They fail differently.

Second: locate the bottleneck layer

Network ingress: packet drops at NIC/kernel, conntrack storms, PPS limits.
Daemon: CPU per thread, lock contention, cache miss storms, DNSSEC signing or validation costs.
Backend dependencies: dynamic backends (databases), slow zone transfers, overloaded logging or telemetry sinks.

Third: apply safe containment, then optimize

Deploy rate limiting at the edge (dnsdist / firewall / load balancer) before you touch daemon internals.
Prioritize: keep basic A/AAAA/CNAME/SOA/NS functional. Degrade “nice to have” (ANY, TXT, large responses).
Stabilize latency (p95) first. Then chase absolute QPS.

Measure first: tasks, commands, outputs, decisions

You can’t rate-limit your way out of not knowing what’s happening. Below are practical tasks you can run during an incident or rehearsal. Each includes a command, a sample output, what it means, and the decision you make.

Task 1: Verify authoritative response and latency (direct query)

cr0x@server:~$ dig @203.0.113.53 www.example.com A +tries=1 +time=1

; <<>> DiG 9.18.24 <<>> @203.0.113.53 www.example.com A +tries=1 +time=1
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43112
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1
;; ANSWER SECTION:
www.example.com. 60 IN A 198.51.100.10
;; Query time: 12 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (UDP)
;; WHEN: Tue Dec 31 12:00:00 UTC 2025
;; MSG SIZE  rcvd: 93

What it means: You’re getting an authoritative answer (flag aa), with low latency.

Decision: If this is fast but users complain, the problem may be recursive resolvers, network path, or cache poisoning/negative caching behavior—not the authoritative itself.

Task 2: Compare UDP vs TCP behavior (detect fragmentation/EDNS pain)

cr0x@server:~$ dig @203.0.113.53 example.com DNSKEY +dnssec +time=1 +tries=1

;; Truncated, retrying in TCP mode.
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5321
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 210 msec
;; SERVER: 203.0.113.53#53(203.0.113.53) (TCP)

What it means: UDP truncation forced TCP retry. Under load, TCP costs more (state, accept queues, kernel overhead).

Decision: Tune EDNS buffer size and DNSSEC response behavior; consider limiting large responses, and ensure TCP handling capacity is sane.

Task 3: Check server-side QPS and top qtypes on BIND (authoritative)

cr0x@server:~$ sudo rndc stats
cr0x@server:~$ sudo tail -n 20 /var/cache/bind/named.stats
++ Incoming Requests ++
[View: default]
                          1987632 QUERY
                           504112 NOTIFY
++ Incoming Queries ++
[View: default]
                           912331 A
                           701221 AAAA
                           389002 TXT
                           211332 DNSKEY
                           122880 ANY

What it means: You have a high TXT/DNSKEY/ANY mix. That’s a classic “make responses bigger” signature.

Decision: Consider tightening response policy: minimize ANY, rate-limit qtypes that are being abused, and review DNSSEC/EDNS sizing.

Task 4: Inspect Unbound stats quickly (recursive)

cr0x@server:~$ sudo unbound-control stats_noreset | egrep 'total\.num\.queries|num\.query\.type|unwanted|cache\.hits|cache\.miss|avg\.recursion'
total.num.queries=18422301
num.query.type.A=10233444
num.query.type.AAAA=6012231
num.query.type.TXT=1589222
unwanted.queries=433221
msg.cache.hits=11022334
msg.cache.misses=7401130
avg.recursion.time=0.145000

What it means: Cache misses are high and recursion time is rising. Unwanted queries are non-trivial—likely abuse or misbehaving clients.

Decision: Add client rate limiting, tighten access control (who can use recursion), and increase cache where it’s safe. If recursion time is high due to upstream timeouts, consider forwarding strategy or root/TLD reachability.

Task 5: Identify top talkers and query names (packet sample)

cr0x@server:~$ sudo tcpdump -ni eth0 udp port 53 -c 50
12:00:01.001234 IP 192.0.2.10.53211 > 203.0.113.53.53: 1234+ A? asdf91k2.example.com. (38)
12:00:01.001260 IP 192.0.2.10.53212 > 203.0.113.53.53: 2234+ AAAA? asdf91k2.example.com. (38)
12:00:01.001289 IP 198.51.100.77.41922 > 203.0.113.53.53: 3321+ ANY? example.com. (28)
...

What it means: Random labels for your zone suggest an NXDOMAIN/NODATA cache-bypass flood; ANY queries suggest amplification probing.

Decision: Apply per-source and per-qname/qtype rate limiting at the edge, and consider serving “synthesized” minimal answers where appropriate (carefully) or aggressive negative caching settings.

Task 6: Check kernel drops and UDP receive errors

cr0x@server:~$ netstat -su
Udp:
    1283221 packets received
    0 packets to unknown port received
    22111 packet receive errors
    904321 packets sent
    22111 receive buffer errors

What it means: Receive buffer errors imply the kernel is dropping UDP because the app can’t drain fast enough or buffers are too small.

Decision: Increase socket buffers and tune kernel networking; also reduce per-packet CPU by moving rate limiting earlier (dnsdist, XDP, firewall) and reducing logging overhead.

Task 7: Check NIC and kernel-level drops

cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    RX:  bytes  packets  errors  dropped overrun mcast
    9876543210 12345678  0       4321    0       0
    TX:  bytes  packets  errors  dropped carrier collsns
    8765432109 11223344  0       0       0       0

What it means: RX drops at the interface are already happening before your daemon sees traffic.

Decision: Add capacity (more instances/anycast), use higher PPS NICs, or enforce upstream filtering/rate limiting so the host never sees the worst of it.

Task 8: Confirm conntrack isn’t accidentally involved

cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 982312
net.netfilter.nf_conntrack_max = 1048576

What it means: You’re near conntrack max. DNS over UDP shouldn’t need conntrack in many designs, but firewalls/NAT can drag it in.

Decision: Avoid stateful tracking for high-QPS DNS if possible; move filtering to stateless rules or dedicated edge devices, or raise limits with care.

Task 9: Validate daemon health and thread saturation (systemd + CPU)

cr0x@server:~$ systemctl status named --no-pager
● named.service - BIND Domain Name Server
     Loaded: loaded (/lib/systemd/system/named.service; enabled)
     Active: active (running) since Tue 2025-12-31 11:20:12 UTC; 39min ago
   Main PID: 1023 (named)
      Tasks: 42
     Memory: 1.3G
        CPU: 18min 22.331s

cr0x@server:~$ top -b -n1 | head -n 15
top - 12:00:10 up 10 days,  3:22,  1 user,  load average: 18.21, 17.44, 16.02
%Cpu(s): 92.1 us,  4.1 sy,  0.0 ni,  1.2 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
MiB Mem :  32110.0 total,   2100.0 free,  12000.0 used,  18010.0 buff/cache
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1023 bind      20   0  2460.0m  1320.0m   8.1m R  580.0   4.1  18:25.12 named

What it means: CPU is hot; named is consuming multiple cores. If latency is still bad, CPU isn’t the only bottleneck—check drops and queueing.

Decision: If CPU is the limiter, scale out and/or reduce expensive work (DNSSEC signing/validation, logging, dynamic updates). If drops are present, rate-limit before the daemon.

Task 10: Check DNS response size distribution (spot amplification risk)

cr0x@server:~$ sudo tcpdump -ni eth0 udp port 53 -vv -c 10
12:00:12.100001 IP 203.0.113.53.53 > 192.0.2.10.53211: 1234 NXDomain 0/1/0 (112)
12:00:12.100120 IP 203.0.113.53.53 > 198.51.100.77.41922: 3321 4/0/6 (1452)

What it means: 1452-byte UDP responses are flirting with fragmentation (depending on path MTU). Fragmentation increases loss and can make you a better reflection weapon.

Decision: Clamp EDNS UDP payload size (often around 1232 bytes is a pragmatic choice), reduce additional records, and ensure TC=1 fallback behavior is acceptable.

Task 11: Check dnsdist front-end stats (edge rate limiter)

cr0x@server:~$ sudo dnsdist -c -e 'showStats()'
aclDrops                                 12345
responses                                9812234
queries                                  9921100
rdQueries                                221001
ruleDrop                                 88321
ruleNXDomain                             0
latency0_1                                7122331
latency1_10                               2501123
latency10_50                              210992
latency50_100                             12012
latency100_1000                            542

What it means: You’re dropping traffic via ACL/rules while keeping most latency under 10 ms. That’s the shape you want during an attack.

Decision: If drops are too high and legit users are impacted, refine rules (per-netblocks, per-qtype, per-qname) instead of raising global limits.

Task 12: Validate zone correctness and negative caching TTL (SOA)

cr0x@server:~$ dig @203.0.113.53 example.com SOA +noall +answer
example.com. 300 IN SOA ns1.example.com. hostmaster.example.com. 2025123101 7200 3600 1209600 60

What it means: The last field (minimum/negative TTL in modern practice) is 60 seconds. That governs how long resolvers may cache NXDOMAIN/NODATA.

Decision: During NXDOMAIN floods, a too-low negative TTL makes everything worse because resolvers re-ask constantly. Raise it thoughtfully (not to days; to minutes or an hour depending on your change cadence).

Task 13: Find whether clients are retrying excessively (log sampling)

cr0x@server:~$ sudo journalctl -u named --since "10 min ago" | egrep 'client|query' | head
Dec 31 11:51:01 ns1 named[1023]: client @0x7f2c1c0a: query (cache) 'asdf91k2.example.com/A/IN' denied
Dec 31 11:51:01 ns1 named[1023]: client @0x7f2c1c0b: query (cache) 'asdf91k2.example.com/AAAA/IN' denied
Dec 31 11:51:01 ns1 named[1023]: client @0x7f2c1c0c: query (cache) 'asdf91k2.example.com/A/IN' denied

What it means: You see repeated queries for the same random name, potentially from recursive resolvers. Denies may be ACLs or rate-limit decisions.

Decision: If legitimate resolvers are being denied, you’re cutting too close. Prefer “slip”/truncate strategies or per-source shaping over blanket denies.

Task 14: Confirm time sync (DNSSEC and TTL sanity)

cr0x@server:~$ timedatectl
               Local time: Tue 2025-12-31 12:00:30 UTC
           Universal time: Tue 2025-12-31 12:00:30 UTC
                 RTC time: Tue 2025-12-31 12:00:29
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active

What it means: Time is sane. Bad time can cause DNSSEC validation failures and weird caching behavior.

Decision: If unsynchronized, fix NTP first. Debugging DNS under time skew is like debugging storage while the disk is on fire.

Rate-limiting patterns that work (and what to avoid)

Rate limiting DNS isn’t “block bad IPs.” Attack traffic is often spoofed, distributed, or coming from legitimate recursive resolvers that are themselves overwhelmed. You’re shaping a system where the client identity is fuzzy and the transport is usually UDP.

Principle: rate-limit at the edge, not inside the brain

Your authoritative daemon should spend CPU on answering legitimate queries, not on being a traffic cop. Put a layer in front:

dnsdist for DNS-aware filtering and shaping.
iptables/nftables for coarse PPS limits and ACLs.
Anycast + multiple POPs to turn “one target” into many smaller targets.

Do not make your authoritative server parse and log every abusive query at debug level. That’s not observability. That’s printing the DDoS to paper.

Pattern 1: Per-source token bucket (with NAT reality)

A per-IP QPS limit is the simplest control. It’s also how you accidentally block an entire mobile carrier behind CGNAT. Use it, but:

Set limits high enough for big resolvers.
Prefer per-/24 or per-/56 limits only if you understand collateral damage.
Whitelist known recursive resolvers you rely on (your own, major public resolvers if appropriate).

Pattern 2: Rate-limit by qname pattern (random subdomain floods)

Random subdomains are designed to cause cache misses. Rate-limiting by qname suffix and “entropy” can be effective:

Throttle NXDOMAIN responses per source.
Throttle queries for “deep” labels (e.g., 20+ chars random prefix) if your business doesn’t use them.
Consider wildcard records carefully—wildcards can turn NXDOMAIN into NOERROR, which changes caching and attack economics.

Pattern 3: Minimize amplification value

Reflection attacks want large responses to small queries. Make your server a lousy amplifier:

Minimize ANY responses (many authoritative servers answer minimally or refuse ANY).
Clamp EDNS UDP size to reduce fragmentation and oversized replies.
Avoid gratuitous additional records (like extra glue) unless necessary.
Ensure you’re not an open resolver if you’re authoritative. That mistake never dies.

Pattern 4: “Slip” (truncate) instead of hard drop, when it helps

Some rate-limit implementations can “slip” responses: send truncated replies (TC=1) occasionally so legitimate resolvers can retry over TCP, while attackers lose efficiency. This isn’t magic—TCP can also be attacked—but it can shift cost back toward the requester.

What to avoid: global QPS caps without classifying traffic

A global cap says “I don’t know what’s going on, so I’ll hurt everyone equally.” That’s a fast way to keep the attack running while your customers churn.

Joke #2: The only thing scarier than a DNS DDoS is a DNS DDoS plus an “emergency logging increase” from someone who misses the 1990s.

Authoritative vs recursive: different enemies, different controls

Authoritative DNS: protect your zones and your reputation

Authoritative servers answer for your domain. Attack traffic often targets them because they’re public and predictable. Common abuse:

Reflection: spoofed source addresses; your server “attacks” someone else.
NXDOMAIN floods: random subdomains; you compute negative answers repeatedly.
DNSSEC amplification: large DNSKEY/DS/NSEC* responses.

Good controls on authoritative:

RRL (Response Rate Limiting) for similar responses to many clients.
qtype-based rules (e.g., restrict ANY, watch TXT/DNSKEY patterns).
Anycast distribution and diverse upstream transit.
Short, sane zone content (avoid huge TXT payloads unless necessary).

Recursive DNS: protect your users and your upstream

Recursive resolvers fetch answers from the internet on behalf of clients. They get abused via:

Open recursion: strangers use your resolver; you pay bandwidth and get blamed.
Cache-bypass: random qnames force upstream lookups and keep your recursion busy.
Client retry storms: misconfigured apps or broken stub resolvers.

Good controls on recursive:

Access control: only your networks can query recursion.
Aggressive caching where safe; prefetching popular records.
Per-client rate limiting and “unwanted query” mitigation.
Serve expired (with care) to ride out upstream failures.

The trap: mixing authoritative and recursive on the same host

Yes, it “works.” It also couples failure domains. Under attack, your recursion can starve your authoritative, or vice versa. Keep roles separate unless you have a tiny environment and enjoy living dangerously.

Design for survival: architecture and capacity moves

Use anycast, but don’t worship it

Anycast spreads query load across multiple sites by advertising the same IP from many locations. It’s the industry standard for authoritative at scale. It also introduces operational realities:

Traffic follows BGP, not your intent. A route flap can move half a continent’s DNS to one POP.
Debugging “why is this resolver hitting that POP?” becomes a networking exercise.
State must be minimal: zone content and keys must replicate reliably.

Do anycast if you can run it well. Otherwise, use multiple unicast NS with strong DDoS protection upstream. The reliability goal is diversity.

Separate “answering” from “thinking”

The best authoritative systems do minimal computation per query:

Pre-signed zones (if using DNSSEC) or efficient signing infrastructure.
Memory-resident zone data.
No synchronous calls to databases on query path. If you absolutely must do dynamic responses, front them with caching and strict circuit breakers.

Tune EDNS and UDP size like you mean it

Large UDP answers fragment. Fragments get dropped. Dropped fragments trigger retries. Retries multiply QPS. This is how you get a “mysterious” spike that looks like the internet collectively forgot how to cache.

Pragmatic move: keep UDP answers reasonably sized. Clamp EDNS UDP payload to a conservative value; accept that TCP fallback is sometimes necessary and engineer for it.

TTL strategy: caching is your cheapest capacity

Short TTLs feel agile. They’re also expensive. Under attack or viral traffic, the difference between TTL 30 and TTL 300 is a factor of 10 in query volume from recursors that respect TTL.

Don’t set TTLs based on vibes. Set them based on:

How fast you truly need to change records in emergencies.
Whether you have other routing controls (load balancers, anycast, failover).
Your tolerance for caching stale data during incidents.

Logging: sample, don’t drown

Query logging at full volume during spikes is a classic self-own. Use:

Sampling (1 in N queries).
Aggregated metrics (qtype counts, NXDOMAIN rate, top clients by PPS).
Short packet captures with tight filters.

Keep your storage subsystem out of the blast radius. DNS is a network service; don’t turn it into a disk benchmark.

Three corporate mini-stories from the trenches

Mini-story 1: the incident caused by a wrong assumption

A mid-size SaaS company ran authoritative DNS on two VMs behind a “reliable” cloud load balancer. The team believed the load balancer provided “DDoS protection by default,” because it came from a big vendor and had a nice dashboard.

Then a partner integrated a new SDK. The SDK did an SRV lookup on app startup, for every request, because someone misread a caching guideline. That created a perfectly legitimate spike: not an attack, just a hundred thousand clients being enthusiastic. The authoritative servers got hammered with repetitive, cache-bypassing queries because the records had a TTL of 30 seconds “for flexibility.”

Latency climbed. Timeouts increased. Clients retried. The load balancer started queueing, and eventually health checks failed because the DNS servers were busy answering real traffic. The balancer drained them, which concentrated load, which made things worse. Classic positive feedback loop, now with a managed service in the middle adding mystery.

The wrong assumption wasn’t “the cloud is unreliable.” It was believing that a generic load balancer equals a DNS-aware edge. They fixed it by increasing TTLs, adding dnsdist in front with sane per-client shaping, and separating health checks from overloaded query paths. They also wrote a test that simulated the SDK’s lookup pattern, because the next incident was going to be weirder.

Mini-story 2: the optimization that backfired

A large enterprise network team decided to “optimize DNS latency” by forcing all internal clients to use a small set of recursive resolvers in one data center. They scaled the resolvers up: lots of RAM, big CPUs, fancy NVMe for logs. The change looked good in a single-office test.

Two months later, a transit issue degraded connectivity from that data center to several TLD name servers. Nothing was fully down, just slower. Recursive resolution time increased, cache misses became painful, and the resolvers started building backlog. Clients—especially Windows endpoints and some IoT devices—retried aggressively on perceived slowness.

The “optimization” created a centralized failure domain. Worse: the team had enabled detailed query logging to prove the improvement, and under spike conditions the log pipeline saturated disk I/O and CPU. The resolvers weren’t just slow because the internet was slow; they were slow because they were narrating their suffering to disk.

The fix was unglamorous: distribute recursion across regions, reduce logging to sampled/aggregated metrics, and enable serve-expired with guardrails so transient upstream slowness didn’t instantly become client retry storms. They also stopped treating DNS like a web service where “one big cluster” is always a win.

Mini-story 3: the boring but correct practice that saved the day

A financial services company ran authoritative DNS with anycast across multiple POPs. Nothing fancy in the zone—A, AAAA, CNAME, a few TXT. The boring part was their discipline: every change went through a staging zone, they tracked TTL intent, and they rehearsed a DDoS runbook quarterly.

One afternoon, an attack started: high-QPS random subdomains plus TXT and DNSKEY probes. Their monitoring flagged a sharp rise in NXDOMAIN rate and UDP receive errors at one POP. The on-call didn’t panic. They followed the playbook: confirm drops, confirm query mix, enable stricter rate limits for NXDOMAIN responses, clamp EDNS size, and temporarily de-preference the overloaded POP at BGP.

Customer impact was minimal. Some resolvers fell back to TCP; some queries were slowed; but the domain stayed resolvable. The attack moved on, as attacks often do when you stop being a fun target.

The saving practice wasn’t a secret product. It was having measured baselines, rehearsed controls, and the authority (organizationally) to apply them quickly. The incident review was short, which is the best kind.

Common mistakes: symptom → root cause → fix

1) Symptom: lots of timeouts, but CPU is not maxed

Root cause: Packet drops at NIC/kernel (ring buffers, socket buffers), or upstream congestion.

Fix: Check ip -s link and netstat -su; increase buffers, reduce per-packet overhead, add edge rate limiting, scale out/anycast.

2) Symptom: sudden spike in NXDOMAIN responses

Root cause: Random subdomain flood, or a misconfigured client generating nonsense qnames.

Fix: Raise negative caching TTL sensibly, apply NXDOMAIN rate limits, and identify the source networks generating entropy-heavy qnames.

3) Symptom: massive increase in TCP DNS connections

Root cause: Truncation due to oversized UDP responses (DNSSEC, large TXT), EDNS mis-sizing, or deliberate forcing to exhaust TCP.

Fix: Clamp EDNS UDP size, minimize large responses, ensure TCP handling is provisioned, and rate-limit abusive qtypes.

4) Symptom: “random” customer failures behind mobile networks

Root cause: Per-IP rate limiting punishing NATed populations or large resolvers.

Fix: Use higher per-IP limits, whitelist major resolvers you depend on, or shift to per-qname/qtype limits that target abusive patterns.

5) Symptom: authoritative servers become open resolvers “accidentally”

Root cause: Recursion enabled or ACLs too broad; sometimes introduced by copying a recursive config into authoritative.

Fix: Disable recursion on authoritative; audit config management; validate externally that recursion is refused.

6) Symptom: everything breaks after enabling full query logging

Root cause: I/O and CPU overhead; log pipeline saturation; lock contention in daemon.

Fix: Turn it off first. Then implement sampling and aggregated metrics. Capture short tcpdump samples instead of trying to log the apocalypse.

7) Symptom: traffic “moves” during an attack and one POP melts

Root cause: Anycast path changes, BGP preference quirks, or one POP getting targeted specifically.

Fix: Control routing: de-preference routes for the hot POP, ensure capacity symmetry, and confirm each POP has upstream DDoS handling.

8) Symptom: resolver recursion time explodes, cache hit rate drops

Root cause: Cache-bypass via random qnames; upstream reachability issues; too-small caches.

Fix: Rate-limit clients, enable protections (unwanted query mitigations), scale resolvers horizontally, and ensure upstream connectivity diversity.

Checklists / step-by-step plan

Phase 0: before you’re under attack (do this on a calm Tuesday)

Baseline your normal. Record typical QPS, qtype mix, NXDOMAIN rate, p50/p95 latency, UDP/TCP split.
Separate roles. Authoritative and recursive on different instances and ideally different subnets/edges.
Set sane TTLs. Don’t run 30-second TTLs for records that change once a month.
Decide EDNS size policy. Clamp UDP payload size and verify behavior with DNSSEC.
Put a DNS-aware edge in place. dnsdist or equivalent, with a tested emergency ruleset you can enable quickly.
Rehearse the playbook. Simulate NXDOMAIN floods and big-response probes in a staging environment.

Phase 1: during a spike (keep it online first)

Confirm where it hurts. Direct dig to authoritative; compare internal vs external.
Find the limiter. Drops vs CPU vs upstream latency (use the tasks above).
Apply containment at the edge. Rate-limit per source, then per qtype/qname, then clamp response size.
Stabilize p95 latency. A “fast error” is often better than slow success in DNS, because slow success triggers retries.
Reduce expensive features temporarily. Disable verbose logging; consider minimizing certain responses if your software supports it safely.
Scale out if you can do it cleanly. More instances behind anycast/unicast NS, but ensure you’re not just replicating a bad config.

Phase 2: after the spike (make it less likely next time)

Classify the traffic. Reflection? NXDOMAIN? Legit surge? Misbehaving client?
Fix the economic incentives. Increase TTLs, reduce response sizes, minimize amplification, and block open recursion.
Automate guardrails. Pre-approved rate-limit rules and safe toggles; dashboards that show qtype mix and NXDOMAIN rate.
Write a one-page incident addendum. What you changed, what you observed, and which counters proved it.

FAQ

1) Should I rate-limit by IP address for DNS?

Yes, but treat it as a blunt instrument. IP-based limits can punish NATed populations and large public resolvers. Combine it with qtype/qname-aware controls.

2) What’s the safest first rate-limit to apply during an attack?

Start with edge shaping that targets clearly abusive patterns: excessive NXDOMAIN per source, suspicious qtypes (ANY bursts, TXT floods), and oversized responses. Avoid global caps.

3) Is TCP fallback good or bad?

Both. TCP reduces spoofed reflection utility and avoids fragmentation, but it costs more per query and can be targeted too. Engineer capacity for TCP, but don’t force everything into it.

4) Do short TTLs make DNS attacks worse?

They make any spike worse, attack or not, because recursors must refresh more often. Use short TTLs only where you truly need fast cutover.

5) How do I know if I’m being used for amplification?

Look for lots of queries with spoofed sources (hard to prove directly), unusual qtypes (DNSKEY, ANY), and a response-to-query byte ratio that looks silly. Packet samples help.

6) Can I just rely on my DNS provider and stop caring?

You can outsource infrastructure, not responsibility. You still need sane TTLs, reasonable zone content, and an incident plan. Providers can’t fix a zone that’s built to amplify.

7) What’s the difference between NXDOMAIN and NODATA, and why does it matter?

NXDOMAIN means the name doesn’t exist. NODATA means the name exists but not that record type. Attacks use both to bypass caches; negative caching TTL determines how often resolvers re-ask.

8) Should I disable ANY queries entirely?

On authoritative, you should at least minimize them. Some clients still ask ANY for diagnostics. Provide a minimal safe response rather than a huge one.

9) Why does my DNS “work” from inside the VPC but not from the internet?

Internal traffic avoids edge congestion and upstream filtering. External failures often indicate PPS saturation, DDoS scrubbing issues, anycast routing shifts, or MTU/fragmentation problems.

10) What metrics should be on the DNS dashboard?

QPS, PPS, p50/p95 latency, UDP vs TCP ratio, NXDOMAIN rate, top qtypes, response size distribution, kernel drops, and cache hit/miss (for recursion).

Conclusion: next steps that actually reduce downtime

If you want DNS to survive spikes without downtime, stop thinking in terms of “block the attacker.” Think in terms of protecting latency for legitimate traffic while making abuse expensive and uninteresting.

Implement a DNS-aware edge (dnsdist or equivalent) and pre-stage emergency rules for NXDOMAIN floods and abusive qtypes.
Clamp EDNS UDP size and verify DNSSEC behavior so you’re not fragmenting yourself into outages.
Fix TTL strategy: raise TTLs where you can, raise negative caching TTL to a sane value, and keep “fast failover” for the few records that truly need it.
Instrument drops and latency at the NIC/kernel and daemon levels. CPU graphs alone are how incidents become folklore.
Rehearse the playbook quarterly. The time to learn which knob hurts customers is not during an attack.

DNS is supposed to be boring. Make it boring again—by engineering it like it’s critical, because it is.