If you’ve ever watched a perfectly healthy service “randomly” fail depending on where the request came from, you’ve met split-horizon DNS in its natural habitat: misconfigured, under-tested, and treated like a magic trick.
The symptoms are familiar. Inside the office, api.example.com works. From a VPN it times out. From a pod in Kubernetes it resolves to something you swear you deleted. From the internet it resolves “correctly,” except your internal monitoring is now screaming. The outage postmortem says “DNS” and everyone nods like that explains anything.
What split-horizon DNS actually is (and what it isn’t)
Split-horizon DNS means different clients receive different DNS answers for the same name, based on where they’re coming from (source IP, interface, TSIG key, view, resolver path, or an explicit policy). That’s it. It’s not inherently evil, and it’s not inherently secure. It’s a routing policy for names.
In the enterprise, it’s commonly used for:
- Returning private RFC1918 addresses internally and public addresses externally.
- Returning different targets for the same service name (internal load balancer vs public CDN).
- Hiding internal-only hostnames from the public internet (though “hiding” is not “securing”).
- Supporting legacy apps that hardcode a single hostname while deployments span multiple networks.
Split-horizon is not:
- A firewall. If you rely on DNS to stop access, you’re building a screen door out of zone files.
- A discovery mechanism that tolerates drift. If the “inside” and “outside” answers aren’t sourced from the same truth, they will diverge.
- A good way to do multi-region traffic engineering unless you treat DNS like the slow, cached, probabilistic system it is.
The hard part isn’t making split-horizon work. The hard part is making it boring. Boring DNS is reliable DNS.
Why split-horizon goes bad in real companies
Split-horizon fails when you lose control of three boundaries:
1) Authoritative vs recursive gets muddled
Your authoritative servers should answer for zones you own. Your recursive resolvers should fetch and cache answers for everyone else. In broken environments, “DNS servers” do everything, forwarding to each other in loops, serving stale data, and guessing which view applies.
2) Clients don’t query what you think they query
Laptops on Wi‑Fi, servers in a VPC, pods in Kubernetes, and mobile clients over VPN often use different resolvers, different search domains, and different caching behavior. You can have “one DNS design” on paper and five in reality.
3) Two sources of truth quietly become twenty
“Internal DNS” in BIND. “External DNS” in a managed provider. A conditional forwarder in Windows DNS. A private hosted zone in cloud. A CoreDNS rewrite rule. A sidecar caching stub. A hand-edited /etc/hosts that someone forgot about. This is how you get the same name resolving to three IPs depending on whether you ask with dig, nslookup, or your application runtime.
Split-horizon isn’t the villain. Unowned complexity is.
A few facts and history that explain the mess
- DNS is older than most of your “legacy” apps. It was standardized in the 1980s to replace the centrally managed HOSTS.TXT file model.
- Negative caching exists. If a name doesn’t exist (NXDOMAIN), that “no” can be cached too, controlled by the zone’s SOA parameters.
- Resolvers aren’t required to rotate fairly across records. Many do, many don’t, and some pin answers longer than you’d like.
- UDP truncation is real. Large DNS responses can require TCP fallback; if TCP 53 is blocked, you get “random” failures with DNSSEC or large TXT records.
- “Split-horizon” as a term was popularized in enterprise network operations. It mirrors the idea of “split horizon” in routing: don’t advertise routes back where they came from.
- CDNs normalized the idea that DNS answers depend on client location. That made policy-based answers feel normal, even when your infrastructure can’t support the operational overhead.
- Caching resolvers are allowed to cap TTLs. Your carefully chosen 30-second TTL might become 300 seconds in some resolvers, especially consumer ISPs.
- Search domains cause “ghost queries.” A query for
apican becomeapi.corp.example.com, thenapi.example.com, thenapi—and the resolver may cache intermediate failures.
If you remember nothing else: DNS is a distributed cache with opinions.
Fast diagnosis playbook (do this first)
This is the checklist I use when someone says “inside works, outside doesn’t” or “VPN broke DNS” and expects you to read minds.
First: establish which resolver path the failing client uses
- On the failing host, identify configured resolvers and search domains.
- Confirm whether there’s a local stub (systemd-resolved, dnsmasq, nscd) and whether your queries even reach the network.
- Check if the client is behind a forwarding chain (VPN DNS, VPC resolver, on-prem forwarders).
Second: compare answers from authoritative sources vs recursive caches
- Ask the recursive resolver the client uses.
- Ask the authoritative server(s) directly.
- Compare: IPs, CNAME chain, TTL, and whether the response is NXDOMAIN vs NODATA.
Third: verify view selection or conditional forwarding logic
- Confirm the source IP the DNS server sees (NAT and proxies matter).
- Verify BIND views match the client subnet in the order you think they do.
- Check conditional forwarders: are they forwarding the right zone, to the right target, with recursion enabled where needed?
Fourth: confirm transport is healthy (UDP/TCP 53)
- Confirm UDP 53 works.
- Confirm TCP 53 works.
- If DNSSEC or large responses are involved, TCP 53 is not optional.
Fifth: make the cache state visible
- Look at TTLs you’re receiving.
- Flush caches deliberately (client stub, recursive resolvers) and re-test.
- If flushing “fixes it,” you have a cache invalidation problem. Congratulations: you’ve met the boss level.
Practical tasks: commands, outputs, decisions (12+)
These are real tasks I run during incidents. Each one includes the command, what the output means, and what decision to make next. Run them from both a “working” and “broken” vantage point: inside LAN, VPN, cloud instance, Kubernetes pod, and a public host.
Task 1: See what resolvers and search domains the host is actually using
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.10.0.53
DNS Servers: 10.10.0.53 10.10.0.54
DNS Domain: corp.example.com
Meaning: This host queries a local stub (systemd-resolved) which forwards to 10.10.0.53 and 10.10.0.54 and appends the search domain corp.example.com.
Decision: If the failing client uses a different DNS server than you expected, stop debating theory and test that resolver directly next.
Task 2: Confirm what /etc/resolv.conf points to (and whether it’s a stub)
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jan 15 09:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
Meaning: Queries go to the local stub (typically 127.0.0.53), not directly to your corporate resolvers.
Decision: If you’re troubleshooting, test both: the stub (to catch local caching issues) and the upstream resolver (to catch split-horizon and forwarding issues).
Task 3: Query the name using the same resolver the client uses
cr0x@server:~$ dig +time=2 +tries=1 api.example.com @10.10.0.53
; <<>> DiG 9.18.24 <<>> +time=2 +tries=1 api.example.com @10.10.0.53
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4242
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.example.com. 20 IN A 10.20.30.40
;; Query time: 12 msec
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
Meaning: Internal resolver returns a private IP with a low TTL (20s). It’s recursive (ra) and the answer is currently consistent.
Decision: If this differs from what “outside” sees, split-horizon is active. Next, confirm the authoritative answers for each horizon.
Task 4: Query the public resolver and compare
cr0x@server:~$ dig +time=2 +tries=1 api.example.com @1.1.1.1
; <<>> DiG 9.18.24 <<>> +time=2 +tries=1 api.example.com @1.1.1.1
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1234
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.example.com. 300 IN A 203.0.113.77
;; Query time: 18 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
Meaning: Public resolver returns a public IP with a longer TTL (300s). That’s the “outside” answer.
Decision: Decide if this is intentional. If yes, you must ensure both answers map to working targets and that clients are reliably classified into the correct horizon.
Task 5: Identify the authoritative name servers for the zone
cr0x@server:~$ dig +short NS example.com
ns1.dns-provider.net.
ns2.dns-provider.net.
Meaning: The public internet considers ns1/ns2 authoritative for example.com.
Decision: If internal DNS serves a different authority (e.g., an internal BIND master), you’re running two authorities. That can be okay, but only with disciplined change control and automation.
Task 6: Ask the authoritative servers directly (bypass caches)
cr0x@server:~$ dig api.example.com @ns1.dns-provider.net +norecurse
; <<>> DiG 9.18.24 <<>> api.example.com @ns1.dns-provider.net +norecurse
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9000
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
api.example.com. 300 IN A 203.0.113.77
Meaning: aa means authoritative answer: the public zone has 203.0.113.77.
Decision: If your internal resolver is supposed to override this, confirm where that override is defined (BIND view, private hosted zone, conditional forwarder) and whether it’s authoritative too.
Task 7: Check if a CNAME chain differs between inside and outside
cr0x@server:~$ dig +noall +answer +authority +additional api.example.com @10.10.0.53
api.example.com. 20 IN CNAME api-internal.example.com.
api-internal.example.com. 20 IN A 10.20.30.40
Meaning: Internal horizon uses a CNAME to an internal-only name.
Decision: Ensure the CNAME target is resolvable for every client that can see it. If VPN clients get classified as “outside” but receive the internal CNAME, they will fail in a way that looks like a network outage.
Task 8: Detect search-domain fallout (the “why did it query that?” problem)
cr0x@server:~$ dig +search api @10.10.0.53
; <<>> DiG 9.18.24 <<>> +search api @10.10.0.53
;; ANSWER SECTION:
api.corp.example.com. 60 IN A 10.9.8.7
Meaning: The client asked for api and the resolver expanded it to api.corp.example.com. That might not be the service you intended.
Decision: If app configs use short names, fix the config. Don’t “fix” DNS by adding more ambiguous records. Ambiguity scales faster than headcount.
Task 9: Confirm whether TCP fallback works (large answers / DNSSEC / truncation)
cr0x@server:~$ dig example.com DNSKEY @1.1.1.1 +dnssec +tcp +time=2 +tries=1
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7777
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
Meaning: TCP query succeeded. If UDP works but TCP fails, you get intermittent resolution failures when responses are too big for UDP without EDNS0.
Decision: If TCP 53 is blocked anywhere between client and resolver/authoritative, fix that first. Stop negotiating with firewalls about “but DNS is UDP.” It’s both.
Task 10: Observe actual DNS traffic and the source IP (view selection depends on it)
cr0x@server:~$ sudo tcpdump -ni eth0 port 53 -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:22:31.100001 IP 172.16.50.10.51123 > 10.10.0.53.53: 4242+ A? api.example.com. (33)
10:22:31.100220 IP 10.10.0.53.53 > 172.16.50.10.51123: 4242* 1/0/0 A 10.20.30.40 (49)
Meaning: The resolver sees the client as 172.16.50.10. If you expected a different subnet (e.g., VPN pool vs NAT pool), your view matching might be wrong.
Decision: Adjust view ACLs to match the addresses the server actually sees. Don’t match on the addresses you wish were true.
Task 11: Check BIND view ordering and zone attachment
cr0x@server:~$ sudo named-checkconf -p | sed -n '1,120p'
acl "internal-nets" { 10.0.0.0/8; 172.16.0.0/12; 192.168.0.0/16; };
view "internal" {
match-clients { "internal-nets"; };
recursion yes;
zone "example.com" { type master; file "/etc/bind/zones/db.example.com.internal"; };
};
view "external" {
match-clients { any; };
recursion yes;
zone "example.com" { type master; file "/etc/bind/zones/db.example.com.external"; };
};
Meaning: Two different zone files for the same zone name, selected by client source IP. View ordering matters: the first match wins.
Decision: Confirm that VPN/NAT subnets are included in internal-nets if they should receive internal answers. If not, you’ve found your “inside/outside madness.”
Task 12: Validate zone file integrity before blaming “DNS randomness”
cr0x@server:~$ sudo named-checkzone example.com /etc/bind/zones/db.example.com.internal
zone example.com/IN: loaded serial 2026020401
OK
Meaning: Zone parses and loads; serial is visible. If this fails, you may be serving an older version or not loading at all.
Decision: If the serial is not what you expect, find the actual master and the propagation path (AXFR/IXFR, git deploy, config management).
Task 13: Check what Unbound has cached (stale answers are a lifestyle)
cr0x@server:~$ sudo unbound-control lookup api.example.com
api.example.com. 20 IN A 10.20.30.40
Meaning: Unbound currently has a cached record with remaining TTL. If that answer is wrong, clients will keep seeing it until TTL expires (or you flush).
Decision: If you just changed the zone and the cache still has the old data, flush that specific name and confirm the authoritative data is correct.
Task 14: Flush a specific cached name (surgical, not “reboot DNS”)
cr0x@server:~$ sudo unbound-control flush api.example.com
ok
Meaning: Cache entry removed; next query should fetch fresh data.
Decision: If flushing fixes it, you need a TTL strategy and a change process that acknowledges caching—especially for split-horizon where two caches might disagree.
Task 15: Prove whether the application uses the OS resolver or something else
cr0x@server:~$ strace -f -e trace=network -s 128 -p $(pidof myapp) 2>&1 | head -n 20
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 42
connect(42, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16) = 0
sendto(42, "\252\252\1\0\0\1\0\0\0\0\0\0\3api\7example\3com\0\0\1\0\1", 33, 0, NULL, 0) = 33
Meaning: The app talks to 127.0.0.53, the local stub. If you expected direct queries to a corporate resolver, your “DNS fix” might not touch the actual path.
Decision: Decide whether to fix at the stub, the upstream resolver, or the app (e.g., JVM DNS cache settings). Blaming “DNS” without locating the resolver is performance art.
Task 16: Check JVM DNS caching (because Java will remember grudges)
cr0x@server:~$ jcmd $(pidof java) VM.system_properties | egrep 'networkaddress.cache|networkaddress.cache.negative'
networkaddress.cache.ttl=-1
networkaddress.cache.negative.ttl=10
Meaning: A TTL of -1 means cache forever (unless overridden by a security manager policy). This defeats DNS-based failover and can preserve the “wrong horizon” across network moves.
Decision: If you rely on DNS changes for recovery, set sane TTLs in the runtime or move service discovery away from DNS for fast failover.
Joke #1: DNS is the only system where “it’s cached” is both an explanation and a confession.
The fix: a split-horizon design that stays sane
The reliable fix is not “add another view” or “lower TTL to 5 seconds.” The fix is to make split-horizon a deliberate product with a single owner, a single source of truth, and a predictable resolver path.
Step 1: Decide the naming model: same zone name, or different zone names
You have two viable patterns:
-
Same zone name (classic split-horizon):
api.example.comexists in both internal and external views, with different answers.- Pros: simple for users, one hostname everywhere.
- Cons: every misclassification becomes an outage; debugging is slower; caches amplify mistakes.
-
Different zone names (recommended when possible): internal services under
api.corp.example.comand public underapi.example.com.- Pros: fewer ambiguous answers; “wrong horizon” often still works because it’s a different name.
- Cons: requires app/config changes; humans must learn which name belongs where.
My opinionated guidance: if you can afford it, use different names. If you can’t, use same-name split-horizon but treat it like production code with tests, CI, and change review.
Step 2: Pick one authoritative workflow per horizon
If you run two different authorities (public DNS provider and internal BIND), accept that you’re running two products.
Your job is to eliminate “hand edits” and drift:
- Store zone data in version control (even if it’s generated).
- Generate both internal and external zones from a shared inventory, with explicit overrides.
- Require serial discipline (monotonic, automated) and validate before deploy.
- Make “who owns this record?” answerable in one minute.
Step 3: Separate authoritative DNS from recursive DNS
This is where many environments rot. They put BIND on a box, enable recursion, add views, add forwarding, and now the same daemon is:
authoritative for internal zones, authoritative for external overrides, recursive for everything else, and exposed to networks it shouldn’t be.
A clean layout looks like this:
- Authoritative layer: serves internal zones (and internal view of shared zones) only to recursive resolvers, not to clients.
- Recursive layer: clients talk to recursive resolvers; resolvers talk to authoritative servers and the internet as needed.
- Policy layer: split-horizon logic lives in a small number of places (BIND views on authoritative, or conditional forwarding on recursors), not sprinkled across laptops and VPN clients.
Step 4: Make view selection deterministic and documented
If you use BIND views, match on the IPs the DNS server sees. That means you must account for:
- VPN pools (often different subnets than office LANs)
- NAT gateways (clients appear as the NAT IP, not their original subnet)
- Proxying resolvers (a forwarder’s IP might be the “client”)
- Dual-stack behavior (IPv6 clients may bypass IPv4-only assumptions)
The single most common split-horizon outage is a new network path (new VPN, new NAT, new VPC) that wasn’t added to the “internal” ACL. Suddenly “internal users” become “external users,” and the DNS answers become incompatible with internal routing.
Step 5: Keep internal and external answers compatible whenever feasible
When you must share a name, design it so the “wrong” answer fails gracefully:
- Prefer returning routable addresses from both horizons when possible (e.g., internal clients can reach public LB too).
- If internal returns private IPs, ensure external clients never see those records (no leakage via wrong view, forwarder, or resolver in the wrong subnet).
- Use CNAMEs carefully: they extend the blast radius because the target might not exist in the other horizon.
- Be cautious with wildcard records; they can turn typos into “valid” names that route somewhere expensive.
Step 6: TTL strategy: stop cargo-culting “set TTL to 5”
TTL is your lever for how long mistakes live. It’s also your lever for how hard you hit your DNS infrastructure during normal operation.
Practical TTL guidance:
- For stable records, use moderate TTLs (5–30 minutes). Low TTLs increase query volume and don’t guarantee fast propagation due to resolver TTL caps.
- For migration windows, pre-lower TTLs well in advance (hours to days), then change records, then raise TTLs after stability.
- Remember negative caching: NXDOMAIN can stick around and ruin your “just created that record” moment.
Step 7: Test split-horizon like you test deployments
You need tests from each vantage point. Not “I ran dig once from my laptop.” Tests that run continuously and alert on divergence:
- Inside LAN: expected internal answer(s).
- VPN: expected internal answer(s) or explicitly expected external behavior.
- Cloud VPC: expected internal answer(s) if it’s part of the corporate network.
- Public internet: expected external answer(s).
- From the recursive resolvers directly: expected authoritative behavior, correct TTL, correct CNAME chain.
Here’s the operational quote that should haunt your DNS design reviews:
“Hope is not a strategy.” — General Gordon R. Sullivan
DNS outages often begin as hope: hope that caches expire quickly, hope that VPN subnets never change, hope that nobody adds a “temporary” forwarder. Don’t run production on hope.
Three mini-stories from corporate life
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran split-horizon on a single pair of BIND servers. Internal view returned private IPs for git.example.com; external view returned a public reverse proxy.
For years it worked. Quietly. Which is how most incidents start.
They replaced their VPN concentrator. New vendor, new “modern” architecture: client traffic hairpinned through a set of NAT gateways. The team updating the VPN focused on auth and throughput. DNS was “unchanged.”
Monday morning, remote engineers could authenticate to VPN and reach internal IPs by address, but git.example.com resolved to the external proxy. The external proxy enforced SSO that was blocked behind conditional access rules when accessed from corporate IP ranges. The engineers were on VPN, but DNS classified them as “outside.” Now they were “outside” with “inside” network access and “outside” DNS. The worst of both worlds.
The wrong assumption was simple: “VPN clients will still come from the VPN pool subnets.” They didn’t. The DNS server saw only the NAT gateway IPs. BIND views matched the NAT range as any, so they hit the external view.
The fix was boring: add the NAT gateway ranges to the internal ACL, then create a monitoring probe that performs the same DNS query from behind that NAT and validates the internal answer. They also documented that “DNS view classification depends on source IP after NAT,” which should have been obvious, but incidents are just tuition you pay to learn obvious things.
Mini-story 2: The optimization that backfired
A large enterprise wanted faster failover for a customer-facing API. They lowered TTL from 300 seconds to 5 seconds on the external A record and set up automation to flip between two load balancers. It looked great in a demo.
Then production happened. Their recursive resolver fleet (internal and at some major ISPs) capped low TTLs upward. Not consistently; not predictably. Some clients re-resolved quickly, others kept using the old address for minutes. Meanwhile, the lower TTL multiplied query rates. Their authoritative provider handled it, but their internal forwarders didn’t. CPU spiked. Cache hit ratios dropped. Latency increased. The “optimization” increased failure probability during the exact moment they needed DNS to be calm.
Worse: internal split-horizon records for the same name used different TTLs and were managed in a different system. During a failover test, internal monitors still resolved to the old internal target for longer than the external clients. The incident channel filled with “but the dashboard says it’s still down” while customers were already back.
They backed out the 5-second TTL, moved failover to the load balancer layer where state changes propagate faster than DNS caches, and kept DNS TTLs in the “normal” range. They also aligned internal and external TTL policies for shared names and started tracking query rates as a first-class SLO.
Joke #2: Setting TTL to 5 seconds is the DNS equivalent of driving faster by removing the brakes.
Mini-story 3: The boring but correct practice that saved the day
Another org ran a clean separation: authoritative servers for internal zones, recursive resolvers for clients, and a tiny set of conditional forwarders for cloud private zones. They had a written rule: “No client points to authoritative DNS directly.” It sounded pedantic. It was.
One day, a new internal zone file deploy introduced a syntax error in the internal view for a shared zone. The authoritative daemon refused to load that zone. The recursive resolvers kept serving cached answers, buying time. Their monitoring didn’t just test resolution; it tested authoritative zone load health and serial increments.
The on-call got paged for “authoritative zone not loaded” before the cache aged out. They rolled back the zone file change within minutes. Clients never noticed. It was the kind of incident that doesn’t make a good story at a conference because nothing caught fire. That’s the point.
The practice that saved them was the most boring one in DNS: validate configs before reload, monitor authoritative health, and don’t let clients hit the authority directly. It turned a potentially company-wide incident into a quiet rollback.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Works in office, fails on VPN”
Root cause: VPN clients are NATed or assigned a subnet not included in internal view ACL; they receive external DNS answers.
Fix: Update view ACLs to include VPN/NAT ranges as seen by the DNS server; add continuous probes from VPN egress.
2) Symptom: “dig works, app doesn’t”
Root cause: App uses a different resolver path (local stub, container DNS, language runtime cache) than your test.
Fix: Trace the app’s DNS calls (strace), inspect runtime caching (JVM, Go netdns, .NET), and test the exact resolver it uses.
3) Symptom: “nslookup shows one IP, dig shows another”
Root cause: Different default servers; one tool queries a different resolver, or search domain expansion changes the query name.
Fix: Always specify the resolver with @server; use fully qualified names with a trailing dot when needed; verify search domains.
4) Symptom: “After changing DNS, some clients still hit the old target”
Root cause: Resolver caching, TTL caps, and application-level caching; negative caching for previous NXDOMAIN.
Fix: Use planned TTL reductions before migrations; flush caches where appropriate; don’t rely on DNS for sub-minute failover.
5) Symptom: “Intermittent failures with DNSSEC or large TXT records”
Root cause: TCP 53 blocked; UDP fragmentation or EDNS0 issues; truncated responses not retried properly.
Fix: Allow TCP/UDP 53 end-to-end; verify EDNS0 support; test with dig +tcp.
6) Symptom: “External users sometimes get private IPs”
Root cause: Split-horizon data leaked via misapplied views, a forwarder exposed to the internet, or a public resolver forwarding to internal.
Fix: Never expose internal recursive resolvers publicly; restrict recursion; audit forwarding chains; ensure internal zones aren’t served in public contexts.
7) Symptom: “Kubernetes pods resolve differently than nodes”
Root cause: Pods use CoreDNS and cluster DNS policy; nodes use system resolver; conditional forwarding differs.
Fix: Align CoreDNS forwarders with your intended recursive resolvers; add pod-based probes; avoid rewrite hacks unless you can test them.
8) Symptom: “Only some sites/offices fail”
Root cause: Different forwarders per site; inconsistent conditional forwarding; stale zone transfers to a site-local resolver.
Fix: Standardize resolver configuration; monitor zone serials per site; prefer centralized recursion with local caching only when necessary.
Checklists / step-by-step plan
Step-by-step: rebuilding split-horizon so it stops hurting
-
Inventory the resolver paths.
- List corporate recursive resolvers, cloud resolvers, VPN-provided resolvers, and any local stubs.
- Decide: which clients should use which resolver, and why.
-
Draw the authority boundary.
- Which servers are authoritative for internal zones?
- Which provider is authoritative for public zones?
- Who owns changes in each?
-
Pick the split-horizon mechanism.
- BIND views on authoritative? Conditional forwarding on recursive? Cloud private zones?
- Minimize the number of policy points. Two is already plenty.
-
Define view classification rules.
- Enumerate internal networks as seen by DNS servers (post-NAT).
- Include VPN egress and cloud NAT ranges deliberately.
-
Unify record generation.
- Generate internal/external variants from shared data with explicit diffs.
- Ban manual edits on production DNS outside a controlled emergency process.
-
Set TTL policy and stick to it.
- Define standard TTLs per record type and environment.
- Define a migration playbook for pre-lowering TTLs.
-
Build tests from every horizon.
- At minimum: internal LAN, VPN, cloud, public internet.
- Test the name, the CNAME chain, and the final reachability (HTTP/TCP health).
-
Deploy with validation gates.
- Zone syntax check, serial monotonic check, and a canary resolver reload before global rollout.
-
Monitor what matters.
- NXDOMAIN rate, SERVFAIL rate, recursion failures, query latency, and TCP fallback failures.
- Zone load status and serial drift across secondaries.
Operational checklist: before you change a split-horizon record
- Does this name exist in more than one horizon? List them.
- Will the change affect CNAME targets that don’t exist externally?
- Is TTL low enough already to support the migration window? If not, lower it ahead of time.
- Are you changing the authoritative source or only a recursive override?
- Which monitoring probes will tell you it worked from each horizon?
- Do you have a rollback record set ready (and tested)?
Operational checklist: during an incident
- Identify the resolver path from the failing client.
- Compare answers: client resolver vs authoritative.
- Confirm view matching with packet capture or resolver logs.
- Validate TCP 53.
- Decide: fix classification, fix record data, or flush caches (and where).
- Write down the exact query name (with trailing dot if needed) and the resolver IP. Ambiguity is how incidents drag on.
FAQ
1) Should I use split-horizon DNS at all?
Use it when you truly need one hostname to map to different targets based on client location. Otherwise, prefer different names for internal vs external services. It reduces ambiguity and cuts incident time.
2) Is split-horizon DNS the same as “private DNS” in cloud?
Conceptually similar outcome: different answers depending on where you query from. Mechanically different: cloud private zones often attach to VPCs and use provider resolvers, while classic split-horizon uses views or policy on your DNS servers.
3) Why do I see the “wrong” answer only from Kubernetes pods?
Pods typically query CoreDNS (or equivalent), which forwards according to cluster config. Nodes might use different resolvers. Fix by aligning CoreDNS forwarding/conditional rules with the corporate recursive resolver strategy and test from inside a pod.
4) Can I rely on TTL for fast failover?
Not for sub-minute behavior. Some resolvers cap TTLs upward; applications cache; connection pools keep using old IPs. Put fast failover in load balancers or service meshes; keep DNS for steady-state discovery and slow migration cutovers.
5) What’s the biggest operational risk with BIND views?
Misclassification. If a new subnet (VPN, NAT, cloud egress) isn’t included in the right ACL, those clients get the wrong zone. Then you’re debugging “network” while the actual problem is policy.
6) How do I prevent internal records from leaking to the internet?
Don’t expose recursive resolvers publicly. Disable recursion on authoritative servers that face untrusted networks. Restrict zone transfers. Audit forwarding chains so no public-facing resolver forwards private zones to internal infrastructure.
7) Why does flushing caches sometimes “fix” it, but then it comes back?
Because you flushed one cache, not the whole chain. Client stub caches, recursive caches, application runtime caches, and sometimes intermediate forwarders all exist. If the authoritative data is wrong, flushing just buys you a short-lived illusion of competence.
8) Should internal and external TTLs match?
Not always, but they should be governed by the same policy. If the same name is split-horizon, drastically different TTLs can confuse monitoring, slow incident confirmation, and create asymmetric behavior during migrations.
9) What’s a safe pattern for “one name everywhere” without split-horizon?
Terminate traffic at a single public endpoint that internal networks can also reach (public LB with internal allowlists, or dual-access VIP). Then DNS can return one answer globally. This trades DNS complexity for network policy, which is usually a win.
10) How do I monitor split-horizon properly?
Run the same DNS query from each horizon and alert on unexpected drift. Also monitor SERVFAIL/NXDOMAIN rates, resolver latency, and zone serial consistency. If you only monitor “does it resolve,” you’ll miss “does it resolve to the right thing.”
Conclusion: practical next steps
Split-horizon DNS is survivable if you stop treating it like a clever trick and start treating it like infrastructure with failure modes. The fix is not a new record. It’s a new discipline: clear resolver paths, deterministic classification, and one truth for zone data.
Do these next, in order
- Map resolver paths for LAN, VPN, cloud, Kubernetes, and a public host. Write them down.
- Pick your policy point: views on authoritative or conditional forwarding on recursive. Minimize count.
- Make classification explicit: enumerate subnets as seen by the DNS servers (including NAT/VPN egress).
- Unify record management: generate internal/external variants from shared data; no hand edits.
- Add horizon probes and alert on drift in answers, not just resolution success.
- Adopt a TTL migration playbook so you can change targets without begging caches for mercy.
If you want the “inside/outside madness” to stop, your DNS must become boring. Not minimal. Not clever. Boring. That’s how production survives.