Split-horizon DNS: Make LAN names resolve without breaking the public internet

Was this helpful?

Everything works… until it doesn’t. You add a shiny internal hostname like git.company.com pointing at a private IP,
and suddenly laptops on the guest Wi‑Fi can’t reach the real public site. Or worse: the internal name leaks to the outside world,
and now your “private” infrastructure is gossiping with strangers.

Split-horizon DNS—also called split-brain DNS—is the grown-up solution: different answers depending on where the question comes from.
Done right, LAN names resolve cleanly, public resolution stays correct, and your on-call rotation sleeps. Done wrong, you invent new
and exciting outage modes.

What split-horizon DNS actually is (and what it is not)

Split-horizon DNS means your DNS infrastructure returns different records for the same name depending on the client’s network context:
source IP, interface, VPN state, or explicit “view” rules. The classic case is internal clients resolving app.example.com
to 10.10.20.30 while external clients resolve it to 203.0.113.10 behind a CDN or WAF.

Split-horizon DNS is not “use an internal suffix like .lan and call it a day.” That might work for a home lab,
but in enterprises it tends to grow sharp corners: device roaming, VPN split tunneling, internal TLS, and the uncomfortable reality
that humans will type whatever looks like the public name.

Also: split-horizon isn’t a magic performance knob. DNS is already fast when you aren’t breaking caching or creating loops. The goal
is correctness and predictable resolution. “Predictable” is the key word. A resolver that answers quickly but inconsistently is just a
high-speed outage generator.

A crisp mental model

  • Authoritative servers publish the “truth” for a zone (or multiple truths, via views).
  • Recursive resolvers look up answers on behalf of clients and cache results.
  • Stub resolvers live on clients and forward questions to a recursive resolver.

Split-horizon can be implemented at the authoritative layer (two different authoritative “truths” depending on source) or at the
recursive layer (override specific names internally while otherwise using the public internet). Both work. Only one tends to scale
cleanly in your org, depending on who owns what.

Joke #1: DNS is the only system where “it worked yesterday” is considered strong evidence of nothing.

Interesting facts and history you can use in arguments

These are short, concrete context points. They’re useful when you’re trying to convince security, networking, and app teams to stop
doing “creative” things with names.

  1. DNS predates modern security assumptions. Early DNS (mid-1980s) was built for cooperative networks; integrity and
    authenticity came later, bolted on with DNSSEC.
  2. Split-horizon got popular with perimeter firewalls. When NAT and DMZ patterns became common, a single name often
    needed different answers for internal vs external routing.
  3. RFC 1918 private IP ranges (1996) normalized the idea that internal addressing is fundamentally different from
    public routing—and DNS had to reflect that reality.
  4. “Split-brain DNS” is older than cloud. The pattern existed long before VPCs; cloud just made it easier to have many
    “internals” with overlapping names and half-owned zones.
  5. DNS caching is why your changes “don’t work.” The TTL and negative caching behavior (NXDOMAIN caching) frequently
    explains why one laptop sees the new record and another doesn’t.
  6. DNS search domains are a foot-gun. Many resolvers will append search suffixes and generate multiple queries.
    Misconfigured search lists can create surprising lookups and leakage.
  7. The “.local” suffix is special in many stacks. It’s used by multicast DNS (mDNS). Using it for unicast DNS is a
    classic “works on my machine” trap.
  8. EDNS Client Subnet changed how “where you are” looks. Some public resolvers can forward client subnet hints to CDNs.
    That’s great for performance and confusing for debugging.
  9. DNS over HTTPS and DNS over TLS complicate enterprise control. If clients bypass your resolver, your split-horizon
    logic may never run.

Design decisions that matter in production

1) Choose your namespace strategy: same name vs different name

You have two broad approaches:

  • Same FQDN inside and outside (e.g., git.company.com resolves internally to RFC1918 and externally to
    public IP/CDN). This is the most ergonomic for humans and applications. It is also the easiest to accidentally misconfigure.
  • Different internal name (e.g., git.internal.company.com or git.corp). This reduces the
    chances of conflicts with public DNS but increases operational friction: certs, redirects, CORS, OAuth callbacks, and user behavior.

Opinion: if you’re running serious internal apps used by real people and devices roam (laptops, phones, VPN), use the same FQDN and do
split-horizon properly. If it’s a small static lab, different internal names are acceptable—until you start issuing TLS and integrating
with SaaS, at which point you’ll wish you’d chosen the boring path.

2) Decide where the split happens: authoritative vs recursive

Implement split-horizon at the authoritative layer when you control the zone and need deterministic answers for internal/external
sources. Example: BIND views, NSD with different instances, or cloud DNS split views.

Implement split-horizon at the recursive layer when you want to keep public authoritative DNS simple and override only a small set of
names internally. Example: Unbound local zones, dnsmasq overrides, or conditional forwarding to internal auth servers.

Opinion: for enterprises, prefer split at the recursive edge plus a clean internal authoritative zone. Put as little “policy” as
possible into authoritative servers that are shared across environments. For smaller networks, a single BIND instance with views is
perfectly fine as long as you treat it like production and test changes.

3) Avoid “shadow zones” unless you like surprises

A shadow zone is when an internal DNS server pretends it is authoritative for a public zone (say company.com) but only
contains a handful of records. Everything else turns into NXDOMAIN or stale answers. That breaks random things in ways that feel like
cursed magic.

If you must override within a public zone, do it with:

  • authoritative split views that still contain a complete zone view, or
  • recursive overrides for specific names, while letting all other names resolve publicly.

4) TTL strategy: low enough for changes, not so low you melt caches

TTL is not “how fast DNS is.” TTL is how long your mistake persists.

For internal overrides that might change during incident response (failover, DR), TTLs of 30–300 seconds are common. For stable
internal records, 300–3600 seconds is fine. For public records behind CDNs, you usually play by the vendor’s rules.

Low TTLs increase query volume and amplify outages when resolvers flap. High TTLs increase recovery time when you need to move an
endpoint. Pick intentionally, per record class, and measure resolver QPS before you “optimize.”

5) Make DNSSEC a conscious decision, not an accident

DNSSEC validation is increasingly common (enterprise resolvers, some ISPs). Split-horizon can interact badly with DNSSEC if you serve
different signed/unsigned answers, or if you override signed public names internally without controlling the signing keys.

Practical guidance:

  • If you’re overriding names within a DNSSEC-signed public zone, do it at the recursive layer with care, or expect validation
    failures if clients validate independently.
  • If you control the zone signing, you can sign both views—but keep operational complexity in mind.

6) Account for modern clients: DoH, VPNs, and “helpful” OS resolvers

Split-horizon assumes you know where the client sends DNS. But browsers and OSes have opinions:

  • DNS over HTTPS can bypass your resolver entirely. Your internal names won’t resolve, and your policy won’t apply.
  • systemd-resolved can do per-interface DNS and routing of queries by domain. That’s great when configured. It’s
    confusing when not.
  • VPN split tunneling may route DNS differently than traffic, creating “DNS says internal, packets go external” chaos.

Reference architectures: the three sane patterns

Pattern A: Authoritative views (BIND “views”) + internal recursion

You run one set of authoritative servers that answer external clients with public records and internal clients with private records,
based on source IP ACLs. Internal resolvers query them, external resolvers query them, everyone’s “happy.”

Pros: one zone name, clear policy, deterministic. Cons: views are configuration-complex; mistakes can leak internal records or serve
incomplete zones; testing needs discipline.

Pattern B: Recursive overrides (Unbound/dnsmasq) + public authoritative stays public

Public zone stays as-is, hosted wherever. Inside the LAN, you run recursive resolvers that override specific names (or forward
specific zones) to internal authoritative servers.

Pros: minimal coupling; easy to reason about; fewer ways to break the public internet. Cons: you now depend on clients using your
resolver; DoH can bypass you; you need resolver HA.

Pattern C: Separate internal subdomain + conditional forwarding

Put internal-only names under a delegated subdomain like corp.company.com and forward that zone internally. Public DNS
either doesn’t publish it or publishes only what you want public.

Pros: clean separation; fewer conflicts; DNSSEC easier if you sign the internal zone separately. Cons: humans will still type the
public name; apps integrate with public callbacks; you’ll end up needing split for some names anyway.

Opinion: Pattern B is the best default for most orgs. Pattern A is powerful when you truly need “same name, two truths” at the
authoritative layer. Pattern C is a comfort blanket that works until it doesn’t.

Practical tasks: commands, outputs, and what you decide next

These are real tasks you can run during design, rollout, and incident response. Each one includes: command, sample output, what it
means, and the decision you make.

Task 1: Confirm which resolver your host is actually using

cr0x@server:~$ resolvectl status
Global
         Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
  resolv.conf mode: stub
Current DNS Server: 10.10.0.53
       DNS Servers: 10.10.0.53 10.10.0.54
        DNS Domain: corp.company.com
Link 2 (ens192)
    Current Scopes: DNS
         Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.10.0.53
       DNS Servers: 10.10.0.53 10.10.0.54

What it means: systemd-resolved is in play; DNS goes to 10.10.0.53/.54. There’s a domain
routing hint for corp.company.com.

Decision: If users report “internal names don’t resolve,” verify they’re on these resolvers. If you see public DNS
(ISP, 1.1.1.1, 8.8.8.8), you don’t have split-horizon; you have hope.

Task 2: Prove the split with a targeted query to internal resolver

cr0x@server:~$ dig @10.10.0.53 app.company.com +noall +answer
app.company.com.        60      IN      A       10.10.20.30

What it means: Internal resolver returns private IP with TTL 60.

Decision: If this is correct, your internal view/override works. Next: check what external/public sees.

Task 3: Compare against public resolution (without trusting your resolver)

cr0x@server:~$ dig @1.1.1.1 app.company.com +noall +answer
app.company.com.        300     IN      A       203.0.113.10

What it means: Public answer differs (expected in split-horizon). TTL is higher.

Decision: If public returns the private IP, you’ve leaked. If public returns NXDOMAIN, you probably shadow-zoned and broke the internet for yourself.

Task 4: Identify where an answer came from (authoritative vs cache)

cr0x@server:~$ dig @10.10.0.53 app.company.com +norecurse
; <<>> DiG 9.18.24-1 <<>> @10.10.0.53 app.company.com +norecurse
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
app.company.com.        60      IN      A       10.10.20.30

What it means: aa indicates the server considers itself authoritative for this answer.

Decision: If you expected a recursive override and see aa, you might be serving a view/zone you didn’t intend (risk of shadowing).

Task 5: Check for negative caching (NXDOMAIN that won’t die)

cr0x@server:~$ dig @10.10.0.53 newhost.corp.company.com +noall +answer +authority
corp.company.com.       900     IN      SOA     ns1.corp.company.com. hostmaster.corp.company.com. 2025123101 3600 600 1209600 60

What it means: No answer section; authority shows SOA. The resolver likely cached NXDOMAIN; negative TTL often derives from SOA minimum/negative TTL.

Decision: If you just created newhost, either wait out negative cache, flush cache, or reduce negative TTL in SOA for zones that change a lot.

Task 6: Flush resolver cache safely (systemd-resolved)

cr0x@server:~$ resolvectl flush-caches
cr0x@server:~$ resolvectl statistics
DNSSEC supported by current servers: no
Transactions               Current Transactions: 0
Cache                       Current Cache Size: 12
Cache                         Cache Hits: 221
Cache                       Cache Misses: 45

What it means: Cache flushed; stats show ongoing behavior.

Decision: If flushing “fixes” things repeatedly, you have a TTL/negative caching issue or inconsistent upstream answers.

Task 7: Validate BIND zone integrity before reload (no hero reloads)

cr0x@server:~$ named-checkzone corp.company.com /etc/bind/zones/db.corp.company.com
zone corp.company.com/IN: loaded serial 2025123101
OK

What it means: Zone file parses, serial present, BIND is unlikely to reject it.

Decision: If it fails, do not reload. Fix syntax first; otherwise you may serve an older zone or fail the view entirely.

Task 8: Validate BIND configuration (views and ACLs are where the bodies are)

cr0x@server:~$ named-checkconf -z
zone corp.company.com/IN: loaded serial 2025123101
zone company.com/IN: loaded serial 2025123105

What it means: Config is syntactically valid and zones load.

Decision: If named-checkconf errors, don’t restart BIND. Fix config; a restart under load is a good way to get paged twice.

Task 9: Confirm which view a client hits (BIND querylog)

cr0x@server:~$ sudo rndc querylog on
cr0x@server:~$ sudo tail -n 3 /var/log/named/query.log
31-Dec-2025 11:03:22.918 client @0x7f2c8c1a: query: app.company.com IN A +E(0)K (10.10.0.53) [view "internal"]
31-Dec-2025 11:03:23.104 client @0x7f2c8c2b: query: app.company.com IN A +E(0)K (10.10.0.53) [view "external"]
31-Dec-2025 11:03:23.201 client @0x7f2c8c2b: query: www.company.com IN A +E(0)K (10.10.0.53) [view "external"]

What it means: Two clients (or NAT’d sources) are being classified into different views.

Decision: If an internal client is tagged external, fix ACLs, source addresses, or NAT behavior. Don’t “add another record”; that’s how you create a slow-motion disaster.

Task 10: Detect forwarding loops (the silent resolver killer)

cr0x@server:~$ dig @10.10.0.53 corp.company.com SOA +time=1 +tries=1
;; communications error to 10.10.0.53#53: timed out
;; no servers could be reached

What it means: The resolver is timing out quickly. Under the hood, this often happens when Resolver A forwards to Resolver B which forwards back to A.

Decision: Check forwarder configuration and ensure there is a terminal authoritative source for the internal zones. Break the loop; then watch query latency drop back to normal.

Task 11: Inspect Unbound local overrides (what is it lying about?)

cr0x@server:~$ sudo unbound-control list_local_zones
corp.company.com. transparent
company.com. nodefault

What it means: corp.company.com is transparently handled; company.com is overridden (nodefault) which can shadow public data.

Decision: If you didn’t intend to override company.com broadly, remove or narrow it to specific host overrides.

Task 12: Query Unbound for local data and see whether it’s authoritative-ish

cr0x@server:~$ dig @10.10.0.53 app.company.com +noall +answer +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61545
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
app.company.com.        60      IN      A       10.10.20.30

What it means: This is recursive behavior (rd/ra) rather than authoritative (aa absent). Good for “override specific names” patterns.

Decision: If you expected authoritative answers, you’re debugging the wrong layer. If you expected recursion and see aa, something else is answering.

Task 13: Verify conditional forwarding (Windows/AD environments)

cr0x@server:~$ dig @10.10.0.10 _ldap._tcp.dc._msdcs.corp.company.com SRV +noall +answer
_ldap._tcp.dc._msdcs.corp.company.com. 600 IN SRV 0 100 389 dc01.corp.company.com.

What it means: AD DNS is returning SRV records; forwarding for corp.company.com is likely working.

Decision: If SRV lookups fail intermittently, treat it as a split-horizon routing problem first (wrong resolver on some clients) before blaming Kerberos.

Task 14: Check if clients are bypassing you with DoH (browser-level reality check)

cr0x@server:~$ sudo tcpdump -ni any port 53 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
11:09:41.120421 ens192 In  IP 10.10.50.21.53821 > 10.10.0.53.53: 31614+ A? app.company.com. (33)
11:09:41.120987 ens192 Out IP 10.10.0.53.53 > 10.10.50.21.53821: 31614 1/0/0 A 10.10.20.30 (49)
11:09:44.908112 ens192 In  IP 10.10.50.21.34620 > 10.10.0.53.53: 1720+ A? www.company.com. (33)
11:09:44.908621 ens192 Out IP 10.10.0.53.53 > 10.10.50.21.34620: 1720 1/0/0 A 203.0.113.10 (49)
5 packets captured

What it means: At least for these lookups, the client is using classic DNS to your resolver.

Decision: If users complain but you see no port 53 traffic, suspect DoH/DoT or captive portal weirdness. Your split-horizon can’t help if it never sees the question.

Task 15: Confirm reverse DNS matches the forward split (or you’ll break audits and some auth)

cr0x@server:~$ dig @10.10.0.53 -x 10.10.20.30 +noall +answer
30.20.10.10.in-addr.arpa. 300 IN PTR app.company.com.

What it means: PTR exists and points back to the expected name.

Decision: If reverse DNS is missing or points at a public name, fix it. Some systems (mail, logging, security tools) use rDNS for correlation and sanity checks.

Task 16: Measure DNS latency and detect timeouts (the real “DNS is slow” test)

cr0x@server:~$ dig @10.10.0.53 app.company.com +stats +noall +answer
app.company.com.        60      IN      A       10.10.20.30
;; Query time: 2 msec
;; SERVER: 10.10.0.53#53(10.10.0.53) (UDP)
;; WHEN: Wed Dec 31 11:12:12 UTC 2025
;; MSG SIZE  rcvd: 49

What it means: 2 ms is healthy on-LAN.

Decision: If you see 200–2000 ms, suspect forwarding issues, packet loss, MTU problems with EDNS, or overloaded resolver. Fix latency before you touch application retries.

Joke #2: If you want to hide infrastructure details, don’t put them in DNS—DNS is basically the office intercom.

Fast diagnosis playbook

When resolution breaks, you don’t have time for philosophy. You need to find the bottleneck and decide who owns it: client, resolver,
authoritative server, or network path. This is the checklist I run before I let anyone “just restart DNS.”

First: confirm the client’s resolver path

  • Check which DNS servers the host uses (resolvectl status on Linux, network settings on others).
    If it’s not your internal resolver, split-horizon isn’t in effect.
  • Run tcpdump on the resolver: do you even see the client queries?
    If not, suspect DoH/DoT, VPN, or a different resolver.

Second: compare internal vs public answers for the same name

  • Query internal resolver directly with dig @internal.
  • Query a known public resolver with dig @1.1.1.1 (or your preferred baseline).
  • If public is “wrong,” your public authoritative is wrong or leaking.
    If internal is “wrong,” your internal override/view/forwarding is wrong.

Third: determine whether the answer is authoritative, cached, or failing upstream

  • +norecurse plus flag inspection: aa means authoritative.
  • Look for timeouts: repeated communications error usually means network or forwarding loops.
  • Check negative caching: SOA in authority with no answers often indicates cached NXDOMAIN.

Fourth: isolate transport and EDNS issues

  • If UDP fails but TCP works, suspect MTU/fragmentation or firewall rules.
  • If DNS works for small names but fails for responses with many records, suspect EDNS buffer issues.

Fifth: confirm view/ACL classification (if using views)

  • Enable query logs briefly and identify which view clients hit.
  • Fix ACLs or NAT behavior; don’t paper over with duplicate records.

One quote to keep handy when people demand “a quick fix” that’s actually a gamble:
paraphrased idea — John Allspaw: the work is understanding the system; blaming people doesn’t improve reliability.

Common mistakes: symptoms → root cause → fix

1) “Internal app works on VPN but not on office Wi‑Fi”

Symptom: Same laptop, different network, different answer.

Root cause: Different resolvers per interface; office Wi‑Fi hands out public DNS, VPN pushes internal DNS (or vice versa).

Fix: Standardize DHCP DNS options; use per-domain routing (split DNS) on VPN so only internal domains go to internal resolvers; verify with resolvectl status.

2) “Some users get the old IP for hours”

Symptom: Stale answers long after change.

Root cause: TTL too high, or intermediate caching resolvers (including home routers) ignoring your intended behavior; negative caching for NXDOMAIN also bites.

Fix: Reduce TTL ahead of migrations; flush caches on controlled resolvers; communicate that unmanaged resolvers will lag. For NXDOMAIN, tune SOA negative TTL if appropriate.

3) “Public users can resolve private IPs for our services”

Symptom: External queries return RFC1918 or internal hostnames.

Root cause: View ACL misclassification, NAT making external queries appear internal, or a recursive resolver accidentally exposed to the internet.

Fix: Lock recursion down to internal networks; validate views with query logging; ensure public-facing DNS answers come only from the external view/zone.

4) “Everything in company.com is NXDOMAIN internally except a few records”

Symptom: Random public sites under your domain break internally.

Root cause: Shadow zone: internal DNS is authoritative for company.com but doesn’t have the full zone contents.

Fix: Stop serving partial authoritative zones. Either host full zone internally with proper sync, or do per-record overrides at the recursive layer.

5) “DNS randomly times out under load”

Symptom: Intermittent timeouts, high latency, clients retrying.

Root cause: Forwarding loops, overloaded resolver, packet loss, or firewall rate limiting. Low TTLs can amplify this by increasing QPS.

Fix: Trace forwarding paths; break loops; add resolver capacity; tune TTLs; ensure firewall allows UDP/TCP 53 reliably on internal segments.

6) “It works with dig but browsers fail”

Symptom: Command-line resolution is fine; browser can’t reach internal names.

Root cause: Browser uses DoH to a public resolver, bypassing internal DNS overrides.

Fix: Manage DoH policies via enterprise controls; provide internal DoH endpoint if needed; verify with packet captures and browser settings audits.

7) “Certificates don’t match after split-horizon change”

Symptom: TLS errors internally after pointing name to private IP.

Root cause: Internal service doesn’t present certificate for the public name, or TLS termination differs internally vs externally.

Fix: Terminate TLS consistently; use certs with correct SANs; don’t rely on “internal users will click through.” They won’t, and they shouldn’t.

Three corporate mini-stories (and the lessons they paid for)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company wanted internal users to hit the “fast path” to their staging environment. Someone suggested split-horizon for
staging.company.com: internal resolves to a private load balancer; external resolves to a hardened public endpoint.

The change went in on a Friday afternoon, because of course it did. Within an hour, support tickets arrived: engineers on home networks
couldn’t reach staging anymore, and some office users were seeing certificate warnings. The immediate assumption was “the load balancer
is down.” It wasn’t.

The wrong assumption: they believed “internal” meant “requests originating from our public IP range.” Their DNS view ACL treated all
NAT egress IPs from the corporate office as internal—fine. But their VPN concentrator also NAT’d remote users behind the same egress IP
block. Depending on route state, a user could be classified as internal DNS-wise while their traffic still went out through the public
internet without access to the private load balancer.

The result was a nasty split: DNS said “go private,” routing said “you can’t,” and the browser said “I hate your certificate.” The fix
was embarrassingly straightforward: classify views based on source subnets that truly represent internal reachability, not just “stuff
we own.” They also moved to per-domain DNS routing on the VPN client so internal names go to internal resolvers only when the VPN is
active and routes exist.

Lesson: split-horizon is not about identity or ownership. It’s about reachability. If your “internal” view includes clients that can’t
reach internal services, you’ve built a confusion machine.

Mini-story 2: The optimization that backfired

A large enterprise had a perfectly acceptable setup: recursive resolvers in each site, conditional forwarding for internal zones, and
public resolution for everything else. Someone looked at DNS query rates and decided they could “reduce chatter” by increasing TTLs
across internal records to several hours.

It worked beautifully in graphs. Resolver QPS dropped. Caches looked warm and cozy. Then a routine maintenance event required moving
an internal service VIP to a different subnet because of a network re-segmentation. The DNS record changed, and half the estate kept
hitting the old IP until caches expired. Some applications had their own DNS caching on top, because why have one cache when you can
have three.

The incident wasn’t spectacular. It was worse: it was slow. Intermittent failures, partial recoveries, and a helpdesk flood where
every team saw a different symptom. A few resolvers were flushed; a few weren’t. Some users were “fixed” by rebooting their laptops.
That’s not a fix; that’s a ritual.

The post-incident outcome was a more nuanced TTL strategy: low TTLs for endpoints that participate in failover/migration, moderate TTLs
for stable records, and clear runbooks for cache invalidation. They also introduced a practice of temporarily lowering TTLs before
planned cutovers—because operational foresight is cheaper than heroics.

Lesson: DNS “optimization” that ignores change velocity is just deferred pain with better charts.

Mini-story 3: The boring but correct practice that saved the day

A financial services firm ran split-horizon for several critical names: authentication portals, internal APIs, and a handful of partner
endpoints reachable only over private connectivity. They had two recursive resolver tiers per datacenter, plus an internal
authoritative cluster for corp.company.com. Nothing fancy—just redundant, monitored, and documented.

One morning, a network change upstream caused intermittent packet loss to one of the authoritative nodes. The recursive resolvers
started timing out for a subset of lookups. This is where things often devolve into “DNS is down,” followed by random restarts and a
change freeze.

But their boring practice kicked in: they had continuous DNS checks from each site that queried a canary record and measured latency.
The alert wasn’t “DNS down.” It was “DNS latency increased, authoritative node A intermittently unreachable from site X.” They also had
query logging ready to enable briefly, and a known-good fallback path.

They drained the flaky authoritative node from the resolver forwarder set, traffic stabilized, and the incident ended without touching
application deployments. Later, networking fixed the packet loss. No drama, no heroics, no “we rebooted everything and it worked.”

Lesson: redundancy is table stakes. Observability and controlled failover are what keep you from guessing at 3 a.m.

Checklists / step-by-step plan

Step-by-step plan: implementing split-horizon without self-harm

1) Define the names and audiences

  • List the FQDNs that need different internal vs external answers.
  • Define “internal” by reachability (subnets/VPN routes), not by “people we like.”
  • Decide if roaming devices must work seamlessly (usually yes).

2) Choose the enforcement point

  • Prefer recursive overrides/conditional forwarding when public authoritative is owned by a different team/vendor.
  • Use authoritative views when you must publish the same zone with two answer sets and you control authoritative DNS.

3) Build resolver redundancy first

  • At least two internal recursive resolvers per site or per failure domain.
  • Lock down recursion to internal networks only.
  • Monitor query latency, SERVFAIL rate, and upstream health.

4) Implement overrides narrowly

  • Override specific hostnames (A/AAAA/CNAME) before you override entire zones.
  • If you must override a zone, ensure you can serve a complete zone view, not a partial shadow.

5) Get TTLs right and plan migrations

  • Use low TTL for names you might swing during incidents or deploys.
  • Lower TTL ahead of planned cutovers; raise it back afterward if appropriate.
  • Remember negative caching; tune SOA if needed for dynamic zones.

6) Validate from multiple vantage points

  • Inside LAN, on VPN, and outside internet.
  • From at least one managed client and one “weird” client (BYOD/phone).
  • Verify both forward and reverse records where they matter.

7) Control DoH/DoT behavior

  • Decide whether clients may use external DoH resolvers.
  • If no, enforce via enterprise policy and network controls where feasible.
  • If yes, accept that internal-only names won’t resolve unless you provide an internal DoH endpoint.

8) Document the “what breaks if DNS breaks” list

  • Auth (Kerberos/OIDC), package repos, time sync dependencies, internal PKI endpoints.
  • Prioritize monitoring for those names and zones.

Operational checklist: before you change split-horizon

  • Run named-checkconf/named-checkzone (or equivalent) before reload/restart.
  • Confirm which clients are in which view / which forwarding path they use.
  • Lower TTLs ahead of time if you need a fast rollback.
  • Have a rollback plan that does not require “restore from memory.”
  • Schedule a short window to enable query logging during rollout, then disable it (logs are useful; disk fills are not).

FAQ

1) Is split-horizon DNS the same as split DNS on VPN clients?

Related, not identical. Split-horizon is “different answers depending on where you ask.” Split DNS on VPN is “send only certain domain
queries to VPN DNS servers.” You often want both: VPN routes internal domains to internal resolvers, and internal resolvers provide the
internal answers.

2) Should I use .local for internal names?

No. Many systems treat .local as mDNS territory. You can make it work in some environments, but you’ll keep paying for it
in strange resolution behavior and debugging time.

3) Can I just run two separate DNS servers and switch DHCP depending on network?

You can, and sometimes that’s enough. The failure mode is roaming and mixed networks: devices with cached resolvers, VPN overlays, and
multiple interfaces. Split-horizon is what you do when “just DHCP” stops being deterministic.

4) What’s the safest way to override a few public names internally?

Do it at the recursive layer: local-data overrides (Unbound), host overrides (dnsmasq), or conditional forwarding for a dedicated
internal subdomain. Avoid becoming authoritative for an entire public zone unless you can serve it completely.

5) How do I prevent internal DNS records from leaking externally?

Don’t expose recursive resolvers to the internet. Lock down recursion with ACLs. If using views, verify ACL classification and test
from external vantage points. Also be careful with logs and monitoring exports—names leak via telemetry too.

6) Does split-horizon break DNSSEC?

It can. If clients validate DNSSEC and you override records inside a signed zone without the correct signing chain, you can trigger
validation failures. Keep overrides at the recursive layer under your control, or ensure your authoritative setup handles signing
consistently across views.

7) Why do I get different answers from different machines on the same LAN?

Usually one of: different resolvers configured, cached data with different TTL remaining, or per-interface DNS selection (especially on
laptops with VPNs, Wi‑Fi, and docked Ethernet). Start by confirming resolver settings and querying the same DNS server explicitly with
dig @server.

8) Do I need reverse DNS (PTR) for internal services?

Not always, but when you do, you really do: logging correlation, some security tools, some auth flows, and human sanity. If you’re
building split-horizon for anything critical, treat reverse zones as part of the system, not decoration.

9) How low should TTL be for failover records?

Low enough that recovery happens in minutes, not hours—commonly 30–300 seconds. Then test query load and ensure your resolvers can
handle it. Also remember clients and libraries can cache DNS independently of TTL.

10) Is it okay to CNAME internal names to public names or vice versa?

Sometimes, but be careful: CNAME chains cross boundaries and complicate split behavior, caching, and TLS expectations. Prefer direct A/AAAA
where you need deterministic answers, especially for critical services.

Conclusion: practical next steps

Split-horizon DNS is one of those infrastructure patterns that feels “simple” until you run it under real conditions: roaming clients,
VPNs, caches, DoH, and the occasional well-meaning change that turns your namespace into performance art.

Next steps that pay off quickly:

  • Pick your pattern (authoritative views vs recursive overrides) and write down why.
  • Inventory the exact names that need split answers; keep the list small and intentional.
  • Stand up redundant internal recursive resolvers and lock recursion down.
  • Test from three vantage points: inside LAN, on VPN, and outside the org.
  • Build your “fast diagnosis” runbook into on-call reality: the commands above, plus a known-good canary name.

The goal isn’t clever DNS. The goal is boring name resolution that stays correct while everything else is on fire.

← Previous
The mining era: how crypto broke GPU pricing (again and again)
Next →
DNS Wildcard Records: The Convenience That Quietly Breaks Things (and Fixes)

Leave a comment