Debian 13: DNS split-horizon gone wrong — fix internal names without breaking the internet (case #33)

Was this helpful?

You’re on Debian 13. Half your fleet can’t reach git.internal anymore, but the other half can. Someone “fixed DNS” yesterday and now your laptop resolves internal names at the coffee shop, which is cute until it leaks queries and breaks zero-trust assumptions.

This is the classic split-horizon DNS failure: internal names should resolve only on internal networks (or via VPN), while public DNS must remain intact. The fun part is that split-horizon can fail in at least four different layers on Debian 13: the authoritative server, the recursive resolver, the stub resolver, and the per-link routing of DNS servers. If you guess wrong, you “fix” it by making the internet worse.

What actually breaks in split-horizon on Debian 13

Split-horizon DNS means the same query name returns different answers depending on where you ask from. Usually that means:

  • Inside the network: git.internal resolves to an RFC1918 address, maybe behind an internal load balancer.
  • Outside the network: git.internal either doesn’t exist (NXDOMAIN), or resolves to a public address (rare and risky), or resolves to nothing useful.

In real production, “split-horizon” is not a single feature. It’s an agreement between:

  • Authoritative DNS for your internal zone(s), usually BIND, Knot, PowerDNS, Windows DNS, or something managed.
  • Recursive resolvers clients talk to (Unbound, BIND recursive, corporate resolvers, VPN-provided resolvers).
  • Client-side stub resolver and policy routing: on Debian 13 that’s often systemd-resolved + resolvectl decisions per link/interface.
  • Search domains and routing domains that decide whether a query should go to “internal DNS” or “internet DNS”.
  • Caches everywhere, including negative caching (NXDOMAIN) that can keep you wrong for minutes or hours even after you fix it.

Debian 13 adds one sharp edge: people assume the OS still behaves like “edit /etc/resolv.conf, done.” On a modern system with systemd-resolved, that file may be a stub, a symlink, or a lie you told yourself to sleep better.

When split-horizon “goes wrong,” the failures usually fall into one of these buckets:

  1. Wrong resolver is being used (clients ask public DNS for internal names, or ask internal DNS for public names).
  2. Routing domain is missing (internal suffixes aren’t tied to the VPN interface, so queries go out the default route).
  3. Authoritative data mismatch (internal and external zones diverged, or one got deleted).
  4. DNSSEC interaction (validating resolvers treat your internal answer as bogus because of broken trust assumptions).
  5. Negative caching (clients keep NXDOMAIN even after you fix the zone).
  6. “Optimization” features like EDNS Client Subnet, DNS64, or aggressive caching behave differently across resolvers and your assumptions explode.

One operational rule: split-horizon should be explicit and testable. If it’s implicit (“it usually works when you’re on VPN”), it will eventually stop working during a meeting with someone important. DNS has a talent for timing.

One quote that still holds up in ops: “Hope is not a strategy” — paraphrased idea, often attributed to multiple operations leaders.

Joke #1: DNS is the only system where “it’s cached” is both an explanation and a threat.

Interesting facts and historical context (the stuff that explains today’s pain)

  • Split-horizon predates cloud by decades. It grew out of enterprises publishing internal names while exposing a curated external view for partners and the early internet.
  • BIND “views” became the canonical mechanism for serving different answers to different client subnets; that concept still shapes how many teams think about split DNS.
  • Negative caching is standardized. Since RFC 2308, resolvers may cache NXDOMAIN based on the zone’s SOA minimum/negative TTL; a “fixed” record can appear broken for longer than you expect.
  • Search domains used to be the main trick. Before per-link DNS routing, admins leaned heavily on search in resolv.conf. That worked… until laptops met multiple networks and VPNs.
  • The “.local” footgun is historical, not theoretical. mDNS uses .local (RFC 6762). If you used it for unicast internal DNS, you probably have ghosts in your resolver.
  • ICANN’s 2013 “name collision” work exposed how often enterprises used “private TLDs” that later conflicted with public DNS behavior or new TLDs.
  • systemd-resolved introduced per-link DNS routing so VPN DNS can be used only for certain domains. It solves a real problem, but it also means the old mental model is obsolete.
  • EDNS0 changed packet sizing behavior. Modern resolvers often send larger UDP packets; if PMTU or firewall rules are wrong, you get weird timeouts that look like “DNS is flaky.”
  • Some resolvers do “NXDOMAIN cut” and aggressive NSEC caching. With DNSSEC, a resolver may infer non-existence for adjacent names. If your internal DNSSEC story is messy, debugging becomes interpretive dance.

Fast diagnosis playbook (check first/second/third)

First: confirm what resolver path the client is actually using

  • Is systemd-resolved running and is /etc/resolv.conf a stub?
  • Which DNS servers are configured per interface (VPN vs LAN vs Wi‑Fi)?
  • Do you see routing domains (~corp.example) applied to the VPN link?

Second: reproduce with one tool, one name, one server

  • Pick a single internal name (e.g., git.internal).
  • Query the configured resolver, then query the authoritative server directly.
  • Compare answers, TTLs, and whether you get NXDOMAIN vs SERVFAIL vs timeout.

Third: look for caching and policy interference

  • Flush client cache and retry.
  • Check negative TTL in SOA.
  • Check DNSSEC validation state if you’re validating.
  • Inspect firewall/MTU if you see intermittent timeouts.

Fourth: only then touch configuration

  • If the client is querying the wrong server: fix per-link DNS routing or NetworkManager/VPN configuration.
  • If the resolver forwards wrong: fix conditional forwarding / stub zones.
  • If authoritative data is wrong: fix the zone, then set expectations about caching delay.

Practical tasks (commands, outputs, what it means, what you decide)

These are the tasks I actually run when a split-horizon incident arrives. Each includes a realistic sample output and the decision you make from it. Replace names/IPs with your reality.

Task 1 — Check if systemd-resolved is in the path

cr0x@server:~$ systemctl status systemd-resolved --no-pager
● systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/lib/systemd/system/systemd-resolved.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-30 09:12:18 UTC; 2h 4min ago
       Docs: man:systemd-resolved.service(8)
   Main PID: 612 (systemd-resolve)
     Status: "Processing requests..."

Meaning: You are not dealing with “just resolv.conf.” There is a stub resolver and a cache.

Decision: Use resolvectl to inspect per-link DNS and routing domains; don’t blindly edit /etc/resolv.conf.

Task 2 — Inspect /etc/resolv.conf reality

cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Dec 30 09:12 /etc/resolv.conf -> /run/systemd/resolve/stub-resolv.conf

Meaning: Applications reading /etc/resolv.conf are pointed at the local stub (typically 127.0.0.53).

Decision: Configure DNS via systemd/network manager, not by rewriting /etc/resolv.conf (unless you intentionally disable resolved).

Task 3 — See what DNS servers and domains are applied per link

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
 resolv.conf mode: stub
Current DNS Server: 1.1.1.1
       DNS Servers: 1.1.1.1 8.8.8.8

Link 2 (enp1s0)
    Current Scopes: DNS
         Protocols: +DefaultRoute
       DNS Servers: 192.168.10.53
        DNS Domain: office.example

Link 5 (tun0)
    Current Scopes: DNS
         Protocols: -DefaultRoute
       DNS Servers: 10.60.0.53
        DNS Domain: ~internal.example ~svc.internal.example

Meaning: VPN link has routing domains (~internal.example) and no default route for DNS. That’s good: only those domains should go to VPN DNS.

Decision: If internal names are failing, check whether the queried name matches the routing domain. If it’s git.internal but your routing domain is ~internal.example, you’ve found the mismatch.

Task 4 — Test internal name using the stub (what apps see)

cr0x@server:~$ resolvectl query git.internal.example
git.internal.example: 10.60.12.44                      -- link: tun0

-- Information acquired via protocol DNS in 21.4ms.
-- Data is authenticated: no

Meaning: The client is routing this query over the VPN interface. Good. The answer exists.

Decision: If apps still fail, you’re likely dealing with app-level caching, wrong hostname, or TLS/SNI mismatch—not DNS split-horizon.

Task 5 — When it fails: distinguish NXDOMAIN vs SERVFAIL vs timeout

cr0x@server:~$ resolvectl query git.internal
git.internal: resolve call failed: 'git.internal' not found

Meaning: Not found in the DNS path that was used. This is usually NXDOMAIN, but resolved abstracts it.

Decision: Run dig against specific servers to see whether it’s truly NXDOMAIN, or a policy/routing miss.

Task 6 — Query the VPN resolver directly

cr0x@server:~$ dig +noall +answer @10.60.0.53 git.internal A
git.internal.            300     IN      A       10.60.12.44

Meaning: The internal resolver knows this name (and the zone apex is probably internal. or you’ve got a delegated private TLD).

Decision: Ensure the client actually routes git.internal queries to the VPN resolver. If your routing domains are only ~internal.example, this name will leak to public DNS and fail.

Task 7 — Query the public resolver directly (to see leakage)

cr0x@server:~$ dig +noall +comments +answer @1.1.1.1 git.internal A
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29144

Meaning: Public DNS returns NXDOMAIN. If clients ask public resolvers for git.internal, they will fail and may cache failure.

Decision: Fix split DNS routing (routing domains/conditional forwarding). Don’t create public records for internal names as a “quick fix.” That’s how you turn internal architecture into internet architecture.

Task 8 — Check search domains and why short names are a trap

cr0x@server:~$ resolvectl domain
Link 2 (enp1s0): office.example
Link 5 (tun0): ~internal.example ~svc.internal.example

Meaning: Only office.example is a search domain; VPN domains are routing-only. A query for git may become git.office.example, not git.internal.example.

Decision: Use fully qualified names for critical tooling (FQDNs). If humans must type short names, define consistent search domains and accept the operational cost.

Task 9 — Flush caches (client side) to kill stale NXDOMAIN

cr0x@server:~$ resolvectl flush-caches
cr0x@server:~$ resolvectl statistics
DNSSEC supported by current servers: no
Transactions: 1283
Cache size: 0
Cache hits: 0
Cache misses: 43

Meaning: Cache flushed. If the issue “fixes itself” after this, you were fighting cached failure.

Decision: Inspect negative TTL in your zone SOA and adjust if you’re seeing long-lived NXDOMAIN after changes.

Task 10 — Check authoritative zone SOA for negative TTL (the quiet saboteur)

cr0x@server:~$ dig +noall +answer @10.60.0.53 internal.example SOA
internal.example.  3600 IN SOA ns1.internal.example. hostmaster.internal.example. 2025123001 3600 600 1209600 3600

Meaning: The last SOA field (3600) is the negative cache TTL in modern practice. NXDOMAIN can stick for an hour.

Decision: For zones with frequent changes, set negative TTL lower (e.g., 60–300 seconds). Don’t set it to 0 unless you enjoy higher query load and new failure modes.

Task 11 — Validate that the authoritative server serves different answers by client subnet (BIND views case)

cr0x@server:~$ dig +noall +answer @192.168.10.53 git.internal.example A
git.internal.example. 300 IN A 10.60.12.44
cr0x@server:~$ dig +noall +answer @192.168.10.53 git.internal.example A +subnet=203.0.113.10/32
git.internal.example. 300 IN A 198.51.100.44

Meaning: The server is performing split-horizon behavior (possibly via views or ECS-aware logic). That second answer is a red flag if it’s not intentional.

Decision: If you’re not explicitly using EDNS Client Subnet, disable it on recursors or ensure views are based on source IP, not ECS. ECS can leak topology and complicate caching.

Task 12 — Verify that the right DNS server is being used from the application perspective

cr0x@server:~$ getent ahosts git.internal.example
10.60.12.44     STREAM git.internal.example
10.60.12.44     DGRAM
10.60.12.44     RAW

Meaning: NSS resolution (glibc) resolves the name. If dig works but getent fails, you likely have NSS issues, container DNS, or different resolver libraries.

Decision: Treat getent as “what most apps see.” Debug DNS at the NSS layer, not only with dig.

Task 13 — Confirm NetworkManager/VPN handed down DNS correctly

cr0x@server:~$ nmcli dev show tun0 | sed -n '1,80p'
GENERAL.DEVICE:                         tun0
GENERAL.TYPE:                           tun
GENERAL.STATE:                          100 (connected)
IP4.DNS[1]:                             10.60.0.53
IP4.DOMAIN[1]:                          ~internal.example
IP4.DOMAIN[2]:                          ~svc.internal.example

Meaning: VPN profile is pushing DNS server and routing domains. Good. If this is missing, resolved can’t make smart choices.

Decision: Fix the VPN profile (or its plugin) to push the right DNS and domains. Don’t “fix” it by hardcoding global resolvers.

Task 14 — Inspect Unbound forwarding if you run a local recursive resolver

cr0x@server:~$ sudo unbound-control status
version: 1.19.0
verbosity: 1
threads: 2
modules: 2 [ subnetcache validator ]
uptime: 7261 seconds
options: control(ssl)

Meaning: Unbound is running and might be the component doing conditional forwarding.

Decision: Verify that Unbound has explicit forward-zone or stub-zone for internal domains; otherwise it will ask the public internet and fail (or leak).

Task 15 — See which process is bound to port 53 locally

cr0x@server:~$ sudo ss -lntup | grep ':53 '
udp   UNCONN 0      0         127.0.0.53%lo:53       0.0.0.0:*    users:(("systemd-resolve",pid=612,fd=15))

Meaning: Only the stub listener is on 127.0.0.53. If you expected dnsmasq/unbound to be listening, it isn’t.

Decision: Decide whether you want systemd-resolved alone, or a local recursor. If you add one, plan port ownership and integration explicitly.

Case #33: DNS split-horizon gone wrong

This case shows up in three variations. Same movie, different actors.

Variation A: The internal zone “worked” until Debian 13 laptops arrived

For years, clients used a simple rule: internal Wi‑Fi handed out internal DNS servers via DHCP. External networks handed out public resolvers. People used a VPN, but it didn’t matter much because internal names were mostly needed on-prem.

Then remote work happened. VPN became the default. Somebody enabled split DNS “properly” so only internal domains go to internal DNS, and public stays public. Great goal.

But the internal naming scheme was a pile of short names and a private TLD (like .internal), while the VPN pushed routing domains for ~internal.example. Debian 13 didn’t guess what you meant. It routed git.internal to public resolvers. NXDOMAIN. Cached. Outage.

Variation B: The authoritative server was fine; the client policy was wrong

Admins proved the record exists by querying the internal DNS server directly. Everyone nods. Then they “fix” the client by adding internal DNS servers as global resolvers.

Now public DNS gets slow. Worse, internal resolvers start receiving queries for public domains they can’t resolve efficiently (or aren’t allowed to resolve). Someone adds forwarding for “.” to a public resolver. Congratulations, you built a recursion chain that leaks internal queries and adds latency.

Variation C: The resolver path was correct; caching made it look wrong

The day before the incident, a zone change removed a record briefly. Resolvers cached NXDOMAIN with a 1-hour negative TTL. The record came back 5 minutes later. People kept seeing failures for the next hour.

That’s when somebody runs a fleet-wide “DNS flush” script with sudo. It helps. It also becomes the new ritual. This is how tribal DNS religions form.

Three workable designs (and which one to pick)

Design 1: Authoritative split-horizon using BIND views (classic)

How it works: The same authoritative server answers internal clients from an internal view and external clients from an external view. Clients just query “the DNS server,” and the server chooses based on source address.

Pros: Central control; clients stay dumb; old devices work.

Cons: Source-IP-based policy fails when clients come through NAT or shared resolvers; harder with multi-cloud; mistakes are catastrophic because the authority is the source of truth.

Pick it when: Your network boundaries are clear, you control recursion, and you can reliably identify client networks.

Design 2: One authoritative view, split done in recursion via conditional forwarding (modern enterprise)

How it works: Authoritative zones are clean and consistent. Split behavior happens in recursive resolvers: internal resolvers know how to reach internal authoritative servers; public resolvers don’t. Clients use internal recursors only when appropriate (VPN/on-prem).

Pros: Cleaner DNS architecture; easier DNSSEC story; avoids “two sources of truth.”

Cons: Requires correct client routing (per-link domains). If that part is wrong, internal queries leak to public resolvers and fail.

Pick it when: You have mixed networks (office, home, VPN) and you want deterministic behavior without relying on source IP on the authoritative.

Design 3: Client-side split DNS via systemd-resolved routing domains (best for Debian 13 endpoints)

How it works: The client has multiple DNS servers. For specific domains, queries go to VPN DNS. Everything else goes to public DNS. This is exactly what systemd-resolved is good at, when configured correctly.

Pros: Prevents leaks; supports multiple networks; explicit; debuggable with resolvectl.

Cons: You must be disciplined about domain naming. Private TLDs and short names are pain multipliers. Some apps bypass the stub.

Pick it when: You have laptops, VPN, and you care about not breaking the public internet on client machines.

If you’re running Debian 13 endpoints, Design 3 is the pragmatic answer most of the time. If you’re running servers in a stable network, Design 2 is usually cleaner. Design 1 is still valid, but it’s easy to get wrong at scale.

Concrete configurations that fix the problem (without “just hardcode 8.8.8.8”)

Option A: Fix per-link routing domains on Debian 13 (systemd-resolved)

The key idea: make internal suffixes route to VPN DNS, and keep global DNS for the rest.

NetworkManager-managed VPN (recommended on endpoints)

Ensure your VPN connection sets DNS and routing domains. You can do it in the connection profile:

cr0x@server:~$ nmcli connection modify corp-vpn ipv4.dns "10.60.0.53 10.60.0.54"
cr0x@server:~$ nmcli connection modify corp-vpn ipv4.dns-search "~internal.example ~svc.internal.example"
cr0x@server:~$ nmcli connection up corp-vpn
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/12)

Meaning: The ~ marks routing-only domains, not search domains. That’s what you want: only queries for these suffixes go to VPN DNS.

Decision: If you rely on git.internal (no dot-suffix), stop and rename or provide FQDNs. Routing domains match suffixes, not vibes.

systemd-networkd (servers or minimal installs)

For a VPN interface, you can set DNS and domains in .network units. Example snippet:

cr0x@server:~$ sudo sed -n '1,120p' /etc/systemd/network/50-tun0.network
[Match]
Name=tun0

[Network]
DNS=10.60.0.53
DNS=10.60.0.54
Domains=~internal.example ~svc.internal.example

Meaning: This pins routing domains to the VPN link. Public DNS remains global.

Decision: Restart networking carefully. On remote systems, do this in a maintenance window or with console access.

Option B: Conditional forwarding in Unbound (good for a local “smart” resolver)

If you want a local recursive resolver that forwards internal zones to internal DNS (and resolves the rest normally), Unbound is boring and excellent.

Example /etc/unbound/unbound.conf.d/forward-internal.conf:

cr0x@server:~$ sudo sed -n '1,120p' /etc/unbound/unbound.conf.d/forward-internal.conf
forward-zone:
  name: "internal.example."
  forward-addr: 10.60.0.53
  forward-addr: 10.60.0.54

forward-zone:
  name: "svc.internal.example."
  forward-addr: 10.60.0.53
  forward-addr: 10.60.0.54

Meaning: Queries for those zones go to internal resolvers; everything else resolves via root hints or your configured upstreams.

Decision: If you do this on endpoints, still prefer per-link routing domains to avoid leaking internal names when off VPN.

Restart and verify:

cr0x@server:~$ sudo systemctl restart unbound
cr0x@server:~$ sudo unbound-control reload
ok

Option C: BIND views on the authoritative (useful, but handle with care)

Views are powerful: they let you serve different zone files to different clients. They’re also a great way to hide bugs until 2 AM.

Example structure (simplified) for named.conf:

cr0x@server:~$ sudo sed -n '1,200p' /etc/bind/named.conf.local
acl "internal-nets" { 10.0.0.0/8; 192.168.0.0/16; 172.16.0.0/12; };

view "internal" {
  match-clients { internal-nets; };
  recursion no;

  zone "internal.example" {
    type master;
    file "/etc/bind/zones/db.internal.example.internal";
  };
};

view "external" {
  match-clients { any; };
  recursion no;

  zone "internal.example" {
    type master;
    file "/etc/bind/zones/db.internal.example.external";
  };
};

Meaning: Clients from internal nets see internal answers; everyone else sees the external zone (which might be empty or NXDOMAIN-ish depending on your setup).

Decision: Ensure monitoring queries both views. Most orgs monitor only from inside and then act surprised when partners can’t resolve anything.

Option D: dnsmasq for a small site or lab

dnsmasq can do split forwarding and caching in one small daemon. It’s great until you accidentally turn your laptop into a DNS server for the entire café.

Example snippet:

cr0x@server:~$ sudo sed -n '1,120p' /etc/dnsmasq.d/split-dns.conf
server=/internal.example/10.60.0.53
server=/svc.internal.example/10.60.0.53
no-resolv
server=1.1.1.1
server=8.8.8.8
cache-size=10000

Meaning: Internal zones forward to internal DNS, everything else to public resolvers.

Decision: If you use this, lock down interface binding and firewall rules. “Accidental open resolver” is a career-limiting move.

Joke #2: Running an open resolver is like leaving your car unlocked with a “free rides” sign—someone will accept the offer.

Common mistakes (symptom → root cause → fix)

1) Internal names fail only on VPN

Symptom: Connected to VPN; git.internal.example fails, but public sites work.

Root cause: VPN DNS server is configured, but routing domains aren’t. Queries for internal domains still go to public resolvers.

Fix: Push routing domains via VPN profile (~internal.example) and verify with resolvectl status. Don’t set VPN DNS as global default unless you mean it.

2) Public DNS becomes slow or flaky after “fixing” internal DNS

Symptom: After adding internal DNS servers globally, browsing slows; intermittent resolution failures for public domains.

Root cause: Internal resolvers aren’t designed/allowed to recurse for the internet, or they forward to another resolver with latency and filtering. You inserted an unnecessary hop.

Fix: Use split routing: internal zones to internal DNS, everything else to public DNS or a proper corporate recursor.

3) Works with dig, fails in applications

Symptom: dig returns the right A record, but curl/git/apt fails to resolve.

Root cause: App uses different resolver path (container DNS, static resolv.conf in chroot, NSS misconfig), or it cached failure.

Fix: Use getent ahosts, inspect container /etc/resolv.conf, check resolvectl and flush caches. Validate that the app is not pinned to a custom DNS.

4) It fails for AAAA but works for A (or vice versa)

Symptom: IPv4 works; IPv6 resolution leads to timeouts or wrong answers.

Root cause: Missing AAAA records internally, broken DNS64/NAT64 assumptions, or firewall blocks for larger DNS responses when AAAA triggers bigger sets.

Fix: Decide whether you support IPv6 internally. If yes, publish AAAA and ensure MTU/firewall handles EDNS0. If no, consider filtering AAAA at the recursor (last resort) and be explicit.

5) Intermittent SERVFAIL from validating resolvers

Symptom: Some clients get SERVFAIL; others are fine; the authoritative server seems to respond.

Root cause: DNSSEC validation failure due to inconsistent signing, broken chain, or split-horizon returning different DNSSEC material than expected.

Fix: Either properly sign internal zones and manage trust anchors, or disable validation for internal zones at the validating resolver. Half-DNSSEC is worse than no DNSSEC.

6) Short internal names stop working after moving to routing domains

Symptom: Users type git and it used to work; now it doesn’t.

Root cause: Search domains changed or vanished; routing domains don’t expand short names.

Fix: Move humans to FQDNs for critical services. If you must keep short names, add search domains knowingly and test collisions (e.g., git.office.example vs git.internal.example).

7) “Fix” causes internal hostname leakage to public DNS logs

Symptom: Security reports internal names seen by public resolvers; privacy concerns raised.

Root cause: Clients query public resolvers for internal suffixes because split DNS routing is missing or wrong.

Fix: Enforce routing domains and use internal recursors that refuse to forward internal zones to the public internet. Monitor for leaks by querying public resolvers from controlled clients and checking NXDOMAIN patterns.

Three corporate-world mini-stories (anonymized, plausible, technically accurate)

Story 1: The outage caused by a wrong assumption

They were a mid-sized company with a clean internal DNS setup: corp.example for users, svc.corp.example for services. The VPN team added split DNS and pushed ~svc.corp.example to endpoints. Everything looked correct in their ticket.

The wrong assumption was subtle: they believed all internal services lived under svc.corp.example. In reality, one legacy service was branded as jira.corp.example, not jira.svc.corp.example. Users didn’t know; they had bookmarks.

When people connected from home, jira.corp.example went to public DNS, returned NXDOMAIN, and got cached. A few users tried again later on the office network and still got NXDOMAIN for a while, because now their local caches were poisoned with “does not exist.”

Engineering spent a morning looking at the Jira server and its load balancer. Both were fine. The incident commander finally demanded a single reproduction with explicit DNS servers. That’s when the mismatch of routing domains and real naming conventions became obvious.

The fix was boring: add ~corp.example to the VPN routing domains and publish a small internal “DNS contract” document: which suffixes are internal, which are public, and who owns changes. The big change was cultural: no more “internal names are obvious.” They aren’t.

Story 2: The optimization that backfired

A different company had two internal recursors and a public upstream. Someone noticed DNS latency spikes and decided to “optimize” by setting internal recursors as global resolvers everywhere, including laptops off VPN. Their thinking: internal recursors have better caches and can forward to public anyway.

It worked, until it didn’t. Laptops at home began querying internal resolvers over a VPN tunnel that wasn’t up yet, or over a network path that filtered UDP fragments. DNS queries timed out, then retried over TCP, then triggered slow fallback behavior in applications. Pages loaded like it was 2003.

The ops team then added a second layer: local dnsmasq on laptops to cache more aggressively. That introduced another cache, another set of timeouts, and a new failure mode: when the VPN came up, dnsmasq didn’t always pick up the updated internal servers quickly. Users saw “it works after I reboot,” which is the IT equivalent of “turn it off and on again,” except now it’s policy.

They eventually rolled it back and used per-domain routing: internal suffixes to internal DNS only when on VPN. Latency improved because they removed the misrouted path. The lesson was uncomfortable: performance “optimization” in DNS is often just moving the wait time to a less visible place.

Story 3: The boring but correct practice that saved the day

At a regulated firm, DNS changes required a change request with a pre-flight and post-flight checklist. Everyone complained it was too slow. Then a split-horizon incident happened during a merger: two networks, two sets of internal zones, and a VPN bridge.

The saving practice wasn’t fancy tech. It was that the DNS team had a standard “canary” host outside each network that continuously queried a small list of internal and external names against the intended resolvers. It also queried from inside to ensure internal views stayed internal. Every check recorded NXDOMAIN/SERVFAIL and TTL.

When the incident started, they didn’t argue about whether DNS was broken. The canary showed: internal names were being asked on public resolvers from VPN-connected endpoints. That narrowed it to client routing domains or VPN config, not the authoritative servers.

The VPN team had pushed a new profile that dropped routing domains and replaced them with search domains. The canary caught it within minutes. Rollback was fast because they had the previous profile pinned and tested. Everyone went back to complaining about bureaucracy, but production stayed alive.

Checklists / step-by-step plan (do this, in this order)

Step 0 — Decide what “internal” means

  • List the internal DNS suffixes (example: internal.example, svc.internal.example).
  • Ban or phase out private TLDs that aren’t subdomains of a domain you own (example: .internal, .corp), unless you have a deliberate and documented policy.
  • Decide whether you will support short names. If yes, define search domain behavior and accept collisions.

Step 1 — Validate authoritative data

  • Query SOA and NS for each internal zone from inside the network.
  • Confirm TTL and negative TTL values.
  • If split is on authoritative via views, test both views from known subnets.

Step 2 — Validate recursive behavior

  • Ensure internal resolvers can reach internal authoritative servers reliably.
  • Ensure internal resolvers do not become accidental open resolvers.
  • Configure conditional forwarding/stub zones explicitly, not via undocumented magic.

Step 3 — Fix clients the Debian 13 way

  • Use resolvectl status to confirm per-link DNS servers.
  • Ensure VPN links carry routing domains (~suffix), not just search domains.
  • Keep global DNS for public resolution; don’t route everything through the VPN unless policy requires it.

Step 4 — Put guardrails in place

  • Monitoring from multiple vantage points (inside, outside, VPN) with explicit resolvers.
  • Alert on sudden spikes in NXDOMAIN for internal suffixes.
  • Runbooks that start with “which resolver path is used,” not “restart bind.”

Step 5 — Communicate caching reality

  • Tell stakeholders that DNS changes can take up to TTL + negative TTL to propagate through caches.
  • Have a controlled cache flush procedure for endpoints (not “everyone reboot”).
  • Keep TTLs sane: low enough for change, high enough to avoid query storms.

FAQ

1) Why did this start happening after moving to Debian 13?

Because the client-side resolver stack is more policy-driven. systemd-resolved supports per-link DNS and routing domains, and it will not “guess” that git.internal belongs to your VPN unless you tell it.

2) Should I disable systemd-resolved and go back to plain /etc/resolv.conf?

Only if you’re prepared to replace its per-link routing capabilities with something equivalent. For laptops and VPN split DNS, disabling it usually makes things worse, not better.

3) What’s the difference between a search domain and a routing domain (~domain)?

A search domain expands short names (query git becomes git.office.example). A routing domain tells resolved which DNS server to use for names within that suffix. Routing domains do not expand short names.

4) Why is NXDOMAIN “sticky” even after I add the record?

Negative caching. Resolvers cache “does not exist” based on the zone’s SOA negative TTL. Flush caches to confirm, then adjust negative TTL if your change rate requires it.

5) Is it okay to use .internal or .corp as a private TLD?

It works until it doesn’t. The safer practice is to use subdomains of a domain you control (like internal.example). Private TLDs complicate split DNS routing, leak detection, and interoperability.

6) Why does dig work but my browser doesn’t?

dig queries exactly what you tell it. Your browser uses the system resolver, may use DNS-over-HTTPS, or may be inside a container namespace. Validate with getent and check browser DoH settings and container DNS configs.

7) Can I solve this by putting internal records in public DNS?

You can, but it’s usually the wrong trade. You leak internal topology, complicate access control, and risk exposing services unintentionally. Fix routing domains and conditional forwarding instead.

8) Do I need DNSSEC for internal zones?

Not strictly, but you need a coherent plan. Validating resolvers plus inconsistent internal DNSSEC is a recipe for SERVFAIL. Either sign and manage it correctly, or disable validation for those internal zones at the resolver.

9) What’s the cleanest way to support both on-prem and VPN users?

Use FQDNs under a domain you own (e.g., svc.internal.example), push routing domains via VPN, and keep public DNS separate. Then test from three vantage points: on-prem, VPN, and outside.

Conclusion: next steps that won’t bite you later

Split-horizon DNS fails when you rely on implied behavior. Debian 13’s resolver stack is explicit: it needs explicit routing domains, explicit forwarding rules, and explicit naming conventions. That’s good news, because it means you can make it deterministic.

Do this next:

  1. Pick your internal suffixes and standardize them under a domain you control.
  2. On Debian 13 endpoints, verify per-link DNS and routing domains with resolvectl status.
  3. Fix VPN profiles to push ~internal.example-style routing domains, not just DNS servers.
  4. Confirm authoritative SOA negative TTLs aren’t sabotaging your recovery time.
  5. Implement monitoring from at least two network vantage points and alert on internal-name leakage to public resolvers.

Then stop “fixing DNS” by hardcoding resolvers globally. That’s not a fix. It’s a new outage with better marketing.

← Previous
Proxmox “Permission Denied” in Datacenter: Roles and ACLs Done Right
Next →
Docker networks: Bridge vs host vs macvlan — pick the one that won’t bite later

Leave a comment