You can build a flawless VPN tunnel and still ship a privacy leak, a reliability mess, and a helpdesk bonfire—just by getting DNS wrong. The usual symptom is “VPN is connected but internal apps don’t work,” followed by “also, why is my laptop talking to a random resolver in a hotel?”
DNS is tiny packets, big consequences. If your split tunnel sends only “important” traffic through the VPN but your DNS queries still wander out to the local network, you’ll see leaks, broken internal name resolution, and weird latency spikes that look like “the VPN is slow” even when the tunnel is fine.
The mental model: what “local DNS for VPN users” actually means
“Local DNS for VPN users” is a phrase people use when they mean three different things. If you don’t clarify which one you’re implementing, you’ll ship a system that works in your lab and fails in a hotel lobby.
Meaning #1: “Use an internal resolver reachable through the VPN”
This is the classic corporate model: the VPN client should send DNS queries to a resolver inside your network (or inside your cloud VPC), not to whatever resolver the user’s Wi‑Fi hands them. This is the foundation for split-horizon DNS (internal names resolve to internal IPs), internal-only zones, and consistent logging.
Meaning #2: “Run a resolver on the client and forward intelligently”
You run a local stub resolver on the device (or rely on the OS’s stub) and configure conditional forwarding: internal domains go to internal DNS over the VPN; everything else goes to a public resolver (or also through the VPN, depending on policy). This is often the cleanest way to make split tunnel behave without punting all internet DNS through the corporate network.
Meaning #3: “Provide a DNS service close to the user to reduce latency”
This is performance-driven: deploy resolvers regionally, reachable through the VPN, so remote workers don’t hairpin to headquarters for every lookup. It also reduces blast radius when one DNS node dies, because “nearest” changes.
All three are valid. Mixing them without intention is how you end up with partial leaks, unpredictable resolver order, and caches that fight each other.
Operational truth: DNS “works” until it doesn’t, and then it fails like a soap bubble—quietly, randomly, and always during a live demo.
Interesting facts and history that explain today’s mess
- DNS predates modern VPNs by years. The DNS spec (RFC 1034/1035) is from 1987, designed for a friendlier network than today’s hostile Wi‑Fi and captive portals.
- “Split horizon” is older than cloud. Enterprises were doing internal vs external DNS views long before Kubernetes made everything feel new again.
- UDP was a feature, then a liability. DNS over UDP was perfect for small queries, but it’s also easy to spoof, block, or mishandle on flaky networks.
- EDNS0 changed packet sizes. Larger DNS responses (think DNSSEC, lots of records) increased fragmentation issues—exactly the kind of thing that gets weird over VPN MTUs.
- DNSSEC improved integrity, not privacy. It helps prevent tampering but doesn’t stop DNS leaks or stop observers from seeing what you queried.
- DoH/DoT changed who “owns” name resolution. Applications can bypass OS resolver settings, which means your VPN’s “set DNS server” might be ignored by the browser.
- Windows NRPT exists for a reason. Microsoft added Name Resolution Policy Table to control which namespaces use which resolvers—because “just set a DNS server” wasn’t enough.
- systemd-resolved brought split DNS to Linux mainstream. It’s powerful, but it also introduced new failure modes when admins assume /etc/resolv.conf tells the full story.
- Corporate DNS is often the last on-call frontier. Plenty of teams modernize auth, deploy fancy proxies, and still run DNS like it’s 2009 because “it’s always been fine.”
Design goals: pick what you’re optimizing for
Before you touch configs, decide what “correct” means for your organization. DNS is policy disguised as plumbing.
Goal A: Stop DNS leaks (privacy + compliance)
Leaks happen when DNS queries for internal domains (or any domains, depending on policy) go to a resolver outside your control. Fixing this may require forcing DNS to the VPN interface, blocking port 53 off-tunnel, and dealing with DoH.
Goal B: Make split tunnel actually work (reliability)
Split tunnel usually means: internal apps through VPN, internet direct. The DNS part must match that. If internal zones resolve via public resolvers, internal apps fail. If public zones resolve via internal resolvers over a congested tunnel, the internet “feels slow.”
Goal C: Keep latency and load sane (performance)
Centralizing DNS behind a single VPN endpoint is easy, and it’s also how you teach users that “the VPN is slow.” Regional resolvers, caching, and correct TTL handling make DNS fast enough that nobody thinks about it—which is the highest compliment you’ll ever get.
Goal D: Keep operations boring (supportability)
DNS debugging on remote endpoints is already painful. Don’t make it worse with cleverness you can’t observe. Prefer designs where you can answer: which resolver did the client use, which interface did it go out, and what did it get back?
Paraphrased idea from John Allspaw: reliability is built from the ability to respond to the unexpected, not from pretending the unexpected won’t happen.
Reference architectures that work in production
1) Full-tunnel DNS: force all DNS through VPN to internal resolvers
When to use: strict compliance, regulated environments, or when you can’t tolerate any off-tunnel DNS.
How it works: VPN pushes internal DNS servers; client routes DNS queries over tunnel; firewall blocks outbound DNS on the local interface; optionally intercepts/redirects to the internal resolver.
Trade-offs: more load on your VPN and DNS; higher latency for public lookups unless your internal resolver does good upstream selection; must handle DoH separately.
2) Split DNS with conditional forwarding (recommended default)
When to use: most organizations with split tunnel and internal namespaces.
How it works: internal domains (e.g., corp.example, svc.cluster.local, cloud private zones) go to internal resolvers over VPN; everything else uses the local network’s resolver or a chosen public resolver.
Trade-offs: more configuration complexity across OSes; requires careful resolver-order control and domain routing rules; still must consider leak paths.
3) Client-local resolver + encrypted upstream (DoT) over VPN
When to use: you want consistent behavior and fewer OS quirks, and you can manage an agent.
How it works: run a local caching stub (like unbound) on the client; forward internal zones to internal DNS; forward public to DoT endpoints either through VPN or direct. This can tame resolver races and reduce repeated queries on flaky links.
Trade-offs: agent lifecycle management; one more moving part; debugging shifts from OS tools to your stub’s logs.
4) VPN-accessible anycast resolvers (for scale)
When to use: large remote workforce across regions; low-latency requirements; you already run global routing well.
How it works: advertise the same resolver IP(s) in multiple regions reachable via VPN; BGP or overlay routing takes the client to the nearest instance. Works best when coupled with health checks and fast failover.
Trade-offs: operational maturity required; anycast + stateful firewalls can get spicy; debugging “which node answered” needs good observability.
Opinion: if you’re not regulated into full-tunnel DNS, start with split DNS and conditional forwarding. It’s the best balance of user experience and control—if you implement it honestly and test on real client OSes.
Client behavior: Windows, macOS, Linux, mobile (and their bad habits)
Windows: multiple resolvers, suffix search, and policy tables
Windows can maintain different DNS servers per interface and has rules for which interface wins. It also has suffix search lists that can generate surprising queries. For split DNS, Windows NRPT can route specific namespaces to specific DNS servers. If you’re doing corporate VPN at scale, learn NRPT. It’s the difference between “works for IT” and “works for everyone.”
macOS: resolver order is real, and sometimes it’s personal
macOS uses a dynamic resolver configuration. You can have multiple resolvers active with per-domain routing. The UI and VPN client may claim one thing while the underlying resolver does another. Always inspect the actual resolver state, not just the network panel.
Linux: /etc/resolv.conf is a liar on modern systems
On many distros, /etc/resolv.conf points to a local stub (like 127.0.0.53) managed by systemd-resolved, NetworkManager, or something else. Debugging requires querying the resolver manager, not just reading the file.
Mobile: captive portals, “helpful” privacy features, and app-specific DNS
Phones roam, switch networks, and aggressively try to keep connectivity. Some apps use their own DNS methods, and modern platforms may prefer encrypted DNS if configured. Your “push DNS over VPN” story might not cover the browser’s DoH setting or a security agent doing its own resolution.
Joke #1: DNS is like office gossip—if you don’t control where it goes, it will absolutely end up in the wrong hallway.
Where DNS leaks and split routing failures come from
Leak path 1: DNS server set, but route missing
VPN pushes an internal DNS server IP, but the client doesn’t route that IP through the tunnel (common with split tunnel). Result: the DNS server is “configured” but unreachable. The OS fails over to another resolver—often the local network’s—causing leaks and breakage.
Leak path 2: Resolver order and fallback
Clients may try multiple resolvers. If the internal resolver times out (latency, MTU issues, packet loss), the OS uses the next resolver. That next resolver is usually external. Congrats, you just built a probabilistic privacy policy.
Leak path 3: Application-level DNS (DoH/DoT)
If browsers or agents use DoH directly to a public provider, your VPN DNS settings can be bypassed. Sometimes that’s desired. Often it’s not. Decide your policy, then enforce it with endpoint management and/or network controls.
Split routing failure: internal names resolve to public IPs (or NXDOMAIN)
If internal zones aren’t routed to internal resolvers, users will either get NXDOMAIN or, worse, resolve internal hostnames to public records with the same name. Split horizon DNS exists because name reuse happens. Ignoring that fact is how you open tickets you don’t deserve.
Split routing failure: internal names resolve fine, but traffic goes direct
Even if DNS is correct, the resulting A/AAAA record must land on an IP range routed via the VPN. If your split tunnel routes only some prefixes, and internal services live elsewhere, name resolution succeeds and connectivity still fails. Users call it “DNS” anyway. Sometimes they’re right, but not for the reason they think.
Practical tasks: commands, outputs, and what you decide from them
These are the checks I run in roughly this order when someone says “VPN DNS is broken.” Each task includes the command, what typical output looks like, what it means, and the decision you make next.
Task 1: Identify which DNS server the system is actually using (Linux with systemd-resolved)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53 1.1.1.1
DNS Domain: corp.example
Link 3 (wg0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53
DNS Domain: ~corp.example
What it means: DNS is split: wg0 owns ~corp.example and uses 10.20.0.53. There’s also a global fallback (1.1.1.1).
Decision: If internal queries still leak, verify that internal queries are being routed to wg0 and that ~corp.example is correctly set. If it’s missing, fix the per-domain routing.
Task 2: Check /etc/resolv.conf without being fooled by it
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jan 2 10:11 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
What it means: You’re using a stub resolver. Reading the file won’t show the real upstream choices.
Decision: Use resolvectl (or NetworkManager tools) for truth. If apps bypass the stub, you have a different problem.
Task 3: Confirm that internal DNS server is reachable via the VPN route
cr0x@server:~$ ip route get 10.20.0.53
10.20.0.53 dev wg0 src 10.20.0.10 uid 1000
cache
What it means: The route to the internal resolver goes via wg0. Good.
Decision: If it routes via wlan0 or another local interface, fix split tunnel routes (AllowedIPs, pushed routes, or policy routing).
Task 4: Test internal resolution explicitly against the internal resolver
cr0x@server:~$ dig @10.20.0.53 git.corp.example +time=2 +tries=1
; <<>> DiG 9.18.24 <<>> @10.20.0.53 git.corp.example +time=2 +tries=1
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40211
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
git.corp.example. 60 IN A 10.30.4.21
;; Query time: 21 msec
;; SERVER: 10.20.0.53#53(10.20.0.53)
What it means: DNS server responds fast and returns an internal IP.
Decision: If this works but apps fail, the client may not be using that resolver (resolver order problem), or traffic routing to 10.30.4.21 is wrong.
Task 5: Test the same name using the system default path (catch leaks and misrouting)
cr0x@server:~$ dig git.corp.example +time=2 +tries=1
; <<>> DiG 9.18.24 <<>> git.corp.example +time=2 +tries=1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5851
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
What it means: Your system default resolver path doesn’t know the internal zone (likely using external DNS).
Decision: Fix split DNS routing so corp.example queries go to the internal resolver. On Linux that’s per-link domain routing; on Windows that’s often NRPT; on macOS it’s per-domain resolver entries.
Task 6: Detect whether DNS packets actually traverse the VPN interface
cr0x@server:~$ sudo tcpdump -ni wg0 port 53 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
10:22:41.111112 IP 10.20.0.10.53321 > 10.20.0.53.53: 40211+ A? git.corp.example. (33)
10:22:41.132908 IP 10.20.0.53.53 > 10.20.0.10.53321: 40211 1/0/1 A 10.30.4.21 (49)
What it means: DNS is going through the tunnel, and replies return. This is the clean baseline.
Decision: If you see nothing on wg0 but do see queries on wlan0, that’s a leak. Fix routing or resolver selection.
Task 7: Catch “fallback resolver” behavior by watching multiple interfaces
cr0x@server:~$ sudo tcpdump -ni wlan0 port 53 -c 3
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:23:02.002201 IP 192.168.1.25.58732 > 192.168.1.1.53: 5851+ A? git.corp.example. (33)
10:23:02.004991 IP 192.168.1.1.53 > 192.168.1.25.58732: 5851 NXDomain 0/1/0 (102)
What it means: Your device is leaking internal queries to the local network resolver.
Decision: Stop the fallback by implementing split DNS correctly and/or blocking off-tunnel DNS. If policy allows, enforce “internal zones must go through VPN” at the OS policy level.
Task 8: Check for MTU/fragmentation problems that look like “DNS timeouts”
cr0x@server:~$ ping -M do -s 1472 10.20.0.53 -c 2
PING 10.20.0.53 (10.20.0.53) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
--- 10.20.0.53 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1026ms
What it means: The path MTU is smaller (1420). Big UDP responses may fragment or drop, leading to intermittent DNS failure.
Decision: Adjust VPN MTU, enable TCP fallback reliability, or ensure DNS responses stay small where possible. If you use DNSSEC-heavy responses, be extra cautious.
Task 9: Verify the client is not using DoH in a way that bypasses your plan (network-side view)
cr0x@server:~$ sudo tcpdump -ni wg0 'tcp port 443 and (host 1.1.1.1 or host 8.8.8.8)' -c 3
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
10:24:10.500000 IP 10.20.0.10.41220 > 1.1.1.1.443: Flags [S], seq 123456, win 64240, options [mss 1360,sackOK,TS val 1 ecr 0], length 0
What it means: You’re seeing HTTPS traffic to a public resolver provider. That could be normal web traffic—or it could be DoH.
Decision: If your policy forbids DoH bypass, enforce it via endpoint management or egress controls. If policy allows it, ensure internal domains can’t leak via DoH (which often means split DNS rules at the application layer or blocking internal suffixes from leaving).
Task 10: Confirm which DNS server answered a query (trace in dig)
cr0x@server:~$ dig git.corp.example +trace +time=2 +tries=1
; <<>> DiG 9.18.24 <<>> git.corp.example +trace +time=2 +tries=1
;; Received 525 bytes from 192.168.1.1#53(192.168.1.1) in 3 ms
What it means: The trace started at your local resolver (192.168.1.1). That’s already a leak for an internal name.
Decision: Fix resolver selection and domain routing. Tracing is useful for proving “what answered” without arguing with UI screenshots.
Task 11: Validate that internal service IPs are routed through VPN (post-DNS)
cr0x@server:~$ ip route get 10.30.4.21
10.30.4.21 dev wg0 src 10.20.0.10 uid 1000
cache
What it means: Traffic to the internal service will go via VPN. DNS might not be your real problem after all.
Decision: If this routes off-tunnel, fix split tunnel prefixes. DNS is fine; routing is not.
Task 12: Check whether the internal resolver can reach upstreams (resolver-side)
cr0x@server:~$ dig @127.0.0.1 example.com +time=2 +tries=1
; <<>> DiG 9.18.24 <<>> @127.0.0.1 example.com +time=2 +tries=1
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 30001
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 2000 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
What it means: Local resolver (on a DNS server) is failing to resolve public names. Upstreams blocked, broken, or timed out.
Decision: Fix resolver egress, upstream configuration, or firewall rules. Otherwise VPN users will blame “VPN DNS” even though the issue is your resolver’s upstream path.
Task 13: Check firewall counters for blocked DNS (server-side sanity)
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iifname "wg0" udp dport 53 accept
iifname "wg0" tcp dport 53 accept
counter packets 1203 bytes 90211 drop
}
}
What it means: DNS from VPN interface is allowed; everything else is dropped by default.
Decision: If VPN clients still can’t resolve, it’s not because the DNS server blocks them (at least not at the input chain). Move to service health, routing, or MTU checks.
Task 14: Check resolver performance and cache hit rate (Unbound example)
cr0x@server:~$ sudo unbound-control stats_noreset | egrep 'total.num.queries|total.num.cachehits|total.num.cachemiss|avg'
total.num.queries=145233
total.num.cachehits=109887
total.num.cachemiss=35346
total.requestlist.avg=8.132
What it means: Cache hit ratio is decent; request list average suggests some concurrency but not necessarily overload.
Decision: If cache hits are low and latency high, add more caching close to users, tune prefetching, or increase resolver capacity. If miss rate spikes during incidents, you might be experiencing upstream flakiness.
Joke #2: Split DNS is like “working from home” policy—everyone agrees in principle, then the edge cases move in rent-free.
Fast diagnosis playbook
This is the “stop guessing” order of operations. Run it like a checklist, not like a debate.
First: confirm what resolver the client is using
- On Linux:
resolvectl status(don’t trust/etc/resolv.confalone). - On macOS: inspect resolver configuration (you want per-domain resolver routing, not just “DNS servers” in UI).
- On Windows: check interface DNS plus NRPT rules if you use them.
Decision: If the client is not configured to use internal DNS for internal domains, stop. Fix configuration before chasing “network problems.”
Second: confirm the route to the internal DNS server goes through the VPN
- Run
ip route get <dns_ip>(Linux) or equivalent tooling. - If it routes off-tunnel, you have a split-tunnel routing policy failure.
Decision: Fix routes (AllowedIPs, pushed routes, policy routing) before touching DNS servers.
Third: test internal name resolution directly against internal DNS
dig @internal-dns internalhost.corp.example- Measure latency, response codes, and returned IPs.
Decision: If direct queries fail, the resolver is unhealthy or unreachable. If direct queries work, the client’s resolver order or per-domain routing is wrong.
Fourth: sniff to confirm where DNS packets go
tcpdumpon VPN interface and local interface for port 53.- Look for internal zones leaking to local resolvers.
Decision: If you see leaks, enforce split DNS or block off-tunnel DNS. If you see timeouts, investigate MTU and packet loss.
Fifth: handle the “DoH bypass” reality
- Decide whether you allow app-level encrypted DNS outside your resolver.
- If not, enforce via endpoint policy and egress controls; otherwise document the behavior and its impact on internal zones.
Common mistakes (symptoms → root cause → fix)
1) “VPN connected, internal sites don’t resolve”
Symptom: internal hostnames return NXDOMAIN or resolve to public IPs.
Root cause: client still uses local network DNS; no conditional routing for internal suffix; NRPT/resolver domains missing.
Fix: implement split DNS: route corp.example to internal DNS via VPN; verify with dig and packet capture that queries go through tunnel.
2) “Works on Ethernet, fails on Wi‑Fi/hotel”
Symptom: intermittent timeouts, especially for larger responses; sometimes only certain records fail.
Root cause: MTU mismatch causing fragmentation drops; UDP response truncation mishandled.
Fix: tune VPN MTU; ensure resolvers support TCP fallback; consider limiting EDNS buffer size if needed.
3) “Only some users leak DNS”
Symptom: inconsistent leaks; users on the same VPN config behave differently.
Root cause: OS differences (systemd-resolved vs legacy resolvconf), multiple active interfaces, resolver fallback behavior.
Fix: standardize client DNS management tooling; test on each OS build you support; explicitly configure per-domain routing and resolver priorities.
4) “Internal resolution works, but apps still can’t connect”
Symptom: DNS returns correct internal IPs; TCP connections fail.
Root cause: split tunnel routes don’t include the service IP ranges; security groups/firewalls block; asymmetric routing.
Fix: fix route advertisement and security policy; validate with ip route get and traceroutes over the tunnel.
5) “Public DNS feels slow when VPN is on”
Symptom: browsing delays; lots of “waiting for DNS” in developer tools.
Root cause: you forced all DNS to internal resolvers in one region; latency and congestion add up; cache misses amplify pain.
Fix: use split DNS for public lookups or deploy regional resolvers accessible via VPN; ensure caching and upstream selection are healthy.
6) “We blocked port 53, but leaks still happen”
Symptom: internal domain lookups still visible externally.
Root cause: DoH/DoT bypass; DNS embedded in apps; resolver uses non-53 ports.
Fix: handle encrypted DNS policy explicitly: endpoint configs, DNS proxying, or controlled DoH endpoints. Don’t pretend port 53 is the whole story.
7) “Everything breaks after enabling DNSSEC validation”
Symptom: SERVFAIL on many domains; intermittent success.
Root cause: upstream path MTU/fragmentation, broken middleboxes, or time skew affecting validation chains.
Fix: validate MTU, allow TCP/53, ensure correct time sync, and consider staged rollout. DNSSEC is not a toggle; it’s a commitment.
Three corporate-world mini-stories (anonymized)
Mini-story 1: An incident caused by a wrong assumption
A mid-sized SaaS company rolled out split-tunnel VPN to reduce bandwidth costs. The VPN pushed internal DNS servers, and the team assumed that meant “DNS is handled.” It passed a basic test: connect from home Wi‑Fi, resolve internal Git, ship the change.
Then Monday happened. Users in coworking spaces reported internal apps failing, but only sometimes. The helpdesk escalated to “VPN unstable.” Network team checked tunnel health: green. Authentication: green. CPU on VPN gateways: fine. Meanwhile, someone noticed that internal queries appeared in the logs of a public DNS provider—because a subset of clients fell back when the internal resolver timed out.
The wrong assumption was subtle: “If a DNS server is configured, queries will go there.” In reality, if the route to that resolver isn’t guaranteed over the VPN, clients try anyway, fail, then quietly use the next resolver. Split tunnel had routes for app subnets, but not for the DNS server subnet. So the DNS server IP was configured, but unreachable from many networks.
The fix was boring: ensure the resolver IPs are always routed through the tunnel, and enforce conditional routing for internal domains. They also added packet captures to their runbook to prove whether queries were leaving the tunnel. The incident ended not with a heroic patch, but with a routing table change and a policy decision to stop relying on “fallback.”
Mini-story 2: An optimization that backfired
A large enterprise wanted “fast DNS everywhere,” so they deployed regional internal resolvers and got fancy with forwarding rules. Public lookups were forwarded to local ISP resolvers (low latency!) while internal lookups went through VPN to internal authoritative servers. On paper, that’s a clean split.
In practice, it created a cache coherence problem and a troubleshooting nightmare. Some client devices had their own caching layers; some used a corporate agent; others used OS stub resolvers. TTLs differed across chains. A record change for an internal service propagated unevenly, and a portion of users kept hitting old IPs even after the service moved.
Then the real backfire: in a subset of regions, the ISP resolvers did aggressive filtering and occasionally responded oddly to uncommon DNS record types. That didn’t matter for “normal browsing” but broke a few security products and developer workflows that relied on specific DNS behavior. Suddenly “DNS optimization” became “developers can’t build,” and the network team got dragged into debugging third-party resolver quirks.
The eventual fix was to stop outsourcing the public half of DNS behavior to random ISP resolvers. They kept regional internal resolvers, but forwarded public queries to a consistent, controlled upstream set (still regional) with predictable behavior and observability. Performance remained good. The pager load dropped. The optimization stayed, but the uncontrolled variable was removed.
Mini-story 3: A boring but correct practice that saved the day
A smaller fintech ran two internal recursive resolvers per region, reachable over VPN, and they treated DNS like any other tier: monitoring, alerts, change control, and capacity planning. Nothing exotic. Just discipline.
One day, a cloud provider had a partial networking event that caused intermittent packet loss between the VPN ingress and one resolver node. End users saw “some internal names fail.” It smelled like application issues, then like VPN issues, then like “maybe DNS.” Classic whodunit.
The reason they didn’t spiral: they had per-resolver latency metrics and query failure counters, and they logged which resolver answered each query (at least for the corporate stub forwarders). The on-call could see that one resolver’s query timeouts spiked while the other stayed normal. They pulled the bad node out of service and watched error rates collapse.
No magic. Just health-checked resolvers, redundancy, and observability. The incident report was short, which is the real sign of maturity.
Checklists / step-by-step plan
Step 1: Decide your DNS policy (write it down)
- Are internal domain queries allowed to leave the device off-tunnel? (Usually: no.)
- Are public domain queries allowed to use local network DNS? (Depends: performance vs compliance.)
- Do you allow DoH/DoT directly from clients? If yes, for which apps and which endpoints?
- Do you require logging for DNS queries? If yes, where and for how long?
Step 2: Define namespaces and routing boundaries
- List internal DNS zones:
corp.example,internal.example, cloud private zones, Kubernetes clusters. - List resolver IPs and ensure they are reachable via VPN routes.
- List internal service CIDRs that must route via VPN (not just the DNS subnet).
Step 3: Implement split DNS explicitly
- Linux: configure per-link domains (e.g.,
~corp.example) and DNS servers on the VPN link. - Windows: use NRPT for internal namespaces if you have a managed fleet.
- macOS: ensure per-domain resolver entries exist for internal zones.
Step 4: Stop the leak paths you actually have
- Block off-tunnel UDP/TCP 53 if policy requires no external DNS.
- Handle DoH/DoT with policy and enforcement; don’t rely on wishful thinking.
- Prevent fallback resolvers from capturing internal namespaces.
Step 5: Make DNS resilient and observable
- At least two resolvers per region or per VPN POP.
- Health checks that reflect real resolution, not just “port 53 open.”
- Logs or metrics: query rate, SERVFAIL, NXDOMAIN rate, latency distribution, cache hit rate.
Step 6: Test from hostile networks
- Hotel Wi‑Fi, coffee shop, mobile hotspot, captive portals.
- IPv6 on/off scenarios (some leaks happen via IPv6 paths you forgot exist).
- Clients with multiple interfaces active (Wi‑Fi + Ethernet + virtual adapters).
Step 7: Ship a runbook that matches reality
- Include the commands in the “Practical tasks” section.
- Include a definition of “DNS leak” for your org (internal-only or all domains).
- Include escalation boundaries: endpoint team vs network team vs DNS team.
FAQ
1) What exactly counts as a DNS leak?
A DNS leak is any DNS query that goes to a resolver outside your intended trust boundary. For many orgs, leaks are specifically internal namespaces (like corp.example) going to public or local resolvers. In stricter environments, any DNS off-tunnel is a leak.
2) Why does split tunneling make DNS harder?
Because DNS has to match traffic intent. If the VPN only routes some subnets, but DNS is configured to use an internal resolver, that resolver must be reachable through the tunnel. Otherwise, clients fail over to other resolvers, often outside the VPN.
3) If I push DNS servers via the VPN, isn’t that enough?
No. You must ensure routing to those DNS servers goes through the VPN, and you must control resolver selection rules so internal namespaces don’t get answered by fallback resolvers. Configuration is not enforcement.
4) Should we run our own recursive resolvers or just forward to public DNS?
If you care about consistent behavior, observability, and internal zones, run your own recursive resolvers (or a managed resolver you control) and forward upstream predictably. Forwarding to random local resolvers is a reliability gamble disguised as cost savings.
5) What about encrypted DNS (DoH/DoT)?
DoH/DoT protects DNS privacy on the wire, but it can bypass your VPN’s DNS routing. Decide policy: allow it with guardrails, or block/redirect it with endpoint and network enforcement. Pretending it doesn’t exist is how leaks survive “DNS blocking.”
6) How do we handle internal DNS when users are on IPv6 networks?
Make sure your VPN and DNS plan covers IPv6 explicitly: resolver reachability, routes, and record responses (AAAA). If you only handle IPv4, some clients will still resolve and connect over IPv6 off-tunnel, and you’ll chase ghosts.
7) Why do some internal lookups work and others time out?
Common causes: MTU issues dropping larger responses, packet loss on the VPN path, overloaded resolvers, or resolver fallback behavior. Validate with tcpdump, MTU testing, and direct dig against the internal resolver.
8) Can we “just block port 53” to stop leaks?
Blocking 53 helps, but it’s not sufficient. Apps can use DoH over 443, and some environments use DNS on nonstandard ports. Also, blocking 53 without providing a working on-tunnel resolver turns “leak prevention” into “internet outage.”
9) What’s the simplest safe approach for a mixed OS fleet?
Use split DNS with conditional forwarding, plus guaranteed routes to internal resolvers. Standardize client configuration via MDM/GPO where possible. Add monitoring on resolvers. Then test from hostile networks.
10) How do we prove we fixed the leak?
Capture packets on the client: verify internal zone queries appear on the VPN interface and do not appear on the local interface. Also log queries on the internal resolvers and correlate. Proof beats screenshots.
Practical next steps
- Pick a policy. Decide what “leak” means for you and whether public DNS should go through VPN.
- Make resolver reachability non-negotiable. Ensure DNS server IPs route through the VPN even in split tunnel mode.
- Implement split DNS deliberately. Conditional routing for internal namespaces, not just “set DNS server.”
- Observe it. Add resolver health checks, latency metrics, and enough logging to answer “which resolver answered.”
- Test in the real world. Hotels, captive portals, IPv6 networks, and devices with multiple interfaces—because your users live there.
If you do these five things, DNS stops being a weekly mystery and becomes what it should have been all along: boring infrastructure that quietly does its job.