You deploy a VPN, you set up split-horizon DNS, and everything works—until it doesn’t. One Monday morning your ticket queue fills with
“can’t reach intranet,” “SSO loop,” and the evergreen classic: “it was working Friday.” The network didn’t change. The app didn’t change.
DNS changed. Quietly. On endpoints.
DNS over HTTPS (DoH) and DNS over TLS (DoT) are real privacy upgrades. They also short-circuit the basic assumption behind
split-horizon DNS: that clients ask your resolver. When clients bypass your resolver, your carefully designed internal zones turn into
internet rumors. This is fixable—but only if you treat encrypted DNS as a first-class part of your architecture, not a browser setting you’ll “deal with later.”
What actually breaks: split-horizon DNS vs encrypted DNS
Split-horizon DNS is an old, boring trick that works because the client’s resolver choice is predictable. You publish different answers
for the same name depending on where the query comes from. Inside the network, app.corp.example resolves to an RFC1918 address.
Outside, it resolves to a public address or NXDOMAIN. It’s not “security” by itself, but it’s absolutely a mechanism that many security
controls and routing designs lean on.
Encrypted DNS changes the resolver relationship. With DoH/DoT, the DNS traffic is wrapped in TLS, and—more importantly—clients can
choose resolvers outside your control. That means:
- Your internal zones might never be consulted.
- Your conditional forwarders might never run.
- Your DNS-based access controls (allow/block lists, sinkholes, security telemetry) might see nothing.
- Your VPN’s “push these DNS servers and these search domains” becomes more like a suggestion.
The common misconception is “DoH/DoT is just encryption; it doesn’t change DNS.” In practice, it changes who answers. That’s the break.
The privacy win and the operational loss are the same event.
DoH vs DoT in operational terms
DoT typically runs on TCP/853 to a resolver. It looks like “a DNS protocol with TLS.” It’s often configured at the OS level
(Android “Private DNS” is a famous example). It’s easier to identify by port, but not always easy to intercept responsibly.
DoH runs over HTTPS (TCP/443) and uses an HTTP request/response format. It’s “DNS disguised as web traffic,” and that’s
not even an insult—it’s exactly the point. It blends in. It benefits from existing HTTPS infrastructure and middleboxes. It also bypasses
many network controls because it looks like every other web request.
The split-horizon failure usually comes from one of these patterns:
- Browser DoH overrides OS DNS. Your OS resolver points at the VPN DNS, but the browser ships queries to a public DoH endpoint.
- OS-level DoT is configured to a public resolver. The endpoint is “securely wrong” all day long.
- Dual stack of resolvers races. Client tries multiple resolvers; the public one wins due to latency and caches the wrong answer.
- Captive portal / intercept infrastructure creates edge cases. The client sees TLS failures, falls back, and now you have nondeterminism.
One short joke, because we’ve earned it: Encrypted DNS is like whispering your question into a megaphone pointed at someone else’s help desk.
Private? Sure. Useful? Depends who answers.
Facts and history you can use in arguments with security and product
These aren’t trivia-night facts. They’re the kind of context you use to make a decision in a change review without needing a two-hour debate.
- DNS privacy problems predate DoH/DoT by decades. “DNS is plaintext” was always true; what changed is that browsers and OSes started acting on it.
- DoT (RFC 7858) landed before DoH (RFC 8484). The industry’s first standard answer was “TLS on a DNS-specific port,” not “HTTP everything.”
- Browsers popularized resolver choice as a product feature. That shifted resolver selection away from network operators to application vendors.
- Split-horizon DNS is older than most VPN products people use today. The technique predates “zero trust” branding by a wide margin.
- Enterprise DNS isn’t just name-to-IP. It’s often the distribution plane for service discovery, SRV records, and internal certificate validation flows.
- DNSSEC doesn’t encrypt. It authenticates answers, but it doesn’t hide queries or prevent on-path observation of names.
- EDNS Client Subnet (ECS) made privacy worse for some users. It can leak client network information to authoritative servers to improve CDN routing.
- Some platforms implement “opportunistic” encrypted DNS. If the resolver supports it, it’s used; if not, it silently falls back. That can create environment-dependent behavior.
- Resolvers are not equal. Different recursive resolvers have different cache policies, filtering behaviors, and negative caching. You can see different answers for the same name.
The operational takeaway: DoH/DoT wasn’t invented to annoy enterprises. It was invented because plaintext DNS is a liability. Your job is to
keep that privacy improvement without letting endpoint software randomly choose your control plane.
Failure modes: how it shows up in production
1) Internal names resolve to public IPs (or not at all)
The classic split-horizon break: jira.corp.example resolves to something public, or NXDOMAIN, because the public resolver doesn’t know your internal zone.
Your internal DNS would have returned 10.30.4.17. Instead you get nothing—or worse, a public CDN hostname you didn’t intend.
This hits VPN users hardest because they expect “internal names work when VPN is up.” But if the client’s resolver goes around the VPN,
the VPN is just a fancy router with no idea what name you typed.
2) SSO and certificate validation go sideways
Internal apps often rely on internal CAs, internal OCSP responders, internal IdPs, and service endpoints whose names only exist internally.
If the endpoint asks an outside resolver and gets NXDOMAIN, you see:
- SSO redirect loops (the IdP hostname doesn’t resolve)
- “Can’t validate certificate” (OCSP/CRL endpoints not reachable)
- Weird partial failures where the app loads but authentication fails
3) DNS-based security controls lose visibility
If your security program depends on DNS logs, blocking, sinkholing, or anomaly detection, DoH can punch a hole through it. Not because encryption is evil,
but because the query never hits your resolver. The first time you learn this will be during an incident, which is a terrible time to learn anything.
4) “It only breaks on some laptops” (the worst kind of break)
Endpoint diversity turns this into a nightmare. One user has Android Private DNS set to a public resolver. Another uses a browser with DoH enabled.
Another is on a managed build where you disabled DoH, but their VPN client has its own DNS stack. You get a blend of behaviors that look like a flaky network.
5) Kubernetes / service mesh adds a second layer of confusion
In clusters, you already have CoreDNS, stub domains, and in-cluster service discovery. If nodes or pods start using external DoH/DoT, you can get
resolution that differs between the host and the pod, or between pods on different nodes. Debugging “DNS” becomes debugging three DNS stacks at once.
Fast diagnosis playbook (check 1/2/3)
When internal names break, don’t start with packet captures. Start by proving which resolver is answering and whether the client is bypassing your intended path.
The goal is to reduce the problem to: wrong resolver, right resolver but wrong answer, or right answer but traffic can’t reach it.
Check 1: Identify the resolver actually in use
- On the client, check configured DNS servers (VPN interface vs Wi-Fi interface).
- Check whether the browser or OS is using DoH/DoT to an external resolver.
- Verify with a query that explicitly targets your internal resolver and compare.
Check 2: Compare internal vs external answers
- Query internal resolver for the failing name.
- Query a known public resolver for the same name.
- If the answers differ, you have split-horizon working—and the client is choosing the wrong horizon.
Check 3: Validate the path and policy
- If the client is correctly using internal DNS, check forwarding, views, and ACLs on the resolver.
- Check firewall rules: is the client allowed to reach internal DNS on 53/udp and 53/tcp?
- Check for DNS interception on guest networks or “security” appliances that rewrite DNS.
Only after these three checks do you start digging into caches, negative TTLs, EDNS behavior, and the really fun stuff.
Practical tasks: commands, outputs, and the decision you make
These are the tasks I actually run in incidents. Each one has: command, sample output, what it means, and the decision it drives.
Adjust hostnames and IPs to your environment.
Task 1: See which DNS servers systemd-resolved thinks you’re using (Linux)
cr0x@server:~$ resolvectl status
Global
Protocols: +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53 1.1.1.1
DNS Domain: corp.example
Link 2 (wg0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53
DNS Domain: corp.example
Meaning: You have both an internal resolver (10.20.0.53) and a public resolver configured (1.1.1.1).
Even if wg0 looks correct, some resolvers will race or fall back.
Decision: Remove public resolvers from managed clients when split-horizon is required, or enforce per-domain routing with a local stub resolver.
Task 2: Check if systemd-resolved is using DNS-over-TLS
cr0x@server:~$ resolvectl dnsovertls
Global setting: no
Link 2 (wg0): no
Link 3 (wlp0s20f3): no
Meaning: DoT isn’t enabled here. If you still see split-horizon breaks, look for browser DoH or a third-party agent.
Decision: Shift focus to browser settings, endpoint security software, or Android/iOS Private DNS features.
Task 3: Compare answers from internal vs public resolvers (dig)
cr0x@server:~$ dig +short jira.corp.example @10.20.0.53
10.30.4.17
cr0x@server:~$ dig +short jira.corp.example @1.1.1.1
Meaning: Internal resolver returns the private IP; public resolver returns nothing (NXDOMAIN or empty due to policy).
Decision: This is not an “app down” incident. It’s a resolver selection problem. Fix client resolver routing, not the app.
Task 4: Confirm whether a browser is doing DoH by observing connections (Linux)
cr0x@server:~$ sudo ss -tpn | grep -E ':443' | head
ESTAB 0 0 192.168.1.50:52114 104.16.248.249:443 users:(("firefox",pid=2148,fd=91))
ESTAB 0 0 192.168.1.50:52122 8.8.8.8:443 users:(("firefox",pid=2148,fd=93))
Meaning: The browser is talking to public IPs on 443. That’s normal for web. The question is whether any of those are DoH endpoints.
If you suspect DoH to a specific provider, correlate with known endpoint IPs from your own allowlist/intel, or inspect SNI/HTTP paths in a controlled environment.
Decision: If this is a managed enterprise, disable browser DoH via policy or point it at your internal DoH service.
Task 5: Confirm DNS queries are hitting your resolver (tcpdump on resolver)
cr0x@server:~$ sudo tcpdump -ni eth0 port 53 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:41:11.225318 IP 10.20.14.22.52433 > 10.20.0.53.53: 12345+ A? jira.corp.example. (35)
12:41:11.225902 IP 10.20.0.53.53 > 10.20.14.22.52433: 12345* 1/0/0 A 10.30.4.17 (51)
Meaning: At least one client is using your resolver for that name. If the user still reports failure, the issue might be caching on the client,
a different resolver being used intermittently, or routing/firewall to the returned IP.
Decision: If you see no queries during a user test, the client is bypassing your resolver (DoH/DoT or different DNS server).
Task 6: Check for DoT traffic (TCP/853) leaving a client subnet (firewall or host)
cr0x@server:~$ sudo tcpdump -ni eth0 tcp port 853 -c 3
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:42:09.114220 IP 10.20.14.22.48932 > 9.9.9.9.853: Flags [S], seq 366451, win 64240, options [mss 1460,sackOK,TS val 248490 ecr 0,nop,wscale 7], length 0
Meaning: That client is attempting DoT to a public resolver. If your split-horizon depends on internal DNS, this is a direct cause.
Decision: Block outbound 853 from corporate networks (with exceptions for your own DoT service), and enforce endpoint policy.
Task 7: Inspect Windows DNS server selection (PowerShell)
cr0x@server:~$ powershell.exe -NoProfile -Command "Get-DnsClientServerAddress -AddressFamily IPv4 | Format-Table -AutoSize"
InterfaceAlias ServerAddresses
-------------- --------------
Wi-Fi {192.168.1.1}
CorpVPN {10.20.0.53, 10.20.0.54}
Meaning: Wi-Fi has its own DNS (often the home router), and VPN has internal DNS. Windows typically prefers the VPN interface for DNS when configured right,
but metrics and NRPT rules matter.
Decision: Confirm interface metrics and add name resolution policy rules so corp.example always uses the VPN resolver.
Task 8: Check Windows Name Resolution Policy Table (NRPT)
cr0x@server:~$ powershell.exe -NoProfile -Command "Get-DnsClientNrptRule | Select-Object Namespace,NameServers,DirectAccess | Format-List"
Namespace : .corp.example
NameServers : {10.20.0.53, 10.20.0.54}
DirectAccess : False
Meaning: Queries for *.corp.example should go to the internal resolvers regardless of general DNS configuration.
Decision: If missing, add NRPT (via GPO/MDM). It’s one of the cleanest ways to preserve split-horizon in a Windows fleet.
Task 9: Confirm macOS resolver domains and order
cr0x@server:~$ scutil --dns | sed -n '1,80p'
DNS configuration
resolver #1
search domain[0] : corp.example
nameserver[0] : 10.20.0.53
nameserver[1] : 10.20.0.54
if_index : 12 (utun3)
flags : Request A records
reach : 0x00000002 (Reachable)
resolver #2
nameserver[0] : 192.168.1.1
if_index : 6 (en0)
reach : 0x00000002 (Reachable)
Meaning: VPN resolver is scoped to corp.example. That’s good. If users still fail, suspect application-level DoH or a resolver not honoring scoped domains.
Decision: Keep scoped resolvers; push a managed DoH profile if you need encrypted DNS without losing split-horizon.
Task 10: Test whether the internal resolver can recurse and forward correctly
cr0x@server:~$ dig +nocmd +noall +answer www.example.net @10.20.0.53
www.example.net. 300 IN A 93.184.216.34
Meaning: Internal resolver can reach upstreams and answer internet names. If it can’t, clients may “helpfully” switch to public resolvers.
Decision: Fix internal resolver egress and latency. If internal DNS is slow or flaky, DoH adoption becomes an outage multiplier.
Task 11: Check BIND view/ACL logic (named-checkconf)
cr0x@server:~$ sudo named-checkconf -z /etc/bind/named.conf
zone corp.example/IN: loaded serial 2026020401
zone 0.20.10.in-addr.arpa/IN: loaded serial 2026020401
zone example.com/IN: loaded serial 2026011502
Meaning: Config parses and zones load. This doesn’t prove the right view matches the client, but it eliminates “fat-fingered config” as the immediate cause.
Decision: If zones load but clients get wrong answers, validate view match by source IP and test from multiple subnets.
Task 12: Validate Unbound forwarding and local-zone behavior
cr0x@server:~$ sudo unbound-control status
version: 1.17.1
verbosity: 1
threads: 4
modules: 2 [ validator iterator ]
uptime: 86400 seconds
options: control(yes)
cr0x@server:~$ sudo unbound-control list_local_zones | grep corp.example
corp.example. transparent
Meaning: Unbound is up and has a local-zone entry for the internal domain. “transparent” means it will resolve using local-data if present, otherwise recurse/forward.
Decision: If you require strict internal-only behavior, prefer “static” or use explicit local-data and block leakage to public recursion.
Task 13: Spot negative caching hurting you (dig +trace and SOA TTL)
cr0x@server:~$ dig jira.corp.example @1.1.1.1 +noall +authority
corp.example. 900 IN SOA ns1.public-dns.example. hostmaster.public-dns.example. 1 7200 900 1209600 900
Meaning: If a public resolver returns an SOA in the authority section for NXDOMAIN, the negative caching TTL can cause “it keeps failing even after we fixed it.”
Decision: Flush caches on the client/resolver where possible, and reduce negative TTLs on public-facing zones where appropriate.
Task 14: Detect clients bypassing DNS by checking resolver logs (example: BIND querylog)
cr0x@server:~$ sudo tail -n 3 /var/log/named/query.log
04-Feb-2026 12:44:11.112 client @0x7f0a3c: query: jira.corp.example IN A +E(0)K (10.20.14.22)
04-Feb-2026 12:44:11.114 client @0x7f0a3c: query: ocsp.corp.example IN A +E(0)K (10.20.14.22)
04-Feb-2026 12:44:11.116 client @0x7f0a3c: query: idp.corp.example IN A +E(0)K (10.20.14.22)
Meaning: You can confirm which clients actually query internal DNS. If a user reproduces a failure and you see no corresponding log lines,
their device is not asking you. That’s the key fact you need for a policy conversation.
Decision: Treat bypass as a managed endpoint compliance issue, not as “DNS server tuning.”
The fix: keep privacy, keep split-horizon
The right fix depends on your threat model and your tolerance for endpoint autonomy. But the wrong fix is always the same:
pretending encrypted DNS won’t happen. It already has.
Here’s the design goal: clients should use encrypted DNS to resolvers you operate or explicitly trust, and they should use
your internal resolver for internal domains every time. You can accomplish that with policy, architecture, or a mix.
Pattern A: Run your own encrypted DNS endpoints (recommended)
If you want the privacy win and you run a serious network, you should provide an encrypted DNS service:
DoT on 853 and/or DoH on 443. Put it close to users, anycast it if you can, and log responsibly.
Then you configure endpoints (via MDM/GPO) to use your encrypted DNS service, not a random public one.
For split-horizon, your resolver must have the same view logic as your existing internal DNS, or it must forward internal zones to the right place.
Operational detail that matters: you need capacity planning and latency monitoring. If your encrypted DNS endpoint is slower than public resolvers,
clients and apps will “helpfully” switch. It’s not malice; it’s user experience.
Pattern B: Enforce per-domain routing (NRPT / scoped resolvers / split DNS)
If you can’t centralize resolver choice completely, enforce at least the internal domain routing:
*.corp.example, *.svc.cluster.local, and other internal namespaces must go to internal resolvers.
This is the minimum viable split-horizon safety rail.
- Windows: NRPT rules are your friend. They’re boring, which is why they work.
- macOS: scoped resolvers via VPN profiles; verify with
scutil --dns. - Linux: systemd-resolved supports per-link domains; NetworkManager can push those via VPN.
Pattern C: Block what you must, but don’t pretend it’s “solved”
Blocking outbound TCP/853 reduces DoT bypass. Blocking DoH is harder because it rides HTTPS.
You can block known DoH endpoints, but endpoint lists change and CDNs move.
If you choose to block DoH at the network, do it with eyes open:
- You will play whack-a-mole unless you use endpoint management too.
- You might break legitimate HTTPS services if you go too broad.
- Some clients will fall back to plaintext DNS. That can be acceptable or unacceptable depending on your policy.
Pattern D: Stop using internal-only names that collide with the public DNS
A big chunk of pain comes from internal namespaces that are not cleanly separated. If you use names that collide with real public domains,
you create ambiguity even without DoH. Encrypted DNS just makes the ambiguity surface faster.
Use a subdomain you control publicly (like corp.example) for internal use. Avoid “shortcut” internal TLDs and avoid naming that relies
on search suffix magic for critical systems. If you must keep legacy names, treat them as technical debt and schedule the migration.
Pattern E: Make internal DNS reliable enough that users don’t seek alternatives
This is the unglamorous truth: people enable DoH/DoT because DNS is a privacy problem and because their ISP DNS is often slow or sketchy.
If your internal DNS is slow, flaky, or blocked by your own firewall rules, users will route around it.
Reliability basics:
- Redundant resolvers in each major site/region.
- Clear forwarding policy for internal zones and for internet recursion.
- Monitoring: latency, SERVFAIL rate, NXDOMAIN rate, top QNAMEs, upstream health.
- Strict change control for zone delegations and split-horizon records.
One quote, because it’s still true in DNS land. Werner Vogels (paraphrased idea): “Everything fails all the time; design so systems keep working anyway.”
Second short joke, then we get back to work: Split-horizon DNS is a lot like corporate org charts—everyone thinks they know who answers, until they ask the wrong person.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company rolled out a new VPN client with “secure DNS” in the marketing copy. The network team assumed this meant “it uses our DNS servers,
but encrypted.” They approved it quickly because, frankly, they were tired of plaintext DNS arguments and wanted the win.
Two weeks later, developers started reporting that internal Git over HTTPS was intermittently failing with certificate errors.
The hostname resolved sometimes to an internal IP, sometimes to a public IP belonging to a completely different service that happened to share a similar name.
The browser would show a cert mismatch, users would click through in some cases (because of course they would), and now you’ve got an incident with a side of bad habits.
The root issue wasn’t TLS. It wasn’t the Git service. It was resolver choice. The VPN client’s “secure DNS” feature enabled DoH to a third-party resolver by default.
When the VPN was up, internal DNS still worked for some apps. But browsers with DoH ended up consistent-to-the-wrong-resolver, and the failures looked random because different apps used different stacks.
The fix was painfully simple and politically annoying: they disabled the VPN client’s third-party DoH and published an internal DoH endpoint.
Then they wrote a one-page policy: “Corporate devices use corporate resolvers. Personal devices on guest Wi-Fi can do what they want.”
The real lesson was also simple: never approve “secure DNS” without asking “secure to which resolver?”
Mini-story 2: The optimization that backfired
Another organization tried to improve performance for remote workers by pushing a public resolver as a secondary DNS server.
The internal resolver was reachable over the VPN, but latency was higher from some regions. The thinking was: “If internal is slow, public will be fast, and only internal names need internal DNS anyway.”
That sentence contains the seed of the outage.
The first sign of trouble was a spike in helpdesk tickets: “Some internal tools are down, but only for remote people.”
Engineers ran quick tests and saw that internal resolvers were fine. Apps were up. VPN was up. Yet name resolution was inconsistent.
Some clients resolved api.corp.example to an internal IP, others got NXDOMAIN.
The backfire came from resolver race behavior and caching. Some clients queried both resolvers; the public one responded faster with NXDOMAIN,
which got cached. Later, even when the internal resolver would have answered correctly, the client didn’t ask—it trusted the negative cache.
A performance “optimization” became a correctness failure.
The fix was to remove public resolvers from VPN configuration and instead deploy regional internal resolvers (or forwarders) close to users.
They also reviewed negative caching TTLs where they had control. The new rule: never add a “backup resolver” that can’t answer your internal zones.
That’s not a backup. That’s a fork in reality.
Mini-story 3: The boring but correct practice that saved the day
A large enterprise had already standardized on a dedicated internal namespace (corp.example) and had documented conditional forwarding rules
between their on-prem DNS and their cloud private zones. They also had endpoint policies disabling browser-managed DoH unless it pointed to the corporate DoH service.
It was dull. It was also effective.
During a broader internet incident affecting a popular public resolver network, they saw fewer problems than peers.
Their corporate devices weren’t depending on external DNS for internal names, and their internal resolvers had redundant upstreams for internet recursion.
Internal apps stayed reachable. VPN users kept working. The helpdesk had a quiet day, which is the closest thing ops people get to a trophy.
The moment that mattered happened in the war room: someone suggested “just tell people to use this public DNS temporarily.”
The DNS team refused—not because they were stubborn, but because they’d already played that movie and didn’t like the ending.
Instead, they increased capacity on internal resolvers and temporarily tightened egress rules that were allowing unmanaged DoT.
The postmortem was short and almost boring: the boring practice (stable namespaces, documented forwarding, endpoint policy) reduced blast radius.
Not everything needs an innovative solution. Sometimes you just need a policy that survives reality.
Common mistakes: symptom → root cause → fix
1) Symptom: “VPN is connected but internal domains don’t resolve”
Root cause: Browser DoH or OS Private DNS bypasses the VPN-provided resolver.
Fix: Enforce endpoint policy: disable third-party DoH/DoT or point it to corporate encrypted DNS; add per-domain rules for corp.example.
2) Symptom: “It works for ping, fails in browser”
Root cause: Different resolver stacks. CLI tools use OS DNS; browser uses DoH.
Fix: Align resolver policy. In managed environments, centrally configure browser DoH. In unmanaged, publish guidance and make internal apps accessible via public DNS + auth if appropriate.
3) Symptom: “Some users get NXDOMAIN for internal names after a change”
Root cause: Negative caching from a public resolver or an internal forwarder returning NXDOMAIN with a long TTL.
Fix: Flush caches where possible; reduce negative TTLs in zones you manage; stop sending internal names to resolvers that can’t answer them.
4) Symptom: “DNS logs show nothing during user tests”
Root cause: Bypass via DoH/DoT, hardcoded resolvers, or local agent.
Fix: Identify the bypass: block outbound 853 where policy allows; enforce DoH settings; use endpoint compliance checks to detect noncompliant DNS settings.
5) Symptom: “Internal names sometimes resolve to public IPs”
Root cause: Split-horizon exists but client is using the wrong horizon intermittently (multi-homed DNS, race behavior).
Fix: Eliminate public resolvers from corporate configs; use scoped domains/NRPT; ensure internal resolvers are low latency and highly available.
6) Symptom: “Security team sees drop in DNS telemetry”
Root cause: DoH to third-party resolvers; DNS filtering moved off your path.
Fix: Provide corporate DoH/DoT with logging and policy; enforce resolver use via MDM/GPO; consider application-layer controls that don’t assume DNS visibility.
7) Symptom: “Kubernetes pods can’t resolve internal zones but nodes can”
Root cause: Pod DNS config or node-local DNS cache forwarding to external resolvers; stub domains missing.
Fix: Configure CoreDNS stubDomains/forward for internal zones; ensure node resolvers don’t use public DoH/DoT for internal namespaces.
Checklists / step-by-step plan
Step-by-step: stabilize split-horizon under DoH/DoT pressure
- Inventory internal namespaces. List internal zones and critical hostnames. If you can’t list them, you can’t protect them.
- Decide resolver policy. Pick one: “corporate resolvers only” (managed) or “per-domain enforcement only” (hybrid).
- Deploy corporate encrypted DNS endpoints. Offer DoH/DoT as a supported service with HA, monitoring, and change control.
- Configure endpoints via MDM/GPO. Disable third-party DoH/DoT; set corporate resolvers; enforce NRPT/scoped domains.
- Block outbound TCP/853 where appropriate. Allow exceptions for your own resolvers. Document exceptions explicitly.
- Handle DoH deliberately. Either:
- Allow corporate DoH and block known third-party DoH where feasible, or
- Allow DoH broadly but ensure internal domains are resolved via internal channels (NRPT/scoped resolvers).
- Make internal DNS fast. Add regional forwarders, caching, and redundancy. Monitor latency and failure rates.
- Test with real clients. Validate on Windows/macOS/Linux plus at least one mobile platform if you support it.
- Operationalize: logging + dashboards. Track resolver query volume, SERVFAIL, NXDOMAIN, and top domains. Watch for sudden drops (bypass).
- Write the runbook. Include the Fast diagnosis playbook and the 12+ tasks above. Future you will appreciate it.
Change review checklist (printable logic)
- Will this change affect which resolver answers corporate names?
- Does the solution support split-horizon views or conditional forwarding?
- Do we have an answer for browser-level DoH and OS-level DoT?
- What happens during partial failure (fallback behavior)?
- Do we have metrics for latency and error rates on corporate resolvers?
- Is there an explicit policy for unmanaged devices?
FAQ
1) Is DoH “bad for enterprises”?
No. Unmanaged DoH is bad for enterprises. Managed DoH is fine, often better. The conflict is control plane, not encryption.
2) Should I block DoH everywhere?
Only if you can enforce an alternative that meets privacy and reliability needs. Blind blocking often causes fallback to plaintext DNS or pushes users to workarounds.
Better: provide corporate DoH/DoT and manage resolver choice.
3) Why does split-horizon DNS exist if it’s so fragile?
Because it’s effective and simple when the resolver path is stable. It becomes fragile when endpoints choose resolvers dynamically or bypass the network’s resolver entirely.
That fragility is manageable with policy and per-domain routing.
4) Does DNSSEC solve this?
DNSSEC helps verify authenticity of DNS answers. It does not encrypt queries and doesn’t stop endpoints from using the wrong resolver.
You can (and often should) use DNSSEC alongside DoH/DoT, but it’s not a substitute.
5) If I run internal DoH, do I still need split-horizon?
If you have internal-only endpoints, yes. DoH doesn’t replace split-horizon; it changes transport. You still need views/forwarding rules so internal names resolve correctly.
6) What’s the safest minimum fix if I can’t do a big project?
Enforce per-domain routing for internal namespaces (NRPT/scoped domains) and remove public resolvers from managed configurations.
Then block outbound TCP/853 to reduce unmanaged DoT.
7) How do I handle mobile devices?
Treat them as a separate class. If they are managed, push a DNS profile and restrict Private DNS settings. If they are unmanaged BYOD, don’t assume split-horizon will work.
Provide internal apps via public DNS with strong auth, or use a managed container/work profile.
8) Why does adding a “backup” public DNS server cause outages?
Because it’s not a backup for internal zones. Some clients will use it first due to latency, interface metrics, or retry logic. Then negative caching can make failure sticky.
If a resolver can’t answer your internal zones, it’s not a safe secondary.
9) Can I rely on search domains instead of fully qualified names?
For convenience, sure. For critical reliability, no. Search domains multiply ambiguity and create confusing leak paths when clients use public resolvers.
Use FQDNs for critical services and keep internal namespaces unambiguous.
10) What’s the best metric to detect DoH bypass?
A sudden drop in query volume to your internal resolvers, especially from networks where endpoint count didn’t change.
Pair that with egress observations (TCP/853, known DoH endpoints) and endpoint compliance reporting.
Conclusion: next steps you can execute this week
DoH and DoT are not a fad, and they’re not your enemy. They’re the default direction of travel for the industry: more encryption, more endpoint autonomy, fewer assumptions about the network.
Split-horizon DNS still works—if you stop assuming clients will ask you nicely.
Practical next steps:
- Run the Fast diagnosis playbook against one real failing user and document which resolver answered.
- Remove “secondary public DNS” from VPN profiles and managed endpoint configs.
- Implement per-domain routing for internal namespaces (NRPT/scoped resolvers) as a baseline.
- Plan and deploy corporate DoH/DoT endpoints, then push them via MDM/GPO.
- Monitor resolver latency and error rates; treat DNS like production infrastructure, because it is.
Privacy is a win. Reliability is non-negotiable. You can have both—but only if you design for encrypted DNS instead of hoping it won’t notice your split-horizon.