Somebody in Office B can’t reach “fileserver01” anymore, but it works fine in Office A. The VPN is up.
The firewall “didn’t change anything.” The ticket says: “DNS issue??”
If you run Windows + Linux across multiple offices, DNS becomes the quiet dependency that ruins your afternoon.
The fix isn’t “set 8.8.8.8 and pray.” The fix is a deliberate design: zones, forwarders, recursion boundaries,
search suffix rules, and a habit of measuring where resolution actually breaks.
What you’re really solving: “name” → “route” → “policy”
“Make hostnames resolve everywhere” sounds like a DNS-only problem. In practice it’s a three-part chain:
the client’s resolver picks a DNS server, that DNS server returns an answer (or doesn’t), and then your
network/security stack decides whether the resulting IP is reachable.
You can have perfect DNS and still fail because the answer is “right” but useless: the branch office gets
an internal datacenter IP that isn’t routed across the VPN; or it gets an old record because replication is
lagging; or it gets the “public” answer because split-horizon DNS is misconfigured. Conversely, you can have
routing and still fail because the resolver is slow, multi-second timeouts add up, and your application
burns its own retry budget.
My opinionated goal: choose one authoritative source of truth for internal names, ensure every office can
reach it (directly or via forwarded resolution), and make your clients deterministic about which DNS servers
they use. Then verify with repeatable commands, not vibes.
Facts and short history that matter in production
- DNS is older than the modern internet as you experience it. The DNS RFCs landed in 1987, replacing the centralized HOSTS.TXT model that didn’t scale.
- “Split-horizon DNS” isn’t a hack; it’s a pattern. Serving different answers based on where the query comes from is common in enterprises and in CDNs.
- Windows made DNS everyone’s problem. Active Directory depends on DNS (SRV records) for locating domain controllers, Kerberos, LDAP, and more.
- TTL is a blunt instrument. Short TTLs reduce stale data but increase query volume and expose latency across WAN links.
- Negative caching exists. If a client asks for a name and gets NXDOMAIN, that failure can be cached; “I fixed it” may not propagate instantly.
- Search suffixes are both productivity and foot-gun. “ping printer01” can resolve quickly… or can trigger a parade of failed queries across multiple suffixes.
- systemd-resolved changed the Linux default mental model. On many distros, /etc/resolv.conf is a stub and the real logic lives behind a local listener.
- EDNS0 and larger DNS packets are normal now. Firewalls that “helpfully” block fragments or large UDP responses can selectively break modern DNS.
- DNSSEC is not the same as “secure DNS.” DNSSEC validates authenticity; it does not encrypt queries. (DoH/DoT are the encryption story.)
One quote worth keeping on a sticky note, because DNS failures love cascading: “Hope is not a strategy.”
— Gene Kranz.
(It’s short, and it applies uncomfortably well to “it usually resolves.”)
A target architecture that doesn’t rot
Decide your internal namespace like you mean it
Use a domain you own. If you already have Active Directory, this is typically your AD DNS domain, and you
should treat it as your internal naming authority. If you don’t have AD, you still need an internal zone
with sane ownership. The “just use .local” era should stay in 2006 where it belongs.
Cross-office hostname resolution becomes easy when your internal names are:
- Authoritative in one place (or a set of synced authoritative servers)
- Reachable from all offices (directly or via forwarding)
- Consistent with routing (answers are reachable from the client’s location)
Choose a pattern and stick to it
Here are the patterns that work in the real world. Pick one based on your size and constraints:
-
AD-integrated DNS everywhere (recommended if you have AD).
Put at least one DNS-capable DC in each office, use AD replication, and keep clients pointed at local DNS.
This reduces WAN sensitivity and makes failures more localized. -
Central authoritative DNS + branch forwarding resolvers.
Branch offices run caching resolvers that forward queries for internal zones to HQ; everything else goes to upstream.
This works well if you don’t want DCs in branches. -
Split-horizon with internal and external views.
Same name, different answers internally vs externally. Useful when public DNS exists for your domain but internal
answers must differ (VPN endpoints, internal VIPs, etc.).
What I avoid: “Every site uses public resolvers (Google/Cloudflare), and we sprinkle hosts files for internal names.”
That’s not a design; that’s a future incident report.
Joke #1: DNS is like office gossip: the one wrong record spreads faster than the correction.
Windows foundations: AD DNS, sites, and zone replication
AD DNS is not optional if you expect AD to behave
In an AD environment, DNS does two jobs: it resolves A/AAAA records like any DNS, and it publishes service
location using SRV records. When branch offices complain “login is slow,” the root cause is often DNS: clients
are discovering domain controllers across the WAN, or failing to locate a global catalog, or timing out on SRV
lookups because they’re pointing at a resolver that can’t see the AD zone.
Use Sites and Services to control where clients land
If you run multiple offices and you aren’t using AD Sites properly, you’re leaving the steering wheel on the roof
of the car. Define subnets per office, map them to sites, and ensure each site has local domain controllers (or at least
local DNS forwarders for AD zones). Otherwise, clients will happily authenticate to “some DC” across a slow link,
and then you’ll blame “the VPN.”
Zone replication scope is a security and performance decision
For AD-integrated zones, replication scope choices (domain-wide vs forest-wide vs custom application partitions)
affect what gets copied where. Replicate only what you need. The goal is predictable resolution, not “everything everywhere.”
Conditional forwarders are the grown-up way to cross-resolve
If you have multiple AD forests or separate internal domains (common after mergers), conditional forwarders keep you sane.
They also avoid “full recursion everywhere” which is both slow and hard to secure.
Linux foundations: resolv.conf, systemd-resolved, and DNS search
Know which resolver stack you’re actually running
On modern Linux, you might have:
- glibc stub resolver reading
/etc/resolv.confdirectly - systemd-resolved acting as a local stub (often
127.0.0.53) with per-link DNS - NetworkManager managing resolvers and writing
resolv.confor talking to resolved - dnsmasq/unbound running locally as a cache/forwarder
Many outages are just mismatched assumptions: someone edits /etc/resolv.conf by hand, NetworkManager
“fixes” it back, and the user concludes “Linux ignores DNS settings.” No, it ignores your manual edits.
Search domains and “ndots” can create phantom failures
Linux resolver behavior includes search suffixes and the ndots option. If ndots:5
is in play (common in Kubernetes environments, sometimes copied into servers by mistake), then a lookup for
fileserver01 can cause multiple appended-domain lookups before trying it “as-is.” Across a WAN, that’s
a latency multiplier.
Make Linux deterministic in branches
In a multi-office setup, Linux servers should point to local resolvers first. If you can’t deploy local resolvers,
then at least ensure the remote resolvers are reachable and low-latency. “Two random DNS IPs in resolv.conf”
without a plan is how you get half your queries going across the VPN because the resolver rotates or fails over.
Practical tasks: commands, outputs, and decisions (Windows + Linux)
These are real checks you can run today. Each includes (1) the command, (2) what the output means, and
(3) the decision you make next. DNS work is not a philosophy seminar; it’s a measurement sport.
Task 1 (Linux): Identify who owns DNS on this host
cr0x@server:~$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 May 3 10:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
Meaning: This system is using systemd-resolved’s stub file; editing /etc/resolv.conf directly won’t persist.
Decision: Use resolvectl and your network manager settings, not hand edits.
Task 2 (Linux): Show effective DNS servers and search domains
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.0.10
DNS Servers: 10.20.0.10 10.20.0.11
DNS Domain: corp.example.com
Link 2 (ens160)
Current Scopes: DNS
DNS Servers: 10.20.0.10 10.20.0.11
DNS Domain: branch1.corp.example.com corp.example.com
Meaning: DNS is per-link; the machine will prefer these servers and will append those search domains.
Decision: Confirm these IPs are local-to-site resolvers. If they’re remote HQ IPs, expect latency and plan a branch resolver.
Task 3 (Linux): Check raw resolver config (for ndots/search surprises)
cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search branch1.corp.example.com corp.example.com
Meaning: Apps will query the local stub; search domains will be tried.
Decision: If lookups for short names are slow, consider reducing search list size and ensure short-name usage is intentional.
Task 4 (Linux): Confirm which server answers, and whether it’s authoritative
cr0x@server:~$ dig fileserver01.corp.example.com +noall +answer +authority +comments
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41457
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
fileserver01.corp.example.com. 300 IN A 10.30.4.21
Meaning: The aa flag indicates an authoritative answer from the responding server.
Decision: If you expect authoritative but don’t see aa, you might be hitting a forwarder/cache; check forward chain and source of truth.
Task 5 (Linux): Catch “DNS works, but it’s slow”
cr0x@server:~$ dig fileserver01.corp.example.com @10.20.0.10 +stats
;; ANSWER SECTION:
fileserver01.corp.example.com. 300 IN A 10.30.4.21
;; Query time: 412 msec
;; SERVER: 10.20.0.10#53(10.20.0.10)
;; WHEN: Sat Dec 28 10:12:55 UTC 2025
;; MSG SIZE rcvd: 74
Meaning: 412 ms is WAN-ish. For an in-site resolver you want low tens of milliseconds or less.
Decision: If query time is consistently high, deploy a local caching resolver in the office or fix routing/MTU/firewall issues on UDP/53.
Task 6 (Linux): Verify the resolver is reachable and not blocked (UDP and TCP)
cr0x@server:~$ nc -vz -u 10.20.0.10 53
Connection to 10.20.0.10 53 port [udp/domain] succeeded!
cr0x@server:~$ nc -vz 10.20.0.10 53
Connection to 10.20.0.10 53 port [tcp/domain] succeeded!
Meaning: Both UDP and TCP 53 are reachable. TCP matters for large responses and truncation retries.
Decision: If UDP works but TCP fails, fix firewall rules; some records (DNSSEC, many SRVs) will break intermittently.
Task 7 (Linux): Observe DNS retries and which names are being tried
cr0x@server:~$ sudo tcpdump -ni ens160 port 53 -c 8
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens160, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:13:11.120134 IP 10.20.1.55.56621 > 10.20.0.10.53: 11209+ A? fileserver01.branch1.corp.example.com. (55)
10:13:11.170221 IP 10.20.1.55.56621 > 10.20.0.10.53: 28041+ A? fileserver01.corp.example.com. (45)
10:13:11.220401 IP 10.20.1.55.56621 > 10.20.0.10.53: 19901+ AAAA? fileserver01.corp.example.com. (45)
Meaning: The client is trying search-suffixed names first, then the base domain, then AAAA.
Decision: If those first search attempts are unnecessary and slow (NXDOMAIN delays), reduce search domains or use FQDNs in configs.
Task 8 (Windows): Show what DNS servers and suffixes a client actually uses
cr0x@server:~$ ipconfig /all
Windows IP Configuration
Host Name . . . . . . . . . . . . : WKS-221
Primary Dns Suffix . . . . . . . : corp.example.com
DNS Servers . . . . . . . . . . . : 10.20.0.10
10.20.0.11
Connection-specific DNS Suffix . . : branch1.corp.example.com
Meaning: The client has a primary suffix and a connection-specific suffix, and will try them during resolution.
Decision: If DNS servers listed are not local-to-site, fix DHCP options or GPO; don’t let laptops in branches point at HQ DNS by default.
Task 9 (Windows): Prove whether the DNS server can answer for the zone
cr0x@server:~$ nslookup
> server 10.20.0.10
Default Server: dns-branch1.corp.example.com
Address: 10.20.0.10
> set type=soa
> corp.example.com
corp.example.com
primary name server = dc1.corp.example.com
responsible mail addr = hostmaster.corp.example.com
serial = 2457
refresh = 900 (15 mins)
retry = 600 (10 mins)
expire = 86400 (1 day)
default TTL = 3600 (1 hour)
Meaning: You got an SOA, so the server can speak authoritatively (or at least can fetch it reliably).
Decision: If SOA fails or times out, this is not a client problem. Fix DNS service reachability or forwarding/zone hosting.
Task 10 (Windows): Check which server answered and whether recursion is involved
cr0x@server:~$ nslookup fileserver01.corp.example.com 10.20.0.10
Server: dns-branch1.corp.example.com
Address: 10.20.0.10
Name: fileserver01.corp.example.com
Address: 10.30.4.21
Meaning: You forced a specific DNS server. Good. You eliminated “client picks the wrong server.”
Decision: If this works but the same lookup without specifying server fails, fix the client’s DNS server list and ordering.
Task 11 (Windows Server / AD): Validate AD DNS health quickly
cr0x@server:~$ dcdiag /test:dns /v
Directory Server Diagnosis
Performing initial setup:
Trying to find home server...
Home Server = DC1
* Identified AD Forest.
Done gathering initial info.
Doing DNS tests...
DNS Tests are running and not hung. Please wait a few minutes...
PASS: Basic (Domain membership)
PASS: Forwarders/Root hints
PASS: Delegation
PASS: Records registration
Meaning: Core DNS tests passed on that DC.
Decision: If this fails in branches only, focus on replication/topology/forwarding between sites, not “DNS is down everywhere.”
Task 12 (Windows Server): Inspect conditional forwarders (when merging domains)
cr0x@server:~$ Get-DnsServerConditionalForwarderZone
ZoneName MasterServers ReplicationScope
-------- ------------- ----------------
corp.partner.example {192.0.2.53,192.0.2.54} Forest
lab.example.com {10.50.0.10} Domain
Meaning: Conditional forwarders exist and replicate; scope tells you how widely.
Decision: If branch DNS servers don’t have the forwarder, fix replication scope or create it on the right DNS role holders.
Task 13 (Linux): Identify whether your BIND/unbound is forwarding correctly
cr0x@server:~$ sudo named-checkconf -p | sed -n '1,80p'
options {
directory "/var/cache/bind";
recursion yes;
allow-recursion { 10.20.0.0/16; };
};
zone "corp.example.com" {
type forward;
forward only;
forwarders { 10.10.0.10; 10.10.0.11; };
};
Meaning: This resolver forwards the internal zone to HQ DNS and will not try other methods (forward only).
Decision: If HQ DNS is unreachable, branch resolution for that zone will fail hard. Consider adding local authoritative if you need branch autonomy.
Task 14 (Linux): Confirm caching behavior and TTL sanity
cr0x@server:~$ dig fileserver01.corp.example.com @10.20.0.10 +noall +answer
fileserver01.corp.example.com. 300 IN A 10.30.4.21
cr0x@server:~$ dig fileserver01.corp.example.com @10.20.0.10 +noall +answer
fileserver01.corp.example.com. 297 IN A 10.30.4.21
Meaning: TTL is decrementing, so caching is happening (second answer is served from cache or a near cache path).
Decision: If TTL doesn’t decrement and response times are high each time, you’re not caching; fix your resolver choice or local forwarder config.
Task 15 (Windows): Flush client cache when testing record changes
cr0x@server:~$ ipconfig /flushdns
Windows IP Configuration
Successfully flushed the DNS Resolver Cache.
Meaning: Client cache is cleared.
Decision: Use this when validating a fix. If it “fixes it” temporarily, the underlying issue may be stale TTL/replication, not client magic.
Task 16 (Linux): Flush systemd-resolved cache (if applicable)
cr0x@server:~$ sudo resolvectl flush-caches
cr0x@server:~$ resolvectl statistics
DNSSEC supported by current servers: no
Cache
Current Cache Size: 0
Cache Hits: 118
Cache Misses: 54
Meaning: Cache cleared; statistics show prior hit/miss behavior.
Decision: If miss count is huge and servers are remote, you’re paying WAN latency repeatedly. Add local caching.
Three corporate mini-stories (how this fails in real life)
Mini-story 1: The incident caused by a wrong assumption
A mid-size company had two offices and a small datacenter. They added a third office quickly, “temporary” VPN,
“temporary” DHCP config, “temporary” everything. The new office’s DHCP handed out public resolvers because “DNS is DNS.”
Internal hostnames were expected to work via hosts files for “a few weeks.”
On a Monday, a Windows update rolled through, and laptops started preferring IPv6. Public resolvers returned
a public AAAA record for a name that was meant to be internal-only (someone had accidentally created a matching record
during a web migration months earlier). Result: clients tried to reach an internal service over the internet, failed,
and the application interpreted that as “service down,” triggering a call tree.
The wrong assumption wasn’t “IPv6 is weird.” The wrong assumption was: internal names don’t leak.
They do—through collisions, through split-horizon mistakes, through people copy-pasting records between zones.
Once the team forced all branch clients to use internal resolvers (and created a proper internal zone with split-horizon),
the problem disappeared.
The lesson: “temporary DNS” is permanent, just with worse documentation.
Mini-story 2: The optimization that backfired
Another org chased performance complaints in a branch office. An engineer lowered TTLs across the internal zone to
something tiny so “changes propagate faster.” It worked—changes propagated faster. It also turned DNS into a constant WAN
stream because the branch had no local cache and the only DNS servers were in HQ.
Nobody noticed until a routine firewall update introduced slightly higher latency on UDP/53 due to new inspection rules.
Suddenly applications that did multiple lookups per transaction started failing. Not because DNS was down, but because
the aggregate resolution time exceeded the application’s timeout budget.
They reverted TTLs upward, added a small caching resolver in the branch, and made a rule: TTL changes require a performance
review when WAN links are involved. “Fast propagation” is great; “amplified dependence on WAN latency” is not.
Joke #2: Setting a 5-second TTL everywhere is like scheduling daily fire drills—technically everyone’s prepared, practically nobody gets work done.
Mini-story 3: The boring but correct practice that saved the day
A larger enterprise ran AD-integrated DNS with one DC per site and a consistent policy: clients use only local DNS servers,
and those servers forward external queries to a small set of controlled upstream resolvers. They also had a simple operational
habit: every site change ticket included a DNS validation checklist and a “dig from the branch” screenshot equivalent.
One night, a carrier issue partially degraded the MPLS between two regions. Latency spiked, packet loss appeared, and a bunch of
teams started reporting “random failures.” The DNS team didn’t panic. They already had local resolvers, so most queries stayed in-site.
Authentication remained stable because clients weren’t hunting for remote domain controllers.
The few failures they did see were traced quickly: conditional forwarders for a partner zone pointed only at resolvers in the affected
region. They added a second forwarder target in another region, tested it, and moved on.
Nothing heroic happened. That’s the point. Boring design reduces the number of things that can become exciting.
Fast diagnosis playbook
When “hostname doesn’t resolve in Office B” hits your queue, the goal is to find the bottleneck in minutes, not to
meditate on DNS theory. Run this in order.
First: confirm it’s DNS, not reachability
- From the failing client: resolve the FQDN with a specified server (nslookup/dig). If that fails, DNS path is broken.
- If it resolves, try reaching the IP (ping/curl/port check). If that fails, routing/firewall/policy is the problem.
Second: find which DNS server the client is actually using
- Windows:
ipconfig /alland confirm DNS server list and suffixes. - Linux:
resolvectl statusor inspect/etc/resolv.confdepending on stack. - Look for split-brain: VPN pushes DNS, DHCP pushes different DNS, NetworkManager rotates, etc.
Third: determine whether the DNS server is authoritative or forwarding
- Query SOA/NS for the zone against the server in question.
- If it forwards, test forwarder reachability from that DNS server (not from your laptop).
- Check for TCP fallback failures and EDNS-related truncation issues.
Fourth: check latency and retries
- Measure query time with
dig +stats. - Use tcpdump/Wireshark to see repeated queries, search suffix expansion, and timeouts.
- If latency is high, implement local caching and keep internal authoritative as close as practical.
Fifth: validate data correctness (stale/wrong answer)
- Compare the returned IP with expected routing per office.
- Check TTLs and negative caching.
- If AD-integrated: confirm replication health and that the record exists where you think it exists.
Design choices: forwarders vs stub zones vs recursion everywhere
Conditional forwarders: the default answer for cross-office DNS
Conditional forwarders say: “for queries in zone X, ask these specific servers.” They’re simple, auditable, and predictable.
In branch offices, a local caching resolver with conditional forwarders gives you good performance without duplicating zone authority.
Failure mode: if the forwarder targets are only in one site, that site becomes a hidden dependency. Add at least two targets
in different failure domains when possible.
Stub zones: useful when you want referral logic, not full forwarding
Stub zones pull the NS (and sometimes glue A records) for a zone and keep them updated. They’re handy when you want the resolver
to learn authoritative servers dynamically, rather than pinning a fixed forwarder list.
Failure mode: stub zones still need reachability to authoritative servers. If your routing is asymmetric or filtered, you can get
“intermittent” resolution that makes everyone argue.
Full recursion everywhere: tempting, messy
Letting every DNS server do full recursion to the internet can work, but it expands your attack surface, complicates logging,
and makes egress filtering harder. In enterprises, it’s generally cleaner to centralize recursion to a small set of resolvers
and forward to them.
Split-horizon: do it deliberately or don’t do it
Split-horizon DNS is appropriate when the same zone exists publicly and internally but with different answers.
The only sane way to operate it is with clear ownership, clear testing, and explicit policies about which clients see which view.
Failure mode: someone updates public DNS but forgets internal (or vice versa). This is how you get “works on mobile hotspot”
and “fails on VPN” at the same time.
Common mistakes: symptom → root cause → fix
1) “nslookup works but the app still fails”
Symptom: Manual lookup succeeds; application errors mention “host not found” or times out.
Root cause: The app uses a different resolver path (container DNS, JVM caching, /etc/nsswitch.conf differences),
or the app is trying AAAA first and timing out on IPv6 connectivity.
Fix: Verify resolution from the app runtime (inside container, same user, same network namespace). Check AAAA behavior and IPv6 routing.
Consider lowering JVM DNS cache TTLs only if you understand the cost.
2) “Only one office can resolve short names”
Symptom: ping fileserver01 works in HQ but fails in branches; FQDN works everywhere.
Root cause: Search suffix configuration differs between sites (DHCP/GPO/NetworkManager), or the branch lacks the primary DNS suffix.
Fix: Standardize suffix search lists via DHCP options and/or GPO; reduce suffix sprawl. Prefer FQDNs in critical configs.
3) “Resolution is slow only over the VPN”
Symptom: Lookups take 500–2000 ms; apps time out; packet captures show retries.
Root cause: Branch clients query HQ DNS across the VPN; no local cache; MTU/fragmentation issues cause dropped DNS responses; TCP/53 blocked.
Fix: Deploy a local caching resolver or site-local DC/DNS; fix MTU and allow TCP/53; ensure EDNS UDP size isn’t being broken by middleboxes.
4) “Randomly gets the wrong IP”
Symptom: Sometimes resolves to 10.x, sometimes to 192.168.x, sometimes to a public IP.
Root cause: Split-horizon misapplied, or clients are using mixed DNS servers (one internal, one public), or stale records with different TTLs.
Fix: Enforce DNS servers via DHCP/VPN/GPO; remove public resolvers from internal clients; audit split-horizon policies and ensure consistent views.
5) “A new record exists in HQ but not in the branch”
Symptom: DC1 resolves it; DC2 in branch returns NXDOMAIN.
Root cause: AD replication issues, zone replication scope mismatch, or branch DNS server not hosting the zone.
Fix: Verify AD replication health; confirm zone is AD-integrated and replication scope includes that DNS server; fix site links and schedules if needed.
6) “Linux servers ignore the DNS settings we pushed”
Symptom: You set resolv.conf; it reverts; behavior differs after reboot.
Root cause: systemd-resolved/NetworkManager managing resolv.conf, or DHCP overwriting settings per interface.
Fix: Configure DNS through the correct manager (resolvectl/NetworkManager connection settings), or run a local resolver and point to 127.0.0.1.
7) “Domain join works, but logons are slow in one office”
Symptom: Join succeeds; later logons take long; GP updates sluggish.
Root cause: Clients discover DCs in another site due to missing/misconfigured AD Sites subnets, or branch points to non-AD DNS.
Fix: Define subnets and map them to sites; ensure local DC/DNS or forwarding for AD zones; verify SRV record resolution locally.
Checklists / step-by-step plan
Step-by-step: make hostnames resolve everywhere (the “do this, not that” plan)
-
Pick your internal authority.
If you have AD, internal DNS authority is AD DNS. If not, pick BIND/Windows DNS as the authoritative source and document ownership. -
Inventory zones and overlaps.
List internal zones, external zones, and any collisions (same name used in multiple places). -
Define site-local DNS per office.
Best: a DNS-capable DC in each office. Good: a small caching resolver VM in each office with conditional forwarders to HQ. -
Standardize client DNS distribution.
DHCP for desktops, VPN profiles for remote users, and GPO where appropriate. Remove public resolvers from internal clients. -
Implement conditional forwarders (or stub zones) for cross-domain needs.
Especially after mergers or when partner zones exist. -
Get split-horizon under control.
If you need it, formalize it: internal view vs external view, consistent change control, and explicit testing from each office. -
Fix routing alignment.
Don’t return IPs that the querying site can’t route to. If you have site-specific VIPs, you may need geo/site-aware answers. -
Test from each office with repeatable commands.
Use the same hostnames, measure query time, and verify which server answered. -
Add logging and basic alerting.
Alert on resolver timeouts, SERVFAIL spikes, and forwarder reachability. DNS problems are rarely silent; they’re just ignored. -
Document the boring parts.
Which DNS servers are “allowed,” which zones are internal, how to add records, and how to validate from each office.
Operational checklist: before you declare “DNS fixed”
- Clients in each office use the intended DNS servers (no public resolvers sneaking in).
- Internal names resolve to routable IPs from each office.
- Query latency is acceptable (measure, don’t guess).
- TCP/53 works end-to-end for resolvers.
- Search suffix lists are consistent and not bloated.
- AD replication (if used) is healthy and zones replicate to the right scope.
- Caches are understood: TTLs, negative caching, and when to flush during tests.
FAQ
1) Should each office have its own DNS server?
Yes, unless you enjoy WAN latency being part of every login and app call. A local caching resolver per office is often enough.
If you have AD, a local DC/DNS per office is the cleanest pattern.
2) Can I just use public DNS resolvers and add internal records somewhere?
Not reliably. Public resolvers won’t serve your private zone, and split-horizon via public infrastructure is not how you want
to spend your weekends. Keep internal resolution internal.
3) What’s better: conditional forwarders or stub zones?
Conditional forwarders are simpler and more predictable. Stub zones are helpful when you want automatic updates of authoritative server lists.
If you’re not sure, choose conditional forwarders and keep the target list redundant.
4) Why does it work in Office A but not Office B?
Usually one of: different DNS servers handed out by DHCP/VPN, different search suffix list, missing zone replication to branch DNS,
or routing mismatch (branch cannot reach the returned IP). Measure which one, then fix the specific layer.
5) How do I handle split-horizon DNS safely?
Treat it like production code: change control, testing from inside and outside, and clear separation of internal vs external zone management.
Don’t let two teams update “the same name” in two places without coordination.
6) Why are DNS lookups slow even though the VPN is “up”?
VPN “up” just means some packets pass. DNS is sensitive to latency, packet loss, MTU/fragmentation, and TCP fallback.
Measure query time, confirm UDP and TCP 53, and add local caching to remove WAN dependency.
7) How do I standardize DNS search suffixes across Windows and Linux?
Windows: DHCP option for DNS suffix and GPO for suffix search list where appropriate.
Linux: configure via NetworkManager/systemd-resolved (or your distro tooling), not by editing resolv.conf by hand.
Keep the list short; every extra suffix is a potential delay.
8) What about DoH/DoT—should branches use encrypted DNS?
For internal zones, you usually want plain DNS on your trusted network segments, with strict server selection and logging.
DoH/DoT can be useful for client privacy to external domains, but it can also bypass corporate DNS policies if unmanaged.
Decide intentionally; don’t let browsers freelancing choose your resolver.
9) How many DNS servers should clients have configured?
Typically two, ideally site-local. More isn’t always better; some clients rotate or fail over in ways that create cross-site traffic.
The right answer is “two reachable servers in the same failure domain,” plus proper server redundancy behind the scenes.
10) How do we avoid stale records in multi-office DNS?
Use dynamic DNS updates where appropriate, scavenging/aging in AD DNS with care, and monitor replication health.
Keep TTLs reasonable; don’t set them tiny to compensate for sloppy record lifecycle management.
Next steps you can do this week
If you want hostnames to resolve everywhere across offices, stop treating DNS like a background utility and treat it like a
production system with topology. Put a resolver close to users. Make internal zones authoritative somewhere you control.
Use conditional forwarders for cross-domain resolution. Keep client configuration consistent and auditable.
Practical next steps:
- Pick one branch office and measure DNS latency today (
dig +stats/nslookupagainst intended servers). - Audit DHCP/VPN/GPO so clients in that office only use the intended resolvers.
- Deploy a local caching resolver (or DC/DNS) in that office and re-test query times.
- Standardize suffix search lists; remove accidental suffix sprawl and mysterious ndots settings.
- Write a one-page DNS runbook: “Which servers, which zones, how to test, how to change.” Then actually use it.
Your reward is boring, consistent resolution. Which is exactly what DNS should be: invisible, fast, and never the headline.