Two offices. One shiny site-to-site VPN. Suddenly printers disappear, phones reboot, file shares “work for me,”
and someone says the phrase that makes SREs age in dog years: “We didn’t change anything.”
This mess usually isn’t the VPN’s fault. It’s your addressing and DHCP model, exposed by the VPN because you just
connected two networks that were never designed to meet. The good news: this is fixable, and you can make it boring.
Boring is the whole point.
The mental model: what actually breaks when you connect offices
A site-to-site VPN is not magic. It’s a pipe. Pipes don’t negotiate semantics; they move packets. The chaos starts
when those packets contain assumptions that were fine in isolation:
- Assumption #1: “192.168.1.0/24 means my office.” After the VPN, it might mean “both offices.” That’s not a meaning; it’s a collision.
- Assumption #2: DHCP is “local and harmless.” It isn’t. DHCP is a control plane for identity (IP), routing (default gateway), and name resolution (DNS).
- Assumption #3: Broadcast-based discovery will work across offices. It won’t, unless you deliberately bridge, and bridging is how you import every L2 problem into your L3 network.
- Assumption #4: “The VPN is slow” when a route is wrong, DNS is split-brained, or MTU is quietly eating your packets.
The VPN changes the failure domain. A typo in one office now fails both. A rogue DHCP server in one office now
has a bigger hunting ground (depending on bridging/relays). And your troubleshooting gets harder because
symptoms show up far away from causes.
If you want a single sentence to guide your design, use this:
Keep each office as an independent L3 domain, connect them with routing, and only centralize what you can monitor and constrain.
One quote worth keeping on a sticky note for VPN and DHCP work: Hope is not a strategy.
— Gene Kranz.
Yes, it’s from mission control, not networking. The point stands.
Joke #1: A site-to-site VPN is like a marriage: it doesn’t fix your communication issues, it just makes them louder.
Interesting facts and historical context (because history repeats)
- RFC 1918 (1996) gave us private IPv4 ranges (10/8, 172.16/12, 192.168/16). It solved address scarcity, and created the modern sport of overlapping subnets.
- DHCP became a standard in the early 1990s (as an evolution of BOOTP). It was designed for LANs with broadcast; the “over a WAN” story came later via relays.
- Broadcast domains were always a scaling limit. Early campus networks learned quickly that L2 sprawl turns “one bad box” into “company-wide mystery.”
- IPsec (late 1990s) aimed at securing IP at Layer 3. Many deployments still accidentally build Layer 2 behavior on top of it, and pay for it.
- NAT wasn’t meant to be forever. It was a practical hack for IPv4 exhaustion. It’s still here, and it still breaks end-to-end assumptions and complicates troubleshooting.
- Split DNS exists because reality is messy. The idea of different answers depending on where you ask is older than many “cloud-native” teams.
- “VPN is slow” became a trope because MTU issues are silent. The history of PMTUD black holes is long and full of packet loss.
- DHCP option sets became a de facto configuration system. Phones, PXE boot, printers, and Wi‑Fi controllers all expect specific options, so DHCP mistakes cause oddly specific outages.
- BGP over IPsec is common now for dynamic routing between sites, but it only works well when your addressing plan is sane to begin with.
IP addressing strategy: the part you must decide upfront
Rule 1: never allow overlapping subnets between sites
If Office A and Office B both use 192.168.1.0/24, you do not have a “VPN problem.” You have an identity problem.
The routing table cannot distinguish “192.168.1.50 in Office A” from “192.168.1.50 in Office B.” You’ll see:
asymmetric routing, ARP weirdness if you bridge, and intermittent connectivity if someone adds NAT as a band-aid.
The opinionated take: renumber one site unless you have a short-term business reason to use NAT as a bridge tactic.
NAT is sometimes necessary, but it’s never the clean end state.
Rule 2: allocate from a real plan, not vibes
Pick a private range large enough for growth, then carve by site and function. One simple pattern:
- 10.64.0.0/10 reserved for corporate internal networks.
- Each office gets a /16: Office A = 10.64.0.0/16, Office B = 10.65.0.0/16, etc.
- Within the /16, carve /24s: users, servers, voice, printers, Wi‑Fi, guest, management.
Why /16 per site? Because renumbering is expensive, and people underestimate growth and “temporary” networks.
It also makes routes simple and summarizable.
Rule 3: decide where the default gateway lives for each VLAN
If you have a router/firewall per site, keep default gateways local. A remote default gateway over VPN turns every
packet into WAN traffic. Your users will notice. Your incident queue will notice more.
Rule 4: document the authoritative source of truth
You need a minimal IPAM approach, even if it’s “a Git repo with YAML.” What matters is that there is one place
where subnets, DHCP scopes, reservations, and key static addresses are recorded and reviewed.
DHCP patterns over a VPN (and why most are bad)
Pattern A: DHCP stays local at each office (recommended)
Each site runs its own DHCP server (or HA pair) for its local subnets. The VPN only carries routed traffic.
This is the simplest failure domain: if Site B’s DHCP server dies, Site A does not care.
Centralize DNS if you want. Centralize authentication if you must. But keep DHCP close to the clients unless
you have strong operational reasons.
Pattern B: central DHCP server with DHCP relay at each site (works, but be disciplined)
This can work if you want centralized scope management and consistent options. The key is that relays must be
correctly configured, and the VPN must be reliable. Remember: leases expire even if your WAN is down.
If you choose this, set longer lease times for remote sites, and make sure your relay agent information (Option 82)
is either used intentionally or disabled consistently. Option mismatch can make troubleshooting feel haunted.
Pattern C: bridge L2 across the VPN (don’t, unless you like pain)
Bridging means DHCP broadcast traverses the VPN, ARP traverses the VPN, and every “who has” becomes a cross-site
conversation. This makes both security and stability worse. It can be justified for niche cases (legacy protocols,
very small networks, short-lived migrations), but it should come with an explicit exit plan.
Pattern D: NAT to hide overlap (acceptable as a temporary escape hatch)
When subnets overlap and you cannot renumber immediately, you can NAT one side so the other sees a translated range.
It “works” in the sense that packets move. It fails in the sense that identity becomes weird: logs lie, ACLs become
awkward, and anything that embeds IPs (some VoIP, certain licensing systems, old-school SMB weirdness) may break.
Joke #2: NAT is like sweeping dirt under the rug—impressive until you trip over the rug.
DHCP scope design: avoid the subtle foot-guns
- Reserve ranges for statics (network gear, servers, hypervisors). Don’t “just pick something outside the pool” without documenting it.
- Use DHCP reservations for printers and special endpoints. Random printer IPs are how ticket queues achieve immortality.
- Standardize option sets (DNS servers, search domains, NTP). If you have multiple DHCP servers, keep the options consistent via config management.
- Keep lease times sane: 8–24 hours for user networks, longer for stable devices. Ultra-short leases create churn during WAN instability.
Routing, NAT, and why “just NAT it” is a trap
Static routes vs dynamic routing
For two offices with a handful of subnets, static routes are fine. Add a third site, or multiple VLANs per site,
and static routes become a change-management tax. That’s when you consider:
- OSPF for internal routing if you control both ends and want fast convergence.
- BGP if you want explicit route control and clean summarization across tunnels.
Dynamic routing isn’t “enterprise theater.” It’s how you stop forgetting a route during a stressful migration.
Split tunneling vs full mesh
Don’t hairpin Internet traffic through a VPN “because security.” That’s a policy choice with a performance bill.
If you do it, do it knowingly: size the links, tune MTU, monitor, and accept that voice/video will complain first.
MTU: the silent saboteur
IPsec adds overhead. GRE adds overhead. WireGuard adds overhead. If you don’t adjust MTU or clamp MSS, you get
fragmentation and PMTUD issues. The classic symptom: small pings work, large transfers stall, and HTTPS sometimes
times out in ways that feel random.
Security boundaries
A VPN is not a permission model. You still need firewall policy: which subnets can talk, which ports are allowed,
where admin interfaces live, and how you log denies. “We put them on a VPN so they can access everything” is how
you turn one compromised office PC into a multi-site incident.
Three mini-stories from corporate life
Mini-story 1: an incident caused by a wrong assumption
A mid-size company acquired a smaller firm and rushed to “connect the offices” so finance could access the ERP
system. The network diagrams looked clean. The VPN came up on the first try. Everyone congratulated the team and
went back to their calendars.
Within hours, helpdesk got tickets: printers in the acquired office were “offline,” then “online,” then “offline”
again. VoIP phones rebooted at random. A few laptops could reach internal systems; others couldn’t. The VPN was blamed.
The wrong assumption was simple: both sides used 192.168.0.0/24 for the main office LAN, because of course they did.
Routing tables on both firewalls had identical connected networks. So each side believed 192.168.0.50 was local.
Packets never even entered the tunnel. People tried adding static routes (which didn’t help) and tweaking IPsec lifetimes
(which also didn’t help).
The fix was not glamorous: renumber the acquired office to a unique subnet, then update DHCP scopes, printer configs,
and a handful of static devices. During the transition, they used temporary NAT to allow a few critical systems to talk.
Once renumbering finished, the “VPN instability” vanished overnight.
The lesson: the tunnel isn’t your identity layer. IP addressing is. Treat it like one.
Mini-story 2: an optimization that backfired
A different org wanted to “simplify management” by centralizing DHCP in the main data center. The remote office would
just run a relay. It sounded clean: one scope configuration, one place to audit options, one change process.
Then a WAN provider had a bad week. Short drops—30 seconds here, two minutes there. Nothing long enough to trigger a
full outage declaration. Long enough to be infuriating.
What failed wasn’t existing clients; it was churn. Phones rebooted after power flickers, laptops moved between SSIDs,
and Wi‑Fi clients tried to renew. Relays couldn’t reach the central DHCP server during drops. Clients fell back to
APIPA addresses or held stale leases that no longer matched DNS. It looked like “Wi‑Fi is flaky,” because that’s how
it surfaced to humans.
They “optimized” further by lowering DHCP lease times to make IP reuse more efficient. That increased renewal frequency,
which increased dependency on the unstable WAN. The team built a high-frequency failure machine and then wondered why
it was loud.
The fix: move DHCP back on-site (or add a local failover partner), increase lease times, and treat centralization as a
tool—one that has reliability requirements. Centralize when the link is as boring as power. Otherwise, keep it local.
Mini-story 3: the boring but correct practice that saved the day
A regulated company had a policy that felt bureaucratic: every subnet allocation went into an IPAM spreadsheet backed by
a Git repo, reviewed by two people. DHCP scope changes required a short change ticket and a pre/post validation checklist.
No one loved it. Everyone complied because audits were real.
During a hurried office move, a contractor installed a “temporary” router with DHCP enabled on a lab bench. In many places,
that’s where the story turns into a two-day outage. Here, it turned into a 20-minute inconvenience.
The on-call engineer followed the checklist: verify DHCP offers, identify the server IP, find the switchport via MAC table,
shut the port, then validate renewal on a test client. The IPAM records made it obvious which scopes were legitimate and which
weren’t. The change ticket history showed a planned scope update that hadn’t been applied yet, avoiding a second misstep.
The incident review was almost boring. That’s the point. The “process tax” was paid upfront, so the outage bill was small.
Fast diagnosis playbook
When “VPN + DHCP” goes sideways, you can waste hours guessing. Don’t. Triage like you mean it.
Your goal is to find the bottleneck class quickly: addressing overlap, routing, DHCP authority, DNS, or MTU.
First: confirm whether you have overlapping networks
- Compare the local site LAN(s) with the remote site LAN(s). If any overlap, stop and plan remediation or NAT.
- Check routes: if the same prefix exists as “connected” on both ends, you have a design conflict.
Second: confirm routing symmetry and firewall policy
- Can Site A reach Site B’s router interface? Can Site B reach Site A’s?
- Do return routes exist? Is policy allowing the traffic both directions?
- Check for asymmetric NAT rules that rewrite only one direction.
Third: decide if it’s DHCP, DNS, or MTU
- DHCP problem if clients are getting wrong gateway/DNS, duplicate IP warnings, or APIPA addresses.
- DNS problem if ping by IP works but names fail, or internal names resolve to public/incorrect targets.
- MTU problem if small pings work, large payloads fail, and HTTPS/SMB are flaky.
Fourth: capture packets at the right place
Capturing on the client is good. Capturing on the router interface facing the LAN is better.
Capturing on the tunnel interface is best when you’re proving what crossed the VPN.
Practical tasks with commands (what to run, what it means, what to decide)
These tasks assume Linux-based routers/servers for examples. The ideas translate directly to pfSense/OPNsense, Windows DHCP,
and managed firewalls; the point is what you check and why.
Task 1: list local interfaces and subnets (spot the obvious overlaps)
cr0x@server:~$ ip -br addr
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 10.64.10.2/24
eth1 UP 10.64.20.1/24
What it means: this host is directly connected to 10.64.10.0/24 and 10.64.20.0/24.
Decision: compare these prefixes to the remote office prefixes. If any overlap, you’re in remediation mode, not tuning mode.
Task 2: inspect the routing table (prove where traffic will go)
cr0x@server:~$ ip route
default via 10.64.10.1 dev eth0
10.64.10.0/24 dev eth0 proto kernel scope link src 10.64.10.2
10.64.20.0/24 dev eth1 proto kernel scope link src 10.64.20.1
10.65.0.0/16 via 10.64.10.254 dev eth0
What it means: traffic to 10.65.0.0/16 (remote office) goes via 10.64.10.254 (likely the VPN router).
Decision: if the remote office networks aren’t present here, add routes (static or dynamic). If you see a more specific route that points somewhere else, fix that conflict first.
Task 3: verify tunnel status (don’t troubleshoot DHCP through a down tunnel)
cr0x@server:~$ sudo wg show
interface: wg0
public key: 2lQf...redacted...
listening port: 51820
peer: 3mHf...redacted...
endpoint: 203.0.113.10:51820
allowed ips: 10.65.0.0/16
latest handshake: 42 seconds ago
transfer: 1.23 GiB received, 980 MiB sent
What it means: WireGuard is up and exchanging traffic recently.
Decision: if handshakes are stale, fix the tunnel first (keys, firewall, NAT, reachability). No amount of DHCP logic matters until packets flow.
Task 4: test reachability and path (confirm routing and ACLs)
cr0x@server:~$ ping -c 3 10.65.10.1
PING 10.65.10.1 (10.65.10.1) 56(84) bytes of data.
64 bytes from 10.65.10.1: icmp_seq=1 ttl=63 time=22.1 ms
64 bytes from 10.65.10.1: icmp_seq=2 ttl=63 time=21.7 ms
64 bytes from 10.65.10.1: icmp_seq=3 ttl=63 time=22.4 ms
--- 10.65.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
What it means: L3 connectivity exists at least to the remote gateway.
Decision: if ping fails, check routes and firewall. If ping works but applications fail, you’re likely dealing with DNS, MTU, or port-level filtering.
Task 5: test MTU with “do not fragment” pings (find the black hole)
cr0x@server:~$ ping -M do -s 1472 -c 2 10.65.10.1
PING 10.65.10.1 (10.65.10.1) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
--- 10.65.10.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1026ms
What it means: your effective MTU on the path is smaller than 1500; 1420 is common for WireGuard.
Decision: clamp TCP MSS on the tunnel or set appropriate MTU on interfaces. If you ignore this, you’ll keep “debugging” flaky SMB/HTTPS forever.
Task 6: show DHCP leases on a Linux DHCP server (validate scope behavior)
cr0x@server:~$ sudo tail -n 8 /var/lib/dhcp/dhcpd.leases
lease 10.64.10.101 {
starts 5 2025/12/27 10:22:10;
ends 5 2025/12/27 22:22:10;
cltt 5 2025/12/27 10:22:10;
binding state active;
hardware ethernet 3c:52:82:aa:bb:cc;
client-hostname "laptop-hr";
}
What it means: the server is handing out addresses in the expected subnet to a known MAC.
Decision: if leases show addresses from the wrong scope or unexpected subnets, you’ve misconfigured scopes or relays (or you have a rogue DHCP server elsewhere).
Task 7: sniff for DHCP offers (catch rogue DHCP and mis-relays)
cr0x@server:~$ sudo tcpdump -ni eth0 -vvv 'port 67 or port 68' -c 6
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:31:00.102345 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 3c:52:82:aa:bb:cc, length 300
10:31:00.105210 IP 10.64.10.5.67 > 255.255.255.255.68: BOOTP/DHCP, Reply, length 300, yiaddr 10.64.10.101, Server-ID 10.64.10.5
10:31:00.106901 IP 10.64.10.250.67 > 255.255.255.255.68: BOOTP/DHCP, Reply, length 300, yiaddr 10.64.10.199, Server-ID 10.64.10.250
What it means: two different DHCP servers are offering leases (10.64.10.5 and 10.64.10.250).
Decision: find and disable the rogue server or isolate it. In a multi-site VPN scenario, confirm you didn’t accidentally bridge L2 or relay DHCP where you shouldn’t.
Task 8: check if a DHCP relay is configured (and where it forwards)
cr0x@server:~$ ps aux | grep -E 'dhcrelay|isc-dhcrelay' | grep -v grep
root 1123 0.0 0.1 18720 4120 ? Ss 10:01 0:00 /usr/sbin/dhcrelay -4 -i eth1 10.64.10.5
What it means: this box relays DHCP requests from interface eth1 to DHCP server 10.64.10.5.
Decision: if that server is across the VPN, ensure the relay has stable routing, firewall rules permit UDP 67/68, and lease times are appropriate for WAN dependency.
Task 9: validate DNS resolution from a remote site (separate DNS from routing)
cr0x@server:~$ resolvectl status | sed -n '1,25p'
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.64.20.10
DNS Servers: 10.64.20.10 10.64.20.11
DNS Domain: corp.example
Link 2 (eth0)
Current Scopes: DNS
What it means: the system uses internal DNS servers (10.64.20.10/.11).
Decision: if a remote office client points to a DNS server it can’t reach (because of missing routes or firewall), DHCP is “working” but delivering unusable configuration. Fix the DHCP options or routing.
Task 10: prove name resolution vs connectivity (stop blaming the VPN for DNS)
cr0x@server:~$ getent hosts fileserver.corp.example
10.65.20.20 fileserver.corp.example
What it means: DNS (or NSS) resolves the name to a remote IP.
Decision: if resolution fails but IP ping works, you need to fix DNS (forwarders, split-horizon, search domains, or DHCP option 15/6). If resolution works but connection fails, it’s routing/firewall/MTU.
Task 11: trace the route (spot asymmetric or unexpected hops)
cr0x@server:~$ tracepath 10.65.20.20
1?: [LOCALHOST] pmtu 1500
1: 10.64.10.254 1.012ms
2: 10.65.0.1 22.104ms
3: 10.65.20.20 23.001ms reached
Resume: pmtu 1420 hops 3 back 3
What it means: traffic crosses the VPN router and reaches the destination; PMTU is 1420.
Decision: if the hop list shows Internet hairpinning or an unexpected router, fix your routes. If PMTU is low and apps break, clamp MSS.
Task 12: check firewall counters for dropped traffic (prove policy issues)
cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
chain forward {
type filter hook forward priority filter; policy drop;
ct state established,related accept
iifname "wg0" oifname "eth1" ip saddr 10.65.0.0/16 ip daddr 10.64.20.0/24 accept
counter packets 0 bytes 0 drop
}
}
What it means: default forward policy is drop; only a specific wg0→eth1 flow is allowed. The drop counter exists and may be increasing elsewhere.
Decision: if counters show drops for expected flows, adjust policy intentionally. If you “temporarily allow all” to test, remember to revert; temporary rules have a habit of becoming permanent architecture.
Task 13: detect duplicate IPs from an endpoint (ARP tells on you)
cr0x@server:~$ ip neigh show dev eth0 | grep '10.64.10.101'
10.64.10.101 lladdr 3c:52:82:aa:bb:cc REACHABLE
What it means: the IP maps to one MAC right now.
Decision: if this MAC changes frequently without the device moving, you may have IP conflicts or a rogue DHCP server. Correlate with DHCP logs and switch MAC tables.
Task 14: identify the switchport for a suspicious MAC (find the rogue box)
cr0x@server:~$ ssh admin@switch-a01 "show mac address-table address 3c:52:82:aa:bb:cc"
Mac Address Table
-------------------------------------------
Vlan Mac Address Type Ports
---- ----------- -------- -----
10 3c:52:82:aa:bb:cc DYNAMIC Gi1/0/24
What it means: the MAC is learned on port Gi1/0/24 in VLAN 10.
Decision: if that port leads to an unmanaged switch or a “temporary” router, shut it down or move it to an isolated VLAN. Then re-test DHCP offers.
Task 15: validate DHCP server identity from a Windows client (common in offices)
cr0x@server:~$ ssh user@winclient "ipconfig /all | findstr /I \"DHCP Server IPv4 Address\""
DHCP Server . . . . . . . . . . . : 10.64.10.5
IPv4 Address. . . . . . . . . . . : 10.64.10.101
What it means: the client is leasing from 10.64.10.5.
Decision: if the DHCP Server changes between renewals, you have multiple DHCP responders. That’s a containment problem, not a “VPN tuning” problem.
Task 16: confirm DHCP options delivered (gateway/DNS mistakes look like VPN outages)
cr0x@server:~$ sudo dhclient -v -1 eth0
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
DHCPOFFER of 10.64.10.120 from 10.64.10.5
DHCPREQUEST for 10.64.10.120 on eth0 to 255.255.255.255 port 67
DHCPACK of 10.64.10.120 from 10.64.10.5
bound to 10.64.10.120 -- renewal in 36000 seconds.
What it means: DHCP negotiation succeeded.
Decision: inspect the applied route and resolv.conf after this. If the gateway points to a remote site over VPN by accident, fix the scope options; don’t blame the tunnel for doing what you told it to.
Common mistakes: symptom → root cause → fix
1) “Some users can reach the remote office, some can’t”
- Symptom: intermittent access; one desk works, another doesn’t; reboot “fixes” it briefly.
- Root cause: overlapping subnets, or multiple DHCP servers assigning different gateways/DNS.
- Fix: eliminate overlap by renumbering; sniff DHCP offers; shut down rogue DHCP; standardize scope options.
2) “VPN is up but nothing routes”
- Symptom: tunnel shows connected/handshake OK, but remote subnets unreachable.
- Root cause: missing routes (or missing allowed IPs in WireGuard), or firewall forward policy blocking.
- Fix: add explicit routes on both sides; verify forward/NAT rules; confirm return path symmetry with tracepath.
3) “DNS works locally, fails from the other office”
- Symptom: ping by IP works; hostnames time out; or resolve to wrong targets.
- Root cause: DHCP provides unreachable DNS servers; split DNS misconfigured; firewall blocks TCP/UDP 53 across VPN.
- Fix: ensure remote sites can reach DNS servers; add local caching resolvers if needed; permit DNS across VPN; validate with getent/dig.
4) “SMB/HTTPS is flaky, but ping works”
- Symptom: file copies stall; web apps partially load; RDP sometimes freezes.
- Root cause: MTU/PMTUD black hole; missing MSS clamping.
- Fix: clamp TCP MSS on the tunnel; set interface MTU appropriately; retest with ping -M do and tracepath PMTU.
5) “After connecting the VPN, printers started duplicating or changing IPs”
- Symptom: printer IPs change; print queues break; duplicate IP warnings appear.
- Root cause: printer on DHCP with long-remembered static config; two sites use same printer subnet; or DHCP reservations not used.
- Fix: reserve printer MACs; keep printers in a dedicated VLAN per site; avoid overlapping printer subnets; document statics in IPAM.
6) “Guest Wi‑Fi can see corporate resources after the VPN project”
- Symptom: guest clients reach internal IPs or services.
- Root cause: overly broad VPN encryption domains/allowed IPs; firewall rules not segmented; routes leaked between VRFs/VLANs.
- Fix: restrict VPN selectors/allowed IPs to required subnets; enforce inter-VLAN firewall policy; explicitly block guest→corp over the tunnel.
7) “Everything breaks during WAN outages, then recovers slowly”
- Symptom: after brief drops, lots of endpoints lose connectivity; recovery takes hours.
- Root cause: centralized DHCP dependent on WAN; short lease times; remote DNS dependency with no local cache.
- Fix: keep DHCP local or add local failover; lengthen leases; deploy local caching resolvers; treat WAN as unreliable until proven otherwise.
8) “We fixed overlap with NAT, now logs and ACLs are weird”
- Symptom: security tools show the NAT gateway as the source; per-host rules don’t work cleanly; troubleshooting is harder.
- Root cause: NAT hides real addresses by design; you traded routing clarity for translation complexity.
- Fix: use NAT only as a time-boxed migration tool; implement renumbering; update ACLs to match real addressing; improve logging with NAT translation logs during transition.
Checklists / step-by-step plan
Step-by-step: designing a new office-to-office VPN without future regret
- Inventory subnets at both sites. List every VLAN/subnet, including guest Wi‑Fi, voice, printers, management, and “temporary” lab networks.
- Pick a consistent address plan. Allocate at least a /16 per site if you can. Write it down in your source of truth.
- Define what must be reachable. Subnet-to-subnet matrix: who needs to talk to whom, on which ports. Default deny, then allow deliberately.
- Choose DHCP model. Prefer local DHCP per site. If you centralize, commit to relay configuration, monitoring, and longer leases.
- Decide DNS architecture. Central resolvers are fine if reachable; otherwise use local caches forwarding to central. Plan split DNS for internal zones.
- Pick routing strategy. Static routes for small and stable. Dynamic routing when growth or change frequency is high.
- Engineer MTU upfront. Set tunnel MTU/MSS clamping early. Validate with do-not-fragment tests.
- Lock down VPN selectors/allowed IPs. Only advertise the internal prefixes you intend to route. No “0.0.0.0/0 because it worked.”
- Monitoring and logs. Tunnel up/down, latency, packet loss, DHCP server health, DNS latency, firewall denies. If you can’t see it, you can’t operate it.
- Run a pre-cutover test plan. DHCP lease, DNS resolution, reachability to key services, and a file transfer test (SMB/HTTPS) to catch MTU.
Step-by-step: when you already have overlapping subnets
- Confirm overlap and identify the minimum blast radius. Which prefixes overlap? Which systems must communicate now?
- Choose a remediation path:
- Best: renumber one site (or at least the subnets that need cross-site connectivity).
- Temporary: NAT one side to a translation range that does not overlap.
- Avoid: bridging to “make it one LAN.” That’s a merger of broadcast domains, not a solution.
- If renumbering: create new DHCP scopes, update gateways, DNS, and reservations; migrate VLAN by VLAN; keep a rollback plan.
- If using NAT: log translations, document the mapping, and set an expiration date. Make sure applications that depend on source IP are reviewed.
- After remediation: remove temporary NAT/rules, tighten firewall, and update IPAM/source-of-truth.
Operational checklist: preventing DHCP chaos
- Enable DHCP snooping on access switches where available, and trust only uplinks to your real DHCP servers/relays.
- Keep DHCP servers authoritative for their scopes; avoid duplicate scope definitions in multiple places unless using true failover.
- Standardize DHCP option sets and version-control them.
- Use reservations for “semi-static” devices (printers, phones, badge readers).
- Audit for rogue DHCP quarterly (or after office moves).
- Monitor lease exhaustion and alert before the pool hits the wall.
FAQ
1) Can I run one DHCP server for both offices over a VPN?
Yes, via DHCP relay. But you’re moving a critical control plane dependency onto your WAN. If the WAN is not extremely stable,
keep DHCP local or deploy local failover.
2) Why can’t DHCP just “cross the VPN” naturally?
DHCP uses broadcast for discovery on a LAN. Routed VPNs don’t carry broadcasts. You need a relay, or you need to bridge L2,
and bridging is usually the wrong trade.
3) What’s the cleanest fix for overlapping subnets?
Renumber one side to a unique prefix. It’s work, but it buys you correctness. NAT is acceptable as a time-boxed migration tactic,
not as your permanent personality.
4) If we must NAT, what should we watch out for?
Logging and access control become less granular, some apps embed IPs, and troubleshooting gets harder because packet captures show
translated identities. Document the mapping and plan to remove it.
5) Why do small pings work but file transfers fail across the VPN?
Classic MTU/PMTUD issue. Encapsulation reduces effective MTU. Without MSS clamping or correct MTU settings, larger packets get dropped,
and TCP stalls in ways that look like “random slowness.”
6) How do I detect a rogue DHCP server quickly?
Run a packet capture for DHCP offers on the affected VLAN and look for multiple Server-ID addresses. Then locate the MAC on the switch
MAC table and shut/isolates that port.
7) Should both offices share the same DNS servers?
They can, but ensure reachability and low latency. A good pattern is local caching resolvers in each office that forward to central
authoritative servers. That reduces VPN dependency for every single DNS query.
8) Static routes or dynamic routing for a small company?
Two sites, a couple subnets: static routes are fine. More sites, frequent changes, or lots of VLANs: use OSPF or BGP so you stop
forgetting routes during urgent work.
9) What lease time should we use for remote office clients?
If DHCP is local, standard lease times (8–24 hours for user networks) are fine. If DHCP is centralized across VPN, use longer leases
so brief WAN issues don’t trigger mass renewals and churn.
10) Is bridging ever acceptable across a site-to-site VPN?
Rarely, and usually temporarily: niche legacy needs, short migrations, or very small environments where you accept the risks knowingly.
If you bridge, document the exit plan on day one.
Conclusion: next steps that actually reduce risk
Office-to-office VPN projects fail in predictable ways. Not because tunnels are mysterious, but because addressing, DHCP,
DNS, and routing are governance problems pretending to be connectivity problems.
Do these next, in order:
- Write down every subnet at every site and eliminate overlap. If you can’t eliminate it immediately, use NAT with an explicit retirement plan.
- Choose your DHCP model deliberately: local per site by default; relay to central only when the WAN and ops maturity justify it.
- Lock routing and firewall policy to exactly what you need, then monitor drops and tunnel health.
- Validate MTU/MSS early so you don’t spend a week blaming “the VPN” for a packet-size math problem.
- Make it boring: a small IPAM source of truth, repeatable checklists, and packet-capture competence beat heroics every time.
If you do this well, the VPN becomes what it was always supposed to be: an unremarkable transport. Your users won’t notice it.
That’s success.