You connect two offices (or acquire a company), bring up the site-to-site VPN, and suddenly printers vanish, file shares map to the wrong servers, and someone insists “it worked yesterday.” The culprit is usually boring: both sites used the same RFC1918 ranges—most often 10.0.0.0/8 or 192.168.0.0/16—and now the routing table is forced to pick a favorite.
You don’t have time to rebuild every DHCP scope, static IP, firewall rule, and third-party whitelist across two org charts. You need something that works this quarter. Preferably this week. Here are three production-grade solutions that don’t require renumbering the world, plus how to diagnose the mess, prove which fix fits, and deploy without inventing new outages.
What “overlapping subnets” actually breaks (and why)
Overlapping subnets aren’t “a routing problem” in the abstract. They’re a name collision problem with real consequences. If both offices have 10.10.0.0/16 and you try to route between them, a host at Site A can’t distinguish “10.10.1.20 at Site A” from “10.10.1.20 at Site B.” Your routers can’t either—at least not with classic IP routing semantics.
What you’ll see in production
- Asymmetric reachability: A can reach B’s server sometimes; replies vanish because the return path resolves to the local twin.
- ARP and neighbor cache lies: On L2 extensions (please don’t), MAC flaps and “duplicate IP” logs appear because the same IP exists twice.
- NAT/Firewall policy ambiguity: “Allow 10.10.1.0/24” becomes a coin flip unless you’re explicit about which 10.10.1.0/24 you mean.
- DNS makes it worse: If both sites publish internal names to the same zone, you get “correct” answers pointing at the wrong twin.
- Monitoring becomes fiction: Your NMS pings 10.10.1.20 and thinks the remote system is up—because it pinged the local one.
The usual human response is denial. The network is “up” because the tunnel shows green. But overlap doesn’t kill the tunnel; it kills meaning. And systems run on meaning.
Opinionated guidance: Don’t attempt to “make routing prefer the remote site.” You can’t route your way out of a namespace collision without adding a new namespace (NAT, VRF, or overlay identities).
Fast diagnosis playbook (first/second/third checks)
This is the fastest path to identifying the bottleneck and choosing the least-terrible fix. If you do these in order, you stop arguing based on vibes.
1) Confirm overlap is real (not just a firewall block)
Pick one “problem IP” and verify if it exists in both places. Check ARP tables, DHCP leases, or local ping behavior.
- If a host pings “remote” 10.x.x.x even when the VPN is down, you’re not reaching remote anything. You’re hitting the local twin.
- If traceroute never enters the tunnel and stays inside the site, you’re routing locally to the twin network.
2) Identify what must talk across sites (and what can be ignored)
Inventory flows: AD, DNS, file services, ERP, VoIP, printer subnets, RDP/SSH, monitoring, backup replication. Then rank by business impact and required bidirectionality.
- Bidirectional protocols (SMB, AD, Kerberos, VoIP) are less tolerant of kludges.
- Unidirectional access (users to a web app) can often be solved with NAT quicker.
3) Check where policy can be enforced: edge firewall, core router, or hosts
Your best solution depends on where you can control the namespace and route selection.
- If you control both edges: NAT or VRF are usually fastest.
- If you control only one side (partner, vendor, acquired site still “independent”): NAT is your lever.
- If you need future-proof multi-site scaling: overlay is expensive but clean.
Once you know (a) the overlap scope, (b) the required flows, and (c) your control points, you can choose a solution without turning your VPN into a haunted house.
Three working solutions (no rebuild required)
All three solutions add a new namespace boundary so identical addresses stop colliding. They differ in where the boundary lives and how much operational debt you accept.
Decision matrix (what to pick and when)
- Pick NAT when you need results fast, you can tolerate address translation, and your main goal is “users here reach servers there.”
- Pick VRFs when you need clean separation, you have decent routing gear, and you want minimal translation weirdness.
- Pick an overlay when you need to connect many sites or integrate cloud/on-prem cleanly and you’re done with duct tape.
You can combine them, too. In real life you often will. Just don’t stack abstractions thoughtlessly. Every added layer is one more place for 3 a.m. lies to hide.
Solution 1: NAT (a.k.a. make the other site look different)
NAT is the blunt instrument that works because it changes the namespace. If Site B’s 10.10.0.0/16 conflicts with Site A’s 10.10.0.0/16, you can map Site B to a “virtual” range—say 172.20.10.0/24 or 100.64.10.0/24—only across the tunnel. To Site A, Site B is no longer 10.10.0.0/16. Problem solved, at least for IP routing.
Two NAT patterns that actually work
- Bi-directional static NAT (1:1 or many:many): Best when servers need to be reachable from both sides with stable addresses.
- Source NAT (SNAT) on one direction: Best when only one side initiates connections and you can keep the remote side mostly unaware.
Where NAT belongs: at the boundary device that also participates in the VPN (firewall/router). Don’t sprinkle NAT rules on random internal hops unless you enjoy troubleshooting state tables across multiple boxes.
What NAT breaks (plan for this)
- IP-based ACLs and allowlists: Remote systems will see translated addresses. Update policies or use consistent mappings.
- Protocols that embed IPs: Some legacy apps, SIP without proper helpers, certain FTP modes. Modern stacks are better, but don’t assume.
- Kerberos/AD edge cases: AD can work through NAT, but you must be careful with name resolution, SPNs, and site topology. If you can avoid NAT in the middle of DC-to-DC replication, do.
Joke #1: NAT is like putting name tags on identical twins—useful until someone decides to swap shirts for fun.
When NAT is the best answer
If you’re in a merger and you need “access to a handful of systems” more than “a perfectly integrated enterprise network,” NAT is your friend. It buys time. Time is what you spend later on renumbering, segmentation, or a proper overlay.
What to avoid: “Temporary NAT” that becomes permanent without documentation. If you do NAT, treat it as a product: clear translation maps, consistent subnets, logging, and a rollback plan.
Solution 2: VRFs / segmentation (keep both truths separate)
VRFs (Virtual Routing and Forwarding instances) give you multiple routing tables on the same hardware. That means you can have 10.10.0.0/16 in VRF-A and 10.10.0.0/16 in VRF-B, and they won’t collide because they live in different routing universes.
VRFs are the “grown-up” solution when you have routing control and you want to preserve original IP addresses end-to-end. They’re also the correct solution when you need multiple overlapping tenants (common in MSSPs, large enterprises, or transitional mergers).
Two VRF deployment patterns
- VRF at the edge (recommended): Put the remote site in its own VRF on the VPN router/firewall. Import/export only the needed routes.
- VRF in the core (bigger change): If overlap exists inside campus cores, you may need VRFs closer to distribution to prevent leakage.
VRFs don’t magically solve everything
VRFs isolate routing, not identity. If you still need apps in VRF-A to talk to identical-address hosts in VRF-B, you need a bridge mechanism: route leaking with NAT, or a proxy, or application-level changes. VRF is a containment strategy first, a connectivity strategy second.
Opinionated guidance: If your goal is “connect two overlapping sites,” VRF alone won’t do it unless you also introduce some translation or application proxies. If your goal is “connect both sites to shared services without merging them,” VRF is perfect. In corporate reality, that second goal comes up more often than people admit.
Failure modes to anticipate
- Route leaking mistakes: One leaked default route later and you’ve built a surprise backhaul.
- Security policy drift: Firewalls need VRF-aware policies. If your tooling isn’t VRF-aware, auditors will learn new words.
- Operational tooling gaps: NMS, flow logs, and packet captures must be VRF-scoped or you’ll troubleshoot ghosts.
Solution 3: Overlay networks (route above the mess)
Overlays give endpoints a new address space (or a new identity) that is independent of the underlay. Think VXLAN/EVPN, SD-WAN fabric addressing, WireGuard-based mesh with assigned “overlay IPs,” or even L3-only application networks. The point is consistent identity across sites without caring what underlay subnets are doing.
If NAT is a screwdriver and VRF is a toolbox drawer, an overlay is a whole new workbench. It’s not the quickest fix, but it’s the one that scales when you add more sites, more clouds, and more “temporary” exceptions.
Overlay patterns that are realistic without a rebuild
- Dedicated overlay subnet for inter-site services: You don’t overlay everything—just the servers that must be reachable across sites. Each server gets an additional IP (or a loopback) in the overlay range.
- Service gateways: Instead of touching every endpoint, you deploy gateways per site that advertise overlay routes and proxy/route to local services.
- SD-WAN with segmentation: Many SD-WAN products support overlapping underlay LANs by mapping them into distinct VPN/VRF-like segments and advertising non-overlapping “virtual” routes.
What overlays cost you
- Operational complexity: Control plane, encryption, MTU, encapsulation overhead, and telemetry. You’re adding moving parts.
- MTU fragility: Encapsulation reduces effective MTU. If you don’t test PMTUD behavior, you will break something “random.”
- Skill requirements: Your team needs to be comfortable reading route tables, EVPN advertisements, or mesh peer status—not just “is the tunnel up.”
Joke #2: Overlays are great until someone says “it’s just networking” and then changes the MTU in one place.
Why overlays are often the best long-term answer
Because they decouple business integration from IP hygiene. You can merge companies, move workloads, and survive imperfect legacy LANs while still providing consistent connectivity for the things that matter. If you’re building a multi-site platform (not just a one-time merger), overlays reduce the amount of “special casing” you do per site.
Three corporate mini-stories from the field
1) The incident caused by a wrong assumption: “10.50.0.0/16 is unique. Surely.”
The integration team connected two offices over IPsec. Both sides had neat spreadsheets. Both sides swore their internal ranges were “unique.” The tunnel came up; monitoring lit green. Everyone went home early, which is always suspicious.
Monday morning, the help desk got reports that finance couldn’t reach a reporting server. The server was “up,” the firewall logs showed accepts, and the application team insisted the service hadn’t changed. The network team ran a traceroute from a workstation. It never hit the VPN. It stayed local and ended at a switch SVI. Classic sign: the route to 10.50.12.40 was internal, not remote.
Turns out both sites used 10.50.0.0/16. The reporting server existed in Site B, but Site A had a printer on the same IP in a forgotten VLAN. The printer wasn’t even broken; it was just responding to ICMP like a cheerful liar. The “it pings” test was worthless.
The fix wasn’t heroic. They added a NAT mapping for the handful of critical servers in Site B into a 172.20.50.0/24 translation range, updated DNS for those services, and stopped trying to route overlap directly. The lesson that stuck: assumptions about IP uniqueness age like milk, especially after acquisitions.
2) The optimization that backfired: compressing routes and “simplifying” policies
A different company tried to be clever. They had overlap in 192.168.1.0/24 and 192.168.2.0/24 across multiple branches. Instead of careful translations per subnet, they summarized policies and routes to “just allow 192.168.0.0/16 across the tunnel.” The VPN throughput improved slightly. Their change request was approved in record time, which should have been a second clue.
Within hours, oddities: RDP sessions connected to the wrong machines. Asset scanners “found” devices that weren’t there. Worse, a branch could now hit another branch’s admin interfaces because the coarse rule set permitted it. Nothing exploded immediately, which is how security problems often introduce themselves—quietly.
The backfire wasn’t that summarization is always wrong. It’s that summarization across overlapping spaces removes the last bits of specificity you had. They’d traded correctness for convenience. When overlap exists, specificity is your seatbelt.
They rolled back, then rebuilt with explicit NAT pools per branch and explicit firewall objects per translated subnet. It took longer. It also stopped the “teleporting RDP” phenomenon, which is not a feature.
3) The boring but correct practice that saved the day: deterministic translation + test harness
At a larger enterprise, the merger plan was realistic: “We will not renumber either company for at least 18 months.” They implemented a deterministic NAT scheme: each site got a translated /16 out of 100.64.0.0/10 (CGNAT space), derived from a site ID. They documented it like a product: mapping tables, DNS rules, firewall object naming conventions, and a small CI job that validated no two sites were assigned overlapping translated ranges.
Then they did the unglamorous part: a test harness. A couple of Linux hosts in each site ran scheduled pings, TCP handshakes, and DNS queries to a list of translated service IPs. Results went into a dashboard with latency, packet loss, and “first failure time.” No magic. Just instrumentation.
Six months later, an ISP change introduced an MTU regression that only broke SMB over the tunnel (ICMP still worked, because of course it did). The harness caught it within minutes. The team adjusted MSS clamping on the VPN edge and avoided a full-day outage for file services.
No one got a trophy for it. But they didn’t spend their weekend in a war room either, which is the better prize.
Practical tasks: commands, outputs, and decisions
You can’t fix what you can’t prove. Below are real tasks you can run from Linux hosts and typical network-adjacent systems. Each includes: the command, what the output means, and the decision you make from it.
Task 1: Prove the “remote IP” is actually local (ARP check)
cr0x@server:~$ ip neigh show 10.10.1.20
10.10.1.20 dev eth0 lladdr 3c:52:82:aa:bb:cc REACHABLE
Meaning: Your host believes 10.10.1.20 is on the local L2 segment (has a MAC on eth0). That cannot be a remote routed host across a VPN.
Decision: Stop debugging the tunnel. You have overlap (or a proxy ARP misconfiguration). Plan for NAT/VRF/overlay.
Task 2: Confirm which route wins for the overlapping destination
cr0x@server:~$ ip route get 10.10.1.20
10.10.1.20 dev eth0 src 10.10.1.100 uid 1000
cache
Meaning: Kernel routes 10.10.1.20 out eth0 locally.
Decision: If you expected the VPN, you need a new namespace boundary; you can’t “policy route” your way out if the destination exists locally.
Task 3: Traceroute to see if traffic ever hits the VPN path
cr0x@server:~$ traceroute -n 10.10.1.20
traceroute to 10.10.1.20 (10.10.1.20), 30 hops max, 60 byte packets
1 10.10.1.1 0.412 ms 0.381 ms 0.372 ms
2 10.10.1.20 0.663 ms 0.631 ms 0.612 ms
Meaning: Two hops, entirely local. Not crossing a tunnel.
Decision: Don’t change VPN selectors yet. First solve the overlap.
Task 4: Verify the VPN interface exists and is up (sanity check, not victory)
cr0x@server:~$ ip link show dev wg0
6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
Meaning: Interface is up. This says nothing about correct routing for overlapping destinations.
Decision: Proceed to routing/NAT/VRF verification; don’t declare success.
Task 5: Check WireGuard peer health (example overlay/mesh)
cr0x@server:~$ sudo wg show
interface: wg0
public key: qXn2mB3lP9z2gJw8v7m9Zk1a2b3c4d5e6f7g8h9i=
listening port: 51820
peer: lH8k1m2n3b4v5c6x7z8a9s0d1f2g3h4j5k6l7p8o9=
endpoint: 203.0.113.10:51820
allowed ips: 100.64.10.0/24
latest handshake: 32 seconds ago
transfer: 1.24 MiB received, 2.88 MiB sent
Meaning: Peer is alive, allowed IPs define the overlay route range.
Decision: If overlap persists, ensure you’re routing to overlay IPs, not the underlay duplicates.
Task 6: Validate DNS is pointing at translated/overlay addresses, not the conflicting underlay
cr0x@server:~$ dig +short reports.internal.example A
172.20.50.40
Meaning: Name resolves to a non-overlapping translated address (good sign for NAT-based integration).
Decision: If it returns 10.50.12.40 and that overlaps, fix split-horizon DNS or create integration-specific records.
Task 7: Verify NAT is happening (conntrack view)
cr0x@server:~$ sudo conntrack -L -p tcp --dport 443 | head
tcp 6 431999 ESTABLISHED src=10.10.1.100 dst=172.20.50.40 sport=51922 dport=443 src=10.10.1.100 dst=10.50.12.40 sport=51922 dport=443 [ASSURED] mark=0 use=1
Meaning: Original destination is 172.20.50.40, but it’s being DNATed to 10.50.12.40 on the far side (or vice versa depending on placement). The tuple shows translation.
Decision: If no conntrack entries appear during a test, the NAT rule isn’t matching or traffic isn’t reaching the translator.
Task 8: Confirm iptables NAT rules match what you think they match
cr0x@server:~$ sudo iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A PREROUTING -i wg0 -d 172.20.50.40/32 -p tcp -m tcp --dport 443 -j DNAT --to-destination 10.50.12.40:443
-A POSTROUTING -o wg0 -s 10.50.12.40/32 -p tcp -m tcp --sport 443 -j SNAT --to-source 172.20.50.40
Meaning: Deterministic bidirectional NAT for a single service (443). DNAT inbound, SNAT outbound.
Decision: Keep NAT symmetric for stateful protocols. If you only DNAT one way, return traffic may route to the local twin and die.
Task 9: Check for MTU/MSS problems that masquerade as “overlap issues”
cr0x@server:~$ ping -M do -s 1372 172.20.50.40 -c 3
PING 172.20.50.40 (172.20.50.40) 1372(1400) bytes of data.
1372 bytes from 172.20.50.40: icmp_seq=1 ttl=61 time=18.4 ms
1372 bytes from 172.20.50.40: icmp_seq=2 ttl=61 time=18.1 ms
1372 bytes from 172.20.50.40: icmp_seq=3 ttl=61 time=18.3 ms
--- 172.20.50.40 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
Meaning: 1400-byte packets pass without fragmentation. Good for many VPN/overlay setups.
Decision: If this fails but small pings work, clamp MSS on the edge and set correct tunnel MTU before blaming NAT/VRF config.
Task 10: Confirm which interface a packet leaves from (useful with policy routing/VRFs)
cr0x@server:~$ ip rule show
0: from all lookup local
1000: from 10.10.1.0/24 lookup 100
32766: from all lookup main
32767: from all lookup default
Meaning: Source-based routing is in play for 10.10.1.0/24 via table 100.
Decision: If overlap forces you into policy routing tricks, verify that the source networks are unique. Policy routing doesn’t solve identical destinations on the same host segment.
Task 11: Inspect the alternate routing table for the VRF/policy domain
cr0x@server:~$ ip route show table 100
default via 100.64.10.1 dev wg0
100.64.10.0/24 dev wg0 proto kernel scope link src 100.64.10.2
Meaning: Table 100 sends traffic into the overlay via wg0.
Decision: If services should use overlay IPs, ensure the correct sources land in the correct table; otherwise they’ll leak to main and hit local twins.
Task 12: Detect duplicate IPs from logs (when users swear it’s “random”)
cr0x@server:~$ sudo journalctl -k | grep -i duplicate | tail -n 5
Dec 28 09:11:02 edge kernel: IPv4: martian source 10.10.1.20 from 10.10.1.1, on dev eth0
Dec 28 09:12:44 edge kernel: arp: 10.10.1.20 moved from 3c:52:82:aa:bb:cc to 00:11:22:33:44:55 on eth0
Meaning: The same IP is seen with different MACs. That’s either a legit move (VM migration) or—more likely in this context—duplicate IPs/overlap/L2 leak.
Decision: If you see MAC moves across sites, stop. You may have accidentally bridged L2 across WAN or extended VLANs. That’s a different emergency.
Task 13: Verify firewall path selection with packet capture (prove where traffic goes)
cr0x@server:~$ sudo tcpdump -ni wg0 host 172.20.50.40 and tcp port 443 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
09:21:11.102334 IP 10.10.1.100.51922 > 172.20.50.40.443: Flags [S], seq 39184422, win 64240, options [mss 1360,sackOK,TS val 123 ecr 0], length 0
09:21:11.120118 IP 172.20.50.40.443 > 10.10.1.100.51922: Flags [S.], seq 91233110, ack 39184423, win 65160, options [mss 1360,sackOK,TS val 456 ecr 123], length 0
Meaning: SYN/SYN-ACK visible on the tunnel interface; traffic is correctly entering the overlay/VPN for the translated IP.
Decision: If you see SYN leaving but no SYN-ACK returning, suspect remote-side NAT/policy, route back, or MTU. If you see nothing on wg0, routing is wrong upstream.
Task 14: Validate that “integration subnets” do not overlap with anything else
cr0x@server:~$ ipcalc 172.20.50.0/24
Address: 172.20.50.0 10101100.00010100.00110010. 00000000
Netmask: 255.255.255.0 = 24 11111111.11111111.11111111. 00000000
Network: 172.20.50.0/24 10101100.00010100.00110010. 00000000
Broadcast: 172.20.50.255 10101100.00010100.00110010. 11111111
HostMin: 172.20.50.1 10101100.00010100.00110010. 00000001
HostMax: 172.20.50.254 10101100.00010100.00110010. 11111110
Hosts/Net: 254
Meaning: You’re defining a clean translation/overlay segment.
Decision: Put these ranges under change control. If another team later uses 172.20.50.0/24 locally, you’ll reintroduce overlap—just with different numbers.
Task 15: Quick application-level test that avoids “ping lies”
cr0x@server:~$ curl -skI https://172.20.50.40/ | head -n 5
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 28 Dec 2025 09:25:02 GMT
Content-Type: text/html
Connection: keep-alive
Meaning: The service is reachable at the translated/overlay IP. This is more meaningful than ICMP.
Decision: If curl works but the app still fails, focus on DNS names, certificates, SSO/identity, or firewall L7 policies—not “the VPN.”
Task 16: Confirm the local network already uses your proposed translation range (the “don’t be clever” check)
cr0x@server:~$ ip route show | egrep '172\.20\.50\.0/24|100\.64\.0\.0/10' || echo "no local routes found"
no local routes found
Meaning: Your chosen translation/overlay range doesn’t appear locally on this host.
Decision: Repeat on core routers and DHCP/IPAM sources. If the range is in use anywhere, pick a different one now, not after rollout.
Common mistakes: symptoms → root cause → fix
1) “The tunnel is up but nothing works”
Symptom: VPN shows connected; pings to “remote” IPs succeed inconsistently; traceroute stays local.
Root cause: Overlapping subnet causes local route/ARP to win; you’re reaching local twins.
Fix: Introduce a non-overlapping namespace (NAT/overlay) and use DNS to direct clients to translated addresses.
2) “Some apps work, SMB/VoIP don’t”
Symptom: Web apps are fine; file shares stall; large transfers hang; VoIP one-way audio.
Root cause: MTU/MSS issues due to VPN encapsulation, often revealed after adding overlays.
Fix: Set tunnel MTU appropriately and clamp TCP MSS on edges; verify with do-not-fragment pings and real app tests.
3) “We did NAT and now the remote team can’t whitelist us”
Symptom: Remote security policies reject traffic; logs show unexpected source IPs.
Root cause: SNAT changes client identity; remote allowlists were built for original ranges.
Fix: Use deterministic NAT ranges per site, document them, and provide a stable source CIDR for whitelisting.
4) “DNS is correct but users still hit the wrong system”
Symptom: Name resolves to translated/overlay IP; yet connections land on an unexpected host.
Root cause: Split-horizon DNS mismatch, stale caches, or local hosts file entries; sometimes proxy auto-config routes around your plan.
Fix: Flush caches where appropriate, audit DHCP-provided DNS, remove hosts overrides, and validate from client subnets with dig/nslookup.
5) “VRF rollout broke monitoring and logs”
Symptom: NMS can’t reach devices; syslog stops; NetFlow missing for one segment.
Root cause: Tooling traffic lives in the wrong VRF, or collectors aren’t reachable via route leaking.
Fix: Explicitly plan management-plane routing: either leak routes for management services or run collectors per VRF/segment.
6) “We summarized routes to simplify, then weird access appeared”
Symptom: Users reach admin interfaces in other branches; segmentation violated.
Root cause: Over-broad route and policy summaries across overlapping ranges removed necessary specificity.
Fix: Reintroduce specificity: per-site translation, per-subnet objects, and least-privilege policies based on translated/overlay identities.
7) “Packets arrive but replies go missing”
Symptom: SYN seen leaving; no SYN-ACK; or request reaches remote but response returns to local twin network.
Root cause: Asymmetric routing due to one-way NAT, missing return routes, or remote side preferring its local overlapping route.
Fix: Ensure symmetric NAT (DNAT+SNAT pair) or enforce return routing via the same tunnel (policy-based VPN selectors or route-based VPN with correct routes).
Checklists / step-by-step plan
Checklist A: Choose the solution in 30 minutes
- List required cross-site flows (source subnet → destination service → ports → bidirectional?)
- Confirm overlap scope: exact prefixes that collide (10.10.0.0/16 vs 10.10.0.0/16, etc.).
- Identify control points: do you manage both edges? Can you change DNS? Can you add gateways?
- Pick the smallest hammer:
- Need access to specific services quickly → NAT
- Need isolation and shared services without full merge → VRFs
- Need scalable multi-site fabric or cloud integration → overlay
- Pick a non-overlapping integration range and reserve it (don’t “borrow” a random /24).
Checklist B: NAT rollout plan that doesn’t turn into a crime scene
- Design deterministic mappings (per-site translated blocks; avoid one-off snowflakes).
- Decide directionality:
- Users → servers only: SNAT may be sufficient.
- Bidirectional services: use symmetric static NAT.
- Update DNS so clients use translated IPs for cross-site names.
- Update firewall policies to permit translated ranges only (least privilege, not “any any”).
- Validate MTU/MSS on tunnel path for the largest expected packets.
- Instrument: packet captures on the VPN interface; basic TCP probes; log NAT hits.
- Document mappings in one authoritative place and gate changes through review.
- Rollback plan: a single toggle/commit revert on the edge device, plus DNS TTL planning.
Checklist C: VRF deployment plan for containment and shared services
- Define VRF boundaries: which VLANs/subnets belong to which VRF (local vs acquired/remote).
- Attach WAN/VPN to VRF for the remote site’s routing domain.
- Plan “shared services” access (DNS, AD, logging, patching) via controlled route leaking or proxies.
- Update security policies to be VRF-aware; verify management-plane access.
- Test in a lab or a pilot site with representative apps (SMB, Kerberos, VoIP if relevant).
- Operationalize: VRF-specific monitoring checks, runbooks, and packet capture procedures.
Checklist D: Overlay deployment plan when you’re done with exceptions
- Pick overlay identity scheme (per-site /24s, per-service /32s, or per-host assignments).
- Decide endpoint strategy:
- Dual-home critical servers into overlay (fastest for limited scope).
- Deploy gateways per site (less endpoint change).
- Validate MTU end-to-end and configure MSS clamping where needed.
- Define routing: which prefixes are advertised where; avoid advertising underlay overlaps into overlay.
- Security: treat overlay as production network—logging, ACLs, key rotation, segmentation.
- Migration: move one service at a time; update DNS; measure; then proceed.
One reliability quote to keep you honest
Hope is not a strategy.
— General Gordon R. Sullivan
Facts and historical context (worth knowing)
- RFC1918 (1996) created private address ranges explicitly to reduce IPv4 consumption—overlap was an accepted side effect, not a bug.
- NAT took off in the mid-1990s as the practical response to IPv4 scarcity; it also normalized the idea that “addresses can be rewritten” in transit.
- CGNAT (Carrier-Grade NAT) popularized using 100.64.0.0/10 as shared space; enterprises now borrow it internally for translation because it’s less likely to collide with home routers.
- VRF concepts matured alongside MPLS VPNs; the same mechanism that isolates customers for carriers isolates overlapping tenants in enterprises.
- Route-based VPNs (interfaces + routing) became dominant over policy-based VPNs because they scale better, but overlap still requires a namespace fix.
- EVPN/VXLAN overlays emerged to solve multi-tenant data center scale problems; those same patterns now show up in campus and branch designs.
- DNS split-horizon has been a standard enterprise trick for decades, but it becomes non-optional when translated/overlay addresses differ by site.
- Large enterprises routinely run multiple IP “realities” during mergers: original address space, translated integration space, and future-state renumbered space—often simultaneously.
FAQ
1) Can I fix overlapping subnets by changing route metrics or adding static routes?
Not reliably. If a subnet exists locally, your hosts and routers will prefer the connected route. You need a namespace boundary: NAT, VRF separation, or overlay addressing.
2) Is NAT always the quickest fix?
Usually, yes—especially for a limited set of services. But “quick” isn’t “free.” NAT adds translation state, policy complexity, and makes troubleshooting more subtle. If you expect deep integration (AD, lots of east-west), consider VRF/overlay planning early.
3) What translation range should I use?
Pick a range that is unlikely to exist in either company and reserve it centrally. Many teams use 100.64.0.0/10 for translation/overlay because it’s less commonly used on LANs than 10/8. The key is governance: reserve it, document it, and keep it out of DHCP scopes.
4) Do I need bidirectional NAT?
If both sides initiate connections to the same service, yes—plan symmetric DNAT/SNAT so return paths stay consistent. For one-way “clients to servers” access, SNAT alone can work, but you still must ensure responses return through the translator.
5) Can VRFs solve overlap without NAT?
VRFs prevent collisions by isolating routing domains, but they don’t let two identical addresses talk to each other directly. If VRF-A must reach a host in VRF-B that shares the same IP as something in VRF-A, you still need translation or a proxy.
6) What’s the biggest risk with overlays?
MTU and operational complexity. Encapsulation reduces effective MTU; if PMTUD breaks, apps fail in weird ways. Also, overlays add a control plane that must be monitored and understood.
7) How do I keep DNS sane during this?
Decide which names should resolve to translated/overlay IPs and enforce it with split-horizon DNS. Keep TTLs short during migration. And audit for hardcoded IPs—there will be some.
8) What if only a handful of users need access across sites?
Use a jump host or application proxy in a neutral, non-overlapping segment (or overlay). It limits blast radius and reduces the number of translated flows. Don’t open the whole address space “because it’s easier.”
9) Will this affect backups and replication?
Yes, and sometimes in surprising ways. Replication tools may pin to IPs, enforce source allowlists, or behave badly under NAT if they embed addresses. Test with real transfer sizes and measure throughput and error rates before declaring victory.
10) Should we still plan to renumber eventually?
Yes, if you want a simpler long-term life. These solutions are valid, but they introduce layers. Renumbering is painful; living with permanent overlap plus ad-hoc NAT is worse.
Conclusion: next steps you can execute
Overlapping subnets aren’t rare; they’re the default outcome of decentralized IT plus RFC1918 plus time. What matters is how you respond. If you pretend it’s “just a routing issue,” you’ll burn days proving the same failure in different ways.
Do this next:
- Run the fast diagnosis: ARP/route/traceroute to prove overlap and identify which flows matter.
- Pick a solution on purpose:
- NAT for quick, service-focused access.
- VRFs for containment and controlled shared services.
- Overlay for scalable, future-proof integration.
- Reserve integration address space (translated/overlay), document it, and protect it from accidental reuse.
- Instrument from day one: not just “tunnel up,” but application probes, MTU checks, and packet captures you can reproduce.
- Write the runbook while the pain is fresh. Future-you will be tired and unamused.
If you must pick one approach for most real corporate mergers: start with deterministic NAT for critical services, then graduate to VRF/overlay as you learn what “integration” truly means in your environment.