NAT Through VPN: Connect Conflicting Networks Without Breaking Services

Was this helpful?

Nothing spices up a routine VPN project like discovering both sides use 10.0.0.0/8. Suddenly every route is a lie, half your packets are heading to the wrong place, and someone asks why “the VPN is up” but the database is still down.

NAT-through-VPN is the grown-up way to connect overlapping networks without renumbering a whole environment today. It’s not “clean,” but it’s often the least-worst move that keeps production stable while you buy time for proper IP hygiene.

The actual problem: overlapping networks and ambiguous routing

When two networks share the same IP space, routing becomes ambiguous. If Site A is 10.20.0.0/16 and Site B is also 10.20.0.0/16, your router can’t distinguish “remote 10.20.1.10” from “local 10.20.1.10.” A VPN tunnel doesn’t fix that. A VPN is just a path; routing still decides where packets go.

In non-overlapping designs, you advertise routes (static, BGP, OSPF, whatever), and traffic flows. In overlapping designs, route advertisements are actively dangerous. Your hosts may try to reach “remote” addresses by ARPing locally. Or you might blackhole traffic by selecting the “wrong” route due to metric preference. Worse: it can look fine from one direction and fail from the other, which is the kind of issue that turns on-call brains into soup.

NAT-through-VPN sidesteps the ambiguity by translating one side (sometimes both) into a unique “virtual” address range for the purpose of crossing the tunnel. You keep the internal addressing unchanged. The tunnel sees a non-overlapping world. Routing becomes deterministic again.

Two rules to keep you honest:

  • You are not fixing overlap; you are isolating it. NAT is a compatibility layer, not a cure.
  • If you can renumber safely, do it. NAT-through-VPN is what you do when “safely” is not available this quarter.

Facts and history that explain why this keeps happening

  • RFC 1918 (1996) defined private IPv4 space (10/8, 172.16/12, 192.168/16), which made “just use 10.x” the default corporate reflex.
  • NAT became mainstream in the late 1990s as IPv4 exhaustion anxiety met the reality of enterprise growth. It solved address scarcity but normalized address reuse.
  • IPsec was designed with end-to-end IP identity in mind; NAT was an awkward neighbor, leading to NAT-T (NAT traversal) to carry ESP over UDP.
  • Many companies copied vendor lab ranges (10.0.0.0/8 and 192.168.0.0/16) because it “worked in the demo,” then scaled that mistake to thousands of subnets.
  • Mergers and acquisitions are an overlap factory: two mature networks collide, both convinced their 10.0.0.0/8 is the One True 10/8.
  • Cloud VPC defaults (like 10.0.0.0/16 starter networks) made overlap even more likely when teams spun environments without central IPAM.
  • Carrier-grade NAT trained a generation to accept translation as normal, even when it breaks protocols that embed IPs in payloads.
  • Some industrial and legacy systems hardcode IPs in configs, licenses, or ACLs, turning renumbering into an “ask legal” event instead of an engineering task.

One paraphrased idea worth keeping: paraphrased idea from Richard Cook (operations and safety): systems succeed because people adapt; failures happen when complexity outruns those adaptations.

Design patterns: how NAT-through-VPN is done in real systems

Pattern 1: One-way “alias subnet” NAT (most common, least confusing)

Pick a non-overlapping “alias” CIDR that exists only as a translation target. Example:

  • Site A real: 10.20.0.0/16
  • Site B real: 10.20.0.0/16 (yes, same)
  • Alias for Site B as seen from Site A: 172.31.20.0/24 (or bigger)

Site A routes 172.31.20.0/24 into the VPN. At the VPN edge on Site B, you DNAT 172.31.20.x10.20.x, and SNAT the return traffic so it comes back from 172.31.20.x. This makes sessions symmetrical and routable.

When to use it: when only one side needs to reach the other, or when you can tolerate asymmetric “real vs alias” addressing in configs.

Pattern 2: Bi-directional NAT (two alias spaces)

If both sides overlap and both sides must initiate connections, you often need two alias spaces:

  • Alias for A as seen by B: 172.31.10.0/24
  • Alias for B as seen by A: 172.31.20.0/24

This avoids “who is 10.20.1.10?” entirely across the tunnel. It also doubles the moving parts, which doubles the ways to get it subtly wrong.

Pattern 3: NAT only for specific services (surgical NAT)

Sometimes you don’t need “network connectivity.” You need “TCP/5432 from app to db.” In those cases, translate only the service VIPs:

  • Create a small alias range (even /32s)
  • DNAT those alias IPs to the real servers
  • Limit firewall policies to those ports

This reduces blast radius. It also forces you to document dependencies, which is always popular right up until you ask someone to do it.

Pattern 4: Application-layer proxies instead of NAT (when NAT will break you)

Some protocols hate NAT. SIP, some FTP modes, weird license daemons, and anything that bakes IPs into payloads can behave like it’s 2003 again. If the application breaks, stop arguing with packets and use a proxy:

  • HTTP(S): reverse proxy or forward proxy
  • Databases: TCP proxy with health checks
  • SSH: bastion/jump host

Proxies cost operational effort but can save you from translation edge cases, especially with multi-connection protocols.

Pattern 5: Route-leaking with VRFs (avoid NAT when the overlap is “organizational”)

If overlap exists because networks were logically separated, VRFs can sometimes fix it without NAT by keeping two identical prefixes in separate routing tables. That’s elegant on capable routers and miserable on a random Linux VM acting as a VPN gateway. Use it if you already run VRFs and your VPN endpoints support them cleanly.

Joke #1: NAT is like duct tape—strong, versatile, and somehow always involved in the crime scene reconstruction.

How to choose an approach (and what you’re really trading)

Your decision is mostly about three tensions:

  • Operational clarity vs speed: Renumbering is clear and future-proof, but slow. NAT is fast, but you’ll debug it at 2 a.m.
  • Symmetry vs simplicity: One-way NAT is simpler but can surprise the “other” side. Bi-directional NAT is symmetric but complex.
  • Observability vs opacity: Translation can hide original IPs unless you log conntrack/NAT mappings and preserve context in logs.

Use this as a practical rubric:

  • If this is a merger bridge with many services and unknown dependencies: start with surgical NAT or proxies. Expand only when you understand traffic.
  • If it’s one app → one dependency: do service VIP NAT. Do not NAT whole subnets “because it’s easier.”
  • If it’s long-term interconnect with many initiators both ways: consider bi-directional NAT but also schedule renumbering. NAT shouldn’t become architecture.

Packet walk: what happens to one TCP connection

Example: app at Site A (10.20.5.10) needs Postgres at Site B (10.20.8.25). Overlap exists, so we give B an alias IP 172.31.20.25 that A will use.

  1. App connects: 10.20.5.10:54012172.31.20.25:5432.
  2. Site A routing sends 172.31.20.0/24 to the VPN gateway, into the tunnel.
  3. Site B VPN gateway DNAT rewrites destination: 172.31.20.2510.20.8.25.
  4. Site B gateway SNAT (important) rewrites source to a stable alias (or gateway IP) so return packets are forced back through the gateway and can be un-translated consistently.
  5. Server replies to the SNAT source, not directly to 10.20.5.10 (which would be ambiguous locally and likely wrong).
  6. Gateway untranslates on return path, so the app sees replies from 172.31.20.25, maintaining session consistency.

If you skip the SNAT step, you’ll get the classic “SYN goes through, SYN-ACK disappears” failure. That’s not a mystery. That’s asymmetric routing caused by your own optimism.

Practical tasks: commands, outputs, and the decision you make

The goal here is not to run commands for sport. Each task answers a question, and each answer forces a decision.

Task 1: Confirm the overlap is real (and how big the blast radius is)

cr0x@server:~$ ip -br addr
lo               UNKNOWN        127.0.0.1/8 ::1/128
eth0             UP             10.20.5.254/24
wg0              UP             10.99.0.1/24

What it means: You see local subnets. This host is in 10.20.5.0/24.

Decision: If the remote side also uses 10.20.0.0/16 or similar, you cannot route it directly. You need NAT, VRFs, or renumbering.

Task 2: Check your route to the alias subnet (is traffic even entering the VPN?)

cr0x@server:~$ ip route get 172.31.20.25
172.31.20.25 dev wg0 src 10.99.0.1 uid 0
    cache

What it means: Traffic to the alias IP will go into wg0.

Decision: If it says dev eth0 or shows a local gateway, you’re not using the tunnel. Fix routing before touching NAT.

Task 3: Verify the VPN handshake and latest traffic (WireGuard example)

cr0x@server:~$ sudo wg show
interface: wg0
  public key: fM0...redacted
  listening port: 51820

peer: 8kQ...redacted
  endpoint: 198.51.100.10:51820
  allowed ips: 172.31.20.0/24
  latest handshake: 1 minute, 12 seconds ago
  transfer: 128.34 MiB received, 141.02 MiB sent

What it means: Handshake is fresh; AllowedIPs includes the alias subnet. Traffic counters move.

Decision: If latest handshake is “never,” stop. You have a VPN problem, not a NAT problem.

Task 4: Confirm IP forwarding is on (the most boring prerequisite)

cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1

What it means: Linux will route packets.

Decision: If it’s 0, enable it and persist via /etc/sysctl.d/*.conf. Without forwarding, NAT rules can be perfect and still do nothing.

Task 5: Validate the firewall policy allows forwarding (nftables)

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain forward {
    type filter hook forward priority filter; policy drop;
    iifname "wg0" oifname "eth0" ip daddr 10.20.8.25 tcp dport 5432 accept
    iifname "eth0" oifname "wg0" ct state established,related accept
  }
}

What it means: Default drop, explicit allow for Postgres, and established return traffic is accepted.

Decision: If you don’t see an established/related rule for the return direction, you’ll get one-way traffic and blame NAT. Don’t.

Task 6: Inspect NAT rules (nftables)

cr0x@server:~$ sudo nft list table ip nat
table ip nat {
  chain prerouting {
    type nat hook prerouting priority dstnat; policy accept;
    iifname "wg0" ip daddr 172.31.20.25 dnat to 10.20.8.25
  }

  chain postrouting {
    type nat hook postrouting priority srcnat; policy accept;
    oifname "eth0" ip daddr 10.20.8.25 snat to 10.20.5.254
  }
}

What it means: DNAT translates alias to real. SNAT forces replies back to the gateway.

Decision: If SNAT is missing, add it or use a more general SNAT/masquerade for the alias-to-real flow. Otherwise return traffic may bypass the gateway.

Task 7: Watch conntrack entries while testing (are sessions being translated?)

cr0x@server:~$ sudo conntrack -L | grep 5432 | head
tcp      6 431999 ESTABLISHED src=10.20.5.10 dst=172.31.20.25 sport=54012 dport=5432 src=10.20.8.25 dst=10.20.5.254 sport=5432 dport=54012 [ASSURED] mark=0 use=1

What it means: You can see both the original tuple and the translated tuple. The reply path is anchored to the gateway IP (10.20.5.254) due to SNAT.

Decision: If conntrack shows nothing while you test, your packets aren’t hitting the gateway. Re-check routing and AllowedIPs/selectors.

Task 8: Packet capture on both interfaces (prove where it dies)

cr0x@server:~$ sudo tcpdump -ni wg0 host 172.31.20.25 and tcp port 5432 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
IP 10.20.5.10.54012 > 172.31.20.25.5432: Flags [S], seq 221733, win 64240, options [mss 1380,sackOK,TS val 123 ecr 0,nop,wscale 7], length 0
IP 10.20.5.10.54012 > 172.31.20.25.5432: Flags [S], seq 221733, win 64240, options [mss 1380,sackOK,TS val 1123 ecr 0,nop,wscale 7], length 0

What it means: SYNs are entering the tunnel interface. If you never see SYN-ACKs, the problem is after the tunnel ingress (DNAT, firewall, server, return routing).

Decision: Then capture on eth0 to confirm DNAT happened.

cr0x@server:~$ sudo tcpdump -ni eth0 host 10.20.8.25 and tcp port 5432 -c 5
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.20.5.254.54012 > 10.20.8.25.5432: Flags [S], seq 221733, win 64240, options [mss 1460,sackOK,TS val 555 ecr 0,nop,wscale 7], length 0

What it means: SNAT and DNAT are working. The server sees the gateway source IP.

Decision: If the server expects the real client IP for ACLs, you need app-layer auth, PROXY protocol (where supported), or a different logging strategy. NAT will hide clients unless you design around it.

Task 9: Check reverse path filtering (the silent dropper)

cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1

What it means: Strict-ish reverse path filtering is enabled. With asymmetry and NAT, rp_filter can drop valid packets.

Decision: If you see intermittent drops, set rp_filter to 2 (loose) on relevant interfaces, or redesign to keep routing symmetric. Don’t disable globally unless you understand the risk.

Task 10: Validate MTU and PMTUD behavior (VPN + NAT loves fragmentation)

cr0x@server:~$ ping -M do -s 1420 172.31.20.25 -c 3
PING 172.31.20.25 (172.31.20.25) 1420(1448) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420

--- 172.31.20.25 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2041ms

What it means: Your path MTU is lower than you assumed. VPN encapsulation adds overhead; NAT devices and firewalls sometimes mishandle ICMP fragmentation-needed messages.

Decision: Lower tunnel MTU (e.g., WireGuard), clamp MSS at the edge, or fix ICMP handling. If TCP sessions stall under load but small pings work, MTU is a prime suspect.

Task 11: Trace route with TCP SYN to the real service (prove L3 vs L4)

cr0x@server:~$ sudo traceroute -T -p 5432 172.31.20.25 -n
traceroute to 172.31.20.25 (172.31.20.25), 30 hops max, 60 byte packets
 1  10.99.0.2  12.112 ms  11.901 ms  11.774 ms
 2  172.31.20.25  24.310 ms  24.221 ms  24.105 ms

What it means: You’re reaching the alias endpoint in two hops (tunnel peer, then target). It doesn’t prove the port is open, but it proves routing is sane.

Decision: If traceroute dies at hop 1, your AllowedIPs/selectors or routing to the tunnel is wrong. Don’t touch server configs yet.

Task 12: Verify the service is actually listening (on the destination side)

cr0x@server:~$ sudo ss -lntp | grep 5432
LISTEN 0      4096       0.0.0.0:5432      0.0.0.0:*    users:(("postgres",pid=1442,fd=6))

What it means: Postgres is listening on all interfaces.

Decision: If it’s bound to 127.0.0.1 or a specific IP that doesn’t match the DNAT target, fix that first. NAT can’t connect to what isn’t listening.

Task 13: Confirm the server’s route back to the gateway (prevent bypass)

cr0x@server:~$ ip route get 10.20.5.254
10.20.5.254 via 10.20.8.1 dev eth0 src 10.20.8.25 uid 0
    cache

What it means: The server will return to its local router, which should then reach the NAT gateway (depending on topology). If the NAT gateway is L2-adjacent, you may see dev eth0 directly.

Decision: If the server has a more direct route back to the client subnet, you can get asymmetric return. Consider policy routing or ensure SNAT uses an address that forces return to the gateway.

Task 14: Check NAT rule counters to confirm hits (nftables counters)

cr0x@server:~$ sudo nft -a list chain ip nat prerouting
table ip nat {
  chain prerouting { # handle 2
    type nat hook prerouting priority dstnat; policy accept;
    iifname "wg0" ip daddr 172.31.20.25 counter packets 128 bytes 7680 dnat to 10.20.8.25 # handle 12
  }
}

What it means: DNAT rule is matching. Counters increase when you test.

Decision: If counters stay at zero, either traffic isn’t arriving, or the match is wrong (interface name, destination IP, address family). Fix match conditions before inventing new rules.

Task 15: Validate IPsec selectors (policy-based VPNs can “eat” NAT designs)

cr0x@server:~$ sudo ip xfrm policy
src 172.31.20.0/24 dst 10.20.8.0/24
	dir out priority 1000 ptype main
	tmpl src 203.0.113.2 dst 198.51.100.2 proto esp reqid 1 mode tunnel
src 10.20.8.0/24 dst 172.31.20.0/24
	dir in priority 1000 ptype main
	tmpl src 198.51.100.2 dst 203.0.113.2 proto esp reqid 1 mode tunnel

What it means: The kernel policies define what gets encrypted. If your alias subnet isn’t in selectors, packets won’t enter the tunnel even if routes exist.

Decision: Update selectors (on both ends) to match post-NAT addresses as seen by the tunnel. For policy-based IPsec, NAT and selectors must agree or you’ll chase ghosts.

Task 16: Test from the application host with explicit source (prove who is being NATed)

cr0x@server:~$ psql "host=172.31.20.25 port=5432 user=app dbname=appdb connect_timeout=3" -c "select 1;"
 ?column? 
----------
        1
(1 row)

What it means: The end-to-end path works for the real application protocol. ICMP success doesn’t count; app success counts.

Decision: If psql fails but TCP SYN/SYN-ACK works, look at authentication (pg_hba.conf), TLS SNI/certs, or L7 expectations that include IPs.

Fast diagnosis playbook

When someone says “VPN is up but service is down,” don’t wander. Run this in order. Stop as soon as you find a mismatch.

First: Is the tunnel real, and is traffic eligible to use it?

  • WireGuard: sudo wg show → handshake recent? transfer counters move?
  • IPsec: check SAs and policies → do selectors include the alias subnet?
  • Routing: ip route get <alias-ip> → does it choose the tunnel interface?

If any of these fail, NAT rules are irrelevant.

Second: Are packets translated and forwarded?

  • sysctl net.ipv4.ip_forward → must be 1
  • NAT counters (nftables/iptables) → do DNAT/SNAT rules match?
  • conntrack -L → does an entry exist for the flow?

If DNAT hits but SNAT doesn’t, expect SYN-ACK loss and timeouts.

Third: Is return traffic forced back through the gateway?

  • Capture on inside interface: do you see server replies to the SNAT source?
  • Check rp_filter settings; loose mode often needed on NAT edges.
  • Confirm server route to the SNAT source is sane.

Fourth: Is the application actually willing to talk?

  • Port open, service listening, TLS names correct, ACLs updated for SNAT’d sources.
  • Test with the real client tool (psql, curl, ldapsearch). Ping is a morale booster, not a validation method.

Common mistakes: symptoms → root cause → fix

1) Symptom: “SYN goes out, nothing comes back”

Root cause: Missing SNAT, so the server replies to the original source IP via its normal route, bypassing the VPN gateway (asymmetric return).

Fix: Add SNAT/masquerade on the VPN gateway for the aliased flows, or use policy routing to force return via the gateway.

2) Symptom: Some ports work, others hang mysteriously

Root cause: Firewall forward rules allow a test port but not the actual application flows, or conntrack state isn’t allowed back.

Fix: Treat forwarding as its own policy. Allow established/related in both directions. Confirm counters increment on the specific rule.

3) Symptom: Works for small requests, dies on big transfers

Root cause: MTU/PMTUD issues over encapsulation; ICMP “fragmentation needed” blocked; MSS not clamped.

Fix: Lower tunnel MTU and/or clamp MSS on SYN packets. Ensure ICMP is permitted for PMTUD or accept you’ll be tuning MTU forever.

4) Symptom: Random drops under load, especially multi-homed gateways

Root cause: Reverse path filtering drops packets that don’t match strict routing expectations after NAT.

Fix: Set rp_filter to loose (2) on relevant interfaces; keep routing symmetric where possible.

5) Symptom: “VPN is connected, but no one can reach the alias subnet”

Root cause: Policy-based IPsec selectors don’t include the alias range, so traffic never enters the tunnel.

Fix: Update selectors to match post-NAT addresses. For route-based IPsec, verify VTI routing instead.

6) Symptom: Application logs show wrong client IP, and rate-limits ban the gateway

Root cause: SNAT collapses many clients into one source IP; L7 systems interpret it as one noisy client.

Fix: Prefer per-client NAT pools, PROXY protocol where supported, or move to an application proxy that preserves client identity.

7) Symptom: DNS works, but connecting to hostnames hits local machines instead

Root cause: Split-brain DNS returns overlapping A records; clients resolve to 10.x and route locally.

Fix: Publish alias IPs in the DNS view used by the remote side, or use conditional forwarding to a DNS zone that returns alias addresses.

8) Symptom: New NAT rules “work” in testing, then break a week later

Root cause: Conntrack table exhaustion, or NAT rules depend on dynamic interface names/addresses that changed.

Fix: Monitor conntrack usage, size appropriately, pin interface names, and avoid fragile matches. Make rules explicit and review them like code.

Three corporate mini-stories (the ones you learn from)

Mini-story #1: The incident caused by a wrong assumption

Company A acquired a smaller outfit and needed “quick connectivity” so finance systems could pull data for month-end. Both sides used 10.0.0.0/8, naturally. The network team built a neat alias range and did DNAT on the far-end VPN gateway. Basic tests passed: ping, traceroute, even a quick TCP connect to the database port.

At 9:05 a.m. on close day, the ETL jobs started and promptly timed out. Packet captures showed SYN and SYN-ACK, then long silences. The team escalated to the VPN vendor, because that’s what teams do when they’re tired and slightly insulted by reality.

The wrong assumption: “If DNAT works, return traffic will find its way back.” It won’t. The database replied to the original client IP (still overlapping), and the response got delivered to a local host with the same address on the database side. Sometimes that host existed and sent RSTs. Sometimes it didn’t and the replies vanished. The network was, in a sense, doing exactly what it was told.

The fix was blunt and effective: SNAT on the VPN gateway for the translated flows, forcing all return traffic back through the same box. Once the sessions were symmetric, the ETL jobs ran. Nobody thanked NAT, but everyone stopped yelling at the VPN concentrator.

Afterward they did what should have been done earlier: documented a single packet walk with both translations, and made “symmetry check” part of the change review. The incident wasn’t caused by complexity. It was caused by skipping the boring parts of complexity.

Mini-story #2: The optimization that backfired

Company B had a long-lived NAT-through-IPsec bridge between two datacenters with overlapping lab networks. It wasn’t pretty, but it was stable. Then someone noticed high CPU on the VPN gateways during peak hours and decided to optimize: “Let’s reduce conntrack pressure by using broader NAT rules and fewer state entries.”

They replaced a set of narrow SNAT rules with a big masquerade, effectively NATing more traffic than intended. CPU dropped. Victory lap scheduled. Two days later, security monitoring lit up: intrusion detection reported a sudden spike in “lateral movement” patterns across the tunnel. It was false—but expensive false.

The backfire was observability. With broader masquerade, multiple internal systems collapsed into the same translated identity. Logs became ambiguous. Alerts that previously tied events to specific sources now pointed at “the NAT gateway,” which is the networking equivalent of “somebody.”

Worse, a rate-limiting system on the remote side saw a single source hammering APIs and began throttling it—throttling everyone. The optimization improved one metric (CPU) by burning another (diagnostic clarity) and then accidentally triggered traffic shaping.

The rollback restored narrow translations. The real fix was more grown-up: scale the gateways properly, increase conntrack limits with monitoring, and keep NAT scope as small as possible. Performance problems are often solved by capacity and architecture, not by making your logs useless.

Mini-story #3: The boring but correct practice that saved the day

Company C ran a regulated environment with painfully strict change control. Engineers grumbled about it until the day it saved them. They needed to connect an on-prem environment to a cloud VPC that had—surprise—overlapping ranges with an internal shared services segment.

The team proposed NAT-through-VPN with an alias range. Before implementing, they did three boring things: reserved the alias range in IPAM (even though it wasn’t “real”), wrote a one-page packet walk, and created synthetic monitoring from both sides using the alias addresses. They also added conntrack and NAT counter checks to their runbook.

During cutover, a firewall rule was missing for return established connections. The synthetic check failed immediately, and the NAT counters showed DNAT hits but no return traffic. No customers noticed because the team caught it in minutes, not hours.

The best part was the postmortem: it didn’t include heroics. It included a checklist, a counter, and a graph. That’s the kind of boring you should aspire to.

Joke #2: If you ever feel useless, remember there’s a “temporary” NAT rule from 2017 still routing payroll traffic.

Checklists / step-by-step plan

Step-by-step plan for NAT-through-VPN that won’t sabotage you later

  1. Define the business scope. List the exact services and ports required. If the list is “everything,” you’re about to build a second network. Stop and renegotiate.
  2. Pick alias CIDRs deliberately. Use ranges that won’t collide with either side now or in plausible future expansions. Reserve them in IPAM like you mean it.
  3. Decide on one-way vs bi-directional NAT. One-way is less operationally expensive. Bi-directional is more uniform for bidirectional apps but harder to reason about.
  4. Pick the translation point. Do NAT at the VPN gateway if you need symmetrical return and centralized control. Avoid “NAT sprinkled across random hosts.”
  5. Write a packet walk. One diagram, one example flow, both translations shown. If you can’t explain it, you can’t operate it.
  6. Implement routing/selectors first. Make sure alias CIDRs route into the tunnel and are allowed by WireGuard AllowedIPs or IPsec selectors.
  7. Implement firewall forward rules second. Start with explicit allow for required ports and established/related returns. Default drop is fine if you’re disciplined.
  8. Implement DNAT + SNAT together. Treat them as a pair for stateful services. DNAT without SNAT is a trap unless you know return routing is pinned.
  9. Instrument the edges. Collect: NAT counters, conntrack usage, tunnel bytes, and interface drops. Alert on changes, not just failures.
  10. Test at L7. Use real client tools. For HTTP, verify headers and TLS names. For DB, run a real query.
  11. Plan rollback. Being brave is not a rollback strategy. Keep old routing and policy configs ready to reapply.
  12. Set a deprecation date. NAT-through-VPN tends to become permanent unless you schedule renumbering or a cleaner interconnect.

Pre-change checklist (printable in your head)

  • Alias CIDR reserved and documented
  • Routing/AllowedIPs/selectors include alias CIDR
  • DNAT and SNAT rules reviewed for symmetry
  • Forward firewall allows required ports + established/related
  • rp_filter evaluated (loose where needed)
  • MTU plan (explicit tunnel MTU or MSS clamp)
  • Monitoring and synthetic tests ready
  • Rollback plan tested (at least once in a lab)

FAQ

1) Do I always need SNAT when doing DNAT across a VPN?

Not always, but assume yes until proven otherwise. If the destination host can return traffic via the same gateway deterministically (no bypass routes), you might skip SNAT. In overlapping networks, return determinism is rare. Use SNAT to force symmetry.

2) Can I solve overlapping networks by advertising more specific routes?

If the overlap is identical (both have 10.20.0.0/16), more specifics don’t fix identity ambiguity on hosts. You might win on routers, then lose on ARP and local routing. NAT or VRFs are the usual fixes; renumbering is the real fix.

3) Should the alias range be RFC1918 or can it be public-looking space?

Use RFC1918 unless you have a very controlled environment and a strong reason. Public-looking space can leak into logs, monitoring, or third parties and confuse incident response. The goal is uniqueness and internal clarity, not cosplay public internet.

4) What’s the difference between NAT-through-VPN and NAT traversal (NAT-T)?

NAT-T is about getting IPsec through a NAT device on the path by encapsulating ESP in UDP. NAT-through-VPN is you deliberately translating addresses to resolve overlap or policy constraints. Similar acronym energy, totally different problem.

5) Can WireGuard do NAT for me?

WireGuard is routing/crypto; it doesn’t implement NAT by itself. You do NAT with nftables/iptables on the gateway. WireGuard’s AllowedIPs does act like a routing/filtering mechanism, and it’s easy to misconfigure when you add alias ranges.

6) How do I keep real client IP visibility for logs and ACLs?

You usually can’t with pure SNAT; that’s the point of SNAT. Options: per-client SNAT pools, application proxies that preserve client identity, or protocol-specific features (like PROXY protocol) where supported.

7) Is bi-directional NAT a bad idea?

It’s not evil; it’s just easy to misunderstand. If you must support initiations from both sides and you can’t renumber, bi-directional NAT can be correct. It requires better documentation, tighter monitoring, and more careful DNS/service discovery.

8) What about IPv6—does it make this problem go away?

IPv6 reduces address scarcity and should reduce overlap pressure, but reality includes legacy systems and partial deployments. Also, people can still overlap ULA prefixes if they treat addressing like a suggestion. The discipline problem doesn’t vanish.

9) Why does DNS become a special kind of messy with NAT aliases?

Because names are shared concepts and IPs are not. If “db.internal” resolves to 10.20.8.25 on one side and that’s also a valid local host on the other, clients will connect to the wrong thing confidently. Use split-horizon DNS or dedicated alias names that resolve to alias IPs for cross-site use.

10) How long can we keep NAT-through-VPN in place?

Technically: years. Operationally: until you forget how it works and then need to change it under pressure. Put an owner on it, keep it monitored, and plan an exit (renumbering, VRFs, or clean network integration).

Conclusion: next steps that won’t ruin your week

NAT through a VPN is a pragmatic tool for a messy world: mergers, shadow IT, default cloud CIDRs, and “we’ll clean it up later.” It works when you respect two things: symmetry and scope. Make return traffic deterministic. Translate only what you must. Instrument everything you touch.

Next steps you can do immediately:

  • Pick and reserve alias CIDRs in IPAM (even if they’re “virtual”).
  • Write a one-page packet walk for a single critical flow, including DNAT and SNAT.
  • Implement NAT counters + conntrack monitoring on the VPN gateways.
  • Build a synthetic check that uses the alias addresses from both sides and fails loudly.
  • Schedule the uncomfortable meeting about renumbering, because NAT is not a retirement plan.
← Previous
Office VPN failover: keep tunnels up with 2 ISPs (without manual babysitting)
Next →
ZFS Special VDEV Failure: How to Survive the Nightmare Scenario

Leave a comment