Site-to-site VPN routing: why the tunnel is up but nothing works (and how to fix it)

Was this helpful?

The tunnel is “up.” Your dashboard is green. Phase 1 is established, Phase 2 is installed, the peer is “connected.”
And yet: no ping, no SSH, no database connections, no application traffic. You have achieved the classic enterprise
milestone: cryptography is working, networking is not.

This failure mode is so common it deserves its own runbook. The tunnel being up only means the devices agreed on
how to encrypt. It says almost nothing about whether packets can find the tunnel, survive the tunnel, and emerge on
the other side in a way that the destination will accept and reply to.

What “tunnel is up” actually means (and what it doesn’t)

Most site-to-site VPNs have two separate “up” concepts:

  • Control plane up: IKE/handshake completed, keys negotiated, SA established. This is what dashboards love.
  • Data plane working: packets are routed into the tunnel, match selectors, pass firewall/NAT, survive MTU, and return traffic follows the same path back.

Your problem is almost always in the data plane. Specifically:

  • Routing: packets never enter the tunnel (wrong routes, missing routes, wrong policy routing, wrong VRF).
  • Selectors / policy: packets enter the box but don’t match the VPN’s traffic selectors (policy-based IPsec) or allowed IPs (WireGuard).
  • NAT / firewall: packets match the tunnel but get NATed into something the other side doesn’t expect, or blocked by stateful rules.
  • Path/MTU: packets disappear due to fragmentation, PMTUD black holes, or MSS not clamped.
  • Asymmetry: packets go out via the VPN and come back via the internet (or another WAN), triggering firewalls, rp_filter, or plain old “no route back.”

The only reliable test is simple: send a packet from a host behind Site A to a host behind Site B, then prove—using
captures and counters—where it stops.

Interesting facts and historical context (that still bite you today)

A little history explains why modern VPN routing feels like archaeology: different eras of networking left behind
different assumptions, and your current setup is a fossil mashup.

  1. IPsec predates “cloud networking”: the core specs were formed when most networks were static and private. Today we bolt it onto NAT, overlays, and dynamic routing.
  2. Policy-based VPNs came first in many firewalls: instead of “route traffic into an interface,” early designs matched packets against selectors and encrypted them. Easy for small setups, painful at scale.
  3. NAT-Traversal (NAT-T) exists because NAT broke IPsec: ESP didn’t survive typical NAT devices, so UDP encapsulation on port 4500 became a standard escape hatch.
  4. PMTUD black holes are older than most dev teams: ICMP “Fragmentation Needed” has been blocked by “security” teams since forever, and VPN encapsulation makes it worse.
  5. “Split tunnel” used to be a client VPN debate: now the same concept appears in site-to-site as selective routing, multiple tunnels, and traffic steering policies.
  6. BGP over IPsec became popular because humans are unreliable: static routes are fine until the third site shows up, then you’re one spreadsheet away from outage.
  7. Reverse path filtering (rp_filter) is a Linux response to spoofing: it’s great… until your routing is asymmetric, then it starts dropping legitimate VPN traffic.
  8. WireGuard is intentionally minimal: it gives you a tunnel and keys, not policy frameworks. Misconfigured AllowedIPs is the modern equivalent of selector mismatch.
  9. Overlapping RFC1918 space is a corporate tradition: 10.0.0.0/8 was supposed to be private, not universal. Mergers ensured it became universal anyway.

Fast diagnosis playbook (find the bottleneck fast)

When someone says “the VPN is up but nothing works,” do not start by changing crypto settings. That’s how you turn a
routing bug into a long night. Start with a tight loop: prove path, prove policy, prove return.

First: prove the packet enters the VPN gateway

  1. From a host behind Site A, ping/traceroute a specific host behind Site B (not “the subnet”). Pick one IP you control.
  2. On the Site A VPN gateway, capture traffic on the LAN interface to confirm the packet arrives.
  3. Check the gateway’s routing decision: does it send that destination into the tunnel interface/VTI or match a policy?

Second: prove the packet matches the VPN policy/selectors

  1. On IPsec, verify the installed SAs include the exact source/destination subnets you’re testing.
  2. On WireGuard, verify AllowedIPs includes the remote subnet, and the remote peer has a route back.
  3. Look at SA byte counters. If they stay at zero while you generate traffic, your packet is not entering the tunnel policy.

Third: prove the return path and state

  1. Capture on the remote site LAN interface: does the packet emerge decrypted?
  2. If it arrives, capture the reply: does it route back into the tunnel?
  3. Check firewall state and rp_filter on both sides.

Fourth: check MTU/MSS only after routing/policy basics

MTU issues can look like “nothing works,” but they usually present as “ping works, TCP stalls,” or “small requests work, large ones hang.”
Don’t prematurely blame MTU because it’s fashionable.

Joke #1: A VPN dashboard that says “Up” is like a coffee machine light that says “Ready.” It doesn’t guarantee you’re getting coffee.

A mental model: the four gates every packet must pass

Here’s the model I use when debugging. Every packet trying to cross a site-to-site VPN must pass four gates:

Gate 1: Can the source host reach its gateway, and is the gateway the right one?

A surprising number of “VPN” problems are actually “wrong default gateway,” “host firewall,” or “wrong VLAN.” If your host is not
sending traffic to the VPN gateway (or to the router that knows the VPN), you’re debugging the wrong device.

Gate 2: Does the gateway route the packet into the tunnel?

Route-based VPNs rely on routing tables (static, OSPF/BGP, policy routing). Policy-based VPNs rely on selectors:
“if packet matches this ACL, encrypt it.” Both can be wrong in boring ways.

The “tunnel is up but nothing works” classic is missing a route to the remote subnet on the LAN side, so packets
go to the internet instead, never hitting the VPN policy.

Gate 3: Does the tunnel accept the packet?

IPsec has traffic selectors (local/remote subnets). If your packet doesn’t match, it won’t be encrypted by the SA you think exists.
WireGuard has AllowedIPs, which is both a crypto policy and a routing policy: it decides what gets encrypted and what gets accepted from a peer.

Gate 4: Can the remote side deliver the packet and return it symmetrically?

Even if the remote side receives the packet, it has to deliver it to the destination host and get a reply back the same way.
Replies that go out another path are the enemy of stateful firewalls, conntrack, NAT expectations, and rp_filter.

Practical tasks with commands: verify, decide, fix

The following tasks are meant to be run during an incident. Each has: a command, what the output means, and the decision you make.
Commands are shown on Linux because it’s common for VPN gateways, jump hosts, and test machines—even if your VPN device is a firewall appliance.

Task 1: Verify the host is sending traffic to the correct gateway

cr0x@server:~$ ip route show default
default via 10.20.0.1 dev eth0 proto dhcp src 10.20.0.55 metric 100

Meaning: The host uses 10.20.0.1 as default gateway. If your VPN gateway/router isn’t 10.20.0.1 (or doesn’t know routes to the remote site),
your packet never even gets a chance.

Decision: If default gateway is wrong, fix DHCP/static config or add a specific route for the remote subnet via the correct router.

Task 2: Confirm there is a route to the remote subnet (host perspective)

cr0x@server:~$ ip route get 172.31.40.10
172.31.40.10 via 10.20.0.1 dev eth0 src 10.20.0.55 uid 1000
    cache

Meaning: Traffic to 172.31.40.10 goes to 10.20.0.1. That router must have a route into the VPN, otherwise you’re leaking to the wrong path.

Decision: If the next hop isn’t your expected VPN-aware router, add a more specific route or fix routing announcements.

Task 3: Check routing on the VPN gateway (route-based VPN)

cr0x@server:~$ ip route get 172.31.40.10
172.31.40.10 dev vti0 src 10.20.0.1 uid 0
    cache

Meaning: The gateway will send this destination into vti0 (a virtual tunnel interface). Good sign: routing is pointing at the tunnel.

Decision: If it routes out the WAN interface instead, you need routes (static/BGP) or policy routing to steer it into the tunnel.

Task 4: Identify policy routing or multiple tables (Linux gateways)

cr0x@server:~$ ip rule show
0:      from all lookup local
1000:   from 10.20.0.0/24 lookup 100
32766:  from all lookup main
32767:  from all lookup default

Meaning: Traffic sourced from 10.20.0.0/24 uses routing table 100. If your VPN route lives in main but traffic uses table 100,
the tunnel won’t be used.

Decision: Put the VPN route in the correct table, or adjust the rule. Treat policy routing as production code: review, test, document.

Task 5: Confirm IP forwarding is enabled on a Linux VPN gateway

cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1

Meaning: Forwarding is on. If it’s 0, the gateway is a decorative router.

Decision: If off, enable it and persist in /etc/sysctl.conf or a drop-in. Then re-test traffic.

Task 6: Check rp_filter (asymmetry killer)

cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1

Meaning: Strict reverse path filtering is enabled. If return traffic is asymmetric (common with dual-WAN or multiple routes),
Linux may drop packets as “spoofed.”

Decision: If you expect asymmetry, set to loose mode (2) on relevant interfaces, or fix routing symmetry (preferred).

Task 7: Verify WireGuard peer and AllowedIPs (tunnel up, traffic not)

cr0x@server:~$ sudo wg show wg0
interface: wg0
  public key: v4bq6nFh1lEoC4QyM2u0n9bZ1gYpP7oGQf+XvYdW2mU=
  listening port: 51820

peer: q0P3sZtVd8l0Rk2m1yG1CwVf1mQf0Qx4lQw2uXn8G3E=
  endpoint: 198.51.100.20:51820
  allowed ips: 172.31.40.0/24
  latest handshake: 18 seconds ago
  transfer: 12.34 KiB received, 9.87 KiB sent

Meaning: Handshake is recent and transfer counters increment. allowed ips includes 172.31.40.0/24, so outbound encryption for that subnet should occur.

Decision: If counters don’t change while testing, your traffic is not matching AllowedIPs, or the host is routing around wg0.

Task 8: Confirm routes created for WireGuard (Linux)

cr0x@server:~$ ip route show dev wg0
172.31.40.0/24 proto kernel scope link src 10.99.0.1

Meaning: The kernel has a route to the remote subnet via wg0. If missing, you might be using Table = off or a custom routing design.

Decision: Add the route explicitly, or adjust WireGuard configuration so it programs routes as intended.

Task 9: Inspect IPsec SAs and traffic selectors (strongSwan)

cr0x@server:~$ sudo swanctl --list-sas
siteA-siteB: #3, ESTABLISHED, IKEv2, 4c2b1b3c3e5b2a6f_i* 6a8d9c0b2a1f4e3d_r
  local  '203.0.113.10' @ siteA
  remote '198.51.100.20' @ siteB
  AES_GCM_16-256/PRF_HMAC_SHA2_256/ECP_256
  established 241s ago, rekeying in 53m
  siteA-siteB-child: #5, INSTALLED, TUNNEL, ESP:AES_GCM_16-256
    local 10.20.0.0/24
    remote 172.31.40.0/24
    bytes_i 18240, bytes_o 16512, rekeying in 49m

Meaning: The child SA is installed with selectors 10.20.0.0/24 ↔ 172.31.40.0/24. Byte counters are non-zero, so traffic is flowing through the SA.

Decision: If selectors don’t match what you’re testing, fix the local/remote traffic selectors. If bytes stay at 0, packets aren’t matching or aren’t reaching the gateway.

Task 10: Check if firewall/NAT rules are interfering (nftables)

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain forward {
    type filter hook forward priority filter; policy drop;
    ct state established,related accept
    iifname "eth0" oifname "vti0" ip saddr 10.20.0.0/24 ip daddr 172.31.40.0/24 accept
    iifname "vti0" oifname "eth0" ip saddr 172.31.40.0/24 ip daddr 10.20.0.0/24 accept
  }
}
table ip nat {
  chain postrouting {
    type nat hook postrouting priority srcnat; policy accept;
    oifname "eth1" masquerade
  }
}

Meaning: Forward policy is drop; only explicit rules allow LAN↔VPN. NAT masquerade applies to eth1 only (WAN), not the tunnel.

Decision: If NAT is accidentally applied to tunnel traffic (masquerade on all interfaces), fix it. If forward rules are missing, add explicit allow rules for the VPN subnets.

Task 11: Look for NAT on the VPN path (iptables legacy setups)

cr0x@server:~$ sudo iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A POSTROUTING -o eth1 -j MASQUERADE

Meaning: NAT is only on WAN interface eth1. Good. If you see MASQUERADE without -o eth1, you may be NATing VPN traffic unintentionally.

Decision: Tighten NAT rules so VPN subnets are exempt, or scope masquerade to WAN only.

Task 12: Prove where packets die using tcpdump (on both sides)

cr0x@server:~$ sudo tcpdump -ni eth0 host 172.31.40.10
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:02:11.102938 IP 10.20.0.55 > 172.31.40.10: ICMP echo request, id 3812, seq 1, length 64

Meaning: The request reaches the gateway’s LAN interface. Next capture on the tunnel interface to see if it’s being forwarded/encrypted.

Decision: If it appears on LAN but not on tunnel, it’s routing/policy/firewall on the gateway. If it appears on tunnel but not on remote LAN, it’s tunnel policy/MTU/remote firewall.

Task 13: Check tunnel interface counters (route-based VPN)

cr0x@server:~$ ip -s link show dev vti0
7: vti0:  mtu 1436 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    RX:  bytes packets errors dropped  missed   mcast
       82912     612      0      0       0       0
    TX:  bytes packets errors dropped carrier collsns
       74488     581      0      0       0       0

Meaning: Counters incrementing indicates traffic is being sent/received on the VTI. If TX increments but RX stays flat, return traffic is missing or blocked.

Decision: If only one direction moves, focus on return route, remote firewall, remote selectors, or NAT.

Task 14: Diagnose MTU/PMTUD with ping and DF bit

cr0x@server:~$ ping -M do -s 1400 -c 3 172.31.40.10
PING 172.31.40.10 (172.31.40.10) 1400(1428) bytes of data.
ping: local error: message too long, mtu=1436
ping: local error: message too long, mtu=1436
ping: local error: message too long, mtu=1436

--- 172.31.40.10 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2032ms

Meaning: Your local interface MTU (1436) can’t send 1428-byte packets with DF. That’s expected if the tunnel MTU is smaller due to encapsulation.

Decision: Clamp MSS for TCP, reduce MTU on tunnel interface, or ensure ICMP “frag needed” isn’t blocked. If small pings work but TCP stalls, MTU is likely guilty.

Task 15: Check MSS clamping status (nftables example)

cr0x@server:~$ sudo nft list chain inet mangle forward
table inet mangle {
  chain forward {
    type filter hook forward priority mangle; policy accept;
    oifname "vti0" tcp flags syn tcp option maxseg size set 1360
  }
}

Meaning: SYN packets going out the tunnel have MSS set to 1360. This prevents TCP from sending segments too large for the tunnel MTU.

Decision: If you see PMTUD issues, implement MSS clamping at the edge. It’s not elegant; it is effective.

Task 16: Trace path and identify “wrong exit” routing

cr0x@server:~$ traceroute -n 172.31.40.10
traceroute to 172.31.40.10 (172.31.40.10), 30 hops max, 60 byte packets
 1  10.20.0.1  0.431 ms  0.382 ms  0.369 ms
 2  203.0.113.1  2.171 ms  2.133 ms  2.114 ms
 3  198.18.0.5  6.990 ms  6.945 ms  6.903 ms

Meaning: This traffic is going out to the WAN (203.0.113.1) instead of into the tunnel. That’s the “missing route” smell.

Decision: Fix routing to steer 172.31.40.0/24 into the tunnel. If you rely on BGP, verify the prefix is advertised and accepted.

Task 17: Verify BGP-learned routes and next-hop (FRR)

cr0x@server:~$ sudo vtysh -c "show ip route 172.31.40.0/24"
Routing entry for 172.31.40.0/24
  Known via "bgp", distance 20, metric 0, best
  Last update 00:03:12 ago
  * 10.255.255.2, via vti0

Meaning: BGP is providing the route, next hop over vti0. That’s what you want for a route-based VPN with dynamic routing.

Decision: If the route is missing or points somewhere else, investigate BGP session state, route filters, and next-hop reachability.

Task 18: Check conntrack for stateful firewall weirdness

cr0x@server:~$ sudo conntrack -L -p icmp | head
icmp     1 29 src=10.20.0.55 dst=172.31.40.10 type=8 code=0 id=3812 [UNREPLIED] src=172.31.40.10 dst=10.20.0.55 type=0 code=0 id=3812

Meaning: The gateway sees outbound echo requests but no replies. If you expect replies, they’re not returning or are being dropped before conntrack sees them.

Decision: Capture on remote side, check return route, and check remote firewall. If replies return on a different interface, that’s asymmetry.

Common mistakes: symptoms → root cause → fix

1) Tunnel shows “up,” but SA byte counters stay at zero

Symptoms: IKE established; pings/timeouts; counters don’t move.

Root cause: Traffic never matches selectors/AllowedIPs, or never reaches the gateway due to routing.

Fix: Verify routes to remote subnet on the source host and gateway. For IPsec policy-based, ensure the exact source/destination subnets match the child SA. For WireGuard, ensure AllowedIPs includes the destination and the kernel routes point to wg0.

2) One-way traffic: A can reach B, B can’t reach A (or replies never arrive)

Symptoms: TCP SYN seen at remote, SYN-ACK never returns; ICMP request seen, reply missing; VTI TX increases but RX flat.

Root cause: Missing return route on remote LAN/router; asymmetric routing due to multiple uplinks; NAT on one side changes source IP unexpectedly.

Fix: Add/advertise the return route for the source subnet via the VPN. Ensure NAT exemption for VPN subnets. If multiple exits exist, enforce symmetry using routing policy or design.

3) Ping works, but HTTPS/SSH hangs or large transfers stall

Symptoms: Small packets succeed; TCP connections establish but stall on data; specific apps fail mysteriously.

Root cause: MTU/PMTUD issues due to encapsulation overhead; ICMP blocked; MSS too high.

Fix: Clamp TCP MSS on traffic going into the tunnel. Set tunnel interface MTU appropriately. Allow ICMP “Fragmentation Needed” on relevant paths if your security posture can handle reality.

4) Works for one subnet, fails for another

Symptoms: Some hosts can reach remote; others can’t; one VLAN works, another doesn’t.

Root cause: Selectors/AllowedIPs only include one subnet; missing routes for additional subnets; firewall rules scoped too narrowly.

Fix: Expand selectors/AllowedIPs symmetrically on both ends. Add routes and firewall rules for each subnet. For IPsec, remember both sides must agree on traffic selectors.

5) Traffic works until a rekey, then dies

Symptoms: Stable for minutes/hours, then sudden blackhole after rekey; “tunnel up” still shown.

Root cause: Rekey mismatch, NAT rebinding, stateful firewall timeouts, or DPD handling differences causing child SAs to desync.

Fix: Align lifetimes/rekey margins; ensure NAT-T stable; confirm DPD settings. Monitor SA install/remove events, not just IKE state.

6) Random drops under load, especially UDP or VoIP

Symptoms: Jitter, drops, bursts of loss; ping stable but real-time traffic bad.

Root cause: QoS not applied inside tunnel; CPU bottleneck on encryption; queueing and bufferbloat; fragmentation of UDP payloads.

Fix: Profile CPU and interface queueing. Shape/mark traffic before encryption if possible. Reduce MTU for UDP-heavy apps. Consider hardware offload or stronger instances.

7) Everything breaks only from one direction after “security hardening”

Symptoms: After a baseline hardening change, VPN traffic dies; local traffic fine.

Root cause: rp_filter set to strict, firewall default policy changed, or ICMP blocked globally.

Fix: Re-evaluate rp_filter and forwarding rules with the VPN design. Put explicit allow rules for VPN subnets. Allow the ICMP needed for PMTUD, or compensate with MSS clamp.

8) Overlapping networks: both sides use 10.0.0.0/8

Symptoms: Traffic goes to local resources instead of the tunnel; routes fight; intermittent results based on longest prefix and local policy.

Root cause: Address space overlap. VPN routing can’t distinguish “their 10.20.0.0/24” from “our 10.20.0.0/24.”

Fix: Use NAT (carefully) or renumber. If you must NAT, do it consistently and document the translated ranges; update selectors and firewall rules accordingly.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a “simple” site-to-site IPsec VPN between HQ and a warehouse. The tunnel came up. The firewall UI smiled.
The warehouse team tried to reach an inventory API at HQ. Nothing. They escalated. Naturally, the first move was to rotate keys and
re-run IKE debugging, because that’s what people do when they’re anxious.

The wrong assumption was: “If the tunnel is up, the firewall will route traffic through it.” The actual setup was policy-based IPsec.
Only traffic that matched a specific pair of subnets would be encrypted. Meanwhile, the new inventory API had been deployed onto a
different VLAN at HQ, with a subnet not included in the phase 2 selectors.

The warehouse hosts could reach the old subnet fine. Nobody noticed because the test checklist was “ping any HQ host.”
The new API subnet? Not in selectors, so it was routed out the default internet path and dropped upstream. SA counters stayed near zero,
but nobody looked because the UI didn’t display them prominently.

The fix was boring: add the missing subnet to the local encryption domain on one side and to the remote encryption domain on the other,
then mirror firewall rules. After that, counters moved, the API was reachable, and everyone pretended it was an “ISP issue” for
morale reasons.

Mini-story 2: The optimization that backfired

A global enterprise had multiple site-to-site tunnels into a central hub. Latency mattered. Someone proposed an optimization:
reduce overhead by raising MTU and letting PMTUD “figure it out.” The change was rolled to several gateways, and it looked fine in
synthetic pings. Everyone went home early.

Two days later, a finance app started timing out only on large report downloads. Authentication succeeded. Small pages loaded.
Downloads hung. The VPN tunnel was up; counters moved; CPU was fine. It smelled like the worst kind of problem: “intermittent.”

The backfire was classic: PMTUD relied on ICMP messages that were blocked by a security policy somewhere between sites.
With the new MTU settings, encapsulated packets exceeded the real path MTU. The network could not signal “fragmentation needed,” so the packets vanished.
TCP kept retrying until timeouts did the rest.

The fix was to clamp MSS on tunnel egress and reduce the tunnel MTU to a safe value that worked across all paths.
They also adjusted firewall policy to permit the specific ICMP types needed for PMTUD—where they could.
The “optimization” was rolled back, and the lesson stuck: MTU is a shared truth, not a personal preference.

Mini-story 3: The boring but correct practice that saved the day

A SaaS company ran route-based VPNs from their primary data center to several partners. They had a habit that looked tedious:
every tunnel had a one-page “contract” describing local/remote prefixes, NAT policy, tunnel MTU, and the expected next-hop routing.
They also kept a basic “known good” packet capture recipe and verified it quarterly.

One afternoon, a partner reported they couldn’t reach a specific subnet. The tunnel was up. The partner’s team insisted “nothing changed.”
The SaaS on-call pulled the contract doc, then ran a quick route lookup: the advertised prefix from the partner had quietly shrunk from a /16 to a /24.
BGP was still up, but the route set wasn’t what their firewall rules expected.

Because the practice was boringly consistent, the on-call had a baseline to compare against. They could say, with evidence, “the tunnel is fine,
but your side is no longer advertising the aggregate; we’re only learning this more specific prefix.” That statement ended the blame game quickly.

They implemented a temporary static route for the missing range (scoped, time-limited), while the partner fixed their route announcements.
No heroics. No guessing. Just documented expectations and quick verification.

Joke #2: BGP is like office politics—everyone claims they’re sharing information, but you still need to check what they actually said.

Checklists / step-by-step plan

A. Incident checklist (15–30 minutes to isolate)

  1. Pick one test pair: a specific source host and destination host, with IPs you can access.
  2. Confirm local host routing: ip route get <dest> must point to the VPN-aware gateway/router.
  3. Confirm gateway sees the traffic on LAN: capture on LAN interface.
  4. Confirm gateway routes into tunnel or matches policy: ip route get (route-based) or SA counters/selectors (policy-based).
  5. Confirm tunnel counters increment: VTI counters, SA byte counters, WireGuard transfer counters.
  6. Confirm remote side receives decrypted traffic: capture on remote LAN side.
  7. Confirm return path: capture reply leaving remote side and ensure it routes back into tunnel.
  8. Check firewall rules both directions: forward/ACL and NAT exemption.
  9. Only then check MTU/MSS: if TCP hangs or large packets fail.
  10. Document the failure point: “dies before tunnel,” “dies in tunnel policy,” “dies after decrypt,” or “return path broken.”

B. Build-it-right checklist (before you go live)

  1. Choose route-based unless you have a reason not to: it scales better and makes troubleshooting sane.
  2. Decide on dynamic routing or static: if there will be more than a couple of prefixes, prefer BGP with clear filters.
  3. Write down prefix contracts: local and remote prefixes, including future growth expectations.
  4. Define NAT policy explicitly: NAT exemption for VPN traffic, and translated ranges if overlaps exist.
  5. Set MTU/MSS intentionally: don’t “let it ride.” Measure, then clamp.
  6. Plan for asymmetry: single exit is simplest; if not possible, enforce routing policy and validate rp_filter settings.
  7. Monitor data-plane signals: SA bytes, interface counters, packet drops, rekeys—dashboards should show traffic, not just “connected.”
  8. Create a test suite: ping, TCP connect, HTTP request, large transfer, and DNS—run after every change.

C. Change checklist (so you don’t self-own production)

  1. Before change: capture baseline SA selectors, routes, and MTU values.
  2. Make one change at a time: routing, then policy, then firewall, then MTU.
  3. After change: verify with the same test pair and compare counters before/after.
  4. Have rollback: keep prior configs ready, and know how to revert routes and firewall rules.

FAQ

1) Why does the VPN say “up” if no traffic passes?

Because “up” usually means the IKE control plane is established. Data plane depends on routing, selectors/AllowedIPs, firewall/NAT, and return paths.

2) What’s the fastest proof that routing is the problem?

Run ip route get <remote-host> on the gateway. If it points out the WAN interface instead of the tunnel/VTI, you found the issue.
Also, traceroute showing internet hops instead of the tunnel path is a dead giveaway.

3) What are “traffic selectors” in IPsec, and how do they break things?

Traffic selectors define which source/destination subnets the child SA will encrypt. If your real traffic uses a subnet not in selectors,
it won’t match and won’t be encrypted. Result: tunnel up, traffic fails.

4) WireGuard handshakes but I can’t reach the remote subnet. What’s the usual cause?

Misconfigured AllowedIPs or missing kernel routes. WireGuard can handshake with zero useful routing. Also check the remote side has a route back to your subnet.

5) How do I tell NAT is breaking my site-to-site VPN?

If the remote side sees traffic coming from a translated source IP (not your real local subnet), it may not match selectors or firewall rules.
On Linux gateways, inspect NAT rules and confirm VPN subnets are exempt from masquerade.

6) Why does ping work but TCP fails?

Usually MTU/MSS. ICMP echo uses small packets by default; TCP may negotiate large segments that get dropped due to encapsulation overhead and blocked PMTUD.
Clamp MSS and set tunnel MTU conservatively.

7) Do I need BGP over the VPN?

If you have multiple prefixes, multiple sites, failover links, or you expect growth, yes—BGP reduces human error. If it’s truly one subnet each side and never changes, static routes can be fine.

8) What’s the difference between policy-based and route-based VPNs for troubleshooting?

Policy-based: you debug selectors/ACL matches and SA counters. Route-based: you debug routes to a tunnel interface. Route-based tends to be easier to reason about and scale.

9) How can asymmetric routing happen if there’s only one VPN tunnel?

Because the rest of the network may have multiple exits. The outbound packet might take the tunnel, while the reply takes a different WAN or a different router with a different view of routes.
Stateful firewalls and rp_filter hate this. Fix with routing design, not hope.

10) What should I monitor so this doesn’t become a monthly surprise?

Monitor SA byte counters, VTI interface counters, rekey frequency, packet drops, and tunnel MTU-related symptoms (TCP retransmits).
“Tunnel up” is not a metric; it’s a mood.

Conclusion: next steps you can do today

If you remember one thing: a VPN tunnel being up proves the two endpoints can agree on encryption. It does not prove your packets are being routed, selected, forwarded, and returned.
Treat the VPN like a network path with a control plane and a data plane, and debug it like you would any other path.

Practical next steps:

  1. Create a single test pair (source and destination hosts) and write it into your runbook.
  2. Add data-plane monitoring: SA bytes and tunnel interface counters with alerting on “up but idle during business hours.”
  3. Standardize on route-based designs where possible, and document prefix contracts and NAT rules.
  4. Set MTU/MSS intentionally, then test with large transfers—not just pings.
  5. Run the fast diagnosis playbook the next time someone says “VPN is up but nothing works,” and refuse to touch crypto until routing is proven.

“Everything fails, all the time.” — Werner Vogels

← Previous
Debian 13 “Read-only file system” surprise: fastest path to root cause and recovery
Next →
ZFS on Proxmox vs VMFS on ESXi: Snapshots, Performance, Recovery, and Real-World Gotchas

Leave a comment