Your offices don’t experience “VPN problems.” They experience payroll runs that stall, VoIP that turns into underwater interpretive dance,
and file shares that time out exactly when the CFO is watching. VPN design choices are boring until they aren’t. Then they become a
high-stakes argument about who changed what, and why the internet is “down” in one building but not the other.
Route-based vs policy-based VPN is one of those choices. Both can encrypt packets and keep auditors calm. Only one tends to keep
operators sane once you have multiple subnets, overlapping address space, cloud attachments, and the occasional acquisition where
nobody knows what IP ranges exist anymore. I’m going to tell you what to pick for offices, when to break the rule, and how to diagnose it fast.
What route-based and policy-based VPNs actually are
Policy-based VPN: “match traffic, then encrypt it”
A policy-based VPN is typically an IPsec tunnel where the firewall uses selectors (often called “proxy IDs” or “traffic selectors”)
to decide what gets encrypted: source subnet(s), destination subnet(s), sometimes ports/protocols.
If traffic matches the policy, it goes into the tunnel; if it doesn’t, it doesn’t.
The mental model is: security policy drives forwarding. Routing still exists, but it’s not the primary decision point.
In many devices, policy-based VPNs feel like writing ACLs that happen to encrypt.
This is why policy-based VPNs are common on older gear and on “SMB-grade” firewalls: they’re easy to explain and quick to deploy
for one subnet to one subnet. The price is paid later, with interest.
Route-based VPN: “build an interface, then route into it”
Route-based VPNs create a logical tunnel interface (often VTI: Virtual Tunnel Interface) and let your routing table decide what should go
over the tunnel. Encryption happens for whatever is routed through that interface (plus whatever your IPsec policy allows, but the point is:
you design it like a network, not like a pile of special cases).
The mental model is: routing drives forwarding; security policy controls permission. That separation is gold in operations.
It means you can do normal things: dynamic routing (BGP/OSPF), route failover, multiple subnets without combinatorial explosion,
and sane troubleshooting (“what route did we pick?” is a better question than “which of 37 selectors matched?”).
What people mean (and where they get confused)
Vendors are not consistent with terminology. Some boxes claim “route-based” but still require traffic selectors; others implement
“policy-based” but let you bind it to a tunnel-like object. Don’t argue about labels. Argue about these properties:
- Is there a real tunnel interface with an IP address? (Often yes for route-based, no for pure policy-based.)
- Can you install routes (static or dynamic) pointing to the tunnel?
- Do you need to enumerate every local/remote subnet pair? (Policy-based tends to.)
- Can you run BGP over it? (Route-based tends to make this straightforward.)
- How does failover behave? (Route-based can be clean; policy-based can be “surprising”.)
One quote worth keeping on your wall, because it’s the whole job in a sentence:
“Hope is not a strategy.”
— Gene Kranz
Policy-based VPNs run on hope more often than teams admit. They work until a new subnet arrives, or until someone asks for active/active,
or until the cloud team wants BGP. Then the selectors become a brittle pile of assumptions.
Office reality: what changes everything
Offices are messy. They grow sideways. They inherit printer VLANs from 2014 and guest Wi‑Fi designs from 2017 and a “temporary” subnet
added for a contractor that never left. VPN design that survives an office environment needs to handle:
- Many subnets per site (users, voice, cameras, IoT, servers, guest, OT).
- Frequent changes (new SaaS breakouts, new DNS, new split-tunnel exceptions).
- Acquisitions (overlapping RFC1918 space is not a theoretical problem; it is a calendar event).
- Multiple upstreams (dual ISPs, LTE backup, SD-WAN underlays).
- Mixed vendors (because of course it’s a Forti at one end and a Palo at the other, with “helpful” NAT in the middle).
- Security policy pressure (zero trust-ish segmentation, auditing, MFA for admin access, and “block everything by default”).
The core question isn’t “which VPN is more secure.” Both can be secure. The question is: which model fails in a predictable,
diagnosable way when the office changes? Because the office will change. It always does.
Which is better for offices (and the exceptions)
The default recommendation: route-based for almost every office-to-office and office-to-datacenter VPN
For offices, route-based wins in the long run because it aligns with how networks are actually operated:
you add routes, you monitor routes, you fail over routes, you summarize routes, you debug routes. You don’t want VPN reachability
to depend on a matrix of selectors that only one engineer understands and nobody wants to touch on a Friday.
Route-based VPNs also play better with real-world needs:
- Dynamic routing (BGP especially) for many sites, cloud interconnects, and failover logic.
- Multiple subnets without selector explosion (one tunnel, many routes).
- Cleaner HA (track tunnel interface, track BGP session, withdraw routes).
- Less brittle change management (adding a subnet becomes “add route + firewall rule,” not “add N selectors”).
When policy-based is still fine
Policy-based VPN is acceptable when:
- It’s truly small: one subnet to one subnet (or a few), and it won’t grow.
- The device doesn’t support VTI/route-based properly (common on older/low-end gear).
- You need per-application/per-port encryption decisions and the platform does it cleanly (rare, but possible).
- You’re integrating with a partner who insists on strict traffic selectors and will not run dynamic routing.
But understand what you’re signing up for: every new subnet is a policy change. Every policy change is a chance to break things in
a way that doesn’t show up in “tunnel is up” checks.
Where teams get burned: “tunnel up” is not “traffic works”
Policy-based VPNs love to show green lights: IKE is up, IPsec is up, bytes are moving. Meanwhile, the new VLAN can’t reach anything
because its selectors were never added, or because NAT snuck in and changed the source IP, or because the remote end is strict and
silently drops traffic that doesn’t match.
Route-based designs can still break, but they break like networks break: wrong route, missing route, asymmetric routing, MTU issues,
firewall rule. Those are solvable at 2 a.m. with a terminal and a pulse.
Joke #1: A policy-based VPN is like a VIP guest list. It works great until your company keeps hiring people.
What “better” means operationally
In office environments, “better” is not elegance. It’s mean time to restore, blast radius, and how many things can change safely.
Route-based tends to win on:
- Observability: route tables, interface stats, BGP state, standard tooling.
- Change safety: smaller diffs, fewer coupled conditions.
- Scalability: adding sites/subnets doesn’t multiply configuration objects across both ends.
- Interoperability: clouds and modern firewalls increasingly assume route-based patterns.
Decision matrix: pick in 5 minutes
Pick route-based if any of these are true
- You have more than two subnets at either site, or expect to within a year.
- You need failover that isn’t a coin flip (dual ISP, dual hubs, active/active).
- You want BGP (or might want it later).
- You have cloud VPNs in the mix (most cloud VPN gateways are route-centric).
- You need to summarize routes or deal with overlapping subnets via NAT.
- You want to troubleshoot with normal tools: routes, ping, tcpdump, counters.
Pick policy-based if all of these are true
- It’s a small, stable, static subnet-to-subnet link.
- No dynamic routing. No HA complexity. No growth expectations.
- The platform’s route-based implementation is weak or unavailable.
- You can document selectors and keep them synchronized across both ends.
Pick “route-based, but with guardrails” for most offices
The best office VPNs are route-based plus discipline:
explicit allowed prefixes, route filters, sane defaults on MTU/MSS, and monitoring that treats the tunnel like a WAN link.
You’re building a network, not a magic portal.
Interesting facts and context you can repeat in meetings
- IPsec started as a security layer for IPv6 work in the 1990s, then got widely deployed for IPv4 because reality always wins.
- “Proxy IDs” come from early IPsec selector ideas: define exactly which traffic should be protected, which fit the subnet-to-subnet era.
- Virtual Tunnel Interfaces (VTI) became popular as vendors realized operators wanted IPsec to behave like GRE: an interface you can route over.
- IKEv2 (2005-era standardization) cleaned up many of IKEv1’s rough edges, especially around negotiation and resilience.
- NAT-Traversal exists because the internet is messy: UDP encapsulation (typically 4500) was a pragmatic fix for NAT devices breaking IPsec.
- Cloud VPN products pushed route-based adoption because clouds want route tables and dynamic routing, not selector spreadsheets.
- SD-WAN accelerated the “tunnel as interface” mindset: underlays change, overlays route; policy-based selectors don’t adapt well.
- Route-based isn’t automatically “any-to-any”: you still need firewall policy and route filters. The difference is you can manage them separately.
Three corporate mini-stories from the trenches
Mini-story #1: the incident caused by a wrong assumption
A mid-sized company had a policy-based site-to-site VPN between HQ and a branch. It had been stable for years: one /24 at the branch,
a couple of /24s at HQ. Green lights everywhere. Nobody touched it.
Then the branch added a second VLAN for VoIP phones. The network engineer added the VLAN on the branch firewall, added DHCP,
and assumed “the VPN will carry it because it’s the same tunnel.” That assumption is how outages are born.
Phones registered to the local PBX? Fine. Phones reaching the centralized call manager at HQ? Dead. Meanwhile, the VPN still showed “up”
because existing subnets still matched selectors and kept traffic flowing. Monitoring didn’t catch it because the checks were a ping from
the old subnet.
The fix was adding new selectors (proxy IDs) on both ends, plus firewall rules. The lesson wasn’t “policy-based is bad.”
The lesson was: policy-based VPNs make reachability opt-in per subnet pair, so the default failure mode is silent partial outage.
After that, the company migrated to route-based tunnels for branches. The same change today is: add VLAN, add route advertisement,
add firewall rule. Monitoring pings from multiple VLANs, not one “legacy good” subnet.
Mini-story #2: the optimization that backfired
A different organization wanted to “optimize” their VPN by tightening selectors to reduce “unnecessary encryption.”
They had a policy-based tunnel with many subnet pairs, and someone decided to remove broad selectors and keep only the “active” ones.
On paper, it reduced configuration clutter.
Two months later, an application team restarted a service in a DR subnet that hadn’t been used recently. Traffic started from a different
source prefix, still valid and expected. It didn’t match any selector anymore, so the firewall sent it in the clear to the internet-facing
routing path, where upstream ACLs dropped it. The tunnel stayed up. Logs were noisy but not obvious.
The outage lasted longer than it should have because people chased the wrong layer: app logs, DNS, then firewall rules, then ISP.
Only later did someone compare packet captures and notice the traffic wasn’t entering IPsec at all.
The fix was restoring selectors, then later redesigning to route-based with route filtering and proper segmentation. The “optimization”
was really a brittle hard-coding of assumptions. Encryption should be a property of a path, not a fragile list of exceptions.
Mini-story #3: the boring but correct practice that saved the day
A company with 40+ offices ran route-based IPsec with BGP to two regional hubs. Nothing fancy. The “boring” part was their change discipline:
every office had a standard prefix plan, route filters, and a pre-change test list that included MTU checks and per-VLAN reachability.
One day a hub firewall upgrade introduced a different default TCP MSS clamp behavior. Suddenly, file transfers from some offices crawled.
Latency looked fine; pings worked; small HTTP requests worked. Big SMB copies were miserable.
Their runbook required a quick PMTU check and an IPsec interface counter check. Within minutes they saw fragmentation and retransmits.
They adjusted MSS clamp on the hub tunnel interfaces and confirmed improvement. No vendor escalation. No “it’s the ISP” finger-pointing.
The practice that saved them wasn’t a clever trick. It was standardization + repeatable tests + routing visibility.
When your VPN behaves like an interface with routes, it’s easier to pin down where things got weird.
Joke #2: The fastest way to learn VPN troubleshooting is to schedule a “quick” firewall change right before a long weekend.
Practical tasks: commands, outputs, and decisions (12+)
These are operator-grade checks you can run from Linux hosts, firewalls that expose shells, or jump boxes near endpoints.
The point isn’t the command; it’s what you decide based on the output.
1) Confirm the route you expect exists (route-based sanity)
cr0x@server:~$ ip route get 10.40.12.20
10.40.12.20 via 169.254.21.2 dev vti0 src 169.254.21.1 uid 1000
cache
Meaning: Traffic to 10.40.12.20 will go out vti0 via the tunnel next hop. Good.
Decision: If it shows your default gateway instead, fix routing (static route, BGP, route-map) before touching IPsec.
2) Check if IPsec SAs exist and are rekeying (strongSwan example)
cr0x@server:~$ sudo ipsec statusall
Status of IKE charon daemon (strongSwan 5.9.8, Linux 6.8.0, x86_64):
uptime: 2 hours, since Dec 28 09:11:02 2025
Connections:
office-hub: 169.254.21.1...203.0.113.10 IKEv2
Security Associations (1 up, 0 connecting):
office-hub[12]: ESTABLISHED 37 minutes ago, 198.51.100.20[198.51.100.20]...203.0.113.10[203.0.113.10]
office-hub{7}: INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: c2f4a1d2_i c8a19f10_o
office-hub{7}: 169.254.21.1/32 === 169.254.21.2/32
Meaning: IKEv2 is established; ESP is installed; selectors are host/32 on VTI endpoints (typical route-based pattern).
Decision: If no SAs, focus on IKE negotiation, PSK/certs, NAT-T, and UDP reachability.
3) Verify traffic is actually being encrypted (look at XFRM stats)
cr0x@server:~$ ip -s xfrm state
src 198.51.100.20 dst 203.0.113.10
proto esp spi 0xc8a19f10 reqid 1 mode tunnel
replay-window 32 flag af-unspec
auth-trunc hmac(sha256) 0x... 128
enc cbc(aes) 0x...
anti-replay context: seq 0x0000007b, oseq 0x0000007b, bitmap 0xffffffff
stats: replay-window 0 replay 0 failed 0 bytes 981234 packets 812 add 2025-12-28 09:15:01 use 2025-12-28 10:02:31
Meaning: Packets/bytes counters incrementing means traffic is going through IPsec.
Decision: If counters stay flat while apps fail, your routing/policy isn’t steering traffic into the tunnel (or selectors don’t match in policy-based).
4) Detect MTU/PMTU black holes (classic office VPN pain)
cr0x@server:~$ ping -M do -s 1372 10.40.12.20 -c 3
PING 10.40.12.20 (10.40.12.20) 1372(1400) bytes of data.
1380 bytes from 10.40.12.20: icmp_seq=1 ttl=61 time=21.3 ms
1380 bytes from 10.40.12.20: icmp_seq=2 ttl=61 time=21.1 ms
1380 bytes from 10.40.12.20: icmp_seq=3 ttl=61 time=21.2 ms
--- 10.40.12.20 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
Meaning: PMTU is at least ~1400 bytes end-to-end. Good for many IPsec paths.
Decision: If you see “Frag needed,” clamp TCP MSS on tunnel interface or lower MTU. If you see timeouts only for large packets, suspect PMTU filtering.
5) Observe path MTU with tracepath (find where it drops)
cr0x@server:~$ tracepath 10.40.12.20
1?: [LOCALHOST] pmtu 1500
1: 169.254.21.2 2.123ms
2: 10.40.12.20 21.004ms reached
Resume: pmtu 1400 hops 2 back 2
Meaning: The discovered PMTU is 1400. That’s normal with ESP overhead.
Decision: If it reports pmtu 1500 but large TCP stalls, suspect ICMP blocked somewhere and enforce MSS clamp anyway.
6) Confirm DNS is not the “VPN outage”
cr0x@server:~$ resolvectl query fileserver.corp.example
fileserver.corp.example: 10.40.12.20 -- link: vti0
-- Information acquired via protocol DNS in 19.9ms.
-- Data is authenticated: no
Meaning: Name resolves to the remote IP and the resolver associates it with the tunnel link.
Decision: If DNS points to public IPs or wrong internal ranges, fix split-DNS or conditional forwarding before blaming IPsec.
7) Prove the TCP handshake works (or doesn’t)
cr0x@server:~$ nc -vz 10.40.12.20 445
Connection to 10.40.12.20 445 port [tcp/microsoft-ds] succeeded!
Meaning: VPN and firewall policy allow SMB to that host.
Decision: If it times out, decide whether it’s routing (no reply path), firewall policy (blocked), or IPsec selectors (policy-based mismatch).
8) Inspect interface counters for drops on the tunnel interface
cr0x@server:~$ ip -s link show dev vti0
6: vti0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1436 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
RX: bytes packets errors dropped missed mcast
18922310 152233 0 12 0 0
TX: bytes packets errors dropped carrier collsns
20100291 160112 0 0 0 0
Meaning: Drops on RX suggest queue/MTU/policy issues, or kernel/xfrm handling trouble.
Decision: If drops climb during transfers, focus on MTU/MSS and CPU; if drops are zero but app fails, look at firewall policy or remote host.
9) Validate BGP adjacency over the tunnel (route-based power move)
cr0x@server:~$ sudo vtysh -c "show ip bgp summary"
BGP router identifier 169.254.21.1, local AS number 65010
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
169.254.21.2 4 65000 8123 7991 144 0 0 03:12:44 37
Meaning: BGP is up for hours and receiving 37 prefixes. That’s what “routing-driven VPN” looks like.
Decision: If BGP flaps, treat the tunnel like a WAN: check underlay loss, MTU, DPD timers, and CPU before touching route filters.
10) Spot route leaks or missing routes (what you advertise matters)
cr0x@server:~$ sudo vtysh -c "show ip route 10.40.12.0/24"
Routing entry for 10.40.12.0/24
Known via "bgp", distance 20, metric 0, best
Last update 00:02:11 ago
* 169.254.21.2, via vti0
Meaning: The prefix exists and is reachable via the tunnel.
Decision: If the route is missing, stop. Don’t debug IPsec. Fix the routing advertisement/import policy first.
11) Confirm that policy-based selectors match (generic debugging via packet capture)
cr0x@server:~$ sudo tcpdump -ni eth0 host 203.0.113.10 and udp port 4500 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:04:11.201122 IP 198.51.100.20.4500 > 203.0.113.10.4500: UDP-encap: ESP(spi=0xc8a19f10,seq=0x0000007c), length 148
10:04:11.221009 IP 203.0.113.10.4500 > 198.51.100.20.4500: UDP-encap: ESP(spi=0xc2f4a1d2,seq=0x00000080), length 148
Meaning: You’re seeing ESP-in-UDP. If you generate test traffic and see no ESP packets, it’s not entering IPsec.
Decision: In policy-based setups, that usually means selector mismatch or NAT changed the inner addresses.
12) Check conntrack for asymmetric routing symptoms
cr0x@server:~$ sudo conntrack -L -p tcp --dport 445 | head
tcp 6 431999 ESTABLISHED src=10.20.5.44 dst=10.40.12.20 sport=51234 dport=445 src=10.40.12.20 dst=10.20.5.44 sport=445 dport=51234 [ASSURED] mark=0 use=1
Meaning: Flow is established in both directions; return traffic is observed.
Decision: If you see lots of SYN_SENT with no reply, suspect missing return route on the remote side or a firewall rule blocking.
13) Verify NAT isn’t sabotaging your selectors
cr0x@server:~$ sudo iptables -t nat -S | sed -n '1,20p'
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A POSTROUTING -o eth0 -j MASQUERADE
Meaning: A broad MASQUERADE rule can rewrite source IPs, breaking policy-based selectors and confusing route-based debugging.
Decision: Add NAT exemptions for VPN destinations or narrow NAT scope to “internet only.”
14) Watch CPU and softirqs during throughput complaints (encryption costs real cycles)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0 (vpn-gw1) 12/28/2025 _x86_64_ (8 CPU)
10:06:31 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
10:06:32 AM all 12.10 0.00 9.88 0.00 0.00 21.44 0.00 56.58
10:06:33 AM all 10.34 0.00 8.91 0.00 0.00 24.10 0.00 56.65
10:06:34 AM all 11.02 0.00 9.33 0.00 0.00 25.89 0.00 53.76
Meaning: High %soft can indicate packet processing pressure (often VPN + NAT + firewalling).
Decision: If CPU is pegged during business hours, you need hardware crypto acceleration, better sizing, or traffic engineering.
15) Measure throughput with iperf3 (stop guessing)
cr0x@server:~$ iperf3 -c 10.40.12.20 -t 10 -P 4
Connecting to host 10.40.12.20, port 5201
[SUM] 0.00-10.00 sec 842 MBytes 707 Mbits/sec sender
[SUM] 0.00-10.00 sec 839 MBytes 704 Mbits/sec receiver
Meaning: You have ~700 Mbps of real throughput across the tunnel under test conditions.
Decision: If throughput is low but latency is fine, focus on MTU/MSS, crypto CPU, or policing/shaping on underlay.
16) Confirm DPD/keepalive behavior in logs (tunnel stability)
cr0x@server:~$ sudo journalctl -u strongswan --since "10 minutes ago" | tail -n 8
Dec 28 10:00:01 vpn-gw1 charon[1123]: 12[IKE] sending DPD request
Dec 28 10:00:01 vpn-gw1 charon[1123]: 12[IKE] received DPD response, peer alive
Dec 28 10:05:01 vpn-gw1 charon[1123]: 12[IKE] sending DPD request
Dec 28 10:05:01 vpn-gw1 charon[1123]: 12[IKE] received DPD response, peer alive
Meaning: The peer is responding. The underlay path isn’t dead, and state is being maintained.
Decision: If DPD fails intermittently, suspect ISP loss, NAT mapping timeouts, or aggressive firewall UDP timeouts.
Fast diagnosis playbook
The fastest way to lose hours is to start in the middle. Start at the edges and follow the packet.
When someone says “VPN is slow” or “site is down,” run this in order.
First: is the problem routing, encryption, or policy?
- Check the route from a host near the source:
ip route get <remote_ip>.
If it doesn’t point at the tunnel/VTI, stop and fix routing. - Check SA state and counters:
IKE established? ESP installed? Do bytes/packets increase when you generate traffic?
If SAs are down, fix IKE/underlay. If SAs are up but counters flat, it’s steering/selectors. - Check firewall policy in the data path:
can younc -vzto the port? If not, verify rules and return routes.
Second: isolate “works for ping” from “works for real traffic”
- PMTU test with
ping -M doandtracepath. If large packets fail, clamp MSS or lower MTU. - Throughput test with
iperf3. If it’s far below circuit capacity, check CPU, softirqs, and shaping. - Packet capture on the public interface: do you see ESP/UDP 4500 when you generate traffic?
Third: verify symmetry and return path
- Conntrack or state table: do flows go ESTABLISHED or get stuck SYN_SENT?
- Remote-side route table: does it know how to reach your source prefix via the tunnel?
- Route filters (if dynamic routing): did you accidentally block the new prefix?
Quick bottleneck fingerprinting
- High latency + loss: underlay ISP, NAT device, LTE failover, or provider congestion.
- Low throughput, latency fine: MTU/MSS, CPU crypto limit, QoS policing, or single-flow limitation.
- One subnet works, another doesn’t: policy-based selector mismatch or missing route/advertisement.
- Works one direction only: return route missing, asymmetric routing, or stateful firewall drop.
Common mistakes: symptoms → root cause → fix
1) “Tunnel is up, but new VLAN can’t reach HQ”
Symptoms: Existing subnets work; new subnet fails; VPN status page is green.
Root cause: Policy-based selectors not updated on both ends, or route advertisements not updated for route-based.
Fix: For policy-based: add matching local/remote selectors (proxy IDs) both directions and ensure NAT exemption. For route-based: advertise or add static routes, then allow in firewall policy.
2) “Small web requests work, file copies hang”
Symptoms: Ping works; RDP works; SMB/large HTTPS stalls; retransmits in captures.
Root cause: MTU/PMTU black hole due to ESP overhead + blocked ICMP fragmentation needed.
Fix: Clamp TCP MSS on tunnel interfaces; reduce MTU; ensure ICMP types needed for PMTU are permitted on underlay where appropriate.
3) “Traffic works for hours, then randomly dies until we rekey/restart”
Symptoms: Intermittent outages; logs show DPD failures; often behind NAT.
Root cause: NAT mapping timeout for UDP 4500/500, or mismatched lifetimes leading to awkward rekey behavior.
Fix: Enable NAT-T keepalives; align IKE/ESP lifetimes; tune DPD; ensure upstream firewalls keep UDP mappings long enough.
4) “We enabled active/active, now some sessions break”
Symptoms: Half the users fine; others see resets; state tables inconsistent.
Root cause: Asymmetric routing across two tunnels without state synchronization, or ECMP without per-flow consistency.
Fix: Use per-flow hashing with consistent return path, or use active/passive for stateful services; prefer BGP with proper attributes and health checks.
5) “Partner VPN only works for one subnet; they refuse to add more”
Symptoms: Only a single internal prefix reaches partner; expanding is slow and political.
Root cause: Partner is using strict policy-based selectors and change control is painful.
Fix: Negotiate route-based with VTI if possible; otherwise use NAT (carefully) to map multiple internal prefixes into one allowed selector, and document it like it’s hazardous material.
6) “After we tightened NAT rules, VPN broke”
Symptoms: IKE up but traffic doesn’t flow; selectors appear correct.
Root cause: NAT exemption removed; source IP changed; policy-based match fails or remote drops unexpected inner addresses.
Fix: Add explicit NAT bypass rules for remote prefixes; validate with packet capture that inner source addresses match what the peer expects.
7) “Cloud to office VPN flaps, on-prem to on-prem is stable”
Symptoms: Periodic rekeys; BGP resets; underlay looks fine.
Root cause: Cloud gateway timer expectations and idle timeouts; sometimes aggressive DPD; sometimes multiple paths with inconsistent NAT.
Fix: Match timers to cloud recommendations; keepalives; avoid double NAT; verify UDP 4500 stability; consider dual tunnels with health-checked routing.
8) “VPN is slow only at peak hours”
Symptoms: Complaints at 9–11am; CPU spikes; drops increase; VoIP jitter.
Root cause: Gateway CPU bound on crypto or packet processing, or underlay congestion with no QoS.
Fix: Size gateways properly, enable hardware offload if available, enforce QoS on underlay, and consider splitting traffic across tunnels/regions.
Checklists / step-by-step plan
Step-by-step: designing a new office VPN (recommended path)
- Pick route-based (VTI) unless you have a concrete reason not to.
- Define prefix ownership: per-office summarized blocks if you can (makes routing and filters sane).
- Decide routing:
- Small environment: static routes are fine, but keep them documented and symmetrical.
- Multi-office or cloud-heavy: run BGP over the tunnel. Route-based makes this clean.
- Set MTU/MSS intentionally: don’t wait for SMB to teach you about PMTU in production.
- Firewall rules: least privilege, but not selector spaghetti: allow required ports between zones, log denies at a useful rate.
- Monitoring: check IKE/IPsec state, route presence, and real application reachability from more than one VLAN.
- Failover design:
- Prefer dual tunnels with routing health (BGP or tracked static routes).
- Make failure explicit: withdraw routes when the tunnel is unhealthy.
- Change management: define how a new VLAN gets connected: “add prefix + advertise + policy.” Make it a checklist item, not a hero story.
Step-by-step: migrating policy-based to route-based without drama
- Inventory selectors: list all local/remote subnet pairs in use today.
- Map to routing intent: which prefixes should actually be reachable? Which are legacy and should die?
- Stand up a parallel route-based tunnel (different peer IP or different tunnel ID) while keeping the old one.
- Move one prefix at a time by routing: install route via VTI and remove selector for that prefix from old tunnel only when validated.
- Validate MTU/MSS early; route-based migrations often expose old PMTU sins.
- Lock down route filters: don’t accidentally create any-to-any between offices because “it routes now.”
- Cut over monitoring to the new tunnel, then retire the old policies.
Operational checklist: what to verify after any VPN change
- IKE established and stable (no constant rekeys or DPD failures).
- ESP counters increment when you generate test traffic.
- Routes installed for all required prefixes (and only those prefixes).
- Firewall policy allows required ports; denies are logged usefully.
- PMTU test passes to representative hosts; MSS clamp in place if needed.
- At least one test per critical VLAN: users, voice, and “weird” IoT.
- Failover tested: pull ISP link or disable one tunnel; confirm routes withdraw and sessions recover as designed.
- Monitoring/alerts updated: no “tunnel up” vanity metrics without traffic checks.
FAQ
1) Is route-based VPN more secure than policy-based?
Not inherently. Security comes from crypto settings, key management, and firewall policy. Route-based is usually operationally safer
because it’s less likely to create accidental partial connectivity or silent bypasses during changes.
2) Why do policy-based VPNs cause “it works for some subnets” outages?
Because reachability is defined by selectors. If you don’t explicitly include a subnet pair, traffic won’t be encrypted, and may be dropped
or leak to a different path. The tunnel can stay “up” while your new network is effectively invisible.
3) Can route-based VPNs still use traffic selectors?
Yes, many implementations still negotiate selectors, but in route-based designs they’re often narrow (host-to-host for VTI endpoints) and
routing decides what inner prefixes traverse the tunnel. The operational behavior is the key distinction.
4) Do I need BGP to justify route-based?
No. Static routes over a VTI are still easier to reason about than a selector matrix, especially when offices have multiple VLANs.
BGP just makes it scale and fail over more gracefully.
5) What about overlapping RFC1918 space between offices?
Overlaps happen during mergers and bad planning. Route-based makes it easier to introduce NAT at the edge in a controlled way and keep
the routing logic explicit. Policy-based can do NAT too, but debugging becomes a logic puzzle involving selectors plus translations.
6) Which model is better for dual ISP failover?
Route-based, almost always. You can track tunnel interface status, use dynamic routing, and withdraw routes cleanly. Policy-based failover
often ends up as duplicated policies with subtle mismatches and “why did this flow choose that tunnel” surprises.
7) Why does “tunnel up” monitoring fail to catch real outages?
Because IKE/IPsec control-plane health doesn’t guarantee data-plane correctness. You can have SAs up with wrong routes, wrong selectors,
blocked ports, PMTU black holes, or missing return paths. Monitor both control plane and representative application probes.
8) Does route-based mean I accidentally created full mesh access between offices?
Only if you let it. Route-based makes it easier to route many prefixes, but you still enforce segmentation with route filters and firewall policy.
“It routes” is not permission.
9) What’s the most common throughput killer on office VPNs?
MTU/MSS issues and CPU limits. MTU problems create retransmits and stalls that look like “slow internet.” CPU limits show up at peak hours
when encryption and packet filtering compete for cycles.
10) If my firewall only supports policy-based, what should I do?
Keep the selector set minimal and documented, avoid broad NAT that rewrites inner addresses, and build monitoring that tests each critical subnet.
If the office is growing, plan a hardware refresh; policy-based at scale becomes a long-term tax.
Next steps you can actually do
If you run offices and you want fewer VPN surprises, do this:
- Standardize on route-based VPN for office links unless constrained by partner gear or legacy platforms.
- Make routing explicit: static routes for small deployments; BGP for many sites or cloud integration.
- Engineer MTU/MSS on day one. Don’t wait for SMB to ruin your afternoon.
- Monitor the data plane: per-VLAN probes, not just “IKE is up.”
- Document the failure modes you actually see: selector mismatches, NAT exemptions, asymmetric routing, PMTU black holes.
- Practice failover by breaking it on purpose during business-friendly windows. A design you haven’t failed over is a rumor.
Route-based VPNs won’t make your network perfect. They just make it diagnosable, scalable, and less dependent on tribal knowledge.
For offices, that’s the difference between “a secure WAN” and “a recurring incident theme.”