Office VPN failover: keep tunnels up with 2 ISPs (without manual babysitting)

Was this helpful?

The office loses the internet. Someone yells “the VPN is down,” Slack explodes, and suddenly you’re debugging from your phone in a parking lot with one bar of LTE. Meanwhile, the second ISP you pay for every month sits there like a spare tire that’s never been inflated.

Dual-ISP office VPN failover is not hard, but it is easy to do in a way that looks redundant on a diagram and still falls over in production. This is the pragmatic version: designs that actually keep tunnels up, health checks that don’t lie, routing that doesn’t blackhole, and the boring operational habits that stop incidents from turning into all-hands therapy sessions.

What “failover” should mean (and what it should not)

In office-land, “VPN failover” usually means “the tunnel should stay up when an ISP dies.” That’s the outcome. But you only get there if you define what “up” means and what you’re willing to accept during failure.

Define success in operational terms

  • Recovery time objective (RTO): How many seconds of packet loss is acceptable? 2 seconds? 30 seconds? “Whenever Bob notices” is not a metric.
  • Recovery point objective (RPO): For VPNs this is less about data and more about state. Stateful flows (VoIP, RDP, long uploads) will break on path change unless you’re doing session preservation (rare). Decide whether breaking sessions is acceptable.
  • Blast radius: Does failover affect only site-to-site traffic, or also internet breakout, DNS, SaaS, voice?
  • Operator action: “Automatic” means no CLI needed at 2am. It can still alert loudly. It just shouldn’t require you to kick it.

Also: failover is not a magic button that makes upstream routing perfect. If your secondary ISP has degraded peering, your tunnel can be “up” while your applications melt. The system needs to detect usefulness, not just link state.

Joke #1: Dual WAN without health checks is like owning two umbrellas and still getting soaked because you only open them when the weather app says “rain.”

A few facts and history that explain today’s failure modes

Some context makes modern VPN failover less mysterious and more “oh, that’s why that breaks.” Here are concrete facts worth keeping in your head:

  1. IPsec predates today’s “always online” offices. The first IPsec standards landed in the late 1990s, when static sites and long-lived links were assumed; rapid failover wasn’t the primary design goal.
  2. NAT wasn’t a first-class citizen. NAT traversal (NAT-T) became common later; plenty of odd behaviors come from stuffing ESP into UDP/4500 to cross NAT devices.
  3. IKEv1 vs IKEv2 behavior differs under failure. Many “it works until it doesn’t” setups rely on IKEv1 quirks; IKEv2 generally behaves better but still depends on timers and DPD settings.
  4. Consumer ISP “failures” are often partial. Many outages are not link-down. They’re DNS resolution failures, broken PMTU, or a routing flap upstream. That’s why dead-gateway detection by pinging the ISP next hop is a lie.
  5. BGP became the grown-up way to do multi-homing for enterprises. But most offices can’t get provider-independent address space or BGP from cheap circuits; thus we simulate resilience at higher layers.
  6. SD-WAN didn’t invent health checks; it productized them. The core idea—measure loss/latency to a real target and steer traffic—has existed for decades in homegrown scripts and router features.
  7. Packet loss hurts VPNs more than casual browsing. IPsec encapsulation overhead plus retransmits can turn “1% loss” into “why is Teams unusable?” faster than you’d like.
  8. PMTU blackholes are an old, repeatable failure mode. Tunnel overhead shrinks effective MTU; if ICMP is blocked, Path MTU Discovery breaks and large packets disappear into the void.

Topologies that work: active/standby, active/active, and “don’t do this”

Topology A: Active/standby tunnels (simplest, often best)

You build two site-to-site VPN tunnels: one over ISP1, one over ISP2. Only one carries traffic under normal conditions. If health checks fail, you move the route to the standby tunnel. You can do this with:

  • Route-based VPN (preferred): two tunnel interfaces, two next hops, metrics/priority.
  • Policy-based VPN (works, but gets brittle as networks grow).
  • Dynamic routing (BGP/OSPF) over the tunnels (clean failover, more moving parts).

Why it works: stable, predictable, fewer asymmetric routing surprises. Debugging at 3am is tolerable.

Tradeoff: you’re “wasting” the standby link. That’s fine. Your CFO wastes more money on conference room chairs.

Topology B: Active/active tunnels (harder, use only when you must)

Both tunnels carry traffic simultaneously. You do per-prefix routing, per-application steering, or ECMP. This is common when you want more throughput or you want to keep one ISP from being “dead weight.”

Where it breaks:

  • Asymmetric routing if return traffic doesn’t follow the same tunnel.
  • Stateful firewalls that pin flows to a path and then reject packets arriving on the “wrong” interface.
  • Remote-side limitations (some cloud VPN gateways dislike frequent path changes).

My opinion: do active/standby unless you have a measured throughput constraint that justifies complexity. If you need active/active, be honest: you’re building a small WAN. Treat it like one.

Topology C: “Failover by DNS” (don’t)

Some teams try: “If ISP1 dies, change DNS for the VPN endpoint.” That’s not failover; that’s a scheduled maintenance plan disguised as resilience. DNS caches, TTLs ignored by clients, and long-lived sessions will make sure your users are the ones implementing your cutover test.

Mechanics: IKE/IPsec behavior, NAT, DPD, and rekey reality

VPN failover is a timer game. You’re balancing how quickly you detect failure against how often you flap during transient loss.

DPD: Dead Peer Detection is necessary, not sufficient

DPD (or equivalent keepalives) can tell you the peer isn’t responding. But if the internet path is half-broken—packets go out, but replies don’t come back—you need additional proof.

Good practice:

  • Enable DPD with reasonable values (e.g., 10–30s intervals, a small number of retries).
  • Use data-plane probes too (ICMP or TCP to a target across the tunnel).
  • Make failover decisions on sustained loss/latency, not a single missed probe.

NAT and source addresses: pick stable identities

On dual ISP, the local VPN gateway may present different public IPs. Remote peers may authenticate by ID (FQDN) and not care about changing source IP, or they may pin to a specific address. Many cloud VPN services expect you to configure both tunnels explicitly; that’s good. You want the remote side to expect failure.

Rekey behavior during failover

Rekey events are a classic “it fails every hour” mystery. During failover, rekey timers can align badly with path instability, leading to repeated negotiation attempts. When possible:

  • Stagger rekey lifetimes between primary and secondary tunnels.
  • Ensure both tunnels can negotiate independently (no shared SA confusion).
  • Keep logs. Rekey failures are rarely solved by vibes.

MTU: the silent tunnel killer

Encapsulation reduces MTU. If you run 1500-byte LAN MTU and encapsulate into IPsec over PPPoE (common on DSL), you may need to clamp TCP MSS or lower interface MTU. If you don’t, you get intermittent failures that look like “some sites load, some don’t.”

Joke #2: MTU issues are the IT equivalent of a squeaky chair: harmless until it drives everyone insane and you start questioning your life choices.

Routing that doesn’t lie: policy-based vs route-based vs BGP

Route-based VPN: default recommendation

Route-based VPN gives you a tunnel interface. You can run static routes, use metrics, and use your normal routing table. It’s easier to combine with health checks and dynamic routing.

Rule: if your platform supports route-based VPN, use it unless you have a specific constraint.

Policy-based VPN: acceptable for tiny networks

Policy-based VPN ties encryption to traffic selectors (subnet-to-subnet). It works fine until you add a third subnet, then a fourth, then someone wants to route a /32 for a test VM, and suddenly you’re doing spreadsheet-driven security policy.

Dynamic routing over tunnels: clean failover, higher skill requirement

BGP over IPsec (or GRE over IPsec) is the most robust way to fail over routes between tunnels, because routing protocols already solve “which path is viable” if you feed them correct interface and keepalive signals.

When to use it:

  • You have multiple sites or multiple prefixes.
  • You need automatic reroute without vendor-specific SD-WAN features.
  • You can operate it: monitoring, route filters, and change control.

When not to:

  • You’re a one-office shop with one remote network and no appetite for routing policy.
  • You can’t guarantee that both ends can run BGP cleanly (some managed services restrict it).

One reliability quote (paraphrased idea): Gene Kranz, NASA flight director, emphasized being “tough and competent” under pressure—reliability is mostly preparation, not heroics.

Health checks: proving the path is good, not just “up”

The core problem with office failover is that most “dual WAN” setups detect the wrong failure. The cable modem is still synced, so the router thinks it’s fine. Meanwhile, upstream routing is blackholing your traffic to the VPN peer.

What to probe

  • Underlay reachability: Can you reach a stable internet target via each ISP? Use multiple targets in different networks. Avoid “ISP gateway ping” as the only test.
  • Overlay reachability: Can you reach something across the tunnel? A loopback IP on the remote firewall is ideal. Even better: a service port (TCP/443 to a health endpoint) that proves more than ICMP.
  • Quality: Measure latency and packet loss. Decide thresholds that match your applications (voice, VDI, file shares).

How to make probes trustworthy

  • Bind probes to the interface/source address for each ISP so you’re testing the right path.
  • Use multiple probes and require consecutive failures before declaring dead.
  • Rate-limit failback. Flapping between ISPs is how you turn a minor upstream issue into a full outage.

Practical tasks (with commands, outputs, and decisions)

These are hands-on tasks you can run on a Linux-based VPN gateway (or a troubleshooting host) to validate a dual-ISP VPN design. Commands are realistic. Outputs are representative. After each, you’ll see what the output means and what decision to make.

Task 1: Confirm both WANs have valid routes

cr0x@server:~$ ip route show
default via 203.0.113.1 dev wan1 proto static metric 100
default via 198.51.100.1 dev wan2 proto static metric 200
10.10.0.0/16 dev lan0 proto kernel scope link src 10.10.0.1
203.0.113.0/24 dev wan1 proto kernel scope link src 203.0.113.20
198.51.100.0/24 dev wan2 proto kernel scope link src 198.51.100.20

Meaning: Two default routes exist; wan1 preferred via lower metric. LAN route is present.

Decision: If you want active/standby underlay, this is correct. If you intended active/active, you need policy routing or ECMP, not just two defaults with metrics.

Task 2: Verify policy routing rules (if you do dual underlay probing or active/active)

cr0x@server:~$ ip rule show
0:	from all lookup local
1000:	from 203.0.113.20 lookup 100
1001:	from 198.51.100.20 lookup 200
32766:	from all lookup main
32767:	from all lookup default

Meaning: Traffic sourced from each WAN IP uses a specific routing table.

Decision: Keep this if you need probes (and some tunnel traffic) pinned to the correct ISP. If absent, your “wan2 probe” may silently exit wan1 and lie to you.

Task 3: Confirm each WAN can reach the VPN peer underlay (bound to interface)

cr0x@server:~$ ping -I wan1 -c 3 192.0.2.10
PING 192.0.2.10 (192.0.2.10) from 203.0.113.20 wan1: 56(84) bytes of data.
64 bytes from 192.0.2.10: icmp_seq=1 ttl=55 time=18.4 ms
64 bytes from 192.0.2.10: icmp_seq=2 ttl=55 time=18.1 ms
64 bytes from 192.0.2.10: icmp_seq=3 ttl=55 time=18.3 ms

--- 192.0.2.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 18.1/18.2/18.4/0.1 ms

Meaning: Underlay to the peer is reachable via wan1, with low loss.

Decision: Repeat for wan2. If wan2 can’t reach the peer, don’t blame IPsec yet; fix routing, ISP filtering, or upstream issues.

Task 4: Check that IKE is listening on both WANs

cr0x@server:~$ sudo ss -lunp | grep -E ':(500|4500)\s'
UNCONN 0      0        0.0.0.0:500     0.0.0.0:*    users:(("charon",pid=1324,fd=12))
UNCONN 0      0        0.0.0.0:4500    0.0.0.0:*    users:(("charon",pid=1324,fd=13))

Meaning: The IKE daemon (example: strongSwan charon) listens on UDP/500 and UDP/4500 on all interfaces.

Decision: If it’s only bound to one IP, secondary failover will never work. Fix daemon bind settings or system firewall.

Task 5: Validate IPsec tunnel state (strongSwan example)

cr0x@server:~$ sudo swanctl --list-sas
vpn-primary: #12, ESTABLISHED, IKEv2, rekeying in 46 minutes
  local  'office-fw' @ 203.0.113.20[4500]
  remote 'dc-fw' @ 192.0.2.10[4500]
  AES_GCM_16_256/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/ECP_256
  vpn-primary-child: #34, INSTALLED, TUNNEL, reqid 1
    local  10.10.0.0/16
    remote 10.50.0.0/16
vpn-secondary: #0, CONNECTING, IKEv2
  local  'office-fw' @ 198.51.100.20[4500]
  remote 'dc-fw' @ 192.0.2.10[4500]

Meaning: Primary is established; secondary is trying but not up.

Decision: If secondary never establishes, verify remote peer is configured to accept the second tunnel identity/source and that NAT/firewall rules allow it.

Task 6: Confirm routes to remote subnets prefer the primary tunnel

cr0x@server:~$ ip route get 10.50.1.25
10.50.1.25 dev vti0 src 10.255.0.2 uid 0
    cache

Meaning: Traffic to the remote network uses vti0 (primary tunnel interface).

Decision: If it goes out wan1 directly, your route-based VPN isn’t actually routing. Fix routes, VTI, or traffic selectors.

Task 7: Prove overlay reachability (probe a remote loopback across the tunnel)

cr0x@server:~$ ping -c 3 10.50.255.1
PING 10.50.255.1 (10.50.255.1) 56(84) bytes of data.
64 bytes from 10.50.255.1: icmp_seq=1 ttl=63 time=21.9 ms
64 bytes from 10.50.255.1: icmp_seq=2 ttl=63 time=22.1 ms
64 bytes from 10.50.255.1: icmp_seq=3 ttl=63 time=22.0 ms

--- 10.50.255.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 21.9/22.0/22.1/0.1 ms

Meaning: Data plane works across the tunnel.

Decision: If underlay is up but overlay fails, focus on IPsec selectors, firewall policies, or routing on the remote side.

Task 8: Check for MTU/fragmentation trouble with DF ping

cr0x@server:~$ ping -c 3 -M do -s 1472 10.50.255.1
PING 10.50.255.1 (10.50.255.1) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1436
ping: local error: message too long, mtu=1436
ping: local error: message too long, mtu=1436

--- 10.50.255.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2051ms

Meaning: Effective MTU along the tunnel path is 1436. Your 1500-byte packet can’t pass with DF set.

Decision: Clamp TCP MSS (e.g., to 1360–1400 depending on overhead) or set tunnel/LAN MTU appropriately. If ICMP is blocked upstream, you’ll see blackholes instead of clean errors.

Task 9: Inspect xfrm state to confirm SAs are installed (Linux IPsec)

cr0x@server:~$ sudo ip xfrm state
src 203.0.113.20 dst 192.0.2.10
	proto esp spi 0xc12f4a9d reqid 1 mode tunnel
	replay-window 32 flag af-unspec
	aead rfc4106(gcm(aes)) 0x4f9b... 128
src 192.0.2.10 dst 203.0.113.20
	proto esp spi 0x1a2b3c4d reqid 1 mode tunnel
	replay-window 32 flag af-unspec
	aead rfc4106(gcm(aes)) 0x0aa1... 128

Meaning: ESP SAs exist in both directions for primary.

Decision: If SAs are missing while IKE claims “established,” you may have CHILD_SA negotiation issues or policy conflicts.

Task 10: Verify NAT traversal and UDP encapsulation on the wire

cr0x@server:~$ sudo tcpdump -ni wan1 udp port 4500 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wan1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:04:11.120111 IP 203.0.113.20.4500 > 192.0.2.10.4500: UDP-encap: ESP(spi=0xc12f4a9d,seq=0x2f1), length 132
12:04:11.140225 IP 192.0.2.10.4500 > 203.0.113.20.4500: UDP-encap: ESP(spi=0x1a2b3c4d,seq=0x19a), length 132
12:04:12.120332 IP 203.0.113.20.4500 > 192.0.2.10.4500: UDP-encap: ESP(spi=0xc12f4a9d,seq=0x2f2), length 132
12:04:12.140447 IP 192.0.2.10.4500 > 203.0.113.20.4500: UDP-encap: ESP(spi=0x1a2b3c4d,seq=0x19b), length 132
12:04:13.120552 IP 203.0.113.20.4500 > 192.0.2.10.4500: UDP-encap: ESP(spi=0xc12f4a9d,seq=0x2f3), length 132
5 packets captured

Meaning: ESP is encapsulated in UDP/4500 and traffic flows both ways.

Decision: If you only see outbound packets, suspect upstream filtering, wrong NAT, or remote side not responding. If you see only IKE/500 and no 4500 while behind NAT, NAT-T may be mis-negotiated.

Task 11: Confirm firewall allows IKE/IPsec on both WANs

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;
    iif "lo" accept
    ct state established,related accept
    iifname "wan1" udp dport { 500, 4500 } accept
    iifname "wan2" udp dport { 500, 4500 } accept
    iifname "lan0" accept
    counter reject with icmpx type admin-prohibited
  }
}

Meaning: UDP/500 and UDP/4500 are accepted on both WAN interfaces.

Decision: If wan2 is missing, secondary tunnel negotiations will fail silently. Fix firewall before touching IPsec configs.

Task 12: Watch route changes during induced failure

cr0x@server:~$ ip monitor route
Deleted default via 203.0.113.1 dev wan1 proto static metric 100
Added default via 198.51.100.1 dev wan2 proto static metric 200
Added 10.50.0.0/16 dev vti1 proto static metric 50
Deleted 10.50.0.0/16 dev vti0 proto static metric 10

Meaning: Primary default and primary tunnel route were removed; secondary became active and route switched to vti1.

Decision: This is what you want. If routes don’t change, your failover logic isn’t wired into routing (or it’s failing to trigger). If they flap repeatedly, add hold-down timers and widen probe thresholds.

Task 13: Confirm application-level health across the VPN (TCP test)

cr0x@server:~$ nc -vz -w 2 10.50.20.10 443
Connection to 10.50.20.10 443 port [tcp/https] succeeded!

Meaning: The overlay can reach a real service. ICMP-only success can hide firewall or MTU issues.

Decision: Use this as a probe target when you care about app uptime. If it fails during failover but ping works, suspect MSS/MTU, stateful inspection, or routing asymmetry.

Task 14: Check for asymmetric routing symptoms with conntrack

cr0x@server:~$ sudo conntrack -S
cpu=0 found=24872 invalid=36 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0

Meaning: You have non-zero invalid packets, which often increases during asymmetric routing or path changes.

Decision: If invalid spikes during failover, consider active/standby instead of active/active, or ensure symmetric routing with policy routing and consistent NAT/state handling.

Fast diagnosis playbook

When the VPN “fails over” but users still complain, you need to isolate where the failure lives: underlay (ISP), overlay (IPsec), routing (your box), or the remote side. Do this in order.

First: underlay sanity (prove the ISP path is usable)

  1. Check link and IP addressing on both WANs (carrier up, DHCP/PPPoE, routes).
  2. Probe multiple stable internet targets per WAN (interface-bound ping or TCP). If only one ISP is degraded, don’t touch the VPN yet.
  3. Confirm you can reach the remote VPN peer public IP from each WAN.

Second: overlay control plane (is IKE negotiating?)

  1. Confirm UDP/500 and UDP/4500 are allowed both ways.
  2. Check IKE logs for negotiation loops, auth failures, or proposal mismatches.
  3. Validate that the secondary tunnel is configured on the remote peer and not blocked by “only allow known source IP.”

Third: overlay data plane (can packets pass?)

  1. Ping a remote loopback or test host across the tunnel.
  2. Run a TCP connect test to a real service across the tunnel.
  3. Test PMTU with DF pings and clamp MSS if needed.

Fourth: routing and symmetry (where did traffic actually go?)

  1. Check route selection to remote subnets.
  2. Look for asymmetric routing with conntrack invalids or firewall drops.
  3. If dynamic routing is involved, verify route advertisements and filtering.

Fifth: remote side and application dependencies

  1. Confirm remote firewall has routes back to office networks via the active tunnel.
  2. Check DNS, identity providers, and any “helpful” proxies that may have pinned to the old path.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The office had two ISPs. The slide deck proudly said “redundant connectivity.” The reality: ISP2 was plugged in and had an IP, but nobody had ever tested a site-to-site VPN over it because “the firewall supports it.” That phrase should come with a warning label.

One Tuesday, ISP1 didn’t go down; it got weird. Outbound traffic worked for some destinations. The VPN peer’s network? Blackholed upstream. The firewall’s “dead gateway detection” kept declaring the link healthy because it could ping the ISP next hop. Users could browse the web, which made leadership confident the problem was “the VPN team.”

The team forced failover manually to ISP2. The tunnel never came up. The remote peer only allowed the known public IP of ISP1, and the certificate identity was tied to that assumption. They’d built redundancy on paper, but operationally it was a single point of failure with a subscription fee attached.

They fixed it the right way: configured both tunnels on the remote, used a stable IKE identity not tied to a single source IP, and implemented overlay probes that tested a loopback behind the remote firewall. Two weeks later, they simulated failure during business hours and watched the tunnels behave like adults.

Mini-story 2: The optimization that backfired

A different company wanted “use both circuits because finance.” They turned on active/active with ECMP across two IPsec VTIs and congratulated themselves when iperf showed more throughput in a lab.

In production, the helpdesk started collecting tickets that sounded unrelated: intermittent file share disconnects, VoIP one-way audio, and “the ERP spins forever.” Nothing was fully down. Everything was slightly haunted.

The culprit was asymmetric routing combined with stateful security policies on the remote side. Some flows went out tunnel A and returned on tunnel B. The firewall dropped them as invalid. Even when packets weren’t dropped, reordering and jitter spiked because paths were different.

The fix was unglamorous: revert to active/standby for most traffic, reserve active/active only for a few stateless bulk flows, and enforce symmetry with policy routing where needed. Throughput graphs looked less exciting. Users stopped calling. That’s the KPI that matters.

Mini-story 3: The boring but correct practice that saved the day

A mid-sized office ran dual ISP with two IPsec tunnels to a data center. Nothing fancy: route-based VPN, primary/secondary, health checks that probed a remote loopback and a TCP port, plus a 60-second hold-down timer before failing back.

They also did something nobody loves: quarterly failover drills during a maintenance window. They didn’t just unplug cables; they simulated partial failures—packet loss injection, DNS outages, upstream route blackholes—because that’s what real ISPs do to you when you’ve made weekend plans.

When a construction crew took out the fiber (predictable, eternal), the office dropped a few seconds of traffic and kept moving. The tunnel switched, routes updated, and the monitoring system alerted with a clean “failover succeeded” message. No one needed to SSH in from a phone.

The post-incident review was short: confirm timelines, confirm the drill matched reality, send the ISP ticket, go back to work. Boring is a feature.

Common mistakes: symptom → root cause → fix

This is the section you read when you’re already late for something.

1) Symptom: ISP1 fails and tunnel drops; tunnel over ISP2 never comes up

  • Root cause: Remote peer only allows one source IP / only one configured tunnel / incorrect IKE identity for secondary.
  • Fix: Configure both tunnels on the remote side explicitly; use stable ID (FQDN or certificate DN) not tied to a single WAN IP; confirm UDP/500 and UDP/4500 reachability on ISP2.

2) Symptom: Tunnel says “up” but users can’t reach anything

  • Root cause: Control plane is established; data plane broken due to missing routes, wrong traffic selectors, or remote-side routing.
  • Fix: Probe a remote loopback across the tunnel; verify route to remote subnet points to the tunnel interface; check remote routes back to office subnets.

3) Symptom: Failover happens, but SaaS works and internal apps don’t

  • Root cause: Only the VPN path is broken (peer reachability, ESP blocked, or overlay route not swapped), while general internet remains fine.
  • Fix: Underlay probe the specific VPN peer public IP from each ISP; use overlay probes; don’t rely on “internet is up” as evidence the VPN is fine.

4) Symptom: Random websites or large downloads fail only over VPN, especially after failover

  • Root cause: MTU/PMTU blackhole; ICMP blocked; TCP MSS too large for encapsulated path.
  • Fix: Clamp MSS on the tunnel or LAN egress; set tunnel MTU; allow ICMP fragmentation-needed where possible; verify with DF pings.

5) Symptom: Active/active looks fine until voice calls sound like robots

  • Root cause: Path jitter/reordering; asymmetric routing; ECMP distributing flows across different quality links.
  • Fix: Use application-aware steering; pin voice to the best link; or stop being clever and run active/standby.

6) Symptom: Flapping—frequent failover/failback every few minutes

  • Root cause: Aggressive probe thresholds; single probe target; failback with no hold-down.
  • Fix: Use multiple probe targets; require consecutive failures; add a hold-down timer for failback; tune thresholds to your app tolerance.

7) Symptom: Secondary tunnel comes up, but return traffic dies (one-way traffic)

  • Root cause: Remote side still routes back via primary tunnel; route propagation delayed; asymmetric routing + stateful firewall drop.
  • Fix: Ensure remote uses route priorities matching tunnel health; if using BGP, validate route withdrawal/advertisement; if static, use tracked routes tied to tunnel status.

8) Symptom: Everything breaks only during rekey, especially on secondary

  • Root cause: Rekey timers coincide with unstable link; mismatched proposals; DPD timing too slow/fast, causing repeated negotiations.
  • Fix: Align crypto proposals; stagger rekey; tune DPD; validate logs during rekey windows.

Checklists / step-by-step plan

If you want office VPN failover that doesn’t require a dedicated emotional support engineer, follow this plan. It’s opinionated because reality is opinionated.

Step 1: Choose your target architecture

  • Default: route-based IPsec, active/standby tunnels, data-plane health checks driving route changes.
  • Upgrade path: add dynamic routing (BGP) over both tunnels when you have multiple prefixes/sites.
  • Avoid: DNS-based “failover,” scripts that edit config files live, and ECMP unless you understand symmetry and state.

Step 2: Make identities and configuration robust to WAN changes

  • Use stable IKE IDs (FQDN or certificates) rather than “my public IP.”
  • Configure both tunnels on the remote peer explicitly.
  • Document the required ports and protocols (UDP/500, UDP/4500; ESP if not NAT-T).

Step 3: Implement health checks that measure usefulness

  • Underlay probes: 2–3 internet targets per WAN (interface-bound), not just ISP next hop.
  • Overlay probes: ping a remote loopback plus TCP connect to an internal service.
  • Decision logic: consecutive failures, thresholds, and hold-down for failback.

Step 4: Tie health to routing, not hope

  • Tracked routes or route metrics that change when probes fail.
  • Clear preference order: primary unless proven bad.
  • If you must do active/active, implement symmetry controls and be prepared to debug state drops.

Step 5: Engineer for MTU and rekey reality

  • Set MTU/MSS correctly early. This is not a “nice to have.”
  • Stagger rekey timers if your platform allows it.
  • Keep logs long enough to catch hourly/daily patterns.

Step 6: Monitoring and alerting that doesn’t cry wolf

  • Alert on: tunnel down, overlay probe failures, and flap frequency.
  • Record: which ISP is active, probe loss/latency, and route changes.
  • Have one “VPN failover succeeded” signal; it reduces panic and ticket noise.

Step 7: Test like an adult

  • Test full ISP loss (link down).
  • Test partial failures (drop UDP/4500, inject loss, break DNS).
  • Test failback behavior (hold-down and stability).

FAQ

1) Should I do active/active or active/standby for office VPN?

Active/standby unless you have a measured bandwidth need that one link cannot meet. Active/active introduces symmetry and state issues that are expensive to troubleshoot.

2) Can I rely on DPD alone for failover?

No. DPD detects peer responsiveness, not application usefulness. You also need data-plane probes across the tunnel and, ideally, an application-level TCP check.

3) Why does my firewall say the tunnel is up but nothing works?

Because IKE can be established while routes, selectors, or remote-side return routing are wrong. Always test overlay reachability to a known remote IP/service.

4) What’s the best probe target for overlay health?

A loopback interface IP on the remote VPN gateway (stable, always routed) plus a TCP probe to a real service you care about (e.g., HTTPS on an internal load balancer).

5) Do I need BGP to get good failover?

Not for a single office to a single site. Static routes with tracking can be perfectly reliable. Use BGP when you have multiple prefixes, multiple sites, or you’re tired of managing static routes.

6) How fast should failover be?

Fast enough that users don’t notice, slow enough that you don’t flap. In practice: detection in ~10–30 seconds for many offices, with failback hold-down of 60–300 seconds. Tune based on measured link behavior.

7) Why do large file transfers fail but small pings work?

Classic MTU/PMTU blackhole. Small ICMP packets pass, larger TCP segments don’t. Fix MSS clamping and confirm with DF pings.

8) Can I do failover by changing public DNS for the VPN endpoint?

You can, but it’s not reliable failover. DNS caching and long-lived sessions will betray you. Use two tunnels and real routing/health logic.

9) What if my ISP uses CGNAT on the backup link?

Still possible if the remote peer accepts NAT-T from changing source ports/addresses, but it’s riskier. Prefer a real public IP for the VPN termination, or use a cloud relay/terminator you control.

10) How do I prevent failback flapping when ISP1 is “mostly back”?

Add a hold-down timer and require sustained healthy probes before failing back. Also probe multiple targets; one recovered path doesn’t mean the internet is actually sane.

Conclusion: next steps you can actually do this week

If you want office VPN failover that doesn’t require manual babysitting, stop thinking in terms of “two circuits” and start thinking in terms of “measurable, steerable paths.” Build two tunnels, prove both work, and make the router/firewall choose based on data-plane health—not wishful link lights.

Practical next steps

  1. Inventory reality: Confirm both ISPs can reach the remote VPN peer and that UDP/500/4500 pass both ways.
  2. Stand up the secondary tunnel: Configure it on both ends. Verify it establishes and can pass traffic before calling it “redundant.”
  3. Add overlay probes: Ping a remote loopback and run a TCP probe to a service across the tunnel. Make routing follow probe results.
  4. Fix MTU early: Test DF pings, clamp MSS, and document the chosen values.
  5. Run a failover drill: Induce failure in a controlled window, watch route changes, confirm user-impact, and tune thresholds to stop flapping.

Do these and you’ll get the real prize: the second ISP stops being a monthly donation to the gods of networking, and starts being what it was supposed to be—boring insurance that actually pays out.

← Previous
ESXi Alternatives for SMB: Proxmox vs XCP-ng vs Hyper-V
Next →
NAT Through VPN: Connect Conflicting Networks Without Breaking Services

Leave a comment