Two Internet Links at the Office: VPN Failover on MikroTik Without Chaos

November 22, 2025 • February 3, 2026 • Read: 22 min • Views: 14

Was this helpful?

You buy a second ISP link for “resilience,” plug it into a spare port on the MikroTik, and feel like a responsible adult.
Then the VPN starts flapping, calls sound like a robot swallowing a modem, and the finance team can’t reach the ERP unless they stand in one specific corner of the office.

Dual WAN is easy to turn on. Making it behave—especially with VPN tunnels—is where the bodies are buried.
This is the field guide for doing failover on MikroTik without turning your office network into a low-budget chaos experiment.

What you’re really building (and why it fails)

When people say “failover,” they usually mean “if ISP A dies, use ISP B.” For VPNs, that’s only the outer shell.
The real system is a chain of dependencies: routing decisions, NAT behavior, tunnel peer reachability, session state, and how quickly you detect the problem without oscillating.

The enemy isn’t lack of features. RouterOS has the knobs. The enemy is half-configurations that fight each other:
default routes competing with policy routes, connection tracking pinning flows to a dead uplink, NAT rewriting replies onto the wrong interface,
and health checks that measure the wrong thing.

Here’s the most important mental model: VPN failover is not a single event; it’s a controlled change in path while preserving determinism.
Determinism means: for any given flow, you can predict which WAN it uses, what source IP it uses, what NAT does, and which tunnel it exits.
If you can’t predict it, your users will “discover” it via outages.

Joke #1: Dual WAN is like buying two umbrellas and assuming the rain will schedule itself around your meetings.

The three failure modes you see in the wild

Asymmetric routing: outbound uses ISP A, inbound return comes via ISP B (or vice versa). VPNs and stateful firewalls hate this.
False health: gateway responds to ping but upstream is dead; your “failover” never triggers because you measured the wrong hop.
Flap storms: tiny packet loss makes the router bounce routes/tunnels every few seconds. The VPN is “up,” but nothing stays up.

One quote, because it’s true in operations

Werner Vogels (paraphrased idea): “Everything fails, all the time.” Build systems that expect that, and your on-call rotation gets to sleep.

Facts and historical context (short, useful, slightly depressing)

IPsec predates modern “cloud VPN” convenience. The core RFC work dates back to the late 1990s, which explains some of its sharp edges.
NAT traversal (NAT-T) became normal because everyone insisted on NAT. Encapsulating ESP in UDP/4500 was a practical compromise, not an elegant one.
Stateful firewalls changed routing expectations. Once you track connections, “any path back” stops being acceptable; return traffic must match state.
Multi-WAN got popular in SMB before it got clean. Many small offices adopted dual ISPs long before they had staff to operate policy routing safely.
BFD (Bidirectional Forwarding Detection) exists because routing protocols needed faster failure detection than hello timers. MikroTik supports BFD with some protocols, but most SMB designs never use it.
“Check-gateway=ping” is not a reachability oracle. It tells you about one target only, and that target might still answer while the Internet is on fire.
WireGuard is young (2010s) and intentionally minimal. That simplicity makes it easier to reason about failover compared to legacy stacks, but you still need routing discipline.
Carrier-grade NAT (CGNAT) quietly broke assumptions. Your “public IP” might not be yours, and inbound VPN initiation can become impossible without help.
Connection tracking timeouts matter more than people think. With failover, stale conntrack entries can pin flows to a dead egress even after routes change.

Design principles that keep VPN failover sane

1) Decide what “success” means: link up, Internet up, or VPN usable?

A physical link can be up while the provider is blackholing routes. A default gateway can answer while upstream DNS is dead.
And a VPN can show “established” while your apps are timing out due to MTU issues or asymmetric routing.

Your health check target should match the level you care about:
ISP gateway ping detects L2/L3 failures; public IP ping detects upstream reachability;
remote VPN endpoint check detects actual service viability. Most offices need at least upstream reachability.

2) Prefer deterministic routing over “magic”

MikroTik can do a lot automatically, but the automatic behaviors are often “reasonable” only in single-WAN environments.
Dual WAN + VPN requires explicit choices:

What source IP should each tunnel use?
What WAN should each tunnel prefer?
What traffic is allowed to fail over, and what must be pinned?
How do you prevent existing sessions from sticking to a dead WAN?

3) Build for stable failover, not fast failover

Yes, you want quick recovery. No, you don’t want a router that changes its mind every time a packet sneezes.
Add hysteresis: multiple consecutive failures before switching, and multiple consecutive successes before switching back.

4) Treat NAT as part of routing (because it is)

With two WANs, NAT rules must be interface-aware and must align with your routing policy.
If traffic leaves via ISP B but is NATed to ISP A’s address, you’ve built a self-inflicted outage.

5) Keep VPN routing isolated from office Internet routing

The cleanest failures are scoped failures. Put your VPN decision logic into separate routing tables (VRFs if you’re doing it seriously),
or at least separate routing tables with routing marks, and keep your default browsing traffic out of that logic.

Joke #2: If you can’t explain your policy routing in one whiteboard diagram, your router is already plotting against you.

Three workable architectures (pick one, don’t mix them)

Architecture A: Simple active/standby default route with clean VPN initiation

This is the “small office but wants fewer surprises” design. You keep a primary default route via WAN1, secondary via WAN2 with higher distance.
VPN peers are reachable via whichever WAN is active; when WAN1 fails, the router shifts default route and the VPN re-establishes from WAN2.

Pros: easier, fewer mangle rules, less chance of self-DDoS via policy routing. Cons: existing sessions die on failover. That’s fine; you’re failing over, not performing surgery.

Best for: one or two site-to-site tunnels, IPsec with dynamic peers, or WireGuard where reconnect is cheap.

Architecture B: Policy routing with pinned VPN egress + failover per tunnel

Here you explicitly route the VPN tunnel traffic itself (IKE/ESP or WireGuard UDP) out a preferred WAN, but allow fallback.
You use routing marks so that “traffic to VPN peer” uses a dedicated table with two default routes: WAN1 primary, WAN2 backup.

Pros: isolates VPN control traffic from office browsing; avoids random tunnel flaps due to general route changes.
Cons: more rules, more ways to shoot your foot, requires discipline with address lists and route rules.

Architecture C: Two tunnels (one per ISP) + dynamic routing or manual priority

This is the “do it properly or don’t do it” approach for serious inter-site connectivity. You build two independent tunnels:
one sourced from WAN1, one from WAN2. Then you run a routing protocol (or static routes with checkers) across the tunnels.

Pros: actual redundancy at the VPN layer, less dependency on a single tunnel rekeying cleanly during an ISP event.
Cons: more complexity, requires remote side support, demands careful route filtering and metrics.

If you can run dynamic routing, you can also use BFD for fast detection in some setups. If you can’t, keep it static and stable.

Opinionated choice

For most offices: Architecture B is the sweet spot. It isolates VPN behavior without forcing you into full-blown routing protocol land.
Architecture A is acceptable when you’re OK with short disconnects. Architecture C is for when “short disconnects” get you yelled at in quarterly meetings.

Practical tasks: commands, outputs, and decisions

These are real checks you can run while building or debugging. Each task includes: a command, what typical output means, and the decision you make.
Commands are RouterOS CLI unless stated otherwise. You can paste them into a MikroTik terminal session.

Task 1: Confirm interface state and speed/duplex (don’t assume cabling is fine)

cr0x@server:~$ /interface/ethernet/print detail where name~"ether1|ether2"
 0 name="ether1" default-name="ether1" mtu=1500 l2mtu=1598 mac-address=DC:2C:6E:AA:BB:01 arp=enabled
   speed=1Gbps full-duplex=yes auto-negotiation=yes link-downs=0
 1 name="ether2" default-name="ether2" mtu=1500 l2mtu=1598 mac-address=DC:2C:6E:AA:BB:02 arp=enabled
   speed=100Mbps full-duplex=yes auto-negotiation=yes link-downs=7

Meaning: ether2 has link-downs and negotiated at 100Mbps. That’s not “fine”; it’s a hint.
Decision: swap cable, check ISP handoff, or lock speed/duplex if the upstream device is dumb. A flaky WAN link makes failover logic look broken.

Task 2: See your default routes and whether they’re active

cr0x@server:~$ /ip/route/print where dst-address="0.0.0.0/0"
Flags: D - dynamic, A - active, c - connect, s - static
 #      DST-ADDRESS        GATEWAY            DISTANCE
 0  As  0.0.0.0/0          198.51.100.1              1
 1  s   0.0.0.0/0          203.0.113.1               5

Meaning: WAN1 route is active; WAN2 is standby (higher distance).
Decision: if both are active when you expected one, you’ll get ECMP-ish behavior and random session failures unless you planned for it.

Task 3: Verify “check-gateway” behavior and whether it’s lying to you

cr0x@server:~$ /ip/route/print detail where dst-address="0.0.0.0/0"
 0 A S  dst-address=0.0.0.0/0 gateway=198.51.100.1 distance=1 check-gateway=ping
 1   S  dst-address=0.0.0.0/0 gateway=203.0.113.1 distance=5 check-gateway=ping

Meaning: you’re pinging the gateway IP, not the Internet.
Decision: for real failover, prefer recursive routes with a public target (or scripted checks) so you’re testing beyond the first hop.

Task 4: Check recursive routing targets (recommended approach for upstream reachability)

cr0x@server:~$ /ip/route/print where comment~"rec-check"
Flags: D - dynamic, A - active, c - connect, s - static
 #     DST-ADDRESS        GATEWAY         DISTANCE COMMENT
 0 As  1.1.1.1/32         198.51.100.1           1 rec-check-wan1
 1  s  0.0.0.0/0          1.1.1.1                1 default-via-wan1
 2  s  8.8.8.8/32         203.0.113.1            1 rec-check-wan2
 3  s  0.0.0.0/0          8.8.8.8                5 default-via-wan2

Meaning: default routes are recursive via public IPs. If the public IP becomes unreachable via that WAN, the default route drops.
Decision: use this when you want failover based on “Internet reachability,” not “gateway answers ARP.”

Task 5: Confirm the router can reach health targets via the intended WAN

cr0x@server:~$ /ping 1.1.1.1 routing-table=main count=3
  SEQ HOST                                     SIZE TTL TIME       STATUS
    0 1.1.1.1                                     56  57 11ms
    1 1.1.1.1                                     56  57 10ms
    2 1.1.1.1                                     56  57 11ms
    sent=3 received=3 packet-loss=0% min-rtt=10ms avg-rtt=10ms max-rtt=11ms

Meaning: reachability exists, but this still doesn’t prove WAN-specific path unless you constrain it.
Decision: if you need per-WAN checks, use routing rules/tables and ping with routing-table=....

Task 6: Inspect routing rules (policy routing) and catch shadow rules

cr0x@server:~$ /routing/rule/print
Flags: X - disabled
 #   SRC-ADDRESS     DST-ADDRESS     ACTION           TABLE
 0   10.10.10.0/24                    lookup          main
 1                   203.0.113.10/32 lookup          to_wan1_vpn
 2                   203.0.113.11/32 lookup          to_wan2_vpn

Meaning: you’re routing traffic to specific VPN peers via dedicated tables.
Decision: ensure ordering makes sense; a broader earlier rule can steal traffic from later, more specific rules.

Task 7: Validate NAT rules are interface-aware (and ordered correctly)

cr0x@server:~$ /ip/firewall/nat/print
Flags: X - disabled, I - invalid, D - dynamic
 #   CHAIN  ACTION    SRC-ADDRESS       OUT-INTERFACE   COMMENT
 0   srcnat masquerade 10.10.10.0/24    ether1          NAT-to-WAN1
 1   srcnat masquerade 10.10.10.0/24    ether2          NAT-to-WAN2
 2   srcnat accept     10.10.10.0/24    (none)          no-NAT-to-VPN

Meaning: the “no-NAT-to-VPN” rule is last, so it will never match before masquerade.
Decision: move accept rules above masquerade. NAT ordering is not a philosophy debate; it’s a line-by-line execution engine.

Task 8: See active IPsec SAs and confirm which local address they use

cr0x@server:~$ /ip/ipsec/active-peers/print detail
 0 address=203.0.113.200 local-address=198.51.100.10 state=established
   nat-traversal=yes ike2=yes exchange-mode=ike2

Meaning: tunnel is established and sourced from WAN1 public IP.
Decision: if failover happens and it still sources from the dead WAN, you need explicit local-address handling or route policy for peer traffic.

Task 9: Check IPsec policies and whether they’re too broad (causing route leaks)

cr0x@server:~$ /ip/ipsec/policy/print
Flags: T - template, D - dynamic, X - disabled, A - active
 #   SRC-ADDRESS      DST-ADDRESS      SA-SRC-ADDRESS    SA-DST-ADDRESS
 0 A 10.10.10.0/24    10.20.20.0/24    198.51.100.10     203.0.113.200
 1   0.0.0.0/0        0.0.0.0/0        198.51.100.10     203.0.113.200

Meaning: there’s a 0/0 policy. That’s a “route all traffic into the tunnel” foot-gun unless you are intentionally building a full-tunnel.
Decision: remove or disable broad policies unless you can explain them to your future self at 3 a.m.

Task 10: Verify WireGuard peer handshakes and transfer on each link

cr0x@server:~$ /interface/wireguard/peers/print detail
 0 interface=wg0 public-key="..." endpoint-address=203.0.113.210 endpoint-port=51820
   allowed-address=10.20.20.0/24 persistent-keepalive=25s
   last-handshake=1m12s rx=145.2MiB tx=162.9MiB

Meaning: handshakes are occurring; traffic is flowing. If last-handshake grows during “failover,” you’re not re-establishing.
Decision: verify endpoint routing and NAT rules; consider two peers/endpoints if the remote side supports both ISPs.

Task 11: Check connection tracking for flows stuck on the wrong WAN

cr0x@server:~$ /ip/firewall/connection/print where dst-address~"203.0.113.200" and protocol=udp
Flags: S - seen-reply, A - assured, C - confirmed, D - dying
 #   PROTO SRC-ADDRESS:PORT    DST-ADDRESS:PORT    TIMEOUT     TCP-STATE
 0 SAC udp  198.51.100.10:4500 203.0.113.200:4500 00:00:27

Meaning: IPsec NAT-T flow is tracked and pinned to a specific source. During failover, old entries can keep trying the dead path.
Decision: after route change, you may need to clear specific conntrack entries (surgically) to force re-establishment.

Task 12: Flush only what you must (surgical conntrack reset)

cr0x@server:~$ /ip/firewall/connection/remove [find dst-address~"203.0.113.200" and protocol=udp]

Meaning: removes tracked flows to the peer so the tunnel can renegotiate using the new path.
Decision: avoid flushing all connections unless you enjoy explaining to the CEO why every call dropped.

Task 13: Capture traffic to confirm which interface is actually used

cr0x@server:~$ /tool/sniffer/quick interface=ether1 ip-address=203.0.113.200
  TIME  NUM  DIR  SRC-ADDRESS           DST-ADDRESS           PROTOCOL  SIZE
  0.01    1  tx   198.51.100.10         203.0.113.200         udp       146
  0.02    2  rx   203.0.113.200         198.51.100.10         udp       146

Meaning: traffic is using ether1. If you believe you failed over to ether2 but see packets on ether1, routing/NAT is not doing what you think.
Decision: trust packet capture over dashboards and feelings.

Task 14: Observe route changes live while you simulate failure

cr0x@server:~$ /log/print follow where topics~"route"
12:01:14 route,info default-via-wan1 inactive
12:01:14 route,info default-via-wan2 active
12:01:16 route,info default-via-wan1 active
12:01:17 route,info default-via-wan2 inactive

Meaning: you’re flapping. That’s not “fast”; it’s unstable.
Decision: introduce hysteresis: longer check intervals, multiple failures before switching, and delayed failback.

Task 15: Check CPU load during failover events (VPN crypto can be the bottleneck)

cr0x@server:~$ /system/resource/print
                   uptime: 3d4h22m
                  version: 7.16.1 (stable)
               cpu-load: 84
         free-memory: 214.3MiB
        total-memory: 512.0MiB

Meaning: high CPU during rekey or tunnel rebuild can delay convergence and make health checks time out.
Decision: reduce crypto overhead (algorithm choices), reduce tunnel churn, or upgrade hardware. “But it’s just an office” is not a capacity plan.

Task 16: Verify MTU/MSS clamping if you see weird app-specific failures

cr0x@server:~$ /ip/firewall/mangle/print where comment~"MSS"
Flags: X - disabled, I - invalid, D - dynamic
 #   CHAIN     ACTION             PROTOCOL  TCP-FLAGS  NEW-MSS  COMMENT
 0   forward   change-mss         tcp       syn       1360     clamp-mss-for-vpn

Meaning: MSS clamp exists. If you don’t have this and you run tunnels over PPPoE or mixed MTUs, you can get “VPN up, app down.”
Decision: if large downloads stall or specific SaaS breaks only over VPN, test PMTUD and consider MSS clamp on the right path.

Fast diagnosis playbook

When the VPN “doesn’t fail over” (or fails over into a wall), you need a repeatable triage order.
This is optimized for speed and signal, not elegance.

First: determine whether you have a link problem, a routing problem, or a state problem

Link: interface down, flapping, negotiation issues, packet loss.
Routing: wrong route active, wrong routing table/rule, recursive target still reachable via the wrong WAN.
State: conntrack pinning, IPsec/WireGuard handshake stale, NAT mapping stuck.

Second: measure the WANs separately, not “the Internet”

Ping a public target through WAN1’s routing table.
Ping a public target through WAN2’s routing table.
Sniff traffic to the VPN peer to confirm actual egress interface.

Third: verify the tunnel control plane before the data plane

For IPsec, check active peers, SAs, and logs for rekey failures. For WireGuard, check last handshake.
If the tunnel isn’t established, debugging routes to subnets is wasted motion.

Fourth: confirm NAT and firewall ordering

Wrong NAT order is a classic: you “allow no-NAT to VPN,” but masquerade matched first.
Always read rules top to bottom with the same cynical suspicion you reserve for vendor release notes.

Fifth: clear only the state that’s blocking convergence

Remove conntrack entries for the peer. Bounce the tunnel if necessary. Avoid rebooting the router unless you’ve run out of ideas and credibility.

Common mistakes: symptoms → root cause → fix

1) Symptom: VPN stays “up” but traffic dies after failover

Root cause: asymmetric routing or stale conntrack. The tunnel control packets may still work, but data flows are pinned to a dead egress.

Fix: enforce policy routing for peer traffic; clear conntrack for the peer after failover; ensure NAT rules match out-interface correctly.

2) Symptom: VPN re-establishes, but only some subnets work

Root cause: overlapping routes, too-broad IPsec policy, or missing routes in the remote site after failover path changes.

Fix: tighten selectors/policies; verify routes on both ends; avoid 0/0 policies unless you truly want full tunnel.

3) Symptom: Failover never triggers during “ISP outage”

Root cause: you’re pinging the ISP gateway, which still answers while upstream is dead.

Fix: use recursive default routes via public IP health targets, or scripted checks that test beyond the gateway.

4) Symptom: Failover triggers constantly (flapping)

Root cause: health check too sensitive; target unreliable; WAN has intermittent loss; or CPU spikes delay responses.

Fix: add hysteresis (multiple failures/successes), choose stable targets, fix the physical link, and watch CPU during churn.

5) Symptom: After failback to WAN1, VPN won’t come back without manual intervention

Root cause: remote side still expects the old source IP; NAT mappings stuck; IPsec peer locked to a specific address; DPD timers too slow.

Fix: configure dual peers or dynamic identity where possible; reduce DPD detection time carefully; clear state on both ends during testing.

6) Symptom: Internet works, but VPN negotiation fails only on WAN2

Root cause: WAN2 is behind CGNAT or blocks UDP/500/4500; or MTU differs enough to break fragmentation assumptions.

Fix: test reachability of UDP ports; prefer WireGuard over IPsec in hostile NAT environments; apply MSS clamp; consider provider changes if they block VPN.

7) Symptom: Users complain “Teams calls break exactly when failover happens”

Root cause: that’s normal. Real-time flows don’t survive path change without session-level resilience.

Fix: accept that failover breaks live sessions; focus on recovery time and stability; communicate expected behavior.

8) Symptom: Random outbound connections fail when both WAN routes are active

Root cause: accidental ECMP without symmetric return; NAT/conntrack mismatch.

Fix: keep one default active unless you implement proper PCC/load balancing and matching NAT rules; do not “accidentally multi-path.”

Three corporate mini-stories from the trenches

Mini-story 1: The outage caused by a wrong assumption

A mid-size firm added a second fiber circuit after a memorable outage where someone tried to hotspot an office VPN through a phone.
The directive was simple: “automatic failover.” The network admin set two default routes with different distances and enabled gateway ping checks.
It looked clean. It was also based on one wrong assumption: “If the gateway pings, the Internet is fine.”

A month later, ISP A had a routing issue upstream. The gateway stayed reachable, ARP was fine, and pings to the gateway were perfect.
Meanwhile, anything beyond the ISP edge was blackholed intermittently. The MikroTik refused to fail over because its check target was still green.
Users experienced a slow-motion disaster: DNS timeouts, SaaS logins failing, VPN tunnels “up” but unusable.

The admin tried to “force” failover by disabling the route manually. Traffic moved to ISP B and immediately stabilized.
Then, as soon as they re-enabled the WAN1 route, the router preferred it again (distance=1), and the pain returned. Cue helpdesk tickets with contradictory symptoms.

The fix was not heroic. They moved to recursive default routes with public reachability checks, and they added hysteresis so failback wouldn’t happen after one good ping.
The lesson that stuck wasn’t technical. It was operational: your monitoring target defines your reality, even when reality disagrees.

Mini-story 2: The optimization that backfired

Another company wanted “zero downtime” between ISPs. Someone had read about load balancing and decided to be clever:
they enabled per-connection balancing across both WANs while also running IPsec site-to-site to a datacenter.
The goal was to use both uplinks and “get more bandwidth for free.”

What they got was a helpdesk bingo card. File transfers would start fast and then stall.
Some web apps worked, others randomly failed to authenticate. The VPN would renegotiate multiple times per hour.
The monitoring graphs looked alive, which is the cruelest thing a graph can do.

The problem wasn’t that load balancing is evil. It’s that they didn’t design for symmetric paths.
Some traffic for the VPN peer went out WAN1, return packets arrived on WAN2, conntrack rejected it, and the tunnel’s data plane became a coin toss.
NAT compounded it: flows leaving WAN2 were occasionally masqueraded as WAN1 because the rules were too generic and misordered.

The rollback was basically a confession: they reverted to one active default route, pinned VPN traffic to a dedicated routing table, and stopped trying to be smarter than stateful networking.
Bandwidth “optimization” is not free if it buys you outages. The business was happy with stable and slightly slower.

Mini-story 3: The boring but correct practice that saved the day

A finance-heavy organization had two ISPs and one key site-to-site VPN to a hosted accounting platform.
They had a change process that was aggressively unglamorous: any network change required a written rollback step, a timed maintenance window,
and a failover simulation performed at least once per quarter. Nobody loved it. Nobody put it on their résumé.

Then a construction crew cut a conduit. WAN1 dropped hard. WAN2 stayed up.
The MikroTik failed over within a couple of check intervals, the VPN re-established, and users saw a brief disconnect and then normal operations.
The helpdesk had maybe a handful of calls, mostly from people asking if “the Internet is being weird.”

What made it work wasn’t a fancy design. It was the boring discipline:
separate routing tables for VPN control traffic, NAT rules pinned to out-interfaces, and a tested procedure that included clearing peer conntrack if needed.
They also kept logs shipped off the router, so they could prove the timeline when vendors started the usual blame dance.

Later, during the post-incident review, someone asked why it went so smoothly.
The network admin’s answer was basically: “We practice the thing we claim to have.” It’s not poetry, but it’s how production stays boring.

Checklists / step-by-step plan

Step-by-step: build dual-WAN VPN failover on MikroTik (stable, not fancy)

Inventory realities. Confirm whether each ISP gives you a real public IP, whether either is CGNAT, and whether UDP/500/4500 is blocked.
If WAN2 is CGNAT, plan for outbound-only initiation or WireGuard with keepalives.
Name interfaces and document them. “ether1” is not documentation. Use comments, interface lists, and consistent naming.
Choose your failover signal. Prefer recursive default routes via public targets, or scripted checks that test real reachability.
Keep default route logic simple. One active default route, one standby with higher distance.
If you want to use both links, that’s a separate project with separate failure modes.
Isolate VPN peer routing. Create a routing table for VPN peer traffic (and optionally one per WAN) and add rules for peer destination IPs.
Make NAT deterministic. Put no-NAT/accept rules for VPN destinations above masquerade. Use out-interface match on masquerade rules.
Validate tunnel establishment on each WAN independently. Temporarily disable WAN1 route and confirm tunnel can come up on WAN2, and vice versa.
Add hysteresis. Avoid rapid failback. Require sustained success before returning to WAN1.
Define a state-reset procedure. Know which conntrack entries to clear, how to bounce the tunnel, and what logs to check.
Test with a controlled failure. Don’t yank cables randomly. Disable a route, or disable an interface at a planned time, and watch logs/sniffer output.
Observe MTU issues. If apps behave oddly over VPN on one ISP, test MSS clamp and path MTU behavior.
Write it down. The config is not self-explanatory, and you won’t remember the intent six months later.

Operational checklist: before you push changes in production

Have an out-of-band access plan (LTE console, remote hands, or at least a local person with a laptop).
Export the current config and store it somewhere safe.
Know your rollback commands and the order to apply them.
Pick a test that matches your business: DNS lookup, ERP login, file transfer, VoIP call—whatever breaks first.
Decide acceptable downtime per failover event and communicate it.

FAQ

1) Can MikroTik do true “seamless” VPN failover without dropping sessions?

Usually no. Most application sessions will break when the public egress IP changes.
You can reduce downtime and make reconnection fast, but seamless requires application-layer resilience or more complex designs (often with the remote side involved).

2) Should I use “check-gateway=ping” on the default route?

Only if you accept that it detects “my gateway is alive,” not “the Internet works.”
For offices, recursive routing via a stable public target is typically a better failover trigger.

3) Is WireGuard easier than IPsec for failover?

Operationally, yes. WireGuard’s handshake model and minimalism tend to make behavior easier to reason about.
But you still must solve routing and NAT determinism; WireGuard won’t save you from sloppy policy routing.

4) Why does the VPN show established but users can’t reach remote subnets?

Because control plane success doesn’t guarantee data plane success. Common culprits: wrong NAT order, missing routes, asymmetric path, or MTU issues.
Packet capture to the peer and a route lookup usually reveals the truth.

5) Do I need two VPN tunnels (one per ISP)?

If the VPN is business-critical and you can coordinate both ends, two tunnels is the most robust approach.
For a typical office with tolerance for brief reconnection, a single tunnel that can re-establish over either WAN is often sufficient.

6) What about load balancing across both WANs and VPN failover?

It’s doable, but don’t stumble into it accidentally. You need consistent per-connection classification, matching NAT rules, and a plan for symmetric return.
If you’re not ready to test for days, keep one WAN primary.

7) How do I avoid flapping between ISPs?

Use hysteresis: multiple failures before switching, and delayed failback until stability is proven.
Also pick stable health targets and fix underlying packet loss. No script can outsmart a bad fiber splice.

8) How do I test failover safely during business hours?

Don’t pull cables. Disable the primary default route briefly, or lower its priority, and watch route logs and tunnel status.
Time-box the test, communicate a short disruption window, and have a rollback ready.

9) Why does WAN2 fail even though general browsing works?

VPN protocols are pickier than web traffic. WAN2 might block UDP/500/4500, might be CGNAT, or might have a smaller MTU that breaks encapsulation.
Test protocol reachability and consider WireGuard if IPsec is being bullied by the ISP path.

Next steps (the boring stuff that prevents tickets)

Build the simplest architecture that meets the business requirement, then make it deterministic:
explicit routing for VPN peers, NAT rules that match the chosen egress, and health checks that measure the right layer.
Keep failover stable, not twitchy. If you want faster convergence, invest in better detection—not more randomness.

Practical next steps you can do this week:

Replace gateway pings with recursive reachability checks to public targets.
Isolate VPN peer routing into a dedicated routing table and verify with packet capture.
Audit NAT rule order and add no-NAT exceptions for VPN subnets at the top.
Run a controlled failover test and record: time to detection, time to tunnel re-establish, and the one weird app that always complains first.
Write a one-page runbook: what to check, what to clear, and how to roll back. Future-you will send a thank-you note in the form of fewer alerts.