MikroTik IPsec IKEv2 Between Offices: How to Make It Stable (and Why It Flaps)

Was this helpful?

Your inter-office VPN is “up” until it isn’t. Teams calls freeze, ERP transactions time out, and someone says “IPsec is unreliable” like it’s a law of physics.
Then you log into a MikroTik and see the tunnel renegotiating like it’s being paid per handshake.

IKEv2 on MikroTik can be rock-solid. But only if you treat it like a production service: control the packet path, align lifetimes and identities, and stop the network from
“helping” with NAT, MTU, and reordering. This is the field guide: why it flaps, how to catch it in the act, and what to change so it stays boring.

What “flapping” really is (and what it is not)

When people say an IPsec tunnel “flaps,” they usually mean one of three different things:

  • IKE SA churn: the IKEv2 control plane keeps re-establishing. You’ll see repeated “initiator” / “responder” negotiations, frequent deletes, or DPD timeouts.
  • Child SA churn: IKE stays up, but the ESP (data) SAs get rekeyed, deleted, or replaced too often, breaking long-lived flows.
  • Traffic blackholing: the tunnel is up, but traffic intermittently doesn’t match policy, routes change, NAT hits, or MTU breaks only certain apps.

Don’t treat them as the same problem. They have different signatures, different root causes, and different fixes.
“Tunnel is up” is not a success metric. “The business traffic stays boring” is.

Also: if your WAN link drops, your VPN drops. That’s not “flapping,” that’s physics. The trick is making sure the VPN doesn’t invent drops on a healthy WAN.

Interesting facts and history you can use at 2 a.m.

  1. IKEv2 was designed to reduce complexity versus IKEv1, especially around rekeying and NAT traversal, but implementations still differ in edge behavior.
  2. NAT traversal is older than many engineers’ careers: UDP encapsulation for ESP (NAT-T) became common in the early 2000s because NAT devices mangled IPsec.
  3. DPD exists because “idle” does not mean “healthy”. Middleboxes silently drop state; DPD is a controlled way to discover that before users do.
  4. Perfect Forward Secrecy (PFS) is not free: it adds CPU during rekey. On small routers, aggressive lifetimes can become self-inflicted outages.
  5. UDP 500/4500 is not inherently unreliable; it’s just the protocol most likely to be “optimized” by stateful firewalls, CGNAT, and ISP gear.
  6. Path MTU discovery has been “almost working” forever. ICMP filtering still breaks it, and VPN encapsulation makes the blast radius bigger.
  7. RouterOS has evolved IPsec significantly across major versions; behavior around proposals, identities, and hardware offload differs by platform and release.
  8. Rekey collisions are a classic: two peers rekey simultaneously, both think they “won,” and traffic drops until one cleans up. Modern stacks handle it better, but not perfectly.

A stable reference architecture for office-to-office IKEv2 on MikroTik

Pick one model: policy-based or route-based. Don’t run a hybrid by accident.

MikroTik supports policy-based IPsec (selectors in policies) and can do route-based designs using VTI-like approaches depending on RouterOS version and features.
The flapping stories I’ve lived usually come from “mostly policy-based, but we added a route here, and now half the subnets work on Tuesdays.”

For two offices, policy-based can be fine if you have stable subnets and a small number of selectors. It becomes fragile when:
you add more subnets, you do hairpin traffic, you have overlapping ranges, or you introduce multiple WANs.
At that point, prefer a route-based design where possible and keep routing explicit.

Identity and addressing: avoid “whatever the ISP gives us” designs

If at least one side has a dynamic public IP, you can still be stable, but you must be deliberate:
use FQDN identities, strict peer matching, and a clean responder configuration. If you can buy static IPs, do it. It’s cheaper than your downtime.

Crypto and lifetimes: boring wins

The stable baseline I use unless a policy requires otherwise:

  • IKE (Phase 1): AES-256, SHA-256, DH group 14+ (or ECP groups if both sides handle them well)
  • ESP (Phase 2): AES-256-GCM if supported end-to-end; otherwise AES-256 + SHA-256
  • Lifetimes: not “as low as possible.” Aim for sane values (hours, not minutes), aligned on both ends.
  • PFS: enable if you can afford CPU; otherwise disable and compensate with longer IKE SA lifetimes and strong IKE crypto.

Your tunnel should not rekey so often that the router spends its day negotiating instead of forwarding packets.

NAT and firewall rules: make the control plane boring

Allow UDP 500 and UDP 4500 in, and allow ESP if you’re not NAT-T (most are NAT-T). Make sure your NAT rules do not translate traffic that should be protected.
A stable tunnel often dies because a NAT rule is “helpfully” masquerading the office subnet on the way into the tunnel.

MTU/MSS: treat it as a first-class dependency

Encapsulation overhead means your effective MTU is smaller. If you don’t plan for it, you get selective failure: web works, file copies crawl, RDP stutters,
and someone blames “the ISP.” It’s almost always MTU plus blocked ICMP.

Why MikroTik IKEv2 flaps: the real failure modes

1) NAT state timeouts and port changes (especially behind CGNAT)

IKEv2 over NAT-T runs over UDP 4500. Many NAT devices time out UDP state aggressively, especially when “idle.”
When the mapping changes, the peer sends packets to a dead port and you get DPD failures or stalled SAs.
If you’re behind CGNAT, this gets worse: the NAT is not even yours to configure.

Fix: ensure NAT keepalives are enabled and DPD is sane. Don’t set DPD to “hammer mode” unless you enjoy self-induced renegotiation storms.

2) Rekey collisions and mismatched lifetimes

If both ends rekey at the same time (or if lifetimes don’t align well), you can end up with Child SAs being replaced in a way that interrupts flows.
The symptom is “every X minutes, connections reset,” often suspiciously regular.

Fix: align lifetimes, consider rekey margins, and avoid extremely short lifetimes. Measure CPU during rekey, too—on small MikroTiks it matters.

3) Proposal mismatch that only shows up on rekey

Initial negotiation may succeed with a shared subset of algorithms, but rekey may select a different proposal order or hit a corner case. Then it fails and falls back,
or it deletes and renegotiates. This looks like random instability, because it’s time-based.

Fix: explicitly set proposals on both sides. Don’t rely on “default.” Defaults change across RouterOS versions and hardware families.

4) MTU/fragmentation problems (IKE messages and data plane)

IKEv2 messages can be large, especially with certificates, multiple proposals, or vendor IDs.
If UDP fragmentation is blocked somewhere, the handshake can fail intermittently.
Then the data plane has the classic “big packets die” issue if PMTUD is broken.

Fix: keep crypto config lean, allow fragmentation where needed, and clamp MSS on TCP flows to avoid relying on PMTUD.

5) Routing or policy selector drift

Policy-based tunnels depend on selectors. Add a new subnet at one site and forget to add it at the other? The tunnel still comes up, but traffic to that subnet fails.
Worse: if you have overlapping policies, traffic can match the wrong one and trigger “unexpected” SA installs.

Fix: keep selectors symmetric, keep policy ordering intentional, and log drops on the forward chain so you can see what’s actually blocked.

6) Firewall rules that are almost correct

“We allow UDP 500 and 4500.” Great. Do you allow related/established? Do you allow IPsec policy traffic in the forward chain?
Did you put a fasttrack rule in front of IPsec exceptions?

RouterOS can fasttrack traffic in ways that bypass IPsec processing if not exempted. That can create traffic blackholes that look like tunnel flaps.

7) Performance headroom: the tunnel is stable, the router is not

Under CPU pressure, keepalives and DPD probes get delayed. Then peers declare each other dead and renegotiate.
On paper the bandwidth is fine; in reality the router is busy encrypting, queuing, and doing firewall/NAT with no offload.

Fix: measure CPU during peak, verify hardware acceleration support, simplify rules, and stop running rekeys every few minutes.

Joke #1: IPsec isn’t flaky; it’s just very honest about every small lie your network tells it.

Fast diagnosis playbook (first/second/third)

When a tunnel “flaps,” don’t start by changing crypto. Start by proving which layer is unstable.
Here’s the fast triage order that saves time and avoids superstition.

First: prove the WAN is stable enough for UDP 4500

  • Check interface errors/drops, link up/down events.
  • Check if public IP changes (dynamic IP, PPPoE reconnects, LTE failover).
  • Check for packet loss and jitter to the peer’s public IP over time.

Second: prove IKE control plane health

  • Watch IKE SA age, rekey events, DPD timeouts.
  • Look for repeated “no proposal chosen,” “auth failed,” “peer not responding.”
  • Confirm NAT-T detection and consistent ports.

Third: prove data plane correctness

  • Verify policies/selectors match the actual traffic.
  • Confirm routing sends the right subnets into IPsec.
  • Validate MTU/MSS with real payload tests (don’t trust ping defaults).
  • Look for fasttrack/NAT interfering.

Fourth: only then tune crypto and lifetimes

If the WAN and policies are clean, then tune proposals, PFS, lifetimes, and rekey margins.
Doing this earlier is how people end up with “working sometimes” configs.

Practical tasks: commands, outputs, and decisions (12+)

These tasks are written like an on-call runbook: run a command, interpret the output, then make a decision.
Commands are shown as if executed from a Linux jump host that can SSH into both MikroTiks. Adapt hostnames.

Task 1: Confirm RouterOS version and platform (behavior varies)

cr0x@server:~$ ssh admin@mtk-office-a '/system/resource/print'
                   uptime: 2w1d3h
                  version: 7.15.3 (stable)
               build-time: 2025-08-01 10:12:45
              free-memory: 312.4MiB
             total-memory: 512.0MiB
                      cpu: ARM
                cpu-count: 4
            cpu-frequency: 1400MHz
                 cpu-load: 18%

What it means: RouterOS major/minor affects IPsec stack details and defaults. CPU and memory tell you if you’re operating near the edge.

Decision: If versions differ widely between sites, align them. If CPU load spikes during drops, you have a capacity problem, not a crypto problem.

Task 2: Check link stability and WAN IP churn

cr0x@server:~$ ssh admin@mtk-office-a '/ip/address/print where interface~"wan"'
 #   ADDRESS            NETWORK         INTERFACE
 0   203.0.113.10/24    203.0.113.0     wan1
cr0x@server:~$ ssh admin@mtk-office-a '/log/print where message~"link down|link up|pppoe"'
 0 time=2025-12-27 10:12:01 topics=interface,info message=wan1 link down
 1 time=2025-12-27 10:12:05 topics=interface,info message=wan1 link up (speed 1G, full duplex)

What it means: If the WAN interface is bouncing, IPsec is collateral damage. The tunnel isn’t “flapping”; the ISP is.

Decision: Fix physical link/ISP first. If the IP changes frequently, prefer FQDN identity or a DDNS workflow and verify peer matching.

Task 3: Check IPsec peers and see if NAT-T is used

cr0x@server:~$ ssh admin@mtk-office-a '/ip/ipsec/peer/print detail'
 0 name="office-b" address=198.51.100.20/32 local-address=203.0.113.10
   exchange-mode=ike2 send-initial-contact=yes nat-traversal=yes
   dpd-interval=10s dpd-maximum-failures=3 profile=ike2-prof

What it means: NAT traversal enabled is typically correct. DPD interval and failure count define how fast you declare the peer dead.

Decision: If you see nat-traversal=no while one side is behind NAT, fix it. If DPD is too aggressive for your WAN, relax it.

Task 4: Inspect active SAs and look for churn timing

cr0x@server:~$ ssh admin@mtk-office-a '/ip/ipsec/active-peers/print detail'
 0 ike2=yes name="office-b" state=established
   local-address=203.0.113.10 remote-address=198.51.100.20
   side=initiator nat-traversal=yes
   uptime=00:43:12 last-dpd=00:00:07
cr0x@server:~$ ssh admin@mtk-office-a '/ip/ipsec/installed-sa/print detail'
 0 spi=0xC1A2B3C4 src-address=203.0.113.10 dst-address=198.51.100.20
   state=mature auth-algorithm=sha256 enc-algorithm=aes-256-gcm
   lifetime=00:58:21 expires-in=00:01:39

What it means: If SAs expire and re-install constantly, you likely have lifetimes too short or rekey instability.

Decision: If rekeys happen more frequently than hourly for a basic office-to-office tunnel, revisit lifetimes and PFS.

Task 5: Enable focused IPsec debug logging briefly (then turn it off)

cr0x@server:~$ ssh admin@mtk-office-a '/system/logging/add topics=ipsec,debug action=memory'
cr0x@server:~$ ssh admin@mtk-office-a '/log/print where topics~"ipsec"'
 0 time=2025-12-28 09:14:02 topics=ipsec,debug message=initiating IKE_SA to 198.51.100.20
 1 time=2025-12-28 09:14:02 topics=ipsec,debug message=IKE_SA established
 2 time=2025-12-28 09:15:33 topics=ipsec,debug message=DPD timeout, peer not responding
 3 time=2025-12-28 09:15:33 topics=ipsec,debug message=deleting IKE_SA

What it means: The log shows control-plane cause: here it’s DPD timeouts, not proposals.

Decision: If DPD timeouts occur while the WAN is stable, suspect NAT state expiry, upstream filtering, or CPU delay.

Task 6: Validate firewall allows IKEv2 and NAT-T to the router

cr0x@server:~$ ssh admin@mtk-office-a '/ip/firewall/filter/print where chain="input"'
 0 chain=input action=accept connection-state=established,related
 1 chain=input action=accept protocol=udp dst-port=500,4500
 2 chain=input action=drop in-interface=wan1

What it means: UDP 500/4500 allowed before the drop. Good. If those rules are after a drop, you’ve built a VPN-shaped paperweight.

Decision: Ensure the accept rules are above drops. If you use address-lists, restrict to peer IPs for hygiene.

Task 7: Detect fasttrack interfering with IPsec

cr0x@server:~$ ssh admin@mtk-office-a '/ip/firewall/filter/print where action~"fasttrack"'
 0 chain=forward action=fasttrack-connection connection-state=established,related

What it means: A generic fasttrack rule can bypass IPsec processing unless you exempt IPsec flows.

Decision: Add an exception for IPsec policy traffic before fasttrack, or disable fasttrack if you can afford it.

Task 8: Confirm NAT rules don’t NAT protected subnets

cr0x@server:~$ ssh admin@mtk-office-a '/ip/firewall/nat/print'
 0 chain=srcnat action=masquerade out-interface=wan1
 1 chain=srcnat action=accept ipsec-policy=out,ipsec
 2 chain=srcnat action=masquerade src-address=10.10.0.0/16 out-interface=wan1

What it means: Rule 1 is the critical “do not NAT IPsec” exception. Without it, some traffic may get NATed and stop matching selectors.

Decision: Ensure the accept rule exists and is above masquerade rules. If you see masquerade hitting VPN traffic, fix ordering.

Task 9: Verify policy selectors match both directions

cr0x@server:~$ ssh admin@mtk-office-a '/ip/ipsec/policy/print detail'
 0 src-address=10.10.10.0/24 dst-address=10.20.10.0/24 action=encrypt
   tunnel=yes peer=office-b proposal=esp-prof
cr0x@server:~$ ssh admin@mtk-office-b '/ip/ipsec/policy/print detail'
 0 src-address=10.20.10.0/24 dst-address=10.10.10.0/24 action=encrypt
   tunnel=yes peer=office-a proposal=esp-prof

What it means: Symmetry matters. If Office B has 10.10.0.0/16 while Office A uses 10.10.10.0/24, you’ll get partial reachability and confusing SA installs.

Decision: Normalize selectors. If you need many subnets, consider consolidating selectors carefully or move to a route-based approach to reduce policy sprawl.

Task 10: Observe drops and policy matches using counters

cr0x@server:~$ ssh admin@mtk-office-a '/ip/firewall/filter/print stats where chain="forward"'
 0 chain=forward action=accept ipsec-policy=in,ipsec packet-count=124992 byte-count=98311234
 1 chain=forward action=accept ipsec-policy=out,ipsec packet-count=121887 byte-count=96733122
 2 chain=forward action=drop in-interface=wan1 packet-count=22 byte-count=1452

What it means: IPsec policy traffic is being accepted and counted. If counters stay at zero while users complain, traffic isn’t matching policy (routing, NAT, or selectors).

Decision: If the IPsec accept counters don’t increment, stop tweaking IPsec and fix routing/NAT/addresses.

Task 11: MTU reality check with a “do not fragment” ping

cr0x@server:~$ ping -M do -s 1472 10.20.10.10 -c 3
PING 10.20.10.10 (10.20.10.10) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1460
ping: local error: message too long, mtu=1460
ping: local error: message too long, mtu=1460

--- 10.20.10.10 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss

What it means: The path MTU is smaller than expected. Over VPN, that’s normal—but you need to accommodate it for TCP.

Decision: Clamp TCP MSS on the MikroTik for traffic going into the tunnel, or adjust interface MTUs if you control the path.

Task 12: Check for IKE fragmentation or large handshake issues (symptom-driven)

cr0x@server:~$ ssh admin@mtk-office-a '/log/print where message~"fragment|too large|invalid"'
 0 time=2025-12-28 08:01:19 topics=ipsec,debug message=received packet too large, dropping
 1 time=2025-12-28 08:01:19 topics=ipsec,debug message=peer not responding

What it means: Some networks drop fragmented UDP. Large IKE messages can die in transit, causing intermittent negotiation failures.

Decision: Reduce proposal list bloat, consider using PSK vs certificate chain size where appropriate, and ensure upstream devices aren’t blocking fragments.

Task 13: Measure CPU and interrupt pressure during renegotiation

cr0x@server:~$ ssh admin@mtk-office-a '/tool/profile'
name                    cpu        memory
firewall                22%        0 B
ipsec                   31%        0 B
ethernet                6%         0 B
management              2%         0 B

What it means: IPsec consuming a third of CPU is not automatically bad, but if it spikes to 90% during peak, DPD and rekey timers get sloppy.

Decision: If IPsec is a top CPU consumer and you see flaps, increase lifetimes, reduce PFS cost, enable hardware offload if supported, or upgrade hardware.

Task 14: Verify routing to remote subnets (don’t assume)

cr0x@server:~$ ssh admin@mtk-office-a '/ip/route/print where dst-address=10.20.10.0/24'
 0 dst-address=10.20.10.0/24 gateway=ipsec policy distance=1 scope=30 target-scope=10

What it means: Depending on design, you may see policy-driven routing without explicit routes, or you may rely on routes. What you can’t rely on is vibes.

Decision: If routes point somewhere else (default route, wrong gateway, wrong VRF), the tunnel can be “up” while traffic goes out to the internet and dies.

Task 15: Capture packets to prove if you’re losing IKE or ESP

cr0x@server:~$ ssh admin@mtk-office-a '/tool/sniffer/quick interface=wan1 ip-protocol=udp port=500,4500'
 0  2025-12-28 09:21:11 203.0.113.10:4500 -> 198.51.100.20:4500 UDP  92
 1  2025-12-28 09:21:21 203.0.113.10:4500 -> 198.51.100.20:4500 UDP  92
 2  2025-12-28 09:21:31 203.0.113.10:4500 -> 198.51.100.20:4500 UDP  92

What it means: You see outbound keepalives/DPD packets, but do you see replies? If not, it’s upstream filtering, NAT state, or the far end is dead.

Decision: If outbound exists but inbound doesn’t, stop reconfiguring the local router. Prove the path or the remote side is dropping.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company had two offices and a “simple” IKEv2 tunnel. It worked for months, then started dropping every afternoon. The helpdesk blamed the ISP,
the network team blamed IPsec, and the security team blamed “someone changing crypto.”

The wrong assumption was subtle: they assumed that “no traffic” means “no problems.” In reality, the tunnel was mostly idle until staff began using a new SaaS
app that triggered periodic bursts plus long idle windows. The ISP’s NAT device had an aggressive UDP timeout. The NAT mapping for UDP 4500 would expire during idle,
then the next packet would go out with a new source port. The far end kept sending to the old mapping until DPD declared the peer dead.

The logs were telling on repeat: DPD timeout, delete IKE SA, renegotiate. But everyone looked at it only during the moment it was “back up,” saw SAs established,
and moved on.

Fix was boring: enable regular NAT keepalives (and stop disabling them “to reduce noise”), loosen DPD to tolerate brief jitter, and set lifetimes so rekey wasn’t
happening right when the office did its daily backup traffic. The tunnel became invisible again, which is the best kind of VPN.

Mini-story 2: The optimization that backfired

Another shop had a RouterOS firewall with fasttrack enabled. They were proud of it. Throughput was great, CPU was low, graphs were pretty.
Then they migrated from IKEv1 to IKEv2 and suddenly had intermittent one-way traffic. The tunnel never “went down,” but file transfers would hang and VoIP would
go robotic for 10–30 seconds at random.

The “optimization” was fasttrack on established/related connections without an IPsec exception. Some flows were being fasttracked in ways that bypassed the policy
checks needed for encryption/decryption. In effect, some traffic took the fast lane straight into a wall.

They tried to fix it by changing crypto. Then by changing lifetimes. Then by changing hardware. Classic. None of that helps if the firewall is skipping the part where
it decides traffic belongs to IPsec.

The fix was surgical: insert an accept rule for ipsec-policy=in,ipsec and ipsec-policy=out,ipsec ahead of fasttrack, and ensure NAT rules had
a top-level “accept if ipsec-policy=out,ipsec.” Throughput dipped slightly. Stability improved dramatically. Everyone pretended this was the plan all along.

Mini-story 3: The boring but correct practice that saved the day

A distributed company had three small offices and one “hub” site. Each office used a different ISP, and one office was on LTE as a backup.
VPN issues were expected; nobody expected them to be diagnosable quickly.

The boring practice: they had a standardized runbook and consistent naming. Peers were named by site, proposals were identical across routers, lifetimes were aligned,
and they logged IPsec events to a central syslog with just enough debug to be useful without being a firehose.

One day, an office started flapping after an ISP maintenance. The on-call engineer didn’t guess. They checked the WAN logs, saw no link changes. They checked IPsec logs,
saw DPD timeouts at a new pattern. They ran a sniffer and confirmed outbound UDP 4500 with no inbound replies. The conclusion was immediate: upstream path problem, not
MikroTik misconfig.

They escalated with evidence: timestamps, packet capture summaries, and clear “we send, we don’t receive.” The ISP fixed a broken stateful rule in their edge. VPN was stable.
Nobody had to roll back firmware at midnight. Nobody had to “just reboot it.” Boring won.

Common mistakes: symptom → root cause → fix

1) Tunnel renegotiates every few minutes like clockwork

  • Symptom: predictable drops every 5–30 minutes; logs show rekey/delete cycles.
  • Root cause: very short lifetimes, rekey collisions, or CPU spikes during rekey.
  • Fix: increase IKE/ESP lifetimes, align both sides, reduce PFS cost if needed, and ensure proposal sets are explicit and identical.

2) “Peer not responding” but WAN looks fine

  • Symptom: DPD timeouts; tunnel re-establishes quickly; users see brief outages.
  • Root cause: NAT mapping expiration or upstream dropping UDP 4500 intermittently.
  • Fix: enable keepalives, tune DPD to match WAN reality, avoid double-NAT when possible, and prove packet loss with sniffer.

3) Tunnel is up, but some subnets work and others don’t

  • Symptom: one VLAN can reach remote office; another cannot; SAs exist.
  • Root cause: selector mismatch, missing policies, or NAT translating one subnet.
  • Fix: make selectors symmetric; add missing policies; add ipsec-policy=out,ipsec NAT accept rule above masquerade.

4) Web works, large transfers fail, RDP is flaky

  • Symptom: small packets succeed; big packets stall; intermittent timeouts.
  • Root cause: MTU/PMTUD failure due to encapsulation overhead and blocked ICMP fragmentation-needed messages.
  • Fix: clamp TCP MSS on VPN traffic; allow necessary ICMP types; test with DF pings and real payload sizes.

5) Works until you enable fasttrack, then “random” breakage

  • Symptom: tunnel stays established, but data path becomes intermittent, often one-way.
  • Root cause: fasttrack bypassing processing needed for IPsec policy traffic.
  • Fix: add explicit accept rules for IPsec policy traffic before fasttrack or disable fasttrack.

6) Negotiation fails only after a firmware upgrade

  • Symptom: “no proposal chosen,” “auth failed,” or new behavior around lifetimes.
  • Root cause: defaults changed, algorithm ordering changed, or deprecated ciphers removed.
  • Fix: pin proposals and profiles explicitly; keep versions aligned; test rekey behavior before rolling to production.

7) Only one direction passes traffic

  • Symptom: Office A can reach B; B cannot reach A, or vice versa.
  • Root cause: asymmetric selectors, routing asymmetry, or firewall forward-chain rules missing IPsec policy accepts.
  • Fix: ensure both directions have matching policies; verify routes; add forward accept rules for in/out IPsec policy.

Joke #2: The fastest way to “fix” a VPN is to declare it “working as designed” and go to lunch—right until the CFO tries to print.

Checklists / step-by-step plan for stability

Step 1: Standardize and document the tunnel contract

  • List local/remote subnets, including future growth.
  • Pick policy-based or route-based intentionally.
  • Pick crypto suite and lifetimes; write them down.
  • Name objects consistently: peers, identities, proposals, policies.

Step 2: Make the control plane resilient

  • Allow UDP 500 and 4500 to the router from the peer.
  • Enable NAT-T unless you can guarantee no NAT in path.
  • Set DPD so it detects real failure without panicking at jitter.
  • Align IKE and ESP lifetimes; avoid “tiny” lifetimes.

Step 3: Make the data plane explicit

  • Add forward-chain rules to accept ipsec-policy=in,ipsec and ipsec-policy=out,ipsec.
  • Add NAT exemption: chain=srcnat action=accept ipsec-policy=out,ipsec above any masquerade.
  • Verify policy selectors match exactly on both sides.
  • Prefer non-overlapping RFC1918 ranges between sites.

Step 4: Fix MTU/MSS before users find it for you

  • Test PMTU with DF ping payloads.
  • Clamp TCP MSS for traffic entering the tunnel (common value: 1360–1400; measure, don’t guess).
  • Allow ICMP fragmentation-needed messages if your security posture permits it.

Step 5: Monitor what matters

  • Log IPsec events (not full debug forever) and keep timestamps.
  • Graph CPU, interface drops/errors, and IPsec SA counts over time.
  • Alert on frequent re-establishments, not just “tunnel down.”

Step 6: Operational discipline for change

  • Upgrade RouterOS in a controlled window; keep both ends compatible.
  • Change one variable at a time: DPD, lifetimes, proposals, NAT rules.
  • After any change, force a rekey test and validate traffic for all subnets.

A reliability principle worth stealing

Quote requirement (paraphrased idea): John Ousterhout argued that complexity is the enemy of reliability; simpler systems fail in fewer surprising ways.

FAQ

1) Should I use IKEv2 or IKEv1 on MikroTik for site-to-site?

Use IKEv2 unless you have a legacy peer that forces IKEv1. IKEv2 generally behaves better with NAT traversal and rekey logic.
Stability still depends more on your network path and configuration discipline than the version label.

2) What DPD settings should I use to avoid flapping?

Start conservative: DPD interval around 10–30 seconds and maximum failures around 3–5, then adjust based on WAN behavior.
If you set it too aggressive on a jittery link, you’re telling the router to rage-quit the relationship on minor delays.

3) My tunnel is established but traffic doesn’t pass. Where do I look first?

Check NAT exemptions and firewall forward-chain rules for IPsec policy traffic. Then check selectors/policies match both sides.
After that, check routing and overlapping subnets.

4) Do I need to allow ESP (protocol 50) on the firewall?

If you’re using NAT-T (UDP 4500), ESP is encapsulated in UDP and you usually only need UDP 500/4500.
If NAT-T is disabled and there is no NAT in path, then yes, ESP must be allowed.

5) How do I know if MTU is my problem?

If small pings work but large transfers stall, suspect MTU. Prove it with DF pings and adjust TCP MSS.
Also check whether ICMP fragmentation-needed is blocked anywhere between sites.

6) Should I enable PFS?

If you have CPU headroom and compliance/security requirements, enable it. If your MikroTik is small and already busy,
PFS plus short lifetimes can cause rekey storms. Security is a system property; stability matters too.

7) Why does it flap more during peak hours?

Peak hours correlate with CPU load, queue pressure, and WAN congestion. If IPsec timers (DPD, rekey) get delayed, peers declare failure.
Check /tool/profile, interface drops, and whether your router is doing too much (NAT, firewall, queues) alongside encryption.

8) Can fasttrack stay enabled with IPsec?

Sometimes, but you must add exceptions so IPsec policy traffic is not fasttracked in a way that bypasses processing.
If you can’t reason about your rule order under stress, disable fasttrack and buy back performance with better hardware.

9) Is “send-initial-contact=yes” good or bad?

Usually good for cleaning up stale SAs on the responder when the initiator reconnects (especially after IP change).
In multi-peer or HA scenarios it can delete SAs you didn’t mean to touch. Use it when you understand who is allowed to connect as that identity.

10) What’s the simplest way to reduce rekey-related drops?

Align lifetimes, keep the proposal set small and explicit, and don’t rekey too frequently. Then confirm CPU headroom during rekey.
Rekey should be a non-event, not an outage drill.

Conclusion: next steps that actually reduce flaps

Stable IKEv2 between offices is not magic. It’s a chain of boring truths:
the WAN must be stable enough, NAT must keep state, firewall/NAT rules must respect IPsec, selectors must match, and MTU must be handled proactively.
If you get those right, crypto becomes a policy choice instead of a troubleshooting roulette wheel.

Practical next steps:

  1. Run the fast diagnosis playbook and classify the problem: WAN, control plane, or data plane.
  2. Pin proposals and lifetimes explicitly on both ends; stop relying on defaults.
  3. Add and verify NAT exemption rules and IPsec forward accept rules; check counters.
  4. Test MTU with DF pings and clamp MSS so applications stop discovering MTU the hard way.
  5. Measure CPU during peak and during rekey; if you’re near the edge, tune lifetimes or upgrade hardware.
  6. Keep logs just long enough to catch flaps, and keep names consistent so your future self can parse the situation quickly.

Your goal is not a tunnel that can connect. Your goal is a tunnel that nobody thinks about. That’s the production-grade definition of “stable.”

← Previous
AI on the CPU: What NPUs Are and Why They Exist
Next →
RDP Between Offices Without Open Ports: The Safe “RDP Only via VPN” Setup

Leave a comment