Site-to-site VPNs fail in boring ways. The handshake is “up” and everyone relaxes, then the CFO can’t reach the file server, printers go dark, and someone swears “it worked yesterday.”
Most of the time, it’s not WireGuard’s fault. It’s your IP plan, your NAT rules, your routes, or your MTU—quietly waiting to embarrass you at 9:07 AM.
This is a production-minded template for two offices on MikroTik RouterOS using WireGuard. It’s opinionated: routed, minimal, debuggable, and designed to survive real corporate networks.
Copy it, adjust a few variables, and keep the troubleshooting sections close when the tickets start.
Design goals (what we’re optimizing for)
A two-office VPN can be “working” and still be an operational mess. The goal here is not to show off cleverness.
The goal is to be able to answer, quickly and with receipts, these questions:
- Is the tunnel up? (handshake + bytes increasing, not vibes)
- Is the path correct? (routes and NAT bypass, not accidental hairpins)
- Is performance stable? (MTU/MSS, queues, fasttrack interactions)
- Can a junior admin maintain it? (clean naming, minimal rules)
- Can we expand to more sites later? (address plan and peer layout)
This template uses a routed model (distinct LAN subnets at each office). No bridging. No L2 extension.
Bridging is how you import broadcast storms and mystery Windows elections into your life. You already have enough excitement.
Also: keep the tunnel IP space separate from LAN space. If you reuse a LAN subnet inside the tunnel “because it’s available,”
you’re building a future outage. It’s just delayed gratification, but for pain.
A few interesting facts (WireGuard + MikroTik context)
Some short context points that actually help you make better decisions:
- WireGuard is intentionally small. It was designed with a tiny codebase compared to legacy VPN stacks, aiming to reduce attack surface and audit complexity.
- It uses modern cryptography by default. No cipher-suite negotiation circus; the protocol chooses a narrow set (like ChaCha20-Poly1305) to avoid downgrade games.
- “Handshake up” doesn’t mean “traffic flows.” WireGuard can establish a handshake even when routing/NAT/firewall prevents payload packets from going anywhere useful.
- WireGuard is stateless in a specific way. It doesn’t build a traditional “session” like some VPNs; peers are identified by public keys, and endpoints can roam.
- PersistentKeepalive exists for NAT reality. If one side is behind NAT, periodic packets keep the mapping alive so inbound traffic doesn’t die after idle time.
- MikroTik added WireGuard relatively late. RouterOS support landed years after WireGuard became popular on Linux, which is why older RouterOS fleets often show a mix of legacy VPNs and WG.
- WireGuard’s simplicity shifts complexity to routing. That’s a feature. You want routing to be explicit, inspectable, and logged.
- MTU issues became more visible with modern web apps. Bigger TLS records, HTTP/2, and “everything is encapsulated” stacks make fragmentation and blackhole MTUs show up as weird app failures.
One quote, because operations people deserve poetry too:
Everything fails, all the time.
— Werner Vogels
Not cynical. Practical. We build systems that fail predictably and recover cleanly.
Network plan: subnets, tunnel IPs, and DNS
Office layout (example)
- Office A LAN: 10.10.10.0/24 (router: 10.10.10.1)
- Office B LAN: 10.20.20.0/24 (router: 10.20.20.1)
- WireGuard tunnel network: 10.99.0.0/30
- Office A wg IP: 10.99.0.1/30
- Office B wg IP: 10.99.0.2/30
- WG UDP port: 13231/udp (pick one port, document it)
Why /30 for the tunnel?
Because it’s a point-to-point link. You need two IPs. A /30 is boring, tight, and prevents you from later “helpfully” adding random hosts onto the tunnel network.
If you plan for 10+ sites, use /24 and allocate /32s per site—but for two offices, don’t overbuild.
DNS strategy (choose one, deliberately)
Site-to-site routing is half the battle. Naming is the other half. You have three sane options:
- Each site resolves local names locally. If Office A needs Office B’s resources, use FQDNs that resolve to 10.20.20.0/24 and vice versa.
- Central DNS at one site. Point both sites at one authoritative DNS (reachable via the tunnel). This is common in corporate setups but makes DNS availability dependent on the tunnel.
- Split-horizon DNS. Best for hybrid networks, most annoying to maintain. Worth it if you have overlapping services or cloud/private split needs.
My operational bias: keep DNS local where possible, and explicitly forward only what’s needed across the tunnel. It reduces “VPN is down” pages that are actually “DNS is down.”
Clean template config: Office A and Office B
The template is designed for RouterOS v7+. If you are on older RouterOS, upgrade. If you can’t upgrade, you can’t have nice things.
(That’s not a joke. That’s a maintenance policy.)
Naming conventions that save you later
- Interface name:
wg-s2s-officeB(tell future you what it is) - Address list names:
lan-officeA,lan-officeB - Firewall comment prefixes:
S2S-WG:so you can filter fast
Office A (public IP, listens)
Assumptions:
Office A router has a public WAN or stable port-forward to it. Office B may be behind NAT.
We’ll make Office A listen on UDP 13231 and Office B initiate with keepalive.
cr0x@server:~$ cat office-a-routeros.rsc
# Office A: MikroTik RouterOS v7+ WireGuard site-to-site template
# Variables you must set:
# - Replace PUBLIC_WAN_INTERFACE if needed (example: ether1)
# - Set correct LAN interface/list and subnets
# - Paste actual keys (do not reuse this placeholder layout)
# 1) WireGuard interface
/interface/wireguard/add name=wg-s2s-officeB listen-port=13231 mtu=1420 comment="S2S-WG: OfficeA<->OfficeB"
# 2) WireGuard tunnel IP
/ip/address/add address=10.99.0.1/30 interface=wg-s2s-officeB comment="S2S-WG: tunnel IP OfficeA"
# 3) Peer (Office B)
/interface/wireguard/peers/add interface=wg-s2s-officeB public-key="OFFICE_B_PUBLIC_KEY" \
allowed-address=10.99.0.2/32,10.20.20.0/24 \
comment="S2S-WG: peer OfficeB"
# 4) Route to Office B LAN via WireGuard
/ip/route/add dst-address=10.20.20.0/24 gateway=wg-s2s-officeB comment="S2S-WG: route to OfficeB LAN"
# 5) Address lists for cleaner firewall rules
/ip/firewall/address-list/add list=lan-officeA address=10.10.10.0/24 comment="OfficeA LAN"
/ip/firewall/address-list/add list=lan-officeB address=10.20.20.0/24 comment="OfficeB LAN"
# 6) Firewall: allow WireGuard inbound on WAN
/ip/firewall/filter/add chain=input action=accept protocol=udp dst-port=13231 in-interface=ether1 \
comment="S2S-WG: allow WireGuard UDP from WAN"
# 7) Firewall: allow WireGuard interface traffic to the router itself (optional but practical)
/ip/firewall/filter/add chain=input action=accept in-interface=wg-s2s-officeB \
comment="S2S-WG: allow input from WireGuard (management/ping as needed)"
# 8) Firewall: allow forwarding between LANs over the tunnel
/ip/firewall/filter/add chain=forward action=accept in-interface=wg-s2s-officeB out-interface-list=LAN \
comment="S2S-WG: allow WG to LAN forwarding"
/ip/firewall/filter/add chain=forward action=accept in-interface-list=LAN out-interface=wg-s2s-officeB \
comment="S2S-WG: allow LAN to WG forwarding"
# 9) NAT bypass (no masquerade between sites)
# Place this BEFORE any general masquerade rule.
/ip/firewall/nat/add chain=srcnat action=accept src-address=10.10.10.0/24 dst-address=10.20.20.0/24 \
comment="S2S-WG: no-NAT OfficeA->OfficeB"
# 10) Optional: log drops related to the tunnel during bring-up (disable after)
/ip/firewall/filter/add chain=forward action=log log-prefix="S2S-WG DROP " src-address=10.10.10.0/24 dst-address=10.20.20.0/24 \
comment="S2S-WG: temporary debug log OfficeA->OfficeB"
Office B (behind NAT, initiates)
Office B will set an endpoint pointing to Office A’s public IP/DNS name, and set persistent-keepalive=25s.
That’s not magic; it’s just teaching NAT boxes to keep the mapping alive.
cr0x@server:~$ cat office-b-routeros.rsc
# Office B: MikroTik RouterOS v7+ WireGuard site-to-site template
# 1) WireGuard interface
/interface/wireguard/add name=wg-s2s-officeA listen-port=13231 mtu=1420 comment="S2S-WG: OfficeB<->OfficeA"
# 2) WireGuard tunnel IP
/ip/address/add address=10.99.0.2/30 interface=wg-s2s-officeA comment="S2S-WG: tunnel IP OfficeB"
# 3) Peer (Office A)
/interface/wireguard/peers/add interface=wg-s2s-officeA public-key="OFFICE_A_PUBLIC_KEY" \
endpoint-address=203.0.113.10 endpoint-port=13231 persistent-keepalive=25s \
allowed-address=10.99.0.1/32,10.10.10.0/24 \
comment="S2S-WG: peer OfficeA"
# 4) Route to Office A LAN via WireGuard
/ip/route/add dst-address=10.10.10.0/24 gateway=wg-s2s-officeA comment="S2S-WG: route to OfficeA LAN"
# 5) Address lists for cleaner firewall rules
/ip/firewall/address-list/add list=lan-officeA address=10.10.10.0/24 comment="OfficeA LAN"
/ip/firewall/address-list/add list=lan-officeB address=10.20.20.0/24 comment="OfficeB LAN"
# 6) Firewall: allow WireGuard traffic (input from WAN not needed if behind NAT, but allow from WG interface)
/ip/firewall/filter/add chain=input action=accept in-interface=wg-s2s-officeA \
comment="S2S-WG: allow input from WireGuard (management/ping as needed)"
# 7) Firewall: allow forwarding between LANs over the tunnel
/ip/firewall/filter/add chain=forward action=accept in-interface=wg-s2s-officeA out-interface-list=LAN \
comment="S2S-WG: allow WG to LAN forwarding"
/ip/firewall/filter/add chain=forward action=accept in-interface-list=LAN out-interface=wg-s2s-officeA \
comment="S2S-WG: allow LAN to WG forwarding"
# 8) NAT bypass (no masquerade between sites)
# Place this BEFORE any general masquerade rule.
/ip/firewall/nat/add chain=srcnat action=accept src-address=10.20.20.0/24 dst-address=10.10.10.0/24 \
comment="S2S-WG: no-NAT OfficeB->OfficeA"
# 9) Optional: log drops related to the tunnel during bring-up (disable after)
/ip/firewall/filter/add chain=forward action=log log-prefix="S2S-WG DROP " src-address=10.20.20.0/24 dst-address=10.10.10.0/24 \
comment="S2S-WG: temporary debug log OfficeB->OfficeA"
Joke #1: WireGuard is so simple that you’ll spend most of your time debugging the parts you didn’t configure, which is impressively on-brand for networking.
Key handling (do not freestyle this)
Generate keys on each router and exchange public keys. Keep private keys private. Yes, that sentence is obvious.
It’s still where teams fail under pressure.
cr0x@server:~$ ssh admin@office-a 'wg genkey | tee /tmp/wg.key | wg pubkey > /tmp/wg.pub && ls -l /tmp/wg.* && cat /tmp/wg.pub'
-rw------- 1 admin admin 45 Dec 28 10:12 /tmp/wg.key
-rw-r--r-- 1 admin admin 45 Dec 28 10:12 /tmp/wg.pub
hF2m3dZyJf1q3oV6n8r2b1P9kY0aQk7mZxv6b2rNQXo=
Decision: store the private key in the router’s WireGuard interface config only; do not email it, do not paste it into tickets, do not “temporarily” keep it in shared chat.
If you need auditability, store it in a proper secret manager with strict access controls.
Firewall and NAT: what to allow, what to refuse
Your firewall policy should be explicit. “We allow WireGuard” is not a policy; it’s a mood.
For site-to-site, you want:
- Input: allow UDP port 13231 to the router (only on the side that listens on a public WAN)
- Forward: allow LAN↔WG traffic (tightly scoped if you can)
- NAT: accept (bypass) site-to-site traffic before general masquerade
About FastTrack (MikroTik-specific reality)
FastTrack can be great until it isn’t. Depending on your RouterOS version and configuration, FastTrack can bypass parts of connection tracking and mangle,
and that can interfere with VPN traffic shaping, firewall accounting, or MSS clamping rules.
My rule: do not FastTrack traffic that goes into or out of WireGuard until you’ve proven it behaves. Then, if you enable it, do it with exceptions and measure.
NAT bypass: the most common “it connects but nothing works” cause
If you masquerade OfficeA→OfficeB traffic, Office B will see packets coming from the Office A router’s tunnel IP or WAN IP, not the original client IP.
That breaks auditing, can break ACLs, and makes troubleshooting a sad guessing game.
Add explicit srcnat action=accept rules for the inter-site subnets, and place them before any generic masquerade.
Then keep a short comment on the masquerade rule warning future you not to reorder it.
Routing design: static routes that don’t surprise you
For two sites, static routes are the correct amount of boring. Dynamic routing can be great, but it’s also a second moving system
with its own failure modes, its own security model, and its own “someone changed a metric” tickets.
Use one route per remote LAN, with the gateway set to the WireGuard interface. That’s it.
If you later add more sites, you can still keep static routes if the count stays reasonable.
If it grows, use a routing protocol intentionally, not because your coworker wants to practice.
MTU, MSS clamping, and why “small pings work” is a trap
WireGuard encapsulates packets inside UDP. Encapsulation adds overhead. Overhead reduces effective MTU. That’s normal.
The failure mode is also normal: some networks drop ICMP “fragmentation needed” messages, so path MTU discovery breaks.
Then large TCP packets vanish, and users report “some websites load, some don’t.”
Set WireGuard MTU to something conservative (1420 is a common starting point). Then test with large pings and real TCP.
If you see weirdness, clamp TCP MSS on traffic going through the tunnel.
MSS clamping rule (example)
This is a practical fix for blackhole MTU issues. It is not a substitute for understanding your path, but it keeps production stable.
cr0x@server:~$ cat mss-clamp-routeros.rsc
# Apply on both sides if needed
/ip/firewall/mangle/add chain=forward action=change-mss new-mss=1360 protocol=tcp tcp-flags=syn \
in-interface=wg-s2s-officeB out-interface-list=LAN comment="S2S-WG: clamp MSS WG->LAN"
/ip/firewall/mangle/add chain=forward action=change-mss new-mss=1360 protocol=tcp tcp-flags=syn \
in-interface-list=LAN out-interface=wg-s2s-officeB comment="S2S-WG: clamp MSS LAN->WG"
Decision: only add MSS clamping if you observe MTU-related failures (or you know you traverse links with lower MTU).
Clamping everything “just in case” can reduce throughput. Not catastrophic, but don’t cargo-cult.
Operational tasks: commands, outputs, and decisions (12+)
These are the tasks you run during bring-up and during incidents. Each includes what to look for and what decision you make.
Use them as a runbook, not as a ritual.
Task 1: Verify WireGuard interface exists and is running (RouterOS)
cr0x@server:~$ ssh admin@office-a 'interface/wireguard/print detail where name="wg-s2s-officeB"'
0 name="wg-s2s-officeB" mtu=1420 listen-port=13231 private-key="<hidden>" running=yes
Output meaning: running=yes confirms the interface is operational.
Decision: if running=no, fix interface creation, key presence, or RouterOS version issues before touching firewall/routing.
Task 2: Confirm the peer shows a recent handshake
cr0x@server:~$ ssh admin@office-a 'interface/wireguard/peers/print detail where comment~"OfficeB"'
0 interface=wg-s2s-officeB public-key="OFFICE_B_PUBLIC_KEY" allowed-address=10.99.0.2/32,10.20.20.0/24
last-handshake=1m12s rx=183204 tx=201889 endpoint-address=198.51.100.55 endpoint-port=51820
Output meaning: last-handshake is recent and rx/tx counters increase.
Decision: if handshake is empty/old, check UDP reachability and endpoint settings before debugging routes.
Task 3: Validate tunnel IP addresses are present
cr0x@server:~$ ssh admin@office-a 'ip/address/print where interface="wg-s2s-officeB"'
0 address=10.99.0.1/30 network=10.99.0.0 interface=wg-s2s-officeB
Output meaning: the router has the correct tunnel IP.
Decision: if missing, traffic may still handshake but routing will be wrong; add the address before proceeding.
Task 4: Check that the static route exists and is active
cr0x@server:~$ ssh admin@office-a 'ip/route/print detail where dst-address=10.20.20.0/24'
0 dst-address=10.20.20.0/24 gateway=wg-s2s-officeB distance=1 scope=30 target-scope=10 active=yes
Output meaning: active=yes indicates the router considers the route usable.
Decision: if inactive, you may have interface down, incorrect gateway, or route conflict (more specific route elsewhere).
Task 5: Ping the remote tunnel IP from the router
cr0x@server:~$ ssh admin@office-a 'ping 10.99.0.2 count=5'
SEQ HOST SIZE TTL TIME STATUS
0 10.99.0.2 56 64 23ms ok
1 10.99.0.2 56 64 22ms ok
2 10.99.0.2 56 64 22ms ok
3 10.99.0.2 56 64 23ms ok
4 10.99.0.2 56 64 22ms ok
Output meaning: tunnel IP connectivity is good.
Decision: if this fails but handshake exists, look for firewall input rules on the remote side and allowed-address mistakes.
Task 6: Ping a remote LAN host from a local LAN host (end-to-end)
cr0x@server:~$ ping -c 4 10.20.20.50
PING 10.20.20.50 (10.20.20.50) 56(84) bytes of data.
64 bytes from 10.20.20.50: icmp_seq=1 ttl=62 time=25.1 ms
64 bytes from 10.20.20.50: icmp_seq=2 ttl=62 time=24.6 ms
64 bytes from 10.20.20.50: icmp_seq=3 ttl=62 time=24.9 ms
64 bytes from 10.20.20.50: icmp_seq=4 ttl=62 time=24.7 ms
--- 10.20.20.50 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
Output meaning: actual business traffic path works.
Decision: if router-to-router works but host-to-host fails, you likely have missing LAN firewall rules, wrong default gateway on hosts, or asymmetric routing.
Task 7: Trace the path to confirm routing is actually used
cr0x@server:~$ traceroute -n 10.20.20.50
traceroute to 10.20.20.50 (10.20.20.50), 30 hops max, 60 byte packets
1 10.10.10.1 0.395 ms 0.324 ms 0.311 ms
2 10.99.0.2 23.012 ms 22.911 ms 22.903 ms
3 10.20.20.50 24.771 ms 24.701 ms 24.679 ms
Output meaning: hop 2 is the remote tunnel endpoint; routing is correct.
Decision: if hop 2 shows an ISP hop, your traffic is leaking to the internet because your route/NAT is wrong.
Task 8: Validate NAT bypass rule ordering
cr0x@server:~$ ssh admin@office-a 'ip/firewall/nat/print where chain="srcnat"'
0 chain=srcnat action=accept src-address=10.10.10.0/24 dst-address=10.20.20.0/24 comment="S2S-WG: no-NAT OfficeA->OfficeB"
1 chain=srcnat action=masquerade out-interface=ether1 comment="MASQ: internet"
Output meaning: accept rule is above masquerade. Good.
Decision: if masquerade is above accept, fix ordering. This is not negotiable if you care about source IP preservation.
Task 9: Confirm firewall counters increment on the intended rules
cr0x@server:~$ ssh admin@office-a 'ip/firewall/filter/print stats where comment~"S2S-WG: allow"'
0 chain=input action=accept packets=921 bytes=87244 comment="S2S-WG: allow WireGuard UDP from WAN"
1 chain=input action=accept packets=110 bytes=9320 comment="S2S-WG: allow input from WireGuard (management/ping as needed)"
2 chain=forward action=accept packets=12844 bytes=11833492 comment="S2S-WG: allow WG to LAN forwarding"
3 chain=forward action=accept packets=12192 bytes=11011220 comment="S2S-WG: allow LAN to WG forwarding"
Output meaning: traffic is matching your intended allow rules, not falling through to a default drop.
Decision: if counters stay at zero while users complain, you’re not on the path you think you are—check routes and interface lists.
Task 10: Capture packets on the WireGuard interface (RouterOS sniffer)
cr0x@server:~$ ssh admin@office-a 'tool/sniffer/quick interface=wg-s2s-officeB ip-address=10.20.20.50'
TIME NUM DIR SRC-ADDRESS DST-ADDRESS PROTOCOL SIZE
1.234 12 <-> 10.10.10.25 10.20.20.50 icmp 98
1.456 13 <-> 10.20.20.50 10.10.10.25 icmp 98
Output meaning: packets traverse the WG interface both ways.
Decision: if you see one direction only, suspect firewall drops on the far side or return route/default gateway issues on the destination host.
Task 11: Check for route conflicts (more specific routes winning)
cr0x@server:~$ ssh admin@office-a 'ip/route/print where dst-address~"10.20.20."'
0 dst-address=10.20.20.0/24 gateway=wg-s2s-officeB distance=1 active=yes
1 dst-address=10.20.20.0/24 gateway=10.10.10.254 distance=1 active=no
Output meaning: there’s another route for the same prefix, currently inactive.
Decision: remove or correct conflicting routes; otherwise, future link state changes may flip your traffic unexpectedly.
Task 12: MTU test with “do not fragment” ping
cr0x@server:~$ ping -M do -s 1380 -c 3 10.20.20.50
PING 10.20.20.50 (10.20.20.50) 1380(1408) bytes of data.
1388 bytes from 10.20.20.50: icmp_seq=1 ttl=62 time=25.9 ms
1388 bytes from 10.20.20.50: icmp_seq=2 ttl=62 time=25.4 ms
1388 bytes from 10.20.20.50: icmp_seq=3 ttl=62 time=25.5 ms
--- 10.20.20.50 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
Output meaning: payload size 1380 works without fragmentation; that’s a decent sign for TCP performance.
Decision: if it fails at moderate sizes (say 1360–1380), implement MSS clamping and/or lower WG MTU.
Task 13: Confirm WireGuard UDP port is reachable from the internet (from a test host)
cr0x@server:~$ sudo nmap -sU -p 13231 203.0.113.10
Starting Nmap 7.94 ( https://nmap.org ) at 2025-12-28 10:40 UTC
Nmap scan report for 203.0.113.10
Host is up (0.021s latency).
PORT STATE SERVICE
13231/udp open|filtered unknown
Nmap done: 1 IP address (1 host up) scanned in 1.23 seconds
Output meaning: UDP often shows open|filtered because there’s no handshake like TCP.
Decision: if it shows closed, you likely have ISP/edge blocking or wrong port-forward/firewall. Fix reachability before touching peers.
Task 14: Watch WireGuard transfer counters during a test
cr0x@server:~$ ssh admin@office-b 'interface/wireguard/peers/print detail where comment~"OfficeA"'
0 interface=wg-s2s-officeA public-key="OFFICE_A_PUBLIC_KEY" allowed-address=10.99.0.1/32,10.10.10.0/24
last-handshake=14s rx=901233 tx=884120 endpoint-address=203.0.113.10 endpoint-port=13231
Output meaning: counters change as you generate traffic. If they don’t, your test may not be going through the tunnel at all.
Decision: if rx increases but tx does not (or vice versa), check firewall and routes in the direction that’s missing.
Fast diagnosis playbook
When it’s broken and someone is watching, you need a ruthless order of operations. Don’t bounce the tunnel.
Don’t “just reboot the router.” That’s how you erase evidence and extend outages.
First: is the peer actually alive?
- Check
last-handshakeon both sides. - Check rx/tx counters while you generate traffic.
- If handshake is stale: validate UDP reachability (ISP, port-forward, WAN firewall).
Bottleneck call: if handshake is dead, your bottleneck is underlay reachability (WAN path, NAT, firewall).
Second: does routing point into the tunnel?
- Verify static routes exist and are active.
- From a LAN host, traceroute to the remote LAN IP.
- On the router, sniff on the WG interface for that flow.
Bottleneck call: if handshake is fine but traceroute shows ISP hops, your bottleneck is routing/NAT policy.
Third: is firewall/NAT rewriting or dropping?
- Check NAT order: bypass accept must be above masquerade.
- Check firewall counters on allow rules.
- Temporarily enable targeted log drops for the two LAN subnets.
Bottleneck call: if packets hit drop logs or NAT masquerade counters unexpectedly, your bottleneck is policy (filter/NAT).
Fourth: is it MTU/performance weirdness?
- Ping with DF and larger payloads.
- Test a single TCP flow (file copy or iperf if you have it).
- If intermittent app failures: clamp MSS and retest.
Bottleneck call: if small pings work but larger fail, your bottleneck is MTU blackholing.
Common mistakes: symptom → root cause → fix
1) Handshake is up, but no one can reach the other site
Symptom: last-handshake is recent, but ping/traceroute to remote LAN fails.
Root cause: missing static route, wrong allowed-address, or NAT masquerade hitting inter-site traffic.
Fix: add/activate route to remote LAN via WG; ensure peer allowed-address includes remote LAN; add NAT bypass accept rule above masquerade.
2) Only one direction works (A→B works, B→A fails)
Symptom: you can reach B from A, but not A from B.
Root cause: asymmetric routing or return path blocked; remote LAN host default gateway wrong; firewall forward rules missing on one side.
Fix: confirm default gateways on destination hosts point to the site router; verify forward accept rules in both directions; sniff on WG interface to see if return traffic exists.
3) Some applications work, others hang (especially HTTPS, SMB, RDP)
Symptom: DNS resolves, small pings work, but web apps time out or SMB stalls.
Root cause: MTU/PMTUD blackhole; fragmentation needed messages blocked upstream.
Fix: lower WireGuard MTU (start at 1420 then adjust) and/or clamp TCP MSS on forwarded SYN packets.
4) Tunnel drops after a few minutes of idle time
Symptom: handshake becomes old; traffic resumes only after someone “tries again.”
Root cause: NAT mapping expiration on the side behind NAT; no keepalive.
Fix: set persistent-keepalive=25s on the NATed peer; verify outbound UDP is permitted.
5) Inter-site traffic is NATed and ACLs break
Symptom: remote servers see traffic from the remote router, not the actual client IP; access rules fail or auditing is useless.
Root cause: generic masquerade rule matches before your bypass.
Fix: add srcnat action=accept for the LAN-to-LAN traffic and place it above masquerade; verify counters.
6) You can ping tunnel IPs but not remote LANs
Symptom: 10.99.0.1↔10.99.0.2 pings fine, but 10.10.10.0/24↔10.20.20.0/24 fails.
Root cause: allowed-address includes only tunnel /32s, not the LAN prefixes; or forward chain rules don’t permit LAN forwarding.
Fix: include remote LAN subnets in allowed-address; add forward accept rules for LAN↔WG in both directions.
7) CPU spikes and throughput collapses during transfers
Symptom: big file copies peg CPU on one router; latency climbs; users complain the internet “feels slow” too.
Root cause: underpowered router model, queues/fasttrack misconfiguration, or overly broad firewall logging.
Fix: disable debug logging rules; ensure you’re not logging every forward packet; consider hardware upgrade or traffic shaping with intent (not guesswork).
Three corporate mini-stories (how this goes wrong)
Incident caused by a wrong assumption: “Handshake means it’s fine”
A mid-sized company stitched two offices together with WireGuard on MikroTik. The admin did the classic validation:
handshake is recent, therefore VPN is up. They announced success. People started mapping network drives.
Within an hour, tickets came in: some PCs could access a remote file server, others couldn’t. The “working” PCs were on a VLAN
that happened to have a permissive firewall policy. The rest were on a restricted VLAN whose forward chain rules didn’t include the WireGuard interface.
The admin kept poking WireGuard settings. Nothing changed, because the tunnel wasn’t the problem. The proof was right there:
peer counters incremented during pings, but forward chain counters didn’t move for real user traffic.
They were debugging the wrong layer because they assumed handshake equals end-to-end connectivity.
The fix was mundane: add explicit forward rules for the specific LAN subnets to and from the WG interface, and tighten them to only the required ports.
After that, connectivity became consistent, and the security team stopped hovering.
Optimization that backfired: “FastTrack everything”
Another organization had a performance complaint: large transfers between offices were slower than expected. Someone suggested enabling FastTrack widely
to reduce CPU load and increase throughput. They did it in the forward chain with a broad rule matching most established connections.
For a day, things looked great. CPU dropped. The graphs calmed down. Then the weird reports started:
some HTTPS sessions would stall, particularly for apps that used long-lived connections. SMB copies would start fast and then degrade.
The helpdesk classified it as “intermittent internet,” which is corporate for “we don’t know what’s happening.”
The root cause: the FastTrack rule bypassed parts of the firewall/mangle pipeline that they relied on for MSS clamping and traffic accounting.
Some flows were going through clamped, others weren’t, depending on connection state and rule ordering. Performance became non-deterministic.
They rolled back FastTrack for WireGuard-related traffic and kept it for simple internet NAT flows only.
Throughput stabilized. The lesson wasn’t “FastTrack is bad.” The lesson was “know what you’re bypassing,” and be disciplined about exceptions.
Boring but correct practice that saved the day: “Comment, baseline, and measure”
A third team had a tidy habit: every change to firewall/NAT for the VPN had a comment prefix, and they kept a short baseline capture of
what “healthy” looked like—handshake times, route status, and rule counters during a standard ping test.
One morning, Office B lost access to Office A’s ERP server. The tunnel was up. Routing looked correct at first glance.
Instead of trying random fixes, the on-call engineer compared counters to the baseline: the NAT bypass rule was no longer matching.
Turns out a routine “cleanup” the night before reordered NAT rules. The masquerade rule moved above the bypass accept rule.
Everything still “worked” for some traffic, but ERP servers had ACLs expecting real client IPs and started refusing sessions.
Because the team had comments and baselines, diagnosis took minutes. They restored NAT ordering, added a guard comment on the masquerade rule,
and implemented a small peer review policy for rule ordering changes. It was not glamorous. It also prevented a recurring outage class.
Joke #2: The most dangerous network device is the one labeled “temporary,” because it will outlive your org chart.
Checklists / step-by-step plan
Step-by-step bring-up (two offices)
- Pick subnets: confirm LANs do not overlap (10.10.10.0/24 and 10.20.20.0/24) and choose a tunnel /30 (10.99.0.0/30).
- Upgrade RouterOS: both routers on v7+ with the same major release if possible.
- Generate keys on each router; exchange public keys out-of-band.
- Create WG interfaces with explicit names and MTU 1420.
- Assign tunnel IPs to WG interfaces.
- Configure peers:
- Office A peer allowed-address includes Office B tunnel /32 and Office B LAN /24.
- Office B peer allowed-address includes Office A tunnel /32 and Office A LAN /24.
- Office B sets endpoint to Office A and persistent keepalive if NATed.
- Add static routes to the remote LAN via the WG interface.
- Add firewall rules:
- Allow UDP port on Office A WAN.
- Allow forward LAN↔WG in both directions.
- Add NAT bypass accept rules before masquerade.
- Test in layers:
- Handshake
- Ping tunnel IPs
- Ping remote LAN IP from router
- Ping remote host from LAN host
- MTU test with DF ping
- Disable debug logging rules after validation.
- Write down the known-good state: routes, counters, handshake cadence, and where keepalive is configured.
Change checklist (every time you touch it)
- Does this change affect rule ordering in NAT or filter? If yes, validate NAT bypass still matches.
- Are we changing any subnet definitions? If yes, update allowed-address and static routes on both sides.
- Did we introduce overlapping subnets via a new VLAN? If yes, stop and redesign before deploying.
- Did we enable FastTrack or new mangle rules? If yes, retest MTU/MSS and measure throughput.
- Did we change ISP or WAN addressing? If yes, validate endpoint and UDP reachability immediately.
Security checklist (minimum viable hygiene)
- Restrict input to UDP port 13231 on WAN (Office A) and keep other input closed.
- Scope forward rules by subnets (and even ports/services if you can).
- Keep WireGuard keys out of tickets and chat logs.
- Log only what you need, and only when you need it. Permanent “log everything” is a self-inflicted DoS.
FAQ
1) Should I bridge the two offices so it’s “one LAN”?
No, not unless you have a very specific requirement and you’re ready to own the broadcast and spanning-tree consequences.
Routed is cleaner, debuggable, and scales better operationally.
2) Why do we add remote LAN subnets to allowed-address?
In WireGuard, allowed-address acts like a routing/selector policy: which destination prefixes are considered valid for that peer.
If you omit the LAN subnet, you’ll get a handshake and maybe tunnel pings, but not actual inter-site routing.
3) Do I need persistent-keepalive on both sides?
Usually only on the side behind NAT (or behind aggressive stateful firewalls). If both sides have public IPs and stable UDP,
you can often skip it. If in doubt, set it only on the NATed side at 25 seconds.
4) What port should I use for WireGuard?
Any UDP port you can reliably allow through your WAN edge. Pick one, document it, and don’t reuse random ports across different VPNs.
Operational clarity beats clever hiding.
5) Can I run multiple site-to-site tunnels on one router?
Yes. Use separate interfaces or carefully managed peer configs on one interface. Keep naming strict and address planning consistent.
The moment you lose track of which peer owns which subnets, you’ll route traffic to the wrong place.
6) The handshake is recent but rx increases and tx doesn’t (or vice versa). What does that mean?
You have one-way traffic. Either the return path is broken (routing/default gateway), or the firewall is dropping one direction.
Sniff on the WG interface and verify forward rule counters to see where it dies.
7) How do I avoid overlapping subnets when the business acquires another office?
You plan for it: maintain an internal IPAM record and reserve ranges per site. If you inherit overlap, you must renumber one side
or implement NAT between sites (which is operational debt). Renumbering hurts once; NAT hurts forever.
8) Should I NAT traffic between sites to make it “simpler”?
Avoid it. NAT hides source IPs, breaks ACL expectations, and makes troubleshooting harder.
Only use inter-site NAT as a temporary compatibility measure (like overlapping subnets you can’t fix immediately), and write a removal plan.
9) Do I need special QoS for WireGuard site-to-site?
Not by default. Measure first. If voice/video or critical apps suffer during large transfers, then introduce shaping with clear classes and limits.
Don’t queue your way into a new bottleneck without baseline data.
10) What’s the single best “health check” metric?
Recent handshake plus steadily increasing transfer counters during a scheduled synthetic test (like a ping plus small TCP probe).
Handshake alone is not enough; counters alone can be stale. You want both, observed over time.
Conclusion: next steps you can do today
If you only do three things after reading this, do these:
- Lock in a clean IP plan (distinct LANs, separate tunnel /30) and write it down where it won’t disappear.
- Implement NAT bypass correctly (accept rules above masquerade) and verify via counters, not hope.
- Adopt the fast diagnosis order: handshake → routing → firewall/NAT → MTU. It keeps you from debugging the wrong layer.
Then schedule one small maintenance window to run the operational tasks section as a rehearsal: capture known-good outputs, confirm counters,
and test MTU. The time to learn your network’s quirks is not during an outage call with five managers silently breathing into the conference bridge.