Site-to-site VPNs fail in a very specific way: everything looks “up,” nobody can reach the file server, and your phone starts vibrating like it’s trying to tunnel too.
WireGuard is usually the antidote—clean config, fast crypto, low moving parts—but site-to-site still has enough routing and firewall sharp edges to keep an SRE employed.
This guide is how I actually build an office-to-office WireGuard link in production: two gateways, two LANs, real routing, NAT decisions you can defend, and a diagnosis playbook for when reality disagrees with your diagram.
Topology, assumptions, and what we’re building
We’ll connect two offices over the public internet using two Linux gateways that run WireGuard.
Each office has its own private LAN. The gateways will route between the LANs through a WireGuard tunnel.
Clients inside each office should reach the other office’s subnets without installing WireGuard themselves.
Example addressing (change to your reality)
- Office A LAN:
10.10.0.0/24 - Office B LAN:
10.20.0.0/24 - WireGuard tunnel network:
10.99.0.0/24 - Gateway A public IP:
198.51.100.10(example) - Gateway B public IP:
203.0.113.20(example) - Gateway A wg IP:
10.99.0.1 - Gateway B wg IP:
10.99.0.2 - Listen port:
51820/udp
What “done” looks like
- From a host in Office A (e.g.,
10.10.0.50), you can reach10.20.0.60in Office B (ping + TCP). - From a host in Office B, you can reach Office A LAN services.
- Tunnel stays up across NAT and idle periods.
- Routing is explicit; NAT is a conscious choice, not a “because it worked” accident.
- When it breaks, you can pinpoint whether it’s crypto, transport, routing, or policy in minutes.
Interesting facts and short history
- WireGuard is young by VPN standards. It was introduced in the mid-2010s as a simpler, modern alternative to IPsec and OpenVPN.
- It landed in the Linux kernel in 2020. That matters: fewer out-of-tree modules, fewer upgrade surprises, better performance.
- Noise protocol framework under the hood. WireGuard’s handshake is based on Noise patterns—small, reviewable, and designed for modern cryptography.
- No cipher-suite negotiation. That’s not a missing feature; it’s a deliberate choice to reduce downgrade attacks and config sprawl.
- It’s intentionally “IP-layer boring.” WireGuard carries IP packets; it doesn’t try to be a full network management system.
- “AllowedIPs” is both routing and access control. It’s a policy primitive that doubles as a routing table selector.
- Roaming is a first-class concept. If a peer’s source IP changes (mobile, NAT shift), WireGuard can follow it after a valid handshake.
- UDP keeps it simple. It avoids TCP-over-TCP meltdowns and most head-of-line blocking pain you get from tunneling over TCP.
- Small codebase. WireGuard is famously compact compared to many VPN stacks, which helps audits and reduces bug surface.
Design choices that matter (and the ones that don’t)
Decide: route vs NAT between offices
For site-to-site, routing is the default you should aim for: Office A sees Office B as 10.20.0.0/24, and vice versa.
NAT “works,” but it hides identity, complicates logging, and creates weird application behavior (Kerberos, IP-based ACLs, anything that cares about source IP).
Use NAT only when you have unavoidable overlaps (both offices used 192.168.1.0/24 because the router came in a box and nobody felt like changing it),
or when the remote side refuses to add routes. If you can route, route.
Decide: where routing lives
You need the LAN hosts to send “remote LAN” traffic to the WireGuard gateway. There are two sane patterns:
- Proper routing: Add a static route on the office router/core so that
10.20.0.0/24goes to Gateway A LAN IP, and10.10.0.0/24goes to Gateway B LAN IP. - Host routes: Add routes on individual hosts. This is fine for small labs and terrible for offices with humans.
Decide: firewall policy boundaries
A site-to-site tunnel is not a magical trust umbrella. Treat it like plugging in a long Ethernet cable to someone else’s switch.
Permit what you need; deny what you don’t; log what you’ll regret not logging.
MTU and MSS: choose to be happy
Most mysterious “it pings but SMB dies” bugs are MTU/MSS problems in costume.
Start with a conservative WireGuard interface MTU like 1420, and clamp TCP MSS on the gateway firewall if you traverse PPPoE, LTE, or unknown middleboxes.
One quote to keep you honest: “Hope is not a strategy.” — paraphrased idea often attributed in operations circles (exact phrasing varies).
Prerequisites and baseline checks
- Two Linux gateways (Debian/Ubuntu examples below). They can be VMs, dedicated boxes, or small appliances.
- Each gateway has:
- One interface on its office LAN (e.g.,
eth1) - One interface to the internet (e.g.,
eth0)
- One interface on its office LAN (e.g.,
- You control routing on each office network, or at least the default gateway/router.
- UDP port
51820can be forwarded to each gateway if it sits behind NAT. - Non-overlapping LAN subnets. If they overlap, you’re doing NAT/translation (covered later as an escape hatch).
Joke #1: If you find both offices using 192.168.0.0/24, congratulations—you’ve discovered the world’s most popular corporate tradition.
Checklists / step-by-step plan
Phase 0: get the routing story straight
- Pick a dedicated WireGuard tunnel subnet (example
10.99.0.0/24). - Confirm Office A LAN and Office B LAN do not overlap.
- Pick which device in each office is the “gateway” and ensure it has reliable power and stable internet.
- Decide routing method:
- Preferred: add static routes on the office routers.
- Fallback: add routes on hosts (small deployments only).
Phase 1: install WireGuard and generate keys
- Install WireGuard tooling on both gateways.
- Generate keypairs; store private keys with strict permissions.
- Create
wg0config files with explicit addresses and AllowedIPs.
Phase 2: enable forwarding and firewall
- Enable IPv4 forwarding on both gateways.
- Allow UDP 51820 inbound to each gateway (public interface).
- Permit forwarding between
wg0and the LAN interface for the specific subnets/services you want. - Optionally clamp MSS.
Phase 3: bring it up and validate in layers
- Bring up WireGuard, verify handshake times, and verify endpoints.
- Ping WireGuard tunnel IPs (
10.99.0.1↔10.99.0.2). - Ping LAN-to-LAN using a test host on each side.
- Validate TCP services (SSH/HTTP/SMB) both directions.
- Lock down firewall rules and logging once it works.
Practical tasks: commands, outputs, and decisions
The difference between a lab VPN and a production VPN is observability and disciplined decision-making.
Below are concrete tasks I run in order. Each includes what the output means and what I decide next.
Task 1: confirm interface names and addresses
cr0x@gw-a:~$ ip -br addr
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 198.51.100.10/24
eth1 UP 10.10.0.1/24
Meaning: eth0 is public, eth1 is Office A LAN. Good.
If your “LAN” interface shows the public IP, stop. Your wiring (or VM NIC mapping) is wrong.
Decision: Note the LAN interface for firewall forwarding rules and the LAN gateway IP for static routes.
Task 2: install WireGuard tools
cr0x@gw-a:~$ sudo apt-get update
...output...
cr0x@gw-a:~$ sudo apt-get install -y wireguard
...output...
Meaning: You have wg, wg-quick, and kernel support (or DKMS) installed.
Decision: If this pulls a DKMS module on an old kernel, plan a kernel upgrade. Production VPNs shouldn’t depend on fragile out-of-tree builds.
Task 3: generate keypair (do this on each gateway)
cr0x@gw-a:~$ umask 077
cr0x@gw-a:~$ wg genkey | tee /etc/wireguard/privatekey | wg pubkey > /etc/wireguard/publickey
cr0x@gw-a:~$ sudo ls -l /etc/wireguard
total 8
-rw------- 1 root root 45 privatekey
-rw-r--r-- 1 root root 45 publickey
Meaning: Private key is root-only. Public key is shareable with the peer.
Decision: If permissions aren’t strict, fix them. The private key is the identity.
Task 4: create the WireGuard config on Gateway A
cr0x@gw-a:~$ sudo tee /etc/wireguard/wg0.conf >/dev/null <<'EOF'
[Interface]
Address = 10.99.0.1/24
ListenPort = 51820
PrivateKey = __GW_A_PRIVATE_KEY__
MTU = 1420
# Optional: ensure routes are added by wg-quick
# (wg-quick will add routes for AllowedIPs on peers)
[Peer]
PublicKey = __GW_B_PUBLIC_KEY__
Endpoint = 203.0.113.20:51820
AllowedIPs = 10.99.0.2/32, 10.20.0.0/24
PersistentKeepalive = 25
EOF
Meaning: Gateway A will route traffic destined to Office B LAN (10.20.0.0/24) into the tunnel, and also knows the peer tunnel IP.
Decision: Keep AllowedIPs tight. Do not put 0.0.0.0/0 here for site-to-site unless you truly want all traffic hairpinned through the other office.
Task 5: create the WireGuard config on Gateway B (mirror it)
cr0x@gw-b:~$ sudo tee /etc/wireguard/wg0.conf >/dev/null <<'EOF'
[Interface]
Address = 10.99.0.2/24
ListenPort = 51820
PrivateKey = __GW_B_PRIVATE_KEY__
MTU = 1420
[Peer]
PublicKey = __GW_A_PUBLIC_KEY__
Endpoint = 198.51.100.10:51820
AllowedIPs = 10.99.0.1/32, 10.10.0.0/24
PersistentKeepalive = 25
EOF
Meaning: Symmetry. Each side is authoritative for its LAN subnet in AllowedIPs.
Decision: If either gateway is behind NAT and has a dynamic public IP, you’ll need a stable endpoint strategy (static IP, DNS name, or one side as “dial-out only”). Don’t pretend it’s stable.
Task 6: enable IP forwarding (both gateways)
cr0x@gw-a:~$ sudo sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1
cr0x@gw-a:~$ echo 'net.ipv4.ip_forward=1' | sudo tee /etc/sysctl.d/99-wireguard-forward.conf
net.ipv4.ip_forward=1
Meaning: The gateway can route packets between interfaces.
Decision: If you forget this, the tunnel will look fine but LAN-to-LAN traffic will die quietly. Enable it permanently via sysctl drop-in.
Task 7: bring up the interface and confirm it’s running
cr0x@gw-a:~$ sudo systemctl enable --now wg-quick@wg0
Created symlink /etc/systemd/system/multi-user.target.wants/wg-quick@wg0.service → /lib/systemd/system/wg-quick@.service.
cr0x@gw-a:~$ sudo wg show
interface: wg0
public key: 7wz...REDACTED...nU=
private key: (hidden)
listening port: 51820
peer: X1p...REDACTED...b0=
endpoint: 203.0.113.20:51820
allowed ips: 10.99.0.2/32, 10.20.0.0/24
latest handshake: 18 seconds ago
transfer: 84.12 KiB received, 90.44 KiB sent
persistent keepalive: every 25 seconds
Meaning: “Latest handshake” is your first green light. If it says “(none),” you don’t have a tunnel yet—don’t waste time on LAN routes.
Decision: No handshake means focus on UDP reachability, endpoints, keys, and time skew (rare, but real). Handshake present means move up-stack to routing and firewall.
Task 8: verify the kernel created routes for AllowedIPs
cr0x@gw-a:~$ ip route show table main | grep -E '10\.20\.0\.0/24|10\.99\.0\.2'
10.20.0.0/24 dev wg0 scope link
10.99.0.2 dev wg0 scope link
Meaning: Packets destined for Office B LAN will be routed into wg0. This is essential.
Decision: If the route is missing, your config isn’t being applied by wg-quick, or you edited the wrong file. Fix before touching routers.
Task 9: open the firewall for WireGuard transport (UDP 51820)
cr0x@gw-a:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
iif "lo" accept
ct state established,related accept
tcp dport 22 accept
}
}
Meaning: Input policy is drop; UDP 51820 isn’t allowed. That’s why handshakes often don’t happen.
Decision: Add an explicit rule to permit UDP 51820 on the public interface.
cr0x@gw-a:~$ sudo nft add rule inet filter input iif "eth0" udp dport 51820 ct state new accept
cr0x@gw-a:~$ sudo nft list chain inet filter input
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
iif "lo" accept
ct state established,related accept
tcp dport 22 accept
iif "eth0" udp dport 51820 ct state new accept
}
}
Meaning: We allow WireGuard transport from the internet. Still minimal exposure.
Decision: Mirror this on Gateway B. If either side is behind NAT, ensure port forwarding lands on the correct internal host.
Task 10: allow forwarding between LAN and wg0 (routing mode)
cr0x@gw-a:~$ sudo nft add table inet wgsite
cr0x@gw-a:~$ sudo nft add chain inet wgsite forward '{ type filter hook forward priority 0; policy drop; }'
cr0x@gw-a:~$ sudo nft add rule inet wgsite forward ct state established,related accept
cr0x@gw-a:~$ sudo nft add rule inet wgsite forward iif "eth1" oif "wg0" ip saddr 10.10.0.0/24 ip daddr 10.20.0.0/24 accept
cr0x@gw-a:~$ sudo nft add rule inet wgsite forward iif "wg0" oif "eth1" ip saddr 10.20.0.0/24 ip daddr 10.10.0.0/24 accept
cr0x@gw-a:~$ sudo nft list chain inet wgsite forward
table inet wgsite {
chain forward {
type filter hook forward priority 0; policy drop;
ct state established,related accept
iif "eth1" oif "wg0" ip saddr 10.10.0.0/24 ip daddr 10.20.0.0/24 accept
iif "wg0" oif "eth1" ip saddr 10.20.0.0/24 ip daddr 10.10.0.0/24 accept
}
}
Meaning: Only LAN-to-LAN traffic is allowed through the gateway. Everything else is dropped by default.
Decision: If you need specific services only, narrow it further using TCP ports. Start broad while testing, then tighten.
Task 11: add static routes on the office routers (preferred)
This is vendor-specific, but the logic is always the same: tell the office router that the remote LAN is reachable via the local WireGuard gateway’s LAN IP.
If you can’t set it on the router, you can set it on a test host to validate.
cr0x@host-a:~$ ip route
default via 10.10.0.254 dev eth0
10.10.0.0/24 dev eth0 proto kernel scope link src 10.10.0.50
Meaning: No route to 10.20.0.0/24. The host will send that traffic to the default gateway, which likely has no clue.
Decision: Add a temporary route on the host for testing, then implement it properly on the office router.
cr0x@host-a:~$ sudo ip route add 10.20.0.0/24 via 10.10.0.1
cr0x@host-a:~$ ip route get 10.20.0.60
10.20.0.60 via 10.10.0.1 dev eth0 src 10.10.0.50 uid 1000
cache
Meaning: The host will now forward remote-LAN traffic to Gateway A (10.10.0.1).
Decision: If this makes everything work, you have proven the VPN and gateway forwarding are fine; now you just need proper routing at the network edge.
Task 12: test tunnel IP connectivity (gateway-to-gateway)
cr0x@gw-a:~$ ping -c 3 10.99.0.2
PING 10.99.0.2 (10.99.0.2) 56(84) bytes of data.
64 bytes from 10.99.0.2: icmp_seq=1 ttl=64 time=21.3 ms
64 bytes from 10.99.0.2: icmp_seq=2 ttl=64 time=20.9 ms
64 bytes from 10.99.0.2: icmp_seq=3 ttl=64 time=21.1 ms
--- 10.99.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 20.9/21.1/21.3/0.16 ms
Meaning: The encrypted path works and forwarding for the tunnel subnet is correct.
Decision: If this fails but handshake exists, you likely blocked ICMP on forwarding rules or you mis-set AllowedIPs for the tunnel IP. Fix policy first.
Task 13: test LAN-to-LAN connectivity from a host
cr0x@host-a:~$ ping -c 3 10.20.0.60
PING 10.20.0.60 (10.20.0.60) 56(84) bytes of data.
64 bytes from 10.20.0.60: icmp_seq=1 ttl=63 time=24.7 ms
64 bytes from 10.20.0.60: icmp_seq=2 ttl=63 time=24.1 ms
64 bytes from 10.20.0.60: icmp_seq=3 ttl=63 time=23.9 ms
--- 10.20.0.60 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
Meaning: Routing through the gateway works. TTL 63 implies at least one hop (gateway) in the path. Good.
Decision: If ping works but TCP fails, it’s likely firewall, MTU/MSS, or asymmetric routing. Don’t guess—measure.
Task 14: test TCP connectivity (because “ping works” is a trap)
cr0x@host-a:~$ nc -vz -w 2 10.20.0.60 445
Connection to 10.20.0.60 445 port [tcp/microsoft-ds] succeeded!
Meaning: At least the TCP handshake completes to SMB. That implies routing + stateful firewall allow it.
Decision: If it times out, check forward rules and any internal host firewall on the destination. If it’s “refused,” the service isn’t listening or ACLs block it—different problem.
Task 15: confirm traffic is actually flowing inside wg0
cr0x@gw-a:~$ sudo wg show wg0 transfer
peer: X1p...REDACTED...b0=
transfer: 2.31 MiB received, 2.88 MiB sent
Meaning: Counters moving confirms real traffic traversed the tunnel, not just a handshake.
Decision: If handshakes happen but transfer stays near zero during tests, traffic is missing the gateway (routing) or blocked pre-encryption (firewall on LAN side).
Task 16: use tcpdump to locate where packets die
cr0x@gw-a:~$ sudo tcpdump -ni eth1 host 10.20.0.60
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:14:31.120331 IP 10.10.0.50 > 10.20.0.60: ICMP echo request, id 1923, seq 1, length 64
Meaning: The gateway is receiving packets from the LAN destined for the remote LAN. Good.
Decision: Now check if they exit via wg0. If they enter on eth1 but never appear on wg0, your forward policy or routing is wrong.
cr0x@gw-a:~$ sudo tcpdump -ni wg0 host 10.20.0.60
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
10:14:31.120882 IP 10.10.0.50 > 10.20.0.60: ICMP echo request, id 1923, seq 1, length 64
Meaning: The packet is being routed into the tunnel. If the remote host doesn’t reply, the problem is on the far side (routing back, firewall, host policy).
Task 17: check reverse path on Gateway B
cr0x@gw-b:~$ ip route get 10.10.0.50
10.10.0.50 dev wg0 src 10.99.0.2 uid 0
cache
Meaning: Gateway B knows to send traffic back to Office A via the tunnel. If it says “via eth1,” you’ve created asymmetric routing and your firewall will punish you.
Decision: Fix AllowedIPs and routes until both sides agree on the path.
Task 18: hunt MTU issues with “do not fragment” pings
cr0x@host-a:~$ ping -M do -s 1380 -c 3 10.20.0.60
PING 10.20.0.60 (10.20.0.60) 1380(1408) bytes of data.
1388 bytes from 10.20.0.60: icmp_seq=1 ttl=63 time=25.4 ms
1388 bytes from 10.20.0.60: icmp_seq=2 ttl=63 time=24.9 ms
1388 bytes from 10.20.0.60: icmp_seq=3 ttl=63 time=25.2 ms
Meaning: Large packets survive without fragmentation. Good sign.
Decision: If you get “Frag needed,” lower wg MTU (e.g., 1380–1420) or clamp MSS on the gateways. Do this before blaming “the ISP.”
Task 19: clamp TCP MSS if you see stalls on large transfers
cr0x@gw-a:~$ sudo nft add table ip mangle
cr0x@gw-a:~$ sudo nft add chain ip mangle forward '{ type filter hook forward priority -150; policy accept; }'
cr0x@gw-a:~$ sudo nft add rule ip mangle forward oif "wg0" tcp flags syn tcp option maxseg size set 1360
cr0x@gw-a:~$ sudo nft list chain ip mangle forward
table ip mangle {
chain forward {
type filter hook forward priority -150; policy accept;
oif "wg0" tcp flags syn tcp option maxseg size set 1360
}
}
Meaning: New TCP connections exiting into the tunnel will advertise a smaller MSS, preventing PMTUD-dependent black holes.
Decision: Clamp only if you need it. It’s a pragmatic fix, but it’s also an admission that path MTU discovery is not reliable in your environment (often due to ICMP blocking).
Task 20: make sure systemd sees wg-quick healthy
cr0x@gw-a:~$ systemctl status wg-quick@wg0 --no-pager
● wg-quick@wg0.service - WireGuard via wg-quick(8) for wg0
Loaded: loaded (/lib/systemd/system/wg-quick@.service; enabled; preset: enabled)
Active: active (exited) since Tue 2025-12-09 10:01:12 UTC; 2h 8min ago
Docs: man:wg-quick(8)
man:wg(8)
Process: 1234 ExecStart=/usr/bin/wg-quick up wg0 (code=exited, status=0/SUCCESS)
Meaning: “active (exited)” is normal for oneshot services. The interface persists in the kernel.
Decision: If it’s failed, read journal logs. Don’t keep restarting; fix the config error (bad key formatting, missing interface address, etc.).
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
A mid-size company merged with a smaller one and needed “a quick tunnel” so finance could reach an ERP system.
The new VPN went in over a weekend. On Monday, the tunnel was up, handshakes were fresh, and the dashboard glowed reassuringly green.
Users still couldn’t log in to the ERP app. The network team immediately blamed the application. The application team blamed the network.
The wrong assumption: “If the WireGuard tunnel is up, routing must be fine.” It wasn’t.
Office A had a static route to Office B on the gateway. Office B did not.
Their default gateway was a managed firewall where nobody had permission to add routes, so return traffic went to the internet edge, hit a drop rule, and vanished.
The tell was in tcpdump: packets entered the remote gateway’s wg0, then replies never returned through the tunnel.
The fix was boring: add the missing static route on Office B’s edge, and tighten firewall policy so only the needed ERP ports crossed.
Afterward, they wrote down a rule: never declare “VPN works” until you’ve verified return routing with ip route get on both gateways.
Lesson: handshake success proves encryption; it does not prove packet delivery, and it definitely does not prove symmetric routing.
Debug the path like an SRE: one hop at a time, one interface at a time, with evidence.
Mini-story 2: the optimization that backfired
Another organization decided the tunnel should be “as fast as possible,” so they cranked MTU up, removed MSS clamping, and enabled aggressive offloads on the NICs.
Throughput tests looked great in a controlled scenario. Then accounting started moving big PDFs over SMB and everything froze intermittently.
Not slow. Frozen. Classic.
The optimization: larger packets and fewer CPU cycles. The backfire: a real-world WAN path with a lower effective MTU and ICMP filtering.
Path MTU discovery failed silently. The result was a black hole where large TCP segments disappeared.
Small traffic (pings, tiny HTTP requests) worked. Big transfers died mid-stream, and the helpdesk got to enjoy that special kind of ticket: “Sometimes it works.”
The fix was to stop trying to outsmart the internet.
They set wg MTU to 1420, added a conservative MSS clamp, and kept offloads at defaults.
Performance became stable, which is the only performance metric users actually notice.
Lesson: “fast” that fails is slower than “boring” that works. Benchmark after you have correctness, not before.
Mini-story 3: the boring practice that saved the day
A company with two offices and a small data center ran WireGuard for years without drama.
The secret wasn’t magic configuration; it was routine.
Every change to AllowedIPs, firewall rules, or office routing went through a tiny runbook: test tunnel IP ping, test LAN-to-LAN ping, test a TCP port, and capture wg show output in the change record.
One day, an ISP swapped the modem in Office B, and the public-facing NAT rules changed.
The tunnel stopped handshaking. This could have been a two-hour finger-pointing session.
Instead, the on-call engineer followed the playbook and found “no handshake” in under a minute, then verified UDP port reachability was dead from the outside.
Because they had the old modem’s port-forward configuration documented, they replicated it on the new device quickly.
No deep packet sorcery, no vendor escalations, no midnight heroics.
The VPN came back, and most people never learned it had been down—which is the highest compliment operations can receive.
Lesson: do the boring paperwork while you’re calm. Your future self is an unreliable narrator under pressure.
Fast diagnosis playbook
When a site-to-site WireGuard link fails, you want to answer one question fast:
is the bottleneck transport/crypto, routing, or firewall/policy?
Here’s the order that minimizes thrash.
First: is there a handshake?
- Run
wg showon both gateways. - If “latest handshake” is recent on both sides, transport and keys are likely fine.
- If no handshake:
- Check UDP 51820 reachability and NAT/port-forwarding.
- Confirm endpoints are correct and public IPs didn’t change.
- Confirm keys match the peer you think they match.
Second: can the gateways reach each other’s wg IP?
- Ping
10.99.0.2from Gateway A and10.99.0.1from Gateway B. - If this fails but handshake exists, suspect:
- AllowedIPs missing the tunnel /32
- Forward rules dropping ICMP or all forwarding
- Policy routing conflicts
Third: is LAN traffic reaching the gateway and entering wg0?
- On Gateway A:
tcpdump -ni eth1 host 10.20.0.60while pinging from a LAN hosttcpdump -ni wg0 host 10.20.0.60at the same time
- Interpretation:
- Seen on eth1, not on wg0: routing/forward firewall on Gateway A.
- Seen on wg0 on A, not seen on wg0 on B: transport path/endpoint/NAT.
- Seen on wg0 on B, replies not returning: routing back or firewall on B or destination host.
Fourth: chase MTU only after the basics
- If ping works, TCP handshake works, but big transfers stall: test DF pings and clamp MSS.
- Don’t “tune MTU” as your first move. That’s how you waste an afternoon and learn nothing.
Joke #2: MTU bugs are like vampires—invite them in by blocking ICMP, and they’ll live in your network forever.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Latest handshake: (none)” on one or both gateways
- Root cause: UDP port blocked, wrong endpoint IP/port, NAT forwarding missing, or incorrect peer public key.
- Fix: Permit UDP 51820 inbound on the public interface, confirm port-forwarding to the gateway, verify peer public keys and endpoints match reality, and consider
PersistentKeepalive=25if behind NAT.
2) Symptom: handshake works, but gateways can’t ping each other’s wg IP
- Root cause: Missing tunnel /32 in AllowedIPs, or forward policy drops ICMP/forwarding.
- Fix: Ensure each peer AllowedIPs includes the other tunnel IP (e.g.,
10.99.0.2/32), and allow forwarding betweenwg0and the LAN.
3) Symptom: office hosts can’t reach remote LAN, but gateways can
- Root cause: Office router lacks static route to the remote subnet via the WireGuard gateway.
- Fix: Add static routes on the office routers, not on random hosts, and verify with
ip route getfrom a client.
4) Symptom: one-way connectivity (A→B works, B→A fails)
- Root cause: Asymmetric routing or firewall state tracking dropping return packets.
- Fix: Verify routes on both gateways and on both office routers. Confirm forward rules permit both directions. Use tcpdump on both ends to prove where return traffic stops.
5) Symptom: ping works, small TCP works, large transfers hang
- Root cause: MTU mismatch and broken PMTUD, often due to ICMP filtering on the path.
- Fix: Lower wg MTU (start at
1420) and clamp MSS on SYN packets exiting into wg0. Confirm with DF pings.
6) Symptom: random drops every few minutes, especially behind NAT
- Root cause: NAT idle timeouts expiring UDP mappings.
- Fix: Set
PersistentKeepalive=25on the side behind NAT (or both sides if uncertain). Verify handshake timestamps remain fresh.
7) Symptom: remote LAN becomes reachable, but DNS or AD behaves strangely
- Root cause: You NATed traffic between sites and broke identity assumptions, or you forgot to route DNS requests to the right resolvers.
- Fix: Prefer routed mode. If you must NAT, document it and adjust application expectations (ACLs, logs). Ensure DNS forwarding and conditional zones align with the new connectivity.
8) Symptom: wireguard works until you reboot, then it doesn’t
- Root cause: Config not enabled as a service, firewall rules not persisted, or sysctl forwarding not persisted.
- Fix: Enable
wg-quick@wg0and persist nftables/sysctl configuration properly. Reboot in a maintenance window and verify.
Hardening, operations, and reliability moves
Keep AllowedIPs minimal and intentional
AllowedIPs is not just “what can pass.” It also influences routing decisions.
If you accidentally claim a subnet on the wrong peer, Linux will happily route traffic into the tunnel and you’ll “debug the wrong office” for hours.
Keep it to:
- Peer tunnel IP (/32)
- Peer LAN subnet(s) (e.g.,
10.20.0.0/24)
Prefer routed mode; use NAT only as a controlled compromise
If you must NAT because of overlapping subnets, do it explicitly and document the translation scheme.
Typical pattern: Office B appears to Office A as 10.120.0.0/24 even though it’s actually 192.168.1.0/24.
That requires NAT rules and careful DNS/service mapping. It’s doable. It’s also debt.
Logging: capture the right evidence
WireGuard itself doesn’t spam logs, which is good for your disks and bad for your assumptions.
Log at the firewall boundary where policy decisions happen. When a ticket arrives, you want to know:
did the packet enter the gateway, did it enter the tunnel, did it leave the tunnel?
Monitoring: measure what matters
- Handshake age (
wg show) - Transfer counters moving (sent/received)
- Packet loss/latency between wg IPs
- Service-level checks across the tunnel (TCP port checks for your real apps)
Change management: treat VPN changes like production changes
The tunnel is now part of your production network. Changes to it can take down payroll, inventory, or authentication.
Use staged rollouts: test from gateways first, then from a single client host, then roll static routes to the whole office.
High availability (optional, but you should think about it)
WireGuard itself doesn’t do clustering. You can still build redundancy:
- Two gateways per office with VRRP/keepalived for the LAN next-hop
- Multiple peers and selective routing (more complex)
- Fast replacement: immutable configs and quick redeploy
If you are small, don’t over-engineer. A single well-managed gateway with spares and a tested restore procedure beats a half-baked HA design.
FAQ
1) Do I need a dedicated tunnel subnet like 10.99.0.0/24?
Yes. Give the WireGuard interfaces their own addresses. It simplifies routing, monitoring, and firewall policy.
A /30 or /31 works too, but /24 is fine and readable.
2) What exactly does AllowedIPs do?
Two things: it defines which destination IPs are routed to a peer, and it defines which source IPs are accepted from that peer.
Treat it like a combined routing prefix list and access control list.
3) Should I set PersistentKeepalive on both sides?
Set it on the side behind NAT (or where you suspect UDP mappings expire). If you don’t know, setting it on both sides is acceptable.
It’s a small, periodic packet; the cost is trivial compared to outage time.
4) Can I run WireGuard on the office router directly?
If the router supports it well and you can manage it safely, yes. Many teams still prefer a Linux gateway because it’s debuggable and automatable.
Pick the platform you can operate at 3 a.m., not the one that looks neat in a diagram.
5) Why does “handshake successful” not guarantee traffic works?
Because the handshake only proves the peers can exchange authenticated messages.
Your LAN traffic can still fail due to missing routes, IP forwarding disabled, firewall forward policies, or MTU black holes.
6) Do I need NAT for a site-to-site WireGuard VPN?
No, not if you control routing on both sides and the LAN subnets don’t overlap.
NAT is a workaround for constraints (overlap, no route control), not a default.
7) How do I handle overlapping subnets between offices?
Best fix: renumber one site. Yes, it’s painful; it’s also the clean solution.
If you cannot renumber, use NAT with a dedicated translated prefix, update DNS expectations, and be ready for odd app behavior.
8) What MTU should I use?
Start with 1420. If you have PPPoE, LTE, or you see black-hole symptoms, drop further and/or clamp MSS.
Validate with DF pings and real application transfers.
9) Can WireGuard do dynamic routing (OSPF/BGP) over the tunnel?
WireGuard doesn’t provide routing protocols itself, but you can run OSPF/BGP across it like any other IP link.
For two offices and a couple subnets, static routes are usually simpler and more predictable.
10) What’s the safest way to rotate keys?
Add a new peer key in parallel (where possible), validate handshake and traffic, then remove the old key.
Avoid “swap keys on both sides at once” unless you enjoy coordinated outages.
Conclusion: practical next steps
If you want this to work on the first try, build it in layers: transport, tunnel IPs, routing, firewall, then application tests.
Don’t skip straight to “it must be DNS” before you’ve proven packets traverse wg0 in both directions.
- Deploy the configs and confirm handshakes on both gateways.
- Confirm routes exist for the remote LAN subnets on both gateways.
- Add static routes on office routers so clients actually use the tunnel.
- Lock down forwarding rules to only what you need, and add minimal logging.
- Run the fast diagnosis playbook once while everything is healthy, and record the expected outputs.