You notice it first in the little things: file shares feel “sticky,” VoIP calls between offices sound like someone is talking through a fishtank, and the CFO’s favorite dashboard loads just slowly enough to be blamed on “the cloud.”
Then you look at the path and realize you’re hairpinning traffic from Office A to Office B through a central data center three states away because that’s how the VPN was built in 2017 and nobody wants to touch it.
Full mesh WireGuard looks like the cure: just connect every office directly to every other office. Latency drops, internet breakouts stay local, and you stop paying the “central hub tax.”
It’s also a good way to build an operational snowflake that explodes the moment you add your 13th site.
What you’re really deciding: topology and failure domains
People talk about “full mesh” like it’s a product feature. It isn’t. It’s a bet.
The bet is that reducing path length (and avoiding a hub choke point) is worth increasing the number of relationships you must manage: keys, endpoints, routes, firewall rules, monitoring, and incident blast radius.
WireGuard itself is not the hard part. WireGuard is a clean, minimal protocol with a small codebase and predictable behavior.
The hard part is the system around it: routing policy, NAT reality, address management, MTU/fragmentation edge cases, and humans doing change management at scale.
The scaling math nobody wants on a slide
In a full mesh with n sites, you have n(n-1)/2 tunnels (or peer relationships) if everyone directly peers with everyone.
That’s 10 sites → 45 relationships. 20 sites → 190. 50 sites → 1225. The growth is quadratic, and your patience is not.
You can automate config generation and distribution. You should. But automation doesn’t remove operational complexity; it just lets you create complexity faster.
What WireGuard does well (and what it doesn’t pretend to do)
WireGuard gives you authenticated encryption, a simple interface model, and a nice property: if there’s no traffic, there’s basically no state.
But WireGuard does not do dynamic routing. It does not do path selection. It does not tell you “peer B is down; use peer C.”
Those behaviors come from your routing layer (static routes, BGP/OSPF via FRR, policy routing, SD-WAN logic) and your network design.
The reliable systems lesson here is old and still true: keep the control plane simple, but don’t pretend you don’t have one.
One quote to keep you honest
Hope is not a strategy.
— General Gordon R. Sullivan
It applies to VPN topologies beautifully. If your plan depends on “the hub probably won’t have issues,” you have a plan made of wishes.
When direct office-to-office tunnels are worth it
Full mesh can be the right answer. Not because it’s trendy, but because it solves specific physics and organizational problems.
Here are the cases where I’ll recommend it without making the “are you sure” face.
1) You have real-time, site-to-site traffic that suffers from hairpin latency
VoIP, video, VDI, remote desktop to on-prem apps, OT/SCADA telemetry, and anything interactive.
If Office A and Office B are 8 ms apart, and your hub is a 40 ms detour, the math is unkind.
Jitter is often worse than raw latency, and hub congestion adds jitter for free.
A direct tunnel gives you a consistent path that doesn’t compete with unrelated traffic bound for the hub.
2) Your hub is a bottleneck you can’t (or shouldn’t) scale endlessly
Sometimes the hub is a pair of virtual firewalls with licenses priced like a small yacht. Sometimes it’s a cramped DC circuit that procurement won’t upgrade until next quarter, meaning “never.”
Full mesh spreads load and reduces the “all roads lead to Rome” failure mode.
3) You need local internet breakouts but still want strong site-to-site connectivity
If every office backhauls internet through a central egress, you get consistent security controls and easy logging. You also get terrible SaaS performance and fun times with asymmetric routing.
A mesh can let offices keep their own egress while preserving direct reachability for internal services.
4) Your WAN transport is heterogeneous and you want path diversity
Offices might have fiber, cable, LTE, or weird “business broadband” that behaves like it’s allergic to uptime.
With multiple WANs per site, you can do multiple WireGuard peers and policy routing to keep traffic off the worst link.
Mesh doesn’t automatically give you traffic engineering, but it gives you more path options to engineer.
5) You have a small number of sites and a strong automation culture
Full mesh is manageable at 3–8 sites. It can be fine at ~10–15 if you generate configs centrally, track IP allocations, enforce review, and test changes.
Beyond that, you need a serious plan for routing and config distribution or you’re building a Rube Goldberg machine that’s one coffee spill away from downtime.
6) You need to contain failures to pairs of sites
In a hub-and-spoke design, hub instability becomes everyone’s problem. In a mesh, a tunnel flap affects the two endpoints.
That narrower blast radius can be the difference between “a ticket” and “a company-wide incident call.”
7) You’re migrating off MPLS and the old “any-to-any” expectation is baked in
Organizations that grew up on MPLS expect that any office can talk to any other office. When you replace that with hub-and-spoke, users notice.
A mesh (or partial mesh with routing) can match expectations with fewer political battles.
When full mesh is a trap
Mesh fails in predictable ways. The failures are not exotic crypto bugs; they’re “humans and routing tables” failures.
Here’s when I’d avoid a full mesh and pick a different pattern.
1) You have more sites than you have operational attention
If you’re heading toward 20+ locations, full mesh without dynamic routing and strong automation becomes a maintenance hazard.
Key rotation alone turns into a scheduling problem with calendar invites and awkward escalation emails.
Also, when every change touches dozens of peers, every change is a “big change,” which means fewer changes, which means more drift, which means worse outages.
2) Your sites are behind hostile NAT or frequently changing IPs
WireGuard can work behind NAT, but it still needs a way to find the other side.
If both ends are behind carrier-grade NAT with changing addresses and no stable rendezvous (or you can’t use a relay), direct tunnels become an uptime hobby.
3) You need clean segmentation or compliance boundaries
Full mesh tends to encourage “everything can reach everything” unless you put in serious routing/firewall policy.
If you’re in a regulated environment, or you have business units that should not share networks, a mesh becomes a policy enforcement puzzle.
The more edges you have, the more places you can accidentally permit traffic.
4) You can’t tolerate human error in AllowedIPs
AllowedIPs is both WireGuard’s access control and its routing selector. That’s elegant. It’s also a sharp knife.
A mistaken AllowedIPs entry can blackhole traffic, steal traffic from the right peer, or create routing loops when combined with OS routes.
5) Your team needs “set and forget”
If the networking bench is shallow and changes are rare, prefer a design with fewer moving parts:
hub-and-spoke, dual hubs, or a small number of regional hubs.
You’re not less professional for choosing boring. You’re more employed.
Short joke #1: Full mesh is like group chat—great with five people, and a disaster once everyone’s aunt gets added.
Design options that aren’t “mesh or nothing”
Hub-and-spoke (single hub) — the starter kit
Benefits: simplest to reason about, easiest to secure centrally, easiest monitoring.
Costs: hub becomes throughput bottleneck and failure domain; hairpin latency; hub circuit costs.
Use it when: you have few inter-office flows, mostly “branch to data center” traffic, and the hub is robust and redundant.
Dual hub — the “grown-up” version
Two hubs (preferably in different regions/providers) with branches connecting to both. Use routing preference and failover.
This reduces the single point of failure and can cut hairpin latency if you choose hubs wisely.
This is often the best value: you avoid quadratic growth while improving resilience.
Regional hubs (partial mesh) — latency without insanity
Create 3–6 regional aggregation points; offices mesh within a region or connect to their nearest region. Inter-region traffic goes hub-to-hub.
You keep most latency wins and reduce peer count dramatically.
Mesh with dynamic routing (FRR + BGP) — the “serious network” option
If you insist on mesh at scale, don’t do it with static routing and hope.
Run a routing daemon (commonly FRR) and exchange routes over WireGuard. Let the routing layer handle reachability and failover.
Caveat: you’ve now introduced an actual routing control plane. You must secure it, monitor it, and keep it consistent. It’s worth it, but it’s not free.
Overlay + relay for NATed sites — accept reality
Some sites cannot do stable inbound connectivity. In that case, consider a relay (a VPS or hub) that both ends can reach.
It’s not “pure mesh,” but it gives you working connectivity with predictable behavior.
Interesting facts and historical context
- WireGuard was created by Jason A. Donenfeld in the mid-2010s with a goal of being small, modern, and auditable compared to traditional VPN stacks.
- Linux kernel integration landed in 2020, which materially changed its operational story: upgrades and performance got easier for many shops.
- WireGuard uses a Noise-based handshake (Noise protocol framework ideas), favoring modern primitives and minimal negotiation complexity.
- It intentionally avoids algorithm agility—a controversial choice historically, but one that reduces downgrade and configuration complexity in real deployments.
- AllowedIPs is a dual-purpose mechanism (routing + access control). That design is elegant and also the top source of “why is traffic going there?” tickets.
- Full mesh VPNs predate WireGuard by decades: early corporate WAN designs tried any-to-any connectivity long before SD-WAN made it marketing-friendly.
- MTU pain is older than VPNs. Fragmentation issues appeared in early IP networks; VPN encapsulation just reintroduces the same physics with new wrappers.
- “Hub-and-spoke” became popular partly due to firewall centralization—it’s easier to audit one choke point than 40 edges, especially under compliance pressure.
- Operational failure modes tend to dominate crypto failure modes in mature VPNs: routing leaks, NAT assumptions, and monitoring gaps cause more outages than cryptography.
Practical tasks: commands, outputs, and decisions
These are the tasks I actually run when I’m deciding whether a mesh is behaving, misbehaving, or quietly plotting against me.
Each task includes: a command, example output, what it means, and the decision you make.
Task 1: Confirm WireGuard is up and handshaking
cr0x@server:~$ sudo wg show
interface: wg0
public key: pGZ2...Qm8=
private key: (hidden)
listening port: 51820
peer: 7y8N...aZk=
endpoint: 198.51.100.10:51820
allowed ips: 10.60.1.0/24, 10.60.200.2/32
latest handshake: 32 seconds ago
transfer: 1.21 GiB received, 980.45 MiB sent
persistent keepalive: every 25 seconds
What it means: “latest handshake” recent and transfer counters moving indicates the tunnel is alive.
If handshake is “never,” you have endpoint reachability, NAT, firewall, or key mismatch problems.
Decision: If handshake is stale/never, don’t touch routing yet. Fix reachability first (firewall, UDP, endpoints, keepalive).
Task 2: Verify the interface has the right addresses
cr0x@server:~$ ip -br addr show dev wg0
wg0 UNKNOWN 10.60.200.1/24
What it means: The overlay IP is configured. “UNKNOWN” is normal for WireGuard interfaces.
Decision: If the address is missing or wrong, fix the local config before blaming the network.
Task 3: Confirm the kernel routes match your intent
cr0x@server:~$ ip route show table main | grep -E 'wg0|10\.60\.'
10.60.1.0/24 dev wg0 scope link
10.60.2.0/24 dev wg0 scope link
10.60.200.0/24 dev wg0 proto kernel scope link src 10.60.200.1
What it means: Routes exist pointing subnets to wg0. If you’re using wg-quick, it will install routes based on AllowedIPs.
Decision: If a subnet route is missing, check AllowedIPs for the peer that should carry it, or your routing daemon if using BGP.
Task 4: Identify “route stealing” from overlapping AllowedIPs
cr0x@server:~$ sudo wg show wg0 allowed-ips
7y8N...aZk= 10.60.1.0/24 10.60.200.2/32
Jp2c...9Qs= 10.60.0.0/16 10.60.200.3/32
What it means: One peer claims 10.60.0.0/16, which overlaps 10.60.1.0/24. WireGuard chooses the most specific prefix, but broad prefixes are a common foot-gun.
Decision: Remove broad AllowedIPs unless you’re intentionally doing a “default via this peer” design. Be explicit per-site.
Task 5: Check if NAT or firewall blocks UDP/51820
cr0x@server:~$ sudo ss -ulpn | grep 51820
UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wg",pid=1322,fd=6))
What it means: The machine is listening on UDP/51820 on all interfaces.
Decision: If nothing is listening, you’re debugging a service/config issue, not a network path issue.
Task 6: Validate firewall rules for the WireGuard port
cr0x@server:~$ sudo nft list ruleset | grep -n '51820'
117: udp dport 51820 accept
What it means: There’s an allow rule. If you don’t see one, assume it’s blocked until proven otherwise.
Decision: Add explicit allow rules and log drops temporarily if you’re chasing intermittent handshakes.
Task 7: Test basic reachability across the tunnel (ICMP)
cr0x@server:~$ ping -c 3 10.60.200.2
PING 10.60.200.2 (10.60.200.2) 56(84) bytes of data.
64 bytes from 10.60.200.2: icmp_seq=1 ttl=64 time=8.12 ms
64 bytes from 10.60.200.2: icmp_seq=2 ttl=64 time=8.25 ms
64 bytes from 10.60.200.2: icmp_seq=3 ttl=64 time=8.09 ms
--- 10.60.200.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2005ms
rtt min/avg/max/mdev = 8.089/8.155/8.246/0.070 ms
What it means: Basic L3 connectivity works and latency is in the expected range.
Decision: If ping works but apps don’t, suspect MTU, TCP MSS, or firewall policy on forwarded traffic.
Task 8: Find MTU trouble with DF pings
cr0x@server:~$ ping -M do -s 1420 -c 2 10.60.200.2
PING 10.60.200.2 (10.60.200.2) 1420(1448) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
--- 10.60.200.2 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1015ms
What it means: The local interface MTU (or a path MTU limitation) prevents that packet size. Encapsulation overhead makes this common.
Decision: Reduce wg0 MTU (for example 1380 or 1360), or clamp TCP MSS on forwarded traffic. Do not guess—test sizes until it passes.
Task 9: Inspect path and where latency spikes (traceroute)
cr0x@server:~$ traceroute -n 10.60.2.10
traceroute to 10.60.2.10 (10.60.2.10), 30 hops max, 60 byte packets
1 10.60.200.2 8.431 ms 8.207 ms 8.189 ms
2 10.60.2.10 8.922 ms 8.743 ms 8.701 ms
What it means: Two-hop path indicates clean routing via the peer. If you see unexpected hops (like a hub), your routing policy isn’t doing what you think.
Decision: If traffic hairpins, fix route preference (metrics, policy routing, BGP local-pref) or remove conflicting routes.
Task 10: Confirm forwarding is enabled (classic gotcha)
cr0x@server:~$ sysctl net.ipv4.ip_forward net.ipv6.conf.all.forwarding
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 0
What it means: IPv4 forwarding is on. IPv6 forwarding is off (which might be correct or might explain IPv6-only failures).
Decision: If you’re routing between office LANs, you generally want forwarding enabled on your VPN routers. For IPv6, decide intentionally; don’t accidentally blackhole it.
Task 11: Watch packets to prove where they die
cr0x@server:~$ sudo tcpdump -ni wg0 host 10.60.2.10 and tcp port 445 -c 6
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
12:01:14.220123 IP 10.60.1.50.51234 > 10.60.2.10.445: Flags [S], seq 2209001, win 64240, options [mss 1360,sackOK,TS val 111 ecr 0], length 0
12:01:15.221004 IP 10.60.1.50.51234 > 10.60.2.10.445: Flags [S], seq 2209001, win 64240, options [mss 1360,sackOK,TS val 112 ecr 0], length 0
12:01:16.221998 IP 10.60.1.50.51234 > 10.60.2.10.445: Flags [S], seq 2209001, win 64240, options [mss 1360,sackOK,TS val 113 ecr 0], length 0
What it means: SYN retransmits with no SYN-ACK. Traffic enters the tunnel but doesn’t get answered.
That’s usually a firewall on the far side, missing return route, or asymmetric routing causing replies to exit elsewhere.
Decision: Check return path: routes on the destination LAN, firewall policies, and NAT rules. If replies aren’t coming back via the same path, fix routing symmetry.
Task 12: Detect asymmetric routing with policy rules
cr0x@server:~$ ip rule show
0: from all lookup local
1000: from 10.60.2.0/24 lookup 100
32766: from all lookup main
32767: from all lookup default
What it means: There’s a policy rule forcing traffic sourced from 10.60.2.0/24 to use table 100. If table 100 routes out the internet instead of wg0, you’ve built asymmetry.
Decision: Ensure policy routing matches your VPN design. If you do local internet breakout, be explicit about return paths for inter-office traffic.
Task 13: Validate rp_filter isn’t dropping legitimate forwarded traffic
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.wg0.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.wg0.rp_filter = 1
What it means: Strict reverse path filtering can drop packets when return paths don’t match the kernel’s idea of “best route.” This bites multi-homed and policy-routed designs.
Decision: For complex routing, set rp_filter to 2 (loose) on relevant interfaces, or redesign for symmetry. Don’t just disable it everywhere without thinking.
Task 14: Measure real throughput and spot CPU or MTU ceilings
cr0x@server:~$ iperf3 -c 10.60.200.2 -P 4 -t 10
Connecting to host 10.60.200.2, port 5201
[SUM] 0.00-10.00 sec 1.05 GBytes 902 Mbits/sec 0 sender
[SUM] 0.00-10.00 sec 1.04 GBytes 893 Mbits/sec 12 receiver
What it means: You’re getting ~900 Mbps with low retransmits. If you see much lower numbers, check CPU, MTU, or shaping on the WAN.
Decision: If CPU is pegged during iperf, your router sizing is wrong or you need hardware offload (or simply fewer tunnels on that box).
Task 15: Check CPU saturation during encryption
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.5.0 (edge-a) 12/27/2025 _x86_64_ (4 CPU)
12:02:01 AM CPU %usr %nice %sys %iowait %irq %soft %idle
12:02:02 AM all 62.00 0.00 30.00 0.00 0.00 6.00 2.00
12:02:02 AM 0 78.00 0.00 20.00 0.00 0.00 2.00 0.00
12:02:02 AM 1 60.00 0.00 36.00 0.00 0.00 4.00 0.00
12:02:02 AM 2 55.00 0.00 35.00 0.00 0.00 8.00 2.00
12:02:02 AM 3 55.00 0.00 29.00 0.00 0.00 10.00 6.00
What it means: You’re nearly out of idle CPU while pushing traffic. WireGuard is efficient, but it can still saturate small routers at higher throughput.
Decision: Scale up hardware, reduce concurrency, or reduce the number of high-throughput tunnels per box (regional hubs instead of full mesh).
Task 16: Confirm MSS clamping if you do NAT or have MTU constraints
cr0x@server:~$ sudo iptables -t mangle -S | grep -E 'TCPMSS|wg0'
-A FORWARD -o wg0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
What it means: SYN packets have MSS adjusted to avoid fragmentation issues. This often “fixes” weird app stalls that ping won’t show.
Decision: If you see MTU-related stalls, add clamping at the VPN edge. If you already have it and problems persist, test actual PMTU and reduce wg0 MTU.
Task 17: Verify that a peer’s endpoint is what you think it is
cr0x@server:~$ sudo wg show wg0 endpoints
7y8N...aZk= 198.51.100.10:51820
What it means: WireGuard shows the current endpoint it’s using. If the peer roams (NAT change), this updates after receiving packets.
Decision: If the endpoint is wrong (old IP), ensure the remote side can initiate traffic (keepalive) or update DNS/IP management. Mesh with unstable endpoints needs a plan.
Fast diagnosis playbook
When the CEO says “Office-to-office is slow” you don’t have time for a philosophy seminar. You need a triage sequence that gets you to the bottleneck fast.
First: decide if it’s a tunnel problem or a routing/path problem
- Check handshake and counters:
wg show. If handshake is stale/never, it’s reachability/NAT/firewall/keys. - Check where traffic is going:
ip route get <dest>andtraceroute -n. If it hairpins through a hub, it’s routing preference or missing direct routes. - Capture a few packets:
tcpdump -ni wg0. If SYNs go out but no SYN-ACK returns, it’s return routing or firewall on destination side.
Second: isolate MTU and fragmentation issues (the silent killer)
- Run DF pings with increasing size:
ping -M do -s 1200/1300/1400. - If it fails at modest sizes, reduce wg0 MTU and/or clamp MSS.
- Re-test the actual application (SMB, HTTPS, RDP). MTU bugs love “works for small things” illusions.
Third: decide if it’s throughput, CPU, or WAN shaping
- Run
iperf3across the tunnel for a baseline. - Watch CPU with
mpstatduring the test. - Check for packet loss/jitter with
pingand for drops with interface counters (ip -s link).
Fourth: check for policy routing and rp_filter sabotage
ip rule showfor policy routing surprises.sysctl net.ipv4.conf.all.rp_filterwhen asymmetric paths exist.- Fix the design, not just the symptoms. But if you must patch, set rp_filter to loose on the right interfaces and document why.
Short joke #2: MTU issues are the networking equivalent of a “check engine” light—vague, ominous, and somehow always your problem.
Common mistakes: symptom → root cause → fix
1) “Tunnel is up, but only some apps work”
Symptom: Ping and small HTTP requests work; SMB, SSH with larger payloads, or HTTPS uploads stall.
Root cause: MTU/PMTUD blackhole due to encapsulation overhead; ICMP “fragmentation needed” blocked; MSS not clamped.
Fix: Set wg0 MTU conservatively (often 1380-ish), and clamp MSS on forwarded TCP SYN packets. Verify with DF ping tests.
2) “Handshakes stop after a few minutes unless there’s traffic”
Symptom: Latest handshake becomes stale; traffic resumes only after manual ping from one side.
Root cause: NAT mapping expires; neither side initiates to refresh mapping; no keepalive on NATed side.
Fix: Configure PersistentKeepalive = 25 (or similar) on the side behind NAT. Ensure firewall allows return UDP.
3) “Traffic to Office B goes to Office C and dies”
Symptom: Wrong peer gets traffic; traceroute shows unexpected first hop; tcpdump shows packets exiting the wrong tunnel.
Root cause: Overlapping AllowedIPs or broad prefixes installed for convenience; route stealing by more general route combined with OS route metrics.
Fix: Make AllowedIPs precise per site subnet; avoid 0.0.0.0/0 and big summaries unless you truly mean it. Audit with wg show allowed-ips.
4) “It works from A to B, but not from B to A”
Symptom: One-way flows; SYNs seen one direction only; replies exit via internet or another WAN.
Root cause: Asymmetric routing due to policy routing, dual WAN, or missing return routes on LAN routers.
Fix: Ensure return path points back to the WireGuard router; use policy routing carefully; consider source-based routing tables per subnet.
5) “After adding a new office, random other offices break”
Symptom: New peer addition causes unrelated connectivity loss; intermittent blackholes.
Root cause: Address overlap (duplicate tunnel IPs or LAN subnets), or AllowedIPs accidentally includes someone else’s subnet.
Fix: Enforce IPAM for overlay and LAN ranges; reject duplicate allocations in CI; run a route overlap check before deploy.
6) “Throughput is terrible but CPU is fine”
Symptom: iperf shows low Mbps; CPU idle; packet loss spikes; performance varies by time of day.
Root cause: WAN shaping, bufferbloat, or congested ISP path; sometimes UDP policing on business broadband.
Fix: Confirm with sustained tests; add QoS/SQM on the edge; consider multiple WANs with failover; don’t blame WireGuard for your ISP’s personality.
7) “Everything breaks during key rotation”
Symptom: Some tunnels come back, some don’t; handshakes never resume for a subset.
Root cause: Inconsistent config rollout; old public keys still referenced; stale configs not reloaded; partial automation.
Fix: Rotate with a staged approach (add new peer keys in parallel where possible), reload deterministically, and verify handshakes with automated checks.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company expanded from six offices to eleven in under a year. The network team decided to “finish what we started” and build a full mesh with static WireGuard configs.
Their assumption: “These are offices; the public IPs are stable.” That was true for fiber sites. It was hilariously untrue for the two new locations using “temporary” broadband that became permanent.
One Monday morning, two offices lost reachability to three others. Not all. Just enough to create a mess of half-working services: phone presence failed, but calls sometimes worked; file shares mapped but wouldn’t open; printers showed offline.
The monitoring dashboard was green because the hub tunnels were fine, and the team had built health checks that only validated branch-to-hub.
The root cause was mundane: those broadband circuits changed IPs over the weekend. Some peers updated endpoints via roaming (because the NATed side initiated traffic), and some didn’t (because the other sites never initiated).
So half the mesh converged and the other half froze with “latest handshake: 3 days ago.”
The fix was not just “set PersistentKeepalive.” They also introduced a design rule: any site without a stable endpoint does not participate as a first-class mesh node.
Those sites connect to a small set of stable rendezvous hubs. Direct office-to-office is allowed only between stable sites or via a controlled relay path.
The lesson: mesh assumes reachability symmetry and stable discovery. If you don’t have those, you’re not building a mesh; you’re building a distributed guessing game.
Mini-story 2: The optimization that backfired
Another company had a hub-and-spoke VPN and hated the latency between two regions.
Someone proposed a “simple win”: add direct tunnels between all offices in Region East and all offices in Region West. It sounded reasonable—until it met BGP summaries and human optimism.
They were already using dynamic routing inside the data center. To avoid advertising dozens of small subnets, they summarized routes at each office boundary.
Then they turned on the cross-region mesh links and imported the summaries without strict filters, because the change window was short and the rollback plan was basically “turn it off and pretend it never happened.”
For a week, things were great. Then a single office router in Region East mis-advertised a summary route due to a configuration template bug.
Suddenly, traffic for a big chunk of Region East started flowing to the wrong office, across a mesh link, and then got dropped by a firewall that never expected transit traffic.
The outage was painful because it looked like “random SaaS slowness” at first. The WAN graphs were normal; the hub was healthy; only a subset of internal apps failed.
It took a packet capture and a routing table diff to spot the bad summary.
They kept the mesh links, but they stopped treating routing policy like a side quest.
Strict prefix filters, max-prefix limits, and route validation became non-negotiable. The optimization was fine; the lack of guardrails was not.
Mini-story 3: The boring but correct practice that saved the day
A company with nine offices ran a partial mesh: regional hubs plus a few direct tunnels for latency-sensitive apps.
It wasn’t glamorous. It was documented. That alone put them ahead of most of the industry.
Their “boring practice” was a weekly automated audit: export wg show, route tables, and firewall rules into a snapshot, then diff against last week.
The output went to a ticket queue nobody loved, but it meant drift was visible.
One afternoon, a helpdesk tech doing “cleanup” removed what looked like an unused static route on an office router.
Within minutes, a warehouse management system stopped syncing inventory updates to another site. Users blamed the application. Of course they did.
The audit diff flagged the removed route and pointed at the exact device and time. The fix was immediate: restore the route and add a comment in config that explained why it existed.
The incident never became a multi-team blame festival because the evidence was sitting there, boring and precise.
The lesson: mesh doesn’t fail only because it’s complex. It fails because people forget why things exist. Make the network explain itself.
Checklists / step-by-step plan
Decide if you should do full mesh at all
- List top inter-office flows: VoIP, SMB, ERP, database replication, backup traffic. Put rough bandwidth and latency sensitivity next to each.
- Count sites now and in 18 months. If you’re heading past ~15, assume you need routing automation and/or a partial mesh.
- Classify endpoints: stable public IP, dynamic IP, behind CGNAT, dual WAN, LTE backup. Mesh hates uncertainty.
- Decide your security model: flat any-to-any, segmented, or “only specific apps.” Mesh plus segmentation demands policy discipline.
- Pick failure domains intentionally: do you want “hub outage impacts all,” or “pairwise tunnel outage impacts two”?
Build it like you plan to operate it
- Allocate overlay IPs via IPAM (even a simple Git-managed registry). Never “just pick one.”
- Standardize interface naming: wg0 for site-to-site, wg-mgmt for management, etc. Consistency buys you incident speed.
- Generate configs from source-of-truth. Human-edited mesh configs are a retirement plan for your future therapist.
- Define AllowedIPs patterns: per-site LAN prefixes only; avoid giant summaries unless backed by routing policy and review.
- Set MTU intentionally and document the reasoning. If you don’t, you’ll rediscover it during an outage, which is an expensive way to learn.
- Plan key rotation: schedule, blast radius, and verification steps. Make it a procedure, not a ritual.
Operate it with guardrails
- Monitoring: alert on stale handshakes for critical peers, rising packet loss, and CPU saturation on VPN routers.
- Change management: every site add/remove should be reproducible and reviewable. Mesh amplifies mistakes.
- Testing: after each change, run ping/DF ping/iperf to at least one peer and one remote LAN host.
- Drift detection: snapshot configs and routing state; diff regularly. “We didn’t change anything” is usually false.
FAQ
1) Is WireGuard full mesh more secure than hub-and-spoke?
Not inherently. WireGuard security is about key management and peer restrictions. Mesh can increase attack surface because there are more edges and more places to misconfigure AllowedIPs or firewall rules.
If you do mesh, treat policy as first-class: restrict routes, enforce segmentation, and audit regularly.
2) What’s the biggest practical advantage of direct office-to-office tunnels?
Latency and jitter reduction for inter-office traffic. If the hub is out of the way geographically or congested, direct tunnels can turn “barely usable” into “fine.”
3) What’s the biggest operational downside of full mesh?
Peer count and change blast radius. Adding a site becomes a multi-peer update. Key rotation and troubleshooting become combinatorial problems unless you automate and monitor like you mean it.
4) How many sites is “too many” for a static full mesh?
Roughly: beyond 10–15 sites, static mesh becomes unpleasant unless your automation and routing discipline are excellent. Past 20, you generally want dynamic routing and/or regional hubs.
5) Should I run BGP over WireGuard?
If you need scale, failover, and fewer static route headaches, yes—BGP (via FRR) is a solid choice. But it’s not magic.
You must implement prefix filters, max-prefix limits, and monitoring, or you’ll turn a routing bug into a company-wide outage.
6) How do I handle offices with dynamic IPs?
Use PersistentKeepalive on the NATed side so it can roam and keep the NAT mapping alive, and prefer stable rendezvous points (hubs/relays) for those sites.
Expect that “pure direct mesh everywhere” will be fragile if multiple sites have unstable endpoints.
7) Can I do active-active tunnels between offices?
Not with WireGuard alone. WireGuard gives you encrypted links; routing decides how to use them.
Active-active is typically done with equal-cost routing (ECMP) via a routing daemon, or application-level load balancing. Be careful: ECMP plus asymmetric paths can upset firewalls and stateful middleboxes.
8) Why does WireGuard use AllowedIPs for routing?
It’s a design choice to keep WireGuard minimal: it decides which peer a packet should go to based on destination prefix matching.
It reduces configuration surface but makes correctness your responsibility. Treat AllowedIPs like production code.
9) Should I NAT traffic across site-to-site tunnels?
Prefer not to. NAT hides addressing mistakes until they become incident-grade. Route real prefixes when possible.
NAT is sometimes useful for overlapping RFC1918 spaces during mergers, but treat it as a transitional tool with a sunset plan.
10) What’s the cleanest way to avoid overlapping office LAN subnets?
Establish an addressing standard early (per-site allocations) and enforce it with IPAM and review. Overlaps are a top cause of “mesh is haunted” behavior.
Practical next steps
If you’re deciding whether to build direct office-to-office WireGuard tunnels, do this in order:
- Measure current hub hairpin latency and jitter between the offices that complain the most.
- Prototype one direct tunnel between two sites and validate: handshake stability, MTU, throughput, and return routing.
- Choose a topology based on site count and endpoint stability: full mesh (small/stable), partial mesh (regional), or dual hubs (usually the best trade).
- Automate config generation before you scale past a handful of sites. If you wait, you’ll never catch up.
- Install guardrails: route filters, prefix sanity checks, drift detection, and a standard MTU/MSS policy.
- Write the runbook while it still feels obvious. Future-you will be tired and suspicious during the outage.
Direct tunnels are worth it when they remove a real bottleneck and you can keep the operational model clean.
They’re not worth it when you’re using them to avoid choosing a network architecture. Your topology will pick one for you—usually during a Friday change window.