Your three offices can’t “just VPN together.” One is behind double NAT, another has a “helpful” ISP router you’re not allowed to touch, and the third is the one with the server room that also stores the Christmas decorations. Yet everyone expects shared services, stable VoIP, and a file share that doesn’t feel like it’s hosted on a toaster.
The hub-and-spoke model with WireGuard is how you get predictable routing without turning your network into interpretive art. But it only stays predictable if you respect how WireGuard actually works: it’s a secure UDP tunnel with a routing table attached. If you treat it like a magical “VPN checkbox,” it will politely let you misroute packets until the day you’re on call.
Pick the topology and be honest about constraints
Hub-and-spoke means: each office (spoke) builds a WireGuard tunnel only to the central gateway (hub). The hub routes traffic between spokes. Spokes never need to reach each other directly, which is the whole point: fewer tunnels, simpler key management, fewer firewall holes, fewer surprises.
It’s not the only model. Full mesh exists, but it scales like a group chat where everyone can reply-all. For three offices you could do mesh, but you’ll still want a central control point for policy, logging, and “who can talk to what.” Hub-and-spoke wins in corporate reality: centralized egress control, centralized observability, and only one place to enforce “no, the printers don’t need to talk to finance.”
What the hub must be
- A stable public endpoint (static IP or stable DNS, but treat DNS failures as real).
- A router with Linux IP forwarding, firewalling, and preferably nftables or iptables you control.
- A box with operational hygiene: time sync, logs, backups, change control, and a clear owner.
What the spokes must be
- A device that can run WireGuard and route a LAN (Linux box, supported router distro, or a small appliance).
- Consistency: one interface for WAN, one for LAN, one for WireGuard. Don’t get clever.
- Known LAN subnets (non-overlapping). Overlaps are how “temporary fixes” become permanent outages.
Decide early whether the hub is also an “internet breakout” for the spokes (spokes route 0.0.0.0/0 through the hub) or whether the hub only routes inter-office traffic. For three offices, inter-office only is usually the first step; central breakout can come later when you’re ready to own the support burden.
Joke #1: A hub-and-spoke VPN is like a corporate org chart: everyone reports to the middle, and the middle spends its life routing complaints.
Interesting facts (and why they matter operationally)
- WireGuard is intentionally minimal. It doesn’t negotiate algorithms or support a zoo of ciphers; it uses a small modern set to reduce misconfiguration and attack surface.
- It uses the Noise protocol framework. That’s not trivia; it explains why handshakes are fast and why state is lean compared to older VPN stacks.
- WireGuard runs in-kernel on Linux. This is why performance is generally excellent and why you should still care about kernel upgrades and regressions like any other datapath change.
- “AllowedIPs” is both ACL and routing. Many outages come from forgetting that this field determines what routes get installed and what traffic is accepted from a peer.
- Roaming is a first-class behavior. Peers can change source IP/port; WireGuard learns the new endpoint after authenticated traffic, which is perfect for offices behind NAT—until keepalives are missing.
- It’s UDP. Which means it can sail through many networks, but it also means stateful firewalls and NAT timeouts become your problem.
- There’s no built-in “user auth.” It’s machine-to-machine keys. If someone wants per-user VPN access, do that elsewhere (or build a separate layer).
- It’s young compared to IPsec. IPsec has decades of scar tissue and interoperability baggage. WireGuard has fewer knobs, fewer footguns, and fewer “but the vendor said…” meetings.
- Small configs are a feature. When you can print the entire VPN config on a single page, on-call debugging is suddenly survivable.
One paraphrased idea that still holds in operations comes from Werner Vogels (Amazon CTO): paraphrased idea: everything fails eventually; design so failure is routine and recovery is boring
. That’s the mindset for VPNs too: assume NATs reboot, links flap, and someone edits the wrong file at 2 a.m.
Address plan and routing model
Start with the address plan. If you skip this, you will pay later, with interest.
Example network plan (three offices + hub)
- Office A LAN:
10.10.10.0/24 - Office B LAN:
10.10.20.0/24 - Office C LAN:
10.10.30.0/24 - WireGuard transit network:
10.99.0.0/24(only for tunnel interface IPs) - Hub wg0:
10.99.0.1/24 - Spoke A wg0:
10.99.0.11/32 - Spoke B wg0:
10.99.0.12/32 - Spoke C wg0:
10.99.0.13/32
Use /32 addresses for peers on the tunnel interface, and keep the transit network separate from office LANs. You want the wire to be its own universe. Also: don’t reuse RFC1918 ranges you already use in offices. Overlapping private subnets are the corporate equivalent of “we’ll just wing it.”
Routing intent
Your intent in hub-and-spoke is simple:
- Spokes know: “to reach the other offices’ LANs, send traffic into wg0.”
- Hub knows: “to reach each office LAN, send traffic out wg0 to the correct peer.”
- LAN clients in each office know: “for other office subnets, default gateway is the office router.”
That’s it. No NAT between offices unless you have overlapping subnets you can’t fix (and if you can’t fix it, budget time to fix it anyway). NAT hides problems and makes them harder to diagnose because it erases source identity—exactly what you want to preserve for auditing and security policy.
Hub configuration: the central gateway
Assume the hub is a Linux VM in a datacenter or cloud with a public IP. Name it wg-hub-1. It has:
eth0= public/WAN interfacewg0= WireGuard interface
Install WireGuard
On modern distributions, it’s a package install and a kernel module that’s already there. Keep it boring.
Generate keys
Do this on the hub and on each spoke; never copy private keys through chat apps or ticket comments. Treat them like root passwords.
Hub: /etc/wireguard/wg0.conf
This example routes three office subnets. The hub will be the only configured endpoint; spokes will point at it.
cr0x@server:~$ sudo sed -n '1,200p' /etc/wireguard/wg0.conf
[Interface]
Address = 10.99.0.1/24
ListenPort = 51820
PrivateKey = HUB_PRIVATE_KEY_REDACTED
SaveConfig = false
# Enable NAT only if you want spokes to reach the hub's WAN or internet via hub.
# For pure site-to-site, you typically do not NAT between offices.
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = nft add table inet wg; nft 'add chain inet wg forward { type filter hook forward priority 0; policy drop; }'
PostUp = nft add rule inet wg forward iif "wg0" oif "wg0" accept
PostUp = nft add rule inet wg forward iif "wg0" oif "eth0" accept
PostUp = nft add rule inet wg forward iif "eth0" oif "wg0" ct state established,related accept
PostDown = nft delete table inet wg
[Peer]
PublicKey = SPOKE_A_PUBLIC_KEY_REDACTED
AllowedIPs = 10.99.0.11/32, 10.10.10.0/24
PersistentKeepalive = 25
[Peer]
PublicKey = SPOKE_B_PUBLIC_KEY_REDACTED
AllowedIPs = 10.99.0.12/32, 10.10.20.0/24
PersistentKeepalive = 25
[Peer]
PublicKey = SPOKE_C_PUBLIC_KEY_REDACTED
AllowedIPs = 10.99.0.13/32, 10.10.30.0/24
PersistentKeepalive = 25
Notes you should not skip:
AllowedIPson the hub is effectively the hub’s routing table for spokes. If it’s wrong, the hub will either blackhole or misdeliver.PersistentKeepaliveon the hub isn’t always necessary, but it helps keep NAT mappings alive on the remote side. In branch-office reality, NAT mappings die when you look away.SaveConfig=falseprevents runtime changes from being written back. This avoids “someone used wg set and now the file is lying.”- The nftables rules above are intentionally strict-ish. Many setups “accept all forward,” which is fine until the VPN becomes an unmonitored transit for everything.
Enable and start
Use systemd units for consistency. You want boot-time recovery without custom scripts.
Spoke configuration: each office router
Each office has a router/firewall box (Linux) with:
eth0= WAN uplink toward ISP router/modemeth1= LAN toward office switchwg0= WireGuard tunnel to hub
Spoke A: /etc/wireguard/wg0.conf
cr0x@server:~$ sudo sed -n '1,200p' /etc/wireguard/wg0.conf
[Interface]
Address = 10.99.0.11/32
PrivateKey = SPOKE_A_PRIVATE_KEY_REDACTED
ListenPort = 51820
SaveConfig = false
PostUp = sysctl -w net.ipv4.ip_forward=1
PostUp = nft add table inet wg; nft 'add chain inet wg forward { type filter hook forward priority 0; policy drop; }'
PostUp = nft add rule inet wg forward iif "eth1" oif "wg0" accept
PostUp = nft add rule inet wg forward iif "wg0" oif "eth1" ct state established,related accept
PostUp = nft add rule inet wg forward iif "wg0" oif "wg0" accept
PostDown = nft delete table inet wg
[Peer]
PublicKey = HUB_PUBLIC_KEY_REDACTED
Endpoint = 203.0.113.10:51820
AllowedIPs = 10.99.0.1/32, 10.10.20.0/24, 10.10.30.0/24
PersistentKeepalive = 25
Spoke B and C are identical except their Address and the office LANs in AllowedIPs should include “the other offices,” not its own LAN. You can include the hub’s wg address as well to ping the hub.
Two rules for spokes:
- Do not put
0.0.0.0/0inAllowedIPsunless you are intentionally forcing all traffic through the hub (central breakout). Don’t “test it” in production and forget. - Do not NAT inter-office traffic. If you need NAT to make something “work,” you’re probably routing wrong.
Routing, NAT, and firewall rules that won’t surprise you
WireGuard doesn’t do routing for you; it only encrypts traffic you route into it. Linux will route based on:
- interface addresses (connected routes),
- routes installed by wg-quick from
AllowedIPs, and/or - your explicit static routes.
Prefer explicit routing over cleverness
When you bring up wg0 with wg-quick, it will typically add routes for AllowedIPs. That’s convenient, but you should understand exactly which routes exist and why. Convenience becomes mystery during an outage.
Forwarding: the three gates
For packets to traverse from office A LAN to office B LAN via the hub, three things must be true:
- Office A clients send traffic to their local router (default gateway).
- Spoke A router forwards to wg0 and encrypts to hub.
- Hub forwards from wg0 to wg0 (to the other peer) based on routing.
Each device needs:
- IP forwarding enabled (
net.ipv4.ip_forward=1) - firewall rules allowing the forwarding path
- routes installed for the remote LANs
When to NAT
There are only a few reasons to NAT in this design:
- Central internet breakout. Spokes route
0.0.0.0/0via hub, hub NATs to the public interface. - Overlapping subnets you cannot renumber. You should renumber anyway, but sometimes you have a merger, a vendor appliance, or a building automation system that refuses to move.
- Temporary migration. Use NAT as scaffolding with an expiry date, not as architecture.
If you NAT between offices by default, you will eventually break something that relies on source IP (ACLs, logging, SMB security rules, VoIP SBC rules). Worse: you’ll lose the ability to answer “who accessed this system?” without playing packet-translation detective.
MTU, fragmentation, and why “it works for me” is not a metric
MTU issues are the most common “it connects but some stuff hangs” problem in WireGuard site-to-site. The tunnel adds overhead. If you carry packets that are too large for some path segment, they’ll fragment or drop, and you’ll get failures that look like application bugs.
WireGuard encapsulates IP in UDP. Typical overhead is around 60 bytes (varies). If your WAN MTU is 1500, a safe WireGuard MTU is often 1420. Many distros default to 1420 on wg interfaces for exactly this reason.
But you cannot assume 1500 on WAN. PPPoE links often have 1492. Some LTE/5G links behave differently. Some ISPs do weird things with ICMP, which breaks PMTU discovery and makes everything “mostly work.”
Joke #2: MTU bugs are the adult version of stepping on a Lego—everything hurts, and you can’t immediately prove what caused it.
My production stance on MTU
- Set
MTU = 1420on wg0 unless you have measured reason to do otherwise. - If you see hangs on large transfers or certain websites/services, test PMTU aggressively and adjust.
- Do not disable ICMP globally. You’re not “hardening,” you’re blindfolding.
Observability: what to log, what to graph
A VPN that “works” but can’t be debugged is a future outage you’re pre-paying. For hub-and-spoke, instrument the hub like it’s production routing infrastructure—because it is.
What to watch on the hub
- Handshake timestamps per peer. A peer that hasn’t handshaken in hours is either idle (fine) or dead (not fine). Correlate with traffic counters.
- RX/TX byte counters. Sudden drops to zero during business hours mean someone broke routing or firewalling.
- CPU softirq and NIC drops. Rare at three offices, but if the hub VM is tiny or oversubscribed, you’ll see it.
- Firewall counters. If you don’t count drops, you’ll argue with yourself at 3 a.m.
- System time drift. Crypto and time-sensitive handshakes do not enjoy time travel.
Logs: be selective
WireGuard itself is quiet. That’s good. Don’t try to “log every packet.” Instead:
- log key interface up/down events,
- log firewall drops at a sampled rate or per-rule counter inspection,
- keep change history of configs and kernel updates.
Practical tasks: commands, outputs, and decisions
These are real tasks you’ll run in production. Each includes a command, example output, what it means, and what decision you make next. Use them on the hub and on spokes. Consistency wins outages.
Task 1: Verify WireGuard interface state
cr0x@server:~$ sudo wg show
interface: wg0
public key: 3kN9...REDACTED
listening port: 51820
peer: zZp2...REDACTED
endpoint: 198.51.100.24:60433
allowed ips: 10.99.0.11/32, 10.10.10.0/24
latest handshake: 1 minute, 12 seconds ago
transfer: 1.42 GiB received, 1.87 GiB sent
persistent keepalive: every 25 seconds
Meaning: The tunnel is up enough to handshake; endpoint shows where the peer currently is (NATed port is expected). Transfer counters prove real traffic.
Decision: If latest handshake is “never” or stale while users complain, move immediately to firewall/NAT reachability and routing checks.
Task 2: Bring up the interface and confirm systemd status
cr0x@server:~$ sudo systemctl status wg-quick@wg0
● wg-quick@wg0.service - WireGuard via wg-quick(8) for wg0
Loaded: loaded (/lib/systemd/system/wg-quick@.service; enabled; preset: enabled)
Active: active (exited) since Sat 2025-12-27 10:11:02 UTC; 7min ago
Docs: man:wg-quick(8)
man:wg(8)
Process: 1724 ExecStart=/usr/bin/wg-quick up wg0 (code=exited, status=0/SUCCESS)
Meaning: wg-quick is a one-shot; “active (exited)” is normal. It did not fail during interface creation.
Decision: If status is failed, read the journal for syntax errors or missing kernel module before touching networking further.
Task 3: Inspect wg-quick logs for errors
cr0x@server:~$ sudo journalctl -u wg-quick@wg0 -n 50 --no-pager
Dec 27 10:11:02 wg-hub-1 wg-quick[1724]: [#] ip link add wg0 type wireguard
Dec 27 10:11:02 wg-hub-1 wg-quick[1724]: [#] wg setconf wg0 /dev/fd/63
Dec 27 10:11:02 wg-hub-1 wg-quick[1724]: [#] ip -4 address add 10.99.0.1/24 dev wg0
Dec 27 10:11:02 wg-hub-1 wg-quick[1724]: [#] ip link set mtu 1420 up dev wg0
Dec 27 10:11:02 wg-hub-1 wg-quick[1724]: [#] sysctl -w net.ipv4.ip_forward=1
Meaning: Interface created, MTU set, forwarding enabled. If you don’t see MTU being set, you might be relying on defaults—fine, but know it.
Decision: If the logs show “RTNETLINK answers: File exists” or nft failures, you have leftover state. Clean up and restart cleanly.
Task 4: Confirm forwarding is actually enabled
cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
Meaning: Kernel will forward IPv4 packets. Without this, your VPN becomes an expensive ping toy.
Decision: If it’s 0, enable persistently in /etc/sysctl.d/ and stop pretending it will “stay set.”
Task 5: Validate routes to office subnets exist on the hub
cr0x@server:~$ ip route show | egrep '10\.10\.(10|20|30)\.0/24|10\.99\.0\.0/24'
10.10.10.0/24 dev wg0 scope link
10.10.20.0/24 dev wg0 scope link
10.10.30.0/24 dev wg0 scope link
10.99.0.0/24 dev wg0 proto kernel scope link src 10.99.0.1
Meaning: The hub believes all office LANs are reachable via wg0. That’s required for hub routing.
Decision: If routes are missing, your AllowedIPs on the hub peers is wrong or wg-quick didn’t install routes (policy routing customizations can do that).
Task 6: Validate routes on a spoke
cr0x@server:~$ ip route show | egrep '10\.10\.(20|30)\.0/24|wg0'
10.10.20.0/24 dev wg0 scope link
10.10.30.0/24 dev wg0 scope link
10.99.0.1 dev wg0 scope link
Meaning: Spoke A will send traffic for office B and C into the tunnel.
Decision: If office subnets are routed elsewhere (or not at all), correct AllowedIPs or add explicit routes. Do not “fix” it with NAT.
Task 7: Confirm the hub can reach each spoke’s wg IP
cr0x@server:~$ ping -c 3 10.99.0.11
PING 10.99.0.11 (10.99.0.11) 56(84) bytes of data.
64 bytes from 10.99.0.11: icmp_seq=1 ttl=64 time=18.6 ms
64 bytes from 10.99.0.11: icmp_seq=2 ttl=64 time=18.4 ms
64 bytes from 10.99.0.11: icmp_seq=3 ttl=64 time=18.9 ms
--- 10.99.0.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
Meaning: Encrypted connectivity to the spoke router exists.
Decision: If this fails but handshake exists, check firewall on the spoke (INPUT/FORWARD) for ICMP, or check that the spoke assigned the wg IP you think it did.
Task 8: Trace routing from hub to a host in Office B
cr0x@server:~$ traceroute -n 10.10.20.50
traceroute to 10.10.20.50 (10.10.20.50), 30 hops max, 60 byte packets
1 10.99.0.12 20.112 ms 20.098 ms 20.085 ms
2 10.10.20.50 21.004 ms 20.992 ms 20.980 ms
Meaning: Hub is routing to the correct spoke (10.99.0.12) and then to the destination host.
Decision: If hop 1 is wrong, you have an AllowedIPs overlap on the hub (two peers claiming the same subnet) or missing routes.
Task 9: Check for overlapping AllowedIPs on the hub
cr0x@server:~$ sudo wg show wg0 allowed-ips
zZp2...REDACTED 10.99.0.11/32 10.10.10.0/24
aPq8...REDACTED 10.99.0.12/32 10.10.20.0/24
mK1r...REDACTED 10.99.0.13/32 10.10.30.0/24
Meaning: Each office subnet is uniquely claimed by one peer. That’s what you want.
Decision: If two peers list the same subnet, fix it immediately; routing will be nondeterministic and you’ll get “sometimes it works” tickets.
Task 10: Validate UDP reachability (port open) on the hub
cr0x@server:~$ sudo ss -ulnp | grep 51820
UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wireguard",pid=1724,fd=3))
Meaning: The hub is listening on UDP/51820.
Decision: If nothing is listening, your interface isn’t up, or you changed ports and didn’t restart the service.
Task 11: Inspect nftables rules and counters
cr0x@server:~$ sudo nft list table inet wg
table inet wg {
chain forward {
type filter hook forward priority filter; policy drop;
iifname "wg0" oifname "wg0" counter packets 18392 bytes 24581231 accept
iifname "wg0" oifname "eth0" counter packets 0 bytes 0 accept
iifname "eth0" oifname "wg0" ct state established,related counter packets 2901 bytes 501832 accept
}
}
Meaning: Forwarding policy is drop by default; specific paths are allowed. Counters show real inter-office forwarding (wg0→wg0).
Decision: If counters on wg0→wg0 stay at zero while users complain, the hub is not forwarding—either routes are wrong or traffic never arrives at the hub.
Task 12: Capture traffic to prove whether packets arrive
cr0x@server:~$ sudo tcpdump -ni wg0 host 10.10.20.50 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wg0, link-type RAW (Raw IP), snapshot length 262144 bytes
10:22:18.114331 IP 10.10.10.25.53422 > 10.10.20.50.445: Flags [S], seq 13922311, win 64240, options [mss 1360,sackOK,TS val 2911221 ecr 0,nop,wscale 7], length 0
10:22:19.120482 IP 10.10.10.25.53422 > 10.10.20.50.445: Flags [S], seq 13922311, win 64240, options [mss 1360,sackOK,TS val 2912227 ecr 0,nop,wscale 7], length 0
Meaning: Traffic is arriving on the hub’s wg0 from Office A destined to Office B host. If there’s no SYN-ACK, either the destination host isn’t responding or traffic isn’t being forwarded correctly onward.
Decision: If you see packets on the hub but they don’t reach the far office, check hub routing to the correct peer, then check spoke B forwarding and office B LAN firewall.
Task 13: Check MTU behavior with “do not fragment” pings
cr0x@server:~$ ping -M do -s 1372 -c 3 10.10.20.50
PING 10.10.20.50 (10.10.20.50) 1372(1400) bytes of data.
1372 bytes from 10.10.20.50: icmp_seq=1 ttl=63 time=22.1 ms
1372 bytes from 10.10.20.50: icmp_seq=2 ttl=63 time=22.0 ms
1372 bytes from 10.10.20.50: icmp_seq=3 ttl=63 time=21.8 ms
--- 10.10.20.50 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
Meaning: A 1400-byte packet makes it end-to-end without fragmentation. That’s a good sign for typical TCP MSS sizing.
Decision: If this fails with “Frag needed,” reduce wg MTU (e.g., 1380) and retest, or investigate WAN MTU constraints.
Task 14: Verify policy routing isn’t silently hijacking traffic
cr0x@server:~$ ip rule show
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
Meaning: Default rules only. That’s good unless you intentionally added complex routing.
Decision: If you see extra rules, confirm they’re required. Policy routing is powerful and also a reliable way to confuse Future You.
Task 15: Confirm reverse path filtering isn’t dropping asymmetric traffic
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.wg0.rp_filter
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.wg0.rp_filter = 0
Meaning: rp_filter is off here, avoiding drops when return path differs from expected interface. On routers, strict rp_filter often breaks VPN forwarding.
Decision: If it’s 1 or 2 and you see one-way traffic, set it to 0 for wg0 (or tune carefully) and document why.
Fast diagnosis playbook
This is the order that finds the bottleneck quickly, without wandering into “maybe it’s DNS” fantasy land. Work from the center outward, because the hub is your choke point and your best vantage point.
First: is the tunnel alive?
- On the hub:
wg show— check handshake age and transfer counters per peer. - On the hub:
ss -ulnp | grep 51820— confirm it’s listening. - On the spoke:
wg show— confirm it sees the hub, handshake updates, and AllowedIPs match intent.
If handshake is “never”: it’s usually endpoint reachability (UDP blocked), wrong keys, wrong port, or NAT not maintained.
Second: is routing correct?
- On hub and spoke:
ip route show— confirm remote office subnets point to wg0. - On hub:
wg show wg0 allowed-ips— confirm no overlapping claims. - Run a traceroute from hub to a remote LAN host — confirm first hop is the correct spoke.
If routes exist but traceroute hops are wrong: you have conflicting AllowedIPs or policy routing.
Third: is forwarding/firewall blocking transit?
- Check
sysctl net.ipv4.ip_forwardon hub and spokes. - Inspect nftables/iptables rules and counters on each node.
- Run
tcpdumpon hub wg0 and on spoke wg0 to see where packets stop.
If packets arrive on hub but not on far spoke: hub routing/AllowedIPs mismatch or hub firewall drop. If packets arrive on far spoke but not on LAN, it’s spoke forwarding or LAN firewall.
Fourth: if it’s “works for small things”
- Test MTU with
ping -M doat multiple sizes. - Check TCP MSS clamping (if you use it) and confirm it matches the effective MTU.
- Look for ICMP being blocked along the path (PMTU needs it).
Fifth: performance complaints (slow file shares, choppy calls)
- Check hub CPU and NIC drops:
sar,ethtool -S,ss -s. - Look at RTT and loss (simple pings are fine; also check jitter).
- Confirm no accidental central internet breakout for heavy traffic.
Common mistakes: symptoms → root cause → fix
1) “Handshake never happens”
Symptoms: wg show shows latest handshake: (none). No traffic counters increment.
Root cause: UDP/51820 blocked inbound to hub, wrong hub endpoint/port, wrong public key, or spoke behind NAT without keepalive and NAT mapping expired.
Fix: Verify ss -ulnp on hub, allow UDP in cloud firewall, confirm hub public IP/port, re-check keys, set PersistentKeepalive=25 on spokes (and optionally on hub peer entries).
2) “It pings, but SMB/RDP/VoIP is flaky or hangs”
Symptoms: Small pings succeed; larger transfers stall; some applications connect and then freeze.
Root cause: MTU/PMTU blackhole; ICMP blocked; tunnel MTU too high for WAN path.
Fix: Set wg MTU to 1420 (or lower), test with ping -M do, avoid blocking ICMP fragmentation-needed messages. If you must, clamp TCP MSS at borders.
3) “Office A can reach Office B, but not Office C”
Symptoms: One remote subnet works; another doesn’t. Handshake is fine.
Root cause: Missing route/AllowedIPs entry for the non-working subnet on the spoke or on the hub.
Fix: Add the missing subnet to the correct AllowedIPs and confirm routes. Recheck wg show allowed-ips for overlap.
4) “Only one direction works”
Symptoms: A host in Office A can reach Office B, but replies never come back (or vice versa).
Root cause: Reverse path filtering, asymmetric routing due to another VPN, or a LAN firewall that doesn’t know about the remote subnets.
Fix: Disable rp_filter for wg interfaces, ensure return routes exist, and update LAN firewall policies to allow remote subnets.
5) “It worked yesterday, now no one can connect after ‘minor changes’”
Symptoms: After a reboot or update, tunnels don’t come up or routing is different.
Root cause: Runtime changes were saved back unexpectedly, nftables table name conflicts, or sysctl forwarding not persistent.
Fix: Use SaveConfig=false, define firewall rules in a dedicated, idempotent system (systemd unit or configuration management), set sysctl in /etc/sysctl.d/.
6) “Randomly switches which office gets traffic for a subnet”
Symptoms: Some sessions go to the wrong destination; traceroute first hop changes; intermittent reachability.
Root cause: Overlapping AllowedIPs on the hub. Two peers claiming the same subnet is a routing fight with no winner.
Fix: Make AllowedIPs mutually exclusive and enforce it in review. Treat it like a routing table: uniqueness is non-negotiable.
7) “VPN is up, but clients in the office can’t reach remote networks”
Symptoms: Spoke router can ping remote office hosts; office desktops cannot.
Root cause: Office clients lack route to remote subnets (default gateway wrong), or spoke firewall doesn’t forward LAN→wg0.
Fix: Ensure clients use the office router as gateway (DHCP), or add static routes on core switch if you’re doing L3 internally; fix forward rules and verify with counters.
Checklists / step-by-step plan
Step-by-step rollout plan (do this, in this order)
- Inventory constraints. For each office: WAN type, NAT presence, who controls the router, and whether inbound UDP is possible. Assume “no” until proven otherwise.
- Lock the IP plan. Pick non-overlapping office subnets and a dedicated WireGuard transit range.
- Build the hub first. Harden OS basics: updates, NTP, firewall baseline, backups of
/etc/wireguard. - Generate keys per peer. Track public keys in a controlled place. Never rotate keys ad hoc without coordination.
- Configure hub wg0. Add peers with unique AllowedIPs (both their wg /32 and their office subnet).
- Configure one spoke (pilot). Enable forwarding, routes, and firewall rules. Bring tunnel up.
- Test from hub to spoke wg IP. Ping, then traceroute to a pilot host in the office.
- Test from office client to another office. Verify both directions, and verify at least one “real” protocol (SMB or HTTPS) not just ping.
- Confirm MTU. Run DF pings at 1400-ish. If it fails, adjust MTU now, not after complaints.
- Add the other spokes. Repeat the same validation for each office.
- Instrument. At minimum, a cron/systemd timer that captures
wg showsnapshots, plus firewall counters. - Change control. Treat WireGuard config edits like firewall changes: reviewed, staged, and with rollback.
Operational checklist (weekly, boring, effective)
- Check handshake freshness and traffic counters during business hours.
- Confirm hub disk space (logs and core dumps don’t care about your feelings).
- Review any kernel/networking updates applied to hub or spokes.
- Verify backups include WireGuard keys and configs, encrypted and access-controlled.
- Spot-check nftables counters for unexpected drops.
Security checklist (practical, not performative)
- Restrict hub inbound to UDP/51820 and admin access (SSH) from known management IPs.
- On hub, default-drop forwarding and explicitly allow only what you intend between spokes.
- Keep keys per site; do not reuse keys across offices.
- Document which office subnets are allowed to access which services; enforce it at the hub.
- Plan key rotation (quarterly/biannually) and rehearse it. Rehearsal is where you discover hidden dependencies.
Three corporate mini-stories from the trenches
Mini-story 1: The outage caused by a wrong assumption
They “standardized” the three branch offices by buying the same ISP package everywhere. Same model modem/router. Same plan. Same marketing promises. The assumption was that NAT behavior would also be “the same,” so they skipped persistent keepalives. After all, WireGuard is modern and handles roaming. What could go wrong?
Two weeks later, Office C started filing tickets: “VPN drops randomly.” It wasn’t random. The ISP device had an aggressive UDP timeout and would garbage-collect the NAT mapping after a short idle period. When the office was quiet (lunch breaks, early mornings), the mapping expired. The next packet from Office C went out, but the inbound response from the hub came back to a dead port mapping.
On the hub, the peer endpoint looked stale. Handshake would update only after someone in Office C tried repeatedly, giving the NAT enough outbound traffic to recreate the mapping. The team chased DNS, then chased “maybe it’s the hub CPU,” then argued about whether UDP should be “more reliable.”
The fix was embarrassingly small: PersistentKeepalive = 25 on the spoke. The lesson was bigger: don’t assume NAT behavior is consistent just because the plastic box looks the same. NAT timeouts vary by firmware, configuration, and mood.
Mini-story 2: The optimization that backfired
A network engineer decided the hub’s firewall rules were “too strict.” Their logic: if the tunnel is encrypted, why not just accept all forwarded traffic from wg0? They replaced explicit forwarding rules with a broad accept, and then—because they liked symmetry—they allowed forwarding from wg0 to eth0 as well. “Future-proofing,” they called it.
It was fine for months. Then a new SaaS migration happened. One office had a misconfigured route on a test VLAN that pointed a big chunk of internet-bound traffic into the VPN. The hub happily forwarded it out to the internet. The hub’s egress became saturated during business hours.
The symptoms were classic: VoIP jitter, file transfers crawling, and the kind of vague tickets that say “the network feels slow.” Because the hub was now also an unintended internet breakout, the problem looked like “WireGuard performance” when it was actually “you turned your hub into a transit ISP.”
The rollback was to re-introduce tight forwarding rules and explicitly block wg0→eth0 for everything except known management subnets. The optimization—“fewer firewall rules”—saved exactly zero minutes long-term and created a hard-to-see failure mode. Keep your hub honest: route only what you intend.
Mini-story 3: The boring practice that saved the day
A different company had a habit that looked painfully dull: every network change required a small “before/after” capture of wg show, ip route, and firewall rules. They stored it next to the change ticket. No heroics, no mystery.
One Friday, an office lost access to two internal services hosted in another branch. The tunnels were up. Pings worked to the routers. But application traffic failed. The on-call engineer pulled the last “known good” snapshot and compared it to current state.
The diff was obvious: an office subnet had been changed from 10.10.30.0/24 to 10.10.30.0/23 during a “minor expansion,” but the hub’s AllowedIPs still claimed only the /24. Half the office’s hosts were now outside the routed range. Those hosts could initiate some sessions (depending on source IP), but others were blackholed. It looked random, because it was random relative to DHCP allocation.
The fix took minutes: update AllowedIPs on the hub peer and restart the interface. The saving grace wasn’t genius; it was a paper trail. Boring practices don’t get applause. They do keep you from spending your weekend proving you’re right.
FAQ
1) Do I need a full mesh for only three offices?
No. You can, but hub-and-spoke is easier to operate. One hub to secure, monitor, and troubleshoot. Mesh becomes “everyone is responsible for everyone else’s reachability.”
2) Should the hub NAT between offices?
Not by default. Route real subnets end-to-end so logs and ACLs keep meaning. NAT is acceptable for central internet breakout or temporary overlap mitigation, not as a standard path.
3) Where should I set PersistentKeepalive?
On spokes, almost always, especially behind NAT. 25 seconds is a common value. If your link is metered or expensive, you can increase it, but don’t remove it casually.
4) Can I use DNS names instead of a static IP for the hub endpoint?
Yes, but don’t pretend DNS is infallible. If you use a DNS name, ensure the spokes can resolve it reliably and consider how you’ll handle a hub IP change without human intervention.
5) Why are you using /32 addresses on spokes for wg0?
Because it reduces ambiguity. The tunnel interface doesn’t need a shared L2 segment feeling; it needs a stable identifier per peer. /32 is clean, and routing is based on AllowedIPs anyway.
6) What’s the best MTU for WireGuard site-to-site?
Start at 1420. If you have PPPoE or weird WAN paths, you may need lower. Measure with DF pings and validate with real traffic (SMB/HTTPS) before declaring victory.
7) Can I limit which offices can talk to each other?
Yes, and you should. Enforce it on the hub with forward chain rules (wg0→wg0) that match source/destination subnets. Don’t rely on “they probably won’t.”
8) How do I handle overlapping office subnets without renumbering?
You can NAT one side or use 1:1 mapping, but it’s operational debt. You’ll spend time debugging identity and ACL issues. If this is permanent, plan a renumbering project.
9) Does WireGuard provide user-level access control?
No. It’s peer keys and AllowedIPs. For user VPN, you typically terminate on a separate system and integrate with identity, MFA, and device posture controls.
10) Do I need high availability for the hub?
If the offices depend on it for core operations, yes. At minimum: backups, automated rebuild, and a tested failover plan (secondary hub, or a floating IP in environments that support it). “We’ll just restore it” is not a plan until you’ve timed it.
Next steps that actually reduce pager noise
WireGuard hub-and-spoke is a solid design, but only if you treat the hub like infrastructure, not a side project. Here’s what to do next, in a practical order:
- Write down the intent. Which subnets exist, which ones should route over WireGuard, and which offices are allowed to access which services.
- Make AllowedIPs reviewable. Keep configs in version control, require review, and enforce “no overlaps” as a human checklist item.
- Codify firewall policy. Default-drop forwarding on the hub, then allow only the inter-office flows you want. Count drops.
- Measure MTU once, properly. DF ping testing and a real file transfer test. Then set MTU and stop touching it unless the WAN changes.
- Build a repeatable diagnosis routine. The playbook above is your first responder. Print it, pin it, and use it.
- Decide whether you want central breakout. If yes, do it intentionally with explicit routes and NAT, and plan for bandwidth and policy enforcement.
If you do those things, your three offices will stop acting like three separate planets. And when something breaks—as it eventually will—you’ll have enough signal to fix it without ritual sacrifices to the networking gods.