Three offices. One “quick” VPN. Suddenly the CFO can reach the lab printer, but the helpdesk can’t reach the ticketing DB, and the CEO’s Zoom calls sound like a robot drowning. You didn’t “set up a tunnel.” You built a network. Networks have physics, policy, and sharp edges.
This is a practical, production-minded design for a hub-and-spoke WireGuard VPN connecting three offices, with access rules by role. We’ll do it with sane routing, explicit segmentation, and enough observability that you can debug the inevitable “it worked yesterday” page without guessing.
Architecture decisions that won’t bite you later
Why hub-and-spoke for three offices?
With three sites, full-mesh is tempting: “just connect everyone to everyone.” That works until you add roles, shared services, future sites, and a compliance auditor with a clipboard and dreams. Hub-and-spoke gives you a single choke point where you can:
- enforce policy consistently (firewall, logging, DNS)
- make routing predictable (spokes send non-local traffic to hub)
- reduce operational complexity (one place to debug inter-site issues)
- add a fourth office without learning new ways to suffer
The trade-off is that the hub becomes important. That’s fine. Production systems always have important parts; the trick is knowing which parts are important and treating them accordingly.
Define the roles before you define the rules
“Access rules by role” is not a firewall feature. It’s a business decision expressed in packets. Start with roles that map to traffic patterns:
- IT/Admin: can reach management subnets, jump hosts, monitoring, maybe everything during incidents.
- Finance: can reach accounting systems, file shares, limited printers, not R&D test networks.
- Engineering: can reach build systems, repos, staging environments, not HR systems.
- Guest/Vendor: can reach one or two apps, and nothing else.
The moment your “roles” are actually “people,” you’ll end up with an ACL snowstorm. Keep roles few. Attach users/devices to roles via WireGuard peer IPs or separate interfaces, then enforce the policy centrally.
Don’t use WireGuard as your only policy engine
WireGuard is excellent at what it does: it’s a small, fast encrypted tunnel with a clean key model. It is not an RBAC system. WireGuard’s AllowedIPs is both routing and a very coarse admission control. It’s not a firewall, and pretending it is leads to weird outages.
Use WireGuard for secure transport and basic peer scoping. Use nftables (or iptables, if you must) on the hub to enforce who can talk to what. This gives you auditability and a single location to implement “Finance can reach 10.10.20.0/24 but not 10.10.30.0/24”.
Joke #1: A VPN without firewall rules is just a really confident LAN cable.
Assume NAT exists somewhere and plan for it
Offices often sit behind commodity routers. WireGuard is UDP-based and NAT-friendly, but NAT will still influence keepalives, MTU, and inbound reachability. Your hub should have a static public IP if possible. If not, you’ll need a more creative approach, but for three offices, pay for the static IP and be done with it.
Decide: routed vs bridged. Choose routed.
Bridging L2 across sites is how you import broadcast storms and “mystery ARP” into your life. Use routed subnets per office. If someone insists they “need L2,” make them name the application and the protocol, then fix that application instead.
Interesting facts and context (because history repeats in tickets)
- WireGuard’s design goal was simplicity: far fewer lines of code than traditional VPN stacks, which reduces attack surface and debugging time.
- It uses modern cryptography by default: Curve25519 for key exchange, ChaCha20-Poly1305 for authenticated encryption, and BLAKE2s for hashing.
- “Cryptokey routing” is WireGuard’s core idea: the key you use also determines what IP ranges you’re allowed to send to that peer.
- IPsec is older than many of your routers: it originated in the 1990s, and its complexity is partly a fossil record of old requirements.
- Site-to-site VPNs used to be hardware-only: early enterprise deployments leaned on dedicated appliances because CPUs were slower and crypto was expensive.
- UDP VPNs can outperform TCP VPNs: because TCP-over-TCP meltdown is real; WireGuard avoids that class of misery by staying on UDP.
- MTU bugs are timeless: path MTU discovery and ICMP filtering have caused “everything works except this one app” incidents for decades.
- Linux routing policy databases (multiple routing tables and
ip rule) exist because “one default route” stopped being enough in real networks.
One paraphrased idea worth keeping on your monitor: paraphrased idea: “Hope is not a strategy”
— often attributed in ops culture to General Gordon R. Sullivan. Treat it as a mindset, not a quote police exercise.
IP plan and routing model (the boring part that saves you)
Office subnets: keep them unique and boring
If your three offices all use 192.168.1.0/24, you don’t have “a VPN problem.” You have an addressing problem. Fix it first. Renumbering hurts, but less than running NAT inside your VPN forever.
A clean, readable plan:
- Office A LAN:
10.10.10.0/24 - Office B LAN:
10.10.20.0/24 - Office C LAN:
10.10.30.0/24 - VPN transit (WireGuard):
10.200.0.0/24
The WireGuard interface IPs live in 10.200.0.0/24. This is not a “LAN.” It’s a transit network. Don’t put printers there. Don’t put people there. Keep it sacred and slightly boring.
Where do the gateways live?
Each office gets a small Linux gateway (physical box, VM, or router that can run WireGuard). The hub is also a Linux gateway in a data center or cloud VPC with a stable public IP.
Example naming:
- hub1 (public IP, central policy):
wg0: 10.200.0.1/24 - spoke-a:
wg0: 10.200.0.11/24, LAN interface on10.10.10.0/24 - spoke-b:
wg0: 10.200.0.21/24, LAN interface on10.10.20.0/24 - spoke-c:
wg0: 10.200.0.31/24, LAN interface on10.10.30.0/24
Routing model: spokes default to hub for remote office networks
Spokes should route traffic destined for other office subnets through the hub. The hub routes to the correct spoke based on destination. This gives you centralized policy and avoids “spoke-to-spoke exceptions” multiplying.
You will still have local internet breakout at each office. This is not a “send all traffic to the hub” design unless you specifically need that and enjoy debugging web browsing over a WAN link.
Role-based access: what “by role” actually means on WireGuard
Three ways to implement roles
You can implement “roles” at different layers. Pick one and be consistent.
-
By source subnet (recommended): assign roles to internal LAN VLANs/subnets in each office (e.g., Finance VLAN). The hub firewall enforces access from those subnets.
Pros: scalable, easy to audit. Cons: requires internal network hygiene. -
By WireGuard peer IP: give each role (or each device) a VPN IP and write firewall rules using those source IPs.
Pros: works even if LANs are flat. Cons: can get messy; you’re encoding identity as IP. -
By separate WireGuard interfaces per role:
wg-finance,wg-eng, etc.
Pros: very explicit. Cons: more interfaces, more moving parts; usually unnecessary for three offices unless compliance demands it.
The practical recommendation
Do roles by source subnet inside each office, and enforce at the hub. If Office A is currently one flat subnet, split it. Yes, that’s “networking work.” That’s also the point.
Example role VLANs at Office A:
- Office A Finance:
10.10.11.0/24 - Office A Engineering:
10.10.12.0/24 - Office A IT/Admin:
10.10.13.0/24
Repeat similarly for Offices B and C, or keep roles consistent across offices if you can. Consistency is a force multiplier when you’re sleepy.
What WireGuard can and can’t do for RBAC
WireGuard can ensure that a given peer is only used for certain destination prefixes via AllowedIPs. That prevents accidental routing leaks and some classes of spoofing. It does not stop a peer from reaching everything it can route to once inside the tunnel. That’s the firewall’s job.
WireGuard configs: hub, spokes, and what AllowedIPs really does
Hub configuration
The hub terminates all three spokes. The hub’s AllowedIPs for each peer should include:
- the spoke’s WireGuard tunnel IP (a /32)
- the spoke’s LAN subnets (office LAN and role VLANs)
cr0x@server:~$ sudo cat /etc/wireguard/wg0.conf
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = HUB_PRIVATE_KEY
# Optional but practical:
SaveConfig = false
# Spoke A
[Peer]
PublicKey = SPOKE_A_PUBLIC_KEY
AllowedIPs = 10.200.0.11/32, 10.10.10.0/24, 10.10.11.0/24, 10.10.12.0/24, 10.10.13.0/24
# Spoke B
[Peer]
PublicKey = SPOKE_B_PUBLIC_KEY
AllowedIPs = 10.200.0.21/32, 10.10.20.0/24, 10.10.21.0/24, 10.10.22.0/24, 10.10.23.0/24
# Spoke C
[Peer]
PublicKey = SPOKE_C_PUBLIC_KEY
AllowedIPs = 10.200.0.31/32, 10.10.30.0/24, 10.10.31.0/24, 10.10.32.0/24, 10.10.33.0/24
Spoke configuration
Each spoke has one peer: the hub. The spoke’s AllowedIPs for the hub should include:
- the hub’s tunnel IP (/32) and any central services on the hub side
- all other office LANs reachable via the hub
That means the spoke routes other offices via the hub. No spoke-to-spoke peering. No special snowflakes.
cr0x@server:~$ sudo cat /etc/wireguard/wg0.conf
[Interface]
Address = 10.200.0.11/24
PrivateKey = SPOKE_A_PRIVATE_KEY
[Peer]
PublicKey = HUB_PUBLIC_KEY
Endpoint = hub1.example.net:51820
AllowedIPs = 10.200.0.1/32, 10.200.0.0/24, 10.10.20.0/24, 10.10.21.0/24, 10.10.22.0/24, 10.10.23.0/24, 10.10.30.0/24, 10.10.31.0/24, 10.10.32.0/24, 10.10.33.0/24
PersistentKeepalive = 25
Why include 10.200.0.0/24 in AllowedIPs on the spoke?
So the spoke can reach other peers’ tunnel IPs if needed (monitoring, ping, health checks). If you’d rather lock it down, only include 10.200.0.1/32 and rely on hub-based monitoring. Both are valid; pick one. I usually allow the transit /24 and firewall it at the hub.
IP forwarding: you must enable it
This is the part people forget, then blame WireGuard. The tunnel is up, but nobody can reach anything because the Linux box isn’t routing.
Enforcing access with nftables (not vibes)
Principle: default deny between roles, allow by need
The hub should enforce segmentation between office role subnets and shared services. Start with “deny inter-office by default,” then allow the flows that are truly required. It feels strict. It also prevents “Finance laptop can scan the entire engineering network” incidents.
Traffic direction you care about
- Forwarded traffic through the hub: office-to-office and office-to-central-services
- Input traffic to the hub itself: SSH, monitoring, DNS if the hub runs it
Example nftables policy on the hub
This is a simplified but realistic pattern: accept established, then explicit allows, then drop. Log drops sparingly, or you’ll DDoS your own disk during an incident.
cr0x@server:~$ sudo cat /etc/nftables.conf
flush ruleset
table inet filter {
chain input {
type filter hook input priority 0;
policy drop;
iif lo accept
ct state established,related accept
# Allow WireGuard
udp dport 51820 accept
# Allow SSH from IT/Admin role subnets only
ip saddr { 10.10.13.0/24, 10.10.23.0/24, 10.10.33.0/24 } tcp dport 22 accept
# Allow monitoring from IT/Admin
ip saddr { 10.10.13.0/24, 10.10.23.0/24, 10.10.33.0/24 } tcp dport { 9100, 9182 } accept
# Optional: ICMP for troubleshooting
ip protocol icmp accept
ip6 nexthdr icmpv6 accept
}
chain forward {
type filter hook forward priority 0;
policy drop;
ct state established,related accept
# Allow IT/Admin roles cross-office (jump hosts, mgmt)
ip saddr { 10.10.13.0/24, 10.10.23.0/24, 10.10.33.0/24 } ip daddr { 10.10.0.0/16 } accept
# Allow Finance to reach accounting service subnet (central or specific office)
ip saddr { 10.10.11.0/24, 10.10.21.0/24, 10.10.31.0/24 } ip daddr 10.10.50.0/24 tcp dport { 443, 5432 } accept
# Allow Engineering to reach build systems
ip saddr { 10.10.12.0/24, 10.10.22.0/24, 10.10.32.0/24 } ip daddr 10.10.60.0/24 tcp dport { 22, 443, 8080 } accept
# Allow limited printer access within each office only (example)
ip saddr 10.10.11.0/24 ip daddr 10.10.10.50 tcp dport { 631, 9100 } accept
ip saddr 10.10.21.0/24 ip daddr 10.10.20.50 tcp dport { 631, 9100 } accept
ip saddr 10.10.31.0/24 ip daddr 10.10.30.50 tcp dport { 631, 9100 } accept
# Log and drop everything else
limit rate 5/second log prefix "vpn-forward-drop " flags all counter drop
}
}
This is where you earn your keep. The hub becomes a policy enforcement point. When someone asks “why can’t vendor X reach Y,” you can answer with a rule, a log line, and a change request—not folklore.
DNS across sites without turning it into a haunted house
Pick one DNS strategy and stick to it
Multi-office VPNs fail in two ways: routing breaks, or DNS breaks. Routing failures are obvious; DNS failures are slow, weird, and emotionally draining.
Common strategies:
- Central DNS: one internal DNS service reachable over the VPN. Easiest policy, single source of truth. Ensure redundancy.
- Per-office DNS with conditional forwarding: each office resolves local names and forwards other zones across VPN. Works well but needs discipline.
- Pure IP + hosts files: technically possible, socially disastrous. Don’t.
Recommendation
For three offices, run central DNS at the hub (or two hubs for HA), and configure office clients to use it for internal zones. If offices must resolve local-only names, use conditional forwarders. Keep internal zones explicit (e.g., corp.internal, svc.corp.internal).
Performance: MTU, UDP realities, and why “fast” is a design choice
MTU is not optional trivia
WireGuard adds overhead. If you push packets that are too large, they fragment or get dropped. If ICMP “fragmentation needed” is blocked somewhere (it often is), you get classic symptoms: SSH works but file transfers stall; web apps half-load; SMB becomes performance art.
Set MTU deliberately. A common safe value is 1420 on WireGuard interfaces. Sometimes you need lower (1412, 1380) depending on upstream links and encapsulations.
UDP and “state”
WireGuard is UDP. That’s good for performance and avoids TCP-over-TCP issues, but it means:
- NAT timeouts can silently break return paths unless you use keepalives.
- Some networks treat UDP as suspicious and rate-limit it.
- You must monitor handshake freshness and traffic counters, not “connection state.”
Joke #2: MTU problems are like cats—if you think you don’t have one, it’s just hiding under the couch.
Bandwidth shaping and fairness
If one office saturates the hub link, everyone gets to share the pain. If this is a concern, use traffic control (tc) to shape by interface or by source subnet. Don’t do it prematurely. But do it on purpose if your business apps get stuck behind someone’s offsite backup job.
Operations: key rotation, change control, and upgrades
Key management isn’t hard, but it is real
WireGuard uses static public/private keys per peer. Keys don’t expire by default. That’s both a strength and a governance problem. You should:
- rotate keys on a schedule (e.g., annually) and on staff departures
- store private keys securely (root-only files, backups encrypted)
- document which key belongs to which site/role
Change management that doesn’t ruin your weekend
The hub is central. Treat its config like production code:
- keep
/etc/wireguardand firewall config in version control (private repo) - use peer naming conventions and comments outside the config file if needed
- deploy changes with a predictable process (Ansible, Salt, or at least a scripted rsync)
- always have a rollback plan
Upgrades
WireGuard in-kernel on Linux is stable. Still, upgrades change kernels, nftables behavior, and NIC drivers. Don’t upgrade the hub and all spokes in one go. Stagger. Verify handshakes and routes after each step.
Practical tasks: commands, outputs, decisions (12+ you will actually run)
Task 1: Verify WireGuard interface state
cr0x@server:~$ sudo wg show
interface: wg0
public key: 1Xh9nR...hubpub...
private key: (hidden)
listening port: 51820
peer: rGm8z...spokeA...
allowed ips: 10.200.0.11/32, 10.10.10.0/24, 10.10.11.0/24, 10.10.12.0/24, 10.10.13.0/24
latest handshake: 37 seconds ago
transfer: 2.31 GiB received, 3.02 GiB sent
persistent keepalive: every 25 seconds
What it means: “latest handshake” tells you if the peer is alive. Transfer counters tell you if traffic is flowing both ways.
Decision: If handshakes are stale (>2 minutes) for a NATed spoke, check keepalive, UDP reachability, and endpoint correctness before touching routing.
Task 2: Confirm interface IP and MTU
cr0x@server:~$ ip -d link show dev wg0
8: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
wireguard version 1.0.20210914
What it means: MTU is set to 1420; interface is UP.
Decision: If you see MTU 1500 and you’re on PPPoE or other overhead-heavy links, plan an MTU reduction and retest file transfers.
Task 3: Ensure IPv4 forwarding is enabled on hub and spokes
cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
What it means: Linux routing is enabled.
Decision: If it’s 0, enable it and make it persistent; otherwise you have a “VPN that connects but doesn’t route,” which is a classic.
Task 4: Check routes for office subnets on the hub
cr0x@server:~$ ip route show table main | grep 10.10.
10.10.10.0/24 dev wg0 scope link
10.10.20.0/24 dev wg0 scope link
10.10.30.0/24 dev wg0 scope link
What it means: The hub believes those subnets are reachable via wg0.
Decision: If routes are missing, your hub’s peer AllowedIPs may be incomplete or the interface isn’t applying them as routes (depending on how you manage routing). Fix AllowedIPs first.
Task 5: Validate that a spoke routes other offices via the hub
cr0x@server:~$ ip route get 10.10.20.10
10.10.20.10 dev wg0 src 10.200.0.11 uid 0
cache
What it means: Spoke A routes traffic to Office B via wg0.
Decision: If it tries to go out the WAN interface, your spoke’s AllowedIPs on the hub peer is missing Office B’s subnets.
Task 6: Check nftables rules are loaded (hub)
cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
chain input {
type filter hook input priority filter; policy drop;
iif "lo" accept
ct state established,related accept
udp dport 51820 accept
ip saddr { 10.10.13.0/24, 10.10.23.0/24, 10.10.33.0/24 } tcp dport 22 accept
ip protocol icmp accept
}
chain forward {
type filter hook forward priority filter; policy drop;
ct state established,related accept
ip saddr { 10.10.13.0/24, 10.10.23.0/24, 10.10.33.0/24 } ip daddr 10.10.0.0/16 accept
limit rate 5/second log prefix "vpn-forward-drop " flags all counter drop
}
}
What it means: Your policy exists and is enforced with default-drop.
Decision: If rules aren’t present, you’re either not loading nftables at boot or another tool overwrote it. Fix the service ordering before adding more rules.
Task 7: Find why a flow is blocked using counters and logs
cr0x@server:~$ sudo nft list chain inet filter forward
table inet filter {
chain forward {
type filter hook forward priority 0; policy drop;
ct state established,related accept
ip saddr { 10.10.13.0/24, 10.10.23.0/24, 10.10.33.0/24 } ip daddr 10.10.0.0/16 accept
limit rate 5/second log prefix "vpn-forward-drop " flags all counter packets 41 bytes 2460 drop
}
}
What it means: 41 forwarded packets were dropped by the catch-all rule.
Decision: If users report access issues and this counter climbs during their test, you need an explicit allow rule (or you’re seeing unexpected traffic that should remain blocked).
Task 8: Confirm UDP 51820 reachability from a spoke
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:44:01.102938 IP 198.51.100.22.39422 > 203.0.113.10.51820: UDP, length 148
12:44:01.128440 IP 203.0.113.10.51820 > 198.51.100.22.39422: UDP, length 92
What it means: Packets arrive and responses go back. Basic reachability is fine.
Decision: If you see inbound but no outbound, suspect firewall on hub. If you see nothing inbound, suspect upstream NAT/port forwarding or wrong endpoint IP.
Task 9: Validate that NAT isn’t accidentally being applied to VPN traffic
cr0x@server:~$ sudo nft list table ip nat
table ip nat {
chain postrouting {
type nat hook postrouting priority srcnat; policy accept;
oifname "eth0" masquerade
}
}
What it means: Everything leaving eth0 is NATed, which might include VPN-routed office traffic if it egresses via the hub.
Decision: If you intend pure routed office-to-office traffic, ensure inter-office traffic stays on wg0 and doesn’t hairpin to eth0. If you do central internet breakout, NAT may be correct—but then you must plan bandwidth and logging accordingly.
Task 10: Spot MTU black holes with ping + DF
cr0x@server:~$ ping -M do -s 1372 10.10.20.10 -c 3
PING 10.10.20.10 (10.10.20.10) 1372(1400) bytes of data.
1380 bytes from 10.10.20.10: icmp_seq=1 ttl=62 time=18.9 ms
1380 bytes from 10.10.20.10: icmp_seq=2 ttl=62 time=18.7 ms
1380 bytes from 10.10.20.10: icmp_seq=3 ttl=62 time=18.8 ms
--- 10.10.20.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
What it means: A 1400-byte packet (including headers) passes without fragmentation. Good sign.
Decision: If this fails at sizes well below 1400, lower the WireGuard MTU and re-test, or stop blocking ICMP “frag needed” somewhere upstream.
Task 11: Confirm reverse path filtering isn’t dropping routed VPN traffic
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter
net.ipv4.conf.all.rp_filter = 0
What it means: Strict reverse-path filtering is disabled globally (0) which is often necessary for multi-homed routing scenarios.
Decision: If rp_filter is 1 or 2 and you see one-way traffic, adjust it (carefully) on the relevant interfaces. Don’t disable security blindly; understand the routing asymmetry first.
Task 12: Verify LAN-side routes exist back to remote office subnets
cr0x@server:~$ ip route | grep 10.10.20.0/24
10.10.20.0/24 via 10.10.10.1 dev eth1
What it means: A host on Office A LAN knows to send traffic to Office B via the local gateway (10.10.10.1).
Decision: If office clients don’t have a route, either advertise routes via your office router, or set the Linux gateway as default gateway for those VLANs. VPN routing fails if only the gateways know the routes.
Task 13: Check conntrack state for a specific flow
cr0x@server:~$ sudo conntrack -L -p tcp --dport 443 2>/dev/null | head
tcp 6 431999 ESTABLISHED src=10.10.11.22 dst=10.10.50.10 sport=53422 dport=443 src=10.10.50.10 dst=10.10.11.22 sport=443 dport=53422 [ASSURED] mark=0 use=1
What it means: The hub sees an established TCP flow between a Finance client and the accounting service.
Decision: If users complain of resets/timeouts but conntrack never shows ESTABLISHED, suspect firewall drops, routing, or MTU. If it shows ESTABLISHED but app is slow, look at congestion/packet loss.
Task 14: Confirm time is sane (handshakes hate time travel)
cr0x@server:~$ timedatectl status | sed -n '1,8p'
Local time: Sun 2025-12-28 10:25:44 UTC
Universal time: Sun 2025-12-28 10:25:44 UTC
RTC time: Sun 2025-12-28 10:25:44
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: Clock is synchronized. Good.
Decision: If time is off, fix NTP first. Weird handshake issues and log correlation failures are guaranteed when clocks drift.
Fast diagnosis playbook
When the phone rings and someone says “the VPN is down,” they usually mean “an application is broken.” Don’t debug the application first. Debug the path.
First: is the tunnel alive?
- On hub:
wg show— check latest handshake for the affected spoke. - On spoke:
wg show— confirm it’s talking to the hub endpoint and counters increase during a test. - If handshake is stale: check UDP reachability, endpoint IP/port, NAT keepalive, and any recent ISP/router changes.
Second: is routing correct on both ends?
- On spoke:
ip route get <remote-ip>— should selectwg0. - On hub:
ip route get <destination>— ensure it points to the correct peer viawg0. - Check the office LAN: do clients have routes back via the local gateway?
Third: is the firewall doing what you told it to do?
- On hub:
nft list chain inet filter forward— watch counters while reproducing. - Look for drops matching the relevant subnets/ports.
- Temporarily add a narrow allow rule (source subnet + destination + port) to confirm the hypothesis. Roll it back if it proves nothing.
Fourth: is it an MTU/fragmentation problem?
- Use
ping -M do -stests across the VPN. - If small packets work and large ones fail or stall: adjust WireGuard MTU and/or fix ICMP filtering.
Fifth: is it DNS?
- If users can ping IPs but not names, stop touching routes.
- Check resolvers, conditional forwarders, and split-DNS settings. Confirm the DNS server is reachable from the role subnet.
The point is speed: tunnel → route → firewall → MTU → DNS. In that order. Every time.
Common mistakes: symptom → root cause → fix
“Tunnel is up but I can’t reach anything”
Symptom: wg show shows handshakes; pings to remote LAN fail.
Root cause: IP forwarding disabled on hub or spoke, or missing LAN-side routes.
Fix: Enable net.ipv4.ip_forward=1. Ensure office clients route remote subnets via the local gateway; don’t assume they magically know.
“Office A can reach Office B, but not the other way”
Symptom: One-way connectivity; TCP sessions hang.
Root cause: Asymmetric routing or rp_filter drops; also possible firewall state issues.
Fix: Check routes on both sides and hub. Set rp_filter appropriately on multi-homed gateways. Verify hub forward rules permit both directions or use established/related handling.
“Only some apps fail; file transfers stall; SSH works”
Symptom: Small packets succeed; large payloads time out.
Root cause: MTU black hole / fragmentation issues, often with ICMP blocked upstream.
Fix: Set mtu 1420 (or lower) on wg0. Test with ping -M do. Fix ICMP filtering where possible.
“A new office router was installed and now VPN flaps”
Symptom: Handshakes intermittently stale; traffic counters freeze then jump.
Root cause: NAT timeout or UDP handling changed; missing keepalive.
Fix: Set PersistentKeepalive = 25 on spokes behind NAT. Verify the hub port is reachable and not remapped unexpectedly.
“Finance can access engineering services (oops)”
Symptom: Users can reach things they shouldn’t; no obvious firewall denies.
Root cause: Overly broad allow rule, or relying on AllowedIPs as “policy.”
Fix: Implement default-drop in hub forward chain and add explicit allows per role. Keep allow rules narrow: source subnet, destination subnet, ports.
“After a reboot, nothing routes until someone runs a command”
Symptom: Works after manual intervention.
Root cause: Service ordering: WireGuard up before sysctl/nftables, or nftables not loaded.
Fix: Ensure sysctl persistence, enable nftables service, and make WireGuard bring-up depend on networking readiness. Test cold boot, not warm restarts.
“DNS is slow across sites”
Symptom: First lookup takes seconds; then it’s fine.
Root cause: Resolver tries unreachable DNS servers first (wrong order), or UDP fragmentation for DNS responses.
Fix: Ensure clients use reachable internal DNS first. Consider TCP fallback for DNS and fix MTU if needed.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company connected three offices with a shiny new WireGuard hub. The network diagram looked clean. The rollout plan was simple: install gateways, add routes, celebrate.
The assumption was the kind you only notice after it hurts: “clients will learn routes automatically.” The gateways had the right routes. The hub had the right AllowedIPs. Handshakes were fresh. Pings between gateways worked. But users couldn’t reach anything across the VPN unless they happened to be on a subnet where the gateway was the default route.
The helpdesk escalated to “VPN down.” The SRE on call checked WireGuard first, found it healthy, and wasted an hour chasing phantom firewall issues because “tunnel up” usually implies “paths exist.” It didn’t.
The fix was painfully ordinary: update the office routers to advertise static routes for the other offices via the local VPN gateway (or make the VPN gateway the default gateway for the role VLANs). After that, everything “magically” worked, which is what routing looks like when you do it correctly.
The lesson: when you deploy a routed VPN, your LAN must participate. If your office edge router doesn’t know where 10.10.20.0/24 lives, neither do your clients. Networks are not mind readers.
Mini-story 2: The optimization that backfired
Another org got ambitious. They wanted to reduce latency between Offices B and C, so they added a direct spoke-to-spoke WireGuard tunnel “just for heavy traffic.” The hub remained the official path. The direct tunnel was the “fast lane.”
It worked beautifully in a lab test. In production, it created a second routing truth. Some subnets preferred the direct tunnel; others followed the hub. Firewall policies lived on the hub, so now traffic either bypassed policy or got dropped depending on which path the route chose that day.
Then came the weirdest outage: file shares worked from B to C but not from C to B, and only for users in one VLAN. They had accidentally created asymmetric routing: request packets took the direct path, replies returned via the hub. Stateful firewalls do not applaud creativity.
The “optimization” was removed. They went back to hub-and-spoke, then improved performance the boring way: set correct MTU, fixed QoS on the WAN edge, and stopped running backups during business hours. Latency improved, and more importantly, predictability returned.
The lesson: in multi-site networks, “one more tunnel” is almost never “just one more tunnel.” It’s one more routing universe.
Mini-story 3: The boring but correct practice that saved the day
A third company treated their hub like a real production service. They kept configs in version control. Every change had a small description and an expected impact. They also had a one-page “VPN fast diagnosis” runbook taped to the inside of the on-call notebook, like it was 2004 and printers still mattered.
One afternoon, Office A lost access to an internal application hosted in Office C. The tunnel was up. The app was up. People started pointing fingers at “the VPN” and “the firewall” as if those are moods.
The on-call engineer followed the runbook: checked handshakes (fresh), checked routing (correct), checked nftables forward counters (drops increasing on the default drop rule). That narrowed it to policy. The last change was a tightened allow rule for Engineering that accidentally excluded one subnet used by that application.
They reverted the rule in minutes, restored service, and then reintroduced a corrected rule with a proper test plan. The incident stayed small because their changes were traceable and their debugging order was disciplined.
The lesson: boring hygiene—versioned configs, narrow changes, and a deterministic diagnosis order—beats heroics. Every time.
Checklists / step-by-step plan
Step-by-step build plan (do this, in this order)
- Fix IP overlaps: ensure each office has unique subnets. If not, renumber now. Don’t “temporary NAT” yourself into permanent misery.
- Define role subnets: create VLANs/subnets per role where possible. Start with IT/Admin and Finance; add more only when needed.
- Provision the hub: stable public IP, Linux, WireGuard, nftables, NTP, logging, monitoring agent.
- Provision three spokes: each with WireGuard and ability to route between LAN and wg0.
- Generate keys: one keypair per gateway (and optionally per role or per device if you extend later).
-
Configure WireGuard: hub peers for each spoke with correct
AllowedIPs; spokes point to hub endpoint with remote office subnets inAllowedIPs. -
Enable forwarding and basic sysctls:
ip_forward, rp_filter tuning as needed. - Install routing on office LAN: office router advertises routes for other office subnets via local gateway, or the VPN gateway becomes the VLAN gateway.
- Implement hub firewall: default drop for forward, explicit allows by role. Keep a temporary “break-glass” rule set you can apply during severe incidents.
- DNS plan: decide central vs conditional forwarding. Test name resolution from each role VLAN.
- MTU validation: set wg MTU (start 1420), test large packet pings and real workloads (file copy, HTTPS, database).
- Monitoring: track handshake age, transfer rates, packet drops (nft counters), interface errors, CPU, and bandwidth.
- Document: IP plan, role rules, key ownership, and the fast diagnosis order. Your future self is a stakeholder.
Pre-change checklist (every time you touch policy)
- Do I know which subnets/roles are impacted?
- Do I have a test case (source IP, destination IP, port) to validate?
- Did I check current counters/logs to establish baseline?
- Is the change narrow (least privilege) and reversible?
- Do I have console access if I lock myself out of the hub?
Post-change checklist (prove it, don’t assume it)
wg showon hub: all peers still handshaking- Ping and one application test per role (Finance, Eng, IT)
- Check nftables counters: expected allow rule increments; drop rule stable
- Validate DNS resolution for internal zones
- Record the change and observed impact
FAQ
1) Should I do full-mesh instead of hub-and-spoke?
For three offices, you can, but you’ll regret it when you add role-based rules and need consistent enforcement. Hub-and-spoke centralizes policy and makes troubleshooting simpler.
2) Can I enforce roles using only AllowedIPs?
Not reliably. AllowedIPs is coarse and primarily about routing and peer scoping. Use nftables on the hub to enforce role-based access, because that’s where you can express “Finance can reach these ports on those subnets.”
3) Where should access rules live: hub or spokes?
Put the canonical policy on the hub. You can add local egress rules on spokes for defense-in-depth, but don’t make policy a distributed puzzle unless you enjoy inconsistent behavior.
4) Do I need BGP/OSPF for three sites?
No. Static routing is fine and often better for predictability. Dynamic routing can be useful later, but it adds a second control plane to debug. Earn that complexity.
5) How do I handle overlapping subnets if renumbering is politically impossible?
You can use NAT (1:1 or masquerade) at the spokes, but treat it as technical debt with interest. Document the translations, expect application edge cases, and plan a renumbering project anyway.
6) Should I run the hub in the cloud or on-prem?
Cloud is often easier for a stable public IP, good bandwidth, and remote hands. On-prem can be fine if you have redundant internet and power. The deciding factor is operational maturity, not ideology.
7) How do I add remote users later without breaking the site-to-site model?
Add a separate WireGuard interface or peer group for remote users, assign them a dedicated subnet (e.g., 10.200.10.0/24), and enforce role rules on the hub the same way. Avoid mixing road-warrior clients into the same address pool as gateways.
8) What does “PersistentKeepalive” do and when do I need it?
It sends periodic packets to keep NAT mappings alive. Use it on spokes behind NAT (common) and on roaming clients. Don’t bother on the hub with a public IP unless you have a specific reason.
9) How do I make this highly available?
Run two hubs and either use anycast/BGP (advanced) or DNS + failover tooling (simpler), plus dual tunnels from each spoke. Keep state minimal (WireGuard is stateless-ish), but your firewall and DNS must also be redundant.
10) What should I log without drowning in data?
Log dropped forwards at a rate limit on the hub, track handshake age and traffic counters, and capture configuration changes. Don’t log every accepted packet; your storage system deserves better.
Next steps that make Monday calmer
If you want a VPN that behaves like infrastructure instead of a magic trick, build it like infrastructure: unique subnets, routed design, central policy, and observability. Hub-and-spoke WireGuard is a solid pattern for three offices, especially when you need access by role rather than “everyone can see everything.”
Practical next steps:
- Write down the IP plan and role subnets, then enforce uniqueness across offices.
- Stand up the hub with WireGuard + nftables default-drop forwarding.
- Bring up one spoke, validate routing end-to-end, then clone the pattern to the other two.
- Implement role rules as explicit allows, watch counters, and keep a rollback ready.
- Add monitoring for handshake age, bandwidth, and drop counters, and keep the fast diagnosis order close at hand.
You’ll still get tickets. But they’ll be the solvable kind, not the supernatural kind.