The office VPN is always “fine” until payroll runs, VoIP turns into robot poetry, or a new branch opens and suddenly nobody remembers which side is supposed to initiate.
Then you get the real ticket: “Site-to-site down. Nothing changed.” (Something changed.)
This is a practical field guide for people who actually have to keep it alive: WireGuard vs IPsec in office environments, what’s easier to maintain, where each one bites, and how to debug it fast.
You’ll get concrete commands, decision points, and failure modes—because vibes are not a monitoring strategy.
Make the decision like an operator
Both WireGuard and IPsec can do office-to-office connectivity well. The difference is not “security” in the abstract.
The difference is how much time you’ll spend proving that a packet ever existed, and how often a tiny mismatch turns into a black hole.
My biased, production-oriented recommendation
-
Default to WireGuard for new office site-to-site VPNs when you control both ends (Linux routers, modern firewalls, or appliances that implement it cleanly).
It’s simpler to reason about, lighter to debug, and predictable under churn. -
Use IPsec/IKEv2 when you must interoperate with existing enterprise gear, compliance-driven standards checklists, or where “IPsec is the only thing the vendor supports without finger-pointing.”
Also: if you need mature hub-and-spoke at scale with existing IPsec tooling in-house, don’t fight your organization. - Do not pick based on cryptographic buzzwords. Pick based on observability, failure modes, and who will be on-call at 03:00.
What “easier to maintain” really means
Maintenance isn’t the initial setup. It’s the second office you add, the ISP swap, the firewall upgrade, and the “quick” subnet change that wasn’t written down.
It’s also credential rotation, monitoring, and recovering from partial failure (one-way traffic is the king of “looks up, acts down”).
WireGuard tends to win on configuration clarity and debuggability.
IPsec tends to win on institutional compatibility and feature breadth—at the cost of more moving parts and more negotiation failure surfaces.
One operational truth: if your VPN requires a meeting to change a cipher suite, you don’t have a VPN—you have a weekly ritual.
Quote (paraphrased idea) from Werner Vogels: You build it, you run it
—meaning the operability cost is part of the design, not a postscript.
Interesting facts and historical context (the stuff that explains today’s mess)
- IPsec started in the 1990s as part of IPv6’s original vision, then got pulled into IPv4 reality. The result: decades of extensions, profiles, and vendor “interpretations.”
- IKE (Internet Key Exchange) evolved because manual keys were pain. IKEv1 was flexible but complicated; IKEv2 simplified the protocol and improved reliability under changing IPs.
- NAT broke the clean IPsec model. ESP doesn’t like being NATed; NAT-T (UDP encapsulation) became the duct tape that made IPsec workable on the modern Internet.
- WireGuard is intentionally small. Its codebase is famously compact compared to typical IPsec stacks. Small doesn’t automatically mean perfect, but it reduces the “unknown unknowns” surface.
- WireGuard uses modern primitives (Noise-based handshake). It made opinionated choices to avoid a sprawling negotiation matrix of algorithms.
- Linux mainline adoption changed the WireGuard game. Once it landed in the Linux kernel, performance and packaging got simpler, and it stopped feeling like a side project.
- IPsec often looks “stateful” to humans because it is. It maintains Security Associations (SAs) with lifetimes, rekeying, and potentially multiple child SAs.
- WireGuard looks “stateless” until you learn it isn’t. It has handshakes and roaming behavior, but it avoids long negotiation sequences and most of the “proposal mismatch” drama.
- Many firewall vendors treat IPsec as a first-class product feature. That matters in offices because support contracts and UI-driven operations are real constraints.
Maintenance reality: what your future self will hate
WireGuard: fewer knobs, fewer ways to be wrong
WireGuard config is basically: keys, endpoint, allowed IPs, and keepalive if you’re behind NAT. That’s it.
The operational trick is that AllowedIPs is both routing and access control. It’s elegant, until someone thinks it’s only routing and accidentally grants reachability to an entire RFC1918 range.
Maintenance work in WireGuard mostly looks like:
- Key rotation without breaking peers.
- Keeping AllowedIPs clean, non-overlapping, and documented.
- Preventing “it works from Bob’s laptop but not from the office router” by standardizing NAT and firewall rules.
- Monitoring handshakes and throughput to catch silent failure early.
IPsec: the land of “it should work” and “it’s still negotiating”
IPsec maintenance is usually dominated by interoperability and negotiation surfaces:
IKE/ESP proposals, DH groups, lifetimes, rekey timers, identities, PSKs vs certificates, policy-based vs route-based tunnels, DPD/keepalives, NAT-T, fragmentation, and vendor-specific defaults.
You can run IPsec smoothly for years, but you only get there by being disciplined:
standardize profiles, document the exact knobs, and monitor rekeying behavior. If you “just click until green,” you’ll pay later.
What actually drives operational toil (regardless of protocol)
- NAT and asymmetric routing: the VPN is blamed, the routing table is guilty.
- MTU/PMTUD issues: small pings work; large payloads die; the helpdesk loses a week.
- DNS expectations: people want “office DNS” to magically work across tunnels without split-horizon planning.
- Identity drift: a peer’s IP or hostname changes; the config doesn’t.
- Key lifecycle: nobody schedules it, then it becomes an outage when you try.
Joke #1: VPNs are like office coffee machines—nobody knows how they work, but everyone notices the moment they don’t.
WireGuard in offices: operational behavior and traps
What WireGuard is great at
- Simple site-to-site routing when each office has stable subnets and you want predictable connectivity.
- Roaming and flaky links: if an endpoint’s public IP changes, WireGuard can recover quickly as long as the peer can be reached and handshakes happen.
- Low-overhead debugging: the “show me the state” commands tend to be direct: last handshake, transfer counters, endpoint.
- Performance per CPU: often excellent on modern hardware, with less overhead than many IPsec implementations.
The traps that cause real outages
Trap: AllowedIPs overlap and route hijacking
In WireGuard, AllowedIPs on the receiving end decides what traffic gets accepted from a peer. On the sending end, it decides what gets routed into the tunnel.
Overlap those and you can create a routing nightmare: traffic disappears into the wrong peer because the kernel picks the most specific route, or worse, your “temporary” 0.0.0.0/0 becomes permanent.
Practical rule: treat AllowedIPs like firewall rules plus routing table entries. Review them like security policy, not like “some IPs we need.”
Trap: NAT and idle timeouts without persistent keepalive
Many office links sit behind NAT—ISP routers, LTE backup, “business” gateways with mystery firmware.
If neither side sends traffic for a while, state expires, and the next packet goes nowhere. This looks like intermittent failure, the worst kind.
For NATed peers, set PersistentKeepalive (often 25 seconds) on the NATed side.
Trap: MTU mismatch and black-hole TCP
WireGuard adds overhead. If you run it over PPPoE, VLAN tags, or other encapsulation, the effective MTU can shrink fast.
The symptom: SSH works, small HTTP works, but “the CRM times out” or file transfers stall.
The fix is usually to set the WireGuard interface MTU lower and/or clamp MSS on the firewall.
Trap: forgetting it’s not a firewall
WireGuard will happily move packets. It will not design your segmentation.
You still need nftables/iptables rules, and you still need to think about lateral movement between office networks.
Many teams mistakenly assume “the VPN is the boundary.” It’s not. It’s a cable.
Trap: key management by spreadsheet
WireGuard keys are static. That’s fine, but you must treat them like credentials: ownership, rotation, and revocation.
The classic failure mode is “who has the private key for the old branch router?” and the answer is “a former MSP.”
Store keys in a secrets system, not in someone’s home directory.
IPsec/IKEv2 in offices: operational behavior and traps
What IPsec is great at
- Vendor interoperability: every firewall speaks some dialect of it, and many have mature UI workflows for it.
- Certificates and PKI integration: scalable identity management when done correctly.
- Route-based tunnels with VTI (on capable platforms): closer to “normal routing,” which operators understand.
- Compliance checkbox compatibility: some orgs have a policy that explicitly names IPsec.
The traps that cause real outages
Trap: proposal mismatch and the “negotiation black hole”
IPsec negotiation depends on agreeing on parameters: encryption, integrity, PRF, DH groups, lifetimes, and more.
One mismatch can cause the tunnel to flap, never come up, or come up but fail to pass traffic due to child SA mismatch.
The logs often look like polite disagreement. The business impact is not polite.
Trap: NAT-T and UDP fragmentation pain
In a NATed world, IPsec often runs over UDP/4500. On some paths, large UDP packets get dropped.
That can break rekeying, break data, or create a “works for a while then dies” pattern.
You’ll end up tuning MTU, enabling fragmentation support, or clamping MSS anyway. Welcome to the club.
Trap: policy-based vs route-based confusion
Policy-based IPsec (selectors for local/remote subnets) looks simple until you need overlapping subnets, multiple networks, or dynamic routing.
Route-based (VTI) tends to be more maintainable for offices because it behaves like a normal interface and routing problem.
But not every platform implements it consistently, and some UIs hide the complexity until it breaks.
Trap: rekey timers that don’t match reality
Rekeying is normal. Rekey storms are not.
If lifetimes are too short, or if one side rekeys aggressively and the other can’t keep up, you get periodic packet loss that looks like “random ISP jitter.”
Monitor rekey frequency. Make lifetimes boring.
Trap: identity configuration that breaks after ISP changes
Many office IPsec deployments bind identity to IP addresses, or assume static public IPs forever.
Then the ISP swaps CPE, or a branch switches to LTE, and IKE identity no longer matches.
Use stable identifiers: FQDN identities and certificates, or at minimum document the dependency and plan changes.
Joke #2: IPsec is a lot like committee-designed office naming conventions—technically comprehensive, emotionally exhausting.
Fast diagnosis playbook (find the bottleneck before you “change stuff”)
When a tunnel is “down,” it’s usually one of four things: transport reachability, negotiation/handshake, routing, or MTU/statefulness.
The trick is to check in the right order so you don’t waste an hour proving the wrong layer.
First: is the underlay reachable?
- Confirm public IP/port reachability (UDP 51820 for WireGuard, UDP 500/4500 for IPsec).
- Check if NAT or firewall rules changed.
- Validate that both ends agree on the peer address (or are configured for dynamic endpoints appropriately).
Second: does the control plane succeed?
- WireGuard: check last handshake time, endpoint, and transfer counters.
- IPsec: check IKE SA establishment, child SA installation, and rekey churn.
Third: is the data plane routed correctly?
- Verify routes on both sides (and that return path matches).
- Look for overlapping subnets and wrong “AllowedIPs” or traffic selectors.
- Confirm forwarding and firewall policies permit the traffic.
Fourth: is MTU/fragmentation silently killing you?
- Test with “do not fragment” pings and increasing sizes.
- Clamp MSS for TCP if you see stalls.
- Watch for rekey failures correlated with large packets.
Fifth: is it a resource problem?
- CPU saturation on crypto.
- Queue drops on the WAN interface.
- IRQ imbalance on cheap routers with “VPN acceleration” marketing.
Practical tasks with commands (and what the output means)
These are not lab commands. They’re the ones you run while someone in Finance is asking if “the VPN is fixed yet.”
Each task includes: command, example output, what it means, and the decision you make.
1) WireGuard: check peer state and last handshake
cr0x@server:~$ sudo wg show
interface: wg0
public key: 9nQ1x7bWJrZxqQYcVw9m0mQ9mJ3ZJ0mJm8yJm3s9YHg=
listening port: 51820
peer: 7bq1oYVxRjv0y9u8mS5o2k1m8K9p0t7x6w5v4u3y2x1=
endpoint: 203.0.113.44:53122
allowed ips: 10.20.0.0/16
latest handshake: 1 minute, 12 seconds ago
transfer: 1.21 GiB received, 980.33 MiB sent
persistent keepalive: 25 seconds
Meaning: “Latest handshake” is recent; endpoint is known; counters moving.
Decision: If traffic still fails, stop blaming the tunnel and check routing/firewall/MTU inside. If handshake is “never” or hours old, focus on underlay reachability or keys.
2) WireGuard: check kernel routes created by AllowedIPs
cr0x@server:~$ ip route show table main | grep wg0
10.20.0.0/16 dev wg0 scope link
Meaning: The system routes 10.20.0.0/16 into wg0.
Decision: If you expected only 10.20.30.0/24 and you see /16, you found why traffic is disappearing into the tunnel. Fix AllowedIPs and route policy.
3) WireGuard: verify the interface MTU and adjust if needed
cr0x@server:~$ ip link show dev wg0
7: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
Meaning: MTU 1420 is common for WireGuard over typical Ethernet. Over PPPoE or additional encapsulation, it might still be too high.
Decision: If you see black-hole TCP, lower MTU (e.g., 1380) and/or clamp MSS on the edge.
4) Path MTU test with DF ping (find black holes)
cr0x@server:~$ ping -M do -s 1372 -c 3 10.20.30.10
PING 10.20.30.10 (10.20.30.10) 1372(1400) bytes of data.
1372 bytes from 10.20.30.10: icmp_seq=1 ttl=63 time=18.7 ms
1372 bytes from 10.20.30.10: icmp_seq=2 ttl=63 time=18.5 ms
1372 bytes from 10.20.30.10: icmp_seq=3 ttl=63 time=18.6 ms
--- 10.20.30.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
Meaning: Payload 1372 with DF succeeds, so at least ~1400-byte packets pass over that path.
Decision: If this fails with “Frag needed,” tune MTU/MSS. If it just times out, suspect filtering of ICMP or a real black hole—clamp MSS and reduce MTU proactively.
5) WireGuard: confirm the process is listening on the expected port
cr0x@server:~$ sudo ss -lunp | grep 51820
UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wg-quick",pid=1324,fd=6))
Meaning: UDP/51820 is open locally.
Decision: If it’s not listening, your service didn’t start or bound incorrectly. Fix systemd/wg-quick and then check firewall/NAT.
6) WireGuard: capture handshake attempts (prove packets exist)
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:10:01.123456 IP 198.51.100.22.51820 > 203.0.113.44.53122: UDP, length 148
12:10:01.223901 IP 203.0.113.44.53122 > 198.51.100.22.51820: UDP, length 92
Meaning: Bidirectional UDP is flowing; underlay and firewall are probably okay.
Decision: If you see only outbound and no reply, it’s firewall/NAT/ISP. If you see bidirectional but no handshake in wg show, suspect key mismatch or wrong peer configuration.
7) IPsec (strongSwan): show IKE and child SA state
cr0x@server:~$ sudo swanctl --list-sas
vpn-office: #12, ESTABLISHED, IKEv2, 3c2f2e8d1d0a9c1f_i* 9a8b7c6d5e4f3210_r
local 'office-a' @ 198.51.100.10[4500]
remote 'office-b' @ 203.0.113.44[4500]
AES_GCM_16-256/PRF_HMAC_SHA2_256/ECP_256
established 42 minutes ago, rekeying in 2 hours
office-a-office-b: #14, INSTALLED, TUNNEL, reqid 1
local 10.10.0.0/16
remote 10.20.0.0/16
AES_GCM_16-256, 51023 bytes_i, 48890 bytes_o, rekeying in 46 minutes
Meaning: IKE SA established, child SA installed, selectors match office subnets.
Decision: If IKE is up but child SA missing, you have selector/proposal mismatch. If both are up but traffic fails, check routing, firewall, and MTU/fragmentation.
8) IPsec: check charon logs for negotiation errors
cr0x@server:~$ sudo journalctl -u strongswan --since "10 min ago" | tail -n 12
Dec 28 12:02:11 gw-a charon-systemd[998]: 14[IKE] received NO_PROPOSAL_CHOSEN notify error
Dec 28 12:02:11 gw-a charon-systemd[998]: 14[IKE] failed to establish CHILD_SA, keeping IKE_SA
Dec 28 12:02:11 gw-a charon-systemd[998]: 14[IKE] peer supports MOBIKE
Meaning: The peer rejected the proposal for the child SA (ESP transforms, DH group, etc.).
Decision: Align cipher suites/proposals on both sides. Don’t “open everything.” Standardize a known-good profile and apply it everywhere.
9) IPsec: verify UDP 500/4500 listeners
cr0x@server:~$ sudo ss -lunp | egrep ':500|:4500'
UNCONN 0 0 0.0.0.0:500 0.0.0.0:* users:(("charon",pid=998,fd=12))
UNCONN 0 0 0.0.0.0:4500 0.0.0.0:* users:(("charon",pid=998,fd=13))
Meaning: The daemon is listening on required IKE/NAT-T ports.
Decision: If absent, your service is down or bound wrong. Fix that first; don’t touch proposals yet.
10) IPsec: confirm NAT-T is in play (or not)
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 4500 -c 3
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:06:44.001122 IP 198.51.100.10.4500 > 203.0.113.44.4500: UDP, length 244
12:06:44.101455 IP 203.0.113.44.4500 > 198.51.100.10.4500: UDP, length 276
Meaning: IPsec is encapsulated in UDP/4500. That’s typical when NAT is detected or forced.
Decision: If UDP/4500 is blocked anywhere, you’ll get intermittent or total failure. Open it, and watch for UDP fragmentation issues.
11) Routing sanity: verify the return path (asymmetry kills VPNs)
cr0x@server:~$ ip route get 10.20.30.10
10.20.30.10 dev wg0 src 10.10.0.1 uid 0
cache
Meaning: The kernel believes 10.20.30.10 is reachable via wg0.
Decision: Run the same check on the remote side for your source subnet. If one side routes correctly and the other doesn’t, you have asymmetric routing or missing routes.
12) Firewall sanity: confirm forwarding policy isn’t silently dropping
cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
chain forward {
type filter hook forward priority filter; policy drop;
iifname "wg0" oifname "lan0" accept
iifname "lan0" oifname "wg0" accept
ct state established,related accept
}
}
Meaning: Forward policy is drop by default; explicit accepts exist between wg0 and lan0.
Decision: If these rules are missing, the tunnel can be up but traffic won’t pass. Add minimal, explicit forwarding rules and log drops during rollout.
13) See live drops and errors on interfaces (spot congestion/MTU)
cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
RX: bytes packets errors dropped missed mcast
9876543210 8123456 0 1203 0 12345
TX: bytes packets errors dropped carrier collsns
8765432109 7345678 0 987 0 0
Meaning: Drops are non-zero. That might be queue pressure, policing, or a driver issue.
Decision: If drops spike during VPN use, investigate WAN shaping/QoS, CPU, and MTU. Don’t randomly tweak crypto first.
14) Confirm IP forwarding is enabled (the classic “tunnel up, no traffic”)
cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
Meaning: The host will route packets.
Decision: If it’s 0 on a Linux router, you found your outage. Enable it persistently and confirm firewall forward rules.
15) Prove DNS is the problem, not the tunnel
cr0x@server:~$ dig +short app.internal.example @10.20.0.53
10.20.30.45
Meaning: Name resolution works over the tunnel to the remote DNS server.
Decision: If the IP resolves but the app still fails, look at app routing/ACLs. If it doesn’t resolve, your “VPN down” report is actually DNS reachability or split-horizon design.
16) Trace the path to a remote subnet (find a rogue default route)
cr0x@server:~$ traceroute -n 10.20.30.10
traceroute to 10.20.30.10 (10.20.30.10), 30 hops max, 60 byte packets
1 10.10.0.1 0.334 ms 0.291 ms 0.271 ms
2 10.20.30.10 18.821 ms 18.654 ms 18.590 ms
Meaning: Traffic reaches the remote host in two hops—likely correct routing through the tunnel gateway.
Decision: If you see hops going out to the ISP, you’re leaking routes. Fix routing and consider policy-based routing or more specific prefixes.
17) For IPsec, check xfrm state/policy (Linux kernel view)
cr0x@server:~$ sudo ip xfrm state | sed -n '1,60p'
src 198.51.100.10 dst 203.0.113.44
proto esp spi 0xc0ffee12 reqid 1 mode tunnel
replay-window 32 flag af-unspec
auth-trunc hmac(sha256) 0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 128
enc cbc(aes) 0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
lifetime config: soft 0 hard 0
lifetime current: soft 0 hard 0
Meaning: Kernel has an ESP SA installed. If policies are missing, traffic won’t match and will go cleartext (or be dropped by firewall).
Decision: If the daemon claims “up” but xfrm is empty, the control plane didn’t install state—check privileges, kernel support, or daemon errors.
18) Measure throughput and packet loss across the tunnel
cr0x@server:~$ iperf3 -c 10.20.30.10 -t 10
Connecting to host 10.20.30.10, port 5201
[ 5] local 10.10.0.50 port 58312 connected to 10.20.30.10 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 412 MBytes 346 Mbits/sec 92 1.12 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 0.00-10.00 sec 412 MBytes 346 Mbits/sec 92 sender
[ 5] 0.00-10.00 sec 409 MBytes 343 Mbits/sec receiver
Meaning: Retransmits exist but throughput is decent. If retransmits are huge or throughput collapses, suspect MTU, loss, or CPU saturation.
Decision: Correlate with interface drops and CPU. If CPU is high, consider hardware upgrade, offload options (carefully), or reducing encryption overhead only if policy allows.
Common mistakes: symptoms → root cause → fix
1) “Tunnel is up, but nothing can talk”
Symptoms: Handshake/SA established, pings to tunnel endpoint work, but subnet-to-subnet traffic fails.
Root cause: Missing IP forwarding, missing forward firewall rules, or no route to remote subnets.
Fix: Enable forwarding (sysctl net.ipv4.ip_forward=1), add explicit forward rules, verify routes with ip route get on both sides.
2) “Works for a few minutes, then dies until we ‘restart the VPN’”
Symptoms: Intermittent connectivity, often after idle periods.
Root cause: NAT idle timeout. WireGuard without keepalive, or IPsec NAT-T state expiring.
Fix: WireGuard: set PersistentKeepalive = 25 on NATed peers. IPsec: ensure DPD/keepalive is configured appropriately; validate UDP/4500 stability.
3) “Small pings work, apps stall, file transfers hang”
Symptoms: SSH ok, web UI half-loads, SMB/HTTPS stalls, large uploads time out.
Root cause: MTU/PMTUD failure or UDP fragmentation loss.
Fix: Lower tunnel MTU, clamp TCP MSS, and run DF ping tests to find safe packet sizes.
4) “After adding a new office, another office broke”
Symptoms: Adding peer C makes peer B unreachable; routes flip-flop.
Root cause: AllowedIPs overlap (WireGuard) or traffic selectors overlap (policy-based IPsec). Route preference changes.
Fix: Use non-overlapping prefixes per site. Prefer route-based IPsec (VTI) and a routing protocol if you’re growing.
5) “IPsec won’t come up, but both sides swear it’s configured right”
Symptoms: Repeated negotiation attempts, NO_PROPOSAL_CHOSEN, AUTH failed, or no matching CHILD_SA.
Root cause: Proposal mismatch, identity mismatch (IDs/certs), PSK mismatch, or clock skew impacting certificate validation.
Fix: Standardize proposals; verify identities; confirm time sync; inspect logs on both ends and align the exact transforms and selectors.
6) “WireGuard handshakes are happening, but traffic counters don’t move”
Symptoms: wg show shows fresh handshake but transfer stays at 0; or only one direction increments.
Root cause: Wrong AllowedIPs on one end, firewall blocks forwarding, or asymmetric routing on the LAN side.
Fix: Validate AllowedIPs on both peers; check routing tables; confirm forward rules; run tcpdump on wg0 and LAN interface.
7) “Everything broke after ISP changed the branch public IP”
Symptoms: Tunnel never comes back; logs show identity mismatch or peer unreachable.
Root cause: Hard-coded peer endpoint/identity bound to old IP; upstream NAT changed; firewall pinholes missing.
Fix: Use stable identities (certs/FQDN for IPsec), dynamic DNS with careful security controls, or a hub model where branches initiate outward.
8) “VPN performance is terrible during business hours”
Symptoms: High latency, low throughput, packet loss spikes, voice issues.
Root cause: CPU saturation on crypto, WAN congestion, bufferbloat, or misconfigured QoS.
Fix: Measure with iperf3, check interface drops and CPU; apply proper shaping/QoS; upgrade hardware if needed.
Three corporate mini-stories (anonymized, plausible, and painfully familiar)
Incident caused by a wrong assumption: “The VPN is the network”
A mid-size company opened a new office. They’d already used WireGuard for remote users, so they reused the same playbook for site-to-site.
The rollout went quickly: tunnel up, routes added, basic pings successful. Everyone high-fived and went back to their day jobs.
Two weeks later, someone noticed that the new office could reach a legacy subnet that was supposed to be isolated. Not “oops, one server.” The whole subnet.
The assumption had been: “If it’s in the tunnel, it’s trusted like the internal LAN.” That assumption had never been true—but it was now encoded in AllowedIPs and permissive forwarding rules.
The root issue was subtle: AllowedIPs included a broad 10.0.0.0/8 because “we might add more subnets later.”
On the firewall, forwarding between wg0 and lan0 was allowed with a blanket accept.
No one wrote down segmentation requirements because “it’s just an office tunnel.”
The fix wasn’t heroic. They narrowed AllowedIPs to exact site prefixes, implemented explicit ACLs between office networks, and added logging for cross-site flows.
The real correction was cultural: they stopped treating the VPN as a magical trust boundary and started treating it like a transport.
The lesson: the fastest way to create surprise reachability is to “plan for growth” by routing the entire universe on day one.
Optimization that backfired: crypto tuning as a substitute for capacity planning
Another org ran IPsec between headquarters and several branches, mostly on mid-range firewall appliances.
Business complained about slow file sync and choppy calls. The networking team did what teams do under pressure: they looked for a knob.
They changed IPsec proposals to “lighter” settings and shortened lifetimes, hoping to improve performance and stability.
The change did move CPU a little—but it also increased rekey frequency and made packet loss bursts show up every time tunnels rotated keys.
The user experience got worse, and now it was worse in a periodic, hard-to-explain way.
Debugging took time because each branch behaved differently. Some were behind aggressive NAT; some had PPPoE; some had ISP gear that hated fragmented UDP.
Shorter lifetimes meant more frequent large control-plane exchanges. Those were the exact packets the path hated.
The final fix was boring: revert to sane lifetimes, clamp MSS, and shape traffic properly on the WAN edge.
Then they upgraded two overloaded appliances that simply didn’t have the CPU headroom for peak hours.
The “optimization” wasn’t wrong in theory; it was wrong in context.
The lesson: if your WAN is saturated or your router is underpowered, fiddling with cipher suites is just rearranging chairs during a fire drill.
Boring but correct practice that saved the day: standard profiles + pre-change validation
A company with 15 offices ran a mix of WireGuard and IPsec due to acquisitions. Their SRE team hated outages more than they hated paperwork, so they built a habit:
every change had a pre-flight checklist, and every tunnel had a standard “known-good” profile.
One Friday afternoon (always Friday), a branch ISP had an unannounced maintenance window and swapped the public IP.
The tunnel dropped. But instead of guessing, the on-call ran the same three commands they always ran: check underlay reachability, check control-plane state, check routes.
Monitoring already had alerts for “handshake older than N minutes” and “SA flapping.”
They already had a hub model where branches initiated outward, plus a documented procedure for updating branch endpoints.
They updated the endpoint, watched handshakes resume, and verified key applications with a small suite of pings and TCP checks.
Total impact: noticeable, but contained.
What saved them wasn’t a fancy protocol. It was consistency:
standard configs, good observability, and a culture of “validate before and after.”
Checklists / step-by-step plan (ship it without hating yourself)
Pick the model first: mesh vs hub-and-spoke
- Mesh (every office to every office): simple concept, operationally messy as you grow. Keys/profiles multiply; troubleshooting becomes combinatorial.
- Hub-and-spoke (offices to HQ/cloud hub): usually best for offices. Centralized monitoring, fewer tunnels, easier policy control.
Decision rule: if you have more than a handful of sites and no dedicated network team, build a hub.
If you need full-mesh for latency reasons, do it with automation and strong conventions or don’t do it at all.
WireGuard rollout plan (office-to-office)
- Define addressing: allocate non-overlapping prefixes per site; document them.
- Choose a listening port and standardize it (don’t get “creative” per site).
- Generate keys per site router; store private keys in your secrets system.
- Write AllowedIPs narrowly: only the remote site prefixes; avoid catch-all ranges.
- Set PersistentKeepalive for NATed sites.
- Implement firewall policy: explicit allowlists between subnets; default deny for cross-site lateral movement.
- Set MTU and/or MSS clamp based on your WAN reality (PPPoE/LTE almost always needs attention).
- Monitoring: alert on handshake age, transfer counters stalling, and packet loss/latency to key services.
- Run a post-change test: ping, TCP connect to key ports, DNS query, and a small throughput test.
IPsec rollout plan (IKEv2, office-to-office)
- Choose route-based if possible (VTI). If stuck with policy-based, keep selectors simple and explicit.
- Standardize a single proposal set across the org. One. Not five. Avoid vendor “auto” modes unless you enjoy archaeology.
- Decide PSK vs certificates: PSK is quick but doesn’t scale well; certificates require PKI hygiene but pay off long-term.
- Set lifetimes to sane values and keep them consistent to avoid churn.
- Enable NAT-T (as needed) and validate UDP/4500 reachability.
- Plan for MTU: clamp MSS early; confirm PMTUD works or assume it doesn’t.
- Logging and monitoring: track SA up/down events, rekey frequency, and error notifies like NO_PROPOSAL_CHOSEN.
- Document identities: what ID each side uses; what changes when the ISP IP changes; where certificates live.
- Post-change validation: same as WireGuard, plus ensure child SA selectors match the intended networks.
Operational hygiene checklist (applies to both)
- Every site has: owner, purpose, subnets, peer endpoints, and a rollback plan.
- Monitoring exists for: tunnel control-plane health and application reachability.
- MTU strategy is written down (including PPPoE/LTE exceptions).
- Change windows include pre/post checks with saved outputs.
- Keys/PSKs/certs have rotation and revocation procedures.
- Firewall rules are reviewed as code, not edited by “whoever had access.”
FAQ
1) Is WireGuard “more secure” than IPsec?
Not in a way that matters for most office deployments. Both can be secure when configured correctly.
WireGuard reduces configuration complexity (fewer ways to misconfigure), while IPsec can be extremely robust but has a larger negotiation and policy surface.
2) Which one is easier to troubleshoot on-call?
WireGuard, typically. wg show tells you handshake age, endpoint, and byte counters in one shot.
IPsec troubleshooting often requires reading negotiation logs and understanding which SA failed and why.
3) What’s the most common WireGuard office mistake?
Overbroad or overlapping AllowedIPs. It causes route hijacks, unintended access, or traffic disappearing into the wrong tunnel.
Treat AllowedIPs as both routing and authorization.
4) What’s the most common IPsec office mistake?
“Auto” proposal settings and mismatched profiles between vendors. It works until it doesn’t, and then nobody can explain why.
Standardize a single IKEv2/ESP profile and apply it everywhere.
5) Do I need dynamic routing (OSPF/BGP) over the VPN?
If you have more than a few sites or expect changes, dynamic routing reduces toil—especially with route-based tunnels.
If you’re small and stable, static routes can be fine, but document them and test failover paths.
6) How do I avoid MTU issues without becoming a packet wizard?
Be conservative: lower tunnel MTU and clamp MSS at the edge.
Validate with DF pings and a real TCP transfer test. Assume PMTUD will fail on at least one ISP path.
7) Should branches initiate the tunnel, or should HQ initiate?
Prefer branches initiating outward to a stable hub. It avoids inbound firewall pinholes at branches and tolerates changing branch IPs better.
It also simplifies incident response: one hub to watch.
8) Can I mix WireGuard and IPsec in one organization?
Yes, and many do due to acquisitions or vendor constraints. The risk is operational inconsistency.
Mitigate it with standard runbooks, monitoring that speaks in outcomes (latency, reachability), and a plan to converge over time.
9) Do I need certificates for IPsec, or is PSK fine?
PSK is fine for a small number of sites and disciplined handling. Certificates scale better, especially when personnel change and you need revocation.
If you do PKI, do it properly: time sync, renewal automation, and clear identity mapping.
10) What does “tunnel is up but one-way traffic” usually mean?
Asymmetric routing, NAT state weirdness, or firewall rules allowing one direction.
Prove it with counters (wg show or SA bytes) and captures on both ends; then fix routing symmetry and forwarding policy.
Next steps you can actually ship
If you’re starting from scratch and you control both ends, deploy WireGuard in a hub-and-spoke layout.
Keep AllowedIPs narrow, set keepalive for NATed sites, and clamp MSS early. Add monitoring for handshake age and application checks, not just “tunnel up.”
If you’re in IPsec land already, don’t rip it out because it’s annoying. Make it boring:
standardize one IKEv2 profile, move to route-based tunnels where you can, align lifetimes, and instrument rekeys and errors.
Most IPsec “mysteries” evaporate once you stop allowing everyone to invent their own proposal set.
Finally: write the runbook you wish existed. Use the fast diagnosis order, keep the commands handy, and store pre/post outputs with every change.
Your future self will still get paged, but at least they’ll have receipts.