Two offices, one set of services, and suddenly everything is “slow,” “intermittent,” or “only broken for finance.” You can’t reproduce it on your laptop. The CEO can, five minutes before a board call. The network diagram is a screenshot of a whiteboard that no longer exists.
A site-to-site VPN can absolutely make two offices behave like one network. It can also become the thing everybody blames forever. The difference is rarely “the tunnel.” It’s IP planning, routing decisions, MTU hygiene, DNS discipline, and boring monitoring that notices problems before humans do.
What you actually want (and what you probably don’t)
When people say “connect two offices into one network,” they usually mean one or more of these:
- Users in Office B can reach internal apps hosted in Office A (and vice versa).
- Both sites can reach shared infrastructure: AD/LDAP, file shares, Git, monitoring, VoIP, printers (yes, still).
- Consistent access policies: “HR can reach HR systems” is true regardless of which building has better coffee.
- Stable performance and predictable failure modes.
What they think they want is “make it like we stretched a switch between buildings.” Don’t do that. L2 extension across the internet is how you end up debugging broadcast storms with a CFO in the room.
Your goal is simple: route traffic between two (or more) well-defined IP networks over an encrypted tunnel, with clear boundaries, clear DNS, and enough observability to answer “what’s broken?” in under five minutes.
And remember: a VPN is not a magic portal to productivity. It’s a transport. If your apps can’t handle latency, packet loss, or intermittent links, the tunnel will merely make the pain geographically diverse.
Short joke #1: A VPN doesn’t “make networks faster,” but it does make blame travel at wire speed.
A few facts and history that explain today’s pitfalls
Some context helps, because half the trouble comes from legacy ideas that survive long past their usefulness.
- IPsec predates modern “cloud thinking.” The core standards landed in the mid-to-late 1990s; many vendors still ship assumptions from that era.
- IKEv1 vs IKEv2 matters. IKEv2 (mid-2000s) fixed real problems: negotiation complexity, mobility, and stability. If you have a choice, pick IKEv2.
- NAT traversal (NAT-T) exists because the internet got NAT’d. IPsec ESP doesn’t play nicely with NAT; UDP encapsulation (usually port 4500) was the practical fix.
- “VPN over UDP” isn’t new. L2TP/IPsec and later WireGuard leaned into UDP because it behaves better through middleboxes than raw ESP in many environments.
- WireGuard is intentionally small. It was designed to be auditable and minimal compared to traditional IPsec stacks. That’s a real operational advantage.
- BGP over tunnels is older than your current firewall. Carriers have been doing dynamic routing across encapsulated links for decades; it’s not overkill when you need clean failover.
- MTU pain is historical, not a personal failure. Encapsulation adds overhead; Path MTU Discovery is often blocked or mangled; “works on my ping” has been lying since forever.
- Corporate networks used to be trusted by default. The “flat LAN” era left a lot of organizations with broad internal access. A site-to-site VPN can accidentally recreate that, at internet scale.
Pick a topology that fails well
The default: routed site-to-site (hub-and-spoke or point-to-point)
For two offices, a simple routed tunnel is usually enough: Office A has a gateway, Office B has a gateway, and you route A-subnets to B-subnets through the tunnel.
If you’re adding more sites later, choose a hub-and-spoke early: one or two hubs (data center or cloud), spokes at offices. That makes policy and monitoring simpler, and it avoids the “full mesh of sadness” where every office must talk securely to every other office with separate tunnels.
IPsec vs WireGuard (opinionated)
If you have enterprise firewalls already and need vendor support: use IPsec with IKEv2. It’s everywhere, it interoperates, it can be HA, and auditors recognize it.
If you want simplicity and you control both ends: WireGuard is hard to beat. Less negotiation. Fewer moving parts. Debugging is more “packets in/packets out” and less “Phase 1 is up but Phase 2 is haunted.”
Either way, do not pick based on a single benchmark chart. Pick based on what you can operate at 3 a.m., and what you can replace cleanly.
Layer 2 bridging: the trap
Bridging office networks over a VPN sounds appealing when you want “zero change.” It’s also how you transport every broadcast, ARP, and accidental loop across a tunnel. Route. Always route. If someone insists on L2, make them own the outage call.
Checklists / step-by-step plan (the simple plan that works)
Phase 0: prerequisites you do before you touch a tunnel
- Pick non-overlapping subnets. If Office A uses 192.168.1.0/24, Office B must not. This is not negotiable.
- Define which subnets need to talk. “Everything to everything” is lazy and dangerous.
- Choose a routing model. Static routes for small, stable networks. BGP if you need failover or plan to grow.
- Pick an authentication method. Pre-shared key (PSK) is okay for small setups; certificates scale better and rotate cleaner.
- Inventory NAT and firewalls. Know who owns public IPs, what ports are allowed, and whether either side sits behind carrier-grade NAT.
Phase 1: IP plan and traffic policy
Do this on paper first:
- Office A LAN: e.g., 10.10.0.0/16
- Office B LAN: e.g., 10.20.0.0/16
- Reserved infrastructure ranges (servers, VoIP, printers) with smaller subnets for sanity
- Decide whether clients should reach internet locally (split tunnel) or through a central egress (full tunnel)
Opinion: For two offices, default to split tunnel unless you have a compliance reason to centralize egress. Full tunnel over site-to-site is a great way to turn one office’s ISP hiccup into a company-wide incident.
Phase 2: tunnel configuration (minimum viable, production-friendly)
- Encryption: use modern suites. Avoid deprecated algorithms and “because it’s compatible.” If you’re stuck with compatibility, isolate and plan an upgrade.
- Perfect Forward Secrecy: enable it.
- Rekey intervals: keep defaults unless you have a reason; overly aggressive rekeying can create periodic brownouts.
- DPD/keepalives: enable dead peer detection so you fail fast.
- Firewall rules: permit only necessary traffic over the tunnel (by subnet and port), then expand intentionally.
Phase 3: routing and DNS
- Routing: install routes on gateways (or via BGP). Confirm with traceroute that the path goes through the tunnel.
- DNS: make name resolution work across sites. This is where “the tunnel is up but the app is down” lives.
- Time sync: ensure both sites have reliable NTP. Certificate validation and Kerberos are not fans of time travel.
Phase 4: hardening and observability
- Log tunnel state changes and rekey events.
- Monitor packet loss and latency between representative hosts, not just “tunnel up.”
- Record MTU/MSS settings and validate with real transfers.
- Document: subnets, routes, allowed ports, ownership, and a rollback plan.
Phase 5: test like a pessimist
- Simulate an ISP failure at one site (disable WAN, fail to backup link if you have one).
- Test DNS resolution from both sites to shared services.
- Test large file transfers and long-lived connections (VoIP calls, RDP/SSH sessions, database connections).
- Test after rekey (wait through at least one rekey interval).
Routing: static vs dynamic, and the one rule you must not break
The one rule: no overlapping IP space
If the same RFC1918 range exists on both sides, you will eventually NAT your way into a corner. Sometimes it works for a few weeks. Then you add a third site, or a cloud VPC, or a contractor VPN, and everything becomes a choose-your-own-adventure of broken routing.
Renumbering is painful. Still easier than permanent NAT hacks.
Static routing: good for small, stable networks
If each office has a single WAN link and your subnets don’t change often, static routes are fine. They’re simple, debuggable, and do not require routing daemons.
Static routing fails when:
- You add redundancy and want clean failover.
- You add more networks and forget one route on one side (it happens).
- You rely on humans to update routes during an incident (humans are unreliable under stress, including yours truly).
BGP: not mandatory, but it’s often the cleanest failover tool
If you have two tunnels (or two ISPs) and want traffic to shift quickly, BGP is worth it. You can run BGP over IPsec or WireGuard, exchange only the prefixes you want, and let the protocol handle failover.
Keep it conservative:
- Use prefix-lists.
- Use max-prefix limits.
- Use route-maps to prevent “oops, advertised 0.0.0.0/0.”
Short joke #2: BGP is like office gossip: powerful, fast, and it will absolutely spread the wrong thing if you don’t set boundaries.
DNS and identity: the silent killer
Most “VPN problems” are actually DNS problems wearing a VPN costume. The tunnel is up. The route exists. But users can’t reach intranet, authentication fails, or a database client stalls.
Decide your naming model
- Single internal domain (common in AD environments): both offices resolve the same internal zones.
- Split-horizon DNS: names resolve differently depending on the source site.
- Explicit service naming: apps use FQDNs pointing at site-local VIPs or global load balancers; less “office as a concept.”
Make DNS resilient across the tunnel
At minimum:
- Each office should have local DNS resolvers.
- Resolvers should forward or replicate zones as needed.
- Clients should not rely on “the other office’s DNS” as their primary. That’s how a WAN glitch becomes an authentication storm.
Identity systems hate partial connectivity
Kerberos, LDAP, and certificate-based auth can fail in ways that look like “random slowness.” If DNS flips between site A and B, you can get intermittent auth timeouts. Keep local identity services local where possible, or design explicit failover with health checks.
MTU, MSS, and why “it pings” is meaningless
Encapsulation adds bytes. IPsec adds overhead (ESP headers, possible UDP encapsulation). WireGuard adds overhead too. If you don’t account for that, you create a special kind of outage: small packets work, large ones stall. Users report “some websites load, file shares hang, video calls freeze.” You start blaming application teams. It’s MTU.
Practical guidance
- Assume your effective MTU over the tunnel is smaller than 1500.
- Clamp TCP MSS on the tunnel edge if you can (common on firewalls and routers). It’s a blunt instrument, but it prevents fragmentation-related misery.
- Test with
pingplus “do not fragment” and realistic payload sizes.
What you’re aiming for
You want a path where TCP can establish sessions and transfer large data without blackholing. If PMTUD is blocked in the path (common), MSS clamping often saves you from a week of pointless meetings.
Security posture: encryption is not access control
A site-to-site VPN encrypts traffic between gateways. It does not decide who should access what. That’s your job.
Do not make “the other office” a trusted zone
Many networks treat the remote office subnet as if it’s just another VLAN. Then a compromised laptop in Office B can probe servers in Office A. Encryption faithfully protects the attacker’s traffic. That is not the win you want.
Minimum viable segmentation
- Allow only necessary subnet-to-subnet flows (e.g., clients to app subnets, app to DB subnets).
- Block east-west between client VLANs by default.
- Log denied flows during rollout so you can adjust without guesswork.
Key management and rotation
For IPsec PSKs: treat them like passwords; rotate them, store them in a secret manager, and don’t email them. For certificates: automate issuance and renewal if you can. Expired certs are the kind of outage that feels personal.
Paraphrased idea (Werner Vogels, operations/reliability): “Build systems assuming things fail; resilience comes from designing for that reality.”
High availability: what “redundant” really means
Redundancy is a chain, and the weakest link is usually “one ISP.” You can build an HA VPN pair and still have a single backhoe take you out.
HA patterns that actually work
- Dual WAN at each site (ideally different carriers, different physical paths).
- Two VPN gateways (active/standby or active/active) with state sync if supported.
- Two tunnels (one per WAN) with routing failover (BGP preferred, or tracked static routes).
Beware the “HA that fails during failover”
Common failure mode: HA is configured, but no one ever tested it. Then the primary link drops, the tunnel flips, and long-lived sessions die. Users call it “the VPN is unstable.” The VPN is doing what you asked. You asked for a switchover without application-level tolerance.
Be explicit about what you promise:
- “Failover within 30 seconds, existing TCP sessions may reset.” (honest)
- “No noticeable impact.” (rare, expensive, and requires more than a VPN)
Monitoring and SLOs for the tunnel (yes, really)
If your monitoring is “the tunnel is up,” you will learn about packet loss from Slack. That is not observability; that’s crowd-sourcing incident detection.
What to measure
- State: tunnel established, rekey events, flaps.
- Latency: RTT between representative hosts across the tunnel.
- Loss: packet loss percentage, preferably at multiple packet sizes.
- Throughput: periodic iperf tests between two test endpoints (scheduled, rate-limited).
- Errors: interface drops, fragmentation counters, firewall denies over the tunnel.
Set expectations (lightweight SLO)
Example: “Inter-office connectivity: 99.9% monthly availability; RTT p95 under 40 ms during business hours; packet loss under 0.5%.” Adjust to reality. Then alert when you violate it, not when someone complains.
Practical tasks: commands, outputs, and decisions (12+)
These are Linux-centric because Linux is honest about what it’s doing. Even if your gateways are appliances, you can run these from test hosts on each side.
Task 1: Confirm your local IP and route to remote subnet
cr0x@server:~$ ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
inet 10.10.12.34/24 brd 10.10.12.255 scope global eth0
valid_lft forever preferred_lft forever
cr0x@server:~$ ip route get 10.20.8.50
10.20.8.50 via 10.10.12.1 dev eth0 src 10.10.12.34 uid 1000
cache
What it means: Your host thinks traffic to 10.20.8.50 goes to the local gateway (10.10.12.1). Good. If it instead routes to some other interface (or default route you didn’t expect), you’re debugging the wrong box.
Decision: If the route is wrong, fix local routing/DHCP option 121/static routes before touching VPN configs.
Task 2: Check that the remote subnet is actually routed back
cr0x@server:~$ traceroute -n 10.20.8.50
traceroute to 10.20.8.50 (10.20.8.50), 30 hops max, 60 byte packets
1 10.10.12.1 0.407 ms 0.352 ms 0.331 ms
2 172.16.100.1 10.982 ms 11.104 ms 11.015 ms
3 10.20.8.50 12.201 ms 12.144 ms 12.087 ms
What it means: Hop 2 is likely your VPN gateway/tunnel interface. You can see the path is going through the expected intermediate.
Decision: If traceroute dies at hop 2, suspect tunnel policy, firewall rules, or missing return routes on the far end.
Task 3: Verify you don’t have overlapping subnets (on a host)
cr0x@server:~$ ip route | egrep '10\.10\.|10\.20\.|192\.168\.'
10.10.12.0/24 dev eth0 proto kernel scope link src 10.10.12.34
10.20.0.0/16 via 10.10.12.1 dev eth0
What it means: Your host has a clean route to the remote network. Overlaps often show up as multiple routes for the same prefix or unexpected connected networks.
Decision: If you see the remote prefix as “connected” locally, you have overlap or a rogue DHCP configuration. Stop and fix addressing.
Task 4: Test basic reachability and distinguish “no route” from “filtered”
cr0x@server:~$ ping -c 3 10.20.8.50
PING 10.20.8.50 (10.20.8.50) 56(84) bytes of data.
64 bytes from 10.20.8.50: icmp_seq=1 ttl=62 time=12.4 ms
64 bytes from 10.20.8.50: icmp_seq=2 ttl=62 time=12.2 ms
64 bytes from 10.20.8.50: icmp_seq=3 ttl=62 time=12.5 ms
--- 10.20.8.50 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 12.210/12.378/12.512/0.121 ms
What it means: ICMP works and latency is reasonable. This does not prove TCP works at scale, but it’s a start.
Decision: If ping fails but traceroute reaches the host, ICMP may be blocked; test TCP next rather than declaring the VPN dead.
Task 5: Test TCP port reachability (service-level check)
cr0x@server:~$ nc -vz -w 2 10.20.8.50 443
Connection to 10.20.8.50 443 port [tcp/https] succeeded!
What it means: Routing and firewall policy allow TCP/443. This is closer to what users experience than ping.
Decision: If this fails but ping works, suspect firewall rules, security groups, or host-based firewall on the server.
Task 6: Validate DNS resolution across sites
cr0x@server:~$ resolvectl status | sed -n '1,25p'
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (eth0)
Current Scopes: DNS
Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.10.0.53
DNS Servers: 10.10.0.53 10.20.0.53
cr0x@server:~$ dig +short app.internal.example
10.20.8.50
What it means: You have resolvers from both sites available, and the name resolves to the expected remote IP.
Decision: If resolution flips between different answers (or times out), fix DNS forwarding/replication before touching routing.
Task 7: Catch MTU problems with “do not fragment” pings
cr0x@server:~$ ping -M do -s 1472 -c 2 10.20.8.50
PING 10.20.8.50 (10.20.8.50) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
cr0x@server:~$ ping -M do -s 1392 -c 2 10.20.8.50
PING 10.20.8.50 (10.20.8.50) 1392(1420) bytes of data.
1400 bytes from 10.20.8.50: icmp_seq=1 ttl=62 time=13.1 ms
1400 bytes from 10.20.8.50: icmp_seq=2 ttl=62 time=13.0 ms
What it means: Path MTU is around 1420 bytes. If endpoints try 1500-byte packets without MSS clamping or PMTUD, you’ll get stalls.
Decision: Clamp MSS (e.g., 1360–1380) or adjust interface MTU to avoid fragmentation/blackholing.
Task 8: Check TCP MSS currently negotiated (spot clamping)
cr0x@server:~$ ss -ti dst 10.20.8.50:443 | sed -n '1,20p'
ESTAB 0 0 10.10.12.34:52514 10.20.8.50:443
cubic wscale:7,7 rto:204 rtt:13.1/1.2 mss:1360 pmtu:1420 rcvmss:1360 advmss:1360 cwnd:10 bytes_sent:2145 bytes_acked:2145 segs_out:17 segs_in:14 data_segs_out:10 data_segs_in:9
What it means: MSS is 1360 and PMTU 1420; this looks consistent with a tunnel overhead scenario.
Decision: If MSS shows 1460 while PMTU is ~1420, you’re asking for trouble. Implement MSS clamping on the gateway.
Task 9: Measure latency and loss with mtr (not just ping)
cr0x@server:~$ mtr -n -r -c 50 10.20.8.50
Start: 2025-12-27T11:12:41+0000
HOST: server Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.10.12.1 0.0% 50 0.4 0.4 0.3 0.9 0.1
2.|-- 172.16.100.1 0.0% 50 11.2 11.0 10.7 13.9 0.6
3.|-- 10.20.8.50 1.0% 50 12.5 12.7 12.0 18.4 1.2
What it means: 1% loss at the destination is enough to make certain apps feel “slow,” especially chatty ones.
Decision: If loss appears at hop 2 and 3, suspect WAN or tunnel transport. If only at hop 3, suspect the host or its local network.
Task 10: Verify IPsec SA state on a Linux gateway (strongSwan)
cr0x@server:~$ sudo ipsec statusall | sed -n '1,40p'
Status of IKE charon daemon (strongSwan 5.9.8, Linux 6.8.0, x86_64):
uptime: 2 hours, since Dec 27 09:11:02 2025
worker threads: 10 of 16 idle, 6/0/0/0 working, job queue: 0/0/0/0, scheduled: 3
Connections:
office-a-office-b: 10.0.0.1...203.0.113.20 IKEv2, dpddelay=30s
office-a-office-b: local: [gw-a] uses pre-shared key authentication
office-a-office-b: remote: [gw-b] uses pre-shared key authentication
Security Associations (1 up, 0 connecting):
office-a-office-b[12]: ESTABLISHED 15 minutes ago, 10.0.0.1[gw-a]...203.0.113.20[gw-b]
office-a-office-b{8}: INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: c9f2a1d1_i c6f8b02f_o
office-a-office-b{8}: 10.10.0.0/16 === 10.20.0.0/16
What it means: IKEv2 is established, and the child SA (ESP) is installed for the expected subnets.
Decision: If IKE is up but no child SA is installed, you likely have mismatched traffic selectors/subnets or policy issues.
Task 11: Check for rekey flaps and negotiation errors in logs
cr0x@server:~$ sudo journalctl -u strongswan --since "2 hours ago" | egrep -i "rekey|established|failed|proposal" | tail -n 20
Dec 27 10:55:03 gw-a charon[1124]: 12[IKE] initiating IKE_SA office-a-office-b[12] to 203.0.113.20
Dec 27 10:55:03 gw-a charon[1124]: 12[IKE] IKE_SA office-a-office-b[12] established between 10.0.0.1[gw-a]...203.0.113.20[gw-b]
Dec 27 10:55:03 gw-a charon[1124]: 12[CHD] CHILD_SA office-a-office-b{8} established with SPIs c9f2a1d1_i c6f8b02f_o and TS 10.10.0.0/16 === 10.20.0.0/16
What it means: Clean establishment. If you see repeated failures, you likely have intermittent WAN, mismatched proposals, or NAT issues.
Decision: Frequent rekeys during business hours: investigate CPU saturation, packet loss, or too-aggressive lifetimes.
Task 12: Verify WireGuard peer state (if you use WireGuard)
cr0x@server:~$ sudo wg show
interface: wg0
public key: 0u9z4x2m3u9l4t...
listening port: 51820
peer: 2q8m9n0b1v2c3x...
endpoint: 203.0.113.20:51820
allowed ips: 10.20.0.0/16
latest handshake: 52 seconds ago
transfer: 1.32 GiB received, 1.08 GiB sent
persistent keepalive: every 25 seconds
What it means: Recent handshake and traffic counters increasing. If “latest handshake” is “never,” you have routing/NAT/firewall issues.
Decision: If handshakes occur but traffic fails, suspect allowed IPs/routes or firewall policy beyond the tunnel.
Task 13: Confirm forwarding and policy routing on a Linux gateway
cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
cr0x@server:~$ ip rule show
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
What it means: Forwarding is enabled; no unusual policy routing rules are present.
Decision: If forwarding is 0, your gateway is a very expensive host. Enable forwarding and ensure firewall rules permit forwarding.
Task 14: Inspect firewall counters for dropped tunnel traffic (nftables)
cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
chain forward {
type filter hook forward priority filter; policy drop;
ct state established,related accept
iifname "wg0" ip saddr 10.20.0.0/16 ip daddr 10.10.0.0/16 tcp dport { 53, 443, 445 } accept
counter packets 184 bytes 11040 drop
}
}
What it means: Forward policy is drop, and you have a specific allow rule from wg0 to office-a. The counter on the final drop rule indicates something is being denied.
Decision: If drop counters climb when users complain, review which ports/subnets are missing. Add rules intentionally; don’t “allow all” out of frustration.
Task 15: Measure throughput (controlled) to spot CPU or shaping limits
cr0x@server:~$ iperf3 -c 10.20.8.50 -P 4 -t 10
Connecting to host 10.20.8.50, port 5201
[ 5] local 10.10.12.34 port 46318 connected to 10.20.8.50 port 5201
[ 7] local 10.10.12.34 port 46334 connected to 10.20.8.50 port 5201
[ 9] local 10.10.12.34 port 46344 connected to 10.20.8.50 port 5201
[ 11] local 10.10.12.34 port 46356 connected to 10.20.8.50 port 5201
[SUM] 0.00-10.00 sec 420 MBytes 352 Mbits/sec sender
[SUM] 0.00-10.00 sec 418 MBytes 350 Mbits/sec receiver
What it means: You’re getting ~350 Mbps across the tunnel between these hosts. That might be great or terrible depending on your WAN and gateway hardware.
Decision: If throughput is far below WAN capacity, check gateway CPU, encryption offload, traffic shaping, or a single-thread bottleneck.
Task 16: Check gateway CPU during load (encryption bottleneck hint)
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0 (gw-a) 12/27/2025 _x86_64_ (4 CPU)
11:14:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:14:02 AM all 35.12 0.00 20.44 0.00 0.00 5.33 0.00 0.00 0.00 39.11
11:14:02 AM 0 80.00 0.00 18.00 0.00 0.00 2.00 0.00 0.00 0.00 0.00
11:14:02 AM 1 10.00 0.00 25.00 0.00 0.00 8.00 0.00 0.00 0.00 57.00
11:14:02 AM 2 20.00 0.00 15.00 0.00 0.00 5.00 0.00 0.00 0.00 60.00
11:14:02 AM 3 30.00 0.00 24.00 0.00 0.00 6.00 0.00 0.00 0.00 40.00
What it means: CPU0 is pinned. That often indicates a single queue/interrupt affinity issue or a process not scaling well. VPN throughput can collapse under one hot core.
Decision: If you see a hot core during peak traffic, consider tuning IRQ affinity, enabling multiqueue, upgrading hardware, or changing cipher/offload settings (carefully).
Fast diagnosis playbook
This is the “five minutes, no drama” flow. You’re trying to find the bottleneck quickly: routing, tunnel state, MTU, DNS, or the far-end service.
First: is it actually the tunnel, or just one app?
- Pick a test host in each office (or a jump box) that you control.
- Test TCP to a known service (e.g.,
nc -vz remote-ip 443). - Test DNS resolution for the same service name from both sides.
If only one app breaks but basic TCP works, stop blaming the VPN. Move up-stack (TLS, auth, app dependencies).
Second: confirm routing and return path
- Traceroute both ways if possible (A→B and B→A). Asymmetric routing is common when one side has multiple gateways.
- Check gateway routes to ensure each subnet is known and points into the tunnel.
- Look for NAT where you didn’t expect it. NAT inside a routed site-to-site is usually a symptom of overlap or a band-aid.
Third: MTU/MSS sanity check
- Run DF pings to find the approximate path MTU.
- Check an established TCP session’s MSS (
ss -ti). - If blackholing is likely, clamp MSS at the gateway.
Fourth: is the transport link sick?
- Run
mtrfor 50–200 packets to see loss and jitter. - Check WAN interface errors/drops on gateways.
- Check if the ISP link is saturating (shaping or bufferbloat can mimic “VPN slowness”).
Fifth: tunnel daemon logs and state
- For IPsec: confirm IKE and child SAs are installed and stable.
- For WireGuard: confirm handshakes and traffic counters increase.
- Look for periodic events: rekeys, DPD timeouts, endpoint changes (NAT).
Common mistakes: symptoms → root cause → fix
1) “Tunnel is up, but SMB/RDP/file transfers hang”
Symptoms: Ping works. Small HTTP requests work. Large downloads stall. SMB copies start fast then freeze.
Root cause: MTU blackhole or missing MSS clamping; PMTUD blocked somewhere.
Fix: Determine path MTU with DF ping; clamp TCP MSS on the VPN gateway; consider lowering tunnel MTU. Re-test with iperf and real transfers.
2) “Only one direction works”
Symptoms: Office A can reach Office B, but not vice versa. Or a specific subnet can initiate but can’t respond.
Root cause: Missing return route, asymmetric routing via a different internet exit, or firewall state assumptions.
Fix: Verify routes on both gateways. Confirm the far end knows how to reach the initiating subnet. Ensure stateful firewalls see both directions on the same path.
3) “Works for an hour, then drops every so often”
Symptoms: Periodic disconnects. VoIP calls drop at predictable intervals. Logs show rekeys or DPD timeouts.
Root cause: Aggressive lifetimes, NAT mapping timeouts, or unstable WAN with brief packet loss that kills keepalives.
Fix: Tune keepalives/persistent keepalive (WireGuard) or DPD settings (IPsec). Avoid overly short rekey intervals. If behind NAT, ensure NAT-T and stable UDP pinholes.
4) “DNS randomly fails, logins are slow, Kerberos errors appear”
Symptoms: Intermittent timeouts to internal names; authentication sometimes fails; retries magically work.
Root cause: Clients using remote-site DNS as primary; DNS forwarding over a flaky tunnel; split-horizon misconfiguration; time skew.
Fix: Prefer local resolvers; replicate zones; add conditional forwarding; validate NTP at both sites; keep DNS dependencies local where feasible.
5) “We added NAT to make overlaps work; now everything is weird”
Symptoms: Some services work, others don’t. Logs show unexpected source IPs. Access control breaks. Troubleshooting becomes archaeology.
Root cause: Overlapping subnets masked with NAT; application assumptions about source IP or reverse lookups.
Fix: Renumber one site (yes, really). If you must NAT temporarily, document it, isolate it, and set a removal deadline.
6) “Throughput is terrible but the ISP link is fast”
Symptoms: Speed tests to the internet look good; over-VPN transfers are slow; CPU spikes on gateway.
Root cause: Encryption is CPU-bound; single core pinned; lack of offload; poor queueing; small MTU causing overhead.
Fix: Measure with iperf. Check CPU per-core. Upgrade gateway hardware, tune IRQ affinity, or use a cipher/offload path supported by your platform. Don’t guess—measure before/after.
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
The company had two offices after an acquisition. Office A ran everything: identity, file shares, internal web apps. Office B had a small network that “looked fine” and a firewall someone configured years ago. The plan was a straightforward IPsec site-to-site tunnel. The project manager’s favorite phrase was “it’s just a tunnel.”
Day one, the tunnel came up. Everyone applauded. Office B still couldn’t log into the CRM by hostname. By IP, it worked. The war room immediately blamed the VPN because that’s what changed. The VPN team started swapping ciphers and rekey timers like it was a cooking show.
Two hours later, someone finally ran dig from an Office B client. It was using Office A’s DNS server as primary, across the tunnel. Under light traffic, it worked. Under morning login load, DNS requests queued behind big SMB transfers. Name lookups timed out, authentication cascaded into retries, and users experienced “the internet is broken.”
The wrong assumption was that DNS is “small traffic” and can ride along with everything else without priority or locality. The fix was boring: add local DNS resolvers in Office B, implement conditional forwarding for internal zones, and explicitly allow DNS traffic with higher priority. The tunnel didn’t change. The outcome did.
Afterwards, they wrote a single operational rule: “Every site must be able to resolve internal names locally during a WAN outage.” That rule prevented at least three later incidents, including one where the ISP went down and nobody noticed for fifteen minutes because the site remained functional.
Mini-story 2: the optimization that backfired
A different org had solid connectivity but wanted “more speed.” Their WAN links were upgraded, and someone decided the VPN was now the bottleneck. They enabled a set of “performance” knobs on the gateway: shorter rekey intervals, aggressive DPD, and a more CPU-intensive cipher suite because “stronger is better.” They also turned on traffic shaping “just to smooth things out,” without defining targets.
The tunnel still came up. Most traffic seemed fine. Then, exactly every so often, a wave of user complaints hit: voice calls dropped, remote desktop froze, and file transfers stalled then resumed. The pattern looked like a haunted clock.
It was not haunting. It was rekey churn combined with shaping that introduced queueing delay during negotiation bursts. Rekey packets competed with data packets. DPD fired when latency spiked. The gateways tore down and rebuilt state, causing sessions to reset. Everyone experienced “random instability” that was actually perfectly timed.
The rollback fixed it immediately: restore reasonable lifetimes, keep DPD sane, and remove shaping until they had measured requirements. Later, they added QoS selectively for voice, not a blanket “make it nice.” The lesson stuck: optimizing without a measurable bottleneck is just adding complexity with a timer on it.
Mini-story 3: the boring but correct practice that saved the day
An operations team maintained a hub-and-spoke VPN connecting two offices and a small cloud VPC. Nothing fancy: two tunnels per site, BGP with prefix filters, and a monitoring job that tested DNS, TCP/443, and a large-packet ping every minute from each side. The dashboards were not pretty, but they were honest.
One afternoon, Office B reported “slowness to everything in Office A.” The tunnel state was up. From the user perspective, that meant “the VPN is up, so it must be the apps.” The app team braced for impact.
The monitoring told a better story: latency jumped and packet loss spiked only on the path that used ISP #1 in Office B. The backup tunnel over ISP #2 looked clean. BGP was still preferring ISP #1 because the tunnel was “up.” The team adjusted BGP metrics to prefer the clean path and temporarily de-preferenced the bad link.
Users recovered in minutes. No heroic debugging. No finger-pointing. Later, they added tracking based on measured loss/latency so “up” wasn’t the only signal. The boring practice was having synthetic tests that matched user pain and a routing design that could react. It saved the day because the day didn’t need saving; it needed evidence.
FAQ
1) Should I use IPsec or WireGuard for a site-to-site VPN?
If you need appliance support, compliance checkboxes, or easy integration with existing firewalls, use IPsec with IKEv2. If you control both endpoints and want operational simplicity, WireGuard is often easier to run and debug.
2) Can I connect two offices if one side is behind NAT or carrier-grade NAT?
Usually, yes. For IPsec, use NAT-T (UDP 4500) and ensure keepalives/DPD are configured. For WireGuard, persistent keepalive helps maintain NAT mappings. If inbound is impossible, you may need a reachable hub (cloud VM) and connect both sites outward.
3) Do I need a static public IP at both sites?
It makes life easier, but it’s not strictly required. Dynamic IPs work with DDNS or with a hub approach. Still, for production: pay for static IPs if you can. It’s cheaper than the hours you’ll spend debugging “why did the endpoint change?”
4) Should we route all internet traffic from Office B through Office A?
Only if you have a clear reason (centralized security tooling, compliance). Otherwise, keep internet egress local and route only internal prefixes over the VPN. Full-tunneling site-to-site is a classic way to create surprise outages.
5) How many subnets should be allowed through the tunnel?
As few as you can get away with. Start with the specific server subnets and required ports. Expand based on observed needs. “Allow any-any” is how internal incidents become cross-site incidents.
6) Why does everything break when someone starts a big file copy?
Most commonly MTU/MSS issues or bufferbloat/saturation on the WAN link. Validate path MTU with DF pings, check MSS with ss -ti, and measure loss/latency with mtr while the copy runs.
7) Can we run BGP over the VPN, or is that too fancy?
You can, and it’s often the cleanest way to handle redundancy and avoid manual route edits. If you have only one tunnel and two subnets, static is fine. If you have two tunnels and want reliable failover, BGP is the grown-up move.
8) How do we make failover not break user sessions?
You usually can’t guarantee that with only a VPN change. You can minimize disruption with fast convergence and stable MTU, but many apps and TCP sessions will reset on path change. If “no disruption” is a requirement, you’re in application resilience and load-balancing territory.
9) What ports do we need open for a typical setup?
Depends. Commonly: IPsec uses UDP 500 and UDP 4500 (plus ESP if not NAT-T). WireGuard uses a single UDP port you choose (often 51820). Then your internal allowed ports are your policy decision, not a VPN requirement.
10) Is it safe to treat two offices as one trusted network if the traffic is encrypted?
No. Encryption protects data in transit; it does not protect you from a compromised device on the far side. Segment, restrict, and log. Assume breach, because reality already does.
Next steps that won’t embarrass you later
- Write the IP plan and ban overlaps. If you already have overlaps, plan the renumbering instead of “temporary NAT forever.”
- Choose routed tunnels with clear allowed prefixes. Avoid L2 extension unless you enjoy pain.
- Pick IPsec (IKEv2) or WireGuard based on what you can operate, not what a vendor slide deck claims.
- Decide split vs full tunnel deliberately. Default split. Full tunnel only with a reason.
- Make DNS local at each site, with forwarding/replication. Test name resolution during a simulated WAN outage.
- Fix MTU/MSS early and validate with DF pings and real transfers. “It pings” is not a test plan.
- Add monitoring that matches user pain: DNS, TCP, loss/latency, not just “tunnel up.”
- Test failover like you mean it. Then document what breaks (sessions may reset) so leadership isn’t surprised later.
If you do these in order, you’ll have a site-to-site VPN that behaves like grown-up infrastructure: boring, predictable, and only exciting in the ways you choose.