VPN Full-Mesh for Three Offices: When You Need It and How to Keep It Manageable

Was this helpful?

Three offices. Each has its own internet circuit, its own local quirks, and its own person who “just needs to access the other site.”
You set up a site-to-site VPN, then another, and suddenly you’re juggling routes, firewall rules, DNS weirdness, and a printer that only fails on Tuesdays.

A full-mesh VPN sounds like the grown-up answer: everyone talks to everyone directly. Sometimes it is. Sometimes it’s an expensive way to create
a distributed incident generator. Let’s decide which world you’re in—and how to run it without becoming the human routing table.

What “full-mesh” really means for 3 offices (and what it doesn’t)

With three sites—call them NYC, DAL, SFO—a full mesh means you build
three site-to-site tunnels: NYC↔DAL, DAL↔SFO, SFO↔NYC. Every site has a direct encrypted path to the others.
No hairpin through a central hub.

This is the part vendors love to put in a neat triangle diagram. In real life, a “full mesh” is not just tunnels. It’s:

  • Routing policy: which prefixes are advertised where, and what wins when multiple paths exist.
  • Security policy: what’s allowed east-west between offices.
  • Operational policy: how changes are rolled out without breaking half your WAN.
  • Identity & naming: DNS, split-horizon, AD site topology, or whatever your auth stack is doing.

For three offices, a full mesh is not automatically “complex.” It’s easy to make complex. That’s the distinction that matters.

Quick math: why the pain grows

Number of tunnels in a full mesh is n(n−1)/2. With 3 sites: 3 tunnels. With 6 sites: 15. With 10 sites: 45.
Most teams can keep 3 tunnels healthy. Many teams can’t keep 15 tunnels consistently healthy without automation and telemetry.

Interesting facts and a bit of history (because you’ll inherit old ideas)

  • Fact 1: IPsec started life in the 1990s as part of IPv6’s security story, but got adopted heavily on IPv4 because… reality.
  • Fact 2: “NAT traversal” (UDP 4500) exists because IPsec ESP didn’t play nicely with NAT; it wasn’t a nice-to-have, it was survival.
  • Fact 3: Many “site-to-site VPN” products still inherit a worldview from leased-line days: stable endpoints, predictable paths, fewer rekeys.
  • Fact 4: BGP became the default for dynamic WAN routing not because it’s friendly, but because it’s brutally explicit about paths and policy.
  • Fact 5: The MTU/MSS mess in VPNs is decades old; it’s not a “cloud era” issue—PPP, GRE, and IPsec all had their turns breaking packets.
  • Fact 6: Split-horizon DNS predates “zero trust” by years; it’s an old tool that remains relevant when different networks need different answers.
  • Fact 7: SD-WAN’s big win wasn’t encryption (VPNs already did that); it was centralized control, observability, and path selection under loss/jitter.
  • Fact 8: “Full mesh” topologies were common in early corporate WANs using Frame Relay PVCs—then everyone remembered billing and moved to hubs.

When you actually need a full-mesh

You need a full mesh when direct paths are materially better for your business and you can’t reliably get that result with a hub-and-spoke.
“Materially better” means latency, bandwidth, availability domains, or failure isolation. Not vibes.

1) Your traffic is genuinely site-to-site, not site-to-datacenter

If users in NYC constantly hit file servers in DAL, and DAL constantly hits build runners in SFO, sending all that through a hub is
like routing local mail through another state “for visibility.” Your hub becomes a tax.

2) You need failure isolation between sites

In a hub-and-spoke, the hub is a shared failure domain: DDoS on the hub circuit, hub firewall upgrade, bad policy push—everyone feels it.
A mesh can keep NYC↔DAL alive when SFO is having an exciting day.

3) You have multiple internet circuits and want better path choices

If each office has its own solid internet, a mesh lets each pair use the best path between them. You can still do this with a hub if you
do clever policy routing and accept hairpin, but you’re choosing complexity anyway—at least get the latency benefit.

4) You have regulated segmentation requirements across sites

This sounds backwards (“mesh equals more trust”), but if you do it right, mesh can reduce exposure by not funneling everything through a hub
where it mingles. You can enforce pairwise policies: DAL can reach NYC finance, but not NYC lab; SFO can reach NYC SSO, but not DAL OT.

5) You can operate it like a system, not a one-off project

Full mesh is an operational commitment. If you don’t have monitoring, config versioning, and a change process, you’re building a nice triangle
that will become modern art the first time someone changes an ISP CPE.

Paraphrased idea (John Allspaw): reliability comes from how work is done in real conditions, not from perfect plans.
Mesh VPNs punish “perfect plan” thinking.

When full-mesh is the wrong tool

The most common reason teams choose full mesh is emotional: “I don’t want a hub.” Fair. But avoiding a hub by creating three independent,
inconsistently managed tunnels is not a strategy. It’s a hobby.

Don’t do full mesh if you really have a core site

If most traffic is “branch to HQ systems,” build hub-and-spoke. Make the hub redundant. Give it multiple uplinks. Monitor it aggressively.
You’ll get simpler policy and better blast-radius control.

Don’t do full mesh if you can’t standardize equipment and configs

If NYC runs a firewall appliance, DAL runs a Linux box someone named “vpn1,” and SFO runs a cloud router with a different IKE stack,
you’re signing up to debug three dialects of the same protocol. That’s not engineering. That’s linguistics.

Don’t do full mesh if your problem is “remote access”

Site-to-site mesh is for networks. Remote access is for people and devices. If you’re trying to solve laptop-to-office access by meshing offices,
you’ll end up with a network that is perfectly connected and still doesn’t know who the user is.

Joke #1: A full mesh is like a group chat—great until someone adds the printer, and suddenly everyone’s getting notifications at 3 a.m.

Design options: IPsec, WireGuard, SD-WAN, and “cheap and cheerful”

Option A: IPsec (IKEv2) site-to-site

IPsec is boring in the good way: widely supported, interoperable, understood by firewalls, and usually acceptable to auditors.
It’s also full of knobs that can hurt you: proposals, lifetimes, DPD, NAT-T, PFS, fragmentation behavior, and vendor “helpfulness.”

If you go IPsec, prefer:

  • IKEv2 over IKEv1. Fewer ghosts.
  • Route-based tunnels (VTI) rather than policy-based when possible. Routing belongs to routing.
  • Consistent proposals across all tunnels, ideally a single suite your devices all support well.

Option B: WireGuard site-to-site

WireGuard is operationally pleasant: minimal config, fast rekey behavior, and a strong “do less” philosophy. It’s also not a routing protocol.
You still need to decide how routes propagate and how to prevent “AllowedIPs” from becoming your accidental global routing table.

For three sites, WireGuard can be excellent if:

  • You control both ends (Linux routers or appliances that support it well).
  • You can standardize on the same firewall posture and monitoring.
  • You’re comfortable handling dynamic routing separately (static routes or BGP/OSPF on top).

Option C: SD-WAN (managed policy + overlays)

SD-WAN is what you buy when you want less bespoke networking and more repeatable operations. It gives you centralized policy, path selection,
and better observability. You pay in licensing and in “the controller is now part of production.”

For three offices, SD-WAN can be overkill—unless you’re growing, have multiple circuits per site, or need application-aware steering.

Option D: Hub-and-spoke with smart routing (often the correct compromise)

If you want “mostly direct” but not full mesh complexity, do hub-and-spoke with:

  • Redundant hubs (active/active or active/passive).
  • BGP between sites and hubs.
  • Policy that prefers local breakout for internet and prefers hub for core services.

You get manageable topology and still avoid some hairpin for internet-bound traffic.

Routing, DNS, and identity: where full-mesh dies in practice

Routing model: static routes vs dynamic routing

For three offices, static routes can work. They also quietly rot.
The moment you add a new subnet, change a LAN range, or introduce a second circuit, you’ll find the tunnel that didn’t get updated.
That’s not a hypothetical. It’s Tuesday.

Dynamic routing (typically BGP) on top of route-based tunnels is the “adult” option:

  • Routes propagate automatically.
  • You can express preference and failover policy.
  • You can filter what each site learns (vital for segmentation).

Choosing the routing approach for exactly three sites

Here’s the opinionated take:

  • Static routes are fine if each office has one LAN prefix, no overlapping networks, and you won’t add sites soon.
  • BGP is worth it if any office has multiple internal segments, you want clean failover, or you plan to grow beyond three sites.

Split-horizon DNS: the quiet requirement

Most inter-office pain is not IP connectivity. It’s name resolution and service discovery:

  • NYC’s DNS returns an internal IP that only NYC can reach unless routes are correct.
  • Clients in DAL resolve “fileserver” to the wrong site because someone set a search suffix and called it a day.
  • SaaS split tunneling sends DNS one place and traffic another, and then you get “it works on my phone.”

Decide whether you want:

  • Single internal DNS view (all offices see the same internal answers), or
  • Site-aware DNS (answers vary by source site, usually to keep clients local).

Site-aware DNS is powerful. It’s also how you accidentally build a distributed system without noticing.

Overlapping subnets: the “we’ll never do that” lie

If any office uses RFC1918 space that overlaps with another office—or with a partner network—you are going to have a bad time.
Renumbering is painful, but NAT-over-VPN is worse long-term because it leaks into application configs and troubleshooting.

Security controls that keep a mesh from becoming lateral movement heaven

A full mesh increases connectivity. Attackers love connectivity.
Your job is to ensure the mesh increases intentional connectivity, not accidental trust.

Principle: route everything you need, allow almost nothing by default

Routing and firewalling are different layers. You can advertise routes broadly (for simplicity) and still enforce strict policy at L3/L4.
Or you can tightly filter routes and keep firewall rules simpler. Pick one as the primary control plane and be consistent.

Recommended baseline controls

  • Inter-site ACLs per zone (user LAN, server LAN, management, voice, OT).
  • Management plane isolation: VPN gateways should have a dedicated management interface or subnet, not “admin from anywhere.”
  • Logging for denies across the VPN, with sampling or rate limits so you don’t DDoS your SIEM.
  • Key rotation and cert hygiene: PSKs tend to live forever; certs force adults to show up.

Segmentation pattern that works

Use zones and explicit allow-lists:

  • NYC-users → DAL-apps: allow 443, 22 (to bastion only), 445 (only to file cluster), deny rest.
  • DAL-users → SFO-users: deny (most orgs do not need “user LAN to user LAN” at all).
  • All-sites → monitoring: allow to central metrics/logging endpoints only.

Joke #2: If you allow “any-any” between offices “just for now,” congratulations—you’ve invented a time machine where “now” lasts three years.

Keeping it manageable: naming, automation, monitoring, change control

Standardize the primitives

Pick a consistent model and stick to it:

  • Route-based tunnels with VTIs named predictably (e.g., vti-nyc-dal).
  • Consistent subnets per site (avoid cleverness). Example: NYC=10.10.0.0/16, DAL=10.20.0.0/16, SFO=10.30.0.0/16.
  • Consistent BGP ASNs if you use BGP: private ASNs per site, documented.
  • Consistent firewall zone layout: users, servers, management, guest.

Make configuration diffable

If your VPN config lives only in a web UI, you don’t have config management—you have hope.
Export config, store it in version control, and treat changes as code even if the “code” is vendor syntax.

Monitoring: measure what breaks first

VPNs fail in boring ways:

  • Packet loss and jitter break voice/VDI before “the tunnel is down.”
  • MTU issues break specific apps (SMB, some HTTPS) while ping works.
  • Rekeys flap and cause brief brownouts that users report as “random.”

Monitor:

  • Tunnel state (IKE SA, child SA, handshake health).
  • Latency/loss between sites (not just to the internet).
  • Throughput and drops on the VPN interface and WAN interface.
  • Routing adjacency (BGP session up/down, route count changes).

Change control that doesn’t hate you

You don’t need a bureaucracy. You need a habit:

  • Do changes during a window.
  • Have a rollback plan that’s real (saved config, tested procedure).
  • Change one tunnel at a time unless you’re doing a coordinated cutover.
  • After change: verify routing, verify DNS, verify a real app flow.

Fast diagnosis playbook

When “the VPN is slow” or “site B can’t reach site C,” you don’t have time for a philosophical discussion about topology.
You need to find the bottleneck quickly. Here’s the order that works in production.

First: is it really the VPN?

  • Check if the issue is app-specific (SMB vs HTTPS vs RDP).
  • Check if it’s one direction only (asymmetric routing, firewall state).
  • Check if it’s one site pair only (NYC↔DAL fine, DAL↔SFO broken).

Second: underlay health (ISP path) beats overlay arguments

  • Measure loss and latency between public endpoints of the VPN gateways.
  • Look for packet loss bursts, not averages.
  • Confirm no one is saturating the uplink (backups, cloud sync, surprise video meeting).

Third: tunnel and crypto health

  • Is the IKE SA stable, or rekeying/flapping?
  • Are there retransmits, fragment drops, NAT-T issues?
  • Is hardware offload active (if relevant), or did a firmware change disable it?

Fourth: routing and MTU (the two silent killers)

  • Routing: verify next-hop and ensure no hairpin loops.
  • MTU: test with DF set; clamp MSS if needed.

Fifth: policy (firewall) and identity (DNS/AD)

  • Firewall denies across zones. Look for implicit denies.
  • DNS returning unreachable targets; AD site mappings wrong; split-brain DNS.

Hands-on tasks: commands, expected output, and what decision to make

Below are practical tasks you can run from Linux-based VPN gateways or diagnostic hosts at each site.
The point isn’t the command; it’s the loop: observe → interpret → decide.

Task 1: Confirm basic reachability between site gateways (underlay)

cr0x@server:~$ ping -c 5 203.0.113.20
PING 203.0.113.20 (203.0.113.20) 56(84) bytes of data.
64 bytes from 203.0.113.20: icmp_seq=1 ttl=54 time=18.7 ms
64 bytes from 203.0.113.20: icmp_seq=2 ttl=54 time=19.1 ms
64 bytes from 203.0.113.20: icmp_seq=3 ttl=54 time=18.6 ms
64 bytes from 203.0.113.20: icmp_seq=4 ttl=54 time=120.3 ms
64 bytes from 203.0.113.20: icmp_seq=5 ttl=54 time=19.0 ms

--- 203.0.113.20 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 18.6/39.1/120.3/40.6 ms

What it means: No packet loss, but a latency spike suggests jitter or transient congestion.
Decision: If spikes correlate with “VPN is slow,” investigate ISP path, saturation, or bufferbloat before touching VPN settings.

Task 2: Trace the public path to detect weird detours

cr0x@server:~$ traceroute -n 203.0.113.20
traceroute to 203.0.113.20 (203.0.113.20), 30 hops max, 60 byte packets
 1  198.51.100.1  1.012 ms  0.941 ms  0.915 ms
 2  198.51.100.9  3.122 ms  3.101 ms  3.094 ms
 3  203.0.113.1   9.884 ms  9.860 ms  9.841 ms
 4  203.0.113.20  18.992 ms  18.970 ms  18.948 ms

What it means: Short path, stable hops. If you see sudden long paths or timeouts, the underlay is suspect.
Decision: If the path changed recently, open an ISP ticket or switch to a secondary circuit if you have one.

Task 3: Check tunnel interfaces and counters (overlay health)

cr0x@server:~$ ip -s link show dev vti-nyc-dal
6: vti-nyc-dal: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1436 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/none
    RX:  bytes  packets  errors  dropped overrun mcast
      987654321 1234567  0       12      0       0
    TX:  bytes  packets  errors  dropped carrier collsns
      876543210 1122334  0       0       0       0

What it means: RX drops on the tunnel often point to MTU issues, queueing, or policy drops upstream.
Decision: If drops increment during complaints, move to MTU/MSS tests and check firewall logs for drops on the VTI.

Task 4: Verify route to a remote subnet (routing correctness)

cr0x@server:~$ ip route get 10.30.5.10
10.30.5.10 via 169.254.30.1 dev vti-nyc-sfo src 10.10.0.1 uid 0
    cache

What it means: The kernel will send traffic to SFO via the NYC↔SFO tunnel.
Decision: If this unexpectedly goes via DAL (hairpin), fix routing preference (metrics/BGP policy) before blaming the application.

Task 5: Detect asymmetric routing (classic “works one way”)

cr0x@server:~$ tcpdump -ni vti-nyc-dal host 10.20.8.50 and tcp port 445 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vti-nyc-dal, link-type RAW (Raw IP), snapshot length 262144 bytes
IP 10.10.9.25.49812 > 10.20.8.50.445: Flags [S], seq 112233, win 64240, options [mss 1360,sackOK,TS val 123 ecr 0,nop,wscale 7], length 0
IP 10.10.9.25.49812 > 10.20.8.50.445: Flags [S], seq 112233, win 64240, options [mss 1360,sackOK,TS val 1123 ecr 0,nop,wscale 7], length 0
IP 10.10.9.25.49812 > 10.20.8.50.445: Flags [S], seq 112233, win 64240, options [mss 1360,sackOK,TS val 2123 ecr 0,nop,wscale 7], length 0
5 packets captured

What it means: SYN retransmits with no SYN-ACK returning on this interface. Either the return path is elsewhere (asymmetric) or blocked.
Decision: Capture on the remote site’s tunnel too. If SYN-ACK returns via a different tunnel, fix routing symmetry or stateful firewall expectations.

Task 6: Test MTU properly with DF set (stop trusting “ping works”)

cr0x@server:~$ ping -M do -s 1372 -c 3 10.30.5.10
PING 10.30.5.10 (10.30.5.10) 1372(1400) bytes of data.
1400 bytes from 10.30.5.10: icmp_seq=1 ttl=63 time=32.1 ms
1400 bytes from 10.30.5.10: icmp_seq=2 ttl=63 time=31.8 ms
1400 bytes from 10.30.5.10: icmp_seq=3 ttl=63 time=32.0 ms

--- 10.30.5.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 31.8/32.0/32.1/0.1 ms

What it means: Payload 1372 (1400 bytes ICMP) passes with DF—so path MTU is at least 1400.
Decision: If larger fails (e.g., 1472), set tunnel MTU accordingly and clamp TCP MSS to avoid fragmentation blackholes.

Task 7: Check for fragmentation needed errors (the smoking gun)

cr0x@server:~$ ping -M do -s 1472 -c 2 10.30.5.10
PING 10.30.5.10 (10.30.5.10) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1436
ping: local error: message too long, mtu=1436

--- 10.30.5.10 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1012ms

What it means: Your local interface MTU is 1436; 1500-byte frames won’t fit through the tunnel.
Decision: Clamp MSS (e.g., 1360) and ensure internal interfaces don’t assume jumbo/1500 end-to-end across the VPN.

Task 8: Check BGP session state and route counts

cr0x@server:~$ vtysh -c "show ip bgp summary"
BGP router identifier 10.10.0.1, local AS number 65010
BGP table version is 42
3 BGP AS-PATH entries
1 BGP community entries

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
169.254.20.2    4      65020     812     799       42    0    0 02:11:34        18
169.254.30.2    4      65030     805     790       42    0    0 02:09:57        22

What it means: Both neighbors are Established; you’re receiving prefixes. Sudden drops to 0 prefixes are a routing outage even if the tunnel is “up.”
Decision: If prefixes are missing, check filters, next-hop reachability, and whether the tunnel interface IP changed.

Task 9: Validate what routes you’re actually exporting (avoid accidental transit)

cr0x@server:~$ vtysh -c "show ip bgp neighbors 169.254.20.2 advertised-routes"
BGP table version is 42, local router ID is 10.10.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.10.0.0/16      0.0.0.0                  0         32768 i
*> 10.10.50.0/24     0.0.0.0                  0         32768 i

What it means: You are advertising only NYC prefixes, not “the whole world.”
Decision: If you see DAL routes advertised to SFO via NYC, you may be unintentionally making NYC a transit; fix export policy.

Task 10: Confirm IP forwarding and rp_filter (Linux routers bite back)

cr0x@server:~$ sysctl net.ipv4.ip_forward net.ipv4.conf.all.rp_filter
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 1

What it means: Forwarding is on, but strict reverse-path filtering can drop asymmetric flows (common in multi-tunnel meshes).
Decision: If you have asymmetric routing by design or during failover, set rp_filter to loose (2) on relevant interfaces.

Task 11: Check strongSwan IKE and CHILD SAs (IPsec truth)

cr0x@server:~$ sudo swanctl --list-sas
nyc-dal: #12, ESTABLISHED, IKEv2, 1d 02:11:04, 198.51.100.10[nyc]...203.0.113.20[dal]
  local  'nyc' @ 198.51.100.10[4500]
  remote 'dal' @ 203.0.113.20[4500]
  AES_GCM_16_256/PRF_HMAC_SHA2_256/ECP_256
  nyc-dal-child: #22, INSTALLED, TUNNEL, reqid 1, 2h 11:04, ESP in UDP SPIs: c1a2b3c4_i c4b3a2c1_o
    AES_GCM_16_256, 12345678 bytes_i, 11223344 bytes_o, rekeying in 31 minutes

What it means: Tunnel is established, child SA installed, traffic counters incrementing, rekey scheduled.
Decision: If bytes stay at 0 while apps fail, you likely have routing/firewall issues, not “VPN down.”

Task 12: Watch for rekey flaps in logs (brownouts)

cr0x@server:~$ sudo journalctl -u strongswan --since "30 min ago" | tail -n 12
Aug 22 10:11:03 nyc-gw charon[2014]: 12[IKE] rekeying IKE_SA nyc-dal[12]
Aug 22 10:11:04 nyc-gw charon[2014]: 12[IKE] sending CREATE_CHILD_SA request 1 [ SA No KE TSi TSr ]
Aug 22 10:11:09 nyc-gw charon[2014]: 12[IKE] retransmit 1 of request with message ID 47
Aug 22 10:11:14 nyc-gw charon[2014]: 12[IKE] retransmit 2 of request with message ID 47
Aug 22 10:11:19 nyc-gw charon[2014]: 12[IKE] giving up after 3 retransmits
Aug 22 10:11:19 nyc-gw charon[2014]: 12[IKE] deleting IKE_SA nyc-dal[12] between 198.51.100.10[nyc]...203.0.113.20[dal]

What it means: Rekey failed; the tunnel got torn down. Users experience a short outage that looks “random.”
Decision: Investigate underlay packet loss, mismatched lifetimes/proposals, NAT-T behavior, or CPU saturation during rekey.

Task 13: Verify firewall policy counters for inter-site allows/denies

cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
  chain forward {
    type filter hook forward priority filter; policy drop;
    ct state established,related accept
    iifname "vti-nyc-dal" oifname "lan0" ip saddr 10.20.0.0/16 ip daddr 10.10.50.0/24 tcp dport { 443, 445 } accept
    counter packets 1842 bytes 221040 drop
  }
}

What it means: Default drop policy; a specific allow exists. The final counter-drop is catching something.
Decision: If drop counter increases during a reported failure, add a targeted allow (or fix the app to use expected ports), don’t open “any-any.”

Task 14: Measure real throughput and loss with iperf3 (not guesswork)

cr0x@server:~$ iperf3 -c 10.20.8.50 -t 10 -P 4
Connecting to host 10.20.8.50, port 5201
[  5] local 10.10.9.25 port 41862 connected to 10.20.8.50 port 5201
[  6] local 10.10.9.25 port 41864 connected to 10.20.8.50 port 5201
[  7] local 10.10.9.25 port 41866 connected to 10.20.8.50 port 5201
[  8] local 10.10.9.25 port 41868 connected to 10.20.8.50 port 5201
[SUM]   0.00-10.00  sec   412 MBytes   346 Mbits/sec  0             sender
[SUM]   0.00-10.00  sec   410 MBytes   344 Mbits/sec                receiver

What it means: You can move ~344 Mbps between sites under test conditions.
Decision: If business traffic is slower, look at per-flow limits, QoS, TCP windowing, or application-layer bottlenecks—not “the VPN.”

Three corporate-world mini-stories

Mini-story 1: The incident caused by a wrong assumption

The company had three offices and a tidy full mesh. The network diagram looked like a geometry homework problem, which made everyone feel safe.
They also had an assumption: “If the tunnel is up, routes are fine.” Nobody wrote that down, which is how assumptions spread—quietly, like dust.

Then a small change landed: a new VLAN in DAL for a contractor lab. The engineer added the subnet locally, updated one tunnel’s static routes,
and moved on. The lab could reach NYC, so the ticket got closed with a cheerful note. SFO couldn’t reach the lab, but no one tested that path.

A week later, a build pipeline in SFO started failing when it tried to pull artifacts from a DAL host that had moved into the new VLAN.
The failures were intermittent because some jobs still hit cached artifacts elsewhere. The team called it “flaky CI,” which is the phrase you use
when you don’t yet know what’s broken.

The fix was not exotic: stop trusting tunnel status as a proxy for connectivity, and stop using one-off static routes without an inventory.
They moved to route-based tunnels and BGP, and they implemented a post-change test that hit one representative host in every remote subnet from every site.
The result was boring. Which is the goal.

Mini-story 2: The optimization that backfired

Another org wanted “maximum performance,” so they enabled aggressive crypto settings and shortened lifetimes because someone read that rekeys are safer.
They also cranked up logging to “debug” during rollout and forgot to turn it down. Classic.

For a while everything looked fine. Then Monday hit: more users, more traffic, more rekeys. The gateways started spiking CPU during rekey events,
and because lifetimes were short, the spikes were frequent. Users didn’t see a full outage. They saw micro-outages: calls dropping, SMB pauses,
RDP freezing for five seconds and then snapping back.

The helpdesk did what helpdesks do: they correlated it with “VPN issues” and escalated. The network team saw tunnels established and moved on.
Meanwhile the log volume was high enough that the disk on one gateway started filling, and then the box got “interesting” in new ways.

They backed out the “optimization”: sane lifetimes, debug logging only during planned sessions, and hardware sizing based on peak rekey load.
Performance improved not because the crypto got faster, but because the system stopped punching itself in the face every few minutes.

Mini-story 3: The boring but correct practice that saved the day

A finance-heavy company ran three offices with a mesh, but they treated the VPN like production: config in version control, change windows,
and a simple runbook for validation. It wasn’t glamorous. It was also why they slept sometimes.

One afternoon, an ISP in one city had a partial outage: not a full drop, just enough packet loss to ruin UDP encapsulated traffic.
The tunnel stayed “up” long enough to confuse everyone, but application symptoms were immediate—voice jitter, slow file access, strange timeouts.

Their monitoring caught it because they measured loss between gateways and tracked IPSec retransmits. The on-call followed the runbook:
check underlay loss, confirm tunnel counters, run MTU tests, and then fail traffic over to the secondary circuit. They didn’t argue with the network.
They moved the workload.

The best part: they had a post-change validation checklist that included “restore primary” steps, so the failback didn’t become a second incident.
No heroics. Just practiced boredom.

Common mistakes (symptoms → root cause → fix)

1) Symptom: “Ping works, but SMB/HTTPS stalls”

Root cause: Path MTU blackhole; TCP packets larger than tunnel MTU get dropped, ICMP fragmentation-needed blocked, or MSS not clamped.

Fix: Set tunnel MTU appropriately and clamp TCP MSS on the VPN ingress/egress. Confirm with DF pings and real app tests.

2) Symptom: “Tunnel is up, but one subnet is unreachable”

Root cause: Missing route advertisement (static route not added, BGP filter, wrong next-hop), or overlapping subnets causing wrong route selection.

Fix: Validate route tables on all three sites; if using BGP, check advertised/received routes and prefix-lists. Fix overlaps by renumbering.

3) Symptom: “Works from NYC to DAL, not from DAL to NYC”

Root cause: Asymmetric routing plus stateful firewall drop; rp_filter on Linux; policy routes sending return traffic out the wrong tunnel.

Fix: Ensure symmetric routing for stateful flows or use stateless rules where appropriate. Set rp_filter to loose mode as needed.

4) Symptom: “Random 10–30 second outages”

Root cause: Rekey flaps due to packet loss, mismatched lifetimes, CPU spikes, or DPD settings too aggressive.

Fix: Align lifetimes/proposals; tune DPD; investigate underlay loss; size CPU; avoid extreme rekey intervals.

5) Symptom: “DAL can reach SFO only when NYC tunnel is down”

Root cause: Routing preference wrong; NYC is accidentally acting as transit because of route leaks, metrics, or BGP local-pref.

Fix: Fix routing policy: prefer direct tunnel, filter transit routes, set correct local-pref/weight, and ensure next-hop-self where appropriate.

6) Symptom: “DNS works, but the returned IP is unreachable from one office”

Root cause: Split-horizon mismatch; site-aware records not aligned with routing; internal services pinned to local subnets.

Fix: Decide on a DNS model (single view vs site-aware). Make it explicit and test resolution + connectivity from each site.

7) Symptom: “Throughput is terrible only for single flows”

Root cause: Per-flow shaping, TCP window issues, or application using one stream; encryption overhead reduces effective MSS/throughput.

Fix: Use iperf3 with parallel streams to compare. Consider QoS, TCP tuning, or app-side changes; don’t just “increase bandwidth.”

8) Symptom: “Everything dies during backups”

Root cause: No QoS and uplink saturation; bufferbloat; backup traffic taking the tunnel hostage.

Fix: Rate-limit backups, schedule them, or apply QoS. Measure drops and latency under load; prioritize interactive traffic.

Checklists / step-by-step plan

Step-by-step plan: build a three-office mesh that won’t haunt you

  1. Decide the topology for a reason: latency/bandwidth/failure isolation. Write the reason down.
  2. Normalize addressing: pick non-overlapping RFC1918 ranges per site; reserve room for growth.
  3. Pick tunnel type: route-based tunnels (VTI) by default.
  4. Pick routing: static if truly small and stable; BGP if anything changes more than quarterly.
  5. Define security zones: users, servers, management, guest, voice/OT as needed.
  6. Write inter-site policy: allow-lists per zone pair; deny user-to-user LAN by default.
  7. DNS plan: single view or site-aware; document what internal names resolve to at each site.
  8. MTU/MSS baseline: set tunnel MTU, clamp MSS, and verify with DF pings.
  9. Observability: tunnel state, BGP state, loss/latency probes, interface drops, and log sampling.
  10. Config management: export configs; store in version control; use consistent naming conventions.
  11. Testing: for every change, test from every site to at least one host in every remote subnet.
  12. Failure drills: simulate a tunnel down, a circuit degraded, and a DNS mis-route. Practice failover and failback.

Pre-change checklist (do this before touching production)

  • Current configs exported and saved (with timestamps and change ID).
  • Rollback procedure written and feasible without internet heroics.
  • List of critical flows between sites (ports + subnets + owners).
  • Maintenance window communicated (including “what will break”).
  • Monitoring dashboards open: loss/latency, tunnel status, BGP, interface errors.

Post-change verification checklist (the part everyone skips)

  • Tunnels established and stable (no flapping in logs).
  • Routes present on all sites (expected prefix counts).
  • MTU test passes at expected size (DF ping).
  • At least one real application transaction per site pair (not just ping).
  • Firewall deny counters not exploding.
  • Document updated: prefixes, peers, policies, and any exceptions.

FAQ

1) For three offices, is full mesh always better than hub-and-spoke?

No. If most traffic goes to a “core” site (datacenter/HQ), hub-and-spoke is simpler and often more reliable.
Full mesh is better when site-to-site traffic is heavy or when you need pairwise failure isolation.

2) Should I use static routes or BGP for a three-site mesh?

Static routes are acceptable if the network is tiny and stable. BGP is worth it if you have multiple subnets per site, expect growth,
or want clean failover and filtering. If you’re already maintaining more than a handful of static routes, you’ve outgrown them.

3) What’s the fastest way to avoid route leaks in a mesh?

Use prefix-lists (or equivalent) and only advertise local prefixes to each neighbor. Assume every neighbor will happily accept nonsense unless you stop it.
Also decide whether any site is allowed to be transit. Usually the answer is “no.”

4) Why does my tunnel show “up” but users still complain?

Because “up” usually means control plane is alive (IKE SA). Data plane can still be broken by MTU blackholes, packet loss, routing loops,
or firewall drops. Measure loss/latency, check counters, and validate actual application flows.

5) Do I need to clamp MSS?

If your effective MTU across the VPN is below 1500 (common with IPsec/NAT-T), clamping MSS is often the easiest way to prevent hard-to-debug stalls.
Verify with DF ping tests and observe if “large responses” fail.

6) Can WireGuard do full mesh between offices safely?

Yes, if you manage keys properly and you’re disciplined with AllowedIPs and firewall policy. WireGuard makes tunnels easy; it does not make routing policy automatic.
For three sites, it’s excellent on Linux-based gateways with consistent configs and monitoring.

7) How do I keep a mesh from becoming a security nightmare?

Default-deny inter-site forwarding, segment by zone, and explicitly allow only required flows.
Log denies with sane rate limits. And keep management access off the general inter-site fabric.

8) What do I monitor to catch problems before users do?

Loss and latency between gateways, tunnel rekey stability, interface drops/errors, BGP session state and prefix counts,
and a couple of synthetic application checks (DNS resolution + TCP connect + small data transfer).

9) What about high availability for three sites?

HA helps, but it’s not magic. Start with dual internet circuits if possible, then redundant gateways with a tested failover method.
Make sure failover doesn’t change source IPs in a way that breaks peer expectations, and rehearse failback.

10) How do I know if SD-WAN is worth it for only three offices?

It’s worth it when you need centralized policy, better observability, multiple circuits per site, and application-aware steering.
If your environment is stable and you can standardize on one gateway stack, classic VPN with BGP is often enough.

Conclusion: practical next steps

For three offices, a VPN full mesh is either a clean, low-latency fabric—or a triangle-shaped blame game. The difference is not the topology.
It’s whether you treat routing, DNS, MTU, and policy as first-class citizens, and whether you operate the thing with discipline.

Next steps you can do this week:

  • Write down your actual reason for mesh (latency, bandwidth, isolation). If you can’t, reconsider hub-and-spoke.
  • Inventory subnets and kill overlaps before they multiply.
  • Run the MTU DF ping tests and clamp MSS where needed.
  • If you’re using static routes, count them. If the count annoys you, move to BGP.
  • Add monitoring for loss/latency between gateways and for rekey flaps. “Tunnel up” is not observability.
  • Implement default-deny inter-site policy and add explicit allow-lists for real business flows.

Build the mesh like you’ll have to debug it at 2 a.m. Because you will. The goal is that it’s boring when you do.

← Previous
The clock-speed arms race: why “more GHz” stopped working
Next →
PostgreSQL vs MongoDB: flexible schema vs predictable ops—who hurts less later

Leave a comment