You don’t pick a VPN protocol because it’s trendy. You pick it because the helpdesk queue is on fire, roaming users keep dropping tunnels, and the compliance team wants proof that your access paths are controlled—not “some config on a Git repo.”
WireGuard is elegant. OpenVPN is familiar. But there are very specific production environments where IKEv2/IPsec is the grown-up choice: boring in the best way, well-integrated with operating systems, and supported by security appliances that already sit in your racks collecting dust and maintenance renewals.
The decision frame: what you’re really choosing
“VPN” is not one decision. It’s a bundle of them:
- Client surface area: Do you want a first-party OS client with MDM policy knobs, or a third-party app you must deploy, update, and debug?
- Identity model: Do you want certificates, EAP (RADIUS/AD-backed), device identity, per-user policy, and clean revocation? Or a shared key and a prayer?
- Network hostility tolerance: Can you survive NAT, captive portals, carrier-grade NAT, and Wi‑Fi roaming without flapping tunnels?
- Operational visibility: Can you capture, interpret, and audit what happened at 3 a.m.?
- Vendor ecosystem: Does your stack include firewalls, load balancers, and identity gear that already “speaks IPsec” fluently?
WireGuard wins for simplicity and performance. OpenVPN wins for “it runs everywhere and we already know it.” IKEv2/IPsec wins when you care about enterprise integration, roaming stability, policy, and compliance-grade operational predictability.
Here’s the blunt heuristic I use in production:
- If you’re building a developer-friendly overlay between known endpoints and can manage keys cleanly, start with WireGuard.
- If you need maximum compatibility with weird legacy clients and third-party environments, OpenVPN still earns its keep.
- If you need native clients, MDM-driven configuration, strong identity integration, and tunnels that survive roaming, pick IKEv2/IPsec and don’t apologize for it.
Facts and history that actually matter in production
Some context points that influence why IKEv2/IPsec behaves the way it does today—and why it’s often the “enterprise default.”
- IKEv2 was designed to replace IKEv1’s complexity and fragility (fewer message exchanges, clearer state machine). You feel this when debugging: less “phase 1/phase 2 folklore,” more structured negotiation.
- MOBIKE (Mobility and Multihoming) is a first-class feature in IKEv2. It’s built for clients moving between networks without tearing down the tunnel.
- NAT traversal (NAT-T) standardized UDP encapsulation for IPsec, making it workable through most consumer NATs. It’s not perfect, but it’s not a hack taped onto the side.
- Many OS vendors shipped IKEv2 as the “native VPN” path while treating OpenVPN/WireGuard as third-party. That changes everything in fleet management: policy distribution, certificate stores, and security posture enforcement.
- IPsec is entrenched in network appliances (firewalls, routers, SD‑WAN boxes). Even when the UI is painful, the interoperability story is mature.
- Cryptographic agility is a real operational issue: IPsec suites have long supported multiple algorithms and negotiation, which is useful when compliance changes faster than your endpoint rollout.
- IKEv2 supports EAP methods (like EAP‑TLS, EAP‑MSCHAPv2) for user authentication; this is why it’s a staple in RADIUS-backed enterprises.
- Perfect Forward Secrecy (PFS) is normal behavior in modern IPsec profiles, and auditors tend to like that it’s explicit and standard rather than “trust us.”
One quote I keep taped to the mental dashboard whenever someone proposes a “clever” VPN setup:
Paraphrased idea — Werner Vogels: you should design for failure because everything fails eventually, and the system must keep working anyway.
Where IKEv2/IPsec beats WireGuard and OpenVPN
1) Native clients and MDM policy control (the underrated superpower)
In the real world, “VPN deployment” means: laptops managed by MDM, smartphones on conditional access, and a security team that wants posture checks and certificate-based identity. Native IKEv2 clients are built into Windows, macOS, iOS, and many Android builds. That matters because:
- You can push profiles via MDM without shipping a third-party client.
- You can leverage OS certificate stores and keychain/TPM-backed keys.
- You can use platform features like Always On VPN (Windows) with IKEv2.
- You reduce client update churn and “VPN app broke after OS update” incidents.
WireGuard is small and clean, but on managed endpoints you’re still deploying software (or relying on built-in support that isn’t universal). OpenVPN almost always means deploying a client and a configuration bundle, then dealing with “which version is this user running?” for the next five years.
2) Roaming stability with MOBIKE
If you support mobile users, your VPN must survive: office Wi‑Fi to LTE, hotel Wi‑Fi to tethering, IP address changes, NAT rebinding, and networks that silently drop idle UDP mappings. IKEv2 with MOBIKE was built for exactly this. It can update endpoints and keep the SA alive without renegotiating everything from scratch.
WireGuard also handles roaming nicely (it’s basically designed around it), but IKEv2 gives you that behavior inside enterprise policy frameworks and vendor appliances that security teams already trust.
3) Enterprise authentication models: EAP + RADIUS + certificates
WireGuard’s identity model is public keys. That’s a feature, not a bug—until you need user identity, group membership, MFA hooks, and clean deprovisioning tied to HR offboarding.
IKEv2 shines when you need:
- EAP-TLS (strongest: certs on both sides, no passwords flying around)
- RADIUS-backed auth with centralized policy and accounting
- Per-user/per-device access control via attributes, groups, and cert profiles
OpenVPN can integrate with these too, but you’re often stitching it together with plugins, scripts, and “please don’t upgrade OpenVPN until we retest the auth stack.” IPsec stacks tend to have these integrations as first-class citizens.
4) Compliance and audit posture
Some organizations need to answer questions like: “Which cipher suites are allowed?”, “Are we enforcing PFS?”, “Can we prove device identity?”, “Can we log connection start/stop and user identity?”
IKEv2/IPsec tends to fit neatly into audit narratives because the policy vocabulary is standardized, the implementations are mature, and the appliance/OS ecosystems expose logging and configuration controls in ways auditors recognize.
5) Interop with site-to-site and existing network gear
If you already run site-to-site IPsec between offices, adding remote access IKEv2 often reuses:
- the same PKI,
- the same firewall vendors,
- the same operational runbooks,
- the same crypto policy baseline.
WireGuard can do site-to-site beautifully, but if your perimeter is anchored on appliances and change control, IPsec is often the path of least organizational resistance. That’s not “politics.” That’s survival.
6) Traffic engineering and QoS in networks that care
In larger networks, you will eventually want to mark traffic, shape it, and observe it without guesswork. IPsec implementations commonly integrate with DSCP handling, policy-based routing, and enterprise QoS tooling. WireGuard is straightforward but doesn’t come with the same “network team already has knobs and dashboards” ecosystem.
Short joke #1: IPsec is like a Swiss Army knife—useful, respected, and somehow always missing the one tool you need at 2 a.m.
When not to use IKEv2/IPsec
IKEv2/IPsec is not “always better.” It’s better when your constraints match its strengths.
Don’t pick IKEv2/IPsec when you need extreme simplicity
If the team is small and you want a VPN you can explain on a whiteboard in five minutes, WireGuard is your friend. IPsec has more moving parts: proposals, transforms, SAs, lifetimes, EAP, certificates, NAT-T. That’s manageable, but it’s not minimal.
Don’t pick it when you’re behind hostile networks that block UDP
Many IPsec deployments run over UDP 500/4500. Some networks block or mangle those. OpenVPN over TCP/443 can sometimes punch through where IPsec can’t. This is not a performance recommendation. It’s a reality recommendation.
Don’t pick it when your platform support is uneven
Native clients are great—until you have to support platforms with weird IKEv2 implementations or limited EAP methods. If you’re supporting niche embedded devices, OpenVPN or WireGuard may be easier to standardize.
Don’t pick it if you can’t commit to PKI hygiene
IKEv2 with EAP-TLS and certificates is excellent. It also requires you to run PKI like you mean it: issuance workflows, revocation, expiry monitoring, rotation plans. If your current certificate practice is “we’ll calendar a reminder for next year,” you’re going to have a bad time.
Architecture choices: road warrior, site-to-site, and “always-on”
Remote access (road warrior) IKEv2
This is the “users connect from anywhere” model. Typical choices:
- Route-based VPN (preferred): assign a virtual IP to client, route networks via the tunnel.
- Split tunnel vs full tunnel: split reduces bandwidth and blast radius; full tunnel simplifies security controls but increases reliance on VPN availability.
With IKEv2, route-based setups usually feel cleaner operationally than policy-based. You’ll debug fewer “why does only this subnet break?” mysteries.
Site-to-site IKEv2
This is classic IPsec territory. If your organization already has site-to-site tunnels, standardizing on IKEv2 brings consistency and sometimes better failover behavior.
In mixed-vendor environments, spend time on proposal compatibility. Most “IPsec is broken” tickets are actually “proposal mismatch” with a thin layer of denial on top.
Always-on / device tunnel use cases
IKEv2 is often the protocol behind “always-on” enterprise VPN strategies because it integrates with OS boot-time networking and device identity. This matters for:
- domain-joined devices that must reach controllers before user login,
- management tooling that needs connectivity even when no user is logged in,
- conditional access policies based on device compliance.
Authentication and identity: EAP, certs, PSKs, and the foot-guns
PSK: fast, familiar, and usually the wrong default
Pre-shared keys are tempting because they’re easy. They’re also a scalability and security liability in remote access:
- Revocation is messy: you rotate the PSK and break everyone at once.
- Distribution becomes a secret-sharing problem.
- Auditing “who used the key” is weak unless layered with other identity checks.
PSKs are fine for tightly controlled site-to-site tunnels with strong operational discipline. For users, use certificates and/or EAP.
EAP-MS-CHAPv2: works, but treat it like a legacy bridge
EAP-MS-CHAPv2 is common in corporate environments because it integrates with existing identity systems. But from a security standpoint, it’s not the gold standard. If you can do EAP-TLS, do it.
EAP-TLS: the “yes, we meant security” option
EAP-TLS gives you strong mutual authentication with certificates. The operational win is clean deprovisioning: revoke the cert, the device/user is out. The operational risk is certificate lifecycle management. You must monitor expirations and automate renewal where possible.
Crypto proposals: standardize or suffer
Pick a small set of approved algorithms and lifetimes. Publish them. Enforce them. If every environment has a different suite, you will eventually debug “works on my laptop” at the cryptographic level, which is the nerdiest kind of misery.
Short joke #2: The fastest way to learn IPsec is to misconfigure one proposal; the second-fastest is to do it in production.
Operations reality: monitoring, debugging, and change management
IKEv2/IPsec has a reputation for being “hard to troubleshoot.” That’s partially deserved. It’s also because many teams try to run it without observability. Don’t.
What to monitor
- Tunnel up/down counts and churn rate per client ASN / network type (Wi‑Fi vs cellular patterns reveal MTU and NAT issues fast).
- Authentication failures split by reason (cert invalid, EAP failure, proposal mismatch).
- Packet drops on UDP 500/4500 and ESP, especially at firewalls/NAT devices.
- Latency and throughput baselines to a known internal endpoint.
Change management: IPsec punishes casual edits
WireGuard is forgiving: change a peer, reload, done. IPsec changes can ripple:
- Proposal changes require both sides to agree.
- Certificate chain changes can break clients that cached an older CA.
- NAT-T behavior changes can expose path filtering issues.
Use staged rollouts. Keep known-good profiles. Log everything. Treat VPN like a production service, not a networking side quest.
Practical tasks with commands: what to run, what it means, what to decide
Below are concrete tasks you can run on a Linux-based IKEv2/IPsec gateway (strongSwan assumed) and surrounding infrastructure. Each task includes: command, example output, how to read it, and the decision you make.
Task 1: Confirm the IKE daemon is healthy
cr0x@server:~$ systemctl status strongswan-starter
● strongswan-starter.service - strongSwan IPsec IKEv1/IKEv2 daemon using ipsec.conf
Loaded: loaded (/lib/systemd/system/strongswan-starter.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2025-12-27 08:11:02 UTC; 2h 14min ago
Main PID: 1187 (starter)
Tasks: 18 (limit: 9381)
Memory: 21.4M
CGroup: /system.slice/strongswan-starter.service
├─1187 /usr/lib/ipsec/starter
└─1214 /usr/lib/ipsec/charon
Meaning: If it’s not active (running), stop debugging the network and start debugging the service.
Decision: If inactive/crashing, collect logs (Task 2) and revert the last config change before touching firewall rules.
Task 2: Inspect recent IKE negotiation errors
cr0x@server:~$ journalctl -u strongswan-starter -n 80 --no-pager
Dec 27 10:02:41 vpn-gw charon[1214]: 12[IKE] received AUTHENTICATION_FAILED notify error
Dec 27 10:02:41 vpn-gw charon[1214]: 12[IKE] EAP method EAP_MSCHAPV2 failed
Dec 27 10:02:41 vpn-gw charon[1214]: 12[IKE] IKE_SA vpn-ra[12] state change: AUTHENTICATING => DESTROYING
Dec 27 10:03:18 vpn-gw charon[1214]: 15[IKE] no proposal chosen
Dec 27 10:03:18 vpn-gw charon[1214]: 15[IKE] peer supports: IKE:AES_GCM_16_256/PRF_HMAC_SHA2_256/MODP_2048
Dec 27 10:03:18 vpn-gw charon[1214]: 15[IKE] configured: IKE:AES_CBC_256/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_1024
Meaning: Two separate failure modes: an EAP auth failure and a crypto proposal mismatch (no proposal chosen).
Decision: Fix identity issues (RADIUS/EAP) separately from crypto policy. Don’t “open up all ciphers” as a panic move; align proposals explicitly.
Task 3: List established SAs and traffic counters
cr0x@server:~$ sudo swanctl --list-sas
vpn-ra: #14, ESTABLISHED, IKEv2, 8f3a1b9c1b2e3a2d_i* 21a9de88c0d3b1aa_r
local 'vpn.example' @ 203.0.113.10[4500]
remote 'user@corp' @ 198.51.100.77[51234]
AES_GCM_16_256/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/ECP_256
established 612s ago, rekeying in 41m
vpn-ra{27}: INSTALLED, TUNNEL, reqid 7, ESP in UDP SPIs: c1b2a3d4_i c9d8e7f6_o
vpn-ra{27}: 10.10.50.21/32 === 10.0.0.0/16
vpn-ra{27}: in 1483920 bytes, 1240 packets
vpn-ra{27}: out 932481 bytes, 1102 packets
Meaning: The tunnel is up, and packets are flowing both directions. If users report “connected but nothing works,” you compare these counters with application-level symptoms.
Decision: If SAs are established but counters don’t increase, suspect routing/firewall/NAT on the gateway or client-side split tunnel settings.
Task 4: Verify UDP 500/4500 is listening
cr0x@server:~$ sudo ss -lunp | egrep ':(500|4500)\s'
UNCONN 0 0 0.0.0.0:500 0.0.0.0:* users:(("charon",pid=1214,fd=12))
UNCONN 0 0 0.0.0.0:4500 0.0.0.0:* users:(("charon",pid=1214,fd=13))
Meaning: If you don’t see these sockets, your daemon isn’t bound or is blocked by local policy.
Decision: No listener → fix daemon config/bind address before changing upstream firewalls.
Task 5: Check firewall rules for IPsec ports
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
udp dport { 500, 4500 } accept
ip protocol icmp accept
tcp dport 22 accept
counter drop
}
}
Meaning: UDP 500/4500 is allowed. If policy is drop and these are missing, your VPN will look “dead” from the outside.
Decision: If missing, add explicit rules. Don’t temporarily set policy to accept; you’ll forget to undo it.
Task 6: Confirm kernel IP forwarding and rp_filter
cr0x@server:~$ sysctl net.ipv4.ip_forward net.ipv4.conf.all.rp_filter
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 1
Meaning: Forwarding is on (good). But strict reverse-path filtering (rp_filter=1) can drop tunneled traffic in asymmetric routing setups.
Decision: If you have multiple uplinks, policy routing, or NAT-T complexity, consider rp_filter=2 (loose) on relevant interfaces and validate with packet capture.
Task 7: Check routes for VPN client pools
cr0x@server:~$ ip route show
default via 203.0.113.1 dev eth0
10.0.0.0/16 via 10.0.1.1 dev eth1
10.10.50.0/24 dev vti0 proto kernel scope link src 10.10.50.1
Meaning: Client pool is on vti0. If internal networks don’t know how to return traffic to 10.10.50.0/24, clients will connect but won’t receive replies.
Decision: Ensure return routes exist in your internal routers, or perform NAT (with awareness of the audit/security implications).
Task 8: Validate NAT rules when doing split/full tunnel
cr0x@server:~$ sudo iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A POSTROUTING -s 10.10.50.0/24 -o eth1 -j MASQUERADE
Meaning: VPN client traffic heading to internal network interface eth1 is NATed. That can “fix” routing but hides client identity from internal logs.
Decision: Use NAT only if you cannot add return routes. If you care about per-user audit trails internally, route properly and avoid NAT.
Task 9: Confirm ESP-in-UDP (NAT-T) packets are arriving
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 4500 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:21:11.205931 IP 198.51.100.77.51234 > 203.0.113.10.4500: UDP-encap: ESP(spi=0xc1b2a3d4,seq=0x0000048a), length 148
10:21:11.306102 IP 203.0.113.10.4500 > 198.51.100.77.51234: UDP-encap: ESP(spi=0xc9d8e7f6,seq=0x000003f2), length 132
10:21:11.406245 IP 198.51.100.77.51234 > 203.0.113.10.4500: UDP-encap: ESP(spi=0xc1b2a3d4,seq=0x0000048b), length 148
10:21:11.506311 IP 203.0.113.10.4500 > 198.51.100.77.51234: UDP-encap: ESP(spi=0xc9d8e7f6,seq=0x000003f3), length 132
10:21:11.606478 IP 198.51.100.77.51234 > 203.0.113.10.4500: UDP-encap: ESP(spi=0xc1b2a3d4,seq=0x0000048c), length 148
Meaning: Bidirectional ESP-in-UDP is flowing. If you see inbound but not outbound, suspect egress firewalling or routing on the gateway.
Decision: Use this to separate “client can reach gateway” from “gateway can respond.” It avoids hours of guessing.
Task 10: Detect MTU/fragmentation pain early
cr0x@server:~$ ping -M do -s 1372 10.0.0.10 -c 3
PING 10.0.0.10 (10.0.0.10) 1372(1400) bytes of data.
1372 bytes from 10.0.0.10: icmp_seq=1 ttl=63 time=18.1 ms
1372 bytes from 10.0.0.10: icmp_seq=2 ttl=63 time=18.5 ms
1372 bytes from 10.0.0.10: icmp_seq=3 ttl=63 time=17.9 ms
--- 10.0.0.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
Meaning: This tests a payload size that approximates common MTU constraints once IPsec overhead is applied. If this fails while smaller pings work, you have an MTU/PMTUD problem.
Decision: If it fails, clamp MSS (Task 11) or adjust interface MTU on tunnel endpoints. Don’t blame DNS yet.
Task 11: Clamp TCP MSS to avoid black-holed PMTUD
cr0x@server:~$ sudo iptables -t mangle -A FORWARD -o vti0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
cr0x@server:~$ sudo iptables -t mangle -S FORWARD | tail -n 1
-A FORWARD -o vti0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
Meaning: MSS clamping often mitigates broken ICMP filtering and PMTUD failures that manifest as “websites hang, but ping works.”
Decision: If user apps stall on large transfers, clamp MSS and re-test. Then schedule a proper fix (allow ICMP fragmentation-needed, set correct MTU).
Task 12: Validate XFRM state and policy in the kernel
cr0x@server:~$ sudo ip xfrm state
src 203.0.113.10 dst 198.51.100.77
proto esp spi 0xc9d8e7f6 reqid 7 mode tunnel
replay-window 32 flag af-unspec
auth-trunc hmac(sha256) 0x3b... 128
enc cbc(aes) 0x9f...
src 198.51.100.77 dst 203.0.113.10
proto esp spi 0xc1b2a3d4 reqid 7 mode tunnel
replay-window 32 flag af-unspec
auth-trunc hmac(sha256) 0xa1... 128
enc cbc(aes) 0x11...
Meaning: The kernel has active ESP SAs. If charon claims “established” but XFRM state is absent, you have a control-plane/data-plane mismatch.
Decision: If missing, suspect kernel modules, policy install failures, or conflicting IPsec tooling (e.g., multiple daemons).
Task 13: Check for proposal mismatch from the client side (captured on gateway)
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 500 -vv -c 3
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
IP 198.51.100.77.500 > 203.0.113.10.500: isakmp 2.0 msgid 00000000: phase 1 IKE_SA_INIT (sa payload)
IP 203.0.113.10.500 > 198.51.100.77.500: isakmp 2.0 msgid 00000000: phase 1 IKE_SA_INIT (sa payload) notify(NO_PROPOSAL_CHOSEN)
IP 198.51.100.77.500 > 203.0.113.10.500: isakmp 2.0 msgid 00000000: phase 1 IKE_SA_INIT (sa payload)
Meaning: The gateway rejected the proposal (NO_PROPOSAL_CHOSEN). That’s not “network.” That’s configuration mismatch.
Decision: Align IKE/ESP transforms. Avoid broad “enable everything” responses; define a compatible baseline and enforce it.
Task 14: Confirm DNS behavior for VPN clients (a frequent “VPN is broken” impostor)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 5 (vti0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.0.0.53
DNS Servers: 10.0.0.53 10.0.0.54
DNS Domain: corp.example
Meaning: The tunnel interface has DNS servers and a search domain. If clients can reach IPs but not names, DNS assignment or split-DNS policy is wrong.
Decision: Fix DNS distribution first. Users don’t care that the tunnel is “up” if their browser can’t resolve internal names.
Task 15: Confirm RADIUS reachability and response codes (when using EAP)
cr0x@server:~$ sudo radtest alice 'CorrectHorseBatteryStaple' 10.0.0.20 0 testing123
Sent Access-Request Id 164 from 0.0.0.0:51682 to 10.0.0.20:1812 length 76
User-Name = "alice"
User-Password = "CorrectHorseBatteryStaple"
NAS-IP-Address = 10.0.0.1
NAS-Port = 0
Message-Authenticator = 0x00
Received Access-Accept Id 164 from 10.0.0.20:1812 to 10.0.0.1:51682 length 44
Meaning: RADIUS accepts credentials and is reachable. If VPN auth fails but this succeeds, the EAP method/cert trust path may be the problem.
Decision: Separate identity backend health from VPN EAP negotiation. If RADIUS is fine, focus on EAP config, cert chain, and client settings.
Fast diagnosis playbook
When someone says “VPN is slow” or “VPN is down,” you don’t start with philosophical debates about protocols. You start with a tight loop that finds the bottleneck fast.
First: is it control-plane or data-plane?
- Control-plane check: Are IKE SAs established? Use
swanctl --list-sas. If there are no SAs, you’re in auth/proposal/firewall territory. - Data-plane check: If SAs exist, do counters increase? If not, you’re in routing/NAT/firewall/XFRM territory.
Second: prove packets traverse the perimeter
- Run
ss -lunpto confirm UDP 500/4500 listeners. - Run
tcpdumpon WAN interface for UDP 500/4500. - If inbound appears but outbound does not, check local firewall and routing; if neither appears, check upstream firewall/NAT.
Third: hunt the silent killers (MTU, DNS, asymmetry)
- MTU: symptoms include “SSH works but websites hang,” “large downloads stall.” Use DF ping tests and apply MSS clamp as mitigation.
- DNS: “connected but nothing works” is often “connected but can’t resolve.” Confirm DNS servers and split-DNS behavior.
- Asymmetric routing: tunnel traffic leaves one interface and returns another, then gets dropped by rp_filter or upstream firewalls.
Fourth: validate performance is not a CPU/crypto issue
- Check CPU saturation on the gateway under load.
- Verify AES-NI / crypto acceleration is enabled if applicable.
- Confirm you’re not doing unnecessary encapsulation layers (double NAT, double tunnel).
Common mistakes (symptom → root cause → fix)
1) “Connected” but no access to internal subnets
Symptom: Client shows connected; internal services time out; gateway shows SAs established.
Root cause: Missing return routes to the VPN client pool, or firewall rules not permitting forwarded traffic.
Fix: Add routes on internal routers for the client pool via the VPN gateway; validate with ip route and counters in swanctl --list-sas. Only use NAT as a last resort.
2) Frequent reconnects when users roam (Wi‑Fi ↔ LTE)
Symptom: Tunnel drops whenever IP changes; users complain on mobile networks.
Root cause: MOBIKE disabled or unsupported in profile; NAT rebinding issues; aggressive DPD settings.
Fix: Enable MOBIKE, tune DPD/keepalives to keep NAT mappings alive, verify with logs that addresses are updated rather than full reauth.
3) “No proposal chosen” after a security policy update
Symptom: Immediate failure at IKE_SA_INIT; logs show no proposal chosen.
Root cause: Gateway and client suites diverged—often after disabling SHA1/older DH groups without updating clients.
Fix: Define a compatibility matrix per OS version. Roll out client profile changes before tightening server policy. Keep one transitional suite temporarily, with a removal date.
4) Works on some networks, fails on hotel/guest Wi‑Fi
Symptom: Users connect from home/cellular but not from certain Wi‑Fi networks.
Root cause: UDP 500/4500 blocked or mangled; captive portal interference; symmetric NAT corner cases.
Fix: Confirm with tcpdump if packets arrive. If blocked, consider offering a fallback (sometimes OpenVPN/TCP is the pragmatic escape hatch) or use a VPN gateway in a more permissive egress path.
5) Slow throughput after “tightening security”
Symptom: Tunnels connect fine; throughput halves; CPU spikes on gateway.
Root cause: Choosing CPU-expensive algorithms or disabling hardware acceleration paths; excessive rekeying.
Fix: Prefer modern AEAD suites like AES-GCM when supported; ensure crypto acceleration is enabled; increase rekey intervals sensibly.
6) Random hangs on large uploads/downloads
Symptom: Small packets fine; large transfers stall; some apps work, others don’t.
Root cause: MTU/PMTUD black hole—ICMP “fragmentation needed” blocked somewhere.
Fix: Clamp MSS as mitigation; then fix PMTUD by allowing ICMP and setting correct MTU on tunnel interfaces.
7) Users authenticate but get the wrong access
Symptom: Some users can reach restricted networks; others can’t reach what they should.
Root cause: RADIUS attributes/groups mapped incorrectly; inconsistent per-user policy on the gateway.
Fix: Normalize policy mapping. Log applied groups/ACLs per session. Treat VPN authorization like application authorization: explicit, versioned, reviewed.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
They migrated remote access from an aging appliance to a new Linux gateway running IKEv2. Pilot users were happy. Throughput was great. The change window closed with high-fives and a new dashboard tile.
Two days later, the helpdesk got a wave: “VPN connects, but nothing loads.” Not everyone—just users in a particular office and a few on specific ISPs. The tunnel status looked fine. SAs were established. Bytes were moving. The team assumed “it must be DNS” because it’s always DNS until it isn’t.
The wrong assumption was subtle: they believed internal routing already had a route back to the new VPN client pool because “it’s just another internal subnet.” It wasn’t. The old appliance NATed client traffic into a familiar source range, so return routing never mattered. The new setup was cleaner—routed, no NAT—so it depended on return routes. Those routes did not exist everywhere.
Once someone bothered to trace a single flow end-to-end, the pattern snapped into focus: SYN packets went in, SYN‑ACKs went to the default gateway, and died somewhere else. They added routes in the core, confirmed on edge routers, and the incident evaporated.
The lesson wasn’t “routing is hard.” The lesson was: know whether your predecessor solved a problem with NAT, and don’t accidentally remove that behavior without replacing the dependency.
Mini-story 2: The optimization that backfired
A different company had a clean IKEv2 deployment with EAP‑TLS and managed certificates. Performance was acceptable, but they wanted more. Someone noticed CPU spikes on the VPN gateway and decided to “optimize” by adjusting crypto settings and rekey timers.
The change was made with good intentions: shorter lifetimes for “better security,” and a switch to a different proposal that looked stronger on paper. No one measured CPU cost. No one checked which algorithms hit hardware acceleration paths. They deployed it on a Friday because “it’s just config.”
By Monday morning, the gateway was thrashing. Rekeying storms began during peak hours, and mobile clients—already sensitive to packet loss—started to flap. Users described it as “the VPN feels haunted,” which is not a metric, but it’s a vibe you should respect.
The rollback fixed it immediately. A follow-up test showed the “stronger” suite caused more CPU per packet and the rekey interval multiplied control-plane work. Security posture didn’t meaningfully improve because the original suite was already compliant, but availability got worse, which is a security problem of its own.
The lesson: crypto knobs are performance knobs. Change them like you change database indexes: with benchmarks, staged rollout, and a rollback plan.
Mini-story 3: The boring but correct practice that saved the day
One org ran IKEv2 for remote access and site-to-site, with certificates issued by an internal CA. Nothing fancy. The magic was in the boring parts: they tracked certificate expirations, had alerts, and enforced renewal through automation on managed devices.
Six months into operation, their CA hierarchy had to be rotated due to a security policy shift. That kind of change usually causes chaos: clients don’t trust the new chain, gateways present the wrong intermediate, and everyone plays “guess the trust store” for a week.
They didn’t. They had a documented trust-chain rollout plan: push the new root/intermediate to endpoints first, verify trust via MDM compliance, then update gateways to present the new chain. Only after telemetry showed most devices trusted the new CA did they revoke the old intermediate.
The rotation completed with minor noise. No outage. No mass re-enrollment. The best compliment was silence from the helpdesk.
The lesson: certificate lifecycle management is not paperwork; it’s uptime engineering.
Checklists / step-by-step plan
Choosing IKEv2/IPsec (decision checklist)
- Do you need native clients and MDM-driven configuration? If yes, lean IKEv2.
- Do you need per-user identity with centralized policy (RADIUS/AD) and good audit trails? If yes, lean IKEv2.
- Do you have a roaming-heavy workforce and want stable tunnels across network changes? If yes, require MOBIKE and test it.
- Is your environment hostile to UDP 500/4500? If yes, plan a fallback strategy (or reconsider protocol choice).
- Can you run PKI with discipline (issuance, renewal, revocation, expiry alerts)? If no, fix that first or pick a model that matches your reality.
Production rollout plan (step-by-step)
- Define a compatibility matrix: supported OS versions, EAP methods, cipher suites, and lifetimes.
- Choose identity: EAP‑TLS if possible; otherwise EAP with RADIUS and MFA integration. Avoid PSK for user VPN.
- Build a reference gateway: logging enabled, metrics exported, packet capture access controlled.
- Implement routing deliberately: decide routed vs NATed client traffic and document return routes.
- MTU plan: set tunnel MTU, allow ICMP fragmentation-needed, and add MSS clamping only as mitigation.
- Staged client rollout: internal pilot, then a single department, then broad deployment. Keep the old VPN until churn rate is stable.
- Chaos testing (lightweight): roam between Wi‑Fi and LTE, suspend/resume laptops, test captive portal networks.
- Runbooks: document your Fast diagnosis playbook and who owns which layer (network/security/endpoint).
- Rotation drills: rehearse certificate renewal and gateway config rollback.
Day-2 operations checklist
- Monitor tunnel churn, auth failures, and per-user connection success rates.
- Alert on certificate expirations (gateway and client issuing CAs).
- Quarterly review of crypto policy vs client capability (avoid surprise “no proposal chosen” incidents).
- Test from at least two hostile networks (guest Wi‑Fi, mobile hotspot) to catch egress filtering issues.
- Keep one known-good client profile in escrow (for recovery).
FAQ
Is IKEv2/IPsec “more secure” than WireGuard?
Not automatically. WireGuard has a smaller attack surface and modern crypto choices. IKEv2/IPsec can be extremely secure when configured well. The real differentiator is often identity and policy integration, not raw cryptography.
Why does IKEv2 feel more reliable for corporate laptops?
Because the OS owns the client stack: it integrates with certificate stores, network transitions, and device management policy. Fewer third-party moving parts means fewer weird breakages after OS updates.
What’s the single best reason to choose IKEv2 over OpenVPN?
Native client support plus enterprise authentication models (EAP/RADIUS/certs) with predictable policy control. OpenVPN can do a lot, but you pay a “client management tax.”
What’s the single best reason to choose IKEv2 over WireGuard?
When you need user/device identity and centralized policy that plugs into corporate IAM workflows, plus the comfort of standard IPsec interop with existing network gear.
Should I use PSK for IKEv2 remote access?
Generally no. PSK is hard to rotate safely at scale and poor for per-user auditing. Use EAP‑TLS or EAP backed by RADIUS, ideally with MFA.
Do I need to allow ESP (IP protocol 50) through my firewall?
Not necessarily. Many deployments rely on NAT-T (UDP 4500) which encapsulates ESP inside UDP. If you control both ends and avoid NAT, native ESP may be used, but UDP 4500 is the common practical path.
Why do I see “no proposal chosen” and how do I stop it?
It means the client and server can’t agree on IKE/ESP algorithms and parameters. Fix by standardizing suites and rolling out client profiles before tightening server policy.
My tunnel is up but web apps hang—what’s the likely cause?
MTU/PMTUD black holes. Large packets get dropped silently. Validate with DF pings and mitigate with MSS clamping while you correct ICMP/MTU handling.
Is IKEv2 good for site-to-site and remote access together?
Yes, and this is one of its operational advantages. You can unify crypto policy, monitoring, and vendor support models across both use cases.
What should I log for compliance without drowning in noise?
Log auth events (success/failure with reasons), assigned virtual IPs, applied policy/ACL identifiers, and session start/stop. Avoid full packet logs except during incident windows.
Practical next steps
If you’re choosing between IKEv2/IPsec, WireGuard, and OpenVPN, stop arguing in abstracts. Make the decision with the constraints that create outages:
- Inventory your clients (OS versions, management state, roaming patterns). If you need native + MDM, IKEv2 moves to the front.
- Pick an identity model you can operate for years. If you can run EAP‑TLS with decent PKI hygiene, do it.
- Define and publish crypto suites and keep them intentionally boring. Then test across all client platforms you support.
- Decide routed vs NATed remote access up front, document return routing, and don’t accidentally change the security/audit story.
- Adopt the Fast diagnosis playbook as your on-call muscle memory: control-plane vs data-plane, perimeter proof, then MTU/DNS/asymmetry.
Pick the protocol you can operate cleanly. The best VPN is the one that doesn’t turn your network team into part-time detectives.