“VPN won’t connect” is not a symptom. It’s a complaint. And the difference matters, because complaints get escalated; symptoms get diagnosed.
When the CEO’s laptop won’t connect from a hotel Wi‑Fi that blocks half the internet, you don’t need vibes. You need the right logs, the right commands, and a fast way to decide whether you’re dealing with auth, routing, crypto mismatch, NAT, MTU, or a plain old firewall drop.
Table of contents
- What logs actually matter (and why most don’t)
- Fast diagnosis playbook (first/second/third checks)
- Build a mental model: where the truth lives
- MikroTik: logging that catches failures without drowning you
- Linux: journalctl, kernel, and VPN daemon logs that pay rent
- Practical tasks (commands + outputs + decisions)
- Log patterns that map to real root causes
- Common mistakes: symptom → root cause → fix
- Checklists / step-by-step plan (production workflow)
- Three corporate-world mini-stories (realistic, anonymized)
- Interesting facts & short history (things that explain today’s pain)
- FAQ
- Conclusion: next steps that reduce repeats
What logs actually matter (and why most don’t)
VPN troubleshooting is a latency game: the longer you spend staring at irrelevant log noise, the longer your users stay offline. Your goal is to gather just enough evidence to pick the right branch: auth vs crypto vs network path vs policy/routing.
There are only a few log categories that consistently answer “why won’t it connect?”
- Handshake logs: IKE/IPsec negotiations, TLS handshakes, WireGuard handshakes. These tell you “we couldn’t agree” or “we never reached the peer.”
- Authentication/authorization logs: username/password failures, certificate validation, EAP decisions, RADIUS replies. These tell you “we reached it, but it said no.”
- Firewall/NAT logs: drops and translations around UDP 500/4500, ESP, TCP 443/1194, etc. These tell you “packets died here.”
- Routing and policy logs: which routes got installed, which selectors matched, which policies were applied. These tell you “connected, but useless.”
- Kernel/driver logs: MTU problems, fragmentation, offload weirdness, interface flaps. These tell you “physics disagrees with your config.”
What does not matter as much as people think: random PPP “connected/disconnected” churn without context, generic “peer not responding” without packet capture correlation, or mega-debug logs collected after the fact with no timestamps aligned across systems. Your first move should be to synchronize time, then filter to the minute of failure.
One reliable heuristic: when a VPN “won’t connect,” the cause is usually visible in one of these places:
- client-side VPN daemon logs (what it attempted)
- server-side VPN daemon logs (what it rejected or never saw)
- network edge logs (what got blocked or mangled)
Get those three views, line up timestamps, and the story writes itself.
Fast diagnosis playbook (first/second/third checks)
First: confirm the path (is anything reaching anything?)
- On MikroTik: look for firewall drops on the VPN ports/protocols and for IKE initiation attempts.
- On Linux: look for inbound packets on the expected interface, then check whether the VPN daemon logged a new handshake attempt.
Decision: if you see no traffic at the server/router during the attempt window, you’re not debugging crypto. You’re debugging reachability (routing, firewall, ISP, hotel Wi‑Fi policy).
Second: classify the failure (auth vs negotiation vs policy)
- Auth failures: “AUTH_FAILED”, “no shared key found”, “EAP failure”, “bad username or password”, “certificate verify failed”.
- Negotiation failures: “no proposal chosen”, “invalid KE payload”, “TS_UNACCEPTABLE”, “encryption algorithm not supported”.
- Policy/routing failures: tunnel up but no routes, wrong selectors, traffic not matching policy, wrong split tunnel settings.
Decision: pick the smallest possible fix: a single identity string, one cipher suite, one subnet selector, one firewall rule.
Third: eliminate the silent killers (time, MTU, NAT-T)
- Time: cert validation and IKE lifetimes fall apart when clocks drift.
- MTU: “connects” but stalls; web works but SMB dies; large packets vanish.
- NAT traversal: UDP 500 works until NAT appears; UDP 4500 blocked; ESP mangled.
Decision: if symptoms are inconsistent across networks (home works, hotel fails), suspect MTU/NAT/blocked ports before you rewrite the VPN.
Paraphrased idea (attributed): “Hope is not a strategy.” — Edsger W. Dijkstra (paraphrased idea)
Build a mental model: where the truth lives
VPN troubleshooting fails when you treat logs like a scrapbook. Treat them like a distributed transaction trace: each component records a partial view, and the overlap is where causality emerges.
For a typical remote-access setup (client ↔ internet ↔ MikroTik ↔ Linux VPN endpoint or services), your data sources are:
- Client logs: tells you what the client offered, what it expected, and what it considered fatal.
- MikroTik logs: tells you about IKE, L2TP/PPP, firewall decisions, and NAT behavior at the edge.
- Linux VPN daemon logs: strongSwan, OpenVPN, WireGuard tools; tells you protocol-level reasons.
- Linux kernel + netfilter logs: dropped packets, conntrack/NAT oddities, MTU/fragmentation hints.
- AAA logs: RADIUS/LDAP or local user database decisions.
Clock alignment matters more than you want it to. If the MikroTik is 90 seconds off and your Linux host is NTP-disciplined, you will “prove” the wrong thing. Fix time first, then chase ghosts.
MikroTik: logging that catches failures without drowning you
MikroTik can be wonderfully blunt. It will tell you “no proposal chosen” and you can go home early. Or it will say “peer not responding” and you’ll waste two hours unless you correlate with firewall counters and packet flow.
What to enable (and what to keep off)
Enable logs that capture:
- ipsec negotiation details at info level (and temporarily at debug for a single source)
- l2tp, ppp for authentication and session establishment (if you run L2TP/PPTP/PPPoE pieces)
- firewall for drops on the relevant chains (limited, rate-limited, and targeted)
Avoid leaving full debug on permanently. Debug logs are like caffeine: useful in a pinch, then you wonder why your heart is racing and your storage is full.
What “good” looks like in MikroTik logs
For IPsec/IKEv2, “good” often includes:
- peer initiates
- proposal accepted
- SA established
- policies installed
For L2TP/IPsec, you want to see:
- IPsec established first
- then L2TP control channel comes up
- then PPP auth succeeds
If you only see the first half, you know where to aim.
Linux: journalctl, kernel, and VPN daemon logs that pay rent
Linux gives you more visibility than you deserve. The trick is not collecting logs; it’s asking narrow questions.
Log sources that matter
- systemd journal via
journalctl(most distros) - /var/log/auth.log or /var/log/secure (PAM, SSH, some VPN auth hooks)
- strongSwan logs (charon daemon)
- OpenVPN logs (service unit output or file log)
- kernel logs for xfrm/IPsec, WireGuard, netfilter drops
When possible, route service logs into journald and use structured filtering by unit. Grepping flat files works, but it’s 2025; we can do better.
Practical tasks (commands + outputs + decisions)
These are the tasks I actually run when a VPN won’t connect. Each includes a command, what typical output looks like, and the decision you make from it.
Task 1: Confirm system time alignment (Linux)
cr0x@server:~$ timedatectl
Local time: Sun 2025-12-28 15:40:11 UTC
Universal time: Sun 2025-12-28 15:40:11 UTC
RTC time: Sun 2025-12-28 15:40:11
Time zone: UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: “System clock synchronized: yes” is what you want. Cert-based VPNs will fail in stupid ways if time is wrong.
Decision: If unsynchronized, fix NTP before debugging VPN. Otherwise you will misread cert errors and replay windows.
Task 2: Check whether packets arrive at the Linux server during an attempt
cr0x@server:~$ sudo tcpdump -ni eth0 'udp port 500 or udp port 4500 or proto 50' -c 10
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:41:02.101234 IP 198.51.100.23.55231 > 203.0.113.10.500: isakmp: phase 1 I ident
15:41:02.105991 IP 203.0.113.10.500 > 198.51.100.23.55231: isakmp: phase 1 R ident
15:41:03.220110 IP 198.51.100.23.55231 > 203.0.113.10.4500: UDP-encap: ESP(spi=0xcafebabe,seq=0x1)
Meaning: You have reachability and NAT-T activity. If this is silent, the issue is upstream: firewall, routing, ISP, or the client never sent.
Decision: If no packets arrive, stop tweaking ciphers. Start checking ACLs, NAT, and whether the client is hitting the right IP.
Task 3: strongSwan IKEv2 logs for handshake failure classification
cr0x@server:~$ sudo journalctl -u strongswan --since "15:40" --no-pager
Dec 28 15:41:02 server charon[1123]: 09[IKE] received IKE_SA_INIT request from 198.51.100.23[55231]
Dec 28 15:41:02 server charon[1123]: 09[IKE] no proposal chosen
Dec 28 15:41:02 server charon[1123]: 09[IKE] sending NO_PROPOSAL_CHOSEN notify to 198.51.100.23[55231]
Meaning: Crypto mismatch. The peer offered algorithms you didn’t accept (or vice versa).
Decision: Align proposals. Do not “open everything” permanently; add the minimum compatible suite and schedule a cleanup.
Task 4: strongSwan “authentication failed” vs “not found”
cr0x@server:~$ sudo journalctl -u strongswan --since "15:40" --no-pager | tail -n 6
Dec 28 15:41:22 server charon[1123]: 11[IKE] authentication of 'vpnuser' with EAP failed
Dec 28 15:41:22 server charon[1123]: 11[IKE] sending AUTHENTICATION_FAILED notify
Dec 28 15:41:22 server charon[1123]: 11[IKE] IKE_SA closed
Meaning: Network path is fine; credentials or AAA integration is not.
Decision: Stop staring at firewall rules. Check RADIUS/LDAP, user secrets, and EAP method compatibility.
Task 5: WireGuard handshake status (Linux)
cr0x@server:~$ sudo wg show
interface: wg0
public key: 7E8sN8ZQk0yqJfYHk2m2pX1G4vQn4w2XhQ0v0n0n0n0=
listening port: 51820
peer: GkYp9q3yR1l4tD0Nw7xqzK9uJm9c0q0H0e0o0r0e0r0=
endpoint: 198.51.100.23:60211
allowed ips: 10.66.66.2/32
latest handshake: 2 minutes, 11 seconds ago
transfer: 18.32 MiB received, 22.10 MiB sent
persistent keepalive: every 25 seconds
Meaning: If “latest handshake” is “never,” you’re dealing with reachability, port blocks, or keys mismatch.
Decision: If handshake exists but traffic doesn’t flow, focus on AllowedIPs, routing, and firewall. If no handshake, focus on UDP reachability and correct keys.
Task 6: WireGuard kernel log hints (Linux)
cr0x@server:~$ sudo journalctl -k --since "15:00" --no-pager | grep -i wireguard | tail
Dec 28 15:12:01 server kernel: wireguard: wg0: Handshake for peer 1 (198.51.100.23:60211) did not complete after 5 seconds, retrying (try 2)
Dec 28 15:12:06 server kernel: wireguard: wg0: No response from peer 1 (198.51.100.23:60211), retrying (try 3)
Meaning: Outbound packets leave; responses don’t return. Classic NAT/firewall/port-forward breakage.
Decision: Check upstream filtering, NAT mappings, and whether the endpoint IP/port is correct.
Task 7: OpenVPN service logs (Linux)
cr0x@server:~$ sudo journalctl -u openvpn-server@server --since "15:40" --no-pager
Dec 28 15:41:10 server openvpn[2210]: TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
Dec 28 15:41:10 server openvpn[2210]: TLS Error: TLS handshake failed
Meaning: Usually not a certificate issue; it’s reachability (blocked TCP/UDP port) or asymmetric routing. Sometimes it’s MTU.
Decision: Verify port exposure and whether packets hit the daemon. If they do, look for cipher mismatch and client/server TLS settings next.
Task 8: Confirm Linux is listening on the expected ports
cr0x@server:~$ sudo ss -lunpt | egrep ':(500|4500|1194|51820)\b'
udp UNCONN 0 0 0.0.0.0:4500 0.0.0.0:* users:(("charon",pid=1123,fd=14))
udp UNCONN 0 0 0.0.0.0:500 0.0.0.0:* users:(("charon",pid=1123,fd=13))
udp UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wg-quick",pid=1300,fd=6))
udp UNCONN 0 0 0.0.0.0:1194 0.0.0.0:* users:(("openvpn",pid=2210,fd=5))
Meaning: If the service isn’t listening, your firewall can be perfect and nothing will work.
Decision: If missing, fix service config/startup before touching network devices.
Task 9: Check netfilter policy (Linux) without guessing
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
udp dport { 500, 4500, 51820, 1194 } accept
ip protocol icmp accept
counter drop
}
}
Meaning: Policy drop is fine if you explicitly allow VPN ports/protocols. If you forget UDP/4500, IKEv2 behind NAT will fail in the most annoying way.
Decision: If missing accepts, add them; if present, stop blaming Linux firewall and move outward.
Task 10: Confirm IP forwarding and XFRM policy health for IPsec
cr0x@server:~$ sysctl net.ipv4.ip_forward net.ipv6.conf.all.forwarding
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 0
Meaning: For site-to-site or routed remote access, forwarding matters. Some setups still “connect” but pass no traffic if forwarding is off.
Decision: If forwarding is off and you expect routed traffic, enable it and verify firewall forward chain rules.
Task 11: Inspect IPsec SAs on Linux (strongSwan)
cr0x@server:~$ sudo swanctl --list-sas
vpn-ikev2: #12, ESTABLISHED, IKEv2, 3c1f0b1f2d3e4a5b_i* 2a1b3c4d5e6f7a8b_r
local '203.0.113.10' @ 203.0.113.10[4500]
remote '198.51.100.23' @ 198.51.100.23[60211]
AES_GCM_16_256/PRF_HMAC_SHA2_256/ECP_256
established 98s ago
vpn-ikev2{21}: INSTALLED, TUNNEL, reqid 7, ESP in UDP SPIs: c1234567_i c89abcde_o
vpn-ikev2{21}: 10.10.10.0/24 === 10.66.66.2/32
Meaning: “ESTABLISHED” plus “INSTALLED” tells you Phase 1 and Phase 2 are up. If traffic still fails, you’re looking at routing, firewall, or selectors mismatch.
Decision: If no CHILD_SA is installed, fix traffic selectors/proposals. If installed, trace packet flow after decryption.
Task 12: MikroTik: see IPsec peers and active SAs
cr0x@server:~$ ssh admin@mtk-edge 'ip ipsec active-peers print; ip ipsec installed-sa print'
0 address=198.51.100.23 port=60211 state=established phase1-established=yes
0 spi=0xcafebabe src-address=203.0.113.1 dst-address=198.51.100.23 state=mature
auth-algorithm=sha256 enc-algorithm=aes-gcm enc-key-size=256
Meaning: Peer established + installed SA means the edge sees a real tunnel, not just wishful thinking.
Decision: If peer isn’t established, work on negotiation/auth. If established but no traffic, inspect routes, NAT exemptions, and firewall forward rules.
Task 13: MikroTik: turn on targeted logging for IPsec and PPP (temporarily)
cr0x@server:~$ ssh admin@mtk-edge '/system logging add topics=ipsec,debug action=memory; /system logging add topics=ppp,info action=memory; /log print where topics~"ipsec|ppp"'
15:44:01 ipsec,debug initiate new phase 1 (Identity_Protection): 198.51.100.23[60211]<=>203.0.113.1[500]
15:44:02 ipsec,error no proposal chosen
15:44:03 ppp,info <l2tp-user> disconnected
Meaning: “no proposal chosen” is immediate crypto mismatch. PPP disconnects afterward are collateral.
Decision: Fix IPsec proposal alignment first. Don’t chase PPP until the IPsec layer is stable.
Task 14: MikroTik: check firewall counters for VPN ports
cr0x@server:~$ ssh admin@mtk-edge '/ip firewall filter print stats where chain=input and (protocol=udp and (dst-port=500 or dst-port=4500))'
0 chain=input action=accept protocol=udp dst-port=500,4500 in-interface=WAN packets=1823 bytes=219120
1 chain=input action=drop in-interface=WAN packets=44 bytes=3520
Meaning: You have accepts and some drops. The question is: are the drops your VPN, or background noise?
Decision: If accepts are zero during an attempt, the traffic isn’t arriving or isn’t matching your rule. If drops spike during attempts, add a specific accept before the drop.
Task 15: Confirm MTU path symptoms with ping DF (Linux)
cr0x@server:~$ ping -M do -s 1372 -c 3 10.66.66.2
PING 10.66.66.2 (10.66.66.2) 1372(1400) bytes of data.
From 10.10.10.1 icmp_seq=1 Frag needed and DF set (mtu = 1420)
From 10.10.10.1 icmp_seq=2 Frag needed and DF set (mtu = 1420)
From 10.10.10.1 icmp_seq=3 Frag needed and DF set (mtu = 1420)
--- 10.66.66.2 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss
Meaning: You’re exceeding a path MTU. VPN encapsulation overhead makes this common, and it can look like “connects but nothing works.”
Decision: Reduce tunnel MTU, clamp TCP MSS, or fix PMTUD blocking. Don’t just “disable DF everywhere” and call it a day.
Joke #1: MTU issues are the only networking bug that can make a VPN “work” and still ruin your weekend.
Log patterns that map to real root causes
Pattern: “no proposal chosen” (IKE/IPsec)
Usually means: mismatch in encryption/integrity/PRF/DH group, or you’re mixing IKEv1 assumptions into IKEv2 config.
What to do: Compare the offered proposals from the client with server’s accepted set. On MikroTik, check IPsec proposal settings; on strongSwan, check ike= and esp= definitions. Make one side more permissive temporarily, then tighten.
Pattern: “AUTHENTICATION_FAILED”
Usually means: wrong credentials, wrong identity string, certificate mismatch, or RADIUS policy rejecting the user.
What to do: Verify the ID the client presents (FQDN/email/DN) and what the server expects. In cert setups, check EKU, SAN, and chain trust. In EAP, confirm method compatibility (EAP-MSCHAPv2 vs EAP-TLS).
Pattern: “peer not responding” / timeouts
Usually means: packets aren’t returning. Firewall blocks, NAT breaks, wrong destination IP, asymmetric route, or UDP 4500 blocked.
What to do: Packet capture on both ends. If you see outbound but no inbound responses, move to the network. If you see inbound and the daemon stays quiet, you’re hitting the wrong host/port or local firewall/SELinux is eating it.
Pattern: tunnel up, but “no traffic”
Usually means: routing/AllowedIPs/selectors wrong, NAT exemption missing, or firewall forward chain blocks post-decrypt traffic.
What to do: For IPsec, inspect installed policies/selectors. For WireGuard, inspect AllowedIPs on both ends and confirm routes exist. Then trace with tcpdump on the inside interface, not just the WAN.
Pattern: intermittent connectivity depending on network
Usually means: NAT traversal edge cases, MTU/PMTUD, or captive portals blocking UDP.
What to do: Force NAT-T (where appropriate), test alternate ports/protocols (e.g., OpenVPN on TCP 443), and clamp MSS. Logs will show you timeouts and retransmits; your job is to connect them to the network environment.
Common mistakes: symptom → root cause → fix
1) Symptom: “works at home, fails on hotel Wi‑Fi”
Root cause: UDP blocked or mangled; NAT-T UDP 4500 blocked; captive portal; or strict firewall policies.
Fix: Offer a TCP-based fallback (often TCP 443), ensure NAT-T is enabled for IPsec, and log firewall drops on WAN input for the VPN ports.
2) Symptom: “IKE_SA establishes, but no internal access”
Root cause: missing routes, wrong traffic selectors, missing NAT exemption, or forward chain drop.
Fix: Verify installed CHILD_SA selectors, add routes, add explicit allow rules post-decrypt, and ensure you’re not NATing traffic that should be protected.
3) Symptom: “authentication failed” right after negotiation
Root cause: wrong identity (IDi/IDr), wrong EAP method, RADIUS rejects, cert trust chain incomplete.
Fix: Log the presented identity, align ID settings, validate cert chain on server, and check AAA logs for reject reasons.
4) Symptom: OpenVPN “TLS handshake failed” but port is open
Root cause: client/server TLS settings mismatch (tls-crypt/tls-auth), cipher mismatch, or MTU causing handshake fragmentation loss.
Fix: Compare configs; temporarily simplify to a known-good cipher/TLS setting, then reintroduce hardening. Check MTU and fragment settings.
5) Symptom: WireGuard “latest handshake: never”
Root cause: wrong endpoint, UDP blocked, wrong public key, or NAT that needs persistent keepalive.
Fix: Verify keys and endpoint; add keepalive for clients behind NAT; confirm UDP reachability with tcpdump.
6) Symptom: VPN connects, browsing works, file shares time out
Root cause: MTU/MSS issues. Small packets pass; large packets die.
Fix: Clamp TCP MSS on the tunnel interface/edge, reduce MTU, and ensure ICMP “frag needed” isn’t blocked.
7) Symptom: L2TP/IPsec connects for some users, others fail
Root cause: per-user PPP secrets, profile mismatches, RADIUS attribute differences, or address pool exhaustion.
Fix: Check PPP auth logs, verify IP pool availability, standardize profiles, and compare AAA replies.
8) Symptom: site-to-site IPsec flaps every few minutes
Root cause: DPD mismatch, rekey timing mismatch, NAT mapping timeouts, or unstable WAN link.
Fix: Align lifetimes and DPD settings; check WAN errors; consider keepalives; log rekey events and correlate with interface state changes.
Checklists / step-by-step plan (production workflow)
Checklist A: the 10-minute triage (don’t be clever yet)
- Get the exact failure time window and the client network type (home/cellular/hotel).
- Confirm DNS resolves to the expected public IP (avoid “wrong destination” rabbit holes).
- Verify server/router time and Linux time are synced.
- On Linux, run tcpdump for UDP 500/4500 (or the relevant port) during a fresh attempt.
- On MikroTik, check firewall counters and IPsec/PPP logs for the same minute.
- Classify: no traffic, negotiation failure, auth failure, or “up but no data.”
- Pick one hypothesis and test it with one change or one command.
Checklist B: negotiation/auth deep-dive (when packets do arrive)
- Extract the exact error string from the VPN daemon logs (NO_PROPOSAL_CHOSEN, AUTHENTICATION_FAILED, TS_UNACCEPTABLE).
- List the server’s configured proposals and compare with the client’s supported suite.
- Validate cert chain, expiration, and identity fields (SAN, CN as relevant).
- Check AAA logs (RADIUS/LDAP) for reject reasons; don’t guess.
- Re-test once. If the error changes, you made progress. If it doesn’t, you didn’t.
Checklist C: “connected but nothing works” (the sneaky one)
- Confirm SA is established and selectors match expected subnets.
- Confirm routes exist on both ends (and on the client for split tunnel).
- Confirm firewall allows post-decrypt traffic (forward chain, not just input).
- Confirm NAT exemption for VPN subnets (don’t NAT inside-to-inside unless you mean it).
- Test MTU with DF ping or clamp MSS and retest.
Joke #2: The VPN isn’t “down.” It’s just practicing chaos engineering without telling you.
Three corporate-world mini-stories (realistic, anonymized)
Mini-story 1: The incident caused by a wrong assumption
The helpdesk escalated a “VPN outage” at 8:17 AM. The symptom was clean: a chunk of remote users couldn’t connect, and the ones who could were mostly on home broadband. Cellular users were fine. Everyone blamed certificates because someone had renewed “something” last week.
The first wrong assumption: “If some users can connect, the VPN server is healthy.” That’s emotionally comforting and sometimes true, but it’s also how you miss path-dependent failures like UDP filtering and NAT-T quirks. The second wrong assumption: “If it’s IPsec, it must be crypto.” No. IPsec is often network policy wearing a crypto costume.
We pulled a tcpdump on the Linux endpoint: nothing arrived on UDP 4500 during the failure attempts. MikroTik firewall counters for UDP 4500 input were flat too, while UDP 500 spiked. That was the clue: clients behind NAT start on UDP 500 then move to UDP 4500 for NAT traversal. They were getting stuck at the door.
The cause was boring: a newly deployed upstream firewall template allowed UDP 500 but not UDP 4500. The template had “IPsec” in the name, so everyone assumed it was complete. We added UDP 4500, watched strongSwan logs flip from timeouts to established SAs, and the incident closed.
The postmortem takeaway wasn’t “remember UDP 4500.” It was: never diagnose “won’t connect” without first proving traffic reaches the device on the port it actually needs.
Mini-story 2: The optimization that backfired
A network team decided to “reduce log noise” on a MikroTik edge by disabling most VPN-related logging and relying on a central Linux syslog from the application tier. The motivation was understandable: storage budgets, alert fatigue, and a dashboard that looked like a seismograph.
Two weeks later, an IKEv2 rollout started failing for a specific client population. The Linux endpoint showed sporadic “received packet with unknown SPI” and occasional AUTH failures. But there was no consistent pattern, and we had no edge logs to tell us whether traffic was being NATed, dropped, or translated in a weird way.
We re-enabled targeted MikroTik IPsec logs and added a single firewall rule with counters for UDP 500/4500. Within minutes the story appeared: a NAT rule was hairpinning certain source ranges differently, causing the endpoint to see changing source ports mid-handshake. StrongSwan treated the changes as suspicious, and handshakes died intermittently.
The “optimization” was turning off the very logs that tell you what the edge is doing to packets. The fix wasn’t “enable all logs.” The fix was disciplined logging: small, targeted, time-bound debug plus persistent counters.
Afterward, the team kept the noise down but preserved the ability to reconstruct a failure. That’s the difference between observability and hoarding.
Mini-story 3: The boring but correct practice that saved the day
A different org did something unfashionable: they standardized VPN failure triage into a tiny runbook. Every incident started with the same three captures: client timestamp, edge counters, server daemon logs. They also enforced NTP on every MikroTik and every Linux host, and they tested it monthly like adults.
When a “VPN won’t connect” incident hit during a vendor demo, they didn’t argue about blame. They aligned timestamps and looked at the overlap window. Edge showed inbound UDP 500 packets accepted, then nothing on UDP 4500. Server tcpdump showed the same. That narrowed it to upstream filtering or client network policy.
The user was on a guest Wi‑Fi that blocked UDP 4500. The runbook’s next step was a documented fallback: OpenVPN on TCP 443, with an explicit log filter to confirm a TLS handshake. They switched profiles, saw “TLS negotiation succeeded,” and the demo continued with only mild sweating.
The saving practice wasn’t fancy. It was having a fallback path and the habit of confirming packet reachability before rewriting configs. Boring wins. You can frame that and hang it in the office, preferably over the espresso machine.
Interesting facts & short history (things that explain today’s pain)
- IPsec’s original enterprise push was in the 1990s, when “security at the network layer” sounded like it would simplify everything. It didn’t, but it did standardize a lot of VPN vocabulary.
- IKEv2 (mid-2000s) was designed to fix IKEv1’s complexity and improve mobility and NAT traversal, but it still depends heavily on correct proposals and identities.
- NAT wasn’t designed with IPsec in mind. ESP (protocol 50) doesn’t have ports, which makes classic NAT devices awkward; NAT-T wrapped ESP in UDP 4500 as a pragmatic patch.
- PPP-era VPNs (L2TP/PPTP) lingered because they were easy to deploy, not because they were great. Logs still show PPP auth failures that have nothing to do with crypto.
- WireGuard went mainstream by being intentionally small: fewer knobs, fewer legacy modes, and therefore fewer “mystery” negotiation errors—when it fails, it usually fails plainly.
- OpenVPN popularized “VPN over TCP/443” as a survival tactic for restrictive networks, even though TCP-over-TCP can be a performance footgun.
- MTU pain got worse as encryption became default: every encapsulation layer steals bytes, and modern networks often block ICMP needed for PMTUD.
- Logging practices changed with systemd: journald made it easier to filter by unit and time window, which is exactly what VPN troubleshooting needs.
FAQ
1) If the VPN “won’t connect,” where do I look first: client, MikroTik, or Linux?
Start where you can prove packet arrival. If you control the server, run tcpdump and check daemon logs. If you don’t see packets, pivot to the MikroTik edge and upstream path.
2) Why do I see UDP 500 packets but not UDP 4500?
Many clients begin IKE on UDP 500 and switch to UDP 4500 when NAT is detected. Missing UDP 4500 usually means firewall policy or upstream filtering, not crypto.
3) What does “no proposal chosen” actually mean?
The peers couldn’t agree on algorithms (encryption/integrity/PRF/DH group) for IKE or ESP. Fix by aligning proposal sets; don’t randomly enable legacy suites unless you understand the risk.
4) The tunnel establishes, but internal subnets are unreachable. Is that still a VPN problem?
Yes, but it’s not a handshake problem. It’s routing, firewall forward rules, NAT exemptions, or traffic selectors/AllowedIPs. Check SAs and routes, then trace traffic after decryption.
5) Why does WireGuard show handshakes but still no traffic?
Usually AllowedIPs or routing. A handshake only proves keys and reachability; it doesn’t prove you’re routing the right subnets through the tunnel or allowing them through firewall.
6) What’s the fastest way to confirm a Linux firewall isn’t blocking VPN?
Inspect the ruleset (nft list ruleset) and confirm explicit accepts for the VPN ports/protocols, plus counters if you can. Don’t “just flush iptables” in production unless you enjoy incident reports.
7) Should I enable debug logging permanently on MikroTik and strongSwan?
No. Enable targeted debug briefly, ideally scoped by time window and source IP if possible, then turn it off. Keep lightweight info logs and counters always.
8) How do I know it’s MTU?
Classic sign: small requests work, large transfers stall; some apps connect, others hang. Confirm with DF ping tests and by clamping MSS. Logs often show retransmits/timeouts, not “MTU” in neon.
9) If OpenVPN over TCP 443 works everywhere, should I just use that?
It’s a useful fallback, not always a primary. TCP-over-TCP can melt performance under loss. Use it when you need to punch through restrictive networks, and measure before standardizing.
10) What single habit reduces VPN incidents the most?
Timestamp discipline: NTP everywhere and a habit of collecting client attempt time, edge counters, and server daemon logs for the same minute.
Conclusion: next steps that reduce repeats
Stop treating “VPN won’t connect” like a riddle. Treat it like a short investigation with evidence collection and branching logic.
- Standardize the fast playbook (traffic arrival → classify error → check time/MTU/NAT-T) and make it the default response.
- Keep targeted logging always-on: MikroTik IPsec info logs, firewall counters for VPN ports, and Linux daemon logs in journald.
- Practice one fallback path for restrictive networks (often TCP 443) and test it periodically.
- Make MTU/MSS a first-class check for “connects but unusable” incidents—because it keeps happening and pretending otherwise won’t help.
When you can answer “did packets arrive?” and “what exact error fired?” in under five minutes, VPN incidents shrink from drama to routine maintenance. That’s the goal: fewer heroics, more uptime.