VPNs are sold as privacy in a box: click connect, you’re safe, the tunnel is “encrypted,” and everyone sleeps. Then Monday arrives. Someone’s laptop gets popped on hotel Wi‑Fi, the attacker rides the VPN like a company shuttle, and suddenly you’re explaining to leadership why “private tunnel” didn’t mean “private network.”
This isn’t a piece about fear. It’s about failure modes. VPN incidents tend to be boring in the worst way: a default left in place, a shortcut that felt reasonable, a log you didn’t keep because it cost money, a subnet that quietly allowed everything. Let’s fix the mechanics, not the marketing.
Interesting facts and historical context
- “VPN” started as a corporate connectivity tool, not a privacy product. Early designs focused on site-to-site links between offices, where “endpoint trust” was assumed.
- IPsec standardization kicked off in the mid‑1990s. It solved “encrypt on the wire” but not “don’t let a compromised endpoint ruin your day.” That part is still on you.
- PPTP (an early VPN protocol) was widely deployed despite weak cryptography. It’s a reminder that convenience can outlive correctness for embarrassingly long periods.
- TLS VPNs became popular because they traversed NAT and firewalls easily. That usability win also made remote access VPNs attractive targets: one internet-facing port with a high-value blast radius.
- Split tunneling was originally a bandwidth optimization when backhauling all internet traffic over corporate links was expensive. Today it’s often a security compromise dressed up as “performance.”
- DNS leaks were a known issue for years because operating systems try hard to “be helpful” with resolvers, interface metrics, and fallback behaviors. Helpful is not the same as safe.
- Certificate lifetimes used to be measured in years because rotation was operationally painful. Automation made short lifetimes realistic; many orgs still run 5-year certs because habits are persistent.
- WireGuard’s design is intentionally small (few thousand lines of core code compared to much larger legacy stacks). Smaller doesn’t mean perfect, but it changes the audit and failure surface.
- Many major VPN incidents didn’t start with “broken encryption”; they started with exposed management interfaces, unpatched appliances, or credentials reused across systems.
Two things never change: people over-trust tunnels, and attackers love any component that’s internet-facing, authenticated, and connected to internal networks.
Fast diagnosis playbook (find the bottleneck fast)
When a VPN “is down” you have to decide: are we dealing with connectivity, authentication, routing, DNS, performance, or policy? This is how you avoid spending two hours blaming “the VPN” like it’s a sentient being.
First: establish what’s failing (control plane vs data plane)
- Can clients reach the VPN listener? Check basic network reachability and service status.
- Can clients authenticate? If MFA/IdP is flaky, the tunnel never forms. If certs expired, same outcome.
- Does traffic pass after connect? That’s routing, firewall policy, MTU, DNS, or split-tunnel config.
Second: identify whether the problem is global or scoped
- One user? Endpoint posture, local firewall, stale client config, Wi‑Fi captive portal, DNS cache, clock drift.
- One office/region? ISP routing, geofenced IdP endpoints, anycast weirdness, or a dead POP.
- Everyone? Server resource exhaustion, certificate expiration, IdP outage, firewall rule push, or a kernel upgrade that “shouldn’t” affect networking.
Third: pinpoint the bottleneck with three cheap signals
- Server CPU (crypto, packet processing, interrupt storms).
- Packet drops (NIC queues, conntrack, firewall, MTU).
- Logs around auth (rate limits, lockouts, failed MFA, invalid certs).
Paraphrased idea from Werner Vogels: you build reliable systems by assuming everything fails and designing accordingly. VPNs are no exception.
Joke #1: A VPN is like a bouncer: if you let everyone in and don’t check IDs, you’ve built a very polite doorway.
The 12 mistakes (and what to do instead)
1) Treating “encrypted” as “secure”
Encryption protects data in transit. It does not validate the endpoint, limit access, prevent lateral movement, or stop data exfiltration once the attacker is inside the tunnel.
Do instead: treat VPN as an untrusted ingress. Apply least privilege routing, per-user policy, segmentation, and logging. Assume endpoints get compromised.
2) Flat network access: “VPN users can reach everything”
This is the classic incident multiplier. If the VPN assigns an IP inside your “trusted” RFC1918 space and your firewalls treat that as internal, you’ve created a roaming office LAN with no walls.
Do instead: create a dedicated VPN address pool/subnet, route it through a policy enforcement point, and explicitly allow only what’s required. Default deny beats “we’ll tighten later.” “Later” becomes “after the incident.”
3) Split tunneling enabled by default without compensating controls
Split tunneling means corporate routes go over the VPN, but internet traffic goes directly out from the client. That can be fine—if you understand the consequences: DNS leaks, traffic interception, and a compromised device bridging between networks.
Do instead: decide intentionally. If split tunnel is required, lock down DNS (internal resolvers over the tunnel), enforce host firewall policy, and restrict which internal routes are advertised. For high-risk roles, disable split tunneling.
4) Weak authentication: passwords alone, shared accounts, no MFA
Remote access VPN with password-only auth is an open invitation to credential stuffing. Shared accounts are worse: no accountability, no scoped revocation, and no clean audit trail.
Do instead: require MFA and prefer certificate/device-bound authentication. Disable shared accounts. If you need break-glass, create time-limited accounts with tight routing and extra monitoring.
5) Long-lived credentials and neglected certificate rotation
Certificates and static keys that live for years turn “credential compromise” into “multi-year persistent access.” Rotation feels operational until you automate it; then it becomes a calendar event with a bot.
Do instead: shorten lifetimes, automate renewal and deployment, and build a revocation story that actually works (CRL/OCSP where applicable; for WireGuard, rotate keys and prune peers).
6) Exposing management interfaces to the internet
Your VPN gateway’s admin UI is not a SaaS demo page. If it’s reachable from the internet, it will be scanned, brute-forced, and hammered by exploit kits. This is not speculation; it’s a Tuesday.
Do instead: put management behind a separate admin network or bastion, restrict by source IP, require MFA, and log every administrative action. Better: no web UI at all for prod changes—use config management and version control.
7) No meaningful logging (or logs you can’t query under pressure)
VPN logs are not optional. Without them you can’t answer basic questions: who connected, from where, what they accessed, and whether access patterns changed before the incident. During IR, “we don’t have that data” is a career-limiting phrase.
Do instead: centralize logs, keep enough retention to cover your detection and investigation window, and index by username/device/IP. Log auth outcomes, assigned IPs, bytes transferred, and config changes.
8) Ignoring endpoint posture and device trust
VPNs often authenticate the user but not the device. So a phished password on an unmanaged laptop becomes “legitimate access.”
Do instead: enforce device posture where possible: managed device certs, EDR presence, disk encryption, minimum OS versions, and revoked access for non-compliant endpoints. If you can’t enforce it, reduce routes and privileges.
9) Bad DNS handling: leaks, split-horizon confusion, and internal domain exposure
VPN clients that keep using public resolvers leak internal domains (useful to attackers) and break applications. Clients that use internal DNS for everything can create outages when the resolver is unreachable or slow.
Do instead: configure DNS explicitly. Use internal resolvers over the tunnel, set search domains intentionally, and monitor resolver latency. Consider DoT/DoH policies carefully: they can bypass corporate DNS controls unless managed.
10) MTU/fragmentation blind spots that look like “random outages”
Encapsulation reduces effective MTU. Some paths drop ICMP fragmentation-needed messages. Result: certain apps hang, large packets stall, and everyone blames “the VPN being flaky.”
Do instead: set MSS clamping or adjust MTU based on encapsulation overhead. Validate with real tests (not just ping). Monitor for retransmits and blackhole MTU patterns.
11) “Performance optimizations” that remove safety rails
Disabling rekeying because it “causes blips.” Turning off perfect forward secrecy because CPU. Increasing session timeouts to reduce login prompts. These are all tempting, and they all age badly.
Do instead: optimize with measurement and preserve security properties. Scale gateways, use modern crypto, offload where appropriate, and tune timeouts with threat models—not irritation levels.
12) No segmentation and no egress control once inside
Even with least-privilege inbound, you still need egress control. If a VPN client can reach internal assets and also exfiltrate freely to the internet, you’ve built a data pump.
Do instead: restrict access to sensitive networks, add per-segment firewalling, and monitor egress. For highly sensitive environments, force all traffic through inspection and control outbound destinations.
Practical tasks with commands (output → decision)
These are tasks you can run today. Each one includes a realistic command, an example output, what it means, and what decision you make from it. Assume Linux servers for VPN endpoints; adapt as needed.
Task 1: Confirm the VPN service is listening (and on what address)
cr0x@server:~$ sudo ss -lntup | grep -E ':(1194|443|51820)\b'
udp UNCONN 0 0 0.0.0.0:1194 0.0.0.0:* users:(("openvpn",pid=1842,fd=6))
tcp LISTEN 0 4096 0.0.0.0:443 0.0.0.0:* users:(("nginx",pid=1021,fd=12))
udp UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wg-quick",pid=911,fd=3))
What it means: The host is listening on OpenVPN UDP/1194, HTTPS/443 (maybe for a TLS VPN or portal), and WireGuard UDP/51820.
Decision: If the expected port isn’t present, stop debugging clients and fix the server/service. If it’s bound to 127.0.0.1 or an internal IP unexpectedly, you’ve found why remote clients can’t connect.
Task 2: Check firewall exposure of VPN and management ports
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
tcp dport 22 ip saddr 198.51.100.10/32 accept
udp dport 51820 accept
udp dport 1194 accept
tcp dport 443 accept
counter reject with icmpx type port-unreachable
}
}
What it means: Default drop, with exceptions. SSH is restricted to a single IP; VPN ports are open to the world.
Decision: Management ports should be restricted; VPN listener ports are typically public. If you see admin UI ports (or SSH) open broadly, fix that before you argue about cipher suites.
Task 3: Detect brute-force or credential stuffing patterns in auth logs (OpenVPN example)
cr0x@server:~$ sudo awk '/AUTH_FAILED/ {print $NF}' /var/log/openvpn/server.log | sort | uniq -c | sort -nr | head
148 203.0.113.77
61 203.0.113.78
14 198.51.100.200
What it means: The same IPs are producing large numbers of authentication failures.
Decision: Add rate limiting, temporarily block abusive IPs, and validate MFA enforcement. Also review whether usernames are being enumerated.
Task 4: Verify WireGuard peer activity and spot suspicious “always on” peers
cr0x@server:~$ sudo wg show
interface: wg0
public key: n5nGx8mX3lqvOa4mGz4bQj2oTt8f2p8h5zQpYqzQqXY=
listening port: 51820
peer: qk9Wb7q9j0p9xqv0t2h1V9kqk3rW8Z8l0y6H7m8b9cA=
preshared key: (hidden)
endpoint: 203.0.113.23:51122
allowed ips: 10.44.0.10/32
latest handshake: 12 seconds ago
transfer: 18.34 GiB received, 92.10 GiB sent
persistent keepalive: every 25 seconds
peer: NlG3b8pVj1kP0yXcVx3BzYt1d9v0m8a1q2w3e4r5t6Y=
endpoint: (none)
allowed ips: 10.44.0.11/32
latest handshake: 9 days, 2 hours, 41 minutes ago
transfer: 122.10 MiB received, 88.03 MiB sent
What it means: One peer is actively connected and transferring large volumes; another hasn’t connected in nine days.
Decision: Large sustained transfers might be normal (backups) or might be exfiltration—correlate with user/device identity and expected behavior. Stale peers should be pruned or rotated; “ghost” credentials are a gift to attackers.
Task 5: Confirm assigned VPN IPs and overlap with internal subnets (routing risk)
cr0x@server:~$ ip -4 addr show dev tun0
7: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 500
inet 10.8.0.1/24 scope global tun0
valid_lft forever preferred_lft forever
What it means: The VPN pool is 10.8.0.0/24. If your internal network also uses 10.8.0.0/16 anywhere, you’re in for route collisions and “random” reachability issues.
Decision: Choose non-overlapping address space and document it. If you already overlap, migrate: collisions become security bugs and outages.
Task 6: Verify what routes are pushed to clients (OpenVPN server config)
cr0x@server:~$ sudo grep -E 'push "route|push "dhcp-option' /etc/openvpn/server.conf
push "route 10.20.0.0 255.255.0.0"
push "route 10.30.10.0 255.255.255.0"
push "dhcp-option DNS 10.20.0.53"
push "dhcp-option DOMAIN corp.example"
What it means: Clients are being told to route two internal networks and use an internal DNS resolver plus search domain.
Decision: Ensure these routes represent least privilege, not “everything internal.” If you see 0.0.0.0 being pushed, that’s full tunnel—good or bad depending on your decision. If DNS isn’t pushed, expect leaks and broken internal name resolution.
Task 7: Catch DNS leaks from a connected client (Linux client example)
cr0x@server:~$ resolvectl status | sed -n '1,120p'
Global
LLMNR setting: yes
MulticastDNS setting: no
DNSOverTLS setting: no
DNSSEC setting: no
DNSSEC supported: no
Link 3 (wg0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53
DNS Domain: corp.example
Link 2 (wlp2s0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 1.1.1.1
DNS Servers: 1.1.1.1 8.8.8.8
What it means: Both the VPN interface and Wi‑Fi interface have DNS resolvers. Depending on resolver ordering and routing, some queries may leak to public DNS.
Decision: Configure the client to prefer VPN DNS for internal domains at minimum (split DNS), or force VPN DNS entirely for full tunnel. Test with internal hostnames and watch where queries go.
Task 8: Validate MTU and detect blackhole fragmentation symptoms
cr0x@server:~$ ping -M do -s 1420 -c 3 10.30.10.25
PING 10.30.10.25 (10.30.10.25) 1420(1448) bytes of data.
1428 bytes from 10.30.10.25: icmp_seq=1 ttl=63 time=22.1 ms
1428 bytes from 10.30.10.25: icmp_seq=2 ttl=63 time=21.8 ms
1428 bytes from 10.30.10.25: icmp_seq=3 ttl=63 time=22.0 ms
--- 10.30.10.25 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
What it means: A 1420-byte payload with DF bit works, suggesting the path MTU is at least 1448 bytes at the ICMP layer (still not a full TCP truth, but a good signal).
Decision: If this fails at sizes that should work, implement MSS clamping (iptables/nft) or reduce interface MTU. Treat “some sites load, others hang” as MTU until proven otherwise.
Task 9: Check TCP retransmits and drops on the VPN interface (performance clue)
cr0x@server:~$ ip -s link show dev wg0
6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
RX: bytes packets errors dropped missed mcast
9876543210 812345 0 213 0 0
TX: bytes packets errors dropped carrier collsns
1234567890 543210 0 17 0 0
What it means: Drops exist. Not necessarily catastrophic, but if drops climb with load, you have congestion, queueing, or CPU issues.
Decision: If drops are non-zero and correlated with complaints, investigate NIC queues, CPU saturation, and firewall/conntrack limits. Consider scaling out gateways or applying QoS.
Task 10: Confirm system time and NTP sync (auth and certificate failures love clock drift)
cr0x@server:~$ timedatectl
Local time: Sat 2025-12-27 10:14:05 UTC
Universal time: Sat 2025-12-27 10:14:05 UTC
RTC time: Sat 2025-12-27 10:14:04
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
NTP service: active
RTC in local TZ: no
What it means: NTP is active but the clock isn’t synchronized yet. That can break TLS handshakes, MFA tokens, and cert validation.
Decision: Fix time sync immediately. If your IdP says “invalid token” and your VPN says “certificate not yet valid,” check time before you check anything else.
Task 11: Audit who can SSH to the VPN gateway (reduce blast radius)
cr0x@server:~$ sudo grep -E '^(AllowUsers|AllowGroups|PasswordAuthentication|PermitRootLogin)' /etc/ssh/sshd_config
PermitRootLogin no
PasswordAuthentication no
AllowGroups sre-vpn-admins
What it means: Root login is disabled, password auth is off (good), and only a specific group can SSH.
Decision: Keep management tight. If password auth is enabled, disable it unless you have a very specific need and compensating controls.
Task 12: Check for risky NAT rules that let VPN clients masquerade everywhere
cr0x@server:~$ sudo iptables -t nat -S | sed -n '1,80p'
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A POSTROUTING -s 10.8.0.0/24 -o eth0 -j MASQUERADE
What it means: VPN clients (10.8.0.0/24) are NATed out of eth0. That could be intended for internet access—or it could be an accidental “everyone can exfiltrate” path.
Decision: If you allow VPN clients outbound, enforce egress filtering and logging. If you don’t need it, remove NAT and route only to required internal destinations.
Task 13: Confirm IKEv2/IPsec status and failing negotiations (strongSwan example)
cr0x@server:~$ sudo ipsec statusall | sed -n '1,120p'
Status of IKE charon daemon (strongSwan 5.9.8, Linux 6.8.0):
uptime: 2 hours, since Dec 27 08:10:31 2025
worker threads: 10 of 16 idle, job queue: 0/0/0/0, scheduled: 3
Listening IP addresses:
192.0.2.10
Connections:
roadwarrior: %any...192.0.2.10 IKEv2, dpddelay=30s
roadwarrior: local: [vpn-gw] uses EAP
roadwarrior: remote: uses EAP
Security Associations (1 up, 0 connecting):
roadwarrior[7]: ESTABLISHED 4 minutes ago, 203.0.113.55[alice]...192.0.2.10[vpn-gw]
roadwarrior{9}: INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: c3b2a1f2_i c4d3b2a1_o
What it means: The daemon is up, listening, and at least one SA is established. If users can’t connect, it’s not a total service outage.
Decision: If you see many connections “connecting” and none established, look at auth backends (RADIUS/IdP), proposals mismatch, or dropped UDP/500+4500 traffic.
Task 14: Validate that VPN client traffic is constrained by firewall policy (example)
cr0x@server:~$ sudo nft list chain inet filter forward
table inet filter {
chain forward {
type filter hook forward priority 0; policy drop;
ct state established,related accept
iif "wg0" oif "lan0" ip daddr 10.30.10.25 tcp dport { 443, 22 } accept
iif "wg0" oif "lan0" ip daddr 10.30.20.0/24 tcp dport 5432 accept
counter reject with icmpx type admin-prohibited
}
}
What it means: VPN interface traffic is only allowed to specific destinations/ports; everything else is denied.
Decision: This is what “least privilege VPN” looks like. If your forward chain policy is ACCEPT (or rules are wide-open), you’re one compromised laptop away from an internal scanning party.
Common mistakes: symptoms → root cause → fix
“VPN connects, but nothing internal loads”
Symptoms: Client reports connected; internal sites time out; sometimes ping works, sometimes not.
Root cause: Missing routes (not pushed), wrong subnet mask, firewall drops on forward chain, or DNS pointing to public resolvers.
Fix: Confirm route push (server config), verify client routing table, enforce DNS over tunnel, and ensure forward rules explicitly allow required destinations.
“Only some websites/apps break over VPN”
Symptoms: Slack works, certain SaaS pages hang, file uploads stall, RDP freezes, large downloads fail.
Root cause: MTU blackholing due to encapsulation and blocked ICMP fragmentation-needed; MSS not clamped.
Fix: Lower MTU on VPN interface or clamp TCP MSS on the gateway. Test with DF pings and real TCP sessions.
“Users get repeated MFA prompts or reauth loops”
Symptoms: People authenticate, then immediately get kicked; tokens “invalid” intermittently.
Root cause: Clock drift on gateway or IdP integration issues; overly aggressive session timeouts; broken SSO callback due to DNS/egress.
Fix: Fix NTP, verify time sync; align session lifetimes; ensure required IdP endpoints are reachable and not blocked by egress controls.
“One user is slow, everyone else fine”
Symptoms: Complaints isolated to one person; speed tests vary wildly.
Root cause: Endpoint CPU/crypto, Wi‑Fi issues, captive portal interference, local firewall, or DNS resolver mismatch.
Fix: Have them test on wired network or alternate ISP, compare DNS settings, capture interface stats, and verify they’re on the expected VPN profile.
“We see internal scanning from a VPN IP”
Symptoms: IDS alerts, port scans, authentication attempts against many internal hosts from an address in VPN pool.
Root cause: Compromised endpoint with broad network access; no segmentation; insufficient anomaly detection; stale credential not revoked.
Fix: Immediately quarantine that VPN identity (disable account, revoke cert/key), restrict VPN routes, implement per-user/per-group policy, and add alerting on scan-like behavior from VPN pools.
“After a gateway update, clients can’t connect”
Symptoms: Suddenly no one can establish tunnels; logs show handshake failures or proposal mismatch.
Root cause: Crypto suite changes, disabled legacy algorithms without client updates, or a config management drift that changed listener binding.
Fix: Use staged rollouts, keep compatibility windows, and test with representative clients. Treat VPN like an API with versioned behavior.
Three corporate-world mini-stories
Mini-story #1: The incident caused by a wrong assumption
They had a “secure VPN” because it used modern encryption and MFA. Leadership liked that sentence. The VPN pool lived inside a large RFC1918 range that the company already used across data centers, branches, and cloud VPCs, because “it’s all private anyway.”
When a contractor’s laptop got infected, the attacker didn’t need to break the VPN. They authenticated normally. Now they had what looked like an internal address. Internal services trusted those source IPs because a decade of firewall rules equated “private IP” with “employee network.” A few legacy admin panels were reachable. A few dev environments had shared credentials. You can guess the rest.
The first alert came from a database team noticing odd login attempts from a “known internal subnet.” No one treated it as hostile at first. The second alert was a burst of internal DNS queries for hostnames that only show up in old runbooks. That’s when the SRE on call got curious.
The post-incident review was uncomfortable because nothing was “hacked” in the Hollywood sense. The assumption did the damage: “VPN equals internal equals trusted.” The fix was equally unglamorous: move VPN clients into a dedicated subnet, put that subnet behind explicit policy, and remove IP-based trust from admin surfaces.
They also learned that “private address space” is not a security boundary. It’s just math.
Mini-story #2: The optimization that backfired
A different company had persistent complaints about VPN speed. Their gateway CPU spiked at peak hours, and the helpdesk got tired of hearing “my video calls stutter.” Someone proposed enabling split tunneling by default to keep internet traffic off the VPN. The change reduced gateway load immediately. Charts went down and everyone smiled.
Two months later, a security engineer noticed a pattern: internal hostnames showing up in public DNS logs from a few endpoints. Not the whole query (thanks to search domain quirks), but enough to reveal internal naming conventions. Meanwhile, a phishing kit targeted remote users on hotel Wi‑Fi and served a fake SSO page. The attacker got credentials and a live session token.
With split tunneling, the compromised laptop’s internet traffic bypassed corporate inspection and egress controls. The attacker used the browser session to access internal web apps over VPN routes while exfiltrating data directly over the local internet connection. It was fast, quiet, and didn’t trip the usual outbound alarms because the exfiltration never touched corporate networks.
The optimization worked. The security posture didn’t. They kept split tunneling, but only after they implemented split DNS properly, tightened routes, enforced endpoint posture, and added client-side firewall rules to block forwarding/bridging. The performance charts stayed good. The risk stopped being invisible.
Mini-story #3: The boring but correct practice that saved the day
One org ran two VPN gateways per region with identical config, pushed by config management. Every change required a pull request, and every PR had a checklist: ports, routes, auth methods, DNS, logging, and a rollback plan. It was as exciting as a spreadsheet, which is a compliment.
One Friday, their identity provider had a partial outage affecting a subset of MFA challenges. Users began retrying repeatedly, causing a wave of authentication failures. The VPN gateways stayed up, but the auth logs looked like an attack: lots of failures, many IPs, high volume. Most places would have reacted by blocking IPs aggressively and making the outage worse.
They didn’t. Because they had good logs, they could distinguish “failed MFA challenges from legitimate accounts” from “invalid usernames from random addresses.” They also had rate limits that slowed brute force attempts without punishing normal retries too harshly.
Most importantly, they had a tested break-glass procedure with a separate, tightly scoped access path for on-call engineers. That path used device certificates and allowed access only to critical admin hosts. The rest of the company waited out the IdP issue; the people who needed to fix production could still get in.
Joke #2: The only thing less glamorous than config management is explaining to auditors why you don’t have it.
Checklists / step-by-step plan
Step-by-step hardening plan (do this in order)
- Define the threat model. Remote access for employees? Contractors? Site-to-site? High-risk roles? If you can’t say who you’re protecting against, you’ll optimize for vibes.
- Carve a dedicated VPN subnet/pool. Route it through a policy enforcement point (firewall/security group) with default deny.
- Implement least-privilege routing. Push only required routes. Use per-group policies (engineering vs finance vs vendors).
- Require MFA and device-bound authentication. Prefer certificates or managed device identity plus MFA, not passwords alone.
- Lock down management access. No admin UI on the public internet. Restrict SSH and admin APIs. Log admin actions.
- Get DNS right. Decide full tunnel vs split tunnel; implement split DNS where needed; prevent leaks; monitor resolver performance.
- Handle MTU explicitly. Set MTU/MSS clamping based on encapsulation and validate with tests.
- Centralize logging and alerting. Log connections, auth outcomes, assigned IPs, bytes, and config changes. Alert on anomalies from VPN pools.
- Build rotation and revocation. Rotate certs/keys regularly; ensure offboarding removes access within hours, not weeks.
- Run incident drills. Practice “disable a user,” “rotate a key,” “isolate a subnet,” and “roll back a config” under time pressure.
Operational checklist for every VPN change
- Does this change expand reachable networks or ports from VPN pools?
- Are we accidentally enabling split tunneling or full tunneling?
- Are we changing crypto suites, rekey intervals, or session lifetimes?
- Will this break older clients, and do we have a compatibility plan?
- Are logs still being generated and forwarded after the change?
- Do we have a rollback path that doesn’t require the VPN to be working?
Offboarding checklist (the access you forget is the access that gets abused)
- Disable the identity account (IdP) and revoke sessions.
- Revoke certificate/device identity or remove WireGuard peer.
- Invalidate API tokens used by VPN clients (if applicable).
- Remove from VPN authorization groups and re-run policy deployment.
- Search logs for recent VPN activity and flag anomalies.
FAQ
Is a VPN still worth using if we’re moving to “zero trust”?
Yes, but treat it as one access method, not a trust grant. A VPN can be a transport. Zero trust is the policy model on top: least privilege, continuous verification, segmentation, and strong identity.
Should we disable split tunneling?
For high-risk roles and sensitive environments, yes by default. For general corporate use, split tunneling can be acceptable if you implement split DNS, endpoint controls, and strict internal route scoping.
What’s the single biggest VPN security improvement?
Stop giving VPN clients broad network access. Put them in a dedicated subnet and allow only what’s needed. That one decision shrinks incident blast radius dramatically.
Are VPN appliances inherently insecure compared to self-hosted?
No. Appliances fail when they’re unpatched, overexposed, or poorly monitored—same as self-hosted. The difference is operational control: you need a reliable patch pipeline, visibility, and rollback either way.
How do we detect compromised VPN credentials?
Look for anomalies: new geographies, unusual connection times, high auth failure rates followed by success, unexpected byte transfer spikes, and internal scanning patterns from VPN pools.
Do we need full packet capture on the VPN gateway?
Not always, and it can be expensive and sensitive. Start with strong connection/auth logs and flow logs. Use targeted packet capture for investigations and MTU/performance debugging.
Is “private IP space” safe to trust?
No. Private IP space is not a security property. Trust should come from authenticated identity, device posture, and explicit policy—not from an address that happens to start with 10.
What about always-on VPN for managed laptops?
Always-on can be great: consistent policy enforcement and fewer “forgot to connect” risks. But it raises availability stakes: if your VPN breaks, your workforce breaks. Build redundancy and test failure modes.
How often should we rotate VPN certificates/keys?
As often as you can operationally sustain with automation. Many orgs aim for months rather than years. Also rotate immediately on suspected compromise and on major personnel changes for shared endpoints.
Can we rely on MFA alone?
No. MFA reduces credential theft impact, but it doesn’t prevent compromised endpoints, misrouted networks, overbroad access, or data exfiltration. MFA is a seatbelt, not a roll cage.
Conclusion: next steps that actually reduce risk
VPN incidents rarely come from broken math. They come from over-trust, overreach, and under-logging. Your tunnel can be perfectly encrypted and still function as a high-speed on-ramp for attackers if you treat it like a magic cloak.
Do these next, in this order:
- Move VPN clients into a dedicated subnet and apply default-deny forwarding.
- Reduce routes to least privilege (per group), and tighten DNS behavior deliberately.
- Require MFA plus device-bound auth, and automate rotation/revocation.
- Remove internet exposure from management surfaces; enforce admin logging.
- Implement the fast diagnosis playbook as an on-call runbook, with the commands above pre-approved.
If you do nothing else: stop treating VPN as “inside.” It’s outside with better encryption. Design accordingly.