WireGuard is usually boring—in the best way. Until it isn’t. You deploy a tunnel, the interface comes up, and then the status sits there like a dead fish: latest handshake: (none) or “Handshake did not complete.” Meanwhile your on-call phone starts inventing new ways to vibrate.
This failure is deceptively simple: one side can’t complete the cryptographic dance with the other. The trick is that the root cause is often not cryptography at all. It’s NAT state expiring, a UDP port quietly blocked, a route pointing into the void, MTU fragmentation turning packets into confetti, or time drift making replay protection do its job a little too well.
What “handshake” really means (and what it doesn’t)
WireGuard’s handshake is a small, fast exchange of UDP packets that establishes ephemeral session keys. If it completes, you’ll see a “latest handshake” timestamp and bytes flowing. If it doesn’t, the interface can still appear “up” because it’s just an interface; WireGuard doesn’t do “connected” in the TCP sense.
That last sentence matters because a ton of troubleshooting mistakes happen here. People see wg0 exists and assume connectivity. WireGuard isn’t making a promise; it’s offering an opportunity.
Also: “handshake did not complete” is not one bug. It’s a class of failures:
- Packets never leave the client (local firewall, wrong routing table, wrong endpoint hostname).
- Packets leave but never arrive (ISP blocks UDP, upstream firewall, wrong port forward, CGNAT).
- Packets arrive but response can’t return (asymmetric routing, NAT mapping mismatch, rp_filter, egress firewall).
- Handshake completes but traffic fails (AllowedIPs, routing, MTU, DNS). This one looks similar if you only watch “ping”.
- Handshake intermittently fails (NAT timeout, roaming endpoint changes, keepalive missing).
- Handshake rejected (wrong keys, stale peer, time/replay problems). Less common, but it happens.
One quote, because it’s still the best operating advice in this entire space: Hope is not a strategy.
— Gene Kranz. You’re going to measure, not guess.
Fast diagnosis playbook (check first/second/third)
If you only do one thing from this article, do this in order. It will cut your troubleshooting time from “ruin my evening” to “mildly annoying.”
First: prove UDP reachability to the server port
- On the server, verify it’s listening on the port you think it is (
wg showandss -lunp). - On the server, run
tcpdumpon the public interface for that UDP port. - From the client, initiate traffic (even a single ping through the tunnel, or just bring up the interface).
- If server tcpdump sees nothing: it’s upstream of WireGuard (NAT/port forward/firewall/ISP/CGNAT/wrong IP).
Second: if packets arrive, prove the server replies and the client receives replies
- Keep tcpdump on the server: do you see outgoing UDP back to the client?
- Run tcpdump on the client: do you see incoming UDP from the server?
- If the server replies but the client never sees it: NAT in the middle, return path, or a stateful firewall drop.
Third: if handshakes happen but traffic fails, debug AllowedIPs, routes, and MTU
wg showshould show recent handshake and increasing counters.ip route getfor destination IPs should point intowg0when appropriate.- Ping with DF set (or MSS clamp) to spot MTU black holes.
That’s it. You’re hunting for the first broken link in the chain: “client emits UDP → server receives UDP → server responds → client receives response → keys established → routes send traffic into tunnel → traffic survives MTU.”
Facts & context you can use in arguments with coworkers
- WireGuard was designed to be minimal. The codebase is famously small compared to older VPN stacks, which reduces “unknown unknowns” during outages.
- It uses UDP by design. That avoids TCP-over-TCP meltdown and plays nicely with roaming, but it means you must think in terms of reachability and NAT state.
- The handshake is based on Noise. The protocol family is a modern approach to authenticated key exchange that aims for simplicity and strong properties.
- “AllowedIPs” is both ACL and routing policy. This is a deliberate design choice: it’s powerful, and it’s a common foot-gun.
- WireGuard doesn’t do liveness the way people expect. There’s no “connected” state; it sends handshake initiation when it has traffic to send.
- PersistentKeepalive exists mostly for NAT. If you’ve ever yelled at a hotel Wi‑Fi, this option is WireGuard’s polite way of tapping the NAT table every N seconds.
- Many consumer routers have short UDP timeouts. It’s not unusual to see mappings expire in 30–120 seconds when idle.
- Carrier-grade NAT is now normal. Plenty of “public internet” connections are not actually public from inbound perspective. No port forward can fix that.
- Replay protection is time-sensitive. If clocks are wildly off, some legitimate packets can be treated like replays. Security is doing its job; you’re the one time-traveling.
Joke 1/2: NAT is like a corporate helpdesk: it forgets you exist the moment you stop sending tickets.
A practical mental model: packets, state, and time
Think of the handshake as four things that must all be true:
- Correct addressing: client knows server endpoint (IP:port), server knows how to reply.
- Bidirectional UDP reachability: not “ping works,” but “UDP packets traverse both ways.”
- State holds long enough: NAT and firewalls keep the mapping/state long enough for handshake and subsequent traffic.
- Cryptographic identity matches: public keys and allowed IPs align; replay protection isn’t tripped by time weirdness.
This model is intentionally boring. You want boring models in production. Fancy models are for conference talks and people who don’t carry pagers.
Hands-on tasks: commands, outputs, and decisions (12+)
These are the tasks I run in order when debugging “handshake did not complete.” Each includes what you should expect, what weird output means, and what decision to make next.
Task 1: Confirm WireGuard is actually running and which port it listens on
cr0x@server:~$ sudo wg show
interface: wg0
public key: 8x3u...redacted...
listening port: 51820
peer: 7p1Q...redacted...
endpoint: (none)
allowed ips: 10.10.0.2/32
latest handshake: (none)
transfer: 0 B received, 0 B sent
What it means: you have a listening port (good). “endpoint: (none)” is normal on servers; WireGuard learns client endpoints dynamically.
Decision: if the listening port is not what you configured, you’re debugging the wrong port. Fix config first, then proceed.
Task 2: Verify the OS is listening on UDP as expected
cr0x@server:~$ sudo ss -lunp | grep 51820
UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wg",pid=1132,fd=6))
What it means: kernel is bound on 0.0.0.0:51820, so it should accept packets to any local address.
Decision: if you don’t see it, WireGuard isn’t up (or bound to a different address). Fix service first; don’t touch NAT yet.
Task 3: Check the endpoint configured on the client (you’d be shocked)
cr0x@client:~$ sudo wg show
interface: wg0
public key: dE5...redacted...
listening port: 48712
peer: 8x3u...redacted...
endpoint: 203.0.113.10:51820
allowed ips: 10.10.0.0/24
latest handshake: (none)
transfer: 0 B received, 0 B sent
persistent keepalive: 25
What it means: client thinks server is at 203.0.113.10:51820. Keepalive is set (good for NAT).
Decision: if endpoint IP is wrong (old public IP, typo, wrong DNS), fix it and retest before you do anything else.
Task 4: Prove the server receives any UDP packets on the WireGuard port
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
Now bring up the client tunnel or generate traffic.
cr0x@client:~$ sudo wg-quick up wg0
[#] ip link add wg0 type wireguard
[#] wg setconf wg0 /dev/fd/63
[#] ip address add 10.10.0.2/32 dev wg0
[#] ip link set mtu 1420 up dev wg0
[#] ip route add 10.10.0.0/24 dev wg0
[#] resolvconf -a wg0 -m 0 -x
If packets are arriving, tcpdump should show something like:
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820
12:22:31.104512 IP 198.51.100.23.48712 > 203.0.113.10.51820: UDP, length 148
12:22:31.104889 IP 203.0.113.10.51820 > 198.51.100.23.48712: UDP, length 92
What it means: packets reach server and server replies. If handshake still doesn’t complete, you’re in “cryptographic identity / AllowedIPs / time weirdness” territory.
Decision:
- No inbound lines at all: this is upstream (NAT/firewall/ISP/port forward). Go there next.
- Inbound but no outbound reply: server firewall or policy routing is blocking replies.
- Both directions present: move on to keys/time/AllowedIPs.
Task 5: Check iptables/nftables rules that might drop UDP/51820
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
tcp dport 22 accept
udp dport 51820 accept
counter drop
}
}
What it means: UDP 51820 is explicitly allowed. Great.
Decision: if you don’t see an allow rule (or policy is drop), add it. If policy is accept but still nothing arrives in tcpdump, the firewall is not your problem.
Task 6: Verify IP forwarding if this is a site-to-site or you expect LAN access
cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
What it means: forwarding is off. Handshake can still work, but routed traffic won’t.
Decision: if you need to route between wg0 and another interface, enable forwarding and configure firewall/NAT accordingly.
Task 7: Check rp_filter (asymmetric routing’s petty enforcer)
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1
What it means: strict reverse path filtering is on. If replies go out a different interface than the kernel expects, packets can be dropped.
Decision: in multi-homed or policy routing setups, set rp_filter to 2 (loose) or adjust routing so return path matches. Don’t flip it blindly on internet-facing boxes unless you understand why.
Task 8: Confirm the server has the correct public IP and the route back to the client
cr0x@server:~$ ip route get 198.51.100.23
198.51.100.23 via 203.0.113.1 dev eth0 src 203.0.113.10 uid 0
cache
What it means: server will reply out eth0 using 203.0.113.10. Good.
Decision: if the route points somewhere unexpected, you’re debugging asymmetric routing. Fix routing before touching WireGuard configs.
Task 9: Confirm keys match what you think they are (without leaking secrets)
cr0x@server:~$ sudo wg show wg0 public-key
8x3u...redacted...
cr0x@client:~$ sudo wg show wg0 peers
8x3u...redacted...
What it means: client is targeting the server’s public key.
Decision: if the keys don’t match, handshake will never complete. Fix keys, restart interface, retry tcpdump.
Task 10: Check system time and NTP sync on both ends
cr0x@server:~$ timedatectl
Local time: Sat 2025-12-27 12:26:02 UTC
Universal time: Sat 2025-12-27 12:26:02 UTC
RTC time: Sat 2025-12-27 12:26:01
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: server clock is sane and synced.
Decision: if one side is not synchronized and time is far off, fix NTP before you keep chasing ghosts.
Task 11: Observe handshake state transitions and counters in real time
cr0x@client:~$ watch -n 1 sudo wg show wg0
Every 1.0s: sudo wg show wg0
interface: wg0
public key: dE5...redacted...
listening port: 48712
peer: 8x3u...redacted...
endpoint: 203.0.113.10:51820
allowed ips: 10.10.0.0/24
latest handshake: 4 seconds ago
transfer: 3.12 KiB received, 2.98 KiB sent
persistent keepalive: 25
What it means: handshake now completes; counters move.
Decision: if handshake timestamp updates but your applications still fail, stop blaming the handshake. Move to routing/AllowedIPs/MTU/DNS.
Task 12: Validate AllowedIPs and routing decisions with ip route get
cr0x@client:~$ ip route get 10.10.0.1
10.10.0.1 dev wg0 src 10.10.0.2 uid 1000
cache
What it means: traffic to 10.10.0.1 will go into wg0.
Decision: if it goes out your default interface, your AllowedIPs/routes aren’t installed the way you think.
Task 13: Detect MTU black holes with DF ping
cr0x@client:~$ ping -M do -s 1380 -c 3 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 1380(1408) bytes of data.
1388 bytes from 10.10.0.1: icmp_seq=1 ttl=64 time=32.1 ms
1388 bytes from 10.10.0.1: icmp_seq=2 ttl=64 time=31.8 ms
1388 bytes from 10.10.0.1: icmp_seq=3 ttl=64 time=32.0 ms
--- 10.10.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
What it means: at least 1380-byte payloads survive; MTU is probably okay.
Decision: if you get “Frag needed and DF set” or silent loss at larger sizes, adjust MTU (commonly 1420, 1380, or lower depending on encapsulation and path).
Task 14: Prove the server is or isn’t behind CGNAT (port forwarding won’t save you)
cr0x@server:~$ ip -4 addr show dev eth0 | sed -n '1,5p'
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
inet 100.64.12.34/24 brd 100.64.12.255 scope global eth0
valid_lft forever preferred_lft forever
What it means: 100.64.0.0/10 is a carrier-grade NAT range. Your “server” does not have a real public IPv4 on its interface.
Decision: stop trying to port forward on that box; you need a public endpoint elsewhere (VPS, IPv6, relay design, or a different ISP plan).
Task 15: Confirm the server sees the client’s source IP/port consistently (NAT rebinding)
cr0x@server:~$ sudo wg show wg0 | sed -n '1,30p'
interface: wg0
public key: 8x3u...redacted...
listening port: 51820
peer: 7p1Q...redacted...
endpoint: 198.51.100.23:48712
allowed ips: 10.10.0.2/32
latest handshake: 12 seconds ago
transfer: 14.21 KiB received, 13.88 KiB sent
What it means: WireGuard has learned the endpoint. If the endpoint keeps changing every minute, a NAT device is rebinding frequently.
Decision: set PersistentKeepalive = 25 on the client (or whichever side is behind NAT) and consider using a stable upstream.
NAT and port forwarding: how it fails in real networks
NAT is the #1 reason handshakes don’t complete, and it’s not close. That’s because the “endpoint” is a moving target when either side sits behind a device that:
- rewrites source ports unpredictably,
- expires UDP mappings quickly,
- doesn’t support “hairpin NAT” (NAT loopback),
- or isn’t actually the NAT you need to configure (double NAT).
Know which side must be reachable
WireGuard can work in several patterns, but the simplest is: server has a public IP and open UDP port, clients connect out. That’s the “default internet shape.”
If both sides are behind NAT and neither has an inbound port mapping, you’re doing NAT traversal without a mediator. Sometimes it works by accident. Production is where accidents go to die.
Double NAT: the hidden second router
Double NAT is common: ISP modem/router doing NAT, then your own firewall doing NAT again. You forward UDP/51820 on the inner router, feel proud, and nothing works because the outer router drops it.
How to spot it quickly:
- Your router WAN IP is in a private range (192.168/10.0/172.16) or CGNAT (100.64/10).
- Server tcpdump sees nothing even though you’re sure you forwarded the port.
NAT timeouts: why it works for 30 seconds, then dies
A UDP “connection” is a NAT illusion. When a client sends UDP out, the NAT device creates a mapping (src IP:src port → public IP:public port). That mapping expires when idle. If it expires, the server’s reply goes into a void. WireGuard can recover, but only when new traffic triggers a new handshake. Users call this “randomly disconnects.” SREs call it “predictably idle.”
The fix is usually PersistentKeepalive on the NATed side. Not on the public server. On the side that needs to keep a hole open.
Port forwarding: the three rules people ignore
- Forward UDP, not TCP. I’ve seen “TCP/UDP” toggles default to TCP-only. That’s a bad day.
- Forward to the correct internal host IP. DHCP changes are how you create intermittent outages and blame “the internet.”
- Don’t forward to a host that also runs another UDP service on that port. Sounds obvious; it’s less obvious when you run multiple tunnels and copy-paste configs.
Firewalls and UDP filtering: proving the negative
Firewalls are the second most common culprit because UDP drops don’t give you the courtesy of an error. They give you silence, which your brain interprets as mystery.
Use tcpdump as your truth serum
If the server never sees packets on UDP/51820, stop editing WireGuard configs. You have a network reachability problem. The correct tool is packet capture. Everything else is mood lighting.
Stateful firewalls and “related/established” myths
Many firewall policies allow inbound packets only if they match an established connection. UDP is “connectionless,” but stateful firewalls still track it as flows with timeouts. That’s why a keepalive helps, and it’s why return traffic might be dropped if the firewall didn’t see the outbound initiation in the expected direction.
Cloud security groups vs. host firewall
In cloud environments you often have:
- a cloud security group / network ACL,
- a host firewall (nftables/iptables/ufw/firewalld),
- and sometimes a managed load balancer that doesn’t like UDP unless configured explicitly.
Pick one source of truth and document it. If you “just allow it everywhere,” you’ll still break it later—only now you’ve also increased your blast radius.
Joke 2/2: Debugging UDP through three firewalls is like office politics—nobody admits they dropped your packet, but everyone had a “policy reason.”
Routing and AllowedIPs: the silent deal-breaker
WireGuard’s AllowedIPs is not a decorative setting. It does two jobs:
- Inbound filtering: what source IPs you’ll accept from that peer.
- Outbound routing: which destination IPs get encrypted to that peer.
This is elegant. It’s also why a single wrong CIDR can make traffic disappear without any handshake errors. Sometimes the handshake completes fine, but you can’t reach anything because nothing is routed into the tunnel. Other times, the handshake fails because the peer’s tunnel IP isn’t considered valid for that peer.
Common AllowedIPs patterns
- Road warrior client: server sets
AllowedIPs = 10.10.0.2/32for that client. Client setsAllowedIPs = 0.0.0.0/0, ::/0if you want full-tunnel, or just the private ranges if split-tunnel. - Site-to-site: each side advertises its local LANs via AllowedIPs so the other side routes to them. This is where overlaps and RFC1918 collisions become your villain.
Overlapping subnets: the corporate classic
If both sides use 192.168.1.0/24 internally, you can establish a tunnel, but routing becomes a coin toss. People then “fix” it by adding more static routes until the network resembles a conspiracy wall. Don’t. Renumber one side or use NAT inside the tunnel deliberately and document it.
Time and replay protection: when clocks ruin your day
Time rarely causes handshake failures, but when it does, it’s infuriating because it feels like magic. WireGuard includes replay protection: it won’t accept packets that look like replays. If clocks are severely skewed or a VM is paused/resumed with time jumps, you can get symptoms that look like “it just won’t handshake,” especially across reinstalls or snapshots.
What to do in practice:
- Make NTP boring. Use systemd-timesyncd/chronyd. Ensure it starts early in boot.
- On virtualized hosts, confirm the hypervisor time sync doesn’t fight NTP.
- If you use snapshots and restores, expect weirdness: a restored VM may have old WireGuard state and old time.
How to tell if time is the issue
You usually see:
- handshakes that complete right after reboot or time sync, then fail later,
- or failures after VM suspend/resume,
- or a system clock that’s wildly wrong (
timedatectlshows unsynchronized).
MTU and fragmentation: the handshake’s slower cousin
Strictly speaking, MTU problems more often break data than handshakes, because handshake packets are small. But in real life, people diagnose “handshake did not complete” because the tunnel “doesn’t work,” and they never separate control-plane success from data-plane failure.
MTU issues show up as:
- handshake works, but large transfers stall,
- some sites work, others don’t (classic PMTUD black hole),
- SSH works, HTTPS hangs on large responses,
- VoIP is choppy and everyone blames the codec.
Fix it the adult way
Start with the default MTU from wg-quick (often 1420). If you have additional encapsulation (PPPoE, VLANs, other tunnels), lower it. Validate with DF ping, then move to MSS clamping if you’re routing TCP-heavy traffic.
Three corporate mini-stories from the trenches
Incident 1: the outage caused by a wrong assumption
A mid-sized company rolled out WireGuard as a “simple remote access VPN” for engineers. The design doc said: “Open UDP/51820 on the VPN server.” Straightforward.
They deployed the server in a colo rack behind an edge firewall managed by a separate network team. The application team requested “open port 51820.” The network team opened TCP/51820 because their ticket system template defaulted to TCP, and nobody questioned it. Everyone tested from inside the office network, where an internal NAT hairpin path made things appear to work “sometimes,” which is the worst kind of working.
On launch day, remote users reported “Handshake did not complete.” The application team spent hours rotating keys and rebuilding configs. Meanwhile the network team insisted “the port is open.” It was. Just not the right protocol.
The fix was a one-line firewall change allowing UDP/51820 inbound. The lesson was less about WireGuard and more about assumptions: never accept “port open” without “protocol open,” and never troubleshoot cryptography before you’ve confirmed packets arrive.
Incident 2: an optimization that backfired
Another org wanted to “reduce background chatter” on mobile clients to save battery and data. Someone noticed PersistentKeepalive = 25 and decided it was wasteful. They removed it fleet-wide.
It worked in their office Wi‑Fi and on corporate LTE plans. Then field staff started using random networks: hotel Wi‑Fi, coffee shop Wi‑Fi, airport captive portals. Those networks often had aggressive UDP timeouts. Idle tunnels died silently. When users resumed activity, some traffic triggered a new handshake, but not always in the right order for their applications. They saw intermittent failures: sometimes it worked after 10–30 seconds, sometimes it didn’t until the app was restarted.
The incident response focused on WireGuard versions, kernels, and “maybe the encryption is broken.” It wasn’t. The “optimization” removed the one mechanism keeping NAT mappings alive.
They reintroduced keepalive—selectively. Always-on laptops got keepalive. Phones got a longer interval and only for profiles used on hostile networks. The deeper lesson: optimization without a failure model is just you volunteering for future outages.
Incident 3: the boring but correct practice that saved the day
A financial services team had a habit that looked old-fashioned: every network service change came with a packet capture on both ends during a test window, stored with the change record. Not because compliance demanded it, but because they liked sleeping.
One Friday, a minor ISP routing change caused their WireGuard server’s public IP to move (planned). DNS updated quickly, but a subset of clients still hit the old IP due to caching and a stale resolver path. Users saw “Handshake did not complete.” The team on-call pulled the last known good packet capture and compared it to a new capture in minutes.
The captures told the story: clients were sending UDP to the old IP; the server never saw it. No need to touch keys, MTU, or firewall rules. They reduced DNS TTL for that record for the duration of the migration and pushed an updated endpoint IP to clients that couldn’t rely on DNS.
It was boring, measurable, and fast. The practice wasn’t glamorous, but it turned guesswork into a short incident.
Common mistakes: symptom → root cause → fix
-
Symptom: Client shows “latest handshake: (none)”; server tcpdump shows no inbound UDP.
Root cause: wrong public IP/endpoint, upstream firewall, missing port forward, CGNAT, or ISP UDP block.
Fix: verify endpoint IP, open UDP in every firewall layer, configure correct port forward, or move server to a true public IP (or IPv6). -
Symptom: Server sees inbound UDP packets but never sends replies.
Root cause: host firewall drops output, policy routing sends replies out wrong interface, rp_filter drops, or WireGuard not bound/started correctly.
Fix: allow outbound UDP, fix routing, relax rp_filter where appropriate, validatess -lunpandwg show. -
Symptom: Server replies in tcpdump; client never sees replies.
Root cause: client-side firewall, NAT mapping expired, symmetric NAT behavior, or return path blocked.
Fix: addPersistentKeepaliveon client, allow inbound UDP from server, test from a different network, or use a server with stable connectivity. -
Symptom: Handshake completes, but you can’t reach anything through the tunnel.
Root cause: wrongAllowedIPs, missing routes, IP forwarding disabled, missing NAT/forward rules on server, or overlapping subnets.
Fix: correct AllowedIPs, verifyip route get, enable forwarding, add forward/NAT rules, renumber or NAT one side. -
Symptom: Works for a minute, then dies when idle; handshake time stops updating.
Root cause: UDP NAT timeout; mapping expires.
Fix:PersistentKeepalive = 25(or appropriate value) on NATed peer; consider reducing idle timeouts on the firewall if you control it. -
Symptom: Works on some networks, never on others (especially corporate guest networks).
Root cause: UDP blocked or rate-limited; only TCP/443 allowed out.
Fix: provide an alternate egress path (different network), or deploy a design that doesn’t require raw UDP egress for those clients (organizational decision, not a WireGuard knob). -
Symptom: After VM resume or restore, handshakes fail unpredictably.
Root cause: clock jump/time drift, stale state, or replay protection edge cases.
Fix: ensure time sync, reboot/bring interface down/up, avoid restoring old snapshots as a “repair step” for network services. -
Symptom: Handshake works, small pings work, large downloads hang.
Root cause: MTU/PMTUD black hole; fragmentation blocked.
Fix: lower WG MTU, test with DF pings, clamp TCP MSS on forwarding paths.
Checklists / step-by-step plan
Checklist A: “Handshake never completes” (hard failure)
- Confirm endpoint: client’s
EndpointIP and port are correct. - Confirm server is listening:
wg showandss -lunp. - Packet capture on server:
tcpdump -ni <if> udp port <port>. - If server sees nothing: inspect port forwards, outer firewalls, and whether server has real public IP (not 100.64/10).
- If server sees inbound only: check server firewall egress, routing, and rp_filter.
- If server sees both directions: validate keys and time sync; then check whether a middlebox is rewriting/blackholing replies on the client side.
Checklist B: “Handshake completes but traffic doesn’t” (soft failure)
- Confirm counters move:
wg showshould show increasing bytes. - Check AllowedIPs: ensure destination networks are included on the sender’s side; ensure peer tunnel IPs are correct on the receiver’s side.
- Check routing:
ip route get <dest>must point into wg0 for tunneled destinations. - Check forwarding/NAT: if you expect LAN access, enable forwarding and allow forward chain traffic.
- Test MTU: DF ping with increasing sizes; adjust WG MTU or clamp MSS.
- Then DNS: only after IP connectivity works. DNS issues masquerade as “VPN broken” in a way that wastes hours.
Checklist C: “It drops every few minutes” (intermittent failure)
- Observe endpoint changes: server
wg showendpoint field flapping indicates NAT rebinding. - Add keepalive:
PersistentKeepalive = 25on NATed side; tune as needed. - Check UDP timeouts: on firewalls you control, increase UDP session timeout for that port/host pair.
- Look for competing NAT devices: double NAT and guest Wi‑Fi “client isolation” can behave like packet loss.
FAQ
1) Does “Handshake did not complete” always mean the UDP port is blocked?
No. It means the handshake packets aren’t successfully exchanged. Blocked UDP is common, but wrong endpoint IP, asymmetric routing, or key mismatch can do it too. Use tcpdump to separate “no packets arrive” from “packets arrive but handshake fails.”
2) If I can ping the server’s public IP, why won’t WireGuard handshake?
Ping is ICMP. WireGuard uses UDP. Networks frequently allow ICMP but block or rate-limit UDP, especially in corporate guest networks and some ISPs. Test UDP reachability with packet captures, not with vibes.
3) Should I change the WireGuard port from 51820?
Sometimes. If you suspect upstream filtering on common VPN ports, moving to a random high UDP port can help. But don’t cargo-cult it: if tcpdump shows nothing on the server, you still have to open/forward the new port everywhere.
4) Where should I set PersistentKeepalive?
On the peer behind NAT that needs to remain reachable. Usually that’s the client. Setting it on the public server does nothing for client-side NAT mappings.
5) Can wrong AllowedIPs prevent the handshake?
Yes, in a couple ways. If the server expects the peer to use a certain tunnel IP but AllowedIPs doesn’t include it, packets can be dropped as “not from this peer.” More often, the handshake succeeds but data goes nowhere because routing never sends packets into the tunnel.
6) Why does it work on my phone hotspot but not at the office?
Office networks commonly block outbound UDP except for DNS/NTP, or they run aggressive stateful inspection that kills long-idle flows. Your phone hotspot tends to be simpler and kinder to UDP.
7) Do I need to restart WireGuard after changing firewall rules?
No, firewall changes apply immediately. Restarting can help clear confusion, but it’s not required. Prefer to change one variable at a time so you know what fixed it.
8) How do I tell if I’m behind CGNAT?
If your WAN interface has an IP in 100.64.0.0/10, you’re behind CGNAT. Also, if your router WAN IP is private (192.168/10.0/172.16), you’re behind another NAT upstream. Inbound port forwarding won’t work unless you control the upstream NAT too.
9) The handshake completes, but only some subnets are reachable. Why?
That’s usually routing/AllowedIPs mismatch or overlapping subnets. Confirm ip route get for each destination and verify each peer’s AllowedIPs includes the right CIDRs.
10) Is time sync really relevant for WireGuard?
Most of the time, no. But when clocks are badly wrong—especially after VM suspend/resume or snapshot restore—replay protection and state can behave in ways that look like random handshake failure. Keep NTP healthy and boring.
Conclusion: next steps that actually reduce pages
When WireGuard says “Handshake did not complete,” it’s not asking you to meditate on cryptography. It’s asking you to follow the packets. Do the boring sequence:
- Verify endpoint/port and that the server is listening.
- Capture packets on the server: do inbound UDP packets arrive?
- If yes, do replies leave and does the client receive them?
- If handshakes succeed, stop staring at handshakes and fix routing/AllowedIPs/forwarding/MTU.
Operationally, your best long-term win is to standardize a minimal diagnostic bundle: wg show output, ss -lunp, a 30-second tcpdump on both ends, and timedatectl. Keep that as muscle memory. Your future self will thank you—quietly, because they’re finally asleep.