Your phone flips from office Wi‑Fi to LTE, your VPN drops, Slack reconnects, your SSH session freezes, and your on-call brain starts doing math it didn’t agree to.
If you’ve lived this, you’re not imagining things: most VPNs were designed with the romantic notion that IP addresses stay put.
WireGuard is one of the rare VPNs that treats roaming as a first-class reality. It’s not “mobile-friendly” in the marketing sense.
It’s mobile-friendly in the “I can walk out of a building mid-incident and my tunnel doesn’t implode” sense.
The real roaming problem: networks lie
“Roaming” sounds polite. Like your laptop is strolling through a meadow and casually switching access points.
In production, roaming means your source IP and source port change, NAT mappings evaporate, paths get weird, and UDP state is treated like a disposable cup.
Most mobile VPN pain is not cryptography. It’s state. Specifically: who thinks the peer is “at” what IP:port right now, and whether intermediate devices agree long enough for packets to arrive.
Cellular networks add more chaos: carrier-grade NAT, aggressive idle timeouts, and power-saving behavior that pauses your app and its sockets.
Classic VPN stacks often bind identity to a session that implicitly assumes stable transport. When the transport changes, the session needs renegotiation.
Negotiation takes time. Some implementations do it poorly. Some do it “correctly” but require keepalives so frequent they burn battery and still lose.
WireGuard’s approach is blunt: your identity is your public key. Transport details are replaceable.
If a valid authenticated packet arrives from a new address, WireGuard updates its idea of where that peer lives.
That’s roaming. It’s not magic; it’s a design decision that operationalizes the obvious: phones move.
Why mobile VPNs “disconnect” even when the tunnel is up
Many incidents get misfiled as “VPN disconnect.” The tunnel might still be configured, the interface still exists, and the client still shows “connected.”
What’s actually happening is one of these:
- NAT mapping expired: the server replies to an address/port that no longer maps to the client.
- Path MTU changed: LTE path shrank; packets black-hole; TCP stalls; users call it “disconnect.”
- DNS changed: you roamed to a network with different resolvers; your VPN DNS isn’t applied; name lookups fail.
- Routing policy changed: OS decides the default route moved; split tunnel rules conflict; some traffic escapes or loops.
- Firewall state mismatch: UDP “sessions” are a suggestion; mid-roam you lose return traffic.
WireGuard doesn’t solve every one of these by itself. But it solves the core identity/endpoint issue cleanly, which removes a huge class of flakiness.
How WireGuard roaming actually works
WireGuard is a layer-3 tunnel with a small protocol and a strict idea: packets must be authenticated.
If the server receives an authenticated packet for a peer from a new IP:port, it updates that peer’s endpoint to the new address and replies there.
No “reconnect” ceremony required.
That endpoint update happens on both sides. The client can roam too: if it hears back from the server at a new address (less common, but relevant for anycast or multi-homed servers),
it updates its endpoint as well.
Handshake vs. data packets: what updates what
WireGuard uses a Noise-based handshake. A handshake packet is authenticated. A data packet is also authenticated.
Endpoint learning happens when a packet is successfully authenticated and decrypted.
That’s a critical detail: random UDP spam cannot move your endpoint. Only a peer that has the right keys can.
So roaming is not “accept packets from anywhere.” It’s “accept packets from anywhere if they prove who they are.”
That’s why you can safely let mobile clients move across Wi‑Fi, LTE, hotel networks, and the occasional airport captive portal from a decade ago.
Why keepalives exist if roaming is so great
Roaming fixes the endpoint-change problem. It does not defeat NAT timeouts by sheer willpower.
If a client is behind NAT and idle, the mapping can expire. The server will still send replies to the last known endpoint, which now points to nothing.
The fix is PersistentKeepalive on the side behind NAT (often the mobile client).
It sends periodic empty packets to keep the NAT mapping alive.
You should use it deliberately, not out of superstition.
Keepalives are a cost: battery, radio wakeups, and sometimes higher data usage.
But the alternative is your tunnel quietly becoming a decorative UI element.
One quote that operations people should tattoo on their runbooks
“Hope is not a strategy.” — General Gordon R. Sullivan
If your mobile VPN depends on “the NAT usually keeps UDP mappings for a while,” you’re running on hope. Hope pages you at 2 a.m.
Facts and history that explain the design
WireGuard didn’t show up as a “better OpenVPN UI.” It came out of a specific frustration: VPN stacks were big, hard to audit, and too tolerant of bad defaults.
Here are some concrete facts and context points that make roaming feel inevitable rather than surprising:
- WireGuard started around 2015 as a project to build a simpler, more auditable VPN with modern crypto defaults.
- It uses the Noise protocol framework (specifically a Noise IK pattern variant) to structure handshakes with strong security properties.
- It aims for a small codebase compared to legacy VPN implementations, which reduces attack surface and makes audits less miserable.
- It originally lived out-of-tree for Linux and later landed in the Linux kernel (mainline) in 2020, a big vote of confidence and a deployment accelerant.
- It uses UDP by design to avoid TCP-over-TCP meltdown and because UDP tolerates changing network conditions better.
- Identity is a public key, not an IP address; IPs are just routed inside the tunnel as “AllowedIPs.”
- Roaming is a first-class behavior: endpoints are learned dynamically based on authenticated packets, not pinned forever.
- It deliberately avoids crypto negotiation; there’s no “pick from 17 ciphers” menu. This prevents downgrade games and configuration drift.
- It’s commonly implemented on mobile using OS-native networking stacks (for example, tunnel providers), which helps stability but still inherits OS power policies.
What to configure (and what not to)
Roaming “just works” only if you don’t sabotage it with the wrong assumptions.
WireGuard is simple, but that simplicity means you’re closer to the metal. You can absolutely shoot yourself in the foot.
The gun is small. It still works.
Decide your tunnel intent: full tunnel vs split tunnel
For mobile users, split tunnel is usually the sane default unless you have a specific compliance requirement.
Full tunnel is appealing until you route video calls through your data center, accidentally become an ISP, and discover that users judge you for physics.
- Full tunnel: client AllowedIPs includes
0.0.0.0/0, ::/0. You must provide DNS, egress NAT, and handle MTU carefully. - Split tunnel: client AllowedIPs includes only internal subnets (and maybe a few service ranges). Less breakage, less bandwidth cost.
Use PersistentKeepalive on NATed clients (usually mobile)
If the mobile client is behind NAT and you need it reachable (for replies, push traffic, or simply reliable sessions), set:
PersistentKeepalive = 25 seconds on the client’s peer entry for the server.
Why 25? It’s a pragmatic value that beats many NAT idle timeouts without being absurd. But don’t worship it. Measure your networks.
MTU: the quiet killer of “it disconnects” tickets
Roaming changes paths. Paths change MTU. LTE and some Wi‑Fi setups can be allergic to fragmentation.
When MTU is wrong, you get stalls, partial connectivity, and users describing it as “VPN is flaky.”
My default for mobile WireGuard is to start around 1280 for IPv6 friendliness and fewer surprises.
On cleaner networks you can go higher (1420 is common). But if you’re troubleshooting mobile drops, reduce MTU before you touch crypto.
DNS: pick one story and tell it consistently
A lot of “disconnects” are actually name resolution failures after roaming.
Decide whether DNS is:
- Inside the tunnel: set DNS in the client config; ensure the DNS server is reachable via AllowedIPs; ensure the server forwards/answers.
- Outside the tunnel: accept local DNS; then don’t pretend internal hostnames will work everywhere.
Firewall: allow the thing you built
WireGuard uses UDP. If your perimeter thinks UDP is suspicious (it often is), you must explicitly allow it.
Also: if you run WireGuard on a non-standard port to “hide it,” be honest about what you’re optimizing for.
Security through obscurity is like wearing a disguise to a staff meeting; it mostly confuses your colleagues.
Practical tasks: commands, outputs, decisions
These are real tasks I’d run during setup or while debugging roaming and mobile stability.
Each includes a command, representative output, what the output means, and the decision you make next.
Task 1: Confirm WireGuard interface state on the server
cr0x@server:~$ sudo wg show
interface: wg0
public key: 6E9q...yZ0=
private key: (hidden)
listening port: 51820
peer: oYt9...r3Q=
allowed ips: 10.6.0.2/32
latest handshake: 14 seconds ago
transfer: 22.34 MiB received, 48.11 MiB sent
persistent keepalive: every 25 seconds
Meaning: The peer is handshaking recently; transfers are happening; keepalive is configured.
Decision: If latest handshake is “never” or very old, stop blaming MTU or DNS and focus on reachability (UDP/port/NAT).
Task 2: Check if the server is actually listening on the expected UDP port
cr0x@server:~$ sudo ss -lunp | grep 51820
UNCONN 0 0 0.0.0.0:51820 0.0.0.0:* users:(("wg",pid=1143,fd=6))
Meaning: The kernel is listening on UDP/51820.
Decision: If nothing is listening, your issue is local service/config, not “roaming.” Fix systemd unit or config first.
Task 3: Verify firewall allows WireGuard UDP
cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
udp dport 51820 accept
ip protocol icmp accept
ip6 nexthdr icmpv6 accept
}
}
Meaning: Default-drop policy, but UDP/51820 is explicitly accepted.
Decision: If you don’t see an accept rule, add one. If you rely on “security group probably allows it,” you’re outsourcing your uptime to vibes.
Task 4: Watch live handshakes to confirm roaming endpoint changes
cr0x@server:~$ sudo wg show wg0 latest-handshakes endpoints
peer: oYt9...r3Q=
endpoint: 203.0.113.50:49213
latest handshake: 1707142421
peer: 2xK1...a9c=
endpoint: 198.51.100.77:61102
latest handshake: 1707142414
Meaning: You can see each peer’s current endpoint (IP:port). On mobile, that port often changes when roaming.
Decision: If endpoint never changes even when the user roams, you might be seeing traffic pinned through a proxy, or the client isn’t sending anything after the roam (sleep/power policy).
Task 5: Confirm IP forwarding on the server (for routed traffic)
cr0x@server:~$ sysctl net.ipv4.ip_forward net.ipv6.conf.all.forwarding
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
Meaning: The server will route packets between interfaces.
Decision: If forwarding is 0, clients may handshake fine but can’t reach anything beyond the server. Fix forwarding before chasing “roaming bugs.”
Task 6: Confirm NAT is configured (for full tunnel or internet egress)
cr0x@server:~$ sudo nft list table ip nat
table ip nat {
chain postrouting {
type nat hook postrouting priority 100; policy accept;
oifname "eth0" ip saddr 10.6.0.0/24 masquerade
}
}
Meaning: Traffic from the VPN subnet gets masqueraded out via eth0.
Decision: If you’re doing full tunnel and NAT is missing, clients will connect but won’t reach the internet. That gets reported as “VPN drops sites.”
Task 7: Validate routing table for the WireGuard subnet
cr0x@server:~$ ip route show
default via 203.0.113.1 dev eth0
10.6.0.0/24 dev wg0 proto kernel scope link src 10.6.0.1
Meaning: The server knows the VPN subnet is directly on wg0.
Decision: If the VPN subnet route is missing, the interface may not be up or has the wrong address. Fix that first.
Task 8: Check MTU on wg0 and path MTU symptoms
cr0x@server:~$ ip link show dev wg0
5: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/none
Meaning: Interface MTU is 1420 (common default).
Decision: If mobile users report stalls on LTE, try lowering MTU (e.g., 1280) on client and/or server. Then retest. MTU fixes look like miracles because they are boring.
Task 9: Test “don’t fragment” ping through the tunnel (MTU discovery)
cr0x@server:~$ ping -M do -s 1380 10.6.0.2 -c 3
PING 10.6.0.2 (10.6.0.2) 1380(1408) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
--- 10.6.0.2 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2046ms
Meaning: The packet size exceeds the interface MTU constraints; this can indicate your tunnel MTU doesn’t support the payload you’re trying to push without fragmentation.
Decision: Reduce MTU and repeat until it succeeds. If you can’t get a stable “do not fragment” size, suspect an intermediate black hole or ICMP filtering.
Task 10: Capture WireGuard traffic to see if packets arrive during roam
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820 -c 10
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:13:01.112233 IP 203.0.113.50.49213 > 203.0.113.10.51820: UDP, length 148
12:13:01.113005 IP 203.0.113.10.51820 > 203.0.113.50.49213: UDP, length 92
12:13:26.144990 IP 198.51.100.22.60133 > 203.0.113.10.51820: UDP, length 148
12:13:26.145701 IP 203.0.113.10.51820 > 198.51.100.22.60133: UDP, length 92
Meaning: You see packets from multiple source IPs/ports. This is exactly what roaming looks like at the edge: same peer key, different endpoint over time.
Decision: If you see inbound packets but wg show never updates handshake times, suspect wrong keys or replay/clock issues. If you see nothing inbound, it’s network/firewall/carrier NAT.
Task 11: Confirm client’s peer settings (on Linux client)
cr0x@server:~$ sudo wg showconf wg0
[Interface]
PrivateKey = (hidden)
Address = 10.6.0.2/32
DNS = 10.6.0.1
[Peer]
PublicKey = 6E9q...yZ0=
AllowedIPs = 10.0.0.0/8, 172.16.0.0/12
Endpoint = 203.0.113.10:51820
PersistentKeepalive = 25
Meaning: Split tunnel is configured (only RFC1918 ranges), DNS points inside the tunnel, keepalive is on.
Decision: If AllowedIPs is too broad, you might hijack local networks (printers, captive portals). If too narrow, internal services won’t route.
Task 12: Check for conflicting routes and policy routing on the client
cr0x@server:~$ ip rule show
0: from all lookup local
32764: from all lookup main
32765: from all lookup default
cr0x@server:~$ ip route show table main | sed -n '1,12p'
default via 192.168.1.1 dev wlan0 proto dhcp metric 600
10.0.0.0/8 dev wg0 scope link
172.16.0.0/12 dev wg0 scope link
Meaning: Routes for internal subnets go via wg0; default route stays on Wi‑Fi. That’s split tunnel behaving.
Decision: If default route points to wg0 unexpectedly, you configured full tunnel. Confirm that’s what you intended before blaming “roaming instability.”
Task 13: Confirm DNS resolver in effect (systemd-resolved example)
cr0x@server:~$ resolvectl status | sed -n '1,80p'
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 5 (wg0)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.6.0.1
DNS Servers: 10.6.0.1
Meaning: DNS is pinned to the VPN DNS server on the WireGuard link.
Decision: If DNS servers are still the Wi‑Fi network’s resolver while you expect internal DNS, fix client DNS config. Many “VPN down” reports are “DNS wrong.”
Task 14: Check conntrack timeouts (why your UDP goes missing)
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_udp_timeout net.netfilter.nf_conntrack_udp_timeout_stream
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
Meaning: UDP flows can be expired quickly (30s). On some firewalls/NATs, this is even shorter.
Decision: If you see idle disconnects around 30 seconds, keepalive at 25 seconds will typically stabilize it. If battery is a concern, tune based on measured timeouts.
Task 15: Confirm time sync (handshakes hate time skew)
cr0x@server:~$ timedatectl
Local time: Tue 2026-02-04 12:21:42 UTC
Universal time: Tue 2026-02-04 12:21:42 UTC
RTC time: Tue 2026-02-04 12:21:42
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: Clock is synchronized. Good.
Decision: If system clock isn’t synced, you can see handshake weirdness and replay rejection. Fix NTP before inventing new theories.
Task 16: Verify packets route between VPN client and internal subnet
cr0x@server:~$ ip route get 10.20.30.40 from 10.6.0.2 iif wg0
10.20.30.40 from 10.6.0.2 iif wg0
via 10.0.0.1 dev eth0
Meaning: The kernel would route traffic from the VPN client toward the internal network via eth0 (example).
Decision: If this points back to wg0 or nowhere, you’ve got routing asymmetry. Roaming won’t save you from bad routing.
Fast diagnosis playbook
When a mobile WireGuard user says “VPN disconnects when I leave Wi‑Fi,” don’t start with ideology.
Start with the shortest path to isolating the bottleneck: transport, handshake, routing, MTU, DNS.
First: Is UDP reaching the server at all?
- On server: run a short capture on the WireGuard port while the user toggles Wi‑Fi → LTE.
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 51820 -c 20
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:15:01.100001 IP 198.51.100.22.60133 > 203.0.113.10.51820: UDP, length 148
12:15:01.100550 IP 203.0.113.10.51820 > 198.51.100.22.60133: UDP, length 92
If you see nothing: firewall, security group, ISP filtering, wrong endpoint/port, or client not sending because OS suspended it. Fix reachability.
If you see packets: move to handshake validation.
Second: Is WireGuard accepting them (handshake updates)?
cr0x@server:~$ sudo wg show wg0
interface: wg0
public key: 6E9q...yZ0=
listening port: 51820
peer: oYt9...r3Q=
endpoint: 198.51.100.22:60133
latest handshake: 6 seconds ago
transfer: 4.12 MiB received, 6.88 MiB sent
If handshake is recent: the tunnel is alive. “Disconnect” is probably routing/MTU/DNS/app behavior.
If handshake stays old/never: wrong keys, wrong peer config, or packets aren’t valid WireGuard for this interface. Verify configs.
Third: Is the problem only certain traffic (MTU/DNS/routing)?
- Test ping by IP (bypasses DNS).
- Test DNS resolution through the VPN DNS server.
- Test a TCP connection with a small payload, then larger payloads.
If small requests work but large downloads stall, MTU is your prime suspect.
If IP works but hostnames fail, DNS.
If only some subnets fail, AllowedIPs or routing.
Common mistakes: symptom → root cause → fix
These are the recurring patterns behind “WireGuard roaming doesn’t work,” which usually means “we built a perfect tunnel into an imperfect network.”
1) Symptom: “It works on Wi‑Fi, fails on LTE”
Root cause: Carrier NAT and short UDP idle timeouts; mapping expires during idle, server replies to dead endpoint.
Fix: Set PersistentKeepalive = 25 on the mobile client peer entry; optionally lower it (e.g., 15) if the carrier is extra aggressive. Confirm with handshake timestamps.
2) Symptom: “Connected, but nothing loads after roaming”
Root cause: Handshake is fine, but MTU black-hole after path change.
Fix: Lower MTU on client (start 1280) and/or server. Validate with ping -M do tests and real app transfers.
3) Symptom: “Some internal services work, others time out”
Root cause: AllowedIPs missing routes, or overlapping private ranges with the local network (coffee shop uses 10.0.0.0/8 too).
Fix: Use more specific routes, avoid routing entire RFC1918 if you can, or readdress internal networks if you enjoy long projects. If you must use broad routes, add policy routing exceptions.
4) Symptom: “VPN says connected, but DNS is broken”
Root cause: DNS not applied by the client OS, or DNS server not reachable via AllowedIPs, or DNS server only listens on LAN.
Fix: Ensure DNS server IP is within AllowedIPs and reachable; configure the resolver integration (systemd-resolved, NetworkManager, mobile client DNS field). Verify with resolver status.
5) Symptom: “Roaming causes a new handshake, and sessions die anyway”
Root cause: Application-level sessions don’t tolerate path changes (some TCP sessions drop), or stateful firewalls reset flows.
Fix: Prefer app protocols that retry cleanly; avoid stateful middleboxes between clients and server when possible; keep the WireGuard server close to the edge.
6) Symptom: “Multiple clients ‘steal’ each other’s connectivity”
Root cause: Duplicate peer keys or duplicated AllowedIPs; WireGuard routes by AllowedIPs and expects uniqueness.
Fix: One unique keypair per device; one unique AllowedIPs per peer (at least for /32 or assigned tunnel IP). Audit configs and rotate keys.
7) Symptom: “Handshake is recent, but packets don’t reach the internal network”
Root cause: Missing IP forwarding, missing routes, missing NAT, or asymmetric routing in the internal network (return path doesn’t know 10.6.0.0/24).
Fix: Enable forwarding; add internal route back to VPN subnet; or NAT (less elegant but effective) depending on network constraints.
8) Symptom: “It dies when the phone screen turns off”
Root cause: OS power management suspends the VPN process or throttles background network; keepalive cadence becomes irrelevant if the process isn’t scheduled.
Fix: Use platform-specific VPN settings to allow always-on VPN; configure on-demand rules; on Android, exempt the app from battery optimization where policy allows.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company rolled out WireGuard for a mobile engineering team. The migration was smooth—until it wasn’t.
Users complained that the VPN “randomly disconnected” when leaving the office. The team assumed it was a WireGuard bug because the UI still showed “connected.”
The first responder did what everyone does under pressure: restarted things. It helped, temporarily.
Roaming from Wi‑Fi to LTE created new source ports. The server saw packets arriving. Handshakes updated. But application traffic still died after a minute of idleness.
The wrong assumption was subtle: they believed the NAT in front of the WireGuard server was “stateful enough” and would keep UDP mappings alive for several minutes.
In that environment it did not. The NAT’s UDP timeout was short, and the mobile carrier’s behavior was even shorter on idle.
The fix was boring and immediate: set PersistentKeepalive for the mobile clients. The “disconnects” disappeared.
The postmortem takeaway wasn’t “WireGuard needs improvement.” It was “measure the network you have, not the one you remember.”
They also updated their runbook: any report of “disconnect after idle” triggers a conntrack/keepalive check first.
The incident didn’t repeat, which is the highest form of praise in operations.
Mini-story 2: The optimization that backfired
Another org wanted to reduce battery usage. Someone noticed that keepalives were every 25 seconds and decided that was “wasteful.”
They pushed a change to increase keepalive to 120 seconds for mobile devices. In the office it looked fine; at home on Wi‑Fi it looked fine.
Then the sales team went on the road. Hotels, airports, tethering, rideshare LTE—exactly the environments where NAT timeouts are least predictable.
Tickets spiked: CRM wouldn’t load, internal dashboards timed out, “VPN unstable.” The helpdesk started advising users to toggle airplane mode, which is not a solution; it’s a ritual.
The backfire had two parts. First, some networks expired UDP state well under two minutes.
Second, when phones slept, the first packet after wake had to fight through new NAT mappings and sometimes packet loss; without frequent keepalive traffic, endpoint learning lagged behind user expectations.
They rolled back to 25 seconds for those users and introduced a policy: keepalive values are tiered by user profile.
Battery-sensitive devices got 40 seconds after testing; high-mobility roles stayed at 25. The right answer wasn’t “lower keepalives forever.”
It was “stop optimizing without a failure budget.”
They also learned to do roaming tests outside the office. A VPN that only works on your corporate Wi‑Fi is a LAN with cosplay.
Mini-story 3: The boring but correct practice that saved the day
A finance company ran WireGuard as part of its remote access stack. Nothing fancy: stable port, explicit firewall rules, a conservative MTU, and strict peer management.
Their change management was equally unglamorous: version-controlled configs, staged rollout, and a weekly audit for duplicate AllowedIPs.
One Monday, a new batch of devices was provisioned by a separate team. A subtle error introduced duplicated tunnel IPs for a handful of clients.
In WireGuard, AllowedIPs are not just “what routes to the peer.” They are also how WireGuard decides which peer should receive packets.
Duplicates create chaos. Not always instantly. The worst kind.
Here’s where the boring practices paid rent. Their audit job flagged duplicates quickly.
Their logs and wg show snapshots were already collected in the same place, so the on-call could correlate “peer flaps” with the provisioning change.
They paused provisioning, rotated the affected peers, and recovered before it became a broader incident.
Nobody wrote a triumphant internal blog post. Good. The outcome was not “heroic recovery.” The outcome was “customers didn’t notice.”
In ops, boring is a feature. Excitement is a leading indicator of future paperwork.
Checklists / step-by-step plan
Step-by-step: Deploy WireGuard for mobile users who roam
- Pick your routing model: split tunnel unless you have a real reason for full tunnel.
- Assign a stable tunnel subnet: e.g.,
10.6.0.0/24, one /32 per device. - Ensure uniqueness: unique keypair per device; unique tunnel IP per device; no shared “mobile key.”
- Set conservative MTU: start at 1280 for mobile-heavy deployments; raise only after testing across LTE.
- Enable keepalive for NATed clients: start at 25 seconds, adjust based on observed idle timeouts and battery impact.
- Make firewall rules explicit: allow UDP to the WireGuard port; log drops during rollout.
- Enable forwarding and routes: don’t rely on NAT unless necessary; if you do full tunnel, configure NAT intentionally.
- Decide DNS behavior: if DNS is inside, ensure resolver reachability and client integration.
- Test roaming in the real world: Wi‑Fi → LTE, LTE → Wi‑Fi, sleep/wake, tethering, captive portal scenarios.
- Instrument basics: store periodic
wg showsnapshots, handshake ages, and interface stats; alert on “handshake never” for active users.
Operational checklist: When someone reports “VPN disconnects on mobile”
- Is the server receiving UDP during the failure window? (tcpdump)
- Do handshake timestamps advance? (wg show)
- Does endpoint change when the user roams? (wg show endpoints)
- Is keepalive configured on the NATed side? (wg showconf)
- Can you ping by IP through the tunnel? (ping to tunnel IP / internal IP)
- Do DNS lookups succeed through VPN DNS? (resolvectl / dig)
- Does lowering MTU fix large transfers? (ip link + ping -M do)
- Any duplicate AllowedIPs or duplicated tunnel IPs? (config audit)
Joke #1: If your troubleshooting plan starts with “reinstall the VPN app,” you’re basically turning it off and on again with extra steps.
FAQ
1) Does WireGuard roaming mean I don’t need keepalives?
No. Roaming handles endpoint changes when packets flow. Keepalives handle idle NAT timeouts so packets can flow when you resume activity.
If your client sits idle behind NAT, keepalive is often the difference between “stable” and “mysteriously dead.”
2) Why does my WireGuard client say “connected” when nothing works?
Because “connected” often means “interface exists and configuration is loaded,” not “packets are currently flowing.”
Check latest handshake. If it’s old, you’re not really connected. If it’s recent, look at MTU, routes, and DNS.
3) What’s a good PersistentKeepalive value for phones?
Start at 25 seconds. If you see idle drops sooner than that, lower it. If battery impact is a problem and your networks are tolerant, raise it carefully.
Do not set it blindly to 0 and hope carriers respect your uptime goals.
4) Will WireGuard reconnect faster than IPsec/IKE when roaming?
Often, yes—because it doesn’t require a heavy renegotiation ceremony just to accept a new endpoint.
But end-user perception also depends on DNS, MTU, and the application’s ability to retry. WireGuard removes one big speed bump; it doesn’t repave the whole road.
5) Does changing networks mid-SSH session still break my session?
It can. WireGuard can keep the tunnel alive, but TCP sessions might still reset if packets are lost or if middleboxes drop state.
Use mosh for remote shells when you expect roaming, or ensure your network path doesn’t involve stateful devices that panic at change.
6) Should I run WireGuard on port 53/123/443 to “get through networks”?
Avoid it unless you have a clear, tested requirement. Port masquerading can collide with real services, trigger filtering, and complicate debugging.
If you need reliable traversal, solve it with proper network policy or a known allowed port—not cosplay as DNS.
7) Why do some internal IP ranges break on public Wi‑Fi?
Overlapping private address space. If your company routes 10.0.0.0/8 through the VPN and the hotel also uses 10.0.0.0/8 locally,
you’ve created a routing bar fight. Narrow your AllowedIPs or readdress.
8) Is MTU really that common of a problem with WireGuard?
Yes. Especially on mobile. Roaming changes the underlying path, and path MTU discovery is not reliably supported across all networks.
If large transfers stall or some sites load and others don’t, MTU is a top suspect.
9) Does WireGuard support “multi-endpoint” servers for redundancy?
Not in the sense of listing multiple endpoints for one peer. You can build redundancy with DNS, anycast, failover tooling, or multiple tunnels,
but WireGuard itself expects one current endpoint per peer that updates as it learns.
10) Is WireGuard “stateless”?
No. It keeps state for peers: keys, counters, endpoints, handshake time. It’s just less ceremonious than many VPNs.
The key is that the state updates quickly and safely when the endpoint changes.
Joke #2: NAT devices have two hobbies: expiring UDP mappings and teaching you humility.
Next steps you can do this week
If you want fewer mobile VPN tickets and fewer “it disconnected again” complaints, do these in order:
- Measure handshake health: collect
wg showhandshake ages for active users; alert on “never” or stale handshakes during business hours. - Standardize keepalive for mobile: set
PersistentKeepalivefor NATed clients, and document when you deviate from the default. - Set a conservative MTU baseline: choose 1280 for mobile-first; only increase after real LTE roaming tests succeed.
- Make DNS behavior explicit: either own DNS inside the tunnel end-to-end or stop promising internal names off-network.
- Audit peer configs: detect duplicated keys and AllowedIPs; rotate anything suspicious immediately.
- Practice the fast diagnosis playbook: tcpdump → handshake → MTU/DNS/routing. Train it until it’s muscle memory.
WireGuard roaming is the fix that removes an entire category of mobile VPN disconnects.
But you still have to run it like a production system: measure, test on hostile networks, and keep your configuration boring on purpose.