Nothing erodes confidence in “IT” like an ERP screen that freezes right when someone posts invoices. The cursor spins, the session times out, and suddenly your finance team is doing the ancient ritual: close everything, reopen, re-enter, swear quietly.
VPNs get blamed because they’re visible. Sometimes that’s fair. More often, the VPN is where fragile application behavior meets latency, packet loss, MTU weirdness, DNS confusion, and stateful firewalls with short attention spans. This is how you make that mess behave in production—without guessing.
What “freezes” really are (and why they’re not mystical)
When users say “the ERP froze,” the system is usually doing exactly what you told it to do: wait. It’s waiting for a network response, a database round trip, a file lock to release, or a TLS session to renegotiate. Most line-of-business apps (ERP/CRM) are chatty: lots of small requests, lots of dependencies, and lots of assumptions about LAN-quality networking.
VPNs change the physics:
- Latency increases (even a “good” VPN adds a few to tens of milliseconds).
- Packet loss becomes visible (Wi‑Fi, LTE, and consumer ISPs love micro-loss).
- Path MTU breaks (encapsulation overhead, blocked ICMP, and sudden fragmentation).
- Routing/DNS changes (split DNS, NRPT, resolver reachability, search domains).
- Stateful devices enforce timeouts (NAT, firewalls, proxies, IDS).
ERP/CRM apps amplify this because they often have:
- Long-lived sessions with fragile keepalives.
- Hard-coded timeouts tuned for office LANs.
- Mixed protocols (HTTP(S), WebSockets, SMB, database drivers, RDP/ICA, sometimes all in one “workflow”).
- Large payload spikes (report exports, attachments, client updates).
The job is not “make VPN faster.” The job is make VPN predictable, and keep the app’s critical flows away from the known traps.
Interesting facts and historical context (you can use at meetings)
- IPsec was standardized in the 1990s with the goal of securing IP itself, not just web traffic. That “network-level” mindset is why it’s still everywhere in site-to-site VPNs.
- SSL VPNs became popular because ports 443/HTTPS gets through. It wasn’t elegance—it was survival in the era of locked-down egress.
- TCP-over-TCP “meltdown” has been known for decades: tunneling TCP traffic inside a TCP VPN can cause compounding retransmits and awful throughput under loss.
- Path MTU Discovery relies on ICMP to work correctly. Blocking ICMP “for security” is a reliable way to create mysterious timeouts.
- NAT and stateful firewalls turned networking into a lease: if you don’t send something periodically, your “connection” may be deleted mid-session, even if both endpoints are fine.
- ERP vendors historically optimized for LANs because that’s where ERPs lived: on-prem servers, thick clients, and low-latency office networks. Cloud and remote work forced those assumptions into daylight.
- SMB has a long history of being latency-sensitive. It has improved (SMB2/3), but it still punishes high RTT and loss, especially with small I/O and chatty metadata operations.
- Wi‑Fi retransmissions can look like “random” packet loss at the VPN layer. VPN encryption hides application visibility, so the network team often only sees “encrypted UDP” and shrugs.
- Modern TLS can resume sessions quickly, but middleboxes that inspect or proxy can break those optimizations, forcing more handshakes and more stalls.
Fast diagnosis playbook: find the bottleneck quickly
If you do nothing else, do this in order. The point is to stop debating and start isolating.
1) Establish whether it’s latency, loss, or MTU
- Latency: the app works but feels slow everywhere; every click waits.
- Loss/jitter: the app “freezes” then catches up; sessions drop; reconnects happen; voice/video is also unhappy.
- MTU/MSS: small things work; large uploads/downloads hang; specific pages stall; “saving” fails; some TLS sites half-load.
2) Prove whether the VPN is on the path for the ERP/CRM traffic
- Check routing (client) and confirm whether you’re full-tunnel or split-tunnel for the ERP/CRM endpoints.
- Verify DNS: are you resolving the ERP hostname to the correct IP (internal vs public)?
3) Identify the choke point: client Wi‑Fi, office uplink, VPN concentrator, or app backend
- Compare behavior from: office LAN, home via VPN, home without VPN (if ERP is SaaS), and a test host in the same subnet as the ERP servers.
- Look for asymmetric performance: download fast/upload slow often indicates shaping, bufferbloat, or policing.
4) Don’t debug the ERP until the tunnel is boring
When the transport is stable—no loss spikes, no MTU problems, no DNS flapping—then you’re allowed to blame the application. Before that, you’re just making noise.
One quote to keep nearby: “Hope is not a strategy.” —James Cameron (paraphrased idea often used in ops and engineering)
The usual failure modes: latency, loss, MTU, DNS, and session state
Latency: death by a thousand round trips
ERP/CRM apps frequently do one request per UI action, sometimes dozens. Add 30–60 ms of VPN RTT and suddenly a workflow with 200 sequential calls turns into a coffee break. This is why “bandwidth upgrades” don’t fix it: the problem is delay, not throughput.
Typical sources:
- VPN concentrator far from users (bad geography).
- Hairpin routing: user → VPN → data center → SaaS → back through VPN, because full tunnel and “security.”
- DNS forcing internal paths when external would be faster (or vice versa).
- RDP/ICA over VPN over Wi‑Fi with bufferbloat (latency spikes under load).
Packet loss and jitter: the freeze button
Loss hurts everything, but it hurts interactive and chatty apps the most. A 0.5% loss rate can turn a good day into a ticket storm. VPN encryption also makes QoS classification harder unless you plan for it.
Common sources:
- Consumer Wi‑Fi, especially in crowded environments.
- ISP congestion and peak-hour shaping.
- UDP-based VPNs on networks that deprioritize UDP.
- VPN concentrator CPU saturation (crypto isn’t free).
MTU and MSS: the “works until it doesn’t” category
Encapsulation adds overhead. If you don’t account for it, you get fragmentation or blackholing. The classic symptom is: login works, navigation works, then a report export hangs forever. Or file attachments time out. Or a specific page loads halfway and stops.
Common sources:
- ICMP blocked somewhere along the path, breaking PMTUD.
- VPN overhead not compensated with MSS clamping.
- Mixed tunnels (e.g., nested VPNs) with inconsistent MTU.
DNS and routing: the silent saboteurs
VPN failures often look like application failures because resolution and routing decide which server you hit. Split DNS can be correct, but only if the client can reach the resolver reliably and uses it consistently.
Patterns that cause pain:
- ERP resolves to a private IP, but split tunnel doesn’t route it.
- Multiple DNS suffixes cause the client to try the wrong hostname first (slow fallback).
- DNS queries go over the VPN, but DNS responses go direct (or vice versa) and get dropped.
Session state and idle timeouts: mid-day disconnect theater
ERP/CRM sessions may be HTTP cookies, JWTs, database sessions, or long-lived WebSockets. VPNs add another session layer, and in the middle are NAT and firewalls with idle timers. When any layer expires silently, the user experiences a “freeze” until the app gives up.
This is where keepalives and timeouts matter. Not as folklore. As engineering.
Joke #1: A VPN is like an elevator: everyone only notices it when it stops between floors.
Hands-on tasks: commands, output, and decisions (12+)
These are practical checks you can run from a Linux jump host, a VPN gateway, or a troubleshooting VM near the user segment. The outputs below are representative. Use them to decide what to fix next.
Task 1 — Confirm the route to the ERP/CRM endpoint
cr0x@server:~$ ip route get 10.42.8.25
10.42.8.25 via 10.8.0.1 dev tun0 src 10.8.0.23 uid 1000
cache
What it means: Traffic to 10.42.8.25 goes through the VPN tunnel interface (tun0) via 10.8.0.1.
Decision: If this should be split-tunneled (e.g., SaaS), change policy. If it must go through VPN, continue diagnosing tunnel health.
Task 2 — Check DNS resolution and whether it matches your intent
cr0x@server:~$ getent hosts erp.internal.example
10.42.8.25 erp.internal.example
What it means: Name resolves to a private address.
Decision: Ensure routing includes 10.42.0.0/16 over VPN. If users should hit SaaS/public IP, fix split DNS or conditional forwarding.
Task 3 — Measure baseline latency and jitter (quick and dirty)
cr0x@server:~$ ping -c 20 10.42.8.25
PING 10.42.8.25 (10.42.8.25) 56(84) bytes of data.
64 bytes from 10.42.8.25: icmp_seq=1 ttl=62 time=38.2 ms
64 bytes from 10.42.8.25: icmp_seq=2 ttl=62 time=41.7 ms
64 bytes from 10.42.8.25: icmp_seq=3 ttl=62 time=120.5 ms
...
--- 10.42.8.25 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19021ms
rtt min/avg/max/mdev = 37.8/52.3/120.5/18.4 ms
What it means: No loss, but jitter spikes (max 120 ms). That feels like “freezes” in interactive apps.
Decision: Investigate bufferbloat/shaping (home router, ISP uplink), VPN queueing, or overloaded concentrator. Don’t tune the ERP yet.
Task 4 — Detect packet loss under load with a longer ping
cr0x@server:~$ ping -i 0.2 -c 200 10.42.8.25
...
--- 10.42.8.25 ping statistics ---
200 packets transmitted, 196 received, 2% packet loss, time 40123ms
rtt min/avg/max/mdev = 39.1/55.7/310.2/42.8 ms
What it means: 2% loss is plenty to ruin a chatty ERP, especially over TCP.
Decision: Prioritize fixing loss: Wi‑Fi, ISP, UDP policing, or VPN device drops. Consider switching VPN transport, tuning keepalives, and enforcing wired connections for critical users.
Task 5 — Verify Path MTU with “do not fragment” ping
cr0x@server:~$ ping -M do -s 1472 -c 3 10.42.8.25
PING 10.42.8.25 (10.42.8.25) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
ping: local error: message too long, mtu=1420
What it means: Effective MTU on the path is 1420, not 1500.
Decision: Clamp MSS or set tunnel MTU to avoid fragmentation/blackholes. If ICMP is blocked, fix that too—or you’ll keep chasing ghosts.
Task 6 — Confirm interface MTU on the VPN client/gateway
cr0x@server:~$ ip link show tun0
7: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 500
link/none
What it means: Tunnel MTU is 1400. That’s common for some VPN setups, but must match reality.
Decision: If your PMTU test shows 1420, 1400 is safe. If PMTU is smaller than tunnel MTU, you’ll fragment or blackhole—adjust.
Task 7 — Check TCP MSS clamping on a Linux gateway (iptables)
cr0x@server:~$ sudo iptables -t mangle -S | grep -i mss
-A FORWARD -o tun0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
What it means: SYN packets going out tun0 will have MSS adjusted to PMTU.
Decision: If MTU issues are suspected and this is missing, add it (or equivalent on your firewall). If it’s present, look elsewhere.
Task 8 — Confirm VPN transport and avoid TCP-over-TCP when possible
cr0x@server:~$ sudo ss -Htanp | grep -E 'openvpn|wg|strongswan' | head
ESTAB 0 0 10.0.0.5:1194 198.51.100.20:51432 users:(("openvpn",pid=1342,fd=6))
What it means: OpenVPN server is using a TCP session (ESTAB) on 1194.
Decision: If you’re tunneling lots of TCP application traffic, strongly consider UDP-based VPN transport (or WireGuard/IPsec) unless blocked by policy. TCP VPN can work, but it’s brittle under loss.
Task 9 — Check firewall/NAT state timeouts impact (conntrack)
cr0x@server:~$ sudo conntrack -S
cpu=0 found=23145 invalid=12 insert=89234 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=123
What it means: Non-zero invalid and a lot of search_restart can indicate stress. Not proof, but a smell.
Decision: If users complain about drops under load, examine conntrack table size and timeouts. Increase capacity or reduce session churn (e.g., split tunnel SaaS).
Task 10 — See if you’re hitting conntrack table limits
cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 257923
net.netfilter.nf_conntrack_max = 262144
What it means: You’re near the limit. When the table fills, new flows get dropped. Users call that “freezing.”
Decision: Increase nf_conntrack_max (and RAM), reduce needless full-tunnel traffic, or scale out the gateway.
Task 11 — Identify whether the gateway is CPU bound (crypto)
cr0x@server:~$ top -b -n 1 | head -n 12
top - 12:43:11 up 31 days, 4:02, 2 users, load average: 7.92, 7.10, 6.55
Tasks: 221 total, 1 running, 220 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.1 us, 2.0 sy, 0.0 ni, 75.2 id, 16.3 si, 0.0 st
MiB Mem : 32000.0 total, 1100.3 free, 9800.2 used, 21100.1 buff/cache
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1342 root 20 0 1450m 220m 12m S 220.0 0.7 80:12.43 openvpn
What it means: High softirq (si) and a VPN process eating multiple cores. That can mean packet processing bottlenecks.
Decision: Enable AES-NI/crypto offload if available, tune interrupts/RPS, or move to a more efficient VPN stack. Sometimes the fix is “buy a bigger box,” and yes, that is engineering.
Task 12 — Check interface errors/drops on the gateway
cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP mode DEFAULT group default qlen 1000
RX: bytes packets errors dropped missed mcast
9876543210 8123456 0 12453 0 112233
TX: bytes packets errors dropped carrier collsns
8765432109 7456789 0 9821 0 0
What it means: Drops exist. Some drops are normal under shaping, but persistent growth during complaints is a clue.
Decision: Inspect qdisc, NIC ring sizes, and shaping policies. If drops correlate with ERP freezes, you have a network problem, not an ERP problem.
Task 13 — Validate QoS/queue discipline on Linux (bufferbloat control)
cr0x@server:~$ tc qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
What it means: fq_codel helps reduce bufferbloat and latency under load.
Decision: If you see a basic pfifo_fast or huge buffers and users complain about freezes during uploads, implement modern AQM (fq_codel/cake) where possible.
Task 14 — Capture evidence of MTU blackholing with tcpdump
cr0x@server:~$ sudo tcpdump -ni tun0 host 10.42.8.25 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tun0, link-type RAW (Raw IP), snapshot length 262144 bytes
12:47:01.112233 IP 10.8.0.23 > 10.42.8.25: ICMP echo request, id 3921, seq 1, length 1480
12:47:01.112987 IP 10.42.8.25 > 10.8.0.23: ICMP unreachable - need to frag (mtu 1420), length 556
What it means: PMTUD is working (you received “need to frag” with an MTU). Good—now clamp MSS and stop oversized packets.
Decision: If you never see the ICMP “need to frag,” suspect ICMP filtering and fix it, or set conservative MTU.
Task 15 — Validate application-layer reachability and TLS handshake time
cr0x@server:~$ curl -sk -o /dev/null -w "dns=%{time_namelookup} connect=%{time_connect} tls=%{time_appconnect} ttfb=%{time_starttransfer} total=%{time_total}\n" https://erp.internal.example/login
dns=0.004 connect=0.037 tls=0.210 ttfb=0.612 total=0.613
What it means: TLS handshake time is 210 ms; time-to-first-byte is 612 ms. That can be normal, but if this balloons during “freezes,” you have path instability or backend saturation.
Decision: If DNS/connect/tls jump around, fix network/VPN. If only TTFB rises while network timings are stable, investigate ERP web/app/database tiers.
Task 16 — Check for asymmetric routing (often breaks stateful devices)
cr0x@server:~$ traceroute -n 10.42.8.25 | head -n 8
traceroute to 10.42.8.25 (10.42.8.25), 30 hops max, 60 byte packets
1 10.8.0.1 2.011 ms 1.902 ms 2.144 ms
2 10.10.0.1 8.220 ms 8.101 ms 8.333 ms
3 10.42.0.1 35.009 ms 34.882 ms 35.120 ms
4 10.42.8.25 38.551 ms 38.420 ms 38.600 ms
What it means: Forward path looks sane. Asymmetry is about the return path, so you still need to verify routing on the ERP subnet side.
Decision: If ERP servers return traffic via a different firewall than the one that saw the SYN, you’ll see “random” stalls and resets. Fix routing symmetry or use stateless rules where appropriate.
Design the VPN for business apps, not heroics
Decide: full tunnel vs split tunnel (with grown-up rules)
Full tunnel is attractive because it feels controllable: “all traffic goes through us.” It also turns your VPN into the office internet for every remote user, which is a great way to learn what “capacity planning” means in real time.
For ERP/CRM, the best answer is usually selective tunneling:
- Tunnel what must be private: on-prem ERP, internal APIs, file services tied to ERP workflows.
- Do not hairpin SaaS: if your CRM is SaaS, routing it through the office adds latency and failure domains for no functional gain.
- Use split DNS carefully: internal names resolve internally; public names resolve publicly. Keep it deterministic.
Security teams often worry split tunnel means “less secure.” The more accurate statement is: split tunnel means you must design controls intentionally (EDR, device posture, DNS protections, and egress policy), not accidentally via hairpinning.
Pick the right transport: UDP when you can, TCP only when forced
For interactive business applications, UDP-based tunneling is generally more forgiving under loss because it avoids TCP-over-TCP feedback loops. WireGuard and IPsec typically perform well and are operationally sane. OpenVPN can be fine too, but avoid TCP transport unless you have no choice.
If the network blocks UDP, you can still succeed with TCP—just expect to spend more time on MTU/MSS and keepalive tuning, and don’t pretend it’s “just as good.”
Make MTU boring: set it, clamp it, monitor it
MTU problems are classic because they’re intermittent and payload-size dependent. Solve them systematically:
- Measure effective PMTU across the tunnel.
- Set tunnel MTU slightly below that.
- Clamp TCP MSS on tunnel egress to PMTU.
- Allow ICMP “fragmentation needed” (type 3 code 4) through relevant firewalls.
If someone says “we block ICMP for security,” hand them a pager. Not as a threat. As a learning experience.
Keep connections alive (because middleboxes forget)
Idle timers are not theoretical. They’re set by NAT devices, firewalls, and sometimes the VPN concentrator itself. When an ERP client keeps a session idle while a user reads a screen, the underlying transport may go idle too.
Use:
- VPN keepalives at an interval lower than the shortest state timeout in the path.
- Application keepalives when supported (WebSocket ping/pong, DB keepalive parameters).
- Firewall session timeout tuning for known-good app flows.
Reduce chatty dependencies: use app-tier proximity or published apps
The most reliable way to make a chatty ERP behave over a not-always-great network is to stop shipping every round trip over the VPN. Two common patterns:
- Virtual desktop / published application close to the ERP (RDP/ICA): only pixels/keystrokes cross the WAN. You trade some UX preferences for stability.
- Web front-ends and APIs close to the database: keep chatty DB calls inside the data center, not across the VPN.
Yes, this is architectural. That’s the point. You can’t “tune” physics away forever.
Observe like an adult: measure user experience, not just tunnel up/down
“VPN connected” is a vanity metric. You need:
- RTT and loss to key internal subnets.
- Tunnel packet drops, rekeys, renegotiations.
- Gateway CPU, softirq, NIC drops.
- Application timings (TTFB, error rates, session drops).
Correlate. Freeze reports that line up with gateway drops are network. Freeze reports that line up with database lock waits are application. Your job is to stop the blame carousel.
Three mini-stories from corporate life
Mini-story #1 — The incident caused by a wrong assumption
The company had just moved its ERP web tier into a private subnet and published it through a VPN. Remote staff complained that “every afternoon” the ERP froze when they uploaded invoice PDFs. The infrastructure team saw clean graphs: VPN tunnels were up, CPU was fine, bandwidth wasn’t saturated. They labeled it “ERP bug” and escalated to the vendor.
The vendor asked for HAR logs. The logs showed a pattern: small API calls succeeded, but the upload request stalled and eventually timed out. Someone suggested MTU. The room got quiet—MTU is the kind of thing people remember exists right before it ruins their week.
They tested PMTU across the tunnel and found an effective MTU of 1412 due to an upstream WAN device. The VPN was configured with MTU 1500 because “Ethernet is 1500,” which is true in the same way that “lunch is at noon” is true: it depends on where you are and who’s in charge.
Fix was straightforward: clamp MSS on the gateway and set the tunnel MTU conservatively. Upload freezes disappeared immediately. The postmortem’s real lesson wasn’t MTU. It was the wrong assumption that “if small requests work, the network is fine.” Payload-size sensitivity is a huge tell.
Mini-story #2 — The optimization that backfired
A different org wanted to “speed up” the VPN. Someone noticed UDP was occasionally blocked on hotel Wi‑Fi, so they switched the VPN to TCP globally “for reliability.” It did help a handful of travelers connect more often. Tickets dropped for two weeks. Then month-end close arrived.
During peak finance activity, ERP response time went from “okay” to “frozen” for a chunk of remote users. The VPN stayed connected, but throughput cratered whenever there was mild packet loss. Users reported that the ERP would hang for 20–60 seconds and then suddenly catch up.
Packet captures showed classic TCP-over-TCP behavior: retransmissions inside retransmissions, congestion windows collapsing, and the tunnel becoming a performance amplifier—in the bad direction. Their “optimization” improved connectivity in hostile networks but made normal jitter far more destructive.
The fix was not to shame the person who made the change. The fix was to restore UDP as the default, add a TCP fallback profile for the environments that needed it, and make clients select based on reachability. Reliability isn’t “pick TCP.” Reliability is “design for the network you actually have.”
Mini-story #3 — The boring but correct practice that saved the day
A manufacturing company ran an on-prem ERP with a thick client that talked to a database and a file share. Remote work was growing, and they knew VPN would be a new dependency. They did two unsexy things early: they documented the critical ERP flows (which servers, which ports, which protocols) and they set up continuous probes from the VPN subnet to those endpoints.
Months later, a firewall policy update went in. Nobody touched the VPN configuration. Yet the next morning, remote ERP users saw intermittent freezes when opening attachments. The helpdesk escalated quickly because the monitoring dashboard already showed a change: SMB response time from the VPN subnet spiked, and there were new TCP resets on the attachment server’s path.
Because the team had the flow map, they didn’t waste half a day testing random theories. They found a session timeout mismatch on a firewall for SMB flows from the VPN pool and reverted that specific policy. The ERP vendor never got pulled into it. Finance never had to re-enter anything.
Nothing heroic happened. That’s why it worked. Boring observability and known-good baselines beat drama every time.
Common mistakes: symptoms → root cause → fix
This is the part you can paste into a ticketing system and look wise.
1) “Login works, but uploads/downloads hang”
- Symptoms: Attachments, report exports, or specific pages stall; small API calls succeed.
- Root cause: MTU blackholing or fragmentation issues; PMTUD broken due to ICMP filtering; no MSS clamping.
- Fix: Measure PMTU, set tunnel MTU conservatively, clamp MSS on egress, allow ICMP “frag needed.”
2) “It freezes for 30 seconds, then catches up”
- Symptoms: Intermittent long stalls; eventually resumes; especially during busy Wi‑Fi or peak hours.
- Root cause: Packet loss/jitter with TCP retransmit backoff; sometimes TCP VPN transport amplifying loss.
- Fix: Prefer UDP transport; reduce loss sources (wired, better Wi‑Fi), apply AQM (fq_codel/cake), check gateway drops.
3) “VPN is connected but ERP says ‘cannot reach server’”
- Symptoms: Tunnel up; name resolves; connection fails or times out.
- Root cause: Split tunnel routes missing; ERP resolves to internal IP but client routes it to default gateway; ACL missing.
- Fix: Fix route push/policies; confirm with
ip route get; align DNS with routing intent.
4) “Works in the morning, fails in the afternoon”
- Symptoms: Time-of-day correlation; more complaints during calls/uploads.
- Root cause: Congestion and bufferbloat; WAN uplink saturation; gateway CPU under peak.
- Fix: Shape traffic; implement AQM; scale gateway; stop hairpinning SaaS through VPN.
5) “Only some users are affected, and it changes”
- Symptoms: Random subsets of users see freezes; hard to reproduce.
- Root cause: Anycast/geo differences to VPN POP, ISP path variance, Wi‑Fi conditions, or MTU differences with nested tunnels.
- Fix: Standardize client profiles; offer regional VPN endpoints; gather client-side measurements (RTT/loss/MTU).
6) “RDP to the ERP desktop freezes, but web browsing is fine”
- Symptoms: RDP/ICA stutters or freezes; general internet seems OK.
- Root cause: UDP blocked or shaped, or high jitter; RDP is sensitive to latency spikes; VPN queueing.
- Fix: Ensure stable UDP where possible; apply QoS at the edge; reduce latency spikes with AQM; consider RDP UDP support settings depending on environment.
7) “It logs out constantly when users are reading screens”
- Symptoms: Idle users get disconnected; session resets after a few minutes of inactivity.
- Root cause: NAT/firewall idle timeouts shorter than app expectations; missing keepalives.
- Fix: Set VPN keepalives; tune session timeouts; ensure long-lived flows are permitted and tracked correctly.
Joke #2: If your troubleshooting plan is “restart the VPN,” you’re not fixing the system—you’re performing a ritual for the packet gods.
Checklists / step-by-step plan
Step-by-step: stabilize ERP/CRM over VPN in 10 moves
- Inventory the flows: ERP/CRM endpoints, ports, protocols, dependencies (file shares, identity providers, APIs).
- Decide routing intent: split tunnel for SaaS; tunnel only internal networks that must be private.
- Fix DNS determinism: conditional forwarding/split DNS so names resolve to the right place consistently.
- Measure PMTU across the tunnel and standardize MTU settings per VPN profile.
- Clamp MSS on tunnel egress for TCP.
- Prefer UDP transport for the VPN where feasible; keep TCP fallback as a separate profile.
- Set keepalives below the shortest middlebox timeout; align firewall session timers with business reality.
- Implement AQM (fq_codel/cake) on the gateway edge or WAN device if you control it.
- Capacity plan the concentrator: CPU (crypto), RAM (conntrack/state), NIC drops, and concurrent sessions.
- Monitor user-experience signals: RTT/loss, tunnel drops, DNS failures, HTTP timing, and backend health.
Checklist: before you change anything in production
- Can you reproduce the freeze and capture timestamps?
- Do you have baseline RTT/loss from a known-good user?
- Do you know whether the traffic is full-tunnel or split-tunnel?
- Have you validated DNS resolution and routes match?
- Did you test PMTU and confirm MSS clamping?
- Do you know the VPN transport (UDP/TCP) and why?
- Did you check gateway CPU, drops, conntrack utilization during the incident window?
Checklist: if you must keep full-tunnel (policy says so)
- Ensure the VPN egress has enough bandwidth and sane shaping/AQM.
- Don’t hairpin obvious SaaS if you can use secure web gateways or device controls instead.
- Pin critical business apps to known-good routes and resolvers; avoid “magic” DNS that changes per network.
- Segment VPN pools (executives/finance vs general) so one group’s streaming doesn’t become everyone’s problem.
- Log and alert on tunnel renegotiations, gateway drops, and DNS failure rates.
FAQ
1) Should we use split tunneling for ERP/CRM?
For on-prem ERP, you’ll tunnel the internal subnets. For SaaS CRM/ERP, split tunnel is usually the right move to avoid hairpin latency and reduce VPN load—if you have endpoint controls and sane DNS.
2) Is OpenVPN “bad” for ERP?
No. OpenVPN can work well. The common foot-gun is running it over TCP for general use and then wondering why performance collapses under loss. UDP transport is usually better for interactive apps.
3) What MTU should we set?
Measure PMTU across your real path. Then pick a tunnel MTU that fits with headroom. If you can’t rely on ICMP, choose conservatively and clamp MSS. Guessing 1500 because “Ethernet” is how you earn night shifts.
4) Why do users say “it freezes” instead of “it’s slow”?
Because many apps block the UI thread waiting on network I/O or a lock. Packet loss and retransmit backoff look like a hard freeze even if the process is alive.
5) Can bandwidth upgrades fix timeouts?
Sometimes, if you’re saturating an uplink and queueing everything. But most ERP/CRM pain over VPN is latency, jitter, MTU, or state timeouts—not raw bandwidth. Measure before buying.
6) Why does it work on mobile hotspot but not home Wi‑Fi?
Different last-mile behavior: Wi‑Fi interference, router bufferbloat, ISP shaping, or UDP handling. Hotspots can be cleaner paths, or they can be worse. The point is: it’s not “the VPN,” it’s the path.
7) Do we need QoS if everything is encrypted?
You can still do QoS on the outer tunnel (by user/group, DSCP marking before encryption, or per-tunnel shaping). If you do nothing, the loudest flow wins—usually someone uploading a video while finance is posting journals.
8) How do we stop idle disconnects?
Set VPN keepalives and align firewall/NAT session timeouts. Also check that rekey/renegotiation doesn’t interrupt long-lived sessions. If the app supports its own keepalive, enable it.
9) Is RDP a good workaround for ERP over VPN?
Often yes. If your ERP is chatty or uses SMB/DB connections directly from the client, publishing the app near the servers turns WAN pain into a single, controllable protocol. It’s not fashionable, but it’s effective.
10) What’s the single most common root cause you see?
MTU/MSS problems for “hang on upload,” and packet loss/jitter for “random freezes.” Close third: bad DNS/routing decisions causing hairpin paths or wrong endpoints.
Conclusion: next steps that actually reduce tickets
ERP/CRM over VPN fails in predictable ways. If you treat it like a mystery, you’ll get mysteries. Treat it like a system with measurable properties—RTT, loss, MTU, routing, DNS, state timeouts—and it becomes boring. Boring is the goal.
Practical next steps:
- Run the fast diagnosis playbook on one affected user path and one known-good path. Capture RTT/loss/PMTU and routing/DNS evidence.
- Fix MTU/MSS and ICMP handling first. It’s low effort and high impact.
- Stop hairpinning SaaS through the office unless you have a specific, validated reason.
- Prefer UDP-based VPN transport; keep TCP as a fallback profile, not the default.
- Instrument your VPN gateway like it matters: drops, conntrack, CPU/softirq, and per-path experience checks to ERP endpoints.
- If the ERP is inherently chatty, move the user interaction closer to the app (published apps/VDI) instead of trying to perfect the WAN.