Everything works on the LAN. SSH is fine, HTTP is fine, DNS looks healthy. Then you try from “outside” (VPN, mobile hotspot, partner network, the actual internet) and it’s dead. Or worse: it half-works, with timeouts that feel like a prank.
This failure pattern is rarely “the app.” It’s almost always routing, NAT, filtering, or a quiet mismatch between what you think the path is and what packets are doing at 2 a.m. The cure is not magical. It’s methodical, capture-driven, and a little bit ruthless with assumptions.
The mental model: why LAN success proves almost nothing
When a service “works on LAN,” you’ve proven exactly one thing: something can reach it on a directly reachable path with minimal translation and often with simpler firewall policy. LAN traffic tends to be:
- Single-hop or a small number of hops (routing is simpler).
- Often not NATed (or NAT behaves differently).
- Less likely to traverse stateful middleboxes with opinionated timeouts.
- Less likely to hit asymmetric routing (the return path is usually obvious).
- Less likely to trigger “internet-facing hardening” rules.
WAN traffic adds layers: edge routers, ISP CPE, security groups, cloud load balancers, and NAT—sometimes multiple Nats stacked like pancakes made of sadness.
The diagnostic trick is to stop thinking in terms of “client can’t connect” and start thinking in terms of packet phases:
- Ingress: Does the SYN/UDP request arrive at the server interface you expect?
- Local policy: Does Linux accept it (firewall, rp_filter, conntrack)?
- Service binding: Is the app bound to the right IP/port and reachable via that path?
- Return path: Does the reply leave via the same edge (or at least a routable one)?
- Translation: Is NAT rewriting the right addresses/ports and tracking state correctly?
If you can answer those five, you can usually name the root cause with confidence, not vibes.
Fast diagnosis playbook (first/second/third)
First: prove whether packets reach the box
Run a capture on the public-facing interface while you attempt a connection from the WAN. If you don’t see inbound packets, the problem is upstream (edge NAT/port forward, ISP firewall, cloud security group, wrong public IP).
Second: prove whether the box responds and where replies go
If you see the SYN arrive, immediately look for the SYN-ACK leaving. If it leaves a different interface than expected, you’re in asymmetric routing/policy routing land. If no response leaves at all, you’re in firewall/rp_filter/app binding land.
Third: prove state and translation
If you see replies leave but the client never gets them, look at NAT state (conntrack) and intermediate devices. Many “WAN fails” cases are return traffic being NATed incorrectly, dropped due to rp_filter, or failing because of MTU/MSS issues on the WAN path.
Opinionated rule: Don’t “just add a rule.” Capture first, change second. Otherwise you’ll stack fixes until you can’t reason about the system anymore.
Interesting facts and historical context (9 quick ones)
- NAT wasn’t the original plan. It became popular in the 1990s as IPv4 address scarcity met consumer broadband.
- Linux netfilter arrived in the 2.4 kernel era. It replaced older ipchains and made stateful firewalling mainstream in Linux.
- conntrack is stateful memory. NAT relies on connection tracking; if conntrack is full, “random” failures appear that aren’t random.
- rp_filter exists to fight spoofing. Reverse path filtering drops packets that “shouldn’t” arrive on an interface, which is great until you do asymmetric routing or policy routing.
- Path MTU Discovery has been fragile for decades. Filtering ICMP “Fragmentation Needed” breaks PMTUD, leading to WAN-only stalls and MSS/MTU “mystery” bugs.
- Hairpin NAT is a special kind of awkward. Accessing a service via its public IP from inside the same NAT can fail unless the router supports reflection/hairpinning.
- nftables didn’t kill iptables overnight. Many distros run iptables as a compatibility layer over nftables, which can confuse debugging if you don’t check what’s actually active.
- Default routes are political. Multi-homed hosts (two uplinks) need explicit policy; otherwise Linux picks “best” routes that are best only in its own head.
- UDP “works on LAN” is not the same as “works on WAN.” NAT timeouts and stateful firewalls treat UDP like a mayfly.
Practical tasks: commands, outputs, and decisions (14 tasks)
All tasks assume Debian/Ubuntu on the server side unless noted. Replace interface names, IPs, and ports with your reality. The point is the shape of the evidence.
Task 1: Confirm the service is listening on the right IP and port
cr0x@server:~$ sudo ss -lntup | grep -E '(:80|:443|:22)\b'
tcp LISTEN 0 4096 0.0.0.0:80 0.0.0.0:* users:(("nginx",pid=1432,fd=6))
tcp LISTEN 0 4096 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=911,fd=3))
tcp LISTEN 0 4096 [::]:443 [::]:* users:(("nginx",pid=1432,fd=7))
What it means: If you see 127.0.0.1:80 or a private-only address, WAN won’t work (the app isn’t bound to the public-facing path). If you only see [::]:443 and your WAN test is IPv4, you might be listening only on IPv6 or vice versa.
Decision: Fix bind addresses first. Don’t touch NAT until the service is actually reachable on the interface/IP you expect.
Task 2: Verify local firewall policy (nftables/iptables/UFW) without guessing
cr0x@server:~$ sudo nft list ruleset | sed -n '1,140p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
tcp dport { 22, 80, 443 } accept
ip protocol icmp accept
counter drop
}
}
What it means: Default drop with explicit allow is fine. Missing your service port is not fine. If you’re using UFW, it may be managing rules under the hood, but nft shows what’s actually active.
Decision: If inbound WAN hits this host directly, open the port here. If traffic is DNATed to an internal host, open it on the destination host too.
Task 3: Check whether iptables is really nftables in disguise
cr0x@server:~$ sudo iptables -V
iptables v1.8.9 (nf_tables)
What it means: Your iptables commands may be manipulating nftables backend. That’s okay, but mixing “legacy” and “nf_tables” tooling can produce rules you don’t see where you’re looking.
Decision: Pick one view (prefer nft list ruleset) and stick to it while debugging.
Task 4: Confirm the routing table and default route on the server
cr0x@server:~$ ip route
default via 203.0.113.1 dev eth0 proto dhcp src 203.0.113.20 metric 100
10.10.0.0/24 dev eth1 proto kernel scope link src 10.10.0.10
203.0.113.0/24 dev eth0 proto kernel scope link src 203.0.113.20
What it means: The server thinks replies to the internet go out eth0 via 203.0.113.1. If your WAN path is actually via another interface, you’ve found a core issue.
Decision: If you’re multi-homed, prepare to use policy routing (ip rule) rather than “hoping” Linux chooses the interface you like.
Task 5: Inspect policy routing rules (the silent kingmakers)
cr0x@server:~$ ip rule show
0: from all lookup local
100: from 10.10.0.0/24 lookup 100
32766: from all lookup main
32767: from all lookup default
What it means: Traffic sourced from 10.10.0.0/24 uses routing table 100. That can cause WAN failures if return traffic for a WAN connection uses the wrong source IP or wrong uplink.
Decision: If a WAN service is DNATed to 10.10.0.10, ensure the replies route back through the interface the client can actually reach, usually the same NAT gateway.
Task 6: Check rp_filter (reverse path filtering), the classic “WAN only” assassin
cr0x@server:~$ sysctl net.ipv4.conf.all.rp_filter net.ipv4.conf.eth0.rp_filter net.ipv4.conf.eth1.rp_filter
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.eth1.rp_filter = 1
What it means: Strict mode (1) drops packets if the kernel thinks the return path wouldn’t use the same interface. Asymmetric routing or policy routing can trigger drops before your firewall even gets a vote.
Decision: If you have asymmetric routing by design (multi-uplink, VRFs, policy routing), set rp_filter to loose (2) on affected interfaces, or disable carefully (0) where appropriate.
Task 7: Prove packets arrive: tcpdump on the WAN-facing interface
cr0x@server:~$ sudo tcpdump -ni eth0 'tcp port 443 and (tcp[tcpflags] & (tcp-syn) != 0)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:44:10.101010 IP 198.51.100.88.50122 > 203.0.113.20.443: Flags [S], seq 1234567890, win 64240, options [mss 1460,sackOK,TS val 101 ecr 0,nop,wscale 7], length 0
What it means: The SYN arrives. Upstream routing and public IP are fine. Now the ball is in your court: local policy, NAT, or return routing.
Decision: If you don’t see packets, stop editing the server. Fix edge NAT, security groups, upstream firewall, or the “wrong public IP” situation.
Task 8: Prove replies leave: capture SYN and SYN-ACK together
cr0x@server:~$ sudo tcpdump -ni eth0 'host 198.51.100.88 and tcp port 443'
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:44:10.101010 IP 198.51.100.88.50122 > 203.0.113.20.443: Flags [S], seq 1234567890, win 64240, options [mss 1460,sackOK,TS val 101 ecr 0,nop,wscale 7], length 0
12:44:10.101200 IP 203.0.113.20.443 > 198.51.100.88.50122: Flags [S.], seq 987654321, ack 1234567891, win 65160, options [mss 1460,sackOK,TS val 202 ecr 101,nop,wscale 7], length 0
What it means: The server responds correctly on the right interface. If the client still can’t connect, something between server and client is dropping replies (edge firewall, ISP, DDoS protection, wrong NAT state).
Decision: Shift focus outward: perimeter device, NAT gateway, cloud firewall, or return route beyond this host.
Task 9: Catch asymmetric return routing (reply leaves the wrong interface)
cr0x@server:~$ sudo tcpdump -ni any 'host 198.51.100.88 and tcp port 443'
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:44:10.101010 eth0 IP 198.51.100.88.50122 > 203.0.113.20.443: Flags [S], seq 1234567890, win 64240, length 0
12:44:10.101300 eth1 IP 10.10.0.10.443 > 198.51.100.88.50122: Flags [S.], seq 987654321, ack 1234567891, win 65160, length 0
What it means: The SYN came in on eth0 to the public IP, but the SYN-ACK is leaving from the private IP on eth1. The client will never accept that. This is the signature of wrong source IP selection, policy routing, or missing src in routes.
Decision: Fix source address selection and routing. Typical fixes: set correct src on routes, add ip rule by source, or ensure DNATed services reply via the NAT gateway.
Task 10: Validate NAT rules on a gateway (MASQUERADE/SNAT/DNAT)
cr0x@server:~$ sudo nft list table ip nat
table ip nat {
chain prerouting {
type nat hook prerouting priority -100; policy accept;
tcp dport 443 dnat to 10.10.0.10:443
}
chain postrouting {
type nat hook postrouting priority 100; policy accept;
oif "eth0" masquerade
}
}
What it means: DNAT sends inbound 443 to an internal server. Postrouting masquerade ensures replies leaving eth0 get a public source. If masquerade is missing, internal hosts may respond with private addresses that die on the internet.
Decision: If WAN clients can’t complete the handshake, confirm both DNAT (inbound) and SNAT/MASQUERADE (outbound) are correct for the path.
Task 11: Confirm forwarding is enabled (only relevant on gateways/routers)
cr0x@server:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
What it means: If this is a NAT gateway and ip_forward is 0, you can DNAT all day and nothing will forward. LAN testing may still work if you’re bypassing the gateway or hitting services locally.
Decision: If you expect routing/NAT through this box, net.ipv4.ip_forward must be 1 and the forward chain must allow traffic.
Task 12: Check conntrack utilization (the “it works until peak” problem)
cr0x@server:~$ sudo conntrack -S
cpu=0 found=120 new=42 invalid=3 ignore=0 delete=8 delete_list=8 insert=42 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=1 found=115 new=38 invalid=1 ignore=0 delete=6 delete_list=6 insert=38 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cr0x@server:~$ cat /proc/sys/net/netfilter/nf_conntrack_count
51234
cr0x@server:~$ cat /proc/sys/net/netfilter/nf_conntrack_max
65536
What it means: You’re close to max. When the table fills, new connections drop. LAN tests may succeed because they don’t traverse NAT/conntrack, or because the LAN path is simpler.
Decision: If counts approach max, increase nf_conntrack_max, reduce timeouts for noisy protocols, or stop funneling everything through one overloaded state table.
Task 13: Detect MTU/MSS issues (WAN stalls, small payload works)
cr0x@server:~$ ip link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
cr0x@server:~$ ping -M do -s 1472 198.51.100.1 -c 3
PING 198.51.100.1 (198.51.100.1) 1472(1500) bytes of data.
From 203.0.113.20 icmp_seq=1 Frag needed and DF set (mtu = 1492)
From 203.0.113.20 icmp_seq=2 Frag needed and DF set (mtu = 1492)
From 203.0.113.20 icmp_seq=3 Frag needed and DF set (mtu = 1492)
--- 198.51.100.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2031ms
What it means: Your path MTU is smaller than 1500 (common with PPPoE/VPN). If ICMP is blocked upstream, PMTUD breaks and TCP sessions can hang mid-flight—often only on WAN.
Decision: Fix MTU on the interface/tunnel, allow ICMP “Fragmentation needed,” or clamp MSS on the edge firewall for TCP.
Task 14: Confirm DNS isn’t the real culprit (split-horizon, wrong A/AAAA)
cr0x@server:~$ dig +short app.example.internal A
10.10.0.10
cr0x@server:~$ dig +short app.example.internal @1.1.1.1 A
203.0.113.20
What it means: Internal DNS returns private IP; public resolver returns public IP. That’s normal for split-horizon. It becomes a problem when clients on “WAN-but-not-really” (VPN users, partner networks) get the wrong view.
Decision: Decide which clients should see which address. Fix conditional forwarding on VPN DNS, or stop trying to use one name for two worlds unless you can control resolvers.
Joke #1: NAT is like office politics: everything works until someone asks who’s actually allowed to talk to whom.
Three corporate mini-stories from the trenches
Mini-story 1: The outage caused by a wrong assumption
One company ran a customer portal behind a Debian-based reverse proxy. Internally, everyone accessed it via the service name, which resolved to a private VIP. From the internet, it resolved to a public IP on an edge firewall that forwarded 443 to the same reverse proxy.
A team spun up a second uplink for “redundancy.” The server became multi-homed: eth0 toward the firewall, eth1 toward a new SD-WAN appliance. They assumed Linux would “do the right thing” and return traffic through the interface it came in on. Linux did what Linux does: it routed according to its tables, not according to anyone’s hopes.
LAN tests kept passing because internal clients hit the private VIP and traffic stayed on the private side. WAN tests failed intermittently. Sometimes the handshake succeeded, sometimes it didn’t, depending on which source address got selected and which routing rule won that moment.
The debugging breakthrough was a tcpdump -ni any capture that showed SYNs arriving on eth0 but SYN-ACKs leaving via eth1 with the wrong source IP. Once the team added source-based routing rules and set rp_filter to loose on the relevant interfaces, the WAN path stabilized.
The wrong assumption wasn’t “Linux is broken.” It was believing symmetric routing is the default in a multi-homed world. It’s not. It never was.
Mini-story 2: The optimization that backfired
Another org wanted to reduce latency. They moved NAT and firewalling from a dedicated appliance to an Ubuntu VM “close to the workload.” It looked great in synthetic tests. The VM had plenty of CPU, and iperf numbers were fantastic inside the DC.
Then the real traffic hit: lots of short-lived HTTPS connections from the internet, plus some UDP-based monitoring from partner networks. The conntrack table filled during bursty periods. When it hit the ceiling, new connections dropped. The app team filed a bug: “WAN flaky, LAN fine.” The ticket came with screenshots and disappointment.
LAN was fine because internal requests didn’t traverse that NAT path or created fewer states. WAN was a conntrack carnival. They had “optimized” by collapsing roles without sizing state tables, timeouts, and kernel parameters for internet behavior.
The fix wasn’t heroic. They measured conntrack usage, increased nf_conntrack_max, reduced specific UDP timeouts where appropriate, and added monitoring for conntrack saturation. Most importantly, they stopped treating NAT as stateless plumbing. NAT is state, and state needs capacity planning.
Joke #2: Nothing says “high availability” like a single conntrack table that panics whenever marketing runs a campaign.
Mini-story 3: The boring practice that saved the day
A finance company had a simple rule: every internet-facing change included a packet capture before and after, stored with the change record. Not a novel. Just enough to show ingress and egress behavior.
One Friday, a small firewall update went out. Immediately, remote users reported they could reach the login page but file downloads hung. Inside the office, everything worked. The first instinct was to blame the application or storage backend because “downloads are big.”
The on-call engineer pulled the “before” capture from last week and took a fresh “after” capture. The new capture showed TCP sessions stalling after a certain packet size. Then they saw the real change: ICMP type 3 code 4 had been blocked “for security.” PMTUD broke, and MSS wasn’t being clamped. Classic WAN-only pain.
They reverted the ICMP block, added an explicit allow for the needed ICMP messages, and documented when to use MSS clamping on VPN links. The incident lasted minutes, not hours, because they had baseline evidence and the discipline to compare it. Boring, correct, repeatable. The best kind of ops.
Reliability quote (paraphrased idea): “You can’t improve what you don’t measure.” — W. Edwards Deming (paraphrased idea)
Common mistakes: symptom → root cause → fix
1) “Works from LAN, times out from WAN”
Symptom: No connection at all from the internet; LAN clients succeed.
Root cause: Inbound traffic never reaches the host (wrong public IP, missing port forward, cloud security group blocks, ISP blocks).
Fix: Capture on the WAN interface. If nothing arrives, fix upstream NAT/firewall. Don’t touch the app.
2) “SYN arrives, no SYN-ACK leaves”
Symptom: tcpdump shows inbound SYN, but no response.
Root cause: Local firewall drop, service not listening on that IP/port, rp_filter drop, or local route to client is broken.
Fix: Check ss -lntup, then firewall rules (nft), then rp_filter, then routing.
3) “SYN-ACK leaves but client never completes handshake”
Symptom: Server responds, but client keeps retransmitting SYN.
Root cause: Return path broken beyond server, wrong source IP on reply (asymmetric routing), or upstream drops.
Fix: Capture on any to see which interface reply uses. Fix policy routing or NAT on the gateway. Verify upstream rules.
4) “VPN users fail, internet users okay”
Symptom: Remote VPN clients can’t reach service using public name; others can.
Root cause: Split DNS mismatch, or hairpin NAT not supported on the VPN egress path.
Fix: Fix DNS view for VPN, or provide an internal name/VIP for VPN users, or implement hairpin NAT correctly.
5) “Small requests work, large downloads hang”
Symptom: Login page loads, but file transfers stall on WAN.
Root cause: MTU/PMTUD failure due to blocked ICMP or tunnel overhead; missing MSS clamping.
Fix: Allow ICMP fragmentation-needed, set correct MTU, clamp MSS on edge for TCP where needed.
6) “Everything breaks during peak, then recovers”
Symptom: Random WAN failures correlated with load.
Root cause: conntrack table exhaustion, NAT port exhaustion, or stateful firewall saturation.
Fix: Measure conntrack count vs max, adjust limits/timeouts, scale out NAT, or reduce unnecessary state creation.
7) “WAN IPv4 fails but IPv6 works” (or vice versa)
Symptom: One IP family works; the other is dead.
Root cause: App binding mismatch, firewall mismatch, missing AAAA/A record or wrong route for one family.
Fix: Check ss for v4/v6 listening sockets, verify firewall rules per family, validate DNS records per family.
8) “LAN works via IP, fails via hostname”
Symptom: curl https://10.10.0.10 works but curl https://app.example.internal fails.
Root cause: DNS points to a different address on WAN, or certificate/SNI routes differently on reverse proxy.
Fix: Compare DNS from inside/outside, check reverse proxy vhost and SNI config, confirm correct target IP.
Checklists / step-by-step plan
Checklist A: Single host directly on public IP (no DNAT)
- Confirm listening socket:
ss -lntup. If it’s bound to localhost/private-only, fix service config. - Confirm local firewall:
nft list ruleset. If policy is drop, explicitly allow the port. - Confirm packets arrive:
tcpdump -ni eth0while testing from WAN. - If packets arrive but no reply: check rp_filter and routes to the client subnet.
- If reply leaves but client doesn’t see it: engage upstream firewall/ISP, verify return routing, verify any DDoS scrubber policy.
- For “large payload hangs”: test MTU/PMTUD and ICMP allowance.
Checklist B: NAT gateway doing port forwarding (DNAT to private server)
- On the gateway, confirm DNAT rule exists and matches interface/port.
- On the gateway, confirm forward chain allows the traffic (stateful allow is typical).
- On the gateway, confirm SNAT/MASQUERADE for outbound replies on the WAN interface.
- On the internal server, confirm it sees the inbound connection (tcpdump on private interface).
- On the internal server, confirm its default route points back to the gateway (otherwise replies bypass NAT and die).
- Watch conntrack on the gateway during tests; NAT needs state.
Checklist C: Multi-homed host (two interfaces, two paths)
- Capture on
anyand verify ingress and egress interface for the same flow. - Check
ip ruleand all routing tables used. - Set rp_filter to loose (2) where asymmetric routing is expected.
- Pin source addresses using
srcon routes or policy routing by source subnet. - Retest and verify the reply leaves through the correct uplink consistently.
Checklist D: DNS and hairpin sanity (common in offices and VPNs)
- Compare DNS answers from inside and outside using
digwith specific resolvers. - If internal clients resolve public IP, test whether the edge supports hairpin NAT.
- If hairpin is not supported, stop forcing it: give internal/VPN clients an internal name or conditional DNS.
- Verify certificates and SNI routing match the hostname users actually use.
FAQ
1) If it works on LAN, doesn’t that prove the app is fine?
It proves the app can respond on some path. WAN adds different routing, NAT, and firewall policy. Treat LAN success as “app not totally dead,” nothing more.
2) What’s the single fastest test to avoid guessing?
tcpdump on the WAN interface during a WAN connection attempt. If nothing arrives, stop blaming the server.
3) Why does rp_filter break WAN but not LAN?
LAN traffic often uses a straightforward symmetric path. WAN traffic on multi-homed hosts or policy-routed networks can look “spoofed” to strict rp_filter, so the kernel drops it early.
4) I opened the port in UFW but it still fails. Why?
Either UFW isn’t the active firewall layer you think it is, or the traffic doesn’t reach the host, or you’re DNATing to another box that’s still blocking it. Always validate the real active rules with nft list ruleset and capture traffic.
5) How do I tell if I have asymmetric routing?
Capture on any and watch a single flow. If SYN enters on one interface and SYN-ACK leaves another, you have asymmetry. Then check ip rule and per-table routes.
6) Why do large downloads fail but the login page works?
That’s the MTU/PMTUD failure pattern. Small packets pass; larger ones require fragmentation or proper PMTUD signaling. If ICMP “fragmentation needed” is blocked, TCP stalls.
7) Can conntrack exhaustion really look like “WAN only”?
Yes. Many LAN paths don’t traverse NAT/conntrack-heavy devices, or create fewer states. Internet-facing traffic is bursty and stateful, and it’s very good at finding your table limits.
8) We DNAT to an internal host. Why does the internal host need the correct default route?
Because replies must go back through the NAT gateway to be translated correctly. If the internal host sends replies out a different gateway, the client sees private source IPs or mismatched state and the session dies.
9) Is IPv6 relevant to “LAN works, WAN fails”?
Absolutely. Many environments accidentally expose IPv6 internally but not externally (or the reverse). Check bind addresses, DNS AAAA records, and firewall rules per family.
10) Should I just enable MASQUERADE everywhere?
No. That’s how you create debugging debt and surprise paths. SNAT/MASQUERADE should be precise: correct egress interface, correct source subnets, and only where you actually need translation.
Conclusion: next steps you should actually do
If you only take one habit from this: capture first. “Works on LAN, fails on WAN” is a routing/NAT/filtering story until proven otherwise. Your job is to prove where the packet disappears.
- Run
tcpdumpon the WAN interface during a failing attempt. Decide: upstream vs local. - If it’s local, check in order: listening sockets → firewall rules → rp_filter → routing/policy routing.
- If it’s DNAT, verify both directions: DNAT in, SNAT/MASQUERADE out, and correct default route on the internal host.
- If it’s flaky under load, measure conntrack and state table limits and stop treating NAT as free.
- If it’s “big payloads hang,” fix MTU/PMTUD and don’t randomly block ICMP you don’t understand.
Do those, and “works on LAN, fails on WAN” becomes less of a mystery and more of a checklist with receipts.