Office-to-office access control: enforce “only servers, not the whole LAN” rules

Was this helpful?

You want two offices connected—but you don’t want every printer, BYOD laptop, and “temporary” IoT camera in Office A having a straight shot into Office B’s desktops. You want the boring, sensible rule: office-to-office traffic is for servers and a small set of managed services, not the entire LAN.

Most teams say they want that. Then a “quick” VPN comes online, routes get leaked, and suddenly you’re running a distributed flat network with two snack kitchens. The job here is to keep the connection while removing the blast radius.

The model: offices are untrusted, services are allowed

“Only servers, not the whole LAN” sounds like a firewall rule. It’s actually a networking posture:

  • Office user segments are noisy, unpredictable, and full of devices you don’t patch fast enough.
  • Server segments (or “service networks”) are managed, monitored, and are where you can enforce identity and least privilege.
  • Inter-office connectivity should be treated like a hostile boundary, even if the other office is “us.”

Practically, “only servers” means:

  • Between Office A and Office B, only a small set of destination subnets are reachable (e.g., 10.20.10.0/24 for servers, not 10.20.0.0/16 for everything).
  • Even within allowed subnets, only required ports are allowed (e.g., 443 to reverse proxies, 22 only from bastions, 445 only from a file services gateway—if you must).
  • Routing and policy align: you don’t advertise a route you intend to block, and you don’t block something you’re accidentally advertising with a more specific prefix.
  • Logging is non-optional at the boundary. If you can’t answer “who initiated what to where,” you’re not controlling access—you’re guessing.

Opinionated take: if your “office-to-office VPN” is just a tunnel with 0.0.0.0/0 on both sides and you’re hoping host firewalls will save you, you’re building an incident pipeline.

Define “server” like you mean it

Teams get stuck because “server” becomes a vibe. Make it concrete:

  • Server subnets are IP ranges dedicated to managed workloads (VM clusters, container nodes, NAS, AD/LDAP, reverse proxies).
  • Client subnets are user endpoints (wired, Wi‑Fi, guest, labs, conference rooms).
  • Infrastructure subnets are network devices, monitoring, management planes, jump hosts, and VPN concentrators.

Then codify: inter-office connectivity is allowed only to server + infrastructure subnets, with a port allowlist, and preferably with identity on top (mTLS, SSO-based proxies, or device posture).

Facts and history that explain why this keeps going wrong

Some context helps. Not because it’s cute, but because these failure modes repeat for the same reasons.

  1. Early enterprise networks grew “flat” because it was cheaper. VLANs and inter-VLAN routing were often a later retrofit, not the original plan.
  2. Site-to-site VPNs exploded in the early 2000s as Internet links became stable enough to replace leased lines; many designs copied “trusted WAN” assumptions from MPLS days.
  3. MPLS culture trained people to trust the carrier cloud. When teams moved to Internet VPNs, they often kept “any-to-any” routing, just with encryption.
  4. CIDR (1993) enabled aggregation but also made it easy to over-summarize: advertising /16 because it’s tidy, not because it’s safe.
  5. NAT became a security crutch in many offices. It hides addressing, but it doesn’t enforce least privilege; it just makes debugging worse.
  6. SMB and Windows browsing historically assumed LAN adjacency. When offices get bridged, legacy discovery traffic leaks across and turns into “why is my login slow?” tickets.
  7. Printers are historically treated as harmless. Then you learn many run ancient stacks and can be pivot points. Printers: the gift that keeps on printing—and occasionally exfiltrating.
  8. Firewalls got faster, so people got lazier. When a box can push tens of Gbps, “just allow it” feels harmless. It isn’t; scale makes mistakes louder.

One quote worth carrying into every design review (paraphrased idea): “Hope is not a strategy.” — attributed to Ed Catmull in operations circles; phrasing varies, so treat it as the idea, not scripture.

Reference architecture: routes, policies, and choke points

The simplest working pattern is also the most defensible: one choke point per site, explicit routing, and policies that default-deny. The goal is to avoid “distributed enforcement” where every host firewall is expected to be perfect forever.

Pick your choke points

At each office, you want a device (or HA pair) that is:

  • the termination point for the site-to-site tunnel (or SD-WAN overlay),
  • the L3 gateway for the server and client VLANs (or at least the inter-VLAN router),
  • the enforcement point for inter-office policies,
  • the logging point.

Common choices: firewall appliance, router with ACLs, Linux-based gateway with nftables, or an SD-WAN edge. The brand matters less than the discipline: deny-by-default, narrow allows, and measured route advertisement.

Routing strategy: advertise only what you intend to permit

This is the part people skip because it feels “networky.” It’s also where the policy becomes real.

Options:

  • Static routes: good for small environments, obvious failure modes, easy to audit. Bad when you scale to many subnets and sites.
  • Dynamic routing (BGP/OSPF over the tunnel): scalable and fast failover, but it will happily propagate your mistakes at protocol speed.
  • SD-WAN policy routing: convenient centrally-managed intent, but you must understand the actual underlay and what happens when policies conflict.

Rule: don’t advertise client subnets across the inter-office boundary. If Office A clients need a service in Office B, they should reach a service endpoint in Office B (reverse proxy, gateway, jump host, published app), not the client VLAN.

Policy strategy: default deny with explicit allows

At the inter-office boundary, implement a policy matrix that has:

  • Source zones (Office A client, Office A server, Office A infra)
  • Destination zones (Office B server, Office B infra)
  • Service groups (HTTPS, SSH from bastion, monitoring, directory services, backup replication)
  • Logging for denies and for a subset of allows (or at least sampling), so you can prove reality matches intent.

Stateful inspection is your friend. Asymmetric routing is not. We’ll diagnose that later, because it will happen to you.

Identity on top: “server subnet” is not enough

Subnet allowlists reduce attack surface. They don’t prevent a compromised server from becoming a beachhead.

For high-value paths, add:

  • mTLS between services
  • SSH via bastions (no direct SSH from office clients into remote server networks)
  • Application proxies (HTTP reverse proxies, RDP gateways, file access gateways)
  • Device posture (managed device checks) where feasible

Yes, it’s extra work. No, you won’t regret it at 02:00.

Control options and when each is the right hammer

1) Route filtering (best first line)

If the remote office never learns a route to your client VLAN, most problems disappear. Route filtering is clean, scalable, and doesn’t rely on packet inspection performance.

Use it when you control routing (static or BGP/OSPF) and want a high-confidence boundary.

2) Stateful firewall policy (the actual rulebook)

Even with route filtering, firewall policy is still required because:

  • some routes must exist (to server subnets), and
  • you still need port-level control and logging.

3) Microsegmentation / host firewalls (good, not sufficient alone)

Host firewalls (Windows Firewall, ufw, firewalld) and microsegmentation agents can drastically reduce lateral movement. But as the only control? That’s betting your boundary on endpoint hygiene. That bet loses eventually.

4) App-layer publishing (best user experience for many cases)

Instead of letting Office A reach Office B servers broadly, publish specific apps:

  • HTTPS apps behind a reverse proxy with SSO
  • RDP via a gateway with MFA
  • SMB via a file gateway or sync mechanism

When people say “we need network access,” ask what they actually need. It’s usually “we need one web app and one file share,” not “we need to discover all devices in a /16.”

Joke 1: A flat network is like an open-plan office: great for collaboration, terrible for preventing someone from hearing your secrets.

Hands-on tasks: commands, outputs, and decisions (12+)

These are practical checks you can run from Linux gateways, server jump hosts, or admin workstations. Each task includes: command, sample output, what it means, and the decision you make.

Task 1: Prove what routes you actually have (and whether client subnets leaked)

cr0x@server:~$ ip route show
default via 10.10.0.1 dev eth0 proto dhcp src 10.10.0.20 metric 100
10.10.0.0/24 dev eth0 proto kernel scope link src 10.10.0.20
10.20.10.0/24 via 10.10.0.1 dev eth0
10.20.30.0/24 via 10.10.0.1 dev eth0

Meaning: This host can route to two remote subnets: 10.20.10.0/24 and 10.20.30.0/24. If 10.20.30.0/24 is a client VLAN, you’ve already violated “only servers.”

Decision: Fix routing advertisement or static routes first. Don’t “paper over” leaked routes with host rules.

Task 2: Check what your VPN interface thinks it’s doing

cr0x@server:~$ ip -br addr show
lo               UNKNOWN        127.0.0.1/8 ::1/128
eth0             UP             10.10.0.20/24 fe80::a00:27ff:fe4e:66a1/64
tun0             UP             172.31.255.2/30

Meaning: Tunnel interface is up with a /30. That’s fine. It doesn’t tell you what’s routed over it, only that the interface exists.

Decision: Correlate with routing table and firewall policy; don’t assume “tun0 up” means “policy correct.”

Task 3: Validate the path to a remote server subnet (and catch surprise detours)

cr0x@server:~$ tracepath -n 10.20.10.50
 1?: [LOCALHOST]                                         pmtu 1500
 1:  10.10.0.1                                            0.410ms
 2:  172.31.255.1                                         1.283ms
 3:  10.20.10.50                                          2.104ms reached
     Resume: pmtu 1500 hops 3 back 3

Meaning: Traffic goes to local gateway, then across the tunnel peer, then to the destination. Clean path, MTU 1500.

Decision: If hops include unexpected routers (like Internet edge), you likely have policy-based routing or route leaks.

Task 4: Check PMTU issues (classic “it pings but the app hangs”)

cr0x@server:~$ ping -M do -s 1472 -c 3 10.20.10.50
PING 10.20.10.50 (10.20.10.50) 1472(1500) bytes of data.
From 10.10.0.1 icmp_seq=1 Frag needed and DF set (mtu = 1400)
From 10.10.0.1 icmp_seq=2 Frag needed and DF set (mtu = 1400)
From 10.10.0.1 icmp_seq=3 Frag needed and DF set (mtu = 1400)

--- 10.20.10.50 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss

Meaning: MTU is effectively 1400 on the path. Large packets with DF set can’t pass. Some apps will stall or retransmit forever.

Decision: Clamp MSS on the VPN/firewall, or set the tunnel MTU appropriately. Then retest with smaller/larger sizes until stable.

Task 5: Confirm what the firewall is dropping (nftables example)

cr0x@gateway:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain forward {
    type filter hook forward priority 0; policy drop;
    ct state established,related accept
    iifname "lan_clients" oifname "vpn0" ip daddr 10.20.10.0/24 tcp dport { 443 } accept
    iifname "lan_clients" oifname "vpn0" counter log prefix "DROP interoffice " drop
  }
}

Meaning: Default drop in forward chain. Only allow client VLAN to remote server subnet on 443. Everything else logs and drops.

Decision: This is the right shape. Next, make sure your rules are symmetrical (return path) and your zones match actual interfaces.

Task 6: Watch drops live to see what users are trying (and what you forgot)

cr0x@gateway:~$ sudo journalctl -k -f | grep "DROP interoffice"
Aug 21 10:13:02 gw kernel: DROP interoffice IN=lan_clients OUT=vpn0 SRC=10.10.50.23 DST=10.20.30.44 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=5421 DF PROTO=TCP SPT=51844 DPT=445 WINDOW=64240 SYN
Aug 21 10:13:04 gw kernel: DROP interoffice IN=lan_clients OUT=vpn0 SRC=10.10.50.23 DST=10.20.10.60 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=5512 DF PROTO=TCP SPT=51845 DPT=3389 WINDOW=64240 SYN

Meaning: Someone is trying SMB (445) to a remote client subnet (10.20.30.44), and RDP (3389) to a server (10.20.10.60). Your policy only allowed 443.

Decision: Don’t knee-jerk allow 445/3389. Ask: should SMB traverse offices? Should RDP go through a gateway? Usually the answer is “publish properly.”

Task 7: Verify BGP advertisements aren’t leaking client ranges

cr0x@router:~$ vtysh -c "show ip bgp neighbors 172.31.255.1 advertised-routes" | sed -n '1,40p'
BGP table version is 44, local router ID is 10.10.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.20.10.0/24    0.0.0.0                  0         32768 i
*> 10.20.99.0/24    0.0.0.0                  0         32768 i

Meaning: You advertise two prefixes. If 10.20.99.0/24 is a client/Wi‑Fi VLAN, you’re exporting it over the inter-office link.

Decision: Add prefix-lists/route-maps to export only server and infrastructure prefixes. Then confirm on the remote side’s received routes.

Task 8: Check received routes on the other side (confirm the fix worked)

cr0x@router-b:~$ vtysh -c "show ip bgp neighbors 172.31.255.2 received-routes" | sed -n '1,40p'
BGP table version is 107, local router ID is 10.20.0.1

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.20.10.0/24    172.31.255.2            0             0 i

Meaning: Only the server subnet is received. Good. Less reachability means fewer ways to be surprised.

Decision: Keep it that way. If someone later requests “temporary access” to a client subnet, treat it like a change request with an expiry.

Task 9: Confirm IP forwarding and rp_filter settings (asymmetric routing trap)

cr0x@gateway:~$ sysctl net.ipv4.ip_forward net.ipv4.conf.all.rp_filter
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 1

Meaning: IP forwarding is enabled. Reverse path filtering is strict (1). Strict rp_filter can drop legitimate traffic when return paths differ (common with dual links, SD-WAN, or policy routing).

Decision: If you observe one-way traffic with valid state, consider setting rp_filter to 2 (loose) on the VPN interfaces, but only after you’ve understood routing symmetry.

Task 10: Capture traffic at the boundary to distinguish “no route” vs “firewall drop”

cr0x@gateway:~$ sudo tcpdump -ni vpn0 host 10.10.50.23 and host 10.20.10.50 and tcp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vpn0, link-type RAW (Raw IP), snapshot length 262144 bytes
10:15:22.011221 IP 10.10.50.23.51910 > 10.20.10.50.443: Flags [S], seq 405483990, win 64240, options [mss 1360,sackOK,TS val 120119 ecr 0,nop,wscale 7], length 0
10:15:22.032909 IP 10.20.10.50.443 > 10.10.50.23.51910: Flags [S.], seq 29199210, ack 405483991, win 65160, options [mss 1360,sackOK,TS val 99122 ecr 120119,nop,wscale 7], length 0

Meaning: SYN and SYN-ACK are visible on vpn0. The handshake is happening. If the client still fails, the issue is likely beyond the inter-office boundary (host firewall, TLS, proxy, app).

Decision: Stop changing VPN/firewall policies and move up the stack.

Task 11: Test port reachability without lying to yourself

cr0x@client-jumphost:~$ nc -vz -w 3 10.20.10.50 443
Connection to 10.20.10.50 443 port [tcp/https] succeeded!

Meaning: TCP connect works. That’s L3/L4 good enough. It does not confirm the application is correct.

Decision: If connect fails, check firewall drops and routes. If it succeeds but app fails, check TLS certs, SNI, proxies, and server-side ACLs.

Task 12: Confirm that a remote client subnet is truly unreachable (the goal)

cr0x@client-jumphost:~$ ping -c 2 10.20.30.44
PING 10.20.30.44 (10.20.30.44) 56(84) bytes of data.

--- 10.20.30.44 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1021ms

Meaning: No ICMP response. This alone doesn’t prove it’s blocked (ICMP can be filtered), but it’s a decent signal when paired with routing and firewall logs.

Decision: Validate with ip route get 10.20.30.44 and firewall counters/logging. Your goal is: no route, or explicit drop with logs.

Task 13: Show which rule is matching (iptables counters example)

cr0x@gateway:~$ sudo iptables -L FORWARD -v -n --line-numbers | sed -n '1,25p'
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in          out         source          destination
1     918K  712M ACCEPT     all  --  *           *           0.0.0.0/0       0.0.0.0/0       ctstate RELATED,ESTABLISHED
2     231K   18M ACCEPT     tcp  --  lan_clients vpn0        10.10.50.0/24    10.20.10.0/24    tcp dpt:443
3     1245  104K DROP       all  --  lan_clients vpn0        0.0.0.0/0       0.0.0.0/0

Meaning: Rule 2 is actively used; rule 3 drops the rest. Counters make policy real. If rule 3 is climbing unexpectedly, users/apps are attempting prohibited paths.

Decision: Investigate what’s being dropped before adding new allows. Sometimes the right fix is “stop doing that,” not “open the firewall.”

Task 14: Prove DNS isn’t causing “it’s down” illusions across offices

cr0x@client-jumphost:~$ dig +short app.officeb.internal
10.20.10.50

Meaning: Name resolves to a server subnet IP. Good. If it resolves to a client subnet IP or a public IP unexpectedly, your access control plan will look “random” to users.

Decision: Ensure internal DNS answers differ by site only when intentional. Don’t let split-horizon DNS become split-brain.

Fast diagnosis playbook

This is the order that finds the bottleneck quickly, without thrashing changes and making things worse.

First: routing truth (reachability at L3)

  • On the source host: ip route get <dest-ip> to see the egress interface and next hop.
  • On the gateway: confirm the route exists and points into the tunnel/SD-WAN overlay.
  • On the remote gateway: confirm the return route exists back to the source subnet (or NAT policy is deliberate and consistent).

If there’s no route, stop. Fix routing advertisement or static routes before touching firewall rules.

Second: firewall policy and counters (is it dropped or allowed?)

  • Check default policy: drop or accept.
  • Check explicit allow for the exact tuple: src subnet, dst subnet, protocol, port.
  • Check counters/logs while reproducing the issue.

If counters show drops, decide: is it a legitimate request (publish/gateway it), or is it lateral movement waiting to happen (keep dropping)?

Third: statefulness and asymmetry (the “it should work” trap)

  • Look for asymmetric return paths (dual WAN, policy routing, ECMP, SD-WAN failover events).
  • Check rp_filter and state table behavior.
  • Use tcpdump on both interfaces (LAN and VPN) to confirm packets traverse as expected.

Fourth: MTU and fragmentation (the silent killer)

  • Use ping -M do with sizes to find the working MTU.
  • Look for “Frag needed and DF set”.
  • Clamp MSS on the tunnel.

Fifth: application-layer and identity

  • Test TCP connect (nc), then TLS (openssl s_client if needed), then the actual app.
  • Check server-side firewalls and ACLs.
  • Validate DNS answers from the affected site.

Common mistakes: symptom → root cause → fix

These are the ones that keep showing up because they’re seductive.

1) “We only allowed server subnets, but users can still reach remote desktops”

Symptom: Office A users RDP into random machines in Office B.

Root cause: The “server subnet” includes VDI pools, jump boxes with broad reach, or misclassified desktops. Or you allowed 3389 broadly to the server VLAN.

Fix: Separate admin/jump/VDI into distinct subnets and apply tighter rules. Use an RDP gateway with MFA; block direct 3389 inter-office.

2) “Pings work, HTTPS times out”

Symptom: ICMP is fine, TCP handshake maybe works, but large responses stall.

Root cause: MTU/MSS mismatch in the tunnel or overlay; PMTUD blocked.

Fix: MSS clamping on the VPN edge, allow ICMP fragmentation-needed messages, set tunnel MTU properly, then verify with DF pings.

3) “We blocked everything, now nothing works—including what we allowed”

Symptom: Even explicit allow rules seem ignored.

Root cause: Wrong zone/interface mapping, policy applied in the wrong direction, or policy bypass due to hardware offload/fast-path settings.

Fix: Validate interface names, zone membership, and rule order. Temporarily disable fast-path for troubleshooting. Confirm with counters and tcpdump.

4) “After failover, some flows die and never recover”

Symptom: SD-WAN failover happens; some users reconnect, some are stuck.

Root cause: Stateful devices lose session tables during path change; asymmetric routing returns on a different tunnel; strict rp_filter drops return packets.

Fix: Ensure symmetric routing or session synchronization in HA. Use consistent NAT policies. Consider loose rp_filter on tunnel interfaces.

5) “We didn’t advertise client subnets, but remote office still reaches them”

Symptom: You swear routes are filtered, yet connectivity exists.

Root cause: Someone added a supernet static route, or a default route crosses the tunnel, or NAT hairpins traffic unintentionally.

Fix: Audit static routes on gateways, check for 0/0 or large summaries in the VPN selectors/crypto ACLs, and verify with ip route and BGP tables.

6) “File share access is intermittent and slow across offices”

Symptom: SMB works, but users complain constantly.

Root cause: SMB chatty behavior over higher latency links, opportunistic locking issues, and name resolution/discovery traffic being blocked in weird ways.

Fix: Don’t stretch SMB across offices as a first choice. Use DFS with careful design, file sync, or publish via a gateway closer to users.

7) “The security team asked for ‘no east-west’ but we need monitoring”

Symptom: Monitoring or backup breaks after segmentation.

Root cause: Monitoring/backups were implicitly relying on broad reachability (ICMP, SNMP, agents, RPC).

Fix: Make monitoring a first-class service: dedicated monitoring subnet, explicit ports, and ideally pull-based agents with mTLS.

Joke 2: A VPN without filtering is just a long Ethernet cable with a passport stamp.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

Two offices. One “corporate” site, one “satellite.” The satellite needed access to a handful of internal web apps and an SSH bastion. The network team deployed a site-to-site tunnel and asked for “the subnets” to route.

The application owner replied with something like, “We’re using 10.40.0.0/16.” That was true in the same way “the ocean is wet” is true: technically accurate, operationally useless. Inside that /16 lived server VLANs, user VLANs, printers, and a lab network where engineers tested things they probably shouldn’t.

The wrong assumption was simple: “If it’s internal, it’s trusted.” Within a week, a compromised laptop at the satellite (phishing, nothing exotic) scanned across the tunnel, found an old SMB share on a desktop VLAN in HQ, and used it as a pivot. The attacker didn’t need a zero-day. They needed reachability and time.

The forensics were uncomfortable. Not because the exploit was clever—it wasn’t—but because the logs were thin. The VPN device had “allow any” with minimal logging to “avoid noise.” The endpoint team insisted the laptop was patched. The network team insisted the tunnel was “private.” Everyone was correct in the way that gets you breached.

The fix was not heroic. They stopped advertising the /16, split the network into published server prefixes, and forced admin access through a bastion with MFA. The incident response report used the word “segmentation” about forty times. The real lesson was fewer words: stop routing what you don’t want to secure.

Mini-story 2: The optimization that backfired

A mid-sized company rolled out SD-WAN to connect multiple offices. Performance improved immediately; jitter dropped and video calls stopped sounding like underwater interviews. The network team then “optimized” the security policy: instead of dozens of granular rules, they summarized them into a few large ones. Fewer rules, fewer mistakes. Reasonable, right?

The problem was the summary. They replaced multiple destination server prefixes with a broader aggregate. It made routing cleaner and the firewall policy shorter. It also quietly included a “temporary” VLAN used for contractors, which had devices that were managed… in the sense that someone once managed them.

At first nothing happened. That’s why this class of failure is dangerous: it waits. Then a contractor machine started generating suspicious traffic to an internal Git service in another office. The access wasn’t blocked because it matched the new summarized allow rule. The alerting didn’t fire because the traffic looked like normal HTTPS.

The postmortem was not kind to the optimization. They didn’t lose data, but they lost time. Lots of it. The team had to unwind the aggregate, identify what the contractor VLAN should have accessed (almost nothing), and introduce route-maps to prevent accidental inclusion in future summaries.

Takeaway: aggregation is a performance optimization; segmentation is a risk decision. Don’t let a routing convenience rewrite your security model.

Mini-story 3: The boring but correct practice that saved the day

A finance org with two main offices had a strict inter-office policy: only server subnets, only specific ports, and all rules were change-controlled with an owner and an expiration date. It was not exciting. People complained about the process. Naturally.

One afternoon, a helpdesk ticket came in: “User in Office A can’t access something in Office B.” The request sounded legitimate and urgent. The usual pressure started: “Just open it temporarily.”

The on-call SRE did the boring thing. They checked the firewall logs and saw repeated attempts to reach TCP/445 and TCP/135 across offices from a user VLAN—classic lateral movement patterns, not “access the app.” They also checked DNS and found the user was trying to reach a hostname that resolved to a desktop subnet in Office B, not the published service endpoint.

Instead of opening the firewall, they fixed the problem: corrected DNS so the hostname pointed to a reverse proxy in the server VLAN, and added a narrow allow to TCP/443 for that proxy only. Meanwhile, the security team investigated the user endpoint. It turned out to be infected with something that was trying its luck.

Nothing dramatic happened because the policy was already designed to make dramatic things boring. That’s the goal: make the safe path the easy path, and make the unsafe path noisy and blocked.

Checklists / step-by-step plan

Step-by-step plan: implement “only servers” without breaking the business

  1. Inventory subnets by function: client, server, infra, guest, lab. If you can’t label a subnet, it’s not ready to be routed anywhere.
  2. Define the allowed destination prefixes per office: usually server + infra only. Keep it short.
  3. Define service ports per use-case:
    • Web apps: 443 to reverse proxies
    • Admin: 22 only from bastions; avoid direct office-to-server SSH
    • Directory: only what’s required, from specific sources
    • Monitoring: from monitoring subnet only
  4. Fix routing first: route filtering or static routes for only the allowed prefixes. Do not export client VLANs.
  5. Apply firewall policies: default deny, explicit allows, with logging. Ensure both directions are accounted for (stateful policies help, but don’t assume magic).
  6. Decide on NAT policy: ideally none for internal-to-internal, but if you must NAT, do it consistently and document why. NAT can hide routing sins; it doesn’t absolve them.
  7. Publish apps instead of networks where possible: reverse proxies, gateways, SSO, MFA. Users want apps, not subnets.
  8. Implement a change workflow: every new inter-office allow needs an owner and an expiry. Permanent rules are fine; eternal temporary rules are a lie.
  9. Test from both sides: route checks, port checks, and application checks. Validate MTU early.
  10. Baseline logs and counters: you should be able to answer “what got dropped” and “what got allowed” during an incident.
  11. Run a tabletop failure drill: simulate a route leak or a mis-summarized prefix and verify detection (alerts on new prefixes, spikes in drops, unexpected flows).

Operational checklist: what to monitor continuously

  • New routes learned over the inter-office adjacency (especially broad aggregates and client VLANs).
  • Firewall rule hit counts for “deny interoffice” rules (spikes often mean scanning or misconfiguration).
  • Tunnel MTU/fragmentation indicators and TCP retransmits.
  • SD-WAN path changes and failover events correlated with application complaints.
  • DNS anomalies: internal names resolving to unexpected subnets per site.
  • Identity signals: MFA failures, unusual bastion usage, abnormal service account behavior.

FAQ

1) Is blocking routes better than blocking with firewall rules?

Yes, when you can. Route filtering reduces the reachable surface area before packets even hit policy. Use both: route filtering for reachability, firewall for ports and logging.

2) We need Office A users to access a database in Office B. Do we allow the DB subnet?

Prefer publishing through an app tier or a proxy in Office B. If you must allow DB access, restrict it to specific source subnets (or hosts) and specific ports, and log it. Databases are not casual cross-office services.

3) Can we rely on Windows Firewall on servers instead of network firewalls?

Use Windows Firewall, absolutely. But don’t rely on it as the only boundary control. Network enforcement gives you consistent policy, centralized logs, and a harder-to-bypass choke point.

4) What about “we trust our offices” because it’s the same company?

Trust is not a network design primitive. Offices contain unmanaged devices, contractors, conference-room gear, and user endpoints that browse the Internet all day. Treat the inter-office link as untrusted transport carrying explicitly allowed services.

5) Should we NAT between offices to “hide” networks?

Only if you have a clear reason (overlapping IP space, merger integration). NAT can simplify addressing collisions, but it complicates debugging, breaks some protocols, and can create false confidence. If you NAT, document it and standardize it.

6) How do we handle overlapping RFC1918 ranges after an acquisition?

Short-term: NAT at the boundary, and keep allowed services narrow. Medium-term: renumber one side or introduce a translation segment with clear ownership. Long-term: stop treating IP space as an afterthought during integrations.

7) Our security team says “zero trust.” What does that mean for office-to-office?

It means network location isn’t enough. Keep subnet allowlists, but add identity: mTLS, SSO proxies, device posture, and MFA for admin paths. “Only servers” becomes “only specific services on specific servers with verified identity.”

8) How do we prevent “temporary” rules from becoming permanent?

Make expiry dates mandatory, and automate reminders (or automatic disable) when the expiry hits. If a rule is still needed, renew it with a justification. Friction here is a feature.

9) What ports should never be broadly allowed between offices?

Broadly: SMB (445), NetBIOS (137–139), RPC endpoint mapper (135), and direct RDP (3389). There are exceptions, but they should feel like exceptions and come with compensating controls.

10) What’s the quickest proof that we achieved “only servers”?

From a client subnet in Office A, you should have (a) no route to Office B client VLANs, and (b) firewall logs showing drops for any attempted access. From approved sources, approved ports to server subnets should succeed with clean counters.

Conclusion: next steps you can do this week

“Only servers, not the whole LAN” is not a slogan. It’s a design you enforce with routing discipline, default-deny policy, and logs that tell the truth when someone insists “it used to work.”

Practical next steps:

  1. List every subnet in both offices and label it client/server/infra/guest/lab.
  2. Stop advertising client VLANs over the inter-office link. Use prefix-lists or tighten static routes.
  3. Implement a default-deny inter-office firewall policy with explicit allows to server subnets and narrow ports.
  4. Publish apps via gateways/proxies rather than opening broad network access (especially for RDP and SMB).
  5. Add logging and counters, then baseline what “normal” looks like before the next incident teaches you the hard way.

If you do just one thing: filter routes so the wrong networks aren’t reachable at all. Every other control gets easier when reachability is already constrained.

← Previous
ZFS atime: The Tiny Toggle That Can Fix ‘Slow Writes’
Next →
MySQL vs SQLite Concurrency: Why Writes Become a Traffic Cliff

Leave a comment