Office VPN Zero Trust: Replace Flat Networks with Role-Based Access

Was this helpful?

Your office VPN works—until it works too well. Someone connects, gets an IP on a big internal subnet, and suddenly “remote access” quietly becomes “remote adjacency.” One compromised laptop later, you’re doing incident response while Slack fills with the corporate equivalent of smoke.

Flat VPN networks are comforting because they’re familiar. They’re also a reliability and security tax you keep paying with interest. Zero trust isn’t a product you buy; it’s a set of constraints you enforce: identity, device posture, and explicit authorization per application. The goal is boring: make the safe path the easy path, and make lateral movement expensive.

What you’re replacing: the flat office VPN

A typical “office VPN” model goes like this:

  • User authenticates to a VPN concentrator (often with MFA, sometimes not).
  • Client gets an internal IP (or routes to internal subnets).
  • Firewall rules allow broad east-west access because “they’re on the VPN.”
  • Access control is mostly implicit: if you can route to it, you can try to log into it.

This is not “zero trust with a VPN.” This is “castle-and-moat with a teleporter.” It fails in predictable ways:

  • Lateral movement becomes cheap. Attackers love adjacency because it turns “one credential” into “many systems.”
  • Authorization is fragmented. App teams implement auth differently; network teams paper over it with reachability.
  • Network segmentation rots. People create exceptions, then exceptions get copied, then you’re basically back to flat.
  • Operational complexity hides in routing. Split tunnel vs full tunnel debates become religious wars, not engineering decisions.

Zero trust isn’t magic dust. It’s a different contract: no user or device gets “internal network access” by default. They get access to specific applications, over specific ports, under specific conditions, with auditing that can survive a bad day.

One practical rule: if your policy is expressible as “VPN users can access 10.0.0.0/8,” you don’t have a policy. You have a shrug.

Facts and historical context that actually matter

Zero trust is often marketed like it was invented last quarter. It wasn’t. Here are concrete points that help you reason about it:

  1. VPNs became mainstream in the late 1990s with IPsec standardization (IKE, ESP). The goal was confidentiality across hostile networks, not fine-grained authorization.
  2. SSL VPNs (early 2000s) popularized “clientless” web access and later full-tunnel clients. They improved deployment, not lateral-movement resistance.
  3. Network perimeter assumptions weakened hard as SaaS and cloud moved “internal” apps onto the public internet behind logins rather than behind subnets.
  4. Microsegmentation became fashionable as virtualization and SDN made east-west filtering possible without rewiring buildings. Most orgs learned that writing the rules is the hard part.
  5. Google’s BeyondCorp model (mid-2010s) pushed the idea that the corporate network is not a security boundary; identity and device state are.
  6. Credential theft outpaced malware for many incidents because it’s cheaper to phish one user than to exploit ten different hosts.
  7. “Flat network” is often an accident, not a design: mergers, temporary VLANs, and “we’ll clean it up later” become permanent.
  8. MFA is necessary but not sufficient: it stops some attacks, but once inside a flat VPN, the blast radius is still enormous.
  9. Zero trust products tend to converge on a few primitives: an identity provider, a policy engine, a connector/proxy, and strong telemetry.

History lesson over. The takeaway: VPNs solved confidentiality and connectivity. Zero trust solves authorization and containment, assuming the network is already hostile.

Target outcome: role-based access, not role-based hope

“Role-based access” in a zero trust office context means:

  • Users authenticate with an identity provider (IdP).
  • They get mapped into roles/groups (Engineering, Finance, IT, Vendor-ACME).
  • Policies grant access to specific apps/services based on role + device posture + context.
  • Connectivity is app-scoped (or service-scoped), not subnet-scoped.
  • Every access event is logged with identity, device, and decision reason.

The mistake is thinking “RBAC” means “add groups and call it a day.” Real RBAC needs:

  • Role hygiene: stable roles, minimal sprawl, clear ownership.
  • Resource inventory: you cannot protect what you can’t name.
  • Authorization boundaries: apps should still authenticate; network policy is a second line, not the only line.
  • Time to revoke: if offboarding takes hours to remove access, it will happen during the hours you didn’t want.

Paraphrased idea from a notable SRE voice (attributed): Werner Vogels has emphasized that everything fails eventually, so systems must be designed to tolerate failure. Apply that here: assume credentials leak, laptops get owned, and networks misroute. Design so the blast radius is small and the audit trail is usable.

Second practical rule: your policy must be readable by humans at 2 a.m. during an outage. If the only person who understands it is on parental leave, you have a risk—not a system.

Joke #1: A flat VPN is like giving everyone a master key because you trust them not to touch the supply closet. The supply closet disagrees.

Architecture options: pick your battle

There are three broad patterns you’ll see. All can work; each fails differently.

1) ZTNA via identity-aware proxy (best for HTTP/S and modern apps)

Users access internal web apps through a proxy that enforces identity, device posture, and policy. The proxy connects to internal services via connectors (agents) or private routing. This shines for:

  • Web apps (internal dashboards, Git, CI/CD UIs)
  • APIs
  • Admin portals that already support SSO

It can be awkward for:

  • Raw TCP services (databases, SSH) unless the product supports it well
  • Legacy thick clients that assume LAN adjacency

2) Per-app tunnels (best for TCP/SSH/DB access with explicit targeting)

Instead of “connect to VPN,” the user connects to “prod-db-readonly” or “k8s-admin” through a broker. Connectivity is established only to the specific destination(s) allowed by policy. This is where you push developers and IT admins when they insist they “need SSH.” Fine. They get SSH to specific hosts, with short-lived credentials and logging.

3) Microsegmented VPN (a transitional pattern, not the end state)

You keep a VPN, but you stop routing users into a shared internal subnet. You issue per-role IP pools and enforce strict ACLs. This is still network-based trust, but at least it’s not a single blast radius. Use it as a stepping stone if your app landscape is deeply legacy.

Opinionated guidance: default to identity-aware proxy for web, per-app tunnels for admin protocols, and only keep a general VPN for rare cases with a retirement date.

Policy model: identities, roles, and “who can reach what”

Start with an access graph, not a subnet map

Network engineers love IP ranges because they’re concrete. Zero trust forces you to model something closer to an access graph:

  • Subjects: users, service accounts, devices
  • Attributes: role, department, risk score, managed/unmanaged, OS version, patch level
  • Objects: apps, APIs, databases, admin endpoints, SSH bastions
  • Actions: HTTP GET/POST, SSH, RDP, database connect, kubectl
  • Conditions: MFA, device posture, geolocation, time, network, ticket reference

A good policy reads like a sentence:

  • “Engineers on managed devices with disk encryption and recent patch level may access Git and CI over HTTPS.”
  • “On-call SREs may SSH to production bastion with MFA and a just-in-time approval, sessions recorded.”
  • “Finance may access payroll app from managed devices only; block unmanaged and BYOD.”

Define roles the boring way

Roles should be stable and coarse-grained. You can always add finer permissions later. You cannot easily delete 400 micro-roles once every team has one. Start with:

  • Employee vs contractor vs vendor
  • Department-level roles (Eng, Sales, Finance, HR)
  • Privileged roles (IT Admin, SRE On-call, Security)
  • Environment roles (Prod access, Staging access)

Then impose constraints:

  • Role ownership: every role has a human owner who approves membership.
  • Membership source of truth: ideally HRIS → IdP → access system, with exceptions tracked.
  • Time-bounded elevation: break-glass roles that expire.

Don’t confuse authentication with authorization

Strong identity is required. But you still need authorization decisions close to the resource. In practice:

  • For web apps: use SSO and enforce app-level roles; the proxy is a gate, not your only lock.
  • For SSH: use short-lived certificates and forced commands when possible.
  • For databases: avoid shared passwords; use IAM auth or per-user credentials; restrict network path as a second layer.

Joke #2: “Just put it behind the VPN” is the security equivalent of “just reboot it.” Sometimes it works; it’s never a strategy.

Checklists / step-by-step plan: migrate without lighting yourself on fire

Step 0: inventory what the VPN is really used for

  • List destinations: subnets, hostnames, services, ports.
  • List user groups: who connects, when, and for what.
  • Classify traffic: web apps, SSH/RDP, databases, file shares, internal DNS, AD services.

Decision outcome: you’ll discover 80% of usage is predictable and can be moved to per-app access. The remaining 20% is where the weirdness lives.

Step 1: choose your enforcement point

  • Identity-aware proxy for HTTP/S and SSO-able apps.
  • Per-app tunnels for SSH/DB/Kubernetes and other TCP protocols.
  • Keep legacy VPN only for “cannot migrate yet,” but shrink the routes.

Step 2: define “managed device” and enforce posture

  • Pick MDM/EDR signals you trust: encryption, OS patch level, screen lock, known EDR agent.
  • Decide how to handle BYOD: separate role with limited access, or no access to sensitive apps.

Decision outcome: access is not only “who you are” but “what you’re using.” Attackers hate this.

Step 3: migrate the low-drama apps first

  • Internal wiki, dashboards, ticketing, documentation portals.
  • Apps already behind SSO.
  • Apps with clean hostnames and TLS.

Step 4: tackle admin protocols with explicit workflows

  • Introduce a bastion or access broker for SSH/RDP.
  • Implement just-in-time elevation for production.
  • Turn on session recording where feasible.

Step 5: shrink the VPN until it’s embarrassing

  • Replace 10.0.0.0/8 routes with only the specific legacy subnets.
  • Segment users by role into separate IP pools and firewall them aggressively.
  • Set a date: “This VPN route dies.” Then actually kill it.

Step 6: measure and iterate

  • Track: denied requests, new app onboarding time, support tickets, median latency.
  • Review policies quarterly like you review on-call: prune, simplify, re-own.

Practical tasks (with commands): verify, diagnose, decide

These are the kinds of tasks you run during migration and during “why can’t I reach X” tickets. Each one includes what the output means and what decision you make.

Task 1: Identify what routes the VPN client is actually installing

cr0x@server:~$ ip route show
default via 192.168.1.1 dev wlp2s0 proto dhcp metric 600
10.0.0.0/8 via 10.8.0.1 dev tun0 proto static metric 50
10.8.0.0/24 dev tun0 proto kernel scope link src 10.8.0.23 metric 50

What it means: The VPN is routing an entire RFC1918 supernet (10/8) through tun0. That’s classic flat-network adjacency.

Decision: Replace the broad route with per-app access or, at minimum, narrow routes to the legacy destinations that truly need it.

Task 2: Confirm split tunnel vs full tunnel behavior

cr0x@server:~$ curl -s https://ifconfig.me
203.0.113.47

What it means: Your egress IP is public ISP (split tunnel) rather than the corporate egress (full tunnel). If you expect corporate egress for compliance, this is a miss.

Decision: For zero trust, prefer app-scoped access and avoid forcing full tunnel unless you have a specific requirement (and capacity) for it.

Task 3: Check DNS resolution path (a common migration footgun)

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.0.0.53
       DNS Servers: 10.0.0.53 1.1.1.1
DNS Domain: corp.example

What it means: You’re using an internal DNS server plus a public fallback. This can leak queries or cause inconsistent resolution if corp zones aren’t handled correctly.

Decision: In a zero trust model, decide which names must resolve internally and ensure your access path provides DNS appropriately (split-horizon, DoH policy, or proxy-based host routing).

Task 4: Verify that a specific internal service is reachable by policy (TCP)

cr0x@server:~$ nc -vz git.corp.example 443
Connection to git.corp.example 443 port [tcp/https] succeeded!

What it means: Network path exists to 443. If the app still fails, the issue is likely TLS/SSO/app auth rather than routing.

Decision: Shift debugging to HTTP/TLS/identity layers, not firewall rules.

Task 5: Diagnose whether the problem is DNS vs connectivity

cr0x@server:~$ dig +short git.corp.example
10.20.30.40

What it means: Name resolves to a private IP. If you’re using an identity-aware proxy, you might not want users resolving private addresses at all.

Decision: Either (a) publish the app via proxy with a public name that doesn’t leak internal IPs, or (b) ensure per-app tunnel covers that IP/port and DNS is consistent.

Task 6: Confirm TLS and SNI behavior to an app endpoint

cr0x@server:~$ openssl s_client -connect git.corp.example:443 -servername git.corp.example -brief
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: CN=git.corp.example
Verification: OK

What it means: TLS is healthy, certificate matches, SNI works. If users see browser warnings, it’s likely intercept/proxy issues or device trust store problems.

Decision: If your ZTNA solution does TLS termination, ensure it presents a trusted cert chain and doesn’t break app expectations.

Task 7: Inspect firewall rules that accidentally kept the network flat

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain forward {
    type filter hook forward priority filter; policy drop;
    iifname "tun0" ip daddr 10.0.0.0/8 accept
    ct state established,related accept
  }
}

What it means: There’s an explicit allow from VPN interface to 10/8. That’s your lateral movement highway.

Decision: Replace the blanket allow with app-specific rules or eliminate forwarding entirely in favor of proxy/tunnel access.

Task 8: Confirm identity group membership used for RBAC (IdP via SSO token introspection)

cr0x@server:~$ jq -r '.email, .groups[]' /tmp/id_token_claims.json
alex@example.com
role:engineering
env:staging
priv:none

What it means: The user’s claims show engineering + staging access, no privileged elevation.

Decision: If they need prod access, you don’t “just add them.” You implement a time-bounded privileged group with approvals and audit.

Task 9: Check device posture signal (managed vs unmanaged) from the endpoint

cr0x@server:~$ sudo osqueryi "select * from disk_encryption;"
device_name = /dev/nvme0n1p3
encrypted = 1
type = luks

What it means: Disk encryption is enabled (good). This is one of the few posture checks that correlates with “lost laptop” risk.

Decision: If encryption is off, block access to sensitive apps and route the user through enrollment. Don’t negotiate with physics.

Task 10: Validate that a per-app tunnel is only exposing intended destinations

cr0x@server:~$ ss -tnlp | grep 127.0.0.1:15432
LISTEN 0 4096 127.0.0.1:15432 0.0.0.0:* users:(("ztna-client",pid=2214,fd=9))

What it means: The local client is listening on localhost only. Good: it’s not opening a port to your LAN or the whole world.

Decision: Keep local bind to 127.0.0.1 for DB tunnels; enforce that pattern in your standard operating procedure.

Task 11: Confirm audit trail exists for an access attempt

cr0x@server:~$ sudo journalctl -u ztna-connector --since "10 min ago" | tail -n 8
Dec 28 12:11:02 connector-01 ztna-connector[1189]: allow user=alex@example.com app=git-web device=managed policy=eng-web-mfa
Dec 28 12:11:04 connector-01 ztna-connector[1189]: deny user=alex@example.com app=prod-bastion reason=missing_jit_approval

What it means: You have explicit allow/deny events with reasons. That’s gold during incidents and access reviews.

Decision: If your system can’t explain a deny, you’ll end up bypassing it. Fix observability before scaling rollout.

Task 12: Identify latency sources (DNS, connect time, TLS, TTFB)

cr0x@server:~$ curl -o /dev/null -s -w "dns=%{time_namelookup} connect=%{time_connect} tls=%{time_appconnect} ttfb=%{time_starttransfer} total=%{time_total}\n" https://git.corp.example/
dns=0.012 connect=0.031 tls=0.112 ttfb=0.487 total=0.512

What it means: DNS/connect/TLS are fine; server response (TTFB) dominates. That’s likely app/backend load or proxy buffering, not “the VPN is slow.”

Decision: Escalate to app performance or proxy capacity; don’t waste days tweaking MTU if the backend is the bottleneck.

Task 13: Check MTU and path MTU issues (classic VPN pain, still shows up in tunnels)

cr0x@server:~$ ip link show dev tun0
6: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 500
    link/none

What it means: MTU is 1280 (common in tunneled environments). Some legacy services break if you assume 1500 and blackhole ICMP.

Decision: If you see stalls on large transfers, validate PMTUD and allow needed ICMP types, or clamp MSS on the tunnel edge.

Task 14: Confirm whether a “blocked” app is actually being accessed directly instead of via proxy

cr0x@server:~$ traceroute -n git.corp.example | head -n 6
traceroute to git.corp.example (10.20.30.40), 30 hops max, 60 byte packets
 1  192.168.1.1  1.103 ms  0.921 ms  0.877 ms
 2  203.0.113.1  9.902 ms  10.004 ms  9.811 ms
 3  * * *

What it means: You’re attempting to route to a private IP over the public internet (it will fail). This happens when DNS points to internal IPs but users are no longer on the VPN.

Decision: Fix DNS and publishing: private IPs shouldn’t leak to clients who aren’t supposed to have network-level access.

Task 15: Verify segmentation boundaries inside the office network (east-west reality check)

cr0x@server:~$ sudo tcpdump -ni eth0 'tcp port 445' -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:20:11.102938 IP 10.50.12.34.51322 > 10.50.0.10.445: Flags [S], seq 122233, win 64240, options [mss 1460,sackOK,TS val 1 ecr 0,nop,wscale 7], length 0

What it means: You’re seeing SMB attempts inside a supposedly segmented environment. That’s lateral movement surface area in a trench coat.

Decision: Ensure workstation VLANs cannot reach server VLANs except via explicit allow rules; remove legacy file-share exposure where possible.

Task 16: Validate who is still using the legacy VPN and for what

cr0x@server:~$ sudo awk '{print $1,$2,$3,$9}' /var/log/openvpn/status.log | tail -n 5
CLIENT_LIST alex@example.com 10.8.0.23 192.0.2.44
CLIENT_LIST sam@example.com 10.8.0.24 198.51.100.19
ROUTING_TABLE 10.30.0.0/16 10.8.0.23
ROUTING_TABLE 10.40.0.0/16 10.8.0.24
GLOBAL_STATS Max_bcast_mcast_queue_length 0

What it means: Users are still pulling routes to large internal networks. The routing table entries show what they can reach.

Decision: Use this data to prioritize migration targets and to justify route removal with evidence instead of vibes.

Fast diagnosis playbook: find the bottleneck quickly

When someone says “zero trust broke my access,” do not start by changing policies. Start by figuring out which layer is failing. This is the order that saves time.

First: name resolution and destination correctness

  • Does the hostname resolve? To what—public proxy address or private IP?
  • Is the user using the right URL (proxy front door) vs old internal hostname?
  • Are they hitting a stale bookmark that bypasses your controls?

Second: path establishment (network + tunnel/proxy)

  • Can you establish TCP to the proxy/tunnel endpoint?
  • Is MTU/PMTUD causing partial hangs (especially large downloads, git clones, container pulls)?
  • Is there asymmetric routing between connector and app?

Third: identity and policy decision

  • Does the user have correct group/role claims?
  • Does device posture pass? (managed, encrypted, compliant)
  • Is the policy denying with a reason you can see in logs?

Fourth: application-level auth and performance

  • SSO loops? Cookie domain mismatches? TLS termination confusion?
  • Backend slowness misattributed to “VPN latency”?
  • Rate limits or WAF blocks triggered by proxy egress IP?

Shortcut rule: if you can’t see a deny reason in logs within five minutes, your system is not operable yet. Fix telemetry before continuing rollout.

Common mistakes: symptoms → root cause → fix

1) “Users can’t reach internal apps unless they use the old VPN”

Symptoms: Browser timeouts; traceroute shows public hops; DNS resolves private IPs.

Root cause: Split-horizon DNS still returns internal addresses to off-network clients; app isn’t properly published via proxy/tunnel.

Fix: Publish the app behind an identity-aware front door with a resolvable name for remote clients; stop leaking internal IPs in external DNS contexts.

2) “ZTNA is slower than VPN” (said loudly, in all caps)

Symptoms: Git clones slow; large downloads stall; interactive shells lag.

Root cause: MTU mismatch, PMTUD blocked, or connector/proxy underprovisioned. Sometimes it’s just the backend app being slow and now you can see it.

Fix: Measure with timing breakdown (DNS/connect/TLS/TTFB). Allow required ICMP for PMTUD or clamp MSS. Scale connectors horizontally and place them close to workloads.

3) “We added RBAC groups, but people still have too much access”

Symptoms: Users in one department can hit unrelated services; pentest shows lateral movement remains possible.

Root cause: Network policy still grants broad subnet reachability; app access is not actually app-scoped.

Fix: Remove subnet routes from user devices. Replace with per-app connectors/proxies and explicit allow lists by service identity.

4) “Contractors can access prod because they’re in the same ‘Engineering’ group”

Symptoms: Audit flags; uncomfortable meetings; sudden interest in separation of duties.

Root cause: Roles modeled around org chart, not risk boundaries. Contractor status not encoded in identity attributes.

Fix: Create separate identities/roles for contractors and vendors; require stronger conditions (managed VDI, JIT approvals) for any sensitive access.

5) “SSO loops forever” after proxying an internal web app

Symptoms: Redirect loop; cookies not sticking; user sees repeated login prompts.

Root cause: Misconfigured callback URLs, cookie domain issues, mixed HTTP/HTTPS assumptions, or double TLS termination confusing the app.

Fix: Standardize on HTTPS end-to-end. Align app base URL with the external proxy URL. Validate IdP redirect URIs and cookie settings (Secure, SameSite).

6) “Access works from laptops but fails from phones”

Symptoms: Mobile users denied; desktop fine.

Root cause: Device posture requirements exclude unmanaged devices; or mobile lacks required client/cert.

Fix: Decide explicitly: allow limited mobile access to low-risk apps or require managed mobile enrollment. Don’t silently half-support it.

7) “We turned on full tunnel for ‘security’ and everything broke”

Symptoms: SaaS logins fail; video calls degrade; traffic hairpins; helpdesk queue grows teeth.

Root cause: Corporate egress capacity wasn’t sized; split tunnel exceptions are messy; DNS and proxy policies conflict.

Fix: Prefer app-scoped access. If full tunnel is required, invest in capacity, local egress, and clear routing rules. Measure before and after.

8) “Our policies are correct, but the logs are useless”

Symptoms: Deny events missing context; no correlation ID; cannot answer “who accessed what.”

Root cause: Logging not treated as a first-class requirement; data not centralized; time sync issues.

Fix: Standardize log fields (user, device, app, decision, reason). Centralize to SIEM/log store. Ensure NTP everywhere.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company migrated to a shiny new access platform. The project plan said: “No more VPN. Everything is behind the proxy.” They flipped the switch for remote access and congratulated themselves with a calendar invite titled “ZTNA done.” That invite aged poorly.

The wrong assumption was subtle: they believed that removing the VPN client removed network adjacency. In reality, office Wi‑Fi was still flat, and remote users still had a fallback “legacy VPN” profile for “exceptions.” The exceptions list grew quietly because it always does. A few services still relied on internal DNS names resolving to private IPs, and people were trained to “just use the VPN” when something didn’t load.

Then a contractor laptop got phished. The attacker didn’t need fancy exploits; they used the contractor’s VPN access, landed on a workstation VLAN, and scanned. File shares responded. A management interface on an old hypervisor responded. The access platform wasn’t bypassed; it was simply irrelevant to the paths the attacker chose.

During response, the team’s dashboards were full of proxy logs showing “nothing suspicious.” True, and also useless. The suspicious activity was on the old VPN routes and the flat office network. The proxy was guarding the front door while the back door was propped open for “just a few weeks.”

The fix wasn’t a new tool. It was a boundary decision: remove broad VPN routes, enforce role-based segmentation on office networks, and publish apps in a way that didn’t rely on internal DNS leakage. They also reclassified contractors as a distinct risk tier with different conditions. The lesson: zero trust dies when you keep a trusted network path “temporarily.”

Mini-story 2: The optimization that backfired

An enterprise IT team wanted to reduce latency. They moved their connectors to a central data center and forced all remote user traffic through it because “single egress is easier to monitor.” Security agreed, because monitoring is comforting and everyone likes graphs.

Performance complaints started within a week. Not because the access platform was inherently slow, but because the traffic path was now absurd: user → proxy → central connector → cloud workload in another region → back again. The extra round trips were deadly for chatty protocols and for apps with many small HTTP requests. The ticket volume went up, and people started whispering the forbidden phrase: “Can we just go back to the VPN?”

The team doubled down and increased connector CPU and bandwidth. Costs rose; the user experience barely improved. The bottleneck wasn’t raw throughput; it was latency and path design. They had optimized for monitoring convenience, not for physics.

The eventual recovery was boring engineering. They deployed connectors closer to workloads (including inside cloud VPCs), used regional egress where compliance allowed, and kept monitoring by centralizing logs rather than centralizing packets. They also split policies by app sensitivity: payroll traffic kept the stricter path; internal wiki took the faster one.

Optimization lesson: “one chokepoint” is operationally attractive until it becomes the chokepoint in the literal sense.

Mini-story 3: The boring but correct practice that saved the day

A company with a mature SRE culture rolled out per-app access for production administration. Nothing dramatic: they required just-in-time elevation for prod SSH, used short-lived certificates, and recorded sessions. The implementation was unglamorous. It also worked.

Months later, a senior engineer’s laptop was stolen from a car. The engineer did the right thing and reported it quickly. Security rotated tokens, revoked sessions, and removed device trust. Still, everyone was tense because “stolen laptop” is a phrase that makes executives learn new vocabulary.

Here’s what didn’t happen: there was no evidence of production access from that device after the theft. Session logs showed the last successful elevation, and the access system’s deny events showed attempted reuse from an unmanaged device state. The attacker had a machine, but not the posture, not the certs, and not the JIT approval.

The incident was resolved without a full production credential rotation. They didn’t have to take down half the fleet to be safe. They did a targeted review, confirmed constraints held, and moved on. The team’s most valuable asset wasn’t a tool; it was a practice: time-bounded privileges plus clean audit trails.

Boring saved the day. That’s the job.

FAQ

1) Is zero trust just “MFA everywhere”?

No. MFA helps prove it’s really the user, once. Zero trust also limits what that user can reach, from what device, under what conditions, and it logs the decision.

2) Do we have to eliminate VPN entirely?

No. But you should eliminate flat VPN routing. Keep a legacy VPN only for clearly defined exceptions, with narrow routes and an end-of-life plan.

3) What’s the difference between ZTNA and microsegmentation?

ZTNA focuses on user-to-app access decisions using identity and posture. Microsegmentation focuses on workload-to-workload east-west controls. You usually need both: ZTNA reduces user blast radius; microsegmentation reduces what a compromised server can reach.

4) How do we handle SSH access in a zero trust model?

Use a broker/bastion with short-lived credentials (certs), JIT elevation for production, and logging/session recording. Avoid long-lived shared keys and avoid giving laptops broad routing into server subnets.

5) Will role-based access explode into role sprawl?

It will if you let every team invent roles freely. Set guardrails: stable roles, owners, quarterly reviews, and a bias toward coarse roles plus app-level authorization.

6) What about service accounts and automation—do they do “zero trust” too?

Yes, or you’ve built a secure front door and left a robotic side entrance. Use workload identity, short-lived tokens, and explicit service-to-service policy. Don’t tunnel entire networks for CI runners.

7) How do we avoid breaking legacy apps that assume LAN access?

Start by isolating them: per-app tunnels, restricted destination lists, and strict posture. Then plan modernization: SSO, TLS, and removing dependencies on broadcast, old protocols, and shared file shares.

8) What’s the quickest win with the biggest risk reduction?

Stop routing remote users into broad internal subnets. Move the most sensitive admin paths (prod SSH/RDP, database admin) to per-app access with JIT and logging.

9) How do we prove to auditors that this is working?

Show explicit policies, group membership controls, posture requirements, and audit logs that answer: who accessed what, when, from which device, and why it was allowed.

10) What if the access proxy goes down—do we lose everything?

If you design it poorly, yes. Build for redundancy: multiple connectors, health checks, and clear break-glass procedures with short-lived, audited elevation. Availability is a security requirement because outages create bypasses.

Conclusion: next steps that survive contact with reality

If you remember one thing, make it this: your goal isn’t to replace a VPN client with a different client. Your goal is to replace implicit network trust with explicit, role-based, identity-aware access.

Next steps

  1. Inventory actual VPN routes, destinations, and user populations. Use logs, not guesses.
  2. Publish web apps through identity-aware access first. It’s the fastest way to reduce subnet dependency.
  3. Move admin access (SSH/RDP/kubectl/db admin) to per-app tunnels with short-lived creds and JIT elevation.
  4. Shrink VPN routing aggressively: remove 10/8 and friends. If someone “needs everything,” they need an architecture review, not a route.
  5. Make posture real: define managed devices, enforce encryption and patch level, and treat BYOD as a separate tier.
  6. Operationalize it: logging with deny reasons, fast diagnosis playbook, quarterly role review, and a clear break-glass path that doesn’t become the default.

Do this well and your office network stops being a privileged club where everyone gets a wristband that opens every door. It becomes what it always should have been: a set of narrowly defined paths that exist only when there’s a reason, and disappear when there isn’t.

← Previous
Email: Outbound reputation tanked — what to stop doing immediately (and fixes)
Next →
MySQL vs PostgreSQL for Multi-Tenant SaaS: Tenant Isolation That Survives Growth

Leave a comment