Tailscale: VPN Without Pain, and the Access Mistakes That Still Hurt

Was this helpful?

You’re on-call. Someone pings: “Can you hop into the prod database host? VPN’s down again.”
Half the time the “VPN” is actually a collection of outdated client configs, expired certs, and a firewall rule nobody dares touch.
The other half is a user error that looks like a network outage until you stare at it long enough.

Tailscale sells the dream: secure private networking with fewer moving parts. It mostly delivers.
But the sharp edges didn’t disappear; they moved. Access control, routes, DNS, and identity are the new places people cut themselves.
Let’s set it up properly, and then talk about how it fails in the real world—because it will.

What Tailscale really is (and what it isn’t)

Tailscale is a WireGuard-based overlay network that ties connectivity to identity.
You get a private IP range (typically 100.64.0.0/10), nodes authenticate via an identity provider,
and the control plane coordinates keys and routes so devices can talk directly when possible.

The crucial mental model: Tailscale is not “a VPN server.” It’s a distributed mesh with a coordination layer.
Most traffic is peer-to-peer. If it can’t be (NATs, firewalls, symmetric NAT, etc.), traffic relays through DERP.
DERP is not “a backdoor”; it’s a pragmatic relay so your laptop can still reach a box in a crusty datacenter.

What Tailscale is not: a replacement for basic network hygiene. You still need host firewalls,
correct routing, sane DNS, and a plan for privileged access. Tailscale makes it easier to do those things.
It does not do them for you.

And yes, you can still lock yourself out. Tailscale just helps you do it faster and from more places.

Facts and history worth knowing

  • WireGuard is young by VPN standards. It hit Linux mainline in 2020, which matters when you compare it to decades-old IPsec stacks.
  • Tailscale’s default address space (100.64.0.0/10) lives in the CGNAT range, reducing collisions with internal RFC1918 networks—usually.
  • DERP is a relay, not a tunnel concentrator. It exists to handle the cases where NAT traversal fails and would otherwise block connectivity.
  • “Zero trust” became a budget line item after perimeter-only models kept failing in hybrid cloud + remote work environments.
  • Traditional VPNs normalized shared secrets and static configs (client profiles, PSKs, long-lived certs). Tailscale pushes toward short-lived identity-bound auth.
  • Split tunneling fights old instincts. Many corporate VPNs forced “all traffic through HQ”; Tailscale makes it optional via exit nodes, on purpose.
  • ACLs are policy-as-code now. That’s great—until someone treats the ACL file like a magic incantation instead of a security boundary.
  • Subnet routing predates Tailscale by decades, but overlay subnet routing makes it tempting to “just route the whole datacenter,” which is how audits start.
  • Identity providers became the new perimeter, which is why an IdP outage can feel like a network outage with a suit on.

A sane “VPN without pain” setup

If you want Tailscale to stay boring, decide what “boring” means before you click any toggles.
My definition: engineers can reach what they need, access is least-privilege by default,
and the failure mode is “can’t connect” rather than “connected to everything.”

Start with a minimal topology

Don’t begin with subnet routes, exit nodes, and clever tag-based exceptions in week one.
Begin with device-to-device connectivity among a small set of admin workstations and a couple of target hosts.
Confirm performance, confirm identity flow, confirm logs. Then scale.

Pick your identity story and commit

If you’re using an IdP (Google Workspace, Microsoft Entra ID, Okta, etc.), enforce MFA and device trust where possible.
The IdP is now your “VPN login.” Treat it as production-critical.
If your org is allergic to SSO, use Tailscale auth keys—but understand the lifecycle, rotation, and scope.

Decide how nodes get authorized

For production environments, prefer a model where:

  • Users can’t just add random devices without approval (or at least without visibility).
  • Servers are registered with scoped, reusable auth keys (or one-time ephemeral keys) plus tags.
  • Deprovisioning a user or device is a single action that actually works.

Make the network plan explicit

Overlay networks invite “just this once” decisions that become permanent.
Write down:

  • Which services must be reachable (SSH, RDP, Postgres, internal web, etc.).
  • From which roles (on-call, CI, DBAs, support).
  • Whether you’ll use subnet routes, or require installing Tailscale on the actual hosts.
  • Whether you’ll allow exit nodes, and for whom.

Joke #1: A VPN is like a shared office kitchen—if you don’t label things, someone will drink the milk, and it’ll be your outage.

Identity, ACLs, tags: where access goes right or wrong

Tailscale’s big win is also its biggest trap: access control gets easy enough that people stop thinking about it.
In classic VPN land, you had a network boundary and a pile of firewall rules. In Tailscale land, ACLs are your firewall.
If you treat them like “that config file we copy-paste,” you will ship security bugs.

Use groups for humans, tags for machines

A practical rule: humans belong in groups, servers get tags.
Humans change jobs, quit, take vacations, and bring personal devices. Servers should not inherit human identity.
Tags are how you express machine roles (“tag:db”, “tag:bastion”, “tag:monitoring”).

Then write ACLs like you’re going to be deposed about them.
Least privilege. Explicit ports. No broad “allow all” policies to “get it working.”
Getting it working is easy; getting it safe is the work.

Avoid “temporary” broad rules

The most dangerous ACL rule is the one added during an incident.
The second most dangerous is the one nobody remembers was added during an incident.
Put time pressure on yourself: if you add an emergency exception, schedule its removal while you still remember why it exists.

Think in terms of blast radius

If one laptop is compromised, what can it touch?
If one auth key leaks, what can it register?
If the IdP is misconfigured, what does an attacker get?
Answer these questions before your auditor asks them in a calm room with fluorescent lighting.

Subnet routes and exit nodes: powerful, easy to abuse

Subnet routing is how you make Tailscale clients reach networks that don’t run the Tailscale client.
Exit nodes are how you route a client’s default internet traffic through a Tailscale node.
Both features are operationally useful. Both can quietly become the “new corporate VPN,” with all the old problems.

Subnet routers: when and how

Use subnet routers for:

  • Legacy networks you can’t modify (appliances, old hypervisors, lab gear).
  • Short-term migrations where installing Tailscale everywhere is unrealistic.
  • Site-to-site connectivity where you want identity-based access, not site-wide trust.

Avoid subnet routers as a default if you can install Tailscale on the hosts. Host-level clients give you better attribution,
better segmentation, and fewer “why is all traffic hairpinning through that one VM” surprises.

Exit nodes: treat as privileged infra

Exit nodes are not just routing features; they’re policy enforcement points and traffic concentrators.
If you allow them, enforce:

  • Dedicated, hardened exit nodes (not “someone’s desktop”).
  • Restricted ACLs: only specific groups can use them.
  • Monitoring: bandwidth, CPU, packet drops, and whether clients are unexpectedly using DERP.

Exit nodes also complicate incident response. If your egress IP is suddenly “wrong,” it might be because a client selected an exit node
(intentionally or accidentally). That’s not a network mystery; it’s a checkbox.

DNS and MagicDNS: name resolution as a reliability feature

Most “VPN is down” reports are actually DNS failures wearing a network costume.
Tailscale’s MagicDNS can clean this up by giving stable names to nodes and handling split DNS for internal domains.
It can also create new failure modes if you half-configure it.

Decide what “internal DNS” means

If you already run internal DNS (Bind, Unbound, Infoblox, Route 53 private zones, etc.), decide whether:

  • Tailscale names are enough (node-name.tailnet), or
  • You need internal domains resolved over Tailscale (split DNS), or
  • You need both, with clear precedence rules.

Then test on macOS, Windows, Linux, and mobile. DNS stacks are not consistent; they are a zoo with paperwork.

Tailscale SSH: remove keys, not accountability

Tailscale SSH can replace traditional SSH key distribution by tying SSH access to Tailscale identity and policy.
That’s attractive: fewer long-lived keys, less “who owns this authorized_keys line,” and better audit trails.
It’s also a change to your operational muscle memory.

A good approach:

  • Enable Tailscale SSH on a small set of hosts first.
  • Keep break-glass access (console, cloud serial, out-of-band) because you will misconfigure something eventually.
  • Decide whether you still allow openssh over non-Tailscale networks. Usually, no.

One reliability quote that still holds: “Hope is not a strategy.” — General Gordon R. Sullivan.
It’s not exclusively an ops quote, but it’s painfully accurate when your access path is a single brittle control plane.

Practical tasks: commands, outputs, and decisions (12+)

These are the commands you run when someone says “Tailscale is broken,” plus what the output means and the decision you make next.
They’re intentionally mundane. Mundane saves production.

Task 1: Verify the node is actually connected

cr0x@server:~$ tailscale status
100.64.10.12   web-01           linux   active; direct 198.51.100.24:41641, tx 123456 rx 234567
100.64.10.20   db-01            linux   active; relay "iad", tx 45678 rx 56789
100.64.10.99   cr0x-laptop      macOS   active; direct 203.0.113.77:54012, tx 98765 rx 87654

Meaning: If you don’t see your peer, you’re not in the same tailnet or the peer is offline. “direct” vs “relay” tells you the path.

Decision: If peers are missing, check login/authorization. If it’s relayed unexpectedly, start looking at NAT/firewall.

Task 2: Check local daemon health and login state

cr0x@server:~$ tailscale status --json | jq '.BackendState, .Self.DNSName, .Self.Online'
"Running"
"web-01.tailnet-abc.ts.net."
true

Meaning: BackendState “Running” is good. DNSName confirms which tailnet you’re in.

Decision: If not running, restart service. If DNSName is wrong, you’re logged into the wrong org.

Task 3: Confirm which routes and DNS settings are applied

cr0x@server:~$ tailscale netcheck
Report:
        * UDP: true
        * IPv4: yes, 198.51.100.24:41641
        * IPv6: no
        * MappingVariesByDestIP: false
        * HairPinning: true
        * Nearest DERP: iad
        * DERP latency:
                - iad: 18.2ms   (nearest)
                - ord: 35.4ms
                - lax: 72.9ms

Meaning: UDP true is key for direct WireGuard. Nearest DERP is where you’ll relay if needed.

Decision: If UDP is false, fix firewall/NAT. If DERP latency is huge, expect sluggish interactive sessions.

Task 4: See active preferences (exit node, SSH, routes)

cr0x@server:~$ tailscale prefs
{
  "ControlURL": "https://controlplane.tailscale.com",
  "RouteAll": false,
  "ExitNodeID": "",
  "CorpDNS": true,
  "RunSSH": false
}

Meaning: RouteAll false means no exit node. CorpDNS true means it’s using tailnet DNS config.

Decision: If someone claims “internet is slow,” check RouteAll/ExitNodeID first. It’s often self-inflicted.

Task 5: Confirm you can reach a peer over the tailnet IP

cr0x@server:~$ ping -c 2 100.64.10.20
PING 100.64.10.20 (100.64.10.20) 56(84) bytes of data.
64 bytes from 100.64.10.20: icmp_seq=1 ttl=64 time=22.4 ms
64 bytes from 100.64.10.20: icmp_seq=2 ttl=64 time=22.1 ms

--- 100.64.10.20 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 22.141/22.262/22.384/0.121 ms

Meaning: Basic reachability works.

Decision: If ping fails but status shows “active,” suspect ACLs or host firewall blocking ICMP (or ping disabled).

Task 6: Test the actual service port (because ping lies)

cr0x@server:~$ nc -vz 100.64.10.20 5432
Connection to 100.64.10.20 5432 port [tcp/postgresql] succeeded!

Meaning: TCP path to Postgres is open.

Decision: If this fails but ping works, it’s ACL or host firewall or the service isn’t listening on the tailnet interface.

Task 7: Check the service is listening on the right interfaces

cr0x@server:~$ sudo ss -lntp | grep 5432
LISTEN 0      244          127.0.0.1:5432       0.0.0.0:*    users:(("postgres",pid=1342,fd=6))

Meaning: Postgres is bound to localhost only. Tailscale can’t reach it remotely.

Decision: Bind to the Tailscale IP or to 0.0.0.0 (with firewall/ACL constraints), then retest.

Task 8: Inspect Linux routing table for subnet route surprises

cr0x@server:~$ ip route show
default via 192.0.2.1 dev eth0
100.64.0.0/10 dev tailscale0 scope link
10.20.0.0/16 dev tailscale0 scope link

Meaning: 10.20.0.0/16 is routed via tailscale0—likely a subnet route you accepted.

Decision: If this route is unexpected, identify the advertising router and decide whether to disable route acceptance.

Task 9: Confirm which subnet routes are being advertised from a router node

cr0x@server:~$ tailscale status --json | jq '.Self.PrimaryRoutes, .Self.AdvertisedRoutes'
null
[
  "10.20.0.0/16"
]

Meaning: This node is advertising 10.20.0.0/16 to the tailnet.

Decision: If you didn’t intend that, remove the advertise-routes setting and rotate any automation that re-applies it.

Task 10: Check whether your client is using DERP (relay) unexpectedly

cr0x@server:~$ tailscale ping 100.64.10.99
pong from cr0x-laptop (100.64.10.99) via DERP(iad) in 36ms

Meaning: You’re relaying through DERP, not direct.

Decision: If performance matters, investigate UDP reachability and NAT behavior on both ends.

Task 11: Check host firewall counters (Linux nftables example)

cr0x@server:~$ sudo nft list ruleset | sed -n '1,120p'
table inet filter {
  chain input {
    type filter hook input priority filter; policy drop;
    iif "lo" accept
    ct state established,related accept
    iif "tailscale0" tcp dport { 22, 5432 } accept
    counter packets 0 bytes 0 drop
  }
}

Meaning: tailscale0 is allowed for ports 22 and 5432; policy drop otherwise.

Decision: If counters show drops increasing during your test, fix firewall rules before blaming Tailscale.

Task 12: Validate DNS resolution for MagicDNS names

cr0x@server:~$ dig +short web-01.tailnet-abc.ts.net
100.64.10.12

Meaning: Name resolves to the tailnet IP.

Decision: If it doesn’t resolve, check MagicDNS enablement and client DNS settings; don’t waste time on routing yet.

Task 13: Inspect systemd service logs for auth and network errors

cr0x@server:~$ sudo journalctl -u tailscaled -n 50 --no-pager
Dec 27 11:02:14 web-01 tailscaled[812]: wgengine: Reconfig: configuring userspace WireGuard config (with 2 peers)
Dec 27 11:02:18 web-01 tailscaled[812]: magicsock: derp-https: connected to derp-iad
Dec 27 11:02:24 web-01 tailscaled[812]: health: state=running

Meaning: You see normal bring-up, DERP connection, and healthy running state.

Decision: If logs show repeated auth failures, key expiry, or “not authorized,” stop and fix identity/approval.

Task 14: Confirm client version consistency (to avoid weird edge bugs)

cr0x@server:~$ tailscale version
1.76.6
  tailscale commit: 3f2c1a7e4
  go version: go1.22.6

Meaning: You know what you’re running.

Decision: If one segment is on an ancient version, upgrade before deep debugging. Mixed versions create time-wasting ghosts.

Fast diagnosis playbook

When connectivity breaks, you want to locate the bottleneck layer quickly: identity, policy, routing, transport, or the service itself.
Here’s the order that tends to minimize wasted time.

First: confirm identity + membership

  • Does tailscale status show the peer at all?
  • Is the node logged into the correct tailnet (DNSName suffix)?
  • Is the device approved/authorized in the admin console (if approval is enabled)?

If this layer is wrong, nothing else matters. Routing can be perfect; you’re still not invited to the party.

Second: confirm policy (ACLs, tags, SSH policy)

  • Can you ping the tailnet IP?
  • Can you connect to the actual port with nc -vz?
  • If using Tailscale SSH, is it enabled and allowed for this identity?

Most production failures are policy mismatches: a tag didn’t apply, a rule is too broad (and got tightened), or a group changed.

Third: confirm transport (direct vs DERP, UDP reachability)

  • Run tailscale netcheck and look at UDP.
  • Run tailscale ping and see if it’s direct or DERP.

If it’s DERP and slow, you’re not “down,” you’re just paying the NAT tax.
Fixing NAT/firewall can turn a sluggish remote shell into a normal one.

Fourth: confirm routing (subnet routes, exit nodes)

  • Check ip route for unexpected routes via tailscale0.
  • Check whether an exit node is enabled (tailscale prefs).

Routing issues often look like “random internal service is broken.” It’s not random; it’s deterministic and undocumented.

Fifth: confirm the service and host firewall

  • Is the service listening on the right interface (ss -lntp)?
  • Is the firewall allowing tailscale0 traffic (nftables/iptables)?

If Tailscale is fine but the service binds to 127.0.0.1, your overlay network is irrelevant. This is the boring part. Do it anyway.

Common mistakes: symptoms → root cause → fix

1) “I can see the node, but SSH times out”

Symptoms: Node appears in tailscale status. Ping may work. SSH/RDP/app port times out.

Root cause: Host firewall blocks tailscale0, or service only listens on localhost, or ACL blocks the port.

Fix: Check nc -vz, then ss -lntp, then nftables/iptables. Ensure ACL explicitly allows the destination port, and the daemon binds to the reachable interface.

2) “Everything is slow today”

Symptoms: Interactive sessions lag; file transfers crawl; intermittent delays.

Root cause: Traffic is relayed via DERP due to UDP blocked or symmetric NAT; sometimes a forced exit node adds latency.

Fix: Run tailscale ping to confirm DERP vs direct and tailscale netcheck for UDP. Fix firewall to allow outbound UDP/41641 (or at least UDP generally), and avoid exit nodes for latency-sensitive workflows unless needed.

3) “We enabled a subnet route and now weird things happen”

Symptoms: Some internal IPs route differently; services become reachable from places they shouldn’t; tickets mention “split tunnel broke.”

Root cause: Clients accepted a subnet route that overlaps with existing routes, or an overbroad route was advertised (like /16) when /24 would do.

Fix: Audit routes with ip route and the router’s tailscale status --json. Narrow the advertised prefixes. Disable auto-accept routes where appropriate, and enforce ACL restrictions for the subnet destinations.

4) “A contractor can still reach prod after offboarding”

Symptoms: A user is removed from the IdP group, but their device still connects or still has access to something.

Root cause: Device remains authorized; ACLs are tag-based and tags are too broad; or a long-lived auth key registered a node not tied to the contractor’s identity.

Fix: Enforce device approval/expiry, regularly prune devices, and prefer user-bound identity for human access. For machine keys, scope them tightly and rotate. Make “offboarding checklist includes Tailscale devices” non-optional.

5) “MagicDNS doesn’t work on my laptop, but works on servers”

Symptoms: dig fails for tailnet names; IP access works; behavior differs by OS.

Root cause: Client DNS settings overridden by another VPN, local resolver, captive portal, or OS DNS priority quirks.

Fix: Verify with dig and inspect OS DNS settings. Disable conflicting VPN DNS, and standardize client configuration policy. If you rely on split DNS, test it on every platform you support.

6) “We can’t reach the datacenter subnet from Tailscale”

Symptoms: Tailnet nodes can talk to each other, but not to 10.x/192.168.x behind a subnet router.

Root cause: Subnet router isn’t forwarding (IP forwarding off), firewall blocks, or routes not approved in admin UI.

Fix: Enable IP forwarding on the router node, allow forwarding in firewall, and confirm that advertised routes are approved and clients accept them. Then test with traceroute and nc.

7) “Exit node enabled and now SaaS logins look suspicious”

Symptoms: Geo-based alerts; repeated logouts; sites show different region; bandwidth spikes on one node.

Root cause: Users unknowingly route all traffic through an exit node, changing egress IP and location.

Fix: Restrict exit node use to specific groups. Educate users on when to use it. Monitor exit nodes. Consider dedicated “egress regions” if compliance requires it.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company rolled out Tailscale to replace an old IPsec setup. The plan was clean: engineers connect from laptops,
reach a bastion host, then access internal services. Someone added a subnet router to make a legacy monitoring network reachable.

The wrong assumption was subtle: they assumed “advertise a subnet route” meant “only a few people can use it.”
In reality, every client accepted routes by default, and the ACLs were written broadly (“engineering can reach internal”)
because that matched the old VPN culture.

A week later, an engineer on a personal device joined the tailnet, installed a debugging tool, and accidentally scanned the monitoring subnet.
Nothing malicious. But the alerts looked like reconnaissance. Security woke up. Management woke up. Everyone woke up cranky.

The fix wasn’t dramatic: tighten ACLs to specific destinations and ports, require device approval for new nodes,
and make subnet route acceptance explicit for only the roles that need it. The real lesson was that overlay routing changes who can “see” networks.
If you don’t make that explicit, you’ll discover it during an incident review.

Mini-story 2: The optimization that backfired

An enterprise IT team decided to standardize remote access by pushing all internet traffic through exit nodes “for security monitoring.”
They deployed a pair of exit nodes per region and required users to enable them. On paper, this gave central visibility and consistent egress.

The backfire came in three acts. First, bandwidth costs rose and the exit nodes became hot spots. Second, interactive work got slower for
engineers who were now hairpinning to an exit node and back out to cloud services hosted in a different region. Third, a routine kernel update
plus a NIC driver quirk caused packet loss on one exit node, which turned into a company-wide “internet is down” event.

The incident was brutal because it looked like everything: DNS, SaaS outages, “Wi-Fi problems,” you name it.
The core issue was simply that they had reintroduced a central chokepoint—exactly what Tailscale typically helps you avoid.

They recovered by making exit node usage optional, restricted to specific scenarios (public Wi-Fi, geo-restricted resources, compliance testing).
They also added monitoring and capacity headroom for exit nodes that remained. The “optimization” was actually a topology change, and topology changes have consequences.

Mini-story 3: The boring but correct practice that saved the day

A fintech ran Tailscale across cloud and on-prem. They had a strict habit: every production-access change required a small change request,
and every ACL change required a peer review, even if it was “just opening port 443 to a new service.”

During an incident, an on-call engineer needed urgent access to a storage node for log collection. The quickest path would have been
to add a broad ACL exception for the on-call group to reach the entire storage subnet. That would have worked in minutes.

But the team followed the boring practice: a narrow ACL rule scoped to a single host tag and a single port, plus a short-lived device posture check.
They also documented it in the incident notes and scheduled its removal. The access worked, the incident moved forward, and nothing “extra” stayed open.

Weeks later, a compromised contractor laptop hit the tailnet. The attacker’s reach was limited.
The storage fleet wasn’t in scope, because the ACLs were precise and reviewed. The team didn’t celebrate.
They just moved on. That’s what “correct” looks like in operations: it’s almost disappointingly uneventful.

Checklists / step-by-step plan

Step-by-step: rolling out Tailscale in a production-minded way

  1. Define access goals. List services and ports that must be reachable, by role. If you can’t list it, you can’t secure it.
  2. Choose identity and enforce MFA. If SSO is available, use it and require MFA. If not, treat auth keys like secrets with lifecycle management.
  3. Enable device visibility and approvals. Decide whether new devices are auto-approved. For production, default to approval required.
  4. Establish naming and tags. Humans in groups. Servers in tags. No exceptions “because it’s easier.”
  5. Start with host-installed Tailscale where possible. Subnet routes are a bridge, not a lifestyle.
  6. Write minimal ACLs. Start with a single “admin → bastion:22” rule and expand carefully.
  7. Turn on MagicDNS intentionally. Test on all OSes you support. Document how split DNS works if you use it.
  8. Decide on exit nodes. If allowed, build dedicated exit nodes and restrict usage. Otherwise, disable the feature or policy-block it.
  9. Keep break-glass access. Console access, cloud serial, iLO/iDRAC, or equivalent. You will need it one day.
  10. Instrument and alert. Watch DERP usage changes, exit node load, and tailscaled health logs on critical nodes.
  11. Train on-call with the diagnosis playbook. Make sure people can distinguish ACL failure from DNS failure from service binding issues.
  12. Do quarterly access hygiene. Prune old devices, rotate keys, review ACLs, and audit subnet routes.

Checklist: before you enable subnet routing

  • Confirm the subnet doesn’t overlap with existing routes used by clients.
  • Confirm the router node has IP forwarding enabled and firewall forwarding rules in place.
  • Advertise the smallest prefix that works (avoid /16 if /24 is enough).
  • Restrict access to the subnet using ACLs (don’t rely on “only some people will use it”).
  • Decide which clients accept routes and make it explicit.

Checklist: before you enable exit nodes

  • Use dedicated exit nodes with patching and monitoring.
  • Lock down who can use them (groups only).
  • Validate performance and MTU path, especially for video calls and large downloads.
  • Plan for region selection and egress IP consistency if compliance cares.

Joke #2: The fastest way to “improve security” is to route everything through one box—right up until that box becomes your new hobby.

FAQ

Is Tailscale “just WireGuard”?

Under the hood, it uses WireGuard for the data plane. The difference is the control plane: identity, key distribution,
NAT traversal coordination, ACLs, and optional features like MagicDNS and Tailscale SSH. That’s the part that replaces your DIY glue scripts.

Why do I see “relay” in tailscale status?

It means the connection couldn’t be established directly (usually NAT/firewall/UDP issues), so traffic is going through DERP.
It’s still encrypted end-to-end, but latency and throughput may suffer.

Can I use Tailscale without installing it on every server?

Yes: use a subnet router. But understand the trade-off: you lose per-host identity and you create a routing dependency.
It’s fine for legacy networks and migrations. It’s not my first choice for greenfield servers.

What’s the difference between subnet routing and an exit node?

Subnet routing advertises specific internal prefixes (like 10.20.0.0/16) into the tailnet.
An exit node routes a client’s default route (0.0.0.0/0) through a tailnet node, affecting general internet traffic.

Should we allow personal devices on the tailnet?

If you do, treat it like a real policy decision. Require device approval and consider device posture checks.
Personal devices are not automatically evil, but they have a different patching and control reality than managed endpoints.

How do ACLs and host firewalls interact?

ACLs control what Tailscale allows at the overlay layer. Host firewalls control what the OS allows.
You want both. ACLs stop a connection from being established; host firewalls stop unexpected traffic even if ACLs are too permissive.

Is Tailscale SSH safe to use in production?

Yes, if you treat it like a privileged access system: tight policies, careful rollout, and break-glass access.
The main risk is operational—misconfiguration that locks you out—not cryptographic weakness.

What’s the most common “it’s down” false alarm?

DNS. Either MagicDNS isn’t resolving, split DNS isn’t applied, or another VPN/client hijacked resolver settings.
Test by connecting to the tailnet IP directly; if that works, it’s name resolution, not routing.

How do we offboard cleanly?

Remove the user from the IdP groups, disable their SSO, and also review and remove their authorized devices from the tailnet.
If you use auth keys for machines, rotate keys on schedule and ensure keys are scoped so they can’t register random devices.

Does using DERP mean Tailscale can read our traffic?

No. DERP relays encrypted packets; it doesn’t terminate the encryption.
Your performance may change, but your traffic content isn’t exposed by the relay mechanism.

Conclusion: next steps that don’t age badly

Tailscale earns its “VPN without pain” reputation when you treat it like a production system: explicit identity, explicit policy,
and explicit routing. Most failures aren’t mystical. They’re basic: wrong tailnet, wrong ACL, wrong route, DNS confusion, or a service binding to localhost.
The trick is to debug in the right order and to keep the blast radius small.

Practical next steps:

  • Adopt the fast diagnosis playbook and teach it to on-call.
  • Audit your ACLs for least privilege and remove “temporary” broad rules.
  • Inventory subnet routes and make acceptance explicit; shrink prefixes where possible.
  • If you use exit nodes, treat them as critical infra: dedicated hosts, monitoring, and strict access.
  • Run a quarterly hygiene sweep: devices, keys, tags, and DNS behavior across platforms.

Keep it boring. Boring is how you sleep.

← Previous
ZFS zfs send: The Fastest Way to Move Terabytes (If You Do It Right)
Next →
ZFS Replacing Multiple Disks: The Safe Order That Prevents Failure

Leave a comment