Debian 13: SSH keys rotated — revoke access cleanly and avoid key sprawl (case #13)

Was this helpful?

Key rotation looks easy until it isn’t. You flip a key, someone can’t deploy, another team’s emergency access breaks, and a “temporary” contractor key you forgot about keeps working—quietly—on a box nobody remembers owning.

This is the part of operations where you either run a controlled change or you get a slow-motion incident with a thousand paper cuts. Debian 13 doesn’t change the fundamentals of SSH, but it’s a good moment to stop treating keys like loose change in the couch.

What “key rotation” actually means in production

In a lab, rotating SSH keys is a neat two-step: add new key, remove old key. In production, it’s a multi-party contract you’re renegotiating under load.

You’re not just changing a file. You’re changing:

  • Identity: which private keys are considered valid proof of who you are.
  • Authorization: which public keys map to which accounts, and with what restrictions.
  • Reachability: which bastions, CI runners, and automation accounts can still reach what they need.
  • Evidence: whether you can later prove who had access at the time of an incident.

Key rotation done well has three properties:

  1. Reversible in minutes (during rollout), but irreversible after the grace window ends (old keys truly dead).
  2. Auditable: you can answer “who can log in where” without grepping the world by hand.
  3. Scoped: one person’s key compromise doesn’t turn into “rotate everything everywhere” unless it must.

Key rotation done poorly usually means you rotate the wrong thing (client keys vs host keys), you delete first and ask questions later, or you keep stacking keys forever because removal is scary.

Host keys vs user keys: rotate the right thing

User keys are the ones in ~/.ssh/authorized_keys that let people (or bots) in. Host keys are what your clients use to verify they’re talking to the right server.

Rotating host keys is a different procedure with different blast radius. If your problem statement is “ex-employee still has access” or “contractor key leaked,” you care about user keys and their authorization paths—not host keys.

One quote worth keeping in your head

Paraphrased ideaRichard Cook: “Success in complex systems often comes from people constantly adapting, not from the plan working as written.”

Key rotation is exactly that: you need a plan, but you win by adapting quickly when reality disagrees.

Facts & history: why SSH keys sprawl the way they do

Key sprawl isn’t a moral failing. It’s a natural outcome of incentives: shipping beats housekeeping, and SSH is frictionless when you let it be. A few facts and bits of context help frame what you’re fighting.

  1. SSH replaced rsh/telnet mostly on trust and encryption. The early win was “not in cleartext,” not “perfect identity governance.” The culture stuck.
  2. OpenSSH defaulted to convenience for decades: per-user authorized_keys is simple, so everyone used it, and fleets grew around it.
  3. Key types evolved: DSA keys were once common and then effectively deprecated; RSA stayed everywhere; Ed25519 became the “modern default” because it’s small and fast.
  4. SSH agent forwarding was designed to reduce key distribution, but in practice it often became “remote box can now use my agent,” which is… a mood.
  5. Known_hosts is a UX compromise: trust-on-first-use was practical, but it also trained people to ignore scary warnings when host keys change.
  6. Authorized keys can contain options (command restrictions, source IP restrictions). Most orgs don’t use them, then reinvent the same controls poorly elsewhere.
  7. Short-lived credentials are historically hard in SSH without extra tooling; long-lived keys became the default because they “just work.”
  8. CI/CD made sprawl worse: robots need access, and teams often “temporarily” drop deploy keys onto boxes. Temporary is the longest time unit in operations.
  9. Central directories exist (LDAP/SSSD, etc.), but SSH authorization is often left local because it’s easy and outages are scary.

Sprawl happens because it’s convenient, it’s distributed, and it’s invisible—until it’s suddenly very visible.

Fast diagnosis playbook (what to check first)

This is the triage flow when someone says “we rotated keys and now SSH is broken” or “we revoked keys but access still works.” You want signal fast, not an essay.

First: is it authentication, authorization, or routing?

  1. Client-side quick test: does the client offer the expected key, and what does the server reply?
  2. Server logs: what does sshd think happened?
  3. Server config source: are you actually using authorized_keys files, or an AuthorizedKeysCommand, or both?

Second: find the controlling “truth source”

  • If authorization is local files, the truth is distributed: you must locate every authorized_keys and any includes/alternates.
  • If you use a command/SSSD/LDAP, the truth is centralized: verify it’s reachable, correct, and cached as expected.
  • If you use SSH certificates, the truth is the CA and its signing policy; revocation might be TTL-based or key-id based.

Third: confirm you actually revoked what you think you revoked

  • Old key removed from one place doesn’t help if the same key lives in ten other accounts or on a bastion.
  • Key comments lie. Fingerprints don’t. Always match by fingerprint.
  • Don’t forget automation: root, deploy users, Git mirror users, backup accounts, break-glass accounts.

When you’re stuck: use verbose client logs and server-side auth logs. Everything else is vibes.

Build a real inventory: accounts, keys, and paths

Before you revoke, you inventory. Not because paperwork is fun, but because revocation without inventory is how you delete the only working key to a storage node at 2 a.m.

Where keys hide on Debian systems

  • /home/*/.ssh/authorized_keys — the obvious one.
  • /root/.ssh/authorized_keys — the painful one.
  • Service accounts with custom homes: /var/lib/*, /srv/*.
  • Alternate locations via AuthorizedKeysFile (can be multiple paths).
  • Central key fetch via AuthorizedKeysCommand (keys might never touch disk per-user).
  • Configuration management drop points (e.g., managed fragments, generated files).

Decide your revocation model

Pick one model per environment, not per team.

  • Local files, managed: keys live in authorized_keys, but they’re owned by a CM system (Ansible/Puppet/etc). Simple, but still distributed.
  • Centralized keys: AuthorizedKeysCommand pulls keys from a central store (directory, API, git-backed DB). Great auditability, more moving parts.
  • SSH certificates: users authenticate with short-lived certs signed by your CA; servers trust the CA. Minimal sprawl, best revocation story if you do TTL right.

If you’re already large enough to say “fleet,” certificates win. If you’re small but disciplined, managed local files are fine. If you’re medium and chaotic, centralized keys are a sanity upgrade—if you engineer it like production, not like a weekend project.

Revoke access cleanly: techniques that don’t create outages

Revocation should be boring. Boring is the goal. The trick is to stage changes so you can roll forward, not scramble backward.

Technique 1: Add-then-remove with a grace window

For human users: add new key, validate login, then remove old key after a deadline. For automation: you need dual-key support during rollout and explicit cutover time.

The grace window should be short. Days, not months. Long windows are how you end up supporting two worlds indefinitely.

Technique 2: Prefer revoking by fingerprint, not by “name”

People copy keys, rename comments, or paste the wrong public key. Always identify keys by fingerprint. Treat the comment as user-provided metadata—because it is.

Technique 3: Use key options to scope blast radius

If you must keep long-lived keys (and many shops do), at least restrict them. Example restrictions include:

  • from="10.0.0.0/8" to limit source IPs
  • command="..." for forced commands (careful)
  • no-agent-forwarding, no-port-forwarding, no-pty

These aren’t perfect controls, but they turn “one leaked key equals full shell from anywhere” into “one leaked key equals limited access from specific places.” That’s a meaningful difference when you’re doing incident response.

Technique 4: Kill access at the account layer when needed

If a user is terminated or a key is confirmed compromised, waiting for key cleanup is too slow. Lock the account and remove keys.

Account lockouts are not a substitute for key hygiene, but they’re a fast circuit breaker.

Technique 5: Design a break-glass story on purpose

You need emergency access. But “everyone has a root key just in case” is not emergency access; it’s key sprawl with feelings.

Break-glass should mean: a small set of controlled keys/certs, stored securely, with logging, and exercised in drills. If you never test it, it will fail when you need it most.

Joke #1: SSH key sprawl is like a junk drawer: one day you’ll need it, and it will still disappoint you.

Avoid key sprawl: centralized control, options, and SSH certificates

Revocation is the cleanup. Avoiding sprawl is the prevention. Prevention is cheaper, faster, and less humiliating.

Centralize the decision, not necessarily the keys

Many teams misread “centralize” as “put all keys in one file.” That’s not the goal. The goal is centralizing authority: who can grant access, how it’s approved, and how it’s audited.

Good patterns:

  • Configuration management generates authorized_keys from a central inventory and enforces it.
  • AuthorizedKeysCommand fetches keys from a directory service and returns only current, approved keys.
  • SSH certificates remove the need to distribute user public keys to every server.

SSH certificates: the adult option for fleets

SSH user certificates (OpenSSH) let servers trust a CA, not individual keys. Users keep private keys; the CA signs a short-lived cert that says “this key is valid for user X for N hours, with these principals.”

What changes operationally:

  • Rotation is mostly CA policy: issue short-lived certs, and “revocation” becomes “wait for TTL” plus disabling issuance.
  • Onboarding is faster: servers already trust the CA; new users don’t require touching every host.
  • Offboarding is cleaner: stop signing for the user and wait out the TTL. No hunt-the-key across the fleet.

Failure modes exist: CA availability, signing workflow bugs, time sync issues. But those are problems you can engineer around. Key sprawl is a social-technical mess that never ends.

Make bastions boring and strict

A bastion is where your access policy becomes enforceable. If you allow direct SSH from laptops to everything, you’ve chosen chaos.

Strong defaults for bastions:

  • No agent forwarding unless justified and constrained.
  • Strong logging (session recording if your risk profile requires it).
  • Network controls: servers only accept SSH from bastions/VPN ranges.
  • Short-lived access where possible.

Practical tasks with commands, outputs, and decisions (12+)

These are field tasks you can actually run on Debian 13. Each includes: a command, what the output means, and what decision you make next. Use them during rotation, incident response, or audits.

Task 1: Confirm what sshd is really using for authorized keys

cr0x@server:~$ sudo sshd -T | egrep -i 'authorizedkeys(file|command)|pubkeyauthentication|passwordauthentication'
pubkeyauthentication yes
passwordauthentication no
authorizedkeysfile .ssh/authorized_keys .ssh/authorized_keys2
authorizedkeyscommand none

Meaning: This host uses per-user key files; no centralized key command. Password auth is off (good).

Decision: Revocation must touch files on disk (or be moved to a managed/central model). Inventory authorized_keys paths for each account.

Task 2: Verify sshd is running the config you think it is

cr0x@server:~$ systemctl status ssh
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-29 11:02:14 UTC; 3h 18min ago
       Docs: man:sshd(8)
             man:sshd_config(5)
   Main PID: 1023 (sshd)
      Tasks: 1 (limit: 18956)
     Memory: 5.7M
        CPU: 1.421s
     CGroup: /system.slice/ssh.service
             └─1023 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Meaning: sshd is active. The command line shows the daemon is the expected binary.

Decision: If changes don’t take effect, you likely changed the wrong file, have include order issues, or forgot to reload.

Task 3: Validate config includes and catch foot-guns before reload

cr0x@server:~$ sudo sshd -t
cr0x@server:~$ echo $?
0

Meaning: Exit code 0 means config syntax is valid.

Decision: Proceed with systemctl reload ssh. If non-zero, don’t reload—fix the config first or you’ll lock yourself out on restart.

Task 4: Show who can log in (local accounts) and spot surprises

cr0x@server:~$ getent passwd | awk -F: '($3>=1000 && $1!="nobody"){print $1":"$6":"$7}' | head
alice:/home/alice:/bin/bash
bob:/home/bob:/bin/bash
deploy:/srv/deploy:/bin/bash
backup:/var/lib/backup:/usr/sbin/nologin
ci-runner:/var/lib/ci-runner:/bin/bash

Meaning: You’ve got interactive users and service accounts. Some accounts have non-standard homes.

Decision: Inventory keys not just in /home but in custom home directories. Also check whether “nologin” accounts have keys and whether they should.

Task 5: Find every authorized_keys file on the host (fast, noisy, effective)

cr0x@server:~$ sudo find / -xdev -type f -name authorized_keys -o -name authorized_keys2 2>/dev/null | head
/root/.ssh/authorized_keys
/home/alice/.ssh/authorized_keys
/home/bob/.ssh/authorized_keys
/srv/deploy/.ssh/authorized_keys

Meaning: These are the key gates on this system (given Task 1). If you see unexpected locations, that’s usually where “temporary” became permanent.

Decision: For rotation, you must update each of these. For prevention, you should enforce a single managed source of truth.

Task 6: Extract fingerprints from authorized_keys and dedupe across accounts

cr0x@server:~$ sudo awk '{print $1" "$2}' /home/alice/.ssh/authorized_keys | ssh-keygen -lf -
256 SHA256:6Oq2j0J3j7cD4Yp8v0Fh7u2i9lXkYxJr3tZx0mQy7xM alice@laptop (ED25519)

Meaning: You now have a stable fingerprint for the key. That’s the identity you rotate/revoke against.

Decision: Build a list of fingerprints to remove. If the same fingerprint appears in multiple accounts, revocation must be coordinated or you’ll miss an access path.

Task 7: Confirm a suspect key is still accepted (server-side test with sshd logs)

cr0x@server:~$ sudo journalctl -u ssh -n 30 --no-pager
Dec 30 09:10:42 server sshd[22144]: Accepted publickey for alice from 10.30.4.18 port 51222 ssh2: ED25519 SHA256:6Oq2j0J3j7cD4Yp8v0Fh7u2i9lXkYxJr3tZx0mQy7xM
Dec 30 09:11:03 server sshd[22190]: Failed publickey for bob from 10.30.4.18 port 51301 ssh2: ED25519 SHA256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Dec 30 09:11:03 server sshd[22190]: Connection closed by authenticating user bob 10.30.4.18 port 51301 [preauth]

Meaning: The server logs show which fingerprint was accepted or rejected.

Decision: If the “revoked” fingerprint is still being accepted, it exists somewhere in the auth path (another account, another file, centralized keys, or a forced command wrapper).

Task 8: Check whether an account is locked (fast offboarding circuit breaker)

cr0x@server:~$ sudo passwd -S alice
alice P 2025-12-01 0 99999 7 -1

Meaning: Status P indicates the account has a usable password hash. This doesn’t directly indicate SSH key access, but it tells you whether password login could work if enabled.

Decision: For immediate offboarding: lock the account (L) and remove keys. Also confirm PasswordAuthentication is disabled (Task 1).

Task 9: Lock an account and verify it’s locked

cr0x@server:~$ sudo usermod -L alice
cr0x@server:~$ sudo passwd -S alice
alice L 2025-12-01 0 99999 7 -1

Meaning: L means the password is locked. Key-based SSH can still work unless you also control it via sshd options or key removal.

Decision: Don’t stop here. Remove the key material or disable the account for SSH using DenyUsers/Match User rules if you need immediate SSH cut-off.

Task 10: Enforce “no SSH for this user” using a Match block (surgical, reversible)

cr0x@server:~$ sudo sh -c 'printf "\nMatch User alice\n  PasswordAuthentication no\n  PubkeyAuthentication no\n  KbdInteractiveAuthentication no\n" >> /etc/ssh/sshd_config.d/99-disable-alice.conf'
cr0x@server:~$ sudo sshd -t
cr0x@server:~$ sudo systemctl reload ssh
cr0x@server:~$ sudo sshd -T -C user=alice,host=server,addr=10.30.4.18 | egrep -i 'pubkeyauthentication|passwordauthentication|kbdinteractiveauthentication'
pubkeyauthentication no
passwordauthentication no
kbdinteractiveauthentication no

Meaning: The effective config for that user has pubkey auth disabled. This blocks SSH even if keys still exist.

Decision: Use this as an emergency block while you hunt down distributed keys. Remove the Match rule once cleanup is complete.

Task 11: Confirm file permissions won’t silently break key auth

cr0x@server:~$ sudo namei -l /home/alice/.ssh/authorized_keys
f: /home/alice/.ssh/authorized_keys
drwxr-xr-x root  root  /
drwxr-xr-x root  root  home
drwxr-xr-x alice alice alice
drwx------ alice alice .ssh
-rw------- alice alice authorized_keys

Meaning: This passes the usual strict permission expectations. If you see group-writable directories or world-writable homes, sshd may ignore the file (depending on config).

Decision: Fix permissions before blaming keys. “Key rotation failed” is often actually “sshd refused to read the file.”

Task 12: Test client behavior with verbose SSH (the truth serum)

cr0x@server:~$ ssh -vvv -i ~/.ssh/id_ed25519 alice@server
debug1: Offering public key: /home/cr0x/.ssh/id_ed25519 ED25519 SHA256:6Oq2j0J3j7cD4Yp8v0Fh7u2i9lXkYxJr3tZx0mQy7xM explicit
debug1: Server accepts key: /home/cr0x/.ssh/id_ed25519 ED25519 SHA256:6Oq2j0J3j7cD4Yp8v0Fh7u2i9lXkYxJr3tZx0mQy7xM explicit
debug1: Authentication succeeded (publickey).
Welcome to Debian GNU/Linux

Meaning: You can see which key was offered, whether the server accepted it, and whether auth succeeded.

Decision: If the wrong key is being offered, fix the client (agent, config, identity files). If the server accepts an old key, it’s still authorized somewhere.

Task 13: Remove a specific key safely by matching the key body, not the comment

cr0x@server:~$ sudo grep -n 'AAAAC3NzaC1lZDI1NTE5AAAAIFakeKeyBodyGoesHereButLooksRealToHumans' /home/alice/.ssh/authorized_keys
3:ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFakeKeyBodyGoesHereButLooksRealToHumans old-laptop
cr0x@server:~$ sudo sed -i '3d' /home/alice/.ssh/authorized_keys
cr0x@server:~$ sudo tail -n +1 /home/alice/.ssh/authorized_keys
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINewKeyBodyLooksDifferent new-laptop

Meaning: You found the exact line by key body and removed it by line number.

Decision: This is safer than searching by the comment. After removal, test login and check logs for unexpected fallbacks.

Task 14: Detect duplicate keys across the host (cheap sprawl detector)

cr0x@server:~$ sudo find /home /root /srv -xdev -name authorized_keys -type f -print0 2>/dev/null | xargs -0 awk '{print $1" "$2}' | sort | uniq -c | sort -nr | head
      4 ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDuplicateKeyBodyExample
      2 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQAnotherDupExample

Meaning: The same public key appears multiple times across accounts/paths.

Decision: Duplicate keys are almost always a governance problem. Decide whether this key is a shared bot key (bad but common) or accidental reuse (worse). Either way, plan to split identities.

Task 15: Confirm sshd will reject passwords and keyboard-interactive

cr0x@server:~$ sudo sshd -T | egrep -i 'passwordauthentication|kbdinteractiveauthentication|usepam'
passwordauthentication no
kbdinteractiveauthentication no
usepam yes

Meaning: Password and keyboard-interactive are disabled, but PAM is still used (often for session setup).

Decision: Good baseline for key-based auth. If you enable any interactive auth methods during troubleshooting, put a reminder to disable them again—temporary exceptions fossilize fast.

Task 16: Spot and stop agent forwarding on servers where it shouldn’t exist

cr0x@server:~$ sudo sshd -T | egrep -i 'allowagentforwarding|allowtcpforwarding|permittty'
allowagentforwarding yes
allowtcpforwarding yes
permittty yes

Meaning: Agent forwarding and TCP forwarding are allowed by default here.

Decision: On production servers, especially those that can reach other sensitive systems, consider disabling forwarding or limiting it with Match rules. Forwarding expands the blast radius of a compromised server session.

Joke #2: If you don’t rotate SSH keys, they don’t “age like wine.” They age like milk in a server room.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had a policy: “All SSH keys are managed by configuration management.” Everyone believed it because it had been true for the main application fleet. The SREs rotated a compromised developer key and removed it from the CM repo. They declared victory, wrote the post-incident note, and moved on.

A week later, an auditor asked a simple question: “Can that key still reach anything?” Someone ran an SSH attempt from a quarantined machine (safely, with approvals) and got in—on a database reporting box. Not the main DB. A “temporary” reporting host spun up for a quarter-end project years ago, never properly folded into CM.

The wrong assumption wasn’t technical; it was organizational. They assumed their tooling coverage matched their mental model. It didn’t. The key lived in /root/.ssh/authorized_keys on a host nobody patched regularly because it “wasn’t production.” It still had network reach to read replicas. That’s production enough.

The fix wasn’t heroic. They built a job that enumerated authorized key footprints across all reachable hosts and compared fingerprints to an approved list. The first run was embarrassing. The second run was actionable. The third run was routine.

Takeaway: if you can’t measure key placement, you don’t have key management. You have hope.

Mini-story 2: The optimization that backfired

A platform team decided to centralize SSH keys quickly. They built an AuthorizedKeysCommand that queried an internal HTTP service: give it a username, get back keys. Great idea, shipped fast, applauded in a meeting.

Then came the first real outage. The key service had a partial failure: the API was up, but latency spiked due to a downstream database issue. SSH logins didn’t “error.” They just hung during authentication. Engineers couldn’t get in to fix the very incident causing the latency. Meanwhile, automation that relied on SSH started failing slowly and unpredictably.

The team had optimized for “single source of truth” and forgot to engineer for failure. No caching, no fallback, no circuit breaker, no local emergency path. They had moved the problem from “distributed keys” to “central dependency for every login.”

They recovered by temporarily restoring local authorized_keys for on-call and break-glass accounts, then reworking the key service: caching on hosts, strict timeouts, and a design that fails closed for normal users but fails open for break-glass under controlled conditions (and with alarms).

Takeaway: centralization is not a free lunch. It’s trading a mess you can grep for a service you must run like it matters—because it does.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent company ran a small fleet but had one habit that looked painfully conservative: every key rotation had a runbook and a “two-key overlap” window. They also maintained a tiny break-glass process with quarterly tests. People complained it was slow. People always complain about things that work.

One afternoon, a laptop with SSH keys was stolen from an engineer’s car. The engineer reported it immediately. Security declared a credential compromise and asked for proof that access was revoked across production.

The on-call followed the runbook: identify fingerprints from the engineer’s known public keys, block the user via a Match file on bastions within minutes, and then push a CM update removing the fingerprints from all production hosts. The overlap window wasn’t needed; this was a hard revoke. They validated revocation by attempting auth from a controlled environment and checking sshd logs for rejected fingerprints.

The audit evidence wrote itself: change tickets, CM diff, deployment logs, and system logs showing rejected attempts. The incident stayed small because the practice was boring, documented, and rehearsed.

Takeaway: the most effective security controls are usually the ones your team can execute under stress without improvising.

Common mistakes: symptom → root cause → fix

These are the recurring failure modes that make key rotation feel cursed. Most are predictable. That’s good news.

1) Symptom: “We removed the key, but it still logs in”

Root cause: Key exists in another account’s authorized_keys, another path defined by AuthorizedKeysFile, or is being fetched by AuthorizedKeysCommand. Sometimes it’s the same private key used elsewhere by agent forwarding.

Fix: Identify by fingerprint in sshd logs, then search the fleet for that key body/fingerprint. Also check bastions and shared accounts. Disable agent forwarding where not required.

2) Symptom: “New key doesn’t work on one server, works everywhere else”

Root cause: Permissions/ownership on ~/.ssh or authorized_keys are too open, or the user’s home directory is not accessible with correct perms.

Fix: Use namei -l to verify each path component. Fix ownership and mode. Confirm sshd effective config and StrictModes behavior.

3) Symptom: “CI deploys started failing right after rotation”

Root cause: Automation used a shared deploy key that wasn’t rotated, or it was pinned in a secrets store and not updated everywhere. Sometimes the CI runner uses an SSH agent with multiple identities and now offers the wrong key first.

Fix: Inspect CI job logs for the offered fingerprint. Pin the exact key with IdentitiesOnly yes and explicit IdentityFile on the runner. Rotate bot keys separately with a staged cutover.

4) Symptom: “After centralizing keys, SSH logins randomly hang”

Root cause: AuthorizedKeysCommand depends on a slow service. Authentication path is now network-coupled and latency-sensitive.

Fix: Add caching, timeouts, and observability. Ensure the command fails fast. Consider local cached snapshots of keys with periodic refresh.

5) Symptom: “People keep appending keys and never remove them”

Root cause: No owner, no process, and no visibility. Removal feels risky because no one knows what will break.

Fix: Enforce management via CM or central store. Add reporting (keys per account, age, last-seen usage via logs). Require an expiry or review cadence.

6) Symptom: “Host key warnings popped up during the change”

Root cause: You rotated host keys (or rebuilt servers) without a distribution plan for known_hosts, or you have DNS/IP reuse causing mismatches.

Fix: Separate host key management from user key rotation. Use stable host key distribution mechanisms or carefully managed known_hosts updates in automation.

Checklists / step-by-step plan

This is the plan that works when you’re tired, busy, and surrounded by systems that don’t care about your feelings.

Checklist A: Controlled rotation (planned change)

  1. Define scope: which users, which hosts, which automation accounts.
  2. Extract fingerprints of old keys to be retired (store them in the change ticket).
  3. Inventory authorization paths: sshd -T, find authorized_keys files, identify any AuthorizedKeysCommand.
  4. Stage new keys: add new keys everywhere required; don’t remove old keys yet.
  5. Validate access: use ssh -vvv for representative clients; check journalctl -u ssh for accepted fingerprints.
  6. Set the grace window: communicate deadline; keep it short.
  7. Cutover: remove old fingerprints; reload sshd if config changed.
  8. Prove revocation: attempt login with old key from a controlled environment; verify logs show rejection.
  9. Clean up: remove temporary Match blocks, tickets, and stale key files. Update inventory.

Checklist B: Emergency revocation (suspected compromise)

  1. Identify fingerprints of suspected keys (from employee device inventory, logs, or key stores).
  2. Block quickly at chokepoints: bastions first. Use Match User or DenyUsers as an immediate stopgap.
  3. Disable issuance if using certificates; otherwise remove keys from central store/CM.
  4. Hunt for duplicates: shared deploy keys and copied keys are common. Don’t assume one location.
  5. Validate with logs: verify rejected attempts for the fingerprint.
  6. Post-incident cleanup: rotate bot keys, reduce forwarding, constrain key options, and fix the inventory gap that allowed sprawl.

Checklist C: Anti-sprawl baseline for Debian 13 sshd

  • Password authentication off unless you have a strong reason.
  • Centralize authorization decisions (CM, central key command, or certs).
  • Restrict network ingress to SSH (bastion/VPN only).
  • Disable agent/tcp forwarding by default; enable per role via Match.
  • Log enough to audit (and ship logs off-host).
  • Break-glass access exists, is minimal, and is tested.

FAQ

1) Do I need to rotate SSH keys regularly, even without a breach?

Yes, but not blindly. Rotate on role changes, offboarding, device loss, and policy-based intervals for high-risk access. If you can move to short-lived SSH certificates, rotation becomes mostly automatic.

2) What’s the cleanest way to revoke access for one user right now?

Block SSH for that user with a Match User rule on bastions (or directly on the host if needed), then remove their keys from the real authorization source. Validate by logs and a controlled login attempt.

3) If I lock the Linux account, does that block SSH key login?

Not reliably. Account lock affects password authentication and sometimes PAM behavior, but public key auth can still succeed depending on configuration. Treat account lock as a circuit breaker, not the whole solution.

4) How do I know which key was used in a login?

Check sshd logs: successful publickey authentication lines include the key type and fingerprint. On Debian, journalctl -u ssh is usually the fastest route.

5) We have hundreds of servers. Is grepping authorized_keys everywhere the only option?

It’s the “works today” option, not the scalable one. At fleet scale, use configuration management enforcement, a centralized authorized keys command with caching, or—best—SSH user certificates.

6) Are SSH certificates complicated to run?

They’re operationally different, not inherently harder. You trade key distribution for CA operations: secure signing, policy, time sync, and issuance workflows. If you already run production PKI or identity systems, it’s a natural fit.

7) Should we allow agent forwarding?

Default to no. Allow it only where it meaningfully reduces key distribution and where the target hosts are trusted and hardened. Agent forwarding turns a compromised server into a stepping stone.

8) What’s the fastest way to detect key sprawl on a single host?

Run sshd -T to confirm the auth mechanism, then enumerate authorized keys files, and count duplicate key bodies/fingerprints across them. Duplicates are your first red flag.

9) Why do we still see old keys in files after we “migrated to centralized keys”?

Because migrations are often partial. If AuthorizedKeysFile is still enabled and files still exist, sshd may still honor them. Decide on one path, then enforce it, including removing or ignoring legacy files.

10) How do we prove to auditors that a key was revoked?

Store the fingerprint in the change record, show the diff/removal from the authorization source, and capture sshd logs showing rejection/absence of acceptance after the cutover.

Conclusion: next steps that stick

If you take one lesson from case #13, make it this: key rotation is a governance problem wearing a crypto hat. Debian 13 gives you the same sharp tools; your job is to wield them consistently.

Do this next, in order:

  1. Pick your authorization model (managed local files, centralized keys, or certificates) and stop mixing casually.
  2. Build a fingerprint-based inventory so you can answer “who has access” without guessing.
  3. Practice emergency revocation using bastion chokepoints and tested Match rules.
  4. Shorten credential lifetime wherever you can—cert TTLs, tighter key options, fewer shared keys.
  5. Measure sprawl: number of keys per account, duplicates, and keys not seen in logs for months.

Make key hygiene boring. Your future self, on-call at an unreasonable hour, will be grateful—and will still complain, but with fewer outages.

← Previous
Docker Network MTU Issues: Why Large Requests Fail and How to Fix MTU/MSS
Next →
Debian 13: SSHFS vs NFS — pick the one that won’t randomly hang (and configure it right)

Leave a comment