You rotated SSH keys. Everyone nodded. Tickets closed. Then a contractor who “definitely lost access” still logs in from a coffee shop IP you’ve never seen before.
This is how key rotation fails in the real world: not at the cryptography layer, but at the human layer—copy/paste, drift, forgotten bastions, and that one
legacy box under a desk that “can’t be rebooted.”
Debian 13 makes it easy to run a sane OpenSSH stack. The hard part is revoking access cleanly and proving it. This case file is a production playbook:
how to inventory where keys live, rotate without downtime, revoke precisely, and stop key sprawl from re-growing like weeds in a neglected parking lot.
Decide what you mean by “revoked”
“We rotated SSH keys” is a sentence that can mean three different things, and only one of them is operationally satisfying.
- Key replacement: you issued new keys and told people to use them. Old keys may still work. This is not revocation; it’s a polite suggestion.
-
Key removal: you removed old public keys from
authorized_keys(and anywhere else they could authenticate). This is revocation,
but only for paths you remembered. -
Access invalidation: you removed/blocked keys, killed surviving sessions where appropriate, audited alternate paths (bastions, shared accounts,
agent forwarding, jump hosts), and you can prove old keys no longer authenticate. This is what you want.
In Debian 13, OpenSSH is capable, boring, and strict if you let it be. Use that. Your enemy is not the math. Your enemy is
key sprawl: the gradual, invisible replication of keys into random places because the fastest way to “fix access” is to append another line
to another authorized_keys.
A useful mental model: treat SSH keys as credentials with lifecycle, not as “files developers keep in their home directory.”
Lifecycle means inventory, issuance, rotation, expiration, revocation, and review. If any one of those is missing, you don’t have a system; you have folklore.
One quote I keep on a sticky note, because it’s the job:
Hope is not a strategy.
— Gen. Gordon R. Sullivan
Joke #1: SSH keys don’t “expire” on their own. They’re like avocados: if you don’t watch them, they’re either unripe forever or suddenly a security incident.
Interesting facts & history (why today’s mess exists)
A little context helps you diagnose why your fleet looks like it was managed by a committee of raccoons.
Here are concrete facts that tend to matter in post-rotation cleanup.
-
SSH1 vs SSH2: SSH2 replaced SSH1 in the early 2000s because SSH1 had security weaknesses. “SSH” today practically means SSH2, but legacy
assumptions linger in old scripts and appliances. -
DSA keys were deprecated:
ssh-dss(DSA) fell out of favor and is disabled by default in modern OpenSSH because of algorithm
constraints and policy. If you still see DSA keys inauthorized_keys, you’re looking at archaeological layers. -
Ed25519 became the default recommendation: Ed25519 keys are compact, fast, and generally preferred over RSA in many environments. Rotation
projects often double as algorithm upgrades. -
authorized_keyswas designed for simplicity: It’s just a file with lines. That simplicity is why key sprawl is so easy:
distribution tools are optional, copy/paste is always available. -
Certificates exist, but most orgs don’t use them: OpenSSH supports CA-signed user certificates, which drastically reduce sprawl, but they
require a small amount of thinking up front. Many teams postpone that thinking indefinitely. -
Revocation is not symmetric: With raw public keys, you revoke by removing them everywhere. With certificates, you can revoke by serial or
key ID using a revocation list, but you need the CA plumbing. -
Agent forwarding made “just hop through the bastion” popular: It also made credential boundaries fuzzy. When revocation fails, agent
forwarding is often in the cast. -
StrictModes exists for a reason: SSH refuses to use insecurely-permissioned key files because it’s trying to protect you from yourself.
When people “fix” permission errors by disabling checks, they usually create a wider blast radius. -
SSH logs are useful but not friendly: OpenSSH logs what happened, not how you feel about it. The fastest incident recoveries come from
teams that know exactly which log lines map to which authentication paths.
Fast diagnosis playbook
When someone says “old key still works” or “new key doesn’t work,” don’t start editing files blindly.
Find the bottleneck in three checks.
1) Identify where the authentication decision is made
Is the user authenticating directly to the target host, through a bastion, via a forced command wrapper, or via a configuration management account?
If you guess wrong here, you’ll “revoke” the wrong place and nothing changes.
2) Confirm which key is actually being offered
Client-side SSH will happily offer a small parade of keys unless you tell it otherwise. People often believe they’re testing the new key while their agent
offers an old one first and it succeeds.
3) Validate the server-side key source
On Debian 13, keys might come from ~/.ssh/authorized_keys, but also from AuthorizedKeysFile paths, or
AuthorizedKeysCommand (LDAP/HTTP/whatever), or from a shared account’s authorized_keys you forgot existed.
Only after these three are pinned down do you start removing lines, deploying config changes, and terminating sessions.
Practical tasks (commands, outputs, decisions)
Below are production tasks you can run on Debian 13 (or on your admin workstation) that move you from “we rotated keys” to “we revoked access and proved it.”
Each task includes: a command, what typical output means, and the decision you make next.
Task 1 — Verify OpenSSH server version (know what features you have)
cr0x@server:~$ sshd -V
OpenSSH_9.7p1 Debian-3, OpenSSL 3.3.1 4 Jun 2024
Meaning: You’re on a modern OpenSSH with good defaults and certificate support.
Decision: Prefer Ed25519 keys or OpenSSH user certificates; avoid clinging to legacy algorithms “for compatibility” without proof.
Task 2 — List active SSH listeners (avoid revoking on the wrong daemon)
cr0x@server:~$ ss -ltnp | grep ssh
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=812,fd=3))
LISTEN 0 128 [::]:22 [::]:* users:(("sshd",pid=812,fd=4))
Meaning: sshd is listening on 22 on IPv4 and IPv6.
Decision: If you expected a bastion port (e.g., 2222) or only IPv4, reconcile that now. “Revoked on port 22” won’t help if users connect elsewhere.
Task 3 — Confirm sshd authentication sources (AuthorizedKeysFile / Command)
cr0x@server:~$ sshd -T | egrep 'authorizedkeys(file|command)|pubkeyauthentication|passwordauthentication|kbdinteractiveauthentication'
pubkeyauthentication yes
authorizedkeysfile .ssh/authorized_keys .ssh/authorized_keys2
authorizedkeyscommand none
passwordauthentication no
kbdinteractiveauthentication no
Meaning: Keys come from local files in users’ home directories; no external key command.
Decision: Revocation means editing those files (or replacing them via config management). If AuthorizedKeysCommand is set, revocation is centralized and you must fix the source.
Task 4 — Inventory local users with home directories (find where keys can exist)
cr0x@server:~$ getent passwd | awk -F: '$6 ~ /^\/home\// {print $1 "\t" $6}'
alice /home/alice
buildbot /home/buildbot
deploy /home/deploy
Meaning: These users likely have ~/.ssh directories.
Decision: Scope the key audit. Don’t forget service accounts like deploy—they are where “temporary” keys go to live forever.
Task 5 — Find every authorized_keys on the host (including odd paths)
cr0x@server:~$ sudo find / -xdev -type f -name 'authorized_keys' -o -name 'authorized_keys2' 2>/dev/null
/home/alice/.ssh/authorized_keys
/home/deploy/.ssh/authorized_keys
/root/.ssh/authorized_keys
Meaning: Three key entry points exist locally.
Decision: If you revoke a contractor, check all of them. Root is frequently forgotten and frequently abused.
Task 6 — Count keys and spot obvious drift (comments matter)
cr0x@server:~$ sudo awk 'NF && $1 !~ /^#/ {print FILENAME ":" NR ":" $0}' /home/*/.ssh/authorized_keys /root/.ssh/authorized_keys | head -n 6
/home/alice/.ssh/authorized_keys:1:ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... alice@laptop
/home/alice/.ssh/authorized_keys:2:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQ... old-admin
/home/deploy/.ssh/authorized_keys:1:command="/usr/local/bin/deploy-wrap" ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... ci@runner
/root/.ssh/authorized_keys:1:ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... breakglass@vault
Meaning: You have mixed algorithms and at least one “old-admin” key. You also have a forced command key for deploy.
Decision: Treat any line without a useful comment as suspect and require re-issuance. For deploy keys, verify the forced command is doing what you think.
Task 7 — Fingerprint a known-bad public key (so you can search for it reliably)
cr0x@server:~$ ssh-keygen -lf /tmp/contractor_id_ed25519.pub
256 SHA256:Wm3xNq7nq0M9m+eOQmR0f0sY0p+9QH8Zq4xkYw1o9X8 contractor@mbp (ED25519)
Meaning: This is the stable fingerprint you can grep for when comments lie.
Decision: Use fingerprints as your canonical identifier in tickets and audits. Comments are hints, not identity.
Task 8 — Search for the fingerprint across authorized_keys (find sprawl)
cr0x@server:~$ sudo grep -R --line-number --fixed-strings 'AAAAC3NzaC1lZDI1NTE5AAAAI' /home/*/.ssh/authorized_keys /root/.ssh/authorized_keys
/home/alice/.ssh/authorized_keys:7:ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... contractor@mbp
/root/.ssh/authorized_keys:4:ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI... contractor@mbp
Meaning: The key exists in multiple places (including root).
Decision: Revoke in all locations, and open a follow-up incident: “Why did a contractor key land in root’s authorized_keys?”
Task 9 — Remove the key safely (surgical edit, keep audit trail)
cr0x@server:~$ sudo cp -a /root/.ssh/authorized_keys /root/.ssh/authorized_keys.bak.$(date +%F)
cr0x@server:~$ sudo sed -i '\#contractor@mbp#d' /root/.ssh/authorized_keys
cr0x@server:~$ sudo ssh-keygen -lf /root/.ssh/authorized_keys | head -n 3
256 SHA256:... breakglass@vault (ED25519)
Meaning: You took a backup, removed the contractor line by comment match, and verified remaining entries parse.
Decision: If comments are unreliable, delete by matching the key blob substring instead. Always keep a dated backup for forensics and rollback.
Task 10 — Reload sshd (don’t restart if you can avoid it)
cr0x@server:~$ sudo systemctl reload ssh
cr0x@server:~$ systemctl is-active ssh
active
Meaning: Config and key file changes are effective; sshd stayed up.
Decision: Prefer reload over restart to avoid dropping connections on busy bastions—unless you changed something that needs restart (rare).
Task 11 — Prove the old key is rejected (server-side log evidence)
cr0x@server:~$ sudo journalctl -u ssh -S "10 minutes ago" | tail -n 6
Dec 31 09:12:41 server sshd[22418]: Connection from 203.0.113.44 port 51221 on 10.0.0.10 port 22 rdomain ""
Dec 31 09:12:41 server sshd[22418]: Failed publickey for alice from 203.0.113.44 port 51221 ssh2: ED25519 SHA256:Wm3xNq7nq0M9m+eOQmR0f0sY0p+9QH8Zq4xkYw1o9X8
Dec 31 09:12:43 server sshd[22418]: Connection closed by authenticating user alice 203.0.113.44 port 51221 [preauth]
Meaning: The exact fingerprint failed. That’s your proof.
Decision: Mark the key revoked for this host. Repeat on other hosts, or—better—automate fleet-wide searches so you don’t “trust sampling.”
Task 12 — Confirm which key the client is offering (stop testing the wrong thing)
cr0x@server:~$ ssh -vvv -i ~/.ssh/id_ed25519_new alice@server 2>&1 | egrep 'Offering public key|Server accepts key|Authentications that can continue'
debug1: Offering public key: /home/alice/.ssh/id_ed25519_new ED25519 SHA256:9aZkNf...
debug1: Authentications that can continue: publickey
debug1: Server accepts key: /home/alice/.ssh/id_ed25519_new ED25519 SHA256:9aZkNf...
debug1: Authentication succeeded (publickey).
Meaning: The new key is used and accepted.
Decision: If it still fails, the issue is likely server-side restrictions (command=, from=, principals, permissions) or you’re hitting a different host/path.
Task 13 — List active SSH sessions (revocation doesn’t kill existing shells)
cr0x@server:~$ who
alice pts/1 2025-12-31 09:03 (203.0.113.44)
deploy pts/2 2025-12-31 08:50 (10.0.2.15)
Meaning: Users have active sessions already authenticated.
Decision: If the key was compromised, you likely need to terminate sessions. If it’s routine rotation, you may let sessions drain to avoid disruption.
Task 14 — Terminate a compromised session (safely, with intent)
cr0x@server:~$ ps -ft pts/1
UID PID PPID C STIME TTY TIME CMD
alice 22391 22388 0 09:03 pts/1 00:00:00 -bash
cr0x@server:~$ sudo kill -HUP 22391
cr0x@server:~$ ps -p 22391
PID TTY TIME CMD
Meaning: The shell exited and the session is gone.
Decision: Use targeted kills, not “restart ssh.” Restarting the daemon doesn’t necessarily kill sessions and creates collateral damage.
Task 15 — Validate file permissions (avoid “it works on one host” nonsense)
cr0x@server:~$ sudo namei -l /home/alice/.ssh/authorized_keys
f: /home/alice/.ssh/authorized_keys
drwxr-xr-x root root /
drwxr-xr-x root root home
drwxr-x--- alice alice alice
drwx------ alice alice .ssh
-rw------- alice alice authorized_keys
Meaning: Permissions are strict enough; sshd should accept the file.
Decision: If you see group/world-writable directories or files, fix them instead of disabling StrictModes.
Task 16 — Detect key reuse across accounts (one key, many doors)
cr0x@server:~$ sudo awk 'NF && $1 !~ /^#/ {print $2}' /home/*/.ssh/authorized_keys /root/.ssh/authorized_keys | sort | uniq -c | sort -nr | head
2 AAAAC3NzaC1lZDI1NTE5AAAAI...
1 AAAAB3NzaC1yc2EAAAADAQABAAABAQ...
Meaning: One key blob appears multiple times across the system.
Decision: Ban key reuse across accounts. It destroys attribution and makes revocation harder than it needs to be.
Task 17 — Check for ssh-agent forwarding usage (it changes the threat model)
cr0x@server:~$ sudo grep -R --line-number 'ForwardAgent' /etc/ssh/ssh_config /etc/ssh/ssh_config.d 2>/dev/null
/etc/ssh/ssh_config.d/20-bastion.conf:3: ForwardAgent yes
Meaning: Agent forwarding is enabled for something (likely bastion hops).
Decision: If you’re revoking due to compromise, treat forwarded agent paths as suspect. Consider turning it off and using certificates or ProxyJump patterns.
Task 18 — Validate sshd config before deploying (prevent self-inflicted lockouts)
cr0x@server:~$ sudo sshd -t
cr0x@server:~$ echo $?
0
Meaning: Config syntax is valid.
Decision: Never reload sshd without sshd -t in automation. The only thing worse than key sprawl is locking out the on-call.
Revocation strategies that actually work
Strategy A: Raw public keys in authorized_keys (the default reality)
This is the world most teams live in: per-user authorized_keys files scattered across hosts.
Revocation here is blunt: remove the key everywhere it appears, including shared accounts and root (if you allow it).
The job is not “remove a line.” The job is “remove all doors that line opens.”
If you are doing raw keys, enforce these policies:
- One person → one keypair per device (laptop, workstation, CI runner). No shared keys. No “team key.”
- Every key line must be attributable: comment includes owner and purpose, and you track fingerprint in your IAM or ticketing system.
- Every key gets an owner and a review date. If you can’t name the owner, you can’t justify the risk.
- No manual edits on pets: if a host matters, keys are managed by configuration management (or a centralized key command).
The tricky part is drift: keys copied to places you didn’t plan, keys left behind after user offboarding, and keys
embedded in images/AMIs/containers. That’s why inventory must be repeatable.
Strategy B: Centralize keys with AuthorizedKeysCommand (good for large fleets)
With AuthorizedKeysCommand, sshd executes a command to fetch keys at login time. That command can query LDAP, a local cache,
a secrets manager, or a simple file generated from your source of truth.
Operational upside: Revocation is immediate and centralized. No more “did we update that one box?”.
Operational downside: You’ve introduced a dependency into your login path. If it’s slow, down, or misconfigured, nobody logs in.
If you go this route, treat it like a production service:
- Cache results locally with TTL; protect the cache permissions.
- Have an emergency static fallback path (break-glass key in root, with strict controls).
- Instrument and alert on command failures and latency.
- Rate-limit and harden input parsing. sshd will run your command a lot.
Strategy C: Move to OpenSSH user certificates (the anti-sprawl option)
Certificates are how you stop key sprawl instead of constantly mowing it.
Users still have a private key, but they authenticate with a short-lived certificate signed by your SSH CA.
Servers trust the CA, not every individual public key.
What changes:
- On servers: you deploy the CA public key once (or rarely). That’s the trust anchor.
- On clients: users obtain a cert (ideally short-lived) from a signing service after SSO/MFA.
- Revocation: you can revoke by disabling issuance, expiring certs quickly, and (if needed) using key revocation lists for specific certs.
Opinionated take: If you have more than a handful of servers, and you rotate keys more than once a year, certificates pay for themselves.
They turn revocation from “hunt lines” into “stop issuing; wait for TTL.”
Debian 13 sshd hardening after rotation
Rotation projects are a gift: you’re touching sshd anyway, so fix the structural issues that made sprawl possible.
The goal is not “maximum paranoia.” The goal is predictable access control.
Make authentication paths explicit
If you rotated keys because of compromise or policy, consider enforcing that policy in sshd so you don’t regress silently.
For example, disable password and keyboard-interactive auth if you truly don’t use them; otherwise, attackers will simply take the other door.
You already saw in Task 3 how to verify.
Stop letting random accounts become remote entry points
The fastest way to end up in key sprawl hell is to allow lots of accounts to accept arbitrary keys.
Prefer fewer login accounts, and use sudo with tight policy for privilege. Even better, use per-person accounts and avoid shared accounts completely.
When you can’t avoid a shared account (CI runners, deploy users), fence it in:
- Force a command for CI keys (as in Task 6).
- Restrict source IPs with
from="10.0.2.15"in authorized_keys. - Disable PTY, port forwarding, and X11 forwarding for non-human access.
Use Match blocks for bastions and special users
Don’t “just set it globally” and hope it works. Bastions are different. CI users are different. Root is different.
Debian’s OpenSSH supports granular Match rules. They are readable when you keep them short.
Break-glass is not optional; it’s controlled
You need a way in when your key service or config deploy breaks. But break-glass must be:
rarely used, heavily logged, and quickly rotated.
One carefully stored emergency key beats five random “just in case” keys scattered around.
Joke #2: A “temporary SSH key” is the most permanent thing in IT—right up until auditors show up and it suddenly becomes “a misunderstanding.”
Three corporate mini-stories (all real enough to hurt)
Incident caused by a wrong assumption: “We removed the key, so it can’t log in”
A mid-sized SaaS company rotated SSH keys after a laptop theft. The team did the obvious work: remove the compromised public key from
the engineer’s authorized_keys across production. They used config management, the changes deployed, and the incident ticket was closed.
Two days later, a suspicious login hit a database host. The logs clearly showed public key authentication with the old fingerprint.
Panic followed, as it does, and the first wave of responses were predictable: restart sshd, rotate keys again, and block the IP.
The login kept happening from new IPs.
The actual problem was small and humiliating: the database host didn’t use per-user keys at all. It was accessed through a legacy
shared account called dbmaint that lived outside config management because it was created “for a vendor, years ago.”
That account had its own authorized_keys file, and the compromised key was in there too.
The wrong assumption was that “removing the key from the user” equaled “removing the key from the system.”
Keys don’t care about org charts. They care about files and trust paths.
The fix was not heroic: inventory every authorized_keys on every host, remove the fingerprint fleet-wide, and then remove the
shared account entirely. The best part? After the cleanup, future revocations became boring, because they had a complete list of where keys could exist.
Optimization that backfired: “We’ll speed up onboarding by copying keys everywhere”
A large enterprise platform team had an onboarding problem: new engineers needed access to dozens of hosts, and approvals took time.
Someone proposed a “simple optimization”: a script that appends a new hire’s public key to every server’s /etc/skel template
and to a couple of common service accounts, so access would “just work.”
It did work, in the short term. Onboarding time improved. Fewer tickets. Everyone congratulated the script.
Then the first offboarding event happened for a senior engineer who left on bad terms.
They rotated her keys. They removed her key from the obvious places. She still had access.
The script had created sprawl faster than any human could. Keys were in user homes created from /etc/skel, in shared accounts,
and in a few random “ops helper” accounts people had created over the years. Nobody knew the full set. Nobody had intended to create a parallel IAM.
They had, anyway.
The backfire was operational cost: revocation now required a fleet-wide forensic search, and it had to be repeated for every departure.
They ended up replacing the script with a real join process: centralized key distribution, limited login accounts, and a move toward short-lived certificates.
The lesson: optimizing for onboarding speed by duplicating credentials is like optimizing for faster fires by storing gasoline in the hallways.
It’s great until you need to stop something quickly.
Boring but correct practice that saved the day: “We can prove who can log in”
A fintech infra team ran Debian hosts with a strict rule: no manual edits to authorized_keys on production. Keys were deployed
from a Git repo through config management. Every key line contained an owner, a purpose, and a review date in the comment.
They also recorded fingerprints in the access request ticket.
One afternoon, an alert fired: repeated failed SSH attempts followed by a successful login to a bastion. The successful login used a key fingerprint
that wasn’t in their repository. That’s not supposed to happen.
Here’s where boring discipline pays: they didn’t debate. They ran a fleet-wide search for that fingerprint. It appeared only on the bastion,
added manually within the past hour. The file timestamp and audit logs lined up with a human mistake: someone had pasted a vendor key during a support call.
They removed the key, reloaded sshd, killed the session, and then compared the bastion’s authorized_keys against the repo-managed version.
The delta was exactly one line. They restored the file from config management, opened a process incident, and tightened the support workflow.
No dramatic forensics. No week-long “key audit project.” Just fast containment because the system had a source of truth and a rule against ad-hoc changes.
Common mistakes (symptom → root cause → fix)
1) “Old key still works” after rotation
Symptom: A revoked key continues to authenticate on some hosts.
Root cause: The key exists in multiple accounts or multiple key sources (shared accounts, root, alternate AuthorizedKeysFile paths, bastions).
Fix: Fingerprint the key and search fleet-wide across all authorized_keys locations (and any centralized source). Remove everywhere. Verify in logs.
2) “New key fails, but it works on another server”
Symptom: Same user, same key, different result depending on the host.
Root cause: Permission issues (StrictModes), different sshd config, different user home path, or the key line has restrictions (from=, command=) that don’t match.
Fix: Run sshd -T on the failing host, verify permissions with namei -l, and check restrictions in the authorized_keys line.
3) “We removed the key but the session stayed active”
Symptom: User remains connected and can keep doing things.
Root cause: SSH key checks happen at authentication time; removing keys doesn’t retroactively unauthenticate existing sessions.
Fix: Identify sessions with who / ss / ps and terminate them selectively when needed.
4) “We locked ourselves out”
Symptom: After reload/restart, nobody can log in.
Root cause: Bad sshd config, wrong AuthorizedKeysFile path, permissions breakage, or overzealous auth method disabling without a tested key.
Fix: Always run sshd -t before reload, keep an emergency console path, and roll out changes progressively with verification.
5) “Contractor offboarded, but CI keeps failing”
Symptom: After key cleanup, automated jobs can’t deploy.
Root cause: CI used the contractor’s personal key (yes, it happens) or a shared account key got removed unintentionally.
Fix: CI must use its own dedicated key/cert with forced command and restrictions. Restore service account access, then re-architect.
6) “We revoked the key on hosts, but access still happens through bastion”
Symptom: Direct login fails but bastion hop succeeds.
Root cause: The key is still valid on the bastion, or agent forwarding allows a different key to be used on the final host.
Fix: Revoke on bastion first, review agent forwarding settings, and force explicit IdentityFile usage during incident response tests.
7) “Audit says there are 400 keys; we only have 60 staff”
Symptom: Authorized_keys are bloated with unknown entries.
Root cause: Historic accumulation: old employees, vendor access, one-off troubleshooting, copied images, and key reuse.
Fix: Establish a baseline: remove keys without owners, require re-requests, and move to a centralized model or certificates.
Checklists / step-by-step plan
Phase 0 — Decide the policy (one page, enforced)
- Allowed algorithms (prefer Ed25519; allow RSA only if required by clients you can name).
- Key ownership rules (no shared human keys; service accounts get dedicated keys).
- Required key comments format (owner + purpose + optional expiry/review date).
- Where keys are stored and managed (config management repo or AuthorizedKeysCommand source).
- Emergency access method (break-glass) and rotation cadence.
Phase 1 — Inventory (the part everyone wants to skip)
- Enumerate all login-capable accounts (humans and services).
- Enumerate all key sources:
AuthorizedKeysFilepaths, root keys, shared accounts, any central key command. - Extract fingerprints into an inventory table (fingerprint → user → host → file → line comment).
- Flag unknown keys and duplicates (same key blob in multiple accounts).
Phase 2 — Rotate (introduce new keys without breaking access)
- Issue new keys/certs to users and services.
- Deploy new public keys alongside old keys temporarily (if not compromised).
- Validate with controlled tests: one user, one host, explicit
-iidentity, log confirmation. - Roll out gradually: bastion first, then high-value targets, then the rest.
Phase 3 — Revoke (make old keys unusable and prove it)
- Remove old keys fleet-wide, including bastions and shared accounts.
- Reload sshd after validation.
- During compromise: terminate active sessions authenticated with revoked credentials where feasible.
- Collect evidence: log lines showing failed publickey attempts with the revoked fingerprint.
Phase 4 — Prevent regrowth (stop key sprawl at the source)
- Ban manual edits on production; enforce via config management and monitoring file integrity.
- Reduce the number of accounts allowed to accept SSH logins.
- Apply restrictions to service keys (forced commands, no forwarding, source IP limits).
- Adopt certificates or centralized key lookup if the fleet size warrants it.
- Schedule regular audits (monthly/quarterly) and tie them to offboarding.
FAQ
1) If we remove a key from authorized_keys, is it immediately revoked?
For new connections, yes. For existing sessions, no. SSH doesn’t re-check authorized_keys mid-session. If the key was compromised, terminate sessions deliberately.
2) What’s the single best way to avoid key sprawl?
Stop treating authorized_keys as a scratchpad. Use a source of truth (config management or centralized key lookup) and prohibit manual edits on production.
If you can, use short-lived OpenSSH certificates.
3) Why do we need fingerprints if we already have comments?
Because comments lie, get copied, or go missing. Fingerprints identify the actual key material. In incidents, your audit trail should reference fingerprints.
4) We rotated keys, but users still authenticate with the old key from their agent. How?
Their SSH client may offer multiple keys automatically. Use ssh -vvv -i to force the intended key during tests, and set per-host identity rules in SSH config.
5) Should we disable root SSH login during or after rotation?
If you can, yes—disable direct root login and require sudo from named accounts. If you must keep it, restrict it heavily and treat root’s authorized_keys as sacred.
6) Is it safe to use AuthorizedKeysCommand in production?
It can be, if you treat it like a dependency: cache, monitor, and have a break-glass fallback. If your org can’t reliably run small services, stick to config-managed files.
7) How do we revoke access for a vendor who used a shared account?
Remove their key from the shared account’s authorized_keys everywhere, then replace the pattern: vendors should get time-bounded access with tight restrictions,
preferably via bastion plus certificates or per-vendor accounts.
8) How do we prove to auditors that a key is revoked?
Show: (1) the key fingerprint, (2) evidence it was removed from the source(s), and (3) log evidence of rejected authentication attempts using that fingerprint
after the change, on representative hosts or fleet-wide where feasible.
9) What about keys baked into images or containers?
If an image contains authorized_keys content, you will keep resurrecting revoked keys. Scan images, fix the build pipeline, and invalidate old images.
In other words: stop shipping credentials as artifacts.
10) Should we rotate host keys too?
Different problem. User key rotation revokes user access; host key rotation changes server identity and impacts known_hosts trust.
Rotate host keys when there’s compromise or policy, but plan for client trust updates.
Next steps you can do this week
If you want a rotation that sticks, do three things in the next seven days:
-
Build a fingerprint inventory from every
authorized_keyson a representative slice of hosts. You’ll find surprises. That’s the point. - Pick a revocation mechanism you can operate: config-managed authorized_keys, AuthorizedKeysCommand with caching, or (best) short-lived certificates.
- Write the “no manual edits” rule and enforce it, starting with bastions and production. Drift is how today’s keys become tomorrow’s incident.
Key rotation is not a ceremony. It’s an access control change with measurable outcomes: old keys fail, new keys succeed, and there’s a paper trail you’d be willing to read at 3 a.m.
Do it once, do it cleanly, and then design the system so you don’t have to re-learn the same lesson every quarter.