It’s 02:17. The pager is screaming, error budgets are evaporating, and the dashboard looks like it got hit with a shovel. Someone asks, “Who knows how this works?” The room goes quiet. Then a name comes up. One name.
If your production system requires a specific human to remain employed, healthy, reachable, and awake, you don’t have reliability. You have a fragile truce with physics and HR.
What “bus factor” really means in production
The bus factor is the number of people who could disappear (bus, illness, resignation, promotion, acquisition, life) before your project stalls or your service fails to recover. People treat it like a morbid joke; ops teams live it as a budget line.
In practice, the bus factor is rarely “one person knows everything.” It’s usually “one person knows the sharp edges.” The undocumented flag. The reason we pin that kernel version. The actual meaning of the custom Prometheus label that gates alerts. The secret handshake to import snapshots into the standby cluster. The pager routing rule that was “temporary” eighteen months ago.
Bus factor isn’t just knowledge. It’s authority (only one person can approve changes), access (only one person has the right tokens), and confidence (everyone else is afraid to touch it). If you fix documentation but keep access locked to a single laptop, you’re still running a hostage negotiation, just with nicer Markdown.
One quote worth keeping in your head when you’re tempted to “just ask Alex” instead of writing it down:
Attributed to Gene Kim (paraphrased idea): The goal is fast flow and stability; heroics are a sign your system and processes are failing.
Two clarifications that matter:
- High bus factor doesn’t mean low skill. It can mean strong specialization. The problem is when specialization becomes an untested single point of failure.
- Bus factor is a systems problem. The “bus factor engineer” is often doing the rational thing inside an irrational org: shipping, patching, firefighting, and absorbing context because nobody else is allowed time to learn.
Joke #1: The bus factor engineer isn’t “irreplaceable.” They’re replaceable; the system just charges a consulting fee in outages.
Why it happens (and why smart teams still fall into it)
1) Incentives reward speed, not transfer
Most orgs promote the person who unblocks launches, not the person who makes sure launches can happen without them. If your performance system awards “impact,” the easiest visible impact is to become a bottleneck. You’re suddenly in every meeting, every review, every incident. Your calendar becomes the architecture.
2) The “temporary exception” becomes permanent infrastructure
Emergency changes are fine. Permanent emergency changes are a smell. The bus factor engineer is often the only one who remembers why we bypassed the normal pipeline “just this once.” That bypass then becomes the pipeline.
3) Complexity piles up where change is easiest
The place where changes can be made quickly becomes the place where changes are made. If one person has the privileges and the context, the system evolves around their workflow. Over time, the workflow becomes a dependency. That’s how you end up with production automation that requires a human to SSH in and run a one-liner from shell history.
4) Documentation fails when it’s treated as prose
Docs that say “restart the service” aren’t documentation; they’re vibes. Real ops documentation is executable, verifiable, and tied to reality: exact commands, expected output, and what decision you make next.
5) Fear is contagious
If the system has bitten people before—data loss, kernel panics, “mysterious corruption”—the team learns not to touch it. Over time, the only person who touches it becomes the only person who can touch it. This is how brittle storage stacks get “owned” by a single engineer for years.
Facts and historical context that explain the pattern
Bus factor feels modern because we talk about it in GitHub terms, but the pattern is old: specialization plus speed plus fragile interfaces. Some context that helps frame why this keeps happening:
- The term “bus factor” is a cousin of older “truck factor” language used in software teams to describe the same risk; the metaphor evolved, the failure mode didn’t.
- NASA’s “single string” versus “dual string” designs (single vs redundant control paths) popularized the idea that redundancy isn’t optional in high-stakes systems; people are part of the system.
- In early telephony and mainframe eras, “operator knowledge” was literal. Runbooks existed because systems required manual steps; when knowledge was trapped in people, uptime suffered.
- The rise of on-call rotations in web ops forced teams to formalize tribal knowledge, because incidents don’t schedule themselves around the only expert’s vacation.
- The “DevOps” movement didn’t just argue for dev+ops collaboration; it implicitly argued against black-box ownership and heroic gatekeeping as a scaling model.
- Postmortem culture (blameless or not) became popular because organizations needed a tool to convert incidents into shared learning instead of private knowledge.
- Configuration management (CFEngine, Puppet, Chef, Ansible) was partly a response to “snowflake servers” that only one admin understood; codifying state is a bus-factor antidote.
- Modern incident command structures (ICS-style) spread beyond emergencies because separating “who does” from “who knows” makes response scalable and teachable.
- Secret management systems exist because “the password is in Steve’s head” is not security; it’s a denial-of-service waiting to happen.
Diagnosing bus-factor risk: signals you can measure
Bus factor is emotional—teams argue about ownership and trust—but it’s also measurable. If you want to reduce it, stop debating and start observing.
Signal 1: Who merges changes?
If one person approves or merges the majority of changes to a system, you’ve already centralized knowledge and authority. It might be justified for security, but then you need a second approver with equivalent competence and access.
Signal 2: Who gets paged and who fixes?
Look at incident timelines. If the same person consistently appears as “joined and resolved,” they’re not just knowledgeable; they’re the only one trusted to act. That’s not a compliment. That’s a failure mode you’re normalizing.
Signal 3: “It’s in Slack” documentation
If the runbook is “search the channel and find the thread from last year,” you have no runbook. You have archaeology.
Signal 4: Access asymmetry
If the IAM model allows only one person to break-glass into production, you have a single point of failure. If you can’t grant more access, then you need a tightly controlled, audited, documented break-glass process that any on-call can execute.
Signal 5: Recovery is a ritual
When recovery depends on “do these seven steps in the right order and don’t ask why,” you’re in a knowledge trap. Systems that can be restored only by ritual are systems that will not be restored under stress.
Signal 6: The system is stable… until it isn’t
Bus factor risk hides in stable systems because they don’t demand attention. Then something changes—kernel update, firmware, cloud API behavior, certificate rotation—and you discover the system was stable only because one person had a set of compensating behaviors.
Fast diagnosis playbook: find the bottleneck quickly
This is the “walk into a burning room” checklist. It’s about figuring out the most likely bottleneck in minutes, not composing a dissertation while the database is on fire.
First: identify the failure domain and stop making it worse
- Confirm scope: one host, one AZ, one region, one dependency, one customer segment.
- Freeze risky deploys: pause pipelines, stop auto-scaling thrash, halt batch jobs if they’re crowding out production.
- Check for obvious saturation: CPU, memory, disk IO, network, file descriptors.
Second: verify the “boring” plumbing
- DNS and certificates: expiring certs and DNS failures imitate app failures disturbingly well.
- Time: NTP drift can break auth, logs, distributed coordination, and your sanity.
- Storage health: IO stalls will masquerade as “application slowness” until someone checks the disks.
Third: find the current limiter
- Latency budget accounting: is time spent in the app, the database, the network, or waiting on IO?
- Look for queue growth: thread pools, connection pools, message brokers, kernel run queue, IO queues.
- Confirm the rollback / failover path: if you can safely shed load or fail over, do it early rather than arguing for an hour.
The bus-factor twist: if the “fast diagnosis” steps require a specific person to interpret them, your observability is decorative. Fix that before the next incident.
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption
A mid-sized SaaS company ran a homegrown storage gateway in front of object storage. The gateway cached metadata locally for speed. It worked well enough that nobody questioned it; the original engineer moved to a different team but still “owned” it in practice.
During a routine maintenance window, another engineer rotated the host certificates and rebooted the gateway nodes one at a time. On paper: safe. In reality: the metadata cache was not just a cache. It contained authoritative state for in-flight multipart uploads, and the “rehydration” job that rebuilt state from object storage was disabled months earlier after causing load spikes.
The wrong assumption was subtle: “If it’s a cache, it’s rebuildable.” That’s true until someone quietly adds “temporary but critical” state and forgets to update the mental model.
After the reboot, clients began retrying uploads. Retries amplified traffic, traffic amplified latency, and the gateway started timing out health checks. Auto-healing replaced nodes, which erased more “cache,” which erased more state. The system didn’t just fail; it churned itself into a crater.
The fix wasn’t heroic. They re-enabled the rehydration job with rate limits, wrote a runbook that explicitly stated what data lived where, and added a canary reboot test to CI for the gateway AMI. Most importantly, they made “cache semantics” a documented contract and put two engineers on rotation for storage changes.
Mini-story #2: The optimization that backfired
A trading analytics platform had a Postgres cluster backed by fast NVMe. One engineer—brilliant, sleep-deprived, and trusted—decided to “help the kernel” by tuning dirty page writeback and IO scheduler settings. They reduced latency during benchmarks and declared victory.
The change lived in a custom sysctl file on the database hosts, not in configuration management, because it was “just a quick tweak.” Months later the company upgraded the OS image. The new kernel handled writeback differently. The old sysctl values became pathological under real workload: bursts of writes would accumulate, then flush in massive spikes that blocked reads. Queries didn’t slow gradually; they hit walls.
On-call saw CPU idle and assumed the database was “fine.” App engineers chased connection pools. Meanwhile the storage layer was doing synchronized suffering. The only person who understood the sysctl tweak was on a transatlantic flight.
When the expert landed, they reverted the sysctl and the system recovered immediately. The postmortem was uncomfortable: the optimization was real, but it was coupled to kernel behavior, undocumented, untested, and effectively unowned by the team.
The durable remediation was to treat kernel tuning like application code: versioned configuration, explicit owners, a canary host, performance regression tests, and a hard rule that “fast” settings must have a rollback plan and a clear reason.
Mini-story #3: The boring but correct practice that saved the day
A healthcare company ran ZFS-backed NFS for internal analytics. Nothing glamorous. The storage engineer insisted on a weekly scrub schedule, monthly restore drills, and a tiny runbook with exact commands and screenshots of expected outputs. Everyone teased them for being old-school.
One Thursday, a firmware bug in a batch of SSDs started returning intermittent read errors. ZFS detected checksum mismatches and began repairing from redundancy. Alerts fired, but the service remained mostly healthy. The on-call team wasn’t panicking because the runbook explained what a scrub does, what “errors: No known data errors” means, and when to escalate.
They replaced the failing drives in a controlled window, verified resilver progress, and ran a restore drill from the latest snapshots. No drama, no “we think it’s okay,” no waiting for the specialist to wake up.
The point wasn’t ZFS magic. The point was operational muscle memory: the team had practiced the boring steps, and those steps were written down in a way that made a non-expert dangerous in the right direction.
Joke #2: Every team has “that system nobody touches.” It’s like a museum exhibit, except it occasionally pages you.
Hands-on tasks: commands, outputs, and decisions (12+)
These are not “tips.” These are the sorts of concrete checks that turn a bus-factor system into a team system. Each task includes a command, what the output means, and the decision you make from it.
Task 1: Identify who has production access (Linux host)
cr0x@server:~$ getent group sudo
sudo:x:27:root,alice,bob,oncall
What it means: This host allows root, alice, bob, and oncall to escalate privileges.
Decision: If production break-glass depends on one named person, add a role-based group (like oncall) with audited membership and an access review cadence.
Task 2: Verify SSH keys and last logins (find “only Pat can log in” situations)
cr0x@server:~$ sudo last -a | head
reboot system boot 6.5.0-21-generic Mon Jan 29 03:12 still running - server
alice pts/0 10.0.2.14 Mon Jan 29 03:20 still logged in - 10.0.2.14
bob pts/1 10.0.3.19 Mon Jan 29 02:55 - 03:40 (00:45) - 10.0.3.19
What it means: Who is actually accessing the box, and from where. If you only ever see one username, you might have a shadow ownership problem.
Decision: Enforce personal accounts + shared on-call role accounts for emergency access, and require at least two people to be able to log in during an incident.
Task 3: Confirm secrets aren’t “in someone’s home directory”
cr0x@server:~$ sudo grep -R --line-number "BEGIN PRIVATE KEY" /home 2>/dev/null | head
/home/alice/.ssh/id_rsa:1:-----BEGIN PRIVATE KEY-----
What it means: You found a private key sitting in a home directory. That might be legitimate for user SSH, but often it’s a service key or deployment credential that became “Alice’s problem.”
Decision: Move service credentials into a managed secret store; rotate compromised/overused keys; document access and rotation.
Task 4: Check whether services are configured by hand (drift detection)
cr0x@server:~$ sudo systemctl cat nginx | sed -n '1,25p'
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
After=network-online.target remote-fs.target nss-lookup.target
[Service]
Type=forking
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
What it means: The service file is stock. Good. If you see overrides in /etc/systemd/system/nginx.service.d/override.conf that aren’t in Git, that’s drift.
Decision: Put overrides into configuration management; add a “drift check” in CI or a scheduled audit.
Task 5: Confirm what is actually listening (catch “mystery ports” owned by one engineer)
cr0x@server:~$ sudo ss -lntp
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 4096 0.0.0.0:443 0.0.0.0:* users:(("nginx",pid=1321,fd=6))
LISTEN 0 4096 127.0.0.1:9000 0.0.0.0:* users:(("gatewayd",pid=2104,fd=12))
What it means: There’s a local-only service on 9000 called gatewayd. If nobody can explain it without pinging one person, that’s bus factor made visible.
Decision: Map every listening port to an owner, repo, and runbook. If you can’t, schedule a short “service inventory” incident before the real incident arrives.
Task 6: Check disk health quickly (SRE triage, storage edition)
cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE,MODEL
sda 1.8T disk Samsung_SSD_860
├─sda1 512M part /boot ext4
└─sda2 1.8T part / ext4
What it means: Basic inventory. If the system relies on a specific disk model/firmware quirk known only to one person, you want that visible.
Decision: Record drive models/firmware and standardize. Heterogeneous fleets are fine until you need consistent failure handling.
Task 7: Look for IO pressure (is the system slow because storage is slow?)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0-21-generic (server) 02/02/2026 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.31 0.00 4.22 25.80 0.00 57.67
Device r/s w/s rkB/s wkB/s await %util
sda 80.0 120.0 6400.0 9800.0 38.2 98.7
What it means: %iowait is high and disk %util is near 100% with elevated await. This is classic “apps look slow, CPU looks fine.”
Decision: Throttle batch jobs, reduce write amplification (logs, compactions), or fail over to less-loaded storage. Start a storage-focused incident thread; don’t let everyone chase application ghosts.
Task 8: Check filesystem capacity and inode exhaustion
cr0x@server:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 1.8T 1.7T 55G 97% /
cr0x@server:~$ df -i /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda2 120586240 119900000 686240 100% /
What it means: You’re out of inodes. This failure looks like “random” write errors even with free disk space.
Decision: Find and delete high-file-count directories (cache, tmp, log shards). Add monitoring for inode usage. Update runbook: “disk full” includes inodes.
Task 9: Detect “kernel is the bottleneck” via run queue and load
cr0x@server:~$ uptime
02:21:10 up 17 days, 6:14, 2 users, load average: 24.18, 23.77, 20.06
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
18 6 0 128000 52000 910000 0 0 820 2400 4100 8200 13 5 55 27 0
What it means: Load is high, and b (blocked) plus wa (IO wait) suggest processes are stuck waiting on IO.
Decision: Treat as IO/backing store bottleneck, not CPU scaling. Escalate to storage, reduce IO, and identify the top IO consumers.
Task 10: Find top IO consumers (prove who is hurting the disk)
cr0x@server:~$ sudo iotop -boPa -n 3 | head -n 12
Total DISK READ: 120.00 M/s | Total DISK WRITE: 85.00 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2104 be/4 app 65.00 M/s 12.00 M/s 0.00 % 15.20 % gatewayd --cache /var/lib/gw
1887 be/4 postgres 40.00 M/s 60.00 M/s 0.00 % 12.30 % postgres: checkpointer
What it means: You have a smoking gun: gatewayd and Postgres checkpoint writes dominate IO.
Decision: For immediate relief: tune checkpoint behavior, reduce cache churn, or temporarily disable non-critical workloads. For bus factor: document what gatewayd is and who owns its IO profile.
Task 11: Validate ZFS pool health (if you run ZFS, you must know this)
cr0x@server:~$ sudo zpool status -x
all pools are healthy
What it means: No known pool issues right now.
Decision: If you don’t regularly see and understand this output, schedule a training session. ZFS is not “set and forget”; it’s “set and verify.”
Task 12: If ZFS is unhealthy, read it like a decision tree
cr0x@server:~$ sudo zpool status tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device and run 'zpool clear'.
scan: scrub repaired 0B in 00:12:11 with 0 errors on Mon Jan 29 01:00:01 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-SAMSUNG_SSD_860-1 ONLINE 0 0 0
ata-SAMSUNG_SSD_860-2 FAULTED 0 0 12
What it means: The mirror member has checksum errors and is faulted. Scrub shows 0 repaired but checksum errors exist: treat the device as suspect.
Decision: Replace the failed disk, confirm resilver, then clear errors. Most importantly: ensure the replacement procedure is in a runbook that a non-specialist can execute under supervision.
Task 13: Confirm scrubs and snapshots exist (boring, correct, lifesaving)
cr0x@server:~$ sudo zpool get scrubby tank
NAME PROPERTY VALUE SOURCE
tank scrubby - default
cr0x@server:~$ sudo zfs list -t snapshot | head
NAME USED AVAIL REFER MOUNTPOINT
tank/data@daily-001 128M - 1.2T -
tank/data@daily-002 130M - 1.2T -
What it means: You have snapshots (good), but scrubby isn’t a standard property; this output tells you scrubs are not being tracked via a property here. You still need to verify scrub schedule elsewhere (cron/systemd timers).
Decision: Ensure there’s a known schedule and alerting for scrub completion and errors. If scrubs “just happen” because one person runs them manually, that’s pure bus-factor risk.
Task 14: Check replication / backup job status (prove recoverability)
cr0x@server:~$ systemctl status zfs-replication.service --no-pager
● zfs-replication.service - ZFS snapshot replication
Loaded: loaded (/etc/systemd/system/zfs-replication.service; enabled)
Active: active (exited) since Mon 2026-02-02 01:00:02 UTC; 1h 22min ago
Process: 4421 ExecStart=/usr/local/sbin/replicate-tank.sh (code=exited, status=0/SUCCESS)
What it means: The replication unit ran successfully recently.
Decision: If replication success is only “status=0,” add verification: list latest snapshot on the destination and test a restore path. Trust, but verify with evidence.
Task 15: Confirm what version and config you’re really running (no “works on my kernel” mysteries)
cr0x@server:~$ uname -a
Linux server 6.5.0-21-generic #21-Ubuntu SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
cr0x@server:~$ dpkg -l | egrep 'zfs|linux-image' | head
ii linux-image-6.5.0-21-generic 6.5.0-21.21 amd64 Signed kernel image generic
ii zfsutils-linux 2.2.2-0ubuntu1 amd64 command-line tools to manage ZFS filesystems
What it means: You can see kernel and ZFS versions. That matters because behavior changes across versions, and “tribal tuning” often assumes a specific version.
Decision: Record versions in the runbook and in incident templates. Require a canary upgrade and a rollback plan for kernel/storage stack changes.
Task 16: Identify “only one person knows” cron jobs and timers
cr0x@server:~$ sudo systemctl list-timers --all | head -n 15
NEXT LEFT LAST PASSED UNIT ACTIVATES
Mon 2026-02-02 03:00:00 UTC 37min left Mon 2026-02-02 02:00:00 UTC 22min ago logrotate.timer logrotate.service
Mon 2026-02-02 04:00:00 UTC 1h 37min Mon 2026-02-02 01:00:00 UTC 1h 22min ago apt-daily-upgrade.timer apt-daily-upgrade.service
What it means: Timers show scheduled automation. If your most critical tasks aren’t here—or aren’t in a known orchestration system—they might be manual rituals.
Decision: Convert manual operational steps into timers/jobs with logs, alerts, and ownership. A job that only runs when “someone remembers” is a reliability bug.
Common mistakes: symptom → root cause → fix
1) Symptom: “We can’t deploy because only one person can approve”
Root cause: Approval policy substitutes for engineering controls; risk is managed socially, not technically.
Fix: Replace personal approval with automated checks (tests, policy-as-code), and require two trained approvers on rotation. Keep emergency override but audit it.
2) Symptom: “On-call can’t restore, but the expert can in 10 minutes”
Root cause: Recovery path is undocumented, untested, and likely involves private context (hidden dependencies, implicit order).
Fix: Write a restore runbook with commands and expected outputs, then run quarterly restore drills with non-experts while the expert watches silently until needed.
3) Symptom: “The service is stable; we don’t need to touch it”
Root cause: Stability is being provided by a person’s habits (manual checks, careful avoidance), not by the system’s design.
Fix: Force safe, small changes: canary reboots, dependency pin checks, periodic failover tests. Stability must be demonstrated under controlled change.
4) Symptom: “Docs exist, but nobody trusts them”
Root cause: Docs are stale, non-executable, and written as narrative rather than operations.
Fix: Make docs command-first: “Run X, see Y, then do Z.” Attach docs to alert pages and incident templates. Add doc ownership and review dates.
5) Symptom: “We have automation, but it’s brittle and only one person debugs it”
Root cause: Automation is a custom system without tests, with poor observability, and with implicit assumptions.
Fix: Treat automation like production code: tests, logs, metrics, and a second maintainer. If it can’t be tested, simplify it until it can.
6) Symptom: “We can’t rotate secrets/certs without downtime”
Root cause: The system was never designed for rotation; secrets are embedded in configs or code paths that require restart and manual coordination.
Fix: Add hot-reload where possible, reduce secret sprawl, use short-lived credentials, and rehearse rotation like a deployment.
7) Symptom: “Storage incidents are terrifying and slow to resolve”
Root cause: Storage has high blast radius and low observability; the team lacks muscle memory for the safe steps (scrub, resilver, snapshot, restore).
Fix: Standardize storage stacks; add clear health dashboards; practice replacement and restore procedures; create a “storage incident” playbook separate from app playbooks.
8) Symptom: “We always wait for the same person to join the incident”
Root cause: Only one person is empowered to act, or everyone else is afraid of making it worse.
Fix: Define decision rights (what on-call can do without permission), create safe mitigations (traffic shedding, feature flags), and run game days that reward learning.
Checklists / step-by-step plan
Step-by-step plan to reduce bus factor in 30–60 days
- Inventory the system: services, hosts, critical cron/timers, storage pools, databases, and dependencies. If it listens on a port, it needs a name and owner.
- Write the “break glass” runbook: how to get access, how to audit access, how to roll back access. This is the difference between resilience and theater.
- Create a minimal incident runbook per service:
- What does “healthy” look like (metrics, logs, status)?
- Top 3 failure modes and how to mitigate safely.
- Exact commands and expected outputs.
- Pick two non-experts and schedule a 60-minute “shadow restore” session where they execute the runbook while the expert watches.
- Standardize credentials: move secrets out of personal machines; enforce rotation; ensure on-call can access secrets through role-based controls.
- Introduce change safety: canaries, feature flags, staged rollouts, and a real rollback path. Bus factor thrives where rollback is mythical.
- Make ownership explicit: define primary/secondary owners for each system component. Secondary means “can operate and recover,” not “is CC’d.”
- Add operational readiness reviews for new systems: “Who can run it at 3 a.m.?” is a launch criterion, not a suggestion.
- Run at least one game day focused on the scary subsystem (storage, networking, auth). The goal is confidence, not chaos.
- Measure improvement: track incidents where the expert was required; track time-to-mitigate when they were absent; track how often runbooks were used and corrected.
What to avoid (because it feels productive and isn’t)
- Don’t punish the bus factor engineer. They didn’t create the incentive structure alone. Fix the system, not the scapegoat.
- Don’t write a 40-page wiki novel. Write a 2-page runbook with commands and decisions, then expand only as needed.
- Don’t mandate “everyone must know everything.” You’re aiming for recoverability and operability, not universal expertise.
- Don’t confuse access sharing with resilience. If two people have the same password, you just doubled the blast radius and kept the bus factor.
A practical “Definition of Done” for de-risking a subsystem
- At least two people can deploy, roll back, and restore without asking for help.
- Runbooks contain exact commands, expected outputs, and decision points.
- Access is via role-based groups with audit logs and periodic review.
- Backups/replication are verified by restore, not by “job succeeded.”
- At least one failure injection or game day has been done in the last quarter.
FAQ
1) Is bus factor just another word for “key person risk”?
Yes, but engineers need it framed operationally. “Key person risk” sounds like a management slide. “Bus factor” sounds like an incident waiting to happen, which is more accurate.
2) What’s an acceptable bus factor?
For production-critical systems, aim for at least 2 for operation and recovery, and 3 for sustained development. If your system can’t survive one person being unavailable, it’s not production-ready.
3) We have documentation. Why is bus factor still high?
Because documentation that isn’t executable is just a story. Runbooks must include commands, expected outputs, and decision points. Also: access and authority matter as much as knowledge.
4) Isn’t specialization unavoidable for storage, networking, and security?
Specialization is fine. Single points of failure are not. The goal is “specialists build and improve,” while “on-call can operate and recover safely using runbooks and guardrails.”
5) How do we reduce bus factor without slowing down delivery?
You’ll slow down slightly now or catastrophically later. The trick is to treat knowledge transfer as part of delivery: every change should update the runbook, dashboards, and rollback steps.
6) What if the bus factor engineer refuses to share knowledge?
Sometimes it’s gatekeeping, sometimes it’s burnout, sometimes it’s fear of losing status. Either way, solve it structurally: make shared ownership a requirement, rotate on-call, and ensure time is allocated for documentation and training.
7) How do we handle “only one person has the credentials” without weakening security?
Use role-based access control, audited break-glass, short-lived tokens, and approvals backed by logging. Security is not “one person owns the keys.” Security is controlled access with traceability.
8) What’s the fastest win?
Pick one critical incident type (restore, failover, certificate rotation) and write a command-based runbook. Then have a non-expert execute it in a drill. The first drill will be messy; that’s the point.
9) How do you measure bus factor improvement?
Track who resolves incidents, how long mitigation takes when the expert is absent, how often runbooks are used, and how many services have primary/secondary owners with tested access.
10) Does automation fix bus factor?
Only if it’s understandable and operable by the team. Automation that only one person can debug is just a faster way to get stuck.
Conclusion: next steps you can actually do
If you suspect you have a bus-factor engineer, you probably do. The giveaway is not their brilliance. It’s the team’s dependency on their presence for recovery and change.
Do this next, in order:
- Pick the scariest recovery path (database restore, storage resilver, region failover) and turn it into a runbook with commands, outputs, and decisions.
- Run a drill where someone else follows the runbook. The expert is allowed to observe and take notes, not drive.
- Fix access: make sure on-call can do the safe mitigations and break-glass steps with audit trails.
- Standardize and test: canary changes, rollback paths, and routine “boring” checks like scrubs and restore tests.
- Make ownership real: primary and secondary owners, with explicit expectations and time allocated for transfer.
Reliability is not a personality trait. It’s a property of a system that assumes humans are fallible, unavailable sometimes, and shouldn’t be single points of failure—no matter how good their shell history is.