Production failures are usually “just” expensive. A pager goes off, revenue blips, a customer yells, we patch and move on.
Therac-25 was different: a set of software and systems decisions turned a medical device into a machine that could deliver
catastrophic overdoses of radiation—while confidently telling the operator everything was fine.
If you build services, platforms, storage, or anything that ships code into the world, this case isn’t ancient history.
It’s a field report on what happens when you remove hardware interlocks, trust concurrency you don’t understand, and treat
incidents as “user error” until your luck runs out.
What Therac-25 was, and why it mattered
Therac-25 was a computer-controlled medical linear accelerator used for radiation therapy in the mid-1980s.
It could operate in multiple modes, including an electron-beam mode and a high-energy X-ray mode. In safe operation,
hardware and software should coordinate to ensure the beam energy and the physical configuration are aligned—target,
filters, collimators, and all the boring mechanical stuff that keeps patients alive.
The “why it mattered” part isn’t that it had software bugs. Everything has software bugs. The story is that
the device design leaned on software to provide safety guarantees that older machines enforced with hardware interlocks.
When the software was wrong—and it was wrong in subtle, timing-dependent ways—the machine could deliver a massive overdose
without a credible alarm or a reliable fail-safe.
If you’ve ever replaced a physical safety rail with “logic in the controller,” you’ve touched the same hot stove.
If you’ve ever removed a “slow” check because it “never happens,” you’ve stood in the same blast radius.
And if you’ve ever treated an operator report as noise because “the system says OK,” you’re already holding the match.
Concrete facts and historical context (short, useful, not trivia night)
- Therac-25 incidents occurred in the mid-1980s, with multiple overdose events tied to software and system design flaws.
- Earlier related machines (Therac-6 and Therac-20) used more hardware interlocks; Therac-25 relied more on software for safety.
- Operators could trigger a hazardous state through rapid input sequences (timing-dependent behavior), a classic concurrency/race-condition smell.
- Error messages were cryptic (e.g., vague “malfunction” codes), pushing staff toward retries instead of safe shutdown and escalation.
- Some overdoses were initially dismissed as patient reactions or operator mistakes, delaying effective containment and root-cause correction.
- Reused software was assumed safe because it existed in prior products, even though the safety context changed when hardware interlocks were reduced.
- Logging and diagnostics were insufficient to reconstruct events quickly, a gift to denial and a curse to incident response.
- Regulatory and industry software safety practices were less mature than today; formal methods and rigorous independent V&V were not uniformly applied.
- The case became a foundational software safety lesson, used for decades in engineering ethics and reliability training.
Two of these facts should make you uncomfortable: “reused software assumed safe” and “cryptic errors causing retries.”
That’s 2026 in a different hat.
The failure chain: from keystrokes to lethal dose
1) Safety moved from hardware into software, but the software wasn’t built like a safety system
In a safety-critical device, you don’t get to say “the code should do X.” You must say “the system cannot do Y,”
even under partial failure: stuck bits, timing anomalies, unexpected sequences, bad inputs, degraded sensors, power glitches.
Hardware interlocks historically forced safe state transitions regardless of software confusion.
Therac-25 reduced hardware interlocks and leaned on software coordination. That choice isn’t automatically wrong,
but it changes the burden of proof. If you remove physical constraints, your software must be engineered, verified,
and monitored like it’s holding the only parachute.
2) Race conditions: the kind that laugh at your test plan
The infamous pattern in Therac-25 discussions is that certain rapid operator sequences could push the system into an invalid
internal state. The operator could edit treatment parameters quickly, and the system could accept the new values while other
parts of the control logic still assumed old values. Now you have “beam energy set for mode A” and “mechanical configuration
positioned for mode B.” That mismatch is not “a bug.” It is a lethal configuration.
Race conditions are failures of coordination. They slip past unit tests, they hide from normal workflows, and they love
“fast operators” because humans are surprisingly good at generating weird timing. The modern equivalent is a distributed
system that “usually converges” but occasionally writes a stale config to the wrong cohort. In safety contexts, “occasionally”
is unacceptable.
3) The UI and error messages trained staff to keep going
If you want a system to fail safely, you must teach the humans around it what “unsafe” looks like. In Therac-25,
operators reportedly encountered confusing error codes and messages. The device might display an error, allow the operator to
continue after acknowledging it, and not clearly indicate the severity or the correct action.
A safety UI doesn’t optimize for throughput. It optimizes for correct escalation. It uses plain language,
explicit severity, and unambiguous required actions. When a system’s UI makes “retry” the easiest path,
it’s not a user mistake when people retry. It’s design.
4) Weak observability enabled denial
After an incident, you need forensics: logs, state snapshots, sensor readings, command sequences, timestamps,
and a way to correlate them. Therac-25’s diagnostics were not sufficient for rapid, confident reconstruction.
In the absence of evidence, organizations default to narratives. The most common narrative is “operator error.”
Observability is a control, not a luxury. In safety-critical systems, it’s also a moral obligation:
you cannot fix what you cannot see, and you cannot prove safety with vibes.
5) Incident response was treated as anomaly, not signal
One overdose should trigger a full-stop safety response: containment, safety notice, independent investigation, and a bias
toward “the system is guilty until proven innocent.” Instead, the historical pattern shows delayed recognition and repeated
harm across events.
This is a recognizable organizational failure mode: treat an incident as an isolated “weird one,” patch locally, move on.
The problem is that the second incident is not “weird.” It’s your system teaching you the truth, louder.
6) Safety is a system property; Therac-25 was optimized like a product
People sometimes summarize Therac-25 as “software bug killed patients.” That’s incomplete and slightly comforting,
which is why it persists. The deeper story is that multiple layers failed:
design assumptions, missing interlocks, concurrency defects, poor diagnostics, and a response culture that didn’t treat
reports as urgent evidence.
One quote belongs here, because it cuts through a lot of modern nonsense:
paraphrased idea: “Hope is not a strategy.”
— General Gordon R. Sullivan (commonly attributed in reliability and planning contexts).
Replace “hope” with “should,” “usually,” or “can’t happen,” and you get a reliable postmortem generator.
Joke #1: Race conditions are like cockroaches—if you see one in production, assume there are forty more hiding behind “it worked on my machine.”
Systems lessons: what to copy, what to never copy
Lesson A: Don’t remove interlocks unless you replace them with stronger guarantees
Hardware interlocks are blunt instruments, but they are brutally effective. They fail in predictable ways and are hard to
bypass accidentally. Software interlocks are flexible, but flexibility is not the same as safety.
If you move safety into software, you must upgrade engineering discipline accordingly:
formal hazard analysis, explicit safety requirements, independent verification, test harnesses that try to break timing,
and runtime monitoring that assumes the code can be wrong.
In SRE terms: don’t delete your circuit breakers because “the new service mesh has retries.”
Lesson B: Treat timing and concurrency as first-class hazards
Therac-25’s most notorious failure modes are timing-dependent. The core operational lesson is that you must test the system
under adversarial timing: rapid input, delayed sensors, reordering, partial updates, stale caches, and slow peripherals.
If you can’t model it, you fuzz it. If you can’t fuzz it, you isolate it with hard constraints.
In modern platforms: if your safety depends on “this event handler won’t be re-entered,” you’re betting lives on a comment.
Use locks, idempotency, state machines with explicit transitions, and invariants that are checked at runtime.
Lesson C: UI is part of the safety envelope
Interfaces must make the safe action easy and the unsafe action hard. Error messages must be actionable. Alarms must be graded.
And the system must not present “everything is fine” when it cannot actually prove that.
If your system is uncertain, it should say “I am uncertain,” and it should default to a safe state. Engineers hate this because
it’s “conservative.” Patients love it because they remain alive to complain about your conservative design.
Lesson D: Postmortems are safety controls, not bureaucracy
Therac-25 is also about institutional learning. When incidents happen, you need an incident process that extracts signal:
blameless in tone, ruthless in analysis, and action-oriented in follow-up.
“We retrained operators” is not a fix when the system allowed lethal state mismatches. That’s paperwork cosplay.
Lesson E: Reuse is not free; reused code inherits new obligations
Reusing code from older systems can be smart. It can also be how you smuggle old assumptions into a new environment.
If Therac-25 reused code from systems with hardware interlocks, then the code’s implicit safety model changed.
In reliability terms: you changed the dependency graph but kept the old SLO.
When you reuse, you must re-validate the safety case. If you cannot write the safety case in plain language and defend it under cross-exam,
you do not have a safety case.
Three corporate mini-stories (anonymized, plausible, and unfortunately common)
Mini-story 1: The wrong assumption that took down billing (and almost took down trust)
A mid-sized company ran a “simple” event pipeline: API writes to a queue, workers process, database stores results.
The system had been stable for a year, which is how you know it was about to become everyone’s personality for a week.
A senior engineer made a reasonable assumption: message IDs were globally unique. They were unique per partition, not global.
The code used message ID as an idempotency key in a shared Redis store. In normal load, collisions were rare.
Under a traffic spike, collisions went from “rare” to “frequent enough to be a bug.”
The result was silent drops: workers believed they had already processed a message and skipped it. Billing events were missing.
Customers weren’t overcharged; they were undercharged. That sounds fun until finance arrives with spreadsheets and questions.
Monitoring didn’t catch it quickly because error rates stayed low. The system was “healthy” while correctness was on fire.
The postmortem was blunt: the assumption was undocumented, untested, and unobserved. They fixed it by scoping idempotency keys
to partition+offset, adding invariants, and building a reconciliation job that flagged gaps. They also added canary traffic that
intentionally created collisions to verify the fix in production-like conditions.
Therac-25’s lesson shows up here: when assumptions are wrong, the system doesn’t always crash. Sometimes it smiles and lies.
Mini-story 2: An optimization that backfired (because “fast” is not a requirement)
Another org had a storage-backed analytics service. They wanted to cut latency, so they added a local cache on each node
and removed a “redundant” checksum validation step when reading blobs from object storage. The checksum step was expensive,
they argued, and the storage layer “already handles integrity.”
For months it looked great. Latency down, CPU down, graphs up and to the right. Then a kernel update introduced a rare
DMA-related data corruption issue on one instance type. It was uncommon, hard to reproduce, and the object store wasn’t the culprit.
The corruption happened between memory and userland on reads, and the removed checksum was the only practical detection point.
What failed wasn’t just data integrity; it was incident response. The team chased “bad queries” and “weird customers”
for days because their dashboards showed success. The corruption was semantic, not transport-level.
They eventually found it by comparing results across nodes and noticing one node was “creatively wrong.”
The rollback was humiliating but effective: re-enable checksums, add cross-node validation in canaries,
and run periodic scrub jobs. The optimization was reintroduced later with guardrails: checksums were sampled at high rate,
and full validation was forced for critical datasets.
Therac-25 removed layers that made unsafe states harder. This was the same mistake, just with fewer funerals and more dashboards.
Mini-story 3: The boring practice that saved the day (and nobody got promoted for it)
A payments platform had a practice that engineers mocked as “paranoid”: every release required a staged rollout with automated
invariants checked at each stage. Not just error rate and latency—business invariants: totals, counts, reconciliation against
independent sources.
One Friday afternoon (because of course), a new release introduced a subtle double-apply bug triggered by retry logic
when a downstream service timed out. It didn’t happen in integration tests because the tests didn’t model timeouts realistically.
It did happen in the first canary cell with real traffic.
The invariants caught it within minutes: the canary cell’s ledger totals drifted compared to the control cell.
The rollout stopped automatically. No full outage. No customer impact beyond a handful of transactions that were automatically
reversed before settlement.
The fix was straightforward: idempotency tokens were moved from “best effort” to “required,” and the retry path was changed
to verify commit status before replay. The key point is that the process caught it, not heroics.
Nobody got a standing ovation. Everyone went home.
Therac-25 didn’t have that kind of invariant-driven staging, and the result was repeated harm before the pattern was accepted.
Boring is good. Boring is survivable.
Fast diagnosis playbook: what to check first/second/third
Therac-25’s operational disaster was amplified by slow, ambiguous diagnosis. In production systems, speed matters—but not the
“randomly change things fast” kind. The “establish the failure mode fast” kind.
First: decide if the system is lying or failing loudly
- Look for mismatch between “healthy” signals and user-reported harm. If users see corruption, wrong results, or unsafe behavior, treat dashboards as untrusted.
- Check invariants. Counts, totals, state transitions, safety interlocks. If you don’t have invariants, congratulations: you have vibes.
- Confirm the blast radius. One node? One region? One device? One operator workflow?
Second: isolate concurrency, state, and timing
- Suspect race conditions when: symptoms are intermittent, triggered by speed, or disappear when you add logging.
- Force serialization temporarily (single worker, disable parallelism, lock critical section) to see if the issue vanishes.
- Capture event ordering with timestamps and correlation IDs. If you can’t reconstruct order, you can’t reason about safety.
Third: verify the safety envelope and fail-safe behavior
- Test the “safe state” transition. Does the system stop when uncertain, or does it keep going?
- Check interlocks: hardware, software, configuration gating, feature flags, circuit breakers.
- Confirm operator guidance: do error messages drive correct action, or do they train retries?
Joke #2: If your incident runbook says “restart and see,” that’s not a runbook—it’s a Ouija board with better uptime.
Practical tasks: commands, outputs, and the decision you make from them
Therac-25 didn’t fail because someone forgot a Linux command. But the operational theme—insufficient evidence, slow diagnosis,
and missing invariants—shows up everywhere. Below are concrete tasks you can run on real systems to catch the modern equivalents:
race conditions, unsafe state transitions, “healthy but wrong,” and the removal of safety layers.
The pattern for each task is: command → typical output → what it means → what decision you make.
Task 1: Confirm time synchronization (ordering matters)
cr0x@server:~$ timedatectl status
Local time: Mon 2026-01-22 10:14:07 UTC
Universal time: Mon 2026-01-22 10:14:07 UTC
RTC time: Mon 2026-01-22 10:14:06
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: “System clock synchronized: yes” reduces the odds that logs lie about event ordering.
Decision: If not synchronized, fix NTP/chrony before trusting multi-node timelines.
Task 2: Check for kernel-level issues that can mimic “random” corruption
cr0x@server:~$ dmesg -T | tail -n 12
[Mon Jan 22 10:11:41 2026] nvme nvme0: I/O 42 QID 3 timeout, aborting
[Mon Jan 22 10:11:41 2026] nvme nvme0: Abort status: 0x371
[Mon Jan 22 10:11:43 2026] EXT4-fs (nvme0n1p2): I/O error while writing superblock
[Mon Jan 22 10:11:44 2026] blk_update_request: I/O error, dev nvme0n1, sector 91827364 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
What it means: Storage timeouts and filesystem errors can produce silent wrongness upstream.
Decision: Stop blaming application code until hardware/driver stability is confirmed; drain the node if errors persist.
Task 3: Establish CPU saturation vs. latency from elsewhere
cr0x@server:~$ mpstat -P ALL 1 3
Linux 6.8.0 (prod-app-17) 01/22/2026 _x86_64_ (32 CPU)
10:14:22 AM CPU %usr %nice %sys %iowait %irq %soft %steal %idle
10:14:23 AM all 18.2 0.0 6.1 22.7 0.0 0.5 0.0 52.5
10:14:23 AM 7 88.9 0.0 9.1 0.0 0.0 0.0 0.0 2.0
What it means: High %iowait suggests the CPUs are idle waiting on I/O; one hot CPU hints at lock contention or single-thread bottleneck.
Decision: If iowait is high, pivot to storage/network checks; if one CPU is pegged, suspect serialization points and mutex contention.
Task 4: See if the box is thrashing memory (timing bugs love pressure)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 1 0 31248 41216 829320 0 0 1024 2048 980 2100 17 6 55 22 0
6 3 0 28992 40128 821004 0 0 4096 8192 1200 8900 25 8 40 27 0
What it means: High context switches (cs) plus rising blocked processes (b) often signals contention or I/O stalls.
Decision: If pressure is high, reduce concurrency and re-test; timing-dependent failures often vanish when load is reduced, which is diagnostic.
Task 5: Confirm filesystem health and mount options (safety rails matter)
cr0x@server:~$ mount | grep " /var "
/dev/nvme0n1p2 on /var type ext4 (rw,relatime,errors=remount-ro)
What it means: errors=remount-ro is a fail-safe: the filesystem will go read-only on serious errors instead of corrupting silently.
Decision: Prefer fail-safe mounts for critical state; if you see permissive options, fix them before you need them.
Task 6: Check RAID / mdadm degradation (the “it’s fine” trap)
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
104320 blocks super 1.2 [2/1] [U_]
[====>................] recovery = 23.4% (24448/104320) finish=2.1min speed=64000K/sec
What it means: [2/1] [U_] indicates a degraded mirror. You’re one disk away from a very bad day.
Decision: Treat degraded redundancy as an incident; pause risky changes, replace the failed disk, and verify rebuild completion.
Task 7: Validate ZFS pool status (scrub, errors, silent corruption)
cr0x@server:~$ sudo zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Restore the file in question if possible.
scan: scrub repaired 0B in 00:12:44 with 1 errors on Mon Jan 22 09:58:13 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 3
errors: Permanent errors have been detected in the following files:
/tank/db/segments/00000042.log
What it means: ZFS is telling you it detected checksum errors and can point to affected files.
Decision: Restore impacted data from replicas/backups; don’t “monitor and see.” If storage says corruption, believe it.
Task 8: Spot TCP retransmits and packet loss (the invisible timing attacker)
cr0x@server:~$ ss -s
Total: 1532
TCP: 812 (estab 214, closed 539, orphaned 0, timewait 411)
Transport Total IP IPv6
RAW 0 0 0
UDP 29 21 8
TCP 273 232 41
INET 302 253 49
FRAG 0 0 0
What it means: This is a coarse snapshot; it doesn’t show retransmits directly, but it helps detect connection churn and timewait storms.
Decision: If estab is low but timewait is huge, suspect aggressive reconnect patterns; move to deeper network stats next.
Task 9: Check interface errors and drops
cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:ab:cd:ef brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
9876543210 8123456 0 12456 0 12034
TX: bytes packets errors dropped carrier collsns
8765432109 7123456 0 342 0 0
What it means: Drops can manifest as retries, timeouts, and reordering—perfect fuel for race conditions and partial updates.
Decision: If drops climb, investigate NIC queues, MTU mismatches, congestion, and noisy neighbors; don’t tune app retries first.
Task 10: Confirm process-level file descriptor pressure (can cause bizarre failures)
cr0x@server:~$ cat /proc/sys/fs/file-nr
42352 0 9223372036854775807
What it means: The first number is allocated file handles; if it grows near system limits, you’ll get failures that look “random.”
Decision: If pressure is high, locate leaking processes and fix; also set sane limits and alert on growth rate.
Task 11: See if a service is restarting (flapping hides root causes)
cr0x@server:~$ systemctl status api.service --no-pager
● api.service - Example API
Loaded: loaded (/etc/systemd/system/api.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2026-01-22 10:02:18 UTC; 11min ago
Main PID: 2187 (api)
Tasks: 34 (limit: 38241)
Memory: 512.4M
CPU: 2min 12.103s
CGroup: /system.slice/api.service
└─2187 /usr/local/bin/api --config /etc/api/config.yaml
What it means: If Active keeps resetting, you’re in crash-loop territory; if uptime is stable, focus elsewhere.
Decision: If flapping, stop the bleeding: freeze deploys, reduce traffic, and capture core dumps/logs before restart wipes evidence.
Task 12: Query journald around a suspected incident window (get the sequence)
cr0x@server:~$ journalctl -u api.service --since "2026-01-22 09:55" --until "2026-01-22 10:10" --no-pager | tail -n 8
Jan 22 10:01:02 prod-app-17 api[2187]: WARN request_id=9b3e retrying upstream due to timeout
Jan 22 10:01:02 prod-app-17 api[2187]: WARN request_id=9b3e retry attempt=2
Jan 22 10:01:03 prod-app-17 api[2187]: ERROR request_id=9b3e upstream commit status unknown
Jan 22 10:01:03 prod-app-17 api[2187]: WARN request_id=9b3e applying fallback path
Jan 22 10:01:04 prod-app-17 api[2187]: INFO request_id=9b3e response_status=200
What it means: “Commit status unknown” followed by “fallback path” is a correctness red flag: you may have double-applies or partial state.
Decision: Add idempotency verification before retries; if safety-critical, fail closed instead of “fallback and hope.”
Task 13: Inspect open TCP connections to a dependency (find hotspots)
cr0x@server:~$ ss -antp | grep ":5432" | head
ESTAB 0 0 10.20.5.17:48422 10.20.9.10:5432 users:(("api",pid=2187,fd=41))
ESTAB 0 0 10.20.5.17:48424 10.20.9.10:5432 users:(("api",pid=2187,fd=42))
ESTAB 0 0 10.20.5.17:48426 10.20.9.10:5432 users:(("api",pid=2187,fd=43))
What it means: Confirms the service is talking to Postgres and not stuck elsewhere. Also hints at connection pool behavior.
Decision: If connections explode, cap pools and add backpressure; connection storms are the software equivalent of removing an interlock.
Task 14: Validate database locks (concurrency bottlenecks look like “random slowness”)
cr0x@server:~$ psql -h 10.20.9.10 -U app -d appdb -c "select wait_event_type, wait_event, count(*) from pg_stat_activity where wait_event is not null group by 1,2 order by 3 desc;"
wait_event_type | wait_event | count
-----------------+---------------------+-------
Lock | relation | 12
IO | DataFileRead | 4
Client | ClientRead | 2
(3 rows)
What it means: Lots of relation locks indicates contention; your “fast operator” equivalent might be a hot table.
Decision: If lock waits dominate, identify blocking queries and fix schema/indexing or transaction scope; don’t just scale app nodes.
Task 15: Validate checksum/verification settings for storage replication (safety belt check)
cr0x@server:~$ rbd info rbd/patient-images
rbd image 'patient-images':
size 2 TiB in 524288 objects
order 22 (4 MiB objects)
snapshot_count: 2
id: 1a2b3c4d5e6f
block_name_prefix: rbd_data.1a2b3c4d5e6f
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Mon Jan 15 08:21:11 2026
What it means: Not a checksum output itself, but confirms object features; you still need end-to-end integrity checks at the application layer.
Decision: If your system’s correctness depends on integrity, add explicit checksums/hashes at boundaries; don’t outsource truth to one layer.
Task 16: Confirm a process is single-threaded or bottlenecked on a lock
cr0x@server:~$ top -H -p 2187 -b -n 1 | head -n 12
top - 10:14:55 up 12 days, 3:21, 1 user, load average: 6.12, 4.98, 4.77
Threads: 36 total, 1 running, 35 sleeping
%Cpu(s): 23.1 us, 7.2 sy, 0.0 ni, 47.3 id, 22.4 wa, 0.0 hi, 0.0 si, 0.0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2199 api 20 0 1452784 524200 49200 R 88.7 1.6 0:31.22 api
2201 api 20 0 1452784 524200 49200 S 2.3 1.6 0:02.10 api
What it means: One thread dominates CPU; others sleep. That’s often a lock, a tight loop, or a single hot path.
Decision: Profile that path; if it’s guarding shared state, redesign toward immutable state or explicit state machines with safe transitions.
Common mistakes: symptoms → root cause → fix
1) Symptom: “The system says OK, but users report wrong outcomes”
Root cause: Health checks measure liveness (process up) instead of correctness (invariants). Therac-25 effectively “passed” its own checks while unsafe.
Fix: Add correctness checks: cross-validation, reconciliations, and invariant dashboards. Gate rollouts on invariants, not just 200s.
2) Symptom: Intermittent failures triggered by speed, retries, or high load
Root cause: Race conditions and non-atomic state transitions. The “fast operator” equivalent is parallel requests or reordered events.
Fix: Model state as a finite state machine, enforce atomic transitions, add idempotency keys, and fuzz timing with chaos tests.
3) Symptom: Operators repeatedly “retry” through errors
Root cause: UI/UX trains unsafe behavior; alarms are vague; the system allows continuation without proof of safety.
Fix: Make unsafe continuation impossible. Use explicit severity and required actions. If safety is uncertain, fail closed.
4) Symptom: Post-incident investigation can’t reconstruct what happened
Root cause: Insufficient logging, missing correlation IDs, lack of timestamps, or logs overwritten on restart.
Fix: Structured logs with correlation IDs, immutable append-only storage for audit trails, and captured state snapshots on fault.
5) Symptom: Fixes are “retrain users” and “be careful”
Root cause: Organization is substituting policy for engineering controls because engineering controls are harder.
Fix: Add hard safety rails: interlocks, permissioned transitions, runtime checks, and independent verification. Training is supplemental, not primary.
6) Symptom: Removing “redundant” checks makes things faster… until it doesn’t
Root cause: Optimization removed detection/correction layers (checksums, validation, confirmation reads), turning rare faults into silent corruption.
Fix: Keep end-to-end validation; if you must optimize, sample intelligently and add canary verification. Never optimize away the only alarm.
7) Symptom: Incidents treated as isolated anomalies, not patterns
Root cause: Weak incident management: no central tracking, no forced escalation, incentives to downplay severity.
Fix: Formal incident response with severity classification, cross-site aggregation, and stop-the-line authority when safety/correctness is at risk.
Checklists / step-by-step plan
Step-by-step: building “Therac-proof” safety rails in software-driven systems
-
Write the hazard list like you mean it.
Enumerate unsafe states, not just failures. Example: “high-energy mode with wrong physical configuration” maps to “privileged action applied with stale config.” -
Define invariants.
What must always be true? Example: “Beam energy and mode must match verified mechanical state” maps to “write path must be consistent with versioned config.” -
Enforce state machines.
Every transition must be explicit. No “edit in place” of safety-critical parameters without versioning and atomic commit. -
Fail closed on uncertainty.
If sensors disagree or commit status is unknown, stop and escalate. Your system should prefer downtime over wrongness when harm is possible. -
Keep layered interlocks.
Use hardware where feasible, but also software gating, access control, and runtime assertions. Remove none without replacing with stronger evidence. -
Design the UI for escalation.
Plain language errors, severity levels, and mandatory safe actions. Don’t let “Enter” be a path to continued hazard. -
Test with adversarial timing.
Fuzz input speed, reorder events, inject delays, and simulate partial failures. “Normal use” tests prove almost nothing about safety. -
Independent verification and validation.
Separate incentives. The team shipping features should not be the only team signing off on safety-critical behavior. -
Instrument for forensics.
Capture event ordering, state snapshots, and audit logs. Make “what happened?” answerable within an hour. -
Incident response with stop-the-line authority.
One credible report of unsafe behavior triggers containment. No debate-by-email while the system keeps operating. -
Practice the failure.
Run game days focused on correctness and safety, not just availability. Include operators in the exercise; they’ll teach you where the UI lies. -
Track recurring weak signals.
“Weird error code but it worked after retry” is not a closed ticket. It’s a precursor.
Release checklist for safety-critical or correctness-critical changes
- Does this change remove any validation, checksum, lock, or interlock? If yes: what replaces it, and what evidence proves equivalence?
- Are state transitions versioned and atomic? If not, you are shipping a race condition with extra steps.
- Are invariants measured and gated in canary/staging?
- Do error messages tell operators what to do, not just what happened?
- Can we reconstruct event order across components within 60 minutes using logs and timestamps?
- Is “stop the line” authority documented for the on-call and for operators?
FAQ
1) Was Therac-25 “just a bug”?
No. Bugs existed, including timing-dependent ones, but the lethal outcome required a system design that allowed unsafe states,
weak interlocks, poor diagnostics, and an incident response posture that didn’t treat early signals as emergencies.
2) Why do race conditions keep showing up in safety stories?
Because humans and computers create unpredictable timing. Concurrency creates states you didn’t intend, and tests rarely cover
those states unless you intentionally attack timing with fuzzing, fault injection, and explicit state machine modeling.
3) Isn’t hardware safer than software?
Hardware can be safer for certain guarantees because it is harder to bypass and easier to reason about in failure.
But hardware can fail too. The real answer is layered controls: hardware interlocks where practical, plus software assertions,
monitoring, and fail-safe behavior when something doesn’t add up.
4) What’s the modern equivalent of Therac-25 in non-medical systems?
Any system where software controls a high-energy or high-impact action: industrial automation, autonomous vehicles,
financial trading, identity and access control, and infrastructure orchestration. If software can irreversibly harm
people or violate trust, you’re in the same genre.
5) Why is “cryptic error messages” such a big deal?
Because operators behave rationally under pressure. If an error is vague and the system allows continuation,
retries become the default. A good safety message is a control: it dictates safe action and prevents casual override.
6) How do you test for “fast operator” sequences in modern apps?
Use fuzzing and model-based testing around state transitions. Simulate rapid edits, retries, network delays, and reordering.
In distributed systems, inject latency and packet loss; in UIs, script high-speed interactions and verify invariants on every step.
7) What should incident response look like when correctness (not uptime) is at risk?
Treat it as a severity event. Contain first: pause rollouts, reduce traffic, isolate suspicious nodes, and preserve evidence.
Then verify invariants and reconcile data. Only after you can explain the failure mode should you resume normal operation.
8) Is “blameless postmortem” compatible with accountability?
Yes. Blameless means you don’t scapegoat individuals for systemic issues. Accountability means you fix the system and follow through.
If your postmortems end with “human error,” you’re avoiding accountability, not enforcing it.
9) What’s the single most actionable Therac-25 lesson for engineers?
Don’t allow unsafe states. Encode invariants and enforce them at runtime. If the system cannot prove it’s safe, it must stop.
10) What if business pressure demands removing “slow” checks?
Then you treat the check as a safety control and require a safety case for removal: what replaces it, how it will be monitored,
and what new failure modes it introduces. If nobody can write that down, the answer is no.
Conclusion: next steps that actually reduce risk
Therac-25 is a case study in how systems kill: not with one dramatic failure, but with a chain of small, defensible decisions
that removed friction from unsafe outcomes. It’s also a case study in how organizations fail: by treating early incidents as noise,
trusting “healthy” indicators over human reports, and shipping systems that can’t explain themselves when something goes wrong.
Practical next steps, in order of impact:
- Define and monitor invariants for correctness and safety, and gate releases on them.
- Model state transitions explicitly and eliminate “edit in place” flows for safety-critical parameters.
- Reintroduce or add interlocks: hardware where feasible, software assertions everywhere, and “fail closed” when uncertain.
- Upgrade observability for forensics: timestamps, correlation IDs, immutable logs, and state snapshots on fault.
- Run timing-adversarial tests: fuzzing, fault injection, and game days focused on correctness, not just uptime.
- Fix the operator interface so it teaches safe behavior: clear severity, explicit actions, no casual retries through danger.
If you build systems that can hurt people—physically, financially, or socially—treat safety as a property you must continuously prove.
The day you start trusting “it should be fine” is the day you start writing your own Therac-25 chapter.