Mars Climate Orbiter: The Unit Mismatch That Lost a Spacecraft

Was this helpful?

You can run world-class infrastructure and still lose the whole mission because a number “looked right.”
Not crashed, not NaN, not on fire. Just quietly wrong—wrong units, wrong assumption, wrong interface—until
physics does the postmortem for you.

Mars Climate Orbiter is the canonical case: a spacecraft was lost because one part of the system produced
data in Imperial units while another expected metric. If you’re thinking “that would never happen in my org,”
congratulations—you have the exact mindset that makes it happen in your org.

What actually happened (and why it mattered)

Mars Climate Orbiter (MCO) launched in 1998 and was supposed to study Martian climate: atmosphere, dust,
water vapor, seasonal changes. It also served as a communications relay for its sibling lander. It never
got to do any of that. During Mars Orbit Insertion in September 1999, the spacecraft flew too low, likely
encountered atmospheric forces it wasn’t designed for, and was lost. Either it burned up or it skipped
off into a solar orbit; in operations, “lost” is a polite word for “we have no idea where it is and can’t
talk to it anymore.”

The root cause, in plain engineering terms: the navigation team used thruster performance data that was
expressed in pound-force seconds, while the navigation software expected newton seconds. Same physical
concept (impulse). Different unit system. That mismatch produced a consistent bias in trajectory estimates.
The spacecraft didn’t suddenly do something crazy. It did something consistently wrong for long enough that
the wrong became terminal.

And here’s the operational lesson most people miss: it wasn’t “a typo.” It was a failure of interface
discipline. The system had multiple opportunities to catch it—reviews, checks, telemetry trends, simulation,
anomaly response. The failure wasn’t one bug; it was a stack of non-bug decisions that kept the bug alive.

Joke #1: Unit mismatches are the only bugs that can be fixed by both a compiler and a tape measure.

Fast facts and historical context

  • Part of the Mars Surveyor ’98 program: MCO was paired with Mars Polar Lander, aiming for a “faster, better, cheaper” cadence.
  • Launched on a Delta II rocket: A workhorse vehicle; the launch itself wasn’t the problem.
  • Loss occurred during orbit insertion: The riskiest phase: a narrow window where navigation accuracy becomes survival.
  • Mismatch involved impulse data: The spacecraft’s small trajectory corrections depended on accurate modeling of thruster firings.
  • Imperial vs metric: Specifically, pound-force seconds (lbf·s) vs newton seconds (N·s). One lbf is about 4.44822 N.
  • Operational trend was visible: Navigation residuals and trajectory predictions drifted; the system was “off” in a way that should have prompted a stop-and-verify.
  • Interface control matters: The unit expectation was documented; enforcement and validation were not.
  • Program pressure was real: Cost and schedule constraints reduced redundancy and made “we’ll catch it later” culturally acceptable.

Anatomy of the failure: where the system lied

The number wasn’t wrong; the meaning was

In reliability work, there’s a special category of failures: correctly computed nonsense. The math is fine.
The software is “working.” The instrumentation is reporting something consistent. And the operator is interpreting
it through a wrong shared assumption. That’s not a bug in a line of code. That’s a bug in the contract between
teams, tools, and reality.

Unit mismatches are particularly nasty because they’re plausible. A thruster impulse of “10” could be 10 N·s
or 10 lbf·s. Neither looks obviously absurd. And if the downstream system expects one and receives the other,
the error scales linearly—no dramatic spikes, no immediate chaos, just a persistent bias.

Bias beats noise

Operations teams are trained to look for spikes: sudden changes, thresholds, error budgets blown in minutes.
Bias is slower and more patient. Bias can sit inside tolerances for days while eating your margin. It will look
like “we’re a bit off, but still within expected variance,” until you are out of corridor during the one maneuver
that matters.

Interfaces are where reliability goes to die

The most dangerous code in any system is the code that “just” transforms data between two subsystems. The mapping
layer. The ETL job. The adapter. The script that rewrites a field. The wrapper that “fixes” an API. That’s where
type information disappears, where semantics get hand-waved, where test coverage gets thin, and where “everyone knows”
becomes a substitute for verification.

In MCO, the interface between thrust modeling/ops data and navigation software effectively stripped the semantics down
to numbers. Humans “knew” the units. The system didn’t. When humans rotated, or assumptions drifted, the system had no
guardrails.

A reliability quote worth taping to your monitor

Gene Kranz (NASA flight director) said: “Failure is not an option.” If you’ve seen it printed on posters,
you’ve also seen people use it as a substitute for engineering. Here’s the adult version: failure is always an option.
Your job is to make it expensive for small failures to become catastrophic ones.

Handoffs, interfaces, and the “unit boundary”

Let’s talk about what “unit mismatch” really means in a modern org. It’s not just metric vs Imperial. It’s any
semantic mismatch:

  • milliseconds vs seconds
  • bytes vs kibibytes
  • UTC vs local time
  • inclusive vs exclusive ranges
  • signed vs unsigned values
  • per-second rates vs per-minute rates
  • compressed vs uncompressed sizes
  • “customer_id” meaning account vs user vs household

These failures cross team boundaries. That matters because teams optimize locally. A team shipping a component may
document units and move on. The receiving team might read that doc once, integrate, and then rely on memory.
Memory is not an operational control.

ICDs are not paperwork; they are executable constraints

Space programs use Interface Control Documents (ICDs) to specify how systems talk: formats, timing, units, tolerances.
In enterprise systems, we often have the same thing in weaker forms: OpenAPI specs, protobuf schemas, JSON examples,
data contracts, runbooks, “tribal knowledge.” The failure mode is the same when those artifacts are not enforced by tests
and runtime checks.

If your interface spec isn’t tested, it’s not a spec. It’s a bedtime story.

Why reviews don’t catch this reliably

People love to say: “How did code review miss that?” Because review is a human sampling process, not a proof system.
Unit mismatches are a semantic error. The code can look perfectly reasonable. The reviewer may not have full context,
may not know the upstream unit, may not notice that a variable named impulse should be impulse_ns.

Reviews help, but they don’t enforce. Enforcement comes from:

  • typed units or explicit naming conventions
  • contract tests between producer and consumer
  • runtime assertions and validation
  • end-to-end simulation in realistic conditions
  • telemetry sanity checks with alarms tuned for bias, not just spikes

What this looks like in production systems

If you run storage, fleets, or data platforms, you’ve already lived some version of MCO. The unit mismatch becomes:

  • a latency SLO measured in milliseconds, but the dashboard plots seconds
  • a rate limiter configured in requests/minute, but the client assumes requests/second
  • a backup retention “30” interpreted as days by one job and hours by another
  • a storage system reporting “GB” (decimal) while finance expects “GiB” (binary)

The moral is not “use metric.” The moral is: make units explicit and verifiable at every boundary.

Joke #2: The fastest way to reduce cloud spend is to accidentally treat milliseconds as seconds—until the CFO meets your outage report.

Fast diagnosis playbook

When you suspect an interface/units/semantic mismatch, do not start by rewriting code or adding retries. Retries are
how you turn a wrong value into a wrong value faster. Instead, run a three-pass diagnosis that prioritizes “what changed”
and “what assumptions are unverified.”

First: prove the contract at the boundary

  1. Capture a real payload (telemetry frame, API request/response, file record) from production.
  2. Compare it against the spec (schema, ICD, protobuf definition, unit expectations).
  3. Check units and reference frames: time base, coordinate system, scaling factors, compression.

If the payload doesn’t self-describe units, assume you are already in danger.

Second: look for bias, not just spikes

  1. Plot residuals (prediction vs observation) over time.
  2. Look for monotonic drift or consistent offset after a deployment or configuration change.
  3. Compare multiple independent signals if possible (e.g., nav estimate vs raw sensor, API derived metric vs raw log count).

Third: reproduce with a “known-good” input

  1. Find a fixture: a historical payload, golden dataset, recorded maneuver sequence.
  2. Run the pipeline/component end-to-end.
  3. Confirm output matches expected within tolerance.

If you can’t define “expected,” your system is not testable, and reliability becomes a betting hobby.

Practical tasks: commands, outputs, decisions

Below are practical tasks you can run on real Linux systems to detect unit/contract mismatches, drift, and “numbers with
no meaning.” Each task includes a command, what the output means, and the decision you make.

Task 1: Find where units are implied (grep for suspicious fields)

cr0x@server:~$ rg -n "ms|millis|seconds|secs|ns|newton|lbf|pound|unit|scale|multiplier|conversion" /etc /opt/app/config
/opt/app/config/telemetry.yaml:41:  impulse_scale: 1.0
/opt/app/config/telemetry.yaml:42:  impulse_unit: "lbf*s"

What it means: Your config explicitly says lbf·s. Good—at least it’s written down.

Decision: Verify every consumer expects the same unit; add a startup assertion that rejects unknown unit strings.

Task 2: Check service build/version skew across a fleet

cr0x@server:~$ ansible -i inventory all -m shell -a 'myservice --version'
host-a | CHANGED | rc=0 >>
myservice 2.3.1
host-b | CHANGED | rc=0 >>
myservice 2.2.9

What it means: Mixed versions. If unit handling changed between versions, you may have split-brain semantics.

Decision: Freeze rollouts; force convergence to one version; run contract tests against the oldest still running node.

Task 3: Inspect OpenAPI/protobuf/schema for explicit unit annotations

cr0x@server:~$ jq '.components.schemas.Telemetry.properties.impulse' openapi.json
{
  "type": "number",
  "description": "Thruster impulse",
  "example": 12.5
}

What it means: “Impulse” without units is a trap. The example doesn’t help.

Decision: Update schema descriptions to include units; add x-unit extension or rename fields (e.g., impulse_ns).

Task 4: Validate a payload against schema (catch missing/extra fields)

cr0x@server:~$ python3 -m jsonschema -i sample_payload.json telemetry_schema.json
sample_payload.json validated successfully

What it means: Structural shape is fine. This does not validate semantics like units.

Decision: Add semantic validation: acceptable ranges and unit tags; consider rejecting payloads missing unit metadata.

Task 5: Spot scale-factor drift in time series (quick-and-dirty)

cr0x@server:~$ awk '{sum+=$2; n++} END{print "avg=",sum/n}' impulse_residuals.txt
avg= 4.43

What it means: Average residual is biased (not centered near zero). That’s a smell for unit/scaling mismatch.

Decision: Compare the bias magnitude to known conversion factors (e.g., ~4.448 for lbf→N). If it matches, stop guessing and investigate unit boundary.

Task 6: Check NTP/time sync (time base mismatches mimic unit bugs)

cr0x@server:~$ timedatectl
               Local time: Wed 2026-01-22 10:48:11 UTC
           Universal time: Wed 2026-01-22 10:48:11 UTC
                 RTC time: Wed 2026-01-22 10:48:11
                Time zone: UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What it means: Time is sane and in UTC. Good. Unit mismatches often hide behind time drift.

Decision: If unsynchronized, fix time first; otherwise you’ll misdiagnose causality and correlation.

Task 7: Confirm log timestamps and units in structured logs

cr0x@server:~$ jq -r '.ts,.duration_ms' /var/log/myservice/events.json | head
2026-01-22T10:46:01.120Z
250
2026-01-22T10:46:01.421Z
280

What it means: Duration field names include “_ms”. That’s a small miracle.

Decision: Enforce suffix conventions (_ms, _bytes, _ns) via linting and code review checklist.

Task 8: Trace an API call end-to-end to see unit translation points

cr0x@server:~$ curl -sS -H 'X-Request-ID: diag-123' http://localhost:8080/telemetry | jq .
{
  "impulse": 12.5,
  "impulse_unit": "lbf*s",
  "burn_duration_ms": 250
}

What it means: Payload self-describes units for impulse but not necessarily for everything else; duration is explicit.

Decision: Ensure all unit-bearing fields are explicit; if consumers ignore impulse_unit, treat that as a bug and break the build.

Task 9: Confirm which config is live (avoid “fixed in git, broken in prod”)

cr0x@server:~$ systemctl show myservice -p FragmentPath -p Environment
FragmentPath=/etc/systemd/system/myservice.service
Environment=UNIT_SYSTEM=imperial IMPULSE_UNIT=lbf*s

What it means: The service is explicitly set to imperial via environment. That can override code defaults.

Decision: Track config provenance; if a unit system flag exists, lock it down and alert on changes.

Task 10: Identify recent changes (units break at the edges of change)

cr0x@server:~$ git log -n 5 --oneline -- config/telemetry.yaml
a91b2cf set impulse scale
71c0d1a rename burn_duration to burn_duration_ms
9f10bb2 initial telemetry config

What it means: Someone touched impulse scaling recently. That’s your prime suspect.

Decision: Tie the change to an incident timeline; require a test proving the new scale matches expected physics/business rules.

Task 11: Check metric queries for unit confusion (Prometheus example)

cr0x@server:~$ promtool query instant http://localhost:9090 'rate(myservice_impulse_total[5m])'
myservice_impulse_total{job="myservice"} 0.21

What it means: A rate is returned, but the unit is unclear. Is this N·s per second? lbf·s per second? Just “impulses”?

Decision: Encode units in metric names and HELP text; e.g., myservice_impulse_newton_seconds_total and reject ambiguous metrics at review.

Task 12: Verify dashboard axis units (Grafana JSON check)

cr0x@server:~$ jq -r '.panels[] | select(.title=="Burn duration") | .fieldConfig.defaults.unit' dashboard.json
ms

What it means: Panel explicitly uses milliseconds. That prevents an operator from eyeballing “250” as “250 seconds.”

Decision: Standardize dashboard units; enforce via CI on dashboard JSON exports.

Task 13: Detect silent scaling changes in binaries (strings can betray you)

cr0x@server:~$ strings /usr/local/bin/myservice | rg -n "lbf|newton|N\\*s|unit"
10233:IMPULSE_UNIT
10234:lbf*s

What it means: The binary contains a default unit string. If config is missing, this default might silently apply.

Decision: Replace defaults with “must specify” behavior for unit-bearing config; fail fast at startup.

Task 14: Confirm your ingestion pipeline preserves unit fields

cr0x@server:~$ jq -r '.impulse_unit' raw_event.json enriched_event.json
lbf*s
null

What it means: Enrichment dropped impulse_unit. Now downstream analytics will assume.

Decision: Treat unit fields as required; block pipeline deploys that drop them; add schema evolution tests.

Three corporate mini-stories (real enough to hurt)

Mini-story 1: The incident caused by a wrong assumption

A payments platform had two services: one calculated “risk score,” another enforced fraud rules. The scoring service
emitted a field called score as an integer from 0 to 1000. The enforcement service believed it was 0.0 to 1.0.
Both teams had documentation. Both teams believed the other had read it. Neither team had a contract test.

The integration passed basic tests because the sample data used small scores and the rule thresholds were set loosely
during early rollout. Then a new model shipped with a wider spread. Suddenly, enforcement started rejecting a clean slice
of traffic. Not all. Just enough to light up customer support and ruin an otherwise calm week.

On-call initially chased latency and database errors because the symptom looked like retries and timeouts: customers were
re-submitting payments. The platform didn’t crash; it just said “no” more often, which is operationally worse because it
looks like a policy decision, not a technical failure.

The fix was embarrassingly simple: rename the field to score_milli (0–1000), add score_unit, and add
an integration test that fails if a score > 1.0 arrives in the “ratio” API. The real fix was cultural: forcing both teams
to treat data contracts as code, not as wiki pages.

Mini-story 2: The optimization that backfired

A storage team optimized an ingestion pipeline by switching from JSON to a compact binary format and “saving bytes” by
dropping descriptive fields. Among the dropped fields: block_size and its unit. Everyone “knew” it was bytes.
Except the consumer team that had historically treated it as kilobytes because their UI labeled it “KB.”

The first sign of trouble was subtle: dashboards showed a slow, steady change in “average block size.” No alert fired.
The platform kept working. But capacity planning started drifting. The organization began buying hardware too early in one
region and too late in another. That mismatch is how you get paged at 3 a.m. by finance.

The optimization also removed the ability to sanity-check values in logs. Operators lost observability in exchange for
a few percent throughput gain. This is a common failure mode: a local optimization that degrades the global system’s
ability to detect errors.

They eventually reintroduced units—not by adding a string to every record (which would be wasteful), but by versioning the
schema and embedding unit metadata in the schema registry. They also added a canary consumer that validates statistical
properties (like typical block sizes) and pages humans when distribution shifts beyond expected tolerance.

Mini-story 3: The boring but correct practice that saved the day

A team running a fleet of database backups had an unglamorous habit: every restore test logged not just “success,” but
restore duration, bytes restored, and throughput with explicit units in the output. They also stored
those results in a small time series database. It was boring. People occasionally complained about “process overhead.”

Then a vendor update changed a default compression setting. Restore throughput dropped and restore duration climbed, but
the job still returned exit code 0. Nobody would have noticed until the day they needed a restore—when it would have been
too late to renegotiate RTO with reality.

Because the team had trendable, unit-explicit metrics, they caught the regression in a weekly review. They rolled back the
change, documented the new default, and added a guardrail: if restore throughput falls below a threshold for two consecutive
tests, the pipeline blocks production rollouts.

The “boring practice” was not the logging. It was the discipline of making the restore test a first-class production signal.
That’s how you survive: not with heroics, but with habits that assume the system will lie to you unless forced not to.

Common mistakes: symptoms → root cause → fix

1) “Everything is within limits” until it isn’t

Symptoms: Slow drift in residuals, gradually worsening accuracy, intermittent “close call” events.

Root cause: Bias introduced by scaling/unit mismatch; alerting tuned for spikes not offsets.

Fix: Add bias detection: rolling mean alarms, CUSUM-style checks, and explicit conversion validation at ingestion.

2) Confident dashboards, wrong axis

Symptoms: Operators swear the system is fine because charts look normal; incidents are “surprising.”

Root cause: Visualization layer assumes units; panels use default unit formatting; fields renamed without updating units.

Fix: Enforce dashboard units via CI; mandate unit suffixes in metric names; add “unit” labels where supported.

3) Interface documentation exists, but reality diverges

Symptoms: Producer and consumer teams both cite docs; integration still breaks.

Root cause: Docs not tied to automated tests; contract changes shipped without consumer validation.

Fix: Implement consumer-driven contract tests; treat schema changes as versioned releases with compatibility gates.

4) “We’ll just convert it later”

Symptoms: Conversion logic duplicated across services; inconsistent outputs; hard-to-reproduce bugs.

Root cause: No single source of truth for unit conversions; ad-hoc conversion code in each client.

Fix: Centralize conversion in one well-tested library; forbid local conversions unless justified and reviewed.

5) Silent dropping of semantic fields in pipelines

Symptoms: Downstream reports diverge after “performance optimization”; unit fields become null.

Root cause: ETL/enrichment jobs treat unit fields as optional; schema evolution not tested.

Fix: Make unit-bearing fields required; add pipeline checks to detect lost fields; version schemas and enforce compatibility.

6) Mixed-version semantics during rollout

Symptoms: Only some nodes behave “wrong”; incident seems random; canary looks fine but fleet isn’t.

Root cause: Unit handling changed in a new version; partial rollout causes inconsistent interpretation.

Fix: Use strict backward compatibility; include versioned unit metadata; block rollouts when mixed semantics are detected.

Checklists / step-by-step plan

Step-by-step plan: make unit mismatches hard to ship

  1. Inventory unit-bearing fields across APIs, telemetry, logs, metrics, and configs. If it’s a number, assume it has units.
  2. Rename fields to include units where practical: duration_ms, size_bytes, temp_c.
  3. Add unit metadata when renaming is not possible: unit fields or schema annotations.
  4. Implement contract tests between producers and consumers; block merges on mismatch.
  5. Fail fast at startup when unit config is missing or unknown; defaults are how ambiguity survives.
  6. Define allowed ranges (min/max) for key signals; unit mismatches often produce values outside physical/business plausibility.
  7. Bias-aware alerting: rolling mean, drift detectors, residual tracking.
  8. Golden fixtures: known inputs with known outputs, including conversions (e.g., lbf→N).
  9. Run end-to-end simulations on realistic payloads; synthetic happy paths are where bugs go to hide.
  10. Operational reviews: every on-call handoff includes “what assumptions are we making about units/time/frames?”

Pre-flight integration checklist (use it like you mean it)

  • All numeric fields have explicit units in name, schema, or metadata.
  • Every conversion happens in one library/module, not scattered.
  • Every interface has a compatibility test in CI.
  • Dashboards specify axis units and do not rely on defaults.
  • Alerts include drift/bias detection, not just thresholds.
  • Rollout plan includes mixed-version behavior analysis.
  • Runbook includes “unit mismatch” as a first-class hypothesis.

FAQ

1) Was Mars Climate Orbiter lost solely because of one unit conversion bug?

The unit mismatch was the triggering technical cause, but the mission was lost because multiple safety nets didn’t catch it:
interface enforcement, validation, operational anomaly response, and review rigor.

2) Why didn’t they detect it earlier?

Because the error manifested as a gradual bias rather than an obvious failure. Drift is easy to rationalize as noise,
especially under schedule pressure and when the system still “looks stable.”

3) What’s the modern equivalent of lbf·s vs N·s in cloud systems?

Seconds vs milliseconds, bytes vs MiB, UTC vs local time, and rate units (per-second vs per-minute) are the big ones.
Also “percent” vs “fraction” (0–100 vs 0–1) shows up constantly in ML and risk systems.

4) Should we store units with every data point?

Not necessarily with every record. You can store units in schemas, registries, or versioned metadata—as long as
consumers can’t accidentally ignore it and still proceed.

5) Isn’t naming fields with suffixes like _ms ugly?

Yes. It’s also effective. Reliability is full of ugly things that keep you alive: circuit breakers, retries with jitter,
and suffixes that prevent humans from hallucinating meaning.

6) What’s the best way to prevent unit bugs in code?

Use unit-aware types where possible (or strong typedef patterns), centralize conversions, and write tests that compare to
known conversion constants and plausible ranges. Most importantly: enforce contracts between services.

7) How do I detect a unit mismatch when I’m already in an incident?

Look for consistent ratios in errors (e.g., ~1000x, ~60x, ~4.448x). Compare independent measurements. Validate the raw
payload at the boundary and search for dropped metadata fields.

8) What if two teams disagree about units and both have “documentation”?

Documentation is not an authority; production is. Capture real payloads, write a failing contract test that encodes the
correct unit expectation, and version the interface so the ambiguity can’t return quietly.

9) Why do these failures keep happening even in mature organizations?

Because organizations scale faster than shared context. Interfaces multiply, and semantics get lost in handoffs. The only
durable fix is to make assumptions executable: types, tests, runtime checks, and enforced schemas.

Conclusion: next steps you can take this week

Mars Climate Orbiter wasn’t defeated by Mars. It was defeated by an interface that let meaning evaporate. That’s the
real operational hazard: numbers that travel without their semantics, crossing organizational boundaries where “everyone
knows” turns into “nobody checked.”

Practical next steps:

  1. Pick your top 10 numeric fields in telemetry or APIs and make units explicit (name or metadata).
  2. Add one contract test that fails on unit mismatch—make it block merges.
  3. Add one drift/bias alert on a key residual metric.
  4. Remove one default unit setting and force explicit configuration.
  5. Run a game day where the injected fault is “units changed upstream.” Watch how long it takes to find.

If you do only one thing: treat interface semantics as production-critical. Because they are. The spacecraft doesn’t care
that your org chart is complicated.

← Previous
Ubuntu 24.04 “Failed to start …”: the fastest systemd triage workflow (case #2)
Next →
ZFS secondarycache: When L2ARC Should Cache Nothing

Leave a comment