Time is a dependency. Treat it like one, or it will treat you like an amateur. If you’ve ever chased a “random” latency spike,
a flaky TLS error, or a database failover that “shouldn’t have happened,” you’ve already met time drift in the parking lot after work.
In 1991, a tiny timing error in the Patriot missile system accumulated during long uptime and contributed to a failure to intercept an incoming missile.
In production terms: the system ran too long, the error budget ran out, and reality didn’t care that the math was “close enough.”
What happened (and why “small errors” aren’t small)
The Patriot system tracks targets by predicting where they will be, not just where they are. That prediction depends on time.
If your time estimate is off, your predicted position is off. If the target is fast enough, “off” becomes “missed.”
The failure pattern is painfully familiar to anyone who runs long-lived systems:
a calculation involves a rounded constant; the rounding error is tiny; the system is designed with a typical uptime in mind;
then operations extends uptime because the environment demands it; error accumulates; a threshold flips; the system behaves badly at the worst moment.
Here’s the uncomfortable operational truth: bugs that depend on long uptime are not “rare.” They are “scheduled.”
The clock is literally counting down to your outage.
A single quote that belongs on an incident wall
“Hope is not a strategy.” — paraphrased idea, often attributed in reliability/operations circles
If you’re building or operating systems that must behave under stress, treat time like a critical subsystem.
Measure it. Monitor it. Budget for its failure modes. And never assume “the clock is fine” without evidence.
Historical facts that matter to engineers
- The Patriot was originally designed for aircraft, then adapted for ballistic missile defense. The operational environment changed faster than the software culture did.
- The failure occurred in 1991 during the Gulf War, under real combat conditions with sustained operations and high-stakes constraints.
- The key issue involved time conversion: converting a counter (ticks) into seconds using fixed-point arithmetic and a rounded constant.
- The rounding error was tiny per conversion, but it accumulated with uptime. Small per-event errors become big over long horizons.
- The system’s tracking used prediction, meaning time error turns into position error. Prediction amplifies timing mistakes.
- Longer-than-expected continuous operation increased the drift past a tolerable threshold. The system worked “fine” until it didn’t.
- A software update reportedly existed to mitigate the issue, but fielding changes in wartime is hard, slow, and sometimes politically messy.
- The incident became a textbook case used in software engineering courses about numeric precision, requirements drift, and operational assumptions.
Notice how few of those facts are “math facts.” Most of them are operational facts:
what it was designed for, how it was used, and how long it stayed up. That’s the theme.
The bug mechanics: fixed-point time, rounding, and accumulating drift
Let’s talk mechanics without turning this into a numerical analysis seminar.
The Patriot system used an internal clock that counted tenths of a second (or a similar tick-based unit; the important part is: a counter, not a floating clock).
To predict target location, the software needed time in seconds.
Converting ticks to seconds is conceptually easy:
- ticks = integer counter
- seconds = ticks × 0.1
The trap is how you represent 0.1 in a computer that prefers binary. In binary, many decimal fractions are repeating fractions.
0.1 cannot be represented exactly in a finite number of binary digits. So you approximate it.
Fixed-point arithmetic: the embedded engineer’s bargain
In constrained systems (historically especially), floating point could be expensive or unavailable, so engineers use fixed-point:
represent real numbers as integers with an implied scale. Example: store seconds in units of 2^-N, or store “0.1” as an integer ratio.
That bargain comes with a bill: you have to choose how many bits of precision you carry, and when you round.
Rounding once is fine. Rounding repeatedly in a loop that runs for hours is a slow-motion incident.
The specific failure shape: “error per tick” × “ticks since boot”
The drift behaves like this:
- You approximate a conversion factor (like 0.1 seconds per tick) with limited precision.
- Each conversion introduces a tiny error (often a fraction of a tick).
- Over many ticks, that fractional error accumulates into a measurable time offset.
- The time offset becomes a position offset through velocity: position_error ≈ velocity × time_error.
That last line is the one that should make your stomach drop. If a target is moving fast, even tens of milliseconds matter.
Why this bug survives testing
It’s not because engineers are stupid. It’s because testing is usually bounded:
- Short test runs don’t accumulate enough drift.
- Lab conditions don’t match deployment duty cycles.
- Acceptance criteria focus on “works now,” not “works after 100 hours.”
- Time is often mocked in tests, which is necessary but can hide integration realities.
Long-uptime bugs require long-uptime testing, or at least formal reasoning and monitoring that explicitly accounts for accumulation.
If you can’t run the test for 100 hours, simulate 100 hours with accelerated counters and verify the math under scale.
How the drift turned into a missed intercept
The Patriot’s radar observes a target position, then the system predicts where it will be when the interceptor can engage.
The prediction uses time. If the internal time is slightly wrong, the predicted target location is wrong.
A certain amount of error is tolerable; tracking filters can absorb noise. But accumulated time drift is not “noise.”
It’s bias. Bias pushes you consistently in the wrong direction.
In operational terms, the failure looks like this:
- Normal ops: the system runs, drift grows slowly, no one notices.
- Approaching failure: track quality degrades in subtle ways; the system becomes more likely to drop or mis-associate tracks.
- Critical moment: a fast target appears; the prediction window is tight; bias matters; the system fails to correctly align the track for engagement.
The key here is that “time drift” is not a background concern. It is a direct contributor to a real-time decision.
If you run anything time-sensitive—payments, auth, distributed storage, telemetry pipelines—time is in your control plane whether you admit it or not.
Joke #1: Time drift is the only bug that gets worse while you sleep, which is rude because it’s also the only time you’re not paging yourself.
The real lessons (for SREs, embedded engineers, and managers)
1) Uptime is not a virtue by itself
“Five nines” culture sometimes devolves into “never reboot anything.” That’s religion, not engineering.
Long uptime increases exposure to leaks, counter rollovers, slow precision loss, and weird states that only exist after days of continuous operation.
The correct stance: design for long uptime, but operate with intentional maintenance windows.
If a system is safety- or mission-critical, you should know exactly what state accumulates with time, and how it is bounded.
2) Requirements drift is a real bug source
The Patriot was adapted to a new threat environment. That’s not unusual. What’s unusual is believing the original assumptions still hold.
In corporate systems, “requirements drift” is often disguised as “just a configuration change” or “just increase the timeout.”
When operational profiles change—traffic, latency, uptime, target speed—re-validate the math.
Especially math involving time, counters, and numeric conversions.
3) Time is a distributed systems dependency even on one machine
Even a single node has multiple clocks:
- Wall clock (
CLOCK_REALTIME): can jump due to NTP corrections or manual changes. - Monotonic clock (
CLOCK_MONOTONIC): steady, but not tied to civil time. - TSC/HPET hardware realities: drift, frequency scaling, virtualization artifacts.
Use monotonic time for measuring intervals and scheduling internal timeouts.
Use real time for human-facing timestamps and interoperability. Mixing them casually is how you get bugs that feel supernatural.
4) Precision should be an explicit design decision
If your system uses fixed-point, document:
- the scale factor
- the maximum representable value before rollover
- the rounding strategy
- the worst-case accumulated error over maximum uptime
If nobody can answer “what’s the maximum drift after 72 hours,” you have not designed the timekeeping. You have merely hoped at it.
5) Monitoring needs to include time quality, not just time value
Many dashboards show “is NTP enabled” like a checkbox on a compliance form. Useless.
What you need is: offset, frequency correction, jitter, and reachability. And alerts that fire before your offset becomes operationally meaningful.
Fast diagnosis playbook: time drift and timing failures
When something smells like “timing” (intermittent auth failures, odd cache invalidations, distributed lock flapping, telemetry out of order),
don’t wander. Run a tight loop.
First: verify you’re not lying to yourself about the current time
- Check local clock state: Is it synchronized? Is NTP/chrony actually controlling it?
- Check offset magnitude: Are you off by milliseconds, seconds, minutes?
- Check step events: Did the clock jump recently?
Second: confirm the time source and the network path
- Who is the time server? Local, corporate stratum servers, or public sources?
- Is UDP/123 reachable? Or are you “syncing” to nothing?
- Is the hypervisor interfering? Virtual time can be “creative.”
Third: map symptoms to the clock type
- Interval bugs (timeouts, retries, rate limiting): suspect monotonic misuse or event-loop stalls.
- Timestamp bugs (JWT validity, TLS, logs ordering): suspect realtime jumps or drift.
- Cross-host inconsistencies: suspect offset between hosts or time source partitioning.
Fourth: bound the blast radius
- Stop the bleeding: pin instances, disable “strict” time checks temporarily if safe (e.g., widen allowed skew) while you restore sync.
- Reduce state accumulation: restart services with known time-sensitive caches if needed.
- Prevent recurrence: fix root cause, then add alerts on offset/jitter/reachability.
Joke #2: If you ever find “timekeeping” under “non-functional requirements,” congratulations—you’ve discovered a functional requirement with better marketing.
Practical tasks with commands: detect, quantify, and decide
Below are real tasks you can run on typical Linux fleets. Each includes:
the command, example output, what it means, and the decision you make.
Use these like a field manual, not like a treasure map.
Task 1: Check if the system thinks it’s synchronized
cr0x@server:~$ timedatectl
Local time: Mon 2026-01-22 14:10:03 UTC
Universal time: Mon 2026-01-22 14:10:03 UTC
RTC time: Mon 2026-01-22 14:10:02
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: “System clock synchronized: yes” indicates a time sync service has disciplined the clock.
Decision: If it says no, treat time as suspect and proceed to NTP/chrony checks before debugging anything else.
Task 2: Identify whether chrony is healthy (offset, jitter, stratum)
cr0x@server:~$ chronyc tracking
Reference ID : 192.0.2.10 (ntp-a.internal)
Stratum : 3
Ref time (UTC) : Mon Jan 22 14:09:55 2026
System time : 0.000021345 seconds slow of NTP time
Last offset : -0.000012311 seconds
RMS offset : 0.000034112 seconds
Frequency : 12.345 ppm fast
Residual freq : -0.021 ppm
Skew : 0.120 ppm
Root delay : 0.003210 seconds
Root dispersion : 0.001102 seconds
Update interval : 64.0 seconds
Leap status : Normal
What it means: Offset is ~21µs slow; that’s excellent. Frequency correction is small, jitter is low.
Decision: If you see offsets in milliseconds/seconds, or “Leap status: Not synchronised,” stop and fix time sync first.
Task 3: See which NTP sources are reachable and preferred
cr0x@server:~$ chronyc sources -v
210 Number of sources = 3
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current best, '+' = combined, '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
^* ntp-a.internal 377 6 -21us[ -33us] +/- 312us
^+ ntp-b.internal 377 6 -10us[ -18us] +/- 401us
^? ntp-c.internal 0 6 +0ns[ +0ns] +/- 0ns
What it means: Two healthy sources; one unreachable (^?, reach 0).
Decision: If most sources are unreachable, check firewall/routing. If the “best” source flips frequently, suspect network jitter or a bad server.
Task 4: Confirm UDP/123 is reachable to your time server
cr0x@server:~$ nc -uvz ntp-a.internal 123
Connection to ntp-a.internal 123 port [udp/ntp] succeeded!
What it means: Basic reachability exists. Not proof of good time, but it eliminates “blocked NTP” quickly.
Decision: If it fails, coordinate with network/security. Don’t “work around” by disabling time validation in apps as a permanent fix.
Task 5: Verify kernel time discipline status
cr0x@server:~$ timedatectl timesync-status
Server: 192.0.2.10 (ntp-a.internal)
Poll interval: 1min 4s (min: 32s; max 34min 8s)
Leap: normal
Version: 4
Stratum: 3
Reference: 9B1A2C3D
Precision: 1us (-24)
Root distance: 1.5ms
Offset: -17us
Delay: 310us
Jitter: 52us
Packet count: 128
Frequency: +12.345ppm
What it means: Offset and root distance are tiny; the system is disciplined.
Decision: If root distance is large (hundreds of ms+), your time quality is poor even if “synchronized: yes.” Consider better sources or closer stratum servers.
Task 6: Detect whether the clock stepped (jumped) recently
cr0x@server:~$ journalctl -u chrony --since "2 hours ago" | tail -n 10
Jan 22 13:02:11 server chronyd[612]: Selected source 192.0.2.10
Jan 22 13:02:11 server chronyd[612]: System clock wrong by -0.742314 seconds
Jan 22 13:02:11 server chronyd[612]: System clock was stepped by -0.742314 seconds
Jan 22 13:02:12 server chronyd[612]: Frequency 12.345 ppm
Jan 22 13:03:16 server chronyd[612]: Source 192.0.2.11 replaced with 192.0.2.12
What it means: A ~742ms step happened. That can break systems that assume time never goes backwards or jumps.
Decision: If you see steps, check why: cold start, VM suspend/resume, or loss of sync. Consider configuring slew-only behavior for sensitive apps (with caution).
Task 7: Check for VM or host time anomalies (dmesg clues)
cr0x@server:~$ dmesg | egrep -i "clocksource|tsc|timekeeping" | tail -n 8
[ 0.000000] tsc: Detected 2294.687 MHz processor
[ 0.000000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x211f0b6d85a, max_idle_ns: 440795223908 ns
[ 0.125432] clocksource: Switched to clocksource tsc
[ 831.441100] timekeeping: Marking clocksource 'tsc' as unstable because the skew is too large
[ 831.441105] clocksource: Switched to clocksource hpet
What it means: Kernel detected unstable TSC and switched clocksource. That can correlate with drift and timing weirdness, especially in VMs.
Decision: If you see “unstable,” involve platform/virtualization teams. Consider pinning clocksource or fixing host settings; don’t just “restart the app” forever.
Task 8: Compare time between hosts (quick skew check)
cr0x@server:~$ for h in app01 app02 db01; do echo -n "$h "; ssh $h "date -u +%s.%N"; done
app01 1769091003.123456789
app02 1769091003.123991234
db01 1769091002.997000111
What it means: db01 is ~126ms behind app01. That’s enough to break strict skew checks and reorder events.
Decision: If skew > your system tolerance (often 50–200ms depending on protocol), fix time sync before debugging application “mysteries.”
Task 9: Verify monotonic clock behavior (no backwards jumps)
cr0x@server:~$ python3 - <<'PY'
import time
a=time.monotonic()
time.sleep(0.2)
b=time.monotonic()
print("delta_ms", (b-a)*1000)
PY
delta_ms 200.312614
What it means: Monotonic time increases steadily. If your app uses wall clock for timeouts, it can break on steps; monotonic avoids that class.
Decision: If you find code using wall clock for intervals, schedule a fix. It’s not optional; it’s a future incident.
Task 10: Quantify drift rate (ppm) over time using chrony
cr0x@server:~$ chronyc sourcestats -v
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
ntp-a.internal 20 12 18m +12.345 0.120 -21us 52us
ntp-b.internal 18 10 18m +11.998 0.200 -10us 60us
What it means: Frequency correction is ~12 ppm fast. That’s normal for commodity clocks; chrony is compensating.
Decision: If frequency is extreme or unstable, suspect hardware issues, thermal problems, VM scheduling, or bad time sources.
Task 11: Inspect whether NTP is configured to use a local “lies-to-you” source
cr0x@server:~$ grep -R "server\|pool\|local" /etc/chrony/chrony.conf
server ntp-a.internal iburst
server ntp-b.internal iburst
# local stratum 10
What it means: Real servers are configured; the “local clock” fallback is commented out. Good.
Decision: If you see local stratum enabled without a strong reason, review it. Local fallback can mask upstream failure and quietly diverge.
Task 12: Detect log timestamp anomalies (time went backwards)
cr0x@server:~$ journalctl --since "1 hour ago" | awk '
$1 ~ /^[A-Z][a-z]{2}$/ {
ts=$1" "$2" "$3;
if (prev != "" && ts < prev) { print "time went backwards:", prev, "->", ts }
prev=ts
}' | head
What it means: If output appears, you have ordering anomalies (often due to clock steps or log ingestion mixing hosts).
Decision: If “backwards” shows up, treat any time-based correlation during incident as suspect; prioritize restoring sync and using monotonic ordering where possible.
Task 13: Validate TLS failures might be clock-related (certificate validity)
cr0x@server:~$ openssl x509 -in /etc/ssl/certs/ca-certificates.crt -noout -dates 2>/dev/null | head -n 2
notBefore=Jan 1 00:00:00 2025 GMT
notAfter=Dec 31 23:59:59 2030 GMT
What it means: Cert dates are fine, but if your system clock is behind notBefore, TLS handshakes fail in ways that look like network issues.
Decision: If you see sudden TLS failures across a subset of hosts, check time offset before rotating certs in a panic.
Task 14: Confirm application containers inherit sane time (host vs container)
cr0x@server:~$ docker exec -it api-1 date -u
Mon Jan 22 14:10:05 UTC 2026
What it means: Containers typically use the host clock; if host time is wrong, every container is wrong in sync.
Decision: If only some nodes are wrong, focus on node-level NTP; if all are wrong, focus on upstream source or network policy changes.
Three corporate mini-stories (wrong assumption, backfired optimization, boring practice)
Mini-story 1: The wrong assumption (“time is stable enough”)
A mid-sized fintech ran an internal event bus used by payment processors and fraud scoring. Nothing exotic: producers stamped messages with a timestamp,
consumers used a “freshness” window to discard anything older than 30 seconds, and the fraud system aggressively ignored “stale” signals to keep latency low.
During a network maintenance window, a subset of application nodes lost reachability to the internal NTP servers. Chrony kept serving time, but it was now free-running.
Over several hours, those nodes drifted. Not by minutes—by hundreds of milliseconds here, a second there. The dashboards still showed “service healthy.”
Of course they did. Most dashboards do not measure time quality.
The fraud pipeline began dropping events as “old.” Not consistently. Just enough to matter.
Analysts later found that decisions were being made with less context: fewer recent device signals, fewer session markers, fewer “this card was just used” indicators.
The false positives increased. Customer support got loud. Engineers got creative in the worst way: they tuned the freshness window larger.
That “fix” made it worse. Now truly stale events were being accepted, which changed the semantics of the fraud features.
The model outputs shifted, and nobody could explain why. The incident ended when someone checked chrony offsets on the drifting hosts and restored NTP reachability.
The root problem was a wrong assumption: that time drift would be negligible and that clocks were “basically correct.”
The real fix wasn’t tuning the window. It was:
(1) alerting on offset/reachability, and
(2) using monotonic time for freshness calculations within a host, while using event sequence IDs across hosts.
Mini-story 2: The optimization that backfired (saving CPU by cutting precision)
A storage team maintained a high-throughput ingestion service that stamped each write with a logical timestamp used for ordering and compaction decisions.
Under load, profiling showed time conversion and formatting in the hot path. Someone proposed a tidy optimization: replace a high-resolution timestamp
with a cheaper tick counter and a precomputed conversion factor, kept in fixed-point. It shaved CPU and looked clean.
The optimization shipped. Latency improved. Everyone high-fived. Then, weeks later, an unrelated ops change extended maintenance intervals.
Nodes stayed up longer. Compactions began to behave oddly: some segments were “in the future,” some were considered already expired,
and the compactor started oscillating. The system wasn’t down, but it was burning IOPS and making tail latencies ugly.
The culprit was not the tick counter. It was the rounding behavior in the conversion factor and the fact that the conversion was done in multiple places.
Different code paths used slightly different scales. Over long uptime, the bias became visible in ordering decisions.
Worse, because this was storage, the effects persisted: bad ordering creates bad merges, and bad merges create more bad merges.
The postmortem ended with three changes:
(1) a single shared time conversion library with explicit precision and tests over simulated long uptime,
(2) switching internal interval math to monotonic nanoseconds,
and (3) adding a sanity check: if ordering deltas exceed expected bounds, the node refuses to make compaction decisions and pages humans.
Optimization is allowed. Unmeasured, unbounded optimization is how you get a slow-burn failure that looks like “entropy.”
Mini-story 3: The boring practice that saved the day (maintenance + time SLOs)
A healthcare platform ran a fleet of API servers and message processors in multiple data centers.
Their reliability team had a policy that sounded painfully conservative: every node gets a scheduled restart within a defined interval,
and every environment has a “time quality” SLO (offset, reachability, and jitter thresholds).
Engineers complained. Restarts are annoying. They disrupt caches. They break long-running debug sessions.
The SREs stuck to it anyway, because they’d seen what happens when “never reboot” becomes doctrine.
One night, a network change partially blocked UDP/123 between one segment and the internal time servers.
The time SLO alerts fired within minutes: offset trending upward on a subset of nodes, reachability dropping.
The on-call didn’t have to infer anything from symptoms; the telemetry pointed at the clock.
The response was boring and effective:
reroute NTP, confirm offsets converge, rotate impacted nodes through restart to clear any time-sensitive state, then validate downstream systems (JWT, TLS, schedulers).
Customers barely noticed. The incident report was short. The most controversial part was who had to file the firewall change ticket.
Boring practices are often just “the right trade,” repeated until people forget why they exist.
Keep the policy. Document the why. And don’t negotiate with physics.
Common mistakes: symptom → root cause → fix
-
Symptom: Random TLS failures (“certificate not yet valid”) on a subset of hosts
Root cause: host clock behind real time; NTP unreachable or stepped backward after resume
Fix: restore NTP reachability, verify offset viachronyc tracking, then restart clients that cache sessions if needed -
Symptom: Distributed locks flapping; leaders re-elected constantly
Root cause: time-based leases using wall clock; clock steps cause lease expiry or extension glitches
Fix: use monotonic time for lease durations; ensure clock sync for timestamp metadata; alert on clock steps -
Symptom: Metrics or logs out of order across hosts; traces look like spaghetti
Root cause: cross-host skew; one segment lost NTP and free-ran; ingestion pipeline trusts timestamps blindly
Fix: enforce time sync SLOs; add ingestion logic that tolerates bounded skew; include sequence IDs or monotonic ordering within a host -
Symptom: Rate limiting misbehaves (“suddenly everyone is over limit” or “nobody is”)
Root cause: counters keyed by time buckets using realtime clock; time jump changes bucket boundaries
Fix: base buckets on monotonic time or use server-generated epoch from a trusted source; avoid wall-clock bucketing in-process -
Symptom: Scheduled jobs fire twice or not at all after NTP correction
Root cause: scheduler uses wall clock and doesn’t handle steps; system time is stepped to correct offset
Fix: configure time sync to slew when feasible; use monotonic timers; add idempotency keys for jobs -
Symptom: “Works for days then degrades” in tracking, streaming, or control loops
Root cause: accumulated rounding error, counter rollover, or drift interacting with long uptime assumptions
Fix: compute worst-case error over max uptime; increase precision; reset state safely during planned maintenance -
Symptom: Only VMs drift; bare metal fine
Root cause: unstable TSC, host oversubscription, suspend/resume artifacts, poor paravirtual clock configuration
Fix: coordinate with virtualization team; verify kernel clocksource stability; ensure chrony is configured for VM environments
Checklists / step-by-step plan
Checklist A: Prevent the “Patriot shape” bug in your own systems
- Inventory time dependencies: auth tokens, caches, schedulers, ordering, leases, compaction, replay protection.
- Declare maximum supported uptime for components with accumulated state; test against it.
- Use monotonic time for intervals (timeouts, retries, backoff, rate limiting), and wall time for presentation/interchange.
- Define a time quality SLO: max offset, max jitter, minimum reachable sources.
- Alert on reachability and offset trends, not just “NTP running.”
- Test long-uptime behavior: accelerated tick counters, soak tests, or formal bounds analysis.
- Centralize time conversions: one library, one scale, one rounding policy, tested.
- Plan safe restarts: maintenance windows, rolling restarts, state rehydration, idempotency.
Checklist B: During incident response when time is suspected
- Check offset now on affected hosts (
chronyc tracking/timedatectl). - Check reachability of time sources (
chronyc sources -v, UDP/123 path). - Look for steps in logs (
journalctl -u chrony). - Compare cross-host skew quickly (SSH loop with
date -u). - Mitigate blast radius: widen skew tolerances temporarily where safe, pin leaders, pause time-sensitive state transitions.
- Restore time discipline: fix network/policy, ensure multiple sources, verify convergence.
- Clean up state: restart services with time-sensitive caches; re-elect leaders cleanly; re-run failed jobs idempotently.
- Lock in prevention: add time SLO alerts, change management for NTP/firewall rules, and post-incident tests.
FAQ
1) Was the Patriot bug “just floating point”?
No. The core issue is precision and accumulation. Fixed-point arithmetic with rounding can be perfectly valid,
but you must bound the error over maximum uptime and use cases. The Patriot incident is the poster child for “small bias × long time = big miss.”
2) Why does 0.1 cause trouble in binary?
Because many decimal fractions are repeating in binary, like 1/3 repeating in decimal. If you store 0.1 with finite binary digits, you’re approximating it.
Approximation is fine; unaccounted accumulation is not.
3) Could monitoring have caught it?
Yes—if the system monitored time error growth as a first-class signal and defined a maximum safe uptime or drift threshold.
In many systems, the absence of such monitoring is less a technical limitation and more an organizational choice.
4) Is rebooting a valid mitigation for time-accumulation bugs?
Sometimes, yes. Rebooting resets accumulated state, including drift-related counters or error buildup. But rebooting as a strategy is only acceptable if:
you can do it safely, predictably, and with a clear maximum interval tied to known error bounds.
5) How much clock skew is “too much” in corporate systems?
It depends. For JWT validation and some auth flows, tens of seconds might be tolerated with leeway, but that’s a security trade.
For distributed tracing and ordering, tens of milliseconds can already hurt. For real-time control loops, even less.
The answer should come from your requirements, not from vibes.
6) NTP vs chrony: does it matter?
Both can work. Chrony is often preferred on modern Linux, especially in VM environments, because it handles variable network conditions well.
What matters more than the daemon is whether you:
(1) have multiple sane sources,
(2) can reach them reliably,
and (3) alert on offset/jitter/reachability.
7) Why not just use GPS time everywhere?
GPS can be a great reference, but it introduces its own failure modes: antenna issues, signal loss, spoofing/jamming concerns, and operational complexity.
Many organizations use GPS-backed stratum-1 servers internally, then distribute time over NTP/PTP within controlled networks.
8) What’s the difference between monotonic time and wall clock, practically?
Monotonic time is for measuring durations; it should not jump backward when the system corrects the wall clock.
Wall clock is for “what time is it.” Use the wrong one and you get retries that stall, tokens that expire early, or schedulers that time travel.
9) If time steps are dangerous, should we forbid them?
Not categorically. Stepping can be necessary on boot or when offset is huge. But you should understand which applications break on steps,
prefer slewing during steady-state operation, and design critical components using monotonic intervals and idempotent logic.
10) What’s the operational takeaway from the Patriot incident?
Don’t let “expected operating profile” live only in someone’s head or a decades-old assumption. Encode it:
in tests, in monitors, in restart policy, and in explicit bounds on numeric error.
Next steps you can actually do this week
If you operate production systems, the Patriot bug isn’t a history lesson. It’s a reminder that time is part of your reliability surface area.
Here’s what to do next, in order, without heroics.
- Add time quality metrics to your monitoring: chrony offset, jitter, reachability, and step events.
- Set an explicit skew budget per subsystem (auth, storage ordering, schedulers). Put the number in writing.
- Audit code for wall-clock intervals. Replace with monotonic timers where applicable.
- Run a controlled experiment: block NTP to a staging segment and watch what breaks. Fix what breaks.
- Define a maximum safe uptime for components with known accumulation risks; implement rolling maintenance restarts if needed.
- Centralize time conversions and add long-uptime tests (accelerated counters) so precision regressions are caught before production.
The Patriot incident is famous because the consequences were visible and immediate. Most timing failures in business systems are quieter:
a few dropped events, a few wrong decisions, a week of “it seems slower lately.” Quiet failures still cost money and trust—just on a payment plan.