Ubuntu 24.04 “Clock skew detected” — fix time sync and stop build/deploy failures (case #46)

Was this helpful?

“Clock skew detected” is one of those errors that feels like the computer is judging you for believing time is real. Your build runs for 12 minutes, then dies because one file appears to be from the future. Your deploy pipeline refuses to sign an artifact because TLS thinks the certificate isn’t valid yet. And your incident channel turns into philosophy class.

On Ubuntu 24.04, you can fix this reliably—if you treat time sync like a production dependency, not a background feature. This is the practical playbook: how to prove where skew comes from, how to stop it, and how to keep it stopped across bare metal, VMs, containers, and CI runners.

What “clock skew detected” actually means (and why it breaks builds)

The message usually comes from make (or tools that behave similarly) when file modification times look inconsistent. The classic example: a generated file has a timestamp later than the current system time, or a dependency file looks newer than its dependents in an impossible way. The build system assumes your clock is wrong because it relies on mtime ordering to decide what needs rebuilding.

But “clock skew detected” is rarely just about make. It’s a symptom that time is no longer monotonic and trustworthy on that host. Once that happens, you get a cascade:

  • CI builds rebuild endlessly or fail because timestamps jump backwards or forwards mid-run.
  • TLS and signed artifacts fail when the system thinks certs are not valid yet / already expired.
  • APT and package repos complain about “Release file is not valid yet” when the client is ahead of the repo’s timestamp.
  • Distributed systems become weird. Not always broken, but weird. You’ll see log ordering issues, token expiry problems, leader elections flapping, and audits that don’t line up.

Time has two big sides on Linux:

  • Wall clock (CLOCK_REALTIME): what humans read, what timestamps use, what TLS checks use.
  • Monotonic clock (CLOCK_MONOTONIC): always moves forward, used for timeouts and measuring durations.

Most time sync problems are wall-clock problems. But the root causes often live below the OS: firmware, hypervisor timekeeping, CPU power states, clocksource selection, or a host that simply cannot reach its NTP servers.

One reliable mental model: if your environment can’t keep time, it can’t keep promises. Builds, deploys, and security controls all assume timestamps mean something.

Paraphrased idea (attributed): Gene Kranz championed “tough and competent”—ops works when you keep your fundamentals solid under pressure. Time sync is a fundamental.

Joke #1: The only thing worse than a clock skew is a clock skew in a postmortem timeline. Suddenly everyone is innocent because “the logs are lying.”

Fast diagnosis playbook

If you’re in the middle of a failing deploy, you don’t have time for a tour of NTP theory. Here’s the order that finds the bottleneck fastest on Ubuntu 24.04.

First: confirm the skew and whether it’s still happening

  1. Check current time, timezone, and sync state (single command gives you most of it).
  2. Check whether time is “jumping” (a big step) vs “drifting” (slow error accumulation).

Second: identify who is supposed to sync time

  1. Is it systemd-timesyncd or chronyd?
  2. Are you inside a VM/container with special time rules?

Third: validate reachability and selection of time sources

  1. Can you reach UDP/123 to your configured servers?
  2. Are you actually syncing to a good server (low stratum, sane offset, stable)?

Fourth: check the platform: VM, hypervisor, clocksource, suspend/resume

  1. VMs drifting often means host time is fine but guest integration is misconfigured.
  2. Big jumps often correlate with resume-from-suspend, snapshot restore, or a host overloaded enough to miss timekeeping ticks.

Fifth: mitigate production impact

  1. Fix time, then invalidate broken artifacts (build outputs can be tainted by bad mtimes).
  2. Restart only what needs restarting (time sync daemons, not the entire fleet unless you enjoy chaos).

Facts and historical context that actually help

  • NTP is old and battle-tested. The Network Time Protocol dates back to the 1980s and remains the backbone of time sync on the internet.
  • Leap seconds are a real operational event. They’ve caused outages when systems handled them inconsistently (step vs smear vs ignore).
  • Linux doesn’t just “have a clock.” It has multiple clocks and multiple clocksources (TSC, HPET, ACPI PM timer), and bad choices can show up as drift.
  • Virtualization changed timekeeping. Guests can run “late” when the host is oversubscribed, paused, snapshotted, or migrated.
  • Chrony was built for hostile conditions. It’s popular in data centers because it handles intermittent connectivity and large initial offsets better than classic ntpd in many setups.
  • Systemd-timesyncd is intentionally minimal. It syncs time but doesn’t aim to be a full NTP suite; that’s fine until you need diagnostics and control.
  • Build systems depend on mtimes because it’s fast. It’s also fragile across network filesystems, VM restores, and skewed clocks.
  • TLS is a time machine with rules. Cert validity windows are strict; if your clock is wrong, security stops you (correctly).
  • Monotonic time saved a lot of software. Many timeout bugs disappeared once software stopped using wall clock for durations—your build tools might not have gotten the memo.

Hands-on tasks: commands, outputs, decisions (12+)

These are the commands I run on Ubuntu 24.04 when the pager says “clock skew” and the CI pipeline is on fire. Each task includes: command, what the output means, and what decision you make.

Task 1 — Check system time, sync status, and which service is in charge

cr0x@server:~$ timedatectl
               Local time: Mon 2025-12-30 09:41:12 UTC
           Universal time: Mon 2025-12-30 09:41:12 UTC
                 RTC time: Mon 2025-12-30 09:40:02
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no

Meaning: System clock synchronized: no is the red flag. RTC differs by ~70 seconds, also suspicious.

Decision: Confirm what “NTP service” actually is (chrony or systemd-timesyncd) and check why sync isn’t achieved.

Task 2 — Identify active time sync daemon(s)

cr0x@server:~$ systemctl status chrony systemd-timesyncd --no-pager
● chrony.service - chrony, an NTP client/server
     Loaded: loaded (/usr/lib/systemd/system/chrony.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-30 09:37:01 UTC; 4min 11s ago
● systemd-timesyncd.service - Network Time Synchronization
     Loaded: loaded (/usr/lib/systemd/system/systemd-timesyncd.service; disabled; preset: enabled)
     Active: inactive (dead)

Meaning: Chrony is running; timesyncd is disabled. Good: one boss, not two.

Decision: Use chronyc for real diagnostics. If both were active, you’d disable one (usually timesyncd if you want chrony).

Task 3 — Quick chrony health: tracking

cr0x@server:~$ chronyc tracking
Reference ID    : 00000000 ()
Stratum         : 0
Ref time (UTC)  : Thu Jan 01 00:00:00 1970
System time     : 0.832145678 seconds fast of NTP time
Last offset     : +0.832145678 seconds
RMS offset      : 0.500000000 seconds
Frequency       : 0.000 ppm
Residual freq   : +0.000 ppm
Skew            : 0.000 ppm
Root delay      : 1.000000000 seconds
Root dispersion : 10.000000000 seconds
Update interval : 0.0 seconds
Leap status     : Not synchronised

Meaning: Stratum 0 with Reference ID 0 means chrony has no selected source. Not synchronized.

Decision: Look at sources and reachability next. This is typically network, DNS, or bad config.

Task 4 — See configured sources and whether they’re reachable

cr0x@server:~$ chronyc sources -v
  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current best, '+' = combined, '-' = not combined,
| /             'x' = may be in error, '~' = too variable, '?' = unusable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured offset,
||                                \     |          |  zzzz = estimated error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^? ntp1.corp.lan                 0   6     0     -     +0ns[   +0ns] +/-    0ns
^? ntp2.corp.lan                 0   6     0     -     +0ns[   +0ns] +/-    0ns

Meaning: ^? and Reach 0 means it’s not getting replies. Stratum 0 here is “unknown/unreachable”, not “very accurate”.

Decision: Check DNS resolution and UDP/123 path. Don’t touch clocksource yet; it’s probably just blocked.

Task 5 — Confirm DNS resolution for NTP hosts

cr0x@server:~$ getent ahosts ntp1.corp.lan
10.20.30.40     STREAM ntp1.corp.lan
10.20.30.40     DGRAM  ntp1.corp.lan
10.20.30.40     RAW    ntp1.corp.lan

Meaning: Name resolves. Good. If this fails, chrony will look “broken” but it’s just DNS.

Decision: If DNS fails, fix resolver first. If DNS is fine, check firewall/routing for UDP/123.

Task 6 — Validate UDP/123 connectivity (firewall or routing issues)

cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
  chain input {
    type filter hook input priority filter; policy drop;
    iif "lo" accept
    ct state established,related accept
    tcp dport 22 accept
    ip protocol icmp accept
  }
}

Meaning: Policy drop on input, no explicit UDP/123 inbound rule—fine for a client. But you also need outbound allowed and return traffic stateful acceptance (it is: established/related).

Decision: If egress is blocked upstream (cloud SG, corporate firewall), fix there. On the host, check output chain if you enforce it.

Task 7 — Watch chrony logs for “no reply” and step events

cr0x@server:~$ sudo journalctl -u chrony --since "30 min ago" --no-pager
Dec 30 09:37:01 server chronyd[932]: chronyd version 4.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER)
Dec 30 09:37:02 server chronyd[932]: Could not resolve address for ntp2.corp.lan: Name or service not known
Dec 30 09:37:05 server chronyd[932]: No suitable source for synchronisation

Meaning: Here we see the real cause: resolution failure for one source; no suitable source overall.

Decision: Fix DNS or replace sources with IPs temporarily while DNS is repaired.

Task 8 — Check whether the RTC is drifting or set wrong

cr0x@server:~$ sudo hwclock --verbose
hwclock from util-linux 2.39.3
System Time: 1735551672.123456
Trying to open: /dev/rtc0
Using the rtc interface to the clock.
Last drift adjustment done at 0 seconds after 1969
RTC time: 1735551602.000000, RTC epoch: 1900, offset: 0
Time since last adjustment is 1735551602 seconds
Calculated Hardware Clock drift is 0.000000 seconds
Hardware clock is on UTC time

Meaning: RTC is ~70 seconds behind system time. That can happen after boot if NTP hasn’t synced and system time is wrong too.

Decision: Once NTP is stable, sync RTC from system time (hwclock --systohc) on bare metal. In VMs, RTC is often virtual; treat carefully.

Task 9 — Confirm what time source systemd thinks is configured (timesyncd setups)

cr0x@server:~$ timedatectl timesync-status
       Server: 185.125.190.57 (ntp.ubuntu.com)
Poll interval: 32min 0s (min: 32s; max 34min 8s)
         Leap: normal
      Version: 4
      Stratum: 2
    Reference: 7B5E1A2F
    Precision: 1us (-20)
Root distance: 28.217ms (max: 5s)
       Offset: +3.122ms
        Delay: 24.503ms
       Jitter: 2.731ms
 Packet count: 41
    Frequency: -12.345ppm

Meaning: If you use timesyncd, this is gold: stratum 2, low offset, stable jitter. That’s healthy.

Decision: If offset is huge or server is missing, move to chrony for better recovery and diagnostics, especially on VMs or flaky networks.

Task 10 — Detect time jumps (suspend, VM restore, or host pausing)

cr0x@server:~$ sudo journalctl --since "2 hours ago" | grep -E "Time has been changed|clock.*jump|System clock"
Dec 30 08:55:11 server systemd[1]: Time has been changed
Dec 30 08:55:11 server chronyd[932]: System clock was stepped by -38.214567 seconds

Meaning: A step of -38 seconds is a backwards jump. That’s exactly how you get “file from the future” or “dependencies in the past”.

Decision: Identify why stepping occurs. Chrony steps on startup/large offsets by design (configurable). If it happens mid-uptime, suspect VM pause/restore or broken time source.

Task 11 — Validate the kernel clocksource (rare, but real)

cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Meaning: tsc is usually fine on modern hardware and VMs, but if you see instability or known platform bugs, switching can help (carefully).

Decision: Only change clocksource if you have evidence (time jumps/drift correlated with specific platforms). Otherwise you’re treating the symptom with a hammer.

Task 12 — Compare time against a known-good external reference quickly

cr0x@server:~$ chronyc -n tracking | sed -n '1,6p'
Reference ID    : C0A80101 (192.168.1.1)
Stratum         : 3
Ref time (UTC)  : Mon Dec 30 09:42:01 2025
System time     : -0.000231456 seconds slow of NTP time
Last offset     : -0.000120001 seconds

Meaning: Offset sub-millisecond: excellent. That’s not the box causing your build failures (unless your filesystem mtimes are corrupted).

Decision: Move investigation to the build environment itself (shared volumes, checkout timestamps, container host time, etc.).

Task 13 — Find whether you’re building on a network filesystem with timestamp weirdness

cr0x@server:~$ findmnt -T "$PWD"
TARGET SOURCE                FSTYPE OPTIONS
/      /dev/mapper/vg0-root  ext4   rw,relatime

Meaning: Local ext4 is predictable. If you see nfs, cifs, or something exotic, mtimes can reflect server time, not client time.

Decision: If build workspace is on NFS/CIFS, ensure the server is time-synced too, or move builds to local disk.

Task 14 — Prove the mtime anomaly that triggers make

cr0x@server:~$ ls -l --full-time build/output.o src/input.c
-rw-r--r-- 1 cr0x cr0x  8216 2025-12-30 09:50:01.000000000 +0000 build/output.o
-rw-r--r-- 1 cr0x cr0x   412 2025-12-30 09:41:10.000000000 +0000 src/input.c

Meaning: If current time is 09:42 but output.o is 09:50, the future file came from a bad clock at creation time.

Decision: Clean and rebuild after fixing time; do not trust incremental builds once timestamps are poisoned.

Task 15 — After fixing sync: verify “synchronized” and stable sources

cr0x@server:~$ timedatectl
               Local time: Mon 2025-12-30 09:43:22 UTC
           Universal time: Mon 2025-12-30 09:43:22 UTC
                 RTC time: Mon 2025-12-30 09:43:22
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: This is the end-state you want: synchronized yes, RTC aligned, UTC everywhere.

Decision: Now remediate the build outputs (clean workspace, purge caches that store timestamps) and add monitoring so this doesn’t return next Tuesday.

Fix patterns that work on Ubuntu 24.04

Ubuntu 24.04 will happily run with either systemd-timesyncd or chrony. Pick one. Make it correct. Monitor it. The biggest time-sync outages I’ve dealt with weren’t caused by the “wrong” daemon; they were caused by ambiguity and neglect.

Pattern A: Standard server/VM fleet — use chrony, keep it boring

Chrony is usually the right default for servers and CI runners because it handles “real life”: intermittent connectivity, VMs that pause, nodes that boot from snapshots, and networks that sometimes block UDP for fun.

Install and enable chrony (if not already)

cr0x@server:~$ sudo apt update
...output...
cr0x@server:~$ sudo apt install -y chrony
...output...
cr0x@server:~$ sudo systemctl enable --now chrony
...output...

Decision point: If you already had timesyncd active, disable it to avoid dueling discipline loops.

cr0x@server:~$ sudo systemctl disable --now systemd-timesyncd
Removed "/etc/systemd/system/sysinit.target.wants/systemd-timesyncd.service".

Configure sane sources

Edit /etc/chrony/chrony.conf. In corporate networks, point to internal NTP servers (ideally redundant). In smaller environments, use Ubuntu pool defaults or your infrastructure provider’s time service. The key is redundancy and reachability.

Example (don’t blindly copy names; use your real servers):

cr0x@server:~$ sudo sed -n '1,80p' /etc/chrony/chrony.conf
pool ntp.ubuntu.com        iburst maxsources 4
keyfile /etc/chrony/chrony.keys
driftfile /var/lib/chrony/chrony.drift
ntsdumpdir /var/lib/chrony
logdir /var/log/chrony
makestep 1.0 3

What matters:

  • iburst speeds up initial sync after boot.
  • makestep 1.0 3 allows stepping (jumping) the clock if the offset is > 1s during the first 3 updates. Good for boot; dangerous if you see it happening later.

Opinionated advice: keep makestep enabled for early boot. CI nodes that come up with a bad clock will waste more money than stepping costs you.

Restart and verify

cr0x@server:~$ sudo systemctl restart chrony
...output...
cr0x@server:~$ chronyc sources -v
...output...
cr0x@server:~$ chronyc tracking
...output...

Decision point: You want a selected source (*), a non-zero reach register (not 0), and a sensible stratum (typically 2–4 in enterprises). If you have none: it’s still network/DNS/firewall.

Pattern B: Minimal desktops or appliances — timesyncd is fine

Systemd-timesyncd does the job for many non-critical systems. But it’s easier to outgrow it than to regret chrony. If you keep timesyncd, at least lock down a reliable server list.

Edit /etc/systemd/timesyncd.conf and set explicit servers if your network blocks public pools.

cr0x@server:~$ sudo sed -n '1,120p' /etc/systemd/timesyncd.conf
[Time]
NTP=ntp1.corp.lan ntp2.corp.lan
FallbackNTP=ntp.ubuntu.com
cr0x@server:~$ sudo systemctl restart systemd-timesyncd
...output...
cr0x@server:~$ timedatectl timesync-status
...output...

Decision point: If timesyncd can’t reach servers, it will quietly fail until something else screams (like your pipeline). If you need stronger diagnostics and resilience, switch to chrony.

Pattern C: VM guests — stop fighting the hypervisor, but don’t trust it blindly

VM time drift is common. The guest can be correct but also can be paused, resumed, snapshotted, migrated, or throttled. Those are all time-events.

Practical guidance:

  • Always run a guest time sync service. Even if the hypervisor “helps,” you want the guest to discipline itself.
  • Align host and guest strategy. If your hypervisor injects time and your guest also aggressively steps, you may get oscillation.
  • Watch for step events mid-uptime. That’s usually not “normal drift”; it’s a platform event.

If you observe stepping while the VM is running normally, look for host oversubscription or CPU steal time. That’s an SRE problem disguised as a time problem.

Pattern D: Containers — you can’t fix host time from inside the container

Containers use the host kernel. If a build container says “clock skew detected,” the host’s time is wrong or the workspace is mounted from somewhere with bad timestamps. You can install chrony inside a container and feel productive, but it won’t discipline the host clock unless you’re doing privileged container gymnastics—which is a different kind of incident.

Fix the node. Or fix the filesystem that supplies mtimes.

Pattern E: CI/CD — clean artifacts after time correction

Once you fix time, you still have poisoned artifacts: generated files with future timestamps, caches containing metadata, and incremental build graphs that now lie. The correct action is usually:

  • Clean workspace (git clean -xfd in a disposable runner; more careful on persistent hosts).
  • Invalidate caches that store mtimes (language-specific build caches, compiler caches, artifact caches).
  • Rebuild from scratch once to reset the dependency chain.

Joke #2: Time sync is like brushing your teeth—skip it for a week and suddenly everything is expensive and everyone’s upset.

Three corporate mini-stories (anonymized, plausible, and technically painful)

Mini-story 1 — The incident caused by a wrong assumption

The company had a new Ubuntu 24.04 runner image for CI. It was “hardened,” so outbound UDP was blocked by default. The assumption: “We don’t run services that need UDP.” That statement is the kind that sounds tidy in a spreadsheet and gets ugly in production.

Within hours, builds started failing with make: warning: Clock skew detected. Meanwhile, a separate team reported intermittent “certificate not valid yet” errors when pulling from an internal registry. The symptoms looked unrelated. The shared cause was time drift: the runners booted with a clock a few minutes off, then drifted further because NTP couldn’t reach anything.

The incident review was classic corporate archaeology. Security had enforced the UDP block. Platform had swapped from timesyncd to chrony for “better accuracy” without validating egress rules. CI owned the failures but didn’t own the network policy. The dashboards showed CPU and memory; time sync wasn’t monitored at all.

The fix was unromantic: allow UDP/123 egress from runners to the internal NTP servers, add a second NTP source, and alert when timedatectl reports unsynchronized for more than a few minutes after boot. The biggest change was cultural: they stopped assuming “UDP is only for weird legacy stuff.” NTP is not weird. It’s a foundation.

Mini-story 2 — The optimization that backfired

An infra team wanted faster deployments. They enabled aggressive snapshot-based scaling for build agents: restore a VM snapshot, run a build, discard. It cut provisioning time dramatically—until the first time the snapshot was taken while the clock was slightly wrong and the guest tools had been paused.

The restored VMs had time offsets that varied by seconds to minutes. Sometimes chrony corrected it quickly. Sometimes it stepped the clock backwards in the middle of a build, right when the build system was generating headers. The resulting artifacts had incoherent timestamps, and builds started failing sporadically. “Sporadically” is the word that makes engineers age.

The team’s first fix attempt was to disable stepping because “stepping breaks builds.” That improved some failures and worsened others. Without stepping, some VMs remained offset long enough for TLS and token expiry logic to fail, and the registry pull step became flaky instead.

The actual solution was to treat snapshots as time-unsafe unless you design for it: ensure the snapshot is taken after time sync is stable, force a sync on boot, and consider discarding any cached build directories that survived the snapshot. They also introduced a health gate: if the VM wasn’t synchronized within a short window, it was terminated and replaced. That’s not elegant. It’s reliable. Reliability beats elegance in production.

Mini-story 3 — The boring but correct practice that saved the day

A different organization had a rule: every node has two independent time sources (internal stratum-1 appliances and a provider time service), and every cluster alerts if any node is unsynchronized for more than 10 minutes. Nobody loved this rule. It was “just more monitoring noise” until it wasn’t.

One Tuesday, a corporate firewall change blocked access to the primary NTP appliances from a new subnet. Most teams didn’t notice immediately because things “mostly worked.” But the monitoring did. Alerts fired: nodes were falling back to the secondary source and showing increased jitter.

Because they had redundancy, nothing user-visible broke. Deployments continued. Builds continued. TLS continued to be happy. The team had time to fix the firewall during business hours instead of during an outage.

This is the part that sounds boring in a write-up and glorious at 3 a.m.: the correct practice wasn’t fancy. It was redundancy and alerting on time synchronization state. It saved them from the kind of incident where you can’t trust logs, tokens, or certificates. That’s a “whole-company incident,” not a “small outage.”

Common mistakes: symptom → root cause → fix

1) Symptom: “Clock skew detected” during make, especially in CI

Root cause: system time stepped backwards or forwards, or build workspace contains files created when time was wrong (future mtimes).

Fix: stabilize time sync first; then clean the workspace and rebuild. Don’t try to “touch” a few files and hope. Use ls -l --full-time to find future timestamps, then wipe outputs.

2) Symptom: “Release file is not valid yet” from apt

Root cause: your node’s clock is ahead of repository metadata timestamps.

Fix: fix NTP; do not pin old repository metadata as a workaround. After sync, rerun apt update.

3) Symptom: TLS errors “certificate is not yet valid” or token expiry weirdness

Root cause: clock offset (often minutes) on client or server; sometimes caused by VM restore or blocked NTP.

Fix: verify both ends are synchronized. Fix network reachability to NTP. For fleets, alert on sync state and offset.

4) Symptom: chrony running but never synchronizes (stratum 0, reach 0)

Root cause: NTP servers unreachable, DNS failures, UDP/123 blocked, or wrong server addresses.

Fix: validate resolution and network path; add redundant sources; ensure you’re not pointing to an address that only works on a different VLAN.

5) Symptom: time “snaps” by tens of seconds during uptime

Root cause: VM pause/resume, snapshot restore, host overload, or chrony stepping due to huge offset discovered late.

Fix: investigate platform events; tune makestep to only allow early boot stepping; ensure VM guest tools and host NTP are configured sanely.

6) Symptom: logs out of order, distributed traces nonsense

Root cause: some nodes have skew; others don’t. Your observability stack is faithfully recording lies.

Fix: enforce time sync across the fleet; consider rejecting nodes that fail time sync health checks (especially for Kubernetes workers).

7) Symptom: only builds on NFS-backed workspace fail with skew warnings

Root cause: server/client time mismatch or filesystem timestamp semantics, especially when the NFS server isn’t synced.

Fix: sync NFS server time; move workspaces local; ensure the NFS infrastructure is part of the time-sync monitoring domain.

8) Symptom: two time daemons fighting (oscillating offset, frequent “stepped” events)

Root cause: both timesyncd and chrony (or other tools) are trying to discipline the clock.

Fix: pick one daemon. Disable the other. Verify sync stability after the change.

Checklists / step-by-step plan

Checklist A — Emergency response when builds/deploys are failing

  1. Confirm skew is real: run timedatectl and record sync state.
  2. Identify the time daemon: systemctl status chrony systemd-timesyncd.
  3. Check time source health: chronyc tracking and chronyc sources -v (or timedatectl timesync-status).
  4. Fix reachability: validate DNS and network policy for UDP/123.
  5. Force stabilization: restart the time service after fixing network/DNS.
  6. Verify stable sync: synchronized = yes, selected NTP source exists, low offset.
  7. Clean tainted build outputs: wipe workspace/build directories and rebuild once cleanly.
  8. Re-run deploy: if TLS/token failures were present, retry after time correction.

Checklist B — Hardening a fleet so this doesn’t come back

  1. Standardize: choose chrony (recommended for servers/CI) or timesyncd (minimal), not both.
  2. Redundancy: configure at least two NTP sources on different infrastructure paths.
  3. Monitoring: alert on unsynchronized state and large offsets. Don’t wait for make to tell you.
  4. Platform alignment: ensure hypervisors and bare metal hosts also sync time; guests can’t compensate forever.
  5. CI hygiene: treat time fixes as cache invalidation events; purge artifact caches when skew occurs.
  6. Change control: firewall and DNS changes should include “does NTP still work?” as a test.
  7. Incident breadcrumbs: keep logs of time step events and NTP source changes for forensic timelines.

Checklist C — When you suspect VM/platform time drift

  1. Search logs for “System clock was stepped”.
  2. Correlate with VM lifecycle events (snapshot restore, migration, pause/resume).
  3. Check clocksource and kernel messages if drift is extreme.
  4. Validate host time sync; a bad host creates bad guests.
  5. Use chrony and restrict stepping to early boot unless you have a strong reason.

FAQ

1) Why does “clock skew detected” show up in make at all?

make relies on filesystem modification times to decide what needs rebuilding. If a dependency appears newer than it should be—or newer than “now”—it warns you because it can’t reason about the build graph reliably.

2) I fixed NTP, but make still warns. Why?

Because you still have files with future timestamps created during the bad-time period. Fix time first, then clean outputs and rebuild. Incremental builds after a skew event are untrustworthy unless you sanitize mtimes.

3) Should I use chrony or systemd-timesyncd on Ubuntu 24.04?

For servers, CI runners, and anything that must recover from imperfect networks or VM weirdness: chrony. For simple desktops or minimal appliances: timesyncd can be fine. The wrong choice is running both.

4) Can I just manually set the time with date and move on?

You can, but it’s a temporary bandage. If the machine can’t reach NTP or keeps drifting due to platform issues, it will break again. Also: manual time changes mid-flight can disrupt TLS, caches, and logs.

5) What’s the deal with stepping vs slewing?

Stepping jumps the clock to the correct time quickly. Slewing adjusts gradually. Stepping can break timestamp-sensitive workloads; slewing can leave you wrong for longer. Chrony’s makestep gives you a sane compromise: step only early in boot when you’re already in a fragile state.

6) Does timezone configuration cause clock skew errors?

Timezone misconfig usually causes human confusion, not skew. The system time (UTC internally) can still be correct. But if you see mismatched RTC/localtime settings, it can indicate sloppy provisioning. Keep servers on UTC. Always.

7) Why do containers get clock skew errors if they can’t set time?

Because they inherit the host kernel time. If a build container complains, fix the host’s time sync or the mounted filesystem timestamps. Installing NTP inside an unprivileged container is mostly theater.

8) How much offset is “too much” for CI and deploys?

Sub-second is normal. A few seconds can break strict systems (some tokens, some signing workflows). Minutes will absolutely break TLS and package management. If you’re seeing tens of seconds, treat it as a production defect.

9) We blocked UDP everywhere. Can we sync time without UDP/123?

Classic NTP uses UDP/123. Some environments use alternative time distribution methods internally, but the operational truth remains: you must allow whatever time protocol you chose. If you block it, your systems will eventually invent their own timeline.

10) Should I sync the hardware clock (RTC) too?

On bare metal: yes, once your system time is correct and stable, write it back with hwclock --systohc. On VMs: be cautious; RTC behavior depends on hypervisor and guest integration.

Next steps you should actually take

If you’re fixing an active failure: make time sync work first, then nuke tainted build artifacts. Don’t negotiate with poisoned mtimes. They don’t get better with hope.

For a durable fix on Ubuntu 24.04:

  • Standardize on one time service (chrony is the pragmatic choice for servers and CI).
  • Configure redundant NTP sources that your network actually allows.
  • Alert on unsynchronized state and on time step events.
  • Audit your VM lifecycle practices (snapshots, restores, migrations) for time impact.
  • After any skew incident, clean builds and caches once to reset the world.

The punchline is dull, which is how you know it’s correct: reliable time is a dependency like DNS and storage. You don’t “set it and forget it.” You run it, you monitor it, and you keep it boring.

← Previous
ZFS Raw Send: Replicating Encrypted Data Without Sharing Keys
Next →
Undervolting and Power Limits: Quieter PCs Without Regret

Leave a comment