Debian 13: NTP Works but Drift Persists — Hardware Clock and Chrony Tweaks (Case #19)

Was this helpful?

Chrony says you’re synced. Monitoring still screams “clock skew.” Kerberos randomly refuses tickets. TLS handshakes fail like it’s 2012 again. And every reboot makes the box “teleport” a few seconds (or minutes) away from reality.

This is the timekeeping failure mode that wastes the most engineer-hours: NTP is technically working, yet drift persists. The culprit is usually not “bad NTP.” It’s the relationship between the system clock, the hardware clock (RTC), and how chrony decides to correct time. Fix that relationship and the problem stops being spooky.

A practical mental model: who owns time?

Linux timekeeping is a three-party relationship:

  • System clock (kernel time): what your processes use. This is what matters for logs, certificates, timeouts, and distributed systems.
  • Hardware clock (RTC): a tiny battery-backed clock on the motherboard. It’s crude. It drifts. It survives power loss. It is not automatically correct.
  • Synchronizer: chrony (in this case) decides how to steer the system clock based on time sources and rules (slew vs step, thresholds, offsets, stability).

The failure pattern “NTP works but drift persists” is usually one of these:

  • RTC is wrong, and you read it on boot (or after suspend) and re-infect the system clock with bad time.
  • Chrony is syncing, but it’s constrained to slew slowly while the system is constantly being yanked by something else (VM host clock, buggy TSC, another time service).
  • Chrony is syncing to the wrong thing (a local source, a flaky upstream, or it’s isolated by network conditions) and stays “happy” while you drift away from true time.

Time bugs are sneaky because everything still “runs.” It’s just running in the wrong decade. One of the only honest signals is the delta between clocks over time.

Paraphrased idea, attributed: Werner Vogels has pushed the reliability mindset that you should expect failures and design systems to handle them. Timekeeping belongs in that bucket: assume your clock will be wrong, then actively control it.

Fast diagnosis playbook (check 1st/2nd/3rd)

If you only have 10 minutes before the incident channel gets spicy, do this in order.

1st: Is chrony the only thing steering the system clock?

Conflicts create the classic symptom: “chrony reports synced, but time still jumps.” Look for competing daemons and virtualization time agents.

2nd: Is the system clock stable, and is it stepping?

Big jumps imply step corrections (or external resets), not drift. Drift implies stable but wrong frequency. The fix differs.

3rd: Is the RTC poisoning you on boot/resume?

If the machine is correct after running for a while but wrong right after reboot, suspect RTC and the UTC/localtime setting first, not your NTP servers.

4th: Is the time source quality real?

Chrony will happily pick a “reachable” but unstable source. You want low jitter, low dispersion, and a sane stratum (most of the time).

5th: Decide policy: slew-only or allow steps?

Production systems often need controlled stepping at boot (or after long outages) and slewing during steady state. You configure that explicitly.

Interesting facts and historical context (quick, useful)

  1. NTP dates back to the early 1980s, designed for a world where networks were slower and clocks were much worse than today.
  2. Linux distinguishes “time” from “frequency”: disciplining the clock means correcting both the current offset and the rate it ticks.
  3. RTC chips are not precision instruments. Many are cheap oscillators whose drift changes with temperature and age.
  4. UTC vs localtime in RTC is a cultural war: Linux defaults to UTC; Windows historically prefers localtime. Dual-boot machines suffer for it.
  5. Leap seconds exist because Earth is rude: rotation isn’t perfectly consistent, so civil time occasionally inserts a second.
  6. Chrony was built to behave better on intermittent networks, like laptops or systems that can’t maintain constant NTP reachability.
  7. Kernel timekeeping changed significantly with modern CPUs: TSC stability, invariant TSC, and clocksource selection all affect drift.
  8. Virtualization used to be a timekeeping horror show: early hypervisors exposed unstable cycle counters, causing guests to drift or jump.
  9. PPS (Pulse Per Second) from GPS can give sub-microsecond discipline in the right setup, which makes NTP over WAN look like guesswork.

Tools and terms you actually need

You don’t need a PhD. You need the right lenses:

  • Offset: how far your system clock is from the source right now.
  • Frequency / skew: how fast your clock runs relative to real time; chrony adjusts this.
  • Step: jump the clock immediately (good for boot-time correction, bad for some workloads).
  • Slew: slowly speed up or slow down the clock until it matches (safe for most workloads, slow if you’re far off).
  • RTC: the hardware clock you read/write via hwclock.
  • Kernel clocksource: what the kernel uses to count time (TSC, HPET, ACPI PM timer, etc.). A bad choice can drift.

One short joke, as promised: when your clock drifts, your postmortem timeline becomes performance art.

Practical tasks (commands, output meaning, decisions)

The rules here are strict: every task includes a runnable command, what the output means, and what decision you make next. Do them in order if you’re unsure.

Task 1: Confirm your time stack (systemd + chrony)

cr0x@server:~$ timedatectl
               Local time: Mon 2025-12-30 11:12:03 UTC
           Universal time: Mon 2025-12-30 11:12:03 UTC
                 RTC time: Mon 2025-12-30 11:10:41
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: “System clock synchronized: yes” is necessary but not sufficient. The interesting line is RTC time and RTC in local TZ.

Decision: If RTC differs by more than a second or two, plan to fix RTC discipline (later tasks). If “RTC in local TZ: yes” on a server, you likely want to switch to UTC unless you have a dual-boot reason.

Task 2: Make sure chrony is actually the NTP engine

cr0x@server:~$ systemctl status chrony --no-pager
● chrony.service - chrony, an NTP client/server
     Loaded: loaded (/lib/systemd/system/chrony.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-30 09:10:22 UTC; 2h 1min ago
       Docs: man:chronyd(8)
             man:chronyc(1)
   Main PID: 712 (chronyd)
      Tasks: 1 (limit: 38219)
     Memory: 3.2M
        CPU: 1.821s
     CGroup: /system.slice/chrony.service
             └─712 /usr/sbin/chronyd -F 1

Meaning: Chronyd is running. The “-F 1” tells you it’s using frequency correction from drift data early.

Decision: If chrony isn’t active, fix that first. If it is active, check for conflicts next.

Task 3: Detect time-service conflicts (timesyncd, ntpd, vendor agents)

cr0x@server:~$ systemctl list-units --type=service | egrep -i 'chrony|timesync|ntp|openntpd'
chrony.service                      loaded active running chrony, an NTP client/server
systemd-timesyncd.service           loaded active running Network Time Synchronization

Meaning: You have systemd-timesyncd running alongside chrony. That’s a classic steering wheel duel.

Decision: Disable timesyncd if chrony is the chosen authority.

cr0x@server:~$ sudo systemctl disable --now systemd-timesyncd.service
Removed "/etc/systemd/system/sysinit.target.wants/systemd-timesyncd.service".

Meaning: timesyncd is stopped and won’t return at boot.

Decision: Re-check drift symptoms after a few hours; if time jumps stop, you just solved it.

Task 4: Check chrony’s view of sync health (tracking)

cr0x@server:~$ chronyc tracking
Reference ID    : 192.0.2.10 (ntp1.example.net)
Stratum         : 3
Ref time (UTC)  : Mon Dec 30 11:12:00 2025
System time     : 0.000214567 seconds slow of NTP time
Last offset     : -0.000180123 seconds
RMS offset      : 0.000355901 seconds
Frequency       : 18.421 ppm slow
Residual freq   : -0.012 ppm
Skew            : 0.098 ppm
Root delay      : 0.012345678 seconds
Root dispersion : 0.001234567 seconds
Update interval : 64.0 seconds
Leap status     : Normal

Meaning: The system is ~214 microseconds slow. That’s fine. But note Frequency 18.421 ppm slow: your clock would lose ~1.6 seconds per day without correction.

Decision: If frequency is large (say >100 ppm) or skew is huge, suspect unstable clocksource or virtualization. If offsets are small but your apps claim skew, you may have stepping/jumping events (Task 7/8).

Task 5: Inspect sources and selection behavior

cr0x@server:~$ chronyc sources -v
  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .-- Source state '*' = current best, '+' = combined, '-' = not combined,
| /   '?' = unreachable, 'x' = time may be in error, '~' = too variable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal)              |  xxxx = adjusted offset,
||      Log2(Polling interval)                     |  yyyy = measured offset,
||                                                 |  zzzz = estimated error.
||                                                 |
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* ntp1.example.net              2   6   377    21   -180us[ -220us] +/-  900us
^+ ntp2.example.net              2   6   377    19   +120us[  +80us] +/- 1100us
^- 203.0.113.44                  3   6   377    17   +980us[ +950us] +/- 5500us

Meaning: You have a preferred source (^*) and a combined backup (^+). The third source is not combined (^-), likely due to worse error bounds.

Decision: If the best source has high error (+/- tens of milliseconds), fix your upstream or network path. If sources flap to unreachable, your drift may actually be long intervals without sync.

Task 6: Verify the kernel is not drowning in time adjustments

cr0x@server:~$ chronyc activity
200 OK
1 sources online
0 sources offline
0 sources doing burst (return to online)
0 sources doing burst (return to offline)
0 sources with unknown address

Meaning: Chrony sees stable reachability. Good.

Decision: If sources are frequently offline, fix networking/firewalls before touching chrony tuning.

Task 7: Find evidence of stepping (time jumps)

cr0x@server:~$ journalctl -u chrony --since "24 hours ago" --no-pager | egrep -i 'step|slew|makestep|System clock wrong'
Dec 30 09:10:22 server chronyd[712]: System clock wrong by -3.842311 seconds, adjustment started
Dec 30 09:10:22 server chronyd[712]: System clock was stepped by -3.842311 seconds

Meaning: Chrony stepped the clock at service start. That’s okay if it’s only at boot and within an expected threshold.

Decision: If stepping happens during business hours, tune makestep policy (later), or find the external actor forcing the step (VM tools, manual scripts, RTC misreads).

Task 8: Check clocksource stability (drift at the hardware level)

cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

Meaning: You’re on TSC. That’s usually good on modern CPUs with invariant TSC. On some platforms (or certain VM configs) it’s trouble.

Decision: If you see wild frequency values in chronyc tracking or time jumps correlated with CPU power changes, test another clocksource on next reboot via kernel parameter (carefully, with change control).

Task 9: Check the RTC itself and compare to system time

cr0x@server:~$ sudo hwclock --show
2025-12-30 11:10:41.123456+00:00
cr0x@server:~$ date -u
Mon Dec 30 11:12:07 UTC 2025

Meaning: RTC is ~86 seconds behind. If you reboot now, you’ll likely come up behind and then chrony will correct.

Decision: If RTC is consistently off by tens of seconds or minutes, make RTC discipline explicit: either write system time back to RTC at controlled times, or use chrony’s RTC tracking to compensate.

Task 10: Confirm whether your RTC is treated as UTC or localtime

cr0x@server:~$ timedatectl show -p RTCInLocalTZ -p Timezone -p LocalRTC
RTCInLocalTZ=no
Timezone=Etc/UTC
LocalRTC=no

Meaning: RTC is UTC. This is the sane server default.

Decision: If RTCInLocalTZ=yes and you’re not dual-booting, change it (Task 11). If you are dual-booting, decide who wins and document it.

Task 11: Set RTC to UTC (typical server best practice)

cr0x@server:~$ sudo timedatectl set-local-rtc 0 --adjust-system-clock

Meaning: The OS will treat RTC as UTC going forward and may adjust the system clock to keep local time consistent.

Decision: Do this during a maintenance window if the system clock might step. Then write a correct system time to RTC (Task 12) once chrony is stable.

Task 12: Write corrected system time to RTC (one-time cleanup)

cr0x@server:~$ chronyc waitsync 30
200 OK
cr0x@server:~$ sudo hwclock --systohc --utc

Meaning: waitsync ensures chrony has synced before you snapshot the time into RTC. --systohc writes system time to RTC.

Decision: If RTC drift is extreme, you may repeat this after a day and measure drift rate. If drift is just “normal bad,” you can compensate with chrony’s drift and RTC behaviors instead of frequent writes.

Task 13: Confirm chrony step/slew policy (you want it explicit)

cr0x@server:~$ grep -v '^\s*#' /etc/chrony/chrony.conf | sed '/^\s*$/d'
pool ntp.example.net iburst maxsources 4
driftfile /var/lib/chrony/chrony.drift
rtcsync
makestep 1.0 3
keyfile /etc/chrony/chrony.keys
leapsectz right/UTC
logdir /var/log/chrony

Meaning: makestep 1.0 3 means: step if offset > 1 second, but only in the first 3 updates. That’s a sane “fix at boot, then slew.” rtcsync keeps RTC roughly in line.

Decision: If you see no makestep, you might be slewing forever after a long outage, which looks like persistent drift. If you see very permissive stepping, you might be causing mid-day jumps.

Task 14: Look for suspend/resume time damage (laptops, some servers)

cr0x@server:~$ journalctl --since "7 days ago" --no-pager | egrep -i 'suspend|resume|PM: suspend|PM: resume'
Dec 29 18:44:02 server kernel: PM: suspend entry (deep)
Dec 29 19:12:19 server kernel: PM: suspend exit

Meaning: If this is a laptop or a server with aggressive power management, resume can cause time discontinuities depending on platform/firmware.

Decision: If skew spikes after resume, consider tighter chrony step policy after resume (service restart hook), or disable deep suspend on systems that must keep time strictly.

Task 15: Verify no one is manually setting time (yes, really)

cr0x@server:~$ journalctl --since "48 hours ago" --no-pager | egrep -i 'set time|time has been changed|CLOCK_SET|adjtimex'
Dec 30 10:03:11 server systemd[1]: Time has been changed

Meaning: Something triggered a time change. That could be chrony stepping, or an admin, or another agent.

Decision: Correlate timestamps with chrony logs (Task 7). If it’s not chrony, hunt the culprit: configuration management, hypervisor tools, or hand-rolled scripts.

Chrony tweaks that actually change outcomes

Chrony is often “fine” out of the box. But “fine” isn’t the same as “boring,” and boring is what you want for time. These are the knobs that matter when drift persists.

Makestep: control when stepping is allowed

Why you care: If a server boots 30 seconds wrong and you only allow slewing, it may take a long time to converge. During that time you get authentication failures and confusing log order. If you allow stepping at any time, some applications will break when time moves backward.

What I do on most Debian servers:

  • Allow steps early after boot only, with a low threshold.
  • After that, slew corrections only.

Example policy (already common in Debian-ish defaults, but verify):

cr0x@server:~$ sudo bash -lc 'printf "%s\n" "makestep 1.0 3" | tee /etc/chrony/conf.d/10-makestep.conf'
makestep 1.0 3

Meaning: Step only if the offset is more than one second and only during the first three updates.

Decision: If your boot-time offsets are routinely larger than a second (bad RTC, long downtime), fix the RTC problem rather than raising this threshold to something lazy like 30 seconds.

rtcsync vs RTC drift tracking: know what you’re buying

rtcsync instructs chronyd to periodically copy system time to RTC (via the kernel 11-minute mode). It helps keep RTC close enough that reboot doesn’t start wildly wrong.

But if your RTC is extremely drifty, writing frequently doesn’t fix the oscillator; it just hides the symptom until the next long power-off. If you need more, consider:

  • Improving the RTC environment (firmware updates, disabling weird power features).
  • Using a stable upstream time source (local NTP servers, PTP in some environments).
  • For truly strict systems, a local reference (PPS, GPS) and tighter holdover strategy.

Poll interval and burst behavior: stop being shy on boot

iburst is good: it uses rapid samples at start to converge faster. Keep it for network sources. If your network is lossy, you may need to keep more sources and avoid single points of time failure.

For a pool line:

cr0x@server:~$ grep -R "^\s*pool\|^\s*server" /etc/chrony
/etc/chrony/chrony.conf:pool ntp.example.net iburst maxsources 4

Meaning: Using a pool (DNS rotated) with up to 4 sources. That’s decent.

Decision: For production, prefer multiple stable internal sources if you have them; otherwise multiple external sources with sane firewall rules (UDP 123) and good DNS.

Local stratum and “I’ll be my own clock”: don’t

Chrony can advertise itself as a time source even without upstream. That’s useful for isolated labs, and dangerous everywhere else. You’ll create a confident liar: a server that tells others “I am time” while drifting.

If you see local stratum directives, pause and ask why. Most environments should not do this except as a consciously designed fallback.

Hardware clock and hwclock: make it boring

The RTC is not an atomic clock. It’s a small chip with a battery, often accurate enough for a wristwatch and not much more. Yet we treat it like a sacred truth at boot. That’s on us.

What “RTC is wrong” looks like

  • Time is correct after the system has been up for hours, then wrong right after reboot.
  • Time is off by a consistent amount after every power loss.
  • Dual-boot machines show exactly one-hour errors around DST changes (a localtime/UTC mismatch).

Stop doing uncontrolled RTC writes

Some admins run cron jobs like “hwclock –systohc every minute.” That turns RTC writes into a noisy ritual and increases the chance you write wrong time during a transient bad sync. If you’re going to write to RTC, do it after sync is confirmed, and not constantly.

Measure RTC drift rate (so you can pick a policy)

You can quantify drift without special tools. Do this:

  1. Sync system time via chrony.
  2. Write system time to RTC once.
  3. After 24 hours, compare RTC to system time again (without writing).

Commands for the compare step:

cr0x@server:~$ sudo hwclock --show
2025-12-31 11:11:12.000000+00:00
cr0x@server:~$ date -u
Tue Dec 31 11:12:07 UTC 2025

Meaning: RTC is ~55 seconds slow over ~24h, which is severe. That’s not a chrony tuning problem; that’s hardware/firmware or a platform issue.

Decision: For that level of drift, rely on NTP discipline and treat RTC as “bootstrapping only,” while investigating BIOS updates, motherboard issues, or replacing the hardware if it matters.

Virtualization and laptops: drift factories

On bare metal servers with stable clocksources, chrony usually converges and stays boring. VMs and laptops add chaos:

  • VM time sync agents can step the guest clock behind chrony’s back.
  • CPU frequency scaling and sleep states can change oscillator behavior and timing stability.
  • Host overload can delay vCPU scheduling, making time appear to “lag,” then catch up.

Detect a VM and check for hypervisor time features

cr0x@server:~$ systemd-detect-virt
kvm

Meaning: You’re in KVM. That’s usually fine, but you must ensure the guest uses paravirtualized clocks and that no competing time sync tools are stepping time.

Decision: If you run guest agents that sync time, either disable their time sync or configure them to not fight chrony.

Check kernel time discipline status (NTP kernel state)

cr0x@server:~$ timedatectl timesync-status
       Server: 192.0.2.10 (ntp1.example.net)
Poll interval: 1min 4s (min: 32s; max 34min 8s)
         Leap: normal
      Version: 4
      Stratum: 3
    Reference: 9C00020A
    Precision: 1us (-20)
Root distance: 1.652ms (max: 5s)
       Offset: -214us
        Delay: 1.233ms
       Jitter: 211us

Meaning: Even though chrony is the actual client, systemd can still report the kernel’s NTP discipline status. This gives another view of offset/jitter.

Decision: If root distance is huge or offset is swinging, focus on source quality and virtualization scheduling first.

Second short joke: if you’re debugging VM clock drift at 3 a.m., remember the guest isn’t late—it’s just living in a different timezone called “host contention.”

Three corporate mini-stories (how this fails in real life)

Mini-story 1: The incident caused by a wrong assumption

The company ran a fleet of Debian-based API nodes behind a load balancer. Everything looked normal: low latency, low error rates. Then, during a certificate rotation, a small percentage of nodes started failing outbound TLS calls with “certificate not yet valid.” The failures weren’t uniform; they came in waves.

The on-call engineer checked chrony and saw it was synced. That seemed to rule out clock skew. They escalated to networking and PKI, because “NTP is fine.” That assumption was the trap: “synced” was being interpreted as “stable.”

The real problem was boot-time: the RTC was kept in localtime on some hosts due to a dual-boot imaging artifact from a hardware test workflow. On reboot, those nodes came up minutes off, then chrony stepped them back into shape within a few update cycles. Between boot and stabilization, TLS validity windows bit them.

The fix wasn’t fancy. They standardized RTC to UTC, disabled the second time service that occasionally fought chrony, and added a health gate: nodes could not enter the load balancer pool until chronyc waitsync succeeded.

The lesson: “NTP synchronized” is a status, not a guarantee. Treat time like any other dependency. You don’t put a node in service just because the process is running.

Mini-story 2: The optimization that backfired

A performance team wanted faster boots and less background noise. They noticed chrony sometimes did an initial burst and a step at startup. That looked “messy.” Someone decided to remove iburst and disable stepping entirely—slew-only, always. It felt more elegant. Time should be smooth, right?

It worked fine on healthy nodes. Then a subset of machines spent a weekend in a rack with the top-of-rack switch misconfigured. No NTP reachability for two days. When networking was fixed, the servers were tens of seconds off.

Slew-only correction meant they took a long time to converge. During that convergence, distributed traces were scrambled, caches expired too early/late, and a batch job that depended on time-based partitions started reading “future” paths. Operations couldn’t decide what happened when, which is a special kind of misery.

They restored a sane makestep policy: step only early in the sync process. They also added alerting on “time since last sync” and “current absolute offset,” not just “daemon running.”

Optimization lesson: smooth time is good, but correctness is better. Boot-time stepping is not a moral failure; it’s a controlled correction. Make it explicit and bounded.

Mini-story 3: The boring but correct practice that saved the day

A bank-ish enterprise ran a mixed environment: bare metal databases, VM application servers, and a few legacy appliances. Their time architecture was unglamorous: two internal NTP servers per datacenter, each with multiple upstreams, and strict configuration management to ensure every node used chrony with the same baseline policy.

One morning, an upstream provider had issues that caused NTP jitter spikes. Many environments would have quietly degraded. In this one, chrony’s source selection and max error bounds helped, but the real hero was the boring practice: they had a routine check that compared chronyc tracking output across a sample of nodes and alerted on outliers in frequency and skew, not just offset.

They caught two VM clusters where host scheduling delays caused guests to wander more than usual. Those guests didn’t look “down.” They looked “a bit weird.” The outlier detection flagged them before authentication started failing.

The mitigation was simple: temporarily pin vCPU resources for the worst offenders and rebalance workloads. The long-term fix involved tuning the hypervisor time settings and ensuring guest agents weren’t stepping clocks.

The point: timekeeping is one of those domains where boring processes are the whole game. Alert on the right metrics, enforce consistency, and the weird stuff becomes visible early.

Common mistakes: symptom → root cause → fix

  • Symptom: Chrony shows “System clock synchronized: yes” but your app logs show time going backward.
    Root cause: Clock stepping during runtime (chrony makestep too permissive, or another agent stepping).
    Fix: Restrict stepping to early boot (makestep 1.0 3 style), disable competing time services, and investigate VM tools time sync.
  • Symptom: Time is wrong right after reboot, then slowly gets correct.
    Root cause: RTC is wrong or treated as localtime when it should be UTC.
    Fix: Set RTC to UTC (timedatectl set-local-rtc 0), then after chrony sync, write system time to RTC once (hwclock --systohc --utc).
  • Symptom: Offset keeps growing despite “good” sources.
    Root cause: Unstable clocksource or virtualization scheduling causing frequency instability; chrony is constantly correcting but losing ground.
    Fix: Validate clocksource, update firmware, consider alternate clocksource, and fix hypervisor contention.
  • Symptom: Chrony alternates best source frequently, with ~ “too variable.”
    Root cause: Bad upstream NTP quality, asymmetric network path, or firewall/NAT oddities.
    Fix: Use fewer, better sources; prefer internal NTP; ensure UDP 123 is stable; avoid hairpin NAT for time.
  • Symptom: Drift happens mainly after suspend/resume.
    Root cause: Platform firmware/clocksource behavior during sleep; timekeeping discontinuities on resume.
    Fix: Re-sync immediately on resume (restart chrony or force burst), or disable deep sleep for systems with strict time needs.
  • Symptom: Kerberos “clock skew too great” sporadically on only some nodes.
    Root cause: Nodes converge slowly after outages due to slew-only policy, or RTC poisoning on boot.
    Fix: Allow controlled stepping early, gate service readiness on chronyc waitsync, fix RTC discipline.
  • Symptom: After changing timezone/DST settings, timestamps are off by exactly one hour.
    Root cause: RTC set to localtime combined with DST transitions, or dual-boot conflicts.
    Fix: Standardize RTC to UTC on Linux hosts; if dual-booting, pick one policy and implement it consistently.

Checklists / step-by-step plan

Checklist A: Stop the bleeding (15–30 minutes)

  1. Confirm chrony is running: systemctl status chrony.
  2. Disable competing time sync daemons: remove systemd-timesyncd if chrony is primary.
  3. Check for recent steps: journalctl -u chrony | egrep -i 'stepped|wrong by'.
  4. Inspect tracking and sources: chronyc tracking and chronyc sources -v.
  5. If offset is large and you’re early after boot, allow a controlled step (configure makestep) and restart chrony in a maintenance window.

Checklist B: Make RTC behavior deterministic (1–2 hours)

  1. Decide RTC policy: UTC for servers unless a strong reason exists.
  2. Set policy: timedatectl set-local-rtc 0 --adjust-system-clock.
  3. Wait for sync: chronyc waitsync 30.
  4. Write correct system time to RTC once: hwclock --systohc --utc.
  5. Enable rtcsync in chrony for “keep RTC close enough.”

Checklist C: Verify stability over time (next 24–72 hours)

  1. Record baseline: chronyc tracking frequency/skew and RTC difference (hwclock --show vs date -u).
  2. Watch for stepping events in logs.
  3. If frequency is extreme or unstable, investigate clocksource and virtualization.
  4. Alert on “time since last sync” and “absolute offset,” not just “chrony is active.”

FAQ

1) Chrony says synced. Why does my monitoring still report drift?

Monitoring may be comparing against a different reference (another NTP server, a monitoring appliance, or application-level skew). Verify with chronyc tracking and ensure no other service is changing time. Also check for stepping events in journalctl.

2) Should I keep the hardware clock in UTC or localtime?

UTC for servers. Localtime is mainly for dual-boot convenience with Windows. UTC avoids DST weirdness and reduces “exactly one hour off” incidents.

3) Is it safe to allow time stepping?

Yes, if you bound it. Allow stepping early after boot (or after long unsynced periods) using makestep. Avoid stepping during steady state for systems sensitive to backward jumps.

4) What does “Frequency 18 ppm slow” mean in chronyc tracking?

Your clock runs slow by 18 parts per million. That’s roughly 1.6 seconds per day without correction. Chrony compensates by adjusting the system clock rate.

5) My RTC differs from system time by a minute. Is that normal?

No. A few seconds over days might happen on weak RTCs. A minute suggests firmware/hardware issues, misconfiguration (localtime/UTC), or that the RTC isn’t being maintained (no rtcsync and no periodic sync policy).

6) Do I need to run hwclock --systohc regularly?

Usually no. Use rtcsync and let the system keep RTC close enough. Do a one-time write after fixing configuration, and only add scheduled writes if you’ve measured a need.

7) Can virtualization cause drift even with perfect NTP sources?

Yes. Guest scheduling delays, unstable virtual clocks, and hypervisor tools that step time can all create drift or jumps. Confirm virtualization type and remove competing time sync mechanisms.

8) What’s the quickest way to prove RTC is the problem?

Compare RTC to system time immediately before reboot, reboot, then compare again right after boot before chrony converges. If the post-boot time starts wrong in the same direction, RTC is poisoning the system clock at startup.

9) Why does drift get worse during high CPU load?

Under heavy load, VM guests can be descheduled and “fall behind,” then catch up. On bare metal, load can expose clocksource quirks or firmware power-management behavior. The symptom is often increased jitter and unstable frequency estimates.

10) Should I replace chrony with another NTP client?

Not as a first move. Chrony is generally excellent on Debian, especially for intermittent connectivity and modern networks. Most “chrony drift” bugs are actually RTC mismanagement, conflicts, or upstream quality issues.

Conclusion: next steps that stop time creep

When NTP “works” but drift persists, you’re not looking at one problem. You’re looking at a governance problem: who is allowed to set time, when, and based on what authority.

  1. Pick one time authority on the host (chrony) and remove competitors (timesyncd, vendor agents that step time).
  2. Make stepping policy explicit: step early, slew later. Use makestep like you mean it.
  3. Fix the RTC contract: set RTC to UTC, write correct time once after sync, and use rtcsync to keep it close enough.
  4. Validate the platform clocksource if frequency/skew looks unstable, especially in VMs.
  5. Operationalize it: gate service readiness on sync, and alert on offset and loss of sync—not just “daemon up.”

If you do those five things, Debian 13 timekeeping becomes what it should be: invisible, boring, and not the reason your authentication stack is suddenly “mysteriously flaky.”

← Previous
Intel XeSS: how Intel joined the fight through software
Next →
Big systems, small mistakes: why “one line” can cost a fortune

Leave a comment