Debian 13: NTP Works but Drift Persists — Hardware Clock and Chrony Tweaks (Case #79)

Was this helpful?

Everything says “synced.” Your dashboards are green. Yet your logs insist the server is slowly time-traveling: TLS handshakes flake, Kerberos gets moody, backup windows slide, and distributed traces look like modern art.

This is the specific kind of failure that makes SREs suspicious of reality itself: NTP is “working,” but drift persists. The fix is rarely “restart chrony.” The fix is understanding which clock is lying, who is disciplining whom, and what your hardware (or hypervisor) is quietly doing behind your back.

What “NTP works but drift persists” actually means

When people say “NTP works,” they usually mean one of these:

  • The service is running (chronyd is active).
  • There are reachable sources (some servers show as online).
  • The system clock is not wildly wrong at the moment they checked.

When they say “drift persists,” they usually mean something else entirely:

  • The clock is correct right after boot but wrong hours later.
  • The clock is correct, but jumps forward/backward occasionally.
  • NTP reports synchronized, yet application timestamps disagree across nodes.
  • The system clock is okay, but the hardware clock (RTC) is wrong, so every reboot starts from the wrong time again.

The trap: NTP is not a “set time once” feature. It’s a control system. If the control loop is fighting a bad oscillator, a broken clocksource, a hypervisor time smear, or a policy that forbids stepping, you’ll get “synced” status and still experience drift-like symptoms.

Interesting facts and historical context (so you stop fighting physics)

  1. NTP predates the modern Internet. The first NTP spec work started in the early 1980s; it’s older than most “cloud-native” careers.
  2. Unix time is not monotonic. The “wall clock” can go backward; monotonic time is a separate counter used for measuring intervals.
  3. Leap seconds are not theoretical. They’ve been inserted dozens of times since the 1970s, and handling differs across systems.
  4. The PC RTC was never meant to be a precision time source. It’s a convenience clock designed to survive power cycles, not to win accuracy awards.
  5. Quartz oscillators drift with temperature and age. Commodity parts routinely drift tens of ppm; that’s seconds per day, not per month.
  6. Some hypervisors “help” by forcing guest time. That can fight chrony, producing sawtooth offsets and “mysterious” jumps.
  7. Kernel clocksource choices matter. A stable TSC can be excellent; an unstable one is chaos with plausible deniability.
  8. Chrony was built for ugly networks. It handles intermittent connectivity and large latency variation better than classic ntpd in many cases.

Fast diagnosis playbook

When time drift is reported, you want the bottleneck fast: is it input (bad sources), control policy (won’t step, wrong thresholds), oscillator (drifty hardware/VM), or reboot path (RTC wrong)? Do this in order:

1) Confirm which time is wrong (system clock vs RTC vs “app time”)

  • Check system time, timezone, and sync state.
  • Check RTC time and whether it’s stored as UTC.
  • Check whether the complaint is about log ordering, TLS validity, or monotonic timers (different causes).

2) Look at chrony’s view of reality

  • Is chrony tracking a stable source?
  • Is it slewing painfully slowly due to limits?
  • Is it frequently stepping because offsets get huge?

3) Look for time jumps and fights

  • Kernel log messages about clocksource instability.
  • VM tools / hypervisor guest agent forcing time.
  • Multiple time daemons running at once.

4) If it’s stable while running but wrong after reboot: RTC path

  • hwclock config, UTC/localtime mismatch, missing periodic sync.
  • Bad RTC battery (yes, still a thing).

One quote worth taping near your forehead: paraphrased idea from Gene Kranz: “Be tough and competent.” Timekeeping bugs require both traits.

Mental model: three clocks and a couple of liars

On a Debian 13 system you care about three time constructs:

  • System clock (a.k.a. wall clock): what date prints; used for timestamps, cert validity, cron, Kerberos, logs.
  • Monotonic clock: used for measuring intervals; should not go backwards; time daemons can adjust its rate indirectly, but it won’t “jump” the same way.
  • Hardware clock (RTC): a battery-backed clock on the motherboard (or a virtual equivalent). Used at boot to seed the system clock.

NTP/chrony disciplines the system clock. It does not automatically “fix” your RTC unless you explicitly sync it or configure a service to do so. If your RTC is wrong, you reboot into wrongness, then chrony spends time correcting it—sometimes slowly, sometimes not at all if policy forbids stepping.

Also: Debian can run systemd-timesyncd or chrony (or both, which is not a cool flex). Pick one. For servers, pick chrony. It’s more transparent, more tunable, and it tells you what it thinks is happening.

Joke #1: Time is an illusion. Chrony is an illusion with logs.

Practical tasks: commands, outputs, and decisions (12+)

These are the tasks I run in production when someone claims “NTP is fine” while the system time clearly isn’t. Every task includes: the command, a realistic output shape, and what decision you make next.

Task 1: Check the current state with timedatectl

cr0x@server:~$ timedatectl
               Local time: Tue 2025-12-31 10:14:22 UTC
           Universal time: Tue 2025-12-31 10:14:22 UTC
                 RTC time: Tue 2025-12-31 09:58:05
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What it means: The system clock is synced, but the RTC is behind by ~16 minutes. That’s a reboot trap.

Decision: If this is a real hardware RTC, plan to sync RTC from system clock after chrony stabilizes (and investigate why RTC is drifting). If it’s a VM, check guest/host time sync settings.

Task 2: Verify which time sync service is active (avoid daemon fights)

cr0x@server:~$ systemctl status chrony --no-pager
● chrony.service - chrony, an NTP client/server
     Loaded: loaded (/lib/systemd/system/chrony.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-12-31 09:50:10 UTC; 24min ago
       Docs: man:chronyd(8)
             man:chronyc(1)
    Process: 612 ExecStart=/usr/sbin/chronyd -F 1 (code=exited, status=0/SUCCESS)
   Main PID: 640 (chronyd)
      Tasks: 1
     Memory: 3.2M
        CPU: 120ms
     CGroup: /system.slice/chrony.service
             └─640 /usr/sbin/chronyd -F 1
cr0x@server:~$ systemctl status systemd-timesyncd --no-pager
● systemd-timesyncd.service - Network Time Synchronization
     Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; disabled; preset: enabled)
     Active: inactive (dead)

What it means: Good: only chrony is active.

Decision: If both are active, disable systemd-timesyncd (or remove chrony, but don’t do that on servers without a reason).

Task 3: Get chrony’s tracking view (is it confident or guessing?)

cr0x@server:~$ chronyc tracking
Reference ID    : 203.0.113.10 (ntp1.example.net)
Stratum         : 3
Ref time (UTC)  : Tue Dec 31 10:14:18 2025
System time     : 0.000423812 seconds slow of NTP time
Last offset     : -0.000221901 seconds
RMS offset      : 0.000612554 seconds
Frequency       : 24.731 ppm fast
Residual freq   : -0.112 ppm
Skew            : 0.931 ppm
Root delay      : 0.022741 seconds
Root dispersion : 0.004921 seconds
Update interval : 64.2 seconds
Leap status     : Normal

What it means: System time is effectively fine right now. Frequency is 24.7 ppm fast, which is not crazy for commodity hardware, but it tells you drift exists and chrony is compensating.

Decision: If users still report drift, it’s likely reboots/RTC, periodic jumps, or another component overriding time.

Task 4: Inspect sources quality (you can’t discipline against garbage)

cr0x@server:~$ chronyc sources -v
  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current best, '+' = combined, '-' = not combined,
| /             'x' = may be in error, '~' = too variable, '?' = unreachable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.    |            |  yyyy = measured offset,
||                                \   |            |  zzzz = estimated error.
||                                 |  |            |
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* ntp1.example.net              2   6   377    22   -221us[ -311us] +/-  8ms
^+ ntp2.example.net              2   6   377    21   -102us[ -190us] +/- 10ms
^- ntp3.example.net              3   6   377    20   +891us[ +830us] +/- 22ms

What it means: Healthy reach (377), multiple sources, small offsets. Not your problem.

Decision: If you see ? unreachable or ~ too variable, fix network path, firewall, or choose better servers.

Task 5: Check for stepping events and big corrections

cr0x@server:~$ journalctl -u chrony --since "today" --no-pager
Dec 31 09:50:10 server chronyd[640]: chronyd version 4.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS)
Dec 31 09:50:11 server chronyd[640]: System clock was stepped by -963.241 seconds
Dec 31 09:50:11 server chronyd[640]: Selected source 203.0.113.10
Dec 31 09:52:15 server chronyd[640]: System clock TAI offset set to 37 seconds

What it means: At startup, the clock was almost 16 minutes off and got stepped. That usually comes from a bad RTC or bad initial time seed.

Decision: Fix RTC, and ensure chrony is allowed to step early in boot (makestep), otherwise services start with wrong time and never fully recover.

Task 6: Compare RTC vs system time directly

cr0x@server:~$ sudo hwclock --show
2025-12-31 09:58:15.123456+00:00
cr0x@server:~$ date -u
Tue Dec 31 10:14:31 UTC 2025

What it means: RTC is behind. If you reboot now, you’ll boot in the past.

Decision: If this is a physical server, consider RTC battery/firmware issues. If it’s a VM, stop treating RTC as authoritative and fix host/guest time integration.

Task 7: Check whether RTC is configured as UTC (it should be)

cr0x@server:~$ cat /etc/adjtime
0.000000 1767174612 0.000000
0
UTC

What it means: RTC is expected to be UTC. Good.

Decision: If you see LOCAL here on a server, fix it unless you have a very specific dual-boot reason.

Task 8: Identify clocksource and whether the kernel thinks it’s stable

cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
cr0x@server:~$ dmesg | grep -i clocksource | tail -n 8
[    0.000000] tsc: Detected 2592.000 MHz processor
[    0.120000] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2557f2f71a, max_idle_ns: 440795309246 ns
[    0.130000] clocksource: Switched to clocksource tsc

What it means: Using TSC. On modern hardware, this can be great. On some virtualized or power-managed scenarios, it can be trouble.

Decision: If you see instability warnings or frequent time jumps, test another clocksource (see later section).

Task 9: Detect time jumps and “clock went backwards” at the application layer

cr0x@server:~$ journalctl -k --since "today" | grep -E "time.*(jump|backward)|clocksource.*unstable" --no-pager
Dec 31 10:02:44 server kernel: clocksource: timekeeping watchdog on CPU2: Marking clocksource 'tsc' as unstable because the skew is too large
Dec 31 10:02:44 server kernel: clocksource: Switched to clocksource hpet

What it means: The kernel caught TSC instability and switched clocksources. That is a smoking crater for “drift persists.”

Decision: Fix BIOS/firmware settings, disable aggressive C-states, or pin to a stable clocksource. On VMs, fix host time/tsc settings.

Task 10: Ensure NTP packets are not being blocked or mangled

cr0x@server:~$ sudo ss -uapn | grep :123
UNCONN 0      0          0.0.0.0:123        0.0.0.0:*    users:(("chronyd",pid=640,fd=5))
UNCONN 0      0             [::]:123           [::]:*    users:(("chronyd",pid=640,fd=6))
cr0x@server:~$ sudo nft list ruleset | grep -n "dport 123" | head
127: udp dport 123 accept

What it means: chrony is listening; firewall allows NTP.

Decision: If no accept rule exists and sources are unreachable, add a rule or fix upstream firewalling.

Task 11: Verify the system clock discipline status (kernel PLL behavior)

cr0x@server:~$ timedatectl show-timesync --all
SystemNTPServers=
FallbackNTPServers=
ServerName=
ServerAddress=
RootDistanceMaxUSec=5s
PollIntervalMinUSec=32s
PollIntervalMaxUSec=34min 8s
Frequency=24731

What it means: This output is more relevant to systemd-timesyncd, but it can still show frequency discipline values. Don’t over-read it if you’re on chrony.

Decision: Use chrony’s own tools for truth; avoid mixing interpretations between daemons.

Task 12: Check if chrony is allowed to step when needed (boot-time correctness)

cr0x@server:~$ grep -nE "^(makestep|rtcsync|driftfile|leapsectz)" /etc/chrony/chrony.conf
12:driftfile /var/lib/chrony/chrony.drift
18:makestep 1.0 3
22:rtcsync
26:leapsectz right/UTC

What it means: makestep 1.0 3 allows stepping if offset > 1 second, but only during the first 3 updates. rtcsync will periodically copy system time to RTC.

Decision: If you boot far off, increase the step window (count) or fix RTC so you don’t need heroics. Keeping rtcsync is usually correct on physical hardware.

Task 13: Validate driftfile updates (is chrony learning the oscillator?)

cr0x@server:~$ sudo ls -l /var/lib/chrony/chrony.drift
-rw-r--r-- 1 _chrony _chrony 18 Dec 31 10:14 /var/lib/chrony/chrony.drift
cr0x@server:~$ sudo cat /var/lib/chrony/chrony.drift
24.731 0.931

What it means: Chrony has measured frequency offset (~24.7 ppm) and estimated error. If this never updates, something is wrong (permissions, read-only FS, container constraints).

Decision: Fix persistence; without drift learning, you’ll re-learn on every boot and spend longer converging.

Task 14: Confirm nothing else is force-setting time (VM tools, scripts)

cr0x@server:~$ ps aux | egrep -i "(ntpd|timesyncd|openntpd|ptp4l|phc2sys)" | grep -v egrep
root         640  0.0  0.1  81200  3520 ?        Ssl  09:50   0:00 /usr/sbin/chronyd -F 1

What it means: Only chronyd appears in the obvious list.

Decision: If you find multiple time daemons, pick one and disable the rest. If you’re on a VM, check guest agents separately (see virtualization section).

Chrony tweaks that actually move the needle

Chrony defaults are reasonable, but “reasonable” assumes your RTC isn’t a disaster and your platform doesn’t randomly jump time. When drift persists, you tune for two goals:

  1. Boot into correct time quickly so dependent services don’t start in the wrong century.
  2. Stay stable even when sources are noisy or connectivity is intermittent.

1) Allow stepping early, but not forever: makestep

Stepping is a sudden jump. Slewing is a gradual rate adjustment. Stepping is what you want at boot when you’re minutes off; slewing is what you want during normal operation when you’re milliseconds off.

A typical server setting:

cr0x@server:~$ sudo grep -n "^makestep" /etc/chrony/chrony.conf
18:makestep 1.0 3

Opinionated guidance: If your RTC is sometimes wrong by tens of seconds, set something like makestep 1.0 10 temporarily while you fix RTC discipline. Don’t leave “step whenever” policies in place on latency-sensitive databases unless you enjoy debugging weird transaction timestamps.

2) Persist oscillator learning: driftfile must be writable

Chrony can compensate for a consistently fast/slow clock if it can learn and store the frequency offset. That learning lives in the driftfile.

If you run immutable images or read-only root filesystems, you need to ensure /var/lib/chrony persists. If it doesn’t, you’ll get repeated long convergence windows after every reboot.

3) Sync RTC from system time (physical servers): rtcsync

rtcsync enables periodic copying of system time into the RTC. This does not make your RTC a precision device. It makes your next boot less embarrassing.

On bare metal, I generally want rtcsync. On some VMs, “RTC” is an abstraction and syncing it can be irrelevant or counterproductive if the hypervisor also manages it.

4) Control update frequency and network sensitivity

If your NTP sources are across a flaky WAN, chrony will cope, but your offsets may show more noise. You can improve stability by choosing closer servers, or by reducing reliance on a single source.

When someone proposes “just poll more often,” remember: more frequent polling can increase noise sensitivity and load. It can also upset upstream corporate NTP servers if you’re an enthusiastic poller.

5) Use good sources, not “whatever resolves”

For enterprise networks, point chrony at:

  • your internal stratum-1/2 servers (if they are competently run),
  • or a well-managed provider NTP service,
  • or local network appliances with GPS/PTP backing (where appropriate).

And always configure multiple sources. One source is a single point of self-deception.

Hardware clock (RTC) and hwclock: make it boring again

Most “drift persists” cases are not actually NTP failures. They’re RTC lifecycle failures: the machine boots with wrong time, then NTP corrects it, then someone reboots and blames NTP again. You’re not chasing drift; you’re chasing a reset-to-bad-state loop.

RTC should be UTC, almost always

On Linux servers, keep RTC in UTC. The operating system can apply timezones; the RTC should not. Timezone rules change. UTC does not. This is not ideology; it’s damage control.

Write system time to RTC after sync (when appropriate)

If chrony is providing good time, update RTC:

cr0x@server:~$ sudo hwclock --systohc --utc

What it means: The RTC is set from the current system time, stored as UTC.

Decision: Do this on bare metal once you’ve confirmed chrony is stable. If your RTC is drifting fast, expect to repeat and investigate hardware.

Read RTC into system time (mostly for recovery scenarios)

cr0x@server:~$ sudo hwclock --hctosys --utc

What it means: You set system time from RTC. This is typically used very early at boot by the OS anyway.

Decision: Don’t use this to “fix” a running system that’s already NTP-disciplined unless you’re intentionally overriding for incident response.

When RTC drift is real hardware

On physical servers, persistent RTC drift can be:

  • A failing RTC battery (classic, boring, real).
  • Firmware bugs around RTC updates.
  • Environmental temperature swings in edge deployments.
  • An RTC device that is just bad.

For a data center server, a rapidly drifting RTC is often a “replace the part” situation, not a “tune chrony harder” situation.

Kernel clocksource, TSC, and why “stable” is conditional

If your system clock jumps, chrony will look guilty because it’s the visible timekeeper. But the kernel clocksource is the engine under the hood. If it’s unstable, you can observe:

  • chrony offsets that sawtooth or never settle,
  • kernel logs about clocksource watchdog actions,
  • applications reporting “clock moved backwards,”
  • VM guests drifting in lockstep with host load or migration events.

Check available clocksources

cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

What it means: Multiple clocksources exist; the kernel chooses one. TSC is usually fastest and best when invariant/stable.

Decision: If the kernel flags TSC as unstable, consider pinning to another source as a mitigation while you fix firmware/VM settings.

Pin a clocksource (mitigation, not a lifestyle)

As a test, you can set a kernel parameter at boot (via GRUB) like clocksource=hpet or similar. That’s environment-specific and should be tested. The goal is not “HPET forever.” The goal is “stop jumping time while we correct the platform.”

Joke #2: The TSC is like an intern’s timesheet—accurate if supervised, creative if left alone.

Virtualization pitfalls: when your host is gaslighting your guest

On VMs, “hardware clock” is whatever the hypervisor emulates. That can be stable, or it can be “close enough” until a live migration, host overload, or suspension/resume event.

The classic VM failure mode: two masters

You run chrony inside the guest, while the hypervisor (or guest agent) also periodically forces guest time. This creates a fight:

  • chrony slews slowly, trying to remain stable,
  • hypervisor steps abruptly, to “help,”
  • applications see time jump backward/forward,
  • chrony logs stepping events and gets blamed.

Detect common guest agents

For example, on some platforms you might see qemu-guest-agent or similar:

cr0x@server:~$ systemctl status qemu-guest-agent --no-pager
● qemu-guest-agent.service - QEMU Guest Agent
     Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-12-31 09:49:58 UTC; 25min ago
   Main PID: 510 (qemu-ga)
      Tasks: 1
     Memory: 2.8M
        CPU: 52ms
     CGroup: /system.slice/qemu-guest-agent.service
             └─510 /usr/sbin/qemu-ga

What it means: Guest agent is running. That doesn’t guarantee it’s setting time, but it’s a suspect.

Decision: Check hypervisor policy: either let the guest own time via chrony, or have the host enforce time and disable guest NTP. Mixed control is where sanity goes to die.

Live migration and “sudden drift” reports

If drift appears after migrations, check for:

  • host TSC stability settings,
  • VM CPU model and invariant TSC exposure,
  • host NTP health (if the host is wrong, the guests inherit weirdness).

In enterprise VM fleets, you fix time at the host layer first. Guests are downstream consumers.

Three corporate-world mini-stories (anonymized)

Incident caused by a wrong assumption: “Synced” means correct after reboot

A mid-size company ran a Debian-based logging tier on bare metal. Monitoring checked chronyc tracking once an hour and alerted if the system time offset exceeded a threshold. It never did. Everyone felt good about time.

Then a maintenance window arrived. They rebooted half the fleet to apply kernel updates. Immediately, ingestion lag spiked and a chunk of logs landed out of order. The on-call person saw “System clock synchronized: yes” and assumed the problem was elsewhere.

The real issue was mundane: RTC drift. Each server’s RTC was several minutes behind. Pre-reboot, chrony had disciplined system time, so monitoring looked perfect. Post-reboot, services started with the wrong time, wrote a burst of incorrectly timestamped records, and only then did chrony step the clock back into place. The “damage” was already committed.

The fix was not exotic. They enabled rtcsync, verified RTC was UTC, and added a boot-time guard: services that required correct time started after chrony reported sync. They also changed their monitoring to explicitly compare RTC vs system time once a day. The lesson stuck because it was embarrassing in the simplest possible way.

Optimization that backfired: aggressive stepping avoidance on latency-sensitive nodes

A different org had a strict stance: “Never step time; slewing only.” The rationale sounded professional: stepping can confuse databases and mess with ordering. They set chrony to never step after startup and set a tiny threshold early on.

It worked fine until they moved part of the fleet into a noisy network segment. NTP packets arrived with variable delay, and offsets occasionally climbed above a second during brief brownouts. Chrony did what they asked—slew only—but slewing at the kernel’s maximum rate took a long time to correct multi-second offsets.

Applications didn’t crash. They did something subtler: auth tokens expired “early,” scheduled jobs drifted, and cross-node coordination behaved inconsistently. Because time wasn’t stepping, the incidents were harder to notice. It looked like random flakiness across unrelated systems.

They eventually adopted a more realistic policy: allow stepping only in the first N updates after boot (makestep), and fix the underlying network jitter. The “never step” rule became “step only when the alternative is hours of wrongness.” That’s less pure, more effective.

Boring but correct practice that saved the day: time gating before starting critical services

A payments platform (regulated, audited, and allergic to surprises) had an unsexy operational rule: systems that sign tokens or negotiate Kerberos do not start until time sync is confirmed. No exceptions, no “it’ll probably be fine.”

They implemented it with systemd ordering: chrony starts early; the application unit has an ExecStartPre check that loops on chronyc tracking until leap status is normal and the system is synchronized. They also baked a “time sanity” unit that refuses to start if the clock is wildly out of range.

This saved them during a data center network incident where NTP reachability was temporarily broken during boot for a subset of hosts. Without gating, those machines would have come up with stale RTC time and started issuing tokens with invalid timestamps. With gating, they simply waited. Startup was slower; correctness stayed intact.

It was boring. It was correct. It prevented an incident that would have been blamed on “the network,” “Linux,” or “Mercury retrograde,” depending on who wrote the postmortem.

Common mistakes: symptoms → root cause → fix

1) “timedatectl says synchronized, but time is still wrong after reboot”

Symptom: Immediately after boot, time is off by minutes; later it looks fine.

Root cause: RTC is wrong and system boot seeds time from RTC. NTP corrects later.

Fix: Ensure RTC is UTC (/etc/adjtime), enable rtcsync in chrony, run hwclock --systohc --utc after sync, investigate RTC battery/firmware.

2) “chrony shows reachable sources, but offsets never converge”

Symptom: chronyc sources shows reach, but tracking RMS offset stays high; frequency swings.

Root cause: Noisy network path, asymmetric routing, or bad upstream servers; sometimes a local firewall/NAT doing something weird with UDP.

Fix: Use closer/better sources, confirm UDP/123 path, avoid polling too aggressively, consider internal stratum servers.

3) “Clock jumps backward/forward sometimes; apps log ‘time went backwards’”

Symptom: Kernel or app logs mention backward jumps; chrony logs stepping outside expected windows.

Root cause: Kernel clocksource instability, hypervisor time forcing, or multiple time daemons fighting.

Fix: Check dmesg for clocksource watchdog messages, ensure only one time sync mechanism, tune VM time settings, mitigate by selecting a stable clocksource.

4) “Everything is fine until the VM migrates”

Symptom: Offset spikes correlate with migrations or host maintenance.

Root cause: Host time is wrong or host exposes unstable TSC; guest sees discontinuities.

Fix: Fix host NTP/PTP first, ensure consistent CPU model/invariant TSC exposure, avoid conflicting guest agent time sync.

5) “Chrony keeps stepping by large amounts at startup”

Symptom: chrony journal shows large step each boot.

Root cause: RTC not being updated, driftfile not persistent, or system boots without network for too long and services rely on wrong seed time.

Fix: Enable rtcsync, ensure driftfile persistence, gate critical service startup until sync is achieved.

6) “Time slowly drifts even while chrony is running”

Symptom: Over hours, offset grows and chrony can’t keep it in bounds.

Root cause: Kernel refuses to discipline fast enough due to configuration limits, or clocksource is unstable, or NTP sources are inconsistent.

Fix: Verify chrony frequency and skew; confirm sources; check for clocksource instability; consider PTP if you need tighter bounds than NTP can reliably deliver in your environment.

Checklists / step-by-step plan

Step-by-step: fix “NTP works but drift persists” on Debian 13

  1. Pick one time daemon.

    Use chrony on servers; disable systemd-timesyncd if present.

    cr0x@server:~$ sudo systemctl disable --now systemd-timesyncd
    Removed "/etc/systemd/system/sysinit.target.wants/systemd-timesyncd.service".
    

    Decision: If disabling timesyncd breaks anything, you had hidden dependencies. Fix those, don’t re-enable daemon duels.

  2. Confirm chrony has stable sources.

    cr0x@server:~$ chronyc sources -v
    ...^* ntp1.example.net ...
    

    Decision: If you can’t reach any sources, fix network/firewall before touching chrony tuning.

  3. Allow stepping early at boot.

    In /etc/chrony/chrony.conf ensure something like:

    cr0x@server:~$ sudo grep -n "^makestep" /etc/chrony/chrony.conf
    18:makestep 1.0 3

    Decision: If offsets are often > 1s at boot, increase the count temporarily while you fix RTC.

  4. Make drift learning persistent.

    cr0x@server:~$ sudo stat /var/lib/chrony/chrony.drift
      File: /var/lib/chrony/chrony.drift
      Size: 18        	Blocks: 8          IO Block: 4096   regular file
    Access: (0644/-rw-r--r--)  Uid: (  109/_chrony)   Gid: (  116/_chrony)
    Access: 2025-12-31 10:14:18.000000000 +0000
    Modify: 2025-12-31 10:14:18.000000000 +0000
    Change: 2025-12-31 10:14:18.000000000 +0000

    Decision: If it’s not updating, fix permissions or persistence (especially in images/containers).

  5. Fix RTC: UTC and synced.

    cr0x@server:~$ cat /etc/adjtime
    0.000000 1767174612 0.000000
    0
    UTC
    cr0x@server:~$ sudo hwclock --systohc --utc
    

    Decision: If RTC continues to drift quickly on bare metal, treat it as hardware/firmware, not an NTP tuning exercise.

  6. Hunt time jumps: clocksource and VM fights.

    cr0x@server:~$ journalctl -k --since "today" | grep -i clocksource --no-pager | tail
    Dec 31 10:02:44 server kernel: clocksource: Switched to clocksource hpet

    Decision: If you see instability, fix platform settings (BIOS/host) and consider pinning a clocksource as mitigation.

  7. Gate critical services on time sync.

    For systems that sign tokens, do auth, or coordinate distributed transactions, don’t start them before sync is achieved. This is boring and correct.

Operational checklist for ongoing safety

  • Monitor RTC vs system clock delta daily on bare metal.
  • Alert on chrony stepping events outside the expected early-boot window.
  • Keep at least three NTP sources, ideally in different failure domains.
  • Record whether a node is VM or bare metal; time failure modes differ.
  • During VM migrations, correlate with offset spikes to prove causality.

FAQ

1) If System clock synchronized: yes, how can time still be “wrong”?

Because “synchronized” is a state at a moment and often refers to the system clock only. Your RTC can still be wrong (reboot issue), or you can have intermittent jumps from clocksource/hypervisor behavior.

2) Should I use systemd-timesyncd or chrony on Debian 13?

For servers: chrony. It’s more diagnosable and tunable. For minimalist desktops or small appliances: timesyncd can be adequate. Don’t run both.

3) What’s the difference between stepping and slewing, operationally?

Stepping changes wall-clock time instantly. Slewing changes the rate so the clock gradually converges. Step at boot if you’re far off. Slew during normal operation to avoid confusing applications.

4) Is it safe to enable rtcsync?

On bare metal servers, yes, generally. It reduces reboot-time surprises. On some VMs, it may be irrelevant, and the hypervisor may manage guest RTC semantics anyway.

5) My RTC is set to local time. Is that really a problem?

On servers, yes. DST changes and timezone rule updates are where “localtime RTC” goes to generate tickets. Use UTC unless you have a specific dual-boot requirement.

6) Why does my clock drift more under load?

Heavy load can reveal clocksource issues, VM scheduling delays, and network jitter. Chrony can handle a lot, but if the underlying clock is unstable or NTP packets are delayed unpredictably, offsets get noisy.

7) Can I fix persistent drift by polling NTP more frequently?

Sometimes it helps a little; often it just amplifies noise. Better sources and a stable clocksource beat aggressive polling. If you need tighter bounds, consider PTP in environments that support it.

8) Why do I see large steps at startup even with chrony configured?

Because the initial time seed (RTC) is far off or the driftfile isn’t persistent. Chrony can only correct after it starts and reaches sources; anything started before that can see wrong time.

9) How do I know if a hypervisor is overriding time?

Look for time jumps correlated with migrations or host events, and check for guest agents or platform settings that “sync time.” In practice, you decide: host-managed time or guest-managed time, not both.

10) What accuracy should I expect from NTP on a typical LAN?

Often milliseconds, sometimes better. On WAN or noisy networks, tens of milliseconds or worse. If you require sub-millisecond or microsecond-level coordination, you’re in PTP territory and need hardware support.

Next steps you should actually take

  1. Measure the right thing: track system offset (chronyc tracking) and RTC delta (hwclock --show vs date -u) separately.
  2. Eliminate fights: ensure only one time sync mechanism is in control on each node (and define whether the host or guest owns time in VM environments).
  3. Make boot safe: configure makestep for early correction and gate time-sensitive services until sync is real.
  4. Make RTC boring: keep it UTC, enable rtcsync on bare metal, and replace failing RTC batteries without drama.
  5. Investigate clocksource warnings immediately: if the kernel marks a clocksource unstable, treat it like a hardware error with a timestamp.

If you do those five, “NTP works but drift persists” stops being a ghost story and becomes a checklist item. Production likes boring. Timekeeping should be aggressively boring.

← Previous
Office VPN access control: allow only what’s needed (not full LAN-to-LAN)
Next →
ZFS ashift: The Silent Misalignment That Cuts Performance in Half

Leave a comment