You’re not really “debugging TLS.” You’re debugging time. The symptom looks like certificates suddenly “expired” or “not yet valid,” package updates start failing, APIs refuse to talk, and your pager begins to develop opinions about your sleep schedule.
On Ubuntu 24.04, the right fix is boring: make time synchronization deterministic, observable, and resilient. That usually means Chrony configured intentionally, not “whatever the image shipped with.” Let’s do it like we mean it.
What breaks when time drifts (and why TLS panics)
Time drift is one of those failures that looks like “everything is broken” because it attacks the assumptions underneath everything. TLS is just the first system to complain loudly because it has a strict sense of reality: certificates are valid only within a time window. If your host clock is wrong, the certificate might be:
- Not yet valid (your clock is too far in the past)
- Expired (your clock is too far in the future)
That’s just the opening act. You can also see:
- APT failures: repositories “suddenly” fail with TLS handshake errors or metadata validity issues.
- OAuth/JWT auth failures: tokens have
nbfandexpclaims; a skewed clock makes valid tokens look invalid. - Kerberos failures: Kerberos is famously strict about time skew.
- Distributed storage weirdness: leases, heartbeats, and monotonic ordering assumptions can misbehave when wall-clock time jumps.
- Monitoring lies: graphs get gaps, alerts fire late, or logs arrive “from the future.”
Here’s the most important mental model: fixing time is not the same as stepping the clock immediately. In production systems, abrupt time steps can break things too—especially databases, caches, and anything using time-based eviction or ordering. Chrony exists partly because it can correct time gradually (“slew”) while staying sane.
One quote worth keeping on a sticky note (paraphrased idea): “Hope is not a strategy.”
— paraphrased idea often attributed to engineering management and reliability culture. Time sync should be designed, not wished into existence.
Fast diagnosis playbook
If you’re on-call and TLS just exploded across a fleet, you need a fast path. Don’t chase cert chains for an hour. Check time first, then decide how to correct it safely.
First: confirm you actually have a time problem
- Check local wall time and sync status (is it wildly wrong? is it synced?).
- Check Chrony health (sources, offset, leap status).
- Check whether time is stepping (VM resume, RTC issues, manual changes).
Second: determine the blast radius
- Is it one host (bad RTC, misconfigured chrony, VM host issue)?
- Is it a whole cluster (broken internal NTP, firewall rules, image regression)?
- Is it only one network segment (blocked UDP/123, NAT weirdness, split horizon DNS)?
Third: fix time with the least dangerous method
- If drift is small: let Chrony slew it back.
- If drift is large (minutes/hours): plan for a controlled step (service impact), then verify TLS and auth paths.
- If time keeps drifting: fix the underlying cause (bad NTP sources, VM clock settings, broken RTC, aggressive power saving, suspended instances).
Interesting facts and historical context
Timekeeping in computing is older than your most “legacy” service. A few concrete facts that help explain today’s behavior:
- NTP predates the commercial web. It was designed in the 1980s to synchronize clocks over unreliable networks—still relevant, still in use.
- TLS validity is deliberately time-bound so stolen certificates can’t be used forever, and so clients can reason about revocation and rotation periods.
- Leap seconds exist, and software has historically handled them inconsistently; some systems step, some smear, some panic.
- Chrony was designed to handle intermittent connectivity better than classic ntpd, which matters for laptops, VMs, and isolated subnets.
- Virtual machines can drift hard after pause/suspend/migration events—especially when host timekeeping integration is misconfigured.
- Wall-clock time and monotonic time are different. Many systems depend on monotonic timers for intervals; TLS cares about wall-clock time.
- Certificate lifetimes have been shrinking in the industry to reduce risk, which makes time accuracy more important, not less.
- Some enterprise networks run internal NTP hierarchies with strict ACLs; a single mis-edit can isolate thousands of machines from time.
Time is a dependency like DNS. You don’t notice it until it’s gone, then everything becomes interpretive dance.
Joke #1: Time drift is the only bug that can make your logs claim the outage ended before it began. It’s like time travel, but with worse documentation.
Practical tasks: commands, outputs, decisions
Below are hands-on tasks you can run on Ubuntu 24.04 to diagnose and repair time sync issues. Each task includes: a command, what typical output means, and what decision to make.
Task 1: Check system time, RTC, and sync flag
cr0x@server:~$ timedatectl
Local time: Sun 2025-12-28 10:41:12 UTC
Universal time: Sun 2025-12-28 10:41:12 UTC
RTC time: Sun 2025-12-28 10:41:10
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
NTP service: active
RTC in local TZ: no
What it means: NTP service is “active” but the system clock is not synchronized. That usually means the daemon is running but can’t reach sources or doesn’t trust them yet.
Decision: Move to Chrony status and sources. If “System clock synchronized” stays no for more than a couple minutes after boot, you likely have source/connectivity issues.
Task 2: Identify whether Chrony is actually installed and running
cr0x@server:~$ systemctl status chrony --no-pager
● chrony.service - chrony, an NTP client/server
Loaded: loaded (/usr/lib/systemd/system/chrony.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-12-28 10:39:44 UTC; 1min 26s ago
Docs: man:chronyd(8)
man:chronyc(1)
Main PID: 1325 (chronyd)
Tasks: 1 (limit: 38228)
Memory: 2.7M (peak: 3.1M)
CPU: 148ms
CGroup: /system.slice/chrony.service
└─1325 /usr/sbin/chronyd -F 1
What it means: Chrony is running. Good. Now we need to see if it has usable sources.
Decision: If Chrony is not installed, install it and disable competing time daemons. If it’s running, inspect its tracking state and sources.
Task 3: Check Chrony tracking (the single most useful snapshot)
cr0x@server:~$ chronyc tracking
Reference ID : 00000000 ()
Stratum : 0
Ref time (UTC) : Thu Jan 01 00:00:00 1970
System time : 12.483912345 seconds slow of NTP time
Last offset : +0.000000000 seconds
RMS offset : 0.000000000 seconds
Frequency : 0.000 ppm
Residual freq : 0.000 ppm
Skew : 0.000 ppm
Root delay : 1.000000000 seconds
Root dispersion : 1.000000000 seconds
Update interval : 0.0 seconds
Leap status : Not synchronised
What it means: Stratum 0, reference ID empty, leap status not synchronized: Chrony is not locked to any source. The clock is 12.48 seconds slow; that’s enough to trigger strict TLS checks in some environments.
Decision: Look at chronyc sources -v. If sources are unreachable, fix network/DNS/ACLs. If reachable but “not selected,” fix NTP server list or trust settings.
Task 4: Inspect sources and selection
cr0x@server:~$ chronyc sources -v
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current best, '+' = combined, '-' = not combined,
| / 'x' = may be in error, '~' = too variable, '?' = unusable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) | xxxx = adjusted offset,
|| Log2(Polling interval) | yyyy = measured offset,
|| | zzzz = estimated error.
|| |
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^? ntp1.corp.local 0 6 0 - +0ns[ +0ns] +/- 0ns
^? ntp2.corp.local 0 6 0 - +0ns[ +0ns] +/- 0ns
What it means: ^? indicates unusable sources, and Reach is 0. Chrony can’t talk to them (DNS failure, routing, firewall, ACL, or the servers are down).
Decision: Test DNS resolution and UDP/123 connectivity. If corporate NTP is unreachable, you need a temporary public/internal alternate approved by policy—or stand up a local time source.
Task 5: Verify DNS resolution for NTP sources
cr0x@server:~$ resolvectl query ntp1.corp.local
ntp1.corp.local: 10.20.30.40 -- link: eth0
-- Information acquired via protocol DNS in 5.2ms.
-- Data is authenticated: no
What it means: DNS resolves. Good. If this fails, you’re debugging DNS, not NTP.
Decision: If DNS fails, fix resolver configuration or split-horizon issues. If DNS works, test network reachability to UDP/123.
Task 6: Test NTP reachability (UDP/123) with chronyc
cr0x@server:~$ chronyc -N ntpdata ntp1.corp.local
Remote address : 10.20.30.40 (10.20.30.40)
Leap status : Normal
Version : 4
Stratum : 2
Poll interval : 6 (64 seconds)
Precision : -23 (119.209ns)
Root delay : 0.001297 seconds
Root dispersion : 0.003418 seconds
Reference ID : 4C4F434C
Reference time : Sun Dec 28 10:41:10 2025
Offset : -0.000412345 seconds
Peer delay : 0.000812 seconds
Peer dispersion : 0.000244 seconds
What it means: The server responds and looks sane (stratum 2, normal leap status). This suggests the network path is fine and the issue is Chrony configuration or selection.
Decision: If ntpdata times out, fix firewall/ACL routing. If it works but sources still show ^?, look for Chrony restrictions, wrong interface binding, or systemd sandboxing.
Task 7: Check whether another time service is fighting Chrony
cr0x@server:~$ systemctl list-units --type=service | grep -E 'chrony|timesync|ntp'
chrony.service loaded active running chrony, an NTP client/server
systemd-timesyncd.service loaded active running Network Time Synchronization
What it means: Both Chrony and systemd-timesyncd are running. That’s not a clever redundancy plan; it’s two pilots grabbing the yoke.
Decision: Pick one. On servers, prefer Chrony. Disable systemd-timesyncd when using Chrony.
Task 8: Disable systemd-timesyncd if Chrony is your chosen source of truth
cr0x@server:~$ sudo systemctl disable --now systemd-timesyncd
Removed "/etc/systemd/system/sysinit.target.wants/systemd-timesyncd.service".
Stopped systemd-timesyncd.service - Network Time Synchronization.
What it means: timesyncd is no longer running; Chrony gets to do its job without interference.
Decision: Re-check timedatectl and Chrony sources after a minute.
Task 9: Inspect Chrony configuration for bad server lines and policies
cr0x@server:~$ grep -E '^(server|pool|makestep|rtcsync|keyfile|driftfile|bindcmdaddress|bindaddress)' /etc/chrony/chrony.conf
pool ntp.ubuntu.com iburst
makestep 1 3
rtcsync
driftfile /var/lib/chrony/chrony.drift
What it means: Default pool is used, makestep 1 3 allows stepping the clock by any amount only during the first three updates (typically shortly after start), then it will slew. rtcsync syncs the hardware clock periodically.
Decision: In corporate networks you usually want explicit internal NTP servers, not public pools. Also decide whether makestep is appropriate for your fleet (more on that below).
Task 10: Replace NTP sources with explicit, redundant servers (example)
cr0x@server:~$ sudoedit /etc/chrony/chrony.conf
...file opened in editor...
What it means: You’re editing for intent. A good server list is explicit, redundant, and local to your network topology.
Decision: Use at least 3 sources if possible. Prefer internal stratum 1/2 servers. Keep iburst for faster initial sync.
Task 11: Restart Chrony and force a quick re-evaluation of sources
cr0x@server:~$ sudo systemctl restart chrony
cr0x@server:~$ sudo chronyc online
200 OK
cr0x@server:~$ chronyc sources -v
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ ntp1.corp.local 2 6 377 12 -221us[ -341us] +/- 6.1ms
^* ntp2.corp.local 2 6 377 10 -12us[ -128us] +/- 4.8ms
^+ ntp3.corp.local 3 6 377 11 +145us[ +31us] +/- 8.9ms
What it means: Reach 377 means packets are flowing. ^* is the selected best source. Offsets in microseconds are fine; estimated error in milliseconds is acceptable for TLS and most distributed systems.
Decision: If you still see ^? or Reach stays low, debug connectivity and server health. If you see ^~ (too variable), consider network jitter, VPN paths, or a noisy VM host.
Task 12: Confirm the system considers time synchronized now
cr0x@server:~$ timedatectl
Local time: Sun 2025-12-28 10:44:18 UTC
Universal time: Sun 2025-12-28 10:44:18 UTC
RTC time: Sun 2025-12-28 10:44:18
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: The kernel clock is synced and the RTC is aligned. That’s what “fixed” looks like.
Decision: Now re-test the failing TLS client paths (apt, curl, your app).
Task 13: If time is wildly wrong, step safely (and admit it will be disruptive)
cr0x@server:~$ chronyc tracking
Reference ID : 4C4F434C (ntp2.corp.local)
Stratum : 3
Ref time (UTC) : Sun Dec 28 10:44:40 2025
System time : 4231.218123456 seconds fast of NTP time
Last offset : -0.003212345 seconds
RMS offset : 0.001002000 seconds
Frequency : 18.122 ppm
Residual freq : -0.441 ppm
Skew : 2.112 ppm
Root delay : 0.001102 seconds
Root dispersion : 0.010843 seconds
Update interval : 64.0 seconds
Leap status : Normal
What it means: You are over an hour fast. Letting this slew back could take a long time, and you’ll have continued TLS/JWT failures meanwhile.
Decision: Consider a controlled step. On many services a step of an hour will cause temporary weirdness; plan it, communicate it, and restart the most time-sensitive daemons afterwards.
Task 14: Force a step with Chrony (use sparingly)
cr0x@server:~$ sudo chronyc makestep
200 OK
What it means: Chrony stepped the clock immediately to correct the offset. This can fix TLS instantly, and break other things instantly. Choose your poison, but choose it consciously.
Decision: After stepping, validate critical apps (databases, message queues, authentication). If you run time-sensitive services, you may need restarts.
Task 15: Check for recent manual time changes (smoking gun)
cr0x@server:~$ journalctl -u chrony -u systemd-timesyncd --since "2 hours ago" --no-pager | tail -n 25
Dec 28 10:39:44 server chronyd[1325]: chronyd version 4.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH)
Dec 28 10:40:12 server chronyd[1325]: System clock wrong by 12.487 seconds, adjustment started
Dec 28 10:43:01 server chronyd[1325]: Selected source 10.20.30.41
Dec 28 10:44:19 server chronyd[1325]: System clock was stepped by 4231.218 seconds
What it means: The logs admit a step happened. If you didn’t do it, something else did (cloud-init, a misconfigured script, a golden image, or an overeager “fix time” runbook).
Decision: Identify the actor: automation, human, or VM platform. Then prevent repeats. The worst incident is the one you “fix” every Tuesday.
Task 16: Validate TLS from the host after time is correct
cr0x@server:~$ openssl s_client -connect archive.ubuntu.com:443 -servername archive.ubuntu.com -brief
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: CN=*.ubuntu.com
Hash used: SHA256
Signature type: RSA-PSS
Verification: OK
Server Temp Key: X25519, 253 bits
What it means: Verification is OK. If this was failing with “not yet valid” before, time was the culprit.
Decision: If verification still fails, you might have a CA trust store issue or interception proxy problems. Time is necessary, not always sufficient.
Task 17: Verify APT can negotiate TLS again
cr0x@server:~$ sudo apt-get update
Hit:1 http://archive.ubuntu.com/ubuntu noble InRelease
Hit:2 http://archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:3 http://security.ubuntu.com/ubuntu noble-security InRelease
Reading package lists... Done
What it means: APT is happy. When time is wrong, you’ll often see TLS handshake errors or metadata validity warnings.
Decision: If APT still fails but OpenSSL works, suspect proxy configuration, pinned certs, or repository metadata issues.
Fix Chrony the right way on Ubuntu 24.04
Chrony is usually the correct choice for servers: fast convergence, good behavior on flaky networks, and solid observability. The wrong way to “fix time” is to run a random one-liner that steps the clock and declare victory. You’ll get your TLS back and then spend the afternoon debugging a database that now thinks the future happened already.
Choose one time sync daemon and commit
On Ubuntu, you’ll often encounter:
- systemd-timesyncd: lightweight SNTP client, fine for simple endpoints.
- chronyd (Chrony): full NTP client/server, better diagnostics and control.
Pick one. For production servers, pick Chrony unless you have a very small footprint and strict simplicity goals. Running both isn’t resilience; it’s self-sabotage with uptime ambitions.
Use explicit time sources, not vibes
Default pools can be fine on the public internet, but corporate reality includes firewalls, proxies, split DNS, and “approved egress.” On fleets, you want explicit servers with redundancy:
- At least three servers if you can.
- Same region / low latency where possible.
- Different failure domains (not three VMs on the same host).
Example Chrony configuration fragment (illustrative):
cr0x@server:~$ sudo bash -lc 'cat > /etc/chrony/sources.d/corp.sources <<EOF
server ntp1.corp.local iburst
server ntp2.corp.local iburst
server ntp3.corp.local iburst
EOF'
What it means: A clean separation: base chrony.conf plus a dedicated file for corporate sources. Easier to manage with configuration tooling.
Decision: Put sources in a managed file so “someone hotfixes the server list” doesn’t become your long-term architecture.
Understand stepping vs slewing (and set policy)
Chrony can correct time in two ways:
- Slew: gradually adjust clock rate. Safer for apps. Slower to correct large offsets.
- Step: jump time immediately. Fixes TLS fast. Can break time-sensitive software.
The makestep directive is the policy knob. Example:
makestep 1 3: step if offset > 1s during first 3 updates, then never step again automatically.makestep 0.5 -1: step if offset > 0.5s at any time (aggressive; use only if you know why).
On servers with strict TLS auth, stepping during boot can be reasonable. Auto-stepping during steady state is riskier; you don’t want a mid-day time jump because a source went weird and Chrony “helped.”
Make the hardware clock (RTC) behave
Bad RTC configuration is a classic cause of “time is wrong after reboot.” On Linux servers, keep RTC in UTC:
cr0x@server:~$ timedatectl set-local-rtc 0
cr0x@server:~$ timedatectl
Local time: Sun 2025-12-28 10:45:22 UTC
Universal time: Sun 2025-12-28 10:45:22 UTC
RTC time: Sun 2025-12-28 10:45:22
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: RTC is in UTC. Good. Dual-boot systems sometimes set RTC to local time; servers should not.
Decision: If you’re dual-booting a server, you have bigger problems, but yes: keep RTC in UTC.
Be careful with VMs and cloud instances
If your host is virtualized, time can drift when:
- VM is paused/resumed
- Live migration occurs
- Host is overloaded and the guest loses CPU time
- Hypervisor time sync integration fights NTP
The fix is platform-dependent, but the pattern is constant: choose a single authority. If the hypervisor provides stable timekeeping, let the guest use NTP to discipline minor drift—not to fight big jumps created by suspend/resume events. If you see frequent large steps, fix the VM lifecycle behavior first.
Make Chrony observable
If you can’t measure offset, you’re guessing. At minimum, capture:
chronyc trackingperiodically (offset, stratum, leap status)chronyc sources -vfor reachability and jitter- daemon logs around boot and resume
Most outages here are not “Chrony is broken.” They’re “Chrony is telling you the truth and you didn’t look.”
Joke #2: NTP is the only service where “reach 377” is good news. Networking is a strange career choice.
TLS/cert validation checks after time repair
Once time is stable, validate that TLS failures are actually gone—and confirm you didn’t uncover a second problem that time was masking.
Check the exact error message
Common time-related TLS messages include:
- certificate is not yet valid
- certificate has expired
- bad certificate (less specific; can still be time-related in some stacks)
If the error is about “unknown CA” or “self-signed certificate in certificate chain,” fixing time won’t help. Don’t force it.
Validate local trust store health (quick sanity check)
cr0x@server:~$ dpkg -l | grep -E '^ii\s+ca-certificates\s'
ii ca-certificates 20240203 all Common CA certificates
What it means: CA bundle package is installed. Good.
Decision: If missing or corrupted, reinstall. If present, focus on time, proxy interception, or application pinning.
Confirm the kernel and userspace agree on time
cr0x@server:~$ date -u
Sun Dec 28 10:46:01 UTC 2025
cr0x@server:~$ python3 -c 'import datetime; print(datetime.datetime.utcnow().isoformat()+"Z")'
2025-12-28T10:46:02.193847Z
What it means: Close agreement between tools suggests no weird container namespace time tricks or broken libc time calls.
Decision: If containers show different time than host, check container runtime settings and host clock namespace assumptions (rare, but real in some hardened environments).
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
They had a “secure internal network.” No outbound internet. Everything went through approved egress points, and for time they used two internal NTP servers. Someone assumed those NTP servers were as foundational as DNS, so they never made it part of a monitored service catalog. They were just… there. Like gravity.
A change window came and went. A network engineer tightened ACLs on a core switch, aiming to reduce unnecessary UDP traffic. UDP/123 wasn’t explicitly allowed from a new subnet used by freshly provisioned application nodes. It wasn’t blocked with malice; it was blocked with indifference, which is how most outages are born.
Within hours, nodes in that subnet began to drift. Nothing obvious happened at first. Then a certificate rotation pushed a new leaf certificate with a validity start time slightly in the future relative to the drifting nodes. Suddenly services in the new subnet couldn’t call anything. On-call saw TLS errors and rotated certificates again, which did nothing except create more meetings.
The fix was two lines in an ACL and a postmortem action: treat internal NTP as a Tier-0 dependency with monitoring, alerts, and change reviews. The wrong assumption wasn’t “ACLs are safe.” It was “time will take care of itself.”
Mini-story 2: The optimization that backfired
A performance-minded team wanted faster boot times on ephemeral compute. They trimmed services. They disabled “extra daemons.” They also replaced Chrony with a minimal SNTP client and aggressive stepping because “we only need time to be roughly correct.” This is the kind of sentence that should set off a small siren in the back of your skull.
Most of the time it worked. Then an upstream time source had a short-lived issue: one server started reporting time with a noticeable offset. A robust NTP client would compare sources, detect inconsistency, and avoid selecting the bad actor. The minimal client had fewer guardrails and stepped the clock mid-flight.
Stepping time caused their cache layer to evict entries incorrectly (time-based TTL logic went sideways). Metrics got scrambled. A few background jobs ran twice because “next run time” moved backwards. Their incident review was painful because logs did not agree on ordering across nodes. Every graph looked like modern art.
They reintroduced Chrony, pinned it to trusted internal sources, and set a conservative stepping policy: step early during boot if needed, slew during steady state. The “optimization” saved seconds and cost them a day. That’s not a trade; it’s a prank.
Mini-story 3: The boring but correct practice that saved the day
A different company had a dull rule: every server image must ship with the same Chrony configuration, and every environment must have three reachable NTP sources. No exceptions, no cleverness. They also collected Chrony tracking metrics and raised a ticket automatically if offset exceeded a small threshold for more than a few minutes.
One morning, a hypervisor cluster began experiencing heavy load. Some guests started drifting, not by seconds but by tens of seconds. Before application teams noticed, monitoring flagged the time offset anomaly. SREs correlated the issue to the hypervisor pool and shifted workloads away while the virtualization team fixed scheduling pressure.
The best part: nothing dramatic happened. No TLS meltdown. No token storms. No heroic midnight debugging. The incident report was short and deeply unsexy: “Detected drift early; moved load; fixed host contention; verified sync.” That’s the kind of boring you should aspire to.
Common mistakes: symptom → root cause → fix
This is the section where most time outages reveal they were self-inflicted. Not because people are careless—because distributed systems punish ambiguity.
1) TLS says “certificate is not yet valid” right after reboot
Root cause: RTC is wrong, or Chrony can’t reach sources early in boot; system comes up with stale time.
Fix: Ensure RTC is UTC, enable rtcsync, and configure makestep to step during initial updates. Also verify NTP sources are reachable from the subnet at boot (ACLs, routing, DNS).
2) “NTP service: active” but “System clock synchronized: no” forever
Root cause: Daemon is running but has no valid sources (^?, Reach 0), or it’s competing with another daemon.
Fix: Use chronyc tracking and chronyc sources -v. Disable systemd-timesyncd if using Chrony. Fix firewall rules for UDP/123.
3) Time jumps backwards/forwards occasionally; apps get weird
Root cause: Aggressive stepping policy (makestep too permissive), or VM suspend/resume causing sudden drift and correction.
Fix: Limit stepping to early boot (e.g., makestep 1 3). Fix VM host timekeeping and avoid guest/hypervisor sync fights.
4) Chrony shows sources but marks them “too variable” (^~)
Root cause: High jitter path (VPN, overloaded network, asymmetric routing) or unstable upstream time sources.
Fix: Prefer local NTP servers. Reduce jitter. Add better sources. If needed, increase polling stability and avoid crossing WAN/VPN for primary time.
5) Everything breaks after someone ran date -s
Root cause: Manual time setting steps the clock without coordination; apps see time discontinuity.
Fix: Stop manually setting time on production servers. Use Chrony with controlled stepping and document when a forced step is acceptable.
6) Kubernetes / service mesh certs fail across nodes intermittently
Root cause: Node clock skew causes mTLS handshakes to fail or short-lived cert rotation to misalign.
Fix: Enforce time sync at node bootstrap, monitor drift, and block scheduling onto nodes with high offset until corrected.
Checklists / step-by-step plan
Checklist A: Immediate incident response (single host)
- Run
timedatectl; if “System clock synchronized: no”, proceed. - Run
chronyc tracking; confirm leap status and offset magnitude. - Run
chronyc sources -v; check Reach and selection (^*). - If sources unreachable: test DNS and NTP reachability (
resolvectl query,chronyc ntpdata). - Ensure only one time daemon is active (disable
systemd-timesyncdif using Chrony). - Restart Chrony, bring sources online (
systemctl restart chrony,chronyc online). - If offset is huge and you must restore TLS now:
chronyc makestep, then validate critical services. - Re-test TLS and APT (
openssl s_client,apt-get update).
Checklist B: Fleet-wide containment
- Pick a canary host per subnet; measure offset and reachability.
- Validate internal NTP server health and ACLs from each network zone.
- Roll out Chrony config with explicit servers and conservative stepping policy.
- Add monitoring: alert when leap status is not synchronized for sustained period or offset exceeds threshold.
- Block or cordon nodes with high drift (platform-specific) until corrected.
Checklist C: Prevent recurrence (the part people skip)
- Make NTP sources a managed dependency with change control (like DNS).
- Ensure three sources minimum and test during provisioning.
- Log and alert on time steps; unexpected steps should be investigated.
- Document whether your environment allows stepping during steady state. Most shouldn’t.
- For VMs: validate hypervisor integration and behavior on suspend/resume/migration events.
FAQ
1) Why do TLS errors show up before anything else?
TLS certificate validation is explicitly time-based. If wall-clock time is wrong, the handshake fails immediately. Other systems might tolerate skew or fail later.
2) Should I use systemd-timesyncd or Chrony on Ubuntu 24.04 servers?
Use Chrony for production servers unless you have a strong reason not to. It has better diagnostics, better handling of imperfect networks, and more control over stepping/slewing.
3) Is it safe to run both Chrony and systemd-timesyncd?
No. Pick one. Two daemons adjusting the same clock is a reliability anti-pattern that produces intermittent, hard-to-explain drift and steps.
4) What’s the quickest proof that time is the problem?
timedatectl showing “System clock synchronized: no” plus chronyc tracking showing “Leap status: Not synchronised” is usually enough. Also, TLS errors that mention “not yet valid” are practically a confession.
5) When should I use chronyc makestep?
When offset is large enough that slewing would take too long and you need to restore critical TLS/auth paths quickly. Do it knowingly: stepping can break time-sensitive applications.
6) Why does Chrony show sources but still not synchronize?
Because not all sources are trustworthy or selectable. Check chronyc sources -v for ^*, Reach, and states like ^? (unusable) or ^x (in error).
7) My NTP servers are reachable, but offset keeps growing. What then?
Look at virtualization and host load. Guests can drift under CPU starvation, and suspend/resume can cause large jumps. Fix the platform behavior; Chrony can’t fix physics.
8) Do leap seconds still matter for this problem?
Usually not for day-to-day TLS failures, but they matter for long-term correctness and for systems that handle leap events poorly. The practical takeaway: use a disciplined time system, not ad-hoc fixes.
9) How accurate does time need to be for TLS?
TLS generally tolerates small skew, but modern systems with short-lived certs, strict clients, and token-based auth can be sensitive. Aim for tight synchronization and alert on drift before it becomes seconds-to-minutes.
10) After fixing time, why do some services still fail until restarted?
Some applications cache time-dependent decisions (token validation windows, session expiry, scheduled tasks) or get confused after a time step. If you stepped time, restarts of critical components can be the cleanest recovery.
Conclusion: next steps you should actually do
If TLS/cert errors appeared after time drift, don’t treat it like a certificate mystery. Treat it like an infrastructure dependency failing quietly. The right path is consistent:
- Prove it’s time:
timedatectl,chronyc tracking,chronyc sources -v. - Make Chrony the single authority (or explicitly choose timesyncd, but not both).
- Use explicit, redundant NTP sources that match your network reality.
- Set a stepping policy you can live with: step during boot if needed; avoid surprise steps in steady state.
- Instrument and alert on drift so you fix time before time fixes you.
Do those, and the next time certificates “randomly expire,” you’ll fix it in minutes—with fewer meetings, fewer mysteries, and fewer people learning what UTC stands for under stress.