The outage starts with a lie so small nobody sees it: the clock on one side of the WAN isn’t the same as the clock on the other.
Users call it “VPN is down.” Security calls it “certs are broken.” AD admins call it “Kerberos being Kerberos.”
You call it Tuesday.
Time sync across offices is boring until it isn’t. Then it’s a distributed systems failure with a single integer at the center: seconds.
If your branch offices don’t share time like they share DNS, you’re one flaky ISP link away from authentication chaos.
Why time breaks everything (AD, VPN, TLS)
NTP is not “infrastructure hygiene.” It’s a dependency of trust. The moment two systems disagree on time, they disagree on whether an event happened.
Authentication systems don’t negotiate with uncertainty; they reject it.
Active Directory and Kerberos: clock skew is a security feature
Kerberos tickets have timestamps. Not as a cute detail—timestamps are part of the replay protection model.
If the client’s time is too far from the domain controller’s time, tickets are considered unsafe and get rejected.
On Windows, you’ll see variations of “KRB_AP_ERR_SKEW” or “The time difference between the client and server is too great.”
By default, AD tolerates a limited skew (commonly five minutes). Five minutes sounds generous until you realize how easy it is for a VM host,
a sleepy laptop, and a branch office firewall doing “helpful” NAT to stack small drifts into a big one.
VPN and SSO: the token that “expired” before it was issued
Modern VPN stacks are glued to identity providers: SAML assertions, OIDC tokens, signed cookies, short-lived device posture checks.
These are time-sensitive. A token that’s valid for 60 seconds is a great security control until your branch office is 90 seconds in the future.
The failure mode is predictable: users authenticate successfully somewhere, then the VPN gateway rejects the follow-on assertion as “not yet valid”
or “expired.” Everyone blames the IdP. The IdP is innocent. Your clocks are not.
TLS certificates: “not yet valid” is time drift, not PKI drama
TLS has a brutally simple rule: certificates have validity windows. If your node’s time is wrong, the cert can be rejected even if it was issued correctly.
“x509: certificate has expired or is not yet valid” is the TLS equivalent of “your wristwatch is lying to you.”
Distributed systems are time-sensitive even when they claim they aren’t
You can build distributed systems without strict synchronized clocks, but you still need monotonicity and reasonable wall-clock agreement for:
log correlation, incident response, billing records, token validation, scheduled jobs, and audit trails.
The question isn’t “do we need NTP?” It’s “how much failure are we willing to accept when NTP degrades?”
Joke #1: Time sync is like flossing—everyone agrees it’s necessary, and most people only do it after something starts bleeding.
How NTP really works across a WAN
NTP is simple in the way TCP is “simple”: a small protocol with a lot of operational edge cases.
Across offices you’ll run into the nasty trio: asymmetric latency, packet loss, and devices that think they’re smarter than time.
Offset, delay, jitter: three numbers that decide your day
When an NTP client queries a server, it’s trying to estimate the difference between its clock and the server’s clock (offset).
It also measures round-trip network delay (delay). Over time it observes variance (jitter).
If delay is high but stable, NTP can still work. If jitter is high, your offset estimate becomes noisy and the client may refuse to step.
Stepping vs slewing: the difference between “correct” and “safe”
A clock can be corrected by stepping (jumping instantly) or slewing (gradually adjusting frequency).
Stepping is fast and often necessary when a machine boots with garbage time. It’s also dangerous mid-flight:
stepping backwards can break databases, confuse logs, and make scheduled tasks run twice.
A sane setup allows stepping at boot (or when offset is huge) and slewing during normal operation.
Chrony is generally better at this than the classic ntpd in messy networks. On Windows, w32time has its own rules and limitations;
treat it like a special citizen, not a Linux clone.
Stratum is not “quality,” it’s “distance”
Stratum is how far away a server is from a reference clock. Stratum 1 is directly connected to a reference source (GPS, atomic clock, radio).
Stratum 2 syncs from stratum 1, and so on. Lower is not automatically “better” if the path is unreliable.
A stratum 3 server with stable connectivity can beat a stratum 1 server across a fragile WAN.
WAN reality: asymmetry and filtering
NTP assumes the network delay is roughly symmetric. Across office links, it often isn’t.
SD-WAN can reorder packets. LTE backup links can spike latency. Stateful firewalls can treat UDP/123 like a suggestion.
A single “security hardening” rule that blocks outbound UDP/123 from branch VLANs can create a drift island.
“Just use pool servers” is not a corporate design
Public NTP pools are excellent for the public Internet. Enterprises are different: you need predictable behavior, tight firewall rules,
deterministic dependencies, audit-friendly change control, and monitoring.
Use external sources carefully at the edge, then distribute time internally with your own servers.
Interesting facts and history (because it explains the traps)
- NTP is old—by design: the protocol dates back to the early 1980s and evolved to survive unreliable networks.
- UTC leap seconds are political: leap seconds exist because Earth’s rotation isn’t constant, and standards bodies keep debating whether to stop inserting them.
- Kerberos inherited its skew tolerance: the “few minutes” skew window comes from balancing security (replay prevention) and real-world clock imperfection.
- Stratum isn’t a badge: it’s a hop count from a reference clock; the Internet is full of low-stratum servers that are wrong.
- Early Windows domains were time-sensitive before “Zero Trust” was trendy: domain auth has long depended on time, it’s just more visible now.
- Virtualization made time harder: guest clocks can drift if the host is overloaded or time sync is misconfigured in both guest and hypervisor.
- NTP can be weaponized: misconfigured servers have been abused for reflection/amplification attacks, which is why security teams sometimes “solve” NTP by blocking it.
- Monotonic and wall clocks are different tools: the OS keeps both; your app logs use wall time while schedulers and timeouts prefer monotonic time.
Architecture that survives offices, links, and auditors
What you want: a small internal time hierarchy
The goal is boring: every system in every office agrees on time within a tight bound, even if the WAN is flaky.
You get there with a hierarchy:
- External references: a few well-chosen upstream sources (public pool, ISP NTP, GPS at HQ) feeding a small set of internal servers.
- Internal time servers: at least two per region or major site, reachable by all clients.
- Branch offices: clients sync from internal servers, not the Internet. Larger branches may run a local NTP relay for resilience.
- Domain controllers: follow the AD time hierarchy correctly instead of freelancing.
AD-specific rule: PDC Emulator is the “time boss”
In an AD domain, time flows in a specific hierarchy. Typically:
the PDC Emulator role holder should be configured to sync from reliable external sources (or a GPS-backed internal source),
and other DCs sync from the domain hierarchy. Domain members sync from DCs.
If you let multiple DCs pull time from random sources, you’ve created competing authorities. You’ll get intermittent skew and you’ll hate your life.
Branch office pattern: local NTP cache when WAN is unreliable
For small branches with stable WAN, pointing clients at two internal regional servers is fine.
For branches with intermittent connectivity, add a local time server (or use the branch DC if it’s reliable and treated carefully) that:
- syncs to HQ/regional upstreams when the link is up,
- serves local clients continuously,
- has sane limits so it doesn’t free-run into nonsense.
Security posture: NTS is nice; controlled NTP is necessary
If your environment supports it, NTS (Network Time Security) provides cryptographic protection for NTP.
Many corporate environments aren’t there yet, especially with mixed Windows, network gear, and appliances.
So you do the next best thing:
- limit who can query and who can serve time,
- use internal servers,
- monitor offsets and source selection,
- treat NTP like an authentication dependency.
Virtualization and cloud: pick one time master per layer
The classic foot-gun is double-sync: hypervisor tools trying to set guest time while chrony/ntpd also adjusts it.
Pick one. Usually: disable “set time” in the hypervisor tools, keep the guest’s NTP client enabled, and ensure the host itself is synced.
The exception is when the vendor explicitly documents the opposite for a specific platform; then you follow the platform, not your instincts.
Quote (paraphrased idea), attributed to John Allspaw: “Reliability comes from designing systems that assume failure, then practicing how you respond.”
Fast diagnosis playbook
When “AD is broken” and “VPN is down” show up in the same hour, check time before you check anything else.
Not because time is always the culprit, but because it’s fast to disprove and catastrophic when true.
First: confirm time reality on one failing client and one authority
- On the client: check current time, time zone, and NTP sync status.
- On a domain controller (or VPN gateway): check its NTP sources and offset.
- Compare wall time with a known-good reference (your internal NTP server, not your phone).
Second: check NTP reachability and who the client thinks its server is
- Is UDP/123 allowed both ways?
- Is the client configured for internal time servers, or did it fall back to something weird?
- Is there an SD-WAN policy rewriting or rate-limiting UDP/123?
Third: determine if the issue is drift, step, or source selection
- Huge offset (minutes/hours): machine booted wrong, CMOS clock issue, VM time reset, or NTP was blocked for a while.
- Small but growing offset: the time service isn’t disciplining the clock (stuck), or the oscillator is bad.
- Offset flapping: multiple time sources fighting, or network jitter/asymmetry confusing selection.
Fourth: decide blast radius
- If only one branch is off: treat it as a branch WAN/firewall/local NTP problem.
- If all offices are off: suspect the top of your time hierarchy (PDC emulator config, internal NTP servers, upstream reachability).
- If only VPN users are failing: check VPN gateway time and IdP assertion validity windows.
Practical tasks: commands, outputs, decisions
These are the checks that pay rent. Each task includes a command, what “good” and “bad” look like, and the decision you make next.
Commands are shown with sample output; adjust hostnames to your environment.
Task 1: Check the current clock and time zone (Linux)
cr0x@server:~$ timedatectl
Local time: Sun 2025-12-28 09:14:22 UTC
Universal time: Sun 2025-12-28 09:14:22 UTC
RTC time: Sun 2025-12-28 09:14:21
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: “System clock synchronized: yes” is a good sign; “NTP service: active” says your client is running.
If the time zone is wrong, logs and cert validation can still look wrong even if NTP is fine.
Decision: If synchronized is “no,” move immediately to checking the NTP daemon and reachability.
Task 2: Check chrony tracking (Linux)
cr0x@server:~$ chronyc tracking
Reference ID : 10.20.1.10 (ntp-hq-1)
Stratum : 3
Ref time (UTC) : Sun Dec 28 09:14:18 2025
System time : 0.000183421 seconds fast of NTP time
Last offset : +0.000091233 seconds
RMS offset : 0.000512344 seconds
Frequency : 12.345 ppm fast
Residual freq : -0.210 ppm
Skew : 0.900 ppm
Root delay : 0.012345 seconds
Root dispersion : 0.001234 seconds
Update interval : 64.0 seconds
Leap status : Normal
What it means: sub-millisecond offsets are excellent; tens of milliseconds are usually fine; hundreds of milliseconds might be survivable;
seconds are a problem for auth. Decision: If “Leap status” isn’t normal or stratum jumps high, inspect sources next.
Task 3: Inspect NTP sources and selection (chrony)
cr0x@server:~$ chronyc sources -v
210 Number of sources = 2
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current best, '+' = combined, '-' = not combined.
| / Reachability register (octal) - valid samples are marked '377'.
|| ----
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* ntp-hq-1 2 6 377 35 +0.182ms[ +0.210ms] +/- 1.2ms
^+ ntp-hq-2 2 6 377 34 -0.311ms[ -0.290ms] +/- 1.6ms
What it means: reach “377” means successful recent polls; “^*” is the current source.
If reach is “0” or low, packets aren’t returning. Decision: Low reach drives you to firewall/WAN checks immediately.
Task 4: Verify NTP UDP/123 connectivity (Linux)
cr0x@server:~$ nc -vu -w 2 ntp-hq-1 123
Connection to ntp-hq-1 123 port [udp/ntp] succeeded!
What it means: this only proves the port isn’t immediately blocked; it doesn’t prove NTP replies are correct.
Decision: If this fails from a branch subnet but works from HQ, you have a network policy problem, not a time daemon problem.
Task 5: Check if something else is using UDP/123 or blocking it (Linux)
cr0x@server:~$ sudo ss -ulpn | grep ':123'
UNCONN 0 0 0.0.0.0:123 0.0.0.0:* users:(("chronyd",pid=812,fd=6))
UNCONN 0 0 [::]:123 [::]:* users:(("chronyd",pid=812,fd=7))
What it means: chronyd is bound to UDP/123 and can serve time.
If you see multiple daemons competing (chronyd and ntpd), that’s a configuration smell.
Decision: If the expected daemon is not listening, fix the service before chasing WAN ghosts.
Task 6: Confirm time sync status and peers (systemd-timesyncd)
cr0x@server:~$ timedatectl timesync-status
Server: 10.20.1.10 (ntp-hq-1)
Poll interval: 1min 4s (min: 32s; max 34min 8s)
Leap: normal
Version: 4
Stratum: 3
Reference: 8A0E5F9C
Precision: 1us (-24)
Root distance: 1.123ms
Offset: +212us
Delay: 502us
Jitter: 147us
Packet count: 178
Frequency: +12.3ppm
What it means: good for simple clients; if your environment is messy (WAN jitter, branches),
chrony is usually the better call. Decision: If offset is large and doesn’t improve, you need better sources or less jitter.
Task 7: Check Windows time status on a domain member (run in PowerShell)
cr0x@server:~$ w32tm /query /status
Leap Indicator: 0(no warning)
Stratum: 3 (secondary reference - syncd by (S)NTP)
Precision: -23 (119.209ns per tick)
Root Delay: 0.0312500s
Root Dispersion: 0.1093750s
ReferenceId: 0x0A14010A (source IP: 10.20.1.10)
Last Successful Sync Time: 12/28/2025 9:12:41 AM
Source: ntp-hq-1
Poll Interval: 6 (64s)
What it means: “Source” should be a domain time source (typically a DC) or your approved internal NTP host.
If it’s “Local CMOS Clock” or a public server, you’ve got drift risk. Decision: Correct the time hierarchy before touching Kerberos settings.
Task 8: Check which Windows server is the time source (domain member)
cr0x@server:~$ w32tm /query /source
ntp-hq-1
What it means: quick confirmation of what the machine believes. Decision: If it points to an unreachable server,
fix routing/firewall or update GPO/time configuration.
Task 9: Verify AD time hierarchy settings on the PDC Emulator
cr0x@server:~$ w32tm /query /configuration
[Configuration]
EventLogFlags: 2 (Local)
AnnounceFlags: 5 (Local)
TimeJumpAuditOffset: 28800 (Local)
MinPollInterval: 6 (Local)
MaxPollInterval: 10 (Local)
MaxNegPhaseCorrection: 172800 (Local)
MaxPosPhaseCorrection: 172800 (Local)
[TimeProviders]
NtpClient (Local)
DllName: C:\Windows\system32\w32time.dll (Local)
Enabled: 1 (Local)
InputProvider: 1 (Local)
NtpServer (Local)
Enabled: 1 (Local)
What it means: you’re checking whether the machine is configured to act as an NTP server and whether the client is enabled.
Decision: If the PDC isn’t configured to sync from a reliable source, fix that first; everything else inherits the mistake.
Task 10: Force a resync on Windows and interpret errors
cr0x@server:~$ w32tm /resync /force
Sending resync command to local computer...
The command completed successfully.
What it means: if this fails with “no time data was available,” your client can’t reach a valid source.
Decision: Successful resync but persistent auth errors suggests the authority side (DC/VPN gateway) is wrong, not the client.
Task 11: Detect time jumps and NTP corrections in logs (Linux)
cr0x@server:~$ sudo journalctl -u chronyd --since "2 hours ago" | tail -n 12
Dec 28 07:22:10 branch-app-7 chronyd[812]: Selected source 10.20.1.10
Dec 28 07:22:42 branch-app-7 chronyd[812]: System clock wrong by 1.832 seconds, adjustment started
Dec 28 07:23:14 branch-app-7 chronyd[812]: System clock synchronized
Dec 28 07:58:15 branch-app-7 chronyd[812]: Source 10.20.1.10 replaced with 10.20.1.11
Dec 28 07:58:47 branch-app-7 chronyd[812]: System clock wrong by 0.412 seconds, adjustment started
What it means: repeated “wrong by X seconds” indicates instability—either the local clock is drifting or upstream is unstable.
Decision: If corrections correlate with WAN events, consider a branch-local time source or better WAN policy for UDP/123.
Task 12: Confirm NTP server is serving and who is querying (Linux server)
cr0x@server:~$ sudo chronyc clients
Hostname NTP Drop Int IntL Last Cmd Drop CmdL
===============================================================================
branch-fw-1 12 0 6 - 42s 0 0 -
branch-dc-2 8 0 6 - 38s 0 0 -
vpn-gw-1 6 0 6 - 44s 0 0 -
What it means: your server sees clients; “Drop” counts show packet drops.
Decision: If a branch is failing but not listed here, it’s not reaching the server—network path issue.
Task 13: Validate NTP server reachability from a branch router/firewall (Linux-based appliance)
cr0x@server:~$ sudo tcpdump -ni eth0 udp port 123 and host 10.20.1.10 -c 6
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:14:10.111111 IP 10.60.8.2.49823 > 10.20.1.10.123: NTPv4, Client, length 48
09:14:10.145678 IP 10.20.1.10.123 > 10.60.8.2.49823: NTPv4, Server, length 48
09:14:11.111209 IP 10.60.8.2.49823 > 10.20.1.10.123: NTPv4, Client, length 48
09:14:11.146001 IP 10.20.1.10.123 > 10.60.8.2.49823: NTPv4, Server, length 48
What it means: requests and replies are visible. If you only see requests, replies are being dropped (stateful firewall, ACL, NAT).
Decision: Fix the network policy before tuning chrony; chrony can’t discipline a clock using missing packets.
Task 14: Check for VM guest time sync conflicts (Linux guest example)
cr0x@server:~$ systemctl is-enabled chronyd; systemctl is-enabled systemd-timesyncd
enabled
disabled
What it means: you should run one time sync client. Decision: If both are enabled, disable one and retest drift.
Task 15: Detect certificate validity errors tied to clock drift (Linux client)
cr0x@server:~$ openssl s_client -connect vpn.example.internal:443 -servername vpn.example.internal 2>/dev/null | openssl x509 -noout -dates
notBefore=Dec 28 08:00:00 2025 GMT
notAfter=Mar 27 08:00:00 2026 GMT
What it means: compare “notBefore” to the client’s current UTC time. If the client is behind, it may see “not yet valid.”
Decision: Don’t rotate certs to “fix” this. Fix time, then rerun the handshake.
Task 16: Confirm firewall counters for UDP/123 (Linux nftables example)
cr0x@server:~$ sudo nft list ruleset | sed -n '1,80p'
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
ct state established,related accept
iif "lo" accept
udp dport 123 ip saddr 10.60.0.0/16 accept
}
}
What it means: you allow NTP from branch subnets. If the rule isn’t present, your NTP server might be silently ignoring branches.
Decision: Make NTP explicit in firewall policy; “allow all from internal” is how you fail audits and still miss NTP.
Three corporate mini-stories (anonymized, plausible, and painfully real)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company opened two new offices after an acquisition. They had an existing AD forest, VPN concentrators in HQ, and a decent internal PKI.
The integration plan was careful about DNS and routing. Time was assumed to “just work” because the Windows domain “handles that.”
On day one, users in the new office complained that logging in took ages and sometimes failed. VPN sign-ins succeeded, then the client dropped after a minute.
Meanwhile, a handful of Linux jump hosts started throwing TLS errors against internal APIs. The incident channel filled up with half-truths:
“AD replication is broken.” “The CA issued bad certs.” “The VPN firmware is unstable.”
The actual issue: the branch firewall had a default policy that blocked outbound UDP/123 “for security,” and the new office had laptops that had been asleep for weeks.
They woke up with stale clocks, then drifted further because NTP couldn’t reach internal servers. Kerberos started rejecting tickets, and the VPN gateway—
which validated short-lived assertions—looked like it was randomly flaking out.
Fixing it was almost insultingly simple: allow UDP/123 from the branch to the internal NTP servers and enforce domain hierarchy for Windows clients.
The hard part was admitting the assumption was wrong. The postmortem wasn’t about “someone forgot a firewall rule.” It was about treating time as a dependency,
not a background radiation.
Mini-story 2: The optimization that backfired
Another org decided to “optimize WAN traffic.” They had dozens of small sites connected over SD-WAN and wanted to reduce chatter.
Someone noticed NTP was a constant trickle and pushed a policy: rate-limit UDP/123 and lengthen polling intervals on clients.
On paper, it looked like a win: less noise, fewer packets, cleaner graphs.
Then the SD-WAN started doing what SD-WAN does: path changes. A few branches occasionally went from low-latency MPLS to higher-latency Internet transport.
NTP packets arrived late or bursty. Clients with longer polling intervals had fewer samples, so they trusted noisy measurements.
Their offsets started oscillating. Some stepped time during business hours. That created weirdness: cron jobs ran twice, log pipelines mis-ordered events,
and a couple of app servers failed mutual TLS because their clocks jumped.
The kicker: none of this tripped “NTP down” alerts, because NTP wasn’t down. It was degraded. The system was “working” in the same way
a wobbly chair is “working” until you sit down quickly.
The rollback was straightforward: remove the aggressive rate limit, restore sane polling, and add branch-local NTP relays at the few sites with the worst jitter.
The lesson was classic operations: optimizing a tiny resource can destabilize a critical control loop.
Mini-story 3: The boring but correct practice that saved the day
A global company ran a strict internal time service: two NTP servers per region, GPS-backed at HQ, with monitoring on offset and reachability.
Every branch had explicit firewall rules for UDP/123. Windows clients followed domain hierarchy; Linux used chrony with two internal sources.
It wasn’t glamorous. It was documented, reviewed, and tested during office turn-ups.
During a major ISP outage, one region lost access to upstream references. Public Internet NTP was unreachable from that segment.
The time servers entered a “holdover” mode—still serving time, but no longer updating from upstream. This is where many setups go off the rails.
The difference was that this org had policies: maximum allowed drift before alerting, and a runbook that told on-call exactly what to check.
Monitoring showed offsets increasing slowly but staying within acceptable bounds. Clients continued authenticating because the local hierarchy remained consistent.
When upstream returned, the servers re-synchronized carefully without stepping time wildly. Nobody in the business noticed.
The correct practice wasn’t fancy NTS everywhere or a $big hardware purchase. It was making time a first-class service with redundancy and measurement.
Boring. Correct. Effective.
Common mistakes: symptom → root cause → fix
1) Users get “The time difference between the client and server is too great”
Symptom: AD logons fail intermittently, especially after sleep/hibernate or after imaging.
Root cause: client clock drift beyond Kerberos skew tolerance; NTP blocked or wrong time source.
Fix: restore NTP reachability to internal servers; enforce domain time hierarchy; resync and verify with w32tm /query /status or chronyc tracking.
2) VPN works for some users, fails for others with token validity errors
Symptom: VPN authentication loops, “assertion expired,” or connection drops after initial success.
Root cause: VPN gateway or IdP connector has wrong time; branch clients are ahead/behind.
Fix: verify gateway time sources; check offset on gateways and IdP proxies; correct NTP and avoid manual time changes.
3) TLS errors: “certificate not yet valid” right after cert rotation
Symptom: new cert deployment seems “broken” in one office only.
Root cause: that office’s clients are behind; cert validity window starts in the “future” relative to them.
Fix: fix NTP first. Re-test with openssl x509 -noout -dates against the service.
4) One DC shows different time than another
Symptom: authentication succeeds against some DCs and fails against others; replication looks noisy.
Root cause: multiple DCs configured with independent NTP sources; PDC emulator not acting as authoritative source.
Fix: configure PDC to sync to approved upstreams; set other DCs to domain hierarchy; audit with w32tm /monitor.
5) Time “looks fine” but offset is noisy and keeps switching sources
Symptom: NTP client flips between servers; jitter is high; occasional clock steps.
Root cause: WAN jitter/asymmetry; aggressive SD-WAN shaping; too few samples due to long polling; unstable upstream.
Fix: use two or more nearby internal sources; reduce jitter by fixing network policy; consider branch-local NTP relay.
6) Someone “fixes” time manually and makes it worse
Symptom: logs become unusable, scheduled tasks misfire, databases complain, incidents become impossible to reconstruct.
Root cause: manual time stepping during normal operation; conflicting time sync services afterward.
Fix: stop the time service, correct configuration, restart, allow controlled step only when required; document a runbook for safe correction.
Joke #2: The only thing more reliable than time drift is someone insisting it can’t be time because “NTP is enabled.”
Checklists / step-by-step plan
Step-by-step: build a resilient cross-office time service
- Pick authoritative internal NTP servers: at least two per major region or HQ. Decide where upstream references live.
- Decide upstream policy: use multiple external sources or GPS-backed sources. Avoid single upstream dependencies.
- Lock down firewall rules: explicitly allow UDP/123 from client subnets to internal NTP servers; deny serving to untrusted networks.
- Make AD time hierarchy correct: configure PDC emulator with upstream; ensure other DCs use domain hierarchy.
- Standardize clients: Linux uses chrony; Windows uses w32time; disable double-sync (guest tools vs NTP daemon).
- Define acceptable drift: pick alert thresholds (example: warn at 250ms, page at 2s for critical systems; tighter for DCs and VPN gateways).
- Instrument and alert: collect offset/jitter/reach metrics; alert on source loss and on increasing offset slope.
- Test failure modes: simulate WAN loss to a branch; confirm branch keeps stable time and recovers without big steps.
- Write the runbook: “what to check first” and “how to fix safely” for on-call and desktop teams.
- Audit quarterly: verify no rogue NTP servers, no public NTP on clients, and that branch firewall rules still exist.
Branch office turn-up checklist (the one you actually use)
- Confirm branch VLANs can reach internal NTP servers over UDP/123 in both directions.
- Confirm at least two NTP sources are configured for servers and critical network gear.
- Confirm the branch DC (if present) is not pointed at random Internet NTP; it should follow domain hierarchy.
- Confirm VPN gateway(s) in-region sync from the same internal hierarchy.
- Confirm monitoring sees branch offsets within thresholds within one hour of opening.
Incident response checklist: when auth is failing and you suspect time
- Measure offset on one failing client and one DC/gateway.
- Check whether NTP packets return (tcpdump if needed).
- Check whether the client is using the correct time source.
- Fix reachability first, then resync; avoid manual time steps during business hours unless absolutely necessary.
- After correction, re-test the original auth path (Kerberos ticket issuance, VPN login, TLS handshake) to confirm causality.
Monitoring and alerting that actually catches it
“NTP service up” is not monitoring. It’s wishful thinking with metrics.
You need to observe the control loop: reachability, offset, jitter, and source changes.
What to measure
- Client offset: especially on domain controllers, VPN gateways, IdP proxies, and certificate-issuing systems.
- Reach register: chrony reach dropping from 377 toward 0 is a leading indicator.
- Source selection changes: frequent switching indicates instability or network jitter.
- Time steps: any step on a critical system should be an alert, because it often correlates with user-visible failures.
- Upstream availability: internal NTP servers losing all upstream sources should alert before clients drift too far.
Alert thresholds (opinionated defaults)
- Domain controllers: warn at 100ms, page at 1s, emergency at 5s.
- VPN gateways / IdP connectors: warn at 200ms, page at 2s (tokens are short-lived, users are impatient).
- General servers: warn at 500ms, page at 5s (unless the service is crypto/auth-heavy).
- Workstations: don’t page, but trend and report. If you see a whole site drifting, treat it as a network policy issue.
What to avoid
- Alerting on every transient jitter spike. You’ll train everyone to ignore the alerts.
- Using a single NTP server as a global dependency without redundancy.
- Letting network/security teams block UDP/123 “temporarily” without an explicit exception process.
FAQ
1) Why does AD care so much about time?
Kerberos uses timestamps to prevent replay attacks. If clocks differ too much, the system can’t safely distinguish “late” from “replayed,” so it rejects.
2) Is five minutes of skew always the limit?
It’s a common default, not a law of physics. You can change tolerances, but increasing skew tolerance is usually the wrong fix:
it weakens security and masks the real issue (broken time distribution).
3) Should branch offices sync directly to Internet NTP?
Usually no. It complicates firewalling, auditing, and consistency. Prefer internal servers so all offices share the same hierarchy and you can monitor it.
Exceptions exist (very small sites with no VPN and no domain membership), but they’re exceptions you document.
4) chrony vs ntpd vs systemd-timesyncd: what should I use?
For servers and branches with real network variability, chrony is typically the best choice because it handles jitter and intermittent connectivity well.
systemd-timesyncd is fine for simple clients. ntpd can work, but in many modern environments chrony is operationally easier.
5) Can SD-WAN break NTP even if “UDP is allowed”?
Yes. Shaping, policing, path changes, and asymmetric routing can increase jitter and distort delay estimates.
NTP might be reachable but unstable, which is worse than clean failure because it produces inconsistent time.
6) How do I handle time in isolated networks without Internet?
Run internal NTP servers with a local reference (GPS or disciplined oscillator) or a well-managed holdover strategy.
The key is consistency: clients should all follow the same internal hierarchy and you must monitor drift over time.
7) What about leap seconds—do they break stuff?
They can. Some systems handle leap seconds poorly, and inconsistent handling across devices can create brief time disagreements.
Using a consistent time source and disciplined clients reduces the risk; avoid mixing “special” time behaviors across your fleet.
8) Why do certificates fail when time is only a little off?
Certificates have strict validity windows, and some clients cache time-dependent decisions.
A small drift might be enough to cross a validity boundary, especially right after rotation when “notBefore” is recent.
9) Should I allow NTP from clients to domain controllers, or only to dedicated NTP servers?
Either can work, but pick a model and enforce it. Many environments use DCs as time sources for domain members
while ensuring the PDC emulator is the only DC pulling external time. Dedicated NTP servers can be cleaner for non-Windows systems and network gear.
10) If a machine is way off, is it safe to step time immediately?
Safe depends on the workload. For a laptop, stepping is usually fine. For databases, message brokers, and anything with sensitive ordering, stepping can be harmful.
Prefer controlled correction: fix NTP, then allow the time service to correct appropriately, stepping only when required and ideally during a maintenance window.
Next steps (the sane path forward)
If you run AD, VPN, or anything with certificates, treat time like an internal platform service. Not a background feature. A service.
Build a hierarchy, restrict it, monitor it, and test its failure modes.
Practical next steps you can do this week:
- Identify the PDC emulator and verify its upstream time sources are correct and redundant.
- Pick two internal NTP servers per region and make UDP/123 reachability explicit in firewall policy.
- Audit a handful of clients in each office: confirm their sources, offsets, and that they aren’t using public NTP.
- Add alerts on offset and reach (not just daemon uptime) for DCs, VPN gateways, IdP components, and internal NTP servers.
- Run a branch WAN failure test: confirm time remains stable and recovery doesn’t cause large steps.
Do this and you’ll prevent a category of outages that look like “security weirdness,” “network flakiness,” and “Windows being Windows,”
but are actually the same boring root cause: time drift you didn’t measure.