Proxmox Time Drift: NTP Issues That Break TLS/PBS and How to Fix

January 14, 2026 • February 3, 2026 • Read: 20 min • Views: 8

Was this helpful?

The failure mode is always the same: backups start failing, web UIs show “certificate not yet valid,”
or nodes flake out of a Proxmox cluster like they suddenly forgot how quorum works. You chase storage,
you chase networking, you blame “some TLS thing.” Then someone runs date and discovers the host thinks
it’s last Tuesday.

Time drift is boring until it detonates. Proxmox (and especially Proxmox Backup Server, PBS) depends on sane time
for TLS, ticket lifetimes, cluster membership, and log correlation. When time goes sideways, you don’t just get
slightly wrong timestamps—you get authentication failures, split-brain-ish behavior, and backups that won’t trust
their own target.

Why time drift breaks Proxmox, TLS, and PBS

Time is a dependency you don’t list in your architecture diagrams, but it’s still a dependency. TLS certificates
have “notBefore” and “notAfter” windows. Kerberos, JWTs, and most ticket systems are time-bound. Clusters use time
as a weak signal for liveness and event ordering. Backup systems build retention policies and garbage collection on
the assumption that “now” is monotonic-ish.

Proxmox VE sits on Debian. The node uses system services (chrony or systemd-timesyncd, sometimes legacy ntpd),
a hardware RTC (real-time clock), and kernel timekeeping. PBS is strict about TLS because it should be. So if your
node’s clock is behind, you get “certificate not yet valid.” If it’s ahead, you get “certificate expired” even when
it isn’t.

How it shows up in real life

PBS backups fail with TLS handshake errors or authentication failures.
Proxmox web UI works on one node but not another, or browsers throw warnings unexpectedly.
Cluster nodes randomly “lose” each other, show odd log ordering, or have confusing corosync events.
VMs drift even if the host is stable, because guests are their own tiny universe with their own bad habits.

If your instinct is “I’ll just set the time manually,” resist it. Time jumps can break running services in ways that
look like corruption. Gradual corrections are usually safer, and when you do need a step, you do it with intent and
controlled blast radius.

One operational truth worth printing out:
paraphrased idea from Werner Vogels (reliability/operations): everything fails; design so it keeps working anyway.
Time sync is part of that “keeps working” plan.

Interesting facts and a little history (because clocks are weird)

NTP is old—really old. It predates most modern “cloud” thinking, and it’s still one of the most deployed protocols on Earth.
Linux keeps time with multiple clocks. There’s wall clock time, monotonic time, and sometimes TSC/HPET-based counters—each with different guarantees.
Virtualization made time harder. A paused VM doesn’t experience time; the world does. Catching up is non-trivial.
Leap seconds are real and annoying. Some systems step; others smear. Mixed strategies can cause weird transient skews.
RTC drift is normal. Cheap oscillators wander with temperature and age; a few seconds per day can happen on commodity boards.
Corosync doesn’t “need” perfect time, but debugging cluster behavior without consistent timestamps is like doing surgery with fogged goggles.
TLS is time-sensitive by design. It’s not being dramatic; “notBefore” prevents replay of future-issued certs and mitigates certain attacks.
PTP exists because NTP isn’t always enough. When you need sub-millisecond sync (trading, telecom, measurement), PTP with hardware timestamping is the move.
Google’s leap smear popularized the idea that stepping time is dangerous at scale; many enterprises copied the strategy without understanding the tradeoffs.

Joke #1: Time sync is the only place where “close enough” and “cryptography” meet, and they immediately start arguing.

Fast diagnosis playbook

When Proxmox/PBS starts throwing TLS errors, you want the shortest path to “is it time?” without spiraling into a
week-long archaeology dig through logs.

First: confirm skew and whether the system thinks it’s synced

Check local time vs reality (and confirm timezone/RTC mode).
Check NTP synchronization state (chrony or systemd-timesyncd).
Check whether time is stepping or slewing slowly (big offsets tend to step unless configured otherwise).

Second: identify the time source and whether it’s reachable

Which service owns NTP? chrony vs systemd-timesyncd vs ntpd: pick one.
Are your configured servers reachable? DNS, firewall, routing, VLAN ACLs.
Is the host allowed to set time? containers and some hardened configs can block it.

Third: check virtualization-specific issues

Host stable, guests drifting? Then you’re fixing guest timekeeping, not NTP on the host.
After suspend/resume or live migration? That’s a clocksource/guest-agent/stepping policy issue.
Cluster nodes disagree? Fix time on every node before you touch corosync settings.

The rule: don’t debug TLS until you’ve proven time is correct. TLS errors are honest—your clock is lying, not your certificates.

Practical tasks (commands, outputs, decisions)

You asked for production-grade. Here are real tasks you can run on Proxmox nodes and PBS. Each one includes:
command, sample output, what it means, and what decision you make.

Task 1: Quick sanity check: local time and kernel time

cr0x@server:~$ date -Ins
2025-12-26T10:14:32,118962+00:00

Meaning: This is the wall clock time with timezone offset. If it’s not obviously “now,” stop and fix time before anything else.

Decision: If off by more than a few seconds in a datacenter environment, proceed to NTP status checks. If off by minutes/hours, expect TLS failures and likely a step correction.

Task 2: Confirm whether the system thinks it’s synchronized

cr0x@server:~$ timedatectl status
               Local time: Fri 2025-12-26 10:14:36 UTC
           Universal time: Fri 2025-12-26 10:14:36 UTC
                 RTC time: Fri 2025-12-26 10:14:35
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: “System clock synchronized: yes” is a good sign. “RTC in local TZ: no” is the sane default for Linux servers (RTC in UTC).

Decision: If synchronized: no, don’t guess. Identify the NTP client and its server reachability next.

Task 3: Identify who owns NTP (chrony vs systemd-timesyncd vs ntpd)

cr0x@server:~$ systemctl is-active chronyd systemd-timesyncd ntp
active
inactive
inactive

Meaning: chrony is active; others are not. Good: one time service.

Decision: If more than one is active, fix that first. NTP clients fighting each other is how you get jitter, stepping, and operator misery.

Task 4: Check chrony’s tracking (offset, frequency, leap status)

cr0x@server:~$ chronyc tracking
Reference ID    : 0A0A0A01 (10.10.10.1)
Stratum         : 3
Ref time (UTC)  : Fri Dec 26 10:14:44 2025
System time     : 0.000123456 seconds fast of NTP time
Last offset     : +0.000081234 seconds
RMS offset      : 0.000210987 seconds
Frequency       : 12.345 ppm fast
Residual freq   : +0.010 ppm
Skew            : 0.120 ppm
Root delay      : 0.002345678 seconds
Root dispersion : 0.001234567 seconds
Leap status     : Normal

Meaning: Offset is tiny; chrony is locked to 10.10.10.1. Frequency is being tuned. This is healthy.

Decision: If the “System time” offset is seconds or more, you likely have bad reachability, a bad source, or a clocksource problem.

Task 5: Check chrony sources and reachability

cr0x@server:~$ chronyc sources -v
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 10.10.10.1                    2   6   377    18   +0.0001[+0.0001] +/- 0.0008
^+ 10.10.10.2                    2   6   377    17   -0.0002[-0.0002] +/- 0.0012
^- 192.0.2.20                    3   6   377    19   +0.0030[+0.0030] +/- 0.0100

Meaning: ^* is the selected source, Reach 377 means successful replies. A Reach near 0 means you’re not talking to that server.

Decision: If all sources show ? or reach is low, you troubleshoot network/DNS/firewall immediately.

Task 6: Confirm the NTP port path (UDP/123) isn’t blocked

cr0x@server:~$ nc -uvz 10.10.10.1 123
Connection to 10.10.10.1 123 port [udp/ntp] succeeded!

Meaning: This doesn’t prove correct NTP responses, but it proves basic reachability to UDP/123.

Decision: If it fails, stop tuning chrony and fix the network policy/routing first.

Task 7: Detect stepping events and time resets in logs

cr0x@server:~$ journalctl -u chronyd -S -2h --no-pager | tail -n 20
Dec 26 09:12:01 server chronyd[821]: Selected source 10.10.10.1
Dec 26 09:12:01 server chronyd[821]: System clock wrong by -3.421 seconds
Dec 26 09:12:01 server chronyd[821]: System clock was stepped by 3.421 seconds
Dec 26 09:12:11 server chronyd[821]: Source 10.10.10.1 online

Meaning: chrony stepped the clock by 3.421 seconds. Steps are sometimes correct, but frequent steps indicate instability.

Decision: If you see repeated stepping, investigate virtualization pause/migration, bad RTC, or multiple time clients.

Task 8: Verify hardware clock and RTC mode (UTC vs localtime)

cr0x@server:~$ hwclock --show
2025-12-26 10:14:57.123456+00:00

Meaning: RTC is close to system time and in UTC.

Decision: If RTC is wildly different, fix the RTC and ensure Linux uses UTC. Bad RTC + reboot = time travel.

Task 9: Force a one-time correction safely (chrony)

cr0x@server:~$ chronyc -a makestep
200 OK

Meaning: chrony stepped the clock if needed, according to its policy.

Decision: Use this when you have a big skew and you need to restore sanity now. Then you still fix the underlying cause.

Task 10: Verify TLS error is time-related (PBS client side)

cr0x@server:~$ proxmox-backup-client status --repository pbs@pam@pbs01:datastore
Error: TLS handshake failed: certificate verify failed: certificate is not yet valid

Meaning: “Not yet valid” is a neon sign for clock skew on the client (this Proxmox node) or sometimes the server (pbs01).

Decision: Check time on both ends. Fix time first; do not reissue certificates as your first move.

Task 11: Confirm time on PBS itself

cr0x@server:~$ ssh pbs01 'timedatectl status'
               Local time: Fri 2025-12-26 10:14:59 UTC
           Universal time: Fri 2025-12-26 10:14:59 UTC
                 RTC time: Fri 2025-12-26 10:15:00
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: PBS is synced. If the client is not, the client is your problem.

Decision: If PBS is unsynced too, fix PBS first—your backup target should be a “time authority,” not another drifting participant.

Task 12: Check Proxmox cluster health and correlate with time issues

cr0x@server:~$ pvecm status
Cluster information
-------------------
Name:             prod-cluster
Config Version:   27
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Dec 26 10:15:04 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2c
Quorate:          Yes

Meaning: If the “Date” differs significantly between nodes when you run this on each, you have drift. Quorum might still be “Yes,” but you’re living on borrowed luck.

Decision: Normalize time across all nodes before investigating intermittent cluster weirdness.

Task 13: Detect guest drift from the host using QEMU guest agent (when installed)

cr0x@server:~$ qm agent 104 get-time
{
  "seconds": 1766744105,
  "nanoseconds": 123000000
}

Meaning: You got guest epoch time. Compare it with host epoch (date +%s) to estimate skew.

Decision: If guests are off while the host is stable, fix guest NTP/tools and consider enabling guest agent time sync strategy (OS-dependent).

Task 14: Check kernel clocksource and timekeeping stability flags

cr0x@server:~$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Meaning: Using TSC is common and fast. On modern CPUs it’s usually stable. On some platforms/BIOS settings, it can be problematic.

Decision: If you see frequent time jumps correlated with CPU power states or migrations, validate BIOS settings and consider an alternate clocksource only with evidence.

Joke #2: If you ever want to feel powerful, set a server clock five minutes ahead and watch every security system suddenly become a philosopher.

Root causes: the usual suspects in Proxmox environments

1) Multiple NTP clients fighting

Debian makes it easy to end up with systemd-timesyncd, chrony, and an older ntpd installed across upgrades and migrations.
When two services both believe they own time, you get oscillation: one slews, the other steps, your logs look like a ransom note.

Opinionated guidance: pick chrony for servers and hypervisors unless you have a very specific reason not to. It handles intermittent connectivity,
large offsets, and virtualized environments well.

2) Bad or unreachable time sources

“We use pool servers” is fine for laptops; it’s sloppy for production hypervisors. Your Proxmox nodes should use:
internal NTP servers (or a pair of well-controlled upstream sources) reachable on every relevant VLAN.

If your NTP source is across a firewall that sometimes blocks UDP/123 (or does stateful timeouts badly), chrony will flap between sources,
increasing jitter. PBS will notice in the form of intermittent TLS failures, because certificates don’t care about your firewall change window.

3) RTC misconfigured (localtime) or drifting badly

RTC-in-localtime is a Windows convention. Linux servers want UTC in hardware clock. If you’ve ever dual-booted, repurposed a workstation, or
imported a questionable image, you can inherit the wrong RTC mode. The problem shows up on reboot: system starts with a wrong baseline,
then NTP tries to correct it. The correction might be a step. Some services hate steps.

4) Virtualization pause/migration and guests with weak timekeeping

Guests pause during snapshotting, backup, migration, or host contention. Some guest OSes recover gracefully; others drift, then snap back.
If you’re backing up a VM to PBS while its clock is unstable, you can get authentication problems inside the guest too (app-level TLS, Kerberos,
database replication).

5) Power management, BIOS settings, and unstable clocksources

Modern CPUs do wizardry with frequency scaling and power states. Usually it’s fine. Sometimes it interacts poorly with timekeeping on specific
platforms. If you see drift correlate with deep C-states, BIOS microcode updates, or odd kernel messages, treat it as a hardware/firmware issue,
not an NTP config issue.

6) “Security hardening” that blocks time sync

Some environments clamp down capabilities, container permissions, or outbound traffic so tightly that time sync cannot function.
Then operators compensate by manually setting time. That’s not hardening; it’s self-harm with extra steps.

Fix patterns that actually stick

Pick one time client and configure it cleanly

On Proxmox hosts and PBS, I prefer chrony. It’s robust under jitter, better at handling large offsets, and tends to behave well in VMs too.
The most important “configuration” is governance: one service, managed consistently, same upstream sources across the fleet.

Use internal NTP sources (and make them boring)

A pair of internal NTP servers, each synced to reliable upstream (GPS, PTP grandmaster, or vetted internet sources), is the boring, correct
enterprise pattern. If you can’t run internal time servers, at least choose multiple upstreams and ensure firewall policy is stable.

Control stepping vs slewing policy

Big offsets force a decision: step now, or slew slowly. For hypervisors, stepping is acceptable during maintenance windows or right after boot,
but frequent stepping during steady-state indicates a deeper problem.

chrony supports controlled stepping (example: allow stepping during the first few updates). The goal is: boot fast, then be smooth.

Make RTC sane, then keep it sane

Set RTC to UTC and keep it that way across rebuilds. If your RTC is unreliable (some server boards are fine, some embedded stuff isn’t),
consider disciplining it more frequently or replacing the platform for critical nodes. Hypervisors are not the place for “good enough” oscillators.

Handle guests explicitly

For Linux guests: run chrony or systemd-timesyncd in the guest; don’t rely solely on host-provided hints. For Windows guests: make sure the time service
is configured correctly and avoid stacking “helpful” tools that both attempt to set time.

If you use QEMU guest agent, use it for management visibility, not as a substitute for proper guest time sync.

Stabilize the network path to NTP

NTP is sensitive to latency spikes and asymmetric routing. It can tolerate a lot, but if your network policy treats UDP/123 as optional,
it will behave like a flaky dependency—because you made it one.

Three corporate mini-stories from the time-skew trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran a three-node Proxmox cluster with PBS in a separate rack. After a routine security audit, someone tightened egress rules.
The assumption: “NTP is internal, so blocking outbound doesn’t matter.” Reasonable-sounding. Wrong in two ways.

The “internal NTP” servers were actually VMs in the same Proxmox cluster. They were configured to sync from external sources. With outbound UDP/123
blocked, the NTP VMs slowly drifted. Chrony on the Proxmox nodes happily locked to those internal sources—because they were reachable and stable,
just stable in the wrong direction.

Two weeks later, PBS backups started failing with TLS “not yet valid” errors on some nodes and “expired” on others. The drift was not identical across nodes.
Operators rotated certificates because, of course, it looked like certificates. That didn’t help. It couldn’t.

The fix was dull: restore outbound NTP for the internal time VMs or, better, run time sources on dedicated small appliances with explicit network policy.
Then force a controlled correction (makestep) during a maintenance window. No heroics, just removing a broken assumption.

Mini-story 2: The optimization that backfired

Another shop wanted “faster migration” and tuned power management aggressively. CPUs were allowed deep sleep states, and some BIOS defaults were changed
to reduce power draw. The cluster got quieter and cooler. Everyone congratulated themselves.

Then came the weirdness: after live migrations, a subset of VMs would report authentication failures to internal services for a few minutes.
Not always. Not on all hosts. It looked like DNS. It looked like application bugs. It looked like “the network being the network.”

The actual culprit was timekeeping instability under certain power-state transitions on that hardware generation. Guests would drift during migration/pause,
then snap back. Some services (notably anything with short-lived tokens) would reject requests because clients were “in the future” or “in the past.”

The resolution wasn’t to disable power management globally forever, but to apply sane BIOS/firmware updates, validate stable TSC behavior,
and stop optimizing without measurement. They reintroduced a stricter time monitoring alert: offset thresholds per node and per key guest.

Mini-story 3: The boring but correct practice that saved the day

A larger enterprise had a habit that looked almost too conservative: every hypervisor and PBS node pointed at the same pair of internal NTP servers,
and those time servers were treated like core infrastructure. They had monitoring for reachability, offset, and stratum changes. Nothing fancy.

During a provider outage, external connectivity got messy. Some sites lost access to upstream time references. The internal time servers entered a degraded
state, but they stayed consistent across the site, and alerts fired early: “upstream lost; serving last known good with increasing dispersion.”

The key point: Proxmox nodes stayed consistent with each other. TLS continued to work because clocks didn’t jump. Backups continued because PBS and PVE agreed
on what “now” meant. Operators had time (the irony is unavoidable) to respond before expiry windows and token lifetimes became a problem.

Their practice was boring: two internal sources, consistent config management, and alarms on offset and reach. It wasn’t glamorous. It was survivable.

Common mistakes: symptom → root cause → fix

1) “Certificate is not yet valid” on PBS repository

Symptom: PBS client fails with “certificate is not yet valid.”

Root cause: Client clock behind PBS (or both wrong in different directions). Sometimes RTC wrong after reboot.

Fix: Verify timedatectl on both, correct NTP, then run chronyc -a makestep during a safe window.

2) “Certificate has expired” but you just renewed it

Symptom: TLS says expired immediately after renewal.

Root cause: System clock ahead; renewal didn’t fix the timeline.

Fix: Fix time first. Then re-check validity windows. Only reissue certs if they’re genuinely wrong after clocks agree.

3) Chrony shows sources but never selects one

Symptom: chronyc sources shows reachable servers but no * selected.

Root cause: Server quality checks failing (bad stratum, falseticker, excessive jitter), or the local clock is so wrong it needs a step and stepping is disabled.

Fix: Inspect chronyc tracking, enable controlled stepping policy, and fix the time servers (or choose better ones).

4) Time sync works until reboot, then it’s wrong again

Symptom: After reboot, time is hours off before NTP corrects it.

Root cause: RTC wrong, RTC in localtime, dead CMOS battery, or firmware clock not persisting correctly.

Fix: Set RTC to UTC, correct hardware clock, replace battery if needed, validate BIOS time settings.

5) Cluster events look out of order across nodes

Symptom: Log correlation is impossible; events appear “before” causes.

Root cause: Node clocks disagree by seconds/minutes; not necessarily breaking the cluster but breaking your brain.

Fix: Enforce NTP consistency across all nodes; alert on offset deltas between nodes.

6) Guests drift heavily during backups or snapshots

Symptom: Guest time jumps after backup window; app auth fails inside VM.

Root cause: Guest paused, weak guest timekeeping, no NTP in guest, or host contention.

Fix: Run proper time sync inside the guest; reduce pause times (storage performance), and confirm guest agent is installed for observability.

7) “We hardened outbound traffic” and now time is flaky

Symptom: Time sync intermittently fails across the fleet.

Root cause: UDP/123 blocked or stateful firewall handling is inconsistent; DNS for NTP servers fails.

Fix: Allow NTP to your approved servers; prefer IPs or stable internal DNS; monitor reach and offset.

8) Switching from chrony to timesyncd “to simplify”

Symptom: Works in the lab, drifts in production with jitter and packet loss.

Root cause: timesyncd is fine for many cases, but chrony is generally more resilient for server-grade networks and offset management.

Fix: Use chrony on hypervisors and PBS unless you have a measured reason not to. Keep it consistent across nodes.

Checklists / step-by-step plan (the boring path to stable time)

Step 0: Decide what “correct” means in your environment

Use UTC everywhere on servers (timezone UTC, RTC in UTC).
Define acceptable offset: typically < 100 ms between nodes for general infrastructure; tighter if you run latency-sensitive apps.
Pick a time strategy: internal NTP servers, or direct to trusted upstream if you must.

Step 1: Standardize the NTP client on Proxmox nodes

Pick chrony or systemd-timesyncd. Don’t run both.
Ensure only one service is enabled and active.
Configure the same upstream servers on every node.

Step 2: Validate time source reachability and quality

Confirm UDP/123 to your time sources is allowed from every node and PBS.
Confirm DNS resolution is stable if you use names.
Confirm you have at least 2 sources (preferably 3) to avoid single-server lies.

Step 3: Fix the baseline (RTC and current offset)

Confirm RTC is in UTC, not localtime.
If offset is large, schedule a controlled step correction.
After correction, watch for repeated stepping; it’s a symptom, not a feature.

Step 4: Harden PBS and PVE for operational sanity

Put PBS on the same time policy as the cluster.
Monitor time offset and NTP reach for both PBS and PVE.
When backups fail with TLS, check time before rotating secrets/certs.

Step 5: Treat guests as first-class time consumers

Run NTP inside guests (chrony/timesyncd).
Install QEMU guest agent for visibility.
Watch for drift after migrations and snapshot-heavy windows.

Step 6: Operationalize it (alerts and runbooks)

Alert on offset (absolute), rate of change (drift), and reach (connectivity).
Alert when the selected NTP source changes frequently.
Write a runbook that starts with timedatectl and ends with controlled remediation.

FAQ

1) Why does time drift break TLS so reliably?

TLS certificate validation depends on the client’s view of time. If the client thinks it’s before notBefore,
the cert is “not yet valid.” If it thinks it’s after notAfter, it’s “expired.” No amount of “but it works on my laptop”
changes that—your laptop has working time sync.

2) Should I reissue PBS or Proxmox certificates when I see TLS errors?

Usually no. First prove the clocks are correct on both ends. Reissuing certs without fixing time just gives you fresh certs that are also “invalid”
in your broken timeline.

3) chrony or systemd-timesyncd on Proxmox?

chrony for most Proxmox and PBS nodes. It’s robust in the face of jitter and intermittent connectivity and gives better diagnostics.
timesyncd can be fine for simpler setups, but consistency and observability matter more than minimalism here.

4) How much time skew is “too much” for Proxmox and PBS?

For TLS, even seconds can hurt depending on certificate validity windows and client behavior. Operationally, keep nodes within tens of milliseconds
if you can. If you’re seeing > 1 second between nodes, treat it as an incident, not a curiosity.

5) Why do VMs drift even if the host is synchronized?

Guests have their own kernels and their own timekeeping. Pauses (snapshots, migrations, host contention) can cause drift. If the guest lacks a working
time client or has conflicting time services, it will misbehave regardless of host stability.

6) Can live migration cause TLS errors inside the VM?

Yes. If migration or snapshotting pauses the VM and the guest clock jumps or slews aggressively afterward, short-lived tokens and certificate checks
inside the guest can fail. The fix is proper guest time sync and minimizing pause time, not blaming PBS.

7) Is using public NTP servers acceptable for Proxmox?

It’s acceptable when you have no alternative, but it’s not ideal. Enterprises prefer internal time servers for consistency, performance, and policy control.
If you must use public sources, use multiple, ensure stable DNS, and don’t let firewall rules change casually.

8) What’s the safest way to correct a large clock skew on a production node?

Prefer doing it in a maintenance window. Use chrony to perform a controlled step (chronyc -a makestep) rather than manual date -s.
Then watch logs for repeated stepping, which indicates the root cause remains.

9) Do I need PTP instead of NTP?

For most Proxmox and PBS deployments, NTP is enough. Consider PTP when you need sub-millisecond sync and you can support it end-to-end (hardware timestamping,
proper network design). PTP done halfway is just an expensive way to be wrong faster.

10) How do I prevent this from coming back?

Standardize the NTP client, use stable internal sources, monitor offset/reach, and treat time as a tier-0 dependency. The technical part is easy.
The discipline is the hard part.

Next steps you can do today

Run the fast diagnosis checks on every Proxmox node and your PBS box: timedatectl, chronyc tracking, chronyc sources -v.
Eliminate dueling time services: make sure only one of chrony/timesyncd/ntpd is active.
Standardize upstreams: point all nodes and PBS to the same two or three trusted time servers.
Fix RTC: ensure UTC hardware clock and validate it survives reboot.
Operationalize: alert on offset and reach before TLS and backups start screaming.

Time drift is one of those problems that feels beneath you until it ruins your day. Make it boring. Your future self will quietly approve.