Everything is fine⌠until it isnât. The graphs look like a calm sea and thenâevery few minutesâyour CPU shoots up like itâs trying to win a sprint. Latency bumps. Fans get loud. Your on-call phone starts doing that thing where it vibrates itself off the nightstand.
If the spikes are periodic, assume a scheduled task until proven otherwise. And the scheduled task you should check first is the one you didnât schedule: the systemâs own timers and vendor agents, not your application. Specifically: systemd timers and cron (plus their âhelpfulâ friends like logrotate, updatedb, apt timers, monitoring agents, and backup clients).
Fast diagnosis playbook (first 10 minutes)
Periodic spikes are a gift. Random spikes are a horror novel. Periodic spikes are a calendar invite.
Minute 0â2: confirm the pattern and capture the offender
- Confirm periodicity. Look at your monitoring: do spikes occur every 1, 5, 10, 15, 60 minutes, or at :00 or :30? Those are classic timer cadences.
- During the next spike, capture top offenders. Use
toporpidstat. Donât âlook laterâ; the process name is half the battle. - Check run queue and iowait. High
%usr/%syssuggests CPU work; high%wasuggests I/O bottleneck with CPU side-effects (compression, checksums, fsck-like work, encryption, etc.).
Minute 2â6: identify the scheduler
- List systemd timers. This catches the âI swear there is no cron jobâ cases.
- List cron sources. User crontabs,
/etc/crontab, and/etc/cron.*directories. - Check vendor agents and monitoring. Many ship as systemd services and schedule their own runs internally.
Minute 6â10: correlate with logs and take a safe action
- Correlate by time. Journal logs around the spike timestamp will usually show the unit that started.
- Safest immediate mitigation. If itâs non-critical (updatedb, logrotate, a reporting script), delay it or add jitter. If itâs critical (backups, security scans), reduce concurrency or scope rather than disabling.
Paraphrased idea attributed to Gene Kim: Reliability comes from making work visible and repeatable, not from heroics at 3 a.m.
Why periodic spikes scream âscheduled taskâ
CPU doesnât spike âevery five minutesâ by accident. Humans schedule things in round numbers. So do operating systems. So do âenterpriseâ agents that were built by someone who never had to share a hypervisor with your database.
Periodic CPU spikes usually come from one of these categories:
- Maintenance tasks: logrotate, updatedb (mlocate), tmp cleanup, package update checks, certificate renewal hooks.
- Backup and indexing: filesystem scans, snapshotting scripts, dedupe indexing, object-store sync, antivirus scans.
- Monitoring and telemetry: metric scrapes, inventory collectors, security posture checks.
- Storage CPU work: compression, checksums, encryption, RAID parity, scrub-like operations, snapshot diff calculations.
- Application âhelpfulnessâ: caches warming, periodic report generation, compaction (databases), reindexing jobs.
Hereâs the trap: you can stare at CPU graphs all day and still miss the cause if you donât line up timestamps. Every periodic task leaves a footprint: a process start, a log line, a unit activation, a file touched, a network burst.
Joke #1: Scheduled tasks are like office birthdaysâsomehow they happen every year, and somehow they still surprise everyone.
The scheduled task to check first (and why)
The first thing I check is systemd timers, then cron. Not because cron is rareâcron is everywhereâbut because teams often check âcronâ and stop. Meanwhile, systemd timers are quietly triggering:
apt-daily.timerandapt-daily-upgrade.timer(Debian/Ubuntu)fstrim.timer(SSD trim; can hit storage subsystems)logrotate.timer(or logrotate from cron, depending on distro)man-db.timer(yes, really)updatedb.timer(filesystem scan)- Vendor timers for backup agents, EDR, compliance scanners
Why first? Because systemd timers can run with persistent catch-up behavior. If a host was down at the scheduled time, a timer with Persistent=true may fire immediately at boot. Thatâs how you get âCPU spikes every few minutesâ after an outage: a herd of missed timers making up for lost time.
Also, timers can be configured with randomized delays. Thatâs good. But when they arenât, fleets do synchronized work. If youâve ever watched 500 nodes run the same job at minute 0, you already know how this story ends.
Hands-on: 12+ tasks with commands, outputs, and decisions
Below are practical tasks Iâd run on a Linux box in production. For each: command, what the output means, and what decision you make next. Run them during a spike if you can. If you canât, you can still gather enough evidence between spikes.
Task 1: Watch CPU per process over time (catch the spike)
cr0x@server:~$ pidstat -u -h 1
Linux 6.5.0 (server) 02/05/2026 _x86_64_ (16 CPU)
# Time UID PID %usr %system %CPU Command
12:00:01 0 1423 0.00 0.00 0.00 systemd
12:00:02 0 22891 92.00 6.00 98.00 updatedb
12:00:03 0 22891 88.00 5.00 93.00 updatedb
Meaning: updatedb is saturating CPU during the spike. Thatâs not your app. Itâs a scheduled filesystem index update.
Decision: Identify who triggers updatedb (timer or cron), then reschedule, throttle, or exclude paths.
Task 2: Confirm overall CPU mode (user/system/iowait)
cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.5.0 (server) 02/05/2026 _x86_64_ (16 CPU)
12:00:01 PM all %usr %nice %sys %iowait %irq %soft %steal %idle
12:00:01 PM all 62.50 0.00 12.30 0.20 0.00 0.50 0.00 24.50
12:00:02 PM all 89.10 0.00 8.40 0.10 0.00 0.40 0.00 2.00
Meaning: High %usr suggests pure compute (hashing/compressing/scanning) rather than waiting on disk.
Decision: Focus on the process and its scheduling. If %iowait were high, youâd pivot to storage latency and queue depth.
Task 3: See load average vs run queue (CPU contention vs âjust busyâ)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 81234 10240 913024 0 0 2 5 120 300 12 3 85 0 0
12 0 0 80120 10240 913500 0 0 0 0 3400 9800 88 9 3 0 0
Meaning: r=12 on a 16-core box means a lot of runnable threads; not blocked, just competing for CPU.
Decision: If this aligns with a timer firing, you want to reduce concurrency (nice/ionice, task settings, or the job itself).
Task 4: List systemd timers (the usual suspects)
cr0x@server:~$ systemctl list-timers --all
NEXT LEFT LAST PASSED UNIT ACTIVATES
Mon 2026-02-05 12:05:00 UTC 2min 12s Mon 2026-02-05 12:00:01 UTC 2min 47s updatedb.timer updatedb.service
Mon 2026-02-05 12:10:00 UTC 7min 12s Mon 2026-02-05 12:00:08 UTC 2min 40s apt-daily.timer apt-daily.service
Mon 2026-02-05 12:15:00 UTC 12min 12s Mon 2026-02-05 12:00:10 UTC 2min 38s logrotate.timer logrotate.service
Meaning: Youâve got multiple timers firing at minute 0 and repeating. The LAST timestamp is gold: match it to your CPU spike time.
Decision: Inspect the timer configuration and the service it activates. Apply jitter, change schedule, or disable safely if appropriate.
Task 5: Inspect a timer and service definition (what exactly runs)
cr0x@server:~$ systemctl cat updatedb.timer
# /lib/systemd/system/updatedb.timer
[Unit]
Description=Update a database for mlocate
[Timer]
OnCalendar=*:0/5
Persistent=true
RandomizedDelaySec=0
[Install]
WantedBy=timers.target
Meaning: Every 5 minutes, persistent catch-up, and zero jitter. That is practically an invitation to synchronized pain.
Decision: Add RandomizedDelaySec, reduce frequency, or disable if you donât need locate on a server.
Task 6: Check cron (system-wide and per-user)
cr0x@server:~$ sudo ls -l /etc/cron.hourly /etc/cron.daily /etc/cron.d
/etc/cron.d:
total 12
-rw-r--r-- 1 root root 240 Jan 10 2026 backup-agent
-rw-r--r-- 1 root root 180 Dec 2 2025 mlocate
-rw-r--r-- 1 root root 210 Nov 18 2025 sysstat
Meaning: Cron is still in play. /etc/cron.d/mlocate is likely driving updatedb on distros that donât use the timer, or in addition to it.
Decision: Ensure you donât have duplicate scheduling (timer + cron). Pick one, disable the other.
Task 7: View crontab entries for a specific user (where surprises hide)
cr0x@server:~$ sudo crontab -l -u root
*/5 * * * * /usr/local/sbin/inventory-scan --json --upload
15 * * * * /usr/local/sbin/storage-report
Meaning: The five-minute cadence is right there. inventory-scan smells like âagent work,â often CPU-heavy (filesystem walk, package query, hashing).
Decision: Measure runtime and CPU, then either reduce frequency, scope it (exclude directories), or move it off critical hosts.
Task 8: Correlate with journal logs at the spike timestamp
cr0x@server:~$ sudo journalctl --since "2026-02-05 11:58:00" --until "2026-02-05 12:02:00" --no-pager
Feb 05 12:00:01 server systemd[1]: Started Update a database for mlocate.
Feb 05 12:00:01 server updatedb[22891]: updatedb: pruning "/var/lib/docker/overlay2"
Feb 05 12:00:10 server systemd[1]: Started Rotate log files.
Feb 05 12:00:10 server logrotate[22940]: rotating pattern: /var/log/*.log forced from command line (1 rotations)
Meaning: Now you have a timeline. Itâs not âmystery loadâ; itâs two maintenance jobs stacking.
Decision: Stagger them. Maintenance jobs colliding is the classic âwhy do spikes look worse at midnight?â phenomenon.
Task 9: Identify which cgroup/unit a hot PID belongs to
cr0x@server:~$ cat /proc/22891/cgroup
0::/system.slice/updatedb.service
Meaning: The process is owned by updatedb.service. That makes it easy to tune with systemd knobs.
Decision: Apply CPUQuota, Nice, or IOSchedulingClass in a drop-in if you canât change the job itself.
Task 10: See what command line the process is really running
cr0x@server:~$ ps -p 22891 -o pid,ppid,ni,etimes,cmd
PID PPID NI ELAPSED CMD
22891 1 0 18 /usr/bin/updatedb.mlocate --prunepaths=/tmp /var/lib/docker /var/lib/kubelet
Meaning: Itâs scanning heavy directories (Docker, kubelet). Those paths churn and contain millions of inodes. Thatâs CPU plus metadata I/O.
Decision: Exclude those paths or stop running updatedb on container hosts. âLocateâ is not a production SLO.
Task 11: Check historical CPU spikes with sar (prove itâs periodic)
cr0x@server:~$ sar -u -s 11:30:00 -e 12:10:00
Linux 6.5.0 (server) 02/05/2026 _x86_64_ (16 CPU)
11:35:00 AM %user %system %iowait %idle
11:35:00 AM 6.20 1.10 0.10 92.60
11:40:00 AM 68.30 8.90 0.20 22.60
11:45:00 AM 7.10 1.20 0.10 91.60
11:50:00 AM 70.40 9.10 0.10 20.40
11:55:00 AM 6.90 1.00 0.10 92.00
12:00:00 PM 71.20 9.00 0.10 19.70
Meaning: Spikes every 10 minutes (or every 5 minutes depending on sampling). That lines up with timers and crons.
Decision: Stop debating âis it the app?â and start matching schedules to those timestamps.
Task 12: Use perf to verify where CPU time is going (when itâs not obvious)
cr0x@server:~$ sudo perf top -p 22891
Samples: 2K of event 'cpu-clock', Event count (approx.): 2000000000
35.12% updatedb libc.so.6 [.] __memmove_avx_unaligned_erms
22.40% updatedb libz.so.1 [.] deflate_slow
10.08% updatedb updatedb [.] hash_path
Meaning: Compression and hashing dominate. Thatâs CPU-heavy by design, not a kernel bug.
Decision: Reduce data scanned, reduce frequency, add cgroup limits, or move the work off the box.
Task 13: Find which unit started recently (quick journal grep)
cr0x@server:~$ sudo journalctl -S "12:00:00" -U "12:01:00" -g "Started " --no-pager
Feb 05 12:00:01 server systemd[1]: Started Update a database for mlocate.
Feb 05 12:00:10 server systemd[1]: Started Rotate log files.
Meaning: This is the fastest âwhat just started?â view when you can bracket the spike time.
Decision: Investigate the started unit(s) before diving into application profiling.
Task 14: Throttle a systemd service safely (drop-in override)
cr0x@server:~$ sudo systemctl edit updatedb.service
# (creates a drop-in override)
cr0x@server:~$ sudo cat /etc/systemd/system/updatedb.service.d/override.conf
[Service]
Nice=10
CPUQuota=20%
IOSchedulingClass=best-effort
IOSchedulingPriority=7
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart updatedb.service
Meaning: You didnât âfixâ updatedb; you made it stop bullying the rest of the host.
Decision: Use throttling as a mitigation, then do the real fix: reschedule, jitter, and reduce scan scope.
Task 15: Add jitter to a timer (stop the fleet stampede)
cr0x@server:~$ sudo systemctl edit updatedb.timer
cr0x@server:~$ sudo cat /etc/systemd/system/updatedb.timer.d/override.conf
[Timer]
RandomizedDelaySec=180
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart updatedb.timer
Meaning: The job still runs, but not simultaneously across the fleet and not exactly on the minute.
Decision: If many nodes spike together, jitter is often the single highest ROI change you can make.
Cron vs systemd timers vs âagentsâ: how they really behave
You donât need religious wars about schedulers. You need a correct mental model.
Cron: simple, ubiquitous, and easy to accidentally duplicate
Cron runs commands on a schedule. Thatâs it. Its simplicity is why itâs still everywhereâand why it gets used as a dumping ground for âtemporaryâ scripts that live for five years.
Where cron hides work:
- System crontab:
/etc/crontab - Drop-in directories:
/etc/cron.d/,/etc/cron.hourly/,/etc/cron.daily/, etc. - User crontabs: in
/var/spool/cronor/var/spool/cron/crontabs
Cronâs classic failure mode in fleets is synchronization: every machine runs the same thing at the same time, because everyone copy-pasted the same crontab entry.
systemd timers: more visible, more powerful, and capable of surprise
Timers are first-class systemd objects. They have logs. They have status. They can catch up after downtime. They can add jitter. They can have accuracy windows that delay firing slightly to coalesce wakeups.
Timersâ classic failure mode is persistence + boot storms. After maintenance, reboot, or autoscaling events, a bunch of missed timers fire. If each timer is âsmall,â the combined load is not.
Agents: âwe run every 5 minutesâ is not a contract, itâs a threat
Monitoring agents, inventory collectors, endpoint security, backup agents, and âcompliance postureâ tools often implement their own schedules internally. You may not see a timer or cron at all; youâll see a long-running daemon that wakes up and does a bunch of work.
The diagnostic trick is to look for:
- Regular bursts of CPU in a single daemon process.
- Regular bursts of child processes forked by that daemon.
- Log entries repeating every N minutes.
The storage angle: CPU spikes from I/O work in disguise
As a storage engineer, Iâll say the quiet part out loud: a lot of âCPU spikesâ are storage chores wearing a CPU costume.
Metadata walks are CPU work
Jobs like updatedb, antivirus scans, backup file enumeration, and integrity scanners do massive directory traversals. Even if your disks are fast, turning millions of dentries and inodes into a list costs CPU. On networked filesystems itâs worse: every metadata call becomes a network round trip, plus client-side parsing and caching.
Compression and checksumming are CPU work
If you enabled compression âto save space,â congratulations: you also signed up to pay CPU on every write (and sometimes on reads). Many backup tools compress by default. Log shipping pipelines compress. Even logrotate can compress archives. Those jobs often align to time boundaries and create bursts.
TRIM and scrubs are not free
fstrim can cause noticeable device activity. A RAID scrub or filesystem scrub can cause CPU usage in kernel threads, especially if checksums are involved. You might see the CPU spike, but the root cause is scheduled storage maintenance colliding with your peak traffic.
Encryption is predictableâand spiky when scheduled
At-rest encryption and encrypted backups move CPU cycles. Thatâs fine when itâs steady. Itâs painful when itâs bursty. A nightly backup that encrypts and compresses at the same time can look like a âmystery CPU spikeâ to anyone not staring at the backup schedule.
Joke #2: Nothing says âenterprise-readyâ like a backup agent that benchmarks your CPU at noon without asking.
Three corporate mini-stories (true-to-life)
Mini-story 1: The incident caused by a wrong assumption (âWe donât use cronâ)
A mid-size company ran a set of API servers behind a load balancer. Every five minutes, error rates ticked up. Not a full outageâjust enough to make the SLO burn look like a slow leak. The on-callâs first assumption was the usual: garbage collection or a noisy neighbor on the hypervisor.
They checked application logs. Nothing obvious. They checked deploys. Nothing. They checked âcron.â No relevant entries in crontab -l. They declared, with confidence, âWe donât use cron on these boxes.â The investigation drifted into profiling endpoints and tuning thread pools.
The breakthrough happened when someone compared CPU spike timestamps with systemctl list-timers. There it was: updatedb.timer running every five minutes, persistent, no jitter. It came from a baseline OS image that had been âhardenedâ in every way except the part where it did a filesystem scan repeatedly.
Why did it hurt the API? The servers also hosted container runtimes. The updatedb scan walked huge overlay filesystem trees. It didnât just use CPU; it churned page cache and metadata caches. The app became slightly slower, then recoveredâover and over, like a tiny denial-of-service attack from inside the house.
The fix was boring: disable updatedb on those hosts and add a policy that baseline images must document enabled timers. The postmortemâs key lesson wasnât about locate. It was about assumptions. âNo cronâ didnât mean âno scheduled tasks.â It meant âwe didnât look in the right scheduler.â
Mini-story 2: The optimization that backfired (compression everywhere)
An internal platform team wanted to cut storage costs. Logs were big, backups were bigger, and the CFO had discovered the word âefficiency.â The team enabled aggressive compression in the backup pipeline, set the job to run every 15 minutes for better RPO, and celebrated the space savings.
Then came the CPU spikes: sharp, rhythmic, and ugly. They started seeing increased request latency and occasional timeouts across services that shared the same nodes as the backup client. The backup team insisted nothing had changed âfor the app,â which was technically true and operationally useless.
What actually happened: compressing small batches every 15 minutes caused constant bursts of CPU rather than one predictable nightly burn. The bursts aligned with other platform timersâlogrotate, metrics compaction, package update checksâand created a recurring âbusy minute.â The system wasnât overloaded on average; it was overloaded in a repeating pattern.
The attempted optimizationâmore frequent incremental backups with maximum compressionâbackfired because it ignored contention and co-tenancy. They fixed it by lowering compression level, adding CPU quotas to the backup service, and adding randomized delays. Space usage increased slightly. Stability improved a lot. The CFO got a graph that went down. The on-call got a night of sleep.
Mini-story 3: The boring practice that saved the day (jitter and budgets)
A finance company ran a large Linux fleet with strict latency requirements. They had learned, the hard way, that âmidnight maintenanceâ is just a way to create synchronized disasters at midnight. So they treated background work like production traffic: it needed budgets, observability, and spread.
Every new scheduled task had to declare: frequency, expected runtime, CPU profile (rough), and whether it could be delayed. Tasks were added as systemd timers with jitter by default. They used RandomizedDelaySec on anything non-urgent, and they avoided scheduling everything on the hour.
One day, they rolled out a compliance agent update that was heavier than expected. It did a filesystem inventory and cryptographic verification. CPU usage increasedâbut it didnât spike across the fleet. Why? The timer had a 10-minute schedule with a 6-minute random delay, and the service had a CPU quota. The work spread out. The load balancers never saw a synchronized dip.
It wasnât glamorous engineering. It was the operational equivalent of putting your tools back after using them. But it turned a potentially nasty incident into a non-event, which is the nicest thing production can do.
Common mistakes: symptom â root cause â fix
This is the part where most teams waste hours. Donât.
1) Spikes every 5 minutes, âno cron jobs,â nothing in app logs
- Symptom: CPU jumps to 80â100% every 5 minutes. App latency bumps. Nothing obvious in app logs.
- Root cause:
systemdtimer (oftenupdatedb.timer, agent timers, package update timers) or a daemon that wakes periodically. - Fix:
systemctl list-timers --all, correlate withjournalctl, add jitter or disable non-essential timers; cap CPU with a drop-in override.
2) Spikes at :00 across many hosts (fleet-wide) and graphs look synchronized
- Symptom: Whole cluster sees CPU spikes at top of the hour; downstream services report p95 latency bumps.
- Root cause: synchronized schedules (cron or timers without
RandomizedDelaySec), often due to identical image configs. - Fix: add jitter, stagger schedules, use systemd timer accuracy and randomized delay; avoid âminute 0â conventions.
3) CPU spikes coincide with disk busy time, but CPU looks like the culprit
- Symptom: CPU spikes, but
%iowaitrises too; storage latency alarms fire; app threads block. - Root cause: scheduled I/O-heavy maintenance (logrotate compress, backup, scrub/trim) triggering CPU work (compression, checksums) and saturating storage queues.
- Fix: reschedule I/O chores off peak; reduce parallelism; use
ionice/IOSchedulingClass; verify storage health and queue depth.
4) Spikes started after enabling âsecurity scanningâ or âinventoryâ tooling
- Symptom: periodic CPU bursts, lots of short-lived processes, frequent filesystem stats.
- Root cause: EDR/compliance agent walking the filesystem, hashing binaries, scanning containers.
- Fix: tune exclusions (container dirs, build caches), reduce scan frequency, push vendor for a better mode; isolate on dedicated nodes if needed.
5) Spikes only after reboot or after downtime
- Symptom: boot seems fine, then within minutes CPU spikes repeatedly, sometimes multiple tasks back-to-back.
- Root cause: systemd timers with
Persistent=trueâcatching upâ on missed runs; multiple missed timers executing on boot. - Fix: add
RandomizedDelaySec, reviewPersistentsetting, and ensure boot-critical services arenât contending with maintenance tasks.
6) Spikes disappear when you run the job manually âfor testingâ
- Symptom: you run the suspected script by hand and it seems fine; spikes keep happening later.
- Root cause: the scheduled environment differs: different PATH, different nice/ionice, different args, different working directory, or it runs alongside other tasks.
- Fix: capture the exact command line and environment from systemd unit or cron logs; reproduce with the same parameters and concurrency.
Checklists / step-by-step plan
Checklist A: Single host with periodic spikes
- Write down the spike cadence (every 5 minutes? at :00?).
- Capture top CPU processes during a spike (
pidstatortop -H). - Check CPU mode (
mpstat) and run queue (vmstat). - List systemd timers and match timestamps (
systemctl list-timers --all). - List cron sources (
/etc/cron.d,crontab -lfor key users). - Correlate with logs (
journalctlaround the spike time). - Confirm the unit/cgroup of the hot PID (
/proc/PID/cgroup). - Mitigate safely (CPUQuota/Nice, add jitter, reschedule).
- Fix properly (exclude paths, reduce frequency, remove duplicates).
- Verify over at least 3 spike intervals with
saror your monitoring.
Checklist B: Fleet-wide synchronized spikes
- Confirm synchronization across multiple nodes (same minute, same shape).
- Identify common timers enabled in the base image (
systemctl list-timerson multiple nodes). - Look for vendor agents deployed everywhere (inventory/EDR/backup).
- Apply jitter universally for non-urgent tasks.
- Set CPU budgets for background work (cgroups via systemd service overrides).
- Stagger unavoidable heavy tasks by role (e.g., different schedules for web vs db nodes).
- Re-check peak latency and error rate after changes.
Checklist C: When the culprit is âstorage workâ not âCPU workâ
- Check iowait and disk utilization during spikes.
- Identify jobs doing compression/hashing (backup, logrotate, scanners).
- Move heavy compression to off-host or off-peak when possible.
- Throttle I/O class for background jobs (
ioniceor systemd I/O scheduling settings). - Prevent directory-walk jobs from scanning container and build cache paths.
Interesting facts and history that matter in production
- Fact 1: Cronâs design dates back to early Unix; its ârun at exact minuteâ behavior is why fleets still stampede on :00.
- Fact 2: systemd timers can be persistent, meaning theyâll run missed jobs after downtimeâgreat for laptops, spicy for servers.
- Fact 3: Many distros migrated classic cron tasks (like logrotate) to systemd timers, so âwe checked cronâ stopped being sufficient years ago.
- Fact 4: The
updatedb/locatedatabase exists to make filename search fastâuseful on dev machines, often pointless on production servers. - Fact 5: logrotate can compress rotated logs, and compression is CPU-heavy; one large log file can cause a bigger spike than a dozen small ones.
- Fact 6: TRIM (
fstrim) is scheduled weekly on many Linux systems; on some storage backends it can create noticeable bursts. - Fact 7: Periodic spikes are easier to diagnose than constant load because you can correlate them with scheduler metadataâif you actually look.
- Fact 8: Randomized delay (jitter) exists specifically to prevent thundering herds; not using it in a fleet is an avoidable self-own.
- Fact 9: Many âagentsâ run their own internal schedules and may not show up as cron/timers at all, so you must observe their CPU behavior directly.
FAQ
1) Whatâs the single first thing to check for periodic CPU spikes?
systemd timers: systemctl list-timers --all. Itâs the quickest way to catch OS maintenance tasks and vendor timers that people forget exist.
2) I checked cron and found nothing. What now?
Check systemd timers, then look for long-running daemons (agents) that wake periodically. Correlate with journalctl at the spike time.
3) Why do spikes happen exactly every 5 minutes?
Because somebody set */5 * * * * in cron or OnCalendar=*:0/5 in a timer. Computers are literal. Humans love round numbers.
4) Should I just disable the offending timer?
Sometimes yes (e.g., updatedb on a production server). Sometimes no (security scans, backups). Prefer: reduce frequency, add jitter, exclude heavy paths, and cap CPU.
5) How do I prove a timer is the cause, not correlation?
Match the timerâs LAST run time to the spike timestamp, then check the started unit in journalctl and confirm the hot PIDâs cgroup is that unit.
6) The spike is mostly %sys (system CPU). What does that suggest?
Kernel-heavy work: filesystem metadata churn, networking overhead, encryption, or kernel threads doing storage maintenance. Look for scans, backups, scrubs, or heavy logging.
7) Why did spikes get worse after a reboot?
Persistent timers can âcatch upâ after downtime. Multiple missed tasks may run shortly after boot, stacking CPU load.
8) How do I stop synchronized spikes across a whole fleet?
Add jitter (RandomizedDelaySec) and avoid scheduling everything at :00. For heavy tasks, enforce CPU quotas so background work canât starve your service.
9) Can storage cause CPU spikes even if disks are fine?
Yes. Directory walks, checksums, compression, and encryption are CPU work triggered by storage-related tasks. âDisk looks fineâ doesnât mean âstorage-related work isnât happening.â
10) What if the process name is unhelpful, like âpythonâ?
Grab the full command line (ps -p PID -o cmd), check its parent process, and check cgroup membership. If itâs a systemd unit, systemctl status UNIT will often reveal the script path.
Conclusion: practical next steps
If your CPU spikes are periodic, act like an SRE, not a fortune teller. Treat it as scheduled work until you have evidence otherwise.
- On one host: capture the hot process during a spike with
pidstat, then map it to a unit or cron entry. - On the scheduler: list systemd timers first, then cron. Look for five-minute and top-of-hour patterns.
- Mitigate safely: throttle with systemd drop-ins (CPUQuota/Nice) and add jitter to stop synchronized fleet spikes.
- Fix the root: exclude heavy paths, reduce frequency, remove duplicate schedules, and stop running developer conveniences on production servers.
- Verify: watch at least a few intervals and confirm the spike is goneâor at least no longer impacts your latency budget.
Do this well and the graph stops looking like a heart monitor. Which is nice, because it means you can stop treating your infrastructure like a patient in triage.