CPU Spikes Every Few Minutes: The Scheduled Task You Should Check First

Was this helpful?

Everything is fine… until it isn’t. The graphs look like a calm sea and then—every few minutes—your CPU shoots up like it’s trying to win a sprint. Latency bumps. Fans get loud. Your on-call phone starts doing that thing where it vibrates itself off the nightstand.

If the spikes are periodic, assume a scheduled task until proven otherwise. And the scheduled task you should check first is the one you didn’t schedule: the system’s own timers and vendor agents, not your application. Specifically: systemd timers and cron (plus their “helpful” friends like logrotate, updatedb, apt timers, monitoring agents, and backup clients).

Fast diagnosis playbook (first 10 minutes)

Periodic spikes are a gift. Random spikes are a horror novel. Periodic spikes are a calendar invite.

Minute 0–2: confirm the pattern and capture the offender

  1. Confirm periodicity. Look at your monitoring: do spikes occur every 1, 5, 10, 15, 60 minutes, or at :00 or :30? Those are classic timer cadences.
  2. During the next spike, capture top offenders. Use top or pidstat. Don’t “look later”; the process name is half the battle.
  3. Check run queue and iowait. High %usr/%sys suggests CPU work; high %wa suggests I/O bottleneck with CPU side-effects (compression, checksums, fsck-like work, encryption, etc.).

Minute 2–6: identify the scheduler

  1. List systemd timers. This catches the “I swear there is no cron job” cases.
  2. List cron sources. User crontabs, /etc/crontab, and /etc/cron.* directories.
  3. Check vendor agents and monitoring. Many ship as systemd services and schedule their own runs internally.

Minute 6–10: correlate with logs and take a safe action

  1. Correlate by time. Journal logs around the spike timestamp will usually show the unit that started.
  2. Safest immediate mitigation. If it’s non-critical (updatedb, logrotate, a reporting script), delay it or add jitter. If it’s critical (backups, security scans), reduce concurrency or scope rather than disabling.

Paraphrased idea attributed to Gene Kim: Reliability comes from making work visible and repeatable, not from heroics at 3 a.m.

Why periodic spikes scream “scheduled task”

CPU doesn’t spike “every five minutes” by accident. Humans schedule things in round numbers. So do operating systems. So do “enterprise” agents that were built by someone who never had to share a hypervisor with your database.

Periodic CPU spikes usually come from one of these categories:

  • Maintenance tasks: logrotate, updatedb (mlocate), tmp cleanup, package update checks, certificate renewal hooks.
  • Backup and indexing: filesystem scans, snapshotting scripts, dedupe indexing, object-store sync, antivirus scans.
  • Monitoring and telemetry: metric scrapes, inventory collectors, security posture checks.
  • Storage CPU work: compression, checksums, encryption, RAID parity, scrub-like operations, snapshot diff calculations.
  • Application “helpfulness”: caches warming, periodic report generation, compaction (databases), reindexing jobs.

Here’s the trap: you can stare at CPU graphs all day and still miss the cause if you don’t line up timestamps. Every periodic task leaves a footprint: a process start, a log line, a unit activation, a file touched, a network burst.

Joke #1: Scheduled tasks are like office birthdays—somehow they happen every year, and somehow they still surprise everyone.

The scheduled task to check first (and why)

The first thing I check is systemd timers, then cron. Not because cron is rare—cron is everywhere—but because teams often check “cron” and stop. Meanwhile, systemd timers are quietly triggering:

  • apt-daily.timer and apt-daily-upgrade.timer (Debian/Ubuntu)
  • fstrim.timer (SSD trim; can hit storage subsystems)
  • logrotate.timer (or logrotate from cron, depending on distro)
  • man-db.timer (yes, really)
  • updatedb.timer (filesystem scan)
  • Vendor timers for backup agents, EDR, compliance scanners

Why first? Because systemd timers can run with persistent catch-up behavior. If a host was down at the scheduled time, a timer with Persistent=true may fire immediately at boot. That’s how you get “CPU spikes every few minutes” after an outage: a herd of missed timers making up for lost time.

Also, timers can be configured with randomized delays. That’s good. But when they aren’t, fleets do synchronized work. If you’ve ever watched 500 nodes run the same job at minute 0, you already know how this story ends.

Hands-on: 12+ tasks with commands, outputs, and decisions

Below are practical tasks I’d run on a Linux box in production. For each: command, what the output means, and what decision you make next. Run them during a spike if you can. If you can’t, you can still gather enough evidence between spikes.

Task 1: Watch CPU per process over time (catch the spike)

cr0x@server:~$ pidstat -u -h 1
Linux 6.5.0 (server)  02/05/2026  _x86_64_  (16 CPU)

# Time        UID      PID    %usr %system  %CPU   Command
12:00:01     0       1423     0.00    0.00  0.00   systemd
12:00:02     0      22891    92.00    6.00  98.00  updatedb
12:00:03     0      22891    88.00    5.00  93.00  updatedb

Meaning: updatedb is saturating CPU during the spike. That’s not your app. It’s a scheduled filesystem index update.

Decision: Identify who triggers updatedb (timer or cron), then reschedule, throttle, or exclude paths.

Task 2: Confirm overall CPU mode (user/system/iowait)

cr0x@server:~$ mpstat -P ALL 1 5
Linux 6.5.0 (server)  02/05/2026  _x86_64_  (16 CPU)

12:00:01 PM  all  %usr %nice %sys %iowait %irq %soft %steal %idle
12:00:01 PM  all  62.50  0.00 12.30  0.20 0.00  0.50   0.00 24.50
12:00:02 PM  all  89.10  0.00  8.40  0.10 0.00  0.40   0.00  2.00

Meaning: High %usr suggests pure compute (hashing/compressing/scanning) rather than waiting on disk.

Decision: Focus on the process and its scheduling. If %iowait were high, you’d pivot to storage latency and queue depth.

Task 3: See load average vs run queue (CPU contention vs “just busy”)

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0  81234  10240 913024    0    0     2     5  120  300 12  3 85  0  0
12  0      0  80120  10240 913500    0    0     0     0 3400 9800 88  9  3  0  0

Meaning: r=12 on a 16-core box means a lot of runnable threads; not blocked, just competing for CPU.

Decision: If this aligns with a timer firing, you want to reduce concurrency (nice/ionice, task settings, or the job itself).

Task 4: List systemd timers (the usual suspects)

cr0x@server:~$ systemctl list-timers --all
NEXT                         LEFT          LAST                         PASSED       UNIT                         ACTIVATES
Mon 2026-02-05 12:05:00 UTC  2min 12s      Mon 2026-02-05 12:00:01 UTC  2min 47s     updatedb.timer               updatedb.service
Mon 2026-02-05 12:10:00 UTC  7min 12s      Mon 2026-02-05 12:00:08 UTC  2min 40s     apt-daily.timer              apt-daily.service
Mon 2026-02-05 12:15:00 UTC  12min 12s     Mon 2026-02-05 12:00:10 UTC  2min 38s     logrotate.timer              logrotate.service

Meaning: You’ve got multiple timers firing at minute 0 and repeating. The LAST timestamp is gold: match it to your CPU spike time.

Decision: Inspect the timer configuration and the service it activates. Apply jitter, change schedule, or disable safely if appropriate.

Task 5: Inspect a timer and service definition (what exactly runs)

cr0x@server:~$ systemctl cat updatedb.timer
# /lib/systemd/system/updatedb.timer
[Unit]
Description=Update a database for mlocate

[Timer]
OnCalendar=*:0/5
Persistent=true
RandomizedDelaySec=0

[Install]
WantedBy=timers.target

Meaning: Every 5 minutes, persistent catch-up, and zero jitter. That is practically an invitation to synchronized pain.

Decision: Add RandomizedDelaySec, reduce frequency, or disable if you don’t need locate on a server.

Task 6: Check cron (system-wide and per-user)

cr0x@server:~$ sudo ls -l /etc/cron.hourly /etc/cron.daily /etc/cron.d
/etc/cron.d:
total 12
-rw-r--r-- 1 root root  240 Jan 10  2026 backup-agent
-rw-r--r-- 1 root root  180 Dec  2  2025 mlocate
-rw-r--r-- 1 root root  210 Nov 18  2025 sysstat

Meaning: Cron is still in play. /etc/cron.d/mlocate is likely driving updatedb on distros that don’t use the timer, or in addition to it.

Decision: Ensure you don’t have duplicate scheduling (timer + cron). Pick one, disable the other.

Task 7: View crontab entries for a specific user (where surprises hide)

cr0x@server:~$ sudo crontab -l -u root
*/5 * * * * /usr/local/sbin/inventory-scan --json --upload
15 * * * * /usr/local/sbin/storage-report

Meaning: The five-minute cadence is right there. inventory-scan smells like “agent work,” often CPU-heavy (filesystem walk, package query, hashing).

Decision: Measure runtime and CPU, then either reduce frequency, scope it (exclude directories), or move it off critical hosts.

Task 8: Correlate with journal logs at the spike timestamp

cr0x@server:~$ sudo journalctl --since "2026-02-05 11:58:00" --until "2026-02-05 12:02:00" --no-pager
Feb 05 12:00:01 server systemd[1]: Started Update a database for mlocate.
Feb 05 12:00:01 server updatedb[22891]: updatedb: pruning "/var/lib/docker/overlay2"
Feb 05 12:00:10 server systemd[1]: Started Rotate log files.
Feb 05 12:00:10 server logrotate[22940]: rotating pattern: /var/log/*.log  forced from command line (1 rotations)

Meaning: Now you have a timeline. It’s not “mystery load”; it’s two maintenance jobs stacking.

Decision: Stagger them. Maintenance jobs colliding is the classic “why do spikes look worse at midnight?” phenomenon.

Task 9: Identify which cgroup/unit a hot PID belongs to

cr0x@server:~$ cat /proc/22891/cgroup
0::/system.slice/updatedb.service

Meaning: The process is owned by updatedb.service. That makes it easy to tune with systemd knobs.

Decision: Apply CPUQuota, Nice, or IOSchedulingClass in a drop-in if you can’t change the job itself.

Task 10: See what command line the process is really running

cr0x@server:~$ ps -p 22891 -o pid,ppid,ni,etimes,cmd
  PID  PPID  NI ELAPSED CMD
22891     1   0      18 /usr/bin/updatedb.mlocate --prunepaths=/tmp /var/lib/docker /var/lib/kubelet

Meaning: It’s scanning heavy directories (Docker, kubelet). Those paths churn and contain millions of inodes. That’s CPU plus metadata I/O.

Decision: Exclude those paths or stop running updatedb on container hosts. “Locate” is not a production SLO.

Task 11: Check historical CPU spikes with sar (prove it’s periodic)

cr0x@server:~$ sar -u -s 11:30:00 -e 12:10:00
Linux 6.5.0 (server)  02/05/2026  _x86_64_  (16 CPU)

11:35:00 AM     %user     %system     %iowait      %idle
11:35:00 AM      6.20        1.10        0.10      92.60
11:40:00 AM     68.30        8.90        0.20      22.60
11:45:00 AM      7.10        1.20        0.10      91.60
11:50:00 AM     70.40        9.10        0.10      20.40
11:55:00 AM      6.90        1.00        0.10      92.00
12:00:00 PM     71.20        9.00        0.10      19.70

Meaning: Spikes every 10 minutes (or every 5 minutes depending on sampling). That lines up with timers and crons.

Decision: Stop debating “is it the app?” and start matching schedules to those timestamps.

Task 12: Use perf to verify where CPU time is going (when it’s not obvious)

cr0x@server:~$ sudo perf top -p 22891
Samples: 2K of event 'cpu-clock', Event count (approx.): 2000000000
  35.12%  updatedb   libc.so.6        [.] __memmove_avx_unaligned_erms
  22.40%  updatedb   libz.so.1        [.] deflate_slow
  10.08%  updatedb   updatedb         [.] hash_path

Meaning: Compression and hashing dominate. That’s CPU-heavy by design, not a kernel bug.

Decision: Reduce data scanned, reduce frequency, add cgroup limits, or move the work off the box.

Task 13: Find which unit started recently (quick journal grep)

cr0x@server:~$ sudo journalctl -S "12:00:00" -U "12:01:00" -g "Started " --no-pager
Feb 05 12:00:01 server systemd[1]: Started Update a database for mlocate.
Feb 05 12:00:10 server systemd[1]: Started Rotate log files.

Meaning: This is the fastest “what just started?” view when you can bracket the spike time.

Decision: Investigate the started unit(s) before diving into application profiling.

Task 14: Throttle a systemd service safely (drop-in override)

cr0x@server:~$ sudo systemctl edit updatedb.service
# (creates a drop-in override)
cr0x@server:~$ sudo cat /etc/systemd/system/updatedb.service.d/override.conf
[Service]
Nice=10
CPUQuota=20%
IOSchedulingClass=best-effort
IOSchedulingPriority=7
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart updatedb.service

Meaning: You didn’t “fix” updatedb; you made it stop bullying the rest of the host.

Decision: Use throttling as a mitigation, then do the real fix: reschedule, jitter, and reduce scan scope.

Task 15: Add jitter to a timer (stop the fleet stampede)

cr0x@server:~$ sudo systemctl edit updatedb.timer
cr0x@server:~$ sudo cat /etc/systemd/system/updatedb.timer.d/override.conf
[Timer]
RandomizedDelaySec=180
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart updatedb.timer

Meaning: The job still runs, but not simultaneously across the fleet and not exactly on the minute.

Decision: If many nodes spike together, jitter is often the single highest ROI change you can make.

Cron vs systemd timers vs “agents”: how they really behave

You don’t need religious wars about schedulers. You need a correct mental model.

Cron: simple, ubiquitous, and easy to accidentally duplicate

Cron runs commands on a schedule. That’s it. Its simplicity is why it’s still everywhere—and why it gets used as a dumping ground for “temporary” scripts that live for five years.

Where cron hides work:

  • System crontab: /etc/crontab
  • Drop-in directories: /etc/cron.d/, /etc/cron.hourly/, /etc/cron.daily/, etc.
  • User crontabs: in /var/spool/cron or /var/spool/cron/crontabs

Cron’s classic failure mode in fleets is synchronization: every machine runs the same thing at the same time, because everyone copy-pasted the same crontab entry.

systemd timers: more visible, more powerful, and capable of surprise

Timers are first-class systemd objects. They have logs. They have status. They can catch up after downtime. They can add jitter. They can have accuracy windows that delay firing slightly to coalesce wakeups.

Timers’ classic failure mode is persistence + boot storms. After maintenance, reboot, or autoscaling events, a bunch of missed timers fire. If each timer is “small,” the combined load is not.

Agents: “we run every 5 minutes” is not a contract, it’s a threat

Monitoring agents, inventory collectors, endpoint security, backup agents, and “compliance posture” tools often implement their own schedules internally. You may not see a timer or cron at all; you’ll see a long-running daemon that wakes up and does a bunch of work.

The diagnostic trick is to look for:

  • Regular bursts of CPU in a single daemon process.
  • Regular bursts of child processes forked by that daemon.
  • Log entries repeating every N minutes.

The storage angle: CPU spikes from I/O work in disguise

As a storage engineer, I’ll say the quiet part out loud: a lot of “CPU spikes” are storage chores wearing a CPU costume.

Metadata walks are CPU work

Jobs like updatedb, antivirus scans, backup file enumeration, and integrity scanners do massive directory traversals. Even if your disks are fast, turning millions of dentries and inodes into a list costs CPU. On networked filesystems it’s worse: every metadata call becomes a network round trip, plus client-side parsing and caching.

Compression and checksumming are CPU work

If you enabled compression “to save space,” congratulations: you also signed up to pay CPU on every write (and sometimes on reads). Many backup tools compress by default. Log shipping pipelines compress. Even logrotate can compress archives. Those jobs often align to time boundaries and create bursts.

TRIM and scrubs are not free

fstrim can cause noticeable device activity. A RAID scrub or filesystem scrub can cause CPU usage in kernel threads, especially if checksums are involved. You might see the CPU spike, but the root cause is scheduled storage maintenance colliding with your peak traffic.

Encryption is predictable—and spiky when scheduled

At-rest encryption and encrypted backups move CPU cycles. That’s fine when it’s steady. It’s painful when it’s bursty. A nightly backup that encrypts and compresses at the same time can look like a “mystery CPU spike” to anyone not staring at the backup schedule.

Joke #2: Nothing says “enterprise-ready” like a backup agent that benchmarks your CPU at noon without asking.

Three corporate mini-stories (true-to-life)

Mini-story 1: The incident caused by a wrong assumption (“We don’t use cron”)

A mid-size company ran a set of API servers behind a load balancer. Every five minutes, error rates ticked up. Not a full outage—just enough to make the SLO burn look like a slow leak. The on-call’s first assumption was the usual: garbage collection or a noisy neighbor on the hypervisor.

They checked application logs. Nothing obvious. They checked deploys. Nothing. They checked “cron.” No relevant entries in crontab -l. They declared, with confidence, “We don’t use cron on these boxes.” The investigation drifted into profiling endpoints and tuning thread pools.

The breakthrough happened when someone compared CPU spike timestamps with systemctl list-timers. There it was: updatedb.timer running every five minutes, persistent, no jitter. It came from a baseline OS image that had been “hardened” in every way except the part where it did a filesystem scan repeatedly.

Why did it hurt the API? The servers also hosted container runtimes. The updatedb scan walked huge overlay filesystem trees. It didn’t just use CPU; it churned page cache and metadata caches. The app became slightly slower, then recovered—over and over, like a tiny denial-of-service attack from inside the house.

The fix was boring: disable updatedb on those hosts and add a policy that baseline images must document enabled timers. The postmortem’s key lesson wasn’t about locate. It was about assumptions. “No cron” didn’t mean “no scheduled tasks.” It meant “we didn’t look in the right scheduler.”

Mini-story 2: The optimization that backfired (compression everywhere)

An internal platform team wanted to cut storage costs. Logs were big, backups were bigger, and the CFO had discovered the word “efficiency.” The team enabled aggressive compression in the backup pipeline, set the job to run every 15 minutes for better RPO, and celebrated the space savings.

Then came the CPU spikes: sharp, rhythmic, and ugly. They started seeing increased request latency and occasional timeouts across services that shared the same nodes as the backup client. The backup team insisted nothing had changed “for the app,” which was technically true and operationally useless.

What actually happened: compressing small batches every 15 minutes caused constant bursts of CPU rather than one predictable nightly burn. The bursts aligned with other platform timers—logrotate, metrics compaction, package update checks—and created a recurring “busy minute.” The system wasn’t overloaded on average; it was overloaded in a repeating pattern.

The attempted optimization—more frequent incremental backups with maximum compression—backfired because it ignored contention and co-tenancy. They fixed it by lowering compression level, adding CPU quotas to the backup service, and adding randomized delays. Space usage increased slightly. Stability improved a lot. The CFO got a graph that went down. The on-call got a night of sleep.

Mini-story 3: The boring practice that saved the day (jitter and budgets)

A finance company ran a large Linux fleet with strict latency requirements. They had learned, the hard way, that “midnight maintenance” is just a way to create synchronized disasters at midnight. So they treated background work like production traffic: it needed budgets, observability, and spread.

Every new scheduled task had to declare: frequency, expected runtime, CPU profile (rough), and whether it could be delayed. Tasks were added as systemd timers with jitter by default. They used RandomizedDelaySec on anything non-urgent, and they avoided scheduling everything on the hour.

One day, they rolled out a compliance agent update that was heavier than expected. It did a filesystem inventory and cryptographic verification. CPU usage increased—but it didn’t spike across the fleet. Why? The timer had a 10-minute schedule with a 6-minute random delay, and the service had a CPU quota. The work spread out. The load balancers never saw a synchronized dip.

It wasn’t glamorous engineering. It was the operational equivalent of putting your tools back after using them. But it turned a potentially nasty incident into a non-event, which is the nicest thing production can do.

Common mistakes: symptom → root cause → fix

This is the part where most teams waste hours. Don’t.

1) Spikes every 5 minutes, “no cron jobs,” nothing in app logs

  • Symptom: CPU jumps to 80–100% every 5 minutes. App latency bumps. Nothing obvious in app logs.
  • Root cause: systemd timer (often updatedb.timer, agent timers, package update timers) or a daemon that wakes periodically.
  • Fix: systemctl list-timers --all, correlate with journalctl, add jitter or disable non-essential timers; cap CPU with a drop-in override.

2) Spikes at :00 across many hosts (fleet-wide) and graphs look synchronized

  • Symptom: Whole cluster sees CPU spikes at top of the hour; downstream services report p95 latency bumps.
  • Root cause: synchronized schedules (cron or timers without RandomizedDelaySec), often due to identical image configs.
  • Fix: add jitter, stagger schedules, use systemd timer accuracy and randomized delay; avoid “minute 0” conventions.

3) CPU spikes coincide with disk busy time, but CPU looks like the culprit

  • Symptom: CPU spikes, but %iowait rises too; storage latency alarms fire; app threads block.
  • Root cause: scheduled I/O-heavy maintenance (logrotate compress, backup, scrub/trim) triggering CPU work (compression, checksums) and saturating storage queues.
  • Fix: reschedule I/O chores off peak; reduce parallelism; use ionice/IOSchedulingClass; verify storage health and queue depth.

4) Spikes started after enabling “security scanning” or “inventory” tooling

  • Symptom: periodic CPU bursts, lots of short-lived processes, frequent filesystem stats.
  • Root cause: EDR/compliance agent walking the filesystem, hashing binaries, scanning containers.
  • Fix: tune exclusions (container dirs, build caches), reduce scan frequency, push vendor for a better mode; isolate on dedicated nodes if needed.

5) Spikes only after reboot or after downtime

  • Symptom: boot seems fine, then within minutes CPU spikes repeatedly, sometimes multiple tasks back-to-back.
  • Root cause: systemd timers with Persistent=true “catching up” on missed runs; multiple missed timers executing on boot.
  • Fix: add RandomizedDelaySec, review Persistent setting, and ensure boot-critical services aren’t contending with maintenance tasks.

6) Spikes disappear when you run the job manually “for testing”

  • Symptom: you run the suspected script by hand and it seems fine; spikes keep happening later.
  • Root cause: the scheduled environment differs: different PATH, different nice/ionice, different args, different working directory, or it runs alongside other tasks.
  • Fix: capture the exact command line and environment from systemd unit or cron logs; reproduce with the same parameters and concurrency.

Checklists / step-by-step plan

Checklist A: Single host with periodic spikes

  1. Write down the spike cadence (every 5 minutes? at :00?).
  2. Capture top CPU processes during a spike (pidstat or top -H).
  3. Check CPU mode (mpstat) and run queue (vmstat).
  4. List systemd timers and match timestamps (systemctl list-timers --all).
  5. List cron sources (/etc/cron.d, crontab -l for key users).
  6. Correlate with logs (journalctl around the spike time).
  7. Confirm the unit/cgroup of the hot PID (/proc/PID/cgroup).
  8. Mitigate safely (CPUQuota/Nice, add jitter, reschedule).
  9. Fix properly (exclude paths, reduce frequency, remove duplicates).
  10. Verify over at least 3 spike intervals with sar or your monitoring.

Checklist B: Fleet-wide synchronized spikes

  1. Confirm synchronization across multiple nodes (same minute, same shape).
  2. Identify common timers enabled in the base image (systemctl list-timers on multiple nodes).
  3. Look for vendor agents deployed everywhere (inventory/EDR/backup).
  4. Apply jitter universally for non-urgent tasks.
  5. Set CPU budgets for background work (cgroups via systemd service overrides).
  6. Stagger unavoidable heavy tasks by role (e.g., different schedules for web vs db nodes).
  7. Re-check peak latency and error rate after changes.

Checklist C: When the culprit is “storage work” not “CPU work”

  1. Check iowait and disk utilization during spikes.
  2. Identify jobs doing compression/hashing (backup, logrotate, scanners).
  3. Move heavy compression to off-host or off-peak when possible.
  4. Throttle I/O class for background jobs (ionice or systemd I/O scheduling settings).
  5. Prevent directory-walk jobs from scanning container and build cache paths.

Interesting facts and history that matter in production

  • Fact 1: Cron’s design dates back to early Unix; its “run at exact minute” behavior is why fleets still stampede on :00.
  • Fact 2: systemd timers can be persistent, meaning they’ll run missed jobs after downtime—great for laptops, spicy for servers.
  • Fact 3: Many distros migrated classic cron tasks (like logrotate) to systemd timers, so “we checked cron” stopped being sufficient years ago.
  • Fact 4: The updatedb / locate database exists to make filename search fast—useful on dev machines, often pointless on production servers.
  • Fact 5: logrotate can compress rotated logs, and compression is CPU-heavy; one large log file can cause a bigger spike than a dozen small ones.
  • Fact 6: TRIM (fstrim) is scheduled weekly on many Linux systems; on some storage backends it can create noticeable bursts.
  • Fact 7: Periodic spikes are easier to diagnose than constant load because you can correlate them with scheduler metadata—if you actually look.
  • Fact 8: Randomized delay (jitter) exists specifically to prevent thundering herds; not using it in a fleet is an avoidable self-own.
  • Fact 9: Many “agents” run their own internal schedules and may not show up as cron/timers at all, so you must observe their CPU behavior directly.

FAQ

1) What’s the single first thing to check for periodic CPU spikes?

systemd timers: systemctl list-timers --all. It’s the quickest way to catch OS maintenance tasks and vendor timers that people forget exist.

2) I checked cron and found nothing. What now?

Check systemd timers, then look for long-running daemons (agents) that wake periodically. Correlate with journalctl at the spike time.

3) Why do spikes happen exactly every 5 minutes?

Because somebody set */5 * * * * in cron or OnCalendar=*:0/5 in a timer. Computers are literal. Humans love round numbers.

4) Should I just disable the offending timer?

Sometimes yes (e.g., updatedb on a production server). Sometimes no (security scans, backups). Prefer: reduce frequency, add jitter, exclude heavy paths, and cap CPU.

5) How do I prove a timer is the cause, not correlation?

Match the timer’s LAST run time to the spike timestamp, then check the started unit in journalctl and confirm the hot PID’s cgroup is that unit.

6) The spike is mostly %sys (system CPU). What does that suggest?

Kernel-heavy work: filesystem metadata churn, networking overhead, encryption, or kernel threads doing storage maintenance. Look for scans, backups, scrubs, or heavy logging.

7) Why did spikes get worse after a reboot?

Persistent timers can “catch up” after downtime. Multiple missed tasks may run shortly after boot, stacking CPU load.

8) How do I stop synchronized spikes across a whole fleet?

Add jitter (RandomizedDelaySec) and avoid scheduling everything at :00. For heavy tasks, enforce CPU quotas so background work can’t starve your service.

9) Can storage cause CPU spikes even if disks are fine?

Yes. Directory walks, checksums, compression, and encryption are CPU work triggered by storage-related tasks. “Disk looks fine” doesn’t mean “storage-related work isn’t happening.”

10) What if the process name is unhelpful, like “python”?

Grab the full command line (ps -p PID -o cmd), check its parent process, and check cgroup membership. If it’s a systemd unit, systemctl status UNIT will often reveal the script path.

Conclusion: practical next steps

If your CPU spikes are periodic, act like an SRE, not a fortune teller. Treat it as scheduled work until you have evidence otherwise.

  1. On one host: capture the hot process during a spike with pidstat, then map it to a unit or cron entry.
  2. On the scheduler: list systemd timers first, then cron. Look for five-minute and top-of-hour patterns.
  3. Mitigate safely: throttle with systemd drop-ins (CPUQuota/Nice) and add jitter to stop synchronized fleet spikes.
  4. Fix the root: exclude heavy paths, reduce frequency, remove duplicate schedules, and stop running developer conveniences on production servers.
  5. Verify: watch at least a few intervals and confirm the spike is gone—or at least no longer impacts your latency budget.

Do this well and the graph stops looking like a heart monitor. Which is nice, because it means you can stop treating your infrastructure like a patient in triage.

← Previous
Linux Firewall: The Clean nftables Layout That Stays Readable at 500 Rules
Next →
DNS: Split-Horizon Done Wrong — The Fix That Stops ‘Inside/Outside’ Madness

Leave a comment