If you’ve ever been woken up because “the nightly job didn’t run,” you already know the problem: scheduling is boring right up until it’s the only thing anyone cares about. Cron is simple, ubiquitous, and… surprisingly good at failing quietly.
On Debian 13, systemd timers are the grown-up option: better observability, sane dependency handling, missed-run catch-up, and a failure model you can actually automate. They also have a few sharp edges that will happily remove a finger if you grab them wrong.
The decision: when to stick with cron vs switch
Let’s be blunt: cron is fine for a surprising amount of stuff. If you have a single host, a couple of scripts, and you can tolerate “it ran at some point” as a success criterion, cron still earns its keep.
But production systems don’t fail in the “happy path.” They fail during reboots, deployments, DNS incidents, full disks, hung NFS mounts, and accidental environment changes. That’s where cron’s simplicity becomes an operational tax.
Use cron when
- You need a schedule and nothing else.
- You have stable hosts, minimal dependencies, and you actively monitor outcomes.
- You’re maintaining legacy workloads and a migration would add risk for little reward.
- You’re running in a container where systemd isn’t PID 1 and you don’t want to fight that war.
Use systemd timers when
- You care about missed runs after reboot or downtime (
Persistent=trueis a career-saving line). - You want logs in the journal, with consistent metadata and easy filtering.
- You need dependency ordering (after network-online, after mounts, after a database service).
- You need resource control: CPU/IO caps, timeouts, and isolation.
- You want monitoring that isn’t “did we get an email from cron?”
- You want to prevent overlapping runs without building a brittle lockfile pile.
Opinionated guidance: on Debian 13, new scheduled jobs should default to systemd timers unless you have a clear reason not to. Cron becomes the exception, not the rule.
One quote that still holds up in operations: paraphrased idea
— John Ousterhout’s “complexity is the enemy of reliability.” Timers aren’t about complexity; they’re about putting complexity in the right place (the service manager), instead of scattering it across scripts.
Interesting facts and a little history (because it matters)
Understanding how we got here helps you predict failure modes. Scheduling has always been less about “run at 2am” and more about “run when the world is on fire and still behave.”
- Cron dates back to the late 1970s, originally built for Unix to run periodic tasks with minimal overhead. It assumes the host is up and the clock is sane.
- Classic cron uses per-user crontabs and a global system crontab; that separation is convenient, but it also fragments ownership and auditability.
- Anacron was introduced to address missed jobs on machines that aren’t always on (like laptops). systemd timers effectively absorb that “catch up after downtime” idea.
- systemd timers came from the “units everywhere” model: a scheduled run is just a trigger for a service unit. That’s why you get dependency ordering and consistent logging.
- Debian’s cron ecosystem historically leaned on email for alerting. But modern environments often lack local MTA configuration, so failures become silent.
- systemd integrates with cgroups. That means you can throttle runaway jobs or protect the rest of the system—something cron itself never tried to do.
- systemd timers support randomization to avoid the “thundering herd” of thousands of hosts hitting the same API at midnight.
- Time synchronization became operationally critical once distributed systems became normal. Both cron and timers suffer with bad clocks, but systemd gives you clearer evidence and ordering tools.
- Logging moved from files to journals on many Debian deployments. Timers benefit directly because stdout/stderr are naturally captured without hacks.
Joke #1: Cron is like that coworker who “totally sent the email” — you’ll only know it didn’t happen when someone else complains.
Mental model: what cron does vs what systemd does
Cron: a minimal scheduler
Cron reads crontab entries, wakes up once per minute, and checks if a job should run. It executes commands with a sparse environment, under a user identity, with output optionally mailed. That’s basically it.
Its strengths are also its weaknesses:
- Strength: predictable simplicity. Weakness: you have to build everything else yourself (locking, retries, timeouts, dependencies, logging, alerting).
- Strength: widely understood. Weakness: “understood” often means “assumed,” and assumptions rot.
systemd timers: a trigger plus a service contract
A timer unit schedules activation of a service unit. The service unit defines how the job runs: user, environment, working directory, timeouts, resource controls, and what counts as success/failure. That separation is the magic.
The result is less tribal knowledge. A systemd unit file is self-describing in a way a random shell snippet in crontab is not.
Common operational differences that matter
- Observability: timers write to journald by default; cron often writes nowhere unless you redirect.
- Missed runs: timers can catch up; cron doesn’t unless you add anacron or custom logic.
- Dependencies: timers can run after mounts/network; cron can only “hope.”
- Overlaps: systemd can gate concurrency; cron will happily stampede unless you lock.
- Failure semantics: systemd records exit codes, restart behavior, and rate limits; cron mostly shrugs.
One more thing: timers are not “more reliable” by being magical. They’re more reliable because they make the job an explicit unit of execution, with explicit behaviors. Reliability is mostly about making the implicit painfully explicit.
Practical tasks (commands, outputs, decisions)
These are the commands I actually run when migrating or debugging scheduled jobs. Each one includes what you’re looking at and the decision it drives.
Task 1: Identify cron implementations and cron-driven system jobs
cr0x@server:~$ dpkg -l | egrep 'cron|anacron|systemd'
ii cron 3.0pl1-... amd64 process scheduling daemon
ii systemd 257-... amd64 system and service manager
What it means: You have classic cron installed. If you also see anacron packages, you may already have partial “missed job” behavior for some daily/weekly tasks.
Decision: Don’t uninstall cron yet. First inventory jobs and understand what’s relying on it.
Task 2: Inventory system-wide cron entries (the ones everyone forgets)
cr0x@server:~$ ls -la /etc/cron.d /etc/cron.daily /etc/cron.hourly /etc/cron.weekly /etc/cron.monthly
/etc/cron.d:
total 20
-rw-r--r-- 1 root root 201 Jan 10 10:12 sysstat
-rw-r--r-- 1 root root 349 Jan 10 10:12 logrotate
...
/etc/cron.daily:
total 16
-rwxr-xr-x 1 root root 539 Jan 10 10:12 apt-compat
-rwxr-xr-x 1 root root 377 Jan 10 10:12 man-db
What it means: Debian packages often ship jobs here. Some of these already have systemd equivalents on newer releases; others still lean on cron.
Decision: For packaged maintenance jobs, prefer distribution defaults unless you have strong reasons. Migrating vendor-supplied cron scripts is rarely a win.
Task 3: Inventory user crontabs (where the weird stuff lives)
cr0x@server:~$ sudo ls -1 /var/spool/cron/crontabs
postgres
www-data
backup
What it means: These are per-user crontabs. They often contain business-critical jobs that nobody documented.
Decision: Treat each crontab as production code. Export and review line-by-line before changing anything.
Task 4: Dump a specific crontab and look for environment traps
cr0x@server:~$ sudo crontab -u backup -l
MAILTO=ops-alerts
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
15 2 * * * /opt/jobs/backup-nightly.sh
What it means: Cron’s environment is explicit here (good), but MAILTO assumes mail delivery actually works (often false on modern fleets).
Decision: When migrating, bake environment into the systemd unit and replace “mail as monitoring” with explicit alerting.
Task 5: Check timer inventory (what systemd already schedules)
cr0x@server:~$ systemctl list-timers --all
NEXT LEFT LAST PASSED UNIT ACTIVATES
Mon 2025-12-29 02:15:00 UTC 3h 12min left Sun 2025-12-28 02:15:02 UTC 21h ago backup-nightly.timer backup-nightly.service
Mon 2025-12-29 00:00:00 UTC 1h - - - logrotate.timer logrotate.service
What it means: Timers show you NEXT/LAST runs. This immediately answers “did it run?” without grepping some file.
Decision: If a job matters, it should be visible here (or in an orchestrator), not hidden in a personal crontab.
Task 6: Inspect a timer’s schedule and whether it catches up
cr0x@server:~$ systemctl cat backup-nightly.timer
# /etc/systemd/system/backup-nightly.timer
[Unit]
Description=Nightly backup
[Timer]
OnCalendar=*-*-* 02:15:00
Persistent=true
RandomizedDelaySec=10m
[Install]
WantedBy=timers.target
What it means: Persistent=true means systemd will run it ASAP after boot if it missed the scheduled time. RandomizedDelaySec spreads load.
Decision: For backups, log rotation, and reporting jobs, Persistent=true is usually correct. For “page someone at 02:15 precisely” jobs, it’s not.
Task 7: Inspect the associated service unit (this is where reliability lives)
cr0x@server:~$ systemctl cat backup-nightly.service
# /etc/systemd/system/backup-nightly.service
[Unit]
Description=Nightly backup job
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
User=backup
Group=backup
WorkingDirectory=/opt/jobs
ExecStart=/opt/jobs/backup-nightly.sh
TimeoutStartSec=3h
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=7
What it means: This is explicit: identity, working directory, timeout. Cron jobs often forget all of these details and rely on luck.
Decision: If your script depends on mounts, network, or specific directories, declare it here instead of encoding it in fragile script logic.
Task 8: Validate the schedule parsing (catch “Feb 31” energy before production)
cr0x@server:~$ systemd-analyze calendar "*-*-* 02:15:00"
Original form: *-*-* 02:15:00
Normalized form: *-*-* 02:15:00
Next elapse: Mon 2025-12-29 02:15:00 UTC
(in UTC): Mon 2025-12-29 02:15:00 UTC
From now: 3h 12min left
What it means: systemd tells you what it thinks you meant, and when it will next fire. This is your first-line defense against schedule misunderstandings.
Decision: If normalized form differs from your expectation, stop and fix it now. Don’t “wait and see.”
Task 9: Check whether a timer actually fired and what happened
cr0x@server:~$ journalctl -u backup-nightly.service -n 20 --no-pager
Dec 28 02:15:02 server systemd[1]: Starting backup-nightly.service - Nightly backup job...
Dec 28 02:15:03 server backup-nightly.sh[1142]: snapshot created: tank/backups@2025-12-28
Dec 28 03:01:29 server backup-nightly.sh[1142]: upload complete
Dec 28 03:01:29 server systemd[1]: backup-nightly.service: Deactivated successfully.
Dec 28 03:01:29 server systemd[1]: Finished backup-nightly.service - Nightly backup job.
What it means: You get start/finish lines plus script output, tied to a unit name. This is far cleaner than “where did that redirect go?”
Decision: If the output is too noisy, fix the script’s logging. Don’t throw away visibility by redirecting everything to /dev/null.
Task 10: Prove whether the last run failed (and how)
cr0x@server:~$ systemctl status backup-nightly.service --no-pager
● backup-nightly.service - Nightly backup job
Loaded: loaded (/etc/systemd/system/backup-nightly.service; static)
Active: inactive (dead) since Sun 2025-12-28 03:01:29 UTC; 21h ago
Duration: 46min 26.113s
Process: 1142 ExecStart=/opt/jobs/backup-nightly.sh (code=exited, status=0/SUCCESS)
What it means: You have the exit status, the runtime duration, and the exact command invoked.
Decision: If status isn’t 0/SUCCESS, don’t guess. Extract the exit code and handle it: retry, alert, or fail fast.
Task 11: Detect overlapping runs (the silent killer of backups and ETL)
cr0x@server:~$ systemctl show backup-nightly.service -p ExecMainStartTimestamp -p ExecMainExitTimestamp -p ActiveEnterTimestamp -p ActiveExitTimestamp
ExecMainStartTimestamp=Sun 2025-12-28 02:15:03 UTC
ExecMainExitTimestamp=Sun 2025-12-28 03:01:29 UTC
ActiveEnterTimestamp=Sun 2025-12-28 02:15:02 UTC
ActiveExitTimestamp=Sun 2025-12-28 03:01:29 UTC
What it means: You can pull structured timestamps to see if the job duration is approaching the schedule interval.
Decision: If runtime is regularly close to the interval, add concurrency protection and consider widening the schedule.
Task 12: Check for time sync issues that make schedules lie
cr0x@server:~$ timedatectl
Local time: Mon 2025-12-29 00:03:11 UTC
Universal time: Mon 2025-12-29 00:03:11 UTC
RTC time: Mon 2025-12-29 00:03:11
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: If clock sync is off, both cron and timers become chaotic. “It ran at 2am” stops meaning anything.
Decision: Fix time first. Debugging scheduling on a host with a drifting clock is like debugging storage on a server with loose SATA cables.
Task 13: Verify a mount dependency (common for backups, reports, ingest jobs)
cr0x@server:~$ systemctl status mnt-backups.mount --no-pager
● mnt-backups.mount - /mnt/backups
Loaded: loaded (/proc/self/mountinfo; generated)
Active: active (mounted) since Sun 2025-12-28 00:00:41 UTC; 1 day 0h ago
Where: /mnt/backups
What: /dev/mapper/vg0-backups
What it means: Your target mount exists and is active. If it isn’t, your job might write to the root filesystem by accident.
Decision: Add explicit mount ordering using RequiresMountsFor=/mnt/backups in the service unit.
Task 14: Confirm the environment your job actually sees under systemd
cr0x@server:~$ systemctl show backup-nightly.service -p Environment -p User -p Group -p WorkingDirectory
Environment=
User=backup
Group=backup
WorkingDirectory=/opt/jobs
What it means: No implicit environment variables are being set. If your script needs AWS_REGION or a PATH tweak, you must declare it.
Decision: Put required environment in an EnvironmentFile= with tight permissions, or use explicit full paths in scripts. I prefer explicit full paths for utilities that must not change behavior.
Task 15: Rate limiting and start bursts (why your retries “did nothing”)
cr0x@server:~$ systemctl show backup-nightly.service -p StartLimitIntervalUSec -p StartLimitBurst
StartLimitIntervalUSec=10s
StartLimitBurst=5
What it means: systemd will stop trying after a burst of failures within the limit interval. This prevents flapping from turning into self-DoS.
Decision: If you use retries, set limits consciously. Otherwise your “restart on failure” plan may quietly stop working after five fast failures.
How to build a timer + service that won’t embarrass you
A good migration is mostly about deciding what your job is: identity, dependencies, timeouts, concurrency, and output. Cron never forced you to decide those things. systemd will, and that’s the point.
Start with the service unit
Write the service first. The timer should be the boring part.
- Use
Type=oneshotfor scripts that run and exit. Don’t pretend it’s a daemon. - Set
WorkingDirectory=if your script expects relative paths. Better: remove relative paths, but reality is messy. - Use full paths in
ExecStart=and inside scripts for critical tooling. Relying onPATHis how you get “works in shell, fails at 2am.” - Set timeouts. If the job can hang, it will hang. Give systemd permission to kill it.
- Declare dependencies like mounts and network online. If the job needs a mount, say so. If it needs DNS, say so.
Then write the timer unit
Timers are compact, but they have two fields that define your reliability posture:
Persistent=true: catches up missed runs after downtime.RandomizedDelaySec=: avoids herds. Great for fleets, dangerous for “exact time” jobs.
Concrete example: migrate a nightly backup job
Here’s a robust baseline. Not “perfect.” But good enough to ship without creating a new on-call hobby.
cr0x@server:~$ sudo tee /etc/systemd/system/backup-nightly.service > /dev/null <<'EOF'
[Unit]
Description=Nightly backup job
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/mnt/backups
[Service]
Type=oneshot
User=backup
Group=backup
WorkingDirectory=/opt/jobs
ExecStart=/opt/jobs/backup-nightly.sh
TimeoutStartSec=3h
KillMode=control-group
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=7
# Basic hardening without breaking scripts:
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/mnt/backups /opt/jobs /var/log
EOF
cr0x@server:~$ sudo tee /etc/systemd/system/backup-nightly.timer > /dev/null <<'EOF'
[Unit]
Description=Nightly backup timer
[Timer]
OnCalendar=*-*-* 02:15:00
Persistent=true
RandomizedDelaySec=10m
AccuracySec=1m
Unit=backup-nightly.service
[Install]
WantedBy=timers.target
EOF
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl enable --now backup-nightly.timer
Created symlink '/etc/systemd/system/timers.target.wants/backup-nightly.timer' → '/etc/systemd/system/backup-nightly.timer'.
Why these choices:
RequiresMountsFor=prevents the classic “backup wrote to /mnt/backups which wasn’t mounted, so actually it filled /” incident.KillMode=control-groupmakes sure child processes don’t escape if the job times out.- Basic hardening reduces blast radius. If your script needs more access, explicitly allow it. Let the unit file be the contract.
AccuracySecprevents systemd from trying too hard to fire at an exact second (which you usually don’t need).
Jitter and “why did it run at 02:23?”
If you enable RandomizedDelaySec, the job will run sometime in that window. This is not a bug. It’s a fleet-safety feature. If finance wants the report at 02:15:00, don’t randomize it. If the job hits a shared storage array, randomize it and enjoy sleeping.
Joke #2: RandomizedDelaySec is the polite version of “everyone stop touching the storage array at midnight.”
Reliability features you should actually use
systemd gives you a lot of knobs. Using all of them is how you create a unit file that nobody wants to touch. Use the knobs that buy reliability per line of config.
1) Catch up missed runs: Persistent=true
If a server is down at 02:15, cron shrugs. Timers can run on boot. For backups, log rotation, periodic cleanup, and “eventually consistent” data tasks, persistent timers are a straightforward win.
Don’t use it when your job must run only at a specific wall-clock time (e.g., coordinated market actions). In those cases, treat the schedule as a contract and handle downtime explicitly.
2) Avoid the thundering herd: RandomizedDelaySec
If you have more than a handful of hosts, don’t schedule everything at the top of the hour. Your DNS, your object store, your database, and your shared storage will all notice at once.
Randomization is cheap insurance. You trade exactness for system-wide stability.
3) Timeouts: TimeoutStartSec and friends
Every job that talks to the network can hang. Every job that touches storage can hang. If your script can’t hang, congratulations: you’ve never met NFS on a bad day.
Pick a timeout based on your SLO. If the job normally takes 10 minutes, a 3-hour timeout is not “safe,” it’s “you won’t notice it’s broken.” Use realistic budgets.
4) Concurrency control: one run at a time
cron’s default is “fire and forget,” which becomes “fire twice and regret.” With systemd, you can design for non-overlap.
A practical pattern:
- Make the service refuse to start if another instance is active.
- Or put a lock in the script with
flock, but treat that lock as part of the contract (and log when you skip).
systemd doesn’t provide a one-line “no overlap” flag for timers in the way people wish it did, but it does make overlaps visible and controllable: you can see active units, define timeouts, and enforce a single execution path.
5) Dependencies: mounts and network are not “probably there”
The worst failures are not “job failed.” The worst failures are “job succeeded at doing the wrong thing.” If a backup runs without the backup mount, it can fill root and still exit 0.
Use RequiresMountsFor= for any path that must be mounted. Use After=network-online.target only if you truly need network; otherwise you slow boot and complicate ordering.
6) Environment: make it explicit or eliminate it
Cron’s environment is famously minimal. systemd’s is also minimal, but in a different way. Relying on interactive shell initialization is how you get “it works when I run it.” That phrase should trigger a small internal alarm.
If you need secrets or configuration, use an EnvironmentFile= readable only by the service user, or load them from a root-owned file with strict permissions. Don’t bake secrets into unit files.
7) Resource control: stop jobs from eating the host
This is where systemd quietly beats cron by a mile. With cgroups, you can cap or prioritize jobs. A compression-heavy report job shouldn’t starve production services.
Useful controls include Nice, IOSchedulingClass, and (when you’re ready) CPU/IO controls. Start small: priority and timeouts already cover most “runaway job” incidents.
8) Logging: stop throwing output into the void
Journald makes it easy to capture output consistently. Keep stdout/stderr. Make scripts log in a structured way if possible (even simple “key=value” lines are gold during incidents).
Then monitor it. A scheduled job without monitoring is a scheduled surprise.
Fast diagnosis playbook
This is the order I check things when a timer-backed job “didn’t run” or “ran but nothing happened.” The goal is to identify the bottleneck in minutes, not in a Slack thread that lasts until lunch.
First: did systemd think it fired?
- Check
systemctl list-timers --allfor LAST and NEXT. - Check
systemctl status yourjob.timerfor activation state. - Decision: If LAST is missing or ancient, it’s scheduling/enablement. If LAST is recent, it’s execution.
Second: did the service start and what exit code did it return?
- Run
systemctl status yourjob.serviceand capture exit status. - Decision: Non-zero exit means job-level failure; zero means “it ran” and your problem is likely “it ran wrong” or “it did nothing useful.”
Third: what do the logs say right around the run?
- Use
journalctl -u yourjob.servicewith time filters or-n. - Decision: If logs are empty, you might be hitting the wrong unit, wrong timer target, or wrong host. If logs show a hang, look at dependencies or timeouts.
Fourth: verify prerequisites (mounts, network, DNS, credentials)
- Check mounts with
systemctl status mnt-*.mountorfindmnt. - Check time sync with
timedatectl. - Decision: If prerequisites aren’t stable, fix them; don’t paper over with retries.
Fifth: look for overlap and backpressure
- Check job duration and frequency. If the job takes longer than its interval, you’ll get overlaps or perpetual backlog.
- Decision: If overlap exists, enforce a single-run policy and adjust schedule or optimize safely.
Common mistakes: symptoms → root cause → fix
This section exists because these mistakes show up over and over, especially during cron-to-timer migrations. If you see the symptom, don’t debate. Jump to the root cause.
1) Symptom: “The timer is enabled, but the job never runs”
Root cause: The timer unit is enabled but malformed schedule, or it’s not installed into timers.target properly.
Fix: Validate schedule and confirm enablement.
cr0x@server:~$ systemd-analyze calendar "Mon..Fri *-*-* 02:15:00"
Original form: Mon..Fri *-*-* 02:15:00
Normalized form: Mon..Fri *-*-* 02:15:00
Next elapse: Mon 2025-12-29 02:15:00 UTC
From now: 3h 12min left
cr0x@server:~$ systemctl is-enabled backup-nightly.timer
enabled
2) Symptom: “It runs manually, but fails from the timer”
Root cause: Environment differences: missing PATH, missing working directory, missing credentials, different user.
Fix: Make the unit define the contract: User=, WorkingDirectory=, full paths, EnvironmentFile=.
3) Symptom: “Job ran, exit 0, but produced no output / did nothing”
Root cause: Script checks some state and exits silently, or wrote output somewhere unexpected, or the job ran against the wrong instance (wrong config path).
Fix: Add explicit logging. Also ensure the unit file points at the correct script and config. Review WorkingDirectory and absolute paths.
4) Symptom: “It ran after reboot at a weird time”
Root cause: Persistent=true caused a catch-up run, possibly combined with RandomizedDelaySec.
Fix: Decide if catch-up is desired. If not, remove Persistent=true. If yes, communicate that behavior to stakeholders and monitor it.
5) Symptom: “Two instances ran at once and corrupted state”
Root cause: Service allows concurrent starts, timer interval is shorter than worst-case runtime, or manual run overlapped scheduled run.
Fix: Add locking (script-level with flock), increase interval, and set a realistic timeout. Also consider making manual operations go through systemctl start so they’re visible.
6) Symptom: “It failed once and then never ran again”
Root cause: Start rate limiting was hit due to rapid failures, or the timer is fine but the service is failing immediately and being suppressed.
Fix: Inspect status and logs; tune start limits if you intentionally retry, and fix the underlying failure.
7) Symptom: “Backups filled root filesystem”
Root cause: Backup destination mount was missing; script wrote to a directory that existed on root.
Fix: Use RequiresMountsFor= and write backups to a path that only exists when mounted (or at least verify mount identity in the script).
8) Symptom: “Cron used to email failures; now we see nothing”
Root cause: Cron’s mail behavior was acting as accidental alerting. systemd logs to the journal, but nobody is watching it.
Fix: Establish monitoring: scrape unit state, alert on failures, alert on missing runs, and optionally forward journal logs to your log pipeline.
Three corporate-world mini-stories
Mini-story 1: The incident caused by a wrong assumption
A team migrated a database export job from cron to a systemd timer. The old cron entry ran as postgres. The new service ran as root because “root can do anything.” That assumption lasted exactly one week.
The export script wrote to a directory owned by postgres and used a local socket expecting ~postgres defaults. Under root, it still connected—sometimes—but created output files owned by root with restrictive permissions. The downstream job, still running as postgres, started failing to read the exports. The first sign was not a clean failure; it was a stale report and an executive asking why numbers didn’t move.
On-call checked the database first (of course). Then storage (also of course). Hours later, someone noticed the export directory had files with mixed ownership and timestamps that didn’t line up with the reporting pipeline.
The fix wasn’t heroic: set User=postgres, lock down WorkingDirectory, and stop using implicit home directory behavior. They also added a unit-level ReadWritePaths= to make it impossible for the job to scribble elsewhere. The incident ended with a simple postmortem line: “We assumed root was safer. It was just less visible.”
Mini-story 2: The optimization that backfired
A company had a fleet of Debian hosts running log compaction and shipping every five minutes. Someone decided to “optimize boot time” by removing network-online.target dependencies from scheduled services. Faster boot, fewer ordering constraints. On paper, it looked clean.
In practice, the timer fired soon after boot, before DNS was consistent and before a particular overlay network interface was stable. The job didn’t always fail loudly. Sometimes it queued data locally and exited 0. Sometimes it wrote to a fallback destination. Sometimes it stalled on a socket connect until the default timeout (which was effectively “a while”).
After a week, storage alarms started: boot storms (planned maintenance) correlated with log backlogs, which correlated with an ingestion lag that made dashboards lie. The “optimization” was not that the job was wrong, but that the system now behaved differently under boot churn.
The boring fix was to put dependencies back—but only the correct ones. They replaced broad “wait for network” with a mount dependency and a small preflight check that validated name resolution for the specific endpoint. They also set a sane TimeoutStartSec and logged failures explicitly. Boot was slightly slower. Operations got dramatically calmer.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent team ran monthly reconciliation reports. Everyone hated touching it because the script was old and the business rules were fiddly. What they did right was not clever: they treated the scheduled job like a service with an SLO.
They migrated to a systemd timer and added two checks: (1) alert if the service fails, and (2) alert if the timer hasn’t run successfully within the expected window. They also pinned the environment via an EnvironmentFile and used explicit full paths for every tool invoked.
One month, a routine package upgrade changed behavior of a parsing utility (same name, different default output formatting). The script still ran, but produced output that didn’t match downstream validation. The job exited non-zero due to a strict check, and the unit went to failed state.
Because the team had “boring monitoring,” the failure was caught within minutes during business hours. They rolled back the package on that host, adjusted parsing to be explicit, and re-ran the service manually through systemd so the rerun was logged and attributable. No late-night panic, no hand-waved numbers, no emergency meeting. Just a clean failure and a clean fix.
Checklists / step-by-step plan
This is a practical migration plan that minimizes surprises. Use it whether you’re migrating one job or a hundred.
Step 1: Inventory and classify jobs
- List system cron directories and user crontabs.
- Classify each job: critical, important, nice-to-have.
- Decide desired behavior after downtime: catch up or skip?
Step 2: Define the execution contract for each job
- What user should run it?
- What directories must exist and be writable?
- What mounts must be present?
- What network conditions are required?
- What’s the worst acceptable runtime?
- Is overlap allowed?
Step 3: Build the systemd service unit first
- Start with
Type=oneshot. - Add
User=,Group=,WorkingDirectory=. - Add a realistic
TimeoutStartSec=. - Add mount and network dependencies only when required.
- Add minimal hardening:
NoNewPrivileges,PrivateTmp, and scoped write paths.
Step 4: Add the timer unit
- Validate schedule with
systemd-analyze calendar. - Use
Persistent=truewhere missed runs must be caught up. - Use
RandomizedDelaySecon fleets to reduce load spikes.
Step 5: Test like you mean it
- Manually start the service:
systemctl start yourjob.service. - Check status and logs.
- Simulate failure modes: missing mount, DNS failure, permission issue.
Step 6: Run in parallel briefly (carefully)
- If safe, keep cron disabled but retained; or run the systemd job in “dry run” mode while cron remains primary.
- Never run two stateful jobs in parallel unless you have explicit locking and you are sure the side effects are safe.
Step 7: Cut over and monitor
- Disable the cron entry once the timer is enabled and verified.
- Alert on failures and on missing successful runs.
- Review logs after the first few real executions.
Step 8: Reduce risk over time
- Improve scripts to be idempotent.
- Add structured logging.
- Make dependencies explicit and remove accidental ones.
FAQ
1) Are systemd timers always “better” than cron?
No. They’re better when you need reliability features and observability. Cron is fine for simple, low-impact tasks. The win is not “modernity,” it’s operational clarity.
2) What’s the systemd equivalent of “run every 5 minutes”?
Use OnUnitActiveSec=5min (monotonic) or an OnCalendar expression like *:0/5 depending on what you mean by “every 5 minutes.” Monotonic timers run relative to the last activation; calendar timers align to wall-clock boundaries.
3) How do I prevent missed runs after a reboot?
Use Persistent=true in the timer. Then ensure your job is safe to run after downtime (idempotent or state-aware).
4) Why does my job run with a different PATH than in my shell?
Because scheduled jobs shouldn’t inherit your interactive shell setup. Fix it by using full paths in scripts, or explicitly setting Environment=PATH=... (or an EnvironmentFile) in the service unit.
5) Should I put the logic in the timer or in the script?
Put scheduling in the timer and execution semantics in the service. Put business logic in the script/app. Don’t re-implement scheduling and retries in shell if systemd can do it with clearer behavior.
6) What replaces “cron emailed me the output”?
Two things: (1) journald logs for stdout/stderr, and (2) monitoring that alerts on failed units and missing successful runs. Email is not monitoring; it’s a nostalgia trap.
7) Can I run timers for non-root users?
Yes. You can use system timers running services as a specific User=, or user-level systemd units (lingering may be required). For servers, system units with explicit User= are usually easier to manage centrally.
8) How do I handle dependencies on mounts and network cleanly?
Use RequiresMountsFor= for filesystem paths and After=network-online.target/Wants=network-online.target only when the job truly requires stable networking. Avoid over-broad dependencies that slow boot and hide real readiness checks.
9) What about jobs that must never overlap?
Assume overlap will happen unless you prevent it. Use lock semantics (often flock) and ensure your monitoring distinguishes “skipped due to lock” from “ran successfully.” Also tune intervals and timeouts so the system behavior is predictable.
10) Is it safe to disable cron entirely on Debian 13?
Sometimes. But many packages still ship cron-based maintenance tasks. Disabling cron globally can create subtle breakage. Prefer migrating your own jobs first, then decide whether cron should remain installed for packaged maintenance.
Conclusion: next steps that pay off
If you run production Debian 13 systems and you care about not missing scheduled work, systemd timers are the default you should reach for. They don’t eliminate failure. They make failure legible, testable, and monitorable.
Practical next steps:
- Inventory cron jobs and tag the critical ones.
- Migrate one critical job to a timer + service with explicit user, working directory, timeout, and mount dependencies.
- Turn on
Persistent=truewhere catch-up behavior is desirable, and add jitter where fleet load matters. - Wire monitoring: alert on service failures and on “no successful run in window.”
- Only then start deleting cron entries. Keep the old definitions in version control, not in the dumpster.
The best scheduling system is the one that makes “did it run?” a two-minute question with a boring answer. systemd gets you there—if you treat unit files as operational contracts, not as a new place to hide shell scripts.