Ubuntu 24.04: Services in restart loop — stop the loop and catch the root error

Was this helpful?

You know the smell: dashboards full of red, pager screaming, and a service that “starts” hundreds of times per minute while doing exactly zero useful work. On Ubuntu 24.04, systemd is not being dramatic. It’s doing what you told it to do—restart a failing process—until it either hits a rate limit or turns your logs into confetti.

This is how you stop the loop without losing the evidence, extract the first real error (not the 500th), and then fix the underlying cause. We’ll do it like operators: minimal heroics, maximal signal, and with a bias toward changes you can defend in a postmortem.

What a restart loop actually means in systemd

A “restart loop” is just a feedback loop between a unit’s restart policy and a process that returns to systemd too quickly. The process exits (crash, misconfiguration, missing dependency, permission issue), systemd applies the unit file’s Restart= rules, and tries again. Repeat. If it’s fast enough, systemd eventually says “enough” using start rate limiting (StartLimitIntervalSec + StartLimitBurst) and marks it failed.

Two key operator points:

  • The root error is almost always in the first few attempts. After that you mostly get repeated noise: “Main process exited” and “Scheduled restart job.”
  • systemd doesn’t know “healthy,” it knows “process exited.” A service can be “running” and still be useless, or be “failed” when it actually succeeded and daemonized incorrectly.

There’s a common anti-pattern here: “Fixing” the loop by setting Restart=no and calling it a day. That’s not fixing; that’s turning off the smoke alarm because it’s loud.

Fast diagnosis playbook (do this first)

This is the order that saves time under pressure. It’s optimized for “find the bottleneck quickly,” not for theoretical completeness.

1) Confirm you’re dealing with a restart loop and capture its tempo

  • Check whether systemd is actively restarting it, and how fast.
  • Look for rate limiting (“Start request repeated too quickly”) which means you missed the earliest errors and need to pull logs by time window.

2) Pull the earliest failure logs, not the latest

  • Use journalctl with --since and --reverse to find the first bad line after a start.
  • Get exit status and signal info from systemd’s own messages.

3) Freeze the patient (stop the loop) without deleting evidence

  • Mask or stop the unit, or temporarily set Restart=no via an override.
  • Do not wipe the journal. Do not reboot “just to clear it.” That’s how you lose the only proof you had.

4) Classify the failure mode

Most loops fall into one of these buckets:

  • Immediate exit: bad config, missing env var, wrong working directory, invalid flag, binary missing.
  • Crash: segfault, abort, illegal instruction (often a library or CPU feature mismatch), OOM.
  • Readiness mismatch: service says it started but systemd expects Type=notify, or forks unexpectedly.
  • Dependency failure: network not ready, DNS broken, mount not present, permissions/SELinux/AppArmor.
  • Resource limits: file descriptors, memlock, tasks, ulimit differences under systemd.

5) Reproduce with the same environment systemd uses

  • Run the same ExecStart in a clean environment, or use systemd-run to emulate.
  • Check unit sandboxing options that may block file access (e.g., ProtectSystem=, PrivateTmp=).

Stop the loop safely (and keep evidence)

If a service is flapping, it’s actively damaging your system: it burns CPU, spams logs, reopens sockets, churns caches, and can trigger external rate limits. Your first job is to stop the bleeding without erasing the crime scene.

Option A: Stop the unit (short-term pause)

This is the quick “pause” button. systemd may still try to restart if other units depend on it and pull it in, so verify after stopping.

Option B: Mask the unit (hard stop)

Masking prevents starts (manual or dependency-triggered). It’s the right move when the loop is causing collateral damage and you need a stable host to investigate.

Option C: Temporary override: disable restart policy

This is cleaner than editing vendor unit files. You create a drop-in override to change Restart= and optionally add a longer RestartSec= so you can read logs between attempts.

Joke #1: A restart loop is like an intern with infinite enthusiasm and zero context—fast, persistent, and somehow making everything worse.

Catch the root error: logs, exit codes, and coredumps

systemd is good at telling you that something died. You need it to tell you why. The “why” is usually in one of these places:

  • journal logs for the unit and its dependencies
  • systemd’s exit metadata: status code, signal, core dump notes
  • application logs: file logs, stderr/stdout capturing, structured logs
  • coredumps: for real crashes
  • kernel logs: OOM killer, segfault, mount errors

One quote you can safely build an incident response around:

“Hope is not a strategy.” — paraphrased idea often credited in engineering management circles

Translation: stop guessing. Capture evidence. Make one change at a time.

Practical tasks: commands, outputs, decisions (12+)

These are the tasks I actually run when a service is flapping on Ubuntu 24.04. Each one includes what the output means and what decision follows.

Task 1: Confirm current state, recent failures, and restart counters

cr0x@server:~$ systemctl status myapp.service --no-pager -l
● myapp.service - MyApp API
     Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2025-12-29 10:14:07 UTC; 2s ago
    Process: 18422 ExecStart=/usr/local/bin/myapp --config /etc/myapp/config.yml (code=exited, status=1/FAILURE)
   Main PID: 18422 (code=exited, status=1/FAILURE)
        CPU: 38ms

Dec 29 10:14:07 server systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE
Dec 29 10:14:07 server systemd[1]: myapp.service: Failed with result 'exit-code'.
Dec 29 10:14:07 server systemd[1]: myapp.service: Scheduled restart job, restart counter is at 17.

What it means: It exits with status 1 quickly and has restarted 17 times. That’s not a “slow startup,” it’s a deterministic failure.

Decision: Stop the loop and pull the earliest logs around the first restart. Don’t waste time tuning timeouts yet.

Task 2: Stop the loop immediately (pause)

cr0x@server:~$ sudo systemctl stop myapp.service
cr0x@server:~$ systemctl is-active myapp.service
inactive

What it means: systemd stopped it. If it flips back to “activating,” something else is pulling it in.

Decision: If it restarts due to dependencies, mask it (next task).

Task 3: Mask it to prevent restarts triggered by dependencies

cr0x@server:~$ sudo systemctl mask myapp.service
Created symlink /etc/systemd/system/myapp.service → /dev/null.
cr0x@server:~$ systemctl status myapp.service --no-pager -l
● myapp.service
     Loaded: masked (Reason: Unit myapp.service is masked.)
     Active: inactive (dead)

What it means: It cannot be started until unmasked. This is reversible and leaves logs intact.

Decision: Investigate calmly. Unmask only when you’re ready to test again.

Task 4: Pull unit-specific logs for the last boot, newest first

cr0x@server:~$ journalctl -u myapp.service -b --no-pager -n 200
Dec 29 10:13:59 server myapp[18391]: FATAL: cannot read config file: open /etc/myapp/config.yml: permission denied
Dec 29 10:13:59 server systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE
Dec 29 10:13:59 server systemd[1]: myapp.service: Failed with result 'exit-code'.
Dec 29 10:14:00 server systemd[1]: myapp.service: Scheduled restart job, restart counter is at 1.

What it means: This is gold: permission denied reading config.

Decision: Fix permissions/ownership, or check hardening options that changed the filesystem view.

Task 5: Identify the exact ExecStart and unit settings

cr0x@server:~$ systemctl cat myapp.service
# /etc/systemd/system/myapp.service
[Unit]
Description=MyApp API
After=network-online.target
Wants=network-online.target

[Service]
User=myapp
Group=myapp
ExecStart=/usr/local/bin/myapp --config /etc/myapp/config.yml
Restart=always
RestartSec=1
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true

[Install]
WantedBy=multi-user.target

What it means: ProtectSystem=strict can make large parts of the filesystem read-only; ProtectHome=true blocks access to /home. If the config is under a protected path or needs auxiliary files, you can break the service “by improving security.”

Decision: If permissions look fine, suspect unit hardening. Either adjust paths or relax specific options with surgical overrides.

Task 6: Check permissions and SELinux/AppArmor constraints (Ubuntu: usually AppArmor)

cr0x@server:~$ namei -l /etc/myapp/config.yml
f: /etc/myapp/config.yml
drwxr-xr-x root root /
drwxr-xr-x root root etc
drwx------ root root myapp
-rw------- root root config.yml

What it means: Directory /etc/myapp is 700, and the file is 600 owned by root. If the unit runs as user myapp, it can’t read it.

Decision: Change ownership or ACLs. Don’t run the service as root “because it works.” That’s how you end up with a different incident next week.

Task 7: Fix ownership safely (example) and validate

cr0x@server:~$ sudo chown -R root:myapp /etc/myapp
cr0x@server:~$ sudo chmod 750 /etc/myapp
cr0x@server:~$ sudo chmod 640 /etc/myapp/config.yml
cr0x@server:~$ sudo -u myapp test -r /etc/myapp/config.yml && echo OK
OK

What it means: The service user can read the file now.

Decision: Unmask and start once; watch logs. If it still fails, you’ve eliminated one root cause and kept the change minimal.

Task 8: Unmask, start once, and follow logs live

cr0x@server:~$ sudo systemctl unmask myapp.service
Removed "/etc/systemd/system/myapp.service".
cr0x@server:~$ sudo systemctl start myapp.service
cr0x@server:~$ journalctl -u myapp.service -f --no-pager
Dec 29 10:22:41 server myapp[19012]: INFO: loaded config /etc/myapp/config.yml
Dec 29 10:22:41 server myapp[19012]: INFO: listening on 0.0.0.0:8080

What it means: It starts and stays up (no further restarts appear).

Decision: If stable, re-enable monitoring alerts and proceed to verify dependencies (DB, storage, upstream).

Task 9: If logs are noisy, find the first failure window by time

cr0x@server:~$ journalctl -u myapp.service --since "2025-12-29 10:10:00" --until "2025-12-29 10:15:00" --no-pager
Dec 29 10:13:59 server myapp[18391]: FATAL: cannot read config file: open /etc/myapp/config.yml: permission denied
Dec 29 10:13:59 server systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE
Dec 29 10:14:00 server systemd[1]: myapp.service: Scheduled restart job, restart counter is at 1.

What it means: You can isolate the earliest error even if restarts spam the journal.

Decision: Always time-box journal queries during restart storms. It keeps you sane and makes incident notes credible.

Task 10: Inspect dependency chain and ordering

cr0x@server:~$ systemctl list-dependencies --reverse myapp.service
myapp.service
● multi-user.target

What it means: Not much depends on it (good). If the reverse list is huge, stopping it might break other units.

Decision: For large dependency trees, mask carefully and communicate impact. “I stopped the thing” is not a complete plan.

Task 11: Catch crashes: check coredumps

cr0x@server:~$ coredumpctl list myapp
TIME                            PID   UID   GID SIG COREFILE  EXE
Mon 2025-12-29 09:58:12 UTC   17021  1001  1001  11 present   /usr/local/bin/myapp
cr0x@server:~$ coredumpctl info 17021
           PID: 17021 (myapp)
           UID: 1001 (myapp)
           GID: 1001 (myapp)
        Signal: 11 (SEGV)
     Timestamp: Mon 2025-12-29 09:58:12 UTC (17min ago)
  Command Line: /usr/local/bin/myapp --config /etc/myapp/config.yml
    Executable: /usr/local/bin/myapp
 Control Group: /system.slice/myapp.service
          Unit: myapp.service
       Message: Process 17021 (myapp) of user 1001 dumped core.

What it means: If you see SIGSEGV/SIGABRT, you’re in real crash territory: corrupted config parsing, incompatible library, or memory bug.

Decision: Stop trying random unit tweaks. Collect the core, build ID, and package versions; then reproduce in staging or with the vendor.

Task 12: Check kernel logs for OOM kills and segfaults

cr0x@server:~$ journalctl -k -b --no-pager -n 200
Dec 29 10:01:44 server kernel: Out of memory: Killed process 17555 (myapp) total-vm:812340kB, anon-rss:612220kB, file-rss:1100kB, shmem-rss:0kB, UID:1001 pgtables:1872kB oom_score_adj:0
Dec 29 10:01:44 server kernel: oom_reaper: reaped process 17555 (myapp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

What it means: The kernel killed the process. systemd restarts it. Loop achieved.

Decision: Fix memory pressure (limits, leaks, concurrency) or add systemd memory controls (MemoryMax=) to fail fast and predictably. Also check swap and co-located workloads.

Task 13: Inspect start rate limiting (why it “stops trying”)

cr0x@server:~$ systemctl show myapp.service -p StartLimitIntervalUSec -p StartLimitBurst -p NRestarts -p Restart -p RestartUSec
StartLimitIntervalUSec=10s
StartLimitBurst=5
NRestarts=17
Restart=always
RestartUSec=1s

What it means: It allows 5 starts per 10 seconds. With RestartSec=1, you’ll hit the limiter fast.

Decision: Don’t “fix” by cranking burst to 10,000. Increase RestartSec to slow the storm and preserve resources while debugging.

Task 14: Reproduce under systemd-run to match the service environment

cr0x@server:~$ sudo systemd-run --unit=myapp-debug --property=User=myapp --property=Group=myapp \
  /usr/local/bin/myapp --config /etc/myapp/config.yml
Running as unit: myapp-debug.service
cr0x@server:~$ journalctl -u myapp-debug.service --no-pager -n 50
Dec 29 10:30:10 server myapp[19411]: FATAL: cannot open /var/lib/myapp/state.db: permission denied

What it means: Your “it runs fine when I run it” test probably ran as root, or in a different working directory, or with different env vars.

Decision: Always reproduce with the service user and systemd context. It saves hours and reduces folklore.

Task 15: Verify and fix RuntimeDirectory/StateDirectory ownership (modern systemd pattern)

cr0x@server:~$ systemctl show myapp.service -p StateDirectory -p RuntimeDirectory
StateDirectory=myapp
RuntimeDirectory=myapp
cr0x@server:~$ ls -ld /var/lib/myapp /run/myapp
ls: cannot access '/var/lib/myapp': No such file or directory
drwxr-xr-x 2 root root 40 Dec 29 10:12 /run/myapp

What it means: If StateDirectory= is configured, systemd should create it under /var/lib with correct ownership—unless the unit is old, mis-specified, or overridden. Here it’s missing, and runtime dir is owned by root.

Decision: Fix the unit to let systemd manage directories, or create them with correct ownership. Directory ownership bugs are classic restart-loop fuel.

When it only fails under systemd

This one gets people because it feels supernatural: you run the binary manually and it works. Under systemd, it fails immediately. That’s not supernatural. That’s environment.

Here’s what changes under systemd that commonly matters:

  • User/group and supplementary groups: manual tests often run as root or a logged-in user.
  • Working directory: systemd defaults to / unless WorkingDirectory= is set.
  • Environment variables: your shell loads profile scripts; systemd does not. If the app needs PATH tweaks or JAVA_HOME, declare it explicitly.
  • File descriptor limits: systemd may set different defaults, and your service may hit EMFILE (too many open files).
  • Sandboxing/hardening: ProtectSystem, PrivateTmp, RestrictAddressFamilies, and friends can break apps in subtle ways.
  • Startup type: if the app daemonizes or forks, but unit is Type=simple (default), systemd may treat it as exited and restart it.

Joke #2: systemd doesn’t “hate your app.” It just refuses to participate in your app’s interpretive dance around PID 1.

Storage and filesystem failure modes that look like “app bugs”

As a storage person, I’ll say the quiet part loudly: a shocking number of restart loops are storage problems wearing an application costume.

Read-only filesystem flips

ext4 and friends can remount read-only after certain I/O errors to prevent further damage. Your service then fails creating PID files, writing state, rotating logs, or updating SQLite. systemd restarts it forever because the process keeps exiting “for some reason.”

Full disk (or full inode table)

A service that can’t write to /var or /tmp often dies early. Worse: log spam fills the remaining space and takes other services down with it. Check both blocks and inodes.

Mount ordering and network filesystems

Services that expect /mnt/data to exist will crash if it’s not mounted. This gets spicy with NFS, iSCSI, and encrypted volumes: the mount can be “present” but not ready, or the network is up but DNS isn’t, or vice versa.

Permissions on state paths

After a restore, rsync, or an “improvement,” ownership drifts. Many daemons will exit if they detect wrong permissions on sensitive files (SSH, Postgres, etc.). That’s a feature, but it creates restart loops if you don’t see the first log line.

Network and DNS traps

Ubuntu 24.04 typically uses systemd-resolved and Netplan. A restart loop often happens when a service tries to resolve a hostname or connect to an upstream dependency during startup, fails quickly, and exits. Then systemd restarts it. Congratulations: your DNS flakiness is now a CPU benchmark.

Common network-driven triggers:

  • Service starts before network is “online” (link up isn’t the same as routable).
  • DNS points to a dead resolver; resolv.conf is managed and your old assumptions break.
  • Firewall rules block egress; app treats it as fatal rather than retrying.
  • IPv6 misbehavior: app binds to v6 only, or tries v6 first and times out.

Dependency ordering and readiness

Restart loops aren’t always about the app failing. Sometimes the app starts correctly, but systemd thinks it didn’t, or starts it at the wrong time.

Wrong service Type

If your daemon forks into the background but the unit is Type=simple, systemd can interpret the parent exiting as failure (or success, depending), and then restart. The fix is usually Type=forking with a PIDFile=, or better: configure the app to stay in the foreground and keep Type=simple.

Readiness notification mismatch

Type=notify expects the process to call sd_notify. If it doesn’t, systemd waits, times out, kills it, restarts. That loop looks like an app crash, but it’s a contract violation.

Mount and network targets aren’t magic

After=network-online.target helps, but it only works if the relevant “wait online” service is enabled for your network stack. For mounts, use RequiresMountsFor= to tie a unit to a path that must be mounted.

Rate limits, restart policy, and how to tune without lying

Restart policy is not just “make it highly available.” It’s “declare how the system behaves under failure.” In production, that behavior must be deliberate.

Restart=always is rarely the default you want

Restart=always restarts even after clean exit. That’s great for workers designed to run forever; it’s terrible for oneshot jobs and for daemons that intentionally exit after a successful migration.

Better patterns:

  • Restart=on-failure for most daemons.
  • Restart=on-abnormal when clean exit is legitimate.
  • No restart for batch jobs; let your scheduler handle retries.

Use RestartSec to avoid self-inflicted denial of service

If a service is going to fail repeatedly, make it fail slowly enough that humans can read logs and the machine can breathe. Increasing RestartSec from 100ms to 5s can be the difference between “minor incident” and “host unusable.”

StartLimit… is a circuit breaker, not a fix

Start limits stop the immediate storm, but they don’t solve the root cause. Treat “start request repeated too quickly” as systemd politely telling you: you’re debugging too late in the timeline.

Common mistakes: symptom → root cause → fix

This is the part you’ll recognize in ten seconds, because you’ve lived it.

1) Symptom: “Active: activating (auto-restart)” with status=1/FAILURE

Root cause: deterministic startup failure (config parse, missing file, permission denied).

Fix: stop/mask, then journalctl -u by time window. Fix file permissions or config. If the unit runs as non-root, validate access using sudo -u.

2) Symptom: “Start request repeated too quickly” and then it stays failed

Root cause: rate limiting triggered; earlier errors scrolled away.

Fix: query logs with --since/--until, or -b and search for the first failure. Consider increasing RestartSec while debugging, not StartLimitBurst.

3) Symptom: service works when run manually, fails under systemd

Root cause: environment mismatch (user, working directory, PATH, ulimits, sandboxing).

Fix: reproduce via systemd-run with the service user. Check systemctl cat for hardening and directory options. Set WorkingDirectory= and explicit Environment= or EnvironmentFile=.

4) Symptom: exit status shows “status=203/EXEC”

Root cause: ExecStart points to a missing binary or not executable, or wrong architecture/format.

Fix: verify path and permissions; check shebang lines for scripts. Ensure the binary exists in the host, not just in your memory.

5) Symptom: status shows “code=killed, signal=KILL” with TimeoutStartSec messages

Root cause: systemd killed it after startup timeout; often readiness mismatch or slow dependency.

Fix: correct Type=, readiness notifications, or make startup not block on dependencies. If startup legitimately takes longer, raise TimeoutStartSec with justification.

6) Symptom: kernel logs show OOM killer entries for the service

Root cause: memory pressure or leak; could be co-tenancy or runaway concurrency.

Fix: reduce memory use, add swap if appropriate, or enforce MemoryMax=. Then fix the leak. Don’t just add RAM and call it “capacity planning.”

7) Symptom: “permission denied” writing to /run or /var/lib

Root cause: runtime/state directories missing or owned by root after deployment or restore.

Fix: use RuntimeDirectory= and StateDirectory=, or fix ownership in tmpfiles.d. Avoid ad-hoc mkdir in ExecStartPre unless you enjoy race conditions.

8) Symptom: failures correlate with reboots; mount path missing at boot

Root cause: dependency ordering and mount readiness; network FS not up yet.

Fix: add RequiresMountsFor=/path and correct After=. For network-online, ensure the proper wait-online service is enabled.

Three corporate mini-stories (pain, learning, receipts)

Mini-story 1: The incident caused by a wrong assumption

The team had a small internal API running as a systemd service. Nothing fancy. It read a YAML file from /etc/company/app.yml and wrote a little state into /var/lib/app. It had run like that for months.

Then a security hardening sprint landed. The unit file was “improved” with ProtectSystem=strict and NoNewPrivileges=true. Everyone nodded. Secure by default is good. The service restarted every second and threw alerts across the environment.

The wrong assumption was subtle: the config was “obviously readable” because it had always worked. But the file permissions had been lazily left as root-only since day one, and the service had been running as root. The hardening sprint also changed it to run as an unprivileged user. Security posture improved, and everything broke at once.

The fix was boring: correct ownership and permissions, then keep the hardening. The lesson wasn’t “don’t harden.” The lesson was “don’t mix hardening changes with privilege model changes without a test plan.” On the bright side, the service ended up safer and the team stopped pretending root was a feature flag.

Mini-story 2: The optimization that backfired

A platform group wanted faster recovery times. Someone lowered RestartSec across a fleet from 5 seconds to 100ms. The idea: if processes crash, bring them back immediately. In a lab, it looked crisp. In production, it turned a small failure into a systemic problem.

One service started failing due to a rotated secret file with the wrong permissions. Instead of one failure every few seconds with readable logs, it attempted ten restarts per second on dozens of nodes. Journals ballooned. Disk write amplification spiked. Nodes started reporting latency. Other services on the same hosts got slower. The restart storm became the incident.

The postmortem was uncomfortable because the original failure was minor and local. The optimization added a giant megaphone and a resource tax. Restart policies are part of system design; “faster” is not automatically “better.”

They backed out the aggressive restart timing, introduced per-service restart delays, and required a justification for very low RestartSec values. Most importantly, they stopped treating restart as remediation. It’s only a tool for resilience if the system can also surface the root cause efficiently.

Mini-story 3: The boring but correct practice that saved the day

A different org ran a payment-adjacent service with a strict change process. Not glamorous. Every service unit had: a drop-in override policy stored in config management, explicit users/groups, explicit directory ownership using StateDirectory=, and a standard “debug override” that could set Restart=no and increase log verbosity temporarily.

During an Ubuntu upgrade window, one node started flapping a daemon due to a library mismatch after a partial package upgrade. The on-call didn’t freestyle edits on the node. They applied the known debug override: stop restarts, capture logs, grab package versions, and restore the node by completing the upgrade transaction.

Why did it matter? Because the incident didn’t expand. The node stayed stable enough to collect evidence. The fleet didn’t get a wave of inconsistent hotfixes. The team ended up with one clean remediation: fix the upgrade automation to avoid partial upgrades and ensure the service only starts after the right packages are present.

Boring practice works because it reduces the number of new variables you introduce while the system is on fire. It’s not exciting. It’s reliable.

Checklists / step-by-step plan

Step-by-step: stop the loop, then diagnose

  1. Capture the current state: systemctl status -l and copy exit status lines into your incident notes.
  2. Stop or mask: if the service is causing resource churn, mask it.
  3. Pull the earliest failure logs: use journalctl -u with --since/--until around the first known failure time.
  4. Classify failure mode: permission/config vs crash vs dependency vs timeout vs OOM.
  5. Check unit file for traps: user/group, working directory, hardening, Type, timeouts, restart policy.
  6. Reproduce under systemd context: systemd-run or run as the service user.
  7. Check the platform layer: disk space/inodes, mounts, DNS, kernel OOM logs.
  8. Make one change: smallest possible change that addresses the root cause.
  9. Test once: unmask and start; follow logs.
  10. Restore restart policy intentionally: don’t leave it in debug mode.
  11. Write down what changed: future you will not remember, and auditors definitely won’t.

Hard rules when you’re sleep-deprived

  • Don’t edit the vendor unit file in /lib/systemd/system. Use drop-ins.
  • Don’t “fix” by running the service as root unless the service is inherently privileged (and even then, minimize scope).
  • Don’t raise StartLimitBurst as your first move. Slow restarts instead.
  • Don’t reboot just to make the logs go away. That’s not hygiene; that’s evidence destruction.

Interesting facts and historical context

  • systemd introduced aggressive, explicit restart semantics compared to traditional init scripts, making “flapping” more visible—and more common when misconfigured.
  • Start rate limiting exists partly because early daemons could fork-bomb PID 1 with rapid failure/restart cycles; the limiter is a safety valve.
  • Ubuntu’s move toward systemd-resolved changed how DNS is managed; old assumptions about /etc/resolv.conf being static frequently break services during upgrades.
  • coredump handling moved from “core files everywhere” to centralized collection with systemd-coredump, improving fleet debugging but confusing people expecting core in the working directory.
  • Modern systemd added declarative directory management (StateDirectory, RuntimeDirectory, LogsDirectory) to reduce fragile ExecStartPre scripts.
  • Hardening options like ProtectSystem and PrivateTmp became mainstream because they’re cheap wins against whole classes of compromise—but they also expose sloppy filesystem assumptions.
  • “Works in a shell” has been a lie since the beginning of Unix; daemon environments have always differed (different PATH, no TTY, different ulimits). systemd just makes the contract more explicit.
  • Rate limits and restart policies are reliability tools, not cosmetics; early HA patterns often tried to “restart everything instantly,” then learned about cascading failures the hard way.

FAQ

1) How do I stop a service restart loop without uninstalling anything?

Mask it: sudo systemctl mask myapp.service. That prevents manual starts and dependency-triggered starts. Unmask when ready.

2) Why does systemd keep restarting a service that obviously can’t start?

Because the unit says so. Look for Restart=always or Restart=on-failure. systemd assumes restart is desirable unless told otherwise.

3) Where is the “real error” if systemctl status only shows restart messages?

Usually in journalctl -u myapp.service, and often in the earliest lines after the first start attempt. Use --since/--until to target the beginning of the loop.

4) What does “status=203/EXEC” mean?

systemd couldn’t execute the command in ExecStart=. Common causes: wrong path, missing binary, not executable, or script with bad shebang.

5) What does “Start request repeated too quickly” mean?

The start limiter tripped. The unit started and failed too many times within StartLimitIntervalSec, exceeding StartLimitBurst. It’s a circuit breaker, not the root cause.

6) How do I tell if it’s crashing (segfault) versus exiting cleanly?

In systemctl status and the journal, look for signal-based exits (SIGSEGV, SIGABRT). Then check coredumpctl list and coredumpctl info.

7) The service works when I run it manually. Why not under systemd?

Different user, different environment, different limits, different working directory, and possibly sandboxing restrictions. Reproduce with systemd-run and the service user.

8) Should I increase TimeoutStartSec to fix a loop?

Only if you’ve proven the service is healthy but slow to become ready. If it’s failing fast, a longer timeout just delays the inevitable and wastes time.

9) Is it okay to set Restart=always for everything so it “self-heals”?

No. It can create restart storms and hide deterministic failures. Use Restart=on-failure for most daemons and design apps to retry dependencies internally.

10) How do I keep logs from flooding disk during a restart storm?

Mask the service, then consider slowing restarts (RestartSec) once you re-enable it. If journald disk usage is already large, rotate/vacuum only after you’ve captured the critical window.

Conclusion: next steps that keep you out of trouble

Restart loops feel chaotic, but they’re usually deterministic: a missing permission, a broken dependency, a readiness mismatch, or a real crash. Your job is to stop the churn, grab the first meaningful error, and change the smallest thing that makes the service behave.

Do this next, in order:

  1. Mask the service if it’s flapping and causing collateral damage.
  2. Extract the earliest failure from the journal by time window.
  3. Classify the failure mode (permission/config, crash, timeout, dependency, resource).
  4. Reproduce under systemd context with the correct user.
  5. Apply a minimal fix, unmask, start once, and tail logs.
  6. Restore a sane restart policy (on-failure, with a reasonable RestartSec) so future failures are survivable and debuggable.

If you take nothing else: don’t fight the loop by making systemd quieter. Make the failure louder, earlier, and easier to prove. That’s how incidents stop repeating.

← Previous
Duron: the budget chip people overclocked like a flagship
Next →
Docker IPv6 Leaks: Prevent “Oops, It’s Public” Exposure

Leave a comment