You reboot a box you own in production, and systemd greets you with the least helpful sentence in Linux history:
“Failed to start …”. No context. No clue which layer lied. Just a sad unit name and a timestamp that’s already in the past.
This is case #2: not a toy demo, not a “just reinstall” situation. It’s the repeatable workflow I use when a service
fails on Ubuntu 24.04 and I need an answer fast: what broke, where it broke, and what decision to make next.
Speed matters, but correctness matters more—because the second reboot is when the real outage begins.
Fast diagnosis playbook (first 5 minutes)
When a unit fails, you’re not “debugging a service.” You’re debugging a transaction systemd tried to run:
dependencies, ordering, environment, privileges, filesystem, network, and the binary itself.
The fastest path is to stop guessing and force systemd to tell you which phase failed.
1) Confirm the exact failing unit and failure mode
- Get the unit name, not “the app name.”
- Get the result and the step (EXEC, SIGNAL, TIMEOUT, etc.).
- Get the last logs for that unit only.
2) Identify the bottleneck layer
The bottleneck is almost always one of these:
ExecStart can’t run (missing binary, permissions, SELinux/AppArmor),
the process starts but exits (config, ports, secrets),
it never becomes “ready” (Type=notify, readiness probe, PIDFile),
or it waits on dependencies (network-online, mount, db, remote storage).
3) Decide if you need a reboot-safe fix or an emergency bypass
There are two kinds of fixes:
proper (unit override, dependency correction, config fix),
and triage (temporary masking, reduced dependency, manual start).
If the host is stuck at boot, the emergency bypass is legitimate—but document it and remove it later.
4) When in doubt, follow the chain, not the symptom
A failed app service often isn’t broken. It’s waiting for a mount, DNS, or a network route that never arrives.
So you treat it like a distributed system: find the earliest failure in the chain.
One quote I still use in postmortems, from W. Edwards Deming: “A bad system will beat a good person every time.”
If you’re staring at “Failed to start,” assume the system (dependencies, ordering, environment) is guilty until proven otherwise.
Nine facts that change how you debug systemd
- systemd replaced Upstart in Ubuntu years ago, but the big shift wasn’t “new init”—it was dependency graph scheduling. Debugging is graph traversal now.
- journald isn’t “just logs”; it stores structured fields (unit, PID, cgroup, executable path). You can slice logs surgically without grep spaghetti.
- “Failed to start” is a UI summary, not a root cause. The real cause is typically in a smaller event: “failed at step EXEC” or “start request repeated too quickly.”
- Units have two orthogonal relationships: ordering (
After=) and requirement (Requires=/Wants=). Many outages happen because people confuse them. - Targets are synchronization points, not “runlevels with a new name.” Debugging boot problems often means understanding which target pulled in the failing unit.
- Type=notify is common in modern daemons. If the service doesn’t send readiness, systemd may declare a timeout even though the process is alive.
- StartLimitBurst/Interval are circuit breakers. A flapping service can “fail” even if the last attempt would have succeeded.
- Drop-in overrides are first-class. Vendors ship units; operators override. Editing files in
/lib/systemd/systemis how you create future-you pain. - systemd-run and transient units exist. When you need to reproduce environment issues, a transient unit can replicate cgroup and sandboxing constraints better than your shell.
Joke #1: systemd doesn’t hate you personally. It hates everyone equally—then writes it to the journal with millisecond precision.
The mental model: where “Failed to start” actually comes from
systemd’s job is to take a unit, calculate a dependency transaction, execute it, and track state.
The failure you see is the end of a chain of state transitions, and you want the earliest meaningful divergence.
What “start” means (in systemd terms)
A service “starting” includes:
- Loading the unit (unit file parse, drop-ins, generator output)
- Resolving dependencies (what must exist before this unit can run)
- Executing (ExecStartPre, ExecStart, possibly multiple processes)
- Readiness (Type=simple is immediate; Type=notify requires a signal; Type=forking needs a PID)
- Monitoring (Restart= policies, watchdogs, failures)
Common failure classes (use these as buckets)
- Unit can’t execute: missing binary, wrong permissions, wrong user, missing interpreter, bad working directory.
- Process exits non-zero: config parse errors, port already used, dependency not reachable.
- Timeout: readiness not reported, hang on network, hang on mount, slow disk, blocked entropy (rare now).
- Dependency deadlock: wrong ordering, incorrect After=, dependency loop.
- Policy blocks: AppArmor denials, systemd sandboxing (ProtectSystem, PrivateTmp), capability restrictions.
- Rate limiting: StartLimitHit, restart loops.
Your job in triage is to map “Failed to start” into one of these buckets in under five minutes.
Everything after that is engineering, not guesswork.
Practical triage tasks (commands + meaning + decisions)
Below are the tasks I actually run. Each one includes: the command, what the output means, and the decision you make.
I’m assuming a failing unit named acme-api.service. Substitute yours.
Task 1: Confirm the state, result, and the last error summary
cr0x@server:~$ systemctl status acme-api.service --no-pager
● acme-api.service - Acme API
Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
Active: failed (Result: timeout) since Mon 2025-12-30 10:12:48 UTC; 32s ago
Duration: 1min 30.012s
Process: 1842 ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml (code=killed, signal=TERM)
Main PID: 1842 (code=killed, signal=TERM)
CPU: 1.012s
Dec 30 10:11:18 server systemd[1]: Starting acme-api.service - Acme API...
Dec 30 10:12:48 server systemd[1]: acme-api.service: start operation timed out. Terminating.
Dec 30 10:12:48 server systemd[1]: acme-api.service: Failed with result 'timeout'.
Dec 30 10:12:48 server systemd[1]: Failed to start acme-api.service - Acme API.
Meaning: systemd killed the process after a start timeout. That’s not “it crashed.” It may be running fine but never became “ready” in systemd’s eyes.
Decision: Next: find out if it’s a readiness/Type problem, or it’s blocked waiting on something (network, mount, DB).
Task 2: Pull unit-scoped logs with context (and stop scrolling)
cr0x@server:~$ journalctl -u acme-api.service -b --no-pager -n 200
Dec 30 10:11:18 server systemd[1]: Starting acme-api.service - Acme API...
Dec 30 10:11:19 server acme-api[1842]: loading config from /etc/acme/api.yaml
Dec 30 10:11:19 server acme-api[1842]: connecting to postgres at 10.20.0.15:5432
Dec 30 10:11:49 server acme-api[1842]: still waiting for postgres...
Dec 30 10:12:19 server acme-api[1842]: still waiting for postgres...
Dec 30 10:12:48 server systemd[1]: acme-api.service: start operation timed out. Terminating.
Dec 30 10:12:48 server systemd[1]: acme-api.service: Failed with result 'timeout'.
Meaning: the app is blocked on Postgres, and systemd’s timeout is the messenger.
Decision: Investigate network reachability/DNS/routes, and verify the service’s dependency chain (do we need After=network-online.target? do we want it?).
Task 3: Read the unit file that systemd is actually using
cr0x@server:~$ systemctl cat acme-api.service
# /etc/systemd/system/acme-api.service
[Unit]
Description=Acme API
After=network.target
Wants=network.target
[Service]
Type=notify
ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml
User=acme
Group=acme
Restart=on-failure
TimeoutStartSec=90
[Install]
WantedBy=multi-user.target
Meaning: it’s Type=notify, so systemd expects readiness. If the daemon doesn’t send it, systemd will timeout even if it’s “fine.”
Also, it only waits for network.target, which is not “the network is up.”
Decision: Confirm whether the binary actually supports sd_notify. If not, change to Type=simple or fix the daemon. Also evaluate whether network-online.target is appropriate (often it isn’t).
Task 4: Check if systemd is waiting on dependencies instead of your service
cr0x@server:~$ systemctl list-dependencies --reverse acme-api.service
acme-api.service
● multi-user.target
● graphical.target
Meaning: it’s not pulled in by something surprising. But this doesn’t show ordering delays; it shows who wants it.
Decision: Use critical chain to see what blocked boot, and inspect network-online/mount units if present.
Task 5: Use critical chain to find the first slow or broken unit
cr0x@server:~$ systemd-analyze critical-chain acme-api.service
acme-api.service +1min 30.012s
└─network.target @8.412s
└─systemd-networkd.service @6.901s +1.201s
└─systemd-udevd.service @3.112s +3.654s
└─systemd-tmpfiles-setup-dev-early.service @2.811s +201ms
└─kmod-static-nodes.service @2.603s +155ms
Meaning: the chain says your service spent 90 seconds “starting,” not that network was slow.
This points back to readiness waiting, not ordering.
Decision: Confirm whether systemd ever received READY=1, or whether the app is just blocked waiting for Postgres.
Task 6: Confirm whether the process was alive during the “timeout”
cr0x@server:~$ systemctl show acme-api.service -p MainPID -p ExecMainStatus -p ExecMainCode -p TimeoutStartUSec -p Type
MainPID=0
ExecMainStatus=0
ExecMainCode=0
TimeoutStartUSec=1min 30s
Type=notify
Meaning: MainPID is 0 now because it’s dead (systemd killed it). The key line is Type=notify and a finite timeout.
Decision: Either the daemon never notified readiness or it never reached readiness because it waited on Postgres. Solve the dependency or change the startup behavior.
Task 7: Check for StartLimitHit (the “it failed because it failed” failure)
cr0x@server:~$ systemctl status acme-api.service --no-pager | sed -n '1,18p'
● acme-api.service - Acme API
Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
Active: failed (Result: start-limit-hit) since Mon 2025-12-30 10:13:30 UTC; 5s ago
Process: 1912 ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml (code=exited, status=1/FAILURE)
Meaning: systemd stopped trying because it restarted too often.
Decision: Reset the failure counter only after you changed something meaningful; otherwise you’re just accelerating a crash loop.
Task 8: Reset a start-limit and retry intentionally
cr0x@server:~$ sudo systemctl reset-failed acme-api.service
cr0x@server:~$ sudo systemctl start acme-api.service
cr0x@server:~$ systemctl status acme-api.service --no-pager -n 20
● acme-api.service - Acme API
Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
Active: activating (start) since Mon 2025-12-30 10:14:01 UTC; 3s ago
Meaning: it’s in activating again—now you watch logs.
Decision: If it repeats, stop and fix the root cause (config, dependency, readiness), don’t keep hammering start.
Task 9: Validate the unit file for obvious footguns
cr0x@server:~$ systemd-analyze verify /etc/systemd/system/acme-api.service
/etc/systemd/system/acme-api.service:6: Unknown lvalue 'Wants' in section 'Unit'
Meaning: In real life, typos happen. Here it flagged a bogus key (example). Systemd may ignore what you thought was critical.
Decision: Fix unit syntax; reload daemon; retry. If you’re lucky, you just found the outage.
Task 10: Check whether a dependency is failing (mounts and storage are repeat offenders)
cr0x@server:~$ systemctl --failed --no-pager
UNIT LOAD ACTIVE SUB DESCRIPTION
● mnt-data.mount loaded failed failed /mnt/data
● acme-api.service loaded failed failed Acme API
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state.
SUB = The low-level unit activation state.
Meaning: If storage didn’t mount, your app may be collateral damage.
Decision: Fix the mount first. Starting the app before its data path exists is how you get data loss disguised as “availability.”
Task 11: Inspect a failed mount unit quickly
cr0x@server:~$ systemctl status mnt-data.mount --no-pager -n 80
● mnt-data.mount - /mnt/data
Loaded: loaded (/proc/self/mountinfo; generated)
Active: failed (Result: exit-code) since Mon 2025-12-30 10:10:02 UTC; 4min ago
Where: /mnt/data
What: UUID=aa1b2c3d-4e5f-6789-a012-b345c678d901
Process: 1222 ExecMount=/usr/bin/mount UUID=aa1b2c3d-4e5f-6789-a012-b345c678d901 /mnt/data (code=exited, status=32)
Status: "Mounting failed."
Dec 30 10:10:02 server mount[1222]: mount: /mnt/data: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program, or other error.
Dec 30 10:10:02 server systemd[1]: mnt-data.mount: Mount process exited, code=exited, status=32/n/a
Meaning: classic mount failure. Could be wrong UUID, fs corruption, missing kernel module, or a changed disk.
Decision: Confirm the block device, run blkid, check dmesg, and only then touch filesystem repair tools.
Task 12: Confirm whether AppArmor blocked the service (Ubuntu loves AppArmor)
cr0x@server:~$ journalctl -k -b --no-pager | grep -i apparmor | tail -n 5
Dec 30 10:11:19 server kernel: audit: type=1400 audit(1735553479.112:88): apparmor="DENIED" operation="open" class="file" profile="/usr/local/bin/acme-api" name="/etc/acme/secret.key" pid=1842 comm="acme-api" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
Meaning: Your service might be fine; policy isn’t.
Decision: Update the AppArmor profile (or stop confining this binary if you can’t maintain it). Don’t chmod 777 secrets “just to see.”
Task 13: Reproduce execution under systemd-like constraints (transient unit)
cr0x@server:~$ sudo systemd-run --unit=acme-api-debug --property=User=acme --property=Group=acme /usr/local/bin/acme-api --config /etc/acme/api.yaml
Running as unit: acme-api-debug.service
cr0x@server:~$ systemctl status acme-api-debug.service --no-pager -n 30
● acme-api-debug.service - /usr/local/bin/acme-api --config /etc/acme/api.yaml
Loaded: loaded (/run/systemd/transient/acme-api-debug.service; transient)
Active: failed (Result: exit-code) since Mon 2025-12-30 10:15:22 UTC; 2s ago
Process: 2044 ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml (code=exited, status=1/FAILURE)
Meaning: This isolates “works in my shell” from “works under systemd user/cgroup.”
Decision: Read the transient unit’s logs. If it fails the same way, it’s likely config/dependency, not unit wiring.
Task 14: Show the exact exit status interpretation
cr0x@server:~$ systemctl show acme-api.service -p ExecMainStatus -p ExecMainCode -p Result
ExecMainStatus=1
ExecMainCode=exited
Result=exit-code
Meaning: exit code failure, not timeout, not signal, not watchdog.
Decision: Focus on application stderr logs and configuration. Don’t waste time on dependencies unless logs point there.
Task 15: Check if sockets/ports were the real issue
cr0x@server:~$ ss -ltnp | grep -E ':8080\b'
LISTEN 0 4096 0.0.0.0:8080 0.0.0.0:* users:(("nginx",pid=912,fd=12))
Meaning: Something else owns the port. Many services log this, but sometimes they don’t before exiting.
Decision: Fix the port conflict (change config, stop the other service, or use socket activation properly).
Task 16: When boot is involved, verify the previous boot too
cr0x@server:~$ journalctl -u acme-api.service -b -1 --no-pager -n 80
Dec 30 09:03:12 server systemd[1]: Starting acme-api.service - Acme API...
Dec 30 09:03:13 server acme-api[701]: connecting to postgres at 10.20.0.15:5432
Dec 30 09:03:14 server acme-api[701]: ready
Dec 30 09:03:14 server systemd[1]: Started acme-api.service - Acme API.
Meaning: Yesterday it worked. That’s a gift: something changed (network, secrets, mounts, policy, remote dependency).
Decision: Look for changes between boots: package updates, config deployment, DNS, firewall, routes, storage changes.
Case #2 walkthrough: dependency chain + timeout + a misleading “green” check
Here’s the situation that shows up in real fleets: a service fails to start after a routine reboot, and the error looks local.
It’s not. It’s a dependency/ordering problem wearing a timeout costume.
The setup
You have an API service that reads config, then connects to Postgres. The unit is Type=notify.
It’s configured with After=network.target and Wants=network.target.
On Ubuntu 24.04, network is managed by systemd-networkd or NetworkManager depending on your build.
The failure starts like this:
cr0x@server:~$ systemctl status acme-api.service --no-pager -n 30
● acme-api.service - Acme API
Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
Active: failed (Result: timeout) since Mon 2025-12-30 10:12:48 UTC; 7min ago
Dec 30 10:12:48 server systemd[1]: acme-api.service: start operation timed out. Terminating.
Dec 30 10:12:48 server systemd[1]: Failed to start acme-api.service - Acme API.
People often jump straight to “increase TimeoutStartSec.” Sometimes that’s correct. Often it’s lazy.
First answer: why did it take too long?
Step 1: The service isn’t “down,” it’s “not ready”
The journal shows it waiting for Postgres. That’s the first clue: the process started and is doing work.
The second clue is Type=notify. That means systemd is waiting for readiness, and the timeout is the wall-clock limit.
The simplest check: can we reach Postgres?
cr0x@server:~$ ping -c 2 10.20.0.15
PING 10.20.0.15 (10.20.0.15) 56(84) bytes of data.
--- 10.20.0.15 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1003ms
Meaning: This host can’t reach the DB IP. That’s not an app bug.
Decision: Debug routing, VLANs, firewall, or the network service state. Don’t touch the app yet.
Step 2: The “network is up” lie (network.target)
network.target mostly means “network management stack is running,” not “you have routes and connectivity.”
If you want “IP configured,” you typically care about network-online.target—but you should treat it like hot sauce:
a little helps; too much ruins dinner.
Check what provides online readiness:
cr0x@server:~$ systemctl status systemd-networkd-wait-online.service --no-pager -n 40
● systemd-networkd-wait-online.service - Wait for Network to be Configured
Loaded: loaded (/usr/lib/systemd/system/systemd-networkd-wait-online.service; enabled; preset: enabled)
Active: failed (Result: timeout) since Mon 2025-12-30 10:10:55 UTC; 9min ago
Process: 876 ExecStart=/usr/lib/systemd/systemd-networkd-wait-online (code=exited, status=1/FAILURE)
Dec 30 10:09:25 server systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Dec 30 10:10:55 server systemd[1]: systemd-networkd-wait-online.service: start operation timed out. Terminating.
Dec 30 10:10:55 server systemd[1]: systemd-networkd-wait-online.service: Failed with result 'timeout'.
Meaning: network didn’t become “online” within the wait timeout. Your API never had a chance if it needs the DB.
Decision: Don’t “fix” the API. Fix why the host never achieved network-online. Usually: wrong netplan, missing VLAN, link down, DHCP failure, or a renamed interface.
Step 3: Find the actual network failure (netplan and interface state)
cr0x@server:~$ ip -br link
lo UNKNOWN 00:00:00:00:00:00
enp0s31f6 DOWN 3c:52:82:aa:bb:cc
Meaning: the interface is down. That’s physical link, driver, or a switch problem.
Decision: If this is a VM, check the hypervisor vNIC. If physical, check cabling/switch port, and dmesg for driver issues.
cr0x@server:~$ journalctl -u systemd-networkd -b --no-pager -n 120
Dec 30 10:09:03 server systemd-networkd[612]: enp0s31f6: Link DOWN
Dec 30 10:09:04 server systemd-networkd[612]: enp0s31f6: Lost carrier
Dec 30 10:09:05 server systemd-networkd[612]: enp0s31f6: DHCPv4 client: No carrier
Meaning: no carrier; DHCP never ran. The “Failed to start” message is innocent—your NIC isn’t.
Decision: Restore link. If link is intentionally absent (isolated network), then the service design is wrong: it shouldn’t block boot.
Step 4: The misleading “green” check
In many orgs, someone will run a health check that says “network OK” because they can resolve localhost or reach a local gateway.
That’s not connectivity to your dependency. It’s a partial truth.
Here’s the check that looks green but isn’t:
cr0x@server:~$ getent hosts postgres.internal
10.20.0.15 postgres.internal
Meaning: DNS (or /etc/hosts) resolution works. That says nothing about routing, firewall, or link.
Decision: Always pair name resolution checks with reachability checks: ip route get, ping, nc, ss.
cr0x@server:~$ ip route get 10.20.0.15
RTNETLINK answers: Network is unreachable
Meaning: kernel has no route. That’s lower-level than the app and higher confidence than any “it should work.”
Decision: Fix netplan/routes. Don’t touch TimeoutStartSec until the host can route to its dependencies.
Step 5: The actual fix (and the non-fix)
The non-fix: raising TimeoutStartSec to five minutes. That just gives your outage more time to be mysterious.
The fix is restoring network, and then tightening the service behavior:
if Postgres is down, should the host boot? Usually yes. Should the service keep retrying? Usually yes.
Should it block for 90 seconds and then die? That depends on whether you want the unit “active” without DB.
In practice, you choose one:
- Make it resilient: let it start and serve partial functionality, or keep retrying without failing readiness.
- Make it explicit: keep failing fast, but add clear logs, and make orchestration aware.
If the daemon doesn’t support sd_notify, switching to Type=simple often prevents false timeouts:
cr0x@server:~$ sudo systemctl edit acme-api.service
# (editor opens)
cr0x@server:~$ cat /etc/systemd/system/acme-api.service.d/override.conf
[Service]
Type=simple
TimeoutStartSec=30
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart acme-api.service
Meaning: You’re aligning systemd’s readiness expectations with reality. Also, you shortened the “mystery hang” window.
Decision: Only do this if the service truly doesn’t notify. If it does notify, keep Type=notify and fix the readiness path.
Joke #2: Increasing TimeoutStartSec is like moving the smoke alarm farther away. The kitchen’s still on fire; you just can’t hear it.
Three corporate mini-stories (and the lesson you actually need)
Mini-story #1: The outage caused by a wrong assumption
A mid-sized SaaS shop had a background worker service that processed billing events. It was “simple”:
read from a queue, write to Postgres, emit metrics. It ran for months without drama.
During a security hardening sprint, someone updated the unit file:
After=network.target and Wants=network.target, because that looked “correct” and was already in other units.
The assumption was that network.target meant “network is up.”
A week later, the company migrated a VLAN and introduced a longer DHCP wait on a subset of nodes.
Those nodes booted, started the worker immediately, and the worker tried to connect before the route existed.
The worker had a 60-second startup timeout and exited non-zero. systemd restarted it aggressively. StartLimitHit tripped.
The ugly part: dashboards showed the worker “running” on most nodes. On the impacted nodes, it was “failed,” but nobody had alerts on the unit state.
Billing lagged. Finance noticed before engineering did, which is a special kind of shame.
They fixed it by making connectivity explicit: either wait for a specific dependency (a reachable DB endpoint via a script),
or make the worker start and keep retrying without failing systemd readiness. They also stopped using network.target as a comfort blanket.
Lesson: Don’t encode folklore into unit files. If you depend on connectivity, define what “ready” means and measure it.
Mini-story #2: The optimization that backfired
An enterprise IT team wanted faster boot times on a fleet of Ubuntu servers. Someone noticed a few seconds spent in “wait online”
and decided it was unnecessary. They disabled systemd-networkd-wait-online.service to shave boot time.
It worked—boot was quicker. Then a storage-backed service started failing after reboots. Not always. Just enough to be expensive.
The service mounted an iSCSI LUN, then started a database on top. With wait-online disabled, the iSCSI login raced the network configuration.
Sometimes it won. Sometimes it face-planted.
The team’s first response was to increase database startup timeouts. That made the database “start” more often,
but created a worse failure mode: the database would start on an empty directory if the mount wasn’t present, initialize a new cluster,
and happily serve nonsense until someone noticed.
The eventual fix wasn’t “turn wait-online back on globally.” They made ordering and requirements precise:
the database unit required the mount unit; the mount unit required a network route; and they used per-unit dependency checks.
Boot time stayed fast on machines that didn’t need the network. Machines that did were correct.
Lesson: Global boot optimizations should be treated like changing a shared library. If you can’t describe the dependency graph, don’t “speed it up.”
Mini-story #3: The boring but correct practice that saved the day
A platform team ran a set of internal services with vendor-provided unit files. They never edited files in /usr/lib/systemd/system.
Not once. Instead, every change went into drop-in overrides in /etc/systemd/system/*.d/.
This looked bureaucratic to some engineers—until an urgent package update landed on a Friday night.
The update replaced the vendor unit and changed an ExecStart flag. On hosts where people had edited vendor units directly,
their changes were silently overwritten and services failed.
On the hosts managed the boring way, the drop-ins still applied cleanly. After the update, the services restarted with the new vendor defaults
plus the team’s overrides. The only work required was validating behavior—not recovering from config drift.
The post-incident review was quiet. The team didn’t win points for heroics. They avoided heroics. That’s the whole job.
Lesson: The dull practices (drop-ins, verify, reload, consistent logs) are the ones that survive upgrades.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Failed with result ‘timeout’” during start
Root cause: readiness mismatch (Type=notify without sd_notify), or the service blocks on a dependency (DB, DNS, mount), or StartTimeout too short for real work.
Fix: Determine whether the process is doing useful work. If no sd_notify support, switch to Type=simple.
If it’s waiting on a dependency, fix the dependency or change service behavior to retry after start rather than blocking readiness.
2) Symptom: “failed at step EXEC”
Root cause: ExecStart path wrong, binary missing, wrong permissions, missing interpreter (e.g., script with bad shebang), or filesystem not mounted.
Fix: Check systemctl status and unit file; validate the file exists and is executable; verify mount units; inspect AppArmor denials.
3) Symptom: “start request repeated too quickly” / start-limit-hit
Root cause: crash loop or rapid failure, exacerbated by Restart=always/on-failure. systemd rate limiter trips.
Fix: Read logs for the first failure, not the last. Fix the underlying config/port/secrets issue. Then systemctl reset-failed.
Optionally tune StartLimit settings, but don’t use them to hide a loop.
4) Symptom: service works when run manually, fails under systemd
Root cause: different environment (PATH, working directory, ulimits), different user, sandboxing options, missing permissions.
Fix: Use systemd-run with User/Group to reproduce; check unit’s WorkingDirectory, EnvironmentFile, and hardening flags.
5) Symptom: service fails only after reboot, not after manual restart
Root cause: boot-time ordering issue: mount not ready, network not configured, secrets not available, time not synced, dependency service not started yet.
Fix: Use systemd-analyze critical-chain and inspect dependencies. Replace vague After=network.target with correct ordering and explicit requirements (mount unit, specific dependency unit, or a lightweight pre-check).
6) Symptom: service “starts” but immediately exits and systemd reports success
Root cause: wrong service Type (e.g., forking vs simple), or ExecStart launching a wrapper that exits while the daemon continues (or never continues).
Fix: Verify the daemon model. Use Type=forking with PIDFile only when the daemon truly forks. Prefer Type=simple or notify for modern daemons.
7) Symptom: everything “looks fine,” but the service can’t access a file or socket
Root cause: AppArmor denial, systemd sandboxing (ProtectSystem, ReadOnlyPaths, PrivateTmp), or file ownership mismatch after deployment.
Fix: Search kernel journal for denials; adjust policy or sandboxing; ensure correct ownership and permissions.
Checklists / step-by-step plan
Checklist A: The 90-second triage (single failing service)
- Run
systemctl status UNIT --no-pager. Capture Result (timeout/exit-code/start-limit-hit) and any “step” (EXEC). - Run
journalctl -u UNIT -b -n 200 --no-pager. Look for the last meaningful app log line. - Run
systemctl cat UNIT. NoteType=,TimeoutStartSec=,User=, and ExecStart path. - Bucket it: EXEC problem, exit-code, timeout/wait, dependency, policy, or rate limiting.
- Pick the next tool based on bucket (don’t freestyle).
Checklist B: The “boot is stuck” workflow
systemctl --failed --no-pagerto list the earliest obvious failures (mounts, networking, name resolution).systemd-analyze critical-chainto identify what delayed reaching the default target.- Fix the earliest broken dependency first (mount/network) before restarting downstream services.
- If you need the host up immediately: temporarily mask the non-critical failing unit, boot to multi-user, and revert the mask after the fix is deployed.
Checklist C: Reboot-safe repair discipline (avoid self-inflicted wounds)
- Never edit vendor unit files in
/usr/lib/systemd/systemor/lib/systemd/system. Usesystemctl editdrop-ins. - After unit changes:
systemctl daemon-reload, then restart the unit. - Use
systemd-analyze verifyon units you touched. - Record what you changed and why. Future-you is an SRE too, and they’re tired.
Checklist D: Storage-aware service starts (because storage failures masquerade as service failures)
- If a service uses a path like
/mnt/data, ensure there is a mount unit and the service requires it. - Check for failed mounts early with
systemctl --failed. - Never let a database initialize on an unmounted directory. Add explicit checks or requirements.
- If the mount is remote (NFS/iSCSI), treat it as a network dependency and design for partial failure.
FAQ
1) Why does systemd say “Failed to start” when the binary clearly ran?
Because “start” includes readiness. If Type=notify is used, systemd expects a readiness signal.
If the app blocks on a dependency, systemd may timeout and kill it even though it was “doing something.”
2) Should I always add After=network-online.target?
No. Many services don’t need full connectivity at start, and waiting for “online” can slow boot or introduce new failure modes.
Add it only when the service truly requires network connectivity to become functional, and prefer more specific dependencies when possible.
3) What’s the difference between After= and Requires=?
After= is ordering only: “start me later.” Requires= is a requirement: “if that fails, I fail too.”
Confusing them creates either boot deadlocks (too many requirements) or race conditions (ordering without ensuring the dependency exists).
4) What does “failed at step EXEC” usually mean?
systemd couldn’t execute the configured command. Common causes: wrong path, missing binary, no execute bit, wrong user permissions,
missing interpreter for scripts, or a required filesystem path not mounted.
5) How do I know if AppArmor is blocking my service?
Look for denials in the kernel journal: journalctl -k -b | grep -i apparmor.
If you see DENIED entries matching your service’s binary, you’ve found a policy issue, not an application issue.
6) When should I increase TimeoutStartSec?
Only when the service legitimately needs longer to become ready and you understand why.
If it’s waiting on a flaky dependency, a longer timeout can make recovery slower and failures harder to detect.
7) Why does a service restart loop turn into start-limit-hit?
systemd rate-limits restarts to prevent thrash. If a unit fails repeatedly in a short interval, it stops trying.
Fix the underlying problem, then clear the counter with systemctl reset-failed.
8) The service works when I run it manually. Why not via systemd?
systemd runs it as a configured user with a specific environment, working directory, ulimits, cgroup constraints, and possibly sandboxing.
Reproduce using systemd-run with the same User/Group to narrow the gap.
9) How do I tell if the problem is upstream (dependency) rather than the service?
If the unit logs show “waiting for …” (DB, DNS, mount), or if you see failed mount/network units in systemctl --failed,
treat it as upstream until proven otherwise. Fix the earliest failure in the chain.
Next steps you can do today
If you want “Failed to start” incidents to stop eating your afternoons, do these in order:
- Standardize the triage commands across your team:
systemctl status,journalctl -u,systemctl cat,systemd-analyze critical-chain,systemctl --failed. - Make unit readiness honest: don’t use
Type=notifyunless the daemon truly notifies; don’t hide dependency waits behind long timeouts. - Wire dependencies explicitly: if you need a mount, require the mount. If you need a route, prove it (or design to retry without blocking boot).
- Use drop-in overrides for every operational change. Keep vendor units pristine so upgrades don’t surprise you.
- Add alerts on unit failure for the services that matter. The journal is great, but it does not page you.
The fastest triage workflow isn’t a bag of commands. It’s a habit: identify the failure class, read the unit, read the logs,
follow the dependency chain, and fix the earliest broken thing. Everything else is theater.