Ubuntu 24.04 “Failed to start …”: the fastest systemd triage workflow (case #2)

October 18, 2025 • February 3, 2026 • Read: 24 min • Views: 6

Was this helpful?

You reboot a box you own in production, and systemd greets you with the least helpful sentence in Linux history:
“Failed to start …”. No context. No clue which layer lied. Just a sad unit name and a timestamp that’s already in the past.

This is case #2: not a toy demo, not a “just reinstall” situation. It’s the repeatable workflow I use when a service
fails on Ubuntu 24.04 and I need an answer fast: what broke, where it broke, and what decision to make next.
Speed matters, but correctness matters more—because the second reboot is when the real outage begins.

Fast diagnosis playbook (first 5 minutes)

When a unit fails, you’re not “debugging a service.” You’re debugging a transaction systemd tried to run:
dependencies, ordering, environment, privileges, filesystem, network, and the binary itself.
The fastest path is to stop guessing and force systemd to tell you which phase failed.

1) Confirm the exact failing unit and failure mode

Get the unit name, not “the app name.”
Get the result and the step (EXEC, SIGNAL, TIMEOUT, etc.).
Get the last logs for that unit only.

2) Identify the bottleneck layer

The bottleneck is almost always one of these:
ExecStart can’t run (missing binary, permissions, SELinux/AppArmor),
the process starts but exits (config, ports, secrets),
it never becomes “ready” (Type=notify, readiness probe, PIDFile),
or it waits on dependencies (network-online, mount, db, remote storage).

3) Decide if you need a reboot-safe fix or an emergency bypass

There are two kinds of fixes:
proper (unit override, dependency correction, config fix),
and triage (temporary masking, reduced dependency, manual start).
If the host is stuck at boot, the emergency bypass is legitimate—but document it and remove it later.

4) When in doubt, follow the chain, not the symptom

A failed app service often isn’t broken. It’s waiting for a mount, DNS, or a network route that never arrives.
So you treat it like a distributed system: find the earliest failure in the chain.

One quote I still use in postmortems, from W. Edwards Deming: “A bad system will beat a good person every time.”
If you’re staring at “Failed to start,” assume the system (dependencies, ordering, environment) is guilty until proven otherwise.

Nine facts that change how you debug systemd

systemd replaced Upstart in Ubuntu years ago, but the big shift wasn’t “new init”—it was dependency graph scheduling. Debugging is graph traversal now.
journald isn’t “just logs”; it stores structured fields (unit, PID, cgroup, executable path). You can slice logs surgically without grep spaghetti.
“Failed to start” is a UI summary, not a root cause. The real cause is typically in a smaller event: “failed at step EXEC” or “start request repeated too quickly.”
Units have two orthogonal relationships: ordering (After=) and requirement (Requires=/Wants=). Many outages happen because people confuse them.
Targets are synchronization points, not “runlevels with a new name.” Debugging boot problems often means understanding which target pulled in the failing unit.
Type=notify is common in modern daemons. If the service doesn’t send readiness, systemd may declare a timeout even though the process is alive.
StartLimitBurst/Interval are circuit breakers. A flapping service can “fail” even if the last attempt would have succeeded.
Drop-in overrides are first-class. Vendors ship units; operators override. Editing files in /lib/systemd/system is how you create future-you pain.
systemd-run and transient units exist. When you need to reproduce environment issues, a transient unit can replicate cgroup and sandboxing constraints better than your shell.

Joke #1: systemd doesn’t hate you personally. It hates everyone equally—then writes it to the journal with millisecond precision.

The mental model: where “Failed to start” actually comes from

systemd’s job is to take a unit, calculate a dependency transaction, execute it, and track state.
The failure you see is the end of a chain of state transitions, and you want the earliest meaningful divergence.

What “start” means (in systemd terms)

A service “starting” includes:

Loading the unit (unit file parse, drop-ins, generator output)
Resolving dependencies (what must exist before this unit can run)
Executing (ExecStartPre, ExecStart, possibly multiple processes)
Readiness (Type=simple is immediate; Type=notify requires a signal; Type=forking needs a PID)
Monitoring (Restart= policies, watchdogs, failures)

Common failure classes (use these as buckets)

Unit can’t execute: missing binary, wrong permissions, wrong user, missing interpreter, bad working directory.
Process exits non-zero: config parse errors, port already used, dependency not reachable.
Timeout: readiness not reported, hang on network, hang on mount, slow disk, blocked entropy (rare now).
Dependency deadlock: wrong ordering, incorrect After=, dependency loop.
Policy blocks: AppArmor denials, systemd sandboxing (ProtectSystem, PrivateTmp), capability restrictions.
Rate limiting: StartLimitHit, restart loops.

Your job in triage is to map “Failed to start” into one of these buckets in under five minutes.
Everything after that is engineering, not guesswork.

Practical triage tasks (commands + meaning + decisions)

Below are the tasks I actually run. Each one includes: the command, what the output means, and the decision you make.
I’m assuming a failing unit named acme-api.service. Substitute yours.

Task 1: Confirm the state, result, and the last error summary

cr0x@server:~$ systemctl status acme-api.service --no-pager
● acme-api.service - Acme API
     Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
     Active: failed (Result: timeout) since Mon 2025-12-30 10:12:48 UTC; 32s ago
   Duration: 1min 30.012s
    Process: 1842 ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml (code=killed, signal=TERM)
   Main PID: 1842 (code=killed, signal=TERM)
        CPU: 1.012s

Dec 30 10:11:18 server systemd[1]: Starting acme-api.service - Acme API...
Dec 30 10:12:48 server systemd[1]: acme-api.service: start operation timed out. Terminating.
Dec 30 10:12:48 server systemd[1]: acme-api.service: Failed with result 'timeout'.
Dec 30 10:12:48 server systemd[1]: Failed to start acme-api.service - Acme API.

Meaning: systemd killed the process after a start timeout. That’s not “it crashed.” It may be running fine but never became “ready” in systemd’s eyes.

Decision: Next: find out if it’s a readiness/Type problem, or it’s blocked waiting on something (network, mount, DB).

Task 2: Pull unit-scoped logs with context (and stop scrolling)

cr0x@server:~$ journalctl -u acme-api.service -b --no-pager -n 200
Dec 30 10:11:18 server systemd[1]: Starting acme-api.service - Acme API...
Dec 30 10:11:19 server acme-api[1842]: loading config from /etc/acme/api.yaml
Dec 30 10:11:19 server acme-api[1842]: connecting to postgres at 10.20.0.15:5432
Dec 30 10:11:49 server acme-api[1842]: still waiting for postgres...
Dec 30 10:12:19 server acme-api[1842]: still waiting for postgres...
Dec 30 10:12:48 server systemd[1]: acme-api.service: start operation timed out. Terminating.
Dec 30 10:12:48 server systemd[1]: acme-api.service: Failed with result 'timeout'.

Meaning: the app is blocked on Postgres, and systemd’s timeout is the messenger.

Decision: Investigate network reachability/DNS/routes, and verify the service’s dependency chain (do we need After=network-online.target? do we want it?).

Task 3: Read the unit file that systemd is actually using

cr0x@server:~$ systemctl cat acme-api.service
# /etc/systemd/system/acme-api.service
[Unit]
Description=Acme API
After=network.target
Wants=network.target

[Service]
Type=notify
ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml
User=acme
Group=acme
Restart=on-failure
TimeoutStartSec=90

[Install]
WantedBy=multi-user.target

Meaning: it’s Type=notify, so systemd expects readiness. If the daemon doesn’t send it, systemd will timeout even if it’s “fine.”
Also, it only waits for network.target, which is not “the network is up.”

Decision: Confirm whether the binary actually supports sd_notify. If not, change to Type=simple or fix the daemon. Also evaluate whether network-online.target is appropriate (often it isn’t).

Task 4: Check if systemd is waiting on dependencies instead of your service

cr0x@server:~$ systemctl list-dependencies --reverse acme-api.service
acme-api.service
● multi-user.target
● graphical.target

Meaning: it’s not pulled in by something surprising. But this doesn’t show ordering delays; it shows who wants it.

Decision: Use critical chain to see what blocked boot, and inspect network-online/mount units if present.

Task 5: Use critical chain to find the first slow or broken unit

cr0x@server:~$ systemd-analyze critical-chain acme-api.service
acme-api.service +1min 30.012s
└─network.target @8.412s
  └─systemd-networkd.service @6.901s +1.201s
    └─systemd-udevd.service @3.112s +3.654s
      └─systemd-tmpfiles-setup-dev-early.service @2.811s +201ms
        └─kmod-static-nodes.service @2.603s +155ms

Meaning: the chain says your service spent 90 seconds “starting,” not that network was slow.
This points back to readiness waiting, not ordering.

Decision: Confirm whether systemd ever received READY=1, or whether the app is just blocked waiting for Postgres.

Task 6: Confirm whether the process was alive during the “timeout”

cr0x@server:~$ systemctl show acme-api.service -p MainPID -p ExecMainStatus -p ExecMainCode -p TimeoutStartUSec -p Type
MainPID=0
ExecMainStatus=0
ExecMainCode=0
TimeoutStartUSec=1min 30s
Type=notify

Meaning: MainPID is 0 now because it’s dead (systemd killed it). The key line is Type=notify and a finite timeout.

Decision: Either the daemon never notified readiness or it never reached readiness because it waited on Postgres. Solve the dependency or change the startup behavior.

Task 7: Check for StartLimitHit (the “it failed because it failed” failure)

cr0x@server:~$ systemctl status acme-api.service --no-pager | sed -n '1,18p'
● acme-api.service - Acme API
     Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
     Active: failed (Result: start-limit-hit) since Mon 2025-12-30 10:13:30 UTC; 5s ago
    Process: 1912 ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml (code=exited, status=1/FAILURE)

Meaning: systemd stopped trying because it restarted too often.

Decision: Reset the failure counter only after you changed something meaningful; otherwise you’re just accelerating a crash loop.

Task 8: Reset a start-limit and retry intentionally

cr0x@server:~$ sudo systemctl reset-failed acme-api.service
cr0x@server:~$ sudo systemctl start acme-api.service
cr0x@server:~$ systemctl status acme-api.service --no-pager -n 20
● acme-api.service - Acme API
     Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
     Active: activating (start) since Mon 2025-12-30 10:14:01 UTC; 3s ago

Meaning: it’s in activating again—now you watch logs.

Decision: If it repeats, stop and fix the root cause (config, dependency, readiness), don’t keep hammering start.

Task 9: Validate the unit file for obvious footguns

cr0x@server:~$ systemd-analyze verify /etc/systemd/system/acme-api.service
/etc/systemd/system/acme-api.service:6: Unknown lvalue 'Wants' in section 'Unit'

Meaning: In real life, typos happen. Here it flagged a bogus key (example). Systemd may ignore what you thought was critical.

Decision: Fix unit syntax; reload daemon; retry. If you’re lucky, you just found the outage.

Task 10: Check whether a dependency is failing (mounts and storage are repeat offenders)

cr0x@server:~$ systemctl --failed --no-pager
  UNIT                          LOAD   ACTIVE SUB    DESCRIPTION
● mnt-data.mount                loaded failed failed /mnt/data
● acme-api.service              loaded failed failed Acme API

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state.
SUB    = The low-level unit activation state.

Meaning: If storage didn’t mount, your app may be collateral damage.

Decision: Fix the mount first. Starting the app before its data path exists is how you get data loss disguised as “availability.”

Task 11: Inspect a failed mount unit quickly

cr0x@server:~$ systemctl status mnt-data.mount --no-pager -n 80
● mnt-data.mount - /mnt/data
     Loaded: loaded (/proc/self/mountinfo; generated)
     Active: failed (Result: exit-code) since Mon 2025-12-30 10:10:02 UTC; 4min ago
      Where: /mnt/data
       What: UUID=aa1b2c3d-4e5f-6789-a012-b345c678d901
    Process: 1222 ExecMount=/usr/bin/mount UUID=aa1b2c3d-4e5f-6789-a012-b345c678d901 /mnt/data (code=exited, status=32)
     Status: "Mounting failed."

Dec 30 10:10:02 server mount[1222]: mount: /mnt/data: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program, or other error.
Dec 30 10:10:02 server systemd[1]: mnt-data.mount: Mount process exited, code=exited, status=32/n/a

Meaning: classic mount failure. Could be wrong UUID, fs corruption, missing kernel module, or a changed disk.

Decision: Confirm the block device, run blkid, check dmesg, and only then touch filesystem repair tools.

Task 12: Confirm whether AppArmor blocked the service (Ubuntu loves AppArmor)

cr0x@server:~$ journalctl -k -b --no-pager | grep -i apparmor | tail -n 5
Dec 30 10:11:19 server kernel: audit: type=1400 audit(1735553479.112:88): apparmor="DENIED" operation="open" class="file" profile="/usr/local/bin/acme-api" name="/etc/acme/secret.key" pid=1842 comm="acme-api" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0

Meaning: Your service might be fine; policy isn’t.

Decision: Update the AppArmor profile (or stop confining this binary if you can’t maintain it). Don’t chmod 777 secrets “just to see.”

Task 13: Reproduce execution under systemd-like constraints (transient unit)

cr0x@server:~$ sudo systemd-run --unit=acme-api-debug --property=User=acme --property=Group=acme /usr/local/bin/acme-api --config /etc/acme/api.yaml
Running as unit: acme-api-debug.service
cr0x@server:~$ systemctl status acme-api-debug.service --no-pager -n 30
● acme-api-debug.service - /usr/local/bin/acme-api --config /etc/acme/api.yaml
     Loaded: loaded (/run/systemd/transient/acme-api-debug.service; transient)
     Active: failed (Result: exit-code) since Mon 2025-12-30 10:15:22 UTC; 2s ago
    Process: 2044 ExecStart=/usr/local/bin/acme-api --config /etc/acme/api.yaml (code=exited, status=1/FAILURE)

Meaning: This isolates “works in my shell” from “works under systemd user/cgroup.”

Decision: Read the transient unit’s logs. If it fails the same way, it’s likely config/dependency, not unit wiring.

Task 14: Show the exact exit status interpretation

cr0x@server:~$ systemctl show acme-api.service -p ExecMainStatus -p ExecMainCode -p Result
ExecMainStatus=1
ExecMainCode=exited
Result=exit-code

Meaning: exit code failure, not timeout, not signal, not watchdog.

Decision: Focus on application stderr logs and configuration. Don’t waste time on dependencies unless logs point there.

Task 15: Check if sockets/ports were the real issue

cr0x@server:~$ ss -ltnp | grep -E ':8080\b'
LISTEN 0      4096         0.0.0.0:8080      0.0.0.0:*    users:(("nginx",pid=912,fd=12))

Meaning: Something else owns the port. Many services log this, but sometimes they don’t before exiting.

Decision: Fix the port conflict (change config, stop the other service, or use socket activation properly).

Task 16: When boot is involved, verify the previous boot too

cr0x@server:~$ journalctl -u acme-api.service -b -1 --no-pager -n 80
Dec 30 09:03:12 server systemd[1]: Starting acme-api.service - Acme API...
Dec 30 09:03:13 server acme-api[701]: connecting to postgres at 10.20.0.15:5432
Dec 30 09:03:14 server acme-api[701]: ready
Dec 30 09:03:14 server systemd[1]: Started acme-api.service - Acme API.

Meaning: Yesterday it worked. That’s a gift: something changed (network, secrets, mounts, policy, remote dependency).

Decision: Look for changes between boots: package updates, config deployment, DNS, firewall, routes, storage changes.

Case #2 walkthrough: dependency chain + timeout + a misleading “green” check

Here’s the situation that shows up in real fleets: a service fails to start after a routine reboot, and the error looks local.
It’s not. It’s a dependency/ordering problem wearing a timeout costume.

The setup

You have an API service that reads config, then connects to Postgres. The unit is Type=notify.
It’s configured with After=network.target and Wants=network.target.
On Ubuntu 24.04, network is managed by systemd-networkd or NetworkManager depending on your build.

The failure starts like this:

cr0x@server:~$ systemctl status acme-api.service --no-pager -n 30
● acme-api.service - Acme API
     Loaded: loaded (/etc/systemd/system/acme-api.service; enabled; preset: enabled)
     Active: failed (Result: timeout) since Mon 2025-12-30 10:12:48 UTC; 7min ago
Dec 30 10:12:48 server systemd[1]: acme-api.service: start operation timed out. Terminating.
Dec 30 10:12:48 server systemd[1]: Failed to start acme-api.service - Acme API.

People often jump straight to “increase TimeoutStartSec.” Sometimes that’s correct. Often it’s lazy.
First answer: why did it take too long?

Step 1: The service isn’t “down,” it’s “not ready”

The journal shows it waiting for Postgres. That’s the first clue: the process started and is doing work.
The second clue is Type=notify. That means systemd is waiting for readiness, and the timeout is the wall-clock limit.

The simplest check: can we reach Postgres?

cr0x@server:~$ ping -c 2 10.20.0.15
PING 10.20.0.15 (10.20.0.15) 56(84) bytes of data.

--- 10.20.0.15 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1003ms

Meaning: This host can’t reach the DB IP. That’s not an app bug.

Decision: Debug routing, VLANs, firewall, or the network service state. Don’t touch the app yet.

Step 2: The “network is up” lie (network.target)

network.target mostly means “network management stack is running,” not “you have routes and connectivity.”
If you want “IP configured,” you typically care about network-online.target—but you should treat it like hot sauce:
a little helps; too much ruins dinner.

Check what provides online readiness:

cr0x@server:~$ systemctl status systemd-networkd-wait-online.service --no-pager -n 40
● systemd-networkd-wait-online.service - Wait for Network to be Configured
     Loaded: loaded (/usr/lib/systemd/system/systemd-networkd-wait-online.service; enabled; preset: enabled)
     Active: failed (Result: timeout) since Mon 2025-12-30 10:10:55 UTC; 9min ago
    Process: 876 ExecStart=/usr/lib/systemd/systemd-networkd-wait-online (code=exited, status=1/FAILURE)
Dec 30 10:09:25 server systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Dec 30 10:10:55 server systemd[1]: systemd-networkd-wait-online.service: start operation timed out. Terminating.
Dec 30 10:10:55 server systemd[1]: systemd-networkd-wait-online.service: Failed with result 'timeout'.

Meaning: network didn’t become “online” within the wait timeout. Your API never had a chance if it needs the DB.

Decision: Don’t “fix” the API. Fix why the host never achieved network-online. Usually: wrong netplan, missing VLAN, link down, DHCP failure, or a renamed interface.

Step 3: Find the actual network failure (netplan and interface state)

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00
enp0s31f6        DOWN           3c:52:82:aa:bb:cc

Meaning: the interface is down. That’s physical link, driver, or a switch problem.

Decision: If this is a VM, check the hypervisor vNIC. If physical, check cabling/switch port, and dmesg for driver issues.

cr0x@server:~$ journalctl -u systemd-networkd -b --no-pager -n 120
Dec 30 10:09:03 server systemd-networkd[612]: enp0s31f6: Link DOWN
Dec 30 10:09:04 server systemd-networkd[612]: enp0s31f6: Lost carrier
Dec 30 10:09:05 server systemd-networkd[612]: enp0s31f6: DHCPv4 client: No carrier

Meaning: no carrier; DHCP never ran. The “Failed to start” message is innocent—your NIC isn’t.

Decision: Restore link. If link is intentionally absent (isolated network), then the service design is wrong: it shouldn’t block boot.

Step 4: The misleading “green” check

In many orgs, someone will run a health check that says “network OK” because they can resolve localhost or reach a local gateway.
That’s not connectivity to your dependency. It’s a partial truth.

Here’s the check that looks green but isn’t:

cr0x@server:~$ getent hosts postgres.internal
10.20.0.15      postgres.internal

Meaning: DNS (or /etc/hosts) resolution works. That says nothing about routing, firewall, or link.

Decision: Always pair name resolution checks with reachability checks: ip route get, ping, nc, ss.

cr0x@server:~$ ip route get 10.20.0.15
RTNETLINK answers: Network is unreachable

Meaning: kernel has no route. That’s lower-level than the app and higher confidence than any “it should work.”

Decision: Fix netplan/routes. Don’t touch TimeoutStartSec until the host can route to its dependencies.

Step 5: The actual fix (and the non-fix)

The non-fix: raising TimeoutStartSec to five minutes. That just gives your outage more time to be mysterious.

The fix is restoring network, and then tightening the service behavior:
if Postgres is down, should the host boot? Usually yes. Should the service keep retrying? Usually yes.
Should it block for 90 seconds and then die? That depends on whether you want the unit “active” without DB.

In practice, you choose one:

Make it resilient: let it start and serve partial functionality, or keep retrying without failing readiness.
Make it explicit: keep failing fast, but add clear logs, and make orchestration aware.

If the daemon doesn’t support sd_notify, switching to Type=simple often prevents false timeouts:

cr0x@server:~$ sudo systemctl edit acme-api.service
# (editor opens)

cr0x@server:~$ cat /etc/systemd/system/acme-api.service.d/override.conf
[Service]
Type=simple
TimeoutStartSec=30

cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart acme-api.service

Meaning: You’re aligning systemd’s readiness expectations with reality. Also, you shortened the “mystery hang” window.

Decision: Only do this if the service truly doesn’t notify. If it does notify, keep Type=notify and fix the readiness path.

Joke #2: Increasing TimeoutStartSec is like moving the smoke alarm farther away. The kitchen’s still on fire; you just can’t hear it.

Three corporate mini-stories (and the lesson you actually need)

Mini-story #1: The outage caused by a wrong assumption

A mid-sized SaaS shop had a background worker service that processed billing events. It was “simple”:
read from a queue, write to Postgres, emit metrics. It ran for months without drama.

During a security hardening sprint, someone updated the unit file:
After=network.target and Wants=network.target, because that looked “correct” and was already in other units.
The assumption was that network.target meant “network is up.”

A week later, the company migrated a VLAN and introduced a longer DHCP wait on a subset of nodes.
Those nodes booted, started the worker immediately, and the worker tried to connect before the route existed.
The worker had a 60-second startup timeout and exited non-zero. systemd restarted it aggressively. StartLimitHit tripped.

The ugly part: dashboards showed the worker “running” on most nodes. On the impacted nodes, it was “failed,” but nobody had alerts on the unit state.
Billing lagged. Finance noticed before engineering did, which is a special kind of shame.

They fixed it by making connectivity explicit: either wait for a specific dependency (a reachable DB endpoint via a script),
or make the worker start and keep retrying without failing systemd readiness. They also stopped using network.target as a comfort blanket.

Lesson: Don’t encode folklore into unit files. If you depend on connectivity, define what “ready” means and measure it.

Mini-story #2: The optimization that backfired

An enterprise IT team wanted faster boot times on a fleet of Ubuntu servers. Someone noticed a few seconds spent in “wait online”
and decided it was unnecessary. They disabled systemd-networkd-wait-online.service to shave boot time.

It worked—boot was quicker. Then a storage-backed service started failing after reboots. Not always. Just enough to be expensive.
The service mounted an iSCSI LUN, then started a database on top. With wait-online disabled, the iSCSI login raced the network configuration.
Sometimes it won. Sometimes it face-planted.

The team’s first response was to increase database startup timeouts. That made the database “start” more often,
but created a worse failure mode: the database would start on an empty directory if the mount wasn’t present, initialize a new cluster,
and happily serve nonsense until someone noticed.

The eventual fix wasn’t “turn wait-online back on globally.” They made ordering and requirements precise:
the database unit required the mount unit; the mount unit required a network route; and they used per-unit dependency checks.
Boot time stayed fast on machines that didn’t need the network. Machines that did were correct.

Lesson: Global boot optimizations should be treated like changing a shared library. If you can’t describe the dependency graph, don’t “speed it up.”

Mini-story #3: The boring but correct practice that saved the day

A platform team ran a set of internal services with vendor-provided unit files. They never edited files in /usr/lib/systemd/system.
Not once. Instead, every change went into drop-in overrides in /etc/systemd/system/*.d/.

This looked bureaucratic to some engineers—until an urgent package update landed on a Friday night.
The update replaced the vendor unit and changed an ExecStart flag. On hosts where people had edited vendor units directly,
their changes were silently overwritten and services failed.

On the hosts managed the boring way, the drop-ins still applied cleanly. After the update, the services restarted with the new vendor defaults
plus the team’s overrides. The only work required was validating behavior—not recovering from config drift.

The post-incident review was quiet. The team didn’t win points for heroics. They avoided heroics. That’s the whole job.

Lesson: The dull practices (drop-ins, verify, reload, consistent logs) are the ones that survive upgrades.

Common mistakes: symptoms → root cause → fix

1) Symptom: “Failed with result ‘timeout’” during start

Root cause: readiness mismatch (Type=notify without sd_notify), or the service blocks on a dependency (DB, DNS, mount), or StartTimeout too short for real work.

Fix: Determine whether the process is doing useful work. If no sd_notify support, switch to Type=simple.
If it’s waiting on a dependency, fix the dependency or change service behavior to retry after start rather than blocking readiness.

2) Symptom: “failed at step EXEC”

Root cause: ExecStart path wrong, binary missing, wrong permissions, missing interpreter (e.g., script with bad shebang), or filesystem not mounted.

Fix: Check systemctl status and unit file; validate the file exists and is executable; verify mount units; inspect AppArmor denials.

3) Symptom: “start request repeated too quickly” / start-limit-hit

Root cause: crash loop or rapid failure, exacerbated by Restart=always/on-failure. systemd rate limiter trips.

Fix: Read logs for the first failure, not the last. Fix the underlying config/port/secrets issue. Then systemctl reset-failed.
Optionally tune StartLimit settings, but don’t use them to hide a loop.

4) Symptom: service works when run manually, fails under systemd

Root cause: different environment (PATH, working directory, ulimits), different user, sandboxing options, missing permissions.

Fix: Use systemd-run with User/Group to reproduce; check unit’s WorkingDirectory, EnvironmentFile, and hardening flags.

5) Symptom: service fails only after reboot, not after manual restart

Root cause: boot-time ordering issue: mount not ready, network not configured, secrets not available, time not synced, dependency service not started yet.

Fix: Use systemd-analyze critical-chain and inspect dependencies. Replace vague After=network.target with correct ordering and explicit requirements (mount unit, specific dependency unit, or a lightweight pre-check).

6) Symptom: service “starts” but immediately exits and systemd reports success

Root cause: wrong service Type (e.g., forking vs simple), or ExecStart launching a wrapper that exits while the daemon continues (or never continues).

Fix: Verify the daemon model. Use Type=forking with PIDFile only when the daemon truly forks. Prefer Type=simple or notify for modern daemons.

7) Symptom: everything “looks fine,” but the service can’t access a file or socket

Root cause: AppArmor denial, systemd sandboxing (ProtectSystem, ReadOnlyPaths, PrivateTmp), or file ownership mismatch after deployment.

Fix: Search kernel journal for denials; adjust policy or sandboxing; ensure correct ownership and permissions.

Checklists / step-by-step plan

Checklist A: The 90-second triage (single failing service)

Run systemctl status UNIT --no-pager. Capture Result (timeout/exit-code/start-limit-hit) and any “step” (EXEC).
Run journalctl -u UNIT -b -n 200 --no-pager. Look for the last meaningful app log line.
Run systemctl cat UNIT. Note Type=, TimeoutStartSec=, User=, and ExecStart path.
Bucket it: EXEC problem, exit-code, timeout/wait, dependency, policy, or rate limiting.
Pick the next tool based on bucket (don’t freestyle).

Checklist B: The “boot is stuck” workflow

systemctl --failed --no-pager to list the earliest obvious failures (mounts, networking, name resolution).
systemd-analyze critical-chain to identify what delayed reaching the default target.
Fix the earliest broken dependency first (mount/network) before restarting downstream services.
If you need the host up immediately: temporarily mask the non-critical failing unit, boot to multi-user, and revert the mask after the fix is deployed.

Checklist C: Reboot-safe repair discipline (avoid self-inflicted wounds)

Never edit vendor unit files in /usr/lib/systemd/system or /lib/systemd/system. Use systemctl edit drop-ins.
After unit changes: systemctl daemon-reload, then restart the unit.
Use systemd-analyze verify on units you touched.
Record what you changed and why. Future-you is an SRE too, and they’re tired.

Checklist D: Storage-aware service starts (because storage failures masquerade as service failures)

If a service uses a path like /mnt/data, ensure there is a mount unit and the service requires it.
Check for failed mounts early with systemctl --failed.
Never let a database initialize on an unmounted directory. Add explicit checks or requirements.
If the mount is remote (NFS/iSCSI), treat it as a network dependency and design for partial failure.

FAQ

1) Why does systemd say “Failed to start” when the binary clearly ran?

Because “start” includes readiness. If Type=notify is used, systemd expects a readiness signal.
If the app blocks on a dependency, systemd may timeout and kill it even though it was “doing something.”

2) Should I always add `After=network-online.target`?

No. Many services don’t need full connectivity at start, and waiting for “online” can slow boot or introduce new failure modes.
Add it only when the service truly requires network connectivity to become functional, and prefer more specific dependencies when possible.

3) What’s the difference between `After=` and `Requires=`?

After= is ordering only: “start me later.” Requires= is a requirement: “if that fails, I fail too.”
Confusing them creates either boot deadlocks (too many requirements) or race conditions (ordering without ensuring the dependency exists).

4) What does “failed at step EXEC” usually mean?

systemd couldn’t execute the configured command. Common causes: wrong path, missing binary, no execute bit, wrong user permissions,
missing interpreter for scripts, or a required filesystem path not mounted.

5) How do I know if AppArmor is blocking my service?

Look for denials in the kernel journal: journalctl -k -b | grep -i apparmor.
If you see DENIED entries matching your service’s binary, you’ve found a policy issue, not an application issue.

6) When should I increase `TimeoutStartSec`?

Only when the service legitimately needs longer to become ready and you understand why.
If it’s waiting on a flaky dependency, a longer timeout can make recovery slower and failures harder to detect.

7) Why does a service restart loop turn into start-limit-hit?

systemd rate-limits restarts to prevent thrash. If a unit fails repeatedly in a short interval, it stops trying.
Fix the underlying problem, then clear the counter with systemctl reset-failed.

8) The service works when I run it manually. Why not via systemd?

systemd runs it as a configured user with a specific environment, working directory, ulimits, cgroup constraints, and possibly sandboxing.
Reproduce using systemd-run with the same User/Group to narrow the gap.

9) How do I tell if the problem is upstream (dependency) rather than the service?

If the unit logs show “waiting for …” (DB, DNS, mount), or if you see failed mount/network units in systemctl --failed,
treat it as upstream until proven otherwise. Fix the earliest failure in the chain.

Next steps you can do today

If you want “Failed to start” incidents to stop eating your afternoons, do these in order:

Standardize the triage commands across your team: systemctl status, journalctl -u, systemctl cat, systemd-analyze critical-chain, systemctl --failed.
Make unit readiness honest: don’t use Type=notify unless the daemon truly notifies; don’t hide dependency waits behind long timeouts.
Wire dependencies explicitly: if you need a mount, require the mount. If you need a route, prove it (or design to retry without blocking boot).
Use drop-in overrides for every operational change. Keep vendor units pristine so upgrades don’t surprise you.
Add alerts on unit failure for the services that matter. The journal is great, but it does not page you.

The fastest triage workflow isn’t a bag of commands. It’s a habit: identify the failure class, read the unit, read the logs,
follow the dependency chain, and fix the earliest broken thing. Everything else is theater.