It’s 02:13. The pager says “service down,” the dashboard is a flatline, and your console greets you with: Failed to start …. That message is systemd’s version of “something happened.” Helpful as a fortune cookie.
This is the workflow I use on Ubuntu 24.04 when I need an answer fast: what failed, why it failed, what it depends on, and whether we should fix, rollback, or isolate. No mysticism. Just the shortest path from symptom to decision.
The mental model: what “Failed to start” really means
Systemd is an orchestrator, not a psychic. When it prints “Failed to start,” it’s reporting a state transition: a unit went from “activating” to “failed,” or it never made it to “active,” or it hit a restart throttle, or one of its dependencies failed first and systemd is just the messenger.
On Ubuntu 24.04, you’ll see the same handful of underlying reasons repeat:
- Bad unit definition: typo, wrong path, missing quotes, invalid directive, wrong
Type=. - Runtime failure: executable exits non-zero, crashes, times out, or can’t bind a port.
- Dependency failure: mount not present, network not online, secrets not readable, database down.
- Environment mismatch: moved config file, changed user, changed permissions, SELinux/AppArmor profile conflict.
- System-level pressure: disk full, memory pressure (OOM), file descriptor limits, CPU quotas, cgroup constraints.
- Start throttling: “Start request repeated too quickly,” aka you made a restart loop and systemd got tired.
The key is to treat systemd like a graph engine. Units have ordering (After=, Before=) and requirements (Requires=, Wants=). Ordering is “when,” requirements are “must exist.” Mixing them up is a classic outage generator.
Interesting facts & historical context (because the past keeps showing up in your incidents)
- Systemd landed in mainstream Linux around 2010–2012 and replaced a zoo of init scripts with a single dependency-aware manager.
- Ubuntu switched from Upstart to systemd in 15.04; a decade later, many “init script assumptions” still lurk in custom services.
- journald is binary by default (structured metadata, fast filtering), which is great until you forget to persist logs and reboot wipes the crime scene.
- Units aren’t just services: mounts, sockets, timers, paths, scopes, targets—most “failed service” events start as a failed mount or socket.
- Start throttling exists to protect the host; without it, a crash-looping service can DOS its own machine with fork/exec storms.
- systemd’s “network-online” is deliberately slippery: it means “a network manager says it’s online,” not “your SaaS endpoint is reachable.”
- Timeout defaults changed over time; older cargo-cult unit files sometimes set timeouts that are either too short for modern boot paths or too long for production expectations.
- cgroups v2 is the default world now; resource controls and OOM behavior can differ from older hosts, surprising services that “always worked.”
And one operational quote worth keeping taped to your terminal:
“Hope is not a strategy.” — General Gordon R. Sullivan
Systemd triage is the antidote to hope.
Fast diagnosis playbook (first/second/third checks)
If you only remember one thing: don’t start by editing files. Start by collecting evidence, then pick the smallest intervention that restores service safely.
First: identify exactly what failed and how
- Check unit state and last result (
systemctl status). - Pull relevant logs from the current boot (
journalctl -u ... -b). - Confirm whether it’s a direct failure or dependency cascade (
systemctl list-dependenciesandsystemctl show).
Second: classify the failure in 60 seconds
- Exec/exit: look for exit code, signal, core dump, missing file.
- Timeout: “start operation timed out,” often waiting for mounts, network-online, or slow disk.
- Permission: “permission denied,” “cannot open,” or AppArmor denials.
- Resource: OOM kill, ENOSPC, too many open files.
- Restart throttle: “start-limit-hit” or “request repeated too quickly.”
Third: pick the least risky corrective action
- Known-good rollback: revert last config/package change if the timeline matches.
- Temporary isolation: stop dependent units, mask flapping ones, or disable the timer to stabilize the host.
- Surgical fix: adjust unit file, permissions, mount ordering, or environment file—and restart with a clean slate.
One short joke, because you’ll need it: systemd logs are like crime novels—everyone is a suspect, and the perpetrator is usually “a missing file.”
Twelve+ practical tasks with commands, output meaning, and decisions
These are the moves I make on Ubuntu 24.04 when a unit fails. Each task includes the command, a realistic snippet of output, what it means, and what decision you make next.
Task 1: Confirm the unit’s current state, last exit code, and the immediate clue
cr0x@server:~$ systemctl status nginx.service
× nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Mon 2025-12-30 02:11:02 UTC; 1min 4s ago
Duration: 83ms
Docs: man:nginx(8)
Process: 2191 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=1/FAILURE)
CPU: 72ms
Dec 30 02:11:02 server nginx[2191]: nginx: [emerg] open() "/etc/nginx/snippets/tls.conf" failed (2: No such file or directory)
Dec 30 02:11:02 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 02:11:02 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Dec 30 02:11:02 server systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
Meaning: Failure is in ExecStartPre config test; nginx refused to start due to a missing file.
Decision: Don’t restart blindly. Fix the missing include or rollback the config change. Verify with nginx -t once corrected.
Task 2: Pull the full journal slice for the unit, from the current boot
cr0x@server:~$ journalctl -u nginx.service -b --no-pager -n 80
Dec 30 02:11:02 server nginx[2191]: nginx: [emerg] open() "/etc/nginx/snippets/tls.conf" failed (2: No such file or directory)
Dec 30 02:11:02 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 02:11:02 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Dec 30 02:11:02 server systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
Meaning: Journald confirms the same root cause; no need to chase dependencies yet.
Decision: Fix config file path or restore snippet; then retest and start.
Task 3: If it’s a dependency cascade, ask systemd what it thinks triggered the stop
cr0x@server:~$ systemctl show -p Id -p ActiveState -p SubState -p Result -p ExecMainStatus -p ExecMainCode -p NRestarts nginx.service
Id=nginx.service
ActiveState=failed
SubState=failed
Result=exit-code
ExecMainCode=1
ExecMainStatus=1
NRestarts=0
Meaning: This is a clean “process exited 1” failure, not a signal kill, not a restart loop.
Decision: Fix the underlying executable/config issue; don’t touch start limits or restart policy.
Task 4: Show unit dependencies and look for “dead” requirements
cr0x@server:~$ systemctl list-dependencies --reverse nginx.service
nginx.service
● nginx.service
○ systemd-user-sessions.service
○ multi-user.target
Meaning: Nothing special depends on nginx besides the target; reverse dependencies won’t block a restart.
Decision: Safe to restart after fixing config; low blast radius.
Task 5: Verify unit file correctness and overrides (catch “it’s not using what you think”)
cr0x@server:~$ systemctl cat nginx.service
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
After=network-online.target remote-fs.target nss-lookup.target
Wants=network-online.target
[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed
[Install]
WantedBy=multi-user.target
Meaning: No drop-ins shown; this unit uses the packaged definition. If you expected custom overrides, you’re debugging the wrong file.
Decision: If customization is required, create a drop-in with systemctl edit rather than modifying files under /lib.
Task 6: Validate a drop-in override and check for subtle misconfigurations
cr0x@server:~$ systemctl status myapp.service
× myapp.service - My App API
Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: enabled)
Drop-In: /etc/systemd/system/myapp.service.d
└─override.conf
Active: failed (Result: exit-code) since Mon 2025-12-30 02:08:19 UTC; 2min 53s ago
Dec 30 02:08:19 server systemd[1]: myapp.service: Failed to run 'start' task: No such file or directory
cr0x@server:~$ systemctl cat myapp.service
# /etc/systemd/system/myapp.service
[Service]
ExecStart=/opt/myapp/bin/myapp --config /etc/myapp/config.yaml
User=myapp
Group=myapp
EnvironmentFile=/etc/myapp/myapp.env
# /etc/systemd/system/myapp.service.d/override.conf
[Service]
ExecStart=/opt/myapp/bin/myappd --config /etc/myapp/config.yaml
Meaning: Drop-in overrides replaced ExecStart. If /opt/myapp/bin/myappd doesn’t exist, systemd can’t exec it.
Decision: Fix the override path or remove the drop-in; then systemctl daemon-reload and restart.
Task 7: Detect “start-limit-hit” (restart throttle) and reset it correctly
cr0x@server:~$ systemctl status myapp.service
× myapp.service - My App API
Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: enabled)
Active: failed (Result: start-limit-hit) since Mon 2025-12-30 02:09:02 UTC; 2min 10s ago
Dec 30 02:09:02 server systemd[1]: myapp.service: Scheduled restart job, restart counter is at 5.
Dec 30 02:09:02 server systemd[1]: myapp.service: Start request repeated too quickly.
Dec 30 02:09:02 server systemd[1]: myapp.service: Failed with result 'start-limit-hit'.
Meaning: The service flapped and systemd throttled it. This is usually a symptom, not the disease.
Decision: Fix the crash/exit first. Once fixed, clear the throttle:
cr0x@server:~$ sudo systemctl reset-failed myapp.service
Meaning: Clears failure state so start attempts are allowed again.
Decision: Start the service only after the underlying cause is addressed, or you’ll re-trigger the throttle.
Task 8: Identify a timeout vs an actual crash
cr0x@server:~$ systemctl status postgresql.service
× postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; preset: enabled)
Active: failed (Result: timeout) since Mon 2025-12-30 02:05:41 UTC; 6min ago
Dec 30 02:04:11 server systemd[1]: Starting postgresql.service - PostgreSQL RDBMS...
Dec 30 02:05:41 server systemd[1]: postgresql.service: start operation timed out. Terminating.
Dec 30 02:05:41 server systemd[1]: postgresql.service: Failed with result 'timeout'.
Meaning: It didn’t exit quickly; it hung during startup. Common culprits: slow storage, WAL recovery, locked data dir, DNS delay, or a mount dependency.
Decision: Check logs, storage health, and whether the data directory is on a mount that wasn’t ready.
Task 9: Correlate unit start time with storage/mount readiness
cr0x@server:~$ systemd-analyze critical-chain postgresql.service
postgresql.service +1min 28.122s
└─local-fs.target @12.405s
└─mnt-data.mount @11.902s +1min 15.701s
└─systemd-fsck@dev-disk-by\x2duuid-3a1c...service @3.211s +8.614s
└─dev-disk-by\x2duuid-3a1c....device @2.983s
Meaning: The real delay is mnt-data.mount taking 75 seconds. Postgres is just waiting on the filesystem.
Decision: Fix mount performance (network storage? fsck? device errors?). Or decouple startup ordering if safe, but don’t paper over a failing disk.
Task 10: Inspect mount unit failures and fstab mistakes
cr0x@server:~$ systemctl status mnt-data.mount
× mnt-data.mount - /mnt/data
Loaded: loaded (/etc/fstab; generated)
Active: failed (Result: exit-code) since Mon 2025-12-30 02:03:22 UTC; 8min ago
Dec 30 02:03:22 server mount[612]: mount: /mnt/data: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program.
Dec 30 02:03:22 server systemd[1]: mnt-data.mount: Mount process exited, code=exited, status=32/n/a
Dec 30 02:03:22 server systemd[1]: mnt-data.mount: Failed with result 'exit-code'.
Meaning: The mount is broken; any service depending on it will fail or hang. “wrong fs type” may mean wrong fstab type, missing package for the filesystem, or actual corruption.
Decision: Confirm device and filesystem type. If this is production data, stop thrashing and validate the block device before repeated mounts make it worse.
Task 11: Confirm what filesystem the kernel thinks it is (and whether the device is there)
cr0x@server:~$ lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1 vfat FAT32 7C1A-3F2B 510M 2% /boot/efi
├─sda2 ext4 1.0 2d4b2b3c-5a62-4b2f-8f87-1b9e8f8a0c19 14G 61% /
└─sda3 swap 1 7d3e0c7c-5d48-4d1b-9b8b-2a5d0f3b9e21 [SWAP]
sdb
└─sdb1 xfs 9a1e3f1f-6a71-4bb5-8a45-2e7a5bb1c5b2
Meaning: /dev/sdb1 is XFS. If fstab says ext4, that’s your bug. If fstab says xfs and it still fails, suspect XFS repair needs or missing xfsprogs (rare on Ubuntu, but possible in minimal builds).
Decision: Fix fstab type/options, or run filesystem checks in maintenance mode. Don’t “force” mount options on a sick disk in production unless you like surprise data loss.
Task 12: Check if a service was killed by OOM (the silent assassin)
cr0x@server:~$ journalctl -b -k --no-pager | tail -n 12
Dec 30 02:07:12 server kernel: Out of memory: Killed process 3310 (myapp) total-vm:3128456kB, anon-rss:1452032kB, file-rss:132kB, shmem-rss:0kB
Dec 30 02:07:12 server kernel: oom_reaper: reaped process 3310 (myapp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 30 02:07:12 server systemd[1]: myapp.service: Main process exited, code=killed, status=9/KILL
Dec 30 02:07:12 server systemd[1]: myapp.service: Failed with result 'signal'.
Meaning: The kernel killed it, not systemd. Systemd reports the aftermath: SIGKILL.
Decision: This is capacity/limit work: tune memory, reduce usage, add swap (with caution), fix leaks, or set sane cgroup limits. Restarting alone is a temporary lie.
Task 13: Check for “address already in use” (port conflicts) and pick the correct offender
cr0x@server:~$ systemctl status myapp.service
× myapp.service - My App API
Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Mon 2025-12-30 02:12:48 UTC; 10s ago
Dec 30 02:12:48 server myapp[4021]: listen tcp 0.0.0.0:8080: bind: address already in use
cr0x@server:~$ sudo ss -ltnp | grep ':8080'
LISTEN 0 4096 0.0.0.0:8080 0.0.0.0:* users:(("nginx",pid=2101,fd=12))
Meaning: Nginx (or something else) owns the port. Your app can’t bind.
Decision: Decide who should own the port. Either move the app to a different port, update the reverse proxy, or stop the conflicting service. Don’t kill random PIDs without knowing why they’re there.
Task 14: Confirm whether AppArmor blocked access (common on Ubuntu)
cr0x@server:~$ journalctl -b --no-pager | grep -i apparmor | tail -n 6
Dec 30 02:10:02 server kernel: audit: type=1400 audit(1735524602.112:91): apparmor="DENIED" operation="open" profile="/usr/sbin/nginx" name="/etc/ssl/private/my.key" pid=2191 comm="nginx" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Meaning: The service is blocked from reading a file it needs. This looks like a permission issue but isn’t fixed by chmod 777 (don’t do that).
Decision: Adjust the AppArmor profile properly or relocate secrets to approved paths; then reload the profile and restart.
Task 15: Verify the unit can actually see the environment you think it has
cr0x@server:~$ systemctl show myapp.service -p Environment -p EnvironmentFiles
Environment=
EnvironmentFiles=/etc/myapp/myapp.env (ignore_errors=no)
cr0x@server:~$ sudo test -r /etc/myapp/myapp.env && echo readable || echo not_readable
not_readable
Meaning: The env file isn’t readable by root (or by systemd during load), so the unit may fail to start or miss variables.
Decision: Fix ownership/permissions. For secrets: root-readable, service user-readable only if needed, and avoid world-readable configs.
Failure modes that matter on Ubuntu 24.04
1) Dependency confusion: After= is not Requires=
After=network-online.target means “start after that target,” not “fail if network-online fails.” If your service truly requires something, declare it. Conversely, if you declare Requires= on something flaky, you’ll take your service down whenever that dependency hiccups.
2) Oneshoot units that pretend to be long-running services
Many internal services are actually “run a script to do setup” and then exit. If the unit is defined as Type=simple with a short-lived process, systemd thinks it crashed. Use Type=oneshot with RemainAfterExit=yes when appropriate.
3) Network-online is a trap (and not always your friend)
On cloud instances, “network online” may become true before DNS works, before routes converge, or before your overlay network is functional. If your service needs to reach a remote database during startup, prefer explicit retry logic in the app. Systemd ordering won’t save you from flaky upstreams.
4) Storage delays: your service is innocent, your mount is guilty
When storage is slow, services timeout. When storage is broken, services fail. Either way, the unit that gets blamed is rarely the one that caused the delay.
On Ubuntu 24.04, pay attention to:
- fstab entries generating mount units
- remote-fs behavior (NFS, CIFS)
- fsck delays
- device renumbering after hardware changes
5) The “it worked yesterday” myth: packaging and config drift
Ubuntu 24.04 brings newer systemd, newer OpenSSL defaults, newer Python, newer kernels. Services that were effectively undefined behavior on older boxes can become properly broken. Don’t fight that. Fix the assumptions.
Three corporate-world mini-stories (anonymized, plausible, and painful)
Mini-story 1: The outage caused by a wrong assumption
A team migrated a fleet to Ubuntu 24.04 and “standardized” unit files. Someone assumed After=network-online.target meant the service wouldn’t start until the database was reachable. In staging, it looked fine. In production, a subset of hosts came up with delayed DNS after a network change.
The service tried to connect to the database at startup, failed once, and exited. Restart policy was set to Restart=on-failure with a tight loop. Systemd did what it’s designed to do: restarted repeatedly, then throttled with start-limit-hit. Now the service was down and refusing to start, even after DNS stabilized.
The initial response was predictable: people tweaked StartLimitIntervalSec and StartLimitBurst to “let it keep trying.” That created a new problem: the crash loop hammered DNS and the database with connection storms from every host rebooting. This is how you turn a local failure into a shared incident.
The fix was boring and correct: the service gained exponential backoff and retry at the application level, startup was made tolerant of initial upstream failures, and systemd ordering was simplified. They kept After=network-online.target for cleanliness, but stopped pretending it was a connectivity guarantee.
Mini-story 2: The optimization that backfired
A cost-cutting initiative pushed for faster boot and lower memory usage. Someone set aggressive systemd timeouts globally and tightened memory limits for a batch of data-processing units. Boot got faster in synthetic tests, and a slide deck was born.
In real life, a subset of nodes had slower attached storage (not broken, just slower at certain times). Postgres on those nodes occasionally needed longer for recovery after unclean shutdowns. The new timeouts killed it mid-recovery. Not only did the service fail to start—repeated interrupted recovery made startup even slower next time, and the restart loop multiplied the pain.
Operations blamed the database. The database team blamed the kernel. The kernel team blamed the hypervisor. Meanwhile, the root cause was a “performance improvement” that removed slack from a system that needed it.
The rollback was immediate: restore sane timeouts, then set service-specific values based on observed recovery times. The longer-term fix was a policy: timeouts are per-service contracts, not global wishes. And if you want faster boot, fix the actual bottlenecks—usually storage and network, not the stopwatch.
Mini-story 3: The boring practice that saved the day
A payments-adjacent service ran on a cluster with strict change control. It wasn’t glamorous. Every systemd unit had a drop-in file in version control, and every rollout included a “smoke reboot” in a canary environment to catch boot-order and dependency issues.
One day, a routine patch introduced an fstab entry for a new reporting mount. The mount server was available, but one node had a stale DNS resolver config. On reboot, the mount unit failed. The main service had RequiresMountsFor= pointing at that path because someone thought it was “nice to have.” The node didn’t come back cleanly.
Because they had a canary smoke reboot, the issue was caught before the patch hit the fleet. The fix was surgical: the reporting mount was changed to nofail and the service requirement was removed. The primary service didn’t need the mount to process payments; it needed it to emit a report. That distinction matters.
No heroics, no war room. Just a checklist, a canary, and an insistence on modeling dependencies as business-critical vs optional. Boring won.
Common mistakes: symptom → root cause → fix
This is the part where you stop repeating the same incident every quarter.
1) “Start request repeated too quickly”
Symptom: Unit fails with Result: start-limit-hit, won’t restart.
Root cause: Crash loop or immediate exit (bad config, missing binary, port conflict). Throttle is systemd doing self-defense.
Fix: Identify why it exits, fix that, then systemctl reset-failed UNIT. Avoid “solving” by increasing StartLimit unless you enjoy turning one problem into many.
2) “Failed to run ‘start’ task: No such file or directory”
Symptom: systemd can’t exec the command.
Root cause: Wrong ExecStart= path, missing executable, wrong architecture, or overridden ExecStart in a drop-in you forgot about.
Fix: systemctl cat UNIT, confirm the final ExecStart, verify file exists and is executable. Then daemon-reload if you changed unit files.
3) Service “active (exited)” but functionality missing
Symptom: systemctl shows success, but the service isn’t running.
Root cause: Type=oneshot script that exits; or a misdeclared unit where the main process forks and systemd loses track.
Fix: Use correct Type= (simple, forking, notify, oneshot) and set PIDFile= where needed. Validate with systemctl show -p MainPID.
4) “Dependency failed for …” during boot
Symptom: A target or service fails because another unit failed.
Root cause: Hard requirement (Requires=) on a mount/network unit that is actually optional.
Fix: Re-classify dependencies. Use Wants= for optional components. For mounts, consider nofail in fstab and/or remove RequiresMountsFor= if it isn’t truly mandatory.
5) “Permission denied” even as root
Symptom: Service can’t read keys/certs/configs.
Root cause: AppArmor denial or unit runs as a non-root user and lacks file permissions.
Fix: Confirm effective user (User=), file ownership, and AppArmor logs. Adjust profile or relocate files to expected paths.
6) Timeouts on storage-backed services after reboot
Symptom: Database, queue, or app times out starting; mount chain shows long waits.
Root cause: Slow fsck, remote mount delays, degraded disk, or incorrect fstab causing retries.
Fix: Fix mounts first. Validate block device health, correct filesystem type/options, and ensure services don’t depend on optional mounts.
7) “Unit file changed on disk” confusion
Symptom: You edited a unit and it doesn’t take effect.
Root cause: Forgot systemctl daemon-reload, or you edited the wrong file (package unit vs override).
Fix: Use systemctl cat to see the final unit; reload daemon; restart the unit.
Second short joke, then back to work: if you’re debugging a failed unit by rebooting repeatedly, congratulations—you’ve invented chaos engineering, but without the learning.
Checklists / step-by-step plan
Checklist A: When a service fails right now (live incident)
- Capture status and logs (don’t mutate first). Save
systemctl statusandjournalctl -u UNIT -b. - Classify the failure: exit-code, signal, timeout, dependency, start-limit-hit.
- Confirm the final unit definition with
systemctl cat UNIT. - Check for dependency blockers: mounts, network-online, secrets, ports.
- Decide restore path:
- If config regression: rollback config.
- If packaging regression: rollback package or pin version.
- If infra dependency: fail open (if safe) or degrade gracefully.
- Stabilize: stop flapping units; disable timers; use
reset-failedafter root cause is removed. - Verify: health endpoints, socket listens, synthetic request, and
systemctl is-active.
Checklist B: The “I need the host back up” boot triage
- Identify the failing unit(s):
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● mnt-data.mount loaded failed failed /mnt/data
● postgresql.service loaded failed failed PostgreSQL RDBMS
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state.
SUB = The low-level unit activation state.
Meaning: The mount failure likely causes the database failure.
Decision: Fix the mount first; then retry the database.
- Get ordering context:
cr0x@server:~$ systemd-analyze blame | head -n 10
1min 15.701s mnt-data.mount
12.233s cloud-init.service
8.614s systemd-fsck@dev-disk-by\x2duuid-3a1c...service
3.812s systemd-networkd-wait-online.service
2.115s snapd.service
1.998s apt-daily.service
1.233s systemd-resolved.service
1.101s ufw.service
981ms systemd-journald.service
742ms systemd-logind.service
Meaning: Boot bottleneck is the mount and wait-online. This guides your next hour.
Decision: If this is a fleet issue, don’t waste time debugging the app; fix storage and network-online semantics.
Checklist C: Safe unit file edits (don’t brick your host)
- Use drop-ins:
systemctl edit UNIT(not editing/lib/systemd/systemdirectly). - Validate syntax and merged config:
systemctl cat UNIT. - Reload manager:
systemctl daemon-reload. - Restart:
systemctl restart UNIT. - Verify logs and MainPID:
systemctl status UNITandsystemctl show -p MainPID UNIT. - Only then enable/disable:
systemctl enable --now UNITif appropriate.
Checklist D: Storage-aware triage (because “Failed to start” is often “disk said no”)
- Check mounts:
systemctl status *.mountfor failed ones. - Correlate boot chain:
systemd-analyze critical-chain SERVICE. - Validate devices:
lsblk -fandblkid. - Check disk space and inodes:
cr0x@server:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 20G 19G 200M 99% /
cr0x@server:~$ df -i /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda2 1310720 1310102 618 100% /
Meaning: You can be “out of inodes” while having space. That breaks logging, PID files, sockets—services fail in weird ways.
Decision: Clear inode-heavy directories (often cache/temp), rotate logs, or expand filesystem. Then restart affected services.
FAQ
1) Why does systemctl say “failed” but the process is actually running?
Because systemd tracks what it believes is the “main process.” If the service forks unexpectedly, writes a wrong PID file, or uses the wrong Type=, systemd can lose it. Check systemctl show -p MainPID UNIT and match it to ps. Fix Type= and PID tracking.
2) What’s the difference between “exit-code,” “signal,” and “timeout” in Result?
exit-code means the process returned non-zero. signal means it was killed by a signal (OOM often shows SIGKILL). timeout means systemd waited longer than allowed and terminated it. Each leads you to a different playbook: config/runtime errors vs resource kills vs dependency/performance delays.
3) Why do I see “Unit file changed on disk” after I edit a service?
Because systemd doesn’t automatically re-parse unit files on every start. Run systemctl daemon-reload, then restart the unit. If you’re editing packaged unit files, stop and use a drop-in instead.
4) What’s the fastest way to find the real bottleneck during boot?
Use systemd-analyze blame to find time sinks, then systemd-analyze critical-chain SERVICE to see the ordering path that matters for the unit you care about.
5) Should I increase StartLimitBurst and StartLimitIntervalSec to prevent outages?
Rarely. Start limits prevent crash loops from melting hosts and downstream dependencies. If a service crashes immediately, you want it to stop quickly and loudly. Fix the crash, add backoff in the app, and use sane restart policies.
6) How do I see logs from the previous boot?
Use journalctl -b -1 for the previous boot, and journalctl -u UNIT -b -1 for the unit in that boot. If logs aren’t there, journald may not be persistent on that host.
7) When should I use “mask” vs “disable”?
disable stops it from starting at boot or via wants. It can still be started manually or as a dependency. mask makes it unstartable (symlink to /dev/null). Use mask for a flapping unit you must prevent from starting while you stabilize the system.
8) How do I confirm whether the failure is due to permissions or AppArmor?
Permissions show up as “permission denied” in application logs and can be verified with namei -l PATH and ownership checks. AppArmor shows kernel audit denials in the journal. Search the journal for “apparmor=DENIED” and match the profile to the service.
9) Why does “network-online.target” slow down boot?
Because systemd-networkd-wait-online.service (or NetworkManager equivalent) can wait for interfaces to be configured. That’s useful for services that truly need a configured network, but harmful if you made everything depend on it. Keep dependencies narrow.
10) What should I do when the unit fails due to a missing file under /etc?
First, decide if it’s a deployment bug or a removed package conffile. Restore from configuration management or backup, or rollback the change. Then add a unit-level guard if appropriate (like ConditionPathExists=) so the failure is explicit and fast.
Conclusion: next steps you can do today
“Failed to start …” is not a diagnosis. It’s a starting gun. Your job is to turn it into one of a few concrete categories: exec failure, timeout, dependency cascade, permissions/AppArmor, resource pressure, or restart throttle.
Do these next steps while you’re not on fire:
- Make journald persistent on servers where post-reboot forensics matter, so failures don’t evaporate with the reboot.
- Audit custom unit files for correct
Type=, explicit dependencies, and sane timeouts. - Separate optional from critical dependencies (mounts, reporting paths, telemetry). Use
Wants=or degrade gracefully. - Add canary reboot testing for changes touching fstab, networking, storage, or systemd units. Boot-order bugs love production.
- Document the “first three commands” your team runs (
systemctl status,journalctl -u ... -b,systemctl cat) and enforce it during incidents.
Systemd is deterministic. If you treat it like a black box, it feels like a gremlin box. Use the workflow above and you’ll spend less time guessing and more time restoring.