Debian 13: Service won’t start after config change — fix it by reading the right log lines (case #1)

October 31, 2025 • February 3, 2026 • Read: 21 min • Views: 8

Was this helpful?

You changed a config. You did the responsible thing. You even left a comment like “temporary” that will absolutely still be there in 2027. Now the service won’t start, your monitoring is paging, and systemctl status is being coy.

The good news: Debian 13 plus systemd gives you everything you need to solve this quickly—if you stop reading the wrong log lines. The bad news: most people do exactly that, squint at the last three lines of output, and then begin ritual sacrifices to “the cache”. Don’t. Read the right lines, in the right order, and you’ll fix this in minutes.

Case #1: config change → service won’t start (what actually happened)

This is the most common pattern I see on Debian systems: a service is healthy, someone edits a config file, then restarts the service. The restart fails. On-call opens systemctl status, sees “failed with result ‘exit-code’”, and starts guessing.

The fix is nearly always inside the logs, but not the part people read first. The useful line is usually:

Earlier than the “Main process exited…” line
From a helper process (like ExecStartPre) that tested config and quit
Or from the daemon itself, emitted once, then buried under systemd boilerplate

For case #1, imagine a typical service with a config test step:

ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on;
ExecStart=/usr/sbin/nginx -g daemon on; master_process on;

The restart fails not because systemd is mysterious, but because the pre-flight config test caught a syntax error, an invalid include path, or a permission problem on a referenced file. The “right log lines” are the ones describing that pre-flight failure. Your job is to pull them out cleanly, without drowning in unrelated noise.

Joke #1 (short, relevant): A service restart is like a parachute pack—if you skip the inspection step, you’ll still find out whether it worked.

A few facts and history that explain why logs look like this

Understanding why Debian 13 behaves the way it does makes you faster under pressure. Here are concrete facts that matter when a service refuses to start after a config change:

systemd became Debian’s default init system in Debian 8 (Jessie). That decision standardized service management and logging expectations, but also changed where people look for errors.
journald is not a text file. Logs are stored in a binary journal and queried with journalctl. You can still forward to syslog, but the canonical source is the journal.
systemctl status is a summary, not an investigation. It shows a clipped slice of logs and a high-level unit state. It’s meant to point you to deeper queries, not replace them.
systemd units can have multiple processes before the “real” daemon starts. ExecStartPre, generators, wrapper scripts, and environment files can fail before your service’s PID even exists.
Exit codes are standardized, but often misleading without context. An “exit status 1” might mean “syntax error” or “permission denied” or “port already in use.” You need the message next to it.
Many daemons are designed to fail fast on invalid config. Nginx, Postfix, HAProxy, and others purposely refuse to start if config tests fail—because running with partial/invalid config is worse.
Debian’s packaging tends to add safety checks. Maintainers frequently include pre-start validation in units or wrapper scripts. That’s good engineering, but it means errors can come from scripts you didn’t realize were in the path.
Log ordering can be deceptive. journald is timestamped, but parallel unit start and multiple processes can interleave. The “last line” is not always “the cause.”
Rate limiting is real. journald can rate-limit spammy services; the first error might be recorded, the next 500 might be summarized. If you only look at the summary, you miss the first clue.

One paraphrased idea worth keeping in your head, attributed correctly: Gene Kim (paraphrased idea): reliability improves when you build fast feedback loops and shorten the distance between change and diagnosis.

Fast diagnosis playbook (first/second/third checks)

This is the order that wins in production. It’s biased toward getting the root cause in under five minutes, not toward feeling busy.

First: confirm what systemd thinks failed (unit-level view)

Get the unit state, exit code, and which phase failed (pre-start vs main start).
Extract the exact command line systemd ran (including ExecStartPre).

Second: pull the right journal slice (time- and unit-scoped)

Query logs for that unit, for the last boot, with minimal noise.
Then widen time range if needed; do not widen scope first.
Look for the first meaningful error line, not the last “exited” line.

Third: run the daemon’s own config validation manually

Most services have a “test config and exit” mode.
Run it exactly as systemd would (same user, same environment, same config path).
If validation passes manually but fails under systemd, suspect permissions, environment files, AppArmor, or working directory differences.

Fourth: decide between fix, rollback, or temporary bypass

If it’s a clear syntax error: fix it now, then restart.
If it’s uncertain and production is burning: rollback to last known-good config and restart.
Avoid “temporary” bypasses like commenting out validation steps unless you understand the blast radius.

Hands-on tasks: commands, expected output, and decisions (12+)

These tasks are written the way an SRE actually works: run a command, read the output, make a decision. No motivational speeches. Each task includes what the output means and what you do next.

Task 1: Check the unit status (but read it correctly)

cr0x@server:~$ systemctl status nginx.service --no-pager
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Mon 2025-12-30 10:14:03 UTC; 42s ago
   Duration: 2.103s
    Process: 21984 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=1/FAILURE)
        CPU: 29ms

Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57
Dec 30 10:14:03 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 10:14:03 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Dec 30 10:14:03 server systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.

What it means: The failure happened in ExecStartPre, before nginx daemon start. That’s a config test failure, not a runtime crash.

Decision: Don’t chase ports, PID files, or kernel limits. Fix the config line referenced (app.conf:57) and rerun the config test.

Task 2: Show only the journal for this unit (the last attempt, cleanly)

cr0x@server:~$ journalctl -u nginx.service -b --no-pager -n 60
Dec 30 10:14:03 server systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57
Dec 30 10:14:03 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 10:14:03 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Dec 30 10:14:03 server systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.

What it means: The journal confirms the exact parser error. No need to infer.

Decision: Open the file, fix the syntax, then test config again before restarting.

Task 3: Pull logs from “since the restart” when the boot is noisy

cr0x@server:~$ systemctl show -p ActiveEnterTimestampMonotonic nginx.service
ActiveEnterTimestampMonotonic=81234567890

cr0x@server:~$ journalctl -u nginx.service -b --no-pager --since "2 min ago"
Dec 30 10:14:03 server systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57
Dec 30 10:14:03 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Dec 30 10:14:03 server systemd[1]: nginx.service: Failed with result 'exit-code'.

What it means: You’re scoping logs by time instead of wading through a whole boot.

Decision: If the error isn’t in that window, widen to 10 minutes; do not remove the unit filter yet.

Task 4: Inspect the unit for pre-start checks and environment files

cr0x@server:~$ systemctl cat nginx.service
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
After=network-online.target
Wants=network-online.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on;
ExecStart=/usr/sbin/nginx -g daemon on; master_process on;
ExecReload=/usr/sbin/nginx -g daemon on; master_process on; -s reload
TimeoutStopSec=5
KillMode=mixed

[Install]
WantedBy=multi-user.target

What it means: The failure is not inside the long-running daemon; it’s in the validation step. Also, no EnvironmentFile= here, so fewer hidden variables.

Decision: Run the exact pre-start command manually to reproduce; if it fails, you fix config. If it succeeds, the failure is environmental (permissions, AppArmor, include paths).

Task 5: Run the daemon’s config test manually (same command)

cr0x@server:~$ sudo /usr/sbin/nginx -t -q -g "daemon on; master_process on;"
nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57

What it means: This is a deterministic config parse error. No systemd weirdness.

Decision: Fix the referenced file and line. Don’t restart repeatedly hoping it “settles”. It won’t.

Task 6: Locate the offending line and validate include structure

cr0x@server:~$ nl -ba /etc/nginx/sites-enabled/app.conf | sed -n '45,70p'
    45  server {
    46      listen 443 ssl;
    47      server_name app.example.internal;
    48      include /etc/nginx/snippets/tls.conf;
    49
    50      location / {
    51          proxy_pass http://127.0.0.1:8080;
    52          proxy_set_header Host $host;
    53      }
    54
    55  }   # end server
    56
    57  }

What it means: There’s an extra closing brace at line 57.

Decision: Remove it, save, rerun config test. If you see brace mismatches often, adopt a style rule: one block per file, consistent indentation, and a config linter in CI.

Task 7: Validate again, then restart (do not skip the validation step)

cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

What it means: Now it’s safe to restart.

Decision: Restart once. If restart fails now, it’s a different problem—don’t assume it’s still the config syntax.

cr0x@server:~$ sudo systemctl restart nginx.service
cr0x@server:~$ systemctl is-active nginx.service
active

What it means: Service is running.

Decision: Confirm it serves traffic (local health check) and close the incident properly.

Task 8: When status is unhelpful, show full logs with priority filtering

cr0x@server:~$ journalctl -u nginx.service -b -p warning --no-pager
Dec 30 10:14:03 server nginx[21984]: nginx: [emerg] unexpected "}" in /etc/nginx/sites-enabled/app.conf:57

What it means: You filtered to warnings and worse, so you’re not reading “Started…” fluff.

Decision: Use this when a unit is chatty. If nothing appears at warning/error, you’re either logging elsewhere or you have a silent failure before logging initializes.

Task 9: Confirm which config files changed recently (catch the real culprit)

cr0x@server:~$ sudo find /etc/nginx -type f -printf '%TY-%Tm-%Td %TH:%TM %p\n' | sort | tail -n 8
2025-12-30 10:12 /etc/nginx/sites-enabled/app.conf
2025-12-29 18:41 /etc/nginx/nginx.conf
2025-12-10 09:03 /etc/nginx/snippets/tls.conf
2025-11-21 15:22 /etc/nginx/mime.types

What it means: You can correlate the start failure with the most recent edit.

Decision: If the error references an include file, check that file’s mtime too. “I only changed one line” is rarely the whole story.

Task 10: If it’s not syntax, check for permission denied (classic after “hardening”)

cr0x@server:~$ journalctl -u nginx.service -b --no-pager -n 30
Dec 30 10:20:11 server nginx[22310]: nginx: [emerg] open() "/etc/nginx/snippets/tls.conf" failed (13: Permission denied)
Dec 30 10:20:11 server systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE

cr0x@server:~$ namei -l /etc/nginx/snippets/tls.conf
f: /etc/nginx/snippets/tls.conf
drwxr-xr-x root root /
drwxr-xr-x root root etc
drwxr-xr-x root root nginx
drwx------ root root snippets
-rw------- root root tls.conf

What it means: Directory permissions prevent nginx (running as www-data after start) or its pre-start check from reading includes.

Decision: Fix permissions to the minimum required. Usually: directory execute bit for traversal and file read for the service user or group.

Task 11: Validate the runtime user and service sandboxing

cr0x@server:~$ systemctl show nginx.service -p User -p Group -p DynamicUser -p ProtectSystem -p ReadWritePaths
User=
Group=
DynamicUser=no
ProtectSystem=no
ReadWritePaths=

What it means: This particular unit isn’t using systemd sandboxing directives. If you do see ProtectSystem=strict or tight ReadWritePaths, config reads/writes may be blocked.

Decision: If sandboxing is enabled, align it with the daemon’s needs rather than disabling it blindly. Add explicit ReadOnlyPaths/ReadWritePaths in an override.

Task 12: Interpret failure reasons from systemd’s perspective (exit codes and signals)

cr0x@server:~$ systemctl show nginx.service -p ExecMainStatus -p ExecMainCode -p Result
ExecMainStatus=1
ExecMainCode=exited
Result=exit-code

What it means: The process exited normally with status 1. Not SIGKILL, not OOM, not a timeout.

Decision: Focus on configuration, parameters, and permissions. If you see ExecMainCode=killed or Result=timeout, that’s a different branch entirely.

Task 13: If the service is flapping, stop the restart loop while you read logs

cr0x@server:~$ sudo systemctl reset-failed nginx.service
cr0x@server:~$ sudo systemctl stop nginx.service
cr0x@server:~$ systemctl status nginx.service --no-pager
● nginx.service - A high performance web server and a reverse proxy server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
     Active: inactive (dead)

What it means: You’re preventing systemd from spamming restarts while you debug. This also makes the journal easier to read.

Decision: Do this when Restart=always is creating noise and load. Then restart intentionally when you have a fix.

Task 14: Compare config changes safely with dpkg metadata (packaging reality check)

cr0x@server:~$ dpkg -S /etc/nginx/nginx.conf
nginx-common: /etc/nginx/nginx.conf

cr0x@server:~$ sudo ls -l /etc/nginx/nginx.conf*
-rw-r--r-- 1 root root 1492 Dec 29 18:41 /etc/nginx/nginx.conf
-rw-r--r-- 1 root root 1479 Nov 21 15:22 /etc/nginx/nginx.conf.dpkg-dist

What it means: You may have a distro-provided new default file or a pending merge. That can interact with your change.

Decision: If the service started failing after an upgrade plus a config edit, examine .dpkg-dist/.dpkg-old and reconcile intentionally.

Task 15: When logs are missing, confirm journald persistence and rate limiting

cr0x@server:~$ sudo grep -E '^(Storage|SystemMaxUse|RateLimitIntervalSec|RateLimitBurst)=' /etc/systemd/journald.conf | sed '/^#/d;/^$/d'
Storage=auto
RateLimitIntervalSec=30s
RateLimitBurst=1000

cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 384.0M in the file system.

What it means: If Storage=volatile, you lose logs on reboot. If rate limiting is low, you might miss repeated errors.

Decision: For production, persist logs on disk and size appropriately. For debugging, temporarily raise rate limits if a service is spamming, but fix the spam next.

Joke #2 (short, relevant): “It worked yesterday” is not evidence; it’s just testimony from a witness with a terrible memory.

Three corporate mini-stories (and what they teach)

Mini-story 1: The incident caused by a wrong assumption

The team had a Debian fleet running a mix of web and API services. One afternoon, a routine config change went out: update TLS ciphers, standardize across environments. Someone restarted nginx on a canary. It failed. They ran nginx -t manually; it passed. The assumption formed instantly: “systemd is broken on this host.”

They dug into package versions, kernel parameters, even SELinux (which wasn’t even enabled). Meanwhile, traffic drained from the node and the autoscaler got nervous. They kept retrying restarts “just to see,” which is a great way to overwrite the one good error line with a pile of restarts.

The fix was embarrassingly simple: the systemd unit used a different config path via an environment file. Not malicious—just historical. Manual nginx -t tested /etc/nginx/nginx.conf; systemd tested /etc/nginx/nginx-canary.conf. The canary file included a snippet path that didn’t exist on that host.

The lesson isn’t “don’t use environment files.” It’s: never assume your manual reproduction matches the service manager. Extract the exact ExecStartPre/ExecStart command from systemctl cat and run that. If an environment file exists, print it, and stop guessing.

Mini-story 2: The optimization that backfired

A platform group decided to “speed up deployments” by switching from restart to reload whenever possible. Reload is cheaper: less connection churn, fewer transient errors. Good intent. Then they generalized it across multiple services with a one-size-fits-most script.

One service, a message broker, accepted reload signals but only partially reloaded configuration. For some settings it required a full restart, but the reload command returned success anyway. Over time, config drift built up: the running config didn’t match what was on disk, and people stopped trusting both.

Eventually a config change introduced a parameter that would have failed a fresh start validation. The reload did nothing useful, said “OK,” and the system ran with the old settings. Days later, a routine host reboot occurred. Now the service had to do a cold start, it read the bad config, and it refused to come up. This failure happened during a maintenance window, which is where you go to meet your future mistakes.

The lesson: reload is not a free lunch. If you choose reload as an optimization, you must also enforce config validation as part of the change process and periodically perform controlled restarts to prove the config is actually startable.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent internal service ran on Debian, backed by a database and a web front-end. They had a change policy that was not glamorous: every config edit had to be committed to a repo, and the deployment tooling always ran the service’s built-in config test before touching systemd. If the test failed, the change simply didn’t ship.

People complained about it. “It slows us down.” “I can test it in my head.” The usual. But then came a day when a senior engineer edited a config live during an incident—because the service was misbehaving and they needed a quick mitigation. The edit had a subtle quoting error. The next restart would have killed the service entirely.

The deployment tooling refused to apply the change without a passing config test. That was the whole point: guardrails when stress makes everyone sloppy. They fixed the quoting, re-tested, then restarted safely. Nobody got paged twice.

The lesson: the boring practice isn’t the repo. It’s the automatic validation gate plus a predictable rollback path. Those two things prevent small mistakes from becoming outages.

Common mistakes: symptom → root cause → fix

Here are the repeat offenders. If your service won’t start after a config change, you will likely land in one of these buckets.

1) Symptom: `systemctl status` shows “failed (Result: exit-code)” with no useful error

Root cause: You’re only seeing the summary. The meaningful line is earlier or truncated.

Fix: Query the journal directly and expand the slice.

cr0x@server:~$ journalctl -u myservice.service -b --no-pager -n 200
...look for the first real error line...

2) Symptom: service fails instantly after restart; logs mention `ExecStartPre`

Root cause: Pre-start validation failed (syntax, missing include, invalid directive).

Fix: Run the same validation manually and correct config before restarting.

cr0x@server:~$ systemctl cat myservice.service | sed -n '/ExecStartPre/p'
ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on;

3) Symptom: config test passes manually, fails under systemd

Root cause: Different config path, different user, different environment, or sandbox restrictions.

Fix: Extract exact command and environment from the unit; run as the service user.

cr0x@server:~$ systemctl show myservice.service -p Environment -p EnvironmentFiles
Environment=
EnvironmentFiles=/etc/default/myservice (ignore_errors=no)

4) Symptom: “Permission denied” on includes, certificates, sockets, PID files

Root cause: Hardening change (chmod/chown), new path with restrictive permissions, or service user mismatch.

Fix: Trace path permissions with namei -l; correct directory execute bits and file readability.

5) Symptom: “Address already in use” after config change

Root cause: You changed listen/port binding; another service already owns it; or the old instance didn’t stop cleanly.

Fix: Identify who holds the port; decide whether to change port back, stop the conflicting service, or fix socket activation.

cr0x@server:~$ sudo ss -ltnp | grep ':443 '
LISTEN 0      511          0.0.0.0:443        0.0.0.0:*    users:(("haproxy",pid=1203,fd=7))

6) Symptom: unit shows `Result=timeout`

Root cause: The daemon hangs during start (waiting on DNS, storage, migrations) or systemd’s timeout is too aggressive for a cold start.

Fix: Read logs around the hang, then adjust TimeoutStartSec only if the startup work is legitimate and bounded.

7) Symptom: after a config change, service “starts” but doesn’t work

Root cause: You used reload and assumed it applied everything; or the config is accepted but semantically wrong.

Fix: Run an application-level health check, and confirm active config with service introspection if available. Restart if needed.

8) Symptom: journal has no entries for the unit

Root cause: The service logs to a file (or stdout is redirected), journald is volatile, or the unit never executed due to dependency failure.

Fix: Check systemctl list-dependencies and journald settings; inspect traditional log files if configured.

Checklists / step-by-step plan (safe fixes and rollback)

Step-by-step: diagnose and fix without thrashing the system

Stop the restart loop if present. If the unit is flapping, pause it so you can read stable logs.
Read the unit summary. Identify whether ExecStartPre failed or the main process died.
Query journald by unit and boot. Don’t start with global logs.
Extract the exact start commands. Read systemctl cat and check for drop-ins.
Run the service’s config test manually. Same args, same config path.
Fix the smallest thing that makes it start. Avoid refactors during outage response.
Restart once, then verify at the application layer. “active (running)” isn’t the same as “serving.”
Write down the root cause line. Paste the exact error string in the incident note. Future you will thank present you.

Rollback plan: when you’re not sure your fix is correct

If you can’t prove the fix quickly, rollback. Don’t “iterate in production” while the pager is screaming.

Save the broken config. Copy it with a timestamp so you can analyze later.
Restore last-known-good from your config repo or backups.
Validate config. Always run the daemon’s test mode.
Restart service and verify.
Only after recovery: debug the broken change in a controlled environment.

cr0x@server:~$ sudo cp -a /etc/nginx/sites-enabled/app.conf /root/app.conf.broken.$(date +%F-%H%M%S)
cr0x@server:~$ sudo cp -a /root/rollback/app.conf /etc/nginx/sites-enabled/app.conf
cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
cr0x@server:~$ sudo systemctl restart nginx.service
cr0x@server:~$ systemctl is-active nginx.service
active

When you must keep partial service up (damage control)

Sometimes you can’t fully fix it immediately, but you can reduce impact:

Restore a minimal config that serves a maintenance page.
Disable the broken virtual host while keeping others running.
Route traffic away from the node temporarily, fix in isolation, then reintroduce.

Do not disable validation steps to “make it start” unless you’re certain the daemon won’t start in a corrupted state. That path leads to data loss, and you won’t enjoy the postmortem.

FAQ

1) Why isn’t `systemctl status` enough?

Because it’s intentionally compact. It shows a small log tail and a unit state summary. Use it to find the unit name, failure phase, and then pivot to journalctl -u for the real diagnosis.

2) What’s the single best journalctl command for this situation?

Usually:

cr0x@server:~$ journalctl -u myservice.service -b --no-pager -n 200

If that’s noisy, add -p warning or restrict time with --since.

3) How do I know whether the failure is config vs runtime?

Look for ExecStartPre failing (config/validation) versus the main process starting and then dying (runtime). systemctl status usually tells you which process failed.

4) Why does running the config test manually sometimes succeed when systemd fails?

Different environment. systemd may use an environment file, a different working directory, sandboxing, or a different user context. Always reproduce using the exact command line from the unit.

5) How do I see drop-in overrides that might change behavior?

cr0x@server:~$ systemctl status myservice.service --no-pager
...look for "Drop-In:" lines...

cr0x@server:~$ systemctl cat myservice.service
...includes /etc/systemd/system/myservice.service.d/*.conf if present...

6) When should I use `reload` instead of `restart`?

Only when the service documents that reload applies the changes you made, and you have a validation step. If you’re uncertain, restart during a safe window or after draining traffic.

7) What if there are no logs at all for the unit?

Then either the unit didn’t execute, journald isn’t retaining logs, or logs go elsewhere (like /var/log/*). Check dependencies and journald settings, and inspect file-based logs if configured by the service.

8) How do I quickly tell if this is a permissions problem?

Look for “Permission denied” in the journal, then trace the file path with namei -l. Permissions issues often come from directory traversal bits or new hardening changes that forgot the service user.

9) What’s the safest way to prevent this class of outage?

Automate config validation (daemon test mode) before restart/reload, keep configs in version control, and make rollback trivial. The goal is to catch the broken line before it reaches systemd.

Conclusion: next steps that prevent repeat incidents

A Debian 13 service failing after a config change is rarely a mystery. It’s usually one precise error line that you didn’t extract cleanly. Read the unit to learn what ran. Read the journal scoped to the unit to learn what failed. Then validate config manually using the exact command systemd uses.

Practical next steps:

Add a pre-deploy config test step for every service that supports it.
Train your team to treat systemctl status as a pointer, not a diagnosis.
Make rollback a first-class operation (copy, restore, validate, restart).
Standardize on a short “fast diagnosis” runbook and keep it near the pager rotation.

Do that, and the next time a service refuses to start, you’ll spend your time fixing the actual problem—not arguing with a summary screen.