You changed /lib/systemd/system/something.service at 2 a.m., the service “worked,” and then a package upgrade quietly undid your fix.
Now you’re staring at a restart loop, wondering which part of your memory is lying to you.
Ubuntu 24.04 is systemd-land. The correct move is almost never “edit the vendor unit file.” The correct move is overrides: drop-ins,
clear, auditable, reversible. You get the fix without the fragility—and you stop losing fights with dpkg.
The mental model: what systemd reads, and in what order
systemd is a configuration merger. It doesn’t “load a service file.” It builds the final unit configuration from multiple layers, then runs that.
If you treat it as a single file, you’ll keep getting surprised.
Where unit files come from (Ubuntu 24.04)
On Ubuntu, you’ll commonly see:
/lib/systemd/system/: vendor unit files (installed by packages). You don’t own these./etc/systemd/system/: admin overrides and custom units. You own these./run/systemd/system/: runtime units and transient drop-ins. Ephemeral; survives until reboot.
Priority is effectively: /etc wins over /run wins over /lib (with some nuance), plus drop-ins applied in lexical order.
The rule of thumb is still useful: if you want a stable fix, put it in /etc.
Drop-ins: the clean scalpel
The canonical override mechanism is a drop-in file:
/etc/systemd/system/<unit>.d/override.conf.
It contains only the deltas you want, not a full unit copy. That matters because it’s readable, reviewable, and resilient to vendor unit evolution.
You can also place multiple drop-ins (e.g., 10-limits.conf, 20-env.conf). systemd loads them in order.
This makes “who changed what” much easier to answer than “someone edited the unit file in place and now it’s a bespoke snowflake.”
The difference between editing and overriding (and why you should care)
Editing the vendor unit file is an attractive nuisance. It’s fast, and it creates technical debt in a single keystroke.
The minute the package updates, dpkg will:
- overwrite the unit (if you didn’t mark it as conffile), or
- prompt in ways that get “handled” during emergencies, or
- leave a .dpkg-dist/.dpkg-old mess that no one remembers to reconcile.
Overrides avoid all of that. The package can do what it wants. Your intent still applies.
One quote to keep you honest
“The subtle bugs are the ones that appear after changes you forgot you made.” — paraphrased idea often attributed to seasoned operations engineers
Keep your changes where you can see them. Overrides are exactly that.
Eight facts and a little history (why overrides exist)
- systemd debuted in 2010 (Lennart Poettering and Kay Sievers) and replaced a zoo of init scripts with declarative unit files.
- Ubuntu adopted systemd by default in 15.04; by 24.04 it’s not “new,” it’s the operating assumption behind most service packaging.
- Drop-ins were designed for distro packaging: vendors ship safe defaults; operators add site policy without forking the unit.
- Unit file search paths are part of the interface; systemd is intentionally built to merge config from multiple locations, not just “read one file.”
systemctl editis the blessed workflow; it creates directories and handles the editor safely, including “dry changes then reload.”- Overriding
ExecStart=requires clearing first; systemd treats it as a list, not a single string. If you forget the reset line, you’ll run two starts (or none). - There’s a runtime layer in
/run; tools and generators can create transient units, which is useful for boot-time logic but tricky during debugging. - systemd has strong introspection:
systemctl cat,systemd-analyze,journalctl -u, andsystemctl showlet you see what it actually decided.
Joke #1: Editing /lib/systemd/system in production is like “just testing in prod” — quick, thrilling, and only occasionally career-limiting.
Case #8: a clean override that survives upgrades
Here’s the situation I see constantly on Ubuntu 24.04:
a vendor service starts fine most days, but under load it hits limits (file descriptors, processes, memory locking), or it needs one extra environment variable,
or the working directory is wrong, or it should wait on a mount. The temptation is to edit the unit file where it lives. Don’t.
Your goal is to: (1) identify the minimal change, (2) implement it in /etc/systemd/system as a drop-in, and
(3) prove systemd is running the merged config you intended.
Typical override patterns that are “safe and boring”
- Resource limits:
LimitNOFILE=,TasksMax=,MemoryMax=— stable and low drama. - Startup ordering:
After=,Requires=,Wants=— avoid race conditions with mounts and networks. - Environment:
Environment=,EnvironmentFile=— keep secrets out of unit files; use files with permissions. - Restart behavior:
Restart=on-failure,RestartSec=,StartLimitIntervalSec=— tune, but don’t hide real faults. - ExecStart changes: doable, but deserves extra care and a rollback plan.
When not to override
If the service is broken because the binary is wrong, the config file is wrong, or a dependency is missing, an override won’t save you.
Overrides are for changing how systemd runs a correct program, not for patching the program itself.
Joke #2: If you “fix” a segfault with Restart=always, congratulations—you’ve invented a very fast log generator.
Fast diagnosis playbook: find the bottleneck fast
When a service is unhealthy, you can spend 30 minutes reading unit files… or you can spend 3 minutes looking at the right output.
This is the order I use in production.
1) Establish the symptom: state, exit code, and last error
- Check status: is it
failed,activating, or running but degraded? - Check exit code:
status=203/EXECvsstatus=1/FAILUREvs OOM kill are different universes. - Check last log lines: don’t scroll; ask journald for the last 50 lines with timestamps.
2) Confirm what systemd is actually running
- Print the merged unit:
systemctl catshows vendor + drop-ins. - Inspect properties:
systemctl show -p ExecStart,Environment,LimitNOFILEto confirm effective values. - Check if a generator created something in /run: if yes, your persistent changes might be getting trumped or complemented.
3) Only then edit: minimal change, predictable rollback
- Create a drop-in with
systemctl edit. - Run
systemd-analyze verifyif you changed syntax-heavy parts. daemon-reload, then restart, then validate.
Practical tasks (commands, output meaning, decisions)
Below are real tasks you can run on Ubuntu 24.04. Each one includes what the output means and what decision you make next.
I’ll use nginx.service as an example because it’s common, but the workflow is identical for databases, agents, exporters, and custom daemons.
Task 1: Find where the unit file lives
cr0x@server:~$ systemctl show -p FragmentPath -p DropInPaths nginx.service
FragmentPath=/lib/systemd/system/nginx.service
DropInPaths=/etc/systemd/system/nginx.service.d/override.conf
Meaning: FragmentPath is the main unit file systemd uses. DropInPaths lists applied overrides.
Decision: If your change is in /lib, you’re holding a grenade. Move it into a drop-in under /etc.
Task 2: Read the merged unit file, not just the vendor fragment
cr0x@server:~$ systemctl cat nginx.service
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
After=network.target
[Service]
Type=forking
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
PIDFile=/run/nginx.pid
# /etc/systemd/system/nginx.service.d/override.conf
[Service]
LimitNOFILE=65536
Meaning: This is what systemd will apply. It’s the truth. Everything else is commentary.
Decision: If the merged unit doesn’t show your change, you’re editing the wrong place or forgot daemon-reload.
Task 3: Check live state and last exit reason
cr0x@server:~$ systemctl status nginx.service --no-pager
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; preset: enabled)
Drop-In: /etc/systemd/system/nginx.service.d
└─override.conf
Active: failed (Result: exit-code) since Mon 2025-12-30 10:12:07 UTC; 15s ago
Docs: man:nginx(8)
Process: 2197 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=1/FAILURE)
CPU: 45ms
Meaning: status=1/FAILURE indicates nginx ran and returned failure (often config error), not an exec problem.
Decision: Go to logs next; don’t rewrite ExecStart until you know it’s needed.
Task 4: Pull recent logs for the unit, with timestamps
cr0x@server:~$ journalctl -u nginx.service -n 50 --no-pager -o short-iso
2025-12-30T10:12:07+00:00 server nginx[2197]: nginx: [emerg] invalid number of arguments in "worker_connections" directive in /etc/nginx/nginx.conf:12
2025-12-30T10:12:07+00:00 server systemd[1]: nginx.service: Main process exited, code=exited, status=1/FAILURE
2025-12-30T10:12:07+00:00 server systemd[1]: nginx.service: Failed with result 'exit-code'.
Meaning: The service failed due to app config. systemd is not the problem today.
Decision: Fix nginx config. Overrides won’t help. If your change was intended to fix a config path, validate that hypothesis.
Task 5: Confirm whether a drop-in exists and what it contains
cr0x@server:~$ systemctl edit nginx.service --full
# (editor opens the full unit)
Meaning: --full creates an entire unit copy in /etc/systemd/system. This is a heavy hammer.
Decision: Prefer systemctl edit nginx.service (drop-in). Use --full only if you must replace the unit wholesale.
Task 6: Create a proper drop-in override (recommended)
cr0x@server:~$ systemctl edit nginx.service
# (editor opens /etc/systemd/system/nginx.service.d/override.conf)
Add something like:
cr0x@server:~$ sudo cat /etc/systemd/system/nginx.service.d/override.conf
[Service]
LimitNOFILE=65536
TasksMax=4096
Meaning: You are not forking the unit. You are layering policy on top.
Decision: Keep overrides small. If your override.conf is longer than the vendor file, you’re writing your own unit—own it explicitly.
Task 7: Reload systemd after changing unit files
cr0x@server:~$ sudo systemctl daemon-reload
Meaning: systemd re-reads unit definitions. Without this, it may keep old versions.
Decision: If changes “don’t apply,” first suspect you forgot reload. Second suspect: you edited a file systemd isn’t using.
Task 8: Verify the effective values systemd will enforce
cr0x@server:~$ systemctl show nginx.service -p LimitNOFILE -p TasksMax -p Environment
LimitNOFILE=65536
TasksMax=4096
Environment=
Meaning: This is the effective config after merging all fragments and drop-ins.
Decision: If you expected a value and it isn’t here, find what’s overriding you (drop-in order, full unit in /etc, or a transient unit).
Task 9: Override ExecStart correctly (clear first)
This is the one people get wrong. In systemd, many directives are list-type. ExecStart= is one of them.
To replace it, you must clear the existing list.
cr0x@server:~$ sudo cat /etc/systemd/system/nginx.service.d/10-execstart.conf
[Service]
ExecStart=
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;' -c /etc/nginx/nginx.conf
Meaning: The blank ExecStart= resets the list; the next line becomes the only ExecStart.
Decision: If you don’t reset, you might end up with multiple ExecStart lines. Best case: start fails. Worst case: it “works” until it doesn’t.
Task 10: Detect a restart storm and stop it safely
cr0x@server:~$ systemctl show nginx.service -p NRestarts -p Restart -p RestartUSec
NRestarts=27
Restart=on-failure
RestartUSec=100ms
Meaning: The service is flapping quickly. This can trash logs and hide the first failure.
Decision: Temporarily stop and mask while debugging (if safe), or increase RestartSec= via a drop-in to slow the loop.
Task 11: Check whether you’re fighting a full unit replacement in /etc
cr0x@server:~$ systemctl show nginx.service -p FragmentPath
FragmentPath=/etc/systemd/system/nginx.service
Meaning: Someone created a full unit in /etc, which completely overrides the vendor unit.
Decision: If you didn’t mean to own the whole unit, remove the full unit and replace with drop-ins. This is a common “we used –full once” hangover.
Task 12: Identify ordering/dependency issues (mounts, network, etc.)
cr0x@server:~$ systemctl list-dependencies --reverse nginx.service
nginx.service
● multi-user.target
Meaning: Reverse dependencies show what wants this service. This is “who will break if nginx is down.”
Decision: If critical targets depend on it, plan changes with caution and use staged reloads or maintenance windows.
Task 13: Verify unit syntax before restarting critical services
cr0x@server:~$ systemd-analyze verify /etc/systemd/system/nginx.service.d/10-execstart.conf
Meaning: No output typically means “no issues found.” If there’s a syntax error, you’ll see it before you cut production over.
Decision: Use verify when you touch tricky directives (ExecStart, dependencies, sandboxing). It’s cheaper than a pager storm.
Task 14: Roll back overrides cleanly
cr0x@server:~$ sudo systemctl revert nginx.service
Removed "/etc/systemd/system/nginx.service.d/override.conf".
Removed "/etc/systemd/system/nginx.service.d/10-execstart.conf".
Meaning: revert removes drop-ins and returns the unit to the vendor state (or whatever remains).
Decision: Use this when your override made things worse and you need a fast “back to known baseline” button.
Task 15: Confirm which overrides are currently applied (with file ordering)
cr0x@server:~$ systemctl show nginx.service -p DropInPaths
DropInPaths=/etc/systemd/system/nginx.service.d/10-execstart.conf /etc/systemd/system/nginx.service.d/override.conf
Meaning: Order matters. Lexical order matters. Your 90-* file will beat your 10-* file.
Decision: If you’re layering multiple changes, name them intentionally. Don’t let “override.conf” become a junk drawer.
Task 16: Prove the running process got the limits you set
cr0x@server:~$ systemctl show nginx.service -p MainPID
MainPID=2418
cr0x@server:~$ cat /proc/2418/limits | head -n 8
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 4096 4096 processes
Meaning: This shows what the kernel actually enforces on the process.
Decision: If the proc limits don’t match, you may have multiple processes (master/worker), or the service forks and applies limits differently. Verify the right PID.
Three corporate mini-stories from the trenches
Mini-story 1: An incident caused by a wrong assumption
A mid-sized SaaS company had a fleet of Ubuntu servers running a log forwarder as a systemd service.
The forwarder started failing after a routine security update. Engineers saw that the service used
a unit file in /lib/systemd/system and assumed “it’s stable, it won’t change much.”
Months earlier, someone had “temporarily” edited the vendor unit in place to add an environment variable:
Environment=HTTP_PROXY=.... It was undocumented, unreviewed, and invisible to config management because it wasn’t in /etc.
The security update replaced the package, and the proxy vanished. The forwarder couldn’t reach the collector, and logs stopped flowing.
The alert that fired wasn’t “log forwarder down.” It was downstream: a dashboard quietly went blank,
then a compliance export job failed because it expected logs. People chased the wrong system for an hour.
The first clue was in journald: repeated connection errors after the package upgrade time.
The fix was boring: a drop-in under /etc/systemd/system to set the proxy, plus a note in the runbook.
The postmortem recommendation was sharper: never change vendor units directly, and build a unit-audit check
into their configuration drift detection.
Mini-story 2: An optimization that backfired
Another company ran a latency-sensitive service with aggressive restart settings.
Someone noticed that after a crash, recovery time was too slow for their taste.
They reduced RestartSec to 100ms and bumped StartLimitBurst high “to keep it available.”
It did reduce the time between crashes and restarts. That’s the problem.
A latent config bug started causing occasional failures. Instead of one clean crash and a stable restart,
the service entered a rapid crash loop. CPU spiked, logs exploded, and other services on the same host
started missing deadlines due to I/O pressure.
Then the backfire: journald rate limiting kicked in. The very logs needed to debug the original fault became
incomplete. Meanwhile, the monitoring system saw “service is up” intermittently and didn’t page immediately.
Everyone lost time because the symptoms were smeared across the entire machine.
The recovery was to stop the unit, revert the override, then reintroduce sane defaults:
Restart=on-failure, RestartSec=2s, and a modest start limit.
The real optimization was changing the health check to detect “rapid restarts” as a failure and page sooner.
Mini-story 3: A boring but correct practice that saved the day
A large enterprise had a policy: all systemd modifications must be drop-ins, named with a prefix and a ticket number,
and every override must be visible in a single inventory report pulled from systemctl show on each host.
It sounded like bureaucracy until the day it wasn’t.
A vendor shipped an updated unit for a critical agent, changing the default sandboxing options.
On a subset of servers, the agent stopped accessing a directory it needed. The incident could have turned into a
multi-day “it works on my node” hunt because different teams owned different environments.
Instead, the on-call engineer pulled the inventory report and immediately saw which servers had an override that touched
filesystem access (e.g., ReadWritePaths) and which didn’t. The override naming convention made it searchable.
They rolled out a one-line drop-in to align behavior, restarted the agent safely, and moved on.
The practice didn’t prevent the vendor change. It prevented the confusion. In production, clarity is a feature.
Common mistakes: symptom → root cause → fix
1) “My override doesn’t apply”
Symptom: You edited a drop-in, restarted the service, and nothing changed.
Root cause: Forgot systemctl daemon-reload, or you edited a file in the wrong path, or a full unit in /etc supersedes vendor + drop-ins.
Fix: Run systemctl show -p FragmentPath -p DropInPaths unit; then daemon-reload; then systemctl cat unit to confirm merged config.
2) “I overrode ExecStart and now it won’t start”
Symptom: Service fails with odd messages, or tries to execute multiple commands.
Root cause: You added ExecStart=... without clearing the existing list.
Fix: In your drop-in: add a blank ExecStart= line before the new ExecStart=.... Reload and restart.
3) “Package upgrade reverted my fix”
Symptom: Service behavior changes right after apt upgrade.
Root cause: You edited /lib/systemd/system/*.service directly (vendor territory).
Fix: Re-apply the change as a drop-in under /etc/systemd/system/<unit>.d/. Consider systemctl revert to clean accidental full-unit copies.
4) “It starts manually but not via systemd”
Symptom: Running the binary in a shell works; systemd start fails.
Root cause: Missing environment variables, different working directory, or tighter sandboxing in systemd context.
Fix: Compare systemctl show -p Environment -p WorkingDirectory; add EnvironmentFile= or WorkingDirectory= in a drop-in; verify logs.
5) “Service flaps and takes the box down with it”
Symptom: High CPU, massive logs, “service keeps restarting.”
Root cause: Over-aggressive restart policy (Restart=always, tiny RestartSec), or a real crash loop masked by auto-restart.
Fix: Slow it down: Restart=on-failure, RestartSec=2s. Stop the unit, capture logs, fix root cause, then restart.
6) “My override broke after adding multiple drop-ins”
Symptom: Settings appear to “randomly” change depending on host.
Root cause: Drop-in file order differs or another file overrides the same directive later.
Fix: Use explicit numbering (10-, 20-, 90-). Check DropInPaths. Consolidate conflicting directives.
7) “It can’t see a mounted directory at boot”
Symptom: Service fails on boot, but works after manual restart.
Root cause: Missing ordering dependencies on the mount unit; service starts before the filesystem is ready.
Fix: Add a drop-in with RequiresMountsFor=/path or appropriate After=/Requires=. Reload and test reboot behavior.
Checklists / step-by-step plan
Checklist A: Make a safe override (the default approach)
- Get the current truth:
systemctl cat unitandsystemctl show -p FragmentPath -p DropInPaths unit. - Decide the smallest possible change (limits, env, ordering, restart policy).
- Create drop-in:
sudo systemctl edit unit. - Add only the directives you need. Avoid copying the vendor unit.
- Validate syntax if you touched tricky stuff:
systemd-analyze verify .... sudo systemctl daemon-reload.- Restart and verify:
sudo systemctl restart unitthensystemctl status unit. - Prove effective config:
systemctl show -p ... unit. - Prove runtime effect (limits/env) via
/procor service-specific diagnostics.
Checklist B: You must replace ExecStart (high-risk change)
- Capture current ExecStart:
systemctl show -p ExecStart unit. - Create a dedicated drop-in file (don’t bury it):
/etc/systemd/system/unit.d/10-execstart.conf. - Reset the list first: include a blank
ExecStart=line. - Include any needed
ExecStartPre=,WorkingDirectory=, and env files. - Verify:
systemd-analyze verify. - Reload and restart.
- Have a rollback ready:
systemctl revert unit(or remove the drop-in) and restart.
Checklist C: Rollback and return to baseline
- Stop the service if it’s flapping:
sudo systemctl stop unit. - Snapshot the current merged unit:
systemctl cat unitinto your incident notes. - Revert overrides:
sudo systemctl revert unit. - Reload:
sudo systemctl daemon-reload. - Start:
sudo systemctl start unit. - Confirm logs:
journalctl -u unit -n 50.
FAQ
1) Should I ever edit files in /lib/systemd/system?
No, not as a practice. Treat it like vendor firmware: readable, not writable. Use drop-ins in /etc/systemd/system.
If you must experiment, do it on a disposable host and translate the result into an override.
2) What’s the difference between systemctl edit and editing a file manually?
systemctl edit creates the correct directory structure, opens the right file, and aligns with systemd’s model.
Manual edits are fine if you know the paths, but systemctl edit reduces “wrong location” mistakes.
3) Why did my ExecStart= override not replace the old one?
Because ExecStart is list-type. Add a blank ExecStart= line to clear the list first, then add your new ExecStart=....
4) How do I see the final unit that systemd uses?
Use systemctl cat unit for the merged config and systemctl show -p FragmentPath -p DropInPaths unit to see sources and applied drop-ins.
5) Do I need daemon-reload every time?
If you changed unit files or drop-ins, yes. Restarting a service does not reliably imply re-reading the unit definition.
daemon-reload is cheap; outages are not.
6) What’s the cleanest way to remove an override?
sudo systemctl revert unit. It removes drop-ins and returns to vendor defaults. Then run daemon-reload.
7) Can I use drop-ins to change dependencies like After= and Requires=?
Yes, and it’s one of the best uses. If a service depends on a mount, consider RequiresMountsFor=/path.
For network dependencies, be careful: “network up” is not the same as “remote dependency reachable.”
8) My override works on one server but not another. Why?
Most common reasons: another drop-in overrides it later; there’s a full unit in /etc on one host; or a generator produced something in /run.
Compare systemctl show -p FragmentPath -p DropInPaths across hosts.
9) Is it better to put everything in one override.conf?
Not always. One file is easy until it becomes a junk drawer. For larger changes, split drop-ins by purpose with numbering:
10-limits.conf, 20-env.conf, 90-hardening.conf. Order becomes explicit.
10) How do I confirm my limit changes actually apply to the process?
Get the PID with systemctl show -p MainPID and check /proc/<pid>/limits. That’s what the kernel enforces.
Conclusion: next steps you can do today
On Ubuntu 24.04, systemd overrides are the grown-up way to fix services. They survive upgrades, keep your intent separate from vendor defaults,
and make incident response faster because the “what changed” question is answerable in seconds.
Practical next steps:
- Pick one service you’ve “tweaked” before. Run
systemctl show -p FragmentPath -p DropInPathsand check whether any edits live in/lib. - If they do, migrate them into a drop-in under
/etc/systemd/system/<unit>.d/. - Standardize naming for drop-ins (numbered, purpose-based). Your future self is a different person with different sleep.
- Teach your team two commands:
systemctl catandsystemctl revert. One finds the truth; the other buys you a rollback.
Do this a few times and you’ll stop treating systemd like a black box. It’s not magic. It’s just very particular about where you put your edits.