There’s a special kind of dread that comes from watching a server “boot” for ten minutes while your deploy pipeline times out and everyone pretends it’s fine. The logs are scrolling, the fans are spinning, and your instance is doing that thing where it’s technically alive but emotionally unavailable.
If you want faster startup, stop hunting ghosts. There’s one list that matters: the ordered set of things you ask the system to start, and the dependency graph that forces the order. Clean that list, and boot gets fast. Ignore it, and you’ll keep buying bigger instances to run the same slow startup script.
The one list: what it really is
When people say “startup takes forever,” they usually mean one of these:
- Boot time: kernel + initramfs + init system getting you to “multi-user” (or equivalent).
- Service readiness time: your app is “started” but not actually accepting traffic yet.
- Node readiness time: in Kubernetes/auto-scaling, the machine exists but isn’t schedulable or is failing health checks.
On most modern Linux, the list you need to clean is visible and editable through systemd. Concretely, it’s:
- The set of enabled units that will be pulled in by a target (usually
multi-user.targetorgraphical.target). - The dependency graph among them (Wants/Requires/After/Before, plus implicit dependencies and generator output).
- The unit-level timeouts and ordering constraints that turn “parallel boot” into “serial waiting.”
Systems don’t boot slowly because one service is “slow.” They boot slowly because something forces the slow service to be on the critical path. You don’t need more heroics; you need less stuff on that path.
One quote worth keeping on a sticky note:
“Hope is not a strategy.” — paraphrased idea often attributed to reliability and operations leaders
Same energy for boot performance: hoping your system starts faster next time is not a plan. Measuring and pruning is.
Fast diagnosis playbook
This is the order I use when an instance is slow to come up and people are already typing in the incident channel.
First: locate the time bucket (firmware/kernel vs initramfs vs userspace)
- If the console sits before “Starting version … systemd,” suspect firmware, kernel init, initramfs, storage discovery, or fsck.
- If systemd starts quickly but “A start job is running…” for 90 seconds, suspect mounts, network-online waits, or timeouts.
- If boot finishes but your app isn’t ready, suspect readiness checks, migrations, DNS, dependency services, or rate limits.
Second: identify the critical chain
- Run
systemd-analyze critical-chainand find what’s on the line todefault.target. - Cross-check with
systemd-analyze blame, but don’t let “blame” trick you: something can be slow without being on the critical path.
Third: confirm if the delay is waiting or working
- Waiting: timeouts, retries, dependency checks, network-online, mount failures.
- Working: CPU-bound unit startup, disk-bound fsck, entropy starvation, package manager locks, container image pulls.
Fourth: cut the graph, not the corners
- Remove unnecessary units from the boot target.
- Relax ordering (
After=) if it’s superstition, not requirement. - Convert “hard” dependencies to “soft” where safe (Requires → Wants).
- Fix or mask the stuff that is failing repeatedly.
That’s the playbook. Now we’ll make it concrete with commands and decisions.
Measure first: what “slow startup” actually means
Boot speed arguments are usually theology. Don’t argue; measure. systemd gives you clean, structured data, and the journal gives you the ugly truth.
Two quick definitions that matter in practice:
- Userspace time is where most of your tuning lives: services, mounts, and dependency logic.
- Critical path is the chain of units whose delays directly delay reaching the boot target.
Also: “startup” is not only systemd. If you have cloud-init, ignition, config management agents, container runtime pulls, or an app that does migrations on boot, that can dwarf the init system’s work. The point is still the same: enumerate the things you do at boot, then delete the ones you don’t need.
And yes, sometimes it really is storage. A single mount waiting 90 seconds for a dead NFS server can make your “fast” app look like it’s running on a potato.
Joke #1: Boot time is like a meeting with “quick sync” in the title. You can’t speed it up by believing harder.
The practical cleanup tasks (commands, outputs, decisions)
These are not “tips.” They are tasks. Run them, read the output, and make a decision. That’s the job.
Task 1: Get a high-level boot breakdown
cr0x@server:~$ systemd-analyze
Startup finished in 3.201s (kernel) + 7.842s (initrd) + 1min 18.331s (userspace) = 1min 29.374s
graphical.target reached after 1min 17.902s in userspace
What it means: Userspace is the problem: 78 seconds after initrd. Kernel/initrd are not the priority.
Decision: Stay in systemd land: critical chain, blame, failed units, mounts, network waits.
Task 2: Find what’s on the critical path
cr0x@server:~$ systemd-analyze critical-chain
graphical.target @1min 17.902s
└─multi-user.target @1min 17.902s
└─docker.service @52.104s +25.769s
└─network-online.target @51.990s
└─systemd-networkd-wait-online.service @21.411s +30.563s
└─systemd-networkd.service @20.982s +428ms
└─systemd-udevd.service @12.441s +8.512s
└─systemd-tmpfiles-setup-dev.service @11.901s +526ms
What it means: The chain is blocked on systemd-networkd-wait-online. Then Docker is on the path, too.
Decision: Decide whether you truly need “network is fully online” before reaching your target. Most systems don’t.
Task 3: View the “blame” list without worshipping it
cr0x@server:~$ systemd-analyze blame | head -n 15
30.563s systemd-networkd-wait-online.service
25.769s docker.service
10.112s dev-sdb2.device
8.512s systemd-udevd.service
6.902s snapd.service
5.441s cloud-init.service
4.990s apt-daily.service
4.221s systemd-journald.service
3.301s systemd-tmpfiles-setup.service
2.880s systemd-logind.service
What it means: You have multiple slow units, but only the ones on the critical chain delay your boot target.
Decision: Fix wait-online first (critical path), then revisit Docker and anything else you care about.
Task 4: List failed units (often the silent boot tax)
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● mnt-backup.mount loaded failed failed /mnt/backup
● nfs-client.target loaded failed failed NFS client services
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
What it means: A mount and NFS client target failed. On some configs, systemd will wait or retry, costing time.
Decision: Either fix the mount, or make it non-blocking (timeouts, nofail, automount), or remove it from startup.
Task 5: Inspect long waits in the journal for this boot
cr0x@server:~$ journalctl -b -0 -p warning..alert --no-pager | tail -n 30
Jan 12 10:04:21 server systemd[1]: mnt-backup.mount: Mount process exited, code=exited, status=32/n/a
Jan 12 10:04:21 server systemd[1]: Failed to mount /mnt/backup.
Jan 12 10:04:21 server systemd[1]: Dependency failed for Local File Systems.
Jan 12 10:04:52 server systemd-networkd-wait-online[418]: Timeout occurred while waiting for network connectivity.
Jan 12 10:04:52 server systemd[1]: systemd-networkd-wait-online.service: Failed with result 'timeout'.
What it means: Two classic boot-killers: a mount failure and a wait-online timeout.
Decision: Fix the mount semantics and remove wait-online requirement unless your node truly can’t function without it.
Task 6: Identify which units require network-online.target
cr0x@server:~$ systemctl list-dependencies --reverse network-online.target
network-online.target
● └─docker.service
● └─remote-fs.target
● └─myapp.service
What it means: Docker, remote filesystems, and your app are pulling in network-online. That’s why the wait blocks everything.
Decision: For each: decide if it needs network.target (basic networking) instead of network-online.target (configured, “up”). Most daemons should tolerate the network coming up shortly after.
Task 7: Confirm who pulls in a slow mount and whether it can be on-demand
cr0x@server:~$ systemctl list-dependencies mnt-backup.mount
mnt-backup.mount
● ├─system.slice
● └─remote-fs-pre.target
What it means: This mount is treated as part of filesystem setup. If it blocks, it blocks early.
Decision: If it’s not required for boot, move it out of the critical path with an automount or nofail and sensible timeouts.
Task 8: Audit fstab for boot-hostile entries
cr0x@server:~$ cat /etc/fstab
UUID=9bb0c7d9-4f4d-4d56-9b27-2cbd71a7dbd6 / ext4 defaults 0 1
server:/exports/backup /mnt/backup nfs defaults 0 0
What it means: That NFS mount has no nofail, no x-systemd.automount, and no timeout options. It will happily stall boot while it tries.
Decision: If it’s optional at boot, make it optional. A practical pattern:
cr0x@server:~$ sudo sed -i 's#server:/exports/backup /mnt/backup nfs defaults 0 0#server:/exports/backup /mnt/backup nfs nofail,x-systemd.automount,x-systemd.idle-timeout=60,timeo=5,retrans=2 0 0#' /etc/fstab
Decision follow-up: After editing, validate with systemctl daemon-reload and a test mount (next task). Don’t ship untested fstab changes; that’s how you get a remote “no boot” day.
Task 9: Test mount behavior without rebooting
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart mnt-backup.automount
cr0x@server:~$ ls -la /mnt/backup
total 8
drwxr-xr-x 2 root root 4096 Jan 12 10:06 .
drwxr-xr-x 4 root root 4096 Jan 12 10:06 ..
What it means: With automount, the mount triggers on access. If the server is down, you’ll see a delay at access time, not at boot time.
Decision: For non-essential remote filesystems, automount is usually the right trade: fast boot, controlled delay when you actually touch it.
Task 10: Remove “wait-online” where it’s superstition
cr0x@server:~$ systemctl is-enabled systemd-networkd-wait-online.service
enabled
cr0x@server:~$ sudo systemctl disable systemd-networkd-wait-online.service
Removed "/etc/systemd/system/network-online.target.wants/systemd-networkd-wait-online.service".
What it means: Disabling wait-online means network-online.target may still be reached by other means, but the explicit “wait until online” step is removed.
Decision: If you have services that truly require online networking before starting, fix those services, not the whole boot. Prefer retry logic in the service, or order only that service after network-online, not the entire system.
Task 11: Find which services are enabled and probably shouldn’t be
cr0x@server:~$ systemctl list-unit-files --state=enabled --no-pager | head -n 25
UNIT FILE STATE PRESET
apt-daily.service enabled enabled
apt-daily.timer enabled enabled
cloud-init.service enabled enabled
cloud-init-local.service enabled enabled
cloud-config.service enabled enabled
cloud-final.service enabled enabled
docker.service enabled enabled
snapd.service enabled enabled
ssh.service enabled enabled
systemd-timesyncd.service enabled enabled
What it means: Some of these are fine. Some are baggage. Some are actively hostile to predictable boot time.
Decision: For servers (not desktops), question background updaters (apt-daily), snap if you don’t use it, and anything cloud-init related on images that aren’t first-boot configured anymore.
Task 12: Verify what a service is waiting on (read the unit file like it’s a contract)
cr0x@server:~$ systemctl cat myapp.service
# /etc/systemd/system/myapp.service
[Unit]
Description=My App API
After=network-online.target remote-fs.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/myapp --config /etc/myapp/config.yml
Restart=on-failure
TimeoutStartSec=120
[Install]
WantedBy=multi-user.target
What it means: Your app is explicitly pulling in network-online and remote-fs. If either is flaky, your app becomes a boot anchor.
Decision: If the app can start without those and retry connections, change to:
After=network.target(or remove ordering entirely)- Drop
remote-fs.targetunless you actually require remote mounts - Keep
TimeoutStartSectight and meaningful
Task 13: Look for dependency cycles and ordering dead-ends
cr0x@server:~$ systemd-analyze verify /etc/systemd/system/myapp.service
/etc/systemd/system/myapp.service:8: Unknown lvalue 'TimeoutStartSec' in section 'Service'
What it means: Verification catches config mistakes that can cause systemd to ignore settings you assumed were active. In this example, you’d investigate why (old systemd? typo? overridden sections?).
Decision: Treat unit parsing warnings as production bugs. Fix them. Misread configs create “why is it still slow?” loops.
Task 14: Find the real time spent in mount units
cr0x@server:~$ systemd-analyze blame | grep -E '\.mount$|remote-fs' | head -n 20
1min 30.002s mnt-backup.mount
12.443s remote-fs.target
What it means: That mount alone costs 90 seconds. This isn’t a “tiny optimization” situation. It’s a “remove it from boot” situation.
Decision: Move it to automount, shorten timeout, add nofail, or remove it entirely. If it’s required for correctness, invest in making the remote service reliable and fast.
Task 15: Check whether DNS is the hidden network wait
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (ens5)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.0.0.3
What it means: DNS looks configured, but this doesn’t prove it’s reachable during boot. Many “network-online” waits are actually “DNS isn’t ready” in disguise.
Decision: If you’re using a local resolver or dependency that isn’t ready early, decouple your boot from DNS by avoiding hostname-based mounts and ensuring resolvers are available (or use IPs where appropriate).
Task 16: Inspect initramfs delays (storage and drivers)
cr0x@server:~$ dmesg --color=never | grep -E 'EXT4-fs|fsck|timed out|nvme|scsi|link is not ready' | tail -n 25
[ 2.901234] nvme nvme0: I/O 14 QID 0 timeout, reset controller
[ 3.112345] nvme nvme0: reset controller
[ 8.554321] EXT4-fs (nvme0n1p2): recovery complete
What it means: Kernel-level storage timeouts or resets can dominate early boot before systemd has any say.
Decision: If you see controller resets/timeouts, stop blaming systemd. Check cloud storage health, driver/firmware, or the underlying device. On physical hosts, pull SMART and controller logs.
Task 17: Check disk health and pending failures (boring, effective)
cr0x@server:~$ sudo smartctl -a /dev/sda | egrep -i 'Reallocated|Pending|Uncorrect|Power_On_Hours|SMART overall'
SMART overall-health self-assessment test result: PASSED
Power_On_Hours: 18234
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
What it means: Not obviously dying. That’s good. It’s not proof of performance, but it rules out one common boot drag: disks retrying reads while pretending they’re fine.
Decision: If values are non-zero or increasing, plan a replacement and stop chasing software “optimizations” to cover hardware failure.
Task 18: Spot the “fsck tax” on boot
cr0x@server:~$ journalctl -b -0 --no-pager | grep -E 'fsck|clean|recovering|UNEXPECTED' | head -n 20
Jan 12 10:03:11 server systemd-fsck[201]: /dev/nvme0n1p2: clean, 245612/6553600 files, 12541021/26214400 blocks
What it means: Here fsck is fast and clean. If you instead see long repairs, your slow startup is “filesystem maintenance,” not “service startup.”
Decision: Investigate unclean shutdowns, storage issues, and filesystem tuning. And stop rebooting boxes as your first troubleshooting step.
Task 19: Find the generators that are creating units at boot
cr0x@server:~$ systemd-analyze unit-paths
/etc/systemd/system.control
/run/systemd/system.control
/run/systemd/transient
/run/systemd/generator.early
/etc/systemd/system
/run/systemd/system
/run/systemd/generator
/usr/local/lib/systemd/system
/usr/lib/systemd/system
/run/systemd/generator.late
What it means: Generators synthesize units from things like fstab, crypttab, kernel cmdline, and network configs. They can create surprises.
Decision: If you keep getting “mystery mounts” or targets, inspect /run/systemd/generator* after boot to see what was generated and why.
Task 20: Validate what’s actually enabled for the default target
cr0x@server:~$ systemctl get-default
graphical.target
cr0x@server:~$ systemctl list-dependencies graphical.target --no-pager | head -n 30
graphical.target
● ├─multi-user.target
● ├─display-manager.service
● └─system.slice
What it means: If this is a server, graphical.target is suspicious. You might be booting a GUI stack you don’t need.
Decision: On servers, set multi-user.target unless you have a real requirement for a graphical session.
Joke #2: Turning on every startup service “just in case” is like packing a treadmill for a camping trip. Technically possible; spiritually wrong.
Common mistakes: symptoms → root cause → fix
Symptom: “A start job is running for /mnt/…”
Root cause: Boot-time mount that can’t be satisfied (NFS/CIFS/iSCSI down, wrong DNS, VPN not up yet), combined with default timeouts and hard dependencies.
Fix: If optional: add nofail, consider x-systemd.automount, shorten timeouts, or remove it from fstab. If required: make the dependency reliable and ensure the network path exists early.
Symptom: Boot blocks on network-online, but networking “works eventually”
Root cause: systemd-networkd-wait-online (or NetworkManager equivalent) waiting for a condition that never occurs: DHCP delay, missing carrier, disabled interface, or DNS requirement.
Fix: Disable wait-online globally when safe; more often, remove Wants=network-online.target from services that don’t need it. If one service needs it, scope it to that service and make it robust (retries/backoff).
Symptom: Everything is “slow” after a new package install or image update
Root cause: New enabled units (telemetry agents, auto-updaters, security scanners), new timers, or cloud-init re-running on every boot because state wasn’t persisted correctly.
Fix: Audit enabled unit files, compare against baseline, and disable what you don’t need. Make cloud-init one-shot again by fixing instance metadata/state behavior.
Symptom: systemd-analyze blame shows a slow unit, but critical-chain doesn’t
Root cause: The unit is slow but started in parallel; it’s not gating your boot target. People chase it anyway because it’s “top of list.”
Fix: Prioritize the critical chain. Only optimize parallel units if they compete for CPU/disk and actually harm readiness or SLOs.
Symptom: App service starts quickly but health checks fail for minutes
Root cause: Your app is doing “startup work” after the process starts: migrations, cache warmups, expensive dependency discovery, container image pulls, or waiting on external APIs.
Fix: Make readiness explicit. Move heavy work out of the hot path. Add backoff, timeouts, and proper dependency checks. Split migrations from web startup if you care about boot.
Symptom: Randomly slow boots; sometimes fast, sometimes terrible
Root cause: External dependencies (DNS, NTP, metadata service, remote mounts), entropy starvation, or flapping network interfaces.
Fix: Remove hard ordering, add caching or local fallbacks, and use timeouts that fail fast. Also check hardware and virtualization layer health.
Symptom: “It used to be fast; now it’s slow on the same host”
Root cause: Unit overrides accumulating, stale generator output assumptions, or a mount/device rename (e.g., by-label changes) causing discovery delays.
Fix: Review drop-ins and overrides; run systemd-delta to see what changed; pin mounts by UUID; tidy old custom units.
Symptom: Boot hangs before systemd starts
Root cause: Initramfs waiting for root device, storage timeouts, broken initramfs, or driver regressions.
Fix: Use console logs/dmesg, check initramfs config, verify root device identifiers, and roll back kernel/initramfs if needed. systemd is not guilty yet.
Three corporate mini-stories from the trenches
1) The incident caused by a wrong assumption: “Network-online means ‘the internet’”
At a mid-sized company, a team standardized their base image and added a simple rule: “Start the API after the network is online.” Reasonable, right? They updated the unit with After=network-online.target and Wants=network-online.target. It passed all staging tests, because staging had clean DHCP and a single interface.
Production was messier. Instances had two NICs: one for internal traffic, one for a restricted management network with no default route by design. During boot, systemd-networkd’s wait-online logic waited for “configured” on both links. The management NIC never satisfied the conditions the service expected. Wait-online hit its timeout, and the dependency chain held the API back.
The incident wasn’t that the API was down. It was that auto-scaling became sluggish. Nodes took too long to join. Deploys rolled slowly. A few regional failovers worsened things because they relied on rapid instance replacement. From the outside it looked like “capacity is fine but nothing is healing.” From the inside it looked like a boot that refused to finish.
The fix was aggressively unsexy: change the app to start after network.target, remove the global wait-online requirement, and add retry logic for upstream connections. They also tuned the wait-online service to only care about the interface that mattered (when they truly needed it). The real mistake was assuming “online” was a single universal truth. On real networks, it’s conditional, and you pay for pretending it isn’t.
2) The optimization that backfired: “Let’s parallelize everything with aggressive timeouts”
A large enterprise platform team got obsessed with boot time because they ran thousands of nodes and wanted faster rolling updates. They went hunting with systemd-analyze blame and found a handful of long-running services: logging, metrics, and a security agent. Someone proposed a “simple improvement”: cut TimeoutStartSec values down across the board and remove ordering constraints so everything starts concurrently.
The rollout looked great for a day. Boot finished faster. Dashboards were green. Then the weirdness started. The security agent would occasionally fail to start due to a slow disk or CPU contention during boot. systemd marked it failed, but the node still reached the boot target and became eligible to serve traffic. The compliance team noticed first, which is never the best kind of monitoring.
Worse, the logging pipeline sometimes started before DNS was stable, failed to resolve its upstream endpoint, and exited. Because it had been “optimized” for fast boot, its restart strategy was insufficient. Nodes came up “successfully” but with missing logs. When an unrelated production incident happened two weeks later, half the nodes had incomplete telemetry. Root cause analysis turned into archaeology.
The correction was to treat boot performance as an SLO trade, not a speedrun. They restored sane timeouts for critical security/telemetry services, added robust restart/backoff, and reintroduced ordering where it represented real dependencies. They still improved boot time—by removing junk from the startup list and fixing mounts—but they stopped cheating by making important services optional by accident.
3) The boring but correct practice that saved the day: “Baseline the unit list and diff it”
A smaller fintech shop had a rule that sounded like bureaucracy: every base image release produced a text artifact listing enabled units, default target, and notable overrides. The release process also compared it to the previous version. If a new service became enabled, someone had to explain why.
One quarter, a routine OS update quietly introduced a new background service related to package management. It wasn’t malicious. It wasn’t even wrong. But it occasionally grabbed locks and did network work early in boot. On some nodes, it collided with configuration management and delayed key services.
Because the team had the baseline, they immediately saw “this unit is new and enabled.” They didn’t need three days of debate about whether the cloud provider had changed something. They simply disabled it on servers where it wasn’t needed at boot and moved update work to a controlled maintenance window.
Nothing heroic happened. No war room. No “SRE saves the day” fan fiction. Just a small process that kept the startup list from quietly growing teeth. That’s the level of boring you should aspire to.
Facts and historical context you can weaponize
- Boot time used to be mostly hardware-bound. On older systems, BIOS/firmware and spinning disks dominated. Now userspace dependency graphs often dominate on servers.
- SysV init was largely sequential. The move to dependency-based init systems (not just systemd) made parallel boot possible, but also made mis-specified dependencies more expensive.
- systemd introduced socket activation. Services can start on-demand when a connection arrives, which is one of the cleanest ways to keep them off the critical path.
- fstab is older than most of your infrastructure. It’s a simple file, but in systemd-era it feeds generators that can create complex unit behavior.
- Network “online” is ambiguous by design. Carrier up, IP configured, default route present, DNS reachable: these are different milestones, and different tools define “online” differently.
- Timeout defaults are often conservative. A 90-second mount timeout might have made sense when networks were slow and servers were fewer. At scale, it’s a self-inflicted outage amplifier.
- cloud-init changed how first boot works. In cloud images, a huge part of “startup time” is often “instance configuration time.” If it reruns unexpectedly, boot balloons.
- Journaling filesystems reduced long fsck events, not eliminated them. If you see frequent repairs, it’s often a symptom of unclean shutdowns or storage instability.
- Parallelism can increase variance. When everything starts at once, contention makes “usually fast” become “sometimes terrible.” This is why critical-path thinking matters.
Checklists / step-by-step plan
Phase 1: Establish a baseline (same day)
- Capture
systemd-analyzeoutput. - Capture
systemd-analyze critical-chainoutput for the current default target. - Save
systemd-analyze blame(top 50 lines is usually enough). - Save
systemctl --failedand warnings/errors fromjournalctl -b.
Goal: Know what bucket you’re in and what unit is gating boot.
Phase 2: Clean the list (1–3 days, depending on politics)
- Remove boot blockers: fix/make optional remote mounts; kill unnecessary wait-online; remove GUI targets on servers.
- Reduce enabled units: disable services you don’t need on servers (desktop baggage, unused agents, update timers during boot).
- Simplify dependency edges: remove “After=” lines that encode superstition; convert Requires to Wants if safe; avoid pulling remote-fs and network-online unless truly necessary.
- Set sane timeouts: shorten mount timeouts for optional resources; set
TimeoutStartSecto a value that reflects reality and fails fast for non-critical services.
Goal: Reduce critical path length and reduce variance.
Phase 3: Make it durable (ongoing)
- Baseline enabled units and diff between image versions.
- Gate new enabled services: require justification in change review.
- Test boot time in CI for base image changes (even a basic smoke test helps).
- For apps, enforce readiness semantics: started is not ready.
Goal: Prevent regression, not just fix the current mess.
What to avoid (because it wastes weeks)
- Blindly disabling “slow” services without understanding what they provide.
- Lowering timeouts everywhere to make boot “look fast.” That’s just sweeping reliability under the rug.
- Chasing the top entry in
blamewhile ignoringcritical-chain. - Assuming the slow part is your app when the system is waiting for storage/network.
FAQ
1) What is “the one list” if I’m not using systemd?
It’s still the same concept: the set of startup actions plus their ordering constraints. On OpenRC, it’s runlevels and dependencies; on init scripts, it’s whatever your rc order is. The cleanup logic doesn’t change: prune, decouple, and remove hard waits.
2) Is systemd-analyze blame enough?
No. It’s useful, but it’s not a critical path report. A slow unit that starts in parallel might not delay reaching the boot target. Use critical-chain to find what’s gating.
3) Should I disable systemd-networkd-wait-online?
Often yes, on servers. But do it intentionally: verify which services require network-online.target. If only one service needs it, keep it scoped there, not globally.
4) What’s the safest way to handle optional NFS/CIFS mounts?
Use nofail and x-systemd.automount so boot doesn’t block. Then control access-time behavior with reasonable timeouts. Optional means optional at boot.
5) My app needs the database. Doesn’t that mean it must wait for network-online?
No. It means your app must be able to handle “database not reachable yet.” Start the process, retry connections with backoff, and only mark the service ready when dependencies are reachable.
6) Can containers make boot slow even if systemd is fast?
Absolutely. Container runtimes can pull images, set up storage backends, and wait on networking. If your boot includes “pull a 2GB image on every node,” your startup list includes that cost whether you admit it or not.
7) How do I keep boot time from regressing over time?
Baseline the enabled unit list and diff it per image release. Also baseline boot timing metrics and alert on regressions the same way you’d alert on latency regressions.
8) What if the slow part is before systemd?
Then systemd tuning won’t help. Use console logs and dmesg to find storage discovery timeouts, initramfs waits, driver issues, or fsck. Fix the underlying device/driver/config.
9) Is changing the default target from graphical to multi-user a real win?
On servers, yes. It removes a whole class of services you don’t need. On desktops, obviously not. “Right target for the job” is a real performance knob.
10) What’s the difference between making something fast and making it not block boot?
Fast means it completes quickly. Not blocking means it can be slow without delaying readiness. For many optional dependencies, not blocking is the better engineering outcome.
Conclusion: next steps that actually move the needle
Slow startup is rarely mysterious. It’s usually a dependency graph that turned parallel boot into serialized waiting, plus a handful of timeouts that are way too generous for modern operations.
Do this next, in order:
- Run the fast diagnosis playbook and identify the critical chain blocker.
- Remove boot blockers: optional remote mounts become automount/nofail; wait-online becomes scoped or disabled.
- Prune enabled units to match the server’s purpose. If it doesn’t serve traffic, store data, or provide security/telemetry you truly require, it doesn’t get to be on the startup list.
- Codify a baseline of enabled units and boot timing, and diff it for every image change. Regression is the default state of the universe; you need paperwork to fight physics.
If you clean the list and keep it clean, boot stops being a suspense thriller. It becomes a boring, repeatable procedure. In production, boring is a feature.