Ubuntu 24.04: Reboot required… but you can’t reboot — smart ways to plan maintenance

Was this helpful?

You patched a CVE. You ran the upgrades. Now Ubuntu politely informs you that a reboot is required, like it’s asking you to water a houseplant.
Meanwhile you’re staring at a production box that’s busy being the company’s revenue stream, identity system, or storage head.

This is where a lot of teams make either of two bad moves: reboot impulsively and take an outage, or postpone forever and call it “risk acceptance.”
The right answer is usually neither. It’s controlled deferral with a plan, a narrow blast radius, and evidence-based decisions.

What “reboot required” actually means on Ubuntu 24.04

On Ubuntu, “reboot required” is not a single condition. It’s a bundle of signals that often get collapsed into a single red badge in a dashboard.
Your job is to split it back into categories and decide what’s truly blocking and what’s manageable.

Reboot required usually means one of these

  • Kernel updated: A new kernel package is installed but you’re running the old kernel. Security fixes may not be active until reboot.
  • Core libc or loader updated: glibc, dynamic loader, or similar foundational libraries were updated. Long-running processes keep old mappings.
  • Firmware/microcode updated: Intel/AMD microcode updates may require reboot to load at early boot.
  • System libraries updated: Services may need restart; reboot is the blunt instrument that guarantees it.
  • Filesystem/storage stack updates: ZFS modules, multipath, NVMe quirks, or driver changes may need reboot to load new modules cleanly.

Ubuntu 24.04 (Noble) is modern enough that the “just restart everything” approach is both more possible and more dangerous.
More possible, because systemd and needrestart can identify affected processes. More dangerous, because workloads are more stateful,
more distributed, and more tightly coupled than they were when “maintenance window” meant “Sunday at 2 AM, nobody cares.”

If you cannot reboot right now, your goal is to answer three questions with evidence:

  1. What changed? Kernel, libc, OpenSSL, systemd, storage drivers—be specific.
  2. What is currently vulnerable or inconsistent? Running kernel version, loaded modules, long-lived daemons.
  3. What is your safest interim action? Restart select services, drain nodes, fail over, or apply live patches.

If this feels like work, yes. That’s the job. Uptime is not free; it’s a subscription you pay in planning and discipline.

Facts and historical context that change how you plan reboots

A few small facts are worth keeping in your head because they change the default decision from “reboot whenever” to “reboot deliberately.”
These aren’t trivia; they’re the reasons your estate behaves the way it does.

  1. The /var/run directory is now /run (a tmpfs) on modern Linux. Reboots and even some service restarts effectively “forget” runtime state.
  2. Ubuntu’s unattended-upgrades has existed for years and is good at installing security updates, but it doesn’t magically make running processes reload new libraries.
  3. Kernel live patching became mainstream in the 2010s as fleets grew too large for frequent reboots; it reduces risk for some CVEs but not all changes.
  4. systemd’s rise changed restart behavior: dependency graphs and socket activation can hide or reveal downtime depending on how services are configured.
  5. “Reboot required” is often a file check: on Ubuntu, /var/run/reboot-required (or /run/reboot-required) is created by packages. It’s a hint, not an oracle.
  6. glibc updates are infamous because long-running processes can keep old versions in memory; you can “patch” the disk and still run old code for weeks.
  7. Microcode updates became normal ops after high-impact CPU vulnerabilities; they’re not exotic anymore and should be part of routine maintenance design.
  8. Containerization did not eliminate reboots: your containers still depend on the host kernel. You can redeploy pods all day and still run an old kernel.
  9. ZFS on Linux (OpenZFS) matured dramatically over the last decade, but kernel-module coupling means kernel changes still deserve extra care on storage hosts.

Fast diagnosis playbook: find the real blocker in minutes

When the page goes out: “Server shows reboot required; can we postpone?” you don’t start with a philosophical debate.
You start with triage. The aim is to identify the bottleneck and the risk category quickly.

First: confirm what triggered the reboot-required flag

  • Is it kernel, microcode, or just service restarts?
  • Is the host part of an HA pair or a singleton?
  • Is there a known exploit in the wild for the updated component?

Second: determine whether you can safely restart services instead of rebooting

  • Restart impacted daemons in dependency order.
  • Validate health checks, latency, and error budgets.
  • Confirm no stateful workload will be disrupted (databases, storage controllers).

Third: plan the reboot route with the smallest blast radius

  • Drain/evacuate traffic, fail over, cordon nodes, or move VIPs.
  • Confirm you have console access (out-of-band).
  • Define rollback: previous kernel in GRUB, snapshot, or known-good image.

You’re not trying to be clever. You’re trying to be predictable.

Practical tasks: commands, outputs, and the decision you make

Below are field-tested tasks for Ubuntu 24.04. Each one includes (1) a command, (2) what the output means, and (3) the decision you make.
Run them as a sequence; they’re designed to progressively narrow the problem.

Task 1: Check whether Ubuntu thinks a reboot is required

cr0x@server:~$ ls -l /run/reboot-required /run/reboot-required.pkgs
-rw-r--r-- 1 root root  0 Dec 30 10:12 /run/reboot-required
-rw-r--r-- 1 root root 73 Dec 30 10:12 /run/reboot-required.pkgs

Meaning: The presence of /run/reboot-required is a package-installed hint that something needs a reboot.
The .pkgs file lists the packages that triggered it.

Decision: Don’t reboot yet. Read the package list first to classify the risk.

Task 2: Read which packages requested the reboot

cr0x@server:~$ cat /run/reboot-required.pkgs
linux-image-6.8.0-55-generic
linux-modules-6.8.0-55-generic
intel-microcode

Meaning: This is a real reboot-trigger set: new kernel + modules + microcode.
Restarting services won’t load the new kernel. Microcode generally loads at boot too.

Decision: Treat as “reboot required for full fix.” If you must defer, document the exposure window and consider live patching.

Task 3: Confirm the running kernel vs installed kernels

cr0x@server:~$ uname -r
6.8.0-49-generic
cr0x@server:~$ dpkg -l 'linux-image-*generic' | awk '/^ii/{print $2,$3}'
linux-image-6.8.0-49-generic 6.8.0-49.49
linux-image-6.8.0-55-generic 6.8.0-55.57

Meaning: You’re running 6.8.0-49 but 6.8.0-55 is installed.

Decision: A reboot (or kexec, rarely) is required to actually run the patched kernel. Plan it; don’t wish it away.

Task 4: Check uptime and reboot history (detect “forever defer” patterns)

cr0x@server:~$ uptime -p
up 97 days, 4 hours, 18 minutes
cr0x@server:~$ last reboot | head -3
reboot   system boot  6.8.0-49-gene Tue Sep 24 06:11   still running
reboot   system boot  6.8.0-41-gene Mon Aug 12 02:03 - 06:10 (42+04:07)
reboot   system boot  6.8.0-31-gene Sun Jun 30 01:58 - 02:02  (00:04)

Meaning: Long uptimes are not automatically good. They often mean you’re carrying latent risk and configuration drift.

Decision: If you see months of uptime on patch-heavy systems, schedule controlled reboots and make them routine, not a crisis.

Task 5: Identify services that should be restarted because of upgraded libraries (needrestart)

cr0x@server:~$ sudo needrestart -r l
NEEDRESTART-VER: 3.6
Services to be restarted:
  ssh.service
  systemd-journald.service
  cron.service
No containers need to be restarted.

Meaning: These daemons are using outdated libraries or binaries and should be restarted to pick up updates.
It also tells you whether containers are implicated.

Decision: If a reboot is impossible today, restart these services during a quiet period and re-check. This reduces exposure without lying to yourself about the kernel.

Task 6: Verify whether a critical process is still mapped to deleted libraries

cr0x@server:~$ sudo lsof | awk '/DEL/{print $1,$2,$9}' | head
systemd 1 /usr/lib/x86_64-linux-gnu/libssl.so.3 (deleted)
nginx 2154 /usr/lib/x86_64-linux-gnu/libcrypto.so.3 (deleted)
php-fpm8 3011 /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 (deleted)

Meaning: Processes are still using older versions of libraries that have been replaced on disk.
This is a classic “patched but still running old code” situation.

Decision: Restart the affected services (or the host) based on criticality and redundancy. If it’s systemd or core plumbing, prefer a reboot window.

Task 7: Confirm microcode package status (why the reboot flag won’t clear)

cr0x@server:~$ dpkg -l intel-microcode amd64-microcode | awk '/^ii/{print $2,$3}'
intel-microcode 3.20241112.0ubuntu0.24.04.1

Meaning: Microcode is installed; it will typically apply on boot. The reboot flag is expected.

Decision: If you operate regulated environments, microcode updates can be considered security-relevant; schedule the reboot as a compliance event.

Task 8: Inspect what unattended-upgrades did (change control without the drama)

cr0x@server:~$ sudo tail -n 30 /var/log/unattended-upgrades/unattended-upgrades.log
2025-12-30 09:58:12,219 INFO Installing linux-image-6.8.0-55-generic:amd64 (6.8.0-55.57) ...
2025-12-30 10:02:41,887 INFO Installing linux-modules-6.8.0-55-generic:amd64 (6.8.0-55.57) ...
2025-12-30 10:06:09,014 INFO Installing intel-microcode:amd64 (3.20241112.0ubuntu0.24.04.1) ...
2025-12-30 10:06:10,120 INFO Packages that were upgraded: linux-image-6.8.0-55-generic linux-modules-6.8.0-55-generic intel-microcode

Meaning: The system did what you asked: it installed updates. It didn’t reboot because that’s a policy choice.

Decision: Use this log as your change record. Don’t guess what changed when you write the maintenance ticket.

Task 9: Check for pending initramfs or bootloader changes

cr0x@server:~$ sudo grep -E "update-initramfs|update-grub" -n /var/log/dpkg.log | tail -n 5
184392:2025-12-30 10:03:12 status half-configured linux-image-6.8.0-55-generic:amd64 6.8.0-55.57
184411:2025-12-30 10:03:28 status installed linux-image-6.8.0-55-generic:amd64 6.8.0-55.57
184412:2025-12-30 10:03:28 trigproc initramfs-tools:amd64 0.142ubuntu25.3 <none>
184413:2025-12-30 10:03:28 status half-configured initramfs-tools:amd64 0.142ubuntu25.3
184420:2025-12-30 10:03:42 status installed initramfs-tools:amd64 0.142ubuntu25.3

Meaning: initramfs triggers ran; your boot artifacts were updated.
If this host boots from unusual storage or has custom initramfs hooks, this is where surprises come from.

Decision: For complex boot stacks (LUKS, ZFS root, multipath), test reboot procedure on a sibling node first.

Task 10: Check ZFS/OpenZFS status if this is a storage host

cr0x@server:~$ sudo zpool status
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool.
action: Upgrade the pool to enable all features.
  scan: scrub repaired 0B in 01:12:33 with 0 errors on Sun Dec 29 03:10:01 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          sda3      ONLINE       0     0     0
          sdb3      ONLINE       0     0     0

errors: No known data errors

Meaning: Pool is healthy. The “features not enabled” message is about pool feature flags, not immediate health.

Decision: If you must reboot a storage head, only do it when the pool is clean and no resilver/scrub is mid-flight.

Task 11: Confirm whether ZFS modules match the running kernel

cr0x@server:~$ dkms status | grep -E '^zfs'
zfs/2.2.2, 6.8.0-49-generic, x86_64: installed
zfs/2.2.2, 6.8.0-55-generic, x86_64: installed

Meaning: DKMS built ZFS for both kernels, which is what you want before rebooting.
If the new kernel entry is missing or “build error,” your reboot becomes a storage incident.

Decision: If DKMS didn’t build for the new kernel, fix that before reboot. Otherwise you risk booting into a kernel without your storage module.

Task 12: Validate systemd failed units before you touch anything

cr0x@server:~$ systemctl --failed
  UNIT                    LOAD   ACTIVE SUB    DESCRIPTION
● multipathd.service       loaded failed failed Device-Mapper Multipath Device Controller

1 loaded units listed.

Meaning: You already have a failed critical unit. Rebooting might “fix” it or might brick your boot path, depending on what it is.

Decision: Resolve failed units that affect storage/network before scheduling a reboot. If you’re relying on multipath, treat this as a stop sign.

Task 13: Check whether critical services are restartable without downtime

cr0x@server:~$ systemctl show nginx --property=CanReload,CanRestart,ActiveState,SubState
CanReload=yes
CanRestart=yes
ActiveState=active
SubState=running

Meaning: systemd believes nginx can reload and restart. Reload is usually lower risk than restart, but only if your config is valid.

Decision: Prefer reload when possible, validate config first, and only restart when you need to pick up binary/library changes.

Task 14: Validate configuration before restarting (avoid self-inflicted outage)

cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

Meaning: Config is syntactically valid; reload/restart is less likely to fail.

Decision: If config validation fails, fix that first. A “maintenance restart” that becomes an outage is how you lose weekends.

Task 15: Check journald and kernel logs for boot-related warnings before scheduling reboot

cr0x@server:~$ sudo journalctl -p warning -b | tail -n 10
Dec 30 07:11:02 server kernel: nvme nvme0: missing or invalid SUBNQN field.
Dec 30 07:11:03 server systemd[1]: Failed to start Device-Mapper Multipath Device Controller.
Dec 30 07:11:05 server kernel: xhci_hcd 0000:00:14.0: BIOS handoff failed (BIOS bug?)

Meaning: You have warnings that might matter during reboot (NVMe quirks, multipath).
These are the clues that reboot isn’t “just a reboot.”

Decision: If boot/storage warnings exist, do a controlled reboot with console access and a rollback plan. Do not “YOLO reboot” remotely.

Task 16: Confirm what will boot by default (GRUB) and ensure fallback is available

cr0x@server:~$ grep -E '^GRUB_DEFAULT=|^GRUB_TIMEOUT=' /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5

Meaning: Default entry will boot the first menu item; timeout is short. That’s fine until you need manual selection.

Decision: If you’ve had kernel regressions before, consider longer timeout on critical systems, or ensure out-of-band console can interact with GRUB.

Smart deferral: how to be safe while you can’t reboot

Sometimes you genuinely can’t reboot: end-of-quarter, a live migration freeze, a customer demo, a database migration, or simply no redundancy.
“No reboot” should not mean “ignore.” It should mean: reduce exposure, prepare the reboot, and schedule it.

Know what you’re deferring

If the reboot-required package list contains a kernel, you’re deferring kernel security fixes. That’s a real exposure.
If it contains only userland libraries, you can often reduce risk by restarting specific services.

Live patching can help for certain kernel CVEs. But it’s not a free pass. It’s a seatbelt, not a roll cage.

First joke (and only because we’ve all done it): A “temporary” reboot deferral is like a temporary firewall rule—eventually it becomes part of the architecture.

Service restart strategy (the sober version)

A targeted restart plan is legitimate when:

  • You have redundancy or load balancing, so restarting one node doesn’t drop traffic.
  • The updated components are userland libraries or daemons, not the kernel.
  • You can validate health after each restart with real checks (not vibes).

A targeted restart plan is dangerous when:

  • The host is a single point of failure (SPOF).
  • Storage and network stacks are involved (multipath, iSCSI, ZFS modules).
  • You don’t have a tested backout or you can’t reach console.

Risk framing that works in grown-up environments

Don’t tell stakeholders “we can’t reboot.” Tell them:

  • What’s patched on disk vs running in memory (kernel and affected processes).
  • What the known risk is (component class, exploitability, exposure).
  • What mitigation you’re applying now (service restarts, WAF rules, access restrictions).
  • When the reboot will happen (window, dependencies, fallback plan).

That’s operational honesty. And it’s how you avoid the “we thought it was fine” postmortem.

Designing maintenance windows that don’t hurt (much)

The real fix for “can’t reboot” is to stop building systems that can’t reboot.
Not because reboots are fun—they’re not—but because security and reliability require you to be able to cycle the machine.

Make reboots boring with redundancy

If you’re running a single production node because “it’s cheaper,” you’re already paying.
You’re paying in risk, in delayed patches, in fragile heroics, and in the kind of on-call anxiety that shortens careers.

  • Web tier: at least two nodes behind a load balancer, with health checks that actually fail when the app is broken.
  • Databases: replication with tested failover, or managed services if you can’t staff it.
  • Storage: dual controllers or clustered storage; if not possible, be honest that storage head reboots are outages.
  • Identity/control plane: treat as critical infrastructure; redundancy is not optional.

Design the reboot workflow, not just the window

Maintenance windows fail when they’re treated as a time slot instead of a procedure.
A good reboot procedure includes:

  • Pre-checks (pool health, replication lag, disk space, failed units).
  • Traffic draining (cordon, VIP move, LB disable, or manual failover).
  • Reboot execution with console access.
  • Post-checks (service health, version verification, logs, performance regression check).
  • Documented rollback (previous kernel, snapshot, or revert plan).

Kernel updates: treat them as routine, not emergencies

The hardest kernel reboot is the first one after you’ve deferred for months.
You’re not just applying one change; you’re leaping across a pile of changes, hoping none of them touch your hardware quirks.

The best pattern is cadence: reboot monthly (or more often if your threat model requires) and keep it consistent.
Cadence makes it boring. Boring is the goal.

Second joke (last one, promise): The only thing more expensive than a planned reboot is an unplanned reboot with an audience.

A reliability quote you should actually internalize

Gene Kim’s paraphrased idea from DevOps culture fits here: make small, frequent changes so failures are smaller and easier to recover from.
That’s not just for deployments; it’s also for kernel reboots.

Three corporate-world mini-stories (anonymized)

Mini-story 1: An incident caused by a wrong assumption

A mid-sized SaaS company ran Ubuntu fleets on auto-security updates. They’d recently moved to a new monitoring stack
that flagged /run/reboot-required as a “critical alert.” Great idea. Poor implementation.

An engineer saw a cluster of “reboot required” alerts and assumed it was mostly userland packages.
They decided to “clear the alert” by restarting the services needrestart listed, which looked safe.
The services restarted cleanly, the alert stayed, and they shrugged.

The wrong assumption was subtle: they believed the alert meant “services are stale” and that restarting a few daemons would resolve it.
But the package list included a kernel update and microcode. The running kernel was months behind.
Their threat model included internet-exposed workloads. That gap mattered.

Two weeks later, an incident response exercise turned up that production hosts were not running the patched kernel despite being “fully updated.”
There was no breach, but it became a compliance problem: “installed” didn’t mean “effective.”
The fix wasn’t technical—it was process. They updated the alert to include the package list and kernel version delta, and they added an SLA for reboot completion.

Mini-story 2: An optimization that backfired

A finance company hated reboots because their batch jobs ran long. Someone proposed a “clever” plan:
aggressively restart only affected services after patching, and reboot only once per quarter. They automated it.

For a while, it looked good. Less downtime, fewer maintenance tickets, happy stakeholders. Then the backfire.
They patched OpenSSL and restarted “all the services” using a script that iterated systemd units.
The script didn’t understand service dependencies and restarted the reverse-proxy tier before the app tier.

Result: brief but repeated partial outages—HTTP 502 spikes that didn’t trigger a full page because the load balancer still saw some healthy nodes.
Customers saw intermittent failures. Support got tickets. SREs got graphs that looked like a comb.

The postmortem was awkward because nothing “crashed.” The optimization created a new failure mode: inconsistent restarts across the stack.
They replaced the script with a controlled drain-and-restart workflow per role (proxy/app/worker), and they shortened the kernel reboot cadence.
The “quarterly reboot” idea died quietly, as it should.

Mini-story 3: A boring but correct practice that saved the day

A company running storage-heavy workloads (ZFS on Ubuntu) had a strict rule: no reboot without three green lights:
clean zpool status, DKMS modules built for the target kernel, and console access tested.
It sounded slow. It was.

One month, unattended-upgrades installed a new kernel overnight. A planned reboot window was scheduled for the next evening.
Pre-checks showed DKMS had failed to build ZFS for the new kernel due to a missing header package (a repository mirror glitch).

If they had rebooted on schedule without looking, they would have booted into a kernel with no ZFS module.
On a storage head, that’s not “degraded.” That’s “your pool didn’t import and now everyone is learning new swear words.”

Instead, they fixed the package issue, rebuilt DKMS, verified module presence, then rebooted.
Nobody outside the infrastructure team noticed. That’s the point of boring, correct practice: nothing happens, and you go home.

Common mistakes: symptoms → root cause → fix

These are patterns that show up in real fleets. They’re not theoretical.

1) Symptom: “Reboot required” won’t go away after restarting services

Root cause: The reboot flag was triggered by a kernel or microcode update; restarting services can’t load a new kernel.

Fix: Confirm with cat /run/reboot-required.pkgs and uname -r. Schedule a reboot, or apply live patching while you plan it.

2) Symptom: Reboot clears the flag, but services behave oddly afterward

Root cause: Config drift and untested restarts. The system has been running so long that the reboot activates old boot-time assumptions (network names, mounts, dependencies).

Fix: Make reboots frequent and routine. Add post-reboot validation checks and fix boot-time unit ordering, mounts, and network config.

3) Symptom: After reboot, storage is missing or pool won’t import

Root cause: Kernel-module mismatch (ZFS DKMS not built), or multipath/iSCSI services failing early in boot.

Fix: Pre-check DKMS status for the target kernel; verify systemctl --failed is clean; ensure initramfs contains necessary modules if applicable.

4) Symptom: Restarting services causes brief 502s or timeouts

Root cause: Restart order and dependency ignorance; load balancer health checks too permissive; insufficient capacity.

Fix: Drain node first; restart in role order; tighten health checks to reflect real readiness; maintain N+1 capacity.

5) Symptom: Security team says “patched,” SRE says “still vulnerable”

Root cause: Confusing “package installed” with “code running.” Long-lived processes and old kernels persist.

Fix: Report both installed and running versions. Use needrestart and kernel version deltas as compliance evidence.

6) Symptom: Reboot causes prolonged downtime because the host never returns

Root cause: No out-of-band console, broken bootloader config, or remote-only access with a network stack that depends on the reboot succeeding.

Fix: Always verify console access before maintenance. Ensure fallback kernels exist. Don’t schedule risky reboots without hands on the steering wheel.

Checklists / step-by-step plan

Checklist A: When you see “reboot required” and you can’t reboot today

  1. Read /run/reboot-required.pkgs and classify: kernel/microcode vs userland.
  2. Record running kernel (uname -r) and installed target kernel (dpkg list).
  3. Run needrestart -r l; restart low-risk services if appropriate.
  4. Check for deleted library mappings (lsof | ... DEL); restart affected daemons in a controlled order.
  5. Apply mitigations if deferring kernel fixes: tighten network exposure, rate limiting, WAF rules, reduce admin access paths.
  6. Create a reboot ticket with: reason, package list, risk, proposed window, rollback, and validation steps.

Checklist B: Pre-reboot safety checks (15 minutes that save hours)

  1. Confirm console access is working (IPMI/iDRAC/virt console).
  2. systemctl --failed must be empty for storage/network critical units.
  3. Storage hosts: zpool status must be clean; no resilver/scrub in a risky phase.
  4. DKMS modules built for new kernel (dkms status includes target kernel).
  5. Disk space sanity: ensure /boot isn’t full; ensure package operations are not half-configured.
  6. Confirm rollback path: previous kernel still installed; GRUB can select it.

Checklist C: Reboot execution with minimal blast radius

  1. Drain traffic or fail over (LB disable, VIP move, cordon node, etc.).
  2. Stop or quiesce stateful workloads if needed (databases, storage exports).
  3. Reboot using systemd and watch console.
  4. Verify kernel version after boot, then verify core services, then restore traffic.

Checklist D: Post-reboot validation (don’t stop at “it pings”)

  1. Confirm running kernel is the expected one.
  2. Confirm critical services are active and healthy.
  3. Check logs for new warnings/errors since boot.
  4. Verify storage mounts/pools and network routes.
  5. Run a synthetic transaction (login, API call, read/write) through the real path.
  6. Close the loop: clear alert, document completion, and set next cadence.

FAQ

1) Is “reboot required” always about the kernel?

No. It’s often kernel or microcode, but it can also be triggered by other packages.
Check /run/reboot-required.pkgs to see what actually requested it.

2) If I restart all services, am I secure without rebooting?

You might reduce risk for userland updates, but you will not pick up kernel fixes without rebooting (or a kernel live patch mechanism for specific CVEs).
Treat “restarted services” as mitigation, not resolution.

3) What’s the safest thing to do when I cannot reboot a single critical server?

Be honest: it’s a SPOF, so any meaningful change is risky. Minimize exposure (network restrictions, access hardening),
schedule a maintenance window, and prioritize building redundancy so you can reboot without drama next time.

4) Can I just delete /run/reboot-required to clear the warning?

You can, and it will clear the file-based alert. It will not change reality.
It’s like removing the smoke alarm battery because the kitchen is loud.

5) How do I know which processes are still using old libraries?

Use needrestart for a curated list and lsof to spot deleted library mappings.
The decision is then: restart individual services now, or schedule a reboot if the scope is too wide.

6) What’s different about Ubuntu 24.04 in this context?

The fundamentals are the same, but Ubuntu 24.04 tends to be deployed with modern kernels, systemd behavior,
containers, and DKMS-based modules (like ZFS) that make kernel changes more operationally significant.

7) How often should we reboot production servers?

Often enough that it’s routine. Monthly is a common baseline for many environments; higher-risk environments go faster.
The key is consistency: predictable cadence beats sporadic panic reboots.

8) We run Kubernetes. Is rebooting a node just “cordon and drain”?

Mostly, but details matter: PodDisruptionBudgets, stateful sets, local storage, daemonsets, and networking agents can all complicate drains.
The concept stands: evacuate workloads, reboot, validate, uncordon.

9) What if the new kernel causes a regression?

That’s why you keep at least one known-good kernel installed and ensure you can select it via console.
Test on a canary node first. Kernel regressions are rare, but “rare” is not “never.”

10) Do storage servers need special handling for reboots?

Yes. Anything with ZFS, multipath, iSCSI, or custom initramfs hooks deserves pre-checks (pool health, module builds, failed units)
and a careful console-watched reboot.

Conclusion: next steps that actually reduce risk

“Reboot required” is not Ubuntu being annoying. It’s Ubuntu telling you the truth: some fixes don’t take effect until the running system changes.
Your job is to decide when and how, not whether reality applies to you.

Do this next:

  1. On every flagged host, record: /run/reboot-required.pkgs, uname -r, and needrestart -r l.
  2. If it’s kernel/microcode: schedule a reboot with a rollback plan and console access. If it’s userland: restart impacted services safely.
  3. Build a cadence: monthly reboots, canary first, then roll through the fleet. Make it boring.
  4. Eliminate “can’t reboot” systems by design: redundancy, failover, and procedures that turn reboots into routine operations.

You don’t win reliability by never rebooting. You win by being able to reboot whenever you need to—and proving it regularly.

← Previous
PostgreSQL vs SQLite: scaling path—how to move from file DB without downtime
Next →
Postfix “host not found”: DNS issues that silently kill email

Leave a comment