You patched a CVE. You ran the upgrades. Now Ubuntu politely informs you that a reboot is required, like it’s asking you to water a houseplant.
Meanwhile you’re staring at a production box that’s busy being the company’s revenue stream, identity system, or storage head.
This is where a lot of teams make either of two bad moves: reboot impulsively and take an outage, or postpone forever and call it “risk acceptance.”
The right answer is usually neither. It’s controlled deferral with a plan, a narrow blast radius, and evidence-based decisions.
What “reboot required” actually means on Ubuntu 24.04
On Ubuntu, “reboot required” is not a single condition. It’s a bundle of signals that often get collapsed into a single red badge in a dashboard.
Your job is to split it back into categories and decide what’s truly blocking and what’s manageable.
Reboot required usually means one of these
- Kernel updated: A new kernel package is installed but you’re running the old kernel. Security fixes may not be active until reboot.
- Core libc or loader updated: glibc, dynamic loader, or similar foundational libraries were updated. Long-running processes keep old mappings.
- Firmware/microcode updated: Intel/AMD microcode updates may require reboot to load at early boot.
- System libraries updated: Services may need restart; reboot is the blunt instrument that guarantees it.
- Filesystem/storage stack updates: ZFS modules, multipath, NVMe quirks, or driver changes may need reboot to load new modules cleanly.
Ubuntu 24.04 (Noble) is modern enough that the “just restart everything” approach is both more possible and more dangerous.
More possible, because systemd and needrestart can identify affected processes. More dangerous, because workloads are more stateful,
more distributed, and more tightly coupled than they were when “maintenance window” meant “Sunday at 2 AM, nobody cares.”
If you cannot reboot right now, your goal is to answer three questions with evidence:
- What changed? Kernel, libc, OpenSSL, systemd, storage drivers—be specific.
- What is currently vulnerable or inconsistent? Running kernel version, loaded modules, long-lived daemons.
- What is your safest interim action? Restart select services, drain nodes, fail over, or apply live patches.
If this feels like work, yes. That’s the job. Uptime is not free; it’s a subscription you pay in planning and discipline.
Facts and historical context that change how you plan reboots
A few small facts are worth keeping in your head because they change the default decision from “reboot whenever” to “reboot deliberately.”
These aren’t trivia; they’re the reasons your estate behaves the way it does.
- The /var/run directory is now /run (a tmpfs) on modern Linux. Reboots and even some service restarts effectively “forget” runtime state.
- Ubuntu’s unattended-upgrades has existed for years and is good at installing security updates, but it doesn’t magically make running processes reload new libraries.
- Kernel live patching became mainstream in the 2010s as fleets grew too large for frequent reboots; it reduces risk for some CVEs but not all changes.
- systemd’s rise changed restart behavior: dependency graphs and socket activation can hide or reveal downtime depending on how services are configured.
- “Reboot required” is often a file check: on Ubuntu, /var/run/reboot-required (or /run/reboot-required) is created by packages. It’s a hint, not an oracle.
- glibc updates are infamous because long-running processes can keep old versions in memory; you can “patch” the disk and still run old code for weeks.
- Microcode updates became normal ops after high-impact CPU vulnerabilities; they’re not exotic anymore and should be part of routine maintenance design.
- Containerization did not eliminate reboots: your containers still depend on the host kernel. You can redeploy pods all day and still run an old kernel.
- ZFS on Linux (OpenZFS) matured dramatically over the last decade, but kernel-module coupling means kernel changes still deserve extra care on storage hosts.
Fast diagnosis playbook: find the real blocker in minutes
When the page goes out: “Server shows reboot required; can we postpone?” you don’t start with a philosophical debate.
You start with triage. The aim is to identify the bottleneck and the risk category quickly.
First: confirm what triggered the reboot-required flag
- Is it kernel, microcode, or just service restarts?
- Is the host part of an HA pair or a singleton?
- Is there a known exploit in the wild for the updated component?
Second: determine whether you can safely restart services instead of rebooting
- Restart impacted daemons in dependency order.
- Validate health checks, latency, and error budgets.
- Confirm no stateful workload will be disrupted (databases, storage controllers).
Third: plan the reboot route with the smallest blast radius
- Drain/evacuate traffic, fail over, cordon nodes, or move VIPs.
- Confirm you have console access (out-of-band).
- Define rollback: previous kernel in GRUB, snapshot, or known-good image.
You’re not trying to be clever. You’re trying to be predictable.
Practical tasks: commands, outputs, and the decision you make
Below are field-tested tasks for Ubuntu 24.04. Each one includes (1) a command, (2) what the output means, and (3) the decision you make.
Run them as a sequence; they’re designed to progressively narrow the problem.
Task 1: Check whether Ubuntu thinks a reboot is required
cr0x@server:~$ ls -l /run/reboot-required /run/reboot-required.pkgs
-rw-r--r-- 1 root root 0 Dec 30 10:12 /run/reboot-required
-rw-r--r-- 1 root root 73 Dec 30 10:12 /run/reboot-required.pkgs
Meaning: The presence of /run/reboot-required is a package-installed hint that something needs a reboot.
The .pkgs file lists the packages that triggered it.
Decision: Don’t reboot yet. Read the package list first to classify the risk.
Task 2: Read which packages requested the reboot
cr0x@server:~$ cat /run/reboot-required.pkgs
linux-image-6.8.0-55-generic
linux-modules-6.8.0-55-generic
intel-microcode
Meaning: This is a real reboot-trigger set: new kernel + modules + microcode.
Restarting services won’t load the new kernel. Microcode generally loads at boot too.
Decision: Treat as “reboot required for full fix.” If you must defer, document the exposure window and consider live patching.
Task 3: Confirm the running kernel vs installed kernels
cr0x@server:~$ uname -r
6.8.0-49-generic
cr0x@server:~$ dpkg -l 'linux-image-*generic' | awk '/^ii/{print $2,$3}'
linux-image-6.8.0-49-generic 6.8.0-49.49
linux-image-6.8.0-55-generic 6.8.0-55.57
Meaning: You’re running 6.8.0-49 but 6.8.0-55 is installed.
Decision: A reboot (or kexec, rarely) is required to actually run the patched kernel. Plan it; don’t wish it away.
Task 4: Check uptime and reboot history (detect “forever defer” patterns)
cr0x@server:~$ uptime -p
up 97 days, 4 hours, 18 minutes
cr0x@server:~$ last reboot | head -3
reboot system boot 6.8.0-49-gene Tue Sep 24 06:11 still running
reboot system boot 6.8.0-41-gene Mon Aug 12 02:03 - 06:10 (42+04:07)
reboot system boot 6.8.0-31-gene Sun Jun 30 01:58 - 02:02 (00:04)
Meaning: Long uptimes are not automatically good. They often mean you’re carrying latent risk and configuration drift.
Decision: If you see months of uptime on patch-heavy systems, schedule controlled reboots and make them routine, not a crisis.
Task 5: Identify services that should be restarted because of upgraded libraries (needrestart)
cr0x@server:~$ sudo needrestart -r l
NEEDRESTART-VER: 3.6
Services to be restarted:
ssh.service
systemd-journald.service
cron.service
No containers need to be restarted.
Meaning: These daemons are using outdated libraries or binaries and should be restarted to pick up updates.
It also tells you whether containers are implicated.
Decision: If a reboot is impossible today, restart these services during a quiet period and re-check. This reduces exposure without lying to yourself about the kernel.
Task 6: Verify whether a critical process is still mapped to deleted libraries
cr0x@server:~$ sudo lsof | awk '/DEL/{print $1,$2,$9}' | head
systemd 1 /usr/lib/x86_64-linux-gnu/libssl.so.3 (deleted)
nginx 2154 /usr/lib/x86_64-linux-gnu/libcrypto.so.3 (deleted)
php-fpm8 3011 /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 (deleted)
Meaning: Processes are still using older versions of libraries that have been replaced on disk.
This is a classic “patched but still running old code” situation.
Decision: Restart the affected services (or the host) based on criticality and redundancy. If it’s systemd or core plumbing, prefer a reboot window.
Task 7: Confirm microcode package status (why the reboot flag won’t clear)
cr0x@server:~$ dpkg -l intel-microcode amd64-microcode | awk '/^ii/{print $2,$3}'
intel-microcode 3.20241112.0ubuntu0.24.04.1
Meaning: Microcode is installed; it will typically apply on boot. The reboot flag is expected.
Decision: If you operate regulated environments, microcode updates can be considered security-relevant; schedule the reboot as a compliance event.
Task 8: Inspect what unattended-upgrades did (change control without the drama)
cr0x@server:~$ sudo tail -n 30 /var/log/unattended-upgrades/unattended-upgrades.log
2025-12-30 09:58:12,219 INFO Installing linux-image-6.8.0-55-generic:amd64 (6.8.0-55.57) ...
2025-12-30 10:02:41,887 INFO Installing linux-modules-6.8.0-55-generic:amd64 (6.8.0-55.57) ...
2025-12-30 10:06:09,014 INFO Installing intel-microcode:amd64 (3.20241112.0ubuntu0.24.04.1) ...
2025-12-30 10:06:10,120 INFO Packages that were upgraded: linux-image-6.8.0-55-generic linux-modules-6.8.0-55-generic intel-microcode
Meaning: The system did what you asked: it installed updates. It didn’t reboot because that’s a policy choice.
Decision: Use this log as your change record. Don’t guess what changed when you write the maintenance ticket.
Task 9: Check for pending initramfs or bootloader changes
cr0x@server:~$ sudo grep -E "update-initramfs|update-grub" -n /var/log/dpkg.log | tail -n 5
184392:2025-12-30 10:03:12 status half-configured linux-image-6.8.0-55-generic:amd64 6.8.0-55.57
184411:2025-12-30 10:03:28 status installed linux-image-6.8.0-55-generic:amd64 6.8.0-55.57
184412:2025-12-30 10:03:28 trigproc initramfs-tools:amd64 0.142ubuntu25.3 <none>
184413:2025-12-30 10:03:28 status half-configured initramfs-tools:amd64 0.142ubuntu25.3
184420:2025-12-30 10:03:42 status installed initramfs-tools:amd64 0.142ubuntu25.3
Meaning: initramfs triggers ran; your boot artifacts were updated.
If this host boots from unusual storage or has custom initramfs hooks, this is where surprises come from.
Decision: For complex boot stacks (LUKS, ZFS root, multipath), test reboot procedure on a sibling node first.
Task 10: Check ZFS/OpenZFS status if this is a storage host
cr0x@server:~$ sudo zpool status
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool.
action: Upgrade the pool to enable all features.
scan: scrub repaired 0B in 01:12:33 with 0 errors on Sun Dec 29 03:10:01 2025
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
sda3 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
errors: No known data errors
Meaning: Pool is healthy. The “features not enabled” message is about pool feature flags, not immediate health.
Decision: If you must reboot a storage head, only do it when the pool is clean and no resilver/scrub is mid-flight.
Task 11: Confirm whether ZFS modules match the running kernel
cr0x@server:~$ dkms status | grep -E '^zfs'
zfs/2.2.2, 6.8.0-49-generic, x86_64: installed
zfs/2.2.2, 6.8.0-55-generic, x86_64: installed
Meaning: DKMS built ZFS for both kernels, which is what you want before rebooting.
If the new kernel entry is missing or “build error,” your reboot becomes a storage incident.
Decision: If DKMS didn’t build for the new kernel, fix that before reboot. Otherwise you risk booting into a kernel without your storage module.
Task 12: Validate systemd failed units before you touch anything
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● multipathd.service loaded failed failed Device-Mapper Multipath Device Controller
1 loaded units listed.
Meaning: You already have a failed critical unit. Rebooting might “fix” it or might brick your boot path, depending on what it is.
Decision: Resolve failed units that affect storage/network before scheduling a reboot. If you’re relying on multipath, treat this as a stop sign.
Task 13: Check whether critical services are restartable without downtime
cr0x@server:~$ systemctl show nginx --property=CanReload,CanRestart,ActiveState,SubState
CanReload=yes
CanRestart=yes
ActiveState=active
SubState=running
Meaning: systemd believes nginx can reload and restart. Reload is usually lower risk than restart, but only if your config is valid.
Decision: Prefer reload when possible, validate config first, and only restart when you need to pick up binary/library changes.
Task 14: Validate configuration before restarting (avoid self-inflicted outage)
cr0x@server:~$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Meaning: Config is syntactically valid; reload/restart is less likely to fail.
Decision: If config validation fails, fix that first. A “maintenance restart” that becomes an outage is how you lose weekends.
Task 15: Check journald and kernel logs for boot-related warnings before scheduling reboot
cr0x@server:~$ sudo journalctl -p warning -b | tail -n 10
Dec 30 07:11:02 server kernel: nvme nvme0: missing or invalid SUBNQN field.
Dec 30 07:11:03 server systemd[1]: Failed to start Device-Mapper Multipath Device Controller.
Dec 30 07:11:05 server kernel: xhci_hcd 0000:00:14.0: BIOS handoff failed (BIOS bug?)
Meaning: You have warnings that might matter during reboot (NVMe quirks, multipath).
These are the clues that reboot isn’t “just a reboot.”
Decision: If boot/storage warnings exist, do a controlled reboot with console access and a rollback plan. Do not “YOLO reboot” remotely.
Task 16: Confirm what will boot by default (GRUB) and ensure fallback is available
cr0x@server:~$ grep -E '^GRUB_DEFAULT=|^GRUB_TIMEOUT=' /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
Meaning: Default entry will boot the first menu item; timeout is short. That’s fine until you need manual selection.
Decision: If you’ve had kernel regressions before, consider longer timeout on critical systems, or ensure out-of-band console can interact with GRUB.
Smart deferral: how to be safe while you can’t reboot
Sometimes you genuinely can’t reboot: end-of-quarter, a live migration freeze, a customer demo, a database migration, or simply no redundancy.
“No reboot” should not mean “ignore.” It should mean: reduce exposure, prepare the reboot, and schedule it.
Know what you’re deferring
If the reboot-required package list contains a kernel, you’re deferring kernel security fixes. That’s a real exposure.
If it contains only userland libraries, you can often reduce risk by restarting specific services.
Live patching can help for certain kernel CVEs. But it’s not a free pass. It’s a seatbelt, not a roll cage.
First joke (and only because we’ve all done it): A “temporary” reboot deferral is like a temporary firewall rule—eventually it becomes part of the architecture.
Service restart strategy (the sober version)
A targeted restart plan is legitimate when:
- You have redundancy or load balancing, so restarting one node doesn’t drop traffic.
- The updated components are userland libraries or daemons, not the kernel.
- You can validate health after each restart with real checks (not vibes).
A targeted restart plan is dangerous when:
- The host is a single point of failure (SPOF).
- Storage and network stacks are involved (multipath, iSCSI, ZFS modules).
- You don’t have a tested backout or you can’t reach console.
Risk framing that works in grown-up environments
Don’t tell stakeholders “we can’t reboot.” Tell them:
- What’s patched on disk vs running in memory (kernel and affected processes).
- What the known risk is (component class, exploitability, exposure).
- What mitigation you’re applying now (service restarts, WAF rules, access restrictions).
- When the reboot will happen (window, dependencies, fallback plan).
That’s operational honesty. And it’s how you avoid the “we thought it was fine” postmortem.
Designing maintenance windows that don’t hurt (much)
The real fix for “can’t reboot” is to stop building systems that can’t reboot.
Not because reboots are fun—they’re not—but because security and reliability require you to be able to cycle the machine.
Make reboots boring with redundancy
If you’re running a single production node because “it’s cheaper,” you’re already paying.
You’re paying in risk, in delayed patches, in fragile heroics, and in the kind of on-call anxiety that shortens careers.
- Web tier: at least two nodes behind a load balancer, with health checks that actually fail when the app is broken.
- Databases: replication with tested failover, or managed services if you can’t staff it.
- Storage: dual controllers or clustered storage; if not possible, be honest that storage head reboots are outages.
- Identity/control plane: treat as critical infrastructure; redundancy is not optional.
Design the reboot workflow, not just the window
Maintenance windows fail when they’re treated as a time slot instead of a procedure.
A good reboot procedure includes:
- Pre-checks (pool health, replication lag, disk space, failed units).
- Traffic draining (cordon, VIP move, LB disable, or manual failover).
- Reboot execution with console access.
- Post-checks (service health, version verification, logs, performance regression check).
- Documented rollback (previous kernel, snapshot, or revert plan).
Kernel updates: treat them as routine, not emergencies
The hardest kernel reboot is the first one after you’ve deferred for months.
You’re not just applying one change; you’re leaping across a pile of changes, hoping none of them touch your hardware quirks.
The best pattern is cadence: reboot monthly (or more often if your threat model requires) and keep it consistent.
Cadence makes it boring. Boring is the goal.
Second joke (last one, promise): The only thing more expensive than a planned reboot is an unplanned reboot with an audience.
A reliability quote you should actually internalize
Gene Kim’s paraphrased idea from DevOps culture fits here: make small, frequent changes so failures are smaller and easier to recover from.
That’s not just for deployments; it’s also for kernel reboots.
Three corporate-world mini-stories (anonymized)
Mini-story 1: An incident caused by a wrong assumption
A mid-sized SaaS company ran Ubuntu fleets on auto-security updates. They’d recently moved to a new monitoring stack
that flagged /run/reboot-required as a “critical alert.” Great idea. Poor implementation.
An engineer saw a cluster of “reboot required” alerts and assumed it was mostly userland packages.
They decided to “clear the alert” by restarting the services needrestart listed, which looked safe.
The services restarted cleanly, the alert stayed, and they shrugged.
The wrong assumption was subtle: they believed the alert meant “services are stale” and that restarting a few daemons would resolve it.
But the package list included a kernel update and microcode. The running kernel was months behind.
Their threat model included internet-exposed workloads. That gap mattered.
Two weeks later, an incident response exercise turned up that production hosts were not running the patched kernel despite being “fully updated.”
There was no breach, but it became a compliance problem: “installed” didn’t mean “effective.”
The fix wasn’t technical—it was process. They updated the alert to include the package list and kernel version delta, and they added an SLA for reboot completion.
Mini-story 2: An optimization that backfired
A finance company hated reboots because their batch jobs ran long. Someone proposed a “clever” plan:
aggressively restart only affected services after patching, and reboot only once per quarter. They automated it.
For a while, it looked good. Less downtime, fewer maintenance tickets, happy stakeholders. Then the backfire.
They patched OpenSSL and restarted “all the services” using a script that iterated systemd units.
The script didn’t understand service dependencies and restarted the reverse-proxy tier before the app tier.
Result: brief but repeated partial outages—HTTP 502 spikes that didn’t trigger a full page because the load balancer still saw some healthy nodes.
Customers saw intermittent failures. Support got tickets. SREs got graphs that looked like a comb.
The postmortem was awkward because nothing “crashed.” The optimization created a new failure mode: inconsistent restarts across the stack.
They replaced the script with a controlled drain-and-restart workflow per role (proxy/app/worker), and they shortened the kernel reboot cadence.
The “quarterly reboot” idea died quietly, as it should.
Mini-story 3: A boring but correct practice that saved the day
A company running storage-heavy workloads (ZFS on Ubuntu) had a strict rule: no reboot without three green lights:
clean zpool status, DKMS modules built for the target kernel, and console access tested.
It sounded slow. It was.
One month, unattended-upgrades installed a new kernel overnight. A planned reboot window was scheduled for the next evening.
Pre-checks showed DKMS had failed to build ZFS for the new kernel due to a missing header package (a repository mirror glitch).
If they had rebooted on schedule without looking, they would have booted into a kernel with no ZFS module.
On a storage head, that’s not “degraded.” That’s “your pool didn’t import and now everyone is learning new swear words.”
Instead, they fixed the package issue, rebuilt DKMS, verified module presence, then rebooted.
Nobody outside the infrastructure team noticed. That’s the point of boring, correct practice: nothing happens, and you go home.
Common mistakes: symptoms → root cause → fix
These are patterns that show up in real fleets. They’re not theoretical.
1) Symptom: “Reboot required” won’t go away after restarting services
Root cause: The reboot flag was triggered by a kernel or microcode update; restarting services can’t load a new kernel.
Fix: Confirm with cat /run/reboot-required.pkgs and uname -r. Schedule a reboot, or apply live patching while you plan it.
2) Symptom: Reboot clears the flag, but services behave oddly afterward
Root cause: Config drift and untested restarts. The system has been running so long that the reboot activates old boot-time assumptions (network names, mounts, dependencies).
Fix: Make reboots frequent and routine. Add post-reboot validation checks and fix boot-time unit ordering, mounts, and network config.
3) Symptom: After reboot, storage is missing or pool won’t import
Root cause: Kernel-module mismatch (ZFS DKMS not built), or multipath/iSCSI services failing early in boot.
Fix: Pre-check DKMS status for the target kernel; verify systemctl --failed is clean; ensure initramfs contains necessary modules if applicable.
4) Symptom: Restarting services causes brief 502s or timeouts
Root cause: Restart order and dependency ignorance; load balancer health checks too permissive; insufficient capacity.
Fix: Drain node first; restart in role order; tighten health checks to reflect real readiness; maintain N+1 capacity.
5) Symptom: Security team says “patched,” SRE says “still vulnerable”
Root cause: Confusing “package installed” with “code running.” Long-lived processes and old kernels persist.
Fix: Report both installed and running versions. Use needrestart and kernel version deltas as compliance evidence.
6) Symptom: Reboot causes prolonged downtime because the host never returns
Root cause: No out-of-band console, broken bootloader config, or remote-only access with a network stack that depends on the reboot succeeding.
Fix: Always verify console access before maintenance. Ensure fallback kernels exist. Don’t schedule risky reboots without hands on the steering wheel.
Checklists / step-by-step plan
Checklist A: When you see “reboot required” and you can’t reboot today
- Read
/run/reboot-required.pkgsand classify: kernel/microcode vs userland. - Record running kernel (
uname -r) and installed target kernel (dpkg list). - Run
needrestart -r l; restart low-risk services if appropriate. - Check for deleted library mappings (
lsof | ... DEL); restart affected daemons in a controlled order. - Apply mitigations if deferring kernel fixes: tighten network exposure, rate limiting, WAF rules, reduce admin access paths.
- Create a reboot ticket with: reason, package list, risk, proposed window, rollback, and validation steps.
Checklist B: Pre-reboot safety checks (15 minutes that save hours)
- Confirm console access is working (IPMI/iDRAC/virt console).
systemctl --failedmust be empty for storage/network critical units.- Storage hosts:
zpool statusmust be clean; no resilver/scrub in a risky phase. - DKMS modules built for new kernel (
dkms statusincludes target kernel). - Disk space sanity: ensure
/bootisn’t full; ensure package operations are not half-configured. - Confirm rollback path: previous kernel still installed; GRUB can select it.
Checklist C: Reboot execution with minimal blast radius
- Drain traffic or fail over (LB disable, VIP move, cordon node, etc.).
- Stop or quiesce stateful workloads if needed (databases, storage exports).
- Reboot using systemd and watch console.
- Verify kernel version after boot, then verify core services, then restore traffic.
Checklist D: Post-reboot validation (don’t stop at “it pings”)
- Confirm running kernel is the expected one.
- Confirm critical services are active and healthy.
- Check logs for new warnings/errors since boot.
- Verify storage mounts/pools and network routes.
- Run a synthetic transaction (login, API call, read/write) through the real path.
- Close the loop: clear alert, document completion, and set next cadence.
FAQ
1) Is “reboot required” always about the kernel?
No. It’s often kernel or microcode, but it can also be triggered by other packages.
Check /run/reboot-required.pkgs to see what actually requested it.
2) If I restart all services, am I secure without rebooting?
You might reduce risk for userland updates, but you will not pick up kernel fixes without rebooting (or a kernel live patch mechanism for specific CVEs).
Treat “restarted services” as mitigation, not resolution.
3) What’s the safest thing to do when I cannot reboot a single critical server?
Be honest: it’s a SPOF, so any meaningful change is risky. Minimize exposure (network restrictions, access hardening),
schedule a maintenance window, and prioritize building redundancy so you can reboot without drama next time.
4) Can I just delete /run/reboot-required to clear the warning?
You can, and it will clear the file-based alert. It will not change reality.
It’s like removing the smoke alarm battery because the kitchen is loud.
5) How do I know which processes are still using old libraries?
Use needrestart for a curated list and lsof to spot deleted library mappings.
The decision is then: restart individual services now, or schedule a reboot if the scope is too wide.
6) What’s different about Ubuntu 24.04 in this context?
The fundamentals are the same, but Ubuntu 24.04 tends to be deployed with modern kernels, systemd behavior,
containers, and DKMS-based modules (like ZFS) that make kernel changes more operationally significant.
7) How often should we reboot production servers?
Often enough that it’s routine. Monthly is a common baseline for many environments; higher-risk environments go faster.
The key is consistency: predictable cadence beats sporadic panic reboots.
8) We run Kubernetes. Is rebooting a node just “cordon and drain”?
Mostly, but details matter: PodDisruptionBudgets, stateful sets, local storage, daemonsets, and networking agents can all complicate drains.
The concept stands: evacuate workloads, reboot, validate, uncordon.
9) What if the new kernel causes a regression?
That’s why you keep at least one known-good kernel installed and ensure you can select it via console.
Test on a canary node first. Kernel regressions are rare, but “rare” is not “never.”
10) Do storage servers need special handling for reboots?
Yes. Anything with ZFS, multipath, iSCSI, or custom initramfs hooks deserves pre-checks (pool health, module builds, failed units)
and a careful console-watched reboot.
Conclusion: next steps that actually reduce risk
“Reboot required” is not Ubuntu being annoying. It’s Ubuntu telling you the truth: some fixes don’t take effect until the running system changes.
Your job is to decide when and how, not whether reality applies to you.
Do this next:
- On every flagged host, record:
/run/reboot-required.pkgs,uname -r, andneedrestart -r l. - If it’s kernel/microcode: schedule a reboot with a rollback plan and console access. If it’s userland: restart impacted services safely.
- Build a cadence: monthly reboots, canary first, then roll through the fleet. Make it boring.
- Eliminate “can’t reboot” systems by design: redundancy, failover, and procedures that turn reboots into routine operations.
You don’t win reliability by never rebooting. You win by being able to reboot whenever you need to—and proving it regularly.