Production servers don’t fail because you forgot a clever trick. They fail because you did something “small” at install time—picked the wrong kernel, skipped one repo, left updates to vibes—and six months later you’re debugging at 2 a.m. with a flashlight made of Slack messages.
This is the boring, correct Oracle Linux 9 baseline: UEK where it fits, Ksplice where it pays, and a set of checks that keep your fleet consistent. You’ll come out with a server you can patch, audit, and troubleshoot without needing to re-learn your own decisions.
The mental model: OL9, UEK, RHCK, and Ksplice
Oracle Linux 9 is a RHEL-compatible distribution with two kernel tracks you can run:
- RHCK (Red Hat Compatible Kernel): closely tracks the upstream RHEL kernel ABI expectations.
- UEK (Unbreakable Enterprise Kernel): Oracle’s kernel, generally newer, with features and performance work targeted at enterprise workloads.
Neither is “always correct.” UEK is often a strong default for OL deployments, especially where Oracle expects it (databases, heavy storage/network I/O, certain drivers). RHCK can be the conservative choice when you need maximum compatibility with third-party kernel modules validated for RHEL’s kernel line, or when your org’s compliance tooling assumes RHEL kernel versions.
Ksplice is live patching: applying certain kernel updates without rebooting. In the real world, it doesn’t remove the need for reboots forever. It changes the reboot schedule from “every time there’s a kernel CVE” to “when you plan it.” That’s a huge difference if you’ve ever tried to coordinate reboots across stateful clusters, finicky middleware, or executive patience.
One quote worth pinning above your rack: “Hope is not a strategy.”
— General Gordon R. Sullivan. Not an SRE quote, but it nails ops planning.
Here’s the baseline attitude: pick a kernel track intentionally, verify you booted what you think you booted, make updates predictable, and keep drift low. Your future incidents will still happen, but they’ll be about real bugs—not self-inflicted ambiguity.
Joke #1: If you don’t standardize your baseline, every server becomes a unique snowflake. And snowflakes melt under pressure.
Interesting facts and context (because history leaks into your outage)
- Oracle Linux started as a rebuild of RHEL sources; the “compatible but with options” DNA is why RHCK exists alongside UEK.
- Ksplice was originally a startup (late 2000s) focused on rebootless kernel updates; Oracle acquired it and made it part of the enterprise story.
- UEK has often moved faster than the RHEL kernel stream, which can matter for hardware enablement on newer platforms.
- RHEL-style kernel ABI stability is a design goal; third-party vendors frequently test against that expectation, which is why RHCK remains relevant.
- Live patching is not “every patch”: some kernel changes are too invasive; you still need maintenance windows for certain classes of updates.
- Secure Boot realities: it can be enabled and still not protect you if your operational process allows unsigned modules or weak boot-chain controls.
- DNF modularity and repo sprawl became a practical ops issue in the RHEL 8+ era; repo hygiene is now part of “install.”
- Systemd’s dominance (since mid-2010s across major distros) means service behavior, boot logs, and dependency failures are mostly a systemd problem now, not “init scripts.”
A clean server baseline that survives real operations
Baseline goals (the ones that matter at 3 a.m.)
- Deterministic boot state: you always know which kernel you’re running and why.
- Predictable updates: security fixes flow; breaking changes don’t sneak in unnoticed.
- Low drift: hosts look similar enough that one runbook works.
- Observable: logs, time sync, and resource signals are trustworthy.
- Recoverable: you can roll back or at least stop digging deeper when something smells off.
Baseline components
For most environments, I want these set early:
- UEK or RHCK chosen intentionally, with the other removed or at least not default.
- Ksplice enabled and monitored if you have the subscription/entitlement and operational need.
- Repo policy: only required repos enabled; no “testing” repos in production.
- Automatic security updates (or at least automatic notifications), with a human-owned maintenance window for reboots.
- Time sync via chrony; time drift ruins incident timelines and authentication.
- Firewall policy that matches actual service exposure, not “open and pray.”
- SELinux enforcing unless you can defend the exception with evidence.
- Crash kernel (kdump) enabled for the systems where kernel failures would be expensive to root-cause.
- Storage sanity checks: correct scheduler, correct mount options, no “it seemed faster” tuning without measurement.
Opinion: if you’re not ready to run SELinux enforcing, you’re not ready to run internet-facing services. Internal-only is not a force field; it’s a rumor.
Hands-on tasks: commands, expected signals, and decisions
These are real checks you can paste into a terminal. Each task includes what the output means and what decision you should make based on it. Do them in order on a fresh OL9 install, then bake the results into automation.
Task 1: Confirm OS release and baseline identity
cr0x@server:~$ cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="9.4"
ID="ol"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.4"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:4:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"
ORACLE_BUGZILLA_PRODUCT="Oracle Linux 9"
ORACLE_BUGZILLA_PRODUCT_VERSION=9.4
Meaning: Confirms you’re actually on Oracle Linux 9.x, not a lookalike template. The PLATFORM_ID and CPE_NAME show the EL9 family.
Decision: Record version in CMDB/asset inventory; align repo targets and patch policy to the same minor line across the fleet where possible.
Task 2: Check what kernel you’re currently running
cr0x@server:~$ uname -r
5.15.0-201.135.6.el9uek.x86_64
Meaning: el9uek in the release string indicates UEK. If you see .el9.x86_64 without uek, that’s RHCK.
Decision: If this doesn’t match your standard, stop and fix it now. The “we’ll clean it up later” plan is how fleets drift.
Task 3: Confirm which kernel is set as default for next boot
cr0x@server:~$ sudo grubby --default-kernel
/boot/vmlinuz-5.15.0-201.135.6.el9uek.x86_64
Meaning: This is what GRUB will boot by default, not what you happen to be running today.
Decision: If the default kernel differs from uname -r, you’re one reboot away from a surprise. Align them.
Task 4: List installed kernels (spot dual-track installs)
cr0x@server:~$ rpm -qa | egrep '^kernel|^kernel-uek' | sort
kernel-5.14.0-427.13.1.el9_4.x86_64
kernel-core-5.14.0-427.13.1.el9_4.x86_64
kernel-modules-5.14.0-427.13.1.el9_4.x86_64
kernel-uek-5.15.0-201.135.6.el9uek.x86_64
kernel-uek-core-5.15.0-201.135.6.el9uek.x86_64
kernel-uek-modules-5.15.0-201.135.6.el9uek.x86_64
Meaning: You have both RHCK and UEK installed. This is common after migrations or “just in case” installs.
Decision: Pick one track as standard. If you keep both, document why and ensure default kernel is explicit. Otherwise remove the unused track to reduce confusion.
Task 5: Verify enabled repositories (repo hygiene)
cr0x@server:~$ sudo dnf repolist --enabled
repo id repo name
ol9_baseos_latest Oracle Linux 9 BaseOS Latest (x86_64)
ol9_appstream Oracle Linux 9 Application Stream (x86_64)
ol9_uek_latest Latest Unbreakable Enterprise Kernel Release 7 for Oracle Linux 9 (x86_64)
ol9_ksplice Ksplice for Oracle Linux 9 (x86_64)
Meaning: These are the sources of truth for packages and kernels. Extra repos are how you accidentally import weirdness.
Decision: Disable anything you cannot explain. If you don’t have entitlement for Ksplice, don’t leave the repo half-configured.
Task 6: Check package update posture (what’s pending)
cr0x@server:~$ sudo dnf -q check-update
kernel-uek.x86_64 5.15.0-201.138.1.el9uek ol9_uek_latest
openssl-libs.x86_64 1:3.0.7-29.el9_4 ol9_baseos_latest
Meaning: Pending updates include a kernel and userland libraries. This is normal.
Decision: Decide whether you’re applying immediately (new build) or staging. If Ksplice is enabled, the kernel update becomes a “live patch first, reboot later” decision.
Task 7: Confirm Secure Boot state (don’t guess)
cr0x@server:~$ sudo mokutil --sb-state
SecureBoot enabled
Meaning: Secure Boot is on at firmware level. If disabled, you may still be fine—but it’s a policy decision, not a default shrug.
Decision: If your compliance posture expects Secure Boot, fix it early. Turning it on later can break unsigned driver workflows.
Task 8: Check SELinux mode (security baseline reality)
cr0x@server:~$ getenforce
Enforcing
Meaning: SELinux is active and enforcing policy. Permissive is “audit mode”; disabled is “I like living dangerously.”
Decision: Keep enforcing unless you have a known incompatibility. If something breaks, write a local policy or adjust contexts; don’t just turn it off.
Task 9: Validate time sync (incident timelines depend on it)
cr0x@server:~$ timedatectl
Local time: Tue 2026-02-05 10:12:41 UTC
Universal time: Tue 2026-02-05 10:12:41 UTC
RTC time: Tue 2026-02-05 10:12:41
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: Clock is synchronized, NTP is active, and you’re using UTC (good). If synchronized: no, your logs will gaslight you.
Decision: If not synchronized, fix chrony before you do anything else. Debugging with incorrect time is self-harm.
Task 10: Confirm network identity and DNS (basic, but breaks everything)
cr0x@server:~$ hostnamectl
Static hostname: ol9-db-01
Icon name: computer-server
Chassis: server
Machine ID: 2f9b2c9d0d6a4d4a8c1d3b0d3b2c4a11
Boot ID: 8a42f5d7e7f747a0a5f01d72bbf05d61
Operating System: Oracle Linux Server 9.4
Kernel: Linux 5.15.0-201.135.6.el9uek.x86_64
Architecture: x86-64
Meaning: Hostname is set; kernel shown matches your expectation. Mismatched hostnames cause monitoring collisions and certificate pain.
Decision: Standardize hostname patterns; ensure DNS forward/reverse is correct for anything that will run TLS or Kerberos-ish systems.
Task 11: See what’s listening (catch accidental exposure)
cr0x@server:~$ sudo ss -lntup
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1032,fd=3))
tcp LISTEN 0 4096 127.0.0.1:323 0.0.0.0:* users:(("chronyd",pid=980,fd=5))
Meaning: Only SSH is open to the world; chrony is local. If you see random services on 0.0.0.0, your “minimal install” wasn’t.
Decision: If a service is listening and you didn’t ask for it, disable it. Default-open services are a career-limiting move.
Task 12: Firewall status and actual rules
cr0x@server:~$ sudo firewall-cmd --state
running
cr0x@server:~$ sudo firewall-cmd --list-all
public (active)
target: default
interfaces: ens192
sources:
services: ssh
ports:
protocols:
forward: no
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
Meaning: Firewall is running and only SSH is allowed in the public zone.
Decision: Keep this tight. Open ports as part of application deployment, not as part of “server install.”
Task 13: Check storage layout and mount options (avoid accidental foot-guns)
cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS
NAME TYPE SIZE FSTYPE MOUNTPOINTS
sda disk 200G
├─sda1 part 600M vfat /boot/efi
├─sda2 part 1G xfs /boot
└─sda3 part 198.4G LVM2_member
├─ol-root lvm 80G xfs /
├─ol-var lvm 60G xfs /var
└─ol-home lvm 58.4G xfs /home
Meaning: Separate /var is usually a win for logs, package caches, and databases that spill. LVM gives you resizing options when reality arrives.
Decision: For servers running stateful services, split /var (and often /var/log) deliberately. If you keep everything on /, document your growth plan.
Task 14: Check I/O scheduler and queue settings (don’t tune blind)
cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none
Meaning: Current scheduler is mq-deadline. On modern SSD-backed virtual disks this is often fine; on certain arrays, none can also be appropriate.
Decision: Do not change schedulers because a blog told you. Change it only with a measured workload test and a rollback plan.
Task 15: Confirm swap and memory pressure posture
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 2.1Gi 26Gi 128Mi 3.0Gi 28Gi
Swap: 4.0Gi 0B 4.0Gi
Meaning: Swap exists and is unused; good. No swap is not “performance tuning,” it’s gambling with the OOM killer.
Decision: Keep swap for most general-purpose servers; size it sensibly. For specialized database systems, decide deliberately and test fail modes.
Task 16: Audit boot errors quickly (don’t wait for the ticket)
cr0x@server:~$ sudo journalctl -b -p warning --no-pager | tail -n 20
Feb 05 10:01:18 ol9-db-01 kernel: ACPI: \_SB_.PLTF.C000: Failed to evaluate _DSM (0x1001)
Feb 05 10:01:20 ol9-db-01 systemd[1]: Failed to start Crash recovery kernel arming.
Feb 05 10:01:20 ol9-db-01 systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE
Meaning: Boot has warnings; kdump failed. That’s not necessarily fatal, but it’s a baseline deviation.
Decision: Fix kdump if you need crash dumps; otherwise disable it intentionally so it doesn’t become noise that hides real boot failures.
Ksplice on OL9: what you actually need
Ksplice is operational leverage. But only if you treat it like a system, not a checkbox.
What Ksplice is good at
- Reducing the number of emergency reboots for kernel CVEs.
- Buying time to coordinate maintenance windows.
- Keeping uptime-sensitive services stable while still patching.
What Ksplice is not
- A guarantee you never reboot again.
- A replacement for upgrading userland libraries (OpenSSL, glibc, etc.).
- A magic wand for broken kernels, bad drivers, or firmware issues.
Core checks for Ksplice readiness
Task 17: Verify ksplice tooling installed
cr0x@server:~$ rpm -q ksplice-uptrack
ksplice-uptrack-1.0.2-14.el9.x86_64
Meaning: Package exists. If not installed, you’re not live patching anything.
Decision: If you intend to use Ksplice, install the tooling as part of the base image; don’t depend on manual post-install steps.
Task 18: Check ksplice service status
cr0x@server:~$ sudo systemctl status uptrack.service --no-pager
● uptrack.service - Ksplice Uptrack service
Loaded: loaded (/usr/lib/systemd/system/uptrack.service; enabled; preset: disabled)
Active: active (running) since Tue 2026-02-05 10:05:12 UTC; 6min ago
Main PID: 1523 (uptrack)
Tasks: 3
Memory: 12.4M
CPU: 0.220s
CGroup: /system.slice/uptrack.service
└─1523 /usr/sbin/uptrack -d
Meaning: Service is running. Note the “enabled” state; you want it persistent across reboots.
Decision: If it’s inactive, check entitlements and configuration. If you don’t want Ksplice, disable and remove it—half-configured agents create false confidence.
Task 19: List applied and available Ksplice updates
cr0x@server:~$ sudo uptrack-show --available
Effective kernel version is 5.15.0-201.135.6.el9uek.x86_64
Updates available:
[0t9s] CVE-2025-12345: Fix for kernel issue in netfilter
[1a2b] CVE-2025-23456: Fix for kernel issue in io_uring
Meaning: Ksplice sees your effective kernel and has live patches available.
Decision: If nothing is available, that could be fine. If updates exist and you’re in a patch window, apply them. If Ksplice can’t determine the effective kernel, you likely have a kernel mismatch or unsupported state.
Task 20: Apply Ksplice updates (when policy allows)
cr0x@server:~$ sudo uptrack-upgrade -y
Installing [0t9s]... ok
Installing [1a2b]... ok
Your kernel is fully up to date.
Meaning: Live patches applied successfully.
Decision: Record the patching event (ticket/CMDB). If patches fail, do not keep retrying blindly—capture logs and consider a normal kernel update + reboot.
Task 21: Confirm reboot requirement state (Ksplice doesn’t cancel physics)
cr0x@server:~$ sudo dnf -q needs-restarting -r || true
No core libraries or services have been updated since boot-up.
Reboot should not be necessary.
Meaning: From the package manager’s view, you don’t need a reboot for userland changes. This does not mean you never reboot, but it reduces urgency.
Decision: Use this to justify deferring reboots to scheduled windows, not to avoid them forever. You still want periodic reboots for firmware/driver refresh, kernel upgrades beyond live patches, and general hygiene.
UEK selection and boot hygiene
Kernel track mistakes are expensive because they show up late: a driver behaves differently, performance shifts, or a third-party module refuses to load after a reboot.
Pick a standard: UEK by default, unless you have a compatibility reason
My default in Oracle Linux environments is UEK for general server workloads unless:
- You rely on a third-party kernel module only certified against the RHEL kernel line.
- You have a regulatory or vendor requirement that explicitly references the compatible kernel stream.
- You’re trying to minimize differences between OL and RHEL fleets.
Task 22: Install UEK kernel packages (if not already present)
cr0x@server:~$ sudo dnf install -y kernel-uek
Last metadata expiration check: 0:18:10 ago on Tue 05 Feb 2026 09:54:21 AM UTC.
Dependencies resolved.
====================================================================
Package Arch Version Repo Size
====================================================================
Installing:
kernel-uek x86_64 5.15.0-201.138.1.el9uek ol9_uek_latest 17 M
Transaction Summary
====================================================================
Install 1 Package
Complete!
Meaning: UEK kernel installed. You still need to ensure it’s the default boot entry.
Decision: If this is a production system, schedule a reboot to validate the new kernel boots cleanly before you call the build “golden.”
Task 23: Set the default kernel explicitly
cr0x@server:~$ sudo grubby --set-default /boot/vmlinuz-5.15.0-201.138.1.el9uek.x86_64
cr0x@server:~$ sudo grubby --default-kernel
/boot/vmlinuz-5.15.0-201.138.1.el9uek.x86_64
Meaning: Next boot goes to the intended UEK image.
Decision: Pin it. Don’t rely on GRUB “latest wins” behavior when multiple kernel families are installed.
Task 24: Remove the unused kernel family (optional, but reduces confusion)
cr0x@server:~$ sudo dnf remove -y 'kernel*5.14.0*' 'kernel-core*5.14.0*' 'kernel-modules*5.14.0*'
Dependencies resolved.
====================================================================
Package Arch Version Repository Size
====================================================================
Removing:
kernel x86_64 5.14.0-427.13.1.el9_4 @System 0
kernel-core x86_64 5.14.0-427.13.1.el9_4 @System 25 M
kernel-modules x86_64 5.14.0-427.13.1.el9_4 @System 34 M
Transaction Summary
====================================================================
Remove 3 Packages
Complete!
Meaning: RHCK packages removed (in this example). You still have UEK installed.
Decision: If this host must support both (rare), keep both but document the reason and ensure default is enforced via config management.
Update strategy: security, stability, and change control
Updating is not a command. It’s a policy with tooling attached.
Security updates: make them routine
If you can’t automate security updates, at least automate visibility. Humans are unreliable schedulers.
Task 25: See security advisories for pending updates
cr0x@server:~$ sudo dnf updateinfo list --security
OLSA-2026-0001 Important/Sec. openssl-libs-1:3.0.7-29.el9_4.x86_64
OLSA-2026-0002 Moderate/Sec. curl-7.76.1-29.el9_4.x86_64
Meaning: Security advisories exist for specific packages.
Decision: If these packages are in your runtime path (they usually are), plan to patch. For internet-facing systems, “Important” is not optional.
Task 26: Apply updates with a conservative posture
cr0x@server:~$ sudo dnf upgrade -y
Dependencies resolved.
====================================================================
Package Arch Version Repo Size
====================================================================
Upgrading:
openssl-libs x86_64 1:3.0.7-29.el9_4 ol9_baseos_latest 1.5 M
Transaction Summary
====================================================================
Upgrade 1 Package
Complete!
Meaning: Userland package upgraded. Kernel might remain pending depending on your approach.
Decision: After upgrades, run needs-restarting to decide if service restarts or reboot are required.
Task 27: Identify services needing restart due to updated libraries
cr0x@server:~$ sudo dnf needs-restarting || true
sshd : 1032
chronyd : 980
Meaning: These processes are running with old libraries mapped; they should be restarted to fully apply updates.
Decision: Restart services in a controlled order. For SSH, be careful: keep an active session open when restarting to avoid locking yourself out.
Task 28: Lock down package sources and prevent accidental repo enablement
cr0x@server:~$ sudo dnf config-manager --set-disabled ol9_developer_EPEL
Error: No matching repo to modify: ol9_developer_EPEL.
Meaning: Repo isn’t present; good. If it were present and enabled, you’d want it off on production.
Decision: Explicitly manage enabled repos in automation. “It wasn’t enabled on my machine” is not a control.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “UEK is installed, so we must be running it”
At a mid-size enterprise, a team standardized on Oracle Linux for a fleet of app servers. The baseline doc said “install UEK” and the build pipeline did exactly that. Somewhere in the reboot chain, though, the default kernel remained RHCK because the image originally came from a template that had RHCK first in GRUB order.
For months nothing looked broken. Monitoring was green. Performance was fine. Then a maintenance reboot hit a subset of nodes and suddenly a third-party NIC feature behaved differently. The symptom was ugly: sporadic packet drops under load and retransmits that looked like “network” but smelled like “kernel.” The team chased switch ports, blamed cabling, and opened vendor tickets. Time disappeared.
The fix was embarrassingly simple: confirm the running kernel, confirm the default kernel, then align them. The lesson wasn’t “UEK bad” or “RHCK bad.” It was that installing a package is not the same as booting it. Any baseline that doesn’t verify boot state is a bedtime story.
Afterward, they added two checks to the build acceptance: uname -r must match the intended track, and grubby --default-kernel must match uname -r. It took minutes. It saved days later.
Optimization that backfired: “Let’s disable swap to reduce latency”
A different org ran low-latency services and decided swap was the enemy. Someone had read a performance thread and took the conclusion personally. Swap was removed from the baseline. The servers did feel “snappier” in a narrow benchmark, so the change looked justified.
Then came the slow leak: a Java service with a memory growth bug that normally would have pushed some cold pages to swap under pressure. Without swap, memory pressure went from “slightly degraded” to “instant OOM.” The kernel started killing processes. Not the big obvious one either—sometimes it killed the sidecar or a logging agent first, which made the incident harder to interpret.
The worst part was the failure mode. It wasn’t a graceful degradation. It was a chaotic crash loop that looked like an application bug (it was) but behaved like infrastructure instability (it became). They reintroduced swap with sane sizing, tuned memory limits properly, and used cgroups to constrain the worst offenders. The “optimization” had reduced one type of latency and increased the mean time to sanity by a lot.
Joke #2: Disabling swap to “improve performance” is like removing the spare tire to “reduce weight.” Technically true until the road gets interesting.
Boring but correct practice that saved the day: “Keep /var separate and monitor it”
In a regulated environment, the security team required verbose audit logging. The ops team did something deeply unsexy: they separated /var into its own logical volume, added alerting on free space, and kept log rotation tight. Nobody clapped. Nobody wrote a blog post about it.
One morning, an upstream authentication provider had intermittent failures. Applications began retrying, and the retry storms amplified log volume. On many systems, that kind of event fills the root filesystem and you get cascading failures: package manager breaks, services can’t write state, SSH logins fail, and recovery becomes a remote-hands adventure.
Here, root stayed healthy. Only /var got hammered, alerts fired early, and the team had room to maneuver: they throttled retries, increased rotation temporarily, and cleaned up the growth. The incident still hurt, but it didn’t turn into a full system outage. The best ops work looks like nothing happened.
Fast diagnosis playbook
This is the “walk into the burning room” sequence. It assumes OL9, but most steps are universal. The goal is to find the bottleneck fast: CPU, memory, disk, network, or “the kernel is not what you think it is.”
First: confirm you are on the kernel you think you are
cr0x@server:~$ uname -r
5.15.0-201.135.6.el9uek.x86_64
cr0x@server:~$ sudo grubby --default-kernel
/boot/vmlinuz-5.15.0-201.135.6.el9uek.x86_64
Signal: Mismatch means you’re one reboot away from a new failure mode. Also, some performance anomalies correlate with kernel track differences.
Decision: If mismatched, treat as configuration drift and fix before deeper tuning.
Second: identify the resource pressure in 60 seconds
cr0x@server:~$ uptime
10:19:52 up 2:31, 2 users, load average: 7.22, 6.91, 6.80
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 26800000 120000 2100000 0 0 2 7 120 240 3 1 95 1 0
8 2 0 900000 110000 2200000 0 0 8000 12000 1400 2200 5 3 55 37 0
Signal: High wa (iowait) suggests storage latency or saturation. High r with low idle suggests CPU contention. Swap-in/out suggests memory pressure.
Decision: Pick the likely bottleneck class and dig there, not everywhere.
Third: if iowait is high, prove it with per-device stats
cr0x@server:~$ iostat -xz 1 3
Linux 5.15.0-201.135.6.el9uek.x86_64 (ol9-db-01) 02/05/2026 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.20 0.00 3.10 34.50 0.00 56.20
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
sda 65.0 5200.0 0.0 0.0 18.5 80.0 110.0 14500.0 2.0 1.8 35.2 131.8 5.3 92.0
Signal: High await, high %util, and growing aqu-sz = storage is the bottleneck.
Decision: Check the storage backend, queue depth, noisy neighbors, filesystem, and application I/O patterns before touching CPU tuning.
Fourth: if CPU is high, find top offenders and their system calls
cr0x@server:~$ ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head
PID COMMAND %CPU %MEM
2412 java 380 22.1
1530 node 95 3.2
980 chronyd 2 0.1
Signal: A single process dominating means you can focus. Many processes each using a bit suggests contention or thundering herd.
Decision: For single offenders, check app-level profiling and limits. For herds, check connection storms, retry loops, and lock contention.
Fifth: if it “feels like network,” validate drops and errors
cr0x@server:~$ ip -s link show dev ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:50:56:aa:bb:cc brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
987654321 1234567 0 12 0 0
TX: bytes packets errors dropped carrier collsns
876543210 1122334 0 0 0 0
Signal: Drops or errors exist; not proof of root cause, but a clue. In virtualized environments, drops can be host-side congestion.
Decision: If drops are nonzero and rising, correlate with load and check vSwitch/host metrics; don’t immediately rewrite TCP settings.
Common mistakes: symptoms → root cause → fix
1) “Ksplice is installed but nothing is patching”
Symptoms: uptrack-show shows errors, service not running, or no updates ever appear.
Root cause: Missing repo entitlement, agent not enabled, or running an unsupported kernel track/version for the configured Ksplice channel.
Fix: Verify dnf repolist --enabled includes the Ksplice repo, ensure systemctl enable --now uptrack.service, and confirm uname -r matches expected UEK/RHCK stream for your Ksplice support.
2) “After reboot, kernel changed unexpectedly”
Symptoms: System reboots and now drivers/performance differ; uname -r shows a different family.
Root cause: Default GRUB entry not set; both kernel families installed; updates installed a newer kernel of the other family.
Fix: Use grubby --default-kernel and grubby --set-default to pin; remove unused kernel packages or explicitly manage GRUB in config management.
3) “DNF pulls in unexpected versions”
Symptoms: A routine upgrade brings in packages you didn’t expect; dependency chains look weird.
Root cause: Extra repos enabled (often developer/testing), or modular streams mismatched across hosts.
Fix: Enforce repo allow-lists, audit with dnf repolist --enabled, and standardize module streams if you use them.
4) “SSH restart locked me out”
Symptoms: Restarted sshd during patching, lost connectivity.
Root cause: Firewall misconfig, sshd config error, or you were connected through a brittle bastion session with no fallback.
Fix: Validate config with sshd -t before restarting, keep a second session open, and use console access for critical changes.
5) “High iowait after ‘storage tuning’”
Symptoms: Latency spikes; apps slow; iostat shows high await.
Root cause: Changed I/O scheduler or queue settings without considering backend storage or virtualization; misaligned filesystem/mount options.
Fix: Revert tuning; baseline with default scheduler; measure with representative workload; coordinate with storage team on queue depth and array behavior.
6) “Logs disappeared or system froze due to full disk”
Symptoms: Services fail, package manager errors, kernel reports write failures.
Root cause: Single root filesystem filled by logs or application data; no alerting on growth.
Fix: Split /var (and sometimes /var/log), enable aggressive log rotation, alert on filesystem utilization, and cap runaway app logs.
7) “Secure Boot enabled, but kernel modules fail to load”
Symptoms: Drivers/modules won’t load after enabling Secure Boot; dmesg shows signature problems.
Root cause: Unsigned third-party modules and an operational process that never handled signing/enrollment.
Fix: Either sign modules properly and manage MOK keys, or keep Secure Boot off by policy—don’t run an undefined middle state.
Checklists / step-by-step plan
Gold image baseline (do this once, automate forever)
- Install OL9 minimal where appropriate; avoid extra package groups unless you need them.
- Set hostname, DNS, and time zone (use UTC unless you have a legal reason not to).
- Choose kernel track:
- UEK for typical Oracle Linux fleets.
- RHCK if a vendor requires it or you need tight RHEL kernel alignment.
- Pin default kernel with
grubby; remove the other kernel family if you don’t need it. - Enable only required repos; verify with
dnf repolist --enabled. - Enable SELinux enforcing; fix policy issues instead of disabling.
- Configure chrony and verify time sync.
- Configure firewall: allow only what’s required; validate with
ssandfirewall-cmd. - Partition sanely: separate
/varfor most server roles; ensure growth plan. - Decide on Ksplice:
- If using it, install agent, enable service, validate patch application.
- If not using it, don’t pretend—remove packages and disable repos.
- Apply updates, then decide: restart services vs reboot. Use
needs-restarting. - Reboot once during build validation to ensure it returns cleanly on the intended kernel.
- Capture “known good” state: kernel, enabled repos, loaded modules, listening ports, filesystem layout.
Recurring operations (weekly rhythm)
- Run security update visibility (
dnf updateinfo list --security). - Apply userland updates regularly; restart affected services deliberately.
- Apply Ksplice updates when available (if policy allows).
- Schedule periodic reboots (monthly/quarterly) even with live patching.
- Audit drift: kernel family, repo list, firewall services, SELinux mode.
Pre-maintenance window sanity checks (per host)
uname -randgrubby --default-kernelmatch.dnf check-updatereviewed; kernel and critical libraries noted.dnf needs-restartingreviewed after updates.- Console access verified for remote sites.
- Backups/snapshots validated for stateful systems.
FAQ
1) Should I use UEK or RHCK on Oracle Linux 9?
Default to UEK for most Oracle Linux fleets, especially if you run Oracle workloads or want newer kernel features. Use RHCK when a third-party vendor certifies only against the RHEL-compatible kernel line or your org demands tight RHEL kernel alignment.
2) Can I install both UEK and RHCK “just in case”?
You can, but you’re buying confusion. If you keep both, you must pin the default kernel and verify it continuously. Otherwise you’ll reboot into a different kernel family and call it “random.”
3) Does Ksplice eliminate all reboots?
No. It reduces urgent kernel-reboot pressure for many CVE-class fixes, but you still need reboots for some kernel changes, firmware updates, driver changes, and general lifecycle hygiene.
4) Why does grubby --default-kernel matter if uname -r looks right?
Because uname -r is “what you booted,” while grubby is “what you will boot.” Outages love the gap between those two sentences.
5) Should I enable automatic updates on OL9 servers?
Enable automatic security updates only if you have a tested rollback story and you understand your change control requirements. At minimum, automate reporting of security advisories and keep maintenance windows frequent.
6) What’s the smallest set of repos I should keep enabled?
Typically BaseOS and AppStream, plus UEK if you run it, plus Ksplice if you’re entitled and using it. Anything else should be justified by a specific package need and reviewed periodically.
7) Is SELinux enforcing realistic in production?
Yes. Most standard services are fine out of the box. When you hit issues, fix contexts or write targeted policy. Disabling SELinux because one custom daemon complained is the fastest way to get “mysterious” incidents later.
8) What filesystem layout do you recommend for a general-purpose server?
UEFI + separate /boot, LVM for flexibility, /var separate for growth and log containment. For heavy logging or databases, consider splitting /var/log or placing data on dedicated volumes.
9) How do I know if I should worry about Secure Boot?
If you’re in regulated environments or you care about boot-chain integrity, it’s a strong control. If you require third-party kernel modules, plan module signing and MOK enrollment first, or Secure Boot will become a surprise outage.
Next steps you can do this week
- Pick your kernel standard (UEK or RHCK) and write it down in one place that matters: build pipeline + policy.
- Add two acceptance checks to every host build:
uname -rmatches the standard, andgrubby --default-kernelmatchesuname -r. - Decide on Ksplice intentionally. If you use it, monitor it and prove it patches. If you don’t, remove it and stop pretending.
- Make updates predictable: constrain repos, standardize patch cadence, and use
needs-restartingto avoid random reboots. - Run the fast diagnosis playbook on one healthy host and capture “normal.” That baseline makes future anomalies obvious.
If you do nothing else: enforce kernel boot hygiene and repo hygiene. Most “mystery” production problems are just undocumented choices finally collecting interest.