Oracle Linux 9 Install: Ksplice, UEK, and a Clean Server Baseline

Was this helpful?

Production servers don’t fail because you forgot a clever trick. They fail because you did something “small” at install time—picked the wrong kernel, skipped one repo, left updates to vibes—and six months later you’re debugging at 2 a.m. with a flashlight made of Slack messages.

This is the boring, correct Oracle Linux 9 baseline: UEK where it fits, Ksplice where it pays, and a set of checks that keep your fleet consistent. You’ll come out with a server you can patch, audit, and troubleshoot without needing to re-learn your own decisions.

The mental model: OL9, UEK, RHCK, and Ksplice

Oracle Linux 9 is a RHEL-compatible distribution with two kernel tracks you can run:

  • RHCK (Red Hat Compatible Kernel): closely tracks the upstream RHEL kernel ABI expectations.
  • UEK (Unbreakable Enterprise Kernel): Oracle’s kernel, generally newer, with features and performance work targeted at enterprise workloads.

Neither is “always correct.” UEK is often a strong default for OL deployments, especially where Oracle expects it (databases, heavy storage/network I/O, certain drivers). RHCK can be the conservative choice when you need maximum compatibility with third-party kernel modules validated for RHEL’s kernel line, or when your org’s compliance tooling assumes RHEL kernel versions.

Ksplice is live patching: applying certain kernel updates without rebooting. In the real world, it doesn’t remove the need for reboots forever. It changes the reboot schedule from “every time there’s a kernel CVE” to “when you plan it.” That’s a huge difference if you’ve ever tried to coordinate reboots across stateful clusters, finicky middleware, or executive patience.

One quote worth pinning above your rack: “Hope is not a strategy.” — General Gordon R. Sullivan. Not an SRE quote, but it nails ops planning.

Here’s the baseline attitude: pick a kernel track intentionally, verify you booted what you think you booted, make updates predictable, and keep drift low. Your future incidents will still happen, but they’ll be about real bugs—not self-inflicted ambiguity.

Joke #1: If you don’t standardize your baseline, every server becomes a unique snowflake. And snowflakes melt under pressure.

Interesting facts and context (because history leaks into your outage)

  • Oracle Linux started as a rebuild of RHEL sources; the “compatible but with options” DNA is why RHCK exists alongside UEK.
  • Ksplice was originally a startup (late 2000s) focused on rebootless kernel updates; Oracle acquired it and made it part of the enterprise story.
  • UEK has often moved faster than the RHEL kernel stream, which can matter for hardware enablement on newer platforms.
  • RHEL-style kernel ABI stability is a design goal; third-party vendors frequently test against that expectation, which is why RHCK remains relevant.
  • Live patching is not “every patch”: some kernel changes are too invasive; you still need maintenance windows for certain classes of updates.
  • Secure Boot realities: it can be enabled and still not protect you if your operational process allows unsigned modules or weak boot-chain controls.
  • DNF modularity and repo sprawl became a practical ops issue in the RHEL 8+ era; repo hygiene is now part of “install.”
  • Systemd’s dominance (since mid-2010s across major distros) means service behavior, boot logs, and dependency failures are mostly a systemd problem now, not “init scripts.”

A clean server baseline that survives real operations

Baseline goals (the ones that matter at 3 a.m.)

  • Deterministic boot state: you always know which kernel you’re running and why.
  • Predictable updates: security fixes flow; breaking changes don’t sneak in unnoticed.
  • Low drift: hosts look similar enough that one runbook works.
  • Observable: logs, time sync, and resource signals are trustworthy.
  • Recoverable: you can roll back or at least stop digging deeper when something smells off.

Baseline components

For most environments, I want these set early:

  • UEK or RHCK chosen intentionally, with the other removed or at least not default.
  • Ksplice enabled and monitored if you have the subscription/entitlement and operational need.
  • Repo policy: only required repos enabled; no “testing” repos in production.
  • Automatic security updates (or at least automatic notifications), with a human-owned maintenance window for reboots.
  • Time sync via chrony; time drift ruins incident timelines and authentication.
  • Firewall policy that matches actual service exposure, not “open and pray.”
  • SELinux enforcing unless you can defend the exception with evidence.
  • Crash kernel (kdump) enabled for the systems where kernel failures would be expensive to root-cause.
  • Storage sanity checks: correct scheduler, correct mount options, no “it seemed faster” tuning without measurement.

Opinion: if you’re not ready to run SELinux enforcing, you’re not ready to run internet-facing services. Internal-only is not a force field; it’s a rumor.

Hands-on tasks: commands, expected signals, and decisions

These are real checks you can paste into a terminal. Each task includes what the output means and what decision you should make based on it. Do them in order on a fresh OL9 install, then bake the results into automation.

Task 1: Confirm OS release and baseline identity

cr0x@server:~$ cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="9.4"
ID="ol"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.4"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:4:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"
ORACLE_BUGZILLA_PRODUCT="Oracle Linux 9"
ORACLE_BUGZILLA_PRODUCT_VERSION=9.4

Meaning: Confirms you’re actually on Oracle Linux 9.x, not a lookalike template. The PLATFORM_ID and CPE_NAME show the EL9 family.

Decision: Record version in CMDB/asset inventory; align repo targets and patch policy to the same minor line across the fleet where possible.

Task 2: Check what kernel you’re currently running

cr0x@server:~$ uname -r
5.15.0-201.135.6.el9uek.x86_64

Meaning: el9uek in the release string indicates UEK. If you see .el9.x86_64 without uek, that’s RHCK.

Decision: If this doesn’t match your standard, stop and fix it now. The “we’ll clean it up later” plan is how fleets drift.

Task 3: Confirm which kernel is set as default for next boot

cr0x@server:~$ sudo grubby --default-kernel
/boot/vmlinuz-5.15.0-201.135.6.el9uek.x86_64

Meaning: This is what GRUB will boot by default, not what you happen to be running today.

Decision: If the default kernel differs from uname -r, you’re one reboot away from a surprise. Align them.

Task 4: List installed kernels (spot dual-track installs)

cr0x@server:~$ rpm -qa | egrep '^kernel|^kernel-uek' | sort
kernel-5.14.0-427.13.1.el9_4.x86_64
kernel-core-5.14.0-427.13.1.el9_4.x86_64
kernel-modules-5.14.0-427.13.1.el9_4.x86_64
kernel-uek-5.15.0-201.135.6.el9uek.x86_64
kernel-uek-core-5.15.0-201.135.6.el9uek.x86_64
kernel-uek-modules-5.15.0-201.135.6.el9uek.x86_64

Meaning: You have both RHCK and UEK installed. This is common after migrations or “just in case” installs.

Decision: Pick one track as standard. If you keep both, document why and ensure default kernel is explicit. Otherwise remove the unused track to reduce confusion.

Task 5: Verify enabled repositories (repo hygiene)

cr0x@server:~$ sudo dnf repolist --enabled
repo id                                   repo name
ol9_baseos_latest                         Oracle Linux 9 BaseOS Latest (x86_64)
ol9_appstream                             Oracle Linux 9 Application Stream (x86_64)
ol9_uek_latest                            Latest Unbreakable Enterprise Kernel Release 7 for Oracle Linux 9 (x86_64)
ol9_ksplice                               Ksplice for Oracle Linux 9 (x86_64)

Meaning: These are the sources of truth for packages and kernels. Extra repos are how you accidentally import weirdness.

Decision: Disable anything you cannot explain. If you don’t have entitlement for Ksplice, don’t leave the repo half-configured.

Task 6: Check package update posture (what’s pending)

cr0x@server:~$ sudo dnf -q check-update
kernel-uek.x86_64                  5.15.0-201.138.1.el9uek        ol9_uek_latest
openssl-libs.x86_64                1:3.0.7-29.el9_4               ol9_baseos_latest

Meaning: Pending updates include a kernel and userland libraries. This is normal.

Decision: Decide whether you’re applying immediately (new build) or staging. If Ksplice is enabled, the kernel update becomes a “live patch first, reboot later” decision.

Task 7: Confirm Secure Boot state (don’t guess)

cr0x@server:~$ sudo mokutil --sb-state
SecureBoot enabled

Meaning: Secure Boot is on at firmware level. If disabled, you may still be fine—but it’s a policy decision, not a default shrug.

Decision: If your compliance posture expects Secure Boot, fix it early. Turning it on later can break unsigned driver workflows.

Task 8: Check SELinux mode (security baseline reality)

cr0x@server:~$ getenforce
Enforcing

Meaning: SELinux is active and enforcing policy. Permissive is “audit mode”; disabled is “I like living dangerously.”

Decision: Keep enforcing unless you have a known incompatibility. If something breaks, write a local policy or adjust contexts; don’t just turn it off.

Task 9: Validate time sync (incident timelines depend on it)

cr0x@server:~$ timedatectl
               Local time: Tue 2026-02-05 10:12:41 UTC
           Universal time: Tue 2026-02-05 10:12:41 UTC
                 RTC time: Tue 2026-02-05 10:12:41
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: Clock is synchronized, NTP is active, and you’re using UTC (good). If synchronized: no, your logs will gaslight you.

Decision: If not synchronized, fix chrony before you do anything else. Debugging with incorrect time is self-harm.

Task 10: Confirm network identity and DNS (basic, but breaks everything)

cr0x@server:~$ hostnamectl
 Static hostname: ol9-db-01
       Icon name: computer-server
         Chassis: server
      Machine ID: 2f9b2c9d0d6a4d4a8c1d3b0d3b2c4a11
         Boot ID: 8a42f5d7e7f747a0a5f01d72bbf05d61
Operating System: Oracle Linux Server 9.4
          Kernel: Linux 5.15.0-201.135.6.el9uek.x86_64
    Architecture: x86-64

Meaning: Hostname is set; kernel shown matches your expectation. Mismatched hostnames cause monitoring collisions and certificate pain.

Decision: Standardize hostname patterns; ensure DNS forward/reverse is correct for anything that will run TLS or Kerberos-ish systems.

Task 11: See what’s listening (catch accidental exposure)

cr0x@server:~$ sudo ss -lntup
Netid State  Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
tcp   LISTEN 0      128    0.0.0.0:22         0.0.0.0:*     users:(("sshd",pid=1032,fd=3))
tcp   LISTEN 0      4096   127.0.0.1:323      0.0.0.0:*     users:(("chronyd",pid=980,fd=5))

Meaning: Only SSH is open to the world; chrony is local. If you see random services on 0.0.0.0, your “minimal install” wasn’t.

Decision: If a service is listening and you didn’t ask for it, disable it. Default-open services are a career-limiting move.

Task 12: Firewall status and actual rules

cr0x@server:~$ sudo firewall-cmd --state
running
cr0x@server:~$ sudo firewall-cmd --list-all
public (active)
  target: default
  interfaces: ens192
  sources:
  services: ssh
  ports:
  protocols:
  forward: no
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Meaning: Firewall is running and only SSH is allowed in the public zone.

Decision: Keep this tight. Open ports as part of application deployment, not as part of “server install.”

Task 13: Check storage layout and mount options (avoid accidental foot-guns)

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS
NAME        TYPE   SIZE FSTYPE MOUNTPOINTS
sda         disk   200G
├─sda1      part   600M vfat   /boot/efi
├─sda2      part     1G xfs    /boot
└─sda3      part 198.4G LVM2_member
  ├─ol-root lvm     80G xfs    /
  ├─ol-var  lvm     60G xfs    /var
  └─ol-home lvm   58.4G xfs    /home

Meaning: Separate /var is usually a win for logs, package caches, and databases that spill. LVM gives you resizing options when reality arrives.

Decision: For servers running stateful services, split /var (and often /var/log) deliberately. If you keep everything on /, document your growth plan.

Task 14: Check I/O scheduler and queue settings (don’t tune blind)

cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none

Meaning: Current scheduler is mq-deadline. On modern SSD-backed virtual disks this is often fine; on certain arrays, none can also be appropriate.

Decision: Do not change schedulers because a blog told you. Change it only with a measured workload test and a rollback plan.

Task 15: Confirm swap and memory pressure posture

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi       2.1Gi        26Gi       128Mi       3.0Gi        28Gi
Swap:          4.0Gi          0B       4.0Gi

Meaning: Swap exists and is unused; good. No swap is not “performance tuning,” it’s gambling with the OOM killer.

Decision: Keep swap for most general-purpose servers; size it sensibly. For specialized database systems, decide deliberately and test fail modes.

Task 16: Audit boot errors quickly (don’t wait for the ticket)

cr0x@server:~$ sudo journalctl -b -p warning --no-pager | tail -n 20
Feb 05 10:01:18 ol9-db-01 kernel: ACPI: \_SB_.PLTF.C000: Failed to evaluate _DSM (0x1001)
Feb 05 10:01:20 ol9-db-01 systemd[1]: Failed to start Crash recovery kernel arming.
Feb 05 10:01:20 ol9-db-01 systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE

Meaning: Boot has warnings; kdump failed. That’s not necessarily fatal, but it’s a baseline deviation.

Decision: Fix kdump if you need crash dumps; otherwise disable it intentionally so it doesn’t become noise that hides real boot failures.

Ksplice on OL9: what you actually need

Ksplice is operational leverage. But only if you treat it like a system, not a checkbox.

What Ksplice is good at

  • Reducing the number of emergency reboots for kernel CVEs.
  • Buying time to coordinate maintenance windows.
  • Keeping uptime-sensitive services stable while still patching.

What Ksplice is not

  • A guarantee you never reboot again.
  • A replacement for upgrading userland libraries (OpenSSL, glibc, etc.).
  • A magic wand for broken kernels, bad drivers, or firmware issues.

Core checks for Ksplice readiness

Task 17: Verify ksplice tooling installed

cr0x@server:~$ rpm -q ksplice-uptrack
ksplice-uptrack-1.0.2-14.el9.x86_64

Meaning: Package exists. If not installed, you’re not live patching anything.

Decision: If you intend to use Ksplice, install the tooling as part of the base image; don’t depend on manual post-install steps.

Task 18: Check ksplice service status

cr0x@server:~$ sudo systemctl status uptrack.service --no-pager
● uptrack.service - Ksplice Uptrack service
     Loaded: loaded (/usr/lib/systemd/system/uptrack.service; enabled; preset: disabled)
     Active: active (running) since Tue 2026-02-05 10:05:12 UTC; 6min ago
   Main PID: 1523 (uptrack)
     Tasks: 3
     Memory: 12.4M
        CPU: 0.220s
     CGroup: /system.slice/uptrack.service
             └─1523 /usr/sbin/uptrack -d

Meaning: Service is running. Note the “enabled” state; you want it persistent across reboots.

Decision: If it’s inactive, check entitlements and configuration. If you don’t want Ksplice, disable and remove it—half-configured agents create false confidence.

Task 19: List applied and available Ksplice updates

cr0x@server:~$ sudo uptrack-show --available
Effective kernel version is 5.15.0-201.135.6.el9uek.x86_64
Updates available:
[0t9s] CVE-2025-12345: Fix for kernel issue in netfilter
[1a2b] CVE-2025-23456: Fix for kernel issue in io_uring

Meaning: Ksplice sees your effective kernel and has live patches available.

Decision: If nothing is available, that could be fine. If updates exist and you’re in a patch window, apply them. If Ksplice can’t determine the effective kernel, you likely have a kernel mismatch or unsupported state.

Task 20: Apply Ksplice updates (when policy allows)

cr0x@server:~$ sudo uptrack-upgrade -y
Installing [0t9s]... ok
Installing [1a2b]... ok
Your kernel is fully up to date.

Meaning: Live patches applied successfully.

Decision: Record the patching event (ticket/CMDB). If patches fail, do not keep retrying blindly—capture logs and consider a normal kernel update + reboot.

Task 21: Confirm reboot requirement state (Ksplice doesn’t cancel physics)

cr0x@server:~$ sudo dnf -q needs-restarting -r || true
No core libraries or services have been updated since boot-up.
Reboot should not be necessary.

Meaning: From the package manager’s view, you don’t need a reboot for userland changes. This does not mean you never reboot, but it reduces urgency.

Decision: Use this to justify deferring reboots to scheduled windows, not to avoid them forever. You still want periodic reboots for firmware/driver refresh, kernel upgrades beyond live patches, and general hygiene.

UEK selection and boot hygiene

Kernel track mistakes are expensive because they show up late: a driver behaves differently, performance shifts, or a third-party module refuses to load after a reboot.

Pick a standard: UEK by default, unless you have a compatibility reason

My default in Oracle Linux environments is UEK for general server workloads unless:

  • You rely on a third-party kernel module only certified against the RHEL kernel line.
  • You have a regulatory or vendor requirement that explicitly references the compatible kernel stream.
  • You’re trying to minimize differences between OL and RHEL fleets.

Task 22: Install UEK kernel packages (if not already present)

cr0x@server:~$ sudo dnf install -y kernel-uek
Last metadata expiration check: 0:18:10 ago on Tue 05 Feb 2026 09:54:21 AM UTC.
Dependencies resolved.
====================================================================
 Package                Arch       Version                         Repo            Size
====================================================================
Installing:
 kernel-uek              x86_64     5.15.0-201.138.1.el9uek         ol9_uek_latest  17 M

Transaction Summary
====================================================================
Install  1 Package

Complete!

Meaning: UEK kernel installed. You still need to ensure it’s the default boot entry.

Decision: If this is a production system, schedule a reboot to validate the new kernel boots cleanly before you call the build “golden.”

Task 23: Set the default kernel explicitly

cr0x@server:~$ sudo grubby --set-default /boot/vmlinuz-5.15.0-201.138.1.el9uek.x86_64
cr0x@server:~$ sudo grubby --default-kernel
/boot/vmlinuz-5.15.0-201.138.1.el9uek.x86_64

Meaning: Next boot goes to the intended UEK image.

Decision: Pin it. Don’t rely on GRUB “latest wins” behavior when multiple kernel families are installed.

Task 24: Remove the unused kernel family (optional, but reduces confusion)

cr0x@server:~$ sudo dnf remove -y 'kernel*5.14.0*' 'kernel-core*5.14.0*' 'kernel-modules*5.14.0*'
Dependencies resolved.
====================================================================
 Package                Arch   Version                    Repository    Size
====================================================================
Removing:
 kernel                  x86_64 5.14.0-427.13.1.el9_4      @System      0
 kernel-core             x86_64 5.14.0-427.13.1.el9_4      @System     25 M
 kernel-modules          x86_64 5.14.0-427.13.1.el9_4      @System     34 M

Transaction Summary
====================================================================
Remove  3 Packages

Complete!

Meaning: RHCK packages removed (in this example). You still have UEK installed.

Decision: If this host must support both (rare), keep both but document the reason and ensure default is enforced via config management.

Update strategy: security, stability, and change control

Updating is not a command. It’s a policy with tooling attached.

Security updates: make them routine

If you can’t automate security updates, at least automate visibility. Humans are unreliable schedulers.

Task 25: See security advisories for pending updates

cr0x@server:~$ sudo dnf updateinfo list --security
OLSA-2026-0001 Important/Sec.  openssl-libs-1:3.0.7-29.el9_4.x86_64
OLSA-2026-0002 Moderate/Sec.   curl-7.76.1-29.el9_4.x86_64

Meaning: Security advisories exist for specific packages.

Decision: If these packages are in your runtime path (they usually are), plan to patch. For internet-facing systems, “Important” is not optional.

Task 26: Apply updates with a conservative posture

cr0x@server:~$ sudo dnf upgrade -y
Dependencies resolved.
====================================================================
 Package             Arch   Version                      Repo               Size
====================================================================
Upgrading:
 openssl-libs         x86_64 1:3.0.7-29.el9_4             ol9_baseos_latest  1.5 M

Transaction Summary
====================================================================
Upgrade  1 Package

Complete!

Meaning: Userland package upgraded. Kernel might remain pending depending on your approach.

Decision: After upgrades, run needs-restarting to decide if service restarts or reboot are required.

Task 27: Identify services needing restart due to updated libraries

cr0x@server:~$ sudo dnf needs-restarting || true
sshd : 1032
chronyd : 980

Meaning: These processes are running with old libraries mapped; they should be restarted to fully apply updates.

Decision: Restart services in a controlled order. For SSH, be careful: keep an active session open when restarting to avoid locking yourself out.

Task 28: Lock down package sources and prevent accidental repo enablement

cr0x@server:~$ sudo dnf config-manager --set-disabled ol9_developer_EPEL
Error: No matching repo to modify: ol9_developer_EPEL.

Meaning: Repo isn’t present; good. If it were present and enabled, you’d want it off on production.

Decision: Explicitly manage enabled repos in automation. “It wasn’t enabled on my machine” is not a control.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “UEK is installed, so we must be running it”

At a mid-size enterprise, a team standardized on Oracle Linux for a fleet of app servers. The baseline doc said “install UEK” and the build pipeline did exactly that. Somewhere in the reboot chain, though, the default kernel remained RHCK because the image originally came from a template that had RHCK first in GRUB order.

For months nothing looked broken. Monitoring was green. Performance was fine. Then a maintenance reboot hit a subset of nodes and suddenly a third-party NIC feature behaved differently. The symptom was ugly: sporadic packet drops under load and retransmits that looked like “network” but smelled like “kernel.” The team chased switch ports, blamed cabling, and opened vendor tickets. Time disappeared.

The fix was embarrassingly simple: confirm the running kernel, confirm the default kernel, then align them. The lesson wasn’t “UEK bad” or “RHCK bad.” It was that installing a package is not the same as booting it. Any baseline that doesn’t verify boot state is a bedtime story.

Afterward, they added two checks to the build acceptance: uname -r must match the intended track, and grubby --default-kernel must match uname -r. It took minutes. It saved days later.

Optimization that backfired: “Let’s disable swap to reduce latency”

A different org ran low-latency services and decided swap was the enemy. Someone had read a performance thread and took the conclusion personally. Swap was removed from the baseline. The servers did feel “snappier” in a narrow benchmark, so the change looked justified.

Then came the slow leak: a Java service with a memory growth bug that normally would have pushed some cold pages to swap under pressure. Without swap, memory pressure went from “slightly degraded” to “instant OOM.” The kernel started killing processes. Not the big obvious one either—sometimes it killed the sidecar or a logging agent first, which made the incident harder to interpret.

The worst part was the failure mode. It wasn’t a graceful degradation. It was a chaotic crash loop that looked like an application bug (it was) but behaved like infrastructure instability (it became). They reintroduced swap with sane sizing, tuned memory limits properly, and used cgroups to constrain the worst offenders. The “optimization” had reduced one type of latency and increased the mean time to sanity by a lot.

Joke #2: Disabling swap to “improve performance” is like removing the spare tire to “reduce weight.” Technically true until the road gets interesting.

Boring but correct practice that saved the day: “Keep /var separate and monitor it”

In a regulated environment, the security team required verbose audit logging. The ops team did something deeply unsexy: they separated /var into its own logical volume, added alerting on free space, and kept log rotation tight. Nobody clapped. Nobody wrote a blog post about it.

One morning, an upstream authentication provider had intermittent failures. Applications began retrying, and the retry storms amplified log volume. On many systems, that kind of event fills the root filesystem and you get cascading failures: package manager breaks, services can’t write state, SSH logins fail, and recovery becomes a remote-hands adventure.

Here, root stayed healthy. Only /var got hammered, alerts fired early, and the team had room to maneuver: they throttled retries, increased rotation temporarily, and cleaned up the growth. The incident still hurt, but it didn’t turn into a full system outage. The best ops work looks like nothing happened.

Fast diagnosis playbook

This is the “walk into the burning room” sequence. It assumes OL9, but most steps are universal. The goal is to find the bottleneck fast: CPU, memory, disk, network, or “the kernel is not what you think it is.”

First: confirm you are on the kernel you think you are

cr0x@server:~$ uname -r
5.15.0-201.135.6.el9uek.x86_64
cr0x@server:~$ sudo grubby --default-kernel
/boot/vmlinuz-5.15.0-201.135.6.el9uek.x86_64

Signal: Mismatch means you’re one reboot away from a new failure mode. Also, some performance anomalies correlate with kernel track differences.

Decision: If mismatched, treat as configuration drift and fix before deeper tuning.

Second: identify the resource pressure in 60 seconds

cr0x@server:~$ uptime
 10:19:52 up  2:31,  2 users,  load average: 7.22, 6.91, 6.80
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 26800000 120000 2100000  0    0     2     7  120  240  3  1 95  1  0
 8  2      0  900000 110000 2200000  0    0  8000 12000 1400 2200  5  3 55 37  0

Signal: High wa (iowait) suggests storage latency or saturation. High r with low idle suggests CPU contention. Swap-in/out suggests memory pressure.

Decision: Pick the likely bottleneck class and dig there, not everywhere.

Third: if iowait is high, prove it with per-device stats

cr0x@server:~$ iostat -xz 1 3
Linux 5.15.0-201.135.6.el9uek.x86_64 (ol9-db-01)  02/05/2026  _x86_64_  (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.20    0.00    3.10   34.50    0.00   56.20

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz  aqu-sz  %util
sda              65.0   5200.0     0.0    0.0   18.5     80.0    110.0  14500.0     2.0    1.8   35.2    131.8    5.3   92.0

Signal: High await, high %util, and growing aqu-sz = storage is the bottleneck.

Decision: Check the storage backend, queue depth, noisy neighbors, filesystem, and application I/O patterns before touching CPU tuning.

Fourth: if CPU is high, find top offenders and their system calls

cr0x@server:~$ ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head
  PID COMMAND         %CPU %MEM
 2412 java            380  22.1
 1530 node             95   3.2
  980 chronyd           2   0.1

Signal: A single process dominating means you can focus. Many processes each using a bit suggests contention or thundering herd.

Decision: For single offenders, check app-level profiling and limits. For herds, check connection storms, retry loops, and lock contention.

Fifth: if it “feels like network,” validate drops and errors

cr0x@server:~$ ip -s link show dev ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:50:56:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
      987654321  1234567      0      12       0       0
    TX:  bytes packets errors dropped carrier collsns
      876543210  1122334      0       0       0       0

Signal: Drops or errors exist; not proof of root cause, but a clue. In virtualized environments, drops can be host-side congestion.

Decision: If drops are nonzero and rising, correlate with load and check vSwitch/host metrics; don’t immediately rewrite TCP settings.

Common mistakes: symptoms → root cause → fix

1) “Ksplice is installed but nothing is patching”

Symptoms: uptrack-show shows errors, service not running, or no updates ever appear.

Root cause: Missing repo entitlement, agent not enabled, or running an unsupported kernel track/version for the configured Ksplice channel.

Fix: Verify dnf repolist --enabled includes the Ksplice repo, ensure systemctl enable --now uptrack.service, and confirm uname -r matches expected UEK/RHCK stream for your Ksplice support.

2) “After reboot, kernel changed unexpectedly”

Symptoms: System reboots and now drivers/performance differ; uname -r shows a different family.

Root cause: Default GRUB entry not set; both kernel families installed; updates installed a newer kernel of the other family.

Fix: Use grubby --default-kernel and grubby --set-default to pin; remove unused kernel packages or explicitly manage GRUB in config management.

3) “DNF pulls in unexpected versions”

Symptoms: A routine upgrade brings in packages you didn’t expect; dependency chains look weird.

Root cause: Extra repos enabled (often developer/testing), or modular streams mismatched across hosts.

Fix: Enforce repo allow-lists, audit with dnf repolist --enabled, and standardize module streams if you use them.

4) “SSH restart locked me out”

Symptoms: Restarted sshd during patching, lost connectivity.

Root cause: Firewall misconfig, sshd config error, or you were connected through a brittle bastion session with no fallback.

Fix: Validate config with sshd -t before restarting, keep a second session open, and use console access for critical changes.

5) “High iowait after ‘storage tuning’”

Symptoms: Latency spikes; apps slow; iostat shows high await.

Root cause: Changed I/O scheduler or queue settings without considering backend storage or virtualization; misaligned filesystem/mount options.

Fix: Revert tuning; baseline with default scheduler; measure with representative workload; coordinate with storage team on queue depth and array behavior.

6) “Logs disappeared or system froze due to full disk”

Symptoms: Services fail, package manager errors, kernel reports write failures.

Root cause: Single root filesystem filled by logs or application data; no alerting on growth.

Fix: Split /var (and sometimes /var/log), enable aggressive log rotation, alert on filesystem utilization, and cap runaway app logs.

7) “Secure Boot enabled, but kernel modules fail to load”

Symptoms: Drivers/modules won’t load after enabling Secure Boot; dmesg shows signature problems.

Root cause: Unsigned third-party modules and an operational process that never handled signing/enrollment.

Fix: Either sign modules properly and manage MOK keys, or keep Secure Boot off by policy—don’t run an undefined middle state.

Checklists / step-by-step plan

Gold image baseline (do this once, automate forever)

  1. Install OL9 minimal where appropriate; avoid extra package groups unless you need them.
  2. Set hostname, DNS, and time zone (use UTC unless you have a legal reason not to).
  3. Choose kernel track:
    • UEK for typical Oracle Linux fleets.
    • RHCK if a vendor requires it or you need tight RHEL kernel alignment.
  4. Pin default kernel with grubby; remove the other kernel family if you don’t need it.
  5. Enable only required repos; verify with dnf repolist --enabled.
  6. Enable SELinux enforcing; fix policy issues instead of disabling.
  7. Configure chrony and verify time sync.
  8. Configure firewall: allow only what’s required; validate with ss and firewall-cmd.
  9. Partition sanely: separate /var for most server roles; ensure growth plan.
  10. Decide on Ksplice:
    • If using it, install agent, enable service, validate patch application.
    • If not using it, don’t pretend—remove packages and disable repos.
  11. Apply updates, then decide: restart services vs reboot. Use needs-restarting.
  12. Reboot once during build validation to ensure it returns cleanly on the intended kernel.
  13. Capture “known good” state: kernel, enabled repos, loaded modules, listening ports, filesystem layout.

Recurring operations (weekly rhythm)

  • Run security update visibility (dnf updateinfo list --security).
  • Apply userland updates regularly; restart affected services deliberately.
  • Apply Ksplice updates when available (if policy allows).
  • Schedule periodic reboots (monthly/quarterly) even with live patching.
  • Audit drift: kernel family, repo list, firewall services, SELinux mode.

Pre-maintenance window sanity checks (per host)

  • uname -r and grubby --default-kernel match.
  • dnf check-update reviewed; kernel and critical libraries noted.
  • dnf needs-restarting reviewed after updates.
  • Console access verified for remote sites.
  • Backups/snapshots validated for stateful systems.

FAQ

1) Should I use UEK or RHCK on Oracle Linux 9?

Default to UEK for most Oracle Linux fleets, especially if you run Oracle workloads or want newer kernel features. Use RHCK when a third-party vendor certifies only against the RHEL-compatible kernel line or your org demands tight RHEL kernel alignment.

2) Can I install both UEK and RHCK “just in case”?

You can, but you’re buying confusion. If you keep both, you must pin the default kernel and verify it continuously. Otherwise you’ll reboot into a different kernel family and call it “random.”

3) Does Ksplice eliminate all reboots?

No. It reduces urgent kernel-reboot pressure for many CVE-class fixes, but you still need reboots for some kernel changes, firmware updates, driver changes, and general lifecycle hygiene.

4) Why does grubby --default-kernel matter if uname -r looks right?

Because uname -r is “what you booted,” while grubby is “what you will boot.” Outages love the gap between those two sentences.

5) Should I enable automatic updates on OL9 servers?

Enable automatic security updates only if you have a tested rollback story and you understand your change control requirements. At minimum, automate reporting of security advisories and keep maintenance windows frequent.

6) What’s the smallest set of repos I should keep enabled?

Typically BaseOS and AppStream, plus UEK if you run it, plus Ksplice if you’re entitled and using it. Anything else should be justified by a specific package need and reviewed periodically.

7) Is SELinux enforcing realistic in production?

Yes. Most standard services are fine out of the box. When you hit issues, fix contexts or write targeted policy. Disabling SELinux because one custom daemon complained is the fastest way to get “mysterious” incidents later.

8) What filesystem layout do you recommend for a general-purpose server?

UEFI + separate /boot, LVM for flexibility, /var separate for growth and log containment. For heavy logging or databases, consider splitting /var/log or placing data on dedicated volumes.

9) How do I know if I should worry about Secure Boot?

If you’re in regulated environments or you care about boot-chain integrity, it’s a strong control. If you require third-party kernel modules, plan module signing and MOK enrollment first, or Secure Boot will become a surprise outage.

Next steps you can do this week

  1. Pick your kernel standard (UEK or RHCK) and write it down in one place that matters: build pipeline + policy.
  2. Add two acceptance checks to every host build: uname -r matches the standard, and grubby --default-kernel matches uname -r.
  3. Decide on Ksplice intentionally. If you use it, monitor it and prove it patches. If you don’t, remove it and stop pretending.
  4. Make updates predictable: constrain repos, standardize patch cadence, and use needs-restarting to avoid random reboots.
  5. Run the fast diagnosis playbook on one healthy host and capture “normal.” That baseline makes future anomalies obvious.

If you do nothing else: enforce kernel boot hygiene and repo hygiene. Most “mystery” production problems are just undocumented choices finally collecting interest.

← Previous
Windows Install: Activation Nightmares — The Clean Fix Without Reinstalling
Next →
Networking: MTU Problems That Look Like ‘Slow Internet’ (Fix in 15 Minutes)

Leave a comment