You don’t install an enterprise Linux because you’re bored. You install it because you need a box that boots every time,
takes patches without drama, and doesn’t turn your on-call rotation into a lifestyle choice.
Rocky Linux 10 is that kind of distribution: RHEL-compatible, predictable, and free of subscription gymnastics.
The catch is that “free” doesn’t mean “automatic.” A Rocky install can be rock-solid or quietly cursed depending on
choices you make in the first 30 minutes: disk layout, boot mode, repositories, SELinux, and how you validate the result.
This is a field guide for doing it the boring, correct way—the way production likes it.
What you’re actually building (and what you’re not)
Rocky Linux 10 is the “I want RHEL behavior without RHEL paperwork” move. If you’ve spent time in enterprise Linux,
you already know why that matters: your vendors certify against a platform, your auditors expect certain knobs,
and your operational runbooks assume systemd, SELinux, predictable package names, and a long support window.
But let’s be clear about the deal:
- You get RHEL-compatible userland and kernel behavior, a familiar packaging stack (dnf/rpm), and enterprise defaults.
- You don’t get vendor support contracts by default, proprietary management integrations, or subscription-gated repos.
- You absolutely must do your own baseline validation: boot mode, disk scheme, updates, time sync, logs, and security posture.
If your environment needs official vendor support for regulatory or contractual reasons, buy it where it matters.
If your environment needs “works the same, patches the same, and doesn’t surprise us,” Rocky is a solid choice.
One operational truth: most outages aren’t caused by exotic kernel bugs. They’re caused by assumptions—about storage,
about boot order, about repo state, about “we’ll harden later.” Rocky Linux doesn’t prevent those assumptions. You do.
Interesting facts and historical context
These aren’t trivia night bullets. They help explain why Rocky exists and why it behaves the way it does.
- RHEL clones are a recurring pattern. The ecosystem has long had rebuilds of enterprise Linux to capture RHEL compatibility without direct licensing friction.
- CentOS used to be the default “free RHEL-like.” For years, many orgs standardized on CentOS for server fleets because it tracked RHEL closely.
- CentOS Stream changed expectations. When Stream became the focus, it shifted from “downstream rebuild” to a rolling preview of what might land in RHEL.
- Rocky Linux was created in response to that shift. A large segment of the industry needed a stable downstream-compatible platform again.
- The “enterprise Linux contract” is social as much as technical. Stability means low surprise: ABI expectations, kernel cadence, and conservative defaults.
- dnf replaced yum for a reason. Dependency resolution, modularity, and performance improvements weren’t cosmetic; they addressed years of pain in fleet patching.
- SELinux is not optional in serious environments. It’s one of the big reasons “RHEL-like” systems are deployable at scale without turning every service into a bespoke snowflake.
- systemd standardized service behavior across distributions. Love it or hate it, it made service supervision, logging, and dependency management more consistent.
- UEFI + GPT is now the default reality. If you still design like it’s BIOS+MBR everywhere, you’re building time bombs for modern server platforms.
Preflight decisions that prevent outages
1) Decide: UEFI or legacy BIOS (and don’t mix vibes)
If the hardware is remotely modern, use UEFI. It’s not about fashion; it’s about consistent boot management,
better GPT support, and fewer “why did the bootloader vanish after a firmware update” mysteries.
Mixing install media boot modes (installing in legacy BIOS on one host, UEFI on another) creates operational drift.
Drift becomes confusion, and confusion becomes tickets at 03:00.
2) Decide: storage layout by failure domain, not by habit
Your storage plan should answer two questions:
- What fails? Disk, controller, node, rack, cloud volume, human.
- What breaks when it fails? Boot, root filesystem, logs, database, container images.
Simple guidance that holds up in production:
- Boot reliability: Keep
/bootstraightforward. If you’re doing RAID, ensure your bootloader supports it cleanly. - Operational control: Put
/varon its own logical volume if the box will log heavily or run containers. Log storms should not fill root. - Safety rails: If you can’t afford downtime, use redundancy (hardware RAID or mdraid) for boot/system disks; don’t pretend cloud snapshots are RAID.
- Performance: Separate hot write paths (databases, journal-heavy apps) from everything else. If you can’t, at least monitor I/O and plan capacity.
3) Decide: LVM or “plain partitions”
Use LVM for almost every server. It makes resizing sane, supports snapshots (with caveats), and gives you a controlled
abstraction boundary. Plain partitions are fine for immutable appliances, but most “temporary” servers live for years.
4) Decide: filesystem choices (XFS vs ext4) with intent
XFS is a common enterprise default for good reasons: scalable, robust, and well-supported. ext4 is perfectly fine too.
Pick one and standardize unless you have a specific workload reason to diverge.
5) Decide: networking plan (static, DHCP reservations, or dynamic)
Servers that run production services should have predictable addressing. That can be static config or DHCP reservations,
but “it’ll get whatever IP” is not a plan.
6) Decide: how you’ll patch (and how you’ll roll back)
Patching is not an event; it’s a pipeline. Decide now:
- Do you update automatically (dnf-automatic), or do you batch and approve?
- What’s your kernel update policy?
- Do you snapshot VM disks before patching? Do you have a tested rollback method?
Joke #1 (short, relevant): Treat patching like flossing—everyone agrees it’s good, and most outages happen because someone “meant to start tomorrow.”
Installation walkthrough: UEFI, storage, and sane defaults
Step 0: verify install media and boot mode
Before you click “Install,” confirm you’re booted the way you intend (UEFI vs legacy). On many servers, the boot menu
will show two entries for the same USB ISO—choose the UEFI one unless you have a reason not to.
Step 1: pick a minimal package set unless you need a GUI
For servers, choose a minimal install. Every extra package is:
another update,
another potential vulnerability,
another “why is this service listening on a port we didn’t know about” moment.
Step 2: time, locale, and keyboard (the quiet foot-guns)
Set the timezone correctly and enable time synchronization later. Certificates, Kerberos, log correlation, and incident
timelines all depend on clocks that don’t lie.
Step 3: disk partitioning that won’t haunt you
If you’re installing on a single disk VM, the simplest safe layout is:
- UEFI System Partition (ESP), small
/boot, small- LVM PV for the rest
- Logical volumes for
/,/var, and swap (swap sizing depends on workload; don’t cargo-cult it)
If you’re installing on physical servers with two system disks, you have two common approaches:
- Hardware RAID1 for system disks: simplest operationally; controller becomes a dependency.
- Software RAID1 (mdraid): transparent and portable across controllers; you need to be careful with bootloader/UEFI placement.
Step 4: network and hostname
Set a real hostname. If your naming convention is a mess, fix the convention, not the hostnames after deployment.
Hostnames show up in logs, monitoring, certificates, and people’s brains.
Step 5: user setup and SSH
Create a non-root admin user. Allow SSH with keys. Keep password SSH off unless you have a specific need, and even then,
constrain it with network policy and MFA where possible.
Step 6: SELinux should be Enforcing
Keep SELinux in Enforcing. If you disable it “just for now,” you will forget, and the box will become a special case.
Special cases are where pagers go to breed.
Step 7: first boot validation
After first boot, do not immediately install your app stack. Validate the base system: boot mode, disk health,
repo configuration, updates, time sync, and logging.
Post-install: updates, repos, security baseline, and service hygiene
Repos: keep it boring and intentional
Enterprise Linux lives and dies by repository discipline. You want:
a known set of repos,
consistent packages across hosts,
and predictable update behavior.
Don’t enable random third-party repos on production servers because a blog said so. If you need extra packages, make a conscious
decision: mirror them, pin them, and test updates in a staging environment that resembles production.
Updates: patch early, then regularly
Your first real action after install should be to update, then reboot if the kernel or critical libraries changed.
This sets the baseline and avoids the “it’s been running since install day” myth.
Security baseline: SSH, firewalld, and SELinux policy
Set SSH to keys, lock down root login, keep firewalld on, and use SELinux the way it’s meant to be used:
as a safety layer that limits blast radius.
Logging: keep logs, don’t drown in them
Configure journald persistence if you need it, forward logs to central storage, and cap local retention so the host doesn’t
self-DoS via disk fill.
Quote (paraphrased idea)
Paraphrased idea: everything fails eventually; resilience comes from designing for failure, not hoping it won’t happen.
— Werner Vogels (reliability-oriented leadership, paraphrased)
Practical tasks: commands, output meaning, and decisions (12+)
These are the checks I actually run after installing a RHEL-compatible system. Each one ends with a decision,
because commands without decisions are just performance art.
Task 1: Confirm OS version and platform identity
cr0x@server:~$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="10.0 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel fedora"
VERSION_ID="10.0"
PLATFORM_ID="platform:el10"
PRETTY_NAME="Rocky Linux 10.0 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="rocky"
What it means: You’re on Rocky 10, and it advertises RHEL-family compatibility.
Decision: If PLATFORM_ID isn’t EL10, stop and confirm you didn’t install the wrong ISO or a derivative you don’t want.
Task 2: Verify boot mode (UEFI vs legacy)
cr0x@server:~$ test -d /sys/firmware/efi && echo UEFI || echo BIOS
UEFI
What it means: Presence of EFI firmware directory indicates UEFI boot.
Decision: If you expected UEFI and got BIOS, reinstall correctly now. Don’t “fix it later.” Later is expensive.
Task 3: Inspect block devices and partition map
cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINTS
sda 200G disk
├─sda1 600M part vfat /boot/efi
├─sda2 1G part xfs /boot
└─sda3 198.4G part LVM2_member
├─rl-root 40G lvm xfs /
├─rl-var 80G lvm xfs /var
└─rl-swap 8G lvm swap [SWAP]
What it means: You have a clean UEFI + /boot + LVM layout, and /var is isolated.
Decision: If /var is not separated on a log-heavy host (containers, databases, CI runners), consider rebuilding before data lands.
Task 4: Check filesystem usage and the “/var filled root” trap
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/rl-root xfs 40G 2.3G 38G 6% /
/dev/mapper/rl-var xfs 80G 5.1G 75G 7% /var
/dev/sda2 xfs 960M 220M 740M 23% /boot
/dev/sda1 vfat 599M 6.2M 593M 2% /boot/efi
What it means: Healthy free space across the important mount points.
Decision: If root is small and /var is not separate, you’re one log spike away from a read-only root filesystem.
Task 5: Confirm fstab correctness (boots matter more than beauty)
cr0x@server:~$ cat /etc/fstab
UUID=2f6c4d9b-4c1e-4ed6-a57b-3e1e6e2a9b0a / xfs defaults 0 0
UUID=fa1b70d8-5b0c-4b98-8e12-7c0d8f8195a2 /var xfs defaults 0 0
UUID=8d8a7d9f-9b02-4d5e-9f21-3d65d7f6e4bc /boot xfs defaults 0 0
UUID=0A1B-2C3D /boot/efi vfat umask=0077,shortname=winnt 0 2
/dev/mapper/rl-swap none swap defaults 0 0
What it means: UUID-based mounts reduce surprises if device names change.
Decision: If you see device paths like /dev/sda3 for critical mounts on physical servers, switch to UUIDs now.
Task 6: Verify repo status and avoid “mystery packages”
cr0x@server:~$ dnf repolist
repo id repo name
appstream Rocky Linux 10 - AppStream
baseos Rocky Linux 10 - BaseOS
extras Rocky Linux 10 - Extras
What it means: You have the core repos enabled. That’s usually what you want on day one.
Decision: If you see extra repos you didn’t approve, disable them before installing anything. Reproducibility beats novelty.
Task 7: Check for updates and understand the blast radius
cr0x@server:~$ dnf check-update
Last metadata expiration check: 0:12:31 ago on Tue 06 Feb 2026 10:14:20 AM UTC.
kernel.x86_64 6.12.0-1.el10_0 baseos
openssl-libs.x86_64 3.2.2-4.el10_0 baseos
systemd.x86_64 256.7-2.el10_0 baseos
What it means: Kernel, OpenSSL, and systemd updates are pending—this is not a “maybe later” category.
Decision: Patch now in a maintenance window; plan a reboot if kernel/systemd are updated. If this host is part of a cluster, roll it.
Task 8: Apply updates and capture what changed
cr0x@server:~$ sudo dnf -y update
Dependencies resolved.
================================================================================
Package Arch Version Repository Size
================================================================================
Upgrading:
kernel x86_64 6.12.0-1.el10_0 baseos 12 M
openssl-libs x86_64 3.2.2-4.el10_0 baseos 2.4 M
systemd x86_64 256.7-2.el10_0 baseos 4.1 M
Transaction Summary
================================================================================
Upgrade 3 Packages
Complete!
What it means: You updated core components.
Decision: Reboot soon. Don’t leave a new kernel installed but not running; that creates “it’s patched” illusions.
Task 9: Validate SELinux state (and don’t negotiate with it)
cr0x@server:~$ getenforce
Enforcing
What it means: SELinux is doing its job.
Decision: If it’s Permissive or Disabled, fix it now unless you have a documented exception with compensating controls.
Task 10: Confirm firewall status and open ports intentionally
cr0x@server:~$ sudo firewall-cmd --state
running
cr0x@server:~$ sudo firewall-cmd --list-services
ssh
What it means: firewalld is running and only SSH is allowed by service definition.
Decision: If you see wide-open services you don’t recognize, remove them. “We’ll restrict later” is how test ports become production features.
Task 11: Check time sync status (because logs must agree)
cr0x@server:~$ timedatectl
Local time: Tue 2026-02-06 10:28:01 UTC
Universal time: Tue 2026-02-06 10:28:01 UTC
RTC time: Tue 2026-02-06 10:28:01
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: The system clock is synced, NTP is active.
Decision: If sync is “no,” fix it before joining domains, issuing certs, or debugging anything distributed.
Task 12: Verify network configuration and DNS resolution
cr0x@server:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
ens192 UP 10.40.12.34/24 fe80::250:56ff:feaa:bbcc/64
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.40.12.10
DNS Servers: 10.40.12.10 10.40.12.11
DNS Domain: corp.example
What it means: You have an address, and DNS is pointed somewhere intentional.
Decision: If DNS points to a random resolver or doesn’t resolve internal zones, fix it now—package installs and service discovery will fail in creative ways.
Task 13: Confirm system services and catch accidental listeners
cr0x@server:~$ systemctl --failed
0 loaded units listed.
cr0x@server:~$ ss -lntp
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=991,fd=3))
LISTEN 0 4096 127.0.0.1:323 0.0.0.0:* users:(("chronyd",pid=812,fd=5))
What it means: Nothing failed; only expected ports are open.
Decision: If you see unexpected listeners, identify the package and disable the service before it becomes an incident report headline.
Task 14: Validate storage health (hardware exposure varies)
cr0x@server:~$ sudo dmesg -T | grep -E "I/O error|blk_update_request|reset|nvme|ata" | tail -n 8
[Tue Feb 6 10:21:22 2026] nvme nvme0: pci function 0000:5e:00.0
[Tue Feb 6 10:21:22 2026] nvme nvme0: 4/0/0 default/read/poll queues
[Tue Feb 6 10:21:22 2026] nvme0n1: p1 p2 p3
What it means: No obvious storage errors; device enumerated cleanly.
Decision: If you see resets or I/O errors here during install day, replace hardware or fix firmware before trusting this host with anything important.
Task 15: Kernel and reboot hygiene (avoid “patched but not rebooted”)
cr0x@server:~$ uname -r
6.12.0-1.el10_0.x86_64
cr0x@server:~$ sudo dnf -q repoquery --installed --latest-limit=1 kernel
kernel-6.12.0-1.el10_0.x86_64
What it means: Running kernel matches latest installed kernel.
Decision: If these don’t match after an update, schedule a reboot and record it. “We updated” doesn’t count until you’re running it.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company migrated a fleet of internal services from an aging CentOS setup to a RHEL-compatible rebuild.
They treated the OS install as a formality and focused on application deployment.
The assumption was simple: “Disk space is disk space. The defaults are fine.”
The app stack was container-heavy. Logs were noisy but manageable—until a debugging flag got left on in one service.
Within hours, /var ballooned. On their “defaults are fine” install, /var lived on the root filesystem.
Root hit 100%, and the system started failing in ways that looked unrelated: SSH logins hung, systemd units timed out,
and the node stopped accepting new container pulls.
The on-call initially chased CPU and network. Graphs were loud; the disk was quietly screaming.
Eventually someone ran df, saw root full, and deleted logs. The host recovered—sort of.
They then got to enjoy the sequel: corrupted partial downloads and a container runtime that needed a reset.
The fix was not “tell developers to log less” (though yes, also that). The real fix was a storage layout that assumed failure:
separate /var, cap journald retention, and ship logs off-host. The incident ended up being an install lesson,
not an application lesson, which is a special kind of annoying.
Mini-story 2: The optimization that backfired
Another org had a performance mandate: faster CI builds, faster artifact downloads, faster everything.
Someone proposed an “optimization”: mount the build workspace on a single large filesystem, no separate volumes,
and tune for throughput. They also disabled SELinux because “it’s slowing down file operations.”
The CI did get a bit faster. Then it got weird. Occasional permission anomalies appeared, but only under load.
A few builds failed with filesystem errors; retries succeeded. Teams blamed the CI tooling, the network,
and once, briefly, the moon.
The root cause was a cocktail: aggressive tuning plus a workspace that produced massive metadata churn,
combined with a log pipeline that occasionally spiked disk writes. With no isolation between /, /var,
and the build workspace, one workload’s worst day became everyone’s worst day.
Disabling SELinux didn’t “fix performance”; it just removed a guardrail, and the eventual security review forced them
to re-enable it under pressure—during a quarter-end release window. That was… a choice.
They unwound the “optimization”: separate logical volumes, sane I/O scheduling defaults, SELinux Enforcing,
and a measured approach to tuning with benchmarks that matched reality.
The build times were slightly slower than the “fastest” configuration, but the failure rate dropped hard.
In production operations, fastest is rarely the same as best.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent team ran Rocky-like systems for internal services.
Their practice was painfully dull: every install followed a checklist, every host had the same partitioning,
repos were pinned, and updates were staged through a canary ring.
It wasn’t glamorous. It was effective.
One morning, a batch of hosts started failing to boot after a firmware update performed by a different team.
The failure mode varied by hardware model. Some systems dropped into a boot prompt; others didn’t find the EFI entry.
This is where you usually see panic and people “fixing” things by hand on the console.
They recovered quickly because their install discipline gave them predictable state:
UEFI everywhere, consistent ESP sizes, consistent bootloader configuration, and a standard way to verify and rebuild EFI entries.
They could compare a broken host to a known-good host and apply the same recovery steps without guesswork.
The result wasn’t a heroic all-nighter. It was a calm incident with a clean postmortem.
The best compliment operations can earn is: “That was boring.” Joke #2 (short, relevant): Boring infrastructure is like a seatbelt—you only notice it when you don’t have it.
Fast diagnosis playbook
When a fresh Rocky Linux 10 install “feels slow,” “won’t update,” or “can’t reach things,” don’t wander.
Triage is a sequence. Your goal is to identify the bottleneck class in minutes.
First: identify the failure domain (boot, network, storage, CPU/memory, repos)
- Boot issues: stuck at bootloader, emergency mode, missing root, filesystem check failures.
- Network issues: DNS failures, can’t reach repos, no default route, intermittent packet loss.
- Storage issues: high iowait, journal errors, full filesystems, device resets.
- Compute issues: load average high, OOM kills, runaway process.
- Repo/package issues: dependency conflicts, metadata timeouts, GPG errors.
Second: run the three fastest discriminators
1) Disk space and inode sanity
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rl-root 40G 2.3G 38G 6% /
/dev/mapper/rl-var 80G 5.1G 75G 7% /var
Interpretation: If any critical filesystem is above ~90%, treat it as an incident cause until proven otherwise.
Decision: Free space, expand LV, or cap logs before doing anything else.
2) DNS and default route
cr0x@server:~$ ip route
default via 10.40.12.1 dev ens192 proto static metric 100
10.40.12.0/24 dev ens192 proto kernel scope link src 10.40.12.34 metric 100
cr0x@server:~$ getent hosts mirrors.rockylinux.org
203.0.113.20 mirrors.rockylinux.org
Interpretation: If DNS lookup fails, repo operations and many “random” things fail.
Decision: Fix DNS/routing before blaming dnf or the OS.
3) iowait and top offenders
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 312544 42152 812344 0 0 120 220 310 640 6 3 89 2 0
0 1 0 309812 42152 812512 0 0 320 5400 290 610 4 2 65 29 0
0 1 0 310120 42160 812600 0 0 280 5100 300 620 5 2 63 30 0
0 0 0 311004 42160 812900 0 0 140 260 305 635 5 2 90 3 0
1 0 0 310888 42168 812930 0 0 150 240 320 650 6 3 88 3 0
Interpretation: wa (iowait) spiking suggests storage contention or device issues.
Decision: If iowait is high, stop tuning CPU. Investigate disks, queueing, and write-heavy services.
Third: go one layer deeper based on what you found
- If boot is failing: check journal for previous boot, check
/etc/fstab, verify EFI entries. - If dnf is failing: verify repos, DNS, clock, and GPG keys; inspect
dnferror text like a grownup. - If performance is bad: identify whether it’s CPU, memory pressure, or storage latency; don’t guess.
Common mistakes: symptoms → root cause → fix
1) “dnf update times out”
Symptoms: Cannot download repomd.xml, slow metadata, intermittent failures.
Root cause: DNS misconfiguration, proxy issues, or broken default route.
Fix: Validate routing and DNS; confirm time sync; then retry.
cr0x@server:~$ curl -I -m 5 https://mirrors.rockylinux.org
HTTP/2 200
date: Tue, 06 Feb 2026 10:45:12 GMT
content-type: text/html
Decision: If curl can’t reach it, dnf won’t either. Fix network path first.
2) “After reboot, it drops into emergency mode”
Symptoms: systemd emergency shell, message about failing to mount a filesystem.
Root cause: Bad /etc/fstab entry, wrong UUID, missing disk, or race with network mounts.
Fix: Use systemctl status and journal messages to identify the mount unit, then correct fstab.
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
var.mount loaded failed failed /var
Decision: If it’s a local mount, fix UUIDs. If it’s a network mount, add proper dependencies or nofail with sane timeouts.
3) “SSH works, but everything else can’t connect”
Symptoms: You can SSH in, but outbound HTTPS fails or internal services aren’t reachable.
Root cause: Missing default route, wrong subnet mask, or firewall blocking outbound (less common).
Fix: Inspect routes and interface config, confirm gateway reachability.
cr0x@server:~$ ping -c 2 10.40.12.1
PING 10.40.12.1 (10.40.12.1) 56(84) bytes of data.
64 bytes from 10.40.12.1: icmp_seq=1 ttl=64 time=0.388 ms
64 bytes from 10.40.12.1: icmp_seq=2 ttl=64 time=0.402 ms
--- 10.40.12.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss
Decision: If you can’t reach the gateway, stop blaming DNS and fix L2/L3 first.
4) “SELinux is blocking my service; I disabled it”
Symptoms: Service fails to start, AVC denials in logs, someone flips SELinux to permissive/disabled.
Root cause: Mislabelled files, non-standard ports, or service configured outside expected paths.
Fix: Read the denial, label correctly, allow the port type where appropriate, and keep SELinux enforcing.
cr0x@server:~$ sudo ausearch -m avc -ts recent | tail -n 3
----
type=AVC msg=audit(1738839012.412:911): avc: denied { name_connect } for pid=1420 comm="nginx" dest=8080 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:http_port_t:s0 tclass=tcp_socket
Decision: If it’s a legitimate connection, configure ports/labels; if it’s unexpected behavior, treat it as a security finding.
5) “The system is slow under load; CPU looks fine”
Symptoms: Latency spikes, load average elevated, CPU idle high, users complain.
Root cause: Storage latency and iowait, often from log bursts, swap thrash, or overloaded disks.
Fix: Identify top I/O processes, check device health, separate workloads, and adjust logging.
cr0x@server:~$ iostat -x 1 3
Device r/s w/s rkB/s wkB/s await svctm %util
sda 5.0 90.0 80.0 7200.0 28.50 1.10 99.0
Decision: If %util pins near 100% and await climbs, you’re storage-bound. Add IOPS, reduce writes, or redesign layout.
6) “We ran out of space but df says we have space”
Symptoms: Writes fail, but df -h shows free space.
Root cause: Inode exhaustion (rare on XFS, more plausible on ext4) or deleted-but-open files.
Fix: Check inode usage and open deleted files; restart offending services if needed.
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/rl-var 524288 81234 443054 16% /var
cr0x@server:~$ sudo lsof +L1 | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
rsyslogd 901 root 5w REG 253,1 1048576 0 123 /var/log/messages (deleted)
Decision: If you find large deleted-but-open files, restart that service during a controlled window to release space.
Checklists / step-by-step plan
Checklist A: “I need a reliable Rocky Linux 10 server install”
- Confirm hardware firmware settings: UEFI enabled, secure boot policy consistent with your org.
- Boot installer in the intended mode (UEFI entry, not legacy).
- Choose minimal package set unless a GUI is required.
- Set hostname, timezone, and network addressing plan.
- Storage layout:
- ESP + /boot + LVM
- Separate /var for log/container heavy systems
- Redundancy for system disks (RAID1 hardware or mdraid) if downtime matters
- Create a non-root admin user; configure SSH keys.
- Keep SELinux Enforcing.
- First boot: verify OS release, boot mode, disk layout, and filesystem mounts.
- Enable and validate time sync.
- Validate repos; disable anything not approved.
- Patch fully; reboot; confirm running kernel matches installed kernel.
- Confirm firewall running; open only needed services/ports.
Checklist B: “Standardize installs across a fleet”
- Write a gold storage layout (LVM names, mount points, sizes, filesystem types).
- Define repo policy: which repos, whether mirrored, and how updates flow (dev → staging → prod).
- Decide baseline services: chronyd on, firewalld on, sshd hardened, unnecessary daemons off.
- Codify configuration with automation (Kickstart + config management). Manual installs don’t scale; they just multiply.
- Create validation checks: boot mode, SELinux, time sync, open ports, disk usage thresholds.
- Define break-glass: console access, rescue boot procedure, and how to recover EFI entries.
Checklist C: “Before putting an app on it”
- Space: verify
/and/varheadroom; set retention policies. - Identity: verify hostname, DNS, and time sync.
- Security: SELinux enforcing; SSH keys; firewall policy present.
- Updates: patched, rebooted, and stable.
- Observability: confirm journal persistence as needed and log forwarding operational.
FAQ
1) Is Rocky Linux 10 “the same as RHEL”?
It’s RHEL-compatible in the ways that matter for most workloads: package ecosystem, behavior, enterprise defaults.
It’s not the same product, and it doesn’t come with the same vendor support contract by default.
2) Should I install a GUI on servers?
No, unless you have a strong operational reason. Headless servers are easier to patch, smaller attack surface,
fewer dependencies, fewer surprises. Use remote management tools and web UIs where appropriate.
3) XFS or ext4 for Rocky Linux 10?
XFS is a strong default for general server use and scales well. ext4 is also reliable. Standardize on one unless a workload
requires something specific. The bigger win is partitioning discipline, not filesystem religion.
4) Do I need swap?
Usually yes, but size it based on workload. Swap is not a performance feature; it’s a safety net. On memory-tight systems,
it can prevent an immediate crash while you diagnose. On latency-sensitive workloads, you may constrain it and rely on
proper memory sizing. Don’t blindly allocate “2x RAM” in 2026.
5) Should I disable SELinux if it causes issues?
No. Fix the policy or labeling issue. Disabling SELinux trades a short-term convenience for long-term fragility and risk.
If you absolutely must temporarily set permissive to debug, document it and put a ticket on your own backlog with teeth.
6) How do I keep installs consistent across environments?
Use Kickstart for the OS install and configuration management for state (users, SSH, sysctl, services, app config).
Then validate with automated checks. Humans are great at judgment calls and terrible at repetitive precision.
7) What’s the right disk layout for container hosts?
Separate /var (and sometimes /var/lib depending on runtime) so images, layers, and logs don’t fill root.
Plan for write amplification. Monitor IOPS, not just capacity. If you can put container storage on separate disks, do it.
8) How do I know if I’m CPU-bound or storage-bound?
Check vmstat for iowait and run queue, then look at iostat -x for device saturation.
High CPU usage with low iowait suggests CPU-bound; high iowait and high disk %util suggests storage-bound.
9) Can I join Rocky Linux 10 to a corporate directory?
Yes. The usual prerequisites apply: correct DNS, correct time sync, and a consistent hostname.
Directory joins fail more often due to clocks and DNS than due to the OS.
10) What’s the first thing to automate after install?
Baseline validation and patching workflow. If you automate only one thing, automate the part that prevents configuration drift:
repos, updates, SELinux state, firewall rules, time sync, and log policy.
Conclusion: practical next steps
Rocky Linux 10 gives you the RHEL-compatible operational shape many organizations rely on, without tying your basic ability
to patch a fleet to a subscription workflow. That’s the value. The risk is thinking the install is “done” when the installer
says so.
Next steps that pay off immediately:
- Standardize boot mode and disk layout (UEFI + GPT, separate /var for noisy hosts, LVM almost everywhere).
- Lock down repos and make updates a predictable pipeline (canary → staged rollout).
- Keep SELinux Enforcing and learn to read denials instead of rage-quitting.
- Validate after every install using the practical tasks above; turn them into automated checks.
- Adopt the fast diagnosis playbook so your team stops guessing and starts isolating bottlenecks quickly.
If you do those five things, Rocky Linux 10 becomes exactly what you wanted: a stable, enterprise-shaped OS that fades into the background
and lets you worry about the parts of production that are actually interesting. Like users.