The install went “fine.” Then the first reboot happened, the app team started load tests, and your new fleet developed mysterious slowdowns, weird DNS timeouts,
and logs that vanished exactly when you needed them. That’s not bad luck. That’s an installation that never got treated like a production change.
This is the checklist you use when you’re the person who gets paged. It’s opinionated. It assumes you care about repeatability, auditability, and storage that
doesn’t quietly sabotage you at 02:00.
Quick facts and context (why RHEL installs look the way they do)
- Anaconda is the RHEL installer, and it’s been around since the late 1990s. Its job is boring: be predictable on awful hardware.
- Kickstart automation isn’t a luxury; it’s how enterprises turned “snowflake servers” into cattle long before cloud made it fashionable.
- systemd became the default in RHEL 7. That’s why services, logs, and boot diagnostics now live in a single, scriptable universe.
- XFS became the default filesystem for RHEL 7 for good reasons: scale, maturity, and sane performance under big directory trees.
- SELinux has been in RHEL for decades. Many breaches don’t happen because SELinux is magical, but because it’s annoyingly hard to bypass quietly.
- UEFI displaced BIOS as the modern boot standard; your boot failures increasingly look like “EFI variables” and “Secure Boot state,” not MBR voodoo.
- chrony replaced older NTP daemons because modern networks need faster convergence and better handling of laptops/VMs that suspend or drift.
- LUKS2 is the modern Linux disk encryption format. It’s not just for laptops; it’s for data centers where lost drives become compliance nightmares.
- cgroups v2 (the modern Linux resource control model) changes how CPU/memory limits behave; it matters if you run container stacks or strict quotas.
None of these facts are trivia. They explain why “just click through the installer” is a trap. RHEL is designed for repeatable operations at scale.
If you don’t install like you’re going to operate, you’re volunteering for a future incident report.
Decisions you must make before you boot the installer
1) What are you building: pet server, fleet node, or regulated system?
If it’s a fleet node, you need automation (Kickstart), standardized storage, and predictable networking. If it’s regulated, you need encryption, audit settings,
and a paper trail. If it’s a one-off “utility box,” fine—just don’t pretend it won’t become production.
2) Firmware and boot mode: UEFI or legacy?
Choose UEFI unless you have a hard compatibility reason not to. Legacy BIOS installs are still possible, but your vendor support and future upgrades will
increasingly assume UEFI. Also: Secure Boot policy should be a conscious decision, not a surprise.
3) Disk layout: LVM, plain partitions, or something custom?
For enterprise RHEL installs, the default sane choice is GPT + UEFI + LVM on XFS, with optional LUKS.
LVM gives you operational flexibility: resizing, adding volumes, separating write-heavy paths.
Avoid heroic partitioning schemes that “optimize performance” by hand unless you can defend them during an outage.
Future-you will not remember why /var got 12.7 GiB.
4) Encryption: do you need it, and can you unlock at scale?
Disk encryption is easy. Disk encryption with unattended boot, remote hands, and broken key escrow is not.
If you enable LUKS, decide the unlock mechanism: local passphrase (manual), network-bound disk encryption (where appropriate), or an orchestration-friendly method.
5) Network identity: hostnames, DNS, NTP, and IP plan
Pick hostnames that reflect function and environment. Decide whether you’ll use DHCP, static IPs, or DHCP reservations. Define DNS resolvers and search domains.
Decide on NTP sources and whether you need internal time servers.
6) Update strategy: “latest,” pinned, or staged?
In real enterprises you stage updates. You don’t let a fresh install pull whatever is newest at install-time and call it reproducible.
Decide: base image version, repo snapshotting strategy, and how you roll security errata.
7) Authentication: local users, LDAP/IdM, or SSO?
If you’re going to use centralized identity (common), plan it now. It changes sudo policy, SSH access, audit expectations, and incident response workflow.
Joke #1: Installing RHEL without deciding storage and identity up front is like buying a safe and then leaving the key under the mat.
Checklists / step-by-step plan (pre, install, post)
Pre-install checklist (do this before you touch Anaconda)
- Confirm hardware/VM specs: CPU, RAM, disk model, controller type, and NICs. Get the actual device names you’ll see in Linux.
- Confirm firmware settings: UEFI enabled, RAID mode vs HBA/JBOD mode, Secure Boot policy, virtualization extensions if relevant.
- Choose storage layout: partitions, LVM, mount points, filesystem types, and encryption plan.
- Define network config: VLANs, MTU, bonding/teaming, static routes, DNS, and time sources.
- Decide post-install repos and update method (and whether the box can reach them at install time).
- Prepare automation: Kickstart file, or at least a written build sheet you can follow exactly.
- Record baseline: intended hostname, IP, serial/asset tag, owner, environment, and purpose.
Installer checklist (choices that bite later)
- Verify you are installing in the intended boot mode (UEFI vs legacy) before partitioning.
- Set correct timezone and enable time sync (or at least plan for it post-install).
- Pick minimal package sets; add what you need intentionally.
- Do not disable SELinux to “make it work.” Fix policy or labels.
- Set root password policy and create an admin user with sudo. Prefer key-based SSH.
- Ensure the correct target disk(s) are selected. Triple-check in multi-disk servers.
Post-install checklist (first boot is where the real install starts)
- Update to your baseline patch level (staged, controlled).
- Lock down SSH: keys, allowed users/groups, disable password auth where feasible.
- Confirm SELinux is enforcing; fix any mislabels from image/custom scripts.
- Configure firewall zones and explicit service openings.
- Set up persistent logging and remote log shipping if required.
- Validate storage: filesystem options, fstab correctness, LVM health, and IO scheduler expectations.
- Validate time sync and name resolution. Broken DNS will masquerade as “random slowness.”
- Baseline performance: CPU steal (VMs), IO latency, and network throughput. Capture initial metrics.
- Register to your management plane (Satellite, internal repos, config mgmt) and mark the build immutable.
Hands-on tasks (commands, outputs, what it means, what you decide)
These are the “walk the box” checks I expect on a new RHEL 10 system before I trust it with anything that has a pager attached.
Each task includes: command, sample output, what it means, and what decision you make.
Task 1: Confirm OS version and install provenance
cr0x@server:~$ cat /etc/redhat-release
Red Hat Enterprise Linux release 10.0 (Plow)
What it means: You’re on the expected major/minor release (and not an older golden image someone “recycled”).
Decision: If the release isn’t what you intended, stop. Don’t “upgrade later” as a recovery plan for a bad base install.
Task 2: Confirm boot mode (UEFI vs legacy)
cr0x@server:~$ test -d /sys/firmware/efi && echo UEFI || echo Legacy
UEFI
What it means: The system booted in UEFI mode. That affects partitioning, GRUB config, and certain recovery workflows.
Decision: If you expected UEFI but got Legacy, reinstall now. Mixed fleets make operations brittle.
Task 3: Inspect block devices and identify the real boot disk
cr0x@server:~$ lsblk -o NAME,MODEL,SIZE,TYPE,FSTYPE,MOUNTPOINTS
NAME MODEL SIZE TYPE FSTYPE MOUNTPOINTS
sda PERC H755 1.8T disk
├─sda1 600M part vfat /boot/efi
├─sda2 1G part xfs /boot
└─sda3 1.8T part LVM2_member
├─rhel-root 80G lvm xfs /
├─rhel-var 50G lvm xfs /var
├─rhel-tmp 8G lvm xfs /tmp
└─rhel-home 10G lvm xfs /home
What it means: You can see the EFI System Partition, separate /boot, and an LVM PV hosting logical volumes.
Decision: Validate mount points match intent. If /var isn’t separate on log-heavy systems, consider fixing early (migration later is pain).
Task 4: Check filesystem types and options (performance and safety knobs)
cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS / /var /tmp /home
/ /dev/mapper/rhel-root xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k
/var /dev/mapper/rhel-var xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k
/tmp /dev/mapper/rhel-tmp xfs rw,nosuid,nodev,noexec,relatime,seclabel
/home /dev/mapper/rhel-home xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k
What it means: /tmp is hardened (nosuid,nodev,noexec). XFS is in play. SELinux labels are enabled.
Decision: Keep /tmp hardened unless you have a specific build toolchain that truly needs exec in /tmp—then fix that toolchain instead of weakening the OS.
Task 5: Validate LVM health and free space for growth
cr0x@server:~$ vgs
VG #PV #LV #SN Attr VSize VFree
rhel 1 4 0 wz--n- 1.79t 1.64t
What it means: Plenty of free space in the VG. Good: you can extend /var or add app-specific volumes without a ticket to Storage Team Theater.
Decision: If VFree is tiny, you either sized volumes too big or forgot future growth. Fix now, before data exists.
Task 6: Confirm encryption state (LUKS) where required
cr0x@server:~$ lsblk -o NAME,TYPE,FSTYPE,MOUNTPOINTS
sda disk
├─sda1 part vfat /boot/efi
├─sda2 part xfs /boot
└─sda3 part crypto_LUKS
└─luks-2b3c... crypt LVM2_member
└─rhel-root lvm xfs /
What it means: The root PV is inside a LUKS container. That’s real at-rest encryption, not “we locked the datacenter door.”
Decision: Validate you have a workable unlock process (manual console, remote KVM, or enterprise-approved method). If not, you’ve built a future outage.
Task 7: Check kernel, boot parameters, and microcode state
cr0x@server:~$ uname -r
6.12.0-xx.el10.x86_64
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.12.0-xx.el10.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rhgb quiet
What it means: You’re on a known kernel line; cmdline shows crashkernel is enabled (useful for kdump).
Decision: If you run latency-sensitive workloads, consider whether “quiet/rhgb” belongs in prod (I usually remove them on servers to make boot logs visible).
Task 8: Confirm SELinux enforcement
cr0x@server:~$ getenforce
Enforcing
What it means: SELinux is doing its job: enforcing, not just logging.
Decision: If it’s Permissive/Disabled, fix it before apps land. Otherwise your “security baseline” is a memo, not a control.
Task 9: Confirm firewall state and what’s actually exposed
cr0x@server:~$ sudo firewall-cmd --state
running
cr0x@server:~$ sudo firewall-cmd --get-active-zones
public
interfaces: ens192
cr0x@server:~$ sudo firewall-cmd --zone=public --list-services
ssh
What it means: Firewalld is active; only SSH service is allowed in public zone.
Decision: Don’t open ports “temporarily” and forget. If a service is needed, define it explicitly per zone and capture it in automation.
Task 10: Validate DNS and resolver behavior (fast failures beat slow mysteries)
cr0x@server:~$ resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.0.53
DNS Servers: 10.20.0.53 10.20.0.54
DNS Domain: corp.example
What it means: systemd-resolved is managing DNS; you have two resolvers; a search domain is set.
Decision: If DNSSEC or split DNS is required, plan it now. If resolvers are unreachable, fix before you blame “the network.”
Task 11: Validate time sync (authentication and distributed systems depend on it)
cr0x@server:~$ chronyc tracking
Reference ID : 0A140035 (10.20.0.53)
Stratum : 3
Ref time (UTC) : Mon Feb 05 12:10:28 2026
System time : 0.000123456 seconds fast of NTP time
Last offset : -0.000045678 seconds
RMS offset : 0.000210000 seconds
Frequency : 12.345 ppm fast
Leap status : Normal
What it means: Chrony is synced, small offset, stable stratum. This prevents TLS failures, Kerberos weirdness, and log correlation misery.
Decision: If Leap status isn’t Normal or offsets are large, fix NTP reachability or firewall rules. Don’t “ignore time drift.”
Task 12: Confirm journald persistence (logs that survive reboots)
cr0x@server:~$ grep -E '^(Storage|SystemMaxUse)=' /etc/systemd/journald.conf
Storage=persistent
SystemMaxUse=1G
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 312.0M in the file system.
What it means: Logs persist across reboot and have a cap. You’ve chosen to keep evidence without letting logs eat the root filesystem.
Decision: If Storage is volatile, change it. If SystemMaxUse is unlimited, set a cap and ship logs off-host if you need retention.
Task 13: Check system resource basics (memory pressure, swap sanity)
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 1.2Gi 28Gi 120Mi 1.8Gi 29Gi
Swap: 8.0Gi 0B 8.0Gi
What it means: Plenty of headroom; swap exists and isn’t already in use.
Decision: For many server workloads, some swap is still valuable as a safety net. Avoid “swapoff everywhere” policies unless you understand the failure mode.
Task 14: Validate IO stack and spot obvious latency before apps arrive
cr0x@server:~$ iostat -xz 1 3
Linux 6.12.0-xx.el10.x86_64 (server) 02/05/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.20 0.00 0.55 0.10 0.00 98.15
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-0 0.20 1.50 8.0 64.0 64.0 0.01 2.10 1.90 2.13 0.25 0.04
What it means: Low await and low utilization; the disk isn’t currently the bottleneck.
Decision: If await spikes into tens/hundreds of ms under mild load, stop and investigate storage controller, multipath, or underlying SAN performance.
Task 15: Confirm update channels and what will actually be patched
cr0x@server:~$ sudo dnf repolist
repo id repo name
rhel-10-baseos RHEL 10 BaseOS (Staged)
rhel-10-appstream RHEL 10 AppStream (Staged)
rhel-10-extras RHEL 10 Extras (Staged)
What it means: Repos are configured and likely point to your staged mirror/snapshot.
Decision: If repos point to random sources or are missing, fix it now. Update behavior is part of your security posture.
Task 16: Confirm kdump is configured (you won’t care until you really care)
cr0x@server:~$ systemctl is-enabled kdump
enabled
cr0x@server:~$ sudo kdumpctl status
Kdump is operational
What it means: If the kernel panics, you can capture a crash dump for real diagnosis.
Decision: On critical systems, keep kdump. On tiny VMs, you may choose to disable it to reclaim RAM—document that choice.
Storage and filesystems: layouts that survive real life
The default layout I trust (and why)
On general-purpose enterprise servers, I like:
/ (root) sized for OS and tools, /var separate for logs and mutable state,
/tmp separate and hardened, and optional dedicated volumes for databases or queue systems.
Use XFS unless you have a specific, validated reason not to.
The operational goal isn’t elegance. It’s blast-radius control. When an app goes feral and writes infinite logs, /var fills up—not /.
When a build process abuses /tmp, it can’t mount an executable staging area by accident.
Partitioning and LVM: pick flexibility over prophecy
People over-partition because they want to prevent disk-full incidents. That’s noble, and it usually causes disk-full incidents.
The correct approach is separate high-risk mount points plus LVM free space for controlled growth.
Put it plainly: you cannot predict which directory will grow. You can predict which ones hurt when they do.
UEFI boot partitions: keep them boring
For UEFI: you need an EFI System Partition (vfat) mounted at /boot/efi. Keep it standard sized (hundreds of MB).
Keep /boot separate (XFS is fine) if you’re using LUKS for root—because bootstrapping encrypted root needs clarity.
Encryption: make it real, make it operable
Encrypting data at rest is often a policy requirement. The trap is building a system that can’t reboot unattended after power maintenance,
because someone put the unlock passphrase in a spreadsheet and then rotated it.
If you do LUKS, decide:
- Where keys live and who can retrieve them during an incident.
- What happens when the host is moved, replaced, or restored from backup.
- Whether remote console access exists during boot (for manual unlock).
Swap: boring, unfashionable, still useful
Swap is not a performance feature. It’s a failure-mode shaper. A small amount of swap can keep a system alive long enough to page you,
ship logs, and recover gracefully instead of killing processes at random.
Mount options: security and behavior, not folklore
Use nodev,nosuid,noexec where it makes sense (e.g., /tmp). Use noatime only if you understand the side effects;
relatime is usually fine and default. Avoid tuning based on internet cargo cult.
Security baseline: SELinux, firewall, crypto, and SSH reality
SELinux: enforcing is the baseline, not the aspiration
SELinux gets blamed for failures it merely exposes. When an app tries to write to the wrong directory or bind the wrong port,
SELinux says “no” and you find out early. That’s a feature.
If you must troubleshoot, use targeted tools: check audit logs, confirm contexts, and apply minimal policy changes.
“Disable SELinux” is the sysadmin version of “remove the smoke alarm because it’s loud.”
Firewall: default deny with explicit intent
On servers, I want firewalld running with minimal services exposed. If you need to open ports, do it deliberately and capture it in automation.
It’s not paranoia; it’s reducing your scan surface.
SSH: keys, tight access, and no mystery users
Use key-based auth for admin access where possible. Disable root SSH login. Limit who can SSH in (AllowUsers or AllowGroups).
For break-glass, use an audited path and document it.
Crypto policies: don’t fight the platform
RHEL has system-wide crypto policy management. That’s good: it keeps TLS settings consistent across OpenSSL consumers.
Don’t let random app install scripts weaken crypto policies to “make an old client connect.”
Fix the client, or isolate that legacy system until it can be replaced.
Quote (paraphrased idea) from Werner Vogels: “You build it, you run it” — reliability improves when builders own operational consequences.
Logging and observability: keep the evidence
Persistent journald is non-negotiable for production debugging
If journald is volatile, logs disappear on reboot—the exact moment after kernel updates, power events, or “we changed one small thing.”
Persist the journal, cap its disk usage, and ship important logs off-box if compliance or incident response demands retention.
Log volume isolation: /var as a safety barrier
Separate /var is a classic boring practice because it works. It contains the mess: package caches, logs, spools, and state.
When /var fills, the OS can often keep running long enough to fix it. When / fills, everything becomes exciting.
Metrics baseline: take a snapshot when things are healthy
The best time to collect baseline performance metrics is right after install, before the workload changes.
Capture: CPU idle/steal, IO latency, network throughput, and memory pressure. Later, when someone says “it’s slower,” you’ll have a reference.
Joke #2: Logs are like backups—everyone loves them after they realize they don’t have them.
Networking: DNS, time, MTU, bonding, and “why is it slow”
DNS: the root cause of half your “intermittent” incidents
Misconfigured resolvers cause slow application startup, random API timeouts, and confusing failovers.
Validate forward and reverse resolution where required. Know whether systemd-resolved is in use and how it integrates with your tooling.
Time sync: the silent dependency
Time drift breaks TLS, authentication, distributed tracing, and your ability to correlate logs across systems.
Chrony should be configured and verified. Don’t accept “it’s probably fine.”
MTU and VLANs: a classic performance foot-gun
Jumbo frames can help in some environments, and they can also create black-hole packet loss if the path doesn’t support them end-to-end.
If you set MTU 9000, validate it across switches, hypervisors, and NICs. Otherwise you’ll get bizarre partial failures.
Bonding/teaming: redundancy requires testing
Don’t declare victory because both links show “UP.” Test failover. Pull a cable (or disable a vNIC) and verify the host keeps connectivity.
Also verify your switch-side configuration matches the bonding mode.
Fast diagnosis playbook (find the bottleneck quickly)
When a freshly installed RHEL 10 host is “slow,” don’t start by reinstalling. Start by isolating the bottleneck domain:
CPU, memory, disk, network, or configuration/identity. This sequence finds 80% of issues fast.
First: is the system fundamentally healthy?
- Check boot errors:
journalctl -b -p err - Check basic load:
uptime,toporhtop - Check disk full conditions:
df -h,df -i
Second: is it CPU/VM scheduling?
- Look for CPU steal (VMs):
mpstat -P ALL 1 5(steal %) - Check throttling:
systemd-cgtopfor cgroup pressure
Third: is it IO latency?
- Measure:
iostat -xz 1 5(await, %util) - Find offenders:
pidstat -d 1,iotopif available - Validate multipath/SAN:
multipath -ll(if used)
Fourth: is it the network (or DNS pretending to be network)?
- DNS latency:
resolvectl query your-serviceand check timing - Packet loss:
ping -c 20 gateway - Throughput:
ss -s(socket health),ip -s link(drops/errors)
Fifth: is it policy/config friction?
- SELinux denies:
ausearch -m avc -ts recent - Firewall blocks:
firewall-cmd --list-allplus service logs - Time drift/TLS:
chronyc tracking, certificate errors in app logs
The trick is discipline. Don’t chase five hypotheses at once. Pick a domain, prove it guilty or innocent, then move on.
Common mistakes: symptom → root cause → fix
1) Symptom: Random timeouts to internal services
Root cause: DNS resolver misconfiguration, missing search domain, or systemd-resolved stub confusion.
Fix: Validate resolvectl status, ensure resolvers are reachable, set correct domain/search, and test queries explicitly.
2) Symptom: Reboot after updates and now the system won’t boot
Root cause: Wrong boot mode (installed Legacy, expected UEFI), or EFI partition not properly created/mounted.
Fix: Confirm UEFI with /sys/firmware/efi. If wrong, reinstall in correct mode. If EFI partition is missing, repair bootloader from rescue media.
3) Symptom: Logs disappear after reboot
Root cause: journald configured with volatile storage (default in some minimal builds), or /var/log not persistent in an image pipeline.
Fix: Set Storage=persistent in journald.conf, create /var/log/journal, and cap usage with SystemMaxUse.
4) Symptom: “Disk full” takes down the whole host
Root cause: Single root filesystem; /var and / are the same; log storms fill /. Sometimes inode exhaustion.
Fix: Separate /var; monitor df -i; cap journals; logrotate; ship logs off-host.
5) Symptom: Application can’t bind to a port or write to a directory
Root cause: SELinux denies due to wrong context or missing boolean; or firewall blocks inbound connections.
Fix: Check AVC denials via ausearch, correct contexts with restorecon, set required booleans, and open firewall ports intentionally.
6) Symptom: Network is “up” but large transfers stall or are flaky
Root cause: MTU mismatch (jumbo frames not supported end-to-end) or bonding misconfiguration.
Fix: Validate MTU across path; test with pings using DF bit; confirm switch configuration matches bonding mode; test failover.
7) Symptom: Performance is terrible only on VMs
Root cause: CPU steal due to overcommit, storage contention on shared datastores, or missing virtio optimizations.
Fix: Measure steal with mpstat, check IO latency with iostat, coordinate with virtualization team for resource placement and storage tiering.
8) Symptom: Host can’t authenticate to corporate services
Root cause: Time drift, wrong hostname, or missing reverse DNS expectations in Kerberos-like environments.
Fix: Fix time sync first; ensure hostname/FQDN correct; validate forward/reverse DNS; then retry enrollment.
Three corporate mini-stories (what actually goes wrong)
Mini-story 1: The incident caused by a wrong assumption
A team rolled out a new RHEL 10-based jump host pool for production access. They’d done “the normal stuff”: patched, SSH hardened, firewalld running.
Access looked fine in light testing. Then the on-call rotation started using the hosts during a real incident.
Sessions would hang for 20–40 seconds when engineers ran commands that touched internal hostnames. After that, everything would “catch up.”
The first guess was network congestion. The second guess was a broken bastion tool. The third guess was “RHEL 10 DNS is weird.”
The root cause was painfully simple: the install assumed the corporate search domain would be provided by DHCP everywhere. In that environment it wasn’t.
Most hostnames were unqualified in scripts and shell habits. So every lookup tried the wrong suffixes, hit timeouts, then finally resolved.
The fix was equally simple: explicitly configure the search domain and resolver list, validate with resolvectl, and add a basic DNS latency check to the build pipeline.
The lesson wasn’t “DHCP is bad.” The lesson was “assumptions are configuration you forgot to write down.”
Mini-story 2: The optimization that backfired
An infrastructure group wanted faster builds for CI runners on RHEL 10. Someone proposed disabling persistent journald because “disk writes are slow” and
“we don’t need logs on ephemeral runners.” It was pitched as a performance optimization and a storage saver. It got merged.
Two weeks later, a subset of runners started failing builds intermittently. The failures weren’t consistent—sometimes package installs broke,
sometimes network fetches died, sometimes tests timed out. The only thing consistent was that rebooting a runner “fixed” it for a while.
It turned out there was a real underlying issue: a NIC driver/firmware interaction on a particular hardware batch caused occasional link flaps.
Journald would have shown a clean timeline of link down/up events, DHCP renewals, and service restarts. But journald was volatile and the boxes rebooted themselves
during automated remediation. So the evidence evaporated.
They re-enabled persistent journald with a tight disk cap, added remote log shipping for critical events, and the debugging time collapsed from days to hours.
The “optimization” didn’t save meaningful IO. It saved the illusion of IO while costing observability, which is usually the most expensive thing to lose.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent application cluster ran on RHEL with strict audit requirements. The build standard looked almost comically conservative:
separate /var, swap present, kdump enabled, SELinux enforcing, firewall minimal, journald persistent with caps, and a standardized LVM layout with lots of VG free space.
No one loved it. Everyone tolerated it.
During a busy quarter, a third-party component went rogue and started logging verbose debug output to /var/log at a high rate after a minor config change.
Within hours, /var hit 100% usage. On many systems, that’s where you get a full-host failure and a long night.
Here, the host stayed up. Root filesystem had space. SSH still worked. systemd still worked. Monitoring still worked.
The on-call engineer logged in, confirmed /var was full, rotated and truncated the offending logs, and restarted the component under a corrected configuration.
The postmortem wasn’t glamorous. The build standard got no applause. But it prevented an outage that would have cascaded into missed batch jobs and delayed reporting.
The boring practice—separating /var and capping logs—did exactly what boring practices are supposed to do: reduce the size of disasters.
FAQ
1) Should I use a GUI install or minimal install for RHEL 10 servers?
Minimal, unless you have a strong reason. Fewer packages means fewer CVEs, less patch churn, and fewer weird dependencies.
You can always add tools later; removing a GUI stack later is rarely worth the cleanup.
2) Is XFS always the right filesystem?
For most enterprise server use cases, yes. XFS scales well and behaves predictably under load. If you need a specific feature (like certain small-file patterns or snapshots),
justify it with tests and operational support plans, not opinions.
3) Do I really need a separate /var?
If the system will run anything that logs, spools, caches, or stores mutable state (so: almost everything), yes.
Separate /var is cheap insurance against log storms and runaway caches taking out the OS.
4) Should I encrypt root disks on servers?
If you have regulatory or risk requirements, yes. If not, it’s still worth considering for systems with sensitive data.
But only do it if you can operate it: unlocking, key rotation, and recovery must be designed, not improvised.
5) Can I disable SELinux to speed up deployment?
You can, but you shouldn’t. Disabling SELinux tends to hide misconfigurations until later, when the blast radius is larger.
Use permissive mode briefly for diagnosis if needed, then return to enforcing with proper labels/policy.
6) What’s the baseline for SSH hardening without breaking automation?
Use key-based auth, disable root login, and restrict allowed users/groups. Keep a break-glass path, but make it auditable.
If automation needs access, use dedicated service accounts with scoped sudo rules—not shared keys.
7) How should I handle updates on freshly installed systems?
Align them with your staged repos and patch windows. Don’t let fresh installs pull arbitrary latest packages from wherever.
Standardize on a known baseline, then roll forward in controlled waves.
8) What’s the first thing to check when “the network is slow”?
DNS and MTU. Validate resolver latency and configuration, then check for drops/errors on interfaces.
Next, look at path MTU if large transfers are failing. Many “slow network” complaints are name resolution delays.
9) Do I need kdump on all servers?
On critical systems, yes. On small VMs with tight RAM budgets, maybe not.
Make it a deliberate decision: either keep it for diagnosability or disable it and accept reduced crash insight.
Next steps you can do today
If you want your RHEL 10 installs to stop being “works on my box” artifacts and start being production assets, do three things:
standardize decisions, automate the build, and validate with a repeatable health check.
- Write down your baseline: boot mode, storage layout, encryption stance, DNS/NTP, SELinux/firewall posture, and update strategy.
- Turn it into automation: Kickstart plus post-install configuration management. If you can’t rebuild it, you don’t own it.
- Run the tasks section as a gating checklist: treat failures like failed tests, not “we’ll fix it after go-live.”
- Add the fast diagnosis playbook to on-call docs: the best time to teach debugging is before the incident, not during it.
Production isn’t where you prove your installer clicks. It’s where you prove your defaults. Make them good.