AlmaLinux 10 Install: Enterprise Linux with a Clean Upgrade Path

Was this helpful?

Most “Linux install guides” assume your job ends when you see a login prompt. In production, the install is where you either buy yourself years of predictable upgrades—or you quietly plant a time bomb that goes off during the first incident call.

AlmaLinux 10 is a solid choice when you want Enterprise Linux semantics without playing subscription games. But the winning move isn’t “install it.” The winning move is installing it in a way that makes upgrades boring, rollback possible, and troubleshooting fast.

Why AlmaLinux 10 (and what you’re really choosing)

AlmaLinux is an Enterprise Linux distribution aimed at being compatible with the RHEL ecosystem. In plain ops language: you get the packaging conventions, systemd behaviors, SELinux defaults, and admin muscle memory that enterprises standardize on. That matters more than the logo.

AlmaLinux 10’s real value is not “free.” It’s predictability. Predictable boot flow (UEFI), predictable security posture (SELinux on), predictable update tooling (DNF), predictable lifecycle patterns (major upgrades are events, minor updates are maintenance).

But compatibility cuts both ways. If you install it like a hobby distro—single giant filesystem, no thought about /var, no plan for UEFI, no baseline—then every “minor” incident becomes a full-contact sport.

Here’s the opinionated stance: install AlmaLinux 10 like you expect to operate it for 5–10 years. If you can’t commit to that, put it in a container, or pick something you’re willing to rebuild frequently.

Interesting facts & historical context (so you don’t repeat history)

  • Enterprise Linux clones used to be “set and forget.” Then upstream policy shifts made “clone strategy” a board-level concern, not a nerd hobby.
  • UEFI became the default boot story because BIOS boot is basically a museum exhibit that still powers your payroll system.
  • SELinux went from “turn it off” to “it saved us.” Modern policy tooling and better defaults made enforcing mode the sane baseline again.
  • DNF replaced YUM to fix dependency resolution and performance issues; the “yum” command is mostly compatibility glue now.
  • systemd won because consistent service management beats a thousand artisanal init scripts, especially during outages.
  • OpenSSH defaults tightened over time (algorithms, root login expectations). Old “just allow everything” configs become upgrade blockers.
  • XFS became the default enterprise filesystem for large-volume performance and consistency; ext4 is still fine, but defaults matter.
  • Kickstart and PXE installs became the adult way to build fleets because “click ops” doesn’t scale or audit.

One paraphrased idea from the reliability world that’s worth pinning to your monitor: paraphrased idea — “Hope is not a strategy,” attributed to Gen. Gordon R. Sullivan. In ops, “we’ll remember the install steps” is hope with extra steps.

Joke #1: If your only backup is “we can reinstall,” congratulations—you’ve invented disaster recovery by vibes.

Decisions that matter before you click “Begin Installation”

1) UEFI + Secure Boot: decide deliberately

UEFI should be your default unless you’re trapped on legacy hardware. For Secure Boot, be honest: do you need it because of compliance, or because you actually manage boot-chain integrity? If you enable it, you must keep your kernel/module story boring—no random unsigned drivers later.

Operational advice: choose Secure Boot if your platform team can support it consistently. Mixed fleets where some nodes are Secure Boot and some aren’t tend to produce “works on my host” outages.

2) Storage layout: you’re really designing failure domains

Your install-time partitioning choices determine what fills up first, what can be snapshotted, and what blocks upgrades. The classic production outage is / or /var hitting 100% at 3 a.m. because logs, container layers, or a runaway spool grew without bound.

Opinion: separate /var if the system will run databases, containers, CI runners, or anything chatty. Put /var/log on its own only when you have a real reason and monitoring to match. Otherwise you create a second failure mode: “/var/log is full so nothing logs.”

3) Filesystem choice: pick the boring winner

XFS is a great default for server volumes. Ext4 is also fine, particularly for smaller boot/root partitions. If you want ZFS, you’re choosing a different lifecycle and tooling model; it can be fantastic, but don’t pretend it’s “just like XFS.”

4) Time sync and DNS: do not wing it

Time and DNS are the two services everyone assumes are fine—right up until Kerberos fails, TLS certs look “not yet valid,” or your package manager can’t resolve repositories. Ensure NTP/chrony and resolver settings are part of the post-install baseline.

5) Identity and access: local users are a smell

In enterprises, local admin users accumulate like dust. Use centralized auth (SSSD/LDAP/Kerberos) where appropriate, and keep a break-glass local account with controlled access. If you’re in smaller environments, keep it simple: at least require SSH keys, disable password auth, and log everything.

6) Upgrade path: plan it now, not later

A “clean upgrade path” is half policy, half architecture:

  • Policy: staged rollouts, pinned repositories, and change windows that include rollback time.
  • Architecture: config management, idempotent bootstrap scripts, and separation of data from OS where feasible.

If your application state lives inside /usr/local with hand-edited files and mystery binaries, no distro can save you. AlmaLinux won’t fix your relationship with entropy.

Installation paths: ISO, Kickstart, and golden images

Interactive ISO install (good for first node, bad for fleets)

Use the interactive installer to validate hardware compatibility, storage assumptions, and NIC naming. Then stop. Don’t build ten servers by clicking the same screens ten times and trusting your memory.

Kickstart (the grown-up option)

Kickstart gives you versionable, reviewable installs. That means:

  • Auditable partitioning and package selection.
  • Predictable network naming and baseline services.
  • A path to PXE provisioning and zero-touch rebuilds.

When you get the inevitable “we need to rebuild node 14 right now,” Kickstart turns panic into a routine.

Golden images (use carefully)

Golden images are great for clouds and hypervisors, but they can bake in problems: stale machine IDs, duplicated SSH host keys, and weird udev/network artifacts. If you go this route, you need a first-boot initialization step that resets identity and re-seals secrets properly.

Storage layouts that survive audits and outages

Baseline layout for general-purpose servers (recommended)

For a typical VM or bare-metal server running a few services:

  • /boot (ext4): small, stable.
  • /boot/efi (vfat): UEFI system partition.
  • / (XFS): OS and binaries.
  • /var (XFS): logs, caches, spools, containers.
  • /home optional (XFS): if humans actually log in (try not to).

LVM on top of RAID (hardware or mdadm) gives you flexibility: extend filesystems, add a new LV for /var/lib/containers, or carve out space for application data without reinstalling.

Container hosts (don’t pretend they’re “just servers”)

If the host runs Podman/Docker or Kubernetes components, plan for the storage write amplification. Give /var extra headroom or split /var/lib/containers into its own LV. If you don’t, the container runtime will eventually eat your OS.

Database hosts (separate OS from data)

For databases, split OS and data aggressively. Put database data on its own volume(s) with clear mount options, separate I/O queues if possible, and monitoring. Keep / and /var clean so OS updates and logs don’t compete with your primary workload.

Encryption: LUKS is a policy decision

Disk encryption is great—until you need to reboot an unattended server at 2 a.m. Choose LUKS when you also have a plan for remote unlock, console access, and operational runbooks. “We’ll type the passphrase when it reboots” is not a plan if the server is 800 miles away.

Practical tasks: commands, outputs, and the decisions they drive

These are the checks I run after an AlmaLinux 10 install. Each one answers a question that matters during an upgrade or an outage.

Task 1: Verify OS release and kernel line

cr0x@server:~$ cat /etc/os-release
NAME="AlmaLinux"
VERSION="10.0 (Purple Lion)"
ID="almalinux"
ID_LIKE="rhel fedora"
VERSION_ID="10.0"
PLATFORM_ID="platform:el10"
PRETTY_NAME="AlmaLinux 10.0 (Purple Lion)"

What it means: Confirms you’re on the expected major version and platform ID.

Decision: If this doesn’t say el10, stop and fix your image/repo selection. Don’t “upgrade later.”

Task 2: Confirm boot mode is UEFI (or not)

cr0x@server:~$ test -d /sys/firmware/efi && echo UEFI || echo BIOS
UEFI

What it means: UEFI is active if that directory exists.

Decision: If you expected UEFI and got BIOS, fix it now. Mixed boot modes complicate standardization, Secure Boot, and recovery workflows.

Task 3: Check Secure Boot state

cr0x@server:~$ mokutil --sb-state
SecureBoot enabled

What it means: Secure Boot is on.

Decision: If enabled, treat third-party kernel modules and out-of-tree drivers as controlled changes. If disabled but required by policy, fix before production onboarding.

Task 4: Inspect disk and filesystem layout

cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINTS
sda       476G disk
├─sda1    600M part vfat   /boot/efi
├─sda2      1G part ext4   /boot
└─sda3  474.4G part LVM2_member
  ├─almalv-root  60G lvm   xfs    /
  ├─almalv-var  120G lvm   xfs    /var
  └─almalv-home  20G lvm   xfs    /home

What it means: Confirms UEFI partition, separate /boot, and LVM-based layout.

Decision: If /var is tiny on a service host, fix now (resize LVs) before logs/containers fill it and turn your next upgrade into a hostage situation.

Task 5: Validate fstab is sane (no surprise network mounts at boot)

cr0x@server:~$ cat /etc/fstab
UUID=3E2A-1C0B  /boot/efi  vfat  umask=0077,shortname=winnt  0  2
UUID=7b3e6dd4-0f3e-4a4c-9b0d-5a5f0a8a5f17  /boot  ext4  defaults  1  2
/dev/mapper/almalv-root  /     xfs   defaults  0  0
/dev/mapper/almalv-var   /var  xfs   defaults  0  0

What it means: Boot-critical mounts are local and straightforward.

Decision: If you see NFS/CIFS mounts without nofail and proper timeouts, you’re inviting boot hangs after the first network hiccup.

Task 6: Check NIC naming and link state (predictability matters)

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00
ens192           UP             00:50:56:aa:bb:cc

What it means: Interface name and state. UP means link is active.

Decision: If names differ across identical hardware/VM templates, fix your provisioning or udev rules. “Which NIC is prod?” is not a fun game during an outage.

Task 7: Confirm IP, route, and DNS with NetworkManager

cr0x@server:~$ nmcli -g GENERAL.DEVICE,GENERAL.STATE,IP4.ADDRESS,IP4.GATEWAY,IP4.DNS device show ens192
ens192:100 (connected)
10.20.30.40/24
10.20.30.1
10.20.30.10

What it means: Device is connected, has an IP, gateway, and DNS.

Decision: If DNS is wrong, fix it before you debug “dnf is broken.” It’s usually DNS.

Task 8: Verify time sync (TLS and auth depend on it)

cr0x@server:~$ timedatectl
               Local time: Tue 2026-02-05 13:18:19 UTC
           Universal time: Tue 2026-02-05 13:18:19 UTC
                 RTC time: Tue 2026-02-05 13:18:19
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

What it means: Clock synchronized and NTP active.

Decision: If System clock synchronized is no, fix chrony before you touch certificates, Kerberos, or distributed systems.

Task 9: Check SELinux mode and recent denials

cr0x@server:~$ getenforce
Enforcing
cr0x@server:~$ ausearch -m avc -ts recent | tail -n 5
----
time->Tue Feb  5 13:10:41 2026
type=AVC msg=audit(1738761041.223:412): avc:  denied  { name_connect } for  pid=1240 comm="nginx" dest=5432 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:postgresql_port_t:s0 tclass=tcp_socket permissive=0

What it means: SELinux is enforcing; an AVC denial shows nginx tried to connect to Postgres port.

Decision: Don’t disable SELinux. Decide whether the service should talk to that port, then apply the right boolean/labeling. If the connection is legitimate, you fix policy; if not, you fix architecture.

Task 10: Confirm firewall state and open services

cr0x@server:~$ systemctl is-active firewalld
active
cr0x@server:~$ firewall-cmd --list-all
public (active)
  target: default
  interfaces: ens192
  services: ssh
  ports:
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

What it means: Firewall is on; only SSH service is open in the public zone.

Decision: If your host is exposing random ports, close them now. If your app needs ports, open them explicitly and document it as code.

Task 11: Check repo configuration and update metadata

cr0x@server:~$ dnf repolist
repo id                              repo name
almalinux-baseos                     AlmaLinux 10 - BaseOS
almalinux-appstream                  AlmaLinux 10 - AppStream
almalinux-crb                        AlmaLinux 10 - CRB
cr0x@server:~$ dnf -q check-update || true
kernel.x86_64  6.11.0-12.el10  almalinux-baseos
openssl.x86_64 3.2.2-4.el10    almalinux-baseos

What it means: Repos are enabled; updates are available.

Decision: If repos are missing or unexpected third-party repos appear, stop. Upgrade safety starts with repo hygiene.

Task 12: Inspect enabled services (reduce surprise at boot)

cr0x@server:~$ systemctl list-unit-files --type=service --state=enabled | head -n 15
UNIT FILE                          STATE   PRESET
chronyd.service                     enabled enabled
firewalld.service                   enabled enabled
sshd.service                        enabled enabled
tuned.service                       enabled enabled
NetworkManager.service              enabled enabled

What it means: Core services enabled; you can see what will start on boot.

Decision: Disable what you don’t need. Every enabled daemon is both an attack surface and a future upgrade interaction.

Task 13: Validate SSH hardening posture

cr0x@server:~$ sshd -T | egrep 'passwordauthentication|permitrootlogin|pubkeyauthentication'
passwordauthentication no
permitrootlogin no
pubkeyauthentication yes

What it means: Password auth and root login are disabled; keys are enabled.

Decision: If password auth is enabled on internet-reachable hosts, fix it before you become someone else’s botnet node.

Task 14: Check disk health indicators (NVMe/SATA)

cr0x@server:~$ smartctl -H /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.0-12.el10] (local build)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

What it means: Device reports health as PASSED (not perfect, but a start).

Decision: If SMART is failing or unavailable, decide whether you need vendor tooling, RAID controller checks, or proactive replacement. Don’t wait for “read-only filesystem” theater.

Task 15: Confirm journald persistence and log volume expectations

cr0x@server:~$ grep -E '^(Storage|SystemMaxUse|RuntimeMaxUse)=' /etc/systemd/journald.conf
Storage=persistent
SystemMaxUse=1G
RuntimeMaxUse=200M

What it means: Logs persist across reboot, with caps.

Decision: If journald is unbounded on small disks, cap it. If you need long retention, ship logs off-host; don’t hoard them locally like souvenirs.

Task 16: Measure boot performance and find slow units

cr0x@server:~$ systemd-analyze
Startup finished in 3.212s (kernel) + 9.844s (userspace) = 13.056s
graphical.target reached after 9.801s in userspace
cr0x@server:~$ systemd-analyze blame | head
4.812s NetworkManager-wait-online.service
1.905s firewalld.service
1.102s chronyd.service

What it means: Pinpoints services slowing boot, often “wait-online.”

Decision: If NetworkManager-wait-online dominates and you don’t need it, disable it. Faster boot means faster recovery and patching.

Task 17: Check CPU/memory and pressure (baseline for later comparisons)

cr0x@server:~$ lscpu | egrep 'Model name|CPU\(s\):|Thread|Core|Socket'
CPU(s):                               4
Model name:                           Intel(R) Xeon(R) CPU
Thread(s) per core:                   2
Core(s) per socket:                   2
Socket(s):                            1
cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       0.9Gi       5.9Gi       0.0Gi       0.9Gi       6.6Gi
Swap:          4.0Gi         0B       4.0Gi

What it means: Establishes what “normal” looks like on fresh install.

Decision: If memory is already tight on a fresh host, don’t deploy memory-hungry workloads and then act surprised.

Task 18: Check I/O scheduler and filesystem mount options (performance and latency)

cr0x@server:~$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none
cr0x@server:~$ findmnt -no TARGET,FSTYPE,OPTIONS / /var
/ xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota
/var xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota

What it means: Shows scheduler and mount options that affect latency and throughput.

Decision: Don’t “tune” mount options randomly. Capture baseline first; change one thing; measure; roll back if it’s worse.

Fast diagnosis playbook (find the bottleneck in minutes)

This is the “I’m on call and something feels slow” sequence. The goal is to identify whether you’re CPU-bound, memory-bound, I/O-bound, network-bound, or blocked by a dependency (DNS, auth, storage backend).

First: confirm the symptom is real and local

  • Is the application slow for everyone or only from one network segment?
  • Is the host slow for all commands or only for the app?
  • Did anything change (deploy, patch, config) in the last hour?

Second: check resource saturation in the simplest tools

cr0x@server:~$ uptime
 13:22:11 up 10 days,  3:41,  1 user,  load average: 6.12, 5.98, 5.44

Interpretation: Load average above CPU count suggests CPU contention or runnable queue growth (could also be blocked I/O contributing to load).

Decision: If load is high, immediately check CPU usage vs iowait and run queue.

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  1      0 412112  90124 982112    0    0   120   450  580  900 25 10 45 20  0
 6  2      0 401980  90124 982400    0    0   110   980  600  950 22 11 42 25  0

Interpretation: r is runnable processes, b is blocked, wa is iowait. Here, iowait is significant and blocked processes exist.

Decision: Treat as likely storage bottleneck. Move to disk latency checks before you tune app threads.

Third: differentiate disk latency vs filesystem full vs memory pressure

cr0x@server:~$ df -hT / /var
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/almalv-root xfs    60G   18G   42G  31% /
/dev/mapper/almalv-var  xfs   120G  118G  2.0G  99% /var

Interpretation: /var is effectively full. That can cause slow writes, failed services, and weird package manager behavior.

Decision: Stop the bleeding: rotate logs, clean caches, or extend the LV. Don’t “restart everything” and hope.

cr0x@server:~$ dmesg -T | tail -n 8
[Tue Feb  5 13:18:02 2026] XFS (dm-1): log I/O error -5
[Tue Feb  5 13:18:02 2026] XFS (dm-1): Filesystem has been shut down due to log error (0x2).

Interpretation: If you see filesystem shutdowns, the problem isn’t your app. It’s storage or underlying device errors.

Decision: Escalate to storage/hardware immediately. Protect data. Remount read-only if needed, and plan controlled recovery.

Fourth: validate network and DNS (because it’s always in the suspect list)

cr0x@server:~$ resolvectl status | sed -n '1,25p'
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.20.30.10
       DNS Servers: 10.20.30.10 10.20.30.11

Interpretation: Confirms resolver configuration and active DNS server.

Decision: If DNS servers are wrong or unreachable, fix DNS before blaming DNF, SSSD, or the app.

Fifth: check service health and logs with intent

cr0x@server:~$ systemctl --failed
  UNIT          LOAD   ACTIVE SUB    DESCRIPTION
● app.service    loaded failed failed Example Application

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state.
SUB    = The low-level unit activation state.

Interpretation: A failed unit is explicit; don’t guess.

Decision: Jump to journalctl -u app.service -b, fix root cause, then restart. Don’t reboot as a diagnostic tool unless you enjoy losing evidence.

Three corporate-world mini-stories (because someone already made your mistake)

Mini-story 1: The incident caused by a wrong assumption

One finance-adjacent company moved a fleet of internal services onto “RHEL-compatible Linux” to standardize patching. A new AlmaLinux-based image was built quickly, approved quickly, and shipped quickly. The critical assumption: “The installer’s default partitioning is fine; it’s an enterprise distro.”

It was fine for a quiet VM. It was not fine for a busy API node that writes logs, keeps a package cache, and runs a container runtime. Within weeks, /var filled during a peak traffic day. The app began throwing timeouts, and the support team chased phantom network issues because the CPU and network looked normal. Meanwhile, journald started dropping messages because disk space was scarce, and the most useful logs disappeared right when everyone wanted them.

The postmortem was awkward because nobody had done anything “wrong” in the usual sense. The OS installed. The service started. Monitoring existed. The real failure was assuming the default layout matched the workload.

The fix wasn’t heroic: rebuild the nodes with a separate, bigger /var, enforce journald caps, and ship logs off-host. They also added a pre-production soak test that intentionally generated log volume and container layer churn. Once /var stayed healthy for a week of synthetic abuse, they trusted the image again.

Mini-story 2: The optimization that backfired

A different org had a latency-sensitive service and an engineer with a healthy fear of context switches. They “optimized” by disabling a handful of default services and tuning kernel/sysctl values based on a blog post. They also disabled tuned because “it changes things,” and pinned the CPU governor manually in a config snippet.

It benchmarked faster in a synthetic test. Then a real maintenance window arrived. After an update and reboot, boot time increased wildly and the service came up inconsistent across nodes. Some nodes waited forever for network online; others started early, hit DNS timeouts, and failed. One node was fine. The fleet was not. The tuning had drifted because the golden image was updated in-place and different teams “fixed” nodes by hand.

The scary part: the performance win wasn’t the problem. The problem was non-reproducibility. During recovery, they couldn’t tell which changes were present on which node. The incident response became archaeology.

The eventual correction was boring: revert unknown sysctls, re-enable baseline services, and implement performance tuning via version-controlled profiles with explicit measurement gates. They still tuned, but only with a roll-forward/roll-back plan and with config management enforcing sameness.

Mini-story 3: The boring but correct practice that saved the day

A retail org ran AlmaLinux on edge systems that occasionally lost network connectivity. Nothing fancy: some local services, a queue, and periodic sync. Their platform team insisted on two things that no one loved: a strict baseline checklist and routine rebuild drills using Kickstart.

One day a storage controller firmware issue started causing intermittent I/O errors. A couple of nodes went read-only. The application team, understandably, wanted to “just restart” and keep selling. But the platform team had a practiced routine: capture logs, mark node out of rotation, rebuild onto known-good hardware, and restore service from the queue.

Because the OS install was reproducible, they didn’t waste time wondering what state the box was in. Because partitions were consistent, their log collection scripts worked. Because SELinux was enforced everywhere, they didn’t have “it works on node A but not node B” security drift. The outage was contained, not celebrated.

That day nobody got credit for being clever. The company got credit for staying open. That’s the whole job.

Joke #2: The only thing more permanent than a temporary fix is the ticket that says “remove later.”

Common mistakes: symptom → root cause → fix

1) DNF is slow or fails with timeouts

Symptom: dnf makecache hangs or repo metadata downloads time out.

Root cause: DNS misconfiguration, proxy issues, or MTU/path problems. Less often: broken mirror selection.

Fix: Confirm resolver and routing (nmcli, resolvectl), test name resolution, validate proxy settings in /etc/dnf/dnf.conf and environment. If on a VLAN with odd MTU, test with ping -M do and adjust.

2) System won’t boot after enabling Secure Boot

Symptom: Boot fails after installing third-party drivers or custom kernel modules.

Root cause: Unsigned modules blocked by Secure Boot policy.

Fix: Use signed, vendor-supported drivers; enroll keys if you truly manage your own signing; otherwise disable Secure Boot only if policy allows. Don’t mix “strict boot chain” with “random DKMS modules.”

3) Services fail mysteriously after install

Symptom: Service starts fine on one node, fails on another with “permission denied” even as root.

Root cause: SELinux denials, mislabeled files, or incorrect contexts after manual file copy.

Fix: Inspect AVC denials, restore contexts (restorecon), set correct file labels, and use booleans where appropriate. Disabling SELinux is not a fix; it’s a surrender.

4) Boot takes forever, especially on networked hosts

Symptom: Long boot time; systemd-analyze blame shows wait-online.

Root cause: NetworkManager-wait-online waiting for connectivity in environments where it’s not necessary (or where DHCP is flaky).

Fix: Disable wait-online when the workload doesn’t require it at boot, or correct the network dependency. Don’t mask network problems with infinite waits.

5) Disk fills up during normal operation

Symptom: /var hits 100%, services crash, logs stop, DNF fails.

Root cause: Under-sized /var, unbounded logs, container layer growth, runaway spool directories.

Fix: Resize LVs, cap journald, configure logrotate, separate container storage, and set monitoring alerts for inode and space usage.

6) After update, app can’t bind to port or connect outbound

Symptom: Permission denied on bind/connect even with correct app config.

Root cause: SELinux port labeling or firewall rules not aligned with the app.

Fix: Check firewalld zone/ports/services, verify SELinux port types, and apply targeted fixes. Avoid broad “open everything” rules that become permanent liabilities.

7) Host keys or machine identity duplicates across cloned VMs

Symptom: SSH warnings about host key changes; monitoring shows multiple nodes with same ID.

Root cause: Golden image cloned without regenerating machine-id and SSH host keys.

Fix: Use cloud-init or first-boot scripts to reset identity, regenerate keys, and ensure unique hostnames and DHCP reservations where needed.

Checklists / step-by-step plan

Step-by-step: enterprise-grade AlmaLinux 10 install (single server)

  1. Decide boot mode: Use UEFI unless you have a documented reason not to. Decide Secure Boot policy upfront.
  2. Decide storage layout: Plan separate /var for service hosts. Use LVM for flexibility. Choose XFS for most workloads.
  3. Set hostname and network: Configure static IP where appropriate; ensure DNS and NTP are correct.
  4. Minimal packages: Install only what you need. Fewer packages means fewer CVEs and fewer upgrade conflicts.
  5. Create admin access: Use SSH keys. Disable root login over SSH. Create a break-glass path you can audit.
  6. Leave SELinux enforcing: Fix policy issues properly rather than turning it off.
  7. Enable firewall: Open only required ports, explicitly.
  8. Update immediately: Patch to current minor state before onboarding into production.
  9. Capture baseline: Save outputs of key commands (repolist, mount layout, enabled services, time sync).
  10. Register monitoring: Disk space/inodes, CPU, memory, load, and service state at minimum.
  11. Document upgrade plan: What’s your rollback? Snapshot? Rebuild? Where is data stored?
  12. Test a reboot: Verify boot, network, services, and application readiness after restart.

Checklist: what “clean upgrade path” means in practice

  • Reproducible build: Kickstart or image pipeline. No hand-crafted snowflakes.
  • Config management: Ansible/Salt/Puppet/etc. Even if small, have something.
  • Repo control: Known repositories only; third-party repos reviewed and pinned.
  • Data separation: App data is not intermingled with OS files. Backups and restores are tested.
  • Upgrade rehearsals: Test on a representative staging node with real traffic patterns where possible.
  • Rollback story: VM snapshot, LVM snapshot (with caution), or rebuild + restore. Pick one and practice.
  • Observability: Logs shipped off-host, metrics retained, and dashboards known by humans who will be paged.

Checklist: post-install hardening that doesn’t break your future self

  • Disable password SSH auth; enforce keys.
  • Disable direct root SSH login; use sudo.
  • Set journald limits; verify logrotate where applicable.
  • Enable and configure firewalld zones correctly.
  • Confirm chrony/NTP working; use UTC unless there’s a strong reason not to.
  • Keep SELinux enforcing; treat denials as signals, not annoyances.
  • Remove unused packages/services; less is less.

FAQ

1) Should I choose AlmaLinux 10 for new production systems?

If you want Enterprise Linux conventions and a stable ecosystem, yes. The operational win comes from compatibility and predictability, not novelty.

2) Is a “minimal install” actually better?

Usually. Fewer packages reduce security exposure and update interactions. Install what you need, then let config management add the rest.

3) What’s the best filesystem for AlmaLinux 10 servers?

XFS is the safe default for most server filesystems. Ext4 is fine for /boot and small roots. Choose ZFS only if you also choose its operational model.

4) Do I need LVM?

If you operate real systems: yes, most of the time. LVM makes resizing and layout fixes possible without reinstalling. Without it, you’re betting you’ll never guess wrong about disk usage.

5) Should I separate /var?

For service hosts, yes. Especially with containers, CI, heavy logging, or package caching. Keeping /var isolated prevents one noisy workload from bricking the OS.

6) Is Secure Boot worth it?

Worth it when you can support it consistently and your environment cares about boot integrity. If you routinely use out-of-tree modules, Secure Boot can become your recurring maintenance tax.

7) How do I ensure a clean major upgrade path?

Make rebuilds easy. Keep repos controlled. Keep config declarative. Keep data separate. Then upgrades become planned events instead of desperate rituals.

8) SELinux is blocking my app. What’s the right fix?

Find the denial, decide whether the access is legitimate, and apply targeted changes (booleans, proper contexts, allowed ports). Turning off SELinux fixes the symptom by deleting the guardrail.

9) What’s the first thing to check when “the server is slow”?

uptime and vmstat. You want to know if you’re CPU-bound, I/O-bound, or memory pressured before you touch the application.

10) Can I rely on golden images instead of Kickstart?

You can, but only if you have a first-boot process that resets identity and you keep the image pipeline disciplined. Otherwise you’ll clone yesterday’s mistakes at scale.

Conclusion: practical next steps

If you want AlmaLinux 10 to feel “enterprise,” install it like you plan to live with it. Most pain comes from unplanned growth: logs, containers, caches, and human changes that never make it back into automation.

Do these next:

  1. Pick a storage layout with a real /var strategy and write it down.
  2. Decide UEFI + Secure Boot policy fleet-wide, not host-by-host.
  3. Turn your best interactive install into a Kickstart and rebuild one node from scratch to prove it works.
  4. Capture a baseline with the command set above and store it with the system’s config.
  5. Run one simulated failure: fill /var in staging, see what breaks, then fix it before production does it for you.

When upgrades arrive, you’ll still do work. But it will be the kind of work that ends on time, not the kind that ends with a postmortem and a new respect for disk space.

← Previous
Windows in a VM, Real GPU, Near‑Native Speed: The IOMMU Setup Guide
Next →
SMB Share ‘Access Denied’: The One Permission Everyone Misses

Leave a comment