Nothing says “productive engineering culture” like a lab host that boots into an emergency shell five minutes before a release cut. Or a CI runner that starts dropping builds because you ran out of inodes, not disk. Those aren’t interesting failures. They’re avoidable failures—usually caused by a sloppy install and an even sloppier mental model.
CentOS Stream 10 is a great fit for labs and CI because it’s close enough to RHEL to teach you the same operational lessons, but it’s also where change shows up earlier—exactly what you want when you’re validating pipelines, drivers, kernel behavior, and your own assumptions.
What CentOS Stream 10 actually is (and why you want it)
CentOS Stream is not “free RHEL.” It’s a continuously delivered distro that sits between Fedora and RHEL in the development flow. In practical terms: it’s where changes land before they’re frozen into the next RHEL minor release. For labs and CI, that’s useful because you can catch upcoming ABI changes, tooling shifts, and packaging quirks before they hit your paid fleets—or your customers’ fleets.
There’s a trap here: people install Stream and treat it like a sleepy long-lived enterprise OS. Don’t. Treat it like a controlled, production-like canary. If you want something you’ll ignore for three years, use something else and be honest about it.
When CentOS Stream 10 is the right call
- You build RPMs or kernel modules and want early warning on buildroot changes.
- Your CI needs RHEL-ish behavior: systemd, SELinux, firewalld, NetworkManager, and the same general tooling.
- You want to validate Ansible roles and hardening baselines against “next RHEL.”
- You run KVM/libvirt or container hosts and need a stable-ish base, but you can tolerate updates moving faster than classic enterprise.
When it’s the wrong call
- Your lab is really a shadow production environment with no change control.
- You can’t patch frequently or you don’t have test gates.
- Your “CI runner” is a pet VM that people SSH into and mutate by hand.
Facts and history that matter operationally
Here are concrete bits of context that actually change decisions, not trivia for conference small talk:
- CentOS Stream was introduced as a rolling preview of RHEL, shifting the old model where CentOS Linux was rebuilt after RHEL releases. That changes your risk profile: you’re closer to the front of the queue.
- RHEL clones used to be the default “free enterprise Linux” play, which trained a generation of teams to treat rebuilds as identical. Stream breaks that assumption by design.
- systemd has been the init system for the RHEL family for years, and Stream is where you’ll see some service defaults change first (timeouts, dependencies, hardening options).
- NetworkManager is not optional in most real deployments anymore, because tooling, cloud-init behavior, and modern NIC naming all assume it. Fighting it wastes time.
- SELinux “Enforcing” is the default in sane environments, and the RHEL ecosystem has spent a decade making that workable. Turning it off is still popular—mostly among people who don’t keep on-call shifts.
- Anaconda has historically been both powerful and easy to misclick, especially around custom partitioning and bootloader placement. “I’m pretty sure I clicked the right disk” is not a storage strategy.
- DNF replaced YUM as the user-facing package manager years ago, and repo metadata behavior (and caching) matters when CI is pulling hundreds of packages per day.
- cgroups v2 has become the standard in modern RHEL-like systems, which changes container and resource-limit behavior compared to older fleets that still had v1 muscle memory.
- Podman (rootless) is a first-class container tool in the RHEL family, and it’s a better default for many CI setups than “just run Docker as root and hope.”
Design goals for labs and CI (pick a side)
Before you boot an installer ISO, decide what you’re optimizing for. Labs and CI are not the same thing, but they share one requirement: predictable failure.
Goal 1: Reproducibility beats cleverness
For CI runners, I want an image that can be rebuilt from scratch and converged with automation in under an hour. If you can’t rebuild it quickly, you’ll keep it around “just in case,” and that’s how you end up debugging a host where the last human change happened during a different fiscal year.
Goal 2: Storage should fail boringly
CI workloads are brutal on storage: lots of small files, high churn, caches that grow until they hit a limit, and logs that never stop. Your layout should make “disk full” noisy and localized, not silent and global.
Goal 3: Security defaults should stay on
Don’t disable SELinux because a build failed once. Fix the labeling. Don’t flush the firewall because your test runner can’t reach a port. Open the port. If your lab is on a flat network, your biggest threat is not Hollywood hackers—it’s the other team’s “temporary” service listening on 0.0.0.0.
One dry reality check: a lab host with no guardrails is just production, except nobody admits it. That’s how you get “surprise” outages that are really “surprise accountability.”
Install path: from ISO to first boot without drama
You can install Stream 10 interactively or via Kickstart. For labs and CI, Kickstart wins because it turns tribal knowledge into a file. Interactive installs are fine for a one-off test VM, but they’re also how “standard build” becomes five different builds.
Choose the right installation profile
For CI runners and headless lab servers: install a minimal environment plus the packages you need. GUI installs are convenient until you’re patching 50 hosts and you realize you’ve been dragging a desktop stack around like a ball and chain.
UEFI vs BIOS: just pick UEFI unless you have a reason
Modern servers and VMs should be UEFI. It’s boring, consistent, and the tooling has matured. BIOS/legacy is for compatibility with ancient hypervisors or embedded junk you can’t replace.
Kickstart posture
A Kickstart for labs/CI should do these things:
- Pin the install disk explicitly (don’t rely on “first disk” ordering).
- Define partitions and LVM volumes with intentional sizes and growth rules.
- Create an admin user with SSH keys (and lock down password SSH).
- Enable SELinux enforcing and firewalld.
- Set timezone, NTP, and a stable hostname scheme.
- Optionally register internal repos/mirrors if CI traffic is heavy.
Joke #1: Treat “Next, Next, Finish” like a loaded weapon. It only takes one click to teach your bootloader some new and exciting disks.
Storage layout: partitions, LVM, and failure domains
Storage is where most lab/CI installs get lazy. And then storage becomes the bottleneck, and everyone blames “the network” because it’s comforting to blame things you can’t see.
What a good layout looks like
For a single-disk VM or small bare-metal node, a sane default is:
- UEFI system partition (ESP): small, fixed.
- /boot: fixed size, ext4.
- LVM PV for everything else.
- Separate LVs for / (root), /var, and optionally /var/lib/containers or /var/lib/libvirt.
Why split /var? Because CI writes to /var like it’s being paid by the byte. Logs, package caches, container layers, build artifacts, and temp files love /var. If /var fills up and it’s on the same filesystem as /, you don’t get a “disk full” problem—you get a “system can’t write state” problem. That’s a different class of pain.
Ext4 vs XFS
XFS is common in RHEL-like ecosystems and behaves well under large files and parallel IO. Ext4 is still a perfectly respectable choice for /boot and sometimes for smaller volumes. The real rule: don’t get creative. Use what your tooling expects, and what your team can recover at 3 a.m.
LVM thin: be careful
LVM thin provisioning can look like free disk space. It is not free disk space. It’s a promise to your future self that you will monitor pool usage and react before it hits 100%. Thin pools are great in virtualized lab setups where you understand oversubscription. They are catastrophic when nobody watches them.
Swap: pick a policy, not a vibe
For CI runners with memory spikes, some swap can prevent the kernel from killing your build job. For latency-sensitive hosts, too much swap can hide memory pressure until performance becomes “mysteriously slow.” If you’re not sure: keep a modest swap and rely on monitoring to tune.
Networking baseline: predictable IPs, predictable DNS
CI runners failing because DNS flapped is not a heroic story. It’s a failure of basics.
Static addressing vs DHCP
For ephemeral CI runners created and destroyed automatically, DHCP is fine if your DHCP and DNS are reliable and integrated. For long-lived lab hosts and bare metal, static IPs reduce surprises. If you do static IPs, do them via NetworkManager profiles, not by hand-editing random files and praying you remember what you did.
Hostnames and search domains
Pick a naming scheme that survives reimaging. For example: role + site + index. Don’t encode secrets or ownership drama in hostnames. Also, keep search domains minimal. Overly broad search domains cause weird delays and strange resolution behavior when internal DNS is unhappy.
Security baseline: SELinux, firewalld, and SSH
Security controls aren’t just about security. They’re about operational predictability. SELinux and firewalld force you to be explicit about what your services do. That makes your systems easier to reason about.
SELinux: keep it enforcing
Enforcing mode catches mislabeling, bad defaults, and “it worked on my laptop” container runs. If something breaks, your first instinct should be: read the AVC denials and fix labels or policy. Your last instinct should be: setenforce 0.
firewalld: define zones and only open what you need
CI hosts typically need inbound SSH and maybe inbound metrics scraping. They don’t need the world to reach random ephemeral ports. If you’re running a registry mirror or artifact cache, that’s different—open those ports intentionally and document them.
SSH: keys, not passwords
Disable password authentication if you can. If you can’t, at least lock it down to internal networks and enforce strong passwords. CI runners should not be a place where password brute force gets a foothold.
One quote worth keeping near your terminal: Hope is not a strategy.
— Gene Kranz
CI/runtime choices: Podman, containers, and virtualization
For labs and CI, you’re usually choosing one of three patterns:
- Host-based builds: install toolchains on the host. Fast, but it rots and becomes snowflake territory.
- Container-based builds: isolate build environments with Podman. Reproducible, easier cleanup, good default.
- VM-based builds: full OS per job via libvirt/KVM. Heavier, but closest to real deployment behavior.
My bias: container-based builds for most pipelines, VM-based builds for kernel/driver work and “we need to test boot and system services,” and host-based builds only when you absolutely must (like certain hardware-bound toolchains).
Practical tasks with commands (and what to decide from the output)
These are not toy commands. This is the stuff you run on day one, and again when something feels off. Each task includes: command, sample output, what it means, and what you decide next.
Task 1: Confirm you installed what you think you installed
cr0x@server:~$ cat /etc/os-release
NAME="CentOS Stream"
VERSION="10"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="10"
PLATFORM_ID="platform:el10"
PRETTY_NAME="CentOS Stream 10"
Meaning: You’re on Stream 10, and the platform identifier is EL10-like. If this says something else, your image pipeline is lying.
Decision: If this isn’t exactly what you expect, stop and fix the build source. Don’t “continue anyway.”
Task 2: Check kernel and boot mode (UEFI vs legacy)
cr0x@server:~$ uname -r
6.12.0-0.el10.x86_64
cr0x@server:~$ test -d /sys/firmware/efi && echo UEFI || echo BIOS
UEFI
Meaning: Kernel version tells you what you’re debugging. Boot mode matters for grub, secure boot behavior, and how disks are partitioned.
Decision: Standardize on UEFI for new builds unless a platform constraint forces BIOS.
Task 3: Identify disks and ensure you partitioned the right one
cr0x@server:~$ lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINTS,MODEL
NAME SIZE TYPE FSTYPE MOUNTPOINTS MODEL
sda 200G disk QEMU HARDDISK
├─sda1 600M part vfat /boot/efi
├─sda2 1G part ext4 /boot
└─sda3 198.4G part LVM2_member
├─cs-root 40G lvm xfs /
├─cs-var 80G lvm xfs /var
└─cs-home 20G lvm xfs /home
Meaning: You can see the partition map and what’s mounted. If you see your data disk holding /boot, you’ve already had a bad day.
Decision: If the wrong disk was used, reinstall. Trying to “fix it later” usually costs more than a clean rebuild.
Task 4: Verify filesystem capacity and inode headroom
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/cs-root xfs 40G 3.2G 37G 8% /
/dev/mapper/cs-var xfs 80G 12G 68G 15% /var
/dev/sda2 ext4 1020M 238M 713M 26% /boot
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/cs-var 41943040 92321 41850719 1% /var
Meaning: Space and inode availability. CI can run out of inodes long before disk space, especially with language ecosystems that love tiny files.
Decision: If /var is small or inode usage is climbing fast, allocate more to /var now. Future you will not do it calmly.
Task 5: Check LVM health and free extents for growth
cr0x@server:~$ vgs
VG #PV #LV #SN Attr VSize VFree
cs 1 3 0 wz--n- 198.38g 58.38g
cr0x@server:~$ lvs -a -o lv_name,vg_name,lv_size,lv_attr,data_percent,metadata_percent
LV VG LSize Attr Data% Meta%
root cs 40.00g -wi-ao----
var cs 80.00g -wi-ao----
home cs 20.00g -wi-ao----
Meaning: You’ve got free space in the VG to grow /var when CI grows teeth. If you see VFree = 0, you sized it to the edge.
Decision: Leave slack in the VG. “We used all disk capacity” is not an achievement; it’s a future incident.
Task 6: Confirm update cadence and repo health
cr0x@server:~$ dnf repolist
repo id repo name
baseos CentOS Stream 10 - BaseOS
appstream CentOS Stream 10 - AppStream
cr0x@server:~$ dnf check-update
Last metadata expiration check: 0:12:17 ago on Tue 06 Feb 2026 09:10:44 AM UTC.
kernel.x86_64 6.12.2-0.el10 baseos
Meaning: Repos are reachable, metadata is current, and updates exist. On Stream, updates are part of the deal.
Decision: If metadata expiration is huge or repolist is empty, fix DNS/proxy/mirrors before you trust the host for CI.
Task 7: Validate time sync (CI hates skew)
cr0x@server:~$ timedatectl
Local time: Tue 2026-02-06 09:23:18 UTC
Universal time: Tue 2026-02-06 09:23:18 UTC
RTC time: Tue 2026-02-06 09:23:18
Time zone: UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
cr0x@server:~$ chronyc tracking
Reference ID : 0A0B0C0D (ntp1.example)
Stratum : 3
Last offset : -0.000021 seconds
RMS offset : 0.000112 seconds
Meaning: Clock is synchronized. Token signing, TLS, artifact timestamps, and distributed builds all break in stupid ways with time skew.
Decision: If NTP isn’t active, fix it before you debug “random” TLS failures.
Task 8: Inspect network config the NetworkManager way
cr0x@server:~$ nmcli -t -f NAME,DEVICE,TYPE,STATE con show --active
Wired connection 1:ens192:802-3-ethernet:activated
cr0x@server:~$ nmcli dev show ens192 | egrep 'IP4.ADDRESS|IP4.GATEWAY|IP4.DNS'
IP4.ADDRESS[1]: 10.20.30.40/24
IP4.GATEWAY: 10.20.30.1
IP4.DNS[1]: 10.20.30.10
Meaning: You have an active connection profile and sane IP/DNS. If CI can’t resolve package repos, this is where you start.
Decision: If DNS points somewhere weird (like a consumer router), fix it. Don’t “work around it” with /etc/hosts entries.
Task 9: Check SELinux mode and recent denials
cr0x@server:~$ getenforce
Enforcing
cr0x@server:~$ sudo ausearch -m avc -ts recent | tail -n 5
type=AVC msg=audit(1738833941.112:842): avc: denied { name_connect } for pid=2213 comm="podman" dest=53 scontext=system_u:system_r:container_t:s0 tcontext=system_u:object_r:dns_port_t:s0 tclass=tcp_socket permissive=0
Meaning: SELinux is enforcing and you have an AVC denial involving a container trying to reach DNS. This is not “SELinux being annoying”; it’s a signal about labeling/policy and container network rules.
Decision: Investigate the context and policy needed; don’t disable SELinux globally. Fix the root cause (container networking policy, DNS port labeling, or container runtime configuration).
Task 10: Verify firewalld state and open ports
cr0x@server:~$ sudo systemctl status firewalld --no-pager
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; preset: enabled)
Active: active (running) since Tue 2026-02-06 09:02:11 UTC; 22min ago
cr0x@server:~$ sudo firewall-cmd --get-active-zones
public
interfaces: ens192
cr0x@server:~$ sudo firewall-cmd --list-services
ssh
Meaning: Firewall is on, interface is in public zone, only SSH is open. Good baseline.
Decision: If you need node_exporter, add the port/service explicitly. If you see “services: dhcpv6-client samba cockpit whatever,” clean it up.
Task 11: Check system resource pressure (CPU, memory, IO) the fast way
cr0x@server:~$ uptime
09:27:51 up 1:12, 1 user, load average: 0.26, 0.18, 0.09
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 15Gi 1.8Gi 11Gi 170Mi 2.4Gi 13Gi
Swap: 4.0Gi 0B 4.0Gi
Meaning: Load is low, memory is healthy, swap unused. If CI jobs are slow, it’s probably not raw CPU starvation right now.
Decision: If load is high and available memory is low, decide whether to add RAM, reduce concurrency, or isolate noisy jobs.
Task 12: Spot IO bottlenecks and latency spikes
cr0x@server:~$ iostat -xz 1 3
Linux 6.12.0-0.el10.x86_64 (runner01) 02/06/2026 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 3.20 8.60 0.00 76.10
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
sda 8.00 320.0 0.00 0.00 6.20 40.00 45.00 2048.0 2.00 4.26 18.40 45.51 0.92 68.00
Meaning: iowait is non-trivial, write await is high-ish, and disk utilization is high. That’s a classic CI runner profile: heavy writes, metadata churn, and caches.
Decision: Consider faster storage, separating build/workspace disks, or moving caches to tmpfs/ramdisk selectively (with limits). Also consider reducing concurrent jobs per node.
Task 13: Confirm journald/log retention won’t eat /var
cr0x@server:~$ sudo journalctl --disk-usage
Archived and active journals take up 1.2G in the file system.
cr0x@server:~$ sudo grep -E 'SystemMaxUse|RuntimeMaxUse' /etc/systemd/journald.conf
#SystemMaxUse=
#RuntimeMaxUse=
Meaning: Journals are already using space, and there’s no explicit cap. On chatty CI nodes, that grows until it hits a filesystem limit.
Decision: Set SystemMaxUse (and maybe SystemMaxFileSize) to a sane cap for your /var sizing.
Task 14: Validate container storage location and growth
cr0x@server:~$ sudo podman info --format '{{.Store.GraphRoot}}'
/var/lib/containers/storage
cr0x@server:~$ sudo du -sh /var/lib/containers/storage
6.4G /var/lib/containers/storage
Meaning: Container layers live under /var. That’s why /var sizing matters.
Decision: If you run heavy container builds, consider placing container storage on its own LV (or dedicated disk) to avoid starving system state.
Task 15: Check cgroups v2 and container compatibility
cr0x@server:~$ stat -fc %T /sys/fs/cgroup/
cgroup2fs
Meaning: You’re on cgroups v2. Some older container tooling and monitoring agents still assume v1 semantics.
Decision: Ensure your CI tooling supports cgroups v2. If not, upgrade tooling rather than downgrading the OS behavior unless you have no choice.
Task 16: Check virtualization readiness (if you run KVM/libvirt runners)
cr0x@server:~$ lscpu | egrep 'Virtualization|Vendor ID|Model name'
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU
Virtualization: VT-x
cr0x@server:~$ lsmod | egrep 'kvm|kvm_intel'
kvm_intel 503808 0
kvm 1490944 1 kvm_intel
Meaning: CPU supports virtualization and KVM modules are loaded.
Decision: If virtualization isn’t available, don’t waste time debugging libvirt. Fix BIOS settings or choose a different host/hypervisor profile.
Fast diagnosis playbook
This is the playbook I want on the wall next to the CI cluster. When builds get slow or installs get flaky, you don’t “poke around.” You follow the shortest path to the bottleneck.
First: prove it’s not DNS/repo access
- Check DNS resolution and reachability to repos/mirrors. CI failures often present as “package install failed” but the root is name resolution or proxy config.
- Confirm time sync. TLS failures and repo metadata errors can be time skew.
cr0x@server:~$ getent hosts mirror.internal
10.20.30.50 mirror.internal
cr0x@server:~$ chronyc tracking | head
Reference ID : 0A0B0C0D (ntp1.example)
Decision: If DNS is slow or flaky, fix that before touching anything else. You can’t optimize a system that can’t find its dependencies.
Second: check storage pressure and IO latency
- Disk full? Inodes full? Thin pool full? These show up as “random failures.”
- IO latency (await) high? That’s why builds are slow.
cr0x@server:~$ df -hT | sed -n '1,6p'
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/cs-root xfs 40G 3.2G 37G 8% /
/dev/mapper/cs-var xfs 80G 78G 2.0G 98% /var
cr0x@server:~$ iostat -xz 1 2 | tail -n 5
sda 9.00 360.0 0.00 0.00 8.10 40.00 60.00 2600.0 1.00 1.64 25.90 43.33 1.40 85.00
Decision: If /var is 98% used, stop. Clean up caches/logs or grow the LV. If disk await is high, reduce concurrency or upgrade storage.
Third: check CPU, memory, and scheduler contention
- High load with low IO wait? Likely CPU saturation.
- Low available memory with swap activity? Memory pressure or runaway jobs.
- High steal time? Hypervisor contention; your VM host is oversubscribed.
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 1123456 81234 2234560 0 0 5 80 150 300 12 3 77 8 0
3 1 0 123456 40000 900000 0 0 0 2000 500 1200 40 10 10 40 0
Decision: If st (steal) is non-zero and persistent, the real bottleneck is upstream. Escalate to the virtualization team or move runners.
Common mistakes: symptoms → root cause → fix
1) Symptom: CI jobs randomly fail with “No space left on device” but df shows free space
Root cause: Inodes exhausted (too many small files) or a different filesystem is full (often /var or /tmp).
Fix: Check df -i and per-mount usage. Increase inode capacity by resizing/recreating filesystem (long-term) and reduce file churn (cleanup, artifact retention). Split /var early.
2) Symptom: Builds are slow only on some runners
Root cause: Storage latency differences (different disk tiers), or VM steal time on oversubscribed hypervisors.
Fix: Compare iostat -xz and vmstat across runners. Standardize disk backend. Stop mixing “fast” and “slow” nodes in the same pool unless your scheduler is topology-aware.
3) Symptom: After update, a service won’t start; logs mention permission denied
Root cause: SELinux denial after a path change, new port, or new unit hardening.
Fix: Use ausearch -m avc to find denials, then correct labeling with restorecon or adjust policy. Don’t disable SELinux globally.
4) Symptom: Host boots into dracut emergency shell after reboot
Root cause: Wrong UUID in fstab, missing initramfs drivers, or disk ordering changes in a VM template.
Fix: Boot rescue, verify blkid and /etc/fstab, rebuild initramfs if needed. Prefer mounting by UUID and stable device naming.
5) Symptom: DNF is painfully slow, “metadata download” hangs
Root cause: DNS issues, proxy MTU weirdness, or mirror selection problems.
Fix: Validate DNS (getent hosts), check MTU, and prefer internal mirrors for CI-heavy environments.
6) Symptom: Container builds fail after OS update, but host builds still work
Root cause: cgroups v2 expectations, rootless networking constraints, or SELinux policy changes affecting container storage.
Fix: Confirm cgroups (stat -fc %T /sys/fs/cgroup), check Podman info, review AVC denials, and update your container tooling/images.
7) Symptom: SSH works from some subnets but not others
Root cause: firewalld zone assignment or upstream network ACL mismatch.
Fix: Check active zones and interfaces, then set explicit rules. Don’t disable firewalld because networking is hard.
Three corporate mini-stories (the kind you don’t brag about)
Mini-story 1: The incident caused by a wrong assumption
The company had a tidy little CI cluster: a dozen VMs, a couple of build caches, and a weekly patch window that everyone ignored until it broke something. They decided to move from a RHEL clone to CentOS Stream for “early compatibility.” Sounds responsible.
The wrong assumption was subtle: they assumed Stream behaved like their old rebuild in one specific way—repo stability. Their pipeline started pulling updates during the workday because someone left a scheduled dnf -y update timer running on the runners. One afternoon, a toolchain update landed, the compiler minor version changed, and a subset of builds started producing slightly different artifacts. Nothing obviously failed at first. The checksum comparisons did, though, and then the release process stopped like it hit a wall.
Engineering teams argued about “nondeterministic builds” and “maybe the caching layer is corrupt.” It was neither. It was the runners updating themselves mid-stream. The painful part wasn’t the fix; the painful part was realizing they had no policy for when and how updates happen on CI infrastructure.
They recovered by freezing updates during business hours, building golden images weekly, and rolling those images through a staging CI pool first. Stream wasn’t the villain. The villain was the belief that updates are something you do when you remember.
Mini-story 2: The optimization that backfired
Another org had a clever storage plan: they used LVM thin provisioning for CI workspaces because it let them “allocate” huge volumes without buying more disk. On paper it was elegant. In reality it was a thin pool shared by too many enthusiastic projects.
They also “optimized” by cranking job concurrency. Builds got faster—until they didn’t. Then one morning, multiple runners started failing with filesystem errors. The thin pool hit 100% data usage. Thin provisioning doesn’t fail gracefully; it fails like a trapdoor. Writes stall, filesystems panic, and everyone suddenly learns what “metadata percent” means.
The immediate fix was ugly: stop the world, delete caches, and extend the underlying storage. The long-term fix was boring: set monitoring on thin pool usage, enforce quotas per project, and stop oversubscribing storage that has no operational guardrails.
They kept thin provisioning, but only where they had alerting and clear ownership. The optimization wasn’t wrong. It was premature, unmonitored, and sold internally as “free capacity,” which is the most expensive lie in storage.
Mini-story 3: The boring but correct practice that saved the day
A third team had a reputation for being “slow” because they insisted on Kickstart-based builds and a strict base image pipeline. Developers wanted the freedom to tweak runners directly. The SREs said no and got grumbled at in meetings.
Then a kernel update in the lab uncovered a regression affecting a specific NIC driver under heavy load. A few runners started dropping network connections during artifact uploads. It looked like random flakiness—exactly the kind that wastes weeks.
Because the team had a golden image pipeline, they could roll back to the previous known-good kernel across the fleet in a controlled way. More importantly, they could reproduce the issue by spinning up test runners from both images and comparing behavior under identical load tests. No archaeology. No “who changed what.” Just controlled experimentation.
The postmortem was almost disappointingly calm. Their boring practice—immutable-ish images, controlled updates, and a staging pool—turned a potential multi-team blame festival into a small, contained operational event. They shipped on time. Nobody wrote a dramatic Slack thread. That’s a win.
Joke #2: The best CI runner is like good plumbing—nobody notices it until someone tried to “optimize” it.
Checklists / step-by-step plan
Step-by-step install plan (interactive or Kickstart-guided)
- Decide your role: CI runner, lab hypervisor, or general-purpose test host. This drives storage and package selection.
- Choose UEFI boot (unless constrained) and confirm the VM firmware setting before install.
- Select minimal install plus required packages; avoid GUI unless you have an explicit need.
- Storage layout:
- Create ESP and /boot fixed partitions.
- Create LVM PV and VG with slack space.
- Create separate LV for /var sized for logs + containers + caches.
- Network: configure via NetworkManager, ensure DNS points to a reliable resolver.
- Time: set timezone to UTC for server fleets; enable NTP.
- Users: create admin user; lock down SSH to keys if possible.
- Security: keep SELinux enforcing; keep firewalld enabled.
- Update policy: decide patch cadence and whether runners self-update. My advice: no auto-updates on runners without gates.
- Snapshot/golden image: capture a base image only after validation commands pass.
Post-install validation checklist (run these before adding to CI pool)
- OS identity correct:
/etc/os-release - Boot mode correct: UEFI check
- Disk layout correct:
lsblk,df -hT,df -i - LVM slack available:
vgs - Repos healthy:
dnf repolist,dnf check-update - Time sync:
timedatectl,chronyc tracking - Network correct:
nmclioutput matches your intended IP/DNS - SELinux enforcing:
getenforce; no unexpected AVC spam - firewalld enabled and minimal:
firewall-cmd - Baseline perf sanity:
iostat,vmstatunder a sample build
Operational checklist (weekly)
- Apply updates in a controlled window; roll through staging pool first.
- Check /var growth trends; cap journald; prune container layers.
- Verify NTP; watch for time drift on VMs.
- Review AVC denials; address recurring ones properly.
- Confirm CI concurrency matches disk capacity, not wishful thinking.
FAQ
1) Is CentOS Stream 10 stable enough for CI?
Yes, if your CI is engineered to absorb change: staged rollouts, reproducible images, and a patch policy. If your CI runners are hand-maintained pets, Stream will expose that quickly.
2) Should I use Stream 10 for production?
Sometimes, but don’t treat it as the default. For production, the question is less “can it work” and more “do you have the operational discipline to manage faster-moving updates?” Many orgs don’t.
3) Minimal install or full server install?
Minimal. Add what you need. Every extra package is update surface area and potential conflict in CI environments.
4) Do I really need a separate /var?
If you run CI workloads, yes. /var is where your system state and your high-churn CI debris collide. Separating them is cheap insurance.
5) XFS or ext4 for CI runners?
XFS is a solid default for / and /var in RHEL-like systems. Keep ext4 for /boot. Don’t mix exotic filesystems unless your team has a recovery playbook and real experience.
6) Should I disable SELinux to make container builds easier?
No. Use AVC logs to fix labeling/policy issues. Disabling SELinux trades a solvable configuration issue for an ongoing risk and inconsistent behavior across environments.
7) How do I stop runners from drifting over time?
Golden images + configuration management. Rebuild runners regularly. If a runner is “special,” it’s also untrustworthy.
8) My DNF installs are slow in CI. What’s the best fix?
Use an internal mirror or caching proxy, and fix DNS. Then tune DNF caching behavior. CI magnifies package manager inefficiencies into real money and real time.
9) Containers or VMs for CI jobs?
Containers for most builds and tests. VMs when you need to test boot-time behavior, kernel interactions, or system services in a realistic environment.
10) What’s the single most common bottleneck on Stream-based CI hosts?
Storage. Specifically: /var filling up, container layer bloat, and IO latency under concurrency. CPU is usually second.
Conclusion: next steps that actually reduce pager noise
CentOS Stream 10 is the right kind of uncomfortable for labs and CI: it nudges you toward disciplined installs, clear update policies, and infrastructure that can tolerate change. If you install it like a hobby OS, it will behave like one. If you install it like you run production, it becomes a sharp early-warning system for “next RHEL.”
Practical next steps:
- Write (or fix) your Kickstart so disk selection, /var sizing, and security defaults are explicit.
- Stand up a staging CI pool that patches first; only promote to the main pool after a day of clean runs.
- Cap journald, prune container storage, and monitor /var usage and inode consumption.
- Adopt the fast diagnosis playbook and make it the default response to “CI is slow.”
- Decide your update policy in writing. Then enforce it with automation, not good intentions.