Fresh Install Checklist: 20 Minutes Now Saves 20 Hours Later

Was this helpful?

You know the moment: the server boots, SSH works, the app deploys, and everyone declares victory. Then two weeks later it’s “mysteriously slow,” backups are “coming soon,” time is wrong, and the root filesystem is 97% full because logs found religion.

This is the checklist that prevents those meetings. It’s not romantic. It’s not “cloud native.” It’s the unglamorous set of checks that lets you sleep, migrate, and debug without performing forensic archaeology on your own install.

The mindset: baseline, then build

A fresh install is the one time in a system’s life when everything is clean, deterministic, and still obedient. Once production traffic arrives, entropy follows: packages drift, config files get “temporarily” edited, and the one person who remembers why a kernel parameter was changed is on vacation forever.

Your goal in the first 20 minutes is not “finish setup.” Your goal is to capture a baseline and lock in safety rails:

  • Baseline: Identify the hardware and virtual environment you actually got (not what procurement promised).
  • Confirm invariants: Time, DNS, routing, storage layout, and update channels.
  • Make the future cheap: Snapshots, backups, a place for logs, and a way to debug without guessing.

One quote to pin to your monitor. This is a paraphrased idea from John Allspaw (operations and reliability): paraphrased idea: Systems fail in surprising ways; your job is to make it safe to learn and recover quickly.

Do the boring checks now. Or do them at 02:00 while someone asks if the problem is “the database.”

Interesting facts and context (why this keeps happening)

  1. Unix timekeeping scars are old. NTP’s ancestors date back decades; time sync bugs still cause authentication failures, TLS errors, and “impossible” log timelines.
  2. Filesystems have history, and it shows. ext4 is conservative because it inherited a lineage that learned hard lessons from power loss and bad disks.
  3. Journaling was a revolution. Before journaling filesystems became common, crashes could turn reboots into multi-hour fsck events. Many “weird” defaults today exist to avoid that era returning.
  4. RAID was never a backup. The phrase is older than most on-call rotations and still ignored weekly; RAID protects against certain disk failures, not deletion or corruption.
  5. DNS failures are a recurring villain. The internet has had multiple high-profile outages caused by DNS mistakes; inside companies, the same pattern repeats with split-horizon and stale resolvers.
  6. OpenSSH defaults evolved for a reason. The move away from weaker algorithms and toward key-based auth wasn’t fashion; it was the accumulated debris of incidents.
  7. cgroups and systemd changed Linux operations. They made resource control and service supervision better, but also introduced failure modes you can only see if you check the right status outputs.
  8. “Works on my machine” got worse with virtualization. Clock drift, noisy neighbors, oversubscribed I/O, and mismatched CPU flags can make two “identical” VMs behave like different species.

These aren’t trivia. They’re why a fresh install checklist is not paranoia. It’s compounding interest.

Checklists / step-by-step plan (the 20-minute routine)

Minute 0–3: Identity and access (before you touch anything else)

  • Confirm hostname, OS version, kernel, and virtualization context.
  • Confirm you can get in safely: SSH keys, sudo, and a second path (console, out-of-band, or VM serial).
  • Record the initial state: package sources, kernel cmdline, disk layout.

Minute 3–8: Storage layout you won’t hate later

  • Verify what disks you have, what they’re called, and whether you’re on NVMe, SATA, or networked storage.
  • Confirm partitioning and mount points. Make sure logs and data aren’t sharing a tiny root by accident.
  • Confirm TRIM/discard behavior for SSDs where it matters.

Minute 8–12: Network reality check

  • Verify IPs, routes, and DNS resolvers.
  • Verify MTU and basic latency. Check for PMTU black holes early.
  • Verify outbound reachability to your package mirrors and monitoring endpoints.

Minute 12–16: Time, updates, and service supervisor

  • Confirm time sync and time zone (and that it stays synced).
  • Confirm automatic security updates policy (or your patching process). “We’ll do it later” is a lie people tell themselves.
  • Confirm systemd health and log persistence.

Minute 16–20: Observability + backups (prove it works)

  • Install a minimal metrics/log shipping agent or at least enable persistent logs.
  • Create a first snapshot (VM or filesystem) once the system is “known-good.”
  • Run a backup job, then attempt a small restore. Yes, immediately.

Short joke #1: Backups are like parachutes: if you only discover they don’t work during the jump, your postmortem will be brief.

Hands-on tasks with commands, outputs, and decisions

The point of commands isn’t to collect trivia. Each command should answer: “What’s true?” and “What do I do next?” Below are practical tasks you can run on most modern Linux distributions. Adjust paths and tooling as needed, but keep the intent.

Task 1: Confirm OS, kernel, and architecture

cr0x@server:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
ID=ubuntu

cr0x@server:~$ uname -r
6.8.0-31-generic

cr0x@server:~$ uname -m
x86_64

What the output means: You’ve got the exact distro release, kernel line, and CPU architecture. This affects package availability, kernel defaults (I/O schedulers, cgroups), and driver behavior.

Decision: Confirm this matches your support policy. If you need a specific kernel or LTS minor for drivers (HBA/NIC), decide now—before production data exists.

Task 2: Confirm virtualization / cloud environment

cr0x@server:~$ systemd-detect-virt
kvm

cr0x@server:~$ cat /sys/class/dmi/id/product_name
KVM

What it means: You’re in KVM. Disk naming, clock behavior, and I/O ceilings may differ from bare metal.

Decision: If you expected bare metal, stop and escalate. If you expected a VM, check your storage performance assumptions (network-backed block devices can be “fine” until they aren’t).

Task 3: Record CPU and memory basics (catch underprovisioning)

cr0x@server:~$ lscpu | sed -n '1,12p'
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               4
On-line CPU(s) list:                  0-3
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R)
Thread(s) per core:                   2
Core(s) per socket:                   2
Socket(s):                            1
Virtualization:                       VT-x
L1d cache:                            64 KiB

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           7.7Gi       420Mi       6.6Gi       4.0Mi       720Mi       7.1Gi
Swap:          0B          0B          0B

What it means: You’ve got 4 vCPUs, ~8 GiB RAM, and no swap configured.

Decision: Decide explicitly on swap. For many server workloads, a small swap can prevent OOM-killer chaos; for latency-critical workloads, you may prefer no swap and strict memory limits. Don’t drift into a default by accident.

Task 4: Confirm disks, models, and transport

cr0x@server:~$ lsblk -o NAME,TYPE,SIZE,MODEL,TRAN,ROTA,MOUNTPOINTS
NAME   TYPE  SIZE MODEL            TRAN ROTA MOUNTPOINTS
vda    disk  120G Virtio Block Dev  virtio    1
├─vda1 part    1G                          1 /boot
└─vda2 part  119G                          1 /

What it means: One virtio disk, rotational=1 (often meaningless in VMs but still a hint). Root and boot share the same disk.

Decision: If this will host databases, queues, or heavy logs, plan separate volumes or mount points now (or move to LVM/ZFS). Single-root-on-one-disk is how you get “disk full” outages from chatty logs.

Task 5: Verify filesystem type and mount options

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /
/dev/vda2 ext4 rw,relatime,errors=remount-ro

cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /boot
/boot /dev/vda1 ext4 rw,relatime

What it means: ext4 with safe-ish defaults. errors=remount-ro can save you from silent corruption at the cost of turning errors into immediate pain. That’s a good trade.

Decision: Keep sane defaults unless you have a measured reason. If you’re using SSDs and care about write amplification, review noatime vs relatime and your monitoring needs.

Task 6: Check free space and inode headroom

cr0x@server:~$ df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda2      ext4  117G  6.1G  105G   6% /

cr0x@server:~$ df -ihT /
Filesystem     Type  Inodes IUsed IFree IUse% Mounted on
/dev/vda2      ext4   7.5M   132K  7.4M    2% /

What it means: Space and inodes are healthy. Inode exhaustion is a classic “disk full but not really” trap with tiny files (caches, mail queues, container layers).

Decision: If you expect many small files, consider separate filesystems with tailored inode ratios, or move such workloads to object storage or a database designed for it.

Task 7: Establish a quick I/O baseline (latency matters more than MB/s)

cr0x@server:~$ sudo apt-get update -y
Hit:1 http://archive.ubuntu.com/ubuntu noble InRelease
Get:2 http://security.ubuntu.com/ubuntu noble-security InRelease [110 kB]
Fetched 110 kB in 1s (160 kB/s)
Reading package lists... Done

cr0x@server:~$ sudo apt-get install -y fio
Setting up fio (3.36-1) ...

cr0x@server:~$ fio --name=randread --filename=/tmp/fio.test --size=1G --rw=randread --bs=4k --iodepth=32 --numjobs=1 --direct=1 --runtime=20 --time_based --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=32
fio-3.36
randread: (groupid=0, jobs=1): err= 0: pid=2176: Thu Feb  5 10:11:02 2026
  read: IOPS=18.4k, BW=71.9MiB/s (75.4MB/s)(1.40GiB/20001msec)
    slat (usec): min=3, max=140, avg=9.12, stdev=2.88
    clat (usec): min=179, max=4112, avg=1726.44, stdev=312.20
    lat (usec): min=188, max=4124, avg=1735.81, stdev=312.34

What it means: Random read IOPS and, more importantly, latency distribution. Average ~1.7ms might be fine for general workloads; it might be deadly for a tail-latency-sensitive database.

Decision: If latency is higher than expected, investigate the storage backend (virtio on networked block, host caching, encryption overhead). Don’t “optimize” the app until you know the disk isn’t quietly on fire.

Task 8: Check network interfaces, addresses, and link state

cr0x@server:~$ ip -br addr
lo               UNKNOWN        127.0.0.1/8 ::1/128
ens3             UP             10.20.14.37/24 fe80::5054:ff:fe2a:1b2c/64

cr0x@server:~$ ip route
default via 10.20.14.1 dev ens3 proto dhcp src 10.20.14.37 metric 100
10.20.14.0/24 dev ens3 proto kernel scope link src 10.20.14.37

What it means: Interface is up, has an IP, and there’s a default route.

Decision: If this is production, decide whether DHCP is acceptable. For servers, static addressing (or DHCP reservations with strict inventory) is usually worth it.

Task 9: Verify DNS resolution and resolver configuration

cr0x@server:~$ resolvectl status | sed -n '1,25p'
Global
         Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
  resolv.conf mode: stub
Current DNS Server: 10.20.0.53
       DNS Servers: 10.20.0.53 10.20.0.54

cr0x@server:~$ getent hosts archive.ubuntu.com
2620:2d:4000:1::19 archive.ubuntu.com
91.189.91.83 archive.ubuntu.com

What it means: You know which resolvers you’re using, and name resolution works.

Decision: If resolvers are “mystery IPs,” fix that now. Also decide whether you need split DNS, search domains, or DNSSEC validation (rare internally, but know the choice).

Task 10: Check MTU and basic path health

cr0x@server:~$ ip link show ens3 | sed -n '1,2p'
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:2a:1b:2c brd ff:ff:ff:ff:ff:ff

cr0x@server:~$ ping -c 3 10.20.14.1
PING 10.20.14.1 (10.20.14.1) 56(84) bytes of data.
64 bytes from 10.20.14.1: icmp_seq=1 ttl=64 time=0.335 ms
64 bytes from 10.20.14.1: icmp_seq=2 ttl=64 time=0.284 ms
64 bytes from 10.20.14.1: icmp_seq=3 ttl=64 time=0.290 ms

cr0x@server:~$ ping -M do -s 1472 -c 2 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
1472 bytes from 8.8.8.8: icmp_seq=1 ttl=115 time=2.31 ms
1472 bytes from 8.8.8.8: icmp_seq=2 ttl=115 time=2.28 ms

What it means: MTU is 1500 and you can pass a full-size packet without fragmentation issues.

Decision: If jumbo frames are expected (MTU 9000), confirm end-to-end. Mixed MTUs create “everything works except the important thing” tickets.

Task 11: Confirm time sync and clock health

cr0x@server:~$ timedatectl
               Local time: Thu 2026-02-05 10:14:21 UTC
           Universal time: Thu 2026-02-05 10:14:21 UTC
                 RTC time: Thu 2026-02-05 10:14:22
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

cr0x@server:~$ chronyc tracking
Reference ID    : 0A140035 (10.20.0.53)
Stratum         : 3
Last offset     : -0.000021 seconds
RMS offset      : 0.000110 seconds
Frequency       : 11.432 ppm fast
Skew            : 0.221 ppm
Root delay      : 0.001234 seconds
Root dispersion : 0.000532 seconds
Update interval : 64.2 seconds
Leap status     : Normal

What it means: Time sync is active and stable. Offset is tiny. Good.

Decision: If time isn’t synchronized, fix it before you deploy anything with TLS, Kerberos, JWTs, or log correlation. Which is to say: everything.

Task 12: Check systemd for failed units (catch quiet breakage)

cr0x@server:~$ systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

What it means: No failed units. This is the baseline you want to preserve.

Decision: If anything is failed, read logs now, not during an outage. A “failed” unit might be your monitoring agent, your backup timer, or networking.

Task 13: Confirm logging persistence and disk usage controls

cr0x@server:~$ sudo grep -E '^(Storage|SystemMaxUse|RuntimeMaxUse)=' /etc/systemd/journald.conf | grep -v '^#' || true

cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 64.0M in the file system.

What it means: journald is using defaults; persistence may be runtime-only depending on distro and directory presence.

Decision: Decide if logs must survive reboot. For production debugging, make them persistent and cap usage so logs don’t eat the root filesystem.

Task 14: Confirm firewall state (and don’t pretend “security group” is enough)

cr0x@server:~$ sudo nft list ruleset
table inet filter {
        chain input {
                type filter hook input priority filter; policy accept;
        }
        chain forward {
                type filter hook forward priority filter; policy accept;
        }
        chain output {
                type filter hook output priority filter; policy accept;
        }
}

What it means: Firewall policy is wide open.

Decision: If this host is reachable beyond a tightly controlled network, lock it down. Even in “private” networks, lateral movement is a thing. Set default deny for inbound, allow what you need, and log drops at a sane rate.

Task 15: Confirm SSH configuration and key-only access

cr0x@server:~$ sudo sshd -T | egrep '^(port|passwordauthentication|permitrootlogin|pubkeyauthentication|kexalgorithms|macs|ciphers)'
port 22
passwordauthentication no
permitrootlogin prohibit-password
pubkeyauthentication yes
kexalgorithms sntrup761x25519-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org
macs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com
ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com

What it means: Password auth is disabled, root login is restricted, and modern crypto is enabled.

Decision: If password auth is on, turn it off once you confirm keys and break-glass access. If root login is allowed, fix it unless you have a tightly managed exception.

Task 16: Confirm updates policy (security patches are not optional)

cr0x@server:~$ apt-cache policy | sed -n '1,20p'
Package files:
 100 /var/lib/dpkg/status
     release a=now
Pinned packages:

cr0x@server:~$ systemctl status unattended-upgrades --no-pager
● unattended-upgrades.service - Unattended Upgrades Shutdown
     Loaded: loaded (/lib/systemd/system/unattended-upgrades.service; enabled)
     Active: inactive (dead)

What it means: Unattended upgrades is installed/enabled (common on Ubuntu). The unit may show inactive because it runs periodically via timers.

Decision: Choose: automatic security updates, or a scheduled patch window with monitoring and enforcement. What you don’t choose is “hope.”

Task 17: Confirm kernel command line (catch accidental flags)

cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-31-generic root=/dev/vda2 ro quiet splash

What it means: Nothing exotic. Good.

Decision: If you see tuning flags you didn’t set (disabling mitigations, changing IOMMU, forcing legacy naming), document why. Unknown kernel flags are how you inherit risk.

Task 18: Confirm top talkers quickly (CPU, memory, I/O) after deploying anything

cr0x@server:~$ top -b -n 1 | sed -n '1,20p'
top - 10:17:02 up  1:21,  1 user,  load average: 0.08, 0.04, 0.01
Tasks: 115 total,   1 running, 114 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.8 sy,  0.0 ni, 97.8 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   7872.4 total,   6760.8 free,    430.7 used,    680.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   7441.7 avail Mem

What it means: Load is low, CPU idle is high, iowait is zero. This is a “quiet” baseline.

Decision: Capture this output once. Later, when someone says “it’s slow,” you have a before/after anchor.

Storage: partitioning, filesystems, and “future you”

Most outages I’ve seen that start with “the app is slow” end with “the disk is full,” “the disk is lying,” or “we built the storage on an assumption.” Storage is quiet until it isn’t. And when it isn’t, it takes everything down with it.

Partitioning: choose failure domains on purpose

Single root partitions are fine for short-lived VMs and stateless services. They are a trap for anything that emits logs, stores queue files, caches, container layers, or database data.

What to do:

  • Separate /var (or at least /var/log) if you expect noisy logs or package churn.
  • Separate application data (e.g., /srv or /var/lib/<service>) from the OS.
  • If you must keep one filesystem, enforce log retention and set quotas/limits where possible.

Filesystem choice: boring is a feature

ext4 and XFS are “boring” in the best sense: widely understood failure modes, predictable tooling, and mature recovery paths.

ZFS is fantastic when you need snapshots, checksums, replication, and data integrity with a coherent admin story. But ZFS requires you to respect RAM, recordsize, ashift, and the fact that it’s a storage platform, not a mount option.

SSD behavior: TRIM and the myth of infinite IOPS

For SSD-backed storage, ensure TRIM/discard is appropriate. Continuous discard can be fine or harmful depending on backend; scheduled TRIM (fstrim timer) is a common compromise.

Also: SSDs are fast at many things and surprisingly bad at a few. Random write latency under sustained load can go from “nice” to “why are we paging the CEO” if the device or backend is saturated.

Encryption: decide, measure, then standardize

Disk encryption (LUKS) is often required, sometimes optional, and always something you should benchmark. Hardware acceleration usually makes it cheap. Bad CPU flags or missing acceleration makes it less cheap.

Make the decision explicit: encrypt OS disk? encrypt data disks? who holds keys? how do you do unattended boots in a VM? If you can’t answer those cleanly, you’re not done.

Network: verify reality, not diagrams

Networks are political systems with packets. Your install checklist needs to confirm what’s true: routes, resolvers, MTU, and which security boundary actually enforces policy.

DNS: your dependency’s dependency

Verify resolvers, and verify they’re the right ones for your environment. The most common DNS misconfigurations I see after installs:

  • Using a default resolver that can’t resolve internal zones.
  • Search domains that cause slow lookups and weird name collisions.
  • Two resolvers listed, one dead; half your queries hang in timeouts.

MTU: small setting, big blast radius

When MTU mismatches, some traffic works. Then your VPN breaks, your database replication stalls, or your container overlay starts dropping large packets. Verify early with “do not fragment” pings to representative destinations.

Egress matters

A server that can’t reach package repos, time servers, certificate endpoints, or your monitoring intake is a server that will rot silently. Confirm egress now. If you’re in a locked-down environment, document the exact required destinations and ports so firewall changes don’t become folklore.

Security: harden without breaking operations

Security hardening fails in two ways: you do nothing and get owned, or you do “everything” and lock out the people who have to fix production.

SSH: keys, minimal exposure, predictable access

  • Use key-based auth. Disable password auth after verifying you have working keys.
  • Don’t allow direct root login. Use sudo with auditing.
  • Restrict SSH to management networks where possible.

Firewall: default deny inbound, allow explicitly

If your environment already uses network security controls, great. Still set a host firewall unless there’s a strict reason not to (some appliances, some load balancers, some specialized routing hosts). Defense in depth isn’t a slogan; it’s how you survive a misapplied rule upstream.

Least privilege: users and services

Create service accounts. Avoid running app processes as root. If you need privileged ports or kernel-level access, be explicit with capabilities, systemd unit hardening, and documentation.

Short joke #2: The “temporary” firewall exception is the most permanent employee in most companies.

Observability: logs, metrics, and time

Fresh installs are when you decide whether debugging will be engineering or séance.

Time sync is observability’s skeleton

If timestamps drift, your incident timeline becomes fan fiction. Keep systems on UTC unless you have a strong reason not to. Configure NTP/chrony, confirm synchronization, and monitor it.

Logs: make them persistent, bounded, and searchable

At minimum:

  • Ensure journald persistence (/var/log/journal) or configure rsyslog to store to disk.
  • Cap log storage so it doesn’t consume the root filesystem.
  • Ship logs centrally if you run more than one server. “SSH and grep” doesn’t scale past the first stressful outage.

Metrics: capture the boring ones

CPU, memory, disk latency, filesystem usage, network errors, and process counts are not “advanced observability.” They are the minimum. Capture them from day one so you can answer: “Is this new or normal?”

Core dumps: decide your policy before you need it

Core dumps can save days when debugging a crash. They can also leak secrets and eat disk. Decide:

  • Enable core dumps for specific services in controlled locations.
  • Disable or limit them on sensitive systems where policy forbids it.
  • Always cap size and storage location, and treat dumps as sensitive artifacts.

Backups: prove restore, not intent

Backups fail in predictable ways: they don’t run, they run but store empty data, they store data you can’t decrypt, or they store data you can’t restore fast enough to matter.

Minimum viable backup design

  • Define scope: OS rebuildable? Then don’t waste time backing it up; backup config and data.
  • Define RPO/RTO: how much data loss is acceptable and how fast you must recover.
  • Make restore a first-class test: every new host should have a restore drill.

What to do immediately after install

  • Create an initial snapshot (VM-level or filesystem-level) once baseline checks pass.
  • Run a backup job and restore a small file or directory to a different path. Verify contents.
  • Verify credentials and encryption keys are stored in your secret manager and accessible under break-glass procedures.

If your backup solution is “we have RAID,” stop. Re-read that sentence, then fix your life.

Fast diagnosis playbook: find the bottleneck quickly

When a fresh install “feels slow,” do not start with application tuning. Start with the system’s critical paths. This playbook is ordered: first checks catch the most common and highest-impact failure modes.

First: is the machine starving or sick?

  • Check load and saturation: CPU, iowait, memory pressure.
  • Check failed services: networking, time sync, storage mounts, agent failures.
  • Check disk full / inode full: it’s the classic because it works.
cr0x@server:~$ uptime
10:20:11 up  1:24,  1 user,  load average: 0.12, 0.06, 0.02

cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 6758208  88240 650240    0    0     1     4   75   98  1  1 98  0  0
 0  0      0 6758120  88240 650240    0    0     0     0   70   90  0  0 100 0  0
 1  0      0 6758020  88240 650240    0    0     0    12   88  120 2  1 97  0  0
 0  0      0 6757908  88240 650240    0    0     0     0   66   85  0  0 100 0  0
 0  0      0 6757800  88240 650240    0    0     0     0   69   89  0  0 100 0  0

Interpretation: High wa means I/O wait; rising r means runnable processes piling up; nonzero si/so means swapping. Decide whether you’re CPU-bound, I/O-bound, or memory-bound.

Second: is storage the limiting factor?

  • Check device latency and utilization.
  • Check filesystem errors and mount state.
  • Check for write amplification culprits (logs, journaling, chatty services).
cr0x@server:~$ iostat -xz 1 3
Linux 6.8.0-31-generic (server) 	02/05/2026 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.21    0.00    0.62    0.05    0.00   98.12

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  r_await  w_await
vda              5.00    8.00    80.0   120.0    0.00    0.50   2.10   1.10     0.90     1.25

Interpretation: await creeping up and %util near 100 indicates saturation. If latency is high before you have real load, your backend is constrained.

Third: is the network the bottleneck?

  • Check packet loss and RTT to key dependencies.
  • Check DNS latency and failures.
  • Check retransmits and interface errors.
cr0x@server:~$ ip -s link show ens3 | sed -n '1,12p'
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:2a:1b:2c brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
      1283942    9321      0       0       0       0
    TX:  bytes packets errors dropped carrier collsns
       902211    8010      0       0       0       0

cr0x@server:~$ dig +stats +time=2 +tries=1 archive.ubuntu.com A | tail -n 5
;; Query time: 21 msec
;; SERVER: 10.20.0.53#53(10.20.0.53)
;; WHEN: Thu Feb 05 10:22:01 UTC 2026
;; MSG SIZE  rcvd: 79

Interpretation: Interface has no errors/drops; DNS is fast. If query time is hundreds/thousands of ms, DNS is your “random latency” generator.

Fourth: validate the app’s dependencies, not just the app

Databases, message brokers, object storage, auth providers, and license servers can all be the bottleneck. Fast diagnosis means identifying which dependency is slow, then which layer (network vs storage vs CPU) is causing it.

Common mistakes: symptoms → root cause → fix

1) Symptom: disk fills up overnight

Root cause: Logs default to verbose, journald unbounded, or an app writes debug logs to /var/log with no rotation.

Fix: Cap journald, configure logrotate, and separate /var or /var/log if the host is log-heavy. Verify with journalctl --disk-usage and df -h.

2) Symptom: SSH works, but package installs randomly hang

Root cause: DNS resolver partially broken; one resolver times out, causing intermittent delays.

Fix: Use resolvectl status to identify servers, remove dead resolvers, and confirm query times with dig +stats. If you’re using systemd-resolved stub mode, ensure the upstreams are correct.

3) Symptom: TLS errors, “certificate not yet valid,” weird auth failures

Root cause: Time not synchronized; chrony/NTP disabled or blocked.

Fix: Enable time sync, ensure UDP 123 to NTP sources is allowed (or whatever your environment uses), verify with timedatectl and chronyc tracking.

4) Symptom: “server is slow” only during backups

Root cause: Backup job saturates disk I/O or network; no bandwidth limiting; snapshots cause copy-on-write amplification on certain filesystems.

Fix: Measure iostat -xz during backup, set rate limits, schedule off-peak, and consider snapshot-friendly layouts. Test restore speed too, not just backup speed.

5) Symptom: reboot takes forever, then services are missing data

Root cause: Filesystem checks or mounts failing; data volume not mounted; system continues with empty directories.

Fix: Use systemctl --failed, journalctl -b, and findmnt. Add x-systemd.automount or dependencies if appropriate, but don’t mask the failure. Ensure mount points are not silently created on root.

6) Symptom: performance regressions after “tuning”

Root cause: Changing kernel/sysctl parameters without baseline measurements; using a tuning guide written for different kernels or storage.

Fix: Capture a baseline (fio, iostat, vmstat) before tuning. Change one variable at a time, record results, and be willing to revert. Production isn’t your lab.

7) Symptom: intermittent connection resets under load

Root cause: MTU mismatch or PMTU discovery blocked; sometimes only large packets fail.

Fix: Validate with ping -M do and test between key nodes. Fix MTU consistently across the path, or allow fragmentation/ICMP as needed by policy.

8) Symptom: can’t log in during incident; only one admin has access

Root cause: No break-glass path, no secondary admin key, no console access documented.

Fix: Add at least two admin keys, confirm sudo works, document console/out-of-band access, and test it. “I think it works” is not a control.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

Company A rolled out a new batch of application servers. The build was “standard,” and the team was proud: automated provisioning, consistent packages, identical instance types. They even had a checklist, but it mostly read like a shopping list: install agent, install app, done.

Two weeks later, they had a burst of intermittent errors: database timeouts, then bursts of recovery. CPU graphs looked fine. Memory looked fine. The database team insisted it wasn’t them, which was technically true and culturally suspicious.

The culprit was one wrong assumption: that the storage backing the VM disks was local SSD. It wasn’t. A change in the virtualization cluster had moved these VMs onto a network-backed block storage tier with aggressive caching and unpredictable tail latency. Under normal load, it looked OK. Under bursts, random write latency spiked, the app’s local queue piled up, and everything downstream looked guilty.

The fix was unglamorous: they added a simple fio baseline check to the provisioning pipeline, recorded latency percentiles, and hard-failed builds that landed on the wrong storage class. They also separated the queue directory onto a dedicated volume with predictable performance. The incident didn’t teach them to “optimize.” It taught them to verify what they paid for.

Mini-story #2: The optimization that backfired

Company B had a problem that looked like a storage problem: heavy log volume from a chatty service. Someone suggested a “quick win”: mount the log filesystem with more aggressive options to reduce writes. They changed mount options and tweaked a few sysctls they found in a tuning blog post. The box seemed fine in the first hour. Everyone went home satisfied.

Then came the backfire. After an unclean reboot (host maintenance, nothing exotic), the filesystem needed recovery. Startup time ballooned. Worse, the service resumed with subtle data loss in its local state directory because it assumed certain fsync semantics that had been weakened by the “optimization.” The app wasn’t designed for that. Most aren’t.

What made it painful wasn’t just the regression. It was the lack of measurement. No baseline I/O numbers. No note of exactly what changed. No quick rollback plan. The team spent more time proving causality than fixing the issue.

They eventually reverted to safe defaults, moved logs to a separate volume with strict rotation, and reduced log verbosity at the source. The long-term lesson: if you can’t explain the tuning in one paragraph and validate it with a before/after measurement, it’s not tuning. It’s cargo cult.

Mini-story #3: The boring but correct practice that saved the day

Company C ran a fleet of internal services that weren’t glamorous: schedulers, glue APIs, small databases. Nothing that got conference talks. The team had one habit that felt overly cautious: after every fresh install, they ran a restore test.

Not a yearly drill. Not a “we validated the backup system.” A literal restore of a small subset of data from the backup target to an alternate directory on the new host. Every time. The ritual was so consistent that it became part of the provisioning definition of done.

One day, they rotated credentials for the backup target. The new credentials were deployed to most hosts, but not all. Monitoring showed backups “running,” but a subset was failing with auth errors. The failures were masked by a retry loop that logged warnings nobody read because, of course, logs were noisy.

The restore test caught it immediately on new hosts. The team fixed the credential rollout before the gap became a disaster. No heroics, no war room, no weekend rebuild. Just a boring practice with a high signal-to-noise ratio.

FAQ

1) How strict should this checklist be for dev environments?

Stricter than you think. Dev systems are where bad defaults breed, then get promoted to production via “it worked in staging.” You can skip some hardening, but don’t skip baselines, time sync, and disk layout sanity.

2) Should I use ext4, XFS, or ZFS?

Use ext4 or XFS when you want predictable, well-understood operations and you don’t need advanced snapshot/replication features. Use ZFS when you will actually use its strengths (checksums, snapshots, replication) and can run it with discipline.

3) Is swap still a thing on servers?

Yes. The right answer depends on workload and failure tolerance. A small swap can soften spikes and prevent OOM kills. But uncontrolled swapping can destroy latency. Decide explicitly, set monitoring, and test under load.

4) Do I really need a host firewall if my network has security groups?

Usually yes. Security groups get misapplied, duplicated, or temporarily opened. A host firewall is a second lock. It also documents intent right on the machine.

5) What’s the minimum observability for a single server?

Persistent logs with rotation/caps, CPU/memory/disk/network metrics, and time sync monitoring. If you can’t answer “what changed?” after an incident, you’re under-instrumented.

6) How do I avoid “disk full” incidents without overengineering storage?

Separate /var or at least /var/log when feasible, cap journald usage, configure logrotate, and monitor filesystem usage with alerts that fire before 95%.

7) When should I benchmark disk with fio?

Once on a clean system to establish baseline, and again after any storage backend change (new VM host, new volume type, encryption changes). Keep tests small and controlled—don’t torch production disks.

8) What’s the first thing to check when “the app is slow” on a fresh host?

Disk latency (iostat), memory pressure (vmstat), and DNS latency (dig +stats). Those three masquerade as application issues constantly.

9) How do I make sure mounts don’t fail silently?

Use findmnt in health checks, ensure systemd units depend on mounts when needed, and watch boot logs. Avoid designs where a missing mount point results in the app writing to root without noticing.

10) What should I document from a fresh install?

OS/kernel versions, disk layout and filesystem types, network config (IPs/routes/DNS), time sync sources, firewall policy, update policy, and backup/restore verification results. If it isn’t written down, it didn’t happen.

Conclusion: next steps that actually stick

A fresh install checklist is not bureaucracy. It’s latency reduction for your future incidents. The minutes you spend now buy you speed later: faster diagnosis, safer changes, fewer “how did we end up here?” surprises.

Do this next:

  1. Turn this into a runbook your team can execute without you. Store it with your infrastructure code.
  2. Automate the checks that can be automated: failed units, disk usage, time sync, baseline fio test (where safe), and firewall/SSH policy checks.
  3. Require proof for backups: one restore test per new host, logged as an artifact.
  4. Capture baselines (CPU/memory/disk/network) and store them. Without baselines, you only have opinions.

If you do nothing else, remember the thesis: you are not installing a server. You are installing a future incident response. Make it kind to you.

← Previous
Windows Can’t See Other PCs: Make Network Discovery Actually Work
Next →
openSUSE Leap 15.6 Install: The ‘Stable Linux’ Setup You’ll Keep for Years

Leave a comment