You know the moment. It’s 02:13, a database is “slow,” and the only evidence anyone has is a screenshot of top
plus a vague memory that “it was fine yesterday.” The vendor wants “a full system report,” your manager wants a timeline,
and you want sleep. The right move is to make the machine tell its story in one reproducible snapshot.
This is how you build a single script that collects hardware, drivers, storage state, kernel messages, and the errors that
actually matter—without bricking the host, without leaking secrets, and without producing an unreadable 80MB blob that nobody opens.
What “good” looks like in a full system report
A full system report is not a trophy. It’s a tool. If it can’t answer “what changed?” and “what’s failing?” quickly, it’s
just disk usage with extra steps.
Here’s the bar I use in production:
- One command to run. No interactive prompts. No “install this” at 3 a.m.
- Readability first. Text files grouped by topic. A human should be able to skim.
- Machine-diff friendly. Stable formatting so you can compare today vs last week.
- Minimal risk. No benchmarks by default. No invasive firmware tools. No “let’s rescan the bus” heroics.
- Secret-aware. Capture enough to debug; avoid dumping tokens, customer data, and private keys.
- Storage-credible. If the box has ZFS, mdraid, LVM, multipath, NVMe—collect their health properly.
- Error-focused. Grab kernel and journal errors with context. Not just “here are all logs since 2019.”
Also: compress it. Name it sensibly. Include a manifest. If you ever tried to email a vendor “some files I copied from /var/log”
you already know why.
Facts & short history: why this stuff is the way it is
A little context makes the tool better. These details explain why Linux reporting looks like a cabinet full of oddly-shaped wrenches.
- /proc and /sys exist because the kernel needed a “filesystem” API for introspection. It’s not a disk; it’s a view into kernel state.
lspcicomes from PCI utilities that predate a lot of “modern” observability. It’s still the fastest way to map devices to drivers.dmesgis old-school ring buffer logging. It’s noisy, but it’s where hardware and driver failures confess first.- systemd’s journal was designed to fix log fragmentation. It centralizes logs, adds metadata, and makes “show me boots” a first-class query.
- SMART predates NVMe. The NVMe “SMART log” is similar in spirit but not identical; mixing interpretations causes bad calls.
- mdraid has survived because it’s boring and good. It’s in-kernel, stable, and understood—still a default in many enterprises.
- Device names like
/dev/sdawere never meant to be stable identifiers. That’s why we have/dev/disk/by-idand udev rules. - Multipath exists because SANs lie by omission. If you don’t verify path health, you’ll believe “redundant” means “safe.”
- Modern CPU topology is complicated because performance is complicated. NUMA, SMT, and power management can make “same CPU model” behave differently.
Joke #1: Logs are like crime scenes—everyone says they “didn’t touch anything,” and then you find fresh fingerprints all over /etc.
Design principles: safe, comparable, supportable
1) Prefer “read-only” inspection over “active probing”
You can get 90% of value from inspection: lsblk, lspci, modinfo, journalctl, SMART reads.
Active probing (stress tests, rescans, firmware updates) can change the system or amplify a failure. Don’t do it in the default path.
2) Capture identity and context first
Before you collect anything else, capture:
- hostname, OS version, kernel version, uptime
- time sync status
- virtualization/container hints
- boot history and last reboot reason (when available)
The reason is simple: many “errors” are just time jumps, partial boots, or kernel mismatches. Context prevents you from diagnosing ghosts.
3) Make output stable enough to diff
Humans like “latest first.” Diff tools like stability. You can have both by writing files with deterministic names and stable formats,
and by separating “current state” from “recent events.” For example: save lsblk output and also save
“journal errors since boot.”
4) Avoid collecting secrets by accident
Don’t vacuum up:
/etc/shadow, SSH private keys, cloud metadata tokens- application configs containing credentials (database URLs, API keys)
- full environment dumps from process listings
Instead, collect sanitized config fragments (package versions, service status, unit files) and logs with bounded time windows.
A support bundle should be safe to share with your own security team without a long apology email.
5) Be explicit about permissions
Many useful commands require root: SMART reads, journal access, some storage tooling. Your script should:
- run as root (recommended) or degrade gracefully
- record which commands failed due to permissions
- never “sudo” inside the script (too many environments forbid it)
6) Capture errors with surrounding evidence
Kernel stack traces, NVMe reset messages, ext4 errors, mpt3sas timeouts—these are the bread crumbs.
When you extract errors, include a few lines of context and include the boot ID / timestamp. Otherwise you’ll paste
one scary line into chat and start a panic you didn’t need.
Fast diagnosis playbook (first/second/third)
When someone says “the server is slow,” they mean “my request is slow.” That could be CPU, storage, memory pressure, network,
or a driver spamming resets. Start with the checks that rule out whole classes of problems.
First: confirm the basics (10 seconds)
- Is the clock sane? Time jumps make logs useless and can break TLS, clustering, and caches.
- Is the machine rebooting or flapping? Uptime and boot logs tell you if you’re chasing a moving target.
- Is there a storm in dmesg? If the kernel is shouting, listen before you tune anything.
Second: identify the bottleneck domain (1 minute)
- CPU saturation? High load with high user/sys, or run queue piling up.
- Memory pressure? Swap activity, OOM kills, reclaim stalls.
- I/O wait? Elevated iowait, blocked tasks, disk latency.
- Network? Drops, retransmits, NIC driver resets.
Third: map symptoms to the physical/driver layer (5 minutes)
- Which device is slow? Which controller? Which driver?
- Do errors align with a firmware version or module?
- Are you on a degraded RAID/ZFS pool?
- Did a “minor” kernel update change the storage stack?
A reliable workflow is: logs → topology → health → configuration drift. The opposite (“tune sysctl first”) is how you
end up polishing the hood while the engine is on fire.
12+ practical tasks: commands, meaning, decision
These are the specific tasks I expect a “full system report” to cover. Each one includes: a command, what the output tells you,
and what decision you can make from it. Use them interactively, and also bake them into the one-script bundle later.
Task 1: Identify OS + kernel + boot mode
cr0x@server:~$ uname -a
Linux server 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64 x86_64 x86_64 GNU/Linux
Meaning: Kernel version and build flavor matter for driver behavior (especially storage and NIC).
Decision: If this kernel recently changed, treat “performance regression” as “driver/firmware mismatch until proven otherwise.”
cr0x@server:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
Meaning: Your package and kernel update policy is implied by distro and release.
Decision: Vendor support often hinges on distro minor release; don’t hand-wave this.
cr0x@server:~$ test -d /sys/firmware/efi && echo UEFI || echo BIOS
UEFI
Meaning: Boot mode influences disk partitioning, bootloader behavior, and sometimes firmware tooling.
Decision: For fleets: standardize boot mode; mixed UEFI/BIOS complicates automation and recovery.
Task 2: Confirm uptime and last boot logs
cr0x@server:~$ uptime -p
up 17 days, 3 hours, 12 minutes
Meaning: Long uptimes hide slow degradation; short uptimes mean your evidence window is tiny.
Decision: If uptime is suspiciously short, prioritize “why did it reboot” over “why is it slow.”
cr0x@server:~$ journalctl -b -1 -n 30 --no-pager
Feb 05 01:02:11 server kernel: Linux version 6.5.0-21-generic (buildd@lcy02-amd64-039) ...
Feb 05 01:02:19 server systemd[1]: Started Journal Service.
Feb 05 01:02:26 server kernel: nvme nvme0: controller is down; will reset: CSTS=0x1, PCI_STATUS=0x10
Feb 05 01:02:28 server kernel: nvme nvme0: reset controller
Meaning: Previous boot messages often show the first occurrence of the problem (like storage resets).
Decision: If prior boots show device resets, stop blaming the application and start validating storage/PCIe health.
Task 3: Quick CPU and memory pressure snapshot
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 512340 92112 4213900 0 0 12 44 320 510 6 2 90 2 0
5 1 0 120480 93200 4061012 0 0 180 920 1250 2100 14 6 60 20 0
Meaning: r run queue, b blocked tasks, swap in/out, and wa iowait.
Decision: High wa and blocked tasks mean storage latency; don’t “add CPU” to fix a disk problem.
cr0x@server:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 12Gi 490Mi 1.2Gi 18Gi 17Gi
Swap: 0B 0B 0B
Meaning: “free” memory is not the whole story; “available” is.
Decision: If available is low and latency is high, check reclaim/IO; if swap exists and is active, expect tail latency spikes.
Task 4: Find kernel errors fast
cr0x@server:~$ dmesg -T --level=err,warn | tail -n 20
[Mon Feb 5 03:11:22 2026] nvme nvme0: I/O 129 QID 7 timeout, aborting
[Mon Feb 5 03:11:22 2026] nvme nvme0: Abort status: 0x0
[Mon Feb 5 03:11:23 2026] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
[Mon Feb 5 03:11:25 2026] EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 262402 starting block 8392704
Meaning: This is the kernel telling you the ground truth: timeouts, resets, filesystem fallout.
Decision: If you see resets/timeouts, pause performance tuning. Collect firmware versions, PCIe link state, and storage health next.
Task 5: Map PCI devices to drivers
cr0x@server:~$ lspci -nnk | sed -n '1,80p'
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e30] (rev 0a)
03:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller [144d:a808]
Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
Kernel driver in use: nvme
Kernel modules: nvme
05:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [8086:1572]
Kernel driver in use: i40e
Kernel modules: i40e
Meaning: “What device, which driver.” This is how you connect a timeout in dmesg to a specific controller and module.
Decision: If the driver is unexpected (fallback driver, or old out-of-tree module), you have a supportability and stability risk.
Task 6: Confirm loaded modules and versions
cr0x@server:~$ lsmod | head
Module Size Used by
nvme 61440 2
nvme_core 212992 4 nvme
i40e 557056 0
xfs 1703936 1
Meaning: Loaded modules show what’s actually running, not what you believe is installed.
Decision: If storage/NIC modules are missing expected dependencies or show unusual “Used by” counts, suspect driver churn or partial upgrades.
cr0x@server:~$ modinfo i40e | sed -n '1,15p'
filename: /lib/modules/6.5.0-21-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko
version: 2.24.0-k
license: GPL
description: Intel(R) Ethernet Connection XL710 Network Driver
firmware: i40e/ddp/i40e-1.3.34.0.pkg
Meaning: Driver version and firmware blob expectations are here.
Decision: If the driver expects a firmware package that isn’t present, you can get degraded features or weird link behavior.
Task 7: Inventory disks, filesystems, and mount options
cr0x@server:~$ lsblk -e7 -o NAME,TYPE,SIZE,MODEL,SERIAL,ROTA,TRAN,HCTL,FSTYPE,MOUNTPOINTS
NAME TYPE SIZE MODEL SERIAL ROTA TRAN HCTL FSTYPE MOUNTPOINTS
nvme0n1 disk 1.8T Samsung SSD 980 S6X... 0 nvme - - -
nvme0n1p1 part 512M - - 0 nvme - vfat /boot/efi
nvme0n1p2 part 1.8T - - 0 nvme - LVM2_member
dm-0 lvm 1.8T - - 0 - - ext4 /
Meaning: This maps physical devices to logical layers (partitions → LVM → filesystem).
Decision: If you see unexpected stacking (e.g., mdraid on top of dm-crypt on top of LVM), performance and recovery become harder. Document it.
cr0x@server:~$ findmnt -Dno TARGET,SOURCE,FSTYPE,OPTIONS | sed -n '1,20p'
/ /dev/mapper/vg0-root ext4 rw,relatime,errors=remount-ro
/boot/efi /dev/nvme0n1p1 vfat rw,relatime,fmask=0077,dmask=0077
Meaning: Mount options matter: barriers, atime, discard, error handling.
Decision: If critical filesystems mount with risky options (like disabling journaling features or barriers), treat it as an outage waiting politely.
Task 8: Check NVMe health (if present)
cr0x@server:~$ nvme list
Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 S6X... Samsung SSD 980 PRO 2TB 1 1.80 TB / 2.00 TB 512 B + 0 B 5B2QGXA7
Meaning: Firmware revision is not trivia; it’s often the difference between “stable” and “mystery resets.”
Decision: If the FW Rev is known-bad in your environment, schedule a controlled update, not an emergency patch in production.
cr0x@server:~$ nvme smart-log /dev/nvme0n1 | sed -n '1,25p'
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0x00
temperature : 44 C
available_spare : 100%
percentage_used : 3%
data_units_read : 12,345,678
data_units_written : 9,876,543
media_errors : 0
num_err_log_entries : 18
Meaning: critical_warning, media_errors, and error log entries are your early warning.
Decision: If error log entries climb and you see kernel timeouts, suspect the device, PCIe link, or controller/firmware interaction.
Task 9: Check classic SMART (SATA/SAS)
cr0x@server:~$ smartctl -a /dev/sda | sed -n '1,40p'
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.5.0-21-generic] (local build)
Device Model: ST12000NM0007
Serial Number: ZRT0...
Firmware Version: SN02
SMART overall-health self-assessment test result: PASSED
Meaning: “PASSED” is not a clean bill of health; you still need attribute trends and error logs.
Decision: If reallocated/pending sectors exist or errors climb, replace proactively. Drives don’t heal; they just get poetic.
Task 10: RAID/ZFS state (whichever applies)
cr0x@server:~$ cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid1 sdb1[0] sdc1[1]
976630336 blocks super 1.2 [2/2] [UU]
Meaning: [UU] means all members up. Anything else is degraded, and degraded arrays get slow.
Decision: If degraded, stop “performance investigations” and start “data protection investigations.” Then rebuild safely.
cr0x@server:~$ zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0B in 00:12:31 with 0 errors on Sun Feb 4 02:00:01 2026
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
Meaning: ZFS tells you if redundancy is intact and whether scrubs are finding corruption.
Decision: If scrubs show errors or DEGRADED, treat the hardware layer as guilty until exonerated.
Task 11: LVM and device-mapper sanity
cr0x@server:~$ pvs
PV VG Fmt Attr PSize PFree
/dev/nvme0n1p2 vg0 lvm2 a-- 1.81t 0
Meaning: PV/VG/LV mapping catches “someone resized something” surprises.
Decision: If you see missing PVs or partial VGs, stop. You might be running on a degraded or misassembled stack.
cr0x@server:~$ lvs -a -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
root vg0 -wi-ao---- 1.81t /dev/nvme0n1p2(0)
Meaning: Shows which underlying devices back the logical volumes.
Decision: If an LV spans multiple PVs unexpectedly, performance and failure domains change. Update your mental model.
Task 12: Filesystem errors and journal signals
cr0x@server:~$ journalctl -k --since "2 hours ago" -p warning --no-pager | tail -n 30
Feb 05 03:11:22 server kernel: nvme nvme0: I/O 129 QID 7 timeout, aborting
Feb 05 03:11:23 server kernel: EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 262402 starting block 8392704
Meaning: Filtered kernel warnings give you a concise error narrative.
Decision: If filesystem warnings appear, plan a controlled filesystem check and inspect underlying storage immediately.
Task 13: Network errors that masquerade as “storage issues”
cr0x@server:~$ ip -s link show dev ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
987654321 1234567 0 12 0 0
TX: bytes packets errors dropped carrier collsns
123456789 7654321 0 0 0 0
Meaning: Drops and errors can create retransmits and timeouts that look like “the database is slow.”
Decision: If drops spike, check NIC driver/firmware, ring buffers, and upstream congestion before you rip apart storage.
Task 14: Time sync and clock discipline
cr0x@server:~$ timedatectl
Local time: Mon 2026-02-05 03:24:11 UTC
Universal time: Mon 2026-02-05 03:24:11 UTC
RTC time: Mon 2026-02-05 03:24:10
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: If time isn’t synchronized, correlating logs across systems becomes guesswork.
Decision: If sync is off, fix NTP/chrony first. Observability without time is interpretive dance.
The one-script support bundle (production-ready)
Now we turn those tasks into a single script. It collects:
- System identity (OS, kernel, uptime, boot mode)
- Hardware inventory (CPU, memory, PCI, USB)
- Drivers/modules and firmware hints
- Storage topology and health (lsblk, LVM, mdraid, ZFS, multipath if present)
- Errors (dmesg warnings/errors, journal excerpts)
- Network state
- A manifest plus command stderr logs
It also does a few “adult” things: defensive checks, time-bounded logs, and redaction of obvious secrets from a couple of outputs.
Redaction is not perfect. It’s a seatbelt, not invincibility.
cr0x@server:~$ cat sysreport.sh
#!/usr/bin/env bash
set -euo pipefail
# Full system report: hardware + drivers + errors
# Safe-by-default. No benchmarks. No destructive actions.
# Run as root for best results.
ts_utc="$(date -u +%Y%m%dT%H%M%SZ)"
host="$(hostname -s 2>/dev/null || hostname)"
out_root="${SYSREPORT_OUTDIR:-/tmp}"
bundle_dir="${out_root%/}/sysreport-${host}-${ts_utc}"
mkdir -p "$bundle_dir"/{meta,identity,hardware,drivers,storage,network,logs,errors,perf}
manifest="$bundle_dir/meta/manifest.txt"
stderr_log="$bundle_dir/meta/stderr.log"
run() {
# run "command string" "output-file"
local cmd="$1"
local out="$2"
{
echo "### CMD: $cmd"
echo "### WHEN_UTC: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo
bash -lc "$cmd"
echo
} >"$out" 2>>"$stderr_log" || true
{
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) $out $cmd"
} >>"$manifest"
}
have() { command -v "$1" >/dev/null 2>&1; }
redact() {
# Very light redaction for outputs likely to contain tokens/keys.
# You should still review before sharing externally.
sed -E \
-e 's/(Authorization:)[[:space:]]*Bearer[[:space:]]+[A-Za-z0-9._-]+/\1 Bearer [REDACTED]/Ig' \
-e 's/([Pp]assword=)[^[:space:]]+/\1[REDACTED]/g' \
-e 's/([Tt]oken=)[^[:space:]]+/\1[REDACTED]/g'
}
# Identity
run "hostnamectl 2>/dev/null || true" "$bundle_dir/identity/hostnamectl.txt"
run "uname -a" "$bundle_dir/identity/uname.txt"
run "cat /etc/os-release 2>/dev/null || true" "$bundle_dir/identity/os-release.txt"
run "uptime -p; uptime" "$bundle_dir/identity/uptime.txt"
run "who -b 2>/dev/null || true" "$bundle_dir/identity/last-boot-who.txt"
run "test -d /sys/firmware/efi && echo UEFI || echo BIOS" "$bundle_dir/identity/boot-mode.txt"
run "timedatectl 2>/dev/null || true" "$bundle_dir/identity/timedatectl.txt"
# Hardware
run "lscpu" "$bundle_dir/hardware/lscpu.txt"
run "free -h" "$bundle_dir/hardware/free.txt"
run "cat /proc/meminfo" "$bundle_dir/hardware/proc-meminfo.txt"
run "dmidecode -t system -t baseboard -t bios -t processor -t memory 2>/dev/null || true" "$bundle_dir/hardware/dmidecode.txt"
run "lspci -nnk" "$bundle_dir/hardware/lspci-nnk.txt"
run "lsusb -t 2>/dev/null || true" "$bundle_dir/hardware/lsusb-tree.txt"
run "ls -l /dev/disk/by-id 2>/dev/null || true" "$bundle_dir/hardware/dev-disk-by-id.txt"
# Drivers / kernel
run "lsmod" "$bundle_dir/drivers/lsmod.txt"
run "sysctl -a 2>/dev/null | egrep '^(kernel\\.|vm\\.|fs\\.|net\\.)' | head -n 2000" "$bundle_dir/drivers/sysctl-kernel-vm-fs-net.txt"
run "cat /proc/cmdline" "$bundle_dir/drivers/kernel-cmdline.txt"
run "grep -R . /etc/modprobe.d 2>/dev/null || true" "$bundle_dir/drivers/modprobe-d.txt"
# Storage topology
run "lsblk -e7 -o NAME,TYPE,SIZE,MODEL,SERIAL,ROTA,TRAN,HCTL,FSTYPE,FSVER,UUID,MOUNTPOINTS" "$bundle_dir/storage/lsblk.txt"
run "blkid 2>/dev/null || true" "$bundle_dir/storage/blkid.txt"
run "findmnt -D" "$bundle_dir/storage/findmnt.txt"
run "df -hT" "$bundle_dir/storage/df-ht.txt"
run "mount" "$bundle_dir/storage/mount.txt"
# LVM / mdraid
run "pvs 2>/dev/null || true" "$bundle_dir/storage/lvm-pvs.txt"
run "vgs 2>/dev/null || true" "$bundle_dir/storage/lvm-vgs.txt"
run "lvs -a -o +devices 2>/dev/null || true" "$bundle_dir/storage/lvm-lvs.txt"
run "cat /proc/mdstat 2>/dev/null || true" "$bundle_dir/storage/mdstat.txt"
run "mdadm --detail --scan 2>/dev/null || true" "$bundle_dir/storage/mdadm-detail-scan.txt"
# ZFS (if present)
if have zpool; then
run "zpool status -v" "$bundle_dir/storage/zpool-status.txt"
run "zfs list -o name,used,avail,refer,mountpoint,compression,recordsize,atime,primarycache,secondarycache -t filesystem,volume 2>/dev/null || true" "$bundle_dir/storage/zfs-list.txt"
run "zpool get all 2>/dev/null || true" "$bundle_dir/storage/zpool-get-all.txt"
fi
# Multipath (if present)
if have multipath; then
run "multipath -ll 2>/dev/null || true" "$bundle_dir/storage/multipath-ll.txt"
fi
# NVMe and SMART (if present)
if have nvme; then
run "nvme list" "$bundle_dir/storage/nvme-list.txt"
run "for d in /dev/nvme*n1; do echo '## ' \$d; nvme id-ctrl \$d 2>/dev/null | head -n 80; echo; nvme smart-log \$d 2>/dev/null; echo; done" "$bundle_dir/storage/nvme-health.txt"
fi
if have smartctl; then
run "lsblk -dn -o NAME,TYPE | awk '\$2==\"disk\"{print \"/dev/\"\$1}' | while read -r d; do echo '## ' \$d; smartctl -a \$d 2>/dev/null | head -n 120; echo; done" "$bundle_dir/storage/smartctl-head.txt"
fi
# Performance snapshots (safe)
run "vmstat 1 5" "$bundle_dir/perf/vmstat.txt"
run "iostat -xz 1 3 2>/dev/null || true" "$bundle_dir/perf/iostat.txt"
run "pidstat 1 3 2>/dev/null || true" "$bundle_dir/perf/pidstat.txt"
run "top -b -n 1 | head -n 80" "$bundle_dir/perf/top.txt"
# Network
run "ip -br link" "$bundle_dir/network/ip-link.txt"
run "ip -s link" "$bundle_dir/network/ip-link-stats.txt"
run "ip addr" "$bundle_dir/network/ip-addr.txt"
run "ip route" "$bundle_dir/network/ip-route.txt"
run "ss -s" "$bundle_dir/network/ss-summary.txt"
run "ss -tupna 2>/dev/null | head -n 2000" "$bundle_dir/network/ss-sockets.txt"
# Logs / errors (time-bounded)
run "dmesg -T" "$bundle_dir/logs/dmesg.txt"
run "dmesg -T --level=err,warn" "$bundle_dir/errors/dmesg-warn-err.txt"
if have journalctl; then
run "journalctl -b --no-pager -n 3000" "$bundle_dir/logs/journal-this-boot-tail.txt"
run "journalctl -k -b --no-pager -p warning..alert" "$bundle_dir/errors/journal-kernel-warning-alert.txt"
run "journalctl -b --no-pager -p err..alert --since '24 hours ago'" "$bundle_dir/errors/journal-errors-24h.txt"
run "journalctl --list-boots --no-pager" "$bundle_dir/logs/journal-boot-list.txt"
fi
# A light touch of redaction on socket listings (can include tokens in args on some systems)
if [ -f "$bundle_dir/network/ss-sockets.txt" ]; then
redact <"$bundle_dir/network/ss-sockets.txt" >"$bundle_dir/network/ss-sockets.redacted.txt"
fi
# Meta
run "id; umask; ulimit -a" "$bundle_dir/meta/runtime.txt"
run "dpkg -l 2>/dev/null | head -n 4000 || true" "$bundle_dir/meta/packages-dpkg.txt"
run "rpm -qa 2>/dev/null | head -n 4000 || true" "$bundle_dir/meta/packages-rpm.txt"
# Bundle
tarball="${bundle_dir}.tar.gz"
tar -C "$out_root" -czf "$tarball" "$(basename "$bundle_dir")" 2>>"$stderr_log" || true
echo "Wrote: $tarball"
echo "Manifest: $manifest"
echo "Stderr log: $stderr_log"
How to run it
cr0x@server:~$ sudo bash sysreport.sh
Wrote: /tmp/sysreport-server-20260205T032411Z.tar.gz
Manifest: /tmp/sysreport-server-20260205T032411Z/meta/manifest.txt
Stderr log: /tmp/sysreport-server-20260205T032411Z/meta/stderr.log
Meaning: You get a single tarball you can attach to an incident ticket, share internally, or diff against a “known good” host.
Decision: If stderr contains many permission errors, re-run as root or explicitly accept the reduced visibility and document it.
How to read the bundle like an SRE, not like a tourist
Start with errors, then confirm topology
The fastest path is usually:
errors/dmesg-warn-err.txtanderrors/journal-kernel-warning-alert.txthardware/lspci-nnk.txtto map devices to driversstorage/lsblk.txtand RAID/ZFS status to see what storage stack you’re actually runningperf/vmstat.txtandperf/iostat.txt(if available) for bottleneck direction
Look for these “tells”
- Reset loops: “controller is down; will reset” repeating is not normal noise.
- Filesystem complaints: ext4/xfs warnings often lag behind the actual device failure. Root cause is usually below.
- Driver mismatch: a NIC driver in use that doesn’t match your fleet standard is drift and future incidents.
- Degraded redundancy: mdraid not [UU], ZFS degraded, multipath missing paths. This turns minor hiccups into outages.
- Thermals: high temperature plus PCIe errors is a real pattern, especially in dense racks.
Quote (paraphrased idea) from W. Edwards Deming: “You can’t improve what you don’t measure.” In ops: you can’t fix what you didn’t capture.
Joke #2: A “quick fix” is just a long incident wearing a fake mustache.
Three corporate mini-stories (and the lessons they paid for)
Mini-story 1: The outage caused by a wrong assumption
A mid-sized company ran an internal artifact repository on a pair of “identical” servers. Same CPU model, same RAM size, same distro image.
The team assumed storage was the same too, because procurement said “2TB NVMe” on both line items. Close enough, right?
The incident started as intermittent latency spikes. The application graphs looked like a comb: fine, fine, fine, then a sharp tooth.
The on-call checked CPU (fine), memory (fine), network (fine). They restarted the service. The spikes went away for an hour, then returned.
Classic “works after restart” nonsense.
When they finally pulled a proper system report bundle, the difference was sitting in storage/nvme-list.txt:
one host had a different NVMe model and firmware revision. The kernel log showed periodic controller resets on that specific device.
Under the workload pattern, the drive firmware hit a corner case, the controller would go sideways, the kernel would reset it, I/O would stall,
and the app would look “slow.”
The wrong assumption wasn’t “NVMe is fast.” It was “same capacity means same behavior.” Storage is not a commodity; it’s a personality.
Same interface, different firmware, different failure modes.
Fix was boring: they standardized the drive model/firmware and added the one-script bundle to their incident checklist.
Next time procurement swapped parts, they caught drift in staging instead of discovering it during customer traffic.
Mini-story 2: The optimization that backfired
A finance team had a file processing pipeline backed by an ext4 filesystem on top of LVM. It was “fine” until end-of-month when
throughput collapsed and the job missed its window. Someone proposed a bold optimization: mount the filesystem with more aggressive options,
disable atime, tweak dirty ratios, and turn on discard everywhere to “keep SSDs healthy.”
After the change, the pipeline initially sped up. People celebrated. Then came the slow burn: sporadic stalls during peak write bursts.
Latency got worse, not better. The system report bundle (collected after the fact, because of course) showed two smoking guns.
First, the mount options and sysctl values had drifted far from the distro defaults with no written rationale.
Second, the kernel logs contained NVMe timeouts during heavy discard activity. The “optimization” accidentally aligned discard operations
with the pipeline’s bursty writes. The SSD firmware didn’t love that. The controller reset pattern returned, and so did the missed window.
The hard lesson: performance tweaks that change I/O patterns can expose firmware bugs and controller limits. Defaults are not sacred,
but they’re battle-tested. If you want to tune, do it with a rollback plan and with instrumentation, not vibes.
The team reverted discard to a scheduled, controlled process (or disabled it when the device handled it internally),
restored sane dirty settings, and required that any kernel/storage tuning be documented in the system report bundle’s “meta” directory
as a change note. The pipeline went back to being boring. Boring is what you want in finance.
Mini-story 3: The boring but correct practice that saved the day
A healthcare-ish organization (regulated, cautious, allergic to surprises) maintained a weekly “support bundle snapshot”
for every production database host. It wasn’t fancy: run the script, store the tarball, keep a rolling window.
Engineers complained it was busywork. It didn’t feel like progress. It felt like flossing.
Then a cluster started throwing occasional filesystem warnings. Nothing catastrophic, just enough to make everyone nervous.
The on-call pulled the latest bundle and compared it to last week’s from the same host and its peer.
Differences popped immediately: a kernel update had landed on one node but not the other, and the storage controller driver version changed with it.
They also noticed a small rise in NVMe error log entries on the updated node—visible in the saved NVMe SMART logs—
plus a corresponding increase in kernel warnings. Without the historical bundles, it would have been hard to prove causality.
With them, it was obvious: this wasn’t “random hardware.” It was change + symptom.
The team pinned the kernel version across the cluster, scheduled a maintenance window, and tested a newer driver/firmware combo in staging.
The warnings stopped. The “boring snapshots” practice paid for itself in one incident by turning a spooky mystery into a controlled rollback.
That’s the underappreciated win: support bundles aren’t just for vendors. They’re for you, one week later, when your memory is fiction.
Common mistakes: symptom → root cause → fix
These are not abstract. I’ve watched teams lose hours (or weekends) to each one.
1) Symptom: “High load average” → Root cause: I/O wait and blocked tasks → Fix: check storage latency and kernel logs
What you see: Load average is high, CPU idle looks decent, users insist “CPU is pegged.”
Root cause: Tasks are stuck in uninterruptible sleep waiting on I/O.
Fix: Use vmstat, iostat, and kernel error logs; then map slow devices via lsblk and controller via lspci.
If dmesg shows timeouts/resets, escalate to hardware/firmware.
2) Symptom: “Random filesystem errors” → Root cause: underlying device resets → Fix: stop treating it as a filesystem problem first
What you see: ext4/xfs warnings, occasional remount read-only, journal messages.
Root cause: The disk/controller is flaking; filesystem is the messenger.
Fix: Collect SMART/NVMe logs, controller driver versions, and look for PCIe errors in dmesg. Replace/upgrade as needed; then repair filesystem.
3) Symptom: “NIC drops under load” → Root cause: driver/firmware mismatch or offload settings → Fix: confirm driver version and link state
What you see: Packet drops climb, retransmits spike, apps time out.
Root cause: NIC firmware mismatch with kernel driver, or a bad offload combination.
Fix: Check lspci -nnk and modinfo. Standardize driver/firmware; validate offloads deliberately, not randomly.
4) Symptom: “Vendor can’t help; they want more info” → Root cause: report lacks versions and topology → Fix: include mapping files
What you see: You send logs; vendor asks for hardware model, driver versions, firmware revs, RAID state.
Root cause: You collected symptoms but not identity and configuration.
Fix: Ensure the script includes lspci -nnk, modinfo, storage health outputs, OS/kernel versions, and boot history.
5) Symptom: “Bundle is huge and useless” → Root cause: unbounded logs and binary dumps → Fix: time-bound and summarize
What you see: 500MB tarball nobody downloads.
Root cause: Full journal export, full package list, copying entire /var/log.
Fix: Capture recent logs (last boot tail, last 24h errors) plus focused summaries; keep the rest as optional flags.
6) Symptom: “We can’t share it; it contains secrets” → Root cause: careless collection → Fix: build in safe defaults and review gates
What you see: Security blocks sharing with vendor, incident drags.
Root cause: Script captured configs or process args containing credentials.
Fix: Don’t collect sensitive paths; redact high-risk outputs; store bundle securely; have a review checklist before external sharing.
Checklists / step-by-step plan
Checklist A: Build your script safely (once)
- Define the purpose. Incident response snapshot? Vendor support bundle? Drift detection? (Pick one primary.)
- Choose a stable output layout. identity/, hardware/, drivers/, storage/, network/, logs/, errors/, meta/.
- Make every command non-interactive. If it might block, put it behind a “have tool” guard.
- Time-bound logs. Current boot tail + last 24h errors is usually the sweet spot.
- Record failures. A stderr.log is not optional. Silent failures waste time.
- Don’t mutate state. No rescans, no filesystem checks, no tuning, no firmware updates in the default run.
- Package it. Tarball plus manifest of commands run.
Checklist B: Run it during an incident (every time)
- Run on the affected host first, then on a “healthy peer” if you have one.
- Attach the tarball to the ticket immediately. Don’t keep it on your laptop like a precious artifact.
- Skim
meta/stderr.logto ensure you actually captured the good stuff. - Start analysis with
errors/. Then map devices/drivers. Then confirm redundancy state. - Write down the “first bad timestamp” from logs. Use it to correlate with deploys and config changes.
Checklist C: Make it fleet-ready (the difference between a script and a practice)
- Store weekly snapshots for critical tiers (databases, storage nodes, load balancers).
- Standardize kernel/driver/firmware baselines per hardware class.
- Automate diffing against a golden host for drift detection (even a crude diff is better than none).
- Define retention and access controls; bundles can contain sensitive operational details even if redacted.
FAQ
1) Should I run this as root?
Yes, if you want a report that’s actually useful. Root gets you kernel logs, SMART/NVMe health, and storage stack details.
If you can’t run as root, run anyway, but expect missing data and record that limitation in the incident ticket.
2) Will collecting this data impact performance?
Lightly. Commands like lspci, lsblk, journalctl queries, and SMART reads are typically low-impact.
Avoid heavy commands (benchmarks, long SMART tests) in the default script. If you need active testing, do it explicitly and during a window.
3) Why not just ship everything to a logging platform and call it done?
Central logging is great—until it isn’t. Support bundles capture the “local truth” including hardware identity, driver mappings, and storage topology
that often aren’t in your logging pipeline. Also, during outages, the logging pipeline is sometimes the first casualty.
4) Why include both dmesg and journalctl?
Because environments differ. Some systems restrict dmesg for non-root, some rotate logs aggressively, and sometimes you need both views.
Journal adds boot metadata; dmesg shows the kernel ring buffer plainly. Redundancy is a feature, not clutter.
5) How do I compare two bundles?
Untar them and diff key files: identity/uname.txt, hardware/lspci-nnk.txt, drivers/lsmod.txt,
storage/lsblk.txt, RAID/ZFS status, and error logs. Don’t diff entire logs first; you’ll drown.
6) What about containers and Kubernetes nodes?
Run the script on the node, not in a container, if you care about hardware/driver errors. Containers don’t see the full kernel view.
For Kubernetes, add node identity (kubelet version, runtime info) as an optional extension—but keep the base script generic.
7) What if tools like nvme-cli or smartmontools aren’t installed?
The script checks for tool presence and skips gracefully. In production, standardize a minimal “diagnostics toolset” image per distro.
Don’t make the on-call install packages during an outage unless you enjoy creating new outages.
8) How do I keep secrets out of the report?
First: don’t collect sensitive paths. Second: time-bound logs. Third: redact selectively (as the script does for a few cases).
Finally: treat the bundle as sensitive anyway—store it in secure systems and review before sharing externally.
9) Can I add application logs to this?
You can, but be disciplined. Add a separate optional flag (for example, --app) and keep it off by default.
App logs are frequently huge and more likely to contain customer data. The base system report should stay focused.
10) What’s the single most valuable file in the bundle?
Usually errors/dmesg-warn-err.txt. It’s where hardware and drivers admit they’re failing. The second-most valuable is
hardware/lspci-nnk.txt, because it turns vague errors into a specific device/driver you can act on.
Conclusion: next steps you’ll actually do
A full system report is not a magical artifact. It’s a disciplined snapshot: identity, topology, health, and errors—captured safely,
in a form you can diff and share. The script above is the backbone. The practice is what makes it pay off.
- Drop the script into your ops repo and treat it like production code: reviewed, versioned, tested.
- Standardize a diagnostics toolset on your images so the script can collect NVMe/SMART and storage details consistently.
- Make “collect bundle” step zero in your incident runbooks, before tuning, restarts, or blame.
- Start keeping weekly snapshots for critical systems. You’ll hate it until the day it saves you.
- Use the bundle to enforce baselines: kernel, drivers, firmware, storage configuration. Drift is an outage in installments.