At 02:13, your pager (or your customer) tells you something is broken. You SSH in, run journalctl -xe, and are greeted by a waterfall of messages that all look equally guilty. Half of them are “normal.” The other half are “not normal, but not the cause.” Somewhere in there is the one line that matters: the first error that made everything else start screaming.
This is about finding that line fast, proving it’s causal (not just noisy), and wiring the result into a boring, reliable email to yourself. Not a “log analytics platform” pitch. A production habit.
A practical mental model: symptoms, triggers, and the first bad event
Event logs are not a narrative. They are a crowd. Your job is to find the person who threw the first punch.
Three categories you must separate
- Symptoms: retries, timeouts, “connection reset,” “failed to send,” “context deadline exceeded,” “upstream unavailable.” These are downstream services expressing pain.
- Triggers: a process crash, a cert expiry, a filesystem remounting read-only, a network flap, a kernel OOM kill. These are typically close to the root cause.
- Noise: periodic health checks, cron jobs, liveness probes, expected restarts, debug logs left on in production, deprecation warnings. Noise is not “unimportant”; it’s just not helpful right now.
When people say “logs are too noisy,” what they usually mean is: they never built a method for turning a log stream into a causal chain. Noise is what you call information you can’t index.
The “first bad event” rule
Most incidents have one of these shapes:
- One hard failure: e.g., disk I/O error → filesystem remount read-only → database panics → API errors.
- One slow failure: e.g., latency spike → request queue grows → thread pool saturates → timeouts everywhere.
- One misconfiguration: e.g., cert rotation missed → TLS handshakes fail → health checks fail → autoscaler thrashes.
- One “nothing changed” lie: a dependency changed outside your control (time, DNS, upstream deployment, cloud control plane).
Your highest-leverage move is to identify the earliest timestamp where reality diverged from normal, then walk forward and watch the blast radius expand. Walk backward, too, but backward is mostly about eliminating red herrings.
Logs only make sense when you align clocks
If your fleet’s clocks are drifting, correlation is fiction. Fix NTP/chrony before you get clever with parsing. Time is the primary key of operations.
One paraphrased idea that’s worth stapling to your monitor (and you’ll see why in the playbook): paraphrased idea
— John Allspaw, on the value of evidence-based incident response and learning from real system behavior.
Interesting facts and history that make modern logs weird
- Syslog predates most of your stack. It was designed in the 1980s for simple text messages, not JSON payloads and trace IDs.
- RFC 3164 syslog timestamps lack the year. That’s why some parsers guess and occasionally guess wrong around New Year’s.
- Systemd’s journal stores structured fields. You can filter by
_SYSTEMD_UNIT,PID,_COMM, and more without regexing free text. - The kernel ring buffer is not a reliable archive.
dmesgis a circular buffer; under stress, the most important messages can be overwritten first. - Log levels are social constructs. Many applications treat “ERROR” as “I want attention” and “WARN” as “I also want attention but with less shame.”
- rsyslog and syslog-ng were built for throughput. They can drop messages under load if misconfigured or if disk I/O stalls; “missing logs” can be a symptom.
- Email alerts predate webhooks by decades. SMTP is boring, ubiquitous, and still one of the most reliable last-mile notification mechanisms when APIs are on fire.
- journald rate limiting is a double-edged sword. It can save your disk during a spammy failure, but it can also hide the frequency of an error pattern.
- Storage failures often whisper first. UDMA CRC errors, link resets, or ZFS checksum errors can appear long before an outright “I/O error.”
And yes, your logs will contain at least one line that’s technically correct but emotionally useless.
Joke 1: If your incident response plan is “tail the logs and feel feelings,” congratulations—you’ve invented artisanal debugging.
Fast diagnosis playbook: first / second / third checks
This is the sequence I use when I don’t know what’s wrong yet. It’s optimized for “find the bottleneck quickly,” not for building a perfect narrative. You can do the perfect narrative later, during the postmortem, when the system is stable and your coffee is legal.
First: establish the incident window and the blast radius
- Pick a time window. Start with “when users noticed” and expand by 10–30 minutes earlier.
- Identify the affected host(s) and unit(s). Is it one node, one AZ, one service, or everything?
- Look for the first hard error. Kernel, storage, OOM, service crash, TLS failure.
Second: check the “big three” resource constraints
- CPU saturation or throttling? Look for load spikes, scheduler latency, cgroup throttling.
- Memory pressure? OOM kills, swapping, reclaim storms.
- I/O and storage? Disk latency, filesystem errors, controller resets, ZFS events, queue depth.
Third: confirm causality with correlation
- Correlate timestamps. Does the app error follow the kernel/storage event?
- Find the “first occurrence.” The earliest message of the pattern is often the trigger.
- Validate with a second signal. Metrics if you have them; if not, use other logs (auth, kernel, storage).
When you’re stuck
- Switch from grepping to filtering by origin: unit, PID, executable, container ID.
- Reduce scope: one host, one minute, one unit.
- Look at the kernel and storage layers even if you “know” it’s an app bug. You don’t know.
Practical tasks (commands, outputs, decisions)
These are real tasks you can run on a Linux box with systemd. Each includes: command, what the output means, and what decision you make next. The goal is to turn “log soup” into a sequence of decisions.
Task 1: Confirm clock sanity (because correlation depends on it)
cr0x@server:~$ timedatectl
Local time: Tue 2026-02-05 02:21:18 UTC
Universal time: Tue 2026-02-05 02:21:18 UTC
RTC time: Tue 2026-02-05 02:21:18
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
Meaning: You want “System clock synchronized: yes” and NTP active.
Decision: If clocks aren’t synced, fix time first (chrony/systemd-timesyncd) or your log correlation will lie to you.
Task 2: Bound the incident window quickly
cr0x@server:~$ journalctl --since "2026-02-05 01:45:00" --until "2026-02-05 02:15:00" -p err..alert --no-pager
Feb 05 01:52:11 server kernel: blk_update_request: I/O error, dev sdb, sector 1946152832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 05 01:52:11 server kernel: Buffer I/O error on dev sdb1, logical block 243269104, async page read
Feb 05 01:52:12 server systemd[1]: postgresql.service: Main process exited, code=killed, status=9/KILL
Feb 05 01:52:12 server systemd[1]: postgresql.service: Failed with result 'signal'.
Meaning: This shows high-severity events only. The kernel I/O error appears before PostgreSQL dies. That ordering matters.
Decision: Treat storage as suspect immediately; don’t waste 45 minutes “tuning Postgres” while your disk is returning I/O errors.
Task 3: Identify the first occurrence of a pattern
cr0x@server:~$ journalctl --since "2026-02-05 01:00:00" | grep -m1 -E "I/O error, dev sdb|Buffer I/O error"
Feb 05 01:52:11 server kernel: blk_update_request: I/O error, dev sdb, sector 1946152832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Meaning: -m1 stops at the first match, giving you the earliest known appearance in that stream.
Decision: Use that timestamp as your pivot point. Everything after is potentially consequence.
Task 4: Filter logs by systemd unit instead of grepping everything
cr0x@server:~$ journalctl -u postgresql.service --since "2026-02-05 01:45:00" --no-pager | tail -n 12
Feb 05 01:52:10 server postgres[21455]: LOG: could not read block 243269104 in file "base/16384/2619": read only 0 of 8192 bytes
Feb 05 01:52:10 server postgres[21455]: LOG: unexpected pageaddr 0/0 in log segment 0000000100000000000000A3, offset 0
Feb 05 01:52:11 server postgres[21455]: PANIC: could not read from log segment 0000000100000000000000A3 at offset 0: read only 0 of 8192 bytes
Feb 05 01:52:12 server systemd[1]: postgresql.service: Main process exited, code=killed, status=9/KILL
Feb 05 01:52:12 server systemd[1]: postgresql.service: Failed with result 'signal'.
Meaning: The database is reporting short reads. That’s not a config issue; it’s either storage or kernel.
Decision: Stop assuming “database bug.” Escalate to storage investigation; consider failing over.
Task 5: Check for OOM kills (the silent assassin)
cr0x@server:~$ journalctl -k --since "2026-02-05 01:45:00" | grep -E "Out of memory|oom-kill|Killed process" | tail -n 5
Feb 05 01:47:03 server kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/api.service,task=java,pid=18802,uid=1001
Feb 05 01:47:03 server kernel: Killed process 18802 (java) total-vm:7421860kB, anon-rss:4823112kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:12544kB oom_score_adj:0
Meaning: The kernel killed a process due to memory pressure, and it names the cgroup (/system.slice/api.service).
Decision: If the OOM aligns with the incident start, treat memory pressure as the trigger. If it occurs later, it may be a consequence of retries and queue growth.
Task 6: Check journald rate limiting (because missing logs are a problem)
cr0x@server:~$ journalctl -u systemd-journald --since "2026-02-05 01:45:00" --no-pager | tail -n 8
Feb 05 01:51:58 server systemd-journald[412]: Suppressed 1529 messages from postgresql.service
Feb 05 01:52:01 server systemd-journald[412]: Suppressed 941 messages from kernel
Meaning: You’re losing detail during the hottest part of the incident. The pattern still matters, but counts and sequences may be incomplete.
Decision: If suppression is significant, collect other evidence (kernel counters, storage tools). Consider adjusting rate limits after the incident, not during.
Task 7: Spot kernel storage/link issues in dmesg/journal
cr0x@server:~$ dmesg -T | egrep -i "ata[0-9]|nvme|reset|link is down|I/O error|blk_update_request" | tail -n 12
[Tue Feb 5 01:52:10 2026] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Tue Feb 5 01:52:10 2026] ata2.00: failed command: READ FPDMA QUEUED
[Tue Feb 5 01:52:11 2026] blk_update_request: I/O error, dev sdb, sector 1946152832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Feb 5 01:52:11 2026] ata2: hard resetting link
[Tue Feb 5 01:52:12 2026] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Meaning: Link resets + failed reads is classic “drive, cable, controller, or backplane” territory.
Decision: Plan hardware replacement or migration. If it’s a VM with virtual disks, treat it as underlying storage instability and involve your provider/platform team.
Task 8: Confirm filesystem health signals
cr0x@server:~$ journalctl -k --since "2026-02-05 01:45:00" | egrep -i "EXT4-fs error|XFS.*corruption|Remounting filesystem read-only|I/O error" | tail -n 10
Feb 05 01:52:11 server kernel: EXT4-fs error (device sdb1): ext4_find_entry:1456: inode #131104: comm postgres: reading directory lblock 0
Feb 05 01:52:11 server kernel: Aborting journal on device sdb1-8.
Feb 05 01:52:11 server kernel: EXT4-fs (sdb1): Remounting filesystem read-only
Meaning: The filesystem remounted read-only. Apps will fail in weird ways from here on out.
Decision: Stop “restarting services.” You need to fail over, remount, or repair. Restarts just add noise and can worsen corruption.
Task 9: If you run ZFS, check pool events and errors
cr0x@server:~$ sudo zpool status -x
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.
action: Replace the device using 'zpool replace'.
scan: scrub repaired 0B in 0 days 00:12:44 with 3 errors on Tue Feb 5 01:55:21 2026
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sdb FAULTED 5 0 3 too many errors
sdc ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/tank/pg/wal/0000000100000000000000A3
Meaning: This is not “maybe.” ZFS is telling you a device is faulted and it can name affected files.
Decision: Replace the failed disk, restore impacted data from replica/backup, and treat application errors as downstream symptoms until proven otherwise.
Task 10: See which services flapped (restart storms create fake root causes)
cr0x@server:~$ systemctl list-units --type=service --state=failed
UNIT LOAD ACTIVE SUB DESCRIPTION
postgresql.service loaded failed failed PostgreSQL RDBMS
api.service loaded failed failed Example API
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state.
SUB = The low-level unit activation state.
Meaning: This is a quick inventory of “what is visibly broken.” It doesn’t tell you why.
Decision: Use it to scope the blast radius, then inspect each unit’s logs around the earliest failure timestamp.
Task 11: Pull structured fields from journald (stop regexing when you can filter)
cr0x@server:~$ journalctl -u api.service --since "2026-02-05 01:45:00" -o json | head -n 3
{"_SYSTEMD_UNIT":"api.service","PRIORITY":"3","MESSAGE":"DB connection failed: timeout","_PID":"18890","_COMM":"java","__REALTIME_TIMESTAMP":"1738720565000000"}
{"_SYSTEMD_UNIT":"api.service","PRIORITY":"3","MESSAGE":"DB connection failed: timeout","_PID":"18890","_COMM":"java","__REALTIME_TIMESTAMP":"1738720566000000"}
{"_SYSTEMD_UNIT":"api.service","PRIORITY":"4","MESSAGE":"Retrying in 500ms","_PID":"18890","_COMM":"java","__REALTIME_TIMESTAMP":"1738720566000000"}
Meaning: You can programmatically parse fields like unit, PID, priority, timestamp. This is how you build reliable alerts.
Decision: If you’re writing automation, prefer JSON output. Text is for humans; structured fields are for machines.
Task 12: Count error bursts to see if you’re dealing with a storm
cr0x@server:~$ journalctl -u api.service --since "2026-02-05 01:45:00" --until "2026-02-05 02:00:00" -p err --no-pager | wc -l
842
Meaning: 842 error-level messages in 15 minutes is not “a few failures.” It’s a systemic issue or an aggressive retry loop.
Decision: If the count is huge, focus on identifying the upstream dependency that’s failing and consider dampening retries or enabling circuit breakers.
Task 13: Cross-host correlation with a quick-and-dirty approach (when you lack centralized logging)
cr0x@server:~$ for h in app01 app02 db01; do echo "== $h =="; ssh $h "journalctl --since '2026-02-05 01:50:00' --until '2026-02-05 01:55:00' -p err --no-pager | head -n 3"; done
== app01 ==
Feb 05 01:52:13 app01 api[18890]: DB connection failed: timeout
Feb 05 01:52:14 app01 api[18890]: DB connection failed: timeout
Feb 05 01:52:15 app01 api[18890]: DB connection failed: timeout
== app02 ==
Feb 05 01:52:13 app02 api[19011]: DB connection failed: timeout
Feb 05 01:52:14 app02 api[19011]: DB connection failed: timeout
Feb 05 01:52:15 app02 api[19011]: DB connection failed: timeout
== db01 ==
Feb 05 01:52:11 db01 kernel: blk_update_request: I/O error, dev sdb, sector 1946152832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 05 01:52:11 db01 kernel: EXT4-fs (sdb1): Remounting filesystem read-only
Feb 05 01:52:12 db01 systemd[1]: postgresql.service: Main process exited, code=killed, status=9/KILL
Meaning: App nodes show timeouts; DB node shows storage errors. That’s your chain, with timestamps.
Decision: Stop troubleshooting apps. Initiate DB failover/recovery, then fix underlying storage.
Task 14: Extract the “real error” line with context (the 10 lines around it)
cr0x@server:~$ journalctl --since "2026-02-05 01:45:00" --no-pager | grep -n -E "Remounting filesystem read-only|blk_update_request: I/O error" | head -n 1
1823:Feb 05 01:52:11 server kernel: EXT4-fs (sdb1): Remounting filesystem read-only
cr0x@server:~$ journalctl --since "2026-02-05 01:45:00" --no-pager | sed -n '1813,1833p'
Feb 05 01:52:10 server kernel: ata2.00: failed command: READ FPDMA QUEUED
Feb 05 01:52:11 server kernel: blk_update_request: I/O error, dev sdb, sector 1946152832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 05 01:52:11 server kernel: Buffer I/O error on dev sdb1, logical block 243269104, async page read
Feb 05 01:52:11 server kernel: EXT4-fs error (device sdb1): ext4_find_entry:1456: inode #131104: comm postgres: reading directory lblock 0
Feb 05 01:52:11 server kernel: Aborting journal on device sdb1-8.
Feb 05 01:52:11 server kernel: EXT4-fs (sdb1): Remounting filesystem read-only
Feb 05 01:52:12 server systemd[1]: postgresql.service: Main process exited, code=killed, status=9/KILL
Feb 05 01:52:13 server api[18890]: DB connection failed: timeout
Meaning: The context turns a single scary line into a causal sequence: ATA read failure → block I/O error → filesystem abort → remount read-only → database dies → app timeouts.
Decision: That’s your incident timeline anchor. Use it to drive action (failover) and later write a clean postmortem.
Task 15: When the “real error” is TLS/certs, find it once, not 5,000 times
cr0x@server:~$ journalctl -u nginx.service --since "2026-02-05 00:00:00" --no-pager | grep -m1 -E "certificate|SSL_do_handshake|PEM_read_bio"
Feb 05 01:03:44 server nginx[901]: [emerg] SSL_CTX_use_PrivateKey_file("/etc/nginx/tls/site.key") failed (SSL: error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch)
Meaning: One line explains the outage: key/cert mismatch. Everything else is clients failing to connect.
Decision: Fix the cert/key pair, reload, and then add a pre-deploy validation step so this doesn’t recur.
Task 16: Detect “optimization” failures like aggressive log rotation or tmpfs filling
cr0x@server:~$ journalctl --since "2026-02-05 01:00:00" -p warning --no-pager | grep -E "No space left on device|failed to write|ENOSPC" | head -n 5
Feb 05 01:11:02 server systemd-journald[412]: Failed to write entry (24 items, 882 bytes), ignoring: No space left on device
Feb 05 01:11:03 server rsyslogd[777]: action 'action-0-builtin:omfile' suspended (module 'builtin:omfile'), retry 0. There is no space left on device.
Meaning: Your logging system can’t write. That means you’re losing evidence during an incident.
Decision: Free space immediately; then re-evaluate where logs are stored, rotation policy, and whether “save disk” changes are sabotaging observability.
Three corporate mini-stories from the log trenches
Mini-story 1: The incident caused by a wrong assumption
The company was mid-migration: new Kubernetes cluster, new service mesh, new everything. A customer-facing API started returning 502s intermittently. The on-call engineer did what most of us do under pressure: they searched logs for “error” and saw a storm of “upstream reset” messages in the proxy.
The assumption formed instantly: “service mesh bug.” A war room spun up. People tweaked proxy timeouts, rolled sidecars, flipped feature flags. The error rate moved around, which felt like progress, but was actually random variance being mistaken for control.
Eventually, someone checked the kernel logs on one node. Not the app logs, not the proxy logs. The node was reporting NIC link flaps and TCP retransmits. The proxy was innocent; it was just the first component honest enough to complain. When the team correlated timestamps, every upstream reset aligned with a link down/up cycle.
The root cause was mundane: a top-of-rack switch port configured with a wrong speed/duplex setting after routine maintenance. The “mesh incident” was a networking incident wearing a proxy costume.
The lesson wasn’t “always blame the network.” It was: don’t let the first plausible error message become your root cause. Logs tell you what a component observed, not what caused reality to break.
Mini-story 2: The optimization that backfired
A platform team was trying to cut disk usage on a fleet of database nodes. Someone noticed journald consuming a surprising amount of space. The fix looked clean: cap the journal size aggressively and enable tighter rate limiting. Less disk, fewer writes, happier SSDs. Everyone loves a win.
Two weeks later, a node experienced intermittent storage latency spikes. PostgreSQL started logging slow fsyncs and occasional WAL write stalls. The application layer saw timeouts. The on-call tried to piece together the timeline, but the kernel messages around the spike were missing. Journald had suppressed a lot, and the capped journal had rotated away the earliest warnings.
Without the early evidence, the team focused on the database configuration. They tuned checkpoints, moved WAL settings, adjusted kernel dirty ratios. Some changes improved steady-state performance and gave everyone the comforting feeling of “doing engineering.” But the spikes persisted.
Later, during a maintenance window, someone ran a scrub on the storage pool and found checksum errors creeping up. A drive was degrading slowly, and the system had been warning about it. Those warnings were the first things to be rotated out because they happened “before the incident.”
The optimization didn’t cause the hardware failure. It caused the team to lose the only cheap, early signal that could have shortened the incident. Saving disk was fine; saving disk by sacrificing forensic retention was not.
Mini-story 3: The boring but correct practice that saved the day
A payments service had a habit that nobody celebrated: every deploy required a small “event log sanity checklist.” Not a big security audit. Just boring checks—time sync, disk space, journald health, and a simple end-to-end transaction in a canary environment.
One morning, the canary failed. The logs showed TLS handshake errors. The engineer on duty didn’t grep blindly; they filtered by unit and searched for the first certificate-related error. It was immediate: the service had loaded a new certificate chain, but an intermediate certificate file was missing. The service still started. It just refused clients that required the full chain.
The fix was one line: install the missing intermediate and reload. No incident, no customer impact. The checklist caught it before it escaped. The team moved on, slightly annoyed that “nothing happened,” which is the correct emotional response to a well-designed process.
That practice didn’t feel innovative. It was. Most outages are prevented by things that look like paperwork until you need them.
Email yourself: a production-grade alert pipeline from logs
Centralized logging is great. Sometimes you don’t have it. Sometimes it’s the thing that’s down. Email, for all its flaws, is resilient, low-dependency, and available from almost anywhere. “Email yourself” is not primitive; it’s pragmatic.
The trick is to avoid two failure modes:
- Alert spam: you’ll ignore it, then miss the one message that mattered.
- Alert silence: the script fails, SMTP fails, DNS fails, or journald suppresses events and you never know.
Design principles for log-to-email alerts
- Alert on triggers, not symptoms. “EXT4 remounted read-only” beats “API timeout” by a mile.
- Deduplicate aggressively. Email is not a metrics time series.
- Include context. Send the first error line plus a small window of surrounding lines.
- Make it testable. You should be able to run the script and see a single, deterministic output.
- Fail loudly. If the alert pipeline can’t send, log that failure somewhere you’ll see.
A simple, robust implementation: systemd timer + journalctl cursor
We’ll use journald’s cursor to remember where we left off. This avoids “re-sending the world” and is more reliable than guessing time windows.
1) Create an alert script
This script looks for high-severity events and a handful of specific trigger patterns (storage, OOM, read-only remount, certificate mismatch). It sends one email per run, deduplicated, with the “top” finding.
cr0x@server:~$ sudo tee /usr/local/sbin/journal-alert-email.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
STATE_DIR="/var/lib/journal-alert"
CURSOR_FILE="$STATE_DIR/cursor"
HOST="$(hostname -f 2>/dev/null || hostname)"
TO_ADDR="oncall@example.com"
FROM_ADDR="journal-alert@$HOST"
SUBJECT_PREFIX="[journal-alert]"
TMP="$(mktemp)"
trap 'rm -f "$TMP"' EXIT
mkdir -p "$STATE_DIR"
chmod 700 "$STATE_DIR"
# Pull new high-priority logs since last cursor.
# We intentionally include kernel + system units; customize for your environment.
if [[ -f "$CURSOR_FILE" ]]; then
journalctl --after-cursor "$(cat "$CURSOR_FILE")" -p err..alert -o short-iso --no-pager > "$TMP" || true
else
journalctl --since "10 minutes ago" -p err..alert -o short-iso --no-pager > "$TMP" || true
fi
# Update cursor regardless; prevents loops if one bad message repeats.
journalctl -n 1 -o json --no-pager | sed -n 's/.*"__CURSOR":"\([^"]*\)".*/\1/p' > "$CURSOR_FILE" || true
chmod 600 "$CURSOR_FILE"
# If no new events, exit quietly.
if [[ ! -s "$TMP" ]]; then
exit 0
fi
# Prefer trigger patterns; fall back to first error line.
TRIGGER_LINE="$(grep -m1 -E \
'Remounting filesystem read-only|blk_update_request: I/O error|Buffer I/O error|oom-kill|Killed process|X509_check_private_key|SSL_CTX_use_PrivateKey_file|No space left on device|Permanent errors have been detected' \
"$TMP" || true)"
if [[ -z "$TRIGGER_LINE" ]]; then
TRIGGER_LINE="$(head -n 1 "$TMP")"
fi
# Add some context around the trigger if present in TMP.
CONTEXT="$(awk -v needle="$TRIGGER_LINE" '
BEGIN{found=0}
$0==needle{found=1; start=NR-5; end=NR+10}
{lines[NR]=$0}
END{
if(found){
for(i=(start<1?1:start); i<=end; i++) if(i in lines) print lines[i]
} else {
for(i=1;i<=20;i++) if(i in lines) print lines[i]
}
}' "$TMP")"
SUBJECT="$SUBJECT_PREFIX $HOST $(echo "$TRIGGER_LINE" | cut -c1-120)"
BODY=$(cat <<EOT
Host: $HOST
Time: $(date -Is)
Trigger:
$TRIGGER_LINE
Context:
$CONTEXT
EOT
)
# Send email. Requires a local MTA or msmtp configured.
printf "%s\n" "$BODY" | /usr/bin/mail -s "$SUBJECT" -r "$FROM_ADDR" "$TO_ADDR"
EOF
sudo chmod 750 /usr/local/sbin/journal-alert-email.sh
What it means: This uses journald as the source of truth and a cursor for state. It sends one message per run, which makes it naturally rate-limited.
Decision: If you can’t guarantee a working local MTA, use msmtp or a relay. Don’t rely on “someone’s laptop” for alert delivery.
2) Configure a minimal mail sender (example: msmtp)
On many systems, mail can work with a local MTA. If you don’t have one, msmtp is a common choice. This is an example config; adapt to your environment.
cr0x@server:~$ sudo tee /etc/msmtprc >/dev/null <<'EOF'
defaults
auth on
tls on
tls_trust_file /etc/ssl/certs/ca-certificates.crt
logfile /var/log/msmtp.log
account relay
host smtp.relay.local
port 587
from journal-alert@server.example
user journal-alert@server.example
passwordeval "cat /etc/msmtp-password"
account default : relay
EOF
sudo chmod 600 /etc/msmtprc
What it means: Credentials are kept out of the script. The mailer logs to a file you can audit.
Decision: If you can’t secure credentials appropriately, don’t ship them. Use a local relay with IP-based auth, or use an internal SMTP service.
3) Create a systemd service + timer
cr0x@server:~$ sudo tee /etc/systemd/system/journal-alert-email.service >/dev/null <<'EOF'
[Unit]
Description=Send email alerts for high-severity journal events
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/journal-alert-email.sh
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=7
EOF
cr0x@server:~$ sudo tee /etc/systemd/system/journal-alert-email.timer >/dev/null <<'EOF'
[Unit]
Description=Run journal email alert script every minute
[Timer]
OnBootSec=2min
OnUnitActiveSec=1min
AccuracySec=10s
Persistent=true
[Install]
WantedBy=timers.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now journal-alert-email.timer
What it means: Every minute, the timer runs the script once. If the host was down, Persistent=true runs missed intervals on boot (useful for catching early boot failures).
Decision: If you’re worried about email volume, increase interval to 2–5 minutes and dedupe within the script. Don’t run it every 5 seconds. You’ll only learn to hate your own inbox.
4) Test the pipeline end-to-end
cr0x@server:~$ sudo systemctl start journal-alert-email.service
cr0x@server:~$ sudo systemctl status journal-alert-email.service --no-pager
● journal-alert-email.service - Send email alerts for high-severity journal events
Loaded: loaded (/etc/systemd/system/journal-alert-email.service; static)
Active: inactive (dead) since Tue 2026-02-05 02:23:20 UTC; 2s ago
Process: 22901 ExecStart=/usr/local/sbin/journal-alert-email.sh (code=exited, status=0/SUCCESS)
Meaning: status=0/SUCCESS means the script ran. It may have sent nothing if there were no new errors.
Decision: If you need a test email, temporarily lower the threshold or generate a controlled error log line in a safe unit/environment.
Joke 2: Email alerts are like smoke detectors: you only notice them when they’re annoying, and that’s exactly when they’re doing their job.
Checklists / step-by-step plan
Checklist: finding the real error in 10 minutes
- Set the window: define
sinceanduntil; expand earlier if needed. - Pull high severity first:
journalctl -p err..alertfor the window. - Scan for triggers: kernel I/O, filesystem read-only, OOM, service crash, TLS/cert errors, ENOSPC.
- Find the earliest trigger line: use
grep -m1or cursor-based scans. - Add context: grab 10 lines before/after or filter by unit.
- Correlate across layers: app unit logs + systemd + kernel.
- Decide action: failover, rollback, restart (rarely), or escalate to hardware/network/platform.
- Preserve evidence: copy relevant journal slices and kernel logs before the system reboots or rotates.
Checklist: production-safe log alerting by email
- Pick trigger patterns you actually want to wake up for.
- Implement cursor state to avoid duplicates and time drift issues.
- Deduplicate within a run (send one email with top trigger + context).
- Use a reliable mail path (local relay or authenticated SMTP).
- Run from systemd timer and monitor failures of the alert service itself.
- Test quarterly (yes, actually do it): alert pipelines rot quietly.
Checklist: storage-focused log triage (because storage lies by being slow)
- Search kernel logs for link resets, I/O errors, filesystem remounts.
- Check RAID/ZFS/MD events if applicable.
- Confirm the “app error” started after storage warnings.
- Fail over or reduce load before attempting repairs.
- After stability, run scrubs/checks and replace marginal components.
Common mistakes: symptom → root cause → fix
1) “Everything is timing out”
Symptom: app logs full of timeouts; reverse proxy shows upstream errors.
Root cause: database node filesystem remounted read-only after I/O errors; app is innocent.
Fix: check kernel + filesystem logs, fail over database, replace failing disk/cable/controller; stop restarting app services.
2) “Service keeps restarting, must be a crash loop bug”
Symptom: systemd shows repeated restarts; app logs end abruptly.
Root cause: kernel OOM killer terminating the process (often due to runaway memory, traffic spikes, or cgroup limits).
Fix: confirm OOM in kernel logs, increase memory or fix leak, adjust cgroup limits, and add backpressure/circuit breakers.
3) “No errors in logs, so it’s not the host”
Symptom: you don’t see the expected kernel/storage messages during the incident.
Root cause: journald suppression, aggressive rotation, or disk-full prevented logs from being written.
Fix: check journald logs for suppression and ENOSPC, increase retention, ensure log partitions have headroom, and alert on “logging pipeline unhealthy.”
4) “We fixed it by increasing timeouts”
Symptom: raising timeouts makes the error rate drop temporarily.
Root cause: slow dependency (storage latency, DNS issues, overloaded DB) was the trigger; timeouts only moved the pain.
Fix: identify the earliest latency/IO trigger, reduce load or fail over; tune timeouts only after stability to avoid retry storms.
5) “Cert rotation is automated, so TLS can’t be the issue”
Symptom: sudden handshake failures after deploy; clients report cert chain errors.
Root cause: mismatched key/cert, missing intermediate, wrong file permissions, or a process that didn’t reload.
Fix: search for the first TLS/key error line, validate cert/key pair in CI, enforce reload checks, and alert on impending expiry.
6) “It’s a network problem” (every time)
Symptom: intermittent resets, retransmits, random failures.
Root cause: sometimes network, yes. Other times: CPU starvation causing delayed ACKs, disk stalls freezing processes, or conntrack exhaustion.
Fix: corroborate with kernel logs and host resource pressure signals. Don’t scapegoat. Prove.
7) “We’ll just grep for ERROR”
Symptom: endless matches, no clarity, wrong focus.
Root cause: log severity misuse and lack of structured filtering (unit/PID/fields).
Fix: filter by unit and priority, then search for trigger patterns; use JSON fields for automation.
FAQ
1) What is the “real error” in an event log?
It’s the earliest high-signal event that explains downstream failures: the first I/O error, the first OOM kill, the first TLS key mismatch, the first “read-only remount.” Not the 50th timeout.
2) Why not just ship everything to a log platform and search there?
You should, when you can. But incident response can’t depend on a system that might be degraded by the same incident. Local-first tooling is your lifeboat.
3) Why does journalctl -xe feel useless during incidents?
Because it mixes severity, scope, and time in a way that’s great for “what just happened” but bad for “what caused the incident.” Use bounded windows, priorities, and unit filters.
4) How do I avoid alert fatigue when emailing myself?
Alert on triggers, dedupe, send one email per run, and keep the subject short but specific. If you wouldn’t take action from the email, don’t send it.
5) Should I parse logs with regex or structured fields?
For journald sources, prefer structured fields when possible. Use regex for message content matching, but don’t build a fragile pipeline that depends on exact wording alone.
6) What if my system uses classic syslog files instead of journald?
The method stays the same: bound the window, prioritize high severity, find the first trigger, add context, and correlate across layers. Your commands change (e.g., grep, awk, zgrep on rotated files), but the decisions don’t.
7) How do I know whether an app error is cause or effect?
Check ordering and independence. If kernel/storage/oom events precede app errors, the app is likely downstream. If app errors appear alone with stable kernel logs, the app may be the trigger.
8) What’s the minimum set of trigger patterns worth alerting on?
Start with: filesystem remount read-only, kernel I/O errors, OOM kills, disk full/ENOSPC, repeated service crashes, and TLS/cert load failures. Expand only when you’ve proven an alert is actionable.
9) What about containers—does journald still help?
Yes, if container logs are routed to journald or if systemd units manage the container runtime. If you’re using a different logging driver, collect equivalent signals from the runtime and still correlate with kernel logs.
10) How long should I retain logs on a server?
Long enough to include “early warnings” that precede incidents: days to weeks, depending on disk budget. If you can’t afford retention, you can’t afford to debug the next slow-burn failure either.
Conclusion: next steps you can do before lunch
If you do nothing else, do these three things:
- Adopt the “first bad event” rule. Find the earliest trigger line and build the causal chain forward.
- Use journald filtering like a grown-up. Priorities, unit filters, structured fields. Regex is a tool, not a lifestyle.
- Wire a small email alert for the triggers that matter. Cursor-based, deduped, with context. You’ll thank yourself at 02:13.
Then, when the next incident hits, you won’t be reading logs like tea leaves. You’ll be extracting evidence, making a decision, and moving on. That’s the job.