Some failures don’t crash; they just rot. Latency creeps up. A daemon stops answering but never “dies.” A storage path flaps once an hour like it’s clearing its throat. Your monitoring shows a thousand tiny papercuts, and the business wants one big bandage.
Then someone says it, quietly, like a confession: “Should we just reboot it?” The reset button era is when that question stops being shameful and starts being operationally mature—provided you know what a reset actually does, what it hides, and what you must collect before you press it.
Why resets feel honest (and why they’re also a lie)
A reset is a clean break in causality. You stop trying to negotiate with a corrupted runtime and you start fresh with known-good initialization paths: kernel boot, driver init, service startup, config load, cache warm-up. For a huge category of failures—leaks, deadlocks, stuck I/O, wedged drivers—initialization is the only code path that’s been tested 10,000 times.
That’s the honest part. The dishonest part is that a reset often destroys evidence. It clears the crime scene, throws away the stack traces, and resets counters that were telling you exactly how bad things were getting.
So treat reset like you’d treat painkillers after an injury. Use them to function, sure. But if you keep taking them without a diagnosis, you’re not “tough,” you’re just flying blind.
Here’s the operational truth: reset is a valid mitigation. It buys time. It lowers blast radius. It gets your customers back. But if it’s your only move, you will eventually end up with a system that “needs” resets like a Victorian factory needs child labor: it works, but it’s not a plan.
Joke #1: A reboot is the most sincere apology a computer can offer: it can’t explain what happened, but it promises to be better.
When a reset is the right call
- SLO breach is active and you have a known-good rollback or restart procedure.
- State is already suspect (filesystem errors, kernel warnings, NIC resets, allocator corruption signals).
- You can’t reproduce under observation and need a stable baseline to collect better data next time.
- Blast radius containment: cordon a node, drain workload, reboot, rejoin.
When a reset is malpractice
- You haven’t captured basic forensics (logs, dmesg, top offenders) and you’re about to erase them.
- Rebooting will trigger a rebuild (RAID, ZFS resilver, database recovery) that increases risk.
- The issue is clearly external (upstream dependency down, network partition, expired cert). Rebooting changes nothing.
- You’re using reboot to avoid admitting you don’t have observability.
There’s a paraphrased idea from Werner Vogels (Amazon CTO) that belongs on every ops team wall: you build for failure, because everything fails eventually. The reset button era is what happens when you accept that statement literally and operationalize it: failures will happen; the question is whether your resets are controlled and informative—or chaotic and forgetful.
Facts and history: how we got here
Resets aren’t a moral failing; they’re a consequence of complexity. A few concrete points that explain why rebooting has stayed relevant across decades:
- Early microcomputers normalized hard resets. Many home systems encouraged power-cycling as routine recovery because persistent state and journaling weren’t the baseline.
- The “three-finger salute” became cultural muscle memory. Keyboard resets on early PCs made “reset” a user-level fix, not an engineering one.
- Watchdog timers existed long before cloud. Embedded systems used hardware watchdogs to reboot automatically when software stopped petting the dog.
- Journaling filesystems reduced—but didn’t remove—reset pain. They made crash recovery faster and safer, which paradoxically made resets more acceptable in production.
- Virtualization made reboot cheaper. When a “server” is a VM, rebooting it feels less dramatic than rolling a physical cart into a datacenter aisle.
- Containers made restart normal again. A pod restart is a planned event; orchestration frameworks treat processes as disposable.
- Modern systems depend on caches everywhere. DNS caches, page cache, connection pools, JIT caches, routing tables. Resets flush them—sometimes improving things, sometimes causing a thundering herd.
- Firmware and microcode updates blurred the line between “software fix” and “reboot required.” A security patch can require a reboot because the kernel and hardware handshake changed.
- Storage stacks became multi-layered. RAID/HBA firmware, multipath, filesystem, volume manager, encryption, application. A reset may fix a stuck layer, but you must identify which one.
Notice the pattern: resets got easier, and “easier” changed the culture. The danger is that ease becomes default, and default becomes doctrine.
What a reset really changes: layers, caches, state, and time
1) Process memory and allocator state
Most “reboot fixed it” stories are really “fresh address space fixed it.” Memory leaks, fragmentation, allocator contention, runaway threads, file descriptor leaks—rebooting clears them all, at the cost of never proving which one was responsible.
2) Kernel state and driver state
Kernel bugs are rare until they’re not. A reboot resets kernel structures: page tables, scheduler queues, TCP state, device driver state machines. If a driver got wedged—common with NICs, HBAs, GPUs, and some virtual devices—a reboot is a blunt but effective way to force reinitialization.
3) Storage caches and I/O queues
Storage has two kinds of state: data state (on disk) and operational state (in flight). A reset blows away operational state: outstanding I/O, queue depths, multipath decisions, congestion control. That can “fix” a latency spiral, but it can also trigger a rebuild, rescan, or controller failover at the worst possible time.
4) Network sessions and load balancer stickiness
Reboots sever TCP connections, reset conntrack tables, and force clients to reconnect. In a healthy system, that’s fine. In a fragile system, it becomes a synchronized reconnect storm that looks like a DDoS you did to yourself.
5) Time: the silent dependency
Many bugs are time-dependent: certificate expiry, token refresh logic, cron jobs, log rotation, counters reaching a threshold, JVM safepoints under heap pressure. A reboot can “solve” the symptom by resetting clocks and counters in user space, while the underlying expiry condition continues toward the next cliff.
Restart vs reboot vs power-cycle: pick the smallest hammer that works
Service restart
Use when the failure appears confined to a single process or a small set of daemons. Examples: runaway memory in one service, stuck worker pool, wedged internal queue, a config reload gone wrong. A restart preserves kernel state, avoids storage rescan, and is faster to roll back.
Node reboot
Use when the kernel or hardware-facing layers look suspicious: recurring driver resets, unkillable processes in D state, filesystem I/O hangs, NIC flapping, severe clock skew, or you need to apply a kernel update.
Power-cycle (hard reset)
Use when the system is not responding to a clean reboot, or the management plane shows hardware wedged (BMC events, PCIe issues, HBA locked). This is the “I need the electrons to stop and start again” option. It can be necessary. It is never gentle.
Joke #2: Power-cycling is like turning your coworker off and on again—effective, but HR wants a postmortem.
A decision rule that works under pressure
Start at the top of the stack and move down only if evidence pushes you:
- If one service is misbehaving: restart service.
- If many services are misbehaving on one node: reboot node (after draining if possible).
- If the node can’t reboot or the hardware is wedged: power-cycle.
And regardless of which you choose: capture enough evidence to make the next incident shorter.
Fast diagnosis playbook: find the bottleneck before you “fix” it
This is the triage order that works when the pager is loud and you have five minutes to be useful. The goal is not to be brilliant. The goal is to stop guessing.
First: is it one box, one service, or the whole fleet?
- Check whether the symptom correlates with a node, an AZ/rack, a deploy, or a dependency.
- If multiple nodes show identical symptoms simultaneously, reboots are often noise. Look for shared dependencies (DNS, auth, storage backend, network).
Second: CPU, memory, I/O, or network?
- CPU bound: high load, high run queue, low iowait, top shows hot threads.
- Memory pressure: swap activity, reclaim stalls, OOM kills, growing RSS.
- I/O bound: high iowait, rising disk latency, blocked tasks, filesystem warnings.
- Network bound: retransmits, drops, conntrack saturation, DNS timeouts.
Third: kernel and hardware signals
- dmesg/journal shows device resets, link down/up, filesystem errors, hung tasks.
- SMART errors, ZFS pool degradation, multipath flapping.
Fourth: choose the smallest mitigation that restores service
- Restart a service if the node looks healthy.
- Drain and reboot if node-level state is suspect.
- Fail over if storage/network path is unstable.
Fifth: after stability, turn mitigation into diagnosis
- Write down what changed after the reset: latency, error rates, kernel counters, cache hit ratios.
- Create a hypothesis you can test before the next incident.
Practical tasks: commands, outputs, and decisions (12+)
These are the “before you reboot” and “right after you reboot” commands I actually want on an incident checklist. Every one includes what the output means and what decision you make from it.
1) Uptime and load: are we in slow death or sudden failure?
cr0x@server:~$ uptime
14:22:10 up 87 days, 3:41, 2 users, load average: 18.44, 17.90, 16.72
Output meaning: 87 days up; load averages are high and persistent (not a short spike).
Decision: If this is a single node and load is abnormal, proceed to CPU vs I/O separation. Don’t reboot yet; collect why load is high.
2) Top snapshot: CPU vs iowait vs runaway process
cr0x@server:~$ top -b -n 1 | head -n 20
top - 14:22:25 up 87 days, 3:41, 2 users, load average: 18.44, 17.90, 16.72
Tasks: 312 total, 2 running, 310 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.3 us, 2.1 sy, 0.0 ni, 21.7 id, 67.6 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 64049.2 total, 1800.5 free, 51221.4 used, 11027.3 buff/cache
MiB Swap: 8192.0 total, 8120.0 free, 72.0 used. 10341.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18342 postgres 20 0 12.1g 9.2g 9120 S 120.0 14.7 36:12.44 postgres
Output meaning: iowait is huge (67.6% wa). CPU is mostly waiting on disk, not doing work.
Decision: Don’t reboot a database because it’s waiting on disk; find the storage bottleneck first. Rebooting may worsen recovery time and I/O storms.
3) Identify blocked tasks (classic “reboot fixes it” trap)
cr0x@server:~$ ps -eo pid,state,comm,wchan:32 | awk '$2 ~ /^D/ {print}'
21991 D kworker/u96:2 nvme_poll
18342 D postgres io_schedule
Output meaning: Processes stuck in uninterruptible sleep (D state), waiting on I/O paths.
Decision: Service restart won’t help. Investigate storage/NVMe/HBA/multipath. If the I/O path is wedged and you can fail over, do that; otherwise plan a reboot with evidence captured.
4) Kernel messages for storage or driver resets
cr0x@server:~$ sudo dmesg -T | tail -n 20
[Mon Jan 21 13:58:12 2026] nvme nvme0: I/O 47 QID 4 timeout, aborting
[Mon Jan 21 13:58:12 2026] nvme nvme0: Abort status: 0x371
[Mon Jan 21 13:58:13 2026] nvme nvme0: resetting controller
[Mon Jan 21 13:58:18 2026] nvme nvme0: failed to set APST feature (-19)
Output meaning: NVMe timeouts and controller resets; the kernel is telling you the device is unreliable.
Decision: Reboot might temporarily recover the device, but treat this as hardware/firmware/driver work. Schedule replacement or firmware update; don’t normalize weekly “NVMe nap time.”
5) Disk latency and saturation with iostat
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0-21-generic (server) 01/21/2026 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
7.89 0.00 2.34 66.85 0.00 22.92
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 12.0 420.0 512.0 8752.0 43.1 198.4 451.2 9.8 462.5 2.1 99.7
Output meaning: %util ~99.7 and await ~451ms: the device is saturated and latency is terrible.
Decision: Stop blaming the app. Either reduce write load, move hot data, or add IOPS capacity. Reboot will not change physics.
6) Find which filesystems are full (the oldest “reboot fixes it” myth)
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 220G 219G 1.2G 100% /
tmpfs 32G 1.2G 31G 4% /run
Output meaning: Root filesystem is effectively full. Many services fail in weird ways when they can’t write logs or temp files.
Decision: Free space immediately (logs, core dumps, old artifacts). Rebooting won’t create disk space; it may just delay writes until it’s worse.
7) Identify big directories fast
cr0x@server:~$ sudo du -xhd1 /var | sort -h | tail -n 10
1.1G /var/cache
2.9G /var/lib
4.8G /var/crash
58G /var/log
Output meaning: /var/log is massive; /var/crash indicates repeated crashes or core dumps.
Decision: Rotate/compress logs, move them off-node, fix crash loop. Consider limiting core dumps. Don’t reboot until you stop the growth.
8) Memory pressure check: is the kernel reclaiming itself to death?
cr0x@server:~$ free -m
total used free shared buff/cache available
Mem: 64049 59721 811 922 3516 1920
Swap: 8192 6130 2062
Output meaning: Low available memory and significant swap use. The box is likely thrashing.
Decision: If a specific service is bloated, restart that service. If it’s systemic (many processes, kernel caches not reclaiming, fragmentation), plan a reboot but capture culprit processes first.
9) Who is consuming memory right now?
cr0x@server:~$ ps -eo pid,comm,rss --sort=-rss | head -n 10
PID COMMAND RSS
18342 postgres 9654320
22107 java 8421000
9821 node 1320040
Output meaning: A few processes dominate memory. That’s good news: targeted restart may fix it without reboot.
Decision: Restart or recycle the top offender after confirming it can be safely restarted (HA, rolling restart, draining).
10) Systemd service health and recent failures
cr0x@server:~$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● app-worker.service loaded failed failed Background worker
● nginx.service loaded failed failed A high performance web server
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state.
SUB = The low-level unit activation state.
Output meaning: Specific services are failed; not necessarily a node failure.
Decision: Try service restart and inspect logs. If they fail due to disk full, DNS, or config, rebooting is irrelevant.
11) Journal logs around the failure window
cr0x@server:~$ sudo journalctl -u nginx.service -S "10 minutes ago" --no-pager | tail -n 20
Jan 21 14:13:41 server nginx[29110]: nginx: [emerg] open() "/var/log/nginx/access.log" failed (28: No space left on device)
Jan 21 14:13:41 server systemd[1]: nginx.service: Main process exited, code=exited, status=1/FAILURE
Output meaning: Explicit failure: no space left on device.
Decision: Free disk space, then restart nginx. Rebooting is a waste of time and will likely fail again.
12) Network drops and errors (interface counters)
cr0x@server:~$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
9876543210 12345678 0 48219 0 0
TX: bytes packets errors dropped carrier collsns
8765432109 11223344 0 0 0 0
Output meaning: RX drops are high. That can mean driver issues, ring buffer limits, or upstream congestion.
Decision: Investigate NIC offload settings, buffer sizing, host CPU contention, or upstream switch congestion. Reboot may reset counters and temporarily “fix” a stuck driver, but you need causality.
13) DNS resolution sanity (fast check for “everything is broken”)
cr0x@server:~$ resolvectl query api.internal
api.internal: resolve call failed: Timeout was reached
Output meaning: DNS timeouts. This often masquerades as application failure across many services.
Decision: Don’t reboot the world. Fix DNS or resolver path. Consider caching resolver locally or failover resolvers.
14) Filesystem health: ext4 errors visible in logs
cr0x@server:~$ sudo dmesg -T | grep -E "EXT4-fs error|I/O error|Buffer I/O" | tail -n 5
[Mon Jan 21 14:01:12 2026] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1587: inode #131081: comm nginx: reading directory lblock 0
[Mon Jan 21 14:01:12 2026] Buffer I/O error on dev nvme0n1p2, logical block 9123456, lost async page write
Output meaning: Real filesystem and I/O errors. This is not an app bug.
Decision: Treat as storage incident. Reduce writes, fail over, plan maintenance. Rebooting may worsen corruption risk if hardware is failing.
15) ZFS pool status (if you run ZFS, you must look)
cr0x@server:~$ sudo zpool status -x
pool 'tank' is degraded
One or more devices could not be used because the label is missing or invalid.
Output meaning: Pool is degraded; redundancy is reduced.
Decision: Don’t reboot casually. A reboot can trigger resilvering and load spikes. Replace/fix the missing device first, or schedule a controlled maintenance window with clear I/O limits.
16) Multipath flapping (SAN environments love to “heal” after reboot)
cr0x@server:~$ sudo multipath -ll | head -n 25
mpatha (3600508b400105e210000900000490000) dm-2 DELL,MD36xxi
size=2.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:1 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 4:0:0:1 sdc 8:32 active ready running
Output meaning: Paths exist with different priorities (ALUA). If you see “failed faulty running” or paths toggling, you have a fabric/controller issue.
Decision: Engage storage/network team; don’t mask it with reboots. If queue_if_no_path is active, you may be stacking latency until everything times out.
17) Before reboot: capture a lightweight incident bundle
cr0x@server:~$ sudo sh -c 'date; uname -a; uptime; free -m; df -h; dmesg -T | tail -n 200' > /var/tmp/pre-reboot-triage.txt
Output meaning: You’ve preserved basic state that a reboot will erase (especially dmesg tail and resource snapshots).
Decision: If you must reboot, do it after you save this file somewhere durable (central logs, ticket attachment, scp to a bastion).
Three corporate mini-stories from the reset button era
Mini-story 1: The incident caused by a wrong assumption
The team had a cluster of application servers behind a load balancer. Each node had local NVMe for ephemeral caching, and the real system of record was a remote database. The mental model was: “local disk is just a cache; if it dies, we reboot and it comes back.”
One afternoon, the error rate rose in a clean diagonal line. Latency increased, but only for a subset of requests. The on-call restarted the app service. No change. They rebooted a node. That node came back fast and looked healthy for five minutes, then degraded again. They rebooted two more nodes. Same story, now with more customer-visible thrash.
The wrong assumption was subtle: the local NVMe wasn’t merely a cache. It held a write-ahead spool for outbound events, designed to survive process restarts and short network glitches. Rebooting wiped the spool because it was on a tmpfs mount that was “temporary by design.” In practice, it was temporary by accident.
When the upstream event collector slowed down, the spool filled and the app applied backpressure. That was the correct behavior. But the reboot erased the spool and created the illusion that the system recovered—until the downstream slow path refilled it again. Meanwhile, the erased spool meant dropped events, which triggered compensating retries later and extended the incident’s tail.
They fixed it by moving the spool to persistent storage (with explicit lifecycle policies), adding alerting on spool depth, and changing runbooks: “never reboot to clear backpressure unless you accept data loss.” The reset button didn’t cause the bottleneck. It just made the bottleneck dishonest.
Mini-story 2: The optimization that backfired
A platform group decided to reduce reboot time for a fleet of nodes. Faster reboots meant faster patching, faster evacuations, and less downtime risk. So they optimized boot: parallelized startup, reduced logging, trimmed “slow” checks, and disabled a storage scan that “never found anything.”
For weeks, it looked like a win. Reboots dropped from minutes to under a minute. Change windows got shorter. People stopped dreading kernel upgrades. The reset button became safer, so it got used more.
Then a node rebooted during a routine patch and came up with a degraded storage mirror. No one noticed because the scan that would have screamed about it was disabled. The node rejoined the pool and resumed workload. Under load, the remaining disk began throwing intermittent errors. The application layer saw timeouts and retried. Retries created more load. Classic positive feedback loop.
By the time someone checked storage health, the node was in a failure spiral: error handling was generating more I/O than success. They removed it from rotation and finally looked at the hardware history. The “slow” scan had been catching early warning signs of drive issues. They had traded reliability for a better stopwatch reading.
The fix wasn’t to abandon optimization; it was to understand what they removed. They reinstated the health checks, made them asynchronous where possible, and gated node admission into the cluster on passing them. Fast reboot is fine. Fast reboot into broken storage is just a faster way to hurt yourself.
Mini-story 3: The boring but correct practice that saved the day
A different company, different culture. The platform team had a ritual they called “pre-mortem snapshots.” It wasn’t fancy: before any disruptive mitigation (reboot, failover, rolling restart), they ran a small set of commands and shoved the output into the incident ticket.
During an incident, one node started dropping requests with no obvious pattern. CPU was fine. Memory was fine. The service logs were unhelpful. Everyone wanted to reboot because it was the fastest way to stop the pain.
The on-call followed the boring practice anyway. They captured dmesg tail, iostat, and interface counters. dmesg showed occasional NIC resets. Interface counters showed rising RX drops. iostat was clean. That evidence changed the decision: instead of rebooting repeatedly, they drained the node and replaced the NIC (virtual function in this case) configuration. They also adjusted interrupt affinity on the host.
The node returned to service without a single reboot. More importantly, the team didn’t waste hours “fixing” a network issue with resets that would have temporarily cleared counters and reset the driver state. The boring snapshot habit didn’t just save the day; it saved the narrative. The postmortem had evidence, not folklore.
Common mistakes: symptom → root cause → fix
1) “Reboot fixed it” becomes the ticket title
Symptom: Incident resolves after reboot; no further action taken.
Root cause: Evidence destroyed; organizational amnesia. The system stays fragile.
Fix: Require a minimal pre-reboot capture bundle and a post-incident hypothesis. If you can’t do both, admit you’re choosing availability over learning—and schedule learning later.
2) Service restarts loop forever
Symptom: Systemd shows a service flapping; it restarts every few seconds.
Root cause: External dependency down (DNS, DB, disk full), misconfig, or permission failure.
Fix: Use journalctl -u to find the first error; fix the dependency. Restarts are not recovery if the cause is deterministic.
3) “It’s CPU” because load average is high
Symptom: Load average climbs; engineers assume compute saturation.
Root cause: Load includes runnable and uninterruptible tasks. I/O waits can inflate load with low CPU usage.
Fix: Check iowait in top/iostat and look for D-state processes. If iowait is high, treat as storage/path issue.
4) Rebooting the database during an I/O incident
Symptom: DB latency spikes; reboot “helps” briefly; then worse.
Root cause: Storage saturation or device errors; reboot triggers recovery, replays WAL, warms cache from cold, and increases I/O.
Fix: Reduce write amplification, stop heavy jobs, move hot data, throttle compactions/checkpoints, and fix storage. Reboot only after containment.
5) Power-cycling hides a failing disk
Symptom: Occasional I/O hangs; power-cycle restores service for days.
Root cause: Drive/controller firmware bug or degrading media; resets clear error states temporarily.
Fix: Correlate dmesg errors with SMART stats or pool degradation; replace hardware, update firmware, validate cabling/backplane.
6) Cache-flush “fixes” performance
Symptom: After reboot, latency improves dramatically, then slowly worsens.
Root cause: Cache pollution, fragmentation, memory leak, or unbounded cardinality (like per-tenant caches).
Fix: Measure cache hit ratio and eviction behavior; implement cache caps, TTLs, and metrics. Don’t treat reboot as cache management.
7) Kubernetes node reboots cause cascading reschedules
Symptom: Rebooting one node triggers cluster-wide latency spike.
Root cause: Pod disruption budgets absent, too-small spare capacity, aggressive autoscaling, or heavy stateful workloads rescheduled at once.
Fix: Drain properly, enforce PDBs, keep headroom, and rate-limit rollouts. Reboot is fine; stampedes are not.
8) Treating network timeouts as application bugs
Symptom: Random timeouts across services; rebooting some nodes “helps.”
Root cause: DNS resolver issues, packet drops, conntrack exhaustion, or MTU mismatch.
Fix: Check resolver, interface drops, and conntrack; fix the network condition. Reboots can reset conntrack and appear to help, which is why people keep doing them.
Checklists / step-by-step plan
Checklist A: Before you press reset (production)
- Confirm scope: one node vs fleet vs dependency.
- Capture the minimum forensic bundle: uptime, top snapshot, iostat, df -h, dmesg tail, service failures.
- Check storage health: errors in dmesg, pool/RAID status, device saturation.
- Check network basics: interface drops, DNS resolution, route table sanity if relevant.
- Pick smallest mitigation: restart service → restart node → power-cycle.
- Plan customer impact: drain traffic, cordon node, fail over if possible.
- Communicate: “Mitigation is reboot; goal is restore. Evidence captured. Next step is root cause.”
Checklist B: Controlled reboot in a cluster (the safe way)
- Cordon/drain the node (or remove it from the load balancer).
- Confirm stateful workloads are safe (replication healthy, no rebuild in progress).
- Reboot once. If it didn’t help, don’t do it again mindlessly.
- On return, validate health checks and key metrics before reintroducing traffic.
- Watch for cache-warm effects and thundering herds.
Checklist C: After reset (turn it into learning)
- Compare key metrics: latency, error rate, iowait, drops, memory usage pre/post.
- Extract a hypothesis: “Reboot helped because driver reset cleared X” is a starting point, not an ending.
- Make one hardening change: alert, limit, test, or upgrade. One per incident is enough to compound over time.
- Update the runbook: include the exact commands and the decision rule that was correct.
FAQ
1) Is rebooting a server in production always bad?
No. Uncontrolled rebooting is bad. A controlled reboot—after draining traffic and capturing evidence—is a legitimate mitigation, especially for kernel/driver state.
2) Why does rebooting “fix” issues that we can’t reproduce?
Because many failures are emergent state: leaks, fragmentation, queue buildup, or stuck devices. Reboot resets that state. It doesn’t explain it.
3) Should we restart the service or reboot the node?
Restart the service if the node looks healthy (no iowait storm, no D-state pileups, no kernel errors). Reboot the node when you suspect kernel/driver/I/O path issues.
4) What’s the single most important thing to do before reboot?
Capture kernel and resource evidence: dmesg tail, iostat, top, df -h, and relevant service logs. If you reboot without this, you’re choosing ignorance.
5) How do I know if the bottleneck is disk I/O?
Look for high iowait in top, high await and %util in iostat -x, and D-state processes waiting on I/O.
6) Can a reboot make a storage incident worse?
Yes. Reboots can trigger rebuilds/resilvers, cold-cache amplification, journal replays, and reconnect storms to shared storage. Stabilize first, then reboot with intent.
7) In Kubernetes, is rebooting a node acceptable?
Yes, if you cordon and drain properly, respect pod disruption budgets, and keep cluster headroom. Treat node reboot like a rolling upgrade, not a surprise.
8) Why do we see “it works after reboot” with network problems?
Because reboot resets conntrack tables, clears transient driver states, and forces fresh DNS resolution and TCP sessions. That’s also why it masks the real network defect.
9) How do we reduce our dependence on resets?
Add limits (memory, disk, queue depth), add observability (latency breakdown by layer), and make known failure modes self-healing (health checks that restart the right component, not the entire node).
10) Should we automate reboots with watchdogs?
Sometimes. For single-purpose appliances and edge nodes, watchdogs can be correct. For complex stateful systems, automatic reboot without evidence capture and rate limits is a great way to erase the only clues you had.
Conclusion: next steps that reduce “reboot as a lifestyle”
The reset button era isn’t about shame. It’s about clarity. Resets are honest because they admit a system has accumulated state you can’t reason about under pressure. They’re dishonest because they erase the trail unless you deliberately preserve it.
If you run production systems, do three things starting this week:
- Standardize the pre-reboot bundle (five commands, one file, always attached to the incident).
- Adopt the smallest-hammer rule (restart service before reboot; reboot before power-cycle; fail over before rebuilding).
- Turn every successful reset into a hypothesis you can test: storage, network, memory, kernel, or dependency. Then harden one weak spot.
Rebooting isn’t failure. Rebooting without learning is.