A segmentation fault feels small. One process dies. The supervisor restarts it. Someone shrugs and says “cosmic ray,”
then gets back to shipping features. And that’s how a quarter gets ended by a single crash: not by the segfault itself,
but by the chain reaction you didn’t design for.
In production, a segfault is rarely “a bug in one binary.” It’s a systems event. It tests your deployment hygiene,
your data durability model, your load shedding, your observability, and your organizational honesty. If you treat it like
a local developer inconvenience, you’ll learn about it again—during the board meeting.
What a segfault really means (and what it doesn’t)
A segfault is the operating system enforcing memory protection. Your process tried to access memory it shouldn’t:
an invalid address, a page without permissions, or an address that used to be valid until you freed it and kept using it.
The kernel sends a signal (usually SIGSEGV), and unless you handle it (you almost never should), the process dies.
“Segfault” is a symptom label. The underlying cause could be:
- Use-after-free and other lifetime bugs (the classics).
- Buffer overflow (write past bounds, corrupt metadata, crash later somewhere unrelated).
- Null pointer dereference (cheap, embarrassing, still happens in 2026).
- Stack overflow (recursion or large stack frames, often triggered by unexpected input).
- ABI mismatch (plugin loads against wrong library version; the code “runs” until it doesn’t).
- Hardware issues (rare, but not mythical: bad RAM and flaky CPUs do exist).
- Kernel/userspace contract violations (e.g., seccomp filters, unusual memory mappings).
Here’s what it usually doesn’t mean: “the OS randomly killed us.” Linux is not out to get your service.
If a process segfaults, something wrote to memory it shouldn’t, or executed code it shouldn’t, or returned to an address it shouldn’t.
The kernel is just the bouncer.
One dry operational truth: if your incident response starts with “it’s probably just one crash,” you’re already behind.
Crashes are how undefined behavior introduces itself.
Joke #1: A segfault is your program’s way of saying “I would like to speak to the manager,” except the manager is the kernel and it’s not negotiating.
Crash ≠ outage, unless you designed it that way
A single crash can be a non-event if your system is built for it:
graceful degradation, retries with jitter, idempotent operations, bounded queues, and durable state.
But if your crash takes down a shard, corrupts an in-memory cache that your database suddenly depends on,
wedges a leader election, or triggers a restart storm, you’ve turned “one bad pointer” into a revenue event.
Core concepts you should keep in your head during triage
- Crash site vs. bug site: the instruction that crashed is often far from the write that corrupted memory.
- Determinism: if it crashes at the same instruction under the same request, it’s likely a straightforward bug; if not, suspect memory corruption, races, or hardware.
- Blast radius: the technical bug might be tiny; the operational bug (coupling) is what hurts.
- Time-to-first-signal: your first job is not “root cause.” It’s “stop the bleeding with evidence intact.”
Paraphrased idea from John Gall (systems thinker): complex systems that work often evolve from simpler systems that worked. If you can’t survive one crash, your system isn’t done evolving.
Why one crash can end a quarter
The quarter-ending part isn’t melodrama. It’s how modern systems and modern businesses amplify small technical events:
tight coupling, shared dependencies, aggressive SLOs, and the business habit of stacking launches.
The standard cascade
A typical sequence looks like this:
- One process segfaults under a particular request shape or load pattern.
- Supervisor restarts it quickly (systemd, Kubernetes, your own watchdog).
- State is lost (in-flight requests, in-memory session store, cached auth tokens, queue offsets, leader leases).
- Clients retry (sometimes correctly, often in a synchronized stampede).
- Latency spikes because the restarted process warms caches, rebuilds connection pools, replays logs.
- Downstream systems get hammered (databases, object storage, dependency APIs).
- Autoscaling responds late (or not at all) because metrics lag and scaling policies are conservative.
- More crashes occur due to memory pressure, timeouts, and queue buildup.
- Operators panic-restart things and delete the evidence (core dumps and logs vanish).
- Finance notices because customer-visible behavior changes: failed checkouts, missed ad auctions, delayed settlements.
How storage makes it worse (yes, even if “it’s a crash”)
As a storage engineer, I’ll say the quiet part: storage turns crashes into money problems because it’s where “temporary”
state becomes durable state. A segfault can:
- Corrupt local state if your process writes non-atomically and doesn’t fsync appropriately.
- Corrupt remote state via partial writes, protocol misuse, or replay bugs.
- Trigger replays (Kafka consumers, WAL replay, object listing retries) that multiply load and cost.
- Cause silent data loss when the process dies between acknowledging and committing.
Plenty of teams “handle” crashes by adding retries. Retries are not reliability; they are load multipliers.
If your retry policy isn’t bounded, jittered, and informed by idempotency, you’re writing an outage generator.
The business layer: why this shows up on earnings calls
The business doesn’t care about your stack trace. It cares about:
- Conversion rate drops when latency spikes and timeouts rise.
- SLA/SLO penalties (explicit penalties or churn).
- Support costs and brand damage when customers experience inconsistent behavior.
- Opportunity cost: you freeze deploys and delay launches to stabilize.
- Engineering distraction: the best people get pulled into incident response for days.
A segfault can be “one line of code.” But in production, it’s a test of your system’s shock absorbers.
Most systems fail that test because the shock absorbers were never installed—just promised.
Facts and history: why we keep doing this to ourselves
- Memory protection is older than your company. Hardware-enforced page protection and faults were mainstream long before Linux; segfaults are the “feature working.”
- Early Unix popularized the “core” file. Dumping a process image on crash was a pragmatic debugging tool when interactive debugging was harder.
SIGSEGVis not the same asSIGBUS. Segfault is invalid access; bus error often indicates alignment faults or issues with mapped files/devices.- ASLR changed the game. Address Space Layout Randomization made exploitability harder and debugging slightly more annoying; symbolized backtraces matter more.
- Heap allocators evolved because crashes were expensive. Modern allocators (ptmalloc, jemalloc, tcmalloc) trade speed, fragmentation, and debugging features differently.
- Debug symbols became a production concern. The rise of continuous deployment made “we’ll reproduce it locally” a fantasy; you need symbols and build IDs to debug what actually ran.
- Containerization complicated core dumps. Namespaces and filesystem isolation mean cores can vanish unless you deliberately route them somewhere.
- “Fail fast” was misread. Failing fast is good when it prevents corruption; it’s bad when it triggers coordinated retries and state loss without guardrails.
- Modern kernels are chatty in helpful ways.
dmesgcan include faulting address, instruction pointer, and even library offsets—if you preserved the logs.
Segfaults didn’t get more common. We just built taller towers of dependencies around them.
Fast diagnosis playbook (first/second/third)
This is the 15–30 minute playbook to find the bottleneck and choose the right containment strategy.
Not to solve the bug forever—yet.
First: stop the restart storm and preserve evidence
- Stabilize: scale out, temporarily disable aggressive retries, and add rate limits.
- Preserve: ensure core dumps and logs survive restarts (or at least preserve the last crash).
- Verify blast radius: is it one host, one AZ, one version, one workload?
Second: identify the crash signature
- Where did it crash? function name, module, offset, faulting address.
- When did it start? correlate with deploy, config change, traffic pattern.
- Is it deterministic? same request triggers, same stack trace, same host?
Third: pick the branch based on strongest signal
- Same binary version only: rollback or disable feature gate; proceed with core analysis.
- Only certain hosts: suspect hardware, kernel, libc, or configuration drift; drain and compare.
- Only high load: suspect race, memory pressure, timeouts leading to unsafe cleanup paths.
- After dependency issues: suspect error handling bug (null deref on failed response, etc.).
The trap: diving into GDB immediately while the system is still flapping. Fix the operational failure mode first.
You need a stable patient before you do surgery.
Hands-on tasks: commands, outputs, decisions (12+)
These are real tasks I expect on-call engineers to run during a crash investigation. Each has: a command, realistic output,
what it means, and what decision to make next.
Task 1: Confirm the crash in the journal and get the signal
cr0x@server:~$ sudo journalctl -u checkout-api.service -S "30 min ago" | tail -n 20
Jan 22 03:11:07 node-17 checkout-api[24891]: FATAL: worker 3 crashed
Jan 22 03:11:07 node-17 systemd[1]: checkout-api.service: Main process exited, code=dumped, status=11/SEGV
Jan 22 03:11:07 node-17 systemd[1]: checkout-api.service: Failed with result 'core-dump'.
Jan 22 03:11:08 node-17 systemd[1]: checkout-api.service: Scheduled restart job, restart counter is at 6.
Jan 22 03:11:08 node-17 systemd[1]: Stopped Checkout API.
Jan 22 03:11:08 node-17 systemd[1]: Started Checkout API.
Output meaning: status 11/SEGV confirms SIGSEGV. code=dumped suggests a core dump exists (or systemd thinks it does).
Decision: slow the restart loop (RestartSec), and go find the core dump before it gets rotated.
Task 2: Check kernel messages for the faulting address and instruction pointer
cr0x@server:~$ sudo dmesg -T | tail -n 8
[Wed Jan 22 03:11:07 2026] checkout-api[24891]: segfault at 0000000000000038 ip 000055c1b2f9a4d2 sp 00007ffd7b5d1a10 error 4 in checkout-api[55c1b2f82000+3e0000]
[Wed Jan 22 03:11:07 2026] Code: 48 8b 47 38 48 85 c0 74 0a 48 8b 00 48 8b 50 10 ff 52 08 48 8b 47 10 <48> 8b 40 08
Output meaning: fault at 0x38 is a classic null-pointer+offset pattern (dereferencing a struct field off a null base). The ip points into your binary text segment.
Decision: prioritize “bad pointer in error path” hypotheses; pull the core and symbols to confirm which field.
Task 3: Locate the core via systemd-coredump
cr0x@server:~$ coredumpctl list checkout-api | tail -n 3
TIME PID UID GID SIG COREFILE EXE
Wed 2026-01-22 03:09:55 UTC 24602 1001 1001 11 present /opt/checkout/bin/checkout-api
Wed 2026-01-22 03:11:07 UTC 24891 1001 1001 11 present /opt/checkout/bin/checkout-api
Output meaning: core files are present for those PIDs. Good—evidence exists.
Decision: immediately extract the newest core to a safe location (cores get vacuumed).
Task 4: Extract the core to disk so it survives rotation
cr0x@server:~$ sudo coredumpctl dump 24891 --output=/var/tmp/checkout-api.core
/var/tmp/checkout-api.core
Output meaning: core written to a stable path.
Decision: snapshot the exact executable and its libraries next, matching build IDs.
Task 5: Verify the binary build ID (so symbols match)
cr0x@server:~$ readelf -n /opt/checkout/bin/checkout-api | grep -A2 "Build ID"
Build ID: 9b7c6fd3b2a9cdb5d3d1c9e0a4f2f7aa12c0f2ab
Output meaning: this build ID must match your debug symbols package/artifact.
Decision: fetch symbols for that build ID; if you can’t, you’ll still do offsets-based debugging but slower and riskier.
Task 6: Quick backtrace from the core (best-effort)
cr0x@server:~$ gdb -q /opt/checkout/bin/checkout-api /var/tmp/checkout-api.core -ex "set pagination off" -ex "thread apply all bt" -ex "quit"
Reading symbols from /opt/checkout/bin/checkout-api...
(No debugging symbols found in /opt/checkout/bin/checkout-api)
[New LWP 24891]
Core was generated by `/opt/checkout/bin/checkout-api --config /etc/checkout/config.yaml'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055c1b2f9a4d2 in ?? ()
#1 0x000055c1b2f63c10 in ?? ()
#2 0x00007f2b8e31c1f5 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x000055c1b2f6456a in ?? ()
Output meaning: no symbols, so frames are unknown. Still, you have the crash address 0x55c1b2f9a4d2.
Decision: translate the crash address into a function using addr2line once symbols are available; meanwhile, use the binary map to compute offsets.
Task 7: Confirm memory mappings to compute the offset
cr0x@server:~$ gdb -q /opt/checkout/bin/checkout-api /var/tmp/checkout-api.core -ex "info proc mappings" -ex "quit"
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x000055c1b2f82000 0x000055c1b3362000 0x003e0000 0x0000000000000000 /opt/checkout/bin/checkout-api
0x00007f2b8e2f0000 0x00007f2b8e4d0000 0x001e0000 0x0000000000000000 /lib/x86_64-linux-gnu/libc.so.6
Output meaning: the binary is mapped starting at 0x55c1b2f82000. Your crash IP is within that range.
Decision: compute offset: 0x55c1b2f9a4d2 - 0x55c1b2f82000. Use that offset with symbol tools.
Task 8: Check whether core dumps are being truncated by limits
cr0x@server:~$ ulimit -c
0
Output meaning: core size limit is zero for the current shell; depending on service manager, the unit may override it, but often this means “no cores.”
Decision: ensure systemd unit has LimitCORE=infinity or configure coredump.conf properly; otherwise you’ll debug blind.
Task 9: Verify systemd core dump storage policy
cr0x@server:~$ sudo grep -E '^(Storage|ProcessSizeMax|ExternalSizeMax|MaxUse|KeepFree)=' /etc/systemd/coredump.conf
Storage=external
ProcessSizeMax=2G
ExternalSizeMax=2G
MaxUse=8G
KeepFree=2G
Output meaning: cores are stored externally with caps. If your process is >2G RSS at crash, the core might be cut or missing.
Decision: if cores are missing or truncated, temporarily raise the cap on impacted nodes, or reproduce under controlled conditions with smaller memory footprint.
Task 10: Rule out OOM-kill (different failure, same “it died” symptom)
cr0x@server:~$ sudo journalctl -k -S "1 hour ago" | grep -i -E "oom|killed process" | tail -n 5
Output meaning: empty output suggests no OOM kill in the last hour.
Decision: keep focusing on segfault; if you do see OOM kills, treat as memory pressure first (and the segfault might be secondary corruption under stress).
Task 11: Check for restart storms and rate-limit them
cr0x@server:~$ systemctl show checkout-api.service -p Restart -p RestartUSec -p NRestarts
Restart=always
RestartUSec=200ms
NRestarts=37
Output meaning: 200ms restart delay and 37 restarts is a self-inflicted load test. It will amplify downstream retries and can starve the host.
Decision: increase restart delay (seconds to minutes) and consider StartLimitIntervalSec/StartLimitBurst to prevent flapping from taking the node down.
Task 12: See if the crash correlates with a deploy (don’t guess)
cr0x@server:~$ sudo journalctl -u checkout-api.service -S "6 hours ago" | grep -E "Started|Stopping|version" | tail -n 15
Jan 21 22:10:02 node-17 systemd[1]: Started Checkout API.
Jan 22 02:58:44 node-17 checkout-api[19822]: version=2.18.7 git=9b7c6fd3 feature_flags=pricing_v4:on
Jan 22 03:11:08 node-17 systemd[1]: Started Checkout API.
Output meaning: you have a version and a feature flag state captured in logs. That’s gold.
Decision: if the crash started after enabling a flag, disable it first. If it started after a version bump, rollback while you analyze.
Task 13: Validate shared library resolution (ABI mismatch catches)
cr0x@server:~$ ldd /opt/checkout/bin/checkout-api | head -n 12
linux-vdso.so.1 (0x00007ffd7b7d2000)
libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007f2b8e600000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007f2b8e180000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2b8df00000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2b8e2f0000)
Output meaning: confirms which libs are loaded. If you expected different versions (e.g., in /opt), you have a packaging or runtime environment problem.
Decision: if dependency drift exists, pin and redeploy; debugging a crash on an unintended ABI is wasted time.
Task 14: Check for host-level anomalies (disk full can ruin coredumps and logs)
cr0x@server:~$ df -h /var /var/tmp
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-var 60G 58G 1.6G 98% /var
/dev/mapper/vg0-var 60G 58G 1.6G 98% /var/tmp
Output meaning: you’re nearly out of space where cores are being written. That means “no evidence” soon, and possibly service issues if logs can’t write.
Decision: free space immediately (vacuum journal, rotate cores), or redirect core storage to a larger filesystem for the duration of the incident.
Task 15: Confirm whether the crash is tied to a specific request (application logs + sampling)
cr0x@server:~$ sudo journalctl -u checkout-api.service -S "10 min ago" | grep -E "request_id|panic|FATAL" | tail -n 10
Jan 22 03:11:07 node-17 checkout-api[24891]: request_id=9f2b4d1f path=/checkout/confirm user_agent=MobileApp/412.3
Jan 22 03:11:07 node-17 checkout-api[24891]: FATAL: worker 3 crashed
Output meaning: a request ID and path right before the crash is a strong hint. Not proof, but a direction.
Decision: pull a sample of recent requests for that endpoint, inspect payload shapes, and consider temporarily rate-limiting or feature-gating that code path.
Three corporate mini-stories from the crash trenches
Mini-story 1: The wrong assumption (the null pointer that “couldn’t happen”)
A payments-adjacent service started segfaulting once a day. Then once an hour. The on-call team treated it like a noisy but manageable bug:
the process restarted quickly, and most customers didn’t notice—until they did. Latency crept up, error rates spiked during peak hours,
and the incident channel became a lifestyle.
The first investigation went after infrastructure: kernel upgrades, host firmware, a recent library patch. Reasonable guesses, wrong target.
The only consistent clue was the faulting address: always a small offset from zero. Classic null dereference.
Yet the codebase had “obviously” checked for null. “Obviously” is an expensive word.
The bug was a wrong assumption about an upstream dependency: “this field is always present.” It was present 99.98% of the time.
Then a partner rolled out a change that omitted it for a niche product tier. The code parsed JSON into a struct, left a pointer unset,
and later used it in a cleanup path that nobody had load-tested because, again, “couldn’t happen.”
The fix was a one-line guard and a better error model. The operational fix was more interesting:
they added a circuit breaker on the partner integration, stopped retrying on malformed inputs, and introduced a canary that simulated the “missing field” case.
The segfault stopped being a quarter risk because the system stopped assuming the world was polite.
Mini-story 2: The optimization that backfired (a faster allocator, a slower company)
A high-throughput ingestion pipeline switched allocators to reduce CPU overhead and improve tail latency. It worked in benchmarks.
In production, crash rates went up slowly, then suddenly. Segfaults. Sometimes a double free. Sometimes an invalid read.
The team suspected their own code, and to be fair, they had earned that suspicion.
The backfire wasn’t “the allocator is bad.” It was that the change altered memory layout and timing enough to surface an existing race.
Previously, the race scribbled on memory that nobody touched for a while. With the new allocator, the same region got reused sooner.
Undefined behavior went from “latent” to “loud.”
Then the operational multiplier showed up: their orchestrator had aggressive liveness checks and immediate restarts.
Under crash loops, ingestion lag grew. Consumers fell behind. Backpressure didn’t propagate correctly.
The pipeline tried to catch up by pulling more work. It became a self-harming machine.
The long-term fix was correcting the race and adding TSAN/ASAN coverage in CI for key components.
The short-term fix that saved the business week was rolling back the allocator and putting a hard cap on concurrency while lag was above a threshold.
They learned the boring truth: performance changes are correctness changes wearing a hoodie.
Mini-story 3: The boring but correct practice that saved the day (symbols, cores, and a calm rollback)
An internal RPC service segfaulted after a routine dependency bump. The on-call engineer did not start by spelunking the codebase.
They started by making sure the system would stop flapping and that the crash would be debuggable.
Restart delay was increased. Affected nodes were drained. Cores were preserved.
The team had done a dull, unglamorous thing months earlier: every build published debug symbols keyed by build ID, and the runtime logged the build ID on startup.
There was also a runbook that said “if a process dumps core, copy it off the node before you do anything clever.”
Nobody celebrated this practice when it was introduced. They should have.
Within an hour they had a symbolized backtrace pointing at a specific function handling an error response.
A dependency now returned an empty list where it used to return null; the code treated “empty” as “has at least one element” and indexed it.
One line fix, one targeted test, and a canary deploy.
The service was stable the entire time because the rollback path was clean and rehearsed.
The crash didn’t become a quarter story because the company had invested in evidence, not heroics.
Joke #2: Nothing accelerates “team alignment” like a crash loop at 3 a.m.; suddenly everyone agrees on priorities.
Common mistakes: symptom → root cause → fix
These are patterns that show up over and over. Learn them once. Save yourself the repeats.
1) Symptom: segfault at address 0x0 or small offset (0x10, 0x38, 0x40)
Root cause: null pointer dereference, often in an error handling or cleanup path.
Fix: add explicit null checks; but also fix invariants—why is that pointer allowed to be null? Add tests for missing/invalid inputs.
2) Symptom: crash site changes every time; stack traces look random
Root cause: memory corruption (buffer overflow, use-after-free), often earlier than the crash.
Fix: reproduce with ASAN/UBSAN; enable allocator hardening in staging; reduce concurrency to narrow timing windows; audit unsafe code and FFI boundaries.
3) Symptom: segfault appears only under load, disappears when you add logs
Root cause: race condition; timing changes hide it. Logging “fixing” a crash is a classic sign you’re dealing with concurrency.
Fix: run with TSAN in CI; add lock discipline; use thread-safe data structures; enforce ownership rules (especially for callbacks).
4) Symptom: only one AZ/host pool sees the crash
Root cause: configuration drift, different library versions, CPU features, kernel differences, or bad hardware.
Fix: compare packages and kernel versions; move workload away; run memory tests if suspicion persists; rebuild golden images and eliminate drift.
5) Symptom: crash coincides with “successful” performance optimization
Root cause: the optimization changed memory layout or timing, surfacing UB; or removed bounds checks.
Fix: roll back first; then reintroduce with guardrails and sanitizers; treat perf work as risky as a schema migration.
6) Symptom: no core dumps anywhere, despite “core-dump” messages
Root cause: core limits set to zero, cores too large and capped, disk full, container runtime not permitting core writes, or systemd-coredump configured to drop them.
Fix: set LimitCORE=infinity; raise coredump size caps; ensure storage space; test core dumping intentionally in staging.
7) Symptom: “segfault” but kernel log shows general protection fault or illegal instruction
Root cause: executing corrupted code, jumping through a bad function pointer, or running on incompatible CPU features (rare, but real with aggressive build flags).
Fix: verify build targets, CPU flags, and library ABI; check for memory corruption; confirm you’re not deploying binaries built for a different microarchitecture.
8) Symptom: crashes stop when you disable one endpoint or feature flag
Root cause: specific input shape triggers undefined behavior; often parsing, bounds, or optional-field handling.
Fix: keep the flag off until fixed; add input validation; add fuzzing for that parser and contract tests for upstream dependencies.
Checklists / step-by-step plan
Containment checklist (first hour)
- Reduce blast radius: drain affected nodes, reduce traffic, or route away from the version.
- Stop restart storms: increase restart delays; cap restarts; avoid thrashing dependencies.
- Preserve evidence: copy cores off-node; snapshot logs; record build IDs and config/flags.
- Get a crash signature: kernel line (fault address, IP), one backtrace (even unsymbolized), frequency, and triggers.
- Choose the safest path: rollback beats “live debugging” when customers are burning.
Diagnosis checklist (same day)
- Symbolize the crash: match build IDs, fetch symbols, produce a readable backtrace.
- Classify: null deref vs. memory corruption vs. race vs. environment/hardware.
- Reproduce: capture the triggering request shape; build a minimal repro; run under sanitizers.
- Confirm scope: versions impacted, host pools impacted, inputs impacted.
- Patch safely: add tests; canary; gradual rollout; validate crash-free time before full deploy.
Prevention checklist (quarterly hygiene that pays rent)
- Always ship build IDs and symbols. Make symbol retrieval boring and automated.
- Keep feature flags for risky code paths. Not everything needs a flag; crash-prone code does.
- Define idempotency and retries. “Retry everything” is how you DDoS yourself.
- Fuzz parsers and boundary code. The edge cases are where segfaults breed.
- Use sanitizers in CI for key components. Especially for C/C++ and FFI boundaries.
- Plan for process death. Stateless where possible; durable where required; graceful everywhere.
FAQ
1) Is a segfault always a software bug?
Almost always, yes. Hardware can cause it (bad RAM), but treat hardware as guilty only after you have evidence:
host-specific crashes, corrected memory errors, or repeatable failures on one machine.
2) What’s the difference between SIGSEGV and SIGBUS?
SIGSEGV is invalid memory access (permissions or unmapped memory). SIGBUS often involves alignment issues or errors on memory-mapped files/devices.
Both mean “your process did something it shouldn’t,” but the debugging path differs.
3) Why does the crash happen “somewhere else” than the bug?
Memory corruption is delayed-action chaos. You overwrite metadata or a neighbor object, and the program keeps running until it touches the poisoned area.
The crash site is where the system noticed, not where you committed the crime.
4) Should I catch SIGSEGV and keep running?
No, not for general application code. Recovering safely is nearly impossible because your process state is untrustworthy.
Use SIGSEGV handlers only for logging minimal diagnostic context and then exiting.
5) Why are core dumps missing in containers?
Often because core limits are zero, the filesystem is read-only or size-limited, or the container runtime blocks core dumping.
You must intentionally configure core dump behavior per runtime and ensure storage exists.
6) If I don’t have debug symbols, is the core useless?
Not useless, just slower. You can still use instruction pointers, mapping offsets, and build IDs to locate code.
But you’ll waste time and risk misattribution. Symbols are cheap compared to downtime.
7) Can a segfault corrupt data?
Yes. If you crash mid-write without atomicity and durability guarantees, you can leave partial state.
If you acknowledge work before it’s committed, you can lose data. If you retry non-idempotently, you can duplicate data.
8) What’s the fastest safe mitigation when crashes spike after a deploy?
Roll back or disable the feature flag. Do it early. Then analyze with preserved evidence.
Hero debugging while customers burn is how you extend incidents and create folklore instead of fixes.
9) How do I know if it’s a race condition?
Crashes that disappear with extra logging, change with CPU count, or show different stack traces under load are strong indicators.
TSAN and controlled stress tests are your friends.
Conclusion: next steps that actually reduce crashes
A segfault is not a “bug.” It’s a systems failure that reveals what you assumed about memory, inputs, dependencies, and recovery.
The crash is the smallest part of the story. The rest is coupling, retries, state handling, and whether you preserved the evidence.
Do these next, in this order:
- Make cores and symbols non-negotiable. If you can’t debug what ran, you’re running on hope.
- Harden restart behavior. Slow restarts, cap flapping, and prevent retry stampedes.
- Design for process death. Idempotency, durability boundaries, and graceful degradation are how one crash stays one crash.
- Invest in sanitizers and fuzzing where it matters. Especially at parsers, FFI, and concurrency hotspots.
- Write the postmortem like you’ll read it again. Because you will—unless you fix the operational amplifier, not just the line of code.
The goal isn’t to eliminate every segfault forever. The goal is to make a segfault boring: contained, diagnosable, and unable to hijack a quarter.