You’re not imagining it: a single crash can drop a core dump the size of your RAM onto disk, and Debian 13 will happily keep doing it until your
node starts failing writes, your database goes read-only, and your “minor app issue” turns into a platform incident.
The fix is not “disable core dumps everywhere” (that’s how you end up debugging with vibes). The fix is a policy: keep the right dumps, keep them
briefly, compress them when it’s cheap, and make the system stop before it eats the machine.
What’s actually happening when disks fill with core dumps
A core dump is a snapshot of a process’s memory at the moment it died (usually via a fatal signal). It’s useful because it contains the state
you can’t reconstruct later: stack frames, heap objects, thread states, and sometimes sensitive data you’d rather not store for months.
On modern Debian, core dump handling typically goes through systemd-coredump. The kernel sees a crashing process, consults
/proc/sys/kernel/core_pattern, and either writes a file directly or pipes the core to a handler. With systemd, that handler can store
the dump under /var/lib/systemd/coredump/ and log metadata into the journal.
The “disk fills” failure mode has a few common shapes:
- High-churn crash loops: a service restarts, crashes immediately, repeats. Each crash generates a new dump.
- Fat processes: JVMs, browsers, language runtimes, or anything with a big heap. One dump can be tens of gigabytes.
- Misplaced storage: dumps land on
/varwhich is on your smallest partition (because 2009 called). - No retention: you assumed “systemd will clean up.” It will, but only if configured and only on its own schedule.
- Security and privacy: someone disables dumps globally to avoid secrets leakage, and you lose the only forensic artifact you had.
If you do SRE for long enough, you learn that disk-full incidents aren’t “storage problems.” They’re control-plane problems: no limits, no ownership,
and no policy. Core dumps are just a particularly effective way to make those problems loud.
Exactly one quote for this whole piece, because we’re engineers and we can count: Hope is not a strategy.
— paraphrased idea often attributed in operations circles.
Fast diagnosis playbook (first / second / third)
When a node is paging you because disk is 99% used, you don’t start with philosophy. You start with triage. Here’s the fastest path to “what is
filling the disk, why now, and can I stop the bleeding without destroying evidence?”
First: confirm what filesystem is full and what’s growing
- Identify the full mount point and its backing device.
- Confirm whether it’s
/var,/, or a dedicated volume. - Find the top offenders by directory size, not by guessing.
Second: confirm it’s core dumps and identify the crash source
- Check
/var/lib/systemd/coredumpsize and file timestamps. - Use
coredumpctlto find which executable is crashing and how often. - Confirm whether this is a restart loop (systemd unit flapping) or one-off failures.
Third: apply a reversible stop-gap, then implement policy
- Short-term: cap dumps, temporarily disable for the crashing service only, or move storage.
- Keep one or two representative dumps for debugging; purge the rest once you have evidence.
- Then: configure retention (
systemd-coredump), size limits, and storage placement.
Joke #1 (short and relevant): Core dumps are like office snacks—left unmanaged, they expand to fill all available space and still leave you hungry.
Interesting facts and short history (because this mess has lineage)
Some context makes better decisions. Here are concrete facts that matter operationally:
- Core dumps predate Linux: Unix systems have dumped process memory since the 1970s; the idea is older than most production runbooks.
- “core” wasn’t metaphorical: early machines used magnetic core memory; “core dump” literally meant dumping core memory contents.
- Default locations changed over time: older setups wrote
./corein the current working directory; modern distros often route to system services. - The kernel decides if dumping is allowed: setuid binaries and privileged contexts can be restricted from dumping for security reasons.
- Dump size isn’t just “RAM used”: memory mappings, huge pages, and runtime behavior can make dumps larger or smaller than expected.
- Compression is not free: compressing multi-GB dumps burns CPU and can steal cycles from recovery and paging mitigation.
- Metadata lives separately: with systemd, you get journal entries about a dump even if the dump file is missing or later rotated.
- Core dumping can be piped:
core_patterncan pipe to a handler; that’s why you’ll see a leading|in the pattern.
Practical tasks: commands, what the output means, and the decision you make
This is the part that pays rent. Below are real tasks you can run on Debian 13. For each: command, sample output, what it means, and the decision
you make from it. Do them in order during an incident, or cherry-pick if you already know the shape of the problem.
Task 1: Confirm which filesystems are full
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda2 ext4 40G 39G 320M 99% /
/dev/sda3 ext4 20G 19G 400M 98% /var
tmpfs tmpfs 3.1G 12M 3.1G 1% /run
Meaning: It’s / and /var that are dying, not some random data volume. That points directly at logs, package caches,
spool, and yes, core dumps.
Decision: Focus on /var and stop writing there. If /var is a separate filesystem, you can often recover faster
by cleaning /var without destabilizing the whole OS.
Task 2: Find the top directories under /var by actual size
cr0x@server:~$ sudo du -xhd1 /var | sort -h
24M /var/cache
180M /var/log
1.1G /var/tmp
17G /var/lib
19G /var
Meaning: /var/lib is the heavyweight. That’s consistent with systemd-coredump storage.
Decision: Drill into /var/lib before touching logs or caches. Your incident timeline improves when you delete the right thing once.
Task 3: Verify if systemd is storing core dumps on disk
cr0x@server:~$ sudo du -sh /var/lib/systemd/coredump
14G /var/lib/systemd/coredump
Meaning: Confirmed: dumps are present and large.
Decision: Don’t mass-delete yet. First identify the crashing program and keep at least one representative dump.
Task 4: List recent core dumps and identify the offender
cr0x@server:~$ sudo coredumpctl list --no-pager | head
TIME PID UID GID SIG COREFILE EXE
Mon 2025-12-29 10:02:41 UTC 9142 998 998 11 present /usr/bin/myservice
Mon 2025-12-29 10:02:12 UTC 9136 998 998 11 present /usr/bin/myservice
Mon 2025-12-29 10:01:43 UTC 9129 998 998 11 present /usr/bin/myservice
Meaning: One binary is dumping repeatedly. Signal 11 is a segfault, classic crash-loop territory.
Decision: Contain the crash loop. Keep the latest dump, then stop further dumps from accumulating.
Task 5: Confirm if it’s a systemd restart loop
cr0x@server:~$ systemctl status myservice.service --no-pager
● myservice.service - Example Service
Loaded: loaded (/etc/systemd/system/myservice.service; enabled; preset: enabled)
Active: activating (auto-restart) (Result: core-dump) since Mon 2025-12-29 10:02:42 UTC; 1s ago
Process: 9142 ExecStart=/usr/bin/myservice (code=dumped, signal=SEGV)
Main PID: 9142 (code=dumped, signal=SEGV)
Meaning: systemd is doing exactly what you asked: restarting the service and collecting cores. It is also eating your disk.
Decision: Stop the unit (or adjust restart policy) to stop the bleeding, then preserve artifacts for debugging.
Task 6: Stop the service to prevent new dumps (triage move)
cr0x@server:~$ sudo systemctl stop myservice.service
cr0x@server:~$ sudo systemctl is-active myservice.service
inactive
Meaning: Crash loop has stopped, at least on this node. Disk usage should stop growing.
Decision: If this is a fleet service, consider traffic shifting or draining this node. Stabilize first; debug second.
Task 7: Identify the largest core files
cr0x@server:~$ sudo ls -lhS /var/lib/systemd/coredump | head
-rw------- 1 root root 6.8G Dec 29 10:02 core.myservice.998.3f7a7c1d2c7d4f2d9c9a0d3a9d1d8f0a.9142.1735466561000000.zst
-rw------- 1 root root 6.7G Dec 29 10:02 core.myservice.998.3f7a7c1d2c7d4f2d9c9a0d3a9d1d8f0a.9136.1735466532000000.zst
-rw------- 1 root root 6.7G Dec 29 10:01 core.myservice.998.3f7a7c1d2c7d4f2d9c9a0d3a9d1d8f0a.9129.1735466503000000.zst
Meaning: Dumps are already compressed (note .zst), and they’re still enormous. That indicates the process memory image is big.
Decision: Keep one dump (usually the newest) for analysis. Plan to delete older duplicates once you have a backtrace.
Task 8: Extract the core dump details without copying it around
cr0x@server:~$ sudo coredumpctl info 9142
PID: 9142 (myservice)
UID: 998 (myservice)
GID: 998 (myservice)
Signal: 11 (SEGV)
Timestamp: Mon 2025-12-29 10:02:41 UTC
Command Line: /usr/bin/myservice --config /etc/myservice/config.yaml
Executable: /usr/bin/myservice
Control Group: /system.slice/myservice.service
Unit: myservice.service
Storage: /var/lib/systemd/coredump/core.myservice.998.3f7a7c1d2c7d4f2d9c9a0d3a9d1d8f0a.9142.1735466561000000.zst (present)
Meaning: You have a precise mapping from incident to artifact: unit, command line, and storage path.
Decision: Capture this info in the incident channel/ticket before deleting anything. It’s the breadcrumb trail for later debugging.
Task 9: Get the stack trace (requires debug symbols to be truly useful)
cr0x@server:~$ sudo coredumpctl gdb 9142 -q --batch -ex "thread apply all bt"
[New LWP 9142]
Core was generated by `/usr/bin/myservice --config /etc/myservice/config.yaml'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f2f1a2b9c2a in memcpy () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000055b8f2b1a2f0 in parse_packet (ctx=0x55b8f3c12000) at src/net/parse.c:217
#2 0x000055b8f2b19d71 in worker_loop () at src/worker.c:88
#3 0x00007f2f1a1691f5 in start_thread () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f2f1a1e8b00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
Meaning: You have a backtrace and file/line, so you likely don’t need 20 more identical dumps.
Decision: Keep the newest core + the trace. Delete the rest to recover disk. If you can’t get symbols, keep one dump and focus on collecting symbols next.
Task 10: Check current systemd-coredump configuration
cr0x@server:~$ sudo systemd-analyze cat-config systemd/coredump.conf
# /etc/systemd/coredump.conf
[Coredump]
Storage=external
Compress=yes
ProcessSizeMax=8G
ExternalSizeMax=8G
MaxUse=16G
KeepFree=2G
Meaning: Limits exist, but they may be too generous for your /var size (or your crash rate).
Decision: Tighten MaxUse and raise KeepFree, or move storage to a dedicated filesystem.
Task 11: Verify what the kernel will do with core dumps (core_pattern)
cr0x@server:~$ cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %e
Meaning: Cores are piped into systemd-coredump, not written directly by the kernel to a file path.
Decision: Fix retention via systemd-coredump config (not by editing a path pattern and hoping).
Task 12: Check global core size limit (ulimit) for the current shell
cr0x@server:~$ ulimit -c
unlimited
Meaning: At least in your shell, core size is not capped. For services, the effective limit can come from systemd unit settings.
Decision: Set limits in the unit (preferred) or via PAM limits for interactive sessions, depending on the context.
Task 13: Check core limit applied to a running service process
cr0x@server:~$ pidof myservice
9210
cr0x@server:~$ grep -E '^Max core file size' /proc/9210/limits
Max core file size unlimited unlimited bytes
Meaning: The service can generate full-size dumps.
Decision: If you want “some” cores but not huge ones, cap at a size that preserves stacks (often hundreds of MB) without storing full heaps.
Task 14: Confirm if journald is also under pressure
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 1.4G in the file system.
Meaning: Journals are not the primary offender here, but 1.4G can still matter on tiny /var.
Decision: Don’t “vacuum” the journal as your first move if cores are 14G. Fix the actual problem first.
Task 15: Free space safely by deleting old dumps after preserving one
cr0x@server:~$ sudo coredumpctl list --no-pager | awk '/myservice/ {print $5}' | head
present
present
present
cr0x@server:~$ sudo rm -f /var/lib/systemd/coredump/core.myservice.*.9136.*.zst /var/lib/systemd/coredump/core.myservice.*.9129.*.zst
cr0x@server:~$ df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 20G 12G 7.4G 62% /var
Meaning: You bought breathing room and didn’t wipe everything blindly.
Decision: Now implement proper limits so this doesn’t recur. Manual deletes are for today, not for next week.
A sane retention policy for Debian 13 (keep value, drop bloat)
You want enough crash artifacts to answer: “what happened?” You do not want a time machine of every crash for the last quarter. Core dumps are
high-signal, high-cost artifacts.
A good policy has four parts:
- Scope: which processes may dump cores (all? only critical services? only in staging?).
- Size caps: per-process and global caps that match your disk reality.
- Retention: keep a small window (by time and by space), rotate aggressively.
- Access & privacy: core dumps can include credentials, tokens, customer data, and decrypted secrets. Treat them as sensitive.
Here’s the opinionated version for most production fleets:
- Enable core dumps for services where you actually debug crashes (usually yes).
- Cap per-process dumps to something that preserves backtraces and thread stacks (often 256M–2G, depending on language/runtime).
- Keep at most a few gigabytes per node or per dump filesystem, and keep free-space headroom.
- Prefer storing dumps on a dedicated filesystem (or at least a dedicated directory with predictable quotas/limits).
- Ship metadata (not dumps) centrally. Move dumps off-host only when you decide they’re needed.
If you’re tempted to disable core dumps entirely, ask yourself: do you have reproducible crash telemetry, symbolized stack traces, and deterministic
repro steps? If yes, maybe. If no, you’re choosing longer incidents.
Joke #2 (the second and last): Disabling core dumps everywhere is like removing smoke alarms because they’re loud—quiet, yes; improved situation, no.
systemd-coredump: configuration that works in production
On Debian 13, systemd is your friend here, but only if you tell it what “enough” looks like. The file is typically:
/etc/systemd/coredump.conf (and drop-ins under /etc/systemd/coredump.conf.d/).
Key knobs you should care about
- Storage= Where dumps go: external files, journal, both, or none.
- Compress= Whether to compress core files (often zstd). Useful, but watch CPU during crash storms.
- ProcessSizeMax= Max size of a process dump eligible for handling.
- ExternalSizeMax= Max size of an individual external core file saved.
- MaxUse= Maximum disk space systemd-coredump may consume in total.
- KeepFree= How much free space to keep on the filesystem holding dumps.
A practical baseline for a node with a 20–40G /var is something like:
MaxUse=2Gto6Gdepending on how much you value dumps versus uptime.KeepFree=2Gto5Gso journald, package updates, and normal runtime don’t die.ExternalSizeMax=512Mto2Gso you keep stack traces without always saving multi-GB heaps.
You don’t have to guess. You can choose a value that’s smaller than “fills disk,” then increase later if you’re missing data.
Example: tighten retention immediately (and safely)
cr0x@server:~$ sudo install -d /etc/systemd/coredump.conf.d
cr0x@server:~$ cat <<'EOF' | sudo tee /etc/systemd/coredump.conf.d/99-retention.conf
[Coredump]
Storage=external
Compress=yes
ExternalSizeMax=1G
MaxUse=4G
KeepFree=4G
EOF
cr0x@server:~$ sudo systemctl daemon-reload
Meaning: You’ve set hard global ceilings. Even if a service goes feral, the node will keep breathing.
Decision: If you routinely need full heaps for memory corruption analysis, raise ExternalSizeMax but move dumps off /var.
Don’t pretend you can have 8G dumps on a 20G filesystem and stay happy.
How rotation actually happens
systemd-coredump applies its own logic for retaining/deleting cores when limits are hit. That’s good, but it’s not instantaneous like a quota.
If you’re in a crash loop, you still want to stop the unit, otherwise you’ll churn CPU and I/O generating dumps that get immediately deleted.
ulimit, core_pattern, and why “it worked on my laptop” is lying to you
Core dumping has three layers that get confused constantly:
- Kernel eligibility: is core dumping enabled at all for this crash context?
- Process limits: what is the max core size the process is allowed to write?
- Core handler: where does the core go and what policy is applied after creation?
Systemd unit-level limits beat your shell
You can set ulimit -c in a shell and feel productive. The service won’t care. For systemd services, use:
LimitCORE= in the unit file or a drop-in.
Example: cap cores for one service without touching the rest of the node:
cr0x@server:~$ sudo systemctl edit myservice.service
cr0x@server:~$ cat /etc/systemd/system/myservice.service.d/override.conf
[Service]
LimitCORE=512M
cr0x@server:~$ sudo systemctl daemon-reload
cr0x@server:~$ sudo systemctl restart myservice.service
Meaning: Even if the process is huge, the dump file cannot exceed 512M. Often that’s enough for a usable stack trace.
Decision: For services in a crash loop, cap immediately to stop disk growth while still preserving debugging value.
core_pattern is the switchboard
If /proc/sys/kernel/core_pattern pipes to systemd-coredump, your “where are the core files?” expectations need to change.
You’ll find them under systemd’s management, not your CWD.
If your org has custom handlers (uploaders, filters), validate them. A buggy core handler can fail open (dropping dumps) or fail closed (blocking and
stalling the crashing process exit path). Neither is fun.
Storage strategy: where core dumps should live (and where they should not)
Treat core dumps like mini-backups of memory: big, sensitive, and occasionally priceless. That suggests a few storage rules.
Rule 1: Don’t store dumps on tiny /var if you can avoid it
Many images still carve out a small /var partition because it feels “clean.” It’s also how you end up with an OS that can’t write state
during an incident. If your distro layout isolates /var, plan where dumps go.
Rule 2: A dedicated filesystem beats heroic cleanup
The operational win is isolation: if cores fill their own filesystem, the node can still log, update packages, write runtime state, and recover.
You can mount a dedicated volume at /var/lib/systemd/coredump or bind-mount it.
Example: create a mount point and confirm it’s a separate filesystem (the actual provisioning depends on your environment):
cr0x@server:~$ sudo findmnt /var/lib/systemd/coredump
TARGET SOURCE FSTYPE OPTIONS
/var/lib/systemd/coredump /dev/sdb1 ext4 rw,relatime
Meaning: Dumps are isolated from /var at large. Now your worst-case is “no more dumps,” not “node dead.”
Rule 3: Plan for privacy (core dumps are data)
Core dumps can contain:
- API tokens in memory
- Decrypted payloads
- Customer identifiers
- Encryption keys (yes, sometimes)
That means: restrict permissions, limit retention, and be deliberate about who can exfiltrate dumps for debugging.
“Root can read it” is not a policy.
Rule 4: Consider keeping metadata centrally, not dumps
In many orgs, the best compromise is:
- Keep core dumps locally for a short window.
- Keep only metadata (timestamps, executable, unit, signal, backtrace) in incident records or centralized logs.
- Only copy the dump off-host when you’ve decided it’s needed.
This avoids turning your log pipeline into a bulk data transport system. Those systems always “work,” right up until they don’t.
Three corporate mini-stories from the crash-dump mines
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company rolled out a new Debian-based image to a set of API nodes. The image had a neat partition scheme: small root, separate /var,
and a tidy data volume for application state. Everyone felt mature.
A week later, one node started flapping. Not the whole fleet—just one. It kept restarting a worker service, and after a while it stopped responding
to deployments. Then it stopped responding to logins. The node wasn’t dead; it was just incapable of writing to disk in the places that mattered.
The wrong assumption was simple: “Core dumps go to the data volume.” Someone had seen core files in an application directory years ago and assumed
that’s still the default. On this image, systemd-coredump stored dumps in /var/lib/systemd/coredump, and /var was 10–20G.
Every crash created a multi-GB compressed dump. systemd restarted the service. More dumps arrived. /var filled. Journald started dropping
messages. Package updates failed. The node couldn’t write enough state to recover cleanly.
The fix wasn’t clever: stop the service, keep one dump, delete the rest, and move core dump storage to a dedicated mount. Then tighten KeepFree
so future crash storms couldn’t starve the OS. The bigger fix was cultural: stop assuming defaults and start verifying them with commands that return facts.
Mini-story 2: The optimization that backfired
Another org wanted “better debugging,” so they enabled core dumps broadly and increased retention. They also turned on compression and raised size
limits, because “storage is cheap.”
It was fine… until a release introduced a rare crash in a high-throughput process. The crash rate wasn’t huge, but it wasn’t small either. During
peak hours, cores were generated regularly, each one several gigabytes even after compression.
The backfire came from an angle no one modeled: CPU contention during recovery. Compressing large dumps is CPU work. When crashes started happening,
the system spent a measurable slice of CPU compressing memory snapshots while also trying to shed load, restart services, and serve traffic.
Their “debuggability optimization” translated into prolonged brownouts. Not total outage, worse: a slow bleed where requests timed out and retries
amplified load. The oncall could see the cores and the incidents, but the platform felt like it was moving through syrup.
The eventual policy was more nuanced: cap per-service core size, and in some cases store only metadata plus a minimal dump. For the rare deep dives,
they’d reproduce in staging with full dumps enabled. Debugging value went up, incident pain went down.
Mini-story 3: The boring but correct practice that saved the day
A finance-adjacent company had a habit that nobody celebrated: every critical unit had a small systemd drop-in setting resource limits,
including LimitCORE. It was part of the “golden unit template,” like setting timeouts and restart policies.
A third-party library update triggered a segfault in a background service on a subset of nodes. The service crashed, systemd restarted it, and yes,
core dumps were generated. But they were capped to a few hundred megabytes.
The nodes stayed healthy. Logs kept flowing. Deployments continued. The incident was annoying but controlled. Engineers pulled one core dump,
got a backtrace, correlated the failing code path, and rolled back the library on schedule.
Nobody got a hero moment. That’s the point. The boring limit meant the team could debug without sacrificing availability. It also meant the security
team didn’t panic about massive sensitive memory snapshots accumulating on disks.
Common mistakes: symptom → root cause → fix
These are the repeat offenders. Each one includes a specific fix, not a motivational poster.
Mistake 1: “Disk is full, delete /var/log”
Symptom: Disk usage climbs fast; logs get blamed first; you delete journals and still have a problem tomorrow.
Root cause: Core dumps in /var/lib/systemd/coredump are the actual bulk data.
Fix: Measure with du, then cap and rotate core dumps. Delete duplicate dumps only after capturing at least one backtrace.
Mistake 2: “Disable core dumps globally”
Symptom: Incidents become harder to debug; “it crashed” tickets linger; blame shifts to “infra” because no evidence remains.
Root cause: A blunt response to disk pressure or privacy concerns.
Fix: Keep cores enabled but limited. Use per-service LimitCORE, and system-wide MaxUse/KeepFree.
Treat dumps as sensitive and shorten retention.
Mistake 3: “We set MaxUse, so we’re safe”
Symptom: Disk still hits 100% during crash storms; nodes still go unstable.
Root cause: Limits don’t act like instantaneous quotas; crash loop keeps generating new data, and you’re also burning CPU.
Fix: Stop the offending unit first. Then configure retention. Limits are not a substitute for containment.
Mistake 4: “Cores are small because they’re compressed”
Symptom: You see .zst and assume you’re fine, until you’re very not fine.
Root cause: Compression ratio varies wildly; large heaps and mapped regions still yield multi-GB dumps.
Fix: Cap size. Verify actual file sizes with ls -lhS. Don’t reason from file extensions.
Mistake 5: “Our service can’t be crash-looping; we’d notice”
Symptom: Suddenly you have dozens of cores; service seems “mostly fine” because a load balancer hides one bad node.
Root cause: Partial failure masked by redundancy; one node is dumping cores quietly.
Fix: Alert on core dump creation rate and on service restart loops (Result: core-dump). Use coredumpctl list
during triage.
Mistake 6: “We moved cores off /var, done”
Symptom: The core partition fills, and now you lose debugging artifacts—still not ideal.
Root cause: Isolation without retention is just moving the fire.
Fix: Isolate and set MaxUse/KeepFree/ExternalSizeMax. Add monitoring.
Checklists / step-by-step plan
Checklist A: During an active disk-full incident
- Confirm what is full:
df -hT. - Identify top growth directory:
sudo du -xhd1 /var | sort -h. - Confirm core dumps are the bulk:
sudo du -sh /var/lib/systemd/coredump. - Identify offender:
sudo coredumpctl list. - Stop the crash loop:
systemctl status ..., thensystemctl stop ...or isolate node. - Preserve one representative dump and metadata:
coredumpctl info PID. - Get a backtrace if possible:
coredumpctl gdb PID. - Delete duplicates to recover disk (surgically).
- Set immediate retention limits (
coredump.conf.d). - Only then restart the service under controlled conditions.
Checklist B: Hardening after the incident (the part that prevents repeats)
- Create or confirm a dedicated core dump filesystem or at least enough headroom on
/var. - Set
MaxUseandKeepFreebased on your smallest nodes, not your biggest. - Set
ExternalSizeMaxand/or per-serviceLimitCORE. - Define who can access dumps and how long they persist.
- Ensure debug symbols are available where you analyze dumps (without dragging full toolchains onto every node).
- Add alerting on:
- filesystem free space for the dump volume
- core dump creation rate
- systemd unit restart loops with core-dump results
- Document the “keep one dump, purge duplicates” rule in your runbook.
FAQ
1) Should I disable core dumps in production?
Usually no. Disable selectively if you have a strong reason (sensitive workloads with strict controls, or you already have equivalent crash telemetry).
Prefer size caps and retention limits first.
2) Where does Debian 13 store core dumps by default?
Commonly under /var/lib/systemd/coredump/ when systemd-coredump is in use. Confirm with cat /proc/sys/kernel/core_pattern
and coredumpctl info.
3) Why are my core files already compressed but still huge?
Because compression doesn’t change the fact you’re capturing a large memory image. Big heaps, mapped files, and certain memory patterns compress poorly.
Cap ExternalSizeMax and/or LimitCORE.
4) What’s the quickest way to identify what is crashing?
sudo coredumpctl list shows the executable and whether a core file is present. Pair it with systemctl status to spot restart loops.
5) Can I keep only metadata and not the full dump?
Yes. You can configure storage behavior so that only metadata is kept, but be careful: when you need a dump, you won’t have it. A common compromise
is keeping small capped dumps plus metadata.
6) How do I cap core size for one service without changing the whole node?
Add a systemd drop-in with LimitCORE=... for that unit. That’s the cleanest operational control point for services.
7) Why does disk still spike even after setting MaxUse in coredump.conf?
Because you can still generate cores rapidly during a crash loop, and enforcement isn’t a perfect instantaneous quota. Stop the unit first, then tune
retention. Limits reduce steady-state bloat; containment fixes crash storms.
8) What about privacy and secrets in core dumps?
Assume cores contain secrets. Restrict access, shorten retention, and avoid shipping dumps broadly. Treat them like sensitive incident artifacts, not
like log files.
9) Do I need debug symbols on every production node?
Not necessarily. You need a reliable way to symbolize stacks somewhere. Many teams keep symbols in a dedicated debug environment and only extract
the necessary core dump when needed (with proper access controls).
10) Is it safe to delete core dumps?
Yes, after you’ve preserved at least one representative dump and captured metadata/backtraces. The risk is not system stability; the risk is losing
forensic value. Be intentional: keep one, delete duplicates.
Conclusion: next steps that won’t bite you later
Core dumps are debugging gold and operational kryptonite. Debian 13 gives you the tools to keep the gold and ditch the kryptonite, but it won’t do
it automatically. You have to choose limits that match your disks and your incident tolerance.
Do this next, in this order:
- On one node, verify where dumps go and how big they get:
coredumpctl list,du -sh /var/lib/systemd/coredump. - Set global retention:
MaxUseandKeepFreein a coredump drop-in. - Set per-service caps for the usual suspects (big heaps, crash-prone components):
LimitCORE. - Isolate storage if possible: dedicate a filesystem or mount for dumps.
- Make it observable: alert on dump rate and on restart loops with core dumps.
The goal is not “never generate a core dump.” The goal is: when something crashes at 03:00, you get one useful artifact and a node that still boots.