If you run production Linux long enough, you’ll meet the kind of outage where nothing logs, nothing responds,
and the only useful artifact is the one you didn’t configure: a crash dump. The machine doesn’t “go down”
so much as it evaporates. Your incident report becomes interpretive dance.
Kdump is the antidote. Not a guarantee of truth, but a fighting chance: a vmcore you can post-mortem
when the kernel stops being polite. On Debian 13, kdump is straightforward—until it isn’t. This is the practical,
verify-or-it-didn’t-happen guide.
What you’re building: kexec + a second kernel + a dump path
Kdump works by keeping a “crash kernel” in reserve. When the running kernel panics (or you force a crash for
testing), the system doesn’t do a cold reboot. Instead, it uses kexec to jump directly into the
crash kernel, which runs a tiny userspace from an initramfs and writes memory to disk or network as a vmcore.
The critical requirement: the crash kernel needs reserved RAM that the crashed kernel never touches.
That’s why you set crashkernel= on the command line. If you skip that, you’ll get a kdump service that
“starts” and a crash that produces nothing useful. It’s like installing sprinklers without water pressure.
Two quick clarifications that save hours:
- Kdump doesn’t fix crashes. It turns a crash into an artifact you can reason about.
- Kdump is only as good as its dump path. If your dump target depends on the thing that’s broken, you’ll learn humility.
Facts and context that actually matter
- Kdump is old by Linux standards. The kexec-based crash dumping approach matured in the mid-2000s and became a staple in enterprise distros.
- “Second kernel” is literal. Kdump isn’t a debugging feature inside the same kernel; it boots a separate kernel into reserved memory.
- Early kdump was mostly about hardware fault triage. It helped vendors distinguish flaky RAM/PCIe from kernel bugs without guessing.
- Crash dumps drove real kernel reliability work. Once ops teams could hand developers a vmcore, “cannot reproduce” became less fashionable.
- Compression became essential as RAM grew. A 512 MB dump was annoying in 2006; a 512 GB dump is a career-limiting event.
- NUMA and huge systems changed sizing rules. Reserving 128 MB used to be fine; on modern hardware, the crash kernel may need more to load drivers and write dumps.
- Secure Boot complicates kexec in some environments. Depending on policy, kexec may be restricted; you need to validate in your platform’s trust model.
- Initramfs is the real battlefield. Most kdump failures aren’t “kdump,” they’re missing storage/network drivers in the crash initramfs.
- Systemd made kdump service state visible. You can now quickly see whether the crash kernel is loaded, rather than relying on folklore.
Design decisions: where to dump, how to boot, what to reserve
Pick a dump target that survives your likely failures
You typically have three sane dump destinations:
- Local filesystem (fast, simple): good if disks stay up during the crash and the filesystem is mountable in the crash kernel.
- Dedicated local partition or raw device (boring, robust): less dependency on your normal userspace; better odds in messy storage stacks.
- Network (usually NFS) (best for “disk stack might be toast”): avoids local storage complexity, but you must make the crash kernel bring up NIC + routes.
If you run complex storage—LUKS on LVM on MD RAID on multipath—your dump path will be the thing that fails.
In that case, a small dedicated unencrypted partition, or an NFS dump target, is the adult choice.
Yes, that means you’re planning for your own failure. Welcome to SRE.
Crashkernel sizing: prefer “works” over “minimal”
On modern Debian systems, reserving too little memory causes quiet, dumb failures: the crash kernel loads,
then can’t initialize drivers or allocate buffers to write the dump. Don’t be clever.
A rough, practical rule that usually works:
- Small VMs: 256M can be enough.
- General servers with common drivers: 512M is a safe baseline.
- Big boxes, lots of storage/network modules, or heavy NUMA: 768M–1024M.
If you can’t spare 512M of RAM in 2025, your real problem is budgeting, not kdump.
One quote you can hang above the dashboard
Hope is not a strategy.
— common operations maxim (paraphrased idea, often attributed in reliability circles)
Install and enable kdump on Debian 13
Debian’s packaging may evolve between point releases, but the operational pattern stays consistent:
install kdump tooling, reserve crashkernel memory, regenerate boot config/initramfs, ensure the crash kernel is loaded.
Task 1 — Confirm kernel, initramfs tooling, and bootloader
cr0x@server:~$ uname -a
Linux server 6.12.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.3-1 (2025-09-10) x86_64 GNU/Linux
What it means: You’re running a Debian kernel that supports kexec/kdump. If you’re on a custom kernel,
validate it’s built with crash dump support. Decision: proceed unless you know you’ve stripped kexec/crash options.
cr0x@server:~$ ls -l /etc/kernel/cmdline /etc/default/grub 2>/dev/null || true
-rw-r--r-- 1 root root 112 Oct 12 10:03 /etc/default/grub
What it means: This host likely uses GRUB config in /etc/default/grub. Some Debian setups use
/etc/kernel/cmdline with kernel-install flows. Decision: use the file your system actually consumes.
Task 2 — Install kdump components
cr0x@server:~$ sudo apt-get update
Hit:1 http://deb.debian.org/debian trixie InRelease
Reading package lists... Done
cr0x@server:~$ sudo apt-get install -y kdump-tools kexec-tools linux-image-amd64 makedumpfile
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
kdump-tools kexec-tools makedumpfile
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Setting up kexec-tools (1:2.0.28-1) ...
Setting up kdump-tools (1:1.8.0-2) ...
Setting up makedumpfile (1:1.7.6-2) ...
What it means: You now have the service wrappers and dump tooling. Decision: next step is crashkernel
reservation; without it, kdump will mostly pretend.
Task 3 — Check whether kdump service is enabled and what it thinks
cr0x@server:~$ systemctl status kdump-tools --no-pager
● kdump-tools.service - Kernel crash dump capture service
Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled; preset: enabled)
Active: active (exited) since Mon 2025-12-30 09:12:01 UTC; 3min ago
Docs: man:kdump-config(8)
What it means: “active (exited)” is normal; the service loads the crash kernel and then gets out of the way.
Decision: don’t celebrate; you still need to confirm the crash kernel is actually loaded.
Joke #1: Kdump is like backups—everyone “has it” until the day they need it and discover they were actually collecting vibes.
Reserve crashkernel memory (and prove it’s reserved)
Task 4 — Inspect current kernel command line
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.12.0-1-amd64 root=UUID=0a1b2c3d-4e5f-6789-a012-3456789abcde ro quiet
What it means: No crashkernel= parameter. Right now, kdump likely cannot reserve memory.
Decision: add crashkernel=512M (or more if you have a complicated driver set).
Task 5 — Set crashkernel in GRUB and regenerate config
cr0x@server:~$ sudo sed -i 's/^\(GRUB_CMDLINE_LINUX_DEFAULT=".*\)"/\1 crashkernel=512M"/' /etc/default/grub
cr0x@server:~$ grep -n '^GRUB_CMDLINE_LINUX_DEFAULT' /etc/default/grub
6:GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=512M"
What it means: Crashkernel is now configured for future boots. Decision: update GRUB and reboot at a controlled time.
cr0x@server:~$ sudo update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.12.0-1-amd64
Found initrd image: /boot/initrd.img-6.12.0-1-amd64
done
Task 6 — Reboot and verify reserved memory
cr0x@server:~$ sudo reboot
Connection to server closed by remote host.
cr0x@server:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.12.0-1-amd64 root=UUID=0a1b2c3d-4e5f-6789-a012-3456789abcde ro quiet crashkernel=512M
cr0x@server:~$ dmesg | grep -i crashkernel | head -n 5
[ 0.000000] Reserving 512MB of memory at 0x0000001f00000000 for crashkernel (System RAM: 32768MB)
[ 0.000000] crashkernel reserved: 0x0000001f00000000 - 0x0000001f20000000 (512 MB)
What it means: The reservation is real. If you don’t see lines like this, you don’t have kdump—just feelings.
Decision: if the reservation failed, increase size or fix conflicting kernel params (common on tiny VMs or weird firmware maps).
Task 7 — Confirm kexec crash kernel is loaded
cr0x@server:~$ sudo kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
KDUMP_KERNELVER: 6.12.0-1-amd64
KEXEC: /sbin/kexec
crashkernel addr: 0x0000001f00000000
crashkernel size: 0x20000000
What it means: The tools see the reserved region. Decision: now validate the dump path and the crash initramfs content.
cr0x@server:~$ sudo kexec -p -s || true
kexec: no crash kernel loaded
What it means: This particular check says nothing loaded yet. Some setups load on service start or after configuration.
Decision: restart kdump-tools and recheck.
cr0x@server:~$ sudo systemctl restart kdump-tools
cr0x@server:~$ sudo kexec -p -s
kexec: crash kernel is loaded
What it means: Great. Your crash kernel is staged. Decision: move on to dump target configuration and validation.
Configure dump target: local disk vs NFS (and why you should care)
Local disk to /var/crash: simplest when storage is sane
Debian defaults are often “dump under /var/crash.” That’s fine if:
your root filesystem is mountable from the crash initramfs and your storage stack is simple.
For single-disk ext4 on a VM, it’s usually perfect.
Task 8 — Validate you have space for dumps and the filesystem can take it
cr0x@server:~$ df -h /var/crash
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 80G 22G 54G 29% /
What it means: You’ve got space. A full dump is roughly RAM size; makedumpfile can compress/filter.
Decision: if free space is tight, dump to a dedicated volume or NFS, and configure filtering/compression.
Task 9 — Review kdump-tools config (and set compression/filtering)
cr0x@server:~$ sudo sed -n '1,200p' /etc/default/kdump-tools
# kdump-tools configuration
USE_KDUMP=1
KDUMP_COREDIR="/var/crash"
MAKEDUMPFILE_ARGS="-l --message-level 1 -d 31"
KDUMP_KERNELVER=""
What it means: MAKEDUMPFILE_ARGS controls how much you trim. -d 31 is a common “drop lots of pages” mask.
Decision: keep filtering on unless you have a specific debugging need for full dumps, and you can store them reliably.
NFS dump target: fewer storage drivers, more network plumbing
NFS is my default when the local storage stack is complex (LUKS-on-everything, exotic RAID controllers, iSCSI boot).
It shifts the dependency from “can we mount the root FS in a broken world” to “can we bring up networking in the crash kernel.”
That’s usually easier to make deterministic, especially on a fixed VLAN.
Task 10 — Configure an NFS dump path (example)
This is an example style of configuration. Your exact file keys may differ by Debian packaging revision, but the operational
requirement is stable: the crash initramfs must know where to write and how to get there.
cr0x@server:~$ sudo grep -R "KDUMP_" -n /etc/default/kdump-tools
2:USE_KDUMP=1
3:KDUMP_COREDIR="/var/crash"
For NFS you typically set the dump directory to an NFS-mounted path available in the crash environment.
A common approach is to configure kdump scripts to mount NFS during crash. On Debian, that behavior is handled by kdump initramfs hooks.
You must test it.
Example: create a mountpoint and a normal fstab entry for day-to-day visibility, then ensure the crash initramfs also includes what it needs.
(Do not assume the crash initramfs uses your normal mounts. It’s a tiny separate world.)
cr0x@server:~$ sudo mkdir -p /mnt/kdump-nfs
cr0x@server:~$ echo '10.10.20.50:/exports/kdump /mnt/kdump-nfs nfs4 _netdev,nofail,ro 0 0' | sudo tee -a /etc/fstab
10.10.20.50:/exports/kdump /mnt/kdump-nfs nfs4 _netdev,nofail,ro 0 0
cr0x@server:~$ sudo mount -a
cr0x@server:~$ mount | grep kdump-nfs
10.10.20.50:/exports/kdump on /mnt/kdump-nfs type nfs4 (ro,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.10.20.21,local_lock=none,addr=10.10.20.50)
What it means: Normal userspace can reach the NFS export. Decision: now you must make sure the crash initramfs can do the same
(drivers, IP config, route, NFS client pieces).
cr0x@server:~$ sudo sed -i 's|^KDUMP_COREDIR=.*|KDUMP_COREDIR="/mnt/kdump-nfs"|g' /etc/default/kdump-tools
cr0x@server:~$ grep -n '^KDUMP_COREDIR' /etc/default/kdump-tools
3:KDUMP_COREDIR="/mnt/kdump-nfs"
What it means: You’re directing dumps to the NFS mountpoint. Decision: rebuild the kdump initramfs and validate the crash kernel still loads.
Verify the kdump kernel and initramfs before you crash anything
The fastest way to look foolish is to “test kdump” during business hours without verifying the crash environment can write a dump.
Treat it like a migration: dry-run what you can, prove the pieces exist, then do the disruptive test with a rollback plan.
Task 11 — Rebuild initramfs and restart kdump-tools
cr0x@server:~$ sudo update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.12.0-1-amd64
cr0x@server:~$ sudo systemctl restart kdump-tools
cr0x@server:~$ sudo kexec -p -s
kexec: crash kernel is loaded
What it means: After config changes, the crash kernel is still staged. Decision: validate what initramfs contains,
especially storage and NIC modules.
Task 12 — Inspect the kdump initramfs for critical modules (storage and network)
cr0x@server:~$ lsinitramfs /boot/initrd.img-6.12.0-1-amd64 | grep -E '(^lib/modules/.*/kernel/drivers/(net|block)|/sbin/makedumpfile|nfs|rpc)' | head -n 20
lib/modules/6.12.0-1-amd64/kernel/drivers/block/virtio_blk.ko
lib/modules/6.12.0-1-amd64/kernel/drivers/net/virtio_net.ko
sbin/makedumpfile
usr/sbin/ip
usr/sbin/ifconfig
What it means: Your initramfs includes at least some required components. Missing drivers here is the #1 reason
dumps never show up. Decision: if you rely on ixgbe, bnx2, megaraid_sas, nvme,
dm_crypt, etc., confirm they’re present.
Task 13 — Ensure sysctls are set to actually crash when something goes wrong
cr0x@server:~$ sysctl kernel.panic kernel.panic_on_oops
kernel.panic = 0
kernel.panic_on_oops = 1
What it means: The system will panic on oops (often desirable for capturing dumps) but won’t auto-reboot after panic.
Decision: in production, set kernel.panic=10 (or similar) so the crash kernel has time to dump, then reboot.
If you set it to 0 forever, you may end up with a dead host that needs manual intervention.
cr0x@server:~$ echo 'kernel.panic = 10' | sudo tee /etc/sysctl.d/99-kdump-panic.conf
kernel.panic = 10
cr0x@server:~$ sudo sysctl --system | tail -n 5
* Applying /etc/sysctl.d/99-kdump-panic.conf ...
kernel.panic = 10
Task 14 — Confirm your dump directory is writable and stable
cr0x@server:~$ sudo test -d /mnt/kdump-nfs && sudo test -w /mnt/kdump-nfs && echo "ok: writable" || echo "not writable"
not writable
What it means: In this example, the NFS mount was read-only (ro in fstab). That’s a self-own.
Decision: fix mount options; your dump target must be writable, obviously.
cr0x@server:~$ sudo sed -i 's/_netdev,nofail,ro/_netdev,nofail,rw/' /etc/fstab
cr0x@server:~$ sudo mount -o remount,rw /mnt/kdump-nfs
cr0x@server:~$ sudo test -w /mnt/kdump-nfs && echo "ok: writable" || echo "not writable"
ok: writable
What it means: Your target is writable in normal userspace. Decision: still not enough—you need the crash kernel to do it too,
but this removes one easy failure mode.
Task 15 — Dry-run write permission and naming behavior
cr0x@server:~$ sudo sh -c 'd=/mnt/kdump-nfs/TEST-$(hostname)-$(date +%s); mkdir "$d" && echo hello > "$d"/marker.txt && ls -l "$d"'
total 4
-rw-r--r-- 1 root root 6 Dec 30 09:24 marker.txt
What it means: The directory structure works and the server accepts writes. Decision: proceed to controlled crash testing
only when you have console access (IPMI/iDRAC/virtual console) and a maintenance window.
Trigger a controlled test crash and confirm vmcore capture
A kdump test crash is disruptive by definition. Do it like you mean it:
maintenance window, console access, ticket opened, someone watching the dump target, and a rollback plan
(at minimum: remove crashkernel= if it breaks boot, though that’s uncommon).
Joke #2: Forcing a kernel crash in production is like testing the fire alarm by setting the kitchen on fire—effective, but you should have an extinguisher.
Task 16 — Ensure SysRq is enabled (required for a safe-ish trigger)
cr0x@server:~$ sysctl kernel.sysrq
kernel.sysrq = 176
What it means: SysRq is enabled (176 allows a subset including crash trigger on many systems).
Decision: if it’s 0, enable it temporarily for the test, then decide your long-term policy.
cr0x@server:~$ echo 'kernel.sysrq = 176' | sudo tee /etc/sysctl.d/99-sysrq.conf
kernel.sysrq = 176
cr0x@server:~$ sudo sysctl -p /etc/sysctl.d/99-sysrq.conf
kernel.sysrq = 176
Task 17 — Start a live tail on the dump target (on the NFS server or on the host)
cr0x@server:~$ sudo ls -lah /mnt/kdump-nfs | tail -n 5
drwxr-xr-x 3 root root 4.0K Dec 30 09:24 .
drwxr-xr-x 4 root root 4.0K Dec 30 09:21 ..
drwxr-xr-x 2 root root 4.0K Dec 30 09:24 TEST-server-1735550640
What it means: You can see new directories appear. Decision: keep this view open during the test; it’s your ground truth.
Task 18 — Trigger the crash via SysRq
This will immediately crash the kernel. Do not do this over an SSH session you aren’t ready to lose.
Use console access if you can.
cr0x@server:~$ echo c | sudo tee /proc/sysrq-trigger
c
What you should see next: the host panics, then kexec boots into the crash kernel, then writes a dump, then reboots.
Timing depends on RAM size and dump target speed. If you reserved 512M but have 256G RAM and no filtering, bring snacks.
Task 19 — After reboot, confirm a dump was written
cr0x@server:~$ sudo ls -lt /mnt/kdump-nfs | head
total 12
drwxr-xr-x 2 root root 4096 Dec 30 09:29 server-2025-12-30-09:29
drwxr-xr-x 2 root root 4096 Dec 30 09:24 TEST-server-1735550640
cr0x@server:~$ sudo ls -lah /mnt/kdump-nfs/server-2025-12-30-09:29
total 1.9G
-rw------- 1 root root 1.9G Dec 30 09:29 vmcore
-rw-r--r-- 1 root root 3.2K Dec 30 09:29 dmesg.txt
-rw-r--r-- 1 root root 178 Dec 30 09:29 kdump-info.txt
What it means: That’s the win condition: a vmcore plus some metadata. Decision: if vmcore is missing,
go straight to the Fast diagnosis playbook and the Common mistakes table below.
Task 20 — Confirm logs show kdump ran (journal from the boot after crash)
cr0x@server:~$ sudo journalctl -b -1 -u kdump-tools --no-pager | tail -n 60
Dec 30 09:28:12 server systemd[1]: Starting kdump-tools.service - Kernel crash dump capture service...
Dec 30 09:28:12 server kdump-tools[642]: kdump-tools: Loading crash kernel: succeeded
Dec 30 09:28:12 server systemd[1]: Finished kdump-tools.service - Kernel crash dump capture service.
What it means: The previous boot’s logs show crash kernel loading. This doesn’t prove the dump was written (your filesystem listing does),
but it proves the staging path is working. Decision: if this shows “failed to load crash kernel,” fix that before any more testing.
Sanity-check the dump and do basic analysis
You don’t need to be a kernel engineer to validate that a vmcore is real and internally consistent.
The goal is operational: confirm you captured memory and can extract a backtrace when needed.
Task 21 — Identify the dump format and confirm it’s not empty garbage
cr0x@server:~$ sudo file /mnt/kdump-nfs/server-2025-12-30-09:29/vmcore
/mnt/kdump-nfs/server-2025-12-30-09:29/vmcore: ELF 64-bit LSB core file, x86-64, version 1 (SYSV)
What it means: It’s an ELF core file, as expected. Decision: proceed to analysis tooling if you need to debug; otherwise archive it.
Task 22 — Confirm the kernel version for symbol matching
cr0x@server:~$ sudo cat /mnt/kdump-nfs/server-2025-12-30-09:29/kdump-info.txt
Kdump: 1.8.0
Kernel: 6.12.0-1-amd64
Dump saved to: /mnt/kdump-nfs/server-2025-12-30-09:29
What it means: You need matching symbols (debug packages) for deep analysis. Decision: in serious environments, mirror debug packages
internally so you can analyze weeks later, after upgrades.
Task 23 — Quick smoke test with crash (optional but recommended)
cr0x@server:~$ sudo apt-get install -y crash
Reading package lists... Done
Building dependency tree... Done
The following NEW packages will be installed:
crash
Setting up crash (8.0.5-1) ...
cr0x@server:~$ sudo crash /usr/lib/debug/boot/vmlinux-6.12.0-1-amd64 /mnt/kdump-nfs/server-2025-12-30-09:29/vmcore
crash 8.0.5
GNU gdb (Debian 14.2-1) 14.2
...
KERNEL: /usr/lib/debug/boot/vmlinux-6.12.0-1-amd64
DUMPFILE: /mnt/kdump-nfs/server-2025-12-30-09:29/vmcore
CPUS: 8
DATE: Mon Dec 30 09:29:01 2025
UPTIME: 00:41:22
LOAD AVERAGE: 0.02, 0.05, 0.01
TASKS: 312
NODENAME: server
RELEASE: 6.12.0-1-amd64
VERSION: #1 SMP PREEMPT_DYNAMIC Debian 6.12.3-1 (2025-09-10)
MACHINE: x86_64 (3191 Mhz)
crash>
What it means: You can open the dump. That’s the operational goal. Decision: if this fails with “cannot find vmlinux,”
build a process to keep debug symbols for deployed kernels (or at least for the ones you care about).
cr0x@server:~$ printf "bt\nsys\nquit\n" | sudo crash /usr/lib/debug/boot/vmlinux-6.12.0-1-amd64 /mnt/kdump-nfs/server-2025-12-30-09:29/vmcore | tail -n 20
PID: 0 TASK: ffffffff9a000000 CPU: 3 COMMAND: "swapper/3"
#0 [ffffb0a5000f3d20] machine_kexec at ffffffff8a2a1a5a
#1 [ffffb0a5000f3d80] __crash_kexec at ffffffff8a34b8e1
#2 [ffffb0a5000f3e50] panic at ffffffff8a1a7d9f
#3 [ffffb0a5000f3f20] sysrq_handle_crash at ffffffff8a8c2f10
#4 [ffffb0a5000f3f60] __handle_sysrq at ffffffff8a8c2b3a
#5 [ffffb0a5000f3fb0] write_sysrq_trigger at ffffffff8a8c42d0
What it means: The backtrace clearly shows a SysRq-triggered crash, exactly what we did. That’s a clean test.
Decision: you’ve verified end-to-end kdump capture and basic post-mortem viability.
Fast diagnosis playbook
When kdump “doesn’t work,” don’t thrash. Diagnose in the order that collapses the search space fastest.
This is the sequence I use under pager pressure.
1) Did the kernel reserve crashkernel memory?
- Check
/proc/cmdlineforcrashkernel=. - Check
dmesg | grep -i crashkernelfor reservation lines.
If no reservation: no dump. Fix bootloader parameters and reboot. Everything else is a distraction.
2) Is a crash kernel actually loaded?
- Run
sudo kexec -p -sto see if a crash kernel is loaded. - Check
systemctl status kdump-toolsandjournalctl -u kdump-tools.
If not loaded: fix kdump-tools config, initramfs generation, or missing kernel images.
3) Can the crash environment access the dump target?
- Local disk: ensure required block drivers and filesystems are in initramfs; verify mount happens.
- NFS: ensure NIC driver + IP config + route + NFS client bits are in initramfs.
If target is unreachable: adjust initramfs hooks, add modules, simplify dump path.
Don’t “optimize” yet. Make it work first.
4) Is makedumpfile filtering too aggressive?
- If you get a tiny
vmcorethat can’t be opened, inspectMAKEDUMPFILE_ARGS. - Temporarily reduce filtering for a test crash if needed.
5) If everything looks right but dumps still vanish
- Consider: secure boot / lockdown restrictions on kexec, firmware oddities, IOMMU quirks, or storage resets.
- Try a simpler dump target (dedicated unencrypted ext4 partition) as a control experiment.
Common mistakes: symptoms → root cause → fix
No vmcore after crash, and nothing obvious in logs
Symptoms: You force a crash; host reboots; dump directory stays empty.
Root cause: No reserved crashkernel memory, or reservation failed.
Fix: Add crashkernel=512M (or larger), reboot, confirm with dmesg | grep -i crashkernel.
Kdump service “active,” but kexec -p -s says no crash kernel loaded
Symptoms: systemd shows kdump-tools succeeded; kexec claims nothing loaded.
Root cause: kdump-tools did not load due to missing crashkernel reservation or mis-generated initramfs.
Fix: Fix crashkernel= first, then update-initramfs -u -k all, restart kdump-tools, recheck.
Dump directory exists but only metadata, no vmcore
Symptoms: You see dmesg.txt or info files, but no vmcore.
Root cause: Dump write failed mid-stream: out of space, NFS hiccup, driver missing, or crash kernel ran out of memory.
Fix: Check space, check NFS stability, increase crashkernel size, ensure NIC/storage modules exist in initramfs.
Dump written locally, but filesystem is corrupted afterwards
Symptoms: vmcore appears, but next boot triggers fsck or errors.
Root cause: Dumping to the same filesystem that was already unhealthy, or crash occurred mid-write and journal replay got spicy.
Fix: Dump to a dedicated partition or NFS. Treat /var as untrusted during disasters.
Dump over NFS never appears; local dumps work
Symptoms: Local disk capture works; NFS target stays empty after crash tests.
Root cause: Crash kernel can’t bring up networking (missing NIC module, no IP config, no route, VLAN needs).
Fix: Ensure NIC driver in initramfs; configure static IP for crash kernel; avoid dependency on complex network services (DHCP, 802.1X).
System hangs on crash instead of rebooting into crash kernel
Symptoms: You trigger crash; host freezes; no reboot.
Root cause: Hardware/firmware lockup, NMI issues, or panic path can’t execute kexec; sometimes watchdog not configured.
Fix: Enable hardware watchdog if available; test alternate crash triggers; consider disabling problematic IOMMU settings for test; verify console access.
vmcore exists but crash tool can’t read it
Symptoms: file shows core file, but crash errors.
Root cause: Missing matching vmlinux debug symbols for that exact kernel build, or dump is truncated.
Fix: Retain debug packages for deployed kernels; verify dump size and storage integrity; re-test with less aggressive makedumpfile filtering.
After enabling crashkernel, the system won’t boot or memory is tight
Symptoms: OOMs earlier, containers behave worse, or boot fails on tiny memory systems.
Root cause: Reserving too much on a constrained VM, or conflicting firmware memory map.
Fix: Use a smaller value (256M), or memory-range based crashkernel parameters; confirm reservation in dmesg.
Checklists / step-by-step plan
Minimal “I just need it working” plan (local disk)
- Install:
kdump-tools,kexec-tools,makedumpfile. - Add
crashkernel=512Mto GRUB, runupdate-grub, reboot. - Verify reservation:
dmesg | grep -i crashkernel. - Verify loaded crash kernel:
kexec -p -s. - Ensure
/var/crashhas space and is writable. - Test crash during a window:
echo c > /proc/sysrq-trigger. - After reboot, verify
/var/crashcontainsvmcore.
Production-credible plan (NFS target)
- Set up an NFS export dedicated to crash dumps (permissions, quota policy, retention).
- Mount it in normal userspace for visibility, but don’t assume crash kernel uses that mount.
- Set
KDUMP_COREDIRto the intended path and rebuild initramfs. - Confirm initramfs contains NIC driver and NFS client bits (inspect with
lsinitramfs). - Set
kernel.panic=10so the system reboots after dumping. - Run a controlled crash test and watch the NFS directory live.
- Open the dump with
crashat least once. Prove it’s analyzable, not just “a file exists.”
Change management notes (what I write on the ticket)
- Risk: reserving crashkernel reduces available RAM; rare boot issues on odd firmware.
- Rollback: remove
crashkernel=from GRUB and reboot. - Verification: dmesg reservation, crash kernel loaded, test crash produces vmcore.
- Operational follow-up: retention policy, permissions, and symbol availability for analysis.
Three corporate mini-stories (painfully plausible)
1) The incident caused by a wrong assumption
A team rolled out kdump across a fleet after a kernel panic took out a database node and everyone had to shrug in the postmortem.
They installed the packages, enabled the service, and checked that systemctl status was green.
Then they declared victory. The change ticket closed with a screenshot. Nobody tested a crash.
Months later, a production host started panicking under heavy network load. This time they were ready: “we have kdump now.”
They waited for the reboot, then logged in and looked in /var/crash. Empty. Not even a stub file.
The escalation thread started with that familiar sentence: “But we enabled it.”
The root cause was embarrassingly clean. The bootloader parameters never had crashkernel=.
The service “worked” in the sense that it ran and exited successfully, but it had nothing to load into.
The assumption was that installing kdump-tools implies the system reserves memory automatically. It doesn’t.
The fix was easy and the lesson was not: any reliability feature that isn’t verified end-to-end is theater.
They added crashkernel=512M, rebooted the hosts in batches, and ran one controlled crash per rack during a scheduled window.
After that, when the network driver bug reappeared, they had a vmcore and a real backtrace. The vendor conversation changed tone immediately.
2) The optimization that backfired
Another org ran high-memory analytics boxes. Someone decided crash dumps were too large and too slow, so they “optimized”
makedumpfile to drop more pages and compress harder. The dumps got smaller. Everyone applauded.
They also reduced crashkernel from 1G to 256M because “it boots fine and we want RAM back.”
Then a panic hit during a storage controller reset storm. Kdump sometimes wrote a tiny vmcore—a few megabytes—sometimes nothing.
When it did write, crash couldn’t extract useful stacks because key memory regions were filtered out.
The optimized setup delivered a nice artifact that contained the operational equivalent of lorem ipsum.
The post-incident analysis found two compounding failures. The smaller crashkernel meant the crash environment was memory-starved
when loading storage drivers and buffering writes. On top of that, the aggressive page filtering removed kernel data structures
needed for the investigation they actually cared about (I/O path state).
They rolled back to a conservative baseline: crashkernel at 768M and moderate filtering.
Dumps got bigger again. But they became reliable and useful. The real optimization wasn’t compression; it was making sure
dumps landed on an NFS target so local disk weirdness didn’t matter. Performance tuning is great, but only after correctness.
3) The boring but correct practice that saved the day
A platform team ran a mixed Debian fleet and had a dull policy: every kernel upgrade must also archive matching debug symbols
(or at least make them retrievable) for the prior two versions. No exceptions. Engineers complained that it was extra storage,
extra steps, extra bureaucracy.
One night, a host panicked in a way that took down a critical internal service. Kdump worked and wrote a vmcore to NFS.
The on-call pulled the dump into the analysis VM, ran crash, and immediately got a clean backtrace pointing at a specific subsystem.
It wasn’t a full root-cause, but it was enough to route the incident to the right team and apply a mitigation.
The key was that symbol availability turned “we have a vmcore” into “we have an answer.” Without symbols, they’d have had a blob.
With symbols, they had function names and a credible narrative of failure. The boring practice—keeping debug artifacts aligned with deployed kernels—
was the difference between a two-hour incident and a two-day argument.
Nobody wrote a celebration email about the policy afterwards. That’s how you know it was real engineering.
FAQ
1) Do I really need to reboot after adding crashkernel=?
Yes. The memory reservation happens at boot. No reboot means no reserved memory, which means no reliable crash dumping.
2) How do I know kdump is actually armed right now?
Check both: dmesg | grep -i crashkernel (reservation) and sudo kexec -p -s (crash kernel loaded).
“Service is enabled” is not proof.
3) What’s the safest way to test kdump?
Use SysRq crash trigger (echo c > /proc/sysrq-trigger) during a maintenance window with console access.
Verify the dump appears and is readable with crash.
4) How big should I set crashkernel on Debian 13?
Start with 512M for typical servers. If you have heavy storage/network driver needs, go 768M–1024M.
If you’re on a tiny VM, 256M may be your ceiling—test it.
5) Can I dump to an encrypted filesystem (LUKS)?
You can, but it’s fragile: the crash initramfs must unlock LUKS, which means you need keys available non-interactively.
For most environments, dump to an unencrypted dedicated partition or to NFS instead.
6) Will kdump work if the crash was caused by the storage driver?
Maybe, maybe not. If your dump target depends on the same driver stack that just melted, you’re gambling.
That’s why NFS or a simple dedicated local partition is often the best design.
7) Why do I get a vmcore but it’s too big?
You’re likely dumping most of RAM with little filtering/compression. Tune MAKEDUMPFILE_ARGS to filter pages,
and use a dump target sized for worst case. Also consider that “too big” is often a retention problem, not a kdump problem.
8) Why does crash complain about missing symbols?
You need the matching vmlinux with debug symbols for the exact kernel release that produced the dump.
Build a process to retain or retrieve those artifacts after upgrades.
9) Should I set kernel.panic_on_oops=1?
In many production environments, yes: an oops often means corrupted kernel state, and continuing can cause silent data corruption.
If you choose to panic-on-oops, pair it with kdump and a sane kernel.panic timeout.
10) Do containers or VMs change anything?
Containers don’t control the host kernel, so kdump is a host feature. In VMs, kdump usually works well—just remember the guest needs crashkernel
RAM reserved, and your hypervisor may influence how fast dumps are written to virtual disk or network.
Next steps you should actually do
- Pick a dump target based on failure modes, not convenience. If storage is complex, go NFS or dedicated partition.
- Standardize crashkernel sizing. Start with 512M; document when to use more.
- Do one controlled crash test per platform type. Different NICs/RAID controllers behave differently in the crash kernel.
- Make symbol availability a policy. If you can’t open a dump two weeks later, you’re collecting expensive paperweights.
- Automate verification. At minimum, alert if crashkernel isn’t reserved or crash kernel isn’t loaded.
- Write down the operator runbook. Who pulls the vmcore, where it lives, how retention works, and who can access it.
Kdump is a reliability feature that only earns its keep after everything else fails. That’s not a reason to procrastinate.
That’s the reason to do it correctly, test it, and keep it boring.