Debian 13 auditd without killing disks: practical settings for sane auditing

December 28, 2025 • February 3, 2026 • Read: 27 min • Views: 9

Was this helpful?

You enable auditd because compliance wants “visibility.” A week later the storage graph looks like a hedgehog, your SSD SMART wear counter starts climbing like it’s training for a marathon, and every incident review ends with: “Why is the system slower when we’re trying to make it safer?”

This is the sober way to run Linux auditing on Debian 13: keep the security signal, drop the noise, and stop turning your disks into a write-only suggestion.

The mental model: what auditd really does to your box

auditd is not a “log file writer.” It’s a user-space collector for the kernel audit subsystem. Rules you load tell the kernel which events to emit. Those events land in a kernel queue (the backlog) and then get pulled by auditd, formatted, and written—usually to /var/log/audit/audit.log. If you ship them elsewhere, they still exist locally first in most setups.

The disk pain comes from two places:

Event volume: too many audited syscalls or watched paths, especially on hot code paths (package managers, container runtimes, build systems, web servers writing temp files).
Write pattern: audit records are small and frequent. Small synchronous-ish appends plus metadata updates can become a random-write generator depending on filesystem, mount options, and log rotation behavior.

Kernel auditing is also ruthless about correctness. If the backlog overflows, you can drop events. If you configure “panic on failure,” you can drop the whole machine. Compliance loves the second one; operations hates it; you will end up negotiating.

Here’s the hard truth: you cannot “optimize” an audit rule set you haven’t scoped. If you start from a giant baseline that audits every openat() and every execve() on every host, you’ll get a lot of data—and a lot of heat. Your job is to decide which questions you actually need to answer during an incident, then audit precisely enough to answer them reliably.

One quote that holds up in operations: “Hope is not a strategy.” — paraphrased idea often attributed to reliability practitioners. In auditing, hope looks like “we’ll just turn it on everywhere and filter later.” You won’t. You’ll drown first.

A few facts and bits of history (useful ones)

Audit in Linux predates most “cloud native” tooling. The Linux Audit Framework has been around since the mid-2000s, aimed at meeting regulated environments that needed tamper-evident trails.
SELinux and audit grew up together. A lot of audit’s early adoption came from environments already deploying SELinux and needing event trails for policy decisions and enforcement debugging.
The audit pipeline is kernel-first, daemon-second. If user-space can’t keep up, the kernel queue backs up; the daemon isn’t the only moving part.
“Watch rules” (-w) are path-based and expensive in their own ways. They’re great for a small set of critical files; they’re a foot-gun for entire directories with high churn.
Syscall rules can explode faster than you expect. Auditing file opens on a busy host is the classic self-inflicted DoS, because everything opens files constantly.
Audit logs are structured-ish, not human-first. The raw log is meant for tools like ausearch and aureport, not grepping at 3 a.m. (though you will do it anyway).
Audit has a real “fail closed” story. With audit=1 and strict settings, systems can be configured to treat audit failure as fatal. That’s not theoretical; people do it.
Debian historically defaulted to being conservative. You typically don’t get a massive ruleset “for free.” The damage usually comes from copied compliance baselines that weren’t tested on your workload.

Joke #1: Audit logs are like glitter—once you deploy a noisy rule, you’ll find it everywhere for months.

What “sane auditing” looks like in production

Before touching configs, define targets. Otherwise you’ll argue endlessly with Security about “missing events” without knowing what “enough” means.

Operational targets (practical, measurable)

No sustained disk write amplification from audit. Short spikes are fine; constant churn is not. Your SSD endurance budget is real.
No meaningful latency hit to critical services. Measure p99 latency before and after enabling auditing on representative load.
No audit event loss under normal load. Under incident load (fork bomb, disk full), you decide policy: drop events, throttle, or fail closed.
Audit logs are queryable and interpretable. If it takes a specialist to answer “who changed SSH config,” your system will rot.

Security targets (what you actually want to know)

Administrative actions: user creation, sudo usage, privilege changes.
Credential and auth related changes: SSH keys, PAM config, /etc/shadow.
Execution of sensitive binaries: sudo, su, package managers, service managers.
Changes to critical configuration: /etc subset, systemd units, cron, audit config itself.
Kernel/module actions where relevant: module load/unload on hosts where that matters.

If you need “data exfil detection,” auditd alone is not your tool. It can contribute (e.g., watching access to a specific key file), but it’s not a network DLP system.

Fast diagnosis playbook (bottleneck hunt in minutes)

When someone says “auditd is killing the disks,” don’t debate. Run a short playbook and let the box tell you.

First: is it event volume or disk behavior?

Check audit event rate and backlog health.
Check disk IO patterns (IOPS, latency, queue depth).
Check whether audit rules are syscall-heavy or watch-heavy.

Second: identify the offending rule or workload

Find the hottest audit keys (if you used keys).
Correlate timestamps: audit spikes vs. service deploys vs. cron vs. logrotate.
Look for high-churn directories being watched.

Third: decide the tradeoff knob you’re going to turn

Reduce events: narrow rules, add filters, remove directory watches.
Reduce write pain: tune flush/freq, rotation, separate filesystem, avoid pathological mount options.
Increase buffering: backlog settings, but understand memory and loss semantics.
Change failure behavior: do you drop audit events under pressure or degrade the host?

Practical tasks (commands, output, decisions)

These are real, runnable checks. Each one ends with a decision: keep, change, or escalate. Do them in order the first time; later you’ll develop instincts.

Task 1: Confirm auditing is enabled and the daemon is healthy

cr0x@server:~$ systemctl status auditd --no-pager
● auditd.service - Security Auditing Service
     Loaded: loaded (/lib/systemd/system/auditd.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-12-30 09:12:01 UTC; 2h 3min ago
       Docs: man:auditd(8)
   Main PID: 712 (auditd)
      Tasks: 4
     Memory: 3.2M
        CPU: 1min 22s

What it means: It’s running. If it’s flapping, you likely have config errors or disk full issues.

Decision: If inactive/failed, fix service health first (disk space, config syntax, permissions) before tuning rules.

Task 2: Check kernel audit status: backlog, failure mode, enabled state

cr0x@server:~$ sudo auditctl -s
enabled 1
failure 1
pid 712
rate_limit 0
backlog_limit 8192
lost 0
backlog 12
backlog_wait_time 60000

What it means: Auditing is enabled. Backlog is mostly empty (backlog 12). failure 1 means “printk” behavior; stricter modes exist.

Decision: If lost is increasing, you’re dropping audit events—treat that as a production incident for compliance-scoped hosts.

Task 3: Measure audit event volume quickly (rough rate)

cr0x@server:~$ sudo awk 'BEGIN{c=0} {c++} END{print c}' /var/log/audit/audit.log
184223

What it means: Raw count in the current file. Combine with file timestamp to estimate throughput.

Decision: If this number grows insanely fast (hundreds of thousands per hour on a small host), you likely have syscall rules that are too broad.

Task 4: Check which audit keys are hottest (only works if you used -k)

cr0x@server:~$ sudo aureport --summary --key | head
Key Summary Report
=========================================
total  file
=========================================
15432  ssh-config
12011  priv-esc
 8430  pkg-mgr
 4122  cron

What it means: You can see which groups of rules generate the most events.

Decision: If one key dominates, inspect the rules behind it. Reduce scope or add exclusions.

Task 5: Identify high-churn watched paths (watch rules)

cr0x@server:~$ sudo auditctl -l | sed -n '1,12p'
-w /etc/ssh/sshd_config -p wa -k ssh-config
-w /etc/sudoers -p wa -k priv-esc
-w /etc/sudoers.d/ -p wa -k priv-esc
-w /var/log/ -p wa -k logdir
-a always,exit -F arch=b64 -S execve -F euid=0 -k root-exec

What it means: That -w /var/log/ is a neon sign reading “I enjoy pain.” Watching entire log directories is high churn.

Decision: Remove directory-wide watches on high-activity paths (/var/log, /tmp, container storage). Replace with specific files or syscall filters.

Task 6: Check disk IO pressure and latency

cr0x@server:~$ iostat -xz 1 3
Linux 6.12.0-1-amd64 (server)  12/30/2025  _x86_64_ (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.12    0.00    1.55    6.78    0.00   88.55

Device            r/s     w/s   rkB/s   wkB/s  aqu-sz  await  %util
nvme0n1          2.10  820.0    45.0  9320.0    4.12   5.05  92.30

What it means: Very high writes per second, high utilization. That’s consistent with tiny log appends.

Decision: If w/s is huge and await climbs during audit spikes, you’re disk-bound. Reduce event volume and/or smooth writes with auditd settings and filesystem separation.

Task 7: Verify audit log filesystem and mount options

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /var/log/audit
/dev/mapper/vg0-varlog ext4 rw,relatime,errors=remount-ro

What it means: Audit logs are on ext4 with relatime. That’s reasonable.

Decision: If audit logs share a busy root filesystem, strongly consider a dedicated filesystem for /var/log or /var/log/audit so audit writes don’t compete with databases or container layers.

Task 8: Confirm journald isn’t duplicating the pain

cr0x@server:~$ sudo journalctl -u auditd -n 20 --no-pager
Dec 30 10:11:22 server auditd[712]: Audit daemon started successfully
Dec 30 10:14:05 server auditd[712]: Audit backlog limit exceeded
Dec 30 10:14:05 server auditd[712]: Audit events lost

What it means: You’re seeing backlog pressure. Journald is mostly not the audit log, but it will capture auditd complaints.

Decision: If backlog is exceeded, you need to reduce event rate or increase capacity (backlog/CPU), not “rotate logs harder.”

Task 9: Watch backlog and lost counters live

cr0x@server:~$ watch -n 1 'sudo auditctl -s | egrep "lost|backlog |backlog_limit|backlog_wait_time"'
Every 1.0s: sudo auditctl -s | egrep "lost|backlog |backlog_limit|backlog_wait_time"  server: Tue Dec 30 11:16:02 2025

backlog_limit 8192
lost 0
backlog 23
backlog_wait_time 60000

What it means: If backlog climbs and stays high, auditd can’t drain events fast enough.

Decision: If backlog climbs during specific jobs (e.g., backups), scope rules so those jobs don’t generate irrelevant audit events.

Task 10: Find top talkers in audit.log by record type

cr0x@server:~$ sudo awk '{print $2}' /var/log/audit/audit.log | cut -d'=' -f2 | sort | uniq -c | sort -nr | head
 88421 SYSCALL
 60210 PATH
  9102 CWD
  3210 EXECVE

What it means: Lots of SYSCALL and PATH records suggests syscall auditing is the bulk. That’s normal; the question is whether it’s too broad.

Decision: If PATH entries dominate, you may be auditing file opens/attrs widely. Tighten rules with directory/file filters and auid/euid constraints.

Task 11: Confirm log rotation policy for audit logs

cr0x@server:~$ sudo grep -E '^(max_log_file|num_logs|max_log_file_action|space_left_action|admin_space_left_action|disk_full_action|flush|freq)' /etc/audit/auditd.conf
max_log_file = 50
num_logs = 10
max_log_file_action = ROTATE
space_left_action = SYSLOG
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
flush = INCREMENTAL_ASYNC
freq = 50

What it means: Rotation is enabled. flush = INCREMENTAL_ASYNC with freq=50 reduces sync pressure while still limiting loss window.

Decision: If you see flush = SYNC on busy hosts, expect pain. If you see num_logs too high with huge files, you might be hoarding locally without benefit.

Task 12: Validate the rule files actually loaded (and aren’t duplicated)

cr0x@server:~$ sudo augenrules --check
No change

What it means: Compiled rules match what’s on disk. No pending changes.

Decision: If it reports changes, reload rules in a controlled way; avoid hot-reloading during peak traffic if you’re using immutable mode.

Task 13: Check whether rules are immutable (and thus hard to fix under fire)

cr0x@server:~$ sudo auditctl -s | grep enabled
enabled 1

cr0x@server:~$ sudo auditctl -e 2
Error sending enable request (Operation not permitted)

What it means: If you’re already in immutable mode (-e 2 set at boot), you cannot change rules without rebooting.

Decision: Use immutable mode only on systems where that operational constraint is acceptable and tested. Otherwise you’re one bad baseline away from a long night.

Task 14: Check SSD wear indicators (sanity check, not paranoia)

cr0x@server:~$ sudo smartctl -a /dev/nvme0n1 | egrep -i 'percentage|data units written|wear|media'
Percentage Used:                    3%
Data Units Written:                 8,421,332 [4.31 TB]
Media and Data Integrity Errors:    0

What it means: You can correlate write growth with audit enablement. Don’t guess; measure.

Decision: If “Data Units Written” slope changes sharply after enabling audit, treat it as a capacity planning issue and tune accordingly.

Task 15: Find the noisiest executables (common culprit: package managers and orchestration)

cr0x@server:~$ sudo ausearch -m SYSCALL -ts today | grep -oE 'exe="[^"]+"' | sort | uniq -c | sort -nr | head
  812 exe="/usr/bin/dpkg"
  640 exe="/usr/bin/apt"
  418 exe="/usr/bin/python3.12"
  251 exe="/usr/sbin/cron"

What it means: You have bursts tied to package activity and cron jobs. That’s not evil; it’s predictable.

Decision: If you audit package manager activity heavily, accept bursts during patch windows—or add time-scoped or host-scoped rules for patching fleets.

Task 16: Build a report for “who touched /etc/shadow” (proof that audit is working)

cr0x@server:~$ sudo ausearch -f /etc/shadow -ts this-week | aureport -f -i | head
File Report
===============================================
# date time            file                 syscall success exe                  auid event
===============================================
1. 12/29/2025 02:10:12 /etc/shadow           openat  yes     /usr/sbin/usermod    admin  19402

What it means: This is the kind of question audit should answer quickly. If it can’t, your rules are either too weak or too messy.

Decision: Keep rules that answer real incident questions. Remove rules that create heat but don’t change decisions.

Rules that don’t melt disks: strategy and examples

Audit rules are where disks go to die. Most “auditd tuning” advice is just “turn off auditing.” That’s lazy. The real move is to move from broad syscall coverage to narrow, intention-driven coverage.

Principle 1: Prefer targeted watches for critical files, not busy directories

Watching /etc/shadow is cheap and useful. Watching /var/log is a denial-of-service letter you mailed to yourself.

Good watch examples:

/etc/passwd, /etc/shadow, /etc/group
/etc/sudoers, /etc/sudoers.d/
/etc/ssh/sshd_config and possibly /etc/ssh/sshd_config.d/
/etc/audit/ (yes, watch the watcher)
Systemd unit files you actually care about: /etc/systemd/system/

Principle 2: If you audit syscalls, filter by identity and intent

Auditing execve for every user on a build server will generate a small novel every minute. Instead:

Audit privileged executions (euid=0) initiated by non-system users (auid>=1000).
Audit changes (write/attribute changes) to a small set of filesystems or directories.
Exclude known-noise accounts (backup, monitoring) if policy allows—and document that exception.

Principle 3: Use keys religiously

Keys (-k) are not decoration. They’re how you triage. Without keys, you’ll stare at raw records and feel your soul leave your body.

A sane baseline ruleset (opinionated)

On Debian, rules are typically managed via /etc/audit/rules.d/*.rules and compiled to /etc/audit/audit.rules using augenrules.

Example: /etc/audit/rules.d/10-sane-baseline.rules (illustrative, not maximal):

cr0x@server:~$ sudo cat /etc/audit/rules.d/10-sane-baseline.rules
## Identity and auth files
-w /etc/passwd -p wa -k identity
-w /etc/shadow -p wa -k identity
-w /etc/group  -p wa -k identity
-w /etc/gshadow -p wa -k identity

## Sudoers and SSH
-w /etc/sudoers -p wa -k priv-esc
-w /etc/sudoers.d/ -p wa -k priv-esc
-w /etc/ssh/sshd_config -p wa -k ssh-config
-w /etc/ssh/sshd_config.d/ -p wa -k ssh-config

## Audit configuration itself
-w /etc/audit/ -p wa -k audit-config

## Systemd unit overrides (where persistence happens)
-w /etc/systemd/system/ -p wa -k systemd-units
-w /etc/systemd/system.conf -p wa -k systemd-units
-w /etc/systemd/journald.conf -p wa -k logging

## Cron persistence
-w /etc/cron.d/ -p wa -k cron
-w /etc/crontab -p wa -k cron
-w /var/spool/cron/ -p wa -k cron

## Privileged command execution by real users
-a always,exit -F arch=b64 -S execve -F euid=0 -F auid>=1000 -F auid!=4294967295 -k root-exec
-a always,exit -F arch=b32 -S execve -F euid=0 -F auid>=1000 -F auid!=4294967295 -k root-exec

This baseline does a few important things:

It focuses on persistence and privilege, which are incident-relevant.
It avoids syscall auditing for file open/close. That’s where event volume goes feral.
It tags everything with keys so you can triage in minutes.

Where people go wrong: “log all the things” syscall rules

A common compliance baseline audits file access broadly, often with something like “audit all open/openat failures,” or “audit all writes by anyone.” On a modern Linux system, that’s basically auditing the act of breathing.

If you must audit file writes, do it narrowly:

Specific directories containing secrets or regulated datasets.
Specific service accounts interacting with those datasets.
Only successful writes/attribute changes, unless failure is a security signal for your environment.

Noise controls you should actually use

You can filter out known-noise patterns using fields like auid, uid, exe, and dir. The trap: each filter you add is also a future blind spot. Make sure the exclusion is a policy decision, not a convenience.

Example: exclude a dedicated backup agent user from privileged exec auditing (only if your security team agrees):

cr0x@server:~$ sudo cat /etc/audit/rules.d/20-exclusions.rules
## Exclude known automation account from root-exec noise
-a never,exit -F arch=b64 -S execve -F auid=1105 -k exclude-backup
-a never,exit -F arch=b32 -S execve -F auid=1105 -k exclude-backup

Joke #2: The fastest way to discover a “rare” code path is to audit it in production.

Audit daemon settings that matter (and why)

Rules decide volume. auditd.conf decides how painful it is to persist that volume and what happens when things go sideways.

Flush strategy: your disk’s mood ring

Key settings:

flush: how auditd flushes data to disk.
freq: how often to flush when using incremental modes.

Practical recommendation for most production hosts: flush = INCREMENTAL_ASYNC and a moderate freq (e.g., 50–200). This reduces the “tiny sync write all day long” problem.

What you trade: a larger loss window if the host crashes. Decide if that’s acceptable based on host class. For hardened compliance endpoints, you may be forced toward stricter flushing. If you do that, you must reduce event volume even more aggressively.

Log rotation: make it boring

Audit logs rotate differently than general logs; don’t let logrotate surprise you. Use auditd’s own rotation knobs:

max_log_file and num_logs size the local retention.
max_log_file_action = ROTATE is usually correct.
max_log_file_action = KEEP_LOGS is how you end up with a full disk and a very philosophical incident.

Disk space actions: decide what failure looks like

These settings are policy in disguise:

space_left_action, admin_space_left_action
disk_full_action, disk_error_action

For general-purpose production hosts, I prefer: alert loudly, keep the box running, and preserve what you can. For regulated endpoints, you may have to suspend auditing or even halt. If your policy is “halt on disk full,” you’d better have separate filesystems and monitoring that catches growth before the cliff.

Backlog and wait time: absorb bursts without lying to yourself

Kernel backlog tuning is about surviving bursts: package updates, CI jobs, cron storms. Bigger backlog can help—until it doesn’t, and then you just delayed the drop.

Use backlog size to handle legitimate bursts, but treat persistent backlog as “your rules are too wide” rather than “add more buffer.” Buffers aren’t throughput.

Debian 13 practical stance on immutable mode

Immutable mode (rules locked until reboot) is useful to prevent an attacker from disabling auditing after gaining root. It’s also useful to prevent you from fixing a bad rollout quickly.

My opinion: use immutable mode only on hosts where you have:

tested rule sets under realistic load,
a rehearsed rollback that includes reboot,
clear ownership of “audit broke prod” incidents.

Storage and filesystem realities: ext4, XFS, ZFS, NVMe

Audit logging is a small-write workload. On modern storage, small writes aren’t automatically bad—until you turn them into a constant stream with metadata churn and forced flushes.

Separate the blast radius

If you can do only one storage change: put /var/log or at least /var/log/audit on its own filesystem with sane sizing and monitoring. This does three things:

prevents audit growth from filling root and bricking the host,
reduces IO contention with application data,
lets you apply filesystem/mount tuning without touching everything else.

ext4: fine, boring, predictable

ext4 on NVMe is usually fine if you avoid pathological settings. Use relatime. Don’t mount with sync. Don’t run your audit log on the same filesystem as a busy database unless you enjoy blame games.

XFS: strong under concurrency, still not magic

XFS handles parallel workloads well, but audit logs are mostly append writes. If your issue is event volume, XFS won’t save you. If your issue is contention and metadata behavior, it might help. Measure, don’t vibe-check.

ZFS: great tool, different failure modes

If you write audit logs onto ZFS, the knobs change. ZFS can turn many small writes into different IO patterns depending on recordsize, sync settings, and SLOG behavior. It can be excellent, but it’s also easy to accidentally convert “small append writes” into “lots of synchronous transactions.”

Practical stance: if you’re not already running ZFS with operational maturity, don’t introduce it just to “fix auditd.” Fix the rule volume first.

NVMe SSD endurance: not fragile, not infinite

Modern enterprise SSDs can take a beating, but endurance is still a budget. What kills you is sustained needless writes. Audit is often sustained and needless when misconfigured. Use SMART counters to prove your case to stakeholders.

Shipping, aggregation, and keeping audit logs useful

Local audit logs are necessary, not sufficient. In real incidents, the compromised host is the least trustworthy place to store truth.

Local first, remote always (but don’t double-write yourself to death)

Many environments ship audit logs to a SIEM via an agent that tails /var/log/audit/audit.log. That’s normal. Watch for two problems:

Tail-induced IO amplification: Some agents behave badly on rotations or do frequent fsyncs. Test your shipper.
Parsing explosions: If your SIEM pipeline chokes, Security may ask for “raw everything” locally forever. Push back. Define retention and prove the remote pipeline works.

Make keys and host metadata first-class

If you don’t tag rules with keys and annotate events in the pipeline with host role, environment, and owner, you’ll end up with a global lake of identical noise. That’s how “auditing” becomes “expensive storage with vibes.”

Retention policy: don’t hoard locally

Local retention is for:

short-term incident response when central logging is delayed,
buffering against network outages,
basic forensic continuity.

It is not for multi-month archival unless you have a specific requirement and a storage plan. Set max_log_file and num_logs based on measured event rates and a realistic outage window for your log pipeline.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “audit writes are just logs, they’ll compress nicely”

At a mid-sized company with a mix of Debian and Ubuntu, Security rolled out a “standard Linux audit baseline” during a quarter when everyone was already stressed. The baseline included syscall auditing for a large set of file operations, because the template had been written for a much smaller fleet and a much quieter workload.

Operations assumed it would behave like regular text logging: “It’s logs, it’ll just be sequential writes; compression and rotation will take care of it.” That assumption died quickly on CI build hosts. Those hosts were already running hot with container pulls, unpacking layers, and writing caches. The new audit rules essentially recorded a high percentage of that churn.

The first symptom wasn’t audit errors. It was build timeouts and sudden increases in IO wait. Latency-sensitive services co-located on the same hypervisors started reporting “random” slowdowns. Then came the real kicker: some hosts began dropping audit events because the backlog filled during peak build waves.

The fix was not heroic. They deleted broad file syscall rules, kept targeted watches on identity and persistence files, and restricted privileged exec auditing to real users via auid constraints. They also moved /var/log/audit to a separate filesystem on hosts that mattered. Compliance was satisfied because the “who changed identity and privilege” questions were answerable again, and the fleet stopped grinding itself into paste.

An optimization that backfired: “make audit durable, set flush to SYNC”

A financial services team wanted stronger guarantees: “No audit events lost, ever.” Someone tuned auditd.conf for maximum durability. They set flush = SYNC and tightened space actions so the system would suspend when storage got tight.

On paper, that reads like responsibility. In practice, it turned routine admin work into latency spikes. Every burst of audited events translated into synchronous write pressure. During patch windows, package managers generated dense event storms (execs, config writes). Storage latency climbed. The patch window stretched. People retried. The retry storm made it worse.

Then the backfire: a log pipeline hiccup delayed off-host shipping. Local logs grew faster than expected. The dedicated log filesystem filled earlier than planned. The system suspended auditing right when it mattered—during a messy operational event—because the disk space actions were strict. They didn’t lose the machine, but they lost the audit trail for the exact period auditors cared about.

The eventual configuration was less “pure” and more reliable: incremental async flushing, a carefully sized local retention window, monitoring on log filesystem utilization, and an agreed policy on what happens under disk pressure. “Never lose events” became “don’t lose events in normal operation; under resource exhaustion, preserve the system and alert immediately.” It wasn’t as satisfying as a hard guarantee, but it actually worked.

A boring but correct practice that saved the day: keys, minimal rules, and a rehearsal

A healthcare SaaS team had been burned before, so they did something unglamorous: they wrote an audit ruleset with strict scoping, assigned keys to every rule, and put it under configuration management with peer review. Every rule had an “incident question” attached in the commit message: “Detect SSH config changes,” “Detect privilege escalation paths,” “Detect persistence via cron.”

They also rehearsed. Once a quarter, they ran a tabletop exercise: simulate a compromised admin account, modify a systemd unit drop-in, add a cron job, and change SSH config. Then they verified they could retrieve those events with ausearch quickly, without tribal knowledge.

When a real incident hit—an engineer’s credentials leaked and were used to modify a sudoers include—the response was boring in the best way. The on-call pulled the priv-esc key summary, extracted the exact change window, and confirmed the actor account via auid. They rolled back, rotated credentials, and had an audit trail that was easy to explain to leadership.

No one celebrated the audit ruleset. It didn’t have a dashboard. It didn’t “integrate with AI.” It just worked, and it didn’t destroy disks in the process.

Common mistakes: symptom → root cause → fix

1) Symptom: Disk write IOPS skyrockets right after enabling auditd

Root cause: Broad syscall rules (often open/openat or directory watches on hot paths) creating massive event volume.

Fix: Remove broad file access syscalls; replace with targeted -w watches on critical files and privileged execve rules filtered by auid and euid. Use keys to identify top offenders.

2) Symptom: `lost` counter increases; journald shows “backlog limit exceeded”

Root cause: Audit event production exceeds consumption; auditd can’t drain fast enough; backlog overflows.

Fix: First reduce event rate (rule scoping). Second increase backlog_limit for bursts. Third check CPU contention and IO latency; auditd writing blocked can cascade into backlog overflow.

3) Symptom: Host latency spikes during patching or CI jobs

Root cause: Auditing privileged exec or file changes broadly; package managers and build tools trigger bursts. Often combined with flush = SYNC.

Fix: Keep privileged exec auditing but scope it (real users only). Use INCREMENTAL_ASYNC flushing. Consider excluding known automation accounts if policy allows.

4) Symptom: Root filesystem fills, system becomes unstable

Root cause: Audit logs on root FS, KEEP_LOGS, or rotation mis-sized to real event rates. Sometimes remote shipping down + local retention too large.

Fix: Put /var/log/audit on dedicated filesystem. Use ROTATE. Set retention based on measured event rate and a realistic outage window. Monitor utilization.

5) Symptom: You can’t change rules during an incident

Root cause: Immutable mode enabled at boot (-e 2), sometimes without an operational plan.

Fix: Don’t enable immutable mode until rules are validated under production load and rollback is rehearsed. If it’s already enabled, your only real option is reboot with corrected rules.

6) Symptom: Audit logs are massive, but you still can’t answer “who changed X?”

Root cause: No keys, unfocused rules, and a belief that volume equals visibility.

Fix: Rewrite rules around incident questions. Use keys. Test by running a few controlled changes and verifying retrieval with ausearch/aureport.

7) Symptom: High IO even when the system is “idle”

Root cause: Watching high-churn directories like /var/log, /tmp, or container overlay storage. “Idle” is a lie; background agents churn.

Fix: Remove directory watches and audit specific critical files instead. If you need container audit, use runtime-specific telemetry, not blanket path watches.

Checklists / step-by-step plan

Plan A: you’re enabling auditd on Debian 13 for the first time

Define incident questions. Examples: “Who changed sudoers?”, “Who modified SSH config?”, “Which human executed root commands?”
Start with a minimal ruleset. Watches for identity/auth/persistence; privileged exec filtered by auid.
Tag everything with keys. You’re building a diagnostic interface, not just a log file.
Set flush = INCREMENTAL_ASYNC and choose freq. Start at 50; tune based on event rate and acceptable loss window.
Separate /var/log/audit if the host is busy or critical. Keep the blast radius small.
Set rotation sanely. max_log_file_action = ROTATE, measured sizes, retention aligned to shipping.
Load rules with augenrules and verify. Ensure no duplicates or legacy files get loaded.
Load test under real workload. Watch lost, backlog, IO latency, and service p99 latency.
Decide failure policy. Under disk full or audit failure: alert, suspend, or halt? Put it in writing.
Only then consider immutable mode. If you can’t reboot quickly, don’t lock yourself out of fixes.

Plan B: auditd is already deployed and disks are screaming

Confirm lost events/backlog behavior. If losing events, you’re already failing the main purpose.
Find the hottest keys and rules. Use aureport --key and auditctl -l.
Delete the obvious foot-guns. Directory watches on high-churn paths, broad open/openat syscall rules.
Switch to incremental async flush if safe. If policy forbids, you must reduce volume harder.
Move audit logs to a dedicated filesystem. Do it during a maintenance window if needed. This is risk reduction.
Re-measure. IO, backlog, lost events, and your ability to answer key incident questions.

FAQ

1) Will `flush = INCREMENTAL_ASYNC` make audit logs “unsafe”?

It increases the window of events that could be lost on a sudden crash. For most production systems, that trade is acceptable if you keep event volume sane and ship logs off-host quickly.

2) Should I audit every `execve` on the system?

No. Audit privileged execve (euid=0) initiated by real users (auid≥1000). That captures most meaningful escalation activity without drowning you.

3) Why is watching a directory so much worse than watching a file?

Directories with churn create a steady stream of create/rename/attribute changes. On log or temp directories, that churn is constant. Watching a handful of specific files yields high signal at low volume.

4) Can I just rely on journald instead of auditd?

Not for kernel audit semantics. Journald is great for service logs. Auditd provides structured security event capture from the kernel audit subsystem with different guarantees and fields.

5) What’s the fastest proof that my ruleset is useful?

Perform a controlled change: edit /etc/ssh/sshd_config, run visudo or modify a file in /etc/sudoers.d, then retrieve events with ausearch -k ssh-config and ausearch -k priv-esc.

6) Does moving `/var/log/audit` to its own filesystem really help performance?

Sometimes. The biggest win is isolation: you stop audit writes from competing with application IO and you reduce the risk of filling root. Performance improvement depends on your storage stack.

7) How big should `backlog_limit` be?

Big enough to absorb bursts without dropping, small enough that you notice sustained overload. Start with a few thousand to tens of thousands depending on event rate and memory, then measure backlog behavior during peak jobs.

8) Should I enable immutable mode on all servers?

No. Use it selectively where the operational cost of “must reboot to fix audit rules” is acceptable and rehearsed. Otherwise it turns misconfigurations into prolonged outages.

9) We need “everything” for compliance. What do I tell auditors?

Define “everything” as “everything relevant to security controls.” Provide evidence: rules map to controls (identity changes, privilege escalation, persistence), event loss is monitored, and logs are shipped off-host with retention.

10) Why do I see multiple records (SYSCALL, PATH, CWD) per action?

Audit events are composed of multiple records describing syscall context, working directory, and file paths. That’s normal—and it’s why event volume grows quickly when you audit frequent syscalls.

Conclusion: next steps you can execute today

If your Debian 13 audit deployment is hurting disks, don’t start by tweaking magic numbers. Start by shrinking the event firehose. Remove directory watches on churny paths. Stop auditing broad file access syscalls unless you have a narrowly defined target and a measured budget for it. Keep the rules that answer real incident questions, and tag them with keys so you can prove it.

Then make persistence less painful: incremental async flushing, sane rotation, dedicated filesystem for audit logs, and a clear policy for what happens when resources run out. Finally, rehearse: make a small change to a sensitive file and confirm you can retrieve the audit trail quickly. That’s what “sane auditing” looks like—useful under pressure, and not actively trying to murder your storage.