Disk hits 92%. Alerts start chirping. Somebody opens a “cleanup” ticket that looks harmless: rotate logs, prune images, delete temp files, vacuum journals. The kind of work you do between meetings.
Two hours later, your database is read-only, your node is evicted, and the postmortem title is something like “Deletion Incident #7”. The offending tool? Not malware. Not an attacker. A trusted utility, doing exactly what it was told.
Why cleanup is uniquely dangerous in production
Cleanup is destructive, usually irreversible, and often performed under time pressure. That’s the trifecta. Add a fourth: cleanup work tends to be delegated to the least privileged context that still has enough power to ruin your day. Root, cluster-admin, or the “storage maint” role that quietly has access to everything because it “needs to.”
The real trap is that cleanup looks like housekeeping, not engineering. You don’t get a design review for “rm old stuff.” You don’t run load tests for “vacuum logs.” You don’t stage it, because you’re deleting data and staging seems pointless. All of those assumptions are wrong.
Modern systems don’t store “files.” They store relationships: containers referencing overlay layers, databases referencing WAL segments, services referencing sockets and lockfiles, snapshots referencing block histories, log shippers referencing inode positions, backup tools referencing checksums. When a “cleaner” removes an object, it can break a relationship that was never written down anywhere.
One quote, because the industry learned this the hard way: “Paraphrased idea” from James Hamilton (reliability engineering): small operational changes cause a surprising share of outages; treat them like real deployments.
There’s also the human factor. When disk is full, you feel urgency. Urgency breeds heroics. Heroics breed --force. And --force is the adult version of “I’m sure this is fine.”
Joke #1: Cleanup scripts are like cats—they only obey you when it’s inconvenient, and they always know where the expensive stuff is.
Interesting facts and historical context (short, but worth knowing)
- Unix “everything is a file” made cleanup deceptively simple. But it also made it easy to delete device nodes, sockets, and state files that aren’t “data” until they are.
- Early sysadmins used log rotation long before standard tools existed; ad-hoc “move and truncate” patterns still haunt systems where daemons don’t reopen logs cleanly.
- Journaling filesystems (ext3/ext4, XFS) improved crash recovery, not “oops recovery.” Deletion is still deletion; the journal helps consistency, not forgiveness.
- The shift to container layers created new garbage collectors (image prune, layer GC). A “disk cleanup” can now break scheduling capacity across a whole cluster, not just one host.
- Copy-on-write storage (ZFS, btrfs) turned snapshots into first-class operational tools. It also made “free space” a more complex concept: deleting files may not reclaim blocks if snapshots still reference them.
- Inode exhaustion is a classic “disk full” impostor: you can have plenty of bytes free but zero inodes, usually due to tiny files in temp/log directories.
- POSIX semantics let processes keep writing to deleted files. Space won’t be reclaimed until the last file descriptor closes, so “I deleted the big file” doesn’t mean “disk got space back.”
- Systemd introduced tmpfiles policies that can wipe directories you assumed were persistent, especially if you placed state under /tmp or mis-declared a runtime directory.
How “cleaner” tools go wrong: the failure modes you keep seeing
1) The target expanded: globbing, variables, and “helpful” defaults
A cleanup tool rarely deletes “a thing.” It deletes “whatever matches.” That match can expand. Globs expand. Environment variables expand. Symlinks expand. Mounts appear/disappear. And suddenly your carefully scoped delete becomes a campus-wide bonfire.
Typical culprits: rm -rf $DIR/* where $DIR is empty; find missing a -xdev; a symlink created by an install step; a bind mount that moved content into a “temp” path.
2) Space reclaimed? Not necessarily: open file descriptors and snapshots
Your disk is full. You delete 30 GB. Disk is still full. Panic levels rise. People delete faster. The real issue is often either:
- Open-but-deleted files: the data remains allocated until the process closes the file descriptor.
- Snapshots: the data remains referenced by a snapshot, so blocks can’t be freed.
Cleaners that “remove old logs” can actually make this worse if they delete files that a long-running daemon still writes to. Now you’ve hidden the log path, you didn’t free space, and you made troubleshooting harder.
3) “Optimization” cleaners that compete with your workload
Some tools are marketed as if cleanup is passive. It’s not. Scanning directories trashes caches. Hashing files burns CPU. Dedup passes wake up disks. Rebalancing metadata causes I/O storms. A cleaner can become the hottest workload on the box.
In storage terms: you just introduced a background random read workload with poor locality. If the system was already I/O constrained, congratulations: you’ve built a throttle-free benchmark and pointed it at production.
4) “Stateless” assumptions that delete state
Plenty of systems put state in places that look temporary:
/var/tmp, /var/lib, /run, a local cache directory, or “just a file in /tmp” that became a lock, a queue, or a spool.
Cleaners that treat these paths as “safe to wipe” cause weird failures: stuck jobs, duplicate processing, dropped metrics, or slow restarts as caches rebuild under load.
5) Retention logic that works until time moves
Retention policies are date math plus edge cases. Daylight saving time. Clock skew. Leap seconds. New year boundaries. The moment you switch log formats or rename a directory, your “delete older than 7 days” might delete everything because it can’t parse dates anymore.
6) Tools that are safe alone, dangerous together
Your log shipper keeps a cursor. Your cleaner rotates logs. Your compression job moves files. Your backup job reads them. Each tool is “fine.” Together, they create races: double rotations, truncated archives, missing segments, and duplicate ingestion.
Most cleanup disasters aren’t a single bad command. They’re an orchestration failure among well-meaning tools.
7) Permissions and identity: the “it can’t delete that” myth
People assume a tool running as a service account can’t do real damage. Then somebody added the account to a group “temporarily,” or it runs in a privileged container, or the filesystem is mounted with lax ownership, or ACLs grant more than you think.
Cleanup incidents love privilege creep. It’s quiet. It’s convenient. It’s catastrophic.
Joke #2: “I’ll just run a quick cleanup” is how outages get their cardio.
Fast diagnosis playbook: first/second/third checks
When “cleanup” has gone wrong, you need speed, not elegance. The goal is to identify the real bottleneck before you delete more evidence.
First: verify what resource is actually exhausted
- Bytes vs inodes: a system can be “full” in two different ways.
- Filesystem vs thin pool vs snapshot reserve: “df says 90%” is not the full story on LVM thin, ZFS, or container overlays.
- Node local vs remote storage: Kubernetes evictions care about node filesystem pressure, not your fancy SAN.
Second: identify the biggest consumers (and whether deletion will reclaim space)
- Top directories and files.
- Open-but-deleted files.
- Snapshot references or copy-on-write retention.
Third: assess blast radius and stop the bleeding
- Disable or pause the cleanup job (cron, systemd timers, CI runner).
- Freeze further rotations/deletions that might destroy forensic artifacts.
- Stabilize the system: free some space safely (even 2–5%) to restore normal operations (journald, package managers, databases).
Decision rule you can live by
If you can’t explain why deleting a thing will reclaim space, don’t delete it yet. Measure first. Then delete with a rollback plan (snapshot, backup, or at least a quarantine directory on the same filesystem).
Practical tasks: commands, outputs, and decisions (12+)
These are the tasks I reach for when a “cleanup” went sideways. Each includes: a command, what typical output means, and the decision you make next. Treat them as building blocks, not a script you copy blindly.
Task 1: Check byte usage by filesystem (quick triage)
cr0x@server:~$ df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1p2 ext4 220G 214G 1.8G 100% /
tmpfs tmpfs 32G 120M 32G 1% /run
/dev/sdb1 xfs 3.6T 2.1T 1.5T 59% /srv
Meaning: Root is effectively full. /srv is fine, but that doesn’t help if the workload writes to /.
Decision: Focus on reclaiming a few GB on / first to restore stability. Don’t “clean /srv” just because it’s big.
Task 2: Check inode exhaustion (the “df lies” moment)
cr0x@server:~$ df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 14M 14M 0 100% /
Meaning: You’re out of inodes. Deleting a few large files won’t help if the problem is millions of tiny ones.
Decision: Identify directories with massive file counts (spools, caches, temp dirs). Avoid recursive tooling that will take hours and worsen load.
Task 3: Find which directories consume space (bytes)
cr0x@server:~$ sudo du -xhd1 /var | sort -h
120M /var/cache
3.2G /var/log
14G /var/lib
18G /var
Meaning: /var/lib is the heavyweight. That’s usually application state (databases, container runtimes), not “garbage.”
Decision: Drill into /var/lib with care; consider app-aware cleanup rather than blunt deletion.
Task 4: Find which directories consume inodes (file counts)
cr0x@server:~$ sudo find /var -xdev -type f -printf '.' | wc -c
12984217
Meaning: ~13 million files under /var. That’s a lot. You likely have runaway temp files, cache shards, or a broken rotation scheme.
Decision: Identify the hotspot directory next (don’t delete blind).
Task 5: Pinpoint inode hotspots by directory
cr0x@server:~$ sudo find /var -xdev -mindepth 1 -maxdepth 3 -type f -printf '%h\n' | sort | uniq -c | sort -nr | head
8420000 /var/lib/app/spool
1960000 /var/log/nginx
510000 /var/tmp/session-cache
Meaning: /var/lib/app/spool is exploding. Spools are usually “business logic,” not trash.
Decision: Treat as an incident in the application pipeline; cleanup may be a band-aid, not a fix.
Task 6: Check for open-but-deleted files (space not reclaimed)
cr0x@server:~$ sudo lsof +L1 | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
java 2310 app 12w REG 259,2 8.0G 0 9123 /var/log/app/app.log (deleted)
nginx 1882 www 5w REG 259,2 1.2G 0 7742 /var/log/nginx/access.log (deleted)
Meaning: Those processes still hold file descriptors to deleted logs. The bytes are still allocated.
Decision: Restart or signal processes to reopen logs (e.g., systemctl restart or kill -HUP), then re-check df.
Task 7: Confirm journald usage before vacuuming
cr0x@server:~$ sudo journalctl --disk-usage
Archived and active journals take up 4.1G in the file system.
Meaning: Journals are non-trivial. Vacuuming may free space quickly, but you’ll lose forensic logs.
Decision: If you’re mid-incident, snapshot or export relevant logs first; then vacuum to a known retention threshold.
Task 8: Vacuum journald safely to a target size
cr0x@server:~$ sudo journalctl --vacuum-size=800M
Vacuuming done, freed 3.3G of archived journals from /var/log/journal.
Meaning: You reclaimed 3.3G from journals.
Decision: Re-check system health. If this was only a stopgap, fix the root cause that filled the disk.
Task 9: Validate logrotate config without running it
cr0x@server:~$ sudo logrotate -d /etc/logrotate.conf
reading config file /etc/logrotate.conf
including /etc/logrotate.d
reading config file nginx
error: nginx:12 duplicate log entry for /var/log/nginx/access.log
Meaning: Logrotate would behave unexpectedly or fail, leaving logs to grow unbounded.
Decision: Fix the configuration and run logrotate manually once (with caution) after validating.
Task 10: Run logrotate once, verbosely, and watch what it changes
cr0x@server:~$ sudo logrotate -vf /etc/logrotate.conf
rotating pattern: /var/log/nginx/*.log after 1 days (14 rotations)
renaming /var/log/nginx/access.log to /var/log/nginx/access.log.1
compressing log with: /bin/gzip
Meaning: Rotation happened; compression kicked in. If a daemon doesn’t reopen logs, you may now have open-but-deleted files.
Decision: Ensure postrotate actions signal services correctly; verify with lsof +L1.
Task 11: Check container runtime disk usage (Docker example)
cr0x@server:~$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 48 12 38.2GB 24.7GB (64%)
Containers 21 6 3.1GB 1.0GB (32%)
Local Volumes 16 10 220GB 0B (0%)
Build Cache 12 0 5.6GB 5.6GB
Meaning: The big number is local volumes. Image pruning won’t solve your disk pressure; it’s stateful volumes.
Decision: Audit volumes and owners. Consider application-level retention or moving volumes to a larger mount.
Task 12: Prune safely (don’t nuke in-use objects)
cr0x@server:~$ sudo docker image prune -a --filter "until=168h"
Deleted Images:
deleted: sha256:3d2b...
Total reclaimed space: 12.4GB
Meaning: You reclaimed 12.4GB from images older than 7 days.
Decision: If this is a production node, coordinate with deployment cadence; ensure you’re not pruning images required for quick rollbacks.
Task 13: Check Kubernetes node disk pressure symptoms
cr0x@server:~$ kubectl describe node worker-3 | sed -n '/Conditions:/,/Addresses:/p'
Conditions:
Type Status LastHeartbeatTime Reason Message
DiskPressure True 2026-01-22T10:11:02Z KubeletHasDiskPressure kubelet has disk pressure
Ready True 2026-01-22T10:11:02Z KubeletReady kubelet is posting ready status
Meaning: The node is “Ready” but under DiskPressure; evictions will start and workloads will thrash.
Decision: Free space on the node filesystem used by kubelet/container runtime; do not “cleanup inside pods” as your first move.
Task 14: ZFS snapshot reality check (why deletes don’t free space)
cr0x@server:~$ sudo zfs list -o name,used,avail,refer,mountpoint tank/app
NAME USED AVAIL REFER MOUNTPOINT
tank/app 980G 120G 240G /srv/app
Meaning: Dataset “USED” is 980G but “REFER” is 240G. The delta is typically snapshots or child datasets.
Decision: Inspect snapshots before deleting anything else; cleanup might require snapshot retention changes.
Task 15: List snapshots and see what’s holding space
cr0x@server:~$ sudo zfs list -t snapshot -o name,used,refer,creation -s used | tail
tank/app@daily-2026-01-15 22G 240G Mon Jan 15 02:00 2026
tank/app@daily-2026-01-16 27G 240G Tue Jan 16 02:00 2026
tank/app@daily-2026-01-17 31G 240G Wed Jan 17 02:00 2026
tank/app@daily-2026-01-18 35G 240G Thu Jan 18 02:00 2026
tank/app@daily-2026-01-19 39G 240G Fri Jan 19 02:00 2026
tank/app@daily-2026-01-20 44G 240G Sat Jan 20 02:00 2026
tank/app@daily-2026-01-21 48G 240G Sun Jan 21 02:00 2026
Meaning: Snapshots are consuming meaningful space. Deleting files from /srv/app won’t reduce USED much while these remain.
Decision: Adjust retention or replicate snapshots elsewhere before pruning; do it deliberately, not in a panic.
Task 16: Identify whether a cleanup crossed filesystem boundaries
cr0x@server:~$ sudo find / -xdev -maxdepth 2 -type d -name 'tmp' -print
/tmp
/var/tmp
Meaning: -xdev keeps you on one filesystem. Without it, a cleanup might traverse into mounted volumes, including backups.
Decision: For any find-based delete, add -xdev unless you can justify crossing filesystems in writing.
Three corporate mini-stories: wrong assumption, backfired optimization, boring win
Mini-story 1: The incident caused by a wrong assumption (“/tmp is always safe”)
A mid-sized company ran a payment pipeline with a Java service and a sidecar that handled encryption. The sidecar wrote short-lived artifacts to /tmp. It was supposed to be temporary: encrypt, transmit, delete. Simple.
Over time, “temporary” became “operational.” The sidecar also used /tmp as a recovery queue: if the upstream API rate-limited, it would stash payloads and retry. Nobody documented this because it wasn’t an intentional feature; it was a pragmatic patch added during a past outage and never revisited.
Then a systems engineer enabled a systemd tmpfiles policy to clean /tmp entries older than a day. Totally normal. It reduced inode churn and kept hosts tidy. On the next weekend, traffic spiked, the sidecar rate-limited more often, and the retry queue grew. It crossed the one-day threshold. The cleaner did its job.
Monday morning was “missing transactions.” Not lost in transit, not rejected—silently deleted from the retry path. The service logs were also unhelpful because the logs referenced request IDs, but the payloads were gone.
The technical fix was boring: move the retry queue to a dedicated directory under /var/lib, manage it with explicit retention, and add metrics on queue depth. The cultural fix mattered more: any “cleanup” that touches default OS temp policies now requires an application owner sign-off. Because the phrase “it’s just /tmp” is how money disappears.
Mini-story 2: The optimization that backfired (aggressive image pruning on CI runners)
Another org ran self-hosted CI runners on beefy machines. Disk usage crept up because builds pulled lots of container images. Someone got clever: a nightly job to prune everything older than 24 hours. The goal was noble—stop paging people for “disk 90%.”
It worked for a week. Then build times got worse. Not a little. Worse enough that developers started retrying jobs, which increased load. The prune job ran at night, but the damage wasn’t “nightly.” Every morning, the first wave of builds pulled images again, hammering the registry and saturating network links shared with production services.
The failure mode was not “we deleted needed things.” It was “we erased locality.” Caches exist because networks and storage are not free. By optimizing for disk, they created a distributed denial of service against their own registry and WAN.
The fix: switch from time-based nuking to capacity-based retention with a floor for commonly used base images. Keep a warm set. Prune when the disk crosses a threshold, not on a calendar. Also: pin the prune job to a cgroup with I/O and CPU limits, because background work shouldn’t get to be the loudest process on the host.
Mini-story 3: The boring but correct practice that saved the day (quarantine, then delete)
A SaaS platform had a classic problem: log growth on nodes handling bursty traffic. An engineer proposed deleting older logs directly from /var/log. Another engineer insisted on a “quarantine” directory: move candidate files to /var/log/.trash on the same filesystem, wait 24 hours, then delete.
It sounded wasteful—why keep garbage? But the quarantine step was cheap. Rename operations are fast. It also preserved the ability to recover from mistakes without involving backups, tickets, or awkward conversations.
One day, a service updated its log format and started writing to a new filename that matched the cleanup glob. The cleanup job moved the active log into quarantine. The service kept writing, but now it wrote to a reopened file at the original path. Meanwhile, log shipping broke because its inode cursor followed the moved file. Alerts fired quickly because ingestion dropped.
Because the file was quarantined and not deleted, recovery was simple: move it back, adjust the cleanup match, restart the shipper, and re-ingest the gap. No restore. No data loss. The postmortem had a rare pleasant sentence: “No customer impact.” That sentence is almost always purchased with boring practices no one wants to implement.
Common mistakes: symptom → root cause → fix
Disk is full, you deleted big files, but df doesn’t change
Symptom: df still shows 95–100% after deletion.
Root cause: Open-but-deleted files (process still holds the descriptor) or snapshots holding blocks.
Fix: Check lsof +L1 and restart/reload services; for ZFS/btrfs, inspect snapshots and adjust retention deliberately.
Service crashes after “tmp cleanup,” but disk looks healthier
Symptom: After cleaning /tmp or /var/tmp, a daemon won’t start, jobs disappear, or rate spikes.
Root cause: Application misused temp directories as state (queues, locks, sockets, caches required at runtime).
Fix: Move state to an explicit persistent path; codify cleanup with app awareness; add tests for tmpfiles policies in staging.
Log rotation “works,” but you lost logs or log shipping gaps appear
Symptom: Compressed logs exist, but ingestion gaps or missing lines occur.
Root cause: Rotated/truncated logs without signaling the process; shipper follows inodes and gets confused by moves/truncation patterns.
Fix: Use correct postrotate actions (HUP/reload), and align shipper configuration with rotation method (copytruncate vs rename).
Cleanup job spikes I/O and latency across the host
Symptom: Latency jumps when cleanup runs; iowait climbs; databases complain.
Root cause: Cleaner performs deep directory scans, compression, checksum, or dedup at full speed with no throttling.
Fix: Rate-limit via ionice/nice, schedule off-peak, cap scope, or redesign to incremental cleanup with metrics.
“We pruned images” and now rollbacks are slow or fail
Symptom: Rollback requires pulling images and times out; nodes churn.
Root cause: Aggressive pruning removed images assumed to be cached locally.
Fix: Keep a protected warm set; prune by capacity with minimum retention; align with deployment frequency and rollback policy.
Filesystem corruption fear after mass deletion
Symptom: Apps error, directories look odd, someone says “maybe corruption.”
Root cause: Usually not corruption; it’s missing state, permission changes, or deleted sockets/lockfiles. Real corruption is rarer than people think.
Fix: Validate mounts, permissions, and application state; check kernel logs; run filesystem checks only with a plan and downtime window.
Cleanup ran on the wrong host or wrong mount
Symptom: A node unrelated to the alert got “cleaned,” or a backup mount got emptied.
Root cause: Missing guardrails: no hostname checks, no mount checks, no -xdev, automation pointed at wrong inventory group.
Fix: Add hard safety checks (expected mount UUIDs, environment markers), and require interactive confirmation for destructive actions outside maintenance windows.
Checklists / step-by-step plan: safe cleanup without regrets
Step 0: Stop making it worse (5 minutes)
- Pause the cleanup job: disable cron/systemd timer/CI job temporarily.
- Capture evidence: current disk/inode usage, top directories,
lsof +L1, and recent syslog/journal snippets. - Get the system breathing room: reclaim a small amount safely (journald vacuum or move logs to quarantine).
Step 1: Prove what’s holding space (15–30 minutes)
- Bytes vs inodes: run
df -hTanddf -ih. - Locate consumers:
du -xhd1on the full filesystem; for inodes, count files by directory. - Check open deleted files:
lsof +L1. - If using CoW storage: inspect snapshots and dataset usage (ZFS/btrfs).
Step 2: Choose the least risky relief valve
- Prefer deleting regenerated data: build caches, package caches, temporary artifacts that you can re-create.
- Prefer app-aware cleanup: database vacuum via vendor tools, container prune with retention filters, logrotate with correct signals.
- Avoid deleting unknowns: anything in
/var/libwithout knowing the owner is a trap.
Step 3: Implement guardrails before re-enabling automation
A cleanup job without guardrails is not “automation.” It’s just scheduled risk.
- Quarantine then delete: move candidates to a hidden directory on the same filesystem; delete after a delay.
- Mount verification: check expected mountpoints and filesystem types before running.
- Scope limits: use
-xdev, explicit paths, and explicit maximum depth. - Dry runs: log what would be deleted. Always.
- Rate limiting: use
niceandionicefor background jobs. - Metrics: alert on growth trends, not just thresholds, and track cleanup actions as events.
Step 4: Make it boring on purpose
The best cleanup process is one nobody talks about because it never surprises anyone. That means:
retention policies documented, owners assigned, and “temporary” directories treated as contracts, not vibes.
FAQ
1) Why does deleting a log file sometimes not free disk space?
Because a process can keep the file open and continue writing to it even after it’s unlinked. The space is reclaimed only when the last file descriptor closes. Use lsof +L1 to find these and restart/reload the process.
2) Is rm -rf always a bad idea?
It’s a sharp tool. The problem is not rm; it’s ambiguity. If you can’t precisely define the target and prove it’s correct, don’t use recursive force deletion in production. Prefer quarantine moves and app-aware cleanup.
3) What’s the safest “emergency space” to reclaim quickly?
Usually journald archives, package manager caches, or known build caches—things that regenerate. Avoid deleting anything that looks like state (/var/lib, database directories, volume mounts) unless you’ve confirmed ownership and recovery.
4) Why did cleaning /tmp break an application?
Because the application stored state in a temp path: sockets, locks, queues, or cached metadata required at runtime. Fix by relocating state to an explicit persistent directory and declaring proper runtime directories (especially under systemd).
5) Why does ZFS still show high usage after deleting files?
Snapshots hold references to old blocks. Deleting current files may not free space until snapshots are destroyed. Verify with zfs list -t snapshot and adjust retention carefully—snapshots are often your only fast rollback.
6) Should we prune container images aggressively to keep disks clean?
Not aggressively—strategically. Pruning can erase locality and slow deployments/rollbacks. Use capacity thresholds, retain a warm set, and coordinate with release cadence.
7) How do I prevent a cleanup from crossing into mounted volumes?
Use filesystem boundary protections like find -xdev, and validate mountpoints before running. In automation, add checks for filesystem type and expected device IDs.
8) Is “delete older than N days” reliable?
It’s reliable only if your time parsing and file naming are reliable. DST changes, timezone shifts, and naming changes break retention logic. Prefer file mtime-based deletion with explicit scope and dry runs, and monitor deletion counts.
9) What’s the biggest cultural fix for cleanup disasters?
Treat destructive operations like deployments: review, staging, observability, and rollback. Cleanup is not housekeeping; it’s production change.
Next steps you can actually do this week
If you operate production systems, you don’t need a better “cleaner.” You need fewer surprises. The surprise comes from ambiguity: unknown ownership, unclear retention, and tools that are allowed to delete without proving they’re deleting the right thing.
- Inventory your cleanup mechanisms: cron jobs, systemd timers, CI runner maintenance, container GC, snapshot retention. Write down owners.
- Add a quarantine pattern for anything file-based. Rename/move first, delete later.
- Instrument disk pressure: bytes and inodes, plus growth rates. Trend alerts beat “92% at 3 a.m.”
- Codify “fast diagnosis” into your runbooks: bytes vs inodes, open deleted files, snapshots, container runtime usage.
- Throttle background cleanup: your cleaner should never be the top consumer on the host.
- Decide what is allowed to be deleted and what must be managed by app-aware tooling. If it’s state, treat it like state.
Production cleanliness is not about deleting more. It’s about deleting with intent, with guardrails, and with a way back when intent meets reality.