Ubuntu 24.04: Disk is “full” but df looks fine — inode exhaustion explained (and fixed)

November 30, 2025 • February 3, 2026 • Read: 21 min • Views: 4

Was this helpful?

You’re SSH’d into an Ubuntu 24.04 server that “has plenty of space” according to df -h.
Yet every deploy fails, logs won’t rotate, apt can’t unpack, and the kernel keeps throwing
No space left on device like it’s getting paid per error.

This is one of those outages that makes smart people doubt their eyes. The disk is not full.
Something else is. Usually: inodes. And once you understand what that means, the fix is almost boring.
Almost.

What you’re seeing: “disk full” with free GBs

In Ubuntu, “disk full” often means one of three things:

Block space exhaustion: the classic case; you ran out of bytes.
Inode exhaustion: you ran out of file metadata entries; bytes can remain free.
Reserved blocks, quotas, or filesystem corruption: you have space but can’t use it.

Inode exhaustion is the sneaky one because it doesn’t show up in the first command everyone runs.
df without flags only reports blocks used/available. You can have 200 GB free and still be
unable to create a 0-byte file. The filesystem can’t allocate a new inode, so it can’t create a new file.

You’ll notice weird side effects:

New log files can’t be created, so services crash or stop logging right when you need them.
apt fails mid-install because it can’t create temp files or unpack archives.
Docker builds start failing on “writing layer” operations even though volumes look fine.
Some apps report “disk full” while others keep working (because they’re not creating files).

There’s a simple test: try to create a file in the affected filesystem. If it fails with “No space left”
while df -h shows free space, stop arguing with df and check inodes.

Inodes explained like you run production

An inode is the filesystem’s record for a file or directory. It’s metadata: owner, permissions, timestamps,
size, pointers to data blocks. In many filesystems, the filename is not stored in the inode; it lives in
the directory entries that map names to inode numbers.

The important operational truth: most Linux filesystems have two separate “budgets”:
blocks (bytes) and inodes (file count). If you spend either budget, you’re done.

Why inodes run out in real systems

Inode exhaustion is usually not “lots of big files.” It’s “millions of tiny files.”
Think caches, mail spools, build artifacts, container layers, CI workspaces, temporary uploads,
and metrics buffers that someone forgot to expire.

A 1 KB file still costs one inode. A 0-byte file still costs one inode. A directory costs one inode too.
When you have 12 million small files, the disk might be mostly empty in bytes, but the inode table is toast.

Which filesystems are most likely to bite you

ext4: common on Ubuntu; inodes are typically created at format time based on an inode ratio. If you guessed wrong, you can run out.
XFS: inodes are more dynamic; inode exhaustion is less common but not mythical.
btrfs: metadata allocation is different; you can still hit metadata space issues, but it’s not the same “fixed inode count” story.
overlayfs (Docker): not a filesystem type by itself, but it amplifies “many files” behavior in container-heavy hosts.

One quote worth keeping on your mental runbook:

“Hope is not a strategy.” — General Gordon R. Sullivan

Fast diagnosis playbook (first/second/third)

When the alert says “No space left on device” but graphs say you’re fine, don’t wander.
Work the playbook.

First: confirm what “space” is actually exhausted

Check blocks (df -h) and inodes (df -i) for the affected mount.
Try creating a file on that mount; confirm the error path.
Check for a read-only remount or filesystem errors in dmesg.

Second: find the mount and the top inode consumers

Identify which filesystem path is failing (logs, temp, data directory).
Find high file-count directories using find and a couple of targeted counts.
If it’s Docker/Kubernetes, check overlay2, images, containers, and logs.

Third: free inodes safely

Start with obvious safe cleanups: old logs, caches, temp files, journal vacuum, container garbage.
Remove files, not directories, if the app expects directory structure.
Confirm inode usage drops (df -i) and services recover.

You don’t need heroics. You need a controlled deletion plan and a postmortem that stops it happening again.

Practical tasks: commands, output meaning, decisions

Below are real tasks you can run on Ubuntu 24.04. Each one includes what to look for and what decision to make.
Do them in order if you’re on-call; cherry-pick if you already know the mount.

Task 1: Confirm block usage (the obvious check)

cr0x@server:~$ df -hT
Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/sda2      ext4   200G   62G  128G  33% /
tmpfs          tmpfs  3.1G  1.2M  3.1G   1% /run
/dev/sdb1      ext4   1.8T  1.1T  648G  63% /data

Meaning: Block space is fine on / and /data.
Decision: If you still get “No space left”, move to inode checks. Don’t waste time hunting big files yet.

Task 2: Check inode usage (the actual culprit half the time)

cr0x@server:~$ df -iT
Filesystem     Type    Inodes  IUsed   IFree IUse% Mounted on
/dev/sda2      ext4   13107200 13107190     10  100% /
tmpfs          tmpfs    790000     420 789580    1% /run
/dev/sdb1      ext4  122142720  982134 121160586   1% /data

Meaning: Root filesystem has IUse% 100%. Only 10 inodes free. That’s game over for creating files.
Decision: You must delete files on / (or move them off) to free inodes. Deleting one huge file won’t help if it’s still one inode either way.

Task 3: Reproduce the failure in a controlled way

cr0x@server:~$ touch /tmp/inode-test-file
touch: cannot touch '/tmp/inode-test-file': No space left on device

Meaning: The filesystem can’t allocate an inode for a tiny file.
Decision: Treat this as an availability incident. Anything needing writes (logs, sockets, PID files, temp files) may fail next.

Task 4: Identify what path is on the impacted filesystem

cr0x@server:~$ findmnt -T /var/log
TARGET SOURCE    FSTYPE OPTIONS
/      /dev/sda2 ext4   rw,relatime,errors=remount-ro

Meaning: /var/log is on /. If logs exploded into millions of files, that’s your battlefield.
Decision: Focus your search under /var, /tmp, /var/lib, and any app directories on /.

Task 5: Spot the usual suspects by directory size (bytes)

cr0x@server:~$ sudo du -xh --max-depth=1 /var | sort -h
12M	/var/cache
180M	/var/log
2.1G	/var/lib
2.4G	/var

Meaning: Byte usage is not extreme. This is your first clue that file count, not file size, is the issue.
Decision: Stop optimizing for GB. Start optimizing for file count.

Task 6: Find directories with huge file counts (top-level sweep)

cr0x@server:~$ sudo bash -lc 'for d in /var/* /tmp /home; do [ -d "$d" ] && printf "%s\t" "$d" && find "$d" -xdev -type f 2>/dev/null | wc -l; done | sort -n -k2 | tail -n 10'
/var/cache	1320
/var/log	5402
/var/lib	12877190
/tmp	120
/home	88

Meaning: /var/lib has ~12.8 million files. That’s not “a little messy”; that’s your inodes.
Decision: Zoom into /var/lib. If this is a container host, expect /var/lib/docker or /var/lib/containerd.

Task 7: Narrow down inside /var/lib quickly

cr0x@server:~$ sudo bash -lc 'for d in /var/lib/*; do [ -d "$d" ] && printf "%s\t" "$d" && find "$d" -xdev -type f 2>/dev/null | wc -l; done | sort -n -k2 | tail -n 10'
/var/lib/systemd	2200
/var/lib/dpkg	9800
/var/lib/docker	12866012

Meaning: Docker is eating your inode budget via layers, build cache, and container logs.
Decision: Decide whether you can prune safely right now. If this host is a single-purpose node, pruning is usually correct. If it’s a pet server with unknown workloads, be more careful.

Task 8: Check Docker’s view (bytes, not inodes, but still helpful)

cr0x@server:~$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          48        12        23.4GB    18.1GB (77%)
Containers      65        18        4.2GB     2.9GB (69%)
Local Volumes   26        20        11.8GB    1.1GB (9%)
Build Cache     152       0         6.4GB     6.4GB

Meaning: There’s plenty to prune, but this output doesn’t show file counts.
Decision: Proceed with prune actions if you understand the impact: stopped containers, dangling images, build cache.

Task 9: Prune Docker safely (start conservative)

cr0x@server:~$ sudo docker image prune -a -f
Deleted Images:
deleted: sha256:1a2b3c...
deleted: sha256:4d5e6f...
Total reclaimed space: 17.8GB

Meaning: Unused images removed; inode count should drop too because layers contain many small files.
Decision: Recheck df -i. If inode usage doesn’t move enough, prune containers and build cache next.

Task 10: Prune build cache (often a silent inode factory)

cr0x@server:~$ sudo docker builder prune -a -f
Deleted build cache objects:
k2r1m3n4o5p6...
Total reclaimed space: 6.4GB

Meaning: CI/build hosts love leaving behind cache shards.
Decision: If you need reproducible builds and fast rebuilds, schedule this cleanup rather than running it constantly. For now, you’re stabilizing the node.

Task 11: Confirm inode recovery (this is your success metric)

cr0x@server:~$ df -i /
Filesystem     Inodes   IUsed    IFree IUse% Mounted on
/dev/sda2     13107200 8123400  4983800   63% /

Meaning: You’re back from the cliff. Services can create files again.
Decision: Restart any services that crashed due to failed writes, and put guardrails in place so you don’t relive this.

Task 12: If logs are the culprit, clean and cap them (systemd-journald)

cr0x@server:~$ sudo journalctl --disk-usage
Archived and active journals take up 1.7G in the file system.

Meaning: Journals are not necessarily inode-heavy (they’re fewer, larger files), but can still be a pressure point.
Decision: If bytes matter too, vacuum. If inodes matter, focus on apps creating lots of separate log files, not the journal itself.

cr0x@server:~$ sudo journalctl --vacuum-time=7d
Vacuuming done, freed 1.2G of archived journals from /var/log/journal.

Meaning: Bytes freed. Inodes freed only modestly.
Decision: Set persistent limits in journald config if this host is prone to noisy logs.

Task 13: If apt is failing, clean package caches

cr0x@server:~$ sudo apt-get clean

Meaning: Clears downloaded package archives under /var/cache/apt/archives.
Decision: Good hygiene, but usually not an inode silver bullet. Helps when caches include many small partial files.

Task 14: Find large file-count directories with du (inode-style)

cr0x@server:~$ sudo du -x --inodes --max-depth=2 /var/lib | sort -n | tail -n 10
1200	/var/lib/systemd
9800	/var/lib/dpkg
12866012	/var/lib/docker
12877190	/var/lib

Meaning: This is the money view: inode consumption per directory.
Decision: Target the top consumer. Don’t “clean a little everywhere.” You’ll waste time and still be at 100%.

Task 15: When in doubt, inspect for pathological fan-out

cr0x@server:~$ sudo find /var/lib/docker -xdev -type f -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -n | tail -n 5
  42000 /var/lib/docker/containers/8a7b.../mounts
  78000 /var/lib/docker/overlay2/3f2d.../diff/usr/lib
 120000 /var/lib/docker/overlay2/9c1e.../diff/var/cache
 250000 /var/lib/docker/overlay2/b7aa.../diff/usr/share
 980000 /var/lib/docker/overlay2/2d9b.../diff/node_modules

Meaning: A container layer with node_modules can generate absurd file counts.
Decision: Fix the build (multi-stage builds, prune dev deps, .dockerignore) and/or move Docker’s data root to a filesystem designed for this workload.

Task 16: Confirm filesystem type and inode provisioning details

cr0x@server:~$ sudo tune2fs -l /dev/sda2 | egrep -i 'Filesystem features|Inode count|Inode size|Block count|Reserved block count'
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Inode count:              13107200
Inode size:               256
Block count:              52428800
Reserved block count:     2621440

Meaning: ext4 has a fixed inode count here. You can’t “add more inodes” without rebuilding the filesystem.
Decision: If this host’s job is “millions of small files,” plan a migration: new filesystem with a different inode ratio, or a different storage layout.

Joke #1: Inodes are like meeting rooms: you can have an empty building and still be “full” if every room is booked by a single sticky note.

Three corporate mini-stories (anonymized, painfully real)

1) Incident caused by a wrong assumption: “df says we’re fine”

A mid-sized SaaS company ran a fleet of Ubuntu servers handling webhook ingestion. The engineers monitored disk usage in percent,
alerted at 80%, and felt proud: “We never fill disks anymore.” On a Tuesday afternoon, the ingestion pipeline started returning
intermittent 500s. Retries piled up, queues backed up, dashboards lit up.

The on-call did the standard routine: df -h looked healthy. CPU was fine. Memory wasn’t great but survivable.
They restarted a service and it died immediately because it couldn’t create a PID file. That error message finally showed its face:
No space left on device.

Someone suggested “maybe the disk lied,” which is a charmingly human way of saying “we didn’t measure the right thing.”
They ran df -i and found the root filesystem at 100% inode usage. The offender wasn’t a database. It was a
“temporary retry store” implemented as one JSON file per webhook event under /var/lib/app/retry/.
Each file was tiny. There were millions.

The fix was immediate: delete files older than a threshold and restart. The real fix took a sprint:
migrate the retry store to a queue designed for this, stop using the filesystem as a low-rent database, and add inode alerts.
The postmortem title was polite. The internal chat was not.

2) Optimization that backfired: “cache everything on local disk”

A data engineering team sped up ETL jobs by caching intermediate artifacts to local SSD.
They switched from “one artifact per batch” to “one artifact per partition,” because parallelism.
Performance improved. Costs looked better. Everyone moved on.

Weeks later, nodes started failing in a staggered pattern. Not all at once, which made it harder.
Some jobs succeeded, others failed at random when trying to write output. The errors were inconsistent:
Python exceptions, Java IO errors, occasional “read-only filesystem” after the kernel remounted a troubled disk.

The root cause was embarrassingly mechanical: caching created tens of millions of tiny files.
The ext4 filesystem had been formatted with a default inode ratio suitable for general purpose use, not for “millions of shards.”
The nodes didn’t run out of bytes; they ran out of file identities. The “optimization” was effectively an inode stress test.

They “fixed” it by increasing disk size. That didn’t help, because inode count was still fixed.
They then reformatted with a more appropriate inode density and changed the caching strategy to pack partitions into tar-like bundles.
Performance regressed slightly. Reliability improved dramatically. That’s a trade you take.

3) Boring but correct practice that saved the day: separate filesystems and guardrails

Another organization ran mixed workloads on Kubernetes nodes: system services, Docker/containerd, and some local scratch space.
They had one rule: anything that can explode in file count gets its own filesystem. Docker lived on /var/lib/docker
mounted from a dedicated volume. Scratch lived on a separate mount with aggressive cleanup policies.

They also had two boring monitors: “block usage” and “inode usage.” No fancy ML. Just two time series and alerts that paged
before the cliff. They tested the alerts quarterly by creating a temporary inode storm in staging (yes, that’s a thing).

One day a new build pipeline started producing pathological layers with huge dependency trees.
Inodes on the Docker volume climbed fast. The alert fired early. The on-call didn’t have to learn anything new under stress.
They pruned, rolled back the pipeline, and raised the limits. The rest of the node stayed healthy because root wasn’t involved.

The incident report was short. The fix was boring. Everyone slept.
That’s the whole point of SRE.

Interesting facts and a little history (because it explains the failure modes)

Inodes come from early Unix: the concept dates back to the original Unix filesystem design, where metadata and data blocks were separate structures.
Traditional ext filesystems pre-allocate inodes: ext2/ext3/ext4 typically decide inode count at mkfs time based on an inode ratio, not dynamically per workload.
Default inode ratios are a compromise: they aim for general-purpose workloads; they’re not tailored for container layers, CI caches, or maildir explosions.
Directories cost inodes too: “We only created directories” is not a defense; each directory is also an inode consumer.
“No space left on device” is overloaded: the same error string can mean out of blocks, out of inodes, quota exceeded, or even a filesystem flipped read-only after errors.
Reserved blocks exist for a reason: ext4 usually reserves a percentage of blocks for root, intended to keep the system usable under pressure; it doesn’t reserve inodes the same way.
Small-file workloads are harder than they look: metadata operations dominate; inode and directory lookup efficiency can matter more than throughput.
Container images magnify tiny-file patterns: language ecosystems with huge dependency trees (Node, Python, Ruby) can create layers with massive file counts.
Some filesystems shifted toward dynamic metadata: XFS and btrfs handle metadata differently, which changes the shape of “full” failures, but doesn’t eliminate them.

Fixes: from quick cleanup to permanent prevention

Immediate stabilization (minutes): free inodes without making it worse

Your job during an incident is not “make it pretty.” It’s “make it writable again” without deleting the wrong thing.
Here’s what tends to be safe and effective, in descending order of sanity:

Delete known ephemeral caches (build cache, package cache, temp files) with commands designed for them.
Prune container garbage if the host is container-heavy and you can tolerate removing unused artifacts.
Expire old app-generated files by time, not by guesswork. Prefer “older than N days” policies.
Move directories off the filesystem if deletion is risky: archive to another mount, then delete locally.

When it’s log-related: fix the file count problem, not just rotation

Logrotate solves “one file grows forever.” It does not automatically solve “we created one file per request.”
If your application creates unique log files per unit of work (request ID, job ID, tenant ID), you’re doing distributed denial of service
against your own inode table.

Prefer:

single stream logging with structured fields (JSON is fine, just keep it sane)
journald integration where appropriate
bounded local spooling with explicit retention

When it’s Docker: pick a data root that matches the workload

Docker on ext4 can work fine, until it doesn’t. If you know a node will build images, run many containers,
and churn layers, treat /var/lib/docker like a high-churn datastore and give it its own filesystem.

Practical options:

Separate mount for /var/lib/docker with an inode density that matches the expected file count.
Clean build cache on a schedule, not manually during outages.
Fix image builds to reduce file fan-out: multi-stage builds, trim dependencies, use .dockerignore.

Permanent fix: design the filesystem for the workload

If inode exhaustion is recurring, you don’t have a “cleanup problem.” You have a capacity planning problem.
On ext4, inode count is fixed at creation time. The only real fix is to migrate to a filesystem with more inodes
(or a different layout), meaning:

create a new filesystem with a higher inode density
move data
update mounts and services
add monitoring and retention policies

How to build ext4 with more inodes (planned migration)

ext4 inode count is influenced by -i (bytes-per-inode) and -N (explicit inode count).
Lower bytes-per-inode means more inodes. More inodes means more metadata overhead. This is a trade, not free candy.

cr0x@server:~$ sudo mkfs.ext4 -i 16384 /dev/sdc1
mke2fs 1.47.0 (5-Feb-2023)
Creating filesystem with 976754176 4k blocks and 61079552 inodes
Filesystem UUID: 9f1f4a1c-8b1d-4c1b-9d88-8d1aa14d4e1e
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

Meaning: This example creates far more inodes than a default ratio would on the same size volume.
Decision: Use this only when you know you need lots of files. For big sequential data, don’t waste metadata.

Joke #2: If you treat the filesystem like a database, it will eventually invoice you in inodes.

Common mistakes: symptom → root cause → fix

1) “df shows 30% used but I can’t write files” → inode exhaustion → check and delete small-file hotspots

Symptom: writes fail, touch fails, apt fails, services crash creating temp files.
Root cause: df -i shows 100% inode usage on the mount.
Fix: identify top file-count directories with du --inodes / find ... | wc -l, remove safe ephemeral files, then prevent recurrence.

2) “Deleted a huge file, still broken” → you freed blocks, not inodes → delete many files instead

Symptom: GB free increases, but “No space left” continues.
Root cause: inode usage unchanged.
Fix: free inodes by deleting file counts, not file sizes. Target caches, spools, build artifacts.

3) “Only root can write; users can’t” → reserved blocks or quotas → verify with tune2fs and quota tools

Symptom: root can create files, non-root can’t.
Root cause: reserved block percentage on ext4, or user quotas reached.
Fix: check reserved blocks with tune2fs; check quotas; adjust carefully. Don’t blindly set reserved blocks to 0% on system partitions.

4) “It became read-only and now everything fails” → filesystem errors → investigate dmesg, run fsck (offline)

Symptom: kernel remounted errors=remount-ro, writes fail with read-only errors.
Root cause: I/O errors or filesystem corruption, not capacity.
Fix: inspect dmesg; plan a reboot into recovery and run fsck. Capacity cleanup won’t fix corruption.

5) “Kubernetes node has DiskPressure but df looks fine” → inode pressure from container runtime → prune and separate mounts

Symptom: pods evicted; kubelet complains; node unstable.
Root cause: runtime directories fill inode budget (overlay2, logs).
Fix: prune runtime storage, enforce image garbage collection, put runtime on dedicated volume with monitoring.

6) “We cleaned /tmp; it helped for an hour” → application recreates storm → fix retention at the source

Symptom: repeated inode incidents after cleanup.
Root cause: app bug, bad design (one file per event), or missing TTL/rotation.
Fix: add retention policy, redesign storage (database/queue/object store), enforce caps and alerts.

Checklists / step-by-step plan

On-call checklist (stabilize in 15–30 minutes)

Run df -hT and df -iT for the failing mount.
Confirm with touch in the affected path.
Find the mount mapping with findmnt -T.
Identify top inode consumers with du -x --inodes --max-depth=2 and targeted find ... | wc -l.
Pick one cleanup action that is safe and high-impact (Docker prune, cache cleanup, time-based deletion).
Recheck df -i until you’re under ~90% on the critical mount.
Restart impacted services (only after writes succeed).
Capture evidence: commands run, inode counts before/after, directories responsible.

Engineering checklist (prevent recurrence)

Add inode monitoring and alerts per filesystem (not just percent disk used).
Put high-churn directories on dedicated mounts: /var/lib/docker, app spool, build cache.
Implement retention at the producer: TTLs, caps, periodic compaction, or a different storage backend.
Review builds and images: reduce layer file count; avoid vendoring huge dependency trees into runtime images.
If ext4 is used for small-file workloads, design inode density during formatting and document the rationale.
Run a game day in staging: simulate inode pressure and validate alerts and recovery steps.

Migration checklist (when ext4 inode count is fundamentally wrong)

Measure current file count and growth rate (daily new files, retention behavior).
Choose the target: ext4 with higher inode density, XFS, or a different architecture (object storage, database, queue).
Provision a new volume and filesystem; mount it to the intended path.
Stop the workload, copy data (preserving ownership/permissions), validate, then cut over.
Re-enable workload with retention defaults enabled from day one.
Leave guardrails: alerts, cleanup timers, and a hard cap policy.

FAQ

1) What is an inode in one sentence?

An inode is a filesystem metadata record that represents a file or directory; you need a free inode to create a new file.

2) Why does Ubuntu say “No space left on device” when `df -h` shows free space?

Because “space” can mean bytes (blocks) or file metadata (inodes). df -h shows blocks; df -i shows inodes.

3) How do I confirm inode exhaustion quickly?

Run df -i on the mount and check for IUse% 100%, then try touch to confirm file creation fails.

4) Can I increase inode count on an existing ext4 filesystem?

Not in a practical online way. ext4 inode count is effectively set at filesystem creation. The real fix is migrating to a new filesystem with more inodes.

5) Why do containers make inode problems more likely?

Image layers and dependency trees can contain huge numbers of small files. Overlay storage multiplies metadata operations, and build caches accumulate quietly.

6) Is deleting one big directory safe?

Sometimes. It’s safer to delete time-bounded files under a known ephemeral path than to delete a directory your service expects. Prefer targeted deletion and verify service configs.

7) What should I monitor to catch this early?

Monitor inode usage per filesystem (df -i metrics) and alert on sustained growth and high-water marks (for example, 85% and 95%), not just disk percent used.

8) If I’m out of inodes, should I reboot?

Rebooting doesn’t create inodes. It might temporarily clear some temp files, but it’s not a fix and can make investigation harder. Free inodes deliberately instead.

9) Why does it sometimes affect only one application?

Because only applications that need to create new files are blocked. Read-heavy services may continue functioning until they try to write logs, sockets, or state.

10) Are journald logs likely to cause inode exhaustion?

Less commonly than “one file per event” application logs. journald tends to store data in fewer larger files, which is more block-heavy than inode-heavy.

Next steps you should actually do

If you take only one habit from this: when you see “No space left on device,” run df -i as automatically as df -h.
The filesystem has two limits, and production doesn’t care which one you forgot to monitor.

Practical next steps:

Add inode alerts on every persistent mount (especially / and container runtime storage).
Move high-churn paths onto dedicated filesystems so one bad workload can’t brick the whole node.
Fix the producer behavior: retention, TTLs, fewer files, better packaging of artifacts.
For ext4, plan inode density up front for small-file workloads, and document why you chose it.
Run a small game day: create a controlled inode storm in staging, verify your alerting and cleanup playbook works.

You don’t need more disk. You need fewer files, better lifecycle controls, and a filesystem layout that matches reality instead of assumptions.