Docker “No Space Left on Device”: The Hidden Places Docker Eats Your Disk

January 11, 2026 • February 3, 2026 • Read: 22 min • Views: 16

Was this helpful?

You’re on call. A deploy fails. CI is red. Containers that ran fine yesterday suddenly refuse to start with no space left on device. You SSH in, run df -h, and your disk looks… fine-ish. Or worse: it’s full, and you have no idea what filled it because “we only run a few containers.”

Docker is a great magician. It makes apps appear. It also makes disk disappear—quietly, across multiple layers of storage, logs, caches, and metadata. The trick is knowing where to look, and which cleanups are safe in production.

Fast diagnosis playbook

This is the “get un-stuck in 10 minutes” flow. It prioritizes the checks that tell you whether you have a disk space problem, an inode problem, or a filesystem-specific constraint (overlay quirks, thinp metadata, project quotas).

First: confirm what’s full (bytes vs inodes vs a mount)

Check free bytes: df -h on the relevant mount (/, /var, /var/lib/docker, and any dedicated Docker data disk).
Check inodes: df -i. If inodes are 100%, you can have “no space” with gigabytes free.
Confirm Docker root: docker info → Docker Root Dir. People check / and forget Docker is on /var (or vice versa).

Second: identify which category is growing

Docker’s own accounting: docker system df -v to see images, containers, volumes, and build cache.
Filesystem reality: du -xhd1 /var/lib/docker (or your root dir) to see where bytes really live. Docker’s numbers can lag behind reality, especially with logs.
Logs: check container JSON logs or journald usage. Logs are the #1 “we didn’t think about that” disk eater.

Third: remediate in the least-destructive order

Stop the bleeding: rotate logs, cap log drivers, or throttle noisy apps.
Free safe space: prune build cache and dangling images. Avoid nuking volumes unless you’re certain.
Address structural issues: move Docker root to a bigger disk, add monitoring, add quotas, set log retention, and fix CI builder sprawl.

Joke #1: Disk is like a hotel minibar—nobody remembers using it until checkout.

What “no space left on device” actually means

The message is a liar by omission. It can mean:

No free blocks on the filesystem that backs Docker’s writable layer, a volume, or a temp directory.
No free inodes (you can’t create new files even if you have space).
Hit a quota (project quotas, XFS quotas, or storage-driver metadata limits).
Thin pool metadata full (common with old devicemapper setups).
A different mount is full than the one you checked (e.g., /var is full, / isn’t).
Overlay filesystem constraints that manifest as space errors (e.g., too many layers, or copy-up behavior exploding usage).

Operationally: treat it as “the kernel refused an allocation.” Your job is to learn which allocation and where.

One quote worth keeping on a sticky note in the data center:

“Hope is not a strategy.” — paraphrased idea often attributed to operations leadership in reliability circles

If your disk management strategy is “we’ll prune when it hurts,” you are already running on hope.

The hidden places Docker eats your disk

1) Container writable layers (overlay2): death by tiny writes

Every running container has a writable layer. In overlay2, it’s a directory structure under something like:

/var/lib/docker/overlay2 (common)
/var/lib/docker/containers (logs and config)

Writable layers balloon when applications write to paths you thought were ephemeral or externalized. The classics:

Apps writing to /tmp inside the container, and you assumed it was memory-backed. It’s not, unless you mount tmpfs.
Databases writing to /var/lib/postgresql/data inside the container without a named volume bind.
Package managers, language runtimes, or “helpful” update checks writing caches to /root, /var/cache, /home.

Overlay copy-up is a special kind of betrayal: reading a file from a lower layer is cheap; modifying it forces a copy into the writable layer. Touching a large file can duplicate it. This is how “we only wrote a small config change” becomes gigabytes.

2) JSON log files: the loudest disk hog wins

Default Docker logging (json-file) writes per-container logs to:

/var/lib/docker/containers/<container-id>/<container-id>-json.log

If you don’t set log rotation, those files grow forever. And “forever” in production is measured in incidents.

Joke #2: The only thing that scales automatically without budget approval is your log volume.

3) Named volumes: durable by design, forgotten by habit

Volumes are where state goes to live. They also survive container deletes. That’s the point, and also the trap.

Volume sprawl happens when you:

Use autogenerated volume names in Compose or CI and never clean them.
Run ephemeral test stacks that create volumes per run.
Bind-mount incorrectly and end up writing into a named volume you didn’t intend.

Important nuance: deleting images rarely frees volume space. People “docker rmi everything” and still have a full disk because the volumes are the disk.

4) Build cache: BuildKit is fast because it remembers everything

Modern Docker builds (especially with BuildKit) cache layers aggressively. It’s fantastic for CI speed. It’s also how your builder nodes become disk landfills.

Build cache grows due to:

Multi-stage builds with many steps that invalidate frequently.
Lots of branches and tags producing unique layers.
Package manager downloads cached in layers, multiplied across variations.

5) Unused images: the “we might need that tag” museum

Images pile up when nodes act as both runtime and build hosts, or when your deployment strategy pulls many versions and never removes them. In clusters, each node becomes its own private museum of “potentially useful someday” layers.

6) Orphaned containers and dead layers: leftovers after crashes and upgrades

In normal operation Docker cleans up. In abnormal operation—daemon crashes, forced reboots, broken storage drivers—garbage can linger. Also: some orchestration patterns create and abandon containers at a shocking rate.

7) Temp directories outside Docker root: /tmp and /var/tmp ambushes

Docker uses temporary files during pulls, builds, and decompression. Depending on configuration and environment variables, temp usage can land in:

/tmp or /var/tmp on the host
a systemd private tmp for the docker service

So you can fill /tmp and crash unrelated services, even if /var/lib/docker is on a separate disk.

8) Journald: “but we don’t use json-file” (yes, you still have logs)

If Docker logs to journald, logs may not be under Docker’s directory at all. They’ll accumulate in journald’s storage (often under /var/log/journal), governed by journald retention settings. Great until your logging defaults are “keep everything” and your disk defaults are “small.”

9) Storage driver edge cases: devicemapper metadata, XFS project quotas, and friends

Most modern installs use overlay2 on ext4 or XFS. But older environments (or “carefully preserved legacy systems”) can still use devicemapper. In devicemapper, it’s common to hit metadata limits before actual disk limits—resulting in “no space” when the disk isn’t full.

XFS project quotas can also surprise you: Docker can be configured to enforce per-container limits. Great when intentional, confusing when inherited from an AMI you didn’t audit.

Practical tasks: commands, outputs, and decisions

These are real, runnable commands. Each includes what the output means and the decision you make from it. Use them in order, not randomly like a raccoon in a server room.

Task 1: Identify the full filesystem

cr0x@server:~$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/nvme0n1p2                      80G   62G   14G  82% /
/dev/nvme1n1p1                     200G  196G  4.0G  99% /var/lib/docker
tmpfs                               16G  1.2G   15G   8% /run

Meaning: Your Docker data disk is full (/var/lib/docker at 99%). Root filesystem is not the main issue.

Decision: Focus on Docker’s root dir usage; do not waste time cleaning /.

Task 2: Check inode exhaustion (the sneaky “space” error)

cr0x@server:~$ df -i
Filesystem                        Inodes   IUsed   IFree IUse% Mounted on
/dev/nvme0n1p2                   5242880  841120 4401760   17% /
/dev/nvme1n1p1                  13107200 13107200       0  100% /var/lib/docker

Meaning: The Docker filesystem is out of inodes, not blocks. This often happens with millions of tiny files (node_modules, image layer unpacking, build caches).

Decision: Pruning may help, but long-term you likely need a filesystem created with more inodes (ext4) or switch to XFS (dynamic inodes) and reduce tiny-file churn.

Task 3: Confirm Docker’s actual root directory

cr0x@server:~$ docker info --format '{{.DockerRootDir}}'
/var/lib/docker

Meaning: Docker agrees it uses /var/lib/docker.

Decision: All subsequent disk analysis should target this path (unless you use an alternate runtime or rootless Docker).

Task 4: Get Docker’s high-level space accounting

cr0x@server:~$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          48        12        72.4GB    41.8GB (57%)
Containers      18        7         3.1GB     2.2GB (71%)
Local Volumes   64        9         88.6GB    55.0GB (62%)
Build Cache     214       0         61.3GB    61.3GB

Meaning: Volumes and build cache dominate. This is not primarily “too many containers.”

Decision: Start with build cache prune (usually safe), then audit volumes carefully before pruning.

Task 5: Drill down with verbose Docker accounting

cr0x@server:~$ docker system df -v
Images space usage:
REPOSITORY   TAG      IMAGE ID       CREATED        SIZE      SHARED SIZE   UNIQUE SIZE   CONTAINERS
app/api      prod     2a1b3c4d5e6f   2 days ago     1.21GB    820MB         390MB         4
app/api      old      7f6e5d4c3b2a   3 weeks ago    1.18GB    820MB         360MB         0

Build cache usage:
CACHE ID    CACHE TYPE    SIZE     CREATED         LAST USED      USAGE     SHARED
k9x...      regular       2.3GB    2 weeks ago     2 weeks ago    1
...

Meaning: You can spot unused image tags (0 containers) and ancient caches.

Decision: Remove unused images and caches first; consider policies for keeping only N versions per node.

Task 6: Identify the biggest directories under Docker root (reality check)

cr0x@server:~$ sudo du -xhd1 /var/lib/docker | sort -h
1.1G    /var/lib/docker/network
3.6G    /var/lib/docker/containers
62G     /var/lib/docker/buildkit
112G    /var/lib/docker/overlay2
181G    /var/lib/docker

Meaning: Overlay2 and buildkit are the big hitters. Containers directory is non-trivial (often logs).

Decision: If containers is big, inspect logs. If buildkit is big, prune build cache. Overlay2 requires careful cleanup via Docker, not manual deletion.

Task 7: Find top container log files (json-file driver)

cr0x@server:~$ sudo find /var/lib/docker/containers -name "*-json.log" -printf "%s %p\n" | sort -nr | head
21474836480 /var/lib/docker/containers/4c2.../4c2...-json.log
 9876543210 /var/lib/docker/containers/91a.../91a...-json.log
 1234567890 /var/lib/docker/containers/ab7.../ab7...-json.log

Meaning: One container wrote ~20GB of logs. That’s not “a little chatty.” That’s a disk eviction notice.

Decision: Immediately truncate that log (safe short-term), then implement rotation and fix the chatty app.

Task 8: Safely truncate an oversized container log without restarting Docker

cr0x@server:~$ sudo truncate -s 0 /var/lib/docker/containers/4c2.../4c2...-json.log
cr0x@server:~$ sudo ls -lh /var/lib/docker/containers/4c2.../4c2...-json.log
-rw-r----- 1 root root 0 Jan  2 11:06 /var/lib/docker/containers/4c2.../4c2...-json.log

Meaning: You reclaimed space immediately; the file is now empty. The container continues to log.

Decision: Treat this as an emergency bandage. Schedule the proper fix: logging options, log driver choice, or application-level log reduction.

Task 9: Confirm which container maps to the noisy log directory

cr0x@server:~$ docker ps --no-trunc --format 'table {{.ID}}\t{{.Names}}'
CONTAINER ID                                                       NAMES
4c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b   api-prod-1

Meaning: The worst log offender is api-prod-1.

Decision: Look at the app’s log level, request storms, retries, or error loops. Disk problems are often just a symptom of an upstream failure.

Task 10: Check journald disk usage (if using journald log driver)

cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 18.7G in the file system.

Meaning: Journald is consuming significant space. This can be Docker logs, system logs, or both.

Decision: Set retention limits in journald configuration and vacuum old logs. Don’t just delete files under /var/log/journal while journald is running.

Task 11: Vacuum journald logs to reclaim space

cr0x@server:~$ sudo journalctl --vacuum-size=2G
Deleted archived journal /var/log/journal/7a1.../system@000...-000...journal
Vacuuming done, freed 16.7G of archived journals on disk.

Meaning: Space was reclaimed safely through journald tooling.

Decision: Implement a persistent journald policy (size/time caps) so this doesn’t return next week.

Task 12: Prune build cache (usually low-risk, high reward)

cr0x@server:~$ docker builder prune --all --force
Deleted build cache objects:
k9x...
m2p...
Total reclaimed space: 59.8GB

Meaning: You recovered almost 60GB by removing build cache. Builds may be slower until cache warms again.

Decision: If this is a CI builder, schedule periodic pruning or cap cache with policy rather than “panic prune.”

Task 13: Prune unused images (safe-ish, but understand your deploy strategy)

cr0x@server:~$ docker image prune -a --force
Deleted Images:
deleted: sha256:7f6e5d4c3b2a...
deleted: sha256:1a2b3c4d5e6f...
Total reclaimed space: 28.4GB

Meaning: Docker removed images not referenced by any container. If you rely on fast rollback by keeping old images locally, you just removed your safety net.

Decision: On production nodes, consider keeping the last N versions or rely on pulling from a registry with known availability and good caching.

Task 14: Find large volumes and who uses them

cr0x@server:~$ docker volume ls
DRIVER    VOLUME NAME
local     api_db_data
local     prometheus_data
local     tmp_ci_run_1738

cr0x@server:~$ sudo du -sh /var/lib/docker/volumes/*/_data | sort -h | tail
6.2G  /var/lib/docker/volumes/prometheus_data/_data
48G   /var/lib/docker/volumes/api_db_data/_data
71G   /var/lib/docker/volumes/tmp_ci_run_1738/_data

Meaning: One “tmp” CI volume is 71GB. That’s probably garbage. The DB volume is large but likely legitimate.

Decision: Audit attachment before deletion: identify which containers use the tmp volume. Do not delete database volumes casually.

Task 15: Map volumes to containers (avoid deleting live state)

cr0x@server:~$ docker ps -a --format '{{.ID}} {{.Names}}' | head
a1b2c3d4e5f6 api-prod-1
d4e5f6a1b2c3 ci-runner-1738
...

cr0x@server:~$ docker inspect -f '{{.Name}} -> {{range .Mounts}}{{.Name}} {{end}}' d4e5f6a1b2c3
/ci-runner-1738 -> tmp_ci_run_1738

Meaning: The large tmp volume belongs to a specific CI runner container (possibly already dead, possibly still used).

Decision: If the container is stopped and the volume is truly ephemeral, remove container and volume. If it’s running, fix the job that’s writing so much.

Task 16: Remove a confirmed-orphan volume

cr0x@server:~$ docker rm -f ci-runner-1738
ci-runner-1738

cr0x@server:~$ docker volume rm tmp_ci_run_1738
tmp_ci_run_1738

Meaning: Container and volume are removed; disk space should drop.

Decision: Add lifecycle automation for CI artifacts so “tmp volumes” don’t become permanent residents.

Task 17: One command to clear the obvious junk (use with judgment)

cr0x@server:~$ docker system prune --all --volumes --force
Deleted Containers:
...
Deleted Images:
...
Deleted Volumes:
...
Total reclaimed space: 132.6GB

Meaning: You just deleted basically everything unused, including volumes. This can be catastrophic if you misclassified “unused.”

Decision: Use this only on disposable hosts (CI, dev builders) or when you have verified volume safety. In production, prefer targeted pruning.

Task 18: Move Docker’s root dir to a bigger disk (the grown-up fix)

When you’re constantly pruning, you’re treating symptoms. Sometimes you need to move the data.

cr0x@server:~$ sudo systemctl stop docker
cr0x@server:~$ sudo rsync -aHAX --numeric-ids /var/lib/docker/ /mnt/docker-data/
cr0x@server:~$ sudo mkdir -p /etc/docker
cr0x@server:~$ sudo tee /etc/docker/daemon.json > /dev/null
{
  "data-root": "/mnt/docker-data"
}
cr0x@server:~$ sudo systemctl start docker
cr0x@server:~$ docker info --format '{{.DockerRootDir}}'
/mnt/docker-data

Meaning: Docker is now using the new data root. If containers fail to start, you likely missed permissions, SELinux contexts, or the rsync flags.

Decision: This is a change-control operation. Do it in a maintenance window, and validate with a canary container first.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption (logs “can’t be that big”)

The company was mid-migration from VMs to containers. The core service had been stable for years, and the containerization effort was deliberately minimal: “lift and shift, don’t refactor.” That decision wasn’t wrong. The assumption attached to it was.

They assumed logs were “handled by the platform” because the old VM image had logrotate. In Docker, the app still wrote to stdout/stderr. The platform did handle it—by writing JSON logs to disk, forever, with no rotation. On day one it was fine. On day twenty, one node started returning 500s. The orchestrator kept rescheduling, because “containers are cattle.” Great. The node stayed full because rescheduling didn’t delete the log files fast enough, and the new containers continued logging into the same abyss.

The on-call engineer checked df -h on /, saw 40% free, and declared “not disk.” They missed that Docker lived on /var, and /var was a different mount. A second engineer ran docker system df and saw nothing outrageous—because Docker’s accounting didn’t scream “one log file is 20GB.”

The fix was brutally simple: truncate the log file, cap log size, and lower log level for a hot loop that had been harmless on VMs because logs rotated. The post-incident action was also simple and more important: write down where logs live for each log driver, and alert on growth. This is what “platform work” actually means.

Mini-story #2: The optimization that backfired (BuildKit cache everywhere)

A different team was proud of their CI speed. Builds were down to a few minutes, largely because BuildKit caching was working perfectly. Too perfectly. Their builders were also running some long-lived services (because “we had spare capacity”), and the builders had large local SSDs. It looked efficient: one class of machine, one golden image, everything scheduled anywhere.

Cache grew quietly. Multi-arch builds, frequent dependency updates, and a habit of tagging every commit created a high-churn cache. It didn’t matter for a week. Then a big release branch cut produced a storm of builds and layer variants. The cache ballooned and pushed the disk over the edge during business hours.

The painful part wasn’t the full disk. The painful part was the second-order effect: as disk filled, the builders slowed down, jobs timed out, retries increased load, and the cache grew even faster. The system became a self-feeding loop: the “optimization” made failure more explosive.

The eventual fix was not “prune more.” They separated roles: dedicated builders with scheduled cache capping, dedicated runtimes with strict image retention, and explicit limits for logs. They also stopped pretending that “fast build” is the same KPI as “stable build.”

Mini-story #3: The boring but correct practice that saved the day (quotas and alerts)

A finance-oriented internal platform team had an unpopular habit: they put quotas and alert thresholds on everything. Developers complained, because quotas feel like bureaucracy until you understand blast radius.

They configured log rotation for Docker’s json-file driver and also set journald caps on hosts that used journald. They set alerts on /var/lib/docker usage, inode usage, and on the top-N container log file sizes. The alert noise was low because thresholds were tuned, and alerts had runbooks attached.

One Friday night, a service started spamming an error message due to a downstream credential rotation issue. On other teams’ platforms, that kind of incident becomes “disk full” plus “app down.” On this team’s platform, log files hit their cap, logs rotated, disk stayed healthy, and the on-call got one alert: “service error rate + log volume increase.” They fixed the credential problem. No cleanup panic. No filesystem triage. Boring reliability won, again.

Common mistakes: symptom → root cause → fix

1) “df shows free space, but Docker says no space”

Symptom: Pull/build/start fails with no space left on device; df -h on / shows plenty free.

Root cause: Docker root is on a different mount (/var or dedicated disk), or you’re filling /tmp during builds.

Fix: docker info for Docker Root Dir; run df -h on that mount and on /tmp. Move data-root or expand the correct filesystem.

2) “No space” but you have gigabytes free

Symptom: Writes fail; df -h shows free GBs; errors persist.

Root cause: Inode exhaustion (df -i shows 100%) or thin pool metadata full (devicemapper).

Fix: If inodes: prune tiny-file-heavy caches and rebuild filesystem with appropriate inode density (or use XFS). If devicemapper: migrate to overlay2 or expand thin pool metadata.

3) “docker system prune freed nothing”

Symptom: You pruned, but disk usage barely changed.

Root cause: The culprit is logs or journald, or big named volumes attached to running containers.

Fix: Inspect /var/lib/docker/containers and journald usage; check volume sizes under /var/lib/docker/volumes and map volumes to containers.

4) “We deleted containers, but disk didn’t drop”

Symptom: Removing containers doesn’t free expected space.

Root cause: Volumes persist; images persist; build cache persists; also, deleted-but-open files can keep space allocated until the process exits.

Fix: Check volumes and build cache; if you suspect deleted-but-open files, restart the offender (sometimes Docker daemon or container runtime) after safe cleanup.

5) “Overlay2 directory is huge; can we delete it?”

Symptom: /var/lib/docker/overlay2 dominates disk usage.

Root cause: That’s where image layers and writable layers live. Manual deletion breaks Docker state.

Fix: Use Docker commands to prune unused images/containers; if state is corrupt, plan a controlled wipe-and-recreate for disposable hosts, not production stateful nodes.

6) “After switching to journald logging, disk still fills”

Symptom: You changed the log driver; disk usage continues to grow.

Root cause: journald retention defaults are too permissive, or persistent journal storage is enabled without caps.

Fix: Configure journald size/time limits and validate with journalctl --disk-usage.

7) “CI builders go disk-full weekly”

Symptom: Builder nodes fill predictably.

Root cause: BuildKit cache retention is unbounded; multiple toolchains generate many unique layers; too many tags/branches built on the same node.

Fix: Scheduled docker builder prune; separate builder from runtime; enforce retention and/or rebuild builders periodically (immutable infrastructure actually helps here).

8) “Space is freed, but the service is still broken”

Symptom: You reclaimed disk, but containers still fail to start or behave oddly.

Root cause: Corrupted Docker metadata, partial pulls, or app-level failure that originally caused excessive logging.

Fix: Validate with a known-good container, check daemon logs, and fix the upstream app issue (rate limiting, retry storm, auth failure). Disk was just collateral damage.

Checklists / step-by-step plan

Emergency checklist (production node is full right now)

Confirm the mount: run df -h and df -i on Docker root and /tmp.
Stop runaway logs first: find the biggest container log files; truncate the worst offenders; reduce log level if safe.
Reclaim safe cache: run docker builder prune --all on builder nodes; run docker image prune -a if you understand rollback impact.
Audit volumes before touching: identify the largest volumes and map them to containers. Remove only confirmed orphan volumes.
Verify free space: re-run df -h. Keep at least a few GB free; some filesystems and daemons behave badly near 100%.
Stabilize: restart failing components only after disk pressure is relieved; avoid flapping.
Write the incident note: what filled disk, how fast it grew, and what policy change prevents it.

Hardening checklist (make it stop happening)

Set Docker log rotation: cap size and count for json-file logs.
Set journald retention: cap storage and/or time if using journald.
Separate concerns: builders and runtimes should not be the same fleet unless you enjoy mystery growth.
Set pruning policy: scheduled build cache pruning, and image retention rules per host role.
Move Docker root to dedicated storage: especially on small root filesystems.
Alert on inodes and bytes: and include runbooks that point to these exact commands.
Measure top offenders: biggest volumes, biggest container logs, largest images per host.
Design for failure: if a downstream breaks and triggers retry storms, your platform should degrade without self-destruction.

Recommended baseline Docker daemon settings (practical defaults)

If you use the json-file log driver, set log rotation. This is the single most cost-effective disk control you can do.

cr0x@server:~$ sudo tee /etc/docker/daemon.json > /dev/null
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "5"
  }
}
cr0x@server:~$ sudo systemctl restart docker

Meaning: Each container’s log file rotates at ~50MB, keeping 5 files (~250MB per container worst case).

Decision: Tune sizes per environment. Production often needs central logging; local logs should be a buffer, not an archive.

Interesting facts and historical context

Fact 1: Early Docker deployments often used devicemapper loopback mode by default, which was slow and prone to “mysterious” space/metadata failures under load.
Fact 2: Docker’s shift to overlay2 as the common default made storage faster and simpler, but also made copy-up behavior a frequent surprise for teams writing into container filesystems.
Fact 3: Docker’s default log driver historically being json-file optimized for simplicity, not for long-term disk hygiene.
Fact 4: BuildKit’s popularity rose because it made builds faster and more parallel, but the operational tax is cache management—especially on shared builders.
Fact 5: The phrase “no space left on device” is a generic errno (ENOSPC) returned by the kernel, and it’s used for more than just “disk is full.”
Fact 6: Inode exhaustion is an old Unix problem that never died; containers brought it back because image extraction and language ecosystems generate huge numbers of small files.
Fact 7: Many operators learned the hard way that “containers are ephemeral” is not a statement about data. Volumes are state, and state is forever unless you delete it.
Fact 8: Docker’s own space accounting (docker system df) is useful but not authoritative; the filesystem is the truth, especially for logs and non-Docker temp usage.

FAQ

1) Why does Docker say “no space left on device” when `df -h` shows space?

Because you checked the wrong mount, or you’re out of inodes, or you hit a quota/metadata limit. Always check Docker root dir and run df -i.

2) Is it safe to run `docker system prune -a` in production?

Sometimes. It removes unused images, containers, and networks. It can break fast rollback strategies and cause slower redeploys due to image pulls. Use targeted pruning first.

3) Is it safe to run `docker system prune --volumes`?

Only if you have verified that the “unused” volumes are truly disposable. “Unused” means “not currently referenced,” not “unimportant.” This is how you delete data.

4) Why are my container logs huge?

Because default json-file logging is unbounded unless you set max-size and max-file. Also, a noisy app can generate gigabytes per hour during error loops.

5) If I truncate container logs, will Docker or the app break?

Truncating the json log file is generally safe as an emergency measure. You lose historical logs, and the app keeps logging. Then fix rotation properly.

6) Why does deleting a container not free space?

Because the space is likely in volumes, images, or build cache. Also, space from deleted files may remain allocated if a process still has the file open.

7) Why is `/var/lib/docker/overlay2` so big even though I don’t have many images?

Overlay2 includes writable layers and extracted layer contents. A few “large” images plus write-heavy containers can easily dominate disk.

8) What’s the best way to prevent Docker disk incidents on CI builders?

Dedicated builders, scheduled docker builder prune, bounded caches, and rebuilding builders periodically. Treat caches as consumables, not treasures.

9) Can I just delete files under `/var/lib/docker` manually?

Don’t. Manual deletion often corrupts Docker’s view of the world. Use Docker commands, or do a controlled wipe only on truly disposable hosts.

10) How much free space should I keep on a Docker host?

Enough that pulls/unpacks and log bursts don’t push you to 100%. Practically: keep a buffer of multiple gigabytes and alert well before the cliff.

Conclusion: next steps that actually prevent repeats

When Docker runs out of space, it’s rarely “Docker is big” and almost always “we didn’t manage the boring parts.” Logs, caches, and volumes are boring. They are also where incidents come from.

Your practical next steps:

Put a cap on logs today (json-file rotation and/or journald retention). This alone eliminates a huge class of outages.
Define host roles: runtime nodes should not accumulate build caches; builder nodes should have scheduled pruning and predictable rebuilds.
Alert on bytes and inodes for the Docker root filesystem, plus top container log sizes and largest volumes.
Stop writing state into writable layers: use volumes intentionally, mount tmpfs for real temporary data, and audit paths your apps write to.
When you do cleanup, be surgical: caches and unused images first, volumes only with evidence.

Disk is not glamorous. That’s why it wins so many fights. Make it someone’s job—preferably yours, before it becomes your weekend.