You’re on call. A deploy fails. CI is red. Containers that ran fine yesterday suddenly refuse to start with no space left on device. You SSH in, run df -h, and your disk looks… fine-ish. Or worse: it’s full, and you have no idea what filled it because “we only run a few containers.”
Docker is a great magician. It makes apps appear. It also makes disk disappear—quietly, across multiple layers of storage, logs, caches, and metadata. The trick is knowing where to look, and which cleanups are safe in production.
Fast diagnosis playbook
This is the “get un-stuck in 10 minutes” flow. It prioritizes the checks that tell you whether you have a disk space problem, an inode problem, or a filesystem-specific constraint (overlay quirks, thinp metadata, project quotas).
First: confirm what’s full (bytes vs inodes vs a mount)
- Check free bytes:
df -hon the relevant mount (/,/var,/var/lib/docker, and any dedicated Docker data disk). - Check inodes:
df -i. If inodes are 100%, you can have “no space” with gigabytes free. - Confirm Docker root:
docker info→Docker Root Dir. People check/and forget Docker is on/var(or vice versa).
Second: identify which category is growing
- Docker’s own accounting:
docker system df -vto see images, containers, volumes, and build cache. - Filesystem reality:
du -xhd1 /var/lib/docker(or your root dir) to see where bytes really live. Docker’s numbers can lag behind reality, especially with logs. - Logs: check container JSON logs or journald usage. Logs are the #1 “we didn’t think about that” disk eater.
Third: remediate in the least-destructive order
- Stop the bleeding: rotate logs, cap log drivers, or throttle noisy apps.
- Free safe space: prune build cache and dangling images. Avoid nuking volumes unless you’re certain.
- Address structural issues: move Docker root to a bigger disk, add monitoring, add quotas, set log retention, and fix CI builder sprawl.
Joke #1: Disk is like a hotel minibar—nobody remembers using it until checkout.
What “no space left on device” actually means
The message is a liar by omission. It can mean:
- No free blocks on the filesystem that backs Docker’s writable layer, a volume, or a temp directory.
- No free inodes (you can’t create new files even if you have space).
- Hit a quota (project quotas, XFS quotas, or storage-driver metadata limits).
- Thin pool metadata full (common with old devicemapper setups).
- A different mount is full than the one you checked (e.g.,
/varis full,/isn’t). - Overlay filesystem constraints that manifest as space errors (e.g., too many layers, or copy-up behavior exploding usage).
Operationally: treat it as “the kernel refused an allocation.” Your job is to learn which allocation and where.
One quote worth keeping on a sticky note in the data center:
“Hope is not a strategy.” — paraphrased idea often attributed to operations leadership in reliability circles
If your disk management strategy is “we’ll prune when it hurts,” you are already running on hope.
Practical tasks: commands, outputs, and decisions
These are real, runnable commands. Each includes what the output means and the decision you make from it. Use them in order, not randomly like a raccoon in a server room.
Task 1: Identify the full filesystem
cr0x@server:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 80G 62G 14G 82% /
/dev/nvme1n1p1 200G 196G 4.0G 99% /var/lib/docker
tmpfs 16G 1.2G 15G 8% /run
Meaning: Your Docker data disk is full (/var/lib/docker at 99%). Root filesystem is not the main issue.
Decision: Focus on Docker’s root dir usage; do not waste time cleaning /.
Task 2: Check inode exhaustion (the sneaky “space” error)
cr0x@server:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 5242880 841120 4401760 17% /
/dev/nvme1n1p1 13107200 13107200 0 100% /var/lib/docker
Meaning: The Docker filesystem is out of inodes, not blocks. This often happens with millions of tiny files (node_modules, image layer unpacking, build caches).
Decision: Pruning may help, but long-term you likely need a filesystem created with more inodes (ext4) or switch to XFS (dynamic inodes) and reduce tiny-file churn.
Task 3: Confirm Docker’s actual root directory
cr0x@server:~$ docker info --format '{{.DockerRootDir}}'
/var/lib/docker
Meaning: Docker agrees it uses /var/lib/docker.
Decision: All subsequent disk analysis should target this path (unless you use an alternate runtime or rootless Docker).
Task 4: Get Docker’s high-level space accounting
cr0x@server:~$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 48 12 72.4GB 41.8GB (57%)
Containers 18 7 3.1GB 2.2GB (71%)
Local Volumes 64 9 88.6GB 55.0GB (62%)
Build Cache 214 0 61.3GB 61.3GB
Meaning: Volumes and build cache dominate. This is not primarily “too many containers.”
Decision: Start with build cache prune (usually safe), then audit volumes carefully before pruning.
Task 5: Drill down with verbose Docker accounting
cr0x@server:~$ docker system df -v
Images space usage:
REPOSITORY TAG IMAGE ID CREATED SIZE SHARED SIZE UNIQUE SIZE CONTAINERS
app/api prod 2a1b3c4d5e6f 2 days ago 1.21GB 820MB 390MB 4
app/api old 7f6e5d4c3b2a 3 weeks ago 1.18GB 820MB 360MB 0
Build cache usage:
CACHE ID CACHE TYPE SIZE CREATED LAST USED USAGE SHARED
k9x... regular 2.3GB 2 weeks ago 2 weeks ago 1
...
Meaning: You can spot unused image tags (0 containers) and ancient caches.
Decision: Remove unused images and caches first; consider policies for keeping only N versions per node.
Task 6: Identify the biggest directories under Docker root (reality check)
cr0x@server:~$ sudo du -xhd1 /var/lib/docker | sort -h
1.1G /var/lib/docker/network
3.6G /var/lib/docker/containers
62G /var/lib/docker/buildkit
112G /var/lib/docker/overlay2
181G /var/lib/docker
Meaning: Overlay2 and buildkit are the big hitters. Containers directory is non-trivial (often logs).
Decision: If containers is big, inspect logs. If buildkit is big, prune build cache. Overlay2 requires careful cleanup via Docker, not manual deletion.
Task 7: Find top container log files (json-file driver)
cr0x@server:~$ sudo find /var/lib/docker/containers -name "*-json.log" -printf "%s %p\n" | sort -nr | head
21474836480 /var/lib/docker/containers/4c2.../4c2...-json.log
9876543210 /var/lib/docker/containers/91a.../91a...-json.log
1234567890 /var/lib/docker/containers/ab7.../ab7...-json.log
Meaning: One container wrote ~20GB of logs. That’s not “a little chatty.” That’s a disk eviction notice.
Decision: Immediately truncate that log (safe short-term), then implement rotation and fix the chatty app.
Task 8: Safely truncate an oversized container log without restarting Docker
cr0x@server:~$ sudo truncate -s 0 /var/lib/docker/containers/4c2.../4c2...-json.log
cr0x@server:~$ sudo ls -lh /var/lib/docker/containers/4c2.../4c2...-json.log
-rw-r----- 1 root root 0 Jan 2 11:06 /var/lib/docker/containers/4c2.../4c2...-json.log
Meaning: You reclaimed space immediately; the file is now empty. The container continues to log.
Decision: Treat this as an emergency bandage. Schedule the proper fix: logging options, log driver choice, or application-level log reduction.
Task 9: Confirm which container maps to the noisy log directory
cr0x@server:~$ docker ps --no-trunc --format 'table {{.ID}}\t{{.Names}}'
CONTAINER ID NAMES
4c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b api-prod-1
Meaning: The worst log offender is api-prod-1.
Decision: Look at the app’s log level, request storms, retries, or error loops. Disk problems are often just a symptom of an upstream failure.
Task 10: Check journald disk usage (if using journald log driver)
cr0x@server:~$ journalctl --disk-usage
Archived and active journals take up 18.7G in the file system.
Meaning: Journald is consuming significant space. This can be Docker logs, system logs, or both.
Decision: Set retention limits in journald configuration and vacuum old logs. Don’t just delete files under /var/log/journal while journald is running.
Task 11: Vacuum journald logs to reclaim space
cr0x@server:~$ sudo journalctl --vacuum-size=2G
Deleted archived journal /var/log/journal/7a1.../system@000...-000...journal
Vacuuming done, freed 16.7G of archived journals on disk.
Meaning: Space was reclaimed safely through journald tooling.
Decision: Implement a persistent journald policy (size/time caps) so this doesn’t return next week.
Task 12: Prune build cache (usually low-risk, high reward)
cr0x@server:~$ docker builder prune --all --force
Deleted build cache objects:
k9x...
m2p...
Total reclaimed space: 59.8GB
Meaning: You recovered almost 60GB by removing build cache. Builds may be slower until cache warms again.
Decision: If this is a CI builder, schedule periodic pruning or cap cache with policy rather than “panic prune.”
Task 13: Prune unused images (safe-ish, but understand your deploy strategy)
cr0x@server:~$ docker image prune -a --force
Deleted Images:
deleted: sha256:7f6e5d4c3b2a...
deleted: sha256:1a2b3c4d5e6f...
Total reclaimed space: 28.4GB
Meaning: Docker removed images not referenced by any container. If you rely on fast rollback by keeping old images locally, you just removed your safety net.
Decision: On production nodes, consider keeping the last N versions or rely on pulling from a registry with known availability and good caching.
Task 14: Find large volumes and who uses them
cr0x@server:~$ docker volume ls
DRIVER VOLUME NAME
local api_db_data
local prometheus_data
local tmp_ci_run_1738
cr0x@server:~$ sudo du -sh /var/lib/docker/volumes/*/_data | sort -h | tail
6.2G /var/lib/docker/volumes/prometheus_data/_data
48G /var/lib/docker/volumes/api_db_data/_data
71G /var/lib/docker/volumes/tmp_ci_run_1738/_data
Meaning: One “tmp” CI volume is 71GB. That’s probably garbage. The DB volume is large but likely legitimate.
Decision: Audit attachment before deletion: identify which containers use the tmp volume. Do not delete database volumes casually.
Task 15: Map volumes to containers (avoid deleting live state)
cr0x@server:~$ docker ps -a --format '{{.ID}} {{.Names}}' | head
a1b2c3d4e5f6 api-prod-1
d4e5f6a1b2c3 ci-runner-1738
...
cr0x@server:~$ docker inspect -f '{{.Name}} -> {{range .Mounts}}{{.Name}} {{end}}' d4e5f6a1b2c3
/ci-runner-1738 -> tmp_ci_run_1738
Meaning: The large tmp volume belongs to a specific CI runner container (possibly already dead, possibly still used).
Decision: If the container is stopped and the volume is truly ephemeral, remove container and volume. If it’s running, fix the job that’s writing so much.
Task 16: Remove a confirmed-orphan volume
cr0x@server:~$ docker rm -f ci-runner-1738
ci-runner-1738
cr0x@server:~$ docker volume rm tmp_ci_run_1738
tmp_ci_run_1738
Meaning: Container and volume are removed; disk space should drop.
Decision: Add lifecycle automation for CI artifacts so “tmp volumes” don’t become permanent residents.
Task 17: One command to clear the obvious junk (use with judgment)
cr0x@server:~$ docker system prune --all --volumes --force
Deleted Containers:
...
Deleted Images:
...
Deleted Volumes:
...
Total reclaimed space: 132.6GB
Meaning: You just deleted basically everything unused, including volumes. This can be catastrophic if you misclassified “unused.”
Decision: Use this only on disposable hosts (CI, dev builders) or when you have verified volume safety. In production, prefer targeted pruning.
Task 18: Move Docker’s root dir to a bigger disk (the grown-up fix)
When you’re constantly pruning, you’re treating symptoms. Sometimes you need to move the data.
cr0x@server:~$ sudo systemctl stop docker
cr0x@server:~$ sudo rsync -aHAX --numeric-ids /var/lib/docker/ /mnt/docker-data/
cr0x@server:~$ sudo mkdir -p /etc/docker
cr0x@server:~$ sudo tee /etc/docker/daemon.json > /dev/null
{
"data-root": "/mnt/docker-data"
}
cr0x@server:~$ sudo systemctl start docker
cr0x@server:~$ docker info --format '{{.DockerRootDir}}'
/mnt/docker-data
Meaning: Docker is now using the new data root. If containers fail to start, you likely missed permissions, SELinux contexts, or the rsync flags.
Decision: This is a change-control operation. Do it in a maintenance window, and validate with a canary container first.
Three corporate mini-stories from the trenches
Mini-story #1: The incident caused by a wrong assumption (logs “can’t be that big”)
The company was mid-migration from VMs to containers. The core service had been stable for years, and the containerization effort was deliberately minimal: “lift and shift, don’t refactor.” That decision wasn’t wrong. The assumption attached to it was.
They assumed logs were “handled by the platform” because the old VM image had logrotate. In Docker, the app still wrote to stdout/stderr. The platform did handle it—by writing JSON logs to disk, forever, with no rotation. On day one it was fine. On day twenty, one node started returning 500s. The orchestrator kept rescheduling, because “containers are cattle.” Great. The node stayed full because rescheduling didn’t delete the log files fast enough, and the new containers continued logging into the same abyss.
The on-call engineer checked df -h on /, saw 40% free, and declared “not disk.” They missed that Docker lived on /var, and /var was a different mount. A second engineer ran docker system df and saw nothing outrageous—because Docker’s accounting didn’t scream “one log file is 20GB.”
The fix was brutally simple: truncate the log file, cap log size, and lower log level for a hot loop that had been harmless on VMs because logs rotated. The post-incident action was also simple and more important: write down where logs live for each log driver, and alert on growth. This is what “platform work” actually means.
Mini-story #2: The optimization that backfired (BuildKit cache everywhere)
A different team was proud of their CI speed. Builds were down to a few minutes, largely because BuildKit caching was working perfectly. Too perfectly. Their builders were also running some long-lived services (because “we had spare capacity”), and the builders had large local SSDs. It looked efficient: one class of machine, one golden image, everything scheduled anywhere.
Cache grew quietly. Multi-arch builds, frequent dependency updates, and a habit of tagging every commit created a high-churn cache. It didn’t matter for a week. Then a big release branch cut produced a storm of builds and layer variants. The cache ballooned and pushed the disk over the edge during business hours.
The painful part wasn’t the full disk. The painful part was the second-order effect: as disk filled, the builders slowed down, jobs timed out, retries increased load, and the cache grew even faster. The system became a self-feeding loop: the “optimization” made failure more explosive.
The eventual fix was not “prune more.” They separated roles: dedicated builders with scheduled cache capping, dedicated runtimes with strict image retention, and explicit limits for logs. They also stopped pretending that “fast build” is the same KPI as “stable build.”
Mini-story #3: The boring but correct practice that saved the day (quotas and alerts)
A finance-oriented internal platform team had an unpopular habit: they put quotas and alert thresholds on everything. Developers complained, because quotas feel like bureaucracy until you understand blast radius.
They configured log rotation for Docker’s json-file driver and also set journald caps on hosts that used journald. They set alerts on /var/lib/docker usage, inode usage, and on the top-N container log file sizes. The alert noise was low because thresholds were tuned, and alerts had runbooks attached.
One Friday night, a service started spamming an error message due to a downstream credential rotation issue. On other teams’ platforms, that kind of incident becomes “disk full” plus “app down.” On this team’s platform, log files hit their cap, logs rotated, disk stayed healthy, and the on-call got one alert: “service error rate + log volume increase.” They fixed the credential problem. No cleanup panic. No filesystem triage. Boring reliability won, again.
Common mistakes: symptom → root cause → fix
1) “df shows free space, but Docker says no space”
Symptom: Pull/build/start fails with no space left on device; df -h on / shows plenty free.
Root cause: Docker root is on a different mount (/var or dedicated disk), or you’re filling /tmp during builds.
Fix: docker info for Docker Root Dir; run df -h on that mount and on /tmp. Move data-root or expand the correct filesystem.
2) “No space” but you have gigabytes free
Symptom: Writes fail; df -h shows free GBs; errors persist.
Root cause: Inode exhaustion (df -i shows 100%) or thin pool metadata full (devicemapper).
Fix: If inodes: prune tiny-file-heavy caches and rebuild filesystem with appropriate inode density (or use XFS). If devicemapper: migrate to overlay2 or expand thin pool metadata.
3) “docker system prune freed nothing”
Symptom: You pruned, but disk usage barely changed.
Root cause: The culprit is logs or journald, or big named volumes attached to running containers.
Fix: Inspect /var/lib/docker/containers and journald usage; check volume sizes under /var/lib/docker/volumes and map volumes to containers.
4) “We deleted containers, but disk didn’t drop”
Symptom: Removing containers doesn’t free expected space.
Root cause: Volumes persist; images persist; build cache persists; also, deleted-but-open files can keep space allocated until the process exits.
Fix: Check volumes and build cache; if you suspect deleted-but-open files, restart the offender (sometimes Docker daemon or container runtime) after safe cleanup.
5) “Overlay2 directory is huge; can we delete it?”
Symptom: /var/lib/docker/overlay2 dominates disk usage.
Root cause: That’s where image layers and writable layers live. Manual deletion breaks Docker state.
Fix: Use Docker commands to prune unused images/containers; if state is corrupt, plan a controlled wipe-and-recreate for disposable hosts, not production stateful nodes.
6) “After switching to journald logging, disk still fills”
Symptom: You changed the log driver; disk usage continues to grow.
Root cause: journald retention defaults are too permissive, or persistent journal storage is enabled without caps.
Fix: Configure journald size/time limits and validate with journalctl --disk-usage.
7) “CI builders go disk-full weekly”
Symptom: Builder nodes fill predictably.
Root cause: BuildKit cache retention is unbounded; multiple toolchains generate many unique layers; too many tags/branches built on the same node.
Fix: Scheduled docker builder prune; separate builder from runtime; enforce retention and/or rebuild builders periodically (immutable infrastructure actually helps here).
8) “Space is freed, but the service is still broken”
Symptom: You reclaimed disk, but containers still fail to start or behave oddly.
Root cause: Corrupted Docker metadata, partial pulls, or app-level failure that originally caused excessive logging.
Fix: Validate with a known-good container, check daemon logs, and fix the upstream app issue (rate limiting, retry storm, auth failure). Disk was just collateral damage.
Checklists / step-by-step plan
Emergency checklist (production node is full right now)
- Confirm the mount: run
df -handdf -ion Docker root and/tmp. - Stop runaway logs first: find the biggest container log files; truncate the worst offenders; reduce log level if safe.
- Reclaim safe cache: run
docker builder prune --allon builder nodes; rundocker image prune -aif you understand rollback impact. - Audit volumes before touching: identify the largest volumes and map them to containers. Remove only confirmed orphan volumes.
- Verify free space: re-run
df -h. Keep at least a few GB free; some filesystems and daemons behave badly near 100%. - Stabilize: restart failing components only after disk pressure is relieved; avoid flapping.
- Write the incident note: what filled disk, how fast it grew, and what policy change prevents it.
Hardening checklist (make it stop happening)
- Set Docker log rotation: cap size and count for json-file logs.
- Set journald retention: cap storage and/or time if using journald.
- Separate concerns: builders and runtimes should not be the same fleet unless you enjoy mystery growth.
- Set pruning policy: scheduled build cache pruning, and image retention rules per host role.
- Move Docker root to dedicated storage: especially on small root filesystems.
- Alert on inodes and bytes: and include runbooks that point to these exact commands.
- Measure top offenders: biggest volumes, biggest container logs, largest images per host.
- Design for failure: if a downstream breaks and triggers retry storms, your platform should degrade without self-destruction.
Recommended baseline Docker daemon settings (practical defaults)
If you use the json-file log driver, set log rotation. This is the single most cost-effective disk control you can do.
cr0x@server:~$ sudo tee /etc/docker/daemon.json > /dev/null
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
cr0x@server:~$ sudo systemctl restart docker
Meaning: Each container’s log file rotates at ~50MB, keeping 5 files (~250MB per container worst case).
Decision: Tune sizes per environment. Production often needs central logging; local logs should be a buffer, not an archive.
Interesting facts and historical context
- Fact 1: Early Docker deployments often used devicemapper loopback mode by default, which was slow and prone to “mysterious” space/metadata failures under load.
- Fact 2: Docker’s shift to overlay2 as the common default made storage faster and simpler, but also made copy-up behavior a frequent surprise for teams writing into container filesystems.
- Fact 3: Docker’s default log driver historically being json-file optimized for simplicity, not for long-term disk hygiene.
- Fact 4: BuildKit’s popularity rose because it made builds faster and more parallel, but the operational tax is cache management—especially on shared builders.
- Fact 5: The phrase “no space left on device” is a generic errno (
ENOSPC) returned by the kernel, and it’s used for more than just “disk is full.” - Fact 6: Inode exhaustion is an old Unix problem that never died; containers brought it back because image extraction and language ecosystems generate huge numbers of small files.
- Fact 7: Many operators learned the hard way that “containers are ephemeral” is not a statement about data. Volumes are state, and state is forever unless you delete it.
- Fact 8: Docker’s own space accounting (
docker system df) is useful but not authoritative; the filesystem is the truth, especially for logs and non-Docker temp usage.
FAQ
1) Why does Docker say “no space left on device” when df -h shows space?
Because you checked the wrong mount, or you’re out of inodes, or you hit a quota/metadata limit. Always check Docker root dir and run df -i.
2) Is it safe to run docker system prune -a in production?
Sometimes. It removes unused images, containers, and networks. It can break fast rollback strategies and cause slower redeploys due to image pulls. Use targeted pruning first.
3) Is it safe to run docker system prune --volumes?
Only if you have verified that the “unused” volumes are truly disposable. “Unused” means “not currently referenced,” not “unimportant.” This is how you delete data.
4) Why are my container logs huge?
Because default json-file logging is unbounded unless you set max-size and max-file. Also, a noisy app can generate gigabytes per hour during error loops.
5) If I truncate container logs, will Docker or the app break?
Truncating the json log file is generally safe as an emergency measure. You lose historical logs, and the app keeps logging. Then fix rotation properly.
6) Why does deleting a container not free space?
Because the space is likely in volumes, images, or build cache. Also, space from deleted files may remain allocated if a process still has the file open.
7) Why is /var/lib/docker/overlay2 so big even though I don’t have many images?
Overlay2 includes writable layers and extracted layer contents. A few “large” images plus write-heavy containers can easily dominate disk.
8) What’s the best way to prevent Docker disk incidents on CI builders?
Dedicated builders, scheduled docker builder prune, bounded caches, and rebuilding builders periodically. Treat caches as consumables, not treasures.
9) Can I just delete files under /var/lib/docker manually?
Don’t. Manual deletion often corrupts Docker’s view of the world. Use Docker commands, or do a controlled wipe only on truly disposable hosts.
10) How much free space should I keep on a Docker host?
Enough that pulls/unpacks and log bursts don’t push you to 100%. Practically: keep a buffer of multiple gigabytes and alert well before the cliff.
Conclusion: next steps that actually prevent repeats
When Docker runs out of space, it’s rarely “Docker is big” and almost always “we didn’t manage the boring parts.” Logs, caches, and volumes are boring. They are also where incidents come from.
Your practical next steps:
- Put a cap on logs today (json-file rotation and/or journald retention). This alone eliminates a huge class of outages.
- Define host roles: runtime nodes should not accumulate build caches; builder nodes should have scheduled pruning and predictable rebuilds.
- Alert on bytes and inodes for the Docker root filesystem, plus top container log sizes and largest volumes.
- Stop writing state into writable layers: use volumes intentionally, mount tmpfs for real temporary data, and audit paths your apps write to.
- When you do cleanup, be surgical: caches and unused images first, volumes only with evidence.
Disk is not glamorous. That’s why it wins so many fights. Make it someone’s job—preferably yours, before it becomes your weekend.