Some Docker builds are slow for honest reasons: you’re compiling a monster, downloading half the internet, or doing cryptography at scale. But most “slow builds” are self-inflicted. The usual pattern: you’re paying for the same work on every commit, across every laptop, on every CI runner, forever.
BuildKit can fix that—if you stop treating caching as “Docker remembers stuff” and start treating it as a supply chain. This is a field guide to caching that holds up under production pressure: what to measure, what to change, and what will backfire.
The mental model: BuildKit cache is not magic, it’s addresses
Classic Docker builds (the legacy builder) worked like a stack of layers: each instruction created a filesystem snapshot. If the instruction and its inputs were unchanged, Docker could reuse the layer. That story is still mostly true, but BuildKit changed the mechanics and made the cache far more expressive.
BuildKit treats your build as a graph of operations. Each operation produces an output, and that output can be cached by a key derived from inputs (files, args, environment, base images, and sometimes metadata). If the key matches, BuildKit can skip the work and reuse the output.
That last sentence hides the gotcha: caching is only as good as your keys. If you accidentally include “today’s timestamp” in your key, you get a cache miss every time. If you include “the whole repo” as an input to an early step, you invalidate the world when someone edits a README.
What “actually speeds it up” means
There are three separate speed problems that people mash into one complaint:
- Repeated work within a single machine: rebuilding the same Dockerfile locally over and over.
- Repeated work across machines: laptops, ephemeral CI runners, autoscaling build fleets.
- Slow steps even on a cache hit: large contexts, slow image pulls, decompression, and “install dependencies” that is never cached.
BuildKit caching solves all three—but only if you pick the right mechanism:
- Layer cache (classic): good for immutable steps that depend on stable inputs.
- Inline cache metadata: stores cache info inside the image manifest so other builders can reuse it.
- External/remote cache: exports cache to registry/local/CI artifact so ephemeral builders don’t start cold.
- Cache mounts (
RUN --mount=type=cache): makes “install dependencies” fast without baking caches into the final image.
One operational rule: if you can’t explain why a step is cached, it isn’t. It’s just temporarily lucky.
Paraphrased idea from Werner Vogels (Amazon CTO): “Everything fails, all the time.” Cache misses are a failure mode. Engineer accordingly.
Interesting facts and small history (because it matters)
- BuildKit started as a separate project to replace Docker’s legacy builder with a graph-based engine, enabling parallelism and richer caching.
- Docker’s classic layer cache predates BuildKit and was tied tightly to “one instruction = one layer snapshot.” That model is simple, but it can’t express ephemeral caches cleanly.
- BuildKit can run independent steps in parallel (for example, pulling base images while transferring build context), which is why logs look different and timing can improve without any Dockerfile changes.
.dockerignoreis older than BuildKit but became more critical as repos grew: big contexts don’t just slow builds, they poison cache keys by changing inputs.- Inline cache metadata was a pragmatic hack: store cache hints inside the image so the next build can reuse layers even on a different machine.
- BuildKit cache mounts were a philosophical shift: “cache as a build-time concern” rather than “cache baked into the image.” That’s the difference between fast builds and bloated images.
- Multi-stage builds changed caching behavior in real teams: you can isolate expensive toolchains in build stages and keep runtime stages stable and cache-friendly.
- Remote cache export became necessary when CI moved to ephemeral runners and autoscaled builders where local disk cache disappears every run.
Joke #1: Docker caching is like office parking—everyone thinks they have a reserved spot until Monday morning proves otherwise.
Fast diagnosis playbook: find the bottleneck in 10 minutes
This is the triage order that saves time. Don’t start “optimizing the Dockerfile” until you know which of these is the real villain.
1) Is BuildKit even enabled?
If you’re on a newer Docker version it usually is, but “usually” isn’t a plan. Legacy builder behaves differently and misses key features like cache mounts.
2) Is the build context huge or unstable?
If you’re sending gigabytes, you’ll be slow even with perfect caching. And if the context changes every commit (logs, build outputs, vendor dirs), cache keys churn.
3) Are you getting cache hits where you expect?
Look at the output: CACHED should appear on expensive steps. If your dependency install step runs every time, that’s your target.
4) Is CI cold-starting every run?
Ephemeral runners mean local cache is gone. Without a remote cache exporter/importer, you’re rebuilding from scratch no matter how “cache-friendly” your Dockerfile looks.
5) Are you blocked on network or CPU?
Dependency downloads, base image pulls, and package index updates are network-heavy. Compiles are CPU-heavy. The fix differs.
6) Are secrets/SSH causing cache misses?
Secrets aren’t part of the cache key by design, but the way you wire private dependencies often changes commands or introduces nondeterminism, which wrecks cache reuse.
7) Are you accidentally busting the cache?
Common culprits: ADD . too early, non-pinned packages, build args changing every run, timestamps, and “cleanup” that changes file mtimes in ways you didn’t expect.
Practical tasks: commands, outputs, decisions (12+)
These are not “toy commands.” They’re the ones you run when your build pipeline is on fire and you need to decide what to change next.
Task 1: Confirm BuildKit is on (and which builder is used)
cr0x@server:~$ docker build --progress=plain -t demo:bk .
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.12kB done
#2 [internal] load metadata for docker.io/library/alpine:3.19
#2 DONE 0.8s
#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s
#4 [1/3] FROM docker.io/library/alpine:3.19
#4 resolve docker.io/library/alpine:3.19 done
#4 DONE 0.0s
What it means: The numbered steps and “internal” phases are BuildKit-style output. If you see the old “Step 1/…” style, you’re not using BuildKit.
Decision: If it’s legacy output, enable BuildKit via environment or daemon config before doing anything else. Otherwise you’ll optimize the wrong engine.
Task 2: Print Docker version and server features
cr0x@server:~$ docker version
Client: Docker Engine - Community
Version: 26.1.3
API version: 1.45
Server: Docker Engine - Community
Engine:
Version: 26.1.3
API version: 1.45 (minimum version 1.24)
Experimental: false
What it means: BuildKit capabilities vary with engine/buildx versions. Very old engines can be “BuildKit-ish” but missing key features.
Decision: If you’re several major versions behind, upgrade first. Caching fixes on an old engine are like tuning a carburetor on a car that needs an engine swap.
Task 3: Check buildx availability and current builder
cr0x@server:~$ docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
default docker
default default running v0.12.5 linux/amd64,linux/arm64
What it means: You have buildx and a builder instance. The BuildKit version matters for some cache exporters/importers.
Decision: If buildx is missing or builder is broken, fix that. Remote caching is easiest with buildx.
Task 4: Measure build context size (the silent killer)
cr0x@server:~$ docker build --no-cache --progress=plain -t demo:ctx .
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 2.01kB done
#2 [internal] load .dockerignore
#2 transferring context: 2B done
#3 [internal] load build context
#3 transferring context: 812.4MB 12.3s done
#3 DONE 12.4s
What it means: 812MB context transfer. Even if everything caches, you just burned 12 seconds before the build started doing real work.
Decision: Fix .dockerignore and/or use a tighter build context path. Don’t accept “it’s fine on my machine” here; CI will pay more.
Task 5: Inspect what’s in your build context (quick reality check)
cr0x@server:~$ tar -czf - . | wc -c
853224921
What it means: Your context is ~853MB compressed. That’s usually build output, node_modules, virtualenvs, or test artifacts sneaking in.
Decision: Add excludes, or build from a subdirectory containing only what the image needs.
Task 6: Verify cache hits step-by-step
cr0x@server:~$ docker build --progress=plain -t demo:cachecheck .
#7 [2/6] COPY package.json package-lock.json ./
#7 CACHED
#8 [3/6] RUN npm ci
#8 CACHED
#9 [4/6] COPY . .
#9 DONE 0.9s
#10 [5/6] RUN npm test
#10 DONE 34.6s
What it means: Dependency install is cached; tests are not (and probably shouldn’t be cached). You’ve separated “stable inputs” from “changing inputs.”
Decision: If the expensive dependency step is not cached, restructure Dockerfile or use cache mounts.
Task 7: Show cache usage on the host
cr0x@server:~$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 41 6 9.12GB 6.03GB (66%)
Containers 12 2 311MB 243MB (78%)
Local Volumes 8 3 1.04GB 618MB (59%)
Build Cache 145 0 3.87GB 3.87GB
What it means: Build cache exists and is sizable; “ACTIVE 0” suggests it’s not pinned by running builds. That cache might be pruned aggressively by scripts.
Decision: If cache keeps vanishing, stop running “docker system prune -a” in CI images or shared runners without understanding the impact.
Task 8: Inspect BuildKit builder disk usage
cr0x@server:~$ docker buildx du
ID RECLAIMABLE SIZE LAST ACCESSED
m5p8xg7n0f3m2b0c6v2h7w2n1 true 1.2GB 2 hours ago
u1k7tq9p4y2c9s8a1b0n6v3z5 true 842MB 3 days ago
total: 2.0GB
What it means: Buildx has its own accounting. If this is empty on CI, you have no persistent cache between runs.
Decision: For ephemeral builders, plan remote cache export/import.
Task 9: Prove a cache-busting culprit (build args)
cr0x@server:~$ docker build --progress=plain --build-arg BUILD_ID=123 -t demo:arg .
#6 [2/5] RUN echo "build id: 123" > /build-id.txt
#6 DONE 0.2s
cr0x@server:~$ docker build --progress=plain --build-arg BUILD_ID=124 -t demo:arg .
#6 [2/5] RUN echo "build id: 124" > /build-id.txt
#6 DONE 0.2s
What it means: That step will never cache across different BUILD_ID values. If that arg is used early, it invalidates everything after it.
Decision: Move volatile args late, or stop embedding build IDs in filesystem layers unless you truly need them.
Task 10: Confirm base image pull is a bottleneck
cr0x@server:~$ docker pull ubuntu:22.04
22.04: Pulling from library/ubuntu
Digest: sha256:4f2...
Status: Image is up to date for ubuntu:22.04
docker.io/library/ubuntu:22.04
What it means: “up to date” indicates it was already present. If pulls are slow and frequent in CI, you may be missing a shared registry mirror or local cache.
Decision: If CI always pulls, consider caching base images on runners or using a registry closer to the runners.
Task 11: Build with explicit cache export/import to a local directory
cr0x@server:~$ docker buildx build --progress=plain \
--cache-to type=local,dest=/tmp/bkcache,mode=max \
--cache-from type=local,src=/tmp/bkcache \
-t demo:localcache --load .
#11 [4/7] RUN apt-get update && apt-get install -y build-essential
#11 DONE 39.4s
#12 exporting cache to client directory
#12 DONE 0.6s
What it means: You exported a reusable cache to /tmp/bkcache. On the next run, those expensive steps should show CACHED.
Decision: If local cache helps but CI is still slow, you need remote cache export (registry/artifact) rather than local-only.
Task 12: Validate caching on the second run (proof, not vibes)
cr0x@server:~$ docker buildx build --progress=plain \
--cache-from type=local,src=/tmp/bkcache \
-t demo:localcache --load .
#11 [4/7] RUN apt-get update && apt-get install -y build-essential
#11 CACHED
#13 exporting to docker image format
#13 DONE 1.1s
What it means: The expensive apt step is cached. You’ve proven the mechanism works.
Decision: Now you can invest in remote cache safely, because you know your Dockerfile is cacheable.
Task 13: Catch nondeterminism in package install
cr0x@server:~$ docker build --progress=plain -t demo:apt .
#9 [3/6] RUN apt-get update && apt-get install -y curl
#9 DONE 18.7s
cr0x@server:~$ docker build --progress=plain -t demo:apt .
#9 [3/6] RUN apt-get update && apt-get install -y curl
#9 DONE 19.4s
What it means: It rebuilt both times, which can happen if something earlier invalidated the layer, or if you used --no-cache, or if the filesystem inputs changed.
Decision: Ensure the RUN apt-get ... step comes after only stable inputs. Also consider pinning packages or using a base image that already contains common tooling.
Task 14: Identify “ADD .” too early (cache invalidation test)
cr0x@server:~$ git diff --name-only HEAD~1
README.md
cr0x@server:~$ docker build --progress=plain -t demo:readme .
#6 [2/6] COPY . .
#6 DONE 0.8s
#7 [3/6] RUN npm ci
#7 DONE 45.1s
What it means: Changing README triggered COPY . ., which invalidated npm ci because you copied the entire repo before installing dependencies.
Decision: Copy only dependency manifests first; copy the rest later. That’s not “micro-optimization”; it’s the difference between a 20s rebuild and a 2-minute rebuild.
Dockerfile patterns that make caching real
BuildKit rewards discipline. If your Dockerfile is a junk drawer, your cache will behave like one: full, expensive, and not containing the thing you need.
Pattern 1: Separate dependency manifests from source
This is the classic fix because it’s the most common mistake. You want dependency install keyed off the lockfile, not keyed off every file in the repository.
cr0x@server:~$ cat Dockerfile
# syntax=docker/dockerfile:1.7
FROM node:20-bookworm AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN --mount=type=cache,target=/root/.npm npm ci
COPY . .
RUN npm run build
FROM node:20-bookworm-slim
WORKDIR /app
COPY --from=build /app/dist ./dist
CMD ["node","dist/server.js"]
Why it works: the expensive npm ci step depends primarily on package-lock.json. A change to README.md shouldn’t reinstall the world.
Pattern 2: Put volatile steps late
Anything that changes every build—build IDs, git SHAs, version stamping—should be near the end. Otherwise you invalidate all downstream cache.
Bad:
cr0x@server:~$ cat Dockerfile.bad
FROM alpine:3.19
ARG GIT_SHA
RUN echo "$GIT_SHA" > /git-sha.txt
RUN apk add --no-cache curl
Good:
cr0x@server:~$ cat Dockerfile.good
FROM alpine:3.19
RUN apk add --no-cache curl
ARG GIT_SHA
RUN echo "$GIT_SHA" > /git-sha.txt
Why it works: You keep stable layers reusable. Your stamp is still there, but it only invalidates itself.
Pattern 3: Multi-stage builds as cache boundaries
Multi-stage builds are not just about smaller runtime images. They also let you isolate “toolchain churn” from “runtime stability.”
- Build stage: compilers, package managers, caches, headers.
- Runtime stage: small, stable, fewer moving parts, fewer cache invalidations.
Pattern 4: Be deliberate about base image tags
If you use floating tags like latest, your base image can change without you touching the Dockerfile. That’s a cache invalidation event plus a reproducibility problem.
Use explicit tags, and in high-assurance environments, prefer digest pinning. It’s a policy choice: convenience versus repeatability.
Pattern 5: Minimize “apt-get update” churn
apt-get update is a frequent source of unpredictable cache behavior. Not because it can’t be cached—because it tends to be placed in layers that get invalidated by unrelated changes.
Also: always combine update and install in one layer. Separate layers cause stale indexes and mysterious 404s later.
Pattern 6: Keep the build context clean
If you don’t control your context, you don’t control your cache keys. Use .dockerignore aggressively: build artifacts, dependency directories, logs, test reports, local env files, and anything produced by CI itself.
Joke #2: If your build context includes node_modules, your Docker daemon is basically doing crossfit—lifting heavy stuff repeatedly for no reason.
Remote cache that survives CI: registry, local, and runners
Local caching is nice. Remote caching is what makes teams stop complaining in Slack. If your CI runner is ephemeral, it wakes up with amnesia every run. That’s not a moral failing; it’s how those systems are designed.
Inline cache: share cache hints via the image
Inline cache stores cache metadata in the image config so later builds can pull the image and reuse layers. This is the “minimum viable” cross-machine caching.
Example build (image push omitted here; the mechanics are the point):
cr0x@server:~$ docker buildx build --progress=plain \
--build-arg BUILDKIT_INLINE_CACHE=1 \
-t demo:inlinecache --load .
#15 exporting config sha256:4c1d...
#15 DONE 0.4s
#16 writing image sha256:0f9a...
#16 DONE 0.8s
What it means: The image now carries cache metadata. Another builder can use it with --cache-from pointing to that image reference.
Decision: Use inline cache when you already push images and want a simple cache path. It’s not always enough, but it’s a solid baseline.
Registry cache: export cache separately (more powerful)
BuildKit can export cache to a registry reference. This is often better than inline cache because it can store more detail and doesn’t require you to “promote” an image just to use its cache.
cr0x@server:~$ docker buildx build --progress=plain \
--cache-to type=registry,ref=registry.internal/demo:buildcache,mode=max \
--cache-from type=registry,ref=registry.internal/demo:buildcache \
-t registry.internal/demo:app --push .
#18 exporting cache to registry
#18 DONE 2.7s
What it means: Cache is stored as an OCI artifact in your registry. Next build imports it, even on a fresh runner.
Decision: If CI is ephemeral and builds are slow, this is usually the move. The trade-off is registry storage growth and occasional permission headaches.
Local directory cache: good for single runner reuse
Local cache exports are great when you have a persistent runner with a workspace that survives runs, or when you can attach a persistent volume.
It’s also a good “prove it works” step before you deal with registry auth and CI policies.
Choosing a cache mode
mode=min: smaller cache, fewer intermediate results. Good when storage is tight.mode=max: more complete cache, better hit rates. Good when you want speed and can afford storage.
In practice: start with mode=max in CI for a week, watch registry growth and hit rates, then decide if you need to constrain it.
Cache mounts that actually speed up dependency installs
Layer caching is blunt. Dependency managers are subtle. They want a cache directory that persists across runs, but you don’t want to bake that cache into the final image. BuildKit’s cache mounts are the right tool.
npm / yarn / pnpm
npm example shown earlier. The gist: mount a cache directory to /root/.npm (or a user cache path) during install.
apt
You can cache apt lists and archives. This is useful when you have repeated installs across builds and your CI network isn’t great. It’s not a silver bullet; apt metadata changes frequently.
cr0x@server:~$ cat Dockerfile.aptcache
# syntax=docker/dockerfile:1.7
FROM ubuntu:22.04
RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt/lists \
apt-get update && apt-get install -y --no-install-recommends curl ca-certificates \
&& rm -rf /var/lib/apt/lists/*
Operational note: caching /var/lib/apt/lists can speed things up, but can also mask repo changes in surprising ways if you’re debugging package availability. Use it knowingly.
pip
Python builds are classic offenders: wheels download every time, compilation happens every time, and someone eventually “fixes it” by copying the entire venv into the image (don’t do that).
cr0x@server:~$ cat Dockerfile.pipcache
# syntax=docker/dockerfile:1.7
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt ./
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
COPY . .
CMD ["python","-m","app"]
Go
Go has two key caches: module download and build cache. BuildKit can persist them without polluting the runtime image.
cr0x@server:~$ cat Dockerfile.go
# syntax=docker/dockerfile:1.7
FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
go build -o /out/app ./cmd/app
FROM gcr.io/distroless/base-debian12
COPY --from=build /out/app /app
CMD ["/app"]
Why this matters: module download cache saves network; build cache saves CPU. Different bottlenecks, same mechanism.
Secrets, SSH, and why your cache disappears
Private dependencies are where “it works locally” goes to die. People hack around it with ARG TOKEN=... and accidentally bake secrets into layers (bad) or break caching in a way that only shows up on CI (also bad).
Use BuildKit secrets instead of ARG for credentials
Secrets mounted during build don’t get stored in image layers. They also don’t become cache keys. That’s a feature and a trap: your step may be cached even if the secret changes, so make sure the output is deterministic.
cr0x@server:~$ cat Dockerfile.secrets
# syntax=docker/dockerfile:1.7
FROM alpine:3.19
RUN apk add --no-cache git openssh-client
RUN --mount=type=secret,id=git_token \
sh -c 'TOKEN=$(cat /run/secrets/git_token) && echo "token length: ${#TOKEN}" > /tmp/token-info'
Decision: If you need private git clones, use --mount=type=ssh with forwarded agent or a deploy key, not a token in ARG.
SSH mounts: safe-ish, but mind reproducibility
SSH mounts avoid baking secrets. They also introduce a new failure mode: your build becomes dependent on the network and on the remote repo state. Pin commits/tags for determinism, or your cache is “correctly” invalidated by upstream changes.
Three corporate mini-stories from the caching trenches
1) Incident caused by a wrong assumption: “CI caches Docker layers by default”
A mid-sized company migrated their CI from long-lived VMs to ephemeral runners. The migration was pitched as a win: clean environments, fewer flaky builds, less disk maintenance. All true. But someone assumed Docker’s layer cache would “just be there” like it was on the old VMs.
Day one after the cutover, build times tripled. Not a little slower—slow enough that deploy windows missed their change approvals. Engineers responded like engineers: parallelize jobs, add more runners, throw money at it. The build farm got bigger; the builds stayed slow.
The real issue was boring. The old VMs had warm Docker caches. The new runners booted empty every time. Every build pulled base images, downloaded dependencies, and recompiled the same code. Caching existed, but it evaporated at the end of each job.
The fix was equally boring: remote cache export/import using buildx with a registry cache reference. Overnight, build times dropped back near the old baseline. The “optimization” wasn’t a clever Dockerfile trick. It was acknowledging the infrastructure reality: ephemeral runners require external state if you want reuse.
2) Optimization that backfired: “Cache everything in the image to make builds faster”
A different org had a build that ran pip install and took forever on developer laptops. Someone decided to “solve it” by copying the entire pip cache and build artifacts into the image layer after installation. It worked—in a narrow sense. Rebuilds were fast on that machine.
Then the image started ballooning. It also became inconsistent: two engineers produced different images from the same commit because the cache directories contained platform-specific wheels and leftover build junk. QA found “it fails only on staging” behavior that didn’t reproduce locally.
Security got involved because the cache included downloaded artifacts that weren’t tracked in the dependency lock, making provenance analysis a mess. Now the company had a fast build and a slow incident response. That is not a trade you want.
The rollback was painful. They switched to BuildKit cache mounts for pip, kept runtime images slim, and introduced a repeatable dependency pinning policy. Builds stayed fast, and the produced artifacts became deterministic enough to debug.
3) Boring but correct practice that saved the day: “Cache keys based on lockfiles, not repo state”
During a busy quarter, one team shipped a lot of small changes: config tweaks, copy edits, feature flags. Their service builds were Node-based and dependency-heavy, historically slow. But their pipeline stayed stable.
The reason wasn’t heroics. Months earlier, someone had refactored Dockerfiles across services to copy only dependency manifests first, run the dependency install, and only then copy application source. They also enforced a rule: changes to lockfiles require explicit review.
So when a flurry of non-code changes landed, most builds hit cache on dependency installation and base layers. The CI system still built and tested, but it didn’t re-download the internet. Their deploy queue stayed short even under high commit volume.
That practice doesn’t look impressive in a demo. It doesn’t get a slide. But it prevented a very predictable failure mode: “small change, big build cost.” In operations, boring is often the highest compliment.
Common mistakes: symptom → root cause → fix
1) Symptom: “Every build reruns dependency install”
Root cause: You copy the whole repo before installing dependencies, so any file change invalidates the layer.
Fix: Copy only package-lock.json/requirements.txt/go.sum first, run install/download, then copy the rest.
2) Symptom: “CI is always slow, local is fine”
Root cause: CI runners are ephemeral; local machine has warm cache.
Fix: Use docker buildx build with --cache-to/--cache-from (registry or artifact storage). Inline cache helps but isn’t always enough.
3) Symptom: “Build context transfer takes forever”
Root cause: Huge context (node_modules, dist, logs, .git) or unstable files included.
Fix: Tighten .dockerignore. Build from a narrower directory. Don’t send the universe to the daemon.
4) Symptom: “Cache hits locally, misses in CI even with remote cache”
Root cause: Different build args/platforms/targets, or different base image digests. Cache keys diverge.
Fix: Standardize build args, platform (--platform), and targets. Pin base images. Ensure CI imports the same cache reference it exports.
5) Symptom: “Cache works until someone runs a cleanup job”
Root cause: Aggressive pruning removes build cache (docker system prune -a), or runner images reset storage.
Fix: Stop nuking caches blindly. Use targeted pruning policies, and rely on remote cache for CI if runner disks are not persistent.
6) Symptom: “Build is slow even when everything says CACHED”
Root cause: You’re spending time on pulling base images, exporting images, compressing layers, or loading to the daemon.
Fix: Measure: look for time in “exporting,” “writing image,” and pulls. Consider using --output type=registry in CI instead of --load if you don’t need the image locally.
7) Symptom: “Random cache misses”
Root cause: Nondeterministic commands (apt-get update without stable ordering, unpinned dependencies), timestamps embedded into build outputs, or generated files included early.
Fix: Make steps deterministic where possible, pin versions, and isolate generated artifacts to later stages.
8) Symptom: “We enabled inline cache and nothing changed”
Root cause: You didn’t actually import cache (--cache-from), or the image isn’t available/pulled on the builder, or you’re building a different platform.
Fix: In CI, explicitly import the cache. Verify with logs that steps are CACHED and that the builder can reach the reference.
Checklists / step-by-step plan
Checklist A: Make local builds fast (single machine)
- Enable BuildKit and use
--progress=plainto see cache behavior. - Cut build context to the bone with
.dockerignore. - Restructure Dockerfile: copy manifests → install dependencies → copy source.
- Use cache mounts for dependency managers (
npm,pip,go,aptif appropriate). - Move volatile args/stamps to the end.
- Run two consecutive builds and verify expensive steps are
CACHED.
Checklist B: Make CI builds fast (ephemeral runners)
- Confirm CI uses buildx and BuildKit output (not legacy).
- Pick a remote cache backend: registry cache is usually simplest operationally.
- Add
--cache-toand--cache-fromto CI builds. - Standardize
--platformacross CI jobs; don’t mix amd64 and arm64 caches unless you mean to. - Pin base images to stable tags (or digests where required).
- Watch for time spent exporting images; prefer pushing directly from buildx rather than
--loadon CI. - Set a policy for cache retention/pruning in the registry; uncontrolled cache growth is a slow-motion outage.
Checklist C: When performance work is worth it
- If your build is <30 seconds and happens rarely, don’t start a caching crusade.
- If your build blocks merges, deploys, or on-call mitigation, treat caching like reliability work.
- If dependency install runs every time, fix that first; it’s the highest ROI in most stacks.
FAQ
1) Why does changing a README trigger a full rebuild?
Because you copied the whole repository into the image before the expensive steps. Cache keys include file inputs. Copy less, later.
2) Is BuildKit caching the same as Docker layer caching?
Layer caching is one mechanism. BuildKit generalizes builds into a graph and adds cache mounts, remote cache export/import, and improved parallelism.
3) Should I always use --mount=type=cache?
Use it for dependency manager caches and compiler caches. Don’t use it to hide nondeterminism or to “make tests faster” by reusing stale outputs.
4) What’s the difference between inline cache and registry cache?
Inline cache stores cache metadata in the image itself. Registry cache exports cache as a separate artifact. Registry cache tends to be more flexible and effective for CI.
5) My CI build uses docker build. Do I need buildx?
You can get some benefits with docker build if BuildKit is enabled, but buildx makes remote caching and multi-platform builds much easier to manage.
6) Why do my cache hits disappear after a Docker cleanup?
Because someone is pruning build cache or the runner filesystem is ephemeral. Fix the cleanup policy, or rely on remote cache rather than local state.
7) Does using secrets disable caching?
Secrets themselves aren’t part of the cache key. But the commands you run using those secrets can still be nondeterministic or depend on moving targets, which causes rebuilds.
8) Can I share cache across architectures (amd64 and arm64)?
Not directly. Cache keys include platform. You can store caches for multiple platforms under the same cache reference, but they’re separate entries.
9) Why is exporting the image slow even when everything is cached?
Because exporting still has to assemble layers, compress, and write/push them. If CI doesn’t need a local image, push directly and avoid --load.
10) When should I pin base images by digest?
When reproducibility and supply-chain control matter more than convenience. Digests reduce surprise cache busts and “same Dockerfile, different image” outcomes.
Conclusion: next steps that pay for themselves
Make this boring. Boring is fast.
- Run a build with
--progress=plainand write down which step is slow and whether it’sCACHED. - Fix build context size first. It’s the tax you pay before caching even gets a vote.
- Restructure Dockerfiles so dependency installs depend on lockfiles, not on your entire repo.
- Add cache mounts for dependency managers to stop re-downloading and re-compiling.
- If CI is ephemeral, implement remote cache export/import. Otherwise you’re optimizing a cache that evaporates on success.
- Standardize build args, platform, and base image policy so cache keys match across environments.
If you do those six things, you’ll stop “optimizing Docker builds” and start running a build system that behaves like production: predictable, measurable, and fast for the right reasons.