Dockerfile “failed to solve”: the errors you can fix instantly

Was this helpful?

Nothing says “calm, healthy delivery pipeline” like a Docker build exploding with failed to solve five minutes before a release. Your CI log turns into a detective novel written by a compiler. Someone suggests “just rerun it,” as if production systems are powered by optimism.

This guide is the pragmatic version: what failed to solve actually means (BuildKit is complaining, not Docker), how to find the real failure fast, and the fixes that reliably work under real constraints: flaky networks, locked-down runners, oversized repos, and “security” controls that break builds.

What “failed to solve” actually means

When you see failed to solve, you’re almost always looking at a BuildKit error bubble-up. BuildKit is Docker’s modern builder engine: it builds a dependency graph (layers, sources, mounts), then “solves” it like a build plan. If something breaks anywhere in that graph—fetching a base image, reading a file in the build context, executing a RUN command, calculating a checksum, unpacking an archive—BuildKit reports the failure as “failed to solve.” It’s not trying to be mysterious. It’s just not trying to be helpful either.

There are three key implications:

  • The real error is almost always a few lines above. “failed to solve” is the trailer, not the movie.
  • The failing step is a node in a graph. The step number in logs can be misleading if steps run in parallel or are cached.
  • Context matters more than Dockerfile purity. Many failures aren’t “Dockerfile bugs”; they’re environment, network, permissions, or context size problems.

If you want one operationally useful mental model: Docker build is a pipeline of inputs (context files, base images, secrets, network) into deterministic transformations (layers). “Failed to solve” means one input is missing, unreadable, untrusted, or slow enough to be considered dead.

One quote, because it’s still painfully true in build systems and everything else: “Hope is not a strategy.” — Gene Kranz

Fast diagnosis playbook

This is the order that saves the most time in the most environments. Don’t freestyle. Follow the steps until you find the bottleneck.

1) Locate the first real error, not the last wrapper

  • Scroll up to the first non-noise failure line. BuildKit often prints multiple cascading errors.
  • Identify which category it is: context, syntax, network, auth, permissions, platform, cache, or runtime.

2) Confirm the builder and its settings

  • Is BuildKit enabled? Are you using buildx? Which driver (docker, docker-container, kubernetes)?
  • Is the failure on a remote builder where filesystem/network differs from your laptop?

3) Check build context health first

  • Most “instant fixes” are context issues: missing files, bad .dockerignore, giant context, permissions.
  • Verify the file paths referenced in COPY/ADD and confirm they’re inside the build context.

4) If it’s network/auth, reproduce with a single fetch

  • Try pulling the base image with the same credentials and network configuration.
  • Try curling the package repo endpoint from the runner (not from your workstation).

5) If it’s a RUN step, reduce it

  • Make the failing RUN command visible (no “&& … && …” mega-lines for debugging).
  • Temporarily drop parallelism flags and add verbose output.

6) If it’s cache-related, prove it

  • Re-run with --no-cache or prune caches to confirm you’re not chasing stale state.
  • Then fix cache keys, not symptoms.

Joke #1: Treat “failed to solve” like a toddler tantrum—something real happened earlier, and now everyone’s just screaming about it.

The instant-fix categories (and why they happen)

Category A: Dockerfile syntax and parser failures

These are the nice ones. They fail fast and consistently. You’ll see errors like:

  • failed to parse Dockerfile
  • unknown instruction
  • invalid reference format

The fix is almost always a typo, missing \, incorrect quoting, or using a feature your builder doesn’t support. If you’re copying snippets from blogs, assume they’re wrong until proven otherwise.

Category B: Build context problems (missing files, wrong paths, .dockerignore)

This is the most common “instant fix,” and the easiest to misdiagnose because people confuse repository layout with build context. The context is whatever directory you pass to docker build. BuildKit can only see files under that directory (minus ignored patterns). If your Dockerfile says:

  • COPY ../secrets /secrets (nope)
  • COPY build/output/app /app (maybe, if it exists at build time)
  • COPY . /src (works, but often dumb)

Then context correctness decides your fate. A small .dockerignore mistake can turn into “checksum of ref … not found,” “failed to compute cache key,” or the classic “failed to walk … no such file or directory.”

Category C: Permissions and ownership failures

BuildKit runs steps with specific users, mounts, and read-only behaviors. Between rootless Docker, hardened CI runners, and corporate NFS home directories, you can hit:

  • permission denied when reading context files
  • operation not permitted on chmod/chown
  • failed to create shim task when user namespaces collide

Fixes range from mundane (chmod a file) to structural (stop trying to chown 200k files during build).

Category D: Network, DNS, proxies, and corporate MITM

Docker builds download base images and packages. That means DNS, TLS, proxies, and timeouts. BuildKit also does parallel fetches, which can make a flaky network look worse. You’ll see:

  • failed to fetch oauth token
  • i/o timeout
  • x509: certificate signed by unknown authority
  • temporary failure in name resolution

These are often fixed instantly by setting proxy env vars correctly, importing a corporate CA, or using a registry mirror that’s actually reachable from the runner.

Category E: Registry authentication and rate limits

Unauthenticated pulls hit rate limits. Wrong credentials hit 401/403. Private registries with expired tokens hit you during the worst possible moment. BuildKit sometimes wraps this in a bland “failed to solve” line, but the underlying error is usually explicit if you scroll.

Category F: Platform and architecture mismatch

Multi-arch builds are great until they aren’t. If your base image doesn’t support your target platform, you’ll see errors like:

  • no match for platform in manifest
  • exec format error

The fix: specify --platform, pick a multi-arch base, or stop building amd64 images on an arm64 runner without emulation configured.

Category G: Cache/solver weirdness (checksum, cache key, invalidation)

BuildKit is aggressive about caching, which is good until your build relies on something that shouldn’t be cached. Errors can include:

  • failed to compute cache key
  • failed to calculate checksum of ref
  • rpc error: code = Unknown desc = …

These often mean “the file BuildKit expects isn’t in context” or “your cache state is corrupt” or “your remote builder lost a mount.” It’s not mystical; it’s just state.

Category H: Secrets, SSH mounts, and build-time credentials

Modern builds use RUN --mount=type=secret and RUN --mount=type=ssh. Great. But if CI doesn’t pass the secret/ssh agent, BuildKit can fail with:

  • secret not found
  • ssh agent requested but no SSH_AUTH_SOCK

Fix: pass the secret or stop pretending your build has access to private stuff without wiring it up.

Practical tasks: commands, outputs, decisions

These are real checks you can run on a developer machine or CI runner. Each includes: the command, what the output means, and what decision to make next. Run them in order that matches your failure category.

Task 1: Confirm whether BuildKit is in play

cr0x@server:~$ docker build --help | head -n 5
Usage:  docker build [OPTIONS] PATH | URL | -

Build an image from a Dockerfile

What it means: Help output alone won’t tell you BuildKit, but it confirms you’re using the classic CLI entrypoint.

Decision: Next, check environment and builder state explicitly.

cr0x@server:~$ echo $DOCKER_BUILDKIT

What it means: Empty usually means “default.” On many modern installs, default is BuildKit anyway.

Decision: Check buildx builders to see what backend you’re actually using.

Task 2: Inspect buildx and active builder

cr0x@server:~$ docker buildx ls
NAME/NODE       DRIVER/ENDPOINT             STATUS    BUILDKIT   PLATFORMS
default         docker                       running   v0.12.5    linux/amd64,linux/arm64
ci-builder *    docker-container            running   v0.12.5    linux/amd64

What it means: You’re using a containerized builder (docker-container) named ci-builder. That builder has its own network, DNS, and cache storage.

Decision: If failures look like DNS/timeouts or missing files, reproduce using the same builder, not the default.

Task 3: Re-run with plain progress to find the real failing line

cr0x@server:~$ docker buildx build --progress=plain -t testimg:debug .
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.27kB done
#2 [internal] load metadata for docker.io/library/alpine:3.19
#2 ERROR: failed to do request: Head "https://registry-1.docker.io/v2/library/alpine/manifests/3.19": dial tcp: lookup registry-1.docker.io: temporary failure in name resolution
------
 > [internal] load metadata for docker.io/library/alpine:3.19:
------
failed to solve: failed to do request: Head "https://registry-1.docker.io/v2/library/alpine/manifests/3.19": dial tcp: lookup registry-1.docker.io: temporary failure in name resolution

What it means: Not a Dockerfile problem. DNS from the builder is broken.

Decision: Move to network/DNS checks. Do not touch the Dockerfile yet; you’ll just create new bugs.

Task 4: Verify the build context size (big contexts cause slow “failed to solve”)

cr0x@server:~$ docker buildx build --progress=plain --no-cache -t testimg:ctx .
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 2.10kB done
#2 [internal] load .dockerignore
#2 transferring context: 2B done
#3 [internal] load build context
#3 transferring context: 1.48GB 34.2s done

What it means: You’re shipping 1.48GB of context to the builder. That’s not “a bit large.” That’s “why is your git repo also a file server?”

Decision: Fix .dockerignore and narrow COPY scope. Large contexts cause timeouts, cache thrash, and slow CI.

Task 5: Show what’s being ignored by .dockerignore (practical sanity check)

cr0x@server:~$ sed -n '1,120p' .dockerignore
.git
node_modules
dist
*.log

What it means: Looks reasonable, but maybe you forgot build/, target/, .venv/, or large test data.

Decision: Add ignores for generated artifacts and local caches; ensure the build still has what it needs.

Task 6: Catch “COPY failed: file not found” before you waste time

cr0x@server:~$ grep -nE '^(COPY|ADD) ' Dockerfile
12:COPY build/output/app /app/app
13:COPY configs/prod.yaml /app/config.yaml

What it means: The Dockerfile expects build/output/app to exist in the build context at build time.

Decision: If CI doesn’t generate that path before docker build, the build will fail. Either build artifacts in a prior CI step or do the build inside the Dockerfile (multi-stage).

Task 7: Confirm the referenced files exist in the context

cr0x@server:~$ ls -la build/output/app configs/prod.yaml
ls: cannot access 'build/output/app': No such file or directory
-rw-r--r-- 1 cr0x cr0x  842 Jan  2 09:41 configs/prod.yaml

What it means: One COPY will fail immediately.

Decision: Decide whether to (a) produce the artifact before the docker build, or (b) convert to a multi-stage build that compiles inside the image build.

Task 8: Validate base image pull and authentication (separate from build)

cr0x@server:~$ docker pull alpine:3.19
3.19: Pulling from library/alpine
Digest: sha256:de0eb0b3f2a47ba1b5d1c61c87a1f1d9b5a4f34b3a4a1d7f3d6c2c6a7f1d2d3c
Status: Image is up to date for alpine:3.19
docker.io/library/alpine:3.19

What it means: Registry access is OK from the Docker host. If buildx uses a remote builder, that builder may still be broken—don’t declare victory yet.

Decision: If the build still fails pulling the same image, inspect builder networking or proxy settings.

Task 9: Check DNS and outbound connectivity from the builder container

cr0x@server:~$ docker ps --filter name=buildx_buildkit --format '{{.ID}} {{.Names}}'
a1b2c3d4e5f6 buildx_buildkit_ci-builder0
cr0x@server:~$ docker exec -it buildx_buildkit_ci-builder0 sh -lc 'cat /etc/resolv.conf && nslookup registry-1.docker.io 2>/dev/null | head -n 5'
nameserver 127.0.0.11
options ndots:0

Server:    127.0.0.11
Address:   127.0.0.11:53

What it means: Builder uses Docker’s embedded DNS (127.0.0.11). If nslookup hangs or fails, you likely have a daemon DNS config issue or blocked outbound UDP/TCP 53.

Decision: Fix Docker daemon DNS settings or provide explicit DNS servers for the builder environment.

Task 10: Identify proxy settings mismatch (classic corporate failure mode)

cr0x@server:~$ env | grep -iE 'http_proxy|https_proxy|no_proxy'
HTTP_PROXY=http://proxy.corp:8080
HTTPS_PROXY=http://proxy.corp:8080
NO_PROXY=localhost,127.0.0.1,.corp

What it means: Proxy is set in your shell. The builder container may not inherit it automatically depending on how it’s created.

Decision: Pass proxy build args or configure the builder/daemon to use proxy consistently. If TLS is intercepted, you’ll also need the corporate CA inside the build image.

Task 11: Catch platform mismatch quickly

cr0x@server:~$ docker buildx imagetools inspect alpine:3.19 | sed -n '1,40p'
Name:      docker.io/library/alpine:3.19
MediaType: application/vnd.oci.image.index.v1+json
Digest:    sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Manifests:
  Name:      docker.io/library/alpine:3.19@sha256:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  Platform:  linux/amd64
  Name:      docker.io/library/alpine:3.19@sha256:cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
  Platform:  linux/arm64

What it means: This base image is multi-arch. Good. If your base image shows only one platform and you’re building the other, you’ve found the root cause.

Decision: Choose a base image with the target platform or set --platform explicitly and ensure emulation is configured if needed.

Task 12: Force a clean build to rule out cache corruption

cr0x@server:~$ docker buildx build --no-cache --progress=plain -t testimg:nocache .
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.27kB done
#2 [internal] load metadata for docker.io/library/alpine:3.19
#2 DONE 0.9s
#3 [internal] load build context
#3 transferring context: 52.4MB 1.1s done
#4 [1/6] FROM docker.io/library/alpine:3.19@sha256:bbbb...
#4 DONE 0.0s
#5 [2/6] RUN apk add --no-cache ca-certificates
#5 DONE 2.8s

What it means: If a clean build succeeds but cached builds fail intermittently, you have cache/state issues (often remote builder disk pressure or corrupted cache records).

Decision: Clean up builder storage and tighten cache export/import strategy rather than “just disabling cache forever.”

Task 13: Check builder disk pressure (the silent killer)

cr0x@server:~$ docker exec -it buildx_buildkit_ci-builder0 sh -lc 'df -h /var/lib/buildkit | tail -n 1'
overlay          20G   19G  1.0G  95% /

What it means: 95% full on the builder’s filesystem. Expect weird errors: unpack failures, “no space left,” random cache issues, and spectacular slowdowns.

Decision: Prune build cache, expand storage, or stop exporting huge caches you never reuse.

Task 14: Prune BuildKit cache safely (and interpret the output)

cr0x@server:~$ docker buildx prune -f --verbose
ID                                              RECLAIMABLE     SIZE        LAST ACCESSED
v1:9n8m7l6k5j4h3g2f1d0s                          true            1.2GB       2 days ago
v1:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa          true            650MB       7 days ago
Total: 1.85GB reclaimed

What it means: You reclaimed 1.85GB. If your builder was at 95%, this might turn a failing build into a working build immediately.

Decision: If pruning fixes it, implement automatic cache retention policies and monitor builder disk usage. If it doesn’t, keep digging.

Task 15: Diagnose “secret not found” and fix wiring

cr0x@server:~$ docker buildx build --progress=plain --secret id=npmrc,src=$HOME/.npmrc -t testimg:secrets .
#7 [4/6] RUN --mount=type=secret,id=npmrc,target=/root/.npmrc npm ci
#7 DONE 18.9s

What it means: Your secret mount is present and the step succeeded. If you omit --secret, BuildKit fails the step at runtime.

Decision: In CI, ensure the secret exists, is scoped correctly, and is passed by the build command. Don’t bake it into the image. Ever.

Task 16: Make a failing RUN step debuggable

cr0x@server:~$ docker buildx build --progress=plain --build-arg DEBUG=1 -t testimg:debugrun .
#9 [5/6] RUN set -eux; apk add --no-cache git; git --version
+ apk add --no-cache git
(1/4) Installing ca-certificates (20241121-r0)
(2/4) Installing libcurl (8.6.0-r0)
(3/4) Installing pcre2 (10.42-r2)
(4/4) Installing git (2.45.2-r0)
OK: 26 MiB in 24 packages
+ git --version
git version 2.45.2

What it means: set -eux shows the exact command that fails and stops at the first error. This is how you stop guessing.

Decision: Keep the “debug mode” pattern in your Dockerfile using a build arg so you can turn it on when CI breaks.

Common mistakes: symptoms → root cause → fix

1) “failed to compute cache key” near a COPY/ADD

Symptoms: Build fails at a COPY line with cache-key/checksum language; sometimes mentions “not found.”

Root cause: The source path doesn’t exist in build context (or is ignored by .dockerignore), so BuildKit can’t hash it to compute cache keys.

Fix: Ensure the file exists before build, adjust the build context directory, or update .dockerignore. Prefer copying only needed files, not COPY . ..

2) “COPY failed: stat … no such file or directory”

Symptoms: Straightforward missing file error.

Root cause: Wrong relative path, wrong context, or CI builds from a different working directory than you assumed.

Fix: Use absolute clarity: run docker build -f path/to/Dockerfile path/to/context. In CI, print pwd and ls before building. Make paths boring.

3) “failed to do request: Head … temporary failure in name resolution”

Symptoms: Base image metadata fetch fails.

Root cause: DNS broken inside builder, or outbound blocked.

Fix: Configure daemon DNS, fix runner network policy, or run builder with known DNS servers. Validate from inside the builder container.

4) “x509: certificate signed by unknown authority” during package install or image pull

Symptoms: TLS failures when hitting registries or package mirrors, especially in corporate environments.

Root cause: TLS interception / custom corporate CA not trusted in the base image or builder.

Fix: Install the corporate CA in the build stage (and ideally in a shared base image). Do not disable TLS verification as a “fix” unless you enjoy incident response.

5) “failed to fetch oauth token” or 401/403 when pulling base images

Symptoms: Registry auth errors, often intermittent when tokens expire.

Root cause: Missing docker login in CI, wrong credential helper, expired tokens, or pulling from a private registry without passing creds to the remote builder.

Fix: Log in before build; for buildx remote builders, ensure credentials are available to that builder context. Confirm with a direct docker pull from the same environment.

6) “no match for platform in manifest”

Symptoms: The build fails early when resolving the base image.

Root cause: Base image doesn’t publish for your target architecture.

Fix: Use a multi-arch base image, pin to the correct platform, or adjust your runners. If you build on arm64 but ship amd64, be explicit.

7) “exec format error” during RUN

Symptoms: The base image pulls, but running binaries fails.

Root cause: Architecture mismatch at runtime (e.g., amd64 binary on arm64 image) or QEMU/emulation not set up.

Fix: Align base image + binaries + platform. If using emulation, ensure binfmt/qemu is configured on the host running the builder.

8) “no space left on device” (sometimes disguised as unpack errors)

Symptoms: Random failures unpacking layers, writing cache, or exporting images.

Root cause: Builder storage full (common with docker-container buildx driver), inode exhaustion, or overlayfs limits.

Fix: Prune caches, expand disk, and stop copying giant directories early in the Dockerfile (it multiplies stored layers).

9) “failed to create LLB definition” or weird RPC errors

Symptoms: BuildKit reports gRPC/rpc errors with “unknown desc.”

Root cause: Builder instability, version mismatch, corrupted state, or resource pressure (disk, memory).

Fix: Restart the builder, upgrade BuildKit/buildx, verify resources, and reproduce with --progress=plain. If it vanishes after pruning, it was state pressure.

10) “secret not found” / SSH mount failures

Symptoms: A RUN step that expects a secret/ssh agent fails immediately.

Root cause: Build command didn’t pass --secret or --ssh, or CI didn’t expose the secret.

Fix: Wire it properly. If you can’t, change the design: fetch dependencies in a different step or vendor them. Don’t commit secrets out of frustration.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A team migrated their builds from a self-hosted GitLab runner to a managed runner fleet. Same repo, same Dockerfile, same commands. The first build failed with failed to solve at a COPY step. Developers shrugged and said, “The file is in the repo; it can’t be missing.”

The file was generated. Locally, everyone had it because their dev workflow ran a build tool that produced build/output/app. The old CI runner also had a pre-step that ran the same tool—but it lived in a shared template nobody remembered using. The new CI pipeline was “cleaned up” to be “simpler,” which is corporate-speak for “we deleted the boring parts that made it work.”

The wrong assumption was subtle: they assumed Docker build could see the whole repo state, including things that would exist “by the time the build runs.” But Docker build context is a snapshot of the filesystem at the moment you invoke the build. If the artifact isn’t there, it doesn’t matter how spiritually present it feels.

The fix was to formalize the artifact creation: either a proper CI build step that produces the artifact before building the image, or a multi-stage Docker build that compiles inside a builder stage. They chose multi-stage. It was slower at first, then faster after cache tuning, and most importantly: deterministic.

Postmortem lesson: when builds fail after runner changes, assume your environment dependency was undocumented. It usually was.

Mini-story 2: The “optimization” that backfired

Another org had a goal: reduce CI time. Someone noticed Docker builds were “slow because they send too much context,” and decided to get clever. They added aggressive patterns to .dockerignore to exclude basically everything except the Dockerfile and a couple directories. CI got faster. Applause followed.

Then a release candidate failed with failed to compute cache key: failed to calculate checksum pointing at a file in a COPY line—one that existed, but was now ignored. The team fixed it by un-ignoring that file. Next build failed for a different ignored file. Another tweak. More failures. It turned into whack-a-mole because the ignore rules were written without mapping them to the Dockerfile’s actual dependency graph.

The backfire wasn’t just the failures. It was the time spent rediscovering what the image build truly required, and the subtle security risk: developers started temporarily “fixing” by copying . to get builds green, accidentally shipping credentials and dev junk into images. That’s how you end up with .env in production containers and a compliance team in your calendar.

The final resolution was boring and correct: they rebuilt .dockerignore from first principles. They listed every file copied into the image, added only those directories to the context, and made the Dockerfile copy specific paths in a stable order. Build context shrank without breaking dependencies.

Lesson: optimizing build context without understanding COPY dependencies is like removing bolts to make a bridge lighter.

Mini-story 3: The boring practice that saved the day

A platform team ran a dedicated buildx builder cluster. Nothing fancy: container driver, persistent storage, and a strict policy. Every builder node exported metrics for disk usage, inode usage, and cache size. They also pruned build caches on a schedule with a retention window tuned to their workload.

One Friday, a product team’s build started failing with sporadic “failed to solve” errors: sometimes during layer unpack, sometimes during cache export. Developers suspected “Docker is flaky” (a timeless theory) and began rerunning jobs until they got a green build.

The platform team looked at the dashboards. Builder disks were creeping toward full. Not because of one project, but because multiple teams had turned on cache exports and nobody set limits. When disk pressure got high, BuildKit started failing in messy ways—because storage failures are rarely polite.

They increased the builder volume size and tightened the prune policy. Builds stabilized immediately. No dramatic heroics, just someone who cared about “boring” signals like disk usage. The product team shipped. The release was quiet. That’s the best kind.

Lesson: builder disk is production infrastructure. Treat it like production infrastructure, or it will treat you like a hobby.

Checklists / step-by-step plan

Step-by-step: from red CI to a confirmed root cause

  1. Re-run once with --progress=plain. Your goal is clarity, not hope.
  2. Identify the failing phase: internal/context, metadata fetch, COPY/ADD, RUN, export.
  3. Check build context: verify referenced files exist; check .dockerignore; measure context size.
  4. Check base image resolution: pull manually; inspect platforms; validate auth.
  5. Check builder environment: which buildx builder, driver type, DNS/proxy, disk pressure.
  6. Reduce the failing RUN command: add set -eux, remove chained commands, re-run.
  7. Rule out cache corruption: run with --no-cache; if it fixes it, repair caching strategy.
  8. Apply the smallest fix that makes it deterministic. Deterministic beats clever.

Checklist: make Dockerfile failures less likely next week

  • Keep .dockerignore small, explicit, and reviewed alongside Dockerfile changes.
  • Prefer multi-stage builds so artifact creation is part of the build graph, not an external assumption.
  • Pin base images to tags you control (and consider digests for critical pipelines).
  • Stop using COPY . . as a lifestyle. Copy specific directories.
  • Split risky RUN steps. One step that downloads packages, one step that builds, one step that cleans.
  • Monitor builder disk. Set prune policies. Document builder configuration like it’s a database (because it behaves like one).
  • Handle proxies and corporate CAs explicitly. If you need them, codify them.
  • Use secrets mounts for credentials; verify CI passes them; never bake secrets into layers.

Joke #2: “We’ll just disable cache to fix the build” is the CI equivalent of unplugging the smoke alarm so you can sleep.

Interesting facts and historical context

  • BuildKit started as a separate project and later became the default builder for many Docker installs because the classic builder couldn’t scale caching and concurrency as well.
  • The “LLB” concept (Low-Level Build) is BuildKit’s internal build graph format; “solve” refers to computing and executing that graph.
  • Docker’s original builder executed Dockerfile steps sequentially; BuildKit introduced more parallelism and smarter caching, which also changes how failures surface.
  • .dockerignore exists because early Docker builds were painfully slow when users unknowingly sent entire repos (including .git) as context.
  • Registry rate limits became a real operational problem as CI usage exploded; many orgs learned about it only after pipelines started failing during peak hours.
  • Multi-architecture images became mainstream as ARM servers and Apple Silicon became common; platform mismatch errors rose accordingly.
  • Rootless Docker and hardened runners improved security but made permission assumptions in older Dockerfiles fail more often.
  • Build secrets support improved significantly with BuildKit mounts, reducing the need for ARG-based secret hacks that leak into layers.
  • Reproducible builds became a stronger expectation as supply-chain security concerns grew; “it works if you rerun it” stopped being acceptable.

FAQ

Why does Docker say “failed to solve” instead of the real error?

Because BuildKit reports a high-level failure for the build graph (“solve failed”) and includes the underlying error above it. Use --progress=plain to make the underlying error obvious.

Is “failed to solve” always a Dockerfile problem?

No. It’s frequently network/DNS, registry auth, build context, or builder disk pressure. Treat the Dockerfile as guilty only after you’ve proven the environment is healthy.

Why does it work on my laptop but fail in CI?

CI runners have different network policies, proxies, credentials, filesystem permissions, and often a different build context path. Also, CI may be using a remote buildx builder with its own environment.

How do I quickly see which step actually failed?

Run the build with docker buildx build --progress=plain and scroll to the first ERROR block. Ignore the final wrapper line until you’ve read the actual error.

What’s the fastest way to detect build context issues?

Check .dockerignore, grep COPY/ADD lines, and verify those source paths exist under the context directory. If your context transfer is hundreds of MB or more, fix that too.

Should I pin base images by digest?

If you care about repeatability and supply-chain control, yes—especially for production pipelines. If you need regular security updates, use a controlled process to bump digests rather than floating tags.

How do I fix corporate proxy and CA issues during builds?

Pass proxy settings consistently to the builder and build steps, and install the corporate CA into the image (and sometimes into the builder environment). Don’t disable TLS verification as a workaround.

What’s the right way to use secrets during docker build?

Use BuildKit secret mounts (--secret and RUN --mount=type=secret). Verify CI provides the secret. Avoid ARG/ENV for secrets because they leak into image history and layers.

Why do cache-related errors look so weird?

Because BuildKit caches are content-addressed and tied to the build graph. When referenced content isn’t available (ignored file, missing mount, corrupted cache), you get checksum and cache-key errors.

When should I use --no-cache?

As a diagnostic tool. If it fixes the build, you’ve proven a caching/state problem. Then fix the cache strategy or builder health—don’t permanently give up caching unless you enjoy slower CI.

Conclusion: next steps that actually reduce pain

“Failed to solve” isn’t a message. It’s a category label. Your job is to peel it back to the first concrete error, then decide whether you’re dealing with context, network, auth, platform, permissions, cache, or runtime.

Do these three things and your future self will send you a thank-you note (silently, by not waking you up at 2 a.m.):

  1. Standardize on --progress=plain in CI logs (at least on failure) so the real error is visible.
  2. Make context deterministic: explicit COPY paths, disciplined .dockerignore, and multi-stage builds for generated artifacts.
  3. Operate your builders: monitor disk, prune intelligently, and treat build infrastructure like production—because it’s upstream of production.

Most Dockerfile “failed to solve” errors are fixed instantly once you stop staring at the last line and start treating builds like systems: inputs, state, and failure domains.

← Previous
Form Styling That Survives Production: Inputs, Selects, Checkboxes, Radios, Switches
Next →
WireGuard vs IPsec for Offices: What’s Easier to Maintain and Common Traps

Leave a comment